Module metrics

Module metrics 

Source
Expand description

Prometheus counters surface - minimum viable set for alerting.

Three counters (auto-emit on event, monotonic) + one info gauge for cp_build_info. No state-as-label gauges, no master/detail panel drivers - those proved unworkable in Grafana’s table model and were stripped (see fleet’s nixfleet-events.json which now reads from Loki instead). What stays is the alerting surface:

  • nixfleet_compliance_failure_events_total{control_id, host} - per-control, per-host. Cardinality bounded by the closed compliance set (~16 controls) × hosts.
  • nixfleet_runtime_gate_error_events_total - unlabeled. One global counter for the “agent couldn’t measure compliance” class.
  • nixfleet_gate_block_total{gate} - one increment per gates::evaluate_for_host block. gate discriminator is one of the kebab-case gate kinds (channel-edges / wave-promotion / host-edge / disruption-budget / compliance-wave). Drives rate(...{gate="compliance-wave"}[5m]) > 0 style alerts.
  • nixfleet_cp_build_info{version, git_commit}=1 - one series. Standard pattern (cf. kube_pod_info) for tracking the deployed CP version across scrapes. Re-emitted every render since the values are compile-time constants.

When the metrics feature is disabled, all functions in this module are no-ops and neither dep is compiled in.

The exporter recorder is process-global and idempotent - first install_recorder() wins. Tests can spin multiple test servers without colliding.

idle_timeout deliberately NOT set: counters are cumulative and must NEVER reset; the previous version applied idle eviction to gauges, but with no gauges in this slim surface, idle eviction is moot. cp_build_info is the only gauge and it’s re-emitted every scrape via record_build_info().

Statics§

METRICS_HANDLE 🔒

Functions§

install_recorder
Install the process-global Prometheus recorder. Idempotent - safe to call from each test’s server-spawn helper.
record_build_info
cp_build_info{version, git_commit}=1 - the deployed CP version. Constants resolve at compile time; re-emit each scrape so it always renders.
record_compliance_event
Increment on ComplianceFailure event arrival in /v1/agent/report. Bounded labels: hosts × controls. No-op when metrics feature off.
record_gate_block
Increment when gates::evaluate_for_host returns Some(GateBlock) at the dispatch endpoint. gate_kind is the kebab-case discriminator (channel-edges / wave-promotion / host-edge / disruption-budget / compliance-wave).
record_runtime_gate_error
Increment on RuntimeGateError event arrival in /v1/agent/report.