Expand description
Boot-recovery handshake (RFC-0005 §9.5 / Plan 07 open-question resolution).
Runs ONCE at runtime startup, before any worker is allowed to fire. Goal: synchronise the (zero-in-memory) agent reducer with whatever state CP holds, so a crash that drops in-memory state doesn’t lead the agent to either re-fire a switch that already took or skip a switch that didn’t.
Five scenarios, all distinguished by what /run/current-system
points to vs what CP’s record says:
-
Fresh install. /run/current-system = X, CP has no rollout record. First heartbeat reports current_closure=X; CP just records “agent present”; next Dispatch starts normal flow.
-
Crashed mid-Pending (before LocalActivate fired). /run/current-system = prior closure (switch never started). First heartbeat reports prior. CP’s Pending record’s current_closure_at_dispatch matches; CP re-queues Dispatch. Long-poll picks up; normal flow.
-
Crashed mid-Activating. /run/current-system = either prior OR target depending on whether switch-to-configuration had committed the new generation yet.
- If target: switch took, agent crashed before posting ActivationCompleted. CP synthesises an ActivationCompleted-shaped Replay-From event; agent’s reducer applies it as if it had arrived normally; probes start fresh; soak proceeds.
- If prior: switch didn’t take (or had only just started). CP treats as scenario 2.
-
Crashed mid-Soaking. /run/current-system = target. CP sees Soaking record’s target matches; replies with a stream of synthesised LocalProbeResult events the reducer needs to know prior probe state (so the Fail→Pass transition detector doesn’t double-count). Probes resume; soak completes; Converged fires.
-
Crashed post-Converged. /run/current-system = target. CP sees Converged record; nothing to do; wait for next Dispatch.
For 7f, the agent issues the first heartbeat and reads the CP’s
X-Nixfleet-Replay-From header. Full synthesised-event
reconstruction (scenario 3’s ActivationCompleted synthesis, scenario
4’s probe-result stream) is documented here but the rich synthesis
lands in a follow-up — the architecture is in place; the wiring
needs CP-side route changes that aren’t in scope this commit.
Until then, a non-zero Replay-From triggers a warn log; operator
can use nixfleet trace to inspect the in-flight rollout.
Structs§
- Recovery
Outcome - Outcome of the boot-recovery handshake. Returned to
runtime::spawnso the caller can decide whether to feed synthesised events through the reducer before workers start. - Replay
From Payload - Synthetic Replay-From payload shape. Documented for the follow-up that wires CP to emit per-event replay bodies; today the agent only sees the seq header.
Constants§
- RECOVERY_
HTTP_ 🔒TIMEOUT - HTTP timeout for the boot-recovery heartbeat. Longer than the steady-state heartbeat because boot recovery may run while CP is itself starting up — 30s of slack covers normal CP cold-start.
Functions§
- default_
current_ system_ path - Default location of NixOS’s current-system symlink. Test fixtures
override via
handshake’scurrent_system_pathparameter. - handshake
- Issue the gated first-heartbeat. Returns the recovery outcome so
runtime::spawncan act onreplay_fromBEFORE starting workers. - read_
current_ closure - Best-effort read of
/run/current-system’s store-path basename. ReturnsNonewhen the symlink doesn’t exist (fresh install, or non-NixOS host running the agent for the first time).