Module recovery

Module recovery 

Source
Expand description

Boot-recovery handshake (RFC-0005 §9.5 / Plan 07 open-question resolution).

Runs ONCE at runtime startup, before any worker is allowed to fire. Goal: synchronise the (zero-in-memory) agent reducer with whatever state CP holds, so a crash that drops in-memory state doesn’t lead the agent to either re-fire a switch that already took or skip a switch that didn’t.

Five scenarios, all distinguished by what /run/current-system points to vs what CP’s record says:

  1. Fresh install. /run/current-system = X, CP has no rollout record. First heartbeat reports current_closure=X; CP just records “agent present”; next Dispatch starts normal flow.

  2. Crashed mid-Pending (before LocalActivate fired). /run/current-system = prior closure (switch never started). First heartbeat reports prior. CP’s Pending record’s current_closure_at_dispatch matches; CP re-queues Dispatch. Long-poll picks up; normal flow.

  3. Crashed mid-Activating. /run/current-system = either prior OR target depending on whether switch-to-configuration had committed the new generation yet.

    • If target: switch took, agent crashed before posting ActivationCompleted. CP synthesises an ActivationCompleted-shaped Replay-From event; agent’s reducer applies it as if it had arrived normally; probes start fresh; soak proceeds.
    • If prior: switch didn’t take (or had only just started). CP treats as scenario 2.
  4. Crashed mid-Soaking. /run/current-system = target. CP sees Soaking record’s target matches; replies with a stream of synthesised LocalProbeResult events the reducer needs to know prior probe state (so the Fail→Pass transition detector doesn’t double-count). Probes resume; soak completes; Converged fires.

  5. Crashed post-Converged. /run/current-system = target. CP sees Converged record; nothing to do; wait for next Dispatch.

For 7f, the agent issues the first heartbeat and reads the CP’s X-Nixfleet-Replay-From header. Full synthesised-event reconstruction (scenario 3’s ActivationCompleted synthesis, scenario 4’s probe-result stream) is documented here but the rich synthesis lands in a follow-up — the architecture is in place; the wiring needs CP-side route changes that aren’t in scope this commit. Until then, a non-zero Replay-From triggers a warn log; operator can use nixfleet trace to inspect the in-flight rollout.

Structs§

RecoveryOutcome
Outcome of the boot-recovery handshake. Returned to runtime::spawn so the caller can decide whether to feed synthesised events through the reducer before workers start.
ReplayFromPayload
Synthetic Replay-From payload shape. Documented for the follow-up that wires CP to emit per-event replay bodies; today the agent only sees the seq header.

Constants§

RECOVERY_HTTP_TIMEOUT 🔒
HTTP timeout for the boot-recovery heartbeat. Longer than the steady-state heartbeat because boot recovery may run while CP is itself starting up — 30s of slack covers normal CP cold-start.

Functions§

default_current_system_path
Default location of NixOS’s current-system symlink. Test fixtures override via handshake’s current_system_path parameter.
handshake
Issue the gated first-heartbeat. Returns the recovery outcome so runtime::spawn can act on replay_from BEFORE starting workers.
read_current_closure
Best-effort read of /run/current-system’s store-path basename. Returns None when the symlink doesn’t exist (fresh install, or non-NixOS host running the agent for the first time).