Rollouts
A rollout is a fleet-wide deployment coordinated by the control plane. Instead of pushing new code to every machine at once and hoping for the best, rollouts deploy in batches with health-check gates between each batch. If something breaks, the rollout pauses or reverts automatically.
Every rollout targets a release - an immutable CP-managed manifest mapping each host to its built Nix store path. This enables per-host deployment in heterogeneous fleets where every machine’s closure is different (different hardware, hostSpec, modules, certificates). You create a release once (nixfleet release create), then trigger one or more rollouts against it.
The two-step flow
nixfleet release create --push-to ssh://root@cache # build + push + register
nixfleet deploy --release rel-abc123 --tags web --strategy canary --wait
Or use the convenience shorthand - nixfleet deploy with --push-to / --copy implicitly creates a release first:
nixfleet deploy --push-to ssh://root@cache --tags web --strategy canary --wait
Both forms do the same thing. The explicit form is useful when you want to deploy the same release multiple times (e.g., staging then prod, or rolling forward then back).
Strategies
All-at-once
Deploy to every targeted machine simultaneously. No batching, no gates. Suitable for dev/staging environments or non-critical updates.
nixfleet deploy --release rel-abc123 --tags staging --strategy all-at-once
Canary
Deploy to a single machine first. If that machine passes health checks within the timeout, deploy to all remaining machines. Suitable for production environments where you want a quick smoke test.
nixfleet deploy --release rel-abc123 --tags prod --strategy canary \
--health-timeout 120 --wait
This creates two batches: batch 0 with 1 machine, batch 1 with the rest.
Staged
Define explicit batch sizes for fine-grained control. Batch sizes can be absolute numbers or percentages.
nixfleet deploy --release rel-abc123 --tags prod --strategy staged \
--batch-size 1,25%,100% \
--health-timeout 300 --wait
This creates three batches:
- Batch 0: 1 machine (canary)
- Batch 1: 25% of remaining machines
- Batch 2: all remaining machines (100%)
How rollouts work
-
Create - The CLI posts a rollout to the control plane with the
release_id, target filter (tags or hosts), and strategy. The CP loads the release entries, intersects them with the target machine set (machines not in the release are skipped with a warning), randomizes the order, and splits them into batches. -
Execute batches - The rollout executor (a background task in the CP) processes batches sequentially:
- For each machine in the current batch, looks up the per-host store path from the release entries
- Captures the machine’s current generation into the batch’s
previous_generationsmap (for per-machine rollback) - Sets the desired generation on each machine via the internal
generationstable - Returns
poll_hint: 5in the agent’s next desired-generation response so agents react within seconds instead of waiting the fullpollInterval - Agents poll, detect the mismatch, fetch the closure, apply, run health checks, and report back with their new
current_generation
-
Health gate - The executor evaluates each machine’s health by verifying TWO conditions:
- The machine’s latest report’s
current_generationmatches the desired store path from the release entry (proves the agent actually applied the new generation) - A health report with
all_passed = truehas been received since the batch started
This two-step gate prevents false-positive completion from stale health reports: a health report from a previous generation cannot count toward the new batch.
- The machine’s latest report’s
-
Complete or fail - When all batches succeed, the rollout status moves to
completed. If a health gate fails, the rollout transitions topausedorfaileddepending on the--on-failuresetting.
Health gates
After each batch deploys, the control plane waits for agents to report health. The gate evaluates based on two parameters:
--health-timeout(default:300seconds) - Maximum time to wait for health reports after a batch deploys. Machines that do not report within this window are marked as timed out. Set this higher thanpollIntervalso agents have time to notice the deploy (or rely onpoll_hintto react within 5s).--failure-threshold(default:0) - Maximum number of unhealthy/timed-out machines before triggering the failure action.0means zero tolerance - any single failure pauses the rollout. Can be absolute ("3") or a percentage of the batch ("30%").
When the threshold is exceeded:
--on-failure pause(default) - The rollout pauses. Investigate, fix the issue, then resume withnixfleet rollout resume <id>. Machines in the failed batch that did deploy are left in place (the agent already rolled back individually if its own health checks failed).--on-failure revert- The rollout fails and the CP reads each completed batch’sprevious_generationsmap, reverting every machine in those batches to the store path it was running before the rollout started. Each machine rolls back to its OWN previous state - not a single shared generation - which is the correct behavior for heterogeneous fleets.
CLI flags
See CLI reference - deploy for the full flag list with defaults and descriptions.
Monitoring rollouts
Stream progress in real time with --wait:
nixfleet deploy --release rel-abc123 --tags prod --strategy canary --wait
If --on-failure pause triggers, --wait exits immediately with an actionable message instead of blocking until timeout:
Rollout r-xxx paused: batch 1 health check failed (2/3 unhealthy)
Resume with: nixfleet rollout resume r-xxx
Monitor with: nixfleet rollout status r-xxx --watch
List rollouts:
nixfleet rollout list
nixfleet rollout list --status running
nixfleet rollout list --status paused
Inspect a specific rollout with per-batch and per-machine detail:
nixfleet rollout status <rollout-id>
Managing rollouts
Resume a paused rollout (after investigating and fixing the issue):
nixfleet rollout resume <rollout-id>
Cancel a rollout (stops further batches, leaves already-deployed machines as-is):
nixfleet rollout cancel <rollout-id>
SSH fallback
For environments without a control plane (small fleets, bootstrapping, or air-gapped networks), the CLI can deploy directly over SSH without using a release:
nixfleet deploy --ssh --hosts "web*" --flake .
This builds each matching host’s closure locally, copies it to the target via nix-copy-closure, and runs switch-to-configuration switch. No rollout orchestration, no release manifest, no health gates - just a direct push. Useful for initial bootstrap or quick one-off deploys.
Worked example: canary deploy to production
Step 1 - build all production hosts and register a release. If you use harmonia as a binary cache, --push-to ssh:// copies the closures to the cache host’s /nix/store where harmonia serves them immediately:
nixfleet release create \
--flake . \
--hosts 'web-*,db-*' \
--push-to ssh://root@cache
Output includes the release ID, for example rel-abc123-....
Step 2 - deploy with canary strategy, 2-minute health timeout, auto-pause on failure:
nixfleet deploy \
--release rel-abc123 \
--tags prod,web \
--strategy canary \
--health-timeout 120 \
--failure-threshold 1 \
--on-failure pause \
--wait
What happens:
- The CP loads the release entries, filters by
prodANDwebtags, intersects with the release’s host list (skipping any tagged machine not in the release), and randomizes the order. - Batch 0: 1 machine receives its per-host store path as desired. The CP starts returning
poll_hint=5in the agent’s desired-generation response. - Within ~5s, the agent polls, sees the mismatch, fetches the closure via
nix copy --from http://cache:5000, runsswitch-to-configuration switch, runs health checks, reports back. - The CP verifies the agent’s report shows the new
current_generation(not a stale report from before the deploy), then waits for a passing health report. - If healthy within 120s: Batch 1 deploys to all remaining machines in parallel.
- If unhealthy: the rollout pauses. The canary machine’s agent has already rolled back locally. Run
nixfleet rollout status <id>to investigate, thennixfleet rollout resume <id>ornixfleet rollout cancel <id>.
Step 3 - same release, different environment:
# Same release, redeploy to a different subset with a different strategy
nixfleet deploy --release rel-abc123 --tags staging --strategy all-at-once --wait