RFC-0001: Declarative fleet topology (fleet.nix)
Status. Accepted.
Scope. Schema and evaluation contract for the fleet flake output. Does not cover reconciliation semantics (RFC-0002) or activation (RFC-0003).
1. Motivation
Every seam in nixfleet today routes around a missing object: “the fleet as declared”. The control plane has desired state in SQLite; the CLI has flags; the operator has intent in their head. None of these are git-tracked, reviewable, or composable. Before any of the downstream spine work can land, we need one thing: a pure, evaluable Nix value representing the fleet. Everything downstream consumes it.
Design goals, in order:
- Pure.
nix eval .#fleetreturns the full value with no IO, no network, no control-plane call. - Self-contained. No cross-referencing outside the flake - hosts, tags, policies all resolved at eval time.
- Typed. Module system with option types; misuse fails at
nix flake check. - Composable. A
fleetis a value; multiple flakes can merge fleets (for org-wide super-fleets). - Minimal. Schema covers what’s needed for RFC-0002 / RFC-0003 / RFC-0004; resists feature creep.
2. Schema
# flake.nix
outputs = { self, nixpkgs, nixfleet, ... }: {
fleet = nixfleet.lib.mkFleet {
# ------------------------------------------------------------
# 2.1 Hosts - the atomic unit.
# ------------------------------------------------------------
hosts.attic-01 = {
system = "x86_64-linux";
configuration = self.nixosConfigurations.attic-01;
tags = [ "homelab" "always-on" "eu-fr" "server" ];
channel = "stable";
};
hosts.rpi-sensor-01 = {
system = "aarch64-linux";
configuration = self.nixosConfigurations.rpi-sensor-01;
tags = [ "edge" "eu-fr" ];
channel = "edge-slow";
};
# ------------------------------------------------------------
# 2.2 Tags - logical groupings, purely descriptive.
# Tags have no hierarchy; use as many as needed per host.
# ------------------------------------------------------------
tags = {
homelab.description = "Manuel's personal fleet.";
"always-on".description = "Expected to be reachable 24/7.";
"eu-fr".description = "Hosted in France; ANSSI policies apply.";
};
# ------------------------------------------------------------
# 2.3 Channels - release trains.
# Pinned to a git ref at reconcile time (see RFC-0003).
# ------------------------------------------------------------
channels.stable = {
description = "Main production channel.";
rolloutPolicy = "canary-conservative";
signingIntervalMinutes = 60; # default; listed for clarity
freshnessWindow = 1440; # 24h in minutes; REQUIRED, no default
# - invariant: ≥ 2 × signingIntervalMinutes
compliance = {
mode = "enforce"; # per-channel default for evidence probes;
# per-probe mode (RFC-0007 §3.3) overrides
frameworks = [ "anssi-bp028" ];
};
};
channels.edge-slow = {
description = "Battery-powered edge nodes; weekly reconcile.";
rolloutPolicy = "all-at-once";
reconcileIntervalMinutes = 10080; # 7 days in minutes
signingIntervalMinutes = 60;
freshnessWindow = 20160; # 2 weeks in minutes
};
# ------------------------------------------------------------
# 2.4 Rollout policies - named, reusable.
# ------------------------------------------------------------
rolloutPolicies.canary-conservative = {
strategy = "canary";
waves = [
{ selector = { tags = [ "canary" ]; }; soakMinutes = 30; }
{ selector = { tagsAny = [ "non-critical" ]; }; soakMinutes = 60; }
{ selector = { all = true; }; soakMinutes = 0; }
];
healthGate = {
systemdFailedUnits.max = 0;
complianceProbes.required = true;
};
onHealthFailure = "rollback-and-halt";
};
rolloutPolicies.all-at-once = {
strategy = "all-at-once";
healthGate.systemdFailedUnits.max = 0;
};
# ------------------------------------------------------------
# 2.5 Edges - ordering constraints across hosts (within a rollout).
# ------------------------------------------------------------
edges = [
{ after = "db-primary"; before = "app-*"; reason = "schema migrations"; }
];
# ------------------------------------------------------------
# 2.6 Channel edges - ordering across channels (across rollouts).
# `before` channel must converge before any new rollout opens on
# `after`. Edge predecessors with no rollout history are open
# (proceed); halted predecessors block until the operator
# resolves them or removes the edge.
# ------------------------------------------------------------
channelEdges = [
{ before = "edge"; after = "stable"; reason = "coordinator canaries first"; }
];
# ------------------------------------------------------------
# 2.7 Disruption budgets - max in-flight per selector. Tag-driven
# at the wire level: each budget carries its `selector` (operator
# intent) and is resolved into a concrete host list at OpenRollout
# time, snapshotted into the rollout manifest. Mid-rollout retags
# affect future rollouts only - a rollout's topology is immutable
# for its life. Cross-rollout fleet-wide enforcement survives the
# snapshot model: in-flight summing matches by selector identity
# across all active rollouts' snapshots.
# ------------------------------------------------------------
disruptionBudgets = [
{ selector = { tags = [ "etcd" ]; }; maxInFlight = 1; }
{ selector = { tags = [ "always-on" ]; }; maxInFlightPct = 50; }
];
};
};
The following additional top-level keys exist; they’re spec’d in the RFCs that own them rather than duplicated here:
healthChecks/tags.<t>.healthChecks/hosts.<h>.healthChecks— multi-scope probe declarations (RFC-0007).compliance/tags.<t>.compliance/hosts.<h>.compliance— multi-scope compliance refinement (RFC-0007 §3.7).revocations— signed agent-cert revocation list (RFC-0003 §4.5 + RFC-0010).bootstrapNonces— durable replay-invariant allowlist for/v1/enroll(RFC-0003 §4.5).
3. Selector algebra
Used by waves, edges, and budgets. Keep it minimal - resist reinventing Kubernetes label selectors.
selector :=
| { tags = [ "a" "b" ]; } # host has ALL listed tags
| { tagsAny = [ "a" "b" ]; } # host has ANY listed tag
| { hosts = [ "attic-01" ]; } # explicit host list
| { channel = "stable"; } # all hosts on this channel
| { all = true; } # every host in the fleet
| { not = <selector>; } # negation
| { and = [ <sel> <sel> ]; } # intersection
No wildcards in host names (resolve to explicit list). No regex. Evaluates to a concrete set of hosts at flake-eval time - fully static.
4. Evaluation contract
4.1 What the control plane consumes
The control plane never evaluates Nix. It reads the resolved fleet from a single JSON artifact produced by CI:
nix eval --json .#fleet.resolved > fleet.json
fleet.resolved is a derived attribute. Two resolution policies coexist:
- Waves are pre-resolved to host lists at fleet-eval time (CI). Wave membership is signed into the artifact.
- Disruption budgets carry their
selectorthrough unchanged - resolution to host lists happens at OpenRollout time and is snapshotted into the per-rollout manifest. Thefleet.resolvedartifact records intent; the rollout manifest records the frozen topology that intent produced for that specific rollout. Mid-rollout retags affect future rollouts only.
Shape:
{
"schemaVersion": 1,
"hosts": {
"attic-01": {
"system": "x86_64-linux",
"closureHash": "sha256-...",
"tags": ["homelab", "always-on", "eu-fr", "server"],
"channel": "stable"
}
},
"channels": { "stable": { "rolloutPolicy": {...}, "compliance": {...} } },
"waves": {
"stable": [
{ "hosts": ["canary-box"], "soakMinutes": 30 },
{ "hosts": ["rpi-01", "rpi-02"], "soakMinutes": 60 },
{ "hosts": ["attic-01"], "soakMinutes": 0 }
]
},
"channelEdges": [
{ "before": "edge", "after": "stable", "reason": "coordinator canaries first" }
],
"disruptionBudgets": [
{ "selector": { "tags": ["etcd"] }, "maxInFlight": 1 }
]
}
The rollout manifest (releases/rollouts/<rolloutId>.json, signed) carries the resolved snapshot:
{
"channel": "stable",
"hostSet": [ ... ],
"disruptionBudgets": [
{
"selector": { "tags": ["etcd"] },
"hosts": ["etcd-1", "etcd-2", "etcd-3"],
"maxInFlight": 1
}
],
...
}
4.2 Invariants checked at nix flake check
- Every host’s
configurationis a validnixosConfiguration. - Every host’s
channelexists inchannels. - Every channel’s
rolloutPolicyexists inrolloutPolicies. - Every selector resolves to at least one host (warn, not fail - empty selectors are sometimes intentional).
compliance.frameworksreference known frameworks fromnixfleet-compliance.- Edges form a DAG (no cycles).
- Disruption budgets are satisfiable given fleet size (warn if
maxInFlight = 1on a 100-host budget will take forever).
4.3 Signed artifact contract
fleet.resolved.json is a trust-boundary artifact (see ../design/architecture.md §4). CI produces and signs it with the CI release key; every consumer verifies before use.
- Signing. CI writes
fleet.resolved.json+fleet.resolved.sigto the channel’s storage. The signature covers the full canonicalized JSON plus asignedAtRFC 3339 timestamp (embedded asmeta.signedAtin the artifact). - Verification - control plane. On every fetch, verifies the signature against the pinned CI release public key. Signature mismatch or unknown key -> refuse to reconcile the channel; emit an alert.
- Verification - agents (optional path). An agent that fetches
fleet.resolveddirectly (rather than receiving targets from the control plane) performs the same verification. Enables the trust-minimized bootstrap in RFC-0003 §4. - Key pinning. The CI release public key is committed to the flake (
nixfleet.trust.ciReleaseKey) and embedded in every built closure. Key rotation is a new commit + a grace window during which both keys verify. - Freshness. Downstream consumers (RFC-0003 §7) enforce
now − meta.signedAt ≤ channel.freshnessWindowto defend against stale-closure replay by a compromised control plane.freshnessWindowis declared per-channel in minutes (see §2.3); there is no implicit default and the value is part of the signed payload so a compromised control plane cannot widen it.
Canonicalization uses a stable, spec-defined encoding (JCS or deterministic CBOR - final choice tracked as an open question below) so that signatures produced by Nix evaluation are byte-identical to what verifiers reconstruct.
5. Composition
Two flakes can merge fleets:
fleet = nixfleet.lib.mergeFleets [
(import ./fleet-paris.nix)
(import ./fleet-lyon.nix)
];
Conflicts (same host name, same channel definition with different values) fail eval. Merge is associative but not commutative when policies define overrides - document the precedence (later wins).
6. What’s deliberately out of scope
- Secrets. Declared alongside, not inside, the fleet schema.
- Enrollment / host identity. A host exists in the fleet schema regardless of whether it’s enrolled. Enrollment is an orthogonal state.
- Runtime state.
fleet.resolvedis purely declarative. Observed state (which host is online, what gen is running) lives in the control plane only. - Dynamic host sets. No “autoscaling” — every host is named in the flake. If you need dynamic, generate the flake from a higher-level tool.