Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

NixFleet

Declarative NixOS fleet management with staged rollouts and automatic rollback.

NixFleet combines a thin configuration framework (mkHost) with an optional orchestration layer (agent + control plane) for fleet-wide deployments. The framework builds on standard NixOS tooling - it doesn’t replace nixos-rebuild or nixos-anywhere, it adds reproducible multi-host configuration and health-driven deployment safety on top.

Start with the Quick Start.

Guide

NixFleet documentation, from first deployment to fleet-wide operations.

Getting Started

Core Concepts

  • Defining Hosts - the mkHost API, hostSpec flags, scopes
  • Deploying - standard tools, control plane, agent, rollouts
  • Operating - fleet status, rollback, impermanence
  • Extending - custom scopes, secrets, templates

Quick Start

Define a fleet, deploy your first host, and enable orchestration - all in 15 minutes.

Prerequisites

  • Nix with flakes enabled (experimental-features = nix-command flakes in ~/.config/nix/nix.conf)
  • SSH access to at least one target machine (root login or nixos-anywhere compatible)

1. Create a Fleet

Create a new directory and initialize a flake.nix:

# flake.nix
{
  inputs = {
    nixfleet.url = "github:arcanesys/nixfleet";
    nixpkgs.follows = "nixfleet/nixpkgs";
  };

  outputs = {nixfleet, ...}: {
    nixosConfigurations.web-01 = nixfleet.lib.mkHost {
      hostName = "web-01";
      platform = "x86_64-linux";
      hostSpec = {
        timeZone = "UTC";
      };
      modules = [
        nixfleet.scopes.roles.server
        ./hosts/web-01/hardware-configuration.nix
        ./hosts/web-01/disk-config.nix
        {
          nixfleet.operators = {
            primaryUser = "deploy";
            users.deploy = {
              isAdmin = true;
              sshAuthorizedKeys = [
                "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAA... you@workstation"
              ];
            };
          };
        }
      ];
    };

    nixosConfigurations.web-02 = nixfleet.lib.mkHost {
      hostName = "web-02";
      platform = "x86_64-linux";
      hostSpec = {
        timeZone = "UTC";
      };
      modules = [
        nixfleet.scopes.roles.server
        ./hosts/web-02/hardware-configuration.nix
        ./hosts/web-02/disk-config.nix
        {
          nixfleet.operators = {
            primaryUser = "deploy";
            users.deploy = {
              isAdmin = true;
              sshAuthorizedKeys = [
                "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAA... you@workstation"
              ];
            };
          };
        }
      ];
    };
  };
}

Each call to mkHost returns a full nixosSystem. The server role imports the base, operators, firewall, secrets, monitoring, and impermanence scopes. The operators scope manages user accounts - primaryUser is the identity anchor for Home Manager, secrets, and impermanence paths. The framework also injects disko and the fleet agent/control-plane service modules (disabled by default).

Tip: Run git init && git add -A before any nix command. Flakes only see files tracked by git.

2. Deploy the First Host

Use standard NixOS tooling. No custom scripts.

# Fresh install (wipes disk, installs NixOS)
nixos-anywhere --flake .#web-01 root@192.168.1.10

# Subsequent rebuilds
nixos-rebuild switch --flake .#web-01 --target-host root@192.168.1.10

Repeat for web-02. At this point you have two independently managed NixOS machines. Everything below is optional.

3. Enable Fleet Orchestration

Add the control plane to web-01 and the fleet agent to both hosts. Create a shared module:

# modules/fleet-agent.nix
{config, ...}: {
  services.nixfleet-agent = {
    enable = true;
    controlPlaneUrl = "http://web-01:8080";
    tags = ["web"];
    healthChecks.http = [
      {
        url = "http://localhost:80/health";
        interval = 5;
        timeout = 3;
        expectedStatus = 200;
      }
    ];
  };
}

Then add the control plane to web-01:

# modules/control-plane.nix
{
  services.nixfleet-control-plane = {
    enable = true;
    listen = "0.0.0.0:8080";
    openFirewall = true;
  };
}

Extract the operators config into a shared module so both hosts use the same user definition:

# modules/operators.nix
{
  nixfleet.operators = {
    primaryUser = "deploy";
    users.deploy = {
      isAdmin = true;
      sshAuthorizedKeys = [
        "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAA... you@workstation"
      ];
    };
  };
}

Include all modules in your mkHost calls:

nixosConfigurations.web-01 = nixfleet.lib.mkHost {
  hostName = "web-01";
  platform = "x86_64-linux";
  modules = [
    nixfleet.scopes.roles.server
    ./hosts/web-01/hardware-configuration.nix
    ./hosts/web-01/disk-config.nix
    ./modules/fleet-agent.nix
    ./modules/control-plane.nix
    ./modules/operators.nix
  ];
};

nixosConfigurations.web-02 = nixfleet.lib.mkHost {
  hostName = "web-02";
  platform = "x86_64-linux";
  modules = [
    nixfleet.scopes.roles.server
    ./hosts/web-02/hardware-configuration.nix
    ./hosts/web-02/disk-config.nix
    ./modules/fleet-agent.nix
    ./modules/operators.nix
  ];
};

Rebuild both hosts to activate the agent and control plane.

4. Deploy to the Fleet

First-time setup - create a config file and bootstrap the admin API key:

nixfleet init \
  --control-plane-url https://cp.example.com:8080 \
  --ca-cert ./fleet-ca.pem \
  --cache-url http://cache.example.com:5000 \
  --push-to ssh://root@cache.example.com
nixfleet bootstrap

This writes .nixfleet.toml to the repo and saves the API key to ~/.config/nixfleet/credentials.toml. Subsequent commands run with no flags.

Now deploy - the one-command form builds all targeted hosts, pushes them to the cache, registers a release, and triggers a canary rollout:

nixfleet deploy --push-to ssh://root@cache.example.com --tags web --strategy canary --wait

Or split it into explicit steps if you want to inspect or replay the release:

nixfleet release create --push-to ssh://root@cache.example.com
# Output: Release rel-abc123 created (2 hosts)
nixfleet deploy --release rel-abc123 --tags web --strategy canary --wait

The --strategy flag controls rollout behavior:

  • all-at-once - deploy to every matching host simultaneously (default)
  • canary - deploy to one host first, verify health, then continue
  • staged - deploy in configurable batch sizes (--batch-size 1,25%,100%)

The agent checks health (http://localhost:80/health) after each switch. On failure, it automatically rolls back to the previous generation. The control plane verifies each machine reports its NEW current_generation before accepting health as proof of successful deployment.

5. Check Fleet Status

nixfleet status
nixfleet status --json

Next Steps

  • Design Guarantees - properties that hold across every NixFleet deployment
  • Installation - detailed install methods, ISO builds, troubleshooting
  • Rollouts - batch sizes, failure thresholds, pause/resume
  • The mkHost API - all parameters and what the framework injects

Design Guarantees

These are not features you enable. They are properties that emerge from the architecture.

PropertyWhat it meansHow the architecture delivers it
ReproducibilitySame configuration produces an identical system, every time, on any machine.The Nix store is content-addressed - every package is identified by a cryptographic hash of its inputs. flake.lock pins every dependency to an exact revision. The follows chain ensures nixpkgs, home-manager, disko, and impermanence all resolve to one consistent version.
ImmutabilityRunning systems cannot drift from their declared configuration.The Nix store is read-only - no process can modify installed software in place. With optional ephemeral root (impermanence), the entire root filesystem is wiped and recreated from configuration on every boot, eliminating accumulated state.
Atomic rollbackRecover from any deployment in seconds, not minutes.NixOS generations are atomic filesystem switches - the previous generation remains intact in the Nix store. The fleet agent auto-rolls back on health check failure. Manual rollback is a single command: nixfleet rollback --host web-01 --ssh.
AuditabilityEvery change to every system is traceable to a commit.Configuration is Git-native - the entire system state is defined in version-controlled Nix files. The control plane maintains a deployment audit log, a release registry (immutable manifests of per-host store paths), and a rollout event timeline for every host. Releases can be diffed with nixfleet release diff <A> <B>.
Supply chain integrityThe complete dependency tree of every system is known and verifiable.flake.lock records the cryptographic hash of every input. Builds are reproducible - the same inputs always produce the same output hash. No implicit dependencies, no untracked downloads during build.
Graceful degradationThe fleet survives a control plane outage without disruption.The architecture uses a polling model - agents independently pull desired state on a configurable interval (default: 60s, with a poll_hint-driven fast path of 5s during active rollouts, and 30s retries on transient failures). If the control plane is unreachable, agents continue running their last-known-good generation. There is no single point of failure; each host is a self-contained NixOS system that operates independently.

These properties hold whether you use the full orchestration layer or just mkHost with standard NixOS commands.

Installation

NixFleet uses standard NixOS/Darwin tooling for installation. No custom deploy scripts.

NixOS - Remote Install

Install a fresh machine over SSH using nixos-anywhere:

nixos-anywhere --flake .#web-01 root@192.168.1.10

The target machine needs SSH access and must be booted into a NixOS installer or any Linux with kexec support. nixos-anywhere handles disk partitioning (via disko), NixOS installation, and the first boot.

Options

# Provision extra files (e.g. host keys, pre-generated secrets)
nixos-anywhere --flake .#web-01 --extra-files ./secrets root@192.168.1.10

# Build on the remote machine (useful for aarch64 targets without cross-compilation)
nixos-anywhere --flake .#web-01 --build-on-remote root@192.168.1.10

NixOS - Rebuild

For machines already running NixOS:

# Local rebuild
sudo nixos-rebuild switch --flake .#web-01

# Remote rebuild
nixos-rebuild switch --flake .#web-01 --target-host root@192.168.1.10

macOS

For Darwin hosts (Apple Silicon or Intel), use nix-darwin:

darwin-rebuild switch --flake .#macbook

The mkHost function detects aarch64-darwin or x86_64-darwin platforms and calls darwinSystem instead of nixosSystem, injecting the appropriate Darwin core module and Home Manager integration.

Custom ISO

Build an installer ISO with your fleet’s SSH keys and base configuration pre-baked:

nix build .#iso

The resulting ISO is written to result/iso/. Flash it to USB and boot target machines for a known-good starting point before running nixos-anywhere.

VM Testing

Test host configurations in QEMU before deploying to real hardware.

Prerequisites: Your fleet must set nixfleet.isoSshKeys with a public key whose private half is on your machine (~/.ssh/id_ed25519.pub). The sshAuthorizedKeys in your hostSpec should use the same key. VM commands SSH into the ISO installer using this key - if it doesn’t match, SSH will hang.

# Install a host into a persistent VM disk (build ISO + nixos-anywhere)
nix run .#build-vm -- -h web-01

# Start the installed VM as a headless daemon
nix run .#start-vm -- -h web-01

# Full VM test cycle (build, install, reboot, verify, cleanup)
nix run .#test-vm -- -h web-01

See VM Tests for details on writing VM test assertions.

Troubleshooting

SSH connection refused

nixos-anywhere requires root SSH access on the target. Verify:

ssh root@192.168.1.10 echo ok

If the target is a fresh installer image, root login is usually enabled by default. For existing systems, ensure services.openssh.enable = true and users.users.root.openssh.authorizedKeys.keys includes your public key.

Build fails with “path not found”

Flakes only see files tracked by git. If you just created or moved files:

git add -A

Then retry the build.

Missing state on impermanent hosts

Hosts with nixfleet.impermanence.enable = true wipe root on every boot. If a service loses state after reboot, its data directory must be added to the persistence configuration. The agent and control plane modules handle this automatically - their state directories (/var/lib/nixfleet, /var/lib/nixfleet-cp) are persisted when impermanence is active.

For other services, add persist paths in your modules:

environment.persistence."/persist".directories = [
  "/var/lib/my-service"
];

The mkHost API

mkHost is the single entry point for defining hosts in NixFleet. It is a closure over framework inputs (nixpkgs, home-manager, disko, impermanence, microvm) that returns a standard nixosSystem or darwinSystem.

The result is a standard NixOS/Darwin system configuration. All existing NixOS tooling (nixos-rebuild, nixos-anywhere, darwin-rebuild) works unchanged.

Parameters

ParameterTypeRequiredDescription
hostNamestringyesMachine hostname
platformstringyesx86_64-linux, aarch64-linux, aarch64-darwin, x86_64-darwin
stateVersionstringnoNixOS/Darwin state version (default: "24.11")
hostSpecattrsetnoHost configuration flags. See hostSpec
moduleslistnoAdditional NixOS/Darwin modules
isVmboolnoInject QEMU VM hardware (default: false)

For the full parameter reference, injected module order, return types, Home Manager integration, and exports, see the mkHost API reference.

Examples

Single host

The simplest pattern. One machine, one repo, no fleet infrastructure.

# flake.nix
{
  inputs = {
    nixfleet.url = "github:arcanesys/nixfleet";
    nixpkgs.follows = "nixfleet/nixpkgs";
  };

  outputs = {nixfleet, ...}: {
    nixosConfigurations.myhost = nixfleet.lib.mkHost {
      hostName = "myhost";
      platform = "x86_64-linux";
      hostSpec = {
        userName = "alice";
        timeZone = "US/Eastern";
        locale = "en_US.UTF-8";
        sshAuthorizedKeys = [
          "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAA..."
        ];
      };
      modules = [
        ./hardware-configuration.nix
        ./disk-config.nix
      ];
    };
  };
}

Deploy with standard NixOS tooling:

nixos-anywhere --flake .#myhost root@192.168.1.50   # fresh install
sudo nixos-rebuild switch --flake .#myhost            # local rebuild

Multi-host fleet with org defaults

Define shared defaults in a let binding and merge per-host overrides. This example uses flake-parts.

# fleet.nix (flake-parts module)
{config, ...}: let
  mkHost = config.flake.lib.mkHost;

  acme = {
    userName = "deploy";
    timeZone = "America/New_York";
    locale = "en_US.UTF-8";
    keyboardLayout = "us";
  };
in {
  flake.nixosConfigurations = {
    dev-01 = mkHost {
      hostName = "dev-01";
      platform = "x86_64-linux";
      hostSpec = acme;
      modules = [
        nixfleet-scopes.scopes.roles.workstation
        { nixfleet.impermanence.enable = true; }
        ./hosts/dev-01/hardware.nix
        ./hosts/dev-01/disk-config.nix
      ];
    };

    prod-web-01 = mkHost {
      hostName = "prod-web-01";
      platform = "x86_64-linux";
      hostSpec = acme;
      modules = [
        nixfleet-scopes.scopes.roles.server
        ./hosts/prod-web-01/hardware.nix
        ./hosts/prod-web-01/disk-config.nix
      ];
    };
  };
}

Batch hosts from a template

Standard Nix. Generate 50 identical edge devices with builtins.genList, then merge with named hosts.

# fleet.nix (flake-parts module)
{config, ...}: let
  mkHost = config.flake.lib.mkHost;

  acme = {
    userName = "deploy";
    timeZone = "America/New_York";
    locale = "en_US.UTF-8";
  };

  edgeHosts = builtins.listToAttrs (map (i: {
    name = "edge-${toString i}";
    value = mkHost {
      hostName = "edge-${toString i}";
      platform = "aarch64-linux";
      hostSpec = acme;
      modules = [
        nixfleet-scopes.scopes.roles.endpoint
        ./hosts/edge/common-hardware.nix
        ./hosts/edge/disk-config.nix
      ];
    };
  }) (builtins.genList (i: i + 1) 50));

  namedHosts = {
    control-plane = mkHost {
      hostName = "control-plane";
      platform = "x86_64-linux";
      hostSpec = acme;
      modules = [ nixfleet-scopes.scopes.roles.server ./hosts/control-plane/hardware.nix ];
    };
  };
in {
  flake.nixosConfigurations = namedHosts // edgeHosts;
}

No special batch API needed - mkHost is a plain function, and Nix handles the rest.

Key points

  • hostSpec values use lib.mkDefault, so modules you pass in modules can override them.
  • hostName is the exception - it is set without mkDefault and always matches the hostName parameter.
  • isDarwin is auto-detected from the platform parameter. You never set it manually.
  • VM mode (isVm = true) adds QEMU hardware, SPICE agent, DHCP, and software GL - useful for testing with nix run .#build-vm and nix run .#start-vm.

hostSpec Configuration

hostSpec is a NixOS module option that holds host identity data. It is the primary mechanism for identifying hosts in NixFleet.

Every module injected by mkHost - core, scopes, Home Manager - can read config.hostSpec to adapt behavior. Scope activation is driven by nixfleet.<scope>.enable options (set by roles from nixfleet-scopes), not by hostSpec flags.

Options

Data fields: userName (required), hostName (auto-set), home (computed), timeZone, locale, keyboardLayout, sshAuthorizedKeys, networking, secretsPath, hashedPasswordFile, rootHashedPasswordFile.

Platform flag: isDarwin (auto-set by mkHost).

For the full option reference with types, defaults, and descriptions, see hostSpec Options.

Accessing hostSpec in modules

hostSpec is available in any NixOS, Darwin, or Home Manager module injected by mkHost:

# In a NixOS/Darwin module
{config, lib, ...}: let
  hS = config.hostSpec;
in {
  services.myapp.dataDir = "${hS.home}/data";
  networking.firewall.enable = lib.mkIf config.nixfleet.firewall.enable true;
}
# In a Home Manager module
{config, lib, ...}: let
  hS = config.hostSpec;
in {
  programs.git.userName = lib.mkIf (!hS.isDarwin) "linux-user";
}

Home Manager modules receive hostSpec because mkHost imports the hostSpec module into the HM evaluation and passes the effective hostSpec values.

Extending hostSpec in fleet repos

The framework defines only the options above. Fleet repos add their own flags as plain NixOS modules:

# modules/hostspec-extensions.nix (in your fleet repo)
{lib, ...}: {
  options.hostSpec = {
    isDev = lib.mkOption {
      type = lib.types.bool;
      default = false;
      description = "Enable development tools and Docker.";
    };
    isGraphical = lib.mkOption {
      type = lib.types.bool;
      default = false;
      description = "Enable graphical desktop (audio, fonts, display manager).";
    };
  };
}

Then use them in fleet-level scopes:

# modules/scopes/dev.nix (in your fleet repo)
{config, lib, pkgs, ...}: let
  hS = config.hostSpec;
in {
  config = lib.mkIf hS.isDev {
    virtualisation.docker.enable = true;
    environment.systemPackages = with pkgs; [gcc gnumake];
  };
}

Include the extension module in your mkHost calls via the modules parameter. No framework changes needed.

Org defaults pattern

Define shared defaults in a let binding and merge per-host:

let
  orgDefaults = {
    userName = "deploy";
    timeZone = "America/New_York";
    locale = "en_US.UTF-8";
    keyboardLayout = "us";
    sshAuthorizedKeys = [
      "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAA... ops-team"
    ];
  };
in {
  web-01 = mkHost {
    hostName = "web-01";
    platform = "x86_64-linux";
    hostSpec = orgDefaults;
    modules = [nixfleet-scopes.scopes.roles.server ./hosts/web-01/hardware.nix];
  };
}

All hostSpec values passed to mkHost use lib.mkDefault, so modules in the modules list can override them if needed.

Cross-Platform

NixFleet supports NixOS and macOS from a single API. mkHost detects the platform from the platform parameter and builds the appropriate system type.

Supported platforms

PlatformSystem builderInit systemNotes
x86_64-linuxnixosSystemsystemdFull feature set
aarch64-linuxnixosSystemsystemdFull feature set (ARM servers, edge devices)
aarch64-darwindarwinSystemlaunchdApple Silicon Macs
x86_64-darwindarwinSystemlaunchdIntel Macs

Automatic platform detection

mkHost sets hostSpec.isDarwin based on the platform parameter. You never set it manually. The home option also auto-computes:

  • Linux: /home/<userName>
  • Darwin: /Users/<userName>

What differs by platform

ConcernNixOSDarwin
Core module_nixos.nix - boot, systemd-boot, NetworkManager, polkit, SSH_darwin.nix - system defaults, TouchID sudo, dock management
User configusers.users.<name>.isNormalUserusers.users.<name>.home, .isHidden
Servicessystemd services (systemd.services.*)launchd agents (launchd.agents.*)
ImpermanenceBtrfs root wipe, /persist bind mountsNot applicable
Base scope packagesifconfig, netstat, xdg-utils (system)dockutil, mas (system)
Home ManagerHM NixOS module + impermanence HM moduleHM Darwin module (no impermanence)
Nix daemonManaged by NixOS (nix.gc.automatic, etc.)Determinate-compatible (nix.enable = false)
Trusted users@admin + user (non-server)@admin + user

Platform guards in modules

Use hostSpec.isDarwin (or pkgs.stdenv) for platform-specific logic:

# Using hostSpec (available in all mkHost modules)
{config, lib, ...}: let
  hS = config.hostSpec;
in {
  config = lib.mkIf (!hS.isDarwin) {
    # Linux-only configuration
    services.openssh.enable = true;
  };
}
# Using stdenv (standard Nix pattern)
{lib, pkgs, ...}: {
  home.packages = lib.optionals pkgs.stdenv.isLinux [pkgs.strace]
    ++ lib.optionals pkgs.stdenv.isDarwin [pkgs.darwin.apple_sdk.frameworks.Security];
}

Both approaches work. hostSpec.isDarwin is preferred in NixFleet modules because it is available without pkgs and is consistent with the hostSpec-driven activation pattern.

Scopes and platform support

Not all framework scopes apply to both platforms:

ScopeNixOSDarwin
baseNixOS module + HM moduleDarwin module + HM module
impermanenceNixOS module + HM moduleNot included
nixfleet-agentNixOS service (systemd)Not available
nixfleet-control-planeNixOS service (systemd)Not available

The agent and control-plane services are NixOS-only (systemd). macOS hosts are managed through standard darwin-rebuild and do not participate in fleet orchestration.

Design principle

Prefer simple platform-specific implementations over complex cross-platform abstractions. If a feature only makes sense on one platform, keep it there. The framework handles the platform split at the mkHost level - individual modules should stay focused on their target platform rather than adding conditionals for every difference.

Mixed fleet example

let
  org = {
    userName = "ops";
    timeZone = "UTC";
    sshAuthorizedKeys = ["ssh-ed25519 AAAA..."];
  };
in {
  # NixOS server
  web-01 = mkHost {
    hostName = "web-01";
    platform = "x86_64-linux";
    hostSpec = org;
    modules = [nixfleet-scopes.scopes.roles.server ./hosts/web-01/hardware.nix];
  };

  # macOS developer laptop
  dev-mac = mkHost {
    hostName = "dev-mac";
    platform = "aarch64-darwin";
    hostSpec = org;
    modules = [./hosts/dev-mac/extras.nix];
  };

  # ARM edge device
  sensor-01 = mkHost {
    hostName = "sensor-01";
    platform = "aarch64-linux";
    hostSpec = org;
    modules = [nixfleet-scopes.scopes.roles.endpoint ./hosts/sensor/hardware.nix];
  };
}

All three hosts share org defaults and use the same mkHost call. The framework selects the right system builder, core module, and scope set based on platform.

Scopes & Roles

NixFleet uses a scope system to compose host configurations. Scopes ship in the nixfleet-scopes companion repository - a standalone collection of infrastructure modules, roles, and disk templates that work with any NixFleet-managed host.

Scopes are NixOS modules that self-activate based on configuration flags. Each scope wraps its config block in lib.mkIf so it produces no configuration when its condition is false. Options live under nixfleet.*. Roles compose scopes and set defaults with lib.mkDefault - consumers override with lib.mkForce when needed.

Repository: github.com/arcanesys/nixfleet-scopes - MIT licensed, works standalone or via inputs.nixfleet.scopes re-export.

Framework Service Scopes

These ship with NixFleet and are auto-included by mkHost (disabled by default).

ScopeOptionsDescription
Agentservices.nixfleet-agent.*Deploy cycle daemon - polls CP, applies generations, reports health
Agent (Darwin)services.nixfleet-agent.*macOS variant using launchd
Control Planeservices.nixfleet-control-plane.*Axum HTTP with mTLS, SQLite, RBAC for fleet orchestration
Cache Serverservices.nixfleet-cache-server.*Harmonia-based Nix binary cache serving from local store
Cacheservices.nixfleet-cache.*Nix substituter pointing to fleet cache
MicroVM Hostservices.nixfleet-microvm-host.*MicroVM hypervisor with bridge networking, DHCP, and NAT

The impermanence scope from nixfleet-scopes is also auto-imported by mkHost. It is inert unless nixfleet.impermanence.enable is set.

Infrastructure Scopes

From nixfleet-scopes. Import via roles or individually.

ScopeNamespaceDescription
basenixfleet.baseUniversal CLI tools (ifconfig, netstat, xdg-utils). Darwin and HM variants available.
operatorsnixfleet.operatorsMulti-user management - primary user, SSH keys, sudo, shell, HM routing, role groups
firewallnixfleet.firewallnftables backend, SSH rate limiting (5/min), drop logging, microVM bridge forwarding
secretsnixfleet.secretsBackend-agnostic identity paths for agenix/sops-nix, boot ordering, key validation
backupnixfleet.backupTimer scaffolding with restic and borgbackup backends, pre/post hooks, health pings
monitoringnixfleet.monitoringPrometheus node exporter with fleet-tuned collector defaults
monitoring-servernixfleet.monitoring.serverPrometheus server with scrape configs, retention, and built-in alert rules
impermanencenixfleet.impermanenceBtrfs root wipe + system persist paths (/etc/nixos, /var/lib/systemd, /var/log, etc.)
home-managernixfleet.home-managerHM integration - useGlobalPkgs/useUserPackages defaults, fans out profileImports to HM-enabled operators
diskonixfleet.diskoDisko NixOS module injection (inert without disko.devices)
o11ynixfleet.o11yMetrics remote-write (vmagent to VictoriaMetrics/Mimir) + journal log shipping
vpnnixfleet.vpnProfile-driven VPN framework with wireguard driver
compliancenixfleet.complianceFilesystem integration for compliance evidence - persists evidence dir, sets configurationRevision
generation-labelnixfleet.generationLabelRich boot entry labels from flake metadata (date, rev, deterministic codename)
remote-buildersnixfleet.distributedBuildsCross-platform distributed build delegation (handles Determinate Nix on Darwin)
hardwarenixfleet.hardwareAuto-imports hardware sub-modules: microcode, bluetooth, nvidia, wake-on-LAN, memory/zram, legacy boot
terminal-compatnixfleet.terminalCompatTerminfo for modern terminals (kitty, alacritty) + headless tools (curl, wget, unzip)

Platform variants exist for: base (Darwin, HM), operators (Darwin), backup (Darwin), impermanence (HM), home-manager (Darwin).

Operators

The operators scope manages user accounts declaratively. One operator is designated primaryUser - the identity anchor for Home Manager, secrets, and impermanence paths.

Each operator (users.<name>) supports:

  • isAdmin - adds wheel group (sudo access)
  • sshAuthorizedKeys - SSH public keys for authorized_keys
  • shell - login shell (default: bash)
  • homeManager.enable - apply the profile’s HM stack to this operator
  • hashedPassword / hashedPasswordFile - password authentication
  • extraGroups - additional groups on top of roleGroups

Top-level options:

  • primaryUser - identity anchor (auto-detected when only one operator exists)
  • roleGroups - groups added to all operators (set by roles, e.g. workstation adds networkmanager/video/audio/docker)
  • rootSshKeys - root SSH access, independent of operator accounts
  • mutableUsers - allow imperative passwd changes (default: false)
nixfleet.operators = {
  primaryUser = "alice";
  users.alice = {
    isAdmin = true;
    sshAuthorizedKeys = [ "ssh-ed25519 AAAA... alice@workstation" ];
    homeManager.enable = true;
    shell = pkgs.zsh;
  };
  users.bob = {
    sshAuthorizedKeys = [ "ssh-ed25519 BBBB... bob@laptop" ];
  };
  rootSshKeys = config.nixfleet.operators._adminSshKeys;
};

Roles

Roles compose scopes with sensible defaults. Import one role per host.

RoleTypeScopes importedKey defaults
serverHeadlessbase, operators, firewall, secrets, monitoring, impermanence, o11y, generation-label, terminal-compat, hardwareFirewall on, secrets on, monitoring on, o11y metrics on, no user key, no roleGroups
workstationInteractivebase, operators, firewall, secrets, home-manager, backup, impermanence, o11y, generation-label, terminal-compat, hardwareFirewall on, secrets on, HM on, o11y metrics on, zram swap, roleGroups: networkmanager/video/audio/docker
endpointLocked-downbase, operators, secrets, impermanenceSecrets on with user key enabled. Consumer provides firewall, HM, and hardware.
microvm-guestVM guestbase, operators, impermanenceMinimal - host owns firewall, backup, and networking

Disk Templates

Pre-built disko configurations for common partition layouts.

TemplateBootFilesystemImpermanence
btrfsUEFIbtrfsNo
btrfs-biosLegacy BIOSbtrfsNo
btrfs-impermanenceUEFIbtrfsYes
btrfs-impermanence-biosLegacy BIOSbtrfsYes
ext4UEFIext4No
luks-btrfs-impermanenceUEFILUKS + btrfsYes

Access via inputs.nixfleet-scopes.scopes.disk-templates.<name>.

What Belongs Where

ContentBelongs in
Framework API (mkHost)nixfleet
Service modules (agent, CP, cache, microvm)nixfleet
Infrastructure scopes and rolesnixfleet-scopes
Disk templatesnixfleet-scopes
Compliance controls and frameworksnixfleet-compliance
Opinionated fleet scopes (dev, graphical, theming)Your fleet repo
Hardware configs and dotfilesYour fleet repo

Compliance

NixFleet’s compliance layer ships in the nixfleet-compliance companion repository - a standalone collection of regulatory controls, framework presets, and evidence probes for NixOS hosts.

Each control enforces a security measure and produces machine-readable evidence via probes. Evidence is collected on a schedule and written to /var/lib/nixfleet-compliance/evidence.json. The governance engine lets fleet operators set enforcement levels, host-type scoping, and per-rule exceptions with mandatory rationale.

Repository: github.com/arcanesys/nixfleet-compliance - MIT licensed, works standalone or alongside nixfleet and nixfleet-scopes.

Quick Start

{
  inputs.compliance.url = "github:arcanesys/nixfleet-compliance";
  # In your mkHost modules:
  modules = [
    compliance.nixosModules.nis2
    {
      compliance.frameworks.nis2 = {
        enable = true;
        entityType = "essential";
      };
    }
  ];
}

Frameworks

FrameworkRegulationControlsDifferentiation
NIS2Directive 2022/255512essential vs important
DORARegulation 2022/25549critical provider vs standard
ISO 27001ISO/IEC 27001:202214full vs partial scope
ANSSIBP-028 v2.07minimal / intermediary / reinforced / high

Controls

ControlWhat it enforces
access-controlSSH key-only auth, root login disabled, idle session timeout
asset-inventoryHost, service, and network inventory from running system
audit-loggingJournald persistence, auditd with execve tracking, log retention
authenticationMFA policy, PAM modules, SSH certificate auth
backup-retentionBackup service verification, last backup age, retention compliance
baseline-hardeningKernel sysctl, IOMMU, filesystem permissions (ANSSI R7-R14)
change-managementSystem rebuild freshness, generation frequency
disaster-recoveryGeneration retention, RTO target, recovery test interval
encryption-at-restLUKS verification, encrypted swap, tmpfs /tmp
encryption-in-transitTLS minimum version, certificate inventory and expiry
incident-responseRollback readiness, journal availability, alert retention
key-managementSSH host key age and algorithm, LUKS key slots, rotation policy
network-segmentationFirewall status, VLAN detection, interface inventory
secure-bootEFI support, secure boot status, signed unified kernel images
supply-chainflake.lock pinning, SBOM generation, nixpkgs staleness
vulnerability-mgmtNixpkgs freshness, scan interval, critical vulnerability blocking

Governance

OptionValuesDescription
enforceModeenforce, reportEnforce applies NixOS config and runs probes; report only runs probes
levelminimal, standard, strict, paranoidRules above this severity threshold are auto-disabled
hostTypeserver, workstation, applianceRules scope themselves to matching host types
excludeslist of tagsTag-based rule exclusions (e.g., ["no-ipv6"])
exceptionsattrs with rationalePer-rule exceptions with mandatory reason, included in audit report
compliance.governance = {
  enforceMode = "enforce";
  level = "standard";
  hostType = "server";
  exceptions.BH-07 = {
    rationale = "IPv6 required for internal mesh networking";
  };
};

Evidence Collection

Probes run on a configurable schedule - hourly for essential/critical entities, daily for important/standard - and produce JSON. The compliance-check CLI runs all probes interactively:

compliance-check              # colored summary
VERBOSE=1 compliance-check    # detailed JSON per control

Framework Mappings

For detailed article-by-article regulatory mappings:

NixOS Advantage

NixOS provides unique compliance properties. flake.lock is a cryptographically verifiable supply chain manifest - every input is pinned by hash. Content-addressing makes binary tampering detectable. Impermanence prevents malware persistence by wiping the root filesystem on every reboot. Declarative configuration means the audit configuration IS the actual running configuration - there is no drift between what was approved and what is deployed.

Standard Tools

NixFleet builds on standard NixOS tooling. Every host produced by mkHost is a regular nixosSystem or darwinSystem output, so the standard deployment commands work unchanged.

Fresh install (with disk partitioning)

nixos-anywhere --flake .#hostname root@192.168.1.42

Disko partitions the disk according to the host’s disk config, then installs the NixOS closure.

Local rebuild

sudo nixos-rebuild switch --flake .#hostname

Remote rebuild

nixos-rebuild switch --flake .#hostname --target-host root@192.168.1.42

Evaluates locally, copies the closure to the target, and activates it.

macOS rebuild

darwin-rebuild switch --flake .#hostname

When to reach for more

These commands work because mkHost returns standard nixosSystem/darwinSystem outputs. The orchestration layer (control plane + agent) is additive - use it when your fleet grows beyond manual rebuilds.

Control Plane

The control plane is a lightweight HTTP server that coordinates fleet deployments. It provides:

  • Machine registry - agents auto-register on first report; machines are tracked with tags and lifecycle states
  • Rollout orchestration - staged, canary, and all-at-once deployment strategies with health-check gates
  • Tag storage - group machines by role, environment, or any arbitrary label
  • Deployment audit log - every action (deploy, rollback, tag change, lifecycle transition) is recorded
  • REST API - all operations available programmatically at /api/v1/

Enabling the service

services.nixfleet-control-plane = {
  enable = true;
  listen = "0.0.0.0:8080";
  dbPath = "/var/lib/nixfleet-cp/state.db";
  openFirewall = true;
};

Options

See Control Plane Options for the full option reference including TLS, metrics, and systemd service details.

Verify

systemctl status nixfleet-control-plane
curl http://localhost:8080/health

What it manages

Machines auto-register when the agent sends its first report to the control plane. Each machine has:

  • A unique ID (defaults to hostname)
  • Tags for grouping (web, prod, eu-west, etc.)
  • A lifecycle state (pendingprovisioningactivemaintenancedecommissioned)

Releases are immutable manifests mapping each host to its built Nix store path. A release captures “what the flake means for each host at a point in time”. Created via nixfleet release create, they can be inspected, diffed, listed, and referenced by rollouts multiple times (e.g., staging then prod, or rollback to a previous release). See CLI reference.

Rollouts coordinate fleet-wide deployments across batches with health gates between each batch. Every rollout references a release - the CP resolves each target machine’s store path from the release entries at batch execution time. See Rollouts for details.

Audit events record every mutation (deployment, rollback, tag change, lifecycle transition) with actor, timestamp, and detail. Query them with:

curl http://localhost:8080/api/v1/audit           # JSON
curl http://localhost:8080/api/v1/audit/export     # CSV

Monitoring

The /metrics endpoint is available on the CP’s listen address with no extra configuration. It is always active when the service is running.

Add a scrape target to your Prometheus configuration:

scrape_configs:
  - job_name: nixfleet-control-plane
    static_configs:
      - targets: ["fleet.example.com:8080"]

See Control Plane Options for the full list of exposed metrics.

Security

The control plane uses two independent auth layers: the TLS layer (authentication) and the API layer (authorization).

LayerMechanismWhoPurpose
TLSmTLS client certsAgents + admin clientsAuthenticate the connection
APIAPI keys (role-gated)Admin clients onlyAuthorize specific operations

API keys have one of three roles: admin (full access), deploy (create releases and rollouts), readonly (read-only). The bootstrap endpoint creates an admin key.

Configuration

services.nixfleet-control-plane = {
  enable = true;
  tls.cert = "/run/secrets/cp-cert.pem";      # enables HTTPS
  tls.key = "/run/secrets/cp-key.pem";
  tls.clientCa = "/run/secrets/fleet-ca.pem";  # enables required mTLS
};

When tls.clientCa is set, all connections must present a valid client certificate:

  • Agents authenticate via client cert alone (no API key)
  • Admin clients require both a client cert AND an API key (Authorization: Bearer <key>)

See Control Plane Options for full TLS option details.

Bootstrap

On first deployment, create the initial admin key via the bootstrap endpoint (only works when no keys exist):

curl -X POST https://cp-host:8080/api/v1/keys/bootstrap \
  --cacert fleet-ca.pem --cert client-cert.pem --key client-key.pem \
  -H 'Content-Type: application/json' -d '{"name":"admin"}'
# Returns: {"key":"nfk-...","name":"admin","role":"admin"}

Save the returned key - it’s only shown once. Subsequent calls return 409 Conflict.

Production recommendation: Always enable TLS. Set tls.clientCa to require mTLS from all clients. Admin clients need both a client certificate and an API key.

Persistence

State is stored in a single SQLite database at dbPath. On impermanent NixOS hosts, the module automatically persists /var/lib/nixfleet-cp across reboots.

A background task cleans up health reports older than 24 hours to prevent unbounded database growth.

Agent

The agent runs on each managed host as a systemd service. It polls the control plane for a desired generation, applies changes when a mismatch is detected, runs health checks, reports status, and automatically rolls back on failure.

Enabling the agent

services.nixfleet-agent = {
  enable = true;
  controlPlaneUrl = "https://fleet.example.com";
  tags = ["web" "prod" "eu-west"];
  pollInterval = 60;
  healthInterval = 60;

  healthChecks = {
    systemd = [{ units = ["nginx.service" "postgresql.service"]; }];
    http = [{
      url = "http://localhost:8080/health";
      expectedStatus = 200;
      timeout = 3;
      interval = 5;
    }];
    command = [{
      name = "disk-space";
      command = "test $(df --output=pcent / | tail -1 | tr -d '% ') -lt 90";
      timeout = 5;
      interval = 10;
    }];
  };
};

Agent options

See Agent Options for the full option reference including TLS, metrics, health checks, and systemd service details.

Deploy cycle

On every poll tick the agent runs a single sequential deploy cycle (run_deploy_cycle) to completion - no cooperative state machine, no interruptible transitions:

  1. Check - GET /api/v1/machines/<id>/desired-generation returns {hash, cache_url, poll_hint}. If hash matches /run/current-system, the cycle reports “up-to-date” and returns. If poll_hint is set (active rollout), the next poll is scheduled at that shorter interval.
  2. Fetch - if the generation differs, the agent runs nix copy --from <cache_url> <hash>. With no cache URL, it falls back to nix path-info to verify the path was pre-pushed out-of-band.
  3. Apply - runs <hash>/bin/switch-to-configuration switch as a subprocess. The agent is a privileged root service - sandboxing is minimal because switch-to-configuration needs access to /dev, /home, /root, cgroups, and kernel modules to do its job.
  4. Verify - runs all configured health checks. If any fail, the agent transitions to rollback.
  5. Report - posts a Report to the CP with current_generation, success, and message. The executor uses current_generation to verify the machine has actually applied the new generation before accepting health-gated completion.

On any failure (network, fetch, apply, or verify), the cycle returns PollOutcome::Failed and the main loop reschedules the next poll to retryInterval (30s by default) instead of the full pollInterval. This handles bootstrap races (agent polls before the CP has a release), transient network failures, and flaky fetches.

Periodic health reports run on a separate healthInterval tick (default 60s) independent of the deploy cycle. The executor only counts a health report toward batch completion when the machine’s current_generation matches the desired store path from the release entry.

Health checks

Three types of health check are supported, all configured declaratively in Nix:

Systemd units

Verify that critical systemd units are in the active state.

healthChecks.systemd = [{
  units = ["nginx.service" "postgresql.service"];
}];

HTTP endpoints

Send a GET request and verify the response status code.

SuboptionTypeDefaultDescription
urlstring- (required)URL to GET
expectedStatusint200Expected HTTP status code
timeoutint3Timeout in seconds
intervalint5Check interval in seconds
healthChecks.http = [{
  url = "http://localhost:3000/healthz";
  expectedStatus = 200;
  timeout = 5;
}];

Custom commands

Run an arbitrary shell command. Exit code 0 means healthy.

SuboptionTypeDefaultDescription
namestring- (required)Check name (used in reports)
commandstring- (required)Shell command to execute
timeoutint5Timeout in seconds
intervalint10Check interval in seconds
healthChecks.command = [{
  name = "disk-space";
  command = "test $(df --output=pcent / | tail -1 | tr -d '% ') -lt 90";
  timeout = 5;
}];

Continuous health reporting

The agent sends periodic health reports at healthInterval (default: 60s), independent of deploy cycles. The CP uses these to track fleet health, evaluate rollout health gates, and surface issues in nixfleet status.

Prometheus Metrics

Enable the agent metrics listener by setting metricsPort:

services.nixfleet-agent = {
  enable = true;
  controlPlaneUrl = "https://fleet.example.com";
  metricsPort = 9101;
  metricsOpenFirewall = true;
};

Scrape from Prometheus at http://agent-host:9101/metrics. See Agent Options for the full list of exposed metrics.

Registration & tags

Agents auto-register on first report (gated by mTLS). Tags from services.nixfleet-agent.tags sync on every report - change the NixOS config, rebuild, and the CP picks up the new tags automatically. Admins can pre-register machines via nixfleet machines register <id>.

nixfleet machines list              # verify enrollment
nixfleet machines list --tags prod  # filter by tag

Persistence

Agent state is stored in a SQLite database at dbPath. On impermanent NixOS hosts, the module automatically persists /var/lib/nixfleet across reboots.

Security

Configure mTLS via the NixOS module options tls.clientCert and tls.clientKey. Set allowInsecure = true for dev-only HTTP mode.

The systemd service runs without sandboxing because switch-to-configuration needs full system access. See Agent Options - Systemd service for the full hardening rationale.

Binary Cache

A fleet binary cache means agents fetch closures from your own infrastructure instead of rebuilding or pulling from cache.nixos.org on every deploy.

NixFleet ships with harmonia as the default cache server. Harmonia serves paths directly from the local Nix store over HTTP - no separate storage backend, database, or push protocol. Paths are signed on-the-fly using the host’s Nix signing key.

Server setup

Enable the cache server on a dedicated host (or any always-on fleet member):

services.nixfleet-cache-server = {
  enable = true;
  port = 5000;            # default
  openFirewall = true;
  signingKeyFile = "/run/secrets/cache-signing-key";
};

Generating a signing key

nix-store --generate-binary-cache-key cache.fleet.example.com secret-key.pem public-key.pem

Store secret-key.pem as an encrypted secret (agenix/sops). Note the public-key.pem contents - clients need it.

Populating the cache

Harmonia serves whatever is in the local Nix store. To populate it, copy closures to the cache host after building:

# Push closures to the cache host's Nix store
nixfleet release create --push-to ssh://root@cache.fleet.example.com

# Or with nix copy directly
nix copy --to ssh://root@cache.fleet.example.com /nix/store/...

Client setup

Enable on agent hosts to configure Nix substituters:

services.nixfleet-cache = {
  enable = true;
  cacheUrl = "http://cache.fleet.example.com:5000";
  publicKey = "cache.fleet.example.com:AAAA...=";  # contents of public-key.pem
};

This adds cacheUrl to nix.settings.substituters and the public key to nix.settings.trusted-public-keys.

Agent fetch workflow

When a deploy is triggered, each agent resolves the closure from substituters in order:

  1. cacheUrl from services.nixfleet-cache (or services.nixfleet-agent.cacheUrl)
  2. Default Nix substituters (cache.nixos.org, etc.)

Agents automatically benefit from the fleet cache once the client module is enabled and the signing key is trusted - no additional configuration on the agent side is needed.

To override the cache URL per-deploy from the CLI:

nixfleet deploy --tags web --release REL-xxx --cache-url http://cache.fleet.example.com:5000

Advanced: custom cache backends

For Attic, Cachix, or other cache backends that need a custom push command, use the --push-hook CLI flag:

# Attic example
nixfleet release create --push-to ssh://root@cache --push-hook "attic push fleet {}"

# Cachix example
nixfleet release create --push-hook "cachix push my-cache {}"

The {} placeholder is replaced with each store path. When combined with --push-to, the hook runs on the remote host via SSH. Without --push-to, it runs locally.

Fleet repos that want Attic can add it as their own flake input and configure it via plain NixOS modules.

Attic and upstream dependencies

Attic is a push-only cache - it does not proxy upstream caches like cache.nixos.org. When you push a closure with attic push, Attic skips store paths that already exist in upstream caches to save bandwidth and storage. This means your private cache may not have every path needed to fetch a full closure.

The agent handles this automatically: if nix copy --from <cache_url> fails (e.g. a dependency like kmod exists on cache.nixos.org but not in your Attic cache), it falls back to nix-store --realise which uses the system-configured substituters. Your custom-built paths are still served from LAN (Attic), while standard nixpkgs dependencies fall through to cache.nixos.org.

For air-gapped fleets (no WAN access), you must push complete closures including all upstream dependencies. Use nix copy --to instead of attic push - it copies every path regardless of upstream availability:

# Push complete closure (all paths, no upstream skip)
nix copy --to http://cache:8081/fleet /nix/store/...-nixos-system-...

# Or via SSH to the cache host's store (harmonia serves it directly)
nix copy --to ssh://root@cache /nix/store/...-nixos-system-...

See also

Rollouts

A rollout is a fleet-wide deployment coordinated by the control plane. Instead of pushing new code to every machine at once and hoping for the best, rollouts deploy in batches with health-check gates between each batch. If something breaks, the rollout pauses or reverts automatically.

Every rollout targets a release - an immutable CP-managed manifest mapping each host to its built Nix store path. This enables per-host deployment in heterogeneous fleets where every machine’s closure is different (different hardware, hostSpec, modules, certificates). You create a release once (nixfleet release create), then trigger one or more rollouts against it.

The two-step flow

nixfleet release create --push-to ssh://root@cache   # build + push + register
nixfleet deploy --release rel-abc123 --tags web --strategy canary --wait

Or use the convenience shorthand - nixfleet deploy with --push-to / --copy implicitly creates a release first:

nixfleet deploy --push-to ssh://root@cache --tags web --strategy canary --wait

Both forms do the same thing. The explicit form is useful when you want to deploy the same release multiple times (e.g., staging then prod, or rolling forward then back).

Strategies

All-at-once

Deploy to every targeted machine simultaneously. No batching, no gates. Suitable for dev/staging environments or non-critical updates.

nixfleet deploy --release rel-abc123 --tags staging --strategy all-at-once

Canary

Deploy to a single machine first. If that machine passes health checks within the timeout, deploy to all remaining machines. Suitable for production environments where you want a quick smoke test.

nixfleet deploy --release rel-abc123 --tags prod --strategy canary \
  --health-timeout 120 --wait

This creates two batches: batch 0 with 1 machine, batch 1 with the rest.

Staged

Define explicit batch sizes for fine-grained control. Batch sizes can be absolute numbers or percentages.

nixfleet deploy --release rel-abc123 --tags prod --strategy staged \
  --batch-size 1,25%,100% \
  --health-timeout 300 --wait

This creates three batches:

  1. Batch 0: 1 machine (canary)
  2. Batch 1: 25% of remaining machines
  3. Batch 2: all remaining machines (100%)

How rollouts work

  1. Create - The CLI posts a rollout to the control plane with the release_id, target filter (tags or hosts), and strategy. The CP loads the release entries, intersects them with the target machine set (machines not in the release are skipped with a warning), randomizes the order, and splits them into batches.

  2. Execute batches - The rollout executor (a background task in the CP) processes batches sequentially:

    • For each machine in the current batch, looks up the per-host store path from the release entries
    • Captures the machine’s current generation into the batch’s previous_generations map (for per-machine rollback)
    • Sets the desired generation on each machine via the internal generations table
    • Returns poll_hint: 5 in the agent’s next desired-generation response so agents react within seconds instead of waiting the full pollInterval
    • Agents poll, detect the mismatch, fetch the closure, apply, run health checks, and report back with their new current_generation
  3. Health gate - The executor evaluates each machine’s health by verifying TWO conditions:

    • The machine’s latest report’s current_generation matches the desired store path from the release entry (proves the agent actually applied the new generation)
    • A health report with all_passed = true has been received since the batch started

    This two-step gate prevents false-positive completion from stale health reports: a health report from a previous generation cannot count toward the new batch.

  4. Complete or fail - When all batches succeed, the rollout status moves to completed. If a health gate fails, the rollout transitions to paused or failed depending on the --on-failure setting.

Health gates

After each batch deploys, the control plane waits for agents to report health. The gate evaluates based on two parameters:

  • --health-timeout (default: 300 seconds) - Maximum time to wait for health reports after a batch deploys. Machines that do not report within this window are marked as timed out. Set this higher than pollInterval so agents have time to notice the deploy (or rely on poll_hint to react within 5s).
  • --failure-threshold (default: 0) - Maximum number of unhealthy/timed-out machines before triggering the failure action. 0 means zero tolerance - any single failure pauses the rollout. Can be absolute ("3") or a percentage of the batch ("30%").

When the threshold is exceeded:

  • --on-failure pause (default) - The rollout pauses. Investigate, fix the issue, then resume with nixfleet rollout resume <id>. Machines in the failed batch that did deploy are left in place (the agent already rolled back individually if its own health checks failed).
  • --on-failure revert - The rollout fails and the CP reads each completed batch’s previous_generations map, reverting every machine in those batches to the store path it was running before the rollout started. Each machine rolls back to its OWN previous state - not a single shared generation - which is the correct behavior for heterogeneous fleets.

CLI flags

See CLI reference - deploy for the full flag list with defaults and descriptions.

Monitoring rollouts

Stream progress in real time with --wait:

nixfleet deploy --release rel-abc123 --tags prod --strategy canary --wait

If --on-failure pause triggers, --wait exits immediately with an actionable message instead of blocking until timeout:

Rollout r-xxx paused: batch 1 health check failed (2/3 unhealthy)
  Resume with:  nixfleet rollout resume r-xxx
  Monitor with: nixfleet rollout status r-xxx --watch

List rollouts:

nixfleet rollout list
nixfleet rollout list --status running
nixfleet rollout list --status paused

Inspect a specific rollout with per-batch and per-machine detail:

nixfleet rollout status <rollout-id>

Managing rollouts

Resume a paused rollout (after investigating and fixing the issue):

nixfleet rollout resume <rollout-id>

Cancel a rollout (stops further batches, leaves already-deployed machines as-is):

nixfleet rollout cancel <rollout-id>

SSH fallback

For environments without a control plane (small fleets, bootstrapping, or air-gapped networks), the CLI can deploy directly over SSH without using a release:

nixfleet deploy --ssh --hosts "web*" --flake .

This builds each matching host’s closure locally, copies it to the target via nix-copy-closure, and runs switch-to-configuration switch. No rollout orchestration, no release manifest, no health gates - just a direct push. Useful for initial bootstrap or quick one-off deploys.

Worked example: canary deploy to production

Step 1 - build all production hosts and register a release. If you use harmonia as a binary cache, --push-to ssh:// copies the closures to the cache host’s /nix/store where harmonia serves them immediately:

nixfleet release create \
  --flake . \
  --hosts 'web-*,db-*' \
  --push-to ssh://root@cache

Output includes the release ID, for example rel-abc123-....

Step 2 - deploy with canary strategy, 2-minute health timeout, auto-pause on failure:

nixfleet deploy \
  --release rel-abc123 \
  --tags prod,web \
  --strategy canary \
  --health-timeout 120 \
  --failure-threshold 1 \
  --on-failure pause \
  --wait

What happens:

  1. The CP loads the release entries, filters by prod AND web tags, intersects with the release’s host list (skipping any tagged machine not in the release), and randomizes the order.
  2. Batch 0: 1 machine receives its per-host store path as desired. The CP starts returning poll_hint=5 in the agent’s desired-generation response.
  3. Within ~5s, the agent polls, sees the mismatch, fetches the closure via nix copy --from http://cache:5000, runs switch-to-configuration switch, runs health checks, reports back.
  4. The CP verifies the agent’s report shows the new current_generation (not a stale report from before the deploy), then waits for a passing health report.
  5. If healthy within 120s: Batch 1 deploys to all remaining machines in parallel.
  6. If unhealthy: the rollout pauses. The canary machine’s agent has already rolled back locally. Run nixfleet rollout status <id> to investigate, then nixfleet rollout resume <id> or nixfleet rollout cancel <id>.

Step 3 - same release, different environment:

# Same release, redeploy to a different subset with a different strategy
nixfleet deploy --release rel-abc123 --tags staging --strategy all-at-once --wait

Fleet Status

Day-2 operations for monitoring your fleet through the CLI and control plane.

Fleet overview

nixfleet status

Shows a summary of all machines known to the control plane: hostname, current generation (from the agent’s most recent report), desired generation (from the active rollout’s release entry, if any), lifecycle state, last report time, and tags.

For machine-readable output:

nixfleet status --json

Listing machines

nixfleet machines list

Filter by tag:

nixfleet machines list --tags prod
nixfleet machines list --tags web

Tags

Tags group machines for targeted deployments and filtering. They can be set in two places.

Via NixOS configuration

Declare tags in the agent service config. These are baked into the system closure and reported on every poll:

services.nixfleet-agent = {
  enable = true;
  controlPlaneUrl = "https://fleet.example.com";
  tags = ["prod" "web" "region-eu"];
};

Tags are stored in the control plane database. NixOS-configured tags (from services.nixfleet-agent.tags) are reported by the agent on every poll and synced to the control plane.

Machine lifecycle

Every machine has a lifecycle state that determines how the control plane treats it.

StateDescription
pendingPre-registered, no agent report yet
provisioningInstall in progress
activeAgent reporting normally
maintenanceManually paused
decommissionedRemoved from fleet

Lifecycle is informational - rollouts target machines by tag or hostname regardless of lifecycle state. Use lifecycle to track operational status and filter with nixfleet machines list.

Transitions

Not all transitions are valid. The control plane enforces these rules:

pending --> provisioning --> active
pending --> active                    (agent reports directly)
pending --> decommissioned            (never used)
provisioning --> pending              (reset)
active <--> maintenance              (pause/resume)
active --> decommissioned             (retire)
maintenance --> decommissioned        (retire while paused)

Invalid transitions (e.g., decommissioned to active, or active to pending) are rejected by the control plane.

Changing lifecycle state

Use the control plane API directly:

curl -X PATCH "$NIXFLEET_CONTROL_PLANE_URL/api/v1/machines/web-01/lifecycle" \
  -H "Content-Type: application/json" \
  -d '{"lifecycle": "maintenance"}'

When the control plane is unavailable

The CLI’s status and machines list commands require a running control plane. If the CP is down:

  • Agents continue running with their last-known generation
  • Agents do not receive new deployments
  • Use SSH for direct machine access (ssh root@hostname)
  • Use standard NixOS tools for local inspection (nixos-rebuild list-generations, systemctl status)

Rollback

Four mechanisms exist for rolling back, from fully automatic to fully manual.

1. Automatic (agent health checks)

When the agent applies a new generation, it runs the configured health checks (systemd units, HTTP endpoints, custom commands). If any check fails, the agent automatically:

  1. Rolls back to the previous generation (switch-to-configuration switch)
  2. Reports the failure to the control plane with success: false
  3. Includes the rollback reason in the report message

No operator action required. During a rollout, this failure report triggers the rollout’s health gate, which may pause or revert the entire rollout depending on --on-failure settings.

2. Rollout-level revert (on_failure = revert)

When a rollout is created with --on-failure revert and a later batch fails, the control plane reads each completed batch’s previous_generations map (captured at batch start) and sets each machine’s desired generation back to the store path it was running BEFORE the rollout started. This is per-machine - each host reverts to its own previous state, not a single shared generation. The rollout status becomes failed and agents pull the revert on their next poll (within ~5s due to poll_hint).

This is the correct rollback mechanism for heterogeneous fleets where each machine has a unique closure.

3. Manual via CLI (SSH mode)

nixfleet rollback is an SSH-only operation - it switches a single machine to a previous generation directly over SSH, bypassing the control plane.

# Rollback to the previous generation (reads from system-1-link on the target)
nixfleet rollback --host web-01 --ssh

# Rollback to a specific store path
nixfleet rollback --host web-01 --ssh --generation /nix/store/abc123-nixos-system

This runs switch-to-configuration switch on the target via SSH. Useful when the control plane is unreachable or during bootstrap before the agent is running.

For CP-driven rollback of a bad deploy discovered after health checks pass, deploy an older release:

git checkout <old-commit>
nixfleet release create --push-to ssh://root@cache
git checkout -
nixfleet deploy --release <old-id> --tags prod --wait

4. Manual via NixOS

Standard NixOS rollback mechanisms work regardless of NixFleet.

Command-line rollback

# On the target machine
sudo nixos-rebuild switch --rollback

# Or switch to a specific generation
sudo nix-env -p /nix/var/nix/profiles/system --switch-generation 42
sudo /nix/var/nix/profiles/system/bin/switch-to-configuration switch

Boot menu

systemd-boot lists previous generations at boot. Select an older entry to boot into a previous configuration. This is the last resort when SSH access is unavailable or the current generation fails to boot.

When to use which

ScenarioMechanism
Deployment health check failsAutomatic (agent rolls back per-machine)
Mid-rollout batch failure with --on-failure revertAutomatic (CP reverts completed batches from per-machine previous_generations)
Bad deploy discovered after health checks passCreate a release pointing at the old closures, nixfleet deploy --release <old>
Control plane is downSSH rollback (nixfleet rollback --host <h> --ssh) or NixOS boot menu
Machine won’t bootBoot menu (select previous generation)
Rollout affecting multiple machinesnixfleet rollout cancel + individual rollbacks if needed

Impermanence

Impermanent hosts wipe their root filesystem on every boot. Only explicitly persisted paths survive. This eliminates configuration drift and forces every piece of state to be declared.

What ephemeral root gives you

  • No drift - the root filesystem is always a clean slate. Undeclared state cannot accumulate.
  • Forced explicitness - if you forget to persist something, you notice on the next reboot. No hidden state.
  • Reproducibility - two machines with the same closure and the same persisted data behave identically.

How the btrfs wipe works

On boot, an initrd script runs before the root filesystem is mounted:

  1. Mounts the btrfs partition by label (root)
  2. Renames the current @root subvolume to old_roots/<timestamp>
  3. Deletes old root snapshots older than 30 days (recursive subvolume deletion)
  4. Creates a fresh @root subvolume
  5. Unmounts

The /persist filesystem is marked neededForBoot = true so it is available during early boot before the wipe completes.

What the framework persists

System-level (/persist)

PathPurpose
/etc/nixosNixOS configuration
/etc/NetworkManager/system-connectionsWiFi/VPN connections
/var/lib/systemdsystemd state (timers, journals)
/var/lib/nixosNixOS UID/GID maps
/var/logSystem logs
/etc/machine-idStable machine identity (file)

User-level (/persist via Home Manager)

The framework persists common user paths. Fleet repos extend this list with their own application state via scope-aware persistence (see below).

PathPurpose
.keysEncryption/decryption keys
.local/share/nixNix user state
.ssh/known_hostsSSH known hosts (file)

The framework also persists paths for tools included in the base scope (shell history, plugin state, CLI auth). See modules/scopes/_impermanence.nix for the full list.

User-level mounts are hidden (hideMounts = true) to keep ls output clean.

Service-level (auto-persist)

The agent and control plane modules automatically persist their state directories when impermanence is enabled:

  • Agent: /var/lib/nixfleet (SQLite state database)
  • Control plane: /var/lib/nixfleet-cp (SQLite state database)

No manual configuration needed. The service modules detect nixfleet.impermanence.enable and add the persist entries.

Scope-aware persistence

Persist paths belong next to the program they support, not in a centralized list. When you write a scope that installs a program with state, co-locate the persist declaration:

{config, lib, pkgs, ...}: let
  hS = config.hostSpec;
in {
  config = lib.mkIf hS.isGraphical {
    programs.firefox.enable = true;

    # Persist Firefox profile alongside its config
    home.persistence."/persist" = lib.mkIf config.nixfleet.impermanence.enable {
      directories = [".mozilla/firefox"];
    };
  };
}

This prevents the persistence list from drifting out of sync with installed programs.

Opting in

Enable nixfleet.impermanence.enable (or use a role that sets it) and use a btrfs disk layout with separate persist subvolumes:

nixfleet.lib.mkHost {
  hostName = "myhost";
  platform = "x86_64-linux";
  hostSpec = {
    userName = "alice";
  };
  modules = [
    nixfleet-scopes.scopes.roles.workstation
    { nixfleet.impermanence.enable = true; }
    # Use the framework's btrfs-impermanence disko template
    nixfleet.diskoTemplates.btrfs-impermanence
    ./hardware-configuration.nix
  ];
}

The framework provides two disko templates:

  • diskoTemplates.btrfs - standard btrfs layout without impermanence
  • diskoTemplates.btrfs-impermanence - btrfs layout with @root, @persist, and @nix subvolumes

Ownership and activation

The framework runs an activation script that ensures /persist/home/<userName> exists with correct ownership. If a .keys directory exists in the persist home, it is recursively chowned to the primary user.

Custom Scopes

Scopes are plain NixOS/HM modules that self-activate based on enable options. The framework provides base, impermanence, and the service modules. Your fleet repo adds scopes for everything else.

Step 1: Define a hostSpec flag

Extend hostSpec in your fleet repo with a plain NixOS module:

# modules/host-spec-extensions.nix
{lib, ...}: {
  options.hostSpec.isDev = lib.mkOption {
    type = lib.types.bool;
    default = false;
    description = "Enable development tools.";
  };
}

Include this module in your mkHost modules list (or use an import-tree pattern).

Step 2: Create the scope module

Write a NixOS module that activates only when the flag is true:

# modules/scopes/dev.nix
{config, lib, pkgs, ...}: let
  hS = config.hostSpec;
in {
  config = lib.mkIf hS.isDev {
    virtualisation.docker.enable = true;
    environment.systemPackages = with pkgs; [gcc gnumake];
  };
}

Step 3: Add Home Manager config

If the scope needs user-level configuration, use the HM module pattern. You can define it as a separate module or combine it with the NixOS module depending on your import strategy.

In a multi-module pattern (returned as an attrset):

# modules/scopes/dev.nix
{
  nixos = {config, lib, pkgs, ...}: let
    hS = config.hostSpec;
  in {
    config = lib.mkIf hS.isDev {
      virtualisation.docker.enable = true;
    };
  };

  homeManager = {config, lib, pkgs, ...}: let
    hS = config.hostSpec;
  in {
    home.packages = lib.optionals hS.isDev (with pkgs; [
      nodejs
      python3
      rustup
    ]);
  };
}

Step 4: Add persist paths

If the scope installs programs with state on impermanent hosts, co-locate the persistence declaration:

{config, lib, pkgs, ...}: let
  hS = config.hostSpec;
in {
  config = lib.mkIf hS.isDev {
    virtualisation.docker.enable = true;

    # Persist Docker data on impermanent hosts
    environment.persistence."/persist".directories =
      lib.mkIf config.nixfleet.impermanence.enable [
        "/var/lib/docker"
      ];
  };
}

For user-level persistence (in an HM module):

home.persistence."/persist" = lib.mkIf config.nixfleet.impermanence.enable {
  directories = [".cargo" ".rustup" ".npm"];
};

Step 5: Import in mkHost

Add the scope module to your host definitions:

nixfleet.lib.mkHost {
  hostName = "workstation";
  platform = "x86_64-linux";
  hostSpec = {
    userName = "alice";
    isDev = true;
  };
  modules = [
    ./modules/host-spec-extensions.nix
    ./modules/scopes/dev.nix
    ./hardware-configuration.nix
  ];
}

If you use an import-tree or similar auto-discovery pattern, the scope is picked up automatically without explicit imports.

Conventions

  • One concern per scope - dev, graphical, desktop, not dev-and-graphical
  • lib.mkIf on enable options - scopes produce no config when their enable is false
  • Co-locate persistence - persist paths live in the scope that needs them
  • Framework vs fleet - generic infrastructure (base, impermanence, agent, CP) belongs in NixFleet. Opinionated tools and theming belong in your fleet repo.

Secrets

NixFleet provides a secrets wiring scope that handles identity path management, impermanence persistence, and boot ordering. Fleet repos bring their own backend (agenix, sops-nix) and wire it to the framework.

Enabling the secrets scope

nixfleet.secrets.enable = true;

The scope computes config.nixfleet.secrets.resolvedIdentityPaths based on its options:

  • Servers (enableUserKey = false, the default for the server role): host SSH key only (/etc/ssh/ssh_host_ed25519_key)
  • Workstations (enableUserKey = true, the default for the workstation role): host SSH key + user key fallback (~/.keys/id_ed25519)

On impermanent hosts, identity keys are automatically persisted.

agenix example

# flake.nix inputs
inputs.agenix.url = "github:ryantm/agenix";
inputs.agenix.inputs.nixpkgs.follows = "nixfleet/nixpkgs";

# In your host modules
{inputs, config, ...}: {
  imports = [inputs.agenix.nixosModules.default];

  # Use framework-computed identity paths
  age.identityPaths = config.nixfleet.secrets.resolvedIdentityPaths;

  age.secrets.root-password.file = "${inputs.secrets}/org/root-password.age";

  hostSpec = {
    hashedPasswordFile = config.age.secrets.root-password.path;
    rootHashedPasswordFile = config.age.secrets.root-password.path;
  };
}

sops-nix example

# flake.nix inputs
inputs.sops-nix.url = "github:Mic92/sops-nix";
inputs.sops-nix.inputs.nixpkgs.follows = "nixfleet/nixpkgs";

# In your host modules
{inputs, config, ...}: {
  imports = [inputs.sops-nix.nixosModules.sops];

  sops = {
    defaultSopsFile = ./secrets/secrets.yaml;
    # sops-nix also uses age keys - resolvedIdentityPaths works here too
    age.keyFile = builtins.head config.nixfleet.secrets.resolvedIdentityPaths;
  };

  sops.secrets.root-password.neededForUsers = true;

  hostSpec = {
    hashedPasswordFile = config.sops.secrets.root-password.path;
    rootHashedPasswordFile = config.sops.secrets.root-password.path;
  };
}

Extension points

hostSpec provides three options for wiring secrets into the framework:

OptionTypePurpose
secretsPathnullOr strHint for the path to your secrets repo/directory.
hashedPasswordFilenullOr strPath to a hashed password file for the primary user.
rootHashedPasswordFilenullOr strPath to a hashed password file for root.

When hashedPasswordFile or rootHashedPasswordFile is non-null, the core NixOS module sets users.users.<name>.hashedPasswordFile accordingly.

Bootstrapping

New machines need a decryption key before they can decrypt secrets. Two approaches:

–extra-files (nixos-anywhere)

Pass the key during initial install:

mkdir -p /tmp/extra/etc/ssh
cp /path/to/ssh_host_ed25519_key /tmp/extra/etc/ssh/ssh_host_ed25519_key
chmod 600 /tmp/extra/etc/ssh/ssh_host_ed25519_key

nixos-anywhere --flake .#myhost --extra-files /tmp/extra root@192.168.1.50

The build-vm and test-vm apps do this automatically when a key is found at ~/.keys/id_ed25519 or ~/.ssh/id_ed25519. You can also pass a key explicitly with --identity-key PATH. For real hardware, pass --extra-files to nixos-anywhere to inject the key during install.

The secrets scope’s nixfleet-host-key-check service auto-generates the host key at /etc/ssh/ssh_host_ed25519_key on first boot if the key is missing, so bootstrapping without a pre-provisioned key is safe.

Generate on target

SSH into the machine and the host key will be generated automatically by nixfleet-host-key-check before sshd starts. Alternatively, generate one manually and add it to your secrets configuration:

ssh root@192.168.1.50
ssh-keygen -t ed25519 -f /etc/ssh/ssh_host_ed25519_key -N ""

Then extract the public key, add it to your secrets configuration (e.g., secrets.nix for agenix), and re-encrypt the affected secrets.

Key placement on impermanent hosts

On impermanent hosts, the secrets scope automatically persists:

  • /etc/ssh/ssh_host_ed25519_key (and .pub)
  • The user key directory (~/.keys) when enableUserKey is true

The impermanence scope also persists ~/.keys independently, providing defense in depth.

Templates & Patterns

NixFleet ships flake templates for common fleet structures. Initialize a new project with:

nix flake init -t github:arcanesys/nixfleet

Available templates

TemplateCommandDescription
default / standalonenix flake init -t nixfleetSingle NixOS machine, no flake-parts
fleetnix flake init -t nixfleet#fleetMulti-host fleet with flake-parts
batchnix flake init -t nixfleet#batchBatch of identical hosts from a template

standalone

Minimal setup for a single machine. No flake-parts, no import-tree. Just nixfleet + one mkHost call:

{
  inputs = {
    nixfleet.url = "github:arcanesys/nixfleet";
    nixpkgs.follows = "nixfleet/nixpkgs";
  };

  outputs = {nixfleet, ...}: {
    nixosConfigurations.myhost = nixfleet.lib.mkHost {
      hostName = "myhost";
      platform = "x86_64-linux";
      hostSpec = {
        userName = "alice";
        timeZone = "US/Eastern";
        locale = "en_US.UTF-8";
        sshAuthorizedKeys = [
          "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAA..."
        ];
      };
      modules = [
        ./hardware-configuration.nix
        ./disk-config.nix
      ];
    };
  };
}

fleet

Multi-host fleet using flake-parts for structure. Imports NixFleet’s flakeModules for apps, tests, formatter, and ISO generation.

batch

Generate many identical hosts from a template. Useful for edge devices, kiosks, or lab machines where the only difference between hosts is the hostname and network config.

The follows chain

Every template uses this pattern:

inputs = {
  nixfleet.url = "github:arcanesys/nixfleet";
  nixpkgs.follows = "nixfleet/nixpkgs";
};

The follows directive means your fleet uses the same nixpkgs revision that NixFleet was tested against. This is important because:

  • Consistency - framework modules, core config, and your fleet code all evaluate against the same package set
  • No diamond dependency - without follows, you would have two separate nixpkgs evaluations (NixFleet’s and yours), doubling memory usage and causing subtle version mismatches
  • Tested combination - NixFleet’s CI validates against its pinned nixpkgs

NixFleet’s own follows chain

NixFleet pins and follows these inputs internally:

nixpkgs           (nixos-unstable)
darwin            follows nixpkgs
home-manager      follows nixpkgs
disko             follows nixpkgs
impermanence      follows nixpkgs
lanzaboote        follows nixpkgs
microvm           follows nixpkgs
nixos-anywhere    follows nixpkgs, flake-parts, disko, treefmt-nix
treefmt-nix       follows nixpkgs

All major inputs share a single nixpkgs, ensuring consistent package versions throughout the dependency tree.

When to follow vs pin independently

ScenarioRecommendation
Standard fleetFollow NixFleet’s nixpkgs (follows = "nixfleet/nixpkgs")
Need a specific nixpkgs fix not yet in NixFleetPin your own nixpkgs, accept potential mismatches, update NixFleet soon
Fleet-specific inputs (secrets tool, hardware modules)Follow your fleet’s nixpkgs for consistency
NixFleet’s bundled inputs (disko, HM, etc.)Always use the versions bundled in NixFleet - they are tested together

Disko templates

NixFleet also provides reusable disk layout templates, separate from flake templates:

TemplateImport pathDescription
btrfsnixfleet.diskoTemplates.btrfsStandard btrfs layout
btrfs-impermanencenixfleet.diskoTemplates.btrfs-impermanenceBtrfs with @root, @persist, @nix subvolumes for impermanence

Use them in your mkHost modules list:

modules = [
  nixfleet.diskoTemplates.btrfs-impermanence
  ./hardware-configuration.nix
];

CLI

Flat reference for all nixfleet CLI commands and flags.

Global options

FlagEnv varDefaultDescription
--control-plane-urlNIXFLEET_CONTROL_PLANE_URLhttp://localhost:8080Control plane URL
--api-keyNIXFLEET_API_KEY""API key for control plane authentication
--client-certNIXFLEET_CLIENT_CERT""Client certificate for mTLS authentication
--client-keyNIXFLEET_CLIENT_KEY""Client key for mTLS authentication
--ca-certNIXFLEET_CA_CERT""CA certificate for TLS verification (uses system trust store if omitted)
--json-falseOutput structured JSON (on commands that produce tables/detail views)
--config--Path to .nixfleet.toml (default: walk up from cwd)
-v, --verbose-0Verbosity: -v shows INFO milestones + subprocess rolling window + progress bar; -vv shows raw passthrough (debug)

Logging is controlled via RUST_LOG (overrides -v/--verbose when set).

Configuration sources

The CLI reads connection settings from four layers, in priority order (highest wins):

  1. CLI flags (--control-plane-url, --api-key, …)
  2. Environment variables (NIXFLEET_* shown above)
  3. ~/.config/nixfleet/credentials.toml - user-level API keys, keyed by CP URL (auto-saved by nixfleet bootstrap)
  4. .nixfleet.toml - repo-level config, from --config <path> or discovered by walking up from cwd

This means the same CLI commands run with no flags from any fleet repo, inheriting the repo’s connection settings and the user’s bootstrapped credentials. See .nixfleet.toml format below.

mTLS example (with config file):

# One-time setup (creates .nixfleet.toml)
nixfleet init \
  --control-plane-url https://cp-01:8080 \
  --ca-cert modules/_config/fleet-ca.pem \
  --client-cert '/run/agenix/agent-${HOSTNAME}-cert' \
  --client-key '/run/agenix/agent-${HOSTNAME}-key' \
  --cache-url http://cache:5000 \
  --push-to ssh://root@cache

# Bootstrap first admin key (auto-saves to ~/.config/nixfleet/credentials.toml)
nixfleet bootstrap

# Subsequent commands: no flags needed
nixfleet machines list
nixfleet release create
nixfleet deploy --release rel-abc123 --hosts 'web-*' --wait

deploy

Deploy config to fleet hosts.

nixfleet deploy [FLAGS]
FlagTypeDefaultDescription
--release <ID>stringDeploy an existing release (required for rollout mode unless using --push-to / --copy)
--push-to <URL>stringBuild all hosts, push to a Nix binary cache URL, and register a release implicitly (e.g., ssh://root@cache, s3://bucket)
--hookboolfalseUse hook mode: push via [cache.hook] push-cmd instead of nix copy. Requires [cache.hook] in .nixfleet.toml or --hook-push-cmd
--hook-push-cmd <CMD>stringOverride hook push command ({} = store path). Requires --hook
--hook-url <URL>stringOverride hook cache URL for agents to pull from. Requires --hook
--copyboolfalseBuild all hosts, push to each target via nix-copy-closure (no binary cache needed), and register a release implicitly
--hosts <PATTERN>string (comma-separated or repeatable)*Host glob patterns. In SSH mode: hosts to deploy. In rollout mode: target machines directly (alternative to --tags)
--tags <TAG>string (comma-separated or repeatable)Target machines by tag - filters both the release build and rollout targeting (only hosts with a matching services.nixfleet-agent.tags value are built)
--dry-runboolfalseBuild closures and show plan, do not push or register
--sshboolfalseSSH fallback mode: build locally, copy via SSH, run switch-to-configuration (no CP needed)
--target <SSH>stringSSH target override (e.g., root@192.168.1.10). Only valid with --ssh and a single host.
--flake <REF>string.Flake reference
--strategy <STRATEGY>stringall-at-onceRollout strategy: canary, staged, all-at-once
--batch-size <SIZES>string (comma-separated)Batch sizes (e.g., 1,25%,100%)
--failure-threshold <N>string0Max unhealthy machines per batch before pausing/reverting. Accepts absolute count or percentage (e.g. 30%)
--on-failure <ACTION>stringpauseAction on batch failure: pause (stop and wait for rollout resume) or revert (roll back to previous generation)
--health-timeout <SECS>u64300Seconds to wait for health reports per batch
--waitboolfalseStream rollout progress; exits non-zero if rollout pauses or fails
--wait-timeout <SECS>u64300Timeout in seconds for --wait (0 = wait forever)
--cache-url <URL>stringBinary cache URL for agents to fetch closures from (overrides the release’s cache_url)

Modes:

  • SSH mode (--ssh): Builds locally, copies closures via SSH, activates on target. No control plane required. Platform-aware: NixOS hosts use switch-to-configuration switch, Darwin hosts use nix-env --set + activate (auto-detected from the host’s platform).

Note: --ssh deploys directly via nix-copy-closure and activation, bypassing the control plane entirely. Lifecycle state is not checked - a machine in maintenance will still receive the deploy. Use --ssh as an emergency escape hatch when the CP is unavailable, not as a routine deployment method.

Darwin SSH deploy requirements: SSH deploy to Darwin hosts connects as $USER@host (not root@ - macOS disables root SSH login). This requires:

  1. Username match: The operator’s local username must exist on the Darwin target with SSH key access. Override with --target user@host for single-host deploys if usernames differ.
  2. Passwordless sudo: Activation requires root. The target must allow passwordless sudo for nix-env and the activation script:
    # nix-darwin: security.sudo.extraConfig
    s33d ALL=(root) NOPASSWD: /nix/var/nix/profiles/default/bin/nix-env *
    s33d ALL=(root) NOPASSWD: /nix/store/*/activate
    
  3. SSH key access: The operator’s SSH public key must be in the target user’s authorized keys.

For production mixed-fleet deploys, prefer the CP rollout path - the agent runs as root (launchd daemon), pulls from cache, and activates locally with no SSH user/sudo requirements.

  • Rollout mode (requires a release): Creates a rollout on the control plane with the specified strategy. Specify an existing release with --release <ID>, or use --push-to <url> / --hook / --copy to build + push + register implicitly in one command.
  • Hook mode (--hook): Uses [cache.hook] push-cmd from .nixfleet.toml to push closures (e.g., attic push mycache {}). Overrides --push-to and uses [cache.hook] url as the cache URL for agents. Flags --hook-push-cmd and --hook-url override the config values.
  • Targeting: Use --tags <TAG> or --hosts <pattern> to select machines. Both are intersected with the release’s host list (machines not in the release are skipped with a warning).

init

Create a .nixfleet.toml config file in the current directory. Run this once per fleet repo to set the connection and deploy defaults.

nixfleet init [FLAGS]
FlagTypeDefaultDescription
--control-plane-url <URL>string– (required)Control plane URL
--ca-cert <PATH>stringCA certificate path (relative to config file or absolute)
--client-cert <PATH>stringClient certificate path (supports ${HOSTNAME} expansion)
--client-key <PATH>stringClient key path (supports ${HOSTNAME} expansion)
--cache-url <URL>stringDefault binary cache URL for agents
--push-to <URL>stringDefault push destination for release create
--hook-url <URL>stringHook mode cache URL (e.g., http://cache:8081/mycache for Attic)
--hook-push-cmd <CMD>stringHook mode push command ({} = store path, e.g., attic push mycache {})
--strategy <STRATEGY>stringDefault deploy strategy (canary, staged, all-at-once)
--on-failure <ACTION>stringDefault deploy failure action (pause, revert)

After init, run nixfleet bootstrap to create and auto-save the first admin API key.


release create

Build host closures, distribute them, and register a release manifest in the control plane. A release is an immutable mapping of hostnames to built store paths that subsequent rollouts can target.

nixfleet release create [FLAGS]
FlagTypeDefaultDescription
--flake <REF>string.Flake reference
--hosts <PATTERN>string*Host glob pattern or comma-separated list
--push-to <URL>stringPush closures to this Nix cache URL via nix copy --to (e.g., ssh://root@cache, s3://bucket)
--hookboolfalseUse hook mode: push via [cache.hook] push-cmd instead of nix copy
--hook-push-cmd <CMD>stringOverride hook push command ({} = store path). Requires --hook
--hook-url <URL>stringOverride hook cache URL. Requires --hook
--copyboolfalsePush closures directly to each target host via nix-copy-closure (no binary cache)
--cache-url <URL>stringOverride the cache URL recorded in the release (defaults to --push-to URL, or config file)
--eval-onlyboolfalseEvaluate config.system.build.toplevel.outPath without building. Assumes closures are already in the cache (e.g., CI-built). Incompatible with --push-to, --hook, --copy
--dry-runboolfalseBuild and show the manifest without pushing or registering
--allow-dirtyboolfalseSkip the dirty working tree check

Output prints the release ID, host count, and per-host store paths. Use the ID with nixfleet deploy --release <ID>.


release list

List recent releases.

nixfleet release list [FLAGS]
FlagTypeDefaultDescription
--limit <N>u3220Number of releases to show (newest first)
--host <HOSTNAME>stringFilter releases to those containing entries for this hostname

release show

Show a release’s full metadata and per-host entries.

nixfleet release show <ID>
ArgumentTypeDescription
<ID>stringRelease ID

release diff

Diff two releases: added hosts, removed hosts, changed store paths, unchanged.

nixfleet release diff <ID_A> <ID_B>
ArgumentTypeDescription
<ID_A>stringFirst release ID
<ID_B>stringSecond release ID

release delete

Delete a release. Fails with exit code 1 if the release is still referenced by a rollout - the control plane returns 409 in that case to prevent breaking rollout history.

nixfleet release delete <RELEASE_ID>
ArgumentTypeDescription
<RELEASE_ID>stringID of the release to delete

Exit codes:

  • 0 - release deleted (CP returned 204)
  • 1 - release still referenced by a rollout (CP returned 409), release not found (CP returned 404), or another non-2xx status

status

Show fleet status from the control plane.

nixfleet status [FLAGS]
FlagTypeDefaultDescription
--stale-threshold <SECS>u64600Seconds without a report before a machine is marked stale
--watchboolfalseContinuously refresh the display (clears screen, Ctrl+C to exit). Incompatible with --json
--interval <SECS>u642Refresh interval in seconds (requires --watch)

Outputs a table of all machines. Pass --json (global flag) for structured JSON output.


rollback

Rollback a single machine to a previous generation via SSH. Activates the previous generation directly on the target, then notifies the control plane so desired generation stays in sync.

nixfleet rollback --host <HOST> --ssh [FLAGS]
FlagTypeDefaultDescription
--host <HOST>string– (required)Target host name
--generation <PATH>stringStore path to roll back to (default: previous generation from system-1-link)
--targetstring-SSH target override (e.g. root@192.168.1.10)
--darwinboolfalseTarget is a Darwin (macOS) host - uses $USER@host, sudo activate instead of switch-to-configuration

Rollback always operates via SSH. The --ssh flag is accepted for backwards compatibility but hidden from --help. For CP-driven rollback, use --on-failure revert on rollouts, or deploy an older release. After a successful rollback, the CP is notified (best-effort) so nixfleet status shows the machine in sync.

Darwin rollback: Use --darwin for macOS hosts. This runs nix-env --set + activate instead of switch-to-configuration:

nixfleet rollback --host aether --ssh --darwin

host add

Scaffold a new host.

nixfleet host add --hostname <NAME> [FLAGS]
FlagTypeDefaultDescription
--hostname <NAME>string– (required)Host name for the new machine
--org <ORG>stringmy-orgOrganization name
--role <ROLE>stringworkstationHost role (workstation, server, edge, kiosk)
--platform <PLATFORM>stringx86_64-linuxTarget platform
--target <SSH>stringSSH target to fetch hardware config (e.g., root@192.168.1.42)

rollout list

List rollouts.

nixfleet rollout list [FLAGS]
FlagTypeDefaultDescription
--status <STATUS>stringFilter by status (e.g., running, paused, completed)
--sort <FIELD>stringcreatedSort by: created (newest first), status, strategy

rollout status

Show rollout detail with batch breakdown.

nixfleet rollout status <ID> [FLAGS]
Argument/FlagTypeDefaultDescription
<ID>stringRollout ID
--waitboolfalseBlock until rollout completes, fails, is cancelled, or pauses. Exits non-zero on failure or pause
--wait-timeout <SECS>u64300Timeout in seconds for --wait (0 = wait forever)
--watchboolfalseContinuously refresh the display (clears screen, Ctrl+C to exit). Incompatible with --wait and --json
--interval <SECS>u642Refresh interval in seconds (requires --watch)

rollout resume

Resume a paused rollout.

nixfleet rollout resume <ID>
ArgumentTypeDescription
<ID>stringRollout ID

rollout cancel

Cancel a rollout.

nixfleet rollout cancel <ID>
ArgumentTypeDescription
<ID>stringRollout ID

bootstrap

Create the first admin API key. Only works when no keys exist in the control plane.

nixfleet bootstrap [FLAGS]
FlagTypeDefaultDescription
--name <NAME>stringadminName for the admin key
--save-key <KEY>stringSave an existing API key without calling the CP (for setting up additional machines)

Output: Human-friendly info to stderr, raw key to stdout. Scriptable:

API_KEY=$(nixfleet bootstrap)

Returns exit code 1 with an error message if keys already exist (409).

Note: No --api-key needed (chicken-and-egg). mTLS is still required when the CP has --client-ca set.

Multi-machine setup: Bootstrap once on your primary machine, then use --save-key on additional machines to share the same API key without re-bootstrapping:

# On the primary machine:
nixfleet bootstrap

# On additional machines (same fleet):
nixfleet bootstrap --save-key nfk-abc123...

completions

Generate a shell completion script.

nixfleet completions <SHELL>
ArgumentTypeDescription
<SHELL>stringTarget shell: zsh, bash, or fish

Source the output in your shell profile:

# zsh
nixfleet completions zsh > ~/.zsh/completions/_nixfleet

# bash
nixfleet completions bash > /etc/bash_completion.d/nixfleet

# fish
nixfleet completions fish > ~/.config/fish/completions/nixfleet.fish

machines register

Register a machine with the control plane (admin endpoint).

nixfleet machines register <ID> [FLAGS]
Argument/FlagTypeDescription
<ID>stringMachine ID
--tags <TAG>string (comma-separated or repeatable)Initial tags

Agents auto-register on first health report, so manual registration is optional. Use this to pre-register machines before they come online.


machines list

List machines.

nixfleet machines list [FLAGS]
FlagTypeDefaultDescription
--tags <TAG>string (comma-separated or repeatable)Filter by tags (machines matching any listed tag are shown)
--watchboolfalseRefresh the list on an interval (clears screen, Ctrl+C to exit). Incompatible with --json
--interval <SECS>u642Refresh interval in seconds (requires --watch)

machines set-lifecycle

Change a machine’s lifecycle state.

nixfleet machines set-lifecycle <ID> <STATE>
ArgumentTypeDescription
<ID>stringMachine ID
<STATE>stringLifecycle state: active, pending, provisioning, maintenance, decommissioned

Only active machines participate in rollouts. Machines in maintenance or decommissioned state are excluded even when explicitly targeted by hostname. Use maintenance to temporarily remove a machine from fleet operations without deregistering it.


machines clear-desired

Clear a machine’s stale desired generation. Use this when an agent is stuck polling for a generation that will never be fulfilled (e.g., after a cancelled rollout).

nixfleet machines clear-desired <ID>
ArgumentTypeDescription
<ID>stringMachine ID

Exit codes:

  • 0 - desired generation cleared (CP returned 204)
  • 1 - machine not found (CP returned 404), or another non-2xx status

machines notify-deploy

Notify the control plane of an out-of-band deploy (e.g. SSH). Sets the machine’s desired generation to the deployed store path so nixfleet status shows the machine in sync once the agent confirms.

Called automatically by deploy --ssh after a successful switch. Also available manually for other out-of-band deploy workflows.

nixfleet machines notify-deploy <ID> <STORE_PATH>
ArgumentTypeDescription
<ID>stringMachine ID
<STORE_PATH>stringStore path that was deployed

Requires deploy or admin role.


rollout delete

Delete a terminal rollout (completed, cancelled, or failed). The control plane rejects deletion of active rollouts with 409.

nixfleet rollout delete <ID>
ArgumentTypeDescription
<ID>stringRollout ID

Exit codes:

  • 0 - rollout deleted (CP returned 204)
  • 1 - rollout is still active (CP returned 409), rollout not found (CP returned 404), or another non-2xx status

Operation logs

All CLI operations (deploy, release create, rollout commands) write persistent logs to:

~/.local/state/nixfleet/logs/

Each operation creates a JSONL file with timestamped entries covering subprocess invocations (command, stdout, stderr, exit code), tracing events, and host context. Logs are written regardless of verbosity level.


.nixfleet.toml format

Committed to the fleet repo root. Discovered by walking up from the CLI’s current working directory. All fields optional - CLI flags and environment variables always override.

[control-plane]
url = "https://cp.example.com:8080"
ca-cert = "modules/_config/fleet-ca.pem"    # relative to config file location

[tls]
client-cert = "/run/agenix/agent-${HOSTNAME}-cert"
client-key = "/run/agenix/agent-${HOSTNAME}-key"

[cache]
url = "http://cache.example.com:5000"          # default --cache-url for rollouts
push-to = "ssh://root@cache.example.com"       # default --push-to for release create

[cache.hook]                                    # used when --hook is passed
url = "http://cache.example.com:8081/mycache"   # overrides cache.url for the release
push-cmd = "attic push mycache {}"              # {} is replaced with the store path

[deploy]
strategy = "staged"             # default rollout strategy
health-timeout = 300            # default health timeout in seconds
failure-threshold = "0"
on-failure = "pause"

Environment variable expansion: values support ${VAR} expansion. ${HOSTNAME} and ${HOST} fall back to the gethostname() syscall if not set in the environment (so they work from zsh where $HOST is a shell builtin, not exported). This lets the same .nixfleet.toml work across every fleet host when agent cert paths follow a per-hostname convention.

Relative paths (like ca-cert = "modules/_config/fleet-ca.pem") are resolved relative to the .nixfleet.toml location, not the CLI’s working directory.

~/.config/nixfleet/credentials.toml format

User-level, mode 600, not checked into any repo. Written automatically by nixfleet bootstrap and keyed by CP URL to support multiple clusters.

["https://cp.example.com:8080"]
api-key = "nfk-73c713cc..."

["https://cp-staging.example.com:8080"]
api-key = "nfk-abc..."

On impermanent NixOS hosts, add .config/nixfleet to home-manager persistence so the credentials file survives reboots.

mkHost API

Parameters

nixfleet.lib.mkHost {
  hostName    = "myhost";
  platform    = "x86_64-linux";
  stateVersion = "24.11";       # optional
  hostSpec    = { ... };         # optional
  modules     = [ ... ];         # optional
  isVm        = false;           # optional
}
ParameterTypeRequiredDefaultDescription
hostNamestringyesMachine hostname. Forced into hostSpec.hostName (not overridable).
platformstringyesTarget platform: x86_64-linux, aarch64-linux, aarch64-darwin, x86_64-darwin.
stateVersionstringno"24.11"NixOS state version (set with lib.mkDefault). Not used for Darwin - consumers set it in their host modules.
hostSpecattrsetno{}Host configuration flags. Values are set with lib.mkDefault (overridable by modules). hostName is always forced to match the parameter.
moduleslistno[]Additional NixOS or Darwin modules appended after framework modules.
isVmboolnofalseInject QEMU VM hardware config (virtio disk, SPICE, DHCP, software GL). NixOS only.

Return type

  • Linux platforms (x86_64-linux, aarch64-linux): Returns the result of nixpkgs.lib.nixosSystem.
  • Darwin platforms (aarch64-darwin, x86_64-darwin): Returns the result of darwin.lib.darwinSystem.

Platform detection is automatic based on platform.

Injected modules

mkHost injects framework modules before user-provided modules. These are mechanism-only - no opinions about packages, services, or user environment.

NixOS (Linux)

  1. system.stateVersion (mkDefault)
  2. nixpkgs.hostPlatform set to platform
  3. hostSpec module (option declarations)
  4. hostSpec values set with lib.mkDefault (overridable by consumer modules)
  5. hostSpec.hostName forced to match the hostName parameter
  6. Impermanence scope from nixfleet-scopes (declares options only - inert unless nixfleet.impermanence.enable = true)
  7. Core NixOS module (_nixos.nix)
  8. Agent service module (disabled by default)
  9. Control plane service module (disabled by default)
  10. Cache server service module (disabled by default)
  11. Cache client module (disabled by default)
  12. MicroVM host module (disabled by default)
  13. User-provided modules

When isVm = true, additionally injects:

  • QEMU disk config and hardware configuration
  • SPICE agent (services.spice-vdagentd.enable)
  • Forced DHCP (networking.useDHCP = lib.mkForce true)
  • Software GL (LIBGL_ALWAYS_SOFTWARE, mesa)

Why impermanence is auto-imported: NixFleet’s internal service modules (agent, control-plane, microvm-host) conditionally contribute to environment.persistence. The NixOS module system validates option paths even inside lib.mkIf false, so the impermanence scope must be present to declare those options. The scope is inert (zero cost) until explicitly enabled.

Darwin (macOS)

  1. nixpkgs.hostPlatform set to platform
  2. hostSpec module (option declarations)
  3. hostSpec values set with lib.mkDefault (overridable by consumer modules)
  4. hostSpec.hostName forced to match the hostName parameter
  5. hostSpec.isDarwin = true
  6. Core Darwin module (_darwin.nix)
  7. Agent Darwin module (disabled by default)
  8. User-provided modules

NOT auto-included

These are consumer responsibilities - import them via roles or explicitly in modules:

  • disko - disk partitioning (import from nixfleet-scopes or use diskoTemplates)
  • base scope - opinionated system defaults (import from nixfleet-scopes)
  • home-manager - user environment management (import from nixfleet-scopes)
  • operators scope - multi-user inventory (import from nixfleet-scopes)
  • All other infrastructure scopes - firewall, secrets, backup, monitoring, etc.

The typical pattern is to import a role, which bundles the relevant scopes:

modules = [
  inputs.nixfleet.scopes.roles.workstation  # includes base, HM, operators, etc.
  ./hardware-configuration.nix
];

Framework inputs

Framework inputs are passed via specialArgs = {inherit inputs;}. Modules can access them as the inputs argument. These are NixFleet’s own inputs (nixpkgs, home-manager, disko, impermanence, etc.), not fleet-level inputs.

Home Manager

Home Manager is a scope from nixfleet-scopes. It is not auto-injected by mkHost.

Import it via a role (workstation and endpoint roles include it) or manually:

modules = [
  nixfleet.scopes.home-manager
  { nixfleet.home-manager.enable = true; }
];

The scope fans out profileImports to all operators with homeManager.enable = true.

Scope re-exports

NixFleet re-exports nixfleet-scopes so consumers do not need a separate flake input:

# These are equivalent:
inputs.nixfleet-scopes.scopes.roles.workstation
inputs.nixfleet.scopes.roles.workstation

Available under inputs.nixfleet.scopes:

  • scopes.roles.* - workstation, server, endpoint, microvm-guest
  • scopes.base - opinionated system defaults
  • scopes.home-manager - HM integration
  • scopes.impermanence - impermanence support
  • scopes.disk-templates.* - disko disk layouts
  • All other nixfleet-scopes exports

Exports

All exports from the NixFleet flake:

ExportAccess pathDescription
lib.mkHostinputs.nixfleet.lib.mkHostHost definition function
lib.mkVmAppsinputs.nixfleet.lib.mkVmAppsVM helper apps generator
nixosModules.nixfleet-coreinputs.nixfleet.nixosModules.nixfleet-coreRaw core NixOS module (without mkHost)
scopesinputs.nixfleet.scopesRe-export of nixfleet-scopes (no separate input needed)
diskoTemplatesinputs.nixfleet.diskoTemplatesAlias for scopes.disk-templates
flakeModules.appsinputs.nixfleet.flakeModules.appsVM lifecycle apps (for fleet repos)
flakeModules.testsinputs.nixfleet.flakeModules.testsEval and VM test infrastructure (for fleet repos)
flakeModules.isoinputs.nixfleet.flakeModules.isoISO builder (for fleet repos)
flakeModules.formatterinputs.nixfleet.flakeModules.formatterTreefmt config - alejandra + shfmt (for fleet repos)
templates.defaultnix flake init -t nixfleetSingle-host template (same as standalone)
templates.standalonenix flake init -t nixfleet#standaloneSingle NixOS machine
templates.batchnix flake init -t nixfleet#batchBatch of identical hosts from a template
templates.fleetnix flake init -t nixfleet#fleetMulti-host fleet with flake-parts

hostSpec Options

All options declared in the framework’s hostSpec module. Fleet repos can extend hostSpec with additional options via plain NixOS modules.

Data fields

OptionTypeDefaultDescription
hostNamestr– (required)The hostname of the host. Set automatically by mkHost.
userNamestr– (required)The username of the primary user.
homestr/home/<userName> (Linux) or /Users/<userName> (Darwin)Home directory path. Computed from userName and isDarwin.
timeZonestr"UTC"IANA timezone (e.g., Europe/Paris).
localestr"en_US.UTF-8"System locale.
keyboardLayoutstr"us"XKB keyboard layout.
networkingattrsOf anything{}Attribute set of networking information (e.g., { interface = "enp3s0"; }).
sshAuthorizedKeyslistOf str[]SSH public keys added to authorized_keys for both the primary user and root.
secretsPathnullOr strnullHint for secrets repo path. Framework-agnostic - no tool coupling.
hashedPasswordFilenullOr strnullPath to hashed password file for the primary user. When non-null, sets users.users.<userName>.hashedPasswordFile.
rootHashedPasswordFilenullOr strnullPath to hashed password file for root. When non-null, sets users.users.root.hashedPasswordFile.

Platform flag

OptionTypeDefaultDescription
isDarwinboolfalseDarwin (macOS) host. Set automatically by mkHost for Darwin platforms.

Note: Earlier revisions of NixFleet had isMinimal, isImpermanent, and isServer flags here. These have been removed. Their roles are now played by scope enable options (nixfleet.impermanence.enable, nixfleet.firewall.enable, etc.) set by roles in nixfleet-scopes.

Extending hostSpec

Fleet repos add custom flags via plain NixOS modules:

{lib, ...}: {
  options.hostSpec = {
    isDev = lib.mkOption {
      type = lib.types.bool;
      default = false;
      description = "Enable development tools.";
    };
    isGraphical = lib.mkOption {
      type = lib.types.bool;
      default = false;
      description = "Enable graphical environment.";
    };
  };
}

Include the extension module in your mkHost modules list. Framework-level hostSpec options and fleet-level extensions merge naturally through the NixOS module system.

Agent Options

All options under services.nixfleet-agent. The module is auto-included by mkHost and disabled by default.

Top-level options

OptionTypeDefaultDescription
enableboolfalseEnable the NixFleet fleet management agent.
controlPlaneUrlstr– (required when enabled)URL of the NixFleet control plane. Example: "https://fleet.example.com".
machineIdstrconfig.networking.hostNameMachine identifier reported to the control plane.
pollIntervalint60Steady-state poll interval in seconds. The control plane may override this for individual cycles via a poll_hint field in the desired-generation response (set to 5 during active rollouts), letting the agent react to new deploys within seconds without reducing the steady-state polling rate.
retryIntervalint30Retry interval in seconds after a failed poll (network error, CP not ready, fetch failure, bootstrap race). Shorter than pollInterval so the agent recovers quickly from transient failures without flooding the CP.
cacheUrlnullOr strnullGlobal binary cache URL for fetching closures. Resolution order: (1) per-generation cache_url from the release entry; (2) this option if set; (3) if neither is set, the agent verifies the store path exists locally via nix path-info - the path must be pre-pushed out-of-band. Example: "http://cache:5000".
dbPathstr"/var/lib/nixfleet/state.db"Path to the SQLite state database.
dryRunboolfalseWhen true, check and fetch but do not apply generations.
tagslistOf str[]Tags for grouping this machine in fleet operations. Passed via NIXFLEET_TAGS environment variable.
healthIntervalint60Seconds between continuous health reports to the control plane.
allowInsecureboolfalseAllow insecure HTTP connections to the control plane. Development only.
tls.clientCertnullOr strnullPath to client certificate PEM file for mTLS authentication. Example: "/run/secrets/agent-cert.pem".
tls.clientKeynullOr strnullPath to client private key PEM file for mTLS authentication. Example: "/run/secrets/agent-key.pem".
metricsPortnullOr portnullPort for agent Prometheus metrics HTTP listener. Null disables metrics.
metricsOpenFirewallboolfalseOpen the metrics port in the firewall. Only effective when metricsPort is set.

healthChecks.systemd

List of systemd unit health checks.

Sub-optionTypeDefaultDescription
unitslistOf strSystemd units that must be active.

Example:

services.nixfleet-agent.healthChecks.systemd = [
  { units = ["nginx.service" "postgresql.service"]; }
];

healthChecks.http

List of HTTP endpoint health checks.

Sub-optionTypeDefaultDescription
urlstrURL to GET.
intervalint5Check interval in seconds.
timeoutint3Timeout in seconds.
expectedStatusint200Expected HTTP status code.

Example:

services.nixfleet-agent.healthChecks.http = [
  { url = "http://localhost:8080/health"; }
  { url = "https://localhost:443"; expectedStatus = 200; timeout = 5; }
];

healthChecks.command

List of custom command health checks.

Sub-optionTypeDefaultDescription
namestrCheck name.
commandstrShell command (exit 0 = healthy).
intervalint10Check interval in seconds.
timeoutint5Timeout in seconds.

Example:

services.nixfleet-agent.healthChecks.command = [
  {
    name = "disk-space";
    command = "test $(df --output=pcent / | tail -1 | tr -d ' %') -lt 90";
    interval = 30;
    timeout = 5;
  }
];

Prometheus Metrics

When metricsPort is set, the agent starts a Prometheus HTTP listener on that port. Null (the default) disables the listener.

Metrics exposed:

MetricDescription
nixfleet_agent_stateCurrent phase of the deploy cycle (idle, checking, fetching, applying, verifying, reporting, rolling_back) encoded as a label
nixfleet_agent_poll_duration_secondsDuration of the last poll cycle
nixfleet_agent_last_poll_timestamp_secondsUnix timestamp of the last completed poll
nixfleet_agent_health_check_duration_secondsDuration of the last health check run
nixfleet_agent_health_check_statusResult of the last health check (1 = healthy, 0 = unhealthy)
nixfleet_agent_generation_infoNix store path of the current active generation (as a label)

Metrics are served in the standard Prometheus text format at GET /metrics.

Example configuration:

services.nixfleet-agent = {
  enable = true;
  controlPlaneUrl = "https://fleet.example.com";
  metricsPort = 9101;
  metricsOpenFirewall = true;
};

Systemd service

The agent runs as a privileged root systemd service:

SettingValue
Targetmulti-user.target
Afternetwork-online.target, nix-daemon.service
Restartalways (30s delay)
StateDirectorynixfleet
NoNewPrivilegestrue
PATH${config.nix.package}/bin:${pkgs.systemd}/bin
EnvironmentXDG_CACHE_HOME=/var/lib/nixfleet/.cache

Hardening rationale. The agent runs switch-to-configuration as a subprocess, which needs full system access (/dev, /home, cgroups, kernel modules). Sandboxing (e.g. PrivateDevices, ProtectHome) would break these operations. The threat model is equivalent to sudo nixos-rebuild switch as a daemon. NoNewPrivileges = true is kept to prevent setuid escalation.

  • nix is in PATH for nix copy and nix path-info.
  • XDG_CACHE_HOME points into the state directory so nix metadata cache persists on impermanent hosts.

Health check configuration is written to /etc/nixfleet/health-checks.json and passed via --health-config.

On impermanent hosts, /var/lib/nixfleet is automatically persisted (including the XDG cache subdirectory).

Control Plane Options

All options under services.nixfleet-control-plane. The module is auto-included by mkHost and disabled by default.

Options

OptionTypeDefaultDescription
enableboolfalseEnable the NixFleet control plane server.
listenstr"0.0.0.0:8080"Address and port to listen on.
dbPathstr"/var/lib/nixfleet-cp/state.db"Path to the SQLite state database.
openFirewallboolfalseOpen the control plane port in the firewall. The port is parsed from the listen value.
tls.certnullOr strnullPath to TLS certificate PEM file. Enables HTTPS when set (requires tls.key). Example: "/run/secrets/cp-cert.pem".
tls.keynullOr strnullPath to TLS private key PEM file. Example: "/run/secrets/cp-key.pem".
tls.clientCanullOr strnullPath to client CA PEM file. When set, all TLS connections must present a valid client certificate signed by this CA (required mTLS). Admin clients must present both a client cert and an API key. Example: "/run/secrets/fleet-ca.pem".

Prometheus Metrics

The control plane exposes a GET /metrics endpoint on its listen address. No separate port or additional configuration is required - the endpoint is always available when the service is running.

No authentication is required for /metrics (same as /health). Restrict access at the network level if needed.

Metrics exposed:

MetricDescription
nixfleet_fleet_sizeTotal number of registered machines
nixfleet_machines_by_lifecycleMachine count grouped by lifecycle state (label: lifecycle)
nixfleet_machine_last_seen_timestamp_secondsUnix timestamp of each machine’s last report (label: machine_id)
nixfleet_http_requests_totalHTTP request count by method, path, and status code
nixfleet_http_request_duration_secondsHTTP request latency histogram
nixfleet_rollouts_totalRollout count by status (label: status)
nixfleet_rollouts_activeNumber of currently active rollouts (created, running, or paused)

Example:

curl http://localhost:8080/metrics

Systemd service

SettingValue
Targetmulti-user.target
Afternetwork-online.target
Restartalways (10s delay)
StateDirectorynixfleet-cp
NoNewPrivilegestrue
ProtectHometrue
PrivateTmptrue
PrivateDevicestrue
ProtectKernelTunablestrue
ProtectKernelModulestrue
ProtectControlGroupstrue
ReadWritePaths/var/lib/nixfleet-cp

Example

services.nixfleet-control-plane = {
  enable = true;
  listen = "0.0.0.0:8080";
  openFirewall = true;
};

On impermanent hosts, /var/lib/nixfleet-cp is automatically persisted.

Secrets Options

This module is provided by nixfleet-scopes. It is documented here as part of the NixFleet ecosystem reference.

All options under nixfleet.secrets. The module is auto-included by mkHost and disabled by default. Enable with nixfleet.secrets.enable = true.

Options

OptionTypeDefaultDescription
enableboolfalseEnable NixFleet secrets wiring (identity paths, persist, boot ordering).
identityPaths.hostKeynullOr str"/etc/ssh/ssh_host_ed25519_key"Primary decryption identity (host SSH key). Used on all hosts.
identityPaths.userKeynullOr str"<home>/.keys/id_ed25519"Fallback decryption identity (user key). Used on workstations only. Computed from hostSpec.home.
identityPaths.enableUserKeybooltrueInclude user key in resolved paths. The server role overrides this to false.
identityPaths.extralistOf str[]Additional identity paths appended to the resolved list.
resolvedIdentityPathslistOf str(computed)Read-only. Computed identity paths. Fleet modules pass this to agenix/sops.

resolvedIdentityPaths computation

The computed list is:

  1. hostKey (if non-null)
  2. userKey (if enableUserKey is true and userKey is non-null)
  3. Each entry in extra

resolvedIdentityPaths is always computed, even when the scope is disabled, so fleet modules can read it without requiring nixfleet.secrets.enable.

Systemd service

When enable = true and identityPaths.hostKey is non-null:

SettingValue
Unitnixfleet-host-key-check.service
Typeoneshot
WantedBymulti-user.target
Beforesshd.service
ConditionConditionPathExists = !<hostKey> (runs only if key is missing)
ActionGenerates ed25519 SSH key at identityPaths.hostKey

A non-fatal activation script (nixfleet-secrets-check) warns at activation if any identity path is missing.

Impermanence

On impermanent hosts (nixfleet.impermanence.enable = true), the scope automatically adds to environment.persistence."/persist":

  • files: hostKey and hostKey.pub
  • directories: parent directory of userKey (when enableUserKey is true)

Example

{config, ...}: {
  nixfleet.secrets = {
    enable = true;
    # Defaults are sufficient for most hosts.
    # Servers: resolvedIdentityPaths = ["/etc/ssh/ssh_host_ed25519_key"]
    # Workstations: resolvedIdentityPaths = ["/etc/ssh/ssh_host_ed25519_key" "~/.keys/id_ed25519"]
  };

  # Wire to agenix
  age.identityPaths = config.nixfleet.secrets.resolvedIdentityPaths;
}

To add a hardware security key as an extra identity:

nixfleet.secrets.identityPaths.extra = ["/run/user/1000/gnupg/S.gpg-agent.ssh"];

Backup Options

This module is provided by nixfleet-scopes. It is documented here as part of the NixFleet ecosystem reference.

All options under nixfleet.backup. The module is auto-included by mkHost and disabled by default. Enable with nixfleet.backup.enable = true.

The backup scope is backend-agnostic. It creates the systemd timer and service skeleton. Set backend to "restic" or "borgbackup" to use a built-in backend, or set systemd.services.nixfleet-backup.serviceConfig.ExecStart directly to use any other tool.

Options

OptionTypeDefaultDescription
enableboolfalseEnable NixFleet backup scaffolding (timer, health, persistence).
backendnullOr enum ["restic" "borgbackup"]nullBackup backend. Null = fleet sets ExecStart manually.
pathslistOf str["/persist"]Directories to back up.
excludelistOf str["/persist/nix" "*.cache"]Patterns to exclude from backup.
schedulestr"daily"Systemd calendar expression (e.g., daily, weekly, *-*-* 02:00:00).
retention.dailyint7Number of daily snapshots to keep.
retention.weeklyint4Number of weekly snapshots to keep.
retention.monthlyint6Number of monthly snapshots to keep.
healthCheck.onSuccessnullOr strnullURL to GET on successful backup (e.g., Healthchecks.io ping URL).
healthCheck.onFailurenullOr strnullURL to GET on backup failure.
preHooklines""Shell commands to run before backup.
postHooklines""Shell commands to run after successful backup.
stateDirectorystr"/var/lib/nixfleet-backup"Directory for backup state/cache. Persisted on impermanent hosts.

restic backend options

Active when backend = "restic". The restic package is added to environment.systemPackages automatically.

OptionTypeDefaultDescription
restic.repositorystr""Restic repository URL or path. Example: "/mnt/backup/restic".
restic.passwordFilestr""Path to file containing the repository password. Example: "/run/secrets/restic-password".

borgbackup backend options

Active when backend = "borgbackup". The borgbackup package is added to environment.systemPackages automatically.

OptionTypeDefaultDescription
borgbackup.repositorystr""Borg repository path or ssh://user@host/path.
borgbackup.passphraseFilenullOr strnullPath to file containing the repository passphrase. Null = repokey without passphrase.
borgbackup.encryptionstr"repokey"Borg encryption mode (repokey, repokey-blake2, none, etc.).

Systemd timer

SettingValue
Unitnixfleet-backup.timer
WantedBytimers.target
OnCalendarvalue of schedule
Persistenttrue (catch up on missed runs)
RandomizedDelaySec1h (stagger across fleet)

Systemd service

SettingValue
Unitnixfleet-backup.service
Typeoneshot
Afternetwork-online.target
Wantsnetwork-online.target
StateDirectorynixfleet-backup
ExecStart(set by fleet module)

After a successful backup run, the service writes status.json to stateDirectory:

{"lastRun": "2025-01-15T02:00:00+00:00", "status": "success", "hostname": "web-01"}

When healthCheck.onFailure is set, a companion nixfleet-backup-failure.service is registered as the OnFailure handler.

Impermanence

On impermanent hosts (nixfleet.impermanence.enable = true), the scope automatically persists stateDirectory.

Example - restic (built-in backend)

nixfleet.backup = {
  enable = true;
  backend = "restic";
  paths = ["/persist/home" "/persist/var/lib"];
  schedule = "*-*-* 03:00:00";
  retention = { daily = 7; weekly = 4; monthly = 3; };
  healthCheck.onSuccess = "https://hc-ping.com/your-uuid-here";
  restic = {
    repository = "s3:s3.amazonaws.com/my-bucket/backups";
    passwordFile = "/run/secrets/restic-password";
  };
};

Example - borgbackup (built-in backend)

nixfleet.backup = {
  enable = true;
  backend = "borgbackup";
  paths = ["/persist/home" "/persist/var/lib"];
  schedule = "weekly";
  retention = { daily = 7; weekly = 4; monthly = 6; };
  borgbackup = {
    repository = "ssh://backup-user@backup-host/var/backups/myhost";
    passphraseFile = "/run/secrets/borg-passphrase";
    encryption = "repokey-blake2";
  };
};

Example - custom backend (manual ExecStart)

{config, pkgs, ...}: {
  nixfleet.backup = {
    enable = true;
    paths = ["/persist/home" "/persist/var/lib"];
    schedule = "*-*-* 03:00:00";
    retention = { daily = 7; weekly = 4; monthly = 3; };
    healthCheck.onSuccess = "https://hc-ping.com/your-uuid-here";
  };

  systemd.services.nixfleet-backup.serviceConfig.ExecStart = let
    resticCmd = "${pkgs.restic}/bin/restic";
    repo = "s3:s3.amazonaws.com/my-bucket/backups";
  in ''
    ${resticCmd} -r ${repo} backup \
      ${builtins.concatStringsSep " " config.nixfleet.backup.paths} \
      ${builtins.concatStringsSep " " (map (p: "--exclude=${p}") config.nixfleet.backup.exclude)} \
      --forget \
      --keep-daily ${toString config.nixfleet.backup.retention.daily} \
      --keep-weekly ${toString config.nixfleet.backup.retention.weekly} \
      --keep-monthly ${toString config.nixfleet.backup.retention.monthly}
  '';
}

Cache Options

Options for services.nixfleet-cache-server and services.nixfleet-cache. Both modules are auto-included by mkHost and disabled by default.

The cache server uses harmonia, which serves paths directly from the local Nix store over HTTP. No separate storage backend, database, or push protocol is needed.

services.nixfleet-cache-server

OptionTypeDefaultDescription
enableboolfalseEnable the NixFleet binary cache server (harmonia).
portport5000Port to listen on.
openFirewallboolfalseOpen the cache server port in the firewall.
signingKeyFilestr- (required)Path to the Nix signing key file for on-the-fly signing. Must be readable by the harmonia user (set age.secrets.<name>.owner = "harmonia" with agenix, or sops.secrets.<name>.owner = "harmonia" with sops-nix). Example: "/run/secrets/cache-signing-key".

services.nixfleet-cache

OptionTypeDefaultDescription
enableboolfalseEnable the NixFleet binary cache client.
cacheUrlstr- (required)URL of the binary cache server. Example: "https://cache.fleet.example.com".
publicKeystr- (required)Cache signing public key in name:base64 format. Example: "cache.fleet.example.com:AAAA...=".

Systemd service (server)

SettingValue
Unitnixfleet-cache-server.service
WantedBymulti-user.target
Afternetwork-online.target, nix-daemon.service
Restartalways (10s delay)
NoNewPrivilegestrue
ProtectHometrue
PrivateTmptrue
PrivateDevicestrue
ProtectKernelTunablestrue
ProtectKernelModulestrue
ProtectControlGroupstrue

Harmonia is stateless - it serves directly from the local Nix store. No state directory or persistence configuration is needed.

Using a different cache backend

Fleet repos that need Attic, Cachix, or another cache backend can add them as their own flake input and configure them via plain NixOS modules. The --push-hook CLI flag supports custom push commands for any backend.

MicroVM Host Options

All options under services.nixfleet-microvm-host. The module is auto-included by mkHost and disabled by default. Enable with services.nixfleet-microvm-host.enable = true.

The module imports the upstream microvm.nixosModules.host module. MicroVMs themselves are defined via the standard microvm.vms option from the microvm.nix framework; this module only provides the bridge networking, DHCP, and NAT infrastructure for the host.

Options

OptionTypeDefaultDescription
enableboolfalseEnable the NixFleet MicroVM host.
bridge.namestr"nixfleet-br0"Bridge interface name for microVM networking.
bridge.addressstr"10.42.0.1/24"Bridge IP address with CIDR prefix.
dhcp.enablebooltrueRun a dnsmasq DHCP server on the bridge.
dhcp.rangestr"10.42.0.10,10.42.0.254,1h"DHCP range in dnsmasq format (start,end,lease-time).

What the module configures

When enabled, the module:

  • Creates a systemd-networkd bridge interface (bridge.name) with the given IP address.
  • Enables net.ipv4.ip_forward for NAT.
  • Configures networking.nat with the bridge as an internal interface so microVMs can reach the outside.
  • Optionally starts dnsmasq on the bridge with the configured DHCP range and the bridge IP as the default router.

Impermanence

On impermanent hosts (nixfleet.impermanence.enable = true), the module automatically persists /var/lib/microvms across reboots.

Example

services.nixfleet-microvm-host = {
  enable = true;
  bridge.address = "10.42.0.1/24";
  dhcp.range = "10.42.0.10,10.42.0.100,12h";
};

# Define a microVM using the upstream microvm.nix API
microvm.vms.my-vm = {
  config = { ... };
};

Monitoring Options

This module is provided by nixfleet-scopes. It is documented here as part of the NixFleet ecosystem reference.

All options under nixfleet.monitoring.nodeExporter. The module is auto-included by mkHost and disabled by default. Enable with nixfleet.monitoring.nodeExporter.enable = true.

Options

OptionTypeDefaultDescription
nodeExporter.enableboolfalseEnable Prometheus node exporter with fleet-tuned defaults.
nodeExporter.portport9100Port for node exporter metrics endpoint.
nodeExporter.openFirewallboolfalseOpen the node exporter port in the firewall.
nodeExporter.enabledCollectorslistOf str(see below)Collectors to enable. Fleet repos can override.
nodeExporter.disabledCollectorslistOf str(see below)Collectors to disable.

Default enabled collectors

CollectorMetrics
systemdSystemd unit state and timing
filesystemDisk usage per mountpoint
cpuCPU utilization
meminfoMemory usage
netdevNetwork interface statistics
diskstatsDisk I/O statistics
loadavgSystem load averages
pressureLinux PSI (pressure stall information)
timeSystem time and NTP sync status

Default disabled collectors

CollectorReason
textfileRequires external file management - opt-in per host
wifiIrrelevant on servers
infinibandNot used in typical fleets
nfsNot used in typical fleets
zfsFramework uses btrfs

Systemd service

The scope delegates to NixOS’s services.prometheus.exporters.node module. The resulting service is prometheus-node-exporter.service.

Example

nixfleet.monitoring.nodeExporter = {
  enable = true;
  port = 9100;
  openFirewall = true;  # allow Prometheus scrape from monitoring host
};

To add a collector not in the default set:

nixfleet.monitoring.nodeExporter.enabledCollectors =
  config.nixfleet.monitoring.nodeExporter.enabledCollectors ++ ["textfile"];

Fleet repos that use a Prometheus stack typically scrape all hosts on port 9100. Pair with a firewall rule on the monitoring host to restrict access to the scrape network.

Firewall Scope

This module is provided by nixfleet-scopes. It is documented here as part of the NixFleet ecosystem reference.

The firewall scope applies SSH rate limiting, connection drop logging, and the nftables backend to all non-minimal hosts. It has no user-configurable options.

Activation

The scope activates when nixfleet.firewall.enable = true. Roles like server and workstation set this automatically. Minimal roles (endpoint, microvm-guest) leave it disabled by default.

What it provides

nftables backend

Sets networking.nftables.enable = true. This is the forward-compatible choice: Linux 6.17+ drops the ip_tables kernel module. Fleet repos using networking.firewall.extraCommands (iptables syntax) will receive an assertion failure at evaluation time, forcing migration before the kernel forces it.

SSH rate limiting

Adds nftables input rules that accept at most 5 new SSH connections per minute per source IP and drop the rest:

tcp dport 22 ct state new limit rate 5/minute accept
tcp dport 22 ct state new drop

This limits brute-force attempts without blocking legitimate access.

Drop logging

Enables networking.firewall.logRefusedConnections and networking.firewall.logReversePathDrops. Dropped packets appear in the system journal under kernel, making it straightforward to diagnose connectivity issues and detect port scans.

No user-configurable options

The firewall scope is intentionally opinionated. These settings are appropriate for any production NixOS host and require no per-host tuning. Fleet repos needing custom firewall rules add them via standard NixOS options (networking.firewall.extraInputRules, networking.firewall.allowedTCPPorts, etc.) alongside the scope.

Core NixOS Module

Everything configured by _nixos.nix, imported automatically by mkHost for Linux platforms.

Nixpkgs

SettingValue
allowUnfreetrue
allowBrokenfalse
allowInsecurefalse
allowUnsupportedSystemtrue

Nix settings

SettingValue
nixPath[] (mkDefault)
allowed-users[<userName>]
trusted-users["@admin"] + <userName> (unless the server role is active)
substituters["https://nix-community.cachix.org" "https://cache.nixos.org"]
trusted-public-keysnix-community + cache.nixos.org keys
auto-optimise-storetrue
experimental-featuresnix-command flakes
gc.automatictrue
gc.datesweekly
gc.options--delete-older-than 7d

Boot

SettingValue
loader.systemd-boot.enabletrue
loader.systemd-boot.configurationLimit42
loader.efi.canTouchEfiVariablestrue
initrd.availableKernelModulesxhci_pci, ahci, nvme, usbhid, usb_storage, sd_mod
kernelPackageslinuxPackages_latest
kernelModules["uinput"]

Localization

SettingSource
time.timeZonehostSpec.timeZone
i18n.defaultLocalehostSpec.locale
console.keyMaphostSpec.keyboardLayout (mkDefault)

Networking

SettingValue
hostNamehostSpec.hostName
useDHCPfalse
networkmanager.enabletrue
firewall.enabletrue
Interface DHCPEnabled for hostSpec.networking.interface when set

Programs

ProgramSetting
gnupg.agentEnabled with SSH support
dconfEnabled
gitEnabled
zshEnabled, completion disabled (managed by HM)

Security

SettingValue
polkit.enabletrue
sudo.enabletrue
Sudo NOPASSWDreboot for wheel group

Users

Primary user (hostSpec.userName)

SettingValue
isNormalUsertrue
extraGroupswheel + audio, video, docker, git, networkmanager (if groups exist)
shellzsh
openssh.authorizedKeys.keyshostSpec.sshAuthorizedKeys
hashedPasswordFilehostSpec.hashedPasswordFile (when non-null)

Root

SettingValue
openssh.authorizedKeys.keyshostSpec.sshAuthorizedKeys
hashedPasswordFilehostSpec.rootHashedPasswordFile (when non-null)

SSH hardening

SettingValue
services.openssh.enabletrue
PermitRootLoginprohibit-password
PasswordAuthenticationfalse
KbdInteractiveAuthenticationfalse

Other services

SettingValue
services.printing.enablefalse
services.xserver.xkb.layouthostSpec.keyboardLayout (mkDefault)
hardware.ledger.enabletrue

System packages

  • git
  • inetutils

State version

system.stateVersion = "24.11" (mkDefault)

Core Darwin Module

Everything configured by _darwin.nix, imported automatically by mkHost for Darwin platforms.

Nixpkgs

SettingValue
allowUnfreetrue
allowBrokenfalse
allowInsecurefalse
allowUnsupportedSystemtrue

Nix settings

SettingValue
nix.enablefalse (Determinate installer compatible)
trusted-users["@admin" "<userName>"]
substituters["https://nix-community.cachix.org" "https://cache.nixos.org"]
trusted-public-keysnix-community + cache.nixos.org keys
auto-optimise-storetrue
experimental-featuresnix-command flakes

Programs

ProgramSetting
zshEnabled, completion disabled (managed by HM)

Users

SettingValue
users.users.<userName>.name<userName>
users.users.<userName>.homehostSpec.home
users.users.<userName>.isHiddenfalse
users.users.<userName>.shellzsh

TouchID sudo

SettingValue
security.pam.services.sudo_local.touchIdAuthtrue
PAM configpam_reattach.so (ignore_ssh) + pam_tid.so

TouchID works for sudo in terminal sessions, including through tmux via pam_reattach.

System defaults

NSGlobalDomain

KeyValue
AppleShowAllExtensionstrue
ApplePressAndHoldEnabledfalse
KeyRepeat2
InitialKeyRepeat15
com.apple.mouse.tapBehavior1
com.apple.sound.beep.feedback0

Dock

KeyValue
autohidetrue
show-recentsfalse
launchanimtrue
orientationbottom
tilesize48

Finder

KeyValue
AppleShowAllExtensionstrue
AppleShowAllFilestrue
ShowPathbartrue
_FXSortFoldersFirsttrue
_FXShowPosixPathInTitlefalse

Trackpad

KeyValue
Clickingtrue
TrackpadThreeFingerDragtrue

Dock management

The module includes a local.dock option for declarative Dock management using dockutil:

OptionTypeDefaultDescription
local.dock.enablebooltrueEnable dock management
local.dock.entrieslistOf submodule– (readOnly)Dock entries

Each entry has:

Sub-optionTypeDefault
pathstr
sectionstr"apps"
optionsstr""

The activation script diffs current Dock state against the declared entries and only resets when they differ.

Other

SettingValue
system.stateVersion4
system.checks.verifyNixPathfalse
system.primaryUser<userName>
hostSpec.isDarwintrue

Apps

Flake apps provided by NixFleet. Available via nix run .#<app>. VM lifecycle apps (build-vm, start-vm, stop-vm, clean-vm, test-vm) are exported via nixfleet.lib.mkVmApps for fleet repos.

validate

Runs the full validation suite: formatting, eval tests, host builds, and optionally VM tests.

nix run .#validate                 # format + flake check + eval + hosts (fast)
nix run .#validate -- --rust       # + cargo test + clippy + rust package builds
nix run .#validate -- --vm         # + every vm-* check (slow)
nix run .#validate -- --all        # everything
FlagWhat it adds to the base
(none)format + flake check + eval + hosts only
--rust+ cargo test + clippy + rust package builds
--vm+ every vm-* check (dynamically discovered)
--alleverything

See Testing Overview for the full check list, duration estimates, and how to drill into specific failures.


build-vm

Install a host into a persistent QEMU disk via nixos-anywhere. Linux and macOS.

nix run .#build-vm -- -h web-02
nix run .#build-vm -- -h web-02 --rebuild
nix run .#build-vm -- --all

Steps:

  1. Build custom ISO
  2. Create disk image at ~/.local/share/nixfleet/vms/<HOST>.qcow2
  3. Boot QEMU from ISO (headless, SSH forwarded)
  4. Install via nixos-anywhere
  5. Stop ISO VM

If a disk already exists, the install is skipped unless --rebuild is specified. If a key is found at ~/.keys/id_ed25519 or ~/.ssh/id_ed25519, it is provisioned into the VM for secrets decryption.

Flags

FlagTypeDefaultDescription
-h <HOST>stringHost config to install
--allboolInstall all hosts in nixosConfigurations
--rebuildboolWipe and reinstall existing disk
--identity-key <PATH>stringPath to identity key for secrets decryption
--ssh-port <N>stringautoOverride SSH port (default: auto-assigned by index)
--ram <MB>string4096RAM in MB
--cpus <N>string2CPU count
--disk-size <S>string20GDisk size

start-vm

Start an installed VM. Runs headless by default; use --display for graphical output. Linux and macOS.

nix run .#start-vm -- -h web-02
nix run .#start-vm -- -h web-02 --display gtk --ram 4096
nix run .#start-vm -- --all

Boots from the existing disk created by build-vm. SSH is forwarded to a per-host port (auto-assigned by sorted nixosConfigurations index, base 2201).

When --display is spice or gtk, the VM runs in the foreground (no daemonize). Closing the viewer window stops the VM. SPICE mode provides clipboard sharing via the SPICE agent.

Flags

FlagTypeDefaultDescription
-h <HOST>stringHost to start
--allboolStart all installed VMs (headless only)
--ssh-port <N>stringautoOverride SSH port
--ram <MB>string1024RAM in MB
--cpus <N>string2CPU count
--display <MODE>stringnoneDisplay: none (headless), spice (SPICE viewer), gtk (native window)

stop-vm

Stop a running VM daemon. Linux and macOS.

nix run .#stop-vm -- -h web-02
nix run .#stop-vm -- --all

Sends SIGTERM to the QEMU process and removes the pidfile.

Flags

FlagTypeDefaultDescription
-h <HOST>stringHost to stop
--allboolStop all running VMs

clean-vm

Delete VM disk, pidfile, and port file. Linux and macOS.

nix run .#clean-vm -- -h web-02
nix run .#clean-vm -- --all

Stops the VM if running, then removes <HOST>.qcow2, <HOST>.pid, and <HOST>.port from ~/.local/share/nixfleet/vms/.

Flags

FlagTypeDefaultDescription
-h <HOST>stringHost to clean
--allboolClean all VMs

test-vm

Automated VM test cycle: build ISO, boot, install, reboot, verify, cleanup. Linux and macOS.

nix run .#test-vm -- -h web-02
nix run .#test-vm -- -h edge-01 --keep

Steps

  1. Build custom ISO
  2. Create ephemeral disk (20G)
  3. Boot QEMU from ISO (headless, SSH on port 2299)
  4. Install via nixos-anywhere
  5. Reboot from disk
  6. Verify: hostname, multi-user.target, sshd

Cleans up temp directory and disk on exit unless --keep is specified.

Flags

FlagTypeDefaultDescription
-h <HOST>stringHost config to install
--keepboolfalseKeep temp dir and disk after test
--ssh-port <N>string2299Host port for SSH
--identity-key <PATH>stringPath to identity key for secrets decryption
--ram <MB>string4096RAM in MB
--cpus <N>string2CPU count

Note: Provisioning real hardware is done via standard NixOS tooling: nixos-anywhere --flake .#hostname root@ip. See Standard Tools.

Architecture

NixFleet is a fleet management framework providing a declarative API (mkHost), Rust service crates for orchestration, and NixOS modules for host configuration. Companion repos provide infrastructure scopes (nixfleet-scopes) and compliance controls (nixfleet-compliance).

System Overview

Fleet repo (flake.nix)
    |
    | calls mkHost { hostName, platform, hostSpec, modules }
    v
Framework (core + scopes + service modules)
    |
    | produces
    v
nixosSystem / darwinSystem
    |
    | deploy via
    v
nixos-rebuild / nixos-anywhere         (standard)
    or
Agent <-> Control Plane <-> CLI        (orchestrated)

mkHost is a closure over framework inputs (nixpkgs, home-manager, disko, impermanence, microvm). It returns a nixosSystem or darwinSystem based on the platform argument. For the full module injection order, see mkHost API reference.

Module graph

mkHost closure (binds framework inputs) ->
  - hostSpec module (identity-only options)
  - disko + impermanence NixOS modules
  - core/_nixos.nix or core/_darwin.nix
  - scopes/nixfleet/_agent.nix (+ _agent_darwin.nix on Darwin)
  - scopes/nixfleet/_control-plane.nix, _cache-server.nix, _cache.nix, _microvm-host.nix
  - user-provided modules (roles, fleet profiles, hardware)

Scope self-activation

Scopes are plain NixOS/HM modules. They are always imported but only activate when their corresponding enable option is set:

{ config, lib, ... }:
lib.mkIf config.nixfleet.impermanence.enable {
  # persistence paths, btrfs subvolume setup, etc.
}

Every host gets every scope module in its module tree, but inactive scopes produce zero config. Roles (from nixfleet-scopes) set the appropriate enable options. Fleet repos follow the same pattern for their own scopes.

Framework inputs via specialArgs

mkHost passes inputs (the framework flake’s inputs) through specialArgs, making them available to all modules as a function argument. Fleet repos that need their own inputs pass them through _module.args or additional specialArgs.

Framework Separation

RepoContents
nixfleetmkHost API, core modules (nix, SSH, identity), service modules (agent, control plane, cache, microvm), Rust crates, eval/VM tests
nixfleet-scopes17 infrastructure scopes, 4 roles (server, workstation, endpoint, microvm-guest), 6 disk templates
nixfleet-compliance16 compliance controls, 4 regulatory frameworks (NIS2, DORA, ISO 27001, ANSSI), evidence probes
Consumer fleet reposHost definitions via mkHost, opinionated scopes, hardware configs, secrets wiring

The framework is generic with no org-specific assumptions. Fleet repos provide opinions. Consumers import scopes via roles or individual scope modules from nixfleet-scopes.

Rust Workspace

Four crates in a Cargo workspace at the repo root:

CrateBinaryPurpose
agent/nixfleet-agentState machine daemon on each managed host: poll - fetch - apply - verify - report
control-plane/nixfleet-control-planeAxum HTTP server with mTLS. Machine registry, rollout orchestration, audit log
cli/nixfleetOperator CLI: deploy, status, rollback, release, rollout, machines, bootstrap, init
shared/(library)nixfleet-types - shared data types and API contracts

Agents poll the control plane for a desired generation, fetch closures, apply, run health checks, and report status. The CLI interacts with the control plane for machine registration, lifecycle management, releases, and rollouts.

Both the agent and control plane ship as NixOS service modules, auto-included by mkHost but disabled by default. Standard nixos-rebuild and nixos-anywhere work without them.

Flake Inputs

InputPurpose
nixpkgsPackage repository (nixos-unstable)
darwinnix-darwin macOS system config
home-managerUser environment management
flake-partsModule system for flake outputs
import-treeAuto-import directory tree as modules
diskoDeclarative disk partitioning
impermanenceEphemeral root filesystem
nixos-anywhereRemote NixOS installation via SSH
nixos-hardwareHardware-specific optimizations
lanzabooteSecure Boot
treefmt-nixMulti-language formatting
microvmMicroVM support (microvm.nix)
craneRust build system for Cargo workspace
nixfleet-scopesCompanion: infrastructure scopes, roles, disk templates

Fleet repos add their own inputs as needed (e.g. agenix or sops-nix for secrets).

Design Decisions

Key architectural decisions are documented in Architecture Decision Records.

Summary of foundational decisions:

  1. Dendritic import - every .nix under modules/ is auto-imported via import-tree. No import lists to maintain.
  2. Plain modules - scopes are plain NixOS/HM modules imported by mkHost. No deferred registration.
  3. Central fleet definition - all hosts in flake.nix, not scattered across directories.
  4. Single API - mkHost is the only public constructor. No mkFleet/mkOrg/mkRole layer.
  5. Scope-aware impermanence - persist paths live alongside their program definitions in scopes.
  6. Mechanism over policy - the framework provides mkHost; fleets provide opinions.

Testing Overview

nixfleet has four test tiers that together cover configuration, Rust code, Nix module wiring, and full multi-node runtime behaviour. There is exactly one command that runs everything:

nix run .#validate -- --all

That’s it. Use this for CI, for pre-merge, for pre-release, and for “did my change break something far from where I was editing”. When you need a smaller slice for an inner-loop iteration, the flag variants below trade coverage for speed:

CommandRunsTypical duration
nix run .#validateformat + nix flake check + eval-* + host builds~1 min
nix run .#validate -- --vm^ + every vm-* check (dynamically discovered)~20–40 min
nix run .#validate -- --rust^ + cargo test --workspace + cargo clippy --workspace -- -D warnings + nix-build of every Rust package (sandboxed test run)~5–8 min
nix run .#validate -- --allEverything~25–45 min

What --all actually runs, in order:

  1. Formatting - nix fmt --fail-on-change
  2. Flake eval - nix flake check --no-build (every flake output type-checks)
  3. Eval tests - all eval-* derivations under .#checks
  4. Host builds - every nixosConfigurations.<host>.config.system.build.toplevel
  5. VM tests - every vm-* under .#checks, discovered dynamically
  6. Rust workspace tests - cargo test --workspace in the dev shell
  7. Rust lints - cargo clippy --workspace --all-targets -- -D warnings
  8. Rust package builds - nix build .#packages.<system>.{nixfleet-agent,nixfleet-control-plane,nixfleet-cli} (runs cargo test inside the nix sandbox - catches environment-dependent test failures that the dev-shell cargo test misses)

Inner-loop iteration (drilling down when something fails)

When --all surfaces a failure, you can reproduce the failing tier with a narrower command. Prefer these only after --all has already failed:

# Single VM scenario
nix build .#checks.x86_64-linux.vm-fleet-apply-failure --no-link

# Single Rust test binary
nix develop --command cargo test -p nixfleet-control-plane --test route_coverage

# Single test function
nix develop --command cargo test -p nixfleet-agent --test run_loop_scenarios \
    poll_hint_shortens_next_interval

Tier C - Eval tests (fast, ~seconds)

Pure Nix evaluations. No VMs, no Rust builds. Asserts structural properties of hostSpec, scope modules, and service wiring. See Eval Tests for the per-check list.

Tier B - Integration tests (medium)

CheckPurpose
integration-mock-clientSimulates a consumer flake importing nixfleet.lib.mkHost. Proves the public API is reachable, produces valid nixosConfigurations, and exposes core modules/scopes.

Tier A - VM tests (slow, minutes per test)

Real NixOS VMs booted under QEMU with Python test scripts driving assertions. See VM Tests for the full list and per-scenario semantics, including the fleet scenario subtests under _vm-fleet-scenarios/.

High-level categories:

  • Framework-level VMs (vm-core, vm-minimal, vm-firewall, vm-monitoring, vm-backup, vm-backup-restic, vm-secrets, vm-cache-server) - each one boots one or two nodes and exercises a single subsystem in isolation. These prove the framework produces bootable configs even when no fleet is enabled.
  • Fleet-level VMs (vm-fleet and the vm-fleet-* scenario subtests under _vm-fleet-scenarios/) - exercise multi-node topologies, mTLS, rollout strategies, failure paths, SSH-direct deploys, and the real fetch → apply → verify pipeline (vm-fleet-agent-rebuild).

Rust tests

Every Rust crate has unit tests in-file, plus integration scenarios in control-plane/tests/*_scenarios.rs and cli/tests/*_scenarios.rs. See Rust Tests for the full breakdown.

Finding the right test for a symptom

SymptomWhere to start
“Option X isn’t being set correctly”Eval test for that option
“My consumer flake doesn’t build with mkHost”integration-mock-client
“The agent service won’t start on a real VM”vm-core, vm-fleet-tag-sync
“A scope module (firewall, backup, monitoring) is broken”vm-firewall, vm-backup, vm-monitoring (per scope)
“The fetch→apply pipeline isn’t working”vm-fleet-agent-rebuild
“Rollout state machine is wrong”vm-fleet + Rust failure_scenarios.rs, deploy_scenarios.rs
“mTLS / auth / RBAC is wrong”vm-fleet-mtls-missing, vm-fleet-mtls-cn-mismatch, Rust auth_scenarios.rs
“Release CRUD or release push-hook is wrong”vm-fleet-release, Rust release_scenarios.rs
“Bootstrap / admin-key flow is wrong”vm-fleet-bootstrap
“SSH-direct deploy is broken”vm-fleet-deploy-ssh, vm-fleet-rollback-ssh
“Tag sync from agent config isn’t working”vm-fleet-tag-sync, Rust machine_scenarios.rs
“Health check type X fails”vm-fleet-apply-failure, agent health::* unit tests
“Rollout resume doesn’t resume”vm-fleet-apply-failure, Rust failure_scenarios.rs
“Metrics aren’t being emitted”Rust metrics_scenarios.rs
“Audit log is wrong / CSV injection”Rust audit_scenarios.rs

Known coverage gaps

  • Real switch-to-configuration: most VM tests run agents with dryRun = true so the actual apply path is not exercised. The exception is vm-fleet-agent-rebuild, which runs with dryRun = false and exercises the missing-path guard end-to-end. Production bootstraps cover the happy apply path.
  • Multi-CP topologies and agenix secret rotation have no tests.

Eval Tests

Eval tests (Tier C in the testing overview) assert configuration properties at Nix evaluation time. They run instantly and catch structural mistakes before anything is built.

For the full test tier map (eval / integration / VM / Rust) see the Testing Overview. This page documents only the eval checks.

How to run

nix flake check --no-build

The --no-build flag skips VM tests so only eval checks execute. Every check is a pkgs.runCommand that prints PASS: or FAIL: for each assertion and exits non-zero on the first failure.

Test fleet

Eval tests run against a minimal test fleet defined in modules/fleet.nix. These hosts exist solely to exercise framework config paths - they are not a real org.

The test fleet is defined in modules/fleet.nix. Key hosts used by eval checks:

HostKey configPurpose
web-01workstation role, impermanence enabledDefault web server, impermanent root
web-02workstation role, impermanence enabledSSH hardening tests
dev-01userName = "alice"Custom user override
edge-01endpoint roleMinimal edge device
srv-01server roleProduction server
agent-testagent enabled, tags, health checksAgent module options

Additional hosts (secrets-test, infra-test, cache-test, microvm-test, backup-restic-test) exercise other subsystems. All hosts share org-level defaults and use isVm = true.

Current checks

CheckHostWhat it asserts
eval-ssh-hardeningweb-02PermitRootLogin == "prohibit-password", PasswordAuthentication == false, firewall enabled
eval-hostspec-defaultsweb-01userName is non-empty, hostName matches "web-01"
eval-username-overrideweb-01, dev-01web-01 uses the shared default user; dev-01 overrides it to a different value
eval-locale-timezoneweb-01timeZone, defaultLocale, console.keyMap are all non-empty
eval-ssh-authorizedweb-01Primary user and root both have at least one SSH authorized key
eval-password-filesweb-01hostSpec exposes hashedPasswordFile and rootHashedPasswordFile options
eval-agent-tags-healthagent-testAgent systemd service has NIXFLEET_TAGS = "web,production", health-checks.json config file exists

Adding a new eval test

  1. Pick (or add) a test fleet host in modules/fleet.nix that exercises the config path you want to verify.

  2. Add a new check in modules/tests/eval.nix following this pattern:

eval-my-check = let
  cfg = nixosCfg "web-01";
in
  mkEvalCheck "my-check" [
    {
      check = cfg.some.option == expectedValue;
      msg = "web-01 some.option should be expectedValue";
    }
  ];
  1. Run nix flake check --no-build to verify the new assertion passes.

The mkEvalCheck helper (from modules/tests/_lib/helpers.nix) takes a check name and a list of { check : bool; msg : string; } assertions. It produces a runCommand derivation that prints each result and fails on the first false.

VM Tests

VM tests boot real NixOS virtual machines under QEMU and assert runtime state via Python test scripts run by the nixosTest driver. They verify services start, ports listen, multi-node interactions work end-to-end, and rollout state machines behave as documented.

How to run

The canonical entry point is nix run .#validate -- --all (see Testing Overview). For VM-only iteration:

nix run .#validate -- --vm

All vm-* checks under .#checks.<system> are discovered dynamically by the validate script, so new scenarios land in --vm / --all automatically without touching it.

When --vm surfaces a specific VM failure, drill in:

nix build .#checks.x86_64-linux.vm-fleet-apply-failure --no-link
nix log /nix/store/<hash>-vm-test-run-vm-fleet-apply-failure.drv

nix log retrieves the full driver output (systemctl status, journals, Python traceback) for a failed or past run.

Requirements

  • Platform: x86_64-linux only (nixosTest uses QEMU)
  • KVM: /dev/kvm for acceptable performance
  • Disk space: each VM test builds a NixOS closure; expect several GB per test
  • Time: minutes per test (closure build + parallel VM boots + assertions)

Test cycle

Each VM test goes through:

  1. Build - Nix evaluates the nodes’ config and builds each node’s system closure.
  2. Boot - QEMU launches one or more VMs in parallel; the shared host /nix/store is mounted read-only over 9p on every VM.
  3. Assert - a Python test script runs commands via the test driver API (machine.succeed(), machine.fail(), machine.wait_for_unit(), machine.wait_until_succeeds(cmd, timeout=N)).
  4. Cleanup - VMs shut down, driver reports pass/fail.

Framework-level VM tests

These test one subsystem in isolation. Most are defined in modules/tests/vm*.nix.

vm-core

Boots a standard framework node (defaultTestSpec, no special flags) and verifies:

  • multi-user.target reached
  • sshd and NetworkManager running
  • Firewall active (nftables input chain exists)
  • Test user exists in the wheel group
  • Core packages available to the user (zsh, git)

This is the “does everything still boot” smoke test.

vm-minimal

Boots a node with the endpoint role (minimal scope set) and verifies the minimal profile stays minimal:

  • multi-user.target reached
  • Core tools still present (zsh, git come from core/nixos.nix, not the base scope)
  • Graphical/dev tools absent (e.g., niri not installed, Docker not running)

vm-infra

One node, four scopes in one VM for speed:

  • Firewall - nftables active, SSH rate limiting rules present (limit rate 5/minute), drop logging enabled.
  • Monitoring - node exporter running, port 9100 responds with Prometheus text, node_systemd collector active.
  • Backup - systemd timer registered, manual trigger writes status.json with "status": "success".
  • Secrets - SSH host key generated at /etc/ssh/ssh_host_ed25519_key with mode 600.

vm-fleet - “Tier A headline test”

4-node fleet: cp + web-01 + web-02 + db-01, with full mTLS (build-time CA + CP server cert + per-agent client certs, no allowInsecure).

  1. CP bootstraps an admin API key.
  2. All 3 agents register with tags (web × 2, db × 1).
  3. Canary rollout on tag web (strategy staged, batch sizes ["1","100%"])
    • both agents healthy, rollout reaches completed.
  4. Health-gate failure rollout on tag db (strategy all_at_once) - db-01’s health check points at http://localhost:9999/health which nothing listens on; the rollout hits health_timeout and pauses.
  5. Resume the paused rollout and verify it transitions out of paused.
  6. Metrics - CP /metrics exposes nixfleet_fleet_size and nixfleet_rollouts_total; agent node exporter on web-01 exposes node_cpu.

Fleet scenario subtests

Every CLI path, failure mode, and rollout branch has its own independently buildable VM subtest under modules/tests/_vm-fleet-scenarios/*.nix. The aggregator modules/tests/vm-fleet-scenarios.nix exposes each one as .#checks.<system>.vm-fleet-<name>.

vm-fleet-agent-rebuild

The only VM test in the suite that runs with dryRun = false - it is the proof that the agent’s real fetch → apply → verify pipeline works end-to-end. CP tells the agent to deploy a fabricated store path that does NOT exist anywhere with no cache URL configured; the agent must log "not found locally and no cache URL configured" and leave /run/current-system untouched. Indirect fetch-path coverage still exists (vm-fleet-release for nix copy + harmonia, vm-fleet-bootstrap for the happy-path report cycle).

vm-fleet-tag-sync

Real agent with tags = ["web" "canary" "eu-west"] in NixOS config. Asserts tags appear in the CP machine_tags table after the first health report, that filtering by a declared tag returns the agent, and that undeclared tags do not leak into the table.

vm-fleet-bootstrap

End-to-end bootstrap flow:

  1. Start CP with an empty api_keys table.
  2. Operator runs nixfleet bootstrap --name test-admin - the CLI returns the first admin API key over mTLS.
  3. Use the returned key to list machines (empty), wait for two real agents (web-01, web-02) to register, list machines again (2 visible).
  4. Create a release via POST /api/v1/releases pointing at each agent’s real /run/current-system toplevel.
  5. POST a rollout targeting tag=web and wait for status=completed.
  6. Negative: a second nixfleet bootstrap call must fail (409 Conflict).

vm-fleet-release

Real nixfleet release create --push-to ssh://root@cache exercised against a harmonia binary cache server:

  • Uses the shared nix-shim (modules/tests/_lib/nix-shim.nix) to intercept nix eval and nix build on the builder node - returns a canned store path - while delegating nix copy to the real nix so the binary transfer actually happens.
  • Cache node runs services.nixfleet-cache-server (harmonia) with a build-time signing key baked as a /nix/store path (avoids the CREDENTIALS=243 race documented in TODO.md).
  • Post-push, assert via the VM-local Nix database (nix-store -q --references) that the path is registered on cache and NOT on cp.
  • Agent then fetches from http://cache:5000 and the DB check passes on the agent too.

vm-fleet-deploy-ssh

Real nixfleet deploy --hosts target --ssh --target root@target - no CP in the topology at all. The CLI calls nix eval (shim) → nix build (shim) → nix-copy-closure (real) → ssh target switch-to-configuration (real). A stub switch-to-configuration writes a marker file to /tmp that the test asserts. Proves --ssh mode truly bypasses the CP.

vm-fleet-apply-failure

Command health check with a sentinel file (/var/lib/fail-next-health) drives the failure path:

  1. Sentinel file created before the agent starts → first health report is unhealthy → rollout pauses (F1).
  2. Assert current_generation is still the agent’s original toplevel (RB1
    • the agent did not advance to the failing generation).
  3. Clear the sentinel, wait for health_reports.all_passed = 1, call POST /api/v1/rollouts/{id}/resume, assert the rollout reaches completed.

This test covers two subtle bugs in the resume path: the rollout executor must not re-mark a batch unhealthy from stale pre-resume reports, and the agent’s CommandChecker must use an absolute /bin/sh so it works under a systemd unit PATH. A regression in either would make the resume → completed transition hang.

vm-fleet-revert

2-agent staged rollout with on_failure = revert:

  • Both agents healthy → first batch succeeded.
  • Test then arms the sentinel on both agents so the next batch fails.
  • Rollout executor walks previous_generations on succeeded batches and restores the per-machine desired generation.
  • Indirectly covers C3 (HealthRunner::run_all actually runs post-deploy) - if the health runner were dead code, the failing report would never arrive and the revert path wouldn’t fire.

vm-fleet-timeout

The agent is configured but its unit’s wantedBy is forced to [] so the process never starts. CP records the machine in the release but sees zero reports from it. The batch sits in pending_count > 0 until health_timeout elapses, at which point evaluate_batch pushes pending_count into unhealthy_count and marks the batch failed.

Negative control: the reports table is empty for the machine - the pause reason really is “timeout”, not “agent reported a failure”.

vm-fleet-poll-retry

Agent starts before the CP. First poll hits a closed port (connection refused). The agent’s main loop schedules a retry at retryInterval = 5s. Then the CP starts, and the agent’s next retry succeeds. Asserts the agent journal contains the retry-scheduling log line, then waits for registration.

vm-fleet-mtls-missing

Pure transport-layer test. CP has tls.clientCa set. A client with the CA cert (can verify server) but no client key pair sends curl against /health and /api/v1/machines/{id}/report:

  • Without --cert → handshake failure at the TLS layer (asserted by grepping the curl verbose output for any of a set of TLS markers: alert, handshake, certificate required, SSL_ERROR, etc.).
  • Positive control with a valid client cert → HTTP response comes back (any status - what matters is the handshake completed).

vm-fleet-mtls-cn-mismatch

Application-layer test on top of mTLS. A client with a valid fleet-CA-signed cert (CN = wrong-agent) hits another agent’s endpoints (/api/v1/machines/web-01/...). The cn_matches_path_machine_id middleware rejects with 403 because the cert CN does not match the {id} path segment. Closes the impersonation gap: CA proves fleet membership, CN proves specific agent identity.

vm-fleet-rollback-ssh

Real nixfleet rollback --host target --ssh --generation <G1> end-to-end:

  1. Deploy stub G2 via nixfleet deploy --ssh → target writes active=g2 marker file.
  2. Pre-copy G1 to target via nix-copy-closure (rollback handler does NOT copy, it only SSHes and runs <gen>/bin/switch-to-configuration).
  3. Run nixfleet rollback --host target --ssh --generation <G1> → target writes active=g1 marker.
  4. Assert both G1 and G2 are still registered in target’s Nix DB (rollback did not delete the forward generation).

Shared VM test helpers

All scenario tests use helpers from modules/tests/_lib/helpers.nix (via modules/tests/vm-fleet-scenarios.nix which pre-binds them):

  • mkCpNode { testCerts, ... } - a CP node with standard mTLS wiring (CA + server cert, services.nixfleet-control-plane with clientCa), sqlite and python3 pre-installed.

  • mkAgentNode { testCerts, hostName, tags, healthChecks, ... } - an agent node with standard TLS, fleet CA trust, services.nixfleet-agent with pre-wired machineId/tags/dryRun. Escape hatch agentExtraConfig (merged via lib.recursiveUpdate into services.nixfleet-agent) handles per-scenario overrides like retryInterval or allowInsecure.

  • tlsCertsModule { testCerts, certPrefix } - a NixOS module fragment wiring the fleet CA plus a named client cert under /etc/nixfleet-tls/, for operator / builder / cache-style nodes that need TLS certs but aren’t a CP or an agent.

  • testPrelude { certPrefix ? "cp", api ? "https://localhost:8080" } - returns a Python prelude string with TEST_KEY, KEY_HASH, AUTH, CURL, API constants and a seed_admin_key(node) helper. Interpolate at the top of every testScript:

    testScript = ''
      ${testPrelude {}}
      cp.start()
      cp.wait_for_unit("nixfleet-control-plane.service")
      cp.wait_for_open_port(8080)
      seed_admin_key(cp)
      ...
    '';
    
  • mkTlsCerts { hostnames } (from _lib/helpers.nix) - builds the fleet CA + per-host cert pairs at Nix-eval time. Deterministic, no runtime setup.

  • nix-shim (from _lib/nix-shim.nix) - a writeShellApplication that intercepts nix eval / nix build with canned responses while delegating nix copy and other subcommands to the real nix at an immutable ${pkgs.nix}/bin/nix path. The absolute path is deliberate: installing the shim into systemPackages would collide with the real nix at /run/current-system/sw/bin/nix, and if the shim won the collision its fall-through branches would infinitely exec themselves. See the nixosTest gotchas section below.

nixosTest gotchas worth knowing

A few behaviours of the nixosTest framework itself that have bitten scenarios in this suite:

  • Shared /nix/store via 9p: every VM sees the host store read-only via 9p mount. Any store path referenced anywhere in the test evaluation is visible as a file on every node regardless of whether it was ever copied there. test -e <storepath> assertions are therefore invariant. The workaround is to check the VM-local Nix database (nix-store -q --references <path>) which is per-VM.
  • systemd PATH for services: services like nixfleet-agent do not get /run/current-system/sw/bin in their PATH by default, so Command::new("sh") (relative lookup) fails with ENOENT. Use absolute paths like /bin/sh.
  • nix shim collisions: adding a shim package named "nix" to environment.systemPackages causes a silent collision with the real nix in /run/current-system/sw/bin/nix. The workaround is to keep the shim only on sessionVariables.PATH (which still pulls it into the closure via string interpolation) and never in systemPackages.
  • wait_for_unit vs wait_until_succeeds("systemctl is-active"): a systemd unit stuck in the activating state forever (e.g., due to a LoadCredential= failure) blocks wait_for_unit with no useful error. wait_until_succeeds(..., timeout=120) wrapped in a try/except that dumps systemctl status + the unit journal gives you an informative failure instead of an opaque hang.

Adding a new VM test

  1. Create modules/tests/_vm-fleet-scenarios/<name>.nix following the vm-fleet-tag-sync.nix template.
  2. Accept mkCpNode, mkAgentNode, mkTlsCerts, testPrelude, and tlsCertsModule via scenarioArgs (and pkgs, lib, etc. as needed with ...).
  3. Register the subtest in modules/tests/vm-fleet-scenarios.nix.
  4. Add the check name to the vm-fleet-* section in the project README (automatic discovery means no script edit is needed).

For non-fleet VM tests (single-subsystem things like vm-core / vm-infra) follow the pattern in modules/tests/vm.nix - use mkTestNode directly.

Shared /nix/store and the assertion classes it forbids (WONTFIX)

Every node in a nixosTest mounts the host’s /nix/store read-only via 9p. This means store-path existence checks (test -e /nix/store/...) are tautologically true on every node regardless of which node’s closure references the path. A nix copy between nodes appears to succeed even when it transferred zero bytes, because the receiver could already see the path via 9p.

The suite uses two workaround patterns instead of the heavy-weight per-VM store-image approach:

NeedWorkaroundWhy it works
Prove a command ran on a specific nodeVM-local marker file under /tmp/tmp is per-VM, never shared via 9p
Prove a path is registered in a node’s Nix DBnix-store -q --references <path> on the targetThe Nix DB (/nix/var/nix/db) is per-VM, only the store files are shared

Concrete examples in the suite:

  • vm-fleet-deploy-ssh uses nix-store -q --references to prove nix-copy-closure --to actually registered the stub closure in the target’s Nix DB. The 9p-mounted store would make a test -e check invariant.
  • vm-fleet-rollback-ssh uses the same pattern for the per-generation rollback assertion.
  • vm-fleet-apply-failure uses /tmp/stub-switch-called (a regular filesystem path, VM-local) as the load-bearing proof that switch-to-configuration switch was invoked.

Why not per-VM store images

The alternative - virtualisation.useNixStoreImage = true; virtualisation.mountHostNixStore = false; - was considered and rejected: every node would rebuild its own store image, multiplying VM build cost for an assertion class that the workarounds already cover. No scenario in the current suite needs per-VM store isolation.

If a future scenario genuinely requires it (e.g. asserting on byte-level transfer through nix copy rather than DB registration), revisit this decision in a follow-up. Do not adopt per-VM store images preemptively

  • they cost real wall-clock minutes per CI run.

Rust Tests

The Rust side of nixfleet lives in three crates:

CratePathRole
nixfleet-control-planecontrol-plane/Axum HTTP server, SQLite state, rollout executor, release registry, auth/audit, metrics
nixfleet-agentagent/Polling daemon, health check runners, store/TLS
nixfleet-typesshared/Wire types shared by the CLI, agent, and CP

Plus the CLI at cli/ (nixfleet-cli) which has its own integration tests.

How to run

The canonical entry point is nix run .#validate -- --all (see Testing Overview). For faster Rust-only iteration:

nix run .#validate -- --rust

That runs cargo test --workspace + cargo clippy --workspace --all-targets -- -D warnings + nix build of every Rust package (the sandboxed test run), in order. Use this over raw cargo test so clippy and the sandbox-build check stay in the loop.

When you need to drill into a specific failure after --rust has already surfaced it:

nix develop --command cargo test -p nixfleet-control-plane --test route_coverage
nix develop --command cargo test -p nixfleet-cli --test subcommand_coverage
nix develop --command cargo test -p nixfleet-agent --test run_loop_scenarios \
    poll_hint_shortens_next_interval

The Rust toolchain (cargo, rustc, clippy, rustfmt, rust-analyzer) is pinned in the dev shell.

Unit tests (in-file #[cfg(test)] mod tests)

Each Rust module has its own unit tests exercising pure logic without HTTP / DB / filesystem / network.

nixfleet-control-plane

ModuleTested logic
auth.rsAPI key SHA-256 hashing, role matrix (admin/deploy/readonly), bearer token parsing, role check predicates
db.rsEvery persistence method: register machine, insert report, generations table, releases + release entries, rollout batches, lifecycle filter, tag join (machine_tags), get_recent_reports deterministic tiebreaker, migrations idempotency
metrics.rsCounter/gauge updates, Prometheus text rendering
state.rsFleetState hydration from DB on startup, in-memory machine inventory, poll_hint propagation
tls.rsServer/client cert loading, rustls ServerConfig / ClientConfig builder
rollout/batch.rsBatch building from strategy (all_at_once, canary, staged), batch_sizes parsing (absolute N and percent), randomization determinism
rollout/executor.rsparse_threshold (absolute + percent), tick_for_tests doc-hidden shim for deterministic single-tick advancement

nixfleet-agent

ModuleTested logic
comms.rsReport payload serialization, HTTP client builder with mTLS
config.rsDefault config (e.g., dry_run = false, tags = []), CLI arg parsing
nix.rs/run/current-system symlink resolution, store-path parsing, generation hash extraction
store.rsSQLite state DB: get/set current_generation, log_check, log_error, cleanup
tls.rsClient cert/key loading, fleet CA trust

nixfleet-types

ModuleTested logic
lib.rsSerde round-trips for all wire types
health.rsHealthReport + HealthCheckResult serialization
release.rsRelease / ReleaseEntry serde
rollout.rsRolloutStatus, RolloutStrategy, OnFailure enum serde

Integration tests (scenario files)

Integration tests live in control-plane/tests/*_scenarios.rs, control-plane/tests/route_coverage.rs, and cli/tests/*_scenarios.rs. Every file is an independent test binary - cargo test spawns one binary per file.

Shared harness

Every scenario file imports control-plane/tests/harness.rs via a #[path = "harness.rs"] mod harness; sibling include. The harness provides:

HelperPurpose
spawn_cp() / spawn_cp_at(path)Boot an in-process CP bound to a temp directory with pre-seeded admin / deploy / readonly API keys. Returns a Cp handle with .db, .fleet, .admin, .base, .db_path.
spawn_cp_with_rollout(store_path)Canonical “1 machine, 1 release, 1 all-at-once rollout, zero-tolerance, pause-on-failure” fixture. Returns (cp, release_id, rollout_id).
register_machine(cp, id, tags)Register a machine directly via DB + fleet state (bypasses HTTP for setup speed).
create_release(cp, entries)POST /api/v1/releases; returns the release id.
create_rollout_for_tag(cp, release_id, tag, strategy, batch_sizes, threshold, on_failure, health_timeout)POST /api/v1/rollouts; returns the rollout id.
fake_agent_report(cp, machine_id, generation, success, message, tags)POST /api/v1/machines/{id}/report as an agent.
agent_reports_health(cp, machine_id, store_path, healthy)Paired helper that emits both a fake_agent_report and an insert_health_report - the executor’s generation gate and batch health gate read different tables, so almost every failure / recovery scenario needs both together.
assert_status(builder, expected)One-line replacement for the let resp = ...; .send().await; assert_eq!(resp.status(), N) triple used across route_coverage.rs.
tick_once(cp)Drive a single executor tick deterministically via executor::test_support::tick_for_tests. Replaces the production 2s tokio::time::interval.
wait_rollout_status(cp, rollout_id, want, within)Poll GET /rollouts/{id} until status matches or the deadline elapses.

Constants: TEST_API_KEY, TEST_DEPLOY_KEY, TEST_READONLY_KEY are the three pre-seeded role keys.

Scenario files - control-plane

FileCovers
release_scenarios.rsR3 push-hook invocation, R4 release list pagination, R5 referenced release delete → 409, R6 orphan release delete → 204.
deploy_scenarios.rsD2 canary strategy happy path, D3 staged strategy happy path.
failure_scenarios.rsGeneration-gate filters stale-gen reports, failure_threshold = "30%" pauses on 4 of 10, resume does not re-flip on a stale pre-resume report, Paused → Cancelled via operator cancel.
hydration_scenarios.rsCP restart mid-rollout resumes from DB (ADR 010) - cp1 stages a rollout, cp2 hydrates from the shared SQLite file and drives it to completion, proving FleetState is re-queried per tick.
rollback_scenarios.rsRollback via CP API: redeploy an old release as a forward rollback; original forward rollout stays Completed (history preserved).
polling_scenarios.rspoll_hint = 5 present when a machine is in an active rollout, absent when idle.
machine_scenarios.rsM1 lifecycle filter (decommissioned excluded from rollout targets), M2 tag propagation via health reports, M3 direct desired-gen ↔ report cycle, M4 success=falsesystem_state=error, M5 multi-machine desired-gen isolation, M6 Pending → Active auto-transition, M7 Active ↔ Maintenance round trip.
auth_scenarios.rsBootstrap 409 after first key, anonymous admin route → 401, public /health stays open, readonly/deploy role enforcement on POST /rollouts and READ_ONLY on GET /releases+/rollouts, bearer-token shape errors (invalid token / missing Bearer prefix → 401).
audit_scenarios.rsAudit log writes for every mutating route + CSV-injection escaping for untrusted detail fields.
metrics_scenarios.rs/metrics exposes every CP-side metric after a real rollout cycle, and the HTTP middleware counter increments per normalized path.
cn_validation_scenarios.rsmTLS CN validation middleware: no extension / empty extension / matching CN / mismatched CN (defense in depth above the CA boundary).
route_coverage.rsHappy + error + auth coverage for every admin route, grouped by family via section headers (machines / rollouts / releases / audit+bootstrap+public). ~50 tests.
migrations_scenarios.rsFresh DB schema shape, refinery_schema_history exists, idempotent on second migrate, every expected table is queryable.

Scenario files - cli

FileCovers
release_hook_scenarios.rsrelease create --push-hook "..." expands {} to the store path and runs the hook under sh -c.
rollback_cli_scenarios.rsnixfleet rollback --host <h> --generation <g> constructs the right SSH invocation.
config_scenarios.rsCLI/credentials/file precedence + env-var precedence (NIXFLEET_* overrides credentials, loses to CLI flags) + HOSTNAME fallback path.
subcommand_coverage.rsDirect CLI test for every leaf subcommand (init, bootstrap, status, host add, machines list/register, rollout list/status/cancel, release list/show/diff).
release_delete_scenarios.rsnixfleet release delete CLI dispatch (204 → exit 0, 409 → exit 1, 404 → exit 1).

Tests deliberately NOT in Rust

  • Everything that needs a real systemd unit (nixfleet-agent.service, harmonia.service, sshd) - those are VM tests.
  • Anything that needs a real /run/current-system symlink to resolve - the agent’s nix::current_generation() returns an unwrap_or_default() at the call site, so the path is testable in VMs only.
  • End-to-end CLI + real nix builds - those are VM tests (vm-fleet-release, vm-fleet-deploy-ssh, vm-fleet-rollback-ssh).

Known gaps

New gaps surfacing during operation should be added here and tracked in TODO.md.

Coverage measurement

NixFleet measures Rust coverage with cargo llvm-cov on demand. We deliberately do not record a one-shot baseline snapshot - an orphaned number from a single point in time is theater without a concrete change to compare against.

The useful measurement is “coverage delta for the code you just touched”, not “total workspace coverage at an arbitrary date.”

When to run

  • Before merging a non-trivial Rust change, to confirm the new code is covered by at least one test path.
  • Before a release, to spot-check any module whose coverage has drifted.
  • When investigating a regression, to see whether the failing path had test coverage prior to the break.

How to run

cargo install cargo-llvm-cov  # once per toolchain
cargo llvm-cov --workspace --html
# Open target/llvm-cov/html/index.html for the per-crate breakdown.

# Or on a specific crate / test target:
cargo llvm-cov --package nixfleet-control-plane --html
cargo llvm-cov --package nixfleet-agent --test run_loop_scenarios --html

# Diff the branch under review against main:
cargo llvm-cov --workspace --summary-only > /tmp/branch.txt
git checkout main
cargo llvm-cov --workspace --summary-only > /tmp/main.txt
diff /tmp/main.txt /tmp/branch.txt

The html output is the primary operator experience. --summary-only produces a text table suitable for piping into diff tools.

What’s not here

There is no persistent coverage percentage in this document - a static snapshot has no downstream consumer. If a future change wants to establish a persistent baseline (e.g. as a CI regression gate), the tooling above is ready.

Adding a new Rust scenario

  1. Create control-plane/tests/<domain>_scenarios.rs or cli/tests/<domain>_scenarios.rs.

  2. Add the harness sibling include at the top:

    #![allow(unused)]
    fn main() {
    #[path = "harness.rs"]
    mod harness;
    
    use harness::*;
    }
  3. Write #[tokio::test] functions. Use the spawn_cp / register_machine / create_release / tick_once helpers so your scenario doesn’t fight the executor’s wall-clock interval.

  4. Run cargo test -p nixfleet-control-plane --test <file> to iterate.

  5. If the scenario uncovers a product bug, fix the bug rather than adapting the test around it. See the test-vs-component debugging rule: when a test fails, first determine whether the test or the tested component needs fixing, before choosing a fix. Prefer root-cause fixes.