VM Tests

VM tests boot real NixOS virtual machines under QEMU and assert runtime state via Python test scripts run by the nixosTest driver. They verify services start, ports listen, multi-node interactions work end-to-end, and rollout state machines behave as documented.

How to run

The canonical entry point is nix run .#validate -- --all (see Testing Overview). For VM-only iteration:

nix run .#validate -- --vm

All vm-* checks under .#checks.<system> are discovered dynamically by the validate script, so new scenarios land in --vm / --all automatically without touching it.

When --vm surfaces a specific VM failure, drill in:

nix build .#checks.x86_64-linux.vm-fleet-apply-failure --no-link
nix log /nix/store/<hash>-vm-test-run-vm-fleet-apply-failure.drv

nix log retrieves the full driver output (systemctl status, journals, Python traceback) for a failed or past run.

Requirements

Platform: x86_64-linux only (nixosTest uses QEMU)
KVM: /dev/kvm for acceptable performance
Disk space: each VM test builds a NixOS closure; expect several GB per test
Time: minutes per test (closure build + parallel VM boots + assertions)

Test cycle

Each VM test goes through:

Build - Nix evaluates the nodes’ config and builds each node’s system closure.
Boot - QEMU launches one or more VMs in parallel; the shared host /nix/store is mounted read-only over 9p on every VM.
Assert - a Python test script runs commands via the test driver API (machine.succeed(), machine.fail(), machine.wait_for_unit(), machine.wait_until_succeeds(cmd, timeout=N)).
Cleanup - VMs shut down, driver reports pass/fail.

Framework-level VM tests

These test one subsystem in isolation. Most are defined in modules/tests/vm*.nix.

`vm-core`

Boots a standard framework node (defaultTestSpec, no special flags) and verifies:

multi-user.target reached
sshd and NetworkManager running
Firewall active (nftables input chain exists)
Test user exists in the wheel group
Core packages available to the user (zsh, git)

This is the “does everything still boot” smoke test.

`vm-minimal`

Boots a node with the endpoint role (minimal scope set) and verifies the minimal profile stays minimal:

multi-user.target reached
Core tools still present (zsh, git come from core/nixos.nix, not the base scope)
Graphical/dev tools absent (e.g., niri not installed, Docker not running)

`vm-infra`

One node, four scopes in one VM for speed:

Firewall - nftables active, SSH rate limiting rules present (limit rate 5/minute), drop logging enabled.
Monitoring - node exporter running, port 9100 responds with Prometheus text, node_systemd collector active.
Backup - systemd timer registered, manual trigger writes status.json with "status": "success".
Secrets - SSH host key generated at /etc/ssh/ssh_host_ed25519_key with mode 600.

`vm-fleet` - “Tier A headline test”

4-node fleet: cp + web-01 + web-02 + db-01, with full mTLS (build-time CA + CP server cert + per-agent client certs, no allowInsecure).

CP bootstraps an admin API key.
All 3 agents register with tags (web × 2, db × 1).
Canary rollout on tag web (strategy staged, batch sizes ["1","100%"])
- both agents healthy, rollout reaches completed.
Health-gate failure rollout on tag db (strategy all_at_once) - db-01’s health check points at http://localhost:9999/health which nothing listens on; the rollout hits health_timeout and pauses.
Resume the paused rollout and verify it transitions out of paused.
Metrics - CP /metrics exposes nixfleet_fleet_size and nixfleet_rollouts_total; agent node exporter on web-01 exposes node_cpu.

Fleet scenario subtests

Every CLI path, failure mode, and rollout branch has its own independently buildable VM subtest under modules/tests/_vm-fleet-scenarios/*.nix. The aggregator modules/tests/vm-fleet-scenarios.nix exposes each one as .#checks.<system>.vm-fleet-<name>.

`vm-fleet-agent-rebuild`

The only VM test in the suite that runs with dryRun = false - it is the proof that the agent’s real fetch → apply → verify pipeline works end-to-end. CP tells the agent to deploy a fabricated store path that does NOT exist anywhere with no cache URL configured; the agent must log "not found locally and no cache URL configured" and leave /run/current-system untouched. Indirect fetch-path coverage still exists (vm-fleet-release for nix copy + harmonia, vm-fleet-bootstrap for the happy-path report cycle).

`vm-fleet-tag-sync`

Real agent with tags = ["web" "canary" "eu-west"] in NixOS config. Asserts tags appear in the CP machine_tags table after the first health report, that filtering by a declared tag returns the agent, and that undeclared tags do not leak into the table.

`vm-fleet-bootstrap`

End-to-end bootstrap flow:

Start CP with an empty api_keys table.
Operator runs nixfleet bootstrap --name test-admin - the CLI returns the first admin API key over mTLS.
Use the returned key to list machines (empty), wait for two real agents (web-01, web-02) to register, list machines again (2 visible).
Create a release via POST /api/v1/releases pointing at each agent’s real /run/current-system toplevel.
POST a rollout targeting tag=web and wait for status=completed.
Negative: a second nixfleet bootstrap call must fail (409 Conflict).

`vm-fleet-release`

Real nixfleet release create --push-to ssh://root@cache exercised against a harmonia binary cache server:

Uses the shared nix-shim (modules/tests/_lib/nix-shim.nix) to intercept nix eval and nix build on the builder node - returns a canned store path - while delegating nix copy to the real nix so the binary transfer actually happens.
Cache node runs services.nixfleet-cache-server (harmonia) with a build-time signing key baked as a /nix/store path (avoids the CREDENTIALS=243 race documented in TODO.md).
Post-push, assert via the VM-local Nix database (nix-store -q --references) that the path is registered on cache and NOT on cp.
Agent then fetches from http://cache:5000 and the DB check passes on the agent too.

`vm-fleet-deploy-ssh`

Real nixfleet deploy --hosts target --ssh --target root@target - no CP in the topology at all. The CLI calls nix eval (shim) → nix build (shim) → nix-copy-closure (real) → ssh target switch-to-configuration (real). A stub switch-to-configuration writes a marker file to /tmp that the test asserts. Proves --ssh mode truly bypasses the CP.

`vm-fleet-apply-failure`

Command health check with a sentinel file (/var/lib/fail-next-health) drives the failure path:

Sentinel file created before the agent starts → first health report is unhealthy → rollout pauses (F1).
Assert current_generation is still the agent’s original toplevel (RB1
- the agent did not advance to the failing generation).
Clear the sentinel, wait for health_reports.all_passed = 1, call POST /api/v1/rollouts/{id}/resume, assert the rollout reaches completed.

This test covers two subtle bugs in the resume path: the rollout executor must not re-mark a batch unhealthy from stale pre-resume reports, and the agent’s CommandChecker must use an absolute /bin/sh so it works under a systemd unit PATH. A regression in either would make the resume → completed transition hang.

`vm-fleet-revert`

2-agent staged rollout with on_failure = revert:

Both agents healthy → first batch succeeded.
Test then arms the sentinel on both agents so the next batch fails.
Rollout executor walks previous_generations on succeeded batches and restores the per-machine desired generation.
Indirectly covers C3 (HealthRunner::run_all actually runs post-deploy) - if the health runner were dead code, the failing report would never arrive and the revert path wouldn’t fire.

`vm-fleet-timeout`

The agent is configured but its unit’s wantedBy is forced to [] so the process never starts. CP records the machine in the release but sees zero reports from it. The batch sits in pending_count > 0 until health_timeout elapses, at which point evaluate_batch pushes pending_count into unhealthy_count and marks the batch failed.

Negative control: the reports table is empty for the machine - the pause reason really is “timeout”, not “agent reported a failure”.

`vm-fleet-poll-retry`

Agent starts before the CP. First poll hits a closed port (connection refused). The agent’s main loop schedules a retry at retryInterval = 5s. Then the CP starts, and the agent’s next retry succeeds. Asserts the agent journal contains the retry-scheduling log line, then waits for registration.

`vm-fleet-mtls-missing`

Pure transport-layer test. CP has tls.clientCa set. A client with the CA cert (can verify server) but no client key pair sends curl against /health and /api/v1/machines/{id}/report:

Without --cert → handshake failure at the TLS layer (asserted by grepping the curl verbose output for any of a set of TLS markers: alert, handshake, certificate required, SSL_ERROR, etc.).
Positive control with a valid client cert → HTTP response comes back (any status - what matters is the handshake completed).

`vm-fleet-mtls-cn-mismatch`

Application-layer test on top of mTLS. A client with a valid fleet-CA-signed cert (CN = wrong-agent) hits another agent’s endpoints (/api/v1/machines/web-01/...). The cn_matches_path_machine_id middleware rejects with 403 because the cert CN does not match the {id} path segment. Closes the impersonation gap: CA proves fleet membership, CN proves specific agent identity.

`vm-fleet-rollback-ssh`

Real nixfleet rollback --host target --ssh --generation <G1> end-to-end:

Deploy stub G2 via nixfleet deploy --ssh → target writes active=g2 marker file.
Pre-copy G1 to target via nix-copy-closure (rollback handler does NOT copy, it only SSHes and runs <gen>/bin/switch-to-configuration).
Run nixfleet rollback --host target --ssh --generation <G1> → target writes active=g1 marker.
Assert both G1 and G2 are still registered in target’s Nix DB (rollback did not delete the forward generation).

Shared VM test helpers

All scenario tests use helpers from modules/tests/_lib/helpers.nix (via modules/tests/vm-fleet-scenarios.nix which pre-binds them):

mkCpNode { testCerts, ... } - a CP node with standard mTLS wiring (CA + server cert, services.nixfleet-control-plane with clientCa), sqlite and python3 pre-installed.
mkAgentNode { testCerts, hostName, tags, healthChecks, ... } - an agent node with standard TLS, fleet CA trust, services.nixfleet-agent with pre-wired machineId/tags/dryRun. Escape hatch agentExtraConfig (merged via lib.recursiveUpdate into services.nixfleet-agent) handles per-scenario overrides like retryInterval or allowInsecure.
tlsCertsModule { testCerts, certPrefix } - a NixOS module fragment wiring the fleet CA plus a named client cert under /etc/nixfleet-tls/, for operator / builder / cache-style nodes that need TLS certs but aren’t a CP or an agent.
testPrelude { certPrefix ? "cp", api ? "https://localhost:8080" } - returns a Python prelude string with TEST_KEY, KEY_HASH, AUTH, CURL, API constants and a seed_admin_key(node) helper. Interpolate at the top of every testScript:
```
testScript = ''
  ${testPrelude {}}
  cp.start()
  cp.wait_for_unit("nixfleet-control-plane.service")
  cp.wait_for_open_port(8080)
  seed_admin_key(cp)
  ...
'';
```
mkTlsCerts { hostnames } (from _lib/helpers.nix) - builds the fleet CA + per-host cert pairs at Nix-eval time. Deterministic, no runtime setup.
nix-shim (from _lib/nix-shim.nix) - a writeShellApplication that intercepts nix eval / nix build with canned responses while delegating nix copy and other subcommands to the real nix at an immutable ${pkgs.nix}/bin/nix path. The absolute path is deliberate: installing the shim into systemPackages would collide with the real nix at /run/current-system/sw/bin/nix, and if the shim won the collision its fall-through branches would infinitely exec themselves. See the nixosTest gotchas section below.

nixosTest gotchas worth knowing

A few behaviours of the nixosTest framework itself that have bitten scenarios in this suite:

Shared /nix/store via 9p: every VM sees the host store read-only via 9p mount. Any store path referenced anywhere in the test evaluation is visible as a file on every node regardless of whether it was ever copied there. test -e <storepath> assertions are therefore invariant. The workaround is to check the VM-local Nix database (nix-store -q --references <path>) which is per-VM.
systemd PATH for services: services like nixfleet-agent do not get /run/current-system/sw/bin in their PATH by default, so Command::new("sh") (relative lookup) fails with ENOENT. Use absolute paths like /bin/sh.
nix shim collisions: adding a shim package named "nix" to environment.systemPackages causes a silent collision with the real nix in /run/current-system/sw/bin/nix. The workaround is to keep the shim only on sessionVariables.PATH (which still pulls it into the closure via string interpolation) and never in systemPackages.
wait_for_unit vs wait_until_succeeds("systemctl is-active"): a systemd unit stuck in the activating state forever (e.g., due to a LoadCredential= failure) blocks wait_for_unit with no useful error. wait_until_succeeds(..., timeout=120) wrapped in a try/except that dumps systemctl status + the unit journal gives you an informative failure instead of an opaque hang.

Adding a new VM test

Create modules/tests/_vm-fleet-scenarios/<name>.nix following the vm-fleet-tag-sync.nix template.
Accept mkCpNode, mkAgentNode, mkTlsCerts, testPrelude, and tlsCertsModule via scenarioArgs (and pkgs, lib, etc. as needed with ...).
Register the subtest in modules/tests/vm-fleet-scenarios.nix.
Add the check name to the vm-fleet-* section in the project README (automatic discovery means no script edit is needed).

For non-fleet VM tests (single-subsystem things like vm-core / vm-infra) follow the pattern in modules/tests/vm.nix - use mkTestNode directly.

Shared `/nix/store` and the assertion classes it forbids (WONTFIX)

Every node in a nixosTest mounts the host’s /nix/store read-only via 9p. This means store-path existence checks (test -e /nix/store/...) are tautologically true on every node regardless of which node’s closure references the path. A nix copy between nodes appears to succeed even when it transferred zero bytes, because the receiver could already see the path via 9p.

The suite uses two workaround patterns instead of the heavy-weight per-VM store-image approach:

Need	Workaround	Why it works
Prove a command ran on a specific node	VM-local marker file under `/tmp`	`/tmp` is per-VM, never shared via 9p
Prove a path is registered in a node’s Nix DB	`nix-store -q --references <path>` on the target	The Nix DB (`/nix/var/nix/db`) is per-VM, only the store files are shared

Concrete examples in the suite:

vm-fleet-deploy-ssh uses nix-store -q --references to prove nix-copy-closure --to actually registered the stub closure in the target’s Nix DB. The 9p-mounted store would make a test -e check invariant.
vm-fleet-rollback-ssh uses the same pattern for the per-generation rollback assertion.
vm-fleet-apply-failure uses /tmp/stub-switch-called (a regular filesystem path, VM-local) as the load-bearing proof that switch-to-configuration switch was invoked.

Why not per-VM store images

The alternative - virtualisation.useNixStoreImage = true; virtualisation.mountHostNixStore = false; - was considered and rejected: every node would rebuild its own store image, multiplying VM build cost for an assertion class that the workarounds already cover. No scenario in the current suite needs per-VM store isolation.

If a future scenario genuinely requires it (e.g. asserting on byte-level transfer through nix copy rather than DB registration), revisit this decision in a follow-up. Do not adopt per-VM store images preemptively

they cost real wall-clock minutes per CI run.

Keyboard shortcuts

NixFleet Documentation