VM Tests
VM tests boot real NixOS virtual machines under QEMU and assert runtime state via Python test scripts run by the nixosTest driver. They verify services start, ports listen, multi-node interactions work end-to-end, and rollout state machines behave as documented.
How to run
The canonical entry point is nix run .#validate -- --all (see
Testing Overview). For VM-only iteration:
nix run .#validate -- --vm
All vm-* checks under .#checks.<system> are discovered dynamically by
the validate script, so new scenarios land in --vm / --all
automatically without touching it.
When --vm surfaces a specific VM failure, drill in:
nix build .#checks.x86_64-linux.vm-fleet-apply-failure --no-link
nix log /nix/store/<hash>-vm-test-run-vm-fleet-apply-failure.drv
nix log retrieves the full driver output (systemctl status, journals,
Python traceback) for a failed or past run.
Requirements
- Platform: x86_64-linux only (nixosTest uses QEMU)
- KVM:
/dev/kvmfor acceptable performance - Disk space: each VM test builds a NixOS closure; expect several GB per test
- Time: minutes per test (closure build + parallel VM boots + assertions)
Test cycle
Each VM test goes through:
- Build - Nix evaluates the nodes’ config and builds each node’s system closure.
- Boot - QEMU launches one or more VMs in parallel; the shared host
/nix/storeis mounted read-only over 9p on every VM. - Assert - a Python test script runs commands via the test driver API
(
machine.succeed(),machine.fail(),machine.wait_for_unit(),machine.wait_until_succeeds(cmd, timeout=N)). - Cleanup - VMs shut down, driver reports pass/fail.
Framework-level VM tests
These test one subsystem in isolation. Most are defined in modules/tests/vm*.nix.
vm-core
Boots a standard framework node (defaultTestSpec, no special flags) and verifies:
multi-user.targetreachedsshdandNetworkManagerrunning- Firewall active (nftables input chain exists)
- Test user exists in the
wheelgroup - Core packages available to the user (
zsh,git)
This is the “does everything still boot” smoke test.
vm-minimal
Boots a node with the endpoint role (minimal scope set) and verifies the minimal profile stays
minimal:
multi-user.targetreached- Core tools still present (
zsh,gitcome fromcore/nixos.nix, not the base scope) - Graphical/dev tools absent (e.g.,
nirinot installed, Docker not running)
vm-infra
One node, four scopes in one VM for speed:
- Firewall - nftables active, SSH rate limiting rules present
(
limit rate 5/minute), drop logging enabled. - Monitoring - node exporter running, port 9100 responds with
Prometheus text,
node_systemdcollector active. - Backup - systemd timer registered, manual trigger writes
status.jsonwith"status": "success". - Secrets - SSH host key generated at
/etc/ssh/ssh_host_ed25519_keywith mode 600.
vm-fleet - “Tier A headline test”
4-node fleet: cp + web-01 + web-02 + db-01, with full mTLS
(build-time CA + CP server cert + per-agent client certs, no
allowInsecure).
- CP bootstraps an admin API key.
- All 3 agents register with tags (web × 2, db × 1).
- Canary rollout on tag
web(strategystaged, batch sizes["1","100%"])- both agents healthy, rollout reaches
completed.
- both agents healthy, rollout reaches
- Health-gate failure rollout on tag
db(strategyall_at_once) - db-01’s health check points athttp://localhost:9999/healthwhich nothing listens on; the rollout hitshealth_timeoutand pauses. - Resume the paused rollout and verify it transitions out of
paused. - Metrics - CP
/metricsexposesnixfleet_fleet_sizeandnixfleet_rollouts_total; agent node exporter on web-01 exposesnode_cpu.
Fleet scenario subtests
Every CLI path, failure mode, and rollout branch has its own
independently buildable VM subtest under
modules/tests/_vm-fleet-scenarios/*.nix. The aggregator
modules/tests/vm-fleet-scenarios.nix exposes each one as
.#checks.<system>.vm-fleet-<name>.
vm-fleet-agent-rebuild
The only VM test in the suite that runs with dryRun = false - it is
the proof that the agent’s real fetch → apply → verify pipeline
works end-to-end. CP tells the agent to deploy a fabricated store path
that does NOT exist anywhere with no cache URL configured; the agent
must log "not found locally and no cache URL configured" and leave
/run/current-system untouched. Indirect fetch-path coverage still
exists (vm-fleet-release for nix copy + harmonia, vm-fleet-bootstrap
for the happy-path report cycle).
vm-fleet-tag-sync
Real agent with tags = ["web" "canary" "eu-west"] in NixOS config. Asserts
tags appear in the CP machine_tags table after the first health report,
that filtering by a declared tag returns the agent, and that undeclared tags
do not leak into the table.
vm-fleet-bootstrap
End-to-end bootstrap flow:
- Start CP with an empty
api_keystable. - Operator runs
nixfleet bootstrap --name test-admin- the CLI returns the first admin API key over mTLS. - Use the returned key to
list machines(empty), wait for two real agents (web-01,web-02) to register,list machinesagain (2 visible). - Create a release via
POST /api/v1/releasespointing at each agent’s real/run/current-systemtoplevel. - POST a rollout targeting
tag=weband wait forstatus=completed. - Negative: a second
nixfleet bootstrapcall must fail (409 Conflict).
vm-fleet-release
Real nixfleet release create --push-to ssh://root@cache exercised against
a harmonia binary cache server:
- Uses the shared
nix-shim(modules/tests/_lib/nix-shim.nix) to interceptnix evalandnix buildon the builder node - returns a canned store path - while delegatingnix copyto the real nix so the binary transfer actually happens. - Cache node runs
services.nixfleet-cache-server(harmonia) with a build-time signing key baked as a/nix/storepath (avoids theCREDENTIALS=243race documented in TODO.md). - Post-push, assert via the VM-local Nix database (
nix-store -q --references) that the path is registered oncacheand NOT oncp. - Agent then fetches from
http://cache:5000and the DB check passes on the agent too.
vm-fleet-deploy-ssh
Real nixfleet deploy --hosts target --ssh --target root@target - no CP
in the topology at all. The CLI calls nix eval (shim) → nix build
(shim) → nix-copy-closure (real) → ssh target switch-to-configuration
(real). A stub switch-to-configuration writes a marker file to /tmp
that the test asserts. Proves --ssh mode truly bypasses the CP.
vm-fleet-apply-failure
Command health check with a sentinel file
(/var/lib/fail-next-health) drives the failure path:
- Sentinel file created before the agent starts → first health report is unhealthy → rollout pauses (F1).
- Assert
current_generationis still the agent’s original toplevel (RB1- the agent did not advance to the failing generation).
- Clear the sentinel, wait for
health_reports.all_passed = 1, callPOST /api/v1/rollouts/{id}/resume, assert the rollout reachescompleted.
This test covers two subtle bugs in the resume path: the rollout
executor must not re-mark a batch unhealthy from stale pre-resume
reports, and the agent’s CommandChecker must use an absolute /bin/sh
so it works under a systemd unit PATH. A regression in either would
make the resume → completed transition hang.
vm-fleet-revert
2-agent staged rollout with on_failure = revert:
- Both agents healthy → first batch
succeeded. - Test then arms the sentinel on both agents so the next batch fails.
- Rollout executor walks
previous_generationson succeeded batches and restores the per-machine desired generation. - Indirectly covers C3 (HealthRunner::run_all actually runs post-deploy) - if the health runner were dead code, the failing report would never arrive and the revert path wouldn’t fire.
vm-fleet-timeout
The agent is configured but its unit’s wantedBy is forced to [] so the
process never starts. CP records the machine in the release but sees zero
reports from it. The batch sits in pending_count > 0 until
health_timeout elapses, at which point evaluate_batch pushes
pending_count into unhealthy_count and marks the batch failed.
Negative control: the reports table is empty for the machine - the pause
reason really is “timeout”, not “agent reported a failure”.
vm-fleet-poll-retry
Agent starts before the CP. First poll hits a closed port (connection
refused). The agent’s main loop schedules a retry at retryInterval = 5s.
Then the CP starts, and the agent’s next retry succeeds. Asserts the
agent journal contains the retry-scheduling log line, then waits for
registration.
vm-fleet-mtls-missing
Pure transport-layer test. CP has tls.clientCa set. A client with the CA
cert (can verify server) but no client key pair sends curl against
/health and /api/v1/machines/{id}/report:
- Without
--cert→ handshake failure at the TLS layer (asserted by grepping the curl verbose output for any of a set of TLS markers:alert,handshake,certificate required,SSL_ERROR, etc.). - Positive control with a valid client cert → HTTP response comes back (any status - what matters is the handshake completed).
vm-fleet-mtls-cn-mismatch
Application-layer test on top of mTLS. A client with a valid fleet-CA-signed cert (CN = wrong-agent) hits another agent’s endpoints (/api/v1/machines/web-01/...). The cn_matches_path_machine_id middleware rejects with 403 because the cert CN does not match the {id} path segment. Closes the impersonation gap: CA proves fleet membership, CN proves specific agent identity.
vm-fleet-rollback-ssh
Real nixfleet rollback --host target --ssh --generation <G1> end-to-end:
- Deploy stub
G2vianixfleet deploy --ssh→ target writesactive=g2marker file. - Pre-copy
G1to target vianix-copy-closure(rollback handler does NOT copy, it only SSHes and runs<gen>/bin/switch-to-configuration). - Run
nixfleet rollback --host target --ssh --generation <G1>→ target writesactive=g1marker. - Assert both G1 and G2 are still registered in target’s Nix DB (rollback did not delete the forward generation).
Shared VM test helpers
All scenario tests use helpers from modules/tests/_lib/helpers.nix
(via modules/tests/vm-fleet-scenarios.nix which pre-binds them):
-
mkCpNode { testCerts, ... }- a CP node with standard mTLS wiring (CA + server cert,services.nixfleet-control-planewithclientCa),sqliteandpython3pre-installed. -
mkAgentNode { testCerts, hostName, tags, healthChecks, ... }- an agent node with standard TLS, fleet CA trust,services.nixfleet-agentwith pre-wiredmachineId/tags/dryRun. Escape hatchagentExtraConfig(merged vialib.recursiveUpdateintoservices.nixfleet-agent) handles per-scenario overrides likeretryIntervalorallowInsecure. -
tlsCertsModule { testCerts, certPrefix }- a NixOS module fragment wiring the fleet CA plus a named client cert under/etc/nixfleet-tls/, for operator / builder / cache-style nodes that need TLS certs but aren’t a CP or an agent. -
testPrelude { certPrefix ? "cp", api ? "https://localhost:8080" }- returns a Python prelude string withTEST_KEY,KEY_HASH,AUTH,CURL,APIconstants and aseed_admin_key(node)helper. Interpolate at the top of everytestScript:testScript = '' ${testPrelude {}} cp.start() cp.wait_for_unit("nixfleet-control-plane.service") cp.wait_for_open_port(8080) seed_admin_key(cp) ... ''; -
mkTlsCerts { hostnames }(from_lib/helpers.nix) - builds the fleet CA + per-host cert pairs at Nix-eval time. Deterministic, no runtime setup. -
nix-shim(from_lib/nix-shim.nix) - awriteShellApplicationthat interceptsnix eval/nix buildwith canned responses while delegatingnix copyand other subcommands to the real nix at an immutable${pkgs.nix}/bin/nixpath. The absolute path is deliberate: installing the shim intosystemPackageswould collide with the real nix at/run/current-system/sw/bin/nix, and if the shim won the collision its fall-through branches would infinitely exec themselves. See the nixosTest gotchas section below.
nixosTest gotchas worth knowing
A few behaviours of the nixosTest framework itself that have bitten scenarios in this suite:
- Shared
/nix/storevia 9p: every VM sees the host store read-only via 9p mount. Any store path referenced anywhere in the test evaluation is visible as a file on every node regardless of whether it was ever copied there.test -e <storepath>assertions are therefore invariant. The workaround is to check the VM-local Nix database (nix-store -q --references <path>) which is per-VM. - systemd PATH for services: services like
nixfleet-agentdo not get/run/current-system/sw/binin theirPATHby default, soCommand::new("sh")(relative lookup) fails with ENOENT. Use absolute paths like/bin/sh. nixshim collisions: adding a shim package named"nix"toenvironment.systemPackagescauses a silent collision with the realnixin/run/current-system/sw/bin/nix. The workaround is to keep the shim only onsessionVariables.PATH(which still pulls it into the closure via string interpolation) and never insystemPackages.wait_for_unitvswait_until_succeeds("systemctl is-active"): a systemd unit stuck in theactivatingstate forever (e.g., due to aLoadCredential=failure) blockswait_for_unitwith no useful error.wait_until_succeeds(..., timeout=120)wrapped in atry/exceptthat dumpssystemctl status+ the unit journal gives you an informative failure instead of an opaque hang.
Adding a new VM test
- Create
modules/tests/_vm-fleet-scenarios/<name>.nixfollowing thevm-fleet-tag-sync.nixtemplate. - Accept
mkCpNode,mkAgentNode,mkTlsCerts,testPrelude, andtlsCertsModuleviascenarioArgs(andpkgs,lib, etc. as needed with...). - Register the subtest in
modules/tests/vm-fleet-scenarios.nix. - Add the check name to the
vm-fleet-*section in the project README (automatic discovery means no script edit is needed).
For non-fleet VM tests (single-subsystem things like vm-core / vm-infra)
follow the pattern in modules/tests/vm.nix - use mkTestNode directly.
Shared /nix/store and the assertion classes it forbids (WONTFIX)
Every node in a nixosTest mounts the host’s /nix/store read-only via 9p.
This means store-path existence checks (test -e /nix/store/...) are
tautologically true on every node regardless of which node’s closure
references the path. A nix copy between nodes appears to succeed even
when it transferred zero bytes, because the receiver could already see
the path via 9p.
The suite uses two workaround patterns instead of the heavy-weight per-VM store-image approach:
| Need | Workaround | Why it works |
|---|---|---|
| Prove a command ran on a specific node | VM-local marker file under /tmp | /tmp is per-VM, never shared via 9p |
| Prove a path is registered in a node’s Nix DB | nix-store -q --references <path> on the target | The Nix DB (/nix/var/nix/db) is per-VM, only the store files are shared |
Concrete examples in the suite:
vm-fleet-deploy-sshusesnix-store -q --referencesto provenix-copy-closure --toactually registered the stub closure in the target’s Nix DB. The 9p-mounted store would make atest -echeck invariant.vm-fleet-rollback-sshuses the same pattern for the per-generation rollback assertion.vm-fleet-apply-failureuses/tmp/stub-switch-called(a regular filesystem path, VM-local) as the load-bearing proof thatswitch-to-configuration switchwas invoked.
Why not per-VM store images
The alternative - virtualisation.useNixStoreImage = true; virtualisation.mountHostNixStore = false; - was considered and rejected:
every node would rebuild its own store image, multiplying VM build cost
for an assertion class that the workarounds already cover. No scenario
in the current suite needs per-VM store isolation.
If a future scenario genuinely requires it (e.g. asserting on byte-level
transfer through nix copy rather than DB registration), revisit this
decision in a follow-up. Do not adopt per-VM store images preemptively
- they cost real wall-clock minutes per CI run.