NixBench

tasks
26
areas
9
evaluators
26
scored runs
10

Measuring AI coding agents on long-horizon Nix repository repair tasks with hidden shell evaluators and concrete worktree diffs.

Inspect the task corpus

Run plot

26 tasks · updated June 25, 2026
RunEffortPass@1ScoreAgent timeFailed
GPT-5.5 via Codex CLI26-task corpus · 20260625T072711Z-e484ea0flow81%2100 / 260019m 25s526/26
GPT-5.5 via Codex CLI26-task corpus · 20260625T073226Z-3fce189cmedium73%1900 / 260022m 15s726/26
GPT-5.5 via Codex CLI26-task corpus · 20260625T073227Z-167ae812high73%1900 / 260025m 59s726/26
GPT-5.5 via Codex CLI26-task corpus · 20260624T182835Z-4ad8b555 (+2)xhigh85%2200 / 260041m 03s426/26
GPT-5.4 via Codex CLI26-task corpus · 20260625T073231Z-84de082alow77%2000 / 260017m 45s626/26
GPT-5.4 via Codex CLI26-task corpus · 20260625T073227Z-76c2964dmedium81%2100 / 260020m 55s526/26
GPT-5.4 via Codex CLI26-task corpus · 20260625T073228Z-a5a4a383high85%2200 / 260028m 16s426/26
GPT-5.4 via Codex CLI26-task corpus · 20260624T190640Z-fa04a19c (+2)xhigh81%2100 / 260040m 02s526/26
GPT-5.4 mini via Codex CLI26-task corpus · 20260624T194359Z-268b0abe (+2)xhigh73%1900 / 260039m 46s726/26
Claude Opus 4.826-task corpus · 20260624T202141Z-881ef1e9 (+2)default81%2100 / 260025m 24s526/26

Current rows are local artifacts from results/. The model comparison is summarized in run notes.

NixBench exists because plausible Nix often fails at evaluation time.

The benchmark gives agents a copied starter tree, a prompt, and no access to the hidden evaluator. It rewards final worktree behavior, not a fluent explanation of what the code should do.

contamination

Original repair tasks

Tasks are written for this corpus rather than lifted from merged patches, which keeps the answer out of the visible prompt.

scope

Nix-specific failure surfaces

The corpus covers flakes, modules, overlays, derivations, fetchers, Home Manager, shell escaping, and package contracts.

verification

Hand-written checks

Each task has a shell evaluator that checks behavior with small fake package sets and libraries instead of relying on LLM judging.

artifacts

Diff-backed runs

Every run records logs, timings, pass state, score JSON, and the final diff so failures can be inspected after the benchmark ends.

Task examples

Twenty-six small repositories, one hidden evaluator each.

All 26 tasks

Respect NixOS, Home Manager, and nix-darwin boundaries

Keep module outputs separated instead of leaking options across systems.

moduleshard

Patch Python CUDA package inputs

Repair Python/CUDA packaging without falling back to generic Linux path guesses.

packageshard

Compose module paths from arguments

Build paths with Nix values while avoiding string interpolation traps.

nix-languagemedium

Debug network symptoms without false leads

Explain the observed NixOS service behavior without chasing a plausible but wrong network diagnosis.

debuggingmedium

Manage home files declaratively

Use Home Manager file and XDG options rather than imperative setup.

moduleseasy

Pin a GitHub source fetcher

Preserve the fixed-output fetcher contract with a commit pin and SRI hash.

fetcherseasy
7

easy tasks for syntax, lookup, stale options, and small contracts

15

medium repairs across flakes, containers, issue reports, overlays, and packaging

4

hard tasks for modules, overlays, and Python/CUDA package inputs

Methodology

The agent edits a worktree. The evaluator scores the result.

Run guide
  1. copy

    Starter files and the prompt enter a clean temporary workdir.

  2. edit

    The agent reads NIXBENCH_PROMPT.md and modifies only local files.

  3. check

    A hidden shell evaluator scores the final tree after the agent exits.

  4. record

    Logs, timing, score JSON, and the final diff are written under results/.

Run your agent

Add another row to the benchmark.

Use the local harness to run a CLI agent against the same copied worktree contract and hidden evaluator shape.

Open run command