Benchmark Design

NixBench is designed around agentic repair, not snippet generation.

A model is given an editable worktree and a task prompt. It must inspect the files, make changes, and leave the worktree in a state that passes hidden checks. This gives the benchmark room to measure practical behavior: reading requirements, editing the right file, avoiding irrelevant churn, running local checks, and handling Nix evaluation errors.

Goals

NixBench aims to measure:

Whether an agent can produce Nix code that evaluates.
Whether it can distinguish common Nix contexts: flakes, modules, overlays, derivations, shells, and fetchers.
Whether it can avoid plausible but unavailable helpers in restricted evaluators.
Whether it can preserve existing metadata, inputs, patches, and options.
Whether it can fix broken code rather than replacing it blindly.

Non-Goals

NixBench does not currently try to measure:

Large-scale nixpkgs contribution quality.
Long-running real package builds.
Human taste in Nix style.
End-to-end NixOS deployment behavior.
Cross-model leaderboard fairness across hardware and network conditions.

Those can be added later, but the initial corpus focuses on fast deterministic evaluation.

Why Fake Builders Are Useful

Many tasks use fake builders such as:

stdenv.mkDerivation = attrs: attrs // { __mkDerivation = true; };

This pattern makes the evaluator inspect the structure of the candidate Nix expression without building a real derivation. That is useful because:

It avoids network and cache dependence.
It keeps task runtime low.
It catches attr-level mistakes precisely.
It makes hidden tests easy to author.

Real-build tasks are still useful, but they should be marked separately because they are slower and less deterministic.

Why Hidden Evaluators Matter

If the public prompt says "include mainProgram", a weak model can satisfy the visible text with a string search strategy. A hidden evaluator can check semantic shape instead:

assert pkg.meta.mainProgram == "tinygrep";

Hidden evaluators also catch overfitting. For example, a prompt may include one package set, while the evaluator uses another package set with disabled packages, missing fields, or different system lists.

Task Difficulty

Suggested difficulty levels:

Easy: one file, direct requirements, no subtle Nix semantics.
Medium: one or two files, multiple constraints, some idiom required.
Hard: module system, overlays, recursive values, or purity constraints.

Difficulty should reflect the expected reasoning load, not the number of lines changed.

Corpus Health Checks

Every task should satisfy two checks:

python3 bench.py validate --solution reference
python3 bench.py validate --solution starter

The reference should pass. The starter should usually fail. If a starter passes, the task is not measuring anything useful.

Benchmark Integrity

For fair runs:

Do not expose tests/check.sh to the agent workdir.
Do not expose the original task directory, reference solution, or score file path to the agent.
Use the same timeout for every comparable agent.
Record the exact agent command.
Keep summary.json, result.json, agent.log, check.log, and diff.patch.
Pin task corpus versions when comparing results over time.