docs/benchmark-design.md
Static HTML generated from the repository markdown.
NixBench is designed around agentic repair, not snippet generation.
Static HTML generated from the repository markdown.
NixBench is designed around agentic repair, not snippet generation.
A model is given an editable worktree and a task prompt. It must inspect the files, make changes, and leave the worktree in a state that passes hidden checks. This gives the benchmark room to measure practical behavior: reading requirements, editing the right file, avoiding irrelevant churn, running local checks, and handling Nix evaluation errors.
NixBench aims to measure:
NixBench does not currently try to measure:
Those can be added later, but the initial corpus focuses on fast deterministic evaluation.
Many tasks use fake builders such as:
stdenv.mkDerivation = attrs: attrs // { __mkDerivation = true; };
This pattern makes the evaluator inspect the structure of the candidate Nix expression without building a real derivation. That is useful because:
Real-build tasks are still useful, but they should be marked separately because they are slower and less deterministic.
If the public prompt says "include mainProgram", a weak model can satisfy the visible text with a string search strategy. A hidden evaluator can check semantic shape instead:
assert pkg.meta.mainProgram == "tinygrep";
Hidden evaluators also catch overfitting. For example, a prompt may include one package set, while the evaluator uses another package set with disabled packages, missing fields, or different system lists.
Suggested difficulty levels:
Difficulty should reflect the expected reasoning load, not the number of lines changed.
Every task should satisfy two checks:
python3 bench.py validate --solution reference
python3 bench.py validate --solution starter
The reference should pass. The starter should usually fail. If a starter passes, the task is not measuring anything useful.
For fair runs:
tests/check.sh to the agent workdir.summary.json, result.json, agent.log, check.log, and diff.patch.