docs/reproducibility.md
Static HTML generated from the repository markdown.
NixBench is built to make benchmark runs inspectable, but full reproducibility still depends on the agent, model, machine, and task corpus version.
Static HTML generated from the repository markdown.
NixBench is built to make benchmark runs inspectable, but full reproducibility still depends on the agent, model, machine, and task corpus version.
For each task, the harness records:
The aggregate run writes results/<run-id>/summary.json.
When publishing or comparing results, record:
The bundled tasks are mostly deterministic because evaluators use local Nix evaluation and fake builders. Agent behavior is not deterministic unless the agent and model expose a reliable deterministic mode.
For serious comparisons, run each model multiple times and report:
Changing a task prompt, starter, evaluator, or reference solution changes the benchmark. Treat corpus changes as benchmark-version changes.
Suggested policy:
The hidden evaluator is intentionally outside the copied workdir. Do not give the agent paths to tasks/<id>/tests/check.sh. Do not include hidden assertions in prompt.md.
For public leaderboards, use a private hidden corpus in addition to the public corpus.