docs/task-format.md
Static HTML generated from the repository markdown.
Each benchmark task lives in `tasks/<task-id>/`.
Static HTML generated from the repository markdown.
Each benchmark task lives in tasks/<task-id>/.
tasks/<task-id>/
metadata.toml
prompt.md
starter/
reference/
tests/
check.sh
metadata.tomlRequired fields:
id = "package-stdenv-cli"
name = "Package A stdenv CLI"
category = "packages"
difficulty = "medium"
timeout_seconds = 60
max_score = 100
systems = ["any"]
evaluator = "tests/check.sh"
systems = ["any"] means the task is system-independent. Use Nix system names such as x86_64-linux or aarch64-darwin for system-specific tasks.
prompt.mdThis is the prompt copied into the workdir as NIXBENCH_PROMPT.md. It should describe the task requirements and expected files, but not reveal hidden evaluator details.
starter/Starter files are copied into a temporary workdir. The agent edits only this copy.
reference/Reference files are overlaid onto the starter when running:
python3 bench.py validate --solution reference
The reference solution should pass the evaluator and act as a regression fixture for task authors.
tests/check.shThe evaluator runs as:
/bin/sh tests/check.sh "$NIXBENCH_WORKDIR"
It should exit 0 for pass and non-zero for fail. It may write a JSON score to $NIXBENCH_SCORE_FILE:
{
"score": 80,
"notes": ["main behavior passed", "style check failed"]
}
If no score file is written, the harness awards full score on pass and zero on fail.
NIXBENCH_SCORE_FILE is only provided to the evaluator, not to the agent command.
Before adding a task to the corpus:
Use categories that describe the Nix skill being tested:
nix-languageflakespackagesmodulesoverlaysfetchersdevshellsdebuggingpurityUse systems = ["any"] for pure evaluation tasks. Reserve real Nix systems for tasks that depend on platform-specific behavior.