docs/scoring.md
Static HTML generated from the repository markdown.
NixBench uses task-level objective scoring first.
Static HTML generated from the repository markdown.
NixBench uses task-level objective scoring first.
Default scoring:
0: full max_score.0.0 by default.Evaluators can provide partial credit by writing JSON to $NIXBENCH_SCORE_FILE, which is an evaluator-only absolute path:
{
"score": 70,
"max_score": 100,
"notes": [
"evaluation passes",
"metadata missing mainProgram"
]
}
The harness records this detail in result.json.
Evaluator-provided scores are clamped to the task's [0, max_score] range. Partial credit can be recorded even when the evaluator exits non-zero or the agent timed out, but passed remains false unless the agent completed and the evaluator exited 0.
For larger tasks, split hidden checks roughly like this:
mkIf, overrideAttrs, fixed-output fetchers, phases, and per-system helpers.nixfmt-rfc-style, statix, and deadnix where applicable.Keep the public prompt stable. Add hidden cases when models overfit obvious examples.
When analyzing failed runs, tag failures with one or more classes:
timeout: the agent exceeded --agent-timeout-seconds.syntax: Nix parsing failed.evaluation: Nix parsed but evaluation failed.missing-attr: required attribute was absent.wrong-value: required attribute existed but had the wrong value.unavailable-helper: solution used helpers not present in the evaluator.impurity: solution referenced host paths, environment variables, or external state.overfit: solution hardcoded the visible example and failed hidden inputs.These tags are not enforced by the harness yet, but they are useful for comparing models.
A useful result report should include:
check.log and diff.patch for failed tasks.