Scoring

NixBench uses task-level objective scoring first.

Default scoring:

Evaluator exits 0: full max_score.
Evaluator exits non-zero or times out: 0.
Agent times out: the run is marked failed and receives 0 by default.

Evaluators can provide partial credit by writing JSON to $NIXBENCH_SCORE_FILE, which is an evaluator-only absolute path:

{
  "score": 70,
  "max_score": 100,
  "notes": [
    "evaluation passes",
    "metadata missing mainProgram"
  ]
}

The harness records this detail in result.json.

Evaluator-provided scores are clamped to the task's [0, max_score] range. Partial credit can be recorded even when the evaluator exits non-zero or the agent timed out, but passed remains false unless the agent completed and the evaluator exited 0.

Recommended Rubric

For larger tasks, split hidden checks roughly like this:

70% functional correctness: evaluation, build behavior, generated config, or expected attributes.
15% Nix idiom: correct use of mkIf, overrideAttrs, fixed-output fetchers, phases, and per-system helpers.
10% maintainability: minimal unrelated changes, readable structure, clear attr names.
5% formatting and lint: nixfmt-rfc-style, statix, and deadnix where applicable.

Keep the public prompt stable. Add hidden cases when models overfit obvious examples.

Failure Classes

When analyzing failed runs, tag failures with one or more classes:

timeout: the agent exceeded --agent-timeout-seconds.
syntax: Nix parsing failed.
evaluation: Nix parsed but evaluation failed.
missing-attr: required attribute was absent.
wrong-value: required attribute existed but had the wrong value.
unavailable-helper: solution used helpers not present in the evaluator.
impurity: solution referenced host paths, environment variables, or external state.
overfit: solution hardcoded the visible example and failed hidden inputs.

These tags are not enforced by the harness yet, but they are useful for comparing models.

Reporting Results

A useful result report should include:

NixBench commit.
Agent command.
Model name.
Timeout.
Overall score.
Per-task pass/fail.
Per-task duration.
Failure classes.
Links to check.log and diff.patch for failed tasks.

docs/scoring.md

Recommended Rubric

Failure Classes

Reporting Results