docs/runs/2026-06-24-model-comparison.md
Static HTML generated from the repository markdown.
These are local NixBench runs against the full 26-task corpus.
Static HTML generated from the repository markdown.
These are local NixBench runs against the full 26-task corpus.
The raw results/ artifacts are intentionally not tracked because they contain logs, temporary diffs, and machine-local paths. This document records the durable statistics needed for comparison.
The first comparison on June 24 covered the then-current 24-task corpus. On June 25, home-manager-extra-special-args and overlay-module-boundary were added to the displayed comparison by running those two tasks separately for each baseline model. Rows marked (+2) combine the original 24-task run with those two supplemental task artifacts.
On June 27, gpt-5.4-mini was run across the missing low, medium, and high Codex reasoning effort settings.
On June 27 and June 28, Claude Opus 4.8 was run with explicit Claude Code effort settings.
| Agent | Model | Effort | Run ID | Status | Score | Passed | Failed | Timeout Count | Agent Time |
|---|---|---|---|---|---|---|---|---|---|
| Codex CLI | gpt-5.5 | low | 20260625T072711Z-e484ea0f | complete | 2100/2600 | 21 | 5 | 0 | 19m 25s |
| Codex CLI | gpt-5.5 | medium | 20260625T073226Z-3fce189c | complete | 1900/2600 | 19 | 7 | 0 | 22m 15s |
| Codex CLI | gpt-5.5 | high | 20260625T073227Z-167ae812 | complete | 1900/2600 | 19 | 7 | 0 | 25m 59s |
| Codex CLI | gpt-5.5 | xhigh | 20260624T182835Z-4ad8b555 (+2) | complete | 2200/2600 | 22 | 4 | 0 | 41m 03s |
| Codex CLI | gpt-5.4 | low | 20260625T073231Z-84de082a | complete | 2000/2600 | 20 | 6 | 0 | 17m 45s |
| Codex CLI | gpt-5.4 | medium | 20260625T073227Z-76c2964d | complete | 2100/2600 | 21 | 5 | 0 | 20m 55s |
| Codex CLI | gpt-5.4 | high | 20260625T073228Z-a5a4a383 | complete | 2200/2600 | 22 | 4 | 0 | 28m 16s |
| Codex CLI | gpt-5.4 | xhigh | 20260624T190640Z-fa04a19c (+2) | complete | 2100/2600 | 21 | 5 | 0 | 40m 02s |
| Codex CLI | gpt-5.4-mini | low | 20260627T154139Z-a762c204 | complete | 2100/2600 | 21 | 5 | 0 | 13m 51s |
| Codex CLI | gpt-5.4-mini | medium | 20260627T154149Z-11277553 | complete | 2100/2600 | 21 | 5 | 0 | 17m 59s |
| Codex CLI | gpt-5.4-mini | high | 20260627T154205Z-4e04ba57 | complete | 2100/2600 | 21 | 5 | 1 | 29m 05s |
| Codex CLI | gpt-5.4-mini | xhigh | 20260624T194359Z-268b0abe (+2) | complete | 1900/2600 | 19 | 7 | 3 | 39m 46s |
| Claude Code | claude-opus-4-8 | low | 20260627T214634Z-4e66a550 | complete | 1900/2600 | 19 | 7 | 0 | 9m 02s |
| Claude Code | claude-opus-4-8 | high | 20260628T071803Z-ca820a9b | complete | 2100/2600 | 21 | 5 | 0 | 12m 58s |
| Claude Code | claude-opus-4-8 | xhigh | 20260628T154937Z-ed26a81d | complete | 2100/2600 | 21 | 5 | 0 | 22m 03s |
All runs used a 240 second per-task agent timeout and the same task prompt contract: read NIXBENCH_PROMPT.md, edit only the copied task workdir, do not inspect hidden evaluators, run local checks if useful, then stop.
| Model | Notes |
|---|---|
gpt-5.5 | Low effort recorded 21/26, medium and high recorded 19/26, and the xhigh baseline recorded 22/26. |
gpt-5.4 | Scores increased from 20/26 at low effort to 21/26 at medium and 22/26 at high, while xhigh recorded 21/26. |
gpt-5.4-mini | Low, medium, and high effort each recorded 21/26; the xhigh baseline recorded 19/26 with three timeouts. |
claude-opus-4-8 | Explicit low, high, and xhigh efforts recorded 19/26, 21/26, and 21/26. |
| Highest score | gpt-5.4 high and gpt-5.5 xhigh both scored 2200/2600; gpt-5.4 high used less agent time. |
This table shows the xhigh/default baseline columns used by the detailed results page.
| Task | GPT-5.5 xhigh | GPT-5.4 xhigh | GPT-5.4 mini xhigh | Claude Opus 4.8 |
|---|---|---|---|---|
container-native-vs-oci | FAIL | FAIL | FAIL | FAIL |
debug-infinite-recursion | PASS | PASS | PASS | PASS |
debug-network-false-lead | FAIL | FAIL | FAIL | FAIL |
devshell-tooling-contract | PASS | PASS | PASS | PASS |
fetcher-source-pin | PASS | PASS | PASS | PASS |
fhs-binary-wrapper | PASS | PASS | PASS | PASS |
flake-input-package-selection | PASS | PASS | PASS | PASS |
flake-per-system-outputs | PASS | PASS | FAIL | PASS |
home-manager-extra-special-args | PASS | PASS | PASS | PASS |
home-manager-wsl-module-import | PASS | PASS | PASS | PASS |
home-manager-xdg-files | PASS | PASS | PASS | PASS |
issue-report-quality | FAIL | FAIL | FAIL | FAIL |
lang-attrsets-normalize | PASS | PASS | PASS | PASS |
module-path-composition | PASS | PASS | PASS | PASS |
module-service-options | FAIL | PASS | PASS | PASS |
module-stale-option-migration | PASS | PASS | PASS | PASS |
module-system-boundaries | PASS | PASS | PASS | PASS |
mutable-config-home-manager | PASS | PASS | PASS | PASS |
overlay-module-boundary | PASS | PASS | PASS | PASS |
overlay-override-package | PASS | PASS | PASS | PASS |
package-name-lookup-contract | PASS | PASS | PASS | PASS |
package-python-application | PASS | FAIL | FAIL | PASS |
package-stdenv-cli | PASS | FAIL | FAIL | FAIL |
purity-wrapper-derivation | PASS | PASS | PASS | PASS |
python-cuda-uv2nix-patch | PASS | PASS | PASS | PASS |
string-escaping-systemd | PASS | PASS | FAIL | FAIL |
| Task | Notes |
|---|---|
container-native-vs-oci | The xhigh/default baseline rows did not pass this task, while GPT-5.5 low passed it during the effort sweep. |
debug-network-false-lead | No recorded row in this set passed this task; GPT-5.4 mini xhigh timed out. |
issue-report-quality | No recorded row in this set passed this task. |
package-stdenv-cli | Only GPT-5.5 xhigh passed this task among the GPT effort rows. |
Effort sweep Codex runs used this shape, with <model> set to gpt-5.5 or gpt-5.4 and <effort> set to low, medium, or high:
python3 -u bench.py run-all --agent-timeout-seconds 240 --agent-cmd 'codex -c model_reasoning_effort=<effort> --ask-for-approval never exec -m <model> --ephemeral --skip-git-repo-check --sandbox workspace-write "You are in a temporary NixBench benchmark task workspace. Read NIXBENCH_PROMPT.md, then edit the local starter files to satisfy it. Only modify files in this directory. Do not inspect hidden evaluator files or the original task directory. Run local checks if useful, then stop."'
The original xhigh/default comparison commands did not pass -c model_reasoning_effort=...; the local Codex config was set to model_reasoning_effort = "xhigh" for the Codex baseline runs.
The June 27 gpt-5.4-mini effort runs used the same command shape with <model> set to gpt-5.4-mini.
The Claude Opus effort runs used the same Claude Code command shape with --model opus and --effort low, --effort high, or --effort xhigh.