Run Notes

2026-06-24 and 2026-06-25 Model Comparison Runs

These are local NixBench runs against the full 26-task corpus.

Source

docs/runs/2026-06-24-model-comparison.md

Static HTML generated from the repository markdown.

These are local NixBench runs against the full 26-task corpus.

The raw results/ artifacts are intentionally not tracked because they contain logs, temporary diffs, and machine-local paths. This document records the durable statistics needed for comparison.

The first comparison on June 24 covered the then-current 24-task corpus. On June 25, home-manager-extra-special-args and overlay-module-boundary were added to the displayed comparison by running those two tasks separately for each baseline model. Rows marked (+2) combine the original 24-task run with those two supplemental task artifacts.

On June 27, gpt-5.4-mini was run across the missing low, medium, and high Codex reasoning effort settings.

On June 27 and June 28, Claude Opus 4.8 was run with explicit Claude Code effort settings.

Summary

AgentModelEffortRun IDStatusScorePassedFailedTimeout CountAgent Time
Codex CLIgpt-5.5low20260625T072711Z-e484ea0fcomplete2100/2600215019m 25s
Codex CLIgpt-5.5medium20260625T073226Z-3fce189ccomplete1900/2600197022m 15s
Codex CLIgpt-5.5high20260625T073227Z-167ae812complete1900/2600197025m 59s
Codex CLIgpt-5.5xhigh20260624T182835Z-4ad8b555 (+2)complete2200/2600224041m 03s
Codex CLIgpt-5.4low20260625T073231Z-84de082acomplete2000/2600206017m 45s
Codex CLIgpt-5.4medium20260625T073227Z-76c2964dcomplete2100/2600215020m 55s
Codex CLIgpt-5.4high20260625T073228Z-a5a4a383complete2200/2600224028m 16s
Codex CLIgpt-5.4xhigh20260624T190640Z-fa04a19c (+2)complete2100/2600215040m 02s
Codex CLIgpt-5.4-minilow20260627T154139Z-a762c204complete2100/2600215013m 51s
Codex CLIgpt-5.4-minimedium20260627T154149Z-11277553complete2100/2600215017m 59s
Codex CLIgpt-5.4-minihigh20260627T154205Z-4e04ba57complete2100/2600215129m 05s
Codex CLIgpt-5.4-minixhigh20260624T194359Z-268b0abe (+2)complete1900/2600197339m 46s
Claude Codeclaude-opus-4-8low20260627T214634Z-4e66a550complete1900/260019709m 02s
Claude Codeclaude-opus-4-8high20260628T071803Z-ca820a9bcomplete2100/2600215012m 58s
Claude Codeclaude-opus-4-8xhigh20260628T154937Z-ed26a81dcomplete2100/2600215022m 03s

All runs used a 240 second per-task agent timeout and the same task prompt contract: read NIXBENCH_PROMPT.md, edit only the copied task workdir, do not inspect hidden evaluators, run local checks if useful, then stop.

Effort Sweep Notes

ModelNotes
gpt-5.5Low effort recorded 21/26, medium and high recorded 19/26, and the xhigh baseline recorded 22/26.
gpt-5.4Scores increased from 20/26 at low effort to 21/26 at medium and 22/26 at high, while xhigh recorded 21/26.
gpt-5.4-miniLow, medium, and high effort each recorded 21/26; the xhigh baseline recorded 19/26 with three timeouts.
claude-opus-4-8Explicit low, high, and xhigh efforts recorded 19/26, 21/26, and 21/26.
Highest scoregpt-5.4 high and gpt-5.5 xhigh both scored 2200/2600; gpt-5.4 high used less agent time.

Baseline Per-Task Results

This table shows the xhigh/default baseline columns used by the detailed results page.

TaskGPT-5.5 xhighGPT-5.4 xhighGPT-5.4 mini xhighClaude Opus 4.8
container-native-vs-ociFAILFAILFAILFAIL
debug-infinite-recursionPASSPASSPASSPASS
debug-network-false-leadFAILFAILFAILFAIL
devshell-tooling-contractPASSPASSPASSPASS
fetcher-source-pinPASSPASSPASSPASS
fhs-binary-wrapperPASSPASSPASSPASS
flake-input-package-selectionPASSPASSPASSPASS
flake-per-system-outputsPASSPASSFAILPASS
home-manager-extra-special-argsPASSPASSPASSPASS
home-manager-wsl-module-importPASSPASSPASSPASS
home-manager-xdg-filesPASSPASSPASSPASS
issue-report-qualityFAILFAILFAILFAIL
lang-attrsets-normalizePASSPASSPASSPASS
module-path-compositionPASSPASSPASSPASS
module-service-optionsFAILPASSPASSPASS
module-stale-option-migrationPASSPASSPASSPASS
module-system-boundariesPASSPASSPASSPASS
mutable-config-home-managerPASSPASSPASSPASS
overlay-module-boundaryPASSPASSPASSPASS
overlay-override-packagePASSPASSPASSPASS
package-name-lookup-contractPASSPASSPASSPASS
package-python-applicationPASSFAILFAILPASS
package-stdenv-cliPASSFAILFAILFAIL
purity-wrapper-derivationPASSPASSPASSPASS
python-cuda-uv2nix-patchPASSPASSPASSPASS
string-escaping-systemdPASSPASSFAILFAIL

Outcome Notes

TaskNotes
container-native-vs-ociThe xhigh/default baseline rows did not pass this task, while GPT-5.5 low passed it during the effort sweep.
debug-network-false-leadNo recorded row in this set passed this task; GPT-5.4 mini xhigh timed out.
issue-report-qualityNo recorded row in this set passed this task.
package-stdenv-cliOnly GPT-5.5 xhigh passed this task among the GPT effort rows.

Commands

Effort sweep Codex runs used this shape, with <model> set to gpt-5.5 or gpt-5.4 and <effort> set to low, medium, or high:

python3 -u bench.py run-all --agent-timeout-seconds 240 --agent-cmd 'codex -c model_reasoning_effort=<effort> --ask-for-approval never exec -m <model> --ephemeral --skip-git-repo-check --sandbox workspace-write "You are in a temporary NixBench benchmark task workspace. Read NIXBENCH_PROMPT.md, then edit the local starter files to satisfy it. Only modify files in this directory. Do not inspect hidden evaluator files or the original task directory. Run local checks if useful, then stop."'

The original xhigh/default comparison commands did not pass -c model_reasoning_effort=...; the local Codex config was set to model_reasoning_effort = "xhigh" for the Codex baseline runs.

The June 27 gpt-5.4-mini effort runs used the same command shape with <model> set to gpt-5.4-mini.

The Claude Opus effort runs used the same Claude Code command shape with --model opus and --effort low, --effort high, or --effort xhigh.