Two rows reached 2200/2600: GPT-5.4 at high effort and GPT-5.5 at xhigh effort.
Model comparison · June 24-25, 2026
GPT and Claude model runs on the full 26-task NixBench corpus.
GPT-5.5 and GPT-5.4 were swept across reasoning effort levels, while GPT-5.4 mini and Claude Opus 4.8 remain xhigh/default baselines under the same 240 second per-task timeout.
- Highest score
- 2200
- Highest pass count
- 22/26
- Full runs
- 10
- Shortest 22/26 run
- 28m 16s
Run summary
Two rows reached 22 passes on the 26-task corpus.
The ten completed rows span four pass-count values across the full 26-task corpus.
GPT-5.4 increased from low through high effort; GPT-5.5 recorded different outcomes at low, medium, high, and xhigh.
GPT-5.5 via Codex CLI (low)
complete20260625T072711Z-e484ea0f
2100/2600
- Passed
- 21
- Failed
- 5
- Timeouts
- 0
- Agent time
- 19m 25s
GPT-5.5 via Codex CLI (medium)
complete20260625T073226Z-3fce189c
1900/2600
- Passed
- 19
- Failed
- 7
- Timeouts
- 0
- Agent time
- 22m 15s
GPT-5.5 via Codex CLI (high)
complete20260625T073227Z-167ae812
1900/2600
- Passed
- 19
- Failed
- 7
- Timeouts
- 0
- Agent time
- 25m 59s
GPT-5.5 via Codex CLI (xhigh)
complete20260624T182835Z-4ad8b555 (+2)
2200/2600
- Passed
- 22
- Failed
- 4
- Timeouts
- 0
- Agent time
- 41m 03s
GPT-5.4 via Codex CLI (low)
complete20260625T073231Z-84de082a
2000/2600
- Passed
- 20
- Failed
- 6
- Timeouts
- 0
- Agent time
- 17m 45s
GPT-5.4 via Codex CLI (medium)
complete20260625T073227Z-76c2964d
2100/2600
- Passed
- 21
- Failed
- 5
- Timeouts
- 0
- Agent time
- 20m 55s
GPT-5.4 via Codex CLI (high)
complete20260625T073228Z-a5a4a383
2200/2600
- Passed
- 22
- Failed
- 4
- Timeouts
- 0
- Agent time
- 28m 16s
GPT-5.4 via Codex CLI (xhigh)
complete20260624T190640Z-fa04a19c (+2)
2100/2600
- Passed
- 21
- Failed
- 5
- Timeouts
- 0
- Agent time
- 40m 02s
GPT-5.4 mini via Codex CLI (xhigh)
complete20260624T194359Z-268b0abe (+2)
1900/2600
- Passed
- 19
- Failed
- 7
- Timeouts
- 3
- Agent time
- 39m 46s
Claude Opus 4.8
complete20260624T202141Z-881ef1e9 (+2)
2100/2600
- Passed
- 21
- Failed
- 5
- Timeouts
- 0
- Agent time
- 25m 24s
Per-task outcomes
Baseline outcomes are shown task by task.
| Task | Area | GPT-5.5 | GPT-5.4 | GPT-5.4 mini | Claude Opus 4.8 |
|---|---|---|---|---|---|
container-native-vs-oci | Modules | Fail 53.0s | Fail 48.3s | Fail 75.9s | Fail 27.8s |
debug-infinite-recursion | Debugging | Pass 62.8s | Pass 42.7s | Pass 37.9s | Pass 54.0s |
debug-network-false-lead | Debugging | Fail 212.9s | Fail 193.1s | Fail 240.0s | Fail 182.5s |
devshell-tooling-contract | Dev shells | Pass 46.9s | Pass 131.1s | Pass 53.2s | Pass 39.0s |
fetcher-source-pin | Fetchers | Pass 96.8s | Pass 138.4s | Pass 107.0s | Pass 36.9s |
fhs-binary-wrapper | Packaging | Pass 98.0s | Pass 68.5s | Pass 142.7s | Pass 47.1s |
flake-input-package-selection | Flakes | Pass 29.3s | Pass 63.9s | Pass 24.4s | Pass 21.7s |
flake-per-system-outputs | Flakes | Pass 212.4s | Pass 184.7s | Fail 240.0s | Pass 74.9s |
home-manager-extra-special-args | Flakes | Pass 135.5s | Pass 138.4s | Pass 80.5s | Pass 131.9s |
home-manager-wsl-module-import | Modules | Pass 39.3s | Pass 78.0s | Pass 44.3s | Pass 36.5s |
home-manager-xdg-files | Modules | Pass 45.7s | Pass 49.1s | Pass 56.9s | Pass 33.1s |
issue-report-quality | Debugging | Fail 38.8s | Fail 50.2s | Fail 39.1s | Fail 89.3s |
lang-attrsets-normalize | Nix language | Pass 87.3s | Pass 80.0s | Pass 74.6s | Pass 67.7s |
module-path-composition | Nix language | Pass 98.1s | Pass 51.7s | Pass 49.6s | Pass 38.2s |
module-service-options | Modules | Fail 124.8s | Pass 106.6s | Pass 84.6s | Pass 81.2s |
module-stale-option-migration | Modules | Pass 41.5s | Pass 37.7s | Pass 63.1s | Pass 29.9s |
module-system-boundaries | Modules | Pass 55.6s | Pass 81.8s | Pass 40.8s | Pass 45.3s |
mutable-config-home-manager | Modules | Pass 82.8s | Pass 42.8s | Pass 63.3s | Pass 38.2s |
overlay-module-boundary | Overlays | Pass 69.1s | Pass 58.4s | Pass 95.8s | Pass 34.2s |
overlay-override-package | Overlays | Pass 44.0s | Pass 70.9s | Pass 50.5s | Pass 48.6s |
package-name-lookup-contract | Packaging | Pass 51.2s | Pass 54.0s | Pass 26.6s | Pass 36.4s |
package-python-application | Packaging | Pass 133.5s | Fail 185.6s | Fail 240.0s | Pass 51.5s |
package-stdenv-cli | Packaging | Pass 182.2s | Fail 203.1s | Fail 184.5s | Fail 30.3s |
purity-wrapper-derivation | Purity | Pass 145.0s | Pass 107.3s | Pass 79.1s | Pass 100.4s |
python-cuda-uv2nix-patch | Packaging | Pass 44.4s | Pass 52.8s | Pass 68.9s | Pass 35.4s |
string-escaping-systemd | Nix language | Pass 231.9s | Pass 83.1s | Fail 122.8s | Fail 111.6s |
Agent duration
The xhigh/default baselines show the per-task time profile.
Outcome notes
Several outcome patterns repeat across runs.
Native-container outcome
The xhigh/default baseline rows did not pass the native NixOS container task; GPT-5.5 low passed it in the effort sweep.
Debugging and reports
The false-lead debugging task and issue-report task were not passed by any row in this set.
Packaging variation
The Python application and stdenv CLI tasks show different pass patterns across models and effort levels.
String escaping
GPT-5.5 and GPT-5.4 passed the baseline string-escaping task; GPT-5.4 mini and Claude Opus 4.8 did not.