Model comparison · June 24-25, 2026

GPT and Claude model runs on the full 26-task NixBench corpus.

GPT-5.5 and GPT-5.4 were swept across reasoning effort levels, while GPT-5.4 mini and Claude Opus 4.8 remain xhigh/default baselines under the same 240 second per-task timeout.

Highest score
2200
Highest pass count
22/26
Full runs
10
Shortest 22/26 run
28m 16s

Run summary

Two rows reached 22 passes on the 26-task corpus.

Source notes
Highest score22/26

Two rows reached 2200/2600: GPT-5.4 at high effort and GPT-5.5 at xhigh effort.

Range19-22 passes

The ten completed rows span four pass-count values across the full 26-task corpus.

EffortModel-specific

GPT-5.4 increased from low through high effort; GPT-5.5 recorded different outcomes at low, medium, high, and xhigh.

GPT-5.5 via Codex CLI (low)

complete
20260625T072711Z-e484ea0f

2100/2600

Passed
21
Failed
5
Timeouts
0
Agent time
19m 25s

GPT-5.5 via Codex CLI (medium)

complete
20260625T073226Z-3fce189c

1900/2600

Passed
19
Failed
7
Timeouts
0
Agent time
22m 15s

GPT-5.5 via Codex CLI (high)

complete
20260625T073227Z-167ae812

1900/2600

Passed
19
Failed
7
Timeouts
0
Agent time
25m 59s

GPT-5.5 via Codex CLI (xhigh)

complete
20260624T182835Z-4ad8b555 (+2)

2200/2600

Passed
22
Failed
4
Timeouts
0
Agent time
41m 03s

GPT-5.4 via Codex CLI (low)

complete
20260625T073231Z-84de082a

2000/2600

Passed
20
Failed
6
Timeouts
0
Agent time
17m 45s

GPT-5.4 via Codex CLI (medium)

complete
20260625T073227Z-76c2964d

2100/2600

Passed
21
Failed
5
Timeouts
0
Agent time
20m 55s

GPT-5.4 via Codex CLI (high)

complete
20260625T073228Z-a5a4a383

2200/2600

Passed
22
Failed
4
Timeouts
0
Agent time
28m 16s

GPT-5.4 via Codex CLI (xhigh)

complete
20260624T190640Z-fa04a19c (+2)

2100/2600

Passed
21
Failed
5
Timeouts
0
Agent time
40m 02s

GPT-5.4 mini via Codex CLI (xhigh)

complete
20260624T194359Z-268b0abe (+2)

1900/2600

Passed
19
Failed
7
Timeouts
3
Agent time
39m 46s

Claude Opus 4.8

complete
20260624T202141Z-881ef1e9 (+2)

2100/2600

Passed
21
Failed
5
Timeouts
0
Agent time
25m 24s

Per-task outcomes

Baseline outcomes are shown task by task.

TaskAreaGPT-5.5GPT-5.4GPT-5.4 miniClaude Opus 4.8
container-native-vs-ociModulesFail 53.0sFail 48.3sFail 75.9sFail 27.8s
debug-infinite-recursionDebuggingPass 62.8sPass 42.7sPass 37.9sPass 54.0s
debug-network-false-leadDebuggingFail 212.9sFail 193.1sFail 240.0sFail 182.5s
devshell-tooling-contractDev shellsPass 46.9sPass 131.1sPass 53.2sPass 39.0s
fetcher-source-pinFetchersPass 96.8sPass 138.4sPass 107.0sPass 36.9s
fhs-binary-wrapperPackagingPass 98.0sPass 68.5sPass 142.7sPass 47.1s
flake-input-package-selectionFlakesPass 29.3sPass 63.9sPass 24.4sPass 21.7s
flake-per-system-outputsFlakesPass 212.4sPass 184.7sFail 240.0sPass 74.9s
home-manager-extra-special-argsFlakesPass 135.5sPass 138.4sPass 80.5sPass 131.9s
home-manager-wsl-module-importModulesPass 39.3sPass 78.0sPass 44.3sPass 36.5s
home-manager-xdg-filesModulesPass 45.7sPass 49.1sPass 56.9sPass 33.1s
issue-report-qualityDebuggingFail 38.8sFail 50.2sFail 39.1sFail 89.3s
lang-attrsets-normalizeNix languagePass 87.3sPass 80.0sPass 74.6sPass 67.7s
module-path-compositionNix languagePass 98.1sPass 51.7sPass 49.6sPass 38.2s
module-service-optionsModulesFail 124.8sPass 106.6sPass 84.6sPass 81.2s
module-stale-option-migrationModulesPass 41.5sPass 37.7sPass 63.1sPass 29.9s
module-system-boundariesModulesPass 55.6sPass 81.8sPass 40.8sPass 45.3s
mutable-config-home-managerModulesPass 82.8sPass 42.8sPass 63.3sPass 38.2s
overlay-module-boundaryOverlaysPass 69.1sPass 58.4sPass 95.8sPass 34.2s
overlay-override-packageOverlaysPass 44.0sPass 70.9sPass 50.5sPass 48.6s
package-name-lookup-contractPackagingPass 51.2sPass 54.0sPass 26.6sPass 36.4s
package-python-applicationPackagingPass 133.5sFail 185.6sFail 240.0sPass 51.5s
package-stdenv-cliPackagingPass 182.2sFail 203.1sFail 184.5sFail 30.3s
purity-wrapper-derivationPurityPass 145.0sPass 107.3sPass 79.1sPass 100.4s
python-cuda-uv2nix-patchPackagingPass 44.4sPass 52.8sPass 68.9sPass 35.4s
string-escaping-systemdNix languagePass 231.9sPass 83.1sFail 122.8sFail 111.6s

Agent duration

The xhigh/default baselines show the per-task time profile.

Outcome notes

Several outcome patterns repeat across runs.

container

Native-container outcome

The xhigh/default baseline rows did not pass the native NixOS container task; GPT-5.5 low passed it in the effort sweep.

evidence

Debugging and reports

The false-lead debugging task and issue-report task were not passed by any row in this set.

package

Packaging variation

The Python application and stdenv CLI tasks show different pass patterns across models and effort levels.

language

String escaping

GPT-5.5 and GPT-5.4 passed the baseline string-escaping task; GPT-5.4 mini and Claude Opus 4.8 did not.