The Inference Report

June 22, 2026

On the SWE-rebench, the top tier remains stable with gpt-5.5-2026-04-23-xhigh holding 62.7%, Junie at 61.6%, and Codex at 60.4%, but movement in the middle ranks reveals both consolidation and fragmentation. Claude Sonnet 4.6 climbed from #8 to #10 while gaining 4.1 percentage points (47.2 to 51.3), and Gemini 3.1 Pro Preview moved from #9 to #11 with a 4.6-point increase (46.5 to 51.1), suggesting these models benefited from either test set changes or evaluation methodology shifts rather than architectural improvements alone. GLM-5.1's jump from #23 to #12 represents the most dramatic repositioning, rising 10.5 points from 40.2 to 50.7, which warrants scrutiny: either the model underwent substantial retraining or the benchmark's coding task distribution shifted to favor its strengths. Conversely, Gemini 3.5 Flash dropped from #7 to #13 despite a marginal score decline (50.2 to 49.5), a minor inversion that may reflect tighter clustering at this performance band. GLM-4.7 showed the largest absolute gain in the lower ranks, jumping from 33.8 to 38.2 across the two evaluations, though it remains at #17 on SWE-rebench. The Artificial Analysis benchmark, with its broader model coverage, presents a different ranking topology: Claude Fable 5 leads at 59.9, above GPT-5.5 at 54.8, inverting the SWE-rebench order and suggesting the two benchmarks weight different coding competencies or test different problem classes. Without disclosure of the evaluation methodology, task composition, test set overlap, execution environment, or whether SWE-rebench underwent revision, attributing these shifts to genuine capability differences versus benchmark drift remains uncertain. The consistency of top-tier models across both benchmarks provides some confidence in their relative ordering, but the volatility in middle ranks indicates either genuine model differentiation in narrow domains or measurement sensitivity that limits strong inference.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1gpt-5.5-2026-04-23-xhigh62.7%
2Junie61.6%
3Codex60.4%
4Claude Code59.6%
5gpt-5.5-2026-04-23-medium58.9%
6Claude Opus 4.8-xhigh56.5%
7gpt-5.4-2026-03-05-medium54.9%
8Claude Opus 4.7-high53.1%
9Cursor53.0%
10Claude Sonnet 4.651.3%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1Claude Fable 559.90$20.00
2Claude Opus 4.855.769$10.00
3GPT-5.554.863$11.25
4Claude Opus 4.753.553$10.00
5GPT-5.451.4165$5.63
6GLM-5.251.194$2.15
7Gemini 3.5 Flash50.2244$3.38
8Claude Sonnet 4.647.269$6.00
9Gemini 3.1 Pro Preview46.5138$4.50
10Qwen3.7 Max46200$3.75

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Gemini 3.5 Flash244
2Qwen3.7 Max200
3GPT-5.4 mini193
4GPT-5.4165
5GPT-5.2 Codex145
6Gemini 3.1 Pro Preview138
7DeepSeek V4 Flash110
8GPT-5.3 Codex107
9GLM-5.1106
10DeepSeek V4 Pro103

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1DeepSeek V4 Flash$0.175
2MiMo-V2.5$0.175
3MiniMax-M3$0.525
4DeepSeek V4 Pro$0.544
5MiMo-V2.5-Pro$0.544
6MiMo-V2-Pro$1.50
7GPT-5.4 mini$1.69
8Kimi K2.6$1.71
9Kimi K2.7 Code$1.71
10GLM-5.2$2.15