The Inference Report

June 24, 2026

The SWE-rebench rankings show stability at the top tier, where gpt-5.5-2026-04-23-xhigh holds 62.7%, Junie 61.6%, and Codex 60.4%, with no movement in the first nine positions. Below that band, modest reshuffling reflects incremental gains across mid-tier models. Claude Sonnet 4.6 climbed from position 10 to maintain its 51.3% score, while GLM-5.1 advanced from rank 23 to 12 by improving from 40.2% to 50.7%, a 10.5-point jump that signals either a methodology change, model update, or evaluation refinement worth scrutinizing. Gemini 3.5 Flash dropped from 7 to 13 despite holding 49.5%, suggesting the ranking absorbed new entrants or recalibration. The Artificial Analysis benchmark, by contrast, saw more substantial motion: Grok Build 0.1 0616 entered at rank 28, while Ring-1T appeared at 159 without prior placement, indicating either fresh model releases or expanded coverage. At the lower end, the data reveals compression around single-digit scores, where models cluster densely and small score shifts produce large rank swings, making those positions less meaningful as discriminators. The movement pattern suggests the SWE-rebench is maturing into a stable ordering of proven performers, while Artificial Analysis continues absorbing new competitors, though neither benchmark's methodology is transparent enough to confirm whether score changes reflect genuine capability shifts or evaluation adjustments.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1gpt-5.5-2026-04-23-xhigh62.7%
2Junie61.6%
3Codex60.4%
4Claude Code59.6%
5gpt-5.5-2026-04-23-medium58.9%
6Claude Opus 4.8-xhigh56.5%
7gpt-5.4-2026-03-05-medium54.9%
8Claude Opus 4.7-high53.1%
9Cursor53.0%
10Claude Sonnet 4.651.3%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1Claude Fable 559.90$20.00
2Claude Opus 4.855.772$10.00
3GPT-5.554.864$11.25
4Claude Opus 4.753.562$10.00
5GPT-5.451.4161$5.63
6GLM-5.251.1118$2.15
7Gemini 3.5 Flash50.2237$3.38
8Claude Sonnet 4.647.269$6.00
9Gemini 3.1 Pro Preview46.5143$4.50
10Qwen3.7 Max46203$3.75

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Gemini 3.5 Flash237
2Qwen3.7 Max203
3GPT-5.4 mini194
4GPT-5.4161
5GPT-5.2 Codex155
6Gemini 3.1 Pro Preview143
7DeepSeek V4 Flash121
8GLM-5.2118
9DeepSeek V4 Pro103
10GLM-5.190

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1DeepSeek V4 Flash$0.175
2MiMo-V2.5$0.175
3MiniMax-M3$0.525
4DeepSeek V4 Pro$0.544
5MiMo-V2.5-Pro$0.544
6MiMo-V2-Pro$1.50
7GPT-5.4 mini$1.69
8Kimi K2.6$1.71
9Kimi K2.7 Code$1.71
10GLM-5.2$2.15