The Inference Report

May 28, 2026

The SWE-rebench rankings show substantial churn at the top, though the movement warrants scrutiny. Two new GPT-5.5 variants (xhigh and medium) now occupy positions 1 and 4 at 62.7% and 58.9%, displacing previously dominant models, while Codex and Claude Code jumped from positions 18 and 17 to positions 2 and 3, gaining 2.1 and 1.2 percentage points respectively. Claude Opus 4.6 fell from first place (65.3%) to sixth (53.1%), a 12.2-point drop that demands explanation, and several formerly high-ranked models (gpt-5.2-2025-12-11-medium, Junie, DeepSeek-V3.2, Claude Sonnet 4.5, Qwen3.5-397B-A17B) disappeared entirely from the benchmark. The Artificial Analysis rankings remain largely stable with identical scores and positions, suggesting the volatility is specific to SWE-rebench's evaluation methodology or dataset. Without documentation of what changed in the benchmark itself, whether test cases were added, removed, or reweighted, or whether evaluation criteria shifted, it is unclear whether these movements reflect genuine capability differences or artifacts of the measurement apparatus. The scale of Claude Opus 4.6's decline particularly raises questions: such a large score regression without corresponding changes in the model itself points toward benchmark modifications rather than model degradation. Until the SWE-rebench evaluation protocol is transparently specified, these rankings indicate movement but not necessarily meaningful progress.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1gpt-5.5-2026-04-23-xhigh62.7%
2Codex60.4%
3Claude Code59.6%
4gpt-5.5-2026-04-23-medium58.9%
5gpt-5.4-2026-03-05-medium54.9%
6Claude Opus 4.753.1%
7Cursor53.0%
8Gemini 3.1 Pro Preview51.1%
9Claude Sonnet 4.651.1%
10GLM-5.150.7%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1GPT-5.560.281$11.25
2Claude Opus 4.757.355$10.94
3Gemini 3.1 Pro Preview57.2132$4.50
4GPT-5.456.889$5.63
5Qwen3.7 Max56.6206$3.75
6Gemini 3.5 Flash55.3228$3.38
7Kimi K2.653.934$1.71
8MiMo-V2.5-Pro53.851$0.544
9GPT-5.3 Codex53.681$4.81
10Grok 4.353.2216$1.56

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Gemini 3.5 Flash228
2Grok 4.3216
3Qwen3.7 Max206
4GPT-5.1 Codex205
5Gemini 3 Flash Preview200
6GPT-5 Codex196
7Grok 4.20 0309192
8Grok 4.20 0309 v2189
9Qwen3.6 35B A3B170
10GPT-5.4 mini153

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2MiMo-V2.5$0.175
3DeepSeek V4 Flash$0.175
4Hy3-preview$0.20
5DeepSeek V3.2$0.337
6GPT-5.4 nano$0.463
7MiniMax-M2.7$0.525
8KAT Coder Pro V2$0.525
9MiniMax-M2.5$0.525
10MiMo-V2.5-Pro$0.544