The Inference Report

June 1, 2026

The SWE-rebench rankings show minimal movement at the top tier, with gpt-5.5-2026-04-23-xhigh holding 62.7% and Codex at 60.4%, but significant volatility below position five signals instability in how models perform on this coding task. Gemini 3.1 Pro Preview dropped from 57.2% on Artificial Analysis to 51.1% on SWE-rebench, falling from fourth to tenth place, while Kimi K2.6 fell from 53.9% to 46.5% and GLM-4.7 declined from 42.1% to 38.2%, suggesting these models may not generalize equally across different coding benchmarks or that SWE-rebench applies stricter evaluation criteria. Conversely, GLM-5.1 held relatively steady between 50.7% and 51.4%, and Claude models maintained consistent rankings across both benchmarks, indicating more reliable performance on code generation tasks. The divergence between SWE-rebench and Artificial Analysis rankings below 50% raises questions about benchmark design: SWE-rebench appears to penalize certain architectural approaches more heavily, or the two evaluations measure meaningfully different aspects of coding capability. Without access to SWE-rebench's methodology documentation, the 5-7 point gaps between benchmark results for the same models cannot be attributed definitively to task difficulty, evaluation harshness, or genuine capability differences, making it premature to treat either ranking as a complete picture of coding performance.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1gpt-5.5-2026-04-23-xhigh62.7%
2Codex60.4%
3Claude Code59.6%
4gpt-5.5-2026-04-23-medium58.9%
5Claude Opus 4.8-xhigh56.4%
6gpt-5.4-2026-03-05-medium54.9%
7Claude Opus 4.7-high53.1%
8Cursor53.0%
9Claude Sonnet 4.6-high51.3%
10Gemini 3.1 Pro Preview51.1%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1Claude Opus 4.861.463$10.94
2GPT-5.560.266$11.25
3Claude Opus 4.757.360$10.94
4Gemini 3.1 Pro Preview57.2144$4.50
5GPT-5.456.886$5.63
6Qwen3.7 Max56.6190$3.75
7Gemini 3.5 Flash55.3227$3.38
8Kimi K2.653.942$1.71
9MiMo-V2.5-Pro53.852$0.544
10GPT-5.3 Codex53.686$4.81

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Grok 4.20 0309229
2Gemini 3.5 Flash227
3Grok 4.20 0309 v2219
4MiniMax-M2.5206
5Gemini 3 Flash Preview193
6Qwen3.7 Max190
7GPT-5.1 Codex186
8GPT-5.4 mini183
9GPT-5 Codex173
10Qwen3.6 35B A3B160

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2MiMo-V2.5$0.175
3DeepSeek V4 Flash$0.175
4Hy3-preview$0.20
5DeepSeek V3.2$0.337
6GPT-5.4 nano$0.463
7MiniMax-M2.7$0.525
8KAT Coder Pro V2$0.525
9MiniMax-M2.5$0.525
10MiMo-V2.5-Pro$0.544