The Inference Report

June 20, 2026

The SWE-rebench rankings remained static across the top tier, with gpt-5.5-2026-04-23-xhigh holding first place at 62.7% and the next five positions unchanged. The Artificial Analysis benchmark showed modest reordering in the middle and lower tiers, though the methodology underlying these two benchmarks differs enough that direct score comparison between them is unreliable. On SWE-rebench, three models shifted position: Claude Sonnet 4.6 rose from #10 to #10 (no change in rank, though the prior data lists it at 47.2 on Artificial Analysis versus 51.3% here, suggesting score drift or evaluation variance), Gemini 3.1 Pro Preview moved from #9 to #11, and GLM-5.1 jumped from #23 to #12, gaining 10.5 percentage points on Artificial Analysis (from 40.2 to 50.7%). GLM-4.7 similarly advanced 4.4 points on SWE-rebench (33.8 to 38.2) and on Artificial Analysis (33.8 to 38.2), indicating consistent gains. On Artificial Analysis, minor reordering occurred around rank 190 where Magistral Medium 1 and Mistral Medium 3 swapped positions at the 12.5 point level, and at rank 360-362 where three models at 2.7 points reordered. The lack of substantial movement in either benchmark's top ranks suggests stable performance hierarchies, though the gains by GLM models warrant attention to whether they reflect genuine capability improvements or evaluation sensitivity differences between benchmarks.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1gpt-5.5-2026-04-23-xhigh62.7%
2Junie61.6%
3Codex60.4%
4Claude Code59.6%
5gpt-5.5-2026-04-23-medium58.9%
6Claude Opus 4.8-xhigh56.5%
7gpt-5.4-2026-03-05-medium54.9%
8Claude Opus 4.7-high53.1%
9Cursor53.0%
10Claude Sonnet 4.651.3%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1Claude Fable 559.90$20.00
2Claude Opus 4.855.764$10.00
3GPT-5.554.861$11.25
4Claude Opus 4.753.557$10.00
5GPT-5.451.4142$5.63
6GLM-5.251.172$2.15
7Gemini 3.5 Flash50.2216$3.38
8Claude Sonnet 4.647.268$6.00
9Gemini 3.1 Pro Preview46.5140$4.50
10Qwen3.7 Max46125$3.75

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Gemini 3.5 Flash216
2GPT-5.4 mini174
3GPT-5.4142
4Gemini 3.1 Pro Preview140
5GPT-5.2 Codex140
6Qwen3.7 Max125
7DeepSeek V4 Flash110
8GLM-5.193
9GPT-5.3 Codex86
10DeepSeek V4 Pro86

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1DeepSeek V4 Flash$0.175
2MiMo-V2.5$0.175
3MiniMax-M3$0.525
4DeepSeek V4 Pro$0.544
5MiMo-V2.5-Pro$0.544
6MiMo-V2-Pro$1.50
7GPT-5.4 mini$1.69
8Kimi K2.6$1.71
9Kimi K2.7 Code$1.71
10GLM-5.2$2.15