The Inference Report

June 21, 2026

The SWE-rebench leaderboard shows consolidation at the top with no movement among the leading seven models, while mid-tier performers reveal more volatility. Claude Sonnet 4.6 climbed from #10 to maintain position with 51.3 percent, Gemini 3.1 Pro Preview held at #11 with 51.1 percent, and GLM-5.1 remained at #12 with 50.7 percent, though the Artificial Analysis benchmark tells a different story: GLM-5.1 jumped from rank 23 at 40.2 to rank 12 at 50.7, a 10.5-point gain that suggests either a model update or a methodology shift between the two benchmarks. The most striking movement came from GLM-4.7, which advanced from #51 on Artificial Analysis (33.8) to #17 on SWE-rebench (38.2), a 4.4-point improvement, while Kimi K2.6 moved from rank 16 to 15 with a 3.7-point jump from 42.8 to 46.5. These discrepancies between the two benchmarks raise questions about their evaluation methodologies: SWE-rebench appears to reward different model behaviors or architectural choices than Artificial Analysis, particularly for Chinese-developed models like GLM and Kimi, which suggests the benchmarks may be measuring distinct aspects of coding capability rather than converging on a unified signal. The lack of score inflation at the frontier, where the top model remains at 62.7 percent, indicates the evaluation has not become easier, though the divergence between benchmark rankings for identical models undermines confidence in any single leaderboard as a complete measure of coding performance.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1gpt-5.5-2026-04-23-xhigh62.7%
2Junie61.6%
3Codex60.4%
4Claude Code59.6%
5gpt-5.5-2026-04-23-medium58.9%
6Claude Opus 4.8-xhigh56.5%
7gpt-5.4-2026-03-05-medium54.9%
8Claude Opus 4.7-high53.1%
9Cursor53.0%
10Claude Sonnet 4.651.3%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1Claude Fable 559.90$20.00
2Claude Opus 4.855.767$10.00
3GPT-5.554.863$11.25
4Claude Opus 4.753.552$10.00
5GPT-5.451.4142$5.63
6GLM-5.251.185$2.15
7Gemini 3.5 Flash50.2217$3.38
8Claude Sonnet 4.647.267$6.00
9Gemini 3.1 Pro Preview46.5136$4.50
10Qwen3.7 Max46197$3.75

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Gemini 3.5 Flash217
2Qwen3.7 Max197
3GPT-5.4 mini180
4GPT-5.4142
5GPT-5.2 Codex139
6Gemini 3.1 Pro Preview136
7DeepSeek V4 Flash110
8GLM-5.1103
9GPT-5.3 Codex95
10DeepSeek V4 Pro92

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1DeepSeek V4 Flash$0.175
2MiMo-V2.5$0.175
3MiniMax-M3$0.525
4DeepSeek V4 Pro$0.544
5MiMo-V2.5-Pro$0.544
6MiMo-V2-Pro$1.50
7GPT-5.4 mini$1.69
8Kimi K2.6$1.71
9Kimi K2.7 Code$1.71
10GLM-5.2$2.15