The Inference Report

May 3, 2026

Claude Opus 4.6 now leads the SWE-rebench rankings at 65.3%, a 12.3 percentage point jump from its prior position at 53%, while the broader top tier shows consolidation rather than dramatic reshuffling: gpt-5.2-2025-12-11-medium (64.4%), GLM-5 (62.8%), and Junie (62.8%) occupy positions 2 through 4 with scores that cluster tightly within a 2.5-point band, suggesting the frontier of coding performance has compressed into a narrow range. The movement is meaningful in specific quarters. GLM-5 rose from rank 17 to rank 3, GLM-5.1 climbed from 14 to 6, and Kimi K2.5 advanced from 29 to 16, indicating that Chinese model families are closing gaps on the leaders, while Gemini 3.1 Pro Preview dropped from rank 3 to rank 7 despite holding a respectable 62.3% score. The Artificial Analysis benchmark, however, tells a different story: it shows far less movement at the top, with GPT-5.5 still leading at 60.2 and Claude Opus 4.6 at rank 9 with 53 points, revealing a significant divergence between the two evaluation frameworks. The SWE-rebench results reflect a methodology focused on software engineering tasks with specific, measurable outcomes, whereas the Artificial Analysis scores may weight different problem classes or evaluation criteria. This divergence matters: a model can rank first on one benchmark while placing ninth on another, which suggests that neither benchmark alone captures complete coding capability. The volume of drops from Artificial Analysis rankings (over 100 models removed) without corresponding SWE-rebench entries makes it impossible to determine whether those models genuinely degraded or were simply deprioritized in evaluation cycles, a methodological gap worth noting when interpreting movement as progress.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4Junie62.8%
5gpt-5.4-2026-03-05-medium62.8%
6GLM-5.162.7%
7Gemini 3.1 Pro Preview62.3%
8DeepSeek-V3.260.9%
9Claude Sonnet 4.660.7%
10Claude Sonnet 4.560.0%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1GPT-5.560.276$11.25
2Claude Opus 4.757.361$10.94
3Gemini 3.1 Pro Preview57.2133$4.50
4GPT-5.456.884$5.63
5Kimi K2.653.929$1.71
6MiMo-V2.5-Pro53.865$1.50
7GPT-5.3 Codex53.693$4.81
8Grok 4.353.2150$1.56
9Claude Opus 4.65353$10.94
10Qwen3.6 Max Preview51.837$2.92

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1GPT-5 Codex210
2Gemini 3 Flash Preview199
3Qwen3.6 35B A3B199
4GPT-5.1 Codex187
5GPT-5.4 mini184
6GPT-5.4 nano162
7Qwen3.5 122B A10B156
8Grok 4.3150
9GPT-5.1149
10MiMo-V2-Flash145

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2DeepSeek V4 Flash$0.175
3DeepSeek V3.2$0.337
4GPT-5.4 nano$0.463
5MiniMax-M2.7$0.525
6KAT Coder Pro V2$0.525
7MiniMax-M2.5$0.525
8Qwen3.6 35B A3B$0.557
9GPT-5 mini$0.688
10Qwen3.5 27B$0.825