The Inference Report

May 4, 2026

Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, a 12.3-point jump from its previous ranking of 53 on Artificial Analysis, though the two benchmarks measure different problem sets and cannot be directly compared. The top tier has consolidated around 62-65% on SWE-rebench, with GPT-5.2-2025-12-11-medium at 64.4% and three models tied at 62.8% (GLM-5, Junie, GPT-5.4-2026-03-05-medium), suggesting diminishing returns in coding task performance at the frontier. More striking are the mid-tier movements: GLM-5 climbed from position 16 to 3 on SWE-rebench, GLM-4.7 rose from 40 to 14, and Kimi K2.5 advanced from 26 to 16, indicating that Chinese model families have made substantial gains on this particular benchmark. Gemini 3.1 Pro Preview, by contrast, dropped from third on Artificial Analysis (57.2) to seventh on SWE-rebench (62.3), a relative decline despite the absolute score increase, which may reflect task-specific strengths rather than regression. On Artificial Analysis, the leaderboard remains fluid with 33 new entries across the 373-model roster, including several reasoning-focused variants and smaller parameter models, though the top ten remain dominated by GPT and Claude variants. The SWE-rebench benchmark appears more selective and stable, with only 34 models tracked versus hundreds on Artificial Analysis, making it a tighter measure of coding capability but limiting visibility into broader model performance distributions. Without methodological details on how SWE-rebench tasks differ from Artificial Analysis's evaluation protocol, the divergence in model rankings suggests these benchmarks may reward different architectural or training choices, a distinction worth investigating rather than treating the benchmarks as interchangeable measures of coding prowess.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4Junie62.8%
5gpt-5.4-2026-03-05-medium62.8%
6GLM-5.162.7%
7Gemini 3.1 Pro Preview62.3%
8DeepSeek-V3.260.9%
9Claude Sonnet 4.660.7%
10Claude Sonnet 4.560.0%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1GPT-5.560.274$11.25
2Claude Opus 4.757.356$10.94
3Gemini 3.1 Pro Preview57.2130$4.50
4GPT-5.456.889$5.63
5Kimi K2.653.931$1.71
6MiMo-V2.5-Pro53.863$1.50
7GPT-5.3 Codex53.687$4.81
8Grok 4.353.2112$1.56
9Claude Opus 4.65348$10.94
10Muse Spark52.10$0.00

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Gemini 3 Flash Preview197
2GPT-5 Codex196
3Qwen3.6 35B A3B192
4GPT-5.4 mini184
5GPT-5.1 Codex184
6GPT-5.4 nano161
7Qwen3.5 122B A10B158
8GPT-5.1151
9MiMo-V2-Flash147
10MiMo-V2-Omni-0327134

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2DeepSeek V4 Flash$0.175
3DeepSeek V3.2$0.337
4GPT-5.4 nano$0.463
5MiniMax-M2.7$0.525
6KAT Coder Pro V2$0.525
7MiniMax-M2.5$0.525
8Qwen3.6 35B A3B$0.557
9GPT-5 mini$0.688
10Qwen3.5 27B$0.825