The Inference Report

April 30, 2026

Claude Opus 4.6 holds the SWE-rebench lead at 65.3%, unchanged from the previous cycle, while the tier immediately below shows modest compression: gpt-5.2-2025-12-11-medium sits at 64.4%, and GLM-5 and gpt-5.4-2026-03-05-medium both score 62.8%. The meaningful movement occurs in the mid-field, where GLM-4.7 has climbed from rank 43 (42.1 points on Artificial Analysis) to rank 14 (58.7% on SWE-rebench), a shift that suggests either a genuine capability jump or a divergence in what these two benchmarks measure. Kimi K2.5 similarly advanced from rank 28 to rank 16, and Kimi K2 Thinking jumped from rank 53 to rank 21, indicating Chinese models have made gains on the SWE-rebench evaluation specifically. Gemini 3.1 Pro Preview dropped from rank 3 to rank 6 on SWE-rebench (62.3%) despite holding rank 3 on Artificial Analysis (57.2), a discrepancy that raises questions about benchmark stability or whether SWE-rebench and Artificial Analysis weight different problem classes. The Artificial Analysis leaderboard itself shows minimal reshuffling in the top 20, with GPT-5.5 leading at 60.2 and Claude Opus 4.7 at 57.3, suggesting those rankings have stabilized. At the lower end, Granite 4.1 models appear as new entries on Artificial Analysis (30B at rank 229, 8B at 261, 3B at 324), and QwQ 32B and Qwen3 VL 30B A3B swapped positions at ranks 160 and 161 without score change, a cosmetic reordering. The lack of dramatic score inflation across either benchmark and the persistence of the same top performers suggest the evaluations are not drifting, though the divergence between SWE-rebench and Artificial Analysis rankings for mid-tier models warrants investigation into whether they stress different failure modes or simply employ different evaluation protocols.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5GLM-5.162.7%
6Gemini 3.1 Pro Preview62.3%
7DeepSeek-V3.260.9%
8Claude Sonnet 4.660.7%
9Claude Sonnet 4.560.0%
10Qwen3.5-397B-A17B59.9%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1GPT-5.560.265$11.25
2Claude Opus 4.757.352$10.00
3Gemini 3.1 Pro Preview57.2129$4.50
4GPT-5.456.893$5.63
5Kimi K2.653.925$1.71
6MiMo-V2.5-Pro53.859$1.50
7GPT-5.3 Codex53.686$4.81
8Claude Opus 4.65349$10.00
9Muse Spark52.10$0.00
10Qwen3.6 Max Preview51.833$2.92

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Qwen3.6 35B A3B191
2Gemini 3 Flash Preview189
3GPT-5.1 Codex170
4GPT-5 Codex166
5GPT-5.4 nano160
6GPT-5.4 mini158
7Qwen3.5 122B A10B142
8Gemini 3.1 Pro Preview129
9Gemini 3 Pro Preview129
10GPT-5.1126

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2DeepSeek V4 Flash$0.175
3DeepSeek V3.2$0.315
4GPT-5.4 nano$0.463
5MiniMax-M2.7$0.525
6KAT Coder Pro V2$0.525
7MiniMax-M2.5$0.525
8Qwen3.6 35B A3B$0.557
9GPT-5 mini$0.688
10Qwen3.5 27B$0.825