The Inference Report

April 17, 2026

Claude Opus 4.6 moved from fourth to first on SWE-rebench, climbing 12.3 percentage points from 53 to 65.3 percent, while Gemini 3.1 Pro Preview dropped from the top Artificial Analysis ranking at 57.2 to sixth on SWE-rebench at 62.3 percent. The gap between first and second place on SWE-rebench narrowed to just 0.9 points (Claude Opus 4.6 at 65.3 versus gpt-5.2-2025-12-11-medium at 64.4), and the top six models now cluster between 62.3 and 65.3 percent, suggesting convergence at the frontier rather than separation. GLM-5 and GLM-5.1 each gained roughly 13 points, moving from tenth and seventh on Artificial Analysis to third and fifth on SWE-rebench, indicating that coding-specific evaluation surfaces different capability profiles than the broader Artificial Analysis benchmark. However, the two benchmarks tell divergent stories about the field: SWE-rebench shows tight competition in the 58 to 65 percent range across the top twenty models, while Artificial Analysis exhibits steeper stratification, with the top performer at 57.2 and a sharper drop-off below rank fifty. The methodological difference matters here. SWE-rebench measures repository-level code generation against real GitHub issues with deterministic evaluation criteria, while Artificial Analysis covers broader reasoning and general capability. Models like Claude Opus 4.6 and the GLM family appear better calibrated to the specific constraints of software engineering tasks, but without visibility into whether SWE-rebench's evaluation methodology changed or whether these represent genuinely new model versions, the magnitude of the gains (particularly the 12-point jump for Claude) warrants scrutiny of whether the benchmark itself remained stable or whether the test set, evaluation harness, or scoring logic shifted.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5GLM-5.162.7%
6Gemini 3.1 Pro Preview62.3%
7DeepSeek-V3.260.9%
8Claude Sonnet 4.660.7%
9Claude Sonnet 4.560.0%
10Qwen3.5-397B-A17B59.9%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1Gemini 3.1 Pro Preview57.2123$4.50
2GPT-5.456.881$5.63
3GPT-5.3 Codex53.670$4.81
4Claude Opus 4.65344$10.00
5Muse Spark52.10$0.00
6Claude Sonnet 4.651.754$6.00
7GLM-5.151.446$2.15
8GPT-5.251.364$4.81
9Qwen3.6 Plus5054$1.13
10GLM-549.864$1.55

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1GPT-5 Codex180
2GPT-5.1 Codex179
3Gemini 3 Flash Preview176
4Grok 4.20 0309161
5GPT-5.4 mini159
6GPT-5.4 nano158
7Grok 4.20 0309 v2146
8Qwen3.5 122B A10B130
9Gemini 3 Pro Preview128
10Gemini 3.1 Pro Preview123

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2DeepSeek V3.2$0.315
3GPT-5.4 nano$0.463
4MiniMax-M2.7$0.525
5KAT Coder Pro V2$0.525
6MiniMax-M2.5$0.525
7GPT-5 mini$0.688
8Qwen3.5 27B$0.825
9GLM-4.7$1.00
10Kimi K2 Thinking$1.07