The Inference Report

April 13, 2026

Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, up from 53 on the Artificial Analysis benchmark, while Gemini 3.1 Pro Preview dropped from first place (57.2 on Artificial Analysis) to fifth (62.3% on SWE-rebench), and Kimi K2.5 climbed from 46.8 to 58.5%, a gain of 11.7 percentage points. The SWE-rebench scores are substantially higher across the board than the Artificial Analysis scores for the same models, suggesting either a difference in task difficulty, evaluation methodology, or the benchmarks' sensitivity to specific coding patterns. GLM-5 moved from tenth place (49.8) to third (62.8%), and Kimi K2 Thinking jumped from 40.9 to 57.4%, indicating that certain architectures perform disproportionately better on the SWE-rebench evaluation. The clustering of models between 58 and 65 percent on SWE-rebench, compared to the wider spread on Artificial Analysis, raises questions about whether SWE-rebench's task distribution favors certain model families or whether its evaluation criteria reward specific coding strategies. Without explicit information about SWE-rebench's methodology, test set composition, or how it differs from Artificial Analysis, the magnitude of these shifts resists clean interpretation: they could reflect genuine capability differences in software engineering tasks, calibration differences between benchmarks, or selection effects in which models were evaluated on which benchmark.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5Gemini 3.1 Pro Preview62.3%
6DeepSeek-V3.260.9%
7Claude Sonnet 4.660.7%
8Claude Sonnet 4.560.0%
9Qwen3.5-397B-A17B59.9%
10Step-3.5-Flash59.6%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1Gemini 3.1 Pro Preview57.2132$4.50
2GPT-5.456.883$5.63
3GPT-5.3 Codex53.678$4.81
4Claude Opus 4.65348$10.00
5Muse Spark52.10$0.00
6Claude Sonnet 4.651.757$6.00
7GLM-5.151.454$2.15
8GPT-5.251.370$4.81
9Qwen3.6 Plus5044$1.13
10GLM-549.886$1.55

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Gemini 3 Flash Preview195
2GPT-5.1 Codex184
3GPT-5.4 nano180
4GPT-5.4 mini179
5GPT-5 Codex177
6Grok 4.20 0309175
7Grok 4.20 0309 v2172
8Qwen3.5 122B A10B154
9Gemini 3 Pro Preview137
10Gemini 3.1 Pro Preview132

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2DeepSeek V3.2$0.315
3GPT-5.4 nano$0.463
4MiniMax-M2.7$0.525
5KAT Coder Pro V2$0.525
6MiniMax-M2.5$0.525
7GPT-5 mini$0.688
8Qwen3.5 27B$0.825
9GLM-4.7$1.00
10Kimi K2 Thinking$1.07