The Inference Report

April 7, 2026

Claude Opus 4.6 maintains its lead on SWE-rebench at 65.3 percent, a 12.3-point gain from its prior position at 53 percent on the Artificial Analysis benchmark, though the two evaluations measure different problem spaces and should not be treated as directly comparable. The top tier has consolidated around 62 to 65 percent on SWE-rebench, with gpt-5.2-2025-12-11-medium at 64.4 percent and GLM-5 and gpt-5.4-2026-03-05-medium both at 62.8 percent; these four models now occupy the summit, separated by narrow margins. Movement in the broader field reveals uneven progress: Kimi K2 Thinking jumped from position 37 at 40.9 percent to position 17 at 57.4 percent, a 16.5-point increase suggesting either a model update or a shift in how the benchmark evaluates reasoning-focused architectures, while Kimi K2.5 advanced from position 16 at 46.8 percent to position 13 at 58.5 percent. Gemini 3 Flash Preview climbed from position 18 at 46.4 percent to position 22 at 52.5 percent, and Nova 2.0 Lite moved from position 76 at 29.7 percent to position 58 at 34.5 percent on the Artificial Analysis side, though this latter shift reflects reranking rather than SWE-rebench performance. The SWE-rebench scores themselves show no obvious saturation at the top: the gap between first and fifth place is 2.8 percentage points, and the distribution below rank 10 remains steep, suggesting that either the benchmark has sufficient discriminative power or that model capabilities on software engineering tasks continue to stratify sharply by architecture and training approach. What remains unclear from the data alone is whether these gains reflect genuine improvements in code generation and repository-level reasoning, or whether they reflect changes in evaluation methodology, task distribution, or model selection within the rebench suite.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5Gemini 3.1 Pro Preview62.3%
6DeepSeek-V3.260.9%
7Claude Sonnet 4.660.7%
8Claude Sonnet 4.560.0%
9Qwen3.5-397B-A17B59.9%
10Step-3.5-Flash59.6%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1GPT-5.457.282$5.63
2Gemini 3.1 Pro Preview57.2142$4.50
3GPT-5.3 Codex5481$4.81
4Claude Opus 4.65354$10.00
5Claude Sonnet 4.651.766$6.00
6GPT-5.251.379$4.81
7GLM-549.869$1.55
8Claude Opus 4.549.767$10.00
9MiniMax-M2.749.643$0.525
10MiMo-V2-Pro49.20$1.50

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Grok 4.20 Beta 0309265
2GPT-5 Codex214
3GPT-5.4 nano209
4Gemini 3 Flash Preview197
5GPT-5.1 Codex190
6GPT-5.4 mini166
7Qwen3.5 122B A10B154
8Gemini 3 Pro Preview143
9Gemini 3.1 Pro Preview142
10MiMo-V2-Flash137

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2DeepSeek V3.2$0.315
3GPT-5.4 nano$0.463
4MiniMax-M2.7$0.525
5KAT Coder Pro V2$0.525
6MiniMax-M2.5$0.525
7GPT-5 mini$0.688
8Qwen3.5 27B$0.825
9GLM-4.7$1.00
10Kimi K2 Thinking$1.07