The Inference Report

April 8, 2026

Claude Opus 4.6 moved to the top of the SWE-rebench rankings at 65.3%, up from fourth place at 53% on Artificial Analysis, while Gemini 3.1 Pro Preview fell from a tie for first at 57.2 to fifth at 62.3% on the coding benchmark, and GLM-5 climbed from seventh at 49.8 to third at 62.8%. The SWE-rebench scores show tighter clustering in the top tier, the gap between first and fifth is only 3 percentage points, compared to the Artificial Analysis benchmark where GPT-5.4 and Gemini 3.1 Pro Preview tied at 57.2, suggesting the coding task may be more discriminative or the models' relative strengths differ meaningfully between general reasoning and software engineering. Kimi K2.5 advanced from sixteenth at 46.8 on Artificial Analysis to thirteenth at 58.5% on SWE-rebench, and Kimi K2 Thinking jumped from thirty-seventh at 40.9 to seventeenth at 57.4%, indicating these models have particular strength in code generation tasks. The SWE-rebench benchmark itself lacks published methodology details in the data provided, no information on test set size, task distribution, evaluation criteria, or whether results are from initial release or continued refinement, making it difficult to assess whether the ranking shifts reflect genuine capability differences or methodological divergence from Artificial Analysis.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5Gemini 3.1 Pro Preview62.3%
6DeepSeek-V3.260.9%
7Claude Sonnet 4.660.7%
8Claude Sonnet 4.560.0%
9Qwen3.5-397B-A17B59.9%
10Step-3.5-Flash59.6%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1GPT-5.457.285$5.63
2Gemini 3.1 Pro Preview57.2132$4.50
3GPT-5.3 Codex5476$4.81
4Claude Opus 4.65355$10.00
5Claude Sonnet 4.651.771$6.00
6GLM-5.151.380$2.15
7GPT-5.251.369$4.81
8Qwen3.6 Plus5052$1.13
9GLM-549.870$1.55
10Claude Opus 4.549.767$10.00

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Grok 4.20 0309252
2GPT-5 Codex203
3GPT-5.4 nano202
4Gemini 3 Flash Preview196
5GPT-5.1 Codex191
6GPT-5.4 mini157
7Gemini 3 Pro Preview139
8Qwen3.5 122B A10B138
9Gemini 3.1 Pro Preview132
10MiMo-V2-Flash129

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2DeepSeek V3.2$0.315
3GPT-5.4 nano$0.463
4MiniMax-M2.7$0.525
5KAT Coder Pro V2$0.525
6MiniMax-M2.5$0.525
7GPT-5 mini$0.688
8Qwen3.5 27B$0.825
9GLM-4.7$1.00
10Kimi K2 Thinking$1.07