The Inference Report

April 9, 2026

Claude Opus 4.6 has moved from fourth place on the SWE-rebench to first with a 65.3% pass rate, a 12.3-point gain from its previous 53% on Artificial Analysis, while Gemini 3.1 Pro Preview dropped from second to fifth despite scoring 62.3% on the coding task, suggesting the two benchmarks measure different problem distributions or evaluation criteria. GLM-5 and Kimi K2.5 both climbed sharply on SWE-rebench, with the latter jumping from 46.8% to 58.5%, and Kimi K2 Thinking advanced from 40.9% to 57.4%, indicating that reasoning-oriented model variants are gaining traction on software engineering tasks. The SWE-rebench top tier shows tighter clustering around 60-65% than the Artificial Analysis rankings, where the gap between first and fifth is only 4.9 points compared to 2.4 points in the older benchmark, suggesting SWE-rebench may be more discriminative among capable models or that recent releases have converged on similar capabilities for this task class. The benchmark methodology appears to emphasize practical code generation and repository-level problem-solving rather than general reasoning or mathematical ability, given that models like gpt-5.4 rank higher on Artificial Analysis (57.2) than on SWE-rebench (62.8 for gpt-5.4-2026-03-05-medium), indicating task-specific tuning or architectural choices matter more than raw parameter count or general reasoning prowess.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5Gemini 3.1 Pro Preview62.3%
6DeepSeek-V3.260.9%
7Claude Sonnet 4.660.7%
8Claude Sonnet 4.560.0%
9Qwen3.5-397B-A17B59.9%
10Step-3.5-Flash59.6%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1GPT-5.457.285$5.63
2Gemini 3.1 Pro Preview57.2124$4.50
3GPT-5.3 Codex5476$4.81
4Claude Opus 4.65355$10.00
5Muse Spark52.10$0.00
6Claude Sonnet 4.651.771$6.00
7GLM-5.151.473$2.15
8GPT-5.251.367$4.81
9Qwen3.6 Plus5052$1.13
10GLM-549.871$1.55

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Grok 4.20 0309244
2Grok 4.20 0309 v2215
3GPT-5.4 nano198
4GPT-5 Codex190
5Gemini 3 Flash Preview185
6GPT-5.1 Codex179
7GPT-5.4 mini161
8Gemini 3 Pro Preview134
9MiMo-V2-Flash130
10Qwen3.5 122B A10B126

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2DeepSeek V3.2$0.315
3GPT-5.4 nano$0.463
4MiniMax-M2.7$0.525
5KAT Coder Pro V2$0.525
6MiniMax-M2.5$0.525
7GPT-5 mini$0.688
8Qwen3.5 27B$0.825
9GLM-4.7$1.00
10Kimi K2 Thinking$1.07