The Inference Report

May 14, 2026

Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, a gain of 12.3 percentage points from its previous ranking of 53 on Artificial Analysis, though these are distinct benchmarks measuring different problem sets and should not be directly compared as improvement on the same task. The SWE-rebench leaderboard shows tighter clustering in the upper tier, with models ranked 2 through 7 all scoring between 64.4% and 62.3%, suggesting convergence in code-solving capability among leading systems. GLM-5 and GLM-5.1 both advanced significantly on Artificial Analysis, moving from positions 17 and 14 to 49.8 and 51.4 respectively, while Kimi K2 Thinking jumped from 54 to 40.9, indicating Chinese-developed models are narrowing the gap. Gemini 3.1 Pro Preview declined from 57.2 to 57.2 on Artificial Analysis (no change) but appears at position 7 on SWE-rebench with 62.3%, down from an implied higher position previously. The SWE-rebench methodology evaluates code agents on real GitHub issues requiring multi-step reasoning and tool use, while Artificial Analysis covers general reasoning tasks, so raw score differences across benchmarks reflect task difficulty rather than model capability regression. Within SWE-rebench, the spread from position 1 to position 10 spans only 4.3 percentage points, suggesting marginal gains require increasingly refined approaches rather than architectural leaps. The Artificial Analysis rankings show broader stratification, with positions 1 through 10 spanning 7.1 points, indicating that general reasoning benchmarks may be more discriminative at the frontier than specialized coding benchmarks at present.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4Junie62.8%
5gpt-5.4-2026-03-05-medium62.8%
6GLM-5.162.7%
7Gemini 3.1 Pro Preview62.3%
8DeepSeek-V3.260.9%
9Claude Sonnet 4.660.7%
10Claude Sonnet 4.560.0%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1GPT-5.560.265$11.25
2Claude Opus 4.757.363$10.94
3Gemini 3.1 Pro Preview57.2128$4.50
4GPT-5.456.883$5.63
5Kimi K2.653.941$1.71
6MiMo-V2.5-Pro53.854$1.50
7GPT-5.3 Codex53.676$4.81
8Grok 4.353.281$1.56
9Claude Opus 4.652.948$10.94
10Muse Spark52.20$0.00

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Gemini 3 Flash Preview197
2GPT-5.1 Codex183
3Qwen3.6 35B A3B182
4GPT-5 Codex179
5GPT-5.4 mini169
6Qwen3.5 122B A10B154
7MiMo-V2-Flash149
8GPT-5.4 nano148
9Hy3-preview134
10Gemini 3.1 Pro Preview128

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1Hy3-preview$0.143
2MiMo-V2-Flash$0.15
3DeepSeek V4 Flash$0.175
4DeepSeek V3.2$0.337
5GPT-5.4 nano$0.463
6MiniMax-M2.7$0.525
7KAT Coder Pro V2$0.525
8MiniMax-M2.5$0.525
9Qwen3.6 35B A3B$0.557
10GPT-5 mini$0.688