The Inference Report

May 12, 2026

Claude Opus 4.6 holds the top position on SWE-rebench with 65.3%, a jump of 12.3 percentage points from its previous ranking of ninth place at 53%, while the rest of the field has compressed considerably at the high end, with positions two through five separated by only 1.6 points. The movement reflects real gains in code generation capability, though the methodology warrants scrutiny: SWE-rebench measures the ability to resolve GitHub issues end-to-end, a task that benefits from reasoning depth and context management rather than pure instruction-following, and the compression at the top suggests these models are approaching saturation on the benchmark's current problem distribution. On Artificial Analysis, the landscape differs markedly, with GPT-5.5 leading at 60.2 but Claude Opus 4.6 placed ninth at 53, indicating that the two benchmarks reward different capabilities or that Artificial Analysis weights broader performance categories beyond code completion. The divergence between benchmarks matters: Claude Sonnet 4.6 ranks ninth on SWE-rebench at 60.7% but only twelfth on Artificial Analysis at 51.7%, suggesting it excels at the specific demands of issue resolution but underperforms on Artificial Analysis's mixed evaluation. GLM-5 climbed from seventeenth to third on SWE-rebench, gaining 13 points, while Kimi K2 Thinking rose from fifty-fourth to twenty-first on the same benchmark with an 16.5-point gain, patterns that suggest targeted improvements in code reasoning. The Artificial Analysis list saw minimal reordering beyond the top tier, with most models holding their positions, indicating either stable model performance or less frequent evaluation updates on that benchmark. What distinguishes this cycle is not a breakthrough in methodology but consolidation: the top performers are now substantially ahead of the middle tier, and the gap between first and tenth place on SWE-rebench spans 4.6 points, a meaningful spread that reflects real differences in how models handle multi-step code tasks.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4Junie62.8%
5gpt-5.4-2026-03-05-medium62.8%
6GLM-5.162.7%
7Gemini 3.1 Pro Preview62.3%
8DeepSeek-V3.260.9%
9Claude Sonnet 4.660.7%
10Claude Sonnet 4.560.0%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1GPT-5.560.266$11.25
2Claude Opus 4.757.371$10.94
3Gemini 3.1 Pro Preview57.2143$4.50
4GPT-5.456.895$5.63
5Kimi K2.653.941$1.71
6MiMo-V2.5-Pro53.857$1.50
7GPT-5.3 Codex53.695$4.81
8Grok 4.353.283$1.56
9Claude Opus 4.65353$10.94
10Muse Spark52.10$0.00

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Gemini 3 Flash Preview205
2GPT-5.1 Codex199
3GPT-5.4 mini185
4Qwen3.6 35B A3B182
5GPT-5 Codex178
6Qwen3.5 122B A10B160
7Hy3-preview158
8GPT-5.4 nano156
9GPT-5.1150
10MiMo-V2-Flash149

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2DeepSeek V4 Flash$0.175
3DeepSeek V3.2$0.337
4GPT-5.4 nano$0.463
5MiniMax-M2.7$0.525
6KAT Coder Pro V2$0.525
7MiniMax-M2.5$0.525
8Qwen3.6 35B A3B$0.557
9GPT-5 mini$0.688
10MiMo-V2.5$0.72