The Inference Report

May 27, 2026

Claude Opus 4.6 has consolidated the top position on SWE-rebench with 65.3%, climbing from #11 at 52.9% on Artificial Analysis, a gain of 12.4 percentage points that reflects either substantial model improvements or meaningful differences in how the two benchmarks evaluate code-solving capability. The broader SWE-rebench leaderboard shows clustering at the top: gpt-5.2-2025-12-11-medium, GLM-5, Junie, and gpt-5.4-2026-03-05-medium all sit within 1.6 points of each other between 62.8% and 64.4%, suggesting convergence among frontier models on this task. Notable climbers include GLM-5 (from #19 to #3, a 13-point jump), Kimi K2.5 (from #31 to #16, up 11.7 points), and Kimi K2 Thinking (from #56 to #21, up 16.5 points), indicating that Chinese-developed models have made tangible progress on repository-level code tasks. Gemini 3.1 Pro Preview declined from #3 to #7 on SWE-rebench while maintaining #3 on Artificial Analysis at 57.2, illustrating that benchmark choice materially affects perceived ranking. Claude Sonnet 4.6 moved from #14 to #9 on Artificial Analysis (51.7 to 60.7 on SWE-rebench), suggesting the models tested are stronger at the specific problem distributions in SWE-rebench than on Artificial Analysis's evaluation. The divergence between the two benchmarks raises a methodological question: SWE-rebench appears to emphasize end-to-end repository modification and integration, while Artificial Analysis may weight reasoning and breadth differently. Without access to the evaluation protocols themselves, the magnitude of these shifts makes it difficult to assess whether one benchmark has higher discriminative validity for production code work.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4Junie62.8%
5gpt-5.4-2026-03-05-medium62.8%
6GLM-5.162.7%
7Gemini 3.1 Pro Preview62.3%
8DeepSeek-V3.260.9%
9Claude Sonnet 4.660.7%
10Claude Sonnet 4.560.0%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1GPT-5.560.272$11.25
2Claude Opus 4.757.354$10.94
3Gemini 3.1 Pro Preview57.2130$4.50
4GPT-5.456.890$5.63
5Qwen3.7 Max56.6206$3.75
6Gemini 3.5 Flash55.3233$3.38
7Kimi K2.653.932$1.71
8MiMo-V2.5-Pro53.851$1.35
9GPT-5.3 Codex53.682$4.81
10Grok 4.353.2196$1.56

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Gemini 3.5 Flash233
2Qwen3.7 Max206
3Gemini 3 Flash Preview204
4GPT-5 Codex202
5GPT-5.1 Codex201
6Grok 4.3196
7Grok 4.20 0309 v2188
8Grok 4.20 0309185
9Qwen3.6 35B A3B170
10GPT-5.4 mini165

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2DeepSeek V4 Flash$0.175
3Hy3-preview$0.20
4DeepSeek V3.2$0.337
5MiMo-V2.5$0.408
6GPT-5.4 nano$0.463
7MiniMax-M2.7$0.525
8KAT Coder Pro V2$0.525
9MiniMax-M2.5$0.525
10DeepSeek V4 Pro$0.544