The Inference Report

May 1, 2026

Claude Opus 4.6 climbed from eighth to first on SWE-rebench, moving 12.3 percentage points from 53% to 65.3%, a movement that reshuffles the entire coding benchmark landscape but warrants scrutiny on whether the test itself remained stable or if the evaluation methodology changed. The top tier has tightened considerably: gpt-5.2-2025-12-11-medium sits at 64.4%, GLM-5 and Junie both at 62.8%, and gpt-5.4-2026-03-05-medium holding steady at 62.8%, creating a compressed band where fractional improvements matter. Below the top five, the ranking has reordered substantially; Gemini 3.1 Pro Preview dropped from third to seventh despite scoring 62.3%, while several models gained ground including GLM-5 (from rank 16 at 49.8% to rank 3 at 62.8%), Kimi K2.5 (from rank 28 at 46.8% to rank 16 at 58.5%), and Kimi K2 Thinking (from rank 53 at 40.9% to rank 21 at 57.4%), suggesting either substantial capability improvements across Chinese models or a shift in benchmark composition favoring their training distribution. On Artificial Analysis, the rankings show minimal movement in the top tier with GPT-5.5 still leading at 60.2 and Claude Opus 4.6 now ninth at 53, a nine-point gap that contradicts the SWE-rebench clustering and raises questions about benchmark alignment. Grok 4.3 entered the Artificial Analysis top 100 at position eight with 53.2, while most other models maintained their prior positions, suggesting this benchmark is more stable but possibly measuring a different capability or using different evaluation criteria. The divergence between SWE-rebench's dramatic reshuffling and Artificial Analysis's relative stability indicates these benchmarks are not measuring the same problem space, or that one has undergone methodological revision without documentation.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4Junie62.8%
5gpt-5.4-2026-03-05-medium62.8%
6GLM-5.162.7%
7Gemini 3.1 Pro Preview62.3%
8DeepSeek-V3.260.9%
9Claude Sonnet 4.660.7%
10Claude Sonnet 4.560.0%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1GPT-5.560.267$11.25
2Claude Opus 4.757.351$10.00
3Gemini 3.1 Pro Preview57.2130$4.50
4GPT-5.456.887$5.63
5Kimi K2.653.925$1.71
6MiMo-V2.5-Pro53.860$1.50
7GPT-5.3 Codex53.682$4.81
8Grok 4.353.2221$1.56
9Claude Opus 4.65352$10.00
10Muse Spark52.10$0.00

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Grok 4.3221
2Qwen3.6 35B A3B185
3Gemini 3 Flash Preview184
4GPT-5.1 Codex172
5GPT-5.4 mini169
6GPT-5 Codex165
7GPT-5.4 nano162
8Qwen3.5 122B A10B148
9GPT-5.1131
10Gemini 3.1 Pro Preview130

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2DeepSeek V4 Flash$0.175
3DeepSeek V3.2$0.315
4GPT-5.4 nano$0.463
5MiniMax-M2.7$0.525
6KAT Coder Pro V2$0.525
7MiniMax-M2.5$0.525
8Qwen3.6 35B A3B$0.557
9GPT-5 mini$0.688
10Qwen3.5 27B$0.825