The Inference Report

April 21, 2026

Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, unchanged from the previous round, while the Artificial Analysis benchmark shows Claude Opus 4.7 at 57.3 in first place, suggesting the two benchmarks are measuring different problem distributions or difficulty levels. The SWE-rebench leaderboard has consolidated around a narrow band: the top six models cluster between 65.3% and 62.3%, with gpt-5.2-2025-12-11-medium at 64.4% and GLM-5 and gpt-5.4-2026-03-05-medium both at 62.8%, indicating diminishing returns as models approach saturation on this evaluation set. Notable climbers on Artificial Analysis include Kimi K2.6 entering at position 4 with 53.9 points and JT-MINI appearing at position 113 with 25.4 points, though their SWE-rebench performance is not reported, making cross-benchmark validation difficult. Gemini 3.1 Pro Preview dropped from second place on Artificial Analysis (57.2) to sixth on SWE-rebench (62.3), a reversal that warrants scrutiny of the underlying tasks, SWE-rebench may emphasize code generation or repository manipulation where Claude and GPT variants perform better, while Artificial Analysis may weight reasoning or planning more heavily. The SWE-rebench methodology itself remains opaque in the provided data; without visibility into task design, evaluation protocol, or whether scores are statistically independent, it is unclear whether the tight clustering reflects genuine convergence in model capability or whether the benchmark has begun to saturate as a discriminator. The two-benchmark divergence suggests practitioners should verify performance on their specific use case rather than treating either leaderboard as a universal proxy for software engineering capability.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5GLM-5.162.7%
6Gemini 3.1 Pro Preview62.3%
7DeepSeek-V3.260.9%
8Claude Sonnet 4.660.7%
9Claude Sonnet 4.560.0%
10Qwen3.5-397B-A17B59.9%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1Claude Opus 4.757.353$10.00
2Gemini 3.1 Pro Preview57.2130$4.50
3GPT-5.456.883$5.63
4Kimi K2.653.9135$1.71
5GPT-5.3 Codex53.690$4.81
6Claude Opus 4.65357$10.00
7Muse Spark52.10$0.00
8Qwen3.6 Max Preview51.80$0.00
9Claude Sonnet 4.651.773$6.00
10GLM-5.151.443$2.15

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Qwen3.6 35B A3B238
2GPT-5 Codex213
3Grok 4.20 0309205
4Grok 4.20 0309 v2203
5Gemini 3 Flash Preview197
6GPT-5.4 mini194
7GPT-5.1 Codex170
8Qwen3.5 122B A10B163
9GPT-5.4 nano161
10Gemini 3 Pro Preview137

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2DeepSeek V3.2$0.315
3GPT-5.4 nano$0.463
4MiniMax-M2.7$0.525
5KAT Coder Pro V2$0.525
6MiniMax-M2.5$0.525
7GPT-5 mini$0.688
8Qwen3.5 27B$0.825
9Qwen3.6 35B A3B$0.844
10GLM-4.7$1.00