The Inference Report

April 18, 2026

Claude Opus 4.6 moved from fourth to first on SWE-rebench with a 65.3% score, a 12.3-point gain over its previous 53%, while Gemini 3.1 Pro Preview dropped from the top ranking on Artificial Analysis to sixth on SWE-rebench despite maintaining 62.3%, suggesting the two benchmarks now diverge meaningfully in what they reward. The SWE-rebench leaderboard shows tighter clustering at the top, with positions two through five spanning only 1.6 percentage points, indicating reduced separation between leading models on coding tasks. Chinese models made notable gains on SWE-rebench: GLM-5 climbed from tenth to third (49.8 to 62.8%), Kimi K2.5 rose from twentieth to sixteenth (46.8 to 58.5%), and GLM-4.7 advanced from thirty-fourth to fourteenth (42.1 to 58.7%), while on Artificial Analysis the top tier remains dominated by Anthropic and OpenAI variants, with Claude Opus 4.7 newly entering at first place and Gemini 3.1 Pro Preview sliding to second. The Artificial Analysis benchmark shows minimal absolute movement across most positions, with entries reordering but scores remaining largely stable, whereas SWE-rebench displays larger score inflation across the board, raising questions about whether the benchmarks are measuring consistent capabilities or whether SWE-rebench's evaluation methodology has shifted. JT-MINI dropped entirely from Artificial Analysis rankings after placing at 109 with 25.4 points, but no corresponding SWE-rebench removal is documented, leaving unclear whether this reflects model discontinuation or benchmark revision. The divergence between these two evaluation frameworks is now pronounced enough to warrant scrutiny of their test construction: if both measure code generation ability, the gap between Gemini's ranking (first on Artificial Analysis, sixth on SWE-rebench) and Claude Opus 4.6's trajectory (fourth to first on SWE-rebench, but only fifth on Artificial Analysis) suggests they are sampling different problem distributions or applying different evaluation criteria rather than simply ranking the same capability differently.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5GLM-5.162.7%
6Gemini 3.1 Pro Preview62.3%
7DeepSeek-V3.260.9%
8Claude Sonnet 4.660.7%
9Claude Sonnet 4.560.0%
10Qwen3.5-397B-A17B59.9%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1Claude Opus 4.757.358$10.00
2Gemini 3.1 Pro Preview57.2126$4.50
3GPT-5.456.882$5.63
4GPT-5.3 Codex53.681$4.81
5Claude Opus 4.65354$10.00
6Muse Spark52.10$0.00
7Claude Sonnet 4.651.760$6.00
8GLM-5.151.447$2.15
9GPT-5.251.374$4.81
10Qwen3.6 Plus5053$1.13

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Qwen3.6 35B A3B238
2GPT-5.1 Codex205
3GPT-5 Codex199
4Grok 4.20 0309194
5Gemini 3 Flash Preview191
6Grok 4.20 0309 v2180
7GPT-5.4 mini172
8GPT-5.4 nano155
9Gemini 3 Pro Preview133
10Qwen3.5 122B A10B130

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2DeepSeek V3.2$0.315
3GPT-5.4 nano$0.463
4MiniMax-M2.7$0.525
5KAT Coder Pro V2$0.525
6MiniMax-M2.5$0.525
7GPT-5 mini$0.688
8Qwen3.5 27B$0.825
9Qwen3.6 35B A3B$0.844
10GLM-4.7$1.00