The Inference Report

May 3, 2026

Claude Opus 4.6 now leads the SWE-rebench rankings at 65.3%, a 12.3 percentage point jump from its prior position at 53%, while the broader top tier shows consolidation rather than dramatic reshuffling: gpt-5.2-2025-12-11-medium (64.4%), GLM-5 (62.8%), and Junie (62.8%) occupy positions 2 through 4 with scores that cluster tightly within a 2.5-point band, suggesting the frontier of coding performance has compressed into a narrow range. The movement is meaningful in specific quarters. GLM-5 rose from rank 17 to rank 3, GLM-5.1 climbed from 14 to 6, and Kimi K2.5 advanced from 29 to 16, indicating that Chinese model families are closing gaps on the leaders, while Gemini 3.1 Pro Preview dropped from rank 3 to rank 7 despite holding a respectable 62.3% score. The Artificial Analysis benchmark, however, tells a different story: it shows far less movement at the top, with GPT-5.5 still leading at 60.2 and Claude Opus 4.6 at rank 9 with 53 points, revealing a significant divergence between the two evaluation frameworks. The SWE-rebench results reflect a methodology focused on software engineering tasks with specific, measurable outcomes, whereas the Artificial Analysis scores may weight different problem classes or evaluation criteria. This divergence matters: a model can rank first on one benchmark while placing ninth on another, which suggests that neither benchmark alone captures complete coding capability. The volume of drops from Artificial Analysis rankings (over 100 models removed) without corresponding SWE-rebench entries makes it impossible to determine whether those models genuinely degraded or were simply deprioritized in evaluation cycles, a methodological gap worth noting when interpreting movement as progress.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	Junie	62.8%
5	gpt-5.4-2026-03-05-medium	62.8%
6	GLM-5.1	62.7%
7	Gemini 3.1 Pro Preview	62.3%
8	DeepSeek-V3.2	60.9%
9	Claude Sonnet 4.6	60.7%
10	Claude Sonnet 4.5	60.0%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	GPT-5.5	60.2	76	$11.25
2	Claude Opus 4.7	57.3	61	$10.94
3	Gemini 3.1 Pro Preview	57.2	133	$4.50
4	GPT-5.4	56.8	84	$5.63
5	Kimi K2.6	53.9	29	$1.71
6	MiMo-V2.5-Pro	53.8	65	$1.50
7	GPT-5.3 Codex	53.6	93	$4.81
8	Grok 4.3	53.2	150	$1.56
9	Claude Opus 4.6	53	53	$10.94
10	Qwen3.6 Max Preview	51.8	37	$2.92

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	GPT-5 Codex	210
2	Gemini 3 Flash Preview	199
3	Qwen3.6 35B A3B	199
4	GPT-5.1 Codex	187
5	GPT-5.4 mini	184
6	GPT-5.4 nano	162
7	Qwen3.5 122B A10B	156
8	Grok 4.3	150
9	GPT-5.1	149
10	MiMo-V2-Flash	145

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	DeepSeek V4 Flash	$0.175
3	DeepSeek V3.2	$0.337
4	GPT-5.4 nano	$0.463
5	MiniMax-M2.7	$0.525
6	KAT Coder Pro V2	$0.525
7	MiniMax-M2.5	$0.525
8	Qwen3.6 35B A3B	$0.557
9	GPT-5 mini	$0.688
10	Qwen3.5 27B	$0.825