The Inference Report

April 30, 2026

Claude Opus 4.6 holds the SWE-rebench lead at 65.3%, unchanged from the previous cycle, while the tier immediately below shows modest compression: gpt-5.2-2025-12-11-medium sits at 64.4%, and GLM-5 and gpt-5.4-2026-03-05-medium both score 62.8%. The meaningful movement occurs in the mid-field, where GLM-4.7 has climbed from rank 43 (42.1 points on Artificial Analysis) to rank 14 (58.7% on SWE-rebench), a shift that suggests either a genuine capability jump or a divergence in what these two benchmarks measure. Kimi K2.5 similarly advanced from rank 28 to rank 16, and Kimi K2 Thinking jumped from rank 53 to rank 21, indicating Chinese models have made gains on the SWE-rebench evaluation specifically. Gemini 3.1 Pro Preview dropped from rank 3 to rank 6 on SWE-rebench (62.3%) despite holding rank 3 on Artificial Analysis (57.2), a discrepancy that raises questions about benchmark stability or whether SWE-rebench and Artificial Analysis weight different problem classes. The Artificial Analysis leaderboard itself shows minimal reshuffling in the top 20, with GPT-5.5 leading at 60.2 and Claude Opus 4.7 at 57.3, suggesting those rankings have stabilized. At the lower end, Granite 4.1 models appear as new entries on Artificial Analysis (30B at rank 229, 8B at 261, 3B at 324), and QwQ 32B and Qwen3 VL 30B A3B swapped positions at ranks 160 and 161 without score change, a cosmetic reordering. The lack of dramatic score inflation across either benchmark and the persistence of the same top performers suggest the evaluations are not drifting, though the divergence between SWE-rebench and Artificial Analysis rankings for mid-tier models warrants investigation into whether they stress different failure modes or simply employ different evaluation protocols.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	gpt-5.4-2026-03-05-medium	62.8%
5	GLM-5.1	62.7%
6	Gemini 3.1 Pro Preview	62.3%
7	DeepSeek-V3.2	60.9%
8	Claude Sonnet 4.6	60.7%
9	Claude Sonnet 4.5	60.0%
10	Qwen3.5-397B-A17B	59.9%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	GPT-5.5	60.2	65	$11.25
2	Claude Opus 4.7	57.3	52	$10.00
3	Gemini 3.1 Pro Preview	57.2	129	$4.50
4	GPT-5.4	56.8	93	$5.63
5	Kimi K2.6	53.9	25	$1.71
6	MiMo-V2.5-Pro	53.8	59	$1.50
7	GPT-5.3 Codex	53.6	86	$4.81
8	Claude Opus 4.6	53	49	$10.00
9	Muse Spark	52.1	0	$0.00
10	Qwen3.6 Max Preview	51.8	33	$2.92

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Qwen3.6 35B A3B	191
2	Gemini 3 Flash Preview	189
3	GPT-5.1 Codex	170
4	GPT-5 Codex	166
5	GPT-5.4 nano	160
6	GPT-5.4 mini	158
7	Qwen3.5 122B A10B	142
8	Gemini 3.1 Pro Preview	129
9	Gemini 3 Pro Preview	129
10	GPT-5.1	126

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	DeepSeek V4 Flash	$0.175
3	DeepSeek V3.2	$0.315
4	GPT-5.4 nano	$0.463
5	MiniMax-M2.7	$0.525
6	KAT Coder Pro V2	$0.525
7	MiniMax-M2.5	$0.525
8	Qwen3.6 35B A3B	$0.557
9	GPT-5 mini	$0.688
10	Qwen3.5 27B	$0.825