The Inference Report

May 1, 2026

Claude Opus 4.6 climbed from eighth to first on SWE-rebench, moving 12.3 percentage points from 53% to 65.3%, a movement that reshuffles the entire coding benchmark landscape but warrants scrutiny on whether the test itself remained stable or if the evaluation methodology changed. The top tier has tightened considerably: gpt-5.2-2025-12-11-medium sits at 64.4%, GLM-5 and Junie both at 62.8%, and gpt-5.4-2026-03-05-medium holding steady at 62.8%, creating a compressed band where fractional improvements matter. Below the top five, the ranking has reordered substantially; Gemini 3.1 Pro Preview dropped from third to seventh despite scoring 62.3%, while several models gained ground including GLM-5 (from rank 16 at 49.8% to rank 3 at 62.8%), Kimi K2.5 (from rank 28 at 46.8% to rank 16 at 58.5%), and Kimi K2 Thinking (from rank 53 at 40.9% to rank 21 at 57.4%), suggesting either substantial capability improvements across Chinese models or a shift in benchmark composition favoring their training distribution. On Artificial Analysis, the rankings show minimal movement in the top tier with GPT-5.5 still leading at 60.2 and Claude Opus 4.6 now ninth at 53, a nine-point gap that contradicts the SWE-rebench clustering and raises questions about benchmark alignment. Grok 4.3 entered the Artificial Analysis top 100 at position eight with 53.2, while most other models maintained their prior positions, suggesting this benchmark is more stable but possibly measuring a different capability or using different evaluation criteria. The divergence between SWE-rebench's dramatic reshuffling and Artificial Analysis's relative stability indicates these benchmarks are not measuring the same problem space, or that one has undergone methodological revision without documentation.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	Junie	62.8%
5	gpt-5.4-2026-03-05-medium	62.8%
6	GLM-5.1	62.7%
7	Gemini 3.1 Pro Preview	62.3%
8	DeepSeek-V3.2	60.9%
9	Claude Sonnet 4.6	60.7%
10	Claude Sonnet 4.5	60.0%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	GPT-5.5	60.2	67	$11.25
2	Claude Opus 4.7	57.3	51	$10.00
3	Gemini 3.1 Pro Preview	57.2	130	$4.50
4	GPT-5.4	56.8	87	$5.63
5	Kimi K2.6	53.9	25	$1.71
6	MiMo-V2.5-Pro	53.8	60	$1.50
7	GPT-5.3 Codex	53.6	82	$4.81
8	Grok 4.3	53.2	221	$1.56
9	Claude Opus 4.6	53	52	$10.00
10	Muse Spark	52.1	0	$0.00

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Grok 4.3	221
2	Qwen3.6 35B A3B	185
3	Gemini 3 Flash Preview	184
4	GPT-5.1 Codex	172
5	GPT-5.4 mini	169
6	GPT-5 Codex	165
7	GPT-5.4 nano	162
8	Qwen3.5 122B A10B	148
9	GPT-5.1	131
10	Gemini 3.1 Pro Preview	130

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	DeepSeek V4 Flash	$0.175
3	DeepSeek V3.2	$0.315
4	GPT-5.4 nano	$0.463
5	MiniMax-M2.7	$0.525
6	KAT Coder Pro V2	$0.525
7	MiniMax-M2.5	$0.525
8	Qwen3.6 35B A3B	$0.557
9	GPT-5 mini	$0.688
10	Qwen3.5 27B	$0.825