The Inference Report

May 4, 2026

Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, a 12.3-point jump from its previous ranking of 53 on Artificial Analysis, though the two benchmarks measure different problem sets and cannot be directly compared. The top tier has consolidated around 62-65% on SWE-rebench, with GPT-5.2-2025-12-11-medium at 64.4% and three models tied at 62.8% (GLM-5, Junie, GPT-5.4-2026-03-05-medium), suggesting diminishing returns in coding task performance at the frontier. More striking are the mid-tier movements: GLM-5 climbed from position 16 to 3 on SWE-rebench, GLM-4.7 rose from 40 to 14, and Kimi K2.5 advanced from 26 to 16, indicating that Chinese model families have made substantial gains on this particular benchmark. Gemini 3.1 Pro Preview, by contrast, dropped from third on Artificial Analysis (57.2) to seventh on SWE-rebench (62.3), a relative decline despite the absolute score increase, which may reflect task-specific strengths rather than regression. On Artificial Analysis, the leaderboard remains fluid with 33 new entries across the 373-model roster, including several reasoning-focused variants and smaller parameter models, though the top ten remain dominated by GPT and Claude variants. The SWE-rebench benchmark appears more selective and stable, with only 34 models tracked versus hundreds on Artificial Analysis, making it a tighter measure of coding capability but limiting visibility into broader model performance distributions. Without methodological details on how SWE-rebench tasks differ from Artificial Analysis's evaluation protocol, the divergence in model rankings suggests these benchmarks may reward different architectural or training choices, a distinction worth investigating rather than treating the benchmarks as interchangeable measures of coding prowess.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	Junie	62.8%
5	gpt-5.4-2026-03-05-medium	62.8%
6	GLM-5.1	62.7%
7	Gemini 3.1 Pro Preview	62.3%
8	DeepSeek-V3.2	60.9%
9	Claude Sonnet 4.6	60.7%
10	Claude Sonnet 4.5	60.0%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	GPT-5.5	60.2	74	$11.25
2	Claude Opus 4.7	57.3	56	$10.94
3	Gemini 3.1 Pro Preview	57.2	130	$4.50
4	GPT-5.4	56.8	89	$5.63
5	Kimi K2.6	53.9	31	$1.71
6	MiMo-V2.5-Pro	53.8	63	$1.50
7	GPT-5.3 Codex	53.6	87	$4.81
8	Grok 4.3	53.2	112	$1.56
9	Claude Opus 4.6	53	48	$10.94
10	Muse Spark	52.1	0	$0.00

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Gemini 3 Flash Preview	197
2	GPT-5 Codex	196
3	Qwen3.6 35B A3B	192
4	GPT-5.4 mini	184
5	GPT-5.1 Codex	184
6	GPT-5.4 nano	161
7	Qwen3.5 122B A10B	158
8	GPT-5.1	151
9	MiMo-V2-Flash	147
10	MiMo-V2-Omni-0327	134

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	DeepSeek V4 Flash	$0.175
3	DeepSeek V3.2	$0.337
4	GPT-5.4 nano	$0.463
5	MiniMax-M2.7	$0.525
6	KAT Coder Pro V2	$0.525
7	MiniMax-M2.5	$0.525
8	Qwen3.6 35B A3B	$0.557
9	GPT-5 mini	$0.688
10	Qwen3.5 27B	$0.825