The Inference Report

April 17, 2026

Claude Opus 4.6 moved from fourth to first on SWE-rebench, climbing 12.3 percentage points from 53 to 65.3 percent, while Gemini 3.1 Pro Preview dropped from the top Artificial Analysis ranking at 57.2 to sixth on SWE-rebench at 62.3 percent. The gap between first and second place on SWE-rebench narrowed to just 0.9 points (Claude Opus 4.6 at 65.3 versus gpt-5.2-2025-12-11-medium at 64.4), and the top six models now cluster between 62.3 and 65.3 percent, suggesting convergence at the frontier rather than separation. GLM-5 and GLM-5.1 each gained roughly 13 points, moving from tenth and seventh on Artificial Analysis to third and fifth on SWE-rebench, indicating that coding-specific evaluation surfaces different capability profiles than the broader Artificial Analysis benchmark. However, the two benchmarks tell divergent stories about the field: SWE-rebench shows tight competition in the 58 to 65 percent range across the top twenty models, while Artificial Analysis exhibits steeper stratification, with the top performer at 57.2 and a sharper drop-off below rank fifty. The methodological difference matters here. SWE-rebench measures repository-level code generation against real GitHub issues with deterministic evaluation criteria, while Artificial Analysis covers broader reasoning and general capability. Models like Claude Opus 4.6 and the GLM family appear better calibrated to the specific constraints of software engineering tasks, but without visibility into whether SWE-rebench's evaluation methodology changed or whether these represent genuinely new model versions, the magnitude of the gains (particularly the 12-point jump for Claude) warrants scrutiny of whether the benchmark itself remained stable or whether the test set, evaluation harness, or scoring logic shifted.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	gpt-5.4-2026-03-05-medium	62.8%
5	GLM-5.1	62.7%
6	Gemini 3.1 Pro Preview	62.3%
7	DeepSeek-V3.2	60.9%
8	Claude Sonnet 4.6	60.7%
9	Claude Sonnet 4.5	60.0%
10	Qwen3.5-397B-A17B	59.9%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	Gemini 3.1 Pro Preview	57.2	123	$4.50
2	GPT-5.4	56.8	81	$5.63
3	GPT-5.3 Codex	53.6	70	$4.81
4	Claude Opus 4.6	53	44	$10.00
5	Muse Spark	52.1	0	$0.00
6	Claude Sonnet 4.6	51.7	54	$6.00
7	GLM-5.1	51.4	46	$2.15
8	GPT-5.2	51.3	64	$4.81
9	Qwen3.6 Plus	50	54	$1.13
10	GLM-5	49.8	64	$1.55

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	GPT-5 Codex	180
2	GPT-5.1 Codex	179
3	Gemini 3 Flash Preview	176
4	Grok 4.20 0309	161
5	GPT-5.4 mini	159
6	GPT-5.4 nano	158
7	Grok 4.20 0309 v2	146
8	Qwen3.5 122B A10B	130
9	Gemini 3 Pro Preview	128
10	Gemini 3.1 Pro Preview	123

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	DeepSeek V3.2	$0.315
3	GPT-5.4 nano	$0.463
4	MiniMax-M2.7	$0.525
5	KAT Coder Pro V2	$0.525
6	MiniMax-M2.5	$0.525
7	GPT-5 mini	$0.688
8	Qwen3.5 27B	$0.825
9	GLM-4.7	$1.00
10	Kimi K2 Thinking	$1.07