The Inference Report

April 18, 2026

Claude Opus 4.6 moved from fourth to first on SWE-rebench with a 65.3% score, a 12.3-point gain over its previous 53%, while Gemini 3.1 Pro Preview dropped from the top ranking on Artificial Analysis to sixth on SWE-rebench despite maintaining 62.3%, suggesting the two benchmarks now diverge meaningfully in what they reward. The SWE-rebench leaderboard shows tighter clustering at the top, with positions two through five spanning only 1.6 percentage points, indicating reduced separation between leading models on coding tasks. Chinese models made notable gains on SWE-rebench: GLM-5 climbed from tenth to third (49.8 to 62.8%), Kimi K2.5 rose from twentieth to sixteenth (46.8 to 58.5%), and GLM-4.7 advanced from thirty-fourth to fourteenth (42.1 to 58.7%), while on Artificial Analysis the top tier remains dominated by Anthropic and OpenAI variants, with Claude Opus 4.7 newly entering at first place and Gemini 3.1 Pro Preview sliding to second. The Artificial Analysis benchmark shows minimal absolute movement across most positions, with entries reordering but scores remaining largely stable, whereas SWE-rebench displays larger score inflation across the board, raising questions about whether the benchmarks are measuring consistent capabilities or whether SWE-rebench's evaluation methodology has shifted. JT-MINI dropped entirely from Artificial Analysis rankings after placing at 109 with 25.4 points, but no corresponding SWE-rebench removal is documented, leaving unclear whether this reflects model discontinuation or benchmark revision. The divergence between these two evaluation frameworks is now pronounced enough to warrant scrutiny of their test construction: if both measure code generation ability, the gap between Gemini's ranking (first on Artificial Analysis, sixth on SWE-rebench) and Claude Opus 4.6's trajectory (fourth to first on SWE-rebench, but only fifth on Artificial Analysis) suggests they are sampling different problem distributions or applying different evaluation criteria rather than simply ranking the same capability differently.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	gpt-5.4-2026-03-05-medium	62.8%
5	GLM-5.1	62.7%
6	Gemini 3.1 Pro Preview	62.3%
7	DeepSeek-V3.2	60.9%
8	Claude Sonnet 4.6	60.7%
9	Claude Sonnet 4.5	60.0%
10	Qwen3.5-397B-A17B	59.9%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	Claude Opus 4.7	57.3	58	$10.00
2	Gemini 3.1 Pro Preview	57.2	126	$4.50
3	GPT-5.4	56.8	82	$5.63
4	GPT-5.3 Codex	53.6	81	$4.81
5	Claude Opus 4.6	53	54	$10.00
6	Muse Spark	52.1	0	$0.00
7	Claude Sonnet 4.6	51.7	60	$6.00
8	GLM-5.1	51.4	47	$2.15
9	GPT-5.2	51.3	74	$4.81
10	Qwen3.6 Plus	50	53	$1.13

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Qwen3.6 35B A3B	238
2	GPT-5.1 Codex	205
3	GPT-5 Codex	199
4	Grok 4.20 0309	194
5	Gemini 3 Flash Preview	191
6	Grok 4.20 0309 v2	180
7	GPT-5.4 mini	172
8	GPT-5.4 nano	155
9	Gemini 3 Pro Preview	133
10	Qwen3.5 122B A10B	130

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	DeepSeek V3.2	$0.315
3	GPT-5.4 nano	$0.463
4	MiniMax-M2.7	$0.525
5	KAT Coder Pro V2	$0.525
6	MiniMax-M2.5	$0.525
7	GPT-5 mini	$0.688
8	Qwen3.5 27B	$0.825
9	Qwen3.6 35B A3B	$0.844
10	GLM-4.7	$1.00