The Inference Report

April 8, 2026

Claude Opus 4.6 moved to the top of the SWE-rebench rankings at 65.3%, up from fourth place at 53% on Artificial Analysis, while Gemini 3.1 Pro Preview fell from a tie for first at 57.2 to fifth at 62.3% on the coding benchmark, and GLM-5 climbed from seventh at 49.8 to third at 62.8%. The SWE-rebench scores show tighter clustering in the top tier, the gap between first and fifth is only 3 percentage points, compared to the Artificial Analysis benchmark where GPT-5.4 and Gemini 3.1 Pro Preview tied at 57.2, suggesting the coding task may be more discriminative or the models' relative strengths differ meaningfully between general reasoning and software engineering. Kimi K2.5 advanced from sixteenth at 46.8 on Artificial Analysis to thirteenth at 58.5% on SWE-rebench, and Kimi K2 Thinking jumped from thirty-seventh at 40.9 to seventeenth at 57.4%, indicating these models have particular strength in code generation tasks. The SWE-rebench benchmark itself lacks published methodology details in the data provided, no information on test set size, task distribution, evaluation criteria, or whether results are from initial release or continued refinement, making it difficult to assess whether the ranking shifts reflect genuine capability differences or methodological divergence from Artificial Analysis.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	gpt-5.4-2026-03-05-medium	62.8%
5	Gemini 3.1 Pro Preview	62.3%
6	DeepSeek-V3.2	60.9%
7	Claude Sonnet 4.6	60.7%
8	Claude Sonnet 4.5	60.0%
9	Qwen3.5-397B-A17B	59.9%
10	Step-3.5-Flash	59.6%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	GPT-5.4	57.2	85	$5.63
2	Gemini 3.1 Pro Preview	57.2	132	$4.50
3	GPT-5.3 Codex	54	76	$4.81
4	Claude Opus 4.6	53	55	$10.00
5	Claude Sonnet 4.6	51.7	71	$6.00
6	GLM-5.1	51.3	80	$2.15
7	GPT-5.2	51.3	69	$4.81
8	Qwen3.6 Plus	50	52	$1.13
9	GLM-5	49.8	70	$1.55
10	Claude Opus 4.5	49.7	67	$10.00

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Grok 4.20 0309	252
2	GPT-5 Codex	203
3	GPT-5.4 nano	202
4	Gemini 3 Flash Preview	196
5	GPT-5.1 Codex	191
6	GPT-5.4 mini	157
7	Gemini 3 Pro Preview	139
8	Qwen3.5 122B A10B	138
9	Gemini 3.1 Pro Preview	132
10	MiMo-V2-Flash	129

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	DeepSeek V3.2	$0.315
3	GPT-5.4 nano	$0.463
4	MiniMax-M2.7	$0.525
5	KAT Coder Pro V2	$0.525
6	MiniMax-M2.5	$0.525
7	GPT-5 mini	$0.688
8	Qwen3.5 27B	$0.825
9	GLM-4.7	$1.00
10	Kimi K2 Thinking	$1.07