The Inference Report

April 7, 2026

Claude Opus 4.6 maintains its lead on SWE-rebench at 65.3 percent, a 12.3-point gain from its prior position at 53 percent on the Artificial Analysis benchmark, though the two evaluations measure different problem spaces and should not be treated as directly comparable. The top tier has consolidated around 62 to 65 percent on SWE-rebench, with gpt-5.2-2025-12-11-medium at 64.4 percent and GLM-5 and gpt-5.4-2026-03-05-medium both at 62.8 percent; these four models now occupy the summit, separated by narrow margins. Movement in the broader field reveals uneven progress: Kimi K2 Thinking jumped from position 37 at 40.9 percent to position 17 at 57.4 percent, a 16.5-point increase suggesting either a model update or a shift in how the benchmark evaluates reasoning-focused architectures, while Kimi K2.5 advanced from position 16 at 46.8 percent to position 13 at 58.5 percent. Gemini 3 Flash Preview climbed from position 18 at 46.4 percent to position 22 at 52.5 percent, and Nova 2.0 Lite moved from position 76 at 29.7 percent to position 58 at 34.5 percent on the Artificial Analysis side, though this latter shift reflects reranking rather than SWE-rebench performance. The SWE-rebench scores themselves show no obvious saturation at the top: the gap between first and fifth place is 2.8 percentage points, and the distribution below rank 10 remains steep, suggesting that either the benchmark has sufficient discriminative power or that model capabilities on software engineering tasks continue to stratify sharply by architecture and training approach. What remains unclear from the data alone is whether these gains reflect genuine improvements in code generation and repository-level reasoning, or whether they reflect changes in evaluation methodology, task distribution, or model selection within the rebench suite.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	gpt-5.4-2026-03-05-medium	62.8%
5	Gemini 3.1 Pro Preview	62.3%
6	DeepSeek-V3.2	60.9%
7	Claude Sonnet 4.6	60.7%
8	Claude Sonnet 4.5	60.0%
9	Qwen3.5-397B-A17B	59.9%
10	Step-3.5-Flash	59.6%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	GPT-5.4	57.2	82	$5.63
2	Gemini 3.1 Pro Preview	57.2	142	$4.50
3	GPT-5.3 Codex	54	81	$4.81
4	Claude Opus 4.6	53	54	$10.00
5	Claude Sonnet 4.6	51.7	66	$6.00
6	GPT-5.2	51.3	79	$4.81
7	GLM-5	49.8	69	$1.55
8	Claude Opus 4.5	49.7	67	$10.00
9	MiniMax-M2.7	49.6	43	$0.525
10	MiMo-V2-Pro	49.2	0	$1.50

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Grok 4.20 Beta 0309	265
2	GPT-5 Codex	214
3	GPT-5.4 nano	209
4	Gemini 3 Flash Preview	197
5	GPT-5.1 Codex	190
6	GPT-5.4 mini	166
7	Qwen3.5 122B A10B	154
8	Gemini 3 Pro Preview	143
9	Gemini 3.1 Pro Preview	142
10	MiMo-V2-Flash	137

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	DeepSeek V3.2	$0.315
3	GPT-5.4 nano	$0.463
4	MiniMax-M2.7	$0.525
5	KAT Coder Pro V2	$0.525
6	MiniMax-M2.5	$0.525
7	GPT-5 mini	$0.688
8	Qwen3.5 27B	$0.825
9	GLM-4.7	$1.00
10	Kimi K2 Thinking	$1.07