The Inference Report

March 28, 2026

Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, unchanged from the previous cycle, while the tier immediately below has solidified around 62 to 64 percent with gpt-5.2-2025-12-11-medium at 64.4% and GLM-5 and gpt-5.4-2026-03-05-medium both at 62.8%. The meaningful shifts appear in models that have climbed several positions: Claude Opus 4.6 rose from rank 4 to rank 1 (a gain of 12.3 points from 53 on the Artificial Analysis benchmark), GLM-5 advanced from rank 7 to rank 3 (13 points), Kimi K2.5 jumped from rank 16 to rank 13 (11.7 points), and Kimi K2 Thinking moved from rank 35 to rank 17 (16.5 points). Gemini 3.1 Pro Preview slipped from rank 2 to rank 5, losing 5.1 points despite remaining competitive. The SWE-rebench methodology appears to diverge notably from the Artificial Analysis scores, particularly for Claude and Kimi models, which suggests the benchmarks may weight different problem classes or evaluation criteria differently. The gap between first and tenth place on SWE-rebench is 5.7 percentage points, indicating a tightening at the top end, while the Artificial Analysis leaderboard shows a steeper spread, with the top model at 57.2 and rank 10 at 49.2. Without access to the specific evaluation methodology differences between these two benchmarks, it is unclear whether the SWE-rebench gains reflect genuine capability improvements or reflect a benchmark that rewards different architectural or training choices.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	gpt-5.4-2026-03-05-medium	62.8%
5	Gemini 3.1 Pro Preview	62.3%
6	DeepSeek-V3.2	60.9%
7	Claude Sonnet 4.6	60.7%
8	Claude Sonnet 4.5	60.0%
9	Qwen3.5-397B-A17B	59.9%
10	Step-3.5-Flash	59.6%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	GPT-5.4	57.2	81	$5.63
2	Gemini 3.1 Pro Preview	57.2	114	$4.50
3	GPT-5.3 Codex	54	74	$4.81
4	Claude Opus 4.6	53	53	$10.00
5	Claude Sonnet 4.6	51.7	66	$6.00
6	GPT-5.2	51.3	72	$4.81
7	GLM-5	49.8	63	$1.55
8	Claude Opus 4.5	49.7	64	$10.00
9	MiniMax-M2.7	49.6	47	$0.525
10	MiMo-V2-Pro	49.2	93	$1.50

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Grok 4.20 Beta 0309	238
2	GPT-5.4 mini	198
3	Gemini 3 Flash Preview	184
4	GPT-5 Codex	181
5	GPT-5.4 nano	160
6	Qwen3.5 122B A10B	134
7	MiMo-V2-Flash	129
8	GPT-5.1 Codex	118
9	Gemini 3 Pro Preview	115
10	Gemini 3.1 Pro Preview	114

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	DeepSeek V3.2	$0.315
3	GPT-5.4 nano	$0.463
4	MiniMax-M2.7	$0.525
5	MiniMax-M2.5	$0.525
6	GPT-5 mini	$0.688
7	Qwen3.5 27B	$0.825
8	GLM-4.7	$1.00
9	Kimi K2 Thinking	$1.07
10	Qwen3.5 122B A10B	$1.10