The Inference Report

March 31, 2026

Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, a gain of 12.3 points from its prior Artificial Analysis score of 53, while gpt-5.2-2025-12-11-medium sits second at 64.4%, and GLM-5 and gpt-5.4-2026-03-05-medium tie at 62.8%. The SWE-rebench leaderboard shows material reshuffling in the upper tier: Kimi K2.5 climbed from rank 16 (46.8) to rank 13 (58.5), Kimi K2 Thinking jumped from rank 35 (40.9) to rank 17 (57.4), and Gemini 3 Flash Preview moved from rank 18 (46.4) to rank 22 (52.5), all indicating that these models improved substantially on this benchmark. The Artificial Analysis leaderboard, which tracks a different evaluation methodology, remains largely stable in its upper rankings with GPT-5.4 and Gemini 3.1 Pro Preview tied at 57.2, though KAT Coder Pro V2 entered at rank 23 with 43.8 and Nemotron Cascade 2 30B appeared at rank 81 with 27.7. The gap between the two benchmarks' top scores (SWE-rebench's 65.3 versus Artificial Analysis's 57.2) suggests they measure different aspects of model capability or use distinct evaluation criteria; without details on SWE-rebench's methodology relative to Artificial Analysis, it remains unclear whether the higher scores reflect easier test cases, different task distributions, or genuine performance differences on the same underlying problems. The consistency of model ordering within each benchmark indicates both are internally coherent, but the divergence between them argues for caution in treating either as a complete picture of coding ability.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	gpt-5.4-2026-03-05-medium	62.8%
5	Gemini 3.1 Pro Preview	62.3%
6	DeepSeek-V3.2	60.9%
7	Claude Sonnet 4.6	60.7%
8	Claude Sonnet 4.5	60.0%
9	Qwen3.5-397B-A17B	59.9%
10	Step-3.5-Flash	59.6%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	GPT-5.4	57.2	96	$5.63
2	Gemini 3.1 Pro Preview	57.2	120	$4.50
3	GPT-5.3 Codex	54	94	$4.81
4	Claude Opus 4.6	53	61	$10.00
5	Claude Sonnet 4.6	51.7	79	$6.00
6	GPT-5.2	51.3	81	$4.81
7	GLM-5	49.8	65	$1.55
8	Claude Opus 4.5	49.7	64	$10.00
9	MiniMax-M2.7	49.6	45	$0.525
10	MiMo-V2-Pro	49.2	0	$1.50

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Grok 4.20 Beta 0309	242
2	GPT-5.4 mini	219
3	GPT-5 Codex	215
4	Gemini 3 Flash Preview	193
5	GPT-5.4 nano	177
6	GPT-5.1 Codex	155
7	Qwen3.5 122B A10B	145
8	GPT-5.2 Codex	129
9	Gemini 3 Pro Preview	123
10	MiMo-V2-Flash	123

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	DeepSeek V3.2	$0.315
3	GPT-5.4 nano	$0.463
4	MiniMax-M2.7	$0.525
5	KAT Coder Pro V2	$0.525
6	MiniMax-M2.5	$0.525
7	GPT-5 mini	$0.688
8	Qwen3.5 27B	$0.825
9	GLM-4.7	$1.00
10	Kimi K2 Thinking	$1.07