The Inference Report

March 26, 2026

Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, unchanged from the previous cycle, while the tier immediately below has compressed: gpt-5.2-2025-12-11-medium, GLM-5, and gpt-5.4-2026-03-05-medium now cluster between 62.8% and 64.4%, with GLM-5 climbing from #7 to #3 and gaining 13 percentage points since the Artificial Analysis benchmark. Gemini 3.1 Pro Preview dropped from #2 to #5 on SWE-rebench despite scoring 62.3%, a 5.1-point gain over its Artificial Analysis score of 57.2, suggesting the two benchmarks measure different problem distributions or evaluation rigor. Kimi K2.5 and Kimi K2 Thinking both posted substantial gains, 12.5 and 16.5 points respectively on Artificial Analysis, and moved up the SWE-rebench ranks to #13 and #17, though the magnitude of improvement raises questions about whether the models were retrained, fine-tuned on benchmark data, or if the evaluation protocols differ materially between the two systems. The broader pattern shows Claude models and GPT variants dominating the top ten on SWE-rebench while Chinese models (GLM-5, Kimi variants, Qwen lines) have narrowed the gap, and the divergence between SWE-rebench and Artificial Analysis rankings for several mid-tier models suggests these benchmarks are not interchangeable proxies for coding ability. MiMo-V2-Omni dropped from the Artificial Analysis rankings entirely despite previously scoring 43.4, a notable exit that warrants clarification on whether the model was discontinued or simply failed to meet evaluation criteria this cycle.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	gpt-5.4-2026-03-05-medium	62.8%
5	Gemini 3.1 Pro Preview	62.3%
6	DeepSeek-V3.2	60.9%
7	Claude Sonnet 4.6	60.7%
8	Claude Sonnet 4.5	60.0%
9	Qwen3.5-397B-A17B	59.9%
10	Step-3.5-Flash	59.6%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	GPT-5.4	57.2	74	$5.63
2	Gemini 3.1 Pro Preview	57.2	113	$4.50
3	GPT-5.3 Codex	54	78	$4.81
4	Claude Opus 4.6	53	51	$10.00
5	Claude Sonnet 4.6	51.7	72	$6.00
6	GPT-5.2	51.3	72	$4.81
7	GLM-5	49.8	69	$1.55
8	Claude Opus 4.5	49.7	59	$10.00
9	MiniMax-M2.7	49.6	47	$0.525
10	MiMo-V2-Pro	49.2	93	$1.50

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	GPT-5.4 nano	221
2	Grok 4.20 Beta 0309	218
3	GPT-5.4 mini	218
4	Gemini 3 Flash Preview	195
5	GPT-5 Codex	190
6	Qwen3.5 122B A10B	134
7	MiMo-V2-Flash	129
8	GPT-5.1 Codex	118
9	Gemini 3 Pro Preview	115
10	Gemini 3.1 Pro Preview	113

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	DeepSeek V3.2	$0.315
3	GPT-5.4 nano	$0.463
4	MiniMax-M2.7	$0.525
5	MiniMax-M2.5	$0.525
6	GPT-5 mini	$0.688
7	Qwen3.5 27B	$0.825
8	GLM-4.7	$1.00
9	Kimi K2 Thinking	$1.07
10	Qwen3.5 122B A10B	$1.10