The Inference Report

March 20, 2026

The SWE-rebench top tier remains unchanged from the previous cycle, with Claude Code holding 52.9%, Junie at 52.1%, and Claude Opus 4.6 and gpt-5.2-2025-12-11-xhigh tied at 51.7%, suggesting a plateau in incremental gains among the highest-performing models. Below this ceiling, however, significant volatility emerges: Gemini 3 Pro Preview dropped from 48.4 to 46.7 on Artificial Analysis while maintaining rank #8 on SWE-rebench, whereas Kimi K2 Thinking climbed from 40.9 to 43.8 on Artificial Analysis and rose from #33 to #13 on SWE-rebench, a 20-position jump that points to either methodological divergence between the two benchmarks or genuine capability shifts in specific coding tasks. GLM-5 fell sharply from 49.8 to 42.1 on Artificial Analysis while dropping from #7 to #15 on SWE-rebench, a discrepancy that warrants scrutiny into whether the Artificial Analysis evaluation captures different problem distributions or if the SWE-rebench methodology has tightened. Kimi K2.5 presents the inverse pattern, declining from 46.8 to 37.9 on SWE-rebench but remaining at 46.8 on Artificial Analysis, suggesting the two benchmarks reward different architectural or prompt-handling strategies. The broader pattern indicates that neither benchmark is settling into stable rankings: models in the 35-50% range on SWE-rebench show rank swings of 10-20 positions across cycles, and the divergence between SWE-rebench and Artificial Analysis scores (sometimes 5-10 percentage points) suggests these are measuring meaningfully different aspects of code generation capability rather than converging on a unified signal.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	Claude Code	52.9%
2	Junie	52.1%
3	Claude Opus 4.6	51.7%
4	gpt-5.2-2025-12-11-xhigh	51.7%
5	gpt-5.2-2025-12-11-medium	51.0%
6	gpt-5.1-codex-max	48.5%
7	Claude Sonnet 4.5	47.1%
8	Gemini 3 Pro Preview	46.7%
9	Gemini 3 Flash Preview	46.7%
10	gpt-5.2-codex	45.0%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	GPT-5.4	57.2	70	$5.63
2	Gemini 3.1 Pro Preview	57.2	117	$4.50
3	GPT-5.3 Codex	54	70	$4.81
4	Claude Opus 4.6	53	56	$10.00
5	Claude Sonnet 4.6	51.7	68	$6.00
6	GPT-5.2	51.3	66	$4.81
7	GLM-5	49.8	74	$1.55
8	Claude Opus 4.5	49.7	60	$10.00
9	MiniMax-M2.7	49.6	43	$0.525
10	MiMo-V2-Pro	49.2	0	$0.00

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	GPT-5.4 mini	254
2	GPT-5.4 nano	216
3	Grok 4.20 Beta 0309	200
4	Gemini 3 Flash Preview	186
5	GPT-5 Codex	176
6	MiMo-V2-Flash	134
7	Qwen3.5 122B A10B	121
8	Gemini 3.1 Pro Preview	117
9	Gemini 3 Pro Preview	111
10	GPT-5.1 Codex	98

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	DeepSeek V3.2	$0.315
3	GPT-5.4 nano	$0.463
4	MiniMax-M2.7	$0.525
5	MiniMax-M2.5	$0.525
6	GPT-5 mini	$0.688
7	Qwen3.5 27B	$0.825
8	GLM-4.7	$1.00
9	Kimi K2 Thinking	$1.07
10	Qwen3.5 122B A10B	$1.10