The Inference Report

March 22, 2026

The top tier on SWE-rebench remains frozen: Claude Code holds 52.9%, Junie 52.1%, and the next four models cluster between 51.7% and 51.0%, with no movement in the top five positions. Below that line, the volatility increases sharply. Claude Opus 4.5 dropped from rank 8 to rank 12 while losing 5.9 percentage points (49.7 to 43.8), a decline that suggests either benchmark drift or a methodological shift in how the evaluation weights different problem categories. Kimi K2 Thinking gained 2.9 points and jumped from rank 34 to rank 13, the largest upward move in the dataset. Gemini 3 Pro Preview fell from rank 13 to rank 8 despite losing 1.7 points, a ranking shift driven by larger losses elsewhere. GLM-5 dropped 7.7 points (49.8 to 42.1) and fell from rank 15 to rank 7 in the Artificial Analysis benchmark, while on SWE-rebench it moved from rank 15 to rank 15 with a smaller decline, suggesting the two benchmarks may measure different problem distributions or that GLM-5's performance on specific task categories degraded significantly. Kimi K2.5 fell 8.9 points on Artificial Analysis (46.8 to 37.9) and dropped from rank 19 to rank 19 on SWE-rebench, a rare case of consistent decline across both metrics. The data shows no clear pattern of across-the-board improvement; instead, individual models show both gains and losses, which raises questions about whether these benchmarks are tracking genuine capability shifts or whether evaluation procedures, test set composition, or model versioning have changed between measurement cycles.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	Claude Code	52.9%
2	Junie	52.1%
3	Claude Opus 4.6	51.7%
4	gpt-5.2-2025-12-11-xhigh	51.7%
5	gpt-5.2-2025-12-11-medium	51.0%
6	gpt-5.1-codex-max	48.5%
7	Claude Sonnet 4.5	47.1%
8	Gemini 3 Pro Preview	46.7%
9	Gemini 3 Flash Preview	46.7%
10	gpt-5.2-codex	45.0%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	GPT-5.4	57.2	85	$5.63
2	Gemini 3.1 Pro Preview	57.2	118	$4.50
3	GPT-5.3 Codex	54	71	$4.81
4	Claude Opus 4.6	53	51	$10.00
5	Claude Sonnet 4.6	51.7	66	$6.00
6	GPT-5.2	51.3	75	$4.81
7	GLM-5	49.8	89	$1.55
8	Claude Opus 4.5	49.7	58	$10.00
9	MiniMax-M2.7	49.6	43	$0.525
10	MiMo-V2-Pro	49.2	0	$0.00

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	GPT-5.4 mini	235
2	GPT-5.4 nano	209
3	Gemini 3 Flash Preview	193
4	GPT-5 Codex	170
5	Qwen3.5 122B A10B	154
6	Grok 4.20 Beta 0309	145
7	MiMo-V2-Flash	142
8	GPT-5.1 Codex	122
9	Gemini 3 Pro Preview	120
10	Gemini 3.1 Pro Preview	118

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	DeepSeek V3.2	$0.315
3	GPT-5.4 nano	$0.463
4	MiniMax-M2.7	$0.525
5	MiniMax-M2.5	$0.525
6	GPT-5 mini	$0.688
7	Qwen3.5 27B	$0.825
8	GLM-4.7	$1.00
9	Kimi K2 Thinking	$1.07
10	Qwen3.5 122B A10B	$1.10