The Inference Report

March 17, 2026

The SWE-rebench leaderboard shows stasis at the top with Claude Code holding 52.9% and Junie at 52.1%, while significant reshuffling occurs in the middle tiers. Claude Opus 4.5 dropped from position 8 with 49.7% to position 12 with 43.8% on Artificial Analysis, a substantial decline that warrants scrutiny of whether the evaluation methodology shifted or if the model's capabilities genuinely regressed on this particular task distribution. Conversely, Kimi K2 Thinking climbed from position 28 with 40.9% to position 13 with 43.8% on Artificial Analysis, suggesting either improved inference or a benchmark revision that favors its approach. The Gemini 3 Pro Preview moved from position 11 at 48.4% to position 8 at 46.7% on SWE-rebench, a modest decline consistent with natural variance, though the divergence between the two benchmarks (Artificial Analysis shows it at 48.4%) hints at methodological differences in how they score the same model outputs. GLM-5 dropped from position 7 with 49.8% on Artificial Analysis to position 15 with 42.1% on SWE-rebench, a 7.7-point gap that is difficult to attribute to random noise and suggests these benchmarks may be testing different aspects of code generation capability. The lack of movement in the top five positions on SWE-rebench combined with large swings in the 7-20 range indicates the benchmark is sensitive enough to detect real differences but that the frontier models have plateaued relative to their challengers, a pattern worth monitoring across future cycles to determine whether we are seeing genuine convergence or measurement instability.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	Claude Code	52.9%
2	Junie	52.1%
3	Claude Opus 4.6	51.7%
4	gpt-5.2-2025-12-11-xhigh	51.7%
5	gpt-5.2-2025-12-11-medium	51.0%
6	gpt-5.1-codex-max	48.5%
7	Claude Sonnet 4.5	47.1%
8	Gemini 3 Pro Preview	46.7%
9	Gemini 3 Flash Preview	46.7%
10	gpt-5.2-codex	45.0%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	GPT-5.4	57.2	80	$5.63
2	Gemini 3.1 Pro Preview	57.2	114	$4.50
3	GPT-5.3 Codex	54	70	$4.81
4	Claude Opus 4.6	53	56	$10.00
5	Claude Sonnet 4.6	51.7	61	$6.00
6	GPT-5.2	51.3	75	$4.81
7	GLM-5	49.8	66	$1.55
8	Claude Opus 4.5	49.7	65	$10.00
9	GPT-5.2 Codex	49	108	$4.81
10	Grok 4.20 Beta 0309	48.5	213	$3.00

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Grok 4.20 Beta 0309	213
2	GPT-5 Codex	203
3	Gemini 3 Flash Preview	179
4	Qwen3.5 122B A10B	159
5	GPT-5.1 Codex	140
6	MiMo-V2-Flash	127
7	Gemini 3.1 Pro Preview	114
8	GPT-5.1	111
9	Gemini 3 Pro Preview	110
10	GPT-5.2 Codex	108

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	DeepSeek V3.2	$0.315
3	MiniMax-M2.5	$0.525
4	GPT-5 mini	$0.688
5	Qwen3.5 27B	$0.825
6	GLM-4.7	$1.00
7	Kimi K2 Thinking	$1.07
8	Qwen3.5 122B A10B	$1.10
9	Gemini 3 Flash Preview	$1.13
10	Kimi K2.5	$1.20