The Inference Report

March 12, 2026

On the SWE-rebench, the top tier has stabilized with Claude Code holding 52.9%, Junie at 52.1%, and Claude Opus 4.6 tied with gpt-5.2-xhigh at 51.7%, representing no movement from prior rankings. Claude Opus 4.5 dropped from position 8 at 49.7% on Artificial Analysis to position 12 at 43.8% on SWE-rebench, a 5.9-point decline that signals either methodological differences between the two benchmarks or genuine performance variance across problem distributions. Kimi K2 Thinking climbed 14 positions on Artificial Analysis (from 27 to 13) with a 2.9-point gain, while Gemini 3 Pro Preview fell from position 10 to 11 despite holding steady at 48.4%, indicating a new entrant shifted rankings. GLM-5 dropped significantly on Artificial Analysis from position 7 at 49.8% to position 15 at 42.1%, a 7.7-point regression that merits scrutiny regarding whether this reflects model degradation or evaluation instability. Kimi K2.5 declined sharply from position 12 at 46.8% to position 19 at 37.9%, losing 8.9 points and 7 ranking positions. The Artificial Analysis leaderboard saw new entries at position 10 (Grok 4.20 Beta) and position 40 (NVIDIA Nemotron 3 Super 120B), while position 97 (LongCat Flash Lite) and position 282 (Sarvam M) entered lower tiers, suggesting either benchmark expansion or periodic model rotation. The divergence between SWE-rebench and Artificial Analysis on models like Claude Opus 4.5 and GLM-5 raises questions about benchmark sensitivity to implementation details or task sampling; without clarity on evaluation methodology differences, it is difficult to assess whether these gaps reflect real capability variation or measurement artifacts.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	Claude Code	52.9%
2	Junie	52.1%
3	Claude Opus 4.6	51.7%
4	gpt-5.2-2025-12-11-xhigh	51.7%
5	gpt-5.2-2025-12-11-medium	51.0%
6	gpt-5.1-codex-max	48.5%
7	Claude Sonnet 4.5	47.1%
8	Gemini 3 Pro Preview	46.7%
9	Gemini 3 Flash Preview	46.7%
10	gpt-5.2-codex	45.0%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	Gemini 3.1 Pro Preview	57.2	111	$4.50
2	GPT-5.4	57	77	$5.63
3	GPT-5.3 Codex	54	57	$4.81
4	Claude Opus 4.6	53	53	$10.00
5	Claude Sonnet 4.6	51.7	60	$6.00
6	GPT-5.2	51.3	65	$4.81
7	GLM-5	49.8	63	$1.55
8	Claude Opus 4.5	49.7	57	$10.00
9	GPT-5.2 Codex	49	72	$4.81
10	Grok 4.20 Beta 0309	48.5	245	$3.00

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Grok 4.20 Beta 0309	245
2	GPT-5 Codex	175
3	Gemini 3 Flash Preview	164
4	Qwen3.5 122B A10B	151
5	MiMo-V2-Flash	133
6	Gemini 3 Pro Preview	115
7	Gemini 3.1 Pro Preview	111
8	GPT-5.1 Codex	108
9	Qwen3.5 27B	87
10	GLM-4.7	79

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	DeepSeek V3.2	$0.315
3	MiniMax-M2.5	$0.525
4	GPT-5 mini	$0.688
5	Qwen3.5 27B	$0.825
6	GLM-4.7	$1.00
7	Kimi K2 Thinking	$1.07
8	Qwen3.5 122B A10B	$1.10
9	Gemini 3 Flash Preview	$1.13
10	Kimi K2.5	$1.20