The Inference Report

March 18, 2026

The SWE-rebench leaderboard shows compression at the top but no meaningful movement within the tested range. Claude Code, Junie, and Claude Opus 4.6 remain locked in the 51.7 to 52.9 percent band, with the top six models separated by less than a percentage point, indicating a plateau in discriminative power rather than genuine progress. Below that tier, ranking shifts are notable but scores tell a different story: Claude Opus 4.5 dropped from 49.7 to 43.8 on Artificial Analysis while holding at 43.8 on SWE-rebench, GLM-5 fell from 49.8 to 42.1, and Kimi K2.5 collapsed from 46.8 to 37.9, all suggesting either benchmark recalibration or evaluation variance rather than model regression. Conversely, Kimi K2 Thinking jumped 2.9 points to 43.8 on SWE-rebench and GLM-4.6 gained 4.6 points to 37.1 on Artificial Analysis, but these gains occur in a region where single-digit improvements are routine and may reflect test set sensitivity rather than architectural breakthroughs. The two benchmarks diverge substantially in their top rankings (GPT-5.4 and Gemini 3.1 Pro Preview both score 57.2 on Artificial Analysis versus Claude Code's 52.9 on SWE-rebench), raising questions about what each evaluates: SWE-rebench appears stricter or tests different problem classes, making direct comparison unreliable. Without documentation of methodology changes, evaluation set stability, or statistical confidence intervals, the apparent volatility in mid-tier positions cannot be distinguished from noise. The real finding is not movement but stagnation at the frontier and inconsistency across benchmarks, both of which limit confidence in using either as a proxy for practical coding capability.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	Claude Code	52.9%
2	Junie	52.1%
3	Claude Opus 4.6	51.7%
4	gpt-5.2-2025-12-11-xhigh	51.7%
5	gpt-5.2-2025-12-11-medium	51.0%
6	gpt-5.1-codex-max	48.5%
7	Claude Sonnet 4.5	47.1%
8	Gemini 3 Pro Preview	46.7%
9	Gemini 3 Flash Preview	46.7%
10	gpt-5.2-codex	45.0%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	GPT-5.4	57.2	80	$5.63
2	Gemini 3.1 Pro Preview	57.2	113	$4.50
3	GPT-5.3 Codex	54	69	$4.81
4	Claude Opus 4.6	53	60	$10.00
5	Claude Sonnet 4.6	51.7	68	$6.00
6	GPT-5.2	51.3	69	$4.81
7	GLM-5	49.8	66	$1.55
8	Claude Opus 4.5	49.7	65	$10.00
9	GPT-5.2 Codex	49	91	$4.81
10	MiMo-V2-Pro	48.8	0	$0.00

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Grok 4.20 Beta 0309	196
2	Gemini 3 Flash Preview	180
3	GPT-5 Codex	176
4	Qwen3.5 122B A10B	151
5	MiMo-V2-Flash	130
6	Gemini 3.1 Pro Preview	113
7	Gemini 3 Pro Preview	110
8	GPT-5.1 Codex	103
9	GPT-5.2 Codex	91
10	Qwen3.5 27B	90

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	DeepSeek V3.2	$0.315
3	MiniMax-M2.5	$0.525
4	GPT-5 mini	$0.688
5	Qwen3.5 27B	$0.825
6	GLM-4.7	$1.00
7	Kimi K2 Thinking	$1.07
8	Qwen3.5 122B A10B	$1.10
9	Gemini 3 Flash Preview	$1.13
10	Kimi K2.5	$1.20