The Inference Report

March 30, 2026

Claude Opus 4.6 holds first place on the SWE-rebench at 65.3%, up from fourth place at 53% on Artificial Analysis, a 12.3-point gain that reflects either a meaningful improvement in the model itself or a substantial methodological divergence between the two benchmarks. The SWE-rebench leaderboard shows tighter clustering at the top than Artificial Analysis: the gap between first and fifth place narrows to 2.0 points (65.3% to 62.3%) compared to 9.8 points in the older dataset (57.2% to 47.7%), suggesting either more homogeneous model performance on software engineering tasks or differences in how the benchmark distributes credit across solution attempts. Kimi K2.5 and Kimi K2 Thinking both advanced substantially, jumping from positions 16 and 35 on Artificial Analysis (46.8 and 40.9 points respectively) to positions 13 and 17 on SWE-rebench (58.5% and 57.4%), indicating these models may have been underestimated by the prior evaluation or that they excel specifically at the code completion and repository-level reasoning that SWE-rebench targets. Gemini 3 Flash Preview similarly climbed from position 18 at 46.4 to position 22 at 52.5%, a 6.1-point improvement that outpaces most of the field. The SWE-rebench evaluation appears to reward architectural choices or training data aligned with real repository work: models like GLM-5 and gpt-5.4-2026-03-05-medium perform nearly identically (62.8%), yet their Artificial Analysis scores diverged by 4.2 points (49.8 vs 54), suggesting the newer benchmark may reduce noise or focus more narrowly on a specific class of engineering problems. Without documentation of what changed in the benchmark methodology, evaluation harness, or problem distribution, the magnitude of these shifts prevents confident assessment of whether they represent genuine model progress or reflect a different measurement regime entirely.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	gpt-5.4-2026-03-05-medium	62.8%
5	Gemini 3.1 Pro Preview	62.3%
6	DeepSeek-V3.2	60.9%
7	Claude Sonnet 4.6	60.7%
8	Claude Sonnet 4.5	60.0%
9	Qwen3.5-397B-A17B	59.9%
10	Step-3.5-Flash	59.6%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	GPT-5.4	57.2	88	$5.63
2	Gemini 3.1 Pro Preview	57.2	114	$4.50
3	GPT-5.3 Codex	54	92	$4.81
4	Claude Opus 4.6	53	59	$10.00
5	Claude Sonnet 4.6	51.7	79	$6.00
6	GPT-5.2	51.3	83	$4.81
7	GLM-5	49.8	65	$1.55
8	Claude Opus 4.5	49.7	68	$10.00
9	MiniMax-M2.7	49.6	44	$0.525
10	MiMo-V2-Pro	49.2	95	$1.50

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Grok 4.20 Beta 0309	242
2	GPT-5.4 mini	219
3	GPT-5 Codex	218
4	Gemini 3 Flash Preview	192
5	GPT-5.4 nano	177
6	Qwen3.5 122B A10B	145
7	GPT-5.1 Codex	140
8	MiMo-V2-Flash	137
9	GPT-5.2 Codex	129
10	Gemini 3 Pro Preview	118

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	DeepSeek V3.2	$0.315
3	GPT-5.4 nano	$0.463
4	MiniMax-M2.7	$0.525
5	MiniMax-M2.5	$0.525
6	GPT-5 mini	$0.688
7	Qwen3.5 27B	$0.825
8	GLM-4.7	$1.00
9	Kimi K2 Thinking	$1.07
10	Qwen3.5 122B A10B	$1.10