The Inference Report

May 14, 2026

Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, a gain of 12.3 percentage points from its previous ranking of 53 on Artificial Analysis, though these are distinct benchmarks measuring different problem sets and should not be directly compared as improvement on the same task. The SWE-rebench leaderboard shows tighter clustering in the upper tier, with models ranked 2 through 7 all scoring between 64.4% and 62.3%, suggesting convergence in code-solving capability among leading systems. GLM-5 and GLM-5.1 both advanced significantly on Artificial Analysis, moving from positions 17 and 14 to 49.8 and 51.4 respectively, while Kimi K2 Thinking jumped from 54 to 40.9, indicating Chinese-developed models are narrowing the gap. Gemini 3.1 Pro Preview declined from 57.2 to 57.2 on Artificial Analysis (no change) but appears at position 7 on SWE-rebench with 62.3%, down from an implied higher position previously. The SWE-rebench methodology evaluates code agents on real GitHub issues requiring multi-step reasoning and tool use, while Artificial Analysis covers general reasoning tasks, so raw score differences across benchmarks reflect task difficulty rather than model capability regression. Within SWE-rebench, the spread from position 1 to position 10 spans only 4.3 percentage points, suggesting marginal gains require increasingly refined approaches rather than architectural leaps. The Artificial Analysis rankings show broader stratification, with positions 1 through 10 spanning 7.1 points, indicating that general reasoning benchmarks may be more discriminative at the frontier than specialized coding benchmarks at present.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	Junie	62.8%
5	gpt-5.4-2026-03-05-medium	62.8%
6	GLM-5.1	62.7%
7	Gemini 3.1 Pro Preview	62.3%
8	DeepSeek-V3.2	60.9%
9	Claude Sonnet 4.6	60.7%
10	Claude Sonnet 4.5	60.0%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	GPT-5.5	60.2	65	$11.25
2	Claude Opus 4.7	57.3	63	$10.94
3	Gemini 3.1 Pro Preview	57.2	128	$4.50
4	GPT-5.4	56.8	83	$5.63
5	Kimi K2.6	53.9	41	$1.71
6	MiMo-V2.5-Pro	53.8	54	$1.50
7	GPT-5.3 Codex	53.6	76	$4.81
8	Grok 4.3	53.2	81	$1.56
9	Claude Opus 4.6	52.9	48	$10.94
10	Muse Spark	52.2	0	$0.00

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Gemini 3 Flash Preview	197
2	GPT-5.1 Codex	183
3	Qwen3.6 35B A3B	182
4	GPT-5 Codex	179
5	GPT-5.4 mini	169
6	Qwen3.5 122B A10B	154
7	MiMo-V2-Flash	149
8	GPT-5.4 nano	148
9	Hy3-preview	134
10	Gemini 3.1 Pro Preview	128

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	Hy3-preview	$0.143
2	MiMo-V2-Flash	$0.15
3	DeepSeek V4 Flash	$0.175
4	DeepSeek V3.2	$0.337
5	GPT-5.4 nano	$0.463
6	MiniMax-M2.7	$0.525
7	KAT Coder Pro V2	$0.525
8	MiniMax-M2.5	$0.525
9	Qwen3.6 35B A3B	$0.557
10	GPT-5 mini	$0.688