The Inference Report

May 12, 2026

Claude Opus 4.6 holds the top position on SWE-rebench with 65.3%, a jump of 12.3 percentage points from its previous ranking of ninth place at 53%, while the rest of the field has compressed considerably at the high end, with positions two through five separated by only 1.6 points. The movement reflects real gains in code generation capability, though the methodology warrants scrutiny: SWE-rebench measures the ability to resolve GitHub issues end-to-end, a task that benefits from reasoning depth and context management rather than pure instruction-following, and the compression at the top suggests these models are approaching saturation on the benchmark's current problem distribution. On Artificial Analysis, the landscape differs markedly, with GPT-5.5 leading at 60.2 but Claude Opus 4.6 placed ninth at 53, indicating that the two benchmarks reward different capabilities or that Artificial Analysis weights broader performance categories beyond code completion. The divergence between benchmarks matters: Claude Sonnet 4.6 ranks ninth on SWE-rebench at 60.7% but only twelfth on Artificial Analysis at 51.7%, suggesting it excels at the specific demands of issue resolution but underperforms on Artificial Analysis's mixed evaluation. GLM-5 climbed from seventeenth to third on SWE-rebench, gaining 13 points, while Kimi K2 Thinking rose from fifty-fourth to twenty-first on the same benchmark with an 16.5-point gain, patterns that suggest targeted improvements in code reasoning. The Artificial Analysis list saw minimal reordering beyond the top tier, with most models holding their positions, indicating either stable model performance or less frequent evaluation updates on that benchmark. What distinguishes this cycle is not a breakthrough in methodology but consolidation: the top performers are now substantially ahead of the middle tier, and the gap between first and tenth place on SWE-rebench spans 4.6 points, a meaningful spread that reflects real differences in how models handle multi-step code tasks.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	Junie	62.8%
5	gpt-5.4-2026-03-05-medium	62.8%
6	GLM-5.1	62.7%
7	Gemini 3.1 Pro Preview	62.3%
8	DeepSeek-V3.2	60.9%
9	Claude Sonnet 4.6	60.7%
10	Claude Sonnet 4.5	60.0%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	GPT-5.5	60.2	66	$11.25
2	Claude Opus 4.7	57.3	71	$10.94
3	Gemini 3.1 Pro Preview	57.2	143	$4.50
4	GPT-5.4	56.8	95	$5.63
5	Kimi K2.6	53.9	41	$1.71
6	MiMo-V2.5-Pro	53.8	57	$1.50
7	GPT-5.3 Codex	53.6	95	$4.81
8	Grok 4.3	53.2	83	$1.56
9	Claude Opus 4.6	53	53	$10.94
10	Muse Spark	52.1	0	$0.00

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Gemini 3 Flash Preview	205
2	GPT-5.1 Codex	199
3	GPT-5.4 mini	185
4	Qwen3.6 35B A3B	182
5	GPT-5 Codex	178
6	Qwen3.5 122B A10B	160
7	Hy3-preview	158
8	GPT-5.4 nano	156
9	GPT-5.1	150
10	MiMo-V2-Flash	149

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	DeepSeek V4 Flash	$0.175
3	DeepSeek V3.2	$0.337
4	GPT-5.4 nano	$0.463
5	MiniMax-M2.7	$0.525
6	KAT Coder Pro V2	$0.525
7	MiniMax-M2.5	$0.525
8	Qwen3.6 35B A3B	$0.557
9	GPT-5 mini	$0.688
10	MiMo-V2.5	$0.72