The Inference Report

May 28, 2026

The SWE-rebench rankings show substantial churn at the top, though the movement warrants scrutiny. Two new GPT-5.5 variants (xhigh and medium) now occupy positions 1 and 4 at 62.7% and 58.9%, displacing previously dominant models, while Codex and Claude Code jumped from positions 18 and 17 to positions 2 and 3, gaining 2.1 and 1.2 percentage points respectively. Claude Opus 4.6 fell from first place (65.3%) to sixth (53.1%), a 12.2-point drop that demands explanation, and several formerly high-ranked models (gpt-5.2-2025-12-11-medium, Junie, DeepSeek-V3.2, Claude Sonnet 4.5, Qwen3.5-397B-A17B) disappeared entirely from the benchmark. The Artificial Analysis rankings remain largely stable with identical scores and positions, suggesting the volatility is specific to SWE-rebench's evaluation methodology or dataset. Without documentation of what changed in the benchmark itself, whether test cases were added, removed, or reweighted, or whether evaluation criteria shifted, it is unclear whether these movements reflect genuine capability differences or artifacts of the measurement apparatus. The scale of Claude Opus 4.6's decline particularly raises questions: such a large score regression without corresponding changes in the model itself points toward benchmark modifications rather than model degradation. Until the SWE-rebench evaluation protocol is transparently specified, these rankings indicate movement but not necessarily meaningful progress.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	gpt-5.5-2026-04-23-xhigh	62.7%
2	Codex	60.4%
3	Claude Code	59.6%
4	gpt-5.5-2026-04-23-medium	58.9%
5	gpt-5.4-2026-03-05-medium	54.9%
6	Claude Opus 4.7	53.1%
7	Cursor	53.0%
8	Gemini 3.1 Pro Preview	51.1%
9	Claude Sonnet 4.6	51.1%
10	GLM-5.1	50.7%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	GPT-5.5	60.2	81	$11.25
2	Claude Opus 4.7	57.3	55	$10.94
3	Gemini 3.1 Pro Preview	57.2	132	$4.50
4	GPT-5.4	56.8	89	$5.63
5	Qwen3.7 Max	56.6	206	$3.75
6	Gemini 3.5 Flash	55.3	228	$3.38
7	Kimi K2.6	53.9	34	$1.71
8	MiMo-V2.5-Pro	53.8	51	$0.544
9	GPT-5.3 Codex	53.6	81	$4.81
10	Grok 4.3	53.2	216	$1.56

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Gemini 3.5 Flash	228
2	Grok 4.3	216
3	Qwen3.7 Max	206
4	GPT-5.1 Codex	205
5	Gemini 3 Flash Preview	200
6	GPT-5 Codex	196
7	Grok 4.20 0309	192
8	Grok 4.20 0309 v2	189
9	Qwen3.6 35B A3B	170
10	GPT-5.4 mini	153

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	MiMo-V2.5	$0.175
3	DeepSeek V4 Flash	$0.175
4	Hy3-preview	$0.20
5	DeepSeek V3.2	$0.337
6	GPT-5.4 nano	$0.463
7	MiniMax-M2.7	$0.525
8	KAT Coder Pro V2	$0.525
9	MiniMax-M2.5	$0.525
10	MiMo-V2.5-Pro	$0.544