The Inference Report

May 30, 2026

The SWE-rebench rankings show Claude models displacing earlier leaders through variant proliferation rather than uniform improvement. Claude Opus 4.8-xhigh entered at 56.4% (rank 5), Claude Opus 4.7-high at 53.1% (rank 7), and Claude Sonnet 4.6-high at 51.3% (rank 9), all marked as new entries, which suggests these represent configuration variants of existing models rather than new releases. The top tier remains stable: gpt-5.5-2026-04-23-xhigh holds 62.7%, Codex 60.4%, and Claude Code 59.6%. Below the leaders, Gemini 3.1 Pro Preview dropped from 57.2 on Artificial Analysis to 51.1 on SWE-rebench (rank 10), a 6.1-point gap that flags a discrepancy between the two benchmarks worth investigating. Kimi K2.6 fell from 53.9 to 46.5 (rank 13), and GLM-4.7 declined from 42.1 to 38.2 (rank 14), suggesting these models either perform materially worse on coding tasks specifically or that SWE-rebench's evaluation criteria diverge meaningfully from Artificial Analysis's methodology. The Artificial Analysis leaderboard itself shows no movement in the top tier and remains dominated by Claude Opus 4.8 (61.4) and GPT-5.5 (60.2), with the field compressed tightly between ranks 1 and 20. Without access to SWE-rebench's exact task distribution, evaluation protocol, or whether it measures pass rates, time-to-solution, or other criteria, the divergence between the two benchmarks cannot be fully explained, but the pattern suggests they are testing distinct problem classes or applying different scoring thresholds.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	gpt-5.5-2026-04-23-xhigh	62.7%
2	Codex	60.4%
3	Claude Code	59.6%
4	gpt-5.5-2026-04-23-medium	58.9%
5	Claude Opus 4.8-xhigh	56.4%
6	gpt-5.4-2026-03-05-medium	54.9%
7	Claude Opus 4.7-high	53.1%
8	Cursor	53.0%
9	Claude Sonnet 4.6-high	51.3%
10	Gemini 3.1 Pro Preview	51.1%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	Claude Opus 4.8	61.4	67	$10.94
2	GPT-5.5	60.2	69	$11.25
3	Claude Opus 4.7	57.3	53	$10.94
4	Gemini 3.1 Pro Preview	57.2	129	$4.50
5	GPT-5.4	56.8	92	$5.63
6	Qwen3.7 Max	56.6	187	$3.75
7	Gemini 3.5 Flash	55.3	209	$3.38
8	Kimi K2.6	53.9	34	$1.71
9	MiMo-V2.5-Pro	53.8	49	$0.544
10	GPT-5.3 Codex	53.6	81	$4.81

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Gemini 3.5 Flash	209
2	Grok 4.20 0309 v2	202
3	MiniMax-M2.5	199
4	Grok 4.20 0309	197
5	Gemini 3 Flash Preview	196
6	Qwen3.7 Max	187
7	Grok 4.3	177
8	GPT-5.1 Codex	172
9	GPT-5.4 mini	167
10	Qwen3.6 35B A3B	164

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	MiMo-V2.5	$0.175
3	DeepSeek V4 Flash	$0.175
4	Hy3-preview	$0.20
5	DeepSeek V3.2	$0.337
6	GPT-5.4 nano	$0.463
7	MiniMax-M2.7	$0.525
8	KAT Coder Pro V2	$0.525
9	MiniMax-M2.5	$0.525
10	MiMo-V2.5-Pro	$0.544