The Inference Report

June 1, 2026

The SWE-rebench rankings show minimal movement at the top tier, with gpt-5.5-2026-04-23-xhigh holding 62.7% and Codex at 60.4%, but significant volatility below position five signals instability in how models perform on this coding task. Gemini 3.1 Pro Preview dropped from 57.2% on Artificial Analysis to 51.1% on SWE-rebench, falling from fourth to tenth place, while Kimi K2.6 fell from 53.9% to 46.5% and GLM-4.7 declined from 42.1% to 38.2%, suggesting these models may not generalize equally across different coding benchmarks or that SWE-rebench applies stricter evaluation criteria. Conversely, GLM-5.1 held relatively steady between 50.7% and 51.4%, and Claude models maintained consistent rankings across both benchmarks, indicating more reliable performance on code generation tasks. The divergence between SWE-rebench and Artificial Analysis rankings below 50% raises questions about benchmark design: SWE-rebench appears to penalize certain architectural approaches more heavily, or the two evaluations measure meaningfully different aspects of coding capability. Without access to SWE-rebench's methodology documentation, the 5-7 point gaps between benchmark results for the same models cannot be attributed definitively to task difficulty, evaluation harshness, or genuine capability differences, making it premature to treat either ranking as a complete picture of coding performance.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	gpt-5.5-2026-04-23-xhigh	62.7%
2	Codex	60.4%
3	Claude Code	59.6%
4	gpt-5.5-2026-04-23-medium	58.9%
5	Claude Opus 4.8-xhigh	56.4%
6	gpt-5.4-2026-03-05-medium	54.9%
7	Claude Opus 4.7-high	53.1%
8	Cursor	53.0%
9	Claude Sonnet 4.6-high	51.3%
10	Gemini 3.1 Pro Preview	51.1%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	Claude Opus 4.8	61.4	63	$10.94
2	GPT-5.5	60.2	66	$11.25
3	Claude Opus 4.7	57.3	60	$10.94
4	Gemini 3.1 Pro Preview	57.2	144	$4.50
5	GPT-5.4	56.8	86	$5.63
6	Qwen3.7 Max	56.6	190	$3.75
7	Gemini 3.5 Flash	55.3	227	$3.38
8	Kimi K2.6	53.9	42	$1.71
9	MiMo-V2.5-Pro	53.8	52	$0.544
10	GPT-5.3 Codex	53.6	86	$4.81

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Grok 4.20 0309	229
2	Gemini 3.5 Flash	227
3	Grok 4.20 0309 v2	219
4	MiniMax-M2.5	206
5	Gemini 3 Flash Preview	193
6	Qwen3.7 Max	190
7	GPT-5.1 Codex	186
8	GPT-5.4 mini	183
9	GPT-5 Codex	173
10	Qwen3.6 35B A3B	160

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	MiMo-V2.5	$0.175
3	DeepSeek V4 Flash	$0.175
4	Hy3-preview	$0.20
5	DeepSeek V3.2	$0.337
6	GPT-5.4 nano	$0.463
7	MiniMax-M2.7	$0.525
8	KAT Coder Pro V2	$0.525
9	MiniMax-M2.5	$0.525
10	MiMo-V2.5-Pro	$0.544