The Inference Report

June 24, 2026

The SWE-rebench rankings show stability at the top tier, where gpt-5.5-2026-04-23-xhigh holds 62.7%, Junie 61.6%, and Codex 60.4%, with no movement in the first nine positions. Below that band, modest reshuffling reflects incremental gains across mid-tier models. Claude Sonnet 4.6 climbed from position 10 to maintain its 51.3% score, while GLM-5.1 advanced from rank 23 to 12 by improving from 40.2% to 50.7%, a 10.5-point jump that signals either a methodology change, model update, or evaluation refinement worth scrutinizing. Gemini 3.5 Flash dropped from 7 to 13 despite holding 49.5%, suggesting the ranking absorbed new entrants or recalibration. The Artificial Analysis benchmark, by contrast, saw more substantial motion: Grok Build 0.1 0616 entered at rank 28, while Ring-1T appeared at 159 without prior placement, indicating either fresh model releases or expanded coverage. At the lower end, the data reveals compression around single-digit scores, where models cluster densely and small score shifts produce large rank swings, making those positions less meaningful as discriminators. The movement pattern suggests the SWE-rebench is maturing into a stable ordering of proven performers, while Artificial Analysis continues absorbing new competitors, though neither benchmark's methodology is transparent enough to confirm whether score changes reflect genuine capability shifts or evaluation adjustments.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	gpt-5.5-2026-04-23-xhigh	62.7%
2	Junie	61.6%
3	Codex	60.4%
4	Claude Code	59.6%
5	gpt-5.5-2026-04-23-medium	58.9%
6	Claude Opus 4.8-xhigh	56.5%
7	gpt-5.4-2026-03-05-medium	54.9%
8	Claude Opus 4.7-high	53.1%
9	Cursor	53.0%
10	Claude Sonnet 4.6	51.3%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	Claude Fable 5	59.9	0	$20.00
2	Claude Opus 4.8	55.7	72	$10.00
3	GPT-5.5	54.8	64	$11.25
4	Claude Opus 4.7	53.5	62	$10.00
5	GPT-5.4	51.4	161	$5.63
6	GLM-5.2	51.1	118	$2.15
7	Gemini 3.5 Flash	50.2	237	$3.38
8	Claude Sonnet 4.6	47.2	69	$6.00
9	Gemini 3.1 Pro Preview	46.5	143	$4.50
10	Qwen3.7 Max	46	203	$3.75

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Gemini 3.5 Flash	237
2	Qwen3.7 Max	203
3	GPT-5.4 mini	194
4	GPT-5.4	161
5	GPT-5.2 Codex	155
6	Gemini 3.1 Pro Preview	143
7	DeepSeek V4 Flash	121
8	GLM-5.2	118
9	DeepSeek V4 Pro	103
10	GLM-5.1	90

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	DeepSeek V4 Flash	$0.175
2	MiMo-V2.5	$0.175
3	MiniMax-M3	$0.525
4	DeepSeek V4 Pro	$0.544
5	MiMo-V2.5-Pro	$0.544
6	MiMo-V2-Pro	$1.50
7	GPT-5.4 mini	$1.69
8	Kimi K2.6	$1.71
9	Kimi K2.7 Code	$1.71
10	GLM-5.2	$2.15