The Inference Report

June 20, 2026

The SWE-rebench rankings remained static across the top tier, with gpt-5.5-2026-04-23-xhigh holding first place at 62.7% and the next five positions unchanged. The Artificial Analysis benchmark showed modest reordering in the middle and lower tiers, though the methodology underlying these two benchmarks differs enough that direct score comparison between them is unreliable. On SWE-rebench, three models shifted position: Claude Sonnet 4.6 rose from #10 to #10 (no change in rank, though the prior data lists it at 47.2 on Artificial Analysis versus 51.3% here, suggesting score drift or evaluation variance), Gemini 3.1 Pro Preview moved from #9 to #11, and GLM-5.1 jumped from #23 to #12, gaining 10.5 percentage points on Artificial Analysis (from 40.2 to 50.7%). GLM-4.7 similarly advanced 4.4 points on SWE-rebench (33.8 to 38.2) and on Artificial Analysis (33.8 to 38.2), indicating consistent gains. On Artificial Analysis, minor reordering occurred around rank 190 where Magistral Medium 1 and Mistral Medium 3 swapped positions at the 12.5 point level, and at rank 360-362 where three models at 2.7 points reordered. The lack of substantial movement in either benchmark's top ranks suggests stable performance hierarchies, though the gains by GLM models warrant attention to whether they reflect genuine capability improvements or evaluation sensitivity differences between benchmarks.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	gpt-5.5-2026-04-23-xhigh	62.7%
2	Junie	61.6%
3	Codex	60.4%
4	Claude Code	59.6%
5	gpt-5.5-2026-04-23-medium	58.9%
6	Claude Opus 4.8-xhigh	56.5%
7	gpt-5.4-2026-03-05-medium	54.9%
8	Claude Opus 4.7-high	53.1%
9	Cursor	53.0%
10	Claude Sonnet 4.6	51.3%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	Claude Fable 5	59.9	0	$20.00
2	Claude Opus 4.8	55.7	64	$10.00
3	GPT-5.5	54.8	61	$11.25
4	Claude Opus 4.7	53.5	57	$10.00
5	GPT-5.4	51.4	142	$5.63
6	GLM-5.2	51.1	72	$2.15
7	Gemini 3.5 Flash	50.2	216	$3.38
8	Claude Sonnet 4.6	47.2	68	$6.00
9	Gemini 3.1 Pro Preview	46.5	140	$4.50
10	Qwen3.7 Max	46	125	$3.75

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Gemini 3.5 Flash	216
2	GPT-5.4 mini	174
3	GPT-5.4	142
4	Gemini 3.1 Pro Preview	140
5	GPT-5.2 Codex	140
6	Qwen3.7 Max	125
7	DeepSeek V4 Flash	110
8	GLM-5.1	93
9	GPT-5.3 Codex	86
10	DeepSeek V4 Pro	86

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	DeepSeek V4 Flash	$0.175
2	MiMo-V2.5	$0.175
3	MiniMax-M3	$0.525
4	DeepSeek V4 Pro	$0.544
5	MiMo-V2.5-Pro	$0.544
6	MiMo-V2-Pro	$1.50
7	GPT-5.4 mini	$1.69
8	Kimi K2.6	$1.71
9	Kimi K2.7 Code	$1.71
10	GLM-5.2	$2.15