The Inference Report

June 22, 2026

On the SWE-rebench, the top tier remains stable with gpt-5.5-2026-04-23-xhigh holding 62.7%, Junie at 61.6%, and Codex at 60.4%, but movement in the middle ranks reveals both consolidation and fragmentation. Claude Sonnet 4.6 climbed from #8 to #10 while gaining 4.1 percentage points (47.2 to 51.3), and Gemini 3.1 Pro Preview moved from #9 to #11 with a 4.6-point increase (46.5 to 51.1), suggesting these models benefited from either test set changes or evaluation methodology shifts rather than architectural improvements alone. GLM-5.1's jump from #23 to #12 represents the most dramatic repositioning, rising 10.5 points from 40.2 to 50.7, which warrants scrutiny: either the model underwent substantial retraining or the benchmark's coding task distribution shifted to favor its strengths. Conversely, Gemini 3.5 Flash dropped from #7 to #13 despite a marginal score decline (50.2 to 49.5), a minor inversion that may reflect tighter clustering at this performance band. GLM-4.7 showed the largest absolute gain in the lower ranks, jumping from 33.8 to 38.2 across the two evaluations, though it remains at #17 on SWE-rebench. The Artificial Analysis benchmark, with its broader model coverage, presents a different ranking topology: Claude Fable 5 leads at 59.9, above GPT-5.5 at 54.8, inverting the SWE-rebench order and suggesting the two benchmarks weight different coding competencies or test different problem classes. Without disclosure of the evaluation methodology, task composition, test set overlap, execution environment, or whether SWE-rebench underwent revision, attributing these shifts to genuine capability differences versus benchmark drift remains uncertain. The consistency of top-tier models across both benchmarks provides some confidence in their relative ordering, but the volatility in middle ranks indicates either genuine model differentiation in narrow domains or measurement sensitivity that limits strong inference.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	gpt-5.5-2026-04-23-xhigh	62.7%
2	Junie	61.6%
3	Codex	60.4%
4	Claude Code	59.6%
5	gpt-5.5-2026-04-23-medium	58.9%
6	Claude Opus 4.8-xhigh	56.5%
7	gpt-5.4-2026-03-05-medium	54.9%
8	Claude Opus 4.7-high	53.1%
9	Cursor	53.0%
10	Claude Sonnet 4.6	51.3%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	Claude Fable 5	59.9	0	$20.00
2	Claude Opus 4.8	55.7	69	$10.00
3	GPT-5.5	54.8	63	$11.25
4	Claude Opus 4.7	53.5	53	$10.00
5	GPT-5.4	51.4	165	$5.63
6	GLM-5.2	51.1	94	$2.15
7	Gemini 3.5 Flash	50.2	244	$3.38
8	Claude Sonnet 4.6	47.2	69	$6.00
9	Gemini 3.1 Pro Preview	46.5	138	$4.50
10	Qwen3.7 Max	46	200	$3.75

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Gemini 3.5 Flash	244
2	Qwen3.7 Max	200
3	GPT-5.4 mini	193
4	GPT-5.4	165
5	GPT-5.2 Codex	145
6	Gemini 3.1 Pro Preview	138
7	DeepSeek V4 Flash	110
8	GPT-5.3 Codex	107
9	GLM-5.1	106
10	DeepSeek V4 Pro	103

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	DeepSeek V4 Flash	$0.175
2	MiMo-V2.5	$0.175
3	MiniMax-M3	$0.525
4	DeepSeek V4 Pro	$0.544
5	MiMo-V2.5-Pro	$0.544
6	MiMo-V2-Pro	$1.50
7	GPT-5.4 mini	$1.69
8	Kimi K2.6	$1.71
9	Kimi K2.7 Code	$1.71
10	GLM-5.2	$2.15