The Inference Report

June 25, 2026

On SWE-rebench, the top tier remains static: gpt-5.5-2026-04-23-xhigh holds 62.7%, Junie stays at 61.6%, and the Claude and GPT variants occupy positions three through seven without movement. The meaningful shifts occur in the mid-tier, where GLM-5.1 climbed from position 23 at 40.2% to position 12 at 50.7%, a 10.5-point gain that represents the largest jump in the dataset, while GLM-4.7 rose from position 52 at 33.8% to position 17 at 38.2%. Kimi K2.6 advanced from position 16 to position 15, and Claude Sonnet 4.6 moved from position 8 to position 10 despite scoring identically at 51.3%, suggesting ranking adjustments independent of score changes. Across the Artificial Analysis benchmark, the distribution shows far less volatility: Claude Fable 5 leads at 59.9, the top 20 models cluster between 42.8 and 59.9 with mostly preserved rankings, and a new entry (Nex-N2-Pro at 41.0) appears at position 20 while KAT-Coder-Pro V1 jumped 31 positions from 83 to 52 with a 6.3-point improvement from 28.3 to 34.6. The discrepancy between benchmarks is notable: models ranking high on SWE-rebench (gpt-5.5-xhigh, Junie) do not dominate Artificial Analysis, where Claude Fable 5 leads despite placing second on the coding benchmark, suggesting these metrics capture different problem-solving dimensions or that the evaluation methodologies diverge in what they reward. Neither benchmark shows the compression or volatility typical of immature measurement systems, indicating both have stabilized around consistent model orderings, though the absence of methodological detail prevents assessment of whether either captures real capability differences or primarily reflects training data overlap.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	gpt-5.5-2026-04-23-xhigh	62.7%
2	Junie	61.6%
3	Codex	60.4%
4	Claude Code	59.6%
5	gpt-5.5-2026-04-23-medium	58.9%
6	Claude Opus 4.8-xhigh	56.5%
7	gpt-5.4-2026-03-05-medium	54.9%
8	Claude Opus 4.7-high	53.1%
9	Cursor	53.0%
10	Claude Sonnet 4.6	51.3%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	Claude Fable 5	59.9	0	$20.00
2	Claude Opus 4.8	55.7	66	$10.00
3	GPT-5.5	54.8	66	$11.25
4	Claude Opus 4.7	53.5	58	$10.00
5	GPT-5.4	51.4	159	$5.63
6	GLM-5.2	51.1	122	$2.15
7	Gemini 3.5 Flash	50.2	221	$3.38
8	Claude Sonnet 4.6	47.2	68	$6.00
9	Gemini 3.1 Pro Preview	46.5	145	$4.50
10	Qwen3.7 Max	46	204	$3.75

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Gemini 3.5 Flash	221
2	Qwen3.7 Max	204
3	GPT-5.4 mini	185
4	GPT-5.4	159
5	Gemini 3.1 Pro Preview	145
6	GPT-5.2 Codex	139
7	DeepSeek V4 Flash	124
8	GLM-5.2	122
9	Nex-N2-Pro	108
10	GPT-5.2	88

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	DeepSeek V4 Flash	$0.175
2	MiMo-V2.5	$0.175
3	MiniMax-M3	$0.525
4	DeepSeek V4 Pro	$0.544
5	MiMo-V2.5-Pro	$0.544
6	Nex-N2-Pro	$1.00
7	MiMo-V2-Pro	$1.50
8	GPT-5.4 mini	$1.69
9	Kimi K2.6	$1.71
10	Kimi K2.7 Code	$1.71