The Inference Report

June 18, 2026

On the SWE-rebench coding benchmark, the top tier shows stability with gpt-5.5-2026-04-23-xhigh holding first at 62.7%, Junie second at 61.6%, and Codex third at 60.4%, while middle-ranked models demonstrate more flux: Claude Sonnet 4.6 climbed from 47.2 to 51.3 percent (position 8 to 10), GLM-5.1 jumped from 40.2 to 50.7 percent (ranking 23 to 12), and Kimi K2.6 advanced from 42.8 to 46.5 percent (16 to 15), yet Gemini 3.5 Flash paradoxically fell from 50.2 to 49.5 percent despite holding rank 13. The Artificial Analysis leaderboard exhibits more volatility across its 394 entries, where Claude Fable 5 leads at 59.9 but the broader distribution shows marginal gains concentrated among models in the 40 to 50 point range, with GLM-4.7 making the largest absolute climb from 33.8 to 38.2 percent. The divergence between these two benchmarks on identical or near-identical models (Claude Sonnet 4.6 scores 51.3 on SWE-rebench but 47.2 on Artificial Analysis; GLM-5.1 scores 50.7 vs 40.2) suggests they measure different problem distributions or evaluation methodologies, raising questions about whether improvements on one reflect genuine capability gains or benchmark-specific overfitting. The SWE-rebench movements are modest in absolute terms, with most shifts under 5 percentage points, which is consistent with natural variance rather than architectural breakthroughs.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	gpt-5.5-2026-04-23-xhigh	62.7%
2	Junie	61.6%
3	Codex	60.4%
4	Claude Code	59.6%
5	gpt-5.5-2026-04-23-medium	58.9%
6	Claude Opus 4.8-xhigh	56.5%
7	gpt-5.4-2026-03-05-medium	54.9%
8	Claude Opus 4.7-high	53.1%
9	Cursor	53.0%
10	Claude Sonnet 4.6	51.3%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	Claude Fable 5	59.9	0	$20.00
2	Claude Opus 4.8	55.7	67	$10.00
3	GPT-5.5	54.8	61	$11.25
4	Claude Opus 4.7	53.5	54	$10.00
5	GPT-5.4	51.4	157	$5.63
6	GLM-5.2	50.7	100	$2.15
7	Gemini 3.5 Flash	50.2	223	$3.38
8	Claude Sonnet 4.6	47.2	66	$6.00
9	Gemini 3.1 Pro Preview	46.5	127	$4.50
10	Qwen3.7 Max	46	96	$3.75

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Gemini 3.5 Flash	223
2	GPT-5.4 mini	177
3	GPT-5.4	157
4	Gemini 3.1 Pro Preview	127
5	GPT-5.2 Codex	125
6	DeepSeek V4 Flash	105
7	GLM-5.2	100
8	Qwen3.7 Max	96
9	GPT-5.2	79
10	GPT-5.3 Codex	77

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	DeepSeek V4 Flash	$0.175
2	MiMo-V2.5	$0.175
3	MiniMax-M3	$0.525
4	DeepSeek V4 Pro	$0.544
5	MiMo-V2.5-Pro	$0.544
6	MiMo-V2-Pro	$1.50
7	GPT-5.4 mini	$1.69
8	Kimi K2.6	$1.71
9	Kimi K2.7 Code	$1.71
10	GLM-5.2	$2.15