The Inference Report

June 19, 2026

The SWE-rebench and Artificial Analysis rankings show stability at the top but meaningful movement in the middle tier. On SWE-rebench, the top six positions remain unchanged: gpt-5.5-2026-04-23-xhigh leads at 62.7%, followed by Junie at 61.6%, Codex at 60.4%, Claude Code at 59.6%, gpt-5.5-2026-04-23-medium at 58.9%, and Claude Opus 4.8-xhigh at 56.5%. The notable shifts occur below this ceiling. Claude Sonnet 4.6 climbed from position 10 with 47.2% to position 10 with 51.3%, a 4.1-point gain; Gemini 3.1 Pro Preview rose from position 9 at 46.5% to position 11 at 51.1%, a 4.6-point increase; GLM-5.1 jumped from position 23 at 40.2% to position 12 at 50.7%, an extraordinary 10.5-point improvement; and GLM-4.7 advanced from position 51 at 33.8% to position 17 at 38.2%, a 4.4-point gain. Gemini 3.5 Flash, conversely, declined from position 7 at 50.2% to position 13 at 49.5%. These movements suggest either benchmark variance or genuine performance shifts in the middle tier, though GLM-5.1's dramatic rise warrants scrutiny of whether the test conditions or model capability changed materially. Artificial Analysis rankings remain consistent across the top 100 positions with only minor reordering among tied scores in the 6 to 7-point range, indicating more stable evaluation methodology or less volatility in that benchmark's test set.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	gpt-5.5-2026-04-23-xhigh	62.7%
2	Junie	61.6%
3	Codex	60.4%
4	Claude Code	59.6%
5	gpt-5.5-2026-04-23-medium	58.9%
6	Claude Opus 4.8-xhigh	56.5%
7	gpt-5.4-2026-03-05-medium	54.9%
8	Claude Opus 4.7-high	53.1%
9	Cursor	53.0%
10	Claude Sonnet 4.6	51.3%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	Claude Fable 5	59.9	0	$20.00
2	Claude Opus 4.8	55.7	66	$10.00
3	GPT-5.5	54.8	68	$11.25
4	Claude Opus 4.7	53.5	56	$10.00
5	GPT-5.4	51.4	157	$5.63
6	GLM-5.2	51.1	98	$2.15
7	Gemini 3.5 Flash	50.2	219	$3.38
8	Claude Sonnet 4.6	47.2	68	$6.00
9	Gemini 3.1 Pro Preview	46.5	136	$4.50
10	Qwen3.7 Max	46	98	$3.75

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Gemini 3.5 Flash	219
2	GPT-5.4 mini	174
3	GPT-5.4	157
4	GPT-5.2 Codex	137
5	Gemini 3.1 Pro Preview	136
6	DeepSeek V4 Flash	114
7	GLM-5.2	98
8	Qwen3.7 Max	98
9	GPT-5.3 Codex	88
10	GPT-5.2	83

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	DeepSeek V4 Flash	$0.175
2	MiMo-V2.5	$0.175
3	MiniMax-M3	$0.525
4	DeepSeek V4 Pro	$0.544
5	MiMo-V2.5-Pro	$0.544
6	MiMo-V2-Pro	$1.50
7	GPT-5.4 mini	$1.69
8	Kimi K2.6	$1.71
9	Kimi K2.7 Code	$1.71
10	GLM-5.2	$2.15