The Inference Report

June 17, 2026

On SWE-rebench, the top tier remains unchanged with gpt-5.5-2026-04-23-xhigh holding 62.7% and the next four models stable within a narrow band. The meaningful shifts occur in the mid-tier: GLM-5.1 entered at 50.7%, moving from position 21 on Artificial Analysis (40.2) to position 12 on SWE-rebench, suggesting the benchmark surfaces different capability profiles than general evaluation suites. Kimi K2.6 gained 3.7 points to 46.5%, while Gemini 3.5 Flash dropped 0.7 points to 49.5% despite previously ranking sixth on Artificial Analysis at 50.2%, indicating SWE-rebench's code-specific tasks may penalize certain architectural choices. On Artificial Analysis, GLM-5.2 entered the top ten at position 6 with 50.7%, a new entrant that did not appear in the prior ranking, while the bulk of the list shows positional shuffling without score changes, suggesting the primary movement comes from model releases rather than re-evaluation of existing systems. The SWE-rebench data presents a cleaner signal for coding capability than the broader Artificial Analysis suite, where most entries maintain identical scores across the two snapshots, indicating the latter functions as a stable archive rather than a live leaderboard. Neither benchmark shows the kind of discontinuous jumps that would signal a methodological shift or contamination event.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	gpt-5.5-2026-04-23-xhigh	62.7%
2	Junie	61.6%
3	Codex	60.4%
4	Claude Code	59.6%
5	gpt-5.5-2026-04-23-medium	58.9%
6	Claude Opus 4.8-xhigh	56.5%
7	gpt-5.4-2026-03-05-medium	54.9%
8	Claude Opus 4.7-high	53.1%
9	Cursor	53.0%
10	Claude Sonnet 4.6	51.3%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	Claude Fable 5	59.9	0	$20.00
2	Claude Opus 4.8	55.7	68	$10.00
3	GPT-5.5	54.8	67	$11.25
4	Claude Opus 4.7	53.5	54	$10.00
5	GPT-5.4	51.4	166	$5.63
6	GLM-5.2	50.7	114	$2.15
7	Gemini 3.5 Flash	50.2	203	$3.38
8	Claude Sonnet 4.6	47.2	63	$6.00
9	Gemini 3.1 Pro Preview	46.5	127	$4.50
10	Qwen3.7 Max	46	106	$3.75

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Gemini 3.5 Flash	203
2	GPT-5.4 mini	180
3	GPT-5.4	166
4	Gemini 3.1 Pro Preview	127
5	GPT-5.2 Codex	125
6	GLM-5.2	114
7	Qwen3.7 Max	106
8	DeepSeek V4 Flash	100
9	GPT-5.3 Codex	89
10	GPT-5.2	78

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	DeepSeek V4 Flash	$0.175
2	MiMo-V2.5	$0.175
3	MiniMax-M3	$0.525
4	DeepSeek V4 Pro	$0.544
5	MiMo-V2.5-Pro	$0.544
6	MiMo-V2-Pro	$1.50
7	GPT-5.4 mini	$1.69
8	Kimi K2.6	$1.71
9	Kimi K2.7 Code	$1.71
10	GLM-5.2	$2.15