The Inference Report

June 16, 2026

On SWE-rebench, the top tier remains stable with gpt-5.5-2026-04-23-xhigh holding first place at 62.7% and the next five positions unchanged, but meaningful movement appears below that line: Gemini 3.1 Pro Preview dropped from 57.2% to 51.1% on Artificial Analysis (down six positions), Gemini 3.5 Flash fell from 55.3% to 49.5% on SWE-rebench and 55.3% to 50.2% on Artificial Analysis, and Kimi K2.6 declined from 53.9% to 46.5% on Artificial Analysis while holding steady on SWE-rebench. GLM-4.7 improved notably from 42.1% to 50.7% on Artificial Analysis, moving into the top 20, and GLM-4.7 itself advanced from 42.1% to 50.7% on Artificial Analysis, though it remains at 38.2% on SWE-rebench. The Artificial Analysis leaderboard shows broader volatility: Claude Fable 5 dropped from 64.9 to 59.9, Claude Opus 4.8 fell from 61.4 to 55.7, and GPT-5.5 declined from 60.2 to 54.8, suggesting either a recalibration of the benchmark methodology or systematic changes in model evaluation conditions. Lower-ranked models show the largest percentage-point losses across both benchmarks, with many models in the 100-200 range losing 5-8 points, raising the question of whether this reflects actual model degradation, benchmark recalibration, or environmental factors like inference conditions that affect consistency. The SWE-rebench scores remain tighter and more stable than Artificial Analysis, which could indicate either greater robustness in that benchmark's methodology or a narrower evaluation scope that leaves less room for variance. Without clarity on whether these benchmarks measure identical task sets or use different evaluation protocols, the divergence between the two makes it difficult to assess whether the movement represents genuine capability shifts or measurement artifacts.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	gpt-5.5-2026-04-23-xhigh	62.7%
2	Junie	61.6%
3	Codex	60.4%
4	Claude Code	59.6%
5	gpt-5.5-2026-04-23-medium	58.9%
6	Claude Opus 4.8-xhigh	56.5%
7	gpt-5.4-2026-03-05-medium	54.9%
8	Claude Opus 4.7-high	53.1%
9	Cursor	53.0%
10	Claude Sonnet 4.6	51.3%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	Claude Fable 5	59.9	0	$20.00
2	Claude Opus 4.8	55.7	68	$10.00
3	GPT-5.5	54.8	67	$11.25
4	Claude Opus 4.7	53.5	57	$10.00
5	GPT-5.4	51.4	191	$5.63
6	Gemini 3.5 Flash	50.2	212	$3.38
7	Claude Sonnet 4.6	47.2	62	$6.00
8	Gemini 3.1 Pro Preview	46.5	133	$4.50
9	Qwen3.7 Max	46	187	$3.75
10	MiniMax-M3	44.4	57	$0.525

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Gemini 3.5 Flash	212
2	GPT-5.4	191
3	Qwen3.7 Max	187
4	GPT-5.4 mini	187
5	GPT-5.2 Codex	137
6	Gemini 3.1 Pro Preview	133
7	DeepSeek V4 Flash	108
8	GPT-5.3 Codex	99
9	DeepSeek V4 Pro	84
10	GPT-5.2	80

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	DeepSeek V4 Flash	$0.175
2	MiMo-V2.5	$0.175
3	MiniMax-M3	$0.525
4	DeepSeek V4 Pro	$0.544
5	MiMo-V2.5-Pro	$0.544
6	MiMo-V2-Pro	$1.50
7	GPT-5.4 mini	$1.69
8	Kimi K2.6	$1.71
9	GLM-5.1	$2.15
10	Qwen3.6 Max Preview	$2.92