The Inference Report

June 21, 2026

The SWE-rebench leaderboard shows consolidation at the top with no movement among the leading seven models, while mid-tier performers reveal more volatility. Claude Sonnet 4.6 climbed from #10 to maintain position with 51.3 percent, Gemini 3.1 Pro Preview held at #11 with 51.1 percent, and GLM-5.1 remained at #12 with 50.7 percent, though the Artificial Analysis benchmark tells a different story: GLM-5.1 jumped from rank 23 at 40.2 to rank 12 at 50.7, a 10.5-point gain that suggests either a model update or a methodology shift between the two benchmarks. The most striking movement came from GLM-4.7, which advanced from #51 on Artificial Analysis (33.8) to #17 on SWE-rebench (38.2), a 4.4-point improvement, while Kimi K2.6 moved from rank 16 to 15 with a 3.7-point jump from 42.8 to 46.5. These discrepancies between the two benchmarks raise questions about their evaluation methodologies: SWE-rebench appears to reward different model behaviors or architectural choices than Artificial Analysis, particularly for Chinese-developed models like GLM and Kimi, which suggests the benchmarks may be measuring distinct aspects of coding capability rather than converging on a unified signal. The lack of score inflation at the frontier, where the top model remains at 62.7 percent, indicates the evaluation has not become easier, though the divergence between benchmark rankings for identical models undermines confidence in any single leaderboard as a complete measure of coding performance.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	gpt-5.5-2026-04-23-xhigh	62.7%
2	Junie	61.6%
3	Codex	60.4%
4	Claude Code	59.6%
5	gpt-5.5-2026-04-23-medium	58.9%
6	Claude Opus 4.8-xhigh	56.5%
7	gpt-5.4-2026-03-05-medium	54.9%
8	Claude Opus 4.7-high	53.1%
9	Cursor	53.0%
10	Claude Sonnet 4.6	51.3%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	Claude Fable 5	59.9	0	$20.00
2	Claude Opus 4.8	55.7	67	$10.00
3	GPT-5.5	54.8	63	$11.25
4	Claude Opus 4.7	53.5	52	$10.00
5	GPT-5.4	51.4	142	$5.63
6	GLM-5.2	51.1	85	$2.15
7	Gemini 3.5 Flash	50.2	217	$3.38
8	Claude Sonnet 4.6	47.2	67	$6.00
9	Gemini 3.1 Pro Preview	46.5	136	$4.50
10	Qwen3.7 Max	46	197	$3.75

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Gemini 3.5 Flash	217
2	Qwen3.7 Max	197
3	GPT-5.4 mini	180
4	GPT-5.4	142
5	GPT-5.2 Codex	139
6	Gemini 3.1 Pro Preview	136
7	DeepSeek V4 Flash	110
8	GLM-5.1	103
9	GPT-5.3 Codex	95
10	DeepSeek V4 Pro	92

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	DeepSeek V4 Flash	$0.175
2	MiMo-V2.5	$0.175
3	MiniMax-M3	$0.525
4	DeepSeek V4 Pro	$0.544
5	MiMo-V2.5-Pro	$0.544
6	MiMo-V2-Pro	$1.50
7	GPT-5.4 mini	$1.69
8	Kimi K2.6	$1.71
9	Kimi K2.7 Code	$1.71
10	GLM-5.2	$2.15