The Inference Report

July 1, 2026

On the SWE-rebench coding benchmark, the top tier remains stable with OpenAI's gpt-5.5-2026-04-23-xhighModel holding 62.7% (±0.91%), followed by JunieAgent at 61.6% (±0.64%) and OpenAI's CodexAgent at 60.4% (±1.37%), unchanged from the previous round. The Artificial Analysis benchmark, by contrast, shows material reshuffling across its 398-model roster: Claude Fable 5 enters at #1 with 59.9 points, displacing GPT-5.5 to #3, while Claude Sonnet 5 debuts at #5 with 53.4 points, pushing prior entries down. Lower in the Artificial Analysis rankings, DeepSeek V3 climbs from #220 (10.4) to #180 (14.2), a 3.8-point gain that suggests either improved evaluation conditions or a correction in prior assessment. Qwen3.5 9B drops from #101 (25) to #120 (21.4), a 3.6-point decline that warrants scrutiny of methodology consistency. The SWE-rebench benchmark's tight confidence intervals (mostly sub-1.5%) and static ordering suggest a well-controlled experimental setup, whereas Artificial Analysis's broader movement and new entrants indicate either looser evaluation criteria or frequent model updates that shift relative standing. Neither benchmark shows the methodological transparency needed to distinguish between genuine performance improvement and variance in test conditions.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	OpenAIgpt-5.5-2026-04-23-xhighModel	62.7%± 0.91%
2	JunieJunieAgent	61.6%± 0.64%
3	OpenAICodexAgent	60.4%± 1.37%
4	AnthropicClaude CodeAgent	59.6%± 1.98%
5	OpenAIgpt-5.5-2026-04-23-mediumModel	58.9%± 0.78%
6	AnthropicClaude Opus 4.8-xhighModel	56.5%± 1.20%
7	OpenAIgpt-5.4-2026-03-05-mediumModel	54.9%± 1.02%
8	AnthropicClaude Opus 4.7-highModel	53.1%± 1.45%
9	CursorCursorAgent	53.0%± 0.53%
10	AnthropicClaude Sonnet 4.6Model	51.3%± 0.55%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	Claude Fable 5	59.9	0	$20.00
2	Claude Opus 4.8	55.7	65	$10.00
3	GPT-5.5	54.8	77	$11.25
4	Claude Opus 4.7	53.5	48	$10.00
5	Claude Sonnet 5	53.4	79	$6.00
6	GPT-5.4	51.4	157	$5.63
7	GLM-5.2	51.1	160	$2.15
8	Gemini 3.5 Flash	50.2	210	$3.38
9	Claude Sonnet 4.6	47.2	63	$6.00
10	Gemini 3.1 Pro Preview	46.5	128	$4.50

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Gemini 3.5 Flash	210
2	Qwen3.7 Max	195
3	GLM-5.2	160
4	GPT-5.4	157
5	GPT-5.4 mini	154
6	Gemini 3.1 Pro Preview	128
7	GPT-5.2 Codex	118
8	DeepSeek V4 Flash	90
9	MiMo-V2.5	86
10	MiniMax-M3	84

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	DeepSeek V4 Flash	$0.175
2	MiMo-V2.5	$0.175
3	MiniMax-M3	$0.525
4	DeepSeek V4 Pro	$0.544
5	MiMo-V2.5-Pro	$0.544
6	Nex-N2-Pro	$1.00
7	MiMo-V2-Pro	$1.50
8	GPT-5.4 mini	$1.69
9	Kimi K2.6	$1.71
10	Kimi K2.7 Code	$1.71