The Inference Report

July 3, 2026

The SWE-rebench rankings show no movement from the previous report: the top tier remains unchanged, with GPT-5.5-xhigh at 62.7%, JunieAgent at 61.6%, and CodexAgent at 60.4%, each holding their positions across the full 24-model list. The Artificial Analysis benchmark, by contrast, exhibits substantial churn across its 398-entry ranking, though the top tier again proves stable, Claude Fable 5 holds the lead at 59.9, followed by Claude Opus 4.8 at 55.7 and GPT-5.5 at 54.8. Below that summit, however, the ordering has shifted measurably: GPT-5 mini dropped from #65 at 33.0 to #72 at 30.9, a loss of 2.1 points and seven positions; Mistral Small 4 fell from #126 at 20.8 to #132 at 19.6; and Qwen3 Next 80B A3B plummeted from #134 at 19.8 to #159 at 16.7, suggesting either methodological revision or genuine performance variance in the 16-20 point band where many models cluster. The SWE-rebench's immobility raises a question about whether those agentic benchmarks are less sensitive to model updates than Artificial Analysis, or whether the coding agents themselves have stabilized while the underlying base models continue to diverge. The Artificial Analysis instability in the mid-range, where confidence intervals would overlap, warrants scrutiny of whether those score differences exceed measurement error; without published confidence bounds for that benchmark, the ranking shifts read as plausible but not necessarily meaningful.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	OpenAIgpt-5.5-2026-04-23-xhighModel	62.7%± 0.91%
2	JunieJunieAgent	61.6%± 0.64%
3	OpenAICodexAgent	60.4%± 1.37%
4	AnthropicClaude CodeAgent	59.6%± 1.98%
5	OpenAIgpt-5.5-2026-04-23-mediumModel	58.9%± 0.78%
6	AnthropicClaude Opus 4.8-xhighModel	56.5%± 1.20%
7	OpenAIgpt-5.4-2026-03-05-mediumModel	54.9%± 1.02%
8	AnthropicClaude Opus 4.7-highModel	53.1%± 1.45%
9	CursorCursorAgent	53.0%± 0.53%
10	AnthropicClaude Sonnet 4.6Model	51.3%± 0.55%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	Claude Fable 5	59.9	64	$20.00
2	Claude Opus 4.8	55.7	65	$10.00
3	GPT-5.5	54.8	84	$11.25
4	Claude Opus 4.7	53.5	50	$10.00
5	Claude Sonnet 5	53.4	87	$6.00
6	GPT-5.4	51.4	166	$5.63
7	GLM-5.2	51.1	181	$2.15
8	Gemini 3.5 Flash	50.2	210	$3.38
9	Claude Sonnet 4.6	47.2	69	$6.00
10	Gemini 3.1 Pro Preview	46.5	136	$4.50

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Gemini 3.5 Flash	210
2	Qwen3.7 Max	200
3	GLM-5.2	181
4	GPT-5.4 mini	168
5	GPT-5.4	166
6	Gemini 3.1 Pro Preview	136
7	Nex-N2-Pro	120
8	GPT-5.2 Codex	120
9	MiniMax-M3	98
10	DeepSeek V4 Flash	93

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	DeepSeek V4 Flash	$0.175
2	MiMo-V2.5	$0.175
3	MiniMax-M3	$0.525
4	DeepSeek V4 Pro	$0.544
5	MiMo-V2.5-Pro	$0.544
6	Nex-N2-Pro	$1.00
7	MiMo-V2-Pro	$1.50
8	GPT-5.4 mini	$1.69
9	Kimi K2.6	$1.71
10	Kimi K2.7 Code	$1.71