The Inference Report

June 26, 2026

The SWE-rebench rankings have been entirely replaced with a fresh cohort of agent-based systems, with OpenAI's gpt-5.5-2026-04-23-xhigh model leading at 62.7% plus or minus 0.91%, followed by JunieAgent at 61.6% and OpenAI CodexAgent at 60.4%, while the Artificial Analysis benchmark shows Claude Fable 5 in the top position at 59.9, a configuration that differs markedly from the previous leadership structure where Claude Fable 5 held first place. The SWE-rebench methodology appears to favor agentic systems with explicit configuration parameters like "xhigh" and "medium," producing notably higher absolute scores than Artificial Analysis reports for comparable models, suggesting the two benchmarks measure different aspects of coding capability or employ divergent evaluation protocols. All seventeen prior SWE-rebench entries have been dropped, indicating either a benchmark refresh, a shift in how systems are tested, or a change in what qualifies for inclusion, though the new cohort maintains consistency in confidence intervals ranging from 0.53% to 1.98%, which is tighter than one might expect given the diversity of approaches. The Artificial Analysis ranking remains largely stable in its upper tiers, with minor reordering and the addition of Devstral 2 moving from position 165 to 140, but without the wholesale replacement seen in SWE-rebench, suggesting the two evaluation frameworks operate on different cadences or criteria. Without prior SWE-rebench data for the new agent entries, it is not possible to determine whether these scores represent genuine capability gains or simply reflect how agent-based systems perform on that particular benchmark relative to the model-only systems previously ranked.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	OpenAIgpt-5.5-2026-04-23-xhighModel	62.7%± 0.91%
2	JunieJunieAgent	61.6%± 0.64%
3	OpenAICodexAgent	60.4%± 1.37%
4	AnthropicClaude CodeAgent	59.6%± 1.98%
5	OpenAIgpt-5.5-2026-04-23-mediumModel	58.9%± 0.78%
6	AnthropicClaude Opus 4.8-xhighModel	56.5%± 1.20%
7	OpenAIgpt-5.4-2026-03-05-mediumModel	54.9%± 1.02%
8	AnthropicClaude Opus 4.7-highModel	53.1%± 1.45%
9	CursorCursorAgent	53.0%± 0.53%
10	AnthropicClaude Sonnet 4.6Model	51.3%± 0.55%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	Claude Fable 5	59.9	0	$20.00
2	Claude Opus 4.8	55.7	65	$10.00
3	GPT-5.5	54.8	73	$11.25
4	Claude Opus 4.7	53.5	55	$10.00
5	GPT-5.4	51.4	150	$5.63
6	GLM-5.2	51.1	123	$2.15
7	Gemini 3.5 Flash	50.2	213	$3.38
8	Claude Sonnet 4.6	47.2	67	$6.00
9	Gemini 3.1 Pro Preview	46.5	136	$4.50
10	Qwen3.7 Max	46	203	$3.75

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Gemini 3.5 Flash	213
2	Qwen3.7 Max	203
3	GPT-5.4 mini	176
4	GPT-5.4	150
5	GPT-5.2 Codex	139
6	Gemini 3.1 Pro Preview	136
7	GLM-5.2	123
8	Nex-N2-Pro	117
9	DeepSeek V4 Flash	113
10	MiMo-V2.5	84

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	DeepSeek V4 Flash	$0.175
2	MiMo-V2.5	$0.175
3	MiniMax-M3	$0.525
4	DeepSeek V4 Pro	$0.544
5	MiMo-V2.5-Pro	$0.544
6	Nex-N2-Pro	$1.00
7	MiMo-V2-Pro	$1.50
8	GPT-5.4 mini	$1.69
9	Kimi K2.6	$1.71
10	Kimi K2.7 Code	$1.71