The Inference Report

June 27, 2026

The SWE-rebench rankings remain frozen at their previous positions, with no movement across the top 17 coding agents. OpenAI's gpt-5.5-2026-04-23-xhighModel holds first at 62.7% (±0.91%), followed by JunieAgent at 61.6% (±0.64%), and the spread narrows predictably through the field. The confidence intervals are tight enough to distinguish most placements, though Claude Opus 4.6-high (47.8% ±1.37%) and Claude Sonnet 4.6 (51.3% ±0.55%) overlap slightly at the boundary of statistical noise. Across the Artificial Analysis benchmark, the data shows substantial churn in the middle and lower tiers, Magistral Medium 1.2 dropped from position 130 to 148, while Apriel-v1.6-15B-Thinker moved from 129 to 128, but the top performers remain locked in place: Claude Fable 5 leads at 59.9, with GPT-5.5 and Claude Opus 4.8 holding their second-tier positions at 54.8 and 55.7 respectively. The two benchmarks measure different problem spaces (SWE-rebench targets repository-level software engineering tasks while Artificial Analysis covers broader reasoning), which explains why their orderings diverge: coding-specific systems like JunieAgent rank higher on SWE-rebench but Claude Fable 5 tops the general benchmark. Without prior Artificial Analysis scores, it is unclear whether the observed shuffling in positions 128 to 148 reflects genuine performance changes or measurement variance. The stability in SWE-rebench suggests the top agents have reached a plateau, or that the evaluation's resolution cannot detect sub-point improvements.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	OpenAIgpt-5.5-2026-04-23-xhighModel	62.7%± 0.91%
2	JunieJunieAgent	61.6%± 0.64%
3	OpenAICodexAgent	60.4%± 1.37%
4	AnthropicClaude CodeAgent	59.6%± 1.98%
5	OpenAIgpt-5.5-2026-04-23-mediumModel	58.9%± 0.78%
6	AnthropicClaude Opus 4.8-xhighModel	56.5%± 1.20%
7	OpenAIgpt-5.4-2026-03-05-mediumModel	54.9%± 1.02%
8	AnthropicClaude Opus 4.7-highModel	53.1%± 1.45%
9	CursorCursorAgent	53.0%± 0.53%
10	AnthropicClaude Sonnet 4.6Model	51.3%± 0.55%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	Claude Fable 5	59.9	0	$20.00
2	Claude Opus 4.8	55.7	60	$10.00
3	GPT-5.5	54.8	83	$11.25
4	Claude Opus 4.7	53.5	57	$10.00
5	GPT-5.4	51.4	163	$5.63
6	GLM-5.2	51.1	120	$2.15
7	Gemini 3.5 Flash	50.2	225	$3.38
8	Claude Sonnet 4.6	47.2	70	$6.00
9	Gemini 3.1 Pro Preview	46.5	145	$4.50
10	Qwen3.7 Max	46	203	$3.75

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Gemini 3.5 Flash	225
2	Qwen3.7 Max	203
3	GPT-5.4 mini	177
4	GPT-5.4	163
5	Gemini 3.1 Pro Preview	145
6	GPT-5.2 Codex	139
7	GLM-5.2	120
8	Nex-N2-Pro	118
9	DeepSeek V4 Flash	114
10	GPT-5.3 Codex	100

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	DeepSeek V4 Flash	$0.175
2	MiMo-V2.5	$0.175
3	MiniMax-M3	$0.525
4	DeepSeek V4 Pro	$0.544
5	MiMo-V2.5-Pro	$0.544
6	Nex-N2-Pro	$1.00
7	MiMo-V2-Pro	$1.50
8	GPT-5.4 mini	$1.69
9	Kimi K2.6	$1.71
10	Kimi K2.7 Code	$1.71