The Inference Report

June 29, 2026

The SWE-rebench leaderboard shows no movement from the previous snapshot: OpenAI's gpt-5.5-2026-04-23-xhigh model holds first at 62.7%, followed by JunieAgent at 61.6% and OpenAI CodexAgent at 60.4%, with confidence intervals tight enough to distinguish genuine separation between top performers. The Artificial Analysis benchmark presents a different picture, one of modest shuffling rather than substantive reordering. Claude Fable 5 leads at 59.9, Claude Opus 4.8 sits at 55.7, and GPT-5.5 ranks third at 54.8, but the list contains no new entries and the scoring appears identical to prior rankings. Two minor position swaps occur in the mid-range: Qwen3 32B moves from rank 217 to 209 with an improvement from 10.5 to 11.5, and Sarvam 105B and Magistral Small 1.2 exchange positions around rank 204-205 without score changes, suggesting database reorganization rather than actual performance shifts. The methodology underlying both benchmarks remains opaque. SWE-rebench reports confidence intervals, which implies repeated trials or cross-validation, yet no detail appears on the evaluation protocol, task distribution, or whether results are deterministic across runs. Artificial Analysis provides no uncertainty quantification whatsoever, making it impossible to assess whether fractional score differences reflect genuine capability gaps or measurement noise. The two benchmarks diverge substantially at the top (gpt-5.5-xhigh leads SWE-rebench but ranks third on Artificial Analysis), raising questions about whether they measure the same construct or whether one dataset better captures real-world code repair needs. Without clarification of what each benchmark tests, how tasks are sampled, and whether scoring is reproducible, the rankings function as indices rather than measures of engineering competence.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	OpenAIgpt-5.5-2026-04-23-xhighModel	62.7%± 0.91%
2	JunieJunieAgent	61.6%± 0.64%
3	OpenAICodexAgent	60.4%± 1.37%
4	AnthropicClaude CodeAgent	59.6%± 1.98%
5	OpenAIgpt-5.5-2026-04-23-mediumModel	58.9%± 0.78%
6	AnthropicClaude Opus 4.8-xhighModel	56.5%± 1.20%
7	OpenAIgpt-5.4-2026-03-05-mediumModel	54.9%± 1.02%
8	AnthropicClaude Opus 4.7-highModel	53.1%± 1.45%
9	CursorCursorAgent	53.0%± 0.53%
10	AnthropicClaude Sonnet 4.6Model	51.3%± 0.55%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	Claude Fable 5	59.9	0	$20.00
2	Claude Opus 4.8	55.7	58	$10.00
3	GPT-5.5	54.8	83	$11.25
4	Claude Opus 4.7	53.5	55	$10.00
5	GPT-5.4	51.4	174	$5.63
6	GLM-5.2	51.1	139	$2.15
7	Gemini 3.5 Flash	50.2	214	$3.38
8	Claude Sonnet 4.6	47.2	57	$6.00
9	Gemini 3.1 Pro Preview	46.5	137	$4.50
10	Qwen3.7 Max	46	198	$3.75

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Gemini 3.5 Flash	214
2	Qwen3.7 Max	198
3	GPT-5.4 mini	178
4	GPT-5.4	174
5	GLM-5.2	139
6	Gemini 3.1 Pro Preview	137
7	GPT-5.2 Codex	135
8	DeepSeek V4 Flash	109
9	GPT-5.3 Codex	94
10	MiMo-V2.5	90

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	DeepSeek V4 Flash	$0.175
2	MiMo-V2.5	$0.175
3	MiniMax-M3	$0.525
4	DeepSeek V4 Pro	$0.544
5	MiMo-V2.5-Pro	$0.544
6	Nex-N2-Pro	$1.00
7	MiMo-V2-Pro	$1.50
8	GPT-5.4 mini	$1.69
9	Kimi K2.6	$1.71
10	Kimi K2.7 Code	$1.71