The Inference Report

July 4, 2026

On the SWE-rebench, the top tier remains static: OpenAI's gpt-5.5-2026-04-23-xhighModel holds position one at 62.7 percent, followed by JunieAgent at 61.6 percent and OpenAI's CodexAgent at 60.4 percent. The confidence intervals are tight enough to distinguish these leaders, with standard errors ranging from 0.53 to 1.98 percentage points across the ranked set, suggesting the evaluation captures consistent performance differences. Across the Artificial Analysis benchmark, the landscape shifts more dramatically: Claude Fable 5 now leads at 59.9, displacing GPT-5.5 from the top spot, while Claude Opus 4.8 sits second at 55.7 and GPT-5.5 drops to third at 54.8. The gap between the SWE-rebench's top performer and Artificial Analysis's top performer is 2.8 points, a meaningful divergence that hints at different problem structures. Within Artificial Analysis, the middle ranks show considerable churn: Llama 3.3 Instruct 70B climbed from position 258 to 242, a 16-rank jump, while several models in the 240 to 260 range shuffled positions, suggesting modest score movements in a crowded band where many models cluster between 8 and 10 points. The two benchmarks do not track perfectly: models strong on SWE-rebench (like JunieAgent and OpenAI's variants) do not appear on the Artificial Analysis list, and vice versa, indicating the tests measure distinct capabilities rather than a single underlying skill. The SWE-rebench concentrates on code-generation agents in controlled conditions, while Artificial Analysis appears broader and less transparent in methodology, making direct comparison hazardous. Without historical Artificial Analysis data from prior runs, the significance of Claude Fable 5's ascent to first cannot be evaluated; the SWE-rebench's stability suggests real differences in agent capability, but the Artificial Analysis movements may reflect noise or evaluation drift rather than genuine progress.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	OpenAIgpt-5.5-2026-04-23-xhighModel	62.7%± 0.91%
2	JunieJunieAgent	61.6%± 0.64%
3	OpenAICodexAgent	60.4%± 1.37%
4	AnthropicClaude CodeAgent	59.6%± 1.98%
5	OpenAIgpt-5.5-2026-04-23-mediumModel	58.9%± 0.78%
6	AnthropicClaude Opus 4.8-xhighModel	56.5%± 1.20%
7	OpenAIgpt-5.4-2026-03-05-mediumModel	54.9%± 1.02%
8	AnthropicClaude Opus 4.7-highModel	53.1%± 1.45%
9	CursorCursorAgent	53.0%± 0.53%
10	AnthropicClaude Sonnet 4.6Model	51.3%± 0.55%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	Claude Fable 5	59.9	62	$20.00
2	Claude Opus 4.8	55.7	61	$10.00
3	GPT-5.5	54.8	88	$11.25
4	Claude Opus 4.7	53.5	49	$10.00
5	Claude Sonnet 5	53.4	79	$6.00
6	GPT-5.4	51.4	167	$5.63
7	GLM-5.2	51.1	176	$2.15
8	Gemini 3.5 Flash	50.2	209	$3.38
9	Claude Sonnet 4.6	47.2	67	$6.00
10	Gemini 3.1 Pro Preview	46.5	140	$4.50

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Gemini 3.5 Flash	209
2	Qwen3.7 Max	199
3	GLM-5.2	176
4	GPT-5.4	167
5	GPT-5.4 mini	164
6	Gemini 3.1 Pro Preview	140
7	Nex-N2-Pro	126
8	GPT-5.2 Codex	123
9	MiniMax-M3	98
10	DeepSeek V4 Flash	95

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	DeepSeek V4 Flash	$0.175
2	MiMo-V2.5	$0.175
3	MiniMax-M3	$0.525
4	DeepSeek V4 Pro	$0.544
5	MiMo-V2.5-Pro	$0.544
6	Nex-N2-Pro	$1.00
7	MiMo-V2-Pro	$1.50
8	GPT-5.4 mini	$1.69
9	Kimi K2.6	$1.71
10	Kimi K2.7 Code	$1.71