The Inference Report

July 2, 2026

On the SWE-rebench, the top tier remains locked in place: OpenAI's gpt-5.5-xhigh holds 62.7% plus or minus 0.91%, followed by JunieAgent at 61.6% and OpenAI CodexAgent at 60.4%, with no movement among the leading six entries. Below that tier, Z.ai's GLM-5.2 enters at position 12 with 51.1% plus or minus 1.13%, displacing its predecessor GLM-5.1 to 13th place, while DeepSeek-V4 Pro and MiMo-V2.5-Pro appear as new entries at 18 and 19 respectively, and Qwen models now occupy positions 22 and 23 in their first SWE-rebench appearances. The Artificial Analysis benchmark shows broader volatility: Claude Fable 5 leads at 59.9, a model not previously ranked in the earlier snapshot, while GPT-5.1 dropped from 38.9 to 36.9 (position 44), and Command A+ fell from 29.3 to 22.5 (position 111), the largest documented decline. gpt-oss-20b, which had held position 171 at 14.9, has been removed from the rankings entirely. The SWE-rebench data carries tighter confidence intervals than Artificial Analysis, suggesting more controlled evaluation conditions, though both benchmarks show the frontier remains dominated by OpenAI and Anthropic systems when measured on code completion tasks, with newer Chinese models (Qwen, GLM variants) gaining ground in the mid-tier rather than displacing leaders.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	OpenAIgpt-5.5-2026-04-23-xhighModel	62.7%± 0.91%
2	JunieJunieAgent	61.6%± 0.64%
3	OpenAICodexAgent	60.4%± 1.37%
4	AnthropicClaude CodeAgent	59.6%± 1.98%
5	OpenAIgpt-5.5-2026-04-23-mediumModel	58.9%± 0.78%
6	AnthropicClaude Opus 4.8-xhighModel	56.5%± 1.20%
7	OpenAIgpt-5.4-2026-03-05-mediumModel	54.9%± 1.02%
8	AnthropicClaude Opus 4.7-highModel	53.1%± 1.45%
9	CursorCursorAgent	53.0%± 0.53%
10	AnthropicClaude Sonnet 4.6Model	51.3%± 0.55%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	Claude Fable 5	59.9	69	$20.00
2	Claude Opus 4.8	55.7	66	$10.00
3	GPT-5.5	54.8	82	$11.25
4	Claude Opus 4.7	53.5	51	$10.00
5	Claude Sonnet 5	53.4	89	$6.00
6	GPT-5.4	51.4	165	$5.63
7	GLM-5.2	51.1	184	$2.15
8	Gemini 3.5 Flash	50.2	214	$3.38
9	Claude Sonnet 4.6	47.2	69	$6.00
10	Gemini 3.1 Pro Preview	46.5	138	$4.50

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Gemini 3.5 Flash	214
2	Qwen3.7 Max	197
3	GLM-5.2	184
4	GPT-5.4 mini	175
5	GPT-5.4	165
6	Gemini 3.1 Pro Preview	138
7	GPT-5.2 Codex	125
8	DeepSeek V4 Flash	91
9	Claude Sonnet 5	89
10	Nex-N2-Pro	87

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	DeepSeek V4 Flash	$0.175
2	MiMo-V2.5	$0.175
3	MiniMax-M3	$0.525
4	DeepSeek V4 Pro	$0.544
5	MiMo-V2.5-Pro	$0.544
6	Nex-N2-Pro	$1.00
7	MiMo-V2-Pro	$1.50
8	GPT-5.4 mini	$1.69
9	Kimi K2.6	$1.71
10	Kimi K2.7 Code	$1.71