The Inference Report

June 30, 2026

The SWE-rebench rankings remain static at the top, with OpenAI's gpt-5.5-2026-04-23-xhigh maintaining 62.7% and Junie's agent holding 61.6%, but the Artificial Analysis benchmark shows substantial movement in the mid-tier and below. DeepSeek R1 jumped from position 190 at 12.6% to position 144 at 18.5%, a gain of 5.9 percentage points that reflects either model improvement or evaluation methodology changes; Mistral Small 3.1 climbed from 255 at 8.6% to 172 at 14.7%, gaining 6.1 points across a 83-position swing. Claude 4 Sonnet dropped from 72 at 30.7% to 83 at 28.9%, losing 1.8 points despite holding its rank position number. Devstral Small 2 moved from 186 at 13.1% to 153 at 17.4%, and Llama 3.1 Instruct 8B climbed from 298 at 6.1% to 274 at 7.6%, both showing gains that suggest either these models were re-evaluated with different configurations or the benchmark itself shifted its evaluation criteria. The Artificial Analysis data spans 397 entries compared to 16 on SWE-rebench, creating an asymmetry in what constitutes meaningful movement: a 1-point swing at the top 10 of Artificial Analysis represents roughly 2 percent of the leader's score, while the same absolute change at position 350 is nearly a 20 percent relative improvement. New entry DiffusionGemma 26B A4B at position 185 with 13.5% provides no prior reference, making it impossible to assess whether this is a newly evaluated model or a previously omitted one. Without access to methodology details for either benchmark, the interpretation of these shifts remains constrained to surface observation: SWE-rebench appears stable and possibly closed to new entries, while Artificial Analysis exhibits churn consistent with either rolling re-evaluation or score recalibration across the full roster.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	OpenAIgpt-5.5-2026-04-23-xhighModel	62.7%± 0.91%
2	JunieJunieAgent	61.6%± 0.64%
3	OpenAICodexAgent	60.4%± 1.37%
4	AnthropicClaude CodeAgent	59.6%± 1.98%
5	OpenAIgpt-5.5-2026-04-23-mediumModel	58.9%± 0.78%
6	AnthropicClaude Opus 4.8-xhighModel	56.5%± 1.20%
7	OpenAIgpt-5.4-2026-03-05-mediumModel	54.9%± 1.02%
8	AnthropicClaude Opus 4.7-highModel	53.1%± 1.45%
9	CursorCursorAgent	53.0%± 0.53%
10	AnthropicClaude Sonnet 4.6Model	51.3%± 0.55%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	Claude Fable 5	59.9	0	$20.00
2	Claude Opus 4.8	55.7	59	$10.00
3	GPT-5.5	54.8	79	$11.25
4	Claude Opus 4.7	53.5	50	$10.00
5	GPT-5.4	51.4	174	$5.63
6	GLM-5.2	51.1	151	$2.15
7	Gemini 3.5 Flash	50.2	210	$3.38
8	Claude Sonnet 4.6	47.2	51	$6.00
9	Gemini 3.1 Pro Preview	46.5	131	$4.50
10	Qwen3.7 Max	46	196	$3.75

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Gemini 3.5 Flash	210
2	Qwen3.7 Max	196
3	GPT-5.4	174
4	GPT-5.4 mini	164
5	GLM-5.2	151
6	Gemini 3.1 Pro Preview	131
7	GPT-5.2 Codex	127
8	DeepSeek V4 Flash	104
9	MiMo-V2.5	91
10	GPT-5.3 Codex	88

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	DeepSeek V4 Flash	$0.175
2	MiMo-V2.5	$0.175
3	MiniMax-M3	$0.525
4	DeepSeek V4 Pro	$0.544
5	MiMo-V2.5-Pro	$0.544
6	Nex-N2-Pro	$1.00
7	MiMo-V2-Pro	$1.50
8	GPT-5.4 mini	$1.69
9	Kimi K2.6	$1.71
10	Kimi K2.7 Code	$1.71