The Inference Report

June 3, 2026

The SWE-rebench rankings remain stable at the top tier, with gpt-5.5-2026-04-23-xhigh holding first place at 62.7%, Codex at 60.4%, and Claude Code at 59.6%, indicating that the highest-performing systems have consolidated their positions. Movement occurs in the middle ranks where Gemini 3.1 Pro Preview dropped from fourth to tenth place on SWE-rebench (57.2% to 51.1%), a 6.1-point decline that signals either a methodological shift or genuine regression in this model's code-solving capability. On the Artificial Analysis benchmark, the top tier similarly stabilizes with Claude Opus 4.8 leading at 61.4 and GPT-5.5 at 60.2, though the broader ranking reveals substantial churn below the top ten: Qwen3.7 Max enters at sixth place (56.6), while older GPT versions and specialized models shuffle downward. GLM-4.7 shows the most striking movement, rising from forty-eighth to forty-ninth on Artificial Analysis but falling on SWE-rebench from 38.2% to 42.1%, a pattern suggesting the benchmarks measure different problem distributions. Kimi K2.6 declined notably from eighth to thirteenth on SWE-rebench (53.9% to 46.5%), a 7.4-point drop that warrants scrutiny into whether the evaluation protocol changed or the model's inference behavior shifted. The divergence between SWE-rebench and Artificial Analysis rankings for mid-tier models like GLM-5.1 (11th on SWE-rebench at 50.7%, 18th on Artificial Analysis at 51.4%) suggests these benchmarks are not measuring identical capabilities, likely because SWE-rebench emphasizes repository-level problem solving while Artificial Analysis may weight different code-generation tasks. Without historical Artificial Analysis data from a prior snapshot, the stability of that leaderboard's top positions appears genuine rather than volatile, though the accumulation of new entries like Qwen3.7 Plus at eleventh place indicates the benchmark's sample is expanding rather than consolidating.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	gpt-5.5-2026-04-23-xhigh	62.7%
2	Codex	60.4%
3	Claude Code	59.6%
4	gpt-5.5-2026-04-23-medium	58.9%
5	Claude Opus 4.8-xhigh	56.4%
6	gpt-5.4-2026-03-05-medium	54.9%
7	Claude Opus 4.7-high	53.1%
8	Cursor	53.0%
9	Claude Sonnet 4.6-high	51.3%
10	Gemini 3.1 Pro Preview	51.1%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	Claude Opus 4.8	61.4	59	$10.94
2	GPT-5.5	60.2	67	$11.25
3	Claude Opus 4.7	57.3	53	$10.94
4	Gemini 3.1 Pro Preview	57.2	123	$4.50
5	GPT-5.4	56.8	79	$5.63
6	Qwen3.7 Max	56.6	198	$3.75
7	Gemini 3.5 Flash	55.3	216	$3.38
8	Kimi K2.6	53.9	39	$1.71
9	MiMo-V2.5-Pro	53.8	46	$0.544
10	GPT-5.3 Codex	53.6	74	$4.81

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Step 3.7 Flash	402
2	Gemini 3.5 Flash	216
3	MiniMax-M2.5	200
4	Qwen3.7 Max	198
5	Grok 4.20 0309 v2	187
6	Gemini 3 Flash Preview	180
7	GPT-5.1 Codex	175
8	GPT-5 Codex	173
9	Grok 4.20 0309	166
10	Qwen3.6 35B A3B	162

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	MiMo-V2.5	$0.175
3	DeepSeek V4 Flash	$0.175
4	Hy3-preview	$0.20
5	DeepSeek V3.2	$0.337
6	Step 3.7 Flash	$0.438
7	GPT-5.4 nano	$0.463
8	MiniMax-M2.7	$0.525
9	KAT Coder Pro V2	$0.525
10	MiniMax-M2.5	$0.525