The Inference Report

May 31, 2026

The SWE-rebench rankings show stability at the top with gpt-5.5-2026-04-23-xhigh holding first place at 62.7%, while Codex and Claude Code follow at 60.4% and 59.6% respectively, unchanged from the previous cycle. Movement in the middle tier reveals two distinct patterns: Gemini 3.1 Pro Preview dropped from position 4 to 10 on SWE-rebench, falling from 57.2% to 51.1%, a 6.1-point decline that marks the most substantial regression in the visible rankings. Conversely, GLM-5.1 held ground at 50.7% while rising slightly in Artificial Analysis from 51.4 to maintain position 11, and Kimi K2.6 declined sharply from position 8 at 53.9% to position 13 at 46.5% on SWE-rebench, a 7.4-point drop. GLM-4.7 presents a puzzling divergence: it improved from 38.2% to 42.1% on Artificial Analysis (rising from position 47), yet on SWE-rebench it remained at 38.2% in position 14, suggesting the two benchmarks may measure different problem classes or that the Artificial Analysis score reflects a broader evaluation window. The consistency of scores across both benchmarks for most models in the top 10 indicates reliable measurement, but the divergence for Gemini and Kimi models warrants scrutiny of whether these benchmarks are testing equivalent code-solving difficulty or if recent model updates affected one benchmark more than the other. The lack of movement in the top five positions across both metrics suggests the frontier has stabilized, though the mid-tier churn indicates active differentiation among models in the 45-55% range.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	gpt-5.5-2026-04-23-xhigh	62.7%
2	Codex	60.4%
3	Claude Code	59.6%
4	gpt-5.5-2026-04-23-medium	58.9%
5	Claude Opus 4.8-xhigh	56.4%
6	gpt-5.4-2026-03-05-medium	54.9%
7	Claude Opus 4.7-high	53.1%
8	Cursor	53.0%
9	Claude Sonnet 4.6-high	51.3%
10	Gemini 3.1 Pro Preview	51.1%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	Claude Opus 4.8	61.4	65	$10.94
2	GPT-5.5	60.2	59	$11.25
3	Claude Opus 4.7	57.3	60	$10.94
4	Gemini 3.1 Pro Preview	57.2	137	$4.50
5	GPT-5.4	56.8	90	$5.63
6	Qwen3.7 Max	56.6	188	$3.75
7	Gemini 3.5 Flash	55.3	218	$3.38
8	Kimi K2.6	53.9	42	$1.71
9	MiMo-V2.5-Pro	53.8	51	$0.544
10	GPT-5.3 Codex	53.6	85	$4.81

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Gemini 3.5 Flash	218
2	Grok 4.20 0309	218
3	Grok 4.20 0309 v2	216
4	Gemini 3 Flash Preview	203
5	MiniMax-M2.5	191
6	Qwen3.7 Max	188
7	GPT-5.4 mini	183
8	GPT-5.1 Codex	182
9	GPT-5 Codex	170
10	Grok 4.3	161

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	MiMo-V2.5	$0.175
3	DeepSeek V4 Flash	$0.175
4	Hy3-preview	$0.20
5	DeepSeek V3.2	$0.337
6	GPT-5.4 nano	$0.463
7	MiniMax-M2.7	$0.525
8	KAT Coder Pro V2	$0.525
9	MiniMax-M2.5	$0.525
10	MiMo-V2.5-Pro	$0.544