The Inference Report

June 2, 2026

The SWE-rebench rankings show minimal movement at the top, with the same fourteen models holding positions 1-14 across both periods. The Artificial Analysis benchmark, however, reveals churn throughout its 384-entry list, though the pattern suggests reordering rather than genuine performance shifts. Gemini 3.1 Pro Preview dropped from 4th to 10th on SWE-rebench, declining from 57.2 to 51.1 points, while Kimi K2.6 fell from 8th to 13th with a 7.4-point loss. GLM-4.7 rose from 47th to 14th on Artificial Analysis (42.1 to 38.2 points), a counterintuitive climb despite the lower score, suggesting ranking methodology changes or score recalibration rather than model improvement. The entry of Step 3.7 Flash at position 47 in Artificial Analysis and the near-universal reshuffling below the top 100 indicate the benchmark may have adjusted its evaluation criteria, weighting scheme, or model test set. Without documentation of methodology changes between periods, it remains unclear whether observed movements reflect actual performance variation or administrative reorganization of the leaderboard itself. The SWE-rebench stability at the top contrasts sharply with Artificial Analysis volatility, raising questions about benchmark sensitivity and whether either ranking reliably tracks incremental progress in code generation capability.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	gpt-5.5-2026-04-23-xhigh	62.7%
2	Codex	60.4%
3	Claude Code	59.6%
4	gpt-5.5-2026-04-23-medium	58.9%
5	Claude Opus 4.8-xhigh	56.4%
6	gpt-5.4-2026-03-05-medium	54.9%
7	Claude Opus 4.7-high	53.1%
8	Cursor	53.0%
9	Claude Sonnet 4.6-high	51.3%
10	Gemini 3.1 Pro Preview	51.1%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	Claude Opus 4.8	61.4	60	$10.94
2	GPT-5.5	60.2	66	$11.25
3	Claude Opus 4.7	57.3	56	$10.94
4	Gemini 3.1 Pro Preview	57.2	132	$4.50
5	GPT-5.4	56.8	79	$5.63
6	Qwen3.7 Max	56.6	201	$3.75
7	Gemini 3.5 Flash	55.3	227	$3.38
8	Kimi K2.6	53.9	40	$1.71
9	MiMo-V2.5-Pro	53.8	53	$0.544
10	GPT-5.3 Codex	53.6	84	$4.81

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Step 3.7 Flash	408
2	Gemini 3.5 Flash	227
3	Grok 4.20 0309 v2	213
4	Qwen3.7 Max	201
5	GPT-5 Codex	191
6	Gemini 3 Flash Preview	186
7	Grok 4.20 0309	184
8	MiniMax-M2.5	178
9	GPT-5.1 Codex	174
10	GPT-5.4 mini	173

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	MiMo-V2.5	$0.175
3	DeepSeek V4 Flash	$0.175
4	Hy3-preview	$0.20
5	DeepSeek V3.2	$0.337
6	Step 3.7 Flash	$0.438
7	GPT-5.4 nano	$0.463
8	MiniMax-M2.7	$0.525
9	KAT Coder Pro V2	$0.525
10	MiniMax-M2.5	$0.525