The Inference Report

May 27, 2026

Claude Opus 4.6 has consolidated the top position on SWE-rebench with 65.3%, climbing from #11 at 52.9% on Artificial Analysis, a gain of 12.4 percentage points that reflects either substantial model improvements or meaningful differences in how the two benchmarks evaluate code-solving capability. The broader SWE-rebench leaderboard shows clustering at the top: gpt-5.2-2025-12-11-medium, GLM-5, Junie, and gpt-5.4-2026-03-05-medium all sit within 1.6 points of each other between 62.8% and 64.4%, suggesting convergence among frontier models on this task. Notable climbers include GLM-5 (from #19 to #3, a 13-point jump), Kimi K2.5 (from #31 to #16, up 11.7 points), and Kimi K2 Thinking (from #56 to #21, up 16.5 points), indicating that Chinese-developed models have made tangible progress on repository-level code tasks. Gemini 3.1 Pro Preview declined from #3 to #7 on SWE-rebench while maintaining #3 on Artificial Analysis at 57.2, illustrating that benchmark choice materially affects perceived ranking. Claude Sonnet 4.6 moved from #14 to #9 on Artificial Analysis (51.7 to 60.7 on SWE-rebench), suggesting the models tested are stronger at the specific problem distributions in SWE-rebench than on Artificial Analysis's evaluation. The divergence between the two benchmarks raises a methodological question: SWE-rebench appears to emphasize end-to-end repository modification and integration, while Artificial Analysis may weight reasoning and breadth differently. Without access to the evaluation protocols themselves, the magnitude of these shifts makes it difficult to assess whether one benchmark has higher discriminative validity for production code work.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	Junie	62.8%
5	gpt-5.4-2026-03-05-medium	62.8%
6	GLM-5.1	62.7%
7	Gemini 3.1 Pro Preview	62.3%
8	DeepSeek-V3.2	60.9%
9	Claude Sonnet 4.6	60.7%
10	Claude Sonnet 4.5	60.0%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	GPT-5.5	60.2	72	$11.25
2	Claude Opus 4.7	57.3	54	$10.94
3	Gemini 3.1 Pro Preview	57.2	130	$4.50
4	GPT-5.4	56.8	90	$5.63
5	Qwen3.7 Max	56.6	206	$3.75
6	Gemini 3.5 Flash	55.3	233	$3.38
7	Kimi K2.6	53.9	32	$1.71
8	MiMo-V2.5-Pro	53.8	51	$1.35
9	GPT-5.3 Codex	53.6	82	$4.81
10	Grok 4.3	53.2	196	$1.56

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Gemini 3.5 Flash	233
2	Qwen3.7 Max	206
3	Gemini 3 Flash Preview	204
4	GPT-5 Codex	202
5	GPT-5.1 Codex	201
6	Grok 4.3	196
7	Grok 4.20 0309 v2	188
8	Grok 4.20 0309	185
9	Qwen3.6 35B A3B	170
10	GPT-5.4 mini	165

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	DeepSeek V4 Flash	$0.175
3	Hy3-preview	$0.20
4	DeepSeek V3.2	$0.337
5	MiMo-V2.5	$0.408
6	GPT-5.4 nano	$0.463
7	MiniMax-M2.7	$0.525
8	KAT Coder Pro V2	$0.525
9	MiniMax-M2.5	$0.525
10	DeepSeek V4 Pro	$0.544