The Inference Report

May 13, 2026

Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, up from a prior ranking of 9th at 53%, a gain of 12.3 percentage points that represents the largest single-model improvement visible in the current data. The top seven positions cluster tightly between 62.3% and 65.3%, with gpt-5.2-2025-12-11-medium at 64.4%, GLM-5 and Junie tied at 62.8%, gpt-5.4-2026-03-05-medium at 62.8%, GLM-5.1 at 62.7%, and Gemini 3.1 Pro Preview at 62.3%. On the Artificial Analysis benchmark, however, the rankings diverge sharply: GPT-5.5 leads at 60.2, while Claude Opus 4.6 ranks 9th at 53, suggesting the two benchmarks measure different capabilities or that SWE-rebench may weight certain problem classes differently. GLM-5 moved from 17th to 3rd on SWE-rebench (49.8 to 62.8), GLM-4.7 climbed from 44th to 14th (42.1 to 58.7), and Kimi K2 Thinking jumped from 54th to 21st (40.9 to 57.4) on the coding benchmark, indicating broad-based gains across Chinese models. Gemini 3.1 Pro Preview, by contrast, dropped from 3rd on Artificial Analysis (57.2) to 7th on SWE-rebench (62.3), a relative decline despite an absolute score increase, which may reflect that the coding-specific benchmark rewards different optimization choices than general-purpose evaluation. The lack of methodological detail in either benchmark limits confidence in interpreting these divergences: neither source discloses test set size, problem distribution, whether solutions are evaluated for correctness alone or for code quality, or how edge cases are handled, making it unclear whether the gap between Claude's dominance on SWE-rebench and its mid-tier position on Artificial Analysis reflects genuine capability differences or artifacts of evaluation design.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	Junie	62.8%
5	gpt-5.4-2026-03-05-medium	62.8%
6	GLM-5.1	62.7%
7	Gemini 3.1 Pro Preview	62.3%
8	DeepSeek-V3.2	60.9%
9	Claude Sonnet 4.6	60.7%
10	Claude Sonnet 4.5	60.0%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	GPT-5.5	60.2	63	$11.25
2	Claude Opus 4.7	57.3	64	$10.94
3	Gemini 3.1 Pro Preview	57.2	131	$4.50
4	GPT-5.4	56.8	84	$5.63
5	Kimi K2.6	53.9	41	$1.71
6	MiMo-V2.5-Pro	53.8	55	$1.50
7	GPT-5.3 Codex	53.6	80	$4.81
8	Grok 4.3	53.2	86	$1.56
9	Claude Opus 4.6	53	49	$10.94
10	Muse Spark	52.1	0	$0.00

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Gemini 3 Flash Preview	197
2	GPT-5.1 Codex	183
3	GPT-5.4 mini	182
4	Qwen3.6 35B A3B	182
5	GPT-5 Codex	171
6	Hy3-preview	159
7	Qwen3.5 122B A10B	159
8	GPT-5.4 nano	148
9	MiMo-V2-Flash	143
10	Gemini 3.1 Pro Preview	131

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	DeepSeek V4 Flash	$0.175
3	DeepSeek V3.2	$0.337
4	GPT-5.4 nano	$0.463
5	MiniMax-M2.7	$0.525
6	KAT Coder Pro V2	$0.525
7	MiniMax-M2.5	$0.525
8	Qwen3.6 35B A3B	$0.557
9	GPT-5 mini	$0.688
10	MiMo-V2.5	$0.72