The Inference Report

May 2, 2026

Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, unchanged from the previous ranking, while the field beneath it shows modest consolidation rather than dramatic reshuffling. The most notable movement comes from Chinese models: GLM-5 jumped from rank 17 to rank 3 (49.8 to 62.8 percent), GLM-5.1 climbed from rank 14 to rank 6 (51.4 to 62.7 percent), and GLM-4.7 advanced from rank 44 to rank 14 (42.1 to 58.7 percent), suggesting systematic improvements in that family's code-solving capability. Kimi K2.5 rose from rank 29 to rank 16 (46.8 to 58.5 percent), and Kimi K2 Thinking moved from rank 54 to rank 21 (40.9 to 57.4 percent), indicating progress across Kimi's lineup as well. Claude Sonnet 4.6 improved from rank 12 to rank 9 (51.7 to 60.7 percent), while Gemini 3.1 Pro Preview declined from rank 3 to rank 7 (57.2 to 62.3 percent), suggesting variable iteration quality. The Artificial Analysis benchmark tells a different story: GPT-5.5 leads at 60.2, with Claude Opus 4.6 at rank 9 (53 points) and Claude Opus 4.7 at rank 2 (57.3 points), revealing substantial disagreement between the two evaluations on which models excel at real-world software engineering tasks. The divergence between these benchmarks, SWE-rebench emphasizing repository-level problem solving and Artificial Analysis potentially capturing different task distributions or evaluation criteria, warrants scrutiny of their respective methodologies before treating either ranking as definitive for model capability assessment.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	Junie	62.8%
5	gpt-5.4-2026-03-05-medium	62.8%
6	GLM-5.1	62.7%
7	Gemini 3.1 Pro Preview	62.3%
8	DeepSeek-V3.2	60.9%
9	Claude Sonnet 4.6	60.7%
10	Claude Sonnet 4.5	60.0%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	GPT-5.5	60.2	73	$11.25
2	Claude Opus 4.7	57.3	51	$10.00
3	Gemini 3.1 Pro Preview	57.2	132	$4.50
4	GPT-5.4	56.8	86	$5.63
5	Kimi K2.6	53.9	25	$1.71
6	MiMo-V2.5-Pro	53.8	63	$1.50
7	GPT-5.3 Codex	53.6	81	$4.81
8	Grok 4.3	53.2	205	$1.56
9	Claude Opus 4.6	53	49	$10.00
10	Muse Spark	52.1	0	$0.00

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Grok 4.3	205
2	Qwen3.6 35B A3B	187
3	Gemini 3 Flash Preview	184
4	GPT-5 Codex	178
5	GPT-5.4 mini	174
6	GPT-5.1 Codex	174
7	GPT-5.4 nano	157
8	Qwen3.5 122B A10B	153
9	Gemini 3.1 Pro Preview	132
10	MiMo-V2-Flash	131

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	DeepSeek V4 Flash	$0.175
3	DeepSeek V3.2	$0.315
4	GPT-5.4 nano	$0.463
5	MiniMax-M2.7	$0.525
6	KAT Coder Pro V2	$0.525
7	MiniMax-M2.5	$0.525
8	Qwen3.6 35B A3B	$0.557
9	GPT-5 mini	$0.688
10	Qwen3.5 27B	$0.825