The Inference Report

April 27, 2026

Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, unchanged from the previous cycle, while the second tier has tightened considerably around 62-64 percent with gpt-5.2-2025-12-11-medium, GLM-5, gpt-5.4-2026-03-05-medium, and GLM-5.1 all clustering within a single point. The most significant movement appears in the mid-tier: GLM-4.7 climbed from rank 42 at 42.1 percent to rank 14 at 58.7 percent, a gain of 16.6 points; Kimi K2.5 jumped from rank 27 at 46.8 percent to rank 16 at 58.5 percent; and Kimi K2 Thinking advanced from rank 51 at 40.9 percent to rank 21 at 57.4 percent. These represent genuine improvements in coding task resolution, not ranking artifacts. Gemini 3.1 Pro Preview dropped from rank 3 to rank 6 despite holding 62.3 percent, reflecting the compression of scores in the upper tier rather than performance degradation. On the Artificial Analysis benchmark, the rankings remain relatively stable at the extreme ends, though GPT-5.5 continues to lead at 60.2 and Claude Opus 4.7 sits at 57.3. The divergence between SWE-rebench and Artificial Analysis scores is pronounced: Claude Opus 4.6 scores 65.3 percent on SWE-rebench but only 53 percent on Artificial Analysis, suggesting the benchmarks measure distinct capabilities or that SWE-rebench may have different task difficulty distribution. Without visibility into whether the SWE-rebench test set changed or models were simply retested, the GLM and Kimi improvements warrant scrutiny regarding whether they reflect algorithmic advances or evaluation variance.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	gpt-5.4-2026-03-05-medium	62.8%
5	GLM-5.1	62.7%
6	Gemini 3.1 Pro Preview	62.3%
7	DeepSeek-V3.2	60.9%
8	Claude Sonnet 4.6	60.7%
9	Claude Sonnet 4.5	60.0%
10	Qwen3.5-397B-A17B	59.9%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	GPT-5.5	60.2	84	$11.25
2	Claude Opus 4.7	57.3	62	$10.00
3	Gemini 3.1 Pro Preview	57.2	135	$4.50
4	GPT-5.4	56.8	86	$5.63
5	Kimi K2.6	53.9	139	$1.71
6	MiMo-V2.5-Pro	53.8	66	$1.50
7	GPT-5.3 Codex	53.6	91	$4.81
8	Claude Opus 4.6	53	59	$10.00
9	Muse Spark	52.1	0	$0.00
10	Qwen3.6 Max Preview	51.8	34	$2.92

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Gemini 3 Flash Preview	200
2	GPT-5 Codex	198
3	Qwen3.6 35B A3B	197
4	GPT-5.4 mini	182
5	GPT-5.4 nano	163
6	GPT-5.1 Codex	159
7	Qwen3.5 122B A10B	156
8	GPT-5.1	153
9	Gemini 3 Pro Preview	141
10	Kimi K2.6	139

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	DeepSeek V4 Flash	$0.175
3	DeepSeek V3.2	$0.315
4	GPT-5.4 nano	$0.463
5	MiniMax-M2.7	$0.525
6	KAT Coder Pro V2	$0.525
7	MiniMax-M2.5	$0.525
8	GPT-5 mini	$0.688
9	Qwen3.5 27B	$0.825
10	Qwen3.6 35B A3B	$0.844