The Inference Report

April 21, 2026

Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, unchanged from the previous round, while the Artificial Analysis benchmark shows Claude Opus 4.7 at 57.3 in first place, suggesting the two benchmarks are measuring different problem distributions or difficulty levels. The SWE-rebench leaderboard has consolidated around a narrow band: the top six models cluster between 65.3% and 62.3%, with gpt-5.2-2025-12-11-medium at 64.4% and GLM-5 and gpt-5.4-2026-03-05-medium both at 62.8%, indicating diminishing returns as models approach saturation on this evaluation set. Notable climbers on Artificial Analysis include Kimi K2.6 entering at position 4 with 53.9 points and JT-MINI appearing at position 113 with 25.4 points, though their SWE-rebench performance is not reported, making cross-benchmark validation difficult. Gemini 3.1 Pro Preview dropped from second place on Artificial Analysis (57.2) to sixth on SWE-rebench (62.3), a reversal that warrants scrutiny of the underlying tasks, SWE-rebench may emphasize code generation or repository manipulation where Claude and GPT variants perform better, while Artificial Analysis may weight reasoning or planning more heavily. The SWE-rebench methodology itself remains opaque in the provided data; without visibility into task design, evaluation protocol, or whether scores are statistically independent, it is unclear whether the tight clustering reflects genuine convergence in model capability or whether the benchmark has begun to saturate as a discriminator. The two-benchmark divergence suggests practitioners should verify performance on their specific use case rather than treating either leaderboard as a universal proxy for software engineering capability.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	gpt-5.4-2026-03-05-medium	62.8%
5	GLM-5.1	62.7%
6	Gemini 3.1 Pro Preview	62.3%
7	DeepSeek-V3.2	60.9%
8	Claude Sonnet 4.6	60.7%
9	Claude Sonnet 4.5	60.0%
10	Qwen3.5-397B-A17B	59.9%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	Claude Opus 4.7	57.3	53	$10.00
2	Gemini 3.1 Pro Preview	57.2	130	$4.50
3	GPT-5.4	56.8	83	$5.63
4	Kimi K2.6	53.9	135	$1.71
5	GPT-5.3 Codex	53.6	90	$4.81
6	Claude Opus 4.6	53	57	$10.00
7	Muse Spark	52.1	0	$0.00
8	Qwen3.6 Max Preview	51.8	0	$0.00
9	Claude Sonnet 4.6	51.7	73	$6.00
10	GLM-5.1	51.4	43	$2.15

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Qwen3.6 35B A3B	238
2	GPT-5 Codex	213
3	Grok 4.20 0309	205
4	Grok 4.20 0309 v2	203
5	Gemini 3 Flash Preview	197
6	GPT-5.4 mini	194
7	GPT-5.1 Codex	170
8	Qwen3.5 122B A10B	163
9	GPT-5.4 nano	161
10	Gemini 3 Pro Preview	137

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	DeepSeek V3.2	$0.315
3	GPT-5.4 nano	$0.463
4	MiniMax-M2.7	$0.525
5	KAT Coder Pro V2	$0.525
6	MiniMax-M2.5	$0.525
7	GPT-5 mini	$0.688
8	Qwen3.5 27B	$0.825
9	Qwen3.6 35B A3B	$0.844
10	GLM-4.7	$1.00