The Inference Report

April 13, 2026

Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, up from 53 on the Artificial Analysis benchmark, while Gemini 3.1 Pro Preview dropped from first place (57.2 on Artificial Analysis) to fifth (62.3% on SWE-rebench), and Kimi K2.5 climbed from 46.8 to 58.5%, a gain of 11.7 percentage points. The SWE-rebench scores are substantially higher across the board than the Artificial Analysis scores for the same models, suggesting either a difference in task difficulty, evaluation methodology, or the benchmarks' sensitivity to specific coding patterns. GLM-5 moved from tenth place (49.8) to third (62.8%), and Kimi K2 Thinking jumped from 40.9 to 57.4%, indicating that certain architectures perform disproportionately better on the SWE-rebench evaluation. The clustering of models between 58 and 65 percent on SWE-rebench, compared to the wider spread on Artificial Analysis, raises questions about whether SWE-rebench's task distribution favors certain model families or whether its evaluation criteria reward specific coding strategies. Without explicit information about SWE-rebench's methodology, test set composition, or how it differs from Artificial Analysis, the magnitude of these shifts resists clean interpretation: they could reflect genuine capability differences in software engineering tasks, calibration differences between benchmarks, or selection effects in which models were evaluated on which benchmark.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	gpt-5.4-2026-03-05-medium	62.8%
5	Gemini 3.1 Pro Preview	62.3%
6	DeepSeek-V3.2	60.9%
7	Claude Sonnet 4.6	60.7%
8	Claude Sonnet 4.5	60.0%
9	Qwen3.5-397B-A17B	59.9%
10	Step-3.5-Flash	59.6%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	Gemini 3.1 Pro Preview	57.2	132	$4.50
2	GPT-5.4	56.8	83	$5.63
3	GPT-5.3 Codex	53.6	78	$4.81
4	Claude Opus 4.6	53	48	$10.00
5	Muse Spark	52.1	0	$0.00
6	Claude Sonnet 4.6	51.7	57	$6.00
7	GLM-5.1	51.4	54	$2.15
8	GPT-5.2	51.3	70	$4.81
9	Qwen3.6 Plus	50	44	$1.13
10	GLM-5	49.8	86	$1.55

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Gemini 3 Flash Preview	195
2	GPT-5.1 Codex	184
3	GPT-5.4 nano	180
4	GPT-5.4 mini	179
5	GPT-5 Codex	177
6	Grok 4.20 0309	175
7	Grok 4.20 0309 v2	172
8	Qwen3.5 122B A10B	154
9	Gemini 3 Pro Preview	137
10	Gemini 3.1 Pro Preview	132

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	DeepSeek V3.2	$0.315
3	GPT-5.4 nano	$0.463
4	MiniMax-M2.7	$0.525
5	KAT Coder Pro V2	$0.525
6	MiniMax-M2.5	$0.525
7	GPT-5 mini	$0.688
8	Qwen3.5 27B	$0.825
9	GLM-4.7	$1.00
10	Kimi K2 Thinking	$1.07