The Inference Report

April 19, 2026

On the SWE-rebench, the top tier has crystallized around 60-65 percent resolve rates, with Claude Opus 4.6 holding first place at 65.3 percent, followed by gpt-5.2-2025-12-11-medium at 64.4 percent and a cluster of GLM and GPT variants in the 62-63 percent band. The meaningful movement comes from models that have climbed substantially from prior positions: GLM-5 jumped from rank 11 to rank 3 by gaining 13 percentage points (49.8 to 62.8), GLM-4.7 surged from rank 36 to rank 14 with a 16.6-point gain (42.1 to 58.7), and Kimi K2.5 moved from rank 21 to rank 16 by adding 11.7 points (46.8 to 58.5). Gemini 3.1 Pro Preview, however, dropped from rank 2 to rank 6 despite maintaining a competitive 62.3 percent score, suggesting the benchmark has become more discriminating at the high end. The SWE-rebench scores show larger absolute gains across the board compared to the Artificial Analysis benchmark, which could indicate either improved model capabilities in coding tasks or a shift in evaluation methodology, though the data does not specify whether the benchmark itself was recalibrated. The clustering of models between 58 and 62 percent suggests diminishing returns in further optimization, with the gap between first and tenth place now only 5.4 percentage points compared to what would be expected if progress were linear.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	gpt-5.4-2026-03-05-medium	62.8%
5	GLM-5.1	62.7%
6	Gemini 3.1 Pro Preview	62.3%
7	DeepSeek-V3.2	60.9%
8	Claude Sonnet 4.6	60.7%
9	Claude Sonnet 4.5	60.0%
10	Qwen3.5-397B-A17B	59.9%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	Claude Opus 4.7	57.3	53	$10.00
2	Gemini 3.1 Pro Preview	57.2	134	$4.50
3	GPT-5.4	56.8	85	$5.63
4	GPT-5.3 Codex	53.6	93	$4.81
5	Claude Opus 4.6	53	59	$10.00
6	Muse Spark	52.1	0	$0.00
7	Claude Sonnet 4.6	51.7	62	$6.00
8	GLM-5.1	51.4	46	$2.15
9	GPT-5.2	51.3	83	$4.81
10	Qwen3.6 Plus	50	52	$1.13

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Qwen3.6 35B A3B	238
2	GPT-5.1 Codex	223
3	Grok 4.20 0309 v2	212
4	GPT-5 Codex	211
5	Gemini 3 Flash Preview	207
6	Grok 4.20 0309	205
7	GPT-5.4 mini	192
8	Qwen3.5 122B A10B	157
9	GPT-5.4 nano	156
10	Gemini 3 Pro Preview	141

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	DeepSeek V3.2	$0.315
3	GPT-5.4 nano	$0.463
4	MiniMax-M2.7	$0.525
5	KAT Coder Pro V2	$0.525
6	MiniMax-M2.5	$0.525
7	GPT-5 mini	$0.688
8	Qwen3.5 27B	$0.825
9	Qwen3.6 35B A3B	$0.844
10	GLM-4.7	$1.00