The Inference Report

April 24, 2026

Claude Opus 4.6 moved from position 7 to 1 on the SWE-rebench with a gain of 12.3 percentage points (53% to 65.3%), while GLM-5 and GLM-5.1 climbed from positions 14 and 11 to 3 and 5 respectively, each gaining over 13 points. On the Artificial Analysis benchmark, GPT-5.5 entered at the top with 60.2 points, displacing Claude Opus 4.7 from first place despite that model holding an identical score of 57.3; Gemini 3.1 Pro Preview remained at 57.2, and the top tier compressed significantly with minimal movement among established leaders. The SWE-rebench shows concentrated gains in the 50-65% range where models demonstrate meaningful progress on real repository tasks, though the methodology does not clearly specify whether these represent improvements in the same test set or refreshed evaluation data. The Artificial Analysis leaderboard exhibits dense clustering and wholesale position shifts despite unchanged scores for many models, suggesting either score rounding or reranking by secondary criteria rather than actual performance changes. The divergence between these two benchmarks is notable: Claude Opus 4.6 dominates SWE-rebench but ranks only 8th on Artificial Analysis at 53, while GPT-5.5 tops Artificial Analysis but does not appear in the SWE-rebench top 34, indicating the benchmarks measure different capabilities or that their evaluation protocols operate on substantially different distributions. Without documentation of evaluation scope or date, it is unclear whether these movements reflect genuine capability gains, model updates, or methodological shifts.

Cole Brennan

The Value Frontier

Price vs Intelligence — lower price, higher score is better.

Fast and Smart

Speed vs Intelligence — higher on both axes is better.

Efficiency

Speed vs Price — faster and cheaper is better.

SWE-rebench

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	gpt-5.4-2026-03-05-medium	62.8%
5	GLM-5.1	62.7%
6	Gemini 3.1 Pro Preview	62.3%
7	DeepSeek-V3.2	60.9%
8	Claude Sonnet 4.6	60.7%
9	Claude Sonnet 4.5	60.0%
10	Qwen3.5-397B-A17B	59.9%

Intelligence Index — Leaderboard

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#	Model	Score	tok/s	$/1M
1	GPT-5.5	60.2	0	$11.25
2	Claude Opus 4.7	57.3	58	$10.00
3	Gemini 3.1 Pro Preview	57.2	132	$4.50
4	GPT-5.4	56.8	80	$5.63
5	Kimi K2.6	53.9	123	$1.71
6	MiMo-V2.5-Pro	53.8	60	$1.50
7	GPT-5.3 Codex	53.6	76	$4.81
8	Claude Opus 4.6	53	50	$10.00
9	Muse Spark	52.1	0	$0.00
10	Qwen3.6 Max Preview	51.8	36	$2.92

Speed — Leaderboard

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#	Model	tok/s
1	Qwen3.6 35B A3B	214
2	Gemini 3 Flash Preview	198
3	GPT-5 Codex	188
4	GPT-5.1 Codex	179
5	GPT-5.4 mini	173
6	Qwen3.5 122B A10B	150
7	Grok 4.20 0309 v2	148
8	GPT-5.4 nano	147
9	Grok 4.20 0309	141
10	Gemini 3.1 Pro Preview	132

Price — Leaderboard

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#	Model	$/1M
1	MiMo-V2-Flash	$0.15
2	DeepSeek V4 Flash	$0.175
3	DeepSeek V3.2	$0.315
4	GPT-5.4 nano	$0.463
5	MiniMax-M2.7	$0.525
6	KAT Coder Pro V2	$0.525
7	MiniMax-M2.5	$0.525
8	GPT-5 mini	$0.688
9	Qwen3.5 27B	$0.825
10	Qwen3.6 35B A3B	$0.844