The Inference Report

April 19, 2026

On the SWE-rebench, the top tier has crystallized around 60-65 percent resolve rates, with Claude Opus 4.6 holding first place at 65.3 percent, followed by gpt-5.2-2025-12-11-medium at 64.4 percent and a cluster of GLM and GPT variants in the 62-63 percent band. The meaningful movement comes from models that have climbed substantially from prior positions: GLM-5 jumped from rank 11 to rank 3 by gaining 13 percentage points (49.8 to 62.8), GLM-4.7 surged from rank 36 to rank 14 with a 16.6-point gain (42.1 to 58.7), and Kimi K2.5 moved from rank 21 to rank 16 by adding 11.7 points (46.8 to 58.5). Gemini 3.1 Pro Preview, however, dropped from rank 2 to rank 6 despite maintaining a competitive 62.3 percent score, suggesting the benchmark has become more discriminating at the high end. The SWE-rebench scores show larger absolute gains across the board compared to the Artificial Analysis benchmark, which could indicate either improved model capabilities in coding tasks or a shift in evaluation methodology, though the data does not specify whether the benchmark itself was recalibrated. The clustering of models between 58 and 62 percent suggests diminishing returns in further optimization, with the gap between first and tenth place now only 5.4 percentage points compared to what would be expected if progress were linear.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5GLM-5.162.7%
6Gemini 3.1 Pro Preview62.3%
7DeepSeek-V3.260.9%
8Claude Sonnet 4.660.7%
9Claude Sonnet 4.560.0%
10Qwen3.5-397B-A17B59.9%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1Claude Opus 4.757.353$10.00
2Gemini 3.1 Pro Preview57.2134$4.50
3GPT-5.456.885$5.63
4GPT-5.3 Codex53.693$4.81
5Claude Opus 4.65359$10.00
6Muse Spark52.10$0.00
7Claude Sonnet 4.651.762$6.00
8GLM-5.151.446$2.15
9GPT-5.251.383$4.81
10Qwen3.6 Plus5052$1.13

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Qwen3.6 35B A3B238
2GPT-5.1 Codex223
3Grok 4.20 0309 v2212
4GPT-5 Codex211
5Gemini 3 Flash Preview207
6Grok 4.20 0309205
7GPT-5.4 mini192
8Qwen3.5 122B A10B157
9GPT-5.4 nano156
10Gemini 3 Pro Preview141

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2DeepSeek V3.2$0.315
3GPT-5.4 nano$0.463
4MiniMax-M2.7$0.525
5KAT Coder Pro V2$0.525
6MiniMax-M2.5$0.525
7GPT-5 mini$0.688
8Qwen3.5 27B$0.825
9Qwen3.6 35B A3B$0.844
10GLM-4.7$1.00