The Inference Report

April 24, 2026

Claude Opus 4.6 moved from position 7 to 1 on the SWE-rebench with a gain of 12.3 percentage points (53% to 65.3%), while GLM-5 and GLM-5.1 climbed from positions 14 and 11 to 3 and 5 respectively, each gaining over 13 points. On the Artificial Analysis benchmark, GPT-5.5 entered at the top with 60.2 points, displacing Claude Opus 4.7 from first place despite that model holding an identical score of 57.3; Gemini 3.1 Pro Preview remained at 57.2, and the top tier compressed significantly with minimal movement among established leaders. The SWE-rebench shows concentrated gains in the 50-65% range where models demonstrate meaningful progress on real repository tasks, though the methodology does not clearly specify whether these represent improvements in the same test set or refreshed evaluation data. The Artificial Analysis leaderboard exhibits dense clustering and wholesale position shifts despite unchanged scores for many models, suggesting either score rounding or reranking by secondary criteria rather than actual performance changes. The divergence between these two benchmarks is notable: Claude Opus 4.6 dominates SWE-rebench but ranks only 8th on Artificial Analysis at 53, while GPT-5.5 tops Artificial Analysis but does not appear in the SWE-rebench top 34, indicating the benchmarks measure different capabilities or that their evaluation protocols operate on substantially different distributions. Without documentation of evaluation scope or date, it is unclear whether these movements reflect genuine capability gains, model updates, or methodological shifts.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5GLM-5.162.7%
6Gemini 3.1 Pro Preview62.3%
7DeepSeek-V3.260.9%
8Claude Sonnet 4.660.7%
9Claude Sonnet 4.560.0%
10Qwen3.5-397B-A17B59.9%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1GPT-5.560.20$11.25
2Claude Opus 4.757.358$10.00
3Gemini 3.1 Pro Preview57.2132$4.50
4GPT-5.456.880$5.63
5Kimi K2.653.9123$1.71
6MiMo-V2.5-Pro53.860$1.50
7GPT-5.3 Codex53.676$4.81
8Claude Opus 4.65350$10.00
9Muse Spark52.10$0.00
10Qwen3.6 Max Preview51.836$2.92

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Qwen3.6 35B A3B214
2Gemini 3 Flash Preview198
3GPT-5 Codex188
4GPT-5.1 Codex179
5GPT-5.4 mini173
6Qwen3.5 122B A10B150
7Grok 4.20 0309 v2148
8GPT-5.4 nano147
9Grok 4.20 0309141
10Gemini 3.1 Pro Preview132

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2DeepSeek V4 Flash$0.175
3DeepSeek V3.2$0.315
4GPT-5.4 nano$0.463
5MiniMax-M2.7$0.525
6KAT Coder Pro V2$0.525
7MiniMax-M2.5$0.525
8GPT-5 mini$0.688
9Qwen3.5 27B$0.825
10Qwen3.6 35B A3B$0.844