The Inference Report

June 3, 2026

The SWE-rebench rankings remain stable at the top tier, with gpt-5.5-2026-04-23-xhigh holding first place at 62.7%, Codex at 60.4%, and Claude Code at 59.6%, indicating that the highest-performing systems have consolidated their positions. Movement occurs in the middle ranks where Gemini 3.1 Pro Preview dropped from fourth to tenth place on SWE-rebench (57.2% to 51.1%), a 6.1-point decline that signals either a methodological shift or genuine regression in this model's code-solving capability. On the Artificial Analysis benchmark, the top tier similarly stabilizes with Claude Opus 4.8 leading at 61.4 and GPT-5.5 at 60.2, though the broader ranking reveals substantial churn below the top ten: Qwen3.7 Max enters at sixth place (56.6), while older GPT versions and specialized models shuffle downward. GLM-4.7 shows the most striking movement, rising from forty-eighth to forty-ninth on Artificial Analysis but falling on SWE-rebench from 38.2% to 42.1%, a pattern suggesting the benchmarks measure different problem distributions. Kimi K2.6 declined notably from eighth to thirteenth on SWE-rebench (53.9% to 46.5%), a 7.4-point drop that warrants scrutiny into whether the evaluation protocol changed or the model's inference behavior shifted. The divergence between SWE-rebench and Artificial Analysis rankings for mid-tier models like GLM-5.1 (11th on SWE-rebench at 50.7%, 18th on Artificial Analysis at 51.4%) suggests these benchmarks are not measuring identical capabilities, likely because SWE-rebench emphasizes repository-level problem solving while Artificial Analysis may weight different code-generation tasks. Without historical Artificial Analysis data from a prior snapshot, the stability of that leaderboard's top positions appears genuine rather than volatile, though the accumulation of new entries like Qwen3.7 Plus at eleventh place indicates the benchmark's sample is expanding rather than consolidating.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1gpt-5.5-2026-04-23-xhigh62.7%
2Codex60.4%
3Claude Code59.6%
4gpt-5.5-2026-04-23-medium58.9%
5Claude Opus 4.8-xhigh56.4%
6gpt-5.4-2026-03-05-medium54.9%
7Claude Opus 4.7-high53.1%
8Cursor53.0%
9Claude Sonnet 4.6-high51.3%
10Gemini 3.1 Pro Preview51.1%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1Claude Opus 4.861.459$10.94
2GPT-5.560.267$11.25
3Claude Opus 4.757.353$10.94
4Gemini 3.1 Pro Preview57.2123$4.50
5GPT-5.456.879$5.63
6Qwen3.7 Max56.6198$3.75
7Gemini 3.5 Flash55.3216$3.38
8Kimi K2.653.939$1.71
9MiMo-V2.5-Pro53.846$0.544
10GPT-5.3 Codex53.674$4.81

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Step 3.7 Flash402
2Gemini 3.5 Flash216
3MiniMax-M2.5200
4Qwen3.7 Max198
5Grok 4.20 0309 v2187
6Gemini 3 Flash Preview180
7GPT-5.1 Codex175
8GPT-5 Codex173
9Grok 4.20 0309166
10Qwen3.6 35B A3B162

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2MiMo-V2.5$0.175
3DeepSeek V4 Flash$0.175
4Hy3-preview$0.20
5DeepSeek V3.2$0.337
6Step 3.7 Flash$0.438
7GPT-5.4 nano$0.463
8MiniMax-M2.7$0.525
9KAT Coder Pro V2$0.525
10MiniMax-M2.5$0.525