The Inference Report

May 30, 2026

The SWE-rebench rankings show Claude models displacing earlier leaders through variant proliferation rather than uniform improvement. Claude Opus 4.8-xhigh entered at 56.4% (rank 5), Claude Opus 4.7-high at 53.1% (rank 7), and Claude Sonnet 4.6-high at 51.3% (rank 9), all marked as new entries, which suggests these represent configuration variants of existing models rather than new releases. The top tier remains stable: gpt-5.5-2026-04-23-xhigh holds 62.7%, Codex 60.4%, and Claude Code 59.6%. Below the leaders, Gemini 3.1 Pro Preview dropped from 57.2 on Artificial Analysis to 51.1 on SWE-rebench (rank 10), a 6.1-point gap that flags a discrepancy between the two benchmarks worth investigating. Kimi K2.6 fell from 53.9 to 46.5 (rank 13), and GLM-4.7 declined from 42.1 to 38.2 (rank 14), suggesting these models either perform materially worse on coding tasks specifically or that SWE-rebench's evaluation criteria diverge meaningfully from Artificial Analysis's methodology. The Artificial Analysis leaderboard itself shows no movement in the top tier and remains dominated by Claude Opus 4.8 (61.4) and GPT-5.5 (60.2), with the field compressed tightly between ranks 1 and 20. Without access to SWE-rebench's exact task distribution, evaluation protocol, or whether it measures pass rates, time-to-solution, or other criteria, the divergence between the two benchmarks cannot be fully explained, but the pattern suggests they are testing distinct problem classes or applying different scoring thresholds.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1gpt-5.5-2026-04-23-xhigh62.7%
2Codex60.4%
3Claude Code59.6%
4gpt-5.5-2026-04-23-medium58.9%
5Claude Opus 4.8-xhigh56.4%
6gpt-5.4-2026-03-05-medium54.9%
7Claude Opus 4.7-high53.1%
8Cursor53.0%
9Claude Sonnet 4.6-high51.3%
10Gemini 3.1 Pro Preview51.1%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1Claude Opus 4.861.467$10.94
2GPT-5.560.269$11.25
3Claude Opus 4.757.353$10.94
4Gemini 3.1 Pro Preview57.2129$4.50
5GPT-5.456.892$5.63
6Qwen3.7 Max56.6187$3.75
7Gemini 3.5 Flash55.3209$3.38
8Kimi K2.653.934$1.71
9MiMo-V2.5-Pro53.849$0.544
10GPT-5.3 Codex53.681$4.81

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Gemini 3.5 Flash209
2Grok 4.20 0309 v2202
3MiniMax-M2.5199
4Grok 4.20 0309197
5Gemini 3 Flash Preview196
6Qwen3.7 Max187
7Grok 4.3177
8GPT-5.1 Codex172
9GPT-5.4 mini167
10Qwen3.6 35B A3B164

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2MiMo-V2.5$0.175
3DeepSeek V4 Flash$0.175
4Hy3-preview$0.20
5DeepSeek V3.2$0.337
6GPT-5.4 nano$0.463
7MiniMax-M2.7$0.525
8KAT Coder Pro V2$0.525
9MiniMax-M2.5$0.525
10MiMo-V2.5-Pro$0.544