The SWE-rebench rankings show Claude models displacing earlier leaders through variant proliferation rather than uniform improvement. Claude Opus 4.8-xhigh entered at 56.4% (rank 5), Claude Opus 4.7-high at 53.1% (rank 7), and Claude Sonnet 4.6-high at 51.3% (rank 9), all marked as new entries, which suggests these represent configuration variants of existing models rather than new releases. The top tier remains stable: gpt-5.5-2026-04-23-xhigh holds 62.7%, Codex 60.4%, and Claude Code 59.6%. Below the leaders, Gemini 3.1 Pro Preview dropped from 57.2 on Artificial Analysis to 51.1 on SWE-rebench (rank 10), a 6.1-point gap that flags a discrepancy between the two benchmarks worth investigating. Kimi K2.6 fell from 53.9 to 46.5 (rank 13), and GLM-4.7 declined from 42.1 to 38.2 (rank 14), suggesting these models either perform materially worse on coding tasks specifically or that SWE-rebench's evaluation criteria diverge meaningfully from Artificial Analysis's methodology. The Artificial Analysis leaderboard itself shows no movement in the top tier and remains dominated by Claude Opus 4.8 (61.4) and GPT-5.5 (60.2), with the field compressed tightly between ranks 1 and 20. Without access to SWE-rebench's exact task distribution, evaluation protocol, or whether it measures pass rates, time-to-solution, or other criteria, the divergence between the two benchmarks cannot be fully explained, but the pattern suggests they are testing distinct problem classes or applying different scoring thresholds.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | gpt-5.5-2026-04-23-xhigh | 62.7% |
| 2 | Codex | 60.4% |
| 3 | Claude Code | 59.6% |
| 4 | gpt-5.5-2026-04-23-medium | 58.9% |
| 5 | Claude Opus 4.8-xhigh | 56.4% |
| 6 | gpt-5.4-2026-03-05-medium | 54.9% |
| 7 | Claude Opus 4.7-high | 53.1% |
| 8 | Cursor | 53.0% |
| 9 | Claude Sonnet 4.6-high | 51.3% |
| 10 | Gemini 3.1 Pro Preview | 51.1% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Opus 4.8 | 61.4 | 67 | $10.94 |
| 2 | GPT-5.5 | 60.2 | 69 | $11.25 |
| 3 | Claude Opus 4.7 | 57.3 | 53 | $10.94 |
| 4 | Gemini 3.1 Pro Preview | 57.2 | 129 | $4.50 |
| 5 | GPT-5.4 | 56.8 | 92 | $5.63 |
| 6 | Qwen3.7 Max | 56.6 | 187 | $3.75 |
| 7 | Gemini 3.5 Flash | 55.3 | 209 | $3.38 |
| 8 | Kimi K2.6 | 53.9 | 34 | $1.71 |
| 9 | MiMo-V2.5-Pro | 53.8 | 49 | $0.544 |
| 10 | GPT-5.3 Codex | 53.6 | 81 | $4.81 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3.5 Flash | 209 |
| 2 | Grok 4.20 0309 v2 | 202 |
| 3 | MiniMax-M2.5 | 199 |
| 4 | Grok 4.20 0309 | 197 |
| 5 | Gemini 3 Flash Preview | 196 |
| 6 | Qwen3.7 Max | 187 |
| 7 | Grok 4.3 | 177 |
| 8 | GPT-5.1 Codex | 172 |
| 9 | GPT-5.4 mini | 167 |
| 10 | Qwen3.6 35B A3B | 164 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | MiMo-V2.5 | $0.175 |
| 3 | DeepSeek V4 Flash | $0.175 |
| 4 | Hy3-preview | $0.20 |
| 5 | DeepSeek V3.2 | $0.337 |
| 6 | GPT-5.4 nano | $0.463 |
| 7 | MiniMax-M2.7 | $0.525 |
| 8 | KAT Coder Pro V2 | $0.525 |
| 9 | MiniMax-M2.5 | $0.525 |
| 10 | MiMo-V2.5-Pro | $0.544 |