The SWE-rebench rankings show substantial churn at the top, though the movement warrants scrutiny. Two new GPT-5.5 variants (xhigh and medium) now occupy positions 1 and 4 at 62.7% and 58.9%, displacing previously dominant models, while Codex and Claude Code jumped from positions 18 and 17 to positions 2 and 3, gaining 2.1 and 1.2 percentage points respectively. Claude Opus 4.6 fell from first place (65.3%) to sixth (53.1%), a 12.2-point drop that demands explanation, and several formerly high-ranked models (gpt-5.2-2025-12-11-medium, Junie, DeepSeek-V3.2, Claude Sonnet 4.5, Qwen3.5-397B-A17B) disappeared entirely from the benchmark. The Artificial Analysis rankings remain largely stable with identical scores and positions, suggesting the volatility is specific to SWE-rebench's evaluation methodology or dataset. Without documentation of what changed in the benchmark itself, whether test cases were added, removed, or reweighted, or whether evaluation criteria shifted, it is unclear whether these movements reflect genuine capability differences or artifacts of the measurement apparatus. The scale of Claude Opus 4.6's decline particularly raises questions: such a large score regression without corresponding changes in the model itself points toward benchmark modifications rather than model degradation. Until the SWE-rebench evaluation protocol is transparently specified, these rankings indicate movement but not necessarily meaningful progress.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | gpt-5.5-2026-04-23-xhigh | 62.7% |
| 2 | Codex | 60.4% |
| 3 | Claude Code | 59.6% |
| 4 | gpt-5.5-2026-04-23-medium | 58.9% |
| 5 | gpt-5.4-2026-03-05-medium | 54.9% |
| 6 | Claude Opus 4.7 | 53.1% |
| 7 | Cursor | 53.0% |
| 8 | Gemini 3.1 Pro Preview | 51.1% |
| 9 | Claude Sonnet 4.6 | 51.1% |
| 10 | GLM-5.1 | 50.7% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.5 | 60.2 | 81 | $11.25 |
| 2 | Claude Opus 4.7 | 57.3 | 55 | $10.94 |
| 3 | Gemini 3.1 Pro Preview | 57.2 | 132 | $4.50 |
| 4 | GPT-5.4 | 56.8 | 89 | $5.63 |
| 5 | Qwen3.7 Max | 56.6 | 206 | $3.75 |
| 6 | Gemini 3.5 Flash | 55.3 | 228 | $3.38 |
| 7 | Kimi K2.6 | 53.9 | 34 | $1.71 |
| 8 | MiMo-V2.5-Pro | 53.8 | 51 | $0.544 |
| 9 | GPT-5.3 Codex | 53.6 | 81 | $4.81 |
| 10 | Grok 4.3 | 53.2 | 216 | $1.56 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3.5 Flash | 228 |
| 2 | Grok 4.3 | 216 |
| 3 | Qwen3.7 Max | 206 |
| 4 | GPT-5.1 Codex | 205 |
| 5 | Gemini 3 Flash Preview | 200 |
| 6 | GPT-5 Codex | 196 |
| 7 | Grok 4.20 0309 | 192 |
| 8 | Grok 4.20 0309 v2 | 189 |
| 9 | Qwen3.6 35B A3B | 170 |
| 10 | GPT-5.4 mini | 153 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | MiMo-V2.5 | $0.175 |
| 3 | DeepSeek V4 Flash | $0.175 |
| 4 | Hy3-preview | $0.20 |
| 5 | DeepSeek V3.2 | $0.337 |
| 6 | GPT-5.4 nano | $0.463 |
| 7 | MiniMax-M2.7 | $0.525 |
| 8 | KAT Coder Pro V2 | $0.525 |
| 9 | MiniMax-M2.5 | $0.525 |
| 10 | MiMo-V2.5-Pro | $0.544 |