The SWE-rebench rankings show stability at the top tier, where gpt-5.5-2026-04-23-xhigh holds 62.7%, Junie 61.6%, and Codex 60.4%, with no movement in the first nine positions. Below that band, modest reshuffling reflects incremental gains across mid-tier models. Claude Sonnet 4.6 climbed from position 10 to maintain its 51.3% score, while GLM-5.1 advanced from rank 23 to 12 by improving from 40.2% to 50.7%, a 10.5-point jump that signals either a methodology change, model update, or evaluation refinement worth scrutinizing. Gemini 3.5 Flash dropped from 7 to 13 despite holding 49.5%, suggesting the ranking absorbed new entrants or recalibration. The Artificial Analysis benchmark, by contrast, saw more substantial motion: Grok Build 0.1 0616 entered at rank 28, while Ring-1T appeared at 159 without prior placement, indicating either fresh model releases or expanded coverage. At the lower end, the data reveals compression around single-digit scores, where models cluster densely and small score shifts produce large rank swings, making those positions less meaningful as discriminators. The movement pattern suggests the SWE-rebench is maturing into a stable ordering of proven performers, while Artificial Analysis continues absorbing new competitors, though neither benchmark's methodology is transparent enough to confirm whether score changes reflect genuine capability shifts or evaluation adjustments.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | gpt-5.5-2026-04-23-xhigh | 62.7% |
| 2 | Junie | 61.6% |
| 3 | Codex | 60.4% |
| 4 | Claude Code | 59.6% |
| 5 | gpt-5.5-2026-04-23-medium | 58.9% |
| 6 | Claude Opus 4.8-xhigh | 56.5% |
| 7 | gpt-5.4-2026-03-05-medium | 54.9% |
| 8 | Claude Opus 4.7-high | 53.1% |
| 9 | Cursor | 53.0% |
| 10 | Claude Sonnet 4.6 | 51.3% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Fable 5 | 59.9 | 0 | $20.00 |
| 2 | Claude Opus 4.8 | 55.7 | 72 | $10.00 |
| 3 | GPT-5.5 | 54.8 | 64 | $11.25 |
| 4 | Claude Opus 4.7 | 53.5 | 62 | $10.00 |
| 5 | GPT-5.4 | 51.4 | 161 | $5.63 |
| 6 | GLM-5.2 | 51.1 | 118 | $2.15 |
| 7 | Gemini 3.5 Flash | 50.2 | 237 | $3.38 |
| 8 | Claude Sonnet 4.6 | 47.2 | 69 | $6.00 |
| 9 | Gemini 3.1 Pro Preview | 46.5 | 143 | $4.50 |
| 10 | Qwen3.7 Max | 46 | 203 | $3.75 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3.5 Flash | 237 |
| 2 | Qwen3.7 Max | 203 |
| 3 | GPT-5.4 mini | 194 |
| 4 | GPT-5.4 | 161 |
| 5 | GPT-5.2 Codex | 155 |
| 6 | Gemini 3.1 Pro Preview | 143 |
| 7 | DeepSeek V4 Flash | 121 |
| 8 | GLM-5.2 | 118 |
| 9 | DeepSeek V4 Pro | 103 |
| 10 | GLM-5.1 | 90 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | DeepSeek V4 Flash | $0.175 |
| 2 | MiMo-V2.5 | $0.175 |
| 3 | MiniMax-M3 | $0.525 |
| 4 | DeepSeek V4 Pro | $0.544 |
| 5 | MiMo-V2.5-Pro | $0.544 |
| 6 | MiMo-V2-Pro | $1.50 |
| 7 | GPT-5.4 mini | $1.69 |
| 8 | Kimi K2.6 | $1.71 |
| 9 | Kimi K2.7 Code | $1.71 |
| 10 | GLM-5.2 | $2.15 |