The SWE-rebench rankings remained static across the top tier, with gpt-5.5-2026-04-23-xhigh holding first place at 62.7% and the next five positions unchanged. The Artificial Analysis benchmark showed modest reordering in the middle and lower tiers, though the methodology underlying these two benchmarks differs enough that direct score comparison between them is unreliable. On SWE-rebench, three models shifted position: Claude Sonnet 4.6 rose from #10 to #10 (no change in rank, though the prior data lists it at 47.2 on Artificial Analysis versus 51.3% here, suggesting score drift or evaluation variance), Gemini 3.1 Pro Preview moved from #9 to #11, and GLM-5.1 jumped from #23 to #12, gaining 10.5 percentage points on Artificial Analysis (from 40.2 to 50.7%). GLM-4.7 similarly advanced 4.4 points on SWE-rebench (33.8 to 38.2) and on Artificial Analysis (33.8 to 38.2), indicating consistent gains. On Artificial Analysis, minor reordering occurred around rank 190 where Magistral Medium 1 and Mistral Medium 3 swapped positions at the 12.5 point level, and at rank 360-362 where three models at 2.7 points reordered. The lack of substantial movement in either benchmark's top ranks suggests stable performance hierarchies, though the gains by GLM models warrant attention to whether they reflect genuine capability improvements or evaluation sensitivity differences between benchmarks.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | gpt-5.5-2026-04-23-xhigh | 62.7% |
| 2 | Junie | 61.6% |
| 3 | Codex | 60.4% |
| 4 | Claude Code | 59.6% |
| 5 | gpt-5.5-2026-04-23-medium | 58.9% |
| 6 | Claude Opus 4.8-xhigh | 56.5% |
| 7 | gpt-5.4-2026-03-05-medium | 54.9% |
| 8 | Claude Opus 4.7-high | 53.1% |
| 9 | Cursor | 53.0% |
| 10 | Claude Sonnet 4.6 | 51.3% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Fable 5 | 59.9 | 0 | $20.00 |
| 2 | Claude Opus 4.8 | 55.7 | 64 | $10.00 |
| 3 | GPT-5.5 | 54.8 | 61 | $11.25 |
| 4 | Claude Opus 4.7 | 53.5 | 57 | $10.00 |
| 5 | GPT-5.4 | 51.4 | 142 | $5.63 |
| 6 | GLM-5.2 | 51.1 | 72 | $2.15 |
| 7 | Gemini 3.5 Flash | 50.2 | 216 | $3.38 |
| 8 | Claude Sonnet 4.6 | 47.2 | 68 | $6.00 |
| 9 | Gemini 3.1 Pro Preview | 46.5 | 140 | $4.50 |
| 10 | Qwen3.7 Max | 46 | 125 | $3.75 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3.5 Flash | 216 |
| 2 | GPT-5.4 mini | 174 |
| 3 | GPT-5.4 | 142 |
| 4 | Gemini 3.1 Pro Preview | 140 |
| 5 | GPT-5.2 Codex | 140 |
| 6 | Qwen3.7 Max | 125 |
| 7 | DeepSeek V4 Flash | 110 |
| 8 | GLM-5.1 | 93 |
| 9 | GPT-5.3 Codex | 86 |
| 10 | DeepSeek V4 Pro | 86 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | DeepSeek V4 Flash | $0.175 |
| 2 | MiMo-V2.5 | $0.175 |
| 3 | MiniMax-M3 | $0.525 |
| 4 | DeepSeek V4 Pro | $0.544 |
| 5 | MiMo-V2.5-Pro | $0.544 |
| 6 | MiMo-V2-Pro | $1.50 |
| 7 | GPT-5.4 mini | $1.69 |
| 8 | Kimi K2.6 | $1.71 |
| 9 | Kimi K2.7 Code | $1.71 |
| 10 | GLM-5.2 | $2.15 |