The SWE-rebench and Artificial Analysis rankings show stability at the top but meaningful movement in the middle tier. On SWE-rebench, the top six positions remain unchanged: gpt-5.5-2026-04-23-xhigh leads at 62.7%, followed by Junie at 61.6%, Codex at 60.4%, Claude Code at 59.6%, gpt-5.5-2026-04-23-medium at 58.9%, and Claude Opus 4.8-xhigh at 56.5%. The notable shifts occur below this ceiling. Claude Sonnet 4.6 climbed from position 10 with 47.2% to position 10 with 51.3%, a 4.1-point gain; Gemini 3.1 Pro Preview rose from position 9 at 46.5% to position 11 at 51.1%, a 4.6-point increase; GLM-5.1 jumped from position 23 at 40.2% to position 12 at 50.7%, an extraordinary 10.5-point improvement; and GLM-4.7 advanced from position 51 at 33.8% to position 17 at 38.2%, a 4.4-point gain. Gemini 3.5 Flash, conversely, declined from position 7 at 50.2% to position 13 at 49.5%. These movements suggest either benchmark variance or genuine performance shifts in the middle tier, though GLM-5.1's dramatic rise warrants scrutiny of whether the test conditions or model capability changed materially. Artificial Analysis rankings remain consistent across the top 100 positions with only minor reordering among tied scores in the 6 to 7-point range, indicating more stable evaluation methodology or less volatility in that benchmark's test set.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | gpt-5.5-2026-04-23-xhigh | 62.7% |
| 2 | Junie | 61.6% |
| 3 | Codex | 60.4% |
| 4 | Claude Code | 59.6% |
| 5 | gpt-5.5-2026-04-23-medium | 58.9% |
| 6 | Claude Opus 4.8-xhigh | 56.5% |
| 7 | gpt-5.4-2026-03-05-medium | 54.9% |
| 8 | Claude Opus 4.7-high | 53.1% |
| 9 | Cursor | 53.0% |
| 10 | Claude Sonnet 4.6 | 51.3% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Fable 5 | 59.9 | 0 | $20.00 |
| 2 | Claude Opus 4.8 | 55.7 | 66 | $10.00 |
| 3 | GPT-5.5 | 54.8 | 68 | $11.25 |
| 4 | Claude Opus 4.7 | 53.5 | 56 | $10.00 |
| 5 | GPT-5.4 | 51.4 | 157 | $5.63 |
| 6 | GLM-5.2 | 51.1 | 98 | $2.15 |
| 7 | Gemini 3.5 Flash | 50.2 | 219 | $3.38 |
| 8 | Claude Sonnet 4.6 | 47.2 | 68 | $6.00 |
| 9 | Gemini 3.1 Pro Preview | 46.5 | 136 | $4.50 |
| 10 | Qwen3.7 Max | 46 | 98 | $3.75 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3.5 Flash | 219 |
| 2 | GPT-5.4 mini | 174 |
| 3 | GPT-5.4 | 157 |
| 4 | GPT-5.2 Codex | 137 |
| 5 | Gemini 3.1 Pro Preview | 136 |
| 6 | DeepSeek V4 Flash | 114 |
| 7 | GLM-5.2 | 98 |
| 8 | Qwen3.7 Max | 98 |
| 9 | GPT-5.3 Codex | 88 |
| 10 | GPT-5.2 | 83 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | DeepSeek V4 Flash | $0.175 |
| 2 | MiMo-V2.5 | $0.175 |
| 3 | MiniMax-M3 | $0.525 |
| 4 | DeepSeek V4 Pro | $0.544 |
| 5 | MiMo-V2.5-Pro | $0.544 |
| 6 | MiMo-V2-Pro | $1.50 |
| 7 | GPT-5.4 mini | $1.69 |
| 8 | Kimi K2.6 | $1.71 |
| 9 | Kimi K2.7 Code | $1.71 |
| 10 | GLM-5.2 | $2.15 |