The SWE-rebench rankings show minimal movement at the top, with the same fourteen models holding positions 1-14 across both periods. The Artificial Analysis benchmark, however, reveals churn throughout its 384-entry list, though the pattern suggests reordering rather than genuine performance shifts. Gemini 3.1 Pro Preview dropped from 4th to 10th on SWE-rebench, declining from 57.2 to 51.1 points, while Kimi K2.6 fell from 8th to 13th with a 7.4-point loss. GLM-4.7 rose from 47th to 14th on Artificial Analysis (42.1 to 38.2 points), a counterintuitive climb despite the lower score, suggesting ranking methodology changes or score recalibration rather than model improvement. The entry of Step 3.7 Flash at position 47 in Artificial Analysis and the near-universal reshuffling below the top 100 indicate the benchmark may have adjusted its evaluation criteria, weighting scheme, or model test set. Without documentation of methodology changes between periods, it remains unclear whether observed movements reflect actual performance variation or administrative reorganization of the leaderboard itself. The SWE-rebench stability at the top contrasts sharply with Artificial Analysis volatility, raising questions about benchmark sensitivity and whether either ranking reliably tracks incremental progress in code generation capability.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | gpt-5.5-2026-04-23-xhigh | 62.7% |
| 2 | Codex | 60.4% |
| 3 | Claude Code | 59.6% |
| 4 | gpt-5.5-2026-04-23-medium | 58.9% |
| 5 | Claude Opus 4.8-xhigh | 56.4% |
| 6 | gpt-5.4-2026-03-05-medium | 54.9% |
| 7 | Claude Opus 4.7-high | 53.1% |
| 8 | Cursor | 53.0% |
| 9 | Claude Sonnet 4.6-high | 51.3% |
| 10 | Gemini 3.1 Pro Preview | 51.1% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Opus 4.8 | 61.4 | 60 | $10.94 |
| 2 | GPT-5.5 | 60.2 | 66 | $11.25 |
| 3 | Claude Opus 4.7 | 57.3 | 56 | $10.94 |
| 4 | Gemini 3.1 Pro Preview | 57.2 | 132 | $4.50 |
| 5 | GPT-5.4 | 56.8 | 79 | $5.63 |
| 6 | Qwen3.7 Max | 56.6 | 201 | $3.75 |
| 7 | Gemini 3.5 Flash | 55.3 | 227 | $3.38 |
| 8 | Kimi K2.6 | 53.9 | 40 | $1.71 |
| 9 | MiMo-V2.5-Pro | 53.8 | 53 | $0.544 |
| 10 | GPT-5.3 Codex | 53.6 | 84 | $4.81 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Step 3.7 Flash | 408 |
| 2 | Gemini 3.5 Flash | 227 |
| 3 | Grok 4.20 0309 v2 | 213 |
| 4 | Qwen3.7 Max | 201 |
| 5 | GPT-5 Codex | 191 |
| 6 | Gemini 3 Flash Preview | 186 |
| 7 | Grok 4.20 0309 | 184 |
| 8 | MiniMax-M2.5 | 178 |
| 9 | GPT-5.1 Codex | 174 |
| 10 | GPT-5.4 mini | 173 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | MiMo-V2.5 | $0.175 |
| 3 | DeepSeek V4 Flash | $0.175 |
| 4 | Hy3-preview | $0.20 |
| 5 | DeepSeek V3.2 | $0.337 |
| 6 | Step 3.7 Flash | $0.438 |
| 7 | GPT-5.4 nano | $0.463 |
| 8 | MiniMax-M2.7 | $0.525 |
| 9 | KAT Coder Pro V2 | $0.525 |
| 10 | MiniMax-M2.5 | $0.525 |