The SWE-rebench rankings show modest churn at the top tier but reveal significant instability in the middle and lower ranks on Artificial Analysis. On SWE-rebench, the top five remain locked: gpt-5.5-2026-04-23-xhigh holds 62.7%, followed by Codex (60.4%), Claude Code (59.6%), gpt-5.5-2026-04-23-medium (58.9%), and gpt-5.4-2026-03-05-medium (54.9%), with no score changes from prior results. Below that tier, Claude Opus 4.7 and Gemini 3.1 Pro Preview each dropped roughly 4 points (to 53.1% and 51.1% respectively), while Kimi K2.6 fell from 53.9% to 46.5%, a 7.4-point decline that suggests either methodology drift or model degradation rather than genuine progress. Across the Artificial Analysis leaderboard, the data is identical to the prior snapshot, indicating no new evaluations or score recalculations occurred. The absence of movement in a 383-model ranked list, combined with the stability of the top SWE-rebench performers, suggests these benchmarks may be operating on different evaluation cadences or that the SWE-rebench methodology itself is in flux. The meaningful signal here is negative: large score drops like Kimi K2.6's warrant investigation into whether the benchmark conditions changed, whether the model was updated, or whether previous scores were inflated. Without evidence of fresh evaluation runs, neither leaderboard documents actual progress this cycle.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | gpt-5.5-2026-04-23-xhigh | 62.7% |
| 2 | Codex | 60.4% |
| 3 | Claude Code | 59.6% |
| 4 | gpt-5.5-2026-04-23-medium | 58.9% |
| 5 | gpt-5.4-2026-03-05-medium | 54.9% |
| 6 | Claude Opus 4.7 | 53.1% |
| 7 | Cursor | 53.0% |
| 8 | Gemini 3.1 Pro Preview | 51.1% |
| 9 | Claude Sonnet 4.6 | 51.1% |
| 10 | GLM-5.1 | 50.7% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Opus 4.8 | 61.4 | 65 | $10.94 |
| 2 | GPT-5.5 | 60.2 | 78 | $11.25 |
| 3 | Claude Opus 4.7 | 57.3 | 53 | $10.94 |
| 4 | Gemini 3.1 Pro Preview | 57.2 | 124 | $4.50 |
| 5 | GPT-5.4 | 56.8 | 89 | $5.63 |
| 6 | Qwen3.7 Max | 56.6 | 199 | $3.75 |
| 7 | Gemini 3.5 Flash | 55.3 | 222 | $3.38 |
| 8 | Kimi K2.6 | 53.9 | 32 | $1.71 |
| 9 | MiMo-V2.5-Pro | 53.8 | 51 | $0.544 |
| 10 | GPT-5.3 Codex | 53.6 | 72 | $4.81 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3.5 Flash | 222 |
| 2 | Grok 4.3 | 213 |
| 3 | Grok 4.20 0309 v2 | 200 |
| 4 | Qwen3.7 Max | 199 |
| 5 | Gemini 3 Flash Preview | 196 |
| 6 | Grok 4.20 0309 | 182 |
| 7 | GPT-5.1 Codex | 180 |
| 8 | MiniMax-M2.5 | 176 |
| 9 | GPT-5.4 mini | 167 |
| 10 | Qwen3.6 35B A3B | 167 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | MiMo-V2.5 | $0.175 |
| 3 | DeepSeek V4 Flash | $0.175 |
| 4 | Hy3-preview | $0.20 |
| 5 | DeepSeek V3.2 | $0.337 |
| 6 | GPT-5.4 nano | $0.463 |
| 7 | MiniMax-M2.7 | $0.525 |
| 8 | KAT Coder Pro V2 | $0.525 |
| 9 | MiniMax-M2.5 | $0.525 |
| 10 | MiMo-V2.5-Pro | $0.544 |