Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, a gain of 12.3 points from its prior Artificial Analysis score of 53, while gpt-5.2-2025-12-11-medium sits second at 64.4%, and GLM-5 and gpt-5.4-2026-03-05-medium tie at 62.8%. The SWE-rebench leaderboard shows material reshuffling in the upper tier: Kimi K2.5 climbed from rank 16 (46.8) to rank 13 (58.5), Kimi K2 Thinking jumped from rank 35 (40.9) to rank 17 (57.4), and Gemini 3 Flash Preview moved from rank 18 (46.4) to rank 22 (52.5), all indicating that these models improved substantially on this benchmark. The Artificial Analysis leaderboard, which tracks a different evaluation methodology, remains largely stable in its upper rankings with GPT-5.4 and Gemini 3.1 Pro Preview tied at 57.2, though KAT Coder Pro V2 entered at rank 23 with 43.8 and Nemotron Cascade 2 30B appeared at rank 81 with 27.7. The gap between the two benchmarks' top scores (SWE-rebench's 65.3 versus Artificial Analysis's 57.2) suggests they measure different aspects of model capability or use distinct evaluation criteria; without details on SWE-rebench's methodology relative to Artificial Analysis, it remains unclear whether the higher scores reflect easier test cases, different task distributions, or genuine performance differences on the same underlying problems. The consistency of model ordering within each benchmark indicates both are internally coherent, but the divergence between them argues for caution in treating either as a complete picture of coding ability.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | Gemini 3.1 Pro Preview | 62.3% |
| 6 | DeepSeek-V3.2 | 60.9% |
| 7 | Claude Sonnet 4.6 | 60.7% |
| 8 | Claude Sonnet 4.5 | 60.0% |
| 9 | Qwen3.5-397B-A17B | 59.9% |
| 10 | Step-3.5-Flash | 59.6% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.4 | 57.2 | 96 | $5.63 |
| 2 | Gemini 3.1 Pro Preview | 57.2 | 120 | $4.50 |
| 3 | GPT-5.3 Codex | 54 | 94 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 61 | $10.00 |
| 5 | Claude Sonnet 4.6 | 51.7 | 79 | $6.00 |
| 6 | GPT-5.2 | 51.3 | 81 | $4.81 |
| 7 | GLM-5 | 49.8 | 65 | $1.55 |
| 8 | Claude Opus 4.5 | 49.7 | 64 | $10.00 |
| 9 | MiniMax-M2.7 | 49.6 | 45 | $0.525 |
| 10 | MiMo-V2-Pro | 49.2 | 0 | $1.50 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Grok 4.20 Beta 0309 | 242 |
| 2 | GPT-5.4 mini | 219 |
| 3 | GPT-5 Codex | 215 |
| 4 | Gemini 3 Flash Preview | 193 |
| 5 | GPT-5.4 nano | 177 |
| 6 | GPT-5.1 Codex | 155 |
| 7 | Qwen3.5 122B A10B | 145 |
| 8 | GPT-5.2 Codex | 129 |
| 9 | Gemini 3 Pro Preview | 123 |
| 10 | MiMo-V2-Flash | 123 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V3.2 | $0.315 |
| 3 | GPT-5.4 nano | $0.463 |
| 4 | MiniMax-M2.7 | $0.525 |
| 5 | KAT Coder Pro V2 | $0.525 |
| 6 | MiniMax-M2.5 | $0.525 |
| 7 | GPT-5 mini | $0.688 |
| 8 | Qwen3.5 27B | $0.825 |
| 9 | GLM-4.7 | $1.00 |
| 10 | Kimi K2 Thinking | $1.07 |