Claude Opus 4.6 climbed from eighth to first on SWE-rebench, moving 12.3 percentage points from 53% to 65.3%, a movement that reshuffles the entire coding benchmark landscape but warrants scrutiny on whether the test itself remained stable or if the evaluation methodology changed. The top tier has tightened considerably: gpt-5.2-2025-12-11-medium sits at 64.4%, GLM-5 and Junie both at 62.8%, and gpt-5.4-2026-03-05-medium holding steady at 62.8%, creating a compressed band where fractional improvements matter. Below the top five, the ranking has reordered substantially; Gemini 3.1 Pro Preview dropped from third to seventh despite scoring 62.3%, while several models gained ground including GLM-5 (from rank 16 at 49.8% to rank 3 at 62.8%), Kimi K2.5 (from rank 28 at 46.8% to rank 16 at 58.5%), and Kimi K2 Thinking (from rank 53 at 40.9% to rank 21 at 57.4%), suggesting either substantial capability improvements across Chinese models or a shift in benchmark composition favoring their training distribution. On Artificial Analysis, the rankings show minimal movement in the top tier with GPT-5.5 still leading at 60.2 and Claude Opus 4.6 now ninth at 53, a nine-point gap that contradicts the SWE-rebench clustering and raises questions about benchmark alignment. Grok 4.3 entered the Artificial Analysis top 100 at position eight with 53.2, while most other models maintained their prior positions, suggesting this benchmark is more stable but possibly measuring a different capability or using different evaluation criteria. The divergence between SWE-rebench's dramatic reshuffling and Artificial Analysis's relative stability indicates these benchmarks are not measuring the same problem space, or that one has undergone methodological revision without documentation.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | Junie | 62.8% |
| 5 | gpt-5.4-2026-03-05-medium | 62.8% |
| 6 | GLM-5.1 | 62.7% |
| 7 | Gemini 3.1 Pro Preview | 62.3% |
| 8 | DeepSeek-V3.2 | 60.9% |
| 9 | Claude Sonnet 4.6 | 60.7% |
| 10 | Claude Sonnet 4.5 | 60.0% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.5 | 60.2 | 67 | $11.25 |
| 2 | Claude Opus 4.7 | 57.3 | 51 | $10.00 |
| 3 | Gemini 3.1 Pro Preview | 57.2 | 130 | $4.50 |
| 4 | GPT-5.4 | 56.8 | 87 | $5.63 |
| 5 | Kimi K2.6 | 53.9 | 25 | $1.71 |
| 6 | MiMo-V2.5-Pro | 53.8 | 60 | $1.50 |
| 7 | GPT-5.3 Codex | 53.6 | 82 | $4.81 |
| 8 | Grok 4.3 | 53.2 | 221 | $1.56 |
| 9 | Claude Opus 4.6 | 53 | 52 | $10.00 |
| 10 | Muse Spark | 52.1 | 0 | $0.00 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Grok 4.3 | 221 |
| 2 | Qwen3.6 35B A3B | 185 |
| 3 | Gemini 3 Flash Preview | 184 |
| 4 | GPT-5.1 Codex | 172 |
| 5 | GPT-5.4 mini | 169 |
| 6 | GPT-5 Codex | 165 |
| 7 | GPT-5.4 nano | 162 |
| 8 | Qwen3.5 122B A10B | 148 |
| 9 | GPT-5.1 | 131 |
| 10 | Gemini 3.1 Pro Preview | 130 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V4 Flash | $0.175 |
| 3 | DeepSeek V3.2 | $0.315 |
| 4 | GPT-5.4 nano | $0.463 |
| 5 | MiniMax-M2.7 | $0.525 |
| 6 | KAT Coder Pro V2 | $0.525 |
| 7 | MiniMax-M2.5 | $0.525 |
| 8 | Qwen3.6 35B A3B | $0.557 |
| 9 | GPT-5 mini | $0.688 |
| 10 | Qwen3.5 27B | $0.825 |