Claude Opus 4.6 moved from fourth to first on SWE-rebench, improving from 53% to 65.3%, a 12.3-point gain that positions it ahead of gpt-5.2-2025-12-11-medium at 64.4%. GLM-5 climbed from tenth to third, jumping 13 points to 62.8%, matching gpt-5.4-2026-03-05-medium. Gemini 3.1 Pro Preview fell from first to fifth despite scoring 62.3%, suggesting the benchmark may have tightened or other models benefited from architectural changes rather than uniform improvement across the field. Kimi K2.5 advanced from twentieth to thirteenth with a 11.7-point increase to 58.5%, and Kimi K2 Thinking jumped from forty-second to seventeenth with a 16.5-point gain to 57.4%. The Artificial Analysis rankings show minimal movement in the top tier, with most models holding their positions, suggesting SWE-rebench and Artificial Analysis are measuring different problem spaces or that SWE-rebench's methodology may be capturing recent architectural improvements that general-purpose benchmarks have not yet registered. The concentration of gains among Claude and Kimi variants, paired with Gemini's relative decline, hints at task-specific optimization rather than across-the-board capability expansion. Without methodological details on how SWE-rebench was constructed or modified, it is unclear whether these shifts reflect genuine progress on software engineering tasks or whether the benchmark itself has shifted in composition.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | Gemini 3.1 Pro Preview | 62.3% |
| 6 | DeepSeek-V3.2 | 60.9% |
| 7 | Claude Sonnet 4.6 | 60.7% |
| 8 | Claude Sonnet 4.5 | 60.0% |
| 9 | Qwen3.5-397B-A17B | 59.9% |
| 10 | Step-3.5-Flash | 59.6% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Preview | 57.2 | 124 | $4.50 |
| 2 | GPT-5.4 | 56.8 | 80 | $5.63 |
| 3 | GPT-5.3 Codex | 53.6 | 75 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 49 | $10.00 |
| 5 | Muse Spark | 52.1 | 0 | $0.00 |
| 6 | Claude Sonnet 4.6 | 51.7 | 50 | $6.00 |
| 7 | GLM-5.1 | 51.4 | 57 | $2.15 |
| 8 | GPT-5.2 | 51.3 | 65 | $4.81 |
| 9 | Qwen3.6 Plus | 50 | 49 | $1.13 |
| 10 | GLM-5 | 49.8 | 70 | $1.55 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Grok 4.20 0309 v2 | 191 |
| 2 | Grok 4.20 0309 | 185 |
| 3 | GPT-5.4 nano | 178 |
| 4 | Gemini 3 Flash Preview | 176 |
| 5 | GPT-5.1 Codex | 175 |
| 6 | GPT-5 Codex | 171 |
| 7 | GPT-5.4 mini | 160 |
| 8 | Qwen3.5 122B A10B | 136 |
| 9 | Gemini 3 Pro Preview | 134 |
| 10 | Gemini 3.1 Pro Preview | 124 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V3.2 | $0.315 |
| 3 | GPT-5.4 nano | $0.463 |
| 4 | MiniMax-M2.7 | $0.525 |
| 5 | KAT Coder Pro V2 | $0.525 |
| 6 | MiniMax-M2.5 | $0.525 |
| 7 | GPT-5 mini | $0.688 |
| 8 | Qwen3.5 27B | $0.825 |
| 9 | GLM-4.7 | $1.00 |
| 10 | Kimi K2 Thinking | $1.07 |