Claude Opus 4.6 has moved from fourth to first place on SWE-rebench, reaching 65.3% after climbing 12.3 percentage points from its prior 53% on Artificial Analysis, while Gemini 3.1 Pro Preview has dropped from the top position to sixth at 62.3%, down 4.9 points from its previous 57.2%. The top tier shows compression rather than separation: positions two through five cluster between 62.7% and 64.4%, with gpt-5.2-2025-12-11-medium at 64.4%, GLM-5 and gpt-5.4-2026-03-05-medium both at 62.8%, and GLM-5.1 at 62.7%. Mid-tier volatility appears more pronounced, with Kimi K2.5 jumping from 20th (46.8) to 16th (58.5%), Kimi K2 Thinking climbing from 42nd (40.9) to 21st (57.4%), and GLM-4.7 surging from 34th (42.1) to 14th (58.7%), suggesting these models either benefited from task-specific improvements or the benchmark methodology shifted to reward their particular strengths. The Artificial Analysis rankings show smaller absolute movements across the full list, with most models holding positions within a few slots, which raises a question about whether SWE-rebench and Artificial Analysis are measuring overlapping but distinct problem spaces or whether one benchmark has higher variance than the other. Without clarity on whether SWE-rebench underwent evaluation changes or the models received updates, the magnitude of these shifts makes it difficult to separate genuine capability improvements from benchmark drift.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | GLM-5.1 | 62.7% |
| 6 | Gemini 3.1 Pro Preview | 62.3% |
| 7 | DeepSeek-V3.2 | 60.9% |
| 8 | Claude Sonnet 4.6 | 60.7% |
| 9 | Claude Sonnet 4.5 | 60.0% |
| 10 | Qwen3.5-397B-A17B | 59.9% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Preview | 57.2 | 122 | $4.50 |
| 2 | GPT-5.4 | 56.8 | 69 | $5.63 |
| 3 | GPT-5.3 Codex | 53.6 | 65 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 43 | $10.00 |
| 5 | Muse Spark | 52.1 | 0 | $0.00 |
| 6 | Claude Sonnet 4.6 | 51.7 | 48 | $6.00 |
| 7 | GLM-5.1 | 51.4 | 43 | $2.15 |
| 8 | GPT-5.2 | 51.3 | 61 | $4.81 |
| 9 | Qwen3.6 Plus | 50 | 53 | $1.13 |
| 10 | GLM-5 | 49.8 | 61 | $1.55 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3 Flash Preview | 173 |
| 2 | GPT-5 Codex | 164 |
| 3 | GPT-5.1 Codex | 160 |
| 4 | GPT-5.4 nano | 156 |
| 5 | GPT-5.4 mini | 153 |
| 6 | Grok 4.20 0309 | 141 |
| 7 | Grok 4.20 0309 v2 | 139 |
| 8 | Gemini 3 Pro Preview | 126 |
| 9 | Gemini 3.1 Pro Preview | 122 |
| 10 | Qwen3.5 122B A10B | 119 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V3.2 | $0.315 |
| 3 | GPT-5.4 nano | $0.463 |
| 4 | MiniMax-M2.7 | $0.525 |
| 5 | KAT Coder Pro V2 | $0.525 |
| 6 | MiniMax-M2.5 | $0.525 |
| 7 | GPT-5 mini | $0.688 |
| 8 | Qwen3.5 27B | $0.825 |
| 9 | GLM-4.7 | $1.00 |
| 10 | Kimi K2 Thinking | $1.07 |