Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, a 12.3-point jump from its previous ranking of 53 on Artificial Analysis, though the two benchmarks measure different problem sets and cannot be directly compared. The top tier has consolidated around 62-65% on SWE-rebench, with GPT-5.2-2025-12-11-medium at 64.4% and three models tied at 62.8% (GLM-5, Junie, GPT-5.4-2026-03-05-medium), suggesting diminishing returns in coding task performance at the frontier. More striking are the mid-tier movements: GLM-5 climbed from position 16 to 3 on SWE-rebench, GLM-4.7 rose from 40 to 14, and Kimi K2.5 advanced from 26 to 16, indicating that Chinese model families have made substantial gains on this particular benchmark. Gemini 3.1 Pro Preview, by contrast, dropped from third on Artificial Analysis (57.2) to seventh on SWE-rebench (62.3), a relative decline despite the absolute score increase, which may reflect task-specific strengths rather than regression. On Artificial Analysis, the leaderboard remains fluid with 33 new entries across the 373-model roster, including several reasoning-focused variants and smaller parameter models, though the top ten remain dominated by GPT and Claude variants. The SWE-rebench benchmark appears more selective and stable, with only 34 models tracked versus hundreds on Artificial Analysis, making it a tighter measure of coding capability but limiting visibility into broader model performance distributions. Without methodological details on how SWE-rebench tasks differ from Artificial Analysis's evaluation protocol, the divergence in model rankings suggests these benchmarks may reward different architectural or training choices, a distinction worth investigating rather than treating the benchmarks as interchangeable measures of coding prowess.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | Junie | 62.8% |
| 5 | gpt-5.4-2026-03-05-medium | 62.8% |
| 6 | GLM-5.1 | 62.7% |
| 7 | Gemini 3.1 Pro Preview | 62.3% |
| 8 | DeepSeek-V3.2 | 60.9% |
| 9 | Claude Sonnet 4.6 | 60.7% |
| 10 | Claude Sonnet 4.5 | 60.0% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.5 | 60.2 | 74 | $11.25 |
| 2 | Claude Opus 4.7 | 57.3 | 56 | $10.94 |
| 3 | Gemini 3.1 Pro Preview | 57.2 | 130 | $4.50 |
| 4 | GPT-5.4 | 56.8 | 89 | $5.63 |
| 5 | Kimi K2.6 | 53.9 | 31 | $1.71 |
| 6 | MiMo-V2.5-Pro | 53.8 | 63 | $1.50 |
| 7 | GPT-5.3 Codex | 53.6 | 87 | $4.81 |
| 8 | Grok 4.3 | 53.2 | 112 | $1.56 |
| 9 | Claude Opus 4.6 | 53 | 48 | $10.94 |
| 10 | Muse Spark | 52.1 | 0 | $0.00 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3 Flash Preview | 197 |
| 2 | GPT-5 Codex | 196 |
| 3 | Qwen3.6 35B A3B | 192 |
| 4 | GPT-5.4 mini | 184 |
| 5 | GPT-5.1 Codex | 184 |
| 6 | GPT-5.4 nano | 161 |
| 7 | Qwen3.5 122B A10B | 158 |
| 8 | GPT-5.1 | 151 |
| 9 | MiMo-V2-Flash | 147 |
| 10 | MiMo-V2-Omni-0327 | 134 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V4 Flash | $0.175 |
| 3 | DeepSeek V3.2 | $0.337 |
| 4 | GPT-5.4 nano | $0.463 |
| 5 | MiniMax-M2.7 | $0.525 |
| 6 | KAT Coder Pro V2 | $0.525 |
| 7 | MiniMax-M2.5 | $0.525 |
| 8 | Qwen3.6 35B A3B | $0.557 |
| 9 | GPT-5 mini | $0.688 |
| 10 | Qwen3.5 27B | $0.825 |