Claude Opus 4.6 holds the SWE-rebench lead at 65.3%, unchanged from the previous cycle, while the tier immediately below shows modest compression: gpt-5.2-2025-12-11-medium sits at 64.4%, and GLM-5 and gpt-5.4-2026-03-05-medium both score 62.8%. The meaningful movement occurs in the mid-field, where GLM-4.7 has climbed from rank 43 (42.1 points on Artificial Analysis) to rank 14 (58.7% on SWE-rebench), a shift that suggests either a genuine capability jump or a divergence in what these two benchmarks measure. Kimi K2.5 similarly advanced from rank 28 to rank 16, and Kimi K2 Thinking jumped from rank 53 to rank 21, indicating Chinese models have made gains on the SWE-rebench evaluation specifically. Gemini 3.1 Pro Preview dropped from rank 3 to rank 6 on SWE-rebench (62.3%) despite holding rank 3 on Artificial Analysis (57.2), a discrepancy that raises questions about benchmark stability or whether SWE-rebench and Artificial Analysis weight different problem classes. The Artificial Analysis leaderboard itself shows minimal reshuffling in the top 20, with GPT-5.5 leading at 60.2 and Claude Opus 4.7 at 57.3, suggesting those rankings have stabilized. At the lower end, Granite 4.1 models appear as new entries on Artificial Analysis (30B at rank 229, 8B at 261, 3B at 324), and QwQ 32B and Qwen3 VL 30B A3B swapped positions at ranks 160 and 161 without score change, a cosmetic reordering. The lack of dramatic score inflation across either benchmark and the persistence of the same top performers suggest the evaluations are not drifting, though the divergence between SWE-rebench and Artificial Analysis rankings for mid-tier models warrants investigation into whether they stress different failure modes or simply employ different evaluation protocols.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | GLM-5.1 | 62.7% |
| 6 | Gemini 3.1 Pro Preview | 62.3% |
| 7 | DeepSeek-V3.2 | 60.9% |
| 8 | Claude Sonnet 4.6 | 60.7% |
| 9 | Claude Sonnet 4.5 | 60.0% |
| 10 | Qwen3.5-397B-A17B | 59.9% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.5 | 60.2 | 65 | $11.25 |
| 2 | Claude Opus 4.7 | 57.3 | 52 | $10.00 |
| 3 | Gemini 3.1 Pro Preview | 57.2 | 129 | $4.50 |
| 4 | GPT-5.4 | 56.8 | 93 | $5.63 |
| 5 | Kimi K2.6 | 53.9 | 25 | $1.71 |
| 6 | MiMo-V2.5-Pro | 53.8 | 59 | $1.50 |
| 7 | GPT-5.3 Codex | 53.6 | 86 | $4.81 |
| 8 | Claude Opus 4.6 | 53 | 49 | $10.00 |
| 9 | Muse Spark | 52.1 | 0 | $0.00 |
| 10 | Qwen3.6 Max Preview | 51.8 | 33 | $2.92 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Qwen3.6 35B A3B | 191 |
| 2 | Gemini 3 Flash Preview | 189 |
| 3 | GPT-5.1 Codex | 170 |
| 4 | GPT-5 Codex | 166 |
| 5 | GPT-5.4 nano | 160 |
| 6 | GPT-5.4 mini | 158 |
| 7 | Qwen3.5 122B A10B | 142 |
| 8 | Gemini 3.1 Pro Preview | 129 |
| 9 | Gemini 3 Pro Preview | 129 |
| 10 | GPT-5.1 | 126 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V4 Flash | $0.175 |
| 3 | DeepSeek V3.2 | $0.315 |
| 4 | GPT-5.4 nano | $0.463 |
| 5 | MiniMax-M2.7 | $0.525 |
| 6 | KAT Coder Pro V2 | $0.525 |
| 7 | MiniMax-M2.5 | $0.525 |
| 8 | Qwen3.6 35B A3B | $0.557 |
| 9 | GPT-5 mini | $0.688 |
| 10 | Qwen3.5 27B | $0.825 |