Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, unchanged from the previous cycle, while the tier immediately below has solidified around 62 to 64 percent with gpt-5.2-2025-12-11-medium at 64.4% and GLM-5 and gpt-5.4-2026-03-05-medium both at 62.8%. The meaningful shifts appear in models that have climbed several positions: Claude Opus 4.6 rose from rank 4 to rank 1 (a gain of 12.3 points from 53 on the Artificial Analysis benchmark), GLM-5 advanced from rank 7 to rank 3 (13 points), Kimi K2.5 jumped from rank 16 to rank 13 (11.7 points), and Kimi K2 Thinking moved from rank 35 to rank 17 (16.5 points). Gemini 3.1 Pro Preview slipped from rank 2 to rank 5, losing 5.1 points despite remaining competitive. The SWE-rebench methodology appears to diverge notably from the Artificial Analysis scores, particularly for Claude and Kimi models, which suggests the benchmarks may weight different problem classes or evaluation criteria differently. The gap between first and tenth place on SWE-rebench is 5.7 percentage points, indicating a tightening at the top end, while the Artificial Analysis leaderboard shows a steeper spread, with the top model at 57.2 and rank 10 at 49.2. Without access to the specific evaluation methodology differences between these two benchmarks, it is unclear whether the SWE-rebench gains reflect genuine capability improvements or reflect a benchmark that rewards different architectural or training choices.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | Gemini 3.1 Pro Preview | 62.3% |
| 6 | DeepSeek-V3.2 | 60.9% |
| 7 | Claude Sonnet 4.6 | 60.7% |
| 8 | Claude Sonnet 4.5 | 60.0% |
| 9 | Qwen3.5-397B-A17B | 59.9% |
| 10 | Step-3.5-Flash | 59.6% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.4 | 57.2 | 81 | $5.63 |
| 2 | Gemini 3.1 Pro Preview | 57.2 | 114 | $4.50 |
| 3 | GPT-5.3 Codex | 54 | 74 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 53 | $10.00 |
| 5 | Claude Sonnet 4.6 | 51.7 | 66 | $6.00 |
| 6 | GPT-5.2 | 51.3 | 72 | $4.81 |
| 7 | GLM-5 | 49.8 | 63 | $1.55 |
| 8 | Claude Opus 4.5 | 49.7 | 64 | $10.00 |
| 9 | MiniMax-M2.7 | 49.6 | 47 | $0.525 |
| 10 | MiMo-V2-Pro | 49.2 | 93 | $1.50 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Grok 4.20 Beta 0309 | 238 |
| 2 | GPT-5.4 mini | 198 |
| 3 | Gemini 3 Flash Preview | 184 |
| 4 | GPT-5 Codex | 181 |
| 5 | GPT-5.4 nano | 160 |
| 6 | Qwen3.5 122B A10B | 134 |
| 7 | MiMo-V2-Flash | 129 |
| 8 | GPT-5.1 Codex | 118 |
| 9 | Gemini 3 Pro Preview | 115 |
| 10 | Gemini 3.1 Pro Preview | 114 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V3.2 | $0.315 |
| 3 | GPT-5.4 nano | $0.463 |
| 4 | MiniMax-M2.7 | $0.525 |
| 5 | MiniMax-M2.5 | $0.525 |
| 6 | GPT-5 mini | $0.688 |
| 7 | Qwen3.5 27B | $0.825 |
| 8 | GLM-4.7 | $1.00 |
| 9 | Kimi K2 Thinking | $1.07 |
| 10 | Qwen3.5 122B A10B | $1.10 |