Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, a 12.3-point jump from its previous ranking of 53% on Artificial Analysis, though the two benchmarks measure different problem sets and cannot be directly compared. The top six models on SWE-rebench cluster between 62.3% and 65.3%, with gpt-5.2-2025-12-11-medium at 64.4%, GLM-5 and gpt-5.4-2026-03-05-medium tied at 62.8%, GLM-5.1 at 62.7%, and Gemini 3.1 Pro Preview at 62.3%, suggesting convergence in code completion capability at the frontier. Gemini 3.1 Pro Preview dropped from second place on Artificial Analysis (57.2) to sixth on SWE-rebench (62.3), indicating the benchmarks reward different model properties, with SWE-rebench apparently favoring systems trained or optimized specifically for repository-level code repair. GLM-4.7 advanced from rank 36 on Artificial Analysis (42.1) to rank 14 on SWE-rebench (58.7), a 16.6-point gain, while Kimi K2.5 climbed from rank 21 (46.8) to rank 16 (58.5), and Kimi K2 Thinking jumped from rank 44 (40.9) to rank 21 (57.4), suggesting these models contain architectural or training choices that translate effectively to the SWE-rebench evaluation methodology. The Artificial Analysis leaderboard shows no movement in the top 30 positions relative to the previous snapshot, indicating ranking stability at the frontier when measured on that benchmark. SWE-rebench's methodology, evaluating models on real pull requests and repository contexts, appears more sensitive to model-specific optimizations than Artificial Analysis, which may employ broader capability assessment; without detailed documentation of task overlap or divergence, claims about which benchmark better predicts real-world code repair performance remain speculative.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | GLM-5.1 | 62.7% |
| 6 | Gemini 3.1 Pro Preview | 62.3% |
| 7 | DeepSeek-V3.2 | 60.9% |
| 8 | Claude Sonnet 4.6 | 60.7% |
| 9 | Claude Sonnet 4.5 | 60.0% |
| 10 | Qwen3.5-397B-A17B | 59.9% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Opus 4.7 | 57.3 | 53 | $10.00 |
| 2 | Gemini 3.1 Pro Preview | 57.2 | 132 | $4.50 |
| 3 | GPT-5.4 | 56.8 | 85 | $5.63 |
| 4 | GPT-5.3 Codex | 53.6 | 93 | $4.81 |
| 5 | Claude Opus 4.6 | 53 | 59 | $10.00 |
| 6 | Muse Spark | 52.1 | 0 | $0.00 |
| 7 | Qwen3.6 Max Preview | 51.8 | 0 | $0.00 |
| 8 | Claude Sonnet 4.6 | 51.7 | 75 | $6.00 |
| 9 | GLM-5.1 | 51.4 | 45 | $2.15 |
| 10 | GPT-5.2 | 51.3 | 84 | $4.81 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Qwen3.6 35B A3B | 242 |
| 2 | Grok 4.20 0309 v2 | 228 |
| 3 | Grok 4.20 0309 | 220 |
| 4 | GPT-5 Codex | 211 |
| 5 | Gemini 3 Flash Preview | 197 |
| 6 | GPT-5.1 Codex | 196 |
| 7 | GPT-5.4 mini | 189 |
| 8 | GPT-5.4 nano | 165 |
| 9 | Qwen3.5 122B A10B | 164 |
| 10 | Gemini 3 Pro Preview | 137 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V3.2 | $0.315 |
| 3 | GPT-5.4 nano | $0.463 |
| 4 | MiniMax-M2.7 | $0.525 |
| 5 | KAT Coder Pro V2 | $0.525 |
| 6 | MiniMax-M2.5 | $0.525 |
| 7 | GPT-5 mini | $0.688 |
| 8 | Qwen3.5 27B | $0.825 |
| 9 | Qwen3.6 35B A3B | $0.844 |
| 10 | GLM-4.7 | $1.00 |