Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, a gain of 12.3 points from its previous score of 53 on Artificial Analysis, while the Artificial Analysis leaderboard shows minimal movement in the top tier, with GPT-5.4 and Gemini 3.1 Pro Preview both at 57.2. The divergence between the two benchmarks reflects a known challenge in LLM evaluation: SWE-rebench and Artificial Analysis measure different problem distributions and solution strategies, making direct comparisons across methodologies unreliable. On SWE-rebench, the clustering is tight in the top ten, with only 5.7 percentage points separating first from tenth, suggesting the benchmark may be reaching saturation for frontier models or that the test set lacks discriminative power at the high end. GLM-5 and Kimi K2.5 show substantial gains on SWE-rebench (13 and 9.7 points respectively), yet their Artificial Analysis positions remain largely stable, indicating these models may have specialized improvements for code-related tasks rather than across-the-board capability increases. The Artificial Analysis benchmark, which covers a broader evaluation surface, shows Gemma 4 31B and Gemma 4 E4B entering the top 150, suggesting incremental progress in the open-source tier, though the majority of entries below rank 40 shuffle positions without meaningful score changes. Neither benchmark provides sufficient methodological transparency in the data to assess whether score improvements reflect genuine capability gains or dataset-specific optimization, and the absence of error bars or confidence intervals prevents determination of statistical significance for most movements.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | Gemini 3.1 Pro Preview | 62.3% |
| 6 | DeepSeek-V3.2 | 60.9% |
| 7 | Claude Sonnet 4.6 | 60.7% |
| 8 | Claude Sonnet 4.5 | 60.0% |
| 9 | Qwen3.5-397B-A17B | 59.9% |
| 10 | Step-3.5-Flash | 59.6% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.4 | 57.2 | 76 | $5.63 |
| 2 | Gemini 3.1 Pro Preview | 57.2 | 118 | $4.50 |
| 3 | GPT-5.3 Codex | 54 | 72 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 46 | $10.00 |
| 5 | Claude Sonnet 4.6 | 51.7 | 52 | $6.00 |
| 6 | GPT-5.2 | 51.3 | 70 | $4.81 |
| 7 | GLM-5 | 49.8 | 61 | $1.55 |
| 8 | Claude Opus 4.5 | 49.7 | 51 | $10.00 |
| 9 | MiniMax-M2.7 | 49.6 | 40 | $0.525 |
| 10 | MiMo-V2-Pro | 49.2 | 0 | $1.50 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Grok 4.20 Beta 0309 | 245 |
| 2 | GPT-5.4 nano | 206 |
| 3 | Gemini 3 Flash Preview | 189 |
| 4 | GPT-5.4 mini | 185 |
| 5 | GPT-5 Codex | 172 |
| 6 | GPT-5.1 Codex | 168 |
| 7 | Qwen3.5 122B A10B | 137 |
| 8 | Gemini 3 Pro Preview | 128 |
| 9 | MiMo-V2-Flash | 125 |
| 10 | Gemini 3.1 Pro Preview | 118 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V3.2 | $0.315 |
| 3 | GPT-5.4 nano | $0.463 |
| 4 | MiniMax-M2.7 | $0.525 |
| 5 | KAT Coder Pro V2 | $0.525 |
| 6 | MiniMax-M2.5 | $0.525 |
| 7 | GPT-5 mini | $0.688 |
| 8 | Qwen3.5 27B | $0.825 |
| 9 | GLM-4.7 | $1.00 |
| 10 | Kimi K2 Thinking | $1.07 |