Claude Opus 4.6 now leads the SWE-rebench rankings at 65.3%, a 12.3 percentage point jump from its prior position at 53%, while the broader top tier shows consolidation rather than dramatic reshuffling: gpt-5.2-2025-12-11-medium (64.4%), GLM-5 (62.8%), and Junie (62.8%) occupy positions 2 through 4 with scores that cluster tightly within a 2.5-point band, suggesting the frontier of coding performance has compressed into a narrow range. The movement is meaningful in specific quarters. GLM-5 rose from rank 17 to rank 3, GLM-5.1 climbed from 14 to 6, and Kimi K2.5 advanced from 29 to 16, indicating that Chinese model families are closing gaps on the leaders, while Gemini 3.1 Pro Preview dropped from rank 3 to rank 7 despite holding a respectable 62.3% score. The Artificial Analysis benchmark, however, tells a different story: it shows far less movement at the top, with GPT-5.5 still leading at 60.2 and Claude Opus 4.6 at rank 9 with 53 points, revealing a significant divergence between the two evaluation frameworks. The SWE-rebench results reflect a methodology focused on software engineering tasks with specific, measurable outcomes, whereas the Artificial Analysis scores may weight different problem classes or evaluation criteria. This divergence matters: a model can rank first on one benchmark while placing ninth on another, which suggests that neither benchmark alone captures complete coding capability. The volume of drops from Artificial Analysis rankings (over 100 models removed) without corresponding SWE-rebench entries makes it impossible to determine whether those models genuinely degraded or were simply deprioritized in evaluation cycles, a methodological gap worth noting when interpreting movement as progress.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | Junie | 62.8% |
| 5 | gpt-5.4-2026-03-05-medium | 62.8% |
| 6 | GLM-5.1 | 62.7% |
| 7 | Gemini 3.1 Pro Preview | 62.3% |
| 8 | DeepSeek-V3.2 | 60.9% |
| 9 | Claude Sonnet 4.6 | 60.7% |
| 10 | Claude Sonnet 4.5 | 60.0% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.5 | 60.2 | 76 | $11.25 |
| 2 | Claude Opus 4.7 | 57.3 | 61 | $10.94 |
| 3 | Gemini 3.1 Pro Preview | 57.2 | 133 | $4.50 |
| 4 | GPT-5.4 | 56.8 | 84 | $5.63 |
| 5 | Kimi K2.6 | 53.9 | 29 | $1.71 |
| 6 | MiMo-V2.5-Pro | 53.8 | 65 | $1.50 |
| 7 | GPT-5.3 Codex | 53.6 | 93 | $4.81 |
| 8 | Grok 4.3 | 53.2 | 150 | $1.56 |
| 9 | Claude Opus 4.6 | 53 | 53 | $10.94 |
| 10 | Qwen3.6 Max Preview | 51.8 | 37 | $2.92 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | GPT-5 Codex | 210 |
| 2 | Gemini 3 Flash Preview | 199 |
| 3 | Qwen3.6 35B A3B | 199 |
| 4 | GPT-5.1 Codex | 187 |
| 5 | GPT-5.4 mini | 184 |
| 6 | GPT-5.4 nano | 162 |
| 7 | Qwen3.5 122B A10B | 156 |
| 8 | Grok 4.3 | 150 |
| 9 | GPT-5.1 | 149 |
| 10 | MiMo-V2-Flash | 145 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V4 Flash | $0.175 |
| 3 | DeepSeek V3.2 | $0.337 |
| 4 | GPT-5.4 nano | $0.463 |
| 5 | MiniMax-M2.7 | $0.525 |
| 6 | KAT Coder Pro V2 | $0.525 |
| 7 | MiniMax-M2.5 | $0.525 |
| 8 | Qwen3.6 35B A3B | $0.557 |
| 9 | GPT-5 mini | $0.688 |
| 10 | Qwen3.5 27B | $0.825 |