Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, a 12.3 percentage point gain from its previous ranking of 53 on Artificial Analysis, while gpt-5.2-2025-12-11-medium sits at 64.4% and GLM-5 climbed to 62.8% from 49.8 on the older benchmark. The movement reflects a fundamental divergence between the two evaluation frameworks: SWE-rebench appears to measure software engineering capability through a different task distribution or methodology than Artificial Analysis, producing a reshuffling that extends beyond the top tier. Gemini 3.1 Pro Preview dropped from #2 on Artificial Analysis (57.2) to #5 on SWE-rebench (62.3), while Kimi K2.5 jumped from #20 (46.8) to #13 (58.5), and Kimi K2 Thinking advanced from #42 (40.9) to #17 (57.4), suggesting these models perform meaningfully better on the specific coding problems SWE-rebench isolates. The Artificial Analysis benchmark may weight different problem categories or evaluation criteria, or SWE-rebench may employ stricter pass conditions, but without documentation of the methodological differences between these frameworks, the magnitude of score inflation across the board (top models gaining 8-12 points) raises questions about whether the benchmarks are measuring comparable constructs or if one applies easier evaluation criteria. The consistency of Claude's dominance across both rankings and the clustering of multiple models in the 58-62% band on SWE-rebench suggests the benchmark has resolution to differentiate models, but direct score comparison between the two systems is unreliable.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | Gemini 3.1 Pro Preview | 62.3% |
| 6 | DeepSeek-V3.2 | 60.9% |
| 7 | Claude Sonnet 4.6 | 60.7% |
| 8 | Claude Sonnet 4.5 | 60.0% |
| 9 | Qwen3.5-397B-A17B | 59.9% |
| 10 | Step-3.5-Flash | 59.6% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Preview | 57.2 | 125 | $4.50 |
| 2 | GPT-5.4 | 56.8 | 79 | $5.63 |
| 3 | GPT-5.3 Codex | 53.6 | 74 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 50 | $10.00 |
| 5 | Muse Spark | 52.1 | 0 | $0.00 |
| 6 | Claude Sonnet 4.6 | 51.7 | 62 | $6.00 |
| 7 | GLM-5.1 | 51.4 | 65 | $2.15 |
| 8 | GPT-5.2 | 51.3 | 65 | $4.81 |
| 9 | Qwen3.6 Plus | 50 | 52 | $1.13 |
| 10 | GLM-5 | 49.8 | 70 | $1.55 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Grok 4.20 0309 v2 | 192 |
| 2 | GPT-5.4 nano | 190 |
| 3 | Grok 4.20 0309 | 186 |
| 4 | Gemini 3 Flash Preview | 176 |
| 5 | GPT-5 Codex | 168 |
| 6 | GPT-5.1 Codex | 166 |
| 7 | GPT-5.4 mini | 160 |
| 8 | Gemini 3 Pro Preview | 139 |
| 9 | MiMo-V2-Flash | 129 |
| 10 | Qwen3.5 122B A10B | 128 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V3.2 | $0.315 |
| 3 | GPT-5.4 nano | $0.463 |
| 4 | MiniMax-M2.7 | $0.525 |
| 5 | KAT Coder Pro V2 | $0.525 |
| 6 | MiniMax-M2.5 | $0.525 |
| 7 | GPT-5 mini | $0.688 |
| 8 | Qwen3.5 27B | $0.825 |
| 9 | GLM-4.7 | $1.00 |
| 10 | Kimi K2 Thinking | $1.07 |