Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, a gain of 12.3 percentage points from its previous ranking of 53 on Artificial Analysis, though these are distinct benchmarks measuring different problem sets and should not be directly compared as improvement on the same task. The SWE-rebench leaderboard shows tighter clustering in the upper tier, with models ranked 2 through 7 all scoring between 64.4% and 62.3%, suggesting convergence in code-solving capability among leading systems. GLM-5 and GLM-5.1 both advanced significantly on Artificial Analysis, moving from positions 17 and 14 to 49.8 and 51.4 respectively, while Kimi K2 Thinking jumped from 54 to 40.9, indicating Chinese-developed models are narrowing the gap. Gemini 3.1 Pro Preview declined from 57.2 to 57.2 on Artificial Analysis (no change) but appears at position 7 on SWE-rebench with 62.3%, down from an implied higher position previously. The SWE-rebench methodology evaluates code agents on real GitHub issues requiring multi-step reasoning and tool use, while Artificial Analysis covers general reasoning tasks, so raw score differences across benchmarks reflect task difficulty rather than model capability regression. Within SWE-rebench, the spread from position 1 to position 10 spans only 4.3 percentage points, suggesting marginal gains require increasingly refined approaches rather than architectural leaps. The Artificial Analysis rankings show broader stratification, with positions 1 through 10 spanning 7.1 points, indicating that general reasoning benchmarks may be more discriminative at the frontier than specialized coding benchmarks at present.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | Junie | 62.8% |
| 5 | gpt-5.4-2026-03-05-medium | 62.8% |
| 6 | GLM-5.1 | 62.7% |
| 7 | Gemini 3.1 Pro Preview | 62.3% |
| 8 | DeepSeek-V3.2 | 60.9% |
| 9 | Claude Sonnet 4.6 | 60.7% |
| 10 | Claude Sonnet 4.5 | 60.0% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.5 | 60.2 | 65 | $11.25 |
| 2 | Claude Opus 4.7 | 57.3 | 63 | $10.94 |
| 3 | Gemini 3.1 Pro Preview | 57.2 | 128 | $4.50 |
| 4 | GPT-5.4 | 56.8 | 83 | $5.63 |
| 5 | Kimi K2.6 | 53.9 | 41 | $1.71 |
| 6 | MiMo-V2.5-Pro | 53.8 | 54 | $1.50 |
| 7 | GPT-5.3 Codex | 53.6 | 76 | $4.81 |
| 8 | Grok 4.3 | 53.2 | 81 | $1.56 |
| 9 | Claude Opus 4.6 | 52.9 | 48 | $10.94 |
| 10 | Muse Spark | 52.2 | 0 | $0.00 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3 Flash Preview | 197 |
| 2 | GPT-5.1 Codex | 183 |
| 3 | Qwen3.6 35B A3B | 182 |
| 4 | GPT-5 Codex | 179 |
| 5 | GPT-5.4 mini | 169 |
| 6 | Qwen3.5 122B A10B | 154 |
| 7 | MiMo-V2-Flash | 149 |
| 8 | GPT-5.4 nano | 148 |
| 9 | Hy3-preview | 134 |
| 10 | Gemini 3.1 Pro Preview | 128 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | Hy3-preview | $0.143 |
| 2 | MiMo-V2-Flash | $0.15 |
| 3 | DeepSeek V4 Flash | $0.175 |
| 4 | DeepSeek V3.2 | $0.337 |
| 5 | GPT-5.4 nano | $0.463 |
| 6 | MiniMax-M2.7 | $0.525 |
| 7 | KAT Coder Pro V2 | $0.525 |
| 8 | MiniMax-M2.5 | $0.525 |
| 9 | Qwen3.6 35B A3B | $0.557 |
| 10 | GPT-5 mini | $0.688 |