Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, up from fourth place and 53% on the prior Artificial Analysis benchmark, representing a 12.3-point gain that reflects either improved model capability or a meaningful shift in how the two benchmarks weight code generation tasks. The top five models cluster tightly between 62.3% and 65.3% on SWE-rebench, with gpt-5.2-2025-12-11-medium at 64.4%, GLM-5 and gpt-5.4-2026-03-05-medium both at 62.8%, and Gemini 3.1 Pro Preview at 62.3%, all of which moved up relative to their prior rankings. Kimi K2.5 climbed from 16th place (46.8 points) to 13th (58.5%), while Kimi K2 Thinking advanced from 36th (40.9 points) to 17th (57.4%), suggesting that reasoning-focused variants are closing gaps on general-purpose models in software engineering tasks. The two benchmarks diverge materially in their orderings: gpt-5.4 ranks first on Artificial Analysis at 57.2 but fourth on SWE-rebench at 62.8, while Gemini 3.1 Pro Preview ties at first on Artificial Analysis but ranks fifth on SWE-rebench, implying that SWE-rebench either captures different failure modes in code generation or applies stricter evaluation criteria around execution correctness rather than response quality alone. The spread between top and middle performers narrowed on SWE-rebench (Claude Opus to Step-3.5-Flash spans 5.7 percentage points) compared to Artificial Analysis (GPT-5.4 to MiMo-V2-Pro spans 8 points), which could indicate that SWE-rebench has tighter clustering due to smaller sample sizes or more binary pass-fail scoring, though neither benchmark publication provides explicit methodology details sufficient to confirm the source of the divergence.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | Gemini 3.1 Pro Preview | 62.3% |
| 6 | DeepSeek-V3.2 | 60.9% |
| 7 | Claude Sonnet 4.6 | 60.7% |
| 8 | Claude Sonnet 4.5 | 60.0% |
| 9 | Qwen3.5-397B-A17B | 59.9% |
| 10 | Step-3.5-Flash | 59.6% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.4 | 57.2 | 75 | $5.63 |
| 2 | Gemini 3.1 Pro Preview | 57.2 | 117 | $4.50 |
| 3 | GPT-5.3 Codex | 54 | 67 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 51 | $10.00 |
| 5 | Claude Sonnet 4.6 | 51.7 | 54 | $6.00 |
| 6 | GPT-5.2 | 51.3 | 70 | $4.81 |
| 7 | GLM-5 | 49.8 | 57 | $1.55 |
| 8 | Claude Opus 4.5 | 49.7 | 52 | $10.00 |
| 9 | MiniMax-M2.7 | 49.6 | 40 | $0.525 |
| 10 | MiMo-V2-Pro | 49.2 | 0 | $1.50 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Grok 4.20 Beta 0309 | 248 |
| 2 | GPT-5.4 nano | 194 |
| 3 | Gemini 3 Flash Preview | 191 |
| 4 | GPT-5.4 mini | 184 |
| 5 | GPT-5 Codex | 159 |
| 6 | GPT-5.1 Codex | 138 |
| 7 | Qwen3.5 122B A10B | 134 |
| 8 | MiMo-V2-Flash | 123 |
| 9 | Gemini 3.1 Pro Preview | 117 |
| 10 | Gemini 3 Pro Preview | 115 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V3.2 | $0.315 |
| 3 | GPT-5.4 nano | $0.463 |
| 4 | MiniMax-M2.7 | $0.525 |
| 5 | KAT Coder Pro V2 | $0.525 |
| 6 | MiniMax-M2.5 | $0.525 |
| 7 | GPT-5 mini | $0.688 |
| 8 | Qwen3.5 27B | $0.825 |
| 9 | GLM-4.7 | $1.00 |
| 10 | Kimi K2 Thinking | $1.07 |