Claude Opus 4.6 holds the SWE-rebench top position at 65.3%, unchanged from prior measurement, while the Artificial Analysis benchmark shows material reshuffling across the field without shifts at the apex. On SWE-rebench, the top tier remains densely clustered between 60 and 65 percent, with gpt-5.2-2025-12-11-medium at 64.4% and GLM-5 and gpt-5.4-2026-03-05-medium both at 62.8%, suggesting diminishing returns in the high-performance region. Gemini 3.1 Pro Preview, ranked fifth on SWE-rebench at 62.3%, dropped from second on Artificial Analysis (57.2) but the discrepancy itself warrants scrutiny: the two benchmarks measure different problem domains and evaluation conditions, so direct ranking comparisons across them carry limited meaning. Notable climbers on Artificial Analysis include Kimi K2 Thinking, which jumped from position 37 (40.9) to position 17 (57.4), and Kimi K2.5, moving from 16 (46.8) to 13 (58.5), both suggesting Kimi's reasoning variants now handle the Artificial Analysis task distribution more effectively. The SWE-rebench methodology itself remains opaque in the provided data: without details on how tasks are selected, whether they stress particular failure modes, or how the evaluation handles partial credit, the stability of top rankings could reflect either genuine performance plateaus or ceiling effects in the benchmark design. The Artificial Analysis list's expansion to 340 entries and reordering throughout suggests either new model submissions or recalibration, but the data does not clarify which. Meaningful movement exists in the middle ranks on both benchmarks, but the absence of methodological documentation limits interpretation of whether these shifts reflect true capability divergence or measurement artifacts.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | Gemini 3.1 Pro Preview | 62.3% |
| 6 | DeepSeek-V3.2 | 60.9% |
| 7 | Claude Sonnet 4.6 | 60.7% |
| 8 | Claude Sonnet 4.5 | 60.0% |
| 9 | Qwen3.5-397B-A17B | 59.9% |
| 10 | Step-3.5-Flash | 59.6% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.4 | 57.2 | 74 | $5.63 |
| 2 | Gemini 3.1 Pro Preview | 57.2 | 117 | $4.50 |
| 3 | GPT-5.3 Codex | 54 | 65 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 48 | $10.00 |
| 5 | Claude Sonnet 4.6 | 51.7 | 55 | $6.00 |
| 6 | GPT-5.2 | 51.3 | 70 | $4.81 |
| 7 | GLM-5 | 49.8 | 57 | $1.55 |
| 8 | Claude Opus 4.5 | 49.7 | 51 | $10.00 |
| 9 | MiniMax-M2.7 | 49.6 | 40 | $0.525 |
| 10 | MiMo-V2-Pro | 49.2 | 0 | $1.50 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Grok 4.20 Beta 0309 | 248 |
| 2 | GPT-5.4 nano | 195 |
| 3 | Gemini 3 Flash Preview | 190 |
| 4 | GPT-5.4 mini | 182 |
| 5 | GPT-5 Codex | 162 |
| 6 | GPT-5.1 Codex | 144 |
| 7 | Qwen3.5 122B A10B | 131 |
| 8 | MiMo-V2-Flash | 123 |
| 9 | Gemini 3.1 Pro Preview | 117 |
| 10 | Gemini 3 Pro Preview | 115 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V3.2 | $0.315 |
| 3 | GPT-5.4 nano | $0.463 |
| 4 | MiniMax-M2.7 | $0.525 |
| 5 | KAT Coder Pro V2 | $0.525 |
| 6 | MiniMax-M2.5 | $0.525 |
| 7 | GPT-5 mini | $0.688 |
| 8 | Qwen3.5 27B | $0.825 |
| 9 | GLM-4.7 | $1.00 |
| 10 | Kimi K2 Thinking | $1.07 |