Claude Opus 4.6 moved to the top of the SWE-rebench rankings at 65.3%, up from fourth place at 53% on Artificial Analysis, while Gemini 3.1 Pro Preview fell from a tie for first at 57.2 to fifth at 62.3% on the coding benchmark, and GLM-5 climbed from seventh at 49.8 to third at 62.8%. The SWE-rebench scores show tighter clustering in the top tier, the gap between first and fifth is only 3 percentage points, compared to the Artificial Analysis benchmark where GPT-5.4 and Gemini 3.1 Pro Preview tied at 57.2, suggesting the coding task may be more discriminative or the models' relative strengths differ meaningfully between general reasoning and software engineering. Kimi K2.5 advanced from sixteenth at 46.8 on Artificial Analysis to thirteenth at 58.5% on SWE-rebench, and Kimi K2 Thinking jumped from thirty-seventh at 40.9 to seventeenth at 57.4%, indicating these models have particular strength in code generation tasks. The SWE-rebench benchmark itself lacks published methodology details in the data provided, no information on test set size, task distribution, evaluation criteria, or whether results are from initial release or continued refinement, making it difficult to assess whether the ranking shifts reflect genuine capability differences or methodological divergence from Artificial Analysis.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | Gemini 3.1 Pro Preview | 62.3% |
| 6 | DeepSeek-V3.2 | 60.9% |
| 7 | Claude Sonnet 4.6 | 60.7% |
| 8 | Claude Sonnet 4.5 | 60.0% |
| 9 | Qwen3.5-397B-A17B | 59.9% |
| 10 | Step-3.5-Flash | 59.6% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.4 | 57.2 | 85 | $5.63 |
| 2 | Gemini 3.1 Pro Preview | 57.2 | 132 | $4.50 |
| 3 | GPT-5.3 Codex | 54 | 76 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 55 | $10.00 |
| 5 | Claude Sonnet 4.6 | 51.7 | 71 | $6.00 |
| 6 | GLM-5.1 | 51.3 | 80 | $2.15 |
| 7 | GPT-5.2 | 51.3 | 69 | $4.81 |
| 8 | Qwen3.6 Plus | 50 | 52 | $1.13 |
| 9 | GLM-5 | 49.8 | 70 | $1.55 |
| 10 | Claude Opus 4.5 | 49.7 | 67 | $10.00 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Grok 4.20 0309 | 252 |
| 2 | GPT-5 Codex | 203 |
| 3 | GPT-5.4 nano | 202 |
| 4 | Gemini 3 Flash Preview | 196 |
| 5 | GPT-5.1 Codex | 191 |
| 6 | GPT-5.4 mini | 157 |
| 7 | Gemini 3 Pro Preview | 139 |
| 8 | Qwen3.5 122B A10B | 138 |
| 9 | Gemini 3.1 Pro Preview | 132 |
| 10 | MiMo-V2-Flash | 129 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V3.2 | $0.315 |
| 3 | GPT-5.4 nano | $0.463 |
| 4 | MiniMax-M2.7 | $0.525 |
| 5 | KAT Coder Pro V2 | $0.525 |
| 6 | MiniMax-M2.5 | $0.525 |
| 7 | GPT-5 mini | $0.688 |
| 8 | Qwen3.5 27B | $0.825 |
| 9 | GLM-4.7 | $1.00 |
| 10 | Kimi K2 Thinking | $1.07 |