Claude Opus 4.6 has moved from fourth place on the SWE-rebench to first with a 65.3% pass rate, a 12.3-point gain from its previous 53% on Artificial Analysis, while Gemini 3.1 Pro Preview dropped from second to fifth despite scoring 62.3% on the coding task, suggesting the two benchmarks measure different problem distributions or evaluation criteria. GLM-5 and Kimi K2.5 both climbed sharply on SWE-rebench, with the latter jumping from 46.8% to 58.5%, and Kimi K2 Thinking advanced from 40.9% to 57.4%, indicating that reasoning-oriented model variants are gaining traction on software engineering tasks. The SWE-rebench top tier shows tighter clustering around 60-65% than the Artificial Analysis rankings, where the gap between first and fifth is only 4.9 points compared to 2.4 points in the older benchmark, suggesting SWE-rebench may be more discriminative among capable models or that recent releases have converged on similar capabilities for this task class. The benchmark methodology appears to emphasize practical code generation and repository-level problem-solving rather than general reasoning or mathematical ability, given that models like gpt-5.4 rank higher on Artificial Analysis (57.2) than on SWE-rebench (62.8 for gpt-5.4-2026-03-05-medium), indicating task-specific tuning or architectural choices matter more than raw parameter count or general reasoning prowess.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | Gemini 3.1 Pro Preview | 62.3% |
| 6 | DeepSeek-V3.2 | 60.9% |
| 7 | Claude Sonnet 4.6 | 60.7% |
| 8 | Claude Sonnet 4.5 | 60.0% |
| 9 | Qwen3.5-397B-A17B | 59.9% |
| 10 | Step-3.5-Flash | 59.6% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.4 | 57.2 | 85 | $5.63 |
| 2 | Gemini 3.1 Pro Preview | 57.2 | 124 | $4.50 |
| 3 | GPT-5.3 Codex | 54 | 76 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 55 | $10.00 |
| 5 | Muse Spark | 52.1 | 0 | $0.00 |
| 6 | Claude Sonnet 4.6 | 51.7 | 71 | $6.00 |
| 7 | GLM-5.1 | 51.4 | 73 | $2.15 |
| 8 | GPT-5.2 | 51.3 | 67 | $4.81 |
| 9 | Qwen3.6 Plus | 50 | 52 | $1.13 |
| 10 | GLM-5 | 49.8 | 71 | $1.55 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Grok 4.20 0309 | 244 |
| 2 | Grok 4.20 0309 v2 | 215 |
| 3 | GPT-5.4 nano | 198 |
| 4 | GPT-5 Codex | 190 |
| 5 | Gemini 3 Flash Preview | 185 |
| 6 | GPT-5.1 Codex | 179 |
| 7 | GPT-5.4 mini | 161 |
| 8 | Gemini 3 Pro Preview | 134 |
| 9 | MiMo-V2-Flash | 130 |
| 10 | Qwen3.5 122B A10B | 126 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V3.2 | $0.315 |
| 3 | GPT-5.4 nano | $0.463 |
| 4 | MiniMax-M2.7 | $0.525 |
| 5 | KAT Coder Pro V2 | $0.525 |
| 6 | MiniMax-M2.5 | $0.525 |
| 7 | GPT-5 mini | $0.688 |
| 8 | Qwen3.5 27B | $0.825 |
| 9 | GLM-4.7 | $1.00 |
| 10 | Kimi K2 Thinking | $1.07 |