Claude Opus 4.6 maintains its lead on SWE-rebench at 65.3 percent, a 12.3-point gain from its prior position at 53 percent on the Artificial Analysis benchmark, though the two evaluations measure different problem spaces and should not be treated as directly comparable. The top tier has consolidated around 62 to 65 percent on SWE-rebench, with gpt-5.2-2025-12-11-medium at 64.4 percent and GLM-5 and gpt-5.4-2026-03-05-medium both at 62.8 percent; these four models now occupy the summit, separated by narrow margins. Movement in the broader field reveals uneven progress: Kimi K2 Thinking jumped from position 37 at 40.9 percent to position 17 at 57.4 percent, a 16.5-point increase suggesting either a model update or a shift in how the benchmark evaluates reasoning-focused architectures, while Kimi K2.5 advanced from position 16 at 46.8 percent to position 13 at 58.5 percent. Gemini 3 Flash Preview climbed from position 18 at 46.4 percent to position 22 at 52.5 percent, and Nova 2.0 Lite moved from position 76 at 29.7 percent to position 58 at 34.5 percent on the Artificial Analysis side, though this latter shift reflects reranking rather than SWE-rebench performance. The SWE-rebench scores themselves show no obvious saturation at the top: the gap between first and fifth place is 2.8 percentage points, and the distribution below rank 10 remains steep, suggesting that either the benchmark has sufficient discriminative power or that model capabilities on software engineering tasks continue to stratify sharply by architecture and training approach. What remains unclear from the data alone is whether these gains reflect genuine improvements in code generation and repository-level reasoning, or whether they reflect changes in evaluation methodology, task distribution, or model selection within the rebench suite.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | Gemini 3.1 Pro Preview | 62.3% |
| 6 | DeepSeek-V3.2 | 60.9% |
| 7 | Claude Sonnet 4.6 | 60.7% |
| 8 | Claude Sonnet 4.5 | 60.0% |
| 9 | Qwen3.5-397B-A17B | 59.9% |
| 10 | Step-3.5-Flash | 59.6% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.4 | 57.2 | 82 | $5.63 |
| 2 | Gemini 3.1 Pro Preview | 57.2 | 142 | $4.50 |
| 3 | GPT-5.3 Codex | 54 | 81 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 54 | $10.00 |
| 5 | Claude Sonnet 4.6 | 51.7 | 66 | $6.00 |
| 6 | GPT-5.2 | 51.3 | 79 | $4.81 |
| 7 | GLM-5 | 49.8 | 69 | $1.55 |
| 8 | Claude Opus 4.5 | 49.7 | 67 | $10.00 |
| 9 | MiniMax-M2.7 | 49.6 | 43 | $0.525 |
| 10 | MiMo-V2-Pro | 49.2 | 0 | $1.50 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Grok 4.20 Beta 0309 | 265 |
| 2 | GPT-5 Codex | 214 |
| 3 | GPT-5.4 nano | 209 |
| 4 | Gemini 3 Flash Preview | 197 |
| 5 | GPT-5.1 Codex | 190 |
| 6 | GPT-5.4 mini | 166 |
| 7 | Qwen3.5 122B A10B | 154 |
| 8 | Gemini 3 Pro Preview | 143 |
| 9 | Gemini 3.1 Pro Preview | 142 |
| 10 | MiMo-V2-Flash | 137 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V3.2 | $0.315 |
| 3 | GPT-5.4 nano | $0.463 |
| 4 | MiniMax-M2.7 | $0.525 |
| 5 | KAT Coder Pro V2 | $0.525 |
| 6 | MiniMax-M2.5 | $0.525 |
| 7 | GPT-5 mini | $0.688 |
| 8 | Qwen3.5 27B | $0.825 |
| 9 | GLM-4.7 | $1.00 |
| 10 | Kimi K2 Thinking | $1.07 |