Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, a 12.3-point jump from its previous ranking of 53 on Artificial Analysis, while Gemini 3.1 Pro Preview has dropped from first place (57.2 on Artificial Analysis) to fifth on SWE-rebench at 62.3%, suggesting the two benchmarks measure different aspects of code generation or that SWE-rebench's evaluation methodology surfaces different model capabilities than Artificial Analysis does. GLM-5 and Kimi K2.5 both show substantial gains on SWE-rebench relative to their Artificial Analysis scores, rising to third (62.8% versus 49.8%) and thirteenth (58.5% versus 46.8%) respectively, while several models near the top of SWE-rebench appear lower on Artificial Analysis, indicating either a shift in what tasks dominate the coding benchmark or differences in how the two evaluations weight problem difficulty and solution quality. The SWE-rebench methodology itself remains opaque from the data provided: the scoring scale differs from Artificial Analysis, the test set composition is unspecified, and whether improvements reflect genuine capability gains or benchmark-specific optimization cannot be determined from ranking movement alone. What is clear is that SWE-rebench produces a substantially different ordering among frontier models, which matters if teams are using it to guide development priorities, but without documentation of the benchmark's task distribution, evaluation harness, and baseline stability, the practical significance of these shifts remains ambiguous.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | Gemini 3.1 Pro Preview | 62.3% |
| 6 | DeepSeek-V3.2 | 60.9% |
| 7 | Claude Sonnet 4.6 | 60.7% |
| 8 | Claude Sonnet 4.5 | 60.0% |
| 9 | Qwen3.5-397B-A17B | 59.9% |
| 10 | Step-3.5-Flash | 59.6% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Preview | 57.2 | 122 | $4.50 |
| 2 | GPT-5.4 | 56.8 | 79 | $5.63 |
| 3 | GPT-5.3 Codex | 53.6 | 65 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 43 | $10.00 |
| 5 | Muse Spark | 52.1 | 0 | $0.00 |
| 6 | Claude Sonnet 4.6 | 51.7 | 52 | $6.00 |
| 7 | GLM-5.1 | 51.4 | 46 | $2.15 |
| 8 | GPT-5.2 | 51.3 | 62 | $4.81 |
| 9 | Qwen3.6 Plus | 50 | 52 | $1.13 |
| 10 | GLM-5 | 49.8 | 66 | $1.55 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3 Flash Preview | 173 |
| 2 | GPT-5.4 nano | 162 |
| 3 | GPT-5.4 mini | 161 |
| 4 | GPT-5 Codex | 161 |
| 5 | GPT-5.1 Codex | 155 |
| 6 | Grok 4.20 0309 v2 | 151 |
| 7 | Grok 4.20 0309 | 151 |
| 8 | Qwen3.5 122B A10B | 133 |
| 9 | Gemini 3 Pro Preview | 127 |
| 10 | Gemini 3.1 Pro Preview | 122 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V3.2 | $0.315 |
| 3 | GPT-5.4 nano | $0.463 |
| 4 | MiniMax-M2.7 | $0.525 |
| 5 | KAT Coder Pro V2 | $0.525 |
| 6 | MiniMax-M2.5 | $0.525 |
| 7 | GPT-5 mini | $0.688 |
| 8 | Qwen3.5 27B | $0.825 |
| 9 | GLM-4.7 | $1.00 |
| 10 | Kimi K2 Thinking | $1.07 |