Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, up from 53 on the Artificial Analysis benchmark, while Gemini 3.1 Pro Preview dropped from first place (57.2 on Artificial Analysis) to fifth (62.3% on SWE-rebench), and Kimi K2.5 climbed from 46.8 to 58.5%, a gain of 11.7 percentage points. The SWE-rebench scores are substantially higher across the board than the Artificial Analysis scores for the same models, suggesting either a difference in task difficulty, evaluation methodology, or the benchmarks' sensitivity to specific coding patterns. GLM-5 moved from tenth place (49.8) to third (62.8%), and Kimi K2 Thinking jumped from 40.9 to 57.4%, indicating that certain architectures perform disproportionately better on the SWE-rebench evaluation. The clustering of models between 58 and 65 percent on SWE-rebench, compared to the wider spread on Artificial Analysis, raises questions about whether SWE-rebench's task distribution favors certain model families or whether its evaluation criteria reward specific coding strategies. Without explicit information about SWE-rebench's methodology, test set composition, or how it differs from Artificial Analysis, the magnitude of these shifts resists clean interpretation: they could reflect genuine capability differences in software engineering tasks, calibration differences between benchmarks, or selection effects in which models were evaluated on which benchmark.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | Gemini 3.1 Pro Preview | 62.3% |
| 6 | DeepSeek-V3.2 | 60.9% |
| 7 | Claude Sonnet 4.6 | 60.7% |
| 8 | Claude Sonnet 4.5 | 60.0% |
| 9 | Qwen3.5-397B-A17B | 59.9% |
| 10 | Step-3.5-Flash | 59.6% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Preview | 57.2 | 132 | $4.50 |
| 2 | GPT-5.4 | 56.8 | 83 | $5.63 |
| 3 | GPT-5.3 Codex | 53.6 | 78 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 48 | $10.00 |
| 5 | Muse Spark | 52.1 | 0 | $0.00 |
| 6 | Claude Sonnet 4.6 | 51.7 | 57 | $6.00 |
| 7 | GLM-5.1 | 51.4 | 54 | $2.15 |
| 8 | GPT-5.2 | 51.3 | 70 | $4.81 |
| 9 | Qwen3.6 Plus | 50 | 44 | $1.13 |
| 10 | GLM-5 | 49.8 | 86 | $1.55 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3 Flash Preview | 195 |
| 2 | GPT-5.1 Codex | 184 |
| 3 | GPT-5.4 nano | 180 |
| 4 | GPT-5.4 mini | 179 |
| 5 | GPT-5 Codex | 177 |
| 6 | Grok 4.20 0309 | 175 |
| 7 | Grok 4.20 0309 v2 | 172 |
| 8 | Qwen3.5 122B A10B | 154 |
| 9 | Gemini 3 Pro Preview | 137 |
| 10 | Gemini 3.1 Pro Preview | 132 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V3.2 | $0.315 |
| 3 | GPT-5.4 nano | $0.463 |
| 4 | MiniMax-M2.7 | $0.525 |
| 5 | KAT Coder Pro V2 | $0.525 |
| 6 | MiniMax-M2.5 | $0.525 |
| 7 | GPT-5 mini | $0.688 |
| 8 | Qwen3.5 27B | $0.825 |
| 9 | GLM-4.7 | $1.00 |
| 10 | Kimi K2 Thinking | $1.07 |