Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, a gain of 12.3 percentage points from its previous ranking of 53% on Artificial Analysis, though these are separate benchmarks measuring different aspects of model capability. On SWE-rebench specifically, the top tier has compressed into a narrow band: gpt-5.2-2025-12-11-medium at 64.4%, GLM-5 and gpt-5.4-2026-03-05-medium tied at 62.8%, and GLM-5.1 at 62.7% occupy positions two through five, with no model scoring below 60% until rank seven. The more significant movement appears in the middle ranks, where GLM-4.7 jumped from position 38 (42.1 on Artificial Analysis) to position 14 (58.7% on SWE-rebench), Kimi K2.5 advanced from position 23 (46.8) to position 16 (58.5%), and Kimi K2 Thinking climbed from position 46 (40.9) to position 21 (57.4%), suggesting these models either improved substantially or benefit from SWE-rebench's evaluation methodology relative to Artificial Analysis scoring. Gemini 3.1 Pro Preview dropped from position 2 on Artificial Analysis to position 6 on SWE-rebench, declining from 57.2 to 62.3%, a counterintuitive result that may reflect differences in how the benchmarks weight problem-solving approaches or test coverage. The SWE-rebench results show less volatility at the extremes than Artificial Analysis, with the bottom ranks similarly stable, but the compression at the top and selective jumps in the middle suggest SWE-rebench either captures a narrower slice of coding ability or rewards specific architectural choices that certain model families exploit more effectively.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | GLM-5.1 | 62.7% |
| 6 | Gemini 3.1 Pro Preview | 62.3% |
| 7 | DeepSeek-V3.2 | 60.9% |
| 8 | Claude Sonnet 4.6 | 60.7% |
| 9 | Claude Sonnet 4.5 | 60.0% |
| 10 | Qwen3.5-397B-A17B | 59.9% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Opus 4.7 | 57.3 | 62 | $10.00 |
| 2 | Gemini 3.1 Pro Preview | 57.2 | 127 | $4.50 |
| 3 | GPT-5.4 | 56.8 | 80 | $5.63 |
| 4 | Kimi K2.6 | 53.9 | 135 | $1.71 |
| 5 | MiMo-V2.5-Pro | 53.8 | 52 | $1.50 |
| 6 | GPT-5.3 Codex | 53.6 | 77 | $4.81 |
| 7 | Claude Opus 4.6 | 53 | 53 | $10.00 |
| 8 | Muse Spark | 52.1 | 0 | $0.00 |
| 9 | Qwen3.6 Max Preview | 51.8 | 38 | $2.92 |
| 10 | Claude Sonnet 4.6 | 51.7 | 64 | $6.00 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Qwen3.6 35B A3B | 224 |
| 2 | GPT-5 Codex | 200 |
| 3 | Gemini 3 Flash Preview | 195 |
| 4 | GPT-5.4 mini | 182 |
| 5 | GPT-5.1 Codex | 177 |
| 6 | Grok 4.20 0309 | 162 |
| 7 | Grok 4.20 0309 v2 | 159 |
| 8 | GPT-5.4 nano | 152 |
| 9 | Qwen3.5 122B A10B | 152 |
| 10 | Kimi K2.6 | 135 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V3.2 | $0.315 |
| 3 | GPT-5.4 nano | $0.463 |
| 4 | MiniMax-M2.7 | $0.525 |
| 5 | KAT Coder Pro V2 | $0.525 |
| 6 | MiniMax-M2.5 | $0.525 |
| 7 | GPT-5 mini | $0.688 |
| 8 | Qwen3.5 27B | $0.825 |
| 9 | Qwen3.6 35B A3B | $0.844 |
| 10 | GLM-4.7 | $1.00 |