Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, a jump from sixth place and a gain of 12.3 percentage points from its previous 53 on Artificial Analysis, though the two benchmarks measure different problem sets and methodologies so direct comparison requires caution. The tier below it remains tightly clustered: gpt-5.2-2025-12-11-medium scores 64.4%, GLM-5 and gpt-5.4-2026-03-05-medium both reach 62.8%, and GLM-5.1 sits at 62.7%, with Gemini 3.1 Pro Preview at 62.3%. What distinguishes these movements is the compression at the top, five models now occupy a 2.6-point band, and the significant repositioning of Chinese models: GLM-5 advanced from rank 13 to rank 3 (49.8 to 62.8), GLM-4.7 jumped from 38 to 14 (42.1 to 58.7), and Kimi K2.5 rose from 23 to 16 (46.8 to 58.5). Gemini 3.1 Pro Preview's descent from second to sixth, despite a 5.1-point absolute gain to 62.3%, underscores how the benchmark shifted the entire distribution upward rather than revealing a single model's failure. The SWE-rebench methodology appears to reward architectural or training choices that these frontier models now share more evenly, particularly for code completion and repository-level problem solving. Whether this convergence reflects genuine capability parity or benchmark saturation, where the test's difficulty ceiling has been approached by multiple labs, remains an open question that requires examining the test construction and error analysis across models, not just the leaderboard positions.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | GLM-5.1 | 62.7% |
| 6 | Gemini 3.1 Pro Preview | 62.3% |
| 7 | DeepSeek-V3.2 | 60.9% |
| 8 | Claude Sonnet 4.6 | 60.7% |
| 9 | Claude Sonnet 4.5 | 60.0% |
| 10 | Qwen3.5-397B-A17B | 59.9% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Opus 4.7 | 57.3 | 62 | $10.00 |
| 2 | Gemini 3.1 Pro Preview | 57.2 | 127 | $4.50 |
| 3 | GPT-5.4 | 56.8 | 82 | $5.63 |
| 4 | Kimi K2.6 | 53.9 | 135 | $1.71 |
| 5 | GPT-5.3 Codex | 53.6 | 80 | $4.81 |
| 6 | Claude Opus 4.6 | 53 | 53 | $10.00 |
| 7 | Muse Spark | 52.1 | 0 | $0.00 |
| 8 | Qwen3.6 Max Preview | 51.8 | 47 | $2.92 |
| 9 | Claude Sonnet 4.6 | 51.7 | 73 | $6.00 |
| 10 | GLM-5.1 | 51.4 | 44 | $2.15 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Qwen3.6 35B A3B | 242 |
| 2 | GPT-5 Codex | 214 |
| 3 | Gemini 3 Flash Preview | 195 |
| 4 | Grok 4.20 0309 | 177 |
| 5 | GPT-5.1 Codex | 177 |
| 6 | Grok 4.20 0309 v2 | 174 |
| 7 | GPT-5.4 mini | 174 |
| 8 | Qwen3.5 122B A10B | 159 |
| 9 | GPT-5.4 nano | 157 |
| 10 | Kimi K2.6 | 135 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V3.2 | $0.315 |
| 3 | GPT-5.4 nano | $0.463 |
| 4 | MiniMax-M2.7 | $0.525 |
| 5 | KAT Coder Pro V2 | $0.525 |
| 6 | MiniMax-M2.5 | $0.525 |
| 7 | GPT-5 mini | $0.688 |
| 8 | Qwen3.5 27B | $0.825 |
| 9 | Qwen3.6 35B A3B | $0.844 |
| 10 | GLM-4.7 | $1.00 |