On the SWE-rebench, the top tier has crystallized around 60-65 percent resolve rates, with Claude Opus 4.6 holding first place at 65.3 percent, followed by gpt-5.2-2025-12-11-medium at 64.4 percent and a cluster of GLM and GPT variants in the 62-63 percent band. The meaningful movement comes from models that have climbed substantially from prior positions: GLM-5 jumped from rank 11 to rank 3 by gaining 13 percentage points (49.8 to 62.8), GLM-4.7 surged from rank 36 to rank 14 with a 16.6-point gain (42.1 to 58.7), and Kimi K2.5 moved from rank 21 to rank 16 by adding 11.7 points (46.8 to 58.5). Gemini 3.1 Pro Preview, however, dropped from rank 2 to rank 6 despite maintaining a competitive 62.3 percent score, suggesting the benchmark has become more discriminating at the high end. The SWE-rebench scores show larger absolute gains across the board compared to the Artificial Analysis benchmark, which could indicate either improved model capabilities in coding tasks or a shift in evaluation methodology, though the data does not specify whether the benchmark itself was recalibrated. The clustering of models between 58 and 62 percent suggests diminishing returns in further optimization, with the gap between first and tenth place now only 5.4 percentage points compared to what would be expected if progress were linear.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | GLM-5.1 | 62.7% |
| 6 | Gemini 3.1 Pro Preview | 62.3% |
| 7 | DeepSeek-V3.2 | 60.9% |
| 8 | Claude Sonnet 4.6 | 60.7% |
| 9 | Claude Sonnet 4.5 | 60.0% |
| 10 | Qwen3.5-397B-A17B | 59.9% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Opus 4.7 | 57.3 | 53 | $10.00 |
| 2 | Gemini 3.1 Pro Preview | 57.2 | 134 | $4.50 |
| 3 | GPT-5.4 | 56.8 | 85 | $5.63 |
| 4 | GPT-5.3 Codex | 53.6 | 93 | $4.81 |
| 5 | Claude Opus 4.6 | 53 | 59 | $10.00 |
| 6 | Muse Spark | 52.1 | 0 | $0.00 |
| 7 | Claude Sonnet 4.6 | 51.7 | 62 | $6.00 |
| 8 | GLM-5.1 | 51.4 | 46 | $2.15 |
| 9 | GPT-5.2 | 51.3 | 83 | $4.81 |
| 10 | Qwen3.6 Plus | 50 | 52 | $1.13 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Qwen3.6 35B A3B | 238 |
| 2 | GPT-5.1 Codex | 223 |
| 3 | Grok 4.20 0309 v2 | 212 |
| 4 | GPT-5 Codex | 211 |
| 5 | Gemini 3 Flash Preview | 207 |
| 6 | Grok 4.20 0309 | 205 |
| 7 | GPT-5.4 mini | 192 |
| 8 | Qwen3.5 122B A10B | 157 |
| 9 | GPT-5.4 nano | 156 |
| 10 | Gemini 3 Pro Preview | 141 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V3.2 | $0.315 |
| 3 | GPT-5.4 nano | $0.463 |
| 4 | MiniMax-M2.7 | $0.525 |
| 5 | KAT Coder Pro V2 | $0.525 |
| 6 | MiniMax-M2.5 | $0.525 |
| 7 | GPT-5 mini | $0.688 |
| 8 | Qwen3.5 27B | $0.825 |
| 9 | Qwen3.6 35B A3B | $0.844 |
| 10 | GLM-4.7 | $1.00 |