Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, unchanged from the previous round, while the Artificial Analysis benchmark shows Claude Opus 4.7 at 57.3 in first place, suggesting the two benchmarks are measuring different problem distributions or difficulty levels. The SWE-rebench leaderboard has consolidated around a narrow band: the top six models cluster between 65.3% and 62.3%, with gpt-5.2-2025-12-11-medium at 64.4% and GLM-5 and gpt-5.4-2026-03-05-medium both at 62.8%, indicating diminishing returns as models approach saturation on this evaluation set. Notable climbers on Artificial Analysis include Kimi K2.6 entering at position 4 with 53.9 points and JT-MINI appearing at position 113 with 25.4 points, though their SWE-rebench performance is not reported, making cross-benchmark validation difficult. Gemini 3.1 Pro Preview dropped from second place on Artificial Analysis (57.2) to sixth on SWE-rebench (62.3), a reversal that warrants scrutiny of the underlying tasks, SWE-rebench may emphasize code generation or repository manipulation where Claude and GPT variants perform better, while Artificial Analysis may weight reasoning or planning more heavily. The SWE-rebench methodology itself remains opaque in the provided data; without visibility into task design, evaluation protocol, or whether scores are statistically independent, it is unclear whether the tight clustering reflects genuine convergence in model capability or whether the benchmark has begun to saturate as a discriminator. The two-benchmark divergence suggests practitioners should verify performance on their specific use case rather than treating either leaderboard as a universal proxy for software engineering capability.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | GLM-5.1 | 62.7% |
| 6 | Gemini 3.1 Pro Preview | 62.3% |
| 7 | DeepSeek-V3.2 | 60.9% |
| 8 | Claude Sonnet 4.6 | 60.7% |
| 9 | Claude Sonnet 4.5 | 60.0% |
| 10 | Qwen3.5-397B-A17B | 59.9% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Opus 4.7 | 57.3 | 53 | $10.00 |
| 2 | Gemini 3.1 Pro Preview | 57.2 | 130 | $4.50 |
| 3 | GPT-5.4 | 56.8 | 83 | $5.63 |
| 4 | Kimi K2.6 | 53.9 | 135 | $1.71 |
| 5 | GPT-5.3 Codex | 53.6 | 90 | $4.81 |
| 6 | Claude Opus 4.6 | 53 | 57 | $10.00 |
| 7 | Muse Spark | 52.1 | 0 | $0.00 |
| 8 | Qwen3.6 Max Preview | 51.8 | 0 | $0.00 |
| 9 | Claude Sonnet 4.6 | 51.7 | 73 | $6.00 |
| 10 | GLM-5.1 | 51.4 | 43 | $2.15 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Qwen3.6 35B A3B | 238 |
| 2 | GPT-5 Codex | 213 |
| 3 | Grok 4.20 0309 | 205 |
| 4 | Grok 4.20 0309 v2 | 203 |
| 5 | Gemini 3 Flash Preview | 197 |
| 6 | GPT-5.4 mini | 194 |
| 7 | GPT-5.1 Codex | 170 |
| 8 | Qwen3.5 122B A10B | 163 |
| 9 | GPT-5.4 nano | 161 |
| 10 | Gemini 3 Pro Preview | 137 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V3.2 | $0.315 |
| 3 | GPT-5.4 nano | $0.463 |
| 4 | MiniMax-M2.7 | $0.525 |
| 5 | KAT Coder Pro V2 | $0.525 |
| 6 | MiniMax-M2.5 | $0.525 |
| 7 | GPT-5 mini | $0.688 |
| 8 | Qwen3.5 27B | $0.825 |
| 9 | Qwen3.6 35B A3B | $0.844 |
| 10 | GLM-4.7 | $1.00 |