Claude Opus 4.6 moved from position 7 to 1 on the SWE-rebench with a gain of 12.3 percentage points (53% to 65.3%), while GLM-5 and GLM-5.1 climbed from positions 14 and 11 to 3 and 5 respectively, each gaining over 13 points. On the Artificial Analysis benchmark, GPT-5.5 entered at the top with 60.2 points, displacing Claude Opus 4.7 from first place despite that model holding an identical score of 57.3; Gemini 3.1 Pro Preview remained at 57.2, and the top tier compressed significantly with minimal movement among established leaders. The SWE-rebench shows concentrated gains in the 50-65% range where models demonstrate meaningful progress on real repository tasks, though the methodology does not clearly specify whether these represent improvements in the same test set or refreshed evaluation data. The Artificial Analysis leaderboard exhibits dense clustering and wholesale position shifts despite unchanged scores for many models, suggesting either score rounding or reranking by secondary criteria rather than actual performance changes. The divergence between these two benchmarks is notable: Claude Opus 4.6 dominates SWE-rebench but ranks only 8th on Artificial Analysis at 53, while GPT-5.5 tops Artificial Analysis but does not appear in the SWE-rebench top 34, indicating the benchmarks measure different capabilities or that their evaluation protocols operate on substantially different distributions. Without documentation of evaluation scope or date, it is unclear whether these movements reflect genuine capability gains, model updates, or methodological shifts.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | GLM-5.1 | 62.7% |
| 6 | Gemini 3.1 Pro Preview | 62.3% |
| 7 | DeepSeek-V3.2 | 60.9% |
| 8 | Claude Sonnet 4.6 | 60.7% |
| 9 | Claude Sonnet 4.5 | 60.0% |
| 10 | Qwen3.5-397B-A17B | 59.9% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.5 | 60.2 | 0 | $11.25 |
| 2 | Claude Opus 4.7 | 57.3 | 58 | $10.00 |
| 3 | Gemini 3.1 Pro Preview | 57.2 | 132 | $4.50 |
| 4 | GPT-5.4 | 56.8 | 80 | $5.63 |
| 5 | Kimi K2.6 | 53.9 | 123 | $1.71 |
| 6 | MiMo-V2.5-Pro | 53.8 | 60 | $1.50 |
| 7 | GPT-5.3 Codex | 53.6 | 76 | $4.81 |
| 8 | Claude Opus 4.6 | 53 | 50 | $10.00 |
| 9 | Muse Spark | 52.1 | 0 | $0.00 |
| 10 | Qwen3.6 Max Preview | 51.8 | 36 | $2.92 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Qwen3.6 35B A3B | 214 |
| 2 | Gemini 3 Flash Preview | 198 |
| 3 | GPT-5 Codex | 188 |
| 4 | GPT-5.1 Codex | 179 |
| 5 | GPT-5.4 mini | 173 |
| 6 | Qwen3.5 122B A10B | 150 |
| 7 | Grok 4.20 0309 v2 | 148 |
| 8 | GPT-5.4 nano | 147 |
| 9 | Grok 4.20 0309 | 141 |
| 10 | Gemini 3.1 Pro Preview | 132 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V4 Flash | $0.175 |
| 3 | DeepSeek V3.2 | $0.315 |
| 4 | GPT-5.4 nano | $0.463 |
| 5 | MiniMax-M2.7 | $0.525 |
| 6 | KAT Coder Pro V2 | $0.525 |
| 7 | MiniMax-M2.5 | $0.525 |
| 8 | GPT-5 mini | $0.688 |
| 9 | Qwen3.5 27B | $0.825 |
| 10 | Qwen3.6 35B A3B | $0.844 |