Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, unchanged from the previous cycle, while the second tier has tightened considerably around 62-64 percent with gpt-5.2-2025-12-11-medium, GLM-5, gpt-5.4-2026-03-05-medium, and GLM-5.1 all clustering within a single point. The most significant movement appears in the mid-tier: GLM-4.7 climbed from rank 42 at 42.1 percent to rank 14 at 58.7 percent, a gain of 16.6 points; Kimi K2.5 jumped from rank 27 at 46.8 percent to rank 16 at 58.5 percent; and Kimi K2 Thinking advanced from rank 51 at 40.9 percent to rank 21 at 57.4 percent. These represent genuine improvements in coding task resolution, not ranking artifacts. Gemini 3.1 Pro Preview dropped from rank 3 to rank 6 despite holding 62.3 percent, reflecting the compression of scores in the upper tier rather than performance degradation. On the Artificial Analysis benchmark, the rankings remain relatively stable at the extreme ends, though GPT-5.5 continues to lead at 60.2 and Claude Opus 4.7 sits at 57.3. The divergence between SWE-rebench and Artificial Analysis scores is pronounced: Claude Opus 4.6 scores 65.3 percent on SWE-rebench but only 53 percent on Artificial Analysis, suggesting the benchmarks measure distinct capabilities or that SWE-rebench may have different task difficulty distribution. Without visibility into whether the SWE-rebench test set changed or models were simply retested, the GLM and Kimi improvements warrant scrutiny regarding whether they reflect algorithmic advances or evaluation variance.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | GLM-5.1 | 62.7% |
| 6 | Gemini 3.1 Pro Preview | 62.3% |
| 7 | DeepSeek-V3.2 | 60.9% |
| 8 | Claude Sonnet 4.6 | 60.7% |
| 9 | Claude Sonnet 4.5 | 60.0% |
| 10 | Qwen3.5-397B-A17B | 59.9% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.5 | 60.2 | 84 | $11.25 |
| 2 | Claude Opus 4.7 | 57.3 | 62 | $10.00 |
| 3 | Gemini 3.1 Pro Preview | 57.2 | 135 | $4.50 |
| 4 | GPT-5.4 | 56.8 | 86 | $5.63 |
| 5 | Kimi K2.6 | 53.9 | 139 | $1.71 |
| 6 | MiMo-V2.5-Pro | 53.8 | 66 | $1.50 |
| 7 | GPT-5.3 Codex | 53.6 | 91 | $4.81 |
| 8 | Claude Opus 4.6 | 53 | 59 | $10.00 |
| 9 | Muse Spark | 52.1 | 0 | $0.00 |
| 10 | Qwen3.6 Max Preview | 51.8 | 34 | $2.92 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3 Flash Preview | 200 |
| 2 | GPT-5 Codex | 198 |
| 3 | Qwen3.6 35B A3B | 197 |
| 4 | GPT-5.4 mini | 182 |
| 5 | GPT-5.4 nano | 163 |
| 6 | GPT-5.1 Codex | 159 |
| 7 | Qwen3.5 122B A10B | 156 |
| 8 | GPT-5.1 | 153 |
| 9 | Gemini 3 Pro Preview | 141 |
| 10 | Kimi K2.6 | 139 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V4 Flash | $0.175 |
| 3 | DeepSeek V3.2 | $0.315 |
| 4 | GPT-5.4 nano | $0.463 |
| 5 | MiniMax-M2.7 | $0.525 |
| 6 | KAT Coder Pro V2 | $0.525 |
| 7 | MiniMax-M2.5 | $0.525 |
| 8 | GPT-5 mini | $0.688 |
| 9 | Qwen3.5 27B | $0.825 |
| 10 | Qwen3.6 35B A3B | $0.844 |