Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, unchanged from the previous evaluation, while the Artificial Analysis benchmark shows a more fragmented picture with GPT-5.5 leading at 60.2. On SWE-rebench, the top tier remains stable: gpt-5.2-2025-12-11-medium sits at 64.4%, GLM-5 and gpt-5.4-2026-03-05-medium are tied at 62.8%, and GLM-5.1 follows at 62.7%. The notable movements occurred in the mid-range, where GLM-4.7 climbed from rank 42 (42.1 points on Artificial Analysis) to rank 14 on SWE-rebench with 58.7%, GLM-5 jumped from rank 16 to rank 3, Kimi K2.5 advanced from rank 27 to rank 16, and Kimi K2 Thinking moved from rank 51 to rank 21. Gemini 3.1 Pro Preview dropped from rank 3 to rank 6 on SWE-rebench despite scoring 62.3%, suggesting the benchmark may be measuring distinct problem-solving dimensions than Artificial Analysis, where Gemini 3.1 Pro Preview ranks third at 57.2. The divergence between the two benchmarks is substantial: Claude Opus 4.6 scores 65.3% on SWE-rebench but only 53 on Artificial Analysis, a gap of 12.3 points. Across both benchmarks, no model appears to have regressed in absolute performance, though relative rankings shifted due to other models improving or being newly added to the evaluation. The SWE-rebench methodology, which focuses on software engineering tasks, appears to reward models differently than the broader Artificial Analysis evaluation, and without access to the specific test case changes or evaluation protocol updates, it remains unclear whether these movements reflect genuine capability shifts or methodological adjustments to the benchmark itself.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | GLM-5.1 | 62.7% |
| 6 | Gemini 3.1 Pro Preview | 62.3% |
| 7 | DeepSeek-V3.2 | 60.9% |
| 8 | Claude Sonnet 4.6 | 60.7% |
| 9 | Claude Sonnet 4.5 | 60.0% |
| 10 | Qwen3.5-397B-A17B | 59.9% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.5 | 60.2 | 78 | $11.25 |
| 2 | Claude Opus 4.7 | 57.3 | 56 | $10.00 |
| 3 | Gemini 3.1 Pro Preview | 57.2 | 135 | $4.50 |
| 4 | GPT-5.4 | 56.8 | 86 | $5.63 |
| 5 | Kimi K2.6 | 53.9 | 0 | $1.71 |
| 6 | MiMo-V2.5-Pro | 53.8 | 65 | $1.50 |
| 7 | GPT-5.3 Codex | 53.6 | 91 | $4.81 |
| 8 | Claude Opus 4.6 | 53 | 48 | $10.00 |
| 9 | Muse Spark | 52.1 | 0 | $0.00 |
| 10 | Qwen3.6 Max Preview | 51.8 | 34 | $2.92 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3 Flash Preview | 200 |
| 2 | GPT-5 Codex | 200 |
| 3 | Qwen3.6 35B A3B | 200 |
| 4 | GPT-5.4 mini | 175 |
| 5 | GPT-5.1 Codex | 159 |
| 6 | GPT-5.4 nano | 157 |
| 7 | Qwen3.5 122B A10B | 156 |
| 8 | GPT-5.1 | 153 |
| 9 | Gemini 3 Pro Preview | 143 |
| 10 | Gemini 3.1 Pro Preview | 135 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V4 Flash | $0.175 |
| 3 | DeepSeek V3.2 | $0.315 |
| 4 | GPT-5.4 nano | $0.463 |
| 5 | MiniMax-M2.7 | $0.525 |
| 6 | KAT Coder Pro V2 | $0.525 |
| 7 | MiniMax-M2.5 | $0.525 |
| 8 | GPT-5 mini | $0.688 |
| 9 | Qwen3.5 27B | $0.825 |
| 10 | Qwen3.6 35B A3B | $0.844 |