Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, unchanged from the previous cycle, while the tier immediately below has compressed: gpt-5.2-2025-12-11-medium, GLM-5, and gpt-5.4-2026-03-05-medium now cluster between 62.8% and 64.4%, with GLM-5 climbing from #7 to #3 and gaining 13 percentage points since the Artificial Analysis benchmark. Gemini 3.1 Pro Preview dropped from #2 to #5 on SWE-rebench despite scoring 62.3%, a 5.1-point gain over its Artificial Analysis score of 57.2, suggesting the two benchmarks measure different problem distributions or evaluation rigor. Kimi K2.5 and Kimi K2 Thinking both posted substantial gains, 12.5 and 16.5 points respectively on Artificial Analysis, and moved up the SWE-rebench ranks to #13 and #17, though the magnitude of improvement raises questions about whether the models were retrained, fine-tuned on benchmark data, or if the evaluation protocols differ materially between the two systems. The broader pattern shows Claude models and GPT variants dominating the top ten on SWE-rebench while Chinese models (GLM-5, Kimi variants, Qwen lines) have narrowed the gap, and the divergence between SWE-rebench and Artificial Analysis rankings for several mid-tier models suggests these benchmarks are not interchangeable proxies for coding ability. MiMo-V2-Omni dropped from the Artificial Analysis rankings entirely despite previously scoring 43.4, a notable exit that warrants clarification on whether the model was discontinued or simply failed to meet evaluation criteria this cycle.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | Gemini 3.1 Pro Preview | 62.3% |
| 6 | DeepSeek-V3.2 | 60.9% |
| 7 | Claude Sonnet 4.6 | 60.7% |
| 8 | Claude Sonnet 4.5 | 60.0% |
| 9 | Qwen3.5-397B-A17B | 59.9% |
| 10 | Step-3.5-Flash | 59.6% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.4 | 57.2 | 74 | $5.63 |
| 2 | Gemini 3.1 Pro Preview | 57.2 | 113 | $4.50 |
| 3 | GPT-5.3 Codex | 54 | 78 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 51 | $10.00 |
| 5 | Claude Sonnet 4.6 | 51.7 | 72 | $6.00 |
| 6 | GPT-5.2 | 51.3 | 72 | $4.81 |
| 7 | GLM-5 | 49.8 | 69 | $1.55 |
| 8 | Claude Opus 4.5 | 49.7 | 59 | $10.00 |
| 9 | MiniMax-M2.7 | 49.6 | 47 | $0.525 |
| 10 | MiMo-V2-Pro | 49.2 | 93 | $1.50 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | GPT-5.4 nano | 221 |
| 2 | Grok 4.20 Beta 0309 | 218 |
| 3 | GPT-5.4 mini | 218 |
| 4 | Gemini 3 Flash Preview | 195 |
| 5 | GPT-5 Codex | 190 |
| 6 | Qwen3.5 122B A10B | 134 |
| 7 | MiMo-V2-Flash | 129 |
| 8 | GPT-5.1 Codex | 118 |
| 9 | Gemini 3 Pro Preview | 115 |
| 10 | Gemini 3.1 Pro Preview | 113 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V3.2 | $0.315 |
| 3 | GPT-5.4 nano | $0.463 |
| 4 | MiniMax-M2.7 | $0.525 |
| 5 | MiniMax-M2.5 | $0.525 |
| 6 | GPT-5 mini | $0.688 |
| 7 | Qwen3.5 27B | $0.825 |
| 8 | GLM-4.7 | $1.00 |
| 9 | Kimi K2 Thinking | $1.07 |
| 10 | Qwen3.5 122B A10B | $1.10 |