Claude Opus 4.6 holds first place on the SWE-rebench at 65.3%, up from fourth place at 53% on Artificial Analysis, a 12.3-point gain that reflects either a meaningful improvement in the model itself or a substantial methodological divergence between the two benchmarks. The SWE-rebench leaderboard shows tighter clustering at the top than Artificial Analysis: the gap between first and fifth place narrows to 2.0 points (65.3% to 62.3%) compared to 9.8 points in the older dataset (57.2% to 47.7%), suggesting either more homogeneous model performance on software engineering tasks or differences in how the benchmark distributes credit across solution attempts. Kimi K2.5 and Kimi K2 Thinking both advanced substantially, jumping from positions 16 and 35 on Artificial Analysis (46.8 and 40.9 points respectively) to positions 13 and 17 on SWE-rebench (58.5% and 57.4%), indicating these models may have been underestimated by the prior evaluation or that they excel specifically at the code completion and repository-level reasoning that SWE-rebench targets. Gemini 3 Flash Preview similarly climbed from position 18 at 46.4 to position 22 at 52.5%, a 6.1-point improvement that outpaces most of the field. The SWE-rebench evaluation appears to reward architectural choices or training data aligned with real repository work: models like GLM-5 and gpt-5.4-2026-03-05-medium perform nearly identically (62.8%), yet their Artificial Analysis scores diverged by 4.2 points (49.8 vs 54), suggesting the newer benchmark may reduce noise or focus more narrowly on a specific class of engineering problems. Without documentation of what changed in the benchmark methodology, evaluation harness, or problem distribution, the magnitude of these shifts prevents confident assessment of whether they represent genuine model progress or reflect a different measurement regime entirely.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | Gemini 3.1 Pro Preview | 62.3% |
| 6 | DeepSeek-V3.2 | 60.9% |
| 7 | Claude Sonnet 4.6 | 60.7% |
| 8 | Claude Sonnet 4.5 | 60.0% |
| 9 | Qwen3.5-397B-A17B | 59.9% |
| 10 | Step-3.5-Flash | 59.6% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.4 | 57.2 | 88 | $5.63 |
| 2 | Gemini 3.1 Pro Preview | 57.2 | 114 | $4.50 |
| 3 | GPT-5.3 Codex | 54 | 92 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 59 | $10.00 |
| 5 | Claude Sonnet 4.6 | 51.7 | 79 | $6.00 |
| 6 | GPT-5.2 | 51.3 | 83 | $4.81 |
| 7 | GLM-5 | 49.8 | 65 | $1.55 |
| 8 | Claude Opus 4.5 | 49.7 | 68 | $10.00 |
| 9 | MiniMax-M2.7 | 49.6 | 44 | $0.525 |
| 10 | MiMo-V2-Pro | 49.2 | 95 | $1.50 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Grok 4.20 Beta 0309 | 242 |
| 2 | GPT-5.4 mini | 219 |
| 3 | GPT-5 Codex | 218 |
| 4 | Gemini 3 Flash Preview | 192 |
| 5 | GPT-5.4 nano | 177 |
| 6 | Qwen3.5 122B A10B | 145 |
| 7 | GPT-5.1 Codex | 140 |
| 8 | MiMo-V2-Flash | 137 |
| 9 | GPT-5.2 Codex | 129 |
| 10 | Gemini 3 Pro Preview | 118 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V3.2 | $0.315 |
| 3 | GPT-5.4 nano | $0.463 |
| 4 | MiniMax-M2.7 | $0.525 |
| 5 | MiniMax-M2.5 | $0.525 |
| 6 | GPT-5 mini | $0.688 |
| 7 | Qwen3.5 27B | $0.825 |
| 8 | GLM-4.7 | $1.00 |
| 9 | Kimi K2 Thinking | $1.07 |
| 10 | Qwen3.5 122B A10B | $1.10 |