Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, up from a prior ranking of 9th at 53%, a gain of 12.3 percentage points that represents the largest single-model improvement visible in the current data. The top seven positions cluster tightly between 62.3% and 65.3%, with gpt-5.2-2025-12-11-medium at 64.4%, GLM-5 and Junie tied at 62.8%, gpt-5.4-2026-03-05-medium at 62.8%, GLM-5.1 at 62.7%, and Gemini 3.1 Pro Preview at 62.3%. On the Artificial Analysis benchmark, however, the rankings diverge sharply: GPT-5.5 leads at 60.2, while Claude Opus 4.6 ranks 9th at 53, suggesting the two benchmarks measure different capabilities or that SWE-rebench may weight certain problem classes differently. GLM-5 moved from 17th to 3rd on SWE-rebench (49.8 to 62.8), GLM-4.7 climbed from 44th to 14th (42.1 to 58.7), and Kimi K2 Thinking jumped from 54th to 21st (40.9 to 57.4) on the coding benchmark, indicating broad-based gains across Chinese models. Gemini 3.1 Pro Preview, by contrast, dropped from 3rd on Artificial Analysis (57.2) to 7th on SWE-rebench (62.3), a relative decline despite an absolute score increase, which may reflect that the coding-specific benchmark rewards different optimization choices than general-purpose evaluation. The lack of methodological detail in either benchmark limits confidence in interpreting these divergences: neither source discloses test set size, problem distribution, whether solutions are evaluated for correctness alone or for code quality, or how edge cases are handled, making it unclear whether the gap between Claude's dominance on SWE-rebench and its mid-tier position on Artificial Analysis reflects genuine capability differences or artifacts of evaluation design.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | Junie | 62.8% |
| 5 | gpt-5.4-2026-03-05-medium | 62.8% |
| 6 | GLM-5.1 | 62.7% |
| 7 | Gemini 3.1 Pro Preview | 62.3% |
| 8 | DeepSeek-V3.2 | 60.9% |
| 9 | Claude Sonnet 4.6 | 60.7% |
| 10 | Claude Sonnet 4.5 | 60.0% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.5 | 60.2 | 63 | $11.25 |
| 2 | Claude Opus 4.7 | 57.3 | 64 | $10.94 |
| 3 | Gemini 3.1 Pro Preview | 57.2 | 131 | $4.50 |
| 4 | GPT-5.4 | 56.8 | 84 | $5.63 |
| 5 | Kimi K2.6 | 53.9 | 41 | $1.71 |
| 6 | MiMo-V2.5-Pro | 53.8 | 55 | $1.50 |
| 7 | GPT-5.3 Codex | 53.6 | 80 | $4.81 |
| 8 | Grok 4.3 | 53.2 | 86 | $1.56 |
| 9 | Claude Opus 4.6 | 53 | 49 | $10.94 |
| 10 | Muse Spark | 52.1 | 0 | $0.00 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3 Flash Preview | 197 |
| 2 | GPT-5.1 Codex | 183 |
| 3 | GPT-5.4 mini | 182 |
| 4 | Qwen3.6 35B A3B | 182 |
| 5 | GPT-5 Codex | 171 |
| 6 | Hy3-preview | 159 |
| 7 | Qwen3.5 122B A10B | 159 |
| 8 | GPT-5.4 nano | 148 |
| 9 | MiMo-V2-Flash | 143 |
| 10 | Gemini 3.1 Pro Preview | 131 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V4 Flash | $0.175 |
| 3 | DeepSeek V3.2 | $0.337 |
| 4 | GPT-5.4 nano | $0.463 |
| 5 | MiniMax-M2.7 | $0.525 |
| 6 | KAT Coder Pro V2 | $0.525 |
| 7 | MiniMax-M2.5 | $0.525 |
| 8 | Qwen3.6 35B A3B | $0.557 |
| 9 | GPT-5 mini | $0.688 |
| 10 | MiMo-V2.5 | $0.72 |