Claude Opus 4.6 holds the top position on SWE-rebench at 65.3 percent, unchanged from the previous cycle, while the tier below it has compressed significantly: gpt-5.2-2025-12-11-medium scores 64.4 percent, and a cluster of three models (GLM-5, Junie, and gpt-5.4-2026-03-05-medium) all sit at 62.8 percent. The most striking movement comes from GLM-5, which jumped from rank 17 at 49.8 percent on Artificial Analysis to rank 3 at 62.8 percent on SWE-rebench, a 13-point gain that suggests either a genuine capability improvement or a methodological difference between the two benchmarks worth examining. GLM-5.1 shows a similar trajectory, moving from rank 14 at 51.4 percent to rank 6 at 62.7 percent, and Kimi K2 Thinking climbed from rank 54 at 40.9 percent to rank 21 at 57.4 percent. Gemini 3.1 Pro Preview, by contrast, declined from rank 3 at 57.2 percent on Artificial Analysis to rank 7 at 62.3 percent on SWE-rebench, a modest slip that places it below several newer contenders despite holding strong absolute performance. The Artificial Analysis leaderboard shows GPT-5.5 leading at 60.2 percent with Claude Opus 4.7 at 57.3 percent and Gemini 3.1 Pro Preview at 57.2 percent, a different ordering entirely from SWE-rebench. This divergence between benchmarks raises a methodological question: SWE-rebench appears to reward certain architectural or training choices that Artificial Analysis does not, and without clarity on what each benchmark isolates, ranking movements alone cannot confirm whether progress is real or an artifact of evaluation design.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | Junie | 62.8% |
| 5 | gpt-5.4-2026-03-05-medium | 62.8% |
| 6 | GLM-5.1 | 62.7% |
| 7 | Gemini 3.1 Pro Preview | 62.3% |
| 8 | DeepSeek-V3.2 | 60.9% |
| 9 | Claude Sonnet 4.6 | 60.7% |
| 10 | Claude Sonnet 4.5 | 60.0% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.5 | 60.2 | 75 | $11.25 |
| 2 | Claude Opus 4.7 | 57.3 | 52 | $10.94 |
| 3 | Gemini 3.1 Pro Preview | 57.2 | 125 | $4.50 |
| 4 | GPT-5.4 | 56.8 | 78 | $5.63 |
| 5 | Kimi K2.6 | 53.9 | 38 | $1.71 |
| 6 | MiMo-V2.5-Pro | 53.8 | 62 | $1.50 |
| 7 | GPT-5.3 Codex | 53.6 | 79 | $4.81 |
| 8 | Grok 4.3 | 53.2 | 80 | $1.56 |
| 9 | Claude Opus 4.6 | 53 | 50 | $10.94 |
| 10 | Muse Spark | 52.1 | 0 | $0.00 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3 Flash Preview | 198 |
| 2 | Qwen3.6 35B A3B | 188 |
| 3 | GPT-5.1 Codex | 177 |
| 4 | GPT-5.4 mini | 170 |
| 5 | GPT-5 Codex | 165 |
| 6 | GPT-5.4 nano | 156 |
| 7 | Qwen3.5 122B A10B | 152 |
| 8 | GPT-5.1 | 140 |
| 9 | MiMo-V2-Flash | 139 |
| 10 | Gemini 3 Pro Preview | 128 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V4 Flash | $0.175 |
| 3 | DeepSeek V3.2 | $0.337 |
| 4 | GPT-5.4 nano | $0.463 |
| 5 | MiniMax-M2.7 | $0.525 |
| 6 | KAT Coder Pro V2 | $0.525 |
| 7 | MiniMax-M2.5 | $0.525 |
| 8 | Qwen3.6 35B A3B | $0.557 |
| 9 | GPT-5 mini | $0.688 |
| 10 | MiMo-V2.5 | $0.72 |