The SWE-rebench leaderboard shows stasis at the top with Claude Code holding 52.9% and Junie at 52.1%, while significant reshuffling occurs in the middle tiers. Claude Opus 4.5 dropped from position 8 with 49.7% to position 12 with 43.8% on Artificial Analysis, a substantial decline that warrants scrutiny of whether the evaluation methodology shifted or if the model's capabilities genuinely regressed on this particular task distribution. Conversely, Kimi K2 Thinking climbed from position 28 with 40.9% to position 13 with 43.8% on Artificial Analysis, suggesting either improved inference or a benchmark revision that favors its approach. The Gemini 3 Pro Preview moved from position 11 at 48.4% to position 8 at 46.7% on SWE-rebench, a modest decline consistent with natural variance, though the divergence between the two benchmarks (Artificial Analysis shows it at 48.4%) hints at methodological differences in how they score the same model outputs. GLM-5 dropped from position 7 with 49.8% on Artificial Analysis to position 15 with 42.1% on SWE-rebench, a 7.7-point gap that is difficult to attribute to random noise and suggests these benchmarks may be testing different aspects of code generation capability. The lack of movement in the top five positions on SWE-rebench combined with large swings in the 7-20 range indicates the benchmark is sensitive enough to detect real differences but that the frontier models have plateaued relative to their challengers, a pattern worth monitoring across future cycles to determine whether we are seeing genuine convergence or measurement instability.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Code | 52.9% |
| 2 | Junie | 52.1% |
| 3 | Claude Opus 4.6 | 51.7% |
| 4 | gpt-5.2-2025-12-11-xhigh | 51.7% |
| 5 | gpt-5.2-2025-12-11-medium | 51.0% |
| 6 | gpt-5.1-codex-max | 48.5% |
| 7 | Claude Sonnet 4.5 | 47.1% |
| 8 | Gemini 3 Pro Preview | 46.7% |
| 9 | Gemini 3 Flash Preview | 46.7% |
| 10 | gpt-5.2-codex | 45.0% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.4 | 57.2 | 80 | $5.63 |
| 2 | Gemini 3.1 Pro Preview | 57.2 | 114 | $4.50 |
| 3 | GPT-5.3 Codex | 54 | 70 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 56 | $10.00 |
| 5 | Claude Sonnet 4.6 | 51.7 | 61 | $6.00 |
| 6 | GPT-5.2 | 51.3 | 75 | $4.81 |
| 7 | GLM-5 | 49.8 | 66 | $1.55 |
| 8 | Claude Opus 4.5 | 49.7 | 65 | $10.00 |
| 9 | GPT-5.2 Codex | 49 | 108 | $4.81 |
| 10 | Grok 4.20 Beta 0309 | 48.5 | 213 | $3.00 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Grok 4.20 Beta 0309 | 213 |
| 2 | GPT-5 Codex | 203 |
| 3 | Gemini 3 Flash Preview | 179 |
| 4 | Qwen3.5 122B A10B | 159 |
| 5 | GPT-5.1 Codex | 140 |
| 6 | MiMo-V2-Flash | 127 |
| 7 | Gemini 3.1 Pro Preview | 114 |
| 8 | GPT-5.1 | 111 |
| 9 | Gemini 3 Pro Preview | 110 |
| 10 | GPT-5.2 Codex | 108 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V3.2 | $0.315 |
| 3 | MiniMax-M2.5 | $0.525 |
| 4 | GPT-5 mini | $0.688 |
| 5 | Qwen3.5 27B | $0.825 |
| 6 | GLM-4.7 | $1.00 |
| 7 | Kimi K2 Thinking | $1.07 |
| 8 | Qwen3.5 122B A10B | $1.10 |
| 9 | Gemini 3 Flash Preview | $1.13 |
| 10 | Kimi K2.5 | $1.20 |