The SWE-rebench leaderboard shows no movement at the top tier: Claude Code, Junie, and the gpt-5.2 variants hold positions 1 through 5 with scores between 52.9% and 51.0%, unchanged from the prior cycle. Below that band, however, several models experienced substantial rank shifts that expose inconsistencies between the two evaluation sources. Claude Opus 4.5 dropped from position 8 (49.7 on Artificial Analysis) to position 12 (43.8% on SWE-rebench), a gap of 6 percentage points that suggests either a significant performance regression or a methodological divergence between benchmarks. Kimi K2 Thinking climbed 14 positions on SWE-rebench (from 27 to 13) despite logging only a 2.9-point gain (40.9 to 43.8%), while GLM-5 fell from position 7 (49.8 on Artificial Analysis) to position 15 (42.1% on SWE-rebench), a 7.7-point drop. Kimi K2.5 reversed course entirely, sinking from position 12 (46.8 on Artificial Analysis) to position 19 (37.9% on SWE-rebench). These divergences raise questions about benchmark stability: SWE-rebench appears to reward certain architectural choices or fine-tuning strategies that Artificial Analysis does not, yet neither source clarifies whether the gap reflects genuine capability differences or evaluation artifacts. The lack of detailed methodology documentation for either benchmark makes it difficult to assess whether these swings represent real performance variation or measurement drift. At the frontier, the stability of the top five models suggests that the highest-capability systems may be approaching a plateau on this task distribution, while the volatility in the 7 to 20 ranking band indicates that mid-tier models remain sensitive to benchmark design choices.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Code | 52.9% |
| 2 | Junie | 52.1% |
| 3 | Claude Opus 4.6 | 51.7% |
| 4 | gpt-5.2-2025-12-11-xhigh | 51.7% |
| 5 | gpt-5.2-2025-12-11-medium | 51.0% |
| 6 | gpt-5.1-codex-max | 48.5% |
| 7 | Claude Sonnet 4.5 | 47.1% |
| 8 | Gemini 3 Pro Preview | 46.7% |
| 9 | Gemini 3 Flash Preview | 46.7% |
| 10 | gpt-5.2-codex | 45.0% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Preview | 57.2 | 106 | $4.50 |
| 2 | GPT-5.4 | 57 | 78 | $5.63 |
| 3 | GPT-5.3 Codex | 54 | 65 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 55 | $10.00 |
| 5 | Claude Sonnet 4.6 | 51.7 | 69 | $6.00 |
| 6 | GPT-5.2 | 51.3 | 67 | $4.81 |
| 7 | GLM-5 | 49.8 | 58 | $1.55 |
| 8 | Claude Opus 4.5 | 49.7 | 62 | $10.00 |
| 9 | GPT-5.2 Codex | 49 | 73 | $4.81 |
| 10 | Gemini 3 Pro Preview | 48.4 | 115 | $4.50 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | GPT-5 Codex | 186 |
| 2 | Gemini 3 Flash Preview | 166 |
| 3 | Qwen3.5 122B A10B | 154 |
| 4 | MiMo-V2-Flash | 136 |
| 5 | Gemini 3 Pro Preview | 115 |
| 6 | GPT-5.1 Codex | 114 |
| 7 | Gemini 3.1 Pro Preview | 106 |
| 8 | GLM-4.7 | 90 |
| 9 | Qwen3.5 27B | 89 |
| 10 | GPT-5.1 | 79 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V3.2 | $0.315 |
| 3 | MiniMax-M2.5 | $0.525 |
| 4 | GPT-5 mini | $0.688 |
| 5 | Qwen3.5 27B | $0.825 |
| 6 | GLM-4.7 | $1.00 |
| 7 | Kimi K2 Thinking | $1.07 |
| 8 | Qwen3.5 122B A10B | $1.10 |
| 9 | Gemini 3 Flash Preview | $1.13 |
| 10 | Kimi K2.5 | $1.20 |