On the SWE-rebench, the top tier has stabilized with Claude Code holding 52.9%, Junie at 52.1%, and Claude Opus 4.6 tied with gpt-5.2-xhigh at 51.7%, representing no movement from prior rankings. Claude Opus 4.5 dropped from position 8 at 49.7% on Artificial Analysis to position 12 at 43.8% on SWE-rebench, a 5.9-point decline that signals either methodological differences between the two benchmarks or genuine performance variance across problem distributions. Kimi K2 Thinking climbed 14 positions on Artificial Analysis (from 27 to 13) with a 2.9-point gain, while Gemini 3 Pro Preview fell from position 10 to 11 despite holding steady at 48.4%, indicating a new entrant shifted rankings. GLM-5 dropped significantly on Artificial Analysis from position 7 at 49.8% to position 15 at 42.1%, a 7.7-point regression that merits scrutiny regarding whether this reflects model degradation or evaluation instability. Kimi K2.5 declined sharply from position 12 at 46.8% to position 19 at 37.9%, losing 8.9 points and 7 ranking positions. The Artificial Analysis leaderboard saw new entries at position 10 (Grok 4.20 Beta) and position 40 (NVIDIA Nemotron 3 Super 120B), while position 97 (LongCat Flash Lite) and position 282 (Sarvam M) entered lower tiers, suggesting either benchmark expansion or periodic model rotation. The divergence between SWE-rebench and Artificial Analysis on models like Claude Opus 4.5 and GLM-5 raises questions about benchmark sensitivity to implementation details or task sampling; without clarity on evaluation methodology differences, it is difficult to assess whether these gaps reflect real capability variation or measurement artifacts.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Code | 52.9% |
| 2 | Junie | 52.1% |
| 3 | Claude Opus 4.6 | 51.7% |
| 4 | gpt-5.2-2025-12-11-xhigh | 51.7% |
| 5 | gpt-5.2-2025-12-11-medium | 51.0% |
| 6 | gpt-5.1-codex-max | 48.5% |
| 7 | Claude Sonnet 4.5 | 47.1% |
| 8 | Gemini 3 Pro Preview | 46.7% |
| 9 | Gemini 3 Flash Preview | 46.7% |
| 10 | gpt-5.2-codex | 45.0% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Preview | 57.2 | 111 | $4.50 |
| 2 | GPT-5.4 | 57 | 77 | $5.63 |
| 3 | GPT-5.3 Codex | 54 | 57 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 53 | $10.00 |
| 5 | Claude Sonnet 4.6 | 51.7 | 60 | $6.00 |
| 6 | GPT-5.2 | 51.3 | 65 | $4.81 |
| 7 | GLM-5 | 49.8 | 63 | $1.55 |
| 8 | Claude Opus 4.5 | 49.7 | 57 | $10.00 |
| 9 | GPT-5.2 Codex | 49 | 72 | $4.81 |
| 10 | Grok 4.20 Beta 0309 | 48.5 | 245 | $3.00 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Grok 4.20 Beta 0309 | 245 |
| 2 | GPT-5 Codex | 175 |
| 3 | Gemini 3 Flash Preview | 164 |
| 4 | Qwen3.5 122B A10B | 151 |
| 5 | MiMo-V2-Flash | 133 |
| 6 | Gemini 3 Pro Preview | 115 |
| 7 | Gemini 3.1 Pro Preview | 111 |
| 8 | GPT-5.1 Codex | 108 |
| 9 | Qwen3.5 27B | 87 |
| 10 | GLM-4.7 | 79 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V3.2 | $0.315 |
| 3 | MiniMax-M2.5 | $0.525 |
| 4 | GPT-5 mini | $0.688 |
| 5 | Qwen3.5 27B | $0.825 |
| 6 | GLM-4.7 | $1.00 |
| 7 | Kimi K2 Thinking | $1.07 |
| 8 | Qwen3.5 122B A10B | $1.10 |
| 9 | Gemini 3 Flash Preview | $1.13 |
| 10 | Kimi K2.5 | $1.20 |