Claude Code holds the top position on SWE-rebench at 52.9%, with Junie at 52.1% and Claude Opus 4.6 at 51.7%, but the most striking pattern appears in the Artificial Analysis benchmark where the rankings have shifted substantially since the prior update: Claude Opus 4.5 dropped from position 8 (49.7) to position 12 (43.8), GLM-5 fell from position 7 (49.8) to position 15 (42.1), and Kimi K2.5 plummeted from position 13 (46.8) to position 19 (37.9), while Kimi K2 Thinking climbed from position 28 (40.9) to position 13 (43.8) and GLM-4.6 rose from position 54 (32.5) to position 22 (37.1). DeepSeek V3.2 Speciale dropped sharply from position 46 (34.1) to position 66 (29.4), the most dramatic reversal in the upper tier. The scale of these movements raises questions about benchmark stability: shifts of 5 to 10 percentage points across a single update are large enough to suggest either meaningful model degradation, evaluation methodology changes, or dataset variance rather than genuine performance evolution. Kimi K2 Thinking's 3-position rise paired with K2.5's 6-position fall is particularly difficult to interpret without visibility into what changed in the evaluation protocol. On SWE-rebench, the top tier remains essentially frozen, which either reflects genuine consolidation at the capability ceiling or indicates that benchmark has reached saturation where further discrimination is difficult. The Artificial Analysis benchmark, by contrast, shows volatility that demands scrutiny: if these are the same models tested under consistent conditions, such reversals warrant investigation into whether scoring criteria, test set composition, or model access states have shifted between runs.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Code | 52.9% |
| 2 | Junie | 52.1% |
| 3 | Claude Opus 4.6 | 51.7% |
| 4 | gpt-5.2-2025-12-11-xhigh | 51.7% |
| 5 | gpt-5.2-2025-12-11-medium | 51.0% |
| 6 | gpt-5.1-codex-max | 48.5% |
| 7 | Claude Sonnet 4.5 | 47.1% |
| 8 | Gemini 3 Pro Preview | 46.7% |
| 9 | Gemini 3 Flash Preview | 46.7% |
| 10 | gpt-5.2-codex | 45.0% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Preview | 57.2 | 116 | $4.50 |
| 2 | GPT-5.4 | 57 | 83 | $5.63 |
| 3 | GPT-5.3 Codex | 54 | 61 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 55 | $10.00 |
| 5 | Claude Sonnet 4.6 | 51.7 | 61 | $6.00 |
| 6 | GPT-5.2 | 51.3 | 61 | $4.81 |
| 7 | GLM-5 | 49.8 | 63 | $1.55 |
| 8 | Claude Opus 4.5 | 49.7 | 61 | $10.00 |
| 9 | GPT-5.2 Codex | 49 | 80 | $4.81 |
| 10 | Grok 4.20 Beta 0309 | 48.5 | 251 | $3.00 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Grok 4.20 Beta 0309 | 251 |
| 2 | GPT-5 Codex | 178 |
| 3 | Gemini 3 Flash Preview | 168 |
| 4 | Qwen3.5 122B A10B | 149 |
| 5 | MiMo-V2-Flash | 128 |
| 6 | Gemini 3 Pro Preview | 125 |
| 7 | Gemini 3.1 Pro Preview | 116 |
| 8 | GPT-5.1 Codex | 107 |
| 9 | Qwen3.5 27B | 88 |
| 10 | GPT-5.4 | 83 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V3.2 | $0.315 |
| 3 | MiniMax-M2.5 | $0.525 |
| 4 | GPT-5 mini | $0.688 |
| 5 | Qwen3.5 27B | $0.825 |
| 6 | GLM-4.7 | $1.00 |
| 7 | Kimi K2 Thinking | $1.07 |
| 8 | Qwen3.5 122B A10B | $1.10 |
| 9 | Gemini 3 Flash Preview | $1.13 |
| 10 | Kimi K2.5 | $1.20 |