Claude Code holds the top position on SWE-rebench at 52.9%, unchanged from the previous cycle, while a new entrant, Junie, enters at 52.1% in second place. Claude Opus 4.6 dropped from 51.7% to a tie at 51.7% with gpt-5.2-2025-12-11-xhigh, though its ranking shifted down one position as the GPT variant pulled ahead in the secondary ordering. The most notable movement occurs lower in the rankings: Claude Opus 4.5 fell sharply from 49.7% on Artificial Analysis (position 8) to 43.8% on SWE-rebench (position 12), a 6-point gap that suggests either a real decline in code-solving ability or a meaningful divergence in what these two benchmarks measure. GLM-5 similarly dropped from 49.8% to 42.1%, and Kimi K2.5 fell from 46.8% to 37.9%, indicating that the Artificial Analysis benchmark may weight certain model capabilities differently than SWE-rebench does. Conversely, Kimi K2 Thinking gained 3 positions, rising from 40.9% to 43.8%, and GLM-4.6 climbed from 32.5% to 37.1%, suggesting these models may have received updates or that SWE-rebench captures their strengths more clearly. The top tier remains clustered between 52.9% and 51.0%, with no model breaking into the low 53% range, indicating a plateau in performance on this benchmark's task distribution rather than rapid improvement. Without access to the evaluation methodology details for either benchmark, the divergence between them warrants scrutiny: SWE-rebench appears to penalize certain architectures or reasoning approaches that Artificial Analysis credits, though both benchmarks are evaluating the same underlying problem space of software engineering tasks.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Code | 52.9% |
| 2 | Junie | 52.1% |
| 3 | Claude Opus 4.6 | 51.7% |
| 4 | gpt-5.2-2025-12-11-xhigh | 51.7% |
| 5 | gpt-5.2-2025-12-11-medium | 51.0% |
| 6 | gpt-5.1-codex-max | 48.5% |
| 7 | Claude Sonnet 4.5 | 47.1% |
| 8 | Gemini 3 Pro Preview | 46.7% |
| 9 | Gemini 3 Flash Preview | 46.7% |
| 10 | gpt-5.2-codex | 45.0% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Preview | 57.2 | 120 | $4.50 |
| 2 | GPT-5.4 | 57 | 73 | $5.63 |
| 3 | GPT-5.3 Codex | 54 | 70 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 57 | $10.00 |
| 5 | Claude Sonnet 4.6 | 51.7 | 69 | $6.00 |
| 6 | GPT-5.2 | 51.3 | 76 | $4.81 |
| 7 | GLM-5 | 49.8 | 52 | $1.55 |
| 8 | Claude Opus 4.5 | 49.7 | 64 | $10.00 |
| 9 | GPT-5.2 Codex | 49 | 76 | $4.81 |
| 10 | Gemini 3 Pro Preview | 48.4 | 116 | $4.50 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | GPT-5 Codex | 187 |
| 2 | Gemini 3 Flash Preview | 166 |
| 3 | GPT-5.1 Codex | 130 |
| 4 | Qwen3.5 122B A10B | 129 |
| 5 | Gemini 3.1 Pro Preview | 120 |
| 6 | Gemini 3 Pro Preview | 116 |
| 7 | MiMo-V2-Flash | 116 |
| 8 | GPT-5.1 | 108 |
| 9 | GLM-4.7 | 106 |
| 10 | Qwen3.5 27B | 91 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V3.2 | $0.315 |
| 3 | MiniMax-M2.5 | $0.525 |
| 4 | GPT-5 mini | $0.688 |
| 5 | Qwen3.5 27B | $0.825 |
| 6 | GLM-4.7 | $1.00 |
| 7 | Kimi K2 Thinking | $1.07 |
| 8 | Qwen3.5 122B A10B | $1.10 |
| 9 | Gemini 3 Flash Preview | $1.13 |
| 10 | Kimi K2.5 | $1.20 |