The SWE-rebench leaderboard shows compression at the top but no meaningful movement within the tested range. Claude Code, Junie, and Claude Opus 4.6 remain locked in the 51.7 to 52.9 percent band, with the top six models separated by less than a percentage point, indicating a plateau in discriminative power rather than genuine progress. Below that tier, ranking shifts are notable but scores tell a different story: Claude Opus 4.5 dropped from 49.7 to 43.8 on Artificial Analysis while holding at 43.8 on SWE-rebench, GLM-5 fell from 49.8 to 42.1, and Kimi K2.5 collapsed from 46.8 to 37.9, all suggesting either benchmark recalibration or evaluation variance rather than model regression. Conversely, Kimi K2 Thinking jumped 2.9 points to 43.8 on SWE-rebench and GLM-4.6 gained 4.6 points to 37.1 on Artificial Analysis, but these gains occur in a region where single-digit improvements are routine and may reflect test set sensitivity rather than architectural breakthroughs. The two benchmarks diverge substantially in their top rankings (GPT-5.4 and Gemini 3.1 Pro Preview both score 57.2 on Artificial Analysis versus Claude Code's 52.9 on SWE-rebench), raising questions about what each evaluates: SWE-rebench appears stricter or tests different problem classes, making direct comparison unreliable. Without documentation of methodology changes, evaluation set stability, or statistical confidence intervals, the apparent volatility in mid-tier positions cannot be distinguished from noise. The real finding is not movement but stagnation at the frontier and inconsistency across benchmarks, both of which limit confidence in using either as a proxy for practical coding capability.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Code | 52.9% |
| 2 | Junie | 52.1% |
| 3 | Claude Opus 4.6 | 51.7% |
| 4 | gpt-5.2-2025-12-11-xhigh | 51.7% |
| 5 | gpt-5.2-2025-12-11-medium | 51.0% |
| 6 | gpt-5.1-codex-max | 48.5% |
| 7 | Claude Sonnet 4.5 | 47.1% |
| 8 | Gemini 3 Pro Preview | 46.7% |
| 9 | Gemini 3 Flash Preview | 46.7% |
| 10 | gpt-5.2-codex | 45.0% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.4 | 57.2 | 80 | $5.63 |
| 2 | Gemini 3.1 Pro Preview | 57.2 | 113 | $4.50 |
| 3 | GPT-5.3 Codex | 54 | 69 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 60 | $10.00 |
| 5 | Claude Sonnet 4.6 | 51.7 | 68 | $6.00 |
| 6 | GPT-5.2 | 51.3 | 69 | $4.81 |
| 7 | GLM-5 | 49.8 | 66 | $1.55 |
| 8 | Claude Opus 4.5 | 49.7 | 65 | $10.00 |
| 9 | GPT-5.2 Codex | 49 | 91 | $4.81 |
| 10 | MiMo-V2-Pro | 48.8 | 0 | $0.00 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Grok 4.20 Beta 0309 | 196 |
| 2 | Gemini 3 Flash Preview | 180 |
| 3 | GPT-5 Codex | 176 |
| 4 | Qwen3.5 122B A10B | 151 |
| 5 | MiMo-V2-Flash | 130 |
| 6 | Gemini 3.1 Pro Preview | 113 |
| 7 | Gemini 3 Pro Preview | 110 |
| 8 | GPT-5.1 Codex | 103 |
| 9 | GPT-5.2 Codex | 91 |
| 10 | Qwen3.5 27B | 90 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V3.2 | $0.315 |
| 3 | MiniMax-M2.5 | $0.525 |
| 4 | GPT-5 mini | $0.688 |
| 5 | Qwen3.5 27B | $0.825 |
| 6 | GLM-4.7 | $1.00 |
| 7 | Kimi K2 Thinking | $1.07 |
| 8 | Qwen3.5 122B A10B | $1.10 |
| 9 | Gemini 3 Flash Preview | $1.13 |
| 10 | Kimi K2.5 | $1.20 |