The SWE-rebench leaderboard shows consolidation at the top with no movement among the leading seven models, while mid-tier performers reveal more volatility. Claude Sonnet 4.6 climbed from #10 to maintain position with 51.3 percent, Gemini 3.1 Pro Preview held at #11 with 51.1 percent, and GLM-5.1 remained at #12 with 50.7 percent, though the Artificial Analysis benchmark tells a different story: GLM-5.1 jumped from rank 23 at 40.2 to rank 12 at 50.7, a 10.5-point gain that suggests either a model update or a methodology shift between the two benchmarks. The most striking movement came from GLM-4.7, which advanced from #51 on Artificial Analysis (33.8) to #17 on SWE-rebench (38.2), a 4.4-point improvement, while Kimi K2.6 moved from rank 16 to 15 with a 3.7-point jump from 42.8 to 46.5. These discrepancies between the two benchmarks raise questions about their evaluation methodologies: SWE-rebench appears to reward different model behaviors or architectural choices than Artificial Analysis, particularly for Chinese-developed models like GLM and Kimi, which suggests the benchmarks may be measuring distinct aspects of coding capability rather than converging on a unified signal. The lack of score inflation at the frontier, where the top model remains at 62.7 percent, indicates the evaluation has not become easier, though the divergence between benchmark rankings for identical models undermines confidence in any single leaderboard as a complete measure of coding performance.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | gpt-5.5-2026-04-23-xhigh | 62.7% |
| 2 | Junie | 61.6% |
| 3 | Codex | 60.4% |
| 4 | Claude Code | 59.6% |
| 5 | gpt-5.5-2026-04-23-medium | 58.9% |
| 6 | Claude Opus 4.8-xhigh | 56.5% |
| 7 | gpt-5.4-2026-03-05-medium | 54.9% |
| 8 | Claude Opus 4.7-high | 53.1% |
| 9 | Cursor | 53.0% |
| 10 | Claude Sonnet 4.6 | 51.3% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Fable 5 | 59.9 | 0 | $20.00 |
| 2 | Claude Opus 4.8 | 55.7 | 67 | $10.00 |
| 3 | GPT-5.5 | 54.8 | 63 | $11.25 |
| 4 | Claude Opus 4.7 | 53.5 | 52 | $10.00 |
| 5 | GPT-5.4 | 51.4 | 142 | $5.63 |
| 6 | GLM-5.2 | 51.1 | 85 | $2.15 |
| 7 | Gemini 3.5 Flash | 50.2 | 217 | $3.38 |
| 8 | Claude Sonnet 4.6 | 47.2 | 67 | $6.00 |
| 9 | Gemini 3.1 Pro Preview | 46.5 | 136 | $4.50 |
| 10 | Qwen3.7 Max | 46 | 197 | $3.75 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3.5 Flash | 217 |
| 2 | Qwen3.7 Max | 197 |
| 3 | GPT-5.4 mini | 180 |
| 4 | GPT-5.4 | 142 |
| 5 | GPT-5.2 Codex | 139 |
| 6 | Gemini 3.1 Pro Preview | 136 |
| 7 | DeepSeek V4 Flash | 110 |
| 8 | GLM-5.1 | 103 |
| 9 | GPT-5.3 Codex | 95 |
| 10 | DeepSeek V4 Pro | 92 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | DeepSeek V4 Flash | $0.175 |
| 2 | MiMo-V2.5 | $0.175 |
| 3 | MiniMax-M3 | $0.525 |
| 4 | DeepSeek V4 Pro | $0.544 |
| 5 | MiMo-V2.5-Pro | $0.544 |
| 6 | MiMo-V2-Pro | $1.50 |
| 7 | GPT-5.4 mini | $1.69 |
| 8 | Kimi K2.6 | $1.71 |
| 9 | Kimi K2.7 Code | $1.71 |
| 10 | GLM-5.2 | $2.15 |