On the SWE-rebench coding benchmark, the top tier shows stability with gpt-5.5-2026-04-23-xhigh holding first at 62.7%, Junie second at 61.6%, and Codex third at 60.4%, while middle-ranked models demonstrate more flux: Claude Sonnet 4.6 climbed from 47.2 to 51.3 percent (position 8 to 10), GLM-5.1 jumped from 40.2 to 50.7 percent (ranking 23 to 12), and Kimi K2.6 advanced from 42.8 to 46.5 percent (16 to 15), yet Gemini 3.5 Flash paradoxically fell from 50.2 to 49.5 percent despite holding rank 13. The Artificial Analysis leaderboard exhibits more volatility across its 394 entries, where Claude Fable 5 leads at 59.9 but the broader distribution shows marginal gains concentrated among models in the 40 to 50 point range, with GLM-4.7 making the largest absolute climb from 33.8 to 38.2 percent. The divergence between these two benchmarks on identical or near-identical models (Claude Sonnet 4.6 scores 51.3 on SWE-rebench but 47.2 on Artificial Analysis; GLM-5.1 scores 50.7 vs 40.2) suggests they measure different problem distributions or evaluation methodologies, raising questions about whether improvements on one reflect genuine capability gains or benchmark-specific overfitting. The SWE-rebench movements are modest in absolute terms, with most shifts under 5 percentage points, which is consistent with natural variance rather than architectural breakthroughs.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | gpt-5.5-2026-04-23-xhigh | 62.7% |
| 2 | Junie | 61.6% |
| 3 | Codex | 60.4% |
| 4 | Claude Code | 59.6% |
| 5 | gpt-5.5-2026-04-23-medium | 58.9% |
| 6 | Claude Opus 4.8-xhigh | 56.5% |
| 7 | gpt-5.4-2026-03-05-medium | 54.9% |
| 8 | Claude Opus 4.7-high | 53.1% |
| 9 | Cursor | 53.0% |
| 10 | Claude Sonnet 4.6 | 51.3% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Fable 5 | 59.9 | 0 | $20.00 |
| 2 | Claude Opus 4.8 | 55.7 | 67 | $10.00 |
| 3 | GPT-5.5 | 54.8 | 61 | $11.25 |
| 4 | Claude Opus 4.7 | 53.5 | 54 | $10.00 |
| 5 | GPT-5.4 | 51.4 | 157 | $5.63 |
| 6 | GLM-5.2 | 50.7 | 100 | $2.15 |
| 7 | Gemini 3.5 Flash | 50.2 | 223 | $3.38 |
| 8 | Claude Sonnet 4.6 | 47.2 | 66 | $6.00 |
| 9 | Gemini 3.1 Pro Preview | 46.5 | 127 | $4.50 |
| 10 | Qwen3.7 Max | 46 | 96 | $3.75 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3.5 Flash | 223 |
| 2 | GPT-5.4 mini | 177 |
| 3 | GPT-5.4 | 157 |
| 4 | Gemini 3.1 Pro Preview | 127 |
| 5 | GPT-5.2 Codex | 125 |
| 6 | DeepSeek V4 Flash | 105 |
| 7 | GLM-5.2 | 100 |
| 8 | Qwen3.7 Max | 96 |
| 9 | GPT-5.2 | 79 |
| 10 | GPT-5.3 Codex | 77 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | DeepSeek V4 Flash | $0.175 |
| 2 | MiMo-V2.5 | $0.175 |
| 3 | MiniMax-M3 | $0.525 |
| 4 | DeepSeek V4 Pro | $0.544 |
| 5 | MiMo-V2.5-Pro | $0.544 |
| 6 | MiMo-V2-Pro | $1.50 |
| 7 | GPT-5.4 mini | $1.69 |
| 8 | Kimi K2.6 | $1.71 |
| 9 | Kimi K2.7 Code | $1.71 |
| 10 | GLM-5.2 | $2.15 |