On SWE-rebench, the top tier remains unchanged with gpt-5.5-2026-04-23-xhigh holding 62.7% and the next four models stable within a narrow band. The meaningful shifts occur in the mid-tier: GLM-5.1 entered at 50.7%, moving from position 21 on Artificial Analysis (40.2) to position 12 on SWE-rebench, suggesting the benchmark surfaces different capability profiles than general evaluation suites. Kimi K2.6 gained 3.7 points to 46.5%, while Gemini 3.5 Flash dropped 0.7 points to 49.5% despite previously ranking sixth on Artificial Analysis at 50.2%, indicating SWE-rebench's code-specific tasks may penalize certain architectural choices. On Artificial Analysis, GLM-5.2 entered the top ten at position 6 with 50.7%, a new entrant that did not appear in the prior ranking, while the bulk of the list shows positional shuffling without score changes, suggesting the primary movement comes from model releases rather than re-evaluation of existing systems. The SWE-rebench data presents a cleaner signal for coding capability than the broader Artificial Analysis suite, where most entries maintain identical scores across the two snapshots, indicating the latter functions as a stable archive rather than a live leaderboard. Neither benchmark shows the kind of discontinuous jumps that would signal a methodological shift or contamination event.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | gpt-5.5-2026-04-23-xhigh | 62.7% |
| 2 | Junie | 61.6% |
| 3 | Codex | 60.4% |
| 4 | Claude Code | 59.6% |
| 5 | gpt-5.5-2026-04-23-medium | 58.9% |
| 6 | Claude Opus 4.8-xhigh | 56.5% |
| 7 | gpt-5.4-2026-03-05-medium | 54.9% |
| 8 | Claude Opus 4.7-high | 53.1% |
| 9 | Cursor | 53.0% |
| 10 | Claude Sonnet 4.6 | 51.3% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Fable 5 | 59.9 | 0 | $20.00 |
| 2 | Claude Opus 4.8 | 55.7 | 68 | $10.00 |
| 3 | GPT-5.5 | 54.8 | 67 | $11.25 |
| 4 | Claude Opus 4.7 | 53.5 | 54 | $10.00 |
| 5 | GPT-5.4 | 51.4 | 166 | $5.63 |
| 6 | GLM-5.2 | 50.7 | 114 | $2.15 |
| 7 | Gemini 3.5 Flash | 50.2 | 203 | $3.38 |
| 8 | Claude Sonnet 4.6 | 47.2 | 63 | $6.00 |
| 9 | Gemini 3.1 Pro Preview | 46.5 | 127 | $4.50 |
| 10 | Qwen3.7 Max | 46 | 106 | $3.75 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3.5 Flash | 203 |
| 2 | GPT-5.4 mini | 180 |
| 3 | GPT-5.4 | 166 |
| 4 | Gemini 3.1 Pro Preview | 127 |
| 5 | GPT-5.2 Codex | 125 |
| 6 | GLM-5.2 | 114 |
| 7 | Qwen3.7 Max | 106 |
| 8 | DeepSeek V4 Flash | 100 |
| 9 | GPT-5.3 Codex | 89 |
| 10 | GPT-5.2 | 78 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | DeepSeek V4 Flash | $0.175 |
| 2 | MiMo-V2.5 | $0.175 |
| 3 | MiniMax-M3 | $0.525 |
| 4 | DeepSeek V4 Pro | $0.544 |
| 5 | MiMo-V2.5-Pro | $0.544 |
| 6 | MiMo-V2-Pro | $1.50 |
| 7 | GPT-5.4 mini | $1.69 |
| 8 | Kimi K2.6 | $1.71 |
| 9 | Kimi K2.7 Code | $1.71 |
| 10 | GLM-5.2 | $2.15 |