On SWE-rebench, the top tier remains static: gpt-5.5-2026-04-23-xhigh holds 62.7%, Junie stays at 61.6%, and the Claude and GPT variants occupy positions three through seven without movement. The meaningful shifts occur in the mid-tier, where GLM-5.1 climbed from position 23 at 40.2% to position 12 at 50.7%, a 10.5-point gain that represents the largest jump in the dataset, while GLM-4.7 rose from position 52 at 33.8% to position 17 at 38.2%. Kimi K2.6 advanced from position 16 to position 15, and Claude Sonnet 4.6 moved from position 8 to position 10 despite scoring identically at 51.3%, suggesting ranking adjustments independent of score changes. Across the Artificial Analysis benchmark, the distribution shows far less volatility: Claude Fable 5 leads at 59.9, the top 20 models cluster between 42.8 and 59.9 with mostly preserved rankings, and a new entry (Nex-N2-Pro at 41.0) appears at position 20 while KAT-Coder-Pro V1 jumped 31 positions from 83 to 52 with a 6.3-point improvement from 28.3 to 34.6. The discrepancy between benchmarks is notable: models ranking high on SWE-rebench (gpt-5.5-xhigh, Junie) do not dominate Artificial Analysis, where Claude Fable 5 leads despite placing second on the coding benchmark, suggesting these metrics capture different problem-solving dimensions or that the evaluation methodologies diverge in what they reward. Neither benchmark shows the compression or volatility typical of immature measurement systems, indicating both have stabilized around consistent model orderings, though the absence of methodological detail prevents assessment of whether either captures real capability differences or primarily reflects training data overlap.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | gpt-5.5-2026-04-23-xhigh | 62.7% |
| 2 | Junie | 61.6% |
| 3 | Codex | 60.4% |
| 4 | Claude Code | 59.6% |
| 5 | gpt-5.5-2026-04-23-medium | 58.9% |
| 6 | Claude Opus 4.8-xhigh | 56.5% |
| 7 | gpt-5.4-2026-03-05-medium | 54.9% |
| 8 | Claude Opus 4.7-high | 53.1% |
| 9 | Cursor | 53.0% |
| 10 | Claude Sonnet 4.6 | 51.3% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Fable 5 | 59.9 | 0 | $20.00 |
| 2 | Claude Opus 4.8 | 55.7 | 66 | $10.00 |
| 3 | GPT-5.5 | 54.8 | 66 | $11.25 |
| 4 | Claude Opus 4.7 | 53.5 | 58 | $10.00 |
| 5 | GPT-5.4 | 51.4 | 159 | $5.63 |
| 6 | GLM-5.2 | 51.1 | 122 | $2.15 |
| 7 | Gemini 3.5 Flash | 50.2 | 221 | $3.38 |
| 8 | Claude Sonnet 4.6 | 47.2 | 68 | $6.00 |
| 9 | Gemini 3.1 Pro Preview | 46.5 | 145 | $4.50 |
| 10 | Qwen3.7 Max | 46 | 204 | $3.75 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3.5 Flash | 221 |
| 2 | Qwen3.7 Max | 204 |
| 3 | GPT-5.4 mini | 185 |
| 4 | GPT-5.4 | 159 |
| 5 | Gemini 3.1 Pro Preview | 145 |
| 6 | GPT-5.2 Codex | 139 |
| 7 | DeepSeek V4 Flash | 124 |
| 8 | GLM-5.2 | 122 |
| 9 | Nex-N2-Pro | 108 |
| 10 | GPT-5.2 | 88 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | DeepSeek V4 Flash | $0.175 |
| 2 | MiMo-V2.5 | $0.175 |
| 3 | MiniMax-M3 | $0.525 |
| 4 | DeepSeek V4 Pro | $0.544 |
| 5 | MiMo-V2.5-Pro | $0.544 |
| 6 | Nex-N2-Pro | $1.00 |
| 7 | MiMo-V2-Pro | $1.50 |
| 8 | GPT-5.4 mini | $1.69 |
| 9 | Kimi K2.6 | $1.71 |
| 10 | Kimi K2.7 Code | $1.71 |