The SWE-rebench rankings remain stable at the top tier, with gpt-5.5-2026-04-23-xhigh holding first place at 62.7%, Codex at 60.4%, and Claude Code at 59.6%, indicating that the highest-performing systems have consolidated their positions. Movement occurs in the middle ranks where Gemini 3.1 Pro Preview dropped from fourth to tenth place on SWE-rebench (57.2% to 51.1%), a 6.1-point decline that signals either a methodological shift or genuine regression in this model's code-solving capability. On the Artificial Analysis benchmark, the top tier similarly stabilizes with Claude Opus 4.8 leading at 61.4 and GPT-5.5 at 60.2, though the broader ranking reveals substantial churn below the top ten: Qwen3.7 Max enters at sixth place (56.6), while older GPT versions and specialized models shuffle downward. GLM-4.7 shows the most striking movement, rising from forty-eighth to forty-ninth on Artificial Analysis but falling on SWE-rebench from 38.2% to 42.1%, a pattern suggesting the benchmarks measure different problem distributions. Kimi K2.6 declined notably from eighth to thirteenth on SWE-rebench (53.9% to 46.5%), a 7.4-point drop that warrants scrutiny into whether the evaluation protocol changed or the model's inference behavior shifted. The divergence between SWE-rebench and Artificial Analysis rankings for mid-tier models like GLM-5.1 (11th on SWE-rebench at 50.7%, 18th on Artificial Analysis at 51.4%) suggests these benchmarks are not measuring identical capabilities, likely because SWE-rebench emphasizes repository-level problem solving while Artificial Analysis may weight different code-generation tasks. Without historical Artificial Analysis data from a prior snapshot, the stability of that leaderboard's top positions appears genuine rather than volatile, though the accumulation of new entries like Qwen3.7 Plus at eleventh place indicates the benchmark's sample is expanding rather than consolidating.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | gpt-5.5-2026-04-23-xhigh | 62.7% |
| 2 | Codex | 60.4% |
| 3 | Claude Code | 59.6% |
| 4 | gpt-5.5-2026-04-23-medium | 58.9% |
| 5 | Claude Opus 4.8-xhigh | 56.4% |
| 6 | gpt-5.4-2026-03-05-medium | 54.9% |
| 7 | Claude Opus 4.7-high | 53.1% |
| 8 | Cursor | 53.0% |
| 9 | Claude Sonnet 4.6-high | 51.3% |
| 10 | Gemini 3.1 Pro Preview | 51.1% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Opus 4.8 | 61.4 | 59 | $10.94 |
| 2 | GPT-5.5 | 60.2 | 67 | $11.25 |
| 3 | Claude Opus 4.7 | 57.3 | 53 | $10.94 |
| 4 | Gemini 3.1 Pro Preview | 57.2 | 123 | $4.50 |
| 5 | GPT-5.4 | 56.8 | 79 | $5.63 |
| 6 | Qwen3.7 Max | 56.6 | 198 | $3.75 |
| 7 | Gemini 3.5 Flash | 55.3 | 216 | $3.38 |
| 8 | Kimi K2.6 | 53.9 | 39 | $1.71 |
| 9 | MiMo-V2.5-Pro | 53.8 | 46 | $0.544 |
| 10 | GPT-5.3 Codex | 53.6 | 74 | $4.81 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Step 3.7 Flash | 402 |
| 2 | Gemini 3.5 Flash | 216 |
| 3 | MiniMax-M2.5 | 200 |
| 4 | Qwen3.7 Max | 198 |
| 5 | Grok 4.20 0309 v2 | 187 |
| 6 | Gemini 3 Flash Preview | 180 |
| 7 | GPT-5.1 Codex | 175 |
| 8 | GPT-5 Codex | 173 |
| 9 | Grok 4.20 0309 | 166 |
| 10 | Qwen3.6 35B A3B | 162 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | MiMo-V2.5 | $0.175 |
| 3 | DeepSeek V4 Flash | $0.175 |
| 4 | Hy3-preview | $0.20 |
| 5 | DeepSeek V3.2 | $0.337 |
| 6 | Step 3.7 Flash | $0.438 |
| 7 | GPT-5.4 nano | $0.463 |
| 8 | MiniMax-M2.7 | $0.525 |
| 9 | KAT Coder Pro V2 | $0.525 |
| 10 | MiniMax-M2.5 | $0.525 |