The SWE-rebench rankings show stability at the top with gpt-5.5-2026-04-23-xhigh holding first place at 62.7%, while Codex and Claude Code follow at 60.4% and 59.6% respectively, unchanged from the previous cycle. Movement in the middle tier reveals two distinct patterns: Gemini 3.1 Pro Preview dropped from position 4 to 10 on SWE-rebench, falling from 57.2% to 51.1%, a 6.1-point decline that marks the most substantial regression in the visible rankings. Conversely, GLM-5.1 held ground at 50.7% while rising slightly in Artificial Analysis from 51.4 to maintain position 11, and Kimi K2.6 declined sharply from position 8 at 53.9% to position 13 at 46.5% on SWE-rebench, a 7.4-point drop. GLM-4.7 presents a puzzling divergence: it improved from 38.2% to 42.1% on Artificial Analysis (rising from position 47), yet on SWE-rebench it remained at 38.2% in position 14, suggesting the two benchmarks may measure different problem classes or that the Artificial Analysis score reflects a broader evaluation window. The consistency of scores across both benchmarks for most models in the top 10 indicates reliable measurement, but the divergence for Gemini and Kimi models warrants scrutiny of whether these benchmarks are testing equivalent code-solving difficulty or if recent model updates affected one benchmark more than the other. The lack of movement in the top five positions across both metrics suggests the frontier has stabilized, though the mid-tier churn indicates active differentiation among models in the 45-55% range.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | gpt-5.5-2026-04-23-xhigh | 62.7% |
| 2 | Codex | 60.4% |
| 3 | Claude Code | 59.6% |
| 4 | gpt-5.5-2026-04-23-medium | 58.9% |
| 5 | Claude Opus 4.8-xhigh | 56.4% |
| 6 | gpt-5.4-2026-03-05-medium | 54.9% |
| 7 | Claude Opus 4.7-high | 53.1% |
| 8 | Cursor | 53.0% |
| 9 | Claude Sonnet 4.6-high | 51.3% |
| 10 | Gemini 3.1 Pro Preview | 51.1% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Opus 4.8 | 61.4 | 65 | $10.94 |
| 2 | GPT-5.5 | 60.2 | 59 | $11.25 |
| 3 | Claude Opus 4.7 | 57.3 | 60 | $10.94 |
| 4 | Gemini 3.1 Pro Preview | 57.2 | 137 | $4.50 |
| 5 | GPT-5.4 | 56.8 | 90 | $5.63 |
| 6 | Qwen3.7 Max | 56.6 | 188 | $3.75 |
| 7 | Gemini 3.5 Flash | 55.3 | 218 | $3.38 |
| 8 | Kimi K2.6 | 53.9 | 42 | $1.71 |
| 9 | MiMo-V2.5-Pro | 53.8 | 51 | $0.544 |
| 10 | GPT-5.3 Codex | 53.6 | 85 | $4.81 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3.5 Flash | 218 |
| 2 | Grok 4.20 0309 | 218 |
| 3 | Grok 4.20 0309 v2 | 216 |
| 4 | Gemini 3 Flash Preview | 203 |
| 5 | MiniMax-M2.5 | 191 |
| 6 | Qwen3.7 Max | 188 |
| 7 | GPT-5.4 mini | 183 |
| 8 | GPT-5.1 Codex | 182 |
| 9 | GPT-5 Codex | 170 |
| 10 | Grok 4.3 | 161 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | MiMo-V2.5 | $0.175 |
| 3 | DeepSeek V4 Flash | $0.175 |
| 4 | Hy3-preview | $0.20 |
| 5 | DeepSeek V3.2 | $0.337 |
| 6 | GPT-5.4 nano | $0.463 |
| 7 | MiniMax-M2.7 | $0.525 |
| 8 | KAT Coder Pro V2 | $0.525 |
| 9 | MiniMax-M2.5 | $0.525 |
| 10 | MiMo-V2.5-Pro | $0.544 |