On the SWE-rebench, the top tier remains stable with gpt-5.5-2026-04-23-xhigh holding 62.7%, Junie at 61.6%, and Codex at 60.4%, but movement in the middle ranks reveals both consolidation and fragmentation. Claude Sonnet 4.6 climbed from #8 to #10 while gaining 4.1 percentage points (47.2 to 51.3), and Gemini 3.1 Pro Preview moved from #9 to #11 with a 4.6-point increase (46.5 to 51.1), suggesting these models benefited from either test set changes or evaluation methodology shifts rather than architectural improvements alone. GLM-5.1's jump from #23 to #12 represents the most dramatic repositioning, rising 10.5 points from 40.2 to 50.7, which warrants scrutiny: either the model underwent substantial retraining or the benchmark's coding task distribution shifted to favor its strengths. Conversely, Gemini 3.5 Flash dropped from #7 to #13 despite a marginal score decline (50.2 to 49.5), a minor inversion that may reflect tighter clustering at this performance band. GLM-4.7 showed the largest absolute gain in the lower ranks, jumping from 33.8 to 38.2 across the two evaluations, though it remains at #17 on SWE-rebench. The Artificial Analysis benchmark, with its broader model coverage, presents a different ranking topology: Claude Fable 5 leads at 59.9, above GPT-5.5 at 54.8, inverting the SWE-rebench order and suggesting the two benchmarks weight different coding competencies or test different problem classes. Without disclosure of the evaluation methodology, task composition, test set overlap, execution environment, or whether SWE-rebench underwent revision, attributing these shifts to genuine capability differences versus benchmark drift remains uncertain. The consistency of top-tier models across both benchmarks provides some confidence in their relative ordering, but the volatility in middle ranks indicates either genuine model differentiation in narrow domains or measurement sensitivity that limits strong inference.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | gpt-5.5-2026-04-23-xhigh | 62.7% |
| 2 | Junie | 61.6% |
| 3 | Codex | 60.4% |
| 4 | Claude Code | 59.6% |
| 5 | gpt-5.5-2026-04-23-medium | 58.9% |
| 6 | Claude Opus 4.8-xhigh | 56.5% |
| 7 | gpt-5.4-2026-03-05-medium | 54.9% |
| 8 | Claude Opus 4.7-high | 53.1% |
| 9 | Cursor | 53.0% |
| 10 | Claude Sonnet 4.6 | 51.3% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Fable 5 | 59.9 | 0 | $20.00 |
| 2 | Claude Opus 4.8 | 55.7 | 69 | $10.00 |
| 3 | GPT-5.5 | 54.8 | 63 | $11.25 |
| 4 | Claude Opus 4.7 | 53.5 | 53 | $10.00 |
| 5 | GPT-5.4 | 51.4 | 165 | $5.63 |
| 6 | GLM-5.2 | 51.1 | 94 | $2.15 |
| 7 | Gemini 3.5 Flash | 50.2 | 244 | $3.38 |
| 8 | Claude Sonnet 4.6 | 47.2 | 69 | $6.00 |
| 9 | Gemini 3.1 Pro Preview | 46.5 | 138 | $4.50 |
| 10 | Qwen3.7 Max | 46 | 200 | $3.75 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3.5 Flash | 244 |
| 2 | Qwen3.7 Max | 200 |
| 3 | GPT-5.4 mini | 193 |
| 4 | GPT-5.4 | 165 |
| 5 | GPT-5.2 Codex | 145 |
| 6 | Gemini 3.1 Pro Preview | 138 |
| 7 | DeepSeek V4 Flash | 110 |
| 8 | GPT-5.3 Codex | 107 |
| 9 | GLM-5.1 | 106 |
| 10 | DeepSeek V4 Pro | 103 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | DeepSeek V4 Flash | $0.175 |
| 2 | MiMo-V2.5 | $0.175 |
| 3 | MiniMax-M3 | $0.525 |
| 4 | DeepSeek V4 Pro | $0.544 |
| 5 | MiMo-V2.5-Pro | $0.544 |
| 6 | MiMo-V2-Pro | $1.50 |
| 7 | GPT-5.4 mini | $1.69 |
| 8 | Kimi K2.6 | $1.71 |
| 9 | Kimi K2.7 Code | $1.71 |
| 10 | GLM-5.2 | $2.15 |