On SWE-rebench, the top tier remains stable with gpt-5.5-2026-04-23-xhigh holding first place at 62.7% and the next five positions unchanged, but meaningful movement appears below that line: Gemini 3.1 Pro Preview dropped from 57.2% to 51.1% on Artificial Analysis (down six positions), Gemini 3.5 Flash fell from 55.3% to 49.5% on SWE-rebench and 55.3% to 50.2% on Artificial Analysis, and Kimi K2.6 declined from 53.9% to 46.5% on Artificial Analysis while holding steady on SWE-rebench. GLM-4.7 improved notably from 42.1% to 50.7% on Artificial Analysis, moving into the top 20, and GLM-4.7 itself advanced from 42.1% to 50.7% on Artificial Analysis, though it remains at 38.2% on SWE-rebench. The Artificial Analysis leaderboard shows broader volatility: Claude Fable 5 dropped from 64.9 to 59.9, Claude Opus 4.8 fell from 61.4 to 55.7, and GPT-5.5 declined from 60.2 to 54.8, suggesting either a recalibration of the benchmark methodology or systematic changes in model evaluation conditions. Lower-ranked models show the largest percentage-point losses across both benchmarks, with many models in the 100-200 range losing 5-8 points, raising the question of whether this reflects actual model degradation, benchmark recalibration, or environmental factors like inference conditions that affect consistency. The SWE-rebench scores remain tighter and more stable than Artificial Analysis, which could indicate either greater robustness in that benchmark's methodology or a narrower evaluation scope that leaves less room for variance. Without clarity on whether these benchmarks measure identical task sets or use different evaluation protocols, the divergence between the two makes it difficult to assess whether the movement represents genuine capability shifts or measurement artifacts.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | gpt-5.5-2026-04-23-xhigh | 62.7% |
| 2 | Junie | 61.6% |
| 3 | Codex | 60.4% |
| 4 | Claude Code | 59.6% |
| 5 | gpt-5.5-2026-04-23-medium | 58.9% |
| 6 | Claude Opus 4.8-xhigh | 56.5% |
| 7 | gpt-5.4-2026-03-05-medium | 54.9% |
| 8 | Claude Opus 4.7-high | 53.1% |
| 9 | Cursor | 53.0% |
| 10 | Claude Sonnet 4.6 | 51.3% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Fable 5 | 59.9 | 0 | $20.00 |
| 2 | Claude Opus 4.8 | 55.7 | 68 | $10.00 |
| 3 | GPT-5.5 | 54.8 | 67 | $11.25 |
| 4 | Claude Opus 4.7 | 53.5 | 57 | $10.00 |
| 5 | GPT-5.4 | 51.4 | 191 | $5.63 |
| 6 | Gemini 3.5 Flash | 50.2 | 212 | $3.38 |
| 7 | Claude Sonnet 4.6 | 47.2 | 62 | $6.00 |
| 8 | Gemini 3.1 Pro Preview | 46.5 | 133 | $4.50 |
| 9 | Qwen3.7 Max | 46 | 187 | $3.75 |
| 10 | MiniMax-M3 | 44.4 | 57 | $0.525 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3.5 Flash | 212 |
| 2 | GPT-5.4 | 191 |
| 3 | Qwen3.7 Max | 187 |
| 4 | GPT-5.4 mini | 187 |
| 5 | GPT-5.2 Codex | 137 |
| 6 | Gemini 3.1 Pro Preview | 133 |
| 7 | DeepSeek V4 Flash | 108 |
| 8 | GPT-5.3 Codex | 99 |
| 9 | DeepSeek V4 Pro | 84 |
| 10 | GPT-5.2 | 80 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | DeepSeek V4 Flash | $0.175 |
| 2 | MiMo-V2.5 | $0.175 |
| 3 | MiniMax-M3 | $0.525 |
| 4 | DeepSeek V4 Pro | $0.544 |
| 5 | MiMo-V2.5-Pro | $0.544 |
| 6 | MiMo-V2-Pro | $1.50 |
| 7 | GPT-5.4 mini | $1.69 |
| 8 | Kimi K2.6 | $1.71 |
| 9 | GLM-5.1 | $2.15 |
| 10 | Qwen3.6 Max Preview | $2.92 |