The SWE-rebench rankings show minimal movement at the top tier, with gpt-5.5-2026-04-23-xhigh holding position one at 62.7% and Junie maintaining second place at 61.6%, but the middle and lower portions of the leaderboard reveal substantial volatility. Gemini 3.1 Pro Preview dropped from 57.2% to 51.1%, falling from fifth to eleventh place, while Gemini 3.5 Flash fell from 55.3% to 49.5%, sliding from eighth to thirteenth. Kimi K2.6 declined from 53.9% to 46.5%, surrendering its tenth-place position to land at fifteenth. Conversely, GLM-4.7 improved from 42.1% to 38.2%, though this represents a methodological concern: the Artificial Analysis benchmark shows GLM-4.7 at 42.1% while SWE-rebench reports 38.2%, raising questions about whether these measure comparable problem-solving capabilities or whether the SWE-rebench evaluation may have shifted its difficulty calibration. Claude Sonnet 4.6 moved upward from eighteenth to tenth on Artificial Analysis, gaining 0.4 points to reach 51.7%, suggesting incremental refinement rather than breakthrough performance. The divergence between the two benchmarks across the same models underscores that coding ability assessments depend heavily on test selection and evaluation methodology, making absolute rankings less informative than the specific gaps they reveal about model strengths on particular problem classes.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | gpt-5.5-2026-04-23-xhigh | 62.7% |
| 2 | Junie | 61.6% |
| 3 | Codex | 60.4% |
| 4 | Claude Code | 59.6% |
| 5 | gpt-5.5-2026-04-23-medium | 58.9% |
| 6 | Claude Opus 4.8-xhigh | 56.5% |
| 7 | gpt-5.4-2026-03-05-medium | 54.9% |
| 8 | Claude Opus 4.7-high | 53.1% |
| 9 | Cursor | 53.0% |
| 10 | Claude Sonnet 4.6 | 51.3% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Fable 5 | 64.9 | 79 | $20.00 |
| 2 | Claude Opus 4.8 | 61.4 | 66 | $10.00 |
| 3 | GPT-5.5 | 60.2 | 78 | $11.25 |
| 4 | Claude Opus 4.7 | 57.3 | 58 | $10.00 |
| 5 | Gemini 3.1 Pro Preview | 57.2 | 142 | $4.50 |
| 6 | GPT-5.4 | 56.8 | 203 | $5.63 |
| 7 | Qwen3.7 Max | 56.6 | 199 | $3.75 |
| 8 | Gemini 3.5 Flash | 55.3 | 227 | $3.38 |
| 9 | MiniMax-M3 | 54.7 | 59 | $0.525 |
| 10 | Kimi K2.6 | 53.9 | 46 | $1.71 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Step 3.7 Flash | 407 |
| 2 | MiniMax-M2.5 | 249 |
| 3 | Gemini 3.5 Flash | 227 |
| 4 | Gemini 3 Flash Preview | 226 |
| 5 | Grok 4.20 0309 v2 | 221 |
| 6 | GPT-5.1 Codex | 218 |
| 7 | Grok 4.20 0309 | 213 |
| 8 | GPT-5.4 | 203 |
| 9 | Qwen3.7 Max | 199 |
| 10 | GPT-5 Codex | 198 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | MiMo-V2.5 | $0.175 |
| 3 | DeepSeek V4 Flash | $0.175 |
| 4 | Hy3-preview | $0.20 |
| 5 | DeepSeek V3.2 | $0.337 |
| 6 | Step 3.7 Flash | $0.438 |
| 7 | GPT-5.4 nano | $0.463 |
| 8 | MiniMax-M3 | $0.525 |
| 9 | MiniMax-M2.7 | $0.525 |
| 10 | KAT Coder Pro V2 | $0.525 |