The SWE-rebench rankings show no movement from the previous report: the top tier remains unchanged, with GPT-5.5-xhigh at 62.7%, JunieAgent at 61.6%, and CodexAgent at 60.4%, each holding their positions across the full 24-model list. The Artificial Analysis benchmark, by contrast, exhibits substantial churn across its 398-entry ranking, though the top tier again proves stable, Claude Fable 5 holds the lead at 59.9, followed by Claude Opus 4.8 at 55.7 and GPT-5.5 at 54.8. Below that summit, however, the ordering has shifted measurably: GPT-5 mini dropped from #65 at 33.0 to #72 at 30.9, a loss of 2.1 points and seven positions; Mistral Small 4 fell from #126 at 20.8 to #132 at 19.6; and Qwen3 Next 80B A3B plummeted from #134 at 19.8 to #159 at 16.7, suggesting either methodological revision or genuine performance variance in the 16-20 point band where many models cluster. The SWE-rebench's immobility raises a question about whether those agentic benchmarks are less sensitive to model updates than Artificial Analysis, or whether the coding agents themselves have stabilized while the underlying base models continue to diverge. The Artificial Analysis instability in the mid-range, where confidence intervals would overlap, warrants scrutiny of whether those score differences exceed measurement error; without published confidence bounds for that benchmark, the ranking shifts read as plausible but not necessarily meaningful.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | OpenAIgpt-5.5-2026-04-23-xhighModel | 62.7%± 0.91% |
| 2 | JunieJunieAgent | 61.6%± 0.64% |
| 3 | OpenAICodexAgent | 60.4%± 1.37% |
| 4 | AnthropicClaude CodeAgent | 59.6%± 1.98% |
| 5 | OpenAIgpt-5.5-2026-04-23-mediumModel | 58.9%± 0.78% |
| 6 | AnthropicClaude Opus 4.8-xhighModel | 56.5%± 1.20% |
| 7 | OpenAIgpt-5.4-2026-03-05-mediumModel | 54.9%± 1.02% |
| 8 | AnthropicClaude Opus 4.7-highModel | 53.1%± 1.45% |
| 9 | CursorCursorAgent | 53.0%± 0.53% |
| 10 | AnthropicClaude Sonnet 4.6Model | 51.3%± 0.55% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Fable 5 | 59.9 | 64 | $20.00 |
| 2 | Claude Opus 4.8 | 55.7 | 65 | $10.00 |
| 3 | GPT-5.5 | 54.8 | 84 | $11.25 |
| 4 | Claude Opus 4.7 | 53.5 | 50 | $10.00 |
| 5 | Claude Sonnet 5 | 53.4 | 87 | $6.00 |
| 6 | GPT-5.4 | 51.4 | 166 | $5.63 |
| 7 | GLM-5.2 | 51.1 | 181 | $2.15 |
| 8 | Gemini 3.5 Flash | 50.2 | 210 | $3.38 |
| 9 | Claude Sonnet 4.6 | 47.2 | 69 | $6.00 |
| 10 | Gemini 3.1 Pro Preview | 46.5 | 136 | $4.50 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3.5 Flash | 210 |
| 2 | Qwen3.7 Max | 200 |
| 3 | GLM-5.2 | 181 |
| 4 | GPT-5.4 mini | 168 |
| 5 | GPT-5.4 | 166 |
| 6 | Gemini 3.1 Pro Preview | 136 |
| 7 | Nex-N2-Pro | 120 |
| 8 | GPT-5.2 Codex | 120 |
| 9 | MiniMax-M3 | 98 |
| 10 | DeepSeek V4 Flash | 93 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | DeepSeek V4 Flash | $0.175 |
| 2 | MiMo-V2.5 | $0.175 |
| 3 | MiniMax-M3 | $0.525 |
| 4 | DeepSeek V4 Pro | $0.544 |
| 5 | MiMo-V2.5-Pro | $0.544 |
| 6 | Nex-N2-Pro | $1.00 |
| 7 | MiMo-V2-Pro | $1.50 |
| 8 | GPT-5.4 mini | $1.69 |
| 9 | Kimi K2.6 | $1.71 |
| 10 | Kimi K2.7 Code | $1.71 |