The SWE-rebench rankings have been entirely replaced with a fresh cohort of agent-based systems, with OpenAI's gpt-5.5-2026-04-23-xhigh model leading at 62.7% plus or minus 0.91%, followed by JunieAgent at 61.6% and OpenAI CodexAgent at 60.4%, while the Artificial Analysis benchmark shows Claude Fable 5 in the top position at 59.9, a configuration that differs markedly from the previous leadership structure where Claude Fable 5 held first place. The SWE-rebench methodology appears to favor agentic systems with explicit configuration parameters like "xhigh" and "medium," producing notably higher absolute scores than Artificial Analysis reports for comparable models, suggesting the two benchmarks measure different aspects of coding capability or employ divergent evaluation protocols. All seventeen prior SWE-rebench entries have been dropped, indicating either a benchmark refresh, a shift in how systems are tested, or a change in what qualifies for inclusion, though the new cohort maintains consistency in confidence intervals ranging from 0.53% to 1.98%, which is tighter than one might expect given the diversity of approaches. The Artificial Analysis ranking remains largely stable in its upper tiers, with minor reordering and the addition of Devstral 2 moving from position 165 to 140, but without the wholesale replacement seen in SWE-rebench, suggesting the two evaluation frameworks operate on different cadences or criteria. Without prior SWE-rebench data for the new agent entries, it is not possible to determine whether these scores represent genuine capability gains or simply reflect how agent-based systems perform on that particular benchmark relative to the model-only systems previously ranked.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | OpenAIgpt-5.5-2026-04-23-xhighModel | 62.7%± 0.91% |
| 2 | JunieJunieAgent | 61.6%± 0.64% |
| 3 | OpenAICodexAgent | 60.4%± 1.37% |
| 4 | AnthropicClaude CodeAgent | 59.6%± 1.98% |
| 5 | OpenAIgpt-5.5-2026-04-23-mediumModel | 58.9%± 0.78% |
| 6 | AnthropicClaude Opus 4.8-xhighModel | 56.5%± 1.20% |
| 7 | OpenAIgpt-5.4-2026-03-05-mediumModel | 54.9%± 1.02% |
| 8 | AnthropicClaude Opus 4.7-highModel | 53.1%± 1.45% |
| 9 | CursorCursorAgent | 53.0%± 0.53% |
| 10 | AnthropicClaude Sonnet 4.6Model | 51.3%± 0.55% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Fable 5 | 59.9 | 0 | $20.00 |
| 2 | Claude Opus 4.8 | 55.7 | 65 | $10.00 |
| 3 | GPT-5.5 | 54.8 | 73 | $11.25 |
| 4 | Claude Opus 4.7 | 53.5 | 55 | $10.00 |
| 5 | GPT-5.4 | 51.4 | 150 | $5.63 |
| 6 | GLM-5.2 | 51.1 | 123 | $2.15 |
| 7 | Gemini 3.5 Flash | 50.2 | 213 | $3.38 |
| 8 | Claude Sonnet 4.6 | 47.2 | 67 | $6.00 |
| 9 | Gemini 3.1 Pro Preview | 46.5 | 136 | $4.50 |
| 10 | Qwen3.7 Max | 46 | 203 | $3.75 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3.5 Flash | 213 |
| 2 | Qwen3.7 Max | 203 |
| 3 | GPT-5.4 mini | 176 |
| 4 | GPT-5.4 | 150 |
| 5 | GPT-5.2 Codex | 139 |
| 6 | Gemini 3.1 Pro Preview | 136 |
| 7 | GLM-5.2 | 123 |
| 8 | Nex-N2-Pro | 117 |
| 9 | DeepSeek V4 Flash | 113 |
| 10 | MiMo-V2.5 | 84 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | DeepSeek V4 Flash | $0.175 |
| 2 | MiMo-V2.5 | $0.175 |
| 3 | MiniMax-M3 | $0.525 |
| 4 | DeepSeek V4 Pro | $0.544 |
| 5 | MiMo-V2.5-Pro | $0.544 |
| 6 | Nex-N2-Pro | $1.00 |
| 7 | MiMo-V2-Pro | $1.50 |
| 8 | GPT-5.4 mini | $1.69 |
| 9 | Kimi K2.6 | $1.71 |
| 10 | Kimi K2.7 Code | $1.71 |