The SWE-rebench rankings show no movement from the previous snapshot: OpenAI's gpt-5.5-2026-04-23-xhighModel holds first at 62.7% ± 0.91%, followed by JunieAgent at 61.6% ± 0.64% and CodexAgent at 60.4% ± 1.37%, with the remaining top 17 entries unchanged in both position and score. The confidence intervals are narrow enough that these represent genuine separations, not measurement noise, and the xhigh configuration of gpt-5.5 shows a clear advantage over its medium variant at position five (58.9%). Artificial Analysis, by contrast, presents a different ranking entirely, with Claude Fable 5 leading at 59.9, followed by Claude Opus 4.8 at 55.7 and GPT-5.5 at 54.8. The two benchmarks diverge substantially in their top performers: SWE-rebench favors agentic systems and OpenAI variants, while Artificial Analysis ranks Claude models more prominently. This divergence likely reflects methodological differences between the benchmarks. SWE-rebench's focus on software engineering tasks with agentic scaffolding may reward systems optimized for tool use and iterative problem-solving, whereas Artificial Analysis may weight different capabilities or evaluation protocols. The stability of SWE-rebench across this snapshot suggests the frontier has plateaued momentarily, or updates are not yet reflected in this data. Without historical comparison points, it remains unclear whether this stasis is typical or represents genuine convergence in model performance on the task.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | OpenAIgpt-5.5-2026-04-23-xhighModel | 62.7%± 0.91% |
| 2 | JunieJunieAgent | 61.6%± 0.64% |
| 3 | OpenAICodexAgent | 60.4%± 1.37% |
| 4 | AnthropicClaude CodeAgent | 59.6%± 1.98% |
| 5 | OpenAIgpt-5.5-2026-04-23-mediumModel | 58.9%± 0.78% |
| 6 | AnthropicClaude Opus 4.8-xhighModel | 56.5%± 1.20% |
| 7 | OpenAIgpt-5.4-2026-03-05-mediumModel | 54.9%± 1.02% |
| 8 | AnthropicClaude Opus 4.7-highModel | 53.1%± 1.45% |
| 9 | CursorCursorAgent | 53.0%± 0.53% |
| 10 | AnthropicClaude Sonnet 4.6Model | 51.3%± 0.55% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Fable 5 | 59.9 | 0 | $20.00 |
| 2 | Claude Opus 4.8 | 55.7 | 57 | $10.00 |
| 3 | GPT-5.5 | 54.8 | 82 | $11.25 |
| 4 | Claude Opus 4.7 | 53.5 | 54 | $10.00 |
| 5 | GPT-5.4 | 51.4 | 166 | $5.63 |
| 6 | GLM-5.2 | 51.1 | 127 | $2.15 |
| 7 | Gemini 3.5 Flash | 50.2 | 213 | $3.38 |
| 8 | Claude Sonnet 4.6 | 47.2 | 61 | $6.00 |
| 9 | Gemini 3.1 Pro Preview | 46.5 | 141 | $4.50 |
| 10 | Qwen3.7 Max | 46 | 198 | $3.75 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3.5 Flash | 213 |
| 2 | Qwen3.7 Max | 198 |
| 3 | GPT-5.4 mini | 178 |
| 4 | GPT-5.4 | 166 |
| 5 | Gemini 3.1 Pro Preview | 141 |
| 6 | GPT-5.2 Codex | 138 |
| 7 | GLM-5.2 | 127 |
| 8 | DeepSeek V4 Flash | 109 |
| 9 | Nex-N2-Pro | 104 |
| 10 | GPT-5.3 Codex | 96 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | DeepSeek V4 Flash | $0.175 |
| 2 | MiMo-V2.5 | $0.175 |
| 3 | MiniMax-M3 | $0.525 |
| 4 | DeepSeek V4 Pro | $0.544 |
| 5 | MiMo-V2.5-Pro | $0.544 |
| 6 | Nex-N2-Pro | $1.00 |
| 7 | MiMo-V2-Pro | $1.50 |
| 8 | GPT-5.4 mini | $1.69 |
| 9 | Kimi K2.6 | $1.71 |
| 10 | Kimi K2.7 Code | $1.71 |