The SWE-rebench rankings remain frozen at their previous positions, with no movement across the top 17 coding agents. OpenAI's gpt-5.5-2026-04-23-xhighModel holds first at 62.7% (±0.91%), followed by JunieAgent at 61.6% (±0.64%), and the spread narrows predictably through the field. The confidence intervals are tight enough to distinguish most placements, though Claude Opus 4.6-high (47.8% ±1.37%) and Claude Sonnet 4.6 (51.3% ±0.55%) overlap slightly at the boundary of statistical noise. Across the Artificial Analysis benchmark, the data shows substantial churn in the middle and lower tiers, Magistral Medium 1.2 dropped from position 130 to 148, while Apriel-v1.6-15B-Thinker moved from 129 to 128, but the top performers remain locked in place: Claude Fable 5 leads at 59.9, with GPT-5.5 and Claude Opus 4.8 holding their second-tier positions at 54.8 and 55.7 respectively. The two benchmarks measure different problem spaces (SWE-rebench targets repository-level software engineering tasks while Artificial Analysis covers broader reasoning), which explains why their orderings diverge: coding-specific systems like JunieAgent rank higher on SWE-rebench but Claude Fable 5 tops the general benchmark. Without prior Artificial Analysis scores, it is unclear whether the observed shuffling in positions 128 to 148 reflects genuine performance changes or measurement variance. The stability in SWE-rebench suggests the top agents have reached a plateau, or that the evaluation's resolution cannot detect sub-point improvements.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | OpenAIgpt-5.5-2026-04-23-xhighModel | 62.7%± 0.91% |
| 2 | JunieJunieAgent | 61.6%± 0.64% |
| 3 | OpenAICodexAgent | 60.4%± 1.37% |
| 4 | AnthropicClaude CodeAgent | 59.6%± 1.98% |
| 5 | OpenAIgpt-5.5-2026-04-23-mediumModel | 58.9%± 0.78% |
| 6 | AnthropicClaude Opus 4.8-xhighModel | 56.5%± 1.20% |
| 7 | OpenAIgpt-5.4-2026-03-05-mediumModel | 54.9%± 1.02% |
| 8 | AnthropicClaude Opus 4.7-highModel | 53.1%± 1.45% |
| 9 | CursorCursorAgent | 53.0%± 0.53% |
| 10 | AnthropicClaude Sonnet 4.6Model | 51.3%± 0.55% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Fable 5 | 59.9 | 0 | $20.00 |
| 2 | Claude Opus 4.8 | 55.7 | 60 | $10.00 |
| 3 | GPT-5.5 | 54.8 | 83 | $11.25 |
| 4 | Claude Opus 4.7 | 53.5 | 57 | $10.00 |
| 5 | GPT-5.4 | 51.4 | 163 | $5.63 |
| 6 | GLM-5.2 | 51.1 | 120 | $2.15 |
| 7 | Gemini 3.5 Flash | 50.2 | 225 | $3.38 |
| 8 | Claude Sonnet 4.6 | 47.2 | 70 | $6.00 |
| 9 | Gemini 3.1 Pro Preview | 46.5 | 145 | $4.50 |
| 10 | Qwen3.7 Max | 46 | 203 | $3.75 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3.5 Flash | 225 |
| 2 | Qwen3.7 Max | 203 |
| 3 | GPT-5.4 mini | 177 |
| 4 | GPT-5.4 | 163 |
| 5 | Gemini 3.1 Pro Preview | 145 |
| 6 | GPT-5.2 Codex | 139 |
| 7 | GLM-5.2 | 120 |
| 8 | Nex-N2-Pro | 118 |
| 9 | DeepSeek V4 Flash | 114 |
| 10 | GPT-5.3 Codex | 100 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | DeepSeek V4 Flash | $0.175 |
| 2 | MiMo-V2.5 | $0.175 |
| 3 | MiniMax-M3 | $0.525 |
| 4 | DeepSeek V4 Pro | $0.544 |
| 5 | MiMo-V2.5-Pro | $0.544 |
| 6 | Nex-N2-Pro | $1.00 |
| 7 | MiMo-V2-Pro | $1.50 |
| 8 | GPT-5.4 mini | $1.69 |
| 9 | Kimi K2.6 | $1.71 |
| 10 | Kimi K2.7 Code | $1.71 |