On the SWE-rebench coding benchmark, the top tier remains stable with OpenAI's gpt-5.5-2026-04-23-xhighModel holding 62.7% (±0.91%), followed by JunieAgent at 61.6% (±0.64%) and OpenAI's CodexAgent at 60.4% (±1.37%), unchanged from the previous round. The Artificial Analysis benchmark, by contrast, shows material reshuffling across its 398-model roster: Claude Fable 5 enters at #1 with 59.9 points, displacing GPT-5.5 to #3, while Claude Sonnet 5 debuts at #5 with 53.4 points, pushing prior entries down. Lower in the Artificial Analysis rankings, DeepSeek V3 climbs from #220 (10.4) to #180 (14.2), a 3.8-point gain that suggests either improved evaluation conditions or a correction in prior assessment. Qwen3.5 9B drops from #101 (25) to #120 (21.4), a 3.6-point decline that warrants scrutiny of methodology consistency. The SWE-rebench benchmark's tight confidence intervals (mostly sub-1.5%) and static ordering suggest a well-controlled experimental setup, whereas Artificial Analysis's broader movement and new entrants indicate either looser evaluation criteria or frequent model updates that shift relative standing. Neither benchmark shows the methodological transparency needed to distinguish between genuine performance improvement and variance in test conditions.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | OpenAIgpt-5.5-2026-04-23-xhighModel | 62.7%± 0.91% |
| 2 | JunieJunieAgent | 61.6%± 0.64% |
| 3 | OpenAICodexAgent | 60.4%± 1.37% |
| 4 | AnthropicClaude CodeAgent | 59.6%± 1.98% |
| 5 | OpenAIgpt-5.5-2026-04-23-mediumModel | 58.9%± 0.78% |
| 6 | AnthropicClaude Opus 4.8-xhighModel | 56.5%± 1.20% |
| 7 | OpenAIgpt-5.4-2026-03-05-mediumModel | 54.9%± 1.02% |
| 8 | AnthropicClaude Opus 4.7-highModel | 53.1%± 1.45% |
| 9 | CursorCursorAgent | 53.0%± 0.53% |
| 10 | AnthropicClaude Sonnet 4.6Model | 51.3%± 0.55% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Fable 5 | 59.9 | 0 | $20.00 |
| 2 | Claude Opus 4.8 | 55.7 | 65 | $10.00 |
| 3 | GPT-5.5 | 54.8 | 77 | $11.25 |
| 4 | Claude Opus 4.7 | 53.5 | 48 | $10.00 |
| 5 | Claude Sonnet 5 | 53.4 | 79 | $6.00 |
| 6 | GPT-5.4 | 51.4 | 157 | $5.63 |
| 7 | GLM-5.2 | 51.1 | 160 | $2.15 |
| 8 | Gemini 3.5 Flash | 50.2 | 210 | $3.38 |
| 9 | Claude Sonnet 4.6 | 47.2 | 63 | $6.00 |
| 10 | Gemini 3.1 Pro Preview | 46.5 | 128 | $4.50 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3.5 Flash | 210 |
| 2 | Qwen3.7 Max | 195 |
| 3 | GLM-5.2 | 160 |
| 4 | GPT-5.4 | 157 |
| 5 | GPT-5.4 mini | 154 |
| 6 | Gemini 3.1 Pro Preview | 128 |
| 7 | GPT-5.2 Codex | 118 |
| 8 | DeepSeek V4 Flash | 90 |
| 9 | MiMo-V2.5 | 86 |
| 10 | MiniMax-M3 | 84 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | DeepSeek V4 Flash | $0.175 |
| 2 | MiMo-V2.5 | $0.175 |
| 3 | MiniMax-M3 | $0.525 |
| 4 | DeepSeek V4 Pro | $0.544 |
| 5 | MiMo-V2.5-Pro | $0.544 |
| 6 | Nex-N2-Pro | $1.00 |
| 7 | MiMo-V2-Pro | $1.50 |
| 8 | GPT-5.4 mini | $1.69 |
| 9 | Kimi K2.6 | $1.71 |
| 10 | Kimi K2.7 Code | $1.71 |