The SWE-rebench leaderboard shows no movement from the previous snapshot: OpenAI's gpt-5.5-2026-04-23-xhigh model holds first at 62.7%, followed by JunieAgent at 61.6% and OpenAI CodexAgent at 60.4%, with confidence intervals tight enough to distinguish genuine separation between top performers. The Artificial Analysis benchmark presents a different picture, one of modest shuffling rather than substantive reordering. Claude Fable 5 leads at 59.9, Claude Opus 4.8 sits at 55.7, and GPT-5.5 ranks third at 54.8, but the list contains no new entries and the scoring appears identical to prior rankings. Two minor position swaps occur in the mid-range: Qwen3 32B moves from rank 217 to 209 with an improvement from 10.5 to 11.5, and Sarvam 105B and Magistral Small 1.2 exchange positions around rank 204-205 without score changes, suggesting database reorganization rather than actual performance shifts. The methodology underlying both benchmarks remains opaque. SWE-rebench reports confidence intervals, which implies repeated trials or cross-validation, yet no detail appears on the evaluation protocol, task distribution, or whether results are deterministic across runs. Artificial Analysis provides no uncertainty quantification whatsoever, making it impossible to assess whether fractional score differences reflect genuine capability gaps or measurement noise. The two benchmarks diverge substantially at the top (gpt-5.5-xhigh leads SWE-rebench but ranks third on Artificial Analysis), raising questions about whether they measure the same construct or whether one dataset better captures real-world code repair needs. Without clarification of what each benchmark tests, how tasks are sampled, and whether scoring is reproducible, the rankings function as indices rather than measures of engineering competence.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | OpenAIgpt-5.5-2026-04-23-xhighModel | 62.7%± 0.91% |
| 2 | JunieJunieAgent | 61.6%± 0.64% |
| 3 | OpenAICodexAgent | 60.4%± 1.37% |
| 4 | AnthropicClaude CodeAgent | 59.6%± 1.98% |
| 5 | OpenAIgpt-5.5-2026-04-23-mediumModel | 58.9%± 0.78% |
| 6 | AnthropicClaude Opus 4.8-xhighModel | 56.5%± 1.20% |
| 7 | OpenAIgpt-5.4-2026-03-05-mediumModel | 54.9%± 1.02% |
| 8 | AnthropicClaude Opus 4.7-highModel | 53.1%± 1.45% |
| 9 | CursorCursorAgent | 53.0%± 0.53% |
| 10 | AnthropicClaude Sonnet 4.6Model | 51.3%± 0.55% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Fable 5 | 59.9 | 0 | $20.00 |
| 2 | Claude Opus 4.8 | 55.7 | 58 | $10.00 |
| 3 | GPT-5.5 | 54.8 | 83 | $11.25 |
| 4 | Claude Opus 4.7 | 53.5 | 55 | $10.00 |
| 5 | GPT-5.4 | 51.4 | 174 | $5.63 |
| 6 | GLM-5.2 | 51.1 | 139 | $2.15 |
| 7 | Gemini 3.5 Flash | 50.2 | 214 | $3.38 |
| 8 | Claude Sonnet 4.6 | 47.2 | 57 | $6.00 |
| 9 | Gemini 3.1 Pro Preview | 46.5 | 137 | $4.50 |
| 10 | Qwen3.7 Max | 46 | 198 | $3.75 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3.5 Flash | 214 |
| 2 | Qwen3.7 Max | 198 |
| 3 | GPT-5.4 mini | 178 |
| 4 | GPT-5.4 | 174 |
| 5 | GLM-5.2 | 139 |
| 6 | Gemini 3.1 Pro Preview | 137 |
| 7 | GPT-5.2 Codex | 135 |
| 8 | DeepSeek V4 Flash | 109 |
| 9 | GPT-5.3 Codex | 94 |
| 10 | MiMo-V2.5 | 90 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | DeepSeek V4 Flash | $0.175 |
| 2 | MiMo-V2.5 | $0.175 |
| 3 | MiniMax-M3 | $0.525 |
| 4 | DeepSeek V4 Pro | $0.544 |
| 5 | MiMo-V2.5-Pro | $0.544 |
| 6 | Nex-N2-Pro | $1.00 |
| 7 | MiMo-V2-Pro | $1.50 |
| 8 | GPT-5.4 mini | $1.69 |
| 9 | Kimi K2.6 | $1.71 |
| 10 | Kimi K2.7 Code | $1.71 |