The SWE-rebench leaderboard shows no movement from the previous snapshot, with OpenAI's gpt-5.5-2026-04-23-xhighModel holding position one at 62.7% plus or minus 0.91%, followed by JunieJunieAgent at 61.6% plus or minus 0.64% and OpenAICodexAgent at 60.4% plus or minus 1.37%. The top ten entries remain unchanged in both rank and score, suggesting either a reporting cycle without new evaluations or that uncertainty bounds are absorbing natural variation. The Artificial Analysis benchmark, by contrast, shows modest reordering in the middle ranks, particularly around positions 172 and 173 where Mistral Small 3.1 and Mistral Medium 3.1 swapped places, and near 256 and 257 where two Nemotron variants shifted, though these appear to reflect tie-breaking rather than meaningful performance divergence. At the lower end of the Artificial Analysis list, several models remain clustered at single-digit scores, indicating a floor effect where the benchmark may lack discriminative power. Neither benchmark dataset exhibits the kind of systematic improvement across the field that would signal a methodological shift or a cohort of new models with substantially different capabilities, making it difficult to characterize today's results as reflecting genuine progress rather than reporting stability.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | OpenAIgpt-5.5-2026-04-23-xhighModel | 62.7%± 0.91% |
| 2 | JunieJunieAgent | 61.6%± 0.64% |
| 3 | OpenAICodexAgent | 60.4%± 1.37% |
| 4 | AnthropicClaude CodeAgent | 59.6%± 1.98% |
| 5 | OpenAIgpt-5.5-2026-04-23-mediumModel | 58.9%± 0.78% |
| 6 | AnthropicClaude Opus 4.8-xhighModel | 56.5%± 1.20% |
| 7 | OpenAIgpt-5.4-2026-03-05-mediumModel | 54.9%± 1.02% |
| 8 | AnthropicClaude Opus 4.7-highModel | 53.1%± 1.45% |
| 9 | CursorCursorAgent | 53.0%± 0.53% |
| 10 | AnthropicClaude Sonnet 4.6Model | 51.3%± 0.55% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Fable 5 | 59.9 | 62 | $20.00 |
| 2 | Claude Opus 4.8 | 55.7 | 57 | $10.00 |
| 3 | GPT-5.5 | 54.8 | 92 | $11.25 |
| 4 | Claude Opus 4.7 | 53.5 | 47 | $10.00 |
| 5 | Claude Sonnet 5 | 53.4 | 79 | $6.00 |
| 6 | GPT-5.4 | 51.4 | 155 | $5.63 |
| 7 | GLM-5.2 | 51.1 | 175 | $2.15 |
| 8 | Gemini 3.5 Flash | 50.2 | 210 | $3.38 |
| 9 | Claude Sonnet 4.6 | 47.2 | 60 | $6.00 |
| 10 | Gemini 3.1 Pro Preview | 46.5 | 138 | $4.50 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3.5 Flash | 210 |
| 2 | Qwen3.7 Max | 200 |
| 3 | GLM-5.2 | 175 |
| 4 | GPT-5.4 mini | 165 |
| 5 | GPT-5.4 | 155 |
| 6 | Gemini 3.1 Pro Preview | 138 |
| 7 | Nex-N2-Pro | 124 |
| 8 | GPT-5.2 Codex | 122 |
| 9 | MiniMax-M3 | 105 |
| 10 | GPT-5.3 Codex | 95 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | DeepSeek V4 Flash | $0.175 |
| 2 | MiMo-V2.5 | $0.175 |
| 3 | MiniMax-M3 | $0.525 |
| 4 | DeepSeek V4 Pro | $0.544 |
| 5 | MiMo-V2.5-Pro | $0.544 |
| 6 | Nex-N2-Pro | $1.00 |
| 7 | MiMo-V2-Pro | $1.50 |
| 8 | GPT-5.4 mini | $1.69 |
| 9 | Kimi K2.6 | $1.71 |
| 10 | Kimi K2.7 Code | $1.71 |