The SWE-rebench rankings remain static at the top, with OpenAI's gpt-5.5-2026-04-23-xhigh maintaining 62.7% and Junie's agent holding 61.6%, but the Artificial Analysis benchmark shows substantial movement in the mid-tier and below. DeepSeek R1 jumped from position 190 at 12.6% to position 144 at 18.5%, a gain of 5.9 percentage points that reflects either model improvement or evaluation methodology changes; Mistral Small 3.1 climbed from 255 at 8.6% to 172 at 14.7%, gaining 6.1 points across a 83-position swing. Claude 4 Sonnet dropped from 72 at 30.7% to 83 at 28.9%, losing 1.8 points despite holding its rank position number. Devstral Small 2 moved from 186 at 13.1% to 153 at 17.4%, and Llama 3.1 Instruct 8B climbed from 298 at 6.1% to 274 at 7.6%, both showing gains that suggest either these models were re-evaluated with different configurations or the benchmark itself shifted its evaluation criteria. The Artificial Analysis data spans 397 entries compared to 16 on SWE-rebench, creating an asymmetry in what constitutes meaningful movement: a 1-point swing at the top 10 of Artificial Analysis represents roughly 2 percent of the leader's score, while the same absolute change at position 350 is nearly a 20 percent relative improvement. New entry DiffusionGemma 26B A4B at position 185 with 13.5% provides no prior reference, making it impossible to assess whether this is a newly evaluated model or a previously omitted one. Without access to methodology details for either benchmark, the interpretation of these shifts remains constrained to surface observation: SWE-rebench appears stable and possibly closed to new entries, while Artificial Analysis exhibits churn consistent with either rolling re-evaluation or score recalibration across the full roster.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | OpenAIgpt-5.5-2026-04-23-xhighModel | 62.7%± 0.91% |
| 2 | JunieJunieAgent | 61.6%± 0.64% |
| 3 | OpenAICodexAgent | 60.4%± 1.37% |
| 4 | AnthropicClaude CodeAgent | 59.6%± 1.98% |
| 5 | OpenAIgpt-5.5-2026-04-23-mediumModel | 58.9%± 0.78% |
| 6 | AnthropicClaude Opus 4.8-xhighModel | 56.5%± 1.20% |
| 7 | OpenAIgpt-5.4-2026-03-05-mediumModel | 54.9%± 1.02% |
| 8 | AnthropicClaude Opus 4.7-highModel | 53.1%± 1.45% |
| 9 | CursorCursorAgent | 53.0%± 0.53% |
| 10 | AnthropicClaude Sonnet 4.6Model | 51.3%± 0.55% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Fable 5 | 59.9 | 0 | $20.00 |
| 2 | Claude Opus 4.8 | 55.7 | 59 | $10.00 |
| 3 | GPT-5.5 | 54.8 | 79 | $11.25 |
| 4 | Claude Opus 4.7 | 53.5 | 50 | $10.00 |
| 5 | GPT-5.4 | 51.4 | 174 | $5.63 |
| 6 | GLM-5.2 | 51.1 | 151 | $2.15 |
| 7 | Gemini 3.5 Flash | 50.2 | 210 | $3.38 |
| 8 | Claude Sonnet 4.6 | 47.2 | 51 | $6.00 |
| 9 | Gemini 3.1 Pro Preview | 46.5 | 131 | $4.50 |
| 10 | Qwen3.7 Max | 46 | 196 | $3.75 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3.5 Flash | 210 |
| 2 | Qwen3.7 Max | 196 |
| 3 | GPT-5.4 | 174 |
| 4 | GPT-5.4 mini | 164 |
| 5 | GLM-5.2 | 151 |
| 6 | Gemini 3.1 Pro Preview | 131 |
| 7 | GPT-5.2 Codex | 127 |
| 8 | DeepSeek V4 Flash | 104 |
| 9 | MiMo-V2.5 | 91 |
| 10 | GPT-5.3 Codex | 88 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | DeepSeek V4 Flash | $0.175 |
| 2 | MiMo-V2.5 | $0.175 |
| 3 | MiniMax-M3 | $0.525 |
| 4 | DeepSeek V4 Pro | $0.544 |
| 5 | MiMo-V2.5-Pro | $0.544 |
| 6 | Nex-N2-Pro | $1.00 |
| 7 | MiMo-V2-Pro | $1.50 |
| 8 | GPT-5.4 mini | $1.69 |
| 9 | Kimi K2.6 | $1.71 |
| 10 | Kimi K2.7 Code | $1.71 |