On the SWE-rebench, the top tier remains static: OpenAI's gpt-5.5-2026-04-23-xhighModel holds position one at 62.7 percent, followed by JunieAgent at 61.6 percent and OpenAI's CodexAgent at 60.4 percent. The confidence intervals are tight enough to distinguish these leaders, with standard errors ranging from 0.53 to 1.98 percentage points across the ranked set, suggesting the evaluation captures consistent performance differences. Across the Artificial Analysis benchmark, the landscape shifts more dramatically: Claude Fable 5 now leads at 59.9, displacing GPT-5.5 from the top spot, while Claude Opus 4.8 sits second at 55.7 and GPT-5.5 drops to third at 54.8. The gap between the SWE-rebench's top performer and Artificial Analysis's top performer is 2.8 points, a meaningful divergence that hints at different problem structures. Within Artificial Analysis, the middle ranks show considerable churn: Llama 3.3 Instruct 70B climbed from position 258 to 242, a 16-rank jump, while several models in the 240 to 260 range shuffled positions, suggesting modest score movements in a crowded band where many models cluster between 8 and 10 points. The two benchmarks do not track perfectly: models strong on SWE-rebench (like JunieAgent and OpenAI's variants) do not appear on the Artificial Analysis list, and vice versa, indicating the tests measure distinct capabilities rather than a single underlying skill. The SWE-rebench concentrates on code-generation agents in controlled conditions, while Artificial Analysis appears broader and less transparent in methodology, making direct comparison hazardous. Without historical Artificial Analysis data from prior runs, the significance of Claude Fable 5's ascent to first cannot be evaluated; the SWE-rebench's stability suggests real differences in agent capability, but the Artificial Analysis movements may reflect noise or evaluation drift rather than genuine progress.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | OpenAIgpt-5.5-2026-04-23-xhighModel | 62.7%± 0.91% |
| 2 | JunieJunieAgent | 61.6%± 0.64% |
| 3 | OpenAICodexAgent | 60.4%± 1.37% |
| 4 | AnthropicClaude CodeAgent | 59.6%± 1.98% |
| 5 | OpenAIgpt-5.5-2026-04-23-mediumModel | 58.9%± 0.78% |
| 6 | AnthropicClaude Opus 4.8-xhighModel | 56.5%± 1.20% |
| 7 | OpenAIgpt-5.4-2026-03-05-mediumModel | 54.9%± 1.02% |
| 8 | AnthropicClaude Opus 4.7-highModel | 53.1%± 1.45% |
| 9 | CursorCursorAgent | 53.0%± 0.53% |
| 10 | AnthropicClaude Sonnet 4.6Model | 51.3%± 0.55% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Fable 5 | 59.9 | 62 | $20.00 |
| 2 | Claude Opus 4.8 | 55.7 | 61 | $10.00 |
| 3 | GPT-5.5 | 54.8 | 88 | $11.25 |
| 4 | Claude Opus 4.7 | 53.5 | 49 | $10.00 |
| 5 | Claude Sonnet 5 | 53.4 | 79 | $6.00 |
| 6 | GPT-5.4 | 51.4 | 167 | $5.63 |
| 7 | GLM-5.2 | 51.1 | 176 | $2.15 |
| 8 | Gemini 3.5 Flash | 50.2 | 209 | $3.38 |
| 9 | Claude Sonnet 4.6 | 47.2 | 67 | $6.00 |
| 10 | Gemini 3.1 Pro Preview | 46.5 | 140 | $4.50 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3.5 Flash | 209 |
| 2 | Qwen3.7 Max | 199 |
| 3 | GLM-5.2 | 176 |
| 4 | GPT-5.4 | 167 |
| 5 | GPT-5.4 mini | 164 |
| 6 | Gemini 3.1 Pro Preview | 140 |
| 7 | Nex-N2-Pro | 126 |
| 8 | GPT-5.2 Codex | 123 |
| 9 | MiniMax-M3 | 98 |
| 10 | DeepSeek V4 Flash | 95 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | DeepSeek V4 Flash | $0.175 |
| 2 | MiMo-V2.5 | $0.175 |
| 3 | MiniMax-M3 | $0.525 |
| 4 | DeepSeek V4 Pro | $0.544 |
| 5 | MiMo-V2.5-Pro | $0.544 |
| 6 | Nex-N2-Pro | $1.00 |
| 7 | MiMo-V2-Pro | $1.50 |
| 8 | GPT-5.4 mini | $1.69 |
| 9 | Kimi K2.6 | $1.71 |
| 10 | Kimi K2.7 Code | $1.71 |