On the SWE-rebench, the top tier remains locked in place: OpenAI's gpt-5.5-xhigh holds 62.7% plus or minus 0.91%, followed by JunieAgent at 61.6% and OpenAI CodexAgent at 60.4%, with no movement among the leading six entries. Below that tier, Z.ai's GLM-5.2 enters at position 12 with 51.1% plus or minus 1.13%, displacing its predecessor GLM-5.1 to 13th place, while DeepSeek-V4 Pro and MiMo-V2.5-Pro appear as new entries at 18 and 19 respectively, and Qwen models now occupy positions 22 and 23 in their first SWE-rebench appearances. The Artificial Analysis benchmark shows broader volatility: Claude Fable 5 leads at 59.9, a model not previously ranked in the earlier snapshot, while GPT-5.1 dropped from 38.9 to 36.9 (position 44), and Command A+ fell from 29.3 to 22.5 (position 111), the largest documented decline. gpt-oss-20b, which had held position 171 at 14.9, has been removed from the rankings entirely. The SWE-rebench data carries tighter confidence intervals than Artificial Analysis, suggesting more controlled evaluation conditions, though both benchmarks show the frontier remains dominated by OpenAI and Anthropic systems when measured on code completion tasks, with newer Chinese models (Qwen, GLM variants) gaining ground in the mid-tier rather than displacing leaders.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | OpenAIgpt-5.5-2026-04-23-xhighModel | 62.7%± 0.91% |
| 2 | JunieJunieAgent | 61.6%± 0.64% |
| 3 | OpenAICodexAgent | 60.4%± 1.37% |
| 4 | AnthropicClaude CodeAgent | 59.6%± 1.98% |
| 5 | OpenAIgpt-5.5-2026-04-23-mediumModel | 58.9%± 0.78% |
| 6 | AnthropicClaude Opus 4.8-xhighModel | 56.5%± 1.20% |
| 7 | OpenAIgpt-5.4-2026-03-05-mediumModel | 54.9%± 1.02% |
| 8 | AnthropicClaude Opus 4.7-highModel | 53.1%± 1.45% |
| 9 | CursorCursorAgent | 53.0%± 0.53% |
| 10 | AnthropicClaude Sonnet 4.6Model | 51.3%± 0.55% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Fable 5 | 59.9 | 69 | $20.00 |
| 2 | Claude Opus 4.8 | 55.7 | 66 | $10.00 |
| 3 | GPT-5.5 | 54.8 | 82 | $11.25 |
| 4 | Claude Opus 4.7 | 53.5 | 51 | $10.00 |
| 5 | Claude Sonnet 5 | 53.4 | 89 | $6.00 |
| 6 | GPT-5.4 | 51.4 | 165 | $5.63 |
| 7 | GLM-5.2 | 51.1 | 184 | $2.15 |
| 8 | Gemini 3.5 Flash | 50.2 | 214 | $3.38 |
| 9 | Claude Sonnet 4.6 | 47.2 | 69 | $6.00 |
| 10 | Gemini 3.1 Pro Preview | 46.5 | 138 | $4.50 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3.5 Flash | 214 |
| 2 | Qwen3.7 Max | 197 |
| 3 | GLM-5.2 | 184 |
| 4 | GPT-5.4 mini | 175 |
| 5 | GPT-5.4 | 165 |
| 6 | Gemini 3.1 Pro Preview | 138 |
| 7 | GPT-5.2 Codex | 125 |
| 8 | DeepSeek V4 Flash | 91 |
| 9 | Claude Sonnet 5 | 89 |
| 10 | Nex-N2-Pro | 87 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | DeepSeek V4 Flash | $0.175 |
| 2 | MiMo-V2.5 | $0.175 |
| 3 | MiniMax-M3 | $0.525 |
| 4 | DeepSeek V4 Pro | $0.544 |
| 5 | MiMo-V2.5-Pro | $0.544 |
| 6 | Nex-N2-Pro | $1.00 |
| 7 | MiMo-V2-Pro | $1.50 |
| 8 | GPT-5.4 mini | $1.69 |
| 9 | Kimi K2.6 | $1.71 |
| 10 | Kimi K2.7 Code | $1.71 |