Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, unchanged from the previous measurement, while the Artificial Analysis benchmark shows no structural movement in the top tier either, with GPT-5.5 remaining at 60.2 and Claude Opus 4.6 fixed at 53. The most notable shifts occur in the middle tiers of the Artificial Analysis leaderboard, where DeepSeek V4 Pro enters at rank 12 with 51.5, a new entry that suggests incremental capability expansion in the reasoning-focused model space, though the score itself represents performance consistent with existing peers rather than a departure. On SWE-rebench, models like GLM-5 (62.8%), gpt-5.4-2026-03-05-medium (62.8%), and GLM-5.1 (62.7%) cluster tightly in the 62 to 63 percent range, indicating a plateau in absolute gains at the frontier, where further differentiation requires sub-point precision. Gemini 3.1 Pro Preview dropped from rank 3 to rank 6 on SWE-rebench (57.2 to 62.3%), a 4.9-point decline that warrants scrutiny into whether the evaluation conditions or test set composition shifted, as such movements in established models typically signal methodology changes rather than genuine capability loss. The Artificial Analysis benchmark, which tracks a broader set of models, shows no entries with scores above 60.2, creating a visible gap between the two benchmarks' top performers that suggests they may be measuring different problem distributions or difficulty profiles; SWE-rebench appears to emphasize code generation under constraints that newer models handle more effectively, while Artificial Analysis may weight reasoning or multi-step tasks more heavily. Neither benchmark exhibits the velocity that would indicate a meaningful breakthrough, and the lack of new entrants at the very top suggests the field is consolidating rather than expanding capability frontiers.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | GLM-5.1 | 62.7% |
| 6 | Gemini 3.1 Pro Preview | 62.3% |
| 7 | DeepSeek-V3.2 | 60.9% |
| 8 | Claude Sonnet 4.6 | 60.7% |
| 9 | Claude Sonnet 4.5 | 60.0% |
| 10 | Qwen3.5-397B-A17B | 59.9% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.5 | 60.2 | 113 | $11.25 |
| 2 | Claude Opus 4.7 | 57.3 | 66 | $10.00 |
| 3 | Gemini 3.1 Pro Preview | 57.2 | 136 | $4.50 |
| 4 | GPT-5.4 | 56.8 | 83 | $5.63 |
| 5 | Kimi K2.6 | 53.9 | 126 | $1.71 |
| 6 | MiMo-V2.5-Pro | 53.8 | 65 | $1.50 |
| 7 | GPT-5.3 Codex | 53.6 | 83 | $4.81 |
| 8 | Claude Opus 4.6 | 53 | 60 | $10.00 |
| 9 | Muse Spark | 52.1 | 0 | $0.00 |
| 10 | Qwen3.6 Max Preview | 51.8 | 34 | $2.92 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3 Flash Preview | 203 |
| 2 | Qwen3.6 35B A3B | 200 |
| 3 | GPT-5 Codex | 198 |
| 4 | GPT-5.4 mini | 192 |
| 5 | GPT-5.1 Codex | 187 |
| 6 | GPT-5.4 nano | 149 |
| 7 | Qwen3.5 122B A10B | 146 |
| 8 | Gemini 3.1 Pro Preview | 136 |
| 9 | Gemini 3 Pro Preview | 131 |
| 10 | GPT-5.1 | 131 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V4 Flash | $0.175 |
| 3 | DeepSeek V3.2 | $0.315 |
| 4 | GPT-5.4 nano | $0.463 |
| 5 | MiniMax-M2.7 | $0.525 |
| 6 | KAT Coder Pro V2 | $0.525 |
| 7 | MiniMax-M2.5 | $0.525 |
| 8 | GPT-5 mini | $0.688 |
| 9 | Qwen3.5 27B | $0.825 |
| 10 | Qwen3.6 35B A3B | $0.844 |