Claude Opus 4.6 moved from fourth to first on SWE-rebench with a 65.3% score, a 12.3-point gain over its previous 53%, while Gemini 3.1 Pro Preview dropped from the top ranking on Artificial Analysis to sixth on SWE-rebench despite maintaining 62.3%, suggesting the two benchmarks now diverge meaningfully in what they reward. The SWE-rebench leaderboard shows tighter clustering at the top, with positions two through five spanning only 1.6 percentage points, indicating reduced separation between leading models on coding tasks. Chinese models made notable gains on SWE-rebench: GLM-5 climbed from tenth to third (49.8 to 62.8%), Kimi K2.5 rose from twentieth to sixteenth (46.8 to 58.5%), and GLM-4.7 advanced from thirty-fourth to fourteenth (42.1 to 58.7%), while on Artificial Analysis the top tier remains dominated by Anthropic and OpenAI variants, with Claude Opus 4.7 newly entering at first place and Gemini 3.1 Pro Preview sliding to second. The Artificial Analysis benchmark shows minimal absolute movement across most positions, with entries reordering but scores remaining largely stable, whereas SWE-rebench displays larger score inflation across the board, raising questions about whether the benchmarks are measuring consistent capabilities or whether SWE-rebench's evaluation methodology has shifted. JT-MINI dropped entirely from Artificial Analysis rankings after placing at 109 with 25.4 points, but no corresponding SWE-rebench removal is documented, leaving unclear whether this reflects model discontinuation or benchmark revision. The divergence between these two evaluation frameworks is now pronounced enough to warrant scrutiny of their test construction: if both measure code generation ability, the gap between Gemini's ranking (first on Artificial Analysis, sixth on SWE-rebench) and Claude Opus 4.6's trajectory (fourth to first on SWE-rebench, but only fifth on Artificial Analysis) suggests they are sampling different problem distributions or applying different evaluation criteria rather than simply ranking the same capability differently.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | GLM-5.1 | 62.7% |
| 6 | Gemini 3.1 Pro Preview | 62.3% |
| 7 | DeepSeek-V3.2 | 60.9% |
| 8 | Claude Sonnet 4.6 | 60.7% |
| 9 | Claude Sonnet 4.5 | 60.0% |
| 10 | Qwen3.5-397B-A17B | 59.9% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Opus 4.7 | 57.3 | 58 | $10.00 |
| 2 | Gemini 3.1 Pro Preview | 57.2 | 126 | $4.50 |
| 3 | GPT-5.4 | 56.8 | 82 | $5.63 |
| 4 | GPT-5.3 Codex | 53.6 | 81 | $4.81 |
| 5 | Claude Opus 4.6 | 53 | 54 | $10.00 |
| 6 | Muse Spark | 52.1 | 0 | $0.00 |
| 7 | Claude Sonnet 4.6 | 51.7 | 60 | $6.00 |
| 8 | GLM-5.1 | 51.4 | 47 | $2.15 |
| 9 | GPT-5.2 | 51.3 | 74 | $4.81 |
| 10 | Qwen3.6 Plus | 50 | 53 | $1.13 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Qwen3.6 35B A3B | 238 |
| 2 | GPT-5.1 Codex | 205 |
| 3 | GPT-5 Codex | 199 |
| 4 | Grok 4.20 0309 | 194 |
| 5 | Gemini 3 Flash Preview | 191 |
| 6 | Grok 4.20 0309 v2 | 180 |
| 7 | GPT-5.4 mini | 172 |
| 8 | GPT-5.4 nano | 155 |
| 9 | Gemini 3 Pro Preview | 133 |
| 10 | Qwen3.5 122B A10B | 130 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V3.2 | $0.315 |
| 3 | GPT-5.4 nano | $0.463 |
| 4 | MiniMax-M2.7 | $0.525 |
| 5 | KAT Coder Pro V2 | $0.525 |
| 6 | MiniMax-M2.5 | $0.525 |
| 7 | GPT-5 mini | $0.688 |
| 8 | Qwen3.5 27B | $0.825 |
| 9 | Qwen3.6 35B A3B | $0.844 |
| 10 | GLM-4.7 | $1.00 |