Claude Opus 4.6 has consolidated the top position on SWE-rebench with 65.3%, climbing from #11 at 52.9% on Artificial Analysis, a gain of 12.4 percentage points that reflects either substantial model improvements or meaningful differences in how the two benchmarks evaluate code-solving capability. The broader SWE-rebench leaderboard shows clustering at the top: gpt-5.2-2025-12-11-medium, GLM-5, Junie, and gpt-5.4-2026-03-05-medium all sit within 1.6 points of each other between 62.8% and 64.4%, suggesting convergence among frontier models on this task. Notable climbers include GLM-5 (from #19 to #3, a 13-point jump), Kimi K2.5 (from #31 to #16, up 11.7 points), and Kimi K2 Thinking (from #56 to #21, up 16.5 points), indicating that Chinese-developed models have made tangible progress on repository-level code tasks. Gemini 3.1 Pro Preview declined from #3 to #7 on SWE-rebench while maintaining #3 on Artificial Analysis at 57.2, illustrating that benchmark choice materially affects perceived ranking. Claude Sonnet 4.6 moved from #14 to #9 on Artificial Analysis (51.7 to 60.7 on SWE-rebench), suggesting the models tested are stronger at the specific problem distributions in SWE-rebench than on Artificial Analysis's evaluation. The divergence between the two benchmarks raises a methodological question: SWE-rebench appears to emphasize end-to-end repository modification and integration, while Artificial Analysis may weight reasoning and breadth differently. Without access to the evaluation protocols themselves, the magnitude of these shifts makes it difficult to assess whether one benchmark has higher discriminative validity for production code work.
Cole Brennan
Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | Junie | 62.8% |
| 5 | gpt-5.4-2026-03-05-medium | 62.8% |
| 6 | GLM-5.1 | 62.7% |
| 7 | Gemini 3.1 Pro Preview | 62.3% |
| 8 | DeepSeek-V3.2 | 60.9% |
| 9 | Claude Sonnet 4.6 | 60.7% |
| 10 | Claude Sonnet 4.5 | 60.0% |
Artificial Analysis composite index across coding, math, and reasoning benchmarks.
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.5 | 60.2 | 72 | $11.25 |
| 2 | Claude Opus 4.7 | 57.3 | 54 | $10.94 |
| 3 | Gemini 3.1 Pro Preview | 57.2 | 130 | $4.50 |
| 4 | GPT-5.4 | 56.8 | 90 | $5.63 |
| 5 | Qwen3.7 Max | 56.6 | 206 | $3.75 |
| 6 | Gemini 3.5 Flash | 55.3 | 233 | $3.38 |
| 7 | Kimi K2.6 | 53.9 | 32 | $1.71 |
| 8 | MiMo-V2.5-Pro | 53.8 | 51 | $1.35 |
| 9 | GPT-5.3 Codex | 53.6 | 82 | $4.81 |
| 10 | Grok 4.3 | 53.2 | 196 | $1.56 |
Output tokens per second — higher is faster. Minimum intelligence score of 40.
| # | Model | tok/s |
|---|---|---|
| 1 | Gemini 3.5 Flash | 233 |
| 2 | Qwen3.7 Max | 206 |
| 3 | Gemini 3 Flash Preview | 204 |
| 4 | GPT-5 Codex | 202 |
| 5 | GPT-5.1 Codex | 201 |
| 6 | Grok 4.3 | 196 |
| 7 | Grok 4.20 0309 v2 | 188 |
| 8 | Grok 4.20 0309 | 185 |
| 9 | Qwen3.6 35B A3B | 170 |
| 10 | GPT-5.4 mini | 165 |
Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.
| # | Model | $/1M |
|---|---|---|
| 1 | MiMo-V2-Flash | $0.15 |
| 2 | DeepSeek V4 Flash | $0.175 |
| 3 | Hy3-preview | $0.20 |
| 4 | DeepSeek V3.2 | $0.337 |
| 5 | MiMo-V2.5 | $0.408 |
| 6 | GPT-5.4 nano | $0.463 |
| 7 | MiniMax-M2.7 | $0.525 |
| 8 | KAT Coder Pro V2 | $0.525 |
| 9 | MiniMax-M2.5 | $0.525 |
| 10 | DeepSeek V4 Pro | $0.544 |