Live rankings from SWE-Rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.
| # | Model | Score |
|---|---|---|
| 1 | Claude Code | 52.9% |
| 2 | Claude Opus 4.6 | 51.7% |
| 3 | gpt-5.2-2025-12-11-xhigh | 51.7% |
| 4 | gpt-5.2-2025-12-11-medium | 51.0% |
| 5 | gpt-5.1-codex-max | 48.5% |
| 6 | Claude Sonnet 4.5 | 47.1% |
| 7 | Gemini 3 Pro Preview | 46.7% |
| 8 | Gemini 3 Flash Preview | 46.7% |
| 9 | gpt-5.2-codex | 45.0% |
| 10 | Codex | 44.0% |
| 11 | Claude Opus 4.5 | 43.8% |
| 12 | Kimi K2 Thinking | 43.8% |
| 13 | gpt-5.1-codex | 42.9% |
| 14 | GLM-5 | 42.1% |
| 15 | GLM-4.7 | 41.3% |
| 16 | Qwen3-Coder-Next | 40.0% |
| 17 | MiniMax M2.5 | 39.6% |
| 18 | Kimi K2.5 | 37.9% |
| 19 | Devstral-2-123B-Instruct-2512 | 37.5% |
| 20 | DeepSeek-V3.2 | 37.5% |
| 21 | GLM-4.6 | 37.1% |
| 22 | gpt-5-mini-2025-08-07-high | 35.0% |
| 23 | Kimi K2 Instruct 0905 | 34.3% |
| 24 | Devstral-Small-2-24B-Instruct-2512 | 32.1% |
| 25 | GLM-4.5 Air | 31.8% |
| 26 | MiniMax M2.1 | 31.7% |
| 27 | Qwen3-Coder-480B-A35B-Instruct | 31.7% |
| 28 | gpt-5-mini-2025-08-07-medium | 30.8% |
| 29 | GLM-4.7 Flash | 25.4% |
| 30 | gpt-oss-120b | 24.6% |
| 31 | Qwen3-235B-A22B-Instruct-2507 | 23.8% |
| 32 | DeepSeek-R1-0528 | 21.7% |
| 33 | Qwen3-Coder-30B-A3B-Instruct | 18.0% |
| 34 | Qwen3-Next-80B-A3B-Instruct | 15.4% |
| 35 | Qwen3-30B-A3B-Instruct-2507 | 7.1% |