The Inference Report

February 26, 2026

Live rankings from SWE-Rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1Claude Code52.9%
2Claude Opus 4.651.7%
3gpt-5.2-2025-12-11-xhigh51.7%
4gpt-5.2-2025-12-11-medium51.0%
5gpt-5.1-codex-max48.5%
6Claude Sonnet 4.547.1%
7Gemini 3 Pro Preview46.7%
8Gemini 3 Flash Preview46.7%
9gpt-5.2-codex45.0%
10Codex44.0%
11Claude Opus 4.543.8%
12Kimi K2 Thinking43.8%
13gpt-5.1-codex42.9%
14GLM-542.1%
15GLM-4.741.3%
16Qwen3-Coder-Next40.0%
17MiniMax M2.539.6%
18Kimi K2.537.9%
19Devstral-2-123B-Instruct-251237.5%
20DeepSeek-V3.237.5%
21GLM-4.637.1%
22gpt-5-mini-2025-08-07-high35.0%
23Kimi K2 Instruct 090534.3%
24Devstral-Small-2-24B-Instruct-251232.1%
25GLM-4.5 Air31.8%
26MiniMax M2.131.7%
27Qwen3-Coder-480B-A35B-Instruct31.7%
28gpt-5-mini-2025-08-07-medium30.8%
29GLM-4.7 Flash25.4%
30gpt-oss-120b24.6%
31Qwen3-235B-A22B-Instruct-250723.8%
32DeepSeek-R1-052821.7%
33Qwen3-Coder-30B-A3B-Instruct18.0%
34Qwen3-Next-80B-A3B-Instruct15.4%
35Qwen3-30B-A3B-Instruct-25077.1%