The Inference Report

February 26, 2026

Live rankings from SWE-Rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#	Model	Score
1	Claude Code	52.9%
2	Claude Opus 4.6	51.7%
3	gpt-5.2-2025-12-11-xhigh	51.7%
4	gpt-5.2-2025-12-11-medium	51.0%
5	gpt-5.1-codex-max	48.5%
6	Claude Sonnet 4.5	47.1%
7	Gemini 3 Pro Preview	46.7%
8	Gemini 3 Flash Preview	46.7%
9	gpt-5.2-codex	45.0%
10	Codex	44.0%
11	Claude Opus 4.5	43.8%
12	Kimi K2 Thinking	43.8%
13	gpt-5.1-codex	42.9%
14	GLM-5	42.1%
15	GLM-4.7	41.3%
16	Qwen3-Coder-Next	40.0%
17	MiniMax M2.5	39.6%
18	Kimi K2.5	37.9%
19	Devstral-2-123B-Instruct-2512	37.5%
20	DeepSeek-V3.2	37.5%
21	GLM-4.6	37.1%
22	gpt-5-mini-2025-08-07-high	35.0%
23	Kimi K2 Instruct 0905	34.3%
24	Devstral-Small-2-24B-Instruct-2512	32.1%
25	GLM-4.5 Air	31.8%
26	MiniMax M2.1	31.7%
27	Qwen3-Coder-480B-A35B-Instruct	31.7%
28	gpt-5-mini-2025-08-07-medium	30.8%
29	GLM-4.7 Flash	25.4%
30	gpt-oss-120b	24.6%
31	Qwen3-235B-A22B-Instruct-2507	23.8%
32	DeepSeek-R1-0528	21.7%
33	Qwen3-Coder-30B-A3B-Instruct	18.0%
34	Qwen3-Next-80B-A3B-Instruct	15.4%
35	Qwen3-30B-A3B-Instruct-2507	7.1%