The Inference Report

March 12, 2026

On the SWE-rebench, the top tier has stabilized with Claude Code holding 52.9%, Junie at 52.1%, and Claude Opus 4.6 tied with gpt-5.2-xhigh at 51.7%, representing no movement from prior rankings. Claude Opus 4.5 dropped from position 8 at 49.7% on Artificial Analysis to position 12 at 43.8% on SWE-rebench, a 5.9-point decline that signals either methodological differences between the two benchmarks or genuine performance variance across problem distributions. Kimi K2 Thinking climbed 14 positions on Artificial Analysis (from 27 to 13) with a 2.9-point gain, while Gemini 3 Pro Preview fell from position 10 to 11 despite holding steady at 48.4%, indicating a new entrant shifted rankings. GLM-5 dropped significantly on Artificial Analysis from position 7 at 49.8% to position 15 at 42.1%, a 7.7-point regression that merits scrutiny regarding whether this reflects model degradation or evaluation instability. Kimi K2.5 declined sharply from position 12 at 46.8% to position 19 at 37.9%, losing 8.9 points and 7 ranking positions. The Artificial Analysis leaderboard saw new entries at position 10 (Grok 4.20 Beta) and position 40 (NVIDIA Nemotron 3 Super 120B), while position 97 (LongCat Flash Lite) and position 282 (Sarvam M) entered lower tiers, suggesting either benchmark expansion or periodic model rotation. The divergence between SWE-rebench and Artificial Analysis on models like Claude Opus 4.5 and GLM-5 raises questions about benchmark sensitivity to implementation details or task sampling; without clarity on evaluation methodology differences, it is difficult to assess whether these gaps reflect real capability variation or measurement artifacts.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1Claude Code52.9%
2Junie52.1%
3Claude Opus 4.651.7%
4gpt-5.2-2025-12-11-xhigh51.7%
5gpt-5.2-2025-12-11-medium51.0%
6gpt-5.1-codex-max48.5%
7Claude Sonnet 4.547.1%
8Gemini 3 Pro Preview46.7%
9Gemini 3 Flash Preview46.7%
10gpt-5.2-codex45.0%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1Gemini 3.1 Pro Preview57.2111$4.50
2GPT-5.45777$5.63
3GPT-5.3 Codex5457$4.81
4Claude Opus 4.65353$10.00
5Claude Sonnet 4.651.760$6.00
6GPT-5.251.365$4.81
7GLM-549.863$1.55
8Claude Opus 4.549.757$10.00
9GPT-5.2 Codex4972$4.81
10Grok 4.20 Beta 030948.5245$3.00

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Grok 4.20 Beta 0309245
2GPT-5 Codex175
3Gemini 3 Flash Preview164
4Qwen3.5 122B A10B151
5MiMo-V2-Flash133
6Gemini 3 Pro Preview115
7Gemini 3.1 Pro Preview111
8GPT-5.1 Codex108
9Qwen3.5 27B87
10GLM-4.779

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2DeepSeek V3.2$0.315
3MiniMax-M2.5$0.525
4GPT-5 mini$0.688
5Qwen3.5 27B$0.825
6GLM-4.7$1.00
7Kimi K2 Thinking$1.07
8Qwen3.5 122B A10B$1.10
9Gemini 3 Flash Preview$1.13
10Kimi K2.5$1.20