The Inference Report

March 22, 2026

The top tier on SWE-rebench remains frozen: Claude Code holds 52.9%, Junie 52.1%, and the next four models cluster between 51.7% and 51.0%, with no movement in the top five positions. Below that line, the volatility increases sharply. Claude Opus 4.5 dropped from rank 8 to rank 12 while losing 5.9 percentage points (49.7 to 43.8), a decline that suggests either benchmark drift or a methodological shift in how the evaluation weights different problem categories. Kimi K2 Thinking gained 2.9 points and jumped from rank 34 to rank 13, the largest upward move in the dataset. Gemini 3 Pro Preview fell from rank 13 to rank 8 despite losing 1.7 points, a ranking shift driven by larger losses elsewhere. GLM-5 dropped 7.7 points (49.8 to 42.1) and fell from rank 15 to rank 7 in the Artificial Analysis benchmark, while on SWE-rebench it moved from rank 15 to rank 15 with a smaller decline, suggesting the two benchmarks may measure different problem distributions or that GLM-5's performance on specific task categories degraded significantly. Kimi K2.5 fell 8.9 points on Artificial Analysis (46.8 to 37.9) and dropped from rank 19 to rank 19 on SWE-rebench, a rare case of consistent decline across both metrics. The data shows no clear pattern of across-the-board improvement; instead, individual models show both gains and losses, which raises questions about whether these benchmarks are tracking genuine capability shifts or whether evaluation procedures, test set composition, or model versioning have changed between measurement cycles.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1Claude Code52.9%
2Junie52.1%
3Claude Opus 4.651.7%
4gpt-5.2-2025-12-11-xhigh51.7%
5gpt-5.2-2025-12-11-medium51.0%
6gpt-5.1-codex-max48.5%
7Claude Sonnet 4.547.1%
8Gemini 3 Pro Preview46.7%
9Gemini 3 Flash Preview46.7%
10gpt-5.2-codex45.0%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1GPT-5.457.285$5.63
2Gemini 3.1 Pro Preview57.2118$4.50
3GPT-5.3 Codex5471$4.81
4Claude Opus 4.65351$10.00
5Claude Sonnet 4.651.766$6.00
6GPT-5.251.375$4.81
7GLM-549.889$1.55
8Claude Opus 4.549.758$10.00
9MiniMax-M2.749.643$0.525
10MiMo-V2-Pro49.20$0.00

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1GPT-5.4 mini235
2GPT-5.4 nano209
3Gemini 3 Flash Preview193
4GPT-5 Codex170
5Qwen3.5 122B A10B154
6Grok 4.20 Beta 0309145
7MiMo-V2-Flash142
8GPT-5.1 Codex122
9Gemini 3 Pro Preview120
10Gemini 3.1 Pro Preview118

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2DeepSeek V3.2$0.315
3GPT-5.4 nano$0.463
4MiniMax-M2.7$0.525
5MiniMax-M2.5$0.525
6GPT-5 mini$0.688
7Qwen3.5 27B$0.825
8GLM-4.7$1.00
9Kimi K2 Thinking$1.07
10Qwen3.5 122B A10B$1.10