The Inference Report

March 17, 2026

The SWE-rebench leaderboard shows stasis at the top with Claude Code holding 52.9% and Junie at 52.1%, while significant reshuffling occurs in the middle tiers. Claude Opus 4.5 dropped from position 8 with 49.7% to position 12 with 43.8% on Artificial Analysis, a substantial decline that warrants scrutiny of whether the evaluation methodology shifted or if the model's capabilities genuinely regressed on this particular task distribution. Conversely, Kimi K2 Thinking climbed from position 28 with 40.9% to position 13 with 43.8% on Artificial Analysis, suggesting either improved inference or a benchmark revision that favors its approach. The Gemini 3 Pro Preview moved from position 11 at 48.4% to position 8 at 46.7% on SWE-rebench, a modest decline consistent with natural variance, though the divergence between the two benchmarks (Artificial Analysis shows it at 48.4%) hints at methodological differences in how they score the same model outputs. GLM-5 dropped from position 7 with 49.8% on Artificial Analysis to position 15 with 42.1% on SWE-rebench, a 7.7-point gap that is difficult to attribute to random noise and suggests these benchmarks may be testing different aspects of code generation capability. The lack of movement in the top five positions on SWE-rebench combined with large swings in the 7-20 range indicates the benchmark is sensitive enough to detect real differences but that the frontier models have plateaued relative to their challengers, a pattern worth monitoring across future cycles to determine whether we are seeing genuine convergence or measurement instability.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1Claude Code52.9%
2Junie52.1%
3Claude Opus 4.651.7%
4gpt-5.2-2025-12-11-xhigh51.7%
5gpt-5.2-2025-12-11-medium51.0%
6gpt-5.1-codex-max48.5%
7Claude Sonnet 4.547.1%
8Gemini 3 Pro Preview46.7%
9Gemini 3 Flash Preview46.7%
10gpt-5.2-codex45.0%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1GPT-5.457.280$5.63
2Gemini 3.1 Pro Preview57.2114$4.50
3GPT-5.3 Codex5470$4.81
4Claude Opus 4.65356$10.00
5Claude Sonnet 4.651.761$6.00
6GPT-5.251.375$4.81
7GLM-549.866$1.55
8Claude Opus 4.549.765$10.00
9GPT-5.2 Codex49108$4.81
10Grok 4.20 Beta 030948.5213$3.00

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Grok 4.20 Beta 0309213
2GPT-5 Codex203
3Gemini 3 Flash Preview179
4Qwen3.5 122B A10B159
5GPT-5.1 Codex140
6MiMo-V2-Flash127
7Gemini 3.1 Pro Preview114
8GPT-5.1111
9Gemini 3 Pro Preview110
10GPT-5.2 Codex108

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2DeepSeek V3.2$0.315
3MiniMax-M2.5$0.525
4GPT-5 mini$0.688
5Qwen3.5 27B$0.825
6GLM-4.7$1.00
7Kimi K2 Thinking$1.07
8Qwen3.5 122B A10B$1.10
9Gemini 3 Flash Preview$1.13
10Kimi K2.5$1.20