The Inference Report

March 18, 2026

The SWE-rebench leaderboard shows compression at the top but no meaningful movement within the tested range. Claude Code, Junie, and Claude Opus 4.6 remain locked in the 51.7 to 52.9 percent band, with the top six models separated by less than a percentage point, indicating a plateau in discriminative power rather than genuine progress. Below that tier, ranking shifts are notable but scores tell a different story: Claude Opus 4.5 dropped from 49.7 to 43.8 on Artificial Analysis while holding at 43.8 on SWE-rebench, GLM-5 fell from 49.8 to 42.1, and Kimi K2.5 collapsed from 46.8 to 37.9, all suggesting either benchmark recalibration or evaluation variance rather than model regression. Conversely, Kimi K2 Thinking jumped 2.9 points to 43.8 on SWE-rebench and GLM-4.6 gained 4.6 points to 37.1 on Artificial Analysis, but these gains occur in a region where single-digit improvements are routine and may reflect test set sensitivity rather than architectural breakthroughs. The two benchmarks diverge substantially in their top rankings (GPT-5.4 and Gemini 3.1 Pro Preview both score 57.2 on Artificial Analysis versus Claude Code's 52.9 on SWE-rebench), raising questions about what each evaluates: SWE-rebench appears stricter or tests different problem classes, making direct comparison unreliable. Without documentation of methodology changes, evaluation set stability, or statistical confidence intervals, the apparent volatility in mid-tier positions cannot be distinguished from noise. The real finding is not movement but stagnation at the frontier and inconsistency across benchmarks, both of which limit confidence in using either as a proxy for practical coding capability.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1Claude Code52.9%
2Junie52.1%
3Claude Opus 4.651.7%
4gpt-5.2-2025-12-11-xhigh51.7%
5gpt-5.2-2025-12-11-medium51.0%
6gpt-5.1-codex-max48.5%
7Claude Sonnet 4.547.1%
8Gemini 3 Pro Preview46.7%
9Gemini 3 Flash Preview46.7%
10gpt-5.2-codex45.0%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1GPT-5.457.280$5.63
2Gemini 3.1 Pro Preview57.2113$4.50
3GPT-5.3 Codex5469$4.81
4Claude Opus 4.65360$10.00
5Claude Sonnet 4.651.768$6.00
6GPT-5.251.369$4.81
7GLM-549.866$1.55
8Claude Opus 4.549.765$10.00
9GPT-5.2 Codex4991$4.81
10MiMo-V2-Pro48.80$0.00

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Grok 4.20 Beta 0309196
2Gemini 3 Flash Preview180
3GPT-5 Codex176
4Qwen3.5 122B A10B151
5MiMo-V2-Flash130
6Gemini 3.1 Pro Preview113
7Gemini 3 Pro Preview110
8GPT-5.1 Codex103
9GPT-5.2 Codex91
10Qwen3.5 27B90

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2DeepSeek V3.2$0.315
3MiniMax-M2.5$0.525
4GPT-5 mini$0.688
5Qwen3.5 27B$0.825
6GLM-4.7$1.00
7Kimi K2 Thinking$1.07
8Qwen3.5 122B A10B$1.10
9Gemini 3 Flash Preview$1.13
10Kimi K2.5$1.20