The Inference Report

March 26, 2026

Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, unchanged from the previous cycle, while the tier immediately below has compressed: gpt-5.2-2025-12-11-medium, GLM-5, and gpt-5.4-2026-03-05-medium now cluster between 62.8% and 64.4%, with GLM-5 climbing from #7 to #3 and gaining 13 percentage points since the Artificial Analysis benchmark. Gemini 3.1 Pro Preview dropped from #2 to #5 on SWE-rebench despite scoring 62.3%, a 5.1-point gain over its Artificial Analysis score of 57.2, suggesting the two benchmarks measure different problem distributions or evaluation rigor. Kimi K2.5 and Kimi K2 Thinking both posted substantial gains, 12.5 and 16.5 points respectively on Artificial Analysis, and moved up the SWE-rebench ranks to #13 and #17, though the magnitude of improvement raises questions about whether the models were retrained, fine-tuned on benchmark data, or if the evaluation protocols differ materially between the two systems. The broader pattern shows Claude models and GPT variants dominating the top ten on SWE-rebench while Chinese models (GLM-5, Kimi variants, Qwen lines) have narrowed the gap, and the divergence between SWE-rebench and Artificial Analysis rankings for several mid-tier models suggests these benchmarks are not interchangeable proxies for coding ability. MiMo-V2-Omni dropped from the Artificial Analysis rankings entirely despite previously scoring 43.4, a notable exit that warrants clarification on whether the model was discontinued or simply failed to meet evaluation criteria this cycle.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5Gemini 3.1 Pro Preview62.3%
6DeepSeek-V3.260.9%
7Claude Sonnet 4.660.7%
8Claude Sonnet 4.560.0%
9Qwen3.5-397B-A17B59.9%
10Step-3.5-Flash59.6%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1GPT-5.457.274$5.63
2Gemini 3.1 Pro Preview57.2113$4.50
3GPT-5.3 Codex5478$4.81
4Claude Opus 4.65351$10.00
5Claude Sonnet 4.651.772$6.00
6GPT-5.251.372$4.81
7GLM-549.869$1.55
8Claude Opus 4.549.759$10.00
9MiniMax-M2.749.647$0.525
10MiMo-V2-Pro49.293$1.50

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1GPT-5.4 nano221
2Grok 4.20 Beta 0309218
3GPT-5.4 mini218
4Gemini 3 Flash Preview195
5GPT-5 Codex190
6Qwen3.5 122B A10B134
7MiMo-V2-Flash129
8GPT-5.1 Codex118
9Gemini 3 Pro Preview115
10Gemini 3.1 Pro Preview113

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2DeepSeek V3.2$0.315
3GPT-5.4 nano$0.463
4MiniMax-M2.7$0.525
5MiniMax-M2.5$0.525
6GPT-5 mini$0.688
7Qwen3.5 27B$0.825
8GLM-4.7$1.00
9Kimi K2 Thinking$1.07
10Qwen3.5 122B A10B$1.10