The Inference Report

March 28, 2026

Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, unchanged from the previous cycle, while the tier immediately below has solidified around 62 to 64 percent with gpt-5.2-2025-12-11-medium at 64.4% and GLM-5 and gpt-5.4-2026-03-05-medium both at 62.8%. The meaningful shifts appear in models that have climbed several positions: Claude Opus 4.6 rose from rank 4 to rank 1 (a gain of 12.3 points from 53 on the Artificial Analysis benchmark), GLM-5 advanced from rank 7 to rank 3 (13 points), Kimi K2.5 jumped from rank 16 to rank 13 (11.7 points), and Kimi K2 Thinking moved from rank 35 to rank 17 (16.5 points). Gemini 3.1 Pro Preview slipped from rank 2 to rank 5, losing 5.1 points despite remaining competitive. The SWE-rebench methodology appears to diverge notably from the Artificial Analysis scores, particularly for Claude and Kimi models, which suggests the benchmarks may weight different problem classes or evaluation criteria differently. The gap between first and tenth place on SWE-rebench is 5.7 percentage points, indicating a tightening at the top end, while the Artificial Analysis leaderboard shows a steeper spread, with the top model at 57.2 and rank 10 at 49.2. Without access to the specific evaluation methodology differences between these two benchmarks, it is unclear whether the SWE-rebench gains reflect genuine capability improvements or reflect a benchmark that rewards different architectural or training choices.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5Gemini 3.1 Pro Preview62.3%
6DeepSeek-V3.260.9%
7Claude Sonnet 4.660.7%
8Claude Sonnet 4.560.0%
9Qwen3.5-397B-A17B59.9%
10Step-3.5-Flash59.6%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1GPT-5.457.281$5.63
2Gemini 3.1 Pro Preview57.2114$4.50
3GPT-5.3 Codex5474$4.81
4Claude Opus 4.65353$10.00
5Claude Sonnet 4.651.766$6.00
6GPT-5.251.372$4.81
7GLM-549.863$1.55
8Claude Opus 4.549.764$10.00
9MiniMax-M2.749.647$0.525
10MiMo-V2-Pro49.293$1.50

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Grok 4.20 Beta 0309238
2GPT-5.4 mini198
3Gemini 3 Flash Preview184
4GPT-5 Codex181
5GPT-5.4 nano160
6Qwen3.5 122B A10B134
7MiMo-V2-Flash129
8GPT-5.1 Codex118
9Gemini 3 Pro Preview115
10Gemini 3.1 Pro Preview114

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2DeepSeek V3.2$0.315
3GPT-5.4 nano$0.463
4MiniMax-M2.7$0.525
5MiniMax-M2.5$0.525
6GPT-5 mini$0.688
7Qwen3.5 27B$0.825
8GLM-4.7$1.00
9Kimi K2 Thinking$1.07
10Qwen3.5 122B A10B$1.10