The Inference Report

March 31, 2026

Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, a gain of 12.3 points from its prior Artificial Analysis score of 53, while gpt-5.2-2025-12-11-medium sits second at 64.4%, and GLM-5 and gpt-5.4-2026-03-05-medium tie at 62.8%. The SWE-rebench leaderboard shows material reshuffling in the upper tier: Kimi K2.5 climbed from rank 16 (46.8) to rank 13 (58.5), Kimi K2 Thinking jumped from rank 35 (40.9) to rank 17 (57.4), and Gemini 3 Flash Preview moved from rank 18 (46.4) to rank 22 (52.5), all indicating that these models improved substantially on this benchmark. The Artificial Analysis leaderboard, which tracks a different evaluation methodology, remains largely stable in its upper rankings with GPT-5.4 and Gemini 3.1 Pro Preview tied at 57.2, though KAT Coder Pro V2 entered at rank 23 with 43.8 and Nemotron Cascade 2 30B appeared at rank 81 with 27.7. The gap between the two benchmarks' top scores (SWE-rebench's 65.3 versus Artificial Analysis's 57.2) suggests they measure different aspects of model capability or use distinct evaluation criteria; without details on SWE-rebench's methodology relative to Artificial Analysis, it remains unclear whether the higher scores reflect easier test cases, different task distributions, or genuine performance differences on the same underlying problems. The consistency of model ordering within each benchmark indicates both are internally coherent, but the divergence between them argues for caution in treating either as a complete picture of coding ability.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5Gemini 3.1 Pro Preview62.3%
6DeepSeek-V3.260.9%
7Claude Sonnet 4.660.7%
8Claude Sonnet 4.560.0%
9Qwen3.5-397B-A17B59.9%
10Step-3.5-Flash59.6%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1GPT-5.457.296$5.63
2Gemini 3.1 Pro Preview57.2120$4.50
3GPT-5.3 Codex5494$4.81
4Claude Opus 4.65361$10.00
5Claude Sonnet 4.651.779$6.00
6GPT-5.251.381$4.81
7GLM-549.865$1.55
8Claude Opus 4.549.764$10.00
9MiniMax-M2.749.645$0.525
10MiMo-V2-Pro49.20$1.50

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Grok 4.20 Beta 0309242
2GPT-5.4 mini219
3GPT-5 Codex215
4Gemini 3 Flash Preview193
5GPT-5.4 nano177
6GPT-5.1 Codex155
7Qwen3.5 122B A10B145
8GPT-5.2 Codex129
9Gemini 3 Pro Preview123
10MiMo-V2-Flash123

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2DeepSeek V3.2$0.315
3GPT-5.4 nano$0.463
4MiniMax-M2.7$0.525
5KAT Coder Pro V2$0.525
6MiniMax-M2.5$0.525
7GPT-5 mini$0.688
8Qwen3.5 27B$0.825
9GLM-4.7$1.00
10Kimi K2 Thinking$1.07