The Inference Report

March 30, 2026

Claude Opus 4.6 holds first place on the SWE-rebench at 65.3%, up from fourth place at 53% on Artificial Analysis, a 12.3-point gain that reflects either a meaningful improvement in the model itself or a substantial methodological divergence between the two benchmarks. The SWE-rebench leaderboard shows tighter clustering at the top than Artificial Analysis: the gap between first and fifth place narrows to 2.0 points (65.3% to 62.3%) compared to 9.8 points in the older dataset (57.2% to 47.7%), suggesting either more homogeneous model performance on software engineering tasks or differences in how the benchmark distributes credit across solution attempts. Kimi K2.5 and Kimi K2 Thinking both advanced substantially, jumping from positions 16 and 35 on Artificial Analysis (46.8 and 40.9 points respectively) to positions 13 and 17 on SWE-rebench (58.5% and 57.4%), indicating these models may have been underestimated by the prior evaluation or that they excel specifically at the code completion and repository-level reasoning that SWE-rebench targets. Gemini 3 Flash Preview similarly climbed from position 18 at 46.4 to position 22 at 52.5%, a 6.1-point improvement that outpaces most of the field. The SWE-rebench evaluation appears to reward architectural choices or training data aligned with real repository work: models like GLM-5 and gpt-5.4-2026-03-05-medium perform nearly identically (62.8%), yet their Artificial Analysis scores diverged by 4.2 points (49.8 vs 54), suggesting the newer benchmark may reduce noise or focus more narrowly on a specific class of engineering problems. Without documentation of what changed in the benchmark methodology, evaluation harness, or problem distribution, the magnitude of these shifts prevents confident assessment of whether they represent genuine model progress or reflect a different measurement regime entirely.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5Gemini 3.1 Pro Preview62.3%
6DeepSeek-V3.260.9%
7Claude Sonnet 4.660.7%
8Claude Sonnet 4.560.0%
9Qwen3.5-397B-A17B59.9%
10Step-3.5-Flash59.6%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1GPT-5.457.288$5.63
2Gemini 3.1 Pro Preview57.2114$4.50
3GPT-5.3 Codex5492$4.81
4Claude Opus 4.65359$10.00
5Claude Sonnet 4.651.779$6.00
6GPT-5.251.383$4.81
7GLM-549.865$1.55
8Claude Opus 4.549.768$10.00
9MiniMax-M2.749.644$0.525
10MiMo-V2-Pro49.295$1.50

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Grok 4.20 Beta 0309242
2GPT-5.4 mini219
3GPT-5 Codex218
4Gemini 3 Flash Preview192
5GPT-5.4 nano177
6Qwen3.5 122B A10B145
7GPT-5.1 Codex140
8MiMo-V2-Flash137
9GPT-5.2 Codex129
10Gemini 3 Pro Preview118

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2DeepSeek V3.2$0.315
3GPT-5.4 nano$0.463
4MiniMax-M2.7$0.525
5MiniMax-M2.5$0.525
6GPT-5 mini$0.688
7Qwen3.5 27B$0.825
8GLM-4.7$1.00
9Kimi K2 Thinking$1.07
10Qwen3.5 122B A10B$1.10