The Inference Report

April 27, 2026

Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, unchanged from the previous cycle, while the second tier has tightened considerably around 62-64 percent with gpt-5.2-2025-12-11-medium, GLM-5, gpt-5.4-2026-03-05-medium, and GLM-5.1 all clustering within a single point. The most significant movement appears in the mid-tier: GLM-4.7 climbed from rank 42 at 42.1 percent to rank 14 at 58.7 percent, a gain of 16.6 points; Kimi K2.5 jumped from rank 27 at 46.8 percent to rank 16 at 58.5 percent; and Kimi K2 Thinking advanced from rank 51 at 40.9 percent to rank 21 at 57.4 percent. These represent genuine improvements in coding task resolution, not ranking artifacts. Gemini 3.1 Pro Preview dropped from rank 3 to rank 6 despite holding 62.3 percent, reflecting the compression of scores in the upper tier rather than performance degradation. On the Artificial Analysis benchmark, the rankings remain relatively stable at the extreme ends, though GPT-5.5 continues to lead at 60.2 and Claude Opus 4.7 sits at 57.3. The divergence between SWE-rebench and Artificial Analysis scores is pronounced: Claude Opus 4.6 scores 65.3 percent on SWE-rebench but only 53 percent on Artificial Analysis, suggesting the benchmarks measure distinct capabilities or that SWE-rebench may have different task difficulty distribution. Without visibility into whether the SWE-rebench test set changed or models were simply retested, the GLM and Kimi improvements warrant scrutiny regarding whether they reflect algorithmic advances or evaluation variance.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5GLM-5.162.7%
6Gemini 3.1 Pro Preview62.3%
7DeepSeek-V3.260.9%
8Claude Sonnet 4.660.7%
9Claude Sonnet 4.560.0%
10Qwen3.5-397B-A17B59.9%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1GPT-5.560.284$11.25
2Claude Opus 4.757.362$10.00
3Gemini 3.1 Pro Preview57.2135$4.50
4GPT-5.456.886$5.63
5Kimi K2.653.9139$1.71
6MiMo-V2.5-Pro53.866$1.50
7GPT-5.3 Codex53.691$4.81
8Claude Opus 4.65359$10.00
9Muse Spark52.10$0.00
10Qwen3.6 Max Preview51.834$2.92

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Gemini 3 Flash Preview200
2GPT-5 Codex198
3Qwen3.6 35B A3B197
4GPT-5.4 mini182
5GPT-5.4 nano163
6GPT-5.1 Codex159
7Qwen3.5 122B A10B156
8GPT-5.1153
9Gemini 3 Pro Preview141
10Kimi K2.6139

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2DeepSeek V4 Flash$0.175
3DeepSeek V3.2$0.315
4GPT-5.4 nano$0.463
5MiniMax-M2.7$0.525
6KAT Coder Pro V2$0.525
7MiniMax-M2.5$0.525
8GPT-5 mini$0.688
9Qwen3.5 27B$0.825
10Qwen3.6 35B A3B$0.844