The Inference Report

June 17, 2026

On SWE-rebench, the top tier remains unchanged with gpt-5.5-2026-04-23-xhigh holding 62.7% and the next four models stable within a narrow band. The meaningful shifts occur in the mid-tier: GLM-5.1 entered at 50.7%, moving from position 21 on Artificial Analysis (40.2) to position 12 on SWE-rebench, suggesting the benchmark surfaces different capability profiles than general evaluation suites. Kimi K2.6 gained 3.7 points to 46.5%, while Gemini 3.5 Flash dropped 0.7 points to 49.5% despite previously ranking sixth on Artificial Analysis at 50.2%, indicating SWE-rebench's code-specific tasks may penalize certain architectural choices. On Artificial Analysis, GLM-5.2 entered the top ten at position 6 with 50.7%, a new entrant that did not appear in the prior ranking, while the bulk of the list shows positional shuffling without score changes, suggesting the primary movement comes from model releases rather than re-evaluation of existing systems. The SWE-rebench data presents a cleaner signal for coding capability than the broader Artificial Analysis suite, where most entries maintain identical scores across the two snapshots, indicating the latter functions as a stable archive rather than a live leaderboard. Neither benchmark shows the kind of discontinuous jumps that would signal a methodological shift or contamination event.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1gpt-5.5-2026-04-23-xhigh62.7%
2Junie61.6%
3Codex60.4%
4Claude Code59.6%
5gpt-5.5-2026-04-23-medium58.9%
6Claude Opus 4.8-xhigh56.5%
7gpt-5.4-2026-03-05-medium54.9%
8Claude Opus 4.7-high53.1%
9Cursor53.0%
10Claude Sonnet 4.651.3%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1Claude Fable 559.90$20.00
2Claude Opus 4.855.768$10.00
3GPT-5.554.867$11.25
4Claude Opus 4.753.554$10.00
5GPT-5.451.4166$5.63
6GLM-5.250.7114$2.15
7Gemini 3.5 Flash50.2203$3.38
8Claude Sonnet 4.647.263$6.00
9Gemini 3.1 Pro Preview46.5127$4.50
10Qwen3.7 Max46106$3.75

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Gemini 3.5 Flash203
2GPT-5.4 mini180
3GPT-5.4166
4Gemini 3.1 Pro Preview127
5GPT-5.2 Codex125
6GLM-5.2114
7Qwen3.7 Max106
8DeepSeek V4 Flash100
9GPT-5.3 Codex89
10GPT-5.278

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1DeepSeek V4 Flash$0.175
2MiMo-V2.5$0.175
3MiniMax-M3$0.525
4DeepSeek V4 Pro$0.544
5MiMo-V2.5-Pro$0.544
6MiMo-V2-Pro$1.50
7GPT-5.4 mini$1.69
8Kimi K2.6$1.71
9Kimi K2.7 Code$1.71
10GLM-5.2$2.15