The Inference Report

June 23, 2026

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1gpt-5.5-2026-04-23-xhigh62.7%
2Junie61.6%
3Codex60.4%
4Claude Code59.6%
5gpt-5.5-2026-04-23-medium58.9%
6Claude Opus 4.8-xhigh56.5%
7gpt-5.4-2026-03-05-medium54.9%
8Claude Opus 4.7-high53.1%
9Cursor53.0%
10Claude Sonnet 4.651.3%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1Claude Fable 559.90$20.00
2Claude Opus 4.855.769$10.00
3GPT-5.554.864$11.25
4Claude Opus 4.753.558$10.00
5GPT-5.451.4167$5.63
6GLM-5.251.1105$2.15
7Gemini 3.5 Flash50.2245$3.38
8Claude Sonnet 4.647.279$6.00
9Gemini 3.1 Pro Preview46.5147$4.50
10Qwen3.7 Max46205$3.75

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Gemini 3.5 Flash245
2Qwen3.7 Max205
3GPT-5.4 mini202
4GPT-5.4167
5GPT-5.2 Codex150
6Gemini 3.1 Pro Preview147
7DeepSeek V4 Flash117
8GLM-5.2105
9GPT-5.3 Codex104
10DeepSeek V4 Pro103

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1DeepSeek V4 Flash$0.175
2MiMo-V2.5$0.175
3MiniMax-M3$0.525
4DeepSeek V4 Pro$0.544
5MiMo-V2.5-Pro$0.544
6MiMo-V2-Pro$1.50
7GPT-5.4 mini$1.69
8Kimi K2.6$1.71
9Kimi K2.7 Code$1.71
10GLM-5.2$2.15