The Inference Report

June 19, 2026

The SWE-rebench and Artificial Analysis rankings show stability at the top but meaningful movement in the middle tier. On SWE-rebench, the top six positions remain unchanged: gpt-5.5-2026-04-23-xhigh leads at 62.7%, followed by Junie at 61.6%, Codex at 60.4%, Claude Code at 59.6%, gpt-5.5-2026-04-23-medium at 58.9%, and Claude Opus 4.8-xhigh at 56.5%. The notable shifts occur below this ceiling. Claude Sonnet 4.6 climbed from position 10 with 47.2% to position 10 with 51.3%, a 4.1-point gain; Gemini 3.1 Pro Preview rose from position 9 at 46.5% to position 11 at 51.1%, a 4.6-point increase; GLM-5.1 jumped from position 23 at 40.2% to position 12 at 50.7%, an extraordinary 10.5-point improvement; and GLM-4.7 advanced from position 51 at 33.8% to position 17 at 38.2%, a 4.4-point gain. Gemini 3.5 Flash, conversely, declined from position 7 at 50.2% to position 13 at 49.5%. These movements suggest either benchmark variance or genuine performance shifts in the middle tier, though GLM-5.1's dramatic rise warrants scrutiny of whether the test conditions or model capability changed materially. Artificial Analysis rankings remain consistent across the top 100 positions with only minor reordering among tied scores in the 6 to 7-point range, indicating more stable evaluation methodology or less volatility in that benchmark's test set.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1gpt-5.5-2026-04-23-xhigh62.7%
2Junie61.6%
3Codex60.4%
4Claude Code59.6%
5gpt-5.5-2026-04-23-medium58.9%
6Claude Opus 4.8-xhigh56.5%
7gpt-5.4-2026-03-05-medium54.9%
8Claude Opus 4.7-high53.1%
9Cursor53.0%
10Claude Sonnet 4.651.3%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1Claude Fable 559.90$20.00
2Claude Opus 4.855.766$10.00
3GPT-5.554.868$11.25
4Claude Opus 4.753.556$10.00
5GPT-5.451.4157$5.63
6GLM-5.251.198$2.15
7Gemini 3.5 Flash50.2219$3.38
8Claude Sonnet 4.647.268$6.00
9Gemini 3.1 Pro Preview46.5136$4.50
10Qwen3.7 Max4698$3.75

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Gemini 3.5 Flash219
2GPT-5.4 mini174
3GPT-5.4157
4GPT-5.2 Codex137
5Gemini 3.1 Pro Preview136
6DeepSeek V4 Flash114
7GLM-5.298
8Qwen3.7 Max98
9GPT-5.3 Codex88
10GPT-5.283

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1DeepSeek V4 Flash$0.175
2MiMo-V2.5$0.175
3MiniMax-M3$0.525
4DeepSeek V4 Pro$0.544
5MiMo-V2.5-Pro$0.544
6MiMo-V2-Pro$1.50
7GPT-5.4 mini$1.69
8Kimi K2.6$1.71
9Kimi K2.7 Code$1.71
10GLM-5.2$2.15