The Inference Report

June 16, 2026

On SWE-rebench, the top tier remains stable with gpt-5.5-2026-04-23-xhigh holding first place at 62.7% and the next five positions unchanged, but meaningful movement appears below that line: Gemini 3.1 Pro Preview dropped from 57.2% to 51.1% on Artificial Analysis (down six positions), Gemini 3.5 Flash fell from 55.3% to 49.5% on SWE-rebench and 55.3% to 50.2% on Artificial Analysis, and Kimi K2.6 declined from 53.9% to 46.5% on Artificial Analysis while holding steady on SWE-rebench. GLM-4.7 improved notably from 42.1% to 50.7% on Artificial Analysis, moving into the top 20, and GLM-4.7 itself advanced from 42.1% to 50.7% on Artificial Analysis, though it remains at 38.2% on SWE-rebench. The Artificial Analysis leaderboard shows broader volatility: Claude Fable 5 dropped from 64.9 to 59.9, Claude Opus 4.8 fell from 61.4 to 55.7, and GPT-5.5 declined from 60.2 to 54.8, suggesting either a recalibration of the benchmark methodology or systematic changes in model evaluation conditions. Lower-ranked models show the largest percentage-point losses across both benchmarks, with many models in the 100-200 range losing 5-8 points, raising the question of whether this reflects actual model degradation, benchmark recalibration, or environmental factors like inference conditions that affect consistency. The SWE-rebench scores remain tighter and more stable than Artificial Analysis, which could indicate either greater robustness in that benchmark's methodology or a narrower evaluation scope that leaves less room for variance. Without clarity on whether these benchmarks measure identical task sets or use different evaluation protocols, the divergence between the two makes it difficult to assess whether the movement represents genuine capability shifts or measurement artifacts.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1gpt-5.5-2026-04-23-xhigh62.7%
2Junie61.6%
3Codex60.4%
4Claude Code59.6%
5gpt-5.5-2026-04-23-medium58.9%
6Claude Opus 4.8-xhigh56.5%
7gpt-5.4-2026-03-05-medium54.9%
8Claude Opus 4.7-high53.1%
9Cursor53.0%
10Claude Sonnet 4.651.3%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1Claude Fable 559.90$20.00
2Claude Opus 4.855.768$10.00
3GPT-5.554.867$11.25
4Claude Opus 4.753.557$10.00
5GPT-5.451.4191$5.63
6Gemini 3.5 Flash50.2212$3.38
7Claude Sonnet 4.647.262$6.00
8Gemini 3.1 Pro Preview46.5133$4.50
9Qwen3.7 Max46187$3.75
10MiniMax-M344.457$0.525

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Gemini 3.5 Flash212
2GPT-5.4191
3Qwen3.7 Max187
4GPT-5.4 mini187
5GPT-5.2 Codex137
6Gemini 3.1 Pro Preview133
7DeepSeek V4 Flash108
8GPT-5.3 Codex99
9DeepSeek V4 Pro84
10GPT-5.280

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1DeepSeek V4 Flash$0.175
2MiMo-V2.5$0.175
3MiniMax-M3$0.525
4DeepSeek V4 Pro$0.544
5MiMo-V2.5-Pro$0.544
6MiMo-V2-Pro$1.50
7GPT-5.4 mini$1.69
8Kimi K2.6$1.71
9GLM-5.1$2.15
10Qwen3.6 Max Preview$2.92