The Inference Report

June 18, 2026

On the SWE-rebench coding benchmark, the top tier shows stability with gpt-5.5-2026-04-23-xhigh holding first at 62.7%, Junie second at 61.6%, and Codex third at 60.4%, while middle-ranked models demonstrate more flux: Claude Sonnet 4.6 climbed from 47.2 to 51.3 percent (position 8 to 10), GLM-5.1 jumped from 40.2 to 50.7 percent (ranking 23 to 12), and Kimi K2.6 advanced from 42.8 to 46.5 percent (16 to 15), yet Gemini 3.5 Flash paradoxically fell from 50.2 to 49.5 percent despite holding rank 13. The Artificial Analysis leaderboard exhibits more volatility across its 394 entries, where Claude Fable 5 leads at 59.9 but the broader distribution shows marginal gains concentrated among models in the 40 to 50 point range, with GLM-4.7 making the largest absolute climb from 33.8 to 38.2 percent. The divergence between these two benchmarks on identical or near-identical models (Claude Sonnet 4.6 scores 51.3 on SWE-rebench but 47.2 on Artificial Analysis; GLM-5.1 scores 50.7 vs 40.2) suggests they measure different problem distributions or evaluation methodologies, raising questions about whether improvements on one reflect genuine capability gains or benchmark-specific overfitting. The SWE-rebench movements are modest in absolute terms, with most shifts under 5 percentage points, which is consistent with natural variance rather than architectural breakthroughs.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1gpt-5.5-2026-04-23-xhigh62.7%
2Junie61.6%
3Codex60.4%
4Claude Code59.6%
5gpt-5.5-2026-04-23-medium58.9%
6Claude Opus 4.8-xhigh56.5%
7gpt-5.4-2026-03-05-medium54.9%
8Claude Opus 4.7-high53.1%
9Cursor53.0%
10Claude Sonnet 4.651.3%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1Claude Fable 559.90$20.00
2Claude Opus 4.855.767$10.00
3GPT-5.554.861$11.25
4Claude Opus 4.753.554$10.00
5GPT-5.451.4157$5.63
6GLM-5.250.7100$2.15
7Gemini 3.5 Flash50.2223$3.38
8Claude Sonnet 4.647.266$6.00
9Gemini 3.1 Pro Preview46.5127$4.50
10Qwen3.7 Max4696$3.75

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Gemini 3.5 Flash223
2GPT-5.4 mini177
3GPT-5.4157
4Gemini 3.1 Pro Preview127
5GPT-5.2 Codex125
6DeepSeek V4 Flash105
7GLM-5.2100
8Qwen3.7 Max96
9GPT-5.279
10GPT-5.3 Codex77

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1DeepSeek V4 Flash$0.175
2MiMo-V2.5$0.175
3MiniMax-M3$0.525
4DeepSeek V4 Pro$0.544
5MiMo-V2.5-Pro$0.544
6MiMo-V2-Pro$1.50
7GPT-5.4 mini$1.69
8Kimi K2.6$1.71
9Kimi K2.7 Code$1.71
10GLM-5.2$2.15