The Inference Report

June 25, 2026

On SWE-rebench, the top tier remains static: gpt-5.5-2026-04-23-xhigh holds 62.7%, Junie stays at 61.6%, and the Claude and GPT variants occupy positions three through seven without movement. The meaningful shifts occur in the mid-tier, where GLM-5.1 climbed from position 23 at 40.2% to position 12 at 50.7%, a 10.5-point gain that represents the largest jump in the dataset, while GLM-4.7 rose from position 52 at 33.8% to position 17 at 38.2%. Kimi K2.6 advanced from position 16 to position 15, and Claude Sonnet 4.6 moved from position 8 to position 10 despite scoring identically at 51.3%, suggesting ranking adjustments independent of score changes. Across the Artificial Analysis benchmark, the distribution shows far less volatility: Claude Fable 5 leads at 59.9, the top 20 models cluster between 42.8 and 59.9 with mostly preserved rankings, and a new entry (Nex-N2-Pro at 41.0) appears at position 20 while KAT-Coder-Pro V1 jumped 31 positions from 83 to 52 with a 6.3-point improvement from 28.3 to 34.6. The discrepancy between benchmarks is notable: models ranking high on SWE-rebench (gpt-5.5-xhigh, Junie) do not dominate Artificial Analysis, where Claude Fable 5 leads despite placing second on the coding benchmark, suggesting these metrics capture different problem-solving dimensions or that the evaluation methodologies diverge in what they reward. Neither benchmark shows the compression or volatility typical of immature measurement systems, indicating both have stabilized around consistent model orderings, though the absence of methodological detail prevents assessment of whether either captures real capability differences or primarily reflects training data overlap.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1gpt-5.5-2026-04-23-xhigh62.7%
2Junie61.6%
3Codex60.4%
4Claude Code59.6%
5gpt-5.5-2026-04-23-medium58.9%
6Claude Opus 4.8-xhigh56.5%
7gpt-5.4-2026-03-05-medium54.9%
8Claude Opus 4.7-high53.1%
9Cursor53.0%
10Claude Sonnet 4.651.3%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1Claude Fable 559.90$20.00
2Claude Opus 4.855.766$10.00
3GPT-5.554.866$11.25
4Claude Opus 4.753.558$10.00
5GPT-5.451.4159$5.63
6GLM-5.251.1122$2.15
7Gemini 3.5 Flash50.2221$3.38
8Claude Sonnet 4.647.268$6.00
9Gemini 3.1 Pro Preview46.5145$4.50
10Qwen3.7 Max46204$3.75

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Gemini 3.5 Flash221
2Qwen3.7 Max204
3GPT-5.4 mini185
4GPT-5.4159
5Gemini 3.1 Pro Preview145
6GPT-5.2 Codex139
7DeepSeek V4 Flash124
8GLM-5.2122
9Nex-N2-Pro108
10GPT-5.288

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1DeepSeek V4 Flash$0.175
2MiMo-V2.5$0.175
3MiniMax-M3$0.525
4DeepSeek V4 Pro$0.544
5MiMo-V2.5-Pro$0.544
6Nex-N2-Pro$1.00
7MiMo-V2-Pro$1.50
8GPT-5.4 mini$1.69
9Kimi K2.6$1.71
10Kimi K2.7 Code$1.71