The Inference Report

May 31, 2026

The SWE-rebench rankings show stability at the top with gpt-5.5-2026-04-23-xhigh holding first place at 62.7%, while Codex and Claude Code follow at 60.4% and 59.6% respectively, unchanged from the previous cycle. Movement in the middle tier reveals two distinct patterns: Gemini 3.1 Pro Preview dropped from position 4 to 10 on SWE-rebench, falling from 57.2% to 51.1%, a 6.1-point decline that marks the most substantial regression in the visible rankings. Conversely, GLM-5.1 held ground at 50.7% while rising slightly in Artificial Analysis from 51.4 to maintain position 11, and Kimi K2.6 declined sharply from position 8 at 53.9% to position 13 at 46.5% on SWE-rebench, a 7.4-point drop. GLM-4.7 presents a puzzling divergence: it improved from 38.2% to 42.1% on Artificial Analysis (rising from position 47), yet on SWE-rebench it remained at 38.2% in position 14, suggesting the two benchmarks may measure different problem classes or that the Artificial Analysis score reflects a broader evaluation window. The consistency of scores across both benchmarks for most models in the top 10 indicates reliable measurement, but the divergence for Gemini and Kimi models warrants scrutiny of whether these benchmarks are testing equivalent code-solving difficulty or if recent model updates affected one benchmark more than the other. The lack of movement in the top five positions across both metrics suggests the frontier has stabilized, though the mid-tier churn indicates active differentiation among models in the 45-55% range.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1gpt-5.5-2026-04-23-xhigh62.7%
2Codex60.4%
3Claude Code59.6%
4gpt-5.5-2026-04-23-medium58.9%
5Claude Opus 4.8-xhigh56.4%
6gpt-5.4-2026-03-05-medium54.9%
7Claude Opus 4.7-high53.1%
8Cursor53.0%
9Claude Sonnet 4.6-high51.3%
10Gemini 3.1 Pro Preview51.1%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1Claude Opus 4.861.465$10.94
2GPT-5.560.259$11.25
3Claude Opus 4.757.360$10.94
4Gemini 3.1 Pro Preview57.2137$4.50
5GPT-5.456.890$5.63
6Qwen3.7 Max56.6188$3.75
7Gemini 3.5 Flash55.3218$3.38
8Kimi K2.653.942$1.71
9MiMo-V2.5-Pro53.851$0.544
10GPT-5.3 Codex53.685$4.81

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Gemini 3.5 Flash218
2Grok 4.20 0309218
3Grok 4.20 0309 v2216
4Gemini 3 Flash Preview203
5MiniMax-M2.5191
6Qwen3.7 Max188
7GPT-5.4 mini183
8GPT-5.1 Codex182
9GPT-5 Codex170
10Grok 4.3161

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2MiMo-V2.5$0.175
3DeepSeek V4 Flash$0.175
4Hy3-preview$0.20
5DeepSeek V3.2$0.337
6GPT-5.4 nano$0.463
7MiniMax-M2.7$0.525
8KAT Coder Pro V2$0.525
9MiniMax-M2.5$0.525
10MiMo-V2.5-Pro$0.544