The Inference Report

June 2, 2026

The SWE-rebench rankings show minimal movement at the top, with the same fourteen models holding positions 1-14 across both periods. The Artificial Analysis benchmark, however, reveals churn throughout its 384-entry list, though the pattern suggests reordering rather than genuine performance shifts. Gemini 3.1 Pro Preview dropped from 4th to 10th on SWE-rebench, declining from 57.2 to 51.1 points, while Kimi K2.6 fell from 8th to 13th with a 7.4-point loss. GLM-4.7 rose from 47th to 14th on Artificial Analysis (42.1 to 38.2 points), a counterintuitive climb despite the lower score, suggesting ranking methodology changes or score recalibration rather than model improvement. The entry of Step 3.7 Flash at position 47 in Artificial Analysis and the near-universal reshuffling below the top 100 indicate the benchmark may have adjusted its evaluation criteria, weighting scheme, or model test set. Without documentation of methodology changes between periods, it remains unclear whether observed movements reflect actual performance variation or administrative reorganization of the leaderboard itself. The SWE-rebench stability at the top contrasts sharply with Artificial Analysis volatility, raising questions about benchmark sensitivity and whether either ranking reliably tracks incremental progress in code generation capability.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1gpt-5.5-2026-04-23-xhigh62.7%
2Codex60.4%
3Claude Code59.6%
4gpt-5.5-2026-04-23-medium58.9%
5Claude Opus 4.8-xhigh56.4%
6gpt-5.4-2026-03-05-medium54.9%
7Claude Opus 4.7-high53.1%
8Cursor53.0%
9Claude Sonnet 4.6-high51.3%
10Gemini 3.1 Pro Preview51.1%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1Claude Opus 4.861.460$10.94
2GPT-5.560.266$11.25
3Claude Opus 4.757.356$10.94
4Gemini 3.1 Pro Preview57.2132$4.50
5GPT-5.456.879$5.63
6Qwen3.7 Max56.6201$3.75
7Gemini 3.5 Flash55.3227$3.38
8Kimi K2.653.940$1.71
9MiMo-V2.5-Pro53.853$0.544
10GPT-5.3 Codex53.684$4.81

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Step 3.7 Flash408
2Gemini 3.5 Flash227
3Grok 4.20 0309 v2213
4Qwen3.7 Max201
5GPT-5 Codex191
6Gemini 3 Flash Preview186
7Grok 4.20 0309184
8MiniMax-M2.5178
9GPT-5.1 Codex174
10GPT-5.4 mini173

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1MiMo-V2-Flash$0.15
2MiMo-V2.5$0.175
3DeepSeek V4 Flash$0.175
4Hy3-preview$0.20
5DeepSeek V3.2$0.337
6Step 3.7 Flash$0.438
7GPT-5.4 nano$0.463
8MiniMax-M2.7$0.525
9KAT Coder Pro V2$0.525
10MiniMax-M2.5$0.525