The Inference Report

June 29, 2026

The SWE-rebench leaderboard shows no movement from the previous snapshot: OpenAI's gpt-5.5-2026-04-23-xhigh model holds first at 62.7%, followed by JunieAgent at 61.6% and OpenAI CodexAgent at 60.4%, with confidence intervals tight enough to distinguish genuine separation between top performers. The Artificial Analysis benchmark presents a different picture, one of modest shuffling rather than substantive reordering. Claude Fable 5 leads at 59.9, Claude Opus 4.8 sits at 55.7, and GPT-5.5 ranks third at 54.8, but the list contains no new entries and the scoring appears identical to prior rankings. Two minor position swaps occur in the mid-range: Qwen3 32B moves from rank 217 to 209 with an improvement from 10.5 to 11.5, and Sarvam 105B and Magistral Small 1.2 exchange positions around rank 204-205 without score changes, suggesting database reorganization rather than actual performance shifts. The methodology underlying both benchmarks remains opaque. SWE-rebench reports confidence intervals, which implies repeated trials or cross-validation, yet no detail appears on the evaluation protocol, task distribution, or whether results are deterministic across runs. Artificial Analysis provides no uncertainty quantification whatsoever, making it impossible to assess whether fractional score differences reflect genuine capability gaps or measurement noise. The two benchmarks diverge substantially at the top (gpt-5.5-xhigh leads SWE-rebench but ranks third on Artificial Analysis), raising questions about whether they measure the same construct or whether one dataset better captures real-world code repair needs. Without clarification of what each benchmark tests, how tasks are sampled, and whether scoring is reproducible, the rankings function as indices rather than measures of engineering competence.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1OpenAIgpt-5.5-2026-04-23-xhighModel62.7%± 0.91%
2JunieJunieAgent61.6%± 0.64%
3OpenAICodexAgent60.4%± 1.37%
4AnthropicClaude CodeAgent59.6%± 1.98%
5OpenAIgpt-5.5-2026-04-23-mediumModel58.9%± 0.78%
6AnthropicClaude Opus 4.8-xhighModel56.5%± 1.20%
7OpenAIgpt-5.4-2026-03-05-mediumModel54.9%± 1.02%
8AnthropicClaude Opus 4.7-highModel53.1%± 1.45%
9CursorCursorAgent53.0%± 0.53%
10AnthropicClaude Sonnet 4.6Model51.3%± 0.55%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1Claude Fable 559.90$20.00
2Claude Opus 4.855.758$10.00
3GPT-5.554.883$11.25
4Claude Opus 4.753.555$10.00
5GPT-5.451.4174$5.63
6GLM-5.251.1139$2.15
7Gemini 3.5 Flash50.2214$3.38
8Claude Sonnet 4.647.257$6.00
9Gemini 3.1 Pro Preview46.5137$4.50
10Qwen3.7 Max46198$3.75

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Gemini 3.5 Flash214
2Qwen3.7 Max198
3GPT-5.4 mini178
4GPT-5.4174
5GLM-5.2139
6Gemini 3.1 Pro Preview137
7GPT-5.2 Codex135
8DeepSeek V4 Flash109
9GPT-5.3 Codex94
10MiMo-V2.590

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1DeepSeek V4 Flash$0.175
2MiMo-V2.5$0.175
3MiniMax-M3$0.525
4DeepSeek V4 Pro$0.544
5MiMo-V2.5-Pro$0.544
6Nex-N2-Pro$1.00
7MiMo-V2-Pro$1.50
8GPT-5.4 mini$1.69
9Kimi K2.6$1.71
10Kimi K2.7 Code$1.71