The Inference Report

June 26, 2026

The SWE-rebench rankings have been entirely replaced with a fresh cohort of agent-based systems, with OpenAI's gpt-5.5-2026-04-23-xhigh model leading at 62.7% plus or minus 0.91%, followed by JunieAgent at 61.6% and OpenAI CodexAgent at 60.4%, while the Artificial Analysis benchmark shows Claude Fable 5 in the top position at 59.9, a configuration that differs markedly from the previous leadership structure where Claude Fable 5 held first place. The SWE-rebench methodology appears to favor agentic systems with explicit configuration parameters like "xhigh" and "medium," producing notably higher absolute scores than Artificial Analysis reports for comparable models, suggesting the two benchmarks measure different aspects of coding capability or employ divergent evaluation protocols. All seventeen prior SWE-rebench entries have been dropped, indicating either a benchmark refresh, a shift in how systems are tested, or a change in what qualifies for inclusion, though the new cohort maintains consistency in confidence intervals ranging from 0.53% to 1.98%, which is tighter than one might expect given the diversity of approaches. The Artificial Analysis ranking remains largely stable in its upper tiers, with minor reordering and the addition of Devstral 2 moving from position 165 to 140, but without the wholesale replacement seen in SWE-rebench, suggesting the two evaluation frameworks operate on different cadences or criteria. Without prior SWE-rebench data for the new agent entries, it is not possible to determine whether these scores represent genuine capability gains or simply reflect how agent-based systems perform on that particular benchmark relative to the model-only systems previously ranked.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1OpenAIgpt-5.5-2026-04-23-xhighModel62.7%± 0.91%
2JunieJunieAgent61.6%± 0.64%
3OpenAICodexAgent60.4%± 1.37%
4AnthropicClaude CodeAgent59.6%± 1.98%
5OpenAIgpt-5.5-2026-04-23-mediumModel58.9%± 0.78%
6AnthropicClaude Opus 4.8-xhighModel56.5%± 1.20%
7OpenAIgpt-5.4-2026-03-05-mediumModel54.9%± 1.02%
8AnthropicClaude Opus 4.7-highModel53.1%± 1.45%
9CursorCursorAgent53.0%± 0.53%
10AnthropicClaude Sonnet 4.6Model51.3%± 0.55%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1Claude Fable 559.90$20.00
2Claude Opus 4.855.765$10.00
3GPT-5.554.873$11.25
4Claude Opus 4.753.555$10.00
5GPT-5.451.4150$5.63
6GLM-5.251.1123$2.15
7Gemini 3.5 Flash50.2213$3.38
8Claude Sonnet 4.647.267$6.00
9Gemini 3.1 Pro Preview46.5136$4.50
10Qwen3.7 Max46203$3.75

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Gemini 3.5 Flash213
2Qwen3.7 Max203
3GPT-5.4 mini176
4GPT-5.4150
5GPT-5.2 Codex139
6Gemini 3.1 Pro Preview136
7GLM-5.2123
8Nex-N2-Pro117
9DeepSeek V4 Flash113
10MiMo-V2.584

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1DeepSeek V4 Flash$0.175
2MiMo-V2.5$0.175
3MiniMax-M3$0.525
4DeepSeek V4 Pro$0.544
5MiMo-V2.5-Pro$0.544
6Nex-N2-Pro$1.00
7MiMo-V2-Pro$1.50
8GPT-5.4 mini$1.69
9Kimi K2.6$1.71
10Kimi K2.7 Code$1.71