The Inference Report

June 27, 2026

The SWE-rebench rankings remain frozen at their previous positions, with no movement across the top 17 coding agents. OpenAI's gpt-5.5-2026-04-23-xhighModel holds first at 62.7% (±0.91%), followed by JunieAgent at 61.6% (±0.64%), and the spread narrows predictably through the field. The confidence intervals are tight enough to distinguish most placements, though Claude Opus 4.6-high (47.8% ±1.37%) and Claude Sonnet 4.6 (51.3% ±0.55%) overlap slightly at the boundary of statistical noise. Across the Artificial Analysis benchmark, the data shows substantial churn in the middle and lower tiers, Magistral Medium 1.2 dropped from position 130 to 148, while Apriel-v1.6-15B-Thinker moved from 129 to 128, but the top performers remain locked in place: Claude Fable 5 leads at 59.9, with GPT-5.5 and Claude Opus 4.8 holding their second-tier positions at 54.8 and 55.7 respectively. The two benchmarks measure different problem spaces (SWE-rebench targets repository-level software engineering tasks while Artificial Analysis covers broader reasoning), which explains why their orderings diverge: coding-specific systems like JunieAgent rank higher on SWE-rebench but Claude Fable 5 tops the general benchmark. Without prior Artificial Analysis scores, it is unclear whether the observed shuffling in positions 128 to 148 reflects genuine performance changes or measurement variance. The stability in SWE-rebench suggests the top agents have reached a plateau, or that the evaluation's resolution cannot detect sub-point improvements.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1OpenAIgpt-5.5-2026-04-23-xhighModel62.7%± 0.91%
2JunieJunieAgent61.6%± 0.64%
3OpenAICodexAgent60.4%± 1.37%
4AnthropicClaude CodeAgent59.6%± 1.98%
5OpenAIgpt-5.5-2026-04-23-mediumModel58.9%± 0.78%
6AnthropicClaude Opus 4.8-xhighModel56.5%± 1.20%
7OpenAIgpt-5.4-2026-03-05-mediumModel54.9%± 1.02%
8AnthropicClaude Opus 4.7-highModel53.1%± 1.45%
9CursorCursorAgent53.0%± 0.53%
10AnthropicClaude Sonnet 4.6Model51.3%± 0.55%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1Claude Fable 559.90$20.00
2Claude Opus 4.855.760$10.00
3GPT-5.554.883$11.25
4Claude Opus 4.753.557$10.00
5GPT-5.451.4163$5.63
6GLM-5.251.1120$2.15
7Gemini 3.5 Flash50.2225$3.38
8Claude Sonnet 4.647.270$6.00
9Gemini 3.1 Pro Preview46.5145$4.50
10Qwen3.7 Max46203$3.75

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Gemini 3.5 Flash225
2Qwen3.7 Max203
3GPT-5.4 mini177
4GPT-5.4163
5Gemini 3.1 Pro Preview145
6GPT-5.2 Codex139
7GLM-5.2120
8Nex-N2-Pro118
9DeepSeek V4 Flash114
10GPT-5.3 Codex100

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1DeepSeek V4 Flash$0.175
2MiMo-V2.5$0.175
3MiniMax-M3$0.525
4DeepSeek V4 Pro$0.544
5MiMo-V2.5-Pro$0.544
6Nex-N2-Pro$1.00
7MiMo-V2-Pro$1.50
8GPT-5.4 mini$1.69
9Kimi K2.6$1.71
10Kimi K2.7 Code$1.71