The Inference Report

July 1, 2026

On the SWE-rebench coding benchmark, the top tier remains stable with OpenAI's gpt-5.5-2026-04-23-xhighModel holding 62.7% (±0.91%), followed by JunieAgent at 61.6% (±0.64%) and OpenAI's CodexAgent at 60.4% (±1.37%), unchanged from the previous round. The Artificial Analysis benchmark, by contrast, shows material reshuffling across its 398-model roster: Claude Fable 5 enters at #1 with 59.9 points, displacing GPT-5.5 to #3, while Claude Sonnet 5 debuts at #5 with 53.4 points, pushing prior entries down. Lower in the Artificial Analysis rankings, DeepSeek V3 climbs from #220 (10.4) to #180 (14.2), a 3.8-point gain that suggests either improved evaluation conditions or a correction in prior assessment. Qwen3.5 9B drops from #101 (25) to #120 (21.4), a 3.6-point decline that warrants scrutiny of methodology consistency. The SWE-rebench benchmark's tight confidence intervals (mostly sub-1.5%) and static ordering suggest a well-controlled experimental setup, whereas Artificial Analysis's broader movement and new entrants indicate either looser evaluation criteria or frequent model updates that shift relative standing. Neither benchmark shows the methodological transparency needed to distinguish between genuine performance improvement and variance in test conditions.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1OpenAIgpt-5.5-2026-04-23-xhighModel62.7%± 0.91%
2JunieJunieAgent61.6%± 0.64%
3OpenAICodexAgent60.4%± 1.37%
4AnthropicClaude CodeAgent59.6%± 1.98%
5OpenAIgpt-5.5-2026-04-23-mediumModel58.9%± 0.78%
6AnthropicClaude Opus 4.8-xhighModel56.5%± 1.20%
7OpenAIgpt-5.4-2026-03-05-mediumModel54.9%± 1.02%
8AnthropicClaude Opus 4.7-highModel53.1%± 1.45%
9CursorCursorAgent53.0%± 0.53%
10AnthropicClaude Sonnet 4.6Model51.3%± 0.55%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1Claude Fable 559.90$20.00
2Claude Opus 4.855.765$10.00
3GPT-5.554.877$11.25
4Claude Opus 4.753.548$10.00
5Claude Sonnet 553.479$6.00
6GPT-5.451.4157$5.63
7GLM-5.251.1160$2.15
8Gemini 3.5 Flash50.2210$3.38
9Claude Sonnet 4.647.263$6.00
10Gemini 3.1 Pro Preview46.5128$4.50

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Gemini 3.5 Flash210
2Qwen3.7 Max195
3GLM-5.2160
4GPT-5.4157
5GPT-5.4 mini154
6Gemini 3.1 Pro Preview128
7GPT-5.2 Codex118
8DeepSeek V4 Flash90
9MiMo-V2.586
10MiniMax-M384

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1DeepSeek V4 Flash$0.175
2MiMo-V2.5$0.175
3MiniMax-M3$0.525
4DeepSeek V4 Pro$0.544
5MiMo-V2.5-Pro$0.544
6Nex-N2-Pro$1.00
7MiMo-V2-Pro$1.50
8GPT-5.4 mini$1.69
9Kimi K2.6$1.71
10Kimi K2.7 Code$1.71