The Inference Report

July 2, 2026

On the SWE-rebench, the top tier remains locked in place: OpenAI's gpt-5.5-xhigh holds 62.7% plus or minus 0.91%, followed by JunieAgent at 61.6% and OpenAI CodexAgent at 60.4%, with no movement among the leading six entries. Below that tier, Z.ai's GLM-5.2 enters at position 12 with 51.1% plus or minus 1.13%, displacing its predecessor GLM-5.1 to 13th place, while DeepSeek-V4 Pro and MiMo-V2.5-Pro appear as new entries at 18 and 19 respectively, and Qwen models now occupy positions 22 and 23 in their first SWE-rebench appearances. The Artificial Analysis benchmark shows broader volatility: Claude Fable 5 leads at 59.9, a model not previously ranked in the earlier snapshot, while GPT-5.1 dropped from 38.9 to 36.9 (position 44), and Command A+ fell from 29.3 to 22.5 (position 111), the largest documented decline. gpt-oss-20b, which had held position 171 at 14.9, has been removed from the rankings entirely. The SWE-rebench data carries tighter confidence intervals than Artificial Analysis, suggesting more controlled evaluation conditions, though both benchmarks show the frontier remains dominated by OpenAI and Anthropic systems when measured on code completion tasks, with newer Chinese models (Qwen, GLM variants) gaining ground in the mid-tier rather than displacing leaders.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1OpenAIgpt-5.5-2026-04-23-xhighModel62.7%± 0.91%
2JunieJunieAgent61.6%± 0.64%
3OpenAICodexAgent60.4%± 1.37%
4AnthropicClaude CodeAgent59.6%± 1.98%
5OpenAIgpt-5.5-2026-04-23-mediumModel58.9%± 0.78%
6AnthropicClaude Opus 4.8-xhighModel56.5%± 1.20%
7OpenAIgpt-5.4-2026-03-05-mediumModel54.9%± 1.02%
8AnthropicClaude Opus 4.7-highModel53.1%± 1.45%
9CursorCursorAgent53.0%± 0.53%
10AnthropicClaude Sonnet 4.6Model51.3%± 0.55%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1Claude Fable 559.969$20.00
2Claude Opus 4.855.766$10.00
3GPT-5.554.882$11.25
4Claude Opus 4.753.551$10.00
5Claude Sonnet 553.489$6.00
6GPT-5.451.4165$5.63
7GLM-5.251.1184$2.15
8Gemini 3.5 Flash50.2214$3.38
9Claude Sonnet 4.647.269$6.00
10Gemini 3.1 Pro Preview46.5138$4.50

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Gemini 3.5 Flash214
2Qwen3.7 Max197
3GLM-5.2184
4GPT-5.4 mini175
5GPT-5.4165
6Gemini 3.1 Pro Preview138
7GPT-5.2 Codex125
8DeepSeek V4 Flash91
9Claude Sonnet 589
10Nex-N2-Pro87

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1DeepSeek V4 Flash$0.175
2MiMo-V2.5$0.175
3MiniMax-M3$0.525
4DeepSeek V4 Pro$0.544
5MiMo-V2.5-Pro$0.544
6Nex-N2-Pro$1.00
7MiMo-V2-Pro$1.50
8GPT-5.4 mini$1.69
9Kimi K2.6$1.71
10Kimi K2.7 Code$1.71