The Inference Report

July 3, 2026

The SWE-rebench rankings show no movement from the previous report: the top tier remains unchanged, with GPT-5.5-xhigh at 62.7%, JunieAgent at 61.6%, and CodexAgent at 60.4%, each holding their positions across the full 24-model list. The Artificial Analysis benchmark, by contrast, exhibits substantial churn across its 398-entry ranking, though the top tier again proves stable, Claude Fable 5 holds the lead at 59.9, followed by Claude Opus 4.8 at 55.7 and GPT-5.5 at 54.8. Below that summit, however, the ordering has shifted measurably: GPT-5 mini dropped from #65 at 33.0 to #72 at 30.9, a loss of 2.1 points and seven positions; Mistral Small 4 fell from #126 at 20.8 to #132 at 19.6; and Qwen3 Next 80B A3B plummeted from #134 at 19.8 to #159 at 16.7, suggesting either methodological revision or genuine performance variance in the 16-20 point band where many models cluster. The SWE-rebench's immobility raises a question about whether those agentic benchmarks are less sensitive to model updates than Artificial Analysis, or whether the coding agents themselves have stabilized while the underlying base models continue to diverge. The Artificial Analysis instability in the mid-range, where confidence intervals would overlap, warrants scrutiny of whether those score differences exceed measurement error; without published confidence bounds for that benchmark, the ranking shifts read as plausible but not necessarily meaningful.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1OpenAIgpt-5.5-2026-04-23-xhighModel62.7%± 0.91%
2JunieJunieAgent61.6%± 0.64%
3OpenAICodexAgent60.4%± 1.37%
4AnthropicClaude CodeAgent59.6%± 1.98%
5OpenAIgpt-5.5-2026-04-23-mediumModel58.9%± 0.78%
6AnthropicClaude Opus 4.8-xhighModel56.5%± 1.20%
7OpenAIgpt-5.4-2026-03-05-mediumModel54.9%± 1.02%
8AnthropicClaude Opus 4.7-highModel53.1%± 1.45%
9CursorCursorAgent53.0%± 0.53%
10AnthropicClaude Sonnet 4.6Model51.3%± 0.55%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1Claude Fable 559.964$20.00
2Claude Opus 4.855.765$10.00
3GPT-5.554.884$11.25
4Claude Opus 4.753.550$10.00
5Claude Sonnet 553.487$6.00
6GPT-5.451.4166$5.63
7GLM-5.251.1181$2.15
8Gemini 3.5 Flash50.2210$3.38
9Claude Sonnet 4.647.269$6.00
10Gemini 3.1 Pro Preview46.5136$4.50

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Gemini 3.5 Flash210
2Qwen3.7 Max200
3GLM-5.2181
4GPT-5.4 mini168
5GPT-5.4166
6Gemini 3.1 Pro Preview136
7Nex-N2-Pro120
8GPT-5.2 Codex120
9MiniMax-M398
10DeepSeek V4 Flash93

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1DeepSeek V4 Flash$0.175
2MiMo-V2.5$0.175
3MiniMax-M3$0.525
4DeepSeek V4 Pro$0.544
5MiMo-V2.5-Pro$0.544
6Nex-N2-Pro$1.00
7MiMo-V2-Pro$1.50
8GPT-5.4 mini$1.69
9Kimi K2.6$1.71
10Kimi K2.7 Code$1.71