The Inference Report

July 4, 2026

On the SWE-rebench, the top tier remains static: OpenAI's gpt-5.5-2026-04-23-xhighModel holds position one at 62.7 percent, followed by JunieAgent at 61.6 percent and OpenAI's CodexAgent at 60.4 percent. The confidence intervals are tight enough to distinguish these leaders, with standard errors ranging from 0.53 to 1.98 percentage points across the ranked set, suggesting the evaluation captures consistent performance differences. Across the Artificial Analysis benchmark, the landscape shifts more dramatically: Claude Fable 5 now leads at 59.9, displacing GPT-5.5 from the top spot, while Claude Opus 4.8 sits second at 55.7 and GPT-5.5 drops to third at 54.8. The gap between the SWE-rebench's top performer and Artificial Analysis's top performer is 2.8 points, a meaningful divergence that hints at different problem structures. Within Artificial Analysis, the middle ranks show considerable churn: Llama 3.3 Instruct 70B climbed from position 258 to 242, a 16-rank jump, while several models in the 240 to 260 range shuffled positions, suggesting modest score movements in a crowded band where many models cluster between 8 and 10 points. The two benchmarks do not track perfectly: models strong on SWE-rebench (like JunieAgent and OpenAI's variants) do not appear on the Artificial Analysis list, and vice versa, indicating the tests measure distinct capabilities rather than a single underlying skill. The SWE-rebench concentrates on code-generation agents in controlled conditions, while Artificial Analysis appears broader and less transparent in methodology, making direct comparison hazardous. Without historical Artificial Analysis data from prior runs, the significance of Claude Fable 5's ascent to first cannot be evaluated; the SWE-rebench's stability suggests real differences in agent capability, but the Artificial Analysis movements may reflect noise or evaluation drift rather than genuine progress.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1OpenAIgpt-5.5-2026-04-23-xhighModel62.7%± 0.91%
2JunieJunieAgent61.6%± 0.64%
3OpenAICodexAgent60.4%± 1.37%
4AnthropicClaude CodeAgent59.6%± 1.98%
5OpenAIgpt-5.5-2026-04-23-mediumModel58.9%± 0.78%
6AnthropicClaude Opus 4.8-xhighModel56.5%± 1.20%
7OpenAIgpt-5.4-2026-03-05-mediumModel54.9%± 1.02%
8AnthropicClaude Opus 4.7-highModel53.1%± 1.45%
9CursorCursorAgent53.0%± 0.53%
10AnthropicClaude Sonnet 4.6Model51.3%± 0.55%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1Claude Fable 559.962$20.00
2Claude Opus 4.855.761$10.00
3GPT-5.554.888$11.25
4Claude Opus 4.753.549$10.00
5Claude Sonnet 553.479$6.00
6GPT-5.451.4167$5.63
7GLM-5.251.1176$2.15
8Gemini 3.5 Flash50.2209$3.38
9Claude Sonnet 4.647.267$6.00
10Gemini 3.1 Pro Preview46.5140$4.50

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Gemini 3.5 Flash209
2Qwen3.7 Max199
3GLM-5.2176
4GPT-5.4167
5GPT-5.4 mini164
6Gemini 3.1 Pro Preview140
7Nex-N2-Pro126
8GPT-5.2 Codex123
9MiniMax-M398
10DeepSeek V4 Flash95

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1DeepSeek V4 Flash$0.175
2MiMo-V2.5$0.175
3MiniMax-M3$0.525
4DeepSeek V4 Pro$0.544
5MiMo-V2.5-Pro$0.544
6Nex-N2-Pro$1.00
7MiMo-V2-Pro$1.50
8GPT-5.4 mini$1.69
9Kimi K2.6$1.71
10Kimi K2.7 Code$1.71