The Inference Report

June 30, 2026

The SWE-rebench rankings remain static at the top, with OpenAI's gpt-5.5-2026-04-23-xhigh maintaining 62.7% and Junie's agent holding 61.6%, but the Artificial Analysis benchmark shows substantial movement in the mid-tier and below. DeepSeek R1 jumped from position 190 at 12.6% to position 144 at 18.5%, a gain of 5.9 percentage points that reflects either model improvement or evaluation methodology changes; Mistral Small 3.1 climbed from 255 at 8.6% to 172 at 14.7%, gaining 6.1 points across a 83-position swing. Claude 4 Sonnet dropped from 72 at 30.7% to 83 at 28.9%, losing 1.8 points despite holding its rank position number. Devstral Small 2 moved from 186 at 13.1% to 153 at 17.4%, and Llama 3.1 Instruct 8B climbed from 298 at 6.1% to 274 at 7.6%, both showing gains that suggest either these models were re-evaluated with different configurations or the benchmark itself shifted its evaluation criteria. The Artificial Analysis data spans 397 entries compared to 16 on SWE-rebench, creating an asymmetry in what constitutes meaningful movement: a 1-point swing at the top 10 of Artificial Analysis represents roughly 2 percent of the leader's score, while the same absolute change at position 350 is nearly a 20 percent relative improvement. New entry DiffusionGemma 26B A4B at position 185 with 13.5% provides no prior reference, making it impossible to assess whether this is a newly evaluated model or a previously omitted one. Without access to methodology details for either benchmark, the interpretation of these shifts remains constrained to surface observation: SWE-rebench appears stable and possibly closed to new entries, while Artificial Analysis exhibits churn consistent with either rolling re-evaluation or score recalibration across the full roster.

Cole Brennan

Daily rankings from SWE-rebench, a benchmark designed to fairly compare LLM capabilities on real-world software engineering tasks. Unlike other evaluations, it uses a standardized scaffolding for all models, continuously updates its dataset to prevent contamination, and runs each model five times to account for stochastic variance.

#ModelScore
1OpenAIgpt-5.5-2026-04-23-xhighModel62.7%± 0.91%
2JunieJunieAgent61.6%± 0.64%
3OpenAICodexAgent60.4%± 1.37%
4AnthropicClaude CodeAgent59.6%± 1.98%
5OpenAIgpt-5.5-2026-04-23-mediumModel58.9%± 0.78%
6AnthropicClaude Opus 4.8-xhighModel56.5%± 1.20%
7OpenAIgpt-5.4-2026-03-05-mediumModel54.9%± 1.02%
8AnthropicClaude Opus 4.7-highModel53.1%± 1.45%
9CursorCursorAgent53.0%± 0.53%
10AnthropicClaude Sonnet 4.6Model51.3%± 0.55%

Artificial Analysis composite index across coding, math, and reasoning benchmarks.

#ModelScoretok/s$/1M
1Claude Fable 559.90$20.00
2Claude Opus 4.855.759$10.00
3GPT-5.554.879$11.25
4Claude Opus 4.753.550$10.00
5GPT-5.451.4174$5.63
6GLM-5.251.1151$2.15
7Gemini 3.5 Flash50.2210$3.38
8Claude Sonnet 4.647.251$6.00
9Gemini 3.1 Pro Preview46.5131$4.50
10Qwen3.7 Max46196$3.75

Output tokens per second — higher is faster. Minimum intelligence score of 40.

#Modeltok/s
1Gemini 3.5 Flash210
2Qwen3.7 Max196
3GPT-5.4174
4GPT-5.4 mini164
5GLM-5.2151
6Gemini 3.1 Pro Preview131
7GPT-5.2 Codex127
8DeepSeek V4 Flash104
9MiMo-V2.591
10GPT-5.3 Codex88

Blended cost per 1M tokens (3:1 input/output) — lower is cheaper. Minimum intelligence score of 40.

#Model$/1M
1DeepSeek V4 Flash$0.175
2MiMo-V2.5$0.175
3MiniMax-M3$0.525
4DeepSeek V4 Pro$0.544
5MiMo-V2.5-Pro$0.544
6Nex-N2-Pro$1.00
7MiMo-V2-Pro$1.50
8GPT-5.4 mini$1.69
9Kimi K2.6$1.71
10Kimi K2.7 Code$1.71