The Inference Report

May 14, 2026

Hyperscalers are spending $725 billion on AI compute this year while the market rejects what that compute produces. Google's AI Overviews cut click-through rates by 58 percent. Wikipedia banned AI-generated content by a 44-2 vote. The collision between infrastructure capacity and actual demand is reshaping where capital flows and which companies survive the correction.

The demand problem reveals itself in unexpected places. Anthropic now serves more business customers than OpenAI according to Ramp expense data, 34.4 percent versus 32.3 percent, a reversal that signals model quality and product fit matter more than first-mover advantage. The real expansion is downmarket into the 36 million small businesses that make up the U.S. economic backbone and into vertical software where AI becomes embedded rather than bolted on. Yet enterprises discovering that 97 percent of organizations have active AI initiatives but only 5 percent say their data is ready exposes the actual bottleneck: not compute, but data governance. The infrastructure spending is real. Execution capability is not. This gap is where leverage is shifting.

The labs are converging on agentic systems as the next commercial battleground, each from a different market position. OpenAI is hardening Codex for Windows sandboxes and patching supply chain vulnerabilities, practical infrastructure work that signals real deployment concerns. NVIDIA is stacking partnerships while promoting open source frameworks like Hermes Agent, positioning itself as infrastructure provider to whoever wins. Anthropic is packaging Claude for small business, a distribution play suggesting agents are table stakes for horizontal adoption rather than differentiated capability. GitHub confirms this shift: the trending repos split between infrastructure like trycua/cua that let agents interact with desktops without going rogue, and skills layers that package domain knowledge as reusable components. Agents are moving from research artifacts to production systems. Whoever builds the sandbox, secures the supply chain, and reaches the customer first wins the next phase.

Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, but the real story is convergence. Models ranked 2 through 7 cluster between 64.4% and 62.3%, suggesting the frontier is tightening. Chinese-developed models are narrowing the gap: GLM-5.1 moved to 51.4 on Artificial Analysis while Kimi K2 Thinking jumped to 40.9. Within SWE-rebench, the spread from position 1 to position 10 spans only 4.3 percentage points. Marginal gains now require refined approaches rather than architectural leaps. The infrastructure race and the benchmark race are diverging. One measures capability. The other measures whether that capability can be made useful at scale without becoming a liability.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research Papers — FocusedAll papers
MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces cs.DC

The fast pace of artificial intelligence~(AI) innovation demands an agile methodology for observation, reproduction and optimization of distributed machine learning~(ML) workload behavior in production AI systems and enables efficient software-hardware~(SW-HW) co-design for future systems. We present Chakra, an open and portable ecosystem for performance benchmarking and co-design. The core component of Chakra is an open and interoperable graph-based representation of distributed AI/ML workloads, called Chakra execution trace~(ET). These ETs represent key operations, such as compute, memory, and communication, data and control dependencies, timing, and resource constraints. Additionally, Chakra includes a complementary set of tools and capabilities to enable the collection, analysis, generation, and adoption of Chakra ETs by a broad range of simulators, emulators, and replay tools. We present analysis of Chakra ETs collected on production AI clusters and demonstrate value via real-world case studies. Chakra has been adopted by MLCommons and has active contributions and engagement across the industry, including but not limited to NVIDIA, AMD, Meta, Keysight, HPE, and Scala, to name a few.

ChunkFlow: Communication-Aware Chunked Prefetching for Layerwise Offloading in Distributed Diffusion Transformer Inference cs.DC

Layerwise offloading reduces the GPU memory footprint of large diffusion transformer (DiT) inference by prefetching upcoming layers from host memory, but its effectiveness hinges on hiding prefetch latency behind per-layer computation. This assumption breaks down when the per-GPU compute workload is small. Moreover, on PCIe-only nodes, prefetch and inter-GPU collective communications such as all-reduce and all-to-all contend on the shared PCIe path, exposing prefetch latency even when compute would otherwise hide it. We revisit layerwise offloading as a co-scheduling problem between prefetch and communication, guided by a first-order analytical model that predicts when prefetch can be hidden by computation. Building on this model, we design ChunkFlow, a communication-aware, chunk-granular offloading runtime that adaptively yields to collective communication and smoothly trades GPU memory for prefetch volume. On three representative diffusion transformers running on two H100 GPUs over PCIe with Ulysses sequence parallelism, ChunkFlow delivers up to 1.28x step-time speedup over SGLang's existing layerwise offloading, reduces peak GPU memory by up to 49% over the no-offload baseline at near-identical step time once the workload is large enough, and exposes a tunable memory-latency tradeoff that recovers near-zero step-time overhead in the small-workload regime.

Trade-offs in Decentralized Agentic AI Discovery Across the Compute Continuum cs.DC

Agentic systems deployed across the compute continuum need discovery mechanisms that remain effective across cloud, edge, and intermittently connected domains. In some emerging agentic architectures, decentralized discovery is already an active design direction, placing DHT-based lookup on the path toward agent directories. This paper studies the trade-offs among major structured-overlay families for agent discovery, comparing Chord, Pastry, and Kademlia as candidate indexing substrates within a shared control-plane framework. Using a benchmark subset centered on a 4096-node stationary comparison and a representative 4096-node churn benchmark, the paper characterizes how discovery reliability, startup behavior, and control-plane overhead vary across these overlays. The goal is to clarify the operating points they expose for agent discovery across edge-to-cloud environments.

The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures cs.DC

Power capping is the standard GPU energy lever in LLM serving, and it appears to work: throughput drops, power readings fall, and energy budgets are met. We show the appearance is illusory for the phase that dominates production serving: autoregressive decode. Across four attention paradigms -- GQA, MLA, Gated DeltaNet, and Mamba2 -- on NVIDIA H200, decode draws only 137--300\,W on a 700\,W GPU; no cap ever triggers, because memory-bound decode saturates HBM bandwidth rather than compute and leaves power headroom untouched. Firmware-initiated clock throttling compounds the illusion: these deviations can corrupt any throughput measurement that attributes them to the cap. SM clock locking dissolves both confounds. By targeting the lever that is actually on the critical path, clock locking Pareto-dominates power capping universally, recovering up to 32\% of decode energy at minimal throughput loss. We identify three architecture-dependent DVFS behavioural classes and characterise a common energy pattern across novel attention replacements: a heavy prefill cost recouped by efficient decode, eventually halving total request energy relative to GQA at production batch sizes.

Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation cs.DC

Selecting the optimal LLM inference configuration requires evaluation across hardware, serving engines, attention backends, and model architectures, since no single choice performs best across all workloads. Profile-based simulators are the standard tool, yet they hardcode their operation set to a specific configuration and re-profile every operation from scratch, making exploration prohibitively expensive. This cost stems from a missing structural understanding: every input dimension of each operation is fixed by the model configuration or determined by the incoming request. Many model-configuration values (e.g., head size, layer count) recur across models, so the same operation runs in many configurations; a single sweep over the request-dependent dimensions can serve them all. We present Dooly, which exploits this structure to achieve configuration-agnostic, redundancy-aware profiling. Dooly performs a single inference pass, labels each input dimension with its origin via taint propagation, and selectively profiles only operations absent from its latency database; stateful operations such as attention are isolated by reusing the serving engine's own initialization code, eliminating manual instrumentation. It builds latency regression models based on the database, which becomes a drop-in backend for existing simulators. Across two GPU platforms, three attention backends, and diverse model architectures, Dooly achieves simulation accuracy within 5% MAPE for TTFT and 8% for TPOT while reducing profiling GPU-hours by 56.4% across 12 models compared to the existing profiling approach.

A Scalable Digital Twin Framework for Energy Optimization in Data Centers cs.DC

This study proposes a scalable Digital Twin framework for energy optimization in data centers.The framework integrates IoT-based data acquisition, cloud computing, and machine learning techniques to enable real-time monitoring, forecasting, and intelligent energy management. A controlled small-scale data center environment was developed to monitor variables such as power consumption, temperature, and computational workload. Long Short-Term Memory (LSTM) models were employed to predict energy demand and support operational decision-making. Experimental results demonstrated improvements in energy efficiency, including reductions in power consumption and enhancements in Power Usage Effectiveness (PUE). Despite being evaluated in a constrained environment, the proposed framework demonstrates strong potential as a scalable and cost-effective solution for sustainable data center management.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1GPT-5.560.265$11.25
2Claude Opus 4.757.363$10.94
3Gemini 3.1 Pro Preview57.2128$4.50
4GPT-5.456.883$5.63
5Kimi K2.653.941$1.71
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4Junie62.8%
5gpt-5.4-2026-03-05-medium62.8%