The Inference Report

May 14, 2026

Hyperscalers are spending $725 billion on AI compute this year while the market rejects what that compute produces. Google's AI Overviews cut click-through rates by 58 percent. Wikipedia banned AI-generated content by a 44-2 vote. The collision between infrastructure capacity and actual demand is reshaping where capital flows and which companies survive the correction.

The demand problem reveals itself in unexpected places. Anthropic now serves more business customers than OpenAI according to Ramp expense data, 34.4 percent versus 32.3 percent, a reversal that signals model quality and product fit matter more than first-mover advantage. The real expansion is downmarket into the 36 million small businesses that make up the U.S. economic backbone and into vertical software where AI becomes embedded rather than bolted on. Yet enterprises discovering that 97 percent of organizations have active AI initiatives but only 5 percent say their data is ready exposes the actual bottleneck: not compute, but data governance. The infrastructure spending is real. Execution capability is not. This gap is where leverage is shifting.

The labs are converging on agentic systems as the next commercial battleground, each from a different market position. OpenAI is hardening Codex for Windows sandboxes and patching supply chain vulnerabilities, practical infrastructure work that signals real deployment concerns. NVIDIA is stacking partnerships while promoting open source frameworks like Hermes Agent, positioning itself as infrastructure provider to whoever wins. Anthropic is packaging Claude for small business, a distribution play suggesting agents are table stakes for horizontal adoption rather than differentiated capability. GitHub confirms this shift: the trending repos split between infrastructure like trycua/cua that let agents interact with desktops without going rogue, and skills layers that package domain knowledge as reusable components. Agents are moving from research artifacts to production systems. Whoever builds the sandbox, secures the supply chain, and reaches the customer first wins the next phase.

Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, but the real story is convergence. Models ranked 2 through 7 cluster between 64.4% and 62.3%, suggesting the frontier is tightening. Chinese-developed models are narrowing the gap: GLM-5.1 moved to 51.4 on Artificial Analysis while Kimi K2 Thinking jumped to 40.9. Within SWE-rebench, the spread from position 1 to position 10 spans only 4.3 percentage points. Marginal gains now require refined approaches rather than architectural leaps. The infrastructure race and the benchmark race are diverging. One measures capability. The other measures whether that capability can be made useful at scale without becoming a liability.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research Papers — FocusedAll papers
RolloutPipe: Overlapping Pipelined Rollout and Training in Disaggregated On-Policy LLM Reinforcement Learning cs.DC

Large language model (LLM) post-training for reasoning increasingly relies on reinforcement learning with verifiable rewards (RLVR), where models learn from ground-truth feedback on mathematical, logical, and scientific tasks. To enable flexible resource allocation and support heterogeneous training setups, modern RLVR systems adopt disaggregated architectures that decouple rollout generation and policy training across independent GPU pools. However, existing synchronous on-policy GRPO (Group Relative Policy Optimization) RLVR systems finish an entire rollout before starting training, leaving the trainer GPU pool idle while rollout is still ongoing. Asynchronous RL pipelines overlap the two stages, but at the cost of training on stale data. To address these challenges, we propose RolloutPipe, a post-training framework for disaggregated RLVR systems, which turns the fixed-weight rollout into a complete-group pipeline where trainable groups move to the trainer while later groups are still being generated. RolloutPipe achieves this through two techniques including complete-group pipelining (CGP) and frontier-group dispatch (FGD). CGP dispatches each trainable complete group to the trainer FIFO as soon as group materialization finishes, and FGD is an admission policy on the Rollout node that first admits requests for the frontier groups needed to form the next training batch, so that trainer-ready groups arrive earlier and more steadily. The design starts training before the rollout completes while maintaining on-policy correctness. Evaluated on Qwen3-1.7B across four reasoning and science benchmarks and twelve rollout settings, RolloutPipe shortens the rollout-to-train-end time by 30.7%-42.3%, and lowers the trainer waiting ratio by 37%-76% compared to Slime, a state-of-the-art rollout and training system.

DMuon: Efficient Distributed Muon Training with Near-Adam Overhead cs.DC

Matrix-orthogonalization-based optimizers, exemplified by Muon, have demonstrated strong convergence behavior across a wide range of modern deep learning workloads. The matrix-aware updates offer a compelling alternative to conventional element-wise optimization, particularly as model architectures continue to grow in scale and heterogeneity. Yet contemporary distributed training infrastructure built around the assumption of element-wise optimizers is poorly matched to matrix-level optimizers such as Muon, whose updates couple entire weight matrices and require costly Newton-Schulz iterations. Vanilla Muon implementations incur more than 2x the cost of forward and backward passes. To close this gap, we present DMuon, an open-source distributed Muon implementation that integrates into existing training pipelines as a drop-in module, with no framework-level modifications. Across both embodied foundation model and large language model (LLM) training workloads, DMuon achieves a 1.48x-3.01x speedup in end-to-end step time and a 6.85x-163.00x speedup in optimizer-step time, bringing per-step latency to near-AdamW levels and enabling efficient scaling in our model training.

Power-Flexible AI Data Centers: A New Paradigm for Grid-Responsive Compute cs.DC

The rapid expansion of artificial intelligence (AI) infrastructure is driving unprecedented growth in electricity demand from data centers. Traditional power-system planning treats large computing facilities as inflexible peak loads, leading to costly infrastructure upgrades and long delays in grid interconnection. Recent work has shown that AI clusters can reduce electricity consumption during peak demand through software-based workload orchestration. This article explores how modern GPU-based AI data centers can operate as grid-interactive assets that respond dynamically to power system conditions. We describe an architecture integrating grid signals, workload scheduling, and power telemetry for fine-grained cluster power control. Experimental results from a real-world deployment on a 130 kW GPU cluster demonstrate multiple forms of flexibility, including rapid load reduction, sustained curtailment, and carbon-aware operation while preserving service levels for priority jobs. We further demonstrate performance-aware load shifting across geographically distributed clusters, enabling workloads to migrate toward regions with lower grid stress. Together, these capabilities transform AI infrastructure from static electricity consumers into flexible resources that support grid reliability, accelerate interconnection, and improve computing sustainability.

AI-Assisted Computational Reproducibility on the FABRIC Testbed cs.DC

Computational reproducibility remains difficult despite being central to scientific research. In this paper, we show how the international FABRIC testbed, combined with large language model (LLM) coding assistants through LoomAI, can simplify reproducing published experiments across multiple domains. We reproduced three case studies on FABRIC, covering BBR-family congestion-control evaluations, LAMMPS molecular dynamics scaling benchmarks on a CPU-only MPI cluster, and stress protein homeostasis genomics pipelines. Rather than focusing only on matching numerical outputs, we evaluate whether the reproduced experiments support the same scientific conclusions as the original studies. The AI assistant was effective in setting up the environment, adapting code, and debugging, but struggled with the analysis stages that lacked clearly defined workflows, which required human guidance to establish execution order and data dependencies. Across the case studies, the AI-assisted workflow reduced reproduction effort by roughly 4--6 times. We conclude with practical recommendations for improving AI-assisted reproducibility on research testbeds.

CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation cs.DC

Emerging LLM services increasingly host many sparse MoE models, yet most models receive sparse requests and remain cold. This creates a GPU memory problem: model weights are stable and model-determined, while KV-cache is transient and demand-determined. Because cold models rarely reach peak KV-cache demand at the same time, reserving worst-case KV capacity per model wastes memory; a shared KV-cache pool can instead provision aggregate active demand. However, KV-cache sharing is not sufficient when weights and KV-cache remain in a monolithic GPU memory pool. Static weights compete with dynamic KV-cache, and KV-head-limited attention under cold, low-concurrency traffic exposes only a fraction of replicated KV capacity, leading to low GPU memory utilization and weak long-context support. We present CrossPool, a serving engine for cold MoE models that separates FFN weights and KV-cache into two GPU memory pools: a weights pool that consolidates FFN weights across cold models, and a KV-cache pool that dynamically serves active requests while keeping attention local to KV-cache. CrossPool combines a KV-cache planner and virtualizer, a layer-wise pipeline scheduler that hides hidden-state transfers, and persistent kernels with control lowering to reduce CPU-GPU control overhead. With efficient GPU memory pooling, CrossPool underpins bursty long-context requests and outperforms the state-of-the-art kvcached-based multi-LLM serving system, reducing P99 TBT by up to $10.4\times$.

FeLoG: Scalable and Efficient Distributed Graph Embedding with Feedback Loop Mechanism cs.DC

Graph embedding maps graph nodes into low-dimensional vectors to support applications such as recommendation, fraud detection, and graph-based retrieval-augmented generation (GraphRAG). As graphs scale to billions of edges, scalable and efficient graph embedding has become increasingly important. Existing frameworks commonly adopt a sampling-training paradigm, in which mini-batches are constructed by sampling nodes and their neighbors. However, sampling is typically decoupled from evolving embedding quality, causing redundant exploration of well-trained regions while under-sampling undertrained nodes. At the system level, such decoupling further leads to excessive communication, serialized execution, and low resource utilization in distributed environments. We present FeLoG, a feedback loop-driven system for scalable distributed graph embedding. (1) FeLoG introduces feedback-coupled sampling and training, dynamically prioritizing undertrained nodes according to real-time embedding-quality feedback, thereby reducing redundant computation and accelerating convergence. (2) It employs activity-aware communication that compresses frequently occurring node sequences to reduce intra-machine PCIe traffic and selectively synchronizes frequently updated embeddings to reduce inter-machine communication. (3) It adopts a round-interleaved pipeline that overlaps next-round sampling with current-round training to improve CPU-GPU utilization. Experiments against six state-of-the-art baselines on large-scale graphs show that FeLoG achieves an average speedup of 27.9x, reduces communication cost by more than 53.1%, and sustains over 80% CPU-GPU utilization.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1GPT-5.560.265$11.25
2Claude Opus 4.757.363$10.94
3Gemini 3.1 Pro Preview57.2128$4.50
4GPT-5.456.883$5.63
5Kimi K2.653.941$1.71
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4Junie62.8%
5gpt-5.4-2026-03-05-medium62.8%