The Inference Report

April 24, 2026

The infrastructure arms race is now consuming the balance sheets of the companies that claim to be building it. Meta is cutting 10 percent of its workforce to offset $135 billion in data center spending this year, while Microsoft commits $140 billion to AI investment and OpenAI, xAI, and peers plan data centers emitting 129 million tons of greenhouse gases annually. This is not growth capital deployed strategically into products with revenue models. This is survival spending, the cost of staying in a game where the entry fee keeps rising and the winner remains unclear. The spending is also accelerating consolidation: smaller builders are being acquired by larger ones, and the companies spending the most on infrastructure are the ones that can afford to cut payroll and still outspend everyone else. Capital concentration and margin compression are creating a two or three player market in foundational models, with everyone else building on top or fighting for scraps in narrower verticals.

The actual products being built on top of this infrastructure reveal why the spending feels mandatory. Everyone is launching agents simultaneously because the competitive window feels like it's closing. OpenAI released GPT-5.5 and workspace agents in ChatGPT. Microsoft added hosted agents to Foundry Agent Service. Google launched both an updated Gemini Enterprise app and the Gemini Enterprise Agent Platform on the same day. Anthropic's Mythos Preview has spooked financial institutions enough that UK banks are seeking access. Yet the speed of deployment is outpacing governance. An enterprise deploying a LangChain-based research agent during preproduction review still faces the problem that autonomous agents are not stable software artifacts, yet authorization frameworks treat them as if they were. Developers are adopting tools that could replace them while simultaneously worrying about displacement. The productivity gains are real and measurable. The anxiety is proportional.

OpenAI is consolidating its position as the primary vendor of production AI agents by shipping GPT-5.5 directly into Codex, its application layer for knowledge work automation, while simultaneously ensuring that layer runs on NVIDIA's infrastructure. Rather than compete on model weights alone, OpenAI is bundling model capability with workflow orchestration, automations, plugins, skills, and structured task execution, which creates friction for customers to migrate. NVIDIA's public embrace of Codex running on GB200 systems signals that the infrastructure vendor sees agent frameworks as the real margin driver. Meanwhile, Hugging Face's focus on browser-based transformer inference via Chrome extensions points toward a different vector: moving model execution to the edge and away from centralized inference, which could fragment the cloud-based agent stack that OpenAI and NVIDIA are building. The announcements collectively reveal a market sorting into layers, with model vendors securing inference infrastructure partnerships, application vendors building stickiness through workflow automation, and infrastructure players ensuring they own the hardware dependency. Competition is happening at integration points, not at the model level alone.

On GitHub, the trending list reveals a decisive split between two categories of developer effort: infrastructure for AI agents and tools that make those agents actually useful at scale. The agent-building layer is consolidating around concrete implementations rather than framework abstractions. Cline and similar autonomous coding agents now come with context-window optimization built in, which solves a real constraint: LLM context is expensive and agents generate noise. Skill libraries like VoltAgent's collection of 1000+ agent skills acknowledge that agents need domain knowledge packaged as callable tools. The discovery layer shows where harder problems still live. Data annotation and curation remain foundational, while LocalAI's positioning as a hardware-agnostic inference engine reflects a practical reality that developers want to run models locally without GPU dependencies to cut costs and latency. Smaller repos like abliterix and fim-ai/fim-one point to where the research frontier is: not whether agents can work, but how to make them predictable, steerable, and efficient. What's conspicuously absent from the trending list is another wave of general-purpose frameworks. The market has decided those are solved problems.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research Papers — FocusedAll papers
Online Survival Analysis: A Bandit Approach under Cox PH Model stat.ML

Survival analysis is a widely used statistical framework for modeling time-to-event data under censoring. Classical methods, such as the Cox proportional hazards (Cox PH) model, offer a semiparametric approach to estimating the effects of covariates on the hazard function. Despite its importance, survival analysis has been largely unexplored in online settings, particularly within the bandit framework, where decisions must be made sequentially to optimize treatments as new data arrive over time. In this work, we take an initial step toward integrating survival analysis into a purely online learning setting under the Cox PH model, addressing key challenges including staggered entry, delayed feedback, and right censoring. We adapt three canonical bandit algorithms to balance exploration and exploitation, with theoretical guarantees of sublinear regret bounds. Extensive simulations and semi-real experiments using SEER cancer data demonstrate that our approach enables rapid and effective learning of near-optimal treatment policies.

Properties and limitations of geometric tempering for gradient flow dynamics stat.ML

We consider the problem of sampling from a probability distribution $π$. It is well known that this can be written as an optimisation problem over the space of probability distributions in which we aim to minimise the Kullback--Leibler divergence from $π$. We consider the effect of replacing $π$ with a sequence of moving targets $(π_t)_{t\ge0}$ defined via geometric tempering on the Wasserstein and Fisher--Rao gradient flows. We show that convergence occurs exponentially in continuous time, providing novel bounds in both cases. We also consider popular time discretisations and explore their convergence properties. We show that in the Fisher--Rao case, replacing the target distribution with a geometric mixture of initial and target distribution never leads to a convergence speed up both in continuous time and in discrete time. Finally, we explore the gradient flow structure of tempered dynamics and derive novel adaptive tempering schedules.

Decentralized Machine Learning with Centralized Performance Guarantees via Gibbs Algorithms stat.ML

In this paper, it is shown, for the first time, that centralized performance is achievable in decentralized learning without sharing the local datasets. Specifically, when clients adopt an empirical risk minimization with relative-entropy regularization (ERM-RER) learning framework and a forward-backward communication between clients is established, it suffices to share the locally obtained Gibbs measures to achieve the same performance as that of a centralized ERM-RER with access to all the datasets. The core idea is that the Gibbs measure produced by client~$k$ is used, as reference measure, by client~$k+1$. This effectively establishes a principled way to encode prior information through a reference measure. In particular, achieving centralized performance in the decentralized setting requires a specific scaling of the regularization factors with the local sample sizes. Overall, this result opens the door to novel decentralized learning paradigms that shift the collaboration strategy from sharing data to sharing the local inductive bias via the reference measures over the set of models.

Efficient Symbolic Computations for Identifying Causal Effects stat.ML

Determining identifiability of causal effects from observational data under latent confounding is a central challenge in causal inference. For linear structural causal models, identifiability of causal effects is decidable through symbolic computation. However, standard approaches based on Gröbner bases become computationally infeasible beyond small settings due to their doubly exponential complexity. In this work, we study how to practically use symbolic computation for deciding rational identifiability. In particular, we present an efficient algorithm that provably finds the lowest degree identifying formulas. For a causal effect of interest, if there exists an identification formula of a prespecified maximal degree, our algorithm returns such a formula in quasi-polynomial time.

On Bayesian Softmax-Gated Mixture-of-Experts Models stat.ML

Mixture-of-experts models provide a flexible framework for learning complex probabilistic input-output relationships by combining multiple expert models through an input-dependent gating mechanism. These models have become increasingly prominent in modern machine learning, yet their theoretical properties in the Bayesian framework remain largely unexplored. In this paper, we study Bayesian mixture-of-experts models, focusing on the ubiquitous softmax-based gating mechanism. Specifically, we investigate the asymptotic behavior of the posterior distribution for three fundamental statistical tasks: density estimation, parameter estimation, and model selection. First, we establish posterior contraction rates for density estimation, both in the regimes with a fixed, known number of experts and with a random learnable number of experts. We then analyze parameter estimation and derive convergence guarantees based on tailored Voronoi-type losses, which account for the complex identifiability structure of mixture-of-experts models. Finally, we propose and analyze two complementary strategies for selecting the number of experts. Taken together, these results provide one of the first systematic theoretical analyses of Bayesian mixture-of-experts models with softmax gating, and yield several theory-grounded insights for practical model design.

Sparse Network Inference under Imperfect Detection and its Application to Ecological Networks stat.ML

Recovering latent structure from count data has received considerable attention in network inference, particularly when one seeks both cross-group interactions and within-group similarity patterns in bipartite networks, which is widely used in ecology research. Such networks are often sparse and inherently imperfect in their detection. Existing models mainly focus on interaction recovery, while the induced similarity graphs are much less studied. Moreover, sparsity is often not controlled, and scale is unbalanced, leading to oversparse or poorly rescaled estimates with degrading structural recovery. To address these issues, we propose a framework for structured sparse nonnegative low-rank factorization with detection probability estimation. We impose nonconvex $\ell_{1/2}$ regularization on the latent similarity and connectivity structures to promote sparsity within-group similarity and cross-group connectivity with better relative scale. The resulting optimization problem is nonconvex and nonsmooth. To solve it, we develop an ADMM-based algorithm with adaptive penalization and scale-aware initialization and establish its asymptotic feasibility and KKT stationarity of cluster points under mild regularity conditions. Experiments on synthetic and real-world ecological datasets demonstrate improved recovery of latent factors and similarity/connectivity structure relative to existing baselines.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1GPT-5.560.20$11.25
2Claude Opus 4.757.358$10.00
3Gemini 3.1 Pro Preview57.2132$4.50
4GPT-5.456.880$5.63
5Kimi K2.653.9123$1.71
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5GLM-5.162.7%