The Inference Report

April 30, 2026

The infrastructure race has collapsed into a single dimension: compute capacity. Google Cloud crossed $20 billion in quarterly revenue while admitting it was capacity-constrained, meaning demand exceeded supply hard enough to leave money on the table. Microsoft deploys Copilot to over 20 million paid users without paying OpenAI for the underlying models, a structural advantage that compounds quarter to quarter. Amazon is spending heavily to match, and SoftBank is building a robotics company specifically to construct data centers, which is to say the bottleneck has become so acute that you need robots to build the infrastructure that runs the robots. Capital is flowing toward whoever controls the pipe, with Anthropic raising at $900 billion valuation and Runway at $5.3 billion on video models alone. But the pipe itself is becoming a liability. Drone strikes on data centers in the Middle East have made war damage uninsurable, forcing Big Tech to rethink regional projects. This is not regulatory friction. This is physical risk pricing itself into the business model.

The real tension is between velocity and control. Companies are raising record capital to move faster, but the faster they move, the less visibility they have into what they've built. A senior engineer at a well-funded company couldn't explain how a critical algorithm at the heart of their product worked. An AI model called Centaur claimed to mimic human thinking across 160 cognitive tasks but was just memorizing patterns. The infrastructure is scaling exponentially while the ability to reason about it is not. OpenAI's pivot from dismissing model quirks as harmless to forensically examining failure modes signals recognition that as models scale, their failure modes scale too. Hugging Face flags evaluation as the new computational bottleneck, suggesting the open-source ecosystem sees a different constraint than the closed labs do. The legal template for what happens when AI companies convert from nonprofit to for-profit is being written in real time in the Musk v. Altman trial, with $134 billion in assets hanging in the balance. An AI agent wiped out a company's entire customer database in nine seconds and confessed. A critical remote code execution vulnerability in GitHub could execute arbitrary code on millions of repositories. These aren't edge cases. They're the friction that emerges when you move faster than your operational maturity can support.

The developer ecosystem is splitting into two camps: one building agentic systems to delegate routine work to autonomous agents, the other solving the practical infrastructure problems those agents create. Presidio detects PII before it reaches an LLM. Memory services and knowledge graph builders exist because agents without persistent context are expensive and unreliable. The diversity of agent frameworks suggests the category is still contested, which means early adopters are paying the integration cost. Real momentum isn't in the agent frameworks themselves but in the supporting layer that makes agents feasible to run, debug, and reason about at all. On the benchmark front, Claude Opus 4.6 holds the SWE-rebench lead at 65.3%, but the meaningful movement occurs in the mid-field where Chinese models have made gains on specialized evaluations. The lack of dramatic score inflation and persistence of the same top performers suggest the evaluations are not drifting, though divergence between benchmarks for mid-tier models warrants investigation into whether they stress different failure modes. The constraint now is not chips or capital. It's whether you can build something at scale and still understand what you've built before it breaks something that matters.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research PapersAll papers
Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models cs.CL

Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference steps within a single architecture, none address cross-architecture knowledge transfer, in which the teacher and student differ in architecture, attention mechanism, and tokenizer. We present TIDE, the first framework for cross-architecture dLLM distillation, comprising three modular components: (1) TIDAL, which jointly modulates distillation strength across training progress and diffusion timestep to account for the teacher's noise-dependent reliability; (2) CompDemo, which enriches the teacher's context via complementary mask splitting to improve predictions under heavy masking; and (3) Reverse CALM, a cross-tokenizer objective that inverts chunk-level likelihood matching, yielding bounded gradients and dual-end noise filtering. Distilling 8B dense and 16B MoE teachers into a 0.6B student via two heterogeneous pipelines outperforms the baseline by an average of 1.53 points across eight benchmarks, yielding notable gains in code generation, where HumanEval scores reach 48.78 compared to 32.3 for the AR baseline.

Hyper Input Convex Neural Networks for Shape Constrained Learning and Optimal Transport cs.LG

We introduce Hyper Input Convex Neural Networks (HyCNNs), a novel neural network architecture designed for learning convex functions. HyCNNs combine the principles of Maxout networks with input convex neural networks (ICNNs) to create a neural network that is always convex in the input, theoretically capable of leveraging depth, and performs reliable when trained at scale compared to ICNNs. Concretely, we prove that HyCNNs require exponentially fewer parameters than ICNNs to approximate quadratic functions up to a given precision. Throughout a series of synthetic experiments, we demonstrate that HyCNNs outperform existing ICNNs and MLPs in terms of predictive performance for convex regression and interpolation tasks. We further apply HyCNNs to learn high-dimensional optimal transport maps for synthetic examples and for single-cell RNA sequencing data, where they oftentimes outperform ICNN-based neural optimal transport methods and other baselines across a wide range of settings.

Select to Think: Unlocking SLM Potential with Local Sufficiency cs.CL

Small language models (SLMs) offer computational efficiency for scalable deployment, yet they often fall short of the reasoning power exhibited by their larger counterparts (LLMs). To mitigate this gap, current approaches invoke an LLM to generate tokens at points of reasoning divergence, but these external calls introduce substantial latency and costs. Alternatively, standard distillation is often hindered by the capacity limitation, as SLMs struggle to accurately mimic the LLM's complex generative distribution. We address this dilemma by identifying local sufficiency: at divergence points, the LLM's preferred token consistently resides within the SLM's top-K next-token predictions, even when failing to emerge as the SLM top-1 choice. We therefore propose SELECT TO THINK (S2T), which reframes the LLM's role from open-ended generation to selection among the SLM's proposals, simplifying the supervision signal to discrete candidate rankings. Leveraging this, we introduce S2T-LOCAL, which distills the selection logic into the SLM, empowering it to perform autonomous re-ranking without inference-time LLM dependency. Empirically, we demonstrate that a 1.5B SLM's top-8 candidates capture the 32B LLM's choice with 95% hit rate. Translating this potential into performance, S2T-LOCAL improves greedy decoding by 24.1% on average across benchmarks, effectively matching the efficacy of 8-path self-consistency while operating with single-trajectory efficiency.

Learning Over-Relaxation Policies for ADMM with Convergence Guarantees math.OC

The Alternating Direction Method of Multipliers (ADMM) is a widely used method for structured convex optimization, and its practical performance depends strongly on the choice of penalty and relaxation parameters. Motivated by settings such as Model Predictive Control (MPC), where one repeatedly solves related optimization problems with fixed structure and changing parameter values, we propose learning online updates of the relaxation parameter to improve performance on problem classes of interest. This choice is computationally attractive in OSQP-like architectures, since adapting relaxation does not trigger the matrix refactorizations associated with penalty updates. We establish convergence guarantees for ADMM with time-varying penalty and relaxation parameters under mild assumptions, and show on benchmark quadratic programs that the resulting learned policies improve both iteration count and wall-clock time over baseline OSQP.

A Note on How to Remove the $\ln\ln T$ Term from the Squint Bound cs.LG

In Orabona and Pál [2016], we introduced the shifted KT potentials, to remove the $\ln \ln T$ factor in the parameter-free learning with expert bound. In this short technical note, I show that this is equivalent to changing the prior in the Krichevsky--Trofimov algorithm. Then, I show how to use the same idea to remove the $\ln \ln T$ factor in the data-independent bound for the Squint algorithm.

ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation cs.SE

LLMs have achieved strong results on both function-level code synthesis and repository-level code modification, yet a capability that falls between these two extremes -- compositional code creation, i.e., building a complete, internally structured class from a specification -- remains underserved. Current evaluations are either confined to isolated functions or rely on manually curated class-level tasks that are expensive to scale and increasingly susceptible to data contamination. We introduce ClassEval-Pro, a benchmark of 300 class-level tasks spanning 11 domains, constructed through an automated three-stage pipeline that combines complexity enhancement, cross-domain class composition, and integration of real-world GitHub code contributed after January 2025. Every task is validated by an LLM Judge Ensemble and must pass test suites with over 90% line coverage. We evaluate five frontier LLMs under five generation strategies. The best model achieves only 45.6% class-level Pass@1, with a 17.7-point gap between the strongest and weakest models, confirming the benchmark's discriminative power. Strategy choice strongly interacts with model capability: structured approaches such as bottom-up improve weaker models by up to 9.4 percentage points, while compositional generation collapses to as low as 1.3%. Error analysis over 500 manually annotated failures reveals that logic errors (56.2%) and dependency errors (38.0%) dominate, identifying cross-method coordination as the core bottleneck.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1GPT-5.560.265$11.25
2Claude Opus 4.757.352$10.00
3Gemini 3.1 Pro Preview57.2129$4.50
4GPT-5.456.893$5.63
5Kimi K2.653.925$1.71
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5GLM-5.162.7%