The Inference Report

March 31, 2026

From ten thousand feet, the AI industry is undergoing a visible bifurcation. Capital is flowing to infrastructure, data access, and operational efficiency rather than to models or applications. Trust is collapsing even as adoption metrics climb. And the companies winning are not the ones publishing benchmarks or announcing capabilities, but the ones solving access problems: who controls the compute, who owns the data, and who can integrate seamlessly into existing institutional relationships.

The money tells the story most clearly. Rebellions, ScaleOps, and Mistral have collectively raised over $1.3 billion to build chips, optimize GPU efficiency, and construct data centers. Starcloud reached unicorn status in 17 months by proposing space-based infrastructure. Qodo raised $70 million betting that code verification is the real constraint. None of these companies are training foundation models. They are solving the plumbing problem because the plumbing problem is where defensible moats exist. Model capability has become a commodity. Microsoft's revelation that only 3.3 percent of its 365 user base has paid for Copilot licenses exposes the fundamental gap: boardroom enthusiasm for AI does not translate into willingness to pay. A Quinnipiac poll shows adoption rising while trust falls, with 85 percent of Americans unwilling to accept an AI boss. Enterprise organizations report they cannot demonstrate sustained return on investment. The Pentagon temporarily blocked treating Anthropic as a supply chain risk, yet Anthropic itself leaked details of its latest model through a public data repository mishap, signaling both competitive desperation and operational fragility.

The real leverage is consolidating around data and institutional access. Mantis Biotech is building synthetic digital twins to solve medicine's data availability problem. The IRS is testing Palantir tools to surface audit targets from legacy systems. Microsoft and Amazon are launching health tools that require connecting to user medical records. A pro-AI group plans to spend $100 million on midterm elections ahead of regulatory battles. These moves reveal that once compute becomes commoditized, the constraint becomes data access and the permission to use it. Regulatory capture, data rights, and political alignment now matter more than model architecture. Google's publication of a quantum vulnerability disclosure framework exemplifies this shift: the company is establishing itself as the arbiter of a problem that does not yet exist at scale, gaining influence over governance before any actual threat surfaces. This is how you convert future competitive advantage into present regulatory credential.

The research and benchmark data reinforce the pattern. Papers are grounding claims in controlled experiments, closed-form bounds, and deployment constraints rather than leaderboard positions. Claude Opus 4.6 leads SWE-rebench at 65.3 percent, but the gap between SWE-rebench and Artificial Analysis's top score of 57.2 percent suggests different evaluation methodologies measure different aspects of capability. GitHub's trending repositories show developers prioritizing practical tooling for Claude integration and agent orchestration over novel model architectures. The winners are not pushing model performance; they are reducing friction in adoption and integration. Infrastructure, data, and access are where the industry is actually investing. Models are becoming utilities.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research PapersAll papers
Adaptive Block-Scaled Data Types cs.CL

NVFP4 has grown increasingly popular as a 4-bit format for quantizing large language models due to its hardware support and its ability to retain useful information with relatively few bits per parameter. However, the format is not without limitations: recent work has shown that NVFP4 suffers from its error distribution, resulting in large amounts of quantization error on near-maximal values in each group of 16 values. In this work, we leverage this insight to design new Adaptive Block-Scaled Data Types that can adapt to the distribution of their input values. For four-bit quantization, our proposed IF4 (Int/Float 4) data type selects between FP4 and INT4 representations for each group of 16 values, which are then scaled by an E4M3 scale factor as is done with NVFP4. The selected data type is denoted using the scale factor's sign bit, which is currently unused in NVFP4, and we apply the same insight to design formats for other bit-widths, including IF3 and IF6. When used to quantize language models, we find that IF4 outperforms existing 4-bit block-scaled formats, achieving lower loss during quantized training and achieving higher accuracy on many tasks in post-training quantization. We additionally design and evaluate an IF4 Multiply-Accumulate (MAC) unit to demonstrate that IF4 can be implemented efficiently in next-generation hardware accelerators. Our code is available at https://github.com/mit-han-lab/fouroversix.

Geometry-aware similarity metrics for neural representations on Riemannian and statistical manifolds cs.LG

Similarity measures are widely used to interpret the representational geometries used by neural networks to solve tasks. Yet, because existing methods compare the extrinsic geometry of representations in state space, rather than their intrinsic geometry, they may fail to capture subtle yet crucial distinctions between fundamentally different neural network solutions. Here, we introduce metric similarity analysis (MSA), a novel method which leverages tools from Riemannian geometry to compare the intrinsic geometry of neural representations under the manifold hypothesis. We show that MSA can be used to i) disentangle features of neural computations in deep networks with different learning regimes, ii) compare nonlinear dynamics, and iii) investigate diffusion models. Hence, we introduce a mathematically grounded and broadly applicable framework to understand the mechanisms behind neural computations by comparing their intrinsic geometries.

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers cs.CV

Modern Text-to-Image (T2I) diffusion models have achieved remarkable semantic alignment, yet they often suffer from a significant lack of variety, converging on a narrow set of visual solutions for any given prompt. This typicality bias presents a challenge for creative applications that require a wide range of generative outcomes. We identify a fundamental trade-off in current approaches to diversity: modifying model inputs requires costly optimization to incorporate feedback from the generative path. In contrast, acting on spatially-committed intermediate latents tends to disrupt the forming visual structure, leading to artifacts. In this work, we propose to apply repulsion in the Contextual Space as a novel framework for achieving rich diversity in Diffusion Transformers. By intervening in the multimodal attention channels, we apply on-the-fly repulsion during the transformer's forward pass, injecting the intervention between blocks where text conditioning is enriched with emergent image structure. This allows for redirecting the guidance trajectory after it is structurally informed but before the composition is fixed. Our results demonstrate that repulsion in the Contextual Space produces significantly richer diversity without sacrificing visual fidelity or semantic adherence. Furthermore, our method is uniquely efficient, imposing a small computational overhead while remaining effective even in modern "Turbo" and distilled models where traditional trajectory-based interventions typically fail.

Temporal Credit Is Free cs.LG

Recurrent networks do not need Jacobian propagation to adapt online. The hidden state already carries temporal credit through the forward pass; immediate derivatives suffice if you stop corrupting them with stale trace memory and normalize gradient scales across parameter groups. An architectural rule predicts when normalization is needed: \b{eta}2 is required when gradients must pass through a nonlinear state update with no output bypass, and unnecessary otherwise. Across ten architectures, real primate neural data, and streaming ML benchmarks, immediate derivatives with RMSprop match or exceed full RTRL, scaling to n = 1024 at 1000x less memory.

Stop Probing, Start Coding: Why Linear Probes and Sparse Autoencoders Fail at Compositional Generalisation cs.LG

The linear representation hypothesis states that neural network activations encode high-level concepts as linear mixtures. However, under superposition, this encoding is a projection from a higher-dimensional concept space into a lower-dimensional activation space, and a linear decision boundary in the concept space need not remain linear after projection. In this setting, classical sparse coding methods with per-sample iterative inference leverage compressed sensing guarantees to recover latent factors. Sparse autoencoders (SAEs), on the other hand, amortise sparse inference into a fixed encoder, introducing a systematic gap. We show this amortisation gap persists across training set sizes, latent dimensions, and sparsity levels, causing SAEs to fail under out-of-distribution (OOD) compositional shifts. Through controlled experiments that decompose the failure, we identify dictionary learning -- not the inference procedure -- as the binding constraint: SAE-learned dictionaries point in substantially wrong directions, and replacing the encoder with per-sample FISTA on the same dictionary does not close the gap. An oracle baseline proves the problem is solvable with a good dictionary at all scales tested. Our results reframe the SAE failure as a dictionary learning challenge, not an amortisation problem, and point to scalable dictionary learning as the key open problem for sparse inference under superposition.

Rethinking Language Model Scaling under Transferable Hypersphere Optimization cs.LG

Scaling laws for large language models depend critically on the optimizer and parameterization. Existing hyperparameter transfer laws are mainly developed for first-order optimizers, and they do not structurally prevent training instability at scale. Recent hypersphere optimization methods constrain weight matrices to a fixed-norm hypersphere, offering a promising alternative for more stable scaling. We introduce HyperP (Hypersphere Parameterization), the first framework for transferring optimal learning rates across model width, depth, training tokens, and Mixture-of-Experts (MoE) granularity under the Frobenius-sphere constraint with the Muon optimizer. We prove that weight decay is a first-order no-op on the Frobenius sphere, show that Depth-$μ$P remains necessary, and find that the optimal learning rate follows the same data-scaling power law with the "magic exponent" 0.32 previously observed for AdamW. A single base learning rate tuned at the smallest scale transfers across all compute budgets under HyperP, yielding $1.58\times$ compute efficiency over a strong Muon baseline at $6\times10^{21}$ FLOPs. Moreover, HyperP delivers transferable stability: all monitored instability indicators, including $Z$-values, output RMS, and activation outliers, remain bounded and non-increasing under training FLOPs scaling. We also propose SqrtGate, an MoE gating mechanism derived from the hypersphere constraint that preserves output RMS across MoE granularities for improved granularity scaling, and show that hypersphere optimization enables substantially larger auxiliary load-balancing weights, yielding both strong performance and good expert balance. We release our training codebase at https://github.com/microsoft/ArchScale.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1GPT-5.457.296$5.63
2Gemini 3.1 Pro Preview57.2120$4.50
3GPT-5.3 Codex5494$4.81
4Claude Opus 4.65361$10.00
5Claude Sonnet 4.651.779$6.00
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5Gemini 3.1 Pro Preview62.3%