The Inference Report

June 19, 2026

Like the shift from mainframe computing to distributed systems in the 1980s, the AI economy is reorganizing around who controls the supply chain, not who builds the best product. Capital is fleeing software margins and flooding into infrastructure, power, and geopolitical positioning. Apple is raising prices because AI data centers have consumed RAM supply. Gaming consoles are becoming accidental luxury goods. Accenture and traditional IT consulting are watching their business models compress as AI agents eliminate the intermediate layer where they once captured margin. The constraint is no longer technology. It is electrons and geopolitical control of the supply chains that move them.

Venture capital is pouring into whoever controls the operational layer. Baseten closed a $1.5 billion round at $13 billion valuation for inference infrastructure. General Intuition is raising $300 million at $2 billion for embodied AI trained on 2 billion videos per year. Snap spun out its AI video team into Dotmo because the cost structure no longer fits a social media company's margin requirements. Meanwhile, infrastructure itself is being nationalized. Trump is adopting a sovereign AI playbook. France's OVHcloud is building frontier models to avoid US and Chinese dependence. FERC handed data centers a fast lane to the electrical grid. SpaceX is plotting a $20 billion bond deal and exploring space-based data centers. AWS is negotiating to sell chips directly to competitors' data centers, explicitly framing a $50 billion opportunity.

The lab announcements confirm this reorientation. OpenAI is moving spend controls and analytics into enterprise products, signaling that cost management has become table-stakes rather than a differentiator. AWS, AMD, and NVIDIA are addressing network optimization, GPU utilization, and grid infrastructure for AI workloads, with NVIDIA explicitly lobbying FERC for interconnection policy changes. The application layer is fragmenting by use case rather than consolidating around general-purpose models. OpenAI is doubling down on health diagnostics and medical reasoning, a vertical where regulatory approval creates defensibility. Hugging Face is building benchmarks and testing frameworks for open-model fine-tuning. AMD is targeting embodied AI and spatial reasoning. The market is moving from hype to maintenance, from capabilities races to cost and control.

GitHub repositories and benchmark movements reflect this infrastructure-first reality. Trending code clusters around agent frameworks like Kilo and Superpowers, and specialized foundation models like GLM-5 for coding and TimesFM for forecasting. Developers are building the practical scaffolding to move computation closer to users, away from cloud dependencies. GLM-5.1 jumped 10.5 points on SWE-rebench, suggesting rapid iteration in task-specific models. The era of general-purpose models is giving way to task-specific pretrained weights. The money, the power, and the geopolitical leverage are flowing to whoever controls the pipes, the electrons, and the data pipelines. Product superiority is secondary.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research PapersAll papers
How Transparent is DiffusionGemma? cs.LG

LLM reasoning transparency is a critical affordance for understanding model decisions, mitigating misuse and misalignment, and debugging surprising model behaviors. However, DiffusionGemma performs a larger fraction of its computation in a continuous latent space; does this make its reasoning less transparent? We study this question by decomposing transparency into two components: variable transparency, whether we understand intermediate snapshots of a model's computational state; and algorithmic transparency, whether we can use these snapshots to reconstruct the process by which the model arrived at its outputs. Naively, DiffusionGemma has poor variable transparency: its opaque serial depth, the amount of serial computation that occurs in between interpretable model states, seems at first 28.6X higher than the corresponding autoregressive Gemma 4 model. However, we show that we can map the information flowing between denoising steps through an interpretable token bottleneck with no decrease in downstream performance. Treating these intermediate states as interpretable reduces the opaque serial depth to just 1.1X that of Gemma 4. Algorithmic transparency is harder for diffusion models than for autoregressive models because all token predictions in the canvas can change at every denoising step, giving the model the power to implement complicated distributed algorithms during the denoising process. To begin bridging this gap, we conduct a suite of interpretability case studies, uncovering initial evidence of novel diffusion-specific phenomena such as non-chronological reasoning, token and sequence smearing, and intermediate-context reasoning. Finally, we test monitorability, a key application of transparency that measures whether model outputs are useful for downstream tasks. We find that DiffusionGemma is similarly monitorable to Gemma 4.

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning cs.CV

Egocentric video understanding is inherently limited by the narrow perspective of wearable cameras: a single viewpoint, a single modality, a single model cannot capture the full richness of human action. We argue that a truly expressive egocentric representation must subsume complementary knowledge across viewpoints, modalities, and foundation model representations, yet remain deployable from egocentric video alone. To this end, we introduce a hierarchical multi-teacher distillation framework that produces UNIEGO, a unified egocentric encoder trained with nine teachers spanning ego-exo viewpoints, RGB, depth, and skeleton modalities, and four foundation models. Rather than distilling directly from heterogeneous teachers whose incompatible architectures and feature geometries induce conflicting gradients, our framework interposes a layer of representation-specific Proxy models that translate diverse teacher knowledge into a homogeneous egocentric space. A second distillation stage, Selective Proxy Distillation (SPD), then adaptively selects, for each training sample, the subset of proxies that are both correct and confident, distilling exclusively from reliable supervision and suppressing erroneous signals. SPD is further stabilized by initializing UNIEGO as a learned convex combination of proxy parameters, placing the unified model in a well-conditioned region of the loss landscape before distillation begins. UNIEGO achieves state-of-the-art performance across three egocentric video understanding tasks - action recognition, video retrieval, and action segmentation on three challenging ego-exo benchmarks, outperforming naive multi-teacher distillation baselines and demonstrating that structured, proxy-mediated knowledge transfer yields richer and more discriminative egocentric representations.

Optimal Deterministic Multicalibration and Omniprediction cs.LG

A model is multicalibrated on a collection of group weights $G$ if it is calibrated -- i.e. unbiased even conditional on its prediction -- not just overall, but also after reweighting contexts by each $g \in G$. It is a useful property for many downstream applications and is a basic desideratum of trustworthy machine learning. Before this work, all predictors known to attain the minimax-optimal $\widetilde O(\varepsilon^{-3})$ sample complexity rate for $\varepsilon$-multicalibration were randomized, while deterministic predictors were known only with substantially worse sample complexity. Whether randomization is necessary for optimal sample complexity in multicalibration was explicitly asked by [CLNR26] and implicitly in several prior works. We resolve this open problem by giving a minimax-optimal multicalibration algorithm that outputs a deterministic predictor. We then generalize the algorithm to produce optimal deterministic predictors that satisfy outcome indistinguishability (OI) with respect to finite or finitely covered collections of tests. As an application, this also gives deterministic omnipredictors and panpredictors with optimal sample complexity, resolving open problems posed by [OKK25] and [BHHLZ25].

Structuring and Tokenizing Distributed User Interest Context for Generative Recommendation cs.IR

Generative recommendation is an emerging paradigm that has shown promise in industrial recommendation systems, aiming to predict users' next interactions from their historical behaviors. At the core of generative recommendation lies item tokenization, which bridges item semantics and recommendation models. However, existing methods often struggle to effectively organize and inject complex user-behavioral and item-semantic contexts into recommendation models simultaneously. On the one hand, existing graph-based integration methods, such as graph serialization and graph neural networks, either suffer from scalability issues or exploit only local graph information. On the other hand, existing semantic tokenization methods typically rely on heuristics and lack explicit supervision signals, which may lead to inaccurate or suboptimal semantic representations. To address these limitations in user interest context modeling, we propose G2Rec, a scalable framework that unifies holistic graph-based user co-engagement modeling with semantic tokenization for industrial-scale generative recommendation. Overall, G2Rec enables recommendation models to capture holistic and semantically grounded user interest prototypes without requiring ground-truth user interests, thereby providing more comprehensive and accurate modeling of user behavior contexts in industrial sequential recommendation. Online deployment across product surfaces and extensive experiments on public datasets demonstrate the superiority of G2Rec over existing methods.

The Token Is a Group Element: On Lie-Algebra Attention over Matrix Lie Groups cs.LG

We place the attention token on the group: a token is an element $g_i$ of a matrix Lie group $G$ -- a bare transformation, with no feature payload and no external action $ρ(g)$ carrying it. To our knowledge this is the first attention construction whose tokens are bare matrix Lie group elements: their score is the closed-form algebra norm of the relative pose rather than a learned kernel, and it reaches the affine full-frame groups that every irrep- or surjective-exp-based method must exclude. We call it Lie-Algebra Attention. Once tokens are group elements, the rest follows with none of the usual representation-theoretic machinery. The relative geometry of a pair is canonical, $g_i^{-1} g_j$, so the pairwise invariant $w_{ij} = \log(g_i^{-1} g_j)$ is intrinsic rather than designed; equivariance under the diagonal $G$-action is tautological, and the cocycle condition holds automatically. The attention score is the negative squared algebra norm, $s_{ij} = -\|\log(g_i^{-1} g_j)\|_λ^2/τ$: the canonical proximity kernel under a block-weighted Frobenius inner product, with no irreducible representations, spherical harmonics, Clebsch-Gordan products, or learned kernel. The construction applies to any matrix Lie group on a chosen logarithm chart containing the relative poses, including the non-compact non-abelian affine groups with scale and shear that no vector-token attention method reaches: neither the irrep tradition nor surjective-exp methods. Three sequence-completion experiments, on SE(2), SO(3), and Aff(2), bear this out: the closed-form score matches a learned MLP kernel on the same invariant and outperforms it on SE(2), using 50 to 80x fewer score parameters, while a vector-token baseline breaks invariance by five to twelve orders of magnitude.

Predictability as a Fine-Grained Measure for Privacy cs.LG

Differential privacy (DP) ensures rigorous individual-level privacy guarantees against even the most knowledgeable attackers, but its worst-case nature can impose a costly privacy-accuracy tradeoff. We introduce privacy via predictability, a fine-grained framework that explicitly incorporates the attacker's core knowledge, a compromised portion of the dataset generated by a stochastic process, and a specified family of queries. Predictability measures privacy leakage as the incremental gain in an attacker's ability to predict sensitive information about unknown individuals after observing the algorithm's output, beyond what can already be inferred from the compromised data. We show that predictability and DP are generally incomparable: each can be small while the other is large. However, in the worst-case regime where all but one individual is compromised, and all binary queries are considered sensitive, predictability implies mutual-information DP. More generally, predictability provides a finer-grained privacy metric tailored to specific sensitive information and specific attacker models. We introduce a general framework, using the generalized method of moments (GMM), to analyze asymptotic predictability when the compromised data is generated by a stationary, ergodic, mixing process. Using this analysis, we derive a predictability-calibrated output perturbation scheme for ERM. Our approach is complementary to DP and can be used alongside DP to provide fine-grained privacy control.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1Claude Fable 559.90$20.00
2Claude Opus 4.855.766$10.00
3GPT-5.554.868$11.25
4Claude Opus 4.753.556$10.00
5GPT-5.451.4157$5.63
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1gpt-5.5-2026-04-23-xhigh62.7%
2Junie61.6%
3Codex60.4%
4Claude Code59.6%
5gpt-5.5-2026-04-23-medium58.9%