The Inference Report

April 22, 2026

Meanwhile, Amazon's $5 billion commitment to Anthropic is locking in 5 gigawatts of Trainium silicon while SpaceX pursues a $60 billion option to acquire Cursor, and neither move is primarily about the models themselves. The real constraint has shifted from model weights to compute allocation and the commercial relationships that control it. Anthropic's investigation into unauthorized access to Mythos while simultaneously limiting its release, Sam Altman's dismissal of Mythos as fear-based marketing, and the security vulnerabilities Mozilla found using the tool all matter less than a simpler fact: access to such capabilities is now gated by hardware availability and partnership agreements. SpaceX needs compute infrastructure and distribution channels that Musk controls. Amazon needs guaranteed capacity when Claude demand outpaces supply. Neither company is primarily acquiring intellectual property; both are securing the physical means to deploy it.

This infrastructure scarcity is forcing a parallel consolidation in how companies manage liability and legal exposure. Florida investigating OpenAI over a mass shooting, a Sullivan & Cromwell partner admitting to AI hallucinations in bankruptcy filings, and insurance companies like Beazley and QBE proposing caps on cyber payouts for AI-related incidents all reflect the same gap: no one knows yet who pays when an AI system fails. OpenAI's statement that ChatGPT bears no responsibility for the shooting is legally defensible but operationally irrelevant if regulators decide otherwise. Clarifai's deletion of 3 million OkCupid photos after an FTC settlement shows the cost of that ambiguity retroactively applied. LinkedIn's Crosscheck feature, a blind taste test for AI models, signals that when companies cannot differentiate on capability or safety, they compete on user perception and choice architecture.

The third concurrent shift is from consumer chat interfaces toward embedded, autonomous systems deployed without constant user input. Hugging Face released ml-intern to automate LLM post-training. Photon's Spectrum framework deploys agents directly to WhatsApp and Telegram. Moonshot AI's Kimi K2.6 scales to 300 sub-agents coordinating 4,000 steps. Snowflake and Adobe are positioning themselves as control planes for the agentic enterprise. Yet Jonathan Wall of Runloop notes that agents operating in stateful, long-running environments with access to APIs and shell commands expand the attack surface dramatically. Capability without control is liability, but companies are shipping these systems anyway because competitive pressure to move agents from prototype to production now exceeds the incentive to slow for safety. Connecticut's legislation on AI deepfakes in elections and expanded US government surveillance both arrive after the infrastructure they aim to govern is already deployed.

In enterprise software, the distribution channel has become the primary competitive asset. OpenAI is productizing Codex through systems integrators like Accenture, PwC, and Infosys rather than direct developer adoption, turning AI into a line item in consulting budgets. Google and IBM follow similar playbooks, embedding tools into existing software stacks and customer relationships. Hugging Face stakes a counterclaim around openness and cybersecurity, but the announcement provides no specifics on adoption. Across the sector, labs are racing to embed models into workflows controlled by firms that already have direct customer relationships, which changes the competitive dynamic away from standalone capability and toward distribution infrastructure.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research PapersAll papers
PlayCoder: Making LLM-Generated GUI Code Playable cs.SE

Large language models (LLMs) have achieved strong results in code generation, but their ability to generate GUI applications, especially games, remains insufficiently studied. Existing benchmarks mainly evaluate correctness through test cases, which are inadequate for GUI applications because these systems are interactive, event-driven, and require correct state transitions across sequences of user actions. Their evaluation therefore should consider interaction flows and UI logic rather than only pass/fail outcomes. To study this problem, we introduce PlayEval, a repository-aware benchmark built from 43 multilingual GUI applications in Python, TypeScript, and JavaScript. Unlike prior GUI benchmarks that are difficult to adapt to desktop environments, PlayEval covers six major GUI application categories and directly supports code-generation evaluation. We further propose Play@k, a metric that measures whether at least one of *k* generated candidates can be played end-to-end without logical errors. To support reliable evaluation, we develop PlayTester, an LLM-based agent that performs task-oriented GUI playthroughs and detects logic violations automatically. Experiments on 10 state-of-the-art code LLMs show that, despite high compilation rates, they achieve near-zero Play@3, revealing major weaknesses in generating logically correct GUI applications. To address this limitation, we present PlayCoder, a multi-agent, repository-aware framework that generates, evaluates, and iteratively repairs GUI application code in a closed loop. PlayCoder substantially improves both functional correctness and semantic alignment for open-source and closed-source models, reaching up to 38.1% Exec@3 and 20.3% Play@3. Case studies further show that it can uncover silent logic bugs missed by traditional metrics and fix them through targeted edits.

Generalization at the Edge of Stability cs.LG

Training modern neural networks often relies on large learning rates, operating at the edge of stability, where the optimization dynamics exhibit oscillatory and chaotic behavior. Empirically, this regime often yields improved generalization performance, yet the underlying mechanism remains poorly understood. In this work, we represent stochastic optimizers as random dynamical systems, which often converge to a fractal attractor set (rather than a point) with a smaller intrinsic dimension. Building on this connection and inspired by Lyapunov dimension theory, we introduce a novel notion of dimension, coined the `sharpness dimension', and prove a generalization bound based on this dimension. Our results show that generalization in the chaotic regime depends on the complete Hessian spectrum and the structure of its partial determinants, highlighting a complexity that cannot be captured by the trace or spectral norm considered in prior work. Experiments across various MLPs and transformers validate our theory while also providing new insights into the recently observed phenomenon of grokking.

Phase Transitions in the Fluctuations of Functionals of Random Neural Networks math.PR

We establish central and non-central limit theorems for sequences of functionals of the Gaussian output of an infinitely-wide random neural network on the d-dimensional sphere . We show that the asymptotic behaviour of these functionals as the depth of the network increases depends crucially on the fixed points of the covariance function, resulting in three distinct limiting regimes: convergence to the same functional of a limiting Gaussian field, convergence to a Gaussian distribution, convergence to a distribution in the Qth Wiener chaos. Our proofs exploit tools that are now classical (Hermite expansions, Diagram Formula, Stein-Malliavin techniques), but also ideas which have never been used in similar contexts: in particular, the asymptotic behaviour is determined by the fixed-point structure of the iterative operator associated with the covariance, whose nature and stability governs the different limiting regimes.

Safe Continual Reinforcement Learning in Non-stationary Environments cs.LG

Reinforcement learning (RL) offers a compelling data-driven paradigm for synthesizing controllers for complex systems when accurate physical models are unavailable; however, most existing control-oriented RL methods assume stationarity and, therefore, struggle in real-world non-stationary deployments where system dynamics and operating conditions can change unexpectedly. Moreover, RL controllers acting in physical environments must satisfy safety constraints throughout their learning and execution phases, rendering transient violations during adaptation unacceptable. Although continual RL and safe RL have each addressed non-stationarity and safety, respectively, their intersection remains comparatively unexplored, motivating the study of safe continual RL algorithms that can adapt over the system's lifetime while preserving safety. In this work, we systematically investigate safe continual reinforcement learning by introducing three benchmark environments that capture safety-critical continual adaptation and by evaluating representative approaches from safe RL, continual RL, and their combinations. Our empirical results reveal a fundamental tension between maintaining safety constraints and preventing catastrophic forgetting under non-stationary dynamics, with existing methods generally failing to achieve both objectives simultaneously. To address this shortcoming, we examine regularization-based strategies that partially mitigate this trade-off and characterize their benefits and limitations. Finally, we outline key open challenges and research directions toward developing safe, resilient learning-based controllers capable of sustained autonomous operation in changing environments.

UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling cs.RO

Scaling humanoid foundation models is bottlenecked by the scarcity of robotic data. While massive egocentric human data offers a scalable alternative, bridging the cross-embodiment chasm remains a fundamental challenge due to kinematic mismatches. We introduce UniT (Unified Latent Action Tokenizer via Visual Anchoring), a framework that establishes a unified physical language for human-to-humanoid transfer. Grounded in the philosophy that heterogeneous kinematics share universal visual consequences, UniT employs a tri-branch cross-reconstruction mechanism: actions predict vision to anchor kinematics to physical outcomes, while vision reconstructs actions to filter out irrelevant visual confounders. Concurrently, a fusion branch synergies these purified modalities into a shared discrete latent space of embodiment-agnostic physical intents. We validate UniT across two paradigms: 1) Policy Learning (VLA-UniT): By predicting these unified tokens, it effectively leverages diverse human data to achieve state-of-the-art data efficiency and robust out-of-distribution (OOD) generalization on both humanoid simulation benchmark and real-world deployments, notably demonstrating zero-shot task transfer. 2) World Modeling (WM-UniT): By aligning cross-embodiment dynamics via unified tokens as conditions, it realizes direct human-to-humanoid action transfer. This alignment ensures that human data seamlessly translates into enhanced action controllability for humanoid video generation. Ultimately, by inducing a highly aligned cross-embodiment representation (empirically verified by t-SNE visualizations revealing the convergence of human and humanoid features into a shared manifold), UniT offers a scalable path to distill vast human knowledge into general-purpose humanoid capabilities.

FASTER: Value-Guided Sampling for Fast RL cs.LG

Some of the most performant reinforcement learning algorithms today can be prohibitively expensive as they use test-time scaling methods such as sampling multiple action candidates and selecting the best one. In this work, we propose FASTER, a method for getting the benefits of sampling-based test-time scaling of diffusion-based policies without the computational cost by tracing the performance gain of action samples back to earlier in the denoising process. Our key insight is that we can model the denoising of multiple action candidates and selecting the best one as a Markov Decision Process (MDP) where the goal is to progressively filter action candidates before denoising is complete. With this MDP, we can learn a policy and value function in the denoising space that predicts the downstream value of action candidates in the denoising process and filters them while maximizing returns. The result is a method that is lightweight and can be plugged into existing generative RL algorithms. Across challenging long-horizon manipulation tasks in online and batch-online RL, FASTER consistently improves the underlying policies and achieves the best overall performance among the compared methods. Applied to a pretrained VLA, FASTER achieves the same performance while substantially reducing training and inference compute requirements. Code is available at https://github.com/alexanderswerdlow/faster .

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1Claude Opus 4.757.362$10.00
2Gemini 3.1 Pro Preview57.2127$4.50
3GPT-5.456.882$5.63
4Kimi K2.653.9135$1.71
5GPT-5.3 Codex53.680$4.81
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5GLM-5.162.7%