The Inference Report

June 9, 2026

There is no breakthrough announcement this week, no new model claiming top-line superiority, no lab unveiling a fundamental advance in reasoning or scale. What's absent is the familiar narrative of capability leapfrog. What's present instead is the machinery of that narrative breaking down in real time.

OpenAI and Anthropic filed confidentially for IPOs valued above one trillion dollars, joining SpaceX in the private capital markets, yet neither company has demonstrated a path to profitability at scale. The funding surge is real. Private credit firms Apollo and Blackstone raised 35 billion dollars for Anthropic alone. But the foundation supporting this valuation is actively collapsing. Microsoft discovered malware in 73 packages on its own platform for the second time in weeks, and a campaign called Hades is specifically targeting Python developer environments to extract credentials from AI agents themselves. The malware executes on import, meaning the very tools builders use to ship products faster have become attack vectors. This isn't temporary friction. It's the predictable outcome of an ecosystem where speed and funding matter more than vetting, and where AI agents are now trusted to run code without human review.

Apple's strategy at WWDC reveals a different bet entirely. Rather than racing for model scale or valuation, the company is wagering that cheaper compute and integration into existing tools will win. Siri AI runs on a two-tiered model powered by Google's Gemini. Safari, Shortcuts, Photos, and Camera all gained AI features. The company is waiving cloud API costs for developers with fewer than 2 million App Store downloads. This is catch-up framed as philosophy, but the positioning is deliberate: Apple is making the case that AI should be so embedded in existing products that users stop thinking about it as a separate thing. Meanwhile, iOS app releases have exploded since agentic AI became viable, but app reviews and user engagement have declined sharply during the same period. Builders can ship code faster than ever. Nobody is shipping products users actually want.

The venture capital incentive structure is now openly broken. Mercor's founder called out Sequoia for selling the same equity to different investors at different prices, a practice he says is widespread among top-tier firms. Tools for Humanity, Sam Altman's identity verification company, is laying off staff despite Altman's imminent OpenAI IPO, suggesting his other bets are not generating revenue. The market is pricing in future dominance before any of these companies have proven they can extract sustainable value. Across research and infrastructure, the pattern is not convergence on a single frontier but fragmentation into specialized applications, regulatory preparation, and infrastructure consolidation. The field is splitting into two distinct patterns: one cluster building agent infrastructure with memory, skills, and internet access; the other tackling the practical constraints of deploying models on devices with real power budgets and latency constraints. Both are essential. Most tools still treat them as mutually exclusive.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research PapersAll papers
OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics cs.CV

Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heterogeneous agent classes (commercial VLMs, open-weight VLMs, and specialized game policies) on the same footing. We address these gaps with OmniGameArena, a real-time benchmark of twelve newly built Unreal Engine 5 games spanning Solo (7), PvP (3), and Coop (2) with unified action interfaces, and the Improvement Dynamics Curve (IDC), an agentic-reflection harness in which a tool-using reflector LLM autonomously refines a bounded skill prompt across multiple rounds. Beyond cold-start leaderboard scores, IDC exposes two additional observables for each (agent, game) pair: how the score evolves across reflection rounds, and how the learned skill behaves on held-out task variants. We report these observables for twelve VLM agents on the cold-start leaderboard and four top agents under IDC.

An Agency-Transferring Model-Free Policy Enhancement Technique cs.LG

Training reinforcement learning (RL) policies from scratch is costly: it requires careful reward and environment design, extensive tuning, and substantial computation. Yet many control problems already have a functional but suboptimal policy available as a baseline. This paper proposes a method for embedding such a baseline into the RL training process, simultaneously improving training efficiency relative to from-scratch methods and producing a learning policy that outperforms the baseline. At each step, the method arbitrates between the baseline policy and a trainable learning policy, initially relying strongly on the baseline policy and then progressively transferring agency to the learning policy. By the end of training, the learning policy is a standalone neural network that operates without baseline policy support. The paper formalizes what it means for the baseline policy to be functional: under this policy, the agent reaches a goal set and remains there with high probability. The proposed arbitration mechanism is designed to exploit this property during training, yielding high goal-reaching rates right from the beginning of training. A theoretical analysis provides a formal interpretation of this behavior under stated assumptions and extends it to the final baseline-free regime, where explicit lower bounds are derived for the goal-reaching probability of the standalone learning policy. Empirical results on continuous-control benchmarks show that the proposed method achieves returns that match or exceed those of competitive approaches, while maintaining the highest goal-reaching rates throughout training among the compared methods -- including in the final stage, where the learning policy operates without any baseline support.

Causally Evaluating the Learnability of Formal Language Tasks cs.CL

Language models, as multi-task learners, acquire a wide range of abilities during training. A fundamental question is how much task-specific data is needed to learn a given task. Answering this for natural language is difficult: tasks are hard to delineate and can confound one another. To rigorously investigate the relationship between data frequency and learnability, we turn to a controlled setting using formal languages induced from probabilistic finite automata. These serve as a methodological testbed to demonstrate that standard correlational evaluation practices are inherently flawed. To enable causal analysis, we introduce the binning semiring, an algebraic object that lets us control how often a targeted property occurs in a sampled corpus. We formulate the experimental pipeline as a causal graphical model and derive decomposed Kullback-Leibler divergence metrics to measure the learnability of specific sub-tasks. Our experiments show that evaluating learnability without causal intervention leads to incorrect conclusions due to confounders in correlational analysis, and serve as a warning about correlational pitfalls in natural-language settings.

Rethinking the Divergence Regularization in LLM RL cs.LG

Reinforcement learning (RL) has become a key component of post-training large language models (LLMs). In practice, LLM RL is often off-policy because of training-inference mismatch and policy staleness, making trust-region control essential for stable optimization. Mainstream methods such as PPO and GRPO approximate this control with a ratio-clipping mechanism, but the importance ratio can be a poor proxy for distributional shift in long-tailed vocabularies. Recent work such as DPPO addresses this mismatch by replacing ratio-based clipping with a divergence-based mask, yielding a trust region defined by the sampled token's absolute probability shift. However, DPPO still relies on a hard mask: once a token crosses the trust-region boundary in a harmful direction, its gradient is discarded rather than corrected. To address this, we propose Divergence Regularized Policy Optimization (DRPO), which replaces the hard mask with a smooth advantage-weighted quadratic regularizer on policy shift. DRPO preserves the same trust-region geometry as DPPO while inducing bounded, continuous gradient weights that attenuate diverging updates and provide corrective signals beyond the boundary. Experiments across model scales, architectures, and precision settings show that DRPO improves the stability and efficiency of LLM RL training.

Weighted universal approximation of differentiable maps on infinite-dimensional manifolds math.FA

We generalize the universal approximation theorem for functional input neural networks (FNN) to differentiable maps by including the approximation of the derivatives. A FNN maps the input from a possibly infinite-dimensional weighted manifold to the real-valued hidden layer, on which a non-linear scalar activation function is applied, and then returns the output into a Banach space via some linear readouts. By proving a weighted Nachbin theorem, we establish a universal approximation theorem (UAT) for differentiable maps, which goes beyond the usual formulation on compact sets and also includes the approximation of the derivatives. This leads us to approximation results for non-anticipative functionals including the horizontal and vertical derivatives. As a further application, we show that linear functions of the signature are able to approximate path space functionals including their directional derivatives.

PTL-Diffusion: Manifold-Aware Diffusion with Periodic Terminal Laws cs.CV

Standard diffusion models typically use a single time-homogeneous Gaussian terminal distribution as the reference law for generation. While this choice is analytically convenient and empirically powerful, it provides little explicit structure for data concentrated near low-dimensional manifolds, where different regions of the data distribution may correspond to distinct local geometric or semantic factors. As a result, the reverse model must recover manifold-level structure almost entirely from an unstructured terminal reference distribution. We propose PTL-Diffusion, a proof-of-concept diffusion framework whose forward noising process converges to a nonconstant periodic family of Gaussian terminal laws rather than to a single invariant law. Unlike a phase-conditioned DDPM, where phase information only enters the denoising network while the forward process remains unchanged, PTL-Diffusion embeds phase structure directly into the forward noising dynamics. The proposed construction remains close to standard denoising diffusion models: for a periodically forced Ornstein--Uhlenbeck-type forward process, we derive closed-form forward marginals, the limiting periodic Gaussian terminal family, and explicit Gaussian reverse posteriors, enabling standard noise-prediction training. We also introduce an invariant-average regularization term coupling the phase-conditioned reverse dynamics through the averaged periodic reference law. Experiments on torus and cylinder point-cloud benchmarks and the Olivetti face dataset show that PTL-Diffusion improves manifold-level distributional matching over matched DDPM baselines, reducing phase-conditioned errors, feature-space covariance errors, and nearest-neighbour manifold distances. These results suggest structured terminal reference laws as a promising direction, while motivating more expressive phase constructions and larger-scale evaluations.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1Claude Opus 4.861.462$10.94
2GPT-5.560.270$11.25
3Claude Opus 4.757.357$10.94
4Gemini 3.1 Pro Preview57.2127$4.50
5GPT-5.456.899$5.63
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1gpt-5.5-2026-04-23-xhigh62.7%
2Codex60.4%
3Claude Code59.6%
4gpt-5.5-2026-04-23-medium58.9%
5Claude Opus 4.8-xhigh56.4%