The Inference Report

June 26, 2026

The market for AI has cleaved decisively into two tiers: those controlling the infrastructure where agents run, and everyone else competing for the scraps. This week's announcements make the hierarchy visible. Notion killed its email product because users have moved to agents that manage inboxes automatically. Patronus AI raised $50 million to test agents before deployment. General Intuition secured $320 million betting that video game data trains agents that behave like humans. These aren't incremental feature releases. They're signals that the competition has shifted from model size to agent execution, and the companies positioning themselves as platforms for that execution are locking in advantage faster than regulation can respond.

The concentration of power is materializing in hardware and trust. Apple is raising prices across all products except the iPhone because AI server demand has created memory shortages. IBM claims sub-1 nanometer transistors. Amazon is committing $13 billion more to India's AI infrastructure. Alibaba was caught using 25,000 fake accounts to extract Claude's capabilities across 28.8 million exchanges, prompting Anthropic to escalate the matter directly to the Senate Banking Committee. That escalation signals model theft is now treated as a national security concern. The Linux Foundation is launching an Agent Name Service framework to establish identity and trust for AI agents operating across enterprises. These moves define the real infrastructure layer: memory, compute, identity verification, and platform orchestration. The companies controlling those layers are printing money. The companies selling models are managing state pressure, IP theft, and the burden of proving their agents outperform open-source alternatives.

OpenAI leads on agent research capability while AWS positions itself as the orchestration layer where agents actually run. GitHub is benchmarking performance across twenty-plus models, rendering model choice increasingly commoditized while locking developers into GitHub's execution environment. Google optimizes cache performance, Hugging Face simplifies vLLM deployment, and NVIDIA bundles gaming discounts with GeForce NOW to build installed base in cloud compute. The pattern is consistent: no lab is announcing breakthroughs in agent reasoning or reliability. They're announcing infrastructure, benchmarking, and pricing. The real traction is in the scaffolding around agents, not the agents themselves. DESIGN.md gives agents stable interfaces to visual systems. Anthropic's 817 cybersecurity skills map to six authoritative frameworks so agents understand domain semantics. gstack packages 23 tools into role-based clusters. These solve the friction that separates toy problems from production deployment. The winners are those who own the platform where agents operate, the data pipelines feeding them, and the trust frameworks verifying their behavior.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research PapersAll papers
DanceOPD: On-Policy Generative Field Distillation cs.CV

Modern image generation demands a single model that unifies diverse capabilities, including text-to-image (T2I), local editing, and global editing. However, these capabilities are rarely naturally aligned and often conflict. For instance, editing tends to degrade T2I performance, while global and local editing interfere with each other. Consequently, effectively composing these capabilities has become a central challenge for image generation model training. To tackle this, we introduce DanceOPD, an on-policy generative field distillation framework for flow-matching models that routes each sample to one capability field, queries one low-noise student-induced state, and trains with a simple velocity MSE objective. With each capability source defined as a velocity field over the shared flow state space, the student learns from fields queried on its own rollout states to compose expert capabilities. This formulation also absorbs operator-defined fields such as classifier-free guidance. Comprehensive experiments on T2I, editing, realism-field absorption, and CFG absorption show that our approach improves multi-capability composition, strengthening target capabilities while preserving anchor generation quality. We believe this work establishes a practical route for generative field distillation in flow-matching models.

Reinforcement Learning without Ground-Truth Solutions can Improve LLMs cs.LG

Reinforcement learning with verifiable rewards (RLVR) for training LLMs typically rely on ground-truth answers to assign rewards, limiting their applicability to tasks where the ground-truth solution is unknown. We introduce a \textbf{R}anking-\textbf{i}nduced \textbf{VER}ifiable framework (RiVER) that trains LLMs on score-based optimization tasks without ground-truth solutions, using deterministic execution feedback as continuous-valued supervision. When applying group-relative RL to such continuous rewards, we identify two key challenges: \emph{scale dominance}, where uncalibrated score magnitudes across test instances distort policy updates, and \emph{frequency dominance}, where repeatedly sampled suboptimal solutions can outweigh rare but stronger candidates. RiVER addresses these challenges with calibrated reward shaping that uses instance-wise comparisons and emphasizes top-ranked solvers while retaining bounded feedback for other valid solutions. We train on 12 AtCoder Heuristic Contest tasks and evaluate on Algorithm Engineering Benchmark (ALE-Bench), LiveCodeBench, and USACO. RiVER advances Qwen3-8B and GLM-Z1-9B-0414 by 8.9\% and 9.4\% in ALE rating rank. More importantly, despite training exclusively on score-based tasks without any ground-truth solutions, RiVER also improves the backbones across exact-solution benchmarks such as LiveCodeBench and USACO by an absolute average improvement of 2.4\% and 3.5\%. By contrast, baselines trained with raw execution scores improve ALE rating but fail to transfer to exact-solution benchmarks. These results suggest that score-based optimization tasks, combined with proper reward calibration, can serve as effective training environments for general coding ability without ground-truth solutions.

Autoregressive Boltzmann Generators cs.LG

Efficient sampling of molecular systems at thermodynamic equilibrium is a hallmark challenge in statistical physics. This challenge has driven the development of Boltzmann Generators (BGs), which allow rapid generation of uncorrelated equilibrium samples by combining a generative model with exact likelihoods and an importance sampling correction. However, modern BGs predominantly rely on normalizing flows (NFs), which either suffer from limited expressivity due to strict invertibility constraints (discrete time) or computationally expensive likelihoods (continuous time). In this paper, we propose Autoregressive Boltzmann Generators (ArBG) -- a novel autoregressive modelling framework -- that overcomes these limitations by departing from the flow-based BG paradigm. ArBG circumvents the topological constraints of flows and enables sequential inference-time interventions, while offering enhanced scalability by leveraging architectures effective in Large Language Models. We empirically demonstrate that ArBG leads to significant improvements over flow-based models across all benchmarks, but particularly in larger peptide systems such as the 10-residue Chignolin. Furthermore, we introduce Robin, a 132 million parameter transferable model trained with the ArBG framework which improves over the previous state-of-the-art, reducing the zero-shot energy error, E-W$_2$, on 8-residue systems by over 60$\%$. The code can be found at the following link: https://github.com/danyalrehman/autobg.

When are likely answers right? On Sequence Probability and Correctness in LLMs stat.ML

Many decoding methods for large language models can be understood as shifting probability mass toward outputs that are more likely under the model, either locally at the token level or globally at the sequence level. Therefore, their success depends on a fundamental question: when does sequence probability, that is, the conditional probability of a continuation given a prompt, actually align with correctness? In this paper, we set out to quantify this relationship across decoding methods, models, and benchmarks at four levels: across decoding methods, across hyperparameters within a method, across prompt-answer pairs within a dataset, and across repeated responses to the same prompt. We find that higher sequence probability is often predictive of correctness across prompt-answer pairs within a fixed dataset. However, this relationship does not generally transfer to decoding decisions: increasing sequence probability by changing hyperparameters or methods does not reliably improve accuracy. Further, sequence probability is not a good indicator of correctness for responses to the same prompt. These findings clarify when decoding can and cannot be expected to improve correctness, and provide practical guidance for decoding, self-consistency, and verifier-free self-improvement.

Error-Conditioned Neural Solvers cs.LG

Neural surrogate models offer fast approximate mappings from PDE parameters to solutions, but they typically treat solving as a purely statistical task: once trained, they struggle to correct their own constraint violations and extrapolate beyond the training distribution. Recent hybrid methods promote physical correctness by targeting the PDE residual via gradient descent or Gauss--Newton steps, but inherit the compute cost and instability of the underlying classical optimizers. We show, theoretically and empirically, that numerically minimizing the PDE residual can be an unreliable proxy for reconstruction accuracy in ill-conditioned systems, explaining why these methods often do not make accurate predictions despite achieving low residuals. We propose error-conditioned Neural Solvers (ENS), built on a different principle: rather than an optimization target, the PDE residual field is passed as a direct input to the network at each iteration, enabling it to read the spatial structure of its own errors and learn an update policy to iteratively correct its predictions. Across four PDE families, ENS attains the highest prediction accuracy in the large majority of settings, with gains reaching $10\times$ on turbulent Kolmogorov flow, while avoiding the expensive compute cost of hybrid methods. ENS's learned correction policy generalizes under distribution shift, including zero-shot parameter changes and cross-equation transfer, where its relative advantage is largest in the ill-conditioned regimes where residual minimization is least reliable. Project website: https://neuralsolver.github.io/.

Mapping Political-Elite Networks in Europe with a Multilingual Joint Entity-Relation Extraction Pipeline cs.CL

Whether political elites organise into rent-seeking coalitions that capture public resources or civic networks that sustain governance is a central question in comparative politics. Yet observing these complex, informal, and adversarial ties at scale has historically required intensive manual coding, while automated text-as-data methods have largely been limited to simple co-occurrence. Recent large language model (LLM) approaches offer a path forward but often rely on proprietary APIs, lack cross-lingual capability, and struggle with scalable entity resolution. We present a modular, fully open-weight pipeline for multilingual joint entity-relation extraction that builds signed, temporal knowledge graphs from massive unstructured news corpora. It combines span-based named-entity recognition (NER) with a three-stage linking cascade mapping mentions to language-independent Wikidata identifiers; a high-throughput, ontology-constrained mixture-of-experts model then uses guided decoding to extract directed, signed relationships grounded in a domain ontology. A full-coverage spot-check against a 3491-relation gold standard shows high textual correctness (68.2% strict to 93.7% lenient). Two large-scale case studies validate the pipeline against the public record. In Austria, it reconstructs a political party's complete lifecycle, dating internal fractures and tracking personnel into successor factions and court convictions. In a Polish corpus, it uncovers the overlapping economic and governance networks of state-enterprise patronage, alongside the structurally balanced, signed conflict network of the polarized Civic Platform (Platforma Obywatelska, PO)--Law and Justice (Prawo i Sprawiedliwość, PiS) duopoly. By bridging raw multilingual text and structured relational data, our framework provides a robust, replicable foundation for cross-national empirical computational social science.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1Claude Fable 559.90$20.00
2Claude Opus 4.855.765$10.00
3GPT-5.554.873$11.25
4Claude Opus 4.753.555$10.00
5GPT-5.451.4150$5.63
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1OpenAIgpt-5.5-2026-04-23-xhighModel62.7%± 0.91%
2JunieJunieAgent61.6%± 0.64%
3OpenAICodexAgent60.4%± 1.37%
4AnthropicClaude CodeAgent59.6%± 1.98%
5OpenAIgpt-5.5-2026-04-23-mediumModel58.9%± 0.78%