The Inference Report

July 1, 2026

The AI industry is splitting into two incompatible operating systems. One side ships products at scale, embeds agents into phones and keyboards, and signs billion-dollar contracts. The other navigates a regulatory environment so unstable that policy reversals arrive faster than product cycles. Google's Nano Banana 2 Lite and Anthropic's Claude Sonnet 5 prioritize inference cost over capability because volume, not benchmarks, generates returns. Wayve's $8.5 billion valuation, Etched's $5 billion with $1 billion in contracted sales, and EquiLibre's $500 million show investors reward execution. Acti embedding agents into smartphone keyboards, OpenClaw landing on iOS and Android, and X launching an MCP server all move the same direction: put agents where users already are. Amazon's $1 billion FDE organization mirrors OpenAI and Anthropic's playbook, deploying agents into customer companies at speed. The infrastructure layer is consolidating around interoperability standards that reduce switching costs. Anthropic's Claude Science targets pharmaceutical workflows. Riverside generates AI-powered newsletters. Margin compression extends into vertical applications because the real money is in production deployment, not capability.

The regulatory side tells an opposite story. The Trump administration dropped export restrictions on Anthropic's Mythos and Fable models weeks after ordering the company to suspend access for foreign nationals. That whipsaw is not a policy framework; it is a signal that builders cannot rely on consistency. The National Design Studio's year-long delay on .gov website redesigns after Trump's AI mandate suggests well-resourced government initiatives stall when political ground shifts. Meanwhile, the security community moved prompt injection from theoretical risk to production threat in late 2025, and VS Code now requires users to explicitly trust code before running it. The gap between where builders are shipping and where defenders need to be is widening. Companies betting on AI are choosing speed and market share over regulatory clarity because regulatory clarity is not coming.

Labs are signaling radically different bets about where economic value concentrates. OpenAI is publishing adoption metrics and benchmarking genomics performance, moves that read as defensive, establishing market presence in life sciences before competitors lock in researcher workflows. Google and NVIDIA are moving harder into infrastructure: TabFM targets tabular data; NVIDIA's token cost analysis and robotics software stack are explicitly framed around production deployment and cost discipline. Anthropic positioned Claude Sonnet 5 and Claude Science as the inference vendor for specialized workloads. AI21 Labs called out routing inefficiency, signaling that token arbitrage is becoming a real cost center. NVIDIA's announcements span GPU optimization, robotics software, synthetic data workflows, and life sciences tooling, a full stack play that locks in developer dependency before application-layer competitors establish themselves. What's missing from today's announcements is any lab announcing a price cut or a direct challenge to NVIDIA's inference economics. That absence is the story.

GitHub's trending repos reveal the practical layer underneath these capital moves: msitarzewski/agency-agents, ogulcancelik/herdr, and obra/superpowers all treat agents as composable building blocks, solving the real problem of coordinating multiple specialized models without rewriting orchestration logic. diegosouzapw/OmniRoute addresses token waste directly, cutting usage by 15 to 95 percent across 231 providers. The infrastructure underneath these frameworks is becoming commodity. What matters now is the glue. Security and data tooling show developers want capability without vendor lock-in. Specialized high-quality datasets are competitive advantages. LightRAG published at EMNLP 2025 and activepieces' integration of 400 MCP servers signal that the plumbing for connecting agents to external tools is no longer experimental. It is infrastructure.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research PapersAll papers
Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision cs.CL

When does training language models (LMs) to generate explanations of their predictions yield faithful introspection, rather than superficial imitation? We study LMs trained to explain which features of their inputs influenced their behavior, using models' counterfactual behavior on modified inputs as supervision. Surprisingly, we find that LMs trained on fixed counterfactual explanations derived from earlier checkpoints of themselves, or even from behaviorally similar models in different families, frequently produce explanations more faithful to their own current behaviors than to those of their training targets. This "introspective" coupling between LM explanations and behaviors occurs when training explanations remain sufficiently correlated with current behaviors over the course of training, even as behaviors themselves shift. We also show that introspective coupling tracks behavior shifts: when explanation training is provided concurrently with other post-training objectives, explanations track those shifts without requiring updated supervision. This phenomenon appears in multiple tasks, including sycophancy and refusal, and is robust to label noise. Overall, our results show that even fixed datasets of counterfactual explanations can provide scalable and generalizable post-training signal for introspection.

QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents cs.LG

LLM agents increasingly act over long horizons, where a single trajectory can contain hundreds or thousands of actions. In these settings, outcome-only rewards provide too sparse guidance, failing to inform the model about the goodness of intermediate actions. Dense supervision methods aim to solve this problem by scoring intermediate steps, from intrinsic confidence to self-distillation and embedding similarities. However, it is common practice to evaluate them by measuring the downstream performance of a training pipeline that integrates them. This is expensive, conflates supervision quality with training engineering confounders, and renders different methodological families requiring distinct training setups incomparable. As a result, dense supervision methods are rarely benchmarked on common ground. We introduce QVal, a training-free testbed for directly evaluating dense supervision signals. Given a state-action pair, QVal measures how well a method's score is Q-aligned: whether it orders actions according to the Q-values of a strong reference-policy. This lets us compare signals before any training run and separate signal quality from other engineering choices. We instantiate QVal as QVal-v1.0, benchmarking 21 dense supervision methods across four diverse environments and seven methodological families, with over 1.2K evaluation experiments across six open-weight model backbones. We find that simple prompting baselines consistently outperform recent dense supervision methods from the literature, and that performance clusters strongly by family. These findings hold across model sizes, environments, and observation modalities. QVal is designed to be easily extensible to new environments and methods, enabling researchers to iterate on dense supervision methods before any training run.

Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs cs.CL

Metacognition is a critical component of intelligence that describes the ability to monitor and regulate one's own cognitive processes. Yet LLMs exhibit systemic deficiencies in key metacognitive faculties: they hallucinate with high confidence, fail to recognize knowledge boundaries, and misrepresent their internal uncertainty--undermining trustworthiness and reliability. Since monitoring task performance and adapting behavior accordingly are central to metacognition, we posit that models capable of accurately judging their own performance are better positioned to improve it. We operationalize this idea via two novel mechanisms: reinforcement learning with metacognitive feedback (RLMF), a paradigm to refine completion rankings during preference optimization based on the quality of a model's self-judgments of performance, and metacognitive data selection, which uses similar self-judgments to identify high-value training examples, outperforming naive active learning. We apply these innovations to the problem of faithful calibration (FC), a task that is itself fundamentally metacognitive: the goal is to align expressed with intrinsic uncertainty, difficult even for frontier LLMs. We adopt a two-stage, decoupled approach, first using these methods to calibrate the faithfulness of models' self-reported confidence scores, then mapping to natural, context-adaptable linguistic uncertainty via targeted output editing. Extensive experiments show RLMF achieves generalizable, state-of-the-art FC on diverse tasks while preserving accuracy. Further, RLMF surpasses standard RL by up to 63% while enhancing models' ability to assess and express their own capability limits. This positions RLMF as a promising paradigm to enhance LLM metacognition toward improved abilities and alignment, and suggests metacognitive performance as an effective RL signal to overcome limits of prior intrinsic feedback methods.

When LLMs Read Tables Carelessly: Measuring and Reducing Data Referencing Errors cs.CL

While large language models (LLMs) perform well on table tasks, they still make data referencing errors (DREs), i.e., incorrectly citing or omitting table values, despite understanding the table structure. Beyond final-answer accuracy, DREs directly compromise the correctness and reliability of intermediate reasoning steps. Yet prior studies have only offered limited, small-scale analyses. In this work, we present the first systematic evaluation of tabular data referencing errors across different models and tasks. Our results show that DREs occur across all tested models (1.7B to 20B parameters). Furthermore, we demonstrate that incorporating data referencing as a critic significantly improves answer accuracy up to 12.0%, through critic-based filtering and rejection sampling. Finally, we trained a lightweight 4B-parameter critic model that achieves an average F1 score of 78.2% in detecting both in-distribution and out-of-distribution DREs, and effectively assists inference for larger models.

Freeform Preference Learning for Robotic Manipulation cs.RO

Reward design remains a central bottleneck for autonomous robot policy improvement, especially in long-horizon manipulation tasks where sparse success labels provide too little signal and binary preferences collapse many competing notions of quality into one ambiguous signal. We introduce Freeform Preference Learning (FPL), a method for learning robot policies from freeform human preferences. Rather than asking annotators which of two trajectories is better overall, FPL lets them define natural-language preference axes, such as speed, safety, quality of placement, or carefulness, and provide pairwise preferences along each axis. These annotations are used to learn a language-conditioned reward model that maps a trajectory and preference label to an axis-specific reward. We use this model to train a reward-conditioned policy that optimizes across the multiple human-specified dimensions. Across four real-world and two simulated long-horizon manipulation tasks, FPL improves over sparse-reward and binary-preference methods by 38 percentage points. Beyond improved performance, FPL learns dense progress signals without explicit subtask segmentation, shows compositionality of behavior not present in the data, and allows users to steer the policy towards different behaviors at test time without retraining. Blog post with videos available at https://freeform-pl.github.io/fpl.website/

AdaJEPA: An Adaptive Latent World Model cs.LG

Latent world models enable planning from high-dimensional observations by predicting future states in a compact latent space. However, these models are typically kept frozen at test time: when their predictions become inaccurate, planning can fail, especially under test-time distribution shift. To address this, we propose AdaJEPA, an adaptive latent world model that performs test-time adaptation within the closed loop of model predictive control (MPC). After training, AdaJEPA plans and executes the first action chunk, uses the observed next-state transition as a self-supervised adaptation signal, and replans with the updated model. This closed-loop update continuously recalibrates the world model without additional expert demonstrations. Across a range of goal-reaching tasks, AdaJEPA substantially improves planning success with as few as one gradient step per MPC replanning step.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1Claude Fable 559.90$20.00
2Claude Opus 4.855.765$10.00
3GPT-5.554.877$11.25
4Claude Opus 4.753.548$10.00
5Claude Sonnet 553.479$6.00
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1OpenAIgpt-5.5-2026-04-23-xhighModel62.7%± 0.91%
2JunieJunieAgent61.6%± 0.64%
3OpenAICodexAgent60.4%± 1.37%
4AnthropicClaude CodeAgent59.6%± 1.98%
5OpenAIgpt-5.5-2026-04-23-mediumModel58.9%± 0.78%
Trending