The Inference Report

June 17, 2026

The government's ability to weaponize AI policy is outpacing the market's capacity to adapt, and this week's regulatory moves have exposed a structural vulnerability that no amount of capital can fix. Anthropic's sudden export ban on Fable and Mythos models, justified as national security, sits alongside the Trump administration's simultaneous protection of xAI's unpermitted gas turbines and invocation of Pentagon personnel using generative AI tools as justification for regulatory leniency. The logic is circular and intentional: AI is strategically important enough to restrict when it threatens foreign markets, and strategically important enough to exempt from Clean Air Act enforcement when domestic infrastructure is at stake. Neither company negotiated this arrangement. Both are now hostages to whichever interpretation of national interest serves the administration's immediate purpose. OpenAI's leaked financials showing billions in annual losses compound the precarity. Revenue is real but dwarfed by R&D spending. These companies cannot absorb extended regulatory uncertainty or market restrictions. SpaceX's $60 billion acquisition of Cursor days after going public signals that capital still flows toward consolidation plays, but the underlying bet is that the regulatory environment will remain favorable long enough for those investments to reach profitability.

The fragmentation of consumer preference and corporate strategy reveals where the real constraint actually lies. Sixty percent of US consumers say AI branding is a turnoff, yet companies are racing to integrate AI everywhere: Android 17 expands Gemini features, Meta launches AI search modes, OpenAI partners with Visa for direct ChatGPT purchases. ChatGPT's market share has slipped below 50 percent for the first time despite maintaining 1.1 billion monthly users, while Claude trails at 245 million. Anthropic's pause on token-based billing for its Claude Agent SDK signals that even AI-first companies understand that pricing volatility destroys adoption. The constraint is not technical capability or regulatory approval. It is whether users will tolerate the cost and uncertainty long enough for these businesses to reach profitability. Robinhood's CEO notably omitted AI from his justification for 10 percent layoffs, breaking from the industry pattern of blaming automation for headcount cuts. That silence matters more than the cuts themselves.

In research and infrastructure, the field is sorting into specialized theaters with different competitive dynamics and winners. The infrastructure race has clear outcomes: NVIDIA and AMD compete through MLPerf benchmarks and hardware partnerships where wins compound and switching costs are highest. The application race remains fragmented: Microsoft operates in Turkish farming, NVIDIA in AR glasses, Anthropic in coding agents, Google in housing planning. The money moves fastest where switching costs are highest, which is the hardware layer. Everything else is still fighting for distribution. Research across robotics, language models, and healthcare systems shows a methodological shift away from monolithic scaling toward conditional, task-aware allocation of computation. Inference-time steering replaces end-to-end training with post-hoc refinement loops. Variable-width transformers and adaptive computation abandon uniform capacity allocation across layers. Agentic frameworks decompose complex problems through structured reasoning rather than end-to-end prediction. On code-specific benchmarks, GLM-5.1 and GLM-5.2 show significant repositioning when evaluated on software engineering tasks versus general capability suites, suggesting that specialized evaluation surfaces different architectural strengths. Developer momentum concentrates on two problems: replacing slow tools with fast ones in Rust and compiled languages, and building the operational layer around AI applications before the AI layer itself stabilizes. LangChain4j, DeepSeek-Reasonix, and production trading engines are engineered around how modern LLM inference actually works, not around what sounds impressive. The infrastructure is consolidating. Everything else is still being built.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research PapersAll papers
Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement cs.RO

Robots deployed in the real world should learn from their experience and improve over time. This requires a mechanism of practicing and learning from feedback. In this paper, we propose VERITAS, a generator-verifier framework for generalist robot policies for inference-time policy steering and self-improvement. We use a pre-trained generalist robot policy as a ``generator'' and pair it with a gradient-free ``visual verifier'' that evaluates actions at inference time. This framework enables inference-time steering that improves policy performance without additional training. We demonstrate that inference-time verification consistently outperforms vanilla generalists without training on additional demonstration data. Additionally, we demonstrate that the verified rollouts provide effective supervision for offline policy improvement: policies fine-tuned on verified self-generated trajectories achieve consistent performance gains. Notably, we find that post-training with verified rollouts achieves comparable efficiency to expert demonstrations, while requiring no human interventions. Our results highlight inference-time verification as a practical and scalable mechanism for improving robotic policies during deployment.

Variable-Width Transformers cs.CL

Scaling model size, specifically depth and width, has driven significant progress in transformer-based language models. However, most architectures maintain a constant width across all layers, allocating a fixed parameter and computation budget evenly despite different layers potentially playing distinct computational roles. In this work, we empirically investigate nonuniform capacity allocation across network depth by proposing a $\times$-shaped > <former architecture. This design maintains wider early and late layers while narrowing the middle layers, utilizing a parameter-free residual resizing mechanism. Across decoder-only language models ranging from 200M to 2B parameters (dense) and 3B parameters (MoE), our > <former consistently outperforms parameter-matched uniform baselines on language modeling loss. By reducing the average layer width, this architecture also requires fewer overall FLOPs (22% reduction under fitted loss-matched scaling curves) and smaller KV cache memory and I/O cost (15% reduction). In analysis, we show that this bottleneck structure results in qualitatively different representations in residual streams. Overall, our results demonstrate that nonuniform width allocation can result in more resource-optimal scaling of language models.

ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues cs.CL

Reproducing research results from papers and released code is central to scientific progress. Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale due to their reliance on substantial manual effort for data curation and evaluation. We introduce ReproRepo, a scalable framework for reproducibility evaluation that leverages human-raised GitHub issues as naturally occurring supervision on realistic reproduction blockers. We instantiate ReproRepo on 1,149 recent machine learning papers from major conferences and evaluate four frontier model-agent configurations. Our results show that LLM agents, even without executing code, can identify many real-world reproducibility problems from paper-repository pairs: the best agent in our study, namely Codex with GPT-5.5, surfaces at least one semantically related human-reported blocker for ~90% of papers in the study. Further analysis shows that agents are particularly effective for surfacing visible failures and identifying the right semantic region, but may still be insufficient in exact localization. ReproRepo can serve as a reusable, scalable framework for future evaluations of LLM agents on real-world reproducibility auditing. Our code is released at https://github.com/LithiumDA/ReproRepo.

Sign-Rank, Index, and List Replicability: Connections and Separations cs.LG

In learning theory, the sign rank of a binary concept class captures the smallest dimension in which it can be represented by points and halfspaces. Despite tremendous interest, lower bounds on sign rank are notoriously difficult to come by. Two recent approaches to the problem establish lower bounds on sign rank by measures that are easier to analyze: the $\mathbb{Z}_2$-index and the list replicability number. We order these measures, showing that the $\mathbb{Z}_2$-index is upper-bounded by a linear function of the list replicability number. As a main consequence, we obtain a strong separation between sign rank and $\mathbb{Z}_2$-index, thereby resolving a question of Frick, Hosseini, and Vasileuski. This motivates a thorough study of list replicability, the stronger of the two lower-bounding measures. We establish upper bounds on the list replicability number by two combinatorial measures: height and minimum star number. We also prove a fundamental composition result, showing that the product of two concept classes has list replicability number bounded by the sum of the list replicability numbers of the two classes.

EvolveNav: Proactive Preflection and Self-Evolving Memory for Zero-Shot Object Goal Navigation cs.AI

Zero-Shot Object-Goal Navigation (ZS-OGN) requires embodied agents to explore and locate target objects without any prior training. To this end, recent methods leverage foundation models. But they typically rely on static priors and lack adaptation, which leads to repeated errors and costly trial and error. In this paper, we propose a self-evolving ZS-OGN framework that enables continuous test-time improvement. Specifically, we build an agentic rule memory by extracting actionable knowledge from past trajectories. Then, we propose a retrieval strategy based on upper confidence bound, selecting effective rules by balancing semantic relevance and historical success. In addition, we introduce a memory-guided preflection module that forecasts potential outcomes before action, reducing inefficient exploration. Extensive experiments show that our method outperforms existing zero-shot baselines, achieving a 10.1\% improvement in success rate with fewer unnecessary steps.

Adaptive Volumetric Mechanical Property Fields Invariant to Resolution cs.CV

Accurate mechanical properties (or materials) Young's modulus ($E$), Poisson's ratio ($ν$) and density ($ρ$) are essential for reliable physics simulation of digital worlds, but most 3D assets lack this information. We propose AdaVoMP, a method for predicting accurate dense spatially-varying ($E$, $ν$, $ρ$) for input 3D objects across representations, improving the resolution, accuracy, and memory efficiency over the state-of-the-art. The foundation of our technique is a sparse and adaptive voxel structure SAV that efficiently represents both the input 3D shape and the material field output. We replace the fixed-voxel model of the most accurate prior method, VoMP, with a novel sparse transformer encoder-decoder model that learns to generate a unique SAV autoregressively for every input shape to represent its materials, achieving a resolution $16^3\times$ higher than prior art. Experiments show that AdaVoMP estimates more accurate volumetric properties, even with lesser test-time compute than all prior art. This allows us to convert high-resolution complex 3D objects into simulation-ready assets, resulting in realistic deformable simulations.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1Claude Fable 559.90$20.00
2Claude Opus 4.855.768$10.00
3GPT-5.554.867$11.25
4Claude Opus 4.753.554$10.00
5GPT-5.451.4166$5.63
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1gpt-5.5-2026-04-23-xhigh62.7%
2Junie61.6%
3Codex60.4%
4Claude Code59.6%
5gpt-5.5-2026-04-23-medium58.9%