The Inference Report

June 11, 2026

The industry has reached an inflection point where the capacity to deploy AI at scale now exceeds the capacity to operate it safely or profitably. Amazon borrowed $17.5 billion on top of recent bond sales, Oracle committed $70 billion to data center spending, and companies tracked by Ramp AI Index spend $7,500 per employee monthly on infrastructure. This spending is no longer investment but operational burn to maintain competitive position in an arms race with clear winners and losers. Yet the productivity gains remain theoretical. Checkmarx's survey found enterprises shipping AI-generated code they know is vulnerable. Glean's Work AI Institute data shows employees spend 6.4 hours weekly on "botsitting," performing quality checks and providing context to make AI outputs usable. The human overhead required to make these systems functional is being absorbed by the workforce rather than counted as a cost of deployment.

Safety and governance are being treated as obstacles to overcome rather than problems to solve. xAI fired an engineer for raising safety concerns about Grok days before SpaceX's IPO. Anthropic initially implemented a covert policy to limit Claude's ability to develop competing AI models before researchers forced reversal. A German court ruled that Google's AI Overview provides no value to users and violates consumer protection law. Face-recognition tools deployed by police departments in Florida led to wrongful arrests, exposing decades of flawed systems treated as near-certain identification. Researchers testing leading AI models on a classic psychology attention task found catastrophic failure: systems scoring over 90 percent accuracy on short tasks fell to near-zero performance as complexity increased. Companies are choosing speed and market share over fixing known problems because regulatory and competitive incentives reward deployment over disclosure.

The architectural response from AI labs and the developer ecosystem reveals a shift from building capability to controlling adoption and margin. OpenAI's enterprise partnerships and EU transparency initiatives establish it as the trusted intermediary between regulators and industry, converting compliance into competitive moat. GitHub's integration of actual code intelligence into Copilot CLI signals how AI coding tools become indispensable rather than convenient. The research literature has moved from opaque end-to-end optimization toward decomposable intermediate representations: selective routing to reduce computation, finer-grained credit assignment in reinforcement learning, and explicit reasoning grounded in structured domain knowledge. On GitHub, the developer community has stopped asking whether to build agents and started asking how to wire them together, with agentic skills frameworks and system prompt repositories consolidating around reusable capabilities. Production agents now require instrumentation, evaluation, and data plumbing layers that only emerge once the first system ships. The question driving the industry forward is no longer what these systems can do but who controls them and what margins can be extracted from their deployment.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research PapersAll papers
Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models cs.CV

Vision-language models (VLMs) project images into hundreds to thousands of visual tokens, making decoder inference expensive in both attention computation and KV-cache memory. Existing visual-token reduction methods largely follow a rank-and-remove paradigm: they score visual tokens, keep a compact subset, and permanently discard the rest. We show that this irreversible action is fragile because visual-token importance changes across decoder depth; tokens ranked low at one stage may become relevant in later layers, especially for grounding-sensitive queries. We propose Reroute, a training-free plug-in that replaces removal with recoverable routing. At each routing stage, selected vision tokens pass through decoder blocks, while deferred tokens bypass the stage and re-enter the candidate pool at the next routing decision. Reroute reuses existing attention-score ranking rules and stage-wise schedules, preserving the theoretical TFLOPs and KV-cache budget class of the pruning method it augments. Across FastV, PDrop, and Nüwa variants on LLaVA-1.5 and Qwen backbones, reroute improves grounding under aggressive token reduction while maintaining general VQA performance. These results suggest that VLM token reduction should not be viewed only as irreversible pruning, but also as recoverable routing. The code can be found here: https://github.com/elmma/mllm-reroute/

Context-Driven Incremental Compression for Multi-Turn Dialogue Generation cs.CL

Modern conversational agents condition on an ever-growing dialogue history at each turn, incurring redundant attention and encoding costs that grow with conversation length. Naive truncation or summarization degrades fidelity, while existing context compressors lack cross-turn memory sharing or revision, causing information loss and compounding errors in long dialogues. We revisit the context compression under conversational dynamics and empirically present its fragility. To improve both efficiency and robustness, we introduce Context-Driven Incremental Compression (C-DIC), which treats a conversation as interleaved contextual threads and stores revisable per-thread compression states in a single, compact dialogue memory. At each turn, a lightweight retrieve, revise, and write-back loop shares information across turns and updates stale memories, stabilizing long-horizon behavior. In addition, we adapt truncated backpropagation-through-time (TBPTT) to our multi-turn setting, learning cross-turn dependencies without full-history backpropagation. Extensive experiments on long-form dialogue benchmarks demonstrate superior performance and efficiency of C-DIC; notably, C-DIC shows stable inference latency and perplexity over hundreds of dialogue turns, supporting a scalable path to high-quality dialogue modeling.

FACTR 2: Learning External Force Sensing for Commodity Robot Arms Improves Policy Learning cs.RO

Contact-rich manipulation requires force sensitivity, but many robot arms lack dedicated force sensors due to their high cost. We present Neural External Torque Estimation (NEXT), a data-driven method that estimates external joint torques without needing any dedicated force sensors. NEXT trains in 1 minute from only 10 minutes of free-motion data, yet achieves estimates comparable to dedicated joint-torque sensors. NEXT enables force-feedback teleoperation on low-cost arms and improves policy learning through Force-Informed Re-Sampling Training (FIRST), which up-samples pre-contact and contact segments during behavior cloning. Across five long-horizon tasks, FIRST outperforms prior force-aware policies by over 17% in task progress. Together, NEXT and FIRST bring force-aware teleoperation and policy learning to off-the-shelf robots without additional sensing hardware. Video results and code are available at https://jasonjzliu.com/factr2

DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners? cs.RO

Vision-Language Models (VLMs) are increasingly deployed as high-level planners for embodied agents, with an emerging strategy of scaling test-time compute to improve capability. However, we observe that doing so increases latency, token usage, and FLOPs while yielding uneven, often diminishing gains in downstream success, limiting where embodied agents can be deployed. We argue that choosing when and where to spend test-time compute is central to bringing frontier performance to the real world. We introduce DIRECT, a routing framework that uses multimodal scene context to allocate compute per prompt, improving the success--cost Pareto frontier over fixed model selection. Across three dominant scaling axes, namely chain-of-thought depth, model size, and memory history, our experiments on VLABench and RoboMME show that test-time compute is not a uniform lever: different axes yield qualitatively distinct capability gains. We validate these insights on a physical Franka arm in a DROID setup spanning zero-shot manipulation and long-horizon chaining, where our router matches or exceeds a stronger model's success rate at up to 65% lower average latency. Ultimately, our results show that naively scaling test-time compute is wasteful, and that DIRECT can provide frontier-level embodied planning in robotic systems at a fraction of the cost. Project page can be found at jadee-dao.github.io/direct/.

Doc-to-Atom: Learning to Compile and Compose Memory Atoms cs.CL

Long input sequences are central to document understanding and multi-step reasoning in Large Language Models, yet the quadratic cost of attention makes inference both memory-intensive and slow. Context distillation mitigates this by compressing contextual information into model parameters, and recent work such as Doc-to-LoRA amortizes context distillation into a single forward pass that generates one LoRA adapter per document. However, producing a single monolithic adapter for all queries leads to irrelevant-query interference, limited compositional recall, and poor scalability to long-document reasoning. To address these challenges, we propose Doc-to-Atom (Doc2Atom), a compositional parametric memory framework that decomposes each document into semantically typed knowledge atoms. Each atom is compiled into an independent micro-LoRA adapter and a provenance retrieval key. At inference time, a lightweight query router selects and assembles only the relevant atoms into a query-specific adapter, which is then injected into a frozen base model. The entire system is trained end-to-end through a multi-objective distillation framework. Experiments on six diverse QA benchmarks demonstrate that Doc2Atom outperforms Doc-to-LoRA baselines while reducing the memory cost of document internalization.

Redesign Mixture-of-Experts Routers with Manifold Power Iteration cs.LG

Router is the cornerstone component to the Mixture-of-Experts models. Serving as expert proxies, the rows of the router matrix compute their similarity to the MoE inputs to determine which subset of experts is activated. Ideally, each router row is designed to encode the expert matrix into this representative vector, such that its dot-product with token can better reflect token-expert affinity. However, there exists no design principles to enforce this condensation. In this paper, we propose to align each router row with the principal singular direction of the associated expert, as this direction provides the most expressive mathematical description of a matrix. Based on this principle, we propose a router redesign with Manifold Power Iteration (MPI). Specifically, it introduces a "Power-then-Retract" paradigm, where a power iteration step is performed on the router weights, followed by a retraction to impose a norm constraint to ensure both efficiency and stability. Theoretically, we show that MPI drives router rows to converge toward the principal singular directions of associated experts. Empirically, we pretrain MoE model across scales from 1B to 11B parameters to confirm that this alignment facilitates more effective MoE models.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1Claude Fable 564.964$21.88
2Claude Opus 4.861.459$10.94
3GPT-5.560.255$11.25
4Claude Opus 4.757.348$10.94
5Gemini 3.1 Pro Preview57.2127$4.50
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1gpt-5.5-2026-04-23-xhigh62.7%
2Junie61.6%
3Codex60.4%
4Claude Code59.6%
5gpt-5.5-2026-04-23-medium58.9%