The Inference Report

March 18, 2026

The infrastructure wars are hardening into permanent structural divides. Nvidia and Intel are cementing agentic AI as the next revenue engine through integrated hardware stacks, SK Hynix's chairman has declared a wafer shortage persisting through 2030, and that scarcity is already priced into market expectations. Meanwhile, the government is deliberately fragmenting its vendor relationships: OpenAI is expanding classified work through AWS while the Pentagon develops alternatives to Anthropic and explores secure environments for model training on military data. Anthropic is in active legal conflict with the Justice Department over warfighting systems, with the government asserting it will decide how AI gets deployed regardless of vendor preference. This is not regulatory disagreement. This is the state claiming unilateral control over deployment regardless of builder intent.

Enterprise adoption is splintering by capability and control. Mistral is betting enterprises want to train custom models from scratch on proprietary data rather than fine-tune existing systems. Google is expanding Personal Intelligence to all US users, letting its AI tap Gmail and Google Photos for tailored responses. Microsoft unified its consumer and commercial Copilot systems while narrowing Mustafa Suleyman's role to superintelligence and frontier models. These are admissions that the integrated AI assistant play is fragmenting into specialized layers. Yann LeCun left Meta, raised $1.03 billion in four months at a $3.5 billion valuation for AMI Labs in Paris, and is explicitly betting against large language models with backing from Nvidia, Samsung, and Bezos Expeditions. One of AI's founding figures is placing capital on a different architecture entirely.

The money is flowing toward whoever controls the stack from model to inference to data to hardware, not toward individual model releases. OpenAI is positioning smaller models like GPT-5.4 mini and nano as the infrastructure layer for distributed agent workloads while capturing data on worker compensation queries. Nvidia is building the hardware and networking stack to make that distribution possible: AI grids pushing inference to telecom networks, RTX-accelerated local devices running open models, and integrations with Apple Vision Pro for professional applications. IBM's acquisition of Confluent signals that real-time data infrastructure is now essential to enterprise agents. Hugging Face and Mistral compete in the open-source compact model space but lack the infrastructure moat Nvidia controls or the API distribution OpenAI does. GitHub shows agentic framework adoption fragmenting into specialized layers, with Obra's superpowers and LangChain's deepagents addressing the concrete problem that naive agent loops hallucinate or get stuck without structured orchestration. The frontier in coding benchmarks shows stagnation rather than progress: Claude Code, Junie, and Claude Opus 4.6 remain locked in a 51.7 to 52.9 percent band on SWE-rebench with less than a percentage point separating the top six models, indicating a plateau in discriminative power rather than genuine capability gains.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research PapersAll papers
Demystifing Video Reasoning cs.CV

Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.

Efficient Reasoning on the Edge cs.LG

Large language models (LLMs) with chain-of-thought reasoning achieve state-of-the-art performance across complex problem-solving tasks, but their verbose reasoning traces and large context requirements make them impractical for edge deployment. These challenges include high token generation costs, large KV-cache footprints, and inefficiencies when distilling reasoning capabilities into smaller models for mobile devices. Existing approaches often rely on distilling reasoning traces from larger models into smaller models, which are verbose and stylistically redundant, undesirable for on-device inference. In this work, we propose a lightweight approach to enable reasoning in small LLMs using LoRA adapters combined with supervised fine-tuning. We further introduce budget forcing via reinforcement learning on these adapters, significantly reducing response length with minimal accuracy loss. To address memory-bound decoding, we exploit parallel test-time scaling, improving accuracy at minor latency increase. Finally, we present a dynamic adapter-switching mechanism that activates reasoning only when needed and a KV-cache sharing strategy during prompt encoding, reducing time-to-first-token for on-device inference. Experiments on Qwen2.5-7B demonstrate that our method achieves efficient, accurate reasoning under strict resource constraints, making LLM reasoning practical for mobile scenarios. Videos demonstrating our solution running on mobile devices are available on our project page.

MessyKitchens: Contact-rich object-level 3D scene reconstruction cs.CV

Monocular 3D scene reconstruction has recently seen significant progress. Powered by the modern neural architectures and large-scale data, recent methods achieve high performance in depth estimation from a single image. Meanwhile, reconstructing and decomposing common scenes into individual 3D objects remains a hard challenge due to the large variety of objects, frequent occlusions and complex object relations. Notably, beyond shape and pose estimation of individual objects, applications in robotics and animation require physically-plausible scene reconstruction where objects obey physical principles of non-penetration and realistic contacts. In this work we advance object-level scene reconstruction along two directions. First, we introduceMessyKitchens, a new dataset with real-world scenes featuring cluttered environments and providing high-fidelity object-level ground truth in terms of 3D object shapes, poses and accurate object contacts. Second, we build on the recent SAM 3D approach for single-object reconstruction and extend it with Multi-Object Decoder (MOD) for joint object-level scene reconstruction. To validate our contributions, we demonstrate MessyKitchens to significantly improve previous datasets in registration accuracy and inter-object penetration. We also compare our multi-object reconstruction approach on three datasets and demonstrate consistent and significant improvements of MOD over the state of the art. Our new benchmark, code and pre-trained models will become publicly available on our project website: https://messykitchens.github.io/.

ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K cs.RO

Learning in simulation provides a useful foundation for scaling robotic manipulation capabilities. However, this paradigm often suffers from a lack of data-generation-ready digital assets, in both scale and diversity. In this work, we present ManiTwin, an automated and efficient pipeline for generating data-generation-ready digital object twins. Our pipeline transforms a single image into simulation-ready and semantically annotated 3D asset, enabling large-scale robotic manipulation data generation. Using this pipeline, we construct ManiTwin-100K, a dataset containing 100K high-quality annotated 3D assets. Each asset is equipped with physical properties, language descriptions, functional annotations, and verified manipulation proposals. Experiments demonstrate that ManiTwin provides an efficient asset synthesis and annotation workflow, and that ManiTwin-100K offers high-quality and diverse assets for manipulation data generation, random scene synthesis, and VQA data generation, establishing a strong foundation for scalable simulation data synthesis and policy learning. Our webpage is available at https://manitwin.github.io/.

SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation cs.CV

Video Super-Resolution (VSR) aims to restore high-quality video frames from low-resolution (LR) estimates, yet most existing VSR approaches behave like black boxes at inference time: users cannot reliably correct unexpected artifacts, but instead can only accept whatever the model produces. In this paper, we propose a novel interactive VSR framework dubbed SparkVSR that makes sparse keyframes a simple and expressive control signal. Specifically, users can first super-resolve or optionally a small set of keyframes using any off-the-shelf image super-resolution (ISR) model, then SparkVSR propagates the keyframe priors to the entire video sequence while remaining grounded by the original LR video motion. Concretely, we introduce a keyframe-conditioned latent-pixel two-stage training pipeline that fuses LR video latents with sparsely encoded HR keyframe latents to learn robust cross-space propagation and refine perceptual details. At inference time, SparkVSR supports flexible keyframe selection (manual specification, codec I-frame extraction, or random sampling) and a reference-free guidance mechanism that continuously balances keyframe adherence and blind restoration, ensuring robust performance even when reference keyframes are absent or imperfect. Experiments on multiple VSR benchmarks demonstrate improved temporal consistency and strong restoration quality, surpassing baselines by up to 24.6%, 21.8%, and 5.6% on CLIP-IQA, DOVER, and MUSIQ, respectively, enabling controllable, keyframe-driven video super-resolution. Moreover, we demonstrate that SparkVSR is a generic interactive, keyframe-conditioned video processing framework as it can be applied out of the box to unseen tasks such as old-film restoration and video style transfer. Our project page is available at: https://sparkvsr.github.io/

Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory cs.CL

Recent advances in Large Language Models (LLMs) have enabled conversational AI agents to engage in extended multi-turn interactions spanning weeks or months. However, existing memory systems struggle to reason over temporally grounded facts and preferences that evolve across months of interaction and lack effective retrieval strategies for multi-hop, time-sensitive queries over long dialogue histories. We introduce Chronos, a novel temporal-aware memory framework that decomposes raw dialogue into subject-verb-object event tuples with resolved datetime ranges and entity aliases, indexing them in a structured event calendar alongside a turn calendar that preserves full conversational context. At query time, Chronos applies dynamic prompting to generate tailored retrieval guidance for each question, directing the agent on what to retrieve, how to filter across time ranges, and how to approach multi-hop reasoning through an iterative tool-calling loop over both calendars. We evaluate Chronos with 8 LLMs, both open-source and closed-source, on the LongMemEvalS benchmark comprising 500 questions spanning six categories of dialogue history tasks. Chronos Low achieves 92.60% and Chronos High scores 95.60% accuracy, setting a new state of the art with an improvement of 7.67% over the best prior system. Ablation results reveal the events calendar accounts for a 58.9% gain on the baseline while all other components yield improvements between 15.5% and 22.3%. Notably, Chronos Low alone surpasses prior approaches evaluated under their strongest model configurations.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1GPT-5.457.280$5.63
2Gemini 3.1 Pro Preview57.2113$4.50
3GPT-5.3 Codex5469$4.81
4Claude Opus 4.65360$10.00
5Claude Sonnet 4.651.768$6.00
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Code52.9%
2Junie52.1%
3Claude Opus 4.651.7%
4gpt-5.2-2025-12-11-xhigh51.7%
5gpt-5.2-2025-12-11-medium51.0%
Trending