The Inference Report

March 24, 2026

A research paper on selective reuse and adaptive computation in video diffusion landed today with minimal fanfare, but it captures the actual constraint shaping the industry: not capability, but cost of deployment. WorldCache and its methodological siblings across the research clusters reveal that the margin compression story isn't coming from better models. It's coming from systems that avoid materializing dense intermediate products, preserve domain structure through explicit representation, and coordinate modular components through structured interfaces rather than monolithic backpaths. This is the technical foundation beneath everything else happening today.

The competitive structure above that foundation is hardening into a clear shape. Nvidia donating Kubernetes drivers for GPU allocation and releasing OpenShell for agent security while AMD ships observability tools and robotics frameworks aren't research moves. They're infrastructure plays designed to make it cheaper and faster for enterprises to run AI workloads at scale, which compresses margins for everyone who doesn't own the infrastructure layer. Gimlet Labs raised 80 million dollars specifically to enable inference across Nvidia, AMD, Intel, ARM, Cerebras, and d-Matrix chips simultaneously. That's not competition between vendors. That's the infrastructure winner positioning itself to service everyone. Meanwhile, OpenAI plans to double its workforce to 8,000 by end of 2026 and Helion is negotiating to sell 12.5% of its power output directly to OpenAI. The capital requirements are compressing the market upward toward whoever can build the infrastructure, secure the power, and navigate regulatory uncertainty simultaneously. SoftBank's 30 billion dollar bet on OpenAI illustrates where the money actually flows.

Regulatory pressure scatters across jurisdictions with no coherent framework and doesn't constrain the winners. It constrains entry and creates compliance costs that only well-capitalized players can absorb. Elizabeth Warren equated the Pentagon's decision to label Anthropic a supply-chain risk as retaliation while Siemens warned Europe against throttling innovation speed for sovereignty. The DOD creates uncertainty for vendors, senators call for suspending Nvidia AI chip exports to China, Australia expects data centers to serve local communities, and Las Vegas police deploy facial recognition while privacy advocates object. None of these moves actually prevent the consolidation. They accelerate it by raising the cost of compliance.

On the benchmarks, Claude Opus 4.6 jumped from fourth to first on SWE-rebench with 65.3% versus its previous 53%, while DeepSeek-V3.2 vaulted from twenty-first at 37.5% to sixth at 60.9%, a 23.4-point gain that represents the largest movement in the cohort. The divergence between SWE-rebench and Artificial Analysis scores suggests the benchmark itself has tightened or recalibrated. The GitHub trending split reveals pragmatism: one group treats agents as orchestrators for existing systems while the other treats them as autonomous executors in hostile environments. Both are gaining traction, but the real infrastructure layer beneath both, LocalAI, Ray, ONNX Runtime, suggests developers are serious about running inference locally or at scale without vendor lock-in.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research PapersAll papers
WorldCache: Content-Aware Caching for Accelerated Video World Models cs.CV

Diffusion Transformers (DiTs) power high-fidelity video world models but remain computationally expensive due to sequential denoising and costly spatio-temporal attention. Training-free feature caching accelerates inference by reusing intermediate activations across denoising steps; however, existing methods largely rely on a Zero-Order Hold assumption i.e., reusing cached features as static snapshots when global drift is small. This often leads to ghosting artifacts, blur, and motion inconsistencies in dynamic scenes. We propose \textbf{WorldCache}, a Perception-Constrained Dynamical Caching framework that improves both when and how to reuse features. WorldCache introduces motion-adaptive thresholds, saliency-weighted drift estimation, optimal approximation via blending and warping, and phase-aware threshold scheduling across diffusion steps. Our cohesive approach enables adaptive, motion-consistent feature reuse without retraining. On Cosmos-Predict2.5-2B evaluated on PAI-Bench, WorldCache achieves \textbf{2.3$\times$} inference speedup while preserving \textbf{99.4\%} of baseline quality, substantially outperforming prior training-free caching approaches. Our code can be accessed on \href{https://umair1221.github.io/World-Cache/}{World-Cache}.

End-to-End Training for Unified Tokenization and Latent Denoising cs.CV

Latent diffusion models (LDMs) enable high-fidelity synthesis by operating in learned latent spaces. However, training state-of-the-art LDMs requires complex staging: a tokenizer must be trained first, before the diffusion model can be trained in the frozen latent space. We propose UNITE - an autoencoder architecture for unified tokenization and latent diffusion. UNITE consists of a Generative Encoder that serves as both image tokenizer and latent generator via weight sharing. Our key insight is that tokenization and generation can be viewed as the same latent inference problem under different conditioning regimes: tokenization infers latents from fully observed images, whereas generation infers them from noise together with text or class conditioning. Motivated by this, we introduce a single-stage training procedure that jointly optimizes both tasks via two forward passes through the same Generative Encoder. The shared parameters enable gradients to jointly shape the latent space, encouraging a "common latent language". Across image and molecule modalities, UNITE achieves near state of the art performance without adversarial losses or pretrained encoders (e.g., DINO), reaching FID 2.12 and 1.73 for Base and Large models on ImageNet 256 x 256. We further analyze the Generative Encoder through the lenses of representation alignment and compression. These results show that single stage joint training of tokenization & generation from scratch is feasible.

UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation cs.CV

We present UniMotion, to our knowledge the first unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images within a single architecture. Existing unified models handle only restricted modality subsets (e.g., Motion-Text or static Pose-Image) and predominantly rely on discrete tokenization, which introduces quantization errors and disrupts temporal continuity. UniMotion overcomes both limitations through a core principle: treating motion as a first-class continuous modality on equal footing with RGB. A novel Cross-Modal Aligned Motion VAE (CMA-VAE) and symmetric dual-path embedders construct parallel continuous pathways for Motion and RGB within a shared LLM backbone. To inject visual-semantic priors into motion representations without requiring images at inference, we propose Dual-Posterior KL Alignment (DPA), which distills a vision-fused encoder's richer posterior into the motion-only encoder. To address the cold-start problem -- where text supervision alone is too sparse to calibrate the newly introduced motion pathway -- we further propose Latent Reconstruction Alignment (LRA), a self-supervised pre-training strategy that uses dense motion latents as unambiguous conditions to co-calibrate the embedder, backbone, and flow head, establishing a stable motion-aware foundation for all downstream tasks. UniMotion achieves state-of-the-art performance across seven tasks spanning any-to-any understanding, generation, and editing among the three modalities, with especially strong advantages on cross-modal compositional tasks.

ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model cs.CV

Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM \emph{thinker} branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM's progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.

3D-Layout-R1: Structured Reasoning for Language-Instructed Spatial Editing cs.CV

Large Language Models (LLMs) and Vision Language Models (VLMs) have shown impressive reasoning abilities, yet they struggle with spatial understanding and layout consistency when performing fine-grained visual editing. We introduce a Structured Reasoning framework that performs text-conditioned spatial layout editing via scene-graph reasoning. Given an input scene graph and a natural-language instruction, the model reasons over the graph to generate an updated scene graph that satisfies the text condition while maintaining spatial coherence. By explicitly guiding the reasoning process through structured relational representations, our approach improves both interpretability and control over spatial relationships. We evaluate our method on a new text-guided layout editing benchmark encompassing sorting, spatial alignment, and room-editing tasks. Our training paradigm yields an average 15% improvement in IoU and 25% reduction in center-distance error compared to Chain of Thought Fine-tuning (CoT-SFT) and vanilla GRPO baselines. Compared to SOTA zero-shot LLMs, our best models achieve up to 20% higher mIoU, demonstrating markedly improved spatial precision.

The Dual Mechanisms of Spatial Reasoning in Vision-Language Models cs.CV

Many multimodal tasks, such as image captioning and visual question answering, require vision-language models (VLMs) to associate objects with their properties and spatial relations. Yet it remains unclear where and how such associations are computed within VLMs. In this work, we show that VLMs rely on two concurrent mechanisms to represent such associations. In the language model backbone, intermediate layers represent content-independent spatial relations on top of visual tokens corresponding to objects. However, this mechanism plays only a secondary role in shaping model predictions. Instead, the dominant source of spatial information originates in the vision encoder, whose representations encode the layout of objects and are directly exploited by the language model backbone. Notably, this spatial signal is distributed globally across visual tokens, extending beyond object regions into surrounding background areas. We show that enhancing these vision-derived spatial representations globally across all image tokens improves spatial reasoning performance on naturalistic images. Together, our results clarify how spatial association is computed within VLMs and highlight the central role of vision encoders in enabling spatial reasoning.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1GPT-5.457.282$5.63
2Gemini 3.1 Pro Preview57.2117$4.50
3GPT-5.3 Codex5465$4.81
4Claude Opus 4.65347$10.00
5Claude Sonnet 4.651.754$6.00
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5Gemini 3.1 Pro Preview62.3%