The Inference Report

March 28, 2026

Wall Street has priced in OpenAI's exit to public markets within 18 months, yet the company is simultaneously retreating from consumer-facing products. OpenAI secured a $40 billion loan from JPMorgan and Goldman Sachs while quietly shutting down Sora, its video generation flagship, amid public backlash over AI-generated content. The move signals a strategic pivot: away from flashy generative capabilities toward enterprise infrastructure and developer lock-in, evidenced by the acquisition of Astral, the Python tooling company, and positioning ChatGPT as a productivity layer for legacy enterprises rather than a breakthrough in model capability. This isn't a technical retreat. It's a capital allocation decision. The money is moving from the flashy to the durable.

The real constraint on AI deployment is no longer funding. It's power, land, and political permission. Memory chip stocks evaporated $100 billion in value this week as research revealed AI data centers will consume far less RAM than the shortage narrative promised, exposing how speculation inflates infrastructure demand. Simultaneously, an 82-year-old Kentucky woman's refusal to sell land to a data center developer, combined with Senate demands for annual electricity disclosures from data centers, reveals friction between compute ambition and ground-level resistance. Anthropic won a federal court reprieve against a Pentagon ban while throttling Claude subscriptions during peak hours, a pattern that suggests regulatory skepticism about AI vendors as national security risks remains unresolved. The financial engineering that funded AI's expansion depends on scarcity and urgency. Once those get measured and monitored, the margin between hype and reality becomes visible.

Labs are competing on deployment narratives rather than model breakthroughs. OpenAI showcased a 230-year-old company where 650 employees use ChatGPT to reshape knowledge work, while Meta advanced SAM 3.1 with multiplexing and global reasoning for real-time video detection and tracking, pushing computer vision inference efficiency for edge deployment. MIRI released a 104-minute documentary positioning the institute's perspective on AI risks as the frame for public discourse. None of these moves require new capability gains to matter; they're about application, integration, and narrative control. On the benchmarking side, Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, with the tier immediately below solidifying around 62 to 64 percent, indicating a tightening at the top end where capability differences narrow. The research frontier has shifted toward multimodal systems, efficient inference through hierarchical compression, and diagnostic evaluation frameworks that isolate failure modes rather than chase aggregate metrics. GitHub's trending repos reflect pragmatism over novelty: multi-agent orchestration frameworks assume agents will fail and require monitoring, while infrastructure maturation in OCR, voice, and on-device speech processing provides the building blocks that make agentic systems actually deployable.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research PapersAll papers
Vega: Learning to Drive with Natural Language Instructions cs.CV

Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only utilize the language modality for scene descriptions or reasoning and lack the flexibility to follow diverse user instructions for personalized driving. To address this, we first construct a large-scale driving dataset (InstructScene) containing around 100,000 scenes annotated with diverse driving instructions with the corresponding trajectories. We then propose a unified Vision-Language-World-Action model, Vega, for instruction-based generation and planning. We employ the autoregressive paradigm to process visual inputs (vision) and language instructions (language) and the diffusion paradigm to generate future predictions (world modeling) and trajectories (action). We perform joint attention to enable interactions between the modalities and use individual projection layers for different modalities for more capabilities. Extensive experiments demonstrate that our method not only achieves superior planning performance but also exhibits strong instruction-following abilities, paving the way for more intelligent and personalized driving systems.

Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving cs.RO

Human driving behavior is inherently personal, which is shaped by long-term habits and influenced by short-term intentions. Individuals differ in how they accelerate, brake, merge, yield, and overtake across diverse situations. However, existing end-to-end autonomous driving systems either optimize for generic objectives or rely on fixed driving modes, lacking the ability to adapt to individual preferences or interpret natural language intent. To address this gap, we propose Drive My Way (DMW), a personalized Vision-Language-Action (VLA) driving framework that aligns with users' long-term driving habits and adapts to real-time user instructions. DMW learns a user embedding from our personalized driving dataset collected across multiple real drivers and conditions the policy on this embedding during planning, while natural language instructions provide additional short-term guidance. Closed-loop evaluation on the Bench2Drive benchmark demonstrates that DMW improves style instruction adaptation, and user studies show that its generated behaviors are recognizable as each driver's own style, highlighting personalization as a key capability for human-centered autonomous driving. Our data and code are available at https://dmw-cvpr.github.io/.

Training the Knowledge Base through Evidence Distillation and Write-Back Enrichment cs.AI

The knowledge base in a retrieval-augmented generation (RAG) system is typically assembled once and never revised, even though the facts a query requires are often fragmented across documents and buried in irrelevant content. We argue that the knowledge base should be treated as a trainable component and propose WriteBack-RAG, a framework that uses labeled examples to identify where retrieval succeeds, isolate the relevant documents, and distill them into compact knowledge units that are indexed alongside the original corpus. Because the method modifies only the corpus, it can be applied once as an offline preprocessing step and combined with any RAG pipeline. Across four RAG methods, six benchmarks, and two LLM backbones, WriteBack-RAG improves every evaluated setting, with gains averaging +2.14%. Cross-method transfer experiments further show that the distilled knowledge benefits RAG pipelines other than the one used to produce it, confirming that the improvement resides in the corpus itself.

PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference cs.CV

Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global semantics; (2) Mid tokens, which achieve a massive spatiotemporal compression (32x token reduction) via a dual-branch network fusing progressive 3D convolutions with low-resolution VAE re-encoding; and (3) Recent tokens, kept at full resolution to ensure local temporal coherence. To strictly bound the memory footprint without sacrificing quality, we introduce a dynamic top-$k$ context selection mechanism for the mid tokens, coupled with a continuous Temporal RoPE Adjustment that seamlessly re-aligns position gaps caused by dropped tokens with negligible overhead. Empowered by this principled hierarchical context compression, PackForcing can generate coherent 2-minute, 832x480 videos at 16 FPS on a single H200 GPU. It achieves a bounded KV cache of just 4 GB and enables a remarkable 24x temporal extrapolation (5s to 120s), operating effectively either zero-shot or trained on merely 5-second clips. Extensive results on VBench demonstrate state-of-the-art temporal consistency (26.07) and dynamic degree (56.25), proving that short-video supervision is sufficient for high-quality, long-video synthesis. https://github.com/ShandaAI/PackForcing

PixelSmile: Toward Fine-Grained Facial Expression Editing cs.CV

Fine-grained facial expression editing has long been limited by intrinsic semantic overlap. To address this, we construct the Flex Facial Expression (FFE) dataset with continuous affective annotations and establish FFE-Bench to evaluate structural confusion, editing accuracy, linear controllability, and the trade-off between expression editing and identity preservation. We propose PixelSmile, a diffusion framework that disentangles expression semantics via fully symmetric joint training. PixelSmile combines intensity supervision with contrastive learning to produce stronger and more distinguishable expressions, achieving precise and stable linear expression control through textual latent interpolation. Extensive experiments demonstrate that PixelSmile achieves superior disentanglement and robust identity preservation, confirming its effectiveness for continuous, controllable, and fine-grained expression editing, while naturally supporting smooth expression blending.

Back to Basics: Revisiting ASR in the Age of Voice Agents cs.AI

Automatic speech recognition (ASR) systems have achieved near-human accuracy on curated benchmarks, yet still fail in real-world voice agents under conditions that current evaluations do not systematically cover. Without diagnostic tools that isolate specific failure factors, practitioners cannot anticipate which conditions, in which languages, will cause what degree of degradation. We introduce WildASR, a multilingual (four-language) diagnostic benchmark sourced entirely from real human speech that factorizes ASR robustness along three axes: environmental degradation, demographic shift, and linguistic diversity. Evaluating seven widely used ASR systems, we find severe and uneven performance degradation, and model robustness does not transfer across languages or conditions. Critically, models often hallucinate plausible but unspoken content under partial or degraded inputs, creating concrete safety risks for downstream agent behavior. Our results demonstrate that targeted, factor-isolated evaluation is essential for understanding and improving ASR reliability in production systems. Besides the benchmark itself, we also present three analytical tools that practitioners can use to guide deployment decisions.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1GPT-5.457.281$5.63
2Gemini 3.1 Pro Preview57.2114$4.50
3GPT-5.3 Codex5474$4.81
4Claude Opus 4.65353$10.00
5Claude Sonnet 4.651.766$6.00
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5Gemini 3.1 Pro Preview62.3%