The Inference Report

March 25, 2026

The capital and engineering momentum in AI has decisively shifted away from consumer novelty toward the infrastructure that makes AI operational. OpenAI's shutdown of Sora joins earlier closures of consumer products, yet Kleiner Perkins just raised $3.5 billion specifically for AI companies at scale, and Databricks is acquiring startups to build security products around AI agents. The pattern is unmistakable: venture capital and established tech companies are betting on control layers, deployment stacks, and integration into existing workflows rather than on novel consumer experiences. This reflects a market maturing from novelty to operational necessity.

The consolidation extends across the full stack. JetBrains is building a control plane for AI coding agents, Mozilla developers are creating Stack Overflow equivalents for agent failures, and Anthropic is pushing Claude Code toward autonomous execution while maintaining safeguards. Arm is manufacturing its first in-house chip with Meta and OpenAI as early customers, while Agile Robots is embedding Google DeepMind's foundation models into its hardware. The infrastructure layer is consolidating around a few platforms and models. Simultaneously, Zoom is betting its edge lies in capturing interactions across video and meetings, Spotify is adding tools to prevent AI-generated content from being misattributed, and local-first products like Talat are finding traction by keeping data off the cloud. These moves reflect a hard-won lesson: consumer enthusiasm for AI features does not translate to sustainable business models, but operational necessity does.

The competitive battle has shifted fundamentally. The real contest is no longer about which company builds the best chatbot but about who controls the deployment stack, who owns the data flowing through production systems, and who can integrate AI into enterprise software without breaking existing operations. GitHub's Copilot SDK integration into issue triage demonstrates this shift concretely, moving AI tooling from chat interfaces into embedded developer workflows where switching costs are highest. AWS is aggregating third-party models into Bedrock, locking customers into the platform while sidestepping the need to build. AMD is positioning its MI300X and MI355X as practical alternatives for production deployments through targeted optimization documentation. Microsoft is drilling into vertical deployment and manufacturing precision, proving ROI in specific industries where automation directly impacts margins.

Research and benchmarking reinforce this operational focus. Papers cluster around controlled evaluation of failure modes in deployed systems, multimodal fusion under resource constraints, and off-policy learning for improved sample efficiency, all framing evaluation around what specific real-world problems models solve rather than aggregate performance metrics. GitHub's trending repositories show developers building agentic orchestration frameworks like ByteDance's Deer-Flow and TauricResearch's TradingAgents alongside infrastructure essentials like Qdrant's vector database, Trivy's vulnerability scanner, and observability tools. The second wave reflects a shift in where developers think computation should happen: not just in data centers, but distributed, constrained, observable, and sometimes disconnected from the internet entirely. The winners will be companies that solve the coordination, observability, and integration problems that come after the model is trained.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research PapersAll papers
MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical Triage cs.CV

Vision Language Models (VLMs) are increasingly used for tasks like medical report generation and visual question answering. However, fluent diagnostic text does not guarantee safe visual understanding. In clinical practice, interpretation begins with pre-diagnostic sanity checks: verifying that the input is valid to read (correct modality and anatomy, plausible viewpoint and orientation, and no obvious integrity violations). Existing benchmarks largely assume this step is solved, and therefore miss a critical failure mode: a model can produce plausible narratives even when the input is inconsistent or invalid. We introduce MedObvious, a 1,880-task benchmark that isolates input validation as a set-level consistency capability over small multi-panel image sets: the model must identify whether any panel violates expected coherence. MedObvious spans five progressive tiers, from basic orientation/modality mismatches to clinically motivated anatomy/viewpoint verification and triage-style cues, and includes five evaluation formats to test robustness across interfaces. Evaluating 17 different VLMs, we find that sanity checking remains unreliable: several models hallucinate anomalies on normal (negative-control) inputs, performance degrades when scaling to larger image sets, and measured accuracy varies substantially between multiple-choice and open-ended settings. These results show that pre-diagnostic verification remains unsolved for medical VLMs and should be treated as a distinct, safety-critical capability before deployment.

Estimating Flow Velocity and Vehicle Angle-of-Attack from Non-invasive Piezoelectric Structural Measurements Using Deep Learning cs.LG

Accurate estimation of aerodynamic state variables such as freestream velocity and angle of attack (AoA) is important for aerodynamic load prediction, flight control, and model validation. This work presents a non-intrusive method for estimating vehicle velocity and AoA from structural vibration measurements rather than direct flow instrumentation such as pitot tubes. A dense array of piezoelectric sensors mounted on the interior skin of an aeroshell capture vibrations induced by turbulent boundary layer pressure fluctuations, and a convolutional neural network (CNN) is trained to invert these structural responses to recover velocity and AoA. Proof-of-concept is demonstrated through controlled experiments in Sandia's hypersonic wind tunnel spanning zero and nonzero AoA configurations, Mach~5 and Mach~8 conditions, and both constant and continuously varying tunnel operations. The CNN is trained and evaluated using data from 16 wind tunnel runs, with a temporally centered held-out interval within each run used to form training, validation, and test datasets and assess intra-run temporal generalization. Raw CNN predictions exhibit increased variance during continuously varying conditions; a short-window moving-median post-processing step suppresses this variance and improves robustness. After post-processing, the method achieves a mean velocity error relative to the low-pass filtered reference velocity below 2.27~m/s (0.21\%) and a mean AoA error of $0.44^{\circ} (8.25\%)$ on held-out test data from the same experimental campaign, demonstrating feasibility of vibration-based velocity and AoA estimation in a controlled laboratory environment.

VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions cs.CV

Existing approaches for improving the efficiency of Large Vision-Language Models (LVLMs) are largely based on the concept of visual token reduction. This approach, however, creates an information bottleneck that impairs performance, especially on challenging tasks that require fine-grained understanding and reasoning. In this work, we challenge this paradigm by introducing VISion On Request (VISOR), a method that reduces inference cost without discarding visual information. Instead of compressing the image, VISOR improves efficiency by sparsifying the interaction between image and text tokens. Specifically, the language model attends to the full set of high-resolution visual tokens through a small, strategically placed set of attention layers: general visual context is provided by efficient cross-attention between text-image, while a few well-placed and dynamically selected self-attention layers refine the visual representations themselves, enabling complex, high-resolution reasoning when needed. Based on this principle, we first train a single universal network on a range of computational budgets by varying the number of self-attention layers, and then introduce a lightweight policy mechanism that dynamically allocates visual computation based on per-sample complexity. Extensive experiments show that VISOR drastically reduces computational cost while matching or exceeding state-of-the-art results across a diverse suite of benchmarks, and excels in challenging tasks that require detailed visual understanding.

Failure of contextual invariance in gender inference with large language models cs.CL

Standard evaluation practices assume that large language model (LLM) outputs are stable under contextually equivalent formulations of a task. Here, we test this assumption in the setting of gender inference. Using a controlled pronoun selection task, we introduce minimal, theoretically uninformative discourse context and find that this induces large, systematic shifts in model outputs. Correlations with cultural gender stereotypes, present in decontextualized settings, weaken or disappear once context is introduced, while theoretically irrelevant features, such as the gender of a pronoun for an unrelated referent, become the most informative predictors of model behaviour. A Contextuality-by-Default analysis reveals that, in 19--52\% of cases across models, this dependence persists after accounting for all marginal effects of context on individual outputs and cannot be attributed to simple pronoun repetition. These findings show that LLM outputs violate contextual invariance even under near-identical syntactic formulations, with implications for bias benchmarking and deployment in high-stakes settings.

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning cs.CV

Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual tool invocation. However, the cascaded perception, reasoning, and tool-calling loops introduce significant sequential overhead. This overhead, termed agentic depth, incurs prohibitive latency and seriously limits system-level concurrency. To this end, we propose SpecEyes, an agentic-level speculative acceleration framework that breaks this sequential bottleneck. Our key insight is that a lightweight, tool-free MLLM can serve as a speculative planner to predict the execution trajectory, enabling early termination of expensive tool chains without sacrificing accuracy. To regulate this speculative planning, we introduce a cognitive gating mechanism based on answer separability, which quantifies the model's confidence for self-verification without requiring oracle labels. Furthermore, we design a heterogeneous parallel funnel that exploits the stateless concurrency of the small model to mask the stateful serial execution of the large model, maximizing system throughput. Extensive experiments on V* Bench, HR-Bench, and POPE demonstrate that SpecEyes achieves 1.1-3.35x speedup over the agentic baseline while preserving or even improving accuracy (up to +6.7%), thereby boosting serving throughput under concurrent workloads.

ReqFusion: A Multi-Provider Framework for Automated PEGS Analysis Across Software Domains cs.SE

Requirements engineering is a vital, yet labor-intensive, stage in the software development process. This article introduces ReqFusion: an AI-enhanced system that automates the extraction, classification, and analysis of software requirements utilizing multiple Large Language Model (LLM) providers. The architecture of ReqFusion integrates OpenAI GPT, Anthropic Claude, and Groq models to extract functional and non-functional requirements from various documentation formats (PDF, DOCX, and PPTX) in academic, industrial, and tender proposal contexts. The system uses a domain-independent extraction method and generates requirements following the Project, Environment, Goal, and System (PEGS) approach introduced by Bertrand Meyer. The main idea is that, because the PEGS format is detailed, LLMs have more information and cues about the requirements, producing better results than a simple generic request. An ablation study confirms this hypothesis: PEGS-guided prompting achieves an F1 score of 0.88, compared to 0.71 for generic prompting under the same multi-provider configuration. The evaluation used 18 real-world documents to generate 226 requirements through automated classification, with 54.9% functional and 45.1% nonfunctional across academic, business, and technical domains. An extended evaluation on five projects with 1,050 requirements demonstrated significant improvements in extraction accuracy and a 78% reduction in analysis time compared to manual methods. The multi-provider architecture enhances reliability through model consensus and fallback mechanisms, while the PEGS-based approach ensures comprehensive coverage of all requirement categories.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1GPT-5.457.279$5.63
2Gemini 3.1 Pro Preview57.2114$4.50
3GPT-5.3 Codex5472$4.81
4Claude Opus 4.65349$10.00
5Claude Sonnet 4.651.765$6.00
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5Gemini 3.1 Pro Preview62.3%