The Inference Report

March 23, 2026

The infrastructure fight has moved beyond chips to the full stack of dependencies that make AI systems actually work. Amazon's Trainium has locked in commitments from Anthropic, OpenAI, and Apple, breaking Nvidia's monopoly grip on compute supply, while Elon Musk's vertical integration play for Tesla and SpaceX signals that even the best-capitalized companies now see owning their own silicon as necessary. Yet the fragility runs deeper than geopolitics. Cursor's reliance on Moonshot AI's Kimi model exposed how quickly foreign infrastructure can become a liability, and the instability of companies like Delve and Nscale, which promised to solve infrastructure problems themselves, reveals a pattern: every layer of the stack is contested, and ownership matters more than capability. Capital is concentrating around speed and focus rather than broad hedging, with Air Street's $232 million fund betting that founders who move faster than incumbents will capture disproportionate value. The real winners will be those who control compute, compliance, and the fabs that others must rent, not the teams shipping the flashiest models.

Research today treats evaluation methodology as a first-class problem rather than an afterthought. Papers on causal generative models with Kolmogorov-Arnold Networks, pixel-grounded tampering detection, and dynamic belief graphs all prioritize transparent mechanisms over black boxes, using factorized representations to expose what models actually learn. Multi-modal work on video understanding agents, surgical vision-language models, and cybersecurity classifiers solves real deployment friction by matching capability to actual resource constraints. But the largest cluster of papers exposes a harder truth: aggregate metrics and single-point estimates mask systematic disagreement about what is being measured. Faithfulness, interpretability, reasoning, and evidence grounding all degrade differently depending on how tasks are framed and whether benchmark conditions reflect the real world. Methodological choices in measurement and sampling determine what conclusions are defensible, and papers increasingly treat this as the core problem, not an implementation detail.

GitHub activity confirms that the innovation frontier has shifted to the plumbing connecting models to the world. Browser-use and deer-flow lead adoption by solving a concrete problem: agents need to interact with systems built for humans, not APIs. Financial trading and penetration testing agents follow the same pattern, wrapping domain expertise into structured action spaces that LLMs can reason over. Retrieval and observability tools like LightRAG, langextract, and Phoenix reflect a maturation cycle where early black-box agent systems are giving way to debuggable, trustworthy ones grounded in actual data. Notably absent are new foundational model releases or architectural breakthroughs. The work happening now is in connecting existing models to tasks and making those connections reliable enough to deploy.

Grant Calloway

AI LabsAll labs

No lab headlines.

From the WireAll feeds
Research PapersAll papers
From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering cs.CV

Existing tampering detection benchmarks largely rely on object masks, which severely misalign with the true edit signal: many pixels inside a mask are untouched or only trivially modified, while subtle yet consequential edits outside the mask are treated as natural. We reformulate VLM image tampering from coarse region labels to a pixel-grounded, meaning and language-aware task. First, we introduce a taxonomy spanning edit primitives (replace/remove/splice/inpaint/attribute/colorization, etc.) and their semantic class of tampered object, linking low-level changes to high-level understanding. Second, we release a new benchmark with per-pixel tamper maps and paired category supervision to evaluate detection and classification within a unified protocol. Third, we propose a training framework and evaluation metrics that quantify pixel-level correctness with localization to assess confidence or prediction on true edit intensity, and further measure tamper meaning understanding via semantics-aware classification and natural language descriptions for the predicted regions. We also re-evaluate the existing strong segmentation/localization baselines on recent strong tamper detectors and reveal substantial over- and under-scoring using mask-only metrics, and expose failure modes on micro-edits and off-mask changes. Our framework advances the field from masks to pixels, meanings and language descriptions, establishing a rigorous standard for tamper localization, semantic classification and description. Code and benchmark data are available at https://github.com/VILA-Lab/PIXAR.

LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation cs.CV

Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face-attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. Addressing this gap requires both explicit modeling strategies and face-attribute-aware data resources. We therefore propose LumosX, a framework that advances both data and model design. On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies. These extracted relational priors impose a finer-grained structure that amplifies the expressive control of personalized video generation and enables the construction of a comprehensive benchmark. On the modeling side, Relational Self-Attention and Relational Cross-Attention intertwine position-aware embeddings with refined attention dynamics to inscribe explicit subject-attribute dependencies, enforcing disciplined intra-group cohesion and amplifying the separation between distinct subject clusters. Comprehensive evaluations on our benchmark demonstrate that LumosX achieves state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation. Code and models are available at https://jiazheng-xing.github.io/lumosx-home/.

MeanFlow Meets Control: Scaling Sampled-Data Control for Swarms cs.LG

Steering large-scale swarms in only a few control updates is challenging because real systems operate in sampled-data form: control inputs are updated intermittently and applied over finite intervals. In this regime, the natural object is not an instantaneous velocity field, but a finite-window control quantity that captures the system response over each sampling interval. Inspired by MeanFlow, we introduce a control-space learning framework for swarm steering under linear time-invariant dynamics. The learned object is the coefficient that parameterizes the finite-horizon minimum-energy control over each interval. We show that this coefficient admits both an integral representation and a local differential identity along bridge trajectories, which leads to a simple stop-gradient training objective. At implementation time, the learned coefficient is used directly in sampled-data updates, so the prescribed dynamics and actuation map are respected by construction. The resulting framework provides a scalable approach to few-step swarm steering that is consistent with the sampled-data structure of real control systems.

VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking cs.CV

Video agentic models have advanced challenging video-language tasks. However, most agentic approaches still heavily rely on greedy parsing over densely sampled video frames, resulting in high computational cost. We present VideoSeek, a long-horizon video agent that leverages video logic flow to actively seek answer-critical evidence instead of exhaustively parsing the full video. This insight allows the model to use far fewer frames while maintaining, or even improving, its video understanding capability. VideoSeek operates in a think-act-observe loop with a well-designed toolkit for collecting multi-granular video observations. This design enables query-aware exploration over accumulated observations and supports practical video understanding and reasoning. Experiments on four challenging video understanding and reasoning benchmarks demonstrate that VideoSeek achieves strong accuracy while using far fewer frames than prior video agents and standalone LMMs. Notably, VideoSeek achieves a 10.2 absolute points improvement on LVBench over its base model, GPT-5, while using 93% fewer frames. Further analysis highlights the significance of leveraging video logic flow, strong reasoning capability, and the complementary roles of toolkit design.

Kolmogorov-Arnold causal generative models cs.LG

Causal generative models provide a principled framework for answering observational, interventional, and counterfactual queries from observational data. However, many deep causal models rely on highly expressive architectures with opaque mechanisms, limiting auditability in high-stakes domains. We propose KaCGM, a causal generative model for mixed-type tabular data where each structural equation is parameterized by a Kolmogorov--Arnold Network (KAN). This decomposition enables direct inspection of learned causal mechanisms, including symbolic approximations and visualization of parent--child relationships, while preserving query-agnostic generative semantics. We introduce a validation pipeline based on distributional matching and independence diagnostics of inferred exogenous variables, allowing assessment using observational data alone. Experiments on synthetic and semi-synthetic benchmarks show competitive performance against state-of-the-art methods. A real-world cardiovascular case study further demonstrates the extraction of simplified structural equations and interpretable causal effects. These results suggest that expressive causal generative modeling and functional transparency can be achieved jointly, supporting trustworthy deployment in tabular decision-making settings. Code: https://github.com/aalmodovares/kacgm

Improving Generalization on Cybersecurity Tasks with Multi-Modal Contrastive Learning cs.CR

The use of ML in cybersecurity has long been impaired by generalization issues: Models that work well in controlled scenarios fail to maintain performance in production. The root cause often lies in ML algorithms learning superficial patterns (shortcuts) rather than underlying cybersecurity concepts. We investigate contrastive multi-modal learning as a first step towards improving ML performance in cybersecurity tasks. We aim at transferring knowledge from data-rich modalities, such as text, to data-scarce modalities, such as payloads. We set up a case study on threat classification and propose a two-stage multi-modal contrastive learning framework that uses textual vulnerability descriptions to guide payload classification. First, we construct a semantically meaningful embedding space using contrastive learning on descriptions. Then, we align payloads to this space, transferring knowledge from text to payloads. We evaluate the approach on a large-scale private dataset and a synthetic benchmark built from public CVE descriptions and LLM-generated payloads. The methodology appears to reduce shortcut learning over baselines on both benchmarks. We release our synthetic benchmark and source code as open source.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1GPT-5.457.284$5.63
2Gemini 3.1 Pro Preview57.2123$4.50
3GPT-5.3 Codex5468$4.81
4Claude Opus 4.65350$10.00
5Claude Sonnet 4.651.765$6.00
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Code52.9%
2Junie52.1%
3Claude Opus 4.651.7%
4gpt-5.2-2025-12-11-xhigh51.7%
5gpt-5.2-2025-12-11-medium51.0%