The Inference Report

March 19, 2026

Nvidia's networking business quietly pulled in $11 billion last quarter while the industry obsesses over frontier model capabilities, yet the real consolidation story is not about who builds the biggest model but about who controls the layers in between. Compression companies like Multiverse Computing are making those models cheaper and faster by the quarter, undercutting the margin narrative that once justified premium pricing. Microsoft is acquiring entire teams to fold collaboration tools directly into its stack. Snowflake is building Project SnowWork to move from answering questions about data to executing decisions on data. The execution layer, not the model layer, is where defensibility lives and where customer relationships stick.

This fragmentation exposes a structural contradiction: the industry claims to democratize AI while consolidating power across three distinct layers. Model makers are racing to lock in distribution and compute at the top. Compression and optimization companies are undercutting those margins in the middle. And underneath, whoever controls the data, inference infrastructure, and the ability to turn language models into agents that actually execute transactions owns the real leverage. OpenAI's $50 billion AWS deal is a defensive move to prevent Microsoft from controlling all three layers at once. Walmart's decision to embed Sparky directly into ChatGPT and Gemini rather than build a specialized agent shows that integrations matter more than differentiation when underlying models are interchangeable.

The incentive structure is also cracking under scrutiny. Patreon's CEO is calling out the licensing hypocrisy: AI companies claim fair use when training on creator content but license from major publishers when they need quality data. The Department of Defense labeled Anthropic a supply-chain risk because it might refuse to disable safety measures during warfighting, revealing that the real tension is between corporate control and state control, not between safety and capability. Entry-level software developers face a 20% hiring decline and wage pressure as AI adoption accelerates, which means the labor market is already pricing in the assumption that junior talent is less valuable. The infrastructure investment visible in today's research papers and GitHub trends reflects this reality: the focus has shifted from building new capabilities to building middleware, observation tools, and execution layers that lock in customers and control the path from chip to inference to decision-making. The industry is not democratizing anything. It is consolidating power while distributing the cost.

Grant Calloway

AI LabsAll labs

AI21 Labs

Mind the gap: Why production AI needs an operating system

AMD

Multi-Node Distributed Inference for Diffusion Models with xDiT

Anthropic

Sydney will become Anthropic’s fourth office in Asia-Pacific

MIRI

Mechanisms to Verify International Agreements about AI Development

NVIDIA

From Simulation to Production: How to Build Robots With AI

OpenAI

OpenAI Japan announces Japan Teen Safety Blueprint to put teen safety first

From the WireAll feeds

Research PapersAll papers

Unified Spatio-Temporal Token Scoring for Efficient Video VLMs cs.CV

Token pruning is essential for enhancing the computational efficiency of vision-language models (VLMs), particularly for video-based tasks where temporal redundancy is prevalent. Prior approaches typically prune tokens either (1) within the vision transformer (ViT) exclusively for unimodal perception tasks such as action recognition and object segmentation, without adapting to downstream vision-language tasks; or (2) only within the LLM while leaving the ViT output intact, often requiring complex text-conditioned token selection mechanisms. In this paper, we introduce Spatio-Temporal Token Scoring (STTS), a simple and lightweight module that prunes vision tokens across both the ViT and the LLM without text conditioning or token merging, and is fully compatible with end-to-end training. By learning how to score temporally via an auxiliary loss and spatially via LLM downstream gradients, aided by our efficient packing algorithm, STTS prunes 50% of vision tokens throughout the entire architecture, resulting in a 62% improvement in efficiency during both training and inference with only a 0.7% drop in average performance across 13 short and long video QA tasks. Efficiency gains increase with more sampled frames per video. Applying test-time scaling for long-video QA further yields performance gains of 0.5-1% compared to the baseline. Overall, STTS represents a novel, simple yet effective technique for unified, architecture-wide vision token pruning.

Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models cs.CV

Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint-aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching models to reason in 3D space. We introduce Loc3R-VLM, a framework that equips 2D Vision-Language Models with advanced 3D understanding capabilities from monocular video input. Inspired by human spatial cognition, Loc3R-VLM relies on two joint objectives: global layout reconstruction to build a holistic representation of the scene structure, and explicit situation modeling to anchor egocentric perspective. These objectives provide direct spatial supervision that grounds both perception and language in a 3D context. To ensure geometric consistency and metric-scale alignment, we leverage lightweight camera pose priors extracted from a pre-trained 3D foundation model. Loc3R-VLM achieves state-of-the-art performance in language-based localization and outperforms existing 2D- and video-based approaches on situated and general 3D question-answering benchmarks, demonstrating that our spatial supervision framework enables strong 3D understanding. Project page: https://kevinqu7.github.io/loc3r-vlm

AgentFactory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse cs.AI

Building LLM-based agents has become increasingly important. Recent works on LLM-based agent self-evolution primarily record successful experiences as textual prompts or reflections, which cannot reliably guarantee efficient task re-execution in complex scenarios. We propose AgentFactory, a new self-evolution paradigm that preserves successful task solutions as executable subagent code rather than textual experience. Crucially, these subagents are continuously refined based on execution feedback, becoming increasingly robust and efficient as more tasks are encountered. Saved subagents are pure Python code with standardized documentation, enabling portability across any Python-capable system. We demonstrate that AgentFactory enables continuous capability accumulation: its library of executable subagents grows and improves over time, progressively reducing the effort required for similar tasks without manual intervention. Our implementation is open-sourced at https://github.com/zzatpku/AgentFactory, and our demonstration video is available at https://youtu.be/iKSsuAXJHW0.

LoST: Level of Semantics Tokenization for 3D Shapes cs.CV

Tokenization is a fundamental technique in the generative modeling of various modalities. In particular, it plays a critical role in autoregressive (AR) models, which have recently emerged as a compelling option for 3D generation. However, optimal tokenization of 3D shapes remains an open question. State-of-the-art (SOTA) methods primarily rely on geometric level-of-detail (LoD) hierarchies, originally designed for rendering and compression. These spatial hierarchies are often token-inefficient and lack semantic coherence for AR modeling. We propose Level-of-Semantics Tokenization (LoST), which orders tokens by semantic salience, such that early prefixes decode into complete, plausible shapes that possess principal semantics, while subsequent tokens refine instance-specific geometric and semantic details. To train LoST, we introduce Relational Inter-Distance Alignment (RIDA), a novel 3D semantic alignment loss that aligns the relational structure of the 3D shape latent space with that of the semantic DINO feature space. Experiments show that LoST achieves SOTA reconstruction, surpassing previous LoD-based 3D shape tokenizers by large margins on both geometric and semantic reconstruction metrics. Moreover, LoST achieves efficient, high-quality AR 3D generation and enables downstream tasks like semantic retrieval, while using only 0.1%-10% of the tokens needed by prior AR models.

Toward Scalable Automated Repository-Level Datasets for Software Vulnerability Detection cs.SE

Software vulnerabilities continue to grow in volume and remain difficult to detect in practice. Although learning-based vulnerability detection has progressed, existing benchmarks are largely function-centric and fail to capture realistic, executable, interprocedural settings. Recent repo-level security benchmarks demonstrate the importance of realistic environments, but their manual curation limits scale. This doctoral research proposes an automated benchmark generator that injects realistic vulnerabilities into real-world repositories and synthesizes reproducible proof-of-vulnerability (PoV) exploits, enabling precisely labeled datasets for training and evaluating repo-level vulnerability detection agents. We further investigate an adversarial co-evolution loop between injection and detection agents to improve robustness under realistic constraints.

TDAD: Test-Driven Agentic Development - Reducing Code Regressions in AI Coding Agents via Graph-Based Impact Analysis cs.SE

AI coding agents can resolve real-world software issues, yet they frequently introduce regressions, breaking tests that previously passed. Current benchmarks focus almost exclusively on resolution rate, leaving regression behavior under-studied. This paper presents TDAD (Test-Driven Agentic Development), an open-source tool and benchmark methodology that combines abstract-syntax-tree (AST) based code-test graph construction with weighted impact analysis to surface the tests most likely affected by a proposed change. Evaluated on SWE-bench Verified with two local models (Qwen3-Coder 30B on 100 instances and Qwen3.5-35B-A3B on 25 instances), TDAD's GraphRAG workflow reduced test-level regressions by 70% (6.08% to 1.82%) and improved resolution from 24% to 32% when deployed as an agent skill. A surprising finding is that TDD prompting alone increased regressions (9.94%), revealing that smaller models benefit more from contextual information (which tests to verify) than from procedural instructions (how to do TDD). An autonomous auto-improvement loop raised resolution from 12% to 60% on a 10-instance subset with 0% regression. These findings suggest that for AI agent tool design, surfacing contextual information outperforms prescribing procedural workflows. All code, data, and logs are publicly available at https://github.com/pepealonso95/TDAD.

BenchmarksFull tables

Intelligence Index

Composite score across coding, math, and reasoning

#	Model	Score	tok/s	$/1M
1	GPT-5.4	57.2	71	$5.63
2	Gemini 3.1 Pro Preview	57.2	112	$4.50
3	GPT-5.3 Codex	54	66	$4.81
4	Claude Opus 4.6	53	51	$10.00
5	Claude Sonnet 4.6	51.7	56	$6.00

SWE-rebench

Agentic coding on real-world software engineering tasks

#	Model	Score
1	Claude Code	52.9%
2	Junie	52.1%
3	Claude Opus 4.6	51.7%
4	gpt-5.2-2025-12-11-xhigh	51.7%
5	gpt-5.2-2025-12-11-medium	51.0%

GitHub Repos All repos

Trending

jarrodwatts/claude-hud

11588 ★

A Claude Code plugin that shows what's happening - context usage, active tools, running agents, and todo progress

obra/superpowers

233678 ★

An agentic skills framework & software development methodology that works.

unslothai/unsloth

63801 ★

Fine-tuning & Reinforcement Learning for LLMs. 🦥 Train OpenAI gpt-oss, DeepSeek, Qwen, Llama, Gemma, TTS 2x faster with 70% less VRAM.

newton-physics/newton

4213 ★

An open-source, GPU-accelerated physics simulation engine built upon NVIDIA Warp, specifically targeting roboticists and simulation researchers.

shadps4-emu/shadPS4

30057 ★

PlayStation 4 emulator for Windows, Linux and macOS written in C++

Daily discovery

xenodium/agent-shellLLM

905 ★

A native Emacs buffer to interact with LLM agents powered by ACP

thu-ml/Causal-ForcingDiffusion Models

487 ★

Official codebase for "Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation"

VIAME/VIAMEDeep Learning

331 ★

Video and Image Analytics for Multiple Environments

Chevey339/kelivoChatbot

2746 ★

A Flutter LLM Chat Client. Support Mobile & Desktop.

endee-io/endeeVector Database

766 ★

Endee.io – A high-performance vector database, designed to handle up to 1B vectors on a single node, delivering significant performance gains through optimized indexing and execution. Also available in cloud https://endee.io/