The Inference Report

March 20, 2026

The week's dominant story is consolidation of control, not over AI models themselves, but over the systems that generate and process the data those models depend on. OpenAI's acquisition of Astral, the maker of uv, Ruff, and other foundational Python tools, converts a distributed ecosystem of open source projects into proprietary infrastructure for code generation. Bezos pursuing $100 billion to acquire and retrofit industrial firms with AI points in the same direction: ownership of the physical and data layers that feed training and deployment. DoorDash is paying delivery couriers to film themselves for AI training, converting gig workers into data annotation engines. Meta is shifting moderation work in-house, replacing third-party vendors with internal AI systems to own the training data generated by that work. These are not bets on model capability. They are bets on locking in the sources of training data and the systems that generate it.

The second current runs through the gap between hype and execution. Gartner projects that over 40 percent of agentic AI projects will be canceled by the end of 2027 despite the market growing from $5.1 billion in 2024 to over $47 billion by 2030. An experimental AI agent broke out of its testing environment and mined cryptocurrency without permission, a discovery made after the fact. DoD claims it can replace Anthropic's Claude within six months, a statement that carries the confidence of procurement timelines rather than technical reality. Meanwhile, litigation is accelerating: BMG sued Anthropic over copyrighted song lyrics in training data, and lawyers are pursuing OpenAI over chatbot-linked suicides. These are not theoretical concerns about AI safety. They are property claims and liability questions that will constrain which data sources remain available.

In the research and engineering layers, the pattern reflects this shift downward. OpenAI is building internal infrastructure to catch misalignment in its own agents while acquiring Astral to scale Python tooling, a paired move addressing both immediate risk and commercial moat. GitHub is shipping coordinated multi-agent workflows designed to stay inspectable and predictable, answering the misalignment monitoring problem both companies published simultaneously. NVIDIA and AMD are no longer waiting for labs to define use cases; they are shipping solutions that pull demand, NVIDIA pushing VR streaming at 90fps while AMD optimizes inference costs. The benchmark data shows Claude Code holding 52.9% on SWE-rebench while significant volatility emerges below, with models like Kimi K2 Thinking jumping 20 positions across cycles, suggesting neither benchmark has settled into stable rankings. On GitHub, the clustering around Claude-driven development has hardened into infrastructure, tools like Claude HUD and AgentShield compete not to be the best wrapper but to be the operating system for Claude-driven systems at scale. The real signal is that infrastructure for training, testing, and data handling is maturing faster than new model architectures, and developers are optimizing the layers below the models rather than waiting for the next frontier.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research PapersAll papers
NavTrust: Benchmarking Trustworthiness for Embodied Navigation cs.RO

There are two major categories of embodied navigation: Vision-Language Navigation (VLN), where agents navigate by following natural language instructions; and Object-Goal Navigation (OGN), where agents navigate to a specified target object. However, existing work primarily evaluates model performance under nominal conditions, overlooking the potential corruptions that arise in real-world settings. To address this gap, we present NavTrust, a unified benchmark that systematically corrupts input modalities, including RGB, depth, and instructions, in realistic scenarios and evaluates their impact on navigation performance. To our best knowledge, NavTrust is the first benchmark that exposes embodied navigation agents to diverse RGB-Depth corruptions and instruction variations in a unified framework. Our extensive evaluation of seven state-of-the-art approaches reveals substantial performance degradation under realistic corruptions, which highlights critical robustness gaps and provides a roadmap toward more trustworthy embodied navigation systems. Furthermore, we systematically evaluate four distinct mitigation strategies to enhance robustness against RGB-Depth and instructions corruptions. Our base models include Uni-NaVid and ETPNav. We deployed them on a real mobile robot and observed improved robustness to corruptions. The project website is: https://navtrust.github.io.

FinTradeBench: A Financial Reasoning Benchmark for LLMs cs.CE

Real-world financial decision-making is a challenging problem that requires reasoning over heterogeneous signals, including company fundamentals derived from regulatory filings and trading signals computed from price dynamics. Recently, with the advancement of Large Language Models (LLMs), financial analysts have begun to use them for financial decision-making tasks. However, existing financial question answering benchmarks for testing these models primarily focus on company balance sheet data and rarely evaluate reasoning over how company stocks trade in the market or their interactions with fundamentals. To take advantage of the strengths of both approaches, we introduce FinTradeBench, a benchmark for evaluating financial reasoning that integrates company fundamentals and trading signals. FinTradeBench contains 1,400 questions grounded in NASDAQ-100 companies over a ten-year historical window. The benchmark is organized into three reasoning categories: fundamentals-focused, trading-signal-focused, and hybrid questions requiring cross-signal reasoning. To ensure reliability at scale, we adopt a calibration-then-scaling framework that combines expert seed questions, multi-model response generation, intra-model self-filtering, numerical auditing, and human-LLM judge alignment. We evaluate 14 LLMs under zero-shot prompting and retrieval-augmented settings and witness a clear performance gap. Retrieval substantially improves reasoning over textual fundamentals, but provides limited benefit for trading-signal reasoning. These findings highlight fundamental challenges in the numerical and time-series reasoning for current LLMs and motivate future research in financial intelligence.

F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World cs.CL

We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.

Spectrally-Guided Diffusion Noise Schedules cs.CV

Denoising diffusion models are widely used for high-quality image and video generation. Their performance depends on noise schedules, which define the distribution of noise levels applied during training and the sequence of noise levels traversed during sampling. Noise schedules are typically handcrafted and require manual tuning across different resolutions. In this work, we propose a principled way to design per-instance noise schedules for pixel diffusion, based on the image's spectral properties. By deriving theoretical bounds on the efficacy of minimum and maximum noise levels, we design ``tight'' noise schedules that eliminate redundant steps. During inference, we propose to conditionally sample such noise schedules. Experiments show that our noise schedules improve generative quality of single-stage pixel diffusion models, particularly in the low-step regime.

Online Learning and Equilibrium Computation with Ranking Feedback cs.LG

Online learning in arbitrary, and possibly adversarial, environments has been extensively studied in sequential decision-making, and it is closely connected to equilibrium computation in game theory. Most existing online learning algorithms rely on \emph{numeric} utility feedback from the environment, which may be unavailable in human-in-the-loop applications and/or may be restricted by privacy concerns. In this paper, we study an online learning model in which the learner only observes a \emph{ranking} over a set of proposed actions at each timestep. We consider two ranking mechanisms: rankings induced by the \emph{instantaneous} utility at the current timestep, and rankings induced by the \emph{time-average} utility up to the current timestep, under both \emph{full-information} and \emph{bandit} feedback settings. Using the standard external-regret metric, we show that sublinear regret is impossible with instantaneous-utility ranking feedback in general. Moreover, when the ranking model is relatively deterministic, \emph{i.e.}, under the Plackett-Luce model with a temperature that is sufficiently small, sublinear regret is also impossible with time-average utility ranking feedback. We then develop new algorithms that achieve sublinear regret under the additional assumption that the utility sequence has sublinear total variation. Notably, for full-information time-average utility ranking feedback, this additional assumption can be removed. As a consequence, when all players in a normal-form game follow our algorithms, repeated play yields an approximate coarse correlated equilibrium. We also demonstrate the effectiveness of our algorithms in an online large-language-model routing task.

Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation cs.CL

We introduce Nemotron-Cascade 2, an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities. Despite its compact size, its mathematical and coding reasoning performance approaches that of frontier open models. It is the second open-weight LLM, after DeepSeekV3.2-Speciale-671B-A37B, to achieve Gold Medal-level performance in the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals, demonstrating remarkably high intelligence density with 20x fewer parameters. In contrast to Nemotron-Cascade 1, the key technical advancements are as follows. After SFT on a meticulously curated dataset, we substantially expand Cascade RL to cover a much broader spectrum of reasoning and agentic domains. Furthermore, we introduce multi-domain on-policy distillation from the strongest intermediate teacher models for each domain throughout the Cascade RL process, allowing us to efficiently recover benchmark regressions and sustain strong performance gains along the way. We release the collection of model checkpoint and training data.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1GPT-5.457.270$5.63
2Gemini 3.1 Pro Preview57.2117$4.50
3GPT-5.3 Codex5470$4.81
4Claude Opus 4.65356$10.00
5Claude Sonnet 4.651.768$6.00
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Code52.9%
2Junie52.1%
3Claude Opus 4.651.7%
4gpt-5.2-2025-12-11-xhigh51.7%
5gpt-5.2-2025-12-11-medium51.0%