The Inference Report

April 5, 2026

Anthropic is learning that owning a platform comes with costs that raw model capability cannot offset. The company's introduction of paid Claude Code access, followed immediately by charges for third-party tool integration, has created a tiered monetization structure that taxes users for workflows the product was designed to enable. Simultaneously, leaked Claude Code materials bundled with malware suggest the tool's internals are either more exposed than disclosed or weaponized comprehensively enough to warrant immediate scrutiny. The UK government's courtship of Anthropic to expand in London adds another dimension: the company gains negotiating leverage across three markets at once, able to point to international commitments when facing US regulators while extracting subsidies and diversifying political risk. Anthropic is discovering that it must play capital and jurisdiction against each other while its products are still being reverse-engineered and its pricing model stress-tested by users.

The broader competitive reality is no longer about model performance but about the layers that convert models into products. NVIDIA is positioning itself as infrastructure for physical AI through simulation and foundation models. AI21 Labs is staking a claim on systems-level engineering between models and deployment. Both moves reflect a shared understanding: value accrual now depends on controlling the stack that matters to customers, not on frontier capability alone. This shift is evident across infrastructure development on GitHub, where developers are building practical orchestration frameworks like Microsoft's agent-framework and block/goose to handle state management, tool execution, and error recovery across different LLM backends. Onyx packages chat and RAG into a self-hosted platform working with any LLM, positioning itself as the open alternative to proprietary systems.

The local-first movement reinforces the pattern. MLX-VLM enables fine-tuning vision language models on Mac hardware without cloud dependencies. AutoRAG and dstack solve real operational friction: evaluating and optimizing RAG pipelines in production, and provisioning GPU compute across fragmented hardware without vendor lock-in. Developers are voting with their forks for systems they can run, modify, and own, particularly where cloud pricing or vendor lock-in created friction. Information retrieval research mirrors this pragmatism, moving away from universal improvements toward domain-specific challenges. Hybrid architectures combining sparse retrieval, dense embeddings, and neural reranking outperform single-method baselines on heterogeneous documents. Careful engineering of retrieval pipelines can partially offset smaller language models in specialized domains, though model scale remains essential for complex reasoning. Across infrastructure, deployment tools, and research, the bottleneck has shifted from model quality to infrastructure maturity and operational control.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research Papers — FocusedAll papers
UniMixer: A Unified Architecture for Scaling Laws in Recommendation Systems cs.IR

In recent years, the scaling laws of recommendation models have attracted increasing attention, which govern the relationship between performance and parameters/FLOPs of recommenders. Currently, there are three mainstream architectures for achieving scaling in recommendation models, namely attention-based, TokenMixer-based, and factorization-machine-based methods, which exhibit fundamental differences in both design philosophy and architectural structure. In this paper, we propose a unified scaling architecture for recommendation systems, namely \textbf{UniMixer}, to improve scaling efficiency and establish a unified theoretical framework that unifies the mainstream scaling blocks. By transforming the rule-based TokenMixer to an equivalent parameterized structure, we construct a generalized parameterized feature mixing module that allows the token mixing patterns to be optimized and learned during model training. Meanwhile, the generalized parameterized token mixing removes the constraint in TokenMixer that requires the number of heads to be equal to the number of tokens. Furthermore, we establish a unified scaling module design framework for recommender systems, which bridges the connections among attention-based, TokenMixer-based, and factorization-machine-based methods. To further boost scaling ROI, a lightweight UniMixing module is designed, \textbf{UniMixing-Lite}, which further compresses the model parameters and computational cost while significantly improve the model performance. The scaling curves are shown in the following figure. Extensive offline and online experiments are conducted to verify the superior scaling abilities of \textbf{UniMixer}.

Aligning Recommendations with User Popularity Preferences cs.IR

Popularity bias is a pervasive problem in recommender systems, where recommendations disproportionately favor popular items. This not only results in "rich-get-richer" dynamics and a homogenization of visible content, but can also lead to misalignment of recommendations with individual users' preferences for popular or niche content. This work studies popularity bias through the lens of user-recommender alignment. To this end, we introduce Popularity Quantile Calibration, a measurement framework that quantifies misalignment between a user's historical popularity preference and the popularity of their recommendations. Building on this notion of popularity alignment, we propose SPREE, an inference-time mitigation method for sequential recommenders based on activation steering. SPREE identifies a popularity direction in representation space and adaptively steers model activations based on an estimate of each user's personal popularity bias, allowing both the direction and magnitude of steering to vary across users. Unlike global debiasing approaches, SPREE explicitly targets alignment rather than uniformly reducing popularity. Experiments across multiple datasets show that SPREE consistently improves user-level popularity alignment while preserving recommendation quality.

ReFormeR: Learning and Applying Explicit Query Reformulation Patterns cs.IR

We present ReFormeR, a pattern-guided approach for query reformulation. Instead of prompting a language model to generate reformulations of a query directly, ReFormeR first elicits short reformulation patterns from pairs of initial queries and empirically stronger reformulations, consolidates them into a compact library of transferable reformulation patterns, and then selects an appropriate reformulation pattern for a new query given its retrieval context. The selected pattern constrains query reformulation to controlled operations such as sense disambiguation, vocabulary grounding, or discriminative facet addition, to name a few. As such, our proposed approach makes the reformulation policy explicit through these reformulation patterns, guiding the LLM towards targeted and effective query reformulations. Our extensive experiments on TREC DL 2019, DL 2020, and DL Hard show consistent improvements over classical feedback methods and recent LLM-based query reformulation and expansion approaches.

From BM25 to Corrective RAG: Benchmarking Retrieval Strategies for Text-and-Table Documents cs.IR

Retrieval-Augmented Generation (RAG) systems critically depend on retrieval quality, yet no systematic comparison of modern retrieval methods exists for heterogeneous documents containing both text and tabular data. We benchmark ten retrieval strategies spanning sparse, dense, hybrid fusion, cross-encoder reranking, query expansion, index augmentation, and adaptive retrieval on a challenging financial QA benchmark of 23,088 queries over 7,318 documents with mixed text-and-table content. We evaluate retrieval quality via Recall@k, MRR, and nDCG, and end-to-end generation quality via Number Match, with paired bootstrap significance testing. Our results show that (1) a two-stage pipeline combining hybrid retrieval with neural reranking achieves Recall@5 of 0.816 and MRR@3 of 0.605, outperforming all single-stage methods by a large margin; (2) BM25 outperforms state-of-the-art dense retrieval on financial documents, challenging the common assumption that semantic search universally dominates; and (3) query expansion methods (HyDE, multi-query) and adaptive retrieval provide limited benefit for precise numerical queries, while contextual retrieval yields consistent gains. We provide ablation studies on fusion methods and reranker depth, actionable cost-accuracy recommendations, and release our full benchmark code.

Do We Need Bigger Models for Science? Task-Aware Retrieval with Small Language Models cs.IR

Scientific knowledge discovery increasingly relies on large language models, yet many existing scholarly assistants depend on proprietary systems with tens or hundreds of billions of parameters. Such reliance limits reproducibility and accessibility for the research community. In this work, we ask a simple question: do we need bigger models for scientific applications? Specifically, we investigate to what extent carefully designed retrieval pipelines can compensate for reduced model scale in scientific applications. We design a lightweight retrieval-augmented framework that performs task-aware routing to select specialized retrieval strategies based on the input query. The system further integrates evidence from full-text scientific papers and structured scholarly metadata, and employs compact instruction-tuned language models to generate responses with citations. We evaluate the framework across several scholarly tasks, focusing on scholarly question answering (QA), including single- and multi-document scenarios, as well as biomedical QA under domain shift and scientific text compression. Our findings demonstrate that retrieval and model scale are complementary rather than interchangeable. While retrieval design can partially compensate for smaller models, model capacity remains important for complex reasoning tasks. This work highlights retrieval and task-aware design as key factors for building practical and reproducible scholarly assistants.

Multi-Agent Video Recommenders: Evolution, Patterns, and Open Challenges cs.IR

Video recommender systems are among the most popular and impactful applications of AI, shaping content consumption and influencing culture for billions of users. Traditional single-model recommenders, which optimize static engagement metrics, are increasingly limited in addressing the dynamic requirements of modern platforms. In response, multi-agent architectures are redefining how video recommender systems serve, learn, and adapt to both users and datasets. These agent-based systems coordinate specialized agents responsible for video understanding, reasoning, memory, and feedback, to provide precise, explainable recommendations. In this survey, we trace the evolution of multi-agent video recommendation systems (MAVRS). We combine ideas from multi-agent recommender systems, foundation models, and conversational AI, culminating in the emerging field of large language model (LLM)-powered MAVRS. We present a taxonomy of collaborative patterns and analyze coordination mechanisms across diverse video domains, ranging from short-form clips to educational platforms. We discuss representative frameworks, including early multi-agent reinforcement learning (MARL) systems such as MMRF and recent LLM-driven architectures like MACRec and Agent4Rec, to illustrate these patterns. We also outline open challenges in scalability, multimodal understanding, incentive alignment, and identify research directions such as hybrid reinforcement learning-LLM systems, lifelong personalization and self-improving recommender systems.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1GPT-5.457.275$5.63
2Gemini 3.1 Pro Preview57.2122$4.50
3GPT-5.3 Codex5482$4.81
4Claude Opus 4.65348$10.00
5Claude Sonnet 4.651.751$6.00
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5Gemini 3.1 Pro Preview62.3%