The Inference Report — April 5, 2026

Anthropic is learning that owning a platform comes with costs that raw model capability cannot offset. The company's introduction of paid Claude Code access, followed immediately by charges for third-party tool integration, has created a tiered monetization structure that taxes users for workflows the product was designed to enable. Simultaneously, leaked Claude Code materials bundled with malware suggest the tool's internals are either more exposed than disclosed or weaponized comprehensively enough to warrant immediate scrutiny. The UK government's courtship of Anthropic to expand in London adds another dimension: the company gains negotiating leverage across three markets at once, able to point to international commitments when facing US regulators while extracting subsidies and diversifying political risk. Anthropic is discovering that it must play capital and jurisdiction against each other while its products are still being reverse-engineered and its pricing model stress-tested by users.

The broader competitive reality is no longer about model performance but about the layers that convert models into products. NVIDIA is positioning itself as infrastructure for physical AI through simulation and foundation models. AI21 Labs is staking a claim on systems-level engineering between models and deployment. Both moves reflect a shared understanding: value accrual now depends on controlling the stack that matters to customers, not on frontier capability alone. This shift is evident across infrastructure development on GitHub, where developers are building practical orchestration frameworks like Microsoft's agent-framework and block/goose to handle state management, tool execution, and error recovery across different LLM backends. Onyx packages chat and RAG into a self-hosted platform working with any LLM, positioning itself as the open alternative to proprietary systems.

The local-first movement reinforces the pattern. MLX-VLM enables fine-tuning vision language models on Mac hardware without cloud dependencies. AutoRAG and dstack solve real operational friction: evaluating and optimizing RAG pipelines in production, and provisioning GPU compute across fragmented hardware without vendor lock-in. Developers are voting with their forks for systems they can run, modify, and own, particularly where cloud pricing or vendor lock-in created friction. Information retrieval research mirrors this pragmatism, moving away from universal improvements toward domain-specific challenges. Hybrid architectures combining sparse retrieval, dense embeddings, and neural reranking outperform single-method baselines on heterogeneous documents. Careful engineering of retrieval pipelines can partially offset smaller language models in specialized domains, though model scale remains essential for complex reasoning. Across infrastructure, deployment tools, and research, the bottleneck has shifted from model quality to infrastructure maturity and operational control.

Grant Calloway

AI LabsAll labs

AI21 Labs

Engineering the subconscious: Why Claude Code isn’t enough to build AI systems

NVIDIA

National Robotics Week — Latest Physical AI Research, Breakthroughs and Resources

From the WireAll feeds

UK courts Anthropic to expand in London after US defence clash

Research Papers — FocusedAll papers

Deep-learning Causal Retrieval Optimization for Efficient e-commerce Distribution in Pinterest cs.IR

Pinterest is where people turn inspiration into action as users browse ideas, then take steps toward realization, often by discovering shoppable content. To support this journey, we must distribute commerce content when it helps, not when it distracts. We frame this as a causal decision of triggering shopping candidate generators in early retrieval and deploy a production system at Pinterest that learns personalized and contextualized triggering policies. A deep multi-task model jointly predicts outcomes and uplift of multiple events, trained with a doubly-robust pseudo-outcome alongside calibrated outcome losses for stable, single-robust uplift learning. A randomized data logging supplies counterfactual coverage, and the model is evaluated by both regular and reverse metrics for full assessment. A linear-time offline replay is designed to select thresholds and forecast policy impact with extremely high consistency with online results. For productionization, the model runs in parallel with remote retrieval calls without end-to-end latency regression. At web scale, we cut shopping triggers by up to 85% while holding key shopping sessions neutral, improving important total sessions (+0.26%) and Pin saves (+1.10%), with significant infrastructure savings. By unifying deep causal learning with reliable offline replay and demonstrating production-grade deployment, this work provides a generally practical recipe for early-retrieval optimizations in modern cascading recommenders beyond shopping, aligning exploration and cost with user intent at scale.

Can We Steer the Black-Box? Towards Controllability-Centric Evaluation of Recommender Systems with Collaborative Agents cs.IR

Recommender systems operate as Black-Boxes, leaving users and regulators unable to steer their outputs toward specific intentions or audit their behavior. This lack of controllability, defined as the system's ability to respond to explicit guidance, remains an unaddressed dimension in existing evaluation paradigms. To fill this gap, we propose CtrlBench-Rec, a collaborative multi-agent framework for systematic assessment of controllability. We formalize three fundamental tasks: target content discovery, interest profile shaping, and popularity bias mitigation, which together measure steerability from explicit commands to implicit representation steering and finally to overcoming algorithmic biases.Extensive experiments on real-world datasets and multiple recommendation models demonstrate that our framework effectively quantifies controllability and exposes critical system bottlenecks, most notably persistent resistance to guiding long tail content. CtrlBench-Rec provides the first standardized toolkit for controllable recommendation research, algorithmic auditing, and user empowerment. Our code is released on https://github.com/caskcsg/CtrlBenchRec.

Does generative AI supersede supervised XMLC? A Benchmark Study on Automated Subject Indexing with German Scientific Literature cs.IR

With a large controlled vocabulary as the label set, the task of automated subject indexing in a library can be understood as a multi-label classification task. If the set of subject terms is large, the problem fits the Extreme Multi-Label Classification (XMLC) objective. In this study, we apply a selection of specialised supervised XMLC methods to the test case of subject indexing contemporary German scientific literature, collected at the German National Library (DNB). We contrast these results by including a classical lexical matching baseline and three of our own recently developed LLM-based methods into the benchmark. Algorithms are evaluated and compared in several metrics. This includes binary relevance comparisons with previously indexed material, as well as graded relevance ratings by professional subject librarians. A challenge for all methods is to reliably make suggestions from the long tail of the subject vocabulary. We find that supervised XMLC algorithms relying on transformer-based dense features give best results in terms of overall binary relevance metrics. However, focusing on graded relevance and performance in the long tail of our subject vocabulary, the LLM-based generative methods give better results, making them a promising alternative for future productive use.

Bridge Evidence: Static Retrieval Utility Does Not Predict Causal Utility in Multi-Step Agentic Search cs.IR

Retrieval systems are trained and evaluated on a static idea of usefulness: hand a document and a question to a reader model, see whether the answer improves, and score the document accordingly. The idea holds up when a document is read on its own. It breaks when a language model works as a search agent, issuing several queries and reasoning across turns, because a document can matter for what it lets the agent do next rather than for what it says about the current question. We measure that gap rather than argue it. Using a ReAct style agent over HotpotQA, we replay 1000 development questions and, for every document the agent read, delete it and re-run the rest of the trajectory from that point. Comparing the original run against its counterfactual gives a Counterfactual Trajectory Utility (CTU) score from three deltas: final answer quality, next query retrieval quality, and turn count. Crossing CTU against Static RAG Utility (SRU) over 23,322 document observations, the two are close to statistically independent (Spearman rho = -0.026). Roughly a third of the documents the agent reads are causally load bearing while looking useless to a static reader; we call these bridge documents. The pattern survives when the reader based axis is swapped for a BM25 and cross encoder proxy, giving a bridge cell of 27.2% on an evenly spread axis. A second experiment pins down the mechanism. Using the Observable Entity Relevance (OER) measure from prior work, entities that discriminate relevant from non-relevant candidates appear in the agent's next query 4.02 times more often than entities found only in non-relevant documents (6.1% vs 1.5%, n = 227,139). A bridge document earns its keep by handing the agent a discriminative entity that redirects the search. Static relevance and causal usefulness are different quantities in agentic retrieval, and optimizing the first does not deliver the second.

Where Does the Noise Come From? A Variance-Components Decomposition of Non-Determinism in LLM Brand Answers cs.IR

Teams measuring whether large language models (LLMs) recommend a brand face a reproducibility problem: ask the same question twice and the answer moves. Practice resamples each prompt a few times (commonly five) and averages, treating within-prompt resampling as the source of the noise. But a measured brand score moves for at least four separable reasons: within-prompt resampling, prompt paraphrase, model identity, and query language. We specify a crossed random-effects (generalizability-theory) decomposition that partitions the total variance of a response-level brand outcome into these four sources, and embed the components in a decision-study allocation that returns how many repeats, paraphrases, models, and languages to buy for a target reliability. We apply it to a fully crossed corpus of 12,933 LLM responses on 20 Central and Eastern European brands, 8 languages, and 3 models (GPT-5.2 and Gemini 3 Flash in parametric mode, Perplexity in grounded retrieval), with a stability subset of 1,435 cells resampled about five times. The outcome is per-response multilingual sentiment polarity. Query language is the largest systematic facet (26.5% of the variance of one response) against 1.5% for brand identity (ICC 0.0146), so a single AI answer carries almost no brand-discriminating signal. Once a cell term isolates pure resampling, resampling is 34.8% of variance and the brand-in-context interaction 29.6%; brand-by-language is 8.6% (a bilingual penalty) while brand-by-model and brand-by-prompt are near zero. Per unit of query budget, adding languages and models reduces relative-error variance far more than adding repeats: a repeat past the fifth reduces it by only 0.0003. Brand-ranking reliability stays low, near 0.01 for a single answer and about 0.36 at the full crossed design, so reliability is bought by spreading across languages and models, not by repeating one prompt.

Can We Steer the Black-Box? Towards Controllability-Centric Evaluation of Recommender Systems with Collaborative Agents cs.IR

Recommender systems operate as Black-Boxes, leaving users and regulators unable to steer their outputs toward specific intentions or audit their behavior. This lack of controllability, defined as the system's ability to respond to explicit guidance, remains an unaddressed dimension in existing evaluation paradigms. To fill this gap, we propose CtrlBench-Rec, a collaborative multi-agent framework for systematic assessment of controllability. We formalize three fundamental tasks: target content discovery, interest profile shaping, and popularity bias mitigation, which together measure steerability from explicit commands to implicit representation steering and finally to overcoming algorithmic biases.Extensive experiments on real-world datasets and multiple recommendation models demonstrate that our framework effectively quantifies controllability and exposes critical system bottlenecks, most notably persistent resistance to guiding long tail content. CtrlBench-Rec provides the first standardized toolkit for controllable recommendation research, algorithmic auditing, and user empowerment. Our code is released on https://github.com/caskcsg/CtrlBenchRec.

BenchmarksFull tables

Intelligence Index

Composite score across coding, math, and reasoning

#	Model	Score	tok/s	$/1M
1	GPT-5.4	57.2	75	$5.63
2	Gemini 3.1 Pro Preview	57.2	122	$4.50
3	GPT-5.3 Codex	54	82	$4.81
4	Claude Opus 4.6	53	48	$10.00
5	Claude Sonnet 4.6	51.7	51	$6.00

SWE-rebench

Agentic coding on real-world software engineering tasks

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	gpt-5.4-2026-03-05-medium	62.8%
5	Gemini 3.1 Pro Preview	62.3%

GitHub Repos All repos

Trending

Blaizzy/mlx-vlm

4048 ★

MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.

onyx-dot-app/onyx

30010 ★

Open Source AI Platform - AI Chat with advanced features that works with every LLM

Yeachan-Heo/oh-my-codex

16055 ★

OmX - Oh My codeX: Your codex is not alone. Add hooks, agent teams, HUDs, and so much more.

siddharthvaddem/openscreen

24658 ★

Create stunning demos for free. Open-source, no subscriptions, no watermarks, and free for commercial use. An alternative to Screen Studio.

telegramdesktop/tdesktop

31155 ★

Telegram Desktop messaging app

Daily discovery

sourcenetwork/defradbEdge AI

873 ★

DefraDB is a Peer-to-Peer Edge-First Database. It's the core data storage system for the Source Ecosystem, built with IPLD, LibP2P, CRDTs, and Semantic open web properties.

darkdevil3610/100-AI-Machine-learning-Deep-learning-Computer-vision-NLPNLP

164 ★

100+ AI Machine learning Deep learning Computer vision NLP Projects with code

jim60105/docker-whisperXSpeech Recognition

425 ★

Dockerfile for WhisperX: Automatic Speech Recognition with Word-Level Timestamps and Speaker Diarization (Dockerfile, CI image build and test)

Marker-Inc-Korea/AutoRAGRAG

4708 ★

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

dkozlov/awesome-knowledge-distillationModel Compression

3833 ★

Awesome Knowledge Distillation