The Inference Report

April 24, 2026

The infrastructure arms race is now consuming the balance sheets of the companies that claim to be building it. Meta is cutting 10 percent of its workforce to offset $135 billion in data center spending this year, while Microsoft commits $140 billion to AI investment and OpenAI, xAI, and peers plan data centers emitting 129 million tons of greenhouse gases annually. This is not growth capital deployed strategically into products with revenue models. This is survival spending, the cost of staying in a game where the entry fee keeps rising and the winner remains unclear. The spending is also accelerating consolidation: smaller builders are being acquired by larger ones, and the companies spending the most on infrastructure are the ones that can afford to cut payroll and still outspend everyone else. Capital concentration and margin compression are creating a two or three player market in foundational models, with everyone else building on top or fighting for scraps in narrower verticals.

The actual products being built on top of this infrastructure reveal why the spending feels mandatory. Everyone is launching agents simultaneously because the competitive window feels like it's closing. OpenAI released GPT-5.5 and workspace agents in ChatGPT. Microsoft added hosted agents to Foundry Agent Service. Google launched both an updated Gemini Enterprise app and the Gemini Enterprise Agent Platform on the same day. Anthropic's Mythos Preview has spooked financial institutions enough that UK banks are seeking access. Yet the speed of deployment is outpacing governance. An enterprise deploying a LangChain-based research agent during preproduction review still faces the problem that autonomous agents are not stable software artifacts, yet authorization frameworks treat them as if they were. Developers are adopting tools that could replace them while simultaneously worrying about displacement. The productivity gains are real and measurable. The anxiety is proportional.

OpenAI is consolidating its position as the primary vendor of production AI agents by shipping GPT-5.5 directly into Codex, its application layer for knowledge work automation, while simultaneously ensuring that layer runs on NVIDIA's infrastructure. Rather than compete on model weights alone, OpenAI is bundling model capability with workflow orchestration, automations, plugins, skills, and structured task execution, which creates friction for customers to migrate. NVIDIA's public embrace of Codex running on GB200 systems signals that the infrastructure vendor sees agent frameworks as the real margin driver. Meanwhile, Hugging Face's focus on browser-based transformer inference via Chrome extensions points toward a different vector: moving model execution to the edge and away from centralized inference, which could fragment the cloud-based agent stack that OpenAI and NVIDIA are building. The announcements collectively reveal a market sorting into layers, with model vendors securing inference infrastructure partnerships, application vendors building stickiness through workflow automation, and infrastructure players ensuring they own the hardware dependency. Competition is happening at integration points, not at the model level alone.

On GitHub, the trending list reveals a decisive split between two categories of developer effort: infrastructure for AI agents and tools that make those agents actually useful at scale. The agent-building layer is consolidating around concrete implementations rather than framework abstractions. Cline and similar autonomous coding agents now come with context-window optimization built in, which solves a real constraint: LLM context is expensive and agents generate noise. Skill libraries like VoltAgent's collection of 1000+ agent skills acknowledge that agents need domain knowledge packaged as callable tools. The discovery layer shows where harder problems still live. Data annotation and curation remain foundational, while LocalAI's positioning as a hardware-agnostic inference engine reflects a practical reality that developers want to run models locally without GPU dependencies to cut costs and latency. Smaller repos like abliterix and fim-ai/fim-one point to where the research frontier is: not whether agents can work, but how to make them predictable, steerable, and efficient. What's conspicuously absent from the trending list is another wave of general-purpose frameworks. The market has decided those are solved problems.

Grant Calloway

AI LabsAll labs

Anthropic

Anthropic and NEC collaborate to build Japan’s largest AI engineering workforce

Google DeepMind

Decoupled DiLoCo: A new frontier for resilient, distributed AI training

Hugging Face

How to Use Transformers.js in a Chrome Extension

NVIDIA

OpenAI

From the WireAll feeds

Research Papers — FocusedAll papers

The Representational Limit of Scalar Interactions: An Interventional Decomposition stat.ML

Signed pairwise interaction scores fundamentally conflate uniqueness (U), redundancy (R), and synergy (S). We prove this on a minimal 3-way XOR structural causal model: faithful indices such as Shapley-Taylor return zero per pair, whereas projective indices such as Shapley Interaction spread the third-order effect into pair scalars that conflate the three mechanisms. We introduce Stochastic Hi-Fi, a post-hoc, retraining-free predictability decomposition that estimates per-feature U/R/S profiles by interventional masked inference. The estimator provides exact interventional semantics, finite-sample Monte Carlo bounds, strict variance reduction from coupled diamond sampling, and uniform finite-vocabulary convergence. Across tabular SCMs, Stochastic Hi-Fi recovers structure missed by scalar baselines (up to 411x larger interaction-magnitude recovery ratios). It also separates redundant and synergistic heads in the GPT-2 IOI circuit. On NIH ChestX-ray14, Stochastic Hi-Fi matches GradCAM on Pointing Game and improves substantially on Deletion AUC.

A Solver-Free Training Method for Predict-then-Optimize stat.ML

We propose a scalable method for training prediction (machine learning) models in the predict-then-optimize paradigm, where model outputs serve as coefficients for a subsequent linear optimization task. Directly minimizing the empirical decision regret is intractable for linear programming and combinatorial optimization since the decision mapping is piecewise constant, and the gradients are zero almost everywhere. While existing methods address this by smoothing the differentiation process, they suffer from scalability issues, since a computationally expensive solver call is required for every gradient evaluation. To address this, we propose a decision-focused learning pipeline based on a measure transformation principle, which yields a new surrogate loss that is completely optimization-solver-free during training. We establish theoretical guarantees, including Fisher consistency and excess risk bounds. Empirically, our method achieves decision quality competitive with state-of-the-art methods while reducing training time by orders of magnitude.

Variational Consensus Monte Carlo for Bayesian Mixture stat.ML

Motivated by the privacy, sensitivity and sharing limitations of health data, we present a comprehensive pipeline for inference of Bayesian mixture models within a federated learning setting, i.e. when data cannot be fully shared or pooled across compute nodes. We adopt a Consensus Monte Carlo (CMC) approach, in which an MCMC algorithm is run independently within each data silo to estimate local posterior distributions, which are then aggregated to approximate the posterior over the full data. The variational CMC approach of Rabinovich, Angelino and Jordan (2015) [1] frames the aggregation step as a variational inference problem, but their application to mixtures assumes the number of clusters and key mixture parameters to be known. Our main methodological contributions are: (i) an extension of variational CMC to over-fitted Bayesian mixture models that infer the number of clusters and all model parameters, without requiring conjugacy; (ii) novel cluster-matching algorithms suitable for cross-silo settings in which not every cluster appears in each local dataset; (iii) a number of inference strategies for the aggregation step, matched to different federated learning constraints; and (iv) guidelines for choosing among these in practice. A comprehensive simulation study validates the framework and allows us to compare to state-of-the-art federated learning alternatives. Notably, we show that when the composition of local datasets reflects the underlying clustering structure in the data, our approach can recover small clusters with greater accuracy than standard MCMC applied to the pooled data. We illustrate the framework on large-scale electronic health record data, identifying multi-morbidity patterns in a British geriatric population.

AURA: Adaptive Uncertainty-aware Refinement for LLM-as-a-Judge Auditing stat.ML

Large language models (LLMs) are increasingly used as judges for open-ended generation, as large-scale human evaluation is often expensive and difficult to scale, yet their preferences remain imperfect proxies for human judgment. Existing auditing pipelines often assume that a reliable subset of examples or clean supervision signals are available beforehand, for example from human annotation, heuristic filtering, or the outputs of strong judges. In LLM evaluation, this assumption is fragile: the initial split may inherit judge bias, while human verification is typically too scarce to define stable groups at scale. We propose AURA, an adaptive uncertainty--aware refinement framework for auditing pairwise LLM--as--a--judge decisions under selected human verification. AURA iteratively learns a human-consistency signal, propagates reliable evidence, and prioritizes uncertain comparisons for human review. The key idea is to treat trust in a judge as a latent quantity that is progressively refined as evidence accumulates. We provide a compact formulation, a stable refinement procedure, and a comprehensive evaluation on both synthetic and real pairwise LLM-answer data.

Stochastic Linear Contextual Bandits with Bounded Noise: A Set-Membership Approach stat.ML

This paper considers stochastic linear contextual bandits (SLCB) with bounded reward noise. Existing works typically assume sub-Gaussian reward noise and bounded expected rewards, under which the optimal regret bound scales as $\tilde{O}(\sqrt{T})$ in terms of horizon $T$. However, in many applications, realized/observed rewards are also naturally bounded, implying bounded reward noise. Bounded noise is more informative than the sub-Gaussian condition but has not been leveraged explicitly in the SLCB literature. In this paper, we propose a novel algorithm SME-OFU by utilizing an uncertainty quantification method called set-membership estimation (SME) and applying the principle of optimism in the face of uncertainty (OFU). Our algorithm enjoys an improved regret bound $O(\log T)$. Notice that this does not contradict the existing optimal bound $\tilde{O}(\sqrt{T})$ for sub-Gaussian noise because bounded noise is a stronger condition. Finally, simulations show empirical improvements of SME-OFU over a benchmark algorithm designed for sub-Gaussian noise when the reward noise is bounded.

Off-Policy Evaluation for Missingness-Aware Policies in MDPs with Rewards Missing Not at Random stat.ML

In offline Reinforcement Learning, immediate rewards in logged batch data are often unobserved due to sparse or irregular record-keeping, or censored beyond certain reward values. This issue arises in practical settings, including health care and marketing. We investigate off-policy evaluation (OPE) in finite-horizon Markov decision processes when rewards are missing not at random (MNAR), which breaks ignorability and induces selection bias even after conditioning on states and actions. To address this, we formalize a reward-dependent propensity model and use future states as shadow variables to identify the full-data conditional mean reward. We further introduce a bridge function that recovers the conditional mean reward without explicitly modeling the MNAR mechanism, and estimate it via a min-max procedure to avoid double sampling. Building upon these identification results, we propose an Fitted-Q-Evaluation-style estimator that propagates the recovered rewards while allowing target policies to depend on past missingness indicators. Finally, we establish consistency and finite-sample error bounds for our OPE estimator, and show through experiments the strong performance of our method compared to existing methods on simulated and MIMIC-III Sepsis data.

BenchmarksFull tables

Intelligence Index

Composite score across coding, math, and reasoning

#	Model	Score	tok/s	$/1M
1	GPT-5.5	60.2	0	$11.25
2	Claude Opus 4.7	57.3	58	$10.00
3	Gemini 3.1 Pro Preview	57.2	132	$4.50
4	GPT-5.4	56.8	80	$5.63
5	Kimi K2.6	53.9	123	$1.71

SWE-rebench

Agentic coding on real-world software engineering tasks

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	gpt-5.4-2026-03-05-medium	62.8%
5	GLM-5.1	62.7%

GitHub Repos All repos

Trending

huggingface/ml-intern

6471 ★

🤗 ml-intern: an open-source ML engineer that reads papers, trains models, and ships ML models

zilliztech/claude-context

9168 ★

Code search MCP for Claude Code. Make entire codebase the context for any coding agent.

HKUDS/RAG-Anything

18456 ★

"RAG-Anything: All-in-One RAG Framework"

Z4nzu/hackingtool

67570 ★

ALL IN ONE Hacking Tool For Hackers

Anil-matcha/Open-Generative-AI

15493 ★

Uncensored, open-source alternative to Higgsfield AI, Freepik AI, Krea AI, Openart AI — Free, unrestricted AI image & video generation studio with 200+ models (Flux, Midjourney, Kling, Sora, Veo). No content filters. Self-hosted, MIT licensed.

Daily discovery

bespokelabsai/curatorSynthetic Data

1667 ★

Synthetic data curation for post-training and structured data extraction

cvat-ai/cvatObject Detection

15967 ★

Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.

fim-ai/fim-oneAI Agents

592 ★

LLM-powered Agent Runtime with Dynamic DAG Planning & Concurrent Execution

wuwangzhang1216/abliterixTransformers

203 ★

Automated alignment adjustment for LLMs — direct steering, LoRA, and MoE expert-granular abliteration, optimized via multi-objective Optuna TPE.

AI4Finance-Foundation/FinGPTPrompt Engineering

20009 ★

FinGPT: Open-Source Financial Large Language Models! Revolutionize 🔥 We release the trained model on HuggingFace.