The Inference Report

May 24, 2026

Today's developments reveal a market learning to price what can be measured and to control what cannot. The pattern spans from consulting to fan engagement to infrastructure, but it breaks down entirely when applied to shared reality itself.

IBM's Ferrari partnership and McKinsey's pricing crisis describe the same mechanism from opposite angles. When AI makes cognitive work repeatable and transparent, hourly fees collapse because clients can now see what the work actually costs to produce. Consulting firms face outcome-based pricing not because AI is better at strategy but because AI makes strategy auditable. Ferrari's superfandom play works similarly: AI extracts higher engagement value from existing audiences by making personalization measurable and scalable. Both stories are about margin compression among incumbents as their work becomes commodified. Elon Musk's shift from solar toward natural gas and orbital data centers follows the same logic, except applied to infrastructure: capital flows toward systems that lock in control rather than distribute it. These are market efficiencies redistributing rents, not creating them.

The synthetic media problem sits in a different category. Unlike consulting or engagement, AI-generated imagery threatens the foundational assumption that images carry evidentiary weight, a commons that courts, newsrooms, and citizens have relied on for decades. That cannot be repriced. The first three stories describe existing players learning to compete under transparency. The fourth describes the destruction of something that has no market substitute: shared agreement on what happened.

This tension shows up in the technical research and benchmark data as well. Across statistical methods and code-solving benchmarks, the focus has shifted from point estimates under ideal conditions toward structured estimation under realistic violations, from raw capability toward operational reliability. GitHub's trending repositories cluster around the same constraint: raw LLM coding ability is insufficient. The work now is knowledge graphs for code, skills pre-structuring, and agent orchestration that turns one-shot outputs into repeatable systems. Developers are building in layers, not replacing wholesale. The infrastructure question is no longer whether agents can code but whether they can code reliably at scale, which is a different kind of problem entirely.

Grant Calloway

AI LabsAll labs

No lab headlines.

From the WireAll feeds
Research Papers — FocusedAll papers
Causal Discovery in Structural VAR Models Under Equal Noise Variance stat.ME

Causal discovery from multivariate time series is challenging when causal effects may occur both across time and within the same sampling interval. This issue is especially important in applications such as neuroscience, where the sampling rate may be coarse relative to the underlying dynamics and contemporaneous effects need not form an acyclic graph. We study causal discovery in linear Gaussian structural VAR models under an equal noise variance assumption, meaning that the structural noise terms have a common variance. Unlike the DAG-based cross-sectional equal noise variance setting, the time-series setting considered here does not generally yield point identification of a unique causal graph. Instead, multiple structural VAR parameterizations can induce the same stationary observed process law. We introduce a notion of observational equivalence tailored to this setting and show that the corresponding equivalence class is characterized by orthogonal transformations of the structural equations together with a global positive scale. This characterization leads to an equivalence-aware model discrepancy, the observational alignment discrepancy, which compares structural models modulo transformations that preserve the observed law. Building on this theory, we propose ENVAR, a sparsity-based procedure that searches over the induced observational equivalence class for a sparse normalized structural representative. We evaluate the proposed methodology on synthetic structural VAR data and on an fMRI dataset.

Controlling False Discovery in Arbitrarily Structured Hypothesis Spaces via Reproducing Kernels stat.ME

Large-scale hypothesis testing is central to modern science, where controlling the False Discovery Rate (FDR) has become the standard approach to managing false positives across many simultaneous tests. Hypotheses rarely exist in isolation; they often exhibit structure through proximity, connectivity, or hierarchy. This structure represents both a challenge and an opportunity: while classical methods treat these dependencies as obstacles requiring conservative correction, leveraging them can substantially increase discovery power. Here, we reframe structured FDR control as a regularized learning problem. By optimizing within a suitable Reproducing Kernel Hilbert Space (RKHS), we introduce a framework that unifies continuous domains, graphs, and hierarchies under a single algorithm through kernel choice alone. This formulation enables smooth solutions in place of the piecewise-constant fits of prior methods, principled likelihood-based hyperparameter selection rather than heuristic tuning, and inference at unobserved locations which in turn supports sample-efficient experimental design. Building on this estimator, we provide two decision rules which we prove to control the FDR. We validate our method on two sources: spatial locations derived from high-dimensional real-world datasets, and a differential gene expression task utilizing protein-protein interaction graphs.

Estimating Item Difficulty with Large Language Models as Experts stat.ME

Accurate estimates of item difficulty are essential for valid assessment and effective adaptive learning. However, for newly created tasks, response data are typically unavailable. Pretesting and expert judgement can be costly and slow, while machine learning methods often require large labelled training datasets. Recent work suggests that large language models (LLMs) may help. However, there is limited evidence on the elicitation procedures and prompt configurations used to emulate experts for difficulty estimation. This study addresses this gap by evaluating three off-the-shelf LLMs as difficulty raters for newly created items without access to response data. Using an item bank from an online learning system, the study examined 6 domains of primary-school mathematics, with empirical difficulty estimates treated as empirical reference. The study used a full factorial design crossing three factors: judgement format (absolute vs pairwise), decision type (hard decisions vs token-probability-based estimates), and prompting strategy (zero-shot vs few-shot). LLM-derived difficulty estimates were compared with empirical difficulties using Spearman rank correlations. Across domains, LLM-based estimates exhibited moderate to strong positive correlations with empirical item difficulties. For simpler arithmetic tasks, some configurations approached the upper end of the accuracy range reported for human experts in previous research. Pairwise comparison consistently outperformed absolute judgement in the absence of additional refinements. However, when token-level probabilities were incorporated and examples of items with known empirical difficulty were provided, the absolute judgement configuration likewise demonstrated moderate-to-high alignment. The study positions LLMs as a promising tool for initial item calibration and offers insights into effective workflow configuration.

Wahkon: A Statistically Principled Deep RKHS Superposition Network stat.ME

Deep learning excels at prediction but often lacks finite-sample guarantees and calibrated uncertainty; RKHS (Reproducing Kernel Hilbert Space)-based methods provide those guarantees but struggle to adapt in high dimensions. We propose Wahkon, a deep RKHS superposition network that unifies Kolmogorov's superposition principle with RKHS regularization in the smoothing-spline tradition of Wahba. This yields a finite-dimensional deep representer theorem that makes training tractable and provides explicit layerwise complexity control. We show the penalized estimator is exactly the MAP (maximum a posteriori) estimate under a hierarchical Gaussian-process prior, extending the spline/GP duality to deep compositions. Using metric-entropy arguments, we establish minimax-optimal convergence rates under mild smoothness and clarify how depth and width trade off with regularity. Empirically, Wahkon outperforms multilayer perceptrons, Neural Tangent Kernels, and Kolmogorov--Arnold Networks across simulation benchmarks and a single-cell CITE-seq study. By unifying Kolmogorov's superposition principle with RKHS regularization, Wahkon delivers accuracy, interpretability, and statistical rigor in a single framework.

Rethinking external validation for the target population: Capturing patient-level similarity with a generative model stat.ME

Background: External validation is essential for assessing the transportability of predictive models. However, its interpretation is often confounded by differences between external and development populations. This study introduces a framework to distinguish model deficiencies from case-mix effects. Method: We propose a framework that quantifies each external patient's similarity to the development data and measures performance in subgroups with varying levels of alignment to the development distribution. We use generative models, specifically autoencoders, to estimate similarity, offering a more flexible alternative to traditional linear approaches and enabling validation without sharing the original development data. The utility of autoencoder-based similarity measure is demonstrated using synthetic data, and the framework's application is illustrated using data from the Netherlands Heart Registration (NHR) to predict mortality after transcatheter aortic valve implantation. Results: Our framework revealed substantial variation in model performance across similarity-defined subgroups, differences that remain hidden under conventional external validation yet can meaningfully alter conclusions. In several settings, conventional external validation suggested poor overall performance. However, after accounting for differences in patient characteristics, for some sub-groups, the model performance was consistent with internal validation results. Conversely, apparently acceptable overall performance could mask clinically relevant performance deficits in specific subgroups. Conclusion: The proposed framework enhances the interpretability of external validation by linking model performance to population alignment with the development data. This provides a more principled basis for deciding whether a model is transportable and to which patients it can be safely applied.

A Topological Sorting Criterion for Random Causal Directed Acyclic Graphs stat.ME

Random directed acyclic graphs (DAGs) based on imposing an order on Erdős-Rényi and scale free random graphs are widely used for evaluating causal discovery algorithms. We show that in such DAGs, the set of nodes reachable via open paths, termed relatives, increases monotonically along the causal order. We assess the prevalence of this pattern numerically, and demonstrate that it can be exploited for causal order recovery via sorting by the estimated number of relatives. We note that many simulations in the literature feature settings where this yields an excellent proxy for the causal order, and show that a strict increase of relatives along the causal order leads to a singular Markov equivalence class. We propose sampling time-series DAGs as a possible alternative and discuss implications for causal discovery algorithms and their evaluation on synthetic data.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1GPT-5.560.266$11.25
2Claude Opus 4.757.355$10.94
3Gemini 3.1 Pro Preview57.2130$4.50
4GPT-5.456.898$5.63
5Qwen3.7 Max56.6203$3.75
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4Junie62.8%
5gpt-5.4-2026-03-05-medium62.8%
Trending