The Inference Report

May 24, 2026

Today's developments reveal a market learning to price what can be measured and to control what cannot. The pattern spans from consulting to fan engagement to infrastructure, but it breaks down entirely when applied to shared reality itself.

IBM's Ferrari partnership and McKinsey's pricing crisis describe the same mechanism from opposite angles. When AI makes cognitive work repeatable and transparent, hourly fees collapse because clients can now see what the work actually costs to produce. Consulting firms face outcome-based pricing not because AI is better at strategy but because AI makes strategy auditable. Ferrari's superfandom play works similarly: AI extracts higher engagement value from existing audiences by making personalization measurable and scalable. Both stories are about margin compression among incumbents as their work becomes commodified. Elon Musk's shift from solar toward natural gas and orbital data centers follows the same logic, except applied to infrastructure: capital flows toward systems that lock in control rather than distribute it. These are market efficiencies redistributing rents, not creating them.

The synthetic media problem sits in a different category. Unlike consulting or engagement, AI-generated imagery threatens the foundational assumption that images carry evidentiary weight, a commons that courts, newsrooms, and citizens have relied on for decades. That cannot be repriced. The first three stories describe existing players learning to compete under transparency. The fourth describes the destruction of something that has no market substitute: shared agreement on what happened.

This tension shows up in the technical research and benchmark data as well. Across statistical methods and code-solving benchmarks, the focus has shifted from point estimates under ideal conditions toward structured estimation under realistic violations, from raw capability toward operational reliability. GitHub's trending repositories cluster around the same constraint: raw LLM coding ability is insufficient. The work now is knowledge graphs for code, skills pre-structuring, and agent orchestration that turns one-shot outputs into repeatable systems. Developers are building in layers, not replacing wholesale. The infrastructure question is no longer whether agents can code but whether they can code reliably at scale, which is a different kind of problem entirely.

Grant Calloway

AI LabsAll labs

No lab headlines.

From the WireAll feeds

Research Papers — FocusedAll papers

Verifying formulas for interventional distributions stat.ME

We formalize verification in causal graphical models: deciding whether a given observational formula identifies a target interventional distribution. This opens a problem complementary to identification, asking not whether any identifying formula exists, but whether the given formula is identifying. We show that even sound and complete solutions to identification do not solve verification. We propose a falsifier as a first practical route forward, prove that it induces an almost-surely correct verifier for regular exponential-family models, and use the resulting verifier to develop the gateway test, which finds all sets admissible for use in a front-door formula.

Observation-Level Watermarking and Detection for Tabular Data stat.ME

With the development of generative AI, watermarking techniques have been widely used to detect the authenticity of AI-generated data and protect the rights of users and creators. While it is already well applied in data types including imaging and text data, watermarking tabular data is still under-explored. Existing methods primarily focus on numerical data, leaving discrete, categorical, and mixed data less studied. In this work, we propose STAMP (Single-observation Tabular Attribution and Marking Procedure), a novel framework for watermarking tabular data that can accommodate and preserve a wide range of distributions. We also develop a corresponding detection mechanism, which can reliably identify watermarks even when the sample size is as small as one. We establish theoretical guarantees for asymptotic consistency and detection accuracy. Finally, through extensive simulation studies and two real-data applications, we demonstrate that the proposed method is effective and robust to subsetting, while maintaining data fidelity and a high detection rate.

tsbootstrap: Distribution-Free Uncertainty Quantification and Conformal Prediction for Time Series stat.ME

Finance, sensing, and demand streams violate the exchangeability that IID conformal prediction and the IID bootstrap assume, and existing libraries implement either a general resampling engine or conformal calibration without the other. tsbootstrap provides block, residual, sieve, and wild resampling, classical bootstrap confidence intervals, and adaptive conformal calibrators (EnbPI, ACI, NexCP, AgACI) through a single typed API in which a specification object selects each method. In a controlled coverage study the IID bootstrap undercovers sharply under dependence; dependence-aware methods reduce the coverage deficit, the sieve nearest to nominal under short-memory linear dependence. On the shared fixed-statistic path a compiled backend runs several times faster than arch, and a streaming reduce avoids materializing the $O(Bn)$ replicate tensor, limiting peak extra memory to $O(B)$ for the statistic array. The software is MIT licensed (v0.6.1).

Recovering Latent Structures after Variational Bayesian Variable Selection: Fit Assessment and Factor-Number Selection in Partially Exploratory Factor Analysis stat.ME

In partially exploratory factor analysis (PEFA), the loading structure and factor numbers are weakly specified. The regularized variational approximation for partially confirmatory factor analysis (PCFA VA) recovers this structure via Bayesian variable selection, using spike and slab priors to assign inclusion probabilities to unspecified loadings. This research introduces a post selection assessment framework for this approach. We convert converged solutions into covariance models using either hard selection (thresholding probabilities into a sparse pattern) or soft selection (retaining them as weights for effective parameter counts). We derive the resulting degrees of freedom, absolute fit diagnostics (RMSEA, SRMR, CFI, TLI), and relative criteria (AIC, BIC, ELBO). To determine factor numbers, we propose a scale free gain rule with a sustained drop guard. Simulations show absolute indices successfully track loading recovery and flag under factoring. While raw criteria over factor, our gain rule accurately recovers true dimensionality, with the ELBO variant proving most robust. Finally, a 100 item PID 5 example demonstrates that our model fits better than a confirmatory 25 facet model and concordantly recovers major structures across disjoint specifications.

An Experimental Design Approach to Evaluating Agentic AI's Autonomous Model Discovery stat.ME

Large language model coding agents increasingly perform open-ended data modeling and analysis. These agents are stochastic and adaptive, and therefore their autonomous model discovery behavior cannot be adequately characterized by a single benchmark run. In this work, we propose an experimental design and analysis framework for systematically evaluating this discovery process, quantifying its variability, and identifying important factors. The proposed framework treats these agents as stochastic model-discovery operators, which map task-specific discovery data and an optimization target to a fitted model. Specifically, we investigate two such operators, Codex and Claude Code, under controlled experimental factors including agent's reasoning effort, task, optimization metric, and composition of training data. For each agent-task-metric combination, regression models and inference are conducted for multiple responses such as output quality, dollar cost, wall-clock time, and process complexity. Furthermore, we develop a utility-aligned canonical decomposition to characterize the dominant direction of the reasoning-effort effect and to assess whether that direction aligns with a performance-cost utility direction. The proposed framework is demonstrated on a testbed of networked word-forming games with insightful findings on reasoning effort with respect to cost and process complexity.

Significance-First Splitting: Aligning Treatment Heterogeneity Detection with Honest Estimation stat.ME

Estimating heterogeneous treatment effects (CATE) requires simultaneously detecting effect modification and quantifying estimation uncertainty. Existing tree-based methods make an uneasy trade-off: significance-based approaches (Radcliffe and Surry 2011) identify subgroup interactions directly but lack valid inference; honest causal trees (Athey and Imbens 2016) deliver nominal confidence interval coverage but use outcome-agnostic splitting criteria that sacrifice interaction sensitivity. We introduce a hybrid algorithm that fuses significance-based splitting with honest sample-splitting and cross-validation. Our splitting criterion uses the squared $t$-statistic for the treatment $\times$ side interaction ($t^2$), which is shown to be directly aligned with the honest $\text{EMSE}_τ$ criterion when the interaction is strong. Post-hoc honest cross-validation selects the cost-complexity penalty, giving a single principled estimator with nominal CI coverage at the leaf level. For forests, we retain bootstrap count vectors to enable an infinitesimal jackknife (IJ) variance estimate of Monte-Carlo convergence rather than formal pointwise inference. On the three synthetic designs from (Athey and Imbens 2016) the single tree achieves approximately 90\% leaf-average CI coverage at the 90\% nominal level across all three designs (200 replications each); on the Criteo and Starbucks uplift datasets we match Qini coefficient performance of S- and T-learner baselines. An open-source Python package with reproducible seeds, sklearn-compatible API, and full test coverage accompanies this work (https://codeberg.org/hadjipantelis/rattus).

BenchmarksFull tables

Intelligence Index

Composite score across coding, math, and reasoning

#	Model	Score	tok/s	$/1M
1	GPT-5.5	60.2	66	$11.25
2	Claude Opus 4.7	57.3	55	$10.94
3	Gemini 3.1 Pro Preview	57.2	130	$4.50
4	GPT-5.4	56.8	98	$5.63
5	Qwen3.7 Max	56.6	203	$3.75

SWE-rebench

Agentic coding on real-world software engineering tasks

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	Junie	62.8%
5	gpt-5.4-2026-03-05-medium	62.8%

GitHub Repos All repos

Trending

Lum1104/Understand-Anything

43640 ★

Graphs that teach > graphs that impress. Turn any code into an interactive knowledge graph you can explore, search, and ask questions about. Works with Claude Code, Codex, Cursor, Copilot, Gemini CLI, and more.

anthropics/claude-plugins-official

30987 ★

Official, Anthropic-managed directory of high quality Claude Code Plugins.

colbymchenry/codegraph

26469 ★

Pre-indexed code knowledge graph for Claude Code — fewer tokens, fewer tool calls, 100% local

rohitg00/ai-engineering-from-scratch

33431 ★

Learn it. Build it. Ship it for others.

Fincept-Corporation/FinceptTerminal

24066 ★

FinceptTerminal is a modern finance application offering advanced market analytics, investment research, and economic data tools, designed for interactive exploration and data-driven decision-making in a user-friendly environment.

Daily discovery

Fr-e-d/GAAI-frameworkAutonomous Agents

142 ★

Turns AI coding tools into reliable software delivery systems. Drop a .gaai/ folder into any project — Discovery defines what to build, Delivery executes autonomously until criteria pass. Works with Claude Code, Codex CLI, Gemini CLI, Cursor, and more. No SDK. No package. Markdown + YAML + bash.

apache/hamiltonMLOps

2503 ★

Apache Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.

screenpipe/screenpipeComputer Vision

18872 ★

screenpipe turns your computer into a personal AI that knows everything you've done. record. search. automate. all local, all private, all yours.

tensorflow/tensorflowNeural Network

195911 ★

An Open Source Machine Learning Framework for Everyone

alibaba/ROLLRLHF

3205 ★

An Efficient and User-Friendly Scaling Library for Reinforcement Learning with Large Language Models