The Inference Report

May 25, 2026

The gap between deployment speed and institutional readiness has become the defining feature of AI's move into production. Google acknowledges it is learning AI security in real time rather than having solved it beforehand. Amazon is selling wearables that collect ambient audio, betting users will accept the privacy tradeoff for convenience. A San Francisco nonprofit is replacing volunteer labor with robots to prepare meals, addressing one logistics problem while creating another about what gets automated when human capacity fails. Robotaxis are being tested in actual traffic because simulation cannot reveal how real drivers will behave around them. The ECB called an emergency meeting with banks after discovering that recent AI models exposed previously unknown or ignored vulnerabilities in financial systems. The pattern is consistent: builders are deploying AI into production environments, wearables, autonomous vehicles, financial infrastructure, and labor workflows before risks are fully mapped. Institutions are reacting rather than leading.

This acceleration is visible in how developers themselves are organizing around AI. The dominant trend on GitHub is not interest in models themselves but in the infrastructure that makes models useful. Code understanding and agent tooling dominate, with repositories like Understand-Anything and CodeGraph converting source code into queryable knowledge graphs that reduce token overhead when working with Claude Code and other editors. The CLAUDE.md approach represents behavioral guidance encoded as configuration rather than fine-tuning. Multica, Pi, and Claude plugin directories reflect a market settling on how to deploy coding agents as persistent, composable workers that track state and accumulate skills. Vertical specialization is emerging through repos like Kronos for financial markets language and cybersecurity skills repositories, offering pre-built knowledge patterns for specific domains. Vector databases and RAG engines have consolidated as the canonical layer between documents and LLM reasoning, with Weaviate and RAGFlow representing the mature end of that market.

Beneath both trends sits a fundamental mismatch: companies and developers are moving at the speed of implementation while regulators, researchers, and risk frameworks operate at the speed of validation. The research literature reflects this gap, with methodological work centered on recovering latent structure from incomplete observations, controlling error under realistic constraints, and bridging statistical guarantees with practical inference. Work on causal discovery, measurement error, and finite-sample concentration shows sustained engagement with the problem of identifying when standard assumptions fail. Yet this rigor exists mostly in academic settings. In production, users, workers, and depositors are bearing the uncertainty while builders and institutions negotiate what safety looks like after deployment.

Grant Calloway

AI LabsAll labs

No lab headlines.

From the WireAll feeds

Research Papers — FocusedAll papers

Verifying formulas for interventional distributions stat.ME

We formalize verification in causal graphical models: deciding whether a given observational formula identifies a target interventional distribution. This opens a problem complementary to identification, asking not whether any identifying formula exists, but whether the given formula is identifying. We show that even sound and complete solutions to identification do not solve verification. We propose a falsifier as a first practical route forward, prove that it induces an almost-surely correct verifier for regular exponential-family models, and use the resulting verifier to develop the gateway test, which finds all sets admissible for use in a front-door formula.

Observation-Level Watermarking and Detection for Tabular Data stat.ME

With the development of generative AI, watermarking techniques have been widely used to detect the authenticity of AI-generated data and protect the rights of users and creators. While it is already well applied in data types including imaging and text data, watermarking tabular data is still under-explored. Existing methods primarily focus on numerical data, leaving discrete, categorical, and mixed data less studied. In this work, we propose STAMP (Single-observation Tabular Attribution and Marking Procedure), a novel framework for watermarking tabular data that can accommodate and preserve a wide range of distributions. We also develop a corresponding detection mechanism, which can reliably identify watermarks even when the sample size is as small as one. We establish theoretical guarantees for asymptotic consistency and detection accuracy. Finally, through extensive simulation studies and two real-data applications, we demonstrate that the proposed method is effective and robust to subsetting, while maintaining data fidelity and a high detection rate.

tsbootstrap: Distribution-Free Uncertainty Quantification and Conformal Prediction for Time Series stat.ME

Finance, sensing, and demand streams violate the exchangeability that IID conformal prediction and the IID bootstrap assume, and existing libraries implement either a general resampling engine or conformal calibration without the other. tsbootstrap provides block, residual, sieve, and wild resampling, classical bootstrap confidence intervals, and adaptive conformal calibrators (EnbPI, ACI, NexCP, AgACI) through a single typed API in which a specification object selects each method. In a controlled coverage study the IID bootstrap undercovers sharply under dependence; dependence-aware methods reduce the coverage deficit, the sieve nearest to nominal under short-memory linear dependence. On the shared fixed-statistic path a compiled backend runs several times faster than arch, and a streaming reduce avoids materializing the $O(Bn)$ replicate tensor, limiting peak extra memory to $O(B)$ for the statistic array. The software is MIT licensed (v0.6.1).

Recovering Latent Structures after Variational Bayesian Variable Selection: Fit Assessment and Factor-Number Selection in Partially Exploratory Factor Analysis stat.ME

In partially exploratory factor analysis (PEFA), the loading structure and factor numbers are weakly specified. The regularized variational approximation for partially confirmatory factor analysis (PCFA VA) recovers this structure via Bayesian variable selection, using spike and slab priors to assign inclusion probabilities to unspecified loadings. This research introduces a post selection assessment framework for this approach. We convert converged solutions into covariance models using either hard selection (thresholding probabilities into a sparse pattern) or soft selection (retaining them as weights for effective parameter counts). We derive the resulting degrees of freedom, absolute fit diagnostics (RMSEA, SRMR, CFI, TLI), and relative criteria (AIC, BIC, ELBO). To determine factor numbers, we propose a scale free gain rule with a sustained drop guard. Simulations show absolute indices successfully track loading recovery and flag under factoring. While raw criteria over factor, our gain rule accurately recovers true dimensionality, with the ELBO variant proving most robust. Finally, a 100 item PID 5 example demonstrates that our model fits better than a confirmatory 25 facet model and concordantly recovers major structures across disjoint specifications.

An Experimental Design Approach to Evaluating Agentic AI's Autonomous Model Discovery stat.ME

Large language model coding agents increasingly perform open-ended data modeling and analysis. These agents are stochastic and adaptive, and therefore their autonomous model discovery behavior cannot be adequately characterized by a single benchmark run. In this work, we propose an experimental design and analysis framework for systematically evaluating this discovery process, quantifying its variability, and identifying important factors. The proposed framework treats these agents as stochastic model-discovery operators, which map task-specific discovery data and an optimization target to a fitted model. Specifically, we investigate two such operators, Codex and Claude Code, under controlled experimental factors including agent's reasoning effort, task, optimization metric, and composition of training data. For each agent-task-metric combination, regression models and inference are conducted for multiple responses such as output quality, dollar cost, wall-clock time, and process complexity. Furthermore, we develop a utility-aligned canonical decomposition to characterize the dominant direction of the reasoning-effort effect and to assess whether that direction aligns with a performance-cost utility direction. The proposed framework is demonstrated on a testbed of networked word-forming games with insightful findings on reasoning effort with respect to cost and process complexity.

Significance-First Splitting: Aligning Treatment Heterogeneity Detection with Honest Estimation stat.ME

Estimating heterogeneous treatment effects (CATE) requires simultaneously detecting effect modification and quantifying estimation uncertainty. Existing tree-based methods make an uneasy trade-off: significance-based approaches (Radcliffe and Surry 2011) identify subgroup interactions directly but lack valid inference; honest causal trees (Athey and Imbens 2016) deliver nominal confidence interval coverage but use outcome-agnostic splitting criteria that sacrifice interaction sensitivity. We introduce a hybrid algorithm that fuses significance-based splitting with honest sample-splitting and cross-validation. Our splitting criterion uses the squared $t$-statistic for the treatment $\times$ side interaction ($t^2$), which is shown to be directly aligned with the honest $\text{EMSE}_τ$ criterion when the interaction is strong. Post-hoc honest cross-validation selects the cost-complexity penalty, giving a single principled estimator with nominal CI coverage at the leaf level. For forests, we retain bootstrap count vectors to enable an infinitesimal jackknife (IJ) variance estimate of Monte-Carlo convergence rather than formal pointwise inference. On the three synthetic designs from (Athey and Imbens 2016) the single tree achieves approximately 90\% leaf-average CI coverage at the 90\% nominal level across all three designs (200 replications each); on the Criteo and Starbucks uplift datasets we match Qini coefficient performance of S- and T-learner baselines. An open-source Python package with reproducible seeds, sklearn-compatible API, and full test coverage accompanies this work (https://codeberg.org/hadjipantelis/rattus).

BenchmarksFull tables

Intelligence Index

Composite score across coding, math, and reasoning

#	Model	Score	tok/s	$/1M
1	GPT-5.5	60.2	66	$11.25
2	Claude Opus 4.7	57.3	47	$10.94
3	Gemini 3.1 Pro Preview	57.2	125	$4.50
4	GPT-5.4	56.8	82	$5.63
5	Qwen3.7 Max	56.6	198	$3.75

SWE-rebench

Agentic coding on real-world software engineering tasks

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	Junie	62.8%
5	gpt-5.4-2026-03-05-medium	62.8%

GitHub Repos All repos

Trending

Lum1104/Understand-Anything

43640 ★

Graphs that teach > graphs that impress. Turn any code into an interactive knowledge graph you can explore, search, and ask questions about. Works with Claude Code, Codex, Cursor, Copilot, Gemini CLI, and more.

rohitg00/ai-engineering-from-scratch

33431 ★

Learn it. Build it. Ship it for others.

anthropics/claude-plugins-official

30987 ★

Official, Anthropic-managed directory of high quality Claude Code Plugins.

anthropics/knowledge-work-plugins

17517 ★

Open source repository of plugins primarily intended for knowledge workers to use in Claude Cowork

multica-ai/andrej-karpathy-skills

155979 ★

A single CLAUDE.md file to improve Claude Code behavior, derived from Andrej Karpathy's observations on LLM coding pitfalls.

Daily discovery

infiniflow/ragflowRAG

84660 ★

RAGFlow is a leading open-source Retrieval-Augmented Generation (RAG) engine that fuses cutting-edge RAG with Agent capabilities to create a superior context layer for LLMs

Open-Source-Legal/citeVector Database

1335 ★

Humans and AI agents, building knowledge bases together. Self-hosted document annotation, version control, semantic search, and MCP.

openvinotoolkit/openvinoGenerative AI

10433 ★

OpenVINO™ is an open source toolkit for optimizing and deploying AI inference

jegly/OfflineLLMEdge AI

147 ★

A privacy-first Android chat app that runs large language models entirely on-device. No internet, no cloud, no tracking. Built with Kotlin, Jetpack Compose, and llama.cpp with optimized ARM NEON/SVE inference.

google/pygloveAutoML

713 ★

Manipulating Python Programs