The Inference Report

May 21, 2026

The AI industry has abandoned the pretense that models are the product. Across infrastructure, benchmarks, and deployed systems, the consolidation pattern is unmistakable: the money, the compute, and the power are flowing toward agents, inference infrastructure, and the systems that make autonomous execution possible at scale.

Google has reorganized its entire product surface around agency. Gemini 3.5 Flash is engineered for agentic workflows, positioned as four times faster than Claude Opus 4.7 and twice as fast as Gemini 3.1 Pro. The company is unifying its coding tools under Antigravity and embedding agents across Search, Android, and enterprise platforms. Nvidia's CEO Jensen Huang announced a $200 billion market opportunity in CPUs for AI agents, not inference chips for chat. This reorientation is structural, not cosmetic. Google processes 3.2 quadrillion tokens per month. That metric is now the unit of metering, pricing, and control. Whoever owns the inference layer owns the billing relationship.

The compute arms race has become visible and expensive. xAI burned $6.4 billion in 2025 and is purchasing $2.8 billion in natural gas turbines over three years while paying Anthropic $1.25 billion per month for compute. Anthropic is on track for its first profitable quarter with $10.9 billion in projected Q2 revenue, a milestone neither OpenAI nor xAI has reached. OpenAI is preparing its IPO filing for as soon as September with Goldman Sachs and Morgan Stanley. These are capital-intensive utilities being valued as such. Figure AI's continuous livestream of humanoid robots handling packages is not marketing; it is proof of concept that the market will watch robots work. The question has shifted from whether agents will exist to who controls the compute they run on.

In benchmarks and deployed code, the shift manifests as concrete technical priorities. Claude Opus 4.6 climbed 12.4 points on SWE-rebench to reach 65.3 percent, the largest single-model improvement in the dataset, while GLM-5 and Kimi K2 Thinking each gained roughly 13 to 16 points. On GitHub, the trending patterns split between agentic coding frameworks that reduce hallucination and token waste, and unglamorous infrastructure: llama.cpp and whisper.cpp remain gravitational centers for efficient local inference, now joined by quantization and pruning strategies. The secondary wave addresses production realities: observability tools like Phoenix, time-series anomaly detection, data synthesis, and ML pipeline orchestration. These don't trend virally because they solve problems that only matter once something actually ships. Agentic coding promises leverage over writing itself. The infrastructure work promises leverage over everything that comes after.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research Papers — FocusedAll papers
Watermarking Game-Playing Agents in Perfect-Information Extensive-Form Games cs.GT

Watermarking techniques for large language models (LLMs), which encode hidden information in the output so its source can be verified, have gained significant attention in recent days, thanks to their potential capability to detect accidental or deliberate misuse. Similar challenges involving model misuse also exist in the context of game-playing, such as when detecting the unauthorized use of AI tools in gaming platforms (e.g., cheating in online chess). In this paper, we initiate the study of how game-playing strategies can be watermarked. We show how the KGW watermark for LLMs can be adapted to watermark game-playing agents in perfect-information extensive-form games. The watermark can then be detected using a statistical test. We show that the degradation in the quality of the watermarked strategy profile, quantified by the expected utility, can be bounded, but there is a tradeoff between detectability and quality. In our experiments, we bootstrap the watermarking framework to various chess engines and demonstrate that a) the impact of the watermark on the quality of the strategy is negligible and b) the watermark can be detected with just a handful of games.

Agreement, Diversity, and Polarization Indices for Approval Elections cs.GT

An index is a function that given an election outputs a value between 0 and 1, indicating the extent to which this election has a particular feature. We seek indices that capture agreement, diversity, and polarization among voters in approval elections, and that are normalized with respect to saturation. By the latter we mean that if two elections differ by the fraction of candidates approved by an average voter, but otherwise are of similar nature, then they should have similar index values. We propose several indices, analyze their properties, and use them to (a) derive a new map of approval elections, and (b) show similarities and differences between various real-life elections from Pabulib, Preflib and other sources.

When to Ask a Question: Understanding Communication Strategies in Generative AI Tools cs.GT

Generative AI models differ from traditional machine learning tools in that they allow users to provide as much or as little information as they choose in their inputs. This flexibility often leads users to omit certain details, relying on the models to infer and fill in under-specified information based on distributional knowledge of user preferences. Such inferences may privilege majority viewpoints and disadvantage users with atypical preferences, raising concerns about fairness. Unlike more traditional recommender systems, LLMs can explicitly solicit more information from users through natural language. However, while directly eliciting user preferences could increase personalization and mitigate inequality, excessive querying places a burden on users who value efficiency. We develop a stylized model of user-LLM interaction and develop an objective that captures tradeoff between user burden and preference representation. Building on the observation that individual preferences are often correlated, we analyze how AI systems should balance inference and elicitation, characterizing the optimal amount of information to solicit before content generation. Ultimately, we show that information elicitation can mitigate the systematic biases of preference inference, enabling the design of generative tools that better incorporate diverse user perspectives while maintaining efficiency. We complement this theoretical analysis with an empirical evaluation illustrating the model's predictions and exploring their practical implications.

Human-AI Productivity Paradoxes: Modeling the Interplay of Skill, Effort, and AI Assistance cs.GT

Generative Artificial Intelligence (AI) tools are rapidly adopted in the workplace and in education, yet the empirical evidence on AI's impact remains mixed. We propose a model of human-AI interaction to better understand and analyze several mechanisms by which AI affects productivity. In our setup, human agents with varying skill levels exert utility-maximizing effort to produce certain task outcomes with AI assistance. We find that incorporating either endogeneity in skill development or in AI unreliability can induce a productivity paradox: increased levels of AI assistance may degrade productivity, leading to potentially significant shortfalls. Moreover, we examine the long-term distributional effect of AI on skill, and demonstrate that skill polarization can emerge in steady state when accounting for heterogeneity in AI literacy -- the agent's capability to identify and adapt to inaccurate AI outputs. Our results elucidate several mechanisms that may explain the emergence of human-AI productivity paradoxes and skill polarization, and identify simple measures that characterize when they arise.

Quotient Semivalues for False-Name-Resistant Data Attribution cs.GT

Data valuation methods allocate payments and audit training data's contribution to machine-learning pipelines; however, they often assume passive contributors. In reality, contributors can split datasets across pseudonymous identities, duplicate high-value examples, create near-duplicates, or launder synthetic variants to inflate their share. We formalize this as false-name manipulation in ML data attribution. Our main construction is the quotient semivalue mechanism: compute Shapley-, Banzhaf-, or Beta-style values over evidence-backed attribution clusters instead of raw identities, using a canonical-representative operator to absorb within-cluster duplication. We prove an impossibility: on a fixed monotone data-value game, exact Shapley-fair attribution over reported identities is incompatible with unrestricted false-name-proofness, even on binary-valued instances, and characterize the split-gain of a general semivalue on a unanimity counter-example. The mechanism is exactly false-name-proof under two structural conditions: false-name-neutral within-cluster allocation and quotient-stable manipulations. Under imperfect provenance, when these conditions hold approximately, manipulation gain and fairness loss are bounded by three measurable quantities: escaped-cluster mass, value-estimation error, and clustering distance. We instantiate the mechanisms in DataMarket-Gym, a benchmark for attribution under strategic provider attacks. On synthetic classification tasks, quotient semivalues with example-level evidence reduce manipulation gain on duplicate and near-duplicate Sybil attacks from $1.74$ under baseline Shapley to $0.96$, near the honest level. The cosine-threshold and (false-merge, false-split) rate sweeps trace the corresponding fairness--Sybil frontier.

The Endogeneity of Miscalibration: Impossibility and Escape in Scored Reporting cs.GT

Eliciting truthful reports from autonomous agents is a core problem in scalable AI oversight: a principal scores the agent's report using a strictly proper scoring rule, but the agent also benefits from the report through a non-accuracy channel (approval for autonomous action, allocation share, downstream control). The same structure appears in classical mechanism-design settings such as marketplace operation. Our main result is an endogeneity: the principal's optimal oversight necessarily uses a non-affine approval function to screen types, yet any non-affine approval makes truthful reporting suboptimal under the combined objective whenever deviation is undetectable. The principal cannot avoid the perturbation that undermines calibration. This impossibility holds for all strictly proper scoring rules, with a closed-form perturbation formula. A constructive escape exists: a step-function approval threshold achieves first-best screening for every strictly proper scoring rule, because the agent's binary inflate-or-not choice creates a type-space threshold regardless of the generator's curvature. Under the Brier score specifically, the type-independent inflation cost yields a welfare equivalence between second-best and first-best; we prove this equivalence is unique to Brier (the welfare gap under smooth $C^1$ oversight is bounded below by $Ω(\text{Var}(1/G'') (γ/β)^2)$ for every non-Brier rule). Two instances develop the framework: AI agent oversight (the lead motivating setting) and marketplace operation (a parallel mechanism-design domain). The message for AI alignment is direct: smooth scoring-based oversight cannot elicit truthful reports from a strategic agent; sharp thresholds are the calibration-preserving design.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1GPT-5.560.264$11.25
2Claude Opus 4.757.348$10.94
3Gemini 3.1 Pro Preview57.2138$4.50
4GPT-5.456.881$5.63
5Qwen3.7 Max56.60$0.00
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4Junie62.8%
5gpt-5.4-2026-03-05-medium62.8%