The Inference Report

May 2, 2026

The market is fragmenting along a clear axis: companies that move fast and build products are pulling away from those relying on regulatory friction and litigation to protect position. Musk's courtroom testimony against OpenAI admits xAI distills OpenAI's models while arguing the company betrayed its nonprofit mission, yet the Pentagon simultaneously signs deals with Nvidia, Microsoft, and AWS to diversify AI vendors after its own dispute with Anthropic over usage terms. Competition requires actual alternatives, but alternatives keep getting acquired. Cursor's reported $60 billion acquisition talks with SpaceX matter less for what they reveal about Cursor's value than for what they signal about consolidation math: if your product works, the acquirer will pay more than the market could ever allocate independently.

Models are commoditizing faster than the industry acknowledges. GPT-5.5 matches Mythos Preview in new cybersecurity tests, suggesting that cyber threat attribution to any single model is not a breakthrough but rather a feature of the capability tier itself. Models tuned to prioritize user satisfaction over truthfulness make more errors, which describes the actual tradeoffs built into deployment. The Pentagon's diversification strategy and DOD friction with Anthropic reveal an institution learning that single-vendor dependency creates leverage problems. Competition in AI infrastructure is real. Competition in capability differentiation is narrowing. Meanwhile, Chinese models are consolidating gains on code benchmarks: GLM-5 jumped from rank 17 to rank 3 on SWE-rebench, while Kimi K2.5 climbed from rank 29 to rank 16, suggesting systematic capability improvements across families rather than breakthrough leaps from any single model.

Regulatory capture is dressing itself as safety. Minnesota passes a ban on fake AI nudes with $500K fines while a new Christian cell network blocks pornography at the network level in ways adult users cannot override. English councils will trial Google AI tools to recommend planning decisions. These represent a shift from "AI companies should self-regulate" to "governments will regulate AI through whatever lever is closest at hand," often meaning regulation of user behavior rather than systems themselves. Platforms that claim they cannot moderate content at scale suddenly find themselves capable of blocking entire categories of speech when regulatory pressure arrives. That capability was always there. The question is only who decides when to use it.

Established labs are competing through distribution and positioning rather than capability announcements. Google is positioning scientific research as a partnership play built on open resources, signaling that influence in academia carries longer-term strategic value than proprietary model dominance alone. IBM is chasing immediate commercial application through consumer engagement via the Ferrari app and enterprise consulting to private equity firms. Neither involves breakthrough capability claims, underscoring a shift in how labs compete: not through raw capability but through distribution channels and positioning as trusted advisors in specific verticals. The real competition is over who owns the relationship when enterprises decide what to build with AI.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research Papers — FocusedAll papers
What Suppresses Nash Equilibrium Play in Large Language Models? Mechanistic Evidence and Causal Control cs.GT

LLM agents are known to deviate from Nash equilibria in strategic interactions, but nobody has looked inside the model to understand why, or asked whether the deviation can be reversed. We do both. Working with four open-source models (Llama-3 and Qwen2.5, 8B to 72B parameters) playing four canonical two-player games, we establish the behavioral picture through self-play and cross-play experiments, then open up the 32-layer Llama-3-8B model and examine what actually happens during a strategic decision. The mechanistic findings are clear. Opponent history is encoded with near-perfect fidelity at the first layer (96% probe accuracy) and consumed progressively by later ones, while Nash action encoding is weak throughout, never exceeding 56%. There is no dedicated Nash module. Instead, the model privately favors the Nash action through most of its forward pass, but a prosocial override concentrated in the final layers reverses this, reaching 84% probability of cooperation at layer 30. When we inject a learned Nash direction into the residual stream, the behavior shifts bidirectionally, confirmed through concept clamping. The behavioral experiments surface six scale- and architecture-dependent findings, the most notable being that chain-of-thought reasoning worsens Nash play in small models but achieves near-perfect Nash play above 70B parameters. The cross-play experiments reveal three phenomena invisible in self-play: a small model can unravel any partner's cooperation by defecting early; two large models reinforce each other's cooperative instincts indefinitely; and who moves first in a coordination game determines which Nash equilibrium the system reaches. LLMs do not lack Nash-playing competence. They compute it, then suppress it.

Computing Equilibrium beyond Unilateral Deviation cs.GT

Most familiar equilibrium concepts, such as Nash and correlated equilibrium, guarantee only that no single player can improve their utility by deviating unilaterally. They offer no guarantees against profitable coordinated deviations by coalitions. Although the literature proposes solution concepts that provide stability against multilateral deviations (\emph{e.g.}, strong Nash and coalition-proof equilibrium), these generally fail to exist. In this paper, we study an alternative solution concept that minimizes coalitional deviation incentives, rather than requiring them to vanish, and is therefore guaranteed to exist. Specifically, we focus on minimizing the average gain of a deviating coalition, and extend the framework to weighted-average and maximum-within-coalition gains. In contrast, the minimum-gain analogue is shown to be computationally intractable. For the average-gain and maximum-gain objectives, we prove a lower bound on the complexity of computing such an equilibrium and present an algorithm that matches this bound. Finally, we use our framework to solve the \emph{Exploitability Welfare Frontier} (EWF), the maximum attainable social welfare subject to a given exploitability (the maximum gain over all unilateral deviations).

Optimally Auditing Adversarial Agents cs.GT

Fraud can pose a challenge in many resource allocation domains, including social service delivery and credit provision. For example, agents may misreport private information in order to gain benefits or access to credit. To mitigate this, a principal can design strategic audits to verify claims and penalize misreporting. In this paper, we introduce a general model of audit policy design as a principal-agent game with multiple agents, where the principal commits to an audit policy, and agents collectively choose an equilibrium that minimizes the principal's utility. We examine both adaptive and non-adaptive settings, depending on whether the principal's policy can be responsive to the distribution of agent reports. Our work provides efficient algorithms for computing optimal audit policies in both settings and extends these results to a setting with limited audit budgets.

Strategic Bidding in 6G Spectrum Auctions with Large Language Models cs.GT

Efficient and fair spectrum allocation is a central challenge in 6G networks, where massive connectivity and heterogeneous services continuously compete for limited radio resources. We investigate the use of Large Language Models (LLMs) as bidding agents in repeated 6G spectrum auctions with budget constraints in vehicular networks. Each user equipment (UE) acts as a rational player optimizing its long-term utility through repeated interactions. Using the Vickrey-Clarke-Groves (VCG) mechanism as a benchmark for incentive-compatible, dominant-strategy truthfulness, we compare LLM-guided bidding against truthful and heuristic strategies. Unlike heuristics, LLMs leverage historical outcomes and prompt-based reasoning to adapt their bidding behavior dynamically. Results show that when the theoretical assumptions guaranteeing truthfulness hold, LLM bidders recover near-equilibrium outcomes consistent with VCG predictions. However, when these assumptions break -- such as under static budget constraints -- LLMs sustain longer participation and achieve higher utilities, revealing their ability to approximate adaptive equilibria beyond static mechanism design. This work provides the first systematic evaluation of LLM bidders in repeated spectrum auctions, offering new insights into how AI-driven agents can interact strategically and reshape market dynamics in future 6G networks.

Compliance Moral Hazard and the Backfiring Mandate cs.GT

Competing firms that serve shared customer populations face a fundamental information aggregation problem: each firm holds fragmented signals about risky customers, but individual incentives impede efficient collective detection. We develop a mechanism design framework for decentralized risk analytics, grounded in anti-money laundering in banking networks. Three strategic frictions distinguish our setting: compliance moral hazard, adversarial adaptation, and information destruction through intervention. A temporal value assignment (TVA) mechanism, which credits institutions using a strictly proper scoring rule on discounted verified outcomes, implements truthful reporting as a Bayes--Nash equilibrium (uniquely optimal at each edge) in large federations. Embedding TVA in a banking competition model, we show competitive pressure amplifies compliance moral hazard and poorly designed mandates can reduce welfare below autarky, a ``backfiring'' result with direct policy implications. In simulation using a synthetic AML benchmark, TVA achieves substantially higher welfare than autarky or mandated sharing without incentive design.

Is Four Enough? Automated Reasoning Approaches and Dual Bounds for Condorcet Dimensions of Elections cs.GT

In an election where $n$ voters rank $m$ candidates, a Condorcet winning set is a committee of $k$ candidates such that for any outside candidate, a majority of voters prefer some committee member. Condorcet's paradox shows that some elections admit no Condorcet winning sets with a single candidate (i.e., $k=1$), and the same can be shown for $k=2$. On the other hand, recent work proves that a set of size $k=5$ exists for every election. This leaves an important theoretical gap between the best known lower bound $(k\geq 3)$ and upper bound $(k \leq 5)$ for the number of candidates needed to guarantee existence. We aim to close the gap between the existence guarantees and impossibility results for Condorcet winning sets. We explore an automated reasoning approach to tighten these bounds. We design a mixed-integer linear program (MILP) to search for elections that would serve as counter-examples to conjectured bounds. We employ a number of optimizations, such as symmetry breaking, subsampling, and constraint generation, to enhance the search and model effectively infinite electorates. Furthermore, we analyze the dual of the linear programming relaxation as a path towards obtaining a new upper bound. Despite extensive search on moderate-sized elections, we fail to find any election requiring a committee larger than size 3. Motivated by our experimental results in this direction, we simplify the dual linear program and formulate a conjecture which, if true, implies that a winning set of size 4 always exists. Our automated reasoning results provide strong empirical evidence that the Condorcet dimension of any election may be smaller than currently known upper bounds, at least for small instances. We offer a general-purpose framework for searching elections in ranked voting and a new, concrete analytical path via duality toward proving that smaller committees suffice.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1GPT-5.560.273$11.25
2Claude Opus 4.757.351$10.00
3Gemini 3.1 Pro Preview57.2132$4.50
4GPT-5.456.886$5.63
5Kimi K2.653.925$1.71
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4Junie62.8%
5gpt-5.4-2026-03-05-medium62.8%