The Inference Report

May 31, 2026

The infrastructure war is hardening into a bifurcated market where capital and silicon matter more than software, and the software layer itself is fracturing under the weight of its own visibility. SoftBank's €75 billion commitment to French data centers reveals the actual hierarchy: Masayoshi Son is betting that whoever controls the substrate controls the market. Compute capacity, electricity, and silicon have become the strategic assets. Everything else is software, and software is becoming a commodified layer that users increasingly resent paying for. GitHub Copilot's shift to token-based billing sparked backlash precisely because developers suddenly saw the unit economics of what was once a loss leader, and the margin became visible and resented. Google's unbundling of Gemini into a separate product called Spark tests whether users will pay for AI assistants once they're separated from search. The transcription software market already shows price resistance; free services are adequate enough that paid alternatives struggle to justify their cost.

AWS is converting operational overhead into lock-in by embedding generative AI into the resilience layer itself through Resilience Hub. The move targets not AI builders but the people managing the systems those builders deploy on, recognizing that as generative AI workloads proliferate across customer infrastructure, the surface area for failure expands faster than traditional monitoring can track. By offering to own the question of what happens when these systems fail at scale, AWS deepens dependency on its ecosystem precisely when organizations transition from experimental deployments to production workloads. This is infrastructure defending itself by becoming indispensable at the operational level.

The code-solving frontier has stabilized at the top, with gpt-5.5-2026-04-23-xhigh holding first place at 62.7% on SWE-rebench, while mid-tier models churn actively between 45 and 55 percent. Gemini 3.1 Pro Preview dropped 6.1 points from 57.2% to 51.1%, marking the most substantial regression in visible rankings, while Kimi K2.6 fell 7.4 points from 53.9% to 46.5%. The divergence between SWE-rebench and Artificial Analysis scores for some models suggests these benchmarks may be testing different problem classes or that recent updates affected one more than the other, warranting scrutiny of whether reliable measurement has broken down in the middle tier.

GitHub's trending repos tell the story of agents moving from prototype to production by building the unglamorous layer where theory meets hardware constraints. Anthropic's claude-code and skills repos, alongside cursor/plugins and EveryInc/compound-engineering-plugin, show AI systems integrating into development workflows through standardized abstractions that third parties can extend. Beneath this sits the real work: ARahim3/mlx-tune brings fine-tuning to consumer hardware, vllm-project/vllm-ascend extends inference to new accelerators, and fluxions-ai/vui achieves 9x realtime performance on commodity GPUs. These aren't flashy, but they're the work that makes deployed agents economical. Developers have stopped waiting for perfect solutions and are building the infrastructure themselves, from document parsing through speech generation, revealing that the bottleneck is no longer capability but cost and efficiency in production.

Grant Calloway

AI LabsAll labs

AWS

Introducing the next generation of AWS Resilience Hub for generative AI-based SRE resilience journey

From the WireAll feeds

Research Papers — FocusedAll papers

Paradoxes of Game Theoretic Equilibria and Price of Anarchy cs.GT

For decades, static solution concepts (Nash, Correlated, and Coarse Correlated Equilibria) and the Price of Anarchy (PoA) have formed the bedrock of algorithmic game theory, with no-regret learning proving fast convergence to such game-theoretic equilibria. We show that reducing multi-agent learning to static equilibrium and black-box regret analysis obscures underlying dynamic disequilibrium and game theoretic bounds. First, interior Nash equilibria lack $C^1$ vector field information, meaning agents cannot distinguish aligned from strictly opposing incentives. Inheriting this geometry, the worst-case pure Nash equilibria dictating robust PoA bounds manifest as topologically unstable strict saddles, and in canonical congestion games, as global repellers supported on almost everywhere strictly dominated strategies. Anchoring efficiency guarantees to these unstable states causes algebraic sensitivity; we prove that accommodating all strictly positive affine costs renders the PoA unbounded. Furthermore, projecting learning trajectories onto the discrete simplex of correlated play systematically accommodates non-rationalizable behavior. Evaluating dynamics via Coarse Correlated Equilibria or proximal refinements fails to preclude strictly dominated strategies. Moreover, optimal $O(1/T)$ swap-regret minimization does not preclude macroscopic turbulence, manifesting as chaotic limit sets even in minimal games. Finally, we examine the non-atomic limit of congestion games. Though considered highly stable with tight sub-linear $Θ(p/\ln p)$ PoA bounds (where $p$ is the polynomial degree), we prove that under discrete-time learning, the unique equilibrium destabilizes into Li-Yorke chaos and global attractors whose time-averaged inefficiency degrades exponentially as $2^p$. These results necessitate re-evaluating worst-case equilibrium frameworks for dynamically grounded metrics.

Contextual Procurement Auctions with Bandit Learning cs.GT

We study repeated contextual procurement auctions in which producers have private costs and the platform must learn context-dependent product values from bandit feedback. The objective is welfare rather than revenue or a virtual-cost surrogate: regret is the total surplus loss relative to the full-information efficient procurement rule. We first show that the natural UCB allocation rule attains $\tilde O(\sqrt{ngT})$ welfare regret under truthful bids, but its adaptive bid-dependent learning path does not by itself give a truthfulness guarantee. To obtain exact incentives, we design a bid-independent explore-then-commit mechanism with empirical critical payments; it is dominant-strategy truthful and has $\tilde O((ng)^{1/3}T^{2/3})$ regret. We then introduce frozen-payment UCB, which estimates payments in an initial bid-independent exploration phase, freezes those payment estimates, and continues adaptive UCB allocation learning afterwards. Under a smoothed truthful-path margin condition, this mechanism gives a regret-incentive tradeoff: the near-UCB tuning attains $\tilde O(\sqrt{ngT})$ welfare regret, while the average per-round gain from any fixed deviation is at most $\tilde O(T^{-1/4})$ for fixed $n,g$. A matching lower bound shows that this frozen-payment frontier is unavoidable.

Contextual Procurement Auctions with Bandit Learning cs.GT

We study repeated contextual procurement auctions in which the platform must learn context-dependent product values from bandit feedback. We give an exactly truthful explore-then-commit mechanism with $\widetilde O((ng)^{1/3}T^{2/3})$ regret. We also give a frozen-payment UCB mechanism with a regret-incentive tradeoff: the near-UCB tuning attains $\widetilde O(\sqrt{ngT})$ welfare regret, while for fixed $n,g$ its total incentive error is $\widetilde O(T^{3/4})$; the balanced tuning gives $\widetilde O(T^{2/3})$ on both scales. Regret is measured as welfare loss relative to the full-information efficient allocation. We prove a matching lower bound for the frozen-payment regret-incentive tradeoff.

LLM Semantic Signaling Game and Mechanism Design: Systematic Blindness, Awareness Shaping, and Mindset Dynamics cs.GT

Large language models (LLMs) increasingly mediate strategic interactions through natural language, making semantic control a critical element of communication and deception. This paper develops a semantic signaling game in which a sender selects a semantic control, an LLM generates a stochastic message, and a receiver evaluates the message using an awareness-dependent scoring mechanism. Receiver awareness is modeled as a type that determines which linguistic features are perceived and used for inference, providing a formal model of systematic blindness. The framework connects prompt-based control, statistical detection, and game-theoretic equilibrium analysis. Gaussian approximations of aggregate message scores enable likelihood-ratio decision rules, while Perfect Bayesian Nash equilibria characterize strategic behavior. The paper further develops mechanism-design approaches that reshape receiver awareness, penalize deceptive semantic controls, and modify receiver populations to induce benign pooling equilibria. Numerical experiments validate the Gaussian approximation, quantify awareness-ordering effects, analyze mindset dynamics under adaptive adversaries, and demonstrate how awareness shaping and guardrail costs reduce successful phishing attacks. The proposed framework provides a principled foundation for analyzing strategic language-mediated interactions in agentic AI systems and offers new tools for the design of robust and secure human-AI communication.

Projected Exploitability Descent for Nash Equilibrium Computation in Multiplayer Imperfect-Information Games cs.GT

Many important games have more than two players and imperfect information. Existing approaches for computing Nash equilibrium, the central game-theoretic solution concept, in such games either lack scalability or obtain poor performance. In this paper we introduce a new algorithm called projected exploitability descent (PED) for approximating Nash equilibria in multiplayer games of imperfect information. The algorithm works by running projected subgradient descent minimizing a proxy for the multiplayer generalized exploitability function. The objective is nonconvex and nonsmooth, but can be represented as the sum of the maxima of linear functions, for which a subgradient can easily be computed and projected to the polytope of feasible sequence-form strategies. We explore performance of PED on a generalized version of the well-studied benchmark game three-player Kuhn poker. No prior exact algorithms scale to the version of the game with deck size larger than 4, and we compare performance to the popular algorithms of fictitious play (FP) and counterfactual regret minimization (CFR). We find that PED obtains a consistent near-monotonic improvement throughout all runs, though both FP and CFR perform significantly better in the initial iterations. This inspires a hybrid algorithm FP-PED that runs FP for an initial burn-in period before switching to PED for stable long-run refinement. We can alternatively view this as a multi-step algorithm that runs FP as a pre-processing step to obtain a strong initialization for PED.

Improved Multi-Dimensional Forecasting for Swap Regret cs.GT

We study the problem of forecasting for an arbitrary number of downstream agents with unknown objectives, each of whom best responds to the forecaster's predictions. We seek a single forecaster that guarantees sublinear swap regret for all downstream agents simultaneously. For two-dimensional outcome spaces, we give a polynomial time algorithm that guarantees $\tilde{O}(\sqrt{kT})$ swap regret for any downstream agent with $k$ actions. This improves over the previously known bound of $\tilde{O}(kT^{5/8})$ and avoids the exponential in $T$ runtime of prior algorithms in this setting. Our algorithm extends nicely to other low dimensional environments, retaining $\tilde{O}(\sqrt{T})$ downstream swap regret while the exponent of $k$ in the regret bound and the exponent of $T$ in the running time both grow with dimension. For arbitrary dimension $d$, we give a forecasting algorithm that guarantees $\tilde{O}(d\sqrt{kT})$ swap regret, assuming the forecaster knows an upper bound $k$ on the number of actions available to any downstream agent, albeit with a much longer runtime. This improves upon previous high dimensional guarantees that had $\tilde{O}(T^{2/3})$ dependence and required additional behavioral assumptions.

BenchmarksFull tables

Intelligence Index

Composite score across coding, math, and reasoning

#	Model	Score	tok/s	$/1M
1	Claude Opus 4.8	61.4	65	$10.94
2	GPT-5.5	60.2	59	$11.25
3	Claude Opus 4.7	57.3	60	$10.94
4	Gemini 3.1 Pro Preview	57.2	137	$4.50
5	GPT-5.4	56.8	90	$5.63

SWE-rebench

Agentic coding on real-world software engineering tasks

#	Model	Score
1	gpt-5.5-2026-04-23-xhigh	62.7%
2	Codex	60.4%
3	Claude Code	59.6%
4	gpt-5.5-2026-04-23-medium	58.9%
5	Claude Opus 4.8-xhigh	56.4%

GitHub Repos All repos

Trending

microsoft/markitdown

143443 ★

Python tool for converting files and office documents to Markdown.

anthropics/claude-code

136440 ★

Claude Code is an agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster by executing routine tasks, explaining complex code, and handling git workflows - all through natural language commands.

cursor/plugins

1530 ★

Cursor plugin specification and official plugins

revfactory/harness

7889 ★

A meta-skill that designs domain-specific agent teams, defines specialized agents, and generates the skills they use.

EveryInc/compound-engineering-plugin

19242 ★

Official Compound Engineering plugin for Claude Code, Codex, Cursor, and more

Daily discovery

ARahim3/mlx-tuneDeep Learning

1267 ★

Fine-tune LLMs on your Mac with Apple Silicon. SFT, DPO, GRPO, and Vision fine-tuning — natively on MLX. Unsloth-compatible API.

Farama-Foundation/ViZDoomReinforcement Learning

2023 ★

Reinforcement Learning environments based on the 1993 game Doom :godmode:

MilesCranmer/PySRAutoML

3558 ★

High-Performance Symbolic Regression in Python and Julia

fluxions-ai/vuiEdge AI

673 ★

Real-time voice assistant — WebRTC streaming, faster-whisper ASR, local LLM, Vui Nano (300M) TTS. OpenAI Realtime API compatible. Voice cloning, barge-in, ~9× realtime on a 4090. Apache 2.0.

davidjurgens/potatoNLP

383 ★

potato: the portable annotation tool