The Inference Report

March 2, 2026

The AI industry's carefully constructed narratives are colliding with an uncooperative reality across multiple fronts. The Pentagon's demand that Anthropic accept military use terms by February 27 or risk designation as a supply chain risk represents the most explicit government intervention in AI company operations since the sector's emergence, forcing a safety-first laboratory to choose between its institutional identity and contracts worth billions. This pressure arrives alongside mounting evidence that the physical world is pushing back against AI's expansion: data center builders are discovering that farmers won't sell land even for million-dollar offers, and Microsoft has retreated from aggressive community relations tactics, vowing to cover full power costs, reject local tax breaks, and replenish water usage. These are not PR troubles to be managed but structural constraints on deployment speed and geography, independent of capital availability.

The market's bifurcation is sharpening along predictable lines. AWS and IBM are positioning AI infrastructure as defensible, recurring-revenue moats with enterprise-focused compute layers and autonomous storage management, while IBM's Missile Defense Agency contract and quantum computing partnership with Cisco signal expanding defense TAM. Anthropic, by contrast, is attempting to own the safety narrative through constitutional classifiers, alignment faking research, and a Responsible Scaling Policy version 3.0, less as philanthropy than as market differentiation for enterprise customers facing regulatory scrutiny. The coding agent space is fragmenting under price pressure: Claude Code's $200 monthly pricing has opened room for free alternatives like Block's Goose, while the NousCoder-14B model's ability to train in four days on 48 Nvidia B200 GPUs challenges the assumption that frontier capabilities require frontier resources.

The benchmark landscape exposes the same gap between perception and measurement. SWE-bench showed zero movement in its latest cycle, with the top 35 models frozen in identical positions, while the newly introduced Artificial Analysis framework produces substantially different rankings that raise questions about what these evaluations actually capture. GitHub trending reinforces the shift: the value has migrated from model capabilities to orchestration infrastructure, with memory systems, multi-agent coordination, and document conversion utilities like Microsoft's markitdown gaining traction over model training tools. The industry appears to have accepted that fine-tuning is solved and is now betting heavily on the application layer, even as public skepticism rises through movements like QuitGPT and research demonstrating that LLMs can generate near-verbatim copies of novels from training data. What remains clear is that the gap between AI's investment thesis and its operational reality is narrowing, and the companies best positioned may be those building for the world as it exists rather than as the sector imagined it.

Grant Calloway

AI LabsAll labs

No lab headlines.

From the WireAll feeds
Research Papers — FocusedAll papers
JustDiag!: A Diagnostic Justification Engine for Accountable Root Cause Analysis cs.SE

Large language models can produce fluent root cause analyses, but fluent final answers alone are insufficient evidence for accountability in high-stakes operations. In real incident response, engineers need to know what evidence supported a diagnosis, which alternatives were considered, where contradictions remained, and whether the system resolved the case or preserved uncertainty. We address this gap with JustDiag, a diagnostic justification engine for RCA that maintains an explicit process state over evidence, findings, competing hypotheses, conflicts, and next checks. We evaluated the system on 66 real-world incidents using a two-layer protocol that separately scores final-answer quality and process quality. Relative to a matched control without diagnostic justification, JustDiag achieved stronger outcome and process scores, while accepting slightly lower terminal completion due to more calibrated non-closure. These results suggest that accountable RCA requires explicit diagnostic justification artifacts and process-aware evaluation, not only fluent final answers.

OpenRath: Session-Centered Runtime State for Agent Systems cs.SE

Modern agent systems often suffer from fragmented runtime state: transcripts, tool effects, memory events, workspace placement, branch provenance, and replay evidence are recorded separately and become difficult to inspect or reproduce. OpenRath addresses this issue with a PyTorch-like programming model for multi-agent, multi-session systems. The analogy concerns the role of a central first-class runtime abstraction, not tensor computation. Its core abstraction is Session, the runtime value passed between agents and workflows. A Session is branchable, inspectable, replayable, backend-aware, and composable. It records conversation chunks, sandbox placement, lineage metadata, token usage, pending work, and tool evidence, while defining where memory interactions enter the runtime record. Since this state is carried by the same value used in program execution, fork, merge, and replay become explicit runtime operations rather than states reconstructed from external traces. OpenRath further defines Sandbox, Tool, Agent, Memory, Workflow, and Selector, with Selector turning control flow into runtime-routed decisions. This report presents the programming model, architecture, audited milestones, and evidence protocol. Its claims are limited to controlled runtime properties, while broad quantitative comparisons, live-provider quality, optional-backend availability, and memory quality are left for follow-on evaluation. The central thesis is that Session provides agent systems with a first-class runtime value for auditable composition.

FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines cs.SE

Multi-step LLM pipelines fail through interactions among retrieval, reasoning, and formatting steps, so prompt-only optimization can miss bottlenecks in the chain. We present FAPO (Fully Autonomous Prompt Optimization), a framework that lets Claude Code optimize an LLM pipeline inside a standardized codebase. FAPO evaluates a pipeline, inspects intermediate steps, diagnoses failures, proposes scoped changes, and validates variants repeatedly to optimize against a score function. It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck. Across six benchmarks and three task models, FAPO beats the baseline GEPA in 15 of 18 model-benchmark comparisons. In 11 model-benchmark comparisons, FAPO wins with non-overlapping mean $\pm$ trial-standard-deviation ranges, and the mean FAPO-GEPA gain is +14.1 pp. In the six HoVer and IFBench comparisons where prompt-first search escalated to structural changes, FAPO wins all six with a mean gain of +33.8 pp. FAPO also improves performance on security tasks: on CTIBench-RCM, a security CVE-to-CWE task, prompt-only FAPO lifts test accuracy by +4.0 pp on GPT-5, +7.1 pp on Foundation-Sec-8B-Instruct, and +2.0 pp on Foundation-Sec-8B-Reasoning. These results position FAPO as a state-of-the-art pipeline optimization technique for both general-purpose and security-focused tasks.

StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns cs.SE

We introduce StaminaBench, a benchmark that measures the stamina of coding agents: how many consecutive interaction turns (change requests) they can handle before failing. Unlike the prevailing fraction-of-tasks-solved metric, this matches real vibe-coding where sessions run dozens or hundreds of turns. In StaminaBench, agents implement a REST API server and modify it across a tunable number of procedurally generated follow-up change requests - 100 in our experiments, resulting in codebases of up to 6,000 lines. Tests are generated fully programmatically without LLM involvement, ensuring reproducibility and reliability; change sequences are drawn from either a hardcoded or LLM-driven sampler, both constrained to a structured action space to ensure changes are valid. The agent and the server run in an isolated environment and communicate with the benchmark through HTTP, making testing fully black-box and language-agnostic. We evaluate six agent harnesses paired with seven open-source LLMs across 20 scenarios of 100 turns each and find that: (1) all the tested models fail within 5-6 turns, confirming that vibe-coding-style programming without thorough testing produces bugs; (2) passing test feedback back to the agent and allowing it to retry improves passed turn count by up to 12x; and (3) a good harness is required for strong performance: stronger models exhibit up to a 6x gap between their best and worst harness, while weaker models fail with any harness. We release the benchmark and the generated tasks to enable further research into multi-turn coding agent behavior. Benchmark code and data: github.com/amazon-science/StaminaBench.

Before the Pull Request: Mining Multi-Agent Coordination cs.SE

Autonomous coding agents now open millions of pull requests, yet large-scale studies find their PRs are produced faster but accepted less often - a coordination and trust gap that pull-request-level telemetry cannot explain. We argue the missing signal lives before the PR, in how concurrent agents claim, divide, and collide over shared work. We study this process through grite, our open-source coordination substrate that needs no central server and stores its records inside git itself, so its append-only, signed event log captures the coordination process directly. We show that (i) this shared substrate reduces duplicate and conflicting work at bounded overhead - the share of work that merely re-does a teammate's task falls from 78% to 0% while useful throughput more than triples; (ii) every agent's copy of the log converges to the same state with no write silently dropped, where a file-based tracker loses concurrent writes; and (iii) the log is a mineable artefact from which concrete failure modes - conflicting edits, lock starvation, redundant rediscovery, race-to-close - are automatically recoverable with provenance, several invisible in pull-request history. We release the dataset, harness, and mining toolkit.

Prompt Quality and Pull Request Outcomes: A Stage-Based Empirical Study of LLM-Assisted Development cs.SE

Large language model (LLM)-powered tools such as ChatGPT are increasingly used in collaborative software engineering workflows, yet little is known about how prompt structure influences downstream pull request (PR) outcomes. Prior studies primarily examine conversational helpfulness, productivity, or coarse-grained adoption metrics, leaving the role of prompt structure in collaborative integration behavior insufficiently understood. We analyze 265 manually validated developer-ChatGPT interactions derived from self-admitted ChatGPT usage in open-source pull requests. Building on prior research on developer-facing artifacts and prompt engineering, we operationalize prompt structure using three dimensions: Context, Specificity, and Verification. We first evaluate whether LLM-assisted annotation can reliably reproduce human judgments of prompt structure, finding substantial variation across dimensions and workflow contexts. Specificity shows the most stable agreement with human judgments; Context is systematically under-scored by the LLM; and Verification remains difficult to assess consistently, motivating a hybrid human-LLM annotation strategy. Using this validated framework, we then examine how prompt structure influences actionable code generation, code adoption, and integration depth across AI-assisted PR workflows. Specificity and Context are most strongly associated with actionable code generation; Verification emerges as the primary predictor of code adoption; and integration depth is most strongly associated with Context. Overall, our findings show that prompt characteristics exert distinct, stage-dependent effects across AI-assisted software engineering workflows, influencing downstream adoption and integration through contextual grounding, task specificity, and evaluability cues.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1Gemini 3.1 Pro Preview57.282$4.50
2GPT-5.3 Codex5474$4.81
3Claude Opus 4.65348$10.00
4Claude Sonnet 4.651.729$6.00
5GPT-5.251.363$4.81
SWE-rebench

Agentic coding on real-world software engineering tasks

No benchmark data.

Trending