The Inference Report

May 4, 2026

A Harvard study showing large language models outperforming human doctors in emergency room diagnosis arrived this week alongside a more sobering reality: enterprises that have already paid billions for AI systems are discovering that vendors can degrade model performance without warning or notification, leaving IT leaders with systems they control but cannot understand. The real tension in AI deployment today is not between skeptics and believers but between vendors extracting value through opacity and enterprises trying to build stable systems on shifting sand. When tokenization drift can degrade a model's performance without any change to data or logic, when LLMs change behavior unilaterally, when IT has little visibility into what happens inside systems they nominally own, enterprises face a choice between fragility and capitulation. Vendors have structured the relationship so that enterprises absorb all operational risk while vendors retain all control.

The infrastructure underneath reveals where genuine progress is happening. Mistral AI's new 128B model scores 77.6% on SWE-Bench Verified, advancing the frontier on coding agents. Sakana AI's KAME injects LLM knowledge into speech-to-speech systems without adding latency. Developers are learning to systematize prompting, moving beyond trial-and-error iteration toward reliability engineering. Today's research papers cluster around three methodological themes: structured execution and constraint enforcement for language models, multimodal representation learning under efficiency constraints, and domain-specific verification frameworks that prioritize diagnosing failure modes and enforcing domain-specific invariants at inference time. These are unglamorous moves that matter more than grand claims about transformation.

The leaderboard movements tell a story about where capability is actually concentrating. Claude Opus 4.6 holds the top position on SWE-Bench at 65.3%, with GPT-5.2-2025-12-11-medium at 64.4%, but the more striking trend is the rise of Chinese model families: GLM-5 climbed from position 16 to 3, GLM-4.7 rose from 40 to 14, and Kimi K2.5 advanced from 26 to 16. The top tier has consolidated around 62-65% on SWE-Bench, suggesting diminishing returns at the frontier. On GitHub, the trending repositories reveal that developers have moved past asking whether agents can work and are now asking how to orchestrate them reliably. Ruflo, browserbase/skills, and czlonkowski/n8n-mcp represent different layers of the same stack for multi-agent coordination. Cordum's emergence in the discovery layer reflects a maturation curve: when agent frameworks proliferate, control planes and governance become necessary infrastructure.

What unites these signals is a shift from novelty to operations. Water utilities in Singapore cut leakage to 75% below England and Wales using AI. Restaurants deploy the technology to reduce waste. Recruiters use it to clear administrative work so humans can focus on judgment calls. These are not revolutionary claims but ordinary businesses using a tool where it reduces cost or error in bounded, measurable ways. Yet the C-suite is moving faster than the work warrants. IBM's survey shows the percentage of organizations with a Chief AI Officer jumped from 26% to 76% in a single year, a pattern that reflects herd behavior and insurance against looking unprepared rather than demonstrated strategic value. When three-quarters of surveyed companies adopt the same title in twelve months, you are watching boardroom fashion, not differentiation.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research PapersAll papers
HyCOP: Hybrid Composition Operators for Interpretable Learning of PDEs cs.CE

We introduce HyCOP, a modular framework that learns parametric PDE solution operators by composing simple modules (advection, diffusion, learned closures, boundary handling) in a query-conditioned way. Rather than learning a monolithic map, HyCOP learns a policy over short programs - which module to apply and for how long - conditioned on regime features and state statistics. Modules may be numerical sub-solvers or learned components, enabling hybrid surrogates evaluated at arbitrary query times without autoregressive rollout. Across diverse PDE benchmarks, HyCOP produces interpretable programs, delivers order-of-magnitude OOD improvements over monolithic neural operators, and supports modular transfer through dictionary updates (e.g., boundary swaps, residual enrichment). Our theory characterizes expressivity and gives an error decomposition that separates composition error from module error and doubles as a process-level diagnostic.

When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models cs.CL

Large language models (LLMs) often achieve strong performance on reasoning benchmarks, but final-answer accuracy alone does not show whether they faithfully execute the procedure specified in a prompt. We study this question through a controlled diagnostic benchmark for procedural execution, where models are given a step-wise arithmetic algorithm and two numeric inputs, and must return the final computed value. The benchmark uses simple arithmetic operations but increases complexity through algorithm length and look-back dependencies over intermediate variables. Across 14 models and 55 datasets, average first-answer accuracy drops from 61% on 5-step procedures to 20% on 95-step procedures. Generation-level analysis shows that failures often involve missing answers, premature answers, self-correction after an initial error, under-executed traces, and hallucinated extra steps. These findings suggest that apparent reasoning ability can mask substantial weaknesses in faithful instruction execution.

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs cs.CV

While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a "Visual Signal Dilution" phenomenon, where the accumulation of textual history expands the attention partition function, causing visual attention to decay inversely with generated sequence length. To counteract this, we propose Persistent Visual Memory (PVM), a lightweight learnable module designed to ensure sustained, on-demand visual perception. Integrated as a parallel branch alongside the Feed-Forward Network (FFN) in LVLMs, PVM establishes a distance-agnostic retrieval pathway that directly provides visual embeddings for precise visual perception, thereby structurally mitigating the signal suppression inherent to deep generation. Extensive experiments on Qwen3-VL models demonstrate that PVM brings notable improvements with negligible parameter overhead, delivering consistent average accuracy gains across both 4B and 8B scales, particularly in complex reasoning tasks that demand persistent visual perception. Furthermore, in-depth analysis reveals that PVM can resist length-induced signal decay and accelerate internal prediction convergence.

Can Coding Agents Reproduce Findings in Computational Materials Science? cs.SE

Large language models are increasingly deployed as autonomous coding agents and have achieved remarkably strong performance on software engineering benchmarks. However, it is unclear whether such success transfers to computational scientific workflows, where tasks require not only strong coding ability, but also the ability to navigate complex, domain-specific procedures and to interpret results in the context of scientific claims. To address this question, we present AutoMat, a benchmark for evaluating LLM-based agents' ability to reproduce claims from computational materials science. AutoMat poses three interrelated challenges: recovering underspecified computational procedures, navigating specialized toolchains, and determining whether the resulting evidence supports a claim. By working closely with subject matter experts, we curate a set of claims from real materials science papers to test whether coding agents can recover and execute the end-to-end workflow needed to support (or undermine) such claims. We then evaluate multiple representative coding agent settings across several foundation models. Our results show that current LLM-based agents obtain low overall success rates on AutoMat, with the best-performing setting achieving a success rate of only 54.1%. Error analysis further reveals that agents perform worst when workflows must be reconstructed from paper text alone and that they fail primarily due to incomplete procedures, methodological deviations, and execution fragility. Taken together, these findings position AutoMat as both a benchmark for computational scientific reproducibility and a tool for diagnosing the current limitations of agentic systems in AI-for-science settings.

Generating Statistical Charts with Validation-Driven LLM Workflows cs.LG

Generating diverse, readable statistical charts from tabular data remains challenging for LLMs, as many failures become apparent after rendering and are not detectable from data or code alone. Existing chart datasets also rarely provide fully aligned artifacts, such as executable code, dataset context, and question-answer pairs. We present a structured LLM-based workflow that decomposes chart generation into dataset screening, plot proposal, code synthesis, rendering, validation-driven refinement, description generation, and question-answer generation. By incorporating rendered-output validation, the workflow addresses visualization-specific failure modes such as readability and semantic mismatch. It treats chart generation as an inspectable process rather than a one-shot prompt-to-code task, retaining each chart with its code, dataset context, description, and question-answer pairs. Applied to UCI datasets, the workflow produces 1,500 charts from 74 datasets, spanning 24 chart families and paired with 30,003 question-answer pairs. We evaluate 16 multimodal LLMs (MLLMs) on these chart-question pairs. The results show that chart-syntax questions are nearly saturated, while value extraction, comparison, and reasoning remain more challenging, illustrating the workflow's utility for diagnostic studies of chart-grounded multimodal reasoning.

RunAgent: Interpreting Natural-Language Plans with Constraint-Guided Execution cs.LG

Humans solve problems by executing targeted plans, yet large language models (LLMs) remain unreliable for structured workflow execution. We propose RunAgent, a multi-agent plan execution platform that interprets natural-language plans while enforcing stepwise execution through constraints and rubrics. RunAgent bridges the expressiveness of natural language with the determinism of programming via an agentic language with explicit control constructs (e.g., \texttt{IF}, \texttt{GOTO}, \texttt{FORALL}). Beyond verifying syntactic and semantic verification of the step output, which is performed based on the specific instruction of each step, RunAgent autonomously derives and validates constraints based on the description of the task and its instance at each step. RunAgent also dynamically selects among LLM-based reasoning, tool usage, and code generation and execution (e.g., in Python), and incorporates error correction mechanisms to ensure correctness. Finally, RunAgent filters the context history by retaining only relevant information during the execution of each step. Evaluations on Natural-plan and SciBench Datasets demonstrate that RunAgent outperforms baseline LLMs and state-of-the-art PlanGEN methods.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1GPT-5.560.274$11.25
2Claude Opus 4.757.356$10.94
3Gemini 3.1 Pro Preview57.2130$4.50
4GPT-5.456.889$5.63
5Kimi K2.653.931$1.71
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4Junie62.8%
5gpt-5.4-2026-03-05-medium62.8%