The Inference Report

May 30, 2026

The week's most consequential story is not about what AI can do, but about the growing distance between what it can do and what companies are prepared to think through before deploying it. Amazon employees gamed an internal AI leaderboard into uselessness while ClickUp cut 22 percent of its workforce based on AI agents its executives don't fully understand. The people signing off on AI-driven layoffs demonstrably do not understand the jobs being eliminated. This gap between technical capability and institutional decision-making has become the defining liability of the moment, and it is generating organized resistance. Wikipedia editors are striking over Wikimedia layoffs. Chinese courts now bar AI-justified terminations. A UK think tank backed by the Trades Union Congress called for workers to have real say in AI rollout decisions. These movements are unconnected yet simultaneous, signaling that workers have stopped accepting "AI made us do it" as a closing argument.

The infrastructure being built to extract value from this moment is outpacing the systems proving competent enough to justify it. A startup now offers free home cleaning in exchange to record it for robot training, normalizing surveillance as a straightforward transaction. Google's Gemini Spark was given access to a user's emails, documents, and calendar to plan a birthday party and failed at the basic task. The systems for invasive data access are permanent; the competence is contingent. Meanwhile, OpenAI is securing regulated use cases and government relationships through methodical institutional positioning. Boston Children's is deploying GPT technology for rare disease diagnosis, Rosalind Biodefense operates with vetted access and government partnerships, and Braintrust uses Codex with GPT-5.5 for engineering workflows. The playbook document on third-party evaluations amounts to OpenAI defining the criteria by which it will be judged. Infrastructure players like AMD and Hugging Face are optimizing for the hardware and engineering problems that follow frontier model deployment, positioning themselves as compute partners to whoever wins.

In research and benchmarking, the pattern is methodological rigor applied to failure modes. Papers cluster around representation design as prerequisite for generative modeling, verification mechanisms that expose when systems break, and bounded-resource optimization that trades fidelity for computational tractability. SchGen and GPIC show how task-specific semantic representations enable language models to generate structured outputs where standard formats fail. Physics Is All You Need, SoundnessBench, and Self-Trained Verification move beyond aggregate metrics to characterize how and when systems fail. VideoMLA achieves 92.7 percent KV cache reduction through low-rank bottleneck training rather than assuming spectral structure. The SWE-rebench rankings show Claude models entering at multiple configuration variants (Opus 4.8-xhigh at 56.4 percent, Sonnet 4.6-high at 51.3 percent) while gpt-5.5-2026-04-23-xhigh holds 62.7 percent at the top tier. Discrepancies between SWE-rebench and Artificial Analysis suggest the benchmarks are testing distinct problem classes or applying different scoring thresholds, a gap that itself deserves investigation.

Developer infrastructure is splitting into two categories: the unglamorous foundation layer of data normalization and the emerging platform layer of agent infrastructure. Markitdown, liteparse, and n2words solve the necessary problem of getting messy inputs into usable forms. The larger cluster is AI agent infrastructure, from Claude Code and Cursor plugins through open alternatives like Hermes and Project N.O.M.A.D. What distinguishes this moment is that many of these tools are explicitly designed to prevent or correct AI output problems. Stop-Slop removes AI tells from prose. Taste-Skill stops generic output. The ECC system adds security and memory to agent harnesses. These solve for quality and control, not capability. Open-source alternatives to commercial platforms suggest developers are building their own stacks rather than waiting for vendors. The signal is clear: the question has shifted from "can we build agents" to "how do we make them reliable enough to ship."

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research PapersAll papers
Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software cs.AI

Are AI agents tools, co-authors, or researchers? We present a quantified case study ($N=1$): a physicist supervising an AI coding agent (Claude Code, Sonnet and Opus models) over 12 work days and 57 sessions to build CLAX-PT, a differentiable one-loop perturbation theory module in JAX. We documented and classified 15 supervision events by intervention level. The agent resolved ten autonomously by iterating against oracle tests. Two more by the physicist's domain knowledge. The three it could not -- all evaded oracle detection -- share a common property: the agent treated symptom reduction as root-cause resolution. It spent 33 of the 57 sessions adjusting coefficients within a code architecture that could not represent the target physics, and could not re-evaluate its CLASS-PT branch choice even when prompted to reconsider; only an injected physics concept (anisotropic BAO damping) triggered the redesign. Separately, the agent committed a calibrated correction that passed all oracle tests but corresponded to no quantity in the theory, predicting wrong values at any other cosmology. The fudge factor was caught and replaced within the same session. Three supervision practices proved critical for catching what oracle tests missed: testing at diverse parameter points beyond the fiducial calibration; shared changelogs that surfaced stalled exploration across sessions; and an explicit rule against unphysical numerical patches. In this case, supervision design, not model capability, determined whether the agent's output was trustworthy. Closing the gap would require agents that propose architectural alternatives rather than optimize within a given structure, and distinguish predictive adequacy from explanatory correctness -- capabilities not exhibited here, not obviously addressed by scaling alone. [Abridged.]

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion cs.CV

Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV layout itself, a dominant contributor to streaming memory and latency, has been mostly left unchanged. In this paper, we present the first study of Multi-Head Latent Attention (MLA) in video diffusion. VideoMLA replaces per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, reducing per-token KV memory by 92.7% at every cached layer. We further investigate why MLA succeeds in video diffusion even though the spectral assumption often used to motivate it in language models does not hold: pretrained video attention is not low-rank, with 99%-energy effective rank far above any practical latent dimension. VideoMLA retains quality at compression ratios where direct spectral approximation would predict large reconstruction error. We show that the MLA bottleneck, rather than the pretrained spectrum, determines the effective rank: both spectral and random initialization occupy nearly the full rank budget from initialization, and training preserves this budget while adapting within it. On VBench, VideoMLA matches short-horizon streaming video diffusion baselines, achieves the best overall score at long horizons among evaluated methods, and improves throughput by 1.23x on a single B200.

DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation cs.RO

Robot manipulation critically depends on perception that preserves the action-relevant aspects of a scene. Yet most robot learning pipelines are built upon visual encoders pre-trained for static recognition or vision-language alignment, leaving motion understanding to downstream policies. We introduce DynaFLIP, a dynamics-aware multimodal pre-training framework that pushes motion understanding upstream into perception. We construct image-language-3D flow triplets from heterogeneous human and robot videos, and use these triplets as training-time supervision to shape an image-only encoder. Our key idea is to encourage the three modalities to span a small simplex volume in the shared hyperspherical space -- a smaller simplex volume indicating stronger alignment. To avoid the geometric ambiguity and trivial collapse of naive volume minimization, we combine simplex-volume minimization with a cosine regularizer and a contrastive objective. Our analyses show that DynaFLIP focuses on control-relevant regions critical for manipulation. The resulting dynamics-aware representations serve as reusable visual backbones and consistently outperform baselines across diverse downstream policies, including VLAs. We validate this across diverse simulation and real-world setups, with gains reaching +22.5% under out-of-distribution scenarios. Our results suggest that robot generalization improves when visual representations are trained to encode not just what is present, but how the world changes under action.

LLMSurgeon: Diagnosing Data Mixture of Large Language Models cs.CL

The pretraining data mixture of Large Language Models (LLMs) constitutes their "digital DNA", shaping model behaviors, capabilities, and failure modes. Yet this composition is rarely disclosed, making post-hoc auditing of data combination or provenance difficult. In this work, we formalize $\textbf{Data Mixture Surgery (DMS)}$: given only generated text from a target LLM, estimate the domain-level distribution of its pretraining corpus under a predefined taxonomy. We propose $\textbf{LLMSurgeon}$, a strong framework that casts DMS as an inverse problem under the label-shift assumption. Rather than directly aggregating classifier outputs, LLMSurgeon estimates a calibrated $\textit{soft}$ confusion matrix and solves a constrained inverse problem to correct systematic domain confusion and recover the latent mixture prior. To evaluate, we introduce $\textbf{LLMScan}$, a recipe-verifiable evaluation suite built from open-source LLMs with transparent pretraining mixtures. Across LLMScan, LLMSurgeon recovers domain mixtures with high fidelity under fixed protocols. Our work presents a practical, post-hoc approach for auditing the digital DNA of foundation models without access to their training data.

SchGen: PCB Schematic Generation with Semantic-Grounded Code Representations cs.AI

Printed circuit board (PCB) schematic design defines nearly all electronic hardware, but it remains manual and expertise-intensive. While generative AI has advanced digital and analog IC design, PCB schematic generation from natural-language intent is largely unexplored. This paper presents SchGen, the first large language model that generates editable PCB schematics from natural-language requests. The key challenge lies in the lack of an LLM-suited representation and a large-scale dataset. Current schematic formats are dominated by verbose, tool-specific syntax and geometry-heavy descriptions, making them difficult to generate reliably. We introduce a semantically grounded code representation that encodes schematic editing primitives with relative placement and pin-name-based wiring, transforming a geometry-driven generation problem into a semantics-driven matching task amenable to LLMs. We further construct a large-scale dataset of PCB schematics paired with user prompts via a human-agent collaborative pipeline that converts open-source hardware designs into our representation. Experiments show that SchGen significantly outperforms alternative representations and even larger general-purpose LLMs on wire connectivity accuracy and functional correctness. Our results highlight the critical role of representation design in enabling generative models for complex hardware design tasks.

Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection cs.AI

Recent advances in Vision-Language Models (VLMs) have achieved impressive performance across many tasks, yet prior studies report unsatisfactory performance when applying large language or multimodal models to finding abnormal patterns in sequential data. Public anomaly detection benchmarks typically provide interval annotations but not natural-language rationales, making it difficult to fine-tune VLMs to produce grounded, interpretable decisions. To address this gap, we construct VisAnomBench, a curated benchmark built from public time-series datasets and augmented with high-quality anomaly explanations selected from multiple large VLMs using fine-grained, task-specific rewards. Through fine-tuning on this benchmark, we develop VisAnomReasoner, a parameter-efficient VLM for time-series anomaly detection. Experimental results on VisAnomBench show that VisAnomReasoner achieves more accurate anomaly localization and consistently outperforms all baselines, with improvements of at least 21.23 and 23.87 percentage points in precision and F1, respectively. Additional experiments on the TSB-AD-U benchmark demonstrate strong cross-benchmark generalization, with VisAnomReasoner improving precision and F1 by 9.57 and 13.39 percentage points, respectively.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1Claude Opus 4.861.467$10.94
2GPT-5.560.269$11.25
3Claude Opus 4.757.353$10.94
4Gemini 3.1 Pro Preview57.2129$4.50
5GPT-5.456.892$5.63
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1gpt-5.5-2026-04-23-xhigh62.7%
2Codex60.4%
3Claude Code59.6%
4gpt-5.5-2026-04-23-medium58.9%
5Claude Opus 4.8-xhigh56.4%