The Inference Report

May 13, 2026

Today's news is significant only if you care about where compute actually goes and who controls it. The product launches are noise. Google pitching agentic Gemini, Anthropic expanding into legal services, OpenAI launching Daybreak for cyber defense, these are distribution plays dressed as strategy. What matters is underneath: compute costs are now the binding constraint on AI adoption, and the companies winning aren't the ones with the best models. They're the ones locking in supply first.

The infrastructure arms race is reshaping entire industries and creating collateral damage that reveals the actual priorities. Google and SpaceX are putting data centers in orbit. Grant County Public Utility District is seizing farmland through eminent domain to build transmission lines. New Jersey towns are banning data centers outright. CME is launching futures contracts for GPU rental prices, turning compute capacity into a commodity. Microsoft renegotiated its deal with OpenAI to gain first rights to new models while OpenAI gains freedom to sell elsewhere, but that only matters if OpenAI can access the chips and power it needs. It cannot. Neither can anyone else. The two-tier market is hardening: those who control compute supply will win. Those who don't will compete on margins that don't exist.

The second-order effects are already visible and troubling. GitLab is warning customers that monthly bills for AI-enabled developer tools will rise from hundreds of dollars per seat to thousands within a year, driven entirely by the volume of compute agents consume. A Microsoft benchmark called DELEGATE-52 found that 19 large language models are error-prone and unreliable at multi-step tasks outside of Python programming, yet the industry is shipping agents into healthcare, legal services, and cyber defense anyway. Medicare's new ACCESS payment model creates the first governmental mechanism to reimburse AI agents for patient monitoring between visits, accelerating adoption of systems we know aren't trustworthy at complex reasoning. A teenager died after ChatGPT recommended a drug combination. Anthropic is warning investors against secondary trading platforms, suggesting the company is bracing for scrutiny. The infrastructure is being built faster than the systems using it are being tested, and the incentives are all wrong.

The research and benchmark data confirm that progress is now coming from better problem decomposition and structural optimization, not parameter scaling. Claude Opus 4.6 jumped from 9th to 1st on SWE-rebench at 65.3 percent, a 12.3-point gain, but diverges sharply from its 9th-place ranking on Artificial Analysis at 53 percent, suggesting the two benchmarks measure different capabilities or that evaluation design is hiding what these models actually do. Chinese models moved dramatically up the coding rankings: GLM-5 climbed from 17th to 3rd, Kimi K2 Thinking from 54th to 21st. Neither benchmark discloses test set size, problem distribution, or how edge cases are handled, making it impossible to know whether these gaps reflect genuine capability differences or artifacts of evaluation design. On GitHub, the trending repos chase agent capabilities while discovery repos solve concrete infrastructure problems: real-time inference optimization, domain-specific applications, synthetic data pipelines. The gap between what's viral and what's actually being deployed reveals the disconnect between the narrative and the work that ships.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research PapersAll papers
AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward cs.CV

In this paper, we propose AlphaGRPO, a novel framework that applies Group Relative Policy Optimization (GRPO) to AR-Diffusion Unified Multimodal Models (UMMs) to enhance multimodal generation capabilities without an additional cold-start stage. Our approach unlocks the model's intrinsic potential to perform advanced reasoning tasks: Reasoning Text-to-Image Generation, where the model actively infers implicit user intents, and Self-Reflective Refinement, where it autonomously diagnoses and corrects misalignments in generated outputs. To address the challenge of providing stable supervision for real-world multimodal generation, we introduce the Decompositional Verifiable Reward (DVReward). Unlike holistic scalar rewards, DVReward utilizes an LLM to decompose complex user requests into atomic, verifiable semantic and quality questions, which are then evaluated by a general MLLM to provide reliable and interpretable feedback. Extensive experiments demonstrate that AlphaGRPO yields robust improvements across multimodal generation benchmarks, including GenEval, TIIF-Bench, DPG-Bench and WISE, while also achieving significant gains in editing tasks on GEdit without training on editing tasks. These results validate that our self-reflective reinforcement approach effectively leverages inherent understanding to guide high-fidelity generation. Project page: https://huangrh99.github.io/AlphaGRPO/

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues cs.CL

Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, state dynamics, workflows, and recurring failure modes. However, existing memory benchmarks for agents mostly focus on user histories, short traces, or downstream task success, leaving open how to directly evaluate whether memory systems effectively internalize environment-specific experience. To address this gap, we introduce LongMemEval-V2 (LME-V2), a benchmark for evaluating whether memory systems can help agents acquire the experience needed to become knowledgeable colleagues in customized environments. LME-V2 contains 451 manually curated questions covering five core memory abilities for web agents: static state recall, dynamic state tracking, workflow knowledge, environment gotchas, and premise awareness. Questions are paired with history trajectories containing up to 500 trajectories and 115M tokens. We use a context gathering formulation: memory systems consume history trajectories and return compact evidence for downstream question answering. We propose a suite of two memory methods: AgentRunbook-R, an efficient RAG-based memory with knowledge pools for raw state observations, events, and strategy notes, and AgentRunbook-C, which stores trajectories as files and invokes a coding agent to gather evidence in an augmented sandbox. Experiments show that AgentRunbook-C achieves the best performance with 72.5% average accuracy, outperforming the strongest RAG baseline (48.5%) and the off-the-shelf coding agent baseline (69.3%). Despite the strong performance gains, coding agent based methods have high latency costs. While AgentRunbook-C advances the accuracy-latency Pareto frontier, substantial room for improvement remains. Together, these results establish LME-V2 as a challenging testbed for developing long-term memory systems for environment experience.

Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation cs.LG

We introduce Pion, a spectrum-preserving optimizer for large language model (LLM) training based on orthogonal equivalence transformation. Unlike additive optimizers such as Adam and Muon, Pion updates each weight matrix through left and right orthogonal transformations, preserving its singular values throughout training. This yields an optimization mechanism that modulates the geometry of weight matrices while keeping their spectral norm fixed. We derive the Pion update rule, systematically examine its design choices, and analyze its convergence behavior along with several key properties. Empirical results show that Pion offers a stable and competitive alternative to standard optimizers for both LLM pretraining and finetuning.

Elastic Attention Cores for Scalable Vision Transformers cs.CV

Vision Transformers (ViTs) achieve strong data-driven scaling by leveraging all-to-all self-attention. However, this flexibility incurs a computational cost that scales quadratically with image resolution, limiting ViTs in high-resolution domains. Underlying this approach is the assumption that pairwise token interactions are necessary for learning rich visual-semantic representations. In this work, we challenge this assumption, demonstrating that effective visual representations can be learned without any direct patch-to-patch interaction. We propose VECA (Visual Elastic Core Attention), a vision transformer architecture that uses efficient linear-time core-periphery structured attention enabled by a small set of learned cores. In VECA, these cores act as a communication interface: patch tokens exchange information exclusively through the core tokens, which are initialized from scratch and propagated across layers. Because the $N$ image patches only directly interact with a resolution invariant set of $C$ learned "core" embeddings, this yields linear complexity $O(N)$ for predetermined $C$, which bypasses quadratic scaling. Compared to prior cross-attention architectures, VECA maintains and iteratively updates the full set of $N$ input tokens, avoiding a small $C$-way bottleneck. Combined with nested training along the core axis, our model can elastically trade off compute and accuracy during inference. Across classification and dense tasks, VECA achieves performance competitive with the latest vision foundation models while reducing computational cost. Our results establish elastic core-periphery attention as a scalable alternative building block for Vision Transformers.

Task-Adaptive Embedding Refinement via Test-time LLM Guidance cs.CL

We explore the effectiveness of an LLM-guided query refinement paradigm for extending the usability of embedding models to challenging zero-shot search and classification tasks. Our approach refines the embedding representation of a user query using feedback from a generative LLM on a small set of documents, enabling embeddings to adapt in real time to the target task. We conduct extensive experiments with state-of-the-art text embedding models across a diverse set of challenging search and classification benchmarks. Empirical results indicate that LLM-guided query refinement yields consistent gains across all models and datasets, with relative improvements of up to +25% in literature search, intent detection, key-point matching, and nuanced query-instruction following. The refined queries improve ranking quality and induce clearer binary separation across the corpus, enabling the embedding space to better reflect the nuanced, task-specific constraints of each ad-hoc user query. Importantly, this expands the range of practical settings in which embedding models can be effectively deployed, making them a compelling alternative when costly LLM pipelines are not viable at corpus-scale. We release our experimental code for reproducibility, at https://github.com/IBM/task-aware-embedding-refinement.

Learning, Fast and Slow: Towards LLMs That Adapt Continually cs.LG

Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of plasticity. In contrast, in-context learning with fixed LLM parameters can cheaply and rapidly adapt to task-specific requirements (e.g., prompt optimization), but cannot by itself typically match the performance gains available through updating LLM parameters. There is no good reason for restricting learning to being in-context or in-weights. Moreover, humans also likely learn at different time scales (e.g., System 1 vs 2). To this end, we introduce a fast-slow learning framework for LLMs, with model parameters as "slow" weights and optimized context as "fast" weights. These fast "weights" can learn from textual feedback to absorb the task-specific information, while allowing slow weights to stay closer to the base model and persist general reasoning behaviors. Fast-Slow Training (FST) is up to 3x more sample-efficient than only slow learning (RL) across reasoning tasks, while consistently reaching a higher performance asymptote. Moreover, FST-trained models remain closer to the base LLM (up to 70% less KL divergence), resulting in less catastrophic forgetting than RL-training. This reduced drift also preserves plasticity: after training on one task, FST trained models adapt more effectively to a subsequent task than parameter-only trained models. In continual learning scenarios, where task domains change on the fly, FST continues to acquire each new task while parameter-only RL stalls.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1GPT-5.560.263$11.25
2Claude Opus 4.757.364$10.94
3Gemini 3.1 Pro Preview57.2131$4.50
4GPT-5.456.884$5.63
5Kimi K2.653.941$1.71
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4Junie62.8%
5gpt-5.4-2026-03-05-medium62.8%