The Inference Report

May 19, 2026

Elon Musk's two-hour loss to OpenAI in court has cleared the last major legal obstacle to Sam Altman's IPO path and signaled that judges will not second-guess the business decisions of AI founders suing their own companies. The verdict removes distraction and accelerates exactly what Musk claimed to oppose: vertical integration of deployment and customer lock-in. Anthropic acquired Stainless, the SDK automation startup used by OpenAI and Google, to control how developers interact with Claude. PwC announced it will train 30,000 staff on Claude and build business operations around it. OpenAI stood up DeployCo, a $4 billion deployment company sending engineers on-site to embed GPT models into customer workflows. These are moats, not partnerships.

The competitive pressure has shifted decisively from model capability to production deployment. NVIDIA shipped its first CPU designed specifically for agent workloads to Anthropic, OpenAI, and SpaceX. OpenAI and Dell are bundling Codex into on-premise environments. Microsoft's deployment at Regis, an aged care provider using AI for administrative work, demonstrates where the money actually moves: when AI handles workflows with measurable labor economics. Claude Opus 4.6 jumped to first place on SWE-rebench with a 12.4-point gain to 65.3%, though the magnitude of movement across benchmarks raises questions about whether underlying test sets or evaluation methodology changed between rounds. The infrastructure layer itself is consolidating faster than software. NextEra and Dominion's $420 billion merger will cement control of US data center capacity. Google committed $5 billion to a Blackstone-backed AI cloud group. Researchers at Penn demonstrated light-matter hybrid particles for ultra-efficient AI computing. The physics and real estate of AI are being claimed.

GitHub's trending repositories confirm developers have stopped building agents and started building infrastructure around them. Humanlayer's 12-factor-agents and tech-leads-club's agent-skills treat agent development as a discipline requiring principles rather than magic. CLI-Anything wraps command-line tools to make existing software agent-native. Lobehub organizes agents into operational teams. The pattern cuts across categories: practical solutions to making agents reliable and composable in production, not just capable in isolation. Supertone's on-device multilingual TTS and CloakBrowser's bot detection solve concrete friction points around latency, cost, and vendor lock-in. The companies winning are those embedding themselves into customer operations before the customer realizes they have a choice. The noise around AI safety and quality has decoupled entirely from where capital and control are flowing. Deployment wins. Everything else is theater.

Grant Calloway

AI LabsAll labs

Anthropic

Anthropic acquires Stainless

GitHub Blog

Take your local GitHub sessions anywhere

Hugging Face

Microsoft

At aged care provider Regis, AI takes on paperwork so staff can focus on residents

NVIDIA

OpenAI

OpenAI and Dell partner to bring Codex to hybrid and on-premise enterprise environments

From the WireAll feeds

Research PapersAll papers

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention cs.CL

Current hierarchical attention methods, such as NSA and InfLLMv2, select the top-k relevant key-value (KV) blocks based on coarse attention scores and subsequently apply fine-grained softmax attention on the selected tokens. However, the top-k operation assumes the number of relevant tokens for any query is fixed and it precludes the gradient flow between the sparse and dense stages. In this work, we propose DashAttention (Differentiable and Adaptive Sparse Hierarchical Attention), which leverages the adaptively sparse $α$-entmax transformation to select a variable number of blocks according to the current query in the first stage. This in turn provides a prior for the second-stage softmax attention, keeping the entire hierarchy fully differentiable. Contrary to other hierarchical attention methods, we show that DashAttention is non-dispersive, translating to better long-context modeling ability. Experiments with large language models (LLMs) show that DashAttention achieves comparable accuracy as full attention with 75% sparsity and a better Pareto frontier than NSA and InfLLMv2, especially in high-sparsity regimes. We also provide an efficient, GPU-aware implementation of DashAttention in Triton, which achieves a speedup of up to over FlashAttention-3 at inference time. Overall, DashAttention offers a cost-effective strategy to model long contexts.

A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability cs.DC

Pipeline parallelism is a key technique for scaling large-model training, but modern workloads exhibit runtime variability in computation and communication. Existing pipeline systems typically consume static, profiled, or adaptively generated schedules as pre-committed execution orders. When realized task readiness diverges from the pre-committed order, stages may wait for not-yet-ready work even though other executable work is available, creating stage misalignment, idle bubbles, and reduced utilization. We present Runtime-Readiness-First Pipeline (RRFP), a readiness-driven runtime for pipeline-parallel training. RRFP changes how schedules are consumed at runtime: instead of treating a schedule as a sequence that stages must wait to follow, it treats the schedule as a non-binding hint order for ranking currently ready work. To support this model, RRFP combines message-driven asynchronous communication, lightweight tensor-parallel coordination for collective consistency, and ready-set arbitration for low-overhead dispatch. We implement RRFP in a Megatron-based training framework and evaluate it on language-only and multimodal workloads at up to 128 GPUs. RRFP improves over fixed-order pipeline baselines across all settings. Using the BFW hint, RRFP achieves up to 1.77$\times$ speedup on language-only workloads and up to 2.77$\times$ on multimodal workloads. In cross-framework comparisons, RRFP with the default BF hint outperforms the faster available external system by up to 1.84$\times$ while preserving training correctness.

Code as Agent Harness cs.CL

Recent large language models (LLMs) have demonstrated strong capabilities in understanding and generating code, from competitive programming to repository-level software engineering. In emerging agentic systems, code is no longer only a target output. It increasingly serves as an operational substrate for agent reasoning, acting, environment modeling, and execution-based verification. We frame this shift through the lens of agent harnesses and introduce code as agent harness: a unified view that centers code as the basis for agent infrastructure. To systematically study this perspective, we organize the survey around three connected layers. First, we study the harness interface, where code connects agents to reasoning, action, and environment modeling. Second, we examine harness mechanisms: planning, memory, and tool use for long-horizon execution, together with feedback-driven control and optimization that make harness reliable and adaptive. Third, we discuss scaling the harness from single-agent systems to multi-agent settings, where shared code artifacts support multi-agent coordination, review, and verification. Across these layers, we summarize representative methods and practical applications of code as agent harness, spanning coding assistants, GUI/OS automation, embodied agents, scientific discovery, personalization and recommendation, DevOps, and enterprise workflows. We further outline open challenges for harness engineering, including evaluation beyond final task success, verification under incomplete feedback, regression-free harness improvement, consistent shared state across multiple agents, human oversight for safety-critical actions, and extensions to multimodal environments. By centering code as the harness of agentic AI, this survey provides a unified roadmap toward executable, verifiable, and stateful AI agent systems.

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop cs.CV

Spatial intelligence unfolds through a perception-action loop: agents act to acquire observations, and reason about how observations vary as a function of action. Rather than passively processing what is seen, they actively uncover what is unseen - occluded structure, dynamics, containment, and functionality that cannot be resolved from passive sensing alone. We move beyond prior formulations of spatial intelligence that assume oracle observations by recasting the observer as an actor. We introduce ESI-BENCH, a comprehensive benchmark for embodied spatial intelligence spanning 10 task categories and 29 subcategories built on OmniGibson, grounded in Spelke's core knowledge systems. Agents must decide what abilities to deploy - perception, locomotion, and manipulation - and how to sequence them to actively accumulate task-relevant evidence. We conduct extensive experiments on state-of-the-art MLLMs and find that active exploration substantially outperforms passive counterparts, with agents spontaneously discovering emergent spatial strategies without explicit instructions, while random multi-view often adds noise rather than signal despite consuming far more images. Most failures stem not from weak perception but from action blindness: poor action choices lead to poor observations, which in turn drive cascading errors. While explicit 3D grounding stabilizes reasoning on depth-sensitive tasks, imperfect 3D representation proves more harmful than 2D baselines by distorting spatial relations. Human studies further reveal that unlike humans who seek falsifying viewpoints and revise beliefs under contradiction, models commit prematurely with high confidence regardless of evidence quality, exposing a metacognitive gap that neither better perception nor more embodied interaction alone can close.

SURGE: Approximation-free Training Free Particle Filter for Diffusion Surrogate stat.ML

Diffusion-based generative models increasingly rely on inference-time guidance, adding a drift term or reweighting mixture of experts, to improve sample quality on task-specific objectives. However, most existing techniques require repeated score or gradient evaluations, introducing bias, high computational overhead, or both. We introduce \texttt{URGE}, Unbiased Resampling via Girsanov Estimation, a derivative-free inference-time scaling algorithm that performs path-wise importance reweighting via a Girsanov change of measure. Instead of computing gradient-based particle weights in previous work, \texttt{URGE} attaches a simple multiplicative weight to each simulated trajectory and periodically resamples. No score, no Hessian, and no PDE evaluation is required. We establish an equivalence between path-wise and particle-wise SMC: the Girsanov path weight admits a backward conditional expectation that recovers the previous particle-level weights, guaranteeing that both schemes produce the same unbiased terminal law. Empirically, \texttt{URGE} outperforms existing inference-time guidance baselines on synthetic tests and diffusion-model benchmarks, achieving better generation quality, while being significantly simpler to implement and fully gradient-free.

Actionable World Representation cs.AI

Inspired by the emergent behaviors in large language models that generalized human intelligence, the research community is pursuing similar emergent capabilities within world models, with a emphasis on modeling the physical world. Within the scope of physical world model, objects are the fundamental primitives that constitute physical reality. From humans to computers, nearly everything we interact with is an object. These objects are rarely static; they are actionable entities with varying states determined by their intrinsic properties. While current methods approach object action states either via video generation or dynamic scene reconstruction, none explicitly model this basic element in a unified, principled way to build an actionable object representation. We propose WorldString, a neural architecture capable of modeling the state manifold of real-world objects by learning directly from point clouds or RGB-D video streams. Serving as a versatile digital twin, it acts as a foundational building block for physical world models; thus, we name it WorldString. Sweetly, its fully differentiable structure seamlessly enables future integration with policy learning and neural dynamics.

BenchmarksFull tables

Intelligence Index

Composite score across coding, math, and reasoning

#	Model	Score	tok/s	$/1M
1	GPT-5.5	60.2	70	$11.25
2	Claude Opus 4.7	57.3	49	$10.94
3	Gemini 3.1 Pro Preview	57.2	117	$4.50
4	GPT-5.4	56.8	79	$5.63
5	Kimi K2.6	53.9	68	$1.71

SWE-rebench

Agentic coding on real-world software engineering tasks

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	Junie	62.8%
5	gpt-5.4-2026-03-05-medium	62.8%

GitHub Repos All repos

Trending

tinyhumansai/openhuman

24209 ★

Your Personal AI super intelligence. Private, Simple and extremely powerful.

Imbad0202/academic-research-skills

36173 ★

Academic Research Skills for Claude Code: research → write → review → revise → finalize

HKUDS/CLI-Anything

39396 ★

"CLI-Anything: Making ALL Software Agent-Native" -- CLI-Hub: https://clianything.cc/

K-Dense-AI/scientific-agent-skills

24651 ★

A set of ready to use Agent Skills for research, science, engineering, analysis, finance and writing.

supertone-inc/supertonic

8556 ★

Lightning-Fast, On-Device, Multilingual TTS — running natively via ONNX.

Daily discovery

modelscope/modelscopeNLP

9028 ★

ModelScope: bring the notion of Model-as-a-Service to life.

Team-Commonly/commonlyAutonomous Agents

558 ★

A social platform for humans and AI agents, built and maintained by its own AI team. Connect any agent via HTTP.

1jehuang/jcodeai

7452 ★

Coding Agent Harness

ModelTC/LightLLMDeep Learning

4068 ★

LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.

MingSun-Tse/Efficient-Deep-LearningModel Compression

954 ★

Collection of recent methods on (deep) neural network compression and acceleration.