The Inference Report

March 5, 2026

Much like the shift from open-source Unix systems to proprietary enterprise software in the 1990s, the AI market is reorganizing itself around control rather than capability. The Pentagon's swift punishment of Anthropic for refusing to lift military restrictions on Claude, followed by OpenAI's immediate announcement of a classified Defense Department contract, reveals that regulatory capture has moved from theoretical concern to operational fact. Within hours, the government had selected its preferred vendor and eliminated the alternative. This was not a negotiation. It was a supply chain decision dressed as policy. Anthropic CEO Dario Amodei called OpenAI's messaging "straight up lies," but the asymmetry speaks louder than his objections: companies that resist military use face federal bans; companies that comply win contracts and market share. The incentive structure now runs directly against safety constraints.

The broader market is already responding to this signal. Xpeng, Smack Technologies, and others are training models specifically for battlefield operations. Every AI company will now calibrate its safety practices around which government contracts it wants to win. Anthropic's principled stance against unrestricted military use has become a competitive disadvantage. Meanwhile, the lab announcements reveal a field increasingly organized around two competing visions: centralized systems designed for government and regulated sectors, versus open infrastructure for developers and enterprises. AWS, GitHub, and AMD are shipping agent frameworks and inference tools explicitly designed for private, on-premises deployment. Google, Meta, and NVIDIA are positioning specialized models and telecommunications infrastructure as the next frontier. The competition is no longer primarily about raw model capability. It is about where models run, who controls them, and what workflows they unlock.

Across the technical stack, the move toward agents and autonomy as product has accelerated. GitHub shipped multi-agent workflows and self-review. AWS launched OpenClaw for private agents. Anthropic acquired Vercept for computer use and released Claude Sonnet 4.6. Qwen released an agentic coding model; AI21 Labs published work on modular intelligence for agent orchestration. These are deployment patterns, not research papers about future possibilities. GitHub's trending repositories show developers building task-specific agents wrapped in frameworks that provide visibility and trust. The real traction is in the plumbing: Flowise's visual agent builder and AgentScope's emphasis on auditability are winning because agents are now the unit of work but teams need ways to compose, debug, and monitor them without writing orchestration code from scratch. The agent layer is still too young for standardization and too diverse in requirements for one solution to dominate.

On the benchmark front, Claude Code maintains its lead on SWE-rebench at 52.9%, with gpt-5.2 variants and Claude Opus 4.6 clustered tightly behind it. More notable movement occurs in the mid-tier, where Kimi K2 Thinking climbed 2.9 percentage points and GLM-5 experienced a sharper decline of 7.7 points, raising questions about consistency between evaluation frameworks. These movements reflect the inherent variance in code generation evaluation, where small absolute score differences can mask meaningful changes in model capability. The research papers cluster around latent-space reformulations of classical problems, test-time adaptation mechanisms, and structured reasoning through intermediate supervision, all reflecting a shared concern with handling distribution shift and non-stationarity. The field is moving beyond fixed-parameter models toward systems that adjust their operating assumptions as conditions evolve. What connects all of this is a fundamental shift in where power resides: no longer in the lab or the model itself, but in whoever controls the constraints that determine how the system can be used.

Grant Calloway

AI LabsAll labs

AMD

HPC Coding Agent - Part 2: An MCP Tool for Code Optimization with OpenEvolve

AWS

Introducing OpenClaw on Amazon Lightsail to run your autonomous private AI agents

Google

Teaching LLMs to reason like Bayesians

OpenAI

From the WireAll feeds

Research PapersAll papers

SimpliHuMoN: Simplifying Human Motion Prediction cs.CV

Human motion prediction combines the tasks of trajectory forecasting and human pose prediction. For each of the two tasks, specialized models have been developed. Combining these models for holistic human motion prediction is non-trivial, and recent methods have struggled to compete on established benchmarks for individual tasks. To address this, we propose a simple yet effective transformer-based model for human motion prediction. The model employs a stack of self-attention modules to effectively capture both spatial dependencies within a pose and temporal relationships across a motion sequence. This simple, streamlined, end-to-end model is sufficiently versatile to handle pose-only, trajectory-only, and combined prediction tasks without task-specific modifications. We demonstrate that this approach achieves state-of-the-art results across all tasks through extensive experiments on a wide range of benchmark datasets, including Human3.6M, AMASS, ETH-UCY, and 3DPW.

Accurate and Efficient Hybrid-Ensemble Atmospheric Data Assimilation in Latent Space with Uncertainty Quantification cs.LG

Data assimilation (DA) combines model forecasts and observations to estimate the optimal state of the atmosphere with its uncertainty, providing initial conditions for weather prediction and reanalyses for climate research. Yet, existing traditional and machine-learning DA methods struggle to achieve accuracy, efficiency and uncertainty quantification simultaneously. Here, we propose HLOBA (Hybrid-Ensemble Latent Observation-Background Assimilation), a three-dimensional hybrid-ensemble DA method that operates in an atmospheric latent space learned via an autoencoder (AE). HLOBA maps both model forecasts and observations into a shared latent space via the AE encoder and an end-to-end Observation-to-Latent-space mapping network (O2Lnet), respectively, and fuses them through a Bayesian update with weights inferred from time-lagged ensemble forecasts. Both idealized and real-observation experiments demonstrate that HLOBA matches dynamically constrained four-dimensional DA methods in both analysis and forecast skill, while achieving end-to-end inference-level efficiency and theoretical flexibility applies to any forecasting model. Moreover, by exploiting the error decorrelation property of latent variables, HLOBA enables element-wise uncertainty estimates for its latent analysis and propagates them to model space via the decoder. Idealized experiments show that this uncertainty highlights large-error regions and captures their seasonal variability.

SELDON: Supernova Explosions Learned by Deep ODE Networks astro-ph.IM

The discovery rate of optical transients will explode to 10 million public alerts per night once the Vera C. Rubin Observatory's Legacy Survey of Space and Time comes online, overwhelming the traditional physics-based inference pipelines. A continuous-time forecasting AI model is of interest because it can deliver millisecond-scale inference for thousands of objects per day, whereas legacy MCMC codes need hours per object. In this paper, we propose SELDON, a new continuous-time variational autoencoder for panels of sparse and irregularly time-sampled (gappy) astrophysical light curves that are nonstationary, heteroscedastic, and inherently dependent. SELDON combines a masked GRU-ODE encoder with a latent neural ODE propagator and an interpretable Gaussian-basis decoder. The encoder learns to summarize panels of imbalanced and correlated data even when only a handful of points are observed. The neural ODE then integrates this hidden state forward in continuous time, extrapolating to future unseen epochs. This extrapolated time series is further encoded by deep sets to a latent distribution that is decoded to a weighted sum of Gaussian basis functions, the parameters of which are physically meaningful. Such parameters (e.g., rise time, decay rate, peak flux) directly drive downstream prioritization of spectroscopic follow-up for astrophysical surveys. Beyond astronomy, the architecture of SELDON offers a generic recipe for interpretable and continuous-time sequence modeling in any time domain where data are multivariate, sparse, heteroscedastic, and irregularly spaced.

A Dual-Helix Governance Approach Towards Reliable Agentic AI for WebGIS Development cs.AI

WebGIS development requires rigor, yet agentic AI frequently fails due to five large language model (LLM) limitations: context constraints, cross-session forgetting, stochasticity, instruction failure, and adaptation rigidity. We propose a dual-helix governance framework reframing these challenges as structural governance problems that model capacity alone cannot resolve. We implement the framework as a 3-track architecture (Knowledge, Behavior, Skills) that uses a knowledge graph substrate to stabilize execution by externalizing domain facts and enforcing executable protocols, complemented by a self-learning cycle for autonomous knowledge growth. Applying this to the FutureShorelines WebGIS tool, a governed agent refactored a 2,265-line monolithic codebase into modular ES6 components. Results demonstrated a 51\% reduction in cyclomatic complexity and a 7-point increase in maintainability index. A comparative experiment against a zero-shot LLM confirms that externalized governance, not just model capability, drives operational reliability in geospatial engineering. This approach is implemented in the open-source AgentLoom governance toolkit.

ZipMap: Linear-Time Stateful 3D Reconstruction with Test-Time Training cs.CV

Feed-forward transformer models have driven rapid progress in 3D vision, but state-of-the-art methods such as VGGT and $π^3$ have a computational cost that scales quadratically with the number of input images, making them inefficient when applied to large image collections. Sequential-reconstruction approaches reduce this cost but sacrifice reconstruction quality. We introduce ZipMap, a stateful feed-forward model that achieves linear-time, bidirectional 3D reconstruction while matching or surpassing the accuracy of quadratic-time methods. ZipMap employs test-time training layers to zip an entire image collection into a compact hidden scene state in a single forward pass, enabling reconstruction of over 700 frames in under 10 seconds on a single H100 GPU, more than $20\times$ faster than state-of-the-art methods such as VGGT. Moreover, we demonstrate the benefits of having a stateful representation in real-time scene-state querying and its extension to sequential streaming reconstruction.

AgentIR: Reasoning-Aware Retrival for Deep Research Agents cs.CL

Deep Research agents are rapidly emerging as primary consumers of modern retrieval systems. Unlike human users who issue and refine queries without documenting their intermediate thought processes, Deep Research agents generate explicit natural language reasoning before each search call, revealing rich intent and contextual information that existing retrievers entirely ignore. To exploit this overlooked signal, we introduce: (1) Reasoning-Aware Retrieval, a retrieval paradigm that jointly embeds the agent's reasoning trace alongside its query; and (2) DR-Synth, a data synthesis method that generates Deep Research retriever training data from standard QA datasets. We demonstrate that both components are independently effective, and their combination yields a trained embedding model, AgentIR-4B, with substantial gains. On the challenging BrowseComp-Plus benchmark, AgentIR-4B achieves 68\% accuracy with the open-weight agent Tongyi-DeepResearch, compared to 50\% with conventional embedding models twice its size, and 37\% with BM25. Code and data are available at: https://texttron.github.io/AgentIR/.

BenchmarksFull tables

Intelligence Index

Composite score across coding, math, and reasoning

#	Model	Score	tok/s	$/1M
1	Gemini 3.1 Pro Preview	57.2	96	$4.50
2	GPT-5.3 Codex	54	64	$4.81
3	Claude Opus 4.6	53	53	$10.00
4	Claude Sonnet 4.6	51.7	59	$6.00
5	GPT-5.2	51.3	62	$4.81

SWE-rebench

Agentic coding on real-world software engineering tasks

#	Model	Score
1	Claude Code	52.9%
2	Claude Opus 4.6	51.7%
3	gpt-5.2-2025-12-11-xhigh	51.7%
4	gpt-5.2-2025-12-11-medium	51.0%
5	gpt-5.1-codex-max	48.5%

GitHub Repos All repos

Trending

KeygraphHQ/shannon

42925 ★

Fully autonomous AI hacker to find actual exploits in your web apps. Shannon has achieved a 96.15% success rate on the hint-free, source-aware XBOW Benchmark.

msitarzewski/agency-agents

112639 ★

A complete AI agency at your fingertips** - From frontend wizards to Reddit community ninjas, from whimsy injectors to reality checkers. Each agent is a specialized expert with personality, processes, and proven deliverables.

aquasecurity/trivy

36043 ★

Find vulnerabilities, misconfigurations, secrets, SBOM in containers, Kubernetes, code repositories, clouds and more

K-Dense-AI/claude-scientific-skills

13066 ★

A set of ready to use Agent Skills for research, science, engineering, analysis, finance and writing.

CodebuffAI/codebuff

3975 ★

Generate code from the terminal!

Daily discovery

tuya/TuyaOpenChatbot

1364 ★

Next-gen AI+IoT framework for T2/T3/T5AI/ESP32/and more – Fast IoT and AI Agent hardware integration

edenai/edenai-apisText-to-Speech

470 ★

Eden AI: simplify the use and deployment of AI technologies by providing a unique API that connects to the best possible AI engines

barseghyanartur/faker-fileSynthetic Data

102 ★

Create files with fake data. In many formats. With no efforts.

RTGS2017/NagaAgentMCP

1443 ★

A simple yet powerful agent framework for personal assistants, designed to enable intelligent interaction, multi-agent collaboration, and seamless tool integration.

apache/airflowMLOps

44570 ★

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows