The Inference Report

March 4, 2026

Control is consolidating at the infrastructure layer while fragmenting everywhere else. Microsoft, Meta, and Amazon are locking in chip supply and power deals that give them permanent advantage over software vendors trying to build on top. Block cut 40 percent of its workforce because AI tools made its organizational structure obsolete, not because demand collapsed. Railway raised $100 million to challenge AWS by offering AI-native infrastructure. Perplexity announced Computer, an agent that assigns work to other agents. The pattern is unmistakable: whoever controls the orchestration layer controls value extraction.

Military leverage is reshaping commercial terms faster than market forces. The Pentagon forced Anthropic to drop restrictions on military use or lose contracts. OpenAI accepted Pentagon terms that Anthropic rejected and secured the deal. Anthropic has no choice but to comply or exit the government market entirely. Meanwhile, the cost structure of agentic AI is rewriting unit economics in real time. Anthropic's Claude Code costs up to $200 per month. Block released Goose, an open-source alternative, for free. Listen Labs raised $69 million after a viral billboard stunt. The velocity of capability deployment has inverted traditional software economics: shipping fast is cheaper than shipping perfect. But shipping fast at scale requires controlling the infrastructure underneath. That's why Microsoft's stateful AI runtime on AWS matters more than the model itself.

Consumer and enterprise willingness to pay for AI software remains flatlined even as capital pours into infrastructure. Samsung's Galaxy S26 costs more and is "even more chock-full of AI." Google released Nano Banana 2. Amazon made Alexa+ free for Prime members. None of this moves the needle on actual usage or revenue. Data center operators thought they could buy farmland for a million dollars and farmers would sell. They didn't. The political economy of AI is resolving in favor of whoever controls scarcity: chips, power, real estate, military access. Software vendors are margin-squeezed and politically exposed. Infrastructure vendors are consolidating. On coding benchmarks, Claude Code holds the top position at 52.9% on SWE-rebench, while Kimi K2 Thinking climbs from position 26 to position 12 with a 2.9 percentage point gain. Kimi K2.5 shows the largest single-cycle deterioration, dropping 8.9 points.

Grant Calloway

AI LabsAll labs

GitHub Blog

Join or host a GitHub Copilot Dev Days event near you

Google DeepMind

Gemini 3.1 Flash-Lite: Built for intelligence at scale

Hugging Face

PRX Part 3 — Training a Text-to-Image Model in 24h!

MiniMax

MiniMax Music 2.5+

NVIDIA

NVIDIA CEO Jensen Huang and Global Technology Leaders to Showcase Age of AI at GTC 2026

OpenAI

From the WireAll feeds

Research Papers — FocusedAll papers

AdaMamba: Adaptive Frequency-Gated Mamba for Long-Term Time Series Forecasting cs.AI

Accurate long-term time series forecasting (LTSF) requires the capture of complex long-range dependencies and dynamic periodic patterns. Recent advances in frequency-domain analysis offer a global perspective for uncovering temporal characteristics. However, real-world time series often exhibit pronounced cross-domain heterogeneity where variables that appear synchronized in the time domain can differ substantially in the frequency domain. Existing frequency-based LTSF methods often rely on implicit assumptions of cross-domain homogeneity, which limits their ability to adapt to such intricate variability. To effectively integrate frequency-domain analysis with temporal dependency learning, we propose AdaMamba, a novel framework that endogenizes adaptive and context-aware frequency analysis within the Mamba state-space update process. Specifically, AdaMamba introduces an interactive patch encoding module to capture inter-variable interaction dynamics. Then, we develop an adaptive frequency-gated state-space module that generates input-dependent frequency bases, and generalizes the conventional temporal forgetting gate into a unified time-frequency forgetting gate. This allows dynamic calibration of state transitions based on learned frequency-domain importance, while preserving Mamba's capability in modeling long-range dependencies. Extensive experiments on seven public LTSF benchmarks and two domain-specific datasets demonstrate that AdaMamba consistently outperforms state-of-the-art methods in forecasting accu racy while maintaining competitive computational efficiency. The code of AdaMamba is available at https://github.com/XDjiang25/AdaMamba.

CAP-CoT: Cycle Adversarial Prompt for Improving Chain of Thoughts in LLM Reasoning cs.AI

Chain-of-Thought (CoT) prompting has emerged as a simple and effective way to elicit step-by-step solutions from large language models (LLMs). However, CoT reasoning can be unstable across runs on long, multi-step problems, leading to inconsistent answers for unchanged task. Most prior work focuses on improving the forward reasoning chain within a single pass, with less attention to iterative and contrastive correction. To address this gap, we propose CAP-CoT, a Cycle Adversarial Prompt optimization framework designed to improve both CoT reasoning accuracy and stability of a single deployed solver. In each cycle, a forward solver generates candidate reasoning chains, an adversarial challenger constructs plausible but deliberately flawed chains using targeted error strategies, and a feedback agent contrasts the two chains and produces step-aligned structured feedback. This feedback closes the optimization loop in two directions, including updating the solver prompt based on errors exposed by the challenger, and updating the challenger prompt to generate increasingly targeted errors in subsequent cycles. Unlike safety-oriented adversarial prompting such as jailbreak or prompt-injection attacks, our adversarial component is task-semantic and aims to expose logical vulnerabilities in reasoning chains. Experiments across six benchmarks and four LLM backbones demonstrate that within two to three adversarial prompt optimization cycles, CAP-CoT consistently reduces variability across runs while improving reasoning accuracy and robustness to prompt perturbations.

Active Inference: A method for Phenotyping Agency in AI systems? cs.AI

The proliferation of agentic artificial intelligence has outpaced the conceptual tools needed to characterize agency in computational systems. Prevailing definitions mainly rely on autonomy and goal-directedness. Here, we argue for a minimal notion open to principled inspection given three criteria: intentionality as action grounded in beliefs and desires, rationality as normatively coherent action entailed by a world model, and explainability as action causally traceable to internal states; we subsequently instantiate these as a partially observable Markov decision process under a variational framework wherein posterior beliefs, prior preferences, and the minimization of expected free energy jointly constitute an agentic action chain. Using a canonical T-maze paradigm, we evidence how empowerment, formulated as the channel capacity between actions and anticipated observations, serves as an operational metric that distinguishes zero-, intermediate-, and high-agency phenotypes through structural manipulations of the generative model. We conclude by arguing that as agents engage in epistemic foraging to resolve ambiguity, the governance controls that remain effective must shift systematically from external constraints to the internal modulation of prior preferences, offering a principled, variational bridge from computational phenotyping to AI governance strategy

AI Identity: Standards, Gaps, and Research Directions for AI Agents cs.AI

AI agents are now running real transactions, workflows, and sub-agent chains across organizational boundaries without continuous human supervision. This creates a problem no current infrastructure is equipped to solve: how do you identify, verify, and hold accountable an entity with no body, no persistent memory, and no legal standing? We define AI Identity as the continuous relationship between what an AI agent is declared to be and what it is observed to do, bounded by the confidence that those two things correspond at any given moment. Through a structured survey of industry trends, emerging standards, and technical literature, we conduct a gap analysis across the full agent identity lifecycle and make three contributions: (1) a structural comparison of human and AI identity across four dimensions (substrate, persistence, verifiability, and legal standing) showing that the asymmetry is fundamental and that extending human frameworks to agents without structural modification produces systematic failures; (2) an evaluation of current technical and regulatory documents against the identity requirements of autonomous agents, finding that none adequately address the challenge of governing nondeterministic, boundary-crossing entities; and (3) identification of five critical gaps (semantic intent verification, recursive delegation accountability, agent identity integrity, governance opacity and enforcement, and operational sustainability) that no current technology or regulatory instrument resolves. These gaps are structural; more engineering effort alone will not close them. Foundational research on AI identity is the central conclusion of this report.

LEGO: An LLM Skill-Based Front-End Design Generation Platform cs.AI

Existing LLM-based EDA agents are often isolated task-specific systems. This leads to repeated engineering effort and limited reuse of successful design and debugging strategies. We present LEGO, a unified skill-based platform for front-end design generation. It decomposes the digital front-end flow into six independent steps and represents every agent capability as a standardized composable circuit skill within a plug-and-play architecture. To build this skill library, we survey more than 100 papers, select 11 representative open-source projects, and extract 42 executable circuit skills within a six-step finite state machine formulation. Circuit Skill Builder automates skill extraction with linear scalability. Agent Skill RAG achieves submillisecond retrieval without relying on embedding models. Empirical evaluation on a hard subset of 41 VerilogEval v2 problems that gpt-5.2-codex fails to solve under extra-high reasoning effort shows that individual circuit skills constructed within LEGO raise Pass@1 from 0.000 to 0.805. This is an 80.5% gain over the baseline. Cross-project skill compositions also reach 0.805 Pass@1. They outperform hierarchy-verilog by 14.6% and VerilogCoder by 2.5%. They also match MAGE. These results show that modular skill composition supports both effective and flexible RTL design automation. The LEGO platform and all circuit skills are publicly available at GitHub: https://github.com/loujc/LEGO-An-LLM-Skill-Based-Front-End-Design-Generation-Platform

GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs cs.AI

Autonomous multi-agent LLM systems are increasingly deployed to investigate operational incidents and produce structured diagnostic reports. Their trustworthiness hinges on whether each claim is grounded in observed evidence rather than model-internal inference. Existing groundedness evaluators (binary classifiers, LLM-as-judge scalars, self-correction loops) treat supporting evidence as interchangeable and emit a single signal that offers no principled control over downstream action. We present GSAR, a grounding-evaluation and replanning framework that (i) partitions claims into a four-way typology (grounded, ungrounded, contradicted, complementary), giving first-class standing to non-redundant alternative perspectives; (ii) assigns evidence-type-specific weights reflecting epistemic strength; (iii) computes an asymmetric contradiction-penalised weighted groundedness score; and (iv) couples that score to a three-tier decision function (proceed, regenerate, replan) driving a bounded-iteration outer loop under an explicit compute budget. We formalise the algorithm, prove six structural properties, and evaluate five design claims on FEVER with gold Wikipedia evidence under four independently-trained LLM judges (gpt-5.4, claude-sonnet-4-6, claude-opus-4-7, gemini-2.5-pro). Every ablation reproduces in the same direction on every judge: bootstrap 95% CIs on the rho=0 effect exclude 0 on all four; the no-complementary ablation under Opus 4.7 has CI [-96,-68] of 200; at n=1000 three independent judges converge to DeltaS(rho=0)=+0.058. A head-to-head against Vectara HHEM-2.1-Open is included. To our knowledge, GSAR is the first published groundedness framework coupling evidence-typed scoring with tiered recovery under an explicit compute budget.

BenchmarksFull tables

Intelligence Index

Composite score across coding, math, and reasoning

#	Model	Score	tok/s	$/1M
1	Gemini 3.1 Pro Preview	57.2	84	$4.50
2	GPT-5.3 Codex	54	59	$4.81
3	Claude Opus 4.6	53	53	$10.00
4	Claude Sonnet 4.6	51.7	56	$6.00
5	GPT-5.2	51.3	61	$4.81

SWE-rebench

Agentic coding on real-world software engineering tasks

#	Model	Score
1	Claude Code	52.9%
2	Claude Opus 4.6	51.7%
3	gpt-5.2-2025-12-11-xhigh	51.7%
4	gpt-5.2-2025-12-11-medium	51.0%
5	gpt-5.1-codex-max	48.5%

GitHub Repos All repos

Trending

msitarzewski/agency-agents

44775 ★

A complete AI agency at your fingertips** - From frontend wizards to Reddit community ninjas, from whimsy injectors to reality checkers. Each agent is a specialized expert with personality, processes, and proven deliverables.

K-Dense-AI/claude-scientific-skills

13066 ★

A set of ready to use Agent Skills for research, science, engineering, analysis, finance and writing.

moeru-ai/airi

36529 ★

💖🧸 Self hosted, you-owned Grok Companion, a container of souls of waifu, cyber livings to bring them into our worlds, wishing to achieve Neuro-sama's altitude. Capable of realtime voice chat, Minecraft, Factorio playing. Web / macOS / Windows supported.

CodebuffAI/codebuff

3975 ★

Generate code from the terminal!

agentscope-ai/agentscope

21783 ★

Build and run agents you can see, understand and trust.

Daily discovery

onestardao/WFGYKnowledge Graph

1694 ★

WFGY 3.0 · Singularity demo (public view). A tension reasoning engine over 131 S-class problems, mapping structure, failure modes, and AI stability boundaries. ⭐ Star if you care about reliable reasoning and system-level alignment.

lightly-ai/lightly-studioComputer Vision

690 ★

Curate, Annotate, and Manage Your Data in LightlyStudio.

volcengine/OpenVikingLLM

22379 ★

OpenViking is an open-source context database designed specifically for AI Agents(such as openclaw). OpenViking unifies the management of context (memory, resources, and skills) that Agents need through a file system paradigm, enabling hierarchical context delivery and self-evolving.

tensorflow/tensorflowDeep Learning

194930 ★

An Open Source Machine Learning Framework for Everyone

ModelTC/LightLLMNLP

3926 ★

LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.