The Inference Report

March 22, 2026

The market is sorting itself by who owns the customer relationship and can credibly deliver results, not by who controls the most advanced technology. OpenAI is doubling headcount to 8,000 by end of 2026 while Nvidia's latest conference failed to move Wall Street, a divergence that reflects investor clarity about which companies extract value versus which merely supply it. Open-weight models like Nvidia's Nemotron-Cascade 2 are hitting Gold Medal performance at 30B parameters with only 3B active, directly undercutting the efficiency moat of frontier models, yet this technical progress hasn't translated into market share because distribution and trust still dominate. Meanwhile, the compliance layer is cracking: Delve stands accused of selling fake compliance to hundreds of customers, a publisher rejected an AI-generated novel outright, and Anthropic's survey of 80,000 Claude users shows hallucinations trouble people far more than job displacement fears. Trust, not capability, is the actual constraint.

Research across multi-agent systems, interpretability, and domain-specific applications reveals a consistent finding: observability alone does not guarantee control. Mechanistic methods achieve near-perfect representation of task-relevant information yet fail to translate that knowledge into corrected outputs, while steering approaches show brittleness under deployment stress. Performance gains come from encoding domain structure into training and evaluation rather than scaling generic models. Pedagogically grounded fine-tuning, clinical benchmarks aligned to real-world needs, and neuro-symbolic architectures with declarative constraint specification all demonstrate that what matters for deployment is generalization to unseen tasks and robustness under perturbation, not aggregate metrics on standard leaderboards.

The infrastructure layer is reasserting itself. Trivy dominates vulnerability scanning with consolidated threat detection, while systemd and protobuf remain the unglamorous backbone everything depends on. On GitHub, the secondary pattern is tooling for AI operations and observability: Phoenix and Claude HUD address the friction point that models and agents are now complex enough to require visibility into internal behavior, while opendataloader and Clawith solve the unglamorous problem of getting messy PDFs and enterprise data into usable formats. The gap between what's trendy and what's useful is narrowing. Compensation is shifting too, with tokens becoming a fourth pillar of engineer pay and companies like DoorDash paying gig workers to train AI, suggesting the real pressure is showing up as cost arbitrage rather than capability breakthroughs. Whoever controls the customer relationship wins; everyone else is either a cost center or selling narrative.

Grant Calloway

AI LabsAll labs

No lab headlines.

From the WireAll feeds
Research Papers — FocusedAll papers
A Study of Belief Revision Postulates in Multi-Agent Systems (Extended Version) cs.AI

We investigate the belief revision problem in epistemic planning, i.e., what will be the beliefs of all agents in a multi-agent system after an agent gains the belief in some state property. Based on the standard representation in epistemic planning of agents' beliefs via a single multi-agent Kripke model, we generalize the classical AGM belief revision postulates to the multi-agent setting, with the aim to provide a formal framework for evaluating dynamic epistemic reasoning frameworks in which the beliefs of all agents as the result of actions are computed. As an example of a simple operator that satisfies all of the generalized AGM postulates, we present generalized full-meet multi-agent belief revision. We moreover define a generalization of the standard postulates for iterated revision, present a more sophisticated, event model based revision operator, and discuss the potential issues in defining an epistemic operator on Kripke models that can satisfy all of the generalized postulates for iterated multi-agent belief revision.

Towards Understanding Specification Gaming in Reasoning Models cs.AI

Specification gaming is a critical failure mode of LLM agents. Despite this, there has been little systematic research into when it arises and what drives it. To address this, we build and open source a diverse suite of tasks where models can score highly by taking unintended actions. We find that all tested models exploit their specifications at non-negligible rates in most of our eight settings, including five non-coding settings. We see the highest rates of specification gaming in Grok 4 and the lowest rates in Claude models. We use our evaluation suite to study what drives specification gaming, and find that: 1. RL reasoning training substantially increases the rate at which models exploit their specifications, 2. Increasing RL reasoning budget has a weakly positive effect on exploit rate, and 3. Test-time mitigations reduce but do not eliminate the rate of specification gaming. Our results suggest that specification gaming is a fundamental challenge arising from RL reasoning training; we release our evaluation suite to support further work on this problem.

Complexity Horizons of Compressed Models in Analog Circuit Analysis cs.AI

The deployment of Large Language Models (LLMs) for specialized engineering domains, such as circuit analysis, often faces a trade-off between reasoning accuracy and computational efficiency. Traditional evaluation methods treat model performance as a flat metric, failing to account for the hierarchical nature of engineering knowledge. We propose a performance-aware model compression strategy that utilizes prerequisite graphs to optimize model selection for circuit analysis tasks. By structuring electronics design concepts as Directed Acyclic Graphs (DAGs), we can identify the specific complexity horizons of an LLM's compressed variants' tiers. Our framework introduces an agentic pipeline for generating prerequisite-based datasets and a strategic evaluation engine that dynamically cascades queries across a spectrum of compressed variants of an LLM. This approach allows to select the smallest compressed model, given its conceptual knowledge boundaries in circuit analysis. Experimental results on analog electronics datasets demonstrate that prerequisite graphs provide a granular map of model compression with respect to the performance given circuit analysis complexity. (Source Code: https://github.com/pacomesimon/LLM_prereq_graphs_circuit_analysis, Demo: https://huggingface.co/spaces/pacomesimon/LLM_prereq_graphs_circuit_analysis)

EngiAgent: Fully Connected Coordination of LLM Agents for Solving Open-ended Engineering Problems with Feasible Solutions cs.AI

Engineering problem solving is central to real-world decision-making, requiring mathematical formulations that not only represent complex problems but also produce feasible solutions under data and physical constraints. Unlike mathematical problem solving, which operates on predefined formulations, engineering tasks demand open-ended analysis, feasibility-driven modeling, and iterative refinement. Although large language models (LLMs) have shown strong capabilities in reasoning and code generation, they often fail to ensure feasibility, which limits their applicability to engineering problem solving. To address this challenge, we propose EngiAgent, a multi-agent system with a fully connected coordinator that simulates expert workflows through specialized agents for problem analysis, modeling, verification, solving, and solution evaluation. The fully connected coordinator enables flexible feedback routing, overcoming the rigidity of prior pipeline-based reflection methods and ensuring feasibility at every stage of the process. This design not only improves robustness to diverse failure cases such as data extraction errors, constraint inconsistencies, and solver failures, but also enhances the overall quality of problem solving. Empirical results across four representative domains demonstrate that EngiAgent achieves substantial improvements in feasibility compared to prior approaches, establishing a new paradigm for feasibility-oriented engineering problem solving with LLMs. Our source code and data are available at https://github.com/AI4Engi/EngiAgent.

Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding cs.AI

Distilling large reasoning models is essential for making Long-CoT reasoning practical, as full-scale inference remains computationally prohibitive. Existing curation-based approaches select complete reasoning traces post-hoc, overlooking collaboration among heterogeneous teachers and lacking dynamic exploration, which leads to redundant sampling and missed complementary reasoning. We introduce CoRD, a collaborative multi-teacher decoding framework that performs step-wise reasoning synthesis guided by predictive perplexity-based scoring and beam search. This enables heterogeneous LRMs to jointly construct coherent reasoning trajectories while efficiently preserving diverse, high-potential hypotheses. Experiments show that CoRD produces higher-quality reasoning data and achieves near teacher-level student performance with fewer, structured supervision signals, without substantial efficiency overhead. CoRD further generalizes well to out-of-domain and open-ended settings. The dataset and model are available at \href{https://github.com/DISL-Lab/CoRD}{https://github.com/DISL-Lab/CoRD}.

Anon: Extrapolating Optimizer Adaptivity Across the Real Spectrum cs.AI

Adaptive optimizers such as Adam have achieved great success in training large-scale models like large language models and diffusion models. However, they often generalize worse than non-adaptive methods, such as SGD on classical architectures like CNNs. We identify a key cause of this performance gap: adaptivity in pre-conditioners, which limits the optimizer's ability to adapt to diverse optimization landscapes. To address this, we propose Anon (Adaptivity Non-restricted Optimizer with Novel convergence technique), a novel optimizer with continuously tunable adaptivity in R, allowing it to interpolate between SGD-like and Adam-like behaviors and even extrapolate beyond both. To ensure convergence across the entire adaptivity spectrum, we introduce incremental delay update (IDU), a novel mechanism that is more flexible than AMSGrad's hard max-tracking strategy and enhances robustness to gradient noise. We theoretically establish convergence guarantees under both convex and non-convex settings. Empirically, Anon consistently outperforms state-of-the-art optimizers on representative image classification, diffusion, and language modeling tasks. These results demonstrate that adaptivity can serve as a valuable tunable design principle, and Anon provides the first unified and reliable framework capable of bridging the gap between classical and modern optimizers and surpassing their advantageous properties.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1GPT-5.457.285$5.63
2Gemini 3.1 Pro Preview57.2118$4.50
3GPT-5.3 Codex5471$4.81
4Claude Opus 4.65351$10.00
5Claude Sonnet 4.651.766$6.00
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Code52.9%
2Junie52.1%
3Claude Opus 4.651.7%
4gpt-5.2-2025-12-11-xhigh51.7%
5gpt-5.2-2025-12-11-medium51.0%