The Inference Report

March 22, 2026

The market is sorting itself by who owns the customer relationship and can credibly deliver results, not by who controls the most advanced technology. OpenAI is doubling headcount to 8,000 by end of 2026 while Nvidia's latest conference failed to move Wall Street, a divergence that reflects investor clarity about which companies extract value versus which merely supply it. Open-weight models like Nvidia's Nemotron-Cascade 2 are hitting Gold Medal performance at 30B parameters with only 3B active, directly undercutting the efficiency moat of frontier models, yet this technical progress hasn't translated into market share because distribution and trust still dominate. Meanwhile, the compliance layer is cracking: Delve stands accused of selling fake compliance to hundreds of customers, a publisher rejected an AI-generated novel outright, and Anthropic's survey of 80,000 Claude users shows hallucinations trouble people far more than job displacement fears. Trust, not capability, is the actual constraint.

Research across multi-agent systems, interpretability, and domain-specific applications reveals a consistent finding: observability alone does not guarantee control. Mechanistic methods achieve near-perfect representation of task-relevant information yet fail to translate that knowledge into corrected outputs, while steering approaches show brittleness under deployment stress. Performance gains come from encoding domain structure into training and evaluation rather than scaling generic models. Pedagogically grounded fine-tuning, clinical benchmarks aligned to real-world needs, and neuro-symbolic architectures with declarative constraint specification all demonstrate that what matters for deployment is generalization to unseen tasks and robustness under perturbation, not aggregate metrics on standard leaderboards.

The infrastructure layer is reasserting itself. Trivy dominates vulnerability scanning with consolidated threat detection, while systemd and protobuf remain the unglamorous backbone everything depends on. On GitHub, the secondary pattern is tooling for AI operations and observability: Phoenix and Claude HUD address the friction point that models and agents are now complex enough to require visibility into internal behavior, while opendataloader and Clawith solve the unglamorous problem of getting messy PDFs and enterprise data into usable formats. The gap between what's trendy and what's useful is narrowing. Compensation is shifting too, with tokens becoming a fourth pillar of engineer pay and companies like DoorDash paying gig workers to train AI, suggesting the real pressure is showing up as cost arbitrage rather than capability breakthroughs. Whoever controls the customer relationship wins; everyone else is either a cost center or selling narrative.

Grant Calloway

AI LabsAll labs

No lab headlines.

From the WireAll feeds
Research Papers — FocusedAll papers
When Only the Final Text Survives: Implicit Execution Tracing for Multi-Agent Attribution cs.AI

When a multi-agent system produces an incorrect or harmful answer, who is accountable if execution logs and agent identifiers are unavailable? Multi-agent language systems increasingly rely on structured interactions such as delegation and iterative refinement, yet the final output often obscures the underlying interaction topology and agent contributions. We introduce IET (Implicit Execution Tracing), a metadata-independent framework that enables token-level attribution directly from generated text and a simple mechanism for interaction topology reconstruction. During generation, agent-specific keyed signals are embedded into the token distribution, transforming the text into a self-describing execution trace detectable only with a secret key. At detection time, a transition-aware scoring method identifies agent handover points and reconstructs the interaction graph. Experiments show that IET recovers agent segments and coordination structure with high accuracy while preserving generation quality, enabling privacy-preserving auditing for multi-agent language systems.

Multi-Trait Subspace Steering to Reveal the Dark Side of Human-AI Interaction cs.AI

Recent incidents have highlighted alarming cases where human-AI interactions led to negative psychological outcomes, including mental health crises and even user harm. As LLMs serve as sources of guidance, emotional support, and even informal therapy, these risks are poised to escalate. However, studying the mechanisms underlying harmful human-AI interactions presents significant methodological challenges, where organic harmful interactions typically develop over sustained engagement, requiring extensive conversational context that are difficult to simulate in controlled settings. To address this gap, we developed a Multi-Trait Subspace Steering (MultiTraitsss) framework that leverages established crisis-associated traits and novel subspace steering framework to generate Dark models that exhibits cumulative harmful behavioral patterns. Single-turn and multi-turn evaluations show that our dark models consistently produce harmful interaction and outcomes. Using our Dark models, we propose protective measure to reduce harmful outcomes in Human-AI interactions.

Adaptive Domain Models: Bayesian Evolution, Warm Rotation, and Principled Training for Geometric and Neuromorphic AI cs.AI

Prevailing AI training infrastructure assumes reverse-mode automatic differentiation over IEEE-754 arithmetic. The memory overhead of training relative to inference, optimizer complexity, and structural degradation of geometric properties through training are consequences of this arithmetic substrate. This paper develops an alternative training architecture grounded in three prior results: the Dimensional Type System and Deterministic Memory Management framework [6], which establishes stack-eligible gradient allocation and exact quire accumulation as design-time verifiable properties; the Program Hypergraph [8], which establishes grade preservation through geometric algebra computations as a type-level invariant; and the b-posit 2026 standard [10], which makes posit arithmetic tractable across hardware targets conventionally considered inference-only. Their composition enables depth-independent training memory bounded to approximately twice the inference footprint, grade-preserving weight updates, and exact gradient accumulation, applicable uniformly to loss-function-optimized and spike-timing-dependent neuromorphic models. We introduce Bayesian distillation, a mechanism by which the latent prior structure of a general-purpose model is extracted through the ADM training regime, resolving the data-scarcity bootstrapping problem for domain-specific training. For deployment, we introduce warm rotation, an operational pattern in which an updated model transitions into an active inference pathway without service interruption, with structural correctness formalized through PHG certificates and signed version records. The result is a class of domain-specific AI systems that are smaller and more precise than general-purpose models, continuously adaptive, verifiably correct with respect to the physical structure of their domains, and initializable from existing models.

Don't Vibe Code, Do Skele-Code: Interactive No-Code Notebooks for Subject Matter Experts to Build Lower-Cost Agentic Workflows cs.AI

Skele-Code is a natural-language and graph-based interface for building workflows with AI agents, designed especially for less or non-technical users. It supports incremental, interactive notebook-style development, and each step is converted to code with a required set of functions and behavior to enable incremental building of workflows. Agents are invoked only for code generation and error recovery, not orchestration or task execution. This agent-supported, but code-first approach to workflows, along with the context-engineering used in Skele-Code, can help reduce token costs compared to the multi-agent system approach to executing workflows. Skele-Code produces modular, easily extensible, and shareable workflows. The generated workflows can also be used as skills by agents, or as steps in other workflows.

Efficient Dense Crowd Trajectory Prediction Via Dynamic Clustering cs.AI

Crowd trajectory prediction plays a crucial role in public safety and management, where it can help prevent disasters such as stampedes. Recent works address the problem by predicting individual trajectories and considering surrounding objects based on manually annotated data. However, these approaches tend to overlook dense crowd scenarios, where the challenges of automation become more pronounced due to the massiveness, noisiness, and inaccuracy of the tracking outputs, resulting in high computational costs. To address these challenges, we propose and extensively evaluate a novel cluster-based approach that groups individuals based on similar attributes over time, enabling faster execution through accurate group summarisation. Our plug-and-play method can be combined with existing trajectory predictors by using our output centroid in place of their pedestrian input. We evaluate our proposed method on several challenging dense crowd scenes. We demonstrated that our approach leads to faster processing and lower memory usage when compared with state-of-the-art methods, while maintaining the accuracy

TeachingCoach: A Fine-Tuned Scaffolding Chatbot for Instructional Guidance to Instructors cs.AI

Higher education instructors often lack timely and pedagogically grounded support, as scalable instructional guidance remains limited and existing tools rely on generic chatbot advice or non-scalable teaching center human-human consultations. We present TeachingCoach, a pedagogically grounded chatbot designed to support instructor professional development through real-time, conversational guidance. TeachingCoach is built on a data-centric pipeline that extracts pedagogical rules from educational resources and uses synthetic dialogue generation to fine-tune a specialized language model that guides instructors through problem identification, diagnosis, and strategy development. Expert evaluations show TeachingCoach produces clearer, more reflective, and more responsive guidance than a GPT-4o mini baseline, while a user study with higher education instructors highlights trade-offs between conversational depth and interaction efficiency. Together, these results demonstrate that pedagogically grounded, synthetic data driven chatbots can improve instructional support and offer a scalable design approach for future instructional chatbot systems.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1GPT-5.457.285$5.63
2Gemini 3.1 Pro Preview57.2118$4.50
3GPT-5.3 Codex5471$4.81
4Claude Opus 4.65351$10.00
5Claude Sonnet 4.651.766$6.00
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Code52.9%
2Junie52.1%
3Claude Opus 4.651.7%
4gpt-5.2-2025-12-11-xhigh51.7%
5gpt-5.2-2025-12-11-medium51.0%