The Inference Report

April 6, 2026

Much like the 1970s shift from mainframe computing to departmental minicomputers, AI adoption is being driven not by strategic planning but by resource scarcity. Companies are embedding disclaimers in their products while simultaneously building business models that monetize the exact use cases they claim to disclaim. Microsoft's Copilot carries a terms-of-service designation as "entertainment only," yet Anthropic's recent move to revoke Claude subscribers' access to OpenClaw, a popular open-source agent framework, in favor of paid credits reveals the pattern plainly: establish permissive positioning, build dependency, then extract value from the friction. The actual deployment of AI is happening where liability is lowest and labor alternatives have vanished entirely. Japan is moving physical robots from pilots into operational work because the social care system cannot fill the gap. South Korea is embedding ChatGPT into companion dolls for elderly care for the same reason. Regulatory conversation assumes adoption is a policy question, but the data shows otherwise. It is a labor shortage question, a liability question, a power question. Companies know exactly what their products will be used for, and their pricing reflects that knowledge.

Research advances are similarly moving away from raw scaling toward diagnostic depth. Rather than uniformly increasing model capacity, three distinct methodological clusters are addressing robustness through detection, staged optimization, and measurement. Federated learning systems now include server-side filtering for Byzantine attacks and learned classifiers that identify memorized training data across architectures. Long-horizon reasoning tasks are tackled through hierarchical latent world models that reduce planning complexity while maintaining zero-shot transfer. But the most consequential work measures when automated systems actually reduce human burden versus when they introduce noise and require abandonment, as demonstrated in studies of code review agents and refactoring workflows. The Behavioral Alignment Score reframes confidence evaluation around decision-theoretic utility, and StoryScope extracts interpretable narrative features to distinguish human from AI authorship. These are not incremental improvements. They are mechanisms operating at the level of the actual constraint.

Developer activity on GitHub reflects similar pragmatism. MLX-VLM addresses the concrete fact that not every AI workload runs on cloud GPUs, while sklearn-genetic-opt and FL-bench solve real problems in hyperparameter tuning and federated learning benchmarking. The agent and platform layer is consolidating around extensible foundations like Goose, pi-mono, and onyx, each making different bets on which abstractions let builders move fastest rather than competing on feature breadth. Pixeltable and Burn represent a quieter but significant trend toward infrastructure that makes honest trade-offs instead of claiming generality. Google's investments in on-device inference and local model exploration suggest the company is betting that frictionless deployment will be the actual distribution channel for edge ML. Across all three domains, the pattern is identical: solutions gain traction by solving for speed, specificity, or honest constraints, not by attempting to do everything at once.

Grant Calloway

AI LabsAll labs

No lab headlines.

From the WireAll feeds
Research PapersAll papers
Enhancing Robustness of Federated Learning via Server Learning cs.LG

This paper explores the use of server learning for enhancing the robustness of federated learning against malicious attacks even when clients' training data are not independent and identically distributed. We propose a heuristic algorithm that uses server learning and client update filtering in combination with geometric median aggregation. We demonstrate via experiments that this approach can achieve significant improvement in model accuracy even when the fraction of malicious clients is high, even more than $50\%$ in some cases, and the dataset utilized by the server is small and could be synthetic with its distribution not necessarily close to that of the clients' aggregated data.

BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence cs.CL

Large language models (LLMs) often produce confident but incorrect answers in settings where abstention would be safer. Standard evaluation protocols, however, require a response and do not account for how confidence should guide decisions under different risk preferences. To address this gap, we introduce the Behavioral Alignment Score (BAS), a decision-theoretic metric for evaluating how well LLM confidence supports abstention-aware decision making. BAS is derived from an explicit answer-or-abstain utility model and aggregates realized utility across a continuum of risk thresholds, yielding a measure of decision-level reliability that depends on both the magnitude and ordering of confidence. We show theoretically that truthful confidence estimates uniquely maximize expected BAS utility, linking calibration to decision-optimal behavior. BAS is related to proper scoring rules such as log loss, but differs structurally: log loss penalizes underconfidence and overconfidence symmetrically, whereas BAS imposes an asymmetric penalty that strongly prioritizes avoiding overconfident errors. Using BAS alongside widely used metrics such as ECE and AURC, we then construct a benchmark of self-reported confidence reliability across multiple LLMs and tasks. Our results reveal substantial variation in decision-useful confidence, and while larger and more accurate models tend to achieve higher BAS, even frontier models remain prone to severe overconfidence. Importantly, models with similar ECE or AURC can exhibit very different BAS due to highly overconfident errors, highlighting limitations of standard metrics. We further show that simple interventions, such as top-$k$ confidence elicitation and post-hoc calibration, can meaningfully improve confidence reliability. Overall, our work provides both a principled metric and a comprehensive benchmark for evaluating LLM confidence reliability.

Hierarchical Planning with Latent World Models cs.LG

Model predictive control (MPC) with learned world models has emerged as a promising paradigm for embodied control, particularly for its ability to generalize zero-shot when deployed in new environments. However, learned world models often struggle with long-horizon control due to the accumulation of prediction errors and the exponentially growing search space. In this work, we address these challenges by learning latent world models at multiple temporal scales and performing hierarchical planning across these scales, enabling long-horizon reasoning while substantially reducing inference-time planning complexity. Our approach serves as a modular planning abstraction that applies across diverse latent world-model architectures and domains. We demonstrate that this hierarchical approach enables zero-shot control on real-world non-greedy robotic tasks, achieving a 70% success rate on pick-&-place using only a final goal specification, compared to 0% for a single-level world model. In addition, across physics-based simulated environments including push manipulation and maze navigation, hierarchical planning achieves higher success while requiring up to 4x less planning-time compute.

A Tsetlin Machine-driven Intrusion Detection System for Next-Generation IoMT Security cs.CR

The rapid adoption of the Internet of Medical Things (IoMT) is transforming healthcare by enabling seamless connectivity among medical devices, systems, and services. However, it also introduces serious cybersecurity and patient safety concerns as attackers increasingly exploit new methods and emerging vulnerabilities to infiltrate IoMT networks. This paper proposes a novel Tsetlin Machine (TM)-based Intrusion Detection System (IDS) for detecting a wide range of cyberattacks targeting IoMT networks. The TM is a rule-based and interpretable machine learning (ML) approach that models attack patterns using propositional logic. Extensive experiments conducted on the CICIoMT-2024 dataset, which includes multiple IoMT protocols and cyberattack types, demonstrate that the proposed TM-based IDS outperforms traditional ML classifiers. The proposed model achieves an accuracy of 99.5\% in binary classification and 90.7\% in multi-class classification, surpassing existing state-of-the-art approaches. Moreover, to enhance model trust and interpretability, the proposed TM-based model presents class-wise vote scores and clause activation heatmaps, providing clear insights into the most influential clauses and the dominant class contributing to the final model decision.

PR3DICTR: A modular AI framework for medical 3D image-based detection and outcome prediction cs.CV

Three-dimensional medical image data and computer-aided decision making, particularly using deep learning, are becoming increasingly important in the medical field. To aid in these developments we introduce PR3DICTR: Platform for Research in 3D Image Classification and sTandardised tRaining. Built using community-standard distributions (PyTorch and MONAI), PR3DICTR provides an open-access, flexible and convenient framework for prediction model development, with an explicit focus on classification using three-dimensional medical image data. By combining modular design principles and standardization, it aims to alleviate developmental burden whilst retaining adjustability. It provides users with a wealth of pre-established functionality, for instance in model architecture design options, hyper-parameter solutions and training methodologies, but still gives users the opportunity and freedom to ``plug in'' their own solutions or modules. PR3DICTR can be applied to any binary or event-based three-dimensional classification task and can work with as little as two lines of code.

Coupled Control, Structured Memory, and Verifiable Action in Agentic AI (SCRAT -- Stochastic Control with Retrieval and Auditable Trajectories): A Comparative Perspective from Squirrel Locomotion and Scatter-Hoarding cs.AI

Agentic AI is increasingly judged not by fluent output alone but by whether it can act, remember, and verify under partial observability, delay, and strategic observation. Existing research often studies these demands separately: robotics emphasizes control, retrieval systems emphasize memory, and alignment or assurance work emphasizes checking and oversight. This article argues that squirrel ecology offers a sharp comparative case because arboreal locomotion, scatter-hoarding, and audience-sensitive caching couple all three demands in one organism. We synthesize evidence from fox, eastern gray, and, in one field comparison, red squirrels, and impose an explicit inference ladder: empirical observation, minimal computational inference, and AI design conjecture. We introduce a minimal hierarchical partially observed control model with latent dynamics, structured episodic memory, observer-belief state, option-level actions, and delayed verifier signals. This motivates three hypotheses: (H1) fast local feedback plus predictive compensation improves robustness under hidden dynamics shifts; (H2) memory organized for future control improves delayed retrieval under cue conflict and load; and (H3) verifiers and observer models inside the action-memory loop reduce silent failure and information leakage while remaining vulnerable to misspecification. A downstream conjecture is that role-differentiated proposer/executor/checker/adversary systems may reduce correlated error under asymmetric information and verification burden. The contribution is a comparative perspective and benchmark agenda: a disciplined program of falsifiable claims about the coupling of control, memory, and verifiable action.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1GPT-5.457.276$5.63
2Gemini 3.1 Pro Preview57.2135$4.50
3GPT-5.3 Codex5484$4.81
4Claude Opus 4.65350$10.00
5Claude Sonnet 4.651.749$6.00
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5Gemini 3.1 Pro Preview62.3%