The Inference Report

March 2, 2026

The AI industry's carefully constructed narratives are colliding with an uncooperative reality across multiple fronts. The Pentagon's demand that Anthropic accept military use terms by February 27 or risk designation as a supply chain risk represents the most explicit government intervention in AI company operations since the sector's emergence, forcing a safety-first laboratory to choose between its institutional identity and contracts worth billions. This pressure arrives alongside mounting evidence that the physical world is pushing back against AI's expansion: data center builders are discovering that farmers won't sell land even for million-dollar offers, and Microsoft has retreated from aggressive community relations tactics, vowing to cover full power costs, reject local tax breaks, and replenish water usage. These are not PR troubles to be managed but structural constraints on deployment speed and geography, independent of capital availability.

The market's bifurcation is sharpening along predictable lines. AWS and IBM are positioning AI infrastructure as defensible, recurring-revenue moats with enterprise-focused compute layers and autonomous storage management, while IBM's Missile Defense Agency contract and quantum computing partnership with Cisco signal expanding defense TAM. Anthropic, by contrast, is attempting to own the safety narrative through constitutional classifiers, alignment faking research, and a Responsible Scaling Policy version 3.0, less as philanthropy than as market differentiation for enterprise customers facing regulatory scrutiny. The coding agent space is fragmenting under price pressure: Claude Code's $200 monthly pricing has opened room for free alternatives like Block's Goose, while the NousCoder-14B model's ability to train in four days on 48 Nvidia B200 GPUs challenges the assumption that frontier capabilities require frontier resources.

The benchmark landscape exposes the same gap between perception and measurement. SWE-bench showed zero movement in its latest cycle, with the top 35 models frozen in identical positions, while the newly introduced Artificial Analysis framework produces substantially different rankings that raise questions about what these evaluations actually capture. GitHub trending reinforces the shift: the value has migrated from model capabilities to orchestration infrastructure, with memory systems, multi-agent coordination, and document conversion utilities like Microsoft's markitdown gaining traction over model training tools. The industry appears to have accepted that fine-tuning is solved and is now betting heavily on the application layer, even as public skepticism rises through movements like QuitGPT and research demonstrating that LLMs can generate near-verbatim copies of novels from training data. What remains clear is that the gap between AI's investment thesis and its operational reality is narrowing, and the companies best positioned may be those building for the world as it exists rather than as the sector imagined it.

Grant Calloway

AI LabsAll labs

No lab headlines.

From the WireAll feeds
Research Papers — FocusedAll papers
Operationalising Information Security Management: A Procedural Framework Analysis of ISO/IEC 27001:2022 Implementation in a Financial-Technology Organisation cs.SE

Organisations operating within information-intensive environments face intensifying pressure to formalise the governance of information security. The ISO/IEC 27001:2022 standard provides a globally recognised framework for establishing, implementing, maintaining, and continually improving an Information Security Management System (ISMS). This article analyses the procedural architecture deployed in a financial-technology organisation's ISMS, examining eight core operational procedures: IT Risk Assessment and Treatment, User Code of Conduct, Password Policy, Access Control, Internet Access, Physical Security, Backup and Restore Management, and Nonconformity Root Cause Analysis and Corrective Action. Drawing on documented internal training materials, the article investigates how each procedure operationalises the requirements of Annex~A controls and Clauses~6--10 of ISO~27001:2022. The paper evaluates the CIA Triad as a unifying evaluation criterion, the twelve-step risk assessment methodology, role-based responsibility allocation, and the interplay between corrective action governance and continual improvement. The findings suggest that a tightly integrated, multi-layered procedural hierarchy, supported by clear accountability structures and measurable risk metrics, constitutes the foundation of an effective ISMS implementation in financial-technology operating environments.

AI-Assisted Code Review as a Scaffold for Code Quality and Self-Regulated Learning: An Experience Report cs.SE

Code review is central to software engineering education but hard to scale in capstone projects due to tight deadlines, uneven peer feedback, and limited prior experience. We investigate an LLM-as-reviewer integrated directly into GitHub pull requests (human-in-the-loop) across two cohorts (more than 100 students, 2023--2024). Using a mixed-methods design -- GitHub data, reflective reports, and a targeted survey -- we examine engagement and responsiveness as behavioral indicators of self-regulated learning processes. Quantitatively, the 2024 cohort produced more iterative activity (1176 vs. 581 PRs), while technical issues observed in 2023 (227 failed AI attempts) dropped to zero after tool and instructional refinements. Despite different adoption levels (93\% vs. 50\% of teams using the tool), responsiveness was stable: 32\% (2023) and 33\% (2024) of successfully AI-reviewed PRs were followed by subsequent commits on the same PR. Qualitatively, students used the LLM's structured comments to focus reviews and discuss code quality, while guidance reduced over-reliance. We contribute: (i) an in-workflow design for an AI reviewer that scaffolds learning while mitigating cognitive offloading; (ii) a repeated cross sectional comparison across two cohorts in authentic settings; (iii) a mixed-methods analysis combining objective GitHub metrics with student self-reports; and (iv) evidence-based pedagogical recommendations for responsible, student-led AI-assisted review.

Knowledge Lever Risk Management for Software Engineering: A Stochastic Framework for Mitigating Knowledge Loss cs.SE

Software engineering (SE) organizations operate in a knowledge-intensive domain where critical assets -- architectural expertise, design rationale, and system intuition -- are overwhelmingly tacit and volatile. The departure of key contributors or the decay of undocumented decisions can severely impair project velocity and software quality. While conventional SE risk management optimized for schedule and budget is common, the intangible knowledge risks that determine project success remain under-represented. The goal of this research work is to propose and evaluate the Knowledge Lever Risk Management (KLRM) Framework, designed specifically for the software development lifecycle. The primary objectives are to: (1) recast intangible knowledge assets as active mechanisms for risk mitigation (Knowledge Levers); (2) integrate these levers into a structured four-phase architecture (Audit, Alignment, Activation, Assurance); and (3) provide a formal stochastic model to quantify the impact of lever activation on project knowledge capital. We detail the application of these levers through software-specific practices such as pair programming, architectural decision records (ADRs), and LLM-assisted development. Stochastic Monte Carlo simulations demonstrate that full lever activation increases expected knowledge capital by 63.8\% and virtually eliminates knowledge crisis probability. Our research shows that knowledge lever activation improves alignment across the project management iron triangle (scope, time, cost) by reducing rework and rediscovery costs.

Can LLMs be Effective Code Contributors? A Study on Open-source Projects cs.SE

LLM-generated code is widely used, and the share of committed code produced by LLMs is expected to increase. However, we are not at a point where LLMs can be effective contributors to production code. We present an approach that exposes the shortcomings of LLM generation on such projects, and proposes recommendations; the targets of our study are sizable open-source projects, e.g., FFmpeg and wolfSSL. First, we developed a framework that uses verification and validation to evaluate a given LLM's suitability to fix or add features to an existing project. Second, we apply the framework to 212 commits (bug fixes and small feature improvements) in eight popular open-source projects and three LLMs: GPT-4o, Ministral3, and Qwen3-Coder. The success rate varied from 0% to 60% depending on the project. The LLMs failed in a variety of ways, from generating syntactically incorrect code, to producing code that fails basic (static) verification, or validation via the project's test suite. In particular, the LLMs struggle with generating new code, handling contexts (function or file) outside a certain size range, and in many cases their success is due to parroting code changes they have been trained on.

Empirical Insights of Test Selection Metrics under Multiple Testing Objectives and Distribution Shifts cs.SE

Deep learning (DL)-based systems can exhibit unexpected behavior when exposed to out-of-distribution (OOD) scenarios, posing serious risks in safety-critical domains such as malware detection and autonomous driving. This underscores the importance of thoroughly testing such systems before deployment. To this end, researchers have proposed a wide range of test selection metrics designed to effectively select inputs. However, prior evaluations of metrics reveal three key limitations: (1) narrow testing objectives, for example, many studies assess metrics only for fault detection, leaving their effectiveness for performance estimation unclear; (2) limited coverage of OOD scenarios, with natural and label shifts are rarely considered; (3) Biased dataset selection, where most work focuses on image data while other modalities remain underexplored. Consequently, a unified benchmark that examines how these metrics perform under multiple testing objectives, diverse OOD scenarios, and different data modalities is still lacking. This leaves practitioners uncertain about which test selection metrics are most suitable for their specific objectives and contexts. To address this gap, we conduct an extensive empirical study of 15 existing metrics, evaluating them under three testing objectives (fault detection, performance estimation, and retraining guidance), five types of OOD scenarios (corrupted, adversarial, temporal, natural, and label shifts), three data modalities (image, text, and Android packages), and 13 DL models. In total, our study encompasses 1,640 experimental scenarios, offering a comprehensive evaluation and statistical analysis.

An Empirical Evaluation of Locally Deployed LLMs for Bug Detection in Python Code cs.SE

Large language models (LLMs) have demonstrated strong performance on a wide range of software engineering tasks, including code generation and analysis. However, most prior work relies on cloud-based models or specialized hardware, limiting practical applicability in privacy-sensitive or resource-constrained environments. In this paper, we present a systematic empirical evaluation of two locally deployed LLMs, LLaMA 3.2 and Mistral, for real-world Python bug detection using the BugsInPy benchmark. We evaluate 349 bugs across 17 projects using a zero-shot prompting approach at the function level and an automated keyword-based evaluation framework. Our results show that locally executed models achieve accuracy between 43% and 45%, while producing a large proportion of partially correct responses that identify problematic code regions without pinpointing the exact fix. Performance varies significantly across projects, highlighting the importance of codebase characteristics. The results demonstrate that local models can identify a meaningful share of bugs, though precise localization remains difficult for locally executed LLMs, particularly when handling complex and context dependent bugs in realistic development scenarios.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1Gemini 3.1 Pro Preview57.282$4.50
2GPT-5.3 Codex5474$4.81
3Claude Opus 4.65348$10.00
4Claude Sonnet 4.651.729$6.00
5GPT-5.251.363$4.81
SWE-rebench

Agentic coding on real-world software engineering tasks

No benchmark data.

Trending