The infrastructure for artificial intelligence is consolidating rapidly, concentrating both capability and control among a narrowing set of actors while creating new vulnerabilities for those outside that circle. The Army's decision to funnel 120 procurement actions into a single $20 billion Anduril contract exemplifies this trend in defense spending, where consolidation of capability into fewer hands has become the default procurement strategy. Meta's reported 20 percent workforce reduction alongside massive capital expenditure on AI infrastructure reveals the underlying economics: the cost of building dominant infrastructure is so high that companies must cut headcount elsewhere to justify the burn. OpenAI is moving aggressively into the transaction layer through new app integrations, positioning itself to observe and influence user behavior at scale, while North Korean operatives demonstrate how the same tools can be weaponized to impersonate workers and drain European companies. Users trust AI-generated summaries at rates approaching 40 percent despite a documented 60 percent hallucination rate, creating asymmetric risk for those relying on these outputs without verification.
Software engineering research reflects this consolidation from a different angle: the field has moved from treating agents as a research question to treating them as an engineering problem that requires scaffolding. Repositories like agency-agents and superpowers ship with built-in personality, processes, and deliverables, moving multi-agent systems from conceptual territory into applied methodology. Specialized tools handle context management through file system paradigms while others bind natural language to physical hardware actuators. The decisive shift is architectural: capability and tooling are decoupling from each other, with plugin ecosystems becoming the distribution model for agent capabilities. Fine-tuning and multi-agent coordination are now engineering problems, not research frontiers. Yet the research literature simultaneously documents fundamental gaps in practical effectiveness: patch overfitting detection fails against random baselines in realistic settings, frontier agents cannot match instruction-tuned models at post-training tasks, and sentiment analysis varies strongly within individuals. The field is building production systems while the underlying evaluation rigor remains contested.
Developers on GitHub have internalized both the consolidation and the gaps. The repositories gaining traction are those solving concrete production problems: headless browsers designed for agent interaction rather than retrofitted tools, dataset quality infrastructure, autonomous research loops, and output control mechanisms. The pattern across all three domains is identical: whoever builds infrastructure that touches users or controls interfaces accumulates leverage, data flows, and margin. Builders with direct user contact consolidate power. Everyone else either cuts costs to fund someone else's advantage or gets exploited by the tools they cannot fully trust or control.
Grant Calloway
No lab headlines.
Organisations operating within information-intensive environments face intensifying pressure to formalise the governance of information security. The ISO/IEC 27001:2022 standard provides a globally recognised framework for establishing, implementing, maintaining, and continually improving an Information Security Management System (ISMS). This article analyses the procedural architecture deployed in a financial-technology organisation's ISMS, examining eight core operational procedures: IT Risk Assessment and Treatment, User Code of Conduct, Password Policy, Access Control, Internet Access, Physical Security, Backup and Restore Management, and Nonconformity Root Cause Analysis and Corrective Action. Drawing on documented internal training materials, the article investigates how each procedure operationalises the requirements of Annex~A controls and Clauses~6--10 of ISO~27001:2022. The paper evaluates the CIA Triad as a unifying evaluation criterion, the twelve-step risk assessment methodology, role-based responsibility allocation, and the interplay between corrective action governance and continual improvement. The findings suggest that a tightly integrated, multi-layered procedural hierarchy, supported by clear accountability structures and measurable risk metrics, constitutes the foundation of an effective ISMS implementation in financial-technology operating environments.
Code review is central to software engineering education but hard to scale in capstone projects due to tight deadlines, uneven peer feedback, and limited prior experience. We investigate an LLM-as-reviewer integrated directly into GitHub pull requests (human-in-the-loop) across two cohorts (more than 100 students, 2023--2024). Using a mixed-methods design -- GitHub data, reflective reports, and a targeted survey -- we examine engagement and responsiveness as behavioral indicators of self-regulated learning processes. Quantitatively, the 2024 cohort produced more iterative activity (1176 vs. 581 PRs), while technical issues observed in 2023 (227 failed AI attempts) dropped to zero after tool and instructional refinements. Despite different adoption levels (93\% vs. 50\% of teams using the tool), responsiveness was stable: 32\% (2023) and 33\% (2024) of successfully AI-reviewed PRs were followed by subsequent commits on the same PR. Qualitatively, students used the LLM's structured comments to focus reviews and discuss code quality, while guidance reduced over-reliance. We contribute: (i) an in-workflow design for an AI reviewer that scaffolds learning while mitigating cognitive offloading; (ii) a repeated cross sectional comparison across two cohorts in authentic settings; (iii) a mixed-methods analysis combining objective GitHub metrics with student self-reports; and (iv) evidence-based pedagogical recommendations for responsible, student-led AI-assisted review.
Software engineering (SE) organizations operate in a knowledge-intensive domain where critical assets -- architectural expertise, design rationale, and system intuition -- are overwhelmingly tacit and volatile. The departure of key contributors or the decay of undocumented decisions can severely impair project velocity and software quality. While conventional SE risk management optimized for schedule and budget is common, the intangible knowledge risks that determine project success remain under-represented. The goal of this research work is to propose and evaluate the Knowledge Lever Risk Management (KLRM) Framework, designed specifically for the software development lifecycle. The primary objectives are to: (1) recast intangible knowledge assets as active mechanisms for risk mitigation (Knowledge Levers); (2) integrate these levers into a structured four-phase architecture (Audit, Alignment, Activation, Assurance); and (3) provide a formal stochastic model to quantify the impact of lever activation on project knowledge capital. We detail the application of these levers through software-specific practices such as pair programming, architectural decision records (ADRs), and LLM-assisted development. Stochastic Monte Carlo simulations demonstrate that full lever activation increases expected knowledge capital by 63.8\% and virtually eliminates knowledge crisis probability. Our research shows that knowledge lever activation improves alignment across the project management iron triangle (scope, time, cost) by reducing rework and rediscovery costs.
LLM-generated code is widely used, and the share of committed code produced by LLMs is expected to increase. However, we are not at a point where LLMs can be effective contributors to production code. We present an approach that exposes the shortcomings of LLM generation on such projects, and proposes recommendations; the targets of our study are sizable open-source projects, e.g., FFmpeg and wolfSSL. First, we developed a framework that uses verification and validation to evaluate a given LLM's suitability to fix or add features to an existing project. Second, we apply the framework to 212 commits (bug fixes and small feature improvements) in eight popular open-source projects and three LLMs: GPT-4o, Ministral3, and Qwen3-Coder. The success rate varied from 0% to 60% depending on the project. The LLMs failed in a variety of ways, from generating syntactically incorrect code, to producing code that fails basic (static) verification, or validation via the project's test suite. In particular, the LLMs struggle with generating new code, handling contexts (function or file) outside a certain size range, and in many cases their success is due to parroting code changes they have been trained on.
Deep learning (DL)-based systems can exhibit unexpected behavior when exposed to out-of-distribution (OOD) scenarios, posing serious risks in safety-critical domains such as malware detection and autonomous driving. This underscores the importance of thoroughly testing such systems before deployment. To this end, researchers have proposed a wide range of test selection metrics designed to effectively select inputs. However, prior evaluations of metrics reveal three key limitations: (1) narrow testing objectives, for example, many studies assess metrics only for fault detection, leaving their effectiveness for performance estimation unclear; (2) limited coverage of OOD scenarios, with natural and label shifts are rarely considered; (3) Biased dataset selection, where most work focuses on image data while other modalities remain underexplored. Consequently, a unified benchmark that examines how these metrics perform under multiple testing objectives, diverse OOD scenarios, and different data modalities is still lacking. This leaves practitioners uncertain about which test selection metrics are most suitable for their specific objectives and contexts. To address this gap, we conduct an extensive empirical study of 15 existing metrics, evaluating them under three testing objectives (fault detection, performance estimation, and retraining guidance), five types of OOD scenarios (corrupted, adversarial, temporal, natural, and label shifts), three data modalities (image, text, and Android packages), and 13 DL models. In total, our study encompasses 1,640 experimental scenarios, offering a comprehensive evaluation and statistical analysis.
Large language models (LLMs) have demonstrated strong performance on a wide range of software engineering tasks, including code generation and analysis. However, most prior work relies on cloud-based models or specialized hardware, limiting practical applicability in privacy-sensitive or resource-constrained environments. In this paper, we present a systematic empirical evaluation of two locally deployed LLMs, LLaMA 3.2 and Mistral, for real-world Python bug detection using the BugsInPy benchmark. We evaluate 349 bugs across 17 projects using a zero-shot prompting approach at the function level and an automated keyword-based evaluation framework. Our results show that locally executed models achieve accuracy between 43% and 45%, while producing a large proportion of partially correct responses that identify problematic code regions without pinpointing the exact fix. Performance varies significantly across projects, highlighting the importance of codebase characteristics. The results demonstrate that local models can identify a meaningful share of bugs, though precise localization remains difficult for locally executed LLMs, particularly when handling complex and context dependent bugs in realistic development scenarios.
Composite score across coding, math, and reasoning
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Preview | 57.2 | 120 | $4.50 |
| 2 | GPT-5.4 | 57 | 84 | $5.63 |
| 3 | GPT-5.3 Codex | 54 | 68 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 57 | $10.00 |
| 5 | Claude Sonnet 4.6 | 51.7 | 61 | $6.00 |
Agentic coding on real-world software engineering tasks
| # | Model | Score |
|---|---|---|
| 1 | Claude Code | 52.9% |
| 2 | Junie | 52.1% |
| 3 | Claude Opus 4.6 | 51.7% |
| 4 | gpt-5.2-2025-12-11-xhigh | 51.7% |
| 5 | gpt-5.2-2025-12-11-medium | 51.0% |
OpenViking is an open-source context database designed specifically for AI Agents(such as openclaw). OpenViking unifies the management of context (memory, resources, and skills) that Agents need through a file system paradigm, enabling hierarchical context delivery and self-evolving.
Official, Anthropic-managed directory of high quality Claude Code Plugins.
Dimensional is the agentic operating system for physical space. Vibecode humanoids, quadrupeds, drones, and other hardware platforms in natural language and build multi-agent systems that work seamlessly with physical input (cameras, lidar, actuators).
Fully automatic censorship removal for language models
OpenRAG is a comprehensive, single package Retrieval-Augmented Generation platform built on Langflow, Docling, and Opensearch.
Firmament Autopilot Embedded System
ARIS ⚔️ (Auto-Research-In-Sleep) — Claude Code skills for autonomous ML research: cross-model review loops, idea discovery, and experiment automation via Codex MCP
An AI framework for generating and modding osu! beatmaps for all gamemodes from spectrogram inputs.
Fine-tuning & Reinforcement Learning for LLMs. 🦥 Train OpenAI gpt-oss, DeepSeek, Qwen, Llama, Gemma, TTS 2x faster with 70% less VRAM.
Collect some World Models for Autonomous Driving (and Robotic, etc.) papers.