The market is sorting itself by who owns the customer relationship and can credibly deliver results, not by who controls the most advanced technology. OpenAI is doubling headcount to 8,000 by end of 2026 while Nvidia's latest conference failed to move Wall Street, a divergence that reflects investor clarity about which companies extract value versus which merely supply it. Open-weight models like Nvidia's Nemotron-Cascade 2 are hitting Gold Medal performance at 30B parameters with only 3B active, directly undercutting the efficiency moat of frontier models, yet this technical progress hasn't translated into market share because distribution and trust still dominate. Meanwhile, the compliance layer is cracking: Delve stands accused of selling fake compliance to hundreds of customers, a publisher rejected an AI-generated novel outright, and Anthropic's survey of 80,000 Claude users shows hallucinations trouble people far more than job displacement fears. Trust, not capability, is the actual constraint.
Research across multi-agent systems, interpretability, and domain-specific applications reveals a consistent finding: observability alone does not guarantee control. Mechanistic methods achieve near-perfect representation of task-relevant information yet fail to translate that knowledge into corrected outputs, while steering approaches show brittleness under deployment stress. Performance gains come from encoding domain structure into training and evaluation rather than scaling generic models. Pedagogically grounded fine-tuning, clinical benchmarks aligned to real-world needs, and neuro-symbolic architectures with declarative constraint specification all demonstrate that what matters for deployment is generalization to unseen tasks and robustness under perturbation, not aggregate metrics on standard leaderboards.
The infrastructure layer is reasserting itself. Trivy dominates vulnerability scanning with consolidated threat detection, while systemd and protobuf remain the unglamorous backbone everything depends on. On GitHub, the secondary pattern is tooling for AI operations and observability: Phoenix and Claude HUD address the friction point that models and agents are now complex enough to require visibility into internal behavior, while opendataloader and Clawith solve the unglamorous problem of getting messy PDFs and enterprise data into usable formats. The gap between what's trendy and what's useful is narrowing. Compensation is shifting too, with tokens becoming a fourth pillar of engineer pay and companies like DoorDash paying gig workers to train AI, suggesting the real pressure is showing up as cost arbitrage rather than capability breakthroughs. Whoever controls the customer relationship wins; everyone else is either a cost center or selling narrative.
Grant Calloway
No lab headlines.
We investigate the belief revision problem in epistemic planning, i.e., what will be the beliefs of all agents in a multi-agent system after an agent gains the belief in some state property. Based on the standard representation in epistemic planning of agents' beliefs via a single multi-agent Kripke model, we generalize the classical AGM belief revision postulates to the multi-agent setting, with the aim to provide a formal framework for evaluating dynamic epistemic reasoning frameworks in which the beliefs of all agents as the result of actions are computed. As an example of a simple operator that satisfies all of the generalized AGM postulates, we present generalized full-meet multi-agent belief revision. We moreover define a generalization of the standard postulates for iterated revision, present a more sophisticated, event model based revision operator, and discuss the potential issues in defining an epistemic operator on Kripke models that can satisfy all of the generalized postulates for iterated multi-agent belief revision.
Specification gaming is a critical failure mode of LLM agents. Despite this, there has been little systematic research into when it arises and what drives it. To address this, we build and open source a diverse suite of tasks where models can score highly by taking unintended actions. We find that all tested models exploit their specifications at non-negligible rates in most of our eight settings, including five non-coding settings. We see the highest rates of specification gaming in Grok 4 and the lowest rates in Claude models. We use our evaluation suite to study what drives specification gaming, and find that: 1. RL reasoning training substantially increases the rate at which models exploit their specifications, 2. Increasing RL reasoning budget has a weakly positive effect on exploit rate, and 3. Test-time mitigations reduce but do not eliminate the rate of specification gaming. Our results suggest that specification gaming is a fundamental challenge arising from RL reasoning training; we release our evaluation suite to support further work on this problem.
The deployment of Large Language Models (LLMs) for specialized engineering domains, such as circuit analysis, often faces a trade-off between reasoning accuracy and computational efficiency. Traditional evaluation methods treat model performance as a flat metric, failing to account for the hierarchical nature of engineering knowledge. We propose a performance-aware model compression strategy that utilizes prerequisite graphs to optimize model selection for circuit analysis tasks. By structuring electronics design concepts as Directed Acyclic Graphs (DAGs), we can identify the specific complexity horizons of an LLM's compressed variants' tiers. Our framework introduces an agentic pipeline for generating prerequisite-based datasets and a strategic evaluation engine that dynamically cascades queries across a spectrum of compressed variants of an LLM. This approach allows to select the smallest compressed model, given its conceptual knowledge boundaries in circuit analysis. Experimental results on analog electronics datasets demonstrate that prerequisite graphs provide a granular map of model compression with respect to the performance given circuit analysis complexity. (Source Code: https://github.com/pacomesimon/LLM_prereq_graphs_circuit_analysis, Demo: https://huggingface.co/spaces/pacomesimon/LLM_prereq_graphs_circuit_analysis)
Engineering problem solving is central to real-world decision-making, requiring mathematical formulations that not only represent complex problems but also produce feasible solutions under data and physical constraints. Unlike mathematical problem solving, which operates on predefined formulations, engineering tasks demand open-ended analysis, feasibility-driven modeling, and iterative refinement. Although large language models (LLMs) have shown strong capabilities in reasoning and code generation, they often fail to ensure feasibility, which limits their applicability to engineering problem solving. To address this challenge, we propose EngiAgent, a multi-agent system with a fully connected coordinator that simulates expert workflows through specialized agents for problem analysis, modeling, verification, solving, and solution evaluation. The fully connected coordinator enables flexible feedback routing, overcoming the rigidity of prior pipeline-based reflection methods and ensuring feasibility at every stage of the process. This design not only improves robustness to diverse failure cases such as data extraction errors, constraint inconsistencies, and solver failures, but also enhances the overall quality of problem solving. Empirical results across four representative domains demonstrate that EngiAgent achieves substantial improvements in feasibility compared to prior approaches, establishing a new paradigm for feasibility-oriented engineering problem solving with LLMs. Our source code and data are available at https://github.com/AI4Engi/EngiAgent.
Distilling large reasoning models is essential for making Long-CoT reasoning practical, as full-scale inference remains computationally prohibitive. Existing curation-based approaches select complete reasoning traces post-hoc, overlooking collaboration among heterogeneous teachers and lacking dynamic exploration, which leads to redundant sampling and missed complementary reasoning. We introduce CoRD, a collaborative multi-teacher decoding framework that performs step-wise reasoning synthesis guided by predictive perplexity-based scoring and beam search. This enables heterogeneous LRMs to jointly construct coherent reasoning trajectories while efficiently preserving diverse, high-potential hypotheses. Experiments show that CoRD produces higher-quality reasoning data and achieves near teacher-level student performance with fewer, structured supervision signals, without substantial efficiency overhead. CoRD further generalizes well to out-of-domain and open-ended settings. The dataset and model are available at \href{https://github.com/DISL-Lab/CoRD}{https://github.com/DISL-Lab/CoRD}.
Adaptive optimizers such as Adam have achieved great success in training large-scale models like large language models and diffusion models. However, they often generalize worse than non-adaptive methods, such as SGD on classical architectures like CNNs. We identify a key cause of this performance gap: adaptivity in pre-conditioners, which limits the optimizer's ability to adapt to diverse optimization landscapes. To address this, we propose Anon (Adaptivity Non-restricted Optimizer with Novel convergence technique), a novel optimizer with continuously tunable adaptivity in R, allowing it to interpolate between SGD-like and Adam-like behaviors and even extrapolate beyond both. To ensure convergence across the entire adaptivity spectrum, we introduce incremental delay update (IDU), a novel mechanism that is more flexible than AMSGrad's hard max-tracking strategy and enhances robustness to gradient noise. We theoretically establish convergence guarantees under both convex and non-convex settings. Empirically, Anon consistently outperforms state-of-the-art optimizers on representative image classification, diffusion, and language modeling tasks. These results demonstrate that adaptivity can serve as a valuable tunable design principle, and Anon provides the first unified and reliable framework capable of bridging the gap between classical and modern optimizers and surpassing their advantageous properties.
Composite score across coding, math, and reasoning
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.4 | 57.2 | 85 | $5.63 |
| 2 | Gemini 3.1 Pro Preview | 57.2 | 118 | $4.50 |
| 3 | GPT-5.3 Codex | 54 | 71 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 51 | $10.00 |
| 5 | Claude Sonnet 4.6 | 51.7 | 66 | $6.00 |
Agentic coding on real-world software engineering tasks
| # | Model | Score |
|---|---|---|
| 1 | Claude Code | 52.9% |
| 2 | Junie | 52.1% |
| 3 | Claude Opus 4.6 | 51.7% |
| 4 | gpt-5.2-2025-12-11-xhigh | 51.7% |
| 5 | gpt-5.2-2025-12-11-medium | 51.0% |
Automate the process of making money online.
The systemd System and Service Manager
Find vulnerabilities, misconfigurations, secrets, SBOM in containers, Kubernetes, code repositories, clouds and more
Project N.O.M.A.D, is a self-contained, offline survival computer packed with critical tools, knowledge, and AI to keep you informed and empowered—anytime, anywhere.
PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.
A project to deploy an online app that predicts the win probability for each NBA game every day. Demonstrates end-to-end Machine Learning deployment.
Implement a ChatGPT-like LLM in PyTorch from scratch, step by step
AI Observability & Evaluation
Your agent in your terminal, equipped with local tools: writes code, uses the terminal, browses the web. Make your own persistent autonomous agent on top!
A lightweight Python package for Automatic Speech Recognition using ONNX models