The AI industry's carefully constructed narratives are colliding with an uncooperative reality across multiple fronts. The Pentagon's demand that Anthropic accept military use terms by February 27 or risk designation as a supply chain risk represents the most explicit government intervention in AI company operations since the sector's emergence, forcing a safety-first laboratory to choose between its institutional identity and contracts worth billions. This pressure arrives alongside mounting evidence that the physical world is pushing back against AI's expansion: data center builders are discovering that farmers won't sell land even for million-dollar offers, and Microsoft has retreated from aggressive community relations tactics, vowing to cover full power costs, reject local tax breaks, and replenish water usage. These are not PR troubles to be managed but structural constraints on deployment speed and geography, independent of capital availability.
The market's bifurcation is sharpening along predictable lines. AWS and IBM are positioning AI infrastructure as defensible, recurring-revenue moats with enterprise-focused compute layers and autonomous storage management, while IBM's Missile Defense Agency contract and quantum computing partnership with Cisco signal expanding defense TAM. Anthropic, by contrast, is attempting to own the safety narrative through constitutional classifiers, alignment faking research, and a Responsible Scaling Policy version 3.0, less as philanthropy than as market differentiation for enterprise customers facing regulatory scrutiny. The coding agent space is fragmenting under price pressure: Claude Code's $200 monthly pricing has opened room for free alternatives like Block's Goose, while the NousCoder-14B model's ability to train in four days on 48 Nvidia B200 GPUs challenges the assumption that frontier capabilities require frontier resources.
The benchmark landscape exposes the same gap between perception and measurement. SWE-bench showed zero movement in its latest cycle, with the top 35 models frozen in identical positions, while the newly introduced Artificial Analysis framework produces substantially different rankings that raise questions about what these evaluations actually capture. GitHub trending reinforces the shift: the value has migrated from model capabilities to orchestration infrastructure, with memory systems, multi-agent coordination, and document conversion utilities like Microsoft's markitdown gaining traction over model training tools. The industry appears to have accepted that fine-tuning is solved and is now betting heavily on the application layer, even as public skepticism rises through movements like QuitGPT and research demonstrating that LLMs can generate near-verbatim copies of novels from training data. What remains clear is that the gap between AI's investment thesis and its operational reality is narrowing, and the companies best positioned may be those building for the world as it exists rather than as the sector imagined it.
Grant Calloway
No lab headlines.
While Large Language Models (LLMs) have achieved remarkable success in code generation, they often struggle with the deep, long-horizon reasoning required for complex software engineering. We attribute this limitation to the nature of standard pre-training data: static software repositories represent only the terminal state of an intricate intellectual process, abstracting away the intermediate planning, debugging, and iterative refinement. To bridge this gap, we propose a novel paradigm: understanding via reconstruction. We hypothesize that reverse-engineering the latent agentic trajectories -- the planning, reasoning, and debugging steps -- behind static repositories provides a far richer supervision signal than raw code alone. To operationalize this, we introduce a framework that synthesizes these trajectories using a multi-agent simulation. This process is grounded in the structural realities of the source repositories (e.g., dependency graphs and file hierarchies) to ensure fidelity. Furthermore, to guarantee the logical rigor of the synthetic data, we employ a search-based optimization technique that iteratively refines the Chain-of-Thought (CoT) reasoning to maximize the likelihood of the ground-truth code. Empirical results demonstrate that continuous pre-training on these reconstructed trajectories significantly enhances Llama-3-8B's performance across diverse benchmarks, including long-context understanding, coding proficiency, and agentic capabilities.
Stream-based monitoring is a real-time safety assurance mechanism for complex cyber-physical systems such as unmanned aerial vehicles. The monitor aggregates streams of input data from sensors and other sources to give real-time statistics and assessments of the system's health. Since the monitor is a safety-critical component, it is mandatory to ensure the absence of runtime errors in the monitor. Providing such guarantees is particularly challenging when the monitor must handle unbounded data domains, like an unlimited number of airspace participants, requiring the use of dynamic data structures. This paper provides a type-safe integration of parameterized streams into the stream-based monitoring framework RTLola. Parameterized streams generalize individual streams to sets of an unbounded number of stream instances and provide a systematic mechanism for memory management. We show that the absence of runtime errors is, in general, undecidable but can be effectively ensured with a refinement type system that guarantees all memory references are either successful or backed by a default value. We report on the performance of the type analysis on example specifications from a range of benchmarks, including specifications from the monitoring of autonomous aircraft.
Programmer attribution seeks to identify or verify the author of a source code artifact using stylistic, structural, or behavioural characteristics. This problem has been studied across software engineering, security, and digital forensics, resulting in a growing and methodologically diverse set of publications. This paper presents a systematic mapping study of programmer attribution research focused on source code analysis. From an initial set of 135 candidate publications, 47 studies published between 2012 and 2025 were selected through a structured screening process. The included works are analysed along several dimensions, including authorship tasks, feature categories, learning and modelling approaches, dataset sources, and evaluation practices. Based on this analysis, we derive a taxonomy that relates stylistic and behavioural feature types to commonly used machine learning techniques and provide a descriptive overview of publication trends, benchmarks, programming languages. A content-level analysis highlights the main thematic clusters in the field. The results indicate a strong focus on closed-world authorship attribution using stylometric features and a heavy reliance on a small number of benchmark datasets, while behavioural signals, authorship verification, and reproducibility remain less explored. The study consolidates existing research into a unified framework and outlines methodological gaps that can guide future work. This manuscript is currently under review. The present version is a preprint.
Code LLMs still struggle with code execution reasoning, especially in smaller models. Existing methods rely on supervised fine-tuning (SFT) with teacher-generated explanations, primarily in two forms: (1) input-output (I/O) prediction chains and (2) natural-language descriptions of execution traces. However, intermediate execution steps cannot be explicitly verified during SFT, so the training objective can reduce to merely matching teacher explanations. Moreover, training data is typically collected without explicit control over task difficulty. We introduce ExecVerify, which goes beyond text imitation by incorporating verifiable white-box rewards derived from execution traces, including next-statement prediction and variable value/type prediction. Our work first builds a dataset with multiple difficulty levels via constraint-based program synthesis. Then, we apply reinforcement learning (RL) to reward correct answers about both intermediate execution steps and final outputs, aligning the training objective with semantic correctness at each execution step. Finally, we adopt a two-stage training pipeline that first enhances execution reasoning and then transfers to code generation. Experiments demonstrate that a 7B model trained with ExecVerify achieves performance comparable to 32B models on code reasoning benchmarks and improves pass@1 by up to 5.9\% on code generation tasks over strong post-training baselines.
Automated Program Repair (APR) can reduce the time developers spend debugging, allowing them to focus on other aspects of software development. Automatically generated bug patches are typically validated through software testing. However, this method can lead to patch overfitting, i.e., generating patches that pass the given tests but are still incorrect. Patch correctness assessment (also known as overfitting detection) techniques have been proposed to identify patches that overfit. However, prior work often assessed the effectiveness of these techniques in isolation and on datasets that do not reflect the distribution of correct-to-overfitting patches that would be generated by APR tools in typical use; thus, we still do not know their effectiveness in practice. This work presents the first comprehensive benchmarking study of several patch overfitting detection (POD) methods in a practical scenario. To this end, we curate datasets that reflect realistic assumptions (i.e., patches produced by tools run under the same experimental conditions). Next, we use these data to benchmark six state-of-the-art POD approaches -- spanning static analysis, dynamic testing, and learning-based approaches -- against two baselines based on random sampling (one from prior work and one proposed herein). Our results are striking: Simple random selection outperforms all POD tools for 71% to 96% of cases, depending on the POD tool. This suggests two main takeaways: (1) current POD tools offer limited practical benefit, highlighting the need for novel techniques; (2) any POD tool must be benchmarked on realistic data and against random sampling to prove its practical effectiveness. To this end, we encourage the APR community to continue improving POD techniques and to adopt our proposed methodology for practical benchmarking; we make our data and code available to facilitate such adoption.
Resolving issues on code repositories is an important part of software engineering. Various recent systems automatically resolve issues using large language models and agents, often with impressive performance. Unfortunately, most of these models and agents focus primarily on Python, and their performance on other programming languages is lower. In particular, a lot of enterprise software is written in Java, yet automated issue resolution for Java is under-explored. This paper introduces iSWE Agent, an automated issue resolver with an emphasis on Java. It consists of two sub-agents, one for localization and the other for editing. Both have access to novel tools based on rule-based Java static analysis and transformation. Using this approach, iSWE achieves state-of-the-art issue resolution rates across the Java splits of both Multi-SWE-bench and SWE-PolyBench. More generally, we hope that by combining the best of rule-based and model-based techniques, this paper contributes towards improving enterprise software development.
Composite score across coding, math, and reasoning
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Preview | 57.2 | 82 | $4.50 |
| 2 | GPT-5.3 Codex | 54 | 74 | $4.81 |
| 3 | Claude Opus 4.6 | 53 | 48 | $10.00 |
| 4 | Claude Sonnet 4.6 | 51.7 | 29 | $6.00 |
| 5 | GPT-5.2 | 51.3 | 63 | $4.81 |
💖🧸 Self hosted, you-owned Grok Companion, a container of souls of waifu, cyber livings to bring them into our worlds, wishing to achieve Neuro-sama's altitude. Capable of realtime voice chat, Minecraft, Factorio playing. Web / macOS / Windows supported.
Production-ready implementation of InvisPose - a revolutionary WiFi-based dense human pose estimation system that enables real-time full-body tracking through walls using commodity mesh routers
🌊 The leading agent orchestration platform for Claude. Deploy intelligent multi-agent swarms, coordinate autonomous workflows, and build conversational AI systems. Features enterprise-grade architecture, distributed swarm intelligence, RAG integration, and native Claude Code / Codex Integration
Python tool for converting files and office documents to Markdown.
An open-source SuperAgent harness that researches, codes, and creates. With the help of sandboxes, memories, tools, skills and subagents, it handles different levels of tasks that could take minutes to hours.
A modular Swift SDK for audio processing with MLX on Apple Silicon
Open-source simulator for autonomous driving research.
VeritasGraph: Enterprise-Grade Graph RAG for Secure, On-Premise AI with Verifiable Attribution
Pretrain, finetune ANY AI model of ANY size on 1 or 10,000+ GPUs with zero code changes.
Production-ready platform for agentic workflow development.