The Inference Report

April 4, 2026

The industry is splintering into two incompatible futures while the public face remains unified. On one side, companies are hardening infrastructure around proprietary systems, local execution, and political control. OpenAI is reorganizing leadership and acquiring media properties while Anthropic buys biotech startups, launches political action committees, and quietly ships code with known vulnerabilities to npm before attempting to DMCA 8,100 repositories. On the other side, the security perimeter is collapsing faster than it can be rebuilt. Claude Code operates with 90 percent autonomy when weaponized by state actors. Meta's AI agents trigger severity-one incidents. The Europa.eu platform lost 350 gigabytes through a supply chain attack on an open-source vulnerability scanner. HackerOne paused Internet Bug Bounty payouts after acknowledging it cannot handle open-source security anymore. The more autonomous these systems become, the less the existing security model holds.

The capital intensity required to scale inference is meeting hard physical limits. Trump's AI data center buildout is delayed across nearly 50 percent of projects because China controls key power infrastructure. Meta, Microsoft, and Google are betting billions on natural gas plants while communities prefer Amazon warehouses in their backyards. Google just added Flex Inference and Priority Inference tiers to Gemini because inference costs, not training costs, are now the binding constraint. Anthropic's 400 million dollar acquisition of Coefficient Bio and its new PAC suggest preparation for a longer game than quarterly model releases. OpenAI's move to acquire TBPN and create a special projects role signals internal focus shifting away from product velocity toward structural positioning.

Measurement and enforcement are becoming table stakes for enterprise adoption. Google is publishing work on behavioral alignment measurement while AWS ships centralized governance tooling across customer accounts. Only one is currently collecting revenue. In research, two complementary trajectories are emerging: one treats LLMs as semantic reasoners augmented with domain-specific constraints for detecting vulnerabilities, the other interrogates whether LLM outputs remain robust under variation through rigorous benchmarking. Both converge on controlled experimental design, yet the gap between laboratory conditions and real-world deployment remains substantial. Claude Opus 4.6 holds 65.3 percent on SWE-rebench, up 12.3 points, while Artificial Analysis shows minimal top-tier movement at 57.2 percent. The divergence reflects a fundamental problem: different benchmarks measure different distributions, making cross-methodology comparison unreliable.

Developer tooling is consolidating around infrastructure rather than capability. Conversational interfaces like Onyx and Prompts.Chat treat model switching and prompt management as friction to eliminate. Deeper architectural work is happening elsewhere: Microsoft's Presidio for PII redaction, Google's TimesFM for time-series forecasting, Genkit for application runtimes that use LLMs as components. The pattern across trending and discovery repositories is clear. The next wave isn't better chat. It's systems that manage state, route data, keep private data private, and learn from interaction. Agents and reinforcement learning are merging in the developer discovery set. The infrastructure bet is real. The security model is not.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research Papers — FocusedAll papers
Code-QA-Bench: Separating Code Reasoning from Documentation Memorization in Repository-Level QA cs.SE

We present Code-QA-Bench, a fully automated framework for synthesizing repository-level code understanding benchmarks that separates genuine code comprehension from documentation recall and pretraining memorization. The framework makes two methodological contributions: (1) an answer-first generation pipeline where a tool-equipped agent explores source code to produce verified gold answers before deriving questions, ensuring every task is grounded in real code structure; and (2) a three-condition experimental design evaluating agents under closed-book (no repository), code-only (documentation removed), and documented (full repository) conditions, with deltas directly quantifying documentation utility and memorization. We generate 528 code-derivable and 100 doc-dependent tasks across 10 Python repositories from SWE-Bench, scored by an LLM judge on accuracy, completeness, and specificity. Experiments on four frontier models reveal that code access is the dominant factor (+0.23 mean gain over closed-book), documentation provides modest additional benefit (+0.071 on doc-dependent tasks), and code-only $\approx$ documented on code-derivable tasks, validating the design. The framework is open-source and applicable to any well-documented Python repository.

On the Road to Personalized Code Intelligence: Portraiting and Assisting Developers Based on Their In-IDE Behaviors cs.SE

With the advent of large language models, research in automated software engineering has increasingly focused on leveraging these models to achieve a deeper semantic understanding of code or to engineer sophisticated agent-based processes. However, this research trajectory has largely overlooked a critical factor: the developers themselves. Programming is a deeply individualized activity; developers exhibit significant variation in their tool-chain preferences, domain-specific expertise, and problem-solving strategies. Consequently, the current paradigm of one-size-fits-all code intelligence systems struggles to accommodate the needs of individual developers. To address this gap, we introduce VirtualME, a novel IDE-embedded data infrastructure designed to model the developer by continuously capturing and interpreting their dynamic programming behaviors and preferences. VirtualME contains three components. (1) Log-level Behavior Extraction: it captures and extracts developers' log-level behaviors from IDE. (2) Task-level Behavior Recognition: it aggregates log-level behaviors into task-level behaviors via a multi-agent pipeline. (3) Developer-personality Measurement: it builds a rule engine to distill a four-dimensional developer persona: technology stack, ability, behavioral habits, and learning style. On top of VirtualME, we propose a solution for personalized repository-level knowledge Q&A by integrating the developer persona into the Q&A agent. We evaluated VirtualME by building a multi-repository benchmark with real-world developer trajectories, balancing correctness and personalization. Experimental results show that VirtualME-enhanced answers outperform generic baselines on five dimensions, yielding an average 33.80% improvement. Our results demonstrate that abundant, continuous developer-behavior data can pave the new way for adaptive and personalized code intelligence.

Offloading Score: Measuring AI Reliance Through Counterfactual Workflows cs.SE

AI tools are increasingly integrated into real-world workflows. However, existing measures of reliance on these tools focus on AI output adoption or on self-reported indicators, rather than how task effort is distributed between users and tools. Here, we introduce offloading score, a measure of reliance that quantifies the fraction of cognitive effort offloaded to an AI tool. Offloading Score is simulation-based -- we construct a counterfactual workflow by estimating how the user would have completed the task without the tool, and then computing the fraction of steps saved by using the tool. We validate offloading score through intrinsic evaluations of metric validity, and a controlled user study ($n=40$) with developers performing programming tasks using AI tools. We vary time pressure to test whether reliance measures capture the known increase in reliance under time pressure. We show that offloading score detects significantly higher reliance in time-constrained settings ($+43\%$, $p=0.018$), while usage-based and self-reported baseline measures of reliance do not distinguish the conditions. We complement this with descriptive insights showing that higher reliance manifests as greater delegation of subtasks to the tool and more direct reuse of AI outputs. Finally, we demonstrate an approach of using offloading score in combination with target outcomes of a task (e.g., code understanding) to identify when reliance may be (in)appropriate. Our framework offers two contributions: an instrument users can apply to measure and reflect on their own reliance, and a quantitative signal that agent designers can utilize to mitigate overreliance.

How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions cs.SE

AI coding agents increasingly act directly within software environments, yet existing analyses of their failures rely on benchmark trajectories that miss how developers actually experience misalignment. We present an observational study of 20,574 coding-agent sessions from 1,639 repositories across IDE and CLI workflows. We operationalize misalignment as a breakdown made visible through developer pushback, and annotate each episode along four axes: form, cause, cost, and resolution. We identify seven recurring forms, spanning how agents read projects, interpret developer intent, follow rules, bound their actions, implement and execute code, and report progress. 90.50\% of episodes impose effort and trust costs rather than irreversible system damage, yet 91.49\% of visible resolutions still require explicit user correction. Misalignment patterns also differ across IDE and CLI settings, persist across adjacent sessions, and shift over time: while overall rates decline, constraint violations and inaccurate self-reporting grow in share. Our findings inform the design of training, evaluation, and interfaces for keeping coding agents aligned with real developer workflows.

Usability Analysis of Configurator User Interfaces with Multimodal Large Language Models cs.SE

Configuration is a key technology for tailoring complex software systems, services, and products. A successful application of configurators not only depends on technical correctness, performance, and domain modeling but also on their usability. While general usability heuristics are widely used, configurator-specific criteria and tool support for systematic user interface (UI) analysis are limited. This paper explores the use of multimodal large language models (MLLMs) for scalable and semi-automated usability analysis of configurator UIs. We synthesize 18 configurator-specific usability criteria from the literature and apply these criteria in an MLLM-based analysis of 16 real-world configurators. Each criterion is assessed individually to generate severity ratings for usability issues and actionable improvement suggestions. A review of the results confirms that MLLMs can reliably identify configurator-specific usability issues and provide domain-aware improvement recommendations. Although human validation remains necessary, this approach has the potential to significantly reduce the required effort to analyze configurator usability.

CODEFUSE-DEBENCH: An Empirical Study on Readability, Recompilability, and Functionality cs.SE

Binary decompilation aims to recover binaries into high-level source code, but existing evaluations mainly rely on syntactic similarity or single-axis readability metrics, which fail to capture practical reusability. We propose a reusability-driven evaluation paradigm that measures decompiler quality along three orthogonal dimensions: readability, recompilability, and functionality. We present DEBENCH, the first automated framework for multidimensional decompilation evaluation. DEBENCH contains 240 atomic test functions, organized into 8 source files and compiled into 640 binaries. It combines LLM-as-judge readability scoring with URAF (18 sub-dimensions), iterative compile-and-repair under a fixed 50-iteration budget, and Frida-based differential dynamic tracing at the program, function, and instruction levels. We evaluate five mainstream decompilers and three repair LLMs. Our study reveals four findings. First, the reusability cliff is steep: the best decompiler-LLM pair reaches 22.3% Exact+Partial program-level behavioral overlap but only 1.2% exact stdout match, nearly 50 points below recompilability. Second, settings that maximize readability do not maximize functionality: -O3 yields the lowest readability but the highest functionality, and Clang gives lower readability than GCC but 2.6x higher functionality. Third, cross-decompiler variation at the functional level is 20x, far larger than the 1.6x cross-LLM variation, showing that progress depends more on decompiler engines than larger repair models. Fourth, failures fall into three categories: syntactic noise, type-system collapse (about 19% of repair errors), and irreversible upstream losses such as ARM64 relocation idioms and C++ ABI features.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1GPT-5.457.276$5.63
2Gemini 3.1 Pro Preview57.2118$4.50
3GPT-5.3 Codex5472$4.81
4Claude Opus 4.65346$10.00
5Claude Sonnet 4.651.752$6.00
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5Gemini 3.1 Pro Preview62.3%