The Inference Report

April 19, 2026

The infrastructure layer is consolidating into oligopoly while the surface layer explodes into commodity, and the winners will be whoever can operate at both levels simultaneously. Cerebras is filing for IPO on the back of billion-dollar compute commitments from Amazon and OpenAI, signaling that the chip wars are already decided in favor of whoever can lock in capacity years in advance. Meanwhile, Tesla's robotaxi expansion to Dallas and Houston operates in a regulatory vacuum where three Texas cities represent the entire addressable market for driverless cars without a safety driver, which is a sandbox, not scale. Anthropic, simultaneously designated a supply-chain risk by the Pentagon and courting the Trump administration, is betting on Schematik for hardware integration while security researchers warn its Mythos model could accelerate hacking faster than fixes can ship. The real leverage is moving from software to hardware control, but the research community is racing to catch up.

On the benchmarks, the top tier has crystallized around 60-65 percent resolve rates on the SWE-rebench, with Claude Opus 4.6 holding first place at 65.3 percent. The meaningful movement comes from Chinese models climbing substantially: GLM-5 jumped 13 percentage points to rank 3, GLM-4.7 surged 16.6 points to rank 14, and Kimi K2.5 added 11.7 points. Gemini 3.1 Pro Preview dropped from rank 2 to rank 6 despite maintaining 62.3 percent, suggesting the benchmark has become more discriminating at the high end. The clustering between 58 and 62 percent indicates diminishing returns, with only 5.4 percentage points separating first from tenth place.

Computer security research exposes a consistent gap: existing defenses fail not at the boundary they claim to protect, but between reasoning and execution. SafeHarness, Parallax, and SIR-Bench demonstrate that lifecycle-integrated defense layers and cognitive-executive separation outperform isolated guardrails, while jailbreak research has shifted from prompt injection to circuit-level intervention. Privacy mechanisms show no single technique dominates; differential privacy leaks at calibrated epsilon tiers, and compositional attacks bypass defenses tuned to isolated signals. The methodological pattern is controlled evaluation exposing where defenses actually fail, not where vendors claim they work.

On the surface, agents are becoming infrastructure. OpenAI's framework and Dify's platform are drawing serious adoption because they solve orchestration, tool integration, and state management without rebuilding each time. Running parallel is a wave of tools treating AI as a control layer for existing infrastructure: Claude Desktop for Debian, better-agent-terminal, and RustDesk gaining adoption as an open alternative to TeamViewer. Developers are past the "what if we put an LLM in this" phase and into "how do we make this LLM useful for our specific constraints." Infrastructure plays like DeepGEMM and Picovoice's on-device speech engine are gaining ground alongside agent frameworks because efficiency and control, not novelty, are becoming the differentiators. The marginal cost of building something AI-powered has collapsed to nearly zero, fragmenting the surface layer while compute commitments consolidate the core.

Grant Calloway

AI LabsAll labs

MiniMax

MiniMax Speech 2.8

From the WireAll feeds

Research Papers — FocusedAll papers

A Measurement Study of AI-Environment Realism Gaps in Malware-Analysis Sandboxes cs.CR

Sandboxing remains a core technique for observing suspicious program behavior, yet environment-aware malware increasingly suppresses execution when analysis is suspected. Prior generations of sandbox evasion focused on virtualization artifacts, timing discrepancies, and wear-and-tear realism. In this paper, we present the first systematic measurement study of AI-environment artifacts as a new sandbox-evasion surface. We operationalize this realism gap through AIprint, a probe framework that captures persistent artifacts left behind by AI-capable software ecosystems, including AI-assistant configuration directories, model caches, environment variables, local inference services, and package dependencies. We systematically extract 450 unique artifacts from 284 open-source AI projects on GitHub, compile them into unprivileged Windows probes, and evaluate them across seven commercial and open-source sandbox backends together with three AI-capable reference hosts. Our results show that traditional VM-detection baselines fail to reliably distinguish real AI-capable systems from modern sandboxes, whereas twelve AI-environment artifacts appear on the reference hosts and on none of the evaluated backends. A controlled 214-step installation experiment establishes a causal relationship between AI tool and package installation and measurable AI-environment artifact accumulation, while adaptive spoofing experiments reveal a fundamental operational asymmetry: reproducing convincing AI software environments is substantially more expensive than detecting shallow spoofing.

Qubes OS Security in the Public Record cs.CR

Qubes OS is a revealing case for security measurement because its architecture makes component boundaries security-relevant. We present a protocol-driven longitudinal analysis of 109 public Qubes Security Bulletins (QSBs, 2011--2025), the official Qubes-maintained Xen Security Advisory (XSA) tracker, and a secondary vulnerability-event sensitivity series. The study measures the public advisory record rather than latent vulnerability incidence or realized compromise. The methodology combines audited deterministic component attribution, change-point analysis, overdispersion checks, severity-proxy weighting, censoring sensitivity, documentary latency lower bounds, and baseline-aware evaluation of vulnerability discovery models (VDMs). The results show persistent upstream dependence in that public record. On the official tracker, 113 of 464 XSAs affect Qubes; under primary labeling, 87 of 109 QSBs (79.8\%) are attributable to Xen, CPU/microarchitectural, or other upstream components rather than Qubes-core logic, with similar results under weighted views. Change-point analyses identify 2015Q1 as the dominant break in the quarterly advisory series, while post-2018 annual disclosure rates are statistically flat. Poisson inferences are stable under dispersion diagnostics and negative-binomial sensitivity checks. The attribution codebook performs well in a stratified 30-QSB audit, and S-shaped VDMs fit descriptively but do not significantly outperform a rolling-mean baseline in short-horizon forecasts. Overall, the Qubes public advisory record appears stable, but not quiet: disclosure activity plateaus at a higher level than in the earliest years, while the observed burden remains concentrated in upstream trust anchors.

Bad Memory: Evaluating Prompt Injection Risks from Memory in Agentic Systems cs.CR

A growing class of agentic systems maintain persistent state across sessions through memory files, behavioral preferences, and knowledge bases. While this makes agents more useful and self-improving, it also creates a new attack surface for prompt injections in which malicious instructions can be embedded within persistent files and influence future behavior. In this work, we study prompt injection attacks in memory-based agentic systems using a sandboxed synthetic workspace. We evaluate two agentic systems, Anthropic Claude Code and OpenAI Codex, across four models: Claude Haiku 4.5, Claude Opus 4.7, GPT-5.2, and GPT-5.5. Our results show that although it is difficult to make an agent overwrite its own memory files using untrusted external content, payloads already planted in those files can successfully attack current and future sessions. Attack success and payload persistence vary substantially across systems, models, adversarial goals, and multi-session attack sequences. These findings show that persistent memory changes the threat model for prompt injection and motivate defenses that protect memory updates without removing useful agent adaptation.

MemPoison: Uncovering Persistent Memory Threats and Structural Blind Spots in LLM Agents cs.CR

Persistent external memory enhances agent continuity but introduces persistent security vulnerabilities: adversarial content can be injected via standard interaction channels, retained across turns, and later distort downstream behavior. To address this challenge, we propose MemPoison, a comprehensive benchmark and analysis framework featuring 1227 hand-validated cases across four attack types, three injection channels, and three representative memory substrates, evaluated on seven open-weight and three closed-weight model families. We introduce a three-tier taxonomy: (L1) direct single-record corruption, (L2) compositional multi-record corruption and (L3) context-triggered dormant corruption. Our evaluations reveal a distinct defense frontier: while baseline write-time defenses, such as consistency checks, substantially suppress direct L1 attacks, they fail to reliably suppress L2 and L3 attacks. Through mechanistic influence decomposition (MID), we demonstrate structural blind spots in write-time defenses, which admit seemingly benign records that later become harmful through joint retrieval composition or trigger-conditioned activation. Our findings advocate for shifting from static filtering to adaptive, context-sensitive memory defense strategies.

The Distributed Open-Source Vulnerability Ecosystem cs.CR

Identifying known software vulnerabilities is a central task in software supply chain security management. Although publicly available vulnerability information is based on shared standards, different vulnerability scanners often report divergent results for identical software inventories. These differences do not arise solely from individual data sources or scanner implementations. They can emerge at several stages of the open-source vulnerability ecosystem. This paper presents a conceptual framework that describes vulnerability management as a distributed process of information exchange and transformation. It traces vulnerability information from its creation and standardization through enrichment to context-dependent interpretation. The analysis identifies heterogeneous information sources, divergent identity and version models, temporal change, and context-dependent assessment as major causes of inconsistent scanner findings. It then discusses the implications for interpreting analysis results, designing reproducible evaluation methods, and handling dynamic vulnerability knowledge in practice.

Setup Complete, Now You Are Compromised: Weaponizing Setup Instructions Against AI Coding Agents cs.CR

AI coding agents set up projects by reading documentation and installing the dependencies it lists, without verifying their names, sources, or known vulnerabilities. By editing only a README, requirements file, or Makefile, an attacker can redirect the agent to an untrusted registry, a known-vulnerable version, or a wrong-but-plausible name: documentation becomes a vector for code execution. We present the first systematic evaluation of package-install-time supply-chain attacks delivered through ordinary project-setup documentation across production coding-agent harnesses, probing frontier models on twelve scenarios in five attack classes, grounded in documented incidents. The same model catches an attack through one harness and installs it through another: install-time security rests on the harness-model combination, not the model alone. Agents catch blatant typosquats reliably, but plausible separator-confusion names (azurecore for azure-core) slip through, and how often depends on the harness-model pairing. Source-based attacks like registry redirection are missed almost everywhere. The source blind spot recurs on npm and Cargo, where nearly every model installs the untrusted dependency; name detection carries over less consistently across ecosystems. Security-oriented prompts recover part of the gap but only for the dimension they name; a deterministic pre-install check that verifies names, sources, and versions before any code runs closes most of it.

BenchmarksFull tables