The Inference Report — April 2, 2026

The compute arms race has entered a new phase where infrastructure control determines market position, and the companies that can lock in power, compress hardware deployment timelines, and absorb regulatory complexity are already separating from those that cannot. Meta's commitment to ten natural gas plants powering its Hyperion data center and Cognichip's $60 million raise on claims of 75% cost reduction and 50% timeline compression in chip design are not marginal improvements but structural bets that the next cycle belongs to whoever can compress the cost and time to deploy compute at scale. Poolside's stalled Texas project and forced negotiations with Google and other cloud providers make the calculus explicit: independent infrastructure plays are losing leverage to companies that can absorb the capital requirements and regulatory complexity of massive data center deployments. The economics have shifted decisively.

This consolidation around compute is matched by a bifurcation in how companies are deploying AI. OpenAI's Gradient Labs is shipping concrete product, AI account managers in banking using GPT-4.1 and purpose-built smaller models for latency-sensitive workflows, the kind of deployment that generates revenue and defensible customer relationships. GitHub's /fleet feature for parallel agent dispatch and Hugging Face's Holo3 announcement signal the same momentum: builders are moving past single-agent orchestration into systems that distribute work across multiple models and processes. By contrast, IBM and AMD are playing a different game, announcing FedRAMP authorizations, decade-long research initiatives, and detailed MLPerf submissions that establish them as trusted intermediaries in regulated environments and foundational research. Consumer-facing applications are racing ahead through product velocity and direct customer relationships. Infrastructure players are establishing credibility through partnerships, compliance achievements, and benchmark credentials.

The technical frontier has shifted from model architecture to the systems that deploy and secure them, and that shift exposes unresolved tensions. Anthropic's accidental mass takedown of GitHub repositories containing leaked Claude Code source, followed by the discovery that Claude Code can uncover zero-day exploits in Vim and GNU Emacs in seconds, reveals that the same tool making security research trivial also makes security risk trivial to create. Anthropic's response was operational damage control, not a technical solution. Meanwhile, Slack's repositioning of Slackbot as an orchestration layer for agentic workflows, Asana's emphasis on multiplayer AI agents, and Meta's semi-formal reasoning technique for code review achieving 93% accuracy show enterprises moving past chatbots toward systems that make decisions and execute tasks across multiple applications and teams. What remains unresolved is whether the tools that make this possible can be secured, and whether the companies deploying them have any meaningful way to audit what those agents are actually doing once they are turned loose on company infrastructure.

Benchmark volatility and developer momentum toward agentic tools underscore the stakes. Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, a 12.3-point gain from its prior ranking, yet the two major benchmarks diverge materially in their orderings, implying that they capture different failure modes in code generation or apply stricter evaluation criteria around execution correctness. Reasoning-focused variants like Kimi K2.5 are closing gaps on general-purpose models in software engineering tasks. On GitHub, Claude Code and related agentic tooling have moved from research artifact to daily driver, while infrastructure projects like Pixeltable, Rerun, and Qdrant represent a category shift toward unsexy but essential data plumbing for multimodal pipelines. Simultaneously, projects like PicoLLM signal a countercurrent toward running models locally without cloud dependencies. The field is scaling inference infrastructure while constraining it, building both the pipes and the guardrails.

Grant Calloway

AI LabsAll labs

AMD

GitHub Blog

Run multiple agents at once with /fleet in Copilot CLI

Hugging Face

Holo3: Breaking the Computer Use Frontier

IBM

OpenAI

Gradient Labs gives every bank customer an AI account manager

From the WireAll feeds

Research Papers — FocusedAll papers

A Measurement Study of AI-Environment Realism Gaps in Malware-Analysis Sandboxes cs.CR

Sandboxing remains a core technique for observing suspicious program behavior, yet environment-aware malware increasingly suppresses execution when analysis is suspected. Prior generations of sandbox evasion focused on virtualization artifacts, timing discrepancies, and wear-and-tear realism. In this paper, we present the first systematic measurement study of AI-environment artifacts as a new sandbox-evasion surface. We operationalize this realism gap through AIprint, a probe framework that captures persistent artifacts left behind by AI-capable software ecosystems, including AI-assistant configuration directories, model caches, environment variables, local inference services, and package dependencies. We systematically extract 450 unique artifacts from 284 open-source AI projects on GitHub, compile them into unprivileged Windows probes, and evaluate them across seven commercial and open-source sandbox backends together with three AI-capable reference hosts. Our results show that traditional VM-detection baselines fail to reliably distinguish real AI-capable systems from modern sandboxes, whereas twelve AI-environment artifacts appear on the reference hosts and on none of the evaluated backends. A controlled 214-step installation experiment establishes a causal relationship between AI tool and package installation and measurable AI-environment artifact accumulation, while adaptive spoofing experiments reveal a fundamental operational asymmetry: reproducing convincing AI software environments is substantially more expensive than detecting shallow spoofing.

Qubes OS Security in the Public Record cs.CR

Qubes OS is a revealing case for security measurement because its architecture makes component boundaries security-relevant. We present a protocol-driven longitudinal analysis of 109 public Qubes Security Bulletins (QSBs, 2011--2025), the official Qubes-maintained Xen Security Advisory (XSA) tracker, and a secondary vulnerability-event sensitivity series. The study measures the public advisory record rather than latent vulnerability incidence or realized compromise. The methodology combines audited deterministic component attribution, change-point analysis, overdispersion checks, severity-proxy weighting, censoring sensitivity, documentary latency lower bounds, and baseline-aware evaluation of vulnerability discovery models (VDMs). The results show persistent upstream dependence in that public record. On the official tracker, 113 of 464 XSAs affect Qubes; under primary labeling, 87 of 109 QSBs (79.8\%) are attributable to Xen, CPU/microarchitectural, or other upstream components rather than Qubes-core logic, with similar results under weighted views. Change-point analyses identify 2015Q1 as the dominant break in the quarterly advisory series, while post-2018 annual disclosure rates are statistically flat. Poisson inferences are stable under dispersion diagnostics and negative-binomial sensitivity checks. The attribution codebook performs well in a stratified 30-QSB audit, and S-shaped VDMs fit descriptively but do not significantly outperform a rolling-mean baseline in short-horizon forecasts. Overall, the Qubes public advisory record appears stable, but not quiet: disclosure activity plateaus at a higher level than in the earliest years, while the observed burden remains concentrated in upstream trust anchors.

Bad Memory: Evaluating Prompt Injection Risks from Memory in Agentic Systems cs.CR

A growing class of agentic systems maintain persistent state across sessions through memory files, behavioral preferences, and knowledge bases. While this makes agents more useful and self-improving, it also creates a new attack surface for prompt injections in which malicious instructions can be embedded within persistent files and influence future behavior. In this work, we study prompt injection attacks in memory-based agentic systems using a sandboxed synthetic workspace. We evaluate two agentic systems, Anthropic Claude Code and OpenAI Codex, across four models: Claude Haiku 4.5, Claude Opus 4.7, GPT-5.2, and GPT-5.5. Our results show that although it is difficult to make an agent overwrite its own memory files using untrusted external content, payloads already planted in those files can successfully attack current and future sessions. Attack success and payload persistence vary substantially across systems, models, adversarial goals, and multi-session attack sequences. These findings show that persistent memory changes the threat model for prompt injection and motivate defenses that protect memory updates without removing useful agent adaptation.

MemPoison: Uncovering Persistent Memory Threats and Structural Blind Spots in LLM Agents cs.CR

Persistent external memory enhances agent continuity but introduces persistent security vulnerabilities: adversarial content can be injected via standard interaction channels, retained across turns, and later distort downstream behavior. To address this challenge, we propose MemPoison, a comprehensive benchmark and analysis framework featuring 1227 hand-validated cases across four attack types, three injection channels, and three representative memory substrates, evaluated on seven open-weight and three closed-weight model families. We introduce a three-tier taxonomy: (L1) direct single-record corruption, (L2) compositional multi-record corruption and (L3) context-triggered dormant corruption. Our evaluations reveal a distinct defense frontier: while baseline write-time defenses, such as consistency checks, substantially suppress direct L1 attacks, they fail to reliably suppress L2 and L3 attacks. Through mechanistic influence decomposition (MID), we demonstrate structural blind spots in write-time defenses, which admit seemingly benign records that later become harmful through joint retrieval composition or trigger-conditioned activation. Our findings advocate for shifting from static filtering to adaptive, context-sensitive memory defense strategies.

The Distributed Open-Source Vulnerability Ecosystem cs.CR

Identifying known software vulnerabilities is a central task in software supply chain security management. Although publicly available vulnerability information is based on shared standards, different vulnerability scanners often report divergent results for identical software inventories. These differences do not arise solely from individual data sources or scanner implementations. They can emerge at several stages of the open-source vulnerability ecosystem. This paper presents a conceptual framework that describes vulnerability management as a distributed process of information exchange and transformation. It traces vulnerability information from its creation and standardization through enrichment to context-dependent interpretation. The analysis identifies heterogeneous information sources, divergent identity and version models, temporal change, and context-dependent assessment as major causes of inconsistent scanner findings. It then discusses the implications for interpreting analysis results, designing reproducible evaluation methods, and handling dynamic vulnerability knowledge in practice.

Setup Complete, Now You Are Compromised: Weaponizing Setup Instructions Against AI Coding Agents cs.CR

AI coding agents set up projects by reading documentation and installing the dependencies it lists, without verifying their names, sources, or known vulnerabilities. By editing only a README, requirements file, or Makefile, an attacker can redirect the agent to an untrusted registry, a known-vulnerable version, or a wrong-but-plausible name: documentation becomes a vector for code execution. We present the first systematic evaluation of package-install-time supply-chain attacks delivered through ordinary project-setup documentation across production coding-agent harnesses, probing frontier models on twelve scenarios in five attack classes, grounded in documented incidents. The same model catches an attack through one harness and installs it through another: install-time security rests on the harness-model combination, not the model alone. Agents catch blatant typosquats reliably, but plausible separator-confusion names (azurecore for azure-core) slip through, and how often depends on the harness-model pairing. Source-based attacks like registry redirection are missed almost everywhere. The source blind spot recurs on npm and Cargo, where nearly every model installs the untrusted dependency; name detection carries over less consistently across ecosystems. Security-oriented prompts recover part of the gap but only for the dimension they name; a deterministic pre-install check that verifies names, sources, and versions before any code runs closes most of it.

BenchmarksFull tables

Intelligence Index

Composite score across coding, math, and reasoning

#	Model	Score	tok/s	$/1M
1	GPT-5.4	57.2	75	$5.63
2	Gemini 3.1 Pro Preview	57.2	117	$4.50
3	GPT-5.3 Codex	54	67	$4.81
4	Claude Opus 4.6	53	51	$10.00
5	Claude Sonnet 4.6	51.7	54	$6.00

SWE-rebench

Agentic coding on real-world software engineering tasks

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	gpt-5.4-2026-03-05-medium	62.8%
5	Gemini 3.1 Pro Preview	62.3%

GitHub Repos All repos

Trending

anthropics/claude-code

136440 ★

Claude Code is an agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster by executing routine tasks, explaining complex code, and handling git workflows - all through natural language commands.

microsoft/VibeVoice

48582 ★

Open-Source Frontier Voice AI

google-research/timesfm

24683 ★

TimesFM (Time Series Foundation Model) is a pretrained time-series foundation model developed by Google Research for time-series forecasting.

luongnv89/claude-howto

38745 ★

A visual, example-driven guide to Claude Code — from basic concepts to advanced agents, with copy-paste templates that bring immediate value.

axios/axios

108935 ★

Promise based HTTP client for the browser and node.js

Daily discovery

pixeltable/pixeltableMLOps

1598 ★

Data Infrastructure providing a declarative, incremental approach for multimodal AI workloads.

debba/tabularisMCP

914 ★

A lightweight, developer-focused database management tool. Supports MySQL, PostgreSQL and SQLite. Hackable with plugins. Built for speed, security, and aesthetics.

skye-harris/hass_local_openai_llmRAG

187 ★

Home Assistant LLM integration for local OpenAI-compatible services (llamacpp, vllm, etc)

Picovoice/picollmModel Compression

311 ★

On-device LLM Inference Powered by X-Bit Quantization

synthetichealth/syntheaSynthetic Data

3160 ★

Synthetic Patient Population Simulator