The Inference Report

April 2, 2026

The compute arms race has entered a new phase where infrastructure control determines market position, and the companies that can lock in power, compress hardware deployment timelines, and absorb regulatory complexity are already separating from those that cannot. Meta's commitment to ten natural gas plants powering its Hyperion data center and Cognichip's $60 million raise on claims of 75% cost reduction and 50% timeline compression in chip design are not marginal improvements but structural bets that the next cycle belongs to whoever can compress the cost and time to deploy compute at scale. Poolside's stalled Texas project and forced negotiations with Google and other cloud providers make the calculus explicit: independent infrastructure plays are losing leverage to companies that can absorb the capital requirements and regulatory complexity of massive data center deployments. The economics have shifted decisively.

This consolidation around compute is matched by a bifurcation in how companies are deploying AI. OpenAI's Gradient Labs is shipping concrete product, AI account managers in banking using GPT-4.1 and purpose-built smaller models for latency-sensitive workflows, the kind of deployment that generates revenue and defensible customer relationships. GitHub's /fleet feature for parallel agent dispatch and Hugging Face's Holo3 announcement signal the same momentum: builders are moving past single-agent orchestration into systems that distribute work across multiple models and processes. By contrast, IBM and AMD are playing a different game, announcing FedRAMP authorizations, decade-long research initiatives, and detailed MLPerf submissions that establish them as trusted intermediaries in regulated environments and foundational research. Consumer-facing applications are racing ahead through product velocity and direct customer relationships. Infrastructure players are establishing credibility through partnerships, compliance achievements, and benchmark credentials.

The technical frontier has shifted from model architecture to the systems that deploy and secure them, and that shift exposes unresolved tensions. Anthropic's accidental mass takedown of GitHub repositories containing leaked Claude Code source, followed by the discovery that Claude Code can uncover zero-day exploits in Vim and GNU Emacs in seconds, reveals that the same tool making security research trivial also makes security risk trivial to create. Anthropic's response was operational damage control, not a technical solution. Meanwhile, Slack's repositioning of Slackbot as an orchestration layer for agentic workflows, Asana's emphasis on multiplayer AI agents, and Meta's semi-formal reasoning technique for code review achieving 93% accuracy show enterprises moving past chatbots toward systems that make decisions and execute tasks across multiple applications and teams. What remains unresolved is whether the tools that make this possible can be secured, and whether the companies deploying them have any meaningful way to audit what those agents are actually doing once they are turned loose on company infrastructure.

Benchmark volatility and developer momentum toward agentic tools underscore the stakes. Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, a 12.3-point gain from its prior ranking, yet the two major benchmarks diverge materially in their orderings, implying that they capture different failure modes in code generation or apply stricter evaluation criteria around execution correctness. Reasoning-focused variants like Kimi K2.5 are closing gaps on general-purpose models in software engineering tasks. On GitHub, Claude Code and related agentic tooling have moved from research artifact to daily driver, while infrastructure projects like Pixeltable, Rerun, and Qdrant represent a category shift toward unsexy but essential data plumbing for multimodal pipelines. Simultaneously, projects like PicoLLM signal a countercurrent toward running models locally without cloud dependencies. The field is scaling inference infrastructure while constraining it, building both the pipes and the guardrails.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research Papers — FocusedAll papers
Quantum-Safe Code Auditing: LLM-Assisted Static Analysis and Quantum-Aware Risk Scoring for Post-Quantum Cryptography Migration cs.CR

The impending arrival of cryptographically relevant quantum computers (CRQCs) threatens the security foundations of modern software: Shor's algorithm breaks RSA, ECDSA, ECDH, and Diffie-Hellman, while Grover's algorithm reduces the effective security of symmetric and hash-based schemes. Despite NIST standardising post-quantum cryptography (PQC) in 2024 (FIPS 203 ML-KEM, FIPS 204 ML-DSA, FIPS 205 SLH-DSA), most codebases lack automated tooling to inventory classical cryptographic usage and prioritise migration based on quantum risk. We present Quantum-Safe Code Auditor, a quantum-aware static analysis framework that combines (i) regex-based detection of 15 classes of quantum-vulnerable primitives, (ii) LLM-assisted contextual enrichment to classify usage and severity, and (iii) risk scoring via a Variational Quantum Eigensolver (VQE) model implemented in Qiskit 2.x, incorporating qubit-cost estimates to prioritise findings. We evaluate the system across five open-source libraries -- python-rsa, python-ecdsa, python-jose, node-jsonwebtoken, and Bouncy Castle Java -- covering 5,775 findings. On a stratified sample of 602 labelled instances, we achieve 71.98% precision, 100% recall, and an F1 score of 83.71%. All code, data, and reproduction scripts are released as open-source.

AutoEG: Exploiting Known Third-Party Vulnerabilities in Black-Box Web Applications cs.CR

Large-scale web applications are widely deployed with complex third-party components, inheriting security risks arising from component vulnerabilities. Security assessment is therefore required to determine whether such known vulnerabilities remain practically exploitable in real applications. Penetration testing is a widely adopted approach that validates exploitability by launching concrete attacks against known vulnerabilities in real-world black-box systems. However, existing approaches often fail to automatically generate reliable exploits, limiting their effectiveness in practical security assessment. This limitation mainly stems from two issues: (1) precisely triggering vulnerabilities with correct technical details, and (2) adapting exploits to diverse real-world deployment settings. In this paper, we propose AutoEG, a fully automated multi-agent framework for exploit generation targeting black-box web applications. AutoEG has two phases: First, AutoEG extracts precise vulnerability trigger logic from unstructured vulnerability information and encapsulates it into reusable trigger functions. Second, AutoEG uses trigger functions for concrete attack objectives and iteratively refines exploits through feedback-driven interaction with the target application. We evaluate AutoEG on 104 real-world vulnerabilities with 29 attack objectives, resulting in 660 exploitation tasks and 55,440 exploit attempts. AutoEG achieves an average success rate of 82.41%, substantially outperforming state-of-the-art baselines, whose best performance reaches only 32.88%.

Do Phone-Use Agents Respect Your Privacy? cs.CR

We study whether phone-use agents respect privacy while completing benign mobile tasks. This question has remained hard to answer because privacy-compliant behavior is not operationalized for phone-use agents, and ordinary apps do not reveal exactly what data agents type into which form entries during execution. To make this question measurable, we introduce MyPhoneBench, a verifiable evaluation framework for privacy behavior in mobile agents. We operationalize privacy-respecting phone use as permissioned access, minimal disclosure, and user-controlled memory through a minimal privacy contract, iMy, and pair it with instrumented mock apps plus rule-based auditing that make unnecessary permission requests, deceptive re-disclosure, and unnecessary form filling observable and reproducible. Across five frontier models on 10 mobile apps and 300 tasks, we find that task success, privacy-compliant task completion, and later-session use of saved preferences are distinct capabilities, and no single model dominates all three. Evaluating success and privacy jointly reshuffles the model ordering relative to either metric alone. The most persistent failure mode across models is simple data minimization: agents still fill optional personal entries that the task does not require. These results show that privacy failures arise from over-helpful execution of benign tasks, and that success-only evaluation overestimates the deployment readiness of current phone-use agents. All code, mock apps, and agent trajectories are publicly available at~ https://github.com/FreedomIntelligence/MyPhoneBench.

Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks cs.CR

System Instructions in Large Language Models (LLMs) are commonly used to enforce safety policies, define agent behavior, and protect sensitive operational context in agentic AI applications. These instructions may contain sensitive information such as API credentials, internal policies, and privileged workflow definitions, making system instruction leakage a critical security risk highlighted in the OWASP Top 10 for LLM Applications. Without incurring the overhead costs of reasoning models, many LLM applications rely on refusal-based instructions that block direct requests for system instructions, implicitly assuming that prohibited information can only be extracted through explicit queries. We introduce an automated evaluation framework that tests whether system instructions remain confidential when extraction requests are re-framed as encoding or structured output tasks. Across four common models and 46 verified system instructions, we observe high attack success rates (> 0.7) for structured serialization where models refuse direct extraction requests but disclose protected content in the requested serialization formats. We further demonstrate a mitigation strategy based on one-shot instruction reshaping using a Chain-of-Thought reasoning model, indicating that even subtle changes in wording and structure of system instructions can significantly reduce attack success rate without requiring model retraining.

VibeGuard: A Security Gate Framework for AI-Generated Code cs.CR

"Vibe coding," in which developers delegate code generation to AI assistants and accept the output with little manual review, has gained rapid adoption in production settings. On March 31, 2026, Anthropic's Claude Code CLI shipped a 59.8 MB source map file in its npm package, exposing roughly 512,000 lines of proprietary TypeScript. The tool had itself been largely vibe-coded, and the leak traced to a misconfigured packaging rule rather than a logic bug. Existing static-analysis and secret-scanning tools did not cover this failure mode, pointing to a gap between the vulnerabilities AI tends to introduce and the vulnerabilities current tooling is built to find. We present VibeGuard, a pre-publish security gate that targets five such blind spots: artifact hygiene, packaging-configuration drift, source-map exposure, hardcoded secrets, and supply-chain risk. In controlled experiments on eight synthetic projects (seven vulnerable, one clean control), VibeGuard achieved 100% recall, 89.47% precision (F1 = 94.44%), and correct pass/fail gate decisions on all eight projects across three policy levels. We discuss how these results inform a defense-in-depth workflow for teams that rely on AI code generation.

Automated Generation of Cybersecurity Exercise Scenarios cs.CR

There is a growing need for cybersecurity professionals with practical knowledge and experience to meet societal needs and comply with new standards and regulations. At the same time, the advances in software technology and artificial intelligence point towards a future where software agents will play an important role in protecting the computer systems that are critical for society to function. The training and development of both humans and software agents requires the design and execution of cybersecurity exercises that differ in properties such as size, scope, objectives, difficultly, etc. Cybersecurity scenarios are critical for the operation of cybersecurity exercises as they describe the scope, context, operational environment and storyline of each exercise. In this work, we present an approach to automatically generate cybersecurity scenarios that model enterprise IT systems. Our approach is able to generate a large number of scenarios that differ in multiple criteria including size, scope, difficulty, complexity and diversity. We further release as open source: a simulation and a virtualization environment that can run cybersecurity exercises based on the generated scenarios and a dataset containing 100000 sample scenarios.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1GPT-5.457.275$5.63
2Gemini 3.1 Pro Preview57.2117$4.50
3GPT-5.3 Codex5467$4.81
4Claude Opus 4.65351$10.00
5Claude Sonnet 4.651.754$6.00
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5Gemini 3.1 Pro Preview62.3%