The Inference Report

April 19, 2026

The infrastructure layer is consolidating into oligopoly while the surface layer explodes into commodity, and the winners will be whoever can operate at both levels simultaneously. Cerebras is filing for IPO on the back of billion-dollar compute commitments from Amazon and OpenAI, signaling that the chip wars are already decided in favor of whoever can lock in capacity years in advance. Meanwhile, Tesla's robotaxi expansion to Dallas and Houston operates in a regulatory vacuum where three Texas cities represent the entire addressable market for driverless cars without a safety driver, which is a sandbox, not scale. Anthropic, simultaneously designated a supply-chain risk by the Pentagon and courting the Trump administration, is betting on Schematik for hardware integration while security researchers warn its Mythos model could accelerate hacking faster than fixes can ship. The real leverage is moving from software to hardware control, but the research community is racing to catch up.

On the benchmarks, the top tier has crystallized around 60-65 percent resolve rates on the SWE-rebench, with Claude Opus 4.6 holding first place at 65.3 percent. The meaningful movement comes from Chinese models climbing substantially: GLM-5 jumped 13 percentage points to rank 3, GLM-4.7 surged 16.6 points to rank 14, and Kimi K2.5 added 11.7 points. Gemini 3.1 Pro Preview dropped from rank 2 to rank 6 despite maintaining 62.3 percent, suggesting the benchmark has become more discriminating at the high end. The clustering between 58 and 62 percent indicates diminishing returns, with only 5.4 percentage points separating first from tenth place.

Computer security research exposes a consistent gap: existing defenses fail not at the boundary they claim to protect, but between reasoning and execution. SafeHarness, Parallax, and SIR-Bench demonstrate that lifecycle-integrated defense layers and cognitive-executive separation outperform isolated guardrails, while jailbreak research has shifted from prompt injection to circuit-level intervention. Privacy mechanisms show no single technique dominates; differential privacy leaks at calibrated epsilon tiers, and compositional attacks bypass defenses tuned to isolated signals. The methodological pattern is controlled evaluation exposing where defenses actually fail, not where vendors claim they work.

On the surface, agents are becoming infrastructure. OpenAI's framework and Dify's platform are drawing serious adoption because they solve orchestration, tool integration, and state management without rebuilding each time. Running parallel is a wave of tools treating AI as a control layer for existing infrastructure: Claude Desktop for Debian, better-agent-terminal, and RustDesk gaining adoption as an open alternative to TeamViewer. Developers are past the "what if we put an LLM in this" phase and into "how do we make this LLM useful for our specific constraints." Infrastructure plays like DeepGEMM and Picovoice's on-device speech engine are gaining ground alongside agent frameworks because efficiency and control, not novelty, are becoming the differentiators. The marginal cost of building something AI-powered has collapsed to nearly zero, fragmenting the surface layer while compute commitments consolidate the core.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research Papers — FocusedAll papers
Challenges and Future Directions in Agentic Reverse Engineering Systems cs.CR

Agentic systems built on large language models (LLMs) are increasingly being used for complex security tasks, including binary reverse engineering (RE). Despite recent growth in popularity and capability, these systems continue to face limitations in realistic settings. Cutting-edge systems still fail in complex RE scenarios that involve obfuscation, timing, and unique architecture. In this work, we examine how agentic systems perform reverse engineering tasks with static, dynamic, and hybrid agents. Through an analysis of existing agentic tool usage, we identify several limitations, including token constraints, struggles with obfuscation, and a lack of program guardrails. From these findings, we outline current challenges and position future directions for system designers to overcome from a security perspective.

AndroScanner: Automated Backend Vulnerability Detection for Android Applications cs.CR

Mobile applications rely on complex backends that introduce significant security risks, yet developers often lack the tools to assess these risks effectively. This paper presents AndroScanner, an automated pipeline for detecting vulnerabilities in Android application backends through combined static and dynamic analysis. AndroScanner extracts backend API calls from APK files using apktool, Androguard, and Frida-based dynamic instrumentation, then vets them against the OWASP API Security Top 10 using APIFuzzer. We evaluate AndroScanner on two Android applications: a purposely vulnerable bank application and a production recruitment application with over 50,000 downloads on Google Play Store. Across both applications, AndroScanner extracted 24 APIs and identified 5 vulnerabilities, including a previously unreported zero-day Excessive Data Exposure vulnerability (ranked 3rd in the OWASP API Security Top 10) in the production application. The vulnerability was responsibly disclosed to the development team prior to publication. AndroScanner is available upon request to assist developers in identifying and remediating backend security risks before deployment.

Robustness Analysis of Machine Learning Models for IoT Intrusion Detection Under Data Poisoning Attacks cs.CR

Ensuring the reliability of machine learning-based intrusion detection systems remains a critical challenge in Internet of Things (IoT) environments, particularly as data poisoning attacks increasingly threaten the integrity of model training pipelines. This study evaluates the susceptibility of four widely used classifiers, Random Forest, Gradient Boosting Machine, Logistic Regression, and Deep Neural Network models, against multiple poisoning strategies using three real-world IoT datasets. Results show that while ensemble-based models exhibit comparatively stable performance, Logistic Regression and Deep Neural Networks suffer degradation of up to 40% under label manipulation and outlier-based attacks. Such disruptions significantly distort decision boundaries, reduce detection fidelity, and undermine deployment readiness. The findings highlight the need for adversarially robust training, continuous anomaly monitoring, and feature-level validation within operational Network Intrusion Detection Systems. The study also emphasizes the importance of integrating resilience testing into regulatory and compliance frameworks for AI-driven IoT security. Overall, this work provides an empirical foundation for developing more resilient intrusion detection pipelines and informs future research on adaptive, attack-aware models capable of maintaining reliability under adversarial IoT conditions.

CBCL: Safe Self-Extending Agent Communication cs.CR

Agent communication languages (ACLs) enable heterogeneous agents to share knowledge and coordinate across diverse domains. This diversity demands extensibility, but expressive extension mechanisms can push the input language beyond the complexity classes where full validation is tractable. We present CBCL (Common Business Communication Language), an agent communication language that constrains all messages, including runtime language extensions, to the deterministic context-free language (DCFL) class. CBCL allows agents to define, transmit, and adopt domain-specific "dialect" extensions as first-class messages; three safety invariants (R1--R3), machine-checked in Lean 4 and enforced in a Rust reference implementation, prevent unbounded expansion, applying declared resource limits, and preserving core vocabulary. We formalize the language and its safety properties in Lean 4, implement a reference parser and dialect engine in Rust with property-based and differential tests, and extract a verified parser binary. Our results demonstrate that homoiconic protocol design, where extension definitions share the same representation as ordinary messages, can be made provably safe. As autonomous agents increasingly extend their own communication capabilities, formally bounding what they can express to each other is a precondition for oversight.

Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection cs.CR

Modern Large audio-language models (LALMs) power intelligent voice interactions by tightly integrating audio and text. This integration, however, expands the attack surface beyond text and introduces vulnerabilities in the continuous, high-dimensional audio channel. While prior work studied audio jailbreaks, the security risks of malicious audio injection and downstream behavior manipulation remain underexamined. In this work, we reveal a previously overlooked threat, auditory prompt injection, under realistic constraints of audio data-only access and strong perceptual stealth. To systematically analyze this threat, we propose \textit{AudioHijack}, a general framework that generates context-agnostic and imperceptible adversarial audio to hijack LALMs. \textit{AudioHijack} employs sampling-based gradient estimation for end-to-end optimization across diverse models, bypassing non-differentiable audio tokenization. Through attention supervision and multi-context training, it steers model attention toward adversarial audio and generalizes to unseen user contexts. We also design a convolutional blending method that modulates perturbations into natural reverberation, making them highly imperceptible to users. Extensive experiments on 13 state-of-the-art LALMs show consistent hijacking across 6 misbehavior categories, achieving average success rates of 79\%-96\% on unseen user contexts with high acoustic fidelity. Real-world studies demonstrate that commercial voice agents from Mistral AI and Microsoft Azure can be induced to execute unauthorized actions on behalf of users. These findings expose critical vulnerabilities in LALMs and highlight the urgent need for dedicated defense.

Route to Rome Attack: Directing LLM Routers to Expensive Models via Adversarial Suffix Optimization cs.CR

Cost-aware routing dynamically dispatches user queries to models of varying capability to balance performance and inference cost. However, the routing strategy introduces a new security concern that adversaries may manipulate the router to consistently select expensive high-capability models. Existing routing attacks depend on either white-box access or heuristic prompts, rendering them ineffective in real-world black-box scenarios. In this work, we propose R$^2$A, which aims to mislead black-box LLM routers to expensive models via adversarial suffix optimization. Specifically, R$^2$A deploys a hybrid ensemble surrogate router to mimic the black-box router. A suffix optimization algorithm is further adapted for the ensemble-based surrogate. Extensive experiments on multiple open-source and commercial routing systems demonstrate that {R$^2$A} significantly increases the routing rate to expensive models on queries of different distributions. Code and examples: https://github.com/thcxiker/R2A-Attack.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1Claude Opus 4.757.353$10.00
2Gemini 3.1 Pro Preview57.2134$4.50
3GPT-5.456.885$5.63
4GPT-5.3 Codex53.693$4.81
5Claude Opus 4.65359$10.00
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5GLM-5.162.7%