The Inference Report

April 19, 2026

The infrastructure layer is consolidating into oligopoly while the surface layer explodes into commodity, and the winners will be whoever can operate at both levels simultaneously. Cerebras is filing for IPO on the back of billion-dollar compute commitments from Amazon and OpenAI, signaling that the chip wars are already decided in favor of whoever can lock in capacity years in advance. Meanwhile, Tesla's robotaxi expansion to Dallas and Houston operates in a regulatory vacuum where three Texas cities represent the entire addressable market for driverless cars without a safety driver, which is a sandbox, not scale. Anthropic, simultaneously designated a supply-chain risk by the Pentagon and courting the Trump administration, is betting on Schematik for hardware integration while security researchers warn its Mythos model could accelerate hacking faster than fixes can ship. The real leverage is moving from software to hardware control, but the research community is racing to catch up.

On the benchmarks, the top tier has crystallized around 60-65 percent resolve rates on the SWE-rebench, with Claude Opus 4.6 holding first place at 65.3 percent. The meaningful movement comes from Chinese models climbing substantially: GLM-5 jumped 13 percentage points to rank 3, GLM-4.7 surged 16.6 points to rank 14, and Kimi K2.5 added 11.7 points. Gemini 3.1 Pro Preview dropped from rank 2 to rank 6 despite maintaining 62.3 percent, suggesting the benchmark has become more discriminating at the high end. The clustering between 58 and 62 percent indicates diminishing returns, with only 5.4 percentage points separating first from tenth place.

Computer security research exposes a consistent gap: existing defenses fail not at the boundary they claim to protect, but between reasoning and execution. SafeHarness, Parallax, and SIR-Bench demonstrate that lifecycle-integrated defense layers and cognitive-executive separation outperform isolated guardrails, while jailbreak research has shifted from prompt injection to circuit-level intervention. Privacy mechanisms show no single technique dominates; differential privacy leaks at calibrated epsilon tiers, and compositional attacks bypass defenses tuned to isolated signals. The methodological pattern is controlled evaluation exposing where defenses actually fail, not where vendors claim they work.

On the surface, agents are becoming infrastructure. OpenAI's framework and Dify's platform are drawing serious adoption because they solve orchestration, tool integration, and state management without rebuilding each time. Running parallel is a wave of tools treating AI as a control layer for existing infrastructure: Claude Desktop for Debian, better-agent-terminal, and RustDesk gaining adoption as an open alternative to TeamViewer. Developers are past the "what if we put an LLM in this" phase and into "how do we make this LLM useful for our specific constraints." Infrastructure plays like DeepGEMM and Picovoice's on-device speech engine are gaining ground alongside agent frameworks because efficiency and control, not novelty, are becoming the differentiators. The marginal cost of building something AI-powered has collapsed to nearly zero, fragmenting the surface layer while compute commitments consolidate the core.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research Papers — FocusedAll papers
Domain-Informed Representation for Evolutionary Sieving in Integral and Module Lattices cs.CR

Traditional cryptography, rooted in problems, e.g., integer factorisation or discrete log, is inevitably vulnerable to a fully operational quantum computer. Although it remains an engineering frontier, the looming threat extends to encrypted data stored today, which could be decrypted in the future with quantum capabilities. To safeguard against this eventuality, the backbone of the modern quantum-safe cryptography is the Shortest Vector Problem (SVP). We enhance Laarhoven's treatment of Ajtai et al.'s sieving as a genetic algorithm (GA) for the SVP by incorporating domain-informed SVP representation and crossover while naturally extending application to the module lattices.

Implicit Identity Technologies for LLMs: Fingerprinting and Watermarking across Datasets, Models, and Generated Content cs.CR

This paper presents a survey and taxonomy of LLM fingerprinting and watermarking for identity, ownership verification, provenance, and generated-content attribution. Large language models (LLMs) require substantial investments in data, computation, and expertise, and are increasingly deployed in high-stakes settings, making it critical to protect LLM-related assets and trace their origins. Existing work has rapidly expanded across dataset provenance, model ownership, and generated-content detection, but the field remains fragmented: fingerprinting and watermarking are often used inconsistently, and methods are typically studied within isolated asset-specific settings. To address this gap, we introduce implicit identity as a unifying abstraction for verifiable but not directly observable identity signals in LLM systems. We distinguish fingerprinting as non-intrusive identity derived from intrinsic characteristics, and watermarking as intrusive identity deliberately embedded into data, models, or generated content. We then propose a lifecycle-based taxonomy that organises techniques across datasets, models, and generated content, and further separates them by verification semantics: similarity-based attribution and keyed verification. Finally, we establish an evaluation framework centred on identifiability, robustness, and deployability, summarising representative metrics under realistic access and transformation regimes. By unifying terminology, lifecycle stages, and evaluation objectives, this survey provides a structured foundation for studying LLM identity technologies and for developing more reliable mechanisms for asset protection and provenance.

Harmless Yet Harmful: Neutral Prompting Attacks for Stealthy Hallucination Steering in Agent Skills cs.CR

LLM-powered coding agents increasingly participate in software development workflows by generating code, selecting dependencies, and producing package installation commands. This creates a new software supply chain risk: when an agent hallucinates a non-existent package, an attacker may register the hallucinated name and later compromise users who install it. Existing package hallucination attacks and defenses primarily focus on naturally occurring hallucinations, targeted dependency steering, or post-hoc package validation. In this paper, we introduce \emph{Neutral Prompting Attack} (NPA), a highly stealthy attack paradigm in which semantically benign instructions, such as encouraging imagination and exhaustiveness, increase package hallucination propensity without containing explicit malicious intent. Unlike targeted dependency steering, NPA does not specify an attacker-chosen package. Instead, it shifts the model's dependency generation behavior toward more speculative package names. We evaluate NPA across multiple coding-oriented LLMs and package hallucination benchmarks. Our results show that NPA increases both \emph{Hallucination ASR} and \emph{Pip Install ASR}, changes the distribution of hallucinated package names, and evades existing static-analysis, LLM-based, and agent-based Skill defenses. These findings reveal that harmless-looking prompts can covertly manipulate hallucination behavior and create downstream software supply chain risks.

AliMark: Enhancing Robustness of Sentence-Level Watermarking Against Text Paraphrasing cs.CR

Existing sentence-level watermarking methods enhance robustness to paraphrasing by anchoring watermarks in sentence semantics. However, their prefix-based designs remain vulnerable to structural perturbations, such as sentence splitting and merging, which commonly arise under strong paraphrasers like DIPPER and GPT-3.5. To mitigate this issue, we propose AliMark, a framework that reformulates sentence-level watermarking as a bit sequence encoding and alignment problem between a potentially watermarked text and a secret bit sequence. Notably, our approach adopts a two-stage detection strategy: we generate multiple restructured text variants and adaptively align their extracted bit sequences with the secret bit sequence to minimize alignment cost. This multi-candidate alignment design naturally improves robustness to sentence merges and splits. Extensive experiments demonstrate that AliMark substantially outperforms state-of-the-art baselines under diverse paraphrasing attacks.

SciIntBench: Measuring LLM Compliance with Research Integrity Norms Under Adversarial Framing cs.CR

Large language models (LLMs) are increasingly used to support scientific work, but it is unclear whether they uphold responsible conduct of research (RCR) norms or help undermine them. We introduce SciIntBench, an adversarial benchmark of 810 prompts across ten RCR categories and three scientific domains. Each scenario appears as an Overt Adversarial, Covert Adversarial, and Benign version, allowing us to jointly measure framing-sensitive refusal of misconduct and helpfulness on legitimate requests. We evaluate 16 commercial and open-weight LLMs from six providers (2024--2026), producing 12,960 responses. We find that scientific integrity alignment is strongly framing-sensitive: models refuse explicit misconduct far more reliably than covert violations, especially failing when misconduct is presented as a pressure-driven shortcut. Refusals vary by RCR category, with weaker boundaries around transparency, plagiarism, and fabrication.

KBF: Knowledge Boundary as Fingerprint for Language Model and Black-Box API Auditing cs.CR

Relay and reseller APIs increasingly intermediate access to large language models (LLMs), but users have no direct way to verify that a claimed endpoint is actually serving the advertised model. We introduce KBF, a low-cost black-box auditing protocol that fingerprints model APIs using stable numerical recall near the knowledge boundary. Across 16 production LLM endpoints, KBF flags all 155 economically relevant substitutions without rejecting any same-model controls, remains stable under deployment variation, detects high-separation mixed-routing attacks when only 5-10% of traffic is substituted, and finds that 7 of 27 platform model cells in a six-platform shadow API audit are statistically inconsistent with their reference endpoints, with inconsistencies concentrated on premium Claude endpoints.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1Claude Opus 4.757.353$10.00
2Gemini 3.1 Pro Preview57.2134$4.50
3GPT-5.456.885$5.63
4GPT-5.3 Codex53.693$4.81
5Claude Opus 4.65359$10.00
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5GLM-5.162.7%