The Inference Report

March 27, 2026

Like the shift from mainframe computing to distributed systems in the 1970s, the AI field is fragmenting along infrastructure lines. Large players are consolidating control through network effects and regulatory favor while the actual leverage migrates to whoever controls the compute layer and the orchestration patterns that run on top of it. Google launches tools to absorb users from competitors while simultaneously pushing audio AI so conversational it obscures whether you are speaking to a human, yet these moves sit alongside research documenting manipulation risks in finance and health. OpenAI killed Sora after burning $15 million daily and shelved an erotic chatbot following internal dissent, while Anthropic won a federal injunction establishing that courts will police regulatory weaponization when the target has resources to fight back. The pattern is unmistakable: large players race toward deployed systems that touch users at scale, while smaller builders compete on efficiency and transparency by accepting lower performance in exchange for deployability.

The real competition is not over capabilities but over infrastructure. Google's TurboQuant compression reducing memory usage by 6x and attention computation by 8x on H100 hardware determines whether AI deployment scales or stalls far more than raw model performance. Cohere's 2-billion-parameter voice model for consumer-grade GPUs and Mistral's open-source speech generation compete directly with closed systems from ElevenLabs and OpenAI by running on cheaper hardware. Meanwhile GitHub's trending repos show how agents are consolidating around two layers: an orchestration tier handling agent lifecycle and message routing, and a foundation tier managing data ingestion and model serving. These aren't monolithic frameworks but composable point solutions, each solving a concrete step rather than the entire pipeline. The winning pattern treats agents as a delivery mechanism, not the novelty.

What the pattern reveals is a widening gap between timescales. Google DeepMind and NVIDIA are racing toward production systems with lower latency and broader infrastructure reach. IBM and Meta are investing in long-cycle research in quantum simulation and neuroscience that may not yield commercial products for years. The detection and generation research in audio signal processing has matured toward richer problem formulations that recognize benign transformations like speech enhancement create distributional shifts indistinguishable from spoofing under existing classifiers. Senate scrutiny of data center power consumption and Mark Warner's proposal to tax them for job displacement signal that the cost of AI's compute footprint is becoming a political liability. Legal intervention stopped one form of leverage when a federal judge halted the Trump administration's supply-chain-risk designation of Anthropic. Market concentration in infrastructure and orchestration continues unimpeded.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research Papers — FocusedAll papers
Echoes: A semantically-aligned music deepfake detection dataset cs.SD

We introduce Echoes, a new dataset for music deepfake detection designed for training and benchmarking detectors under realistic and provider-diverse conditions. Echoes comprises 3,577 tracks (110 hours of audio) spanning multiple genres (pop, rock, electronic), and includes content generated by ten popular AI music generation systems. To prevent shortcut learning and promote robust generalization, the dataset is deliberately constructed to be challenging, enforcing semantic-level alignment between spoofed audio and bona fide references. This alignment is achieved by conditioning generated audio samples directly on bona-fide waveforms or song descriptors. We evaluate Echoes in a cross-dataset setting against three existing AI-generated music datasets using state-of-the-art Wav2Vec2 XLS-R 2B representations. Results show that (i) Echoes is the hardest in-domain dataset; (ii) detectors trained on existing datasets transfer poorly to Echoes; (iii) training on Echoes yields the strongest generalization performance. These findings suggest that provider diversity and semantic alignment help learn more transferable detection cues.

Variable-Length Audio Fingerprinting cs.SD

Audio fingerprinting converts audio to much lower-dimensional representations, allowing distorted recordings to still be recognized as their originals through similar fingerprints. Existing deep learning approaches rigidly fingerprint fixed-length audio segments, thereby neglecting temporal dynamics during segmentation. To address limitations due to this rigidity, we propose Variable-Length Audio FingerPrinting (VLAFP), a novel method that supports variable-length fingerprinting. To the best of our knowledge, VLAFP is the first deep audio fingerprinting model capable of processing audio of variable length, for both training and testing. Our experiments show that VLAFP outperforms existing state-of-the-arts in live audio identification and audio retrieval across three real-world datasets.

Bridging Biological Hearing and Neuromorphic Computing: End-to-End Time-Domain Audio Signal Processing with Reservoir Computing cs.SD

Despite the advancements in cutting-edge technologies, audio signal processing continues to pose challenges and lacks the precision of a human speech processing system. To address these challenges, we propose a novel approach to simplify audio signal processing by leveraging time-domain techniques and reservoir computing. Through our research, we have developed a real-time audio signal processing system by simplifying audio signal processing through the utilization of reservoir computers, which are significantly easier to train. Feature extraction is a fundamental step in speech signal processing, with Mel Frequency Cepstral Coefficients (MFCCs) being a dominant choice due to their perceptual relevance to human hearing. However, conventional MFCC extraction relies on computationally intensive time-frequency transformations, limiting efficiency in real-time applications. To address this, we propose a novel approach that leverages reservoir computing to streamline MFCC extraction. By replacing traditional frequency-domain conversions with convolution operations, we eliminate the need for complex transformations while maintaining feature discriminability. We present an end-to-end audio processing framework that integrates this method, demonstrating its potential for efficient and real-time speech analysis. Our results contribute to the advancement of energy-efficient audio processing technologies, enabling seamless deployment in embedded systems and voice-driven applications. This work bridges the gap between biologically inspired feature extraction and modern neuromorphic computing, offering a scalable solution for next-generation speech recognition systems.

Enhancing Efficiency and Performance in Deepfake Audio Detection through Neuron-level dropin & Neuroplasticity Mechanisms cs.SD

Current audio deepfake detection has achieved remarkable performance using diverse deep learning architectures such as ResNet, and has seen further improvements with the introduction of large models (LMs) like Wav2Vec. The success of large language models (LLMs) further demonstrates the benefits of scaling model parameters, but also highlights one bottleneck where performance gains are constrained by parameter counts. Simply stacking additional layers, as done in current LLMs, is computationally expensive and requires full retraining. Furthermore, existing low-rank adaptation methods are primarily applied to attention-based architectures, which limits their scope. Inspired by the neuronal plasticity observed in mammalian brains, we propose novel algorithms, dropin and further plasticity, that dynamically adjust the number of neurons in certain layers to flexibly modulate model parameters. We evaluate these algorithms on multiple architectures, including ResNet, Gated Recurrent Neural Networks, and Wav2Vec. Experimental results using the widely recognised ASVSpoof2019 LA, PA, and FakeorReal dataset demonstrate consistent improvements in computational efficiency with the dropin approach and a maximum of around 39% and 66% relative reduction in Equal Error Rate with the dropin and plasticity approach among these dataset, respectively. The code and supplementary material are available at Github link.

What and When to Learn: CURriculum Ranking Loss for Large-Scale Speaker Verification cs.SD

Speaker verification at large scale remains an open challenge as fixed-margin losses treat all samples equally regardless of quality. We hypothesize that mislabeled or degraded samples introduce noisy gradients that disrupt compact speaker manifolds. We propose Curry (CURriculum Ranking), an adaptive loss that estimates sample difficulty online via Sub-center ArcFace: confidence scores from dominant sub-center cosine similarity rank samples into easy, medium, and hard tiers using running batch statistics, without auxiliary annotations. Learnable weights guide the model from stable identity foundations through manifold refinement to boundary sharpening. To our knowledge, this is the largest-scale speaker verification system trained to date. Evaluated on VoxCeleb1-O, and SITW, Curry reduces EER by 86.8\% and 60.0\% over the Sub-center ArcFace baseline, establishing a new paradigm for robust speaker verification on imperfect large-scale data.

MSR-HuBERT: Self-supervised Pre-training for Adaptation to Multiple Sampling Rates cs.SD

Self-supervised learning (SSL) has advanced speech processing. However, existing speech SSL methods typically assume a single sampling rate and struggle with mixed-rate data due to temporal resolution mismatch. To address this limitation, we propose MSRHuBERT, a multi-sampling-rate adaptive pre-training method. Building on HuBERT, we replace its single-rate downsampling CNN with a multi-sampling-rate adaptive downsampling CNN that maps raw waveforms from different sampling rates to a shared temporal resolution without resampling. This design enables unified mixed-rate pre-training and fine-tuning. In experiments spanning 16 to 48 kHz, MSRHuBERT outperforms HuBERT on speech recognition and full-band speech reconstruction, preserving high-frequency detail while modeling low-frequency semantic structure. Moreover, MSRHuBERT retains HuBERT's mask-prediction objective and Transformer encoder, so existing analyses and improvements that were developed for HuBERT can apply directly.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1GPT-5.457.275$5.63
2Gemini 3.1 Pro Preview57.2114$4.50
3GPT-5.3 Codex5475$4.81
4Claude Opus 4.65349$10.00
5Claude Sonnet 4.651.768$6.00
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5Gemini 3.1 Pro Preview62.3%