The Inference Report

March 27, 2026

Like the shift from mainframe computing to distributed systems in the 1970s, the AI field is fragmenting along infrastructure lines. Large players are consolidating control through network effects and regulatory favor while the actual leverage migrates to whoever controls the compute layer and the orchestration patterns that run on top of it. Google launches tools to absorb users from competitors while simultaneously pushing audio AI so conversational it obscures whether you are speaking to a human, yet these moves sit alongside research documenting manipulation risks in finance and health. OpenAI killed Sora after burning $15 million daily and shelved an erotic chatbot following internal dissent, while Anthropic won a federal injunction establishing that courts will police regulatory weaponization when the target has resources to fight back. The pattern is unmistakable: large players race toward deployed systems that touch users at scale, while smaller builders compete on efficiency and transparency by accepting lower performance in exchange for deployability.

The real competition is not over capabilities but over infrastructure. Google's TurboQuant compression reducing memory usage by 6x and attention computation by 8x on H100 hardware determines whether AI deployment scales or stalls far more than raw model performance. Cohere's 2-billion-parameter voice model for consumer-grade GPUs and Mistral's open-source speech generation compete directly with closed systems from ElevenLabs and OpenAI by running on cheaper hardware. Meanwhile GitHub's trending repos show how agents are consolidating around two layers: an orchestration tier handling agent lifecycle and message routing, and a foundation tier managing data ingestion and model serving. These aren't monolithic frameworks but composable point solutions, each solving a concrete step rather than the entire pipeline. The winning pattern treats agents as a delivery mechanism, not the novelty.

What the pattern reveals is a widening gap between timescales. Google DeepMind and NVIDIA are racing toward production systems with lower latency and broader infrastructure reach. IBM and Meta are investing in long-cycle research in quantum simulation and neuroscience that may not yield commercial products for years. The detection and generation research in audio signal processing has matured toward richer problem formulations that recognize benign transformations like speech enhancement create distributional shifts indistinguishable from spoofing under existing classifiers. Senate scrutiny of data center power consumption and Mark Warner's proposal to tax them for job displacement signal that the cost of AI's compute footprint is becoming a political liability. Legal intervention stopped one form of leverage when a federal judge halted the Trump administration's supply-chain-risk designation of Anthropic. Market concentration in infrastructure and orchestration continues unimpeded.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research Papers — FocusedAll papers
What Does a Pathological Speech Assessment Model Know about Acoustic Features? A Case Study on Oral and Oropharyngeal Cancer Patients cs.SD

This work investigates the interpretability of a Wav2Vec 2.0based speech intelligibility assessment model for oral and oropharyngeal cancer patients through canonical correlation analysis. By measuring the correlation between the model embeddings and eGeMAPS low-level descriptors (LLDs) as an interpretable reference, we analyze how acoustic information is encoded across the model layers. The analysis is conducted at two levels: individual LLDs layer-wise, and group-level: prosodic, spectral, and voice quality. Results show that the learned representations are most strongly correlated with spectral and prosodic features, with the first MFCC coefficient yielding the highest correlations across all layers. At the group level, spectral and prosodic groups achieve correlations of 0.77 and 0.71 respectively, while voice quality reaches 0.65. Beyond model interpretability, this work also offers practical guidance on acoustic feature selection for pathological speech assessment.

Supervised Post-training of Speech Foundation Models for Robust Adaptation in Speech Deepfake Detection cs.SD

Large speech foundation models have shown strong potential for speech deepfake detection, but direct fine-tuning is limited by a mismatch between self-supervised pre-training objectives and spoof-specific artifacts. To address this, we propose a mix-frame post-training strategy to create localized spoof-oriented perturbations and use frame-level supervision to encourage the SSL model to learn local inconsistencies that are critical for robust spoof detection. On ASVspoof5, we achieve state-of-the-art EER 4.50% for a single model without data augmentation. On ASVspoof2021 LA/DF, it further achieves only 0.16\% absolute EER gap between LA and DF, indicating strong and balanced robustness across distinct distortion conditions. These results show that supervised post-training provides an effective and practical way to adapt speech foundation models for robust deepfake detection.

Sarashina2.2-TTS: Tackling Kanji Polyphony in Japanese Speech Generation via Data Scaling and Targeted Data Synthesis cs.SD

While large language model (LLM)-based text-to-speech (TTS) systems have achieved high-quality speech synthesis, most existing systems focus on English and Chinese. Japanese, however, remains under-explored, and its unique linguistic challenges, such as widespread context-dependent kanji polyphony, have yet to be adequately tackled. Here we introduce Sarashina2.2-TTS (https://github.com/sbintuitions/sarashina2.2-tts), a Japanese-centric LLM-TTS system that tackles these challenges through a dual approach: data strategy and evaluation methodology. First, we scale training to approximately 361k hours of speech, incorporating a balanced mix of Japanese and English data. Furthermore, we design a targeted data augmentation pipeline covering all 2,136 Joyo (regular-use) kanji designated by Japan's Agency for Cultural Affairs to efficiently address kanji polyphony disambiguation. Second, we introduce the Joyo Kanji Yomi Benchmark (https://github.com/sbintuitions/JoyoKanji-Yomi-Benchmark), covering all 2,136 Joyo kanji and their 4,378 readings. Alongside this benchmark, we propose Kana-CER, a metric that compares synthesized speech against reference readings in the kana space, eliminating orthographic variations to directly measure pronunciation correctness. Experiments demonstrate that our targeted data augmentation significantly improves reading accuracy. Overall, Sarashina2.2-TTS achieves state-of-the-art kanji-level reading accuracy and matches top baselines on general sentence-level pronunciation, while delivering the highest speaker similarity in zero-shot Japanese speech synthesis. Furthermore, cross-lingual evaluation reveals that Sarashina2.2-TTS is the only system that maintains stable Japanese pronunciation regardless of the prompt language, confirming that our balanced training approach improves cross-lingual robustness.

From Sounds to Scenes: A Benchmark for Evaluating Context-Aware Auditory Scene Understanding in Large Audio Language Models cs.SD

Recent Large Audio Language Models (LALMs) have achieved remarkable progress in audio perceptual tasks across individual acoustic layers, including speech, sound, and music. However, existing benchmarks predominantly evaluate these layers in isolation, overlooking the complex contextual relationships that arise when multiple acoustic sources co-occur in real-world auditory scenes. Real-world auditory interpretation requires Context-Aware Auditory Scene Understanding (CASU): the ability to comprehend the holistic scene by integrating sound layers. To evaluate this capability, we introduce the CASU benchmark, which assesses whether Audio LLMs can interpret auditory scenes composed of speech, acoustic events (e.g., announcements), and background environments (e.g., traffic), and reason about the logical relationships between these layers. We propose a scalable pipeline for constructing time-accurate, semi-synthetic audio streams by composing real-world scene sounds with synthetic speech. Building on this data, we design four tasks that probe scene understanding: contextual question answering, entity extraction from the scene, speaker role inference, and counterfactual reasoning where scene is manipulated. Experiments across multiple LALMs demonstrate that effective auditory scene understanding requires integration over all auditory layers, rather than reliance on speech or sound alone, underscoring the necessity of CASU for advancing complex audio understanding in LALMs.

STEB: A Speech-to-Speech Translation Expressiveness Benchmark for Evaluating Beyond Translation Fidelity cs.SD

Speech-to-speech translation (S2ST) should preserve not only lexical meaning, but also expressive attributes: emotion, scenario style (e.g., news reporting vs. dramatic dialogue), and nonverbal vocalizations (NVs). Moreover, collecting cross-lingual target speech that is both translation-faithful and expressively aligned with the source is difficult at scale, making reference-based evaluation impractical. We introduce STEB (Speech-to-Speech Translation Expressiveness Benchmark), a 32.6-hour Chinese--English benchmark that evaluates both standard dimensions (translation fidelity, speaker similarity, duration alignment) and expressiveness dimensions (emotion, scenario style, NV preservation). For expressiveness evaluation, STEB uses a caption-then-summarize framework that converts speech into structured expressive attributes and compares source and hypothesis attributes with an LLM judge. Human validation shows statistically significant correlations with listener judgments across all expressive dimensions. We evaluate six S2ST systems covering cascaded systems, end-to-end models, and speech large language models. Many systems, especially cascaded ones, achieve strong translation fidelity, but they still struggle with emotion preservation (best: 3.82/5) and NV preservation (best: 2.31/5). These results reveal a gap between semantic transfer and expressive transfer, identifying expressiveness preservation as an open challenge for S2ST. Audio samples are available at https://cmots.github.io/steb.github.io/.

Neuromorphic Speech Enhancement with Dual-Branch Spiking Neural Networks cs.SD

Spiking neural network (SNN)-based neuromorphic speech enhancement has emerged as a promising paradigm due to its energy efficiency, yet it still underperforms classical artificial neural network (ANN)-based approaches owing to binary activations and the lack of well-designed network architectures. To overcome this limitation, we propose a novel dual-branch spiking neural network architecture equipped with a gated spiking unit (GSU), termed GSU-DBNet. Specifically, GSU-DBNet simultaneously models the speech magnitude spectrum and complex spectrum, predicting the corresponding magnitude and complex spectral masks. Meanwhile, a dual-path GSU module is adopted to exploit temporal and frequency information for enhanced spatiotemporal feature representation. Experiments on a popular benchmark dataset show that GSU-DBNet achieves a PESQ score of 3.04 with only 394K parameters, outperforming existing SNN-based methods while using only 4.5%--10.6% of the parameters of representative ANN-based models.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1GPT-5.457.275$5.63
2Gemini 3.1 Pro Preview57.2114$4.50
3GPT-5.3 Codex5475$4.81
4Claude Opus 4.65349$10.00
5Claude Sonnet 4.651.768$6.00
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5Gemini 3.1 Pro Preview62.3%