The Inference Report

June 5, 2026

A paper on information retrieval published in 2024 has quietly documented what the market is only now pricing in: the constraint is no longer raw model capability, but rather the engineering discipline required to make retrieval systems actually work in production. The research reveals that dense embeddings and learned scoring functions, however sophisticated, leave substantial gaps when applied to latent patterns, low-frequency items, and domain-specific terminology. Solutions that persist across multiple papers combine sparse and dense signals, integrate structured metadata, and inject external knowledge through LLM-guided reasoning. The field has shifted from treating retrieval as a search problem to treating it as a state management and alignment problem. This matters because the market is now bifurcating between companies that can operationalize AI capability into defensible revenue streams and those still selling the promise of future returns. Anthropic's annualized revenue crossed $47 billion in May, up from $9 billion at the end of 2025, yet the company faces real tests ahead as it prepares for an IPO. Quantinuum continues to attract investor capital despite losing millions. This divergence reflects a market where the ability to convert AI capability into actual customer value now separates the winners from the venture-backed faith plays.

Infrastructure tooling is reshaping faster than applications can absorb it. Microsoft released Coreutils to reduce friction for developers moving between Windows and Linux environments. Google shipped Gemma 4 12B to run agentic workflows on consumer hardware. Microsoft's Web IQ and Rayfin SDK aim to give AI agents access to real-time web data and simplified backend deployment. These moves prioritize developer velocity and operational simplicity over raw model scale. Meanwhile, Anthropic's Claude is being used by the NSA for cyber operations, and Meta is building data centers in tents to slash costs. GitHub's trending repositories confirm this pattern: Headroom compresses LLM inputs by 60-95 percent without degrading output quality. Lance converts Parquet to a columnar format optimized for vector operations, delivering 100x faster random access. Trivy and PaddleOCR solve the upstream problem of getting structured data into AI pipelines in the first place. These aren't frameworks that ask you to rewrite your stack. They're tools that plug into existing workflows and reduce friction or expense. The constraint is no longer compute or models. It's infrastructure efficiency, developer tooling, and the ability to operationalize systems in production environments.

Distribution and integration now matter more than raw capability. Apple's pre-WWDC privacy campaign frames AI as something that only works if users trust the platform, while Poke became the first AI agent approved for Apple's Messages for Business platform. This signals Apple's strategy: controlled distribution of AI features through its own platforms, not open ecosystems. Google, Microsoft, and Meta are racing to embed AI into their existing products and services, turning them into runtime environments for agents. The lab announcements reveal the same pattern at a different scale. IBM is ceding the foundation layer to Google and building industry-specific agents on top of Gemini, effectively focusing on consulting delivery rather than competing head-to-head on models. Hugging Face is trying to own the middle layer where enterprises actually build, while NVIDIA sees durable value in gaming infrastructure and sovereign AI partnerships. The companies that already own distribution channels and user relationships are converting them into AI deployment infrastructure. Those without direct user access are competing on raw capability and hoping to be acquired or integrated. The coding automation space, measured by SWE-rebench and Artificial Analysis, is becoming a measurable competitive arena where engineering discipline matters as much as scale. The labs that won the last cycle are now building moats around deployment, customization, and operational integration rather than racing to announce bigger numbers.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research Papers — FocusedAll papers
VCG: A Multimodal Retrieval Framework for E-Commerce Video Feeds under Extreme Cold-Start Conditions cs.IR

The digital commerce landscape is shifting from static, search-driven catalogs to dynamic, immersive video feeds. This transition introduces an ``extreme cold-start'' problem: unlike traditional items, new short-form videos lack the dense interaction history required for collaborative filtering. Furthermore, immersive feeds introduce strong position and duration biases that distort standard engagement signals. In this paper, we demonstrate the Video Candidate Generation (VCG) system, a scalable multimodal retrieval engine designed to solve these challenges in a large-scale e-commerce environment. By leveraging a domain-adapted vision-language model (based on CLIP), we map users and videos into a shared semantic space, enabling zero-shot retrieval based on visual content rather than behavioral history. We detail the system's architecture and present a rigorous evaluation comparing generative (LLM) vs. discriminative (CLIP) embeddings. Our results show that while generative models excel at attribute prediction, they suffer from embedding space collapse in retrieval tasks. Online A/B testing demonstrates that VCG effectively mitigates engagement biases, yielding a 50\% uplift in deep video completion. To showcase the system's capabilities, we present an interactive demonstration featuring three bi-directional retrieval scenarios: Product-to-Video, Video-to-Product, and Zero-Shot Semantic Search.

Token Factory: Efficiently Integrating Diverse Signals into Large Recommendation Models cs.IR

Large Recommendation Models (LRMs) have demonstrated promising capabilities in industry-scale recommendation tasks. However, holistically integrating traditional signals into these transformer-based architectures effectively and efficiently remains a major challenge. Conventional approaches that "textualize" these signals directly or create discrete item representations often lead to excessively long prompts, substantial memory footprints, and high computational overhead. To overcome these limitations, we propose "Token Factory", a framework designed to transform traditional signals into "soft tokens" that can be directly processed by LRMs. This approach enables efficient integration and compression of heterogeneous input features, preventing prompt length explosion while enhancing model performance. We detail the architecture of Token Factory and present experimental results validating its effectiveness in a production-scale recommendation environment.

Closing the Calibration Gap in Semantic Caching cs.IR

Semantic caching cuts LLM inference costs by serving a cached response to semantically similar queries. Standard practice evaluates these systems using PR-AUC, a metric that only measures how well scores rank and ignores whether they are usable at a fixed threshold. We show this mismatch leads to systematically poor deployment choices, as models with the highest PR-AUC are often the worst in operation. We introduce Precision-Cache Hit Ratio (P-CHR) AUC, a cache-aware metric that measures precision across cache utilization levels, and Calibration Retention Rate (CRR), which captures how much offline ranking quality survives at deployment. We decompose the operational gap between offline and deployed quality into a recoverable calibration component and an irreducible structural component fixed by the dataset's positive rate. Our experiments show that the calibration gap is governed by the training objective rather than data scale, and post-hoc calibration only partially closes it. Ultimately, model selection for semantic caching is a calibration problem, not a ranking one, and measuring it is the first step to closing the gap.

Generative Engine Optimization at Scale: Measuring Brand Visibility Across AI Search Engines cs.IR

People increasingly get answers straight from AI search engines like ChatGPT, Claude, Perplexity, and Gemini rather than scrolling search results. Brands that once focused on search engine optimization (SEO) must now optimize for how these engines represent, cite, and recommend them -- a shift variously called Generative Engine Optimization (GEO), Answer Engine Optimization (AEO), and AI Search Visibility. We treat AEO and AI Visibility as part of GEO, and study how to measure brand visibility across AI engines: what they value when they cite a brand, which sources they rely on, and what content large language models surface. The hard case is everyone outside the already-authoritative top brands -- SMEs, D2C brands, creators, and early-stage startups. We analyze 100K+ prompt responses across 100+ brands tracked on Ranqo between March and May 2026. First visibility runs form a clear three-tier brand-stature ladder: global household names (e.g., Stripe, Nike) appear in 73% of relevant AI answers on their first run; established mid-market and regional brands (e.g., Olipop, Klaviyo) in 44%; niche and small brands in just 11% -- about 30 percentage points per step. When engines cite sources, about 78% go to corporate websites; among non-corporate sources YouTube leads, ahead of Reddit, editorial media, and Wikipedia. The highest-leverage page is the ranked "best-of" listicle, the most-cited content format at about 21% of all citations. Sentiment is the unstable signal: whether a brand is framed positively or negatively flips about 6.7 times more often than whether it is mentioned at all. These findings provide a first large-scale baseline for measuring GEO: AI brand visibility can be measured, differs by platform, and varies strongly by brand maturity. We close by proposing seven v1.1 protocols to test whether specific recommendations can causally improve AI visibility.

ScholarQuest: A Taxonomy-Guided Benchmark for Agentic Academic Paper Search in Open Literature Environments cs.IR

Academic paper search is a core step in scientific research, and LLM-based search agents are emerging as a promising paradigm for iterative, intent-driven literature exploration. However, existing benchmarks are insufficient for systematically evaluating agentic academic search under realistic open literature environments. We propose ScholarQuest, a large-scale, taxonomy-guided benchmark for agentic academic paper search. ScholarQuest is constructed from over 1,000 computer science topics and four representative research intents, including method-oriented, setting-anchored, comparison-based, and scope-controlled queries. It further provides scalable answer construction and a shared retrieval backend ScholarBase for reproducible evaluation. Benchmarking results show that agentic methods outperform single-shot retrieval baselines, yet the best-performing agent only achieves 0.314 Recall@100 and 0.355 Recall@All, indicating substantial room for improvement. In addition, analyses of search efficiency, intent-level robustness, and failure cases further highlight the benchmark's ability to provide multi-dimensional evaluation signals for academic paper search agents.

ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval cs.IR

Leveraging Multimodal Large Language Models (MLLMs) via contrastive learning has become a mainstream paradigm for improving the performance of Universal Multimodal Retrieval (UMR). However, previous works have ignored the grain blindness when adapting the contrastive paradigm into retrieval tasks. Grain blindness refers to the tendency of the model to overlook grain-level information contained in the query, which is crucial for effectively handling complex queries. This stems from contrastive learning treating samples as a binary classification (positive/negative), while ignoring the different information carried by each negative sample. To address this, we argue that negatives should be treated differently according to their similarity to the positive sample, enabling the model to learn distinct grain information from each negative. In this paper, we introduce a simple but effective framework, called ELVA, a novel rule-based RL framework that mitigates grain blindness through ranking-driven MLLMs. 1) Instead of relying on reward models, we extend Reinforcement Learning with Verifiable Rewards (RLVR) to retrieval tasks, allowing the model to explore new ranking behaviors without explicit ranking labels. 2) By utilizing rule-based rewards, our approach jointly optimizes the ranking of negative samples while enlarging the similarity gap between positive and negative. To more precisely measure grain blindness, we further introduce MRBench, a new benchmark specifically designed for multi-grain query scenarios. ELVA achieves state-of-the-art results across standard retrieval benchmarks, and its notable 13.1% improvement on MRBench further demonstrates its effectiveness in alleviating grain blindness.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1Claude Opus 4.861.458$10.94
2GPT-5.560.262$11.25
3Claude Opus 4.757.352$10.94
4Gemini 3.1 Pro Preview57.2123$4.50
5GPT-5.456.880$5.63
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1gpt-5.5-2026-04-23-xhigh62.7%
2Codex60.4%
3Claude Code59.6%
4gpt-5.5-2026-04-23-medium58.9%
5Claude Opus 4.8-xhigh56.4%