The Inference Report

July 4, 2026

Today's news confirms what the market has already priced in: the AI industry's center of gravity has shifted from capability announcements to operational durability. Token prices are cooling and spending growth is plateauing, yet the actual competition is intensifying precisely where it matters least to the headlines. Meta is squeezing efficiency from recycled RAM through custom chips while managing unionization pressure at DeepMind. Cloud providers are fracturing under their own outages. Security patches are accelerating because vulnerabilities now arrive faster than fixes can ship. The companies winning aren't those with the biggest models; they're the ones solving the unglamorous problems of hardware reuse, infrastructure resilience, and exploit remediation.

AMD's three infrastructure announcements in a single day exemplify this shift. The company is not launching models or frameworks. It is publishing benchmarks that directly compare competing AI agents on the same hardware, building integrated pipelines that keep data in VRAM to eliminate bottlenecks, and optimizing performance for models already shipping from competitors. AMD names Cursor Agent, Claude Code, and OpenAI Codex in its coding benchmark. It names Kimi-K2.5 and MiniMax-M2.5 in inference optimization. The message to the market is explicit: your models will run faster on our hardware if you follow our patterns. This is the work of infrastructure positioning, not foundation model competition. Database research reinforces the pattern, clustering around three interconnected frontiers: agentic systems for data manipulation, semantic integration of unstructured and structured data, and the infrastructure required to make agents reliable at scale. Rather than treating agents as stateless compute or semantic operations as isolated problems, the emerging research treats the artifacts of search and the semantics of data as first-class database objects, queryable and governed by the infrastructure itself.

Benchmark stability and GitHub trends confirm where actual differentiation is occurring. OpenAI's gpt-5.5-2026-04-23-xhighModel holds the top position on SWE-rebench at 62.7 percent with tight confidence intervals, while Claude Fable 5 leads Artificial Analysis at 59.9 percent, suggesting different problem structures rather than a unified capability hierarchy. The trending repositories reveal a decisive shift toward agentic tools embedded into developer workflows: Claude Code dominates with the highest star count as a terminal-resident coding agent, while supporting infrastructure clusters around agent isolation, auditable memory systems, and skill standardization. The newcomers are not flashy consumer tools but unglamorous plumbing. Elasticsearch, PyTorch, and Supabase remain foundational, but TencentCloud's CubeSandbox isolates agent execution, shodh-memory offers local auditable agent memory without API dependencies, and academic-research-skills packages domain expertise as Claude Code skills. The pattern is consolidation around a specific vision of agency: agents as terminal-native, skill-based, auditable systems that integrate into existing infrastructure rather than replace it.

Grant Calloway

AI LabsAll labs

AMD

Google DeepMind

Google DeepMind and A24 announce first-of-its-kind research partnership

Mistral

Leanstral 1.5: Proof Abundance for All

From the WireAll feeds

Research Papers — FocusedAll papers

AgenticDataBench: A Comprehensive Benchmark for Data Agents cs.DB

Data science aims to derive actionable insights from heterogeneous raw data, unlocking the value of the massive amounts of data generated in modern society. Automating this process is essential to reducing labor-intensive efforts for data scientists and enabling scalable data-driven applications. Recently, large language model (LLM)-based data agents have emerged as a promising solution to automate data science workflows. However, the field lacks comprehensive benchmarks to rigorously evaluate these agents across diverse scenarios with fine-grained granularity. To address this gap, we propose AgenticDataBench, a comprehensive benchmark featuring realistic tasks spanning diverse domains with fine-grained ground-truth labels. This enables evaluations to capture the diversity and complexity of data science workflows and the detailed performance of agents. First, to cover diverse domains, we collect real datasets and tasks from 15 vertical domains, including 5 real-world B2B use cases from a leading fintech company. Second, to remove redundancy in real-world tasks and generate high-quality tasks for domains lacking real data, we introduce data science skills, recurring data-centric operational patterns, and quantify benchmark coverage by the number of skills included. Representative skills are extracted from large-scale task solutions on Stack Overflow using skill-aligned hierarchical clustering. Third, for real-world business tasks, we select task-solution pairs that maximize diversity in skill composition, ensuring broad coverage of practical scenarios. Fourth, to generate realistic tasks for devise domains without real tasks, we propose a systematic LLM-based task generation approach to create workflows and tasks based on these skills. Finally, we evaluate state-of-the-art data agents using our annotated benchmark and open-sourced testbed, providing detailed skill-level insights.

HNSW with Accuracy Guarantees Using Graph Spanners -- A Technical Report cs.DB

Hierarchical Navigable Small World (HNSW) graphs serve as the industry standard due to their logarithmic complexity and strong empirical performance. However, HNSW relies on greedy graph traversal, a heuristic that provides no theoretical guarantees of correctness. In this paper, we propose a novel "Certify-then-Rectify" framework that bridges the gap between the speed of heuristic search and the rigor of exact retrieval. Rather than discarding HNSW, our approach first employs a distribution-free statistical certifier to dynamically evaluate the quality of a standard HNSW search with minimal overhead. If certification indicates that the retrieved neighbors are of low quality, the framework safely escalates to a rigorous exact recovery algorithm. To make this exact recovery computationally feasible, we reinterpret the HNSW graph as a geometric spanner and utilize Extreme Value Theory to stochastically estimate its maximum empirical stretch factor. This allows us to mathematically bound the maximum distance of true nearest neighbors. Extensive evaluations on benchmark datasets demonstrate that our tiered framework delivers the average-case speed of HNSW while ensuring the worst-case correctness of exact search and outperforming other applicable approaches.

When Classic Cache Policies Fail: Learning-Augmented Replacement for Semantic Retrieval Buffers cs.DB

LLM agents increasingly rely on retrieval buffers to store and reuse past experience, yet the cache management policies governing these buffers remain largely ad-hoc. We formalize this as an online semantic cache replacement problem with switching costs, where items are matched by embedding similarity and hit quality is continuous rather than binary. Through experiments on two datasets from MemoryBench-Full (LoCoMo, DialSim) with 8 replacement policies, we reveal a surprising finding: classic heuristics (LRU, LFU) \emph{consistently underperform} the naive FIFO baseline on semantic workloads, due to the absence of temporal locality and frequency concentration. We propose SOLAR, a learning-augmented framework that derives modification timing from regret accumulation (achieving $\sim$17\% modification rate) and content selection from Bayesian online learning over implicit retrieval feedback. We prove SOLAR achieves a constant competitive ratio $\leq 3$, independent of cache size and horizon (vs.\ $Ω(K)$ for FIFO), and eviction regret $O(\sqrt{KT\log T})$, matching the $Ω(\sqrt{KT})$ lower bound up to logarithmic factors. Experiments demonstrate 5--75\% relative improvement over FIFO at tight cache sizes, with a clearly characterized phase transition at the working set boundary. Synthetic experiments with 5000-item pools further reveal an inverted-U relationship between pool size and retrieval quality, justifying capacity constraints as a retrieval noise phenomenon rather than a storage limitation.

Exploring the Semantic Gap in Agentic Data Systems: A Formative Study of Operationalization Failures in Analytical Workflows cs.DB

Large language models (LLMs) are increasingly used to generate queries, invoke tools, and construct analytical workflows. Although recent advances have substantially improved workflow generation and execution, the semantic information required to operationalize analytical concepts often lies beyond what is explicitly represented in database schemas and data values. We present a cross-domain formative study of operationalization failures in agent-generated analytical workflows. Across 236 analytical intents spanning finance, human resources, and public safety domains, we identify 153 recurring failures despite successful workflow generation and execution. Our analysis reveals five recurring classes of failures: comparative grounding, process reasoning, quantitative reasoning, role confusion, and policy grounding. These findings suggest a semantic gap between user-level analytical concepts and the information available to workflow-generation systems. More broadly, they raise questions about the admissibility of analytical operations and suggest that future agentic data systems may require richer semantic representations to bridge the gap between analytical intent and executable computation.

DA-Studio: An Agentic System for End-to-End Data Analysis cs.DB

Real-world data analysis is a multi-step process over heterogeneous inputs rather than merely producing a final answer. A practical system should autonomously organize multi-step workflows, execute generated code in a sandboxed and controllable environment, and remain inspectable through visible action traces and intermediate artifacts. Existing LLM-based analysis tools, however, often emphasize isolated subtasks, leaving limited support for complete execution-grounded workflows. We present DA-Studio (Data Analysis Studio), an interactive web-based demo system for end-to-end data analysis that is autonomous, sandboxed, and inspectable. DA-Studio integrates an action-structured analysis backend, a sandboxed execution workspace, and a browser interface for task setup, streamed action traces, artifact preview, code editing and rerunning, and report export. Through iterative action generation, code execution, and feedback incorporation, it incrementally constructs executable analysis steps from raw files and natural-language requests while exposing intermediate results and artifacts throughout the process.

SemJoin: Semantic Join Optimization cs.DB

Integrating unstructured data into relational database systems is increasingly important as demand grows for natural language querying and analysis. A semantic join, joining two tables under a natural-language predicate, can be evaluated with a large language model (LLM), but comparing every pair of tuples requires O(M x N) LLM invocations and is cost-prohibitive at scale. Existing systems reduce this cost but typically commit to a single fixed strategy (e.g., embedding similarity or one batched scheme) regardless of the data or the join predicate. We propose an LLM-agent-based decision pipeline that optimizes semantic joins by matching the execution strategy to the characteristics of the underlying tables. An LLM advisor routes each join to one of two strategies: a Cluster Join, which prunes candidates via unsupervised embedding clustering and sample-based filtering, or a Classifier strategy for predicates that reduce to a shared discrete label set. Across three diverse datasets (IMDb reviews, email contradictions, and Stack Overflow tags), the advisor consistently identifies the optimal execution strategy for each workload. This dynamic routing proves decisive: it outperforms adaptive block join (ABJ) by 20-33 F1 points across all datasets while consuming fewer tokens on two of the three, and achieves higher F1 scores than featurized-decomposition join (FDJ) at one to two orders of magnitude lower token cost.

BenchmarksFull tables

Intelligence Index

Composite score across coding, math, and reasoning

#	Model	Score	tok/s	$/1M
1	Claude Fable 5	59.9	62	$20.00
2	Claude Opus 4.8	55.7	61	$10.00
3	GPT-5.5	54.8	88	$11.25
4	Claude Opus 4.7	53.5	49	$10.00
5	Claude Sonnet 5	53.4	79	$6.00

SWE-rebench

Agentic coding on real-world software engineering tasks

#	Model	Score
1	OpenAIgpt-5.5-2026-04-23-xhighModel	62.7%± 0.91%
2	JunieJunieAgent	61.6%± 0.64%
3	OpenAICodexAgent	60.4%± 1.37%
4	AnthropicClaude CodeAgent	59.6%± 1.98%
5	OpenAIgpt-5.5-2026-04-23-mediumModel	58.9%± 0.78%

GitHub Repos All repos

Trending

usestrix/strix

35239 ★

Open-source AI hackers to find and fix your app’s vulnerabilities.

openai/codex-plugin-cc

23462 ★

Use Codex from Claude Code to review code or delegate tasks.

JuliusBrussee/caveman

83293 ★

🪨 why use many token when few token do trick — Claude Code skill that cuts 65% of tokens by talking like caveman

elastic/elasticsearch

77403 ★

Free and Open Source, Distributed, RESTful Search Engine

actions/checkout

8287 ★

Action for checking out a repo

Daily discovery

willyfh/visualtorchDeep Learning

172 ★

VisualTorch aims to help visualize Torch-based neural network architectures.

isLinXu/paper-listImage Generation

144 ★

autoupdate paper list

astroautomata/PySRAutoML

3605 ★

High-Performance Symbolic Regression in Python and Julia

varun29ankuS/shodh-memoryVector Database

227 ★

Cognitive brain for Claude, AI agents & edge devices — learns with use, runs offline, single binary. Neuroscience-grounded 3-tier architecture with Hebbian learning.

redai-infra/RelaxRLHF

457 ★

An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale