The Inference Report

May 6, 2026

Elon Musk's lawsuit forcing OpenAI executives to defend their founding principles arrives the same week the company releases GPT-5.5 Instant and launches expanded ad offerings, crystallizing a broader pattern: as AI moves from research to revenue, the legal and financial incentives that shaped founding documents become friction points. Character.AI faces dual state suits for chatbots impersonating licensed psychiatrists with fabricated credentials while OpenAI claims reduced hallucination in law and medicine, both operating in the same regulatory vacuum. The tension is not philosophical but structural. Regulators are beginning to notice that companies claiming safety improvements and chatbots claiming professional licenses inhabit the same unmonitored space, and that space is closing.

Capital is flowing not to whoever builds the best models but to whoever controls the layer above them. Blitzy raises $200 million at $1.4 billion to automate coding. SAP acquires Prior Labs for $1.16 billion and restricts customers to approved models, signaling that margin lives in enterprise lock-in, not model capability. ElevenLabs hits $500 million ARR with voice AI as the critical interface. Krutrim, India's first GenAI unicorn, pivots to cloud services after the model business proved unsustainable. Infrastructure vendors are racing to solve the enterprise agent problem: NVIDIA partners with ServiceNow, AWS removes the modernization tax by deploying agents against legacy systems without APIs, AMD targets GPU distribution for Mixture-of-Experts models. Yet no one has claimed the agent orchestration layer itself, the platform that audits and governs agents across multiple services. That gap is where the next round of competition will live.

The gap between announcement and outcome is widening. Gartner's survey of 350 large organizations found 80 percent reported headcount reductions from AI initiatives, yet firms cutting staff show no better returns than those retaining workers. PayPal announces $1.5 billion in savings through automation and restructuring before measuring results. Microsoft and Google are simultaneously adding governance controls for AI agents accessing corporate data, a tacit admission that deployment speed has outpaced the ability to monitor what these systems actually do inside organizations. On GitHub, developers have stopped asking whether agents can work and started solving whether they can work cheaply and accurately enough to deploy. Context-mode achieves 98 percent token reduction. Local-deep-research supports 10 search engines. The real infrastructure conversation is not about frameworks that require buying into a complete worldview but about tools that integrate into existing workflows and make agents retain anything useful across sessions.

Grant Calloway

AI LabsAll labs

AMD

Accelerating Mixture-of-Experts Execution with FarSkip-Collective Models

AWS

Modernize your workflows: Amazon WorkSpaces now gives AI agents their own desktop (preview)

Anthropic

Agents for financial services

IBM

IBM and Aramco Explore Collaboration to Accelerate AI and Innovation Across Saudi Arabia

NVIDIA

NVIDIA and ServiceNow Partner on New Autonomous AI Agents for Enterprises

OpenAI

From the WireAll feeds

Research PapersAll papers

A Closed-Form Adaptive-Landmark Kernel for Certified Point-Cloud and Graph Classification cs.LG

We introduce PALACE (Persistence Adaptive-Landmark Analytic Classification Engine), the data-adaptive companion to PLACE, paying a small cross-validation tier on three knobs (budget, radii, bandwidth; $\leq 5$ choices each). A cover-theoretic core (Lebesgue-number criterion on the landmark cover) yields four closed-form guarantees. (i) A structural lower distortion bound $λ(τ;ν)$ on $\mathcal{D}_n$ under cross-diagram non-interference, with a $(D/L)^2$ budget reduction over the uniform grid when diagrams concentrate. (ii) Equal weights $w_k = K^{-1/2}$ maximizing $λ$, and farthest-point-sampling positions $2$-approximating the optimal $k$-center covering radius; both derived from training labels alone, no gradient training. (iii) A kernel-RKHS classification rate $O((k-1)\sqrt{K}/(γ\sqrt{m_{\min}}))$ with binary necessity threshold $m = Ω(\sqrt K/γ)$ from a matching Le Cam lower bound, and a closed-form filtration-selection rule. The kernel-Mahalanobis margin $\hatρ_{\mathrm{Mah}}$ is the strongest closed-form ranker across the chemical-graph pool (mean Spearman $ρ\approx +0.60$); the isotropic surrogate $\hatγ/\sqrt{K}$ admits a selection-consistency rate, and $\widehatλ$ from (i) provides an independent data-level signal (positive on COX2 and PTC). (iv) A per-prediction certificate, in non-asymptotic Pinelis and asymptotic Gaussian forms, with no calibration split. Empirically, PALACE is the strongest closed-form diagram-based method on Orbit5k ($91.3 \pm 1.0\%$, matching Persformer), leads every diagram-based competitor on COX2 and MUTAG, and is competitive on DHFR (within 1 pp of ECP). At $8\times$ domain inflation, adaptive placement maintains $94\%$ while the uniform grid collapses to chance ($25\%$ on 4-class data).

Safety and accuracy follow different scaling laws in clinical large language models cs.CL

Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time compute, with the implicit expectation that higher accuracy implies safer behavior. This assumption is incomplete in medicine, where a few confident, high-risk, or evidence-contradicting errors can matter more than average benchmark performance. We introduce SaFE-Scale, a framework for measuring how clinical LLM safety changes across model scale, evidence quality, retrieval strategy, context exposure, and inference-time compute. To instantiate this framework, we introduce RadSaFE-200, a Radiology Safety-Focused Evaluation benchmark of 200 multiple-choice questions with clinician-defined clean evidence, conflict evidence, and option-level labels for high-risk error, unsafe answer, and evidence contradiction. We evaluated 34 locally deployed LLMs across six deployment conditions: closed-book prompting (zero-shot), clean evidence, conflict evidence, standard RAG, agentic RAG, and max-context prompting. Clean evidence produced the strongest improvement, increasing mean accuracy from 73.5% to 94.1%, while reducing high-risk error from 12.0% to 2.6%, contradiction from 12.7% to 2.3%, and dangerous overconfidence from 8.0% to 1.6%. Standard RAG and agentic RAG did not reproduce this safety profile: agentic RAG improved accuracy over standard RAG and reduced contradiction, but high-risk error and dangerous overconfidence remained elevated. Max-context prompting increased latency without closing the safety gap, and additional inference-time compute produced only limited gains. Worst-case analysis showed that clinically consequential errors concentrated in a small subset of questions. Clinical LLM safety is therefore not a passive consequence of scaling, but a deployment property shaped by evidence quality, retrieval design, context construction, and collective failure behavior.

OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories cs.AI

Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet their development remains dominated by industrial giants. The typical industry recipe involves a highly resource-intensive pipeline spanning pre-training, continual pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL). In this report, we show that when fueled with informative and high-difficulty trajectories, a simple SFT approach could be surprisingly powerful for training frontier search agents. By introducing three simple data synthesis modifications: scaling knowledge graph size for richer exploration, expanding the tool set size for broader functionality, and strict low-step filtering, we establish a stronger baseline. Trained on merely 10.6k data points, our OpenSeeker-v2 achieves state-of-the-art performance across 4 benchmarks (30B-sized agents with ReAct paradigm): 46.0% on BrowseComp, 58.1% on BrowseComp-ZH, 34.6% on Humanity's Last Exam, and 78.0% on xbench, surpassing even Tongyi DeepResearch trained with heavy CPT+SFT+RL pipeline, which achieves 43.4%, 46.7%, 32.9%, and 75.0%, respectively. Notably, OpenSeeker-v2 represents the first state-of-the-art search agent within its model scale and paradigm to be developed by a purely academic team using only SFT. We are excited to open-source the OpenSeeker-v2 model weights and share our simple yet effective findings to make frontier search agent research more accessible to the community.

Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures cs.CV

We propose HeadsUp, a scalable feed-forward method for reconstructing high-quality 3D Gaussian heads from large-scale multi-camera setups. Our method employs an efficient encoder-decoder architecture that compresses input views into a compact latent representation. This latent representation is then decoded into a set of UV-parameterized 3D Gaussians anchored to a neutral head template. This UV representation decouples the number of 3D Gaussians from the number and resolution of input images, enabling training with many high-resolution input views. We train and evaluate our model on an internal dataset with more than 10,000 subjects, which is an order of magnitude larger than existing multi-view human head datasets. HeadsUp achieves state-of-the-art reconstruction quality and generalizes to novel identities without test-time optimization. We extensively analyze the scaling behavior of our model across identities, views, and model capacity, revealing practical insights for quality-compute trade-offs. Finally, we highlight the strength of our latent space by showcasing two downstream applications: generating novel 3D identities and animating the 3D heads with expression blendshapes.

Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours cs.AI

AI systems are entering critical domains like healthcare, finance, and defense, yet remain vulnerable to adversarial attacks. While AI red teaming is a primary defense, current approaches force operators into manual, library-specific workflows. Operators spend weeks hand-crafting workflows - assembling attacks, transforms, and scorers. When results fall short, workflows must be rebuilt. As a result, operators spend more time constructing workflows than probing targets for security and safety vulnerabilities. We introduce an AI red teaming agent built on the open-source Dreadnode SDK. The agent creates workflows grounded in 45+ adversarial attacks, 450+ transforms, and 130+ scorers. Operators can probe multi-agent systems, multilingual, and multimodal targets, focusing on what to probe rather than how to implement it. We make three contributions: 1. Agentic interface. Operators describe goals in natural language via the Dreadnode TUI (Terminal User Interface). The agent handles attack selection, transform composition, execution, and reporting, letting operators focus on red teaming. Weeks compress to hours. 2. Unified framework. A single framework for probing traditional ML models (adversarial examples) and generative AI systems (jailbreaks), removing the need for separate libraries. 3. Llama Scout case study. We red team Meta Llama Scout and achieve an 85% attack success rate with severity up to 1.0, using zero human-developed code

Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems cs.CL

Reasoning-intensive retrieval aims to surface evidence that supports downstream reasoning rather than merely matching topical similarity. This capability is increasingly important for agentic search systems, where retrievers must provide complementary evidence across iterative search and synthesis. However, existing work remains limited on both evaluation and training: benchmarks such as BRIGHT provide narrow gold sets and evaluate retrievers in isolation, while synthetic training corpora often optimize single-passage relevance rather than evidence portfolio construction. We introduce BRIGHT-Pro, an expert-annotated benchmark that expands each query with multi-aspect gold evidence and evaluates retrievers under both static and agentic search protocols. We further construct RTriever-Synth, an aspect-decomposed synthetic corpus that generates complementary positives and positive-conditioned hard negatives, and use it to LoRA fine-tune RTriever-4B from Qwen3-Embedding-4B. Experiments across lexical, general-purpose, and reasoning-intensive retrievers show that aspect-aware and agentic evaluation expose behaviors hidden by standard metrics, while RTriever-4B substantially improves over its base model.

BenchmarksFull tables

Intelligence Index

Composite score across coding, math, and reasoning

#	Model	Score	tok/s	$/1M
1	GPT-5.5	60.2	79	$11.25
2	Claude Opus 4.7	57.3	64	$10.94
3	Gemini 3.1 Pro Preview	57.2	138	$4.50
4	GPT-5.4	56.8	85	$5.63
5	Kimi K2.6	53.9	28	$1.71

SWE-rebench

Agentic coding on real-world software engineering tasks

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	Junie	62.8%
5	gpt-5.4-2026-03-05-medium	62.8%

GitHub Repos All repos

Trending

Hmbown/DeepSeek-TUI

10521 ★

Coding agent for DeepSeek models that runs in your terminal

ruvnet/ruflo

44462 ★

🌊 The leading agent orchestration platform for Claude. Deploy intelligent multi-agent swarms, coordinate autonomous workflows, and build conversational AI systems. Features enterprise-grade architecture, distributed swarm intelligence, RAG integration, and native Claude Code / Codex Integration

virattt/dexter

24049 ★

An autonomous agent for deep financial research

docusealco/docuseal

14272 ★

Open source DocuSign alternative. Create, fill, and sign digital documents ✍️

bwya77/vscode-dark-islands

8029 ★

VSCode theme based off the easemate IDE and Jetbrains islands theme

Daily discovery

esengine/deepseek-reasonixLLM

386 ★