The Inference Report

June 7, 2026

OpenAI is simultaneously fortifying its core product against attack, overhauling ChatGPT into a services gateway ahead of a potential IPO, and negotiating equity stakes with the Trump administration while proposing a sovereign-wealth fund to address public anxiety about AI's impact. Sriram Krishnan has left his White House advisor role to shape Trump's AI policy from outside. The pattern is unmistakable: the companies building the products are also writing the policy framework, structuring the financial instruments, and placing their people in corridors of power. When a builder proposes a fund to address public concern and the administration considers taking an equity stake in that same builder, there is no separation left between the builder, the regulator, and the beneficiary.

This consolidation of control extends across the technical stack. NVIDIA is betting that consumer hardware upgrades and AI-native software will drive the next computing cycle through RTX Spark and gaming partnerships with T1 and Krafton. Hugging Face recognizes a different market reality: many enterprise workflows don't need frontier models, just efficient ones purpose-built for specific domains in finance and regulated verticals where inference cost, compliance, and interpretability matter more than raw capability. Neither company is chasing scale-at-all-costs, suggesting the industry is finally pricing the difference between capability and utility. On the research front, the field has moved from treating model outputs as atomic units toward decomposing the processes that generate them. Self-consistency ranking, function vectors, and trajectory extrapolation error now measure latent capabilities without explicit training. MedSP1000 replaces single-turn medical QA with interactive standardized-patient scenarios. StreamMA and ReasoningFlow capture multi-agent reasoning and non-linear reasoning traces as directed acyclic graphs to reveal fine-grained behaviors like backtracking and self-correction.

In infrastructure, the agent layer is consolidating around practical web access and memory. Agent-Reach and last30days-skill solve the same constraint: reliable internet sight lines without burning through API quotas. MemPalace and CopilotKit at 33k stars show developers treating memory and agentic capability as foundational layers. The harder problems remain fragmented. LocalAI's 46k stars reflects demand for cloud-independent inference, but security-focused repos like AIJack and synthetic data work on DeepEcho suggest risk management, privacy simulation, and training data generation remain underfunded relative to the agent layer. The real gap isn't in inference or chat. It's in operational machinery: testing, security scanning, and synthetic data generation remain splintered across isolated tools rather than integrated into development workflows.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research Papers — FocusedAll papers
Characterizing Narrative Content in Web-scale LLM Pretraining Data cs.CL

The narrative composition of web-scale LLM pretraining corpora remains largely unexplored even though narrative is a fundamental mode of human communication. We present the first fine-grained study of narrative features in Dolma, a 3-trillion-token open pretraining corpus. Drawing on narrative theory, we design a framework spanning three core narrative elements (agency, setting, and events) operationalized as 11 interpretable dimensions. After sampling and annotating a diverse set of 400 passages, we finetune and validate NarraBERT, a RoBERTa-based model for fine-grained narrative prediction. We apply NarraBERT to 3M passages, resulting in a new dataset, NarraDolma. We find (i) narrative structure is measurable at scale across extremely heterogeneous data, (ii) we uncover a continuous, multidimensional narrative structure underlying web text, and (iii) narrative qualities are unequally distributed across pretraining sources and topics in ways that current curation practices neither measure nor account for. Our framework, dataset, and analyses provide a foundation for understanding how narrative qualities are distributed in LLM pretraining data and for studying how data composition affects narrative reasoning tasks. We publicly release NarraDolma and NarraBERT.

Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias cs.CL

LLM-as-a-Judge has become the dominant evaluation paradigm for language models, but judge validation in practice relies on exact-match agreement, a metric that does not correct for chance and systematically overstates discriminative ability. We present the largest systematic evaluation of LLM-as-a-Judge to date: 21 judges from nine providers across MT-Bench, JudgeBench, and RewardBench, evaluated under three protocols (agreement, consistency, bias audit) over 118 runs and approximately 541,000 individual judgments. Four findings emerge, consistent across the full cohort, including the April 2026 frontier: kappa deflation between exact match and Cohen's kappa is universal (33--41 pp on MT-Bench), judge rankings shift by up to 14 positions across benchmarks, high test--retest reliability (>0.95) coexists with severe position bias (>0.10) in two production-deployed judges (instantiating a consistency--bias paradox), and verbosity bias is small (<0.011) across our cohort under a single pairwise rubric. We distill these into a Minimum Viable Validation Protocol.

LaViSA: A Language and Vision Structural Ambiguity Benchmark cs.CL

Structural ambiguity arises when a single sentence admits multiple valid interpretations due to its syntactic structure, posing a fundamental challenge for language understanding. Visual scenes serve as useful cues for resolving such ambiguity, and Vision and Language Models (VLMs) need to be capable of deriving possible semantic interpretations from visual scenes. We introduce Language and Vision Structural Ambiguity (LaViSA), a benchmark designed to evaluate the ability of VLMs to resolve structural ambiguity leveraging visual scenes. LaViSA consists of ambiguous sentences, their disambiguated sentences, and corresponding images of these disambiguated sentences across seven ambiguity categories. Using LaViSA, we conduct a comprehensive evaluation of diverse VLMs, including both proprietary and open-source models with varying parameter scales and reasoning capabilities. Experimental results show that although recent VLMs can leverage visual scenes to resolve structural ambiguity to a some extent, they still struggle with certain ambiguity types and visually subtle semantic distinctions, indicating remaining limitations in resolving structural ambiguity using visual scenes.

A BART-based approach with hierarchical strategy for Vietnamese abstractive multi-document summarization cs.CL

In this technical report, we focus on solving the challenge of Vietnamese multi-document abstractive summarization, introduced in the International Workshop on Vietnamese Language and Speech Processing (VLSP) 2022. We choose to follow the popular hierarchical approach, i.e. condensing each document followed by aggregation and summarization. We propose a novel yet simple strategy to shorten documents that is driven by the golden summary, thus ensuring high correlation between stages of the hierarchical approach. Our method achieves a ROUGE2-F1 score of 0.2468 on the VLSP's public test set, and can produce fluent and concise summaries. Additionally, we utilize external sources for extra data, which greatly enhances the quantity of data for Vietnamese multi-document summarization. The additional data is made available for the community.

Where Does Social Reasoning Come From? Capability Provenance in Language Models cs.CL

We use training-data attribution as an interpretable tool for capability discovery, mapping which regions of the pretraining corpus support social-reasoning versus STEM-reasoning in OLMo3-7B. Training-data attribution measures how strongly each training document influences a model's predictions on a benchmark, but document-level scores are too noisy to identify which corpus regions support which capabilities, and prior work has emphasized factual knowledge rather than reasoning. We compute gradient-based attribution (TrackStar via Bergson) over a working set drawn from the de-duplicated Dolma3 mix, aggregate influence across WebOrganizer's 24-format x 24-topic taxonomy (576 bins), and contrast benchmark pairs in a 2x2 design that varies domain (social vs. STEM) and capability type (reasoning vs. knowledge): SocialIQA and MMLU Social Sciences against ARC-Challenge and MMLU STEM. Social and STEM reasoning draw on qualitatively distinct corpus regions, and the contrast is sharper at the reasoning level than at the knowledge level. Targeted machine unlearning provides partial causal validation: forgetting high-attribution topic bins (e.g., Literature for SocialIQA) degrades the aligned benchmark more than within-bin random baselines, and we open-source all code, sampling manifests, the bin-level influence matrix, and unlearning checkpoints.

Before the Labels: How Dataset Construction Shapes Suicidality Detection in Clinical Text cs.CL

Clinical NLP increasingly relies on electronic health record (EHR) data to detect suicidal behaviors, treating clinical documentation as more reliable ground truth than social media. We argue that this framing obscures how EHR-based suicidality datasets encode a particular operationalization of suicidality, shaped by who authors the data, how episodes are bounded, and how ambiguity is resolved. We ground this argument in a case study of the ScAN dataset, built over MIMIC-III clinical notes. We show how governance constraints, ICD-based cohort selection, single-annotator labeling, and hospital-stay-level aggregation produce labels that reflect clinician-documented judgments, treat suicidality as a bounded episode, and assume that intent can be reliably inferred from documentation. A linguistic analysis demonstrates that identical labels subsume heterogeneous clinical framings differing in temporality, negation, and uncertainty. We argue that clinical NLP should examine the assumptions embedded in suicidality datasets before interpreting their labels as ground truth.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1Claude Opus 4.861.471$10.94
2GPT-5.560.264$11.25
3Claude Opus 4.757.366$10.94
4Gemini 3.1 Pro Preview57.2137$4.50
5GPT-5.456.895$5.63
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1gpt-5.5-2026-04-23-xhigh62.7%
2Codex60.4%
3Claude Code59.6%
4gpt-5.5-2026-04-23-medium58.9%
5Claude Opus 4.8-xhigh56.4%