The Inference Report

June 15, 2026

The wealth concentration accelerating inside AI companies is colliding with the infrastructure fragmentation surrounding them. OpenAI is formalizing a bet on distribution over direct sales through its $150M Partner Network, a structural move that locks customers into dependency on its API roadmap while outsourcing implementation complexity to an ecosystem of resellers and consultants. Yet simultaneously, export controls have frozen Anthropic's Fable and Mythos models, exposing how little consensus exists within the US government on enforcing dominance over advanced systems, and European organizations are systematically migrating toward open source alternatives and away from US suppliers, driven by geopolitical anxiety rather than ideology. The pattern is one of consolidation and fragmentation occurring in parallel: AI companies concentrate wealth and power while the technical and political infrastructure supporting that concentration fractures under regulatory chaos and geopolitical pressure.

Labor markets are bifurcating along similar lines. Tens of thousands of workers face layoffs as AI replaces entire functions, yet employers simultaneously demand proof that candidates can work alongside these tools. Legora, a legal AI startup, is doubling headcount on the strength of a 5.6 billion dollar valuation and viral adoption, exemplifying how narrow the band of winners has become. The real skills workers are told they need are precisely those machines cannot replicate, a claim that assumes clear boundaries between human and machine capability that the research literature increasingly questions.

Research activity reflects this fragmentation. Work on mechanistic interpretability, hallucination diagnosis, and efficiency gains through pruning occupies significant attention, but the field has fractured into domain-specific applications: anti-spoofing, audio explainability, causal inference. Performance benchmarks reveal the fragmentation too. Coding ability leaderboards show minimal movement at the top tier, with gpt-5.5-2026-04-23-xhigh holding 62.7% on SWE-rebench, but the middle ranks reveal substantial volatility. Gemini 3.1 Pro Preview dropped from 57.2% to 51.1%, falling from fifth to eleventh place, while divergence between SWE-rebench and Artificial Analysis rankings on the same models underscores how heavily capability assessments depend on test selection rather than absolute model strength. GitHub development patterns reinforce this picture: testing infrastructure consolidates around proven frameworks like pytest and Cypress, while AI tooling proliferates into competing abstraction layers like AgenticX, aisuite, and rig that attempt to solve provider fragmentation. Open source alternatives to commercial SaaS are emerging in specific domains, suggesting a counter-movement toward self-hosted infrastructure even as centralized AI platforms accumulate resources.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research PapersAll papers
Gaze Heads: How VLMs Look at What They Describe cs.CV

How a vision-language model internally solves the task of describing an image is far from obvious. We find that the model develops a specific mechanism for this: a small set of attention heads in its language-model backbone, which we call gaze heads, whose attention tracks the image region the model is currently describing. We find them with a simple correlation score from a few forward passes, using comic strips as a controlled testbed where narrative order is laid out spatially. These gaze heads do not just track the image tokens being described: redirecting their attention to a chosen region forces the VLM to describe that region instead. A single attention-mask intervention on the top-100 gaze heads, fewer than 9% of all heads, steers the model's answer to any chosen comic panel at 83.1% accuracy, while the same intervention on random heads fails to redirect the answer, and intervening on all heads destroys generation. The same lever also extends to continuous control: switching the gaze target mid-generation makes the model wrap up its current panel description and move to the new one within a few tokens. Beyond comics, the same intervention redirects answers to chosen regions in natural COCO images. The mechanism further recurs across model sizes from 2B to 32B parameters and across other VLM architectures, although some frozen-encoder families show no comparable head set. More broadly, this shows that targeted edits identified through mechanistic analysis can serve as practical inference-time levers for steering multimodal model behavior, without any retraining. Our code, interactive demo, and datasets are available at https://gaze.baulab.info/

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning cs.CV

Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but often ignore where hallucinations originate within the reasoning process. We find that hallucination sources vary across samples: errors may arise from visual misrecognition, incorrect medical knowledge recall, or flawed reasoning integration. To enable source-level hallucination diagnosis, we introduce ClinHallu, a benchmark for stage-wise hallucination diagnosis in medical MLLM reasoning. ClinHallu contains 7,031 validated instances, where each instance is augmented with a structured reasoning trace decomposed into Visual Recognition, Knowledge Recall, and Reasoning Integration. We also use stage-replacement interventions to measure how correcting specific stages affects the final answer. Beyond evaluation, we show that trace-supervised fine-tuning reduces stage-wise hallucinations. ClinHallu provides a fine-grained hallucination testbed for diagnosing and mitigating reasoning failures in medical MLLMs. The benchmark is publicly available at https://github.com/alibaba-damo-academy/ClinHallu.

Persona-Pruner: Sculpting Lightweight Models for Role-Playing cs.LG

Language Models (LMs) have shown remarkable potential as role-playing chatbots, delivering consistent, stylized interactions when given a specification of a character or user persona. However, applying these capabilities to real-world applications (e.g., ecosystems with numerous NPCs interacting simultaneously) exposes a critical inefficiency due to the excessive computational cost. In this paper, we question the necessity of dedicating a full, generalist model to a single persona, hypothesizing that a specific character identity relies on only a fraction of the model's total capacity. We observe that naively pruning LMs often severely degrades the role-playing performance for a specific persona; it does not distinguish between redundant knowledge and essential character traits. We propose Persona-Pruner, a framework that sculpts a lightweight role-playing model by isolating persona-specific sub-networks from a single description. Our experiments consistently show that Persona-Pruner preserves role-playing performance substantially more effectively than existing state-of-the-art LLM pruning techniques, reducing the performance drop from the dense model by up to 93.8% over the strongest baseline on RoleBench in LLM-as-a-judge score, while still maintaining general LLM capabilities. Code is available at https://github.com/jsu-kim/Persona-Pruner.

AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization cs.CL

Large reasoning models typically follow a read-then-think paradigm: they observe the complete input, reason over a static context, and then produce the answer. Yet many real-world scenarios are inherently dynamic, such as audio and video stream, where information arrives as a continuous stream and models must reason, update, and respond under partial observations. Recent streaming reasoning methods allow models to think while reading, but they largely rely on supervised imitation of pre-constructed trajectories, which limits their flexibility. In this paper, we propose AdaSR, an adaptive streaming reasoning framework that enables models to reason during input streaming and perform final deliberation once the stream is complete, learning when to think, and how much computation to allocate across different stages. To optimize this hierarchical reasoning process, we introduce Hierarchical Relative Policy Optimization (HRPO), which decomposes policy optimization into streaming reasoning and deep reasoning phases, providing more fine-grained advantage assignment instead of uniformly distributing a single sequence-level advantage over all tokens. HRPO integrates format, accuracy, and adaptive thinking rewards to enforce valid reasoning protocols, preserve final task performance, and encourage latency-aware computation allocation. Experiments show that AdaSR achieves a better balance among reasoning accuracy, computational efficiency, and streaming latency compared with supervised fine-tuning baseline. We release our code at https://github.com/EIT-NLP/StreamingLLM/tree/main/AdaSR.

Learning Coordinated Preference for Multi-Objective Multi-Agent Reinforcement Learning cs.MA

Cooperative multi-objective multi-agent reinforcement learning (MOMARL) models team decision making under multiple, potentially conflicting objectives. In this setting, conflicts arise not only across objectives but also across agents with different observations, roles, and contributions. We propose Preference Coordinated Multi-agent Policy Optimization (PCMA), which learns coordinated agent-specific preferences to enable complementary trade-offs among agents. Theoretically, we formulate cooperative MOMARL as a team-optimal game and show that, under suitable conditions, preference diversity can induce team improvement through a first-order improvement decomposition. Experiments on multiple cooperative MOMA environments and a practical traffic-control scenario show that PCMA improves both performance and trade-off coordination.

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment cs.CL

Reinforcement learning with verifiable rewards (RLVR) has successfully elicited the reasoning capabilities of large language models, motivating its extension to multimodal scenarios. Existing methods primarily focus on improving the visual coverage of reasoning traces and mitigating visual hallucinations, but underestimate the semantic inconsistency between the reasoning process and the final answer. In this paper, we delve into thinking-answer inconsistency in RLVR for large vision-language models (LVLMs), showing thorough analyses of rollouts collected throughout Group Relative Policy Optimization (GRPO) training process and post-RLVR evaluation outputs that this issue persists during training and remains present during inference. Motivated by the analysis, we propose Consistency-Oriented Reasoning Alignment (CORA), which introduces thinking-answer semantic consistency into RLVR through a lightweight plug-and-play consistency reward model, and further incorporates Hybrid Reward Advantage Splitting (HRAS) to stably coordinate task and consistency optimization. Extensive experiments across representative multimodal reasoning benchmarks and mainstream LVLMs show that CORA improves task performance while effectively mitigating thinking-answer inconsistency, leading to more faithful reasoning traces.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1Claude Fable 564.979$20.00
2Claude Opus 4.861.466$10.00
3GPT-5.560.278$11.25
4Claude Opus 4.757.358$10.00
5Gemini 3.1 Pro Preview57.2142$4.50
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1gpt-5.5-2026-04-23-xhigh62.7%
2Junie61.6%
3Codex60.4%
4Claude Code59.6%
5gpt-5.5-2026-04-23-medium58.9%