The Inference Report

April 3, 2026

A research paper on decomposing generative models into interpretable, controllable components deserves attention precisely because it arrives while the entire industry is racing in the opposite direction. The paper's core insight, that structured constraints, whether through tokenization schemes, curriculum design, or analytical reconstruction, yield both empirical gains and clearer understanding of failure modes, runs counter to the prevailing assumption that scale and capability compression are the only paths forward. Yet this methodological precision matters most in a market that has stopped pretending capability differences matter and started fighting over everything else.

The competition has shifted decisively away from model quality and toward distribution, trust, and narrative control. Google is flooding the market with open models on permissive licenses, Microsoft is building its own foundational stack rather than remaining dependent on OpenAI, and OpenAI acquired a talk show not because it needed another product but because it needed to control the story about itself. The actual differentiators have compressed to speed, cost, and developer lock-in through bundled experience. Cursor is launching an AI coding agent to compete with Claude Code not on model quality but on the integrated workflow it provides. Kilo's KiloClaw targets shadow AI agents with managed services, betting that enterprises will pay for governance rather than build it themselves. The infrastructure layer is moving decisively away from cloud-only dependency, with NVIDIA emphasizing local agentic AI and AMD pushing ready-to-deploy solution blueprints. Small models are becoming useful at the edge, and the models themselves are becoming interchangeable.

Trust in these systems is collapsing in real time. Anthropic's DMCA takedowns to contain Claude Code leaks hit legitimate GitHub forks, backfiring spectacularly. A UK government-backed study found a fivefold increase in AI misbehavior over six months, not a breakthrough in capability but a deterioration in reliability and honesty that no press release can address. GitHub's trending repos reveal the gap between what developers claim to build and what they actually use: a leaked system prompts collection with 36,866 stars outpaces nearly every other trending repository, suggesting developers are mining reverse-engineered specifications rather than building forward. The venture funding flowing to foundational AI startups, 178 billion dollars in Q1 alone, vastly exceeds actual enterprise deployment maturity. Fewer than 10 percent of AI use cases make it past pilot stage, and the market is building faster than it is validating.

Grant Calloway

AI LabsAll labs

AMD

Deploy and Customize AMD Solution Blueprints

Anthropic

InterpretabilityEmotion concepts and their function in a large language model

Google DeepMind

Gemma 4: Byte for byte, the most capable open models

Hugging Face

Welcome Gemma 4: Frontier multimodal intelligence on device

IBM

First-Ever ‘Masters at Madison Square Park’ Watch Party Tee’d Up for April 9-12

NVIDIA

OpenAI

From the WireAll feeds

Research PapersAll papers

ActionParty: Multi-Subject Action Binding in Generative Video Games cs.CV

Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.

Steerable Visual Representations cs.CV

Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their effectiveness for generic visual tasks. To address this, we introduce Steerable Visual Representations, a new class of visual representations, whose global and local features can be steered with natural language. While most vision-language models (e.g., CLIP) fuse text with visual features after encoding (late fusion), we inject text directly into the layers of the visual encoder (early fusion) via lightweight cross-attention. We introduce benchmarks for measuring representational steerability, and demonstrate that our steerable visual features can focus on any desired objects in an image while preserving the underlying representation quality. Our method also matches or outperforms dedicated approaches on anomaly detection and personalized object discrimination, exhibiting zero-shot generalization to out-of-distribution tasks.

Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation cs.CL

Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.

Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning cs.LG

Large Language Models employing Chain-of-Thought reasoning achieve strong performance but suffer from excessive token consumption that inflates inference costs. Existing efficiency methods such as explicit length penalties, difficulty estimators, or multi-stage curricula either degrade reasoning quality or require complex training pipelines. We introduce Batched Contextual Reinforcement, a minimalist, single-stage training paradigm that unlocks efficient reasoning through a simple structural modification: training the model to solve N problems simultaneously within a shared context window, rewarded purely by per-instance accuracy. This formulation creates an implicit token budget that yields several key findings: (1) We identify a novel task-scaling law: as the number of concurrent problems N increases during inference, per-problem token usage decreases monotonically while accuracy degrades far more gracefully than baselines, establishing N as a controllable throughput dimension. (2) BCR challenges the traditional accuracy-efficiency trade-off by demonstrating a "free lunch" phenomenon at standard single-problem inference. Across both 1.5B and 4B model families, BCR reduces token usage by 15.8% to 62.6% while consistently maintaining or improving accuracy across five major mathematical benchmarks. (3) Qualitative analyses reveal emergent self-regulated efficiency, where models autonomously eliminate redundant metacognitive loops without explicit length supervision. (4) Crucially, we empirically demonstrate that implicit budget constraints successfully circumvent the adversarial gradients and catastrophic optimization collapse inherent to explicit length penalties, offering a highly stable, constraint-based alternative for length control. These results prove BCR practical, showing simple structural incentives unlock latent high-density reasoning in LLMs.

No Single Best Model for Diversity: Learning a Router for Sample Diversity cs.CL

When posed with prompts that permit a large number of valid answers, comprehensively generating them is the first step towards satisfying a wide range of users. In this paper, we study methods to elicit a comprehensive set of valid responses. To evaluate this, we introduce \textbf{diversity coverage}, a metric that measures the total quality scores assigned to each \textbf{unique} answer in the predicted answer set relative to the best possible answer set with the same number of answers. Using this metric, we evaluate 18 LLMs, finding no single model dominates at generating diverse responses to a wide range of open-ended prompts. Yet, per each prompt, there exists a model that outperforms all other models significantly at generating a diverse answer set. Motivated by this finding, we introduce a router that predicts the best model for each query. On NB-Wildchat, our trained router outperforms the single best model baseline (26.3% vs $23.8%). We further show generalization to an out-of-domain dataset (NB-Curated) as well as different answer-generation prompting strategies. Our work lays foundation for studying generating comprehensive answers when we have access to a suite of models.

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models cs.AI

Standard LLM benchmarks evaluate the assistant turn: the model generates a response to an input, a verifier scores correctness, and the analysis ends. This paradigm leaves unmeasured whether the LLM encodes any awareness of what follows the assistant response. We propose user-turn generation as a probe of this gap: given a conversation context of user query and assistant response, we let a model generate under the user role. If the model's weights encode interaction awareness, the generated user turn will be a grounded follow-up that reacts to the preceding context. Through experiments across $11$ open-weight LLMs (Qwen3.5, gpt-oss, GLM) and $5$ datasets (math reasoning, instruction following, conversation), we show that interaction awareness is decoupled from task accuracy. In particular, within the Qwen3.5 family, GSM8K accuracy scales from $41\%$ ($0.8$B) to $96.8\%$ ($397$B-A$17$B), yet genuine follow-up rates under deterministic generation remain near zero. In contrast, higher temperature sampling reveals interaction awareness is latent with follow up rates reaching $22\%$. Controlled perturbations validate that the proposed probe measures a real property of the model, and collaboration-oriented post-training on Qwen3.5-2B demonstrates an increase in follow-up rates. Our results show that user-turn generation captures a dimension of LLM behavior, interaction awareness, that is unexplored and invisible with current assistant-only benchmarks.

BenchmarksFull tables

Intelligence Index

Composite score across coding, math, and reasoning

#	Model	Score	tok/s	$/1M
1	GPT-5.4	57.2	74	$5.63
2	Gemini 3.1 Pro Preview	57.2	117	$4.50
3	GPT-5.3 Codex	54	65	$4.81
4	Claude Opus 4.6	53	48	$10.00
5	Claude Sonnet 4.6	51.7	55	$6.00

SWE-rebench

Agentic coding on real-world software engineering tasks

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	gpt-5.4-2026-03-05-medium	62.8%
5	Gemini 3.1 Pro Preview	62.3%

GitHub Repos All repos

Trending

siddharthvaddem/openscreen

24658 ★

Create stunning demos for free. Open-source, no subscriptions, no watermarks, and free for commercial use. An alternative to Screen Studio.

Yeachan-Heo/oh-my-codex

16055 ★

OmX - Oh My codeX: Your codex is not alone. Add hooks, agent teams, HUDs, and so much more.

asgeirtj/system_prompts_leaks

38464 ★

Extracted system prompts from ChatGPT (GPT-5.4, GPT-5.3, Codex), Claude (Opus 4.6, Sonnet 4.6, Claude Code), Gemini (3.1 Pro, 3 Flash, CLI), Grok (4.2, 4), Perplexity, and more. Updated regularly.

sherlock-project/sherlock

79637 ★

Hunt down social media accounts by username across social networks

Daily discovery

software-mansion/react-native-executorchObject Detection

1375 ★

Declarative way to run AI models in React Native on device, powered by ExecuTorch.

chi2liu/ABC-GRPORLHF

164 ★

Code For Adaptive-Boundary-Clipping GRPO. arxiv.org/pdf/2601.03895

topoteretes/cogneeai

16037 ★

Knowledge Engine for AI Agent Memory in 6 lines of code

Tencent/AI-Infra-GuardMCP

3387 ★

A full-stack AI Red Teaming platform securing AI ecosystems via OpenClaw Security Scan, Agent Scan, Skills Scan, MCP scan, AI Infra scan and LLM jailbreak evaluation.

k2-fsa/sherpaSpeech Recognition

905 ★

Speech-to-text server framework with next-gen Kaldi