The Inference Report

March 14, 2026

The industry is fracturing along two incompatible timelines. Google's $32 billion Wiz acquisition, Microsoft's reshuffling of four executives into Rajesh Jha's vacated seat, and NVIDIA's inference chip preparations represent capital allocation decisions by companies that already own the distribution layer. These moves are not experimental bets. On the other side, startups are generating momentum through velocity: NanoClaw secured a Docker partnership in six weeks, Databricks shipped Genie Code, Nyne raised $5.3 million on AI agent infrastructure. The gap between these trajectories is where the real tension lives.

What separates them is not technical capability but operational stability and the ability to absorb chaos. Microsoft reshuffles leadership to accelerate a working operation. VS Code accelerates from monthly to weekly releases. Databricks adds agents to existing notebooks. The builders moving fastest are not solving the hardest problems. They are adding layers to what already works. Startups that depend on solving novel problems or maintaining singular focus are running into the real constraint of the moment: operational complexity at scale, not compute or capital. Energy production has replaced compute as the bottleneck for AI deployment. Lawyers are flagging mass casualty liability cases tied to AI chatbots. Enterprises are discovering that the pitch of replacing developers with LLMs collides with reality.

The research and infrastructure work reflects this consolidation. Papers cluster around interpretability through latent-space analysis, efficiency gains via selective computation, and agentic reasoning frameworks that decompose complex tasks into iterative refinement loops. GitHub's trending repos tell the same story: the real traction goes to tools that solve the integration problem, the testing problem, or the operational problem that comes after you have a model. Agent frameworks like Langflow and Alibaba's page-agent attack the coordination problem. Promptfoo addresses testing before deployment. Microsoft's BitNet and Google's LiteRT represent a shift in where the actual engineering work happens: not in training larger models, but in running existing ones efficiently at scale and on constrained hardware. What separates tools gaining real adoption from the noise is specificity. They solve a concrete bottleneck, not a vague aspiration.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research PapersAll papers
The Latent Color Subspace: Emergent Order in High-Dimensional Chaos cs.LG

Text-to-image generation models have advanced rapidly, yet achieving fine-grained control over generated images remains difficult, largely due to limited understanding of how semantic information is encoded. We develop an interpretation of the color representation in the Variational Autoencoder latent space of FLUX.1 [Dev], revealing a structure reflecting Hue, Saturation, and Lightness. We verify our Latent Color Subspace (LCS) interpretation by demonstrating that it can both predict and explicitly control color, introducing a fully training-free method in FLUX based solely on closed-form latent-space manipulation. Code is available at https://github.com/ExplainableML/LCS.

Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training cs.CV

Humans perceive and understand real-world spaces through a stream of visual observations. Therefore, the ability to streamingly maintain and update spatial evidence from potentially unbounded video streams is essential for spatial intelligence. The core challenge is not simply longer context windows but how spatial information is selected, organized, and retained over time. In this paper, we propose Spatial-TTT towards streaming visual-based spatial intelligence with test-time training (TTT), which adapts a subset of parameters (fast weights) to capture and organize spatial evidence over long-horizon scene videos. Specifically, we design a hybrid architecture and adopt large-chunk updates parallel with sliding-window attention for efficient spatial video processing. To further promote spatial awareness, we introduce a spatial-predictive mechanism applied to TTT layers with 3D spatiotemporal convolution, which encourages the model to capture geometric correspondence and temporal continuity across frames. Beyond architecture design, we construct a dataset with dense 3D spatial descriptions, which guides the model to update its fast weights to memorize and organize global 3D spatial signals in a structured manner. Extensive experiments demonstrate that Spatial-TTT improves long-horizon spatial understanding and achieves state-of-the-art performance on video spatial benchmarks. Project page: https://liuff19.github.io/Spatial-TTT.

EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models cs.CV

Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs' reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT's denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points.

SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning cs.CL

Constructing scientific multimodal document reasoning datasets for foundation model training involves an inherent trade-off among scale, faithfulness, and realism. To address this challenge, we introduce the synthesize-and-reground framework, a two-stage pipeline comprising: (1) Claim-Centric QA Synthesis, which generates faithful, isolated QA pairs and reasoning on focused segments, and (2) Document-Scale Regrounding, which programmatically re-embeds these pairs into full-document tasks to ensure realistic complexity. Using this framework, we construct SciMDR, a large-scale training dataset for cross-modal comprehension, comprising 300K QA pairs with explicit reasoning chains across 20K scientific papers. We further construct SciMDR-Eval, an expert-annotated benchmark to evaluate multimodal comprehension within full-length scientific workflows. Experiments demonstrate that models fine-tuned on SciMDR achieve significant improvements across multiple scientific QA benchmarks, particularly in those tasks requiring complex document-level reasoning.

Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models cs.LG

Cross-entropy (CE) training provides dense and scalable supervision for language models, but it optimizes next-token prediction under teacher forcing rather than sequence-level behavior under model rollouts. We introduce a feature-matching objective for language-model fine-tuning that targets sequence-level statistics of the completion distribution, providing dense semantic feedback without requiring a task-specific verifier or preference model. To optimize this objective efficiently, we propose energy-based fine-tuning (EBFT), which uses strided block-parallel sampling to generate multiple rollouts from nested prefixes concurrently, batches feature extraction over these rollouts, and uses the resulting embeddings to perform an on-policy policy-gradient update. We present a theoretical perspective connecting EBFT to KL-regularized feature-matching and energy-based modeling. Empirically, across Q&A coding, unstructured coding, and translation, EBFT matches RLVR and outperforms SFT on downstream accuracy while achieving a lower validation cross-entropy than both methods.

Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training cs.AI

Reasoning LLMs-as-Judges, which can benefit from inference-time scaling, provide a promising path for extending the success of reasoning models to non-verifiable domains where the output correctness/quality cannot be directly checked. However, while reasoning judges have shown better performance on static evaluation benchmarks, their effectiveness in actual policy training has not been systematically examined. Therefore, we conduct a rigorous study to investigate the actual impact of non-reasoning and reasoning judges in reinforcement-learning-based LLM alignment. Our controlled synthetic setting, where a "gold-standard" judge (gpt-oss-120b) provides preference annotations to train smaller judges, reveals key differences between non-reasoning and reasoning judges: non-reasoning judges lead to reward hacking easily, while reasoning judges can lead to policies that achieve strong performance when evaluated by the gold-standard judge. Interestingly, we find that the reasoning-judge-trained policies achieve such strong performance by learning to generate highly effective adversarial outputs that can also score well on popular benchmarks such as Arena-Hard by deceiving other LLM-judges. Combined with our further analysis, our study highlights both important findings and room for improvements for applying (reasoning) LLM-judges in non-verifiable LLM post-training.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1Gemini 3.1 Pro Preview57.2117$4.50
2GPT-5.45784$5.63
3GPT-5.3 Codex5464$4.81
4Claude Opus 4.65355$10.00
5Claude Sonnet 4.651.760$6.00
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Code52.9%
2Junie52.1%
3Claude Opus 4.651.7%
4gpt-5.2-2025-12-11-xhigh51.7%
5gpt-5.2-2025-12-11-medium51.0%