The Inference Report

April 15, 2026

The AI industry is experiencing a collision between its capital requirements and its ability to generate returns, forcing a visible separation between firms that can sustain high burn rates and those that cannot. Anthropic's $380 billion valuation now appears cheaper than OpenAI's implicit $1.2 trillion price tag, signaling that investors are beginning to distinguish between narrative and actual business fundamentals. This reckoning extends across the entire stack. Microsoft is raising Surface prices 33 percent due to RAM shortages while struggling to meet its carbon-negative pledge as data center electricity demand is projected to double by 2030. The hyperscalers priced AI services as premium goods when GPU access was scarce and alternatives nonexistent. That advantage has evaporated. Neocloud providers are undercutting them significantly, and the market is becoming ruthless about cost. When Globalstar announced a merger with Amazon for $11.6 billion to integrate satellite connectivity, it was not about innovation. It was about who controls the last mile to devices when electricity becomes the binding constraint.

Regulatory capture is replacing regulatory avoidance as the dominant industry strategy. OpenAI and Anthropic are now openly clashing over Illinois liability law, with OpenAI backing protections that would shield labs from mass casualty events while Anthropic opposes them. Simultaneously, Anthropic is briefing the Trump administration on its Mythos cybersecurity model while suing the government. This apparent contradiction reflects the actual shape of modern regulatory strategy: compete on liability frameworks while maintaining government relationships. Maine passed the first state data center construction ban, a precedent the industry fears will spread. Silicon Valley is spending millions to stop Alex Bores, a former Palantir employee, from reaching Congress after he helped pass tough AI laws. The industry's political spending is no longer primarily about shaping distant federal policy. It is about local control and preventing precedents that spread.

The practical work of deploying AI at scale is creating operational bottlenecks that model improvements alone cannot solve. GitHub is introducing Stacked PRs to handle the volume of code AI tools generate, breaking large pull requests into smaller units because traditional code review cannot keep pace. Enterprise developers are building autonomous AI agents faster than security infrastructure can contain them, forcing vendors like Curity to rebuild identity and access management from scratch. Microsoft is testing features inspired by Openclaw to make Copilot more autonomous, but experts warn this introduces major security risks. Ukraine is replacing soldiers with robots to offset drone casualties. Max Hodak's Science Corp is preparing to place a sensor in a human brain. These are no longer experiments. They are deployments. The question is no longer whether AI works. It is whether the systems built to govern it can scale as fast as the infrastructure that runs it.

The competitive advantage in AI is shifting from general-purpose capability to specialized access and operational infrastructure. OpenAI's Trusted Access for Cyber program now includes GPT-5.4-Cyber, a model explicitly designed for vetted defenders, pairing capability advancement with gating. Google DeepMind's Gemini Robotics ER 1.6 reflects a similar push into embodied reasoning, where real-world task performance can be measured and monetized. Meanwhile, NVIDIA is positioning quantum AI models as open-source infrastructure, treating the foundation as a platform layer. GitHub's work on security benchmarks and AI21's warning about coding agent benchmark inflation both point to the same problem: agent capabilities are being measured against benchmarks that do not predict real-world reliability. Security and robotics are where capability claims get tested against reality fastest. That is where the money follows.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research PapersAll papers
SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis cs.CV

Large Language Models (LLMs) and Vision-Language Models (VLMs) increasingly generate indoor scenes through intermediate structures such as layouts and scene graphs, yet evaluation still relies on LLM or VLM judges that score rendered views, making judgments sensitive to viewpoint, prompt phrasing, and hallucination. When the evaluator is unstable, it becomes difficult to determine whether a model has produced a spatially plausible scene or whether the output score reflects the choice of viewpoint, rendering, or prompt. We introduce SceneCritic, a symbolic evaluator for floor-plan-level layouts. SceneCritic's constraints are grounded in SceneOnto, a structured spatial ontology we construct by aggregating indoor scene priors from 3D-FRONT, ScanNet, and Visual Genome. SceneOnto traverses this ontology to jointly verify semantic, orientation, and geometric coherence across object relationships, providing object-level and relationship-level assessments that identify specific violations and successful placements. Furthermore, we pair SceneCritic with an iterative refinement test bed that probes how models build and revise spatial structure under different critic modalities: a rule-based critic using collision constraints as feedback, an LLM critic operating on the layout as text, and a VLM critic operating on rendered observations. Through extensive experiments, we show that (a) SceneCritic aligns substantially better with human judgments than VLM-based evaluators, (b) text-only LLMs can outperform VLMs on semantic layout quality, and (c) image-based VLM refinement is the most effective critic modality for semantic and orientation correction.

Visual Preference Optimization with Rubric Rewards cs.CV

The effectiveness of Direct Preference Optimization (DPO) depends on preference data that reflect the quality differences that matter in multimodal tasks. Existing pipelines often rely on off-policy perturbations or coarse outcome-based signals, which are not well suited to fine-grained visual reasoning. We propose rDPO, a preference optimization framework based on instance-specific rubrics. For each image-instruction pair, we create a checklist-style rubric of essential and additional criteria to score responses from any possible policies. The instruction-rubric pool is built offline and reused during the construction of on-policy data. On public reward modeling benchmarks, rubric-based prompting massively improves a 30B-A3B judge and brings it close to GPT-5.4. On public downstream benchmarks, rubric-based filtering raises the macro average to 82.69, whereas outcome-based filtering drops it to 75.82 from 81.14. When evaluating scalability on a comprehensive benchmark, rDPO achieves 61.01, markedly outperforming the style-constrained baseline (52.36) and surpassing the 59.48 base model. Together, these results show that visual preference optimization benefits from combining on-policy data construction with instance-specific criterion-level feedback.

CLAD: Efficient Log Anomaly Detection Directly on Compressed Representations cs.LG

The explosive growth of system logs makes streaming compression essential, yet existing log anomaly detection (LAD) methods incur severe pre-processing overhead by requiring full decompression and parsing. We introduce CLAD, the first deep learning framework to perform LAD directly on compressed byte streams. CLAD bypasses these bottlenecks by exploiting a key insight: normal logs compress into regular byte patterns, while anomalies systematically disrupt them. To extract these multi-scale deviations from opaque bytes, we propose a purpose-built architecture integrating a dilated convolutional byte encoder, a hybrid Transformer--mLSTM, and four-way aggregation pooling. This is coupled with a two-stage training strategy of masked pre-training and focal-contrastive fine-tuning to effectively handle severe class imbalance. Evaluated across five datasets, CLAD achieves a state-of-the-art average F1-score of 0.9909 and outperforms the best baseline by 2.72 percentage points. It delivers superior accuracy while completely eliminating decompression and parsing overheads, offering a robust solution that generalizes to structured streaming compressors.

Classical and Quantum Speedups for Non-Convex Optimization via Energy Conserving Descent quant-ph

The Energy Conserving Descent (ECD) algorithm was recently proposed (De Luca & Silverstein, 2022) as a global non-convex optimization method. Unlike gradient descent, appropriately configured ECD dynamics escape strict local minima and converge to a global minimum, making it appealing for machine learning optimization. We present the first analytical study of ECD, focusing on the one-dimensional setting for this first installment. We formalize a stochastic ECD dynamics (sECD) with energy-preserving noise, as well as a quantum analog of the ECD Hamiltonian (qECD), providing the foundation for a quantum algorithm through Hamiltonian simulation. For positive double-well objectives, we compute the expected hitting time from a local to the global minimum. We prove that both sECD and qECD yield exponential speedup over respective gradient descent baselines--stochastic gradient descent and its quantization. For objectives with tall barriers, qECD achieves a further speedup over sECD.

Representation geometry shapes task performance in vision-language modeling for CT enterography cs.CV

Computed tomography (CT) enterography is a primary imaging modality for assessing inflammatory bowel disease (IBD), yet the representational choices that best support automated analysis of this modality are unknown. We present the first study of vision-language transfer learning on abdominal CT enterography and identify two main findings. First, mean pooling of slice embeddings gives better categorical disease assessment (59.2\% three-class accuracy), whereas attention pooling gives better cross-modal retrieval (0.235 text-to-image MRR). This pattern holds across all LoRA configurations tested and suggests that the two aggregators emphasize different properties of the learned representation. Second, per-slice tissue contrast matters more than broader spatial coverage: multi-window RGB encoding, which maps complementary Hounsfield Unit windows to RGB channels, outperforms all strategies that increase spatial coverage through multiplanar sampling, and in this setting adding coronal and sagittal views reduces classification performance. For report generation, fine-tuning without retrieval context yields within-1 severity accuracy at the prevalence-matched chance level (70.4\% vs.\ 71\% random), suggesting little learned ordering beyond the class distribution. Retrieval-augmented generation (RAG) improves this across all configurations, scoring 7--14 percentage points above the chance baseline and improving ordinal MAE from 0.98 to 0.80--0.89. A three-teacher pseudolabel framework enables all comparisons without expert annotations. Together, these findings provide the first baselines for this underexplored modality and offer practical guidance for building vision-language systems for volumetric medical imaging.

Toward Autonomous Long-Horizon Engineering for ML Research cs.CL

Autonomous AI research has advanced rapidly, but long-horizon ML research engineering remains difficult: agents must sustain coherent progress across task comprehension, environment setup, implementation, experimentation, and debugging over hours or days. We introduce AiScientist, a system for autonomous long-horizon engineering for ML research built on a simple principle: strong long-horizon performance requires both structured orchestration and durable state continuity. To this end, AiScientist combines hierarchical orchestration with a permission-scoped File-as-Bus workspace: a top-level Orchestrator maintains stage-level control through concise summaries and a workspace map, while specialized agents repeatedly re-ground on durable artifacts such as analyses, plans, code, and experimental evidence rather than relying primarily on conversational handoffs, yielding thin control over thick state. Across two complementary benchmarks, AiScientist improves PaperBench score by 10.54 points on average over the best matched baseline and achieves 81.82 Any Medal% on MLE-Bench Lite. Ablation studies further show that File-as-Bus protocol is a key driver of performance, reducing PaperBench by 6.41 points and MLE-Bench Lite by 31.82 points when removed. These results suggest that long-horizon ML research engineering is a systems problem of coordinating specialized work over durable project state, rather than a purely local reasoning problem.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1Gemini 3.1 Pro Preview57.2122$4.50
2GPT-5.456.879$5.63
3GPT-5.3 Codex53.665$4.81
4Claude Opus 4.65343$10.00
5Muse Spark52.10$0.00
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5Gemini 3.1 Pro Preview62.3%