The Inference Report

April 13, 2026

The bifurcation accelerating through AI markets in 2025 reflects a deeper misalignment between institutional deployment and technical reality. Policy incoherence is creating openings for consolidation: Trump officials encourage banks to test Anthropic's Mythos model while the Department of Defense simultaneously flags the company as a supply-chain risk, a contradiction that effectively grants preferred vendors regulatory cover. UK financial regulators are rushing to assess Mythos's cybersecurity implications only after Claude demonstrated the ability to automate vulnerability discovery, meaning institutions tasked with managing financial system risk are auditing products already in customer hands. Anthropic dominates conference floors and venture conversations while short sellers position themselves for disruption bets, suggesting even sophisticated capital sees current valuations as vulnerable to the technological change these models represent. The gap between what institutions are building and what the public has patience for is widening: 54 percent of Americans report fatigue with AI coverage itself, a consumer sentiment that sits uneasily against the acceleration of model capability, regulatory shortcuts, and capital deployment.

This institutional capture is occurring precisely as technical capability outpaces the ability to manage it. Today's research papers cluster around three methodological movements: mechanistic interpretability applied as a causal tool for safety, structured supervision to enforce causal dependence, and agentic reasoning systems that decouple inference from training through tool use. The signature across all three is the same: move beyond end-to-end optimization to explicitly model what a system must depend on, whether that is evidence, intermediate rewards, or safety margins, and verify that dependence through controlled perturbation. Yet operationalization is outrunning these safeguards. Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, and the clustering of models between 58 and 65 percent on SWE-rebench compared to wider spreads on other benchmarks suggests certain architectures perform disproportionately better on specific evaluation criteria. Whether this reflects genuine capability differences or selection effects in evaluation remains unclear, but the velocity of deployment doesn't wait for methodological clarity.

On GitHub, the strongest signal is the shift from building AI agents to operationalizing them. Hermes Agent, Multica, Ralph, and Claude Mem all solve the same underlying problem: autonomous systems that work reliably in production require determinism, memory, and task management. These aren't viral repos riding hype; they're addressing friction that developers hit when moving agents from demos into workflows. The secondary pattern is more telling: specialized models and vertical tools are gaining ground faster than general-purpose frameworks. Kronos targets financial markets, VoxCPM2 solves multilingual speech synthesis, RustFS competes on object storage performance, and MarkItDown converts documents to Markdown for RAG pipelines at 105K stars. The repos gaining real adoption are the ones that solve one thing well and integrate into existing workflows rather than demanding teams adopt a new philosophy. This mirrors OpenAI's own distribution strategy: packaging ChatGPT as vertical-specific workflow software with use-case framing and compliance wrapping rather than building deep domain expertise into the models themselves. The advantage isn't in specialized capability but in market penetration across industries where a general-purpose interface can capture workflow share before purpose-built competitors arrive.

Hardware and infrastructure are finally catching up to software capability. The orbital compute cluster now operational through Kepler Communications and Apple's succession of smart-glass designs into four pragmatic prototypes point to infrastructure and hardware moving past the demo phase. Meta's construction of a Zuckerberg AI avatar for internal staff interaction signals that deployment is moving past chatbots into organizational infrastructure. Yet this acceleration into production environments, combined with regulatory shortcuts and capital deployment velocity, is concentrating friction precisely where institutions have the least institutional readiness to manage it. The widening gap between what's being built and what the public has patience for is where real resistance will eventually concentrate.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research PapersAll papers
Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism cs.CL

Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce ``emergent misalignment'' that generalizes broadly. Whether this brittleness reflects a fundamental lack of coherent internal organization for harmfulness remains unclear. Here we use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. We find that harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities. Aligned models exhibit a greater compression of harm generation weights than unaligned counterparts, indicating that alignment reshapes harmful representations internally--despite the brittleness of safety guardrails at the surface level. This compression explains emergent misalignment: if weights of harmful capabilities are compressed, fine-tuning that engages these weights in one domain can trigger broad misalignment. Consistent with this, pruning harm generation weights in a narrow domain substantially reduces emergent misalignment. Notably, LLMs harmful generation capability is dissociated from how they recognize and explain such content. Together, these results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety.

ANTIC: Adaptive Neural Temporal In-situ Compressor cs.LG

The persistent storage requirements for high-resolution, spatiotemporally evolving fields governed by large-scale and high-dimensional partial differential equations (PDEs) have reached the petabyte-to-exabyte scale. Transient simulations modeling Navier-Stokes equations, magnetohydrodynamics, plasma physics, or binary black hole mergers generate data volumes that are prohibitive for modern high-performance computing (HPC) infrastructures. To address this bottleneck, we introduce ANTIC (Adaptive Neural Temporal in situ Compressor), an end-to-end in situ compression pipeline. ANTIC consists of an adaptive temporal selector tailored to high-dimensional physics that identifies and filters informative snapshots at simulation time, combined with a spatial neural compression module based on continual fine-tuning that learns residual updates between adjacent snapshots using neural fields. By operating in a single streaming pass, ANTIC enables a combined compression of temporal and spatial components and effectively alleviates the need for explicit on-disk storage of entire time-evolved trajectories. Experimental results demonstrate how storage reductions of several orders of magnitude relate to physics accuracy.

Case-Grounded Evidence Verification: A Framework for Constructing Evidence-Sensitive Supervision cs.CL

Evidence-grounded reasoning requires more than attaching retrieved text to a prediction: a model should make decisions that depend on whether the provided evidence supports the target claim. In practice, this often fails because supervision is weak, evidence is only loosely tied to the claim, and evaluation does not test evidence dependence directly. We introduce case-grounded evidence verification, a general framework in which a model receives a local case context, external evidence, and a structured claim, and must decide whether the evidence supports the claim for that case. Our key contribution is a supervision construction procedure that generates explicit support examples together with semantically controlled non-support examples, including counterfactual wrong-state and topic-related negatives, without manual evidence annotation. We instantiate the framework in radiology and train a standard verifier on the resulting support task. The learned verifier substantially outperforms both case-only and evidence-only baselines, remains strong under correct evidence, and collapses when evidence is removed or swapped, indicating genuine evidence dependence. This behavior transfers across unseen evidence articles and an external case distribution, though performance degrades under evidence-source shift and remains sensitive to backbone choice. Overall, the results suggest that a major bottleneck in evidence grounding is not only model capacity, but the lack of supervision that encodes the causal role of evidence.

Seeing is Believing: Robust Vision-Guided Cross-Modal Prompt Learning under Label Noise cs.CV

Prompt learning is a parameter-efficient approach for vision-language models, yet its robustness under label noise is less investigated. Visual content contains richer and more reliable semantic information, which remains more robust under label noise. However, the prompt itself is highly susceptible to label noise. Motivated by this intuition, we propose VisPrompt, a lightweight and robust vision-guided prompt learning framework for noisy-label settings. Specifically, we exploit a cross-modal attention mechanism to reversely inject visual semantics into prompt representations. This enables the prompt tokens to selectively aggregate visual information relevant to the current sample, thereby improving robustness by anchoring prompt learning to stable instance-level visual evidence and reducing the influence of noisy supervision. To address the instability caused by using the same way of injecting visual information for all samples, despite differences in the quality of their visual cues, we further introduce a lightweight conditional modulation mechanism to adaptively control the strength of visual information injection, which strikes a more robust balance between text-side semantic priors and image-side instance evidence. The proposed framework effectively suppresses the noise-induced disturbances, reduce instability in prompt updates, and alleviate memorization of mislabeled samples. VisPrompt significantly improves robustness while keeping the pretrained VLM backbone frozen and introducing only a small amount of additional trainable parameters. Extensive experiments under synthetic and real-world label noise demonstrate that VisPrompt generally outperforms existing baselines on seven benchmark datasets and achieves stronger robustness. Our code is publicly available at https://github.com/gezbww/Vis_Prompt.

VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images cs.CV

Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low-level visual skills. This motivates a practical question: can targeted synthetic supervision, generated from only a task keyword such as Depth Order, address these weaknesses? To investigate this question, we introduce VisionFoundry, a task-aware synthetic data generation pipeline that takes only the task name as input and uses large language models (LLMs) to generate questions, answers, and text-to-image (T2I) prompts, then synthesizes images with T2I models and verifies consistency with a proprietary VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry-10K, a synthetic visual question answering (VQA) dataset containing 10k image-question-answer triples spanning 10 tasks. Models trained on VisionFoundry-10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP and +10% on CV-Bench-3D, while preserving broader capabilities and showing favorable scaling behavior as data size increases. Our results suggest that limited task-targeted supervision is an important contributor to this bottleneck and that synthetic supervision is a promising path toward more systematic training for VLMs.

VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning cs.CV

Large Vision Language Models (LVLMs) achieve strong multimodal reasoning but frequently exhibit hallucinations and incorrect responses with high certainty, which hinders their usage in high-stakes domains. Existing verbalized confidence calibration methods, largely developed for text-only LLMs, typically optimize a single holistic confidence score using binary answer-level correctness. This design is mismatched to LVLMs: an incorrect prediction may arise from perceptual failures or from reasoning errors given correct perception, and a single confidence conflates these sources while visual uncertainty is often dominated by language priors. To address these issues, we propose VL-Calibration, a reinforcement learning framework that explicitly decouples confidence into visual and reasoning confidence. To supervise visual confidence without ground-truth perception labels, we introduce an intrinsic visual certainty estimation that combines (i) visual grounding measured by KL-divergence under image perturbations and (ii) internal certainty measured by token entropy. We further propose token-level advantage reweighting to focus optimization on tokens based on visual certainty, suppressing ungrounded hallucinations while preserving valid perception. Experiments on thirteen benchmarks show that VL-Calibration effectively improves calibration while boosting visual reasoning accuracy, and it generalizes to out-of-distribution benchmarks across model scales and architectures.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1Gemini 3.1 Pro Preview57.2132$4.50
2GPT-5.456.883$5.63
3GPT-5.3 Codex53.678$4.81
4Claude Opus 4.65348$10.00
5Muse Spark52.10$0.00
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Opus 4.665.3%
2gpt-5.2-2025-12-11-medium64.4%
3GLM-562.8%
4gpt-5.4-2026-03-05-medium62.8%
5Gemini 3.1 Pro Preview62.3%