The Inference Report

April 8, 2026

Anthropic just locked in expanded compute from Google and Broadcom as its annual revenue run-rate hit thirty billion dollars. Firmus, Nvidia's Asia data center subsidiary, raised 1.35 billion in six months at a 5.5 billion valuation. Meanwhile, only 28 percent of AI use cases in infrastructure and operations are meeting ROI expectations, while 20 percent fail outright. The fracture lines are now visible: those controlling chips, data, and cloud relationships are consolidating power, while the productivity gains promised by AI are collapsing under real-world deployment.

The capital flow reveals what actually matters in this market. Uber is expanding its AWS contract to run ride-sharing features on Amazon's chips. Intel is joining Musk's Terafab project. These decisions aren't about which model performs best in benchmarks. They're about who owns the hardware and software stack when serious money gets deployed. Arcee, a 26-person startup that built an open source LLM, is gaining traction with users, but the headline itself signals the constraint: scale in this industry still means access to billions of dollars of compute and the relationships to secure it.

But the productivity story is fracturing faster than the market is consolidating. Google's AI Overviews tell millions of lies per hour at ninety percent accuracy. AMD's AI Group director publicly flagged that Claude Code cuts corners on complex problems, offering answers that seem faster but don't hold up, forcing her team to stop using it. Companies are cutting engineering teams based on the fantasy that AI can now build and maintain enterprise applications with minimal supervision, a miscalculation that will produce consequences beyond bad quarters. The gap between what AI can do in controlled demos and what it delivers when bolted onto real systems is widening, not closing.

Developers are responding by moving computation and data retrieval off the cloud entirely. Google's gallery and LiteRT-LM sit alongside tools like GitNexus and qmd that run models locally and build knowledge infrastructure without servers. When your docs live in your browser as a knowledge graph or your code gets indexed client-side, you stop paying per query and start owning the infrastructure. TrustGraph and CrateDB represent the backend half of that equation, building storage designed for semantic retrieval rather than bolted onto relational schemas. The implication is clear: developers are investing in tools that reduce dependency on centralized platforms, not deepen it. Local-first isn't a trend anymore. It's becoming the default assumption.

Grant Calloway

AI LabsAll labs

AMD

Serving CTR Recommendation Models with Triton Inference Server using the ONNX Runtime Backend

From the WireAll feeds

Research PapersAll papers

Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework cs.CL

The rapid growth of scientific literature has made it increasingly difficult for researchers to efficiently discover, evaluate, and synthesize relevant work. Recent advances in multi-agent large language models (LLMs) have demonstrated strong potential for understanding user intent and are being trained to utilize various tools. In this paper, we introduce Paper Circle, a multi-agent research discovery and analysis system designed to reduce the effort required to find, assess, organize, and understand academic literature. The system comprises two complementary pipelines: (1) a Discovery Pipeline that integrates offline and online retrieval from multiple sources, multi-criteria scoring, diversity-aware ranking, and structured outputs; and (2) an Analysis Pipeline that transforms individual papers into structured knowledge graphs with typed nodes such as concepts, methods, experiments, and figures, enabling graph-aware question answering and coverage verification. Both pipelines are implemented within a coder LLM-based multi-agent orchestration framework and produce fully reproducible, synchronized outputs including JSON, CSV, BibTeX, Markdown, and HTML at each agent step. This paper describes the system architecture, agent roles, retrieval and scoring methods, knowledge graph schema, and evaluation interfaces that together form the Paper Circle research workflow. We benchmark Paper Circle on both paper retrieval and paper review generation, reporting hit rate, MRR, and Recall at K. Results show consistent improvements with stronger agent models. We have publicly released the website at https://papercircle.vercel.app/ and the code at https://github.com/MAXNORM8650/papercircle.

In-Place Test-Time Training cs.LG

The static ``train then deploy" paradigm fundamentally limits Large Language Models (LLMs) from dynamically adapting their weights in response to continuous streams of new information inherent in real-world tasks. Test-Time Training (TTT) offers a compelling alternative by updating a subset of model parameters (fast weights) at inference time, yet its potential in the current LLM ecosystem is hindered by critical barriers including architectural incompatibility, computational inefficiency and misaligned fast weight objectives for language modeling. In this work, we introduce In-Place Test-Time Training (In-Place TTT), a framework that seamlessly endows LLMs with Test-Time Training ability. In-Place TTT treats the final projection matrix of the ubiquitous MLP blocks as its adaptable fast weights, enabling a ``drop-in" enhancement for LLMs without costly retraining from scratch. Furthermore, we replace TTT's generic reconstruction objective with a tailored, theoretically-grounded objective explicitly aligned with the Next-Token-Prediction task governing autoregressive language modeling. This principled objective, combined with an efficient chunk-wise update mechanism, results in a highly scalable algorithm compatible with context parallelism. Extensive experiments validate our framework's effectiveness: as an in-place enhancement, it enables a 4B-parameter model to achieve superior performance on tasks with contexts up to 128k, and when pretrained from scratch, it consistently outperforms competitive TTT-related approaches. Ablation study results further provide deeper insights on our design choices. Collectively, our results establish In-Place TTT as a promising step towards a paradigm of continual learning in LLMs.

Topological Characterization of Churn Flow and Unsupervised Correction to the Wu Flow-Regime Map in Small-Diameter Vertical Pipes cs.LG

Churn flow-the chaotic, oscillatory regime in vertical two-phase flow-has lacked a quantitative mathematical definition for over $40$ years. We introduce the first topology-based characterization using Euler Characteristic Surfaces (ECS). We formulate unsupervised regime discovery as Multiple Kernel Learning (MKL), blending two complementary ECS-derived kernels-temporal alignment ($L^1$ distance on the $χ(s,t)$ surface) and amplitude statistics (scale-wise mean, standard deviation, max, min)-with gas velocity. Applied to $37$ unlabeled air-water trials from Montana Tech, the self-calibrating framework learns weights $β_{ECS}=0.14$, $β_{amp}=0.50$, $β_{ugs}=0.36$, placing $64\%$ of total weight on topology-derived features ($β_{ECS} + β_{amp}$). The ECS-inferred slug/churn transition lies $+3.81$ m/s above Wu et al.'s (2017) prediction in $2$-in. tubing, quantifying reports that existing models under-predict slug persistence in small-diameter pipes where interfacial tension and wall-to-wall interactions dominate flow. Cross-facility validation on $947$ Texas A&M University images confirms $1.9\times$ higher topological complexity in churn vs. slug ($p < 10^{-5}$). Applied to $45$ TAMU pseudo-trials, the same unsupervised framework achieves $95.6\%$ $4$-class accuracy and $100\%$ churn recall-without any labeled training data-matching or exceeding supervised baselines that require thousands of annotated examples. This work provides the first mathematical definition of churn flow and demonstrates that unsupervised topological descriptors can challenge and correct widely adopted mechanistic models.

HaloProbe: Bayesian Detection and Mitigation of Object Hallucinations in Vision-Language Models cs.CV

Large vision-language models can produce object hallucinations in image descriptions, highlighting the need for effective detection and mitigation strategies. Prior work commonly relies on the model's attention weights on visual tokens as a detection signal. We reveal that coarse-grained attention-based analysis is unreliable due to hidden confounders, specifically token position and object repetition in a description. This leads to Simpson's paradox: the attention trends reverse or disappear when statistics are aggregated. Based on this observation, we introduce HaloProbe, a Bayesian framework that factorizes external description statistics and internal decoding signals to estimate token-level hallucination probabilities. HaloProbe uses balanced training to isolate internal evidence and combines it with learned prior over external features to recover the true posterior. While intervention-based mitigation methods often degrade utility or fluency by modifying models' internals, we use HaloProbe as an external scoring signal for non-invasive mitigation. Our experiments show that HaloProbe-guided decoding reduces hallucinations more effectively than state-of-the-art intervention-based methods while preserving utility.

DiffHDR: Re-Exposing LDR Videos with Video Diffusion Models cs.CV

Most digital videos are stored in 8-bit low dynamic range (LDR) formats, where much of the original high dynamic range (HDR) scene radiance is lost due to saturation and quantization. This loss of highlight and shadow detail precludes mapping accurate luminance to HDR displays and limits meaningful re-exposure in post-production workflows. Although techniques have been proposed to convert LDR images to HDR through dynamic range expansion, they struggle to restore realistic detail in the over- and underexposed regions. To address this, we present DiffHDR, a framework that formulates LDR-to-HDR conversion as a generative radiance inpainting task within the latent space of a video diffusion model. By operating in Log-Gamma color space, DiffHDR leverages spatio-temporal generative priors from a pretrained video diffusion model to synthesize plausible HDR radiance in over- and underexposed regions while recovering the continuous scene radiance of the quantized pixels. Our framework further enables controllable LDR-to-HDR video conversion guided by text prompts or reference images. To address the scarcity of paired HDR video data, we develop a pipeline that synthesizes high-quality HDR video training data from static HDRI maps. Extensive experiments demonstrate that DiffHDR significantly outperforms state-of-the-art approaches in radiance fidelity and temporal stability, producing realistic HDR videos with considerable latitude for re-exposure.

The Character Error Vector: Decomposable errors for page-level OCR evaluation cs.CV

The Character Error Rate (CER) is a key metric for evaluating the quality of Optical Character Recognition (OCR). However, this metric assumes that text has been perfectly parsed, which is often not the case. Under page-parsing errors, CER becomes undefined, limiting its use as a metric and making evaluating page-level OCR challenging, particularly when using data that do not share a labelling schema. We introduce the Character Error Vector (CEV), a bag-of-characters evaluator for OCR. The CEV can be decomposed into parsing and OCR, and interaction error components. This decomposability allows practitioners to focus on the part of the Document Understanding pipeline that will have the greatest impact on overall text extraction quality. The CEV can be implemented using a variety of methods, of which we demonstrate SpACER (Spatially Aware Character Error Rate) and a Character distribution method using the Jensen-Shannon Distance. We validate the CEV's performance against other metrics: first, the relationship with CER; then, parse quality; and finally, as a direct measure of page-level OCR quality. The validation process shows that the CEV is a valuable bridge between parsing metrics and local metrics like CER. We analyse a dataset of archival newspapers made of degraded images with complex layouts and find that state-of-the-art end-to-end models are outperformed by more traditional pipeline approaches. Whilst the CEV requires character-level positioning for optimal triage, thresholding on easily available values can predict the main error source with an F1 of 0.91. We provide the CEV as part of a Python library to support Document understanding research.

BenchmarksFull tables

Intelligence Index

Composite score across coding, math, and reasoning

#	Model	Score	tok/s	$/1M
1	GPT-5.4	57.2	85	$5.63
2	Gemini 3.1 Pro Preview	57.2	132	$4.50
3	GPT-5.3 Codex	54	76	$4.81
4	Claude Opus 4.6	53	55	$10.00
5	Claude Sonnet 4.6	51.7	71	$6.00

SWE-rebench

Agentic coding on real-world software engineering tasks

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	gpt-5.4-2026-03-05-medium	62.8%
5	Gemini 3.1 Pro Preview	62.3%

GitHub Repos All repos

Trending

google-ai-edge/gallery

19731 ★

A gallery that showcases on-device ML/GenAI use cases and allows people to try and use models locally.

google-ai-edge/LiteRT-LM

3116 ★

NVIDIA/personaplex

8617 ★

PersonaPlex code.

abhigyanpatwari/GitNexus

33562 ★

GitNexus: The Zero-Server Code Intelligence Engine - GitNexus is a client-side knowledge graph creator that runs entirely in your browser. Drop in a GitHub repo or ZIP file, and get an interactive knowledge graph wit a built in Graph RAG Agent. Perfect for code exploration

tobi/qmd

19836 ★

mini cli search engine for your docs, knowledge bases, meeting notes, whatever. Tracking current sota approaches while being all local

Daily discovery

inboxpraveen/LLM-Minutes-of-MeetingNLP

164 ★

🎤📄 An innovative tool that transforms audio or video files into text transcripts and generates concise meeting minutes. Stay organized and efficient in your meetings, and get ready for Phase 2 where we'll be open for contributions to enable real-time meeting transcription! 🚀

GreenmaskIO/greenmaskSynthetic Data

1650 ★

Database anonymization, synthetic data generation and logical dump

trustgraph-ai/trustgraphKnowledge Graph

1956 ★

The context development platform. Store, enrich, and retrieve structured knowledge with graph-native infrastructure, semantic retrieval, and portable context cores.

crate/crateVector Database

4380 ★

CrateDB is a distributed and scalable SQL database for storing and analyzing massive amounts of data in near real-time, even with complex queries. It is PostgreSQL-compatible, and based on Lucene.

commaai/openpilotRobotics

61236 ★

openpilot is an operating system for robotics. Currently, it upgrades the driver assistance system on 300+ supported cars.