The Inference Report — March 12, 2026

The infrastructure layer is consolidating while the application layer fragments into a thousand bets on what agentic AI actually means in practice. Nvidia is spending 26 billion to build open-weight models and striking 2 billion deals with cloud providers like Nebius, essentially funding its own customer base while courts and regulators wrestle with what safety even looks like. OpenAI is shipping agent runtime primitives and container orchestration that turn models into executable workflows, but the concrete value is accruing to companies like Rakuten and Wayfair that are buying operational leverage: half the incident response time, automated catalog maintenance, ticket triage that doesn't require human gatekeeping. Nemotron 3 Super, a 120-billion-parameter model with only 12 billion active parameters in use, is explicitly designed to run agentic workloads efficiently, the kind of inference cost structure that makes autonomous systems economically viable. Whoever controls the agent runtime and the inference economics wins the next cycle. Model capability alone is table stakes.

But the application bets are chaotic and the safety narrative is fracturing. Character.AI was deemed uniquely unsafe by researchers, yet Grammarly's AI feature was quietly shut down after a lawsuit revealed it falsely attributed suggestions to real authors without consent. Ford's new AI assistant checks seatbelt compliance in fleet vehicles while AI meal plans for teens are cutting calories in ways nutritionists wouldn't. Anthropic is launching a think tank to examine AI's societal effects while simultaneously selling Claude to the US military for targeting decisions in Iran. The gap between what these systems are actually being deployed to do and the frameworks being built around them is widening, not closing.

The money is flowing toward builders who can show revenue and production usage, not flashy demos. Lovable hit 400 million ARR with 146 employees, Replit jumped to a 9 billion valuation and is chasing 1 billion ARR by year's end, and Forethought was acquired at a valuation reflecting years of head start in a category that barely existed two years ago. Yet the labor math is inverting: Atlassian cut 10 percent of its workforce citing AI threats, Oracle set aside 500 million for restructuring as it deploys AI coding tools, and Salesforce had to pay a premium on a 25 billion bond deal because Wall Street is spooked about AI disruption. The valuation premium for fast-growing AI startups with proven unit economics is real. The discount for mature software companies is equally real.

Grant Calloway

AI LabsAll labs

Google

Exploring the feasibility of conversational diagnostic AI in a real-world clinical study

Hugging Face

How NVIDIA AI-Q Reached \#1 on DeepResearch Bench I and II

Meta AI

Four MTIA Chips in Two Years: Scaling AI Experiences for Billions

NVIDIA

OpenAI

From the WireAll feeds

Research PapersAll papers

COMIC: Agentic Sketch Comedy Generation cs.CV

We propose a fully automated AI system that produces short comedic videos similar to sketch shows such as Saturday Night Live. Starting with character references, the system employs a population of agents loosely based on real production studio roles, structured to optimize the quality and diversity of ideas and outputs through iterative competition, evaluation, and improvement. A key contribution is the introduction of LLM critics aligned with real viewer preferences through the analysis of a corpus of comedy videos on YouTube to automatically evaluate humor. Our experiments show that our framework produces results approaching the quality of professionally produced sketches while demonstrating state-of-the-art performance in video generation.

LiTo: Surface Light Field Tokenization cs.CV

We propose a 3D latent representation that jointly models object geometry and view-dependent appearance. Most prior works focus on either reconstructing 3D geometry or predicting view-independent diffuse appearance, and thus struggle to capture realistic view-dependent effects. Our approach leverages that RGB-depth images provide samples of a surface light field. By encoding random subsamples of this surface light field into a compact set of latent vectors, our model learns to represent both geometry and appearance within a unified 3D latent space. This representation reproduces view-dependent effects such as specular highlights and Fresnel reflections under complex lighting. We further train a latent flow matching model on this representation to learn its distribution conditioned on a single input image, enabling the generation of 3D objects with appearances consistent with the lighting and materials in the input. Experiments show that our approach achieves higher visual quality and better input fidelity than existing methods.

Neural Field Thermal Tomography: A Differentiable Physics Framework for Non-Destructive Evaluation cs.LG

We propose Neural Field Thermal Tomography (NeFTY), a differentiable physics framework for the quantitative 3D reconstruction of material properties from transient surface temperature measurements. While traditional thermography relies on pixel-wise 1D approximations that neglect lateral diffusion, and soft-constrained Physics-Informed Neural Networks (PINNs) often fail in transient diffusion scenarios due to gradient stiffness, NeFTY parameterizes the 3D diffusivity field as a continuous neural field optimized through a rigorous numerical solver. By leveraging a differentiable physics solver, our approach enforces thermodynamic laws as hard constraints while maintaining the memory efficiency required for high-resolution 3D tomography. Our discretize-then-optimize paradigm effectively mitigates the spectral bias and ill-posedness inherent in inverse heat conduction, enabling the recovery of subsurface defects at arbitrary scales. Experimental validation on synthetic data demonstrates that NeFTY significantly improves the accuracy of subsurface defect localization over baselines. Additional details at https://cab-lab-princeton.github.io/nefty/

V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation cs.CV

Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-Zero, a zero-pair video-to-music generation approach that outputs time-aligned music for video. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-Zero achieves substantial gains over paired-data baselines: 5-21% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Overall, our results validate that temporal alignment through within-modality features, rather than paired cross-modal supervision, is effective for video-to-music generation. Results are available at https://genjib.github.io/v2m_zero/

Instruction set for the representation of graphs cs.CL

We present IsalGraph, a method for representing the structure of any finite, simple graph as a compact string over a nine-character instruction alphabet. The encoding is executed by a small virtual machine comprising a sparse graph, a circular doubly-linked list (CDLL) of graph-node references, and two traversal pointers. Instructions either move a pointer through the CDLL or insert a node or edge into the graph. A key design property is that every string over the alphabet decodes to a valid graph, with no invalid states reachable. A greedy \emph{GraphToString} algorithm encodes any connected graph into a string in time polynomial in the number of nodes; an exhaustive-backtracking variant produces a canonical string by selecting the lexicographically smallest shortest string across all starting nodes and all valid traversal orders. We evaluate the representation on five real-world graph benchmark datasets (IAM Letter LOW/MED/HIGH, LINUX, and AIDS) and show that the Levenshtein distance between IsalGraph strings correlates strongly with graph edit distance (GED). Together, these properties make IsalGraph strings a compact, isomorphism-invariant, and language-model-compatible sequential encoding of graph structure, with direct applications in graph similarity search, graph generation, and graph-conditioned language modelling

Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge cs.CL

The paradigm of LLM-as-a-judge relies on a critical assumption, namely that high inter-evaluator agreement indicates reliable and objective evaluation. We present two complementary findings that challenge this assumption. \textbf{First}, we demonstrate that this consensus is frequently illusory. We identify and formalize \textbf{Evaluation Illusion}, a phenomenon where LLM judges generate sophisticated critiques yet anchor scores on shared surface heuristics rather than substantive quality. Through a large-scale study of 105,600 evaluation instances (32 LLMs $\times$ 3 frontier judges $\times$ 100 tasks $\times$ 11 temperatures), we show that model-level agreement (Spearman $ρ= 0.99$) masks fragile sample-level agreement (Pearson $\bar{r} = 0.72$; absolute agreement ICC $= 0.67$), that merely sharing rubric structure restores 62\% of total agreement, and that high-quality outputs paradoxically receive the \textit{least} consistent evaluations. \textbf{Second}, we demonstrate that dynamically generating evaluation rubrics grounded in domain knowledge produces more meaningful assessment. We introduce MERG (Metacognitive Enhanced Rubric Generation), a knowledge-driven rubric generation framework whose domain-selective effects confirm this. Agreement \textit{increases} in codified domains (Education +22\%, Academic +27\%) where knowledge anchors evaluators on shared standards, while it decreases in subjective domains where genuine evaluative pluralism emerges. These findings suggest that evaluation rubrics should be dynamically enriched with expert knowledge rather than relying on generic criteria, with implications for reward modeling in RLAIF.

BenchmarksFull tables

Intelligence Index

Composite score across coding, math, and reasoning

#	Model	Score	tok/s	$/1M
1	Gemini 3.1 Pro Preview	57.2	111	$4.50
2	GPT-5.4	57	77	$5.63
3	GPT-5.3 Codex	54	57	$4.81
4	Claude Opus 4.6	53	53	$10.00
5	Claude Sonnet 4.6	51.7	60	$6.00

SWE-rebench

Agentic coding on real-world software engineering tasks

#	Model	Score
1	Claude Code	52.9%
2	Junie	52.1%
3	Claude Opus 4.6	51.7%
4	gpt-5.2-2025-12-11-xhigh	51.7%
5	gpt-5.2-2025-12-11-medium	51.0%

GitHub Repos All repos

Trending

msitarzewski/agency-agents

44775 ★

A complete AI agency at your fingertips** - From frontend wizards to Reddit community ninjas, from whimsy injectors to reality checkers. Each agent is a specialized expert with personality, processes, and proven deliverables.

promptfoo/promptfoo

15534 ★

Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

obra/superpowers

158198 ★

An agentic skills framework & software development methodology that works.

fishaudio/fish-speech

27360 ★

SOTA Open Source TTS

virattt/ai-hedge-fund

55420 ★

An AI Hedge Fund Team

Daily discovery

aurelio-labs/semantic-routerNLP

3345 ★