The Inference Report

April 4, 2026

The industry is splintering into two incompatible futures while the public face remains unified. On one side, companies are hardening infrastructure around proprietary systems, local execution, and political control. OpenAI is reorganizing leadership and acquiring media properties while Anthropic buys biotech startups, launches political action committees, and quietly ships code with known vulnerabilities to npm before attempting to DMCA 8,100 repositories. On the other side, the security perimeter is collapsing faster than it can be rebuilt. Claude Code operates with 90 percent autonomy when weaponized by state actors. Meta's AI agents trigger severity-one incidents. The Europa.eu platform lost 350 gigabytes through a supply chain attack on an open-source vulnerability scanner. HackerOne paused Internet Bug Bounty payouts after acknowledging it cannot handle open-source security anymore. The more autonomous these systems become, the less the existing security model holds.

The capital intensity required to scale inference is meeting hard physical limits. Trump's AI data center buildout is delayed across nearly 50 percent of projects because China controls key power infrastructure. Meta, Microsoft, and Google are betting billions on natural gas plants while communities prefer Amazon warehouses in their backyards. Google just added Flex Inference and Priority Inference tiers to Gemini because inference costs, not training costs, are now the binding constraint. Anthropic's 400 million dollar acquisition of Coefficient Bio and its new PAC suggest preparation for a longer game than quarterly model releases. OpenAI's move to acquire TBPN and create a special projects role signals internal focus shifting away from product velocity toward structural positioning.

Measurement and enforcement are becoming table stakes for enterprise adoption. Google is publishing work on behavioral alignment measurement while AWS ships centralized governance tooling across customer accounts. Only one is currently collecting revenue. In research, two complementary trajectories are emerging: one treats LLMs as semantic reasoners augmented with domain-specific constraints for detecting vulnerabilities, the other interrogates whether LLM outputs remain robust under variation through rigorous benchmarking. Both converge on controlled experimental design, yet the gap between laboratory conditions and real-world deployment remains substantial. Claude Opus 4.6 holds 65.3 percent on SWE-rebench, up 12.3 points, while Artificial Analysis shows minimal top-tier movement at 57.2 percent. The divergence reflects a fundamental problem: different benchmarks measure different distributions, making cross-methodology comparison unreliable.

Developer tooling is consolidating around infrastructure rather than capability. Conversational interfaces like Onyx and Prompts.Chat treat model switching and prompt management as friction to eliminate. Deeper architectural work is happening elsewhere: Microsoft's Presidio for PII redaction, Google's TimesFM for time-series forecasting, Genkit for application runtimes that use LLMs as components. The pattern across trending and discovery repositories is clear. The next wave isn't better chat. It's systems that manage state, route data, keep private data private, and learn from interaction. Agents and reinforcement learning are merging in the developer discovery set. The infrastructure bet is real. The security model is not.

Grant Calloway

AI LabsAll labs

AWS

Amazon Bedrock Guardrails supports cross-account safeguards with centralized control and management

Google

Evaluating alignment of behavioral dispositions in LLMs

From the WireAll feeds

Research Papers — FocusedAll papers

Natural-Language to SysMLv2 Translation via Conformance-Driven Iterative Refinement cs.SE

Model-Based Systems Engineering (MBSE) relies on formal system models as primary technical artifacts for representing requirements, structure, and behavior across the system lifecycle. With the standardization of SysMLv2 as a textual language, interest is increasing in translating natural-language descriptions directly into executable models. For practical deployment, generated models must be accepted by industrial modeling environments, not merely satisfy grammar constraints. We present a conformance-checker-driven framework for reliable natural-language-to-SysMLv2 translation that enforces production-level acceptance as the termination condition. The system embeds a SysMLv2 conformance checker within a generate-check-repair loop. Each model is evaluated using the checker, and deterministic diagnostics are incorporated into revisions until zero conformance errors are achieved. Using the production checker as the oracle ensures the framework targets deployability rather than grammar plausibility. We evaluate the approach on the full SysMBench prompt set of 151 prompts across four large language model backends, yielding 604 prompt-model cases. Single-shot generation achieves 51.16% production-conformance acceptance, while our approach achieves 100.00% conformance. By elevating production conformance from a post-processing check to a control mechanism within generation, the framework converts probabilistic outputs into production-accepted SysMLv2 artifacts suitable for loading, visualization, and engineering use.

Towards Reliable AI-Assisted Analog Design: Template-Constrained LLM Agents for SAR ADC Generation cs.SE

While Large Language Models (LLMs) have demonstrated significant capability in software code generation, their application to analog Electronic Design Automation (EDA) is bottlenecked. Owing to limited circuit topology understanding and data, directly prompting LLMs and multimodal models leads to hallucinations and failure to produce schematics capable of passing rigorous SPICE simulations, as we show in our work. Instead, we propose an end-to-end, multi-step LLM agentic framework ATLAS, capable of generating a functional Successive Approximation Register (SAR) Analog-to-Digital Converter (ADC) that successfully passes simulation validation. To adhere to the rigid constraints of analog design, we utilize expert knowledge to ground the LLM in its planning, selection, parameterization, and iterative modification. As part of ATLAS, we introduce Template-Constrained Generation - which unlike other template-based works - builds towards a more generalized SAR ADC generation flow. We demonstrate a strong proof-of-concept of our framework by developing SAR ADCs across technology nodes and input specs. Overall, our expert-knowledge grounded multi-step agentic ATLAS establishes a pragmatic foundation for integrating LLMs into reliable analog design methodologies.

Stop Means Stop: Measuring and Repairing the Enforcement Gap in Agent-Framework Control Primitives cs.SE

Production LLM-agent frameworks expose control primitives -- human-in-the-loop approval gates, run cancellation, and execution timeouts -- whose names and documentation imply barrier semantics: while a run is paused, cancelled, or timed out, no gated side effect executes. We show this implied contract holds on none of the six widely used open-source frameworks we test. Model-free differential probes isolate a recurring sibling leak -- an approval gate suspends its own branch while a sibling branch's effect executes during the pause, so a later rejection cannot prevent it -- in every framework shipping a pre-execution gate (five of six), plus three further gaps: replay double-execution, cancellation orphans, and timeout zombies. The hazard is reachable, not merely constructible: under an a-priori-fixed protocol, frontier models emit the leak-triggering plan shape at pooled rates up to 14%, and when live models drive the unmodified frameworks under an approval pause, 215 of 1,200 runs execute their effect during the pause, across three schedulers and two language runtimes. To repair the measured gaps we present SOUNDGATE, an environment-external effect gate in Rust through which every side effect must be admitted, enforcing hold-until-decided, reject-cancels, dedup-on-replay, and fence-on-cancel -- one property per violation class -- under a stated complete-mediation contract discharged for network egress by kernel-enforced routes. We verify the properties over a model of the admission core (Verus; TLA+/TLC, exhaustive to 7.5e7 states; TLAPS), model-check the deployed Rust with Loom, and bridge model to code by differential conformance over 1.2e7 operations -- refinement evidence, not a mechanized proof. SOUNDGATE blocks every measured violation end-to-end on all six frameworks while releasing legitimate effects, at about 1 ms admission per write and 12k-26k durable admissions per second.

Structured Feedback Improves Repair in an LLM Agent Loop cs.SE

LLM agents often retry after external validation rejects a candidate, but the interface between validation and the next model call remains underspecified. We introduce VeriHarness, a code-controlled agent loop in which models generate candidates while external validators control acceptance, budgets, and traces. We use it to compare raw diagnostics with feedback that identifies the failure location, observed value, and admissible alternatives. Across 50 paired TextWorld games under a four-call cap, feedback containing all three fields raises terminal success from 14/50 to 36/50 for Qwen2.5-Coder-14B (+44 percentage points) and from 8/50 to 29/50 for Llama-3.1-8B (+42 points). Ablations locate most of the gain in the admissible alternatives: feedback containing only the location and observed value remains near the raw diagnostic baseline. Presenting the complete repair information in prose instead of a keyed JSON record yields nearly the same success, providing no evidence that JSON syntax itself improves repair. The ordering persists across the tested call budgets and one sampled-decoding setting.

Quantize with Confidence? An Empirical Study of Quantization for Code Generation cs.SE

The growing adoption of local inference frameworks such as Ollama has made it increasingly common for developers to run large code models on laptops and other resource-constrained hardware. In these settings, post-training quantization is essential for reducing memory footprint and enabling practical deployment, yet its impact on generated code remains insufficiently understood. We empirically evaluate six state-of-the-art quantization methods (GPTQ, AWQ, QuIP#, AQLM, BitsAndBytes, and GGUF) on two representative large code model families, Qwen2.5-Coder and CodeLlama, using the multilingual McEval and CoderEval benchmarks for Python and Java. We assess functional correctness (pass@1) together with maintainability, reliability, security, and structural complexity. We also introduce a novel analysis of robustness under varying prompt complexity, characterized by Shannon entropy and token length. Our results show that quantization techniques differ meaningfully in their impact on correctness and code quality. AQLM consistently matches or exceeds the full-precision baseline, whereas QuIP# exhibits the largest correctness degradation, particularly on complex prompts. Security attributes remain stable across models, benchmarks, and programming languages, while robustness to prompt complexity varies across techniques. These findings provide practical guidance for selecting quantization strategies for deploying large code models on resource-constrained hardware and highlight the importance of evaluating quantized models beyond functional correctness.

NexForge: Scaling Executable Agent Tasks via Requirement-First Synthesis cs.SE

Scaling executable agent training data is bottlenecked by substrate-first methods that tie task generation to predefined tools, repositories, or skill graphs: expanding coverage requires manual expansion of the substrate, each new domain demands a bespoke pipeline, and the resulting task distributions often reflect substrate convenience rather than real-world demand. We introduce NexForge, a requirement-first framework that compiles free-form capability requirements into executable agent training data. NexForge first performs research-based demand discovery to identify representative task forms, realistic scenarios, and their relative prevalence. It then applies distribution-aware task compilation and automatically retrieves or constructs the files, repositories, dependencies, and runtime configurations required to materialize each task, followed by teacher rollout collection and trajectory distillation. The same pipeline, without any domain-specific infrastructure, produces 3,600 terminal tasks and 2,000 office tasks, improving Qwen3.5-35B-A3B Base from 22.5% to 52.0% on Terminal-Bench 2.0 and from 813 to 1338 Elo on GDPval; scaling to 43.2K terminal tasks reaches 58.4%, surpassing Claude Opus 4.6. Scaled further, NexForge-synthesized data contributes to the training of Nex-N2, a family of publicly available agent models that lift Qwen3.5-35B-A3B to 75.3% on Terminal-Bench 2.1 and to 1585 Elo on GDPval -- achieving state-of-the-art open-source performance and surpassing several frontier proprietary systems. Nex-N2 models are available at https://nex.sii.edu.cn/

BenchmarksFull tables

Intelligence Index

Composite score across coding, math, and reasoning

#	Model	Score	tok/s	$/1M
1	GPT-5.4	57.2	76	$5.63
2	Gemini 3.1 Pro Preview	57.2	118	$4.50
3	GPT-5.3 Codex	54	72	$4.81
4	Claude Opus 4.6	53	46	$10.00
5	Claude Sonnet 4.6	51.7	52	$6.00

SWE-rebench

Agentic coding on real-world software engineering tasks

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	gpt-5.4-2026-03-05-medium	62.8%
5	Gemini 3.1 Pro Preview	62.3%

GitHub Repos All repos

Trending

Yeachan-Heo/oh-my-codex

16055 ★

OmX - Oh My codeX: Your codex is not alone. Add hooks, agent teams, HUDs, and so much more.

onyx-dot-app/onyx

30010 ★

Open Source AI Platform - AI Chat with advanced features that works with every LLM

google-research/timesfm

24683 ★

TimesFM (Time Series Foundation Model) is a pretrained time-series foundation model developed by Google Research for time-series forecasting.

siddharthvaddem/openscreen

24658 ★

Create stunning demos for free. Open-source, no subscriptions, no watermarks, and free for commercial use. An alternative to Screen Studio.

dmtrKovalenko/fff.nvim

3784 ★

The fastest and the most accurate file search toolkit for AI agents, Neovim, Rust, C, and NodeJS

Daily discovery

microsoft/presidioTransformers

7893 ★

An open-source framework for detecting, redacting, masking, and anonymizing sensitive data (PII) across text, images, and structured data. Supports NLP, pattern matching, and customizable pipelines.

dograh-hq/dograhai

1959 ★

Open Source Voice Agent Platform

basetenlabs/truss-examplesMLOps

220 ★

Examples of models deployable with Truss

thinkwee/AgentsMeetRLRLHF

1204 ★

Awesome List for Agentic RL

genkit-ai/genkitVector Database

5753 ★

Open-source framework for building AI-powered apps in JavaScript, Go, and Python, built and used in production by Google