The Inference Report

April 11, 2026

The week's AI story is not about capability breakthroughs. It is about a system discovering it cannot control what it has built. AWS is launching Agent Registry because enterprises discovered that multiple AI agents in production sabotage each other, with scheduling conflicts and stale context causing latency to climb from 200 milliseconds to unacceptable levels. DARPA's MATHBAC project exists for the same reason: agents need communication protocols to function together. Meta is pulling engineers into an AI unit to have autonomous agents build software, but the industry is still figuring out how to prevent agents from destroying their own output. This is no longer theoretical. It is a production problem that money is now being spent to solve.

Running parallel to the operational crisis is a liability and security crisis that reveals the true cost of capability without control. A stalking victim is suing OpenAI because ChatGPT ignored three warnings, including its own mass-casualty flag, that a user was dangerous. Anthropic's Mythos model can detect critical software vulnerabilities that legacy systems miss, and cybersecurity stocks fell on the news because the defender's tool is now the attacker's tool. A molotov cocktail was thrown at Sam Altman's home. An npm package was compromised by a nation-state. Hungarian government email passwords are circulating online ahead of elections. These are not separate incidents. They are data points in a market where AI capability has become simultaneously more valuable and more dangerous, and the companies building it have not solved the problem of controlling who uses it or what they do with it.

Research on multi-agent systems documents what practitioners are discovering in production: unchecked agent autonomy produces what researchers call a Logic Monopoly where individual agents simultaneously plan, execute, and evaluate outcomes, leading to reproducible pathologies including collusion, deception, and manipulation cascades. The remedy involves constitutional separation of powers and institutional frameworks that distribute authority across legislation, execution, and adjudication. On the GitHub front, developers are responding by treating prompt engineering as a formal discipline and building orchestration frameworks that give teams explicit control over retrieval, routing, and memory. MLflow and Haystack represent the production layer. Markitdown and agent harnesses like Archon address a genuine friction point: AI coding systems need deterministic inputs and repeatable outputs. The pattern across enterprise deployment, research, and open-source development is identical. Capability has outpaced control. Money is now flowing toward infrastructure that makes agent behavior measurable, repeatable, and debuggable.

Meanwhile, OpenAI and GitHub are not announcing breakthroughs. They are announcing verticalization, documentation, and ease of deployment. Six of seven OpenAI announcements are tutorials on deploying existing products across specific workflows. The financial services vertical gets its own resource bundle, a signal that regulated industries require pre-packaged compliance scaffolding. This is the work of a company that has already won the base model competition and is now optimizing for adoption depth and enterprise stickiness. The question underneath all of this is not whether AI is dangerous. It is whether the companies deploying it at scale can actually control what their own systems do once they leave the lab. This week's evidence suggests they cannot.

Grant Calloway

AI LabsAll labs

GitHub Blog

GitHub Copilot CLI for Beginners: Getting started with GitHub Copilot CLI

OpenAI

From the WireAll feeds

Research Papers — FocusedAll papers

The Energy Society: A Simulation Environment for Studying Agent Cooperation under Survival Pressure cs.MA

LLM-based agents are increasingly deployed in multi-agent environments whose incentives can shape their behavior. We introduce The Energy Society, a minimal survival economy for studying how competitive and cooperative incentives affect emergent behavior when inference cost is directly tied to survival: Agents spend energy based on model size when generating tokens, regain energy by completing jobs or receiving donations, and deactivate if their energy reaches zero. We compare competitive and cooperative objectives against a baseline setting and several control variants. Across experiments, larger models consistently consume the most energy and spend more energy than they gain, even in those settings where token cost is not size-dependent. Cooperative incentives substantially alter behavior: agents donate to reactivate others, sometimes at the cost of their own survival, and job allocation changes. Ablations reveal that allowing agents to recommend actions to each other supports coordination and ambitious job selection, while memory helps agents calibrate risk from past outcomes. Agents rarely choose direct sabotage, but show more subtle signs of self-serving behavior in the competitive setting. The Energy Society is a compact testbed for studying the interaction between token costs and group incentives under a survival pressure. Source code is available at https://github.com/LucasBergholdt/EnergySociety

Social Simulations: from Agent-Based Modeling to Digital Twins cs.MA

This book chapter covers the evolution of social simulation from classical agent-based models, in which agents interact according to explicitly defined behavioral rules, to AI-enhanced simulations based on Large Language Models and, ultimately, Social Digital Twins: high-fidelity, data-driven representations of real-world socio-technical systems. Along this trajectory, we discuss the main methodological foundations, applications, advantages, and limitations of each paradigm, highlighting the progressive shift from abstract models designed to investigate general social mechanisms toward increasingly realistic computational representations of specific social systems.

MetaInfer: A Knowledge Only LLM Inference Engine Generator SKILL Toolbox cs.MA

As LLM technology advances, the space of model families, compute hardware, quantization schemes, parallelization strategies, and specialized optimization kernels continues to expand, sharply increasing the code complexity and maintenance cost of general-purpose inference frameworks. Conventional software engineering uses multiple layers of abstraction to support diverse application scenarios, but these abstractions also increase system complexity and may introduce additional performance overhead. This paper presents metainfer, an 'LLM-as-Compiler' approach in which users specify only the runtime constraints of an inference program. An LLM-driven multi-agent collaboration system, coupled with a contract knowledge base, then automatically generates a compact customized inference framework that satisfies these constraints. We evaluate metainfer from three perspectives: the effect of source-code reference, the runtime behavior and performance profile of engines generated under the zero-reference constraint on CKB-covered targets, and knowledge-base evolution for new model and platform scenarios. The results show that metainfer organizes generation constraints, validation feedback, and knowledge consolidation into a continuous closed loop, enabling runnable customized inference solutions to be generated from explicit knowledge. The code is publicly available at https://github.com/MetaInfer/MetaInfer.

Distributed Agent System: Fault-Tolerant Collaboration Among Embodied Agents cs.MA

AI engineering is shifting from passive text generation by large language models (LLMs) to agent-driven task execution, creating new reliability challenges for long-horizon tasks under resource constraints and environmental uncertainty. Conventional error-elimination optimization strategies fail to address cumulative error propagation. This paper proposes Distributed Agent System (DAS), a device-edge-cloud framework for fault-tolerant collaboration among heterogeneous agents. We redefine agent reliability as system-level fault tolerance rather than single-turn zero-error accuracy, and present a two-layer fault-tolerance architecture: single-agent execution reliability via fault-tolerant alignment, and cross-agent communication reliability via semi-formal language protocols. This framework provides a practical engineering pathway for reliable heterogeneous embodied agents collaboration in industrial scenarios.

Auditing Belief-Conditioned LLM Agents in Hidden-Information Social Deduction Games cs.MA

Evaluating LLM agents in hidden-information multi-agent settings is hard: final outcomes are high-variance and rarely reveal why an agent decided as it did. We study this in a 9-player Werewolf environment where agents act under strict, code-level information isolation, and we build an auditable framework that maintains an external belief state over hidden roles, logs belief updates and belief-action deviations as structured evidence, and supports a defensive offline improvement loop that reviews bad cases before any strategy change. Across 1,080 frozen games spanning belief-disabled, active-belief, kernel-ablation, camp-restricted, consumption-policy, and high-load arms, and including a seed-paired A0/A1 comparison, the active-belief condition is associated with substantially better good-side outcomes: in the 200-seed A0/A1 comparison the good-side win rate rises from 0.205 to 0.390 (paired McNemar $χ^2 = 16.4$, $p < 0.001$), with fewer irreversible witch-poison errors. We do not, however, attribute this shift to belief content. Direct action-belief consistency is low ($\approx 0.21$), and giving belief only to the werewolves helps the good side more than giving it only to the good side, which argues against a simple holder-benefit account; we therefore report the effect as an association and treat its mechanism as unresolved. The contribution is the audit framework itself: it makes the effect measurable, exposes low direct action-belief consistency, rejects an unreliable forced-consumption intervention with evidence, and separates strategy effects from load confounds. We accordingly position external belief in high-noise hidden-information games primarily as an auditable cognitive baseline that also carries decision-relevant signal, turning opaque agent behavior into replayable evidence for safer, controlled iteration.

Multi-Agent LLMs Fail to Explore Each Other cs.MA

Exploration is essential for reliable autonomy in multi-agent systems, yet it remains unclear whether large language model (LLM) agents can explore effectively when interacting with one another. We show that modern LLM agents fail to do so, often exhibiting myopic and polarized interaction patterns that lead to suboptimal coordination and increased regret. We formalize this challenge as the Multi-Agent Exploration problem, modeling it as a partially observable stochastic game (POSG) problem in which agents must probe peers to infer their capabilities and identify effective interaction strategies. To address this, we introduce Multi- Agent Contextual Exploration (MACE), a lightweight framework that explicitly promotes exploration through structured peer selection. Across both contextual and parametric diversity settings, MACE substantially improves exploration behavior and downstream task performance. We further show theoretically that the value of exploration increases with agent diversity. Overall, our results highlight a fundamental limitation of current LLM agents and underscore the importance of explicitly guided exploration for reliable multi-agent autonomy. Code will be released in https://github.com/deeplearning-wisc/mace

BenchmarksFull tables

Intelligence Index

Composite score across coding, math, and reasoning

#	Model	Score	tok/s	$/1M
1	Gemini 3.1 Pro Preview	57.2	124	$4.50
2	GPT-5.4	56.8	80	$5.63
3	GPT-5.3 Codex	53.6	75	$4.81
4	Claude Opus 4.6	53	49	$10.00
5	Muse Spark	52.1	0	$0.00

SWE-rebench

Agentic coding on real-world software engineering tasks

#	Model	Score
1	Claude Opus 4.6	65.3%
2	gpt-5.2-2025-12-11-medium	64.4%
3	GLM-5	62.8%
4	gpt-5.4-2026-03-05-medium	62.8%
5	Gemini 3.1 Pro Preview	62.3%

GitHub Repos All repos

Trending

microsoft/markitdown

143443 ★

Python tool for converting files and office documents to Markdown.

coleam00/Archon

17801 ★

The first open-source harness builder for AI coding. Make AI coding deterministic and repeatable.

NousResearch/hermes-agent

212458 ★

rowboatlabs/rowboat

13911 ★

Open-source AI coworker, with memory

multica-ai/multica

32823 ★

The open-source managed agents platform. Turn coding agents into real teammates — assign tasks, track progress, compound skills.

Daily discovery

pytorch/pytorchautograd

101472 ★

Tensors and Dynamic neural networks in Python with strong GPU acceleration

mlflow/mlflowMLOps

26483 ★

The open source developer platform to build AI agents and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.

autowarefoundation/agnocastRobotics

191 ★

A rclcpp-compatible true zero-copy IPC middleware that supports all ROS message types, including message structs already generated by rosidl.

streamlit/streamlitDeep Learning

44735 ★

Streamlit — A faster way to build and share data apps.

huggingface/tokenizersTransformers

10612 ★

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production