The week's AI story is not about capability breakthroughs. It is about a system discovering it cannot control what it has built. AWS is launching Agent Registry because enterprises discovered that multiple AI agents in production sabotage each other, with scheduling conflicts and stale context causing latency to climb from 200 milliseconds to unacceptable levels. DARPA's MATHBAC project exists for the same reason: agents need communication protocols to function together. Meta is pulling engineers into an AI unit to have autonomous agents build software, but the industry is still figuring out how to prevent agents from destroying their own output. This is no longer theoretical. It is a production problem that money is now being spent to solve.
Running parallel to the operational crisis is a liability and security crisis that reveals the true cost of capability without control. A stalking victim is suing OpenAI because ChatGPT ignored three warnings, including its own mass-casualty flag, that a user was dangerous. Anthropic's Mythos model can detect critical software vulnerabilities that legacy systems miss, and cybersecurity stocks fell on the news because the defender's tool is now the attacker's tool. A molotov cocktail was thrown at Sam Altman's home. An npm package was compromised by a nation-state. Hungarian government email passwords are circulating online ahead of elections. These are not separate incidents. They are data points in a market where AI capability has become simultaneously more valuable and more dangerous, and the companies building it have not solved the problem of controlling who uses it or what they do with it.
Research on multi-agent systems documents what practitioners are discovering in production: unchecked agent autonomy produces what researchers call a Logic Monopoly where individual agents simultaneously plan, execute, and evaluate outcomes, leading to reproducible pathologies including collusion, deception, and manipulation cascades. The remedy involves constitutional separation of powers and institutional frameworks that distribute authority across legislation, execution, and adjudication. On the GitHub front, developers are responding by treating prompt engineering as a formal discipline and building orchestration frameworks that give teams explicit control over retrieval, routing, and memory. MLflow and Haystack represent the production layer. Markitdown and agent harnesses like Archon address a genuine friction point: AI coding systems need deterministic inputs and repeatable outputs. The pattern across enterprise deployment, research, and open-source development is identical. Capability has outpaced control. Money is now flowing toward infrastructure that makes agent behavior measurable, repeatable, and debuggable.
Meanwhile, OpenAI and GitHub are not announcing breakthroughs. They are announcing verticalization, documentation, and ease of deployment. Six of seven OpenAI announcements are tutorials on deploying existing products across specific workflows. The financial services vertical gets its own resource bundle, a signal that regulated industries require pre-packaged compliance scaffolding. This is the work of a company that has already won the base model competition and is now optimizing for adoption depth and enterprise stickiness. The question underneath all of this is not whether AI is dangerous. It is whether the companies deploying it at scale can actually control what their own systems do once they leave the lab. This week's evidence suggests they cannot.
Grant Calloway
Recent advances in language model (LM) agents have significantly improved automated software engineering (SWE). Prior work has proposed various agentic workflows and training strategies as well as analyzed failure modes of agentic systems on SWE tasks, focusing on several contextual information signals: Reproduction Test, Regression Test, Edit Location, Execution Context, and API Usage. However, the individual contribution of each signal to overall success remains underexplored, particularly their ideal contribution when intermediate information is perfectly obtained. To address this gap, we introduce Oracle-SWE, a unified method to isolate and extract oracle information signals from SWE benchmarks and quantify the impact of each signal on agent performance. To further validate the pattern, we evaluate the performance gain of signals extracted by strong LMs when provided to a base agent, approximating real-world task-resolution settings. These evaluations aim to guide research prioritization for autonomous coding systems.
Large language model (LLM) agents increasingly coordinate in multi-agent systems, yet we lack an understanding of where and why cooperation failures may arise. In many real-world coordination problems, from knowledge sharing in organizations to code documentation, helping others carries negligible personal cost while generating substantial collective benefits. However, whether LLM agents cooperate when helping neither benefits nor harms the helper, while being given explicit instructions to do so, remains unknown. We build a multi-agent setup designed to study cooperative behavior in a frictionless environment, removing all strategic complexity from cooperation. We find that capability does not predict cooperation: OpenAI o3 achieves only 17% of optimal collective performance while OpenAI o3-mini reaches 50%, despite identical instructions to maximize group revenue. Through a causal decomposition that automates one side of agent communication, we separate cooperation failures from competence failures, tracing their origins through agent reasoning analysis. Testing targeted interventions, we find that explicit protocols double performance for low-competence models, and tiny sharing incentives improve models with weak cooperation. Our findings suggest that scaling intelligence alone will not solve coordination problems in multi-agent systems and will require deliberate cooperative design, even when helping others costs nothing.
Multi-agent LLM orchestration systems suffer from context pollution: when N concurrent agents compete for the orchestrator's context window, each agent's task state, partial outputs, and pending questions contaminate the steering interactions of every other agent, degrading decision quality. We introduce Dynamic Attentional Context Scoping (DACS), a mechanism in which the orchestrator operates in two asymmetric modes. In Registry mode it holds only lightweight per-agent status summaries (<=200 tokens each), remaining responsive to all agents and the user. When an agent emits a SteeringRequest, the orchestrator enters Focus(a_i) mode, injecting the full context of agent a_i while compressing all other agents to their registry entries. Context isolation is agent-triggered, asymmetric, and deterministic: the context window contains exactly F(a_i) + R_{-i} during steering, eliminating cross-agent contamination without requiring context compression or retrieval. We evaluate DACS across four experimental phases totalling 200 trials: Phase 1 tests N in {3,5,10} (60 trials); Phase 2 tests agent heterogeneity and adversarial dependencies (60 trials); Phase 3 tests decision density up to D=15 (40 trials); Phase 4 uses autonomous LLM agents for free-form questions (40 trials, Claude Haiku 4.5). Across all 8 synthetic scenarios, DACS achieves 90.0--98.4% steering accuracy versus 21.0--60.0% for a flat-context baseline (p < 0.0001 throughout), with wrong-agent contamination falling from 28--57% to 0--14% and context efficiency ratios of up to 3.53x. The accuracy advantage grows with N and D; keyword matching is validated by LLM-as-judge across all phases (mean kappa=0.909). DACS outperforms the flat-context baseline by +17.2pp at N=3 (p=0.0023) and +20.4pp at N=5 (p=0.0008) in Phase 4, with the advantage growing with N confirmed by two independent judges.
We present Logical Robots, an interactive multi-agent simulation platform where autonomous robot behavior is specified declaratively in the logic programming language Logica. Robot behavior is defined by logical predicates that map observations from simulated radar arrays and shared memory to desired motor outputs. This approach allows low-level reactive control and high-level planning to coexist within a single programming environment, providing a coherent framework for exploring multi-agent robot behavior.
Autonomous AI agents are beginning to operate across organizational boundaries on the open internet -- discovering, transacting with, and delegating to agents owned by other parties without centralized oversight. When agents from different human principals collaborate at scale, the collective becomes opaque: no single human can observe, audit, or govern the emergent behavior. We term this the Logic Monopoly -- the agent society's unchecked monopoly over the entire logic chain from planning through execution to evaluation. We propose the Separation of Power (SoP) model, a constitutional governance architecture deployed on public blockchain that breaks this monopoly through three structural separations: agents legislate operational rules as smart contracts, deterministic software executes within those contracts, and humans adjudicate through a complete ownership chain binding every agent to a responsible principal. In this architecture, smart contracts are the law itself -- the actual legislative output that agents produce and that governs their behavior. We instantiate SoP in AgentCity on an EVM-compatible layer-2 blockchain (L2) with a three-tier contract hierarchy (foundational, meta, and operational). The core thesis is alignment-through-accountability: if each agent is aligned with its human owner through the accountability chain, then the collective converges on behavior aligned with human intent -- without top-down rules. A pre-registered experiment evaluates this thesis in a commons production economy -- where agents share a finite resource pool and collaboratively produce value -- at 50-1,000 agent scale.
Strategic interaction in adversarial domains such as law, diplomacy, and negotiation is mediated by language, yet most game-theoretic models abstract away the mechanisms of persuasion that operate through discourse. We present the Strategic Courtroom Framework, a multi-agent simulation environment in which prosecution and defense teams composed of trait-conditioned Large Language Model (LLM) agents engage in iterative, round-based legal argumentation. Agents are instantiated using nine interpretable traits organized into four archetypes, enabling systematic control over rhetorical style and strategic orientation. We evaluate the framework across 10 synthetic legal cases and 84 three-trait team configurations, totaling over 7{,}000 simulated trials using DeepSeek-R1 and Gemini~2.5~Pro. Our results show that heterogeneous teams with complementary traits consistently outperform homogeneous configurations, that moderate interaction depth yields more stable verdicts, and that certain traits (notably quantitative and charismatic) contribute disproportionately to persuasive success. We further introduce a reinforcement-learning-based Trait Orchestrator that dynamically generates defense traits conditioned on the case and opposing team, discovering strategies that outperform static, human-designed trait combinations. Together, these findings demonstrate how language can be treated as a first-class strategic action space and provide a foundation for building autonomous agents capable of adaptive persuasion in multi-agent environments.
Composite score across coding, math, and reasoning
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Preview | 57.2 | 124 | $4.50 |
| 2 | GPT-5.4 | 56.8 | 80 | $5.63 |
| 3 | GPT-5.3 Codex | 53.6 | 75 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 49 | $10.00 |
| 5 | Muse Spark | 52.1 | 0 | $0.00 |
Agentic coding on real-world software engineering tasks
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | Gemini 3.1 Pro Preview | 62.3% |
Python tool for converting files and office documents to Markdown.
The first open-source harness builder for AI coding. Make AI coding deterministic and repeatable.
Open-source AI coworker, with memory
The open-source managed agents platform. Turn coding agents into real teammates — assign tasks, track progress, compound skills.
Tensors and Dynamic neural networks in Python with strong GPU acceleration
The open source developer platform to build AI agents and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.
A rclcpp-compatible true zero-copy IPC middleware that supports all ROS message types, including message structs already generated by rosidl.
Streamlit — A faster way to build and share data apps.
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production