Capital is abandoning manufacturing for silicon. SoftBank's market value now exceeds Toyota's, marking the definitive end of Japan's industrial era and the beginning of its AI hardware bet. Ardian is building a five-billion-euro data center outside Paris. Intel is shipping inference GPUs by year end. The infrastructure thesis is being priced in at scale, and the money has moved decisively from pipes to software and back again to pipes. What's being tested now is whether returns justify deployment, and whether the public will tolerate the environmental and labor costs of getting there.
NVIDIA has moved beyond selling chips. At Computex, the company announced a unified stack: hardware, software layers, reference designs, open source tools, and now Hugging Face integration of NVIDIA Cosmos 3 into its model hub. The architecture makes it cheaper and faster to build on NVIDIA infrastructure than anywhere else, then calls it openness. Taiwan's manufacturing ecosystem serves as both proof of concept and supply chain lock-in. Token demand is exploding. Every factory, hospital, and developer scaling agentic AI is being funneled toward the same compute layer. This isn't about winning deals. It's about making the alternative friction-filled enough that builders stop considering it.
Beneath this confidence sits a widening credibility gap. Gen Z uses AI more than any cohort but increasingly views it as a threat rather than opportunity. Erin Brockovich's scrutiny of data center secrecy suggests environmental and transparency concerns will become friction points as capacity demands accelerate. Wall Street strategists are shrugging off bubble warnings while simultaneously betting on AI stocks to defy gravity, a posture that depends entirely on the infrastructure thesis holding.
Developer energy has already shifted from "what if we use an LLM" to "how do we actually operate these systems." GitHub trending shows agent coordination tools like Harness, CORAL, and Pi Subagents solving real problems that didn't exist two years ago. Aegis enforces runtime policies with cryptographic audit trails. Claude Code dominates, but the deeper signal is the emergence of glue tools: MarkItDown converts documents, Scrapling handles web scraping, Supermemory builds memory layers. Openpilot remains the most interesting robotics repo because it ships actual code running on actual cars, not research simulations. The practical layer is solidifying while the infrastructure layer consolidates.
Grant Calloway
Tackling complex reasoning tasks typically relies on massive monolithic LLMs, which suffer from severe computational redundancy. While task decomposition through structured pipelines or multi-agent collaborations offers an alternative, these approaches inevitably fall into a critical dilemma: predefined static topologies are highly vulnerable to cascading errors, whereas unconstrained dynamic agents suffer from trajectory divergence and unpredictable memory bloat. To address this, we present DynaGraph, a lightweight multi-model framework driven by dynamic topological reconfiguration. At the execution level, DynaGraph multiplexes time-division PEFT adapters over a shared base model, enabling both full system training and inference deployment on a single consumer-grade GPU. At the routing level, the Evaluator continuously monitors execution confidence to trigger hierarchical self-healing: Fine-grained Patching for localized data gaps and Subgraph Reconstruction for severe logical ruptures. Experiments on StrategyQA, MATH, and FinQA demonstrate our 8B model closely approximates the reasoning capabilities of a 72B monolithic model (e.g., 87.6% on StrategyQA, 82.7% on MATH). Furthermore, it reduces latency by up to 68.1% and token consumption by 68.6% compared to unconstrained dynamic architectures.
Although large language model (LLM) based multi-agent systems (MAS) show their capability to solve complex tasks and achieve higher performance over single agent systems, they lead to huge computational overheads because of heavy communication between agents. Previous research has made efforts to train a sparse multi-agent graph or fine-tune a planner to orchestrate the workflow better. However, such extra training processes introduce computational costs and limit MAS to specific domains, therefore compromising their generalizability. In this paper, we propose CONCAT, a training-free multi-agent collaboration framework based on CONsensus and Confidence-driven Ad hoc Teaming to efficiently organize agent interactions. Specifically, agents are clustered based on their initial answers, and leaders of each cluster are selected based on the agents' confidence. Then, a heuristic function based on the Theory of Mind is designed to predict the collaboration benefits between every two leaders according to their answers and confidence. Finally, an ad hoc multi-agent network is organized after evicting a percentage of communications based on the predicted benefits. Experiments across three LLMs and three benchmarks show that CONCAT achieves up to 2.02x higher efficiency (accuracy/latency ratio) than LLM-Debate and outperforms training-aware methods such as AgentDropout, while reducing average latency by 50.1% on Qwen2.5-14B-Instruct, without any task-specific training.
LLM-based multi-agent systems (MAS) have emerged as an effective paradigm for complex and long-horizon tasks. However, in real-world tasks, MAS often exhibit various failures during execution and such failures are difficult to eliminate during design. This motivates experience-driven MAS evolution, where a system improves based on its own execution experience. Yet such evolution is challenging because MAS experience is prolonged and intricate, interleaving multiple agents' execution chains and communication messages, which makes it difficult to identify what should be improved. To address this challenge, we propose Meta-Team, an experience-driven MAS evolution framework based on collaborative self-evolution. Meta-Team preserves the execution context of each agent and coordinates post-task communication, enabling agents to exchange distributed evidence for evolution. Building on this design, Meta-Team conducts multi-scale self-evolution, transforming execution experience into reusable improvements to agent behaviors, inter-agent coordination, and team-level organization. Across six long-horizon agent benchmarks, Meta-Team consistently outperforms single-agent systems, hand-crafted MAS, and prior MAS evolution methods; further analyses demonstrate that Meta-Team enables more reliable and scalable MAS self-evolution.
Do next-generation LLM agents inherit the cooperative biases documented in their predecessors, or does scale and provider diversity reshape equilibrium behaviour in competitive multi-agent settings? Willis et al. established a benchmark for this question using evolutionary game theory and the Iterated Prisoner's Dilemma (IPD), finding consistent cooperative biases in ChatGPT-4o and Claude 3.5 Sonnet. We extend this benchmark to four frontier models released in 2025-2026 - Claude Sonnet 4.6, Gemini 2.5 Flash, Gemini 3.1 Pro, and GPT-5.4 Mini - applying the identical protocol across three prompting styles (Default, Prose, Self-Refine) and four population compositions (balanced and biased, with and without noise). Cooperative bias persists across providers (H1): nine of twelve model-prompt combinations favour cooperative equilibria in balanced noiseless conditions. Cross-provider divergence is substantial (H3): Gemini 2.5 Flash reaches up to 77% aggressive equilibria under biased conditions, while GPT-5.4 Mini reaches 70% cooperative equilibria under Self-Refine. Support for aggressive capability parity is partial (H2): Self-Refine raises ICD in all models and Claude Sonnet 4.6 Refine achieves the highest ICD in the dataset (0.913), but Default and Prose prompts show no systematic narrowing. Evidence on noise robustness is directionally positive but not robustly confirmed (H4): with n=500 Moran iterations per condition, average noise sensitivity is approximately 6 percentage points for Claude Sonnet 4.6 versus 13 pp for Claude 3.5 Sonnet, but this cross-study gap is not statistically significant once the predecessor's unreported sampling error is propagated. Provider identity, rather than model generation, is the strongest correlate of equilibrium outcomes; noise remains a universal challenge regardless of model size or vintage.
We study two-level autoresearch for cooperation: an outer-loop AI agent autonomously redesigns the inner-loop pipeline of an LLM policy-synthesis system for multi-agent Sequential Social Dilemmas (SSDs). A researcher agent $\mathcal{R}$ (run as a coding agent) reads the inner-loop source code, edits system prompts, feedback functions, helper libraries, and iteration logic, runs evaluations, and decides what to keep, following the autoresearch paradigm. Across two games (Cleanup and Gathering), two policy-synthesizer LLMs, and two welfare objectives (utilitarian efficiency and Rawlsian maximin), the researcher reliably exceeds hand-designed baselines, sharply tightens run-to-run variance, and outperforms prompt-only optimization. The discovered pipelines are objective-dependent: only under maximin does the researcher inject an explicit fairness mechanism into synthesizer pipelines, a class of mechanism that is absent from its own objective-agnostic system prompt and from every efficiency-optimized pipeline. This supports an information-design reading in which the researcher chooses what to reveal to the boundedly rational synthesizer as a function of the welfare objective. Code at https://github.com/vicgalle/autoresearch-social-dilemmas.
The design space of agentic AI inference spans two extremes: frontier large language models (LLMs), typically hosted in the cloud and offering strong performance across a wide range of tasks at substantially high cost, and more cost-efficient small language models (SLMs), which are amenable to on-device inference. Hybrid multi-agent systems (MASs) combining on-device and cloud models offer a promising middle ground, but they also introduce a complex and poorly understood design space in which task accuracy, monetary cost, and edge energy consumption are tightly coupled; in the absence of general design principles, hybrid components, although not the most prevalent choice, are typically introduced through ad hoc decisions tailored to specific domains. In this work, we examine this design space more systematically. We adapt two representative MAS architectures to support hybrid inference and study how individual design choices shift the operating point along the Pareto frontier of power, cost, and performance. Our findings paint a nuanced picture of hybrid MAS design: while SLMs can effectively benefit from LLM assistance, the optimal architecture is highly task-dependent, and greater frontier-level compute does not consistently translate to better performance.
Composite score across coding, math, and reasoning
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Opus 4.8 | 61.4 | 63 | $10.94 |
| 2 | GPT-5.5 | 60.2 | 66 | $11.25 |
| 3 | Claude Opus 4.7 | 57.3 | 60 | $10.94 |
| 4 | Gemini 3.1 Pro Preview | 57.2 | 144 | $4.50 |
| 5 | GPT-5.4 | 56.8 | 86 | $5.63 |
Agentic coding on real-world software engineering tasks
| # | Model | Score |
|---|---|---|
| 1 | gpt-5.5-2026-04-23-xhigh | 62.7% |
| 2 | Codex | 60.4% |
| 3 | Claude Code | 59.6% |
| 4 | gpt-5.5-2026-04-23-medium | 58.9% |
| 5 | Claude Opus 4.8-xhigh | 56.4% |
Python tool for converting files and office documents to Markdown.
🕷️ An adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl!
Hermes WebUI: The best way to use Hermes Agent from the web or from your phone!
Official Compound Engineering plugin for Claude Code, Codex, Cursor, and more
The open-source repo for docs.github.com
OpenKB: Open LLM Knowledge Base
Apache Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.
ModelScope: bring the notion of Model-as-a-Service to life.
Production ready toolkit to run AI locally
Runtime policy enforcement for AI agents. Cryptographic audit trail, human-in-the-loop approvals, kill switch. Zero code changes.