No major AI safety incident or regulatory crackdown dominated the news cycle today. Instead, the story is structural: the infrastructure supporting AI development and deployment is splintering along three separate but reinforcing fault lines, each reshaping where power actually concentrates in the market.
The first fracture is geographic and political. Rural communities and Michigan jurisdictions are blocking data center construction outright, while Charlotte's mayor killed a moratorium vote that would have accelerated restrictions. This physical resistance is forcing a reckoning with inference costs that fixed pricing models can no longer absorb. GitHub's shift to usage-based billing for Copilot and Amazon's immediate counter-offer of OpenAI models on AWS after Microsoft loosened exclusivity terms both signal what happens when the bill comes due: competitive pressure forces open what looked like locked-in markets, but only for vendors with margin to spare. Companies without it will pass costs to users or exit the market. The infrastructure divide is becoming geographic and political, not merely technical.
The second fracture separates capability from control. Xiaomi open-sourced MiMo models under MIT licensing to give developers lower-cost alternatives for long-running agents, and analysts now argue enterprises don't need expensive GPUs for agentic AI workloads that run on business logic rather than raw compute. OpenAI released Symphony, an orchestration spec that lets coding agents pull work from issue trackers instead than running one task at a time, solving a bottleneck the company hit when engineers scaled Codex sessions. These are efficiency plays, but they reveal where the real margin is: not in the hardware or the model weights, but in orchestration and control. Google signed a new Pentagon contract after Anthropic refused to support domestic mass surveillance and autonomous weapons. The market doesn't restrict capability; it just reshuffles who deploys it. AWS, Microsoft, and NVIDIA now own the distribution layer to enterprise customers. Smaller labs sell differentiated capabilities through someone else's platform.
The third fracture separates the legal theater from the actual consolidation of power. Musk testified under oath about founding OpenAI to prevent a "Terminator outcome," relitigating old disputes about Altman's trustworthiness. Meanwhile, Amazon, Google, and OpenAI are quietly restructuring commercial relationships, enterprises are moving from experimentation to deployment of agents that make decisions across fragmented data, and Meta faces potential layoffs of over 700 AI training workers in Ireland. The feuding billionaires and their lawsuits occupy headlines, but the actual consolidation is happening in contract renegotiations, platform partnerships, and the unglamorous work of making agents reliable enough to control business workflows. Developers, meanwhile, are trending toward local tooling and client-side infrastructure to avoid vendor lock-in, signaling skepticism about cloud-hosted AI as currently packaged. The market is sorting itself not by capability but by who controls the last mile to the customer and who owns the integration layer.
Grant Calloway
Large language models can produce fluent root cause analyses, but fluent final answers alone are insufficient evidence for accountability in high-stakes operations. In real incident response, engineers need to know what evidence supported a diagnosis, which alternatives were considered, where contradictions remained, and whether the system resolved the case or preserved uncertainty. We address this gap with JustDiag, a diagnostic justification engine for RCA that maintains an explicit process state over evidence, findings, competing hypotheses, conflicts, and next checks. We evaluated the system on 66 real-world incidents using a two-layer protocol that separately scores final-answer quality and process quality. Relative to a matched control without diagnostic justification, JustDiag achieved stronger outcome and process scores, while accepting slightly lower terminal completion due to more calibrated non-closure. These results suggest that accountable RCA requires explicit diagnostic justification artifacts and process-aware evaluation, not only fluent final answers.
Modern agent systems often suffer from fragmented runtime state: transcripts, tool effects, memory events, workspace placement, branch provenance, and replay evidence are recorded separately and become difficult to inspect or reproduce. OpenRath addresses this issue with a PyTorch-like programming model for multi-agent, multi-session systems. The analogy concerns the role of a central first-class runtime abstraction, not tensor computation. Its core abstraction is Session, the runtime value passed between agents and workflows. A Session is branchable, inspectable, replayable, backend-aware, and composable. It records conversation chunks, sandbox placement, lineage metadata, token usage, pending work, and tool evidence, while defining where memory interactions enter the runtime record. Since this state is carried by the same value used in program execution, fork, merge, and replay become explicit runtime operations rather than states reconstructed from external traces. OpenRath further defines Sandbox, Tool, Agent, Memory, Workflow, and Selector, with Selector turning control flow into runtime-routed decisions. This report presents the programming model, architecture, audited milestones, and evidence protocol. Its claims are limited to controlled runtime properties, while broad quantitative comparisons, live-provider quality, optional-backend availability, and memory quality are left for follow-on evaluation. The central thesis is that Session provides agent systems with a first-class runtime value for auditable composition.
Multi-step LLM pipelines fail through interactions among retrieval, reasoning, and formatting steps, so prompt-only optimization can miss bottlenecks in the chain. We present FAPO (Fully Autonomous Prompt Optimization), a framework that lets Claude Code optimize an LLM pipeline inside a standardized codebase. FAPO evaluates a pipeline, inspects intermediate steps, diagnoses failures, proposes scoped changes, and validates variants repeatedly to optimize against a score function. It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck. Across six benchmarks and three task models, FAPO beats the baseline GEPA in 15 of 18 model-benchmark comparisons. In 11 model-benchmark comparisons, FAPO wins with non-overlapping mean $\pm$ trial-standard-deviation ranges, and the mean FAPO-GEPA gain is +14.1 pp. In the six HoVer and IFBench comparisons where prompt-first search escalated to structural changes, FAPO wins all six with a mean gain of +33.8 pp. FAPO also improves performance on security tasks: on CTIBench-RCM, a security CVE-to-CWE task, prompt-only FAPO lifts test accuracy by +4.0 pp on GPT-5, +7.1 pp on Foundation-Sec-8B-Instruct, and +2.0 pp on Foundation-Sec-8B-Reasoning. These results position FAPO as a state-of-the-art pipeline optimization technique for both general-purpose and security-focused tasks.
We introduce StaminaBench, a benchmark that measures the stamina of coding agents: how many consecutive interaction turns (change requests) they can handle before failing. Unlike the prevailing fraction-of-tasks-solved metric, this matches real vibe-coding where sessions run dozens or hundreds of turns. In StaminaBench, agents implement a REST API server and modify it across a tunable number of procedurally generated follow-up change requests - 100 in our experiments, resulting in codebases of up to 6,000 lines. Tests are generated fully programmatically without LLM involvement, ensuring reproducibility and reliability; change sequences are drawn from either a hardcoded or LLM-driven sampler, both constrained to a structured action space to ensure changes are valid. The agent and the server run in an isolated environment and communicate with the benchmark through HTTP, making testing fully black-box and language-agnostic. We evaluate six agent harnesses paired with seven open-source LLMs across 20 scenarios of 100 turns each and find that: (1) all the tested models fail within 5-6 turns, confirming that vibe-coding-style programming without thorough testing produces bugs; (2) passing test feedback back to the agent and allowing it to retry improves passed turn count by up to 12x; and (3) a good harness is required for strong performance: stronger models exhibit up to a 6x gap between their best and worst harness, while weaker models fail with any harness. We release the benchmark and the generated tasks to enable further research into multi-turn coding agent behavior. Benchmark code and data: github.com/amazon-science/StaminaBench.
Autonomous coding agents now open millions of pull requests, yet large-scale studies find their PRs are produced faster but accepted less often - a coordination and trust gap that pull-request-level telemetry cannot explain. We argue the missing signal lives before the PR, in how concurrent agents claim, divide, and collide over shared work. We study this process through grite, our open-source coordination substrate that needs no central server and stores its records inside git itself, so its append-only, signed event log captures the coordination process directly. We show that (i) this shared substrate reduces duplicate and conflicting work at bounded overhead - the share of work that merely re-does a teammate's task falls from 78% to 0% while useful throughput more than triples; (ii) every agent's copy of the log converges to the same state with no write silently dropped, where a file-based tracker loses concurrent writes; and (iii) the log is a mineable artefact from which concrete failure modes - conflicting edits, lock starvation, redundant rediscovery, race-to-close - are automatically recoverable with provenance, several invisible in pull-request history. We release the dataset, harness, and mining toolkit.
Large language model (LLM)-powered tools such as ChatGPT are increasingly used in collaborative software engineering workflows, yet little is known about how prompt structure influences downstream pull request (PR) outcomes. Prior studies primarily examine conversational helpfulness, productivity, or coarse-grained adoption metrics, leaving the role of prompt structure in collaborative integration behavior insufficiently understood. We analyze 265 manually validated developer-ChatGPT interactions derived from self-admitted ChatGPT usage in open-source pull requests. Building on prior research on developer-facing artifacts and prompt engineering, we operationalize prompt structure using three dimensions: Context, Specificity, and Verification. We first evaluate whether LLM-assisted annotation can reliably reproduce human judgments of prompt structure, finding substantial variation across dimensions and workflow contexts. Specificity shows the most stable agreement with human judgments; Context is systematically under-scored by the LLM; and Verification remains difficult to assess consistently, motivating a hybrid human-LLM annotation strategy. Using this validated framework, we then examine how prompt structure influences actionable code generation, code adoption, and integration depth across AI-assisted PR workflows. Specificity and Context are most strongly associated with actionable code generation; Verification emerges as the primary predictor of code adoption; and integration depth is most strongly associated with Context. Overall, our findings show that prompt characteristics exert distinct, stage-dependent effects across AI-assisted software engineering workflows, influencing downstream adoption and integration through contextual grounding, task specificity, and evaluability cues.
Composite score across coding, math, and reasoning
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.5 | 60.2 | 68 | $11.25 |
| 2 | Claude Opus 4.7 | 57.3 | 56 | $10.00 |
| 3 | Gemini 3.1 Pro Preview | 57.2 | 133 | $4.50 |
| 4 | GPT-5.4 | 56.8 | 90 | $5.63 |
| 5 | Kimi K2.6 | 53.9 | 0 | $1.71 |
Agentic coding on real-world software engineering tasks
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | GLM-5.1 | 62.7% |
My personal directory of skills, straight from my .claude directory.
GitNexus: The Zero-Server Code Intelligence Engine - GitNexus is a client-side knowledge graph creator that runs entirely in your browser. Drop in a GitHub repo or ZIP file, and get an interactive knowledge graph wit a built in Graph RAG Agent. Perfect for code exploration
A curated list of practical Codex skills for automating workflows across the Codex CLI and API.
Open-Source Frontier Voice AI
CLI tool for configuring and monitoring Claude Code
ZenML 🙏: One AI Platform from Pipelines to Agents. https://zenml.io.
Self-hosted OpenClaw gateway + agent runtime in .NET (NativeAOT-friendly)
StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing
Turn any web interface into an AI agent — for humans and machines. Open-source, DOM-native SDK. Sub-second actions, no screenshots, no VMs. Websites, Chrome extensions, Electron apps, and more.
The open source research environment for AI researchers to seamlessly train, evaluate, and scale models from local hardware to GPU clusters.