The AI industry's most significant transformation this week is not happening in model benchmarks or research papers but in the physical infrastructure that underpins everything. OpenAI announced a $110 billion raise at a $730 billion valuation, with $30 billion each from SoftBank and NVIDIA plus $50 billion from Amazon, signaling that the bottleneck in AI development has shifted decisively from model capability to compute capacity. The simultaneous partnerships with both Microsoft and Amazon mean the two dominant cloud providers are now formally aligned with a single frontier lab, a consolidation of compute power without modern precedent. NVIDIA's own $30 billion commitment reveals where the real constraint lies: not in algorithms but in the physical layer of data centers, fiber optics, power delivery, and cooling.
Yet this capital surge arrives amid intensifying pressure from two directions that are reshaping the industry's trajectory. The Pentagon's pressure on Anthropic, culminating in Defense Secretary Pete Hegseth's explicit threat to use the Defense Production Act, represents the most aggressive government intervention in a private AI company to date. OpenAI's rushed Pentagon deal, announced immediately after Anthropic's public reprimand, shows how quickly the industry is reorganizing around national security priorities. The consumer backlash has been equally swift: ChatGPT uninstalls surged 295 percent after the DoD deal became public, with users fleeing to Claude. Tech workers at Amazon, Google, and Microsoft have signed open letters urging executives to reject these arrangements, but the capital and political pressure flowing the other direction appears irresistible.
The enterprise case for AI is simultaneously forcing a brutal restructuring of labor that makes the defense partnerships feel almost inevitable. Block's decision to cut 40 percent of its workforce, reducing headcount from over 10,000 to under 6,000, is not a distress signal: the company posted $10.36 billion in gross profit last year, up 17 percent. Jack Dorsey's letter to shareholders made the logic explicit: AI tools have made a leaner organization not just possible but strategically preferable, and most companies are late to this realization. Amazon announced 14,000 job cuts while increasing AI investment. The messaging is consistent across Silicon Valley: the technology has arrived, the workforce reductions are features not bugs.
The infrastructure supporting this transformation is hitting real physical constraints that threaten to cap this expansion. Iowa adopted what may be the strictest data center zoning rules in the country, yet residents say it is not enough. xAI spent $7 million on a sound wall that does little to muffle noise from its Memphis power plant, prompting local fury. Microsoft pledged to cover full power costs and reject local tax breaks for its data centers, a response to growing backlash that suggests the political economy of AI infrastructure is becoming untenable. The memory shortage driving PC shipments down over 10 percent this year is directly traceable to hyperscalers gobbling up supply for AI data centers.
The product strategy emerging from this pressure is agentic AI, which represents both an attempt to deliver on the technology's promise and a way to manage its unpredictability. Perplexity launched a "Computer" that assigns tasks to sub-agents. Salesforce rebuilt Slackbot into an autonomous agent. Anthropic released Cowork, extending Claude Code to non-technical users. The coding agent space is particularly intense: Cursor hit $2 billion in annualized revenue, Claude Code costs up to $200 per month, and free alternatives like Goose are gaining traction. The NousCoder-14B model trained in just four days on 48 Nvidia B200 GPUs demonstrates how quickly the technical floor is dropping. But the same capability that makes these tools valuable also makes them unpredictable. OpenClaw, the viral agentic tool known for being highly capable and wildly unstable, has led Meta and other firms to restrict its use due to security fears. This is the paradox at the heart of the agentic turn: the more autonomous these systems become, the more they need to be contained.
Grant Calloway
Training on verifiable symbolic data is a promising way to expand the reasoning frontier of language models beyond what standard pre-training corpora provide. Yet existing procedural generators often rely on fixed puzzles or templates and do not deliver the distributional breadth needed at scale. We introduce Reasoning Core, a scalable suite that procedurally generates verifiable symbolic reasoning data across core formal domains: PDDL planning over randomized domains, first-order logic with equality, context-free grammar parsing and generation, causal reasoning over random Bayesian networks, and systems of equations. Each task is paired with an external solver for rigorous verification and admits continuous difficulty control for curriculum design. Examples can optionally include solver-derived reasoning traces, enabling supervised training from the earliest pre-training stages, and the same interface provides verifiable reward functions for reinforcement learning. Our experiments show that mixing Reasoning Core data into pre-training improves downstream reasoning while preserving, or slightly improving, language modeling quality. Zero-shot evaluations confirm these tasks challenge frontier models such as GPT-5. The code and data are publicly available under the MIT license.
Selective conformal prediction can yield substantially tighter uncertainty sets when we can identify calibration examples that are exchangeable with the test example. In interventional settings, such as perturbation experiments in genomics, exchangeability often holds only within subsets of interventions that leave a target variable "unaffected" (e.g., non-descendants of an intervened node in a causal graph). We study the practical regime where this invariance structure is unknown and must be learned from data. Our contributions are: (i) a contamination-robust conformal coverage theorem that quantifies how misclassification of "unaffected" calibration examples degrades coverage via an explicit function $g(δ,n)$ of the contamination fraction and calibration set size, providing a finite-sample lower bound that holds for arbitrary contaminating distributions; (ii) a task-driven partial causal learning formulation that estimates only the binary descendant indicators $Z_{a,i}=\mathbf{1}\{i\in\mathrm{desc}(a)\}$ needed for selective calibration, rather than the full causal graph; and (iii) algorithms for descendant discovery via perturbation intersection patterns (differentially affected variable set intersections across interventions), and for approximate distance-to-intervention estimation via local invariant causal prediction. We provide recovery conditions under which contamination is controlled. Experiments on synthetic linear structural equation models (SEMs) validate the bound: under controlled contamination up to $δ=0.30$, the corrected procedure maintains $\ge 0.95$ coverage while uncorrected selective CP degrades to $0.867$. A proof-of-concept on Replogle K562 CRISPR interference (CRISPRi) perturbation data demonstrates applicability to real genomic screens.
Test-time reinforcement learning (TTRL) has emerged as a promising paradigm for self-evolving large reasoning models (LRMs), enabling online adaptation on unlabeled test inputs via self-induced rewards through majority voting. However, a spurious yet high-frequency unverified consensus can become a biased and reinforced reward signal, leading to incorrect mode collapse. We address this failure mode with T^3RL (Tool-Verification for Test-Time Reinforcement Learning), which introduces test-time tool verification into reward estimation. Concretely, a verifier uses an external tool as evidence (e.g., from code execution) to upweight verified rollouts in a verification-aware voting, producing more reliable pseudo-labels for training. Across various math difficulties (MATH-500, AMC, and AIME 2024) and diverse backbone types, T^3RL significantly improves over TTRL, with larger gains on harder problems. More broadly, T^3RL can be viewed as verified online data synthesis, highlighting test-time tool verification as a key mechanism for stabilizing self-evolution.
Pre-deployment evaluations inspect only a limited sample of model actions. A malicious model seeking to evade oversight could exploit this by randomizing when to "defect": misbehaving so rarely that no malicious actions are observed during evaluation, but often enough that they occur eventually in deployment. But this requires taking actions at very low rates, while maintaining calibration. Are frontier models even capable of that? We prompt the GPT-5, Claude-4.5 and Qwen-3 families to take a target action at low probabilities (e.g. 0.01%), either given directly or requiring derivation, and evaluate their calibration (i.e. whether they perform the target action roughly 1 in 10,000 times when resampling). We find that frontier models are surprisingly good at this task. If there is a source of entropy in-context (such as a UUID), they maintain high calibration at rates lower than 1 in 100,000 actions. Without external entropy, some models can still reach rates lower than 1 in 10,000. When target rates are given, larger models achieve good calibration at lower rates. Yet, when models must derive the optimal target rate themselves, all models fail to achieve calibration without entropy or hint to generate it. Successful low-rate strategies require explicit Chain-of-Thought (CoT) reasoning, so malicious models attempting this approach could currently be caught by a CoT monitor. However, scaling trends suggest future evaluations may be unable to rely on models' lack of target rate calibration, especially if CoT is no longer legible.
The deployment of multimodal models in high-stakes domains, such as self-driving vehicles and medical diagnostics, demands not only strong predictive performance but also reliable mechanisms for detecting failures. In this work, we address the largely unexplored problem of failure detection in multimodal contexts. We propose Adaptive Confidence Regularization (ACR), a novel framework specifically designed to detect multimodal failures. Our approach is driven by a key observation: in most failure cases, the confidence of the multimodal prediction is significantly lower than that of at least one unimodal branch, a phenomenon we term confidence degradation. To mitigate this, we introduce an Adaptive Confidence Loss that penalizes such degradations during training. In addition, we propose Multimodal Feature Swapping, a novel outlier synthesis technique that generates challenging, failure-aware training examples. By training with these synthetic failures, ACR learns to more effectively recognize and reject uncertain predictions, thereby improving overall reliability. Extensive experiments across four datasets, three modalities, and multiple evaluation settings demonstrate that ACR achieves consistent and robust gains. The source code will be available at https://github.com/mona4399/ACR.
An agent must try new behaviors to explore and improve. In high-stakes environments, an agent that violates safety constraints may cause harm and must be taken offline, curtailing any future interaction. Imitating old behavior is safe, but excessive conservatism discourages exploration. How much behavior change is too much? We show how to use any safe reference policy as a probabilistic regulator for any optimized but untested policy. Conformal calibration on data from the safe policy determines how aggressively the new policy can act, while provably enforcing the user's declared risk tolerance. Unlike conservative optimization methods, we do not assume the user has identified the correct model class nor tuned any hyperparameters. Unlike previous conformal methods, our theory provides finite-sample guarantees even for non-monotonic bounded constraint functions. Our experiments on applications ranging from natural language question answering to biomolecular engineering show that safe exploration is not only possible from the first moment of deployment, but can also improve performance.
Composite score across coding, math, and reasoning
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Preview | 57.2 | 81 | $4.50 |
| 2 | GPT-5.3 Codex | 54 | 59 | $4.81 |
| 3 | Claude Opus 4.6 | 53 | 51 | $10.00 |
| 4 | Claude Sonnet 4.6 | 51.7 | 43 | $6.00 |
| 5 | GPT-5.2 | 51.3 | 63 | $4.81 |
💖🧸 Self hosted, you-owned Grok Companion, a container of souls of waifu, cyber livings to bring them into our worlds, wishing to achieve Neuro-sama's altitude. Capable of realtime voice chat, Minecraft, Factorio playing. Web / macOS / Windows supported.
Anthropic's Interactive Prompt Engineering Tutorial
🌊 The leading agent orchestration platform for Claude. Deploy intelligent multi-agent swarms, coordinate autonomous workflows, and build conversational AI systems. Features enterprise-grade architecture, distributed swarm intelligence, RAG integration, and native Claude Code / Codex Integration
OpenSandbox is a general-purpose sandbox platform for AI applications, offering multi-language SDKs, unified sandbox APIs, and Docker/Kubernetes runtimes for scenarios like Coding Agents, GUI Agents, Agent Evaluation, AI Code Execution, and RL Training.
Python tool for converting files and office documents to Markdown.
Ongoing research training transformer models at scale
Agent skills that make AI coding assistants write production-grade robotics software. ROS1, ROS2, design patterns, SOLID principles, and testing — for Claude Code, Cursor, Copilot, and any SKILL.md-compatible agent.
The AI agent that lives in your framework/browser
The SmythOS Runtime Environment (SRE) is an open-source, cloud-native runtime for agentic AI. Secure, modular, and production-ready, it lets developers build, run, and manage intelligent agents across local, cloud, and edge environments.
Python SDK for Milvus Vector Database