The frontier AI labs have stopped competing on model capability alone. The real race is for control over the operating layer where intelligence gets deployed, governed, and monetized. OpenAI is shipping agentic coding tools that control your desktop. Anthropic is expanding to London while negotiating Pentagon access. Google is embedding AI directly into Chrome and Photos. Whether Claude Opus 4.7 outscores a leaked competitor matters less than who owns the infrastructure where these models actually run.
Venture capital is treating AI infrastructure as the new platform layer. Factory commanded a $1.5 billion valuation after three years. Upscale AI raised $2 billion just seven months after launch. Physical Intelligence's π0.7 robot brain attracted major funding. But concentration is accelerating. First-quarter venture funding flowed overwhelmingly to large, well-funded U.S. companies even as global deal count fell. Data center delays now threaten Microsoft and OpenAI projects. Meta raised Quest headset prices by $50 to $100 citing RAM shortages. When infrastructure becomes the bottleneck, whoever controls it owns the next decade of software. AWS is tightening its relationship with Anthropic by launching Claude Opus 4.7 through Bedrock's new inference engine, positioning Amazon's infrastructure as the default deployment layer for Claude users. IBM and NVIDIA are pursuing quantum-adjacent positioning to establish themselves as infrastructure for the quantum transition. The pattern across the stack is consolidation around inference engines, API grant programs, and vertical models that embed switching costs into workflows.
The real tension surfaces in how these labs are positioning themselves against traditional software. Anthropic's Chief Product Officer left Figma's board to build competing design tools. Runway's CEO is betting AI can make fifty films instead of one blockbuster. Canva's AI assistant calls external tools. Enterprise customers are beginning to see AI not as a feature but as a replacement for entire categories of software. The margin isn't in the model, it's in the operational layer that makes models reliable enough to replace humans at scale. InsightFinder raised $15 million to diagnose where AI agents fail. Antioch built robotics simulation platforms for the same purpose. Google blocked 8.3 billion ads while suspending fewer advertisers, demonstrating how platform power compounds when you control both the model and the distribution channel.
Developers are already building for this future. GitHub's trending repositories reveal two waves of investment. One is infrastructure for AI agents: memory systems like claude-mem and knowledge engines like cognee built as separate, composable pieces rather than baked into monolithic platforms. The second is self-evolution. GenericAgent achieves full system control from a 3.3K-line seed with 6x lower token consumption than baseline approaches. EvoMap's Evolver and EvoScientist use Gene Expression Programming to let agents modify themselves. These implementations may not be production-ready yet, but they point toward a real problem: manually updating agent prompts and skills doesn't scale. Meanwhile, benchmark convergence at the frontier suggests the capability differentiation game is narrowing. Claude Opus 4.6 moved from fourth to first on SWE-rebench, climbing 12.3 points to 65.3 percent. The gap between first and second place narrowed to 0.9 points, with the top six models clustering between 62.3 and 65.3 percent. The field is consolidating not around who builds the smartest model, but around who builds the infrastructure that makes those models deployable, governable, and hard to leave.
Grant Calloway
Federated Conformal RAG (FC-RAG) provides distribution-free coverage for a bandwidth-limited swarm of weak language models, but only at a fixed horizon. We extend it to anytime-valid sequential coverage: validity at every stopping time, preserved under predictable adaptive control (recalibration, per-node bandwidth escalation, distilled-student refresh), at no extra cost in assumptions over fixed-horizon FC-RAG. Naive composition fails because FC-RAG's marginal coverage bound makes the betting e-process a non-supermartingale on adverse calibration draws, and Ville's inequality cannot be invoked. We give Anytime-FC-RAG, a sequential extension built on a summable per-step calibration-deviation budget that converts the marginal bound into a strict conditional bound on a calibration-good event, paired with a truncated betting e-process that is a nonnegative supermartingale on the entire probability space. From these two ingredients, we obtain four guarantees: time-uniform alarm validity $\mathbb{P}(\sup_t E_t \ge 1/δ_e) \le δ_e + δ_{\mathrm{cal}}$, a Hoeffding-stitched cumulative-miscoverage envelope at the same total budget, safety under any predictable controller (recalibration, bandwidth escalation, student refresh), and training-side error propagation across an unbounded sequence of Federated Probe-Logit Distillation (FPLD) refreshes via a summable training budget. As a practical consequence, an adaptive controller that escalates retrieval bandwidth only when the e-process crosses a warning threshold matches the alarm rate of a fixed-high-bandwidth schedule at substantially lower communication cost. Experiments on a GPT-2-small + MiniLM swarm across MMLU, DBpedia, and AG News verify the predicted alarm rate, detection delay, envelope coverage, and $14$-$57\%$ bandwidth savings; the alarm fires when and only when coverage genuinely breaks.
Many applications require statistically valid inference across many related tasks, while using only a handful of high-quality labels per hypothesis. In AI evaluation, these tasks may correspond to model behaviors across prompts, subgroups, or hypotheses; in social science surveys, they may correspond to related questions, populations, or measurement conditions. Prediction-powered inference (PPI) uses abundant but inexpensive proxy measurements to improve inference from limited, ground-truth labels, but commonly used methods treat tasks independently and therefore fail to exploit shared structure across related tasks. This limitation is especially important in settings where only a small number of labels are available per task. To address this issue, we introduce a multi-task prediction-powered inference framework that uses labeled data from related tasks to improve power while preserving task-specific inference. Our methods exploit the shared structure in the proxy-ground-truth relationship through cross-task recalibration, while retaining within-task rectification and power tuning to construct accurate point estimates and confidence intervals. We prove that efficiency gains beyond power-tuned PPI are only possible when the proxy-ground-truth relationship contains nonlinear structure; affine cross-task recalibrations are asymptotically equivalent to using the original proxy. We complement our theoretical findings with experiments on synthetic and semi-synthetic datasets, as well as a case study auditing language models on election-related information during the 2024 U.S. presidential election. Using a large human-annotation study, we show that cross-task recalibration can substantially reduce confidence interval widths when labels are scarce.
In randomized trials involving multiple treatments, bivariate survival outcomes present significant analytical challenges for making decisions. This paper addresses the problem of deriving optimal individualized treatment rules to maximize the joint survival probability beyond fixed time points $(t_1, t_2)$ through deep neural networks, while accounting for right censoring. We propose a novel approach that models treatment rules via stochastic policies, coupling marginal accelerated failure time models via link function to capture bivariate dependence. To enhance robustness and effectiveness of decision making, we introduce an adaptive prediction-powered method that leverages auxiliary predictions from machine learning models.
In federated language modeling, $K$ nodes each hold $n$ samples but cannot pool data or exchange full-precision gradients or weights. We study the minimax rate at which a conditional distribution over $V$ tokens can be estimated when each node may upload at most $B$ bits per query in a public probe set. In federated probe-logit distillation (FPLD), each node transmits a scalar-quantized logit vector on the probe set, and an aggregator distills a global parametric student. Prior work (Dubey and Huo, 2026) establishes a high-probability KL rate $O(d/(Kn) + ρ\sqrt{V \log V / m} + K^{-1} \cdot 2^{-2B/V})$ plus optimization slack, with the bandwidth term in its trace-sharpened form. Whether this bandwidth-term rate is tight, and how the upper bound generalizes to heterogeneous per-node bandwidths, are left open. We close both gaps. First, the dithered FPLD construction has a matching single-round lower bound $Ω(K^{-1} \cdot 2^{-2B/V})$ under non-degeneracy, pinning the bandwidth-axis rate at $Θ(K^{-1} \cdot 2^{-2B/V})$. $T$-round sequential refinement with nested/scaled residual quantizers achieves $O(K^{-1} \cdot 2^{-2TB/V})$; vanilla FPLD's $T$-independent bandwidth term is suboptimal for every $T > 1$. Second, we establish a heterogeneous-bandwidth upper bound for per-node budgets $B_i$, paired with a closed-form optimal allocation $B_i^* = B_{\mathrm{tot}}/K + (V/2) \log_2(w_i / \bar{w}_g)$, a log-tilted water-filling rule that is the per-node analogue of reverse water-filling for distortion-rate optimization. A plug-in adaptive variant estimates the weights from a short warm-up phase and attains $1 + O(\sqrt{\log(K/δ)/(m T_0)})$ relative suboptimality. Synthetic n-gram simulations confirm that empirical KL is bracketed by the upper and lower bounds and that the optimal allocation strictly dominates uniform and inverse-weighted baselines under heterogeneous clipping.
Recent work in random matrix theory (RMT) has developed the notion of deterministic equivalents: typically linear surrogate models that approximate the spectral behavior of large nonlinear random matrices, such as nonlinear feature maps in neural networks (NNs). On the one hand, these deterministic equivalents make theoretical predictions tractable by reducing a complex model to a simpler model with properties that fall under the umbrella of classical RMT tools. However, this leaves open the question of whether this idealized linear equivalence remains meaningful when dealing with high-dimensional nonlinearly separable data, such as performing clssification on nonlinearly separable data. Motivated by this, we consider the conjugate kernel (CK), which is the nonlinear feature map of a feedforward NN, under a canonical nonlinearly separable dataset, the XOR problem; and we use the study of informative outlier eigenvalues in the CK and whether their corresponding eigenvectors asymptotically align with XOR labels as a proxy for nonlinear learnability. We develop a robust quadratic equivalent to the spiked CK matrix that enables a precise analysis of emergent informative spikes, as one modifies various knobs common in ML practice: sample complexity, signal-to-noise ratio (SNR), nonlinear activation choice, and pretrained features. In each of these scenarios, we derive a precise BBP-type phase transition in which linear classification via the CK eigenvectors becomes possible. Our analysis helps translate the power of deterministic equivalence tools in RMT to study problems of practical relevance in ML.
We study the Lipschitz bandit problem, where a learner sequentially maximizes an unknown Lipschitz function $f$ over a domain $\mathcal{X} \subset [0,1]^d$ using noisy pointwise evaluations. Existing regret bounds are either worst-case, scaling as $\tildeΘ \left ( T^{d+1/d+2}\right )$, or adaptive via the zooming dimension $d_z$, yielding $\tildeΘ \left ( T^{d_z+1/d_z+2}\right )$. However, such zooming-based guarantees are only partially instance-dependent, as they depend solely on the asymptotic growth of near-optimal level sets and fail to capture finer structural properties of $f$. We provide an analysis and an algorithm that characterizes the regret through integrals of the suboptimality gap of $f$ over its level sets. This yields regret bounds that adapt to the local growth of level sets, rather than only their asymptotic behavior. As a corollary, when the set of maximizers has dimension $d^\star>0$, we obtain improved adaptive rates of order $\tilde{\mathcal{O}} \left ( T^{d_z+1 / \max(d_z,d^\star)+2}\right )$ strictly improving over classical zooming bounds in this regime. Finally, we extend our analysis to the full-information setting (Lipschitz experts) and show how some of the regularity assumptions can be relaxed.
Composite score across coding, math, and reasoning
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Preview | 57.2 | 123 | $4.50 |
| 2 | GPT-5.4 | 56.8 | 81 | $5.63 |
| 3 | GPT-5.3 Codex | 53.6 | 70 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 44 | $10.00 |
| 5 | Muse Spark | 52.1 | 0 | $0.00 |
Agentic coding on real-world software engineering tasks
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | GLM-5.1 | 62.7% |
A Claude Code plugin that automatically captures everything Claude does during your coding sessions, compresses it with AI (using Claude's agent-sdk), and injects relevant context back into future sessions.
Self-evolving agent: grows skill tree from 3.3K-line seed, achieving full system control with 6x less token consumption
The open-source voice synthesis studio
An open source template for building cloud agents.
🔬 Harness Vibe Research with Self-evolving AI Scientists
Full-Stack Development Platform for Building Reliable Agents
End-to-End Speech Processing Toolkit
Gokart solves reproducibility, task dependencies, constraints of good code, and ease of use for Machine Learning Pipeline.
fully OSS lightweight network video recorder system witten in C with modern js frontend