Capital concentration in AI has reached a point where it now determines competitive viability more than technical capability. Google's $40 billion commitment to Anthropic and Amazon's $5 billion investment, announced within days of each other, represent bids to lock down compute capacity before competitors can secure it. Meta is simultaneously consuming tens of millions of AWS Graviton5 cores and signing separate deals for Amazon CPUs, a strategy that looks like hedging but functions as desperation. The infrastructure arms race has fractured the market into two tiers: companies with access to massive capital can iterate and scale, while everyone else is locked out. Samsung's smartphone division is losing money because AI-driven memory demand is siphoning production capacity away from consumer hardware. Mac minis are being scalped on eBay at marked-up prices because they've become the preferred hardware for running local AI models. This is not a market shaped by innovation but by who can afford to buy their way to the frontier first.
The emergence of open-source alternatives and efficiency gains represents a real but constrained response to this concentration. DeepSeek's V4 preview claims to have nearly closed the gap with frontier models on reasoning benchmarks while handling longer prompts more efficiently. Tencent hired Yao Shunyu, a leading researcher from OpenAI, and released Hunyuan Hy3 to compete with ByteDance, Alibaba, and DeepSeek. ComfyUI hit a $500 million valuation after raising $30 million by offering creators granular control over AI image, video, and audio generation. These moves matter for distribution and cost, but they do not shift who controls the frontier. The companies spending $40 billion on compute will still set the pace. Frontier labs are now optimizing for operational maturity and cost discipline rather than capability leaps. Anthropic's five-gigawatt compute expansion with Amazon represents a commitment to sustained infrastructure that only a handful of actors can finance, effectively narrowing the field of who can run competitive training runs.
The structural pressure of this arms race is reshaping entire organizations. Tim Cook is stepping down in September, handing Apple to hardware chief John Ternus, who inherits a company under pressure to deliver an AI product that Cook never cracked. Microsoft is offering voluntary retirement buyouts to roughly 8,750 US employees, about 7 percent of the US workforce, as large tech companies restructure under the cost pressure of AI infrastructure investment. These are not separate stories. They are evidence that capital reallocation is now cascading through every level of the industry: from semiconductor allocation to workforce composition to which executive gets the job of fixing the innovation gap. The companies that cannot or will not commit the capital are shedding people. The ones that can are betting their future on it.
On performance benchmarks, the frontier has begun to plateau. Claude Opus 4.6 holds the top position on SWE-rebench at 65.3%, unchanged from the previous measurement, while GPT-5.5 remains at 60.2 on the Artificial Analysis benchmark. DeepSeek V4 Pro enters the Artificial Analysis leaderboard at rank 12 with 51.5, representing incremental expansion rather than a departure. Models cluster tightly in the 62 to 63 percent range on SWE-rebench, indicating that further differentiation at the frontier now requires sub-point precision. Neither benchmark exhibits the velocity that would indicate a meaningful breakthrough, and the lack of new entrants at the very top suggests the field is consolidating rather than expanding capability frontiers. The research community is responding by treating evaluation methodology itself as a first-class experimental factor, isolating causal performance drivers rather than simply reporting results. Developer activity shows consolidation around Claude-powered tooling, unified platforms like PostHog and OpenMetadata that replace fragmented point solutions, and self-hosted alternatives like Vaultwarden that give teams control over compliance and cost. The long tail of specialized tools suggests the market is shifting from "can we make this model work" to "can we make it work cheaply."
Grant Calloway
How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos. We first exploit the multimodal cues and temporal structure naturally present in videos to learn, in a self-supervised manner, to detect speed changes and estimate playback speed. We then show that these learned temporal reasoning models enable us to curate the largest slow-motion video dataset to date from noisy in-the-wild sources. Such slow-motion footage, typically filmed by high-speed cameras, contains substantially richer temporal detail than standard videos. Using this data, we further develop models capable of temporal control, including speed-conditioned video generation, which produces motion at specified playback speed, and temporal super-resolution, which tranforms low-FPS, blurry videos into high-FPS sequences with fine-grained temporal details. Our findings highlight time as a manipulable, perceptual dimension in video learning, opening doors to temporally controllable video generation, temporal forensics detection, and potentially richer world-models that understand how events unfold over time.
Streaming Continual Learning (CL) typically converts a continuous stream into a sequence of discrete tasks through temporal partitioning. We argue that this temporal taskification step is not a neutral preprocessing choice, but a structural component of evaluation: different valid splits of the same stream can induce different CL regimes and therefore different benchmark conclusions. To study this effect, we introduce a taskification-level framework based on plasticity and stability profiles, a profile distance between taskifications, and Boundary-Profile Sensitivity (BPS), which diagnoses how strongly small boundary perturbations alter the induced regime before any CL model is trained. We evaluate continual finetuning, Experience Replay, Elastic Weight Consolidation, and Learning without Forgetting on network traffic forecasting with CESNET-Timeseries24, keeping the stream, model, and training budget fixed while varying only the temporal taskification. Across 9-, 30-, and 44-day splits, we observe substantial changes in forecasting error, forgetting, and backward transfer, showing that taskification alone can materially affect CL evaluation. We further find that shorter taskifications induce noisier distribution-level patterns, larger structural distances, and higher BPS, indicating greater sensitivity to boundary perturbations. These results show that benchmark conclusions in streaming CL depend not only on the learner and the data stream, but also on how that stream is taskified, motivating temporal taskification as a first-class evaluation variable.
Automatic Speech Recognition (ASR) is traditionally evaluated using Word Error Rate (WER), a metric that is insensitive to meaning. Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task. This paper evaluates their relevance through three approaches: (1) selecting the best hypothesis between two candidates, (2) computing semantic distance using generative embeddings, and (3) qualitative classification of errors. On the HATS dataset, the best LLMs achieve 92--94\% agreement with human annotators for hypothesis selection, compared to 63\% for WER, also outperforming semantic metrics. Embeddings from decoder-based LLMs show performance comparable to encoder models. Finally, LLMs offer a promising direction for interpretable and semantic ASR evaluation.
Continual learning (CL) studies how models acquire tasks sequentially while retaining previously learned knowledge. Despite substantial progress in benchmarking CL methods, comparative evaluations typically keep the fine-tuning regime fixed. In this paper, we argue that the fine-tuning regime, defined by the trainable parameter subspace, is itself a key evaluation variable. We formalize adaptation regimes as projected optimization over fixed trainable subspaces, showing that changing the trainable depth alters the effective update signal through which both current task fitting and knowledge preservation operate. This analysis motivates the hypothesis that method comparisons need not be invariant across regimes. We test this hypothesis in task incremental CL, five trainable depth regimes, and four standard methods: online EWC, LwF, SI, and GEM. Across five benchmark datasets, namely MNIST, Fashion MNIST, KMNIST, QMNIST, and CIFAR-100, and across 11 task orders per dataset, we find that the relative ranking of methods is not consistently preserved across regimes. We further show that deeper adaptation regimes are associated with larger update magnitudes, higher forgetting, and a stronger relationship between the two. These results show that comparative conclusions in CL can depend strongly on the chosen fine-tuning regime, motivating regime-aware evaluation protocols that treat trainable depth as an explicit experimental factor.
We study the minimax sample complexity of multicalibration in the batch setting. A learner observes $n$ i.i.d. samples from an unknown distribution and must output a (possibly randomized) predictor whose population multicalibration error, measured by Expected Calibration Error (ECE), is at most $\varepsilon$ with respect to a given family of groups. For every fixed $κ> 0$, in the regime $|G|\le \varepsilon^{-κ}$, we prove that $\widetildeΘ(\varepsilon^{-3})$ samples are necessary and sufficient, up to polylogarithmic factors. The lower bound holds even for randomized predictors, and the upper bound is realized by a randomized predictor obtained via an online-to-batch reduction. This separates the sample complexity of multicalibration from that of marginal calibration, which scales as $\widetildeΘ(\varepsilon^{-2})$, and shows that mean-ECE multicalibration is as difficult in the batch setting as it is in the online setting, in contrast to marginal calibration which is strictly more difficult in the online setting. In contrast we observe that for $κ= 0$, the sample complexity of multicalibration remains $\widetildeΘ(\varepsilon^{-2})$ exhibiting a sharp threshold phenomenon. More generally, we establish matching upper and lower bounds, up to polylogarithmic factors, for a weighted $L_p$ multicalibration metric for all $1 \le p \le 2$, with optimal exponent $3/p$. We also extend the lower-bound template to a regular class of elicitable properties, and combine it with the online upper bounds of Hu et al. (2025) to obtain matching bounds for calibrating properties including expectiles and bounded-density quantiles.
We present CrossCommitVuln-Bench, a curated benchmark of 15 real-world Python vulnerabilities (CVEs) in which the exploitable condition was introduced across multiple commits - each individually benign to per-commit static analysis - but collectively critical. We manually annotate each CVE with its contributing commit chain, a structured rationale for why each commit evades per-commit analysis, and baseline evaluations using Semgrep and Bandit in both per-commit and cumulative scanning modes. Our central finding: the per-commit detection rate (CCDR) is 13% across all 15 vulnerabilities - 87% of chains are invisible to per-commit SAST. Critically, both per-commit detections are qualitatively poor: one occurs on commits framed as security fixes (where developers suppress the alert), and the other detects only the minor hardcoded-key component while completely missing the primary vulnerability (200+ unprotected API endpoints). Even in cumulative mode (full codebase present), the detection rate is only 27%, confirming that snapshot-based SAST tools often miss vulnerabilities whose introduction spans multiple commits. The dataset, annotation schema, evaluation scripts, and reproducible baselines are released under open-source licenses to support research on cross-commit vulnerability detection.
Composite score across coding, math, and reasoning
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.5 | 60.2 | 113 | $11.25 |
| 2 | Claude Opus 4.7 | 57.3 | 66 | $10.00 |
| 3 | Gemini 3.1 Pro Preview | 57.2 | 136 | $4.50 |
| 4 | GPT-5.4 | 56.8 | 83 | $5.63 |
| 5 | Kimi K2.6 | 53.9 | 126 | $1.71 |
Agentic coding on real-world software engineering tasks
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | GLM-5.1 | 62.7% |
Use claude-code for free in the terminal, VSCode extension or via discord like openclaw
🤗 ml-intern: an open-source ML engineer that reads papers, trains models, and ships ML models
Vulnerability scanner written in Go which uses the data provided by https://osv.dev
ALL IN ONE Hacking Tool For Hackers
Code search MCP for Claude Code. Make entire codebase the context for any coding agent.
The first distributed AGI system. Thousands of autonomous AI agents collaboratively train models, share experiments via P2P gossip, and push breakthroughs here. Fully peer-to-peer. Join from your browser or CLI.
A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
Robust recipes to align language models with human and AI preferences
Weaviate is an open-source vector database that stores both objects and vectors, allowing for the combination of vector search with structured filtering with the fault tolerance and scalability of a cloud-native database.
Scientific computing library for optics, computer graphics and visual perception.