The AI industry is splitting into two competing visions of what agents should be, and the market is moving faster than governance can keep pace. Meta's Muse Spark and Anthropic's Mythos model exemplify the tension: both companies are simultaneously building systems for mass consumption and restricting access to select customers, hedging bets on whether agents will function as utilities or premium services. The real signal is in the tooling race. Microsoft's VS Code Agents app, AWS's S3 Files interface, and Astropad's Workbench are racing to make agent orchestration invisible to end users, while Microsoft's Agent Governance Toolkit and OpenAI's Child Safety Blueprint suggest the opposite, that agents will require visible, auditable control layers enterprises cannot avoid. The winner will be whoever makes the complexity disappear first, not whoever builds the most sophisticated safeguards.
Cloud providers are consolidating control across the entire stack. AWS's dual investments in both Anthropic and OpenAI, framed as cultural competency in managing competition, is a naked articulation of vertical integration strategy: cloud providers will own the infrastructure, the models, and the integrations, turning model choice into a switching cost problem rather than a capability problem. Tubi's native app integration within ChatGPT and Atlassian's third-party agent partnerships in Confluence show the distribution layer solidifying around existing platforms. Elon Musk's decision to forgo $134 billion in damages against OpenAI and donate any settlement to a nonprofit neutralizes the optics of a billionaire lawsuit while keeping pressure on OpenAI's governance structure, which remains the actual point of contention. This is consolidation wearing a partnership label.
The coding agent market is moving at a pace that makes traditional development cycles obsolete. Claude Opus 4.6 jumped from fourth place to first on the SWE-rebench with a 65.3% pass rate, a 12.3-point gain from its previous benchmark score, while reasoning-oriented variants like Kimi K2.5 climbed from 46.8% to 58.5%. Yet Meta admits to performance gaps in agentic and coding systems, and Anthropic had to restrict Mythos access and build governance toolkits. Research papers like ReCodeAgent demonstrate that multi-agent design yields 40% higher test pass rates than single-agent alternatives, but they also reveal that simpler dense retrieval often outperforms complex configurations and existing reward models fail substantially on personalization. The real constraint is not capability; it is operational safety and liability. Companies are building faster than they can govern.
The developer ecosystem reflects this shift away from model competition toward infrastructure lock-in. Google's ML gallery and LocalAI both solve the same problem: the cost and latency of cloud inference have made local execution a serious alternative. Andrej Karpathy's CLAUDE.md prompt engineering file accumulated 9,500 stars, yet specialized tools like virattt's AI Hedge Fund and Strands' Python SDK are gaining traction by promising to abstract away the prompt-writing problem entirely. Ultralytics YOLO crossed 55,000 stars not because computer vision is new, but because the library became the default way to train and deploy object detection. These tools win by being the obvious choice, and the industry is rewarding whoever can hide the complexity from users while selling it as innovation to enterprises.
Grant Calloway
Large Chunk Test-Time Training (LaCT) has shown strong performance on long-context 3D reconstruction, but its fully plastic inference-time updates remain vulnerable to catastrophic forgetting and overfitting. As a result, LaCT is typically instantiated with a single large chunk spanning the full input sequence, falling short of the broader goal of handling arbitrarily long sequences in a single pass. We propose Elastic Test-Time Training inspired by elastic weight consolidation, that stabilizes LaCT fast-weight updates with a Fisher-weighted elastic prior around a maintained anchor state. The anchor evolves as an exponential moving average of past fast weights to balance stability and plasticity. Based on this updated architecture, we introduce Fast Spatial Memory (FSM), an efficient and scalable model for 4D reconstruction that learns spatiotemporal representations from long observation sequences and renders novel view-time combinations. We pre-trained FSM on large-scale curated 3D/4D data to capture the dynamics and semantics of complex spatial environments. Extensive experiments show that FSM supports fast adaptation over long sequences and delivers high-quality 3D/4D reconstruction with smaller chunks and mitigating the camera-interpolation shortcut. Overall, we hope to advance LaCT beyond the bounded single-chunk setting toward robust multi-chunk adaptation, a necessary step for generalization to genuinely longer sequences, while substantially alleviating the activation-memory bottleneck.
Exact relevance certification asks which coordinates are necessary to determine the optimal action in a coordinate-structured decision problem. The tractable families treated here admit a finite primitive basis, but optimizer-quotient realizability is maximal, so quotient shape alone cannot characterize the frontier. We prove a meta-impossibility theorem for efficiently checkable structural predicates invariant under the theorem-forced closure laws of exact certification. Structural convergence with zero-distortion summaries, quotient entropy bounds, and support-counting arguments explains why those closure laws are canonical. We establish the theorem by constructing same-orbit disagreements for four obstruction families, namely dominant-pair concentration, margin masking, ghost-action concentration, and additive/statewise offset concentration, using action-independent, pair-targeted affine witnesses. Consequently no correct tractability classifier on a closure-closed domain yields an exact characterization over these families. Here closure-orbit agreement is forced by correctness rather than assumed as an invariance axiom. The result therefore applies to correct classifiers on closure-closed domains, not only to classifiers presented through a designated admissibility package.
Generating motion-controlled videos--where user-specified actions drive physically plausible scene dynamics under freely chosen viewpoints--demands two capabilities: (1) disentangled motion control, allowing users to separately control the object motion and adjust camera viewpoint; and (2) motion causality, ensuring that user-driven actions trigger coherent reactions from other objects rather than merely displacing pixels. Existing methods fall short on both fronts: they entangle camera and object motion into a single tracking signal and treat motion as kinematic displacement without modeling causal relationships between object motion. We introduce MoRight, a unified framework that addresses both limitations through disentangled motion modeling. Object motion is specified in a canonical static-view and transferred to an arbitrary target camera viewpoint via temporal cross-view attention, enabling disentangled camera and object control. We further decompose motion into active (user-driven) and passive (consequence) components, training the model to learn motion causality from data. At inference, users can either supply active motion and MoRight predicts consequences (forward reasoning), or specify desired passive outcomes and MoRight recovers plausible driving actions (inverse reasoning), all while freely adjusting the camera viewpoint. Experiments on three benchmarks demonstrate state-of-the-art performance in generation quality, motion controllability, and interaction awareness.
The rapid growth of generative artificial intelligence (AI) has introduced unprecedented computational demands, driving significant increases in the energy footprint of data centers. However, existing power consumption data is largely proprietary and reported at varying resolutions, creating challenges for estimating whole-facility energy use and planning infrastructure. In this work, we present a methodology that bridges this gap by linking high-resolution workload power measurements to whole-facility energy demand. Using NLR's high-performance computing data center equipped with NVIDIA H100 GPUs, we measure power consumption of AI workloads at 0.1-second resolution for AI training, fine-tuning and inference jobs. Workloads are characterized using MLCommons benchmarks for model training and fine-tuning, and vLLM benchmarks for inference, enabling reproducible and standardized workload profiling. The dataset of power consumption profiles is made publicly available. These power profiles are then scaled to the whole-facility-level using a bottom-up, event-driven, data center energy model. The resulting whole-facility energy profiles capture realistic temporal fluctuations driven by AI workloads and user-behavior, and can be used to inform infrastructure planning for grid connection, on-site energy generation, and distributed microgrids.
Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values. While benchmarks for general response quality are prevalent, evaluating how well reward models account for individual user preferences remains an open challenge. To bridge this gap, we introduce Personalized RewardBench, a novel benchmark designed to rigorously assess reward models' capacity to model personalized preferences. We construct chosen and rejected response pairs based on strict adherence to (or violation of) user-specific rubrics, ensuring that preference distinctions are uniquely tailored to the individual. In particular, human evaluations confirm that the primary discriminative factor between pairs is strictly personal preference, with both responses maintaining high general quality (e.g., correctness, relevance and helpfulness). Extensive testing reveals that existing state-of-the-art reward models struggle significantly with personalization, peaking at an accuracy of just 75.94%. Crucially, because an effective reward model benchmark should predict a reward model's performance on downstream tasks, we conduct experiments demonstrating that our benchmark exhibits a significantly higher correlation with downstream performance in both Best-of-N (BoN) sampling and Proximal Policy Optimization (PPO) compared to existing baselines. These findings establish Personalized RewardBench as a robust and accurate proxy for evaluating reward models' performance in downstream applications.
Most repository-level code translation and validation techniques have been evaluated on a single source-target programming language (PL) pair, owing to the complex engineering effort required to adapt new PL pairs. Programming agents can enable PL-agnosticism in repository-level code translation and validation: they can synthesize code across many PLs and autonomously use existing tools specific to each PL's analysis. However, state-of-the-art has yet to offer a fully autonomous agentic approach for repository-level code translation and validation of large-scale programs. This paper proposes ReCodeAgent, an autonomous multi-agent approach for language-agnostic repository-level code translation and validation. Users only need to provide the project in the source PL and specify the target PL for ReCodeAgent to automatically translate and validate the entire repository. ReCodeAgent is the first technique to achieve high translation success rates across many PLs. We compare the effectiveness of ReCodeAgent with four alternative neuro-symbolic and agentic approaches to translate 118 real-world projects, with 1,975 LoC and 43 translation units for each project, on average. The projects cover 6 PLs (C, Go, Java, JavaScript, Python, and Rust) and 4 PL pairs (C-Rust, Go-Rust, Java-Python, Python-JavaScript). Our results demonstrate that ReCodeAgent consistently outperforms prior techniques on translation correctness, improving test pass rate by 60.8% on ground-truth tests, with an average cost of $15.3. We also perform process-centric analysis of ReCodeAgent trajectories to confirm its procedural efficiency. Finally, we investigate how the design choices (a multi-agent vs. single-agent architecture) influence ReCodeAgent performance: on average, the test pass rate drops by 40.4%, and trajectories become 28% longer and persistently inefficient.
Composite score across coding, math, and reasoning
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.4 | 57.2 | 85 | $5.63 |
| 2 | Gemini 3.1 Pro Preview | 57.2 | 124 | $4.50 |
| 3 | GPT-5.3 Codex | 54 | 76 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 55 | $10.00 |
| 5 | Muse Spark | 52.1 | 0 | $0.00 |
Agentic coding on real-world software engineering tasks
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | Gemini 3.1 Pro Preview | 62.3% |
A specialized Claude Code workspace for creating long-form, SEO-optimized blog content for any business. This system helps you research, write, analyze, and optimize content that ranks well and serves your target audience.
A gallery that showcases on-device ML/GenAI use cases and allows people to try and use models locally.
PersonaPlex code.
Unofficial API SDK for openrouter.ai in go
LocalAI is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.
Train, Evaluate, Optimize, Deploy Computer Vision Models via OpenVINO™
Ultralytics YOLO 🚀
A general fine-tuning kit geared toward image/video/audio diffusion models.