Leverage has become the day's operative word, and American AI dominance is discovering its limits. The White House's order forcing Anthropic to revoke foreign access to Fable 5 and Mythos 5 was meant to protect U.S. technological advantage, but it has triggered precisely the acceleration it sought to prevent. DeepSeek closed a $7.4 billion funding round while Chinese labs slashed token prices up to 99 percent. JPMorgan Chase and Goldman Sachs cut Claude access for Hong Kong staff. The policy is driving enterprises and competitors toward alternatives, with capital markets now treating China's labs as safer bets than American companies subject to sudden revocation. Dario Amodei is telling G7 leaders to resist splintering over AI, but the splinter is already happening. The White House's demand that Anthropic guarantee Fable 5's guardrails cannot be circumvented before allowing rereleases is technically impossible according to security experts, yet the policy persists anyway, accelerating the very fragmentation it was designed to prevent.
The market itself is correcting away from hype toward specificity. Uber blew through its annual AI budget in months on pilots that haven't scaled into measurable value. Meta killed its internal leaderboard. Some companies cut Claude licenses for entire departments. The enthusiasm that drove executives to push AI usage as far as it would go has collided with bills that don't reflect the initial projections. AWS is now betting the real bottleneck is not code generation but release management, testing, and safety review. Databricks is pitching context layers and ontologies as keys to trusted agents. Pramaana Labs raised $27 million to bring formal verification to AI in law, drug discovery, and tax preparation, where errors have real costs. The market is moving upstream from generic productivity tools toward domains where AI failures have measurable consequences and enterprises will actually pay for reliability.
Capital is flowing hardest toward hardware and infrastructure. Odyssey, a world model startup, just hit a $1.45 billion valuation with backing from Amazon and others. Canadian pension funds are acquiring stakes in Indian data center operators. The same geopolitical pressure driving data center investment in India means compute supply will fragment by region, making local models and local infrastructure more valuable than centralized American platforms. Nvidia's robots are learning to install GPUs and cut zip ties through teams of AI coding agents, signaling the company is solving the data problem for embodied AI at scale. Robot training data collection through vendors like XDOF is becoming a distinct business. Google believes Gemini makes conversational interaction valuable enough to justify a $99.99 smart speaker, reviving a market dormant for years. The hardware layer is where defensibility now lives, not in models that can be turned off by government order.
The research and infrastructure communities are converging on the same conclusion: raw model capability is table stakes. The real competition is over domain-specific applications and cost structure. OpenAI and Mistral are moving downstream into drug discovery, physics, and engineering acceleration, where outputs have regulatory moats and defensible business value. GitHub's trending repos reveal a pattern of modular, single-purpose tools that solve specific friction points rather than trying to be platforms. DeusData's codebase-memory-mcp cuts token overhead by 99 percent. Frameworks that let developers compose skills and tools without vendor lock-in are gaining steady adoption. The labs with the most compute are capturing high-value verticals. Infrastructure players like Hugging Face are betting on breadth and developer lock-in through tooling. Anthropic is pursuing geographic diversification and positioning safety research as a defensible moat in markets where regulation is tightening. The shift is decisive and already underway.
Grant Calloway
Passive models for long video understanding typically rely on a "watch-it-all" paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their context cost still scales with video length. We propose OmniAgent, the first native omni-modal agent that formulates video understanding as a POMDP-based iterative Observation-Thought-Action cycle. OmniAgent executes on-demand actions to selectively distill audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To operationalize this, we introduce (1) Agentic Supervised Fine-Tuning to bootstrap native active perception via best-of-N trajectory synthesis with dual-stage quality control, and (2) Agentic Reinforcement Learning with TAURA (Turn-aware Adaptive Uncertainty Rescaled Advantage), which leverages turn-level entropy to steer credit assignment toward pivotal discovery turns. Crucially, OmniAgent exhibits positive test-time scaling, where performance improves as the number of reasoning turns increases, validating the efficacy of active perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent achieves state-of-the-art performance among open-source models. Notably, on LVBench, our 7B agent outperforms the 10$\times$ larger Qwen2.5-VL-72B (50.5% vs. 47.3%).
Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by training a large language model (LLM) to match a single ground truth response, either by maximizing the log probability or by using a similarity reward. We instead propose {Turing-RL}: a Turing-Test-based reinforcement learning approach for training user simulator models. {Turing-RL} uses a discriminative Turing reward with an LLM judge to score how indistinguishable a generated response is from the real user's given the user's history, and the user simulator LLM learns to produce responses indistinguishable from what the user could have said with such rewards. Across two different domains--conversational chat and Reddit forum discussion--we find that {Turing-RL} consistently outperforms baseline methods on both LLM and human evaluation metrics. Our study suggests that optimizing for indistinguishability, rather than response matching, is effective for learning user simulators.
Progress in legal AI increasingly depends on access to authoritative legal text at scale. Yet one of the most consequential layers of American law remains largely absent from existing machine-readable corpora: local ordinances. Local codes govern zoning, housing, business licensing, public health, noise, animal control, and many other domains of everyday regulation, but they are fragmented across vendor platforms designed for human browsing rather than bulk research access. We introduce LOCUS - the Local Ordinance Corpus for the United States - a comprehensive corpus and county-harmonized access layer for U.S. municipal and county ordinance codes. The raw corpus, available for release to researchers, represents nearly all publicly available municipal and county ordinance codes. The resulting raw corpus contains codes from 9,239 cities and counties. A smaller county-harmonized LOCUS access layer provides coverage for the largest 2,309 of 3,144 U.S. counties, accounting for a majority of the population. We use OCR to handle the myriad of document formats that have kept the law from being a public resource. We release the corpus with coverage metadata to support reproducibility, downstream legal AI research, and the incremental expansion of machine-readable access to local law. We train a collection of ModernBERT-based classifiers and scorers to facilitate analyzing U.S. local law among several dimensions, such as opacity and paternalism, that have not previously been studied at this scale. LOCUS-v1 and its derivative models are available at: https://huggingface.co/datasets/LocalLaws/LOCUS-v1
We present a framework to cross-match sources from the Chandra Source Catalog (CSC v2.1) with optical sources from Gaia Data Release 3. Unlike purely spatial approaches, we use source properties such as magnitudes, colors, and distances to identify true counterparts, detect chance coincidences, and resolve ambiguities when multiple plausible candidates exist. We define a training set of high-confidence matches using NWAY, a Bayesian cross-matching framework that accounts for positional errors and source densities. We train a gradient-boosted classifier (LightGBM) on a variety of features from both catalogs. Of the ~$254$k unique X-ray sources, we find counterparts for ~$113$k sources, of which plausible multiple counterparts are found for ~$7$k. We find no counterparts for ~$20$k sources for which separation-based cross-matching does find a match, and attribute half of these to chance coincidences. We validate the pipeline on the Chandra Orion Ultradeep Project (COUP), where the machine-learning matches reproduce 95% of NWAY cross-matches without using any positional information. We release a catalog of the ~$113$k Chandra-Gaia counterparts, together with ~$7$k alternative matches and ~$20$k ambiguous NWAY associations, supporting future population studies of sources detectable by both Chandra and Gaia. We discuss limitations and provide a generalization of the framework that is applicable in other cross-matching scenarios.
Preference-based RL provides an approach to learning reward models from pairwise comparisons of behaviors, bypassing the need for explicit reward design. However, existing methods typically rely on passive data collection and suffer from poor sample efficiency, especially during the early stages of learning. We introduce a model-based approach that actively directs exploration by jointly reasoning over uncertainties in the reward, dynamics, and value functions. Our method, Uncertainty-Balanced Preference Planning (UBP2), uses ensembles of reward, dynamics, and value function models to evaluate candidate trajectories according to a unified score that combines expected reward, terminal value, and epistemic uncertainty. Planning under this objective yields an explicit tradeoff between exploitation and information acquisition without requiring ad hoc exploration heuristics. Under standard regularity assumptions, we establish sublinear regret guarantees for both finite-horizon and infinite-horizon settings. Empirically, experiments on the Meta-World benchmark show UBP2 achieves substantially higher sample efficiency than model-free preference-based methods and non-optimistic model-based baselines.
Post-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to obtain and may themselves be noisy, incomplete, or partially incorrect; even when the final solution is correct, an imperfect rationale can interfere with learning. Reinforcement learning with verified rewards, on the other hand, typically compresses evaluative feedback into a scalar signal, obscuring which aspects of a response should be improved. We propose \textbf{Rubric-Conditioned Self-Distillation}, a framework that incorporates rubrics as structured, fine-grained feedback for on-policy self-distillation. Our method conditions the teacher model on criterion-level rubrics and uses it to provide token-level guidance on the student's own sampled trajectories. This design avoids treating a single reference rationale as the sole supervision target. Instead, rubrics specify what a strong response should satisfy, enabling more fine-grained credit assignment over the reasoning process than scalar reward optimization. We instantiate this framework with a two-stage pipeline that first learns to generate task-specific rubrics and then trains a rubric-guided reasoner. We evaluate on a diverse suite of science reasoning benchmarks and results show that rubric-conditioned self-distillation effectively converts rubric-level criteria into token-level guidance over the reasoning process, surpassing GRPO by 1.0 points and OPSD by 0.9 points on average.
Composite score across coding, math, and reasoning
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Fable 5 | 59.9 | 0 | $20.00 |
| 2 | Claude Opus 4.8 | 55.7 | 67 | $10.00 |
| 3 | GPT-5.5 | 54.8 | 61 | $11.25 |
| 4 | Claude Opus 4.7 | 53.5 | 54 | $10.00 |
| 5 | GPT-5.4 | 51.4 | 157 | $5.63 |
Agentic coding on real-world software engineering tasks
| # | Model | Score |
|---|---|---|
| 1 | gpt-5.5-2026-04-23-xhigh | 62.7% |
| 2 | Junie | 61.6% |
| 3 | Codex | 60.4% |
| 4 | Claude Code | 59.6% |
| 5 | gpt-5.5-2026-04-23-medium | 58.9% |
High-performance code intelligence MCP server. Indexes codebases into a persistent knowledge graph — average repo in milliseconds. 158 languages, sub-ms queries, 99% fewer tokens. Single static binary, zero dependencies.
IP addresses break, dial keys instead. Modular networking stack in Rust.
Give your AI agent eyes to see the entire internet. Read & search Twitter, Reddit, YouTube, GitHub, Bilibili, XiaoHongShu — one CLI, zero API fees.
Meshery, the cloud native manager
An agentic skills framework & software development methodology that works.
Additional packages (components, document stores and the likes) to extend the capabilities of Haystack
Production-ready RAG Framework (Python/FastAPI). 1-line config swaps: 6 Vector DBs (Weaviate, Pinecone, Qdrant, ChromaDB, pgvector, MongoDB), 5 LLMs (Gemini, OpenAI, Claude, Ollama, OpenRouter). OpenAI-compatible API. 2100+ tests.
Ultralytics YOLO 🚀
A Large-Scale Knowledge Graph for Automated Scientific Research
RL training framework for diffusion and omni-modality models