The AI market is fragmenting into specialized layers while capital consolidates around infrastructure control. Anthropic's decision to sell Claude Fable 5 with explicit safety restrictions to the public while offering an unrestricted Claude Mythos 5 to trusted organizations reveals a deliberate business strategy: treating regulatory compliance and brand protection as premium features. This is not a technical constraint but a commercial choice, one that mirrors a broader pattern across the industry where different versions of capability are priced and distributed based on customer relationships and assumed trustworthiness. Google's price cuts on budget AI subscriptions, Meta's 168-megawatt data center deal in India, and SpaceX's plans for a million data-center satellites all point to the same bottleneck. Compute capacity and the energy to power it have become the binding variable in AI economics, making infrastructure ownership more valuable than incremental improvements to model weights.
Capital is flowing toward this infrastructure layer at extraordinary velocity. Apollo and Blackstone finalized a $35 billion lending deal to finance Anthropic's growth, while Justin Ernest deployed nearly $500 million into startups including Anthropic, Anduril, and SpaceX through a captive network of limited partners rather than a traditional venture fund. SpaceX is going public Friday at $1.75 trillion with an AI division that lost $6.4 billion last year, a loss that reflects structural investment in compute rather than a failed business line. The real capital momentum, however, is flowing toward applied AI companies solving specific enterprise problems. Lovable claims $500 million in annualized revenue with 1 million new projects a week, while Sandstone raised $30 million in Series A funding for an AI legal assistant just six months after its seed round. These numbers suggest the money is not in model builders chasing benchmark improvements but in tools that let enterprises build applications on top of existing models.
The regulatory surface is hardening while the technology spreads, creating contradictory incentives that push companies toward fragmentation. Anthropic's decision to block queries on cybersecurity, biology, and chemistry in Fable 5 reflects calculations about public scrutiny versus private use. The UK's demand that tech companies filter photos and messages on device within three months has triggered CISO concerns that the same mechanisms could undermine enterprise security. A federal judge blocked the Trump administration's H-1B visa fee, offering temporary relief to companies hiring foreign talent for AI roles. These pressures are not aligned. They push companies to build different products for different jurisdictions, treat compliance as a product design problem rather than a cost center, and accept that the frontier is no longer a single technological line but a series of regulatory and commercial frontiers, each with its own winners and losers.
Research and infrastructure trends reinforce this fragmentation. The labs are converging on a thesis that AI's value lies not in the model itself but in how it multiplies productive output of knowledge workers at scale. NVIDIA's role in Apple's Private Cloud Compute and the broader push to move inference onto customer infrastructure signals that real margins are in the hardware and platform layers, not API calls. On GitHub, the market for AI agents is splintering into specialized tools rather than consolidating around platforms, with developers building narrower agents optimized for specific workflows rather than general-purpose systems. Meanwhile, libraries like DALI and ManiSkill that push computation to hardware early, vector search becoming commodity work, and edge-first databases gaining traction all suggest developers are preparing for a world where models train and run closer to data. The company that owns the distribution layer, whether that's GitHub's CLI, Google's Meet, or AWS's Bedrock, wins regardless of whose model runs underneath.
Grant Calloway
Cross-modal alignment (CA) and cross-modal prediction (CP) are the dominant paradigms for multimodal representation learning, yet there is no systematic understanding of when each succeeds, when each fails, and when cross-modal training helps at all -- a gap that leaves practitioners, especially in scientific domains like biomedicine or astrophysics, with heterogeneous instruments and multiple levels of organization and measurement, unable to diagnose why standard methods underperform the best single modality. We develop a unified linear framework that addresses both questions. Under a spiked signal-plus-noise model with structured cross-modal nuisance correlation, we derive separation ratios for both objectives that expose complementary failure modes: alignment whitens each modality and fails when nuisance is strongly correlated across views; prediction encodes whatever is cross-predictable through a one-sided whitening, with recovery governed by source-modality quality. The resulting phase diagram partitions multimodal problems into four regimes: Both, CA only, CP only, and Neither. We present a data-driven procedure to locate real-world datasets in this diagram using a small labeled subsample, identifying the preferred objective and prediction direction before any cross-modal training. Experiments on synthetic data, stereo-vision benchmarks, image-caption pairs, and real astrophysical data validate the predictions in the nonlinear regime, including the Neither regime where cross-modal training is actively harmful. Our framework lets practitioners diagnose their multimodal problem and choose the right objective before committing to training. Code to reproduce the results is available at https://github.com/IlayMalinyak/mm_align_vs_pred.
Supervised fine-tuning (SFT) typically maximizes the likelihood of every token in a demonstrated trajectory. However, an observed token can be non-unique, noisy, or misaligned with the model prior. Strictly fitting toward this one-hot target may be suboptimal, especially when the pretrained model encodes a rich knowledge prior. In this work, we reinterpret SFT as target distribution design: instead of studying only the loss objective, we analyze the token-level target that the loss drives the model to match. We introduce the Q-target framework, which decomposes SFT supervision into two explicit choices: (1) how strongly to rely on the observed token, and (2) how to allocate the remaining probability mass over alternatives. This perspective unifies many existing SFT variants as implicit choices of the target distribution Q. Building on this view, we propose Target-SFT which constructs the training objective directly from the desired target distribution. This method consistently outperforms across the ten reasoning dataset-model settings evaluated, showing the effectiveness of this target-based approach. Overall, our formulation reveals a more fundamental design principle for SFT training and opens a broader search space for SFT objectives.
In this paper, we propose EEVEE, the first multi-dataset test-time prompt learning framework for LLM agents, enabling test-time prompt learning under real-world task streams. Existing methods are largely designed for single-dataset settings, while real-world applications require models to handle heterogeneous input streams drawn from multiple datasets, domains, and task distributions, limiting their practical applicability. To mitigate cross-dataset interference, EEVEE introduces a router that partitions incoming inputs into task clusters and assigns them to suitable prompt configurations. This design is optimized via a router-prompt co-evolution strategy, which employs interleaved router and prompt learning phases to address their mutual dependency. Experiments across multiple datasets demonstrate that the framework improves robustness under heterogeneous data streams while maintaining single-benchmark learning capability and efficiency. Specifically, EEVEE improves average multi-benchmark scores by 10.38 and 24.32 points over Qwen3-4B-Instruct and DeepSeek-V3.2, surpassing SOTA methods GEPA and ACE by up to 37.2% and 48.2%.
Data tells stories that shape society; the data journalist's job is to turn raw information into stories non-experts can trust. A high-quality news feature takes a newsroom team weeks: hunting for context, running statistics, choosing an angle, and designing visuals. Recent agents handle individual steps well: data-science agents close the analysis loop, while design agents synthesize beautiful websites. But can an agent serve as a data journalist end to end? We introduce Data Journalist Agent (Data2Story), a multi-agent framework that orchestrates specialized roles into a single virtual newsroom. Data2Story contributes two innovations. (i) Claims are evidence-grounded: an Inspector links every number, angle, and asset back to data, code, or an external reference. (ii) Articles are multimodally generative: rather than defaulting to plain text and static charts, Data2Story reasons about what readers will want to see, then deploys multimodal tools, such as interactive maps for geography and audio for music. We evaluate Data2Story on 18 articles, each paired with the originally published expert piece, along four axes: (a) human-agent angle coverage; (b) rubric evaluation with 53 participants across five dimensions; (c) computer-use agents as judges, a cost-saving proxy for how readers navigate interactive articles; and (d) verifiability, where a coding verifier re-executes statements against the data and checks claims against references. Data2Story produces competitive, evidence-traceable multimedia stories, with particular strength in transparency and auditability. Human articles retain an edge in editorial angle, creative design, and presentation. We position Data2Story as a collaborator for journalists, enabling more evidence-based, transparent, and verifiable reporting. Code and demos are available at https://data2story.github.io.
Conditioning a language model on additional context, such as feedback on a previous attempt, typically improves its response. Self-distillation trains the model to retain this improvement when the context is not present. The method works by matching the model's output distribution under two settings: a student that sees only the question, and a self-teacher that also sees the context. What the model learns therefore depends on what context the self-teacher receives, yet the design of this context remains largely unexplored. We study context design for self-distillation by training a solver on feedback from a frozen critic. We compare three conditions: (i) a binary reward (GRPO), (ii) the reference solution, and (iii) a step-by-step critique aligned to the solver's reasoning trace. Step-aligned critique yields the largest gains, outperforming GRPO by 16.11 points and reference-solution-conditioned self-distillation by 5.27 points (Avg@12). Per-token advantage analysis reveals why: step-aligned feedback targets only the tokens where reasoning fails, leaving correct behavior intact. Conditioning on the reference solution, by contrast, pressures the model to change its behavior at every token (even correct steps) because an alternative derivation inevitably differs in phrasing and approach. This suggests that structural alignment between feedback and the solver's reasoning is a key driver of self-distillation effectiveness.
Deployed large reasoning models (LRMs) often behave unexpectedly. Test-time steering controls LRM outputs by intervening on their hidden representations, but it can degrade output quality. We argue that prior steering work implicitly relies on internal features that detect behavior in already generated text. We show that these detection features are poor predictors of future behavioral outcomes, and thus not the natural intervention target. Instead, we train activation probes to predict future behavior likelihoods from intermediate reasoning steps. These probes predict the most likely behavior with 64%-91% accuracy, revealing a separate type of internal prediction features. Building on these prediction features, we introduce a text-level steering method, Future Probe Controlled Generation. FPCG samples multiple candidate sentences and chooses the best one according to a probe predicting the future behavior likelihood. This enables steering with almost no output quality degradation. FPCG also enables steering in several evaluations where activation steering fails. These results show that distinguishing detection and prediction features enables a more nuanced approach to controlling LRM behaviors.
Composite score across coding, math, and reasoning
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Fable 5 | 64.9 | 61 | $21.88 |
| 2 | Claude Opus 4.8 | 61.4 | 60 | $10.94 |
| 3 | GPT-5.5 | 60.2 | 54 | $11.25 |
| 4 | Claude Opus 4.7 | 57.3 | 47 | $10.94 |
| 5 | Gemini 3.1 Pro Preview | 57.2 | 127 | $4.50 |
Agentic coding on real-world software engineering tasks
| # | Model | Score |
|---|---|---|
| 1 | gpt-5.5-2026-04-23-xhigh | 62.7% |
| 2 | Codex | 60.4% |
| 3 | Claude Code | 59.6% |
| 4 | gpt-5.5-2026-04-23-medium | 58.9% |
| 5 | Claude Opus 4.8-xhigh | 56.4% |
AI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
A vector index built on TurboQuant, written in Rust with Python bindings
We write your reusable computer vision tools. 💜
Open Source Computer Vision Library
Desktop app to manage markdown knowledge bases
A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
SAPIEN Manipulation Skill Framework, an open source GPU parallelized robotics simulator and benchmark
An open-source read-along document reader server with high-quality TTS options, synchronized highlighting, and audiobook export for EPUB, PDF, DOCX, TXT, and MD.
A library to model multivariate data using copulas.
Local-first personal AI identity and memory for MCP-compatible coding tools — lessons, decisions, playbooks, and project context you can see, edit, and override.