From ten thousand feet, today's AI industry looks like a system sorting itself into layers, with power and capital concentrating at the infrastructure and regulatory levels while application builders race to ship products before the underlying costs calcify into permanent moats. The technical frontier in model performance has plateaued, but the economic frontier is moving fast, and the companies winning are those controlling the layers below the models themselves.
The clearest signal is regulatory permission itself becoming the scarce resource. Anthropic's Fable and Mythos models were blocked by the Trump administration in mid-June over cybersecurity concerns, then unblocked by June 30 after adding a new security measure, with global rollout announced July 1. The speed of reversal and minimal technical friction required to satisfy the block suggests the controls were less about genuine national security risk and more about demonstrating authority. The White House is now preparing AI model standards for announcement as soon as next week, meaning the same administration that just weaponized export controls is positioning itself to set the rules everyone else operates under. Europe is already preparing countermeasures, with the UK signaling a shift away from building for US companies and the EU weighing weaker data center climate rules after Big Tech lobbying. This pattern defines the moment: regulatory capture isn't a future risk, it's happening in real time, and the companies that can afford to negotiate with governments are the ones building the infrastructure others depend on.
Below the regulatory layer sits infrastructure competition that looks nothing like model performance racing. Ashton Kutcher's new VC fund explicitly targets infrastructure and energy backing AI labs rather than the labs themselves. Meta is building a cloud business to monetize excess compute, directly competing with AWS, Google Cloud, and Azure. SpaceX showed investors a handset-like AI device before going public, signaling another infrastructure push. Bhavin Turakhia is betting thirty million dollars of his own money to build an AI alternative to Microsoft Office, but the real play is proving someone outside the current cloud oligarchy can build and own the stack. Venice AI hit unicorn status at a sixty-five million Series A while already profitable with over seventy million annualized revenue, proving there's money in privacy-first platforms that don't depend on the same data moats. NVIDIA's dual announcement about capital partnerships and American manufacturing signals the company moving beyond selling chips to selling entire production systems, positioning itself as the operating system for continuous inference. These aren't disruptions to the leaders, they're efforts to build parallel infrastructure that doesn't require permission from whoever controls the regulatory dial.
The technical debt being created by autonomous agents is about to become someone's crisis and someone else's consolidation opportunity. AWS is launching a new OpenSearch engine claiming seventy percent storage cost reduction for log analytics because AI and agentic applications are generating telemetry at scales conventional observability systems weren't built to handle economically. Multiple sources flag that autonomous agents have a memory problem, they glitch, hang, or produce nonsense when context runs out, and that agents can industrialize inefficient infrastructure patterns at scale faster than humans can remediate post-deploy. Claude Opus 4.7 was used to break into Front Gate's ticketing system and issue arbitrary festival tickets, showing that capability without control creates liability. The companies selling tools to manage this chaos, cost controls, memory solutions, governance frameworks, are positioning themselves as essential. GitHub trending data shows developers treating AI agents as production infrastructure now, with the majority of high-traction repos falling into agent frameworks, infrastructure for making agents practical like sandboxing and token optimization, and tooling to extract and prepare data at scale. What's notably absent is any single dominant LLM provider's wrapper, instead repos prioritize abstraction layers that let developers swap models without rewriting code. The industry is effectively betting that the infrastructure crisis will arrive faster than regulation can catch up, and whoever owns the solution layer owns the next consolidation wave.
Grant Calloway
LLMs are increasingly used to brainstorm research ideas, but existing evaluations mostly judge individual ideas by novelty, feasibility, or expert preference. We instead ask: how far are current LLM-generated ideas from human researchers? To characterize this gap, we build a large-scale evaluation framework for ideation from high-quality human research papers. For each paper, we reverse-engineer a small set of closely related prior works that likely inspired its core idea. LLMs are then prompted to generate a new idea from the set of paper titles and summaries. We introduce a two-axis research-taste taxonomy to profile each idea by its opportunity pattern and research paradigm, and use it to quantify the divergence between human and LLM ideas. Across idea sets generated by different LLMs, we observe a consistent distributional gap: LLM ideas are disproportionately concentrated around bridge-like opportunities and synthesis methods, whereas the human paper reference distribution spreads more broadly across ways of framing gaps and constructing contributions. This result suggests that strong LLMs can produce a range of reasonable ideas, but that range remains narrower than, and systematically shifted relative to, human research taste.
Reinforcement learning (RL) has become a central component of post-training large language models (LLMs), yet little is understood about how RL adaptation is distributed across transformer layers. Existing approaches typically update all model parameters uniformly, implicitly assuming that every layer contributes similarly to the gains obtained during RL post-training. In this work, we challenge this assumption through a systematic layer-wise study of RL training. Surprisingly, we find that training a single transformer layer can recover most of the gains achieved by full-parameter RL training, and in some cases even surpass it. To quantify this phenomenon, we introduce the quantity layer contribution, which measures the fraction of full RL improvement recovered by training a layer in isolation. Across seven models spanning two model families (Qwen3, Qwen2.5), three RL algorithms (GRPO, GiGPO, Dr. GRPO), and multiple task domains including mathematical reasoning, code generation, and agentic decision-making, we observe a remarkably stable pattern: RL gains are highly concentrated in a small subset of, and in many cases even a single, transformer layers. More strikingly, the same structural pattern consistently emerges: high-contribution layers concentrate in the middle of the transformer stack, while layers near the input and output ends contribute substantially less. The resulting layer rankings remain strongly correlated across datasets, tasks, model families, and RL algorithms.
Prior work on imitation learning from suboptimal demonstrations typically relies on compressed supervision signals such as confidence estimates, discriminator scores, or importance weights. These scalar signals are inherently limited, as they cannot explicitly express intermediate reasoning about task progress, failure modes, or corrective actions. We propose a language-critique framework for imitation learning from suboptimal demonstrations that instead leverages natural language as a structured supervision signal, avoiding the collapse of expressive feedback into scalars. Our method first constructs language labels from demonstrations that explicitly describe current progress, identify suboptimal behaviors, and provide fine-grained corrective guidance. We then introduce a language-critique loss that directly trains policies using these structured signals without reducing them to scalars, and instantiate it for both behavior cloning and diffusion policies, yielding LC-BC and LC-DP. We further provide a theoretical result showing that the proposed objective upper-bounds the expert performance gap under standard assumptions. Empirically, we evaluate on diverse continuous control tasks spanning navigation, manipulation, and gameplay, where our methods consistently outperform strong imitation learning and offline reinforcement learning baselines. These results demonstrate that language can serve as a powerful and structured form of supervision for learning robust policies from suboptimal data.
Memory expertise is a learned skill: knowing what to encode, when to retrieve, and how to organize knowledge--a capacity known in cognitive science as metamemory. We bring this perspective to LLMs by treating memory management as a trainable skill. We promote file-system operations to first-class memory actions alongside task actions, letting the model itself decide how to manage its memory. This memory skill improves along two axes: the structure that supports it (prompts, file schemas, action vocabulary), and the proficiency of the model exercising it. Both axes resist manual optimization: episodes in long-horizon tasks run for thousands of steps, and a single memory mistake can hide long before it surfaces, making human review of full trajectories impractical. We introduce AutoMem, a framework that automates both axes. In the first loop, a strong LLM reviews complete agent trajectories and iteratively revises the memory structure that shapes how the agent interacts with its memory files. In the second loop, the agent's own good memory decisions are identified from many episodes and used as training signal to sharpen the model's memory proficiency directly. Across three procedurally generated long-horizon games (Crafter, MiniHack, and NetHack), optimizing memory alone--without modifying the model's task-action behavior--improved the base agent's performance ~2x-4x, bringing a 32B open-weight model competitive with frontier systems such as Claude Opus 4.5 and Gemini 3.1 Pro Thinking. Our results show that memory management is an independently learnable skill, and a high-leverage objective yielding large gains on long-horizon tasks.
When should an AI system's answer be trusted? Formal proof assistants offer certainty but cannot reach most of the problem distribution; scalar LLM judges offer coverage but produce opaque scores that cannot be audited after the fact and are subject to the same coherence issues as any LLM. We present Theoria, a verification architecture that closes this gap. A candidate solution is rewritten into a sequence of typed state transitions, each licensed by an explicit justification, whether that be a citation, computation, or problem-given fact, and every transition is independently auditable. The foundational invariant is completeness of change: every difference between consecutive proof states must be accounted for, so hidden premises surface as unlicensed mutations rather than passing silently. On HLE-Verified Gold (185 text-only expert problems), Theoria certifies 105 at 91.4% strict precision (Wilson 95% CI [84.5%, 95.4%]). Every certification produces a human readable proof trace in which each step can be independently challenged. Holistic LLM judges achieve comparable precision at matched coverage but fail on different problems (Jaccard 0.14-0.36), making the approaches complementary. On 95 adversarial poisoned proofs across 15 domains, structured judges catch 94.7% versus 83.2% for holistic judging (p= 0.0017). The overall 11.5 pp gap concentrates in hidden premises (90.6% vs. 62.5%, a 28 pp difference) and fabricated citations (100% vs. 90%), the error classes where the formal analysis predicts an advantage; performance is identical on arithmetic and theorem-misapplication errors, where no advantage is predicted. On GPQA Diamond (n= 65), certified precision is 97.1% (Wilson CI [85.1%, 99.5%]).
Transformers use the same forward computation stream to both predict the next token and store useful state for future token predictions. We formulate the \emph{state-prediction separation hypothesis}: disentangling the two roles yields better language modeling performance. We design a Transformer variant that uses two computation streams to separate the two functions, and conduct pretraining experiments across various scales. Our experiments show that state-prediction separation consistently offers better data and compute efficiencies, improving validation loss and outperforming standard Transformers by 2--3 percentage points on average on downstream tasks. We also conduct extensive empirical analysis that rules out potential confounders and demonstrates the fundamental difference in the gradients our design entails.
Composite score across coding, math, and reasoning
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Fable 5 | 59.9 | 69 | $20.00 |
| 2 | Claude Opus 4.8 | 55.7 | 66 | $10.00 |
| 3 | GPT-5.5 | 54.8 | 82 | $11.25 |
| 4 | Claude Opus 4.7 | 53.5 | 51 | $10.00 |
| 5 | Claude Sonnet 5 | 53.4 | 89 | $6.00 |
Agentic coding on real-world software engineering tasks
| # | Model | Score |
|---|---|---|
| 1 | OpenAIgpt-5.5-2026-04-23-xhighModel | 62.7%± 0.91% |
| 2 | JunieJunieAgent | 61.6%± 0.64% |
| 3 | OpenAICodexAgent | 60.4%± 1.37% |
| 4 | AnthropicClaude CodeAgent | 59.6%± 1.98% |
| 5 | OpenAIgpt-5.5-2026-04-23-mediumModel | 58.9%± 0.78% |
A complete AI agency at your fingertips** - From frontend wizards to Reddit community ninjas, from whimsy injectors to reality checkers. Each agent is a specialized expert with personality, processes, and proven deliverables.
Open-source AI hackers to find and fix your app’s vulnerabilities.
"Vibe-Trading: Your Personal Trading Agent"
A comprehensive dataset of 433 fitness exercises. Each entry includes name, category, target muscle group, equipment, instructions, thumbnail image, and animation video.
An open source design system that's fully customizable and agent ready
Omnigent is an open-source AI agent framework and meta-harness: orchestrate Claude Code, Codex, Cursor, Pi, and custom agents — swap harnesses without rewriting, enforce policies and sandboxing, and collaborate in real time from any device.
Towards Robust Multimodal Sentiment Analysis with Incomplete Data
Project Tapestry aims to give every nation and participant frontier AI they can call their own — uniting a global consortium to train a shared frontier model from which partners build and own sovereign models aligned to their national, socio-cultural, and industrial needs.
SGLang is a high-performance serving framework for large language models and multimodal models.
A practical lab for building, testing, and evaluating apps with Apple's Foundation Models framework.