The government's ability to weaponize AI policy is outpacing the market's capacity to adapt, and this week's regulatory moves have exposed a structural vulnerability that no amount of capital can fix. Anthropic's sudden export ban on Fable and Mythos models, justified as national security, sits alongside the Trump administration's simultaneous protection of xAI's unpermitted gas turbines and invocation of Pentagon personnel using generative AI tools as justification for regulatory leniency. The logic is circular and intentional: AI is strategically important enough to restrict when it threatens foreign markets, and strategically important enough to exempt from Clean Air Act enforcement when domestic infrastructure is at stake. Neither company negotiated this arrangement. Both are now hostages to whichever interpretation of national interest serves the administration's immediate purpose. OpenAI's leaked financials showing billions in annual losses compound the precarity. Revenue is real but dwarfed by R&D spending. These companies cannot absorb extended regulatory uncertainty or market restrictions. SpaceX's $60 billion acquisition of Cursor days after going public signals that capital still flows toward consolidation plays, but the underlying bet is that the regulatory environment will remain favorable long enough for those investments to reach profitability.
The fragmentation of consumer preference and corporate strategy reveals where the real constraint actually lies. Sixty percent of US consumers say AI branding is a turnoff, yet companies are racing to integrate AI everywhere: Android 17 expands Gemini features, Meta launches AI search modes, OpenAI partners with Visa for direct ChatGPT purchases. ChatGPT's market share has slipped below 50 percent for the first time despite maintaining 1.1 billion monthly users, while Claude trails at 245 million. Anthropic's pause on token-based billing for its Claude Agent SDK signals that even AI-first companies understand that pricing volatility destroys adoption. The constraint is not technical capability or regulatory approval. It is whether users will tolerate the cost and uncertainty long enough for these businesses to reach profitability. Robinhood's CEO notably omitted AI from his justification for 10 percent layoffs, breaking from the industry pattern of blaming automation for headcount cuts. That silence matters more than the cuts themselves.
In research and infrastructure, the field is sorting into specialized theaters with different competitive dynamics and winners. The infrastructure race has clear outcomes: NVIDIA and AMD compete through MLPerf benchmarks and hardware partnerships where wins compound and switching costs are highest. The application race remains fragmented: Microsoft operates in Turkish farming, NVIDIA in AR glasses, Anthropic in coding agents, Google in housing planning. The money moves fastest where switching costs are highest, which is the hardware layer. Everything else is still fighting for distribution. Research across robotics, language models, and healthcare systems shows a methodological shift away from monolithic scaling toward conditional, task-aware allocation of computation. Inference-time steering replaces end-to-end training with post-hoc refinement loops. Variable-width transformers and adaptive computation abandon uniform capacity allocation across layers. Agentic frameworks decompose complex problems through structured reasoning rather than end-to-end prediction. On code-specific benchmarks, GLM-5.1 and GLM-5.2 show significant repositioning when evaluated on software engineering tasks versus general capability suites, suggesting that specialized evaluation surfaces different architectural strengths. Developer momentum concentrates on two problems: replacing slow tools with fast ones in Rust and compiled languages, and building the operational layer around AI applications before the AI layer itself stabilizes. LangChain4j, DeepSeek-Reasonix, and production trading engines are engineered around how modern LLM inference actually works, not around what sounds impressive. The infrastructure is consolidating. Everything else is still being built.
Grant Calloway
Robots deployed in the real world should learn from their experience and improve over time. This requires a mechanism of practicing and learning from feedback. In this paper, we propose VERITAS, a generator-verifier framework for generalist robot policies for inference-time policy steering and self-improvement. We use a pre-trained generalist robot policy as a ``generator'' and pair it with a gradient-free ``visual verifier'' that evaluates actions at inference time. This framework enables inference-time steering that improves policy performance without additional training. We demonstrate that inference-time verification consistently outperforms vanilla generalists without training on additional demonstration data. Additionally, we demonstrate that the verified rollouts provide effective supervision for offline policy improvement: policies fine-tuned on verified self-generated trajectories achieve consistent performance gains. Notably, we find that post-training with verified rollouts achieves comparable efficiency to expert demonstrations, while requiring no human interventions. Our results highlight inference-time verification as a practical and scalable mechanism for improving robotic policies during deployment.
Scaling model size, specifically depth and width, has driven significant progress in transformer-based language models. However, most architectures maintain a constant width across all layers, allocating a fixed parameter and computation budget evenly despite different layers potentially playing distinct computational roles. In this work, we empirically investigate nonuniform capacity allocation across network depth by proposing a $\times$-shaped > <former architecture. This design maintains wider early and late layers while narrowing the middle layers, utilizing a parameter-free residual resizing mechanism. Across decoder-only language models ranging from 200M to 2B parameters (dense) and 3B parameters (MoE), our > <former consistently outperforms parameter-matched uniform baselines on language modeling loss. By reducing the average layer width, this architecture also requires fewer overall FLOPs (22% reduction under fitted loss-matched scaling curves) and smaller KV cache memory and I/O cost (15% reduction). In analysis, we show that this bottleneck structure results in qualitatively different representations in residual streams. Overall, our results demonstrate that nonuniform width allocation can result in more resource-optimal scaling of language models.
Reproducing research results from papers and released code is central to scientific progress. Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale due to their reliance on substantial manual effort for data curation and evaluation. We introduce ReproRepo, a scalable framework for reproducibility evaluation that leverages human-raised GitHub issues as naturally occurring supervision on realistic reproduction blockers. We instantiate ReproRepo on 1,149 recent machine learning papers from major conferences and evaluate four frontier model-agent configurations. Our results show that LLM agents, even without executing code, can identify many real-world reproducibility problems from paper-repository pairs: the best agent in our study, namely Codex with GPT-5.5, surfaces at least one semantically related human-reported blocker for ~90% of papers in the study. Further analysis shows that agents are particularly effective for surfacing visible failures and identifying the right semantic region, but may still be insufficient in exact localization. ReproRepo can serve as a reusable, scalable framework for future evaluations of LLM agents on real-world reproducibility auditing. Our code is released at https://github.com/LithiumDA/ReproRepo.
In learning theory, the sign rank of a binary concept class captures the smallest dimension in which it can be represented by points and halfspaces. Despite tremendous interest, lower bounds on sign rank are notoriously difficult to come by. Two recent approaches to the problem establish lower bounds on sign rank by measures that are easier to analyze: the $\mathbb{Z}_2$-index and the list replicability number. We order these measures, showing that the $\mathbb{Z}_2$-index is upper-bounded by a linear function of the list replicability number. As a main consequence, we obtain a strong separation between sign rank and $\mathbb{Z}_2$-index, thereby resolving a question of Frick, Hosseini, and Vasileuski. This motivates a thorough study of list replicability, the stronger of the two lower-bounding measures. We establish upper bounds on the list replicability number by two combinatorial measures: height and minimum star number. We also prove a fundamental composition result, showing that the product of two concept classes has list replicability number bounded by the sum of the list replicability numbers of the two classes.
Zero-Shot Object-Goal Navigation (ZS-OGN) requires embodied agents to explore and locate target objects without any prior training. To this end, recent methods leverage foundation models. But they typically rely on static priors and lack adaptation, which leads to repeated errors and costly trial and error. In this paper, we propose a self-evolving ZS-OGN framework that enables continuous test-time improvement. Specifically, we build an agentic rule memory by extracting actionable knowledge from past trajectories. Then, we propose a retrieval strategy based on upper confidence bound, selecting effective rules by balancing semantic relevance and historical success. In addition, we introduce a memory-guided preflection module that forecasts potential outcomes before action, reducing inefficient exploration. Extensive experiments show that our method outperforms existing zero-shot baselines, achieving a 10.1\% improvement in success rate with fewer unnecessary steps.
Accurate mechanical properties (or materials) Young's modulus ($E$), Poisson's ratio ($ν$) and density ($ρ$) are essential for reliable physics simulation of digital worlds, but most 3D assets lack this information. We propose AdaVoMP, a method for predicting accurate dense spatially-varying ($E$, $ν$, $ρ$) for input 3D objects across representations, improving the resolution, accuracy, and memory efficiency over the state-of-the-art. The foundation of our technique is a sparse and adaptive voxel structure SAV that efficiently represents both the input 3D shape and the material field output. We replace the fixed-voxel model of the most accurate prior method, VoMP, with a novel sparse transformer encoder-decoder model that learns to generate a unique SAV autoregressively for every input shape to represent its materials, achieving a resolution $16^3\times$ higher than prior art. Experiments show that AdaVoMP estimates more accurate volumetric properties, even with lesser test-time compute than all prior art. This allows us to convert high-resolution complex 3D objects into simulation-ready assets, resulting in realistic deformable simulations.
Composite score across coding, math, and reasoning
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Fable 5 | 59.9 | 0 | $20.00 |
| 2 | Claude Opus 4.8 | 55.7 | 68 | $10.00 |
| 3 | GPT-5.5 | 54.8 | 67 | $11.25 |
| 4 | Claude Opus 4.7 | 53.5 | 54 | $10.00 |
| 5 | GPT-5.4 | 51.4 | 166 | $5.63 |
Agentic coding on real-world software engineering tasks
| # | Model | Score |
|---|---|---|
| 1 | gpt-5.5-2026-04-23-xhigh | 62.7% |
| 2 | Junie | 61.6% |
| 3 | Codex | 60.4% |
| 4 | Claude Code | 59.6% |
| 5 | gpt-5.5-2026-04-23-medium | 58.9% |
freeCodeCamp.org's open-source codebase and curriculum. Learn math, programming, and computer science for free.
Rust-based platform for the Web
A self-hosted data logger for your Tesla 🚘 [main maintainer=@JakobLichterfeld]
Collection of publicly available IPTV channels from all over the world
JavaScript API for Chrome and Firefox
Semantic code searcher and codebase utility
Production-ready RAG Framework (Python/FastAPI). 1-line config swaps: 6 Vector DBs (Weaviate, Pinecone, Qdrant, ChromaDB, pgvector, MongoDB), 5 LLMs (Gemini, OpenAI, Claude, Ollama, OpenRouter). OpenAI-compatible API. 2100+ tests.
Open-source LLM knowledge platform: turn raw documents into a queryable RAG, an autonomous reasoning agent, and a self-maintaining Wiki.
DeepSeek-native AI coding agent for your terminal. Engineered around prefix-cache stability — leave it running.
LangChain4j is an open-source Java library that simplifies the integration of LLMs into Java applications through a unified API, providing access to popular LLMs and vector databases. It makes implementing RAG, tool calling (including support for MCP), and agents easy. LangChain4j integrates seamlessly with various enterprise Java frameworks.