OpenAI's sudden shutdown of Sora after six months of public operation reveals a calculation that even market leaders can misjudge: the regulatory and reputational cost of collecting user data through accessible AI products often outweighs the strategic value of the product itself. But the week's real story lies not in what frontier labs are abandoning, but in where capital and engineering talent are actually flowing. Mistral raised $830 million to build Nvidia infrastructure in Europe as a non-US counterweight to American dominance. Eli Lilly committed $2 billion to a Hong Kong biotech firm for AI drug development, routing pharmaceutical R&D capital into China. Google carved out Google-Agent as a distinct technical entity separate from Googlebot, signaling that even search incumbents must now create new categories for real-time user-initiated AI access. The pattern is unmistakable: frontier labs face mounting regulatory scrutiny the moment they touch user data or democratize capabilities too widely, while infrastructure, tooling, and domain-specific applications accelerate precisely because they operate below the political line of sight.
The consolidation is happening at the infrastructure layer. Amazon released A-Evolve to automate agent development. Chroma shipped Context-1, a 20-billion-parameter search model. Mistral released Voxtral TTS, a four-billion-parameter open-weight speech model that directly competes with proprietary voice APIs. Multiple open-source frameworks including AIO Sandbox, nanobot, and CAI are standardizing how autonomous systems execute. This is not rhetoric about democratization. These are builders moving past language models into the systems that make them useful at scale. On GitHub, the ecosystem has already moved beyond "how do I use Claude Code" to "how do I orchestrate multiple agents." Repositories like hermes-agent and oh-my-claudecode dominate because they solve the next layer of complexity: teams don't deploy single agents, they deploy systems that coordinate multiple agents with different capabilities. The voice and robotics layers are maturing into production workloads. Developers are now solving the plumbing problems that let agents run reliably in production, from compute orchestration to persistent session context to RAG systems that don't require vector databases.
OpenAI's Gates Foundation partnership for disaster response in Asia illustrates the operational shift. The framing positions OpenAI as the infrastructure provider for humanitarian coordination, but the significance lies in embedding OpenAI's tools into institutional workflows before competitors do, creating switching costs at the operational level. This is less about altruism and more about establishing OpenAI as the default AI layer for critical infrastructure decisions in a high-growth region. Meanwhile, research papers are moving beyond aggregate metrics toward finer-grained characterization of failure modes. Standard benchmarks are being exposed as obscuring real problems: repository-level code comprehension reveals memorization masquerading as reasoning, and state-of-the-art multimodal models plateau well below human performance when multiple temporally separated observations are required. The SWE-rebench leaderboard shows Claude Opus 4.6 at 65.3%, but the tighter clustering at the top suggests either more homogeneous model performance or substantial methodological divergence from older benchmarks, making confident assessment of genuine progress impossible without documentation of what changed. Capital and talent are flowing toward the places where the next layer of value accrues, and those places are not where the headlines suggest they should be.
Grant Calloway
Large language model (LLM)-based coding agents achieve impressive results on controlled benchmarks yet routinely produce pull requests that real maintainers reject. The root cause is not functional incorrectness but a lack of organicity: generated code ignores project-specific conventions, duplicates functionality already provided by internal APIs, and violates implicit architectural constraints accumulated over years of development. Simply exposing an agent to the latest repository snapshot is not enough: the snapshot reveals the final state of the codebase, but not the repository-specific change patterns by which that state was reached. We introduce Learning to Commit, a framework that closes this gap through Online Repository Memory. Given a repository with a strict chronological split, the agent performs supervised contrastive reflection on earlier commits: it blindly attempts to resolve each historical issue, compares its prediction against the oracle diff, and distils the gap into a continuously growing set of skills-reusable patterns capturing coding style, internal API usage, and architectural invariants. When a new PR description arrives, the agent conditions its generation on these accumulated skills, producing changes grounded in the project's own evolution rather than generic pretraining priors. Evaluation is conducted on genuinely future, merged pull requests that could not have been seen during the skill-building phase, and spans multiple dimensions including functional correctness, code-style consistency, internal API reuse rate, and modified-region plausibility. Experiments on an expert-maintained repository with rich commit history show that Online Repository Memory effectively improves organicity scores on held-out future tasks.
Weight tying, i.e. sharing parameters between input and output embedding matrices, is common practice in language model design, yet its impact on the learned embedding space remains poorly understood. In this paper, we show that tied embedding matrices align more closely with output (unembedding) matrices than with input embeddings of comparable untied models, indicating that the shared matrix is shaped primarily for output prediction rather than input representation. This unembedding bias arises because output gradients dominate early in training. Using tuned lens analysis, we show this negatively affects early-layer computations, which contribute less effectively to the residual stream. Scaling input gradients during training reduces this bias, providing causal evidence for the role of gradient imbalance. This is mechanistic evidence that weight tying optimizes the embedding matrix for output prediction, compromising its role in input representation. These results help explain why weight tying can harm performance at scale and have implications for training smaller LLMs, where the embedding matrix contributes substantially to total parameter count.
Lack of accessible and dexterous robot hardware has been a significant bottleneck to achieving human-level dexterity in robots. Last year, we released Ruka, a fully open-sourced, tendon-driven humanoid hand with 11 degrees of freedom - 2 per finger and 3 at the thumb - buildable for under $1,300. It was one of the first fully open-sourced humanoid hands, and introduced a novel data-driven approach to finger control that captures tendon dynamics within the control system. Despite these contributions, Ruka lacked two degrees of freedom essential for closely imitating human behavior: wrist mobility and finger adduction/abduction. In this paper, we introduce Ruka-v2: a fully open-sourced, tendon-driven humanoid hand featuring a decoupled 2-DOF parallel wrist and abduction/adduction at the fingers. The parallel wrist adds smooth, independent flexion/extension and radial/ulnar deviation, enabling manipulation in confined environments such as cabinets. Abduction enables motions such as grasping thin objects, in-hand rotation, and calligraphy. We present the design of Ruka-v2 and evaluate it against Ruka through user studies on teleoperated tasks, finding a 51.3% reduction in completion time and a 21.2% increase in success rate. We further demonstrate its full range of applications for robot learning: bimanual and single-arm teleoperation across 13 dexterous tasks, and autonomous policy learning on 3 tasks. All 3D print files, assembly instructions, controller software, and videos are available at https://ruka-hand-v2.github.io/ .
Equivariance is a fundamental property in computer vision models, yet strict equivariance is rarely satisfied in real-world data, which can limit a model's performance. Controlling the degree of equivariance is therefore desirable. We propose a general framework for constructing soft equivariant models by projecting the model weights into a designed subspace. The method applies to any pre-trained architecture and provides theoretical bounds on the induced equivariance error. Empirically, we demonstrate the effectiveness of our method on multiple pre-trained backbones, including ViT and ResNet, across image classification, semantic segmentation, and human-trajectory prediction tasks. Notably, our approach improves the performance while simultaneously reducing equivariance error on the competitive ImageNet benchmark.
We introduce PerceptionComp, a manually annotated benchmark for complex, long-horizon, perception-centric video reasoning. PerceptionComp is designed so that no single moment is sufficient: answering each question requires multiple temporally separated pieces of visual evidence and compositional constraints under conjunctive and sequential logic, spanning perceptual subtasks such as objects, attributes, relations, locations, actions, and events, and requiring skills including semantic recognition, visual correspondence, temporal reasoning, and spatial reasoning. The benchmark contains 1,114 highly complex questions on 279 videos from diverse domains including city walk tours, indoor villa tours, video games, and extreme outdoor sports, with 100% manual annotation. Human studies show that PerceptionComp requires substantial test-time thinking and repeated perception steps: participants take much longer than on prior benchmarks, and accuracy drops to near chance (18.97%) when rewatching is disallowed. State-of-the-art MLLMs also perform substantially worse on PerceptionComp than on existing benchmarks: the best model in our evaluation, Gemini-3-Flash, reaches only 45.96% accuracy in the five-choice setting, while open-source models remain below 40%. These results suggest that perception-centric long-horizon video reasoning remains a major bottleneck, and we hope PerceptionComp will help drive progress in perceptual reasoning.
Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static UI-to-code generation, interactive multi-page frontend reproduction, to long-horizon full-stack website development. The benchmark is constructed from real-world websites and comprises a total of 193 tasks across 16 categories, with 918 prototype images and 1,255 test cases. To support flexible, thorough and reliable evaluation, we propose workflow-based agent verification paradigm based on two complementary components: a GUI agent verifier and a VLM-based judge. We evaluate multiple visual language models instantiated under different coding-agent frameworks, revealing substantial performance gaps at all task levels, with state-of-the-art models still struggling on full-stack development.
Composite score across coding, math, and reasoning
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.4 | 57.2 | 88 | $5.63 |
| 2 | Gemini 3.1 Pro Preview | 57.2 | 114 | $4.50 |
| 3 | GPT-5.3 Codex | 54 | 92 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 59 | $10.00 |
| 5 | Claude Sonnet 4.6 | 51.7 | 79 | $6.00 |
Agentic coding on real-world software engineering tasks
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | Gemini 3.1 Pro Preview | 62.3% |
A visual, example-driven guide to Claude Code — from basic concepts to advanced agents, with copy-paste templates that bring immediate value.
Teams-first Multi-agent orchestration for Claude Code
Open-Source Frontier Voice AI
Financial data platform for analysts, quants and AI agents.
openpilot is an operating system for robotics. Currently, it upgrades the driver assistance system on 300+ supported cars.
Bindings and ADO.NET Provider for DuckDB
Run, manage, and scale AI workloads on any AI infrastructure. Use one system to access & manage all AI compute (Kubernetes, 20+ clouds, or on-prem).
📑 PageIndex: Document Index for Vectorless, Reasoning-based RAG
A tremendous feat of documentation, this guide covers Claude Code from beginner to power user, with production-ready templates for Claude Code features, guides on agentic workflows, and a lot of great learning materials, including quizzes and a handy "cheatsheet". Whether it's the "ultimate" guide to Claude Code will be up to the reader :)