OpenAI is retreating from research and consumer products while the venture capital pouring into AI infrastructure confronts a reckoning: the models are expensive to run, the returns uncertain, and the money is flowing decisively toward builders who can ship enterprise tools rather than toward labs chasing capability gains. Claude Opus 4.6 claimed the top spot on SWE-rebench with a 65.3% score, displacing Gemini 3.1 Pro Preview, but the divergence between SWE-rebench and Artificial Analysis rankings now runs so deep that the two benchmarks appear to measure different problems entirely. The company shedding Sora and its science team while executives like Kevin Weil and Bill Peebles depart for other opportunities is the visible fracture point in a larger shift: the infrastructure bill has come due, component costs are rising faster than revenue can absorb, and the moonshot phase where research mattered more than shipping is over.
The capital is moving with surgical precision toward vertical integration and developer lock-in. Cursor is raising $2 billion at a $50 billion valuation on enterprise adoption alone. Recursive, a months-old startup founded by former DeepMike and OpenAI engineers, just closed a $500 million round at $4 billion with backing from Google Ventures and Nvidia. Hugging Face is doubling down on applied infrastructure rather than competing on model weights. GitHub is treating Copilot CLI as a teaching tool and content engine. Anthropic launched Claude Design as a visual tool for product managers. The pattern across these moves is not about capability breakthroughs but about who owns the developer experience and the workflows that enterprises actually use. Oracle is selling semantic search without LLMs, positioning vector-based retrieval as an alternative to RAG systems. Builders who can ship something enterprises will integrate into existing infrastructure are getting funded. Those chasing incremental research gains or consumer experiences are not.
On GitHub, the infrastructure for coordinated agents has matured past the single-agent reasoning loop. Donchitos' Claude-Code-Game-Studios demonstrates the new baseline: 49 agents organized into a studio hierarchy with 72 workflow skills, mirroring actual production team structures. OpenAI's agents-python framework and obra's superpowers both solve the same underlying problem of routing work between agents and managing dependencies without consuming tokens irrationally, but they assume you are already building systems where agents talk to agents. Agents that can only read and write are hitting their limits; those that can observe and act on their environment through browser manipulation, screen capture, and file detection are solving harder problems. The research community, meanwhile, has moved away from end-to-end optimization toward compositional analysis: breaking systems into interpretable stages, measuring failure modes in controlled settings, and fusing complementary inference strategies to preserve both rigor and practical effectiveness. The gap between AI insiders who understand the cost structure and everyone else is widening. Developers chasing token optimization are writing more code that costs more to run. The industry is spending like it is certain of returns while operating like it is uncertain of anything else.
Grant Calloway
The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. MM-WebAgent jointly optimizes global layout, local multimodal content, and their integration, producing coherent and visually consistent webpages. We further introduce a benchmark for multimodal webpage generation and a multi-level evaluation protocol for systematic assessment. Experiments demonstrate that MM-WebAgent outperforms code-generation and agent-based baselines, especially on multimodal element generation and integration. Code & Data: https://aka.ms/mm-webagent.
Whether language models can systematically generalize remains actively debated. Yet empirical performance is jointly shaped by multiple factors such as training data, training paradigms, and inference-time strategies, making failures difficult to interpret. We introduce a controlled synthetic environment based on shortest-path planning, a canonical composable sequential optimization problem. The setup enables clean separation of these factors and supports two orthogonal axes of generalization: spatial transfer to unseen maps and length scaling to longer-horizon problems. We find that models exhibit strong spatial transfer but consistently fail under length scaling due to recursive instability. We further analyze how distinct stages of the learning pipeline influence systematic problem-solving: for example, data coverage sets capability limits; reinforcement learning improves training stability but does not expand those limits; and inference-time scaling enhances performance but cannot rescue length-scaling failures.
LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: $\textbf{(1)}$ a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates ($\barρ = 0.8$-$4.1\%$), with $33$-$67\%$ of documents exhibiting at least one directed 3-cycle; and $\textbf{(2)}$ split conformal prediction sets over 1-5 Likert scores providing theoretically-guaranteed $\geq(1{-}α)$ coverage, with set width serving as a per-instance reliability indicator ($r_s = {+}0.576$, $N{=}1{,}918$, $p < 10^{-100}$, pooled across all judges). Critically, prediction set width shows consistent cross-judge agreement ($\bar{r} = 0.32$-$0.38$), demonstrating it captures document-level difficulty rather than judge-specific noise. Across four judges and four criteria, both diagnostics converge: criterion matters more than judge, with relevance judged most reliably (avg. set size $\approx 3.0$) and coherence moderately so (avg. set size $\approx 3.9$), while fluency and consistency remain unreliable (avg. set size $\approx 4.9$). We release all code, prompts, and cached results.
MLP is a heavily used backbone in modern deep learning (DL) architectures for supervised learning on tabular data, and AdamW is the go-to optimizer used to train tabular DL models. Unlike architecture design, however, the choice of optimizer for tabular DL has not been examined systematically, despite new optimizers showing promise in other domains. To fill this gap, we benchmark \Noptimizers optimizers on \Ndatasets tabular datasets for training MLP-based models in the standard supervised learning setting under a shared experiment protocol. Our main finding is that the Muon optimizer consistently outperforms AdamW, and thus should be considered a strong and practical choice for practitioners and researchers, if the associated training efficiency overhead is affordable. Additionally, we find exponential moving average of model weights to be a simple yet effective technique that improves AdamW on vanilla MLPs, though its effect is less consistent across model variants.
Over the past year, spatial intelligence has drawn increasing attention. Many prior works study it from the perspective of visual-spatial intelligence, where models have access to visuospatial information from visual inputs. However, in the absence of visual information, whether linguistic intelligence alone is sufficient to endow models with spatial intelligence, and how models perform relevant tasks with text-only inputs still remain unexplored. Therefore, in this paper, we focus on a fundamental and critical capability in spatial intelligence from a linguistic perspective: viewpoint rotation understanding (VRU). Specifically, LLMs and VLMs are asked to infer their final viewpoint and predict the corresponding observation in an environment given textual description of viewpoint rotation and observation over multiple steps. We find that both LLMs and VLMs perform poorly on our proposed dataset while human can easily achieve 100% accuracy, indicating a substantial gap between current model capabilities and the requirements of spatial intelligence. To uncover the underlying mechanisms, we conduct a layer-wise probing analysis and head-wise causal intervention. Our findings reveal that although models encode viewpoint information in the hidden states, they appear to struggle to bind the viewpoint position with corresponding observation, resulting in a hallucination in final layers. Finally, we selectively fine-tune the key attention heads identified by causal intervention to improve VRU performance. Experimental results demonstrate that such selective fine-tuning achieves improved VRU performance while avoiding catastrophic forgetting of generic abilities. Our dataset and code will be released at https://github.com/Young-Zhen/VRU_Interpret .
The reliability of a machine vision system for autonomous driving depends heavily on its training data distribution. When a vehicle encounters significantly different conditions, such as atypical obstacles, its perceptual capabilities can degrade substantially. Unlike many domains where errors carry limited consequences, failures in autonomous driving translate directly into physical risk for passengers, pedestrians, and other road users. To address this challenge, we explore Visual Anomaly Detection (VAD) as a solution. VAD enables the identification of anomalous objects not present during training, allowing the system to alert the driver when an unfamiliar situation is detected. Crucially, VAD models produce pixel-level anomaly maps that can guide driver attention to specific regions of concern without requiring any prior assumptions about the nature or form of the hazard. We benchmark eight state-of-the-art VAD methods on AnoVox, the largest synthetic dataset for anomaly detection in autonomous driving. In particular, we evaluate performance across four backbone architectures spanning from large networks to lightweight ones such as MobileNet and DeiT-Tiny. Our results demonstrate that VAD transfers effectively to road scenes. Notably, Tiny-Dinomaly achieves the best accuracy-efficiency trade-off for edge deployment, matching full-scale localization performance at a fraction of the memory cost. This study represents a concrete step toward safer, more responsible deployment of autonomous vehicles, ultimately improving protection for passengers, pedestrians, and all road users.
Composite score across coding, math, and reasoning
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Opus 4.7 | 57.3 | 58 | $10.00 |
| 2 | Gemini 3.1 Pro Preview | 57.2 | 126 | $4.50 |
| 3 | GPT-5.4 | 56.8 | 82 | $5.63 |
| 4 | GPT-5.3 Codex | 53.6 | 81 | $4.81 |
| 5 | Claude Opus 4.6 | 53 | 54 | $10.00 |
Agentic coding on real-world software engineering tasks
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | GLM-5.1 | 62.7% |
The GEP-Powered Self-Evolution Engine for AI Agents. Genome Evolution Protocol. | evomap.ai
Self-evolving agent: grows skill tree from 3.3K-line seed, achieving full system control with 6x less token consumption
Claude Code skill to support Android app's reverse engineering
AI that sees your screen, listens to your conversations and tells you what to do
Turn Claude Code into a full game dev studio — 49 AI agents, 72 workflow skills, and a complete coordination system mirroring real studio hierarchy.
This repository contains a hand-curated resources for Prompt Engineering with a focus on Generative Pre-trained Transformer (GPT), ChatGPT, PaLM etc
Consumer AI app for chat, image generation, video generation, and music creation powered by Ace Data Cloud APIs.
Fast and Accurate ML in 3 Lines of Code
A super fast Graph Database uses GraphBLAS under the hood for its sparse adjacency matrix graph representation. Our goal is to provide the best Knowledge Graph for LLM (GraphRAG).
The Complete AI Development Toolkit for Claude Code — 89 skills, 31 agents, 99 hooks. Production-ready patterns for full-stack development.