Today's news marks a genuine inflection point: the age of standalone AI products is ending, and the era of embedded AI as operational infrastructure is beginning. The evidence is structural, not speculative. Google, Amazon, and Zoom are not adding AI features to their platforms; they are embedding AI agents directly into workflows where users already spend time and money. RevenueCat's data confirms the economic reality: AI-powered apps launch with strong monetization but hemorrhage users within months. Standalone applications cannot retain attention. The winners will be platforms with existing distribution and lock-in. Conversely, the companies that control the infrastructure layer are consolidating it. Thinking Machines Lab secured a gigawatt of Nvidia compute through a multiyear deal that includes strategic investment from Nvidia itself, a move that signals vertical integration of training, inference, and hardware orchestration. Legora reached 5.55 billion dollars by solving a specific pain point in a high-margin vertical. The pattern repeats: generalist features bolt onto platforms with distribution; vertical specialists win on margin and defensibility.
The operational reality of deploying these systems has forced a reckoning on safety and control. Amazon mandated senior engineer sign-off on AI-assisted code changes after outages linked to AI tooling. Claude Opus found 22 vulnerabilities in Firefox in two weeks, a reminder that AI agents find problems humans miss and miss problems humans catch. The regulatory pressure is real. The Trump administration is preparing executive action against Anthropic even as Microsoft backs the company's legal challenge to the Pentagon's supply chain risk designation. Yet the tension between caution and speed is not resolving in favor of caution. Enterprise customers demand safety certification and prompt injection resistance, which explains OpenAI's focus on instruction hierarchy and model steering. The labs are competing on the ecosystem layer now, not just model capability. Anthropic launched an institute; AI21 focused on enterprise deployment friction; GitHub declared that "AI as text" is over and execution is now the interface. These announcements signal a deliberate architectural shift away from the chatbot model that defined the last eighteen months.
Notably absent from lab announcements are claims about model scaling or capability breakthroughs. Either the gains are plateauing or the next phase of competition is happening behind closed doors. The research literature tells a different story: papers cluster around knowledge-guided constraint integration, hierarchical learning frameworks, and representational geometry control. The underlying principle is that inductive structure, when properly aligned with the problem domain, outperforms generic end-to-end learning. Carbon flux upscaling improves by grounding in the carbon balance equation; pathology reasoning improves by organizing knowledge as hierarchical memory. This represents a deliberate move away from pure scaling and toward architectures that encode what the problem already knows. On the SWE-rebench leaderboard, the top five models hold steady between 51% and 52.9%, suggesting the frontier may be approaching a plateau on this task distribution. Below that band, volatility is high and benchmark divergences are substantial, indicating that mid-tier models remain sensitive to evaluation design choices.
The final signal comes from GitHub. Agentic systems have moved from research curiosity to infrastructure problem. Developers are building two distinct layers: frameworks for orchestrating multi-agent workflows and testing infrastructure to validate them. Promptfoo's traction reflects a genuine pain point: as teams deploy agents with different LLM backends, comparing performance and detecting failure modes requires systematic testing. The repos gaining traction solve operational problems, not architectural ones. Page-agent for web automation, tgo for customer service agent teams, and domain-specific tools like IPED for forensic analysis indicate this is no longer researchers playing with prompts. These are tools for shipping production systems. The shift from novelty to necessity is the real story.
Grant Calloway
Accurately upscaling terrestrial carbon fluxes is central to estimating the global carbon budget, yet remains challenging due to the sparse and regionally biased distribution of ground measurements. Existing data-driven upscaling products often fail to generalize beyond observed domains, leading to systematic regional biases and high predictive uncertainty. We introduce Task-Aware Modulation with Representation Learning (TAM-RL), a framework that couples spatio-temporal representation learning with knowledge-guided encoder-decoder architecture and loss function derived from the carbon balance equation. Across 150+ flux tower sites representing diverse biomes and climate regimes, TAM-RL improves predictive performance relative to existing state-of-the-art datasets, reducing RMSE by 8-9.6% and increasing explained variance ($R^2$) from 19.4% to 43.8%, depending on the target flux. These results demonstrate that integrating physically grounded constraints with adaptive representation learning can substantially enhance the robustness and transferability of global carbon flux estimates.
A central idea in mechanistic interpretability is that neural networks represent more features than they have dimensions, arranging them in superposition to form an over-complete basis. This framing has been influential, motivating dictionary learning approaches such as sparse autoencoders. However, superposition has mostly been studied in idealized settings where features are sparse and uncorrelated. In these settings, superposition is typically understood as introducing interference that must be minimized geometrically and filtered out by non-linearities such as ReLUs, yielding local structures like regular polytopes. We show that this account is incomplete for realistic data by introducing Bag-of-Words Superposition (BOWS), a controlled setting to encode binary bag-of-words representations of internet text in superposition. Using BOWS, we find that when features are correlated, interference can be constructive rather than just noise to be filtered out. This is achieved by arranging features according to their co-activation patterns, making interference between active features constructive, while still using ReLUs to avoid false positives. We show that this kind of arrangement is more prevalent in models trained with weight decay and naturally gives rise to semantic clusters and cyclical structures which have been observed in real language models yet were not explained by the standard picture of superposition. Code for this paper can be found at https://github.com/LucasPrietoAl/correlations-feature-geometry.
A key component of creativity is associative reasoning: the ability to draw novel yet meaningful connections between concepts. We introduce CREATE, a benchmark designed to evaluate models' capacity for creative associative reasoning. CREATE requires models to generate sets of paths connecting concepts in a model's parametric knowledge. Paths should have high specificity (distinctiveness and closeness of the concept connection) and high diversity (dissimilarity from other paths), and models are scored more highly if they produce a larger set of strong, diverse paths. This task shares demands of real creativity tasks like hypothesis generation, including an extremely large search space, but enables collection of a sizable benchmark with objective answer grading. Evaluation of frontier models shows that the strongest models achieve higher creative utility than others, with the high multiplicity of answers and complexity of the search making benchmark saturation difficult to achieve. Furthermore, our results illustrate that thinking models are not always more effective on our task, even with high token budgets. Recent approaches for creative prompting give some but limited additional improvement. CREATE provides a sandbox for developing new methods to improve models' capacity for associative creativity.
As social virtual reality (VR) grows more popular, addressing accessibility for blind and low vision (BLV) users is increasingly critical. Researchers have proposed an AI "sighted guide" to help users navigate VR and answer their questions, but it has not been studied with users. To address this gap, we developed a large language model (LLM)-powered guide and studied its use with 16 BLV participants in virtual environments with confederates posing as other users. We found that when alone, participants treated the guide as a tool, but treated it companionably around others, giving it nicknames, rationalizing its mistakes with its appearance, and encouraging confederate-guide interaction. Our work furthers understanding of guides as a versatile method for VR accessibility and presents design recommendations for future guides.
Collective decision-making in biological and human groups often emerges from simple interaction rules that amplify minor differences into consensus. The bee equation, developed initially to describe nest-site selection in honeybee swarms, captures this dynamic through recruitment and inhibition processes. Here, we extend the bee equation into an agent-based model in which emotional valence (positive-negative) and arousal (low-high) act as modulators of interaction rates, effectively altering the recruitment and cross-inhibition parameters. Agents display simulated facial expressions mapped from their valence-arousal states, allowing the study of emotional contagion in consensus formation. Three scenarios are explored: (1) the joint effect of valence and arousal on consensus outcomes and speed, (2) the role of arousal in breaking ties when valence is matched, and (3) the "snowball effect" in which consensus accelerates after surpassing intermediate support thresholds. Results show that emotional modulation can bias decision outcomes and alter convergence times by shifting effective recruitment and inhibition rates. At the same time, intrinsic non-linear amplification can produce decisive wins even in fully symmetric emotional conditions. These findings link classical swarm decision theory with affective and social modelling, highlighting how both emotional asymmetries and structural tipping points shape collective outcomes. The proposed framework offers a flexible tool for studying the emotional dimensions of collective choice in both natural and artificial systems.
Language-conditioned local navigation requires a robot to infer a nearby traversable target location from its current observation and an open-vocabulary, relational instruction. Existing vision-language spatial grounding methods usually rely on vision-language models (VLMs) to reason in image space, producing 2D predictions tied to visible pixels. As a result, they struggle to infer target locations in occluded regions, typically caused by furniture or moving humans. To address this issue, we propose BEACON, which predicts an ego-centric Bird's-Eye View (BEV) affordance heatmap over a bounded local region including occluded areas. Given an instruction and surround-view RGB-D observations from four directions around the robot, BEACON predicts the BEV heatmap by injecting spatial cues into a VLM and fusing the VLM's output with depth-derived BEV features. Using an occlusion-aware dataset built in the Habitat simulator, we conduct detailed experimental analysis to validate both our BEV space formulation and the design choices of each module. Our method improves the accuracy averaged across geodesic thresholds by 22.74 percentage points over the state-of-the-art image-space baseline on the validation subset with occluded target locations. Our project page is: https://xin-yu-gao.github.io/beacon.
Composite score across coding, math, and reasoning
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Preview | 57.2 | 106 | $4.50 |
| 2 | GPT-5.4 | 57 | 78 | $5.63 |
| 3 | GPT-5.3 Codex | 54 | 65 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 55 | $10.00 |
| 5 | Claude Sonnet 4.6 | 51.7 | 69 | $6.00 |
Agentic coding on real-world software engineering tasks
| # | Model | Score |
|---|---|---|
| 1 | Claude Code | 52.9% |
| 2 | Junie | 52.1% |
| 3 | Claude Opus 4.6 | 51.7% |
| 4 | gpt-5.2-2025-12-11-xhigh | 51.7% |
| 5 | gpt-5.2-2025-12-11-medium | 51.0% |
A complete AI agency at your fingertips** - From frontend wizards to Reddit community ninjas, from whimsy injectors to reality checkers. Each agent is a specialized expert with personality, processes, and proven deliverables.
Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
Sample code and notebooks for Generative AI on Google Cloud, with Gemini on Vertex AI
An AI Hedge Fund Team
Pytorch Distributed native training library for LLMs/VLMs with OOTB Hugging Face support
Speech-to-text, text-to-speech, speaker diarization, speech enhancement, source separation, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, HarmonyOS, Raspberry Pi, RISC-V, RK NPU, Axera NPU, Ascend NPU, x86_64 servers, websocket server/client, support 12 programming languages
Open-source AI Agent Customer Service Platform. Build AI agent teams with LLM orchestration, RAG knowledge base, multi-channel support, and human collaboration.
[ICLR 2026]🔥🔥🔥MOSAIC: Multi-Subject Personalized Generation via Correspondence-Aware Alignment and Disentanglement
Light Image Video Generation Inference Framework