The gap between what AI companies profess and what they actually build is no longer theoretical. Anthropic's Pro-Human Declaration preceded its Pentagon contract announcement by weeks, OpenAI's robotics team lost an engineer rather than accept defense work, and Google tied a $692M executive compensation package to autonomous systems and logistics infrastructure instead of language model performance. These moments expose the real incentive structure: defense contracts fund hardware development that consumer products cannot, autonomous systems and data center economics drive value creation, and the companies that control infrastructure control the market. When Samsung packs multiple AI models onto Galaxy devices and KKR invests billions in cooling infrastructure, when Grammarly slaps writer names onto features without consent, the pattern becomes clear. Safety frameworks and credibility declarations are rhetoric. The actual competition is over who owns the hardware, who signs the checks, and what infrastructure gets built.
This collision between stated values and structural incentives maps directly onto how the field is reorganizing around production realities. Information retrieval research has moved past laboratory benchmarks toward systems-level evaluation that couples retrieval quality with computational cost and business impact, revealing that improvements in recall frequently fail to translate into real gains once latency constraints and re-ranking budgets enter the equation. Similarly, across trending development frameworks, the shift is away from treating AI as a service and toward treating it as infrastructure for autonomous agents. OpenAI's Skills Catalog, Alibaba's page-agent, and the AI Hedge Fund team don't succeed because they reason better than existing models; they succeed because they decompose problems into sequences of discrete actions, route tasks to specialized components, and maintain state across multi-step workflows. Agency-agents sells each component as a specialized expert with proven deliverables. The real work isn't happening in model weights. It's happening in the orchestration layer that decides what to do with model output, in the glue code that makes agents deployable units with personality and process attached, and in the physics simulation engines and audio SDKs that embed agents into robotics pipelines and on-device inference. Infrastructure, not rhetoric, is where the actual story lives.
Grant Calloway
No lab headlines.
While reasoning rerankers, such as Rank1, have demonstrated strong abilities in improving ranking relevance, it is unclear how they perform on other retrieval qualities such as fairness. We conduct the first systematic comparison of fairness between reasoning and non-reasoning rerankers. Using the TREC 2022 Fair Ranking Track dataset, we evaluate six reranking models across multiple retrieval settings and demographic attributes. Our findings demonstrate reasoning neither improve nor harm fairness compared to non-reasoning approaches. Our fairness metric, Attention-Weighted Rank Fairness (AWRF) remained stable (0.33-0.35) across all models, even as relevance varies substantially (nDCG 0.247-1.000). Demographic breakdown analysis revealed fairness gaps for geographic attributes regardless of model architecture. These results indicate that future work in specializing reasoning models to be aware of fairness attributes could lead to improvements, as current implementations preserve the fairness characteristics of their input ranking.
Generative Recommender Systems (GR) increasingly model user behavior as a sequence generation task by interleaving item and action tokens. While effective, this formulation introduces significant structural and computational inefficiencies: it doubles sequence length, incurs quadratic overhead, and relies on implicit attention to recover the causal relationship between an item and its associated action. Furthermore, interleaving heterogeneous tokens forces the Transformer to disentangle semantically incompatible signals, leading to increased attention noise and reduced representation efficiency.In this work, we propose a principled reformulation of generative recommendation that aligns sequence modeling with underlying causal structures and attention theory. We demonstrate that current interleaving mechanisms act as inefficient proxies for similarity-weighted action pooling. To address this, we introduce two novel architectures that eliminate interleaved dependencies to reduce sequence complexity by 50%: Attention-based Late Fusion for Actions (AttnLFA) and Attention-based Mixed Value Pooling (AttnMVP). These models explicitly encode the $i_n \rightarrow a_n$ causal dependency while preserving the expressive power of Transformer-based sequence modeling.We evaluate our framework on large-scale product recommendation data from a major social network. Experimental results show that AttnLFA and AttnMVP consistently outperform interleaved baselines, achieving evaluation loss improvements of 0.29% and 0.80%, and significant gains in Normalized Entropy (NE). Crucially, these performance gains are accompanied by training time reductions of 23% and 12%, respectively. Our findings suggest that explicitly modeling item-action causality provides a superior design paradigm for scalable and efficient generative ranking.
Personalized news recommendation is highly time-sensitive, as user interests are often driven by emerging events, trending topics, and shifting real-world contexts. These dynamics make it essential to model not only users' long-term preferences, which reflect stable reading habits and high-order collaborative patterns, but also their short-term, context-dependent interests that change rapidly over time. However, most existing approaches rely on a single static interaction graph, which struggles to capture both long-term preference patterns and short-term interest changes as user behavior evolves. To address this challenge, we propose a unified framework that learns user preferences from both global and local temporal perspectives. A global preference modeling component captures long-term collaborative signals from the overall interaction graph, while a local preference modeling component partitions historical interactions into stage-wise temporal subgraphs to represent short-term dynamics. Within this module, an LSTM branch models the progressive evolution of recent interests, and a self-attention branch captures long-range temporal dependencies. Extensive experiments on two large-scale real-world datasets show that our approach consistently outperforms strong baselines and delivers fresher and more relevant recommendations across diverse user behaviors and temporal settings.
Retrieval-Augmented Generation (RAG) systems typically treat documents as flat text, ignoring the structured metadata and linked relationships that knowledge graphs provide. In this paper, we investigate whether structured linked data, specifically Schema.org markup and dereferenceable entity pages served by a Linked Data Platform, can improve retrieval accuracy and answer quality in both standard and agentic RAG systems. We conduct a controlled experiment across four domains (editorial, legal, travel, e-commerce) using Vertex AI Vector Search 2.0 for retrieval and the Google Agent Development Kit (ADK) for agentic reasoning. Our experimental design tests seven conditions: three document representations (plain HTML, HTML with JSON-LD, and an enhanced agentic-optimized entity page) crossed with two retrieval modes (standard RAG and agentic RAG with multi-hop link traversal), plus an Enhanced+ condition that adds rich navigational affordances and entity interlinking. Our results reveal that while JSON-LD markup alone provides only modest improvements, our enhanced entity page format, incorporating llms.txt-style agent instructions, breadcrumbs, and neural search capabilities, achieves substantial gains: +29.6% accuracy improvement for standard RAG and +29.8% for the full agentic pipeline. The Enhanced+ variant, with richer navigational affordances, achieves the highest absolute scores (accuracy: 4.85/5, completeness: 4.55/5), though the incremental gain over the base enhanced format is not statistically significant. We release our dataset, evaluation framework, and enhanced entity page templates to support reproducibility.
Pseudo-relevance feedback (PRF) methods built on large language models (LLMs) can be organized along two key design dimensions: the feedback source, which is where the feedback text is derived from and the feedback model, which is how the given feedback text is used to refine the query representation. However, the independent role that each dimension plays is unclear, as both are often entangled in empirical evaluations. In this paper, we address this gap by systematically studying how the choice of feedback source and feedback model impact PRF effectiveness through controlled experimentation. Across 13 low-resource BEIR tasks with five LLM PRF methods, our results show: (1) the choice of feedback model can play a critical role in PRF effectiveness; (2) feedback derived solely from LLM-generated text provides the most cost-effective solution; and (3) feedback derived from the corpus is most beneficial when utilizing candidate documents from a strong first-stage retriever. Together, our findings provide a better understanding of which elements in the PRF design space are most important.
Retrieval-augmented generation (RAG) systems combine document retrieval with a generative model to address complex information seeking tasks like report generation. While the relationship between retrieval quality and generation effectiveness seems intuitive, it has not been systematically studied. We investigate whether upstream retrieval metrics can serve as reliable early indicators of the final generated response's information coverage. Through experiments across two text RAG benchmarks (TREC NeuCLIR 2024 and TREC RAG 2024) and one multimodal benchmark (WikiVideo), we analyze 15 text retrieval stacks and 10 multimodal retrieval stacks across four RAG pipelines and multiple evaluation frameworks (Auto-ARGUE and MiRAGE). Our findings demonstrate strong correlations between coverage-based retrieval metrics and nugget coverage in generated responses at both topic and system levels. This relationship holds most strongly when retrieval objectives align with generation goals, though more complex iterative RAG pipelines can partially decouple generation quality from retrieval effectiveness. These findings provide empirical support for using retrieval metrics as proxies for RAG performance.
Composite score across coding, math, and reasoning
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Preview | 57.2 | 125 | $4.50 |
| 2 | GPT-5.4 | 57 | 75 | $5.63 |
| 3 | GPT-5.3 Codex | 54 | 70 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 63 | $10.00 |
| 5 | Claude Sonnet 4.6 | 51.7 | 69 | $6.00 |
Agentic coding on real-world software engineering tasks
| # | Model | Score |
|---|---|---|
| 1 | Claude Code | 52.9% |
| 2 | Claude Opus 4.6 | 51.7% |
| 3 | gpt-5.2-2025-12-11-xhigh | 51.7% |
| 4 | gpt-5.2-2025-12-11-medium | 51.0% |
| 5 | gpt-5.1-codex-max | 48.5% |
Skills Catalog for Codex
A complete AI agency at your fingertips** - From frontend wizards to Reddit community ninjas, from whimsy injectors to reality checkers. Each agent is a specialized expert with personality, processes, and proven deliverables.
Sample code and notebooks for Generative AI on Google Cloud, with Gemini on Vertex AI
🤖 Autonomous agent framework for Elixir. Built for distributed, autonomous behavior and dynamic workflows.
Agent framework and applications built upon Qwen>=3.0, featuring Function Calling, MCP, Code Interpreter, RAG, Chrome extension, etc.
Become a cracked AI/ML Research Engineer
nanoRLHF: from-scratch journey into how LLMs and RLHF really work.
An open-source, GPU-accelerated physics simulation engine built upon NVIDIA Warp, specifically targeting roboticists and simulation researchers.
A modular Swift SDK for audio processing with MLX on Apple Silicon
Example apps for Foundation Models Framework in iOS 26 and macOS 26