The Inference Report

June 8, 2026

The AI industry is learning that liability travels faster than valuations. A lawsuit from a school shooting survivor against an AI gun detection firm exposes the chasm between marketing claims and courtroom accountability, arriving precisely as OpenAI pivots from chatbots to broader platforms and Notion discovers that service disruptions command more attention than product launches. The market is simultaneously sending contradictory signals: PhysicsX commands a $2.4 billion valuation on a Temasek-led round while software acquisition deals have collapsed to their lowest level since the pandemic, a bifurcation that separates proven operators from speculative bets. Chip stocks shed $1.3 trillion on Friday after a hot jobs report spooked rate expectations, collapsing the narrative that AI valuations operate independent of macro conditions. They do not.

NVIDIA has shifted from selling chips to architecting sovereign AI buildouts at national scale, embedding itself into the supply chain at every layer. The UK announcement frames this as geopolitical necessity; South Korean moves position NVIDIA as the technical spine of Korea's AI ambition, with gigawatt-scale facilities coming online by 2027 and memory partnerships securing the semiconductor stack. This is infrastructure capture, not model competition. Meanwhile, an IBM survey surfaces a control problem: enterprise CIOs and CTOs are deploying AI systems they don't fully govern. That gap between deployment velocity and operational control is precisely where NVIDIA's integrated platform pitch gains traction. The company is not waiting for standards; it is building the infrastructure so thoroughly that adoption becomes the path of least resistance.

Research papers cluster around three themes: diagnosing reasoning gaps through controlled benchmarking, extending learning beyond single-task boundaries through continual learning and long-horizon agents, and refining representations to align with downstream objectives. Rather than scaling existing architectures, these works target specific bottlenecks in reasoning reliability, knowledge retention, and representation quality through targeted interventions that preserve interpretability. On benchmarks, SWE-rebench shows stasis at the top while Gemini 3.1 Pro Preview dropped 6.1 points, falling from fourth to tenth place. Artificial Analysis expanded to 388 entries with the top 20 remaining nearly identical, suggesting either saturation in the current model generation or insufficient sensitivity to detect refinements in the 60-point range.

GitHub trending repos reveal two distinct movements. One treats AI agents as research and synthesis engines that integrate with existing information infrastructure, Reddit, X, YouTube, market data, returning structured findings rather than hallucinations. The second consolidates around inference and retrieval, with llama.cpp maintaining its position as the practical standard for local execution while Milvus and TurboVec compete for vector search at scale. Beneath headline repos, a pattern emerges around constraint and taste. Taste-skill explicitly attacks the problem that scaling models doesn't guarantee quality output. Security middleware like shellward and local integration approaches reflect growing concern about deployment risk, particularly around data leakage in agent systems. The repos gaining traction solve real integration problems, not benchmark scores.

Grant Calloway

AI LabsAll labs

Hugging Face

Amazing Digital Dentures (a failed project)

IBM

New IBM Study Finds CIOs and CTOs Face Growing AI Control Gap as Enterprise Deployment Scales

NVIDIA

From the WireAll feeds

Research PapersAll papers

How reliable are LLMs when it comes to playing dice? cs.CL

We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probability problems. We constructed two datasets, respectively a set of standard exercises and a set of counterintuitive exercises, designed to trigger heuristic reasoning, and evaluated 8 state-of-the-art models, each tested with and without Chain-of-Thought prompting. Models achieve an average accuracy of 0.96 on standard problems but only 0.59 on counterintuitive ones. We further provide empirical evidence of token bias: performance drops by over 20% when canonical formulations are replaced by disguised variants. Embedding misleading suggestions in the prompt reduces performance by up to 34%, with no model proving immune. Taken together, the reported findings suggest that current LLMs are not yet genuine probabilistic reasoners, despite their success in advanced mathematical problems.

Agentopia: Long-Term Life Simulation and Learning in Agent Societies cs.CL

Humans learn from social life. Simulating this process with LLM-powered agents represents a promising research direction, raising a natural question: whether LLMs can learn from such simulated social experience to better understand and replicate human behavior. However, prior agent society simulations typically operate at the scale of days, limiting the depth of social interactions and long-term growth. In this paper, we study long-term life simulation and LLM learning in agent societies, with two goals: (1) investigating social behaviors that emerge from life-long simulation, and (2) developing anthropomorphic capabilities in LLMs, particularly intelligence in social life, through years of simulated social experience. Specifically, we present Agentopia, a comprehensive framework for long-term life simulation in multi-agent societies, where 100 agents autonomously pursue personal growth, develop social relationships, and fulfill their needs and goals over 10 simulated years. We define life reward to mirror human well-being, and leverage this reward to train LLMs via rejection sampling. Extensive experiments show that agents exhibit rich emergent social behaviors. Furthermore, life reward training effectively enhances the underlying LLM, which leads to improved agent well-being in simulation, and generalizes to downstream role-playing benchmarks with +15.6% improvement.

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism cs.CV

Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.

Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings cs.CL

Large language models exhibit impressive zero-shot capabilities across a wide range of downstream tasks. However, they struggle to function as off-the-shelf embedding models, leading to suboptimal performance on massive text embedding benchmarks. In this paper, we identify a potential cause underlying this deficiency. Our motivation stems from an unexpected observation: text embeddings tend to align with frequent but uninformative tokens when projected onto the vocabulary space. We argue that this excessive expression of high-frequency tokens suppresses the model's ability to capture nuanced semantics. To address this, we introduce EmbedFilter, a simple linear transformation designed to refine text embeddings derived from LLMs directly. Specifically, we uncover that the unembedding matrix within LLMs encodes a latent space that is actively writing these frequent tokens into embedding space. By filtering out this subspace, EmbedFilter suppress the influence of high-frequency tokens, thereby enhancing semantic representations. As a compelling byproduct, this enables an inherent dimensionality reduction, lowering index storage and speedup retrieval while fully preserving the refined embedding quality. Our experiments across multiple LLM backbones demonstrate that LLMs equipped with EmbedFilter achieve superior zero-shot downstream performance even with significantly reduced embedding dimensions. We hope our findings provide deeper insights into the mechanisms of LLM-based representations and inspire more principled designs to improve text embeddings training. Our code is available at https://github.com/CentreChen/EmbFilter.

Sparse Subspace-to-Expert Sharing for Task-Agnostic Continual Learning cs.LG

Continual learning in Large Language Models (LLMs) is hindered by the plasticity-stability dilemma, where acquiring new capabilities often leads to catastrophic forgetting of previous knowledge. Existing methods typically treat parameters uniformly, failing to distinguish between specific task knowledge and shared capabilities. We introduce Mixture of Sparse Experts for Task Agnostic Continual Learning (SETA), a framework that resolves the plasticity-stability conflict through adaptive sparse subspace decomposition into task-specific expert modules. Unlike standard updates, where tasks compete for the same parameters, SETA separates knowledge into unique experts, designed to isolate task-specific patterns, and shared experts, responsible for capturing common features. This structure is maintained through adaptive elastic anchoring and a routing-aware regularization that jointly protect shared knowledge at both the weight and routing levels and enable a unified gating network to automatically retrieve the correct expert combination during inference. Extensive experiments across diverse domain-specific benchmarks demonstrate that SETA achieves competitive or superior overall performance relative to state-of-the-art continual learning baselines, with particularly strong retention of early-task knowledge and improved backward transfer on LLaMA-2 7B and Qwen3-4B.

Accelerated Decentralized Stochastic Gradient Descent for Strongly Convex Optimization cs.LG

Decentralized stochastic optimization is a fundamental paradigm for large-scale learning over networks, where agents communicate only with their neighbors and no central coordinator is required. For strongly convex problems, communication efficiency is mainly determined by the condition number $κ=L/μ$ and the network spectral gap $1-β$. Although deterministic decentralized methods can simultaneously achieve accelerated $\sqrtκ$ and $1/\sqrt{1-β}$ dependences, no existing stochastic method attains both improvements at once. In this paper, we propose \emph{Multi-Gossip Accelerated DSGD} (MG-ADSGD), a decentralized stochastic algorithm that combines Nesterov-type primal--dual extrapolation with multi-round fast gossip averaging. The key idea is to couple the gossip depth with the mini-batch size so that additional communication rounds simultaneously improve consensus accuracy and reduce gradient variance. We show that MG-ADSGD achieves the communication complexity \[ \widetilde{\mathcal O}\!\left( \frac{σ^2}{μnε}\log\frac{1}ε + \sqrt{\fracκ{1-β}}\log\frac{1}ε \right), \] where $ε$ denotes the target accuracy, $n$ is the number of nodes, and $σ^2$ is the gradient variance. To the best of our knowledge, this bound yields the best currently available communication complexity for decentralized stochastic strongly convex optimization, up to logarithmic factors that are independent of $ε$.

BenchmarksFull tables

Intelligence Index

Composite score across coding, math, and reasoning

#	Model	Score	tok/s	$/1M
1	Claude Opus 4.8	61.4	64	$10.94
2	GPT-5.5	60.2	67	$11.25
3	Claude Opus 4.7	57.3	62	$10.94
4	Gemini 3.1 Pro Preview	57.2	136	$4.50
5	GPT-5.4	56.8	100	$5.63

SWE-rebench

Agentic coding on real-world software engineering tasks

#	Model	Score
1	gpt-5.5-2026-04-23-xhigh	62.7%
2	Codex	60.4%
3	Claude Code	59.6%
4	gpt-5.5-2026-04-23-medium	58.9%
5	Claude Opus 4.8-xhigh	56.4%

GitHub Repos All repos

Trending

mvanhorn/last30days-skill

39418 ★

AI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary

opencv/opencv

88799 ★

Open Source Computer Vision Library

Leonxlnx/taste-skill

37685 ★

Taste-Skill - gives your AI good taste. stops the AI from generating boring, generic slop

NousResearch/hermes-agent

186592 ★

lfnovo/open-notebook

27639 ★

An Open Source implementation of Notebook LM with more flexibility and features

Daily discovery

isLinXu/paper-listObject Detection

134 ★

autoupdate paper list

OpenSTEF/openstefData Science

148 ★

Automated Machine Learning pipelines. Builds the Open Short Term Energy Forecasting package.

MMMU-Benchmark/MMMUMultimodal

576 ★

This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI"

skye-harris/hass_local_openai_llmRAG

187 ★

Home Assistant LLM integration for local OpenAI-compatible services (llamacpp, vllm, etc)

milvus-io/milvusVector Database

44862 ★

Milvus is a high-performance, cloud-native vector database built for scalable vector ANN search