The Inference Report — March 10, 2026

Meanwhile, the AI industry is fracturing along a new fault line between those building for government and those refusing to. Anthropic's lawsuit against the Pentagon over its supply-chain-risk designation has drawn public support from more than 30 OpenAI and Google DeepMind employees signing an amicus brief, yet OpenAI itself has moved in the opposite direction, acquiring Promptfoo to strengthen its ability to deploy AI agents in critical operations while Caitlin Kalinowski, the company's head of robotics, resigned over inadequate safeguards in its Pentagon contract. The split reflects a deeper disagreement about what builders should accept in exchange for scale and legitimacy, with the designation already costing Anthropic material revenue as companies paused deal talks.

The money is flowing toward infrastructure and specialized models rather than consolidation around any single foundation. Yann LeCun's AMI Labs closed a $1.03 billion seed round at a $3.5 billion valuation to build world models focused on physical understanding, while Nscale, an Nvidia-backed infrastructure startup, reached a $14.6 billion valuation on a $2 billion raise. Anthropic launched a Claude Marketplace to streamline enterprise procurement and deployed Code Review, a multi-agent system for analyzing AI-generated code. The market is settling into layers: frontier model providers compete on capability and trust, infrastructure companies capture deployment economics, and specialized tools fill gaps between raw models and production use.

The practical pressure on builders is now acute. Amazon held an engineering meeting after AI-related outages linked to generative AI-assisted code changes, while Microsoft's Copilot for Microsoft 365 has captured only 3 percent of its customer base despite two years in market, forcing the company to add Anthropic's Claude to its own tools. The market is no longer asking whether AI works in theory but whether it works reliably enough to deploy at scale, whether it can be audited and reviewed, and whether users can understand what it does. Lab announcements reveal a hardening focus on the operational layer: security, observability, and cost reduction in production environments. The companies that win will solve these problems faster than their competitors, not those that promise the most capability.

Grant Calloway

AI LabsAll labs

AMD

AWS

AWS Weekly Roundup: Amazon Connect Health, Bedrock AgentCore Policy, GameDay Europe, and more (March 9, 2026)

GitHub Blog

Under the hood: Security architecture of GitHub Agentic Workflows

Hugging Face

IBM

SEI Engages IBM to Accelerate Enterprise Transformation Through Agentic AI

NVIDIA

OpenAI

OpenAI to acquire Promptfoo

From the WireAll feeds

Research Papers — FocusedAll papers

Scalable LLM-based Coding of Dialogue in Healthcare Simulation: Balancing Coding Performance, Processing Time, and Environmental Impact cs.HC

Research shows that dialogue, the interactive process through which participants articulate their thinking, plays a central role in constructing shared understanding, coordinating action, and shaping learning outcomes in teams. Analysing dialogue content has been central to advancing team learning theory and informing the design of computer-supported collaborative learning environments, yet this progress has depended on labour-intensive qualitative coding. LLMs offer new possibilities for automating and enhancing the dialogue layer within emerging multimodal learning analytics approaches, with recent studies showing that they can approximate human coding through few-shot prompting. However, prior work has focused on replicating human coding accuracy for research purposes, rather than addressing a more educationally consequential question: how can we design prompts that allow an LLM to label team dialogue accurately and fast enough to be useful in real settings, such as in-person healthcare simulations, where results must be returned quickly and computational cost and sustainability also matter? This paper investigates how prompt design and batching strategies can be optimised to balance coding accuracy, processing time, and environmental impact in team-based healthcare simulation debriefing. Using a dataset of 11,647 utterances coded across 6 dialogue constructs, we compared 4 prompt designs across varying batch sizes, evaluating coding performance, processing time, and energy consumption, as well as the trade-offs between these metrics. Results indicate that increasing batch size improves speed and reduces energy use, but negatively impacts coding performance. Beyond demonstrating the feasibility of LLM-based qualitative analysis, this study offers practical guidance for scaling dialogue analytics in contexts where timeliness, privacy, and sustainability are critical.

Can Humans Detect AI? Mining Textual Signals of AI-Assisted Writing Under Varying Scrutiny Conditions cs.HC

This study asks whether the threat of AI detection changes how people write with AI, and whether other people can tell the difference. In a two-phase controlled experiment, 21 participants wrote opinion pieces on remote work using an AI chatbot. Half were randomly warned that their submission would be scanned by an AI detection tool. The other half received no warning. Both groups had access to the same chatbot. In Phase 2, 251 independent judges evaluated 1,999 paired comparisons, each time choosing which document in the pair was written by a human. Judges were not told that both writers had access to AI. Across all evaluations, judges selected the warned writer's document as human 54.13% of the time versus 45.87% for the unwarned writer. A two-sided binomial test rejects chance guessing at p = 0.000243, and the result holds across both writing stances. Yet on every measurable text feature extracted, including AI overlap scores, lexical diversity, sentence structure, and pronoun usage, the two groups were indistinguishable. The judges are picking up on something that feature-based methods do not capture.

From Rights to Rites: Expectations Management in Smart-Home AI cs.HC

Domestic voice assistants and smart-home devices are increasingly embedded in everyday routines, yet their ethics are often treated as an afterthought or delegated to compliance teams. To explore how expectations about smart-home AI are constructed and managed, we conducted 33 semi-structured interviews with designers, developers, and researchers from major smart-home platforms (Amazon Alexa, Microsoft Azure IoT, and Google Nest). Using a constructivist grounded theory approach, we develop Expectations Management (EM): a culturally embedded model describing how practitioners shape, calibrate, and repair expectations by balancing organisational rights with culturally situated rites. We show that EM differs from expectation-confirmation theory and trust-calibration by foregrounding moral judgement, situated action, and cross-cultural variation. Our analysis reveals four recurring design tensions: automation vs. autonomy, helpfulness vs. intrusiveness, personalisation vs. predictability, and transparency vs. obscurity and distils them into a five-phase EM Design Playbook that supports moral prudence. We discuss implications for responsible smart-home design and offer guidance for human-centred AI.

Talking Slide Avatars: Open-Source Multimodal Communication Approach for Teaching cs.HC

Slide-based teaching is widely used in higher education, yet in online, hybrid, and asynchronous contexts, slides often lose the instructor presence, narrative continuity, and expressive framing that help learners connect with content. Full lecture video can partly restore these qualities, but it is time-consuming to record, revise, and reuse. This study addresses that pedagogical and production challenge by presenting a practice-based analysis of an open-source workflow for creating talking slide avatars for slide-based teaching. The workflow integrates OpenVoice for text-to-speech generation and voice cloning with Ditto-TalkingHead for audio-driven talking-image synthesis, enabling instructors to transform a script and a static portrait into a short narrated video that can be embedded in slide decks or HTML-based lecture materials. Rather than treating this workflow merely as a technical solution, the study frames talking slide avatars as multimodal communication artifacts at the intersection of digital pedagogy, aesthetic education, and art-technology practice. Using a practice-based implementation and analytic reflection approach, the study documents the production pipeline, examines its communicative and aesthetic affordances, and proposes practical guidelines for script length, image selection, pacing, disclosure, accessibility, and ethical use. The study makes three primary contributions: it presents an educator-oriented open-source production model, reframes talking avatars as an educational communication design problem, and proposes a responsible pathway for incorporating generative synthetic media into teaching. It concludes that short, transparent, and carefully designed avatars can humanize slide-based instruction while providing a reusable communicative layer for introductions, transitions, reminders, and recaps across online, hybrid, and asynchronous learning environments.

What Did They Mean? How LLMs Resolve Ambiguous Social Situations across Perspectives and Roles cs.HC

People increasingly turn to large language models (LLMs) to interpret ambiguous social situations: a delayed text reply, an unusually cold supervisor, a teacher's mixed signals, or a boundary-crossing friend. Yet in many such cases, no stable interpretation can be verified from the available evidence alone. We study how LLMs respond to these situations across four domains: early-stage romantic relationships, teacher--student dynamics, workplace hierarchies, and ambiguous friendships. Across 72 responses from GPT, Claude, and Gemini, only 9 (12.5\%) genuinely preserved uncertainty. The remaining 87.5% produced interpretive closure through recurring pathways including narrative alignment, narrative reversal, normative advice under uncertainty, and hedged language that still supported a single conclusion. We further find that narrator perspective shapes the path to closure: first-person accounts more often elicited alignment, while third-person accounts invited more detached interpretation, even when the underlying situation remained comparable. Together, these findings show that LLMs do not simply assist interpersonal sensemaking; they tend to resolve ambiguity into coherent and actionable narratives. These results suggest that the central risk is not only that LLMs may misinterpret social situations, but that they may make unresolved situations feel prematurely settled. We frame this tendency as a design challenge for uncertainty-preserving social AI.

IntentVLM: Open-Vocabulary Intention Recognition through Forward-Inverse Modeling with Video-Language Models cs.HC

Improving the effectiveness of human-robot interaction requires social robots to accurately infer human goals through robust intention understanding. This challenge is particularly critical in multimodal settings, where agents must integrate heterogeneous signals including text, visual cues to form a coherent interpretation of user intent. This paper presents IntentVLM, a novel two-stage video-language framework designed for open-vocabulary human intention recognition. The approach is inspired by forward-inverse modeling in cognitive science by decomposing intention understanding into goal candidate generation followed by structured inference through selection, effectively reducing hallucinations in latent reasoning. Evaluated on the IntentQA and Inst-IT Bench datasets, IntentVLM achieves state-of-the-art results with up to 80% accuracy, notably surpassing the baseline performance by 30% and matches human performance. Our findings demonstrate that this structured reasoning approach enhances open-vocabulary intention understanding without catastrophic forgetting, offering a robust foundation for human-centered robotics.

BenchmarksFull tables

Intelligence Index

Composite score across coding, math, and reasoning

#	Model	Score	tok/s	$/1M
1	Gemini 3.1 Pro Preview	57.2	110	$4.50
2	GPT-5.4	57	78	$5.63
3	GPT-5.3 Codex	54	68	$4.81
4	Claude Opus 4.6	53	55	$10.00
5	Claude Sonnet 4.6	51.7	69	$6.00

SWE-rebench

Agentic coding on real-world software engineering tasks

#	Model	Score
1	Claude Code	52.9%
2	Junie	52.1%
3	Claude Opus 4.6	51.7%
4	gpt-5.2-2025-12-11-xhigh	51.7%
5	gpt-5.2-2025-12-11-medium	51.0%

GitHub Repos All repos

Trending

GoogleCloudPlatform/generative-ai

15914 ★

Sample code and notebooks for Generative AI on Google Cloud, with Gemini on Vertex AI

openclaw/openclaw

365028 ★

Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞

karpathy/nanochat

46464 ★

The best ChatGPT that $100 can buy.

NousResearch/hermes-agent

87101 ★

pbakaus/impeccable

3340 ★

The design language that makes your AI harness better at design.

Daily discovery

langchain-ai/langgraphGenerative AI

26027 ★

Build resilient language agents as graphs.

UCSC-VLAA/OpenVisionMultimodal

465 ★

OpenVision (ICCV 2025), OpenVision 2 (CVPR 2026), and OpenVision 3

deepset-ai/haystackRAG

24803 ★

Open-source AI orchestration framework for building context-engineered, production-ready LLM applications. Design modular pipelines and agent workflows with explicit control over retrieval, routing, memory, and generation. Built for scalable agents, RAG, multimodal applications, semantic search, and conversational systems.

apache/airflowMLOps

44570 ★

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

debba/tabularisMCP

914 ★

A lightweight, developer-focused database management tool. Supports MySQL, PostgreSQL and SQLite. Hackable with plugins. Built for speed, security, and aesthetics.