The Inference Report

March 21, 2026

Meanwhile, as federal courts weigh whether Anthropic poses a national security threat and the Trump administration releases a framework designed to preempt state AI regulations, the real battle is shifting beneath the political theater toward control of infrastructure and the platforms through which work flows. The government's case against Anthropic rests on technical misunderstandings, according to sworn declarations filed Friday, yet the administration's National Policy Framework emphasizes lighter-touch rules for companies while attempting to centralize authority over AI policy at the federal level. Big Tech is fracturing over these attacks, with former Trump allies offering unprecedented criticism even as the administration tries to block state-level laws. This is not philosophical disagreement about safety. It is a power struggle over who writes the rules: the executive branch, states, or the companies themselves.

Beneath the regulatory rupture, power has become the bottleneck constraining the entire enterprise. Nvidia's CEO Jensen Huang projects one trillion dollars in AI chip sales through 2027, yet energy consumption is now the north star metric alongside accuracy and engagement as engineers discover that rolling out new data centers depends on power availability, not model capability. Microsoft rolled back Copilot bloat on Windows after user and developer resistance to forced integration. OpenAI is folding ChatGPT, Codex, and its browser Atlas into a single desktop superapp, signaling a shift toward enterprise infrastructure and developer tools away from the consumer market that made it a household name. These are admissions that the consumer AI wave has peaked and the real money is in developer platforms and enterprise lock-in.

Distribution control is now the prize. WordPress.com lets AI agents write and publish posts directly. Google embedded AI into Stitch, enabling developers to describe interfaces in natural language. Amazon is building a smartphone called Transformer to integrate shopping, streaming, and voice services through Alexa. LinkedIn banned an AI agent that had conquered the platform. Each move lowers friction for adoption while raising switching costs and centralizing control through the platform. PwC told staff they must embrace AI or face replacement. Google told researchers to stop submitting AI-generated bug reports to its open-source program due to hallucinations and low quality. AI adoption is no longer optional, quality control is breaking down at scale, and the winners will be whoever owns the platform through which work flows.

The technical evidence confirms this shift. Benchmark performance has plateaued at the top tier, with Claude Code holding 52.9% on SWE-rebench and the next three positions separated by less than 1.2 percentage points, signaling that incremental gains in raw capability now demand substantial effort. On GitHub, the dominant pattern is developers moving past building individual models toward building systems that orchestrate them: Claude HUD, Open-SWE, and Superpowers all solve the same problem of making autonomous agents predictable enough to trust in production. The repos gaining traction are those that make the infrastructure layers reliable: specialized data handling for AI pipelines, vector storage, dataflow definition, and domain-specific scaffolding. The boring parts of AI systems, not raw capability, are what developers are actively building on.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research Papers — FocusedAll papers
Editorial Alignment: A Participatory Approach to Engaging Editorial Expertise in LLM-mediated Knowledge Dissemination cs.HC

The emergence of LLM-driven information services is reshaping the conditions under which public knowledge institutions operate, threatening to absorb the editorial function these institutions exist to exercise. While LLMs offer powerful new affordances for knowledge dissemination, editorial authority is challenged by pretrained LLMs that arrive already aligned with the values and dissemination strategies of their commercial developers. This paper investigates editor participation in re-aligning LLM interfaces to editorial standards through design workshops, in a case study where we design and implement an LLM-enabled encyclopedia interface with a Nordic public knowledge institution. We introduce editorial alignment as a design practice within Participatory AI, framing AI alignment as a design process and positioning the editorial standard as a design artefact that translates editorial practice and values into alignment objectives for technical implementation. Last, we discuss how editorial alignment can create space for ongoing participation and give editors agency in LLM-mediated knowledge dissemination.

DataMagic: Transforming Tabular Data into Data Insight Video cs.HC

Data videos integrate dynamic charts, voice narration, and synchronized animations to communicate data insights as temporal narratives, making them an effective medium for improving data consumption efficiency in the data management lifecycle. However, producing high-quality data videos requires expertise spanning data analysis, narrative design, and video production. Existing approaches fall short: static visualization tools (e.g., BI dashboards) lack narrative logic and animation; authoring tools require users to pre-prepare visualizations rather than working from raw data; pixel-level video generation models cannot guarantee data fidelity or provenance. We demonstrate DataMagic, an end-to-end interactive system that transforms raw tabular data and natural language queries into narrative data-insight videos. To ensure data fidelity, DataMagic introduces the declarative specification DVSpec, which binds visual and animation elements to underlying data fields through data-driven semantic references. To address the combinatorial explosion of the design space, DataMagic adopts a Generate-then-Orchestrate multi-agent architecture that generates candidate scenes in parallel and then optimizes narrative coherence through global orchestration. Leveraging DVSpec's decoupling of logic and rendering, the system further supports three interaction modes and structured provenance-based data Q&A, transforming one-way videos into explorable interactive data interfaces. Evaluation on 109 real-world samples validates the effectiveness of the DataMagic. Homepage: https://datamagic-home.github.io/

Better Adherence, Richer Context: A Field Evaluation of LLM-Powered Conversational Voice Diaries for Sleep cs.HC

Sleep diaries are central to behavioral sleep medicine and cognitive behavioral therapy for insomnia, yet daily completion is difficult to sustain, and static forms often provide limited context for interpreting night-to-night sleep variation. We designed an LLM-powered conversational voice diary that delivers clinically grounded morning and evening sleep diary questions through proactive smart-speaker prompts, structured conversational intake, and adaptive follow-up dialogue. We evaluated the system in a four-week between-subjects field study with 30 university students, comparing it with a text-based mobile diary using matched diary items, reporting windows, and reminder intervals. Compared with the text-based diary, the conversational voice diary showed higher adherence and elicited more detailed contextual self-report about routines, stressors, environmental conditions, and other sleep-related factors. Participants also described the voice diary as easier to integrate into daily routines, despite longer perceived completion time. However, voice-based conversational intake produced lower completeness for some structured diary fields, revealing a trade-off between expressive richness and structured precision. These findings show both the promise and the challenge of using LLM-powered conversational voice assistants for longitudinal health self-report.

SwitchBraidNet: Quantisation-Aware Lightweight Architecture for Hybrid Brain-Computer Interface cs.HC

Hybrid brain-computer interfaces (BCIs) that integrate motor imagery (MI) and steady-state visual evoked potentials (SSVEP) provide high-dimensional neural decoding but typically exceed the computational limits of embedded hardware. To address this, we propose SwitchBraidNet, a compact EEG classification architecture designed for low-power deployment. The model employs a dual-path temporal braid to extract multiscale oscillatory features, an adaptive squeeze-and-excitation spatial switch for electrode gating, and a log-variance readout layer for direct band-power encoding. Furthermore, through systematic quantisation-aware training on the OpenBMI dataset, we compared SwitchBraidNet against four established baselines across FP32, FP16, and INT8 precisions. Experimental results demonstrate superior efficiency and performance, achieving MI accuracy of 69.49% (FP16), SSVEP accuracy of 93.48% (FP32), and a hybrid information transfer rate of 64.82 bits/min (FP16). With an INT8 footprint of only 3.03 KB, SwitchBraidNet maintains high accuracy across varying numerical precisions, demonstrating its suitability for low-power embedded BCI deployment.

Improving Human-Robot Teamwork in Urban Search and Rescue Through Episodic Memory of Prior Collaboration cs.HC

Effective human-robot teamwork requires robots to adapt to partners, situations, and task dynamics from the start of an interaction. In the MATRX Urban Search and Rescue (USAR) environment, people can externalize collaboration patterns (CPs) they discover during teamwork through a chat and reflection interface. We study whether a robot can use such prior team experience to become a better teammate in future interactions. To this end, we represent historical CPs as knowledge-graph episodic memories and use graph representation learning with a node-classification objective to identify a representative and effective memory for reuse. We then initialize the robot with this memory before a new collaboration episode begins. Across 20 participants and 160 round-level observations, initializing the robot with a single automatically selected prior CP increases rescue success from 25.7% to 41.3% and reduces average task time by 283 seconds. The strongest gains appear at the beginning of interaction, suggesting that reusable episodic memory can help robots enter collaboration with more effective task knowledge and support smoother early teamwork.

A Clinician-Centered Pipeline for Annotation and Evaluation in Ultrasound AI Studies cs.HC

Clinician-centered evaluation is critical for validating medical AI systems, especially in ultrasound imaging where quantitative metrics do not always capture clinical usability. Existing medical image platforms primarily focus on dataset labeling. They lack integrated support for blinded model comparison and reproducible evaluation workflows. We present a clinician-centered pipeline for remote annotation and evaluation in ultrasound AI studies. The proposed pipeline uses a centralized server and lightweight browser interfaces to enable clinicians to perform annotation, blinded ranking, and review without local dataset downloads. The pipeline also supports multi-rater participation, centralized result aggregation, and automated statistical analysis. We validate the pipeline in a fetal ultrasound segmentation study with six raters spanning expert, generalist, and non-expert experience levels. The system automatically generated Spearman correlation, Kendall's $τ$, and top-1 selection statistics. Results indicated moderate to strong agreement across experts and other groups. The blinded evaluation results showed a tendency for later active learning models to be preferred. These outcomes suggest that the pipeline can support clinician-centered annotation and reproducible human-\ac{AI} evaluation studies in ultrasound imaging. The proposed pipeline is available on \href{https://github.com/13204942/SonoRate}{GitHub}.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1GPT-5.457.286$5.63
2Gemini 3.1 Pro Preview57.2117$4.50
3GPT-5.3 Codex5474$4.81
4Claude Opus 4.65354$10.00
5Claude Sonnet 4.651.770$6.00
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Code52.9%
2Junie52.1%
3Claude Opus 4.651.7%
4gpt-5.2-2025-12-11-xhigh51.7%
5gpt-5.2-2025-12-11-medium51.0%