The Inference Report

March 7, 2026

Anthropic's refusal to surrender control of Claude to Pentagon demands has split the AI market along a fault line that now determines winners and losers: companies that preserve user trust versus those that accept government terms. The choice cost Anthropic a $200 million contract but delivered something more durable. Claude's app now sees more daily installs than ChatGPT, which suffered a 295 percent surge in uninstalls after OpenAI accepted the Pentagon's conditions. This is not a debate about safety frameworks. It is a market signal that the consumer base will punish military entanglement, and that signal is reshaping how AI companies calculate their revenue mix.

The fracture extends beyond geopolitics into infrastructure and talent. Microsoft, Google, and Amazon all moved quickly to preserve Claude access through their platforms, recognizing that distribution channels matter more than any single vendor relationship. Musk failed to block California's data disclosure law, forcing xAI into transparency about training data sources. Britain's House of Lords demanded licensing before copyright use. The UK, Denmark, and Germany are shifting procurement toward open-source alternatives and away from US vendors. These are not ideological moves. They reflect the recognition that whoever controls the model controls leverage, and governments are moving to dilute that leverage by fragmenting it. Alibaba replaced its top AI researcher with a Google DeepMind veteran within 48 hours. DeepSeek is shipping a trillion-parameter open-weight model on Chinese silicon, signaling the effort to break free from Nvidia's grip is moving from aspiration to product. An AI startup sued its ex-CEO for stealing 41GB of emails, exposing how fast institutional knowledge now migrates between competitors.

Meanwhile, the labs are abandoning the race for next-generation breakthroughs in favor of embedding models into workflows where revenue is immediate and measurable. OpenAI is locking in usage through application security and financial services partnerships. GitHub's vulnerability scanner runs on OpenAI's Codex Security agent. Descript uses OpenAI models for multilingual dubbing. AMD is positioning itself as the platform for domain-specific inference where computational cost matters. Anthropic's Firefox partnership focuses on security at the browser level rather than announcing new capabilities. What's absent is more telling than what's present: no consumer products, no benchmark breakthroughs, only integration into existing tools and revenue streams. The labs are becoming infrastructure inside things that already work.

The benchmarks themselves signal consolidation at a plateau. Claude Code holds 52.9 percent on SWE-rebench with no movement at the top tier. Below rank four, the list shows significant churn driven by model versioning rather than performance gains. Gemini 3 Pro Preview and GPT-5.4 lead Artificial Analysis at 57 percent but do not appear in SWE-rebench's top rankings, indicating the benchmarks measure different capabilities or use different protocols. The absence of clear improvement signals in the top tier, combined with ranking instability in the 7-20 range, suggests the field is consolidating around a performance plateau rather than advancing. GitHub's trending repos confirm the shift: developers are building orchestration frameworks and supporting infrastructure for agent systems, not chasing model scale. Airi, Qwen-Agent, and CyberStrikeAI all treat agents as orchestrated systems where specialized components handle retrieval, planning, and tool use. The plumbing that makes AI systems reproducible and composable at production scale is where engineering effort is concentrating. The market has stopped waiting for the next breakthrough and started building the systems to deploy what already exists.

Grant Calloway

AI LabsAll labs
From the WireAll feeds
Research Papers — FocusedAll papers
DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning cs.CL

Reinforcement learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for eliciting reasoning capabilities in large language models, particularly in mathematics and coding. While recent efforts have extended this paradigm to broader general scientific (STEM) domains, the complex interplay between supervised fine-tuning (SFT) and RL in these contexts remains underexplored. In this paper, we conduct controlled experiments revealing a critical challenge: for general STEM domains, RL applied directly to base models is highly sample-inefficient and is consistently surpassed by supervised fine-tuning (SFT) on moderate-quality responses. Yet sequential SFT followed by RL can further improve performance, suggesting that the two stages play complementary roles, and that how training data is allocated between them matters. Therefore, we propose DeReason, a difficulty-based data decoupling strategy for general reasoning. DeReason partitions training data by reasoning intensity estimated via LLM-based scoring into reasoning-intensive and non-reasoning-intensive subsets. It allocates broad-coverage, non-reasoning-intensive problems to SFT to establish foundational domain knowledge, and reserves a focused subset of difficult problems for RL to cultivate complex reasoning. We demonstrate that this principled decoupling yields better performance than randomly splitting the data for sequential SFT and RL. Extensive experiments on general STEM and mathematical benchmarks demonstrate that our decoupled curriculum training significantly outperforms SFT-only, RL-only, and random-split baselines. Our work provides a systematic study of the interplay between SFT and RL for general reasoning, offering a highly effective and generalized post-training recipe.

MDER-DR: Multi-Hop Question Answering with Entity-Centric Summaries cs.CL

Retrieval-Augmented Generation (RAG) over Knowledge Graphs (KGs) suffers from the fact that indexing approaches may lose important contextual nuance when text is reduced to triples, thereby degrading performance in downstream Question-Answering (QA) tasks, particularly for multi-hop QA, which requires composing answers from multiple entities, facts, or relations. We propose a domain-agnostic, KG-based QA framework that covers both the indexing and retrieval/inference phases. A new indexing approach called Map-Disambiguate-Enrich-Reduce (MDER) generates context-derived triple descriptions and subsequently integrates them with entity-level summaries, thus avoiding the need for explicit traversal of edges in the graph during the QA retrieval phase. Complementing this, we introduce Decompose-Resolve (DR), a retrieval mechanism that decomposes user queries into resolvable triples and grounds them in the KG via iterative reasoning. Together, MDER and DR form an LLM-driven QA pipeline that is robust to sparse, incomplete, and complex relational data. Experiments show that on standard and domain specific benchmarks, MDER-DR achieves substantial improvements over standard RAG baselines (up to 66%), while maintaining cross-lingual robustness. Our code is available at https://github.com/DataSciencePolimi/MDER-DR_RAG.

Markovian Generation Chains in Large Language Models cs.CL

The widespread use of large language models (LLMs) raises an important question: how do texts evolve when they are repeatedly processed by LLMs? In this paper, we define this iterative inference process as Markovian generation chains, where each step takes a specific prompt template and the previous output as input, without including any prior memory. In iterative rephrasing and round-trip translation experiments, the output either converges to a small recurrent set or continues to produce novel sentences over a finite horizon. Through sentence-level Markov chain modeling and analysis of simulated data, we show that iterative process can either increase or reduce sentence diversity depending on factors such as the temperature parameter and the initial input sentence. These results offer valuable insights into the dynamics of iterative LLM inference and their implications for multi-agent LLM systems.

Artificial Intelligence for Sentiment Analysis of Persian Poetry cs.CL

Recent advancements of the Artificial Intelligence (AI) have led to the development of large language models (LLMs) that are capable of understanding, analysing, and creating textual data. These language models open a significant opportunity in analyzing the literature and more specifically poetry. In the present work, we employ multiple Bidirectional encoder representations from transformers (BERT) and Generative Pre-trained Transformer (GPT) based language models to analyze the works of two prominent Persian poets: Jalal al-Din Muhammad Rumi (Rumi) and Parvin E'tesami. The main objective of this research is to investigate the capability of the modern language models in grasping complexities of the Persian poetry and explore potential correlations between the poems' sentiment and their meters. Our findings in this study indicates that GPT4o language model can reliably be used in analysis of Persian poetry. Furthermore, the results of our sentiment analysis revealed that in general, Rumi's poems express happier sentiments compared to Parvin E'tesami's poems. Furthermore, comparing the utilization of poetic meters highlighted Rumi's poems superiority in using meters to express a wider variety of sentiments. These findings are significant as they confirm that LLMs can be effectively applied in conducting computer-based semantic studies, where human interpretations are not required, and thereby significantly reducing potential biases in the analysis.

ThReadMed-QA: A Multi-Turn Medical Dialogue Benchmark from Real Patient Questions cs.CL

Medical question-answering benchmarks predominantly evaluate single-turn exchanges, failing to capture the iterative, clarification-seeking nature of real patient consultations. We introduce ThReadMed-QA, a benchmark of 2,437 fully-answered patient-physician conversation threads extracted from r/AskDocs, comprising 8,204 question-answer pairs across up to 9 turns. Unlike prior work relying on simulated dialogues, adversarial prompts, or exam-style questions, ThReadMed-QA captures authentic patient follow-up questions and verified physician responses, reflecting how patients naturally seek medical information online. We evaluate five state-of-the-art LLMs -- GPT-5, GPT-4o, Claude Haiku, Gemini 2.5 Flash, and Llama 3.3 70B -- on a stratified test split of 238 conversations (948 QA pairs) using a calibrated LLM-as-a-judge rubric grounded in physician ground truth. Even the strongest model, GPT-5, achieves only 41.2% fully-correct responses. All five models degrade significantly from turn 0 to turn 2 (p < 0.001), with wrong-answer rates roughly tripling by the third turn. We identify a fundamental tension between single-turn capability and multi-turn reliability: models with the strongest initial performance (GPT-5: 75.2; Claude Haiku: 72.3 out of 100) exhibit the steepest declines by turn 2 (dropping 16.2 and 25.0 points respectively), while weaker models plateau or marginally improve. We introduce two metrics to quantify multi-turn failure modes: Conversational Consistency Score (CCS) and Error Propagation Rate (EPR). CCS reveals that nearly one in three Claude Haiku conversations swings between a fully correct and a completely wrong response within the same thread. EPR shows that a single wrong turn raises the probability of a subsequent wrong turn by 1.9-6.1x across all models.

Temporal Text Classification with Large Language Models cs.CL

Languages change over time. Computational models can be trained to recognize such changes enabling them to estimate the publication date of texts. Despite recent advancements in Large Language Models (LLMs), their performance on automatic dating of texts, also known as Temporal Text Classification (TTC), has not been explored. This study provides the first systematic evaluation of leading proprietary (Claude 3.5, GPT-4o, Gemini 1.5) and open-source (LLaMA 3.2, Gemma 2, Mistral, Nemotron 4) LLMs on TTC using three historical corpora, two in English and one in Portuguese. We test zero-shot and few-shot prompting, and fine-tuning settings. Our results indicate that proprietary models perform well, especially with few-shot prompting. They also indicate that fine-tuning substantially improves open-source models but that they still fail to match the performance delivered by proprietary LLMs.

BenchmarksFull tables
Artificial AnalysisIntelligence Index

Composite score across coding, math, and reasoning

#ModelScoretok/s$/1M
1Gemini 3.1 Pro Preview57.2126$4.50
2GPT-5.45774$5.63
3GPT-5.3 Codex5468$4.81
4Claude Opus 4.65359$10.00
5Claude Sonnet 4.651.770$6.00
SWE-rebench

Agentic coding on real-world software engineering tasks

#ModelScore
1Claude Code52.9%
2Claude Opus 4.651.7%
3gpt-5.2-2025-12-11-xhigh51.7%
4gpt-5.2-2025-12-11-medium51.0%
5gpt-5.1-codex-max48.5%
Trending