Anthropic's refusal to surrender control of Claude to Pentagon demands has split the AI market along a fault line that now determines winners and losers: companies that preserve user trust versus those that accept government terms. The choice cost Anthropic a $200 million contract but delivered something more durable. Claude's app now sees more daily installs than ChatGPT, which suffered a 295 percent surge in uninstalls after OpenAI accepted the Pentagon's conditions. This is not a debate about safety frameworks. It is a market signal that the consumer base will punish military entanglement, and that signal is reshaping how AI companies calculate their revenue mix.
The fracture extends beyond geopolitics into infrastructure and talent. Microsoft, Google, and Amazon all moved quickly to preserve Claude access through their platforms, recognizing that distribution channels matter more than any single vendor relationship. Musk failed to block California's data disclosure law, forcing xAI into transparency about training data sources. Britain's House of Lords demanded licensing before copyright use. The UK, Denmark, and Germany are shifting procurement toward open-source alternatives and away from US vendors. These are not ideological moves. They reflect the recognition that whoever controls the model controls leverage, and governments are moving to dilute that leverage by fragmenting it. Alibaba replaced its top AI researcher with a Google DeepMind veteran within 48 hours. DeepSeek is shipping a trillion-parameter open-weight model on Chinese silicon, signaling the effort to break free from Nvidia's grip is moving from aspiration to product. An AI startup sued its ex-CEO for stealing 41GB of emails, exposing how fast institutional knowledge now migrates between competitors.
Meanwhile, the labs are abandoning the race for next-generation breakthroughs in favor of embedding models into workflows where revenue is immediate and measurable. OpenAI is locking in usage through application security and financial services partnerships. GitHub's vulnerability scanner runs on OpenAI's Codex Security agent. Descript uses OpenAI models for multilingual dubbing. AMD is positioning itself as the platform for domain-specific inference where computational cost matters. Anthropic's Firefox partnership focuses on security at the browser level rather than announcing new capabilities. What's absent is more telling than what's present: no consumer products, no benchmark breakthroughs, only integration into existing tools and revenue streams. The labs are becoming infrastructure inside things that already work.
The benchmarks themselves signal consolidation at a plateau. Claude Code holds 52.9 percent on SWE-rebench with no movement at the top tier. Below rank four, the list shows significant churn driven by model versioning rather than performance gains. Gemini 3 Pro Preview and GPT-5.4 lead Artificial Analysis at 57 percent but do not appear in SWE-rebench's top rankings, indicating the benchmarks measure different capabilities or use different protocols. The absence of clear improvement signals in the top tier, combined with ranking instability in the 7-20 range, suggests the field is consolidating around a performance plateau rather than advancing. GitHub's trending repos confirm the shift: developers are building orchestration frameworks and supporting infrastructure for agent systems, not chasing model scale. Airi, Qwen-Agent, and CyberStrikeAI all treat agents as orchestrated systems where specialized components handle retrieval, planning, and tool use. The plumbing that makes AI systems reproducible and composable at production scale is where engineering effort is concentrating. The market has stopped waiting for the next breakthrough and started building the systems to deploy what already exists.
Grant Calloway
The narrative composition of web-scale LLM pretraining corpora remains largely unexplored even though narrative is a fundamental mode of human communication. We present the first fine-grained study of narrative features in Dolma, a 3-trillion-token open pretraining corpus. Drawing on narrative theory, we design a framework spanning three core narrative elements (agency, setting, and events) operationalized as 11 interpretable dimensions. After sampling and annotating a diverse set of 400 passages, we finetune and validate NarraBERT, a RoBERTa-based model for fine-grained narrative prediction. We apply NarraBERT to 3M passages, resulting in a new dataset, NarraDolma. We find (i) narrative structure is measurable at scale across extremely heterogeneous data, (ii) we uncover a continuous, multidimensional narrative structure underlying web text, and (iii) narrative qualities are unequally distributed across pretraining sources and topics in ways that current curation practices neither measure nor account for. Our framework, dataset, and analyses provide a foundation for understanding how narrative qualities are distributed in LLM pretraining data and for studying how data composition affects narrative reasoning tasks. We publicly release NarraDolma and NarraBERT.
LLM-as-a-Judge has become the dominant evaluation paradigm for language models, but judge validation in practice relies on exact-match agreement, a metric that does not correct for chance and systematically overstates discriminative ability. We present the largest systematic evaluation of LLM-as-a-Judge to date: 21 judges from nine providers across MT-Bench, JudgeBench, and RewardBench, evaluated under three protocols (agreement, consistency, bias audit) over 118 runs and approximately 541,000 individual judgments. Four findings emerge, consistent across the full cohort, including the April 2026 frontier: kappa deflation between exact match and Cohen's kappa is universal (33--41 pp on MT-Bench), judge rankings shift by up to 14 positions across benchmarks, high test--retest reliability (>0.95) coexists with severe position bias (>0.10) in two production-deployed judges (instantiating a consistency--bias paradox), and verbosity bias is small (<0.011) across our cohort under a single pairwise rubric. We distill these into a Minimum Viable Validation Protocol.
Structural ambiguity arises when a single sentence admits multiple valid interpretations due to its syntactic structure, posing a fundamental challenge for language understanding. Visual scenes serve as useful cues for resolving such ambiguity, and Vision and Language Models (VLMs) need to be capable of deriving possible semantic interpretations from visual scenes. We introduce Language and Vision Structural Ambiguity (LaViSA), a benchmark designed to evaluate the ability of VLMs to resolve structural ambiguity leveraging visual scenes. LaViSA consists of ambiguous sentences, their disambiguated sentences, and corresponding images of these disambiguated sentences across seven ambiguity categories. Using LaViSA, we conduct a comprehensive evaluation of diverse VLMs, including both proprietary and open-source models with varying parameter scales and reasoning capabilities. Experimental results show that although recent VLMs can leverage visual scenes to resolve structural ambiguity to a some extent, they still struggle with certain ambiguity types and visually subtle semantic distinctions, indicating remaining limitations in resolving structural ambiguity using visual scenes.
In this technical report, we focus on solving the challenge of Vietnamese multi-document abstractive summarization, introduced in the International Workshop on Vietnamese Language and Speech Processing (VLSP) 2022. We choose to follow the popular hierarchical approach, i.e. condensing each document followed by aggregation and summarization. We propose a novel yet simple strategy to shorten documents that is driven by the golden summary, thus ensuring high correlation between stages of the hierarchical approach. Our method achieves a ROUGE2-F1 score of 0.2468 on the VLSP's public test set, and can produce fluent and concise summaries. Additionally, we utilize external sources for extra data, which greatly enhances the quantity of data for Vietnamese multi-document summarization. The additional data is made available for the community.
We use training-data attribution as an interpretable tool for capability discovery, mapping which regions of the pretraining corpus support social-reasoning versus STEM-reasoning in OLMo3-7B. Training-data attribution measures how strongly each training document influences a model's predictions on a benchmark, but document-level scores are too noisy to identify which corpus regions support which capabilities, and prior work has emphasized factual knowledge rather than reasoning. We compute gradient-based attribution (TrackStar via Bergson) over a working set drawn from the de-duplicated Dolma3 mix, aggregate influence across WebOrganizer's 24-format x 24-topic taxonomy (576 bins), and contrast benchmark pairs in a 2x2 design that varies domain (social vs. STEM) and capability type (reasoning vs. knowledge): SocialIQA and MMLU Social Sciences against ARC-Challenge and MMLU STEM. Social and STEM reasoning draw on qualitatively distinct corpus regions, and the contrast is sharper at the reasoning level than at the knowledge level. Targeted machine unlearning provides partial causal validation: forgetting high-attribution topic bins (e.g., Literature for SocialIQA) degrades the aligned benchmark more than within-bin random baselines, and we open-source all code, sampling manifests, the bin-level influence matrix, and unlearning checkpoints.
Clinical NLP increasingly relies on electronic health record (EHR) data to detect suicidal behaviors, treating clinical documentation as more reliable ground truth than social media. We argue that this framing obscures how EHR-based suicidality datasets encode a particular operationalization of suicidality, shaped by who authors the data, how episodes are bounded, and how ambiguity is resolved. We ground this argument in a case study of the ScAN dataset, built over MIMIC-III clinical notes. We show how governance constraints, ICD-based cohort selection, single-annotator labeling, and hospital-stay-level aggregation produce labels that reflect clinician-documented judgments, treat suicidality as a bounded episode, and assume that intent can be reliably inferred from documentation. A linguistic analysis demonstrates that identical labels subsume heterogeneous clinical framings differing in temporality, negation, and uncertainty. We argue that clinical NLP should examine the assumptions embedded in suicidality datasets before interpreting their labels as ground truth.
Composite score across coding, math, and reasoning
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Preview | 57.2 | 126 | $4.50 |
| 2 | GPT-5.4 | 57 | 74 | $5.63 |
| 3 | GPT-5.3 Codex | 54 | 68 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 59 | $10.00 |
| 5 | Claude Sonnet 4.6 | 51.7 | 70 | $6.00 |
Agentic coding on real-world software engineering tasks
| # | Model | Score |
|---|---|---|
| 1 | Claude Code | 52.9% |
| 2 | Claude Opus 4.6 | 51.7% |
| 3 | gpt-5.2-2025-12-11-xhigh | 51.7% |
| 4 | gpt-5.2-2025-12-11-medium | 51.0% |
| 5 | gpt-5.1-codex-max | 48.5% |
💖🧸 Self hosted, you-owned Grok Companion, a container of souls of waifu, cyber livings to bring them into our worlds, wishing to achieve Neuro-sama's altitude. Capable of realtime voice chat, Minecraft, Factorio playing. Web / macOS / Windows supported.
Agent framework and applications built upon Qwen>=3.0, featuring Function Calling, MCP, Code Interpreter, RAG, Chrome extension, etc.
A refined collection of Hypervelocity Engineering components (instructions, prompts, agents) to start your project off right, or upgrade your existing projects to get the most out of all Copilots
CyberStrikeAI is an AI-native security testing platform built in Go. It integrates 100+ security tools, an intelligent orchestration engine, role-based testing with predefined security roles, a skills system with specialized testing skills, and comprehensive lifecycle management capabilities.
Lightning-Fast RL for LLM Reasoning and Agents. Made Simple & Flexible.
Unified framework for robot learning built on NVIDIA Isaac Sim
The Open Source Feature Store for AI/ML
🎩 An Alfred 5 Workflow for using OpenAI Chat API to interact with GPT models 🤖💬 It also allows image generation/editing/understanding 🖼️, speech-to-text conversion 🎤, and text-to-speech synthesis 🔈
RAG-Fusion: multi-query generation + Reciprocal Rank Fusion for better retrieval-augmented generation. Includes evaluation harness with NFCorpus/BEIR.
The robust European language model benchmark.