Anthropic's refusal to surrender control of Claude to Pentagon demands has split the AI market along a fault line that now determines winners and losers: companies that preserve user trust versus those that accept government terms. The choice cost Anthropic a $200 million contract but delivered something more durable. Claude's app now sees more daily installs than ChatGPT, which suffered a 295 percent surge in uninstalls after OpenAI accepted the Pentagon's conditions. This is not a debate about safety frameworks. It is a market signal that the consumer base will punish military entanglement, and that signal is reshaping how AI companies calculate their revenue mix.
The fracture extends beyond geopolitics into infrastructure and talent. Microsoft, Google, and Amazon all moved quickly to preserve Claude access through their platforms, recognizing that distribution channels matter more than any single vendor relationship. Musk failed to block California's data disclosure law, forcing xAI into transparency about training data sources. Britain's House of Lords demanded licensing before copyright use. The UK, Denmark, and Germany are shifting procurement toward open-source alternatives and away from US vendors. These are not ideological moves. They reflect the recognition that whoever controls the model controls leverage, and governments are moving to dilute that leverage by fragmenting it. Alibaba replaced its top AI researcher with a Google DeepMind veteran within 48 hours. DeepSeek is shipping a trillion-parameter open-weight model on Chinese silicon, signaling the effort to break free from Nvidia's grip is moving from aspiration to product. An AI startup sued its ex-CEO for stealing 41GB of emails, exposing how fast institutional knowledge now migrates between competitors.
Meanwhile, the labs are abandoning the race for next-generation breakthroughs in favor of embedding models into workflows where revenue is immediate and measurable. OpenAI is locking in usage through application security and financial services partnerships. GitHub's vulnerability scanner runs on OpenAI's Codex Security agent. Descript uses OpenAI models for multilingual dubbing. AMD is positioning itself as the platform for domain-specific inference where computational cost matters. Anthropic's Firefox partnership focuses on security at the browser level rather than announcing new capabilities. What's absent is more telling than what's present: no consumer products, no benchmark breakthroughs, only integration into existing tools and revenue streams. The labs are becoming infrastructure inside things that already work.
The benchmarks themselves signal consolidation at a plateau. Claude Code holds 52.9 percent on SWE-rebench with no movement at the top tier. Below rank four, the list shows significant churn driven by model versioning rather than performance gains. Gemini 3 Pro Preview and GPT-5.4 lead Artificial Analysis at 57 percent but do not appear in SWE-rebench's top rankings, indicating the benchmarks measure different capabilities or use different protocols. The absence of clear improvement signals in the top tier, combined with ranking instability in the 7-20 range, suggests the field is consolidating around a performance plateau rather than advancing. GitHub's trending repos confirm the shift: developers are building orchestration frameworks and supporting infrastructure for agent systems, not chasing model scale. Airi, Qwen-Agent, and CyberStrikeAI all treat agents as orchestrated systems where specialized components handle retrieval, planning, and tool use. The plumbing that makes AI systems reproducible and composable at production scale is where engineering effort is concentrating. The market has stopped waiting for the next breakthrough and started building the systems to deploy what already exists.
Grant Calloway
Diffusion language models expose an explicit denoising trajectory, making it possible to ask when different kinds of information become measurable during generation. We study three independent 32-step runs of LLaDA-8B-Base on masked WikiText-103 text, each with 1{,}000 probe-training sequences and 200 held-out evaluation sequences. From saved trajectories, we derive four temporal measurements: token commitment; linear recoverability of part-of-speech (POS), coarse semantic category, and token identity; confidence and entropy dynamics; and sensitivity under mid-trajectory re-masking. Across seeds, the same ordering recurs: content categories stabilize earlier than function-heavy categories, POS and coarse semantic labels remain substantially more linearly recoverable than exact lexical identity under our probe setup, uncertainty remains higher for tokens that ultimately resolve incorrectly even though late confidence becomes less calibrated, and perturbation sensitivity peaks in the middle of the trajectory. A direct/collateral decomposition shows that this peak is overwhelmingly local to the perturbed positions themselves. In this LLaDA+WikiText setting, denoising time is therefore a useful analysis axis: under our measurements, coarse labels are recovered earlier and more robustly than lexical identity, trajectory-level uncertainty tracks eventual correctness, and mid-trajectory states are the most intervention-sensitive.
Large language models (LLMs) are increasingly utilized in various complex reasoning tasks due to their excellent instruction following capability. However, the model's performance is highly dependent on the open-ended characteristics of the users' input prompt. Natural prompts often do not follow proper syntactic rules, which creates ambiguous queries that yield multiple interpretations. Such ambiguous prompts confuse the model in choosing the correct reasoning paths to answer questions. Prior works address this challenge by applying query editing during the LLM inference process without explicitly solving the root cause of the ambiguity. To address this limitation, we propose a pre-inference prompt optimization mechanism via explicit prompt disambiguation. Particularly, we identify semantic risks in the prompt, check their multi-perspective consistency, and resolve any semantic conflicts that arise. Finally, we organize the resolved ambiguities in a logically structured manner as a clean input to the LLM. By explicitly resolving semantic ambiguity, our method can produce a more focused attention distribution to the semantically essential tokens. We also leverage small language models (SLMs) as the main executor of prompt disambiguation to benefit from their efficient computation. Through comprehensive experiments on multiple benchmarks, we demonstrate that our method improves reasoning performance by 2.5 points at a cost of only \$0.02. Our study promotes explicit prompt disambiguation as an effective prompt optimization method without disturbing the internal mechanism of LLM inference.
Large language models (LLMs) operate in two fundamental learning modes - fine-tuning (FT) and in-context learning (ICL) - raising key questions about which mode yields greater language proficiency and whether they differ in their inductive biases. Prior studies comparing FT and ICL have yielded mixed and inconclusive results due to inconsistent experimental setups. To enable a rigorous comparison, we propose a formal language learning task - offering precise language boundaries, controlled string sampling, and no data contamination - and introduce a discriminative test for language proficiency, where an LLM succeeds if it assigns higher generation probability to in-language strings than to out-of-language strings. Empirically, we find that: (a) FT has greater language proficiency than ICL on in-distribution generalization, but both perform equally well on out-of-distribution generalization. (b) Their inductive biases, measured by the correlation in string generation probabilities, are similar when both modes partially learn the language but diverge at higher proficiency levels. (c) Unlike FT, ICL performance differs substantially across models of varying sizes and families and is sensitive to the token vocabulary of the language. Thus, our work demonstrates the promise of formal languages as a controlled testbed for evaluating LLMs, behaviors that are difficult to isolate in natural language datasets. Our source code is available at https://github.com/bishwamittra/formallm.
Long-context large language models remain computationally expensive to run and often fail to reliably process very long inputs, which makes context compression an important component of many systems. Existing compression approaches typically rely on trained compressors, dense retrieval-style selection, or heuristic trimming, and they often struggle to jointly preserve task relevance, topic coverage, and cross-sentence coherence under a strict token budget. To address this, we propose a training-free and model-agnostic compression framework that selects a compact set of sentences guided by structural graph priors. Our method constructs a sparse hybrid sentence graph that combines mutual k-NN semantic edges with short-range sequential edges, extracts a topic skeleton via clustering, and ranks sentences using an interpretable score that integrates task relevance, cluster representativeness, bridge centrality, and a cycle coverage cue. A budgeted greedy selection with redundancy suppression then produces a readable compressed context in original order. Experimental results on four datasets show that our approach is competitive with strong extractive and abstractive baselines, demonstrating larger gains on long-document benchmarks.
In this work, we present Au-M-ol, a novel multimodal architecture that extends Large Language Models (LLMs) with audio processing. It is designed to improve performance on clinically relevant tasks such as Automatic Speech Recognition (ASR). Au-M-ol has three main components: (1) an audio encoder that extracts rich acoustic features from medical speech, (2) an adaptation layer that maps audio features into the LLM input space, and (3) a pretrained LLM that performs transcription and clinical language understanding. This design allows the model to interpret spoken medical content directly, improving both accuracy and robustness. In experiments, Au-M-ol reduces Word Error Rate (WER) by 56\% compared to state-of-the-art baselines on medical transcription tasks. The model also performs well in challenging conditions, including noisy environments, domain-specific terminology, and speaker variability. These results suggest that Au-M-ol is a strong candidate for real-world clinical applications, where reliable and context-aware audio understanding is essential.
Full-duplex spoken dialogue systems can model natural conversational behaviours such as interruptions, overlaps, and backchannels, yet such systems remain largely unexplored for Indian languages. We present the first open, reproducible full-duplex spoken dialogue system for Hindi by adapting Moshi, a state-of-the-art duplex speech architecture, using a custom Hindi tokeniser and training on 26,000 hours of real spontaneous conversations collected from 14,695 speakers with separate speaker channels, enabling direct learning of turn-taking and overlap patterns from natural interactions. To support Hindi text generation, we replace the original English tokeniser and reinitialise text-vocabulary-dependent parameters while retaining the pre-trained audio components. We propose a two-stage training recipe -- large-scale pre-training followed by fine-tuning on 1,000 hours of conversational data. Evaluation through the prompted dialogue continuation paradigm with both automatic metrics and human judgments demonstrates that the resulting model generates natural and meaningful full-duplex conversational behaviour in Hindi. This work serves as a first step toward real-time duplex spoken dialogue systems for Hindi and other Indian languages.
Composite score across coding, math, and reasoning
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Preview | 57.2 | 126 | $4.50 |
| 2 | GPT-5.4 | 57 | 74 | $5.63 |
| 3 | GPT-5.3 Codex | 54 | 68 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 59 | $10.00 |
| 5 | Claude Sonnet 4.6 | 51.7 | 70 | $6.00 |
Agentic coding on real-world software engineering tasks
| # | Model | Score |
|---|---|---|
| 1 | Claude Code | 52.9% |
| 2 | Claude Opus 4.6 | 51.7% |
| 3 | gpt-5.2-2025-12-11-xhigh | 51.7% |
| 4 | gpt-5.2-2025-12-11-medium | 51.0% |
| 5 | gpt-5.1-codex-max | 48.5% |
💖🧸 Self hosted, you-owned Grok Companion, a container of souls of waifu, cyber livings to bring them into our worlds, wishing to achieve Neuro-sama's altitude. Capable of realtime voice chat, Minecraft, Factorio playing. Web / macOS / Windows supported.
Agent framework and applications built upon Qwen>=3.0, featuring Function Calling, MCP, Code Interpreter, RAG, Chrome extension, etc.
A refined collection of Hypervelocity Engineering components (instructions, prompts, agents) to start your project off right, or upgrade your existing projects to get the most out of all Copilots
CyberStrikeAI is an AI-native security testing platform built in Go. It integrates 100+ security tools, an intelligent orchestration engine, role-based testing with predefined security roles, a skills system with specialized testing skills, and comprehensive lifecycle management capabilities.
Lightning-Fast RL for LLM Reasoning and Agents. Made Simple & Flexible.
Unified framework for robot learning built on NVIDIA Isaac Sim
The Open Source Feature Store for AI/ML
🎩 An Alfred 5 Workflow for using OpenAI Chat API to interact with GPT models 🤖💬 It also allows image generation/editing/understanding 🖼️, speech-to-text conversion 🎤, and text-to-speech synthesis 🔈
RAG-Fusion: multi-query generation + Reciprocal Rank Fusion for better retrieval-augmented generation. Includes evaluation harness with NFCorpus/BEIR.
The robust European language model benchmark.