The Inference Report

July 3, 2026

Meanwhile, as Sam Altman offers five percent equity stakes to a U.S. sovereign wealth fund and extends the framework to rivals as if volunteering their capital, the federal government sits at the negotiation table for the first time, and the AI industry's power brokers scramble to shape the terms before rules get written without them. Altman's proposal is not philanthropy but a political move, a way to bind federal interests to OpenAI's success while preemptively framing the company as a national asset rather than a competitor to be regulated. Microsoft's launch of its Frontier Company with 6,000 embedded engineers and a 2.5 billion dollar commitment, following similar moves by Amazon and Anthropic, shows the infrastructure layer consolidating around the major players. Smaller builders will deploy on their platforms, not compete with them. The frontier labs are moving upstream to control the hardware itself, with Anthropic discussing a custom chip with Samsung and OpenAI partnering with Broadcom for its own silicon. Mark Zuckerberg's admission that AI development is moving slower than he hoped is the week's most honest statement: the hype cycle has outpaced actual progress, but the money and regulatory attention have not slowed.

The real constraint is not compute but energy. Google's 37 percent electricity increase in 2025 for AI data centers, combined with FERC's ultimatum to grid operators and growing local backlash to proposed facilities in Virginia and Nevada, means that access to power and interconnection capacity will determine who scales and who doesn't. Blackstone's abandoned Virginia data center project after protests signals that political risk is rising faster than companies anticipated. By contrast, Honolulu's deployment of AI to accelerate permit reviews and Clark County's imminent data center regulations show local governments are moving faster than federal agencies to set the rules, creating a patchwork that favors the players with enough legal and regulatory capacity to navigate it. Apple's reported negotiations with Chinese chip makers despite Pentagon blacklists reveals how real the RAM shortage is and how far manufacturers will push against policy when supply chains break.

The research literature and developer activity reveal where the field is actually moving. Papers cluster around controlled evaluation of failure modes, mechanistic diagnosis of model behavior via internal representations, and decomposition of complex objectives into tractable learning stages, treating empirical precision over architectural novelty as the primary lever. In trending GitHub repositories, the story is almost entirely about agents and the infrastructure to run them at scale: Claude Code frameworks like agency-agents, career-ops, and superpowers are not experiments but tools designed to ship, with developers grinding on cost, reliability, and composability rather than asking whether agents can be built. The discovery layer shows the underlying plumbing consolidating around orchestration platforms like Langflow and Xinference, data extraction tools like Unstract, and fine-tuning infrastructure like LlamaFactory, which has 72k stars. What is absent is striking: no major new model releases in the trending set, no new training frameworks. The investment has shifted downstream, toward deployment. This is the phase where infrastructure maturity determines which teams can actually ship, and the companies that already have energy access or can negotiate it will win.

Grant Calloway

AI LabsAll labs

Anthropic

More details on Fable 5’s cyber safeguards and our jailbreak framework

NVIDIA

Joyride Through July With 12 Games Coming to GeForce NOW

From the WireAll feeds

Research PapersAll papers

Distributed Attacks in Persistent-State AI Control cs.AI

As AI coding agents become more autonomous, they increasingly ship code iteratively, with the codebase persisting across sessions. This persistence creates a new attack surface: a misaligned or prompt-injected agent can distribute attacks across pull requests (PRs) and time its payload for the PR with the best natural cover. To study the resulting dynamics, we introduce Iterative VibeCoding, a setting for AI control, the study of safely deploying capable but potentially untrusted AI. In Iterative VibeCoding, a coding agent builds software over a sequence of PRs in a persistent codebase while pursuing a covert side task. Our benchmark includes two task families: CLI tools and Flask web services, across 20 total task variations. We use Claude Sonnet 4.5 as the attack agent and GPT-4o as the monitor. We compare gradual attacks, which distribute the side task across PRs, against non-gradual attacks concentrated in a single PR. No single monitor is robust to both: which strategy evades best (success while evading the monitor) depends on the monitor type, so a defender cannot close off both gradual and non-gradual attacks with any one monitor. High evasion (>= 65%) generalizes across model attack agent backends (Sonnet 4.5, Gemini 3.1 Pro, Kimi K2.5), confirming this is a property of the persistent-state attack surface rather than a single model's capability. Evasion also remains high across state-of-the-art monitor models and the gap between gradual and non-gradual evasion widens for more capable models. We introduce a stateful link-tracker monitor that tracks suspicious buildup across PRs. On both task families, it detects gradual attacks substantially better than diff monitors that merely see more accumulated history. Combining this stronger monitor with trajectory monitors in a four-monitor ensemble reduces gradual-attack evasion from 93% under the weakest standard diff monitor to 47%.

LACUNA: A Testbed for Evaluating Localization Precision for LLM Unlearning cs.CL

LLMs memorize sensitive training data, including personally identifiable information (PII), creating a pressing need for reliable post hoc removal methods. Unlearning has emerged as a promising solution, with state-of-the-art(SOTA) methods often following a localize-first, unlearn-second paradigm that targets specific model parameters. However, existing benchmarks evaluate unlearning solely at the output level, leaving open the question of whether unlearning truly erases knowledge from a model's parameters or merely obfuscates it, a concern reinforced by the success of resurfacing attacks. To bridge this gap, we introduce LACUNA: the first unlearning testbed with ground-truth parameter-level localization. LACUNA injects PII of synthetic individuals into predefined parameters of 1B and 7B OLMo-based models via masked continual pretraining, enabling direct evaluation of whether unlearning targets the weights responsible for knowledge storage. We use LACUNA to benchmark current SOTA unlearning methods and find that, despite strong output-level performance, existing methods are highly imprecise and susceptible to resurfacing attacks. We further show that when localization is successful, even a simple gradient-based unlearning method achieves strong erasure and robustness to resurfacing attacks, highlighting the importance of precise unlearning. We release LACUNA to complement behavioral evaluations and drive further advances in robust, localization-based unlearning.

Program-as-Weights: A Programming Paradigm for Fuzzy Functions cs.LG

Many everyday programming tasks resist clean rule-based implementation, such as alerting on important log lines, repairing malformed JSON, or ranking search results by intent, and are increasingly outsourced to large language model APIs at the cost of locality, reproducibility, and price. We propose fuzzy-function programming: compiling such a function from a natural-language specification into a compact, locally-executable neural artifact. We instantiate this paradigm with Program-as-Weights (PAW), in which a 4B compiler trained on FuzzyBench, a 10M-example dataset we release, emits parameter-efficient adapters for a frozen, lightweight interpreter. A 0.6B Qwen3 interpreter executing PAW programs matches the performance of direct prompting of Qwen3-32B, while using roughly one fiftieth of the inference memory and running at 30 tokens/s on a MacBook M3. PAW reframes the foundation model from a per-input problem solver into a tool builder: invoked once per function definition, it produces a small reusable artifact whose subsequent calls per function application are cheap and offline.

Online Safety Monitoring for LLMs cs.AI

Despite alignment training, LLMs remain prone to generating unsafe outputs at deployment time. Monitoring outputs online and raising an alarm when safety can no longer be assumed is therefore critical. We study a simple real-time monitor that turns a verifier signal from an external model into an alarm decision by thresholding, with the threshold calibrated via risk control. In experiments on mathematical reasoning and red teaming datasets, we show that this simple design is competitive with more advanced monitors based on sequential hypothesis testing.

ReContext: Recursive Evidence Replay as LLM Harness for Long-Context Reasoning cs.AI

Understanding and reasoning over long contexts has become a key requirement for deploying large language models (LLMs) in realistic applications. Although recent LLMs support increasingly long context windows, they often fail to use relevant evidence that is already present in the input, revealing a gap between context access and effective context utilization. In this work, we propose Recursive Evidence Replay as LLM Harness for Long-Context Reasoning (RECONTEXT), a training-free inference method for improving long-context reasoning. RECONTEXT uses model-internal relevance signals to construct a query-conditioned evidence pool and replays it before final generation while preserving the full original context. This recursive selection process separates evidence organization from answer generation without training, external memory, or context pruning. We also provide a theoretical analysis based on associative memory, which characterizes the context as a memory store, the question as a retrieval cue, attention as cue-trace association, and replay as trace reactivation. Experiments on eight long-context datasets with 128K context length show that RECONTEXT consistently improves evidence utilization across Qwen3-4B, Qwen3-8B, and Llama3-8B, achieving the best average rank on all three backbones. Code is available at https://github.com/Yanjun-Zhao/ReContext.

What LLM Agents Say When No One Is Watching: Social Structure and Latent Objective Emergence in Multi-Agent Debates cs.AI

LLM agents will increasingly act in socially structured settings where role, audience, and relational context can shape what is advantageous or costly to say. We study whether such social structure, without any explicit objective in the prompt, changes what an agent expresses publicly relative to an off-the-record (OTR) channel elicited under the same condition. We introduce a dual-channel debate framework in which agents produce public utterances that enter the shared history alongside OTR responses that are recorded but never shown to the other participant. Across 10 models, 3 scenarios, and 5 variations within each scenario, alignment-inducing settings produce systematic public-OTR divergence in the targeted agent, with its decision divergence rising from a $\sim$3% baseline to roughly 40%. The effect is consistent across four aggregate analyses: stance, semantic similarity, natural language inference, and survey responses. In some cases, the OTR response explicitly attributes public accommodation to relational pressures, such as career risk or sponsorship obligation. The findings suggest that agent evaluation should extend beyond explicit goals and detect emergent objectives. We present a dual-channel evaluation framework and complementary behavioral measures that operationalize this assessment.

BenchmarksFull tables

Intelligence Index

Composite score across coding, math, and reasoning

#	Model	Score	tok/s	$/1M
1	Claude Fable 5	59.9	64	$20.00
2	Claude Opus 4.8	55.7	65	$10.00
3	GPT-5.5	54.8	84	$11.25
4	Claude Opus 4.7	53.5	50	$10.00
5	Claude Sonnet 5	53.4	87	$6.00

SWE-rebench

Agentic coding on real-world software engineering tasks

#	Model	Score
1	OpenAIgpt-5.5-2026-04-23-xhighModel	62.7%± 0.91%
2	JunieJunieAgent	61.6%± 0.64%
3	OpenAICodexAgent	60.4%± 1.37%
4	AnthropicClaude CodeAgent	59.6%± 1.98%
5	OpenAIgpt-5.5-2026-04-23-mediumModel	58.9%± 0.78%

GitHub Repos All repos

Trending

usestrix/strix

33233 ★

Open-source AI hackers to find and fix your app’s vulnerabilities.

JuliusBrussee/caveman

82001 ★

🪨 why use many token when few token do trick — Claude Code skill that cuts 65% of tokens by talking like caveman

msitarzewski/agency-agents

126006 ★

A complete AI agency at your fingertips** - From frontend wizards to Reddit community ninjas, from whimsy injectors to reality checkers. Each agent is a specialized expert with personality, processes, and proven deliverables.

hasaneyldrm/exercises-dataset

9610 ★

A comprehensive dataset of 433 fitness exercises. Each entry includes name, category, target muscle group, equipment, instructions, thumbnail image, and animation video.

santifer/career-ops

58221 ★

AI-powered job search system built on Claude Code. 14 skill modes, Go dashboard, PDF generation, batch processing.

Daily discovery

Zipstack/unstractPrompt Engineering

6686 ★

LLM-Driven Extraction of Unstructured Data — Built for API Deployments & ETL Pipeline Workflows

ArduPilot/ardupilotRobotics

15408 ★

ArduPlane, ArduCopter, ArduRover, ArduSub source

deepset-ai/haystack-core-integrationsMLOps

199 ★

Additional packages (components, document stores and the likes) to extend the capabilities of Haystack

hiyouga/LlamaFactoryRLHF

72918 ★

Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)

xorbitsai/inferenceLLM

9408 ★

Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source, speech, and multimodal models on cloud, on-prem, or your laptop — all through one unified, production-ready inference API.