The infrastructure war is hardening into a bifurcated market where capital and silicon matter more than software, and the software layer itself is fracturing under the weight of its own visibility. SoftBank's €75 billion commitment to French data centers reveals the actual hierarchy: Masayoshi Son is betting that whoever controls the substrate controls the market. Compute capacity, electricity, and silicon have become the strategic assets. Everything else is software, and software is becoming a commodified layer that users increasingly resent paying for. GitHub Copilot's shift to token-based billing sparked backlash precisely because developers suddenly saw the unit economics of what was once a loss leader, and the margin became visible and resented. Google's unbundling of Gemini into a separate product called Spark tests whether users will pay for AI assistants once they're separated from search. The transcription software market already shows price resistance; free services are adequate enough that paid alternatives struggle to justify their cost.
AWS is converting operational overhead into lock-in by embedding generative AI into the resilience layer itself through Resilience Hub. The move targets not AI builders but the people managing the systems those builders deploy on, recognizing that as generative AI workloads proliferate across customer infrastructure, the surface area for failure expands faster than traditional monitoring can track. By offering to own the question of what happens when these systems fail at scale, AWS deepens dependency on its ecosystem precisely when organizations transition from experimental deployments to production workloads. This is infrastructure defending itself by becoming indispensable at the operational level.
The code-solving frontier has stabilized at the top, with gpt-5.5-2026-04-23-xhigh holding first place at 62.7% on SWE-rebench, while mid-tier models churn actively between 45 and 55 percent. Gemini 3.1 Pro Preview dropped 6.1 points from 57.2% to 51.1%, marking the most substantial regression in visible rankings, while Kimi K2.6 fell 7.4 points from 53.9% to 46.5%. The divergence between SWE-rebench and Artificial Analysis scores for some models suggests these benchmarks may be testing different problem classes or that recent updates affected one more than the other, warranting scrutiny of whether reliable measurement has broken down in the middle tier.
GitHub's trending repos tell the story of agents moving from prototype to production by building the unglamorous layer where theory meets hardware constraints. Anthropic's claude-code and skills repos, alongside cursor/plugins and EveryInc/compound-engineering-plugin, show AI systems integrating into development workflows through standardized abstractions that third parties can extend. Beneath this sits the real work: ARahim3/mlx-tune brings fine-tuning to consumer hardware, vllm-project/vllm-ascend extends inference to new accelerators, and fluxions-ai/vui achieves 9x realtime performance on commodity GPUs. These aren't flashy, but they're the work that makes deployed agents economical. Developers have stopped waiting for perfect solutions and are building the infrastructure themselves, from document parsing through speech generation, revealing that the bottleneck is no longer capability but cost and efficiency in production.
Grant Calloway
Watermarking techniques for large language models (LLMs), which encode hidden information in the output so its source can be verified, have gained significant attention in recent days, thanks to their potential capability to detect accidental or deliberate misuse. Similar challenges involving model misuse also exist in the context of game-playing, such as when detecting the unauthorized use of AI tools in gaming platforms (e.g., cheating in online chess). In this paper, we initiate the study of how game-playing strategies can be watermarked. We show how the KGW watermark for LLMs can be adapted to watermark game-playing agents in perfect-information extensive-form games. The watermark can then be detected using a statistical test. We show that the degradation in the quality of the watermarked strategy profile, quantified by the expected utility, can be bounded, but there is a tradeoff between detectability and quality. In our experiments, we bootstrap the watermarking framework to various chess engines and demonstrate that a) the impact of the watermark on the quality of the strategy is negligible and b) the watermark can be detected with just a handful of games.
An index is a function that given an election outputs a value between 0 and 1, indicating the extent to which this election has a particular feature. We seek indices that capture agreement, diversity, and polarization among voters in approval elections, and that are normalized with respect to saturation. By the latter we mean that if two elections differ by the fraction of candidates approved by an average voter, but otherwise are of similar nature, then they should have similar index values. We propose several indices, analyze their properties, and use them to (a) derive a new map of approval elections, and (b) show similarities and differences between various real-life elections from Pabulib, Preflib and other sources.
Generative AI models differ from traditional machine learning tools in that they allow users to provide as much or as little information as they choose in their inputs. This flexibility often leads users to omit certain details, relying on the models to infer and fill in under-specified information based on distributional knowledge of user preferences. Such inferences may privilege majority viewpoints and disadvantage users with atypical preferences, raising concerns about fairness. Unlike more traditional recommender systems, LLMs can explicitly solicit more information from users through natural language. However, while directly eliciting user preferences could increase personalization and mitigate inequality, excessive querying places a burden on users who value efficiency. We develop a stylized model of user-LLM interaction and develop an objective that captures tradeoff between user burden and preference representation. Building on the observation that individual preferences are often correlated, we analyze how AI systems should balance inference and elicitation, characterizing the optimal amount of information to solicit before content generation. Ultimately, we show that information elicitation can mitigate the systematic biases of preference inference, enabling the design of generative tools that better incorporate diverse user perspectives while maintaining efficiency. We complement this theoretical analysis with an empirical evaluation illustrating the model's predictions and exploring their practical implications.
Generative Artificial Intelligence (AI) tools are rapidly adopted in the workplace and in education, yet the empirical evidence on AI's impact remains mixed. We propose a model of human-AI interaction to better understand and analyze several mechanisms by which AI affects productivity. In our setup, human agents with varying skill levels exert utility-maximizing effort to produce certain task outcomes with AI assistance. We find that incorporating either endogeneity in skill development or in AI unreliability can induce a productivity paradox: increased levels of AI assistance may degrade productivity, leading to potentially significant shortfalls. Moreover, we examine the long-term distributional effect of AI on skill, and demonstrate that skill polarization can emerge in steady state when accounting for heterogeneity in AI literacy -- the agent's capability to identify and adapt to inaccurate AI outputs. Our results elucidate several mechanisms that may explain the emergence of human-AI productivity paradoxes and skill polarization, and identify simple measures that characterize when they arise.
Data valuation methods allocate payments and audit training data's contribution to machine-learning pipelines; however, they often assume passive contributors. In reality, contributors can split datasets across pseudonymous identities, duplicate high-value examples, create near-duplicates, or launder synthetic variants to inflate their share. We formalize this as false-name manipulation in ML data attribution. Our main construction is the quotient semivalue mechanism: compute Shapley-, Banzhaf-, or Beta-style values over evidence-backed attribution clusters instead of raw identities, using a canonical-representative operator to absorb within-cluster duplication. We prove an impossibility: on a fixed monotone data-value game, exact Shapley-fair attribution over reported identities is incompatible with unrestricted false-name-proofness, even on binary-valued instances, and characterize the split-gain of a general semivalue on a unanimity counter-example. The mechanism is exactly false-name-proof under two structural conditions: false-name-neutral within-cluster allocation and quotient-stable manipulations. Under imperfect provenance, when these conditions hold approximately, manipulation gain and fairness loss are bounded by three measurable quantities: escaped-cluster mass, value-estimation error, and clustering distance. We instantiate the mechanisms in DataMarket-Gym, a benchmark for attribution under strategic provider attacks. On synthetic classification tasks, quotient semivalues with example-level evidence reduce manipulation gain on duplicate and near-duplicate Sybil attacks from $1.74$ under baseline Shapley to $0.96$, near the honest level. The cosine-threshold and (false-merge, false-split) rate sweeps trace the corresponding fairness--Sybil frontier.
Eliciting truthful reports from autonomous agents is a core problem in scalable AI oversight: a principal scores the agent's report using a strictly proper scoring rule, but the agent also benefits from the report through a non-accuracy channel (approval for autonomous action, allocation share, downstream control). The same structure appears in classical mechanism-design settings such as marketplace operation. Our main result is an endogeneity: the principal's optimal oversight necessarily uses a non-affine approval function to screen types, yet any non-affine approval makes truthful reporting suboptimal under the combined objective whenever deviation is undetectable. The principal cannot avoid the perturbation that undermines calibration. This impossibility holds for all strictly proper scoring rules, with a closed-form perturbation formula. A constructive escape exists: a step-function approval threshold achieves first-best screening for every strictly proper scoring rule, because the agent's binary inflate-or-not choice creates a type-space threshold regardless of the generator's curvature. Under the Brier score specifically, the type-independent inflation cost yields a welfare equivalence between second-best and first-best; we prove this equivalence is unique to Brier (the welfare gap under smooth $C^1$ oversight is bounded below by $Ω(\text{Var}(1/G'') (γ/β)^2)$ for every non-Brier rule). Two instances develop the framework: AI agent oversight (the lead motivating setting) and marketplace operation (a parallel mechanism-design domain). The message for AI alignment is direct: smooth scoring-based oversight cannot elicit truthful reports from a strategic agent; sharp thresholds are the calibration-preserving design.
Composite score across coding, math, and reasoning
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | Claude Opus 4.8 | 61.4 | 65 | $10.94 |
| 2 | GPT-5.5 | 60.2 | 59 | $11.25 |
| 3 | Claude Opus 4.7 | 57.3 | 60 | $10.94 |
| 4 | Gemini 3.1 Pro Preview | 57.2 | 137 | $4.50 |
| 5 | GPT-5.4 | 56.8 | 90 | $5.63 |
Agentic coding on real-world software engineering tasks
| # | Model | Score |
|---|---|---|
| 1 | gpt-5.5-2026-04-23-xhigh | 62.7% |
| 2 | Codex | 60.4% |
| 3 | Claude Code | 59.6% |
| 4 | gpt-5.5-2026-04-23-medium | 58.9% |
| 5 | Claude Opus 4.8-xhigh | 56.4% |
Python tool for converting files and office documents to Markdown.
Claude Code is an agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster by executing routine tasks, explaining complex code, and handling git workflows - all through natural language commands.
Cursor plugin specification and official plugins
A meta-skill that designs domain-specific agent teams, defines specialized agents, and generates the skills they use.
Official Compound Engineering plugin for Claude Code, Codex, Cursor, and more
Fine-tune LLMs on your Mac with Apple Silicon. SFT, DPO, GRPO, and Vision fine-tuning — natively on MLX. Unsloth-compatible API.
Reinforcement Learning environments based on the 1993 game Doom :godmode:
High-Performance Symbolic Regression in Python and Julia
Real-time voice assistant — WebRTC streaming, faster-whisper ASR, local LLM, Vui Nano (300M) TTS. OpenAI Realtime API compatible. Voice cloning, barge-in, ~9× realtime on a 4090. Apache 2.0.
potato: the portable annotation tool