Meanwhile, as federal courts weigh whether Anthropic poses a national security threat and the Trump administration releases a framework designed to preempt state AI regulations, the real battle is shifting beneath the political theater toward control of infrastructure and the platforms through which work flows. The government's case against Anthropic rests on technical misunderstandings, according to sworn declarations filed Friday, yet the administration's National Policy Framework emphasizes lighter-touch rules for companies while attempting to centralize authority over AI policy at the federal level. Big Tech is fracturing over these attacks, with former Trump allies offering unprecedented criticism even as the administration tries to block state-level laws. This is not philosophical disagreement about safety. It is a power struggle over who writes the rules: the executive branch, states, or the companies themselves.
Beneath the regulatory rupture, power has become the bottleneck constraining the entire enterprise. Nvidia's CEO Jensen Huang projects one trillion dollars in AI chip sales through 2027, yet energy consumption is now the north star metric alongside accuracy and engagement as engineers discover that rolling out new data centers depends on power availability, not model capability. Microsoft rolled back Copilot bloat on Windows after user and developer resistance to forced integration. OpenAI is folding ChatGPT, Codex, and its browser Atlas into a single desktop superapp, signaling a shift toward enterprise infrastructure and developer tools away from the consumer market that made it a household name. These are admissions that the consumer AI wave has peaked and the real money is in developer platforms and enterprise lock-in.
Distribution control is now the prize. WordPress.com lets AI agents write and publish posts directly. Google embedded AI into Stitch, enabling developers to describe interfaces in natural language. Amazon is building a smartphone called Transformer to integrate shopping, streaming, and voice services through Alexa. LinkedIn banned an AI agent that had conquered the platform. Each move lowers friction for adoption while raising switching costs and centralizing control through the platform. PwC told staff they must embrace AI or face replacement. Google told researchers to stop submitting AI-generated bug reports to its open-source program due to hallucinations and low quality. AI adoption is no longer optional, quality control is breaking down at scale, and the winners will be whoever owns the platform through which work flows.
The technical evidence confirms this shift. Benchmark performance has plateaued at the top tier, with Claude Code holding 52.9% on SWE-rebench and the next three positions separated by less than 1.2 percentage points, signaling that incremental gains in raw capability now demand substantial effort. On GitHub, the dominant pattern is developers moving past building individual models toward building systems that orchestrate them: Claude HUD, Open-SWE, and Superpowers all solve the same problem of making autonomous agents predictable enough to trust in production. The repos gaining traction are those that make the infrastructure layers reliable: specialized data handling for AI pipelines, vector storage, dataflow definition, and domain-specific scaffolding. The boring parts of AI systems, not raw capability, are what developers are actively building on.
Grant Calloway
Large language models (LLMs) based AI systems increasingly mediate what billions of people see, choose and buy. This creates an urgent need to quantify the systemic risks of LLM-driven market intermediation, including its implications for market fairness, competition, and the diversity of information exposure. This paper introduces ChoiceEval, a reproducible framework for auditing preferences for brands and cultures in large language models (LLMs) under realistic usage conditions. ChoiceEval addresses two core technical challenges: (i) generating realistic, persona-diverse evaluation queries and (ii) converting free-form outputs into comparable choice sets and quantitative preference metrics. For a given topic (e.g. running shoes, hotel chains, travel destinations), the framework segments users into psychographic profiles (e.g., budget-conscious, wellness-focused, convenience), and then derives diverse prompts that reflect real-world advice-seeking and decision-making behaviour. LLM responses are converted into normalised top-k choice sets. Preference and geographic bias are then quantified using comparable metrics across topics and personas. Thus, ChoiceEval provides a scalable audit pipeline for researchers, platforms, and regulators, linking model behaviour to real-world economic outcomes. Applied to Gemini, GPT, and DeepSeek across 10 topics spanning commerce and culture and more than 2,000 questions, ChoiceEval reveals consistent preferences: U.S.-developed models Gemini and GPT show marked favouritism toward American entities, while China-developed DeepSeek exhibits more balanced yet still detectable geographic preferences. These patterns persist across user personas, suggesting systematic rather than incidental effects.
Artificial intelligence is increasingly embedded in human decision-making, where it can either enhance human reasoning or induce excessive cognitive dependence. This paper introduces a conceptual and mathematical framework for distinguishing cognitive amplification, in which AI improves hybrid human-AI performance while preserving human expertise, from cognitive delegation, in which reasoning is progressively outsourced to AI systems. To characterize these regimes, we define a set of operational metrics: the Cognitive Amplification Index (CAI*), the Dependency Ratio (D), the Human Reliance Index (HRI), and the Human Cognitive Drift Rate (HCDR). Together, these quantities provide a low-dimensional metric space for evaluating not only whether human-AI systems achieve genuine synergistic performance, but also whether such performance is cognitively sustainable for the human component over time. The framework highlights a central design tension in human-AI systems: maximizing short-term hybrid capability does not necessarily preserve long-term human cognitive competence. We therefore argue that human-AI systems should be designed under a cognitive sustainability constraint, such that gains in hybrid performance do not come at the cost of degradation in human expertise.
AI-based tools that mediate, enhance or generate parts of video communication may interfere with how people evaluate trustworthiness and credibility. In two preregistered online experiments (N = 2,000), we examined whether AI-mediated video retouching, background replacement and avatars affect interpersonal trust, people's ability to detect lies and confidence in their judgments. Participants watched short videos of speakers making truthful or deceptive statements across three conditions with varying levels of AI mediation. We observed that perceived trust and confidence in judgments declined in AI-mediated videos, particularly in settings in which some participants used avatars while others did not. However, participants' actual judgment accuracy remained unchanged, and they were no more inclined to suspect those using AI tools of lying. Our findings provide evidence against concerns that AI mediation undermines people's ability to distinguish truth from lies, and against cue-based accounts of lie detection more generally. They highlight the importance of trustworthy AI mediation tools in contexts where not only truth, but also trust and confidence matter.
Artificial intelligence (AI) systems are deployed as collaborators in human decision-making. Yet, evaluation practices focus primarily on model accuracy rather than whether human-AI teams are prepared to collaborate safely and effectively. Empirical evidence shows that many failures arise from miscalibrated reliance, including overuse when AI is wrong and underuse when it is helpful. This paper proposes a measurement framework for evaluating human-AI decision-making centered on team readiness. We introduce a four part taxonomy of evaluation metrics spanning outcomes, reliance behavior, safety signals, and learning over time, and connect these metrics to the Understand-Control-Improve (U-C-I) lifecycle of human-AI onboarding and collaboration. By operationalizing evaluation through interaction traces rather than model properties or self-reported trust, our framework enables deployment-relevant assessment of calibration, error recovery, and governance. We aim to support more comparable benchmarks and cumulative research on human-AI readiness, advancing safer and more accountable human-AI collaboration.
Helping people identify and pursue personally meaningful career goals at scale remains a key challenge in applied psychology. Career coaching can improve goal quality and attainment, but its cost and limited availability restrict access. Large language model (LLM)-based chatbots offer a scalable alternative, yet the psychological mechanisms by which they might support goal pursuit remain untested. Here we report a preregistered three-arm randomised controlled trial (N = 517) comparing an AI career coach ("Leon," powered by Claude Sonnet), a matched structured written questionnaire covering closely matched reflective topics, and a no-support control on goal progress at a two-week follow-up. The AI chatbot produced significantly higher goal progress than the control (d = 0.33, p = .016). Compared with the written-reflection condition, the AI did not significantly improve overall goal progress, but it increased perceived social accountability. In the preregistered mediation model, perceived accountability mediated the AI-over-questionnaire effect on goal progress (indirect effect = 0.15, 95% CI [0.04, 0.31]), whereas self-concordance did not. These findings suggest that AI-assisted goal setting can improve short-term goal progress, and that its clearest added value over structured self-reflection lies in increasing felt accountability.
Explainable AI (XAI) interfaces seek to make large language models more transparent, yet explanation alone does not produce understanding. Explaining a system's behavior is not the same as being able to engage with it, to probe and interpret its operations through direct manipulation. This distinction matters for scientific disciplines in particular: scientists who increasingly rely on LLMs for reading, citing, and producing literature reviews have little means of directly engaging with how these models process and transform the texts they generate. In this ongoing design research project, I argue for a shift from explainability to interpretative engagement. This shift moves away from accounts of system behavior to instead enable users to manipulate a model's intermediate representations. Drawing on textual scholarship, computational poetics, and the history of reading and writing technologies, including practices such as marginalia, glosses, indices, and annotation systems, I propose interpretative interfaces as interactive environments in which non-expert users can intervene in the representational space of a language model. More specifically, such interfaces will allow users to select a token and follow its trajectory through the model's intermediate layers. This way, they can observe how its semantic position shifts as context is processed, and possibly annotate the transformations they find useful or meaningful. The same way readers can create their own maps within a book through annotations and bookmarks, interpretative interfaces will allow users to inscribe their reading of a model's internal representations. The goal of this project is to reframe AI interpretability as an interaction design project rather than a purely technical one, and to open a path toward AI-mediated reading that supports interpretative engagement and critical stewardship of scientific knowledge.
Composite score across coding, math, and reasoning
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.4 | 57.2 | 86 | $5.63 |
| 2 | Gemini 3.1 Pro Preview | 57.2 | 117 | $4.50 |
| 3 | GPT-5.3 Codex | 54 | 74 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 54 | $10.00 |
| 5 | Claude Sonnet 4.6 | 51.7 | 70 | $6.00 |
Agentic coding on real-world software engineering tasks
| # | Model | Score |
|---|---|---|
| 1 | Claude Code | 52.9% |
| 2 | Junie | 52.1% |
| 3 | Claude Opus 4.6 | 51.7% |
| 4 | gpt-5.2-2025-12-11-xhigh | 51.7% |
| 5 | gpt-5.2-2025-12-11-medium | 51.0% |
A Claude Code plugin that shows what's happening - context usage, active tools, running agents, and todo progress
An Open-Source Asynchronous Coding Agent
An agentic skills framework & software development methodology that works.
PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.
Generate any location from the real world in Minecraft with a high level of detail.
Full-Stack Development Platform for Building Reliable Agents
A zenoh plug-in that allows to transparently route DDS data. This plugin can be used by DDS applications to leverage zenoh for geographical routing or for better scaling discovery. For ROS2 robotic applications, use https://github.com/eclipse-zenoh/zenoh-plugin-ros2dds
Run Cursor, Claude Code, OpenCode, or Codex with any LLM provider — deploy to IM, HTTP, or your own product.
Build, Manage and Deploy AI/ML Systems
A fast in-memory rule engine