Like the shift from mainframes to personal computers, which didn't happen because computers got smaller but because value migrated to whoever controlled the interface layer, this week's AI developments reveal a fundamental restructuring around control of the value chain rather than capability breakthroughs. OpenAI's termination of its exclusive Microsoft partnership and revised $135 billion deal signals the company has accumulated enough leverage to demand independence at the precise moment it needs to prove it can operate as a standalone for-profit entity ahead of an IPO. The company won the right to sell models on Amazon Bedrock while Microsoft receives expanded revenue sharing instead of exclusivity, a renegotiation that matters less for what it says about their current relationship and more for what it says about OpenAI's confidence in its future bargaining position. Simultaneously, David Silver's $1.1 billion raise for Ineffable Intelligence at a $5.1 billion valuation to build AI systems that learn without human data signals investors believe the moat lies in data efficiency rather than model scale, while regulatory actions from the EU forcing Google to stop giving Gemini preferential treatment on Android and China blocking Meta's $2 billion acquisition of Manus reveal governments using acquisition review as a tool to prevent any single entity from consolidating the AI supply chain within their borders.
The hardware layer represents where this restructuring becomes concrete. OpenAI is reportedly developing a phone with Qualcomm and MediaTek targeting 2028 production while Skye attracted investors for an AI-native iPhone home screen, and the competition is not about processing speed but about whether the next interface layer will be apps or agents and who controls it. The Musk v. Altman trial, meanwhile, unfolds as a courtroom test of whether a company founded as a nonprofit can pivot to for-profit without founders having legal grounds to block it. Jury selection revealed negative views of Musk himself, suggesting the case may hinge less on contractual language than on whether the court accepts Altman's version of the company's mission. That verdict determines whether OpenAI can proceed to IPO as a for-profit entity, which determines whether the company can raise capital independently of Microsoft and compete on hardware.
Infrastructure consolidation is accelerating across the stack. OpenAI secured FedRAMP Moderate authorization for government deployment while positioning Symphony as an open-source orchestration spec that locks developers into its API ecosystem, a familiar playbook of releasing developer tooling to capture downstream demand. IBM's Bob agent claiming 45% productivity gains suggests enterprise coding assistance is becoming table stakes rather than differentiation, and the fact that IBM is building its own agent rather than embedding a third-party model indicates margin pressure is landing on the labs themselves. AWS, meanwhile, is racing to commoditize agent infrastructure through Bedrock AgentCore CLI and Lambda S3 Files, reducing friction between model consumption and application deployment. On GitHub, the dominant pattern mirrors this shift: developers have moved past debating which LLM to use and toward building reliable systems around them, with repositories focused on agent frameworks, context management across sessions, and observability for multi-step workflows occupying the top tier. The benchmarks reveal fragmentation rather than consensus, with Claude Opus 4.6 leading SWE-rebench at 65.3% while GPT-5.5 leads Artificial Analysis at 60.2, suggesting different evaluation methodologies reward different problem-solving approaches and no single model has established dominance across all dimensions.
Grant Calloway
Adaptive programming practice often relies on fixed libraries of worked examples and practice problems, which require substantial authoring effort and may not correspond well to the logical errors and partial solutions students produce while writing code. As a result, students may receive learning content that does not directly address the concepts they are working to understand, while instructors must either invest additional effort in expanding content libraries or accept a coarse level of personalization. We present an approach for knowledge-component (KC) guided educational content generation using pattern-based KCs extracted from student code. Given a problem statement and student submissions, our pipeline extracts recurring structural KC patterns from students' code through AST-based analysis and uses them to condition a generative model. In this study, we apply this approach to worked example generation, and compare baseline and KC-conditioned outputs through expert evaluation. Results suggest that KC-conditioned generation improves topical focus and relevance to learners' underlying logical errors, providing evidence that KC-based steering of generative models can support personalized learning at scale.
While the optimal sample complexity of binary classification in terms of the VC dimension is well-established, determining the optimal sample complexity of multiclass classification has remained open. The appropriate complexity parameter for multiclass classification is the DS dimension, and despite significant efforts, a gap of $\sqrt{\text{DS}}$ has persisted between the upper and lower bounds on sample complexity. Recent work by Hanneke et al. (2026) shows a novel algebraic characterization of multiclass hypothesis classes in terms of their DS dimension. Building up on this, we show that the maximum hypergraph density of any multiclass hypothesis class is upper-bounded by its DS dimension. This proves a longstanding conjecture of Daniely and Shalev-Shwartz (2014). As a consequence, we determine the optimal dependence of the sample complexity on the DS dimension for multiclass as well as list learning.
In this paper, we propose a harmonized rotational gradient method, termed HRGrad, for simultaneously tackling multiscale time-dependent kinetic problems with varying small parameters. These parameters exhibit asymptotic transitions from microscopic to macroscopic physics, making it a challenging multi-task problem to solve over all ranges simultaneously. Solving tasks in different asymptotic regions often encounter gradient conflicts, which can lead to the failure of multi-task learning. To address this challenge, we explicitly encode a hidden representation of these parameters, ensuring that the corresponding solving tasks are serialized for simultaneous training. Furthermore, to mitigate gradient conflicts, we segment the prediction results to construct task losses and introduce a novel gradient alignment metric to ensure a positive dot product between the final update and each loss-specific gradient. This metric maintains consistent optimization rates for all task losses and dynamically adjusts gradient magnitudes based on conflict levels. Moreover, we provide a mathematical proof demonstrating the convergence of the HRGrad method, which is evaluated across a range of challenging asymptotic-preserving neural networks (APNNs) scenarios. We conduct an extensive set of experiments encompassing the Bhatnagar-Gross-Krook (BGK) equation and the linear transport equation in all ranges of Knudsen number. Our results indicate that HRGrad effectively overcomes the `failure modes' of APNNs in these problems.
We study learning with Chain-of-Thought (CoT) supervision from multiple thinkers, all of whom provide correct but possibly systematically different solutions, e.g., step-by-step solutions to math problems written by different thinkers, or step-by-step execution traces of different programs solving the same problem. We consider classes that are computationally easy to learn using CoT supervision from a single thinker, but hard to learn with only end-result supervision, i.e., without CoT (Joshi et al. 2025). We establish that, under cryptographic assumptions, learning can be hard from CoT supervision provided by two or a few different thinkers, in passive data-collection settings. On the other hand, we provide a generic computationally efficient active learning algorithm that learns with a small amount of CoT data per thinker that is completely independent of the target accuracy $\varepsilon$, a moderate number of thinkers that scales as $\log \frac{1}{\varepsilon}\log \log \frac{1}{\varepsilon}$, and sufficient passive end-result data that scales as $\frac{1}{\varepsilon}\cdot poly\log\frac{1}{\varepsilon}$.
Specification-guided reinforcement learning (RL) provides a principled framework for encoding complex, temporally extended tasks using formal specifications such as linear temporal logic (LTL). While recent methods have shown promising results, their ability to generalize across unseen specifications and diverse environments remains insufficiently understood. In this work, we introduce SpecRLBench, a benchmark designed to evaluate the generalization capabilities of LTL-based specification-guided RL methods. The benchmark spans multiple difficulty levels across navigation and manipulation domains, incorporating both static and dynamic environments, diverse robot dynamics, and varied observation modalities. Through extensive empirical evaluation, we characterize the strengths and limitations of existing approaches and reveal the challenges that emerge as specification and environment complexity increase. SpecRLBench provides a structured platform for systematic comparison and supports the development of more generalizable specification-guided RL methods. Code is available at https://github.com/BU-DEPEND-Lab/SpecRLBench.
Indonesian marketplace reviews mix standard vocabulary with slang, regional loanwords, numeric shorthands, and emoji, making lexicon-based sentiment tools unreliable in practice. This paper describes a two-track classification pipeline applied to the PRDECT-ID dataset, which contains 5,400 product reviews from 29 Indonesian e-commerce categories, each labeled for binary sentiment (Positive/Negative) and five-class emotion (Happy, Sad, Fear, Love, Anger). The first track applies TF-IDF vectorization with a PyCaret AutoML sweep across standard classifiers. The second track is a PyTorch Bidirectional Long Short-Term Memory (BiLSTM) network with a shared encoder and two task-specific output heads. A preprocessing module applies 14 sequential cleaning steps, including a 140-entry slang dictionary assembled from marketplace corpora. Four configurations are benchmarked: BiLSTM Baseline, BiLSTM Improved, BiLSTM Large, and TextCNN. Training uses class-weighted cross-entropy loss, ReduceLROnPlateau scheduling, and early stopping. Both tracks are deployed as Gradio applications on Hugging Face Spaces. Source code is publicly available at https://github.com/ikii-sd/pba2026-crazyrichteam.
Composite score across coding, math, and reasoning
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.5 | 60.2 | 78 | $11.25 |
| 2 | Claude Opus 4.7 | 57.3 | 56 | $10.00 |
| 3 | Gemini 3.1 Pro Preview | 57.2 | 135 | $4.50 |
| 4 | GPT-5.4 | 56.8 | 86 | $5.63 |
| 5 | Kimi K2.6 | 53.9 | 0 | $1.71 |
Agentic coding on real-world software engineering tasks
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | GLM-5.1 | 62.7% |
My personal directory of skills, straight from my .claude directory.
GitNexus: The Zero-Server Code Intelligence Engine - GitNexus is a client-side knowledge graph creator that runs entirely in your browser. Drop in a GitHub repo or ZIP file, and get an interactive knowledge graph wit a built in Graph RAG Agent. Perfect for code exploration
A curated list of practical Codex skills for automating workflows across the Codex CLI and API.
Use claude-code for free in the terminal, VSCode extension or via discord like openclaw
Beads - A memory upgrade for your coding agent
The AI agent that lives in your framework/browser
Add object detection, tracking, mobile notifications, and search to any security camera.
Terminal session manager for AI coding agents. One TUI for Claude, Gemini, OpenCode, Codex, and more.
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
AI Observability & Evaluation