The industry is splintering into two incompatible futures while the public face remains unified. On one side, companies are hardening infrastructure around proprietary systems, local execution, and political control. OpenAI is reorganizing leadership and acquiring media properties while Anthropic buys biotech startups, launches political action committees, and quietly ships code with known vulnerabilities to npm before attempting to DMCA 8,100 repositories. On the other side, the security perimeter is collapsing faster than it can be rebuilt. Claude Code operates with 90 percent autonomy when weaponized by state actors. Meta's AI agents trigger severity-one incidents. The Europa.eu platform lost 350 gigabytes through a supply chain attack on an open-source vulnerability scanner. HackerOne paused Internet Bug Bounty payouts after acknowledging it cannot handle open-source security anymore. The more autonomous these systems become, the less the existing security model holds.
The capital intensity required to scale inference is meeting hard physical limits. Trump's AI data center buildout is delayed across nearly 50 percent of projects because China controls key power infrastructure. Meta, Microsoft, and Google are betting billions on natural gas plants while communities prefer Amazon warehouses in their backyards. Google just added Flex Inference and Priority Inference tiers to Gemini because inference costs, not training costs, are now the binding constraint. Anthropic's 400 million dollar acquisition of Coefficient Bio and its new PAC suggest preparation for a longer game than quarterly model releases. OpenAI's move to acquire TBPN and create a special projects role signals internal focus shifting away from product velocity toward structural positioning.
Measurement and enforcement are becoming table stakes for enterprise adoption. Google is publishing work on behavioral alignment measurement while AWS ships centralized governance tooling across customer accounts. Only one is currently collecting revenue. In research, two complementary trajectories are emerging: one treats LLMs as semantic reasoners augmented with domain-specific constraints for detecting vulnerabilities, the other interrogates whether LLM outputs remain robust under variation through rigorous benchmarking. Both converge on controlled experimental design, yet the gap between laboratory conditions and real-world deployment remains substantial. Claude Opus 4.6 holds 65.3 percent on SWE-rebench, up 12.3 points, while Artificial Analysis shows minimal top-tier movement at 57.2 percent. The divergence reflects a fundamental problem: different benchmarks measure different distributions, making cross-methodology comparison unreliable.
Developer tooling is consolidating around infrastructure rather than capability. Conversational interfaces like Onyx and Prompts.Chat treat model switching and prompt management as friction to eliminate. Deeper architectural work is happening elsewhere: Microsoft's Presidio for PII redaction, Google's TimesFM for time-series forecasting, Genkit for application runtimes that use LLMs as components. The pattern across trending and discovery repositories is clear. The next wave isn't better chat. It's systems that manage state, route data, keep private data private, and learn from interaction. Agents and reinforcement learning are merging in the developer discovery set. The infrastructure bet is real. The security model is not.
Grant Calloway
Smart contracts are self-executing programs that manage financial transactions on blockchain networks. Developers commonly rely on third-party code libraries to improve both efficiency and security. However, improper use of these libraries can introduce hidden vulnerabilities that are difficult to detect, leading to significant financial losses. Existing automated tools struggle to identify such misuse because it often requires understanding the developer's intent rather than simply scanning for known code patterns. This paper presents LibScan, an automated detection framework that combines large language model (LLM)-based semantic reasoning with rule-based code analysis, identifying eight distinct categories of library misuse in smart contracts. To improve detection reliability, the framework incorporates an iterative self-correction mechanism that refines its analysis across multiple rounds, alongside a structured knowledge base derived from large-scale empirical studies of real-world misuse cases. Experiments conducted on 662 real-world smart contracts demonstrate that LibScan achieves an overall detection accuracy of 85.15\%, outperforming existing tools by a margin of over 16 percentage points. Ablation experiments further confirm that combining both analysis approaches yields substantially better results than either method used independently.
Smart contract vulnerabilities can cause substantial financial losses due to the immutability of code after deployment. While existing tools detect vulnerabilities, they cannot effectively repair them. In this paper, we propose SCPatcher, a framework that combines retrieval-augmented generation with a knowledge graph for automated smart contract repair. We construct a knowledge graph from 5,000 verified Ethereum contracts, extracting function-level relationships to build a semantic network. This graph serves as an external knowledge base that enhances Large Language Model reasoning and enables precise vulnerability patching. We introduce a two-stage repair strategy, initial knowledge-guided repair followed by Chain-of-Thought reasoning for complex vulnerabilities. Evaluated on a diverse set of vulnerable contracts, SCPatcher achieves 81.5\% overall repair rate and 91.0\% compilation pass rate, substantially outperforming existing methods.
Due to their widespread use in industry, several techniques have been proposed in the literature to fuzz REST APIs. Existing fuzzers for REST APIs have been focusing on detecting crashes (e.g., 500 HTTP server error status code). However, security vulnerabilities can have major drastic consequences on existing cloud infrastructures. In this paper, we propose a series of novel automated oracles aimed at detecting violations of access policies in REST APIs, as well as executing traditional attacks such as SQL Injection and XSS. These novel automated oracles can be integrated into existing fuzzers, in which, once the fuzzing session is completed, a ``security testing'' phase is executed to verify these oracles. When a security fault is detected, as output our technique is able to general executable test cases in different formats, like Java, Kotlin, Python and JavaScript test suites. Our novel techniques are integrated as an extension of EvoMaster, a state-of-the-art open-source fuzzer for REST APIs. Experiments are carried out on 9 artificial examples, 8 vulnerable-by-design REST APIs with black-box testing, and 36 REST APIs from the WFD corpus with white-box testing, for a total of 52 distinct APIs. Results show that our novel oracles and their automated integration in a fuzzing process can lead to detect security issues in several of these APIs.
The Mining Software Repositories (MSR) field focuses on analysing the rich data contained in software repositories to derive actionable insights into software processes and products. Mining repositories at scale requires techniques capable of handling large volumes of heterogeneous data, a challenge for which language models (LMs) are increasingly well-suited. Since the advent of Transformer-based architectures, LMs have been rapidly adopted across a wide range of MSR tasks. This article presents a comprehensive survey of the use of LMs in MSR, based on an analysis of 85 papers. We examine how LMs are applied, the types of artefacts analysed, which models are used, how their adoption has evolved over time, and the extent to which studies support reproducibility and reuse. Building on this analysis, we propose a taxonomy of LM applications in MSR, identify key trends shaping the field, and highlight open challenges alongside actionable directions for future research.
Training effective software engineering agents requires large volumes of task-specific trajectories, incurring substantial data construction costs. Inspired by the "Less-Is-More" hypothesis in mathematical reasoning, we investigate its extension to agentic scenarios and propose an end-to-end training framework that achieves superior agentic capabilities with fewer but higher-quality training trajectories. This is achieved via STITCH (Sliding-memory Trajectory Inference and Task Chunking Heuristic), a coarse-to-fine mechanism that filters low-value noise and retains decision-critical tokens to maximize training signal quality. We conduct experiments across multiple agent frameworks (e.g., mini-SWE-agent, MSWE-agent), model scales (30B to 355B), and multilingual settings (Python, Java, and ArkTS). On SWE-bench Verified, models trained with STITCH achieve up to 63.16% relative improvement over base models. On Multi-SWE-bench (Java), MiniMax-M2.5-STITCH achieves 43.75% with our CodeArts Agent scaffold (+16.67%). On HarmonyOS (ArkTS), GLM-4.7-STITCH improves the compilation pass rate to 61.31% (+43.34%) with less than 1K training trajectories. Our results confirm that the "Less-Is-More" paradigm generalizes effectively to complex agentic tasks across diverse languages and model scales.
Large Language Models (LLMs) are increasingly applied to automate software engineering tasks, including the generation of UML class diagrams from natural language descriptions. While prior work demonstrates that LLMs can produce syntactically valid diagrams, syntactic correctness alone does not guarantee meaningful design. This study investigates whether LLMs can move beyond diagram translation to perform design synthesis, and how reliably they maintain design-oriented reasoning under variation. We introduce a preference-based few-shot prompting approach that biases LLM outputs toward designs satisfying object-oriented principles and pattern-consistent structures. Two design-intent benchmarks, each with three domain-only, paraphrased prompts and 10 repeated runs, are used to evaluate three LLMs (ChatGPT 4o-mini, Claude 3.5 Sonnet, Gemini 2.5 Flash) across three modeling strategies: standard prompting, rule-injection prompting, and preference-based prompting, totaling 540 experiments (i.e. 2x3x10x3x3). Results indicate that while preference-based alignment improves adherence to design intent it does not eliminate non-determinism, and model-level behavior strongly influences design reliability. These findings highlight that achieving dependable LLM-assisted software design requires not only effective prompting but also careful consideration of model behavior and robustness.
Composite score across coding, math, and reasoning
| # | Model | Score | tok/s | $/1M |
|---|---|---|---|---|
| 1 | GPT-5.4 | 57.2 | 76 | $5.63 |
| 2 | Gemini 3.1 Pro Preview | 57.2 | 118 | $4.50 |
| 3 | GPT-5.3 Codex | 54 | 72 | $4.81 |
| 4 | Claude Opus 4.6 | 53 | 46 | $10.00 |
| 5 | Claude Sonnet 4.6 | 51.7 | 52 | $6.00 |
Agentic coding on real-world software engineering tasks
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% |
| 3 | GLM-5 | 62.8% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% |
| 5 | Gemini 3.1 Pro Preview | 62.3% |
OmX - Oh My codeX: Your codex is not alone. Add hooks, agent teams, HUDs, and so much more.
Open Source AI Platform - AI Chat with advanced features that works with every LLM
TimesFM (Time Series Foundation Model) is a pretrained time-series foundation model developed by Google Research for time-series forecasting.
Create stunning demos for free. Open-source, no subscriptions, no watermarks, and free for commercial use. An alternative to Screen Studio.
The fastest and the most accurate file search toolkit for AI agents, Neovim, Rust, C, and NodeJS
An open-source framework for detecting, redacting, masking, and anonymizing sensitive data (PII) across text, images, and structured data. Supports NLP, pattern matching, and customizable pipelines.
Open Source Voice Agent Platform
Examples of models deployable with Truss
Awesome List for Agentic RL
Open-source framework for building AI-powered apps in JavaScript, Go, and Python, built and used in production by Google