Research in software engineering has bifurcated into two distinct but complementary trajectories: one centered on detecting and repairing vulnerabilities in specialized domains (smart contracts, REST APIs, mobile apps) through hybrid reasoning systems combining LLMs with rule-based or knowledge-graph analysis, and another focused on evaluating and improving the reliability of LLM-based agents and code generation systems through rigorous benchmarking and decomposition analysis. The first trajectory, exemplified by LibScan, SCPatcher, and Enhancing REST API Fuzzing, treats LLMs as semantic reasoners augmented with domain-specific constraints, knowledge bases, or structured feedback loops to achieve precision in detecting misuse patterns or generating repairs. The second trajectory, spanning papers from Reliability of Large Language Models for Design Synthesis to ToolMisuseBench and Are Benchmark Tests Strong Enough, interrogates whether LLM outputs remain robust under variation, whether benchmarks admit false positives through weak test suites, and how agent behavior decomposes into re-solving, scaffolding, and content effects, revealing that task structure and draft quality dynamically determine the utility of multi-stage pipelines. Across both trajectories, a methodological consensus emerges around controlled experimental design: ablation studies isolate component contributions, decomposition experiments separate orthogonal effects, and large-scale empirical studies (Java static analysis frameworks, production-derived benchmarks, autonomous agent contributions in open source) expose the gap between controlled laboratory conditions and real-world deployment, where configuration interactions, framework semantics, code churn, and reproducibility drift undermine idealized assumptions about monotonic improvement and semantic comparability.
Cole Brennan
Showing of papers
Smart contracts are self-executing programs that manage financial transactions on blockchain networks. Developers commonly rely on third-party code libraries to improve both efficiency and security. However, improper use of these libraries can introduce hidden vulnerabilities that are difficult to detect, leading to significant financial losses. Existing automated tools struggle to identify such misuse because it often requires understanding the developer's intent rather than simply scanning for known code patterns. This paper presents LibScan, an automated detection framework that combines large language model (LLM)-based semantic reasoning with rule-based code analysis, identifying eight distinct categories of library misuse in smart contracts. To improve detection reliability, the framework incorporates an iterative self-correction mechanism that refines its analysis across multiple rounds, alongside a structured knowledge base derived from large-scale empirical studies of real-world misuse cases. Experiments conducted on 662 real-world smart contracts demonstrate that LibScan achieves an overall detection accuracy of 85.15\%, outperforming existing tools by a margin of over 16 percentage points. Ablation experiments further confirm that combining both analysis approaches yields substantially better results than either method used independently.
Smart contract vulnerabilities can cause substantial financial losses due to the immutability of code after deployment. While existing tools detect vulnerabilities, they cannot effectively repair them. In this paper, we propose SCPatcher, a framework that combines retrieval-augmented generation with a knowledge graph for automated smart contract repair. We construct a knowledge graph from 5,000 verified Ethereum contracts, extracting function-level relationships to build a semantic network. This graph serves as an external knowledge base that enhances Large Language Model reasoning and enables precise vulnerability patching. We introduce a two-stage repair strategy, initial knowledge-guided repair followed by Chain-of-Thought reasoning for complex vulnerabilities. Evaluated on a diverse set of vulnerable contracts, SCPatcher achieves 81.5\% overall repair rate and 91.0\% compilation pass rate, substantially outperforming existing methods.
Due to their widespread use in industry, several techniques have been proposed in the literature to fuzz REST APIs. Existing fuzzers for REST APIs have been focusing on detecting crashes (e.g., 500 HTTP server error status code). However, security vulnerabilities can have major drastic consequences on existing cloud infrastructures. In this paper, we propose a series of novel automated oracles aimed at detecting violations of access policies in REST APIs, as well as executing traditional attacks such as SQL Injection and XSS. These novel automated oracles can be integrated into existing fuzzers, in which, once the fuzzing session is completed, a ``security testing'' phase is executed to verify these oracles. When a security fault is detected, as output our technique is able to general executable test cases in different formats, like Java, Kotlin, Python and JavaScript test suites. Our novel techniques are integrated as an extension of EvoMaster, a state-of-the-art open-source fuzzer for REST APIs. Experiments are carried out on 9 artificial examples, 8 vulnerable-by-design REST APIs with black-box testing, and 36 REST APIs from the WFD corpus with white-box testing, for a total of 52 distinct APIs. Results show that our novel oracles and their automated integration in a fuzzing process can lead to detect security issues in several of these APIs.
The Mining Software Repositories (MSR) field focuses on analysing the rich data contained in software repositories to derive actionable insights into software processes and products. Mining repositories at scale requires techniques capable of handling large volumes of heterogeneous data, a challenge for which language models (LMs) are increasingly well-suited. Since the advent of Transformer-based architectures, LMs have been rapidly adopted across a wide range of MSR tasks. This article presents a comprehensive survey of the use of LMs in MSR, based on an analysis of 85 papers. We examine how LMs are applied, the types of artefacts analysed, which models are used, how their adoption has evolved over time, and the extent to which studies support reproducibility and reuse. Building on this analysis, we propose a taxonomy of LM applications in MSR, identify key trends shaping the field, and highlight open challenges alongside actionable directions for future research.
Training effective software engineering agents requires large volumes of task-specific trajectories, incurring substantial data construction costs. Inspired by the "Less-Is-More" hypothesis in mathematical reasoning, we investigate its extension to agentic scenarios and propose an end-to-end training framework that achieves superior agentic capabilities with fewer but higher-quality training trajectories. This is achieved via STITCH (Sliding-memory Trajectory Inference and Task Chunking Heuristic), a coarse-to-fine mechanism that filters low-value noise and retains decision-critical tokens to maximize training signal quality. We conduct experiments across multiple agent frameworks (e.g., mini-SWE-agent, MSWE-agent), model scales (30B to 355B), and multilingual settings (Python, Java, and ArkTS). On SWE-bench Verified, models trained with STITCH achieve up to 63.16% relative improvement over base models. On Multi-SWE-bench (Java), MiniMax-M2.5-STITCH achieves 43.75% with our CodeArts Agent scaffold (+16.67%). On HarmonyOS (ArkTS), GLM-4.7-STITCH improves the compilation pass rate to 61.31% (+43.34%) with less than 1K training trajectories. Our results confirm that the "Less-Is-More" paradigm generalizes effectively to complex agentic tasks across diverse languages and model scales.
Large Language Models (LLMs) are increasingly applied to automate software engineering tasks, including the generation of UML class diagrams from natural language descriptions. While prior work demonstrates that LLMs can produce syntactically valid diagrams, syntactic correctness alone does not guarantee meaningful design. This study investigates whether LLMs can move beyond diagram translation to perform design synthesis, and how reliably they maintain design-oriented reasoning under variation. We introduce a preference-based few-shot prompting approach that biases LLM outputs toward designs satisfying object-oriented principles and pattern-consistent structures. Two design-intent benchmarks, each with three domain-only, paraphrased prompts and 10 repeated runs, are used to evaluate three LLMs (ChatGPT 4o-mini, Claude 3.5 Sonnet, Gemini 2.5 Flash) across three modeling strategies: standard prompting, rule-injection prompting, and preference-based prompting, totaling 540 experiments (i.e. 2x3x10x3x3). Results indicate that while preference-based alignment improves adherence to design intent it does not eliminate non-determinism, and model-level behavior strongly influences design reliability. These findings highlight that achieving dependable LLM-assisted software design requires not only effective prompting but also careful consideration of model behavior and robustness.
Java static analysis frameworks are commonly compared under the assumption that analysis algorithms and configurations compose monotonically and yield semantically comparable results across tools. In this work, we show that this assumption is fundamentally flawed. We present a large-scale empirical study of semantic consistency within and across four widely used Java static analysis frameworks: Soot, SootUp, WALA, and Doop. Using precision partial orders over analysis algorithms and configurations, we systematically identify violations where increased precision introduces new call-graph edges or amplifies inconsistencies. Our results reveal three key findings. First, algorithmic precision orders frequently break within frameworks due to modern language features such as lambdas, reflection, and native modeling. Second, configuration choices strongly interact with analysis algorithms, producing synergistic failures that exceed the effects of algorithm or configuration changes alone. Third, cross-framework comparisons expose irreconcilable semantic gaps, demonstrating that different frameworks operate over incompatible notions of call-graph ground truth. These findings challenge prevailing evaluation practices in static analysis and highlight the need to reason jointly about algorithms, configurations, and framework semantics when assessing precision and soundness.
The rise of large language models for code has reshaped software development. Autonomous coding agents, able to create branches, open pull requests, and perform code reviews, now actively contribute to real-world projects. Their growing role offers a unique and timely opportunity to investigate AI-driven contributions and their effects on code quality, team dynamics, and software maintainability. In this work, we construct a novel dataset of approximately $110,000$ open-source pull requests, including associated commits, comments, reviews, issues, and file changes, collectively representing millions of lines of source code. We compare five popular coding agents, including OpenAI Codex, Claude Code, GitHub Copilot, Google Jules, and Devin, examining how their usage differs in various development aspects such as merge frequency, edited file types, and developer interaction signals, including comments and reviews. Furthermore, we emphasize that code authoring and review are only a small part of the larger software engineering process, as the resulting code must also be maintained and updated over time. Hence, we offer several longitudinal estimates of survival and churn rates for agent-generated versus human-authored code. Ultimately, our findings indicate an increasing agent activity in open-source projects, although their contributions are associated with more churn over time compared to human-authored code.
Continuous integration and delivery (CI/CD) pipelines are critical for sustaining the evolution of large software systems. In regulated industries with legacy technologies, however, pipelines themselves can become a source of technical debt. This paper presents an industrial case study of Bankdata, a cooperative IT provider for Danish banks, where a Jenkins-based COBOL CI/CD pipeline had grown fragile, slow, and tightly coupled to platform-specific logic. The original architecture relied on Groovy scripts spread across four repositories with runtime dependency installation, leading to long execution times, high maintenance costs, and vendor lock-in. We report on the migration to a containerized architecture featuring an abstraction layer for platform logic, simplified repository structure, and a pre-built OCI-compliant image containing COBOL tools and dependencies. The new design achieved an 82% runtime reduction. Our experience highlights lessons on abstraction, containerization, and organizational adoption, offering guidance for modernizing pipelines in legacy, high-security environments.
Multi-LLM revision pipelines, in which a second model reviews and improves a draft produced by a first, are widely assumed to derive their gains from genuine error correction. We question this assumption with a controlled decomposition experiment that uses four matched conditions to separate second-pass gains into three additive components: re-solving, scaffold, and content. We evaluate this design across two model pairs on three benchmarks spanning knowledge-intensive MCQ and competitive programming. Our results show that the gains of multi-LLM revision are not monolithic, but depend on task structure, draft quality, and the type of draft information. On MCQ tasks, where the answer space is constrained and drafts provide little structural guidance, most gains are consistent with stronger-model re-solving, and directly routing queries to the stronger model can be more effective than revising a weak draft. On code generation tasks, however, two-stage prompting remains useful because even semantically null drafts can provide substantial structural scaffolding, while weak draft content can be harmful. Finally, role-reversed experiments show that strong drafts clearly benefit weak reviewers. Ultimately, our findings demonstrate that the utility of multi-LLM revision is dynamically bottlenecked by task structure and draft quality, necessitating more targeted pipeline designs rather than blanket revision strategies.
Computational reproducibility is fundamental to trustworthy science, yet remains difficult to achieve in practice across various research workflows, including Jupyter notebooks published alongside scholarly articles. Environment drift, undocumented dependencies and implicit execution assumptions frequently prevent independent re-execution of published research. Despite existing reproducibility guidelines, scalable and systematic infrastructure for automated assessment remains limited. We present an automated, web-oriented reproducibility engineering pipeline that reconstructs and evaluates repository-level execution environments for scholarly notebooks. The system performs dependency inference, automated container generation, and isolated execution to approximate the notebook's original computational context. We evaluate the approach on 443 notebooks from 116 GitHub repositories referenced by publications in PubMed Central. Execution outcomes are classified into four categories: resolved environment failures, persistent logic or data errors, reproducibility drift, and container-induced regressions. Our results show that containerization resolves 66.7% of prior dependency-related failures and substantially improves execution robustness. However, a significant reproducibility gap remains: 53.7% of notebooks exhibit low output fidelity, largely due to persistent runtime failures and stochastic non-determinism. These findings indicate that standardized containerization is essential for computational stability but insufficient for full bit-wise reproducibility. The framework offers a scalable solution for researchers, editors, and archivists seeking systematic, automated assessment of computational artifacts.
Software engineering students often struggle to appreciate empirical methods and hypothesis-driven inquiry, especially when taught in theoretical terms. This experience report explores whether grounding empirical learning in hype-driven technologies can make these concepts more accessible and engaging. We conducted a one-semester seminar framed around the currently popular topic of AI coding assistants, which attracted unusually high student interest. The course combined hands-on sessions using AI coding assistants with small, student-designed empirical studies. Classroom observations and survey responses suggest that the hype topic sparked curiosity and critical thinking. Students engaged with the AI coding assistants while questioning their limitations -- developing the kind of empirical thinking needed to assess claims about emerging technologies. Key lessons: (1) Hype-driven topics can lower barriers to abstract concepts like empirical research; (2) authentic hands-on development tasks combined with ownership of inquiry foster critical engagement; and (3) a single seminar can effectively teach both technical and research skills.
File-level defect prediction models traditionally rely on product and process metrics. While process metrics effectively complement product metrics, they often overlook commit size the number of files changed per commit despite its strong association with software quality. Network centrality measures on dependency graphs have also proven to be valuable product level indicators. Motivated by this, we first redefine process metrics as commit size aware process metric vectors, transforming conventional scalar measures into 100 dimensional profiles that capture the distribution of changes across commit size strata. We then model change history as a hyper co change graph, where hyperedges naturally encode commit-size semantics. Vector centralities computed on these hypergraphs quantify size-aware node importance for source files. Experiments on nine long-lived Apache projects using five popular classifiers show that replacing scalar process metrics with the proposed commit size aware vectors, alongside product metrics, consistently improves predictive performance. These findings establish that commit size aware process metrics and hypergraph based vector centralities capture higher-order change semantics, leading to more discriminative, better calibrated, and statistically superior defect prediction models.
As Large Language Models (LLMs) for code increasingly utilize massive, often non-permissively licensed datasets, evaluating data contamination through Membership Inference Attacks (MIAs) has become critical. We propose SERSEM (Selective Entropy-Weighted Scoring for Membership Inference), a novel white-box attack framework that suppresses uninformative syntactical boilerplate to amplify specific memorization signals. SERSEM utilizes a dual-signal methodology: first, a continuous character-level weight mask is derived through static Abstract Syntax Tree (AST) analysis, spellchecking-based multilingual logic detection, and offline linting. Second, these heuristic weights are used to pool internal transformer activations and calibrate token-level Z-scores from the output logits. Evaluated on a 25,000-sample balanced dataset, SERSEM achieves a global AUC-ROC of 0.7913 on the StarCoder2-3B model and 0.7867 on the StarCoder2-7B model, consistently outperforming the implemented probability-based baselines Loss, Min-K% Prob, and PAC. Our findings demonstrate that focusing on human-centric coding anomalies provides a significantly more robust indicator of verbatim memorization than sequence-level probability averages.
Most defects in mobile applications are visually observable on the device screen. To track these defects, users, testers, and developers must manually submit bug reports, especially in the absence of crashes. However, these reports are frequently ambiguous or inaccurate, often omitting essential components such as the Observed Behavior (OB), Expected Behavior (EB), or Steps to Reproduce (S2Rs). Low-quality reports hinder developers' ability to understand and reproduce defects, delaying resolution and leading to incorrect or unresolvable fixes. In this paper, we posit that providing specific app-related information (e.g., GUI interactions or specific screens where bugs appear) to LLMs as key points of context can assist in automatically generating clear, detailed, and accurate OB, EB, and S2Rs. We built and evaluated a novel approach, BugScribe, that generates bug reports in this way. To support the evaluation, we introduce a unified quality framework that defines correctness and completeness dimensions for OB, EB, and S2Rs. Using 48 bug reports from 26 Android apps, we show that BugScribe produces higher-quality and more accurate components than the original reports and outperforms recent LLM-based baselines. We envision that BugScribe can serve as a practical assistant for testers and developers by enhancing incomplete bug reports with reliable and accurate OB, EB, and S2Rs, thereby streamlining bug resolution and improving mobile app quality.
Identifying the features to be released in the next version of software, from a pool of potential candidates, is a challenging problem. User feedback from app stores is frequently used by software vendors for the evolution of apps across releases. Privacy feedback, although smaller in volume, carries a larger impact influencing app's success. Multiple existing work has focused on summarizing privacy concerns at the app level and has also shown that developers utilize feedback to implement security and privacy-related changes in subsequent releases. However, the current literature offers little support for release managers and developers in identifying privacy concerns prior to release. This gap exists as user reviews are typically available in app stores only after new features of a software system is released. In this paper, we introduce Pre-PI, a novel approach that summarizes privacy concerns for to-be-released features. Our method first maps existing features to semantically similar privacy reviews to learn feature-privacy review relations. We then simulate feedback for candidate features and generate concise summaries of privacy concerns. We evaluate Pre-PI across three real-world apps, and compare it with Hark, a state-of-the-art method that relies on post-release user feedback to identify privacy concerns. Results show that Pre-PI generates additional valid privacy concerns and identifies these concerns earlier than Hark, allowing proactive mitigation prior to release.
Retrieval-augmented generation (RAG) systems are gaining traction in enterprise settings, yet stringent data protection regulations prevent many organizations from using cloud-based services, necessitating on-premises deployments. While existing blueprints and reference architectures focus on cloud deployments and lack enterprise-grade components, comprehensive on-premises implementation frameworks remain scarce. This paper aims to address this gap by presenting a comprehensive AI engineering blueprint for scalable on-premises enterprise RAG solutions. It is designed to address common challenges and streamline the integration of RAG into existing enterprise infrastructure. The blueprint provides: (1) an end-to-end reference architecture described using the 4+1 view model, (2) a reference application for on-premises deployment, and (3) best practices for tooling, development, and CI/CD pipelines, all publicly available on GitHub. Ongoing case studies and expert interviews with industry partners will assess its practical benefits.
With the advancement of Agentic AI, researchers are increasingly leveraging autonomous agents to address challenges in software engineering (SE). However, the large language models (LLMs) that underpin these agents often function as black boxes, making it difficult to justify the superiority of Agentic AI approaches over baselines. Furthermore, missing information in the evaluation design description frequently renders the reproduction of results infeasible. To synthesize current evaluation practices for Agentic AI in SE, this study analyzes 18 papers on the topic, published or accepted by ICSE 2026, ICSE 2025, FSE 2025, ASE 2025, and ISSTA 2025. The analysis identifies prevailing approaches and their limitations in evaluating Agentic AI for SE, both in current research and potential future studies. To address these shortcomings, this position paper proposes a set of guidelines and recommendations designed to empower reproducible, explainable, and effective evaluations of Agentic AI in software engineering. In particular, we recommend that Agentic AI researchers make their Thought-Action-Result (TAR) trajectories and LLM interaction data, or summarized versions of these artifacts, publicly accessible. Doing so will enable subsequent studies to more effectively analyze the strengths and weaknesses of different Agentic AI approaches. To demonstrate the feasibility of such comparisons, we present a proof-of-concept case study that illustrates how TAR trajectories can support systematic analysis across approaches.
Modern generator-based fuzzing techniques combine lightweight input generators with coverage-guided mutation as a method of exploring deep execution paths in a target program. A complimentary approach in prior research focuses on creating highly customized, domain-specific generators that encode structural and semantic logic sufficient enough to reach deep program states; the challenge comes from the overhead of writing and testing these complex generators. We investigate whether AI coding agents can automatically synthesize such target-specific generators, and whether the resulting generators are strong enough to obviate the need for coverage guidance and mutation entirely. Our approach, Gentoo, is comprised of an LLM coding agent (provided terminal access and source code of the fuzz target and its library) instructed to iteratively synthesize and refine an input generator, and optionally provided fine-grained predicate-level coverage feedback. We evaluate three configurations of Gentoo against human-written generators on fuzz targets for 7 real-world Java libraries. Our findings show that agent-synthesized generators achieve statistically significantly higher branch coverage than human-written baseline generators on 4 of 7 benchmarks. Critically, the use of coverage guidance and mutation strategies is not statistically significantly beneficial for agent-synthesized generators, but is significant for all human-written generators, suggesting that structural and semantic logic encoded in the agent generators makes coverage guidance largely unnecessary.
Clone-and-own development produces families of related software variants that evolve independently. As variants diverge, important fixes applied in one repository are often missing in others. PaReco has shown that thousands of such missed opportunity (MO) patches exist across real ecosystems, yet its textual output provides limited support for understanding where and how these fixes should be propagated. We present MOVis, a lightweight, interactive desktop tool that visualizes MO patches between a source and target variant. MOVis loads PaReco's MO classifications and presents patched and buggy hunks side-by-side, highlighting corresponding regions and exposing structural differences that hinder reuse. This design enables developers to quickly locate missed fixes, understand required adaptations, and more efficiently maintain consistency across software variants. The tool, replication package, and demonstration video are available at https://zenodo.org/records/18356553 and https://youtu.be/Ac-gjBxHJ3Y.
We introduce SWE-ZERO to SWE-HERO, a two-stage SFT recipe that achieves state-of-the-art results on SWE-bench by distilling open-weight frontier LLMs. Our pipeline replaces resource-heavy dependencies with an evolutionary refinement strategy: (1) SWE-ZERO utilizes large-scale, execution-free trajectories to master code semantics and repository-level reasoning, and (2) SWE-HERO applies targeted, execution-backed refinement to transition these semantic intuitions into rigorous engineering workflows. Our empirical results set a new benchmark for open-source models of comparable size. We release a dataset of 300k SWE-ZERO and 13k SWE-HERO trajectories distilled from Qwen3-Coder-480B, alongside a suite of agents based on the Qwen2.5-Coder series. Notably, SWE-HERO-32B achieves a 62.2% resolution rate on SWE-bench Verified. Furthermore, despite being trained exclusively on Python, our agents demonstrate robust zero-shot transferability on SWE-bench Multilingual, reaching 44.1% and confirming the paradigm's generalizability across diverse languages.
Tool using agents often fail for operational reasons even when language understanding is strong. Common causes include invalid arguments, interface drift, weak recovery, and inefficient retry behavior. We introduce ToolMisuseBench, an offline deterministic benchmark for evaluating tool misuse and recovery under explicit step, call, and retry budgets. The benchmark covers CRUD, retrieval, file, and scheduling environments with replayable fault injection. It reports success, invalid call behavior, policy violations, recovery quality, and budgeted efficiency. We release a public dataset with 6800 tasks and a reproducible evaluation pipeline. Baseline results show fault specific recovery gains for schema aware methods, while overall success remains limited under the released authorization and hard failure settings.
Benchmarks driven by test suites, notably SWE-bench, have become the de facto standard for measuring the effectiveness of automated issue-resolution agents: a generated patch is accepted whenever it passes the accompanying regression tests. In practice, however, insufficiently strong test suites can admit plausible yet semantically incorrect patches, inflating reported success rates. We introduce STING, a framework for targeted test augmentation that uses semantically altered program variants as diagnostic stressors to uncover and repair weaknesses in benchmark regression suites. Variants of the ground-truth patch that still pass the existing tests reveal under-constrained behaviors; these gaps then guide the generation of focused regression tests. A generated test is retained only if it (i) passes on the ground-truth patch, (ii) fails on at least one variant that survived the original suite, and (iii) remains valid under behavior-preserving transformations designed to guard against overfitting. Applied to SWE-bench Verified, STING finds that 77% of instances contain at least one surviving variant. STING produces 1,014 validated tests spanning 211 instances and increases patch-region line and branch coverage by 10.8% and 9.5%, respectively. Re-assessing the top-10 repair agents with the strengthened suites lowers their resolved rates by 4.2%-9.0%, revealing that a substantial share of previously passing patches exploit weaknesses in the benchmark tests rather than faithfully implementing the intended fix. These results underscore that reliable benchmark evaluation depends not only on patch generation, but equally on test adequacy.
The increasing scale and complexity of mobile applications make automated GUI exploration essential for software quality assurance. However, existing methods often neglect state dependencies between test fragments, which leads to redundant exploration and prevents access to deep application states. We introduce EpiDroid, a black-box, pluggable framework that augments existing explorers through semantic state dependency awareness. EpiDroid distills raw traces into stable test fragments to extract underlying dependencies. It then employs a Recomposition-Replay paradigm to perform impact reasoning via LLM and deterministic replay on high-value mutable state elements. Through iterative feedback, EpiDroid refines the state-dependency graph to systematically reach deep application states. We integrated EpiDroid into both industrial and state-of-the-art research tools and evaluated it on 20 real-world apps. The results show that EpiDroid consistently improves the performance of all baselines, increasing average code coverage by 10--28\% and delivering 3--4$\times$ more coverage gain compared to continuing the baselines alone from the same starting point. This demonstrates that dependency-guided recomposition unlocks deep states that forward exploration cannot access, irrespective of additional budget.
Benchmarks that reflect production workloads are better for evaluating AI coding agents in industrial settings, yet existing benchmarks differ from real usage in programming language distribution, prompt style and codebase structure. This paper presents a methodology for curating production-derived benchmarks, illustrated through ProdCodeBench - a benchmark built from real sessions with a production AI coding assistant. We detail our data collection and curation practices including LLM-based task classification, test relevance validation, and multi-run stability checks which address challenges in constructing reliable evaluation signals from monorepo environments. Each curated sample consists of a verbatim prompt, a committed code change and fail-to-pass tests spanning seven programming languages. Our systematic analysis of four foundation models yields solve rates from 53.2% to 72.2% revealing that models making greater use of work validation tools, such as executing tests and invoking static analysis, achieve higher solve rates. This suggests that iterative verification helps achieve effective agent behavior and that exposing codebase-specific verification mechanisms may significantly improve the performance of externally trained agents operating in unfamiliar environments. We share our methodology and lessons learned to enable other organizations to construct similar production-derived benchmarks.
Promptly porting patches from a source codebase to its variants (e.g., forks and branches) is essential for mitigating propagated defects and vulnerabilities. Recent studies have explored automated patch porting to reduce manual effort and delay, but existing approaches mainly handle inconsistencies visible in a patch's local context and struggle with those requiring global mapping knowledge between codebases. We refer to such non-local inconsistencies as implicit inconsistencies. Implicit inconsistencies pose greater challenges for developers to resolve due to their non-local nature. To address them, we propose MIP, which enables collaboration among an LLM, a compiler, and code analysis utilities. MIP adopts different strategies for different cases: when source identifiers exist in the target codebase, it leverages compiler diagnostics; otherwise, it retrieves matched code segment pairs from the two codebases as mapping knowledge for mitigation. Experiments on two representative scenarios, cross-fork and cross-branch patch porting, show that MIP successfully resolves more than twice as many patches as the best-performing baseline in both settings. A user study with our industry partner further demonstrates its practical effectiveness.
REST APIs are widely used in industry, in all different kinds of domains. An example is Volkswagen AG, a German automobile manufacturer. Established testing approaches for REST APIs are time consuming, and require expertise from professional test engineers. Due to its cost and importance, in the scientific literature several approaches have been proposed to automatically test REST APIs. The open-source, search-based fuzzer EvoMaster is one of such tools proposed in the academic literature. However, how academic prototypes can be integrated in industry and have real impact to software engineering practice requires more investigation. In this paper, we report on our experience in using EvoMaster at Volkswagen AG, as an EvoMaster user from 2023 to 2026. We share our learnt lessons, and discuss several features needed to be implemented in EvoMaster to make its use in an industrial context successful. Feedback about value in industrial setups of EvoMaster was given from Volkswagen AG about 4 APIs. Additionally, a user study was conducted involving 11 testing specialists from 4 different companies. We further identify several real-world research challenges that still need to be solved.
With the rapid evolution of LLMs, automated software testing is witnessing a paradigm shift. While proprietary models like GPT-4o demonstrate impressive capabilities, their high deployment costs and data privacy concerns make open-source LLMs the practical imperative for many academic and industrial scenarios. In the field of automated test generation, it has evolved to iterative workflows to construct test suites based on LLMs. When utilizing open-source LLMs, we empirically observe they lack a suite-level perspective, suffering from structural myopia-failing to generate new tests with large marginal gain based on the current covered status. In this paper, from the perspective of sequences, we formalize test suite generation as a MDP and demonstrate that its objective exhibits monotone submodularity, which enables an effective relaxation of this NP-hard global optimization into a tractable step-wise greedy procedure. Guided by this insight, we propose TestDecision, which transforms LLMs into neural greedy experts. TestDecision consists of two synergistic components: (1) an inference framework which implements test suite construction following a step-wise greedy strategy; and (2) a training pipeline of reinforcement learning which equips the base LLM with sequential test generation ability to maximize marginal gain. Comprehensive evaluations on the ULT benchmark demonstrate that TestDecision significantly outperforms existing advanced methods. It brings an improvement between 38.15-52.37% in branch coverage and 298.22-558.88% in execution pass rate over all base models, achieving a comparable performance on 7B backbone with a much larger proprietary LLM GPT-5.2. Furthermore, TestDecision can find 58.43-95.45% more bugs than vanilla base LLMs and exhibit superior generalization on LiveCodeBench, proving its capability to construct high-quality test suites.
In the digital age, ensuring the correctness, safety, and reliability of software through formal verification is paramount, particularly as software increasingly underpins critical infrastructure. Formal verification, split into theorem proving and model checking, provides a feasible and reliable path. Unlike theorem proving, which yields notable advances, model checking has been less focused due to the difficulty of automatic program modeling. To fill this gap, we introduce Model-Bench, a benchmark and an accompanying pipeline for evaluating and improving LLMs' program modeling capability by modeling Python programs into verification-ready model checking specifications checkable by its accompanying model checker. Model-Bench comprises 400 Python programs derived from three well-known benchmarks (HumanEval, MBPP, and LiveCodeBench). Our extensive experiments reveal significant limitations in LLMs' program modeling and further provide inspiring directions.
Modern software systems rely heavily on Web APIs, yet creating meaningful and executable test scripts remains a largely manual, time-consuming, and error-prone task. In this paper, we present APITestGenie, a novel tool that leverages Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and prompt engineering to automatically generate API integration tests directly from business requirements and OpenAPI specifications. We evaluated APITestGenie on 10 real-world APIs, including 8 APIs comprising circa 1,000 live endpoints from an industrial partner in the automotive domain. The tool was able to generate syntactically and semantically valid test scripts for 89\% of the business requirements under test after at most three attempts. Notably, some generated tests revealed previously unknown defects in the APIs, including integration issues between endpoints. Statistical analysis identified API complexity and level of detail in business requirements as primary factors influencing success rates, with the level of detail in API documentation also affecting outcomes. Feedback from industry practitioners confirmed strong interest in adoption, substantially reducing the manual effort in writing acceptance tests, and improving the alignment between tests and business requirements.