TestExplora: Benchmarking LLMs for Proactive Bug Discovery via Repository-Level Test Generation

  • Steven Liu ,
  • Jane Luo ,
  • Xin Zhang ,
  • Aofan Liu ,
  • Hao Liu ,
  • J. Wu ,
  • Ziyang Huang ,
  • Yangyu Huang ,
  • ,

arXiv

Given that Large Language Models (LLMs) are increasingly applied to automate software development, comprehensive software assurance spans three distinct goals: regression prevention, reactive reproduction, and proactive discovery. Current evaluations systematically overlook the third goal. Specifically, they either treat existing code as ground truth (a compliance trap) for regression prevention, or depend on post-failure artifacts (e.g., issue reports) for bug reproduction-so they rarely surface defects before failures. To bridge this gap, we present TestExplora, a benchmark designed to evaluate LLMs as proactive testers within full-scale, realistic repository environments. TestExplora contains 2,389 tasks from 482 repositories and hides all defect-related signals. Models must proactively find bugs by comparing implementations against documentation-derived intent, using documentation as the oracle. Furthermore, to keep evaluation sustainable and reduce leakage, we propose continuous, time-aware data collection. Our evaluation reveals a significant capability gap: state-of-the-art models achieve a maximum Fail-to-Pass (F2P) rate of only 16.06%. Further analysis indicates that navigating complex cross-module interactions and leveraging agentic exploration are critical to advancing LLMs toward autonomous software quality assurance. Consistent with this, SWEAgent instantiated with GPT-5-mini achieves an F2P of 17.27% and an F2P@5 of 29.7%, highlighting the effectiveness and promise of agentic exploration in proactive bug discovery tasks.

Related Tools

TestExplora

This repository is the official implementation of the paper "TestExplora: Benchmarking LLMs for Proactive Bug Discovery via Repository-Level Test Generation" It can be used for baseline evaluation using the prompts mentioned in the paper. Table of Contents What is TestExplora Setup How to Deploy TestExplora Test Generation (Inference) Supported Models Build Benchmark Contributing Trademarks What is TestExplora TestExplora is a systematic, repository-level benchmark designed to evaluate the capability of Large Language Models to proactively discover latent software defects by generating tests. It was developed to evaluate the proactive defect discovery capabilities of LLMs at the repository level. Our dataset is constructed from real-world GitHub pull requests, containing 2,389 test-generation tasks sourced from 1,552 PRs across 482 repositories. Each task is designed such that the model must write test cases capable of triggering a Fail-to-Pass transition between buggy and repaired versions – reflecting true defect detection rather than passive confirmation. The benchmark further includes automatically generated documentation for test entry points to enable scalable evaluation. Setup Prerequisites Python 3.10+ Docker (for local test evaluation) Git Installation git clone https://github.com/microsoft/TestExplora.git cd TestExplora Install core dependencies: pip install -r requirements.txt How to Deploy TestExplora Test Generation (Inference) The main entry point is testexplora/harness/inference.py. Given the benchmark dataset (JSON format), it drives the target LLM to generate test cases for each task and saves the results as test patches. python testexplora/harness/inference.py --data_path <path_to_data.json> --repo_testbed_dir <output_directory> --model <model_name> --test_type <whitebox|graybox|blackbox> Output test_patches.json — Generated test patches per repository and PR. config.yaml — Experiment configuration for reproducibility. generation.log — Detailed execution log. trajectory/ — Agent trajectory files (for agent-based models). Supported Models The benchmark supports evaluation across a broad set of LLMs and coding agents. To reproduce or customize results for a specific model, modify the corresponding call file under testexplora/harness/call_pipeline/. API-based Models (Direct LLM Call) Model KeyCall Filegpt-4o, o3-mini, o4-mini, gpt-5-mini, gpt-5, r1call_gpt.pyclaude_sonnetcall_gpt.py (Anthropic via Azure)gemini-2.5-pro, gemini-2.5-flashcall_gemini.pyCodellama-34B, Qwen3-Coder-30Bcall_vllm.py Agent-based Models (Agentic Code Exploration) Model KeyCall Filesweagent-*call_sweagent.pytraeagent-*call_traeagent.py Note: Agent-based models only support whitebox test type. Build Benchmark To construct a benchmark dataset similar to TestExplora from your own set of GitHub repositories, use testexplora/build_benchmark/process_data.py. It automates the end-to-end pipeline: Clone repositories and iterate over closed pull requests. Checkout the base commit (pre-PR state) and extract code structure & dependency graphs. Apply the PR patch, then re-extract code structure to obtain the post-PR state. Identify changed functions/methods by mapping diff line ranges to AST-level code elements. python testexplora/build_benchmark/process_data.py Before running, update the paths at the bottom of process_data.py to point to your repository data JSON directory and a local directory for cloning repos. The script relies on two helper modules under the same directory: parse_repo.py — AST-based extraction of classes, functions, methods, and their metadata from a Python repository. build_dependency_graph.py — Builds inter-function dependency graphs using NetworkX, including cross-file import resolution. Contributing This project welcomes contributions and suggestions. Most contributions require you to agree to aContributor License Agreement (CLA) declaring that you have the right to, and actually do, grant usthe rights to use your contribution. For details, visit https://cla.opensource.microsoft.com. When you submit a pull request, a CLA bot will automatically determine whether you need to providea CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructionsprovided by the bot. You will only need to do this once across all repos using our CLA. This project has adopted the Microsoft Open Source Code of Conduct.For more information see the Code of Conduct FAQ orcontact opencode@microsoft.com with any additional questions or comments. Trademarks This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsofttrademarks or logos is subject to and must followMicrosoft's Trademark & Brand Guidelines.Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.Any use of third-party trademarks or logos are subject to those third-party's policies.