A Note About: Critical Review of BugSwarm for Fault Localization and Program Repair
AA Note About: Critical Review of BugSwarm forFault Localization and Program Repair
David A. Tomassi and Cindy Rubio-Gonz´alez
University of California, Davis USA { datomassi,crubio } @ucdavis.edu Abstract.
Datasets play an important role in the advancement of soft-ware tools and facilitate their evaluation.
BugSwarm [12] is an infras-tructure to automatically create a large dataset of real-world repro-ducible failures and fixes. In this paper, we respond to Durieux andAbreu [7]’s critical review of the
BugSwarm dataset, referred to inthis paper as
CriticalReview . We replicate
CriticalReview ’s studyand find several incorrect claims and assumptions about the
BugSwarm dataset. We discuss these incorrect claims and other contributions listedby
CriticalReview . Finally, we discuss general misconceptions about
BugSwarm , and our vision for the use of the infrastructure and dataset.
Datasets are imperative to the development and progression of software tools,not only to facilitate a fair and unbiased evaluation of their effectiveness, but alsoto inspire and enable the community to advance the state of the art. There havebeen various influential datasets developed in the Software Engineering commu-nity (e.g., [4, 5, 6, 8, 9, 10, 11]). Unfortunately, these datasets have required asubstantial amount of manual effort to be created, which makes it difficult togrow them.Recently we developed
BugSwarm [12], an infrastructure that leverages con-tinuous integration (CI) to automatically create a dataset of reproducible fail-ures and fixes.
BugSwarm comprises an infrastructure, dataset, REST API,and website. The initial dataset (version 1.0.0 and reported in [12]) consists of3,091 pairs of failures and fixes (referred to as artifacts ) mined from Java andPython projects. Because artifacts mined from open-source software are boundto have different characteristics (number of failing tests, failure reason, fix lo-cation(s), patch size, etc.), we provide a REST API and website for users tonavigate and select the artifacts that fit the needs of their tools.
BugSwarm is under active development, currently allowing the mining of failures and fixesthat satisfy specific characteristics.Parallel to the development of
BugSwarm , Durieux and Abreu [7] conducteda review of the
BugSwarm dataset (version 1.0.1) with respect to AutomatedProgram Repair (APR) and Fault Localization (FL). The authors stated char-acteristics they consider necessary for artifacts to be used in studies that eval-uate the state of the art in APR and FL. Additionally, the authors presented a a r X i v : . [ c s . S E ] O c t David A. Tomassi and Cindy Rubio-Gonz´alez high-level classification of failures, and discussed the cost of using
BugSwarm artifacts. In the rest of this paper we refer to [7] as
CriticalReview .One of the purposes of datasets is to facilitate the evaluation of software tools.Instead,
CriticalReview uses the general requirements/current limitations ofthe state-of-the-art APR tools to evaluate the
BugSwarm dataset. While it isimportant that datasets possess key characteristics (e.g., failures that are rele-vant to the tools under evaluation), the existence of artifacts that do not havedesired characteristics does not hinder studies if users can navigate and selectartifacts relevant to their studies. Limiting a dataset to only include problemsthat certain tools can handle would be of no benefit to our community. Further-more, the goal of the
BugSwarm dataset is to identify the kinds of problemsfound in real software and the environment in which these problems occur, andthus inspire the community to advance the state of the art.In addition to general misconceptions on datasets,
CriticalReview discred-its the use of the
BugSwarm dataset based on multiple incorrect observations.Specifically,
CriticalReview makes a false allegation against
BugSwarm pa-per [12]’s reported data, and presents wrong results and conclusions led by mis-understandings of
Travis-CI terminology and Docker’s architecture.This paper discusses each of
CriticalReview ’s incorrect claims, which hadalready been communicated to the authors of
CriticalReview upon theirrequest for feedback prior to the archival of their study. We also discuss thetwo other contributions of
CriticalReview : a GitHub repository to store thecode and build logs of the
BugSwarm artifacts, and
CriticalReview ’s ownwebsite to browse
BugSwarm artifacts, both of which duplicate informationalready available in
BugSwarm .The rest of this paper presents a brief overview of
BugSwarm in Section 2,and describes the methodology used by
CriticalReview in Section 3. Wediscuss the incorrect findings reported by
CriticalReview in Section 4, andthe rest of the contributions of
CriticalReview in Section 5. Finally, we clarifysome misconceptions about
BugSwarm , and re-affirm its goals and intended usein Section 6.
BugSwarm is comprised of three main components: (1) an infrastructure toautomatically mine and reproduce failures and fixes from open-source projectsthat use continuous integration ( Travis-CI ), (2) a continuously growing datasetof real-world failures and fixes packaged in publicly available Docker images tofacilitate reproducibility, and (3) a website and a REST API for dataset usersto navigate and select artifacts based on a number of characteristics. https://github.com/BugSwarm – in the process of open sourcing. https://hub.docker.com/r/bugswarm/images/tags https://github.com/BugSwarm/common Note About: Critical Review of BugSwarm 3
Fig. 1.
Workflow for the
BugSwarm toolkit
BugSwarm ’s methodology to create a continuously growing dataset of real-world failures and fixes is shown in Fig. 1. We briefly describe each componentbelow. For more details please refer to the
BugSwarm paper [12].
PairMiner.
PairMiner represents the first stage of the process. The role ofPairMiner is to mine fail-pass job pairs from the
Travis-CI ’s build historyof open-source projects hosted in GitHub. A project’s build history refers toall
Travis-CI builds previously triggered. A build may include many jobs; forexample, a build for a Python project might include separate jobs to test withPython versions 2.6, 2.7, 3.0, etc. The input to PairMiner is the repository slug(e.g., google/auto) of the project of interest. PairMiner analyzes the project’sbuild history to identify fail-pass build pairs, where a build fails and the nextconsecutive build passes. From these fail-pass build pairs, PairMiner will extractfail-pass job pairs . The output of PairMiner is a set of fail-pass job pairs foundfor the given project.
PairFilter.
PairFilter takes as input the
Travis-CI fail-pass job pairs fromPairMiner and ensures that essential data is available to allow for reproduction:
David A. Tomassi and Cindy Rubio-Gonz´alez (1) the state of the project at the time the job was executed, and (2) the envi-ronment in which the job was executed. If these essentials are not available thenPairFilter will discard the fail-pass job pair. PairFilter will determine the Dockerimage that was the exact build environment for the fail-pass job pair and thespecific commits that triggered each job. The output of PairFilter is the subsetof fail-pass job pairs for which (1) and (2) are available.
Reproducer.
The goal of Reproducer is to reproduce each job in the fail-passjob pair in the same build environment as it was originally run. The input toReproducer is a fail-pass job pair, the commits for each version, and the Dockerimage for the build environment. Reproducer conducts the following: (1) gener-ates a job script, i.e., a shell script to build the project and run regression tests,(2) matches the build environment, as the job was originally ran in, via a Dockerimage from the PairFilter, (3) reverts the project to the specific version, and (4)runs the code for the job in the Docker image via the job script. The Reproducercan be ran in parallel via multiple processes for each job pair as shown in Fig. 1.The output of Reproducer is a build log, which is a transcript of everything thatoccurs at the command line during the build and testing process.
Analyzer.
The Analyzer parses the original (historical) and reproduced buildlogs, extracts key attributes, and compares the extracted attributes to ensurethey match. The key attributes that are parsed are the status of the build(passed, failed, or errored), and the result of the test suite (number of testsran, number tests failed, and names of failed tests). If the results match betweenthe original and reproduced build logs, then metadata about the pair will beadded to the
BugSwarm database.
Artifact Creation.
The Reproducer and Analyzer are run five times. If a fail-pass job pair was reproducible all five times then we mark it as “reproducible”.If the number of times the pair was reproducible was less than five but morethan zero then it will be marked as “flaky”. A pair can be flaky due to a varietyof reasons but primarily because of test flakiness which can be caused by non-deterministic tests due to concurrency or environmental changes. Lastly, if apair is reproducible zero times then it will be marked as “unreproducible”. Areproducible or flaky job pair is referred to as a
BugSwarm artifact.For each
BugSwarm artifact, a Docker image is created which has both ver-sions of the code and the job scripts to build and test each version. This Dockerimage is then stored on our DockerHub repository. We chose to package each
BugSwarm artifact in a Docker image because Docker facilitates reproducibil-ity. Docker is also a good choice because it is light-weight, and uses layering.Docker images are composed of multiple layers which can be shared across mul-tiple Docker images to save space. Docker does not re-download or store a layerthat is already on a system [1]. https://hub.docker.com/r/bugswarm/images Note About: Critical Review of BugSwarm 5
The
BugSwarm dataset is the first continuously growing dataset of reproduciblereal-world failures and fixes. The dataset was automatically created using the
BugSwarm infrastructure without controlling for any specific attributes. Cur-rently, the
BugSwarm dataset (version 1.1.0) consists of 3,140 artifacts that arewritten in Java and Python. There are a diverse number of artifacts with differ-ent build systems ranging from Maven, Gradle, and Ant to different longevityfrom 2015 to 2019 and different testing frameworks such as JUnit and unittest.We expect a steady grow of the dataset in the next months as the
BugSwarm infrastructure is set to run in dedicated servers.
BugSwarm offers many different characteristics to filter by to create a subsetthat is useful in the evaluation of a given tool. Examples of such character-istics are: language, size of diff, build system, number of tests ran, number offailed tests, patch location (e.g., source code, test code, or build files), exceptionsthrown during run time (e.g., NullPointerException), etc. The
BugSwarm web-site and REST API allow the selection of artifacts based on the above attributes.
The goal of
CriticalReview ’s study is to answer the following questions:
RQ1
What are the main characteristics of
BugSwarm ’s pairs of builds regard-ing the requirements for APR and FL?
RQ2
What is the execution and storage cost of
BugSwarm ? RQ3
Which pairs of builds meets the requirements of APR and FL?
Characteristics of
BugSwarm ’s Pairs of Builds.
CriticalReview char-acterizes the
BugSwarm dataset with respect to requirements of current APRand FL tools: (1) behavioral bugs, (2) test suite is used with passing tests defin-ing correct behavior and failing tests defining incorrect behavior, (3) executionset up is known in terms of path of source, test files, etc., (4) uniqueness ofbugs, and (5) human patch availability. The above requires, for each artifact,the source code for the buggy version and the fixed version, the diff between thetwo versions, and the
Travis-CI build log for the failing job.
CriticalReview queries for fully reproducible
Java and Python artifacts(see Section 4.1 for further details) using the
BugSwarm
REST API. The re-sulting artifacts are then filtered for unique commits. The diff of each artifactis calculated by retrieving the buggy and fixed versions of the artifact from itscorresponding Docker image, pushing the code into a branch of a new GitHub Note that multiple
Travis-CI jobs may originate from a single
Travis-CI build. David A. Tomassi and Cindy Rubio-Gonz´alez repository, and then invoking the GitHub API to retrieve the diff between thetwo code versions. Unique diffs are identified based on md5 hash values, and arti-facts are classified based on whether the extension of the changed files are .java or .py . Lastly, a high-level classification of the reason of failure is conducted byusing regular expressions to match certain patterns (test failures, style checkers,compilation errors, etc.) on Travis-CI build logs.
Execution and Storage of
BugSwarm . CriticalReview estimates thesize of the
BugSwarm dataset for download and storage, as well as its usagecost. The size of the dataset is calculated using two metrics: counting every
Docker layer, and counting every unique
Docker layer. Note that Docker does notdownload or store a layer that is already in the system (see [1] and Section 4.2).
CriticalReview gives a time estimate for download assuming a 80 Mbit/sstable connection. Finally, the cost of using the full dataset is estimated assuminga 20-minute experiment per artifact using Amazon Cloud Instances.
Pairs for APR and FL.
CriticalReview lists what the paper considers therequirements to use state-of-the-art APR and FL tools: (1) artifacts that havebeen reproduced five times, (2) artifacts whose Docker images are available, (3)non empty diff, (4) unique commit, (5) unique diff, (6) test case failure, and(7) only source files changed.
CriticalReview then reports the number of
BugSwarm artifacts that satisfy those requirements.
After replicating the study presented by
CriticalReview and inspecting itsscripts, we identified incorrect claims made by
CriticalReview related to in-consistencies in the number of artifacts reported in the
BugSwarm paper [12], amisleading duplication of commits in the dataset, and calculations of the storagerequired by the dataset. Below we discuss each incorrect claim, organized perresearch question as presented in [7].
Incorrect Number of Artifacts.
CriticalReview reports the number of“builds” reproduced five times given a
BugSwarm
API request listed in [7,Section III-B]. The API request returns 2,949 artifacts while the
BugSwarm paper[12] gives 3,091 artifacts. Thus,
CriticalReview reports a contradictionby the
BugSwarm authors, which according to
CriticalReview had statedthat each “build” in the dataset was successfully reproduced five times. { “reproduce successes”: { “$gt”:4,“lang”: { “$in”:[“Java”,“Python”] }}} Note About: Critical Review of BugSwarm 7
CriticalReview states in [7, Section III-C]:Indeed, we considered all pairs of builds that are reproduced success-fully five times like it is described in
BugSwarm ’s paper (see Section4-B). Surprisingly,
BugSwarm authors did not consider their criteria intheir final selection of the pairs of builds and consequently the reportednumber is in contradiction with the paper.
BugSwarm original paper states in [12, Section IV-B]:We repeated the reproduction process 5 times for each pair to determineits stability. If the pair is reproducible all 5 times, then it is markedas ’reproducible’. If the pair is reproduced only sometimes, then it ismarked as ’flaky’. Otherwise, the pair is said to be ’unreproducible’.First, as discussed in the
BugSwarm paper [12, Section III-C] and in Sec-tion 2 of this paper,
BugSwarm is comprised of artifacts (
Travis-CI job pairs),thus a request from the
BugSwarm
API will return the number of artifacts ,not the number of builds .Second, the
BugSwarm
API request used by
CriticalReview is return-ing the number of artifacts successfully reproduced five times. In other words,the query is returning the number of fully reproducible artifacts. However, the
BugSwarm dataset [12, Table III] includes both fully reproducible and flakyartifacts, which together account for a total of 3,091 artifacts. The correct
BugSwarm
REST API request needs to filter based on a number of repro-duce successes greater than zero and a number of attempts equal to five. All3,091 artifacts included in the dataset were attempted five times.At the time CriticalReview was written (
BugSwarm dataset 1.0.1 fromMay 2019 ), the number of fully reproducible artifacts was indeed 2,949 andthe number of flaky artifacts was 142. There is no contradiction on the selectioncriteria described in the BugSwarm paper: both reproducible and flaky artifactsare included in the dataset.
Duplicate Failing Commits.
CriticalReview reports a “new” finding re-garding a high number of duplicate failing commits in the
BugSwarm datasetthat would introduce misleading results.
CriticalReview states in [7, Section II-C]:Our second observation is that 40.08% ((2,949-1,767)/2,949) of the buildshave a duplicate failing commit. It means that those 40.08% should notbe considered by the approaches that only consider the source code ofthe application otherwise it introduces misleading results. { “reproduce suc-cesses”: { “$gt”:0 } ,“reproduce attempts”:5,“lang”: { “$in”:[“Java”,“Python”] }}} David A. Tomassi and Cindy Rubio-Gonz´alez
Table 1.
Table of Metrics of
BugSwarm
Downloading and Storage Cost from [7].Metrics in Gigabytes (GB) Java Python All
BugSwarm
Docker layer size 5,107 3,813 8,921
BugSwarm unique
Docker layer size
Avg. size 3.01 3.05 3.03Download all layers (80Mbits/s) 6d, 7.8h 4d, 17.13h 11d, 1.16hDownload unique layers (80Mbits/s)
BugSwarm paper states in [12, Section IV-B]:Recall from Section III-C that
PairMiner mines job pairs. The corre-sponding number of reproducible unique build pairs is 1,837. The rest ofthe paper describes the results in terms of number of job pairs.As stated in the
BugSwarm paper [12, Section III-C] and in Section 2 ofthis paper, a
BugSwarm artifact corresponds to a pair of jobs , not a pair of builds (as incorrectly interpreted throughout
CriticalReview ). A
Travis-CI build can be composed of multiple jobs that test the same commit under differentconfigurations. Early feedback from researchers in our community indicated thatsuch artifacts can also be of interest to researchers.As also described in the
BugSwarm paper [12, Section III-B], a given ex-periment may require artifacts that meet specific criteria. If such criteria requireuniqueness of job pairs, as reported by
CriticalReview is the case for APRtools, then we provide a REST API and website that allow to consider uniquenesswhen selecting artifacts of interest. Thus, having the dataset include multiplejobs from a build does not represent a problem that would introduce misleadingresults. CriticalReview calculates the size of the
BugSwarm dataset and providesestimated download time and cost for using the full dataset in Amazon WebInstances [7, Section 3-D]. The paper reports that the full dataset is 8,921 GB,which takes about 11 days 1.16 hours to download when using a 80 Mbits/sinternet connection. Subsequently, the cost of using the
BugSwarm dataset,assuming a 20-minute experiment, is $711.30 USD.
Download Size Calculation.
CriticalReview calculates the size of the
BugSwarm dataset using two metrics: counting every
Docker layer, and count-ing every unique
Docker layer. The size of the dataset is reported (see Table 1from
CriticalReview ) as 8,921 GB and 2,246 GB, respectively. However, The difference between 1,767 and 1,837 is again due to
CriticalReview omittingflaky artifacts. Note About: Critical Review of BugSwarm 9 counting every Docker layer is incorrect. Docker does not re-download or storea layer that is already on a system [1]. The average size (row 3) and downloadtime (row 4) given in Table 1 are calculated based on all
Docker layers (row 1),thus these table entries are also incorrect.
Compression Ratio.
CriticalReview estimates a compression ratio usedto incorrectly calculate space in disk. A compression ratio is unnecessary in thefirst place; disk space is determined by the size of unique Docker layers, alreadygiven in row 2 of Table 1.
CriticalReview states in [7, Section III-D]:According to our observations, the ratio between download size and diskstorage is 2.48x and drops to 0.41x when considering the duplicate layers.[...] Based on this observation, we estimate the total disk space requiredto 3,680.45 GB.
CriticalReview fails to mention that the above observations are based on464 artifacts [2], not the full dataset. The script [3] used to calculate disk spacelists 598.98 GB of storage used by the 464 artifacts. When we downloaded thesame 464 artifacts, the disk space reported by the command docker system df is 353 GB, not 598.98 GB.The compression ratio is then calculated by dividing the space in disk bythe size of the 464 artifacts when considering all Docker layers: 598.98 GB /1,452.02 GB = 0.42. However, when using this compression ratio, the estimateddisk space reported for the full dataset is 3,680.45 GB, which is 63% higher thanthe actual size given in Table 1 in row 2, which is 2,246 GB.
Cost Calculation.
Because the cost of using the
BugSwarm dataset is basedon incorrect estimated download and storage sizes, the cost calculations are alsoincorrect. Additionally, as mentioned earlier and corroborated by
Critical-Review , we expect that
BugSwarm users will be interested in subsets of thedataset, as opposed to the full dataset. This must be taken into account whenmaking such cost calculations.
In addition to answering the questions described in Section 3,
CriticalReview also provides a GitHub repository for the
BugSwarm artifacts and a website tonavigate and select artifacts. This section discusses these contributions as wellas a finding regarding duplicate diffs.
GitHub Repository.
One of the contributions listed by
CriticalReview isa new GitHub repository to store BugSwarm artifacts. Specifically, there is https://github.com/TQRG/BugSwarm a branch for each artifact that contains the buggy version of the code, the fixedversion of the code, the diff between both versions, and the failing and passing Travis-CI build logs. The only artifact information not stored in the repositoryis the scripts to build the code and run regression tests.However, the existence of the
CriticalReview repository is not necessary.The buggy and fixed versions of the code can be directly accessed via the originalrepositories (the
BugSwarm
REST API and the website provide the commitinformation) or by downloading the
BugSwarm
Docker image for the artifact,which includes a copy of both versions of the code. The
Travis-CI build logs canbe directly accessed via the
Travis-CI website using the information providedby the
BugSwarm
REST API or directly following the
BugSwarm websitelinks. Finally, the diff can be directly retrieved using the GitHub API (3-dotdiff), or accessed via the
BugSwarm website (2-dot diff).
Website to Browse and Select
BugSwarm
Artifacts.
Another contribu-tion listed by
CriticalReview is a website to browse and select BugSwarm artifacts. The website displays the number of added/removed/modified lines andfiles, and allows to select artifacts based on unique commits, unique diffs, notempty diffs, containing failing tests, changing source code, a manual categoriza-tion of bug/non-bug patches, and a high-level categorization of failures.
BugSwarm already provides its own website for browsing and selectionbased on the same attributes listed by CriticalReview (except for their twocategorizations, which are complementary to our own). The
BugSwarm websitealso allows to select artifacts based on the location of the fix (source files, con-figuration files, or test files). In addition to the website,
BugSwarm provides aREST API to query the
BugSwarm database directly, thus one is not restrictedto the options provided in the website.
BugSwarm provides a classification ofartifacts based on runtime exceptions.
Duplicate Diffs.
CriticalReview reports that while controlling for uniquefailed commits there are duplicate diffs among them, reporting that 198 out of1,767 artifacts have a duplicate diff [7, Section III-C].Recently, we have discovered that
Travis-CI can make a “double build”when a build is a Pull Request (PR). Travis-CI will create a build for thePR branch, and another build for the PR branch merged with the base branch.If no changes have been made to the base branch since the time the PR branchwas created, then the diffs between both builds will be the same. This explains
CriticalReview ’s observation. Fortunately, we believe it is feasible to auto-matically detect these cases, and this detection will be incorporated into the
BugSwarm infrastructure to avoid such cases in future versions of the dataset. https://tqrg.github.io/BugSwarm https://docs.travis-ci.com/user/pull-requests/ Note About: Critical Review of BugSwarm 11
We would like to conclude this paper by briefly clarifying a few misconceptionsabout
BugSwarm , and by discussing our vision for the
BugSwarm infrastruc-ture and dataset. (1) BugSwarm is more than a dataset . As described in Section 2,
BugSwarm is comprised of an infrastructure to automatically create a large-scale dataset of real-world failures and fixes, a continuously growing dataset, anda REST API and website to navigate and select artifacts from the dataset basedon characteristics of interest. (2) The BugSwarm dataset is not static.
One of the main contributionsof the
BugSwarm infrastructure is that its full automation has enabled thecreation of a continuously growing dataset. As discussed in the
BugSwarm paper [12], the potential for size and diversity opens new opportunities, but italso presents several challenges. Some of these challenges include data versioning(discussed in
CriticalReview ), and automated bug classification to increasethe usefulness of the dataset. (3) The BugSwarm dataset is not meant for a single target ap-plication.
Because of the size and diversity of the
BugSwarm dataset, it isunrealistic to believe that all artifacts will be relevant to one application. As aresult,
BugSwarm facilitates navigating and selecting artifacts based on a setof characteristics via the
BugSwarm website or REST API. Thus, it is easy toselect artifacts for a given application (e.g., APR or FL) beforehand. (4) BugSwarm artifacts with specific characteristics can be “grown”.
The initial
BugSwarm dataset was created without controlling for any particu-lar attribute, such as diff size, patch location, or reason for failure. However, sincethe publication of the
BugSwarm paper [12], target mining is now available andthus, it is possible to grow the dataset in specific directions. We believe that al-lowing for diverse characteristics does not hinder the evaluation of the state ofthe art. On the other hand, we hope that the existence of artifacts that the stateof the art may not be able to handle today will further push advancement.
BugSwarm is a project under active development, and in process of opensourcing its infrastructure. We welcome feedback from the community. The
BugSwarm dataset is publicly available in DockerHub. The website is alsopublicly available, and the REST API is available to anyone who would like torequest a token to access the
BugSwarm database.
Acknowledgments
We thank Bohan Xiao and Octavio Corona for their help in gathering some ofthe data discussed in this paper. This work was supported by NSF grant CNS-1629976 and a Microsoft Azure Award. Any opinions, findings, and conclusionsor recommendations expressed in this material are those of the authors and donot necessarily reflect the views of the National Science Foundation or Microsoft. ibliography [1] Docker Storage Driver. https://docs.docker.com/storage/storagedriver/ (Accessed 2019)[2] Images downloaded by CriticalReview. https://github.com/TQRG/BugSwarm/blob/master/docs/downloaded_images.json (Accessed 2019)[3] Images downloaded by CriticalReview. https://github.com/TQRG/BugSwarm/blob/master/script/compression_rate.py (Accessed 2019)[4] Cifuentes, C., Hoermann, C., Keynes, N., Li, L., Long, S., Mealy, E.,Mounteney, M., Scholz, B.: BegBunch: Benchmarking for C Bug Detec-tion Tools. In: DEFECTS ’09: Proceedings of the 2nd International Work-shop on Defects in Large Software Systems, pp. 16–20 (2009), URL http://doi.acm.org/10.1145/1555860.1555866 [5] Dallmeier, V., Zimmermann, T.: Extraction of Bug Localization Bench-marks from History. In: 22nd IEEE/ACM International Conference on Au-tomated Software Engineering (ASE 2007), November 5-9, 2007, Atlanta,Georgia, USA, pp. 433–436 (2007), URL http://doi.acm.org/10.1145/1321631.1321702 [6] Do, H., Elbaum, S., Rothermel, G.: Supporting controlled experimentationwith testing techniques: An infrastructure and its potential impact. Empir-ical Softw. Engg. (4), 405–435 (Oct 2005), ISSN 1382-3256[7] Durieux, T., Abreu, R.: Critical review of bugswarm for fault localizationand program repair. CoRR abs/1905.09375 (2019)[8] Hutchins, M., Foster, H., Goradia, T., Ostrand, T.J.: Experiments of theEffectiveness of Dataflow- and Controlflow-Based Test Adequacy Crite-ria. In: Proceedings of the 16th International Conference on Software En-gineering, Sorrento, Italy, May 16-21, 1994., pp. 191–200 (1994), URL http://portal.acm.org/citation.cfm?id=257734.257766 [9] Just, R., Jalali, D., Ernst, M.D.: Defects4j: A database of existing faultsto enable controlled testing studies for java programs. In: Proceedings ofthe 2014 International Symposium on Software Testing and Analysis, pp.437–440, ISSTA 2014, ACM (2014)[10] Le Goues, C., Holtschulte, N., Smith, E.K., Brun, Y., Devanbu, P.T., For-rest, S., Weimer, W.: The ManyBugs and IntroClass Benchmarks for Auto-mated Repair of C Programs. IEEE Trans. Software Eng. (12), 1236–1256(2015), URL https://doi.org/10.1109/TSE.2015.2454513https://doi.org/10.1109/TSE.2015.2454513