[PDF] A Note About: Critical Review of BugSwarm for Fault Localization and Program Repair

Abstract

Datasets play an important role in the advancement of software tools and facilitate their evaluation. BugSwarm is an infrastructure to automatically create a large dataset of real-world reproducible failures and fixes. In this paper, we respond to Durieux and Abreu's critical review of the BugSwarm dataset, referred to in this paper as CriticalReview. We replicate CriticalReview's study and find several incorrect claims and assumptions about the BugSwarm dataset. We discuss these incorrect claims and other contributions listed by CriticalReview. Finally, we discuss general misconceptions about BugSwarm, and our vision for the use of the infrastructure and dataset.

Full PDF

AA Note About: Critical Review of BugSwarm forFault Localization and Program Repair

David A. Tomassi and Cindy Rubio-Gonz´alez

University of California, Davis USA { datomassi,crubio } @ucdavis.edu Abstract.

Datasets play an important role in the advancement of soft-ware tools and facilitate their evaluation.

BugSwarm [12] is an infras-tructure to automatically create a large dataset of real-world repro-ducible failures and ﬁxes. In this paper, we respond to Durieux andAbreu [7]’s critical review of the

BugSwarm dataset, referred to inthis paper as

CriticalReview . We replicate

CriticalReview ’s studyand ﬁnd several incorrect claims and assumptions about the

BugSwarm dataset. We discuss these incorrect claims and other contributions listedby

CriticalReview . Finally, we discuss general misconceptions about

BugSwarm , and our vision for the use of the infrastructure and dataset.

Datasets are imperative to the development and progression of software tools,not only to facilitate a fair and unbiased evaluation of their eﬀectiveness, but alsoto inspire and enable the community to advance the state of the art. There havebeen various inﬂuential datasets developed in the Software Engineering commu-nity (e.g., [4, 5, 6, 8, 9, 10, 11]). Unfortunately, these datasets have required asubstantial amount of manual eﬀort to be created, which makes it diﬃcult togrow them.Recently we developed

BugSwarm [12], an infrastructure that leverages con-tinuous integration (CI) to automatically create a dataset of reproducible fail-ures and ﬁxes.

BugSwarm comprises an infrastructure, dataset, REST API,and website. The initial dataset (version 1.0.0 and reported in [12]) consists of3,091 pairs of failures and ﬁxes (referred to as artifacts ) mined from Java andPython projects. Because artifacts mined from open-source software are boundto have diﬀerent characteristics (number of failing tests, failure reason, ﬁx lo-cation(s), patch size, etc.), we provide a REST API and website for users tonavigate and select the artifacts that ﬁt the needs of their tools.

BugSwarm is under active development, currently allowing the mining of failures and ﬁxesthat satisfy speciﬁc characteristics.Parallel to the development of

BugSwarm , Durieux and Abreu [7] conducteda review of the

BugSwarm dataset (version 1.0.1) with respect to AutomatedProgram Repair (APR) and Fault Localization (FL). The authors stated char-acteristics they consider necessary for artifacts to be used in studies that eval-uate the state of the art in APR and FL. Additionally, the authors presented a a r X i v : . [ c s . S E ] O c t David A. Tomassi and Cindy Rubio-Gonz´alez high-level classiﬁcation of failures, and discussed the cost of using

BugSwarm artifacts. In the rest of this paper we refer to [7] as

CriticalReview .One of the purposes of datasets is to facilitate the evaluation of software tools.Instead,

CriticalReview uses the general requirements/current limitations ofthe state-of-the-art APR tools to evaluate the

BugSwarm dataset. While it isimportant that datasets possess key characteristics (e.g., failures that are rele-vant to the tools under evaluation), the existence of artifacts that do not havedesired characteristics does not hinder studies if users can navigate and selectartifacts relevant to their studies. Limiting a dataset to only include problemsthat certain tools can handle would be of no beneﬁt to our community. Further-more, the goal of the

BugSwarm dataset is to identify the kinds of problemsfound in real software and the environment in which these problems occur, andthus inspire the community to advance the state of the art.In addition to general misconceptions on datasets,

CriticalReview discred-its the use of the

BugSwarm dataset based on multiple incorrect observations.Speciﬁcally,

CriticalReview makes a false allegation against

BugSwarm pa-per [12]’s reported data, and presents wrong results and conclusions led by mis-understandings of

Travis-CI terminology and Docker’s architecture.This paper discusses each of

CriticalReview ’s incorrect claims, which hadalready been communicated to the authors of

CriticalReview upon theirrequest for feedback prior to the archival of their study. We also discuss thetwo other contributions of

CriticalReview : a GitHub repository to store thecode and build logs of the

BugSwarm artifacts, and

CriticalReview ’s ownwebsite to browse

BugSwarm artifacts, both of which duplicate informationalready available in

BugSwarm .The rest of this paper presents a brief overview of

BugSwarm in Section 2,and describes the methodology used by

CriticalReview in Section 3. Wediscuss the incorrect ﬁndings reported by

CriticalReview in Section 4, andthe rest of the contributions of

CriticalReview in Section 5. Finally, we clarifysome misconceptions about

BugSwarm , and re-aﬃrm its goals and intended usein Section 6.

BugSwarm is comprised of three main components: (1) an infrastructure toautomatically mine and reproduce failures and ﬁxes from open-source projectsthat use continuous integration ( Travis-CI ), (2) a continuously growing datasetof real-world failures and ﬁxes packaged in publicly available Docker images tofacilitate reproducibility, and (3) a website and a REST API for dataset usersto navigate and select artifacts based on a number of characteristics. https://github.com/BugSwarm – in the process of open sourcing. https://hub.docker.com/r/bugswarm/images/tags https://github.com/BugSwarm/common Note About: Critical Review of BugSwarm 3

Fig. 1.

Workﬂow for the

BugSwarm toolkit

BugSwarm ’s methodology to create a continuously growing dataset of real-world failures and ﬁxes is shown in Fig. 1. We brieﬂy describe each componentbelow. For more details please refer to the

BugSwarm paper [12].

PairMiner.

PairMiner represents the ﬁrst stage of the process. The role ofPairMiner is to mine fail-pass job pairs from the

Travis-CI ’s build historyof open-source projects hosted in GitHub. A project’s build history refers toall

Travis-CI builds previously triggered. A build may include many jobs; forexample, a build for a Python project might include separate jobs to test withPython versions 2.6, 2.7, 3.0, etc. The input to PairMiner is the repository slug(e.g., google/auto) of the project of interest. PairMiner analyzes the project’sbuild history to identify fail-pass build pairs, where a build fails and the nextconsecutive build passes. From these fail-pass build pairs, PairMiner will extractfail-pass job pairs . The output of PairMiner is a set of fail-pass job pairs foundfor the given project.

PairFilter.

PairFilter takes as input the

Travis-CI fail-pass job pairs fromPairMiner and ensures that essential data is available to allow for reproduction:

David A. Tomassi and Cindy Rubio-Gonz´alez (1) the state of the project at the time the job was executed, and (2) the envi-ronment in which the job was executed. If these essentials are not available thenPairFilter will discard the fail-pass job pair. PairFilter will determine the Dockerimage that was the exact build environment for the fail-pass job pair and thespeciﬁc commits that triggered each job. The output of PairFilter is the subsetof fail-pass job pairs for which (1) and (2) are available.

Reproducer.

The goal of Reproducer is to reproduce each job in the fail-passjob pair in the same build environment as it was originally run. The input toReproducer is a fail-pass job pair, the commits for each version, and the Dockerimage for the build environment. Reproducer conducts the following: (1) gener-ates a job script, i.e., a shell script to build the project and run regression tests,(2) matches the build environment, as the job was originally ran in, via a Dockerimage from the PairFilter, (3) reverts the project to the speciﬁc version, and (4)runs the code for the job in the Docker image via the job script. The Reproducercan be ran in parallel via multiple processes for each job pair as shown in Fig. 1.The output of Reproducer is a build log, which is a transcript of everything thatoccurs at the command line during the build and testing process.

Analyzer.

The Analyzer parses the original (historical) and reproduced buildlogs, extracts key attributes, and compares the extracted attributes to ensurethey match. The key attributes that are parsed are the status of the build(passed, failed, or errored), and the result of the test suite (number of testsran, number tests failed, and names of failed tests). If the results match betweenthe original and reproduced build logs, then metadata about the pair will beadded to the

BugSwarm database.

Artifact Creation.

The Reproducer and Analyzer are run ﬁve times. If a fail-pass job pair was reproducible all ﬁve times then we mark it as “reproducible”.If the number of times the pair was reproducible was less than ﬁve but morethan zero then it will be marked as “ﬂaky”. A pair can be ﬂaky due to a varietyof reasons but primarily because of test ﬂakiness which can be caused by non-deterministic tests due to concurrency or environmental changes. Lastly, if apair is reproducible zero times then it will be marked as “unreproducible”. Areproducible or ﬂaky job pair is referred to as a

BugSwarm artifact.For each

BugSwarm artifact, a Docker image is created which has both ver-sions of the code and the job scripts to build and test each version. This Dockerimage is then stored on our DockerHub repository. We chose to package each

BugSwarm artifact in a Docker image because Docker facilitates reproducibil-ity. Docker is also a good choice because it is light-weight, and uses layering.Docker images are composed of multiple layers which can be shared across mul-tiple Docker images to save space. Docker does not re-download or store a layerthat is already on a system [1]. https://hub.docker.com/r/bugswarm/images Note About: Critical Review of BugSwarm 5

The

BugSwarm dataset is the ﬁrst continuously growing dataset of reproduciblereal-world failures and ﬁxes. The dataset was automatically created using the

BugSwarm infrastructure without controlling for any speciﬁc attributes. Cur-rently, the

BugSwarm dataset (version 1.1.0) consists of 3,140 artifacts that arewritten in Java and Python. There are a diverse number of artifacts with diﬀer-ent build systems ranging from Maven, Gradle, and Ant to diﬀerent longevityfrom 2015 to 2019 and diﬀerent testing frameworks such as JUnit and unittest.We expect a steady grow of the dataset in the next months as the

BugSwarm infrastructure is set to run in dedicated servers.

BugSwarm oﬀers many diﬀerent characteristics to ﬁlter by to create a subsetthat is useful in the evaluation of a given tool. Examples of such character-istics are: language, size of diﬀ, build system, number of tests ran, number offailed tests, patch location (e.g., source code, test code, or build ﬁles), exceptionsthrown during run time (e.g., NullPointerException), etc. The

BugSwarm web-site and REST API allow the selection of artifacts based on the above attributes.

The goal of

CriticalReview ’s study is to answer the following questions:

RQ1

What are the main characteristics of

BugSwarm ’s pairs of builds regard-ing the requirements for APR and FL?

RQ2

What is the execution and storage cost of

BugSwarm ? RQ3

Which pairs of builds meets the requirements of APR and FL?

Characteristics of

BugSwarm ’s Pairs of Builds.

CriticalReview char-acterizes the

BugSwarm dataset with respect to requirements of current APRand FL tools: (1) behavioral bugs, (2) test suite is used with passing tests deﬁn-ing correct behavior and failing tests deﬁning incorrect behavior, (3) executionset up is known in terms of path of source, test ﬁles, etc., (4) uniqueness ofbugs, and (5) human patch availability. The above requires, for each artifact,the source code for the buggy version and the ﬁxed version, the diﬀ between thetwo versions, and the

Travis-CI build log for the failing job.

CriticalReview queries for fully reproducible

Java and Python artifacts(see Section 4.1 for further details) using the

BugSwarm

REST API. The re-sulting artifacts are then ﬁltered for unique commits. The diﬀ of each artifactis calculated by retrieving the buggy and ﬁxed versions of the artifact from itscorresponding Docker image, pushing the code into a branch of a new GitHub Note that multiple

Travis-CI jobs may originate from a single

Travis-CI build. David A. Tomassi and Cindy Rubio-Gonz´alez repository, and then invoking the GitHub API to retrieve the diﬀ between thetwo code versions. Unique diﬀs are identiﬁed based on md5 hash values, and arti-facts are classiﬁed based on whether the extension of the changed ﬁles are .java or .py . Lastly, a high-level classiﬁcation of the reason of failure is conducted byusing regular expressions to match certain patterns (test failures, style checkers,compilation errors, etc.) on Travis-CI build logs.

Execution and Storage of

BugSwarm . CriticalReview estimates thesize of the

BugSwarm dataset for download and storage, as well as its usagecost. The size of the dataset is calculated using two metrics: counting every

Docker layer, and counting every unique

Docker layer. Note that Docker does notdownload or store a layer that is already in the system (see [1] and Section 4.2).

CriticalReview gives a time estimate for download assuming a 80 Mbit/sstable connection. Finally, the cost of using the full dataset is estimated assuminga 20-minute experiment per artifact using Amazon Cloud Instances.

Pairs for APR and FL.

CriticalReview lists what the paper considers therequirements to use state-of-the-art APR and FL tools: (1) artifacts that havebeen reproduced ﬁve times, (2) artifacts whose Docker images are available, (3)non empty diﬀ, (4) unique commit, (5) unique diﬀ, (6) test case failure, and(7) only source ﬁles changed.

CriticalReview then reports the number of

BugSwarm artifacts that satisfy those requirements.

After replicating the study presented by

CriticalReview and inspecting itsscripts, we identiﬁed incorrect claims made by

CriticalReview related to in-consistencies in the number of artifacts reported in the

BugSwarm paper [12], amisleading duplication of commits in the dataset, and calculations of the storagerequired by the dataset. Below we discuss each incorrect claim, organized perresearch question as presented in [7].

Incorrect Number of Artifacts.

CriticalReview reports the number of“builds” reproduced ﬁve times given a

BugSwarm

API request listed in [7,Section III-B]. The API request returns 2,949 artifacts while the

BugSwarm paper[12] gives 3,091 artifacts. Thus,

CriticalReview reports a contradictionby the

BugSwarm authors, which according to

CriticalReview had statedthat each “build” in the dataset was successfully reproduced ﬁve times. { “reproduce successes”: { “$gt”:4,“lang”: { “$in”:[“Java”,“Python”] }}} Note About: Critical Review of BugSwarm 7

CriticalReview states in [7, Section III-C]:Indeed, we considered all pairs of builds that are reproduced success-fully ﬁve times like it is described in

BugSwarm ’s paper (see Section4-B). Surprisingly,

BugSwarm authors did not consider their criteria intheir ﬁnal selection of the pairs of builds and consequently the reportednumber is in contradiction with the paper.

BugSwarm original paper states in [12, Section IV-B]:We repeated the reproduction process 5 times for each pair to determineits stability. If the pair is reproducible all 5 times, then it is markedas ’reproducible’. If the pair is reproduced only sometimes, then it ismarked as ’ﬂaky’. Otherwise, the pair is said to be ’unreproducible’.First, as discussed in the

BugSwarm paper [12, Section III-C] and in Sec-tion 2 of this paper,

BugSwarm is comprised of artifacts (

Travis-CI job pairs),thus a request from the

BugSwarm

API will return the number of artifacts ,not the number of builds .Second, the

BugSwarm

API request used by

CriticalReview is return-ing the number of artifacts successfully reproduced ﬁve times. In other words,the query is returning the number of fully reproducible artifacts. However, the

BugSwarm dataset [12, Table III] includes both fully reproducible and ﬂakyartifacts, which together account for a total of 3,091 artifacts. The correct

BugSwarm

REST API request needs to ﬁlter based on a number of repro-duce successes greater than zero and a number of attempts equal to ﬁve. All3,091 artifacts included in the dataset were attempted ﬁve times.At the time CriticalReview was written (

BugSwarm dataset 1.0.1 fromMay 2019 ), the number of fully reproducible artifacts was indeed 2,949 andthe number of ﬂaky artifacts was 142. There is no contradiction on the selectioncriteria described in the BugSwarm paper: both reproducible and ﬂaky artifactsare included in the dataset.

Duplicate Failing Commits.

CriticalReview reports a “new” ﬁnding re-garding a high number of duplicate failing commits in the

BugSwarm datasetthat would introduce misleading results.

CriticalReview states in [7, Section II-C]:Our second observation is that 40.08% ((2,949-1,767)/2,949) of the buildshave a duplicate failing commit. It means that those 40.08% should notbe considered by the approaches that only consider the source code ofthe application otherwise it introduces misleading results. { “reproduce suc-cesses”: { “$gt”:0 } ,“reproduce attempts”:5,“lang”: { “$in”:[“Java”,“Python”] }}} David A. Tomassi and Cindy Rubio-Gonz´alez

Table 1.

Table of Metrics of

BugSwarm

Downloading and Storage Cost from [7].Metrics in Gigabytes (GB) Java Python All

BugSwarm

Docker layer size 5,107 3,813 8,921

BugSwarm unique

Docker layer size

Avg. size 3.01 3.05 3.03Download all layers (80Mbits/s) 6d, 7.8h 4d, 17.13h 11d, 1.16hDownload unique layers (80Mbits/s)

BugSwarm paper states in [12, Section IV-B]:Recall from Section III-C that

PairMiner mines job pairs. The corre-sponding number of reproducible unique build pairs is 1,837. The rest ofthe paper describes the results in terms of number of job pairs.As stated in the

BugSwarm paper [12, Section III-C] and in Section 2 ofthis paper, a

BugSwarm artifact corresponds to a pair of jobs , not a pair of builds (as incorrectly interpreted throughout

CriticalReview ). A

Travis-CI build can be composed of multiple jobs that test the same commit under diﬀerentconﬁgurations. Early feedback from researchers in our community indicated thatsuch artifacts can also be of interest to researchers.As also described in the

BugSwarm paper [12, Section III-B], a given ex-periment may require artifacts that meet speciﬁc criteria. If such criteria requireuniqueness of job pairs, as reported by

CriticalReview is the case for APRtools, then we provide a REST API and website that allow to consider uniquenesswhen selecting artifacts of interest. Thus, having the dataset include multiplejobs from a build does not represent a problem that would introduce misleadingresults. CriticalReview calculates the size of the

BugSwarm dataset and providesestimated download time and cost for using the full dataset in Amazon WebInstances [7, Section 3-D]. The paper reports that the full dataset is 8,921 GB,which takes about 11 days 1.16 hours to download when using a 80 Mbits/sinternet connection. Subsequently, the cost of using the

BugSwarm dataset,assuming a 20-minute experiment, is $711.30 USD.

Download Size Calculation.

CriticalReview calculates the size of the

BugSwarm dataset using two metrics: counting every

Docker layer, and count-ing every unique

Docker layer. The size of the dataset is reported (see Table 1from

CriticalReview ) as 8,921 GB and 2,246 GB, respectively. However, The diﬀerence between 1,767 and 1,837 is again due to

CriticalReview omittingﬂaky artifacts. Note About: Critical Review of BugSwarm 9 counting every Docker layer is incorrect. Docker does not re-download or storea layer that is already on a system [1]. The average size (row 3) and downloadtime (row 4) given in Table 1 are calculated based on all

Docker layers (row 1),thus these table entries are also incorrect.

Compression Ratio.

CriticalReview estimates a compression ratio usedto incorrectly calculate space in disk. A compression ratio is unnecessary in theﬁrst place; disk space is determined by the size of unique Docker layers, alreadygiven in row 2 of Table 1.

CriticalReview states in [7, Section III-D]:According to our observations, the ratio between download size and diskstorage is 2.48x and drops to 0.41x when considering the duplicate layers.[...] Based on this observation, we estimate the total disk space requiredto 3,680.45 GB.

CriticalReview fails to mention that the above observations are based on464 artifacts [2], not the full dataset. The script [3] used to calculate disk spacelists 598.98 GB of storage used by the 464 artifacts. When we downloaded thesame 464 artifacts, the disk space reported by the command docker system df is 353 GB, not 598.98 GB.The compression ratio is then calculated by dividing the space in disk bythe size of the 464 artifacts when considering all Docker layers: 598.98 GB /1,452.02 GB = 0.42. However, when using this compression ratio, the estimateddisk space reported for the full dataset is 3,680.45 GB, which is 63% higher thanthe actual size given in Table 1 in row 2, which is 2,246 GB.

Cost Calculation.

Because the cost of using the

BugSwarm dataset is basedon incorrect estimated download and storage sizes, the cost calculations are alsoincorrect. Additionally, as mentioned earlier and corroborated by

Critical-Review , we expect that

BugSwarm users will be interested in subsets of thedataset, as opposed to the full dataset. This must be taken into account whenmaking such cost calculations.

In addition to answering the questions described in Section 3,

CriticalReview also provides a GitHub repository for the

BugSwarm artifacts and a website tonavigate and select artifacts. This section discusses these contributions as wellas a ﬁnding regarding duplicate diﬀs.

GitHub Repository.

One of the contributions listed by

CriticalReview isa new GitHub repository to store BugSwarm artifacts. Speciﬁcally, there is https://github.com/TQRG/BugSwarm a branch for each artifact that contains the buggy version of the code, the ﬁxedversion of the code, the diﬀ between both versions, and the failing and passing Travis-CI build logs. The only artifact information not stored in the repositoryis the scripts to build the code and run regression tests.However, the existence of the

CriticalReview repository is not necessary.The buggy and ﬁxed versions of the code can be directly accessed via the originalrepositories (the

BugSwarm

REST API and the website provide the commitinformation) or by downloading the

BugSwarm

Docker image for the artifact,which includes a copy of both versions of the code. The

Travis-CI build logs canbe directly accessed via the

Travis-CI website using the information providedby the

BugSwarm

REST API or directly following the

BugSwarm websitelinks. Finally, the diﬀ can be directly retrieved using the GitHub API (3-dotdiﬀ), or accessed via the

BugSwarm website (2-dot diﬀ).

Website to Browse and Select

BugSwarm

Artifacts.

Another contribu-tion listed by

CriticalReview is a website to browse and select BugSwarm artifacts. The website displays the number of added/removed/modiﬁed lines andﬁles, and allows to select artifacts based on unique commits, unique diﬀs, notempty diﬀs, containing failing tests, changing source code, a manual categoriza-tion of bug/non-bug patches, and a high-level categorization of failures.

BugSwarm already provides its own website for browsing and selectionbased on the same attributes listed by CriticalReview (except for their twocategorizations, which are complementary to our own). The

BugSwarm websitealso allows to select artifacts based on the location of the ﬁx (source ﬁles, con-ﬁguration ﬁles, or test ﬁles). In addition to the website,

BugSwarm provides aREST API to query the

BugSwarm database directly, thus one is not restrictedto the options provided in the website.

BugSwarm provides a classiﬁcation ofartifacts based on runtime exceptions.

Duplicate Diﬀs.

CriticalReview reports that while controlling for uniquefailed commits there are duplicate diﬀs among them, reporting that 198 out of1,767 artifacts have a duplicate diﬀ [7, Section III-C].Recently, we have discovered that

Travis-CI can make a “double build”when a build is a Pull Request (PR). Travis-CI will create a build for thePR branch, and another build for the PR branch merged with the base branch.If no changes have been made to the base branch since the time the PR branchwas created, then the diﬀs between both builds will be the same. This explains

CriticalReview ’s observation. Fortunately, we believe it is feasible to auto-matically detect these cases, and this detection will be incorporated into the

BugSwarm infrastructure to avoid such cases in future versions of the dataset. https://tqrg.github.io/BugSwarm https://docs.travis-ci.com/user/pull-requests/ Note About: Critical Review of BugSwarm 11

We would like to conclude this paper by brieﬂy clarifying a few misconceptionsabout

BugSwarm , and by discussing our vision for the

BugSwarm infrastruc-ture and dataset. (1) BugSwarm is more than a dataset . As described in Section 2,

BugSwarm is comprised of an infrastructure to automatically create a large-scale dataset of real-world failures and ﬁxes, a continuously growing dataset, anda REST API and website to navigate and select artifacts from the dataset basedon characteristics of interest. (2) The BugSwarm dataset is not static.

One of the main contributionsof the

BugSwarm infrastructure is that its full automation has enabled thecreation of a continuously growing dataset. As discussed in the

BugSwarm paper [12], the potential for size and diversity opens new opportunities, but italso presents several challenges. Some of these challenges include data versioning(discussed in

CriticalReview ), and automated bug classiﬁcation to increasethe usefulness of the dataset. (3) The BugSwarm dataset is not meant for a single target ap-plication.

Because of the size and diversity of the

BugSwarm dataset, it isunrealistic to believe that all artifacts will be relevant to one application. As aresult,

BugSwarm facilitates navigating and selecting artifacts based on a setof characteristics via the

BugSwarm website or REST API. Thus, it is easy toselect artifacts for a given application (e.g., APR or FL) beforehand. (4) BugSwarm artifacts with speciﬁc characteristics can be “grown”.

The initial

BugSwarm dataset was created without controlling for any particu-lar attribute, such as diﬀ size, patch location, or reason for failure. However, sincethe publication of the

BugSwarm paper [12], target mining is now available andthus, it is possible to grow the dataset in speciﬁc directions. We believe that al-lowing for diverse characteristics does not hinder the evaluation of the state ofthe art. On the other hand, we hope that the existence of artifacts that the stateof the art may not be able to handle today will further push advancement.

BugSwarm is a project under active development, and in process of opensourcing its infrastructure. We welcome feedback from the community. The

BugSwarm dataset is publicly available in DockerHub. The website is alsopublicly available, and the REST API is available to anyone who would like torequest a token to access the

BugSwarm database.

Acknowledgments

We thank Bohan Xiao and Octavio Corona for their help in gathering some ofthe data discussed in this paper. This work was supported by NSF grant CNS-1629976 and a Microsoft Azure Award. Any opinions, ﬁndings, and conclusionsor recommendations expressed in this material are those of the authors and donot necessarily reﬂect the views of the National Science Foundation or Microsoft. ibliography [1] Docker Storage Driver. https://docs.docker.com/storage/storagedriver/ (Accessed 2019)[2] Images downloaded by CriticalReview. https://github.com/TQRG/BugSwarm/blob/master/docs/downloaded_images.json (Accessed 2019)[3] Images downloaded by CriticalReview. https://github.com/TQRG/BugSwarm/blob/master/script/compression_rate.py (Accessed 2019)[4] Cifuentes, C., Hoermann, C., Keynes, N., Li, L., Long, S., Mealy, E.,Mounteney, M., Scholz, B.: BegBunch: Benchmarking for C Bug Detec-tion Tools. In: DEFECTS ’09: Proceedings of the 2nd International Work-shop on Defects in Large Software Systems, pp. 16–20 (2009), URL http://doi.acm.org/10.1145/1555860.1555866 [5] Dallmeier, V., Zimmermann, T.: Extraction of Bug Localization Bench-marks from History. In: 22nd IEEE/ACM International Conference on Au-tomated Software Engineering (ASE 2007), November 5-9, 2007, Atlanta,Georgia, USA, pp. 433–436 (2007), URL http://doi.acm.org/10.1145/1321631.1321702 [6] Do, H., Elbaum, S., Rothermel, G.: Supporting controlled experimentationwith testing techniques: An infrastructure and its potential impact. Empir-ical Softw. Engg. (4), 405–435 (Oct 2005), ISSN 1382-3256[7] Durieux, T., Abreu, R.: Critical review of bugswarm for fault localizationand program repair. CoRR abs/1905.09375 (2019)[8] Hutchins, M., Foster, H., Goradia, T., Ostrand, T.J.: Experiments of theEﬀectiveness of Dataﬂow- and Controlﬂow-Based Test Adequacy Crite-ria. In: Proceedings of the 16th International Conference on Software En-gineering, Sorrento, Italy, May 16-21, 1994., pp. 191–200 (1994), URL http://portal.acm.org/citation.cfm?id=257734.257766 [9] Just, R., Jalali, D., Ernst, M.D.: Defects4j: A database of existing faultsto enable controlled testing studies for java programs. In: Proceedings ofthe 2014 International Symposium on Software Testing and Analysis, pp.437–440, ISSTA 2014, ACM (2014)[10] Le Goues, C., Holtschulte, N., Smith, E.K., Brun, Y., Devanbu, P.T., For-rest, S., Weimer, W.: The ManyBugs and IntroClass Benchmarks for Auto-mated Repair of C Programs. IEEE Trans. Software Eng. (12), 1236–1256(2015), URL https://doi.org/10.1109/TSE.2015.2454513https://doi.org/10.1109/TSE.2015.2454513