[PDF] Empirical Review of Java Program Repair Tools: A Large-Scale Experiment on 2,141 Bugs and 23,551 Repair Attempts

Abstract

In the past decade, research on test-suite-based automatic program repair has grown significantly. Each year, new approaches and implementations are featured in major software engineering venues. However, most of those approaches are evaluated on a single benchmark of bugs, which are also rarely reproduced by other researchers. In this paper, we present a large-scale experiment using 11 Java test-suite-based repair tools and 5 benchmarks of bugs. Our goal is to have a better understanding of the current state of automatic program repair tools on a large diversity of benchmarks. Our investigation is guided by the hypothesis that the repairability of repair tools might not be generalized across different benchmarks of bugs. We found that the 11 tools 1) are able to generate patches for 21% of the bugs from the 5 benchmarks, and 2) have better performance on Defects4J compared to other benchmarks, by generating patches for 47% of the bugs from Defects4J compared to 10-30% of bugs from the other benchmarks. Our experiment comprises 23,551 repair attempts in total, which we used to find the causes of non-patch generation. These causes are reported in this paper, which can help repair tool designers to improve their techniques and tools.

Full PDF

EEmpirical Review of Java Program Repair Tools

A Large-Scale Experiment on 2,141 Bugs and 23,551 Repair Attempts

Thomas Durieux

INESC-ID and IST, University of Lisbon, [email protected]

Fernanda Madeiral

Federal University of Uberlândia, [email protected]

Matias Martinez

University of Valenciennes, [email protected]

Rui Abreu

INESC-ID and IST, University of Lisbon, [email protected]

ABSTRACT

In the past decade, research on test-suite-based automatic programrepair has grown significantly. Each year, new approaches and im-plementations are featured in major software engineering venues.However, most of those approaches are evaluated on a single bench-mark of bugs, which are also rarely reproduced by other researchers.In this paper, we present a large-scale experiment using 11 Javatest-suite-based repair tools and 5 benchmarks of bugs. Our goalis to have a better understanding of the current state of automaticprogram repair tools on a large diversity of benchmarks. Our inves-tigation is guided by the hypothesis that the repairability of repairtools might not be generalized across different benchmarks of bugs.We found that the 11 tools 1) are able to generate patches for 21% ofthe bugs from the 5 benchmarks, and 2) have better performance onDefects4J compared to other benchmarks, by generating patches for47% of the bugs from Defects4J compared to 10-30% of bugs fromthe other benchmarks. Our experiment comprises 23,551 repairattempts in total, which we used to find the causes of non-patchgeneration. These causes are reported in this paper, which can helprepair tool designers to improve their techniques and tools.

ACM Reference Format:

Thomas Durieux, Fernanda Madeiral, Matias Martinez, and Rui Abreu. 2019.Empirical Review of Java Program Repair Tools: A Large-Scale Experimenton 2,141 Bugs and 23,551 Repair Attempts. In

Proceedings of The 27th ACMJoint European Software Engineering Conference and Symposium on the Foun-dations of Software Engineering (ESEC/FSE 2019).

ACM, New York, NY, USA,12 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

Software bugs decrease the quality of software systems from thepoint of view of the software system users. Manually repairingbugs is well-known as being a difficult and time-consuming task.To address this activity automatically, a new field of research hasemerged, named automatic program repair . Automatic programrepair consists of automatically finding solutions 100% executable

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

ESEC/FSE 2019, 26–30 August, 2019, Tallinn, Estonia © 2019 Association for Computing Machinery.ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn to software bugs, without human intervention [30, 31]. The mostpopular approach to automatically repair bugs is to create a patchusing the test suite of the program as the specification of its expectedbehavior. This type of approach is known as test-suite-based programrepair [30], which has been applied in several repair tools in the lastdecade [2, 6, 10, 13, 14, 17, 18, 23, 24, 27, 28, 37, 42, 43, 45–47, 50].The repair tools are used in empirical evaluations so that therepairability of the repair approaches they implement is measured.These evaluations are reported in the literature in two ways: whena new repair approach is proposed (e.g. [50]), or when a dedicatedfull contribution on evaluating existing repair tools is reported (e.g.[26, 33, 48]). The evaluations consist of four main aspects in general:1) [benchmark] the selection of benchmarks of bugs; 2) [execution]the collection of data by executing repair tools on the selectedbugs; 3) [observed aspect] an investigation on the effectivenessof the repair approach regarding some criteria (e.g. repairability , correctness , and repair time ); and finally 4) [comparison/discussion]the comparison of repair approaches and discussion.A major problem with all previous evaluations, focusing onrepair for Java programs, is that they are widely performed on thesame benchmark of bugs: Defects4J [16]. In theory, this should notbe a problem if Defects4J is not biased; however, no benchmarkis perfect [21]. Benchmarks should reflect the representativenessof the bugs and the projects they come from in the real world.The extent of the representativeness of benchmarks for real-worldbugs is unknown, because even the distribution of the real worldbugs is unknown. Therefore, by using a single benchmark whenevaluating repair tools, a bias can be introduced, which makes hardto generalize the performance of repair tools.In this paper, we report on a large experiment conducted on11 test-suite-based repair tools for Java using other benchmarksof bugs than Defects4J. The primary goal of this experiment isto investigate if the existing repair tools behave in a similar wayacross different benchmarks. If a repair tool performs significantlybetter on one benchmark than on others, we say that the repairtool overfits the benchmark . The secondary goal is to understandthe causes of non-patch generation from a practical view, which,to the best of our knowledge, has not been subject of investigationby the repair community.To achieve our goals, we designed our experiment consideringthree out of the four main aspects usually used to evaluate repairtools: a) on benchmark, we use 5 benchmarks (including Defects4J),totaling 2,141 bugs; b) on execution, we run 11 repair tools on the2,141 bugs, using a framework we developed to automatize and a r X i v : . [ c s . S E ] M a y SEC/FSE 2019, 26–30 August, 2019, Tallinn, Estonia Durieux, et al. simplify the execution of repair tools on different benchmarks;c) on observed aspect, we analyze the repairability of the tools,focusing on their performance across the different benchmarks.We do not target the fourth aspect of evaluations, which is aboutcomparing repair tools. Our goal is not to compare repair toolsamong themselves, but to compare the behavior of each tool amongdifferent benchmarks.Through our experiment, we first observed that all 11 repairtools are able to generate a test-suite adequate patch for bugs fromeach of the 5 benchmarks. However, when analyzing the proportionof bugs patched by the repair tools per benchmark, we found thatindeed the repair tools perform better on Defects4J than on the otherbenchmarks. Finally, we found six main reasons why repair tools donot succeed to generate patches for bugs. For instance, we observedthat incorrect fault localization and multiple fault locations have asignificant impact on patch generation. These reasons are valuablefor repair tool designers and researchers to improve their tools.To sum up, our contributions are: • A large-scale experiment of 11 repair tools on 2,141 bugsfrom 5 benchmarks: this is the largest study on automaticprogram repair ever (i.e. 11 x 2,141 = 23,551 repair attempts); • A repair execution framework, named RepairThemAll, thatadds an abstraction around repair tools and benchmarks,which can be further extended to support additional repairtools and benchmarks; • A novel study on the repairability of repair tools acrossmultiple benchmarks, where the goal is to investigate ifthere exists the benchmark overfitting problem, i.e. that therepair tools perform significantly better on the extensivelyused benchmark Defects4J than on other benchmarks; • A thorough study on the non-patch generation cases on the23,551 repair attempts that did not result in patches.The remainder of this paper is organized as follows. Section 2presents the literature on the evaluation of test-suite-based repairtools for Java, which grounds the motivation of our experiment.Section 3 presents the design of our study, including the researchquestions and the data collection and analysis. Section 4 presentsthe results, followed by the discussion in Section 5. Finally, Section 6presents the related works, and Section 7 presents the final remarks.

Automatic repair tools meet benchmarks of bugs when they areevaluated. In this section, we present a review of the literature onthe existing evaluations of repair tools, which is based on a two-stepprotocol: 1) gathering repair tools, and 2) gathering information ontheir evaluations, focusing on the used benchmarks and the numberof bugs given as input to the repair tools.To gather repair tools, we searched the existing living review onautomatic program repair [32] for test-suite-based repair tools forJava, which is the focus of this work: we found 24 repair tools thatmeet this criterion. Then, we gathered scientific papers containingevaluations of these tools. There are two types of papers that areinteresting for us: the first type consists of the presentation of a newrepair approach, which also includes an evaluation conducted usinga tool that implements the approach (e.g. [47]); and the second type

Table 1: Test-suite-based program repair tools for Java.

Repair tool Benchmark used a b in evaluation Generate-and-validate

ACS [46] Defects4J 224 23 17ARJA [50] Defects4J 224 59 18QuixBugs [48] 40 4 2Defects4J 224 25 22CapGen [42] IntroClassJava 297 – 25Cardumen [28] Defects4J 356 77 –QuixBugs [48] 40 5 3DeepRepair [43] Defects4J 374 51 –Elixir [37] Defects4J 82 41 26Bugs.jar 127 39 22GenProg-A [50] Defects4J 224 36 –HDRepair [18] Defects4J 90 – 23Jaid [2] Defects4J 138 31 25jGenProg [27] Defects4J 224 29 –Defects4J [26] 224 27 5QuixBugs [48] 40 2 0Defects4J 224 22 –Defects4J [26] 224 22 1jKali [27] QuixBugs [48] 40 2 1jMutRepair [27] Defects4J 224 17 –QuixBugs [48] 40 3 1Kali-A [50] Defects4J 224 33 –LSRepair [23] Defects4J 395 38 19PAR [17] PARDataset 119 27 –RSRepair-A [50] Defects4J 224 44 –QuixBugs [48] 40 4 2SimFix [14] Defects4J 357 56 34SketchFix [13] Defects4J 357 26 19SOFix [24] Defects4J 224 – 23ssFix [45] Defects4J 357 60 20xPAR [18] Defects4J 90 – 4

Semantics-driven

Defects4J 224 27 –DynaMoth [10] QuixBugs [48] 40 2 1Nopol [47] ConditionDataset 22 17 13Defects4J [26] 224 35 5QuixBugs [48] 40 3 1

Metaprogramming-based

NPEDataset 16 14 –NPEFix [6] QuixBugs [48] 40 2 1 a A patched bug means that a repair tool fixed it with a test-suite adequate patch. b A fixed bug means that a repair tool fixed it with a test-suite adequate patch thatwas confirmed to be correct. consists of an empirical evaluation carried out on already createdtools, which is a specific work to evaluate repair tools by runningthem on benchmarks of bugs (e.g. [26]). We gathered 18 papersfrom the first type of papers (more than one tool can be presentedin the same paper) and two papers from the second type.Table 1 summarizes our review on the existing evaluations of the24 repair tools based on the 20 scientific papers. The first columnpresents the repair tools, which are grouped by the well-knowncategories generate-and-validate and semantics-based approaches,plus metaprogramming-based . We named the latter category forrepair tools that first create a metaprogram of the program underrepair and then explore it at runtime, which in the end uses theruntime information to generate patches.Each repair tool is associated with one or more benchmarksused in its evaluation in the table. When a repair tool has beenevaluated in more than one benchmark (or more than once inthe same benchmark), we place first the benchmark used in thepaper that presented the tool (i.e. first evaluation), followed bythe other benchmarks with the reference for the posterior studies. mpirical Review of Java Program Repair Tools ESEC/FSE 2019, 26–30 August, 2019, Tallinn, Estonia For instance, in the paper that jGenProg [27] is presented, there isan evaluation on Defects4J: this evaluation has no citation in thesecond column of the table because the evaluation is in jGenProg’spaper. Later, it was evaluated again on Defects4J [26] and also onQuixBugs [48], which contain citations of the empirical evaluationpapers in the table. The table also presents additional informationon the evaluations, which are the number of bugs given as inputto the repair tools, and the number of bugs for which the toolsgenerated a test-suite adequate patch (i.e. patched bugs) and acorrect patch (i.e. fixed bugs), reported by the gathered papers.In total, we found 38 evaluations of the 24 repair tools. Out of 24,22 repair tools were evaluated on (a subset of) bugs from Defects4J,and nine of them were recently evaluated on the QuixBugs bench-mark [22]. In some exceptional cases, the benchmarks Bugs.jar [36]and IntroClassJava [11] were also used. However, the number of ex-isting evaluations in terms of number of repair tools versus numberof benchmarks is low compared to all possible combinations. Thereare some benchmarks that were rarely used or never used so far:this is partially explained by the fact that some benchmarks wererecently published (e.g. Bears [25]), thus they were not availablewhen some repair tools were published.We also observe that three repair tools were originally evaluatedon datasets that were not presented in the literature in a researchpaper dedicated for them (i.e. PARDataset [17], ConditionDataset[47] and NPEDataset [6]). This is the case of the first evaluationsof PAR, Nopol, and NPEFix. However, these repair tools were laterevaluated on formally proposed benchmarks, except for PAR, whichis not publicly available. PAR was later reimplemented, resulting inthe tool xPAR [18], which was then evaluated on Defects4J.

The extensive usage of Defects4J at evaluating repair tools motivatesour study. Our goal is to investigate if the repair tools have a similarperformance on other benchmarks of bugs. In this section, wepresent the design of our study, including the research questions,the systematic selection of repair tools and benchmarks of bugs,and the data collection and analysis.

RQ1 . [Repairability] To what extent do test-suite-based repairtools generate patches for bugs from a diversity of bench-marks?This research question guides us towards the investigationon the ability of the existing repair tools to generate test-suiteadequate patches for bugs from the selected benchmarks.

RQ2 . [Benchmark overfitting] Is the repair tools’ repairability sim-ilar across benchmarks?The repair tools have been extensively evaluated on De-fects4J. Our goal in this research question is to investigate ifthey repair bugs from other benchmarks to a similar extentthan they repair bugs from Defects4J.

RQ3 . [Non-patch generation] What are the causes that lead repairattempts to not generate patches?Existing evaluations focus on the successful cases, i.e. thebugs that a given repair tool generated patches for. However,to the best of our knowledge, there is no study that investi-gates the unsuccessful cases, i.e. a repair tool tried to repair

Table 2: Selected repair tools based on our inclusion criteria.

Non-fulfillment Criteria Repair Tools E x c l u d e d ( ) Not public (C1) Elixir, PAR, SOFix, xPARNot working (C2) ACS, CapGen, DeepRepairOnly compatible with Defects4J (C3) LSRepair, SimFixFaulty class/method required (C4) HDRepair, Jaid, SketchFixOthers ssFix I n c l u d e d ( ) ARJA, Cardumen, DynaMoth, jGenProg, GenProg-A, jKali,Kali-A, jMutRepair, Nopol, NPEFix, RSRepair-A a bug but no patch was generated. Our goal in this researchquestion is to find the causes of non-patch generation sothat the repair community can focus on practical limitationsand improve their repair tools.

To include a Java test-suite-based program repair tool in our study,it must meet the following four inclusion criteria: • Criterion

The repair tool ought to be publicly available: ourstudy involves the execution of repair tools, therefore tools thatare not publicly available are excluded. We exclude tools withthis criterion when 1) the paper where the tool was describeddoes not include a link for the tool, 2) we cannot find the toolon the internet, and 3) we received an answer by email from theauthors of the tool explaining why the tool is not available (e.g.Elixir has a confidentiality issue) or no answer at all. • Criterion

The repair tool ought to be possible to run: sometools are publicly available, but they are not possible to run fordiverse issues (e.g. ACS uses GitHub, which recently changed itsinterface and does not allow programmed queries). • Criterion

The repair tool ought to be possible to run on bugsfrom different benchmarks beyond the one used in its originalevaluation: we cannot run tools in other benchmarks if they arehardcoded to specific ones (e.g. SimFix is currently working onlyfor Defects4J). • Criterion

The repair tool ought to require only the sourcecode of the program under repair and its test suite used as oracle.These two elements are the two inputs specified in the problemstatement of test-suite-based automatic program repair [31].After checking on all 24 repair tools presented in Table 1, wefound 12 tools that meet the inclusion criteria outlined. One ofthem, ssFix, was further excluded because we had issues to run it,so we ended up with 11 repair tools for our experiment. Table 2presents the excluded and included tools, and for the excluded ones,it also shows the criterion they did not meet. Note that, amongthe included tools, we have eight generate-and-validate tools, thetwo semantics-based tools, and the only metaprogramming-basedtool. Therefore, we cover the three approach categories. We brieflydescribe each selected repair tool in the remainder of this section. jGenProg [27] and GenProg-A [50] are Java implementationsof GenProg [41], which is for C programs. GenProg is a redundancy-based repair approach [29] that generates patches using existingcode (aka the ingredient ) from the system under repair, i.e., it doesnot synthesize new code. GenProg works at the statement level,

SEC/FSE 2019, 26–30 August, 2019, Tallinn, Estonia Durieux, et al. and the repair operations are insertion, removal, and replacementof statements. jKali [27] and Kali-A [50] are Java implementations of Kali[35]. Kali was conceived to show that most of the patches synthe-sized by GenProg over the ManyBugs benchmark [20] consist ofavoiding the execution of code. The operators implemented in Kaliare removal of statements, modification of if conditions to true and false , and insertion of return statements. jMutRepair [27] is an implementation of the mutation-basedrepair approach presented by Debroy and Wong [4] for Java. Itconsiders three kinds of mutation operators: relational (e.g. == ),logical (e.g. &&) and unary (i.e. addition or removal of the negationoperator !). jMutRepair performs mutations on those operators insuspicious if condition statements. Nopol [47] is a semantics-based repair tool dedicated to repairbuggy if conditions and to add missing if preconditions. Nopoluses the so-called angelic values to determine the expected behaviorof suspicious statements: an angelic value is an arbitrary value thatmakes all failing test cases from the program under repair pass.Nopol then collects those values at runtime and encodes them intoa Satisfiability Modulo Theory (SMT) formula to find an expressionthat matches the behavior of the angelic value. When the SMTformula is satisfiable, Nopol translates the SMT solution into asource code patch. DynaMoth [10] is a repair tool integrated into Nopol that alsotargets buggy and missing if conditions. The difference betweenDynaMoth and Nopol is that instead of using an SMT formulato generate the patch, it uses the Java Debug Interface to accessthe runtime context and collects variable and method calls. Then,DynaMoth combines those variables and method calls to generatemore complex expressions until it finds one that has the expectedbehavior. This allows the generation of patches containing methodcalls with parameters, for instance. NPEFix [6] is different from the generate-and-validate andsemantics-based tools, it is a metaprogramming-based tool. It meansthat NPEFix modifies the program under repair to include severalrepair strategies that can be activated during the runtime. NPEFixrepairs programs that crash due to a null pointer exception. NPEFixruns the failing test-case several times and activates a differentrepair strategy for each execution. In the end, knowing the repairstrategies that have worked, together with information of the con-text that they worked, a patch is created. Note that NPEFix worksin a similar way than semantics-based tools in this last step: if apatch is found, it means the patch is already satisfactory.

ARJA [50] is a genetic programming approach that optimizesthe exploration of the search space by combining three differentapproaches: a patch representation for decoupling properly thesearch subspaces of likely-buggy locations, operation types andingredient statements; a multi-objective search optimization forminimizing the weighted failure rate and for searching simplerpatches; and a method/variable scope matching for filtering thereplacement/inserted code to improve compilation rate.

Cardumen [28] is a test-suite-based repair approach that worksat the level of expressions. It synthesizes new expressions (that areused to replace suspicious expressions) as follows. First, it minestemplates (i.e., piece of code at the level of expression, where thevariables are replaced by placeholders) from the code under repair.

Table 3: Selected benchmarks of bugs and their sizes.

Benchmark

Then, for creating a candidate patch that replaces a suspiciousexpression se , Cardumen selects a compatible template (i.e. theevaluation of the template and the se return compatible types) andcreates a new expression from it by replacing all its placeholderswith variables frequently used in the context of se . RSRepair-A [50] is a Java implementation of RSRepair [34].RSRepair repairs is a test-suite-based repair approach for C thathas been created to compare the performance between geneticprogramming (GenProg) and random search in the case of automaticprogram repair. It showed that in 23/24 RSRepair finds patches fasterthan GenProg.

To select benchmarks of bugs for our study, we defined the followingthree inclusion criteria: • Criterion : The benchmark must contain bugs in the Java lan-guage: this criterion excludes benchmarks such as ManyBugs[21], IntroClass [21], Codeflaws [40] and BugsJS [12]. • Criterion : The benchmark must be peer-reviewed, presentedin the literature in a research paper dedicated for it: this criterionexcludes benchmarks such as PARDataset [17], ConditionDataset[47], and NPEDataset [6]. • Criterion : The benchmark must include, for each bug, at leastone failing test case: this criterion excludes benchmarks such asiBugs [3].After searching the literature for benchmarks that meet ourcriteria, we ended up with 5 benchmarks. Table 3 summarizes themby presenting their sizes in number of projects, bugs and lines ofcode. We present a brief description of them as follows.

Defects4J [16] contains 395 bugs from six widely used opensource Java projects with an average size of 129,592 lines of Javacode. The bugs were extracted by the identification of bug fixingcommits with the support of the bug tracking system and the exe-cution of tests on the bug fixing program version and its reversepatch (buggy version). Despite the fact this benchmark was firstproposed to the software testing community, it has been used forseveral works on automatic program repair.

Bugs.jar [36] contains 1,158 bugs from eight Apache projectswith an average size of 212,889 lines of Java code. It was createdusing the same strategy than Defects4J. Its main contribution is thehigh number of bugs.

Bears [25] contains 251 bugs from 72 different GitHub projectswith an average size of 62,597 lines of Java code. It was created bymining software repositories based on commit building state fromTravis Continuous Integration. Bears has the largest diversity ofproject compared to previous bug benchmarks. mpirical Review of Java Program Repair Tools ESEC/FSE 2019, 26–30 August, 2019, Tallinn, Estonia

IntroClassJava [11] contains 297 bugs from six different studentprojects. It is a transpiled version to Java of the bugs from the Cbenchmark IntroClass [21]. In the transpiled version, the projectshave on average 230 lines of code.

QuixBugs [22] contains 40 single line bugs from 40 programs,which are translated into both Java and Python languages. Eachprogram corresponds to the implementation of one algorithm suchas Quicksort and contains on average 190 lines of code. This is thefirst multi-lingual program repair benchmark.

To run the repair tools on different benchmarks of bugs, we createdan execution framework named RepairThemAll, which providesan abstraction around repair tools and benchmarks. Figure 1 illus-trates the overview of the framework. It is composed of three maincomponents: 1) repair tool plug-in , where there is the abstractionaround repair tools, allowing the addition and removal of tools, 2) benchmark plug-in , where there is the abstraction around bench-marks, also allowing the addition and removal of benchmarks, and3) repair runner , which works as a façade for the execution of repairtools on specific bugs.For the creation of the framework, we performed three maintasks, one for each component. First, for repair tool plug-in , weidentify the common parameters that are required by the repairtools, which we refer to as abstract parameters . We then map theseabstract parameters to the actual parameters of each repair tool. Thisis necessary because the repair tools use different parameter namesand input formats. For instance, an abstract parameter is the sourcecode folder path, and the actual parameter for source code folderpath in ARJA is

DsrcJavaDir , but in jGenProg is srcjavafolder .We identify eight common parameters for the repair tools: (1) sourcecode path, (2) test path, (3) binary path of the source code, (4) binarypath of the tests, (5) the classpath, (6) the java version (compliancelevel), (7) the failing test class name, and (8) workspace directory.RepairThemAll also supports the setting of actual parametersexisting in the repair tools to tune specific executions, i.e., actualparameters that are not mapped to any abstract parameter.Then, for benchmark plug-in , we identify the abstract operations that should be performed to use the bugs from benchmarks (e.g.check out a buggy program given a bug id). We map these abstractoperations to the actual operations of each benchmark when theyare available. We define three required bug usage operations tobe able to use the benchmark with repair tools: (1) to check outa specific bug (buggy source code files) at a given location, (2) tocompile the buggy source code and the tests, and (3) to provideinformation on the bug to be given as input to repair tools (i.e. theeight parameters previously mentioned, except workspace direc-tory). If the bug is from a multi-module project, the source code andtest paths should be related to the module that contains the bug.Only Defects4J provides bug usage operations that fully covers therequired abstract operations. Consequently, we had to build abovethe four other benchmarks the missing operations, e.g. to checkout a bug from the QuixBugs benchmark.Finally, for the repair runner , we design the input and output ina simple way so that one can easily interact with the RepairThe-mAll framework as well as interpret the results. For the input,

The RepairThemAll Framework

Abstract operations

Benchmark Plugin

Abstract parameters C on c r e t e pa r a m e t e r s C o n c r e t e o p e r a t i o n s Repair RunnerTool Plugin

Checkout & Compile

A Benchmark

Operations : buginfo, checkout, compile

Bugs

A bug

Repair attemptsConcrete tool output

A repair tool

Input - Repair tool name- Benchmark name- Bug id

Output - Normalised output- Repair attempt log

Figure 1: The RepairThemAll framework. one can start an execution of repair tools on benchmarks as asimple command line: for instance, the command ./repair.pyNopol –benchmark Defects4J –id Chart-7 starts the execu-tion of Nopol on the bug Chart 7 of Defects4J. At the end of this ex-ecution, the repair runner generates a standardized output dividedinto two files: the log of the repair attempt execution ( repair.log ),and the normalized JSON file ( results.json ) containing the loca-tion of the patches generated by the tools, if any, and the textualdifference between the patches and their buggy program versions.This standardized output is due to the abstraction around the outputformat (output normalization) we create to simplify the analysisand the readability of the results from the different repair tools.The RepairThemAll framework currently contains 11 repairtools and 5 benchmarks of bugs, and it allows the plug-in of otherones to help the repair community to compare different approaches.Moreover, the framework allows users to set repair tool executionssuch as the timeout and the limit of generated patches. It is publiclyavailable at [7], which includes a tutorial with all the steps to use itand to integrate new repair tools and new benchmarks.

To answer our research questions, we executed the 11 repair toolson the 5 benchmarks using RepairThemAll, resulting in patchesthat are further used for analysis. In this section, we describe therepair tools’ setup (Section 3.5.1) and their execution (Section 3.5.2),and the analysis we performed on the repair attempts to determinethe possible causes of non-patch generation (Section 3.5.3).

For this experiment, we set the timebudget to two hours per repair attempt: a repair attempt consists ofthe execution of one repair tool over one bug. We also configuredthe repair tools to stop the execution of repair attempts when theyalready generated one patch. However, ARJA, GenProg-A, Kali-A,RSRepair-A, and NPEFix do not have this option, they stop theirrepair attempts when they consume their own tentative budget, orby timeout. Moreover, we configured repair tools to run on onepredefined random seed: due to the huge computational powerrequired for this experiment, we were not able to run the repairtools with additional seeds. Finally, Table 4 presents the version ofeach repair tool that we used in this study. The logs of the repairattempts are available at [8].

SEC/FSE 2019, 26–30 August, 2019, Tallinn, Estonia Durieux, et al.

Table 4: The used version of each repair tool.

Repair tool Framework GitHub repository Commit idARJA, GenProg-A,Kali-A, RSRepair-A Arja yyxhdy/arja e60b990f9Cardumen, jGenProg,jKali, jMutRepair Astor SpoonLabs/astor 26ee3dfc8DynaMoth, Nopol Nopol SpoonLabs/nopol 7ba58a78dNPEFix – Spirals-Team/npefix 403445b9a

To our knowledge, our experimentalsetup is the biggest one on patch generation studies, in terms ofnumber of repair tools and bugs, and also in execution time. In total,we executed 11 repair tools on 2,141 bugs from 130 open-sourceprojects that the selected 5 benchmarks provide. This represents23,551 repair attempts, which took 314 days and 12.5 hours ofcombined execution, almost a year of continuous execution. Thisexperiment would not be possible without the support of the clusterGrid’5000 [1] that provided us the required computing power toconduct this work.

Prior studies on patchgeneration mainly focused on the ability of approaches to generatepatches, and do not investigate the reasons why non-patch genera-tion happens. The study on non-patch generation is important tomake research progress so that authors of repair tools can improvetheir tools. Since there is a lack of knowledge on that subject, we arenot able to automatically perform the detection of the reasons whypatches are not generated for bugs, and therefore manual analysis isrequired. Due to the scale of our experiment setup including 23,551different patch generation attempts, it is unrealistic to manuallyanalyze each attempt log to understand what happened. We identifythe major causes of non-patch generation by analyzing a sample ofthe repair attempt logs. We do not predefine the sample because weobserve during preliminary investigation that identical behaviorshappen for groups of repair attempts. For instance, we found thatfor all bugs from a specific project, all the repair tools have thesame issue in the fault localization. For that reason, predefining asample is not optimal, because we would analyze attempt logs thatwe already know what is the cause of non-patch generation.

The results of our empirical study, as well as the answers to ourresearch questions, are presented in this section.

In this research question, we analyze the repairability of the 11repair tools on the total of 2,141 bugs. For that, we calculated thenumber of patched bugs and also the number of bugs that arecommonly patched by tool.Figure 2 presents the repairability of the repair tools in descend-ing order by the number of patched bugs. For each tool, it shows thenumber of unique bugs patched by the tool (dark grey), the numberof patched bugs that other repair tools also patched (light grey), andthe total number of patched bugs with the proportion over all 2,141included in this study. For instance, Nopol synthesizes patches for . . . . . . . . . Figure 2: Repairability of the 11 repair tools on 2,141 bugs.

213 bugs in total (9.9% of all bugs), where 57 are uniquely patchedby Nopol, and 156 are patched by Nopol and other tools.We observe that Nopol, DynaMoth, and ARJA are the three toolsthat generate test-suite adequate patches for the highest numberof bugs, with respectively 213, 206 and 146 patched bugs in total.NPEFix, on the other hand, generates patches for the fewest numberof bugs (15). It can be explained by the narrow repair scope of thistool, i.e. bugs exposed by null pointer exception.On bugs uniquely patched by tools, we observe that only jKalifailed to generate patches for bugs that are not patched by othertools, and that DynaMoth is the tool that patches the highest num-ber of unique bugs. However, NPEFix is the tool that has the highestproportion of unique patched bugs, i.e. 53% of the 15 bugs patchedby NPEFix are unique.The overlapping between each pair of repair tools in number ofbugs is presented in Table 5. In the case where the column nameand the line name are the same (main diagonal), it presents thenumber of uniquely bugs patched by the tool. For instance, 20 bugshave been uniquely patched by ARJA, which repairs other 66 bugsthat are also patched by GenProg-A.We observe a large overlap between repair tools that share thesame patch generation framework, i.e. the framework where therepair tools are implemented (see Table 4). For instance, ARJA hasan overlap of 45% with GenProg-A, 56% with Kali-A, and 55% withRSRepair-A, all implemented in the Arja framework. However,ARJA has an overlap ranging from 2% to 36% with the other repairtools. DynaMoth has an overlap of 55% with Nopol, but only 0%to 26% with the other tools. Each tool implemented in the Astorframework (e.g. jGenProg) has a big overlap with other tools inAstor. Moreover, the tools in Astor also present high overlappingwith the tools in the Arja framework, which are tools sharingsimilar repair approaches.

Answer to RQ1 . To what extent do test-suite-based repairtools generate patches for bugs from a diversity of bench-marks?

The 11 repair tools are able to generate patches for bugsranging from 15 to 213 bugs, from a total of 2,141 bugs. Theyare complementary to each other, because 10/11 repair tools fixunique bugs (all but jKali). We also observe that the overlappedrepairability of the tools is impacted by their similar implementedrepair approaches, and also by the patch generation frameworkwhere they are implemented. The full list of the patched bugs and the textual patches are available in [9]. mpirical Review of Java Program Repair Tools ESEC/FSE 2019, 26–30 August, 2019, Tallinn, Estonia

Table 5: The number of overlapped patched bugs per repair tool. Each row r presents the percentage of overlapped patched bugsof one tool t r with the rest of the tools. For instance, 45% of the bugs patched by ARJA (row 2) are also patched by GenProg-A(column 3). On the contrary, 85% of the bugs patched by GenProg-A (row 3) are also patched by ARJA (column 2). ARJA GenProg-A Kali-A RSRepair-A Cardumen jGenProg jKali jMutRepair Nopol DynaMoth NPEFixARJA

13% (20)

45% (66) 56% (82) 55% (81) 15% (23) 30% (44) 27% (40) 19% (29) 36% (53) 32% (48) 2% (4)GenProg-A 85% (66)

3% (3)

63% (49) 81% (63) 22% (17) 40% (31) 37% (29) 23% (18) 42% (33) 40% (31) 2% (2)Kali-A 69% (82) 41% (49)

9% (11)

46% (55) 16% (20) 28% (34) 37% (44) 23% (28) 47% (56) 45% (54) 1% (2)RSRepair-A 85% (81) 66% (63) 57% (55)

5% (5)

17% (17) 38% (37) 31% (30) 21% (20) 37% (36) 36% (35) 2% (2)Cardumen 50% (23) 36% (17) 43% (20) 36% (17)

26% (12)

65% (30) 45% (21) 32% (15) 21% (10) 26% (12) 4% (2)jGenProg 67% (44) 47% (31) 52% (34) 56% (37) 46% (30)

9% (6)

55% (36) 41% (27) 29% (19) 36% (24) 3% (2)jKali 76% (40) 55% (29) 84% (44) 57% (30) 40% (21) 69% (36)

0% (0)

57% (30) 53% (28) 67% (35) 1% (1)jMutRepair 44% (29) 27% (18) 43% (28) 30% (20) 23% (15) 41% (27) 46% (30)

15% (10)

58% (38) 30% (20) 1% (1)Nopol 24% (53) 15% (33) 26% (56) 16% (36) 4% (10) 8% (19) 13% (28) 17% (38)

26% (57)

53% (114) <

1% (2)DynaMoth 23% (48) 15% (31) 26% (54) 16% (35) 5% (12) 11% (24) 16% (35) 9% (20) 55% (114)

36% (75) <

1% (1)NPEFix 26% (4) 13% (2) 13% (2) 13% (2) 13% (2) 13% (2) 6% (1) 6% (1) 13% (2) 6% (1)

53% (8)

In this research question, we compare the repairability of the repairtools on the bugs from the extensively used benchmark Defects4Jwith their repairability on the other benchmarks included in thisstudy, which are Bears, Bugs.jar, IntroClassJava, and QuixBugs.Table 6 shows the number of bugs that have been patched byeach repair tool per benchmark. We first observe that Defects4J isthe benchmark with the highest number of unique patched bugs(187), which represents 47.34% of all Defects4J bugs. The next mostpatched benchmarks are QuixBugs with 30% and IntroClassJavawith 20.87% of their bugs. This difference can also be observed inthe total number of generated patches per benchmark: Defects4J isstill dominating the ranking with 550 generated patches, even thatit contains fewer bugs than Bugs.jar (395 versus 1,158 bugs).To test if the repairability of the repair tools is independent ofDefects4J, we applied the Chi-square test on the number of patchedbugs for Defects4J compared to the other benchmarks. The nullhypothesis of our test is that the number of patched bugs by a giventool is independent of Defects4J . We observed in Table 6 that the p-value is smaller than the significance level α < .05 for all repairtools. Hence, we reject the null hypothesis for those 11 tools, andwe conclude that the number of patched bugs by them is dependentof Defects4J. Therefore, repair tools overfit Defects4J .The repairability of the repair tools on Defects4J cannot be onlyexplained by the repair approaches. We raised three hypothesesthat can potentially explain the repairability difference betweenDefects4J and the other benchmarks: 1) there is a technical problemin the repair tools, 2) the bug fix isolation performed on Defects4Jhas an impact on repairing Defects4J bugs, and 3) the distribution ofthe bug types in Defects4J is different from the other benchmarks.

1. [Technical problems in the repair tools]

In RQ1, we ob-served the importance of the implementation of the tools for therepairability. One hypothesis that can explain the fact that repairtools overfit Defects4J is that the authors of the repair tools havedebugged and tuned their frameworks for Defects4J and, conse-quently, improved significantly the repairability of their tools forthis specific benchmark. For instance, they may have paid attentionto not let the dependencies of the repair tools to interfere with theclasspath of the Defects4J bugs, in order to preserve the behavior of test executions on the Defects4J bugs. However, this issue canaffect the bugs of other benchmarks.

2. [Bug fix isolation performed on Defects4J]

The secondhypothesis is related to the way that Defects4J has been created. Abug fixing commit might include other changes that are not relatedto the actual bug fix. Then given a bug fixing commit, the authorsof Defects4J recreated the buggy and patched program versionsso that the diff between the two versions contains only changesrelated to the bug fix: this is called bug fix isolation. The resultedisolated bug fixes facilitate studies on patches [39]. However, such aprocedure can potentially have an impact on the repairability of therepair tools. For instance, by comparing the developer patch [15]with the Defects4J patch [5] for the bug Closure-51, we observe thatthe method isNegativeZero has been introduced in the buggyprogram version, which contains part of the logic for fixing thebug. The presence of this method in the buggy program version cansimplify the generation of patches by the repair tools or introducean ingredient for genetic programming repair approaches.

3. [Bug type distribution in the benchmarks]

Our final hy-pothesis is related to the distribution of the bugs in the differentbenchmarks. Defects4J might contain more bugs that can be patchedby the repair tools compared to the other benchmarks. For thatreason, the bug type distribution of each benchmark should befurther analyzed and correlated with the repairability of the tools.To our understanding, the first hypothesis is more plausiblesince we observe in RQ1 that the implementation of the repair toolshas an impact on their repairability. However, additional studiesshould be designed to identify which hypothesis, or a combinationof hypotheses, has an impact on the repairability of the repair toolson Defects4J compared to other benchmarks.

Answer to RQ2 . Is the repair tools’ repairability similaracross benchmarks?

There is a difference in the repairability ofthe 11 repair tools across benchmarks. Indeed, the repairability ofall tools is significantly higher for bugs from Defects4J comparedto the other four benchmarks, therefore we conclude that theyoverfit Defects4J. In addition, we raised three hypotheses thatmight explain this difference. The confirmation of those hypothe-ses are full contributions themselves, therefore our study opensthe opportunity for several future investigations.

SEC/FSE 2019, 26–30 August, 2019, Tallinn, Estonia Durieux, et al.

Table 6: Number of bugs patched by at least one test-adequate patch and the p-value of the Chi-square test of independencebetween the number of patched bugs from Defects4J compared to the other benchmarks.

Repair toolBenchmark

Bears (251) Bugs.jar (1,158) Defects4J (395) IntroClassJava (297) QuixBugs (40) Total (2,141) p-valueARJA 12 (4%) 21 (1%) 86 (21%) 23 (7%) 4 (10%) 146 (7%) < . < . < . < . . < . < . . < . < . < . Table 7: Percentage of repair attempts that failed by error.

Repair toolBenchmark B e a r s B u g s . j a r D e f e c t s J I n t r o C l a ss J a v a Q u i x B u g s A v e r a g e ARJA 24.70 49.56 1.26 0 0 29.93GenProg-A 88.04 78.06 7.08 0 2.5 53.90Kali-A 24.70 50.08 4.81 0 0 30.87RSRepair-A 87.25 79.27 6.83 0 2.5 54.41Cardumen 47.41 70.46 48.60 0 5.0 52.73jGenProg 45.01 63.29 12.65 0 5.0 41.94jKali 44.62 64.42 12.40 0 5.0 42.45jMutRepair 72.11 66.66 15.44 13.46 22.5 49.64Nopol 28.68 60.27 45.31 0 2.5 44.37DynaMoth 27.09 46.97 4.30 0 0 29.37NPEFix 89.24 86.18 73.16 0 2.5 70.62Average 52.62 65.02 21.08 1.22 4.31 45.48

In this final research question, we analyze the repair attempts thatdid not result in patches, and we identify the causes of non-patchgeneration. The goal of this research question is to provide high-lights to the automatic repair community on the causes of non-patchgeneration so that authors of repair tools can improve their tools.Table 7 and Table 8 present the percentage of repair attemptsthat finished due to an error and by timeout, respectively. Theyshow that the repair attempts in error or timeout represent themajority of all repair attempts (56.49%). The Bugs.jar benchmark isthe main contributor to this percentage. The size and complexity ofthe Bugs.jar projects show the limitation of the current automaticpatch generation tools. Moreover, Table 7 shows that NPEFix is thetool with the highest error rate, but this tool crashes when no nullpointer exception is found in the execution of the failing test casethat exposes a bug. Regarding the timeout in Table 8, jGenProg andCardumen are more subject to reach timeout.

Table 8: Percentage of repair attempts that failed by timeout.

Repair toolBenchmark B e a r s B u g s . j a r D e f e c t s J I n t r o C l a ss J a v a Q u i x B u g s A v e r a g e ARJA 19.52 18.56 6.07 0 0 13.45GenProg-A 6.37 7.08 9.62 0 5.0 6.44Kali-A 1.19 2.76 0 0 0 1.63RSRepair-A 7.17 6.99 8.86 0 5.0 6.35Cardumen 4.38 61.57 19.74 0 2.5 37.50jGenProg 48.20 28.23 71.39 83.83 85.0 47.31jKali 0.79 4.05 1.77 0 0 2.61jMutRepair 0.39 3.45 1.01 0 0 2.10Nopol 0.39 0.51 0 0 0 0.32DynaMoth 0 0.69 2.27 0 0 0.79NPEFix 0 4.49 0.75 0 0 2.56Average 8.04 12.58 11.04 7.62 8.86 11.01

We then manually analyzed the execution trace of the repairattempts [8] to identify the causes of non-patch generation. Themethodology for this analysis is described in Section 3.5.3, and byfollowing it we identified six causes of non-patch generation.

1. [The repair tool cannot repair the bug]

A logical problem isthat the repair tools do not have a patch that fixes the bug in theirsearch space. For instance, NPEFix is not able to generate patchesfor bugs that are not related to null pointer exception. jGenProg isnot able to generate a patch when the repair ingredient is not inthe source code of the application, which happens frequently forsmall programs like the ones in QuixBugs. New repair approachesshould be created to handle this cause of non-patch generation.

2. [Incorrect fault localization]

When the fault localization doesnot succeed to identify the location of the bug, the repair tools donot succeed to generate a patch for it. This can be due to a limitationof the fault localization approach or to the suspiciousness thresholdthat the repair tools use. Moreover, we identified that test casesthat should pass are failing, and consequently there is a misleading mpirical Review of Java Program Repair Tools ESEC/FSE 2019, 26–30 August, 2019, Tallinn, Estonia fault localization. For instance, the fault localization fails on all bugsfrom the INRIA/spoon project (from the Bears benchmark) becausethe fault localization does not succeed to load a test resource, andconsequently the passing test cases fail.

3. [Multiple fault locations]

Developers frequently fix a bug atmore than one location in the source code: we refer to this typeof bug as multi-location bug. However, most current repair toolsand fault localization tools do not support multi-location bugs. Forinstance, the bug Math-1 from Defects4J has to be fixed in the exactsame way at two different locations, and the two locations arespecified by two failing test cases. The current tools consider thatthe two failing test cases specify the same bug at the same location,and consequently do not succeed to generate a multi-location patch.

4. [Too small time budget]

We observe that some of the repairattempts finish the execution by consuming all the time budget.Considering the size of this experiment, it is not realistic to increasedrastically the time budget. However, new approaches and opti-mizations can minimize this problem. In this study, we detected2,593 repair attempts that failed by timeout. It is not possible topredict the outcome of those attempts, but a previous study [28]showed that additional time budget might result in a higher numberof generated patches by genetic programming approaches. How-ever, in our experiment, the repair tools require 13.5 minutes onaverage to generate a patch, which is significantly lower than theallocated time budget (two hours).

5. [Incorrect configuration]

We also observe that the RepairThe-mAll framework does not succeed to correctly compute parametersfrom some bugs to give as input to the repair tools, such as com-pliance level, source folders, and failing test cases. This results infailing repair attempts, which can be due to a bug in RepairThe-mAll or an impossibility to compile the bug. For instance, NPEFixfails 215 times because of issues related to classpath.

6. [Other technical issues]

The final cause of non-patch gener-ation is related to other technical limitations that cause the non-execution of the repair tools. One of them is about too long com-mand lines. The repair tools are executed from the command line,which means that all parameters must be provided in the commandline. However, the size of the command line is limited, and in thecase of projects that have a long classpath, the operating systemdenies the execution of the command line, which results in failing re-pair attempts. On Bugs.jar, for instance, 200 repair attempts finishedwith the error [Errno 7] Argument list too long . Finally,there are also other diverse issues that cause the repair tools tocrash. For instance, jGenProg finished its repair attempt on the bugFlink-6bc6dbec from Bugs.jar with a

NullPointerException . Answer to RQ3 . What are the causes that lead repair at-tempts to not generate patches?

Through an analysis on logsof repair attempts, we identified six causes of non-patch gener-ation, such as incorrect fault localization. Each cause should beinvestigated in detail in new studies. Moreover, repair tools’ de-signers are also stakeholders on those causes, which inform themwhat are the weakness of their tools and help them to understandtheir previous evaluations’ results. All the occurrences of InvalidClassPathException in our execution: https://git.io/fjRax Repair attempts that end with

Argument list too long : https://git.io/fjRap Log file of the repair attempt: https://git.io/fjRab

Diversity of program repair benchmarks.

In RQ2, we foundthat all 11 Java repair tools included in this study perform signif-icantly better on the bugs from Defects4J than on the bugs fromother benchmarks. Indeed, repair tool evaluations that only use De-fects4J have a threat to the external validity since the repairabilityresults cannot be generalize for other benchmarks. We then con-clude that future tools should be evaluated on diverse benchmarksto mitigate that threat.

Impact of the repair tools’ engineering on the repairability.

During the conduction of this study, we observed that the imple-mentations of the repair approaches play an important role in theirability to repair bugs. For instance, jKali and Kali-A share the sameapproach, but they neither have the same implementation nor thesame results (see Table 6). Kali-A fixes 118 bugs while jKali fixes 52with the same input. Note that this observation has also been corre-lated with the analysis of non-patch generation, where a significantnumber of causes is not related to the repair approaches them-selves, but to their implementations. This observation highlightsa potential bias in empirical studies on automatic program repairthat compare the repairability of different repair approaches. Basedon this observation, those studies only compare the effectivenessof repair tools, not the approaches themselves.

Challenges of creating RepairThemAll.

The main challengeswe faced to run repair tools are related to the creation of the Re-pairThemAll framework. First, we checked all test-suite-basedrepair tools for Java based on our criteria (e.g. availability) so thatwe could find the suitable tools for our study (see Table 2). Then,we had to understand the repair tools we finally gathered, wheremanual source code analysis was required so that we could com-pile them and find their inputs and requirements: the tools arediverse, sometimes not documented, and implemented by differ-ent researchers. Once we understood the repair tools, we couldplug them in the RepairThemAll framework, which contains theabstraction around the tools. Those challenges are mainly due tolack of well-organized open-science repositories for all the repairtools. Good documentation, examples, and instructions on how tocompile the tools can speed up the process of learning on how toexecute repair tools.

The observed repairability compared to the previous evalu-ations.

Table 1 shows the test-suite-based repair tools for Java andthe repairability results from their previous evaluations. Those re-sults are difficult to compare with the results of our study, becausethe previous evaluations on Defects4J did not consider all bugsfrom the benchmark. On Defects4J, only Cardumen fixes fewerbugs in this study compared to the previous evaluation. This can beexplained by the difference of the setup (such as the number of ran-dom seed considered in the study), and potential bugs in the versionof Cardumen we use. According to those results, RepairThemAllconfigures correctly the repair tools to generate patches since nomajor drawbacks have been observed.

Threats to validity.

As with any implementation, the RepairThe-mAll framework is not free of bugs. A bug in the framework mightimpact the results we reported in Section 4. However, the frame-work and the raw data are publicly available for other researchersand potential users to check the validity of the results.

SEC/FSE 2019, 26–30 August, 2019, Tallinn, Estonia Durieux, et al.

This study focuses on test-suite adequate patches, which meansthat the generated patches make the test suite pass; yet, there isno guarantee that they fix the bugs. Studying patch correctness[19, 44, 49] is out of the scope of this work. Our goal is to analyzethe current state of the automatic program repair tools and identifypotential flaws and improvements. The conclusions of our studydo not require the knowledge on the correctness of the patches.Our goal is to have a full picture of test-suite-based repair toolsfor Java. In our literature review, presented in Section 2, we found 24repair tools that compose the full picture. Our study was conductedconsidering only 11 of them: note that this is the largest experimentin terms of number of repair tools (and benchmarks). However,we do not have the full picture we wanted to, which is a threat tothe external validity of our results. Most of the repair tools thatwe did not include in our study are just not possible to be ran: forinstance, PAR [17] is not even available. Open-source tools allowthe community to generate knowledge in several directions. In ourwork, open-source tools allowed us to perform a novel evaluationon the state of the repair tools. Another direction is to help thedevelopment of new tools: for instance, DeepRepair [43] is builtover Astor [27], which is a library for repairing.

The works related to ours are empirical studies on the repairabil-ity of multiple automatic program repair tools. Repair tools for Cprograms were the subject of investigation in the first empiricalstudies on automatic program repair. Qi et al. [35] introduced theidea of plausible (i.e. test-suite adequate ) versus correct patch. Theystudied the patch plausibility and correctness of four generate-and-validation repair tools on the bugs from the GenProg benchmark[20] (which later became a part of the ManyBugs benchmark [21]).They found that a small number of bugs are fixed by correct patches.An incorrect , test-suite adequate patch is known as an overfittingpatch , because it overfits the test suite. Such problem was namedas the overfitting problem by Smith et al. [38], who studied it in thecontext of two generate-and-validate C repair tools on the bugs fromthe IntroClass benchmark [21]. They found that even using high-quality, high-coverage test suites results in overfitting patches. Later,Le et al. [19] analyzed the overfitting problem for semantics-basedrepair techniques for the C language. The study also investigateshow test suite size and provenance, number of failing tests, andsemantics-specific tool settings can affect overfitting.For Java, Martinez et al. [26] reported on a remarkable largeexperiment, where three repair tools were executed on the bugsfrom Defects4J. The focus of their study was to measure the re-pairability of the repair tools and to find correct patches by manualanalysis. They also found that a small number of bugs (9/47) couldbe repaired with a test-suite adequate patch that is also correct. Yeet al. [48] presented a study where nine repair tools were executedon the bugs from QuixBugs. They used automatically generatedtest cases based on the human-written patches to identify incorrectpatches generated by the repair tools.Motwani et al. [33] reported on an empirical study that includedseven repair tools for both Java and C languages, where the De-fects4J and ManyBugs benchmarks were used. They had a differentfocus: they investigated if the bugs repaired by repair tools arehard and important. To do so, they used the repairability data from

Table 9: Empirical studies on repair tools.

Work Language previous works, and they performed a correlation analysis betweenthe repaired bugs and measures of defect importance, the humanpatch complexity, and the quality of the test suite.Table 9 summarizes the mentioned studies. The main differ-ence between our study and all the previous ones is the goal.Previous works focused on the patch overfitting problem and ad-vanced/correlation analysis between the repairability of tools andbug-related characteristics. We introduce the benchmark overfitting problem, which is investigated in this paper as well as the causesof non-patch generation. Moreover, the scale of our study is muchlarger than previous studies, on repair tools and benchmarks.

In this paper, we presented an empirical study including 11 repairtools and 2,141 bugs from 5 benchmarks. In total, 23,551 repairattempts were performed: this is the largest experiment to ourknowledge. The goal of our experiment is to obtain an overviewof the current state of the repair tools for Java in practice. Forthat, we scaled up the previous experiments by considering morebenchmarks of bugs, which combined have bugs from 130 projects,collected with different strategies.We found that the repair tools are able to repair bugs from bench-marks that were not initially used for their evaluations. However,our results suggest that all repair tools overfit Defects4J. Finally, weanalyzed why the repair tools do not succeed to generate patches:this study resulted in six different causes that can help future de-velopment of repair tools.Our study opens several opportunities for future investigations.First, our hypotheses on why the repair tools perform better onDefects4J can be further confirmed. For instance, one of the hy-potheses is the fact that the buggy program versions were changedin Defects4J due to the bug fix isolation. A study to confirm thishypothesis is a full contribution itself. Second, other repair toolscan also be executed to aggregate and scale up our study. ssFix, forinstance, is possible to run, despite the fact we had issues for it,which lead to its exclusion in this work. Moreover, the tools thatare hardcoded to be ran on Defects4J could also be adapted to workfor other benchmarks of bugs. Finally, an investigation on the bugtype distribution in the benchmarks should also be conducted. Thiswould provide the information on how many bugs a repair toolis actually able to fix, i.e. by finding the bugs that meet the repairtools’ bug class target.

ACKNOWLEDGMENTS

We acknowledge CAPES for partially funding this research, andMarcelo Maia for discussions. This material is based upon worksupported by Fundação para a Ciência e a Tecnologia (FCT), with thereference PTDC/CCI-COM/29300/2017 and UID/CEC/50021/2019. mpirical Review of Java Program Repair Tools ESEC/FSE 2019, 26–30 August, 2019, Tallinn, Estonia

REFERENCES [1] Daniel Balouek, Alexandra Carpen Amarie, Ghislain Charrier, Frédéric Desprez,Emmanuel Jeannot, Emmanuel Jeanvoine, Adrien Lèbre, David Margery, NicolasNiclausse, Lucas Nussbaum, Olivier Richard, Christian Pérez, Flavien Quesnel,Cyril Rohr, and Luc Sarzyniec. 2013. Adding Virtualization Capabilities to theGrid’5000 Testbed. In

Cloud Computing and Services Science , Ivan I. Ivanov,Marten van Sinderen, Frank Leymann, and Tony Shan (Eds.). Communications inComputer and Information Science, Vol. 367. Springer International Publishing,Cham, 3–20.[2] Liushan Chen, Yu Pei, and Carlo A. Furia. 2017. Contract-Based Program Repairwithout the Contracts. In

Proceedings of the 32nd IEEE/ACM International Confer-ence on Automated Software Engineering (ASE ’17) . IEEE Press, Piscataway, NJ,USA, 637–647.[3] Valentin Dallmeier and Thomas Zimmermann. 2007. Extraction of Bug Localiza-tion Benchmarks from History. In

Proceedings of the 22nd IEEE/ACM InternationalConference on Automated Software Engineering (ASE ’07) . ACM, New York, NY,USA, 433–436.[4] Vidroha Debroy and W. Eric Wong. 2010. Using Mutation to AutomaticallySuggest Fixes for Faulty Programs. In

Proceedings of the 2010 Third InternationalConference on Software Testing, Verification and Validation (ICST ’10) . IEEE Com-puter Society, Washington, DC, USA, 65–74.[5] Defects4J. 2011. Defects4J patch for Closure-51 bug. http://program-repair.org/defects4j-dissection/

Proceedings of the 24th IEEE International Conference on Software Analysis,Evolution and Reengineering (SANER ’17) . IEEE, Klagenfurt, Austria, 349–358.[7] Thomas Durieux, Fernanda Madeiral, Matias Martinez, and Rui Abreu. 2019.The RepairThemAll framework repository. https://github.com/program-repair/RepairThemAll.[8] Thomas Durieux, Fernanda Madeiral, Matias Martinez, and Rui Abreu. 2019. Therepair attempts’ results. https://github.com/program-repair/RepairThemAll_experiment.[9] Thomas Durieux, Fernanda Madeiral, Matias Martinez, and Rui Abreu. 2019.Website for browsing the generated patches. http://program-repair.org/RepairThemAll_experiment.[10] Thomas Durieux and Martin Monperrus. 2016. DynaMoth: Dynamic Code Syn-thesis for Automatic Program Repair. In

Proceedings of the 11th InternationalWorkshop on Automation of Software Test (AST ’16) . ACM, New York, NY, USA,85–91.[11] Thomas Durieux and Martin Monperrus. 2016.

IntroClassJava: A Benchmark of297 Small and Buggy Java Programs . Technical Report

Proceedings of the 12th International Conference on SoftwareTesting, Verification, and Validation (ICST ’19) . IEEE Computer Society, Washing-ton, DC, USA, 1–12.[13] Jinru Hua, Mengshi Zhang, Kaiyuan Wang, and Sarfraz Khurshid. 2018. TowardsPractical Program Repair with On-Demand Candidate Generation. In

Proceedingsof the 40th International Conference on Software Engineering (ICSE ’18) . ACM, NewYork, NY, USA, 12–23.[14] Jiajun Jiang, Yingfei Xiong, Hongyu Zhang, Qing Gao, and Xiangqun Chen.2018. Shaping Program Repair Space with Existing Patches and Similar Code.In

Proceedings of the 27th ACM SIGSOFT International Symposium on SoftwareTesting and Analysis (ISSTA ’18) . ACM, New York, NY, USA, 298–309.[15] johnlenz. 2011. Human patch for Defects4J Closure-51 bug. https://github.com/google/closure-compiler/commit/a02241e5df48e44e23dc0e66dbef3fdc3c91eb3e.[16] René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: A Databaseof Existing Faults to Enable Controlled Testing Studies for Java Programs. In

Proceedings of the 23rd International Symposium on Software Testing and Analysis(ISSTA ’14) . ACM, New York, NY, USA, 437–440.[17] Dongsun Kim, Jaechang Nam, Jaewoo Song, and Sunghun Kim. 2013. AutomaticPatch Generation Learned from Human-Written Patches. In

Proceedings of the35th International Conference on Software Engineering (ICSE ’13) . IEEE Press,Piscataway, NJ, USA, 802–811.[18] Xuan Bach D. Le, David Lo, and Claire Le Goues. 2016. History Driven Pro-gram Repair. In

Proceedings of the 23rd IEEE International Conference on SoftwareAnalysis, Evolution and Reengineering (SANER ’16) . IEEE, Suita, Japan, 213–224.[19] Xuan-Bach D. Le, Ferdian Thung, David Lo, and Claire Le Goues. 2018. Over-fitting in semantics-based automated program repair. In

Proceedings of the 40thInternational Conference on Software Engineering (ICSE ’18) . ACM, New York, NY,USA, 163–163.[20] Claire Le Goues, Michael Dewey-Vogt, Stephanie Forrest, and Westley Weimer.2012. A Systematic Study of Automated Program Repair: Fixing 55 out of 105Bugs for $8 Each. In

Proceedings of the 34th International Conference on SoftwareEngineering (ICSE ’12) . IEEE Press, Piscataway, NJ, USA, 3–13. [21] Claire Le Goues, Neal Holtschulte, Edward K. Smith, Yuriy Brun, PremkumarDevanbu, Stephanie Forrest, and Westley Weimer. 2015. The ManyBugs andIntroClass Benchmarks for Automated Repair of C Programs.

IEEE Transactionson Software Engineering

41, 12 (Dec. 2015), 1236–1256.[22] Derrick Lin, James Koppel, Angela Chen, and Armando Solar-Lezama. 2017.QuixBugs: A Multi-Lingual Program Repair Benchmark Set Based on the QuixeyChallenge. In

Proceedings of the 2017 ACM SIGPLAN International Conferenceon Systems, Programming, Languages, and Applications: Software for Humanity(SPLASH Companion 2017) . ACM, New York, NY, USA, 55–56.[23] Kui Liu, Anil Koyuncu, Kisub Kim, Dongsun Kim, and Tegawendé F. Bissyandé.2018. LSRepair: Live Search of Fix Ingredients for Automated Program Repair. In

Proceedings of the 25th Asia-Pacific Software Engineering Conference (APSEC ’18) .IEEE Computer Society, Washington, DC, USA, 1–5.[24] Xuliang Liu and Hao Zhong. 2018. Mining StackOverflow for Program Repair.In

Proceedings of the 25th IEEE International Conference on Software Analysis,Evolution and Reengineering (SANER ’18) . IEEE, Campobasso, Italy, 118–129.[25] Fernanda Madeiral, Simon Urli, Marcelo Maia, and Martin Monperrus. 2019.Bears: An Extensible Java Bug Benchmark for Automatic Program Repair Studies.In

Proceedings of the 26th IEEE International Conference on Software Analysis,Evolution and Reengineering (SANER ’19) . IEEE, Hangzhou, China, 468–478.[26] Matias Martinez, Thomas Durieux, Romain Sommerard, Jifeng Xuan, and MartinMonperrus. 2017. Automatic Repair of Real Bugs in Java: A Large-scale Experi-ment on the Defects4J Dataset.

Empirical Software Engineering

22, 4 (Aug. 2017),1936–1964.[27] Matias Martinez and Martin Monperrus. 2016. ASTOR: A Program Repair Libraryfor Java. In

Proceedings of the 25th International Symposium on Software Testingand Analysis (ISSTA ’16), Demonstration Track . ACM, New York, NY, USA, 441–444.[28] Matias Martinez and Martin Monperrus. 2018. Ultra-Large Repair Search Spacewith Automatically Mined Templates: the Cardumen Mode of Astor. In

Proceed-ings of the 10th International Symposium on Search-Based Software Engineering(SSBSE ’18). Lecture Notes in Computer Science, vol 11036 , Thelma Elita Colanziand Phil McMinn (Eds.). Springer International Publishing, Cham, 65–86.[29] Matias Martinez, Westley Weimer, and Martin Monperrus. 2014. Do the Fix Ingre-dients Already Exist? An Empirical Inquiry into the Redundancy Assumptions ofProgram Repair Approaches. In

Proceedings of the 36th International Conference onSoftware Engineering (ICSE Companion 2014) . ACM, New York, NY, USA, 492–495.[30] Martin Monperrus. 2014. A Critical Review of “Automatic Patch GenerationLearned from Human-Written Patches”: Essay on the Problem Statement and theEvaluation of Automatic Software Repair. In

Proceedings of the 36th InternationalConference on Software Engineering (ICSE ’14) . ACM, New York, NY, USA, 234–242.[31] Martin Monperrus. 2018. Automatic Software Repair: a Bibliography.

Comput.Surveys

51, 1, Article 17 (Jan. 2018), 24 pages.[32] Martin Monperrus. 2018.

The Living Review on Automated Program Repair . Tech-nical Report hal-01956501. HAL/archives-ouvertes.fr, HAL/archives-ouvertes.fr.[33] Manish Motwani, Sandhya Sankaranarayanan, René Just, and Yuriy Brun. 2018.Do automated program repair techniques repair hard and important bugs?

Em-pirical Software Engineering

23, 5 (Oct. 2018), 2901–2947.[34] Yuhua Qi, Xiaoguang Mao, Yan Lei, Ziying Dai, and Chengsong Wang. 2014. TheStrength of Random Search on Automated Program Repair. In

Proceedings of the36th International Conference on Software Engineering (ICSE ’14) . ACM, New York,NY, USA, 254–265.[35] Zichao Qi, Fan Long, Sara Achour, and Martin Rinard. 2015. An Analysis ofPatch Plausibility and Correctness for Generate-and-Validate Patch GenerationSystems. In

Proceedings of the 2015 International Symposium on Software Testingand Analysis (ISSTA ’15) . ACM, New York, NY, USA, 24–36.[36] Ripon K. Saha, Yingjun Lyu, Wing Lam, Hiroaki Yoshida, and Mukul R. Prasad.2018. Bugs.jar: A Large-scale, Diverse Dataset of Real-world Java Bugs. In

Pro-ceedings of the 15th International Conference on Mining Software Repositories (MSR’18) . ACM, New York, NY, USA, 10–13.[37] Ripon K. Saha, Yingjun Lyu, Hiroaki Yoshida, and Mukul R. Prasad. 2017. ELIXIR:Effective Object-Oriented Program Repair. In

Proceedings of the 32nd IEEE/ACMInternational Conference on Automated Software Engineering (ASE ’17) . IEEE Press,Piscataway, NJ, USA, 648–659.[38] Edward K. Smith, Earl T. Barr, Claire Le Goues, and Yuriy Brun. 2015. Is theCure Worse Than the Disease? Overfitting in Automated Program Repair. In

Proceedings of the 10th Joint Meeting on Foundations of Software Engineering(ESEC/FSE ’15) . ACM, New York, NY, USA, 532–543.[39] Victor Sobreira, Thomas Durieux, Fernanda Madeiral, Martin Monperrus, andMarcelo A. Maia. 2018. Dissection of a Bug Dataset: Anatomy of 395 Patches fromDefects4J. In

Proceedings of the 25th IEEE International Conference on SoftwareAnalysis, Evolution and Reengineering (SANER ’18) . IEEE, Campobasso, Italy,130–140.[40] Shin Hwei Tan, Jooyong Yi, Yulis, Sergey Mechtaev, and Abhik Roychoudhury.2017. Codeflaws: A Programming Competition Benchmark for Evaluating Auto-mated Program Repair Tools. In

Proceedings of the 39th International Conferenceon Software Engineering Companion (ICSE-C ’17) . IEEE Press, Piscataway, NJ, USA,180–182.

SEC/FSE 2019, 26–30 August, 2019, Tallinn, Estonia Durieux, et al. [41] Westley Weimer, ThanhVu Nguyen, Claire Le Goues, and Stephanie Forrest. 2009.Automatically Finding Patches Using Genetic Programming. In

Proceedings of the31st International Conference on Software Engineering (ICSE ’09) . IEEE ComputerSociety, Washington, DC, USA, 364–374.[42] Ming Wen, Junjie Chen, Rongxin Wu, Dan Hao, and Shing-Chi Cheung. 2018.Context-Aware Patch Generation for Better Automated Program Repair. In

Pro-ceedings of the 40th International Conference on Software Engineering (ICSE ’18) .ACM, New York, NY, USA, 1–11.[43] Martin White, Michele Tufano, Matias Martinez, Martin Monperrus, and DenysPoshyvanyk. 2019. Sorting and Transforming Program Repair Ingredients viaDeep Learning Code Similarities. In

Proceedings of the 26th IEEE InternationalConference on Software Analysis, Evolution and Reengineering (SANER ’19) . IEEE,Hangzhou, China, 479–490.[44] Qi Xin and Steven P. Reiss. 2017. Identifying Test-Suite-Overfitted PatchesthroughTest Case Generation. In

Proceedings of the 26th ACM SIGSOFT Interna-tional Symposium on Software Testing and Analysis (ISSTA ’17) . ACM, New York,NY, USA, 226–236.[45] Qi Xin and Steven P. Reiss. 2017. Leveraging Syntax-Related Code for AutomatedProgram Repair. In

Proceedings of the 32nd IEEE/ACM International Conferenceon Automated Software Engineering (ASE ’17) . IEEE Press, Piscataway, NJ, USA, 660–670.[46] Yingfei Xiong, Jie Wang, Runfa Yan, Jiachen Zhang, Shi Han, Gang Huang, andLu Zhang. 2017. Precise Condition Synthesis for Program Repair. In

Proceedingsof the 39th International Conference on Software Engineering (ICSE ’17) . IEEE Press,Piscataway, NJ, USA, 416–426.[47] Jifeng Xuan, Matias Martinez, Favio DeMarco, Maxime Clément, Sebastian Lame-las, Thomas Durieux, Daniel Le Berre, and Martin Monperrus. 2016. Nopol:Automatic Repair of Conditional Statement Bugs in Java Programs.

IEEE Trans-actions on Software Engineering

43, 1 (April 2016), 34–55.[48] He Ye, Matias Martinez, Thomas Durieux, and Martin Monperrus. 2019. AComprehensive Study of Automatic Program Repair on the QuixBugs Benchmark.In

International Workshop on Intelligent Bug Fixing (IBF ’19, co-located with SANER) .IEEE, Hangzhou, China, 1–10.[49] Zhongxing Yu, Matias Martinez, Benjamin Danglot, Thomas Durieux, and MartinMonperrus. 2019. Alleviating Patch Overfitting with Automatic Test Generation:A Study of Feasibility and Effectiveness for the Nopol Repair System.

EmpiricalSoftware Engineering

24, 1 (Feb. 2019), 33–67.[50] Yuan Yuan and Wolfgang Banzhaf. 2018. ARJA: Automated Repair of Java Pro-grams via Multi-Objective Genetic Programming.