[PDF] IBIR: Bug Report driven Fault Injection

Abstract

Much research on software engineering and software testing relies on experimental studies based on fault injection. Fault injection, however, is not often relevant to emulate real-world software faults since it "blindly" injects large numbers of faults. It remains indeed challenging to inject few but realistic faults that target a particular functionality in a program. In this work, we introduce IBIR, a fault injection tool that addresses this challenge by exploring change patterns associated to user-reported faults. To inject realistic faults, we create mutants by retargeting a bug report driven automated program repair system, i.e., reversing its code transformation templates. IBIR is further appealing in practice since it requires deep knowledge of neither of the code nor the tests, but just of the program's relevant bug reports. Thus, our approach focuses the fault injection on the feature targeted by the bug report. We assess IBIR by considering the Defects4J dataset. Experimental results show that our approach outperforms the fault injection performed by traditional mutation testing in terms of semantic similarity with the original bug, when applied at either system or class levels of granularity, and provides better, statistically significant, estimations of test effectiveness (fault detection). Additionally, when injecting 100 faults, IBIR injects faults that couple with the real ones in 36% of the cases, while mutants from mutation testing inject less than 1%. Overall, IBIR targets real functionality and injects realistic and diverse faults.

Full PDF

II B I R: Bug Report driven Fault Injection

Ahmed Khanﬁr, Anil Koyuncu, Mike Papadakis, Maxime Cordy, Tegawend´e F. Bissyand´e,Jacques Klein and Yves Le Traon

SnT, University of Luxembourg, Luxembourg { ﬁrstname.surname } @uni.lu Abstract —Much research on software engineering and soft-ware testing relies on experimental studies based on faultinjection. Fault injection, however, is not often relevant to emulatereal-world software faults since it “blindly” injects large numbersof faults. It remains indeed challenging to inject few but realisticfaults that target a particular functionality in a program. In thiswork, we introduce I B I R , a fault injection tool that addressesthis challenge by exploring change patterns associated to user-reported faults. To inject realistic faults, we create mutants by re-targeting a bug report driven automated program repair system,i.e., reversing its code transformation templates. I B I R is furtherappealing in practice since it requires deep knowledge of neitherof the code nor the tests, but just of the program’s relevant bugreports. Thus, our approach focuses the fault injection on thefeature targeted by the bug report. We assess I B I R by consideringthe Defects4J dataset. Experimental results show that our ap-proach outperforms the fault injection performed by traditionalmutation testing in terms of semantic similarity with the originalbug, when applied at either system or class levels of granularity,and provides better, statistically signiﬁcant, estimations of testeffectiveness (fault detection). Additionally, when injecting 100faults, I B I R injects faults that couple with the real ones in 36%of the cases, while mutants from mutation testing inject less than1%. Overall, I B I R targets real functionality and injects realisticand diverse faults. I. I

NTRODUCTION

A key challenge of fault injection techniques (such asmutation analysis) is to emulate the effects of real faults.This property of representativeness of the injected faults isof particular importance since fault injection techniques arewidely used by researchers when evaluating and comparingbug ﬁnding, testing and debugging techniques, e.g., test gen-eration, bug ﬁxing, fault localisation, etc, [1]. This means thatthere is a high risk of mistakenly asserting test effectivenessin case the injected faults are non-representative.Typically, fault injection techniques introduce faults bymaking syntactic changes in the target programs’ code using aset of simple syntactic transformations [2]–[4], usually calledmutation operators. These transformations have been deﬁnedbased on the language syntax [5] and are “blindly” mutatingthe entire codebase of the projects, injecting large numbersof mutants, with the hope to inject some realistic faults.This means that there is a limited control on the fault typesand the locations where to inject faults. In other words, theappropriate “what” and “where” to inject faults in order tomake representative fault injection has been largely ignoredby existing research. Fault injection techniques may also draw on recent researchthat mines fault patterns [6], [7] and demonstrate some form ofrealism w.r.t. real faults. These results are encouraging becausethey indicate that the injected faults may carry over the realismof the patterns. This may remove a potential validity threat, butat the same time, it is limited as it does not provide any controlon the locations and target functionality, thus impacting faultrepresentativeness [3], [8], [9].This is an important limitation especially for large real-world systems because of the following two reasons: a)injecting faults everywhere escalates the application cost dueto the large number of mutants introducedand b) the resultscould be misleading since a tiny ratio of the injected faults arecoupled to the real ones [9] and the injected set of faults donot represent the likelihood of faults appearing in the ﬁeld [3].Therefore, representativeness of the injected faults in terms offault types and locations is of outmost importance w.r.t. bothapplication cost and accuracy of the method.To bypass these issues, one could use real faults (minedfrom the projects’ repositories) or directly apply the testingapproach to a set of programs and manually identify potentialfaults. While such a solution brings realism into the evalu-ations, it is often limited to few fault instances (of limiteddiversity), requires expensive manual effort in identifying thefaults and fails to offer the experimental control required bymany evaluation scenarios.We advance in this research direction by bringing realismin the fault injection via leveraging information from bugreports. Bug reports often include sufﬁcient information fordebugging techniques in order to localize [10], debug [11]and repair faults [12] that happened in the ﬁeld. Therefore,together with specially crafted defect patterns (mined throughsystematic examination of real faults) such information canguide fault injection to target critical functionality, mimic realfaulty behaviour and make realistic fault injection. Perhapsmore importantly, the use of bug reports removes the need forknowledge of the targeted system or code.Our method starts from the target project and a bug reportwritten in natural language. It then applies Information Re-trieval (IR)-based fault localization [10] in order to identifythe relevant places where to inject faults. It then injectsrecurrent fault instances (fault patterns) that were manuallycrafted using a systematic analysis of frequent bug ﬁxes,prioritized according to their position and type. This way ourmethod performs fault injection, using realistic fault patterns,by targeting the features described by the bug reports. a r X i v : . [ c s . S E ] D ec e implemented our approach in a system called I B I R andevaluated its ability to imitate 157 real faults. In particular weevaluated a) the semantic similarity of real and injected faults,b) the coupling relation between injected and real faults, andc) the ability of the injected faults to indicate test effectiveness(fault detection) when tested with different test suites. Ourresults show that I B I R manages to imitate the targeted faults,with a median semantic similarity value of 0.58, which issigniﬁcantly higher than the 0.0 achieved by using traditionalmutation testing, when injecting the same number of faults.Interestingly, we found that I B I R injects faults that couplewith the real ones in 36% of the targeted cases. This isachieved by injecting 100 faults per target (real) fault and itis approximately 50 times higher than the coupled mutantsproduced by mutation testing. Fault coupling is one of themost important testing properties [13], [14], here indicatingthat one can use the injected faults instead of the real ones.Another key ﬁnding of our study is that the injected faultsprovide much better indication on test effectiveness (faultdetection) than mutation testing as their detection ratios dis-criminate between actual failing and passing test suites, whilemutant detection rates cannot. This implies that the use of I B I R yields more accurate results than the use of traditionalmutation testing.Overall, our primary contributions are: • We introduce the notion of bug report-driven fault injec-tion. Bug reports can be used to inject realistic faults. • We introduce a set of mutation operators based on fre-quently used patch patterns that are reverted to injectrealistic faults. • We present I B I R, an automatic fault injection method,which is driven by bug reports to emulate real faults. • We provide empirical evidence demonstrating that I B I Routperforms the current state of practice in mutationtesting w.r.t. fault representativeness and coupling.II. B

ACKGROUND

A. Fault Localization

Fault localization is the activity of identifying the suspectedfault locations, which will be transformed to generate patches.Several automated fault localization techniques have beenproposed [15], such as slice-based [16], spectrum-based [17],statistics-based [18], mutation-based [11] and etc.Fault localisation techniques based on Information Retrieval(IR) [19]–[22] exploit textual bug reports to identify codechunks relevant to the bug, without relying on test cases. IR-based fault localisation tools extract tokens from the bug reportto formulate a query to be matched with the collection ofdocuments formed by the source code ﬁles [10], [23]–[27].Then, they rank the documents based on their relevance to thequery, such that source ﬁles ranked higher are more likely tocontain the fault. Recently, automated program repair methods Injected faults couple with the real ones when injected faults are detectedonly by test cases that detect the real faults. This implies that the injectedfaults provide good indications on whether tests are capable of detecting thecoupled faults. have been designed on top of IR-based fault localization[12]. They achieve comparable performance to methods usingspectrum-based fault localization, yet without relying on theassumption that test cases are available.We leverage IR-based fault localization to achieve a dif-ferent goal: instead of localising the reported bug, we aim at injecting faults at code locations that implement a functionalitysimilar to the fault targeted by the bug report description.

B. Mutation Testing

Mutation testing is a popular fault-based testing technique[1]. It operates by inserting artiﬁcial faults into a programunder test, thereby creating many different versions (named mutants ) of the program. The artiﬁcial faults are injectedthrough syntactic changes to all program locations in theoriginal program, based on predeﬁned rules named mutationoperators . Such operators can, for instance, invert relationaloperators (e.g., replacing ≥ with < ).Mutants can be used to indicate the strengths of test suites,based on their ability to distinguish the mutants from theoriginal program. If there exists a test case distinguishing theoriginal program from a particular mutant, then the mutant issaid to be killed . Then, we term a mutant to be “coupled” withrespect to a particular fault if the test cases that kill it are asubset of the test cases that can also detect that fault (makethe program fail by exerting the fault).Previous research has shown that the choice of mutationoperators and location can affect the fault-revealing ability ofthe produced mutants [28], [29]. Thus, it is important to selectappropriate mutation testing strategies. Nevertheless, previousresearch has shown that random mutant sampling achievescomparable results with the mutation testing state of the art [8],[30], making the random mutant sampling a natural baselineto compare with.Another issue involved in mutation testing campaigns isthe application cost of the method. The problem stems fromthe vast number of faults that are injected, which need to beexecuted with a large number of test suites, thereby escalatingthe computational demands of the method [1]. Unfortunately,the mutant execution problem becomes intractable when testexecution is expensive or the test suites involve system leveltests, thereby often limiting mutation testing application to unitlevel. This is a major problem when performing fault tolerance[3], or other large-scale testing campaigns. Luckily, recentstudies have shown that only a tiny number of the injectedfaults are useful [9], [30], [31], suggesting that a handfulnumber of injected faults should be sufﬁcient to performtesting. Though, it remains an open question on how to identifythem.We ﬁll this gap, by using bug report-driven fault injection.In essence we leverage IR-based fault localization techniquesto identify the locations where fault injection should happen,i.e., locations relevant to the targeted functionality describedin the bug report, and apply frequent fault patterns to producemutants that behave similar to real faults. . Fix Patterns In automated program repair [32], a common way to gen-erate patches is to apply ﬁx patterns [33] (also named ﬁxtemplates [34] or program transformation schemes [35]) insuspicious program locations (detected by fault localization).Patterns used in the literature [33]–[41] have been deﬁnedmanually or automatically (mined from bug ﬁx datasets).Instead of ﬁx patterns, we use fault patterns that areﬁx patterns inverted. Since ﬁx patterns were designed usingrecurrent faults their related fault patterns introduce them. Thishelps injecting faults that are similar to those described in thebug reports. I B I R inverts and uses the patterns implementedby

TBar [42] as we detail in the following Section.III. A

PPROACH

We propose I B I R, the ﬁrst fault injection approach thatutilizes information extracted from bug reports to emulate realfaults. A high level view of the way I B I R works is shown inFigure 1. Our approach takes as input (1) the source code ofthe program of interest and (2) a resolved bug report of thatprogram, written in natural language. The objective is to injectartiﬁcial faults in the program (one by one, creating multiplefaulty versions of the program) that imitate the original bug.To do so, I B I R proceeds in three steps.First step: I B I R identiﬁes relevant locations to inject thefaults. It applies IR-based fault localization to determine, fromthe bug report, the code locations (statements) that are likelyto be relevant to the target fault. These locations are rankedaccording to their likelihood to be the feature described by thebug report, hence are relevant to inject faults.Second step: I B I R applies fault patterns on the identiﬁedcode locations. We build our patterns by inverting ﬁx patternsused in automated program repair approaches [42]. Our intu-ition is that, since ﬁx patterns are used to ﬁx bugs, invertedpatterns may introduce a fault similar to the original bug. Foreach location, we apply only patterns that are syntacticallycompatible with the code location. This step yields a set offaults to inject, i.e., pairs composed of a location and a pattern.Third step: our method ranks the location-pattern pairs wrt.the location likelihood and priority order of the patterns. Then I B I R takes each pair (in order) and applies the pattern to thelocation, injecting a fault in the program. We repeat the processuntil the desired number of injected faults has been producedor until all location-pattern pairs have been considered.

A. Bug Report driven Fault Localization

IR-based fault localization (IRFL) [43], [44] leverages po-tential similarity between the terms used in a bug reportand the program source code to identify relevant buggy codelocations. It typically starts by extracting tokens from a givenbug report to formulate a query to be matched in a search spaceof documents formed by the collections of source code ﬁlesand indexed through tokens extracted from source code [10],[23]–[26], [45]. IRFL approaches then rank the documentsbased on a probability of relevance. Top-ranked ﬁles are likelyto contain the buggy code. We follow the same principle to identify promising locationswhere to inject realistic faults. We rely on the informationcontained in the bug report to localize the code locationwith the highest similarity score. Most IRFL techniques havefocused on ﬁle-level localization, which is too coarse-grainedfor our purpose of injecting fault. Thus, we rather use astatement-level IRFL approach that has been successfullyapplied to support program repair [12].It is to be noted that, contrary to program repair, we do notaim to identify the exact bug location. We are rather interestedin locations that allow injecting realistic faults (similar to thebug). This means that IRFL may pinpoint multiple locationsof interest for fault injection even if those were not buggycode locations.

B. Fault patterns

We start from the ﬁx patterns developed in TBar [42], astate of the art pattern-based program repair tool. Any patternis described by a context, i.e., an AST node type to which thepattern applies, and a recipe, a syntactical modiﬁcation to beperformed. For each pattern, we deﬁne a related fault injectionpattern that represents the inverse of that pattern. For instance,inverting the ﬁx pattern that consists of adding an arbitrarystatement yields a remove statement fault pattern. Interestingly,some ﬁx patterns are symmetric in the sense that their inversepattern is also a ﬁx pattern, e.g., inverting a Boolean connector.These patterns can thus be used for both bug ﬁxing and faultinjection. Table I enumerates the resulting set of fault injectionpatterns used by our approach.Given a location (code statement) to inject a fault into, weidentify the patterns that can be applied to the statement. To doso, our method starts from the AST node of the statement andvisits it exhaustively, in a breadth-ﬁrst manner. Each time itmeets an AST node that matches the context of a fault pattern,it memorizes the node and the pattern for later application.Then the method continues until it has visited all AST nodesunder the statement node. This way, we enumerate all possibleapplications of all fault patterns onto the location.Since more than one pattern may apply to a given location,we prioritize them by leveraging heuristic priority rules pre-viously deﬁned in automated program repair methods (thesewere inferred from real-world bug occurrences [42]). Thismeans that every fault injection pattern gets the priority orderof its inverse ﬁx pattern.

C. Fault injection

The last step consists of applying, one by one, the faultpatterns to inject faults at the program locations identiﬁed byIRFL. Locations of higher ranking are considered ﬁrst. Withina location, pattern applications are ordered based on the patternpriority. By applying a pattern to a corresponding AST nodeof the location, we inject a fault within the program beforerecompiling it. If the program does not compile, we discardthe fault and restart with the next one. We continue the processuntil it reaches the desired number of (compilable) injectedfaults or all locations and patterns have been considered. ig. 1. The I B I R fault injection workﬂow.

IV. R

ESEARCH Q UESTIONS

Our approach aims at injecting faults that imitate realones by leveraging the information included in bug reports.Therefore, a natural question to ask is how well I B I R’s faultsimitate the targeted (real) ones. Thus, we ask:RQ1 (Imitating bugs):

Are the I B I R faults capable of em-ulating, in terms of semantic similarity, the targeted(real) ones?To answer this question, we check whether any of theinjected faults imitate well the targeted ones. Following therecommendations from the mutation testing literature [9] weapproximate the program behaviour through the project testsuites and compare the behaviour similarity of the test casesw.r.t. their pass and failing status using the Ochiai similaritycoefﬁcient. This is a typical way of computing the semanticsimilarity of mutants and faults in mutation-based fault local-ization [11], [46].We then turn our attention to the similarity of the injectedfault sets and contrast them with mutants such as those usedby modern mutation testing tools [14]. Hence we ask:RQ2 (Comparison with mutation testing):

How does I B I Rcompare with mutation testing, in terms of semanticsimilarity?We answer this question by injecting mutants using thestandard operators employed by mutation testing tools [14] andmeasuring their semantic similarity with the targeted faults. Tomake a fair comparison, we inject the same number of faultsper target. For I B I R we selected the top-ranked mutants whilefor mutation testing we randomly sampled mutants across theentire project codebase. Random mutant sampling forms ourbaseline since it performs comparably to the alternative mutantselection methods [8], [30]. Also, since we are interested inthe relative differences between the injected fault sets, werepeat our experiments multiple times using the same numberof faults (mutants).Our approach identiﬁes the locations where bugs should beinjected through an IR-based fault localization method. Thismay give signiﬁcant advantages when applied at the projectlevel, but these may not carry on individual classes. Suchclass level granularity level may be well suited for some testevaluation tasks, such as automatic test generation [47]. To account for this, we performed mutation testing (using thetraditional mutation operators) at the targeted classes (classeswhere the faults were ﬁxed). To make a fair comparison wealso restricted I B I R to the same classes and compared the samenumber of mutants. This leads us to the following question:RQ3 (Comparison at the target class):

How does I B I Rcompare with mutation testing, in terms of semanticsimilarity, when restricted to particular classes?We answer this question by injecting faults in only thetarget classes using the I B I R bug patterns and the traditionalmutation operators. Then we compare the two approaches thesame way as we did in RQ1 and RQ2.Up to this point, the answers to the posed questions provideevidence that using our approach yields mutants that aresemantically similar to the targeted bugs. Although, this isimportant and demonstrates the potential of our approach, itdoes not necessarily mean that the injected faults are stronglycoupled with the real ones . Mutant and fault coupling is animportant property for mutants that signiﬁcantly helps testing[48]. Therefore, we seek to investigate:RQ4 (Mutant and fault coupling): How does I B I R com-pare with mutation testing with respect to mutant andfault coupling?To answer this question we check whether the faults that weinject are detected only by the failing tests, i.e., only by thetests that also reveal the target fault. Compared to similaritymetrics, this coupling relation is stricter and stronger.After answering the above questions we turn our attentionto the actual use of mutants in test effectiveness evaluations.Therefore, we are interested in checking the correlationsbetween the failure rates of the sets of the injected faults weintroduce and the real ones. To this end, we ask:RQ5 (Failure estimates):

Are the injected faults leadingto failure estimates that are representative of the realones? How do these estimates compare with mutationtesting?The difference of RQ5 from the other RQs is that in RQ5,a set of injected faults is evaluated while, in the previous RQsonly isolated mutant instances. Mutants are coupled with real faults if they are killed only by test casesthat also reveal the real faultsABLE I I BI R FAULT INJECTION PATTERNS . Pattern context category Bug injection pattern example input example outputInsert Statement

Insert a method call,before or after the localised statement. someMethod(expression); someMethod(expression);method(expression);

Insert a return statement,before or after the localised statement. statement; statement;return VALUE;

Wrap a statement with a try-catch. statement; try { statement; } catch (Exception e) { ... } Insert an if checker: wrap astatement with an if block. statement; if (conditional exp) { statement; } Mutate Class Instance Creation

Replace an instance creation call bya cast of the super.clone() method call. ... new T(); ... (T) super.clone();

Mutate Conditional Expression

Remove a conditional expression. condExp1 && condExp2 condExp1

Insert a conditional expression. condExp1 condExp1 && condExp2

Change the conditional operator. condExp1 && condExp2 condExp1 || condExp2 Mutate Data Type

Change the declaration type of a variable.

T1 var ...; T2 var ...;

Change the casting type of an expression. ... (T1) expression ...; ... (T2) expression ...;

Mutate ﬂoat or double Division

Remove a ﬂoat or a double cast ... dividend / (ﬂoat) divisor ...; ... dividend / divisor ...;from the divisor. ... intVarExp / 10d ...; ... intVarExp / 10 ...;

Remove a ﬂoat or a double cast ... (ﬂoat) dividend / divisor ...; ... dividend / divisor ...; from the dividend. ... 1.0 / var ...; ... 1 / var ...;

Replace ﬂoat or double multiplication ... (1.0 / divisor) * dividend ... ... dividend / divisor ...; by an int division. ... 0.5 * intVarExp ...; ... intVarExp / 2 ...;

Mutate Literal Expression

Change boolean, number or stringliterals in a statement by another literalor expression of the same type. ... string literal1 ...... int literal ... ... string literal2 ...... int expression ...

Mutate Method Invocation

Replace a method call by another one. ... method1(args) ... ... method(args) ...

Replace a method call argument by another one. ... method(arg1, arg2) ... ... method(arg1, arg3) ...

Remove a method call argument. ... method(arg1, arg2) ... ... method(arg1) ...

Add an argument to a method call ... method(arg1) ... ... method(arg1, arg2) ...

Mutate Return Statement

Replace a return experession by an other one. return expr1; return exp2;

Mutate Variable

Replace a variable by another variableor an expression of the same type. ... var1 ...... var1 ... ... var2 ...... exp ...

Move Statement

Move a statement to another position. statement;... ...statement;

Remove Statement

Remove a statement. statement;... ...

Remove a method. method(args) { statement; } ... Mutate Operators

Replace an Arithmetic operator. ... a + b ... ... a - b ...

Replace an Assignment operator. ... c += b ... ... c -= b ...

Replace a Relational operator. ... a < b ... ... a > b ... Replace a Conditional operator. ... a && b ... ... a || b ... Replace a Bitwise or a Bit Shift operator. ... a & b ... ... a | b ... Replace an Unary operator. a++ a--

Change arethmetic operations order. a + b * c c + b * a

V. E

XPERIMENTAL S ETUP

A. Dataset & Benchmark

To evaluate I B I R we needed a set of benchmark programs,faults and bug reports. We decided to use Defects4J [49] sinceit is a benchmark that includes real-world bugs and it is quitepopular in software engineering literature.

1) Linking the bugs with their related reports:

To identifywhich bug report describes a given bug in the Defects4J, wefollowed the same process as in the study of Koyuncu etal. [12]. Unfortunately, it was not possible to link the bugreports with the defects for the Joda-Time, JFreeChart andClosure because their repositories and issue tracking systemshave been migrated into GitHub without any mapping of the bug report identiﬁers. This means that in these projects thebug identiﬁers that were used in the commit are meaningless.We therefore decided to ignore these projects in an attempt tomake our evaluation data as clean as possible.For the Lang and Math projects, we used the bug linkingstrategies that are implemented in the Jira issue tracking soft-ware and used the approach of Fischer et al. [50] and Thomaset al. [51] to map the sought bugs with the correspondingreports. Precisely, we crawled the relevant bug reports andchecked their links. We selected bug reports that were taggedas “BUG” and marked as “RESOLVED” or “FIXED” andhave a “CLOSED” status. Then we searched the commit logsto identify related identiﬁers (IDs) that link the commits withthe corresponding bug.ur resulting bug dataset included the 171 faults of Defect4Jrelated to the Lang and Math projects. We discarded 10 defectsbecause they had a bug report with undesired status in thebug tracking system, or there were issues with the buggyprogram versions such as missing ﬁles from the repositoryat the reporting time. We also discarded another 4 defectsbecause I B I R generated less than 5 mutants in total. Thisleaves us with 157 faults.

B. Experimental Procedure

To compare the fault injection techniques we need to set acommon basis for comparison. We set this basis as the numberof injected faults since it forms a standard cost metric [52] thatputs the studied methods under the same cost level. We usedsets of 5, 10, 30, and 100 injected faults since our aim is toequip researchers with few representative faults, per targetedfault, in order to reach reasonable execution demands.To measure how well the injected faults imitate the real ones(answer RQ1, RQ2 and RQ3) we use a semantic similaritymetric (Ochiai coefﬁcient) between the test failures on theinjected and real (targeted) faults. This coefﬁcient quantiﬁesthe similarity level of the program behaviours exercised by thetest suites and is often used in mutation testing literature [9].The metric takes values in the range [0, 1] with 0 indicatingcomplete difference and 1 exact match. We treated the injectedfaults that were not detected by any of the test suites asequivalent mutants [53], [54]. This choice does not affect ourresults since we approximate the program behaviours throughthe projects test suites, i.e., they are never killed.To measure whether the injected faults couple with the exist-ing ones (answer RQ4), we followed the process suggested byJust et al. [48] and identiﬁed whether there were any injectedfaults that were killed by at least one failing test (test thatdetects the real fault) and not by any passing test (test that doesnot detect the real fault). In RQ5 we randomly sampled 50 testsuites, subsets of the accompanied test suites, that includedbetween 10% to 30% test cases of the original test suite andrecorded the ratios of the injected faults that are detected wheninjecting 5, 10, 30 and 100 faults. We also recorded binaryvariables indicating whether or not each test suite detects thetargeted fault. This process simulates cases where test suitesof different strengths are compared. Based on these data, wecomputed two statistical correlation coefﬁcients, the Kendalland Pearson.To further validate whether the two approaches providesufﬁcient indicators on the effectiveness of the test suites, wecheck whether the detection ratios of the injected faults arestatistically higher when test suites detect the targeted faultsthan when they do not.To reduce the inﬂuence of stochastic effects we used theWilcoxon test with a signiﬁcance level of 0.05. This helpeddeciding whether the differences we observe can be charac-terised as statistically signiﬁcant. Statistical signiﬁcance doesnot imply sizable differences and thus, we also used the VarghaDelaney effect size ˆ A [55]. In essence, the ˆ A valuesquantify the level of the differences. For instance, a value F r e q u e n c y IBIRMutation

Fig. 2. Distribution of semantic similarities of 100 injected faults per targeted(real) fault. ˆ A = 0 . can be interpreted as a tendency of equal valueof the two samples. ˆ A > . suggest that the ﬁrst set hashigher values, while ˆ A < . suggest the opposite. C. Implementation

To perform our experiments we set the following parametersin our framework: First, we limit the IR fault localizationon the 20 top ranked suspicious ﬁles, per bug report. Wethen searched them for the exact statements where to injectfaults. We also ensured that the IR engine is not trained withbug reports that we aim to localize. Second, for the mutationtesting, denoted as “Mutation” in our experiments, we usedrandomly sampled mutants from those produced by typicalmutation operators, coming from mutation testing literature. Inparticular we implemented the muJava intra-method mutationoperators [56], which are the most frequently used [14].Third to reduce the noise from stillborn mutants, i.e., mutantsthat do not compile, we discarded without taking into anyconsideration, i.e., prior to our experiment, every mutant thatdid not compile or its execution with the test suite exceededa timeout of 5 minutes. Fourth, when answering the RQ3, wefound out that there were many cases where I B I R injected lessthan 100 faults. To perform a fair comparison, we discardedthese cases (for both approaches). This means that we alwaysreport results where both studied approaches manage to injectthe same number of faults.VI. R

ESULTS

A. RQ1: Semantic similarity between injected and real faults

To check whether the injected faults imitate well the targetedones, we measured their behaviour (semantic) similarity w.r.t.the project test suites (please refer to Section V for details).Figure 2 shows the distribution of the similarity coefﬁcientvalues that were recorded in our study. As can be seen, I B I Rinjects hundreds of faults that are similar to real ones, whereasmutation (denoted as Mutation in Figure 2) did not manage

10 30 100Injected faults0.00.20.40.60.81.0 S e m a n t i c s i m il a r i t y IBIRMutation

Fig. 3. Semantic similarity per targeted (real) fault, top values. I B I R injectsfaults with higher similarity coefﬁcients than mutation testing. to generate any. At the same time, as typically happens inmutation testing [9], a large number of injected faults havelow similarity. This is evident in our data, where mutationshave 0 similarity.To investigate whether I B I R successfully injects any faultthat is similar (semantically) to the targeted ones, we collectedthe best similarity coefﬁcients, per targeted fault, when inject-ing 5, 10, 30 and 100 faults. Figure 3 shows the distribution ofthese results. For more than half of the targeted faults, I B I Ryields a best similarity value higher than 0.5, when injecting100 faults, indicating that I B I R’s faults imitate relatively wellthe targeted ones. We also observe that in many faults the bestsimilarity values are above 0 by injecting just 10 faults. Thisis important since it indicates that I B I R successfully identiﬁesrelevant locations for fault injection.To establish a baseline and better understand the value of I B I R, we need to contrast I B I R’s performance with that ofmutation testing when injecting the same number of faults.Mutation testing forms the current SoA of fault injection andthus a related baseline. As can be seen from Figure 3, thesimilarity values of mutation testing are signiﬁcantly lowerthan those of I B I R. In the following subsection we furthercompare I B I R with mutation testing.

B. RQ2: I B I R Vs Mutation Testing

Figure 4 shows the distribution of the semantic similarities,between real and injected faults, when injecting 5, 10, 30 and100 faults. As can be seen from the boxplots, the trend is thata large portion of faults injected by I B I R imitates the targetedones, (at least much better than mutation testing). Interestingly,in mutation testing, only outliers have their similarity above0. In particular, mutation testing injected faults with similarityvalues higher than 0 in 3, 8, 19, 40 of the targeted faults (wheninjecting 5, 10, 30, 100 faults), while I B I R injected in 75, 88,101, 123 of the targeted faults, respectively. S e m a n t i c s i m il a r i t y IBIRMutation

Fig. 4. Semantic similarity of all injected faults. I B I R injects faults withhigher similarity coefﬁcients than mutation testing. S e m a n t i c s i m il a r i t y IBIRMutation

Fig. 5. Semantic similarity of injected faults at particular classes. I B I R injectsfaults with higher similarity coefﬁcients than mutation testing.

To validate this ﬁnding, we performed a statistical test(Wilcoxon paired test) on the data of both ﬁgures 3 and 4to check for signiﬁcant differences. Our results showed thatthe differences are signiﬁcant, indicating the low probabilityof this effect to be happening by chance. The size of thedifference is also big, with I B I R yielding ˆ A values between0.73 and 0.84 indicating that I B I R injects faults with highersemantic similarity to real ones in the great majority of thecases. Due to the many cases with 0 similarity values, theaverage similarity values of I B I R’s faults is 0.166, while formutation it is 0.002, indicating the superiority of I B I R. C. RQ3: I B I R Vs Mutation Testing at particular classes

To check the performance of I B I R at the class level ofgranularity we repeated our analysis by discarding, from ourpriority lists, every mutant that is not located on the targetedclasses, i.e., classes where the targeted faults have been ﬁxed.

10 30 100Injected faults01020304050 P e r c e n t a g e IBIRMutation

Fig. 6. Percentage of injected faults that are coupled to the real ones.

Figure 5 shows the distribution of the semantic similaritieswhen injecting 5, 10, 30 and 100 faults at a particular class.As expected, mutation testing scores are higher than thosepresented before, but still mutation testing falls behind.To validate this ﬁnding, we performed a statistical test andfound that the differences are signiﬁcant. The size of thedifference is 0.6, meaning that I B I R score 60% times higherthan mutation testing. The average similarity values of the I B I R faults is 0.240, while for mutation is 0.114, indicatingthat I B I R is better.

D. RQ4: Fault Coupling

The coupling between the injected and the real faultsforms a fundamental assumption of the fault-based testingapproaches [49]. An injected fault is coupled to a real onewhen a test case that reveals the injected fault also revealsthe real fault [49]. This implies that revealing these coupledinjected faults results in revealing potential real ones. Wetherefore, check this property in the faults we inject andcontrast it with the baseline mutation testing approach.Figure 6 shows the percentage of targeted faults wherethere is at least one injected fault that is coupled to a realone. This is shown for the scenarios where 5, 10, 30 and100 faults, per target, are injected. As we can see from thesedata, I B I R injects coupled faults for approximately 16% ofthe target faults when it aims at injecting 5 faults. Thispercentage increases to 36% when the number of injectedfaults is increased to 100.Perhaps surprisingly, mutation testing did not perform well(it injected coupled faults for less than 1% of the targeted,when injecting 100 faults per target). These results differfrom those reported by previous research [9], [48], becausea) previous research only injected faults at the faulty classesand not the entire project and b) previous research injected allpossible mutant instances and not 100 as we do.

TABLE IIV

ARGHA AND D EIANEY ˆ A ( I B I R VS M

UTATION ) OF K ENDALL AND P EARSON CORRELATION COEFFICIENTS . Number of injected faults 5 10 30 100Kendall

Pearson K e n d a ll IBIRMutation

Fig. 7. Kendall correlation coefﬁcients of test suites (samples from theoriginal project test suite). The two related variables are a) the percentage ofinjected faults that was detected by the sampled test suites and b) whetherthe targeted fault was detected or not by the same test suites.

E. RQ5: Fault detection estimates

The results presented so far provide evidence that someof the injected faults imitate well the targeted ones. Though,the question of whether the injections provide representativeresults of real faults remains, especially since we observe alarge number of faults with low similarity value. Therefore, wecheck the correlations between the failure rates of the sets ofinjected faults and the real faults when executed with differenttest suites, (please refer to section V for details).Figures 7 and 8 show the distribution of the correlationcoefﬁcients, when injecting different numbers of faults. Inter-estingly, the results on both ﬁgures show a trend in favourof I B I R. This difference is statistically signiﬁcant, shownby a Wilcoxon test, with an effect size of approximately0.72. Table II records the effect size values, ˆ A , for theexamined strategies. In essence, these effect sizes mean that I B I R outperforms the mutant injection in 72% of the cases,suggesting that I B I R could be a much better choice thanmutation testing, especially in cases of large test suites withexpensive test executions.To further validate whether I B I R’s faults provide goodindicators (estimates) of test effectiveness (fault detection) wesplit our test suites between those that detect the targeted faultsand those that do not. We then tested whether detection ratiosof the injected faults in the test suite group that detects thereal faults are signiﬁcantly (statistically) higher than those inthe group that does not detect it. In case this happens, we canconclude that test suites capable of detecting a higher number

10 30 100Injected faults0.40.20.00.20.40.60.81.0 P e a r s o n IBIRMutation

Fig. 8. Pearson correlation coefﬁcients of test suites (samples from theoriginal project test suite). The two related variables are a) the percentage ofinjected faults that was detected by the sampled test suites and b) whetherthe targeted fault was detected or not by the same test suites. of injected faults have similarly higher chances to detect thereal ones. This is important when comparing test generationtechniques, where the aim is to identify the most effective (atdetecting faults) technique.Figure 9 records the number of faults where the test suitesdetecting the (real) targeted fault also detect a statisticallyhigher number of injected faults than those test suites thatdo not detect it. As can be seen by these results, I B I R hasa big difference from mutation, i.e., it distinguishes betweenpassing and failing test suites in 80 faults, while Mutation in21 faults. Since statistical signiﬁcance does not imply practicalsigniﬁcance, we also measured the Vargha and Delaney ˆ A effect size values on the same data, recorded in Figure 10.Of course it does not make sense to contrast insigniﬁcantcases, so we only performed that on the results where I B I Rhas statistically signiﬁcant difference. Interestingly the resultsdemonstrate big differences (in approximately 80% of thecases) in favour of our approach.VII. T

HREATS TO V ALIDITY

The question of whether our ﬁndings generalise, forms atypical threat to validity of empirical studies. To reduce thisthreat, we used real-world projects, developer test suites, realfaults and their associated bug reports, from an establishedand independently built benchmark. Still though, we have toacknowledge that these may not be representative of projectsfrom other domains or industrial systems.Other threats may also arise from the way we handledthe injected faults and mutants that were not killed by anytest case. We believe that this validation process is sufﬁcientsince the test suites are relatively strong and somehow formthe current state of practice, i.e., developers tend to usethis particular level of testing. Though, in case the approachis putted into practice things might be different. We alsoapplied our analysis on the ﬁxed program version providedby Defects4J. This was important in order to show that we S i g n i f i c a n t d i ff e r e n c e s

50 62 67 809 11 20 21

IBIRMutation

Fig. 9. Number of faults where injected faults provided good indications offault detection. Particularly, number of cases with test suites detecting thereal fault have statistically signiﬁcant difference, in terms of ratios of injectedfaults detected, from those that do not detect the real fault. Â v a l u e s Fig. 10. Vargha and Deianey values for I B I R. ˆ A values computed on thedetection ratios of injected faults of the test suites that detect and do not detectthe (real) faults. actually inject the actual targeted faults. Though, our resultsmight not hold on the cases that the code has drasticallychanged since the time of the bug report. We believe that thisthreat is not of actual importance as we are concerned withfault injection at interesting program locations, which shouldbe pinpointed by the fault localization technique we use. Stillfuture research should shed some light on how useful theselocations and faults are.Finally, our evaluation metrics may induce some additionalthreats. Our comparison basis measurement, i.e., number of in-jected faults, approximates the execution cost of the techniquesand their chances to provide misleading guidance [9], while thefault couplings and semantic similarity metrics approximatethe effectiveness of the approaches. These are intuitive metrics,used by previous research [8], [30] and aim at providing acommon ground for comparison.III. R ELATED W ORK

Software fault injection [57] has been widely studied since1970s. Injected faults have been used for the purpose of testing[1], debugging [11], [58], assessing fault tolerance [3], riskanalysis [59], [60] and dependability evaluation [61].Despite the many years of research, the majority of previousresearch is focused on the fault types. In mutation testingresearch, mutation operators (fault types) are usually designedbased on the grammar of the targeted language [1], [5], whichare then reﬁned through empirical analysis, aiming at reducingthe redundancy between the injected faults [52], [62]. Themost prominent mutant selection approach is that of Offuttet al. [52], which proposed a set of 5 mutation operators. Thisset has been incorporated in most of the modern mutationtesting tools [14] and is the one that we use in our baseline.Recently, Brown et al. [7] aimed at inferring fault patternsfrom bug ﬁxes. Their results showed that a large number ofmutation operators could be inferred. Along the same linesTufano et al. [6] developed a neural machine translation toolthat learns to mutate through bug ﬁxes. A key assumptionsof these methods are a) the availability of a comprehensivenumber of clean bug ﬁxing commits, and b) the absence offault couplings [63], which are often not met and can oftenbe reduced to what simple mutations do. For instance, thestudy of Brown et al. found that with few exceptions, almostall mutation operators designed based on the C languagegrammar appeared in the inferred operator set. Perhaps moreimportantly, the studies of Natella et al. [3] and Chekam et al.[8] found that the pair of mutant location and type are whatmakes mutants powerful and not the type itself. Nevertheless, I B I R goal is complementary to the above studies as it aimsat injecting faults that mimic speciﬁcally targeted faults, thosedescribed in bug reports. This way, one can inject the mostimportant and severe faults experienced.Some studies attempt to identify the program locationswhere to inject faults. Sun et al. [64] suggested injectingfaults in diverse places within different program executionpaths. Gong et al. [65] used graph analysis to inject faultsin different and diverse locations of the program spectra.Mirshokraie et al. [66] employed complexity metrics togetherwith actual program executions to inject faults at placeswith good observability. These strategies, aim at reducing thenumber of injected faults and not to mimic any real fault asour approach. Moreover, their results should be resembled bythe random mutant sampling baseline that we use.Random mutant sampling forms a natural cost-reductionmethod proposed since the early days of mutation testing [2].Despite that, most of the mutant selection methods fail toperform better than it. Recently, Kurtz et al. [30] and Chekamet al. [8] demonstrated that selective mutation and randommutant sampling perform similarly. From this, it should beclear that despite the advances in selective mutation, the simplerandom sampling is one of the most effective fault injectiontechniques. This is the reason why we adopt it as a baselinein our experiments. Natella et al. [3] used complexity metrics as machinelearning features and applied them on a set of examplesin order to identify (predict) which injected faults have thepotential to emulate well the behaviour of real ones. Chekamet al. [8] also used machine learning, with many static mutant-related features to select and rank mutants that are likely faultrevealing (have high chance to couple with a fault). Thesestudies assume the availability of a historical faults and donot aim at injecting speciﬁc faults as done by I B I R.The relationship between injected and real faults has alsoreceived some attention [1]. The studies of Papadakis et al.[9], Just et al. [48], Andrews et al. [53] investigated whethermutant kills and fault detection ratios follow similar trends.The results show the existence of a correlation and, thus, thatmutants can be used in controlled experiments as alternativesto real faults. In the context of testing, i.e., using mutants toguide testing, injected faults can help identifying corner casesand reveal existing faults. The studies of Frankl et al. [67], Liet al. [68] and Chekam et al. [69] demonstrated that guidancefrom mutants leads to signiﬁcantly higher fault revelation thanthat of other test techniques (test criteria).IX. C

ONCLUSION

We presented I B I R; a bug-report driven fault injection tool. I B I R (1) equips researchers with faults (to inject) targetingthe critical functionality of the target systems, (2) mimics realfaulty behaviour and (3) makes relevant fault injection. I B I R’s use case is simple; given a program and somecarefully selected bug reports, it injects faults emulating therelated bugs, i.e., I B I R generates few faults per target bugreport. This allows constructing realistic fault pools to be usedfor test or fault tolerance assessment.This means that I B I R’s faults can be used as substitutesof real faults, in controlled studies. In a sense, I B I R canbring the missing realism into fault injection and thereforesupport empirical research and controlled experiments. Thisis important since a large number of empirical studies rely onartiﬁcially-injected faults [70], the validity of which is alwaysin question.While the use case of I B I R is in research studies, the useof the tool can have applications in a wide range of softwareengineering tasks. It can, for instance, be used for assertingthat future software releases do not introduce the same (orsimilar) kind of faults. Such a situation occurs in large softwareprojects [71], where I B I R could help by checking for some ofthe most severe faults experienced.Another potential application of I B I R is fault toleranceassessment, by injecting faults similar to previously experi-enced ones and analysing the system responses and overalldependability.Finally, testers could use I B I R for testing all system areasthat could lead to similar symptoms than the ones observed andresolved. This will bring signiﬁcant beneﬁts when testing soft-ware clones [72] and similar functionality implementations.We hope that we will address these points in the near future.

EFERENCES[1] M. Papadakis, M. Kintis, J. Zhang, Y. Jia, Y. L. Traon, and M. Harman,“Chapter six - mutation testing advances: An analysis and survey,”

Advances in Computers , vol. 112, pp. 275–378, 2019. [Online].Available: https://doi.org/10.1016/bs.adcom.2018.03.015[2] R. A. DeMillo, R. J. Lipton, and F. G. Sayward, “Hints ontest data selection: Help for the practicing programmer,”

IEEEComputer , vol. 11, no. 4, pp. 34–41, 1978. [Online]. Available:https://doi.org/10.1109/C-M.1978.218136[3] R. Natella, D. Cotroneo, J. Dur˜aes, and H. Madeira, “On faultrepresentativeness of software fault injection,”

IEEE Trans. SoftwareEng. , vol. 39, no. 1, pp. 80–96, 2013. [Online]. Available:https://doi.org/10.1109/TSE.2011.124[4] A. Lanzaro, R. Natella, S. Winter, D. Cotroneo, and N. Suri, “Errormodels for the representative injection of software defects,” in

SoftwareEngineering & Management 2015, Multikonferenz der GI-FachbereicheSoftwaretechnik (SWT) und Wirtschaftsinformatik (WI), FA WI-MAW, 17.M¨arz - 20. M¨arz 2015 , ser. LNI, vol. P-239. GI, 2015, pp. 118–119.[5] P. Ammann and J. Offutt,

Introduction to Software Testing . CambridgeUniversity Press, 2008. [Online]. Available: https://doi.org/10.1017/CBO9780511809163[6] M. Tufano, C. Watson, G. Bavota, M. D. Penta, M. White, andD. Poshyvanyk, “Learning how to mutate source code from bug-ﬁxes,”in . IEEE, 2019, pp. 301–312. [Online].Available: https://doi.org/10.1109/ICSME.2019.00046[7] D. B. Brown, M. Vaughn, B. Liblit, and T. W. Reps, “Thecare and feeding of wild-caught mutants,” in

Proceedings of the2017 11th Joint Meeting on Foundations of Software Engineering,ESEC/FSE 2017 . ACM, 2017, pp. 511–522. [Online]. Available:https://doi.org/10.1145/3106237.3106280[8] T. T. Chekam, M. Papadakis, T. F. Bissyand´e, Y. L. Traon, andK. Sen, “Selecting fault revealing mutants,”

Empirical SoftwareEngineering , vol. 25, no. 1, pp. 434–487, 2020. [Online]. Available:https://doi.org/10.1007/s10664-019-09778-7[9] M. Papadakis, D. Shin, S. Yoo, and D. Bae, “Are mutationscores correlated with real fault detection?: a large scale empiricalstudy on the relationship between mutants and real faults,” in

Proceedings of the 40th International Conference on SoftwareEngineering, ICSE 2018 , 2018, pp. 537–548. [Online]. Available:https://doi.org/10.1145/3180155.3180183[10] J. Zhou, H. Zhang, and D. Lo, “Where should the bugs be ﬁxed?more accurate information retrieval-based bug localization based onbug reports,” in

Proceedings of the 2012 International Conference onSoftware Engineering (ICSE) , 2012, pp. 14–24.[11] M. Papadakis and Y. Le Traon, “Metallaxis-ﬂ: mutation-based faultlocalization,”

Software Testing, Veriﬁcation and Reliability , vol. 25, no.5-7, pp. 605–628, 2015.[12] A. Koyuncu, K. Liu, T. F. Bissyand´e, D. Kim, M. Monperrus, J. Klein,and Y. L. Traon, “iFixR: Bug report driven program repair,” in

Proceed-ings of the 13th Joint Meeting on Foundations of Software Engineering(FSE) , 2019.[13] M. Papadakis, T. T. Chekam, and Y. L. Traon, “Mutant qualityindicators,” in . IEEE Computer Society, 2018,pp. 32–39. [Online]. Available: http://doi.ieeecomputersociety.org/10.1109/ICSTW.2018.00025[14] M. Kintis, M. Papadakis, A. Papadopoulos, E. Valvis, N. Malevris, andY. L. Traon, “How effective are mutation testing tools? an empiricalanalysis of java mutation testing tools with manual analysis and realfaults,”

Empir. Softw. Eng. , vol. 23, no. 4, pp. 2426–2463, 2018.[Online]. Available: https://doi.org/10.1007/s10664-017-9582-5[15] W. E. Wong, R. Gao, Y. Li, R. Abreu, and F. Wotawa, “A survey onsoftware fault localization,”

IEEE Transactions on Software Engineering ,vol. 42, no. 8, pp. 707–740, 2016.[16] W. E. Wong, V. Debroy, and B. Choi, “A family of code coverage-based heuristics for effective fault localization,”

Journal of Systems andSoftware , vol. 83, no. 2, pp. 188–208, 2010.[17] R. Abreu, P. Zoeteweij, and A. J. Van Gemund, “Spectrum-basedmultiple fault localization,” in

Proceedings of the 24th IEEE/ACMInternational Conference on Automated Software Engineering (ASE) ,2009, pp. 88–99. [18] B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I. Jordan, “Scalablestatistical bug isolation,” in

Proceedings of the 26th ACM SIGPLANConference on Programming Language Design and Implementation(PLDI) , 2005, pp. 15–26.[19] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, andR. Harshman, “Indexing by latent semantic analysis,”

Journal of theAmerican Society for Information Science , vol. 41, no. 6, pp. 391–407,Sep. 1990.[20] W. B. Frakes and R. Baeza-Yates,

Information Retrieval: Data Structuresand Algorithms , 1st ed. Prentice Hall, Jun. 1992.[21] C. D. Manning and H. Sch¨utze,

Foundations of Statistical NaturalLanguage Processing , 1st ed. Cambridge, Mass: The MIT Press, Jun.1999.[22] G. Salton and M. J. McGill,

Introduction to Modern InformationRetrieval . New York, NY, USA: McGraw-Hill, Inc., 1986.[23] S. K. Lukins, N. A. Kraft, and L. H. Etzkorn, “Bug localizationusing latent Dirichlet allocation,”

Information and Software Technology ,vol. 52, no. 9, pp. 972–990, 2010.[24] R. K. Saha, M. Lease, S. Khurshid, and D. E. Perry, “Improving buglocalization using structured information retrieval,” in

Proceedings ofthe 28th IEEE/ACM International Conference on Automated SoftwareEngineering (ASE) , 2013, pp. 345–355.[25] S. Wang and D. Lo, “Version History, Similar Report, and Structure:Putting Them Together for Improved Bug Localization,” in

Proceed-ings of the 22nd International Conference on Program Comprehension(ICPC) , 2014, pp. 53–63.[26] M. Wen, R. Wu, and S.-C. Cheung, “Locus: Locating bugs fromsoftware changes,” in

Proceedings of the 31st IEEE/ACM InternationalConference on Automated Software Engineering (ASE) , 2016, pp. 262–273.[27] K. C. Youm, J. Ahn, J. Kim, and E. Lee, “Bug Localization Based onCode Change Histories and Bug Reports,” in

Proceedings of the 2015Asia-Paciﬁc Software Engineering Conference (ICSE) , 2015, pp. 190–197.[28] J. Andrews, L. Briand, Y. Labiche, and A. Namin, “Using mutationanalysis for assessing and comparing testing coverage criteria,”

SoftwareEngineering, IEEE Transactions on , vol. 32, no. 8, pp. 608–624, 2006.[29] T. Laurent, M. Papadakis, M. Kintis, C. Henard, Y. L. Traon, andA. Ventresque, “Assessing and improving the mutation testing practiceof pit,” in , March 2017, pp. 430–435.[30] B. Kurtz, P. Ammann, J. Offutt, M. E. Delamaro, M. Kurtz,and N. G¨okc¸e, “Analyzing the validity of selective mutation withdominator mutants,” in

Proceedings of the 24th ACM SIGSOFTInternational Symposium on Foundations of Software Engineering, FSE2016 , 2016, pp. 571–582. [Online]. Available: https://doi.org/10.1145/2950290.2950322[31] P. Ammann, M. E. Delamaro, and J. Offutt, “Establishing theoreticalminimal sets of mutants,” in . IEEE, 2014.[32] C. L. Goues, M. Pradel, and A. Roychoudhury, “Automated programrepair,”

Commun. ACM , vol. 62, no. 12, pp. 56–65, 2019. [Online].Available: https://doi.org/10.1145/3318162[33] D. Kim, J. Nam, J. Song, and S. Kim, “Automatic patch generationlearned from human-written patches,” in

Proceedings of the 35th ICSE .IEEE, 2013, pp. 802–811.[34] K. Liu, D. Kim, T. F. Bissyand´e, S. Yoo, and Y. Le Traon, “Miningﬁx patterns for ﬁndbugs violations,”

IEEE Transactions on SoftwareEngineering , 2018.[35] J. Hua, M. Zhang, K. Wang, and S. Khurshid, “Towards practicalprogram repair with on-demand candidate generation,” in

Proceedingsof the 40th International Conference on Software Engineering (ICSE) ,2018, pp. 12–23.[36] R. K. Saha, Y. Lyu, H. Yoshida, and M. R. Prasad, “Elixir: Effectiveobject-oriented program repair,” in

Proceedings of the 32nd IEEE/ACMInternational Conference on Automated Software Engineering (ASE) ,2017, pp. 648–659.[37] T. Durieux, B. Cornu, L. Seinturier, and M. Monperrus, “Dynamicpatch generation for null pointer exceptions using metaprogramming,”in

Proceedings of the 24th SANER . IEEE, 2017, pp. 349–358.[38] A. Koyuncu, K. Liu, T. F. Bissyand´e, D. Kim, J. Klein, M. Monperrus,and Y. Le Traon, “Fixminer: Mining relevant ﬁx patterns for automatedprogram repair,”

Empirical Software Engineering , October 2019.39] M. Martinez and M. Monperrus, “Ultra-large repair search space withautomatically mined templates: The cardumen mode of astor,” in

Pro-ceedings of the 10th SSBSE . Springer, 2018, pp. 65–86.[40] K. Liu, A. Koyuncu, D. Kim, and T. F. Bissyand´e, “Avatar: Fixingsemantic bugs with ﬁx patterns of static analysis violations,” in

Proceed-ings of the IEEE 26th International Conference on Software Analysis,Evolution and Reengineering (SANER) , 2019, pp. 1–12.[41] K. Liu, D. Kim, T. F. Bissyand´e, S. Yoo, and Y. Le Traon, “Mining ﬁxpatterns for ﬁndbugs violations,”

TSE , 2018.[42] K. Liu, A. Koyuncu, D. Kim, and T. F. Bissyand´e, “TBar: Revisitingtemplate-based automated program repair,” in

Proceedings of the 28thACM SIGSOFT International Symposium on Software Testing andAnalysis (ISSTA) , 2019, pp. 31–42.[43] C. Parnin and A. Orso, “Are automated debugging techniques actuallyhelping programmers?” in

Proceedings of the 20th ISSTA . ACM, 2011,pp. 199–209.[44] Q. Wang, C. Parnin, and A. Orso, “Evaluating the usefulness of ir-basedfault localization techniques,” in

Proceedings of the 2015 InternationalSymposium on Software Testing and Analysis (ISSTA) , 2015, pp. 1–11.[45] C.-P. Wong, Y. Xiong, H. Zhang, D. Hao, L. Zhang, and H. Mei,“Boosting Bug-Report-Oriented Fault Localization with Segmentationand Stack-Trace Analysis,” in

Proceedings of the 2014 IEEE Interna-tional Conference on Software Maintenance and Evolution (ICSME) ,2014, pp. 181–190.[46] S. Moon, Y. Kim, M. Kim, and S. Yoo, “Ask the mutants: Mutatingfaulty programs for fault localization,” in

Seventh IEEE InternationalConference on Software Testing, Veriﬁcation and Validation, ICST2014 . IEEE Computer Society, 2014, pp. 153–162. [Online]. Available:https://doi.org/10.1109/ICST.2014.28[47] G. Fraser and A. Arcuri, “Whole test suite generation,”

IEEE Trans.Software Eng. , vol. 39, no. 2, pp. 276–291, 2013. [Online]. Available:https://doi.org/10.1109/TSE.2012.14[48] R. Just, D. Jalali, L. Inozemtseva, M. D. Ernst, R. Holmes, andG. Fraser, “Are mutants a valid substitute for real faults in softwaretesting?” in

Proceedings of the 22nd ACM SIGSOFT InternationalSymposium on Foundations of Software Engineering, 2014 , 2014, pp.654–665. [Online]. Available: https://doi.org/10.1145/2635868.2635929[49] R. Just, D. Jalali, and M. D. Ernst, “Defects4J: A database of existingfaults to enable controlled testing studies for Java programs,” in

Pro-ceedings of the 2014 International Symposium on Software Testing andAnalysis (ISSTA) , 2014, pp. 437–440.[50] M. Fischer, M. Pinzger, and H. C. Gall, “Populating a release historydatabase from version control and bug tracking systems,” in . IEEE Computer Society, 2003,p. 23. [Online]. Available: https://doi.org/10.1109/ICSM.2003.1235403[51] S. W. Thomas, M. Nagappan, D. Blostein, and A. E. Hassan, “Theimpact of classiﬁer conﬁguration and classiﬁer combination on buglocalization,”

IEEE Trans. Software Eng. , vol. 39, no. 10, pp. 1427–1443, 2013. [Online]. Available: https://doi.org/10.1109/TSE.2013.27[52] A. J. Offutt, A. Lee, G. Rothermel, R. H. Untch, and C. Zapf, “Anexperimental determination of sufﬁcient mutant operators,”

ACM Trans.Softw. Eng. Methodol. , vol. 5, no. 2, pp. 99–118, 1996. [Online].Available: https://doi.org/10.1145/227607.227610[53] J. H. Andrews, L. C. Briand, Y. Labiche, and A. S. Namin,“Using mutation analysis for assessing and comparing testing coveragecriteria,”

IEEE Trans. Software Eng. , vol. 32, no. 8, pp. 608–624, 2006.[Online]. Available: https://doi.org/10.1109/TSE.2006.83[54] M. Papadakis, M. E. Delamaro, and Y. L. Traon, “Mitigating theeffects of equivalent mutants with mutant classiﬁcation strategies,”

Sci.Comput. Program. , vol. 95, pp. 298–319, 2014. [Online]. Available:https://doi.org/10.1016/j.scico.2014.05.012[55] A. Vargha and H. D. Delaney, “A critique and improvement of the clcommon language effect size statistics of mcgraw and wong,”

Journalof Educational and Behavioral Statistics , vol. 25, no. 2, pp. 101–132,2000.[56] Y. Ma, J. Offutt, and Y. R. Kwon, “Mujava: an automated classmutation system,”

Softw. Test. Veriﬁcation Reliab. , vol. 15, no. 2, pp.97–133, 2005. [Online]. Available: https://doi.org/10.1002/stvr.308[58] Y. Lou, A. Ghanbari, X. Li, L. Zhang, H. Zhang, D. Hao, andL. Zhang, “Can automated program repair reﬁne fault localization?a uniﬁed debugging approach,” in

ISSTA ’20: 29th ACM SIGSOFT [57] J. M. Voas and G. McGraw,

Software Fault Injection: InoculatingPrograms against Errors . USA: John Wiley & Sons, Inc., 1997.

International Symposium on Software Testing and Analysis, VirtualEvent, USA, July 18-22, 2020 . ACM, 2020, pp. 75–87. [Online].Available: https://doi.org/10.1145/3395363.3397351[59] J. Christmansson and R. Chillarege, “Generation of error set thatemulates software faults based on ﬁeld data,” in

Digest of Papers:FTCS-26, The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing, 1996 . IEEE Computer Society, 1996, pp. 304–313.[Online]. Available: https://doi.org/10.1109/FTCS.1996.534615[60] J. M. Voas, F. Charron, G. McGraw, K. W. Miller, and M. Friedman,“Predicting how badly ”good” software can behave,”

IEEE Softw. ,vol. 14, no. 4, pp. 73–83, 1997. [Online]. Available: https://doi.org/10.1109/52.595959[61] J. Arlat, A. Costes, Y. Crouzet, J. Laprie, and D. Powell, “Faultinjection and dependability evaluation of fault-tolerant systems,”

IEEETrans. Computers , vol. 42, no. 8, pp. 913–923, 1993. [Online].Available: https://doi.org/10.1109/12.238482[62] M. Marcozzi, S. Bardin, N. Kosmatov, M. Papadakis, V. Prevosto, andL. Correnson, “Time to clean your test objectives,” in

Proceedingsof the 40th International Conference on Software Engineering, ICSE2018, Gothenburg, Sweden, May 27 - June 03, 2018 , M. Chaudron,I. Crnkovic, M. Chechik, and M. Harman, Eds. ACM, 2018, pp.456–467. [Online]. Available: https://doi.org/10.1145/3180155.3180191[63] A. J. Offutt, “Investigations of the software testing coupling effect,”

ACM Trans. Softw. Eng. Methodol. , vol. 1, no. 1, pp. 5–20, 1992.[Online]. Available: https://doi.org/10.1145/125489.125473[64] C. Sun, F. Xue, H. Liu, and X. Zhang, “A path-aware approachto mutant reduction in mutation testing,”

Information & SoftwareTechnology , vol. 81, pp. 65–81, 2017. [Online]. Available: https://doi.org/10.1016/j.infsof.2016.02.006[65] D. Gong, G. Zhang, X. Yao, and F. Meng, “Mutant reduction basedon dominance relation for weak mutation testing,”

Information &Software Technology , vol. 81, pp. 82–96, 2017. [Online]. Available:https://doi.org/10.1016/j.infsof.2016.05.001[66] S. Mirshokraie, A. Mesbah, and K. Pattabiraman, “Guided mutationtesting for javascript web applications,”

IEEE Trans. SoftwareEng. , vol. 41, no. 5, pp. 429–444, 2015. [Online]. Available:https://doi.org/10.1109/TSE.2014.2371458[67] P. G. Frankl, S. N. Weiss, and C. Hu, “All-uses vs mutationtesting: An experimental comparison of effectiveness,”

J. Syst.Softw. , vol. 38, no. 3, pp. 235–253, 1997. [Online]. Available:https://doi.org/10.1016/S0164-1212(96)00154-9[68] N. Li, U. Praphamontripong, and J. Offutt, “An experimentalcomparison of four unit test criteria: Mutation, edge-pair, all-usesand prime path coverage,” in

Second International Conference onSoftware Testing Veriﬁcation and Validation, ICST, 2009, WorkshopsProceedings . IEEE Computer Society, 2009, pp. 220–229. [Online].Available: https://doi.org/10.1109/ICSTW.2009.30[69] T. T. Chekam, M. Papadakis, Y. L. Traon, and M. Harman, “Anempirical study on mutation, statement and branch coverage faultrevelation that avoids the unreliable clean program assumption,”in

Proceedings of the 39th International Conference on SoftwareEngineering, ICSE 2017 , 2017, pp. 597–608. [Online]. Available:https://doi.org/10.1109/ICSE.2017.61[70] M. Papadakis, C. Henard, M. Harman, Y. Jia, and Y. L. Traon,“Threats to the validity of mutation-based test assessment,” in

Proceedings of the 25th International Symposium on Software Testingand Analysis, ISSTA 2016 , 2016, pp. 354–365. [Online]. Available:https://doi.org/10.1145/2931037.2931040[71] N. Palix, G. Thomas, S. Saha, C. Calv`es, J. Lawall, and G. Muller,“Faults in linux: Ten years later,” in

Proceedings of the sixteenthinternational conference on Architectural support for programminglanguages and operating systems , 2011, pp. 305–318.[72] M. Mondal, M. S. Rahman, R. K. Saha, C. K. Roy, J. Krinke, and K. A.Schneider, “An empirical study of the impacts of clones in softwaremaintenance,” in

The 19th IEEE International Conference on ProgramComprehension, ICPC 2011 . IEEE Computer Society, 2011, pp.242–245. [Online]. Available: https://doi.org/10.1109/ICPC.2011.14. IEEE Computer Society, 2011, pp.242–245. [Online]. Available: https://doi.org/10.1109/ICPC.2011.14