[PDF] Practical Mutation Testing at Scale

Abstract

Mutation analysis assesses a test suite's adequacy by measuring its ability to detect small artificial faults, systematically seeded into the tested program. Mutation analysis is considered one of the strongest test-adequacy criteria. Mutation testing builds on top of mutation analysis and is a testing technique that uses mutants as test goals to create or improve a test suite. Mutation testing has long been considered intractable because the sheer number of mutants that can be created represents an insurmountable problem -- both in terms of human and computational effort. This has hindered the adoption of mutation testing as an industry standard. For example, Google has a codebase of two billion lines of code and more than 500,000,000 tests are executed on a daily basis. The traditional approach to mutation testing does not scale to such an environment. To address these challenges, this paper presents a scalable approach to mutation testing based on the following main ideas: (1) Mutation testing is done incrementally, mutating only changed code during code review, rather than the entire code base; (2) Mutants are filtered, removing mutants that are likely to be irrelevant to developers, and limiting the number of mutants per line and per code review process; (3) Mutants are selected based on the historical performance of mutation operators, further eliminating irrelevant mutants and improving mutant quality. Evaluation in a code-review-based setting with more than 24,000 developers on more than 1,000 projects shows that the proposed approach produces orders of magnitude fewer mutants and that context-based mutant filtering and selection improve mutant quality and actionability. Overall, the proposed approach represents a mutation testing framework that seamlessly integrates into the software development workflow and is applicable up to large-scale industrial settings.

Full PDF

11 Practical Mutation Testing at Scale

A view from Google

Goran Petrovi´c, Marko Ivankovi´c, Gordon Fraser, René Just

Abstract —Mutation analysis assesses a test suite’s adequacy by measuring its ability to detect small artiﬁcial faults, systematicallyseeded into the tested program. Mutation analysis is considered one of the strongest test-adequacy criteria. Mutation testing builds ontop of mutation analysis and is a testing technique that uses mutants as test goals to create or improve a test suite. Mutation testinghas long been considered intractable because the sheer number of mutants that can be created represents an insurmountableproblem—both in terms of human and computational effort. This has hindered the adoption of mutation testing as an industry standard.For example, Google has a codebase of two billion lines of code and more than 500,000,000 tests are executed on a daily basis. Thetraditional approach to mutation testing does not scale to such an environment; even existing solutions to speed up mutation analysisare insufﬁcient to make it computationally feasible at such a scale.To address these challenges, this paper presents a scalable approach to mutation testing based on the following main ideas: (1)Mutation testing is done incrementally, mutating only changed code during code review, rather than the entire code base; (2) Mutantsare ﬁltered, removing mutants that are likely to be irrelevant to developers, and limiting the number of mutants per line and per codereview process; (3) Mutants are selected based on the historical performance of mutation operators, further eliminating irrelevantmutants and improving mutant quality. This paper empirically validates the proposed approach by analyzing its effectiveness in acode-review-based setting, used by more than 24,000 developers on more than 1,000 projects. The results show that the proposedapproach produces orders of magnitude fewer mutants and that context-based mutant ﬁltering and selection improve mutant qualityand actionability. Overall, the proposed approach represents a mutation testing framework that seamlessly integrates into the softwaredevelopment workﬂow and is applicable up to large-scale industrial settings.

Index Terms —mutation testing, code coverage, test efﬁcacy (cid:70)

NTRODUCTION

Software testing is the predominant technique for ensuringsoftware quality, and various approaches exist for assessingtest suite efﬁcacy (i.e., a test suite’s ability to detect softwaredefects). One such approach is code coverage, which iswidely used at Google [1] and measures the degree to whicha test suite exercises a program. Code coverage is intuitive,cheap to compute, and well supported by commercial-gradetools. However, code coverage alone might be misleading,in particular when program statements are covered but theexpected program outcome is not asserted upon [2], [3].Another approach is mutation analysis , which systematicallyseeds artiﬁcial faults into a program and measures a testsuite’s ability to detect these artiﬁcial faults, called mu-tants [4]. Mutation analysis addresses the limitations of codecoverage and is widely considered the best approach forevaluating test suite efﬁcacy [5], [6], [7].

Mutation testing isan iterative testing approach that builds on top of mutationanalysis and uses undetected mutants as concrete test goalsfor which to create test cases. • Goran Petrovi´c and Marko Ivankovi´c are with Google LLC.E-mail: [email protected], [email protected] • Gordon Fraser is with the University of PassauE-mail: [email protected] • René Just is with the University of WashingtonE-mail: [email protected] work has been submitted to the IEEE for possible publication. Copyrightmay be transferred without notice, after which this version may no longer beaccessible.

As a concrete example, consider the following fullycovered, yet weekly tested, function: public Buffer view() {Buffer buf = new Buffer();buf.Append(this.internal_buf); //mutation: delete this linereturn buf;}

The tests only exercise the function, but do not assert uponits effects on the returned buffer. This is just one examplewhere mutation testing outperforms code coverage: eventhough the line that appends some content to buf is covered,a developer is not informed about the fact that no test checksfor its effects. The statement-deletion mutation, on the otherhand, explicitly points out this testing weakness.Google always strives to improve test quality, andthus decided to implement and deploy a mutation systemto evaluate its effectiveness. The sheer scale of Google’smonolithic repository with approximately 2 billion linesof code [8], however, rendered the traditional approachto mutation testing infeasible: More than 500,000,000 testexecutions per day are gatekeepers for 60,000 change sub-missions to this code base, ensuring that 13,000 continuousintegrations remain healthy on a daily basis [9]. First, at thisscale, systematically mutating the entire code base wouldcreate far too many mutants, each potentially requiringmany tests to be executed. Second, neither the traditionallycomputed mutant-detection ratio, which quantiﬁes test suiteefﬁcacy, nor simply showing all mutants that have evadeddetection to a developer would be actionable. Given thatevaluating and resolving a single mutant takes several min- a r X i v : . [ c s . S E ] F e b utes [10], [11], the required developer effort for resolving allundetected mutants would be prohibitively expensive.To make matters worse, even when applying samplingtechniques to substantially reduce the number of mutants,developers at Google initially classiﬁed 85% of reported mu-tants as unproductive. An unproductive mutant is either triv-ially equivalent to the original program or it is detectable,but adding a test for it would not improve the test suite [11].For example, mutating the initial capacity of a Java collec-tion (e.g., new ArrayList(64) (cid:55)→ new ArrayList(16) ) createsan unproductive mutant. While it is possible to write a testthat asserts on the collection capacity or expected memoryallocations, it is unproductive to do so. In fact, it is con-ceivable that these tests, if written and added, would evenhave a negative impact because their change-detector nature(speciﬁcally testing the current implementation rather thanthe speciﬁcation) violates testing best practices and causesbrittle tests and false alarms.Faced with the two major challenges in deploying mu-tation testing—the computational costs of mutation analysisand the fact that most mutants are unproductive —we havedeveloped a mutation testing approach that is scalable andusable, based on three central ideas:1) Our approach performs mutation testing on code changes ,considering only changed lines of code (Section 2, basedon our prior work [12]), and surfacing mutants duringcode review. This greatly reduces the number of lines inwhich mutants are created and matches a developer’sunit of work for which additional tests are desirable.2) Our approach uses transitive mutant suppression , us-ing heuristics based on developer feedback (Section 3,based on our prior work [12]). The feedback of morethan 20,000 developers on thousands of mutants oversiz years enabled us to develop heuristics for mutantsuppression that improved the ratio of productive mu-tants from 15% to 89%.3) Our approach uses probabilistic, targeted mutant selec-tion , surfacing a restricted number of mutants basedon historical performance (Section 4), further avoidingunproductive mutants.Based on an evaluation of the proposed mutation test-ing framework on almost 17 million mutants and 760,000changes, which surfaced 2 million mutants during codereview (Section 5), we conclude that, taken together, theseimprovements make mutation testing feasible—even forindustry-scale software development environments. UTATION T ESTING AT G OOGLE

Mutation testing at Google faces challenges of scale, bothin terms of computation time as well as integration intothe developer workﬂow. Even though existing work onselective mutation and other optimizations can substantiallyreduce the number of mutants that need to be analyzed,it remains infeasibly expensive to compute the absolutemutation score for the codebase at any given ﬁxed pointdue to the size of the code repository. It would be even moreexpensive to keep re-computing the mutation score in anyﬁxed time period (e.g., daily or weekly), and it is impossibleto compute the full score after each commit. In addition tothe computational costs of the mutation score, we were also unable to ﬁnd a good way to surface it to the developersin an actionable way, as it is neither concrete nor actionable,and it does not guide testing. The scale, however, also makessurfacing individual mutants to developers challenging, inparticular in light of unproductive mutants. Mutation test-ing at Google is designed to overcome these challenges ofscale and unproductive mutants, and therefore differs fromthe traditional approach to mutation testing, described inthe literature [13].Figure 1 summarizes how the Mutation Testing Serviceat Google creates and analyzes mutants: Mutation testing isstarted when developers send changelists for code review.A changelist is an atomic update to the version controlsystem, and it consists of a list of ﬁles, the operations tobe performed on these ﬁles, and possibly the ﬁle contentsto be modiﬁed or added, along with metadata like changedescription, author, etc. First, the Mutation Testing Servicecalculates the code coverage for the changelist (Section 2.1).Then, it creates mutants (Section 2.2) by determining whichnodes of the abstract syntax tree (AST) are eligible formutation. An AST node is eligible for mutation if it iscovered by at least one test and if it is not arid (i.e., ifmutated, it does not create unproductive mutants; see Sec-tion 3). The service then generates, executes, and analyzesmutants for all eligible AST nodes (Section 2.3). In the end,only a restricted set of surviving mutants is selected to besurfaced to the developer as part of the code review process(Section 2.4). This section describes the overall infrastructureand workﬂow of mutation testing at Google.

To enable mutation testing at Google, we implemented diff-based mutation testing: Mutants are only generated for linesthat are changed. Once a developer is happy with theirchangelist, they send it to peers for code review. At thispoint, various static and dynamic analyses are run for thatchangelist and report back useful ﬁndings to the developerand the reviewers. Line coverage is one such analysis: Dur-ing code reviews, overall and delta code coverage is surfacedto the developers [1]. Overall code coverage is the ratio ofthe number of lines covered by tests in the ﬁle to the totalnumber of instrumented lines in the ﬁle. The number ofinstrumented lines is usually smaller than the total numberof lines, since artifacts like comments or pure whitespacelines are not applicable for testing. Delta coverage is theratio of the number of lines covered by tests in the addedor modiﬁed lines in the changelist to the total number ofadded or modiﬁed lines in the changelist.Code coverage is a prerequisite for running mutationanalysis, as shown in Figure 3, because of the high cost ofgenerating and evaluating mutants in uncovered lines, allof which would inevitably survive because the code is nottested. Once line-level coverage is available for a changelist,mutagenesis is triggered.Google uses Bazel as its build system [14]. Build targetslist their sources and dependencies explicitly. Test targetscan contain multiple tests, and each test suite can containmultiple test targets. Tests are executed in parallel. Usingthe explicit dependency and source listing, test coverageanalysis provides information about which test target covers

A AA

MUTMUT --++--++

Fig. 1: Mutagenesis process: (1) For a given changelist, line coverage is computed and code is parsed into an AST. (2)For AST nodes spanning covered lines, arid nodes are tagged as unproductive using the arid node detection heuristic. (3)Non-arid (eligible) nodes are mutated and tested. (4) Surviving mutants are surfaced as code ﬁndings.which line in the source code. Results of coverage analysislink lines of code to a set of tests covering them. Line levelcoverage is used during the test execution phase, where itdetermines the minimal set of tests that need to be run in anattempt to kill a mutant.

Once delta coverage and line-level coverage metadata isavailable, the system generates mutants in affected coveredlines. Affected lines are added or modiﬁed lines in thechangelist, and covered lines are deﬁned by the coverageanalysis results. The mutagenesis service receives a requestto generate point mutations , i.e., mutations that produce amutant which differs from the original in one AST node onthe requested line. For each programming language sup-ported, a special mutagenesis service capable of navigatingthe AST of a compilation unit in that language accepts pointmutation requests and replies with potential mutants.For each point mutation request, i.e., a ( f ile, line ) tuple,a mutation operator is selected and a mutant is generatedin that line if that mutation operator is applicable to it. Ifno mutant is generated by the mutation operator, another isselected and so on until either a mutant is generated or allmutation operators have been tried and no mutant couldbe generated. There are two mutation operator selectionstrategies, random and targeted , described in Section 4.When a mutagenesis service receives a point mutationrequest, it ﬁrst constructs an AST of the ﬁle in question, andvisits each node, labeling arid nodes (Section 3) in advanceusing heuristics accumulated using developer feedbackabout mutant productivity over the years. Arid nodes arenot considered for mutation and no mutants are producedin them. Arid node labeling happens before mutagenesisis started; mutants in arid nodes are not generated anddiscarded, they are never created in the ﬁrst place.The Mutation Testing Service implements mutagenesisfor 10 programming languages: C++, Java, Go, Python,TypeScript, JavaScript, Dart, SQL, Common Lisp, andKotlin. For each language, the service implements ﬁve mu-tation operators: AOR (Arithmetic operator replacement),LCR (Logical connector replacement), ROR (Relational oper-ator replacement), UOI (Unary operator insertion), and SBR(Statement block removal). These mutation operators wereoriginally introduced for Mothra [15], and Table 1 givesfurther details for each. In Python, the unary increment anddecrement are replaced by a binary operator to achieve thesame effect due to the language design. In our experience,the ABS (Absolute value insertion) mutation operator was reported to predominantly create unproductive mutants,mostly because it acted on time-and-count related expres-sions that are positive and nonsensical if negated, and istherefore not used. Note that this is due to the style andfeatures of our codebase, and may not hold in general.For each ﬁle in the changelist, a set of mutants is re-quested, one for each affected covered line. Mutagenesis isperformed by traversing the ASTs in each of the languages,and decisions are often done on the AST node level becauseit allows for ﬁne-grained decisions due to the amount ofcontext available. Once all mutants are generated for a changelist, a temporarystate of the version control system is prepared for eachof them, based on the original changelist, and then testsare executed in parallel for all those states. This makes foran efﬁcient interaction and caching between our versioncontrol system and build system, and evaluates mutants inthe fastest possible manner. Once test results are available,we randomly pick mutants from all surviving mutants tobe reported. We limit the number of reported mutants toat most 7 times the number of total ﬁles in the changelist,to ensure that the cognitive overhead of understandingthe reported mutants is not too high, which might causedevelopers to stop using mutation testing. 7 is a result ofheuristics collected over the years of running the system.Selected surviving mutants are reported in the code reviewUI to the author and the reviewers.

The selected mutants are shown to developers during thecode review process. Most changes to Google’s monolithiccodebase, except for a limited number of fully automatedchanges, are reviewed by developers before they are mergedinto the source tree. Potvin and Levenberg [8] provide acomprehensive overview of Google’s development ecosys-tem. Reviewers can leave comments on the changed codethat must be resolved by the author. A special type ofcomment generated by an automated analyzer is known asa ﬁnding . Unlike human-generated comments, ﬁndings donot need to be resolved by the author before submission,unless a human reviewer marks them as mandatory. Manyanalyzers are run automatically when a changelist is sent forreview: linters, formatters, static code and build dependencyanalyzers etc. The majority of analyzers are based on theTricorder code analysis platform [16]. We display mutationanalysis results during the code review process because this

TABLE 1: Mutation operators implemented in the Mutation Testing Service N AME S COPE

AOR Arithmetic operator replacement a + b → { a, b, a - b, a * b, a / b, a % b } LCR Logical connector replacement a && b → { a, b, a || b, true, false } ROR Relational operator replacement a > b → { a < b, a <= b, a >= b, true, false } UOI Unary operator insertion a → { a++, a-- } ; b → !b SBR Statement block removal stmt → ∅

Fig. 2: Mutant shown in the code review tool

Calculate coverageChangelist

Critique

Generate mutants

Coverage Service Mutation Service

Report ﬁndings

Fig. 3: Code coverage and mutation testing integrationmaximizes the probability that the results will be consideredby the developers.The number of comments displayed during code reviewcan be large, so it is important that all tools only producehigh quality ﬁndings that can be used immediately bythe developers. Surfacing non-actionable ﬁndings duringcode review has a negative impact on the author and thereviewers. If an automated changelist analyzer ﬁnding (e.g.,a surviving mutant) is not perceived as useful, developerscan report that with a single click on the ﬁnding. If any ofthe reviewers consider a ﬁnding to be important, they canindicate that to the changelist author with a single click.Figure 2 shows an example mutant displayed in Critique,including the “Please Fix” and “Not useful” links in thebottom corners. This feedback is accessible to the ownerof the system that created the ﬁndings, so quality metricscan be tracked and unhelpful ﬁndings triaged, and ideallyprevented in the future.

Google has a large codebase with code in various program-ming languages. The coverage distribution per project isshown in Figure 4. Although the statement coverage of mostprojects is satisfactory, even with our system that does heavysuppression and selection, the number of live mutants per Fig. 4: Distribution of project statement coveragechangelist is still signiﬁcant (median is 2 mutants, 99 th percentile is 43 mutants). To be of any use to the author andthe reviewers, code ﬁndings need to be surfaced quickly,before the review is complete. To further reduce the numberof mutants, mutations are never generated in uninteresting,arid lines, as described in Section 3; furthermore, we proba-bilistically select mutants based on their historical mutationoperator performance (Section 4). RID N ODE D ETECTION

Some parts of the code are less interesting than others. Sur-facing live mutants in uninteresting statements, for exampledebug logging statements, has a negative impact on humantime spent analyzing the ﬁnding, and its cognitive overhead.Because developers do not perceive adding tests to killmutants in uninteresting nodes as improving the overallefﬁcacy of the suite to detect faults, such mutants tendto survive. This section proposes an approach for mutantsuppression and a set of heuristics for detecting AST nodesin which mutation is to be suppressed. There is a trade-offbetween correctness and usability of the results; the pro-posed heuristic may suppress mutation in relevant nodesas a side-effect of reducing uninteresting node mutations.We argue that this is a good trade-off because the numberof possible mutants is always orders of magnitude largerthan what we could reasonably present to the developerswithin the existing developer tools, and it is more effectiveto prevent high impact faults, rather than arid faults.

Mutation operators create mutants based on the AST of aprogram. The AST contains nodes, which are statements,expressions or declarations, and their child-parent relation-ships reﬂect their connections in the source code [17]. In or-der to prevent the generation of unproductive mutants, weidentify nodes in the AST that are related to uninterestingstatements, i.e., arid nodes.Most compilers differentiate simple and compoundnodes in an AST. Simple nodes have no body, e.g., a call expression names a function and parameters, but has nobody. Compound nodes have at least one body, e.g., a for loop might have a body, while an if statement might havetwo: then and else branches. Examples of arid nodes wouldbe log statements, calls to memory-reserving functions like std::vector::reserve , or writes to stdout ; these scenariosare typically not tested by unit tests.The heuristic approach for labeling nodes as arid is two-fold and is deﬁned in Equation 1: arid ( N ) = (cid:26) expert ( N ) if simple ( N )1 if (cid:86) ( arid ( b )) = 1 , ∀ b ∈ N otherwise (1)Here, N ∈ T is a node in the abstract syntax tree T of a program, simple is a boolean function determiningwhether a node is simple (compound nodes contain theirchildren nodes), and expert is a boolean function over asubset of simple statements in T encoding manually curatedknowledge on arid simple nodes. The ﬁrst part of Equa-tion 1 operates on simple nodes, is represented by an expertcurated manually for each programming language and isadjusted over time. The second part operates on compoundnodes and is deﬁned recursively. A compound node is anarid node iff all of its parts are arid.The expert function that ﬂags simple nodes as arid isdeveloped over time to incorporate developer feedback onreported ‘Not useful’ mutants. This process is manual: ifwe decide a certain mutation is not productive and thatthe whole class of mutants should not be created, the ruleis added to the expert function. This is the critical partof the system because, without it, users would becomefrustrated with non-actionable feedback and opt out of thesystem altogether. Targeted mutation and careful surfacingof ﬁndings has been critical for adoption of mutation testingat Google. There are more than a hundred rules for aridnode detection in our system. The expert function consists of various rules, some of whichare mutation-operator-speciﬁc, and some of which are uni-versal. We distinguish between heuristics that prevent thegeneration of uncompilable vs. compilable yet unproductivemutants. Most heuristics deal with the latter category, butthe former is also important, especially in Go, where thecompiler is very sensitive to mutations (e.g., unused importis a compiler error). For compilable mutants, we distinguishbetween heuristics for equivalent mutants, killable mutants,and redundant mutants, as reported in Table 2.

A mutant should be a syntactically valid program—otherwise, it would be detected by the compiler and notadd any value for testing. There are certain mutations,especially the ones that delete code, that violate this validityprinciple. A prime example is deleting code in Go; anyunused variables or imported modules produce compilererrors. The proposed heuristic is to gather all used symbolsand put them in a slice instead of deleting them so they arereferenced and the compiler is appeased.

Equivalent mutants, which are semantically equivalent tothe mutated program, are a plague in mutation testingand cannot generally be detected automatically. However,there are some categories of equivalent mutants that can beaccurately detected. For example, in Java, the speciﬁcationfor the size method of a java.util.Collection is that itreturns a non-negative value. This means that mutationssuch as collection.size() == 0 (cid:55)→ collection.size() <= 0 are guaranteed to produce an equivalent mutant.Another example for this category is related to memoiza-tion. Memoization is often used to speed up execution, butits removal inevitably causes the generation of equivalentmutants. The following heuristic is used to detect memoiza-tion: An if statement is a cache lookup if it is of the form if a, ok := x[v]; ok return a , i.e., if a lookup in the mapﬁnds an element, the if block returns that element (amongother values, e.g., Error in Go). Such an if statement is acache lookup statement and is considered arid by the expert function, as is its full body. The following example shows acache lookup in Go: var cache map[string]stringfunc get(key string) string { if val, ok := cache[key]; ok {return val} value := expensiveCalculation(key)cache[key] = valuereturn value} Removing the if statement just removes caching, but doesnot change functional behavior, and hence yields an equiv-alent mutant. The program still produces the same outputfor the same input—albeit slower. Functional tests are notexpected to detect such changes.As a third example, a heuristic in this category avoidsmutations of time speciﬁcations because unit tests rarely testfor time, and if they do, they tend to use fake clocks. State-ments invoking sleep-like functionality, setting deadlines, orwaiting for services to become ready (like gRPC [18] server’s Wait function that is always invoked in RPC servers, whichare abundant in Google’s code base) are considered arid bythe expert function. sleep( ); rpc.set_deadline( );sleep( ); rpc.set_deadline( ); Not all code is equally important. Much of it can be mutated,and those mutants could actually be killed, but such testsare not considered valuable and will not be written byexperienced developers; such mutants are bad test goals.Examples of this category are increments of values in moni-toring system frameworks, low level APIs like mkdir or ﬂagchanges: these are easy to mutate, easy to test for, and yetmostly undesirable as tests.A common way to implement heuristics in this categoryis to match function names; indeed we suppress mutantsin calls to hundreds of functions, which is responsible forthe highest number of suppressions. The star example ofthis category is a heuristic that marks any function call arid if the function name starts with the preﬁx log or theobject on which the function is invoked is called logger . Wevalidated this heuristic by randomly sampling 100 nodesthat were marked arid by the log heuristic, and found that99 indeed were correctly marked, while one had marginalutility. We have fuzzy name suppression rules for more than200 function families. log.infof("network speed: %v", bytes / time)log.infof("network speed: %v", bytes + time) There has been a lot of research on redundant mutants,targeted at reducing the cost of mutation testing. While thecost aspect is not a concern for us, because we generate atmost a single mutant in a line, user experience and con-sistency are important concerns. In a code review context,we surface mutants in each snapshot; when the developersupdate their code, possibly writing tests to kill mutants, wererun mutation testing on the new code and report new mu-tants. Because of this, we suppress some redundant mutantsso that mutants are consistently reported, as opposed toalternating between redundant mutants, which introducescognitive overhead and can be confusing.As an example, in C++, the LCR mutation operator has aspecial case when dealing with

NULL (i.e., nullptr ), becauseof its logical equivalence with false :O RIGINAL NODE P OTENTIAL MUTANTS if (x != nullptr) (cid:55)−→ if (x) if (nullptr) if (x == nullptr) if (false) if (true)

The mutants marked in bold are redundant (equivalent toone another) because the value of nullptr is equivalent to false . Likewise, the opposite example, where the conditionis if (nullptr == x) , yields redundant mutants for the left-hand side. These mutations are suppressed.

The highest mutant productivity gains came from the threeheuristics implemented in the early days: suppression ofmutations in logging statements, time-related operations(e.g., setting deadlines, timeouts, exponential backoff spec-iﬁcations etc.), and ﬁnally conﬁguration ﬂags. Most of theearly feedback was about unproductive mutants in suchcode, which is ubiquitous in the code base. While it is hardto measure exactly, there is strong indication that these sup-pressions account for improvements in productivity fromabout 15% to 80%. Additional heuristics and reﬁnementsprogressivley improved producitvity to 89%.Heuristics are implemented by matching AST nodeswith the full compiler information available to the muta-tion operator. Some heuristics are unsound: they employfuzzy name matching and recognize AST shapes, but cansuppress a productive mutant. On the other hand, someheuristics make use of the full type information (like match-ing java.util.HashMap::size calls) and are sound. Sound TABLE 2: Arid node heuristics. H EURISTIC C OUNT F REQUENCY

Uncompilable 1 CommonEquivalent 13 CommonUnproductive killable 16 Very commonRedundant 2 Uncommon heuristics are demonstrably correct, but we have had muchmore important improvements of perceived mutant useful-ness from unsound heuristics.For a detailed list of heuristics, please refer to Ap-pendix A.

UTANT S ELECTION C RITERIA

Once arid nodes have been identiﬁed in the AST, the nextstep (cf. Section 2.2) is to produce mutants for the remaining,non-arid nodes. There are two issues arising from this: First,only mutants that survive the tests can be shown to devel-opers, whereas those that are killed just use computationalresources. Many mutants never survive the test phase, andare not reported to the developer and reviewers during codereview. An iterative approach, where after the ﬁrst round oftests further rounds of mutagenesis could be run for linesin which mutants were killed, would use the build and testsystems inefﬁciently, and would take much longer becauseof multiple rounds. Second, not all surviving mutants areequally productive: Depending on the context, certain mu-tation operators may produce better mutants than others.Reporting all surviving mutants for a line would prolongthe mutagenesis step and increase test evaluation costs ina prohibitive manner. Because of this, effective selectioncriteria not only constitute a good trade-off, but are crucialin making mutation analysis results actionable during codereview. In this section, we present a basic random selectionstrategy that generates one mutant per covered line andconsiders information about arid nodes, and a targetedselection, which considers the past performance of mutationoperators in similar context (Figure 5).

The basic principle of a random line-based mutant selectionis shown in Listing 1: For each line in a changelist, oneof the mutants that can be generated for that line wouldbe selected randomly, or alternatively a mutation targetis picked randomly ﬁrst and then a mutation operator israndomly selected. function Mutagenesis(diff_ast)mutants ← ∅ for line in covered_lines(diff_ast)mutants ← uniform_random(all_mutants(line))endforreturn mutants Listing 1: Naïve random selectionSince our approach to mutation testing is based on theidentiﬁcation of arid nodes which should not be mutated, the random selection algorithm we use is described in List-ing 2. For each language, the Mutation Testing Service im-plements mutation operators as AST visitors. The mutationoperators available for a language are randomly shufﬂed,and are used one by one to try and create a mutant in thegiven ﬁle and line, until one succeeds. We do this for eachchanged line in the changelist that is covered by tests. Ifany mutant can be created in a line, one will be createdin that line, but which one will depend on the randomshufﬂe and the AST itself (e.g., in a line without relationaloperators, the ROR mutation operator will not produce amutant, but SBR might, because most lines can be deleted).If the ﬁrst mutation operator in the randomly shufﬂed ordercannot produce a mutant in a given line, either because it isnot applicable to it, or because the relevant AST nodes arelabeled arid, the next mutation operator is invoked, untileither a mutant is produced or there are no more mutationoperators left. This is done for each mutation request. function Mutagenesis(diff_ast)mutants ← ∅ productive_ast = remove_arid_nodes(diff_ast)ops = shuffle({UOI, ROR, SBR, LCR, AOR})for line in covered_lines(productive_ast)for op in opsif can_generate(op, line)mutants ∪ = generate_mutant(op, line)breakreturn mutants Listing 2: Random selection with suppressionIt is important to note that many nodes are labeled asarid by our heuristic (see Section 3), and are not consideredfor mutation at all. Furthermore, only a single mutant in aline is ever produced, all others are not considered. Thesedesign decisions proved to be the core of making mutationtesting feasible at very large scale.

The targeted mutation operator selection strategy orders theoperators by their perceived productivity in the mutationAST context, as shown in Listing 3. function Mutagenesis(diff_ast)mutants ← ∅ productive_ast = remove_arid_nodes(diff_ast)ops = {UOI, ROR, SBR, LCR, AOR}for line in covered_lines(productive_ast)ops = order_by_historic_productivity(line, ops)for operator in opsif can_generate(operator, line)mutants ∪ = generate_mutant(operator, line)breakreturn mutants Listing 3: Targeted selection with suppressionThe information about how productive mutating a par-ticular AST node by a particular mutation operator is, isbased on historical information: First, we can determine amutation operator’s survivability (i.e., the fraction of mutantsproduced by the operator in the past that were not killedby the existing tests) in a particular context. Second, we

A AA

Fig. 5: Random (1) vs. Targeted (2) mutation selectioncan determine a mutant’s productivity using developer feed-back: Each reported mutant can be ﬂagged as productiveor unproductive by the author of the changelist or any ofthe reviewers of the changelist. We consider this a strongsignal because it comes from experienced professionals thatunderstand the context of the mutant.Using this information, we can order the mutation opera-tors by survivability and perceived productivity, rather thanusing a random shufﬂe. For each mutant, an AST context iskept, describing the environment of the AST node that wasmutated, along with the productivity feedback and whetherthe mutant was killed or not. When the mutagenesis servicereceives a point mutation request, for nodes for whichthe mutation is requested, it ﬁnds similar nodes from thebody of millions of previously evaluated mutants using theAST context, and then looks into historical performanceof those mutants in two categories: developer feedback onproductivity and mutant survivabiliy. Mutation operatorsare ordered using this metric rather than uniformly shufﬂed,and mutagenesis is attempted in that order, to maximize theprobability that the mutant will be productive, or at leastsurvive to be reported in the code review. For example, ifwe are mutating a binary expression within an if condition,we will ﬁnd mutants done in a similar AST context and seehow each mutation operator performed in them. In order to apply historical information about mutation pro-ductivity and effectiveness, we need to decide how similarcandidate mutations are compared to past mutations. Wedeﬁne a mutation to be similar if it happened in a similarcontext, e.g., replacing a relational operator within an if condition that is the ﬁrst statement in the body of a for loop, as shown in Listing 4.As an efﬁcient means to capture the similarity of thecontext of two mutations, we use the hashing frameworkfor tree-structured data introduced by Tatikonda et al. [19],which maps an unordered tree into a multiset of simplestructures referred to as pivots . Each pivot captures infor-mation about the relationship among the nodes of the tree(see Section 4.4). Finding similar mutation contexts is then reduced toﬁnding similar pivot multisets. To identify similar pivotmultisets, we produce a MinHash [20] inspired ﬁngerprintof the pivot multiset. Because the distance in the ﬁngerprintspace correlates with the distance in the tree space, we canﬁnd similar mutation contexts efﬁciently by ﬁnding similarﬁngerprints of the node under mutation.

In order to capture the intricate relationship between nodesin the AST, we translate the AST into a multiset of pivots. Apivot is a triplet of nodes from the AST that encodes theirrelationship; for nodes u and v , a pivot p is tuple ( lca, u, v ) ,where lca is the lowest common ancestor of nodes u and v . The pivot represents a subtree of the AST. The set ofall pivots involving a particular node describes the treefrom the point of view of that node. In mutation testing,we are only interested in nodes that are close to the nodebeing mutated, so we constrain the set of pivots to pivotscontaining nodes that are a certain distance from the nodeconsidered for mutation.In the example of replacing a relational operator in an if condition within a body of the for loop in Listing 4, onepivot might be ( if , Cond , ∗ ) , and another ( Cond , i , kMax ) . Allcombinations of two nodes within some distance from thenode being mutated in the AST in Figure 6 and their lowestcommon ancestor make pivot structures. for (int i = 0; i < kMax; ++i) {if (i < kMax / 2) {return i / 2;} else {return i * 2;}} Listing 4: C++ snippet with an if statement within a for loopPivot multisets P precisely preserve the structural re-lationship of the tree nodes ( parent-child and ancestor rela-tions), so the tree similarity of two AST subtrees T and T can be measured as the Jaccard index of the pivotmultisets [19] as shown in equation 2. ForIniti Cond

Fig. 6: AST for the C++ example in Listing 4 d ( T , T

2) =

Jaccard ( P ( T , P ( T | P ( T ∩ P ( T || P ( T ∪ P ( T | (2) Pivot multisets are potentially quadratic in tree size, leadingto costly union and intersection operations. Even a trivial if statement with a single return statement produces largepivot sets, and set operations become prohibitive. To allevi-ate that, a ﬁngerprinting function is applied to convert largepivot multisets into ﬁxed-sized ﬁngerprints.We hash the pivot sets to single objects that form themultiset of representatives for the input AST. The size of themultiset can be large, especially for large programs. In orderto improve the efﬁciency of further manipulation, we use asignature function that converts large pivot hash sets intoshorter signatures. The signatures are later used to computethe similarity between the trees, taking into considerationonly the AST node type and ignoring everything else, liketype data or names of the identiﬁers.We use a simple hash function to hash a single pivot p =( lca, u, v ) into a ﬁxed-size value, proposed by Tatikonda andParthasarathy [19]. h ( p ) = ( a · lca + b · u + c · v ) mod Ka, b, c ∈ Z P For a, b, c we pick small primes, and for K a large primethat ﬁts in 32 bits. To be able to hash AST nodes, we assignsparse integer hash values to different AST node types ineach language, e.g., a C++ FunctionDecl is assigned 8500,and

CXXMethodDecl ( lca, u, v ) we use these assigned hashes.The signature for such a bag of representatives is gen-erated using a MinHashing technique. The set of pivots ispermuted and hashed under that permutation. To minimizethe false positives and negatives (i.e., different trees hashto similar hashes, or vice versa), this is repeated k times,resulting in k -MinHashes.The goal is that the signatures are similar for similar(multi)sets and dissimilar for dissimilar ones. Jaccard sim-ilarity between two sets can be estimated by comparingtheir MinHash signatures in the same way [20], as shownin equation 3. The MinHash scheme can be considered aninstance of locality-sensitive hashing, in which ASTs thathave a small distance to each other are transformed intohashes that preserve that property. d ( T , T

2) = | P ( T ∩ P ( T || P ( T ∪ P ( T | ≈ | hash ( T ∩ hash ( T || hash ( T ∪ hash ( T | (3)When mutating a node, we calculate its pivot set andhash it. We ﬁnd similar AST contexts using nearest neighborsearch algorithms. We observe how different mutants be-have in this context and which mutation operators producethe most productive and surviving mutants. This is the basisfor targeted mutation selection. VALUATION

In order to bring value to developers, the Mutation TestingService at Google needs to surface few productive mutants,selected from a large pool of mutants—most of whichare unproductive. Recall that a productive mutant elicitsan effective test, or otherwise advances code quality [11].Therefore, our goal is two-fold. First, we aim to select mu-tants with a high survival rate and productivity to maximizetheir utility as test objectives. Second, we aim to surfacevery few mutants to reduce computational effort and avoidoverwhelming developers with too many ﬁndings.In addition to the design decision of applying muta-tion testing at the level of changelists rather than projects,two technical solutions reduce the number of mutants: (1)mutant suppression using arid nodes and (2) one-per-linemutant selection. The ﬁrst research question aims to answerhow effective these two solutions are: • RQ1 Mutant suppression . How effective is mutantsuppression using arid nodes and 1-per-line mutantselection? (Section 5.2)To understand the inﬂuence of mutation operator selectionon mutant survivability and productivity in the remain-ing non-arid nodes, we consider historical data, includingdeveloper feedback. We aim to answer the following tworesearch questions: • RQ2 Mutant survivability . Does mutation operatorselection inﬂuence the probability that a generated mu-tant survives the test suite? (Section 5.3) • RQ3 Mutant productivity . Does mutation operatorselection inﬂuence developer feedback on a generatedmutant? (Section 5.4)Having established the inﬂuence of individual mutation op-erators on survivability and productivity, the ﬁnal questionis whether mutation context can be used to improve both.Therefore, our ﬁnal research question is as follows: • RQ4 Mutation context . Does context-based selection ofmutation operators improve mutant survivability andproductivity? (Section 5.5)

For our analyses, we established two datasets, one withdata on all mutants, and one containing additional data onmutation context for a subset of all mutants.

Mutant dataset.

The mutant dataset contains 16,935,148mutants across 10 programming languages: C++, Java, Go,Python, TypeScript, JavaScript, Dart, SQL, Common Lisp,and Kotlin. Table 3 summarizes the mutant dataset andgives the number and ratio of mutants per programminglanguage, the average number of mutants per changelistand the percentage of mutants that survive the test suite.Table 4 breaks down the numbers by mutation operator.We created this dataset by gathering data on all mu-tants that the Mutation Testing Service generated since itsinauguration, which refers to the date when we made theservice broadly available, after the initial development of theservice and its suppression rules (see Section 3.2.5). We didnot perform any data ﬁltering, hence the dataset providesinformation about all mutation analyses that were run. TABLE 3: Summary of the mutant dataset. (Note that SQL,Common Lisp, and Kotlin are excluded from our analysesbecause of insufﬁcient data.) L ANGUAGE G ENERATED MUTANTS S URVIVABILITY C OUNT R ATIO P ER CLC++ 7,197,069 42.5% 23.2 12.5%Java 2,894,772 17.1% 14.8 13.2%Go 1,988,798 11.7% 27.6 12.5%Python 1,689,382 10.0% 21.3 13.2%TypeScript 1,006,531 5.9% 20.8 10.8%JavaScript 908,014 5.4% 31.0 9.4%Dart 581,109 3.4% 17.4 16.3%SQL 478,975 2.8% 91.2 11.7%Common Lisp 148,289 0.9% 179.3 2.2%Kotlin 42,209 0.2% 20.7 11.0%Total 16,935,148 100% 21.8 12.5%

TABLE 4: Number of mutants per mutation operator. O PERATOR G ENERATED MUTANTS S URVIVABILITY C OUNT R ATIO

SBR 11,522,932 68.0% 12.7%UOI 3,137,375 18.5% 9.6%LCR 1,305,499 7.7% 16.3%ROR 672,009 4.0% 14.7%AOR 297,333 1.8% 13.5%Total 16,935,148 100% 12.5%

In total, our data collection considered 776,740 change-lists that were part of the code review process. For these,16,935,148 mutants were generated, out of which 2,110,489were surfaced. Out of all surfaced mutants, 66,798 receivedexplicit developer feedback. For each considered changelist,the mutant dataset contains information about: • affected ﬁles and affected lines, • test targets testing those affected lines, • mutants generated for each of the affected lines, • test results for the ﬁle at the mutated line, and • mutation operator and context for each mutant.Our analysis aims to study the efﬁcacy and perceivedproductivity of mutants and mutation operators across pro-gramming languages. Note that our mutant dataset is likelyspeciﬁc to Google’s code style and review practices. How-ever, the code style is widely adopted [21], and the moderncode review process is used throughout the industry [22].Information about mutant survivability per program-ming language or mutation operator can be directly ex-tracted from the dataset and allows us to answer researchquestions RQ1 , RQ2 and

RQ3 . Context dataset.

The context dataset contains 4,068,241mutants (a subset of the mutant dataset) for the top-fourprogramming languages: C++, Java, Go, and Python. Eachmutant in this dataset is enriched with the information ofwhether our context-based selection strategy would haveselected that mutant. When generating mutants, we wouldalso run the context-based prediction, and we persistedthe prediction information along with the mutants. If therandomly chosen operator was indeed what the predictionservice picked, this mutant is the one with the highestpredicted value. For each mutant, the dataset contains: • all information from the mutant dataset, • predicted survivability and productivity for each muta-tion in similar context, and • information about whether the mutant has the highestpredicted survivability/productivity.We created this dataset by using our context-based mu-tation selection strategy during mutagenesis on all mutantsduring a limited period of time. During this time, weautomatically annotated the mutants, indicating whethera mutant would be picked by the context-based mutationselection strategy along with the mutant outcome in termsof survivability and productivity. This dataset enables theevaluation of our context-based mutation selection strategyand allows us to answer research question RQ4 . Experiment measures:

Surviving the initial test suite is aprecondition for surfacing a mutant, but survivability aloneis not a good measure of mutant productivity. For example,a mutation that changes the timeout of a network calllikely survives the test suite but is also very likely to beunproductive (i.e., a developer will not consider writinga test for it). Hence, developer feedback indicating that amutant is indeed (un)productive is a stronger signal.We measure mutant productivity via user feedback gath-ered from Critique (Section 2.4), where each surfaced mu-tant displays a

Please ﬁx (productive mutant) and a

Notuseful (unproductive mutant) link.

Please ﬁx corresponds toa request to the author of a changelist to improve the testsuite based on the surfaced mutant; not useful correspondsto a false alarm or generally a non-actionable code ﬁnding.82% of all surfaced mutants with feedback were labeledas productive by developers. Note that this ratio is anaggregate over the entire data set. Since the inaugurationof the Mutation Testing Service, productivity has increasedover time from 80% to 89% because we generalized thefeedback on unproductive mutants and created suppressionrules for the expert function, described in Section 3. Thismeans that later mutations of nodes in which mutants werefound to be unproductive will be suppressed, generatingfewer unproductive mutants over time. Surfaced mutantswithout explicit developer feedback are not considered forthe productivity analysis.

In order to compare our mutant-suppression approach withthe traditional mutagenesis, we (1) randomly sampled 5,000changelists from the mutant dataset, (2) determined howmany mutants traditional mutagenesis produces, and (3)compared the result with the number of mutants generatedby our approach. (Since traditional mutation analysis isprohibitively expensive at scale, we adapted our system toonly generate all mutants for the selected changelists.) Fig-ure 7 shows the results for three strategies: no suppression(traditional), select one mutant per line, and select one mu-tant per line after excluding arid nodes (our approach). Weinclude the 1-per-line approach in the analysis to evaluatethe individual contribution of the arid-node suppression,beyond sampling one mutant per line.As shown in and Table 5, the median number of gener-ated mutants is 820 for traditional mutagenesis, 77 for 1-per-line selection, and only 7 for arid-1-per-line selection. Hence,our mutant-suppression approach reduces the number of

No suppression 1-per-line Arid-1-per-line

Suppression strategy M u t a n t s ( l o g ) Number of mutants per strategy type

No suppression1-per-lineArid-1-per-line

Fig. 7: Number of generated mutants per changelist for nosuppression (traditional mutagenesis), 1-per-line and arid-1-per-line (our approach). (Note the log-scaled vertical axis.)TABLE 5: Mann-Whitney U test comparing the distributionsof the number of mutants generated by different strategies. S TRATEGY

A S

TRATEGY B P - VALUE M EDIAN

A M

EDIAN

BNo suppression 1-per-line <.0001 820 771-per-line Arid-1-per-line <.0001 77 7No suppression Arid-1-per-line <.0001 820 7 mutants by two orders of magnitude. Table 5 also shows theresults for a Mann-Whitney U test, which conﬁrms that thedistributions are statistically signiﬁcantly different.Our mutant-suppression approach generates fewer than20 mutants for most changelists; the 25th and 75th per-centiles are 3 and 19, respectively. In contrast, the 25thand 75th percentiles for 1-per-line are 31 and 138 mutants.Traditional mutagenesis generates more than 450 mutantsfor most changelists (the 25th and 75th percentiles are 460and 1734, respectively), further underscoring that this ap-proach is impractical, even at the changelist level. Presentinghundreds of mutants, most of which are not actionable, toa developer would almost certainly result in that developerabandoning mutation testing altogether.

RQ1:

Arid-node suppression and 1-per-line selection signiﬁ-cantly reduce the number of mutants per changelist, with amedian of only 7 mutants per changelist (compared to 820mutants for traditional mutagenesis).

Mutant survivability is important because we generate atmost a single mutant per line—if that mutant is killed, noother mutant is generated. To be actionable, mutants have tobe reported as soon as possible in the code review process,as described in Section 4. Therefore, we aim to maximizemutant survivability because it directly impacts the numberof surfaced mutants.Overall, 87.5% of all generated mutants are killed bythe initial test suite. Note that this is not the same asthe traditional mutation score [23] (ratio of killed mutantsto the total number of mutants) because mutagenesis isprobabilistic and only generates a subset of all mutants. Thismeans only a fraction of all possible mutants are generatedand evaluated, and many other mutants are never generatedbecause they are associated with arid nodes. D a r t J a v a P y t hon C ++ G o T y pe S c r i p t J a v a S c r i p t S u r v i v ab ili t y (a) Survivability per programming language. S u r v i v ab ili t y (b) Survivability per mutation operator. Fig. 8: Mutant survivability. J a v a T y pe S c r i p t D a r t J a v a S c r i p t C ++ G o P y t hon P r odu c t i v i t y (a) Productivity per programming language. P r odu c t i v i t y (b) Productivity per mutation operator. Fig. 9: Mutant productivity.Tables 3 and 4 show the distribution of number of mu-tants and mutant survivability, broken down by program-ming language and mutation operator. Figure 8 visualizesthe mutant survivability data. Because the SBR mutationoperator can be applied to almost any non-arid node in thecode, it is no surprise that this mutation operator dominatesthe number of mutants, contributing roughly 68% of allmutants. While SBR is a proliﬁc and versatile mutation oper-ator, it is also the second least likely to survive the test suite:when applicable to a changelist, SBR mutants are surfacedduring code review with a probability of 12.6%. Overall,mutant survivability is similar across mutation operators,with a notable exception of UOI, which has a survivabilityof only 9.5%. Mutant survivability is also similar acrossprogramming languages with the exception of Dart, whosemutant survivability is noticeably higher. We conjecture thatthis is because Dart is mostly used for web developmentwhich has its own testing challenges.

RQ2:

Different mutation operators result in different mutantsurvivability; for example, the survival rate of LCR is almosttwice as high as that of UOI.

Language OperatorC++ Go Java Python AOR LCR ROR SBR UOI0%50%100%150% I m p r o v e m en t Probability

Mutant survives Mutant is productive

Fig. 10: Improvements achieved by context-based selection.(0% improvement corresponds to random selection.)

Mutant productivity is the most important measure, becauseit directly measures the utility of a surfaced mutant. Sincewe only generate a single mutant in a line, that mutantideally should not just survive the test suite but also beproductive, allowing developers to improve the test suiteor the source code itself. Given Google’s high accuracyand actionability requirements for surfacing code ﬁndingsduring code reviews, we rely on developer feedback as thebest available measure for mutant productivity. Speciﬁcally,we consider a mutant a developer marked with

Please ﬁx to be more productive than others. Likewise, we considera mutant a developer marked with

Not useful to be lessproductive than others. (Note that we excluded mutants forwhich no developer feedback is available from the analy-sis.) We compare the mutant productivity across mutationoperators and programming languages.Figure 9 shows the results, indicating that mutant pro-ductivity is similar across mutation operators, with AORand UOI mutants being noticeably less productive. Forexample, ROR mutants are productive 84.1% of the time,whereas, UOI mutants are only productive 74.5% of thetime. The differences between programming languages areeven more pronounced, with Java mutants being productive87.2% of the time, compared to Python mutants that areproductive 70.6% of the time. This could be due to codeconventions, language common usecase scenarios, testingframeworks or simply the lack of heuristics. We have foundthat Python code generally requires more tests because ofthe lack of the compiler.

RQ3:

ROR, LCR, and SBR mutants show similar productivity,whereas AOR and UOI mutants show noticeably lower produc-tivity.

We examine whether context-based selection of mutationoperators improves mutant survivability and productivity.Speciﬁcally, we determine whether context-based selectionof mutation operators increases the probability of a gener-ated mutant to survive and to result in a

Please ﬁx request,when compared to the random-selection baseline.Figure 10 shows that selecting mutation operators basedon the AST context of the node under mutation substantially increases the probability of the generated mutant to surviveand to result in a Please ﬁx request. While improvementsvary across programming languages and across mutationoperators, the context-based selection consistently outper-forms random selection. The largest productivity improve-ments are achieved for UOI, AOR, and SBR, which generatemost of all mutants. Intuitively, these improvements meanthat context-based selection results in twice as many pro-ductive UOI mutants (out of all generated mutants), whencompared to random selection. Figure 10 also shows towhat extent these improvements can be attributed to thefact that simply more mutants are surfaced. Since the im-provements for productivity increase even more than thosefor survivability, context-based selection not only results inmore surfaced mutants but also in higher productivity ofthe surviving mutants. Overall, the survival rate increasesby over 40% and the probability that a reviewer asks for agenerated mutant to be ﬁxed increases by almost 50%.It is important to put these improvements into context.Probabilistic diff-based mutation analysis aggressively trimsdown the number of considered mutants from thousands ina representative ﬁle to a mere few, and enables mutants to beeffectively presented to developers as potential test targets.The random-selection approach produces fewer survivingmutants of lower productivity.

RQ4:

Context-based selection improves the probability that agenerated mutant survives by more than 40% and the probabil-ity that a generated mutant is productive by almost 50%.

ELATED W ORK

There are several veins of research that are related to thiswork. Just et al. proposed an AST-based program contextmodel for predicting mutant effectiveness [24]. Fernandezet al. developed various rules for Java programs to detectequivalent and redundant mutants [25]. The initial resultsare promising for developing selection strategies that out-perform random selection. Further, Zhang et al. used ma-chine learning to predict mutation scores, both on successiveversions of a given project, and across projects [26]. Finally,the PIT project makes mutation testing usable by practicingdevelopers and has gained adoption in the industry [27].There has been a lot of focus on computational costs andthe equivalent mutant problem [28]. There is much focuson avoiding redundant mutants, which leads to increase ofcomputational costs and inﬂation of the mutation score [29],and instead favoring hard-to-detect mutants [30], [31] ordominator mutants [32]. Mutant subsumption graphs havesimilar goals but mutant productivity is much more fuzzythan dominance or subsumption.Effectiveness for mutants is primarily deﬁned in terms ofredundacy and equivalence. This approach fails to considerthe notion that non-reduntant mutants might be unpro-ductive or that equivalent mutants can be productive [33].From our experience, reporting equivalent mutants has beena vastly easier problem than reporting unproductive non-reduntant and non-equivalent mutants.Our approach for targeted mutant selection (Section 4)compares the context of mutants using tree hashes. The speciﬁc implementation was driven by the need for con-sistency and efﬁciency, in order to make it possible to lookup similar AST contexts in real time during mutant creation.In particular, the hash distances need to be preserved overtime to improve the targeted selection. There are approachesto software clone detection [34] that similarly use tree-distances (e.g., [35], [36], [37], [38], [39]). Whether alternativedistance measurements can be scaled for application atGoogle and whether they can further improve the targetedselection remains to be determined in future work.This approach is similar to tree-based approaches(e.g., [35], [36], [37], [38], [39]) in software clone detec-tion [34], which aims to detect that a code fragment is acopy of some original code, with or without modiﬁcation.The AST-based techniques can detect additional categoriesof modiﬁcations like identiﬁer name changes or type aliases,that token-based detection cannot, and the insensitivity ofto variable names is important for the mutation context.However, clone detection differs drastically in its goal: itcares about detecting code with the same semantics, in spiteof the syntactical changes made to it. While clone detectionmight want to detect that an algorithm has been copiedand then changed slightly, e.g., a recursion rewritten toan equivalent iterative algorithm, mutation testing contextcares only about the neighboring AST nodes: in the iterativealgorithm, the most productive mutants will be those thatthrived before in such code, not the ones that thrived fora recursive algorithm. In order to look up similar ASTcontexts in real time, as mutants are created, we requirea fast method that preserves hash distance over time. Forthese consistency and efﬁciency reasons, we opted for thedescribed tree-hashing approach.

ONCLUSIONS

Mutation testing has the potential to effectively guide soft-ware testing and advance software quality. However, manymutants represent unproductive test goals; writing tests forthem does not improve test suite efﬁcacy and, even worse,negatively affects test maintainability.Over the past six years, we have developed a scalablemutation testing approach and mutant suppression rulesthat increased the ratio of productive mutants, as judgedby developers, from 15% to 89% at Google. Three strate-gies were key to success. First, we devised an incrementalmutation testing strategy, reporting at most one mutant perline of code—targeting lines that are changed and covered.Second, we have created a set of rule-based heuristicsfor mutant suppression, based on developer feedback andmanual analyses. Third, we devised a probabilistic, targetedmutant selection approach that considers mutation contextand historical mutation results.Given the success of our mutation testing approachand the positive developer feedback, we are planning todeploy it company-wide. We expect that this step will resultin additional reﬁnements of the suppression and selectionstrategies in order to maintain a mutant productivity ratearound 90%. Furthermore, we will investigate the long-term effects of mutation testing on developer behavior whenwriting tests as part of our future work. R EFERENCES [1] M. Ivankovi´c, G. Petrovi´c, R. Just, and G. Fraser, “Code coverageat google,” in

Proc. of ESEC/FSE , August 26–30 2019, pp. 955–963.[2] A. J. Offutt and J. M. Voas, “Subsumption of condition coveragetechniques by mutation testing,”

Department of Information andSoftware Systems Engineering, George Mason University, Tech. Rep.ISSE-TR-96-100 , 1996.[3] D. Schuler and A. Zeller, “Assessing oracle quality with checkedcoverage,” in , 2011, pp. 90–99.[4] R. A. DeMillo, R. J. Lipton, and F. G. Sayward, “Hints on test dataselection: Help for the practicing programmer,”

Computer , vol. 11,no. 4, pp. 34–41, 1978.[5] J. H. Andrews, L. C. Briand, Y. Labiche, and A. S. Namin, “Usingmutation analysis for assessing and comparing testing coveragecriteria,”

IEEE Transactions on Software Engineering , vol. 32, no. 8,pp. 608–624, 2006.[6] R. Just, D. Jalali, L. Inozemtseva, M. D. Ernst, R. Holmes, andG. Fraser, “Are mutants a valid substitute for real faults in softwaretesting?” in

Proceedings of the International Symposium on Founda-tions of Software Engineering . ACM, 2014, pp. 654–665.[7] Y. T. Chen, R. Gopinath, A. Tadakamalla, M. D. Ernst, R. Holmes,G. Fraser, P. Ammann, and R. Just, “Revisiting the relationshipbetween fault detection, test adequacy criteria, and test set size,”in

Proc. of ASE , September 21–25 2020, pp. 237–249.[8] R. Potvin and J. Levenberg, “Why Google stores billions of linesof code in a single repository,”

Communications of the ACM , vol. 59,pp. 78–87, 2016.[9] “How DevOps Accelerates "Ideas to Prod" at Google,” https://swampup.jfrog.com/.[10] D. Schuler and A. Zeller, “(un-)covering equivalent mutants,” in

Proc. of ICST , April 2010, pp. 45–54.[11] G. Petrovi´c, M. Ivankovi´c, B. Kurtz, P. Ammann, and R. Just, “Anindustrial application of mutation testing: Lessons, challenges, andresearch directions,” in

Proc. of Mutation , Apr. 2018, pp. 47–53.[12] G. Petrovic and M. Ivankovic, “State of Mutation Testing atGoogle,” in

Proceedings of the 40th International Conference on Soft-ware Engineering 2017 (SEIP) , 2018.[13] A. J. Offutt and R. H. Untch, “Mutation 2000: Uniting the orthog-onal,”

Mutation testing for the new century , pp. 34–44, 2001.[14] “Bazel build system,” https://bazel.io/, 2015.[15] A. J. Offutt, A. Lee, G. Rothermel, R. H. Untch, and C. Zapf, “Anexperimental determination of sufﬁcient mutant operators,”

ACMTransactions on Software Engineering and Methodology (TOSEM) ,vol. 5, no. 2, pp. 99–118, 1996.[16] C. Sadowski, J. van Gogh, C. Jaspan, E. Soederberg, and C. Winter,“Tricorder: Building a program analysis ecosystem,” in

SoftwareConference (ICSE), 2015 , 2015.[17] S. S. Muchnick,

Advanced compiler design implementation . MorganKaufmann, 1997.[18] G. Inc., “gRPC: A high performance, open-source universal RPCframework,” https://grpc.io, 2006.[19] S. Tatikonda and S. Parthasarathy, “Hashing tree-structured data:Methods and applications,” in . IEEE, 2010, pp. 429–440.[20] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher,“Min-wise independent permutations,”

Journal of Computer andSystem Sciences , vol. 60, no. 3, pp. 630–659, 2000.[21] “Google Style Guides,” https://google.github.io/styleguide/.[22] A. Bacchelli and C. Bird, “Expectations, outcomes, and challengesof modern code review,” in . IEEE, 2013, pp. 712–721.[23] R. A. DeMillo, R. J. Lipton, and F. G. Sayward, “Hints on test dataselection: Help for the practicing programmer,”

Computer , vol. 11,no. 4, pp. 34–41, Apr. 1978.[24] R. Just, R. J. Kurtz, and P. Ammann, “Inferring mutant utility fromprogram context,” in

Proc. of ISSTA , July 2017, pp. 284–294.[25] L. Fernandes, M. Ribeiro, L. Carvalho, R. Gheyi, M. Mongiovi,A. Santos, A. Cavalcanti, F. Ferrari, and J. C. Maldonado, “Avoid-ing useless mutants,” in

Proc. of GPCE , October 2017, pp. 187–198.[26] J. Zhang, Z. Wang, L. Zhang, D. Hao, L. Zang, S. Cheng, andL. Zhang, “Predictive mutation testing,” in

Proc. of ISSTA , July2016, pp. 342–353.[27] H. Coles, “Real world mutation testing,” http://pitest.org, lastaccessed January 2018.[28] Y. Jia and M. Harman, “An analysis and survey of the develop-ment of mutation testing,”

IEEE TSE , vol. 37, no. 5, pp. 649–678,2011. [29] R. Just and F. Schweiggert, “Higher accuracy and lower runtime: efﬁcient mutation analysis using non-redundant mutationoperators,”

JSTVR , vol. 25, no. 5-7, pp. 490–507, 2015.[30] X. Yao, M. Harman, and Y. Jia, “A study of equivalent and stub-born mutation operators using human analysis of equivalence,” in

Proc. of ICSE , May 2014, pp. 919–930.[31] W. Visser, “What makes killing a mutant hard,” in

Proc. of ASE ,September 2016, pp. 39–44.[32] P. Ammann, M. E. Delamaro, and J. Offutt, “Establishing theoreti-cal minimal sets of mutants,” in

Proc. of ICST , 2014, pp. 21–31.[33] P. McMinn, C. J. Wright, C. J. McCurdy, and G. Kapfhammer,“Automatic detection and removal of ineffective mutants for themutation analysis of relational database schemas,”

IEEE TSE , 2017.[34] C. K. Roy and J. R. Cordy, “A survey on software clone detectionresearch,”

Queen’s School of Computing TR , vol. 541, no. 115, pp.64–68, 2007.[35] I. D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier, “Clonedetection using abstract syntax trees,” in

Proceedings. InternationalConference on Software Maintenance (Cat. No. 98CB36272) . IEEE,1998, pp. 368–377.[36] W. Yang, “Identifying syntactic differences between two pro-grams,”

Software: Practice and Experience , vol. 21, no. 7, pp. 739–755,1991.[37] L. Jiang, G. Misherghi, Z. Su, and S. Glondu, “Deckard: Scalableand accurate tree-based detection of code clones,” in

InternationalConference on Software Engineering (ICSE’07) . IEEE, 2007, pp. 96–105.[38] V. Wahler, D. Seipel, J. Wolff, and G. Fischer, “Clone detectionin source code by frequent itemset techniques,” in

Source CodeAnalysis and Manipulation, Fourth IEEE International Workshop on .IEEE, 2004, pp. 128–135.[39] W. S. Evans, C. W. Fraser, and F. Ma, “Clone detection via struc-tural abstraction,”

Software Quality Journal , vol. 17, no. 4, pp. 309–330, 2009.

Goran Petrovi´c

Goran Petrovi´c is a Staff Software Engineer at GoogleSwitzerland, Zürich. He received an MS in Computer Science fromUniversity of Zagreb, Croatia, in 2009. His main research interests aresoftware quality metrics and improvements, ranging from preventionof software defects to evaluation of software design reusability andmaintenance costs and automated large scale software refactoring.

Marko Ivankovi´c

Marko Ivankovi´c is a Staff Software Engineer atGoogle Switzerland, Zürich. He received an MS in Computer Sciencefrom University of Zagreb, in 2011. His work focuses on SoftwareEngineering as a discipline, large scale code base manipulation, codemetrics and developer workﬂows.

Gordon Fraser

Gordon Fraser is a full professor in Computer Scienceat the University of Passau, Germany. He received a PhD in computerscience from Graz University of Technology, Austria, in 2007, workedas a post-doc at Saarland University, and was a Senior Lecturer at theUniversity of Shefﬁeld, UK. The central theme of his research is improv-ing software quality, and his recent research concerns the prevention,detection, and removal of defects in software.

René Just

René Just is an Assistant Professor at the University ofWashington. His research interests are in software engineering, soft-ware security, and data science, in particular static and dynamic pro-gram analysis, mobile security, and applied statistics and machine learn-ing. He is the recipient of an NSF CAREER Award, and his researchin the area of software engineering won three ACM SIGSOFT Distin-guished Paper Awards. He develops research and educational infras-tructures that are widely adopted by other researchers and instructors(e.g., Defects4J and the Major mutation framework). A PPENDIX AA RID N ODE H EURISTICS

Nodes of the abstract syntax tree (AST) are arid if applyingmutation operators on them or their subtrees would leadto unproductive mutants. An unproductive mutant is eithertrivially equivalent to the original program, or if it is de-tectable then adding a test for it would not lead to an actualimprovement of the test suite. The decision of whether anode of the AST is arid is implemented using heuristicsbuilt on developer feedback over time. In general, theseheuristics are speciﬁcally tailored for the environment ofthe developers who provided the feedback, and a differentcontext will require deriving new, appropriate heuristics.In this appendix, we summarize the main categories ofsuch heuristics. We ﬁrst summarize the main categories ofarid node heuristics that are indendent of a speciﬁc pro-gramming language, then we describe heuristics developedspeciﬁcally for different programming languages. For eachheuristic, we provide examples of unproductive mutantsthat the heuristic addresses.

A.1 Language Independent Heuristics

A.1.1 Logging Frameworks

Logging statements are rarely tested outside of the code ofthe logging systems themselves. Mutants in logging state-ments are usually unproductive and would not lead to teststhat improve software quality.

LOG(INFO) << "Duration: " << (absl::Now() - start);LOG(INFO) << "Duration: " << (absl::Now() + start);

A special case of the logging statement heuristic con-cerns the

Console class available in the browser that can beused for logging; mutants in that code are unproductive testgoals. console.log(’duration is ’, new Date() - start);console.log(’duration is ’, new Date() + start);

Similar is true for other console methods like assert.

Implementation.

This is implemented using AST-levelarid node tagging, matching call expression or macros.

Soundness.

This heuristic is sound when applied tosource code that does not explicitly test the logging codeitself, which is easy to detect using the build system.

A.1.2 Memory and Capacity Functionality

Often it makes sense to pre-allocate memory for efﬁciency,when the total size is known in advance. Mutants in thesememory size speciﬁcations are not good test goals; they usually create functionally equivalent code and are notkillable by standard testing methods. std::vector merged(left.size() + right.size());absl::c_copy(left, std::back_inserter(merged));absl::c_copy(right, std::back_inserter(merged));std::vector merged(left.size() - right.size());absl::c_copy(left, std::back_inserter(merged));absl::c_copy(right, std::back_inserter(merged));

In this example, the only consequence will be that thevector may need to grow itself and that will take extra time.The same also holds for Java collections, e.g.,

List merged = new ArrayList<>(left.length() +right.length());List merged = new ArrayList<>(left.length() -right.length());List merged =Lists.newArrayListWithCapacity(left.length() +right.length());List merged =Lists.newArrayListWithCapacity(left.length() -right.length());

Similar constructs exist in all programming lan-guages, and the heuristic extends to all of these suchas std::vector::resize , or reserve , shrink_to_fit , free , delete . These represent a family of common functions ofmany containers in many languages, std::vector being justa representative example.Another interesting example are cache prefetch instruc-tions added with SSE, prefetch0 , prefetch , prefetch2 and prefetchnta accessible with __builtin or directly by an asmblock. - __builtin_prefetch(&obj, 0, 3); Implementation.

This is implemented using AST-levelarid node tagging, matching call expressions.

Soundness.

This heuristic is sound; it uses exact symbolsand type names.

A.1.3 Monitoring Systems

Although it may be debatable whether monitoring logicshould be tested or not, developers did not use such mutantsproductively and instead reported them as being unproduc-tive. Consequently, heuristics mark AST nodes related tomonitoring logic as arid. Implementation.

This is implemented using AST-levelarid node tagging, matching constructor or call expressions.

Soundness.

This heuristic is sound; it uses exact symbolsand type names.

A.1.4 Time Related Code

Clocks are usually faked in tests, and networking calls areshort-circuited to special RPC implementations for testing;it therefore rarely makes sense to mutate time expressionswhen used in a deadline-context, because they would leadto unproductive mutants. - ::SleepFor(absl::Seconds(5));

The same holds for other types of network-code, such assetting deadlines: context.set_deadline(std::chrono::system_clock::now() +std::chrono::milliseconds(10));context.set_deadline(std::chrono::system_clock::now() -std::chrono::milliseconds(10));

Implementation.

This is implemented using AST-levelarid node tagging, matching constructor or call expressions.

Soundness.

This heuristic is sound; it uses exact symbolsand type names.

A.1.5 Tracing and Debugging

Code is often adorned with debugging and tracing informa-tion that may be even excluded in the release builds, butpresent while testing. This code serves its purpose, but isusually impossible to test and mutants in that code do notmake good test goals. - ASSERT_GT(input.size(), 0);- assert(x != nullptr); - TRACE(x);- Preconditions.checkNotNull(v);- exception.printStackTrace();

In general, check-failures usually make the programsegfault and serve as a last line of defense, and tracingis used for debugging purposes, and so neither results inproductive mutants.

Implementation.

This is implemented using AST-levelarid node tagging, matching constructors, call expressionsor macros.

Soundness.

This heuristic is sound; it uses exact symbolsand type names.

A.1.6 Programming Model Frameworks

There are specialized frameworks for specifying complexwork conceptually and then executing that work in a differ-ent way, where the code that is written serves as a model forthe intent, not the real logic that gets executed. Some exam-ples of this principle are Apache Beam and TensorFlow.

Pipeline p = Pipeline.create(options);PCollection word_counts = p.apply(TextIO.read().from(options.getInputFile())).apply("ExtractWords", new WordExtractor()).apply(Count.perElement()).apply("FormatResults", new ResultFormatter());// materializes the results- PipelineRunner.run(p);

In this example, developers usually test the componentsof the pipeline, but not the code assembling the pipeline.Similar examples exist for TensorFlow: - tf.compat.v1.enable_eager_execution()assert tf.multiply(6, 7).numpy() == 42

Implementation.

This is implemented using AST-levelarid node tagging, matching constructor or call expressions.

Soundness.

This heuristic is not sound. Because it isbased on best-effort matching of code structures that lookarid and often are, it can suppress productive mutants.

A.1.7 Block Body Uncovered

Suppose that a block entry condition (e.g., of an if -statement) is covered by tests, but the condition is notfulﬁlled by any tests and thus the corresponding block is notcovered. Most mutants of the condition would only help thedevelopers to identify that no test covers the relevant branchyet. However, the same information is already provided bycoverage, and so mutants in such if -conditions are deemedunproductive. Mutants like this can indeed inform abouttest suite quality, but coverage is a far simpler test goal for the developers to act on in this case, and for that reason weuse coverage to drive test case implementation, and mutatnsfor their subsequent improvement. Implementation.

This is implemented using AST-levelarid node tagging, aided by line code coverage data.

Soundness.

While most mutants are indeed unproduc-tive, the heuristic is not entirely sound as there may bemutants that reveal information about boundary cases ofthe condition. Since coverage points out that a branch is nottaken, forcing boundary-check tests prior to even coveringboth branches is pre-mature; if the tests written for the cov-erage test goal do not check boundary conditions, mutantscan then be reported as new test goals.

A.1.8 Arithmetic Operator with a no-op Child

In some cases, mostly due to style, code will be written withexplicit zeros for some parts of an expression. For example: data[i] + 0 * sizeof(char), data[i] + 4 * sizeof(char),data[i] + 8 * sizeof(char);

Mutating the binary operator + by removing the right-hand side (the ), leaving only left-handside of the binary operator (the data[i] ), results in anequivalent mutant. The code is simply written in such a waybecause it deals with low-level instructions and the codestyle requires that each offset be explicitly written, and alllines equally aligned so each offset is at the same column. Implementation.

This is implemented using AST-levelarid node tagging, matching expressions.

Soundness.

This heuristic is sound because it has the fulltype and expression information available.

A.1.9 Logical Comparator of POD with Zero Values

When comparing a plain-old-data structure with its zero-value, there is a possibility for creating an equivalent mu-tant. For example, a conditional statement if (x != 0) ,with x having a primitive or record type, is equivalent to if (x) . In that case, mutating the condition x != 0 to theleft-hand-side operand x produces an equivalent mutant. if (x !== 0) {return 5;}if (x) {return 5;} Implementation.

This is implemented using AST-levelarid node tagging, matching expressions.

Soundness.

This heuristic is sound when full type andexpression information is available.

A.1.10 Logical Comparator with Null Child

When comparing something to nullptr and its correspond-ing value in other languages (

NULL , nil , null , None , ...),picking the left (or right, depending where the null valueis) is equivalent to replacing the binary operator with false. if (worker_ == nullptr)+ if (nullptr) // ‘if (false)‘ is the equivalent mutation

In an expression of format x != nullptr , mutating it to x is an equivalent mutant. if (worker_ != nullptr) worker_->DoWork();if (worker_) worker_->DoWork(); Implementation.

This is implemented using AST-levelarid node tagging, matching expressions.

Soundness.

This heuristic is sound because it has the fulltype and expression information available.

A.1.11 Floating Point Equality

Floating point equality comparison, except for special val-ues such as zero, is mostly meaningless. For a number x thatis not 0, replacing f() > x with f() >= x is not a good testgoal. return normalized_score > 0.95return normalized_score >= 0.95 Implementation.

This is implemented using AST-levelarid node tagging, matching expressions.

Soundness.

This heuristic is sound because it has the fulltype and expression information available.

A.1.12 Expression and Statement Deletion

Many statements can be deleted, but usually more cannot,if the code is to compile. This is obvious in itself, but itis worth reporting general types of nodes that are best notdeleted. Some of them are: conditional (ternary) operator: b in a ? b : c , parameters of call expressions: a in f(a) ,non-assignment binary operators, unary operators that arenot a standalone statement but within a compound, return,label, default and declaration statements, blocks containinga return path within non-void functions, only statementsin non-void functions (function with 1 statement). Someof these rules change from language to language, or areapplicable only in some languages, but the ideas carry. InC++, one may have a function without a return statementand when compiled with the right set of compiler ﬂags, itcompiles, but the return value is undeﬁned, and in someother languages it would fail to compile and no amount ofcompiler ﬂags could change that. Blocks can be deleted, orreplaced with an empty block {} , or in Python a block with pass . Implementation.

This is implemented using AST-levelarid node tagging, matching nodes.

Soundness.

This heuristic is not sound because it mightsuppress some mutants that would be productive. A.1.13 Program Flags

Program ﬂags, passed in as arguments and parsed by someﬂag framework like Abseil, are a way to conﬁgure theprogram. Often, tests will inject the fake ﬂag values, butoften they will ignore them; they may be used for algorithmtweaking (max threads in pool, max size of cache, deadlinefor network operations). Other ﬂags will inform the pro-gram about the location of dependencies on the network, orresources on the ﬁle system; these are usually faked in testsand injected directly into the code using the programmingAPI rather than ﬂags, since the code is directly invoked,rather than forked into. - flags.DEFINE_string(’name’, ’Jane Random’, ’Your name.’)flags.DEFINE_integer(’stack_size’, 1000 * 1000, ’Size ofthe call stack.’)flags.DEFINE_integer(’stack_size’, 1000 / 1000, ’Size ofthe call stack.’)flags.DEFINE_integer(’rpc_deadline_seconds’, 5 * 60,’Network deadline.’)flags.DEFINE_integer(’rpc_deadline_seconds’, 5 + 60,’Network deadline.’)

Implementation.

This is implemented using AST-levelarid node tagging, matching expressions.

Soundness.

This heuristic is sound because it has the fulltype and expression information available.

A.1.14 Low-level APIs

If the code directly accesses the operating system using thestandard libraries (glibc) or Python’s os or shutil libraries(e.g., to copy some ﬁles, create a directory, or to print onthe screen), then the program is probably some kind of autility script and mutating these calls results in unproduc-tive mutants: these calls are hard to mock (except in Python)and mostly unproductive test targets. There are exceptions,e.g., an API that wraps this communication and is usedby various projects, but for the most part there are few ofthose and many more of simple utility scripts for doingbasic ﬁlesystem operations. We can be sure that these arenot critical programs because the standard libraries cannotuse any of the standard storages, just local disk, and arerarely used in production. - shutil.rmtree(dir)- os.rename(from, to)

Implementation.

This is implemented using AST-levelarid node tagging, matching expressions.

Soundness.

This heuristic is sound because it has the fulltype and expression information available.

A.1.15 Stream Operations

Streams like stdout , stderr , or any other cached buffer,ﬂush when the buffer ﬁlls to some point, or on specialevents. Removing the ﬂush operations on various streamsshould change no behavior from the test point of view, andtherefore mutants of such statements are not productive testgoals. The same also holds for close operations on ﬁles orother buffers. - buffer.flush();- file.close(); Implementation.

This is implemented using AST-levelarid node tagging, matching call expressions.

Soundness.

This heuristic is not sound because thereare conceivable code constructs in which buffer operationschange the perceived behavior (e.g., in concurrent streammanipulation).

A.1.16 Gate Conﬁguration

It is very common to use ﬂags or some other mechanisms tofacilitate easy switching between different implementations,or control the state of rollout. Consider the following: class Controller(object):USE_NEXT_GEN_BACKEND = Trueclass Controller(object):USE_NEXT_GEN_BACKEND = False

In this example there are two implementations, an oldand a new one, but ideally both should work correctly, andthen it becomes impossible to distinguish by tests that thereis a difference.Similarly, a more gradual approach might have some-thing like this: private static final Double nextGenTrafficRatio = 0.1;private static final Double nextGenTrafficRatio = 0.1 + 1;

Some ratio of trafﬁc can exercise a new implementa-tion, for easier incremental control. Mutants in such globalswitches, usually determinable from code style, do not makefor good test goals.

Implementation.

This is implemented using AST-levelarid node tagging, matching nodes.

Soundness.

This heuristic is not sound because it isguessing the meaning of a class ﬁeld based on its value andlocation, and it might be wrong. A.1.17 Cached lookups

Often, values are cached/memoized to avoid redundantrecalculation. Removing the cache lookup slows down theprogram, but functionally does not change anything, pro-ducing an equivalent, and thus unproductive, mutant. def fib(n):if n in cache:return cache[n]if n == 1:value = 1elif n == 2:value = 1elif n > 2:value = fib(n - 1) + fib(n -2)cache[n] = valuereturn value

Implementation.

This is implemented using AST-levelarid node tagging, matching complex code structures. Thecode structure that is considered a cached lookup mustfulﬁll the following: a) it must lookup an input parameter ina dissociative container and return from it under that key iffound, b) it must store the value that it otherwise returns inthe same container under the same key.

Soundness.

This heuristic is not sound because it onlychecks for probable code structures.

A.1.18 Inﬁnity

There are various representations of inﬁnity in mathematicallibraries in various languages. Incrementing or decrement-ing these produces an equivalent, and thus unproductive,mutant. x = a.replace([numpy.inf, -numpy.inf])x = a.replace([numpy.inf + 1, -numpy.inf])

Implementation.

This is implemented using AST-levelarid node tagging, matching expressions.

Soundness.

This heuristic is sound because it has the fulltype and expression information available.

A.1.19 Insensitive Arguments

There are some functions that are insensitive to precisevalues or use them as an indication only. These, if mutated,should be mutated to a degree that they change not onlythe value, but also the indication of that value. For example,in Python the zip builtin makes an iterator that aggregateselements from each of the iterables passed to it. The iteratorstops when the shortest input iterable is exhausted, meaningthat changing the size of one of the parameters is notguaranteed to affect the result. zip(a[i:j], b[j:k], c[k:m])zip(a[i:j + 1], b[j:k], c[k:m])

Incrementing and decrementing indices within zip pa-rameters has may likely create equivalent (unproductive) mutants. Another example is given by comparator functionsin any context: It is very common for comparators to takein two values, and return -1, 0 or 1, if one element is lessthan the other, equal to it or greater than it, in whatever se-mantics the author deﬁnes. Commonly, any negative valueimplies the former, and any positive value implies the latter,while zero implies equality. As an example, consider JavaCollections: list.sort((Person p1, Person p2) -> p1.getAge() -p2.getAge());list.sort((Person p1, Person p2) -> p1.getAge() -p2.getAge() + 1);

This mutant can only be helpful when the age differenceis exactly -1 or 0, for any other combination it is an equiva-lent mutant and thus an unproductive test target.Another Java example is the

String::split method, forwhich one of the overloaded versions takes two parameters,the regex to deﬁne the split and the limit that controls thenumber of times the pattern is applied, affecting the lengthof the resulting array. According to the API speciﬁcation,if te limit is non-positive then the pattern will be appliedas many times as possible. This means that any negativenumber has the same semantics.

String[] parts = key.split(",", -1);String[] parts = key.split(",", -2);

Finally, another example is a loop spec with a step. Whenchanging the range condition, it has to be changed at leastthe full step for the change to have an effect. x = l[1:10 + 2 * 7:14]x = l[1:10 + 2 * 7 + 1:14]for (int i = 1; i < 10 + 2 * 7; i += 14) { std::cout << i<< std::endl; }for (int i = 1; i < 10 + 2 * 7 + 1; i += 14) { std::cout <
Implementation.

This is implemented using AST-levelarid node tagging, matching expressions.

Soundness.

This heuristic is sound because it has the fulltype and expression information available.

A.1.20 Collection Size

The size of a collection cannot be a negative number, sowhen comparing the length of a container to zero, somemutants resulting from the comparison may produce un-reachable code and make for unproductive test goals. if len(l) > 0:return l[1]if len(l) < 0:return l[1] The same also holds for collections in other languages,although it is not always easy to detect when the length isaccessed. In Java, the length method can be detected for allthe standard library collections by checking the inheritancechain. In Go and Python, the len builtin function can bedetected with ease, and for C++, the size method can bechecked for, along with iterators or inheritance chain.

Implementation.

This is implemented using AST-levelarid node tagging, matching expressions.

Soundness.

This heuristic is sound because it has thefull type and expression information available, barring theredeﬁnition of a len function in Python or hotplugging apatched class in Java standard library.

A.1.21 Trivial Methods

Most programming languages have different types of “boil-erplate” code that is required, but rarely considered asimportant to be tested by developers. For example, in Javathere are methods like equals , hashCode , toString , clone ,and they are usually implemented by using existing librarieslike the Objects API in Java or Abseil Hash in C++. While itis possible that these methods do indeed contain bugs, thedeveloper feedback on the productivity of correspondingmutants clearly indicates that mutants in such methods arenot productive. @Overridepublic boolean equals(Object o) {if (!(o instanceof CellData)) {return false;}CellData that = (CellData) o;return Objects.equals(exp, that.exp) &&Objects.equals(text, that.text);}@Overridepublic boolean equals(Object o) {if (false) {return false;}CellData that = (CellData) o;return Objects.equals(exp, that.exp) &&Objects.equals(text, that.text);} Implementation.

This is implemented using AST-levelarid node tagging, matching expressions.

Soundness.

This heuristic is not sound because it relieson the code style recommendation on implementing suchmethods.

A.1.22 Early Exit Optimizations

Linus Torvalds famously states that "...if you need more than 3levels of indentation, you’re screwed anyway, and should ﬁx yourprogram." in the kernel coding style. While this is sometimes hard to accomplish, having less things to remember is agood thing, so it is encouraged by the code style to returnearly if possible.Consider the following mutant: log.infof("network speed: %v", bytes/time)Map ExtractPrices(List products) {if (products.empty()) {return ImmutableMap.of();}// Translation logic.

The early return just makes the code easier to under-stand but has no effect on the behavior, and the producedequivalent mutant is a unproductive test goal.

Implementation.

This is implemented using AST-levelarid node tagging, matching expressions. This conditiontriggers when an empty container (e.g.,

ImmutableMap.of() )is returned if one of the parameters is checked for emptiness.The checks for emptiness range from zero or null-lookingexpressions, invocations of len or size or empty methodson a container of an appropriate type that depends on thelanguage (e.g., hash maps, lists, dictionaries, trees, stacks,etc.). The empty container criterion checks for standardlibrary containers, commonly used libraries and internalspecialized container implementations. Soundness.

This heuristic is not sound because the mu-tant might not be equivalent.

A.1.23 Equality and Equivalence

Some languages have equality ( == ) and equivalence ( === )comparison operators, where one checks whether the valueslook the same versus are the same. The equivalence opera-tors check for strict equality of both type and value, whilethe standard equality is not strict and applies type coercionand then compares values, making a string ’77 ’ equal toan integer , because the string gets coerced to integer.The overwhelming feedback points that strict-to-nonstrictmutants and vice versa make for unproductive test goals. if (value === CarType.ECO)if (value != CarType.ECO) To avoid dogmatic debates, == is only mutated to != and === only to !== . Implementation.

This is implemented using AST-levelarid node tagging, matching binary operators.

Soundness.

This heuristic is not sound because it relieson the code style recommendation on comparison operators.

A.1.24 Acceptable Bounds

Gating a computed result into an acceptable bound byusing Math.min, Math.max, or constrainToRange of Ints,Longs, and friends is by design unlikely to change behaviorwhen one of the inputs is mutated. This is similar to theInsensitive arguments heuristic, and resulting mutants areusually unproductive. long newCapacity = Math.min(Math.max((data.length * 2L),minCapacity), MAX_BUFFER_SIZE);long newCapacity = Math.min(Math.max((-(data.length * 2L)),minCapacity), MAX_BUFFER_SIZE); Implementation.

This is implemented using AST-levelarid node tagging, matching expressions.

Soundness.

This heuristic is not sound; it can suppressproductive mutants that can result from mathematical oper-ations.

A.2 JavaScript

A.2.1 Closure

Closure provides a framework for library management andmodule registration and exporting. These are function callsbut their semantics are for the compiler at the languagelevel, and mutants in nodes containing them make forunproductive test goals. - goog.requireType(’goog.dom.TagName’);

Additional issues arise from the fact that the tests areexecuted in a different environment than the ﬁnal compiledobfuscated minimized optimized code, where calls to thesefunctions are potentially removed, replaced or modiﬁed.

Implementation.

This is implemented using AST-levelarid node tagging, matching expressions.

Soundness.

This heuristic is sound because it has the fulltype and expression information available.

A.2.2 Annotations

A special case of the declaration heuristic is based onJavaScript’s JSDoc method of signaling implicit match andinterface types, for example, @interface annotations. Theseare variables specially tagged in comments, and require spe-cial handling compared to other languages where interfacesare ﬁrst class citizen of the language. - /**- * @interface- */apps.action.Action = function() {};

Implementation.

This is implemented using AST-levelarid node tagging, matching expressions.

Soundness.

This heuristic is sound because it has the fulltype and expression information available.

A.3 Java

A.3.1 System & Runtime Classes

Mutants around the System and Runtime class that is usedfor interacting with the operating system usually producemutants that are not good test goals. This is a special case ofthe Low Level APIs heuristic. - System.gc(); - Runtime.getRuntime().exec("rm -rf " + dirName);

Implementation.

This is implemented using AST-levelarid node tagging, matching expressions.

Soundness.

This heuristic is not sound; it can suppressproductive mutants.

A.3.2 Dependency Injection Modules

Java frequently uses annotation-based automated depen-dency injection such as Guice or Dagger. Modules providebindings for injecting implementations or constants, andusually the tests will override the production modules andregister testing doubles (fakes, mocks or test implementa-tions), so changing the production module often has noeffect on the tests because the tests override the setup. Suchmutants are unproductive testing goals. - bindAsSingleton(binder, CarType.ECO, EcoImpl.class);

Implementation.

This is implemented using AST-levelarid node tagging, matching expressions.

Soundness.

This heuristic is not sound; it assumes thatall automated dependency injection is overridden by tests.

A.4 Python

A.4.1 Main

Python’s main entry point of a program is usually an ifcondition checking that the script is being invoked, and notimported by another script: if __name__ == ’__main__’:app.run()if __name__ != ’__main__’:app.run()

Mutants in that expression are not a good test goal.

Implementation.

This is implemented using AST-levelarid node tagging, matching expressions.

Soundness.

This heuristic is sound, barring manipula-tion of __name__ global.

A.4.2 Special Exceptions

In Python, exceptions like

ValueError imply a programmingdefect, something a compiler might catch if one was em-ployed, not something for what a test should be written. Inthat case, Python’s type system would be testable in eachfunction by calling the function with all possible types andasserting that the interpreter works correctly; this is not agood test goal. The

AssertionError should usually meanthat the code is unreachable. Another special case is a virtualmethod that raises

NotImplementedError and is annotatedby abc.abstractmethod . @abstractmethoddef virtual_method(self):raise NotImplementedError()@abstractmethoddef virtual_method(self):pass Implementation.

This is implemented using AST-levelarid node tagging, matching expressions.

Soundness.

This heuristic is not sound, because it relieson the consistent usage of control ﬂow mechanisms.

A.4.3 Version Checks

Python has two major versions, namely 2 and 3, and codecan be written to work for both interpreters and languagespeciﬁcations. The version can be determined by reading sys.version_info . Mutants in those lines make for unpro-ductive test goals. if sys.version_info[0] < 3:from urllib import quoteelse:from urllib.parse import quoteif @False@:from urllib import quoteelse:from urllib.parse import quote

Implementation.

This is implemented using AST-levelarid node tagging, matching expressions.

Soundness.

This heuristic is not sound, because a pro-ductive mutant could conceivably appear in version detec-tion code.

A.4.4 Multiple Return Paths

The code style requires Python programs to explicitly returnNone in all leafs if there are multiple return statements: itforbids the explicit return None that Python would returnwhen there is no return statement in some path. Removingthose return statements does not make for a good test goal. log.infof("network speed: %v", bytes/time)def GetBuilder(x):if x < 10:logging.info(’too small, ignoring’)return Noneelsif x > 100:return LargeBuilder()else:return SmallBuilder()

Implementation.

This is implemented using AST-levelarid node tagging, matching complex code structures. Thetriggering condition is that all leaf nodes are a return state-ment.

Soundness.

This heuristic is not sound, because it relieson the code style recommendation.

A.4.5 Print

In Python2, print is a ﬁrst-class citizen of the AST; it is not afunction that is called using a

CallExpr (call expression, e.g. function or method invocation). While this is covered bythe Low Level API heuristics, it is worth noting that Pythonrequires handling this differently. - print ’exiting...’

Implementation.

This is implemented using AST-levelarid node tagging, matching expressions.

Soundness.

This heuristic is not sound.

A.5 Go

A.5.1 Memory Allocation

Go has a built-in make function to allocate and initializeobjects of type slice , map or chan . The size parameter isused for specifying the slice capacity, the map size or thechannel buffer capacity. The initial capacity will be grownby the runtime as needed, so changing it is undetectableby functional tests. This is a special case of the genericmemory and capacity functionality, but it is worth explicitlymentioning because of the builtin status of this function andthe AST handling. buf := make([]byte, 4, 4+3*10)buf := make([]byte, 4, 4+3/10) Implementation.

This is implemented using AST-levelarid node tagging, matching expressions.

Soundness.

This heuristic is sound, since it relies on fullexpression and type information. Suppressed mutants arefunctionally equivalent.

A.5.2 Statement Deletion

Go has a strict opinionated compiler, and unlike most oth-ers, it has very few ﬂags that can affect the behavior. Forexample, including an unused package is a compiler error,and deﬁning an unused identiﬁer is also a compiler error. InC++, it is easy to pass a ﬂag to gcc or clang to make this onlya warning, whereas in Go that is impossible. Deleting state-ments or blocks of statements almost invariably producesunbuildable code and the mutant appears killed becausethe test fails (to build). There is a way to work around this,that is employed when deleting Go statements. First, thestatement under deletion is traversed by a recursive ASTvisitor, and all symbols that are used are recorded. Thisincludes included package literals, variables and functions,but excludes types and built-in functions. Once the list ofused symbols is computed, the deletion can proceed, ina form of a replacement: everything that was used in thestatement under deletion is put into an unnamed slice oftype []interface . While this is a “hack”, this is the onlyway to delete code without semantically analyzing the restof the translation unit, which then introduces many issueswith byte offsets. - var v []string_ = []interface{}{v} v := "-42"- i, err := strconv.Atoi(v)v := "-42"_ = []interface{}{strconv.Atoi, v} Implementation.

This is implemented using AST-levelarid node tagging, matching complex expressions. The deleted code is recursively visited by a custom AST visitorthat collects information about variables and functions refer-enced and extracts the full list of symbols that are referencedtherein. The replacement slice is constructed from all eligibleobjects.

Related Researches

The Diversity of Gamification Evaluation in the Software Engineering Education and Industry: Trends, Comparisons and Gaps

by Rodrigo Henrique Barbosa Monteiro

Learning How to Search: Generating Effective Test Cases Through Adaptive Fitness Function Selection

by Hussein Almulla

Evaluating the robustness of source code plagiarism detection tools to pervasive plagiarism-hiding modifications

by Hayden Cheers

Academic Source Code Plagiarism Detection by Measuring Program Behavioural Similarity

by Hayden Cheers

Traceability Transformed: Generating more Accurate Links with Pre-Trained BERT Models

by Jinfeng Lin

Moderator Factors of Software Security and Performance Verification

by Victor Vidigal Ribeiro

SceML - A Graphical Modeling Framework for Scenario-based Testing of Autonomous Vehicles

by Barbara Schuett

Operation is the hardest teacher: estimating DNN accuracy looking for mispredictions

by Antonio Guerriero

Graph Neural Network to Dilute Outliers for Refactoring Monolith Application

by Utkarsh Desai

Contact Tracing Apps for COVID-19: Access Permission and User Adoption

by Amal Awadalla Ali

Recommending More Efficient Workflows to Software Developers

by Dylan Bates

UML Modeling to TM Modeling and Back

by Sabah Al-Fedaghi

Understanding and Fixing Complex Faults in Embedded Cyberphysical Systems

by Alexander Weiss

Analysing the use of graphs to represent the results of Systematic Reviews in Software Engineering

by Katia Romero Felizardo

Mutant reduction evaluation: what is there and what is missing?

by Peng Zhang

Towards Modal Software Engineering

by Ramy Shahin

A Baseline Model for Software Effort Estimation

by Peter A. Whigham

Using Visual Text Mining to Support the Study Selection Activity in Systematic Literature Reviews

by Katia Romero Felizardo

The Impact of Sampling and Rule Set Size on Generated Fuzzy Inference System Predictive Accuracy: Analysis of a Software Engineering Data Set

by Stephen G. MacDonell

Evaluating SZZ Implementations Through a Developer-informed Oracle

by Giovanni Rosa

Worst-Case Execution Time Calculation for Query-Based Monitors by Witness Generation

by Márton Búr

Challenges in Digital Twin Development for Cyber-Physical Production Systems

by Heejong Park

Human Values in Software Release Planning

by Davoud Mougouei

BiasFinder: Metamorphic Test Generation to Uncover Bias for Sentiment Analysis Systems

by Muhammad Hilmi Asyrofi

Online Cycle Detection for Models with Mode-Dependent Input and Output Dependencies

by Heejong Park

«

1

2

3

4

»

Submitted on 22 Feb 2021 (v1), last revised 26 Feb 2021 (this version, v2) Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar