Towards Evidence-based Testability Measurements
aa r X i v : . [ c s . S E ] F e b Towards Evidence-based Testability Measurements
Luca Guglielmo, Andrea Riboni, Giovanni Denaro
Department of Computer Science, Systems and and Communication (DISCo)University of Milano-BicoccaViale Sarca, 336, 20125 Milan (MI), ItalyEmail: [email protected], [email protected], [email protected]
Abstract —Evaluating Software testability can assist softwaremanagers in optimizing testing budgets and identifying opportu-nities for refactoring. In this paper, we abandon the traditionalapproach of pursuing testability measurements based on thecorrelation between software metrics and test characteristicsobserved on past projects, e.g., the size, the organization or thecode coverage of the test cases. We propose a radically newapproach that exploits automatic test generation and mutationanalysis to quantify the amount of evidence about the relativehardness of identifying effective test cases. We introduce twonovel evidence-based testability metrics, describe a prototypeto compute them, and discuss initial findings on whether ourmeasurements can reflect actual testability issues.
Index Terms —Software testability, mutation analysis, test casegeneration
I. I
NTRODUCTION
Accomplishing effective software testing has a key role inproducing high quality software at the state of the practice.Thus, being able to measure the testability of the softwareartifacts under test, i.e., the degree to which the design of theartifacts supports or hardens their own testing [1]–[4], canbe a crucial information for managers. For example, the earlyavailability of testability measures can enable informed deci-sions on optimising the testing budget, or pinpoint componentsthat shall undergo refactoring before testing.So far, the research on measuring software testability fo-cused on either i) exploiting the correlation between staticsoftware metrics and testing effort or quality, or ii) estimatingthe likelihood of detecting faults, in particular with referenceto the execute-infect-propagate (PIE) model of fault sensitivity.Static software metrics capture the static structure of softwareartifacts, e.g., the amount of lines of code, the McCabe’scyclomatic complexity, or the Chidamber and Kemerer’s class-level metrics [5], [6]. Indeed, empirical data collected out ofmany past software projects support the existence of corre-lation between static metrics and the number, the size, thecomplexity, the code coverage or the mutation scores of thetest suites [7]–[17].The PIE model defines fault sensitivity as the combinedprobability of executing faulty locations, infecting the execu-tion state and propagating the effects of the infection to someobservable output [18]. High fault sensitivity can be a proxyof high testability, and vice-versa [18]–[21].This paper makes the observation that the research doneso far addresses testability only indirectly, either in termsof predictions about size, complexity and coverage of test cases, or by estimating fault sensitivity scores. However, wecannot take for granted that these indirect measurements cap-ture testability to the full extent. Moreover, the measurementapproaches depend on the characteristics of the available testcases (e.g., test cases from past projects) which can be easilyaffected by arbitrary decisions of the testers (e.g., decisionson designing few or many test cases, or aiming to high codecoverage or ignoring code coverage). Yet, with reference to thebody of literature on the correlation with software metrics, werecall that correlation does not necessarily entails causation,and in fact several studies yield contrasting results on whichmetrics are best predictors of testability, and predictive modelstrained on the data of past projects rarely generalise with stableprecision [7].This paper introduces a radically new idea on how topursue software testability measurements. We aim at directlysampling the relative easiness (or the hardness) of identifyingtest cases for revealing the potential faults in the softwaremodules under test. The higher the evidence of hard-to-testfaults in a module, the higher the evidence that the design ofthat module is not facilitating the testing. On this basis we referto our approach as evidence-based testability measurement .Drawing on this idea, we propose an approach that simulatesfaults based on mutation-based fault seeding [22], [23], com-putes test cases in the style of search-based test generationtechniques [24], and discriminates between easy-to-test orhard-to-test faults based on the results of the test generator. Inparticular, we classify the seeded faults as easy-to-test whenthe test generator is able to synthesize fault-revealing testcases out-of-the-box. Conversely, we classify the seeded faultsas hard-to-test, if the test generator reveals them only whenassisted with artificial testability boosters, like programminginterfaces that relax the encapsulation constraints of the testedprograms, or ideal test oracles that can predicate on interme-diate execution states.The paper is organized as follows. Section II presents ourapproach in detail, and contextualizes it on the problem ofmeasuring testability for object oriented classes. Section IIIreports initial findings from a qualitative study on the classesof the
Closure Compiler project. Section IV summarizes ourconclusions and future research plans.II. E
VIDENCE - BASED T ESTABILITY M EASUREMENT
This section discusses the intuitions that underlie our ap-proach, and formalizes those intuitions into a reference the-ry of evidence-based testability metrics . Next, we define ameasurement framework that instantiates the reference theoryheuristically, and present a prototype implementation.
A. The Rationale of Evidence-Based Testability Measurements
Our intuition is that software testability could be directly quantified if we might know in advance (i) which faults mayexist the software modules under test, (ii) which test casesthe developers could use, and (iii) some criterion to analyzethe test cases and argue whether they are easy or hard to beidentified. In this (admittedly) idealistic scenario, we couldquantify the relative testability of the software modules basedon the portion of easy-to-test and hard-to-test faults in eachmodule. Intuitively, the higher the portion of hard-to-test faultsin a software module, the higher the likelihood that the modulemay come with testability issues, and vice-versa.Drawing on these intuitions, we propose to evaluate testabil-ity as the portion of faults for which there exists some evidence(i.e., at least a test case) that those faults can be revealed witheasy test cases. This is why we refer out testability metrics as evidence-based . We define the following notion of testability:
Definition 1. Idealistic Testability
Let M be a software mod-ule of a program P , F be the set of executable faults in P , T be the set of possible test cases for P , and Reveal : F × T be the relation between faults and test cases that reveal them.Let also F ( M ) ⊆ F denote the faults located in M .Moreover, let Hard : T → { true, f alse } be a criterion (apredicate) to decide whether a test case is or is not hard tobe identified. Accordingly, let F hard ( M ) be the set of faultsin M that are hard to identify, that is, F hard ( M ) ≡ { f ∈ F ( M ) |∀ t : Reveal ( f, t ) ⇒ Hard ( t ) } . Then the testability of module M is quantified as T estability ( M ) = 1 − | F hard ( M ) || F ( M ) | . We can further adapt the above definition to discriminatebetween (i) controllability issues, i.e., testability problemsthat depend on the difficulty of identifying test cases thatexecute the faults, and (ii) observability issues, i.e., testabilityproblems that depend on the difficulty of identifying testcases that ultimately reveal the faults by producing observablemalfunctions.
Definition 2. Idealistic Controllability
Let
Exec : F × T bethe relation between faults and test cases that execute them,and Hard _ d : T → { true, f alse } be a criterion to decidewhether or not the driver part of a test case is hard to beidentified. The driver is the part of a test case that sets properinputs for the module under test, aiming to drive its executionin specific way. Correspondingly, let F hard _ d ( M ) be the setof faults in M that can be executed only with test drivers that Non-executable faults are irrelevant for testing and testability. The set F hard ( M ) includes also the faults that, although being executable,cannot be revealed with any test case. Executable, non-revealable faults arearguably synthoms of testability issues. are hard to identify, that is, F hard _ d ( M ) ≡ { f ∈ F ( M ) |∀ t : Exec ( f, t ) ⇒ Hard _ d ( t ) } .Then the controllability of module M is quantified as Contr ( M ) = 1 − | F hard _ d ( M ) || F ( M ) | . Definition 3. Idealistic Observability
Let F r ( M ) ⊆ F bethe set of faults that can be revealed with some test case,that is, F r ( M ) ≡ { f ∈ F ( M ) |∃ t : Reveal ( f, t ) } . Let Hard _ o : T → { true, f alse } be a criterion to decide whetheror not the oracle part of a test case is hard to be identified.The oracle is the part of a test case that evaluates the outputsagainst the specification to exclude or pinpoint malfunctions.Correspondingly, let F hard _ o ( M ) be the set of faults in M thatcan be revealed only with oracles that are hard to identify,that is, F hard _ o ( M ) ≡ { f ∈ F r ( M ) |∀ t : Reveal ( t, f ) ⇒ Hard _ o ( t ) } .Then the observability of module M is quantified as Obs ( M ) = 1 − | F hard _ o ( M ) | F r ( M ) | . As mentioned, these definitions capture an idealistic sce-nario, in which we know the potential faults and the possibletest cases in advance, which is, of course, unrealistic. In thenext subsection, we present a measurement framework thatproxies these idealistic metrics by referring to concrete faults,concrete test cases and objective decisions on evaluating thehardness of test cases and faults.
B. A Framework to Measure Evidence-Based Testability
The measurement framework that we propose in this paperaddresses testability of object oriented classes. It instantiatesthe idealistic testability metrics by exploiting (i) mutation-based fault seeding to proxy the potential faults in theclasses, (ii) search-based test generation to heuristically samplethe possible test cases, (iii) weak-kill analysis and (iv) de-encapsulation to discriminate hard-to-identify test drivers andtest oracles.
Mutation-based fault seeding instruments programs withpossible faults by using mutation operators, each describinga class of code-level modifications that may simulate faultsin the program [25], [26]. For instance, replacing numericliterals is a mutation operator that produces different programversions (called mutants ) by changing a numeric literal in theprogram with a compatible literal: It produces a mutant foreach possible legal replacement. A test case that has a differentoutcome when executed against either the original programor a mutant m is said to kill the mutant m , meaning that itreveals the sample fault that the mutant represents. Severalresearchers argue that mutants are valid representative of realfaults [23]. Our measurement framework refers to the set ofmutation operators defined in the tool Major [22], and exploitsMajor to both generate the mutants and analyze killed mutants.We compute test cases based on the search-based testgenerator EvoSuite [24], which samples the possible test caseswith meta-heuristic algorithms guided with fitness functions2ased on coverage criteria, and generates test cases withassertion-style oracles on the observed outputs. In our evalu-ation framework, EvoSuite plays the role of a reference testerthat works with the same capability consistently.Mutation operators may generate a large amount of mutants,including mutants that we cannot execute due to the intrinsiclimitations of the test generator of choice, rather than becauseof testability issues. Our framework takes a twofold approachto coping with this. On one hand, we focus only on mutantsthat can be executed with at least a generated test case, evenif not necessarily revealed. Technically, we rely on the notionof weak-kill analysis (that belongs to the theory of mutationanalysis ) as provided in Major. A test case weakly-kills amutant if there exists at least an execution state that differsbetween the original program and the mutant.On the other hand we exploit de-encapsulation , to empowerthe test generator to directly set any input and state variableof any class in the program, and thus produce input statesthat may be hard generate otherwise. De-encapsulation isachieved by augmenting the interface of the classes under testwith custom setters for all class variables. Next, we executethe test generator on both the original classes and their de-encapsulated versions, and determine the set of executablemutants as the ones weakly-killed with at least an obtainedtest case. We are aware that, technically speaking, using de-encapsulated classes may lead us to generate some inputstates that are illegal for the original classes. Nonetheless, ourmeasurement framework embraces this approach heuristically:observing faults that the test generator can execute only withde-encapsulation is a sign of strict class interfaces, which maypinpoint testability issues.We estimate the hard-to-execute and hard-to-reveal mutants(the sets F hard _ o and F hard _ d of Definitions 2 and 3) asthe mutants that we could either execute only with customsetters or weakly-kill but not ultimately kill, respectively. Therationale is that the mutants that could be executed only withcustom setters might, at least in principle, be in the scope ofthe test generator, although it seems hard for our referencetester to identify test cases that execute those mutants underthe testability constraints imposed by the actual interfaces.Similarly, mutants that we detect as weakly-killed but notultimately killed with any test case, provide evidence of faultsthat can be in principle revealed, but may require oracles thatare hard for the test generator to identify.In summary, let ˆ F kill , ˆ F wkill and ˆ F wkill _ noset be the setof mutants that the generated test cases kill, weakly kill andweakly kill even without using custom setters, respectively;our testability measurement framework makes the followingestimates related to the sets in Definitions 2 and 3: • ˆ T , all test cases that EvoSuite generates in bounded timefor the target classes with and without custom setters. • ˆ F ≡ ˆ F r ≡ ˆ F wkill , all mutants executed (at least weakly-killed) with at least a test case t ∈ ˆ T . • ˆ F hard _ d ≡ ˆ F wkill − ˆ F wkill _ noset , the mutants executed(at least weakly-killed) only with test cases with customsetters. • ˆ F hard _ o ≡ ˆ F wkill − ˆ F kill , the mutants that can be weakly-killed, but not killed.We then estimate controllability and observability as: Definition 4. Estimated Controllability
We estimate thecontrollability of a class C as the portion of mutants executedonly with test cases that do not rely on custom setters: Contr ( C ) = 1 − | ˆ F hard _ d ( C ) || ˆ F ( C ) | = | ˆ F wkill _ noset ( C ) || ˆ F wkill ( C ) | . Definition 5. Estimated Observability
We estimate the ob-sevability of a class C as the portion of mutants that wereweakly-killed, but not killed: Obs ( C ) = 1 − | ˆ F hard _ o ( C ) || ˆ F r ( C ) | = | ˆ F kill ( C ) || ˆ F wkill ( C ) | . At the current state of our research, we do not yet provide anestimation for the overall testability (Definition 1) of a class,since our conservative assessments of the sets of executablefaults (for which we allow test cases with custom setters) andrevealed faults (for which we accept weak kill as sufficientevidence), respectively, do not match well the precise way inwhich the controllability and observability facts combine intotestability in the ideal scenario of Definition 1.
C. Prototype Implementation
We have built a testability measurement prototype on top ofthe mutation analysis tool Major [22], and the test generatorEvoSuite [24]. We exploited the
JavaParser code-manipulationlibrary [27] to generate classes augmented with custom setters.Our prototype executes EvoSuite six times for the originalclasses and six times on the classes with the custom setters;therefore, twelve times in total. For each six-run group, itexecutes EvoSuite twice for each of three fitness functions,aiming to improve code coverage and mitigate the impact ofthe random choices of the search-based algorithm of EvoSuite.Namely we refer to the following fitness functions: (i) line andbranch coverage, (ii) weak mutation coverage, and (iii) thecomposition of these two. We set a time budget of 10 minutesfor each run of EvoSuite, and relied on the functionality ofthe tool to generate test cases that include assertions on theobserved outputs. For each class, we consider the union ofall test cases that EvoSuite generated in the twelve runs.Next, our prototype executes Major to collect the statisticson the killed and weakly-killed mutants by executing the testcases in two passes, in which we retain or strip off the customsetters, respectively. We ignore the mutants that belong to thecode of the custom setters.We remark that, even if the current prototype is arguablyslow, in building this first implementation we gave priority tobeing able to gather initial data on whether our measurementscan reflect actual testability issues. We leave for future workthe challenge of devising an efficient implementation. We used EvoSuite with the option assertion_strategy=ALL.
NITIAL F INDINGS
Subjects:
We executed our prototype to evaluate the testa-bility of Closure Compiler (commit 46897c4), a JavaScriptcompiler written in Java, which is part of the experimentalbenchmarks of Defect4J [28]. The codebase of Closure Com-piler consists of 483 classes comprised of 93,907 lines ofcode. Our prototype generated a total of 51,707 mutants for355 classes (while for 128 classes the mutation tool did notyield any mutant) and a total of 2,898 test cases. We wereable to classify 21,920 and 29,441 killed and weakly-killedmutants overall, respectively, and 28,847 mutants weakly-killed without using custom setters. We executed our prototypeon a cloud facility with a virtual machine equipped with LinuxUbuntu, 48 cores Intel Xeon 2.4 GHz, and 160 GB of RAM.The experiment took a total of 200 hours. Results
Our measurements indicated that 85% of the classesof Closure Compiler have max controllability, while theydistribute more evenly in the range of observability values,with less than 15% of the classes scoring a max observability.However, at the current state of our research, it is hard for usto quantitatively evaluate the precision of our measurements,because we miss a reference ground-truth. We thus conducteda qualitative study on the classes with the lowest and highest
Contr and
Obs scores, to try to confirm our hypothesis thatlow and high scores according to our metrics can be reconciledto actual testability issues and boosters, which we could revealwith manual inspection of the classes.The top part of Table I and Table II summarize ourfindings for the 10 classes with the lowest controllabilityand observability, respectively. The bottom part of the tablesreport the 5 classes with the highest scores (the ones withmost weakly-killed mutants among those). We considered onlyclasses for which our prototype identified at least 10 weakly-killed mutants. The first four columns of the tables indicatethe class name (column
Class ), the lines of code in the class(
LOC ), the amount of weakly-killed mutants ( ), and thecorresponding
Contr (resp.
Obs ) score, which is a portionof the weakly-killed mutants. The last column reports ourfindings on the controllability (resp. observability) issues orboosters that we identified manually in the classes.As expected, our metrics scored maximum values for classesthat allow for controlling the execution with simple inputsof interface methods and constructors, and observing theresults in return values or with getter methods. Conversely,controllability and observability issues depends on interfacesthat hamper test cases from setting relevant input values andinspecting relevant outputs. We describe the specific issuesand boosters in detail at the bottom of each table. For twolow
Contr classes, and two low
Obs classes, we marked theresult as false positive . We traced these low scores to randombehaviors of EvoSuite that missed easy-to-spot test inputs andoracles. However, we indeed reconciled the
Contr and
Obs scores to actual controllability and observability issues andboosters for all other classes. We observe that in both tables We do not include the inner classes in this count. TABLE IQ
UALITATIVE STUDY OF CLASSES WITH LOW AND HIGH
Contr
SCORES
Class LOC
UnreachableCodeElimination 146 96 0.01 (i1)
FindExportableNodes 96 25 0.20 n.a.SourceMapConsumerV2 98 95 0.30 (i1) (i2)
PeepholeOptimizationsPass 71 48 0.36 (i1)
ObjectPropertyStringPostprocess 46 37 0.47 (i3)
VarCheck 199 97 0.60 (i2)
CheckGlobalNames 145 32 0.62 n.a.OptimizeArgumentsArray 121 22 0.64 (i1)
AnalyzeNameReferences 84 25 0.80 (i1)
FunctionTypeBuilder 605 283 0.85 (i2)
RegExpTree 1403 1356 1.00 (b1)
Fuzzer 739 864 1.00 (b1) (b2)
JsMessage 426 656 1.00 (b1) (b2)
CharRanges 311 521 1.00 (b1) (b2)
SourceMapGeneratorV2 421 458 1.00 (b1)ISSUES(i1) Multi-step protocol : The class interface induces an interaction proto-col that requires test cases to call multiple methods in specific sequencesto set the relevant input states, thus hardening the task of identifying tests. (i2) Complex structured inputs : The class takes several inputs defined ascomplex data structures, and thus requires long test cases that go throughsophisticated initialization sequences to set the relevant inputs. (i3) Preconditioned updates : The interface methods for updating the classstate are guarded with many preconditions, and thus the class challengesthe testers to comply with the preconditions when specifying the test inputs. (n.a.)
We did not identify any controllability issue.
BOOSTERS(b1) Simply-typed control inputs : The class methods are fully control-lable with inputs of primitive type, string type or types defined as flat datastructures with only primitive fields and setters for all fields. (b2) Complete field-input mapping in constructors : The test cases canrely on class constructors to assign all fields based on simply-typed inputsof the constructors, one input for each field. the amounts of lines of code oscillate almost within the samerange of values across the classes with either low or highscores, suggesting no clear relation between class size andclass testability.IV. C
ONCLUSION AND F UTURE W ORK
In this papers we discussed a new approach for mea-suring testability characteristics of object oriented classes.Our approach tackles the testability measurement problem indirect fashion, by sampling the fault space of the classes anddiscriminating the relative portions of faults that are eithereasy or hard to be executed or revealed with automaticallygenerated test cases. Thus, our approach differs from previouswork that attempted to measure testability based on unprovenrelations to test size, test coverage and fault sensitivity. Weare currently working to devise an efficient implementation ofour prototype of the measurement framework, and to designexperiments for evaluating our metrics quantitatively.R
EFERENCES[1] R. S. Freedman, “Testability of software components,”
IEEE transac-tions on Software Engineering , vol. 17, no. 6, pp. 553–564, 1991.[2] J. M. Voas and K. W. Miller, “Software testability: The new verification,”
IEEE software , vol. 12, no. 3, pp. 17–28, 1995. ABLE IIQ
UALITATIVE STUDY OF CLASSES WITH LOW AND HIGH
Obs
SCORES
Class LOC
BooleanType 49 11 0.00 (i1)
SourcePosition 23 17 0.06 (i2)
NullType 53 12 0.08 (i1)
FunctionBuilder 82 22 0.09 n.a.JvmMetrics 214 81 0.16 (i2)
CommandLineRunner 405 37 0.16 (i3)
Timer 55 17 0.18 (i2)
XtbMessageBundle 138 10 0.20 (i4)
Base64VLQ 54 29 0.24 n.a.GraphPruner 53 18 0.28 (i1) (i3)
Token 212 101 1.00 (b1)
NameReferenceGraphReport 145 77 1.00 (b1) (b2)
BooleanLiteralSet 39 14 1.00 (b1)
SimpleDependencyInfo 42 13 1.00 (b1) (b2)
DiagnosticGroupWarningsGuard 29 11 1.00 (b1)ISSUES(i1) Complex observer methods : The observer methods depend on manyparameters, or take complex data structures as input, thus hardening thetask of specifying the test oracles. (i2) Output on system streams : The class produces most output on systemstreams, e.g., it writes results to the console or in the GUI, hardening thetask of specifying automatic test oracles on those results. (i3) Structural updates on private data : The class computes structuralcharacteristics of internal data structures with private visibility, e.g., thedimension of internal arrays, which cannot be checked with test oracles. (i4) Asynchronous updates : The class delegates asynchronous methods,which then produce results with callbacks, but it is difficult for the testoracles to predicate on those results (n.a.)
We did not identify any observability issue.
BOOSTERS(b1) Output as simply-typed return values : The class produces allrelevant outputs as return values of simple types, which can be easilychecked with test oracles. (b2) Full getter access to modified fields : The class stores its outputs infields that the test cases can easily access with provided getter methods.[3] International Standard Organization (ISO), “International standardiso/iec 9126, information technology - product quality - part1: Qualitymodel,” 2001.[4] IEEE,
IEEE Standard Glossary of Software Engineering Terminology ,IEEE Std., 1990.[5] T. J. McCabe, “A complexity measure,”
IEEE Transactions on softwareEngineering , no. 4, pp. 308–320, 1976.[6] S. R. Chidamber and C. F. Kemerer, “A metrics suite for object orienteddesign,”
IEEE Transactions on software engineering , vol. 20, no. 6, pp.476–493, 1994.[7] V. Garousi, M. Felderer, and F. N. Kılıçaslan, “A survey on softwaretestability,”
Information and Software Technology , vol. 108, pp. 35–64,2019.[8] M. Bruntink and A. Van Deursen, “Predicting class testability usingobject-oriented metrics,” in
Source Code Analysis and Manipulation,Fourth IEEE International Workshop on . IEEE, 2004, pp. 136–145.[9] M. Bruntink and A. van Deursen, “An empirical study into classtestability,”
Journal of systems and software , vol. 79, no. 9, pp. 1219–1232, 2006.[10] Y. Singh, A. Kaur, and R. Malhotra, “Predicting testing effort usingartificial neural network,” in
Proceedings of the World Congress onEngineering and Computer Science (WCECS 2008) San Francisco, USA.Newswood Limited , 2008, pp. 1012–1017.[11] F. Toure, M. Badri, and L. Lamontagne, “Predicting different levelsof the unit testing effort of classes using source code metrics: amultiple case study on open-source software,”
Innovations in Systemsand Software Engineering , vol. 14, no. 1, pp. 15–46, 2018.[12] N. Alshahwan, M. Harman, A. Marchetto, and P. Tonella, “Improvingweb application testing using testability measures,” in . IEEE, 2009, pp.49–58.[13] R. C. da Cruz and M. M. Eler, “An empirical analysis of the correlationbetween ck metrics, test coverage and mutation score.” in
ICEIS (2) ,2017, pp. 341–350.[14] K. Jalbert and J. S. Bradbury, “Predicting mutation score using sourcecode and test suite metrics,” in . IEEE, 2012,pp. 42–46.[15] T. M. Khoshgoftaar, E. B. Allen, and Z. Xu, “Predicting testability ofprogram modules using a neural network,” in
Proceedings 3rd IEEESymposium on Application-Specific Systems and Software EngineeringTechnology . IEEE, 2000, pp. 57–62.[16] T. Yu, W. Wen, X. Han, and J. H. Hayes, “Predicting testabilityof concurrent programs,” in . IEEE, 2016, pp.168–179.[17] V. Terragni, P. Salza, and M. Pezzè, “Measuring software testabilitymodulo test quality,” in
Proceedings of the 28th International Confer-ence on Program Comprehension , 2020, pp. 241–251.[18] J. M. Voas, “Pie: A dynamic failure-based technique,”
IEEE Transac-tions on software Engineering , vol. 18, no. 8, p. 717, 1992.[19] J. Voas, L. Morell, and K. Miller, “Predicting where faults can hide fromtesting,”
IEEE Software , vol. 8, no. 2, pp. 41–48, 1991.[20] T.-H. Tsai, C.-Y. Huang, and J.-R. Chang, “A study of applying extendedpie technique to software testability analysis,” in , vol. 1.IEEE, 2009, pp. 89–98.[21] R. V. Binder, “Design for testability in object-oriented systems,”
Com-munications of the ACM , vol. 37, no. 9, pp. 87–101, 1994.[22] R. Just, “The major mutation framework: Efficient and scalable mutationanalysis for java,” in
Proceedings of the 2014 International Symposiumon Software Testing and Analysis , 2014, pp. 433–436.[23] J. H. Andrews, L. C. Briand, and Y. Labiche, “Is mutation an appropriatetool for testing experiments?” in
Proceedings of the 27th internationalconference on Software engineering , 2005, pp. 402–411.[24] G. Fraser and A. Arcuri, “Evosuite: automatic test suite generation forobject-oriented software,” in
Proceedings of the 19th ACM SIGSOFTsymposium and the 13th European conference on Foundations of soft-ware engineering , 2011, pp. 416–419.[25] R. A. DeMillo, R. J. Lipton, and F. G. Sayward, “Hints on test dataselection: Help for the practicing programmer,”
Computer , vol. 11, no. 4,pp. 34–41, 1978.[26] M. Pezzè and M. Young,
Software testing and analysis: process,principles, and techniques . John Wiley & Sons, 2008.[27] N. Smith, D. van Bruggen, and F. Tomassetti, “Javaparser: visited,”
Leanpub, oct. de , 2017.[28] R. Just, D. Jalali, and M. D. Ernst, “Defects4j: A database of existingfaults to enable controlled testing studies for java programs,” in
Pro-ceedings of the 2014 International Symposium on Software Testing andAnalysis , 2014, pp. 437–440., 2014, pp. 437–440.