EvoSpex: An Evolutionary Algorithm for Learning Postconditions
Facundo Molina, Pablo Ponzio, Nazareno Aguirre, Marcelo Frias
EEvoSpex: An Evolutionary Algorithm for LearningPostconditions
Facundo Molina ∗† , Pablo Ponzio ∗† , Nazareno Aguirre ∗† , Marcelo Frias †‡∗ Department of Computer Science, FCEFQyN, University of Río Cuarto, Argentina † National Council for Scientific and Technical Research (CONICET), Argentina ‡ Department of Software Engineering, Buenos Aires Institute of Technology, Argentina
Abstract —Software reliability is a primary concern in theconstruction of software, and thus a fundamental component inthe definition of software quality. Analyzing software reliabilityrequires a specification of the intended behavior of the softwareunder analysis, and at the source code level, such specificationstypically take the form of assertions . Unfortunately, softwaremany times lacks such specifications, or only provides themfor scenario-specific behaviors, as assertions accompanying tests.This issue seriously diminishes the analyzability of software withrespect to its reliability.In this paper, we tackle this problem by proposing a techniquethat, given a Java method, automatically produces a specificationof the method’s current behavior, in the form of postconditionassertions. This mechanism is based on generating executions ofthe method under analysis to obtain valid pre/post state pairs,mutating these pairs to obtain (allegedly) invalid ones, and thenusing a genetic algorithm to produce an assertion that is satisfiedby the valid pre/post pairs, while leaving out the invalid ones.The technique, which targets in particular methods of reference-based class implementations, is assessed on a benchmark ofopen source Java projects, showing that our genetic algorithmis able to generate post-conditions that are stronger and moreaccurate, than those generated by related automated approaches,as evaluated by an automated oracle assessment tool. Moreover,our technique is also able to infer an important part of manuallywritten rich postconditions in verified classes, and reproducecontracts for methods whose class implementations were auto-matically synthesized from specifications.
I. I
NTRODUCTION
The quality of software systems is typically defined aroundvarious dimensions, such as reliability, usability, efficiency,etc. Among these, reliability is in general considered a fun-damental attribute of software quality, and a primary concernin software development [12], [16]. Analyzing software re-liability is strongly related to finding software defects, i.e.,actual software behaviors that diverge from the expectedbehavior. Discovering such defects requires one to state what the expected behavior is, in other words, a specification of thesoftware. Many times such specifications are either implicit,or stated informally, diminishing the possibility of exploitingspecifications for (automated) reliability analysis.Software specifications can appear in different forms. At thelevel of source code, when present, they generally manifesteither as comments , i.e., informal descriptions of what thesoftware is supposed to do, or more formally as programassertions , i.e., (usually executable) statements that assertproperties that the software must satisfy at certain points during program executions. The former are more common,but cannot be straightforwardly used for automated reliabilityanalysis. The latter, on the other hand, are readily usable forprogram analysis, especially when stated as contracts [21], butthey are seldom found accompanying source code. Moreover,many times program assertions state scenario-specific proper-ties, e.g., statements that only express the expected softwarebehavior for a test case, as opposed to the more general,and also significantly more useful, assertions associated withcontract elements such as invariants and pre/post-conditions.Due to the above described situation regarding specificationsat the level of source code, the specification inference problem (a special case of the well known oracle problem [3]), i.e.,taking a program without a corresponding specification andattempting to automatically produce one that captures theprogram’s current behavior, is receiving increasing attention bythe software engineering community. Automatically inferringspecifications from source code is a relevant topic, as it enablesa number of applications, including program comprehension,software evolution and maintenance, bug finding [31], andspecification improvement [31], [15], among others.In this paper, we tackle this problem by proposing atechnique that, given a Java method, automatically produces aspecification of the method’s current behavior, in the formof postcondition assertions. This mechanism is based ongenerating valid and invalid pre/post state pairs (i.e., statepairs that represent, and do not represent, the method’s currentbehavior, respectively), which guide a genetic algorithm toproduce a JML-like assertion characterizing the valid pre/postpairs, while leaving out the invalid ones. The generationof valid pre/post pairs is based on executing the methodon a bounded exhaustive test set, generated by exercisingthe method inputs’ APIs using user-defined ranges for basicdatatypes, and bounding their execution sequences. The invalidpre/post pairs, on the other hand, are obtained by mutating valid pairs, i.e., arbitrarily modifying the post-states so thateach resulting pair does not belong to the set of valid pairs.This mutation-based approach to generate invalid pairs isunsound, in the sense that it may lead to valid pairs instead, anissue that may affect the precision of the produced assertions.As we describe later on, the design of our genetic algorithmtakes it into account. Because of the assertion language weconsider, that involves quantification, object navigation andreachability expressions, our approach is particularly well- a r X i v : . [ c s . S E ] M a r uited for reference-based class implementations with (im-plicit) strong representation invariants, such as heap-allocatedstructural objects, and complex custom types.We assess our technique on a benchmark of open sourceJava projects taken from [11], featuring complex implemen-tations of reference-based classes. In these case studies, ourgenetic algorithm is able to generate post-conditions that arestronger and more accurate, than those generated by relatedspecification-inference approaches, as evaluated by OASIs,an automated oracle assessment tool [15]. Moreover, ourtechnique is also able to infer an important part of manuallywritten rich postconditions (strong contracts used for verifica-tion) present in verified classes [37], and reproduce contractsfor methods whose class implementations were automaticallysynthesized from specifications [19].II. B ACKGROUND
A. Assertions as Program Specifications
The use of assertions as program specifications dates backto the works of Hoare [13] and Floyd [9], in the contextof program verification and associated with the concept ofprogram correctness. Technically, an assertion is a statementpredicating on program states, that can be used to capture assumed properties , as in the case of preconditions, or intendedproperties , as in the case of postconditions. A program P accompanied by a precondition pre and postcondition post issaid to be (partially) correct with respect to this specification,if every execution of P starting in a state that satisfies pre , if itterminates, it does so in a state that satisfies post [13]. That is,every valid terminating execution of P , i.e., every executionsatisfying the requirements stated in the precondition, mustlead to a state satisfying the postcondition.While program assertions originated in the context ofprogram verification, they soon permeated into programminglanguages constructs and (informal) programming methodolo-gies. More recently, they have been central to the definitionof methodologies for software design, notably design bycontract [22]. Most modern imperative and object-orientedprogramming languages support assertions, either as built-inconstructions [23] or through mature libraries such as CodeContracts [2] and JML [5]. Moreover, libraries for unit testingmake extensive use of assertions to automate checking theexpected results of running a test case.Preconditions are more commonly seen in source code, e.g.,within methods in the form of state and argument checks,throwing appropriate exceptions when these are found invalid,preventing normal execution. Postconditions, on the otherhand, are less common. Post-execution checks are commonlyseen as part of test cases, although they rarely capturepostconditions, in the sense of general properties that everyexecution must satisfy on termination; post-execution checksin tests generally state properties that should be satisfied forthe specific test where they are stated.The assertion language that we consider in this paper is,from an expressiveness point of view, a JML-like [5] contractlanguage. More precisely, we follow the approach used in [17], and use the Alloy notation [14]. The language supportsquantifiers, navigation and reachability expressions includingnavigations through one or more field. A sample specification,generated by our technique, is shown in Figure 3. Most oper-ators have a direct intuitive reading (equality and inequalities,boolean connectives, etc.); all and some are the universaland existential quantifiers, respectively; the dot operator ( . )is relational composition and captures navigation; relationalunion and intersection are denoted by + and & , respectively,and can be applied to combine fields in navigations; set/rela-tional cardinality is denoted by ; finally, * and ˆ are reflexive-transitive closure and transitive closure, respectively. Closuresallow the language to express reachability. For instance, thelast sentence in Figure 3 expresses that for every node n reachable (in zero or more steps) from the root by traversing left and right (i.e., all nodes in the tree), it is not the casethat n is included in the set of nodes reachable in one or moresteps from n itself. That is, the left + right structure fromthe root is acyclic. It is worth to mention that all assertions inthis language can be checked at run-time, and thus we can useit to assert properties in program points. We refer the readerto [14] for further details regarding the language. B. Quality of Assertions
As described above, program assertions are a way ofcapturing the expected software behavior via expressions thatconvey intended properties of program states in specific partsof a program. Such expected behavior can be captured withdifferent degrees of precision, leading to assertions of differentquality. The most typical issue with program assertions is themisclassification of invalid program states as valid. This isessentially the effect of having weak assertions, that are able todetect some, but not all, faulty situations. It is rarely considereda defect in the assertion, but an inherent issue associated with abalance between expressiveness and economy/efficiency in thedefinition of assertions. Indeed, it is even considered method-ologically correct to express weak (and efficiently checkable)assertions [22]. Following the terminology put forward in [15],a real program execution leading to an invalid program statethat a corresponding assertion is unable to detect is called a false negative .A second issue with program assertions is the dual of theprevious, i.e., the misclassification of valid program states asinvalid. This issue indicates that the assertion is wrong , as itdoes not properly specify the intended behavior of the soft-ware. Such issues are typically considered to be specificationdefects. This situation can also often arise as a consequenceof software evolution, when required changes in programbehavior are (correctly) implemented, but the accompanyingassertions are not kept in synchrony with the evolved behavior[6]. A real program execution leading to a valid programstate, that a corresponding assertion classifies as an assertionviolation, is a false positive , according to the terminology putforward in [15].Assessing the quality of assertions accompanying a programis a very challenging problem, that is typically performedanually. A way of measuring the quality of assertions isby attempting to determine the number of false positivesand false negatives that a given assertion has. This idea hasbeen exploited in [15], where an automated mechanism forevaluating the quality of assertions, based on evolutionarycomputation, is proposed. The approach presented thereinexecutes an evolutionary test generation tool (the well-knowntool EvoSuite [10]) that tries to find false positives and falsenegatives, and when found, produces witnessing test cases,that can be used to (manually) improve the correspondingassertions. It is worth remarking that, for contracts specifiedin standard assertion languages, it is hardly expected fora contract to fully capture the behavior of a program. Asexplained in [27], precisely capturing a program’s intendedsemantics requires additional mechanisms, such as the use ofmodel classes, that imply the manual definition of abstractionsof the state space of the program being specified. In termsof the above-mentioned issues with program assertions, itmeans that, technically, one can very often come up with falsenegatives, i.e., finding states that satisfy a given assertion butcorrespond to incorrect program behavior.III. A N I LLUSTRATING E XAMPLE
As an illustrating example, let us consider a Java classimplementing lists, partially shown in Figure 1 . This classimplements list operations over balanced trees, supportinginsertion and deletion from the list in O (log n ) , as opposed tothe classic array-based and linked-list based list implementa-tions. Let us focus on method add , that inserts an elementin the list. Notice how the precondition of the method iscaptured in the source code, checking the validity of the indexfor insertion and that the tree has not reached its maximumsize. The method postcondition, on the other hand, is notpresent in this implementation. Having the postcondition hasmultiple applications, in particular as assertions for testingfuture improvements of this method, and as a declarativedescription of what this method does (how it operates on thedata structure), among many others. Writing the specificationis, however, nontrivial, and thus coming up with the rightexpression for the postcondition is an important problem.A well-known tool to assist the developer in this situationis Daikon [7]. Daikon performs run-time invariant detection,it runs the program on a set of test cases, and observeswhich properties hold during these runs at particular programpoints, such as after method invokations. It then suggests aslikely invariants those properties that were not falsified byany execution, or equivalently, that held true for all observedexecutions. The quality of the obtained invariants stronglydepends on the program executions considered by Daikon (i.e.,the set of tests that the user provides), and the set of candidateexpressions to be considered. In particular for method add in Figure 1, Daikon produces the postcondition shown inFigure 2, when fed with all valid tree lists of size up to 4.The shown postcondition is actually that produced by Daikon, import java.util.AbstractList;public final class AvlTreeList
Fig. 2. Postcondition generated by Daikon for AvlTreeList.add(int, E) // rootthis.root != null &&this.root.left != null &&// heightall n : this.root.*(left+right) : (n.left != null => n.height > n.left.height &&n.right != null => n.height > n.right.height) &&// sizeold_this.root.size < this.root.size &&this.root.size ==
Fig. 3. Postcondition generated by our tool for AvlTreeList.add(int, E) we explain later on, the approach to generate invalid states mayunsoundly generate valid ones) in attempting to determine amethod’s postcondition, instead of only valid executions, as isthe case with Daikon. Third, our approach is based on evolvingspecifications, instead of considering non-falsified candidateproperties. The details of our technique are described in thenext section. Let us just mention that, for method add of class
AvlTreeList , our obtained postcondition is the one shownin Figure 3. Notice how the size update (referring to therelation between the pre and post states) and the membershipof the inserted element are captured, as well as some structuralproperties of the representation.IV. E VO S PEX
We now present the details of our technique for inferringmethod postconditions. An overview is shown in Fig. 4. Thetechnique is composed of two main phases: state generationand learning. During state generation, we produce pre/postprogram state pairs which are later on used in the learningphase to guide the search for suitable postcondition assertions.Two kinds of state pairs are generated: valid ones, whichcapture actual method behaviors that candidate assertionsshould satisfy; and invalid ones, which attempt to capture incorrect behaviors (pre/post pairs that do not correspond tothe current method behavior), that candidate assertions shouldnot satisfy. Valid pre/post pairs are generated by executingthe target method, using a test generation technique; clearly,these pairs correspond to the behavior of the method, as theywere generated from its execution. Invalid pre/post pairs, onthe other hand, are generated by mutating valid pairs, goingout of the set of valid pairs; contrary to the case of validpairs, it is not guaranteed that our invalid pairs are indeedincorrect method behaviors, i.e., that they represent behaviorsthat are not exhibited by the method. This may clearly affectthe precision of the obtained assertions, since the algorithmwould be guided to avoid some allegedly invalid behaviorswhich are actually valid. In these situations, the obtainedassertions would be stronger than necessary, leading to ahigher number of false positives, when evaluating assertionquality. We consider this issue in the design of our technique,in the following way. Firstly, the effectiveness in generatingtruly invalid pairs depends on the exhaustiveness of the setof valid pairs: the more exhaustive the set of valid pairs, thegreater the chances that mutating out of this set leads to atruly incorrect method behavior. Secondly, as the soundnessof the mechanism for invalid state pair generation cannot beguaranteed, one may risk favoring incorrect assertions basedon wrong invalid state assumptions. The former motivatesthe use of a bounded-exhaustive test generation approach forvalid state pairs. The latter drives an asymmetric treatmentof valid and invalid state pairs in the fitness function, thatgives the reliable information provided by valid pairs a greaterrelevance. We further describe in this section how we handlethese issues, as well as other details of the genetic algorithm,and in the next section we evaluate the technique, includingan evaluation of assertion precision.
A. Generation of Valid/Invalid Method Executions
The learning phase of our algorithm depends on a set ofvalid/invalid method executions, which guide the search forpostcondition assertions. This is an important part of ouralgorithm. The overall process starts by generating runs ofthe target method m , collecting the pre/post states (cid:104) s, s (cid:48) (cid:105) ofeach execution; these are the valid execution pairs V . In orderto generate invalid execution pairs I , valid pairs are mutated :for a valid pair (cid:104) s, s (cid:48) (cid:105) , we mutate s (cid:48) into s (cid:48)(cid:48) , and check that (cid:104) s, s (cid:48)(cid:48) (cid:105) does not belong to V , to consider it part of I . Ofcourse, the mutated pre/post pair may actually correspond toa valid execution of m that we had not generated in V . Theeffectiveness of this latter approach depends on how thorough V is (although we may still generate “unseen” valid executionpairs via mutation), motivating a bounded exhaustive approachfor generating valid execution pairs.The mechanism for generating valid execution pairs worksas follows. Let C, C , . . . , C n be classes, and m , the targetmethod, a method in C with parameters of types C , . . . , C n .The initial states for the execution of m will be tuples (cid:104) o C , o C , . . . , o C n (cid:105) of objects of types C, C , . . . , C n , respec-tively. We build the objects to form these tuples, for each class, arget methodTest GenerationTest cases ExecutionValid pre/post statesMutationInvalid pre/post statesEvoSpexPostcondition assertions Generation phaseLearning phase
Fig. 4. An overview of the proposed approach bounded exhaustively, in the following way. Let C i be a class,and methods b , . . . , b l a set of builders for C i , i.e., a set ofmanually identified methods that can be used to create objectsof class C i . For instance, for a set collection, builders wouldinclude constructors and insertion routines. Given a bound k (maximum length for method sequences), we build a set ofobjects of class C i using a variant of Randoop [26]. Randooprandomly generates sequences of methods of C i ’s API, ofincreasing length, by iterating a process in which previouslyproduced traces are randomly selected, together with a method,to generate a new trace that calls this method. Our variantincorporates two main modifications to this process: • The random selection of a method to extend a previouslyproduced trace t (test case), implemented in [26], isreplaced by a mechanism to systematically select all methods in b , . . . , b l , leading to l different extensionsof t . This is applied until the bound k is reached. • A state matching mechanism is implemented, to reducethe number of method combinations: when a newly pro-duced trace leads to an object that matches a previouslycollected one, the trace (and the object) are discarded.The state matching approach borrows the canonical objectrepresentation put forward in [29].Besides the bound k on trace length, the state matchingmechanism also requires a maximum number of objects pertype, and a range for primitive types (e.g., 0..k-1 for integers).This is a k -based scope, as defined in finitization procedures in[4] (a standard issue in bounded exhaustive generation). Usingthe above mechanism, we build the tuples of initial states, toexecute m . We execute m in each of these tuples, and collectthe corresponding post-states, building in this way the set V of valid pre states and corresponding post states for m .The mutations applied to produce the “invalid” pre/poststate set I , take a valid execution pair (cid:104) s, s (cid:48) (cid:105) , and create (cid:104) s, s (cid:48)(cid:48) (cid:105) , where s (cid:48)(cid:48) mutates s (cid:48) by selecting a random field inthe receiving object or return value (the constituents of s (cid:48) ),and replacing the value by a randomly generated value ofthe corresponding type within the above mentioned scope. Wecheck that the resulting pair is not in V before including itinto the invalid state pair set I . B. Chromosomes representing Candidate Postconditions
Our representation of candidate assertions is based on theencoding used in [24], where chromosomes represent conjunc-tions of assertions (each gene in a chromosome representsan assertion). That is, given a chromosome c , the candidatepostcondition ϕ c represented by c is defined as follows: c = (cid:104) g , g , . . . , g n (cid:105) ⇒ ϕ c = g ∧ g ∧ ... ∧ g n As opposed to what is most common in genetic algorithms,chromosomes have varying lengths in this representation (upto a maximum chromosome length), and gene positions aredisregarded by the genetic operators (see below), due to theassociativity and commutativity of the conjunction. Genesneed to encode complex assertions. Below we describe howgenes are built, mutated and combined.
C. Initial Population
Let us describe how we build the initial population, to startour genetic algorithm. In order to create individuals repre-senting “meaningful” postconditions, i.e., assertions statingproperties of objects that are reachable at the end of themethod executions, we take into account typing information,as in [24]. We consider a type graph built automaticallyfrom the class under analysis: nodes represent types, andeach field f of type B in class A will produce an arc inthe graph going from the node representing A to the noderepresenting B . For example, if we consider the AvlTreeList class in Figure 1, the corresponding type graph would bethe one shown in Figure 5. It is straightforward to seethat by traversing the graph, typed expressions can be built,using the fields of the object from which the method wasexecuted. Some examples are this.root , this.root.left , this.root.size , this.root.value , and so on. Moreover,from loops in the graph, expressions denoting sets, suchas this.root.*left (the set of nodes reachable from this.root via left traversals only), this.root.*right and this.root.*(left+right) , can be created (as ex-plained earlier, we are using * for reflexive-transitive closure,as in [14]). Size one chromosomes are created using expres-sions denoting a single value, evaluating these on a randomlyselected subset of the valid (resp. invalid) method executions,in the following way: if the result of evaluating an expression expr in a valid (resp. invalid) tuple t returns a value v , thenwe create the individual (cid:104) expr == v (cid:105) (resp. (cid:104) expr != v (cid:105) ).In addition to these basic individuals we also create chromo-somes containing comparisons of random expressions of thesame type (e.g., this.root == this.root.right ), chro-mosomes with quantified formulas considering expressionsdenoting sets (e.g., all n: this.root.*(left+right) -null : n == n.right ), and individuals comparing integerexpressions with the cardinality of expressions denoting sets(e.g., this.root.height == ). Fi-nally, since the method under analysis may have a return valueor a set of arguments, we also include, in the set of initialcandidates, expressions comparing them against expressionsof the same type (e.g., result < this.f ). The expressions vlTreeList
Fig. 5. Type graph for AvlTreeList example. used to compare with the result variable or the arguments, aswell as the operators, are randomly chosen.Notice that all our initial chromosomes are size 1 chromo-somes. The main reason for this design choice is to allow thegenetic algorithm to progressively produce complex candidatepostconditions by means of the genetic operators, that we de-fine later on in this section. While this size-one chromosomesfor the initial population is non-standard in genetic algorithms,in our case it helps the algorithm to more quickly convergeto better fitted individuals. The replication package site [1]contains the results of comparing the effectiveness of our size-one chromosomes in the initial population, with standard size- N chromosomes (we do not include the comparison here dueto space restrictions). D. Fitness Function
Our fitness function assesses how good a candidate post-condition is, by distinguishing between the set V of validexecutions and the set I of invalid executions. To do so,before computing the fitness value of a given candidate c , weobtain the postcondition ϕ c that c represents, and then computethe sets P and N of positive and negative counterexamples,respectively. These sets are defined as follows: P = { v ∈ V |¬ ϕ c ( v ) } N = { i ∈ I | ϕ c ( v ) } where ϕ c is the postcondition represented by the candidate c .Basically, the sets P and N contain those executions for whichthe postcondition ϕ c does not behave correctly. Recall that,as opposed to the case of V which reliably represents actualexecution information of m , the set I may contain mutatedexecutions that are considered “invalid”, but correspond toactual executions of m . This motivates a definition of ourfitness function that does not treat P and N symmetrically.The fitness function f ( c ) is computed as follows: P > → ( MAX − P − I ) + (cid:16) l c + comp c (cid:17) + mca c l c P = 0 → ( MAX − N ) + (cid:16) l c + comp c (cid:17) + mca c l c This case-based definition aims at considering the negativecounterexamples only when no positive counterexamples areobtained. In fact, for arbitrary candidates c and c , if c has no positive counterexample and c has some positivecounterexamples, then f is guaranteed to produce worse fitnessvalues for c , no matter how many negative counterexamples these candidates have. The rationale here is to make thereliable positive-counterexample information more relevant.The definition of the fitness function has three parts. Thefirst term reflects the most important aspect: to minimize thenumber of counterexamples. The fact that when the candidatepostcondition ϕ c has positive counterexamples, i.e., it is falsi-fied by a correct method execution, the whole set I of invalidexecutions is considered as counterexamples too, is what guar-antees our above observation regarding the prioritization ofcandidates with no positive counterexamples. More precisely,the first term of the function subtracts I when P > , toensure that the fitness value of such individual is lower thatthe fitness value of any other individual that only has negativecounterexamples. The second term of the fitness functionacts as a penalty regarding two aspects: the candidate length l c and the candidate “complexity” comp c . The candidatelength is simply the number of conjuncts in the assertion,and it is considered in order to guide the algorithm towardsproducing smaller assertions. The candidate complexity is thesum of each conjunct complexity. Intuitively, the complexityof an equality between two integer fields is lower than thecomplexity of an equality between an integer field with a setcardinality, and both of these are lower than the complexity ofa quantified formula, and so on. The last term of the functionacts as a reward favoring those candidates with a greaternumber of “method component assertions” mca c , i.e., with ahigh number of conjuncts of the candidate postcondition thatrepresent properties regarding the parameters, the result, or arelation between initial and final object states. As described,the penalty related to the candidate length and complexityas well as the reward prioritizing the method componentassertions just contribute a fraction to the fitness value, sincewe want the algorithm to focus on individuals whose numberof counterexamples is approaching zero. E. Genetic Operators
During evolution, the genetic operators allow the algorithmto explore the search space of candidate solutions, by per-forming certain operations that produce individuals with newcharacteristics as well as combinations of existing ones. Inparticular, our algorithm implements two well known geneticoperators, namely the mutation and crossover operators. Someof these genetic operators were inspired by similar onesintroduced in [24], while others are novel. Also, a custom selection operator was implemented, to keep in the populationthose candidates that are more suitable to be part of the realpost condition.Each chromosome gene is selected for mutation with aprobability of . , and the operation can perform a varietyof modifications depending on the shape of the selected geneexpression. From a general point of view, the set of consideredmutations are the following: Gene deletion: it can be applied to any gene and simplyremoves the gene expression from the chromosome.
Negation: it negates the gene expression and is applied toany gene except quantified assertions. umeric addition/subtraction: it is only applied to genesthat compare two expressions evaluating to a number, and itadds/subtracts a randomly selected numeric expression to theright-hand side of the comparison.
Expression replacement: it applies to any gene, and itreplaces some part of the gene with a randomly selectedexpression of the same type.
Expression extension: it can be applied to any gene thatinvolves a navigational expression, and it extends this expres-sion with a new field, for example replacing this.root by this.root.left . Operator replacement: it replaces an operator by an alter-native one. The operators vary depending on the current geneexpression. For instance, for relational equalities, the possi-ble operators are { == , ! = } ; for numeric comparisons, theoperators are { == , ! = , <, >, < = , > = } ; and for quantifiedexpressions, the quantifiers are { all, some } .To produce combinations of individuals, we use a crossoverrate of . . Given two randomly selected chromosomes c and c , our crossover operator simply produces a newindividual that contains the union of the genes of c and c ,and thus represents the candidate postcondition ϕ c ∧ ϕ c .An important detail in our crossover operator is that beforeselecting individuals for combination, we filter the population,keeping individuals which only have negative counterexam-ples, i.e., that represent formulas that are consistent with allvalid method executions. The main reason for this policy is thatwe want the algorithm to join chromosomes that are alreadyconsistent with the valid method executions.Finally, to keep in the population the best candidates of eachgeneration, our selection operator is defined as follows: given anumber n to be used as constant population size, our operatorfirst sorts all the candidate postconditions in decreasing order,and then the candidates to be moved to the next generationare the first n/ individuals plus the best n/ unary non-validindividuals, i.e., size 1 chromosomes whose only gene is aformula that still has positive counterexamples. Additionally,our operator keeps all the unary valid candidates, that is, thosethat only have negative counterexamples. This last policy inour selection operator allows us to keep in the population allthe discovered valid properties that the algorithm can use infuture crossover operations.V. E VALUATION
To evaluate our technique, we performed experiments fo-cused on the following research questions:RQ1
Do the oracles learned by EvoSpex have any deficiencycompared to oracles produced by related tools?
RQ2
Are the assertions produced by the algorithm close tomanually written contracts?
To evaluate RQ1, we need to consider programs (in ourcase, Java programs) for which to infer method specifications.As mentioned earlier in the paper, and as it is clear from ourcandidate assertion state space and evolution operators, we target classes and methods with reference-based implementa-tions, in particular classes where the corresponding internalrepresentation has strong (implicit) invariants. As a sourcefor our benchmark, we considered SF110 (originally used in[11]), a collection of 110 Java projects (100 random projects,plus the 10 most popular ones according to SourceForge),that covers a wide variety of software, representative of opensource development. Our process of assessing postconditionassertions makes use of the OASIs tool [15], essentially, toevaluate the quality of a postcondition assertion in terms of itsassociated number of false positives and false negatives. Theprocess of computing this number requires a manual process(as described in [15], to compute the false negatives one firstneeds to get rid of the false positives, which implies havingto manually refine the produced postconditions every timeOASIs reports the presence of a false positive). Therefore,we are unable to consider the whole 110 projects in thebenchmark. We randomly selected 16 projects, skipping casesin our selection that have a clear dependency on the environ-ment (our technique involves automated test generation, andenvironment dependencies seriously affect these tools). The16 projects can be found in Table II. For each case study, weselected various methods with different behaviors for analysis,manually defined a set of builders, and then generated thecorresponding valid and invalid method executions with arelatively small scope (3 for all cases). Then, we executedour tool in the following way: for each method m selectedfor analysis, we executed the genetic algorithm to producea postcondition for m until it reached 30 generations or a10 minutes timeout was fulfilled. We repeated this execution10 times, and then selected the postcondition assertion thatrepeated the most number of times, from the 10 produced bythe algorithm. Additionally, in order to compare our tool withrelated approaches, we executed Daikon to infer post condi-tions for each method m . It is important to remark that thetest suites that we fed Daikon with to produce postconditionsfor the methods under analysis, were exactly the same testsuites that were used to generate the valid method executionsin our technique (our valid bounded exhaustive suites). Bothour tool and Daikon can produce assertions leading to falsepositives (see Section 2 for a comment on this issue), as wellas redundant assertions.The results of this experiment are shown in Tables I andII. Table I presents the postconditions generated by the tools,after removing the false positives and the redundant assertions,with the aim of giving a clear glance of the complexity ofthe assertions that the techniques are able to generate. Weconsidered these assertions, as the ones produced by the twotechniques. We then measured the quality of the correspondingassertions by automatically computing false positives and falsenegatives, using the OASIs [15] tool. Table II shows theresults of this quality assessment. Specifically, for each casestudy, we report in Table II: (i) lines of code (LoC) of theevaluated project; (ii) number of analyzed methods from the OSTCONDITIONS INFERRED BY E VO S PEX AND D AIKON AFTER REMOVING F ALSE P OSITIVES
Method EvoSpex Daikonjiprof - com.mentorgen.tools.profile.runtime.ClassAllocation getAllocCount(): int result == this._count this._count == result && result == old(this._count)incAllocCount(): void this._count == 1 + old(this_count) this._count >= 1 && this._count - old(this_count)- 1 == 0 jmca - com.soops.CEN4010.JMCA.JParser.SimpleNode jjtSetParent(Node n): void n == this.parent this.parent == old(n)&&this.children == old(this.children)&&this.id == old(this.id)&&this.parser == old(this.parser)&&this.identifiers == old(this.identifiers) bpmail - ch.bluepenguin.email.client.service.impl.EmailFacadeState setState(Integer ID, booleandirtyFlag): void ID in this.states.keySet() this.states == old(this.states) byuic - com.yahoo.platform.yui.compressor.JavaScriptIdentifier preventMunging(): void this.mungedValue == old(this.mungedValue)&&this.refCount == old(this.refcount)&&all n : this.declaredScope.*parentScope: n !in n.^parentScope this.mungedValue == old(this.mungedValue)&&this.recCount == old(this.refcount)&&this.declaredScope == old(this.declaredScope)&&this.markedForMunging == false dom4j - org.dom4j.tree.LazyList add(E element): boolean old(this.size)== this.size - 1 &&result == true &&element in this.header.*next.element this.header == old(this.header)&&this.size >= 1 &&result == true &&this.size - old(this.size)- 1 == 0 corresponding project; (iii) number of assertions produced aspart of the postconditions; (iv) amount and percentage offalse positives present in all generated assertions; and (v) number of methods for which false negatives were detected.Notice that, as proposed in [15], false negatives detection isperformed once all the false positives have been removed fromthe postcondition (hence the manual task that made us considera subset of SF110). For both false positives detection and falsenegatives detection, we executed OASIs with a timeout of oneminute. Problems with OASIs prevented us from reportingthe number of false negatives for each method and casestudy; more precisely, when the tool reported the existenceof false negatives, in some cases it was unable to produce thewitnessing counterexamples (test cases), preventing us frommeasuring the number of false negatives identified by the toolin these cases. This issue was discussed with the developers ofthe tool. We therefore inform the number of methods for whichOASIs reported the existence of false negatives, rather than thenumber of false negatives found, as this information was notreliably produced by the tool for all cases. For instance, forproject imsmart , out of the 3 methods analyzed, OASIs foundone of the corresponding assertions discovered by Daikon tohave false negatives, and one of the assertions discovered byEvoSpex to have a false negative, too.The evaluation of RQ2 requires having classes with methodsfeaturing manually written contracts. Moreover, as discussedin Section 2, assertions for run-time checking are typicallyweak, efficiently checkable assertions, that weakly capture thesemantics of the corresponding classes and methods [22], [27].In order to compare with strong contracts, we took: • A set of case studies with contracts written for the verifi-cation of object oriented programs. More precisely, theseprograms are written in Eiffel [23], and the accompanyingcontracts were used for verification using the AutoProoftool [37], a verifier for object-oriented programs writtenin the Eiffel programming language, for Eiffel programs. • A set of case studies for which a data representation and method implementations are automatically synthesizedfrom a higher-level specification. More precisely, thesynthesized implementations are taken from [19], aregenerated by the Cozy tool, and are guaranteed to becorrect with respect to higher level specifications, whichserve as method contracts.From [37], we specifically considered various methods andtheir corresponding postconditions, from the following cases: • Composite : A tree with a consistency constraint betweenparent and children nodes. Each node stores a collectionof its children and a reference to its parent; the client isallowed to modify any intermediate node. A value in eachnode should be the maximum of all children’s values. • DoublyLinkedListNode : Node in a (circular) doubly-linked list with a structural invariant enforcing that itsleft and right links are consistent with its neighbors. • Map < K,V > : Generic Map abstract datatype implemen-tation, based on two lists that contain the keys and values,and with operations that perform linear searches on thelists. • RingBugger < G > : Bounded queue implemented over acircular array.Since our tool is for Java, and these implementations are inEiffel, we had to manually translate the whole classes intoJava, for analysis with our tool (this also prevented us fromconsidering more sophisticated case studies in this evaluation).While the translation was manual, we made an effort in makingit systematic, preserving the structure of the original code,and taking into account the semantics of references (e.g.,expanded types in Eiffel), array indexing in Eiffel vs. Java,etc., using as a guideline the J2Eif work [36]. Eiffel alsodiffers from Java in other important aspects that did not affectthe translation (e.g., inheritance, visibility of features, etc.).While we did not formally verify our translation, it was code-reviewed independently by co-authors of the paper.From [19], we considered several high-level specificationsand their corresponding synthesized Java implementations: Polyupdate , a bag of elements that keeps track of thesum of its positive elements. • Structure , a simple class encapsulating a function andcaching a parameter. • ListComp02 , a structure composed of two collections ofdifferent elements, and operations that combine elementsof the collections. • MinFinder , a bag of elements with a min operation. • MaxBag , a set of elements, with a max operation.In order to infer postconditions for methods in these classes,we first generated valid and invalid method executions, asdescribed earlier in this paper, for each of the target methodsusing a scope of 4. Then, we executed our algorithm using thesame configuration described for RQ1 (30 generations witha 10-minute timeout). Again, we repeated the execution 10times and selected the most frequently obtained postcondition.Notice that our approach is not using the real contracts alreadyaccompanying the target methods. We fully ignore these inthe inference approach, and only consider the methods sourcecode, both for the generation of valid/invalid method execu-tions, and for the actual evolutionary inference. A similar pro-cedure was followed for the Cozy case studies. We computedpostconditions for the Java implementations, and contrastedthem with those in the original high-level specifications, fromwhich the Java implementations were derived.The results of this experiment are shown in Tables III andIV. In Table III, for each of the target methods, the columnEiffel Contracts lists the properties that are present in theoriginal postcondition (expressed as text, for easier reference).In Table IV, the original postcondition is described in columnHigh-level spec in terms of the abstract state declared in thespecification. In both tables, the column EvoSpex indicateswhich of the corresponding assertions in the original contract,our evolutionary algorithm was able to infer. Finally, Table Vsummarizes these results and also reports the number of invalid assertions synthesized as part of the inferred specifications foreach subject in Eiffel and Cozy case studies.Our tool, all the case studies, and a description of how toreproduce the experiments presented in this section can befound in the site of the replication package of our approach[1]. All the experiments were run on an Intel Core i7 3.2Ghz,with 16Gb of RAM, running GNU/Linux (Ubuntu 16.04).
A. Assessment
Let us briefly discuss the results of our evaluation. Re-garding RQ1, the results show that our approach is able togenerate postconditions containing more complex assertionsthan the ones produced by Daikon. This is mainly due tothe fact that our technique focuses on evolving assertionstargeting reference-based conditions in reference-based im-plementations, as opposed to Daikon whose expressions arecomparatively simpler properties, that do not include complexstructural constraints, membership checking, etc (with theexception of arrays and implementations of java.util.List ,for which Daikon generates interesting structural properties).Furthermore, as Table II shows, a significant number of the
TABLE IIM
EASURING THE QUALITY OF POSTCONDITIONS INFERRED BY D AIKONAND E VO S PEX , USING
OASI S . Project LOCs Methods Technique
Total %imsmart 1407 3 Daikon 21 2 9.52 1EvoSpex 4 1 25 1beanbin 4784 5 Daikon 35 5 14.29 0EvoSpex 7 0 0 0byuic 7699 7 Daikon 165 21 12.73 4EvoSpex 36 4 11.11 2geo-google 20974 7 Daikon 93 30 32.26 0EvoSpex 10 3 30 4templateit 3315 7 Daikon 37 4 10.81 3EvoSpex 20 0 0 2water-simulator 9931 9 Daikon 39 3 7.69 9EvoSpex 18 3 16.67 9dsachat 5546 9 Daikon 138 15 10.87 3EvoSpex 18 2 11.11 2jmca 16891 9 Daikon 205 26 12.68 0EvoSpex 25 1 4 3jni-inchi 3100 10 Daikon 122 12 9.84 2EvoSpex 50 1 2 4bpmail 2765 11 Daikon 46 6 13.04 8EvoSpex 17 0 0 7dom4j 42198 18 Daikon 166 27 16.27 7EvoSpex 25 2 8 10jdbacl 28618 19 Daikon 115 17 14.78 10EvoSpex 80 3 3.75 8jiprof 26296 20 Daikon 352 81 23.01 20EvoSpex 35 4 11.43 19summa 119963 21 Daikon 273 67 24.54 6EvoSpex 62 5 8.06 5corina 78144 22 Daikon 155 13 8.39 17EvoSpex 55 1 1.82 17a4j 6618 23 Daikon 257 59 22.96 9EvoSpex 60 5 8.33 5TOTAL 200 Daikon 2219 388 17.49 99EvoSpex 522 35 assertions inferred by our technique are true positives , i.e.,assertions that hold for all valid post-states of the correspond-ing methods, for any scope. Of course, this check for truepositives is in the end manual (we carefully analyzed howeach of the evaluated methods operates, and inspected theobtained assertions after filtering out assertion conjuncts as perOASIs assessment); the oracle deficiency analysis performedby OASIs is inherently incomplete, we cannot guarantee thetruth of the remaining assertions.As shown in Table II, in most of the case studies (13 out of16), the percentage of false positives that our tool generates,when considering the total amount of produced assertions, isless than that produced by Daikon. Thus, comparing it withDaikon, and solely based on false positives, our assertionsare significantly more precise. In fact, in a total of 200methods analyzed, our technique had a 6.7% false positives,compared to the 17.49% of Daikon (an order of magnitudeimprovement). Moreover, the relationship between the numberof produced assertions (in total, 522 of EvoSpex vs. the 2219produced by Daikon) and the identified presence of false
ABLE IIIC
OMPARING MANUALLY WRITTEN CONTRACTS ( IN E IFFEL VERIFIEDCLASSES ) WITH POSTCONDITIONS INFERRED BY E VO S PEX . Method Eiffel Contracts EvoSpexComposite add_child(Composite c) : void child added (cid:51) c value unchangedc children unchangedancestors unchanged (cid:51)
DoublyLinkedListNode insert_right(DoublyLinkedListNode n) : void n left set (cid:51) n right setremove() : void singleton (cid:51) neighbors connected
Map < K,V > count() : int result is size (cid:51) extend(K k,V v) : int key set (cid:51) data set (cid:51) other keys unchangedother data unchangedresult is indexremove(K k) : int key removed (cid:51) other keys unchangedother data unchangedresult is index RingBuffer < G > count() : int result is sizeextend(G a_value) value added (cid:51) item() : G result is first elemremove() : void first removedwipe_out() : void is empty (cid:51) TABLE IVI
NFERRING POSTCONDITIONS OF SYNTHESIZED COLLECTIONS . High-level state Method High-level spec EvoSpexPolyupdate x : Bag < Int > a(Integer y) : void y added to x (cid:51) s : Int y added to s if positivesm() : Integer result is s + sum of x (cid:51) Structure x : Int foo() : Integer result is x + 1 (cid:51) setX(Integer y) x = y (cid:51)
ListComp02
Rs : Bag < R > insert_r(R r) : void r added to Rs (cid:51) Ss : Bag < S > insert_s(S s) : void s added to Ss (cid:51) q() : Integer result is the sum of Rs ⊗ Ss MinFinder xs : Bag < T > findmin() : T result is min of xs (cid:51) chval(T x, int nv) : void inner value of T is x MaxBag l : Set < Int > get_max() : Integer result is max of l (cid:51) add(Integer x) : void x added to l (cid:51) remove(Integer x) : void x removed from l negatives, shows that our technique produces overall assertionsof similar strength, with significantly fewer constraints. Daikonseems to make a more heavy used of specifically observedvalues in the assertions, leading to assertions that, whiletrue within the provided test suite cases, are violated whenconsidering larger scopes. Our algorithm is guided both byvalid and invalid pre/post method states, giving it an advantageover Daikon, and explore a state space of candidate assertionsthat are less affected by specific values observed in executions. TABLE VS
UMMARY OF E VO S PEX ASSERTIONS ON
RQ2
SUBJECTS
Subject Methods Manual Contracts Inferred Assertions
Total Invalid
Eiffel
Composite 1 4 7 0DoublyLinkedListNode 2 4 5 0Map < K,V > < G > Cozy
Polyupdate 2 3 3 0Structure 2 2 2 0ListComp02 3 3 6 0MinFinder 2 2 3 0MaxBag 3 3 33 5
Regarding false negatives, both Daikon and our technique leadto assertions for which OASIs is able to identify false negatives(with our technique having a small margin of advantage inthis respect). The conclusion is clear: the assertions obtainedwith both tools are weaker than the “strongest” postcondition,thus letting “pass” undetected some mutations of the analyzedmethods (cf. how OASIs identifies false negatives [15]). Un-fortunately, as discussed earlier, a problem with OASIs did notallow us to perform a more detailed comparison, based on the number of identified false negatives in each case. Nevertheless,by inspecting the obtained postconditions, as shown in Table I,it is apparent that our technique produces stronger assertions.Regarding RQ2, the assertions that our technique can pro-duce are close to those that may be defined by developerswhen manually specifying rich contracts. As Table III shows,our algorithm generated at least one of the exact propertiesdefined in the original assertions for the Eiffel methods in 8out of 11 cases. Similarly, as Table IV indicates, in 9 outof 12 methods we correctly identified at least one propertyof the corresponding postcondition. This confirms that ourtechnique is capable of generating assertions that are actuallytrue positives and are scope-independent. In the case of theremaining properties that the algorithm was not able to infer,these are specific assertions regarding the arguments, complexproperties over sets that are beyond the assertions that thealgorithm may produce, such as the “other keys unchanged”in the Map.extend routine, or are relatively complex arithmeticconstraints such as the ones present in Cozy’s ListComp02and the RingBuffer methods (notice that our assertions con-centrate on properties of reference-based fields rather thansophisticated arithmetic assertions). Regarding the precision ofthe generated specifications for Eiffel and Cozy case studies,it is also important to analyze if the tool produces invalidassertions. As Table V shows, only 2 out of 9 subjectscontained invalid assertions in the corresponding inferredpostcondition, being all of them assertions that are true inthe bounded scenarios from which they were computed, butare not if the scope is extended. These cases were the onesthat involved a greater number of fields. Still the percentageof invalid postconditions for these cases were about 15% oress (4 of 30 in one case, 5 of 33 in the other). Table V alsoshows that, for most case studies, EvoSpex produces additionalvalid assertions, compared to the corresponding postcondition.Generally, these have to do with valid information that isnot explicitly mentioned in the original postcondition. Forinstance, for Composite.addChild, EvoSpex produced a 7-conjunct postcondition, 2 of which are in the manual contract;the remaining 5 are either trivial (e.g., the list of childrenis not changed, the parent is not changed), or capture validinformation not in the original “ensure” (e.g., acyclicity of theparent structure). For further details, we refer the reader to thereplication package site [1], where all the assertions producedfor each case study can be found.VI. R
ELATED W ORK
Assertions can be exploited for a wide variety of activitiesin software development, notably program verification [5],[8] and bug finding [35], [18], but also including programcomprehension, software evolution and maintenance [30], andspecification improvement [15], [31], among others. Thus,the problem of automatically inferring specifications fromsource code, and in general the problem of producing soft-ware oracles, has received increasing attention in the lastfew years [3]. Techniques for inferring specifications fromsource code, i.e., for deriving oracles, include approachesbased on program executions, such as those reported in [7],[34], as well as some recent techniques based on machinelearning [32], [33], [25]. Compared to the execution basedapproaches, our technique is guided both by valid and invalidexecutions (actually, pre/post method states); compared to themachine learning approaches, our technique concentrates onmethod postconditions, and produces interpretable assertion instandard assertion languages, as opposed to assertions encodedinto artificial neural networks and other machine learningmodels. A closely related technique is that proposed in [7],with which we compare in this paper. Tools for automatedtest generation, notably EvoSuite [10] as well as Randoop[26] and some extensions [38], can produce assertions ac-companying the generated tests. However, these assertions arescenario-specific, i.e., they capture properties particular to thegenerated tests, as opposed to our postconditions that attemptto characterize general method behaviors.Our technique embeds a mechanism for test input genera-tion, that follows a bounded exhaustive testing approach. Asopposed to the previous mechanisms for generating boundedexhaustive suites, e.g., via tools like Korat [4] or TestEra [17],our technique generates bounded exhaustive suites from theprogram’s API, rather than from an invariant specification. Inthis sense, our technique is more closely related to Randoop[26], replacing the random method selection in building testtraces, with a systematic generation of all bounded methodtraces. The state matching mechanism we used in this paperis crucial in making this approach effective, but its discussionis beyond the scope of the paper. Besides producing validmethod executions in the search of assertions, our techniquealso produces invalid program executions. The approach is based on mutating state. It is somehow related to the oracleassessment approach (for false negatives) implemented in theOASIs tool [15], although therein the authors mutate programs (source code), as opposed to mutating state . The idea ofmutating state is used elsewhere, e.g., in [20], [25].VII. C
ONCLUSION
The oracle problem has become a very important prob-lem in software engineering, and within this context, oraclederivation or inference is particularly challenging [3]. In thispaper, we have proposed an evolutionary algorithm for oracleinference, in particular for inferring method assertions inthe form of postconditions . Our technique features variousnovel characteristics, including a mechanism for generatingtest inputs bounded exhaustively, from a component’s API,and the definition of a genetic algorithm whose state spaceof candidate assertions includes rich constraints involvingmethod parameters, return values, internal object states, andthe relationship between pre and post method execution states.Our experimental evaluation shows that our tool is able toproduce more accurate assertions (stronger contracts in thesense of [28], with the associated benefits described therein),with a total of 6.70% of false positives, compared to the17.49% of false positives of related techniques, for a set ofrandomly selected methods from a benchmark of open sourceJava projects. Furthermore, our evaluation shows that ourtechnique is able to infer an important part of rich programassertions, taken from a set of case studies involving contractsfor program verification and synthesis.This work also opens several lines for future work. Onone hand, our genetic algorithm uses a finite set of geneticoperators, in particular the ones used for mutation; extendingthe set of operators and exploring new ones may be necessaryto increase the scope of properties that the algorithm mayproduce, especially when dealing with more sophisticatedprograms. Fitness functions in genetic algoritms play a crucialrole in the quality of the solutions; adapting the fitness functionof our algorithm in order to prioritize general aspects ofmethod postconditions may considerably improve our results.Our experiments were based on the use of a variant ofrandom generation for the production of bounded exhaustivetest suites. Using alternative test suite generation approachessuch as fully random generation may allow us to producedifferent postconditions. The existence of false negatives forour produced postcondition assertions also opens lines ofimprovement for our inference mechanism.A
CKNOWLEDGMENTS
The authors would like to thank the anonymous reviewersfor their helpful feedback, and the OASIs authors for theirassistance in using the OASIs oracle assessment tool.This work was partially supported by ANPCyT PICT 2016-1384, 2017-1979 and 2017-2622. Facundo Molina’s workis also supported by Microsoft Research, through a LatinAmerica PhD Award.
EFERENCES[1] Evospex site. https://sites.google.com/view/evospex, 2021.[2] Mike Barnett. Code contracts for .net: Runtime verification and somuch more. In Howard Barringer, Yliès Falcone, Bernd Finkbeiner,Klaus Havelund, Insup Lee, Gordon J. Pace, Grigore Rosu, OlegSokolsky, and Nikolai Tillmann, editors,
Runtime Verification - FirstInternational Conference, RV 2010, St. Julians, Malta, November 1-4,2010. Proceedings , volume 6418 of
Lecture Notes in Computer Science ,pages 16–17. Springer, 2010.[3] Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and ShinYoo. The oracle problem in software testing: A survey.
IEEE Trans.Software Eng. , 41(5):507–525, 2015.[4] Chandrasekhar Boyapati, Sarfraz Khurshid, and Darko Marinov. Korat:automated testing based on java predicates. In Phyllis G. Frankl, editor,
Proceedings of the International Symposium on Software Testing andAnalysis, ISSTA 2002, Roma, Italy, July 22-24, 2002 , pages 123–133.ACM, 2002.[5] Patrice Chalin, Joseph R. Kiniry, Gary T. Leavens, and Erik Poll.Beyond assertions: Advanced specification and verification with JMLand esc/java2. In Frank S. de Boer, Marcello M. Bonsangue, SusanneGraf, and Willem P. de Roever, editors,
Formal Methods for Componentsand Objects, 4th International Symposium, FMCO 2005, Amsterdam,The Netherlands, November 1-4, 2005, Revised Lectures , volume 4111of
Lecture Notes in Computer Science , pages 342–363. Springer, 2005.[6] Brett Daniel, Vilas Jagannath, Danny Dig, and Darko Marinov. Reassert:Suggesting repairs for broken unit tests. In
ASE 2009, 24th IEEE/ACMInternational Conference on Automated Software Engineering, Auck-land, New Zealand, November 16-20, 2009 , pages 433–444. IEEEComputer Society, 2009.[7] Michael D. Ernst, Jeff H. Perkins, Philip J. Guo, Stephen McCamant,Carlos Pacheco, Matthew S. Tschantz, and Chen Xiao. The daikon sys-tem for dynamic detection of likely invariants.
Sci. Comput. Program. ,69(1-3):35–45, 2007.[8] Manuel Fähndrich. Static verification for code contracts. In RadhiaCousot and Matthieu Martel, editors,
Static Analysis - 17th InternationalSymposium, SAS 2010, Perpignan, France, September 14-16, 2010.Proceedings , volume 6337 of
Lecture Notes in Computer Science , pages2–5. Springer, 2010.[9] Robert W. Floyd. Assigning meanings to programs. In J. T. Schwartz,editor,
Mathematical Aspects of Computer Science, Proceedings ofSymposia in Applied Mathematics 19 , pages 19–32, Providence, 1967.American Mathematical Society.[10] Gordon Fraser and Andrea Arcuri. Evosuite: automatic test suitegeneration for object-oriented software. In
SIGSOFT FSE , pages 416–419. ACM, 2011.[11] Gordon Fraser and Andrea Arcuri. A large-scale evaluation of automatedunit test generation using evosuite.
ACM Trans. Softw. Eng. Methodol. ,24(2):8:1–8:42, 2014.[12] Carlo Ghezzi, Mehdi Jazayeri, and Dino Mandrioli.
Fundamentals ofSoftware Engineering . Prentice Hall PTR, Upper Saddle River, NJ, USA,2nd edition, 2002.[13] Charles A. R. Hoare. An axiomatic basis for computer programming.
Commun. ACM , 12(10):576–580, 1969.[14] Daniel Jackson.
Software Abstractions - Logic, Language, and Analysis .MIT Press, 2006.[15] Gunel Jahangirova, David Clark, Mark Harman, and Paolo Tonella.Test oracle assessment and improvement. In Andreas Zeller andAbhik Roychoudhury, editors,
Proceedings of the 25th InternationalSymposium on Software Testing and Analysis, ISSTA 2016, Saarbrücken,Germany, July 18-20, 2016 , pages 247–258. ACM, 2016.[16] Pankaj Jalote.
An Integrated Approach to Software Engineering, ThirdEdition . Texts in Computer Science. Springer, 2005.[17] Shadi Abdul Khalek, Guowei Yang, Lingming Zhang, Darko Marinov,and Sarfraz Khurshid. Testera: A tool for testing java programs usingalloy specifications. In Perry Alexander, Corina S. Pasareanu, andJohn G. Hosking, editors, , pages 608–611. IEEE Computer Society, 2011.[18] Andreas Leitner, Ilinca Ciupa, Manuel Oriol, Bertrand Meyer, and ArnoFiva. Contract driven development = test driven development - writingtest cases. In Ivica Crnkovic and Antonia Bertolino, editors,
Proceedingsof the 6th joint meeting of the European Software Engineering Confer-ence and the ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2007, Dubrovnik, Croatia, September 3-7,2007 , pages 425–434. ACM, 2007.[19] Calvin Loncaric, Michael D. Ernst, and Emina Torlak. Generalized datastructure synthesis. In
Proceedings of the 40th International Conferenceon Software Engineering , ICSE ’18, page 958–968, New York, NY,USA, 2018. Association for Computing Machinery.[20] Muhammad Zubair Malik, Junaid Haroon Siddiqui, and Sarfraz Khur-shid. Constraint-based program debugging using data structure repair. In
Fourth IEEE International Conference on Software Testing, Verificationand Validation, ICST 2011, Berlin, Germany, March 21-25, 2011 , pages190–199. IEEE Computer Society, 2011.[21] Bertrand Meyer. Applying "design by contract".
IEEE Computer ,25(10):40–51, 1992.[22] Bertrand Meyer.
Object-Oriented Software Construction, 2nd Edition .Prentice-Hall, 1997.[23] Bertrand Meyer. Design by contract: The eiffel method. In
TOOLS1998: 26th International Conference on Technology of Object-OrientedLanguages and Systems, 3-7 August 1998, Santa Barbara, CA, USA ,page 446. IEEE Computer Society, 1998.[24] Facundo Molina, César Cornejo, Renzo Degiovanni, Germán Regis,Pablo F. Castro, Nazareno Aguirre, and Marcelo F. Frias. An evolu-tionary approach to translate operational specifications into declarativespecifications. In Leila Ribeiro and Thierry Lecomte, editors,
FormalMethods: Foundations and Applications - 19th Brazilian Symposium,SBMF 2016, Natal, Brazil, November 23-25, 2016, Proceedings , volume10090 of
Lecture Notes in Computer Science , pages 145–160, 2016.[25] Facundo Molina, Renzo Degiovanni, Pablo Ponzio, Germán Regis,Nazareno Aguirre, and Marcelo F. Frias. Training binary classifiersas data structure invariants. In Joanne M. Atlee, Tevfik Bultan, and JonWhittle, editors,
Proceedings of the 41st International Conference onSoftware Engineering, ICSE 2019, Montreal, QC, Canada, May 25-31,2019 , pages 759–770. IEEE / ACM, 2019.[26] Carlos Pacheco, Shuvendu K. Lahiri, Michael D. Ernst, and ThomasBall. Feedback-directed random test generation. In , pages 75–84. IEEE Computer Society, 2007.[27] Nadia Polikarpova, Carlo A. Furia, and Bertrand Meyer. Specifyingreusable components. In Gary T. Leavens, Peter W. O’Hearn, andSriram K. Rajamani, editors,
Verified Software: Theories, Tools, Experi-ments, Third International Conference, VSTTE 2010, Edinburgh, UK,August 16-19, 2010. Proceedings , volume 6217 of
Lecture Notes inComputer Science , pages 127–141. Springer, 2010.[28] Nadia Polikarpova, Carlo A. Furia, Yu Pei, Yi Wei, and BertrandMeyer. What good are strong specifications? In David Notkin, BettyH. C. Cheng, and Klaus Pohl, editors, , pages 262–271. IEEE Computer Society, 2013.[29] Pablo Ponzio, Valeria S. Bengolea, Simón Gutiérrez Brida, GastónScilingo, Nazareno Aguirre, and Marcelo F. Frias. On the effect ofobject redundancy elimination in randomly testing collection classes. InJuan Pablo Galeotti and Alessandra Gorla, editors,
Proceedings of the11th International Workshop on Search-Based Software Testing, ICSE2018, Gothenburg, Sweden, May 28-29, 2018 , pages 67–70. ACM, 2018.[30] Manoranjan Satpathy, Nils T. Siebel, and Daniel Rodríguez. Assertionsin object oriented software maintenance: Analysis and a case study.In , pages 124–135. IEEEComputer Society, 2004.[31] Todd W. Schiller and Michael D. Ernst. Reducing the barriers to writingverified specifications. In Gary T. Leavens and Matthew B. Dwyer,editors,
Proceedings of the 27th Annual ACM SIGPLAN Conference onObject-Oriented Programming, Systems, Languages, and Applications,OOPSLA 2012, part of SPLASH 2012, Tucson, AZ, USA, October 21-25,2012 , pages 95–112. ACM, 2012.[32] Seyed Reza Shahamiri, Wan Mohd Nasir Wan-Kadir, Suhaimi Ibrahim,and Siti Zaiton Mohd Hashim. An automated framework for softwaretest oracle.
Information & Software Technology , 53(7):774–788, 2011.[33] Rahul Sharma and Alex Aiken. From invariant checking to invariantinference using randomized search.
Formal Methods in System Design ,48(3):235–256, 2016.[34] Anthony J. H. Simons. Jwalk: a tool for lazy, systematic testing of javaclasses by design introspection and user interaction.
Autom. Softw. Eng. ,14(4):369–418, 2007.35] Nikolai Tillmann and Jonathan de Halleux. Pex-white box test genera-tion for .net. In Bernhard Beckert and Reiner Hähnle, editors,
Tests andProofs, Second International Conference, TAP 2008, Prato, Italy, April9-11, 2008. Proceedings , volume 4966 of
Lecture Notes in ComputerScience , pages 134–153. Springer, 2008.[36] Marco Trudel, Manuel Oriol, Carlo A. Furia, and Martin Nordio.Automated translation of java source code to eiffel. In Judith Bishopand Antonio Vallecillo, editors,
Objects, Models, Components, Patterns -49th International Conference, TOOLS 2011, Zurich, Switzerland, June28-30, 2011. Proceedings , volume 6705 of
Lecture Notes in ComputerScience , pages 20–35. Springer, 2011. [37] Julian Tschannen, Carlo A. Furia, Martin Nordio, and Nadia Polikar-pova. Autoproof: Auto-active functional verification of object-orientedprograms. In , volume 9035 of
Lecture Notes in Computer Science ,pages 566–580. Springer, 2015.[38] Kohsuke Yatoh, Kazunori Sakamoto, Fuyuki Ishikawa, and ShinichiHoniden. Feedback-controlled random test generation. In Michal Youngand Tao Xie, editors,