[PDF] EvoSpex: An Evolutionary Algorithm for Learning Postconditions

Abstract

Software reliability is a primary concern in the construction of software, and thus a fundamental component in the definition of software quality. Analyzing software reliability requires a specification of the intended behavior of the software under analysis, and at the source code level, such specifications typically take the form of assertions. Unfortunately, software many times lacks such specifications, or only provides them for scenario-specific behaviors, as assertions accompanying tests. This issue seriously diminishes the analyzability of software with respect to its reliability. In this paper, we tackle this problem by proposing a technique that, given a Java method, automatically produces a specification of the method's current behavior, in the form of postcondition assertions. This mechanism is based on generating executions of the method under analysis to obtain valid pre/post state pairs, mutating these pairs to obtain (allegedly) invalid ones, and then using a genetic algorithm to produce an assertion that is satisfied by the valid pre/post pairs, while leaving out the invalid ones. The technique, which targets in particular methods of reference-based class implementations, is assessed on a benchmark of open source Java projects, showing that our genetic algorithm is able to generate post-conditions that are stronger and more accurate, than those generated by related automated approaches, as evaluated by an automated oracle assessment tool. Moreover, our technique is also able to infer an important part of manually written rich postconditions in verified classes, and reproduce contracts for methods whose class implementations were automatically synthesized from specifications.

Full PDF

EEvoSpex: An Evolutionary Algorithm for LearningPostconditions

Facundo Molina ∗† , Pablo Ponzio ∗† , Nazareno Aguirre ∗† , Marcelo Frias †‡∗ Department of Computer Science, FCEFQyN, University of Río Cuarto, Argentina † National Council for Scientiﬁc and Technical Research (CONICET), Argentina ‡ Department of Software Engineering, Buenos Aires Institute of Technology, Argentina

Abstract —Software reliability is a primary concern in theconstruction of software, and thus a fundamental component inthe deﬁnition of software quality. Analyzing software reliabilityrequires a speciﬁcation of the intended behavior of the softwareunder analysis, and at the source code level, such speciﬁcationstypically take the form of assertions . Unfortunately, softwaremany times lacks such speciﬁcations, or only provides themfor scenario-speciﬁc behaviors, as assertions accompanying tests.This issue seriously diminishes the analyzability of software withrespect to its reliability.In this paper, we tackle this problem by proposing a techniquethat, given a Java method, automatically produces a speciﬁcationof the method’s current behavior, in the form of postconditionassertions. This mechanism is based on generating executions ofthe method under analysis to obtain valid pre/post state pairs,mutating these pairs to obtain (allegedly) invalid ones, and thenusing a genetic algorithm to produce an assertion that is satisﬁedby the valid pre/post pairs, while leaving out the invalid ones.The technique, which targets in particular methods of reference-based class implementations, is assessed on a benchmark ofopen source Java projects, showing that our genetic algorithmis able to generate post-conditions that are stronger and moreaccurate, than those generated by related automated approaches,as evaluated by an automated oracle assessment tool. Moreover,our technique is also able to infer an important part of manuallywritten rich postconditions in veriﬁed classes, and reproducecontracts for methods whose class implementations were auto-matically synthesized from speciﬁcations.

I. I

NTRODUCTION

The quality of software systems is typically deﬁned aroundvarious dimensions, such as reliability, usability, efﬁciency,etc. Among these, reliability is in general considered a fun-damental attribute of software quality, and a primary concernin software development [12], [16]. Analyzing software re-liability is strongly related to ﬁnding software defects, i.e.,actual software behaviors that diverge from the expectedbehavior. Discovering such defects requires one to state what the expected behavior is, in other words, a speciﬁcation of thesoftware. Many times such speciﬁcations are either implicit,or stated informally, diminishing the possibility of exploitingspeciﬁcations for (automated) reliability analysis.Software speciﬁcations can appear in different forms. At thelevel of source code, when present, they generally manifesteither as comments , i.e., informal descriptions of what thesoftware is supposed to do, or more formally as programassertions , i.e., (usually executable) statements that assertproperties that the software must satisfy at certain points during program executions. The former are more common,but cannot be straightforwardly used for automated reliabilityanalysis. The latter, on the other hand, are readily usable forprogram analysis, especially when stated as contracts [21], butthey are seldom found accompanying source code. Moreover,many times program assertions state scenario-speciﬁc proper-ties, e.g., statements that only express the expected softwarebehavior for a test case, as opposed to the more general,and also signiﬁcantly more useful, assertions associated withcontract elements such as invariants and pre/post-conditions.Due to the above described situation regarding speciﬁcationsat the level of source code, the speciﬁcation inference problem (a special case of the well known oracle problem [3]), i.e.,taking a program without a corresponding speciﬁcation andattempting to automatically produce one that captures theprogram’s current behavior, is receiving increasing attention bythe software engineering community. Automatically inferringspeciﬁcations from source code is a relevant topic, as it enablesa number of applications, including program comprehension,software evolution and maintenance, bug ﬁnding [31], andspeciﬁcation improvement [31], [15], among others.In this paper, we tackle this problem by proposing atechnique that, given a Java method, automatically produces aspeciﬁcation of the method’s current behavior, in the formof postcondition assertions. This mechanism is based ongenerating valid and invalid pre/post state pairs (i.e., statepairs that represent, and do not represent, the method’s currentbehavior, respectively), which guide a genetic algorithm toproduce a JML-like assertion characterizing the valid pre/postpairs, while leaving out the invalid ones. The generationof valid pre/post pairs is based on executing the methodon a bounded exhaustive test set, generated by exercisingthe method inputs’ APIs using user-deﬁned ranges for basicdatatypes, and bounding their execution sequences. The invalidpre/post pairs, on the other hand, are obtained by mutating valid pairs, i.e., arbitrarily modifying the post-states so thateach resulting pair does not belong to the set of valid pairs.This mutation-based approach to generate invalid pairs isunsound, in the sense that it may lead to valid pairs instead, anissue that may affect the precision of the produced assertions.As we describe later on, the design of our genetic algorithmtakes it into account. Because of the assertion language weconsider, that involves quantiﬁcation, object navigation andreachability expressions, our approach is particularly well- a r X i v : . [ c s . S E ] M a r uited for reference-based class implementations with (im-plicit) strong representation invariants, such as heap-allocatedstructural objects, and complex custom types.We assess our technique on a benchmark of open sourceJava projects taken from [11], featuring complex implemen-tations of reference-based classes. In these case studies, ourgenetic algorithm is able to generate post-conditions that arestronger and more accurate, than those generated by relatedspeciﬁcation-inference approaches, as evaluated by OASIs,an automated oracle assessment tool [15]. Moreover, ourtechnique is also able to infer an important part of manuallywritten rich postconditions (strong contracts used for veriﬁca-tion) present in veriﬁed classes [37], and reproduce contractsfor methods whose class implementations were automaticallysynthesized from speciﬁcations [19].II. B ACKGROUND

A. Assertions as Program Speciﬁcations

The use of assertions as program speciﬁcations dates backto the works of Hoare [13] and Floyd [9], in the contextof program veriﬁcation and associated with the concept ofprogram correctness. Technically, an assertion is a statementpredicating on program states, that can be used to capture assumed properties , as in the case of preconditions, or intendedproperties , as in the case of postconditions. A program P accompanied by a precondition pre and postcondition post issaid to be (partially) correct with respect to this speciﬁcation,if every execution of P starting in a state that satisﬁes pre , if itterminates, it does so in a state that satisﬁes post [13]. That is,every valid terminating execution of P , i.e., every executionsatisfying the requirements stated in the precondition, mustlead to a state satisfying the postcondition.While program assertions originated in the context ofprogram veriﬁcation, they soon permeated into programminglanguages constructs and (informal) programming methodolo-gies. More recently, they have been central to the deﬁnitionof methodologies for software design, notably design bycontract [22]. Most modern imperative and object-orientedprogramming languages support assertions, either as built-inconstructions [23] or through mature libraries such as CodeContracts [2] and JML [5]. Moreover, libraries for unit testingmake extensive use of assertions to automate checking theexpected results of running a test case.Preconditions are more commonly seen in source code, e.g.,within methods in the form of state and argument checks,throwing appropriate exceptions when these are found invalid,preventing normal execution. Postconditions, on the otherhand, are less common. Post-execution checks are commonlyseen as part of test cases, although they rarely capturepostconditions, in the sense of general properties that everyexecution must satisfy on termination; post-execution checksin tests generally state properties that should be satisﬁed forthe speciﬁc test where they are stated.The assertion language that we consider in this paper is,from an expressiveness point of view, a JML-like [5] contractlanguage. More precisely, we follow the approach used in [17], and use the Alloy notation [14]. The language supportsquantiﬁers, navigation and reachability expressions includingnavigations through one or more ﬁeld. A sample speciﬁcation,generated by our technique, is shown in Figure 3. Most oper-ators have a direct intuitive reading (equality and inequalities,boolean connectives, etc.); all and some are the universaland existential quantiﬁers, respectively; the dot operator ( . )is relational composition and captures navigation; relationalunion and intersection are denoted by + and & , respectively,and can be applied to combine ﬁelds in navigations; set/rela-tional cardinality is denoted by ; ﬁnally, * and ˆ are reﬂexive-transitive closure and transitive closure, respectively. Closuresallow the language to express reachability. For instance, thelast sentence in Figure 3 expresses that for every node n reachable (in zero or more steps) from the root by traversing left and right (i.e., all nodes in the tree), it is not the casethat n is included in the set of nodes reachable in one or moresteps from n itself. That is, the left + right structure fromthe root is acyclic. It is worth to mention that all assertions inthis language can be checked at run-time, and thus we can useit to assert properties in program points. We refer the readerto [14] for further details regarding the language. B. Quality of Assertions

As described above, program assertions are a way ofcapturing the expected software behavior via expressions thatconvey intended properties of program states in speciﬁc partsof a program. Such expected behavior can be captured withdifferent degrees of precision, leading to assertions of differentquality. The most typical issue with program assertions is themisclassiﬁcation of invalid program states as valid. This isessentially the effect of having weak assertions, that are able todetect some, but not all, faulty situations. It is rarely considereda defect in the assertion, but an inherent issue associated with abalance between expressiveness and economy/efﬁciency in thedeﬁnition of assertions. Indeed, it is even considered method-ologically correct to express weak (and efﬁciently checkable)assertions [22]. Following the terminology put forward in [15],a real program execution leading to an invalid program statethat a corresponding assertion is unable to detect is called a false negative .A second issue with program assertions is the dual of theprevious, i.e., the misclassiﬁcation of valid program states asinvalid. This issue indicates that the assertion is wrong , as itdoes not properly specify the intended behavior of the soft-ware. Such issues are typically considered to be speciﬁcationdefects. This situation can also often arise as a consequenceof software evolution, when required changes in programbehavior are (correctly) implemented, but the accompanyingassertions are not kept in synchrony with the evolved behavior[6]. A real program execution leading to a valid programstate, that a corresponding assertion classiﬁes as an assertionviolation, is a false positive , according to the terminology putforward in [15].Assessing the quality of assertions accompanying a programis a very challenging problem, that is typically performedanually. A way of measuring the quality of assertions isby attempting to determine the number of false positivesand false negatives that a given assertion has. This idea hasbeen exploited in [15], where an automated mechanism forevaluating the quality of assertions, based on evolutionarycomputation, is proposed. The approach presented thereinexecutes an evolutionary test generation tool (the well-knowntool EvoSuite [10]) that tries to ﬁnd false positives and falsenegatives, and when found, produces witnessing test cases,that can be used to (manually) improve the correspondingassertions. It is worth remarking that, for contracts speciﬁedin standard assertion languages, it is hardly expected fora contract to fully capture the behavior of a program. Asexplained in [27], precisely capturing a program’s intendedsemantics requires additional mechanisms, such as the use ofmodel classes, that imply the manual deﬁnition of abstractionsof the state space of the program being speciﬁed. In termsof the above-mentioned issues with program assertions, itmeans that, technically, one can very often come up with falsenegatives, i.e., ﬁnding states that satisfy a given assertion butcorrespond to incorrect program behavior.III. A N I LLUSTRATING E XAMPLE

As an illustrating example, let us consider a Java classimplementing lists, partially shown in Figure 1 . This classimplements list operations over balanced trees, supportinginsertion and deletion from the list in O (log n ) , as opposed tothe classic array-based and linked-list based list implementa-tions. Let us focus on method add , that inserts an elementin the list. Notice how the precondition of the method iscaptured in the source code, checking the validity of the indexfor insertion and that the tree has not reached its maximumsize. The method postcondition, on the other hand, is notpresent in this implementation. Having the postcondition hasmultiple applications, in particular as assertions for testingfuture improvements of this method, and as a declarativedescription of what this method does (how it operates on thedata structure), among many others. Writing the speciﬁcationis, however, nontrivial, and thus coming up with the rightexpression for the postcondition is an important problem.A well-known tool to assist the developer in this situationis Daikon [7]. Daikon performs run-time invariant detection,it runs the program on a set of test cases, and observeswhich properties hold during these runs at particular programpoints, such as after method invokations. It then suggests aslikely invariants those properties that were not falsiﬁed byany execution, or equivalently, that held true for all observedexecutions. The quality of the obtained invariants stronglydepends on the program executions considered by Daikon (i.e.,the set of tests that the user provides), and the set of candidateexpressions to be considered. In particular for method add in Figure 1, Daikon produces the postcondition shown inFigure 2, when fed with all valid tree lists of size up to 4.The shown postcondition is actually that produced by Daikon, import java.util.AbstractList;public final class AvlTreeList extends AbstractList {private Node root;public void add(int index, E val) {if (index < 0 || index > size())throw new IndexOutOfBoundsException();if (size() == Integer.MAX_VALUE)throw new IllegalStateException("Max size reached");root = root.insertAt(index, val);}private static final class Node {private E value;private int height;private int size;private Node left;private Node right;public Node insertAt(int index, E obj) {assert 0 <= index && index <= size;if (this == EMPTY_LEAF)return new Node<>(obj);int leftSize = left.size;if (index <= leftSize)left = left.insertAt(index, obj);elseright = right.insertAt(index-leftSize-1, obj);recalculate();return balance();}}} Fig. 1. Add method of class AvlTreeList but manually ﬁltering out invalid expressions (inducing falsepositives) that could not be falsiﬁed by the suite. Still, as it canbe seen, the postcondition generated in this case is relativelyweak: we would expect to have some further information abouthow node attributes get manipulated in this implementationof lists over trees. The reason why Daikon produces thisvery simple postcondition in this case has to do with theset of candidate expressions that Daikon considers, which areproduced from the deﬁnition of the program, and are restrictedto relatively simple program properties (e.g., structural con-straints, membership checking, etc., are not considered) [7].Our aim is to provide stronger postconditions in cases suchas the above. Our approach is, in essence, similar to Daikon’s:the method under analysis is run for different inputs, andfrom information extracted from these runs we propose apostcondition for the method. There are, however, multipledifferences. Firstly, our approach is based on generating runsfor the method under analysis bounded exhaustively , as op-posed to Daikon, which requires tests to be provided (in theabove example, the suite we provided Daikon with was the onethat our technique produces). Our technique for generating thebounded exhaustive test suite is based on exercising the APIof the inputs of the program under analysis, contrary to relatedapproaches that require a speciﬁcation [4], [17]. Secondly, weconsider both valid and invalid program states (although, as / heightthis.root.height >= old_this.root.height &&this.root.height >= old_this.root.left.height &&this.root.height >= old_this.root.right.height &&// sizethis.root.size > old_this.root.height &&this.root.size > old_this.root.left.height &&this.root.size > old_this.root.right.height &&// left heightthis.root.left.height <= old_this.root.height &&this.root.left.height >= old_this.root.left.height &&this.root.left.height >= old_this.root.right.height &&// right heightthis.root.right.height <= old_this.root.height &&this.root.right.height >= old_this.root.left.height &&this.root.right.height >= old_this.root.right.height

Fig. 2. Postcondition generated by Daikon for AvlTreeList.add(int, E) // rootthis.root != null &&this.root.left != null &&// heightall n : this.root.*(left+right) : (n.left != null => n.height > n.left.height &&n.right != null => n.height > n.right.height) &&// sizeold_this.root.size < this.root.size &&this.root.size ==

Fig. 3. Postcondition generated by our tool for AvlTreeList.add(int, E) we explain later on, the approach to generate invalid states mayunsoundly generate valid ones) in attempting to determine amethod’s postcondition, instead of only valid executions, as isthe case with Daikon. Third, our approach is based on evolvingspeciﬁcations, instead of considering non-falsiﬁed candidateproperties. The details of our technique are described in thenext section. Let us just mention that, for method add of class

AvlTreeList , our obtained postcondition is the one shownin Figure 3. Notice how the size update (referring to therelation between the pre and post states) and the membershipof the inserted element are captured, as well as some structuralproperties of the representation.IV. E VO S PEX

We now present the details of our technique for inferringmethod postconditions. An overview is shown in Fig. 4. Thetechnique is composed of two main phases: state generationand learning. During state generation, we produce pre/postprogram state pairs which are later on used in the learningphase to guide the search for suitable postcondition assertions.Two kinds of state pairs are generated: valid ones, whichcapture actual method behaviors that candidate assertionsshould satisfy; and invalid ones, which attempt to capture incorrect behaviors (pre/post pairs that do not correspond tothe current method behavior), that candidate assertions shouldnot satisfy. Valid pre/post pairs are generated by executingthe target method, using a test generation technique; clearly,these pairs correspond to the behavior of the method, as theywere generated from its execution. Invalid pre/post pairs, onthe other hand, are generated by mutating valid pairs, goingout of the set of valid pairs; contrary to the case of validpairs, it is not guaranteed that our invalid pairs are indeedincorrect method behaviors, i.e., that they represent behaviorsthat are not exhibited by the method. This may clearly affectthe precision of the obtained assertions, since the algorithmwould be guided to avoid some allegedly invalid behaviorswhich are actually valid. In these situations, the obtainedassertions would be stronger than necessary, leading to ahigher number of false positives, when evaluating assertionquality. We consider this issue in the design of our technique,in the following way. Firstly, the effectiveness in generatingtruly invalid pairs depends on the exhaustiveness of the setof valid pairs: the more exhaustive the set of valid pairs, thegreater the chances that mutating out of this set leads to atruly incorrect method behavior. Secondly, as the soundnessof the mechanism for invalid state pair generation cannot beguaranteed, one may risk favoring incorrect assertions basedon wrong invalid state assumptions. The former motivatesthe use of a bounded-exhaustive test generation approach forvalid state pairs. The latter drives an asymmetric treatmentof valid and invalid state pairs in the ﬁtness function, thatgives the reliable information provided by valid pairs a greaterrelevance. We further describe in this section how we handlethese issues, as well as other details of the genetic algorithm,and in the next section we evaluate the technique, includingan evaluation of assertion precision.

A. Generation of Valid/Invalid Method Executions

The learning phase of our algorithm depends on a set ofvalid/invalid method executions, which guide the search forpostcondition assertions. This is an important part of ouralgorithm. The overall process starts by generating runs ofthe target method m , collecting the pre/post states (cid:104) s, s (cid:48) (cid:105) ofeach execution; these are the valid execution pairs V . In orderto generate invalid execution pairs I , valid pairs are mutated :for a valid pair (cid:104) s, s (cid:48) (cid:105) , we mutate s (cid:48) into s (cid:48)(cid:48) , and check that (cid:104) s, s (cid:48)(cid:48) (cid:105) does not belong to V , to consider it part of I . Ofcourse, the mutated pre/post pair may actually correspond toa valid execution of m that we had not generated in V . Theeffectiveness of this latter approach depends on how thorough V is (although we may still generate “unseen” valid executionpairs via mutation), motivating a bounded exhaustive approachfor generating valid execution pairs.The mechanism for generating valid execution pairs worksas follows. Let C, C , . . . , C n be classes, and m , the targetmethod, a method in C with parameters of types C , . . . , C n .The initial states for the execution of m will be tuples (cid:104) o C , o C , . . . , o C n (cid:105) of objects of types C, C , . . . , C n , respec-tively. We build the objects to form these tuples, for each class, arget methodTest GenerationTest cases ExecutionValid pre/post statesMutationInvalid pre/post statesEvoSpexPostcondition assertions Generation phaseLearning phase

Fig. 4. An overview of the proposed approach bounded exhaustively, in the following way. Let C i be a class,and methods b , . . . , b l a set of builders for C i , i.e., a set ofmanually identiﬁed methods that can be used to create objectsof class C i . For instance, for a set collection, builders wouldinclude constructors and insertion routines. Given a bound k (maximum length for method sequences), we build a set ofobjects of class C i using a variant of Randoop [26]. Randooprandomly generates sequences of methods of C i ’s API, ofincreasing length, by iterating a process in which previouslyproduced traces are randomly selected, together with a method,to generate a new trace that calls this method. Our variantincorporates two main modiﬁcations to this process: • The random selection of a method to extend a previouslyproduced trace t (test case), implemented in [26], isreplaced by a mechanism to systematically select all methods in b , . . . , b l , leading to l different extensionsof t . This is applied until the bound k is reached. • A state matching mechanism is implemented, to reducethe number of method combinations: when a newly pro-duced trace leads to an object that matches a previouslycollected one, the trace (and the object) are discarded.The state matching approach borrows the canonical objectrepresentation put forward in [29].Besides the bound k on trace length, the state matchingmechanism also requires a maximum number of objects pertype, and a range for primitive types (e.g., 0..k-1 for integers).This is a k -based scope, as deﬁned in ﬁnitization procedures in[4] (a standard issue in bounded exhaustive generation). Usingthe above mechanism, we build the tuples of initial states, toexecute m . We execute m in each of these tuples, and collectthe corresponding post-states, building in this way the set V of valid pre states and corresponding post states for m .The mutations applied to produce the “invalid” pre/poststate set I , take a valid execution pair (cid:104) s, s (cid:48) (cid:105) , and create (cid:104) s, s (cid:48)(cid:48) (cid:105) , where s (cid:48)(cid:48) mutates s (cid:48) by selecting a random ﬁeld inthe receiving object or return value (the constituents of s (cid:48) ),and replacing the value by a randomly generated value ofthe corresponding type within the above mentioned scope. Wecheck that the resulting pair is not in V before including itinto the invalid state pair set I . B. Chromosomes representing Candidate Postconditions

Our representation of candidate assertions is based on theencoding used in [24], where chromosomes represent conjunc-tions of assertions (each gene in a chromosome representsan assertion). That is, given a chromosome c , the candidatepostcondition ϕ c represented by c is deﬁned as follows: c = (cid:104) g , g , . . . , g n (cid:105) ⇒ ϕ c = g ∧ g ∧ ... ∧ g n As opposed to what is most common in genetic algorithms,chromosomes have varying lengths in this representation (upto a maximum chromosome length), and gene positions aredisregarded by the genetic operators (see below), due to theassociativity and commutativity of the conjunction. Genesneed to encode complex assertions. Below we describe howgenes are built, mutated and combined.

C. Initial Population

Let us describe how we build the initial population, to startour genetic algorithm. In order to create individuals repre-senting “meaningful” postconditions, i.e., assertions statingproperties of objects that are reachable at the end of themethod executions, we take into account typing information,as in [24]. We consider a type graph built automaticallyfrom the class under analysis: nodes represent types, andeach ﬁeld f of type B in class A will produce an arc inthe graph going from the node representing A to the noderepresenting B . For example, if we consider the AvlTreeList class in Figure 1, the corresponding type graph would bethe one shown in Figure 5. It is straightforward to seethat by traversing the graph, typed expressions can be built,using the ﬁelds of the object from which the method wasexecuted. Some examples are this.root , this.root.left , this.root.size , this.root.value , and so on. Moreover,from loops in the graph, expressions denoting sets, suchas this.root.*left (the set of nodes reachable from this.root via left traversals only), this.root.*right and this.root.*(left+right) , can be created (as ex-plained earlier, we are using * for reﬂexive-transitive closure,as in [14]). Size one chromosomes are created using expres-sions denoting a single value, evaluating these on a randomlyselected subset of the valid (resp. invalid) method executions,in the following way: if the result of evaluating an expression expr in a valid (resp. invalid) tuple t returns a value v , thenwe create the individual (cid:104) expr == v (cid:105) (resp. (cid:104) expr != v (cid:105) ).In addition to these basic individuals we also create chromo-somes containing comparisons of random expressions of thesame type (e.g., this.root == this.root.right ), chro-mosomes with quantiﬁed formulas considering expressionsdenoting sets (e.g., all n: this.root.*(left+right) -null : n == n.right ), and individuals comparing integerexpressions with the cardinality of expressions denoting sets(e.g., this.root.height == ). Fi-nally, since the method under analysis may have a return valueor a set of arguments, we also include, in the set of initialcandidates, expressions comparing them against expressionsof the same type (e.g., result < this.f ). The expressions vlTreeList+Null this ENode+Null root valueleft,right int sizeheight

Fig. 5. Type graph for AvlTreeList example. used to compare with the result variable or the arguments, aswell as the operators, are randomly chosen.Notice that all our initial chromosomes are size 1 chromo-somes. The main reason for this design choice is to allow thegenetic algorithm to progressively produce complex candidatepostconditions by means of the genetic operators, that we de-ﬁne later on in this section. While this size-one chromosomesfor the initial population is non-standard in genetic algorithms,in our case it helps the algorithm to more quickly convergeto better ﬁtted individuals. The replication package site [1]contains the results of comparing the effectiveness of our size-one chromosomes in the initial population, with standard size- N chromosomes (we do not include the comparison here dueto space restrictions). D. Fitness Function

Our ﬁtness function assesses how good a candidate post-condition is, by distinguishing between the set V of validexecutions and the set I of invalid executions. To do so,before computing the ﬁtness value of a given candidate c , weobtain the postcondition ϕ c that c represents, and then computethe sets P and N of positive and negative counterexamples,respectively. These sets are deﬁned as follows: P = { v ∈ V |¬ ϕ c ( v ) } N = { i ∈ I | ϕ c ( v ) } where ϕ c is the postcondition represented by the candidate c .Basically, the sets P and N contain those executions for whichthe postcondition ϕ c does not behave correctly. Recall that,as opposed to the case of V which reliably represents actualexecution information of m , the set I may contain mutatedexecutions that are considered “invalid”, but correspond toactual executions of m . This motivates a deﬁnition of ourﬁtness function that does not treat P and N symmetrically.The ﬁtness function f ( c ) is computed as follows:  P > → ( MAX − P − I ) + (cid:16) l c + comp c (cid:17) + mca c l c P = 0 → ( MAX − N ) + (cid:16) l c + comp c (cid:17) + mca c l c This case-based deﬁnition aims at considering the negativecounterexamples only when no positive counterexamples areobtained. In fact, for arbitrary candidates c and c , if c has no positive counterexample and c has some positivecounterexamples, then f is guaranteed to produce worse ﬁtnessvalues for c , no matter how many negative counterexamples these candidates have. The rationale here is to make thereliable positive-counterexample information more relevant.The deﬁnition of the ﬁtness function has three parts. Theﬁrst term reﬂects the most important aspect: to minimize thenumber of counterexamples. The fact that when the candidatepostcondition ϕ c has positive counterexamples, i.e., it is falsi-ﬁed by a correct method execution, the whole set I of invalidexecutions is considered as counterexamples too, is what guar-antees our above observation regarding the prioritization ofcandidates with no positive counterexamples. More precisely,the ﬁrst term of the function subtracts I when P > , toensure that the ﬁtness value of such individual is lower thatthe ﬁtness value of any other individual that only has negativecounterexamples. The second term of the ﬁtness functionacts as a penalty regarding two aspects: the candidate length l c and the candidate “complexity” comp c . The candidatelength is simply the number of conjuncts in the assertion,and it is considered in order to guide the algorithm towardsproducing smaller assertions. The candidate complexity is thesum of each conjunct complexity. Intuitively, the complexityof an equality between two integer ﬁelds is lower than thecomplexity of an equality between an integer ﬁeld with a setcardinality, and both of these are lower than the complexity ofa quantiﬁed formula, and so on. The last term of the functionacts as a reward favoring those candidates with a greaternumber of “method component assertions” mca c , i.e., with ahigh number of conjuncts of the candidate postcondition thatrepresent properties regarding the parameters, the result, or arelation between initial and ﬁnal object states. As described,the penalty related to the candidate length and complexityas well as the reward prioritizing the method componentassertions just contribute a fraction to the ﬁtness value, sincewe want the algorithm to focus on individuals whose numberof counterexamples is approaching zero. E. Genetic Operators

During evolution, the genetic operators allow the algorithmto explore the search space of candidate solutions, by per-forming certain operations that produce individuals with newcharacteristics as well as combinations of existing ones. Inparticular, our algorithm implements two well known geneticoperators, namely the mutation and crossover operators. Someof these genetic operators were inspired by similar onesintroduced in [24], while others are novel. Also, a custom selection operator was implemented, to keep in the populationthose candidates that are more suitable to be part of the realpost condition.Each chromosome gene is selected for mutation with aprobability of . , and the operation can perform a varietyof modiﬁcations depending on the shape of the selected geneexpression. From a general point of view, the set of consideredmutations are the following: Gene deletion: it can be applied to any gene and simplyremoves the gene expression from the chromosome.

Negation: it negates the gene expression and is applied toany gene except quantiﬁed assertions. umeric addition/subtraction: it is only applied to genesthat compare two expressions evaluating to a number, and itadds/subtracts a randomly selected numeric expression to theright-hand side of the comparison.

Expression replacement: it applies to any gene, and itreplaces some part of the gene with a randomly selectedexpression of the same type.

Expression extension: it can be applied to any gene thatinvolves a navigational expression, and it extends this expres-sion with a new ﬁeld, for example replacing this.root by this.root.left . Operator replacement: it replaces an operator by an alter-native one. The operators vary depending on the current geneexpression. For instance, for relational equalities, the possi-ble operators are { == , ! = } ; for numeric comparisons, theoperators are { == , ! = , <, >, < = , > = } ; and for quantiﬁedexpressions, the quantiﬁers are { all, some } .To produce combinations of individuals, we use a crossoverrate of . . Given two randomly selected chromosomes c and c , our crossover operator simply produces a newindividual that contains the union of the genes of c and c ,and thus represents the candidate postcondition ϕ c ∧ ϕ c .An important detail in our crossover operator is that beforeselecting individuals for combination, we ﬁlter the population,keeping individuals which only have negative counterexam-ples, i.e., that represent formulas that are consistent with allvalid method executions. The main reason for this policy is thatwe want the algorithm to join chromosomes that are alreadyconsistent with the valid method executions.Finally, to keep in the population the best candidates of eachgeneration, our selection operator is deﬁned as follows: given anumber n to be used as constant population size, our operatorﬁrst sorts all the candidate postconditions in decreasing order,and then the candidates to be moved to the next generationare the ﬁrst n/ individuals plus the best n/ unary non-validindividuals, i.e., size 1 chromosomes whose only gene is aformula that still has positive counterexamples. Additionally,our operator keeps all the unary valid candidates, that is, thosethat only have negative counterexamples. This last policy inour selection operator allows us to keep in the population allthe discovered valid properties that the algorithm can use infuture crossover operations.V. E VALUATION

To evaluate our technique, we performed experiments fo-cused on the following research questions:RQ1

Do the oracles learned by EvoSpex have any deﬁciencycompared to oracles produced by related tools?

RQ2

Are the assertions produced by the algorithm close tomanually written contracts?

To evaluate RQ1, we need to consider programs (in ourcase, Java programs) for which to infer method speciﬁcations.As mentioned earlier in the paper, and as it is clear from ourcandidate assertion state space and evolution operators, we target classes and methods with reference-based implementa-tions, in particular classes where the corresponding internalrepresentation has strong (implicit) invariants. As a sourcefor our benchmark, we considered SF110 (originally used in[11]), a collection of 110 Java projects (100 random projects,plus the 10 most popular ones according to SourceForge),that covers a wide variety of software, representative of opensource development. Our process of assessing postconditionassertions makes use of the OASIs tool [15], essentially, toevaluate the quality of a postcondition assertion in terms of itsassociated number of false positives and false negatives. Theprocess of computing this number requires a manual process(as described in [15], to compute the false negatives one ﬁrstneeds to get rid of the false positives, which implies havingto manually reﬁne the produced postconditions every timeOASIs reports the presence of a false positive). Therefore,we are unable to consider the whole 110 projects in thebenchmark. We randomly selected 16 projects, skipping casesin our selection that have a clear dependency on the environ-ment (our technique involves automated test generation, andenvironment dependencies seriously affect these tools). The16 projects can be found in Table II. For each case study, weselected various methods with different behaviors for analysis,manually deﬁned a set of builders, and then generated thecorresponding valid and invalid method executions with arelatively small scope (3 for all cases). Then, we executedour tool in the following way: for each method m selectedfor analysis, we executed the genetic algorithm to producea postcondition for m until it reached 30 generations or a10 minutes timeout was fulﬁlled. We repeated this execution10 times, and then selected the postcondition assertion thatrepeated the most number of times, from the 10 produced bythe algorithm. Additionally, in order to compare our tool withrelated approaches, we executed Daikon to infer post condi-tions for each method m . It is important to remark that thetest suites that we fed Daikon with to produce postconditionsfor the methods under analysis, were exactly the same testsuites that were used to generate the valid method executionsin our technique (our valid bounded exhaustive suites). Bothour tool and Daikon can produce assertions leading to falsepositives (see Section 2 for a comment on this issue), as wellas redundant assertions.The results of this experiment are shown in Tables I andII. Table I presents the postconditions generated by the tools,after removing the false positives and the redundant assertions,with the aim of giving a clear glance of the complexity ofthe assertions that the techniques are able to generate. Weconsidered these assertions, as the ones produced by the twotechniques. We then measured the quality of the correspondingassertions by automatically computing false positives and falsenegatives, using the OASIs [15] tool. Table II shows theresults of this quality assessment. Speciﬁcally, for each casestudy, we report in Table II: (i) lines of code (LoC) of theevaluated project; (ii) number of analyzed methods from the OSTCONDITIONS INFERRED BY E VO S PEX AND D AIKON AFTER REMOVING F ALSE P OSITIVES

Method EvoSpex Daikonjiprof - com.mentorgen.tools.proﬁle.runtime.ClassAllocation getAllocCount(): int result == this._count this._count == result && result == old(this._count)incAllocCount(): void this._count == 1 + old(this_count) this._count >= 1 && this._count - old(this_count)- 1 == 0 jmca - com.soops.CEN4010.JMCA.JParser.SimpleNode jjtSetParent(Node n): void n == this.parent this.parent == old(n)&&this.children == old(this.children)&&this.id == old(this.id)&&this.parser == old(this.parser)&&this.identifiers == old(this.identifiers) bpmail - ch.bluepenguin.email.client.service.impl.EmailFacadeState setState(Integer ID, booleandirtyFlag): void ID in this.states.keySet() this.states == old(this.states) byuic - com.yahoo.platform.yui.compressor.JavaScriptIdentiﬁer preventMunging(): void this.mungedValue == old(this.mungedValue)&&this.refCount == old(this.refcount)&&all n : this.declaredScope.*parentScope: n !in n.^parentScope this.mungedValue == old(this.mungedValue)&&this.recCount == old(this.refcount)&&this.declaredScope == old(this.declaredScope)&&this.markedForMunging == false dom4j - org.dom4j.tree.LazyList add(E element): boolean old(this.size)== this.size - 1 &&result == true &&element in this.header.*next.element this.header == old(this.header)&&this.size >= 1 &&result == true &&this.size - old(this.size)- 1 == 0 corresponding project; (iii) number of assertions produced aspart of the postconditions; (iv) amount and percentage offalse positives present in all generated assertions; and (v) number of methods for which false negatives were detected.Notice that, as proposed in [15], false negatives detection isperformed once all the false positives have been removed fromthe postcondition (hence the manual task that made us considera subset of SF110). For both false positives detection and falsenegatives detection, we executed OASIs with a timeout of oneminute. Problems with OASIs prevented us from reportingthe number of false negatives for each method and casestudy; more precisely, when the tool reported the existenceof false negatives, in some cases it was unable to produce thewitnessing counterexamples (test cases), preventing us frommeasuring the number of false negatives identiﬁed by the toolin these cases. This issue was discussed with the developers ofthe tool. We therefore inform the number of methods for whichOASIs reported the existence of false negatives, rather than thenumber of false negatives found, as this information was notreliably produced by the tool for all cases. For instance, forproject imsmart , out of the 3 methods analyzed, OASIs foundone of the corresponding assertions discovered by Daikon tohave false negatives, and one of the assertions discovered byEvoSpex to have a false negative, too.The evaluation of RQ2 requires having classes with methodsfeaturing manually written contracts. Moreover, as discussedin Section 2, assertions for run-time checking are typicallyweak, efﬁciently checkable assertions, that weakly capture thesemantics of the corresponding classes and methods [22], [27].In order to compare with strong contracts, we took: • A set of case studies with contracts written for the veriﬁ-cation of object oriented programs. More precisely, theseprograms are written in Eiffel [23], and the accompanyingcontracts were used for veriﬁcation using the AutoProoftool [37], a veriﬁer for object-oriented programs writtenin the Eiffel programming language, for Eiffel programs. • A set of case studies for which a data representation and method implementations are automatically synthesizedfrom a higher-level speciﬁcation. More precisely, thesynthesized implementations are taken from [19], aregenerated by the Cozy tool, and are guaranteed to becorrect with respect to higher level speciﬁcations, whichserve as method contracts.From [37], we speciﬁcally considered various methods andtheir corresponding postconditions, from the following cases: • Composite : A tree with a consistency constraint betweenparent and children nodes. Each node stores a collectionof its children and a reference to its parent; the client isallowed to modify any intermediate node. A value in eachnode should be the maximum of all children’s values. • DoublyLinkedListNode : Node in a (circular) doubly-linked list with a structural invariant enforcing that itsleft and right links are consistent with its neighbors. • Map < K,V > : Generic Map abstract datatype implemen-tation, based on two lists that contain the keys and values,and with operations that perform linear searches on thelists. • RingBugger < G > : Bounded queue implemented over acircular array.Since our tool is for Java, and these implementations are inEiffel, we had to manually translate the whole classes intoJava, for analysis with our tool (this also prevented us fromconsidering more sophisticated case studies in this evaluation).While the translation was manual, we made an effort in makingit systematic, preserving the structure of the original code,and taking into account the semantics of references (e.g.,expanded types in Eiffel), array indexing in Eiffel vs. Java,etc., using as a guideline the J2Eif work [36]. Eiffel alsodiffers from Java in other important aspects that did not affectthe translation (e.g., inheritance, visibility of features, etc.).While we did not formally verify our translation, it was code-reviewed independently by co-authors of the paper.From [19], we considered several high-level speciﬁcationsand their corresponding synthesized Java implementations: Polyupdate , a bag of elements that keeps track of thesum of its positive elements. • Structure , a simple class encapsulating a function andcaching a parameter. • ListComp02 , a structure composed of two collections ofdifferent elements, and operations that combine elementsof the collections. • MinFinder , a bag of elements with a min operation. • MaxBag , a set of elements, with a max operation.In order to infer postconditions for methods in these classes,we ﬁrst generated valid and invalid method executions, asdescribed earlier in this paper, for each of the target methodsusing a scope of 4. Then, we executed our algorithm using thesame conﬁguration described for RQ1 (30 generations witha 10-minute timeout). Again, we repeated the execution 10times and selected the most frequently obtained postcondition.Notice that our approach is not using the real contracts alreadyaccompanying the target methods. We fully ignore these inthe inference approach, and only consider the methods sourcecode, both for the generation of valid/invalid method execu-tions, and for the actual evolutionary inference. A similar pro-cedure was followed for the Cozy case studies. We computedpostconditions for the Java implementations, and contrastedthem with those in the original high-level speciﬁcations, fromwhich the Java implementations were derived.The results of this experiment are shown in Tables III andIV. In Table III, for each of the target methods, the columnEiffel Contracts lists the properties that are present in theoriginal postcondition (expressed as text, for easier reference).In Table IV, the original postcondition is described in columnHigh-level spec in terms of the abstract state declared in thespeciﬁcation. In both tables, the column EvoSpex indicateswhich of the corresponding assertions in the original contract,our evolutionary algorithm was able to infer. Finally, Table Vsummarizes these results and also reports the number of invalid assertions synthesized as part of the inferred speciﬁcations foreach subject in Eiffel and Cozy case studies.Our tool, all the case studies, and a description of how toreproduce the experiments presented in this section can befound in the site of the replication package of our approach[1]. All the experiments were run on an Intel Core i7 3.2Ghz,with 16Gb of RAM, running GNU/Linux (Ubuntu 16.04).

A. Assessment

Let us brieﬂy discuss the results of our evaluation. Re-garding RQ1, the results show that our approach is able togenerate postconditions containing more complex assertionsthan the ones produced by Daikon. This is mainly due tothe fact that our technique focuses on evolving assertionstargeting reference-based conditions in reference-based im-plementations, as opposed to Daikon whose expressions arecomparatively simpler properties, that do not include complexstructural constraints, membership checking, etc (with theexception of arrays and implementations of java.util.List ,for which Daikon generates interesting structural properties).Furthermore, as Table II shows, a signiﬁcant number of the

TABLE IIM

EASURING THE QUALITY OF POSTCONDITIONS INFERRED BY D AIKONAND E VO S PEX , USING

OASI S . Project LOCs Methods Technique

Total %imsmart 1407 3 Daikon 21 2 9.52 1EvoSpex 4 1 25 1beanbin 4784 5 Daikon 35 5 14.29 0EvoSpex 7 0 0 0byuic 7699 7 Daikon 165 21 12.73 4EvoSpex 36 4 11.11 2geo-google 20974 7 Daikon 93 30 32.26 0EvoSpex 10 3 30 4templateit 3315 7 Daikon 37 4 10.81 3EvoSpex 20 0 0 2water-simulator 9931 9 Daikon 39 3 7.69 9EvoSpex 18 3 16.67 9dsachat 5546 9 Daikon 138 15 10.87 3EvoSpex 18 2 11.11 2jmca 16891 9 Daikon 205 26 12.68 0EvoSpex 25 1 4 3jni-inchi 3100 10 Daikon 122 12 9.84 2EvoSpex 50 1 2 4bpmail 2765 11 Daikon 46 6 13.04 8EvoSpex 17 0 0 7dom4j 42198 18 Daikon 166 27 16.27 7EvoSpex 25 2 8 10jdbacl 28618 19 Daikon 115 17 14.78 10EvoSpex 80 3 3.75 8jiprof 26296 20 Daikon 352 81 23.01 20EvoSpex 35 4 11.43 19summa 119963 21 Daikon 273 67 24.54 6EvoSpex 62 5 8.06 5corina 78144 22 Daikon 155 13 8.39 17EvoSpex 55 1 1.82 17a4j 6618 23 Daikon 257 59 22.96 9EvoSpex 60 5 8.33 5TOTAL 200 Daikon 2219 388 17.49 99EvoSpex 522 35 assertions inferred by our technique are true positives , i.e.,assertions that hold for all valid post-states of the correspond-ing methods, for any scope. Of course, this check for truepositives is in the end manual (we carefully analyzed howeach of the evaluated methods operates, and inspected theobtained assertions after ﬁltering out assertion conjuncts as perOASIs assessment); the oracle deﬁciency analysis performedby OASIs is inherently incomplete, we cannot guarantee thetruth of the remaining assertions.As shown in Table II, in most of the case studies (13 out of16), the percentage of false positives that our tool generates,when considering the total amount of produced assertions, isless than that produced by Daikon. Thus, comparing it withDaikon, and solely based on false positives, our assertionsare signiﬁcantly more precise. In fact, in a total of 200methods analyzed, our technique had a 6.7% false positives,compared to the 17.49% of Daikon (an order of magnitudeimprovement). Moreover, the relationship between the numberof produced assertions (in total, 522 of EvoSpex vs. the 2219produced by Daikon) and the identiﬁed presence of false

ABLE IIIC

OMPARING MANUALLY WRITTEN CONTRACTS ( IN E IFFEL VERIFIEDCLASSES ) WITH POSTCONDITIONS INFERRED BY E VO S PEX . Method Eiffel Contracts EvoSpexComposite add_child(Composite c) : void child added (cid:51) c value unchangedc children unchangedancestors unchanged (cid:51)

DoublyLinkedListNode insert_right(DoublyLinkedListNode n) : void n left set (cid:51) n right setremove() : void singleton (cid:51) neighbors connected

Map < K,V > count() : int result is size (cid:51) extend(K k,V v) : int key set (cid:51) data set (cid:51) other keys unchangedother data unchangedresult is indexremove(K k) : int key removed (cid:51) other keys unchangedother data unchangedresult is index RingBuffer < G > count() : int result is sizeextend(G a_value) value added (cid:51) item() : G result is ﬁrst elemremove() : void ﬁrst removedwipe_out() : void is empty (cid:51) TABLE IVI

NFERRING POSTCONDITIONS OF SYNTHESIZED COLLECTIONS . High-level state Method High-level spec EvoSpexPolyupdate x : Bag < Int > a(Integer y) : void y added to x (cid:51) s : Int y added to s if positivesm() : Integer result is s + sum of x (cid:51) Structure x : Int foo() : Integer result is x + 1 (cid:51) setX(Integer y) x = y (cid:51)

ListComp02

Rs : Bag < R > insert_r(R r) : void r added to Rs (cid:51) Ss : Bag < S > insert_s(S s) : void s added to Ss (cid:51) q() : Integer result is the sum of Rs ⊗ Ss MinFinder xs : Bag < T > ﬁndmin() : T result is min of xs (cid:51) chval(T x, int nv) : void inner value of T is x MaxBag l : Set < Int > get_max() : Integer result is max of l (cid:51) add(Integer x) : void x added to l (cid:51) remove(Integer x) : void x removed from l negatives, shows that our technique produces overall assertionsof similar strength, with signiﬁcantly fewer constraints. Daikonseems to make a more heavy used of speciﬁcally observedvalues in the assertions, leading to assertions that, whiletrue within the provided test suite cases, are violated whenconsidering larger scopes. Our algorithm is guided both byvalid and invalid pre/post method states, giving it an advantageover Daikon, and explore a state space of candidate assertionsthat are less affected by speciﬁc values observed in executions. TABLE VS

UMMARY OF E VO S PEX ASSERTIONS ON

RQ2

SUBJECTS

Subject Methods Manual Contracts Inferred Assertions

Total Invalid

Eiffel

Composite 1 4 7 0DoublyLinkedListNode 2 4 5 0Map < K,V > < G > Cozy

Polyupdate 2 3 3 0Structure 2 2 2 0ListComp02 3 3 6 0MinFinder 2 2 3 0MaxBag 3 3 33 5

Regarding false negatives, both Daikon and our technique leadto assertions for which OASIs is able to identify false negatives(with our technique having a small margin of advantage inthis respect). The conclusion is clear: the assertions obtainedwith both tools are weaker than the “strongest” postcondition,thus letting “pass” undetected some mutations of the analyzedmethods (cf. how OASIs identiﬁes false negatives [15]). Un-fortunately, as discussed earlier, a problem with OASIs did notallow us to perform a more detailed comparison, based on the number of identiﬁed false negatives in each case. Nevertheless,by inspecting the obtained postconditions, as shown in Table I,it is apparent that our technique produces stronger assertions.Regarding RQ2, the assertions that our technique can pro-duce are close to those that may be deﬁned by developerswhen manually specifying rich contracts. As Table III shows,our algorithm generated at least one of the exact propertiesdeﬁned in the original assertions for the Eiffel methods in 8out of 11 cases. Similarly, as Table IV indicates, in 9 outof 12 methods we correctly identiﬁed at least one propertyof the corresponding postcondition. This conﬁrms that ourtechnique is capable of generating assertions that are actuallytrue positives and are scope-independent. In the case of theremaining properties that the algorithm was not able to infer,these are speciﬁc assertions regarding the arguments, complexproperties over sets that are beyond the assertions that thealgorithm may produce, such as the “other keys unchanged”in the Map.extend routine, or are relatively complex arithmeticconstraints such as the ones present in Cozy’s ListComp02and the RingBuffer methods (notice that our assertions con-centrate on properties of reference-based ﬁelds rather thansophisticated arithmetic assertions). Regarding the precision ofthe generated speciﬁcations for Eiffel and Cozy case studies,it is also important to analyze if the tool produces invalidassertions. As Table V shows, only 2 out of 9 subjectscontained invalid assertions in the corresponding inferredpostcondition, being all of them assertions that are true inthe bounded scenarios from which they were computed, butare not if the scope is extended. These cases were the onesthat involved a greater number of ﬁelds. Still the percentageof invalid postconditions for these cases were about 15% oress (4 of 30 in one case, 5 of 33 in the other). Table V alsoshows that, for most case studies, EvoSpex produces additionalvalid assertions, compared to the corresponding postcondition.Generally, these have to do with valid information that isnot explicitly mentioned in the original postcondition. Forinstance, for Composite.addChild, EvoSpex produced a 7-conjunct postcondition, 2 of which are in the manual contract;the remaining 5 are either trivial (e.g., the list of childrenis not changed, the parent is not changed), or capture validinformation not in the original “ensure” (e.g., acyclicity of theparent structure). For further details, we refer the reader to thereplication package site [1], where all the assertions producedfor each case study can be found.VI. R

ELATED W ORK

Assertions can be exploited for a wide variety of activitiesin software development, notably program veriﬁcation [5],[8] and bug ﬁnding [35], [18], but also including programcomprehension, software evolution and maintenance [30], andspeciﬁcation improvement [15], [31], among others. Thus,the problem of automatically inferring speciﬁcations fromsource code, and in general the problem of producing soft-ware oracles, has received increasing attention in the lastfew years [3]. Techniques for inferring speciﬁcations fromsource code, i.e., for deriving oracles, include approachesbased on program executions, such as those reported in [7],[34], as well as some recent techniques based on machinelearning [32], [33], [25]. Compared to the execution basedapproaches, our technique is guided both by valid and invalidexecutions (actually, pre/post method states); compared to themachine learning approaches, our technique concentrates onmethod postconditions, and produces interpretable assertion instandard assertion languages, as opposed to assertions encodedinto artiﬁcial neural networks and other machine learningmodels. A closely related technique is that proposed in [7],with which we compare in this paper. Tools for automatedtest generation, notably EvoSuite [10] as well as Randoop[26] and some extensions [38], can produce assertions ac-companying the generated tests. However, these assertions arescenario-speciﬁc, i.e., they capture properties particular to thegenerated tests, as opposed to our postconditions that attemptto characterize general method behaviors.Our technique embeds a mechanism for test input genera-tion, that follows a bounded exhaustive testing approach. Asopposed to the previous mechanisms for generating boundedexhaustive suites, e.g., via tools like Korat [4] or TestEra [17],our technique generates bounded exhaustive suites from theprogram’s API, rather than from an invariant speciﬁcation. Inthis sense, our technique is more closely related to Randoop[26], replacing the random method selection in building testtraces, with a systematic generation of all bounded methodtraces. The state matching mechanism we used in this paperis crucial in making this approach effective, but its discussionis beyond the scope of the paper. Besides producing validmethod executions in the search of assertions, our techniquealso produces invalid program executions. The approach is based on mutating state. It is somehow related to the oracleassessment approach (for false negatives) implemented in theOASIs tool [15], although therein the authors mutate programs (source code), as opposed to mutating state . The idea ofmutating state is used elsewhere, e.g., in [20], [25].VII. C

ONCLUSION

The oracle problem has become a very important prob-lem in software engineering, and within this context, oraclederivation or inference is particularly challenging [3]. In thispaper, we have proposed an evolutionary algorithm for oracleinference, in particular for inferring method assertions inthe form of postconditions . Our technique features variousnovel characteristics, including a mechanism for generatingtest inputs bounded exhaustively, from a component’s API,and the deﬁnition of a genetic algorithm whose state spaceof candidate assertions includes rich constraints involvingmethod parameters, return values, internal object states, andthe relationship between pre and post method execution states.Our experimental evaluation shows that our tool is able toproduce more accurate assertions (stronger contracts in thesense of [28], with the associated beneﬁts described therein),with a total of 6.70% of false positives, compared to the17.49% of false positives of related techniques, for a set ofrandomly selected methods from a benchmark of open sourceJava projects. Furthermore, our evaluation shows that ourtechnique is able to infer an important part of rich programassertions, taken from a set of case studies involving contractsfor program veriﬁcation and synthesis.This work also opens several lines for future work. Onone hand, our genetic algorithm uses a ﬁnite set of geneticoperators, in particular the ones used for mutation; extendingthe set of operators and exploring new ones may be necessaryto increase the scope of properties that the algorithm mayproduce, especially when dealing with more sophisticatedprograms. Fitness functions in genetic algoritms play a crucialrole in the quality of the solutions; adapting the ﬁtness functionof our algorithm in order to prioritize general aspects ofmethod postconditions may considerably improve our results.Our experiments were based on the use of a variant ofrandom generation for the production of bounded exhaustivetest suites. Using alternative test suite generation approachessuch as fully random generation may allow us to producedifferent postconditions. The existence of false negatives forour produced postcondition assertions also opens lines ofimprovement for our inference mechanism.A

CKNOWLEDGMENTS

The authors would like to thank the anonymous reviewersfor their helpful feedback, and the OASIs authors for theirassistance in using the OASIs oracle assessment tool.This work was partially supported by ANPCyT PICT 2016-1384, 2017-1979 and 2017-2622. Facundo Molina’s workis also supported by Microsoft Research, through a LatinAmerica PhD Award.

EFERENCES[1] Evospex site. https://sites.google.com/view/evospex, 2021.[2] Mike Barnett. Code contracts for .net: Runtime veriﬁcation and somuch more. In Howard Barringer, Yliès Falcone, Bernd Finkbeiner,Klaus Havelund, Insup Lee, Gordon J. Pace, Grigore Rosu, OlegSokolsky, and Nikolai Tillmann, editors,

Runtime Veriﬁcation - FirstInternational Conference, RV 2010, St. Julians, Malta, November 1-4,2010. Proceedings , volume 6418 of

Lecture Notes in Computer Science ,pages 16–17. Springer, 2010.[3] Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and ShinYoo. The oracle problem in software testing: A survey.

IEEE Trans.Software Eng. , 41(5):507–525, 2015.[4] Chandrasekhar Boyapati, Sarfraz Khurshid, and Darko Marinov. Korat:automated testing based on java predicates. In Phyllis G. Frankl, editor,

Proceedings of the International Symposium on Software Testing andAnalysis, ISSTA 2002, Roma, Italy, July 22-24, 2002 , pages 123–133.ACM, 2002.[5] Patrice Chalin, Joseph R. Kiniry, Gary T. Leavens, and Erik Poll.Beyond assertions: Advanced speciﬁcation and veriﬁcation with JMLand esc/java2. In Frank S. de Boer, Marcello M. Bonsangue, SusanneGraf, and Willem P. de Roever, editors,

Formal Methods for Componentsand Objects, 4th International Symposium, FMCO 2005, Amsterdam,The Netherlands, November 1-4, 2005, Revised Lectures , volume 4111of

Lecture Notes in Computer Science , pages 342–363. Springer, 2005.[6] Brett Daniel, Vilas Jagannath, Danny Dig, and Darko Marinov. Reassert:Suggesting repairs for broken unit tests. In

ASE 2009, 24th IEEE/ACMInternational Conference on Automated Software Engineering, Auck-land, New Zealand, November 16-20, 2009 , pages 433–444. IEEEComputer Society, 2009.[7] Michael D. Ernst, Jeff H. Perkins, Philip J. Guo, Stephen McCamant,Carlos Pacheco, Matthew S. Tschantz, and Chen Xiao. The daikon sys-tem for dynamic detection of likely invariants.

Sci. Comput. Program. ,69(1-3):35–45, 2007.[8] Manuel Fähndrich. Static veriﬁcation for code contracts. In RadhiaCousot and Matthieu Martel, editors,

Static Analysis - 17th InternationalSymposium, SAS 2010, Perpignan, France, September 14-16, 2010.Proceedings , volume 6337 of

Lecture Notes in Computer Science , pages2–5. Springer, 2010.[9] Robert W. Floyd. Assigning meanings to programs. In J. T. Schwartz,editor,

Mathematical Aspects of Computer Science, Proceedings ofSymposia in Applied Mathematics 19 , pages 19–32, Providence, 1967.American Mathematical Society.[10] Gordon Fraser and Andrea Arcuri. Evosuite: automatic test suitegeneration for object-oriented software. In

SIGSOFT FSE , pages 416–419. ACM, 2011.[11] Gordon Fraser and Andrea Arcuri. A large-scale evaluation of automatedunit test generation using evosuite.

ACM Trans. Softw. Eng. Methodol. ,24(2):8:1–8:42, 2014.[12] Carlo Ghezzi, Mehdi Jazayeri, and Dino Mandrioli.

Fundamentals ofSoftware Engineering . Prentice Hall PTR, Upper Saddle River, NJ, USA,2nd edition, 2002.[13] Charles A. R. Hoare. An axiomatic basis for computer programming.

Commun. ACM , 12(10):576–580, 1969.[14] Daniel Jackson.

Software Abstractions - Logic, Language, and Analysis .MIT Press, 2006.[15] Gunel Jahangirova, David Clark, Mark Harman, and Paolo Tonella.Test oracle assessment and improvement. In Andreas Zeller andAbhik Roychoudhury, editors,

Proceedings of the 25th InternationalSymposium on Software Testing and Analysis, ISSTA 2016, Saarbrücken,Germany, July 18-20, 2016 , pages 247–258. ACM, 2016.[16] Pankaj Jalote.

An Integrated Approach to Software Engineering, ThirdEdition . Texts in Computer Science. Springer, 2005.[17] Shadi Abdul Khalek, Guowei Yang, Lingming Zhang, Darko Marinov,and Sarfraz Khurshid. Testera: A tool for testing java programs usingalloy speciﬁcations. In Perry Alexander, Corina S. Pasareanu, andJohn G. Hosking, editors, , pages 608–611. IEEE Computer Society, 2011.[18] Andreas Leitner, Ilinca Ciupa, Manuel Oriol, Bertrand Meyer, and ArnoFiva. Contract driven development = test driven development - writingtest cases. In Ivica Crnkovic and Antonia Bertolino, editors,

Proceedingsof the 6th joint meeting of the European Software Engineering Confer-ence and the ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2007, Dubrovnik, Croatia, September 3-7,2007 , pages 425–434. ACM, 2007.[19] Calvin Loncaric, Michael D. Ernst, and Emina Torlak. Generalized datastructure synthesis. In

Proceedings of the 40th International Conferenceon Software Engineering , ICSE ’18, page 958–968, New York, NY,USA, 2018. Association for Computing Machinery.[20] Muhammad Zubair Malik, Junaid Haroon Siddiqui, and Sarfraz Khur-shid. Constraint-based program debugging using data structure repair. In

Fourth IEEE International Conference on Software Testing, Veriﬁcationand Validation, ICST 2011, Berlin, Germany, March 21-25, 2011 , pages190–199. IEEE Computer Society, 2011.[21] Bertrand Meyer. Applying "design by contract".

IEEE Computer ,25(10):40–51, 1992.[22] Bertrand Meyer.

Object-Oriented Software Construction, 2nd Edition .Prentice-Hall, 1997.[23] Bertrand Meyer. Design by contract: The eiffel method. In

TOOLS1998: 26th International Conference on Technology of Object-OrientedLanguages and Systems, 3-7 August 1998, Santa Barbara, CA, USA ,page 446. IEEE Computer Society, 1998.[24] Facundo Molina, César Cornejo, Renzo Degiovanni, Germán Regis,Pablo F. Castro, Nazareno Aguirre, and Marcelo F. Frias. An evolu-tionary approach to translate operational speciﬁcations into declarativespeciﬁcations. In Leila Ribeiro and Thierry Lecomte, editors,

FormalMethods: Foundations and Applications - 19th Brazilian Symposium,SBMF 2016, Natal, Brazil, November 23-25, 2016, Proceedings , volume10090 of

Lecture Notes in Computer Science , pages 145–160, 2016.[25] Facundo Molina, Renzo Degiovanni, Pablo Ponzio, Germán Regis,Nazareno Aguirre, and Marcelo F. Frias. Training binary classiﬁersas data structure invariants. In Joanne M. Atlee, Tevﬁk Bultan, and JonWhittle, editors,

Proceedings of the 41st International Conference onSoftware Engineering, ICSE 2019, Montreal, QC, Canada, May 25-31,2019 , pages 759–770. IEEE / ACM, 2019.[26] Carlos Pacheco, Shuvendu K. Lahiri, Michael D. Ernst, and ThomasBall. Feedback-directed random test generation. In , pages 75–84. IEEE Computer Society, 2007.[27] Nadia Polikarpova, Carlo A. Furia, and Bertrand Meyer. Specifyingreusable components. In Gary T. Leavens, Peter W. O’Hearn, andSriram K. Rajamani, editors,

Veriﬁed Software: Theories, Tools, Experi-ments, Third International Conference, VSTTE 2010, Edinburgh, UK,August 16-19, 2010. Proceedings , volume 6217 of

Lecture Notes inComputer Science , pages 127–141. Springer, 2010.[28] Nadia Polikarpova, Carlo A. Furia, Yu Pei, Yi Wei, and BertrandMeyer. What good are strong speciﬁcations? In David Notkin, BettyH. C. Cheng, and Klaus Pohl, editors, , pages 262–271. IEEE Computer Society, 2013.[29] Pablo Ponzio, Valeria S. Bengolea, Simón Gutiérrez Brida, GastónScilingo, Nazareno Aguirre, and Marcelo F. Frias. On the effect ofobject redundancy elimination in randomly testing collection classes. InJuan Pablo Galeotti and Alessandra Gorla, editors,

Proceedings of the11th International Workshop on Search-Based Software Testing, ICSE2018, Gothenburg, Sweden, May 28-29, 2018 , pages 67–70. ACM, 2018.[30] Manoranjan Satpathy, Nils T. Siebel, and Daniel Rodríguez. Assertionsin object oriented software maintenance: Analysis and a case study.In , pages 124–135. IEEEComputer Society, 2004.[31] Todd W. Schiller and Michael D. Ernst. Reducing the barriers to writingveriﬁed speciﬁcations. In Gary T. Leavens and Matthew B. Dwyer,editors,

Proceedings of the 27th Annual ACM SIGPLAN Conference onObject-Oriented Programming, Systems, Languages, and Applications,OOPSLA 2012, part of SPLASH 2012, Tucson, AZ, USA, October 21-25,2012 , pages 95–112. ACM, 2012.[32] Seyed Reza Shahamiri, Wan Mohd Nasir Wan-Kadir, Suhaimi Ibrahim,and Siti Zaiton Mohd Hashim. An automated framework for softwaretest oracle.

Information & Software Technology , 53(7):774–788, 2011.[33] Rahul Sharma and Alex Aiken. From invariant checking to invariantinference using randomized search.

Formal Methods in System Design ,48(3):235–256, 2016.[34] Anthony J. H. Simons. Jwalk: a tool for lazy, systematic testing of javaclasses by design introspection and user interaction.

Autom. Softw. Eng. ,14(4):369–418, 2007.35] Nikolai Tillmann and Jonathan de Halleux. Pex-white box test genera-tion for .net. In Bernhard Beckert and Reiner Hähnle, editors,

Tests andProofs, Second International Conference, TAP 2008, Prato, Italy, April9-11, 2008. Proceedings , volume 4966 of

Lecture Notes in ComputerScience , pages 134–153. Springer, 2008.[36] Marco Trudel, Manuel Oriol, Carlo A. Furia, and Martin Nordio.Automated translation of java source code to eiffel. In Judith Bishopand Antonio Vallecillo, editors,

Objects, Models, Components, Patterns -49th International Conference, TOOLS 2011, Zurich, Switzerland, June28-30, 2011. Proceedings , volume 6705 of

Lecture Notes in ComputerScience , pages 20–35. Springer, 2011. [37] Julian Tschannen, Carlo A. Furia, Martin Nordio, and Nadia Polikar-pova. Autoproof: Auto-active functional veriﬁcation of object-orientedprograms. In , volume 9035 of

Lecture Notes in Computer Science ,pages 566–580. Springer, 2015.[38] Kohsuke Yatoh, Kazunori Sakamoto, Fuyuki Ishikawa, and ShinichiHoniden. Feedback-controlled random test generation. In Michal Youngand Tao Xie, editors,