[PDF] Grey-Box Learning of Register Automata

Abstract

Model learning (a.k.a. active automata learning) is a highly effective technique for obtaining black-box finite state models of software components. Thus far, generalisation to infinite state systems with inputs/outputs that carry data parameters has been challenging. Existing model learning tools for infinite state systems face scalability problems and can only be applied to restricted classes of systems (register automata with equality/inequality). In this article, we show how we can boost the performance of model learning techniques by extracting the constraints on input and output parameters from a run, and making this grey-box information available to the learner. More specifically, we provide new implementations of the tree oracle and equivalence oracle from RALib, which use the derived constraints. We extract the constraints from runs of Python programs using an existing tainting library for Python, and compare our grey-box version of RALib with the existing black-box version on several benchmarks, including some data structures from Python's standard library. Our proof-of-principle implementation results in almost two orders of magnitude improvement in terms of numbers of inputs sent to the software system. Our approach, which can be generalised to richer model classes, also enables RALib to learn models that are out of reach of black-box techniques, such as combination locks.

Full PDF

GGrey-Box Learning of Register Automata

Bharat Garhewal ? , Frits Vaandrager , Falk Howar ,Timo Schrijvers , Toon Lenaerts , and Rob Smits Radboud University, Nijmegen, The Netherlands { bharat.garhewal, frits.vaandrager } @ru.nl Dortmund University of Technology

Abstract.

Model learning (a.k.a. active automata learning) is a highlyeﬀective technique for obtaining black-box ﬁnite state models of softwarecomponents. Thus far, generalization to inﬁnite state systems with inputsand outputs that carry data parameters has been challenging. Existingmodel learning tools for inﬁnite state systems face scalability problemsand can only be applied to restricted classes of systems (register automatawith equality/inequality). In this article, we show how one can boost theperformance of model learning techniques by extracting the constraintson input and output parameters from a run, and making this grey-boxinformation available to the learner. More speciﬁcally, we provide newimplementations of the tree oracle and equivalence oracle from the RALibtool, which use the derived constraints. We extract the constraints fromruns of Python programs using an existing tainting library for Python,and compare our grey-box version of RALib with the existing black-box version on several benchmarks, including some data structures fromPython’s standard library. Our proof-of-principle implementation resultsin almost two orders of magnitude improvement in terms of numbers ofinputs sent to the software system. Our approach, which can be general-ized to richer model classes, also enables RALib to learn models that areout of reach of black-box techniques, such as combination locks.

Keywords:

Model learning · Active Automata Learning · Register Au-tomata · RALib · Grey-box · Tainting

Model learning, also known as active automata learning, is a black-box techniquefor constructing state machine models of software and hardware componentsfrom information obtained through testing (i.e., providing inputs and observingthe resulting outputs). Model learning has been successfully used in numerousapplications, for instance for generating conformance test suites of software com-ponents [13], ﬁnding mistakes in implementations of security-critical protocols[8–10], learning interfaces of classes in software libraries [14], and checking that ? Supported by NWO TOP project 612.001.852 “Grey-box learning of Interfaces forRefactoring Legacy Software (GIRLS)”. a r X i v : . [ c s . F L ] S e p B. Garhewal et al. a legacy component and a refactored implementation have the same behaviour[19]. We refer to [17, 20] for surveys and further references.In many applications it is crucial for models to describe control ﬂow , i.e.,states of a component, data ﬂow , i.e., constraints on data parameters that arepassed when the component interacts with its environment, as well as the mutualinﬂuence between control ﬂow and data ﬂow. Such models often take the formof extended ﬁnite state machines (EFSMs). Recently, various techniques havebeen employed to extend automata learning to a speciﬁc class of EFSMs called register automata , which combine control ﬂow with guards and assignments todata variables [1, 2, 6].While these works demonstrate that it is theoretically possible to infer suchricher models, the presented approaches do not scale well and are not yet satisfac-torily developed for richer classes of models (c.f. [16]): Existing techniques eitherrely on manually constructed mappers that abstract the data aspects of input andoutput symbols into a ﬁnite alphabet, or otherwise infer guards and assignmentsfrom black-box observations of test outputs. The latter can be costly, especiallyfor models where control ﬂow depends on test on data parameters in input: inthis case, learning an exact guard that separates two control ﬂow branches mayrequire a large number of queries.One promising strategy for addressing the challenge of identifying data-ﬂowconstraints is to augment learning algorithms with white-box information ex-traction methods, which are able to obtain information about the System UnderTest (SUT) at lower cost than black-box techniques. Several researchers have ex-plored this idea. Giannakopoulou et al. [11] develop an active learning algorithmthat infers safe interfaces of software components with guarded actions. In theirmodel, the teacher is implemented using concolic execution for the identiﬁcationof guards. Cho et al. [7] present MACE an approach for concolic exploration ofprotocol behaviour. The approach uses active automata learning for discoveringso-called deep states in the protocol behaviour. From these states, concolic execu-tion is employed in order to discover vulnerabilities. Similarly, Botinˇcan and Babi´c[4] present a learning algorithm for inferring models of stream transducers thatintegrates active automata learning with symbolic execution and counterexample-guided abstraction reﬁnement. They show how the models can be used to verifyproperties of input sanitizers in Web applications. Finally, Howar et al. [15]extend the work of [11] and integrate knowledge obtained through static codeanalysis about the potential eﬀects of component method invocations on a compo-nent’s state to improve the performance during symbolic queries. So far, however,white-box techniques have never been integrated with learning algorithms forregister automata.In this article, we present the ﬁrst active learning algorithm for a generalclass of register automata that uses white-box techniques. More speciﬁcally, weshow how dynamic taint analysis can be used to eﬃciently extract constraintson input and output parameters from a test, and how these constraints canbe used to improve the performance of the SL ∗ algorithm of Cassel et al. [6].The SL ∗ algorithm generalizes the classical L ∗ algorithm of Angluin [3] and has rey-Box Learning of Register Automata 3 SL ∗ Learner Teacher M / SUTEquivalenceOracleTree OracleTQSDTEQ H Yes/CE MQYes/No andConstraintsMQYes/No andConstraints

Fig. 1:

MAT Framework (Our addition — tainting — in red): Doublearrows indicate possible multiple instances of a query made by an oraclefor a single query by the learner. been used successfully to learn register automaton models, for instance of Linuxand Windows implementations of TCP [9]. We have implemented the presentedmethod on top of RALib [5], a library that provides an implementation of the SL ∗ algorithm.The integration of the two techniques (dynamic taint analysis and learningof register automata models) can be explained most easily with reference to thearchitecture of RALib, shown in Figure 1, which is a variation of the MinimallyAdequate Teacher (MAT) framework of [3]: In the MAT framework, learning isviewed as a game in which a learner has to infer the behaviour of an unknownregister automaton M by asking queries to a teacher . We postulate M modelsthe behaviour of a System Under Test (SUT) . In the learning phase, the learner(that is, SL ∗ ) is allowed to ask questions to the teacher in the form of tree queries (TQs) and the teacher responds with symbolic decision trees (SDTs). In order toconstruct these SDTs, the teacher uses a tree oracle , which queries the SUT with membership queries (MQs) and receives a yes/no reply to each. Typically, the treeoracle asks multiple MQs to answer a single tree query in order to infer causalimpact and ﬂow of data values. Based on the answers on a number of tree queries,the learner constructs a hypothesis in the form of a register automaton H . Thelearner submits H as an equivalence query (EQ) to the teacher, asking whether H is equivalent to the SUT model M . The teacher uses an equivalence oracle to answer equivalence queries. Typically, the equivalence oracle asks multipleMQs to answer a single equivalence query. If, for all membership queries, theoutput produced by the SUT is consistent with hypothesis H , the answer to theequivalence query is ‘Yes’ (indicating learning is complete). Otherwise, the answer‘No’ is provided, together with a counterexample (CE) that indicates a diﬀerencebetween H and M . Based on this CE, learning continues. In this extended MAT B. Garhewal et al. framework, we have constructed new implementations of the tree oracle andequivalence oracle that leverage the constraints on input and output parametersthat are imposed by a program run: dynamic tainting is used to extract theconstraints on parameters that are encountered during a run of a program. Ourimplementation learns models of Python programs, using an existing taintinglibrary for Python [12]. Eﬀectively, the combination of the SL ∗ with our new treeand equivalence oracles constitutes a grey-box learning algorithm, since we onlygive the learner partial information about the internal structure of the SUT.We compare our grey-box tree and equivalence oracles with the existing black-box versions of these oracles on several benchmarks, including Python’s queue and set modules. Our proof-of-concept implementation results in almost twoorders of magnitude improvement in terms of numbers of inputs sent to thesoftware system. Our approach, which generalises to richer model classes, alsoenables RALib to learn models that are completely out of reach for black-boxtechniques, such as combination locks. Outline:

Section 2 contains preliminaries; Section 3 discusses tainting in ourPython SUTs; Section 4 contains the algorithms we use to answer TQs usingtainting and the deﬁnition for the tainted equivalence oracle needed to learncombination lock automata; Section 5 contains the experimental evaluation ofour technique; and Section 6 concludes.

This section contains the deﬁnitions and constructions necessary to understandactive automata learning for models with dataﬂow. We ﬁrst deﬁne the conceptof a structure , followed by guards , data languages , register automata , and ﬁnally symbolic decision trees . Deﬁnition 1 (Structure).

A structure S = h R, D , Ri is a triple where R isa set of relation symbols, each equipped with an arity, D is an inﬁnite domainof data values, and R contains a distinguished n -ary relation r R ⊆ D n for each n -ary relation symbol r ∈ R . In the remainder of this article, we ﬁx a structure S = h R, D , Ri , where R contains a binary relation symbol = and unary relation symbols = c , for each c contained in a ﬁnite set C of constant symbols, D equals the set N of naturalnumbers, = R is interpreted as the equality predicate on N , and to each symbol c ∈ C a natural number n c is associated such that (= c ) R = { n c } .Guards are a restricted type of Boolean formulas that may contain relationsymbols from R . Deﬁnition 2 (Guards).

We postulate a countably inﬁnite set V = { v , v , . . . } of variables . In addition, there is a variable p

6∈ V that will play a special roleas formal parameter of input symbols; we write V + = V ∪ { p } . A guard is Available at https://bitbucket.org/toonlenaerts/taintralib/src/basic.rey-Box Learning of Register Automata 5 a conjunction of relation symbols and negated relation symbols over variables.Formally, the set of guards is inductively deﬁned as follows: – If r ∈ R is an n -ary relation symbol and x , . . . , x n are variables from V + ,then r ( x , . . . , x n ) and ¬ r ( x , . . . , x n ) are guards. – If g and g are guards then g ∧ g is a guard.Let X ⊂ V + . We say that g is a guard over X if all variables that occur in g arecontained in X . A variable renaming is a function σ : X → V + . If g is a guardover X then g [ σ ] is the guard obtained by replacing each variable x in g by σ ( x ) . Next, we deﬁne the notion of a data language . For this, we ﬁx a ﬁnite set of actions Σ . A data symbol α ( d ) is a pair consisting of an action α ∈ Σ and a datavalue d ∈ D . While relations may have arbitrary arity, we will assume that allactions have an arity of one to ease notation and simplify the text. A data word is a ﬁnite sequence of data symbols, and a data language is a set of data words.We denote concatenation of data words w and w by w · w , where w is the preﬁx and w is the suﬃx . Acts ( w ) denotes the sequence of actions α α . . . α n in w ,and Vals ( w ) denotes the sequence of data values d d . . . d n in w . We refer to asequence of actions in Σ ∗ as a symbolic suﬃx . If w is a symbolic suﬃx then wewrite (cid:74) w (cid:75) for the set of data words u with Acts ( u ) = w .Data languages may be represented by register automaton , deﬁned below. Deﬁnition 3 (Register Automaton).

A Register Automaton (RA) is a tuple M = ( L, l , X , Γ, λ ) where – L is a ﬁnite set of locations, with l as the initial location; – X maps each location l ∈ L to a ﬁnite set of registers X ( l ) ; – Γ is a ﬁnite set of transitions, each of the form h l, α ( p ) , g, π, l i , where • l, l are source and target locations respectively, • α ( p ) is a parametrised action, • g is a guard over X ( l ) ∪ { p } , and • π is an assignment mapping from X ( l ) to X ( l ) ∪ { p } ; and – λ maps each location in L to either accepting (+) or rejecting ( − ) .We require that M is deterministic in the sense that for each location l ∈ L and input symbol α ∈ Σ , the conjunction of the guards of any pair of distinct α -transitions with source l is not satisﬁable. M is completely speciﬁed if for all α -transitions out of a location, the disjunction of the guards of the α -transitions isa tautology. M is said to be simple if there are no registers in the initial location,i.e., X ( l ) = ∅ . In this text, all RAs are assumed to be completely speciﬁed andsimple, unless explicitly stated otherwise. Locations l ∈ L with λ ( l ) = + are called accepting , and locations with λ ( l ) = − rejecting .Example 1 (FIFO-buﬀer). The register automaton displayed in Figure 2 modelsa FIFO-buﬀer with capacity 2. It has three accepting locations l , l and l (denoted by a double circle), and one rejecting “sink” location l (denoted by asingle circle). Function X assigns the empty set of registers to locations l and l , singleton set { x } to location l , and set { x, y } to l . B. Garhewal et al. l start l l l Push ( p ) x := p Pop ( p ) Push ( p ) y := pp = x Pop ( p ) p = x Pop ( p ) p = x Pop ( p ) x := y p = x Pop ( p ) Push ( p ) Push ( p ) Pop ( p ) Fig. 2:

FIFO-buﬀer with a capacity of 2 modeled as a register automaton.

We now formalise the semantics of an RA. A valuation of a set of variables X is a function ν : X → D that assigns data values to variables in X . If ν is avaluation of X and g is a guard over X then ν | = g is deﬁned inductively by: – ν | = r ( x , . . . , x n ) iﬀ ( ν ( x ) , . . . , ν ( x n )) ∈ r R – ν | = ¬ r ( x , . . . , x n ) iﬀ ( ν ( x ) , . . . , ν ( x n )) r R – ν | = g ∧ g iﬀ ν | = g and ν | = g A state of a RA M = ( L, l , X , Γ, λ ) is a pair h l, ν i , where l ∈ L is a locationand ν : X ( l ) −→ D is a valuation of the set of registers at location l . A run of M over data word w = α ( d ) . . . α n ( d n ) is a sequence h l , ν i α ( d ) ,g ,π −−−−−−−−→ h l , ν i . . . h l n − , ν n − i α n ( d n ) ,g n ,π n −−−−−−−−−→ h l n , ν n i , where – for each 0 ≤ i ≤ n , h l i , ν i i is a state (with l the initial location), – for each 0 < i ≤ n , h l i − , α i ( p ) , g i , π i , l i i ∈ Γ such that ι i (cid:15) g i and ν i = ι i ◦ π i ,where ι i = ν i − ∪ { [ p d i ] } extends ν i − by mapping p to d i .A run is accepting if λ ( l n ) = +, else rejecting . The language of M , notation L ( M ), is the set of words w such that M has an accepting run over w . Word w is accepted (rejected) under valuation ν if M has an accepting (rejecting) runthat starts in state h l , ν o i . rey-Box Learning of Register Automata 7 { x , x } Pop ( p ) p = x Pop ( p ) p = x Pop ( p ) p = x Pop ( p ) p = x Pop ( p ) Fig. 3:

SDT for preﬁx Push (5)

Push (7) and (symbolic) suﬃx Pop Pop.

Example 2.

Consider the FIFO-buﬀer example from Figure 2. This RA has arun h l , ν = [] i Push (7) ,g ≡> ,π =[ x p ] −−−−−−−−−−−−−−−→ h l , ν = [ x i Push (7) ,g ≡> ,π =[ x x,y p ] −−−−−−−−−−−−−−−−−−−→ h l , ν = [ x , y i Pop (7) ,g ≡ p = x,π =[ x y ] −−−−−−−−−−−−−−−−→ h l , ν = [ x i Push (5) ,g ≡> ,π =[ x x,y p ] −−−−−−−−−−−−−−−−−−−→ h l , ν = [ x , y i Pop (7) ,g ≡ p = x,π =[ x y ] −−−−−−−−−−−−−−−−→ h l , ν = [ x i Pop (5) ,g ≡ p = x,π =[] −−−−−−−−−−−−−→ h l , ν = [] i and thus the trace is Push (7)

Push (7)

Pop (7)

Push (5)

Pop (7)

Pop (5). (cid:121)

The SL ∗ algorithm uses tree queries in place of membership queries. The argu-ments of a tree query are a preﬁx data word u and a symbolic suﬃx w , i.e., a dataword with uninstantiated data parameters. The response to a tree query is a socalled symbolic decision tree (SDT), which has the form of tree-shaped registerautomaton that accepts/rejects suﬃxes obtained by instantiating data param-eters in one of the symbolic suﬃxes. Let us illustrate this on the FIFO-buﬀerexample from Figure 2 for the preﬁx Push (5)

Push (7) and the symbolic suﬃx

Pop Pop . The acceptance/rejection of suﬃxes obtained by instantiating dataparameters after

Push (5)

Push (7) can be represented by the SDT in Figure 3. Inthe initial location, values 5 and 7 from the preﬁx are stored in registers x and x , respectively. Thus, SDTs will generally not be simple RAs. Moreover, sincethe leaves of an SDT have no outgoing transitions, they are also not completelyspeciﬁed. We use the convention that register x i stores the i th data value. Thus,initially, register x contains value 5 and register x contains value 7. The initial B. Garhewal et al. transitions in the SDT contain an update x := p , and the ﬁnal transitions anupdate x := p . For readability, these updates are not displayed in the diagram.The SDT accepts suﬃxes of form Pop ( d ) Pop ( d ) iﬀ d equals the value storedin register x , and d equals the data value stored in register x .The formal deﬁnitions of an SDT and the notion of a tree oracle are presentedin Appendix A. For a more detailed discussion of SDTs we refer to [6]. We postulate that the behaviour of the SUT (in our case: a Python program)can be modeled by a register automaton M . In a black-box setting, observationson the SUT will then correspond to words from the data language of M . Inthis section, we will describe the additional observations that a learner can makein a grey-box setting, where the constraints on the data parameters that areimposed within a run become visible. In this setting, observations of the learnerwill correspond to what we call tainted words of M . Tainting semantics is anextension of the standard semantics in which each input value is “tainted” witha unique marker from V . In a data word w = α ( d ) α ( d ) . . . α n ( d n ), the ﬁrstdata value d is tainted with marker v , the second data value d with v , etc.While the same data value may occur repeatedly in a data word, all the markersare diﬀerent. A tainted state of a RA M = ( L, l , X , Γ, λ ) is a triple h l, ν, ζ i , where l ∈ L is a location, ν : X ( l ) → D is a valuation, and ζ : X ( l ) → V is a functionthat assigns a marker to each register of l . A tainted run of M over data word w = α ( d ) . . . α n ( d n ) is a sequence τ = h l , ν , ζ i α ( d ) ,g ,π −−−−−−−−→ h l , ν , ζ i . . . h l n − , ν n − , ζ n − i α n ( d n ) ,g n ,π n −−−−−−−−−→ h l n , ν n , ζ n i , where – h l , ν i α ( d ) ,g ,π −−−−−−−−→ h l , ν i . . . h l n − , ν n − i α n ( d n ) ,g n ,π n −−−−−−−−−→ h l n , ν n i is a run of M , – for each 0 ≤ i ≤ n , h l i , ν i , ζ i i is a tainted state, – for each 0 < i ≤ n , ζ i = κ i ◦ π i , where κ i = ζ i − ∪ { ( p, v i ) } .The tainted word of τ is the sequence w = α ( d ) G α ( d ) G · · · α n ( d n ) G n ,where G i = g i [ κ i ], for 0 < i ≤ n . We deﬁne constraints M ( τ ) = [ G , . . . , G n ].Let w = α ( d ) . . . α n ( d n ) be a data word. Since register automata are deter-ministic, there is a unique tainted run τ over w . We deﬁne constraints M ( w ) = constraints M ( τ ), that is, the constraints associated to a data word are the con-straints of the unique tainted run that corresponds to it. In the untainted settinga membership query for data word w leads to a response “yes” if w ∈ L ( M ), and aresponse “no” otherwise, but in a tainted setting the predicates constraints M ( w )are also included in the response, and provide additional information that thelearner may use. rey-Box Learning of Register Automata 9 Example 3.

Consider the FIFO-buﬀer example from Figure 2. This RA has atainted run h l , [] , [] i Push (7) −−−−→h l , [ x , [ x v ] i Push (7) −−−−→ h l , [ x , y , [ x v , y v ] i Pop (7) −−−−→ h l , [ x , [ x v ] i Push (5) −−−−→ h l , [ x , y , [ x v , y v ] i Pop (7) −−−−→ h l , [ x , [ y v ] i Pop (5) −−−−→ h l , [] , [] i (For readability, guards g i and assignments π i have been left out.) The constraintsin the corresponding tainted trace can be computed as follows: κ = [ p v ] G ≡ > [ κ ] ≡ > κ = [ x v , p v ] G ≡ > [ κ ] ≡ > κ = [ x v , y v , p v ] G ≡ ( p = x )[ κ ] ≡ v = v κ = [ x v , p v ] G ≡ > [ κ ] ≡ > κ = [ x v , y v , p v ] G ≡ ( p = x )[ κ ] ≡ v = v κ = [ x v , p v ] G ≡ ( p = x )[ κ ] ≡ v = v and thus the tainted word is: Push (7) > Push (7) > Pop (7) v = v Push (5) > Pop (7) v = v Pop (5) v = v , and the corresponding list of constraints is [ > , > , v = v , > , v = v , v = v ]. (cid:121) Various techniques can be used to observe tainted traces, for instance symbolicand concolic execution. In this work, we have used a library called “ taintedstr ”to achieve tainting in Python and make tainted traces available to the learner.

Tainting in Python is achieved by using a library called “ taintedstr ” , whichimplements a “ tstr ” ( tainted string ) class. We do not discuss the entire imple-mentation in detail, but only introduce the portions relevant to our work. The“ tstr ” class works by operator overloading : each operator is overloaded to recordits own invocation. The tstr class overloads the implementation of the “ eq ”(equality) method in Python’s str class, amongst others. In this text, we onlyconsider the equality method. A tstr object x can be considered as a triple h o, t, cs i , where o is the (base) string object, t is the taint value associated withstring o , and cs is a set of comparisons made by x with other objects, whereeach comparison c ∈ cs is a triple h f, a, b i with f the name of the binary methodinvoked on x , a a copy of x , and b the argument supplied to f .Each a method f in the tstr class is an overloaded implementation of therelevant (base) method f as follows: See [12] and https://github.com/vrthra/taintedstr.0 B. Garhewal et al. def f(self , other): self.cs.add ((m._name_ , self , other)) return self.o.f(other) We present a short example of how such an overloaded method would work below:

Example 4 ( tstr tainting).

Consider two tstr objects: x = h “1” , , ∅i and x = h “1” , , ∅i . Calling x == x returns True as x .o = x .o . As a side-eﬀectof f , the set of comparisons x .cs is updated with the triple c = h “ eq ” , x , x i .We may then conﬁrm that x is compared to x by checking the taint values ofthe variables in comparison c : x .t = 1 and x .t = 2.Note, our approach to tainting limits the recorded information to operationsperformed on a tstr object. (cid:121) Example 5 (Complicated Comparison).

Consider the following snippet, where x , x , x are tstr objects with 1 , , if not (x_1 == x_2 or (x_2 != x_3)): If the base values of x and x are equal, the Python interpreter will “short-circuit” the if-statement and the second condition, x = x , will not be evaluated.Thus, we only obtain one comparison: x = x . On the other hand, if the basevalues of x and x are not equal, the interpreter will not short-circuit, and bothcomparisons will be recorded as { x = x , x = x } . While the comparisons arestored as a set, from the perspective of the tainted trace, the guard(s) is a singleconjunction: x = x ∧ x = x . However, the external negation operation willnot be recorded by any of the tstr objects: the negation was not performed onthe tstr objects. (cid:121) Given an SUT and a tree query, we generate an SDT in the following steps: (i) construct a characteristic predicate of the tree query (Algorithm 1) usingmembership and guard queries, (ii) transform the characteristic predicate intoan SDT (Algorithm 2), and (iii) minimise the obtained SDT (Algorithm 3).

For u = α ( d ) · · · α k ( d k ) a dataword, ν u denotes the valuation of { x , . . . , x k } with ν u ( x i ) = d i , for 1 ≤ i ≤ k .Suppose u is a preﬁx and w = α k +1 · · · α k + n is a symbolic suﬃx. Then H is a characteristic predicate for u and w in M if, for each valuation ν of { x , . . . , x k + n } that extends ν u , ν | = H ⇐⇒ α ( ν ( x )) · · · α k + n ( ν ( x k + n )) ∈ L ( M ) , rey-Box Learning of Register Automata 11 Algorithm 1:

ComputeCharacteristicPredicate

Data:

A tree query consisting of preﬁx u = α ( d ) · · · α k ( d k ) and symbolic suﬃx w = α k +1 · · · α k + n Result:

A characteristic predicate for u and w in M G := > , H := ⊥ , V := { x , . . . , x k + n } while ∃ valuation ν for V that extends ν u such that ν | = G do ν := valuation for V that extends ν u such that ν | = G z := α ( ν ( x )) · · · α k + n ( ν ( x k + n )) // Construct membership query I := V k + ni = k +1 constraints M ( z )[ i ] // Constraints resulting from query if z ∈ L ( M ) then // Result query ‘‘yes’’ or ‘‘no’’ H := H ∨ I G := G ∧ ¬ I end return H that is, H characterizes the data words u with Acts ( u ) = w such that u · u is accepted by M . In the case of the FIFO-buﬀer example from Figure 2, acharacteristic predicate for preﬁx Push (5)

Push (7) and symbolic suﬃx

Pop Pop is x = x ∧ x = x . A characteristic predicate for the empty preﬁx and symbolicsuﬃx Pop is ⊥ , since this trace will inevitably lead to the sink location l andthere are no accepting words.Algorithm 1 shows how a characteristic predicate may be computed by sys-tematically exploring all the (ﬁnitely many) paths of M with preﬁx u and suﬃx w using tainted membership queries. During the execution of Algorithm 1, pred-icate G describes the part of the parameter space that still needs to be explored,whereas H is the characteristic predicate for the part of the parameter space thathas been covered. We use the notation H ≡ T to indicate syntactic equivalence,and H = T to indicate logical equivalence. Note, if there exists no parameterspace to be explored (i.e., w is empty) and u ∈ L ( M ), the algorithm returns H ≡⊥ ∨> (as the empty conjunction equals > ). Example 6 (Algorithm 1).

Consider the FIFO-buﬀer example and the tree querywith preﬁx

Push (5)

Push (7) and symbolic suﬃx

Pop Pop . After the preﬁx location l is reached. From there, three paths are possible with actions Pop Pop : l l l , l l l and l l l . We consider an example run of Algorithm 1.Initially, G ≡ > and H ≡⊥ . Let ν = [ x , x , x , x ν extends ν u and ν | = G . The resulting tainted run corresponds to path l l l and so the tainted query gives path constraint I ≡ x = x ∧ > . Since thetainted run is rejecting, H ≡⊥ and G ≡ > ∧ ¬ I .In the next iteration, we set ν = [ x , x , x , x ν extends ν u and ν | = G . The resulting tainted run corresponds to path l l l and so the tainted query gives path constraint I ≡ x = x ∧ x = x . Since thetainted run is rejecting, H ≡⊥ and G ≡ > ∧ ¬ I ∧ ¬ I .In the ﬁnal iteration, we set ν = [ x , x , x , x ν extends ν u and ν | = G . The resulting tainted run corresponds to path l l l and the tainted query gives path constraint I ≡ x = x ∧ x = x . Now thetainted run is accepting, so H ≡⊥ ∨ I and G = > ∧ ¬ I ∧ ¬ I ∧ ¬ I . As G isunsatisﬁable, the algorithm terminates and returns characteristic predicate H . Construction of a non-minimal SDT

For each tree query with preﬁx u andsymbolic suﬃx w , the corresponding characteristic predicate H is suﬃcient toconstruct an SDT using Algorithm 2. Algorithm 2:

SDTConstructor

Data:

Characteristic predicate H , index n = k + 1,Number of suﬃx parameters N Result:

Non-minimal SDT T if n = k + N + 1 then l := SDT node z := if H ⇐⇒ ⊥ then − else + // Value λ for leaf node of the SDT return h { l } , l , [ l

7→ ∅ ] , ∅ , [ l z ] i // RA with single location else T := SDT node I t := { i | x n (cid:12) x i ∈ H, n > i } // x i may be a parameter or a constant if I t is ∅ then t := SDTConstructor ( H, n + 1 , N ) // No guards present Add t with guard > to T else g := V i ∈ I t x n = x i // Disequality guard case H := W f ∈ H f ∧ g if f ∧ g is satisﬁable else ⊥ // f is a disjunct t := SDTConstructor ( H , n + 1 , N ) Add t with guard g to T for i ∈ I t do g := x n = x i // Equality guard case H := W f ∈ H f ∧ g if f ∧ g is satisﬁable else ⊥ t := SDTConstructor ( H , n + 1 , N ) Add t with guard g to T end return T Algorithm 2 proceeds in the following manner: for a symbolic action α ( x n )with parameter x n , construct the potential set I t (lines 6 & 7), that is, the setof parameters to which x n is compared to in H . For line 7, recall that H is aDNF formula, hence each literal x j (cid:12) x k is considered in the set comprehension,rather than the conjunctions making up the predicate H . Each element x i ∈ I t can be either a formal parameter in the tree query or a constant c i ∈ C from ourchosen structure. Using I t , we can construct the guards as follows: – Disequality guard : The disequality guard will be g := V { i ∈ I t } x n = x i .We can then check which guards in H are still satisﬁable with the addition rey-Box Learning of Register Automata 13 of g and constructs the predicate H for the next call of Algorithm 2 (lines13–16). – Equality guard (s) : For each parameter x i for i ∈ I t , the equality guardwill be g := x n = x i . We can then check which guards in H are still satisﬁablewith the addition of g and this becomes the predicate H for the next call ofAlgorithm 2 (lines 18–21).At the base case (lines 1 − H = ⊥ , otherwise accepting. As mentioned,at each non-leaf location l of the SDT T returned by Algorithm 2, there existsa potential set I t . For each parameter x i , we know that there is a comparisonbetween x i and x n in the SUT. Example 7 (Algorithm 2).

Consider a characteristic predicate H ≡ I ∨ I ∨ I ∨ I ,where I ≡ x = x ∧ x = x , I ≡ x = x ∧ x = x , I ≡ x = x ∧ x = x , I ≡ x = x ∧ x = x . We discuss only the construction of the sub-tree rootedat node s for the SDT visualised in Figure 4a; the construction of the remainderis similar.Initially, x n = x k +1 = x . Potential set I t for x is { x } as H contains theliterals x = x and x = x . Consider the construction of the equality guard g := x = x . The new characteristic predicate is H ≡ ( I ∧ g ) ∨ ( I ∧ g ), as I and I are unsatisﬁable when conjugated with g .For the next call, with n = 3, the current variable is x , with predicate H = H (from the parent instance). We obtain the potential set for x as { x } . The equality guard is g := x = x with the new characteristic predicate H ≡ I ∧ g ∧ g , i.e., H ⇐⇒ x = x ∧ x = x (note, I ∧ g ∧ g is unsatisﬁable).In the next call, we have n = 4, thus we compute a leaf. As H is not ⊥ , we returnan accepting leaf t . The disequality guard is g := x = x with characteristicpredicate H ⇐⇒ x = x ∧ x = x ∧ x = x ⇐⇒ ⊥ . In the next call, wehave n = 4, and we return a non-accepting leaf t . The two trees t and t areadded as sub-trees with their respective guards g and g to a new tree rootedat node s (see Figure 4a). (cid:121) SDT Minimisation

Example 7 showed a characteristic predicate H containingredundant comparisons, resulting in the non-minimal SDT in Figure 4a. Weuse Algorithm 3 to minimise the SDT in Figure 4a to the SDT in Figure 4b.We present an example of the application of Algorithm 3, shown for the SDTof Figure 4a. Figure 4a visualises a non-minimal SDT T , where s and s (in red)are essentially “duplicates” of each other: the sub-tree for node s is isomorphicto the sub-tree for node s under the relabelling “ x = x ”. We indicate thisrelabelling using the notation T [ s ] h x , x i and the isomorphism relation underthe relabelling as T [ s ] h x , x i ’ T [ s ]. Algorithm 3 accepts the non-minimalSDT of Figure 4a and produces the equivalent minimal SDT in Figure 4b. Nodes s and s are merged into one node, s , marked in green. We can observethat both SDTs still encode the same decision tree. With Algorithm 3, we havecompleted our tainted tree oracle, and can now proceed to the tainted equivalenceoracle. Algorithm 3:

MinimiseSDT

Data:

Non-minimal SDT T , current index n Result:

Minimal SDT T if T is a leaf then // Base case return T else T := SDT node // Minimise the lower levels for guard g with associated sub-tree t in T do Add guard g with associated sub-tree MinimiseSDT ( t, n + 1) to T end // Minimise the current level I := Potential set of root node of T t := disequality sub-tree of T with guard V i ∈ I x n = x i I := ∅ for i ∈ I do t := sub-tree of T with guard x n = x i if t h x i , x n i 6’ t or t h x i , x n i is undeﬁned then I := I ∪ { x i } Add guard x n = x i with corresponding sub-tree t to T end Add guard V i ∈ I x n = x i with corresponding sub-tree t to T return T The tainted equivalence oracle (TEO), similar to its non-tainted counterpart,accepts a hypothesis H and veriﬁes whether H is equivalent to register automaton M that models the SUT. If H and M are equivalent, the oracle replies “yes”,otherwise it returns “no” together with a CE. The RandomWalk EquivalenceOracle in RALib constructs random traces in order to ﬁnd a CE. Deﬁnition 4 (Tainted Equivalence Oracle).

For a given hypothesis H , max-imum word length n , and an SUT S , a tainted equivalence oracle is a function O E ( H , n, S ) for all tainted traces w of S where | w | ≤ n , O E ( H , n, S ) returns w if w ∈ L ( H ) ⇐⇒ w ∈ L ( S ) is false, and ‘Yes’ otherwise. The TEO is similar to the construction of the characteristic predicate to ﬁnd aCE: we randomly generate a symbolic suﬃx of speciﬁed length n (with an emptypreﬁx), and construct a predicate H for the query. For each trace w satisfying aguard in H , we conﬁrm whether w ∈ L ( H ) ⇐⇒ w ∈ L ( M ). If false, w is a CE.If no w is false, then we randomly generate another symbolic suﬃx. In practise,we bound the number of symbolic suﬃxes to generate. Example 8 presents ascenario of a combination lock automaton that can be learned (relatively easily)using a TEO but cannot be handled by normal oracles. Example 8 (Combination Lock RA).

A combination lock is a type of RA whichrequires a sequence of speciﬁc inputs to ‘unlock’. Figure 5 presents an RA C rey-Box Learning of Register Automata 15 s s { x } x = x x = x x = x x = x x = x x = x (a) Non-minimal SDT T s { x }> : x x = x x = x (b) Minimal SDT T Fig. 4:

SDT Minimisation: Redundant nodes (in red, left SDT) are mergedtogether (in green, right SDT). l start l l l l α ( p ) | p =1 ∅ α ( p ) | p =9 ∅ α ( p ) | p =6 ∅ α ( p ) | p =2 ∅ β α ( p ) | p =1 ∅ α ( p ) | p =9 ∅ α ( p ) | p =6 ∅ α ( p ) | p =2 ∅ Fig. 5:

Combination Lock C : Sequence α (1) α (9) α (6) α (2) unlocks theautomaton. Error transitions (from l – l to l ) have been ‘merged’ forconciseness. The sink state has not been drawn. with a ‘4-digit’ combination lock that can be unlocked by the sequence w = α ( c ) α ( c ) α ( c ) α ( c ), where { c , c , c , c } are constants. Consider a case wherea hypothesis H is being checked for equivalence against the RA C with w ( H ). While it would be diﬃcult for a normal equivalence oracle to generate theword w randomly; the tainted equivalence oracle will record at every locationthe comparison of input data value p with some constant c i and explore allcorresponding guards at the location, eventually constructing the word w .For the combination lock automaton, we may note that as the ‘depth’ of thelock increases, the possibility of randomly ﬁnding a CE decreases. (cid:121) We have used stubbed versions of the Python FIFO-Queue and Set modules for learning the FIFO and Set models, while the Combination Lock automatawere constructed manually. Source code for all other models was obtained by From Python’s queue module and standard library, respectively.6 B. Garhewal et al. translating existing benchmarks from [18] (see also automata.cs.ru.nl) to Pythoncode. We also utilise a ‘reset’ operation: A ‘reset’ operation brings an SUT back toits initial state, and is counted as an ‘input’ for our purposes. Furthermore, eachexperiment was repeated 30 times with diﬀerent random seeds. Each experimentwas bounded according to the following constraints: learning phase: 10 inputsand 5 × resets; testing phase: 10 inputs and 5 × resets; length of thelongest word during testing: 50; and a ten-minute timeout for the learner torespond.Figure 6 gives an overview of our experimental results. We use the notation‘TTO’ to represent ‘Tainted Tree Oracle’ (with similar labels for the other oracles).In the ﬁgure, we can see that as the size of the container increases, the diﬀerencebetween the fully tainted version (TTO+TEO, in blue) and the completelyuntainted version (NTO+NEO, in red) increases. In the case where only a taintedtree oracle is used (TTO+NEO, in green), we see that it is following the fullytainted version closely (for the FIFO models) and is slightly better in the caseof the SET models. S I P A B P O U T P U T F I F O F I F O F I F O F I F O F I F O L O C K L O C K L O C K S E T S E T S E T R E P E T I T I O N I n p u t s + R e s e t s ( l o g s c a l e ) TTO + TEONTO + TEOTTO + NEONTO + NEO

Fig. 6:

Benchmark plots: Number of symbols used with tainted oracles(blue and green) are generally lower than with normal oracles (red andorange). Note that the y-axis is log-scaled. Additionally, normal oraclesare unable to learn the Combination Lock and Repetition automata andare hence not plotted. rey-Box Learning of Register Automata 17

The addition of the TEO gives a conclusive advantage for the CombinationLock and Repetition benchmarks. The addition of the TTO by itself resultsin signiﬁcantly fewer number of symbols, even without the tainted equivalenceoracle (TTO v/s NTO, compare the green and red lines). With the exception ofthe Combination Lock and Repetition benchmarks, the TTO+TEO combinationdoes not provide vastly better results in comparison to the TTO+NEO results,however, it is still (slightly) better. We note that — as expected — the NEO doesnot manage to provide CEs for the Repetition and Combination Lock automata.The TEO is therefore much more useful for ﬁnding CEs in SUTs which utiliseconstants. For complete details of the data used to produce the plots, please referto Appendix B.

In this article, we have presented an integration of dynamic taint analysis, a white-box technique for tracing data ﬂow, and register automata learning, a black-boxtechnique for inferring behavioral models of components. The combination ofthe two methods improves upon the state-of-the-art in terms of class of systemsfor which models can be generated and in terms of performance: Tainting makesit possible to infer data-ﬂow constraints even in instances with a high essentialcomplexity (e.g., in the case of so-called combination locks). Our implementationoutperforms pure black-box learning by two orders of magnitude with a growingimpact in the presence of multiple data parameters and registers. Both improve-ments are important steps towards the applicability of model learning in practiceas they will help scaling to industrial use cases.At the same time our evaluation shows the need for further improvements:Currently, the SL ∗ algorithm uses symbolic decision trees and tree queries globally,a well-understood weakness of learning algorithms that are based on observationtables. It also uses individual tree oracles each type of operation and relies onsyntactic equivalence of decision trees. A more advanced learning algorithmfor extended ﬁnite state machines will be able to consume fewer tree queries,leverage semantic equivalence of decision trees. Deeper integration with white-box techniques could enable the analysis of many (and more involved) operationson data values. Acknowledgement

We are grateful to Andreas Zeller for explaining the use oftainting for dynamic tracking of constraints, and to Rahul Gopinath for helpingus with his library for tainting Python programs. We also thank the anonymousreviewers for their suggestions.

References [1] Aarts, F., Heidarian, F., Kuppens, H., Olsen, P., Vaandrager, F.: Automatalearning through counterexample-guided abstraction reﬁnement. In: Gian-nakopoulou, D., M´ery, D. (eds.) 18th International Symposium on Formal rey-Box Learning of Register Automata 19 [11] Giannakopoulou, D., Rakamari´c, Z., Raman, V.: Symbolic learning of com-ponent interfaces. In: Proceedings of the 19th International Conference onStatic Analysis. pp. 248–264. SAS’12, Springer-Verlag, Berlin, Heidelberg(2012)[12] Gopinath, R., Mathis, B., H¨oschele, M., Kampmann, A., Zeller, A.: Sample-free learning of input grammars for comprehensive software fuzzing. CoRRabs/1810.08289 (2018), http://arxiv.org/abs/1810.08289[13] Hagerer, A., Margaria, T., Niese, O., Steﬀen, B., Brune, G., Ide, H.D.: Eﬃcientregression testing of CTI-systems: Testing a complex call-center solution.Annual review of communication, Int.Engineering Consortium (IEC) 55,1033–1040 (2001)[14] Howar, F., Isberner, M., Steﬀen, B., Bauer, O., Jonsson, B.: Inferring seman-tic interfaces of data structures. In: ISoLA (1): Leveraging Applicationsof Formal Methods, Veriﬁcation and Validation. Technologies for Master-ing Change - 5th International Symposium, ISoLA 2012, Heraklion, Crete,Greece, October 15-18, 2012, Proceedings, Part I. Lecture Notes in ComputerScience, vol. 7609, pp. 554–571. Springer (2012)[15] Howar, F., Giannakopoulou, D., Rakamari´c, Z.: Hybrid learning: Interfacegeneration through static, dynamic, and symbolic analysis. In: Proceedingsof the 2013 International Symposium on Software Testing and Analysis. pp.268–279. ISSTA 2013, ACM, New York, NY, USA (2013), http://doi.acm.org/10.1145/2483760.2483783[16] Howar, F., Jonsson, B., Vaandrager, F.W.: Combining black-box and white-box techniques for learning register automata. In: Steﬀen, B., Woeginger, G.J.(eds.) Computing and Software Science - State of the Art and Perspectives,Lecture Notes in Computer Science, vol. 10000, pp. 563–588. Springer (2019),https://doi.org/10.1007/978-3-319-91908-9 26[17] Howar, F., Steﬀen, B.: Active automata learning in practice. In: Ben-naceur, A., H¨ahnle, R., Meinke, K. (eds.) Machine Learning for DynamicSoftware Analysis: Potentials and Limits: International Dagstuhl Seminar16172, Dagstuhl Castle, Germany, April 24-27, 2016, Revised Papers. pp.123–148. Springer International Publishing (2018)[18] Neider, D., Smetsers, R., Vaandrager, F., Kuppens, H.: Benchmarks forautomata learning and conformance testing. In: Margaria, T., Graf, S.,Larsen, K.G. (eds.) Models, Mindsets, Meta: The What, the How, and theWhy Not? Essays Dedicated to Bernhard Steﬀen on the Occasion of His60th Birthday, pp. 390–416. Springer International Publishing, Cham (2019),https://doi.org/10.1007/978-3-030-22348-9 23[19] Schuts, M., Hooman, J., Vaandrager, F.: Refactoring of legacy software usingmodel learning and equivalence checking: an industrial experience report. In:´Abrah´am, E., Huisman, M. (eds.) Proceedings 12th International Conferenceon integrated Formal Methods (iFM), Reykjavik, Iceland, June 1-3. LectureNotes in Computer Science, vol. 9681, pp. 311–325 (2016)[20] Vaandrager, F.: Model learning. Commun. ACM 60(2), 86–95 (Feb 2017),http://doi.acm.org/10.1145/2967606

Appendix A Tree Oracle for Equalities

In this appendix, we prove that the tainted tree oracle generates SDTs whichare isomorphic to the SDTs generated by the normal tree oracle as deﬁned in [6].In order to do so, we ﬁrst introduce the constructs used by Cassel et al. [6] forgenerating SDTs. We begin with some preliminaries:For a word u with Vals ( u ) = d . . . d k , we deﬁne a potential of u . The potentialof u , written as pot ( u ), is the set of indices i ∈ { , . . . , k } for which there exists no j ∈ { , . . . , k } such that j > i and d i = d j . The concept of potential essentiallyallows unique access to a data value, abstracting away from the concrete positionof a data value in a word. For a guard g deﬁned over V + for a word u with Vals ( u ) = d , . . . d k , a representative data value d gu is a data value s.t. ν ( u ) ∪{ [ p d gu ] } (cid:15) g . Furthermore, for a word w = α · w (where w may be (cid:15) ) , w can be represented as α -1 w . The same notation is also extended to sets of words: α -1 V = (cid:8) α -1 w | w ∈ V (cid:9) .We may now deﬁne an SDT: Deﬁnition 5 (Symbolic Decision Tree).

A Symbolic Decision Tree (SDT)is a register automaton T = ( L, l , X , Γ, λ ) where L and Γ form a tree rooted at l . For location l of SDT T , we write T [ l ] to denote the subtree of T rooted at l .An SDT that results from a tree query ( u, w ) (of a preﬁx word u and a symbolicsuﬃx w ), is required to satisfy some canonical form, captured by the followingdeﬁnition. Deﬁnition 6 ( ( u, w ) -tree). For any data word u with k actions and any sym-bolic suﬃx w , a ( u, w ) -tree is an SDT T which has runs over all data words in (cid:74) w (cid:75) , and which satisﬁes the following restriction: whenever h l, α ( p ) , g, π, l i is the j th transition on some path from l , then for each x i ∈ X ( l ) we have either (i) i < k + j and π ( x i ) = x i , or (ii) i = k + j and π ( x i ) = p . If u = α ( d ) · · · α k ( d k ) is a data word then ν u is the valuation of { x , . . . , x k } satisfying ν u ( x i ) = d i , for 1 ≤ i ≤ k . Using this deﬁnition, the notion of a treeoracle , which accepts tree queries and returns SDTs, can be described as follows. Deﬁnition 7 (Tree Oracle).

A tree oracle for a structure S is a function O which, for a data language L , preﬁx word u and symbolic suﬃx w returns a ( u, w ) -tree O ( L , u, w ) s.t. for any word v ∈ (cid:74) w (cid:75) , the following holds: v is accepted by O ( L , u, w ) under ν u iﬀ u · v ∈ L . A tree oracle returns equality trees , deﬁned below:

Deﬁnition 8 (Equality Tree). An equality tree for a tree query ( u, V ) is a ( u, V ) -tree T such that: – for each action α , there is a potential set I ⊆ pot ( u ) of indices such that theinitial α -guards consist of the equalities of form p = x i for i ∈ I and onedisequality of form ∧ i ∈ I p = x i , and rey-Box Learning of Register Automata 21 – for each initial transition h l , α ( p ) , g, l i of T , the tree T [ l ] is an equality treefor ( uα ( d gu ) , α -1 V ) . Cassel et al. [6] require their (equality trees) SDTs to be minimal (called maximally abstract in [6]), i.e., the SDTs must not contain any redundancies(such as Figure 4a). This can be achieved by checking if two sub-trees are equalunder some relabelling, and the process of constructing a tree by relabelling anequality sub-tree is called specialisation of equality tree : Deﬁnition 9 (Specialisation of equality tree).

Let T be an equality tree forpreﬁx u and set of symbolic suﬃxes V , and let J ⊆ pot ( u ) be a set of indices.Then T h J i denotes the equality tree for ( u, V ) obtained from T by performingthe following transformations for each α : – Whenever T has several initial α -transitions of form h l , α ( p ) , ( p = x j ) , l j i with j ∈ J , then all subtrees of form ( T [ l j ]) h J [( k + 1) j ] i for j ∈ J mustbe deﬁned and isomorphic, otherwise T h J i is undeﬁned. If all such subtreesare deﬁned and isomorphic, then T h J i is obtained from T by1. replacing all initial α -transitions of form h l , α ( p ) , ( p = x j ) , l j i for j ∈ J by the single transition h l , α ( p ) , ( p = x m ) , l m i where m = max( J ) ,2. replacing T [ l m ] by ( T [ l m ]) h J [( k + 1) m ] i , and3. replacing all other subtrees T [ l ] reached by initial α -transitions (whichhave not been replaced in Step by ( T [ l ]) h J i . If, for some α , any of the subtrees generated in Step 2 or 3 are undeﬁned,then T h J i is also undeﬁned, otherwise T h J i is obtained after performing Steps1 − α . Deﬁnition 10 (Necessary Potential set for Tree Oracle).

A necessarypotential set I for the root location l of an equality tree O ( L , u, V ) is a subsetof pot ( u ) such that for each index i ∈ I the following holds:1. O ( L , uα ( d u ) , V α ) h { i, k + 1 }i is undeﬁned, or2. O ( L , uα ( d u ) , V α ) h { i, k + 1 }i 6’ O ( L , uα ( d i ) , V α ) . Intuitively, a necessary potential set contains indices of data values whichinﬂuence future behaviour of the SUT. Consequently, indices of data valueswhich do not inﬂuence the behaviour of the SUT are excluded from the necessarypotential set. We are now ready to deﬁne the tree oracle for equality:

Deﬁnition 11 (Tree oracle for equality).

For a language L , a preﬁx u , andthe set of symbolic suﬃxes V , the equality tree O ( L , u, V ) is constructed as follows: – If V = { (cid:15) } , then O ( L , u, { (cid:15) } ) is the trivial tree with one location l and noregisters. It is accepting if the word is accpeted, i.e., λ ( l ) = + if u ∈ L , else λ ( l ) = − . To determine u ∈ L , the tree oracle performs a membership queryon u . – If V = { (cid:15) } , then for each α such that V α = α -1 V is non-empty, • let I be the necessary potential set (Deﬁnition 10), • O ( L , u, V ) is constructed as O ( L , u, V ) = ( L, l , Γ, λ ) , where, letting O ( L , uα ( d i ) , V α ) be the tuple ( L αi , l α i , Γ αi , λ αi ) for i ∈ ( I ∪ { } ) , ∗ L is the disjoint union of all L αi plus an additional initial location l , ∗ Γ is the union of all Γ αi for i ∈ ( I ∪ { } ) , and in addition thetransitions of form h l , α ( p ) , g i , l α i i with i ∈ ( I ∪ { } ) , where g i is V j ∈ I p = x j for i = 0 , and g i is p = x i for i = 0 , and ∗ λ agrees with each λ αi on L αi . Moreover, if (cid:15) ∈ V , then λ ( l ) = + if u ∈ L , otherwise λ ( l ) = − . Again, to determine whether u ∈ L , thetree oracle performs a membership query for u . Intuitively, O ( L , u, V ) is constructed by joining the trees O ( L , uα ( d i ) , V α ) withguard p = x i for i ∈ I , and the tree O ( L , uα ( d u ) , V α ) with guard V i ∈ I p = x i ,as children of a new root. Note, while V is a set of symbolic suﬃxes, RALibtechnically handles tree queries sequentially, i.e., as sequential tree queries of preﬁx u and symbolic suﬃx w . Consequently, we treat the set of symbolic suﬃxes V as a singleton, referred to as ‘ w ’. O ( L , u, w ) is constructed bottom-up, recursively building new ‘roots’ at thetop with larger and larger symbolic suﬃxes (and consequently, shorter and shorterpreﬁxes). The choice of the necessary potential set I plays a crucial role: if I is larger than necessary, O ( L , u, w ) contains redundant guards (and is hence a‘non-minimal’ SDT).We now have a clear goal for our proof: we must show that the SDT returnedby Algorithm 3 is isomorphic to the SDT returned by the tree oracle for equalityas deﬁned in Deﬁnition 11 (under the assumption that the ‘set’ of symbolicsuﬃxes V is a singleton). We can divide our proof into the following steps:1. We show that Algorithm 1 produces a characteristic predicate for tree query( u, w ), and contains all the information for constructing an equality tree,2. Next, we show that Algorithm 2 guarantees that for potential set I t of alocation l t of the tainted equality tree T t , the potential set I of equivalentlocation l of the normal equality tree T is a subset of I t : I ⊆ I t , and ﬁnally,3. We can then reduce the make the tainted potential set equal to the normalpotential set (using Algorithm 3) and the resulting tainted equality tree willbe isomorphic to the normal equality tree.Each of the above steps correspond to one of our algorithms. We now beginwith step 1: from Algorithm 1, we can state the following lemmas: Lemma 1 (Characteristic Predicate).

For a tree query ( u, w ) , Algorithm 1always produces a characteristic predicate H .Proof. We recall that, under the test hypothesis, an SUT M is deterministicand has a ﬁnite number of logically disjoint branches to be followed from eachstate. Algorithm 1 initialises two variables G := > and H := ⊥ . For each word z = u · w under a valuation ν (cid:15) G , we may perform a membership query on M .Each query returns the guard I = ∧ k + ni = k +1 constraints M ( z )[ i ] such that ν (cid:15) I andthe acceptance of the word z in the language of M , i.e., z ∈ M . rey-Box Learning of Register Automata 23 For each iteration of the do-while loop, the variable G is updated with thenegation of the previously-satisﬁed guard I , i.e., G := G ∧ ¬ I . This guaranteesthat any new valuation ν will not satisfy I , and hence, the next iteration of thedo-while loop shall induce a diﬀerent run of M . Given that M only has a ﬁnitenumber of logical branches, Algorithm 1 terminates.We also know that for each tainted word z , we obtain the acceptance of z ∈ L ( M ). If z ∈ L ( M ), the variable H is updated to H ∨ I . Therefore, thepredicate H returned by Algorithm 1 is the characteristic predicate for the treequery ( u, w ). ut After constructing the characteristic predicate, we convert it to a non-minimalSDT using Algorithm 2, providing us with the following lemma:

Lemma 2 (Non-minimal SDT).

For any location l t of a non-minimal SDTwith an equivalent location l of a minimal SDT, the necessary potential set I t ofthe non-minimal SDT is a superset of the necessary potential set I of the minimalSDT: I ⊆ I t ⊆ pot ( u ) where pot ( u ) is the potential of the preﬁx u of locations l t and l .Proof. We know that I ⊆ pot ( u ) by deﬁnition of the necessary potential set.For any word w = u · v where the preﬁx u leads to location l t of the taintednon-minimal SDT, Algorithm 2 guarantees that the suﬃxes of u will be classiﬁedcorrectly. If the suﬃxes are classiﬁed correctly, we derive that I t ⊇ I (otherwisethe suﬃxes will not be classiﬁed correctly). Since I t ⊇ I and I, I t ⊆ pot ( u ), weconclude I ⊆ I t ⊆ pot ( u ). ut Following Lemma 2, if we wish to make I = I t , we can simply remove allelements from I t which do not satisfy the conditions outlined in Deﬁnition 10.Since we already know that I ⊆ I t , we can conﬁrm that after removal of allirrelevant parameters, I = I t . Algorithm 3 accomplishes the same.Cassel et al. [6] use the concept of representative data values for constructingthe SDT, while we treat the values symbolically: a representative data value‘represents’ the set of data values that satisfy a guard during construction ofthe SDT; in our case, we simply let Z3 decide on all the values to use for ourmembership queries and obtain the guards about them using their taint markersas identiﬁers. Theorem 1 (Isomorphism of tree oracles).

The SDTs generated by thetainted tree oracle and the untainted tree oracle for a tree query ( u, w ) are iso-morphic.Proof. Lemma 1 guarantees that Algorithm 1 returns a characteristic predicate H for the tree query ( u, w ). Application of Algorithm 2 on H constructs a non-minimal SDT. Using Lemma 2 and Algorithm 3 on the non-minimal SDT, we canconclude that the root locations of the tainted tree oracle and normal tree oraclehave the same necessary potential set. By inductive reasoning on the depth ofthe trees, the same holds for all sub-trees of both oracles, eventually reducing tothe leaves, showing that the tainted tree oracle is isomorphic to tree oracle. ut Appendix B Detailed Benchmark results

Table 1 contains the full results of the values used to create the plots from Fig-ure 6.

Table 1:

Benchmarks

Model Tree Oracle EQ Oracle Learn Symbols Test Symbols Total Symbols Learned(Std. Dev) (Std. Dev) (Std. Dev)Abp Output Tainted Normal 6.55E+02 1.57E+05 1.58E+05 30/30(8.33E+01) (1.29E+05) (1.29E+05)Abp Output Tainted Tainted 6.17E+02 1.68E+04 1.74E+04 30/30(7.78E+01) (1.15E+04) (1.15E+04)Abp Output Normal Normal 6.93E+03 1.57E+05 1.64E+05 30/30(5.20E+03) (1.29E+05) (1.29E+05)Abp Output Normal Tainted 6.51E+03 1.68E+04 2.33E+04 30/30(3.97E+03) (1.15E+04) (1.29E+04)Lock 2 Tainted Normal N-A N-A N-A 0/30(N-A) (N-A) (N-A)Lock 2 Tainted Tainted 7.10E+01 1.15E+03 1.22E+03 30/30(0.00E+00) (6.76E+02) (6.76E+02)Lock 2 Normal Normal N-A N-A N-A 0/30(N-A) (N-A) (N-A)Lock 2 Normal Tainted 2.00E+02 1.15E+03 1.35E+03 30/30(0.00E+00) (6.76E+02) (6.76E+02)Lock 4 Tainted Normal N-A N-A N-A 0/30(N-A) (N-A) (N-A)Lock 4 Tainted Tainted 2.41E+02 6.29E+03 6.53E+03 30/30(0.00E+00) (5.52E+03) (5.52E+03)Lock 4 Normal Normal N-A N-A N-A 0/30(N-A) (N-A) (N-A)Lock 4 Normal Tainted 3.45E+04 6.29E+03 4.08E+04 30/30(0.00E+00) (5.52E+03) (5.52E+03)Lock 5 Tainted Normal N-A N-A N-A 0/30(N-A) (N-A) (N-A)Lock 5 Tainted Tainted 3.80E+02 2.62E+04 2.66E+04 30/30(0.00E+00) (1.45E+04) (1.45E+04)Lock 5 Normal Normal N-A N-A N-A 0/30(N-A) (N-A) (N-A)Lock 5 Normal Tainted 6.35E+05 2.62E+04 6.61E+05 30/30(0.00E+00) (1.45E+04) (1.45E+04)Continued on next pagerey-Box Learning of Register Automata 25

Table 1:

Benchmarks

Model Tree Oracle EQ Oracle Learn Symbols Test Symbols Total Symbols Learned(Std. Dev) (Std. Dev) (Std. Dev)Fifo 01 Tainted Normal 2.90E+01 1.71E+01 4.62E+01 30/30(4.08E+00) (6.12E+00) (6.73E+00)Fifo 01 Tainted Tainted 2.97E+01 1.38E+01 4.35E+01 30/30(3.83E+00) (3.58E+00) (4.93E+00)Fifo 01 Normal Normal 6.65E+01 1.71E+01 8.37E+01 30/30(1.84E+01) (6.12E+00) (1.80E+01)Fifo 01 Normal Tainted 7.07E+01 1.38E+01 8.46E+01 30/30(1.74E+01) (3.58E+00) (1.68E+01)Fifo 02 Tainted Normal 1.16E+02 6.47E+01 1.81E+02 30/30(3.26E+01) (2.77E+01) (4.28E+01)Fifo 02 Tainted Tainted 1.01E+02 5.10E+01 1.52E+02 30/30(3.03E+01) (1.55E+01) (3.31E+01)Fifo 02 Normal Normal 3.62E+02 6.47E+01 4.27E+02 30/30(1.29E+02) (2.77E+01) (1.33E+02)Fifo 02 Normal Tainted 3.50E+02 5.10E+01 4.01E+02 30/30(1.48E+02) (1.55E+01) (1.49E+02)Fifo 03 Tainted Normal 3.03E+02 1.34E+02 4.38E+02 30/30(8.53E+01) (5.84E+01) (9.39E+01)Fifo 03 Tainted Tainted 2.93E+02 1.05E+02 3.98E+02 30/30(8.54E+01) (4.69E+01) (8.07E+01)Fifo 03 Normal Normal 1.64E+03 1.34E+02 1.78E+03 30/30(9.00E+02) (5.84E+01) (8.82E+02)Fifo 03 Normal Tainted 1.93E+03 1.05E+02 2.03E+03 30/30(1.34E+03) (4.69E+01) (1.31E+03)Fifo 04 Tainted Normal 6.87E+02 2.20E+02 9.06E+02 30/30(1.51E+02) (1.11E+02) (2.14E+02)Fifo 04 Tainted Tainted 6.35E+02 1.62E+02 7.96E+02 30/30(1.41E+02) (7.53E+01) (1.53E+02)Fifo 04 Normal Normal 1.22E+04 2.20E+02 1.24E+04 30/30(1.22E+04) (1.11E+02) (1.22E+04)Fifo 04 Normal Tainted 1.19E+04 1.62E+02 1.20E+04 30/30(1.21E+04) (7.53E+01) (1.21E+04)Fifo 05 Tainted Normal 1.23E+03 3.53E+02 1.58E+03 30/30(3.35E+02) (2.13E+02) (4.49E+02)Fifo 05 Tainted Tainted 1.32E+03 2.24E+02 1.54E+03 29/30(2.88E+02) (9.79E+01) (3.14E+02)Fifo 05 Normal Normal 1.00E+05 3.19E+02 1.01E+05 25/30Continued on next page6 B. Garhewal et al.

Table 1:

Benchmarks

Model Tree Oracle EQ Oracle Learn Symbols Test Symbols Total Symbols Learned(Std. Dev) (Std. Dev) (Std. Dev)(1.84E+05) (1.67E+02) (1.84E+05)Fifo 05 Normal Tainted 1.28E+05 2.35E+02 1.28E+05 25/30(2.08E+05) (8.76E+01) (2.08E+05)Repetition Tainted Normal N-A N-A N-A 0/30(N-A) (N-A) (N-A)Repetition Tainted Tainted 1.22E+02 7.33E+03 7.45E+03 30/30(0.00E+00) (2.03E+03) (2.03E+03)Repetition Normal Normal N-A N-A N-A 0/30(N-A) (N-A) (N-A)Repetition Normal Tainted 8.90E+03 7.33E+03 1.62E+04 30/30(1.99E+03) (2.03E+03) (2.26E+03)Set 01 Tainted Normal 1.45E+02 1.28E+03 1.43E+03 29/30(1.03E+02) (1.52E+03) (1.52E+03)Set 01 Tainted Tainted 9.75E+01 1.83E+02 2.80E+02 30/30(3.56E+01) (1.61E+02) (1.56E+02)Set 01 Normal Normal 5.00E+06 1.28E+03 5.01E+06 29/30(1.73E+07) (1.52E+03) (1.73E+07)Set 01 Normal Tainted 2.96E+03 1.83E+02 3.15E+03 30/30(6.71E+03) (1.61E+02) (6.69E+03)Set 02 Tainted Normal 1.61E+03 8.21E+03 9.82E+03 28/30(9.96E+02) (1.26E+04) (1.24E+04)Set 02 Tainted Tainted 1.00E+03 2.21E+02 1.23E+03 29/30(3.26E+02) (2.14E+02) (3.68E+02)Set 02 Normal Normal 4.61E+06 8.60E+03 4.62E+06 25/30(1.43E+07) (1.31E+04) (1.43E+07)Set 02 Normal Tainted 4.35E+04 2.20E+02 4.37E+04 30/30(7.28E+04) (2.10E+02) (7.29E+04)Set 03 Tainted Normal 1.76E+04 5.01E+03 2.26E+04 24/30(8.71E+03) (9.51E+03) (1.40E+04)Set 03 Tainted Tainted 1.44E+04 6.91E+02 1.51E+04 30/30(5.05E+03) (8.76E+02) (4.95E+03)Set 03 Normal Normal 5.76E+06 3.94E+03 5.76E+06 14/30(1.47E+07) (6.48E+03) (1.47E+07)Set 03 Normal Tainted 2.01E+06 2.23E+02 2.01E+06 28/30(3.60E+06) (2.06E+02) (3.60E+06)Sip 2015 Tainted Normal 2.14E+03 1.89E+05 1.92E+05 10/30(4.00E+02) (2.60E+05) (2.60E+05)Continued on next pagerey-Box Learning of Register Automata 27

Table 1: