A Decision Procedure for Path Feasibility of String Manipulating Programs with Integer Data Type
Taolue Chen, Matthew Hague, Jinlong He, Denghang Hu, Anthony Widjaja Lin, Philipp Rummer, Zhilin Wu
aa r X i v : . [ c s . L O ] J u l A Decision Procedure for Path Feasibility of StringManipulating Programs with Integer Data Type
Taolue Chen , Matthew Hague , Jinlong He , , Denghang Hu , ,Anthony Widjaja Lin , Philipp R¨ummer , and Zhilin Wu , , University of Surrey, UK Royal Holloway, University of London, UK State Key Laboratory of Computer Science,Institute of Software, Chinese Academy of Sciences, China Technical University of Kaiserslautern, Germany Uppsala University, Sweden University of Chinese Academy of Sciences, China Shanghai Key Laboratory of Trustworthy Computing, East China Normal University, China Institute of Intelligent Software, Guangzhou, China
Abstract.
Strings are widely used in programs, especially in web applications.Integer data type occurs naturally in string-manipulating programs, and is fre-quently used to refer to lengths of, or positions in, strings. Analysis and testing ofstring-manipulating programs can be formulated as the path feasibility problem:given a symbolic execution path, does there exist an assignment to the inputs thatyields a concrete execution that realizes this path? Such a problem can naturallybe reformulated as a string constraint solving problem. Although state-of-the-artstring constraint solvers usually provide support for both string and integer datatypes, they mainly resort to heuristics without completeness guarantees.In this paper, we propose a decision procedure for a class of string-manipulatingprograms which includes not only a wide range of string operations such as con-catenation, replaceAll, reverse, and finite transducers, but also those involvingthe integer data-type such as length, indexof, and substring. To the best of ourknowledge, this represents one of the most expressive string constraint languagesthat is currently known to be decidable. Our decision procedure is based on avariant of cost register automata. We implement the decision procedure, givingrise to a new solver OSTRICH + . We evaluate the performance of OSTRICH + on a wide range of existing and new benchmarks. The experimental results showthat OSTRICH + is the first string decision procedure capable of tackling finitetransducers and integer constraints, whilst its overall performance is comparablewith the state-of-the-art string constraint solvers. String-manipulating programs are notoriously subtle, and their potential bugs may bringsevere security consequences. A typical example is cross-site scripting (XSS), which isamong the OWASP Top 10 Application Security Risks [29]. Integer data type occursnaturally and extensively in string-manipulating programs. An e ff ective and increas-ingly popular method for identifying bugs, including XSS, is symbolic execution [11].n a nutshell, this technique analyses static paths through the program being considered.Each of these paths can be viewed as a constraint ϕ over appropriate data domains, andsymbolic execution tools demand fast constraint solvers to check the satisfiability of ϕ .Such constraint solvers need to support all data-type operations occurring in a program.Typically, mainstream programming languages provide standard string functionssuch as concatenation, replace , and replaceAll . Moreover, Web programming lan-guages usually provide complex string operations (e.g. htmlEscape and trim), whichare conveniently modelled as finite transducers, to sanitise malicious user inputs [19].Nevertheless, apart from these operations involving only the string data type, functionssuch as length , indexOf , and substring , which can convert strings to integers and viceversa, are also heavily used in practice; for instance, it was reported [26] that length , indexOf , substring , and variants thereof, comprise over 80% of string function occur-rences in 18 popular JavaScript applications, notably outnumbering concatenation. Theintroduction of integers exacerbates the intricacy of string-manipulating programs, andposes new theoretical and practical challenges in solver development.When combining strings and integers, decidability can easily be lost; for instance,the string theory with concatenation and letter counting functions is undecidable [8,15].Remarkably, it is still a major open problem whether the string theory with concate-nation (arguably the simplest string operation) and length function (arguably the mostcommon string-number function) is decidable [17,22]. One promising approach to re-tain decidability is to enforce a syntactic restriction to the constraints. In the literature,these restriction include solved forms [17], acyclicity [5,2,3], and straight-line frag-ment (aka programs in single static assignment form) [21,13,14,18]. On the one hand,such a restriction has led to decidability of string constraint solving with complex stringoperations (not only concatenation, but also finite transducers) and integer operations(letter-counting, length , indexOf , etc.); see, e.g., [21]. On the other hand, there is a lotof evidence (e.g. from benchmark) that many practical string constraints do satisfy suchsyntactic restrictions.Approaches to building practical string solvers could essentially be classified intotwo categories. Firstly, one could support as many constraints as possible, but primar-ily resort to heuristics, o ff ering no completeness / termination guarantee. This is a re-alistic approach since, as mentioned above, the problem involving both string and in-teger data types is in general undecidable. Many solvers belong to this category, e.g.,CVC4 [20], Z3 [7,16], Z3-str3 [6], S3(P) [27,28], Trau [1] (or its variants Trau + [3] andZ3-Trau [9]), ABC [10], and Slent [32]. Completeness guarantees are, however, valu-able since the performance of heuristics can be di ffi cult to predict. The second approachis to develop solvers for decidable fragments supporting both strings and integers (e.g.[17,5,2,3,21,13,14,18]). Solvers in this category include Norn [2], SLOTH [18], andOSTRICH [14]. The fragment without complex string operations (e.g. replaceAll andfinite transducers, but length ) can be handled quite well by Norn. The fragment without length constraints (but replaceAll and finite transducers) can be handled e ff ectively byOSTRICH and SLOTH. Moreover, most existing solvers that belong to the first cat-egory do not support complex string operations like replaceAll and finite transducersas well. This motivates the following problem: provide a decision procedure that sup- orts both string and integer data type, with completeness guarantee and meanwhileadmitting e ffi cient implementation .We argue that this problem is highly challenging. A deeper examination of the algo-rithms used by OSTRICH and SLOTH reveals that, unlike the case for Norn, it would not be straightforward to extend OSTRICH and SLOTH with integer constraints. Firstand foremost, the complexity of the fragment used by Norn (i.e. without transducers and replaceAll ) is solvable in exponential time, even in the presence of integer constraints.This is not the case for the straight-line fragments with transducers / replaceAll , whichrequire at least double exponential time (regardless of the integer constraints). Thisunfortunately manifests itself in the size of symbolic representations of the solutions.SLOTH [18] computes a representation of all solutions “eagerly” as (alternating) finitetransducers. Dealing with integer data type requires to compute the Parikh images ofthese transducers [21], which would result in a quantifier-free linear integer arithmeticformula (LIA for short) of double exponential size, thus giving us a triple exponentialtime algorithm, since LIA formulas are solved in exponential time (see e.g. [30]). Linand Barcelo [21] provided a double exponential upper bound in the length of the stringsin the solution, and showed that the double exponential time theoretical complexitycould be retained. This, however, does not result in a practical algorithm since it re-quires all strings of double exponential size to be enumerated. OSTRICH [14] adopted a“lazy” approach and computed the pre-images of regular languages step by step, whichis more scalable than the “eager” approach adopted by SLOTH and results in a highlycompetitive solver. It uses recognisable relations (a finite union of products of regularlanguages) as symbolic representations. Nevertheless, extending this approach to inte-ger constraints is not obvious since integer constraints break the independence betweendi ff erent string variables in the recognisable relations. Contribution.
We provide a decision procedure for an expressive class of stringconstraints involving the integer data type, which includes not only concatenation, replace / replaceAll , reverse , finite transducers, and regular constraints, but also length , indexOf and substring . The decision procedure utilizes a variant of cost-register au-tomata introduced by Alur et al. [4], which are called cost-enriched finite automata (CEFA) for convenience. Intuitively, each CEFA records the connection between astring variable and its associated integer variables. With CEFAs, the concept of recog-nisable relations is then naturally extended to accommodate integers. The integer con-straints, however, are detached from CEFAs rather than being part of CEFAs. This al-lows to preserve the independence of string variables in the recognisable relation. Thecrux of the decision procedure is to compute the backward images of CEFAs understring functions, where each cost register (integer variable) might be split into severalones, thus extending but still in the same flavour as OSTRICH for string constraints without the integer data type [14]. Such an approach is able to treat a wide range ofstring functions in a generic, and yet simple, way. To the best of our knowledge, theclass of string constraints considered in this paper is currently one of the most expres-sive string theories involving the integer data type known to enjoy a decision procedure.We implement the decision procedure based on the recent OSTRICH solver[14], resulting in OSTRICH + . We perform experiments on a wide range of bench-mark suites, including those where both replace / replaceAll / finite transducers and3 ength / indexOf / substring occur, as well as the well-known benchmarks K aluza andP y E x . The results show that 1) OSTRICH + so far is the only string constraint solvercapable of dealing with finite transducers and integer constraints, and 2) its overallperformance is comparable with the best state-of-the-art string constraint solvers (e.g.CVC4 and Z3-Trau) which are short of completeness guarantees.The rest of the paper is structured as follows: Section 2 introduces the prelimi-naries. Section 3 defines the class of string-manipulating programs with integer datatype. Section 4 presents the decision procedure. Section 5 presents the benchmarks andexperiments for the evaluation. The paper is concluded in Section 6. Missing proofs,implementation details and further examples can be found in the appendix. We write N and Z for the sets of natural and integer numbers, respectively. For n ∈ N with n ≥
1, [ n ] denotes { , . . . , n } ; for m , n ∈ N with m ≤ n , [ m , n ] denotes { i ∈ N | m ≤ i ≤ n } . Throughout the paper, Σ is a finite alphabet, ranged over by a , b , . . . . Strings, languages, and transductions.
A string over Σ is a (possibly empty) sequenceof elements from Σ , denoted by u , v , w , . . . . An empty string is denoted by ε . We write Σ ∗ (resp., Σ + ) for the set of all (resp. nonempty) strings over Σ . For a string u , we use | u | to denote the number of letters in u . In particular, | ε | =
0. Moreover, for a ∈ Σ , let | u | a denote the number of occurrences of a in u . Assume u = a · · · a n − is nonemptyand i < j ∈ [0 , n − u [ i ] denote a i and u [ i , j ] for the substring a i · · · a j .Let u , v be two strings. We use u · v to denote the concatenation of u and v . Thestring u is said to be a prefix of v if v = u · v ′ for some string v ′ . In addition, if u , v ,then u is said to be a strict prefix of v . If v = u · v ′ for some string v ′ , then we use u − v to denote v ′ . In particular, ε − v = v . If u = a · · · a n − is nonempty, then we use u ( r ) todenote the reverse of u , that is, u ( r ) = a n − · · · a .A transduction over Σ is a binary relation over Σ ∗ , namely, a subset of Σ ∗ × Σ ∗ . Wewill use T , T , . . . to denote transductions. For two transductions T and T , we willuse T · T to denote the composition of T and T , namely, T · T = { ( u , w ) ∈ Σ ∗ × Σ ∗ | there exists v ∈ Σ ∗ s.t. ( u , v ) ∈ T and ( v , w ) ∈ T } . Recognisable relations.
We assume familiarity with standard regular language. Recallthat a regular language L can be represented by a regular expression e ∈ RegExp whereby we usually write L = L ( e ).Intuitively, a recognisable relation is simply a finite union of Cartesian productsof regular languages. Formally, an r -ary relation R ⊆ Σ ∗ × · · · × Σ ∗ is recognisable if R = S ni = L ( i )1 × · · · × L ( i ) r where L ( i ) j is regular for each j ∈ [ r ]. A representation of arecognisable relation R = S ni = L ( i )1 × · · · × L ( i ) r is ( A ( i )1 , . . . , A ( i ) r ) ≤ i ≤ n such that each A ( i ) j is an NFA with L ( A ( i ) j ) = L ( i ) j . The tuples ( A ( i )1 , . . . , A ( i ) r ) are called the disjuncts of therepresentation and the NFAs A ( i ) j are called the atoms of the representation. Automata models. A (nondeterministic) finite automaton (NFA) is a tuple A = ( Q , Σ, δ, I , F ), where Q is a finite set of states, Σ is a finite alphabet, δ ⊆ Q × Σ × Q isthe transition relation, I , F ⊆ Q are the set of initial and final states respectively. Forreadability, we write a transition ( q , a , q ′ ) ∈ δ as q a −→ δ q ′ (or simply q a −→ q ′ ). The size
4f an NFA A , denoted by |A| , is defined as the number of transitions of A . A run of A on a string w = a · · · a n is a sequence of transitions q a −→ q · · · q n − a n −→ q n with q ∈ I . The run is accepting if q n ∈ F . A string w is accepted by an NFA A if there is anaccepting run of A on w . In particular, the empty string ε is accepted by A if I ∩ F , ∅ .The language of A , denoted by L ( A ), is the set of strings accepted by A . An NFA A is said to be deterministic if I is a singleton and, for every q ∈ Q and a ∈ Σ , there isat most one state q ′ ∈ Q such that ( q , a , q ′ ) ∈ δ . It is well-known that finite automatacapture regular languages precisely.A nondeterministic finite transducer (NFT) T is an extension of NFA with outputs.Formally, an NFT T is a tuple ( Q , Σ, δ, I , F ), where Q , Σ, I , F are as in NFA and thetransition relation δ is a finite subset of Q × Σ × Q × Σ ∗ . Similarly to NFA, for readability,we write a transition ( q , a , q ′ , u ) ∈ δ as q a , u −−→ δ q ′ or q a , u −−→ q ′ . The size of an NFT T ,denoted by |T | , is defined as the sum of the sizes of the transitions of T , where the sizeof a transition q a , u −−→ q ′ is defined as | u | +
3. A run of T over a string w = a · · · a n is asequence of transitions q a , u −−−→ q · · · q n − a n , u n −−−→ q n with q ∈ I . The run is accepting if q n ∈ F . The string u · · · u n is called the output of the run. The transduction defined by T , denoted by T ( T ), is the set of string pairs ( w , u ) such that there is an accepting runof T on w , with the output u . An NFT T is said to be deterministic if I is a singleton,and, for every q ∈ Q and a ∈ Σ there is at most one pair ( q ′ , u ) ∈ Q × Σ ∗ such that( q , a , q ′ , u ) ∈ δ . In this paper, we are primarily interested in functional finite transducers(FFT), i.e., finite transducers that define functions instead of relations. (For instance,deterministic finite transducers are always functional.)We will also use standard quantifier-free / existential linear integer arithmetic (LIA)formulae, which are typically ranged over by φ, ϕ , etc. In this paper, we consider logics involving two data-types, i.e., the string data-type andthe integer data-type. As a convention, u , v , . . . denote string constants, c , d , . . . denoteinteger constants, x , y , . . . denote string variables, and i , j , . . . denote integer variables.We consider symbolic execution of string-manipulating programs with numericconditions (abbreviated as SL int ), defined by the following rules, S :: = x : = y · z | x : = replaceAll e , u ( y ) | x : = reverse ( y ) | x : = T ( y ) | x : = substring ( y , t , t ) | assert ( ϕ ) | S ; S ,ϕ :: = x ∈ A | t o t | ϕ ∨ ϕ | ϕ ∧ ϕ, where e is a regular expression over Σ , u ∈ Σ ∗ , T is an FFT, A is an NFA, o ∈ { = , , , ≥ , ≤ , >, < } , and t , t are integer terms defined by the following rules, t :: = i | c | length ( x ) | indexOf v ( x , i ) | ct | t + t , where c ∈ Z , v ∈ Σ + . We require that the string-manipulating programs are in single static assignment (SSA)form . Note that SSA form imposes restrictions only on the assignment statements, butnot on the assertions. A string variable x in an SL int program S is called an input string ariable of S if it does not appear on the left-hand side of the assignment statements of S . A variable in S is called an input variable if it is either an input string variable or aninteger variable. Semantics.
The semantics of SL int is explained as follows. – The assignment x : = y · z denotes that x is the concatenation of two strings y and z . – The assignment x : = replaceAll e , u ( y ) denotes that x is the string obtained by replac-ing all occurrences of e in y with u , where the leftmost and longest matching of e isused. For instance, replaceAll ( ab ) + , c ( aababaab ) = ac · replaceAll ( ab ) + , c ( aab ) = acac ,since the leftmost and longest matching of ( ab ) + in aababaab is abab . Here werequire that the language defined by e does not contain the empty string, in orderto avoid the troublesome definition of the semantics of the matching of the emptystring. The formal semantics of the replaceAll function can be found in [13]. – The assignment x : = reverse ( y ) denotes that x is the reverse of y . – The assignment x : = T ( y ) denotes that ( y , x ) ∈ T ( T ). – The assignment x : = substring ( y , t , t ) denotes that x is equal to the return valueof substring ( y , t , t ), where substring ( y , t , t ) = ǫ if t < ∨ t ≥ | y | ∨ t = y [ t , min { t + t − , | y | − } ] o / w For instance, substring ( abaab , − , = ε , substring ( abaab , , = ε , substring ( abaab , , = ab , and substring ( abaab , , = ab . – The conditional statement assert ( x ∈ A ) denotes that x belongs to L ( A ). – The conditional statement assert ( t o t ) denotes that the value of t is equal to(not equal to, . . . ) that of t , if o ∈ { = , , , ≥ , >, ≤ , < } . – The integer term length ( x ) denotes the length of x . – The function indexOf v ( x , i ) returns the starting position of the first occurrenceof v in x after the position i , if such an occurrence exists, and − i <
0, then indexOf v ( x , i ) returns indexOf v ( x , i ≥ length ( x ), then indexOf v ( x , i ) returns −
1. For instance, indexOf ab ( aaba , − = indexOf ab ( aaba , = indexOf ab ( aaba , = −
1, and indexOf ab ( aaba , = − Path feasibility problem.
Given an SL int program S , decide whether there are valua-tions of the input variables so that S can execute to the end. In this section, we present a decision procedure for the path feasibility problem of SL int .A distinguished feature of the decision procedure is that it conducts backward compu-tation which is lazy and can be done in a modular way. To support this, we extend aregular language with quantitative information of the strings in the language, giving riseto cost-enriched regular languages and corresponding finite automata (Section 4.1). Thecrux of the decision procedure is thus to show that the pre-images of cost-enriched reg-ular languages under the string operations in SL int (i.e., concatenation · , replaceAll e , u , reverse , FFTs T , and substring ) are representable by so called cost-enriched recognis-able relations (Section 4.2). The overall decision procedure is presented in Section 4.3,supplied by additional complexity analysis.6 .1 Cost-Enriched Regular Languages and Recognisable Relations Let k ∈ N with k >
0. A k-cost-enriched string is ( w , ( n , · · · , n k )) where w is a stringand n i ∈ Z for all i ∈ [ k ]. A k-cost-enriched language L is a subset of Σ ∗ × Z k . For ourpurpose, we identify a “regular” fragment of cost-enriched languages as follows. Definition 1 (Cost-enriched regular languages).
Let k ∈ N with k > . A k-cost-enriched language is regular (abbreviated as CERL) if it can be accepted by a cost-enriched finite automaton .A cost-enriched finite automaton (CEFA) A is a tuple ( Q , Σ, R , δ, I , F ) where – Q , Σ, I , F are defined as in NFAs, – R = ( r , · · · , r k ) is a vector of (mutually distinct) cost registers , – δ is the transition relation which is a finite set of tuples ( q , a , q ′ , η ) where q , q ′ ∈ Q,a ∈ Σ , and η : R → Z is a cost register update function.For convenience, we usually write ( q , a , q ′ , η ) ∈ ∆ as q a ,η −−→ q ′ .A run of A on a k-cost-enriched string ( a · · · a m , ( n , · · · , n k )) is a transition sequenceq a ,η −−−→ q · · · q m − a m ,η m −−−−→ q m such that q ∈ I and n i = P ≤ j ≤ m η j ( r i ) for each i ∈ [ k ] (Note that the initial values of cost registers are zero). The run is accepting if q m ∈ F. Ak-cost-enriched string ( w , ( n , · · · , n k )) is accepted by A if there is an accepting run of A on ( w , ( n , · · · , n k )) . In particular, ( ε, n ) is accepted by A if n = and I ∩ F , ∅ . Thek-cost-enriched language defined by A , denoted by L ( A ) , is the set of k-cost-enrichedstrings accepted by A . The size of a CEFA A = ( Q , Σ, R , δ, I , F ), denoted by |A| , is defined as thesum of the sizes of its transitions, where the size of each transition ( q , a , q ′ , η ) is P r ∈ R ⌈ log ( | η ( r ) | ) ⌉ +
3. Note here the integer constants in A are encoded in binary. Remark 1.
CEFAs can be seen as a variant of Cost Register Automata [4], by admittingnondeterminism and discarding partial final cost functions. CEFAs are also closely re-lated to monotonic counter machines [21]. The main di ff erence is that CEFAs discardguards in transitions and allow binary-encoded integers in cost updates, while mono-tonic counter machines allow guards in transitions but restrict the cost updates to beingmonotonic and unary, i.e. 0 , Example 1 (CEFA for length ). The string function length can be captured by CEFAs.For any NFA A = ( Q , Σ, δ, I , F ), it is not di ffi cult to see that the cost-enriched language { ( w , length ( w )) | w ∈ L ( A ) } is accepted by a CEFA, i.e., ( Q , Σ, ( r ) , δ ′ , I , F ) such thatfor each ( q , a , q ′ ) ∈ δ , we let ( q , a , q ′ , η ) ∈ δ ′ , where η ( r ) = A len = ( { q } , Σ, ( r ) , { ( q , a , q , η ) | η ( r ) = } , { q } , { q } ). In other words, A len accepts { ( w , length ( w )) | w ∈ Σ ∗ } .We can show that the function indexOf v ( · , · ) can be captured by a CEFA as well, inthe sense that, for any NFA A and constant string v , we can construct a CEFA A indexOf v { ( w , ( n , indexOf v ( w , n ))) | w ∈ L ( A ) , n ≤ indexOf v ( w , n ) < | w |} . The con-struction is slightly technical and can be found in Appendix B.Note that A indexOf v does not model the corner cases in the semantics of indexOf v ,for instance, indexOf v ( w , n ) = − v does not occur after the position n in w .Given two CEFAs A = ( Q , Σ, R , δ , I , F ) and A = ( Q , Σ, δ , R , I , F ) with R ∩ R = ∅ , the product of A and A , denoted by A × A , is defined as ( Q × Q , Σ, R ∪ R , δ, I × I , F × F ), where δ comprises the tuples (( q , q ) , σ, ( q ′ , q ′ ) , η )such that ( q , σ, q ′ , η ) ∈ δ , ( q , σ, q ′ , η ) ∈ δ , and η = η ∪ η .For a CEFA A , we use R ( A ) to denote the vector of cost registers occurring in A .Suppose A is CEFA with R ( A ) = ( r , · · · , r k ) and i = ( i , · · · , i k ) is a vector of mutuallydistinct integer variables such that R ( A ) ∩ i = ∅ . We use A [ i / R ( A )] to denote the CEFAobtained from A by simultaneously replacing r j with i j for j ∈ [ k ]. Definition 2 (Cost-enriched recognisable relations).
Let ( k , · · · , k l ) ∈ N l with k j > for every j ∈ [ l ] . A cost-enriched recognisable relation (CERR) R ⊆ ( Σ ∗ × Z k ) × · · · × ( Σ ∗ × Z k l ) is a finite union of products of CERLs. Formally, R = n S i = L i , × · · · × L i , l , wherefor every j ∈ [ l ] , L i , j ⊆ Σ ∗ × Z k j is a CERL. A CEFA representation of R is a collectionof CEFA tuples ( A i , , · · · , A i , l ) i ∈ [ n ] such that L ( A i , j ) = L i , j for every i ∈ [ n ] and j ∈ [ l ] . To unify the presentation, we consider string functions f : ( Σ ∗ × Z k ) ×· · ·× ( Σ ∗ × Z k l ) → Σ ∗ . (If there is no integer input parameter, then k , · · · , k l are zero.) Definition 3 (Cost-enriched pre-images of CERLs).
Suppose that f : ( Σ ∗ × Z k ) ×· · · × ( Σ ∗ × Z k l ) → Σ ∗ is a string function, L ⊆ Σ ∗ × Z k is a CERL defined by a CEFA A = ( Q , Σ, R , δ, I , F ) with R = ( r , · · · , r k ) . Then the R-cost-enriched pre-image of Lunder f , denoted by f − R ( L ) , is a pair ( R , t ) such that – R ⊆ ( Σ ∗ × Z k + k ) × · · · × ( Σ ∗ × Z k l + k ) ; – t = ( t , · · · , t k ) is a vector of linear integer terms where for each i ∈ [ k ] , t i is aterm whose variables are from n r (1) i , · · · , r ( l ) i o which are fresh cost registers and aredisjoint from R in A ; – L is equal to the language comprising the k -cost-enriched strings (cid:16) w , t h d (1)1 / r (1)1 , · · · , d ( l )1 / r ( l )1 i , · · · , t k h d (1) k / r (1) k , · · · , d ( l ) k / r ( l ) k i(cid:17) , such thatw = f (( w , c ) , · · · , ( w l , c l )) for some (( w , ( c , d )) , · · · , ( w l , ( c l , d l ))) ∈ R , where c j ∈ Z k j , d j = ( d ( j )1 , · · · , d ( j ) k ) ∈ Z k for j ∈ [ l ] .The R-cost-enriched pre-image of L under f , say f − R ( L ) = ( R , t ) , is said to be CERR-definable if R is a CERR. ff ective representation of a CERR-definable f − R ( L ) = ( R , t ) in terms of CEFAs. Namely, a CEFA representation of ( R , t ) (where t j is over n r (1) j , · · · , r ( l ) j o for j ∈ [ k ]) is a tuple (( A i , , · · · , A i , l ) i ∈ [ n ] , t ) such that ( A i , , · · · , A i , l ) i ∈ [ n ] is a CEFA representation of R , where R ( A i , j ) = (cid:16) r ′ j , , · · · , r ′ j , k j , r ( j )1 , · · · , r ( j ) k (cid:17) for each i ∈ [ n ] and j ∈ [ l ]. (The cost registers r ′ , , · · · , r ′ , k , · · · , r ′ l , , · · · , r ′ l , k l are mutually dis-tinct and freshly introduced.) Example 2 ( substring − R ( L ) ). Let Σ = { a } and L = { ( w , | w | ) | w ∈ L (( aa ) ∗ ) } . Evidently L is a CERL defined by a CEFA A = ( Q , Σ, R , δ, { q } , { q } ) with Q = { q , q } , R = ( r )and δ = { ( q , a , q ) , ( q , a , q ) } . Since substring is from Σ ∗ × Z to Σ ∗ , substring − R ( L ),the R -cost-enriched pre-image of L under substring , is the pair ( R , t ), where t = r (1)1 (note that in this case l = k =
1, and k =
2) and R = { ( w , n , n , n ) | w ∈ L ( a ∗ ) , n ≥ , n ≥ , n + n ≤ | w | , n is even } , which is represented by ( A ′ , t ) such that A ′ = ( Q ′ , Σ, R ′ , δ ′ , I ′ , F ′ ), where – Q ′ = Q × { p , p , p } , (Intuitively, p , p , and p denote that the current position isbefore the starting position, between the starting position and ending position, andafter the ending position of the substring respectively.) – R ′ = (cid:16) r ′ , , r ′ , , r (1)1 (cid:17) , – I ′ = { ( q , p ) } , F ′ = { ( q , p ) , ( q , p ) } (where ( q , p ) is used to accept the 3-cost-enriched strings ( w , n , ,
0) with 0 ≤ n ≤ | w | ), and – δ ′ is ( q , p ) a ,η −−−→ ( q , p ) , ( q , p ) a ,η −−−→ ( q , p ) , ( q , p ) a ,η −−−→ ( q , p ) , ( q , p ) a ,η −−−→ ( q , p ) , ( q , p ) a ,η −−−→ ( q , p ) , ( q , p ) a ,η −−−→ ( q , p ) , where η ( r ′ , ) = η ( r ′ , ) = η ( r (1)1 ) = η ( r ′ , ) = η ( r ′ , ) =
1, and η ( r (1)1 ) = η ( r ′ , ) = η ( r ′ , ) =
0, and η ( r (1)1 ) = substring − R ( L ) is CERR-definable.It turns out that for each string function f in the assignment statements of SL int , thecost-enriched pre-images of CERLs under f are CERR-definable. Proposition 1.
Let L be a CERL defined by a CEFA A = ( Q , Σ, R , δ, I , F ) . Then foreach string function f ranging over · , replaceAll e , u , reverse , FFTs T , and substring ,f − R ( L ) is CERR-definable. In addition, – a CEFA representation of · − R ( L ) can be computed in time O ( |A| ) , – a CEFA representation of reverse − R ( L ) (resp. substring − R ( L ) ) can be computed intime O ( |A| ) , – a CEFA representation of ( T ( T )) − R ( L ) can be computed in time polynomial in |A| and exponential in |T | , – a CEFA representation of ( replaceAll e , u ) − R ( L ) can be computed in time polynomialin |A| and exponential in | e | and | u | . The proof of Proposition 1 is given in Appendix C.9 .3 The Decision Procedure
Let S be an SL int program. Without loss of generality, we assume that for every oc-currence of assignments of the form y : = substring ( x , t , t ), it holds that t and t are integer variables. This is not really a restriction, since, for instance, if in y : = substring ( x , t , t ), neither t nor t is an integer variable, then we introduce fresh integervariables i and j , replace t , t by i , j respectively, and add assert ( i = t ) ; assert ( j = t )in S . We present a decision procedure for the path feasibility problem of S which is di-vided into five steps. Step I: Reducing to atomic assertions.
Note first that in our language, each assertion is a positive Boolean combinationof atomic formulas of the form x ∈ A or t o t (cf. Section 3). Nondeterministicallychoose, for each assertion assert ( ϕ ) of S , a set of atomic formulas Φ ϕ = { α , · · · , α n } such that ϕ holds when atomic formulas in Φ ϕ are true.Then each assertion assert ( ϕ ) in S with Φ ϕ = { α , · · · , α n } is replaced by assert ( α ) ; · · · ; assert ( α n ), and thus S constrains atomic assertions only. Step II: Dealing with the case splits in the semantics of indexOf v and substring . For each integer term of the form indexOf v ( x , i ) in S , nondeterministically chooseone of the following five options (which correspond to the semantics of indexOf v inSection 3).(1) Add assert ( i <
0) to S , and replace indexOf v ( x , i ) with indexOf v ( x ,
0) in S .(2) Add assert ( i <
0) ; assert (cid:16) x ∈ A Σ ∗ v Σ ∗ (cid:17) to S ; replace indexOf v ( x , i ) with − S .(3) Add assert ( i ≥ length ( x )) to S , and replace indexOf v ( x , i ) with − S .(4) Add assert ( i ≥
0) ; assert ( i < length ( x )) to S .(5) Add assert ( i ≥
0) ; assert ( i < length ( x )) ; assert ( j = length ( x ) − i ) ; y : = substring ( x , i , j ); assert (cid:16) y ∈ A Σ ∗ v Σ ∗ (cid:17) to S , where y is a fresh string variable, j is a fresh integer variable, and A Σ ∗ v Σ ∗ is anNFA defining the language { w ∈ Σ ∗ | v does not occur as a substring in w } . Replace indexOf v ( x , i ) with − S .For each assignment y : = substring ( x , i , j ), nondeterministically choose one of thefollowing three options (which correspond to the semantics of substring in Section 3).(1) Add the statements assert ( i ≥
0) ; assert ( i + j ≤ length ( x )) to S .(2) Add the statements assert ( i ≥
0) ; assert ( i ≤ length ( x )) ; assert ( i + j > length ( x )); assert ( i ′ = length ( x ) − i ) to S , and replace y : = substring ( x , i , j ) with y : = substring ( x , i , i ′ ), where i ′ is a fresh integer variable.(3) Add the statement assert ( i <
0) ; assert ( y ∈ A ε ) to S , and remove y : = substring ( x , i , j ) from S , where A ε is the NFA defining the language { ε } . Step III: Removing length and indexOf .For each term length ( x ) in S , we introduce a fresh integer variable i , replace everyoccurrence of length ( x ) by i , and add the statement assert ( x ∈ A len [ i / r ]) to S . (SeeExample 1 for the definition of A len .) 10or each term indexOf v ( x , i ) occurring in S , introduce two fresh integer variables i and i , replace every occurrence of indexOf v ( x , i ) by i , and add the statements assert ( i = i ) ; assert (cid:0) x ∈ A indexOf v [ i / r , i / r ] (cid:1) to S . Step IV: Removing the assignment statements backwards .Repeat the following procedure until S contains no assignment statements.Suppose y : = f ( x , i , · · · , x l , i l ) is the last assignment of S , where f :( Σ ∗ × Z k ) × · · · × ( Σ ∗ × Z k l ) → Σ ∗ is a string function and i j = ( i j , , · · · , i j , k j )for each j ∈ [ l ].Let {A , · · · , A s } be the set of all CEFAs such that assert (cid:16) y ∈ A j (cid:17) occurs in S for every j ∈ [ s ]. Let j ∈ [ s ] and R ( A j ) = ( r j , , · · · , r j ,ℓ j ). Then from Proposi-tion 1, a CEFA representation of f − R ( A j ) ( L ( A j )), say (cid:18)(cid:16) B (1) j , j ′ , · · · , B ( l ) j , j ′ (cid:17) j ′ ∈ [ m j ] , t (cid:19) ,can be e ff ectively computed from A and f , where we write R (cid:16) B ( j ′′ ) j , j ′ (cid:17) = (cid:18) ( r ′ ) ( j ′′ , j , · · · , ( r ′ ) ( j ′′ , k j ′′ ) j , r ( j ′′ ) j , , · · · , r ( j ′′ ) j ,ℓ j (cid:19) for each j ′ ∈ [ m j ] and j ′′ ∈ [ l ], and t = ( t , · · · , t ℓ j ). Note that the cost registers( r ′ ) (1 , j , · · · , ( r ′ ) (1 , k ) j , · · · , ( r ′ ) ( l , j , · · · , ( r ′ ) ( l , k l ) j , r (1) j , , · · · , r (1) j ,ℓ j , · · · , r ( l ) j , , · · · , r ( l ) j ,ℓ j are mutually distinct and freshly introduced, moreover, R (cid:18) B ( j ′′ ) j , j ′ (cid:19) = R (cid:18) B ( j ′′ ) j , j ′ (cid:19) for distinct j ′ , j ′ ∈ [ m j ].Remove y : = f ( x , i , · · · , x l , i l ), as well as all the statements assert ( y ∈ A ), · · · , assert ( y ∈ A s ) from S . For every j ∈ [ s ], nondeterministically choose j ′ ∈ [ m j ], and add the following statements to S , assert (cid:16) x ∈ B (1) j , j ′ (cid:17) ; · · · ; assert (cid:16) x l ∈ B ( l ) j , j ′ (cid:17) ; S j , j ′ , i , ··· , i l ; S j , t where S j , j ′ , i , ··· , i l ≡ assert (cid:16) i , = ( r ′ ) (1 , j , j ′ (cid:17) ; · · · ; assert (cid:16) i , k = ( r ′ ) (1 , k ) j , j ′ (cid:17) ; · · · assert (cid:16) i l , = ( r ′ ) ( l , j , j ′ (cid:17) ; · · · ; assert (cid:16) i l , k l = ( r ′ ) ( l , k l ) j , j ′ (cid:17) and S j , t ≡ assert (cid:16) r j , = t (cid:17) ; · · · , assert (cid:16) r j ,ℓ j = t ℓ j (cid:17) . Step V: Final satisfiability checking.
In this step, S contains no assignment statements and only assertions of the form assert ( x ∈ A ) and assert ( t o t ) where A are CEFAs and t , t are linear integerterms. Let X denote the set of string variables occurring in S . For each x ∈ X , let Λ x = {A x , · · · , A s x x } denote the set of CEFAs A such that assert ( x ∈ A ) appears in S . Moreover, let φ denote the conjunction of all the LIA formulas t o t occurring in S . It is straightforward to observe that φ is over R ′ = S x ∈ X , j ∈ [ s x ] R ( A jx ). Then the pathfeasibility of S is reduced to the satisfiability problem of LIA formulas w.r.t. CEFAs(abbreviated as SAT
CEFA [LIA] problem) which is defined as11eciding whether φ is satisfiable w.r.t. ( Λ x ) x ∈ X , namely, whether there are anassignment function θ : R ′ → Z and strings ( w x ) x ∈ X such that φ [ θ ( R ′ ) / R ′ ] holdsand ( w x , θ ( R ( A jx ))) ∈ L ( A jx ) for every x ∈ X and j ∈ [ s x ].This SAT CEFA [LIA] problem is decidable and pspace -complete; The proof can be foundin Appendix D.
Proposition 2.
SAT
CEFA [LIA] is pspace -complete. An example to illustrate the decision procedure can be found in Appendix ?? . Complexity analysis of the decision procedure.
Step I and Step II can be done in non-deterministic linear time. Step III can be done in linear time. In Step IV, for each inputstring variable x in S , at most exponentially many CEFAs can be generated for x , each ofwhich is of at most exponential size. Therefore, Step IV can be done in nondeterministicexponential space. By Proposition 2, Step V can be done in exponential space. There-fore, we conclude that the path feasibility problem of SL int programs is in nexpspace ,thus in expspace by Savitch’s theorem [23]. Remark 2.
In this paper, we focus on functional finite transducers (cf. Section 2). Ourdecision procedure is applicable to general finite transducers as well with minor adap-tation. However, the expspace complexity upper-bound does not hold any more, becausethe distributive property f − ( L ∩ L ) = f − ( L ) ∩ f − ( L ) for regular languages L , L only holds for functional finite transducers f . We have implemented the decision procedure presented in the preceding section basedon the recent string constraint solver OSTRICH [14], resulting in a new solver OS-TRICH + . OSTRICH is written in Scala and based on the SMT solver Princess [25].OSTRICH + reuses the parser of Princess, but replaces the NFAs from OSTRICH withCEFAs. Correspondingly, in OSTRICH + , the pre-image computation for concatena-tion, replaceAll , reverse , and finite transducers is reimplemented, and a new pre-imageoperator for substring is added. OSTRICH + also implements CEFA constructions for length and indexOf . More details can be found in Appendix E.We have compared OSTRICH + with some of the state-of-the-art solvers on a widerange of benchmarks. We discuss the benchmarks in Section 5.1 and present the exper-imental results in Section 5.2. Our evaluation focuses on problems that combine string with integer constraints. To thisend, we consider the following four sets of benchmarks, all in SMT-LIB 2 format.T ransducer + is derived from the T ransducer benchmark suite of OSTRICH [14]. TheT ransducer suite involves seven transducers: toUpper (replacing all lowercase letterswith their uppercase ones) and its dual toLower, htmlEscape and its dual htmlUnescape,escapeString, addslashes, and trim. These transducers are collected from Stranger [33]and SLOTH [18]. Initially none of the benchmarks involved integers. In T ransducer + ,we encode four security-relevant properties of transducers [19], with the help of thefunctions charAt and length : 12 idempotence: given T , whether ∀ x . T ( T ( x )) = T ( x ); – duality: given T and T , whether ∀ x . T ( T ( x )) = x ; – commutativity: given T and T , whether ∀ x . T ( T ( x )) = T ( T ( x )); – equivalence: given T and T , whether ∀ x . T ( x ) = T ( x ).For instance, we encode the non-idempotence of T into the path feasibility of theSL int program y : = T ( x ); z : = T ( y ); S y , z , where y and z are two fresh string variables,and S y , z is the SL int program encoding y , z (see Appendix A for the details ). We alsoinclude in T ransducer + three instances generated from a program to sanitize URLsagainst XSS attacks (see Appendix ?? for the details), where T trim is used. In total, weobtain 94 instances for the T ransducer + suite.SLOG + is adapted from the SLOG benchmark suite [31], containing 3,511 instancesabout strings only. We obtain SLOG + by choosing a string variable x for each instance,and adding the statement assert ( length ( x ) < indexOf a ( x , a ∈ Σ . Asin [14], we split SLOG + into SLOG + ( replace ) and SLOG + ( replaceall ), comprising3,391 and 120 instances respectively. In addition to the indexOf and length functions,the benchmarks use regular constraints and concatenation; SLOG + ( replace ) also con-tains the replace function (replacing the first occurrence), while SLOG + ( replaceall )uses the replaceAll function (replacing all occurrences).P y E x [24] contains 25,421 instances derived by the PyEx tool, a symbolic executionengine for Python programs. The P y E x suite was generated by the CVC4 group fromfour popular Python packages: httplib2, pip, pymongo, and requests. These instancesuse regular constraints, concatenation, length , substring , and indexOf functions. Fol-lowing [24], the P y E x suite is further divided into three parts: P y E x -td, P y E x -z3 andP y E x -zz, comprising 5,569, 8,414 and 11,438 instances, respectively.K aluza [26] is the most well-known benchmark suite in literature, containing 47,284instances with regular constraints, concatenation, and the length function. The 47,284benchmarks include 28,032 satisfiable and 9,058 unsatisfiable problems in SSA form. We compare OSTRICH + to CVC4 [20], Z3-str3 [34], and Z3-Trau [9], as well as twoconfigurations of OSTRICH [14] with standard NFAs. The configuration OSTRICH (1) is a direct implementation of the algorithm in [14], and does not support integer func-tions. In OSTRICH (2) , we integrated support for the length function as in Norn [2],based on the computation of length abstractions of regular languages, and handle indexOf , substring , and charAt via an encoding to word equations. The experimentsare executed on a computer with an Intel Xeon Silver 4210 2.20GHz and 2.19GHzCPU (2-core) and 8GB main memory, running 64bit Ubuntu 18.04 LTS OS and Java1.8. We use a timeout of 30 seconds (wall-clock time), and report the number of satisfi-able and unsatisfiable problems solved by each of the systems. Table 1 summarises theexperimental results. We did not observe incorrect answers by any tool.There are two additional state-of-the-art solvers Slent and Trau + which were notincluded in the evaluation. We exclude Slent [32] because it uses its own input formatlaut, which is di ff erent from the SMT-LIB 2 format used for our benchmarks; also,T ransducer + is beyond the scope of Slent. Trau + [3] integrates Trau with Sloth to deal13 enchmark Output CVC4 Z3-str3 Z3-Trau OSTRICH (1) OSTRICH (2)
OSTRICH + T ransducer + Total: 94 sat − − − unsat − − − inconcl. − − −
93 93 6SLOG + ( replaceall )Total: 120 sat − − − − inconcl. 5 − −
113 115 10SLOG + ( replace )Total: 3,391 sat − − inconcl. 0 447 − y E x -tdTotal: 5,569 sat 4,224 4,068
68 96 4,141unsat 1,284 1,289
95 93 1,203inconcl. 61 212 8 5,406 5,380 225P y E x -z3Total: 8,414 sat 6,346 6,040
76 100 5,489unsat 1,358 1,370
61 53 1,239inconcl. 710 1,004 17 8,277 8,261 1,686P y E x -zzTotal: 11,438 sat 10,078 8,804
71 98 9,033unsat 1,204 1,207
91 61 868inconcl. 156 1,427 87 11,276 11,279 1,537K aluza
Total: 47,284 sat
Table 1.
Experimental results on di ff erent benchmark suites. ’–’ means that the tool is not appli-cable to the benchmark suite, and ’inconclusive’ means that a tool gave up, timed out, or crashed. with both finite transducers and integer constraints. We were unfortunately unable toobtain a working version of Trau + , possibly because Trau requires two separate versionsof Z3 to run. In addition, the algorithm in [3] focuses on length-preserving transducers,which means that T ransducer + is beyond the scope of Trau + .OSTRICH + and OSTRICH are the only tools applicable to the problems in T rans - ducer + . With a timeout of 30s, OSTRICH + can solve 88 of the benchmarks, but thisnumber rises to 94 when using a longer timeout of 600s. Given the complexity of thosebenchmarks, this is an encouraging result. OSTRICH can only solve one of the bench-marks, because the encoding of charAt in the benchmarks using equations almost al-ways leads to problems that are not in SSA form.On SLOG + ( replaceall ), OSTRICH + and CVC4 are very close: OSTRICH + solves98 satisfiable instances, slightly less than the 104 instances solved by CVC4, whileOSTRICH + solves one more unsatisfiable instance than CVC4 (12 versus 11). Thesuite is beyond the scope of Z3-str3 and Z3-Trau, which do not support replaceAll .On SLOG + ( replace ), OSTRICH + , CVC4, and Z3-str3 solve a similar number ofunsatisfiable problems, while CVC4 solves the largest number of satisfiable instances(1,309). The suite is beyond the scope of Z3-Trau which does not support replace .14n the three P y E x suites, Z3-Trau consistently solves the largest number of in-stances by some margin. OSTRICH + solves a similar number of instances as Z3-str3.Interpreting the results, however, it has to be taken into account that P y E x includes1,334 instances that are not in SSA form, which are beyond the scope of OSTRICH + .The K aluza problems can be solved most e ff ectively by CVC4. OSTRICH + cansolve almost all of the around 80% of the benchmarks that are in SSA form, however.OSTRICH + consistently outperforms OSTRICH (1) and OSTRICH (2) in the evalu-ation, except for the K aluza benchmarks. For OSTRICH (1) , this is expected becausemost benchmarks considered here contain integer functions. For OSTRICH (2) , it turnsout that the encoding of indexOf , substring , and charAt as word equations usually leadsto problems that are not in SSA form, and therefore are beyond the scope of OSTRICH.In summary, we observe that OSTRICH + is competitive with other solvers, whileis able to handle benchmarks that are beyond the scope of the other tools due to thecombination of string functions (in particular transducers) and integer constraints. In-terestingly, the experiments show that OSTRICH + , at least in its current state, is betterat solving unsatisfiable problems than satisfiable problems; this might be an artefact ofthe use of nuXmv for analysing products of CEFAs. We expect that further optimisationof our algorithm will lead to additional performance improvements. For instance, a nat-ural optimisation that is to be included in our implementation is to use standard finiteautomata, as opposed to CEFAs, for simpler problems such as the K aluza benchmarks.Such a combination of automata representations is mostly an engineering e ff ort. In this paper, we have proposed an expressive string constraint language which canspecify constraints on both strings and integers. We provided an automata-theoreticdecision procedure for the path feasibility problem of this language. The decision pro-cedure is simple, generic, and amenable to implementation, giving rise to a new solverOSTRICH + . We have evaluated OSTRICH + on a wide range of existing and newly cre-ated benchmarks, and have obtained very encouraging results. OSTRICH + is shown tobe the first solver which is capable of tackling finite transducers and integer constraintswith completeness guarantees. Meanwhile, it demonstrates competitive performanceagainst some of the best state-of-the-art string constraint solvers. Acknowledgements.
T. Chen and Z. Wu are supported by Guangdong Science and Tech-nology Department grant (No. 2018B010107004); T. Chen is also supported by OverseasGrant (KFKT2018A16) from the State Key Laboratory of Novel Software Technology, Nan-jing University, China and Natural Science Foundation of Guangdong Province, China (No.2019A1515011689). M. Hague is supported by EPSRC [EP / T00021X / eferences
1. P. A. Abdulla, M. F. Atig, Y. Chen, B. P. Diep, L. Hol´ık, A. Rezine, and P. R¨ummer. Flattenand conquer: a framework for e ffi cient analysis of string constraints. In PLDI , pages 602–617, 2017.2. P. A. Abdulla, M. F. Atig, Y. Chen, L. Hol´ık, A. Rezine, P. R¨ummer, and J. Stenman. Stringconstraints for verification. In
CAV , pages 150–166, 2014.3. P. A. Abdulla, M. F. Atig, B. P. Diep, L. Hol´ık, and P. Janku. Chain-free string constraints.In
ATVA , pages 277–293, 2019.4. R. Alur, L. D’Antoni, J. Deshmukh, M. Raghothaman, and Y. Yuan. Regular functions andcost register automata. In
LICS , pages 13–22. IEEE Computer Society, 2013.5. P. Barcel´o, D. Figueira, and L. Libkin. Graph logics with rational relations.
Logical Methodsin Computer Science , 9(3), 2013.6. M. Berzish, V. Ganesh, and Y. Zheng. Z3str3: A string solver with theory-aware heuristics.In
FMCAD , pages 55–59, 2017.7. N. Bjørner, N. Tillmann, and A. Voronkov. Path feasibility analysis for string-manipulatingprograms. In
TACAS , pages 307–321, 2009.8. J. R. B¨uchi and S. Senger. Definability in the existential theory of concatenation and unde-cidable extensions of this theory. In
Collected Works of J. R. B¨uchi , pages 671–683. 1990.9. D. Bui and contributors. Z3-trau, 2019.10. T. Bultan and contributors. Abc string solver, 2015.11. C. Cadar and K. Sen. Symbolic execution for software testing: Three decades later.
Commun.ACM , 56(2):82–90, Feb. 2013.12. R. Cavada, A. Cimatti, M. Dorigatti, A. Griggio, A. Mariotti, A. Micheli, S. Mover,M. Roveri, and S. Tonetta. The nuXmv symbolic model checker. In
CAV , pages 334–342,2014.13. T. Chen, Y. Chen, M. Hague, A. W. Lin, and Z. Wu. What is decidable about string con-straints with the replaceall function.
PACMPL , 2(POPL):3:1–3:29, 2018.14. T. Chen, M. Hague, A. W. Lin, P. R¨ummer, and Z. Wu. Decision procedures for path feasi-bility of string-manipulating programs with complex operations.
PACMPL , 3(POPL), 2019.15. J. D. Day, V. Ganesh, P. He, F. Manea, and D. Nowotka. RP. pages 15–29, 2018.16. L. de Moura and N. Bjørner. Z3: an e ffi cient SMT solver. In TACAS , pages 337–340, 2008.17. V. Ganesh, M. Minnes, A. Solar-Lezama, and M. C. Rinard. Word equations with lengthconstraints: What’s decidable? In
HVC 2012 , pages 209–226, 2012.18. L. Hol´ık, P. Janku, A. W. Lin, P. R¨ummer, and T. Vojnar. String constraints with concatena-tion and transducers solved e ffi ciently. PACMPL , 2(POPL):4:1–4:32, 2018.19. P. Hooimeijer, B. Livshits, D. Molnar, P. Saxena, and M. Veanes. Fast and precise sanitizeranalysis with BEK. In
USENIX Security Symposium , 2011.20. T. Liang, A. Reynolds, C. Tinelli, C. Barrett, and M. Deters. A DPLL(T) theory solver for atheory of strings and regular expressions. In
CAV , pages 646–662, 2014.21. A. W. Lin and P. Barcel´o. String solving with word equations and transducers: Towards alogic for analysing mutation XSS. In
POPL , pages 123–136. ACM, 2016.22. A. W. Lin and R. Majumdar. Quadratic word equations with length constraints, countersystems, and presburger arithmetic with divisibility. In
ATVA , pages 352–369, 2018.23. C. H. Papadimitriou.
Computational complexity.
Addison-Wesley, 1994.24. A. Reynolds, M. Woo, C. Barrett, D. Brumley, T. Liang, and C. Tinelli. Scaling up DPLL(T)string solvers using context-dependent simplification. In
CAV , pages 453–474, 2017.25. P. R¨ummer. A constraint sequent calculus for first-order logic with linear integer arithmetic.In
LPAR , pages 274–289, 2008.
6. P. Saxena, D. Akhawe, S. Hanna, F. Mao, S. McCamant, and D. Song. A symbolic executionframework for javascript. In S & P , pages 513–528, 2010.27. M. Trinh, D. Chu, and J. Ja ff ar. S3: A symbolic string solver for vulnerability detection inweb applications. In CCS , pages 1232–1243, 2014.28. M. Trinh, D. Chu, and J. Ja ff ar. Progressive reasoning over recursively-defined strings. In CAV , pages 218–240. Springer, 2016.29. A. van der Stock, B. Glas, N. Smithline, and T. Gigler. OWASP Top 10 – 2017, 2017.30. K. N. Verma, H. Seidl, and T. Schwentick. On the complexity of equational horn clauses. In
CADE , pages 337–352, 2005.31. H. Wang, T. Tsai, C. Lin, F. Yu, and J. R. Jiang. String analysis via automata manipulationwith logic circuit representation. In
CAV , pages 241–260, 2016.32. H.-E. Wang, S.-Y. Chen, F. Yu, and J.-H. R. Jiang. A symbolic model checking approach tothe analysis of string and length constraints. In
ASE , page 623633. ACM, 2018.33. F. Yu, M. Alkhalaf, T. Bultan, and O. H. Ibarra. Automata-based symbolic string analysisfor vulnerability detection.
Form. Methods Syst. Des. , 44(1):44–70, 2014.34. Y. Zheng, X. Zhang, and V. Ganesh. Z3-str: a Z3-based string solver for web applicationanalysis. In
ESEC / SIGSOFT FSE , pages 114–124, 2013. The SL int program S x , y encoding x , y At first, we note that the function charAt ( x , i ) which returns x [ i ] (i.e., the character of x at theposition i ) can be seen as a special case of substring , namely charAt ( x , i ) ≡ substring ( x , i , x , y is expressed as the following SL int program (denoted by S x , y ) z : = charAt ( x , i ); z : = charAt ( y , i ); assert (cid:0) length ( x ) , length ( y ) ∨ W a ∈ Σ ( z ∈ A a ∧ z ∈ A Σ \ a ) (cid:1) , where z , z are two freshly introduced string variables, and A a (resp. A Σ \ a ) is the NFA accepting { a } (resp. Σ \ { a } ). Intuitively, two strings are di ff erent if their lengths are di ff erent or otherwise,there exists some position where the characters of the two strings are di ff erent. B Construction of A indexOf v In this section, we show that the function indexOf v ( · , · ) can be captured by CEFA. We start withthe simple example for v = a . Example 3 (CEFA for indexOf a ). Let a ∈ Σ . Then A indexOf a = ( { ( q , q , q ) } , Σ, ( r , r ) , δ indexOf a , { q } , { q } ), where δ indexOf a comprises the tuples – ( q , b , q , η ) such that b ∈ Σ , η ( r ) = η ( r ) = – ( q , b , q , η ) such that b ∈ Σ , η ( r ) = η ( r ) = – ( q , a , q , η ) such that η ( r ) = η ( r ) = – ( q , b , q , η ) such that b ∈ Σ \ { a } , η ( r ) = η ( r ) = – ( q , a , q , η ) such that η ( r ) = η ( r ) = – ( q , b , q , η ) such that b ∈ Σ , η ( r ) = η ( r ) = r corresponds to the starting position i of indexOf a ( x , i ), r corresponds to the outputof indexOf a ( x , i ), q specifies that the current position is before i , q specifies that the currentposition is after i , while a has not occurred yet, and q specifies that a has occurred after i .Technically, for any NFA A and constant string v , we can construct a CEFA accepting { ( w , ( n , indexOf v ( w , n ))) | w ∈ L ( A ) , n ≤ indexOf v ( w , n ) < | w |} . For this purpose, we need aconcept of window profiles of string positions w.r.t. v , which are elements of {⊥ , ⊤} n − . The win-dow profiles facilitate recognising the first occurrence of v in the input string. Intuitively, given astring u , the window profile of a position i in u w.r.t. v encodes the matchings of prefixes of v tothe su ffi xes of u [0 , i ] (see [13] for the details). For π = π · · · π n − ∈ {⊥ , ⊤} n − and b ∈ Σ , we useuwp( π , b ) to represent the window profile updated from π after reading the letter b , specifically,uwp( π , b ) = π ′ such that – π ′ = ⊤ i ff b = a , – for each i ∈ [ n − π ′ i + = ⊤ i ff π i = ⊤ and b = a i + .Let WP v denote the set of window profiles of string positions w.r.t. v . From the result in [13], weknow that | WP v | ≤ | v | .Suppose v = a · · · a n with n ≥
2. Then indexOf v is captured by the CEFA A indexOf v = ( Q , Σ, R , δ, I , F ), such that – Q = { q , q } ∪ WP v ∪ WP v × [ n ], – R = ( r , r ) (where r , r represent the input and output positions of indexOf v respectively), – I = { q } , F = { q } , and – δ comprises • the tuples ( q , a , q , η ) such that a ∈ Σ , η ( r ) =
1, and η ( r ) = • the tuples ( q , a , π , η ) such that a ∈ Σ , π = θ ⊥ n − where θ = ⊤ i ff a = a , η ( r ) =
0, and η ( r ) = • the tuples ( π , a , uwp( π , a ) , η ) such that π ∈ WP u , a ∈ Σ , π n − = ⊥ or a , a n , η ( r ) = η ( r ) = • the tuples ( π , a , (uwp( π , a ) , , η ) such that π ∈ WP u , a = a , π n − = ⊥ or a , a n , η ( r ) =
0, and η ( r ) = • the tuples (( π , i ) , a , (uwp( π , a ) , i + , η ) such that π ∈ WP u , i ∈ [ n − a = a i + , π n − = ⊥ or a , a n , η ( r ) =
0, and η ( r ) = • the tuples (( π , n − , a , q , η ) such that π ∈ WP u , a = a n , η ( r ) =
0, and η ( r ) = • the tuples ( q , a , q , η ) such that a ∈ Σ , η ( r ) =
0, and η ( r ) = C Proof of Proposition 1
Proposition 1 . Let L be a CERL defined by a CEFA A = ( Q , Σ, R , δ, I , F ) . Then for each stringfunction f ranging over · , replaceAll e , u , reverse , FFTs T , and substring , f − R ( L ) is CERR-definable. In addition, – a CEFA representation of · − R ( L ) can be computed in time O ( |A| ) , – a CEFA representation of reverse − R ( L ) (resp. substring − R ( L ) ) can be computed in time O ( |A| ) , – a CEFA representation of ( T ( T )) − R ( L ) can be computed in time polynomial in |A| andexponential in |T | , – a CEFA representation of ( replaceAll e , u ) − R ( L ) can be computed in time polynomial in |A| and exponential in | e | and | u | .Proof. Let A = ( Q , Σ, R , δ, I , F ) be a CEFA with R = ( r , · · · , r k ). We show how to construct aCEFA representation of f − R ( L ) for each function f in SL int . · − R ( L ) . A CEFA representation of · − R ( L ) is given by (( A I , q , A q , F ) q ∈ Q , t ), where – A I , q = ( Q , Σ, R (1) , δ (1) , I , { q } ) and A q , F = ( Q , Σ, R (2) , δ (2) , { q } , F ) such that • R (1) = ( r (1)1 , · · · , r (1) k ), R (2) = ( r (2)1 , · · · , r (2) k ), • δ (1) comprises the tuples ( q , a , q ′ , η ′ ) satisfying that there exists η such that ( q , a , q ′ , η ) ∈ δ and for each j ∈ [ k ], and η ′ ( r (1) j ) = η ( r j ), similarly for δ (2) , – and t = ( r (1)1 + r (2)1 , · · · , r (1) k + r (2) k ).Note that the size of (( A I , q , A q , F ) q ∈ Q , t ) is O ( |A| ). reverse − R ( L ) . A CEFA representation of reverse − R ( L ) is given by ( A ( r ) , t ), where – A ( r ) = ( Q , Σ, R (1) , δ ′ , F , I ) such that • R (1) = ( r (1)1 , · · · , r (1) k ), and • δ ′ comprises the tuples ( q ′ , a , q , η ′ ) satisfying that there exists η such that ( q , a , q ′ , η ) ∈ δ ,and η ′ ( r (1) i ) = η ( r i ) for each i ∈ [ k ], – and t = ( r (1)1 , · · · , r (1) k ).Note that L ( A ( r ) ) = { ( w ( r ) , n ) | ( w , n ) ∈ L ( A ) } , and the size of ( A ( r ) , t ) is O ( |A| ). ubstring − R ( L ) . A CEFA representation of substring − R ( L ) is given by ( B , t ), where – B = ( Q ′ , Σ, R ′ , δ ′ , I ′ , F ′ ) such that • Q ′ = Q × { p , p , p } , (intuitively, p , p , and p denote that the current position isbefore the starting position, between the starting position and ending position, and afterthe ending position respectively) • R ′ = (cid:16) r ′ , , r ′ , , r (1)1 , · · · , r (1) k (cid:17) , (intuitively, r ′ , denotes the starting position, and r ′ , de-notes the length of the substring) • I ′ = I × { p } , F ′ = F ′ × { p } ∪ ( I ∩ F ) × { p } , • and δ ′ comprises ∗ the tuples (( q , p ) , a , ( q , p ) , η ′ ) such that q ∈ I , a ∈ Σ , and η ′ satisfies that η ′ ( r ′ , ) =
1, and η ′ ( r ′ , ) =
0, and η ′ ( r (1) j ) = j ∈ [ k ], ∗ the tuples (( q , p ) , a , ( q ′ , p ) , η ′ ) such that q ∈ I and there exists η satisfying that( q , a , q ′ , η ) ∈ δ , moreover, η ′ ( r ′ , ) = η ′ ( r ′ , ) =
1, and η ′ ( r (1) j ) = η ( r j ) for each j ∈ [ k ], ∗ the tuples (( q , p ) , a , ( q ′ , p ) , η ′ ) such that q ∈ I and there exists η satisfying that( q , a , q ′ , η ) ∈ δ , moreover, q ′ ∈ F , and η ′ ( r ′ , ) = η ′ ( r ′ , ) =
1, and η ′ ( r (1) j ) = η ( r j ) for each j ∈ [ k ], ∗ the tuples (( q , p ) , a , ( q ′ , p ) , η ′ ) such that there exists η satisfying that ( q , a , q ′ , η ) ∈ δ , η ′ ( r ′ , ) =
0, and η ′ ( r ′ , ) =
1, and η ′ ( r (1) j ) = η ( r j ) for each j ∈ [ k ], ∗ the tuples (( q , p ) , a , ( q ′ , p ) , η ′ ) such that q ′ ∈ F , and there exists η satisfying that( q , a , q ′ , η ) ∈ δ , moreover, η ′ ( r ′ , ) = η ′ ( r ′ , ) =
1, and η ′ ( r (1) j ) = η ( r j ) for each j ∈ [ k ], ∗ the tuples (( q , p ) , a , ( q , p ) , η ′ ) such that q ∈ F , η ′ ( r ′ , ) =
0, and η ′ ( r ′ , ) =
0, and η ′ ( r (1) j ) = j ∈ [ k ], – t = ( r (1)1 , · · · , r (1) k ).Note that the size of ( B , t ) is O ( |A| ). ( T ( T )) − R ( L ) . Suppose T = ( Q ′ , Σ, δ ′ , I ′ , F ′ ). Then a CEFA representation of ( T ( T )) − R ( L ) isgiven by ( B , t ), where – B simulates the run of T on the input string, meanwhile, it simulates the run of A on theoutput string of T , formally, B = ( Q ′ × Q , Σ, R (1) , δ ′′ , I ′ × I , F ′ × F ) such that • R (1) = ( r (1)1 , · · · , r (1) k ), and • δ ′′ comprises the tuples (( q ′ , q ) , a , ( q ′ , q ) , η ′ ) satisfying one of the following condi-tions, ∗ there exist u = a · · · a n ∈ Σ + and a transition sequence p a ,η −−−→ δ p · · · p n − a n ,η n −−−→ δ p n in A such that ( q ′ , a , q ′ , u ) ∈ δ ′ , p = q , p n = q , and for each j ∈ [ k ], η ′ ( r (1) j ) = η ( r j ) + · · · + η n ( r j ), ∗ ( q ′ , a , q ′ , ε ) ∈ δ ′ , q = q , and η ′ ( r (1) j ) = j ∈ [ k ], – t = ( r (1)1 , · · · , r (1) k ).Note that the number of transitions of B can be exponential in the worst case, since it summarisesthe updates of cost registers of A on the output strings of the transitions of T . More precisely, let – ℓ be the maximum length of the output strings of transitions of T , – N be the maximum number of transitions between a given pair of states of A , and – C be the maximum absolute value of the integer constants occurring in A , hen | δ ′′ | , the cardinality of δ ′′ , is bounded by | δ ′ | × | Q | × N ℓ , and the integer constants occurringin each transition of δ ′′ are bounded by ℓ C . Therefore, the size of ( B , t ) is O ( | δ ′ | × | Q | × N ℓ × k log ( ℓ C )) . Since | δ ′ | , ℓ ≤ |T | , | Q | , N , k ≤ |A| , and C ≤ |A| , we deduce that the size of ( B , t ) is O ( |T | × |A| ×|A| |T| × |A| log ( |T | )) = |A| O ( |T| ) |T | log ( |T | ) . ( replaceAll e , u ) − R ( L ) . From the result in [13], we know that a NFT T e , u = ( Q ′ , Σ, δ ′ , I ′ , F ′ ) canbe constructed to capture replaceAll e , u . Moreover, – | Q ′ | , as well as | δ ′ | , is 2 O ( | e | ) , – ℓ , the maximum length of the output strings of transitions of T e , u , is | u | .Then a CEFA representation of ( replaceAll e , u ) − R ( L ) can be constructed as that of ( T ( T e , u )) − R ( L ).Let N denote the maximum number of transitions between a given pair of states of A , and C bethe maximum absolute value of the integer constants occurring in A , which is bounded by 2 |A| .Then the CEFA representation of ( replaceAll e , u ) − R ( L ) is of size O ( | δ ′ | × | Q | × N ℓ × k log ( ℓ C )) = O ( | e | ) |A| |A| | u | |A| log | u | = O ( | e | ) |A| O ( | u | ) . according to the aforementioned discussion for NFTs. (cid:3) D Proof of Proposition 2
Proposition 2 . The
SAT
CEFA [LIA] problem is pspace -complete.Proof.
The lower bound follows from the pspace -hardness of the intersection problem of NFAs.For the upper bound, let {A ji } i ∈ I , j ∈ J i be a family of CEFAs each of which carries a vector ofregisters R ji and φ be a quantifier-free LIA formula such that R ji are pairwise disjoint and thevariables of φ are from R ′ : = S i , j R ji .First, we observe that we can focus on monotonic CEFAs where the cost registers are mono-tone in the sense that their values are non-decreasing during the course of execution. In otherwords, they can only be updated with natural number (as opposed to general integer) constants.This observation is justified by the following reduction.For each register r ∈ R ij , we introduce two registers r + , r − . Let ( R ij ) ± denote the vector ofregisters by replacing each r ∈ R ij with ( r + , r − ). Intuitively, for each r ∈ R ij , the updates of r in A ji are split into non-negative ones and negative ones, with the former stored in r + and the latterin r − . Suppose ( R ′ ) ± = S i , j ( R ji ) ± . Then we construct monotonic CEFAs ( B ji ) i ∈ I , j ∈ J i and an LIAformula φ ± such thatthere are an assignment function θ : R ′ → Z and strings ( w i ) i ∈ I such that φ [ θ ( R ′ ) / R ′ ]holds and ( w i , θ ( R ji )) ∈ L ( A ji ) for every i ∈ I and j ∈ J i if and only ifthere are an assignment function θ ± : ( R ′ ) ± → N and strings ( w i ) i ∈ I such that φ ± [ θ ± (( R ′ ) ± ) / ( R ′ ) ± ] holds and ( w i , θ ± (( R ji ) ± )) ∈ L ( B ji ) for every i ∈ I and j ∈ J i . or i ∈ I and j ∈ J i , the CEFA B ji is obtained from A ji by replacing each transition ( q , a , q ′ , η ) in A ji by the transition ( q , a , q ′ , η ′ ) such that for each r ∈ R jj , η ′ ( r + ) = ( η ( r ) , if η ( r ) ≥
00 otherwise , η ′ ( r − ) = ( , if η ( r ) ≥ − η ( r ) otherwise . In addition, φ ± is obtained from φ by replacing each r ∈ R ′ with r + − r − .It remains to prove the SAT CEFA [LIA] problem for monotonic CEFAs is in pspace , namely,given a family of monotonic
CEFAs {A ji } i ∈ I , j ∈ J i each of which carries a vector of regis-ters R ji and a quantifier-free LIA formula φ such that R ji are pairwise disjoint, and thevariables of φ are from R ′ = S i , j R ji , deciding whether there are an assignment function θ : R ′ → N and strings ( w i ) i ∈ I such that φ [ θ ( R ′ ) / R ′ ] holds and ( w i , θ ( R ji )) ∈ L ( A ji ) forevery i ∈ I and j ∈ J i is in pspace .We use Proposition 16 in [21] to show the result. Proposition 16 in [21] mainly consideredmonotonic counter machines, which can be seen as monotonic CEFAs where each transitioncontains no alphabet symbol, and η ( r ) ∈ { , } for the update function η therein.For each i ∈ I and j ∈ J i , let ( A ′ ) ji be the monotonic counter machine obtained from A ji bythe following two-step procedure:1. [Remove the alphabet symbols]: Remove alphabet symbols a in each transition ( q , a , q ′ , η )of A ji .2. [From binary encoding to unary encoding]: Replace each transition ( q , q ′ , η ) such that ℓ = max r ∈ R ji η ( r ) > q , p , η ′ ) , · · · , ( p ℓ − , q ′ , η ′ ℓ ), where p , · · · , p ℓ − are the freshly introduced states, moreover, η ′ j ( r ) = η ( r ) ≥ j , and η ′ j ( r ) = {C i } i ∈ I each of which carries a vector ofcounters R i and a quantifier-free LIA formula φ such that R i are pairwise disjoint, andthe variables of φ are from R ′ = S i R i . If there is an assignment function θ : R ′ → N such that φ [ θ ( R ′ ) / R ′ ] holds and θ ( R i ) is a reachable valuation of counters in C i for every i ∈ I , then there are desired θ such that for each i ∈ I and r ∈ R i , θ ( r ) is at mostpolynomial in the number of states in C i , exponential in | R i | , and exponential in | φ | .For each i ∈ I , let C i be the product of monotonic counter machines ( A ′ ) ji for j ∈ J i . From thefact that the number of states of ( A ′ ) ji is at most the product of the number of transitions of A ji and B A ji (where B A ji denotes the maximum natural number constants η ( r ) in A ji ), we deduce thefollowing,if there are an assignment function θ : R ′ → N and strings ( w i ) i ∈ I such that φ [ θ ( R ′ ) / R ′ ]holds and ( w i , θ ( R ji )) ∈ L ( A ji ) for every i ∈ I and j ∈ J i , then there are desired θ and( w i ) i ∈ I such that for each i ∈ I and r ∈ S j ∈ J i R ji , θ ( r ) is at most polynomial in the productof the number of transitions in A ji and B A ji for j ∈ J i , exponential in (cid:12)(cid:12)(cid:12)S j ∈ J i R ji (cid:12)(cid:12)(cid:12) , andexponential in | φ | .Since the values of all the registers in A ji for i ∈ I and j ∈ J i can be assumed to be atmost exponential, and thus their binary encodings can be stored in polynomial space, one cannondeterministically guess the strings ( w i ) i ∈ I , and for each i ∈ I and j ∈ J i , simulate the runs ofCEFAs A ji on w i , and finally evaluate φ with the register values after all A ji accept, in polynomialspace. From Savitch’s theorem [23], we conclude that the SAT CEFA [LIA] problem for monotonicCEFAs is in pspace . This concludes the proof of the proposition. (cid:3) Implementation
Algorithm 1:
Function checkSat for Step II-III
Input: active : set of CEFA constraints, arith : arithmetic constraints, funApps : acyclic setof assignment statements.
Result: sat if the input constraints are satisfiable, and unsat otherwise. for each partition ( I l ) l ∈ [5] of the set of indexOf v ( x , i ) in arith andeach partition ( J l ) l ∈ [3] of the set of substring ( x , i , j ) in funApps /* thepartitions refer to (1)-(5) for indexOf v ( x , i ) and (1)-(3) for substring ( x , i , j ) in Step II of Section 4.3 */ do /* Case splits for semantics of indexOf and substring */ ( active , arith , funApps ) = indexofCaseSplit ( active , arith , funApps , ( I l ) l ∈ [5] ); ( active , arith , funApps ) = substringCaseSplit ( active , arith , funApps , ( J l ) l ∈ [3] ); for each length ( x ) occurring in arith do choose a fresh integer variable i ; active ← active ∪ { x ∈ A len [ i / r ] } ; arith ← arith [ i / length ( x )]; for each indexOf v ( x , i ) occurring in arith do choose fresh integer variables i , i ; active ← active ∪ { x ∈ A indexOf v [ i / r , i / r ] } ; arith ← arith [ i / indexOf v ( x , i )] ∧ i = i ; if BackDfsExp ( active , ∅ , arith , funApps ) then return sat ; return unsat ;OSTRICH + performs a depth-first exploration of the search tree resulting from repeatedlysplitting the disjunctions (or unions) in the cost-enriched recognisable pre-images of CERLs un-der string functions, as well as the case splits in the semantics of indexOf and substring . Thepseudo-code of Step II-III of the decision procedure is given by the function checkSat in Algo-rithm 1, which calls two functions indexofCaseSplit in Algorithm 2 and substringCaseSplit in Al-gorithm 3 for the case splits in the semantics of indexOf v and substring respectively. Moreover, checkSat calls a recursive function BackDfsExp in Algorithm 4 for the depth-first exploration(Step IV of the decision procedure), which in turn calls a function
CheckCefaLIASat to solve theSAT
CEFA [LIA] problem (Step V). Note that Step I of the decision procedure is handled by theDPLL(T) procedure in Princess and is omitted here.
Optimisations for solving the
SAT
CEFA [LIA] problem.
From Proposition 2, a natural ap-proach to solve the SAT
CEFA [LIA] problem is to compute an existential LIA formula defining theParikh image of products of CEFAs, and then use o ff -the-shelf SMT solvers (e.g. CVC4 or Z3)to decide the satisfiability of the existential LIA formula. However, our preliminary experimentsshow that this approach su ff ers from a scalability issue, in particular, the state-space explosionwhen computing products of CEFAs. In the implementation of the function CheckCefaLIASat in Algorithm 4, we opt to utilise the symbolic model checker nuXmv [12] to mitigate the state-space explosion during the computation of products of CEFAs. The nuXmv tool is a well-known ymbolic model checker that is capable of analysing both finite and infinite state systems. Ourtechnique is to encode SAT CEFA [LIA] as an instance of the model checking problem, which can besolved by nuXmv. Since SAT
CEFA [LIA] is a problem for quantifier-free LIA formulas and CEFAsthat contain integer variables, the SAT
CEFA [LIA] problem actually corresponds to the problem ofmodel checking infinite state systems . Algorithm 2: indexofCaseSplit for case splits in the semantics of indexOf v Input: active : set of CEFA constraints, arith : arithmetic constraint, funApps : acyclic setof assignment statements, and ( I l ) l ∈ [5] : subsets of indexOf v ( x , i ) string terms Result: ( active , arith , funApps ) for each indexOf v ( x , i ) ∈ I do arith ← arith [ indexOf v ( x , / indexOf v ( x , i )] ∧ i < for each indexOf v ( x , i ) ∈ I do active ← active ∪ { x ∈ A Σ ∗ v Σ ∗ } ; arith ← arith [ − / indexOf v ( x , i )] ∧ i < for each indexOf v ( x , i ) ∈ I do arith ← arith [ − / indexOf v ( x , i )] ∧ i ≥ length ( x ); for each indexOf v ( x , i ) ∈ I do arith ← arith [ − / indexOf v ( x , i )] ∧ i ≥ ∧ i < length ( x ); for each indexOf v ( x , i ) ∈ I do choose fresh variables y and j ; active ← active ∪ { y ∈ A Σ ∗ v Σ ∗ } ; arith ← arith [ − / indexOf v ( x , i )] ∧ i ≥ ∧ i < length ( x ) ∧ j = length ( x ) − i ; funApps ← funApps ∪ { y : = substring ( x , i , j ) } ; lgorithm 3: substringCaseSplit for case splits in the semantics of substring Input: active : set of CEFA constraints, arith : arithmetic constraint, funApps : acyclic setof assignment statements, and ( I l ) l ∈ [5] : subsets of indexOf v ( x , i ) string terms Result: ( active , arith , funApps ) for each y : = substring ( x , i , j ) ∈ J do arith ← arith ∧ i ≥ ∧ i + j ≤ length ( x ); for each y : = substring ( x , i , j ) ∈ J do choose a fresh integer variable i ′ ; arith ← arith ∧ i ≥ ∧ i ≤ length ( x ) ∧ i + j > length ( x ) ∧ i ′ = length ( x ) − i ; funApps ← funApps [ y : = substring ( x , i , i ′ ) / y : = substring ( x , i , j )]; for each y : = substring ( x , i , j ) ∈ J do arith ← arith ∧ i < active ← active ∪ { y ∈ A ε } ; funApps ← funApps \ { y : = substring ( x , i , j ) } ; Algorithm 4:
Function
BackDfsExp for Step IV (depth-first exploration)
Input: active , passive : sets of CEFA constraints, arith : arithmetic constraints, funApps :acyclic set of assignment statements. Result: sat if the input constraints are satisfiable, and unsat otherwise. if active = ∅ then /* Check whether the LIA constraint arith is satisfiable withrespect to the CEFA constraints in passive (i.e. Step V). */ return CheckCefaLIASat ( passive , arith ) ; else choose a CEFA constraint x ∈ A in active with R ( A ) = ( r , · · · , r k ); if there is an assignment x : = f ( y , i , . . . , y l , i l ) defining x in funApps with i j = ( i j , , · · · , i j , k j ) for j ∈ [ l ] then compute f − R ( A ) ( L ( A )) = (cid:16) ( A (1) j , · · · , A ( l ) j ) j ∈ [ n ] , t (cid:17) where R (cid:16) A ( j ′ ) j (cid:17) = (cid:16) ( r ′ ) ( j ′ , , · · · , ( r ′ ) ( j ′ , k j ′ ) , r ( j ′ )1 , · · · , r ( j ′ ) k (cid:17) for j ∈ [ n ] and j ′ ∈ [ l ]; active ← active \ { x ∈ A} ; passive ← passive ∪ { x ∈ A} ; for j ← to n do active ← active ∪ { y ∈ A (1) j , . . . , y l ∈ A ( l ) j } ; arith ← arith ∧ V j ′ ∈ [ l ] , j ′′ ∈ [ k j ′ ] i j ′ , j ′′ = ( r ′ ) ( j ′ , j ′′ ) ∧ V j ′ ∈ [ k ] r j ′ = t j ′ ; if active ∪ passive is inconsistent then continue ; /* backtrack */ else switch BackDfsExp ( active , passive , arith , funApps ) do case sat do return sat ; case unsat do continue ; /* backtrack */ return unsat ; else return BackDfsExp ( active \{ x ∈ A} , passive ∪ { x ∈ A} , arith , funApps ) ;;