LTL Model Checking of Self Modifying Code
LLTL Model Checking of Self Modifying Code
Tayssir Touili and Xin Ye CNRS,LIPN and University Paris 13 East China Normal University, Shanghai, China
Abstract.
Self modifying code is code that can modify its own instruc-tions during the execution of the program. It is extensively used by mal-ware writers to obfuscate their malicious code. Thus, analysing self mod-ifying code is nowadays a big challenge. In this paper, we consider theLTL model-checking problem of self modifying code. We model such pro-grams using self-modifying pushdown systems (SM-PDS), an extensionof pushdown systems that can modify its own set of transitions duringexecution. We reduce the LTL model-checking problem to the empti-ness problem of self-modifying B¨uchi pushdown systems (SM-BPDS).We implemented our techniques in a tool that we successfully appliedfor the detection of several self-modifying malware. Our tool was alsoable to detect several malwares that well-known antiviruses such as Bit-Defender, Kinsoft, Avira, eScan, Kaspersky, Qihoo-360, Baidu, Avast,and Symantec failed to detect.
Binary code presents several complex aspects that cannot be encountred insource code. One of these aspects is self-modifying code, i.e., code that canmodify its own instructions during the execution of the program. Self-modifyingcode makes reverse code engineering harder. Thus, it is extensively used to pro-tect software intellectual property. It is also heavily used by malware writers inorder to make their malwares hard to analyse and detect by static analysers andanti-viruses. Thus, it is crucial to be able to analyse self-modifying code.There are several kinds of self-modifying code. In this work, we considerself-modifying code caused by self-modifying instructions . These kind of in-structions treat code as data. This allows them to read and write into code,leading to self-modifying instructions . These self-modifying instructions areusually mov instructions, since mov allows to access memory and read andwrite into it.Let us consider the example shown in Fig.1. For simplicity, the addresses’length is assumed to be 1 byte. In the right box, we give, respectively, the binarycode, the addresses of the different instructions, and the corresponding assemblycode, obtained by translating syntactically the binary code at each address. Forexample, is the binary code of the jump jmp . Thus,
0c 02 is translated to jmp 0x2 (jump to address 0x2). The second line is translated to push 0x9 , since a r X i v : . [ c s . L O ] S e p f is the binary code of the instruction push . The third instruction mov 0x20xc will replace the first byte at address by . Thus, at address , ff 09 is replaced by
0c 09 . This means the instruction push 0x9 is replacedby the jump instruction jmp 0x9 (jump to address 0x9), etc. Therefore, thiscode is self-modifying: the mov instruction was able to modify the instructionsof the program via its ability to read and write the memory. If we study thiscode without looking at the semantics of the self-modifying instructions, we willextract from it the Control Flow Graph
CFG a that is in the left of the figure,and we will reach the conclusion that the call to the API function CopyFileA ataddress cannot be made. However, you can see that the correct CFG is theone on the right hand side
CFG b , where the call to the API function CopyFileAat address can be reached. Thus, it is very important to be able to take intoaccount the semantics of the self-modifying instructions in binary code.
CFGs ff ff ff Binary Codes AssemblyaddressCodes
After Execution ofmov 0x2 0xc
Fig. 1: An Example of a Self-modifying CodeIn this paper, we consider the LTL model-checking problem of self-modifyingcode. To this aim, we use Self-Modifying Pushdown Systems (SM-PDSs) [29]to model self-modifying code. Indeed, SM-PDSs were shown in [29] to be anadequate model for self-modifying code since they allow to mimic the program’sstack while taking into account the self-modifying semantics of the transitions.This is very important for binary code analysis and malware detection, sincemalwares are based on calls to API functions of the operating system. Thus,antiviruses check the API calls to determine whether a program is malicious ornot. Therefore, to evade from these antiviruses, malware writers try to hide theAPI calls they make by replacing calls by push and jump instructions. Thus,to be able to analyse such malwares, it is crucial to be able to analyse theprogram’s stack. Hence the need to a model like pushdown systems and self-modifying pushdown systems for this purpose, since they allow to mimic theprogram’s stack. 2ntuitively, a SM-PDS is a pushdown system (PDS) with self-modifying rules,i.e., with rules that allow to modify the current set of transitions during execu-tion. This model was introduced in [29] in order to represent self-modifying code.In [29], the authors have proposed algrithms to compute finite automata thataccept the forward and backward reachability sets of SM-PDSs. In this work,we tackle the problem of LTL model-checking of SM-PDSs. Since SM-PDSs areequivalent to PDSs [29], one possible approach for LTL model checking of SM-PDS is to translate the SM-PDS to a standard PDS and then run the LTLmodel checking algorithm on the equivalent PDS [2,10]. But translation from aSM-PDS to a standard PDS is exponential. Thus, performing the LTL modelchecking on the equivalent PDS is not efficient.To overcome this limitation, we propose a direct
LTL model checking algo-rithm for SM-PDSs. Our algorithm is based on reducing the LTL model checkingproblem to the emptiness problem of Self Modifying B¨uchi Pushdown Systems(SM-BPDS). Intuitively, we obtain this SM-BPDS by taking the product of theSM-PDS with a B¨uchi automaton accepting an LTL formula ϕ . Then, we solvethe emptiness problem of an SM-BPDS by computing its repeating heads. Thiscomputation is based on computing labelled pre ∗ configurations by applying asaturation procedure on labelled finite automata.We implemented our algorithm in a tool. Our experiments show that our direct techniques are much more efficient than translating the SM-PDS to anequivalent PDS and then applying the standard LTL model checking for PDSs[2,10]. Moreover, we successfully applied our tool to the analysis of 892 self-modifying malwares. Our tool was also able to detect several self-modifyingmalwares that well-known antiviruses like BitDefender, Kinsoft, Avira, eScan,Kaspersky, Qihoo-360, Baidu, Avast, and Symantec were not able to detect. Related Work.
Model checking and static analysis approaches have been widelyused to analyze binary programs, for instance, in [9,5,23,11,3]. Temporal Logicswere chosen to describe malicious behaviors in [20,11,3,4,8]. However, these workscannot deal with self-modifying code.POMMADE [3,4] is a malware detector based on LTL and CTL model-checking of PDSs. STAMAD [15,16,14] is a malware detector based on PDSsand machine learning. However, POMMADE and STAMAD cannot deal withself-modifying code.Cai et al. [7] use local reasoning and separation logic to describe self-modifyingcode and treat program code uniformly as regular data structure. However, [7]requires programs to be manually annotated with invariants. In [26], the au-thors propose a formal semantics for self-modifying codes, and use that to repre-sent self-unpacking code. This work only deals with packing and unpacking be-haviours. Bonfante et al. [6] provide an operational semantics for self-modifyingprograms and show that they can be constructively rewritten to a non-modifyingprogram. However, all these specifications [6,7,26] are too abstract to be used inpractice.In [1], the authors propose a new representation of self-modifying code namedState Enhanced-Control Flow Graph (SE-CFG). SE-CFG extends standard con-3rol flow graphs with a new data structure, keeping track of the possible statesprograms can reach, and with edges that can be conditional on the state of thetarget memory location. It is not easy to analyse a binary program only using itsSE-CFG, especially that this representation does not allow to take into accountthe stack of the program.[24] propose abstract interpretation techniques to compute an over-approximationof the set of reachable states of a self-modifying program, where for each controlpoint of the program, an over-approximation of the memory state at this controlpoint is provided. [18] combine static and dynamic analysis techniques to analyseself-modifying programs. Unlike our approach, these techniques [24,18] cannothandle the program’s stack.Unpacking binary code is also considered in [13,17,22,26]. These works donot consider self-modifying mov instructions.
Outline.
The rest of the paper is structured as follows: Section 2 recallsthe definition of Self Modifying pushdown systems. LTL model checking andSM-BPDSs are defined in Section 3. Section 4 solves the emptiness problem ofSM-BPDS. Finally, the experiments are reported in Section 5.
We recall in this section the definition of Self-modifying Pushdown Systems [29].
Definition 1.
A Self-modifying Pushdown System (SM-PDS) is a tuple P =( P, Γ, ∆, ∆ c ) , where P is a finite set of control points, Γ is a finite set of stacksymbols, ∆ ⊆ ( P × Γ ) × ( P × Γ ∗ ) is a finite set of transition rules, and ∆ c ∈ P × ∆ × ∆ × P is a finite set of modifying transition rules. If (( p, γ ) , ( p (cid:48) , w )) ∈ ∆ ,we also write (cid:104) p, γ (cid:105) (cid:44) → (cid:104) p (cid:48) , w (cid:105) ∈ ∆ . If ( p, r , r , p (cid:48) ) ∈ ∆ c , we also write p ( r ,r ) (cid:44) −−−−→ p (cid:48) ∈ ∆ c . A Pushdown System (PDS) is a SM-PDS where ∆ c = ∅ . Intuitively, a Self-modifying Pushdown System is a Pushdown System thatcan dynamically modify its set of rules during the execution time: rules ∆ arestandard PDS transition rules, while rules ∆ c modify the current set of transitionrules: (cid:104) p, γ (cid:105) (cid:44) → (cid:104) p (cid:48) , w (cid:105) ∈ ∆ expresses that if the SM-PDS is in control point p and has γ on top of its stack, then it can move to control point p (cid:48) , pop γ andpush w onto the stack, while p ( r ,r ) (cid:44) −−−−→ p (cid:48) ∈ ∆ c expresses that when the PDS isin control point p , then it can move to control point p (cid:48) , remove the rule r fromits current set of transition rules, and add the rule r .Formally, a configuration of a SM-PDS is a tuple c = ( (cid:104) p, w (cid:105) , θ ) where p ∈ P is the control point, w ∈ Γ ∗ is the stack content, and θ ⊆ ∆ ∪ ∆ c is the currentset of transition rules of the SM-PDS. θ is called the current phase of the SM-PDS. When the SM-PDS is a PDS, i.e., when ∆ c = ∅ , a configuration is a tuple c = ( (cid:104) p, w (cid:105) , ∆ ), since there is no changing rule, so there is only one possible phase.4n this case, we can also write c = (cid:104) p, w (cid:105) . Let C be the set of configurations ofa SM-PDS. A SM-PDS defines a transition relation ⇒ P between configurationsas follows: Let c = ( (cid:104) p, w (cid:105) , θ ) be a configuration, and let r be a rule in θ , then:1. if r ∈ ∆ c is of the form r = p ( r ,r ) (cid:44) −−−−→ p (cid:48) , such that r ∈ θ , then ( (cid:104) p, w (cid:105) , θ ) ⇒ P ( (cid:104) p (cid:48) , w (cid:105) , θ (cid:48) ), where θ (cid:48) = ( θ \ { r } ) ∪ { r } . In other words, the transition rule r updates the current set of transition rules θ by removing r from it andadding r to it.2. if r ∈ ∆ is of the form r = (cid:104) p, γ (cid:105) (cid:44) → (cid:104) p (cid:48) , w (cid:48) (cid:105) ∈ ∆ , then ( (cid:104) p, γw (cid:105) , θ ) ⇒ P ( (cid:104) p (cid:48) , w (cid:48) w (cid:105) , θ ). In other words, the transition rule r moves the control pointfrom p to p (cid:48) , pops γ from the stack and pushes w (cid:48) onto the stack. Thistransition keeps the current set of transition rules θ unchanged.Let ⇒ ∗P be the transitive, reflexive closure of ⇒ P and ⇒ + P be its transitiveclosure. An execution (a run) of P is a sequence of configurations π = c c ... s.t. c i ⇒ P c i +1 for every i ≥
0. Given a configuration c , the set of immedi-ate predecessors (resp. successors) of c is pre P ( c ) = { c (cid:48) ∈ C : c (cid:48) ⇒ P c } (resp. post P ( c ) = { c (cid:48) ∈ C : c ⇒ P c (cid:48) } ). These notations can be generalizedstraightforwardly to sets of configurations. Let pre ∗P (resp. post ∗P ) denote thereflexive-transitive closure of pre P (resp. post P ). We remove the subscript P when it is clear from the context.We suppose w.l.o.g. that rules in ∆ are of the form (cid:104) p, γ (cid:105) (cid:44) → (cid:104) p (cid:48) , w (cid:105) such that | w | ≤
2, and that the self-modifying rules r = p ( r ,r ) (cid:44) −−−−→ p (cid:48) in ∆ c are such that r (cid:54) = r . Note that this is not a restriction, since for a given SM-PDS, one cancompute an equivalent SM-PDS that satisfies these conditions [29] . Let P = ( P, Γ, ∆, ∆ c ) be a SM-PDS. It was shown in [29] that:1. P can be described by an equivalent pushdown system (PDS). Indeed, sincethe number of phases is finite, we can encode phases in the control pointof the PDS. However, this translation is not efficient since the number ofcontrol points of the equivalent PDS is | P | · O ( | ∆ | + | ∆ c | ) .2. P can also be described by an equivalent Symbolic pushdown system [27],where each SM-PDS rule is represented by a single, symbolic transition,where the different values of the phases are encoded in a symbolic way usingrelations between phases. This translation is not efficient neither since thesize of the relations used in the symbolic transitions is 2 O ( | ∆ | + | ∆ c | ) . It is shown in [29] how to describe a self-modifying binary code using a SM-PDS.The basic idea is that the control locations of the SM-PDS store the controlpoints of the binary program and the stack mimics the program’s stack. Our5ranslation relies on the disassembler Jakstab [12] to disassemble binary code,construct the control flow graph (CFG), determine indirect jumps, compute thepossible values of used variables, registers and the memory locations at eachcontrol point of program. After getting the control flow graph whose edges areequipped with disassembled instructions, we translate the CFG into a SM-PDSas described in [29]. The non self-modifying instructions of the program definethe rules ∆ of the SM-PDS (which are standard PDS rules), and can be obtainedfollowing the translation of [3] that models non self-modifying instructions ofthe program by a PDS. Self-modifying instructions are represented using self-modifying transitions ∆ c of the SM-PDS. For more details, we refer the readerto [29]. Let At be a finite set of atomic propositions. LTL formulas are defined as follows(where A ∈ At ): ϕ := A | ¬ ϕ | ϕ ∨ ϕ | Xϕ | ϕ U ϕ Formulae are interpreted on infinite words over 2 At . Let ω = ω ω ... be aninfinite word over 2 At . We write ω i for the suffix of ω starting at ω i . We denote ω | = ϕ to express that ω satisfies a formula ϕ : ω | = A ⇐⇒ A ∈ ω ω | = ¬ ϕ ⇐⇒ ω (cid:50) ϕω | = ϕ ∨ ϕ ⇐⇒ ω | = ϕ or ω | = ϕ ω | = Xϕ ⇐⇒ ω | = ϕω | = ϕ Uϕ ⇐⇒ ∃ i ≥ , ω i | = ϕ and ∀ ≤ j < i, ω j | = ϕ The temporal operators G (globally) and F (eventually) are defined as fol-lows:
F ϕ = ( A ∨ ¬ A ) U ϕ and Gϕ = ¬ F ¬ ϕ . Let W ( ϕ ) be the set of infinite wordsthat satisfy an LTL formula ϕ . It is well known that W ( ϕ ) can be accepted byB¨uchi automata: Definition 2.
A B¨uchi automaton B is a quintuple ( Q, Γ, η, q , F ) where Q isa finite set of states, Γ is a finite input alphabet, η ⊆ ( Q × Γ × Q ) is a setof transitions, q ∈ Q is the initial state and F ⊆ Q is the set of acceptingstates. A run of B on a word γ γ ... ∈ Γ ω is a sequence of states q q q ... s.t. ∀ i ≥ , ( q i , γ i , q i +1 ) ∈ η . An infinite word ω is accepted by B if B has a run on ω that starts at q and visits accepting states from F infinitely often. Theorem. [19] Given an LTL formula ϕ , one can effectively construct a B¨uchiautomaton B ϕ which accepts W ( ϕ ) . A Self Modifying B¨uchi Pushdown Systems (SM-BPDS) is a tu-ple BP = ( P, Γ, ∆, ∆ c , G ) where P is a set of control locations, G ⊆ P is a set f accepting control locations, ∆ ⊆ ( P × Γ ) × ( P × Γ ∗ ) is a finite set of transitionrules, and ∆ c ⊆ P × ∆ ∪ ∆ c × ∆ ∪ ∆ c × P is a finite set of modifying transitionrules in the form p ( σ,σ (cid:48) ) (cid:44) −−−−→ p (cid:48) where σ, σ (cid:48) ⊆ ∆ ∪ ∆ c .Let ⇒ BP be the transition relation between configurations as follows: Let θ ⊆ ∆ ∪ ∆ c , γ ∈ Γ, w ∈ Γ ∗ , and p ∈ P , then1. If r : (cid:104) p, γ (cid:105) (cid:44) → (cid:104) p (cid:48) , w (cid:48) (cid:105) ∈ ∆ and r ∈ θ , then ( (cid:104) p, γw (cid:105) , θ ) ⇒ BP ( (cid:104) p (cid:48) , w (cid:48) w (cid:105) , θ ) .2. If r : p ( σ,σ (cid:48) ) (cid:44) −−−−→ p (cid:48) ∈ ∆ c , σ ∩ θ (cid:54) = ∅ and r ∈ θ , then ( (cid:104) p, γw (cid:105) , θ ) ⇒ BP ( (cid:104) p (cid:48) , γw (cid:105) , θ (cid:48) ) where θ (cid:48) = θ \ σ ∪ σ (cid:48) .A run π of BP is a sequence of configurations π = c c ... s.t. c i ⇒ BP c i +1 for every i ≥ . π is accepting iff it infinitely often visits configurations havingcontrol locations in G .Let c and c (cid:48) be two configurations of the SM-BPDS BP . The relation ⇒ r BP is defined as follows: c ⇒ r BP c (cid:48) iff there exists a configuration ( (cid:104) g, u (cid:105) , θ ) , g ∈ G s.t. c ⇒ ∗BP ( (cid:104) g, u (cid:105) , θ ) ⇒ + BP c (cid:48) . We remove the subscript BP when it is clearfrom the context. We define i ⇒ as follows: c i ⇒ c (cid:48) iff there exists a sequence ofconfigurations c ⇒ BP c ⇒ BP ... ⇒ BP c i s.t. c = c and c i = c (cid:48) .A head of SM-BPDS is a tuple ( (cid:104) p, γ (cid:105) , θ ) where p ∈ P , γ ∈ Γ and θ ⊆ ∆ ∪ ∆ c .A head (( p, γ ) , θ ) is repeating if there exists v ∈ Γ ∗ such that ( (cid:104) p, γ (cid:105) , θ ) ⇒ r BP ( (cid:104) p, γv (cid:105) , θ ) . The set of repeating heads of SM-BPDS is called Rep BP . We assume w.l.o.g. that for every rule in ∆ c of the form r : p ( σ,σ (cid:48) ) (cid:44) −−−−→ p (cid:48) , r / ∈ σ. Let P = ( P, Γ, ∆, ∆ c ) be a self modifying pushdown system. Let At be a setof atomic propositions. Let ν : P → At be a labelling function. Let π =( (cid:104) p , w (cid:105) , θ )( (cid:104) p , w (cid:105) , θ ) ... be an execution of the SM-PDS P . Let ϕ be an LTLformula over the set of atomic propositions At . We say that π | = ν ϕ iff ν ( p ) ν ( p ) · · · | = ϕ Let ( (cid:104) p, w (cid:105) , θ ) be a configuration of P . We say that ( (cid:104) p, w (cid:105) , θ ) | = ν ϕ iff P hasa path π starting at ( (cid:104) p, w (cid:105) , θ ) such that π | = ν ϕ .Our goal in this paper is to perform LTL model-checking for self-modifyingpushdown systems. Since SM-PDSs can be translated to standard (symbolic)pushdown systems, one way to solve this LTL model-checking problem is tocompute the (symbolic) pushdown system that is equivalent to the SM-PDS(see section 2.2), and then apply the standard LTL model-checking algorithms onstandard PDSs [27]. However, this approach is not efficient (as will be witnessedlater in the experiments). Thus, we need a direct approach that performs LTLmodel-checking on the SM-PDS, without translating it to an equivalent PDS. Let7 ϕ = ( Q, At , η, q , F ) be a B¨uchi automaton that accepts W ( ϕ ). We computethe SM-BPDS BP ϕ = ( P × Q, Γ, ∆ (cid:48) , ∆ (cid:48) c , G ) by performing a kind of productbetween the SM-PDS P and the B¨uchi automaton B ϕ as follows:1. if r = (cid:104) p, γ (cid:105) (cid:44) → (cid:104) p (cid:48) , w (cid:105) ∈ ∆ and ( q, ν ( p ) , q (cid:48) ) ∈ η , then (cid:104) ( p, q ) , γ (cid:105) (cid:44) → (cid:104) ( p (cid:48) , q (cid:48) ) , w (cid:105) ∈ ∆ (cid:48) . Let prod ( r ) be the set of rules of ∆ (cid:48) obtained from the rule r , i.e., rulesof ∆ (cid:48) of the form (cid:104) ( p, q ) , γ (cid:105) (cid:44) → (cid:104) ( p (cid:48) , q (cid:48) ) , w (cid:105) .2. if a rule r = p ( r ,r ) (cid:44) −−−−→ p (cid:48) ∈ ∆ c and ( q, ν ( p ) , q (cid:48) ) ∈ η , then ( p, q ) ( σ,σ (cid:48) ) (cid:44) −−−−→ ( p (cid:48) , q (cid:48) ) ∈ ∆ (cid:48) c where σ = prod ( r ) , σ (cid:48) = prod ( r ). Let prod ( r ) be the set ofrules of ∆ (cid:48) obtained from the rule r , i.e., rules of ∆ (cid:48) c of the form ( p, q ) ( σ,σ (cid:48) ) (cid:44) −−−−→ ( p (cid:48) , q (cid:48) ).3. G = P × F .We can show that: Theorem 1.
Let ( (cid:104) p, w (cid:105) , θ ) be a configuration of the SM-PDS P . ( (cid:104) p, w (cid:105) , θ ) | = ν ϕ iff BP ϕ has an accepting run from ( (cid:104) ( p, q ) , w (cid:105) , prod ( θ )) where prod ( θ ) is theset of rules of ∆ ∪ ∆ c obtained from the rules of θ as described above. Thus, LTL model-checking for SM-PDSs can be reduced to checking whethera SM-BPDS has an accepting run. The rest of the paper is devoted to thisproblem.
From now on, we fix a SM-BPDS BP = ( P, Γ, ∆, ∆ c , G ). We can show that BP has an accepting run starting from a configuration c if and only if from c , it canreach a configuration with a repeating head: Proposition 1.
A SM-BPDS BP has an accepting run starting from a con-figuration c if and only if there exists a repeating head (( p, γ ) , θ ) such that c ⇒ ∗BP ( (cid:104) p, γw (cid:105) , θ ) for some w ∈ Γ ∗ . Proof: “ ⇒ ”: Let σ = c c ... be an accepting run starting at configuration c where c = c and c i = ( (cid:104) p i , w i (cid:105) , θ i ). We construct an increasing sequenceof indices i , i ... with a property that once any of the configurations c i k isreached, the rest of the run never changes the bottom | w i k |− | w i | = min {| w j | | j ≥ }| w i k | = min {| w j | | j > i k − } , k ≥ BP has only finitely many different heads, there must be a head ( (cid:104) p, γ (cid:105) , θ )which occurs infinitely often as a head in the sequence c i c i ... . Moreover, as8ome g ∈ G becomes a control location infinitely often, we can find a subse-quence of indices i j , i j , ... with the following property: for every k ≥ , thereexist v, w ∈ Γ ∗ c i jk = ( (cid:104) p, γw (cid:105) , θ ) ⇒ r ( (cid:104) p, γvw (cid:105) , θ ) = c i jk +1 Because w is never looked at or changed in this path, we can have ( (cid:104) p, γ (cid:105) , θ ) ⇒ r ( (cid:104) p, γv (cid:105) , θ ). This proves this direction of the proposition.“ ⇐ ”: Because ( (cid:104) p, γ (cid:105) , θ ) is a repeating head, we can construct the followingrun for some u, v, w ∈ Γ ∗ , θ (cid:48) ⊆ ( ∆ ∪ ∆ c ) and g ∈ G : c ⇒ ∗ ( (cid:104) p, γw (cid:105) , θ ) ⇒ ∗ ( (cid:104) g, uw (cid:105) , θ (cid:48) ) ⇒ + ( (cid:104) p, γvw (cid:105) , θ ) ⇒ ∗ ( (cid:104) g, uvw (cid:105) , θ (cid:48) ) ⇒ + ( (cid:104) p, γvvw (cid:105) , θ ) ⇒ ∗ ... Since g occurs infinitely often, the run is accepting. (cid:50) Thus, since there exists an efficient algorithm to compute the pre ∗ of SM-PDSs [29], the emptiness problem of a SM-BPDS can be reduced to computingits repeating heads. G Our goal is to compute the set of repeating heads
Rep BP , i.e., the set of heads( (cid:104) p, γ (cid:105) , θ ) such that there exists v ∈ Γ ∗ , ( (cid:104) p, γ (cid:105) , θ ) ⇒ r ( (cid:104) p, γv (cid:105) , θ ). I.e., ( (cid:104) p, γ (cid:105) , θ ) ⇒ ∗ ( (cid:104) p, γv (cid:105) , θ ) s.t. this path goes through an accepting location in G . To this aim,we will compute a finite graph G whose nodes are the heads of BP of the form(( p, γ ) , θ ), where p ∈ P , γ ∈ Γ and θ ⊆ ∆ ∪ ∆ c ; and whose edges encode the reach-ability relation between these heads. More precisely, given two heads (( p, γ ) , θ )and (( p (cid:48) , γ (cid:48) ) , θ (cid:48) ), (( p, γ ) , θ ) b −→ (( p (cid:48) , γ (cid:48) ) , θ (cid:48) ) is an edge of the graph G means thatthe configuration ( (cid:104) p, γ (cid:105) , θ ) can reach a configuration having ( (cid:104) p (cid:48) , γ (cid:48) (cid:105) , θ (cid:48) ) as head,i.e., it means that there exists v ∈ Γ ∗ s.t. ( (cid:104) p, γ (cid:105) , θ ) ⇒ ∗ ( (cid:104) p (cid:48) , γ (cid:48) v (cid:105) , θ (cid:48) ). Moreover,we need to keep the information whether this path visits an accepting location in G or not. This information is recorded in the label of the edge b : b = 1 means thatthe path visits an accepting location in G , i.e. that ( (cid:104) p, γ (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48) , γ (cid:48) v (cid:105) , θ (cid:48) ).Otherwise, b = 0. Therefore, if the graph G contains a loop from a head (( p, γ ) , θ )to itself such that this loop goes through an edge labelled by 1, then (( p, γ ) , θ )is a repeating head. Thus, computing Rep BP can be reduced to computing thegraph G and finding 1-labelled loops in this graph.More precisely, we define the head reachability graph G as follows: Definition 4.
The head reachability graph G is a tuple ( P × Γ × ∆ ∪ ∆ c , { , } , δ ) such that (( p, γ ) , θ ) b −→ (( p (cid:48) , γ (cid:48) ) , θ (cid:48) ) is an edge of δ iff:1. there exists a transition r c : p ( σ,σ (cid:48) ) (cid:44) −−−−→ p (cid:48) ∈ θ ∩ ∆ c , γ = γ (cid:48) , θ (cid:48) = θ \ σ ∪ σ (cid:48) ,and b = 1 iff p ∈ G ;2. there exists a transition (cid:104) p, γ (cid:105) (cid:44) → (cid:104) p (cid:48) , γ (cid:48) (cid:105) ∈ θ ∩ ∆, θ = θ (cid:48) and b = 1 iff p ∈ G ; . there exists a transition (cid:104) p, γ (cid:105) (cid:44) → (cid:104) p (cid:48)(cid:48) , γ γ (cid:48) (cid:105) ∈ θ ∩ ∆ , for γ ∈ Γ , p (cid:48)(cid:48) ∈ P ,s.t. ( (cid:104) p (cid:48)(cid:48) , γ (cid:105) , θ ) ⇒ ∗BP ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) , and b = 1 iff p ∈ G or ( (cid:104) p (cid:48)(cid:48) , γ (cid:105) , θ ) ⇒ r BP ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) Let G be the head reachability graph. We define −→ i as follows: let (( p, γ ) , θ ) and (( p (cid:48) , γ (cid:48) ) , θ (cid:48) ) be two heads of BP . We write (( p, γ ) , θ ) −→ i (( p (cid:48) , γ (cid:48) ) , θ (cid:48) ) iff ∃ booleans b , b ...b i ∈ { , } , ∃ heads (( p j , γ j ) , θ j ) , ≤ j ≤ i s.t. G contains the followingpath (( p , γ ) , θ ) b −→ (( p , γ ) , θ ) b −→ ... b i −→ (( p i , γ i ) , θ i ) where (( p , γ ) , θ ) =(( p, γ ) , θ ) and (( p i , γ i ) , θ i ) = (( p (cid:48) , γ (cid:48) ) , θ (cid:48) ) .Let → ∗ be the reflexive transitive closure of the graph relation b −→ , and let → r be defined as follows: Given two heads (( p, γ ) , θ ) and (( p (cid:48) , γ (cid:48) ) , θ (cid:48) ) , (( p, γ ) , θ ) → r (( p (cid:48) , γ (cid:48) ) , θ (cid:48) ) iff there is in G a path between (( p, γ ) , θ ) and (( p (cid:48) , γ (cid:48) ) , θ (cid:48) ) that goesthrough a 1-labelled edge, i.e., iff there exist heads (( p , γ ) , θ ) and (( p , γ ) , θ ) s.t. (( p, γ ) , θ ) → ∗ (( p , γ ) , θ ) −→ (( p , γ ) , θ ) → ∗ (( p (cid:48) , γ (cid:48) ) , θ (cid:48) ) . We can show that:
Theorem 2.
Let BP = ( P, Γ, ∆, ∆ c , G ) be a self-modifying B¨uchi pushdownsystem, and let G be its corresponding head reachability graph. A head (( p, γ ) , θ ) of BP is repeating iff G has a loop on the node (( p, γ ) , θ ) that goes through a1-labeled edge. To prove this theorem, we first need to prove the following lemma:
Lemma 1.
The relations → ∗ and → r have the following properties: For anyheads (( p, γ ) , θ ) and (( p (cid:48) , γ (cid:48) ) , θ ) :(a) (( p, γ ) , θ ) → ∗ (( p (cid:48) , γ (cid:48) ) , θ ) iff ( (cid:104) p, γ (cid:105) , θ ) ⇒ ∗ ( (cid:104) p (cid:48) , γ (cid:48) v (cid:105) , θ ) for some v ∈ Γ ∗ .(b) (( p, γ ) , θ ) → r (( p (cid:48) , γ (cid:48) ) , θ ) iff ( (cid:104) p, γ (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48) , γ (cid:48) v (cid:105) , θ ) for some v ∈ Γ ∗ . Proof: “ ⇒ ”: Assume (( p, γ ) , θ ) −→ i (( p (cid:48) , γ (cid:48) ) , θ ). We proceed by induction on i .(a) Basis. i = 0. In this case, (( p, γ ) , θ ) = (( p (cid:48) , γ (cid:48) ) , θ ), then we can get( (cid:104) p, γ (cid:105) , θ ) ⇒ ∗ ( (cid:104) p, γ (cid:105) , θ ) = ( (cid:104) p (cid:48) , γ (cid:48) (cid:105) , θ ) Step. i >
0. Then there exist p ∈ P, γ (cid:48)(cid:48) ∈ Γ ∗ and θ (cid:48) ⊆ ∆ ∪ ∆ c such that(( p, γ ) , θ ) −→ (( p , γ (cid:48)(cid:48) ) , θ (cid:48) ) −−→ i − (( p (cid:48) , γ (cid:48) ) , θ ). From the induction hypothesis,there exists u ∈ Γ ∗ such that ( (cid:104) p , γ (cid:48)(cid:48) (cid:105) , θ (cid:48) ) ⇒ ∗ ( (cid:104) p (cid:48) , γ (cid:48) u (cid:105) , θ )Since (( p, γ ) , θ ) → (( p , γ (cid:48)(cid:48) ) , θ (cid:48) ), we have ( (cid:104) p, γ (cid:105) , θ ) ⇒ ∗ ( (cid:104) p , γ (cid:48)(cid:48) w (cid:105) , θ (cid:48) ) for w ∈ Γ ∗ , hence ( (cid:104) p, γ (cid:105) , θ ) ⇒ ∗ ( (cid:104) p (cid:48) , γ (cid:48) uw (cid:105) , θ ).The property holds.(b) (( p, γ ) , θ ) → r (( p, γ ) , θ ) cannot hold for the case i = 0. Basis. i = 1 . In this case, (( p, γ ) , θ ) → r (( p (cid:48) , γ (cid:48) ) , θ ), then we can get p ∈ G and ( (cid:104) p, γ (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48) , γ (cid:48) (cid:105) , θ ). The property holds.10 tep. i >
0. As done in the proof of part (a) of this lemma, there exists p , γ (cid:48)(cid:48) ∈ Γ, θ (cid:48)(cid:48) ⊆ ∆ ∪ ∆ c s.t. (( p, γ ) , θ ) −→ (( p , γ (cid:48)(cid:48) ) , θ (cid:48) ) −−→ i − (( p (cid:48) , γ (cid:48) ) , θ ).Then if (( p, γ ) , θ ) → r (( p (cid:48) , γ (cid:48) ) , θ ), either (( p , γ (cid:48)(cid:48) ) , θ (cid:48) ) → r (( p (cid:48) , γ (cid:48) ) , θ ) or(( p, γ ) , θ ) −→ (( p , γ (cid:48)(cid:48) ) , θ (cid:48) ) holds. In the first case i.e. (( p , γ (cid:48)(cid:48) ) , θ (cid:48) ) → r (( p (cid:48) , γ (cid:48) ) , θ ),by the induction hypothesis, we can have ( (cid:104) p , γ (cid:48)(cid:48) (cid:105) , θ (cid:48) ) ⇒ r ( (cid:104) p (cid:48) , γ (cid:48) u (cid:105) , θ ),hence, ( (cid:104) p, γ (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48) , γ (cid:48) u (cid:105) , θ ) holdsThe second case depends on the rule applied to get (( p, γ ) , θ ) −→ (( p , γ (cid:48)(cid:48) ) , θ (cid:48) )according to Definition 4.- If this edge corresponds to a transition r c : p ( σ,σ (cid:48) ) (cid:44) −−−−→ p ∈ θ , then γ = γ (cid:48)(cid:48) , θ (cid:48) = θ \ σ ∪ σ (cid:48) and p ∈ G . Since we can obtain ( (cid:104) p, γ (cid:105) , θ ) ⇒ BP ( (cid:104) p , γ (cid:105) , θ (cid:48) ) ⇒ ∗ ( (cid:104) p (cid:48) , γ (cid:48) uw (cid:105) , θ ) from part ( a ) and p ∈ G , then ( (cid:104) p, γ (cid:105) , θ ) ⇒ r ( (cid:104) p , γ (cid:105) , θ (cid:48) ) ⇒ ∗ ( (cid:104) p (cid:48) , γ (cid:48) uw (cid:105) , θ ). This implies that ( (cid:104) p, γ (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48) , γ (cid:48) v (cid:105) , θ )for some v ∈ Γ ∗ . - If this edge corresponds to a transition r : (cid:104) p, γ (cid:105) (cid:44) → (cid:104) p , γ (cid:48)(cid:48) (cid:105) ∈ θ ∩ ∆ , then θ (cid:48) = θ and p ∈ G . Since we can obtain ( (cid:104) p, γ (cid:105) , θ ) ⇒ BP ( (cid:104) p , γ (cid:48)(cid:48) (cid:105) , θ ) ⇒ ∗ ( (cid:104) p (cid:48) , γ (cid:48) uw (cid:105) , θ ) from part ( a ) and p ∈ G , then ( (cid:104) p, γ (cid:105) , θ ) ⇒ r ( (cid:104) p , γ (cid:48)(cid:48) (cid:105) , θ ) ⇒ ∗ ( (cid:104) p (cid:48) , γ (cid:48) uw (cid:105) , θ ). This implies that ( (cid:104) p, γ (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48) , γ (cid:48) v (cid:105) , θ ) for some v ∈ Γ ∗ .- If this edge corresponds to a transition r : (cid:104) p, γ (cid:105) (cid:44) → (cid:104) p (cid:48)(cid:48) , γ γ (cid:48)(cid:48) (cid:105) ∈ θ ,then either p ∈ G or ( (cid:104) p (cid:48)(cid:48) , γ (cid:105) , θ ) ⇒ r ( (cid:104) p , (cid:15) (cid:105) , θ (cid:48) ) holds. If p ∈ G , thenwe have ( (cid:104) p, γ (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48)(cid:48) , γ γ (cid:48)(cid:48) (cid:105) , θ ). Otherwise, ( (cid:104) p (cid:48)(cid:48) , v γ (cid:48)(cid:48) w (cid:105) , θ ) ⇒ r ( (cid:104) p , γ (cid:48)(cid:48) w (cid:105) , θ (cid:48) ). Since we can obtain ( (cid:104) p , γ (cid:48)(cid:48) (cid:105) , θ (cid:48) ) ⇒ ∗ ( (cid:104) p (cid:48) , γ (cid:48) u (cid:105) , θ ) frompart ( a ). Therefore, ( (cid:104) p, γ (cid:105) , θ ) ⇒ r ( (cid:104) p , γ (cid:48)(cid:48) (cid:105) , θ (cid:48) ) ⇒ ∗ ( (cid:104) p (cid:48) , γ (cid:48) u (cid:105) , θ ). Thisimplies that ( (cid:104) p, γ (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48) , γ (cid:48) v (cid:105) , θ ) for some v ∈ Γ ∗ .‘ ⇐ ”: Assume ( (cid:104) p, γ (cid:105) , θ ) i ⇒ ( (cid:104) p (cid:48) , γ (cid:48) v (cid:105) , θ ). We proceed by induction on i .(a) Basis. i = 0. In this case, v = (cid:15) and ( (cid:104) p, γ (cid:105) , θ ) = ( (cid:104) p (cid:48) , γ (cid:48) (cid:105) , θ ), then(( p, γ ) , θ ) → ∗ (( p (cid:48) , γ (cid:48) ) , θ ) holds. Step. i >
0. Then there exist p ∈ P, u ∈ Γ ∗ and θ (cid:48) ⊆ ∆ ∪ ∆ c such that( (cid:104) p, γ (cid:105) , θ ) ⇒ ( (cid:104) p , u (cid:105) , θ (cid:48) ) i − ⇒ ( (cid:104) p (cid:48) , γ (cid:48) v (cid:105) , θ ). There are 2 cases:1. Case θ (cid:48) = θ : There must exist a rule r : (cid:104) p, γ (cid:105) (cid:44) → (cid:104) p , u (cid:105) ∈ ∆ such that r ∈ θ (cid:48) and | u | ≥
1. Let l denote the minimal length of the stack on thepath from ( (cid:104) p , u (cid:105) , θ ) to ( (cid:104) p (cid:48) , γ (cid:48) v (cid:105) , θ ). Then u can be written as u (cid:48)(cid:48) γ u (cid:48) where | u (cid:48) | = l − u (cid:48) will remain on the stack for the path).Furthermore, there exists p (cid:48)(cid:48)(cid:48) such that ( (cid:104) p , u (cid:48)(cid:48) (cid:105) , θ ) ⇒ ∗ ( (cid:104) p (cid:48)(cid:48)(cid:48) , (cid:15) (cid:105) , θ (cid:48)(cid:48) ) forsome θ (cid:48)(cid:48) ⊆ ( ∆ c ∪ ∆ ). We have ( (cid:104) p, γ (cid:105) , θ ) k ⇒ ( (cid:104) p (cid:48)(cid:48)(cid:48) , γ u (cid:48) (cid:105) , θ (cid:48)(cid:48) ) for k < i . Bythe induction on i , we have (( p, γ ) , θ ) → ∗ (( p (cid:48)(cid:48)(cid:48) , γ ) , θ (cid:48)(cid:48) ). Because u (cid:48) hasto remain on the stack for the rest of the path, v is of the form v (cid:48) u (cid:48) forsome v (cid:48) ∈ Γ ∗ . That means ( (cid:104) p (cid:48)(cid:48)(cid:48) , γ (cid:105) , θ (cid:48)(cid:48) ) j ⇒ ( (cid:104) p (cid:48) , γ (cid:48) v (cid:48) (cid:105) , θ ) for j < i . Bythe induction hypothesis, (( p (cid:48)(cid:48)(cid:48) , γ ) , θ (cid:48)(cid:48) ) → ∗ (( p (cid:48) , γ (cid:48) ) , θ ) holds. Moreover,we have (( p, γ ) , θ ) → ∗ (( p (cid:48)(cid:48)(cid:48) , γ ) , θ (cid:48)(cid:48) ), hence (( p, γ ) , θ ) → ∗ (( p (cid:48) , γ (cid:48) ) , θ ).11. Case θ (cid:48) (cid:54) = θ : There must be a rule r c : p ( σ,σ (cid:48) ) (cid:44) −−−−→ p ∈ ∆ c such that r c ∈ θ and σ ∩ θ (cid:54) = ∅ , then θ (cid:48) = θ \ σ ∪ σ (cid:48) . After the execution of r c , the content of the stack will remain the same, thus, u = γ . Then( (cid:104) p, γ (cid:105) , θ ) ⇒ ( (cid:104) p , γ (cid:105) , θ (cid:48) ) i − ⇒ ( (cid:104) p (cid:48) , γ (cid:48) v (cid:105) , θ ). By the induction hypothe-sis to ( (cid:104) p , γ (cid:105) , θ (cid:48) ) i − ⇒ ( (cid:104) p (cid:48) , γ (cid:48) v (cid:105) , θ ), we can obtain that (( p , γ ) , θ (cid:48) ) → ∗ (( p (cid:48) , γ (cid:48) ) , θ ). Since ( (cid:104) p, γ (cid:105) , θ ) ⇒ ( (cid:104) p , γ (cid:105) , θ (cid:48) ), then we can have a path(( p, γ ) , θ ) → (( p , γ ) , θ (cid:48) ) → ∗ (( p (cid:48) , γ (cid:48) ) , θ ) that implies (( p, γ ) , θ ) → ∗ (( p (cid:48) , γ (cid:48) ) , θ ). The property holds.(b) ( (cid:104) p, γ (cid:105) , θ ) ⇒ r ( (cid:104) p, γ (cid:48) v (cid:105) , θ ) is impossible in 0 steps. Basis. i = 1. ( (cid:104) p, γ (cid:105) , θ ) ⇒ r ( (cid:104) p, γ (cid:105) , θ ), then p ∈ G . Thus, (( p, γ ) , θ ) → r (( p, γ ) , θ ) holds. Step. i >
1. ( (cid:104) p, γ (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48) , γ (cid:48) v (cid:105) , θ ) holds, then there exist p ∈ P, u ∈ Γ ∗ and θ (cid:48) ⊆ ∆ ∪ ∆ c such that ( (cid:104) p, γ (cid:105) , θ ) ⇒ ( (cid:104) p , u (cid:105) , θ (cid:48) ) i − ⇒ ( (cid:104) p (cid:48) , γ (cid:48) v (cid:105) , θ ).Thus, either ( (cid:104) p, γ (cid:105) , θ ) ⇒ r ( (cid:104) p , u (cid:105) , θ (cid:48) ) or ( (cid:104) p , u (cid:105) , θ (cid:48) ) ⇒ r ( (cid:104) p (cid:48) , γ (cid:48) v (cid:105) , θ ) holds.The first case implies p ∈ G. There are 2 cases:1. Case θ (cid:48) = θ : then as in the previous proof of part (a), we can have apath (( p, γ ) , θ ) → ∗ (( p (cid:48)(cid:48)(cid:48) , γ ) , θ (cid:48)(cid:48) ) → ∗ (( p (cid:48) , γ (cid:48) ) , θ ). Since p ∈ G , we getby Definition 4 (( p, γ ) , θ ) → ∗ (( p (cid:48)(cid:48)(cid:48) , γ ) , θ (cid:48)(cid:48) ) → ∗ (( p (cid:48) , γ (cid:48) ) , θ ). Thus, wehave that (( p, γ ) , θ ) → r (( p (cid:48) , γ (cid:48) ) , θ ). The property holds.2. Case θ (cid:48) (cid:54) = θ : then as in the previous proof of part (a), we can havea path (( p, γ ) , θ ) → (( p , γ ) , θ (cid:48) ) → ∗ (( p (cid:48) , γ (cid:48) ) , θ ). Since p ∈ G , we get(( p, γ ) , θ ) −→ (( p , γ ) , θ (cid:48) ) → ∗ (( p (cid:48) , γ (cid:48) ) , θ ). Thus, we have that (( p, γ ) , θ ) → r (( p (cid:48) , γ (cid:48) ) , θ ). The property holds.In the second case, ( (cid:104) p , u (cid:105) , θ (cid:48) ) ⇒ r ( (cid:104) p (cid:48) , γ (cid:48) v (cid:105) , θ ) holds. As previously, thereare 2 cases:1. Case θ (cid:48) = θ : then as in case (a) we have ( (cid:104) p , u (cid:105) , θ ) ⇒ ∗ ( (cid:104) p (cid:48)(cid:48)(cid:48) , γ u (cid:48) (cid:105) , θ (cid:48)(cid:48) )and ( (cid:104) p (cid:48)(cid:48)(cid:48) , γ (cid:105) , θ (cid:48)(cid:48) ) ⇒ ∗ ( (cid:104) p (cid:48) , γ (cid:48) v (cid:48) (cid:105) , θ ). If ( (cid:104) p , u (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48) , γ (cid:48) v (cid:105) , θ ),then either ( (cid:104) p , u (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48)(cid:48)(cid:48) , γ u (cid:48) (cid:105) , θ (cid:48)(cid:48) ) or ( (cid:104) p (cid:48)(cid:48)(cid:48) , γ (cid:105) , θ (cid:48)(cid:48) ) ⇒ r ( (cid:104) p (cid:48) , γ (cid:48) v (cid:48) (cid:105) , θ ).- If ( (cid:104) p , u (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48)(cid:48)(cid:48) , γ u (cid:48) (cid:105) , θ (cid:48)(cid:48) ), let u (cid:48)(cid:48) ∈ Γ ∗ s.t. u = u (cid:48)(cid:48) γ u (cid:48) and ( (cid:104) p , u (cid:48)(cid:48) (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48)(cid:48)(cid:48) , (cid:15) (cid:105) , θ (cid:48)(cid:48) ), then, we have (( p, γ ) , θ ) → r (( p (cid:48)(cid:48)(cid:48) , γ ) , θ (cid:48)(cid:48) ). We have ( (cid:104) p, γ (cid:105) , θ ) k ⇒ ( (cid:104) p (cid:48)(cid:48)(cid:48) , γ u (cid:48) (cid:105) , θ (cid:48)(cid:48) ) for k < i . Bythe induction on i , we have (( p, γ ) , θ ) → ∗ (( p (cid:48)(cid:48)(cid:48) , γ ) , θ (cid:48)(cid:48) ). Because u (cid:48) has to remain on the stack for the rest of the path, v is of the form v (cid:48) u (cid:48) for some v (cid:48) ∈ Γ ∗ . That means ( (cid:104) p (cid:48)(cid:48)(cid:48) , γ (cid:105) , θ (cid:48)(cid:48) ) j ⇒ ( (cid:104) p (cid:48) , γ (cid:48) v (cid:48) (cid:105) , θ ) for j
We can now prove Theorem 2.
Proof:
Let (( p, γ ) , θ ) be a repeating head, then there exists some v ∈ Γ ∗ , θ ⊆ ∆ c ∪ ∆ such that ( (cid:104) p, γ (cid:105) , θ ) ⇒ r ( (cid:104) p, γv (cid:105) , θ ). By Lemma 1, this is the case if andonly if (( p, γ ) , θ ) → r (( p, γ ) , θ ). From the definition of → r , that means that thereexist heads (( p , γ ) , θ (cid:48) ) and (( p , γ ) , θ (cid:48)(cid:48) ) such that (( p, γ ) , θ ) → ∗ (( p , γ ) , θ (cid:48) ) −→ (( p , γ ) , θ (cid:48)(cid:48) ) → ∗ (( p, γ ) , θ ) . Then (( p, γ ) , θ ) , (( p , γ ) , θ (cid:48) ) and (( p , γ ) , θ (cid:48)(cid:48) ) are allin the same loop with a 1-labelled edge. Conversely, whenever (( p, γ ) , θ ) is in acomponent with such an edge, (( p, γ ) , θ ) → r (( p, γ ) , θ ) holds, then Lemma 1implies that ( (cid:104) p, γ (cid:105) , θ ) ⇒ r ( (cid:104) p, γv (cid:105) , θ ) which means that (( p, γ ) , θ ) is a repeatinghead. (cid:50) BP -automata To compute G , we need to be able to compute predecessors of configurations ofthe form ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ), and to determine whether these predecessors were backward-reachable using some control points in G (item 3 in Definition 4). To solve thisquestion, we will label configurations ( (cid:104) p (cid:48)(cid:48) , w (cid:105) , θ ) s.t. ( (cid:104) p (cid:48)(cid:48) , w (cid:105) , θ ) ⇒ ∗ ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) )by 1 if this path went through an accepting location in G , i.e., if ( (cid:104) p (cid:48)(cid:48) , w (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ), and by 0 if not. To this aim, we define a labelled configuration as atuple [( (cid:104) p, w (cid:105) , θ ) , b ], s.t. ( (cid:104) p, w (cid:105) , θ ) is a configuration and b ∈ { , } .Multi-automata were introduced in [2,10] to finitely represent regular infinitesets of configurations of a PDS. Since a labelled configuration c = [( (cid:104) p, w (cid:105) , θ ) , b ]of a SM-PDS involves a PDS configuration (cid:104) p, w (cid:105) , together with the current setof transition rules (phase) θ , and a boolean b , in order to take into account thephases θ , and these new 0 / BP -automata as follows: Definition 5.
Let BP = ( P, Γ, ∆, ∆ c , G ) be a SM-BPDS. A labelled BP -automatonis a tuple A = ( Q, Γ, T, I, F ) where Γ is the automaton alphabet, Q is a finite et of states, I ⊆ P × ∆ ∪ ∆ c ⊆ Q is the set of initial states, T ⊂ Q × (cid:0) ( Γ ∪{ (cid:15) } ) × { , } (cid:1) × Q is the set of transitions, F ⊆ Q is the set of final states. If (cid:0) q, [ γ, b ] , q (cid:48) (cid:1) ∈ T , we write q [ γ,b ] −−−→ T q (cid:48) . We extend this notation in the obviousway to sequences of symbols: (1) ∀ q ∈ Q, q [ (cid:15), −−−→ T q , and (2) ∀ q, q (cid:48) ∈ Q, ∀ b ∈{ , } , ∀ w ∈ Γ ∗ for w = γ ...γ n +1 , q [ w,b ] −−−→ T q (cid:48) iff ∃ q , ..., q n ∈ Q, b , ..., b n +1 ∈{ , } , b = b ∨ b ∨ ... ∨ b n +1 and q [ γ ,b ] −−−−−→ T q γ ,b ] −−−−−→ T q · · · q n [ γ n +1 ,b n +1 ] −−−−−−−−→ T q (cid:48) . If q [ w,b ] −−−→ T q (cid:48) holds, we say that q [ w,b ] −−−→ T q (cid:48) and q [ γ ,b ] −−−−−→ T q γ ,b ] −−−−−→ T q · · · q n [ γ n +1 ,b n +1 ] −−−−−−−−→ T q (cid:48) is a path of A .A labelled configuration [( (cid:104) p, w (cid:105) , θ ) , b ] is accepted by the automaton A iffthere exists a path ( p, θ ) [ γ ,b ] −−−−−→ T q γ ,b ] −−−−−→ T q · · · q n [ γ n ,b n ] −−−−−→ T q n +1 in A such that w = γ γ · · · γ n , b = b ∨ b ∨ ... ∨ b n , ( p, θ ) ∈ I , and q n +1 ∈ F . Let L ( A ) be theset of labelled configurations accepted by A . pre ∗ (cid:0) ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) (cid:1) Given a configuration of the form ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ), our goal is to compute a labelled BP -automaton A pre ∗ (cid:0) ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) (cid:1) that accepts labelled configurations of theform [ c, b ] where c is a configuration and b ∈ { , } such that c ⇒ ∗ ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) )(i.e., c ∈ pre ∗ (cid:0) ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) (cid:1) ) and b = 1 iff this path went through final controlpoints, i.e., c ⇒ r ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ). Otherwise, b = 0.Let p ∈ P , we define B ( p ) = 1 if p ∈ G and B ( p ) = 0 otherwise. A pre ∗ (cid:0) ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) (cid:1) =( Q, Γ, T, I, F ) is computed as follows: Initially, Q = I = F = { ( p (cid:48) , θ (cid:48) ) } and T = ∅ . We add to T transitions as follows: α : If r = (cid:104) p, γ (cid:105) (cid:44) → (cid:104) p , w (cid:105) ∈ ∆ . If there exists in T a path ( p , θ ) [ w,b ] −−−→ T q (in case | w | = 0, we have w = (cid:15) ) with r ∈ θ . Then, add ( p, θ ) to I , and (cid:0) ( p, θ ) , [ γ, B ( p ) ∨ b ] , q (cid:1) to T . α : if r = p ( σ,σ (cid:48) ) (cid:44) −−−−→ p ∈ ∆ c and there exists in T a transition ( p , θ ) [ γ,b ] −−−→ T q with r ∈ θ , where γ ∈ Γ . Then add ( p, θ (cid:48) ) to I , and (cid:0) ( p, θ (cid:48) ) , [ γ, B ( p ) ∨ b ] , q (cid:1) to T , for θ (cid:48) such that θ = θ (cid:48) \ σ ∪ σ (cid:48) .The procedure above terminates since there is a finite number of states andphases. Note that by construction, F = { ( p (cid:48) , θ (cid:48) ) } , and, since initially Q = { ( p (cid:48) , θ (cid:48) ) } , states of A pre ∗ (cid:0) ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) (cid:1) are all of the form ( p, θ ) for p ∈ P and θ ⊆ ∆ ∪ ∆ c .Let us explain the intuition behind rule ( α ). Let r = (cid:104) p, γ (cid:105) (cid:44) → (cid:104) p , w (cid:105) ∈ ∆ . Let c = ( (cid:104) p , ww (cid:48) (cid:105) , θ ) and c (cid:48) = ( (cid:104) p, γw (cid:48) (cid:105) , θ ). Then, if c ⇒ ∗ ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ), then necessar-ily, c (cid:48) ⇒ ∗ ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ). Moreover, c (cid:48) ⇒ r ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) iff either c ⇒ r ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) or p ∈ G (i.e. B ( p ) = 1). Thus, we would like that if the automaton A pre ∗ (cid:0) ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) (cid:1) accepts the labelled configuration [ c, b ] (where b = 1 means c ⇒ r ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) )),then it should also accept the labelled configuration [ c (cid:48) , b ∨ B ( p )] ( b ∨ B ( p ) = 1means c (cid:48) ⇒ r ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) )). Thus, if the automaton A pre ∗ (cid:0) ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) (cid:1) contains14 path of the form π = ( p , θ ) [ w,b ] −−−−→ T q [ w (cid:48) ,b ] −−−−→ T q f where q f ∈ F that ac-cepts the labelled configuration [ c, b ], then the automaton should also accept thelabelled configuration [ c (cid:48) , b ∨ B ( p )]. This configuration is accepted by the run( p, θ ) [ γ,B ( p ) ∨ b ] −−−−−−−−→ T q [ w (cid:48) ,b ] −−−−−→ T q f added by rule ( α ).Rule ( α ) deals with modifying rules: Let r = p ( r ,r ) (cid:44) −−−−→ p ∈ ∆ c . Let c = ( (cid:104) p , γw (cid:48) (cid:105) , θ ) and c (cid:48) = ( (cid:104) p, γw (cid:48) (cid:105) , θ (cid:48)(cid:48) ) s.t. θ = θ (cid:48)(cid:48) \{ r } ∪ { r } . Then, if c ⇒ ∗ ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ), then necessarily, c (cid:48) ⇒ ∗ ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ). Moreover, c (cid:48) ⇒ r ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) )iff either c ⇒ r ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) or p ∈ G (i.e. B ( p ) = 1). Thus, we need to impose thatif the automaton A pre ∗ (cid:0) ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) (cid:1) contains a path of the form ( p , θ ) [ γ,b ] −−−→ T q [ w (cid:48) ,b ] −−−−→ T q f (where q f ∈ F ) that accepts the labelled configuration [ c, b ] , b = b ∨ b ( b = 1 means c ⇒ r ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) )), then necessarily, the automaton A pre ∗ (cid:0) ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) (cid:1) should also accept the labelled configuration [ c (cid:48) , b ∨ B ( p )]. This configuration isaccepted by the run ( p, θ (cid:48)(cid:48) ) [ γ,B ( p ) ∨ b ] −−−−−−−→ T q [ w (cid:48) ,b ] −−−−→ T q f added by rule ( α ).Before proving that our construction is correct, we introduce the followingdefinition: Definition 6.
Let A pre ∗ (cid:0) ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) (cid:1) = ( Q, Γ, T, P, F ) be the labelled P -automatoncomputed by the saturation procedure above. In this section, we use −→ i T to denotethe transition relation of A pre ∗ (cid:0) ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) (cid:1) obtained after adding i transitionsusing the saturation procedure above. Let us notice that due to the fact that ini-tially Q = { ( p (cid:48) , θ (cid:48) ) } and due to rules ( α ) and ( α ) that at step i add onlytransitions of the form ( p, θ ) γ −→ T q for a state q that is already in the automatonat step i − , then, states of A pre ∗ (cid:0) ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) (cid:1) are all of the form ( p, θ ) for p ∈ P and θ ⊆ ∆ ∪ ∆ c . We can show that:
Lemma 2.
Let p, p (cid:48)(cid:48) ∈ P and θ, θ (cid:48)(cid:48) ⊆ ∆ ∪ ∆ c . Let w ∈ Γ ∗ and b ∈ { , } . If apath ( p, θ ) [ w,b ] −−−→ T ( p (cid:48)(cid:48) , θ (cid:48)(cid:48) ) is in A pre ∗ (cid:0) ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) (cid:1) , then ( (cid:104) p, w (cid:105) , θ ) ⇒ ∗ ( (cid:104) p (cid:48)(cid:48) , (cid:15) (cid:105) , θ (cid:48)(cid:48) ) .Moreover, if b = 1 , then ( (cid:104) p, w (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48)(cid:48) , (cid:15) (cid:105) , θ (cid:48)(cid:48) ) . Proof:
Initially, the automaton contains no transitions. Let i be an index suchthat ( p, θ ) [ w,b ] −−−→ i T ( p (cid:48)(cid:48) , θ (cid:48)(cid:48) ) holds. We proceed by induction on i . Basis. i = 0, then ( p (cid:48)(cid:48) , θ (cid:48)(cid:48) ) [ (cid:15), −−−→ T ( p (cid:48)(cid:48) , θ (cid:48)(cid:48) ). This means p (cid:48)(cid:48) = p (cid:48) , θ (cid:48)(cid:48) = θ (cid:48) . Since initially Q = { ( p (cid:48) , θ (cid:48) ) } , then ( (cid:104) p (cid:48)(cid:48) , (cid:15) (cid:105) , θ (cid:48)(cid:48) ) ⇒ ∗ ( (cid:104) p (cid:48)(cid:48) , (cid:15) (cid:105) , θ (cid:48)(cid:48) ) always holds. Step. i >
0. Let t = (cid:0) ( p , θ ) , [ γ, b ] , ( p , θ ) (cid:1) be the i -th transition added to A pre ∗ and j be the number of times that t is used in the path ( p, θ ) [ w,b ] −−−→ i T ( p (cid:48)(cid:48) , θ (cid:48)(cid:48) ).The proof is by induction on j . If j = 0, then we have ( p, θ ) [ w,b ] −−−→ i − T ( p (cid:48)(cid:48) , θ (cid:48)(cid:48) ) inthe automaton, and we apply the induction hypothesis (induction on i ) thenwe obtain ( (cid:104) p, w (cid:105) , θ ) ⇒ ∗ ( (cid:104) p (cid:48)(cid:48) , (cid:15) (cid:105) , θ (cid:48)(cid:48) ). So assume that j >
0. Then, there exist u, v ∈ Γ ∗ , b (cid:48) , b (cid:48)(cid:48) ∈ { , } such that w = uγv , b = b (cid:48) ∨ b ∨ b (cid:48)(cid:48) and15 p, θ ) [ u,b (cid:48) ] −−−→ i − T ( p , θ ) [ γ,b ] −−−→ i T ( p , θ ) [ v,b (cid:48)(cid:48) ] −−−−→ i T ( p (cid:48)(cid:48) , θ (cid:48)(cid:48) ) (1)The application of the induction hypothesis (induction on i ) to ( p, θ ) [ u,b (cid:48) ] −−−→ i − T ( p , θ ) gives that( (cid:104) p, u (cid:105) , θ ) ⇒ ∗ ( (cid:104) p , (cid:15) (cid:105) , θ ) , moreover, if b (cid:48) = 1 , ( (cid:104) p, u (cid:105) , θ ) ⇒ r ( (cid:104) p , (cid:15) (cid:105) , θ ) (2)There are 2 cases depending on whether transition t was added by saturationrule α or α .1. Case t was added by rule α : There exist p ∈ P and w ∈ Γ ∗ such that r = (cid:104) p , γ (cid:105) (cid:44) → (cid:104) p , w (cid:105) ∈ ∆ ∩ θ (3)and A pre ∗ contains the following path: π (cid:48) = ( p , θ ) [ w ,b ] −−−−→ i − T ( p , θ ) [ v,b (cid:48)(cid:48) ] −−−−→ i T ( p (cid:48)(cid:48) , θ (cid:48)(cid:48) ) , b = b ∨ B ( p ) (4)Applying the transition rule r , we get that( (cid:104) p , γv (cid:105) , θ ) ⇒ ( (cid:104) p , w v (cid:105) , θ ) (5)By induction on j (since transition t is used j − π (cid:48) ), we get from(4) that( (cid:104) p , w v (cid:105) , θ ) ⇒ ∗ ( (cid:104) p (cid:48)(cid:48) , (cid:15) (cid:105) , θ (cid:48)(cid:48) ) moreover, if b ∨ b (cid:48)(cid:48) = 1 , ( (cid:104) p , w v (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48)(cid:48) , (cid:15) (cid:105) , θ (cid:48)(cid:48) )(6)Putting (2), (5) and (6) together, we can obtain that( (cid:104) p, w (cid:105) , θ ) = ( (cid:104) p, uγv (cid:105) , θ ) ⇒ ∗ ( (cid:104) p , γv (cid:105) , θ ) ⇒ ( (cid:104) p , w v (cid:105) , θ ) ⇒ ∗ ( (cid:104) p (cid:48)(cid:48) , (cid:15) (cid:105) , θ (cid:48)(cid:48) )Furthermore, if b = b (cid:48) ∨ b ∨ b (cid:48)(cid:48) = 1, then b (cid:48) = 1 or b ∨ b (cid:48)(cid:48) = 1.For the first case, b (cid:48) = 1, then we can have ( (cid:104) p, u (cid:105) , θ ) ⇒ r ( (cid:104) p , (cid:15) (cid:105) , θ ) from(2). Thus, we can obtain that ( (cid:104) p, uγv (cid:105) , θ ) ⇒ r ( (cid:104) p , γv (cid:105) , θ ) ⇒ ∗ ( (cid:104) p (cid:48)(cid:48) , (cid:15) (cid:105) , θ (cid:48)(cid:48) )i.e. ( (cid:104) p, w (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48)(cid:48) , (cid:15) (cid:105) , θ (cid:48)(cid:48) ).The second case b ∨ b (cid:48)(cid:48) = 1 i.e. B ( p ) ∨ b ∨ b (cid:48)(cid:48) = 1 implies that B ( p ) = 1(that means p ∈ G and ( (cid:104) p , γv (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48)(cid:48) , (cid:15) (cid:105) , θ (cid:48)(cid:48) )) or b ∨ b (cid:48)(cid:48) = 1 (thatimplies ( (cid:104) p , w v (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48)(cid:48) , (cid:15) (cid:105) , θ (cid:48)(cid:48) ) from (6)). Therefore, ( (cid:104) p, w (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48)(cid:48) , (cid:15) (cid:105) , θ (cid:48)(cid:48) ). 16. Case t was added by rule α : there exist p ∈ P and θ ⊆ ∆ ∪ ∆ c such that r = p σ,σ (cid:48) ) (cid:44) −−−−→ p ∈ ∆ c ∩ θ , θ = ( θ \ σ ) ∪ σ (cid:48) (7)and the following path in the current automaton ( self-modifying rule won’tchange the stack) with r ∈ θ :( p , θ ) [ γ,b (cid:48) ] −−−→ i − T ( p , θ ) [ v,b (cid:48)(cid:48) ] −−−−→ i T ( p (cid:48)(cid:48) , θ (cid:48)(cid:48) ) , b = B ( p ) ∨ b (cid:48) (8)Applying the transition rule, we can get from (7) that( (cid:104) p , γv (cid:105) , θ ) ⇒ ( (cid:104) p , γv (cid:105) , θ ) (9)We can apply the induction hypothesis (on j ) to (8), and obtain( (cid:104) p , γv (cid:105) , θ ) ⇒ ∗ ( (cid:104) p (cid:48)(cid:48) , (cid:15) (cid:105) , θ (cid:48)(cid:48) ), moreover, if b (cid:48) ∨ b (cid:48)(cid:48) = 1 , ( (cid:104) p , γv (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48)(cid:48) , (cid:15) (cid:105) , θ (cid:48)(cid:48) )(10)From (2),(9) and (10), we get( (cid:104) p, w (cid:105) , θ ) = ( (cid:104) p, uγv (cid:105) , θ ) ⇒ ∗ ( (cid:104) p , γv (cid:105) , θ ) ⇒ ( (cid:104) p , γv (cid:105) , θ ) ⇒ ∗ ( (cid:104) p (cid:48)(cid:48) , (cid:15) (cid:105) , θ (cid:48)(cid:48) )Furthermore, if b = b (cid:48) ∨ b ∨ b (cid:48)(cid:48) = 1 , then b (cid:48) = 1 or b ∨ b (cid:48)(cid:48) = 1.For the first case, b (cid:48) = 1, then we can have ( (cid:104) p, u (cid:105) , θ ) ⇒ r ( (cid:104) p , (cid:15) (cid:105) , θ ) from(2). Thus, we can obtain that ( (cid:104) p, uγv (cid:105) , θ ) ⇒ r ( (cid:104) p , γv (cid:105) , θ ) ⇒ ∗ ( (cid:104) p (cid:48)(cid:48) , (cid:15) (cid:105) , θ (cid:48)(cid:48) )i.e. ( (cid:104) p, w (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48)(cid:48) , (cid:15) (cid:105) , θ (cid:48)(cid:48) ). The second case b ∨ b (cid:48)(cid:48) = 1 i.e. B ( p ) ∨ b (cid:48) ∨ b (cid:48)(cid:48) = 1 implies that B ( p ) = 1 (that means p ∈ G and ( (cid:104) p , γv (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) )) or b (cid:48) ∨ b (cid:48)(cid:48) = 1 (that implies ( (cid:104) p , γv (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48)(cid:48) , (cid:15) (cid:105) , θ (cid:48)(cid:48) ) from(10)) i.e. ( (cid:104) p, w (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ). Therefore, we can get that if b = 1, then( (cid:104) p, w (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48)(cid:48) , (cid:15) (cid:105) , θ (cid:48)(cid:48) ). (cid:50) Lemma 3.
If there is a labelled configuration [( (cid:104) p, w (cid:105) , θ ) , b ] such that ( (cid:104) p, w (cid:105) , θ ) ⇒ ∗ ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) , then there is a path ( p, θ ) [ w,b ] −−−→ T ( p (cid:48) , θ (cid:48) ) in A pre ∗ (cid:0) ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) (cid:1) . More-over, if ( (cid:104) p, w (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) , then b = 1 . Proof:
Assume ( (cid:104) p, w (cid:105) , θ ) i ⇒ ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ). We proceed by induction on i . Basis. i = 0. Then θ = θ (cid:48) , p (cid:48) = p and w = (cid:15) . Initially, we have that Q = { ( p (cid:48) , θ (cid:48) ) } ,therefore, by the definition of → T , we have ( p (cid:48) , θ (cid:48) ) (cid:15) −→ T ( p (cid:48) , θ (cid:48) ). We cannot have( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) ⇒ r ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) in 0-step. Step. i >
0. Then there exists a configuration ( (cid:104) p (cid:48)(cid:48) , u (cid:105) , θ (cid:48)(cid:48) ) such that( (cid:104) p, w (cid:105) , θ ) ⇒ ( (cid:104) p (cid:48)(cid:48) , u (cid:105) , θ (cid:48)(cid:48) ) i − ⇒ ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) )17e apply the induction hypothesis to ( (cid:104) p (cid:48)(cid:48) , u (cid:105) , θ (cid:48)(cid:48) ) i − ⇒ ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ), and obtain thatthere exists in A pre ∗ (cid:0) ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) (cid:1) a path ( p (cid:48)(cid:48) , θ (cid:48)(cid:48) ) [ u,b (cid:48)(cid:48) ] −−−−→ T ( p (cid:48) , θ (cid:48) ). If ( (cid:104) p (cid:48)(cid:48) , u (cid:105) , θ (cid:48)(cid:48) ) ⇒ r ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ), b (cid:48)(cid:48) = 1.Let ( p , θ ) be a state of A pre ∗ . Let w , u ∈ Γ ∗ , γ ∈ Γ, b (cid:48)(cid:48) , b (cid:48)(cid:48) ∈ { , } besuch that w = γw , u = u w , b (cid:48)(cid:48) = b (cid:48)(cid:48) ∨ b (cid:48)(cid:48) and( p (cid:48)(cid:48) , θ (cid:48)(cid:48) ) [ u ,b (cid:48)(cid:48) ] −−−−−→ T ( p , θ ) [ w ,b (cid:48)(cid:48) ] −−−−−→ T ( p (cid:48) , θ (cid:48) ) (1)There are two cases depending on which rule is applied to get ( (cid:104) p, w (cid:105) , θ ) ⇒ ( (cid:104) p (cid:48)(cid:48) , u (cid:105) , θ (cid:48)(cid:48) ).1. Case ( (cid:104) p, w (cid:105) , θ ) ⇒ ( (cid:104) p (cid:48)(cid:48) , u (cid:105) , θ (cid:48)(cid:48) ) is obtained by a rule of the form: (cid:104) p, γ (cid:105) (cid:44) →(cid:104) p (cid:48)(cid:48) , u (cid:105) ∈ ∆ . In this case, θ (cid:48)(cid:48) = θ. By the saturation rule α , we have( p, θ (cid:48)(cid:48) ) [ γ,b ] −−−−→ T ( p , θ ) , b = B ( p ) ∨ b (cid:48)(cid:48) (2)Putting (1) and (2) together, we can obtain that π = ( p, θ (cid:48)(cid:48) ) [ γ,b ] −−−−→ T ( p , θ ) [ w ,b (cid:48)(cid:48) ] −−−−−→ T ( p (cid:48) , θ (cid:48) ) (3)Thus, ( p, θ (cid:48)(cid:48) ) [ γw ,b ∨ b (cid:48)(cid:48) ] −−−−−−−−→ T ( p (cid:48) , θ (cid:48) ) i.e. ( p, θ ) [ w,b ] −−−→ T ( p (cid:48) , θ (cid:48) ) where b = b ∨ b (cid:48)(cid:48) .2. Case ( (cid:104) p, w (cid:105) , θ ) ⇒ ( (cid:104) p (cid:48)(cid:48) , u (cid:105) , θ (cid:48)(cid:48) ) is obtained by a rule of the form p ( σ,σ (cid:48) ) (cid:44) −−−−→ p (cid:48)(cid:48) ∈ ∆ c i.e θ (cid:48)(cid:48) (cid:54) = θ. In this case, u = γ . By the saturation rule β , we obtainthat ( p, θ ) [ γ,b ] −−−−→ T ( p , θ ) where θ (cid:48)(cid:48) = θ \{ r } ∪ { r } , b = B ( p ) ∨ b (cid:48)(cid:48) . (4)Putting (1) and (4) together, we have the following path( p, θ ) [ γ,b ] −−−−→ T ( p , θ ) [ w ,b (cid:48)(cid:48) ] −−−−−→ T ( p (cid:48) , θ (cid:48) ) i.e. ( p, θ ) [ w,b ] −−−→ T ( p (cid:48) , θ (cid:48) ) where b = b ∨ b (cid:48)(cid:48) (5)Furthermore, if ( (cid:104) p, w (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ), then ( (cid:104) p, w (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48)(cid:48) , u (cid:105) , θ (cid:48)(cid:48) ) or( (cid:104) p (cid:48)(cid:48) , u (cid:105) , θ (cid:48)(cid:48) ) ⇒ r ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ).For the first case, ( (cid:104) p, w (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48)(cid:48) , u (cid:105) , θ (cid:48)(cid:48) ), then p ∈ G i.e. B ( p ) = 1. Forthe second case, ( (cid:104) p (cid:48)(cid:48) , u (cid:105) , θ (cid:48)(cid:48) ) ⇒ r ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ), we can get b (cid:48)(cid:48) = 1 (from inductionhypothesis). Thus, b = b ∨ b (cid:48)(cid:48) = B ( p ) ∨ b (cid:48)(cid:48) ∨ b (cid:48)(cid:48) = B ( p ) ∨ b (cid:48)(cid:48) = 1. Therefore, if( (cid:104) p, w (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ), then we can obtain b = 1. (cid:50) From these two lemmas, we get:
Theorem 3.
Let [ c, b ] be a labelled configuration. Then [ c, b ] is in L ( A pre ∗ (cid:0) ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) (cid:1) iff c ∈ pre ∗ (cid:0) ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) (cid:1) . Moreover, c ⇒ r ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) iff b = 1 . roof: Let [( (cid:104) p, w (cid:105) , θ ) , b ] be a configuration of pre ∗ (cid:0) ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) (cid:1) ). Then ( (cid:104) p, w (cid:105) , θ ) ⇒ ∗ ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ). By Lemma 2, we can obtain that there exists a path ( p, θ ) [ w,b ] −−−→ T ( p (cid:48) , θ (cid:48) ) in A pre ∗ (cid:0) ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) (cid:1) . So [( (cid:104) p, w (cid:105) , θ ) , b ] is in L ( A pre ∗ (cid:0) ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) (cid:1) ). More-over, if ( (cid:104) p, w (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ), then b = 1.Conversely, let [( (cid:104) p, w (cid:105) , θ ) , b ] be a configuration accepted by A pre ∗ (cid:0) ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) (cid:1) i.e. there exists a path ( p, θ ) [ w,b ] −−−→ T ( p (cid:48) , θ (cid:48) ) in A pre ∗ (cid:0) ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) (cid:1) . By Lemma3, ( (cid:104) p, w (cid:105) , θ ) ⇒ ∗ ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) i.e. ( (cid:104) p, w (cid:105) , θ ) ∈ pre ∗ ( L ( A )). Moreover, if b = 1,( (cid:104) p, w (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ). (cid:50) G Based on the definition of the Head Reachability Graph G , and on Theorem 3,we can compute G as follows. Initially, G has no edges. α (cid:48) : if r c : p ( σ,σ (cid:48) ) (cid:44) −−−−→ p (cid:48) ∈ ∆ c , then for every phase θ such that r c ∈ θ and every γ ∈ Γ , we add the edge (( p, γ ) , θ ) B ( p ) −−−→ (( p (cid:48) , γ ) , θ ) to the graph G , where θ = θ \ σ ∪ σ (cid:48) . α (cid:48) : if r : (cid:104) p, γ (cid:105) (cid:44) → (cid:104) p , γ (cid:105) ∈ ∆ , then for every phase θ such that r ∈ θ , we addthe edge (( p, γ ) , θ ) B ( p ) −−−→ (( p , γ ) , θ ) to the graph G . α (cid:48) : if r : (cid:104) p, γ (cid:105) (cid:44) → (cid:104) p , γ γ (cid:48) (cid:105) ∈ ∆ , then for every phase θ such that r ∈ θ , we addto the graph G the edge (( p, γ ) , θ ) B ( p ) −−−→ (( p , γ ) , θ ). Moreover, for everycontrol point p (cid:48) ∈ P and phase θ (cid:48) such that A pre ∗ (cid:0) ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) (cid:1) contains atransition of the form t = ( p , θ ) [ γ ,b ] −−−−→ T ( p (cid:48) , θ (cid:48) ), we add to the graph G theedge (( p, γ ) , θ ) b ∨ B ( p ) −−−−→ (( p (cid:48) , γ (cid:48) ) , θ (cid:48) ).Items α (cid:48) and α (cid:48) are obvious. They respectively correspond to item 1 anditem 2 of Definition 4 (since B ( p ) = 1 iff p ∈ G ). Item α (cid:48) is based onLemma 1 and on item 3 of Definition 4. Indeed, it follows from Lemma 1 that A pre ∗ (cid:0) ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ) (cid:1) contains a transition of the form ( p , θ ) [ γ ,b ] −−−−→ T ( p (cid:48) , θ (cid:48) ) impliesthat ( (cid:104) p , γ (cid:105) , θ ) ⇒ ∗ ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ), and if b = 1, then ( (cid:104) p , γ (cid:105) , θ ) ⇒ r ( (cid:104) p (cid:48) , (cid:15) (cid:105) , θ (cid:48) ).Thus, in this case, the edge (( p, γ ) , θ ) b ∨ B ( p ) −−−−→ (( p (cid:48) , γ (cid:48) ) , θ (cid:48) ) is added to G (item 3of Definition 4) since (cid:104) p, γ (cid:105) (cid:44) → (cid:104) p , γ γ (cid:48) (cid:105) ∈ ∆ . We implemented our approach in a tool and we compared its performance againstthe approaches that consist in translating the SM-PDS to an equivalent stan-dard (or symbolic) PDS, and then applying the standard LTL model checking19lgorithms implemented in the PDS model-checker tool Moped [27]. All our ex-periments were run on Ubuntu 16.04 with a 2.7 GHz CPU, 2GB of memory.To perform the comparison, we randomly generate several SM-PDSs and LTLformulas of different sizes. The results (CPU Execution time) are shown in Ta-ble 1.
Column
Size is the size of SM-PDS ( S for non-modifying transitions ∆ and S for modifying transitions ∆ c ). Column
LTL gives the size of the tran-sitions of the B¨uchi automaton generated from the LTL formula (using the toolLTL2BA[21]).
Column
SM-PDS gives the cost of our direct algorithm presentedin this paper.
Column
PDS shows the cost it takes to get the equivalent PDSfrom the SM-PDS.
Column
Result reports the cost it takes to run the LTL PDSmodel-checker Moped [27] for the PDS we got.
Column
Total is the total costit takes to translate the SM-PDS into a PDS and then apply the standard LTLmodel checking algorithm of Moped (Total=PDS+Result).
Column
SymbolicPDS reports the cost it takes to get the equivalent Symbolic PDS from the SM-PDS.
Column
Result is the cost to run the Symbolic PDS LTL model-checkerMoped. Column
T otal is the total cost it takes to translate the SM-PDS intoa symbolic PDS and then apply the standard LTL model checking algorithmof Moped. You can see that our direct algorithm ( Column
SM-PDS ) is muchmore efficient than translating the SM-PDS to an equivalent (symbolic) PDS,and then run the standard LTL model-checker Moped.
Translating the SM-PDS to a standard PDS may take more than 20 days, whereas ourdirect algorithm takes only a few seconds.
Moreover, since the obtainedstandard (symbolic) PDS is huge, Moped failed to handle several cases (the timelimit that we set for Moped is 20 minutes), whereas our tool was able to dealwith all the cases in only a few seconds.
As described in [4], severalmalicious behaviors can be described by LTL formulas. We give in what followsthree examples of such malicious behaviors and show how they can be describedby LTL formulas:
Registry Key Injecting:
In order to get started at boot time, many malwaresadd themselves into the registry key listing. This behavior is typically imple-mented by first calling the API function GetModuleFileNameA to retrieve thepath of the malware’s executable file. Then, the API function RegSetValueExAis called to add the file path into the registry key listing. This malicious behaviorcan be described in LTL as follows: φ rk = F (cid:0) call GetM oduleF ileN ameA ∧ F ( call RegSetV alueExA ) (cid:1) This formula expresses that if a call to the API function GetModuleFile-NameA is followed by a call to the API function RegSetValueExA, then probablya malware is trying to add itself into the registry key listing.
Data-Stealing:
Stealing data from the host is a popular malicious behavior thatintend to steal any valuable information including passwords, software codes,bank information, etc. To do this, the malware needs to scan the disk to find the20 ize LTL SM-PDS PDS Result Total Symbolic PDS
Result T otal S : 5 , S : 2 | δ | :15 S : 5 , S : 3 | δ | :8 S : 11 , S : 4 | δ | :8 S : 5 , S : 3 | δ | :10 S : 110 , S : 4 | δ | :8 S : 255 , S : 8 | δ | :8 S : 255 , S : 8 | δ | :10 S : 110 , S : 4 | δ | :15 S : 255 , S : 8 | δ | :15 S : 110 , S : 4 | δ | :20 S : 255 , S : 8 | δ | :20 S : 255 , S : 8 | δ | :25 S : 2059 , S : 7 | δ | :8 S : 2059 , S : 9 | δ | :8 S : 2059 , S : 11 | δ | :8 S : 2059 , S : 11 | δ | :28 S : 3050 , S : 10 | δ | :8 S : 3090 , S : 10 | δ | :8 S : 3050 , S : 10 | δ | :20 S : 3090 , S : 10 | δ | :30 S : 3090 , S : 10 | δ | :25 S : 4050 , S : 10 | δ | :8 S : 4050 , S : 10 | δ | :28 S : 4058 , S : 11 | δ | :8 S : 4058 , S : 11 | δ | :25 S : 5050 , S : 11 | δ | :8 S : 5090 , S : 11 | δ | :8 S : 5090 , S : 11 | δ | :10 S : 6090 , S : 11 | δ | :8 S : 6090 , S : 11 | δ | :10 S : 6090 , S : 11 | δ | :40 S : 7090 , S : 11 | δ | :25 S : 7090 , S : 11 | δ | :30 S : 9090 , S : 11 | δ | :8 S : 9090 , S : 11 | δ | :20 S : 10050 , S : 12 | δ | :8 S : 10050 , S : 12 | δ | :25 S : 10050 , S : 12 | δ | :30 S : 10150 , S : 12 | δ | :35 S : 10150 , S : 14 | δ | :8 S : 10150 , S : 14 | δ | :40 S : 10150 , S : 12 | δ | :40 S : 10150 , S : 16 | δ | :45 S : 10150 , S : 12 | δ | :60 S : 10150 , S : 12 | δ | :65 S : 10150 , S : 16 | δ | :65 S : 10180 , S : 16 | δ | :65 S : 10180 , S : 16 | δ | :78 Table 1: Our approach vs. standard LTL for PDSs21nteresting file that he wants to steal. After finding the file, the malware needs tolocate it. To this aim, the malware first calls the API function GetModuleHan-dleA to get a base address to search for a location of the file. Then the malwarestarts looking for the interesting file by calling the API function FindFirstFileA.Then the API functions CreateFileMappingA and MapViewOfFile are called toaccess the file. Finally, the specific file can be copied by calling the API functionCopyFileA. Thus, this data-stealing malicious behavior can be described by thefollowing LTL formula as follows: φ ds = F ( call GetModuleHandleA ∧ F ( call F indF irstF ileA ∧ F ( call CreateF ileMappingA ∧ F ( call MapV iewofF ile ∧ F call CopyF ileA )))) Spy-Worm:
A spy worm is a malware that can record data and send it using theSocket API functions. For example, Keylogger is a spy worm that can record thekeyboard states by calling the API functions GetAsyKeyState and GetKeyStateand send that to the specific server by calling the socket function sendto. Anotherspy worm can also spy on the I/O device rather than the keyboard. For this, itcan use the API function GetRawInputData to obtain input from the specifieddevice, and then send this input by calling the socket functions send or sendto.Thus, this malicious behavior can be described by the following LTL formula: φ sw = F (cid:0) ( call GetAsyncKeyState ∨ call GetRawInputData ) ∧ F ( call sendto ∨ call send ) (cid:1) Appending virus:
An appending virus is a virus that inserts a copy of its codeat the end of the target file. To achieve this, since the real OFFSET of the virus’variables depends on the size of the infected file, the virus has to first computeits real absolute address in the memory. To perform this, the virus has to callthe sequence of instructions: l : call f ; l : ....; f : pop eax; . The instruction call f will push the return address l onto the stack. Then, the pop instruction in f will put the value of this address into the register eax. Thus, the virus can getits real absolute address from the register eax. This malicious behavior can bedescribed by the following LTL formula: φ av = (cid:87) F (cid:16) call ∧ X (top-of-stack = a ) ∧ G ¬ (cid:0) ret ∧ (top-of-stack = a ) (cid:1)(cid:17) where the (cid:87) is taken over all possible return addresses a , and top-of-stack =a is a predicate that indicates that the top of the stack is a . The subformula call ∧ X (top-of-stack = a ) means that there exists a procedure call having a asreturn address. Indeed, when a procedure call is made, the program pushes itscorresponding return address a to the stack. Thus, at the next step, a will be onthe top of the stack. Therefore, the formula above expresses that there exists aprocedure call having a as return address, such that there is no ret instructionwhich will return to a .Note that this formula uses predicates that indicate that the top of the stackis a . Our techniques work for this case as well: it suffices to encode the top ofthe stack in the control points of the SM-PDS. Our implementation works forthis case as well and can handle appending viruses. Applying our tool for malware detection.
We applied our tool to detectseveral malwares. We use the unpack tool unpacker [28] to handle packers like22 x a m p l e S i z e L T L M u l t i p l e p r e ∗ E x a m p l e S i z e L T L M u l t i p l e p r e ∗ E x a m p l e S i z e L T L M u l t i p l e p r e ∗ T a n a t o s . b . s . s N e t s ky . c . s . s W i n . H a pp y . s . s N e t s ky . a450 . s . s M y d oo m . c . s . s M y D oo m - N . s . s M y d oo m . y . s . s M y d oo m . j . s . s k l e z - N . s . s k l e z . c . s . s M y d oo m . v . s . s N e t s ky . b . s . s R e p a h . b . s . s G i b e . b . s . s M ag i s t r . b . s . s N e t s ky . d . s . s A r du r k . d . s . s k l e z . f . s . s K e li n o .l . s . s K i p i s . t . s . s k l e z . d . s . s K e li n o . g4700 . s . s P l ag e . b . s . s U r b e . a1230 . s . s k l e z . e . s . s M ag i s t r . b . s . s M ag i s t r . a . p o l y . s . s A d o n . . s . s A d o n . . s . s Sp a m . T e d r oo . A B . s . s A k e z . s . s A l c a u l. d . s . s A l a u l. c . s . s H a h a r i n . A . s . s f s A u t o B . F . s . s H a h a r i n . d r . s . s L d P i n c h . B X . D LL . s . s L d P i n c h . f m y e . s . s L d P i n c h . W i n . . s . s L d P i n c h - . s . s L d P i n c h . e . s . s W i n T oga ! r f n . s . s T a n a t o s . b . s . s N e t s ky . c . s . s W i n . H a pp y . s . s N e t s ky . a450 . s . s M y d oo m . c . s . s M y D oo m - N . s . s M y d oo m . y . s . s M y d oo m . j . s . s k l e z - N . s . s k l e z . c . s . s M y d oo m . v . s . s N e t s ky . b . s . s R e p a h . b . s . s G i b e . b . s . s M ag i s t r . b . s . s N e t s ky . d . s . s A r du r k . d . s . s k l e z . f . s . s K e li n o .l . s . s K i p i s . t . s . s k l e z . d . s . s K e li n o . g4700 . s . s P l ag e . b . s . s U r b e . a1230 . s . s k l e z . e . s . s M ag i s t r . b . s . s M ag i s t r . a . p o l y . s . s M y d oo m - E G [ T r j ] . s . s E m a il. W ! c . s . s W . M y d oo m . L . s . s M y d oo m . . s . s M y d oo m . c j d z . s . s M y d oo m . D N . w o r m . s . s M y d oo m . R . s . s W i n . M y d oo m . s . s M y d oo m . o @ MM ! z i p . s . s M y d oo m . M @ mm . s . s M y D oo m . . s . s M y D oo m . N . s . s S r a m o t a . a v f . s . s M y d oo m . . s W i n . M y d oo m . . s . s W i n . R un o u c e . s . s W i n . C hu r . A . s . s W i n . C NH a c k e r . s . s W i n . S ky b ag41806 . s . s S ky b ag . A . s . s N e t s ky . a h @ MM . s . s A d o n . . s . s A d o n . . s . s Sp a m . T e d r oo . A B . s . s A k e z . s . s A l c a u l. d . s . s A l a u l. c . s . s H a h a r i n . A . s . s f s A u t o B . F . s . s H a h a r i n . d r . s . s L d P i n c h . B X . D LL . s . s L d P i n c h . f m y e . s . s L d P i n c h .. . s . s L d P i n c h - . s . s L d P i n c h . e . s . s W i n T oga ! r f n . s . s L d P i n c h . b y . s . s G e n e r i c . . s . s L d P i n c h . a rr . s . s L d P n c h - F a m . s . s T r o j . L d P i n c h . e r . s . s L d P i n c h . G e n . . s . s A nd r o m . s . s A r du r k . d . s . s G e n e r i c . . s . s J o r i k . s . s B u g b e a r - B . s . s T a n a t o s . O . s . s T a b l e : M u l t i p l e p r e ∗ v . s . o u r d i r ec t L T L m o d e l - c h ec k i n ga l go r i t h m x a m p l e S i z e R e s u l t c o s t E x a m p l e S i z e R e s u l t c o s t E x a m p l e S i z e R e s u l t c o s t T a n a t o s . b Y e s . s N e t s ky . c Y e s . s W i n . H a pp y Y e s . s N e t s ky . a45 Y e s . s M y d oo m . c Y e s . s M y D oo m - N Y e s . s M y d oo m . y Y e s . s M y d oo m . j Y e s . s k l e z - N Y e s . s k l e z . c Y e s . s M y d oo m . v Y e s . s N e t s ky . b Y e s . s R e p a h . b Y e s . s G i b e . b Y e s . s M ag i s t r . b Y e s . s N e t s ky . d Y e s . s A r du r k . d Y e s . s k l e z . f Y e s . s K e li n o .l Y e s . s K i p i s . t Y e s . s k l e z . d Y e s . s K e li n o . g470 Y e s . s P l ag e . b Y e s . s U r b e . a123 Y e s . s k l e z . e Y e s . s M ag i s t r . b Y e s . s M ag i s t r . a . p o l y Y e s . s M y d oo m . M @ mm Y e s . s M y D oo m . Y e s . s M y D oo m . N ! w o r m Y e s . s W i n . R un o u c e Y e s . s W i n . C hu r . A Y e s . s W i n . C NH a c k e r . C Y e s . s W i n . M y d oo m ! O Y e s . s M y d oo m . o @ MM ! z i p Y e s . s W . M y d oo m . k Z L Y e s . s M y d oo m - E G [ T r j ] Y e s . s E m a il. W o r m . W ! c Y e s . s W . M y d oo m . L Y e s . s W o r m . M y d oo m - Y e s . s M y d oo m . C J D Z - Y e s . s M y d oo m . D N . w o r m Y e s . s W i n . M y d oo m . R Y e s . s W i n . M y d oo m . d l np q i Y e s . s M y d oo m . o @ MM ! z i p Y e s . s S r a m o t a . a v f Y e s . s B e h a v e s L i k e . M y d oo m Y e s . s W i n . M y d oo m . Y e s . s M y d oo m . A C Q Y e s . s M y d oo m . b a19423 Y e s . s M y d oo m . f t d e Y e s . s W o r m . A n a r xy Y e s . s M a l w a r e ! b f Y e s . s A n a r . A . Y e s . s W i n . A n a r . a215 Y e s . s n a r . Y e s . s W o r m - e m a il. A n a r . S Y e s . s H LL W . N e w A p t Y e s . s W i n . W o r m . k m Y e s . s N e w a p t . E f bh Y e s . s N e w A p t ! g e n e r i c Y e s . s N e w A p t . A @ mm Y e s . s N e w a p t . W i n . Y e s . s W . W . N e w a p t . A ! Y e s . s W o r m . M a il. N e w A p t . a51550 Y e s . s m a li c i o u s . Y e s . s W i n . Y a n z Y e s . s Y a n z i. Q T Q X - Y e s . s W i n . Y a n z . a2410 Y e s . s W i n . S ky b ag4180 Y e s . s S ky b ag . A Y e s . s N e t s ky . a h @ MM Y e s . s S ky b ag . b Y e s . s W o r m . S ky b ag - Y e s . s W i n . A g e n t . R Y e s . s S ky b ag [ W r m ] Y e s . s S ky b ag . D v g b Y e s . s N e t s ky . C I . w o r m Y e s . s A g e n t . x p r o533 Y e s . s V il s e l.l hb Y e s . s G e n e r i c . Y e s . s V il s e l.l hb Y e s . s G e n e r i c . D F Y e s . s L d P i n c h . ao q Y e s . s J o r i k Y e s . s B u g b e a r - B Y e s . s T a n a t o s . O Y e s . s G e n . Y e s . s G i b e . b Y e s . s G e n e r i c . AX C N Y e s . s A nd r o m Y e s . s A r du r k . d Y e s . s G e n e r i c . Y e s . s L d P i n c h . b y Y e s . s G e n e r i c . Y e s . s L d P i n c h . a rr Y e s . s G e n e r i c . Y e s . s G e n e r i c . Y e s . s L d P i n c h . m g5957 Y e s . s S c r i p t . Y e s . s G e n e r i c . D F Y e s . s Z a fi Y e s . s G e n e r i c K D Y e s . s W i n . A g e n t . e s Y e s . s W . H f s A u t o B . Y e s . s T r o j a n . S i v i s - Y e s . s W i n . S i gg e n . Y e s . s T r o j a n / C o s m u .i s k . Y e s . s T r o j a n . - Y e s . s D e l ph i. G e n Y e s . s T r o j a n . b c . Y e s . s D e l f o b f u s Y e s . s T r o j . U nd e f Y e s . s T r o j a n - R a n s o m . Y e s . s L D P i n c h . Y e s . s P S W . L d P i n c h . p l t Y e s . s P S W . P i n c h . Y e s . s L d P i n c h . B X . D LL Y e s . s L d P i n c h . f m y e Y e s . s L d P i n c h . W i n . Y e s . s T r o j a nSp y . L y d r a . a3450 Y e s . s T r o j a n . S t a r t P ag e Y e s . s P S W T r o j . L d P i n c h . a u Y e s . s L d P i n c h - Y e s . s L d P i n c h - R Y e s . s L d P i n c h . G e n . Y e s . s G r a f t o r . Y e s . s L d P i n c h - A I H [ T r j ] Y e s . s W i n . H e u r . k Y e s . s L d P i n c h - Y e s . s L d P i n c h . e Y e s . s W i n T oga ! r f n Y e s . s P S W . L d P i n c h . m j Y e s . s G ao b o t . D I H . w o r m Y e s . s L D P i n c h . D F ! t r . p w s Y e s . s T r o j a nSp y . Z b o t Y e s . s L D P i n c h . Y e s . s S ill y P r o xy . A M Y e s . s L d P i n c h . m j ! c Y e s . s L d P i n c h . H . g e n ! E l d o r a d o605 Y e s . s G e n e r i c ! B T Y e s . s L d P n c h - F a m Y e s . s T r o j . L d P i n c h . e r Y e s . s L d P i n c h . G e n . Y e s . s W i n . M a l w a r e . w s c Y e s . s m a li c i o u s . f d Y e s . s W S . L D P i n c h . Y e s . s x a m p l e S i z e R e s u l t c o s t E x a m p l e S i z e R e s u l t c o s t E x a m p l e S i z e R e s u l t c o s t c a l c u l a t i o n . e x e N o18 . s c i s v c . e x e N o3 . ss i m p l e . e x e N o0 . s s hu t d o w n . e x e N o0 . s l oo p . e x e N o9 . s c m d . e x e N o13 . s n o t e p a d . e x e N o24 . s j a v a . e x e N o15 . s j a v a . e x e N o42 . s s o r t . e x e N o29 . s b i b D e s k . e x e N o50 . s i n t e r f a c e . e x e N o8 . s i p v . e x e N o4 . s T e x t W r a n g l e r . e x e N o45 . ss ogo u . e x e N o55 . s ga m e . e x e N o82 . s c y c l e . t e x N o42 . s c a l e nd e r . e x e N o35 . s Sd B o t . z k Y e s . s V i r u s . G e n Y e s . s A u t o R un . P R Y e s . s A d o n . Y e s . s A d o n . Y e s . s Sp a m . T e d r oo . A B Y e s . s A k e z Y e s . s A l c a u l. d Y e s . s A l a u l. c Y e s . s V i r u s . W i n . k l k Y e s . s V i r u s . W i n . A g e n t Y e s . s H oa x . G e n Y e s . s e H e u r . V i r u s Y e s . s A k e z . Y e s . s A k e z . W i n . Y e s . s W e i r d . . C Y e s . s PE A K E Z . A Y e s . s V i r u s . W e i r d . c Y e s . s W K u a n g435 Y e s . s R a d a r . G e n Y e s . s A k e z . W i n . Y e s . s H a h a r i n . A Y e s . s f s A u t o B . F Y e s . s H a h a r i n . d r Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s N G V C K Y e s . s T a b l e : E x p e r i m e n t a l R e s u l t s Column
Size is the number of controllocations,
Column
Result gives the result of our algorithm:
Yes means ma-licious and No means benign; and Column cost gives the cost to apply ourLTL model-checker to check one of the LTL properties described above.2. Second, we abstract away the self-modifying instructions and proceed asif these instructions were not self-modifying. In this case, we translate thebinary codes to standard pushdown systems as described in [3]. By usingPDSs as models, none of the malwares that we consider was detected asmalicious, whereas, as reported in Table 3, using self-modifying PDSs asmodels, and applying our LTL model-checking algorithm allowed to detectall the 892 malwares that we considered.Note that checking the formulas φ rk , φ ds , and φ sw could be done using mul-tiple pre ∗ queries on SM-PDSs using the pre ∗ algorithm of [29]. However, thiswould be less efficient than performing our direct LTL model-checking algorithm,as shown in Table 2, where Column
Size gives the number of control locations,
Column
LTL gives the time of applying our LTL model-checking algorithm; and
Column
Multiple pre ∗ gives the cost of applying multiple pre ∗ on SM-PDSs tocheck the properties φ rk , φ ds , and φ sw . It can be seen that applying our direct LTL model checking algortihm is more efficient. Furthermore, the appendingvirus formula φ av cannot be solved using multiple pre ∗ queries. Our direct LTLmodel-checking algorithm is needed in this case. Note that some of the malwareswe considered in our experiments are appending viruses. Thus, our algorithm andour implementation are crucial to be able to detect these malwares. our tool McAfee Norman BitDefender Kinsoft Avira eScan Kaspersky Qihoo360 Baidu Avast Symantec
Table 4: Detection rate: Our tool vs. well known antiviruses26 omparison with well-known antiviruses.
We compare our tool againstwell-known and widely used antiviruses. Since known antiviruses update theirsignature database as soon as a new malware is known, in order to have a faircomparision with these antiviruses, we need to consider new malwares. We usethe sophisticated malware generator NGVCK available at VX Heavens [31] togenerate 205 malwares. We obfuscate these malwares with self-modifying code,and we fed them to our tool and to well known antiviruses such as BitDefender,Kinsoft, Avira, eScan, Kaspersky, Qihoo-360, Baidu, Avast, and Symantec. Ourtool was able to detect all these programs as malicious, whereas none of thewell-known antiviruses was able to detect all these malwares. Table 4 reportsthe detection rates of our tool and the well-known anti-viruses.
References
1. A.Bertrand, M.Matias, and D.Koen. A model for self-modifying code. In
IHMM-Sec , 2006.2. A. Bouajjani, J. Esparza, and O. Maler. Reachability Analysis of Pushdown Au-tomata: Application to Model Checking. In
CONCUR’97 , 1997.3. F.Song and T.Touili. Efficient malware detection using model-checking. In FM ,2012.4. F.Song and T.Touili. Ltl model-checking for malware detection. In TACAS , 2013.5. G.Balakrishnan, T.W. Reps, N.Kidd, A.Lal, J.Lim, et al. Model checking x86executables with codesurfer/x86 and WPDS++. In
CAV , 2005.6. G.Bonfante, J.Marion, and D.Reynaud-Plantey. A computability perspective onself-modifying programs. In
SEFM , 2009.7. H.Cai, Z.Shao, and A.Vaynberg. Certified self-modifying code.
ACM SIGPLANNotices , 42(6), 2007.8. H.Nguyen and T.Touili. CARET model checking for malware detection. In
SPIN ,2017.9. J.Bergeron], M.Debbabi, et al. Static detection of malicious code in executableprograms.
Int. J. of Req. Eng , 2001(184-189), 2001.10. J.Esparza, D.Hansel, P.Rossmanith, and S.Schwoon. Efficient algorithms for modelchecking pushdown systems. In
CAV , 2000.11. J.Kinder, S.Katzenbeisser, C.Schallhart, and H.Veith. Detecting malicious code bymodel checking. In
DIMVA , 2005.12. H.Veith J.Kinder. Jakstab: A static analysis platform for binaries. In
CAV , 2008.13. K.Coogan, S.Debray, T.Kaochar, and G.Townsend. Automatic static unpacking ofmalware binaries. In
WCRE’09 , 2009.14. K.Dam and T.Touili. Malware detection based on graph classification. In
ICISSP ,2017.15. K.Dam and T.Touili. Learning malware using generalized graph kernels. In
ARES ,2018.16. K.Dam and T.Touili. Precise extraction of malicious behaviors. In
COMPSAC ,2018.17. K.Gyung et al. Renovo: A hidden code extractor for packed executables. In
WORM , 2007.18. K.Roundy and B.Miller. Hybrid analysis and control of malware. In
RAID , 2010.19. M.Vardi and P.Wolper. Reasoning about infinite computations.
Inf. Comput. ,115(1), 1994.
0. P.Beaucamps, I.Gnaedig, and J.Marion. Behavior abstraction in malware analysis.In
Runtime Verification , 2010.21. P.Gastin and D.Oddoux. Fast ltl to b¨uchi automata translation. In
CAV , 2001.22. P.Royal, M.Halpin, et al. Polyunpack: Automating the hidden-code extraction ofunpack-executing malware. In
ACSAC , 2006.23. P.Singh and A.Lakhotia. Static verification of worm and virus behavior in binaryexecutables using model checking. In
IAW , 2003.24. S.Blazy, V.Laporte, and D.Pichardie. Verified abstract interpretation techniquesfor disassembling low-level self-modifying code.
JAR , 56(3), 2016.25. S.Cutler. malshare. https://malshare.com .26. S.Debray, K.Coogan, and G.Townsend. On the semantics of self-unpacking mal-ware code.
Tech. rep. University of Arizona, Computer Science , 2008.27. S.Schwoon.
Model-checking pushdown systems . PhD thesis, Technische Universit¨atM¨unchen, Universit¨atsbibliothek, 2002.28. Unpacker Tool. Automated unpacking: A behaviour based approach. https://github.com/malwaremusings/unpacker .29. T.Touili and X.Ye. Reachability analysis of self modifying code. In
ICECCS , 2017.30. T.Touili and X.Ye. Ltl model checking of self modifying code. In
ICECCS , 2019.31. V.Heaven. V.heavens. http://vxer.org/lib/ .32. VirusShare. vxshare. https://virusshare.com ..