[PDF] Cartesian Tree Matching and Indexing

Abstract

We introduce a new metric of match, called Cartesian tree matching, which means that two strings match if they have the same Cartesian trees. Based on Cartesian tree matching, we define single pattern matching for a text of length n and a pattern of length m, and multiple pattern matching for a text of length n and k patterns of total length m. We present an O(n+m) time algorithm for single pattern matching, and an O((n+m) log k) deterministic time or O(n+m) randomized time algorithm for multiple pattern matching. We also define an index data structure called Cartesian suffix tree, and present an O(n) randomized time algorithm to build the Cartesian suffix tree. Our efficient algorithms for Cartesian tree matching use a representation of the Cartesian tree, called the parent-distance representation.

Full PDF

CCartesian Tree Matching and Indexing

Sung Gwan Park

Seoul National University, [email protected]

Amihood Amir

Bar-Ilan University, [email protected]

Gad M. Landau

University of Haifa, Israel and New York University, [email protected]

Kunsoo Park Seoul National University, [email protected]

Abstract

We introduce a new metric of match, called

Cartesian tree matching , which means that two stringsmatch if they have the same Cartesian trees. Based on Cartesian tree matching, we deﬁne singlepattern matching for a text of length n and a pattern of length m , and multiple pattern matchingfor a text of length n and k patterns of total length m . We present an O ( n + m ) time algorithm forsingle pattern matching, and an O (( n + m ) log k ) deterministic time or O ( n + m ) randomized timealgorithm for multiple pattern matching. We also deﬁne an index data structure called Cartesiansuﬃx tree, and present an O ( n ) randomized time algorithm to build the Cartesian suﬃx tree. Oureﬃcient algorithms for Cartesian tree matching use a representation of the Cartesian tree, called the parent-distance representation . Theory of computation → Pattern matching

Keywords and phrases

Cartesian tree matching, Pattern matching, Indexing, Parent-distancerepresentation

Digital Object Identiﬁer

String matching is one of fundamental problems in computer science, and it can be appliedto many practical problems. In many applications string matching has variants derivedfrom exact matching (which can be collectively called generalized matching ), such as order-preserving matching [19, 20, 22], parameterized matching [4, 7, 8], jumbled matching [9],overlap matching [3], pattern matching with swaps [2], and so on. These problems arecharacterized by the way of deﬁning a match , which depends on the application domainsof the problems. In ﬁnancial markets, for example, people want to ﬁnd some patterns inthe time series data of stock prices. In this case, they would like to know more about somepattern of price ﬂuctuations than exact prices themselves [15]. Therefore, we need a deﬁnitionof match which is appropriate to handle such cases.The Cartesian tree [27] is a tree data structure that represents an array, only focusing onthe results of comparisons between numeric values in the array. In this paper we introduce anew metric of match, called

Cartesian tree matching , which means that two strings match ifthey have the same Cartesian trees. If we model the time series stock prices as a numerical Corresponding author © Sung Gwan Park, Amihood Amir, Gad M. Landau, and Kunsoo Park;licensed under Creative Commons License CC-BY30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019).Editors: Nadia Pisanti and Solon P. Pissis; Article No. 13; pp. 13:1–13:14Leibniz International Proceedings in InformaticsSchloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany a r X i v : . [ c s . D S ] M a y string, we can ﬁnd a desired pattern from the data by solving a Cartesian tree matchingproblem. For example, let’s assume that the pattern we want to ﬁnd looks like the pictureon the left of Figure 1, which is a common pattern called the head-and-shoulder [15] (infact there are two versions of the head-and-shoulder: one is the picture in Figure 1 and theother is the picture reversed). The picture on the right of Figure 1 is the Cartesian treecorresponding to the pattern on the left. Cartesian tree matching ﬁnds every position of thetext which has the same Cartesian tree as the picture on the right of Figure 1.Even though order-preserving matching [19, 20, 22] can also be applied to ﬁnding patternsin time series data, Cartesian tree matching may be more appropriate than order-preservingmatching in ﬁnding patterns. For instance, let’s assume that we are looking for the patternin Figure 1 in time series stock prices. An important characteristic of the pattern is thatthe price hit the bottom (head), and it has two shoulders before and after the head. Butthe relative order between the two shoulders (i.e., which one is higher) does not matter.If we model this pattern into order-preserving matching, then order-preserving matchingimposes a relative order between two shoulders S [2] and S [6]. Moreover, it imposes anunnecessary order between two valleys S [3] and S [5]. Hence, order preserving matching maynot be able to ﬁnd such a pattern in time series data. In contrast, the pattern in Figure 1can be represented by one Cartesian tree, and therefore Cartesian tree matching is a moreappropriate metric in such cases.In this paper we deﬁne string matching problems based on Cartesian tree matching:single pattern matching for a text of length n and a pattern of length m , and multiplepattern matching for a text of length n and k patterns of total length m , and we presenteﬃcient algorithms for them. We also deﬁne an index data structure called Cartesian suﬃxtree as in the cases of parameterized matching and order-preserving matching [8, 13], andpresent an eﬃcient algorithm to build the Cartesian suﬃx tree. To obtain eﬃcient algorithmsfor Cartesian tree matching, we deﬁne a representation of the Cartesian tree, called the parent-distance representation .In Section 2 we give basic deﬁnitions for Cartesian tree matching. In Section 3 wepropose an O ( n + m ) time algorithm for single pattern matching. In Section 4 we present an O (( n + m ) log k ) deterministic time or O ( n + m ) randomized time algorithm for multiplepattern matching. In Section 5 we deﬁne the Cartesian suﬃx tree, and present an O ( n )randomized time algorithm to build the Cartesian suﬃx tree of a string of length n . S[4] = 1S[2] = 2 S[6] = 3S[1] = 6 S[3] = 5 S[5] = 4 S[7] = 7

Figure 1

Example pattern S = (6 , , , , , ,

7) and its corresponding Cartesian tree .G. Park, A. Amir, G.M. Landau, and K. Park 13:3 A string is a sequence of characters in an alphabet Σ, which is a set of integers. We assumethat the comparison between any two characters can be done in O (1) time. For a string S , S [ i ] represents the i -th character of S , and S [ i..j ] represents a substring of S starting from i and ending at j . A string S can be associated with its corresponding Cartesian tree CT ( S ) according to thefollowing rules [27]:If S is an empty string, CT ( S ) is an empty tree.If S [1 ..n ] is not empty and S [ i ] is the minimum value among S , CT ( S ) is the tree with S [ i ] as the root, CT ( S [1 ..i − CT ( S [ i + 1 ..n ]) as the rightsubtree. If there are two or more minimum values, we choose the leftmost one as the root.Since each character in string S corresponds to a node in Cartesian tree CT ( S ), we can treateach character as a node in the Cartesian tree. Cartesian tree matching is the problem to ﬁnd all the matches in the text which have thesame Cartesian tree as a given pattern. Formally, we deﬁne it as follows: (cid:73)

Deﬁnition 1. (Cartesian tree matching) Given two strings text T [1 ..n ] and pattern P [1 ..m ],ﬁnd every 1 ≤ i ≤ n − m + 1 such that CT ( T [ i..i + m − CT ( P [1 ..m ]).For example, let’s consider a sample text T = (41 , , , , , , , , , , , , , P = (6 , , , , , , CT ( T [5 .. CT ( P [1 .. T [6] = 23and T [10] = 22 is diﬀerent from that between P [2] = 2 and P [6] = 3, but it is a match inCartesian tree matching. O ( n + m ) Time3.1 Parent-distance representation

In order to solve Cartesian tree matching without building every possible Cartesian tree, wepropose an eﬃcient representation to store the information about Cartesian trees, called the parent-distance representation . (cid:73) Deﬁnition 2. (Parent-distance representation) Given a string S [1 ..n ], the parent-distancerepresentation of S is an integer string P D ( S )[1 ..n ], which is deﬁned as follows: P D ( S )[ i ] = ( i − max ≤ j

1) is

P D ( S ) =(0 , , , , , S [ j ] in Deﬁnition 2 represents the parent of S [ i ] in Cartesian tree CT ( S [1 ..i ]). Furthermore, if there is no such j , S [ i ] is the root of Cartesian tree CT ( S [1 ..i ]).Theorem 3 shows that the parent-distance representation has a one-to-one mapping tothe Cartesian tree, so it can substitute the Cartesian tree without any loss of information. C P M 2 0 1 9 (cid:73)

Theorem 3.

Two strings S and S have the same Cartesian trees if and only if S and S have the same parent-distance representations. Proof.

If two strings have diﬀerent lengths, they have diﬀerent Cartesian trees and diﬀerentparent-distance representations, so the theorem holds. Therefore, we can only consider thecase where S and S have the same length. Let n be the length of S and S . We prove thetheorem by an induction on n .If n = 1, S and S will always have the same Cartesian trees with only one node.Furthermore, they will have the same parent-distance representation (0). Therefore, thetheorem holds when n = 1.Let’s assume that the theorem holds when n = k , and show that it holds when n = k + 1.(= ⇒ ) Assume that S [1 ..k + 1] and S [1 ..k + 1] have the same Cartesian trees (i.e., CT ( S [1 ..k + 1]) = CT ( S [1 ..k + 1])). There are two cases.If S [ k + 1] and S [ k + 1] are not roots of the Cartesian trees, let S [ j ] be the parent of S [ k + 1], and S [ l ] the parent of S [ k + 1]. Since CT ( S [1 ..k + 1]) = CT ( S [1 ..k + 1]),we have j = l . If we remove S [ k + 1] from Cartesian tree CT ( S [1 ..k + 1]), we obtain thetree CT ( S [1 ..k ]), where the left subtree of S [ k + 1] is attached to its parent S [ j ]. If weremove S [ k + 1] from CT ( S [1 ..k + 1]), we obtain CT ( S [1 ..k ]) in the same way. Since CT ( S [1 ..k + 1]) = CT ( S [1 ..k + 1]), we get CT ( S [1 ..k ]) = CT ( S [1 ..k ]), and therefore P D ( S )[1 ..k ] = P D ( S )[1 ..k ] by induction hypothesis. Since P D ( S )[ k + 1] = k + 1 − j and P D ( S )[ k + 1] = k + 1 − l (and j = l ), we have P D ( S ) = P D ( S ).If S [ k + 1] and S [ k + 1] are roots, we remove S [ k + 1] and S [ k + 1] to get CT ( S [1 ..k ])and CT ( S [1 ..k ]). Since CT ( S [1 ..k + 1]) = CT ( S [1 ..k + 1]), we have CT ( S [1 ..k ]) = CT ( S [1 ..k ]), and therefore P D ( S )[1 ..k ] = P D ( S )[1 ..k ] by induction hypothesis. Since P D ( S )[ k + 1] = P D ( S )[ k + 1] = 0 in this case, we get P D ( S ) = P D ( S ).( ⇐ =) Assume that S [1 ..k + 1] and S [1 ..k + 1] have the same parent-distance repres-entations (i.e., P D ( S )[1 ..k + 1] = P D ( S )[1 ..k + 1]). Since P D ( S )[1 ..k ] = P D ( S )[1 ..k ],we have CT ( S [1 ..k ]) = CT ( S [1 ..k ]) by induction hypothesis. From CT ( S [1 ..k ]), we canderive CT ( S [1 ..k + 1]) as follows. If P D ( S )[ k + 1] >

0, let x be S [ k + 1 − P D ( S )[ k + 1]].We insert S [ k + 1] into CT ( S [1 ..k ]) so that the parent of S [ k + 1] is x and the originalright subtree of x becomes the left subtree of S [ k + 1]. If P D ( S )[ k + 1] = 0, S [ k + 1] is theroot of CT ( S [1 ..k + 1]) and CT ( S [1 ..k ]) becomes the left subtree of S [ k + 1]. We derive CT ( S [1 ..k + 1]) from CT ( S [1 ..k ]) in the same way. Since CT ( S [1 ..k ]) = CT ( S [1 ..k ]) and P D ( S )[ k + 1] = P D ( S )[ k + 1], we can conclude that CT ( S [1 ..k + 1]) = CT ( S [1 ..k + 1]).Therefore, we have proved that there is a one-to-one mapping between Cartesian treesand parent-distance representations. (cid:74) Given a string S [1 ..n ], we can compute the parent-distance representation in linear timeusing a stack, as in [13, 14]. The main idea is that if two characters S [ i ] and S [ j ] for i < j satisfy S [ i ] > S [ j ], S [ i ] cannot be the parent of S [ k ] for any k > j . Therefore, we will onlystore S [ i ] which does not have such S [ j ] while scanning from left to right. If we store such S [ i ] only, they form a non-decreasing subsequence of S . When we consider a new value,therefore, we can pop values that are larger than the new value, ﬁnd its parent, and pushthe new value and its index into the stack. Algorithm 1 describes the algorithm to compute P D ( S ).Furthermore, given the parent-distance representation of string S , we can compute theparent-distance representation of any substring S [ i..j ] easily. To compute P D ( S [ i..j ])[ k ], we .G. Park, A. Amir, G.M. Landau, and K. Park 13:5 Algorithm 1

Computing parent-distance representation of a string procedure PARENT-DIST-REP ( S [1 ..n ]) ST ← an empty stack for i ← to n do while ST is not empty do ( value, index ) ← ST.top if value ≤ S [ i ] then break ST.pop if ST is empty then P D ( S )[ i ] ← else P D ( S )[ i ] ← i − index ST.push (( S [ i ] , i )) return P D ( S )need only check whether the parent of S [ i + k −

1] is within S [ i..j ] or not (i.e., the parent isoutside if P D ( S )[ i + k − ≥ k ). P D ( S [ i..j ])[ k ] = ( P D ( S )[ i + k − ≥ kP D ( S )[ i + k −

1] otherwise. (1)For example, the parent-distance representation of string S = (2 , , , , , ,

1) is

P D ( S ) =(0 , , , , , , P D ( S [2 .. P D ( S [2 .. , , , , , We can deﬁne a failure function similar to the one used in the KMP algorithm [21]. (cid:73)

Deﬁnition 4. (Failure function) The failure function π of string P is an integer stringsuch that: π [ q ] = ( max { k : CT ( P [1 ..k ]) = CT ( P [ q − k + 1 ..q ]) for 1 ≤ k < q } if q >

10 if q = 1That is, π [ q ] is the largest k such that the preﬁx and the suﬃx of P [1 ..q ] of length k have thesame Cartesian trees. For example, assuming that P = (5 , , , , , , π = (0 , , , , , , CT ( P [1 .. CT ( P [3 .. π [6] = 4. We will present an algorithm to compute the failure function of a given string inSection 3.5. As in the original KMP text search algorithm, we can use the failure function in order toachieve linear time text search: scan the text from left to right, and use the failure functionevery time we ﬁnd a mismatch between the text and the pattern. We apply this idea toCartesian tree matching.

C P M 2 0 1 9

Algorithm 2

Text search of Cartesian tree matching procedure CARTESIAN-TREE-MATCH ( T [1 ..n ] , P [1 ..m ]) P D ( P ) ← PARENT-DIST-REP( P ) π ← FAILURE-FUNC( P ) len ← DQ ← an empty deque for i ← to n do Pop elements ( value, index ) from back of DQ such that value > T [ i ] while len = 0 do if P D ( T [ i − len..i ])[ len + 1] = P D ( P )[ len + 1] then break else len ← π [ len ] Delete elements ( value, index ) from front of DQ such that index < i − len len ← len + 1 DQ.push _ back (( T [ i ] , i )) if len = m then print “Match occurred at i − m + 1” len ← π [ len ] Delete elements ( value, index ) from front of DQ such that index ≤ i − len In order to perform a text search using O ( m ) space, we compute the parent-distancerepresentation of the text online as we read the text, so that we don’t need to store theparent-distance representation of the whole text, which would cost O ( n ) space. Furthermore,among the text characters which are matched with the pattern, we only have to store elementsthat form a non-decreasing subsequence by using a deque (instead of a stack in Section 3.2)in order to delete elements in front. Using this idea, we can keep the size of the deque to bealways smaller than or equal to m . Therefore, we can perform the text search using O ( m )space. Algorithm 2 shows the text search algorithm of Cartesian tree matching. In line9 we need to compute x = P D ( T [ i − len..i ])[ len + 1]. If the deque is empty, then x = 0.Otherwise, let ( value, index ) be the element at the back of the deque. Then x = i − index .This computation takes constant time. Just before line 14, we do not compare P D ( T [ i ]) and P D ( P )[1] when len = 0, because they always match. Therefore, we can safely perform line14. We compute the failure function π in a way similar to the text search, as in the KMPalgorithm. However, we can compute the parent-distance representation of the patternin O ( m ) time before we compute the failure function. Hence we don’t need a deque andthe computation is slightly simpler than text search. Algorithm 3 shows the procedure tocompute the failure function. Since our algorithm for Cartesian tree matching including text search and the computation ofthe failure function follow the KMP algorithm, it is easy to see that our algorithm correctlyﬁnds all occurrences (in the sense of Cartesian tree matching) of the pattern in the text. .G. Park, A. Amir, G.M. Landau, and K. Park 13:7

Algorithm 3

Computing failure function in Cartesian tree matching procedure FAILURE-FUNC ( P [1 ..m ]) P D ( P ) ← PARENT-DIST-REP( P ) len ← π [1] ← for i ← to m do while len = 0 do if P D ( P [ i − len..i ])[ len + 1] = P D ( P [1 ..len + 1])[ len + 1] then break else len ← π [ len ] len ← len + 1 π [ i ] ← len Since our algorithm checks one character of the parent-distance representation in constanttime, it takes O ( n ) time for text search and O ( m ) time to compute the failure function,as in KMP algorithm. Therefore, our algorithm requires O ( m + n ) time for Cartesian treematching using O ( m ) space. There is an alternative representation of Cartesian trees, called

Cartesian tree signature [14].The Cartesian tree signature of S [1 ..n ] is an array L [1 ..n ] such that L [ i ] equals the numberof the elements popped from the stack in the i -th iteration of Algorithm 1. Furthermore, theCartesian tree signature can be represented as a bit string 1 L [1] L [2] · · · L [ n ] n , which is a succinct representation of a Cartesian tree. For example, the Cartesiantree signature of string S = (2 , , , , , ,

1) is L = (0 , , , , , , D [1 ..n ], which is deﬁned as follows: If S [ i ] is never popped out from the stack, D [ i ] = 0. Otherwise, let S [ j ] be the value whichpopped S [ i ] out from the stack, and D [ i ] = j − i . For string S = (2 , , , , , , D = (6 , , , , , , D , we can delete one character at the front of string S [1 ..n ] in constant time.In order to get Cartesian tree signature L and its corresponding D for S [2 ..n ], we do thefollowing: If D [1] >

0, we decrease L [ D [1] + 1] by one and erase L [1] from L . If D [1] = 0, wejust erase L [1]. After that, we delete D [1] from D to get D . For example, if we want todelete one character at the front of S = (2 , , , , , , L [ D [1] + 1] = L [7] byone, and delete L [1] and D [1]. This results in L = (0 , , , , ,

1) and D = (1 , , , , , D of S [2 ..

7] = (7 , , , , , D to perform stringmatching, which uses the same space as the parent-distance representation. For Cartesiantree matching, therefore, it uses more space than Algorithm 2. C P M 2 0 1 9 q q q q q q q q q q q q q q q q q q q q q q 𝑖𝑑𝑥 𝑙𝑒𝑛 Figure 2

Aho-Corasick automaton for P = (4 , , , , , P = (3 , , , , P = (1 , , , , O (( n + m ) log k ) Time

In this section we extend Cartesian tree matching to the case of multiple patterns. Deﬁnition5 gives the formal deﬁnition of multiple pattern matching. (cid:73)

Deﬁnition 5. (Multiple pattern Cartesian tree matching) Given a text T [1 ..n ] and patterns P [1 ..m ] , P [1 ..m ] , ..., P k [1 ..m k ], where m = m + m + · · · + m k , multiple pattern Cartesiantree matching is to ﬁnd every position in the text which matches at least one pattern, i.e., ithas the same Cartesian tree as that of at least one pattern.We modify the Aho-Corasick algorithm [1] using the parent-distance representation deﬁnedin Section 3.1 to do multiple pattern matching in O (( n + m ) log k ) time. Instead of using the patterns themselves in the Aho-Corasick automaton, we use their parent-distance representations to make an automaton. Each node in the automaton corresponds tothe preﬁx of the parent-distance representation of some pattern. We maintain two integers idx and len for every node such that the node corresponds to the parent-distance representationof the pattern preﬁx P idx [1 ..len ]. If there are more than one possible indexes, we store thesmallest one. Each node also has a state transition function trans ( x ), which gets an integer x as an input and returns the next node, or report that there is no such node. We can constructthe trie and the state transition function for every node in O ( m log k ) time, assuming thatwe use a balanced binary search tree to implement the transition function. Figure 2 showsan Aho-Corasick automaton for three patterns P = (4 , , , , , P = (3 , , , , P =(1 , , , , P D ( P ) =(0 , , , , , P D ( P ) = (0 , , , , P D ( P ) = (0 , , , ,

2) to construct the automaton.The failure function π of the Aho-Corasick automaton is deﬁned as follows: Let q i be anode in the automaton, and s i be the substring that node q i represents in the trie. Let s j be the longest proper suﬃx of s i which matches (in the sense of Cartesian tree matching)preﬁx s k of some pattern P k . The failure function of q i is deﬁned as node q k (i.e., π [ q i ] = q k ).The dotted lines in Figure 2 shows the failure function of each node. For example, node q represents P [1 .. q represents P [1 .. P [1 .. .G. Park, A. Amir, G.M. Landau, and K. Park 13:9 Algorithm 4

Computing failure function in multiple pattern matching procedure MULTIPLE-FAILURE-FUNC ( P , P , ..., P k ) for i ← to k do P D ( P i ) ← PARENT-DIST-REP( P i ) T R ← Build trie with

P D ( P i )’s for node ← breadth-ﬁrst traversal of the trie do len ← len [ node ] idx ← idx [ node ] π [ node ] ← T R.root ptr ← parent of node in the trie while ptr = T R.root do ptr ← π [ ptr ] plen ← len [ ptr ] x ← P D ( P idx [ len − plen..len ])[ plen + 1] if ptr.trans ( x ) exists then π [ node ] ← ptr.trans ( x ) break matches P [3 ..

4] (i.e.,

P D ( P [1 .. P D ( P [3 .. , P [1 ..

4] that matches a preﬁx of some pattern. Note that the parent-distancerepresentation of s k may not be the suﬃx of the parent-distance representation of s i . Forexample, q has the parent-distance representation (0 , , , q hasthe parent-distance representation (0 ,

0) which is not a suﬃx of (0 , , , node corresponds to the parent-distance representation of P idx [1 ..len ], and so the parentof node corresponds to the parent-distance representation of P idx [1 ..len − ptr corresponds to the parent-distance representation of some suﬃx of P idx [1 ..len − ptr is a node that can be reached from the parent of node followingthe failure links. Since ptr corresponds to some string of length plen , we can conclude that ptr represents P idx [ len − plen..len − P idx [ len − plen..len ]matches some node in the trie, so we should check whether ptr has the transition using x = P D ( P idx [ len − plen..len ])[ plen + 1]. If ptr has the transition ptr.trans ( x ), it correspondsto P idx [ len − plen..len ], and we can conclude that π [ node ] = ptr.trans ( x ). If ptr doesn’thave such a transition, there is no node that represents P idx [ len − plen..len ], and thus wehave to continue the loop.For example, suppose that we compute the failure function of q in Figure 2. From idx [ q ] = 2 and len [ q ] = 4, we know that q represents P [1 .. q , which is theparent of q , represents P [1 .. ptr = π [ q ] = q .Since len [ q ] = 2, we know that q , which represents P [1 .. P [2 .. P [2 ..

4] matches some node in the trie, we compute x = P D ( P [2 .. q .trans ( x ) exists. Since there is no such transition, we continue thewhile loop with ptr = π [ q ] = q . We know that q , which represents P [1 .. P [3 ..

3] from len [ q ] = 1. In order to check whether P [3 ..

4] matches some node, we compute x = P D ( P [3 .. q .trans ( x ) exists. Since there is such a transition, C P M 2 0 1 9 we conclude that π [ q ] = q .trans (0) = q . Note that x may change during the while loop,which is not the case in the Aho-Corasick algorithm.While computing the failure function, we can also compute the output function in thesame way as the Aho-Corasick algorithm. The output function of node q i is the set ofpatterns which match some suﬃx of s i . This function is used to output all possible matchesat the node. Using the automaton deﬁned above, we can solve multiple pattern Cartesian tree matching in O ( n log k ) time. The text search algorithm is essentially the same as that of the Aho-Corasickalgorithm, following the trie and using the failure links in case of any mismatches. As in thesingle pattern case, we compute the parent-distance representation of the text online in thesame way as Algorithm 2 (using a deque) to ensure O ( m ) space. The time complexity of ourmultiple pattern Cartesian tree matching is O (( n + m ) log k ) using O ( m ) space, where thelog k factor is included due to the binary search tree in each node. Since there can be at most k outgoing edges from each node, we can perform an operation in the binary search tree in O (log k ) time. Combined with the time-complexity analysis of the Aho-Corasick algorithm,this shows that our algorithm has the time complexity of O (( n + m ) log k ). We can reducethe time complexity further to randomized O ( n + m ) time by using a hash instead of a binarysearch tree [12]. O ( n ) Time

In this section we apply the notion of Cartesian tree matching to the suﬃx tree as in thecases of parameterized matching and order-preserving matching [8, 13]. We ﬁrst deﬁne theCartesian suﬃx tree, and show that it can be built in randomized O ( n ) time or worst-case O ( n log n ) time using the result from Cole and Hariharan [12]. The Cartesian suﬃx tree is an index data structure that allows us to ﬁnd an occurrence ofa given pattern P [1 ..m ] in randomized O ( m ) time or worst-case O ( m log n ) time, where n is the length of the text string. In order to store the information of Cartesian suﬃx treeseﬃciently, we again use the parent-distance representation from Section 3.1. Deﬁnition 6gives the formal deﬁnition of the Cartesian suﬃx tree. (cid:73) Deﬁnition 6. (Cartesian suﬃx tree) Given a string T [1 ..n ], the Cartesian suﬃx tree of T is a compacted trie built with P D ( T [ i..n ]) · ( −

1) for every 1 ≤ i ≤ n (where the specialcharacter − P D ( T [ i..n ])) and string ( − − T = (2 , , , , , , , , , , A corresponds to substring T [1 ..

5] or T [6 .. P D ( T [1 .. P D ( T [6 .. , , , , A stores the suﬃx number 1 or 6, start and end positions3 and 5. .G. Park, A. Amir, G.M. Landau, and K. Park 13:11 -1 0-1 0 1 -1 0 -1 -1 -1 1 A Figure 3

Cartesian suﬃx tree of S = (2 , , , , , , , , , , There are several algorithms eﬃciently constructing the suﬃx tree, such as McCreight’salgorithm [24] and Ukkonen’s algorithm [26]. However, the distinct right context property [16, 8] should hold in order to apply these algorithms, which means that the suﬃx link ofevery internal node should point to an explicit node. The Cartesian suﬃx tree does not havethe distinct right context property. In Figure 3, the internal node marked with A does notsatisfy this property because P D ( T [2 .. P D ( T [7 .. , , , ,

0) and thus there isno explicit node corresponding to parent-distance representation (0 , , , quasi-suﬃx collection , which satisﬁes thefollowing properties: A quasi-suﬃx collection is a set of n strings s , s , ..., s n , where the length of s i is n + 1 − i . For any two diﬀerent strings s i and s j , s i should not be a preﬁx of s j . For any i and j , if s i and s j have a common preﬁx of length l , s i +1 and s j +1 should havea common preﬁx of length at least l − s i = P D ( T [ i..n ]) · ( −

1) and s j = P D ( T [ j..n ]) · ( −

1) have a common preﬁx of length l , i.e., P D ( T [ i..i + l − P D ( T [ j..j + l − P D ( T [ i + 1 ..i + l − P D ( T [ j + 1 ..j + l − s i +1 = P D ( T [ i + 1 ..n ]) · ( −

1) and s j +1 = P D ( T [ j + 1 ..n ]) · ( −

1) have a common preﬁx of length l − characteroracle , which returns the i -th character of s j in constant time. We can do this in constanttime using Equation 1, once the parent-distance representation of T is computed. C P M 2 0 1 9

Since we have all properties needed to perform Cole and Hariharan’s algorithm, we canconstruct a Cartesian suﬃx tree in randomized O ( n ) time using O ( n ) space [12]. In theworst case, it can be built in O ( n log n ) time by using a binary search tree instead of a hashtable to store the children of each node in the suﬃx tree, because the alphabet size | Σ | is O ( n ). We can also modify our algorithm to construct a Cartesian suﬃx tree online, usingthe idea in [23, 25]. We have deﬁned Cartesian tree matching and the parent-distance representation of a Cartesiantree. We developed a linear time algorithm for single pattern matching and an O (( n + m ) log k )deterministic time or O ( n + m ) randomized time algorithm for multiple pattern matching.Finally, we deﬁned an index data structure called Cartesian suﬃx tree, and showed that itcan be constructed in O ( n ) randomized time. We believe that the notion of Cartesian treematching, which is a new metric on string matching and indexing over numeric strings, canbe used in many applications.There have been many works on approximate generalized matching. For example, thereare results for approximate order-preserving matching [11], approximate jumble matching[10], approximate swapped matching [5], and approximate parameterized matching [6, 18].There are also results on computing the period of a generalized string, such as computing theperiod in the order-preserving model [17]. Since Cartesian tree matching is ﬁrst introducedin this paper, many problems including approximate matching and computing the period inthe Cartesian tree matching model are future research topics. Acknowledgments

S.G. Park and K. Park were supported by Institute for Information & communicationsTechnology Promotion(IITP) grant funded by the Korea government(MSIT) (No. 2018-0-00551, Framework of Practical Algorithms for NP-hard Graph Problems). A. Amir and G.M.Landau were partially supported by the Israel Science Foundation grant 571/14, and GrantNo. 2014028 from the United States-Israel Binational Science Foundation (BSF).

References Alfred V. Aho and Margaret J. Corasick. Eﬃcient string matching: An aid to bibliographicsearch.

Commun. ACM , 18(6):333–340, 1975. URL: https://doi.org/10.1145/360825.360855 , doi:10.1145/360825.360855 . Amihood Amir, Yonatan Aumann, Gad M. Landau, Moshe Lewenstein, and Noa Lewenstein.Pattern matching with swaps.

J. Algorithms , 37(2):247–266, 2000. URL: https://doi.org/10.1006/jagm.2000.1120 , doi:10.1006/jagm.2000.1120 . Amihood Amir, Richard Cole, Ramesh Hariharan, Moshe Lewenstein, and Ely Porat.Overlap matching.

Inf. Comput. , 181(1):57–74, 2003. URL: https://doi.org/10.1016/S0890-5401(02)00035-4 , doi:10.1016/S0890-5401(02)00035-4 . Amihood Amir, Martin Farach, and S. Muthukrishnan. Alphabet dependence in parameter-ized matching.

Inf. Process. Lett. , 49(3):111–115, 1994. URL: https://doi.org/10.1016/0020-0190(94)90086-8 , doi:10.1016/0020-0190(94)90086-8 . Amihood Amir, Moshe Lewenstein, and Ely Porat. Approximate swapped matching.

Inf.Process. Lett. , 83(1):33–39, 2002. URL: https://doi.org/10.1016/S0020-0190(01)00302-7 , doi:10.1016/S0020-0190(01)00302-7 . .G. Park, A. Amir, G.M. Landau, and K. Park 13:13 Alberto Apostolico, Péter L. Erdös, and Moshe Lewenstein. Parameterized matching withmismatches.

J. Discrete Algorithms , 5(1):135–140, 2007. URL: https://doi.org/10.1016/j.jda.2006.03.014 , doi:10.1016/j.jda.2006.03.014 . Brenda S. Baker. A theory of parameterized pattern matching: algorithms and applications.In

Proceedings of the Twenty-Fifth Annual ACM Symposium on Theory of Computing, May16-18, 1993, San Diego, CA, USA , pages 71–80, 1993. URL: https://doi.org/10.1145/167088.167115 , doi:10.1145/167088.167115 . Brenda S. Baker. Parameterized duplication in strings: Algorithms and an application tosoftware maintenance.

SIAM J. Comput. , 26(5):1343–1362, 1997. URL: https://doi.org/10.1137/S0097539793246707 , doi:10.1137/S0097539793246707 . Peter Burcsi, Ferdinando Cicalese, Gabriele Fici, and Zsuzsanna Lipták. Algorithms forjumbled pattern matching in strings.

Int. J. Found. Comput. Sci. , 23(2):357–374, 2012. URL: https://doi.org/10.1142/S0129054112400175 , doi:10.1142/S0129054112400175 . Peter Burcsi, Ferdinando Cicalese, Gabriele Fici, and Zsuzsanna Lipták. On approximatejumbled pattern matching in strings.

Theory Comput. Syst. , 50(1):35–51, 2012. URL: https://doi.org/10.1007/s00224-011-9344-5 , doi:10.1007/s00224-011-9344-5 . Tamanna Chhabra, Emanuele Giaquinta, and Jorma Tarhio. Filtration algorithms forapproximate order-preserving matching. In

String Processing and Information Retrieval- 22nd International Symposium, SPIRE 2015, London, UK, September 1-4, 2015, Pro-ceedings , pages 177–187, 2015. URL: https://doi.org/10.1007/978-3-319-23826-5_18 , doi:10.1007/978-3-319-23826-5\_18 . Richard Cole and Ramesh Hariharan. Faster suﬃx tree construction with missing suﬃx links.In

Proceedings of the Thirty-Second Annual ACM Symposium on Theory of Computing, May21-23, 2000, Portland, OR, USA , pages 407–415, 2000. URL: https://doi.org/10.1145/335305.335352 , doi:10.1145/335305.335352 . Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Marcin Kubica, Alessio Langiu,Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, and Tomasz Walen. Order-preservingindexing.

Theor. Comput. Sci. , 638:122–135, 2016. URL: https://doi.org/10.1016/j.tcs.2015.06.050 , doi:10.1016/j.tcs.2015.06.050 . Erik D. Demaine, Gad M. Landau, and Oren Weimann. On Cartesian trees and rangeminimum queries.

Algorithmica , 68(3):610–625, 2014. URL: https://doi.org/10.1007/s00453-012-9683-x , doi:10.1007/s00453-012-9683-x . Tak-Chung Fu, Korris Fu-Lai Chung, Robert Wing Pong Luk, and Chak-man Ng. Stocktime series pattern matching: Template-based vs. rule-based approaches.

Eng. Appl. ofAI , 20(3):347–364, 2007. URL: https://doi.org/10.1016/j.engappai.2006.07.003 , doi:10.1016/j.engappai.2006.07.003 . Raﬀaele Giancarlo. A generalization of the suﬃx tree to square matrices, with ap-plications.

SIAM J. Comput. , 24(3):520–562, 1995. URL: https://doi.org/10.1137/S0097539792231982 , doi:10.1137/S0097539792231982 . Garance Gourdel, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Arseny M. Shur,and Tomasz Walen. String periods in the order-preserving model. In , pages 38:1–38:16, 2018. URL: https://doi.org/10.4230/LIPIcs.STACS.2018.38 , doi:10.4230/LIPIcs.STACS.2018.38 . Carmit Hazay, Moshe Lewenstein, and Dina Sokol. Approximate parameterized match-ing. In

Algorithms - ESA 2004, 12th Annual European Symposium, Bergen, Norway,September 14-17, 2004, Proceedings , pages 414–425, 2004. URL: https://doi.org/10.1007/978-3-540-30140-0_38 , doi:10.1007/978-3-540-30140-0\_38 . Jinil Kim, Amihood Amir, Joong Chae Na, Kunsoo Park, and Jeong Seop Sim. Onrepresentations of ternary order relations in numeric strings.

Mathematics in ComputerScience , 11(2):127–136, 2017. URL: https://doi.org/10.1007/s11786-016-0282-0 , doi:10.1007/s11786-016-0282-0 . C P M 2 0 1 9 Jinil Kim, Peter Eades, Rudolf Fleischer, Seok-Hee Hong, Costas S. Iliopoulos, Kunsoo Park,Simon J. Puglisi, and Takeshi Tokuyama. Order-preserving matching.

Theor. Comput. Sci. ,525:68–79, 2014. URL: https://doi.org/10.1016/j.tcs.2013.10.006 , doi:10.1016/j.tcs.2013.10.006 . Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching instrings.

SIAM J. Comput. , 6(2):323–350, 1977. URL: https://doi.org/10.1137/0206024 , doi:10.1137/0206024 . Marcin Kubica, Tomasz Kulczynski, Jakub Radoszewski, Wojciech Rytter, and TomaszWalen. A linear time algorithm for consecutive permutation pattern matching.

Inf. Process.Lett. , 113(12):430–433, 2013. URL: https://doi.org/10.1016/j.ipl.2013.03.015 , doi:10.1016/j.ipl.2013.03.015 . Taehyung Lee, Joong Chae Na, and Kunsoo Park. On-line construction of parameterizedsuﬃx trees for large alphabets.

Inf. Process. Lett. , 111(5):201–207, 2011. URL: https://doi.org/10.1016/j.ipl.2010.11.017 , doi:10.1016/j.ipl.2010.11.017 . Edward M. McCreight. A space-economical suﬃx tree construction algorithm.

J. ACM ,23(2):262–272, 1976. URL: https://doi.org/10.1145/321941.321946 , doi:10.1145/321941.321946 . Joong Chae Na, Raﬀaele Giancarlo, and Kunsoo Park. On-line construction of two-dimensionalsuﬃx trees in O(n log n) time. Algorithmica , 48(2):173–186, 2007. URL: https://doi.org/10.1007/s00453-007-0063-x , doi:10.1007/s00453-007-0063-x . Esko Ukkonen. On-line construction of suﬃx trees.

Algorithmica , 14(3):249–260, 1995. URL: https://doi.org/10.1007/BF01206331 , doi:10.1007/BF01206331 . Jean Vuillemin. A unifying look at data structures.

Commun. ACM , 23(4):229–239, 1980.URL: https://doi.org/10.1145/358841.358852 , doi:10.1145/358841.358852doi:10.1145/358841.358852

Related Researches

The Multiplicative Version of Azuma's Inequality, with an Application to Contention Analysis

by William Kuszmaul

Balanced Districting on Grid Graphs with Provable Compactness and Contiguity

by Cyrus Hettle

Deterministic Tree Embeddings with Copies for Algorithms Against Adaptive Adversaries

by Bernhard Haeupler

Approximately counting independent sets of a given size in bounded-degree graphs

by Ewan Davies

A Dynamic Data Structure for Temporal Reachability with Unsorted Contact Insertions

by Luiz F. Afra Brito

Semi-Streaming Algorithms for Submodular Matroid Intersection

by Paritosh Garg

Multivariate Analysis of Scheduling Fair Competitions

by Siddharth Gupta

Streaming k-PCA: Efficient guarantees for Oja's algorithm, beyond rank-one updates

by De Huang

Online Bin Packing with Predictions

by Spyros Angelopoulos

Minimum projective linearizations of trees in linear time

by Lluís Alemany-Puig

Parameterized complexity of computing maximum minimal blocking and hitting sets

by Júlio Araújo

A 2 -Approximation Algorithm for Flexible Graph Connectivity

by Sylvia Boyd

A Faster Algorithm for Finding Closest Pairs in Hamming Metric

by Andre Esser

Kernelization of Maximum Minimal Vertex Cover

by Júlio Araújo

Fractionally Log-Concave and Sector-Stable Polynomials: Counting Planar Matchings and More

by Yeganeh Alimohammadi

Optimal Construction of Hierarchical Overlap Graphs

by Shahbaz Khan

Gapped Indexing for Consecutive Occurrences

by Philip Bille

CountSketches, Feature Hashing and the Median of Three

by Kasper Green Larsen

A Refined Analysis of Submodular Greedy

by Ariel Kulik

Generalized Parametric Path Problems

by Prerona Chatterjee

Approximate Privacy-Preserving Neighbourhood Estimations

by Alvaro Garcia-Recuero

Coalgebra Encoding for Efficient Minimization

by Hans-Peter Deifel

Algorithms and Complexity on Indexing Founder Graphs

by Massimo Equi

A Linear Time Algorithm for Constructing Hierarchical Overlap Graphs

by Sangsoo Park

Density Sketches for Sampling and Estimation

by Aditya Desai

«

1

2

3

4

»

Submitted on 22 May 2019 Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar