Cartesian Tree Matching and Indexing
CCartesian Tree Matching and Indexing
Sung Gwan Park
Seoul National University, [email protected]
Amihood Amir
Bar-Ilan University, [email protected]
Gad M. Landau
University of Haifa, Israel and New York University, [email protected]
Kunsoo Park Seoul National University, [email protected]
Abstract
We introduce a new metric of match, called
Cartesian tree matching , which means that two stringsmatch if they have the same Cartesian trees. Based on Cartesian tree matching, we define singlepattern matching for a text of length n and a pattern of length m , and multiple pattern matchingfor a text of length n and k patterns of total length m . We present an O ( n + m ) time algorithm forsingle pattern matching, and an O (( n + m ) log k ) deterministic time or O ( n + m ) randomized timealgorithm for multiple pattern matching. We also define an index data structure called Cartesiansuffix tree, and present an O ( n ) randomized time algorithm to build the Cartesian suffix tree. Ourefficient algorithms for Cartesian tree matching use a representation of the Cartesian tree, called the parent-distance representation . Theory of computation → Pattern matching
Keywords and phrases
Cartesian tree matching, Pattern matching, Indexing, Parent-distancerepresentation
Digital Object Identifier
String matching is one of fundamental problems in computer science, and it can be appliedto many practical problems. In many applications string matching has variants derivedfrom exact matching (which can be collectively called generalized matching ), such as order-preserving matching [19, 20, 22], parameterized matching [4, 7, 8], jumbled matching [9],overlap matching [3], pattern matching with swaps [2], and so on. These problems arecharacterized by the way of defining a match , which depends on the application domainsof the problems. In financial markets, for example, people want to find some patterns inthe time series data of stock prices. In this case, they would like to know more about somepattern of price fluctuations than exact prices themselves [15]. Therefore, we need a definitionof match which is appropriate to handle such cases.The Cartesian tree [27] is a tree data structure that represents an array, only focusing onthe results of comparisons between numeric values in the array. In this paper we introduce anew metric of match, called
Cartesian tree matching , which means that two strings match ifthey have the same Cartesian trees. If we model the time series stock prices as a numerical Corresponding author © Sung Gwan Park, Amihood Amir, Gad M. Landau, and Kunsoo Park;licensed under Creative Commons License CC-BY30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019).Editors: Nadia Pisanti and Solon P. Pissis; Article No. 13; pp. 13:1–13:14Leibniz International Proceedings in InformaticsSchloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany a r X i v : . [ c s . D S ] M a y string, we can find a desired pattern from the data by solving a Cartesian tree matchingproblem. For example, let’s assume that the pattern we want to find looks like the pictureon the left of Figure 1, which is a common pattern called the head-and-shoulder [15] (infact there are two versions of the head-and-shoulder: one is the picture in Figure 1 and theother is the picture reversed). The picture on the right of Figure 1 is the Cartesian treecorresponding to the pattern on the left. Cartesian tree matching finds every position of thetext which has the same Cartesian tree as the picture on the right of Figure 1.Even though order-preserving matching [19, 20, 22] can also be applied to finding patternsin time series data, Cartesian tree matching may be more appropriate than order-preservingmatching in finding patterns. For instance, let’s assume that we are looking for the patternin Figure 1 in time series stock prices. An important characteristic of the pattern is thatthe price hit the bottom (head), and it has two shoulders before and after the head. Butthe relative order between the two shoulders (i.e., which one is higher) does not matter.If we model this pattern into order-preserving matching, then order-preserving matchingimposes a relative order between two shoulders S [2] and S [6]. Moreover, it imposes anunnecessary order between two valleys S [3] and S [5]. Hence, order preserving matching maynot be able to find such a pattern in time series data. In contrast, the pattern in Figure 1can be represented by one Cartesian tree, and therefore Cartesian tree matching is a moreappropriate metric in such cases.In this paper we define string matching problems based on Cartesian tree matching:single pattern matching for a text of length n and a pattern of length m , and multiplepattern matching for a text of length n and k patterns of total length m , and we presentefficient algorithms for them. We also define an index data structure called Cartesian suffixtree as in the cases of parameterized matching and order-preserving matching [8, 13], andpresent an efficient algorithm to build the Cartesian suffix tree. To obtain efficient algorithmsfor Cartesian tree matching, we define a representation of the Cartesian tree, called the parent-distance representation .In Section 2 we give basic definitions for Cartesian tree matching. In Section 3 wepropose an O ( n + m ) time algorithm for single pattern matching. In Section 4 we present an O (( n + m ) log k ) deterministic time or O ( n + m ) randomized time algorithm for multiplepattern matching. In Section 5 we define the Cartesian suffix tree, and present an O ( n )randomized time algorithm to build the Cartesian suffix tree of a string of length n . S[4] = 1S[2] = 2 S[6] = 3S[1] = 6 S[3] = 5 S[5] = 4 S[7] = 7
Figure 1
Example pattern S = (6 , , , , , ,
7) and its corresponding Cartesian tree .G. Park, A. Amir, G.M. Landau, and K. Park 13:3 A string is a sequence of characters in an alphabet Σ, which is a set of integers. We assumethat the comparison between any two characters can be done in O (1) time. For a string S , S [ i ] represents the i -th character of S , and S [ i..j ] represents a substring of S starting from i and ending at j . A string S can be associated with its corresponding Cartesian tree CT ( S ) according to thefollowing rules [27]:If S is an empty string, CT ( S ) is an empty tree.If S [1 ..n ] is not empty and S [ i ] is the minimum value among S , CT ( S ) is the tree with S [ i ] as the root, CT ( S [1 ..i − CT ( S [ i + 1 ..n ]) as the rightsubtree. If there are two or more minimum values, we choose the leftmost one as the root.Since each character in string S corresponds to a node in Cartesian tree CT ( S ), we can treateach character as a node in the Cartesian tree. Cartesian tree matching is the problem to find all the matches in the text which have thesame Cartesian tree as a given pattern. Formally, we define it as follows: (cid:73)
Definition 1. (Cartesian tree matching) Given two strings text T [1 ..n ] and pattern P [1 ..m ],find every 1 ≤ i ≤ n − m + 1 such that CT ( T [ i..i + m − CT ( P [1 ..m ]).For example, let’s consider a sample text T = (41 , , , , , , , , , , , , , P = (6 , , , , , , CT ( T [5 .. CT ( P [1 .. T [6] = 23and T [10] = 22 is different from that between P [2] = 2 and P [6] = 3, but it is a match inCartesian tree matching. O ( n + m ) Time3.1 Parent-distance representation
In order to solve Cartesian tree matching without building every possible Cartesian tree, wepropose an efficient representation to store the information about Cartesian trees, called the parent-distance representation . (cid:73) Definition 2. (Parent-distance representation) Given a string S [1 ..n ], the parent-distancerepresentation of S is an integer string P D ( S )[1 ..n ], which is defined as follows: P D ( S )[ i ] = ( i − max ≤ j
1) is
P D ( S ) =(0 , , , , , S [ j ] in Definition 2 represents the parent of S [ i ] in Cartesian tree CT ( S [1 ..i ]). Furthermore, if there is no such j , S [ i ] is the root of Cartesian tree CT ( S [1 ..i ]).Theorem 3 shows that the parent-distance representation has a one-to-one mapping tothe Cartesian tree, so it can substitute the Cartesian tree without any loss of information. C P M 2 0 1 9 (cid:73)
Theorem 3.
Two strings S and S have the same Cartesian trees if and only if S and S have the same parent-distance representations. Proof.
If two strings have different lengths, they have different Cartesian trees and differentparent-distance representations, so the theorem holds. Therefore, we can only consider thecase where S and S have the same length. Let n be the length of S and S . We prove thetheorem by an induction on n .If n = 1, S and S will always have the same Cartesian trees with only one node.Furthermore, they will have the same parent-distance representation (0). Therefore, thetheorem holds when n = 1.Let’s assume that the theorem holds when n = k , and show that it holds when n = k + 1.(= ⇒ ) Assume that S [1 ..k + 1] and S [1 ..k + 1] have the same Cartesian trees (i.e., CT ( S [1 ..k + 1]) = CT ( S [1 ..k + 1])). There are two cases.If S [ k + 1] and S [ k + 1] are not roots of the Cartesian trees, let S [ j ] be the parent of S [ k + 1], and S [ l ] the parent of S [ k + 1]. Since CT ( S [1 ..k + 1]) = CT ( S [1 ..k + 1]),we have j = l . If we remove S [ k + 1] from Cartesian tree CT ( S [1 ..k + 1]), we obtain thetree CT ( S [1 ..k ]), where the left subtree of S [ k + 1] is attached to its parent S [ j ]. If weremove S [ k + 1] from CT ( S [1 ..k + 1]), we obtain CT ( S [1 ..k ]) in the same way. Since CT ( S [1 ..k + 1]) = CT ( S [1 ..k + 1]), we get CT ( S [1 ..k ]) = CT ( S [1 ..k ]), and therefore P D ( S )[1 ..k ] = P D ( S )[1 ..k ] by induction hypothesis. Since P D ( S )[ k + 1] = k + 1 − j and P D ( S )[ k + 1] = k + 1 − l (and j = l ), we have P D ( S ) = P D ( S ).If S [ k + 1] and S [ k + 1] are roots, we remove S [ k + 1] and S [ k + 1] to get CT ( S [1 ..k ])and CT ( S [1 ..k ]). Since CT ( S [1 ..k + 1]) = CT ( S [1 ..k + 1]), we have CT ( S [1 ..k ]) = CT ( S [1 ..k ]), and therefore P D ( S )[1 ..k ] = P D ( S )[1 ..k ] by induction hypothesis. Since P D ( S )[ k + 1] = P D ( S )[ k + 1] = 0 in this case, we get P D ( S ) = P D ( S ).( ⇐ =) Assume that S [1 ..k + 1] and S [1 ..k + 1] have the same parent-distance repres-entations (i.e., P D ( S )[1 ..k + 1] = P D ( S )[1 ..k + 1]). Since P D ( S )[1 ..k ] = P D ( S )[1 ..k ],we have CT ( S [1 ..k ]) = CT ( S [1 ..k ]) by induction hypothesis. From CT ( S [1 ..k ]), we canderive CT ( S [1 ..k + 1]) as follows. If P D ( S )[ k + 1] >
0, let x be S [ k + 1 − P D ( S )[ k + 1]].We insert S [ k + 1] into CT ( S [1 ..k ]) so that the parent of S [ k + 1] is x and the originalright subtree of x becomes the left subtree of S [ k + 1]. If P D ( S )[ k + 1] = 0, S [ k + 1] is theroot of CT ( S [1 ..k + 1]) and CT ( S [1 ..k ]) becomes the left subtree of S [ k + 1]. We derive CT ( S [1 ..k + 1]) from CT ( S [1 ..k ]) in the same way. Since CT ( S [1 ..k ]) = CT ( S [1 ..k ]) and P D ( S )[ k + 1] = P D ( S )[ k + 1], we can conclude that CT ( S [1 ..k + 1]) = CT ( S [1 ..k + 1]).Therefore, we have proved that there is a one-to-one mapping between Cartesian treesand parent-distance representations. (cid:74) Given a string S [1 ..n ], we can compute the parent-distance representation in linear timeusing a stack, as in [13, 14]. The main idea is that if two characters S [ i ] and S [ j ] for i < j satisfy S [ i ] > S [ j ], S [ i ] cannot be the parent of S [ k ] for any k > j . Therefore, we will onlystore S [ i ] which does not have such S [ j ] while scanning from left to right. If we store such S [ i ] only, they form a non-decreasing subsequence of S . When we consider a new value,therefore, we can pop values that are larger than the new value, find its parent, and pushthe new value and its index into the stack. Algorithm 1 describes the algorithm to compute P D ( S ).Furthermore, given the parent-distance representation of string S , we can compute theparent-distance representation of any substring S [ i..j ] easily. To compute P D ( S [ i..j ])[ k ], we .G. Park, A. Amir, G.M. Landau, and K. Park 13:5 Algorithm 1
Computing parent-distance representation of a string procedure PARENT-DIST-REP ( S [1 ..n ]) ST ← an empty stack for i ← to n do while ST is not empty do ( value, index ) ← ST.top if value ≤ S [ i ] then break ST.pop if ST is empty then P D ( S )[ i ] ← else P D ( S )[ i ] ← i − index ST.push (( S [ i ] , i )) return P D ( S )need only check whether the parent of S [ i + k −
1] is within S [ i..j ] or not (i.e., the parent isoutside if P D ( S )[ i + k − ≥ k ). P D ( S [ i..j ])[ k ] = ( P D ( S )[ i + k − ≥ kP D ( S )[ i + k −
1] otherwise. (1)For example, the parent-distance representation of string S = (2 , , , , , ,
1) is
P D ( S ) =(0 , , , , , , P D ( S [2 .. P D ( S [2 .. , , , , , We can define a failure function similar to the one used in the KMP algorithm [21]. (cid:73)
Definition 4. (Failure function) The failure function π of string P is an integer stringsuch that: π [ q ] = ( max { k : CT ( P [1 ..k ]) = CT ( P [ q − k + 1 ..q ]) for 1 ≤ k < q } if q >
10 if q = 1That is, π [ q ] is the largest k such that the prefix and the suffix of P [1 ..q ] of length k have thesame Cartesian trees. For example, assuming that P = (5 , , , , , , π = (0 , , , , , , CT ( P [1 .. CT ( P [3 .. π [6] = 4. We will present an algorithm to compute the failure function of a given string inSection 3.5. As in the original KMP text search algorithm, we can use the failure function in order toachieve linear time text search: scan the text from left to right, and use the failure functionevery time we find a mismatch between the text and the pattern. We apply this idea toCartesian tree matching.
C P M 2 0 1 9
Algorithm 2
Text search of Cartesian tree matching procedure CARTESIAN-TREE-MATCH ( T [1 ..n ] , P [1 ..m ]) P D ( P ) ← PARENT-DIST-REP( P ) π ← FAILURE-FUNC( P ) len ← DQ ← an empty deque for i ← to n do Pop elements ( value, index ) from back of DQ such that value > T [ i ] while len = 0 do if P D ( T [ i − len..i ])[ len + 1] = P D ( P )[ len + 1] then break else len ← π [ len ] Delete elements ( value, index ) from front of DQ such that index < i − len len ← len + 1 DQ.push _ back (( T [ i ] , i )) if len = m then print “Match occurred at i − m + 1” len ← π [ len ] Delete elements ( value, index ) from front of DQ such that index ≤ i − len In order to perform a text search using O ( m ) space, we compute the parent-distancerepresentation of the text online as we read the text, so that we don’t need to store theparent-distance representation of the whole text, which would cost O ( n ) space. Furthermore,among the text characters which are matched with the pattern, we only have to store elementsthat form a non-decreasing subsequence by using a deque (instead of a stack in Section 3.2)in order to delete elements in front. Using this idea, we can keep the size of the deque to bealways smaller than or equal to m . Therefore, we can perform the text search using O ( m )space. Algorithm 2 shows the text search algorithm of Cartesian tree matching. In line9 we need to compute x = P D ( T [ i − len..i ])[ len + 1]. If the deque is empty, then x = 0.Otherwise, let ( value, index ) be the element at the back of the deque. Then x = i − index .This computation takes constant time. Just before line 14, we do not compare P D ( T [ i ]) and P D ( P )[1] when len = 0, because they always match. Therefore, we can safely perform line14. We compute the failure function π in a way similar to the text search, as in the KMPalgorithm. However, we can compute the parent-distance representation of the patternin O ( m ) time before we compute the failure function. Hence we don’t need a deque andthe computation is slightly simpler than text search. Algorithm 3 shows the procedure tocompute the failure function. Since our algorithm for Cartesian tree matching including text search and the computation ofthe failure function follow the KMP algorithm, it is easy to see that our algorithm correctlyfinds all occurrences (in the sense of Cartesian tree matching) of the pattern in the text. .G. Park, A. Amir, G.M. Landau, and K. Park 13:7
Algorithm 3
Computing failure function in Cartesian tree matching procedure FAILURE-FUNC ( P [1 ..m ]) P D ( P ) ← PARENT-DIST-REP( P ) len ← π [1] ← for i ← to m do while len = 0 do if P D ( P [ i − len..i ])[ len + 1] = P D ( P [1 ..len + 1])[ len + 1] then break else len ← π [ len ] len ← len + 1 π [ i ] ← len Since our algorithm checks one character of the parent-distance representation in constanttime, it takes O ( n ) time for text search and O ( m ) time to compute the failure function,as in KMP algorithm. Therefore, our algorithm requires O ( m + n ) time for Cartesian treematching using O ( m ) space. There is an alternative representation of Cartesian trees, called
Cartesian tree signature [14].The Cartesian tree signature of S [1 ..n ] is an array L [1 ..n ] such that L [ i ] equals the numberof the elements popped from the stack in the i -th iteration of Algorithm 1. Furthermore, theCartesian tree signature can be represented as a bit string 1 L [1] L [2] · · · L [ n ] n , which is a succinct representation of a Cartesian tree. For example, the Cartesiantree signature of string S = (2 , , , , , ,
1) is L = (0 , , , , , , D [1 ..n ], which is defined as follows: If S [ i ] is never popped out from the stack, D [ i ] = 0. Otherwise, let S [ j ] be the value whichpopped S [ i ] out from the stack, and D [ i ] = j − i . For string S = (2 , , , , , , D = (6 , , , , , , D , we can delete one character at the front of string S [1 ..n ] in constant time.In order to get Cartesian tree signature L and its corresponding D for S [2 ..n ], we do thefollowing: If D [1] >
0, we decrease L [ D [1] + 1] by one and erase L [1] from L . If D [1] = 0, wejust erase L [1]. After that, we delete D [1] from D to get D . For example, if we want todelete one character at the front of S = (2 , , , , , , L [ D [1] + 1] = L [7] byone, and delete L [1] and D [1]. This results in L = (0 , , , , ,
1) and D = (1 , , , , , D of S [2 ..
7] = (7 , , , , , D to perform stringmatching, which uses the same space as the parent-distance representation. For Cartesiantree matching, therefore, it uses more space than Algorithm 2. C P M 2 0 1 9 q q q q q q q q q q q q q q q q q q q q q q 𝑖𝑑𝑥 𝑙𝑒𝑛 Figure 2
Aho-Corasick automaton for P = (4 , , , , , P = (3 , , , , P = (1 , , , , O (( n + m ) log k ) Time
In this section we extend Cartesian tree matching to the case of multiple patterns. Definition5 gives the formal definition of multiple pattern matching. (cid:73)
Definition 5. (Multiple pattern Cartesian tree matching) Given a text T [1 ..n ] and patterns P [1 ..m ] , P [1 ..m ] , ..., P k [1 ..m k ], where m = m + m + · · · + m k , multiple pattern Cartesiantree matching is to find every position in the text which matches at least one pattern, i.e., ithas the same Cartesian tree as that of at least one pattern.We modify the Aho-Corasick algorithm [1] using the parent-distance representation definedin Section 3.1 to do multiple pattern matching in O (( n + m ) log k ) time. Instead of using the patterns themselves in the Aho-Corasick automaton, we use their parent-distance representations to make an automaton. Each node in the automaton corresponds tothe prefix of the parent-distance representation of some pattern. We maintain two integers idx and len for every node such that the node corresponds to the parent-distance representationof the pattern prefix P idx [1 ..len ]. If there are more than one possible indexes, we store thesmallest one. Each node also has a state transition function trans ( x ), which gets an integer x as an input and returns the next node, or report that there is no such node. We can constructthe trie and the state transition function for every node in O ( m log k ) time, assuming thatwe use a balanced binary search tree to implement the transition function. Figure 2 showsan Aho-Corasick automaton for three patterns P = (4 , , , , , P = (3 , , , , P =(1 , , , , P D ( P ) =(0 , , , , , P D ( P ) = (0 , , , , P D ( P ) = (0 , , , ,
2) to construct the automaton.The failure function π of the Aho-Corasick automaton is defined as follows: Let q i be anode in the automaton, and s i be the substring that node q i represents in the trie. Let s j be the longest proper suffix of s i which matches (in the sense of Cartesian tree matching)prefix s k of some pattern P k . The failure function of q i is defined as node q k (i.e., π [ q i ] = q k ).The dotted lines in Figure 2 shows the failure function of each node. For example, node q represents P [1 .. q represents P [1 .. P [1 .. .G. Park, A. Amir, G.M. Landau, and K. Park 13:9 Algorithm 4
Computing failure function in multiple pattern matching procedure MULTIPLE-FAILURE-FUNC ( P , P , ..., P k ) for i ← to k do P D ( P i ) ← PARENT-DIST-REP( P i ) T R ← Build trie with
P D ( P i )’s for node ← breadth-first traversal of the trie do len ← len [ node ] idx ← idx [ node ] π [ node ] ← T R.root ptr ← parent of node in the trie while ptr = T R.root do ptr ← π [ ptr ] plen ← len [ ptr ] x ← P D ( P idx [ len − plen..len ])[ plen + 1] if ptr.trans ( x ) exists then π [ node ] ← ptr.trans ( x ) break matches P [3 ..
4] (i.e.,
P D ( P [1 .. P D ( P [3 .. , P [1 ..
4] that matches a prefix of some pattern. Note that the parent-distancerepresentation of s k may not be the suffix of the parent-distance representation of s i . Forexample, q has the parent-distance representation (0 , , , q hasthe parent-distance representation (0 ,
0) which is not a suffix of (0 , , , node corresponds to the parent-distance representation of P idx [1 ..len ], and so the parentof node corresponds to the parent-distance representation of P idx [1 ..len − ptr corresponds to the parent-distance representation of some suffix of P idx [1 ..len − ptr is a node that can be reached from the parent of node followingthe failure links. Since ptr corresponds to some string of length plen , we can conclude that ptr represents P idx [ len − plen..len − P idx [ len − plen..len ]matches some node in the trie, so we should check whether ptr has the transition using x = P D ( P idx [ len − plen..len ])[ plen + 1]. If ptr has the transition ptr.trans ( x ), it correspondsto P idx [ len − plen..len ], and we can conclude that π [ node ] = ptr.trans ( x ). If ptr doesn’thave such a transition, there is no node that represents P idx [ len − plen..len ], and thus wehave to continue the loop.For example, suppose that we compute the failure function of q in Figure 2. From idx [ q ] = 2 and len [ q ] = 4, we know that q represents P [1 .. q , which is theparent of q , represents P [1 .. ptr = π [ q ] = q .Since len [ q ] = 2, we know that q , which represents P [1 .. P [2 .. P [2 ..
4] matches some node in the trie, we compute x = P D ( P [2 .. q .trans ( x ) exists. Since there is no such transition, we continue thewhile loop with ptr = π [ q ] = q . We know that q , which represents P [1 .. P [3 ..
3] from len [ q ] = 1. In order to check whether P [3 ..
4] matches some node, we compute x = P D ( P [3 .. q .trans ( x ) exists. Since there is such a transition, C P M 2 0 1 9 we conclude that π [ q ] = q .trans (0) = q . Note that x may change during the while loop,which is not the case in the Aho-Corasick algorithm.While computing the failure function, we can also compute the output function in thesame way as the Aho-Corasick algorithm. The output function of node q i is the set ofpatterns which match some suffix of s i . This function is used to output all possible matchesat the node. Using the automaton defined above, we can solve multiple pattern Cartesian tree matching in O ( n log k ) time. The text search algorithm is essentially the same as that of the Aho-Corasickalgorithm, following the trie and using the failure links in case of any mismatches. As in thesingle pattern case, we compute the parent-distance representation of the text online in thesame way as Algorithm 2 (using a deque) to ensure O ( m ) space. The time complexity of ourmultiple pattern Cartesian tree matching is O (( n + m ) log k ) using O ( m ) space, where thelog k factor is included due to the binary search tree in each node. Since there can be at most k outgoing edges from each node, we can perform an operation in the binary search tree in O (log k ) time. Combined with the time-complexity analysis of the Aho-Corasick algorithm,this shows that our algorithm has the time complexity of O (( n + m ) log k ). We can reducethe time complexity further to randomized O ( n + m ) time by using a hash instead of a binarysearch tree [12]. O ( n ) Time
In this section we apply the notion of Cartesian tree matching to the suffix tree as in thecases of parameterized matching and order-preserving matching [8, 13]. We first define theCartesian suffix tree, and show that it can be built in randomized O ( n ) time or worst-case O ( n log n ) time using the result from Cole and Hariharan [12]. The Cartesian suffix tree is an index data structure that allows us to find an occurrence ofa given pattern P [1 ..m ] in randomized O ( m ) time or worst-case O ( m log n ) time, where n is the length of the text string. In order to store the information of Cartesian suffix treesefficiently, we again use the parent-distance representation from Section 3.1. Definition 6gives the formal definition of the Cartesian suffix tree. (cid:73) Definition 6. (Cartesian suffix tree) Given a string T [1 ..n ], the Cartesian suffix tree of T is a compacted trie built with P D ( T [ i..n ]) · ( −
1) for every 1 ≤ i ≤ n (where the specialcharacter − P D ( T [ i..n ])) and string ( − − T = (2 , , , , , , , , , , A corresponds to substring T [1 ..
5] or T [6 .. P D ( T [1 .. P D ( T [6 .. , , , , A stores the suffix number 1 or 6, start and end positions3 and 5. .G. Park, A. Amir, G.M. Landau, and K. Park 13:11 -1 0-1 0 1 -1 0 -1 -1 -1 1 A Figure 3
Cartesian suffix tree of S = (2 , , , , , , , , , , There are several algorithms efficiently constructing the suffix tree, such as McCreight’salgorithm [24] and Ukkonen’s algorithm [26]. However, the distinct right context property [16, 8] should hold in order to apply these algorithms, which means that the suffix link ofevery internal node should point to an explicit node. The Cartesian suffix tree does not havethe distinct right context property. In Figure 3, the internal node marked with A does notsatisfy this property because P D ( T [2 .. P D ( T [7 .. , , , ,
0) and thus there isno explicit node corresponding to parent-distance representation (0 , , , quasi-suffix collection , which satisfies thefollowing properties: A quasi-suffix collection is a set of n strings s , s , ..., s n , where the length of s i is n + 1 − i . For any two different strings s i and s j , s i should not be a prefix of s j . For any i and j , if s i and s j have a common prefix of length l , s i +1 and s j +1 should havea common prefix of length at least l − s i = P D ( T [ i..n ]) · ( −
1) and s j = P D ( T [ j..n ]) · ( −
1) have a common prefix of length l , i.e., P D ( T [ i..i + l − P D ( T [ j..j + l − P D ( T [ i + 1 ..i + l − P D ( T [ j + 1 ..j + l − s i +1 = P D ( T [ i + 1 ..n ]) · ( −
1) and s j +1 = P D ( T [ j + 1 ..n ]) · ( −
1) have a common prefix of length l − characteroracle , which returns the i -th character of s j in constant time. We can do this in constanttime using Equation 1, once the parent-distance representation of T is computed. C P M 2 0 1 9
Since we have all properties needed to perform Cole and Hariharan’s algorithm, we canconstruct a Cartesian suffix tree in randomized O ( n ) time using O ( n ) space [12]. In theworst case, it can be built in O ( n log n ) time by using a binary search tree instead of a hashtable to store the children of each node in the suffix tree, because the alphabet size | Σ | is O ( n ). We can also modify our algorithm to construct a Cartesian suffix tree online, usingthe idea in [23, 25]. We have defined Cartesian tree matching and the parent-distance representation of a Cartesiantree. We developed a linear time algorithm for single pattern matching and an O (( n + m ) log k )deterministic time or O ( n + m ) randomized time algorithm for multiple pattern matching.Finally, we defined an index data structure called Cartesian suffix tree, and showed that itcan be constructed in O ( n ) randomized time. We believe that the notion of Cartesian treematching, which is a new metric on string matching and indexing over numeric strings, canbe used in many applications.There have been many works on approximate generalized matching. For example, thereare results for approximate order-preserving matching [11], approximate jumble matching[10], approximate swapped matching [5], and approximate parameterized matching [6, 18].There are also results on computing the period of a generalized string, such as computing theperiod in the order-preserving model [17]. Since Cartesian tree matching is first introducedin this paper, many problems including approximate matching and computing the period inthe Cartesian tree matching model are future research topics. Acknowledgments
S.G. Park and K. Park were supported by Institute for Information & communicationsTechnology Promotion(IITP) grant funded by the Korea government(MSIT) (No. 2018-0-00551, Framework of Practical Algorithms for NP-hard Graph Problems). A. Amir and G.M.Landau were partially supported by the Israel Science Foundation grant 571/14, and GrantNo. 2014028 from the United States-Israel Binational Science Foundation (BSF).
References Alfred V. Aho and Margaret J. Corasick. Efficient string matching: An aid to bibliographicsearch.
Commun. ACM , 18(6):333–340, 1975. URL: https://doi.org/10.1145/360825.360855 , doi:10.1145/360825.360855 . Amihood Amir, Yonatan Aumann, Gad M. Landau, Moshe Lewenstein, and Noa Lewenstein.Pattern matching with swaps.
J. Algorithms , 37(2):247–266, 2000. URL: https://doi.org/10.1006/jagm.2000.1120 , doi:10.1006/jagm.2000.1120 . Amihood Amir, Richard Cole, Ramesh Hariharan, Moshe Lewenstein, and Ely Porat.Overlap matching.
Inf. Comput. , 181(1):57–74, 2003. URL: https://doi.org/10.1016/S0890-5401(02)00035-4 , doi:10.1016/S0890-5401(02)00035-4 . Amihood Amir, Martin Farach, and S. Muthukrishnan. Alphabet dependence in parameter-ized matching.
Inf. Process. Lett. , 49(3):111–115, 1994. URL: https://doi.org/10.1016/0020-0190(94)90086-8 , doi:10.1016/0020-0190(94)90086-8 . Amihood Amir, Moshe Lewenstein, and Ely Porat. Approximate swapped matching.
Inf.Process. Lett. , 83(1):33–39, 2002. URL: https://doi.org/10.1016/S0020-0190(01)00302-7 , doi:10.1016/S0020-0190(01)00302-7 . .G. Park, A. Amir, G.M. Landau, and K. Park 13:13 Alberto Apostolico, Péter L. Erdös, and Moshe Lewenstein. Parameterized matching withmismatches.
J. Discrete Algorithms , 5(1):135–140, 2007. URL: https://doi.org/10.1016/j.jda.2006.03.014 , doi:10.1016/j.jda.2006.03.014 . Brenda S. Baker. A theory of parameterized pattern matching: algorithms and applications.In
Proceedings of the Twenty-Fifth Annual ACM Symposium on Theory of Computing, May16-18, 1993, San Diego, CA, USA , pages 71–80, 1993. URL: https://doi.org/10.1145/167088.167115 , doi:10.1145/167088.167115 . Brenda S. Baker. Parameterized duplication in strings: Algorithms and an application tosoftware maintenance.
SIAM J. Comput. , 26(5):1343–1362, 1997. URL: https://doi.org/10.1137/S0097539793246707 , doi:10.1137/S0097539793246707 . Peter Burcsi, Ferdinando Cicalese, Gabriele Fici, and Zsuzsanna Lipták. Algorithms forjumbled pattern matching in strings.
Int. J. Found. Comput. Sci. , 23(2):357–374, 2012. URL: https://doi.org/10.1142/S0129054112400175 , doi:10.1142/S0129054112400175 . Peter Burcsi, Ferdinando Cicalese, Gabriele Fici, and Zsuzsanna Lipták. On approximatejumbled pattern matching in strings.
Theory Comput. Syst. , 50(1):35–51, 2012. URL: https://doi.org/10.1007/s00224-011-9344-5 , doi:10.1007/s00224-011-9344-5 . Tamanna Chhabra, Emanuele Giaquinta, and Jorma Tarhio. Filtration algorithms forapproximate order-preserving matching. In
String Processing and Information Retrieval- 22nd International Symposium, SPIRE 2015, London, UK, September 1-4, 2015, Pro-ceedings , pages 177–187, 2015. URL: https://doi.org/10.1007/978-3-319-23826-5_18 , doi:10.1007/978-3-319-23826-5\_18 . Richard Cole and Ramesh Hariharan. Faster suffix tree construction with missing suffix links.In
Proceedings of the Thirty-Second Annual ACM Symposium on Theory of Computing, May21-23, 2000, Portland, OR, USA , pages 407–415, 2000. URL: https://doi.org/10.1145/335305.335352 , doi:10.1145/335305.335352 . Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Marcin Kubica, Alessio Langiu,Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, and Tomasz Walen. Order-preservingindexing.
Theor. Comput. Sci. , 638:122–135, 2016. URL: https://doi.org/10.1016/j.tcs.2015.06.050 , doi:10.1016/j.tcs.2015.06.050 . Erik D. Demaine, Gad M. Landau, and Oren Weimann. On Cartesian trees and rangeminimum queries.
Algorithmica , 68(3):610–625, 2014. URL: https://doi.org/10.1007/s00453-012-9683-x , doi:10.1007/s00453-012-9683-x . Tak-Chung Fu, Korris Fu-Lai Chung, Robert Wing Pong Luk, and Chak-man Ng. Stocktime series pattern matching: Template-based vs. rule-based approaches.
Eng. Appl. ofAI , 20(3):347–364, 2007. URL: https://doi.org/10.1016/j.engappai.2006.07.003 , doi:10.1016/j.engappai.2006.07.003 . Raffaele Giancarlo. A generalization of the suffix tree to square matrices, with ap-plications.
SIAM J. Comput. , 24(3):520–562, 1995. URL: https://doi.org/10.1137/S0097539792231982 , doi:10.1137/S0097539792231982 . Garance Gourdel, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Arseny M. Shur,and Tomasz Walen. String periods in the order-preserving model. In , pages 38:1–38:16, 2018. URL: https://doi.org/10.4230/LIPIcs.STACS.2018.38 , doi:10.4230/LIPIcs.STACS.2018.38 . Carmit Hazay, Moshe Lewenstein, and Dina Sokol. Approximate parameterized match-ing. In
Algorithms - ESA 2004, 12th Annual European Symposium, Bergen, Norway,September 14-17, 2004, Proceedings , pages 414–425, 2004. URL: https://doi.org/10.1007/978-3-540-30140-0_38 , doi:10.1007/978-3-540-30140-0\_38 . Jinil Kim, Amihood Amir, Joong Chae Na, Kunsoo Park, and Jeong Seop Sim. Onrepresentations of ternary order relations in numeric strings.
Mathematics in ComputerScience , 11(2):127–136, 2017. URL: https://doi.org/10.1007/s11786-016-0282-0 , doi:10.1007/s11786-016-0282-0 . C P M 2 0 1 9 Jinil Kim, Peter Eades, Rudolf Fleischer, Seok-Hee Hong, Costas S. Iliopoulos, Kunsoo Park,Simon J. Puglisi, and Takeshi Tokuyama. Order-preserving matching.
Theor. Comput. Sci. ,525:68–79, 2014. URL: https://doi.org/10.1016/j.tcs.2013.10.006 , doi:10.1016/j.tcs.2013.10.006 . Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching instrings.
SIAM J. Comput. , 6(2):323–350, 1977. URL: https://doi.org/10.1137/0206024 , doi:10.1137/0206024 . Marcin Kubica, Tomasz Kulczynski, Jakub Radoszewski, Wojciech Rytter, and TomaszWalen. A linear time algorithm for consecutive permutation pattern matching.
Inf. Process.Lett. , 113(12):430–433, 2013. URL: https://doi.org/10.1016/j.ipl.2013.03.015 , doi:10.1016/j.ipl.2013.03.015 . Taehyung Lee, Joong Chae Na, and Kunsoo Park. On-line construction of parameterizedsuffix trees for large alphabets.
Inf. Process. Lett. , 111(5):201–207, 2011. URL: https://doi.org/10.1016/j.ipl.2010.11.017 , doi:10.1016/j.ipl.2010.11.017 . Edward M. McCreight. A space-economical suffix tree construction algorithm.
J. ACM ,23(2):262–272, 1976. URL: https://doi.org/10.1145/321941.321946 , doi:10.1145/321941.321946 . Joong Chae Na, Raffaele Giancarlo, and Kunsoo Park. On-line construction of two-dimensionalsuffix trees in O(n log n) time. Algorithmica , 48(2):173–186, 2007. URL: https://doi.org/10.1007/s00453-007-0063-x , doi:10.1007/s00453-007-0063-x . Esko Ukkonen. On-line construction of suffix trees.
Algorithmica , 14(3):249–260, 1995. URL: https://doi.org/10.1007/BF01206331 , doi:10.1007/BF01206331 . Jean Vuillemin. A unifying look at data structures.
Commun. ACM , 23(4):229–239, 1980.URL: https://doi.org/10.1145/358841.358852 , doi:10.1145/358841.358852doi:10.1145/358841.358852