FFast Cartesian Tree Matching
Siwoo Song , Cheol Ryu , Simone Faro , Thierry Lecroq , and KunsooPark (cid:0) ) Seoul National University, Seoul, Korea { swsong,cryu,kpark } @theory.snu.ac.kr University of Catania, Catania, Italy [email protected] Normandie University, Rouen, France [email protected]
Abstract.
Cartesian tree matching is the problem of finding all sub-strings of a given text which have the same Cartesian trees as that ofa given pattern. So far there is one linear-time solution for Cartesiantree matching, which is based on the KMP algorithm. We improve therunning time of the previous solution by introducing new representa-tions. We present the framework of a binary filtration method and anefficient verification technique for Cartesian tree matching. Any exactstring matching algorithm can be used as a filtration for Cartesian treematching on our framework. We also present a SIMD solution for Carte-sian tree matching suitable for short patterns. By experiments we showthat known string matching algorithms combined on our framework ofbinary filtration and efficient verification produce algorithms of goodperformances for Cartesian tree matching.
Keywords:
Cartesian tree matching · Global-parent representation · Filtration algorithms.
String matching is one of fundamental problems in computer science. There aregeneralized matchings such as parameterized matching [3,5], swapped matching[1,4], overlap matching [2], jumbled matching [6], and so on. These problems arecharacterized by the way of defining a match, which depends on the applicationdomains of the problems. In particular, order-preserving matching [18,17,20] andCartesian tree matching [21] deal with the order relations between numbers.The Cartesian tree [23] is a tree data structure that represents a string,focusing on the orders between elements of the string. Park et al. [21] intro-duced a metric of match called Cartesian tree matching. It is the problemof finding all substrings of a text T which have the same Cartesian trees asthat of a pattern P . Cartesian tree matching can be applied to finding pat-terns in time series data such as share prices in stock markets, like order-preserving matching, but sometimes it may be more appropriate as indicatedin [21]. Fig. 1 shows an example of Cartesian tree matching. Suppose T = a r X i v : . [ c s . D S ] A ug S. Song et al.
66 8 57 9 TextPattern
Fig. 1: Cartesian tree matching, and Cartesian tree corresponding to pattern.(10 , , , , , , , , , , , , , ,
12) and P = (3 , , , , , , , , u = (15 , , , , , , , ,
17) is the same asthat of P . Note that if we use order-preserving matching instead of Cartesiantree matching as a metric, u does not match P .String matching algorithms have been designed over the years. To speedup the search phase of string matching, algorithms based on automata andbit-parallelism were developed [14,12]. In recent years, the SIMD instructionset architecture gave rise to packed string matching, where one can comparepacked data elements in parallel. In the last few years, many solutions for order-preserving matching have been proposed. Given a text of length n and a patternof length m , Kubica et al. [20] and Kim et al. [18] gave O ( n + m log m ) time solu-tions based on the KMP algorithm. Cho et al. [11] presented an algorithm usingthe Boyer–Moore approach. Chhabra and Tarhio [10] presented a new practicalsolution based on filtration, and Chhabra et al. [9] gave a filtration algorithmusing the Boyer-Moore-Horspool approach and SIMD instructions. Cantone etal. [7] proposed filtration methods using the q -neighborhood representation andSIMD instructions. These filtration methods [10,9,7] take sublinear time on av-erage.In this paper we introduce new representations, prefix-parent representation and prefix-child representation , which can be used to decide whether two stringshave the same Cartesian trees or not. Using these representations, we improvethe running time of the previous Cartesian tree matching algorithm in [21].We also present a binary filtration method for Cartesian tree matching, andgive an efficient verification technique for Cartesian tree matching based on the global-parent representation . On the framework of our binary filtration methodand efficient verification technique, we can apply any known string matchingalgorithm [8,15,12] as a filtration for Cartesian tree matching. In addition, wepresent a SIMD solution for Cartesian tree matching based on the global-parentrepresentation, which is suitable for short patterns. We conduct experimentscomparing many algorithms for Cartesian tree matching, which show that knownstring matching algorithms combined on the framework of our binary filtrationand efficient verification for Cartesian tree matching produce algorithms of goodperformances for Cartesian tree matching. ast Cartesian Tree Matching 3 This paper is organized as follows. In Section 2, we describe notations and theproblem definition. In Section 3, we present an improved linear-time algorithmusing new representations. In Section 4, we present the framework of binaryfiltration and efficient verification. In Section 5, we present a SIMD solution forshort patterns. In Section 6, we give the experimental results of the previousalgorithm and the proposed algorithms.
A string is defined as a finite sequence of elements in an alphabet Σ . In thispaper, we will assume that Σ has a total order < . For a string S , S [ i ] representsthe i th element of S , and S [ i..j ] represents a substring of S from the i th elementto the j th element. If i > j then S [ i..j ] is an empty string.We will say S [ i ] ≺ S [ j ], if and only if S [ i ] < S [ j ], or S [ i ] and S [ j ] have thesame value with i < j . Note that S [ i ] = S [ j ] (as elements of the string) if andonly if i = j . Unless stated otherwise, the minimum is defined by ≺ . A string S can be associated with its corresponding Cartesian tree CT ( S ) [23]according to the following rules: – If S is an empty string, then CT ( S ) is an empty tree. – If S [1 ..n ] is not empty and S [ i ] is the minimum value among S , then CT ( S )is the tree with S [ i ] as the root, CT ( S [1 ..i − CT ( S [ i + 1 ..n ]) as the right subtree. Cartesian tree matching is to find all substrings of the text which have the sameCartesian trees as that of the pattern. Formally, Park et al. [21] define it asfollows:
Definition 1. (Cartesian tree matching) Given two strings text T [1 ..n ] andpattern P [1 ..m ], find every 1 ≤ i ≤ n − m + 1 such that CT ( T [ i..i + m − CT ( P [1 ..m ]).Instead of building the Cartesian tree for every position in the text to solveCartesian tree matching, Park et al. [21] use the following representation for aCartesian tree. Definition 2. (Parent-distance representation) Given a string S [1 ..n ], the parent-distance representation of S is a function PD S , which is defined as follows: PD S ( i ) = (cid:40) i − max ≤ j
1, and S [ PC S ( i )] is the root of ast Cartesian Tree Matching 5 Algorithm 1
Text search of Cartesian tree matching procedure CARTESIAN-TREE-MATCH ( T [1 ..n ] , P [1 ..m ])2: ( PP P , PC P ) ← PREFIX-PARENT-CHILD-REP( P )3: π ← FAILURE-FUNC( P )4: q ← for i ← to n do while q (cid:54) = 0 do if T [ i − q − PP P ( q + 1)] (cid:22) T [ i ] (cid:22) T [ i − q − PC P ( q + 1)] then break else q ← π [ q ]11: q ← q + 112: if q = m then print “Match occurred at i − m + 1”14: q ← π [ q ] CT ( S [1 ..i − PP S ( i ) = i . When PP S ( i ) = i −
1, there is no child of S [ i ]in CT ( S [1 ..i ]), and thus we set PC S ( i ) as i .Fig. 2 shows the prefix-parent representation (resp. the prefix-child represen-tation) of string S = (3 , , , , , , , ,
9) by arrows. The arrow starting from S [ i ] indicates PP S ( i ) (resp. PC S ( i )). If PP S ( i ) = i (resp. PC S ( i ) = i ), we omitthe arrow.The advantage of using the prefix-child representation and the prefix-parentrepresentation is that we can check whether each text element matches thecorresponding pattern element in constant time without computing its parent-distance [21]. Theorem 1.
Given two strings P and S , assume that P [1 ..q ] and S [1 ..q ] havethe same prefix-parent representations. If S [ PP P ( q +1)] (cid:22) S [ q +1] (cid:22) S [ PC P ( q +1)], then P [1 ..q + 1] and S [1 ..q + 1] have the same prefix-parent representations,and vice versa. Proof. (= ⇒ ) If q = 0, P [1] and S [1] always have the same prefix-parent 1.Now let’s assume q ≥
1. There are three cases, in each of which we show that PP P ( q + 1) = PP S ( q + 1).1. Case PP P ( q + 1) = q + 1: Since P [ PC P ( q + 1)] is the minimum elementin P [1 ..q ] and PP P ( i ) = PP S ( i ) for 1 ≤ i ≤ q , S [ PC P ( q + 1)] is also theminimum element in S [1 ..q ]. Therefore, if S [ q + 1] (cid:22) S [ PC P ( q + 1)] holds,then we have PP S ( q + 1) = q + 1.2. Case PP P ( q + 1) = q : Since S [ q ] (cid:22) S [ q + 1], we have PP S ( q + 1) = q .3. Case PP P ( q + 1) < q : Since P [ PC P ( q + 1)] is the minimum element in P [ PP P ( q + 1) + 1 ..q ] and PP P ( i ) = PP S ( i ) for 1 ≤ i ≤ q , S [ PC P ( q + 1)] isalso the minimum element in S [ PP P ( q + 1) + 1 ..q ]. Therefore, if S [ PP P ( q +1)] (cid:22) S [ q + 1] (cid:22) S [ PC P ( q + 1)] holds, then PP S ( q + 1) = PP P ( q + 1).( ⇐ =) It is trivial by definitions of PP and PC . (cid:117)(cid:116) S. Song et al.
Algorithm 2
Computing prefix-parent and prefix-child representations procedure PREFIX-PARENT-CHILD-REP ( P [1 ..m ])2: ST ← an empty stack3: for i ← to m do j next ← i while ST is not empty do j ← ST.top if P [ j ] ≺ P [ i ] then break ST.pop j next ← j PC P ( i ) ← j next if ST is empty then PP P ( i ) ← i else PP P ( i ) ← j ST.push ( i )17: return ( PP P , PC P ) With the prefix-parent representation and the prefix-child representation ofpattern P , we can simplify the text search. For each element T [ i ], we can check PP P ( q +1) = PP T [ i − q..i ] ( q +1) by comparing T [ i ] with the elements in T [ i − q..i ]whose indices correspond to PP P ( q +1) and PC P ( q +1) in P . Using this idea, wedon’t have to compute PP T [ i − q..i ] ( q + 1). Algorithm 1 describes the algorithmto do this. We compute the failure function π in the same way as [21] does.Given a string P [1 ..m ], we can compute the prefix-child representation andthe prefix-parent representation simultaneously in linear time using a stack. PP P ( i ) = j means that P [ j ] ≺ P [ k ] for j < k < i . The same is true for PC P ( i ). On the stack, therefore, we maintain only j ’s which satisfy P [ j ] ≺ P [ k ]for j < k < i while scanning from i = 1 to m . Suppose that j , j , . . . , j r areon the stack when we are computing PP P ( i ) and PC P ( i ). (We assume that j r +1 = i .) Then, ( P [ j ] , P [ j ] , . . . , P [ j r ]) forms an increasing subsequence of P .When we consider a new index i , we pop the indices j r , j r − , . . . , j t +1 repeatedlyuntil we have P [ j t ] ≺ P [ i ]. If there exists such an index j t , we set PP P ( i ) = j t and PC P ( i ) = j t +1 . (If t = r , then PC P ( i ) = j t +1 = i .) Otherwise, P [ i ] is theminimum element in P [1 ..i ], and thus PP P ( i ) = i and PC P ( i ) = j . Finally,we push i onto the stack. Algorithm 2 describes the algorithm to compute PP P and PC P simultaneously. In this section we present a practical solution based on filtration. Our solutionfor Cartesian tree matching consists of two phases: filtration and verification.First, the text is filtered with some exact string matching algorithm using a ast Cartesian Tree Matching 7 binary representation. In the second phase, the potential candidates are verifiedusing a global-parent representation.
In the filtration phase, a string S is translated into a binary representation β S as follows. Definition 5. (binary representation) Given a string S [1 ..n ], the binary repre-sentation of S is a binary string β S of length n −
1, which is defined as follows: β S [ i ] = (cid:40) PP S ( i + 1) = i ≤ i ≤ n − . One can easily check whether PP S ( i + 1) = i is true or not by comparing S [ i ]and S [ i + 1]: PP S ( i + 1) = i if and only if S [ i ] ≺ S [ i + 1]. The following theoremproves that the binary representation can be used to filter a text T to search forall Cartesian tree matching occurrences of a pattern P . Theorem 2.
Let P and T be two strings of lengths m and n , respectively,and let β P and β T be the binary representations associated with P and T ,respectively. If CT ( P [1 ..m ]) = CT ( T [ i..i + m − β P [ j ] = β T [ i + j − ≤ j ≤ m − Proof.
The prefix-parent representation has a one-to-one mapping to the Carte-sian tree. Therefore, if CT ( P [1 ..m ]) = CT ( T [ i..i + m − PP P ( j + 1) = PP T ( i + j ) for 0 ≤ j ≤ m −
1. If PP P ( j + 1) = PP T ( i + j ), then β P [ j ] = β T [ i + j −
1] for 1 ≤ j ≤ m − β P in β T , these matches are only possible candidates of Cartesiantree matching which should be verified.Cantone et al. [7] presented two filtration methods other than the binaryrepresentation to solve order-preserving matching. They used the property that T doesn’t match P at position i if there are two positions j and k such that P [ j ] (cid:22) P [ k ] ⇔ T [ i + j − (cid:22) T [ i + k −
1] doesn’t hold. Thus any comparison resultbetween two positions can be used for filtration. In Cartesian tree matching,however, even if there exist such j and k , the corresponding Cartesian trees canbe the same when | j − k | >
1. Therefore, we cannot use these filtration methodsfor Cartesian tree matching.
In the verification phase, we have to check whether the candidates found by thefiltration phase are actual matches or not. This checking can be done using prefix-parent and prefix-child representations by Theorem 1, which takes 2 comparisons
S. Song et al. per element. In order to reduce the number of comparisons to 1, we introduceanother representation as follows.
Definition 6. (Global-parent representation) Given a string S [1 ..n ], the global-parent representation of S is a function GP S , which is defined as follows: GP S ( i ) = (cid:40) j such that PC S ( j ) = i for j > i PP S ( i ) if such j doesn’t exist. GP S ( i ) is well-defined because there is at most one j > i which satisfies PC S ( j ) = i . Fig. 2 shows the global-parent representation by arrows. The arrow startingfrom S [ i ] indicates the global parent of S [ i ]. If GP S ( i ) = i , we omit the arrow. Theorem 3.
Two strings P [1 ..m ] and S [1 ..m ] have the same Cartesian trees ifand only if S [ GP P ( i )] (cid:22) S [ i ] for all 1 ≤ i ≤ m . Proof.
We will prove that S [ PP P ( i )] (cid:22) S [ i ] (cid:22) S [ PC P ( i )] for all 1 ≤ i ≤ m ifand only if S [ GP P ( i )] (cid:22) S [ i ] for all 1 ≤ i ≤ m .(= ⇒ ) It is trivial by definition of GP .( ⇐ =) Assume S [ GP P ( i )] (cid:22) S [ i ] for all 1 ≤ i ≤ m . For any 1 ≤ k ≤ m , wefirst show S [ k ] (cid:22) S [ PC P ( k )], and then we show S [ PP P ( k )] (cid:22) S [ k ].1. (Proof of S [ k ] (cid:22) S [ PC P ( k )]) There are two cases: PC P ( k ) = k and PC P ( k ) (cid:54) = k . If PC P ( k ) = k , then S [ k ] (cid:22) S [ PC P ( k )] holds trivially. Otherwise, since GP P ( PC P ( k )) = k , S [ k ] = S [ GP P ( PC P ( k ))] (cid:22) S [ PC P ( k )]. Therefore, S [ k ] (cid:22) S [ PC P ( k )] holds.2. (Proof of S [ PP P ( k )] (cid:22) S [ k ]) If GP P ( k ) = PP P ( k ), then S [ PP P ( k )] = S [ GP P ( k )] (cid:22) S [ k ]. So we only have to consider the case that there is k > k which satisfies PC P ( k ) = k . Let k = k < k < · · · < k r ≤ m be asequence such that PC P ( k l +1 ) = k l , and there is no k r +1 > k r which satisfies PC P ( k r +1 ) = k r . Since ( k , k , . . . , k r ) is a strictly increasing sequence, such k r always exists. Note that GP P ( k l ) = k l +1 except for GP P ( k r ). On thesequence, there may or may not exist j such that PP P ( k j ) = k j .Suppose that there exists some j such that PP P ( k j ) = k j . Since k j − = PC P ( k j ), P [ k j − ] is the minimum element in P [1 ..k j − PP P ( k j − ) = k j − . Proceeding inductively, PP P ( k l ) = k l for all l ≤ j . Thus S [ PP P ( k )] (cid:22) S [ k ] holds trivially.Now we consider the case that PP P ( k j ) (cid:54) = k j for all j . Then, we have S [ k ] (cid:23) S [ k ] (cid:23) · · · (cid:23) S [ k r ] (cid:23) S [ GP P ( k r )] = S [ PP P ( k r )] by the assumption that S [ GP P ( i )] (cid:22) S [ i ] for all i . We now show PP P ( k r ) = PP P ( k ) as follows. Since PC P ( k r ) = k r − , P [ k r − ] is the minimum element in P [ PP P ( k r ) + 1 ..k r − P [ k r − ] (cid:23) P [ PP P ( k r )]. Hence, we have PP P ( k r − ) = PP P ( k r ).Inductively, we can show that PP P ( k ) = PP P ( k ) = · · · = PP P ( k r ).Therefore, S [ PP P ( k )] (cid:22) S [ k ] holds. (cid:117)(cid:116) By Theorem 3, we only have to compare once for each element in the ver-ification phase. For a potential candidate obtained from the filtration phase ast Cartesian Tree Matching 9 (say, it starts from T [ i ]), we compare T [ i + q −
1] and T [ i + GP P ( q ) − q = 1 to m . The candidate is discarded when there exists q such that T [ i + q − ≺ T [ i + GP P ( q ) − GP P ( i ) as PP P ( i ), and then if we find j such that PC P ( j ) = i weupdate GP P ( i ) to j . The proof of sublinearity is similar to the analysis of order-preserving matchingwith filtration [10]. Let’s assume that the elements in the pattern P and the text T are independent of each other and the distribution is uniform. The verificationphase takes time proportional to the pattern length times the number of poten-tial candidates. When alphabet size is | Σ | , the probability that β P [ i ] = 0 (i.e.,probability that P [ i ] ≺ P [ i +1]) is ( | Σ | + | Σ | ) / (2 | Σ | ), since there are | Σ | pairsand | Σ | pairs among them have equal elements. Similarly, the probability that β P [ i ] = 1 is ( | Σ | −| Σ | ) / (2 | Σ | ), and it is the same for β T [ i ]. Therefore, the prob-ability that β P [ i ] = β T [ i ] is (( | Σ | + | Σ | ) / (2 | Σ | )) + (( | Σ | − | Σ | ) / (2 | Σ | )) =1 / / (2 | Σ | ). As the pattern length increases, the number of potential candi-dates decreases exponentially, and the verification time approaches zero. Hence,the filtration time dominates. So if the filtration method takes a sublinear timein the average case, the total algorithm takes a sublinear time in the averagecase, too. When we use the Boyer-Moore-Horspool algorithm [15] and the Alpha skipsearch algorithm [8] as the filtration method, we pack four 32-bit numbers orsixteen 8-bit numbers into a register, as in order-preserving matching algorithms[9,7]. Each pair of two corresponding packed data elements can be compared inparallel using streaming SIMD extensions (SSE) [16]. In the case of 32-bit in-tegers, for example, we compute ( T [ i + 3] > T [ i + 4]), ( T [ i + 2] > T [ i + 3]),( T [ i + 1] > T [ i + 2]), and ( T [ i ] > T [ i + 1]) in parallel as in Algorithm 3, whereinstruction mm loadu si128(( m128i *)( T + i )) loads four 32-bit integers frommemory T + i into a 128-bit register, instruction mm cmpgt epi32( a , b ) comparesfour pairs of packed 32-bit integers and returns the results of the comparisonsinto a 128-bit register, instruction mm castsi128 ps casts the integer type tothe float type, and instruction mm movemask ps selects only the most signifi-cant bits of the 4 floats. Comparing a pair of sixteen 8-bit numbers can be donesimilarly. In this section we present an algorithm that works when the alphabet consists of1-byte characters and the pattern length m is at most 16. As shown in Section 4.2, Algorithm 3
Compare integers in parallel procedure CompareUsingSIMD ( T [1 ..n ] , i )2: m128i a ← mm loadu si128(( m128i *)( T + i ))3: m128i b ← mm loadu si128(( m128i *)( T + i + 1))4: m128i r ← mm cmpgt epi32( a , b )5: return mm movemask ps( mm castsi128 ps( r )) we test T [ s + i − (cid:23) T [ s + GP P ( i ) −
1] for 1 ≤ i ≤ m to check for an occurrenceat position s of the text T .Let W be a word of 16 bytes containing the current window of the text, i.e., W = T [ s..s + 15]. For 1 ≤ i ≤ m , we define W i (word obtained from W byshifting i − GP P ( i ) positions to the left or to the right, depending on the signof i − GP P ( i )) as follows: W i = (cid:40) W (cid:28) ( GP P ( i ) − i ) if i < GP P ( i ) W (cid:29) ( i − GP P ( i )) if i > GP P ( i ) . For fixed i , we can find the positions j which satisfy W [ j + i − (cid:23) W [ j + GP P ( i ) −
1] for 0 ≤ j ≤
15 in parallel by comparing W i to W using SIMDinstructions. The satisfying positions for all 1 ≤ i ≤ m are the occurrencesof the pattern. The details of the algorithm are as follows. We test whether W i [ j ] (cid:22) W [ j ] for 0 ≤ j ≤
15 in parallel using the SIMD instruction R i =mm cmpgt epi8( W, W i ) for i < GP P ( i ) or R i = ∼ mm cmpgt epi8( W i , W ) for i > GP P ( i ). (In order to get only significant bits when computing R i , we useinstruction mm movemask epi8.) Then we compute q = AND mi =1 ( R i (cid:28) ( i − s + j of the text if q [ j ] = 1. Example 1.
Let’s consider an example of the pattern P = (3 , , , ,
8) and thewindow of the text W = (10 , , , , , , , , , , , , , , , − GP P (1) = 3 − GP P (3), R = R . Moreover we do notneed to compute R , since 2 − GP P (2) = 0. Hence we compute R , R , and R . W = 10, 12, 16, 15, 06, 14, 09, 12, 11, 14, 09, 17, 12, 13, 12, 10 W = 12, 16, 15, 06, 14, 09, 12, 11, 14, 09, 17, 12, 13, 12, 10 R = 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, - W = 10, 12, 16, 15, 06, 14, 09, 12, 11, 14, 09, 17, 12, 13, 12, 10 W = 10, 12, 16, 15, 06, 14, 09, 12, 11, 14, 09, 17, 12, 13 R = -, -, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1 W = 10, 12, 16, 15, 06, 14, 09, 12, 11, 14, 09, 17, 12, 13, 12, 10 W = 10, 12, 16, 15, 06, 14, 09, 12, 11, 14, 09, 17, 12, 13, 12 R = -, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0The final result q can be computed as follows: ast Cartesian Tree Matching 11Dataset m KMP IKMP SBNDMCT BMHCT SKSCT PMCT CT 2 4 6 4 8 12 16 4 8 12 16 CTRandom 5 10.52 6.84 4.99 4.42 4.17 int 9 10.71 6.83 2.71 2.31 1.95 1.95 temp 9 5.11 3.14 1.56 1.45 1.55 1.55
Random 5 10.24 6.86 4.80 4.44 3.95 3.22 char 7 10.32 6.86 3.53 2.89 4.47 2.39 2.40
Table 1: Execution times in seconds for random patterns in texts (Randomdatasets: for 100 patterns, Seoul temperatures dataset: for 1000 patterns). R = 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, - R (cid:28) R (cid:28) R (cid:28) q = 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0Therefore, we can report 3 matches. After we have tested a window of thetext, we shift the current window to the right by 17 − m positions. This algorithmtakes O ( mn/ (17 − m )) SIMD instructions. In this section we conduct experiments comparing the following algorithms. – KMPCT: algorithm of Park, Amir, Landau, and Park [21] – IKMPCT: our improved linear-time algorithm based on prefix-parent andprefix-child representations (Section 3) – PMCT: SIMD solution for short patterns (Section 5) – SBNDMCT q : algorithm based on the SBNDM q filtration implemented byFaro and Lecroq [13] on the binary representations of the text and the pat-tern (Section 4.1) and verification using the global-parent representation(Section 4.2) [12] (The following algorithms have the same framework asSBNDMCT q ; only SBNDM q is replaced by another filtration method.) – BMHCT q : algorithm based on the q -gram Boyer-Moore-Horspool filtrationusing SIMD instructions [15,22,9] t i m e ( s e c ) pattern length IKMPCTSBNDMCTqBMHCTqSKSCTqPMCT
Fig. 3: Execution times for the random character dataset. – SKSCT q : algorithm based on the q -gram Alpha skip search filtration usingSIMD instructions [8,7]We tested for two random datasets and one real dataset, which is a time seriesof Seoul temperatures. The first random dataset consists of 10,000,000 randomintegers. The second random dataset consists of 10,000,000 random characters.The Seoul temperatures dataset consists of 658,795 integers referring to thehourly temperatures in Seoul (multiplied by ten) in the years 1907-2019. Ingeneral, temperatures rise during the day and fall at night. Therefore, the Seoultemperatures dataset has more matches than random datasets. We picked 100random patterns per pattern length from random datasets and 1000 randompatterns per pattern length for the Seoul temperatures dataset.The experimental environments and parameters are as follows. All algorithmswere implemented in C++11 and compiled with GNU C++ compiler version4.8.5, and O3 and msse4 options were used. The experiments were performed ona CentOS Linux 7 with 128GB RAM and Intel Xeon CPU E5-2630 processor.Table 1 shows the total execution times of Cartesian tree matching algo-rithms for random patterns (including the preprocessing). The best results areboldfaced. We choose the best results of the random character dataset from eachalgorithm regardless of q and present them in Fig. 3 (except KMPCT becauseof readability). Our linear-time algorithm IKMPCT improves upon algorithmKMPCT of [21] by about 35%. In the random character dataset, PMCT is thefastest algorithm for short patterns. However, as the pattern length grows, algo-rithms based on the filtration method are much faster in practice. It can be seenthat SKSCT is the fastest algorithm in most cases. When the pattern lengthis equal to 9, BMHCT utilizing 8-grams is the fastest algorithm, irrespective ofthe datasets. As pattern length grows, SKSCT utilizing 12-grams becomes thefastest algorithm.Regardless of the data type, the results are almost consistent. In details, how-ever, there are several differences. First, filtration algorithms, especially SKSCTalgorithms, are slower at the Seoul temperatures dataset relatively. It’s because ast Cartesian Tree Matching 13 there are more matches in the Seoul temperatures dataset. Second, when q is large, BMHCT and SKSCT algorithms are faster in the random characterdataset than in the random integer dataset. It’s because the maximum numberthat we can compute in parallel is 16 in the character dataset while it is 4 in theinteger dataset. Acknowledgments.
Song, Ryu and Park were supported by CollaborativeGenome Program for Fostering New Post-Genome industry through the Na-tional Research Foundation of Korea(NRF) funded by the Ministry of ScienceICT and Future Planning (No. NRF-2014M3C9A3063541).
References
1. Amir, A., Aumann, Y., Landau, G.M., Lewenstein, M., Lewenstein, N.: Patternmatching with swaps. Journal of Algorithms (2), 247–266 (2000)2. Amir, A., Cole, R., Hariharan, R., Lewenstein, M., Porat, E.: Overlap matching.Information and Computation (1), 57–74 (2003)3. Amir, A., Farach, M., Muthukrishnan, S.: Alphabet dependence in parameterizedmatching. Information Processing Letters (3), 111–115 (1994)4. Amir, A., Lewenstein, M., Porat, E.: Approximate swapped matching. InformationProcessing Letters (1), 33–39 (2002)5. Baker, B.S.: A theory of parameterized pattern matching: Algorithms and appli-cations. In: Proceedings of the Twenty-fifth Annual ACM Symposium on Theoryof Computing. pp. 71–80. ACM (1993)6. Burcsi, P., Cicalese, F., Fici, G., Liptak, Z.: Algorithms for jumbled pattern match-ing in strings. International Journal of Foundations of Computer Science (2),357–374 (2012)7. Cantone, D., Faro, S., Kulekci, M.O.: The order-preserving pattern matching prob-lem in practice. Discrete Applied Mathematics (2018)8. Charras, C., Lecroq, T., Pehoushek, J.D.: A very fast string matching algorithmfor small alphabets and long patterns. In: Combinatorial Pattern Matching. pp.55–64 (1998)9. Chhabra, T., Kulekci, M.O., Tarhio, J.: Alternative algorithms for order-preservingmatching. In: Proceedings of the Prague Stringology Conference 2015. pp. 36–46(2015)10. Chhabra, T., Tarhio, J.: A filtration method for order-preserving matching. Infor-mation Processing Letters (2), 71–74 (2016)11. Cho, S., Na, J.C., Park, K., Sim, J.S.: A fast algorithm for order-preserving patternmatching. Information Processing Letters (2), 397–402 (2015)12. Durian, B., Holub, J., Peltola, H., Tarhio, J.: Improving practical exact stringmatching. Information Processing Letters (4), 148–152 (2010)13. Faro, S., Lecroq, T., Borzi, S., Mauro, S.D., Maggio, A.: The string matchingalgorithms research tool. In: Proceedings of the Prague Stringology Conference2016. pp. 99–113 (2016)14. Fredriksson, K., Grabowski, S.: Practical and optimal string matching. In: StringProcessing and Information Retrieval. pp. 376–387 (2005)15. Horspool, R.N.: Practical fast searching in strings. Software: Practice and Experi-ence (6), 501–506 (1980)4 S. Song et al.16. Intel: Intel (R) 64 and IA-32 Architectures Optimization Reference Manual (2019)17. Kim, J., Amir, A., Na, J.C., Park, K., Sim, J.S.: On representations of ternaryorder relations in numeric strings. Mathematics in Computer Science (2), 127–136 (2017)18. Kim, J., Eades, P., Fleischer, R., Hong, S.H., Iliopoulos, C.S., Park, K., Puglisi,S.J., Tokuyama, T.: Order-preserving matching. Theoretical Computer Science , 68–79 (2014)19. Knuth, D.E., Morris, Jr., J.H., Pratt, V.R.: Fast pattern matching in strings. SIAMJournal on Computing (2), 323–350 (1977)20. Kubica, M., Kulczynski, T., Radoszewski, J., Rytter, W., Walen, T.: A linear timealgorithm for consecutive permutation pattern matching. Information ProcessingLetters (12), 430–433 (2013)21. Park, S.G., Amir, A., Landau, G.M., Park, K.: Cartesian tree matching and in-dexing. Accepted to CPM (2019), https://arxiv.org/abs/1905.08974
22. Tarhio, J., Peltola, H.: String matching in the DNA alphabet. Software: Practiceand Experience (7), 851–861 (1997)23. Vuillemin, J.: A unifying look at data structures. Communications of the ACM23