[PDF] Fast Cartesian Tree Matching

Abstract

Cartesian tree matching is the problem of finding all substrings of a given text which have the same Cartesian trees as that of a given pattern. So far there is one linear-time solution for Cartesian tree matching, which is based on the KMP algorithm. We improve the running time of the previous solution by introducing new representations. We present the framework of a binary filtration method and an efficient verification technique for Cartesian tree matching. Any exact string matching algorithm can be used as a filtration for Cartesian tree matching on our framework. We also present a SIMD solution for Cartesian tree matching suitable for short patterns. By experiments we show that known string matching algorithms combined on our framework of binary filtration and efficient verification produce algorithms of good performances for Cartesian tree matching.

Full PDF

FFast Cartesian Tree Matching

Siwoo Song , Cheol Ryu , Simone Faro , Thierry Lecroq , and KunsooPark (cid:0) ) Seoul National University, Seoul, Korea { swsong,cryu,kpark } @theory.snu.ac.kr University of Catania, Catania, Italy [email protected] Normandie University, Rouen, France [email protected]

Abstract.

Cartesian tree matching is the problem of ﬁnding all sub-strings of a given text which have the same Cartesian trees as that ofa given pattern. So far there is one linear-time solution for Cartesiantree matching, which is based on the KMP algorithm. We improve therunning time of the previous solution by introducing new representa-tions. We present the framework of a binary ﬁltration method and aneﬃcient veriﬁcation technique for Cartesian tree matching. Any exactstring matching algorithm can be used as a ﬁltration for Cartesian treematching on our framework. We also present a SIMD solution for Carte-sian tree matching suitable for short patterns. By experiments we showthat known string matching algorithms combined on our framework ofbinary ﬁltration and eﬃcient veriﬁcation produce algorithms of goodperformances for Cartesian tree matching.

Keywords:

Cartesian tree matching · Global-parent representation · Filtration algorithms.

String matching is one of fundamental problems in computer science. There aregeneralized matchings such as parameterized matching [3,5], swapped matching[1,4], overlap matching [2], jumbled matching [6], and so on. These problems arecharacterized by the way of deﬁning a match, which depends on the applicationdomains of the problems. In particular, order-preserving matching [18,17,20] andCartesian tree matching [21] deal with the order relations between numbers.The Cartesian tree [23] is a tree data structure that represents a string,focusing on the orders between elements of the string. Park et al. [21] intro-duced a metric of match called Cartesian tree matching. It is the problemof ﬁnding all substrings of a text T which have the same Cartesian trees asthat of a pattern P . Cartesian tree matching can be applied to ﬁnding pat-terns in time series data such as share prices in stock markets, like order-preserving matching, but sometimes it may be more appropriate as indicatedin [21]. Fig. 1 shows an example of Cartesian tree matching. Suppose T = a r X i v : . [ c s . D S ] A ug S. Song et al.

66 8 57 9 TextPattern

Fig. 1: Cartesian tree matching, and Cartesian tree corresponding to pattern.(10 , , , , , , , , , , , , , ,

12) and P = (3 , , , , , , , , u = (15 , , , , , , , ,

17) is the same asthat of P . Note that if we use order-preserving matching instead of Cartesiantree matching as a metric, u does not match P .String matching algorithms have been designed over the years. To speedup the search phase of string matching, algorithms based on automata andbit-parallelism were developed [14,12]. In recent years, the SIMD instructionset architecture gave rise to packed string matching, where one can comparepacked data elements in parallel. In the last few years, many solutions for order-preserving matching have been proposed. Given a text of length n and a patternof length m , Kubica et al. [20] and Kim et al. [18] gave O ( n + m log m ) time solu-tions based on the KMP algorithm. Cho et al. [11] presented an algorithm usingthe Boyer–Moore approach. Chhabra and Tarhio [10] presented a new practicalsolution based on ﬁltration, and Chhabra et al. [9] gave a ﬁltration algorithmusing the Boyer-Moore-Horspool approach and SIMD instructions. Cantone etal. [7] proposed ﬁltration methods using the q -neighborhood representation andSIMD instructions. These ﬁltration methods [10,9,7] take sublinear time on av-erage.In this paper we introduce new representations, preﬁx-parent representation and preﬁx-child representation , which can be used to decide whether two stringshave the same Cartesian trees or not. Using these representations, we improvethe running time of the previous Cartesian tree matching algorithm in [21].We also present a binary ﬁltration method for Cartesian tree matching, andgive an eﬃcient veriﬁcation technique for Cartesian tree matching based on the global-parent representation . On the framework of our binary ﬁltration methodand eﬃcient veriﬁcation technique, we can apply any known string matchingalgorithm [8,15,12] as a ﬁltration for Cartesian tree matching. In addition, wepresent a SIMD solution for Cartesian tree matching based on the global-parentrepresentation, which is suitable for short patterns. We conduct experimentscomparing many algorithms for Cartesian tree matching, which show that knownstring matching algorithms combined on the framework of our binary ﬁltrationand eﬃcient veriﬁcation for Cartesian tree matching produce algorithms of goodperformances for Cartesian tree matching. ast Cartesian Tree Matching 3 This paper is organized as follows. In Section 2, we describe notations and theproblem deﬁnition. In Section 3, we present an improved linear-time algorithmusing new representations. In Section 4, we present the framework of binaryﬁltration and eﬃcient veriﬁcation. In Section 5, we present a SIMD solution forshort patterns. In Section 6, we give the experimental results of the previousalgorithm and the proposed algorithms.

A string is deﬁned as a ﬁnite sequence of elements in an alphabet Σ . In thispaper, we will assume that Σ has a total order < . For a string S , S [ i ] representsthe i th element of S , and S [ i..j ] represents a substring of S from the i th elementto the j th element. If i > j then S [ i..j ] is an empty string.We will say S [ i ] ≺ S [ j ], if and only if S [ i ] < S [ j ], or S [ i ] and S [ j ] have thesame value with i < j . Note that S [ i ] = S [ j ] (as elements of the string) if andonly if i = j . Unless stated otherwise, the minimum is deﬁned by ≺ . A string S can be associated with its corresponding Cartesian tree CT ( S ) [23]according to the following rules: – If S is an empty string, then CT ( S ) is an empty tree. – If S [1 ..n ] is not empty and S [ i ] is the minimum value among S , then CT ( S )is the tree with S [ i ] as the root, CT ( S [1 ..i − CT ( S [ i + 1 ..n ]) as the right subtree. Cartesian tree matching is to ﬁnd all substrings of the text which have the sameCartesian trees as that of the pattern. Formally, Park et al. [21] deﬁne it asfollows:

Deﬁnition 1. (Cartesian tree matching) Given two strings text T [1 ..n ] andpattern P [1 ..m ], ﬁnd every 1 ≤ i ≤ n − m + 1 such that CT ( T [ i..i + m − CT ( P [1 ..m ]).Instead of building the Cartesian tree for every position in the text to solveCartesian tree matching, Park et al. [21] use the following representation for aCartesian tree. Deﬁnition 2. (Parent-distance representation) Given a string S [1 ..n ], the parent-distance representation of S is a function PD S , which is deﬁned as follows: PD S ( i ) = (cid:40) i − max ≤ j

1, and S [ PC S ( i )] is the root of ast Cartesian Tree Matching 5 Algorithm 1

Text search of Cartesian tree matching procedure CARTESIAN-TREE-MATCH ( T [1 ..n ] , P [1 ..m ])2: ( PP P , PC P ) ← PREFIX-PARENT-CHILD-REP( P )3: π ← FAILURE-FUNC( P )4: q ← for i ← to n do while q (cid:54) = 0 do if T [ i − q − PP P ( q + 1)] (cid:22) T [ i ] (cid:22) T [ i − q − PC P ( q + 1)] then break else q ← π [ q ]11: q ← q + 112: if q = m then print “Match occurred at i − m + 1”14: q ← π [ q ] CT ( S [1 ..i − PP S ( i ) = i . When PP S ( i ) = i −

1, there is no child of S [ i ]in CT ( S [1 ..i ]), and thus we set PC S ( i ) as i .Fig. 2 shows the preﬁx-parent representation (resp. the preﬁx-child represen-tation) of string S = (3 , , , , , , , ,

9) by arrows. The arrow starting from S [ i ] indicates PP S ( i ) (resp. PC S ( i )). If PP S ( i ) = i (resp. PC S ( i ) = i ), we omitthe arrow.The advantage of using the preﬁx-child representation and the preﬁx-parentrepresentation is that we can check whether each text element matches thecorresponding pattern element in constant time without computing its parent-distance [21]. Theorem 1.

Given two strings P and S , assume that P [1 ..q ] and S [1 ..q ] havethe same preﬁx-parent representations. If S [ PP P ( q +1)] (cid:22) S [ q +1] (cid:22) S [ PC P ( q +1)], then P [1 ..q + 1] and S [1 ..q + 1] have the same preﬁx-parent representations,and vice versa. Proof. (= ⇒ ) If q = 0, P [1] and S [1] always have the same preﬁx-parent 1.Now let’s assume q ≥

1. There are three cases, in each of which we show that PP P ( q + 1) = PP S ( q + 1).1. Case PP P ( q + 1) = q + 1: Since P [ PC P ( q + 1)] is the minimum elementin P [1 ..q ] and PP P ( i ) = PP S ( i ) for 1 ≤ i ≤ q , S [ PC P ( q + 1)] is also theminimum element in S [1 ..q ]. Therefore, if S [ q + 1] (cid:22) S [ PC P ( q + 1)] holds,then we have PP S ( q + 1) = q + 1.2. Case PP P ( q + 1) = q : Since S [ q ] (cid:22) S [ q + 1], we have PP S ( q + 1) = q .3. Case PP P ( q + 1) < q : Since P [ PC P ( q + 1)] is the minimum element in P [ PP P ( q + 1) + 1 ..q ] and PP P ( i ) = PP S ( i ) for 1 ≤ i ≤ q , S [ PC P ( q + 1)] isalso the minimum element in S [ PP P ( q + 1) + 1 ..q ]. Therefore, if S [ PP P ( q +1)] (cid:22) S [ q + 1] (cid:22) S [ PC P ( q + 1)] holds, then PP S ( q + 1) = PP P ( q + 1).( ⇐ =) It is trivial by deﬁnitions of PP and PC . (cid:117)(cid:116) S. Song et al.

Algorithm 2

Computing preﬁx-parent and preﬁx-child representations procedure PREFIX-PARENT-CHILD-REP ( P [1 ..m ])2: ST ← an empty stack3: for i ← to m do j next ← i while ST is not empty do j ← ST.top if P [ j ] ≺ P [ i ] then break ST.pop j next ← j PC P ( i ) ← j next if ST is empty then PP P ( i ) ← i else PP P ( i ) ← j ST.push ( i )17: return ( PP P , PC P ) With the preﬁx-parent representation and the preﬁx-child representation ofpattern P , we can simplify the text search. For each element T [ i ], we can check PP P ( q +1) = PP T [ i − q..i ] ( q +1) by comparing T [ i ] with the elements in T [ i − q..i ]whose indices correspond to PP P ( q +1) and PC P ( q +1) in P . Using this idea, wedon’t have to compute PP T [ i − q..i ] ( q + 1). Algorithm 1 describes the algorithmto do this. We compute the failure function π in the same way as [21] does.Given a string P [1 ..m ], we can compute the preﬁx-child representation andthe preﬁx-parent representation simultaneously in linear time using a stack. PP P ( i ) = j means that P [ j ] ≺ P [ k ] for j < k < i . The same is true for PC P ( i ). On the stack, therefore, we maintain only j ’s which satisfy P [ j ] ≺ P [ k ]for j < k < i while scanning from i = 1 to m . Suppose that j , j , . . . , j r areon the stack when we are computing PP P ( i ) and PC P ( i ). (We assume that j r +1 = i .) Then, ( P [ j ] , P [ j ] , . . . , P [ j r ]) forms an increasing subsequence of P .When we consider a new index i , we pop the indices j r , j r − , . . . , j t +1 repeatedlyuntil we have P [ j t ] ≺ P [ i ]. If there exists such an index j t , we set PP P ( i ) = j t and PC P ( i ) = j t +1 . (If t = r , then PC P ( i ) = j t +1 = i .) Otherwise, P [ i ] is theminimum element in P [1 ..i ], and thus PP P ( i ) = i and PC P ( i ) = j . Finally,we push i onto the stack. Algorithm 2 describes the algorithm to compute PP P and PC P simultaneously. In this section we present a practical solution based on ﬁltration. Our solutionfor Cartesian tree matching consists of two phases: ﬁltration and veriﬁcation.First, the text is ﬁltered with some exact string matching algorithm using a ast Cartesian Tree Matching 7 binary representation. In the second phase, the potential candidates are veriﬁedusing a global-parent representation.

In the ﬁltration phase, a string S is translated into a binary representation β S as follows. Deﬁnition 5. (binary representation) Given a string S [1 ..n ], the binary repre-sentation of S is a binary string β S of length n −

1, which is deﬁned as follows: β S [ i ] = (cid:40) PP S ( i + 1) = i ≤ i ≤ n − . One can easily check whether PP S ( i + 1) = i is true or not by comparing S [ i ]and S [ i + 1]: PP S ( i + 1) = i if and only if S [ i ] ≺ S [ i + 1]. The following theoremproves that the binary representation can be used to ﬁlter a text T to search forall Cartesian tree matching occurrences of a pattern P . Theorem 2.

Let P and T be two strings of lengths m and n , respectively,and let β P and β T be the binary representations associated with P and T ,respectively. If CT ( P [1 ..m ]) = CT ( T [ i..i + m − β P [ j ] = β T [ i + j − ≤ j ≤ m − Proof.

The preﬁx-parent representation has a one-to-one mapping to the Carte-sian tree. Therefore, if CT ( P [1 ..m ]) = CT ( T [ i..i + m − PP P ( j + 1) = PP T ( i + j ) for 0 ≤ j ≤ m −

1. If PP P ( j + 1) = PP T ( i + j ), then β P [ j ] = β T [ i + j −

1] for 1 ≤ j ≤ m − β P in β T , these matches are only possible candidates of Cartesiantree matching which should be veriﬁed.Cantone et al. [7] presented two ﬁltration methods other than the binaryrepresentation to solve order-preserving matching. They used the property that T doesn’t match P at position i if there are two positions j and k such that P [ j ] (cid:22) P [ k ] ⇔ T [ i + j − (cid:22) T [ i + k −

1] doesn’t hold. Thus any comparison resultbetween two positions can be used for ﬁltration. In Cartesian tree matching,however, even if there exist such j and k , the corresponding Cartesian trees canbe the same when | j − k | >

1. Therefore, we cannot use these ﬁltration methodsfor Cartesian tree matching.

In the veriﬁcation phase, we have to check whether the candidates found by theﬁltration phase are actual matches or not. This checking can be done using preﬁx-parent and preﬁx-child representations by Theorem 1, which takes 2 comparisons

S. Song et al. per element. In order to reduce the number of comparisons to 1, we introduceanother representation as follows.

Deﬁnition 6. (Global-parent representation) Given a string S [1 ..n ], the global-parent representation of S is a function GP S , which is deﬁned as follows: GP S ( i ) = (cid:40) j such that PC S ( j ) = i for j > i PP S ( i ) if such j doesn’t exist. GP S ( i ) is well-deﬁned because there is at most one j > i which satisﬁes PC S ( j ) = i . Fig. 2 shows the global-parent representation by arrows. The arrow startingfrom S [ i ] indicates the global parent of S [ i ]. If GP S ( i ) = i , we omit the arrow. Theorem 3.

Two strings P [1 ..m ] and S [1 ..m ] have the same Cartesian trees ifand only if S [ GP P ( i )] (cid:22) S [ i ] for all 1 ≤ i ≤ m . Proof.

We will prove that S [ PP P ( i )] (cid:22) S [ i ] (cid:22) S [ PC P ( i )] for all 1 ≤ i ≤ m ifand only if S [ GP P ( i )] (cid:22) S [ i ] for all 1 ≤ i ≤ m .(= ⇒ ) It is trivial by deﬁnition of GP .( ⇐ =) Assume S [ GP P ( i )] (cid:22) S [ i ] for all 1 ≤ i ≤ m . For any 1 ≤ k ≤ m , weﬁrst show S [ k ] (cid:22) S [ PC P ( k )], and then we show S [ PP P ( k )] (cid:22) S [ k ].1. (Proof of S [ k ] (cid:22) S [ PC P ( k )]) There are two cases: PC P ( k ) = k and PC P ( k ) (cid:54) = k . If PC P ( k ) = k , then S [ k ] (cid:22) S [ PC P ( k )] holds trivially. Otherwise, since GP P ( PC P ( k )) = k , S [ k ] = S [ GP P ( PC P ( k ))] (cid:22) S [ PC P ( k )]. Therefore, S [ k ] (cid:22) S [ PC P ( k )] holds.2. (Proof of S [ PP P ( k )] (cid:22) S [ k ]) If GP P ( k ) = PP P ( k ), then S [ PP P ( k )] = S [ GP P ( k )] (cid:22) S [ k ]. So we only have to consider the case that there is k > k which satisﬁes PC P ( k ) = k . Let k = k < k < · · · < k r ≤ m be asequence such that PC P ( k l +1 ) = k l , and there is no k r +1 > k r which satisﬁes PC P ( k r +1 ) = k r . Since ( k , k , . . . , k r ) is a strictly increasing sequence, such k r always exists. Note that GP P ( k l ) = k l +1 except for GP P ( k r ). On thesequence, there may or may not exist j such that PP P ( k j ) = k j .Suppose that there exists some j such that PP P ( k j ) = k j . Since k j − = PC P ( k j ), P [ k j − ] is the minimum element in P [1 ..k j − PP P ( k j − ) = k j − . Proceeding inductively, PP P ( k l ) = k l for all l ≤ j . Thus S [ PP P ( k )] (cid:22) S [ k ] holds trivially.Now we consider the case that PP P ( k j ) (cid:54) = k j for all j . Then, we have S [ k ] (cid:23) S [ k ] (cid:23) · · · (cid:23) S [ k r ] (cid:23) S [ GP P ( k r )] = S [ PP P ( k r )] by the assumption that S [ GP P ( i )] (cid:22) S [ i ] for all i . We now show PP P ( k r ) = PP P ( k ) as follows. Since PC P ( k r ) = k r − , P [ k r − ] is the minimum element in P [ PP P ( k r ) + 1 ..k r − P [ k r − ] (cid:23) P [ PP P ( k r )]. Hence, we have PP P ( k r − ) = PP P ( k r ).Inductively, we can show that PP P ( k ) = PP P ( k ) = · · · = PP P ( k r ).Therefore, S [ PP P ( k )] (cid:22) S [ k ] holds. (cid:117)(cid:116) By Theorem 3, we only have to compare once for each element in the ver-iﬁcation phase. For a potential candidate obtained from the ﬁltration phase ast Cartesian Tree Matching 9 (say, it starts from T [ i ]), we compare T [ i + q −

1] and T [ i + GP P ( q ) − q = 1 to m . The candidate is discarded when there exists q such that T [ i + q − ≺ T [ i + GP P ( q ) − GP P ( i ) as PP P ( i ), and then if we ﬁnd j such that PC P ( j ) = i weupdate GP P ( i ) to j . The proof of sublinearity is similar to the analysis of order-preserving matchingwith ﬁltration [10]. Let’s assume that the elements in the pattern P and the text T are independent of each other and the distribution is uniform. The veriﬁcationphase takes time proportional to the pattern length times the number of poten-tial candidates. When alphabet size is | Σ | , the probability that β P [ i ] = 0 (i.e.,probability that P [ i ] ≺ P [ i +1]) is ( | Σ | + | Σ | ) / (2 | Σ | ), since there are | Σ | pairsand | Σ | pairs among them have equal elements. Similarly, the probability that β P [ i ] = 1 is ( | Σ | −| Σ | ) / (2 | Σ | ), and it is the same for β T [ i ]. Therefore, the prob-ability that β P [ i ] = β T [ i ] is (( | Σ | + | Σ | ) / (2 | Σ | )) + (( | Σ | − | Σ | ) / (2 | Σ | )) =1 / / (2 | Σ | ). As the pattern length increases, the number of potential candi-dates decreases exponentially, and the veriﬁcation time approaches zero. Hence,the ﬁltration time dominates. So if the ﬁltration method takes a sublinear timein the average case, the total algorithm takes a sublinear time in the averagecase, too. When we use the Boyer-Moore-Horspool algorithm [15] and the Alpha skipsearch algorithm [8] as the ﬁltration method, we pack four 32-bit numbers orsixteen 8-bit numbers into a register, as in order-preserving matching algorithms[9,7]. Each pair of two corresponding packed data elements can be compared inparallel using streaming SIMD extensions (SSE) [16]. In the case of 32-bit in-tegers, for example, we compute ( T [ i + 3] > T [ i + 4]), ( T [ i + 2] > T [ i + 3]),( T [ i + 1] > T [ i + 2]), and ( T [ i ] > T [ i + 1]) in parallel as in Algorithm 3, whereinstruction mm loadu si128(( m128i *)( T + i )) loads four 32-bit integers frommemory T + i into a 128-bit register, instruction mm cmpgt epi32( a , b ) comparesfour pairs of packed 32-bit integers and returns the results of the comparisonsinto a 128-bit register, instruction mm castsi128 ps casts the integer type tothe ﬂoat type, and instruction mm movemask ps selects only the most signiﬁ-cant bits of the 4 ﬂoats. Comparing a pair of sixteen 8-bit numbers can be donesimilarly. In this section we present an algorithm that works when the alphabet consists of1-byte characters and the pattern length m is at most 16. As shown in Section 4.2, Algorithm 3

Compare integers in parallel procedure CompareUsingSIMD ( T [1 ..n ] , i )2: m128i a ← mm loadu si128(( m128i *)( T + i ))3: m128i b ← mm loadu si128(( m128i *)( T + i + 1))4: m128i r ← mm cmpgt epi32( a , b )5: return mm movemask ps( mm castsi128 ps( r )) we test T [ s + i − (cid:23) T [ s + GP P ( i ) −

1] for 1 ≤ i ≤ m to check for an occurrenceat position s of the text T .Let W be a word of 16 bytes containing the current window of the text, i.e., W = T [ s..s + 15]. For 1 ≤ i ≤ m , we deﬁne W i (word obtained from W byshifting i − GP P ( i ) positions to the left or to the right, depending on the signof i − GP P ( i )) as follows: W i = (cid:40) W (cid:28) ( GP P ( i ) − i ) if i < GP P ( i ) W (cid:29) ( i − GP P ( i )) if i > GP P ( i ) . For ﬁxed i , we can ﬁnd the positions j which satisfy W [ j + i − (cid:23) W [ j + GP P ( i ) −

1] for 0 ≤ j ≤

15 in parallel by comparing W i to W using SIMDinstructions. The satisfying positions for all 1 ≤ i ≤ m are the occurrencesof the pattern. The details of the algorithm are as follows. We test whether W i [ j ] (cid:22) W [ j ] for 0 ≤ j ≤

15 in parallel using the SIMD instruction R i =mm cmpgt epi8( W, W i ) for i < GP P ( i ) or R i = ∼ mm cmpgt epi8( W i , W ) for i > GP P ( i ). (In order to get only signiﬁcant bits when computing R i , we useinstruction mm movemask epi8.) Then we compute q = AND mi =1 ( R i (cid:28) ( i − s + j of the text if q [ j ] = 1. Example 1.

Let’s consider an example of the pattern P = (3 , , , ,

8) and thewindow of the text W = (10 , , , , , , , , , , , , , , , − GP P (1) = 3 − GP P (3), R = R . Moreover we do notneed to compute R , since 2 − GP P (2) = 0. Hence we compute R , R , and R . W = 10, 12, 16, 15, 06, 14, 09, 12, 11, 14, 09, 17, 12, 13, 12, 10 W = 12, 16, 15, 06, 14, 09, 12, 11, 14, 09, 17, 12, 13, 12, 10 R = 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, - W = 10, 12, 16, 15, 06, 14, 09, 12, 11, 14, 09, 17, 12, 13, 12, 10 W = 10, 12, 16, 15, 06, 14, 09, 12, 11, 14, 09, 17, 12, 13 R = -, -, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1 W = 10, 12, 16, 15, 06, 14, 09, 12, 11, 14, 09, 17, 12, 13, 12, 10 W = 10, 12, 16, 15, 06, 14, 09, 12, 11, 14, 09, 17, 12, 13, 12 R = -, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0The ﬁnal result q can be computed as follows: ast Cartesian Tree Matching 11Dataset m KMP IKMP SBNDMCT BMHCT SKSCT PMCT CT 2 4 6 4 8 12 16 4 8 12 16 CTRandom 5 10.52 6.84 4.99 4.42 4.17 int 9 10.71 6.83 2.71 2.31 1.95 1.95 temp 9 5.11 3.14 1.56 1.45 1.55 1.55

Random 5 10.24 6.86 4.80 4.44 3.95 3.22 char 7 10.32 6.86 3.53 2.89 4.47 2.39 2.40

Table 1: Execution times in seconds for random patterns in texts (Randomdatasets: for 100 patterns, Seoul temperatures dataset: for 1000 patterns). R = 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, - R (cid:28) R (cid:28) R (cid:28) q = 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0Therefore, we can report 3 matches. After we have tested a window of thetext, we shift the current window to the right by 17 − m positions. This algorithmtakes O ( mn/ (17 − m )) SIMD instructions. In this section we conduct experiments comparing the following algorithms. – KMPCT: algorithm of Park, Amir, Landau, and Park [21] – IKMPCT: our improved linear-time algorithm based on preﬁx-parent andpreﬁx-child representations (Section 3) – PMCT: SIMD solution for short patterns (Section 5) – SBNDMCT q : algorithm based on the SBNDM q ﬁltration implemented byFaro and Lecroq [13] on the binary representations of the text and the pat-tern (Section 4.1) and veriﬁcation using the global-parent representation(Section 4.2) [12] (The following algorithms have the same framework asSBNDMCT q ; only SBNDM q is replaced by another ﬁltration method.) – BMHCT q : algorithm based on the q -gram Boyer-Moore-Horspool ﬁltrationusing SIMD instructions [15,22,9] t i m e ( s e c ) pattern length IKMPCTSBNDMCTqBMHCTqSKSCTqPMCT

Fig. 3: Execution times for the random character dataset. – SKSCT q : algorithm based on the q -gram Alpha skip search ﬁltration usingSIMD instructions [8,7]We tested for two random datasets and one real dataset, which is a time seriesof Seoul temperatures. The ﬁrst random dataset consists of 10,000,000 randomintegers. The second random dataset consists of 10,000,000 random characters.The Seoul temperatures dataset consists of 658,795 integers referring to thehourly temperatures in Seoul (multiplied by ten) in the years 1907-2019. Ingeneral, temperatures rise during the day and fall at night. Therefore, the Seoultemperatures dataset has more matches than random datasets. We picked 100random patterns per pattern length from random datasets and 1000 randompatterns per pattern length for the Seoul temperatures dataset.The experimental environments and parameters are as follows. All algorithmswere implemented in C++11 and compiled with GNU C++ compiler version4.8.5, and O3 and msse4 options were used. The experiments were performed ona CentOS Linux 7 with 128GB RAM and Intel Xeon CPU E5-2630 processor.Table 1 shows the total execution times of Cartesian tree matching algo-rithms for random patterns (including the preprocessing). The best results areboldfaced. We choose the best results of the random character dataset from eachalgorithm regardless of q and present them in Fig. 3 (except KMPCT becauseof readability). Our linear-time algorithm IKMPCT improves upon algorithmKMPCT of [21] by about 35%. In the random character dataset, PMCT is thefastest algorithm for short patterns. However, as the pattern length grows, algo-rithms based on the ﬁltration method are much faster in practice. It can be seenthat SKSCT is the fastest algorithm in most cases. When the pattern lengthis equal to 9, BMHCT utilizing 8-grams is the fastest algorithm, irrespective ofthe datasets. As pattern length grows, SKSCT utilizing 12-grams becomes thefastest algorithm.Regardless of the data type, the results are almost consistent. In details, how-ever, there are several diﬀerences. First, ﬁltration algorithms, especially SKSCTalgorithms, are slower at the Seoul temperatures dataset relatively. It’s because ast Cartesian Tree Matching 13 there are more matches in the Seoul temperatures dataset. Second, when q is large, BMHCT and SKSCT algorithms are faster in the random characterdataset than in the random integer dataset. It’s because the maximum numberthat we can compute in parallel is 16 in the character dataset while it is 4 in theinteger dataset. Acknowledgments.

Song, Ryu and Park were supported by CollaborativeGenome Program for Fostering New Post-Genome industry through the Na-tional Research Foundation of Korea(NRF) funded by the Ministry of ScienceICT and Future Planning (No. NRF-2014M3C9A3063541).

References

1. Amir, A., Aumann, Y., Landau, G.M., Lewenstein, M., Lewenstein, N.: Patternmatching with swaps. Journal of Algorithms (2), 247–266 (2000)2. Amir, A., Cole, R., Hariharan, R., Lewenstein, M., Porat, E.: Overlap matching.Information and Computation (1), 57–74 (2003)3. Amir, A., Farach, M., Muthukrishnan, S.: Alphabet dependence in parameterizedmatching. Information Processing Letters (3), 111–115 (1994)4. Amir, A., Lewenstein, M., Porat, E.: Approximate swapped matching. InformationProcessing Letters (1), 33–39 (2002)5. Baker, B.S.: A theory of parameterized pattern matching: Algorithms and appli-cations. In: Proceedings of the Twenty-ﬁfth Annual ACM Symposium on Theoryof Computing. pp. 71–80. ACM (1993)6. Burcsi, P., Cicalese, F., Fici, G., Liptak, Z.: Algorithms for jumbled pattern match-ing in strings. International Journal of Foundations of Computer Science (2),357–374 (2012)7. Cantone, D., Faro, S., Kulekci, M.O.: The order-preserving pattern matching prob-lem in practice. Discrete Applied Mathematics (2018)8. Charras, C., Lecroq, T., Pehoushek, J.D.: A very fast string matching algorithmfor small alphabets and long patterns. In: Combinatorial Pattern Matching. pp.55–64 (1998)9. Chhabra, T., Kulekci, M.O., Tarhio, J.: Alternative algorithms for order-preservingmatching. In: Proceedings of the Prague Stringology Conference 2015. pp. 36–46(2015)10. Chhabra, T., Tarhio, J.: A ﬁltration method for order-preserving matching. Infor-mation Processing Letters (2), 71–74 (2016)11. Cho, S., Na, J.C., Park, K., Sim, J.S.: A fast algorithm for order-preserving patternmatching. Information Processing Letters (2), 397–402 (2015)12. Durian, B., Holub, J., Peltola, H., Tarhio, J.: Improving practical exact stringmatching. Information Processing Letters (4), 148–152 (2010)13. Faro, S., Lecroq, T., Borzi, S., Mauro, S.D., Maggio, A.: The string matchingalgorithms research tool. In: Proceedings of the Prague Stringology Conference2016. pp. 99–113 (2016)14. Fredriksson, K., Grabowski, S.: Practical and optimal string matching. In: StringProcessing and Information Retrieval. pp. 376–387 (2005)15. Horspool, R.N.: Practical fast searching in strings. Software: Practice and Experi-ence (6), 501–506 (1980)4 S. Song et al.16. Intel: Intel (R) 64 and IA-32 Architectures Optimization Reference Manual (2019)17. Kim, J., Amir, A., Na, J.C., Park, K., Sim, J.S.: On representations of ternaryorder relations in numeric strings. Mathematics in Computer Science (2), 127–136 (2017)18. Kim, J., Eades, P., Fleischer, R., Hong, S.H., Iliopoulos, C.S., Park, K., Puglisi,S.J., Tokuyama, T.: Order-preserving matching. Theoretical Computer Science , 68–79 (2014)19. Knuth, D.E., Morris, Jr., J.H., Pratt, V.R.: Fast pattern matching in strings. SIAMJournal on Computing (2), 323–350 (1977)20. Kubica, M., Kulczynski, T., Radoszewski, J., Rytter, W., Walen, T.: A linear timealgorithm for consecutive permutation pattern matching. Information ProcessingLetters (12), 430–433 (2013)21. Park, S.G., Amir, A., Landau, G.M., Park, K.: Cartesian tree matching and in-dexing. Accepted to CPM (2019), https://arxiv.org/abs/1905.08974

22. Tarhio, J., Peltola, H.: String matching in the DNA alphabet. Software: Practiceand Experience (7), 851–861 (1997)23. Vuillemin, J.: A unifying look at data structures. Communications of the ACM23

Related Researches

The Multiplicative Version of Azuma's Inequality, with an Application to Contention Analysis

by William Kuszmaul

Balanced Districting on Grid Graphs with Provable Compactness and Contiguity

by Cyrus Hettle

Deterministic Tree Embeddings with Copies for Algorithms Against Adaptive Adversaries

by Bernhard Haeupler

Approximately counting independent sets of a given size in bounded-degree graphs

by Ewan Davies

A Dynamic Data Structure for Temporal Reachability with Unsorted Contact Insertions

by Luiz F. Afra Brito

Semi-Streaming Algorithms for Submodular Matroid Intersection

by Paritosh Garg

Multivariate Analysis of Scheduling Fair Competitions

by Siddharth Gupta

Streaming k-PCA: Efficient guarantees for Oja's algorithm, beyond rank-one updates

by De Huang

Online Bin Packing with Predictions

by Spyros Angelopoulos

Minimum projective linearizations of trees in linear time

by Lluís Alemany-Puig

Parameterized complexity of computing maximum minimal blocking and hitting sets

by Júlio Araújo

A 2 -Approximation Algorithm for Flexible Graph Connectivity

by Sylvia Boyd

A Faster Algorithm for Finding Closest Pairs in Hamming Metric

by Andre Esser

Kernelization of Maximum Minimal Vertex Cover

by Júlio Araújo

Fractionally Log-Concave and Sector-Stable Polynomials: Counting Planar Matchings and More

by Yeganeh Alimohammadi

Optimal Construction of Hierarchical Overlap Graphs

by Shahbaz Khan

Gapped Indexing for Consecutive Occurrences

by Philip Bille

CountSketches, Feature Hashing and the Median of Three

by Kasper Green Larsen

A Refined Analysis of Submodular Greedy

by Ariel Kulik

Generalized Parametric Path Problems

by Prerona Chatterjee

Approximate Privacy-Preserving Neighbourhood Estimations

by Alvaro Garcia-Recuero

Coalgebra Encoding for Efficient Minimization

by Hans-Peter Deifel

Algorithms and Complexity on Indexing Founder Graphs

by Massimo Equi

A Linear Time Algorithm for Constructing Hierarchical Overlap Graphs

by Sangsoo Park

Density Sketches for Sampling and Estimation

by Aditya Desai

«

1

2

3

4

»

Submitted on 14 Aug 2019 Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar