Fast and linear-time string matching algorithms based on the distances of q -gram occurrences
Satoshi Kobayashi, Diptarama Hendrian, Ryo Yoshinaka, Ayumi Shinohara
aa r X i v : . [ c s . D S ] A p r Fast and linear-time string matching algorithms based on the distancesof q -gram occurrences Satoshi Kobayashi, Diptarama Hendrian, Ryo Yoshinaka, and Ayumi ShinoharaGraduate School of Information Sciences, Tohoku University, Japan
Abstract
Given a text T of length n and a pattern P of length m , the string matching problem is a taskto find all occurrences of P in T . In this study, we propose an algorithm that solves this problemin O (( n + m ) q ) time considering the distance between two adjacent occurrences of the same q -gramcontained in P . We also propose a theoretical improvement of it which runs in O ( n + m ) time, though itis not necessarily faster in practice. We compare the execution times of our and existing algorithms onvarious kinds of real and artificial datasets such as an English text, a genome sequence and a Fibonaccistring. The experimental results show that our algorithm is as fast as the state-of-the-art algorithmsin many cases, particularly when a pattern frequently appears in a text. The exact string matching problem is a task to find all occurrences of P in T when given a text T oflength n and a pattern P of length m . A brute-force solution of this problem is to compare P with allthe substrings of T of length m . It takes O ( nm ) time. The Knuth-Morris-Pratt (KMP) algorithm [17]is well known as an algorithm that can solve the problem in O ( n + m ) time. However, it is not efficientin practice, because it scans every position of the text at least once. The Boyer-Moore algorithm [3] isfamous as an algorithm that can perform string matching fast in practice by skipping many positions ofthe text, though it has O ( nm ) worst-case time complexity. Like this, many efficient algorithms whoseworst-case time complexity is the same or even worse than the naive method have been proposed sofar [16, 19, 21]. For example, the HASH q algorithm [19] focuses on the substrings of length q in a patternand obtains a larger shift amount. However, considering that such an algorithm is embedded in softwareand actually used, if the worst-case input strings are given, the operation of the software may be sloweddown. Therefore, an algorithm that operates theoretically and practically fast is important. Franek etal. [14] proposed the Franek-Jennings-Smyth (FJS) algorithm, which is a hybrid of the KMP algorithmand the Sunday algorithm [21]. The worst-case time complexity of the FJS algorithm is O ( n + m + σ ) andit works fast in practice, where σ is the alphabet size. Kobayashi et al. [18] proposed an algorithm thatimproves the speed of the FJS algorithm by combining a method that extends the idea of the Quite-Naivealgorithm [4]. This algorithm has the same worst-case time complexity as the FJS algorithm, and it runsfaster than the FJS algorithm in many cases. The LWFR q algorithm [8] is a practically fast algorithmthat works in linear time. This algorithm uses a method of quickly recognizing substrings of a patternusing a hash function. See [12, 15] for recent surveys on exact string matching algorithms.This paper proposes two new exact string matching algorithms based on the HASH q and KMP al-gorithms incorporating a new idea based on the distances of occurrences of the same q -grams. Thetime complexity of the preprocessing phase of the first algorithm is O ( mq ) and the search phase runs1n O ( nq ) time. The second algorithm improves the theoretical complexity of the first algorithm, wherethe preprocessing and searching times are O ( m ) and O ( n ), respectively. Our algorithms are as fast asthe state-of-the-art algorithms in many cases. Particularly, our algorithms work faster when a patternfrequently appears in a text.This paper is organized as follows. Section 2 briefly reviews the KMP and HASH q algorithms, whichare the basis of the proposed algorithms. Section 3 proposes our algorithms. Section 4 shows experimentalresults comparing the proposed algorithms with several other algorithms using artificial and practical data.Section 5 draws our conclusions. Let Σ be a set of characters called an alphabet and σ = | Σ | be its size. Σ ∗ denotes the set of all stringsover Σ. The length of a string w ∈ Σ ∗ is denoted by | w | . The empty string , denoted by ε , is the string oflength zero. The i -th character of w is denoted by w [ i ] for each 1 ≤ i ≤ | w | . The substring of w startingat i and ending at j is denoted by w [ i : j ] for 1 ≤ i ≤ j ≤ | w | . For convenience, let w [ i : j ] = ε if i > j . Astring w [1 : i ] is called a prefix of w and a string w [ i : | w | ] is called a suffix of w . A string v is a border of w if v is both a prefix and a suffix of w . Note that the empty string is a border of any string. Moreover,a prefix, a suffix or a border v of w is called proper when v = w . The length of the longest proper borderof w [1 : i ] for 1 ≤ i ≤ | w | is given by Bord w [ i ] = max { j | w [1 : j ] = w [ i − j + 1 : i ] and 0 ≤ j < i } . Throughout this paper, we assume Σ is an integer alphabet.
The exact string matching problem is defined as follows:
Input:
A text T ∈ Σ ∗ of length n and a pattern P ∈ Σ ∗ of length m , Output:
All positions i such that T [ i : i + m −
1] = P for 1 ≤ i ≤ n − m + 1.We will use a text T ∈ Σ ∗ of length n and a pattern P ∈ Σ ∗ of length m throughout the paper.Let us consider comparing T [ i : i + m −
1] and P [1 : m ]. The naive method compares characters ofthe two strings from left to right. When a character mismatch occurs, the pattern is shifted to the rightby one character. That is, we compare T [ i + 1 : i + m ] and P [1 : m ]. This naive method takes O ( nm )time for matching. There are a number of ideas to shift the pattern more so that searching T for P canbe performed more quickly, using shifting functions obtained by preprocessing the pattern. The Knuth-Morris-Pratt (KMP) algorithm [17] is well known as a string matching algorithm that haslinear worst-case time complexity. When the KMP algorithm has confirmed that T [ i : i + j −
2] = P [1 : j − T [ i + j − = P [ j ] for some j ≤ m , it shifts the pattern so that a suffix of T [ i : i + j −
2] matches aprefix of P and we do not have to re-scan any part of T [ i : i + j −
2] again. That is, the pattern can beshifted by j − k − k = Bord P [ j − P [ k + 1] = P [ j ], the same mismatch will occur2 lgorithm 1: Computing
KMP Shift Function
PreKMPShift ( P ) m ← | P | ; i ← j ← Strong Bord P [1] ← − while i ≤ m do while j > P [ i ] = P [ j ] do j ← Strong Bord P [ j ]; i ← i + 1; j ← j + 1; if i ≤ m and P [ i ] = P [ j ] then Strong Bord P [ i ] ← Strong Bord P [ j ]; else Strong Bord P [ i ] ← j ; for j ← to m do KMP Shift [ j ] ← j − Strong Bord P [ j ] − return KMP Shift again after the shift. In order to avoid this kind of mismatch, we use
Strong Bord [1 : m + 1] given by Strong Bord P ( j ) = Bord P ( m ) if j = m + 1,max (cid:0) { k | P [1 : k ] = P [ j − k : j − , P [ k + 1] = P [ j ] , ≤ k < j } ∪ {− } (cid:1) otherwise.The amount KMP Shift [ j ] of the shift is given by KMP Shift [ j ] = j − Strong Bord P ( j ) − . This function has a domain of { , . . . , m + 1 } and is implemented as an array in the algorithm. Hereafter,we identify some functions and the arrays that implement them. Fact 1. If P [1 : j −
1] = T [ i : i + j − and P [ j ] = T [ i + j − , then P [1 : j − k j −
1] = T [ i + k j : i + j − holds for k j = KMP Shift [ j ] . Moreover, there is no positive integer k < KMP Shift [ j ] such that P = T [ i + k : i + k + m − . Note that if the algorithm has confirmed T [ i : i + m −
1] = P , the shift is given by KMP Shift [ m + 1]after reporting the occurrence of the pattern. Algorithm 1 shows a pseudocode to compute the array KMP Shift . It runs in O ( m ) time. By using KMP Shift , the KMP algorithm finds all occurrences of P in T in O ( n ) time. q algorithm The HASH q algorithm [19] is an adaptation of the Wu-Manber multiple string matching algorithm [22]to the single string matching problem. Before comparing P and T [ i : i + m − q algorithmshifts the pattern so that the suffix q -gram T [ i + m − q : i + m −
1] of the text substring shall matchthe rightmost occurrence of the same q -gram in the pattern. For practical efficiency, we use a hashfunction, though it may result in aligning mismatching q -grams occasionally. The shift amount is givenby shift ( h ( T [ i + m − q : i + m − shift ( c ) = m − max( { j | h ( P [ j − q + 1 : j ]) = c, q ≤ j ≤ m } ∪ { q − } ) ,h ( x ) = (2 q − · x [1] + 2 q − · x [2] + · · · + 2 · x [ q −
1] + x [ q ]) mod 2 . We repeatedly shift the pattern till the suffix q -grams of the pattern and the considered text substring havea matching hash value, in which case the shift amount will be 0. We then compare the characters of the3attern and the text substring from left to right. If a character mismatch occurs during the comparison,the pattern is shifted bymin( { k | h ( P [ m ′ − k : m − k ]) = h ( P [ m ′ : m ]) , ≤ k ≤ m − q } ∪ { m ′ } ) (1)where m ′ = m − q + 1, since the q -gram suffixes of the pattern and the text substring have the same hashvalues. The time complexity of the preprocessing phase for computing the shift function is O ( mq ). Thesearching phase has O ( n ( m + q )) time complexity. The worst-case time complexity is worse than that ofthe naive method, but it works fast in practice. Fact 2.
If shift ( h ( T [ i + m − q : i + m − j = m − q + 1 , then h ( P [ m − j − q + 1 : m − j ]) = h ( T [ i + m − q : i + m − . There is no positive integer k < j such that P = T [ i + k : i + k + m − . q algorithm Our proposed algorithm uses three kinds of shifting functions. The first one
HQ Shift is essentially thesame as shift , the one used in the HASH q algorithm, except for the hashing function. The second one dist is based on the distance of the closest occurrences of the q -grams of the same hash value in the pattern.We involve KMP Shift as the third one to guarantee the linear-time behavior.Formally, the first shifting function is given as
HQ Shift [ c ] = m − max( { j | h ( P [ j − q + 1 : j ]) = c, q ≤ j ≤ m } ∪ { q − } ) , where h ( x ) = (4 q − · x [1] + 4 q − · x [2] + · · · + 4 · x [ q −
1] + x [ q ]) mod 2 . Fact 2 holds for
HQ Shift .The second shifting function is defined for j = q, . . . , m by dist [ j ] = min( { k | h ( P [ j ′ − k : j − k ]) = h ( P [ j ′ : j ]) , ≤ k ≤ j − q } ∪ { j ′ } )where j ′ = j − q + 1. This function dist is a generalization of the shift (Eq. 1) used in the HASH q algorithm. We have dist [ j ] = k < j ′ if the q -gram ending at j and the one ending at j − k have the samehash value, while no q -grams occurring between those have the same value. If no q -gram ending before j has the same hash value, then dist [ j ] = j ′ . By using this, in the situation where h ( P [ j − q + 1 : j ]) = h ( T [ i + j − q : i + j − T [ i : i + m −
1] and P , the patterncan be shifted by dist [ j ]. Fact 3.
Suppose that h ( P [ j − q + 1 : j ]) = h ( T [ i + j − q : i + j − . Then h ( P [ j − q + 1 − dist [ j ] : j − dist [ j ]]) = h ( T [ i + j − q : i + j − , unless dist [ j ] = j − q + 1 . Moreover, there is no positive integer k < dist [ j ] such that P = T [ i + k : i + k + m − . Those functions
HQ Shift , dist and KMP Shift are computed in the preprocessing phase. Algorithms 2and 3 compute the arrays
HQ Shift and dist , respectively.Figure 1 shows examples of shifting the pattern using
HQ Shift and dist . Both functions
HQ Shift and dist shift the pattern using q -gram hash values based on Facts 2 and 3, respectively. The latter canbe used only when we know that the pattern and the text substring have aligned q -grams ending at j withthe same hash value and it may shift the pattern at most j − q + 1, while the former can be used anytimeand the maximum possible shift is m − q + 1. The advantage of the function dist is in the computationalcost. If we know that the premise of Fact 3 is satisfied, we can immediately perform the shift based on4 lgorithm 2: Computing
HQ Shift Function
PreHqShift ( P, q ) m ← | P | ; for i ← to − do HQ Shift [ i ] ← m − q + 1; for i ← q to m do hash ← h ( P [ i − q + 1 : i ]); HQ Shift [ hash ] ← m − i ; return HQ Shift ; Algorithm 3:
Computing dist Function
PreDistArray ( P, q ) for j ← to q − do dist [ j ] ← for j ← to − do prevpos [ j ] ← for j ← q to | P | do hash ← h ( P [ j − q + 1 : j ]); if prevpos [ hash ] = 0 then d ← j − q + 1; else d ← j − prevpos [ hash ]; dist [ j ] ← d ; prevpos [ hash ] ← j ; return dist ; dist , while computing HQ Shift ( h ( w )) for the concerned q -gram w in the text is not as cheap as dist [ j ].Our algorithm exploits this advantage of the new shifting function dist .Next, we explain our searching algorithm shown in Algorithm 4. The searching phase is divided intothree: Alignment-phase , Comparison-phase , and
KMP-phase . The goal of the Alignment-phase is to shiftthe pattern as far as possible without comparing each single character of the pattern and the text. TheAlignment-phase ends when we align the pattern and a text substring that have (a) aligned q -grams ofthe same hash value and (b) the same first character. Suppose P and T [ k − m + 1 : k ] are aligned at thebeginning of the Alignment-phase. If s = HQ Shift [ h ( T [ k − q +1 : k ])] ≤ m − q , by shifting the pattern by s ,we find the aligned q -grams of the same hash value. Namely, h ( P [ m − q − s +1 : m − s ]) = h ( T [ k − q +1 : k ]).Otherwise, we shift the pattern by m − q + 1 repeatedly until we find aligned q -grams of the samehash value. When finding a position pos satisfying (a) by aligning P and T [ k ′ − m + 1 : k ′ ] for some k ′ , i.e., h ( P [ pos − q + 1 : pos ]) = h ( T [ k ′ − m + pos − q + 1 : k ′ − m + pos ]), we simply check thecondition (b). If P [1] and the corresponding text character match, we move to the Comparison-phase.Otherwise, we safely shift the pattern using dist [ pos ]. Note that although it is possible to use the function HQ Shift ( h ( T [ k ′ − q + 1 : k ′ ])) rather than dist [ pos ], the computation would be more expensive. Shiftingthe pattern by dist [ pos ], unless dist [ pos ] = pos − q + 1, still the pattern and the aligned text substringsatisfy (a). However, we do not repeat dist -shift any more, since the smaller pos becomes, the smaller theexpected shift amount will be. We simply restart the Alignment-phase. Once the conditions (a) and (b)are satisfied, we move on to the Comparison-phase.In the Comparison-phase, we check the characters from P [2] to P [ m ]. If a character mismatch occurs5 aaba ba a ba a aa bb ab ba bbb a a T: a P: a bb abb aa posHQ_Shift [ h ( baa )] aab T: b ba P: abab ab aab b ba aaa abbab aa bba aa pos dist [ pos ] Figure 1: Shifting a pattern using
HQ Shift and dist during the comparison, either of the shift by
KMP Shift or by dist is possible. Therefore, we select theone where the resumption position of the character comparison goes further to the right after shiftingthe pattern. If the resumption position of the comparison is the same, we select the one with thelarger shift amount. Recall that when the KMP algorithm finds that P [1 : j −
1] = T [ i : i + j − P [ j ] = T [ i + j − T [ i + j −
1] and P [ j − KMP Shift [ j ]] if KMP Shift [ j ] < j , and T [ i + j ] and P [1] if KMP Shift [ j ] = j . On the other hand,if we shift the pattern by dist [ pos ], we simply resume matching T [ i + dist [ pos ]] and P [1]. Therefore, weshould use KMP Shift [ j ] rather than dist [ pos ] when either KMP Shift [ j ] < j and dist [ pos ] < j − KMP Shift [ j ] = j > dist [ pos ]. Summarizing the discussion, we shift the pattern by dist [ pos ] if dist [ pos ] ≥ j − dist [ pos ] ≥ KMP Shift [ j ] hold. Otherwise, we shift the pattern by KMP Shift [ j ]. At thismoment, we may have a “ partial match ” between the pattern and the aligned text substring. If we haveperformed the KMP-shift with KMP Shift [ j ] < j −
1, then we have a match between the nonemptyprefixes of the pattern and the aligned text substring of length j − KMP Shift [ j ] −
1. In this case, we go tothe KMP-phase, where we simply perform the KMP algorithm. The KMP-phase prevents the charactercomparison position from returning to the left and guarantees the linear time behavior of our algorithm.If we have no partial match, we return to the Alignment-phase.
Theorem 1.
The worst-case time complexity of the DIST q algorithm is O (( n + m ) q ) .Proof. Since the proposed algorithm uses Fact 1 on the KMP algorithm to prevent the character compar-ison position from going back to the left, the number of character comparisons is at most 2 n − m timeslike the KMP algorithm. In addition, the hash value of q -gram is calculated to perform the shift using HQ Shift . Since the hash value calculation requires O ( q ) time and it is calculated at the maximum of n − q + 1 places in the text, the hash value calculation takes O ( nq ) time in total. Therefore, the worst-casetime complexity of the searching phase is O ( nq ). In the preprocessing, O ( mq ) time is required to calculatethe hash value of q -gram at m − q + 1 locations. (cid:3) Example 1.
Let P = abaabbaaa . The shifting functions dist , KMP Shift , and
HQ Shift are shownbelow. The hash values are calculated by treating each character as its ASCII value, e.g. a is calculatedas 97. j P a b a a b b a a a dist - - 1 2 3 4 5 4 7 KMP Shift lgorithm 4:
DIST q algorihm Function
DISTq ( P, T, q ) KMP Shift ← PreKMPShift ( P ); HQ Shift ← PreHqShift ( P, q ); dist ← PreDistArray ( P, q ); n ← | T | ; m ← | P | ; i ← j ← k ← m ; while k ≤ n do if j ≤ then while True do // Alignment-phase sh ← HQ Shift [ h ( T [ k − m + 1 : k ])]; k ← k + sh ; if sh = m − q + 1 then pos ← m − − sh ; if P [1] = T [ k − m + 1] then break ; k ← k + dist [ pos ]; if k > n then halt ; j ← i ← k − m + 2; // Comparison-phase while j ≤ m and P [ j ] = T [ i ] do i ← i + 1; j ← j + 1; if j = m + 1 then output i − m ; if dist [ pos ] ≥ j − dist [ pos ] ≥ KMP Shift [ j ] then j ← j − dist [ pos ]; else j ← j − KMP Shift [ j ]; else while j ≤ m and P [ j ] = T [ i ] do // KMP-phase i ← i + 1; j ← j + 1; if j = m + 1 then output i − m ; j ← j − KMP Shift [ j ]; k ← i + m − j ; x aba baa aab abb bba aaa others h ( x ) 2041 2053 2038 2042 2057 2037 HQ Shift [ h ( x )] 6 1 4 3 2 0 7Figure 2 illustrates an example run of the DIST q algorithm ( q = 3) for finding P = abaabbaaa in T = abbaabbaababbabbaaabaabaabbaaa . Attempt 1
We shift the pattern by one character for
HQ Shift [ h ( T [7 : 9])] = HQ Shift [ h ( baa )] = HQ Shift [2053] = 1. Since the position of the q -gram aligned by this shiftis 8, pos is updated to 8. Attempt 2
We check whether the first character of the pattern matches the corresponding character of7 T : a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a Attempt 1 P : a b a a b b a a a × Attempt 2 P : a b a a b b a a a Attempt 3 P : a b a a b b a a a ◦ × Attempt 4 P : a b a a b b a a a Attempt 5 P : a b a a b b a a a ◦ ◦ ◦ ◦ ◦ × Attempt 6 P : a b a a b b a a a • • ◦ ◦ ◦ ◦ ◦ ◦ ◦ Attempt 7 P : a b a a b b a a a Figure 2: An example run of the DIST q algorithm for a pattern P = abaabbaaa and a text T = abbaabbaababbabbaaabaabaabbaaa . For each alignment of the pattern, ◦ and × indicate a match anda mismatch between the text and the pattern, respectively. The character with • is known to match thecharacter at the corresponding position in the text without comparison. Subscript numbers show theorder of character comparisons in each attempt.the text. Finding P [1] = T [2], the pattern is shifted by dist [ pos ] = dist [8] = 4. Attempt 3
We shift the pattern by
HQ Shift [ h ( T [12 : 14])] = HQ Shift [ h ( bba )] = and update pos to7. Attempt 4
We check whether the first character of the pattern matches the corresponding characterof the text. From P [1] = T [8], we compare the characters of P [2 : 9] and T [9 : 16] from leftto right. Since P [2] = T [9], the pattern is shifted by KMP Shift [2] or dist [ pos ] = dist [7]. From KMP Shift [2] = 1, dist [7] = 5, dist [ pos ] ≥ − dist [ pos ] ≥ KMP Shift [2] are satisfied. There-fore, we shift the pattern by dist [ pos ] = dist [7] = 5. Attempt 5
We shift the pattern by
HQ Shift [ h ( T [19 : 21])] = HQ Shift [ h ( aba )] = HQ Shift [2041] = 6 and update pos to 3.
Attempt 6
We check whether the first character of the pattern matches the corresponding characterof the text. By P [1] = T [19], the characters of P [2 : 9] and T [20 : 27] are compared from leftto right. Since P [6] = T [24], the pattern is shifted by KMP Shift [6] or dist [ pos ] = dist [3]. By KMP Shift [6] = 3 > dist [3] = 1, the pattern is shifted by KMP Shift [6] = 3.
Attempt 7
Attempt 6 shows that P [1 : 2] = T [22 : 23], that is, there is a partial match, so we continuethe comparison of T [24 : 30] and P [3 : 9]. Since T [24 : 30] = P [3 : 9], the pattern occurrenceposition 22 is reported. q algorithm The LDIST q algorithm modifies the DIST q algorithm so that the worst-case time complexity is indepen-dent of q . In the DIST q algorithm, if strings such as T = a n and P = ba m − are given, O ( nq ) time isrequired for searching phase because the hash values of each q -gram are calculated in the text. Since thehash function h defined in Section 3.1 is a rolling hash, if the hash value of w [ i : i + q −
1] has alreadybeen obtained for a string w , the hash value of w [ i + 1 : i + q ] can be computed in constant time by h ( w [ i + 1 : i + q ]) = (4 · ( h ( w [ i : i + q − − q − · w [ i ]) + w [ i + q ]) mod 2 . The LDIST q algorithmmodifies Line 9 of Algorithm 4 so that we calculate the hash value of the q -gram using the previously8alculated value of the other q -gram in the incremental way, if they overlap. Similarly, the time complexityof the preprocessing phase can be reduced. Theorem 2.
The worst-case time complexity of the LDIST q algorithm is O ( n + m ) .Proof. Like the DIST q algorithm, we compare characters at most 2 n − m times. To calculate the hashvalue of a q -gram, if it is overlapped with the q -gram for which the hash value has been calculated onestep before, the incremental update is performed using the rolling hash. Therefore, the calculation ofthe hash value of q -grams takes O ( n ) time in total. Thus, the worst-case time complexity of matchingis O ( n ). Calculating the hash values of q -grams in the preprocess is performed in the same way, so it isdone in O ( m ) time. (cid:3) In this section, we compare the execution times of the proposed algorithms with the existing algorithmslisted below, where algorithms that run in linear time in the input string size are marked with ⋆ . • BNDM q [20]: Backward Nondeterministic DAWG Matching algorithm using q -grams with q =2 , • SBNDM q [1]: Simplified version of the Backward Nondeterministic DAWG Matching algorithmusing q -grams with q = 2 , , • KBNDM [7]: Factorized variant of the BNDM algorithm, • BSDM q [10]: Backward SNR DAWG Matching algorithm using condensed alphabets with groupsof q characters with 1 ≤ q ≤ • ⋆ FJS [14]: Franek-Jennings-Smyth algorithm, • ⋆ FJS+ [18]: Modification of the FJS algorithm, • HASH q [19]: Hashing algorithm using q -grams with 2 ≤ q ≤ • FS- w [11]: Multiple Windows version of the Fast Search algorithm [5] implemented using w slidingwindows with w = 1 , , , • IOM [6]: Improved Occurrence Matcher, • WOM [6]: Worst Occurrence Matcher, • SKIP q [9]: Skip-Search algorithm using q -grams with 2 ≤ q ≤ • WFR q [8]: Weak-Factors-Recognition algorithm implemented with a q -chained loop with 2 ≤ q ≤ • ⋆ LWFR q [8]: Linear-Weak-Factors-Recognition algorithm implemented with a q -chained loop with2 ≤ q ≤ • ⋆ DIST q : Our algorithm proposed in Section 3.1 (Algorithm 4) with 1 ≤ q ≤ • ⋆ LDIST q : Our algorithm proposed in Section 3.2 with 1 ≤ q ≤ − O3 . We used the implementations in SMART [13] for all algorithms except for the FJS, FJS+ and our al-Table 1: Genome sequence ( σ = 4, n = 4641652) m q . (2) . (2) . (4) . (6) . (6) . (6) . (6) . (6) . (6) . (6) SBNDM q . ( ) . (2) . (4) . (6) . (8) . (6) . (6) . (6) . (6) . (6) KBNDM 311 .
78 201 .
99 150 .
15 113 .
84 83 .
23 67 .
83 75 .
65 75 .
39 76 .
58 74 . q . (2) . ( ) . (5) . (6) . (6) . (6) . (7) . (6) . (6) . (6) ⋆ FJS 407 .
02 353 .
60 311 .
96 279 .
13 308 .
42 297 .
00 266 .
12 317 .
79 317 .
34 296 . ⋆ FJS+ 388 .
44 296 .
70 203 .
03 171 .
17 149 .
52 136 .
59 128 .
55 130 .
39 122 .
51 112 . q . (2) . (3) . (3) . (3) . (3) . (6) . (6) . (7) . (7) . (7) FS- w . (4) . (4) . (4) . (4) . (6) . (4) . (6) . (6) . (6) . (6) IOM 377 .
25 275 .
36 215 .
72 220 .
97 219 .
86 218 .
12 210 .
61 221 .
31 230 .
15 211 . .
54 301 .
46 220 .
34 182 .
30 166 .
27 143 .
24 136 .
20 133 .
75 127 .
40 114 . q . (2) . (3) . (4) . (6) . (7) . (7) . (7) . (7) . (8) . (6) WFR q . (2) . (2) . (4) . (4) . (8) . (2) . (3) . (3) . (6) . ( ) ⋆ LWFR q . (2) . (3) . (4) . (5) . ( ) . ( ) . ( ) . (6) . (6) . (7) ⋆ DIST q . (2) . (3) . ( ) . ( ) . (6) . (7) . (8) . (8) . (7) . (8) ⋆ LDIST q . (2) . (3) . (4) . (6) . (6) . (6) . (6) . ( ) . ( ) . (7) Table 2: English text ( σ = 62, n = 4017009) m q . (2) . (2) . (4) . (4) . ( ) . (4) . (4) . (4) . (6) . (4) SBNDM q . ( ) . ( ) . ( ) . ( ) . (4) . (4) . (4) . (4) . (4) . (4) KBNDM 192 .
76 126 .
56 92 .
03 71 .
04 64 .
17 56 .
79 49 .
56 49 .
09 50 .
24 47 . q . (2) . (2) . (3) . (4) . (8) . (8) . (8) . (8) . (8) . (8) ⋆ FJS 192 .
52 158 .
61 106 .
40 89 .
04 78 .
43 68 .
14 64 .
80 61 .
15 60 .
48 55 . ⋆ FJS+ 196 .
88 155 .
86 101 .
77 77 .
74 67 .
40 60 .
31 54 .
77 52 .
44 51 .
62 47 . q . (2) . (2) . (2) . (2) . (3) . (6) . (5) . (8) . (3) . (7) FS- w . (6) . (6) . (6) . (4) . (6) . (4) . (6) . (4) . (6) . (4) IOM 193 .
37 148 .
51 104 .
36 86 .
22 75 .
09 69 .
08 63 .
63 60 .
62 58 .
83 51 . .
23 153 .
81 112 .
45 86 .
61 75 .
26 62 .
80 60 .
54 62 .
74 58 .
91 55 . q . (2) . (2) . (3) . (4) . (4) . (4) . (4) . (8) . (2) . (2) WFR q . (2) . (2) . (2) . (5) . (5) . ( ) . ( ) . (5) . (6) . (2) ⋆ LWFR q . (2) . (2) . (2) . (3) . (5) . (5) . (6) . ( ) . ( ) . ( ) ⋆ DIST q . (2) . (2) . (3) . (4) . (4) . (4) . (4) . (8) . (4) . (7) ⋆ LDIST q . (2) . (2) . (3) . (4) . (5) . (5) . (3) . (4) . (5) . (7) Table 3: Fibonacci string ( σ = 2, n = 2178309) m q . (2) . (2) . (4) . (6) . (6) . (4) . (4) . (4) . (4) . (4) SBNDM q . ( ) . ( ) . (4) . (6) . (6) . (6) . (6) . (6) . (6) . (6) KBNDM 541 .
70 405 .
78 411 .
08 422 .
85 382 .
25 402 .
45 425 .
60 437 .
67 461 .
07 451 . q . (2) . (3) . (5) . (8) . (6) . (1) . (1) . (1) . (1) . (3) ⋆ FJS 402 .
44 362 .
23 276 .
97 237 .
87 218 . . .
86 202 .
94 196 .
49 194 . ⋆ FJS+ 456 .
20 396 .
04 335 .
70 319 .
93 295 .
64 300 .
36 296 .
48 295 .
37 288 .
56 289 . q . (2) . (2) . ( ) . (7) . (7) . (7) . (7) . (7) . (7) . (7) FS- w . (1) . (1) . (1) . (1) . (1) . (1) . (1) . (1) . (1) . (1) IOM 381 .
92 414 .
42 453 .
54 497 .
84 543 .
93 641 .
13 751 .
42 839 .
92 899 .
19 1019 . .
38 555 .
43 564 .
67 617 .
93 664 .
47 732 .
05 852 .
06 926 .
35 1036 .
31 1126 . q . (2) . (2) . (5) . (8) . (8) . (3) . (3) . (3) . (3) . (3) WFR q . (2) . (3) . (3) . (6) . (7) . (8) . (8) . (8) . (8) . (8) ⋆ LWFR q . (2) . (3) . (5) . (7) . (7) . (2) . (6) . (6) . (6) . (6) ⋆ DIST q . (1) . (2) . (2) . ( ) . ( ) . (2) . (8) . (5) . (4) . (5) ⋆ LDIST q . (2) . (3) . (4) . (7) . (8) . (7) . ( ) . ( ) . ( ) . ( ) σ = 4, n = 4000000, m = 8) occ q . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (6) SBNDM q . ( ) . ( ) . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (4) KBNDM 127 .
73 128 .
41 128 .
41 126 .
92 128 .
74 131 .
17 129 .
74 135 .
70 140 .
99 152 .
46 164 .
59 186 . q . (4) . (4) . ( ) . ( ) . ( ) . ( ) . ( ) . ( ) . ( ) . (4) . (4) . (4) ⋆ FJS 246 .
26 253 .
35 263 .
94 247 .
73 255 .
94 276 .
44 262 .
83 256 .
36 251 .
33 269 .
39 259 .
93 251 . ⋆ FJS+ 174 .
38 173 .
86 180 .
46 173 .
05 178 .
65 182 .
98 177 .
56 181 .
17 182 .
48 190 .
71 197 .
69 199 . q . (3) . (3) . (3) . (3) . (3) . (3) . (3) . (3) . (3) . (3) . (3) . (4) FS- w . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (2) . (2) . (2) IOM 185 .
31 181 .
07 195 .
53 187 .
98 194 .
18 200 .
99 195 .
98 195 .
89 197 .
67 202 .
61 214 .
90 209 . .
59 192 .
28 207 .
57 189 .
21 196 .
02 203 .
86 196 .
98 199 .
79 203 .
59 215 .
79 219 .
94 230 . q . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (3) WFR q . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (4) ⋆ LWFR q . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (3) . (3) ⋆ DIST q . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (4) . ( ) . ( ) . ( ) ⋆ LDIST q . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (4) . (4) Table 5: Texts with frequent pattern occurrences ( σ = 95, n = 4000000, m = 8) occ q . (2) . (2) . (2) . (2) . (2) . (2) . (2) . (2) . (2) . (2) . (2) . (2) SBNDM q . ( ) . ( ) . ( ) . ( ) . (2) . ( ) . (2) . ( ) . (2) . (2) . (2) . (2) KBNDM 81 .
98 80 .
22 77 .
61 79 .
42 79 .
55 81 .
88 82 .
61 82 .
25 89 .
19 98 .
65 111 .
59 142 . q . (2) . (2) . (2) . (2) . (2) . (2) . (2) . (2) . (2) . (3) . (3) . (3) ⋆ FJS 81 .
93 81 .
41 81 .
14 81 .
83 82 .
04 82 .
16 82 .
10 84 .
30 86 .
02 90 .
72 100 .
41 111 . ⋆ FJS+ 80 .
26 84 .
27 81 .
05 80 .
58 80 .
21 80 .
84 83 .
52 85 .
66 87 .
40 94 .
03 102 .
16 112 . q . (2) . (2) . (2) . (2) . (2) . (2) . (2) . (2) . (2) . (2) . (2) . (2) FS- w . (8) . (6) . (6) . (6) . ( ) . (6) . ( ) . (6) . ( ) . (6) . (4) . (2) IOM 82 .
07 83 .
86 84 .
89 85 .
58 82 .
64 82 .
45 83 .
90 86 .
98 87 .
28 91 .
63 99 .
30 106 . .
37 84 .
42 86 .
39 85 .
05 85 .
36 85 .
35 86 .
20 86 .
74 90 .
22 93 .
31 105 .
36 114 . q . (2) . (2) . (3) . (3) . (2) . (2) . (2) . (2) . (2) . (3) . (3) . (3) WFR q . (2) . (2) . (2) . (2) . (2) . (2) . (2) . (2) . (2) . (2) . (2) . (2) ⋆ LWFR q . (2) . (2) . (2) . (2) . (2) . (2) . (2) . (2) . (2) . (2) . (3) . (3) ⋆ DIST q . (2) . (2) . (2) . (2) . (3) . (2) . (2) . (2) . (2) . ( ) . ( ) . ( ) ⋆ LDIST q . (3) . (2) . (3) . (2) . (3) . (2) . (2) . (2) . (2) . (2) . (2) . (2) gorithms. The implementations of our algorithms are available at https://github.com/ushitora/distq .We experimented with the following strings:1. Genome sequence (Table 1): the genome sequence of E. coli of length n = 4641652 with σ = 4, fromNCBI . The patterns are randomly extracted from T of length m = 2 , , , , , , , ,
512 and 1024.We measured the total running time of 25 executions.2. English text (Table 2): the King James version of the Bible of length n = 4017009 with σ = 62,from the Large Canterbury Corpus [2]. We removed the line breaks from the text. The patternsare randomly extracted from T of length m = 2 , , , , , , , ,
512 and 1024. We measuredthe total running time of 25 executions.3. Fibonacci string (Table 3): generated by the following recurrence
Fib = b , Fib = a and Fib k = Fib k − · Fib k − for k > . The text is fixed to T = Fib of length n = 2178309. The patterns are randomly extracted from T of length m = 2 , , , , , , , ,
512 and 1024. We measured the total running time of 100executions. http://corpus.canterbury.ac.nz/
11. Texts with frequent pattern occurrences (Tables 4, 5): generated by intentionally embedding a lotof patterns. We embedded occ = 0 , , , , , , , , , , m = 8 into a text of length n = 4000000 over an alphabetof size σ = 4 and 95. More specifically, we first randomly generate a pattern and a provisional text,which may contain the pattern. Then we randomly change characters of the text until the patterndoes not occur in the text. Finally we embed the pattern occ times at random positions withoutoverlapping. We measured the total running time of 25 executions.The best performance among three trials is recorded for each experiment. For the algorithms usingparameter q or w , we report only the best results. The value of q or w giving the best performance isshown in round brackets.Experimental results show that when the pattern is short, the SBNDM q and BSDM q algorithms havegood performance in general. For the genome sequence text, WFR q , LWFR q and our algorithms are thefastest algorithms except when the pattern is very short. On the English text, SBNDM q and LWFR q runfastest for short and long patterns, respectively. On the other hand, DIST q runs almost as fast as thebest algorithm on both short and long patterns. In fact, it runs faster than SBDDM q and LWFR q forlong and short patterns, respectively. In the experiments on the Fibonacci string, the FJS algorithm andour algorithms have shown good results as the pattern length increases. Differently from the previoustwo sorts of texts, our algorithms clearly outperformed the LWFR q algorithm. Since the Fibonaccistrings have many repeating structures and patterns are randomly extracted from the text, the number ofoccurrences of the pattern is very large in this experiment. Therefore, we hypothesize that the efficiencyof DIST q algorithms does not decrease when the number of pattern occurrences is large. We fixed thepattern length and alphabet size and prepared data with the number of pattern occurrences intentionallychanged. From the experimental result, it is found that our algorithms become more advantageous as thenumber of pattern occurrences increases. The results show that the LDIST q algorithm is generally slowerthan the DIST q algorithm. This should be due to the overhead of the process of determining whether toupdate the hash value difference by the rolling hash in the LDIST q algorithm. We proposed two new algorithms for the exact string matching problem: the DIST q algorithm and theLDIST q algorithm. We confirmed that our algorithms are as efficient as the state-of-the-art algorithmsin many cases. Particularly when a pattern frequently appears in a text, our algorithms outperformedexisting algorithms. The DIST q algorithm runs in O ( q ( n + m )) time and the LDIST q algorithm runs in O ( n + m ) time. Their performances were not significantly different in our experiments and rather theformer ran faster than the latter in most cases, where the optimal value of q was relatively small. References [1] Cyril Allauzen, Maxime Crochemore, and Mathieu Raffinot. Factor oracle: A new structure forpattern matching. In Jan Pavelka, Gerard Tel, and Miroslav Bartoˇsek, editors,
SOFSEM’99: Theoryand Practice of Informatics , pages 295–310. Springer Berlin Heidelberg, 1999.[2] Ross Arnold and Tim Bell. A corpus for the evaluation of lossless compression algorithms. In
Proceedings of DCC ’97. Data Compression Conference , pages 201–210, 1997.[3] Robert S. Boyer and J. Strother Moore. A fast string searching algorithm.
Commun. ACM ,20(10):762–772, 1977. 124] Domenico Cantone and Simone Faro. Searching for a substring with constant extra-space complexity.In
Proceedings of Third International Conference on Fun with algorithms , pages 118–131, 2004.[5] Domenico Cantone and Simone Faro. Fast-search algorithms: New efficient variants of the Boyer-Moore pattern-matching algorithm.
Journal of Automata, Languages and Combinatorics , 10:589–608,2005.[6] Domenico Cantone and Simone Faro. Improved and self-tuned occurrence heuristics.
Journal ofDiscrete Algorithms , 28:73–84, 2014.[7] Domenico Cantone, Simone Faro, and Emanuele Giaquinta. A compact representation of nondeter-ministic (suffix) automata for the bit-parallel approach.
Information and Computation , 213:3–12,2012. Special Issue: Combinatorial Pattern Matching (CPM 2010).[8] Domenico Cantone, Simone Faro, and Arianna Pavone. Linear and Efficient String Matching Al-gorithms Based on Weak Factor Recognition.
Journal of Experimental Algorithmics , 24(1):1–20,2019.[9] Simone Faro. A very fast string matching algorithm based on condensed alphabets. In RiccardoDondi, Guillaume Fertin, and Giancarlo Mauri, editors,
Algorithmic Aspects in Information andManagement - 11th International Conference, AAIM 2016, Bergamo, Italy, July 18-20, 2016, Pro-ceedings , volume 9778 of
Lecture Notes in Computer Science , pages 65–76. Springer, 2016.[10] Simone Faro and Thierry Lecroq. A fast suffix automata based algorithm for exact online stringmatching. In Nelma Moreira and Rog´erio Reis, editors,
Implementation and Application of Automata ,pages 149–158. Springer Berlin Heidelberg, 2012.[11] Simone Faro and Thierry Lecroq. A multiple sliding windows approach to speed up string match-ing algorithms. In Ralf Klasing, editor,
Experimental Algorithms , pages 172–183. Springer BerlinHeidelberg, 2012.[12] Simone Faro and Thierry Lecroq. The exact online string matching problem.
ACM ComputingSurveys , 45(2):1–42, 2013.[13] Simone Faro, Thierry Lecroq, Stefano Borz`ı, Simone Di Mauro, and Alessandro Maggio. The stringmatching algorithms research tool. In Jan Holub and Jan ˇZˇd´arek, editors,
Proceedings of the PragueStringology Conference 2016 , pages 99–113, Czech Technical University in Prague, Czech Republic,2016.[14] Frantisek Franek, Christopher G. Jennings, and W.F. Smyth. A simple fast hybrid pattern-matchingalgorithm.
Journal of Discrete Algorithms , 5(4):682–695, 2007.[15] Saqib I. Hakak, Amirrudin Kamsin, Palaiahnakote Shivakumara, Gulshan A. Gilkar, Wazir Z. Khan,and Muhammad Imran. Exact string matching algorithms: Survey, issues, and future researchdirections.
IEEE Access , 7:69614–69637, 2019.[16] R. Nigel Horspool. Practical fast searching in strings.
Software: Practice and Experience , 10(6):501–506, 1980.[17] Donald E. Knuth, James H. Morris, Jr., and Vaughan R. Pratt. Fast pattern matching in strings.
SIAM Journal on Computing , 6(2):323–350, 1977.1318] Satoshi Kobayashi, Diptarama Hendrian, Ryo Yoshinaka, and Ayumi Shinohara. An improvementof the Franek-Jennings-Smyth pattern matching algorithm. In
Proceedings of the Prague StringologyConference 2019 , pages 56–68, 2019.[19] Thierry Lecroq. Fast exact string matching algorithms.
Information Processing Letters , 102(6):229–235, 2007.[20] Gonzalo Navarro and Mathieu Raffinot. A bit-parallel approach to suffix automata: Fast extendedstring matching. In Martin Farach-Colton, editor,
Combinatorial Pattern Matching , pages 14–33.Springer Berlin Heidelberg, 1998.[21] Daniel M. Sunday and Daniel M. A very fast substring search algorithm.