Cadences in Grammar-Compressed Strings
aa r X i v : . [ c s . D S ] A ug Cadences in Grammar-Compressed Strings
Julian Pape-Lange ∗ Abstract
Cadences are structurally maximal arithmetic progressions of indices corresponding toequal characters in an underlying string.This paper provides a polynomial time detection algorithm for 3-cadences in grammar-compressed binary strings. This algorithm also translates to a linear time detection algo-rithm for 3-cadences in uncompressed binary strings.Furthermore, this paper proves that several variants of the cadence detection problemare
N P -complete for grammar-compressed strings. As a consequence, the equidistant sub-sequence matching problem with patterns of length three is
N P -complete for grammar-compressed ternary strings.
A sub-cadence in a string is an arithmetic progression of indices corresponding to equal characters.This concept in the context of finite sequences is quite old and dates back to van der Waerden.He showed in the year 1927 in [12] that for each k and each alphabet size | Σ | , there is a naturalnumber m = m ( k, | Σ | ), such that each sequence of characters in Σ with length greater than orequal to m has a sub-cadence consisting of k indices.The term cadence in the context of strings was first used by Gardelle in [5] in the year 1964.In this paper, I use the notation of Amir et al. in [1] and say that a cadence is a sub-cadencewhich is structurally maximal in the sense that the extension of the arithmetic progression tothe left or to the right would not result in a valid index of the string.For example, in the string S = 10101, the three indices (1 , ,
5) form a cadence, since theindices − S = 01110, thethree indices (2 , ,
4) do not form a cadence, since, for example, the index 1 is inside the string.Cadences recently gained some traction. At the beginning of this year it was proven byFunakoshi and Pape-Lange in [4] that the number of 3-cadences can be counted in O ( n (log n ) )time using fast Fourier transform if the underlying alphabet has a constant size.More recently, Funakoshi et al. presented the more general problem of equidistant subsequencematching in [3] which extends the cadences to arbitrary arithmetic factors, and showed thattechniques for cadence-detection can be adopted to solve equidistant subsequence matching withsimilar time complexity.Strings can be compressed by straight-line programs, which are context-free grammars whoselanguages contain exactly one string each. Since this grammar-based compression is able tocompress some strings to logarithmic size, we are interested which polynomial time problems onuncompressed strings can also be solved in polynomial time with respect to the compressed sizeof the string. For example, grammar-based compression allows for fast algorithms as the fully ∗ Technische Universit¨at Chemnitz, Straße der Nationen 62, 09111 Chemnitz, Germany. Email: [email protected] L - R -cadence, which starts and ends in given intervals, and the even/odd 3-sub-cadence, which start at even/odd indices.I will also prove that for grammar-compressed strings, the cadence detection problem becomes N P -complete for longer cadences or 3-cadences over a ternary alphabet.
A string S of length n is the concatenation S = S [1] S [2] S [3] . . . S [ n ] of characters from an alpha-bet Σ. Strings naturally split into runs of equal characters. For example, the string 00010101100splits into 000 · · · · · ·
00. In this paper, these runs of equal characters are just calledruns for the sake of simplicity.For the sub-cadences and cadences, this paper uses the definitions of Amir et al. in [1]. Thesedefinitions are slightly different from the definition by Gardelle in [5] and by Lothaire in [9].Funakoshi and Pape-Lange present a comparison of these definitions in [4].
Definition 1. A k -sub-cadence is an arithmetic progression of indices given by the triple ( i, d, k ) of integers such that d > and S [ i ] = S [ i + d ] = · · · = S [ i + ( k − d ] hold. As a special case, cadences additionally have to be structurally maximal in the sense thatevery extension of the underlying arithmetic progression is not contained in the integer interval { , , , . . . , n } anymore. More formally: Definition 2. A k -cadence is a k -sub-cadence ( i, d, k ) such that the inequalities i − d ≤ and n < i + kd hold. In this paper, we will also consider a new special case of the sub-cadence, in which the firstelement and the last element of the sub-cadence are contained in given intervals:
Definition 3.
For two disjoint intervals L and R , a L - R - k -cadence is a k -sub-cadence whichstarts in the interval L and ends in the interval R . I.e. i ∈ L and i + ( k − d ∈ R hold. For the compressed problems, we consider straight-line grammars in Chomsky normal form.I.e. a string is given by a grammar G = ( V, Σ , S, rhs ) such that the set V = { v , v , . . . , v | V | } of nonterminals is ordered so that for every nonterminal v i ∈ V the right-hand side is either acharacter from the alphabet or of the form v j v k with j, k < i . We also assume that the startsymbol S is given by v | V | .Since each string which has at least the size of van der Waerden’s bound m = m ( k, | Σ | )has a k -sub-cadence, we can detect a k -sub-cadences by restricting the string to the first m m is only dependent on k and Σ but not dependent on the lengthof the original string, the detection algorithm for k -sub-cadences only uses constant time onuncompressed strings. In this section, I will prove the following theorem:
Theorem 1.
The decision problem of k -cadence detection on grammar-compressed strings is N P -complete if at least one of the following conditions holds: • k ≥ and | Σ | ≥ and we only consider k -cadences with a given character, • k ≥ and | Σ | ≥ or • k ≥ and | Σ | ≥ . Since we can test for a given candidate ( i, d, k ) of a k -cadence in polynomial time, whether( i, d, k ) forms indeed a k -cadence, all three problems mentioned above belong to N P .To show the
N P -hardness, I will reduce the following problem, which Lohrey proves inTheorem 3.13 of [8] to be
N P -complete, to the problems above: input:
Two strings P and P ′ over the alphabet { , } given by grammar-compression. output: Is there an index l with P [ l ] = P ′ [ l ] = 1?Let P and P ′ be strings over the alphabet { , } given by grammar-compression. Withoutloss of generality | P ′ | ≤ | P | holds. Since it is more convenient if both strings have the samelength, we pad the shorter string P ′ with zeros. Also, for the cadences, it will be helpful, if oneof the strings is reversed. We therefore define P ′′ = ( P ′ | P |−| P ′ | ) rev .In this setting, for every index l , the equation P [ l ] = P ′ [ l ] = 1 holds if and only if the equation P [ l ] = P ′′ [ | P | + 1 − l ] = 1 holds as well.Consider the string S = (cid:16) ( k − | P | · P · · k | P | (cid:17)(cid:16) k | P | · · k | P | (cid:17)(cid:16) k | P | · · P ′′ · ( k − | P | (cid:17)(cid:16) k | P | +1 (cid:17) k − .The grammar of S can be built by the grammars of P and P ′ and O (cid:0) log k | P | (cid:1) additional non-terminals. Since the grammar-compression of a string P needs at least Ω(log | P | ) nonterminals,the compressed size of S is, for fixed k , polynomial in the compressed size of the inputs.If there is an index l with P [ l ] = P ′′ [ | P | + 1 − l ] = 1, we can construct a corresponding k -cadence in S with character 1:The corresponding indices ( k − | P | + l and 2(2 k | P | + 1) + k | P | + 1 + ( | P | + 1 − l ) to P [ l ] and P ′′ [ | P | + 1 − l ] in S as well as the 1 in the second bracket at index (2 k | P | + 1) + k | P | + 1 form anarithmetic progression starting at i = ( k − | P | + l with distance d = 2 k | P | + 1 + ( | P | + 1 − l ).The index l is bounded by 1 ≤ l ≤ | P | , and each bracket has length 2 k | P | + 1. Therefore, n = k (2 k | P | + 1) holds and the inequalities i + kd = (( k − | P | + l ) + k (2 k | P | + 1 + ( | P | + 1 − l )) > k (2 k | P | + 1) = n and i + ( k − d = (( k − | P | + l ) + ( k − k | P | + 1 + ( | P | + 1 − l )) ≤ ( k | P | ) + ( k − k | P | + 1) + ( k − | P | < k (2 k | P | + 1) = n
3s well as i − d ≤ i > ≤ j < k the index i + jd lies in the ( j + 1)-th bracket. Therefore, S [ i + jd ] = 1 holds for each 0 ≤ j < k . This implies that ( i, d, k ) is a k -cadence with character 1.If, on the other hand, the triple ( i, d, k ) defines a k -cadence with character 1 in S , we canfind a corresponding index l with P [ l ] = P ′′ [ | P | + 1 − l ] = 1:The inequalities i − d ≤ < i and i + ( k − d ≤ n < i + kd of the cadence imply jk n < k − jk i + jk ( i + kd ) = i + jd = k − j − k ( i − d ) + j + 1 k ( i + ( k − d ) ≤ j + 1 k n .Since the brackets in the definition of S divide the string in k substrings with equal length,the ( j + 1)-th element of any cadence lies in the ( j + 1)-th of the k brackets. Therefore, eachcadence with character 1 contains the single 1 at index (2 k | P | + 1) + ( k | P | ) + 1 in the secondbracket. Furthermore, the first element of the arithmetic progression has to be a 1 in P in thefirst bracket and the third element of the arithmetic progression has to be a 1 in P ′′ in the thirdbracket.By construction, the two indices of these characters have the same distance to the index(2 k | P | + 1) + ( k | P | ) + 1, and the two strings P and P ′′ have the same distance to the index(2 k | P | + 1) + ( k | P | ) + 1 as well. Therefore, the first element of the k -cadence and the thirdelement of the k -cadence define an index l with P [ l ] = P ′′ [ | P | + 1 − l ] = 1.Therefore, the string S has a k -cadence with character 1 if and only if there is an index l such that P [ l ] = P ′ [ l ] = 1 holds.If k > S containing only the character 1. In this case,this bracket forces every k -cadence to be a k -cadence with character 1. Therefore, in this case,the requirement that the underlying character has to be 1 can be dropped.For 3-cadences on a ternary alphabet we consider the string S = (cid:16) ( k − | P | · P · · k | P | (cid:17)(cid:16) k | P | · · k | P | (cid:17)(cid:16) k | P | · · P ′′ · ( k − | P | (cid:17) .Since the first and the last bracket do not contain the character 2, there are no k -cadenceswith character 2. Since the second bracket does not contain the character 0, there are no k -cadences with character 0 either. Therefore, all k -cadences use the character 1, and there is a3-cadence in S if and only if there is an index l with P [ l ] = P ′ [ l ] = 1.This concludes the proof of Theorem 1. In this section, I will show that the problems discussed in the last section are also
N P -completefor L - R -cadences instead of cadences. Even if k = 3 and | Σ | = 2 hold, the compressed detectionproblem of L - R - k -cadences is N P -complete. However, in this special case, there is a polynomialtime detection algorithm if the length of L is similar to the length of R . The underlying ideafor this algorithm also leads to a linear time algorithm for the detection of L - R -3-cadences inuncompressed binary strings.In uncompressed strings, the first proposed detection algorithm for 3-cadences by Amir etal. in [1] was actually a detection algorithm for L - R -3-cadences with L = (cid:8) , , . . . , (cid:4) n (cid:5)(cid:9) and R = (cid:8)(cid:4) n (cid:5) + 1 , (cid:4) n (cid:5) + 2 , . . . , n (cid:9) . Furthermore, the algorithm of Funakoshi and Pape-Lange in [4] count the number of 3-cadences in O (cid:0) n (log n ) (cid:1) time, by counting L - R -3-cadencesin O (( | L | + | R | )(log( | L | + | R | ))) time. It therefore seems reasonable to understand the L - R -cadences to be a simplification of cadences. 4owever, for all cadence problems discussed in the last section, the corresponding L - R -cadence problem is N P -complete too:
Lemma 1.
The decision problem of L - R - k -cadence detection is N P -complete on grammar-compressed strings if at least one of the following conditions holds: • k ≥ and | Σ | ≥ and we only consider L - R - k -cadences with a given character, • k ≥ and | Σ | ≥ or • k ≥ and | Σ | ≥ . The proofs are essentially equal to the corresponding proofs in the last section, since for L = (cid:8) , , . . . , k n (cid:9) and R = (cid:8) k − k n + 1 , k − k n + 2 , . . . , n (cid:9) , all k -cadences in the discussed string S are L - R - k -cadences and vice versa.Next, I will show that in the case k = 3 and | Σ | = 2, even if we do not require a given charac-ter, the decision problem of L - R - k -cadence detection is N P -complete on grammar-compressedstrings:Since we can test for every triple ( i, d, k ), whether this triple forms an L - R - k -cadence, thisproblem belongs to N P .To show the
N P -hardness, we will, like in the last section, reduce the following
N P -completeproblem to the decision problem of L - R -3-cadence detection in grammar-compressed strings: input: Two strings P and P ′ over the alphabet { , } given by grammar-compression. output: Is there an index l with P [ l ] = P ′ [ l ] = 1?Let P and P ′ be strings over the alphabet { , } given by grammar-compression. Withoutloss of generality | P ′ | ≤ | P | holds. Since it is more convenient if both strings have the samelength, we pad the shorter string P ′ with zeros. Also, for the L - R - k -cadences, it will be helpful,if in one of the strings, each character is duplicated. For example, for P ′ = 011, we define P ′′ = 001111. This can be done by introducing two additional nonterminals.Define S = 1(0 | P | )( P )( P ′′ ), L = { } and R = { | P | + 1 , | P | + 2 , . . . | P | + 2 | P |} .In this setting S [ L ] = 1 and S [ R ] = P ′′ holds. Furthermore, for each index 1 ≤ l ≤ | P | , theequations P [ l ] = S [1 + ( | P | + l )] and P ′ [ l ] = P ′′ [2 l ] = S [1 + 2 | P | + 2 l ] = S [1 + 2( | P | + l )] hold.Therefore, for each index l , the equation P [ l ] = 1 = P ′ [ l ] holds if and only if the equation S [1] = S [1+( | P | + l )] = S [1+2( | P | + l )] holds. This equation, however, defines an L - R -3-cadence.This proves, that S has an L - R -3-cadence if and only if there is an index l such that P [ l ] = P ′ [ l ] = 1 holds.Together with the previous lemma, this implies: Theorem 2.
For k ≥ and | Σ | ≥ , the decision problem of L - R - k -cadence detection is N P -complete on grammar-compressed strings.
Since the equidistant subsequence matching problem is closely related to sub-cadences, wecan similarly show that equidistant subsequence matching with patterns of length 3 on ternarystrings is
N P -complete on grammar-compressed strings.Consider the pattern P = 212, and a string S with S [ L ] , S [ R ] ∈ { , } ∗ and all other charac-ters are either 0 or 1. Define S ′ by S ′ [ i ] = ( S [ i ] = 01 if S [ i ] = 0the string in which all “2”s in S are replaced by a “1”. In this setting, the equidistant occurrencesof P in S are exactly the L - R -3-cadences with character 1 in S ′ .5ll reductions above used that we could force all cadences to use a fixed character of thestring. However, surprisingly, if L and R have similar length, we can detect in polynomial time,whether a compressed binary string has an L - R -3-cadence. Furthermore, with the same idea wecan detect in linear time, whether an uncompressed binary string has an L - R -3-cadence.The remainder of this section proves the following theorem: Theorem 3.
The decision problem of L - R - -cadence detection in binary grammar-compressedstrings can be solved in polynomial time with respect to the compressed size of the string and theadditional variable max (cid:16) | L || R | , | R || L | (cid:17) .The decision problem of L - R - -cadence detection in binary uncompressed strings can be solvedlinear time with respect to | L | + | R | . Since the first index and the third index of each 3-sub-cadence have the same parity, it isuseful to divide the L - R -3-cadences according to this parity: Definition 4.
For two disjoint intervals L and R , an even L - R -3-cadence is a -sub-cadencewhich starts at an even index in L and ends at an even index in R .Similarly, an odd L - R -3-cadence is a -sub-cadence which starts at an odd index in L andends at an odd index in R .For each set M , we define M even := M ∩ Z and M odd := M ∩ (2 Z + 1) and for each M = { a , a , . . . , a l } ⊂ Z with ≤ a < a < a < · · · < a l ≤ n , we define the string S [ M ] = S [ a ] S [ a ] . . . S [ a l ] as the subsequence of characters with indices given by M . The key insight for the detection algorithm for L - R -3-cadences is that if the string does notcontain L - R -3-cadences, either S [ L even ] or S [ R even ] is very structured. The following lemmaimplies that if S [ L even ] has the substring 01 and S [ R even ] has the substring 10 or vice versa, then S has an L - R -3-cadence: Lemma 2.
Let S be a binary string and L and R be two intervals.If there are indices i and j with • S [ i ] = S [ j ] = S [ i + 2] = S [ j − , • i, i + 2 ∈ L , • j, j − ∈ R and • i ≡ j (mod 2) ,then S has an L - R - -cadence.Proof. Since i ≡ j (mod 2) holds, the number i + j is an integer. Furthermore, since S is binaryand S [ i ] = S [ j ] = S [ i + 2] = S [ j −
2] holds, we either have S [ i ] = S [ i + j ] = S [ j ] or S [ i + 2] = S [ i + j ] = S [ j − L - R -3-cadence.This implies that if S does not contain L - R -3-cadences, then there are only few possibilitiesfor the subsequences S [ L even ] and S [ R even ]: Corollary 1.
Let S be a binary string and L and R be two intervals such that S has no L - R - -cadences.Then, • if S [ L even ] is of the form i i ′ with i, i ′ > , then S [ R even ] is of the form j j ′ where j and j ′ may be equal to , [ { , , . . . , } even ]: 0 1 1 1 1 1 1 1 S [ { , , . . . , } ]: 1 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0 S [ { , , . . . , } even ]: 0 1 1 0 0 1 0 1Figure 1: A string with 48 characters. For L = { } and R = { , , . . . , } , for each index of R even , there is only one candidate ( i, d, k ) for forming an L - R -3-cadence. • if S [ L even ] is of the form i i ′ with i, i ′ > , then S [ R even ] is of the form j j ′ where j and j ′ may be equal to , and • if S [ L even ] contains the substrings and , then S [ R even ] is of the form j or j . We can check in linear time in uncompressed strings and in polynomial time in grammar-compressed strings whether S [ L even ] and S [ R even ] are of the form 0 i j or 1 i j . If both S [ L even ]and S [ R even ] are of the form 0 i j , we can divide L even and R even into L ′ even , L ′′ even , R ′ even and R ′′ even such that S [ L ′ even ] = 0 i , S [ L ′′ even ] = 1 i ′ , S [ R ′ even ] = 0 j and S [ R ′′ even ] = 1 j ′ .Since there are, by construction, no even L ′ - R ′′ -3-cadences and no even L ′′ - R ′ -3-cadences, weonly have to detect L ′ - R ′ -3-cadences and L ′′ - R ′′ -3-cadences. This can be done in linear time inuncompressed strings and in polynomial time in grammar-compressed strings using the followinglemma, which holds by definition of the even L - R -3-cadence: Lemma 3.
Let S be a binary string and L and R be two intervals such that S [ L even ] = 0 i and S [ R even ] = 0 j hold for some integers i , j . Let further l min = min( L even ) , l max = max( L even ) , r min = min( R even ) and r max = max( R even ) .Then, S has an even L - R - -cadence if and only if S (cid:20)(cid:26) l min + r min , l min + r min , . . . , l max + r max (cid:27)(cid:21) = 1 (cid:16) l max+ r max2 − l min+ r min2 +1 (cid:17) holds. The more difficult case is that one of the two subsequences, without loss of generality S [ R even ],is more complex. I.e. it consists of multiple runs of 0s and 1s and thereby contains both substrings01 and 10. In this case, in order to avoid L - R -3-cadences, the other subsequence S [ L even ] is apower of a single character, without loss of generality 0. This can be checked in linear time inuncompressed strings and in polynomial time in grammar-compressed strings.It should be no surprise that this case is more difficult since this case occurred in the proof ofthe N P -completeness of the compressed L - R -3-cadence detection problem. Figure 1 shows thatif L is a short interval, we have to check linearly many pairs with respect to n in order to findan L - R -3-cadence. In order to develop a polynomial time algorithm for grammar-compressedstrings, we have to use that L is roughly as long as R .By definition of the L - R -3-cadence we get the following lemma:7 emma 4. Let S be a binary string and L and R be two intervals. Let further S [ L even ] beof the form i . Define l min = min( L even ) , l max = max( L even ) , r min = min( R even ) and r max =max( R even ) .Then, for any r ∈ R even with S [ r ] = 0 there is an even L - R - -cadence which uses this aslast element if and only if S (cid:20)(cid:26) l min + r , l min + r , . . . , l max + r (cid:27)(cid:21) = 1 (cid:16) l max+ r − l min+ r +1 (cid:17) holds.Conversely, for any m ∈ (cid:8) l min + r min , l min + r min + 1 , . . . , l max + r max (cid:9) with S [ m ] = 0 there is aneven L - R - -cadence which uses this as middle element if and only if S [ { max (2 m − l max , r min ) , max (2 m − l max , r min ) + 2 , . . . , min (2 m − l min , r max ) } ] is not of the form j . With Corollary 1, Lemma 3 and Lemma 4, it is possible to efficiently either find an L - R -3-cadence or to shorten the complex interval R without removing any L - R -3-cadences. Corollary 2.
Let S be a binary string and L and R be two intervals. Let further S [ L even ] beof the form i . Define l min = min( L even ) , l max = max( L even ) , r min = min( R even ) and r max =max( R even ) .If S [ R even ] is of the form j , there is no even L - R - -cadence.Otherwise, define r = min ( r ∈ R even | S [ r ] = 0) . If S (cid:2)(cid:8) l min + r , l min + r + 1 , . . . , l max + r (cid:9)(cid:3) contains a , the corresponding index forms an L - R - -cadence with an index of L even and r .Otherwise, there is no L - R - -cadence using r as third index, and furthermore, if the substring S (cid:2)(cid:8) l max + r + 1 , l max + r + 2 , . . . , l max + r max (cid:9)(cid:3) of S is of the form j , then there is no even L - R - -cadence.Otherwise, define m = min (cid:0) m ∈ (cid:8) l max + r + 1 , l max + r + 2 , . . . , l max + r max (cid:9) | S [ m ] = 0 (cid:1) . If S [ { m − l max , m − l max + 1 , . . . , min (2 m − l min , r max ) } ] contains a , the corresponding in-dex forms an L - R - -cadence with an index of L even and m .Otherwise, define R ′ = R ∩ Z > m − l min . There is an even L - R - -cadence if and only if thereis an even L - R ′ - -cadence. An application of this corollary can be seen in Figure 2.By construction, the set R ′ contains only elements greater than 2 m − l min . Therefore, either2 m − l min ≥ r max and R ′ is the empty set or 2 m − l min < r max and R ′ even contains at least (cid:22) m − l min − r min (cid:23) ≥ (cid:22) l max + r + 2 − l min − r min (cid:23) > l max − l min R even . Therefore, the algorithm described in Corollary 2 has to be used atmost O (cid:16) | R || L | (cid:17) times.On the other hand, the algorithm needs O (cid:0) | L | + ( r − r min ) + ( m − l min + r min ) (cid:1) time foruncompressed strings and polynomial time for grammar-compressed strings.Also Corollary 2 removes at least r − r min and at least m − l min + r min from the set R even .Therefore, even if L is small and either r − r min or m − l min + r min is large, the detection of even L - R -3-cadences can be done in O ( | L | + | R | ) time in uncompressed strings.By symmetry, the detection of odd L - R -3-cadences can also be done as the detection of even L - R -3-cadences.This concludes the proof of Theorem 3. 8 [ { , , . . . , } even ]: 0 0 0 0 0 1 1 0 S [ { , , . . . , } ]: 1 1 1 1 1 1 0 1 0 0 1 1 1 0 0 0 S [ { , , . . . , } even ]: 0 1 1 1 1 1 0 1Figure 2: A string with 48 characters after one application of Corollary 2. Let L = { , , . . . , } and R = { , , . . . , } be given. First, the index r = 34 is found. The minimal and maximalcandidates for 3-cadences with r are marked with red. Then, the index m = 23 is found. Theminimal and maximal candidates for 3-cadences with m are marked with yellow. Afterwards,the gray characters are guaranteed not to form a 3-cadence with characters from the first run ofthe string. In this section, I will show that the results of Theorem 3 also hold for the corresponding 3-cadenceproblems:
Theorem 4.
The decision problem of -cadence detection in binary grammar-compressed stringscan be solved in polynomial time.The decision problem of -cadence detection in binary uncompressed strings can be solvedlinear time. The main idea of the algorithm of Funakoshi and Pape-Lange in [4] for counting 3-cadencesin uncompressed strings was counting L - R -3-cadences for many pairs of L and R . Therefore,we can use the detection algorithm for L - R -3-cadences given by Corollary 2 in order to detect a3-cadence in uncompressed binary strings in O ( n log n ) time.However, since this algorithm uses Θ( n ) pairs of L and R , this approach does not translate intoa polynomial time detection algorithm for 3-cadences in compressed binary strings. Therefore,instead of dissecting the problem of 3-cadence detection into many problems of L - R -3-cadencedetection, we have to apply the ideas from the last section directly to the problem of 3-cadencedetection.Similarly to the L - R -3cadences, there are even 3-cadences and odd 3-cadences. Without lossof generality, this paper only considers the even 3-cadences and defines: Definition 5.
An even -cadence is a -cadence which starts with an even index.We define the string S even by S even = S hn , , , . . . , j | S | koi to be the restriction of S tothe characters with even indices. Let i, d be two integers such that i − d ≤ i + 3 d > n hold. Let L = { , , . . . , i } and R = { i + 2 d, i + 2 d + 1 , . . . , n } be two intervals. Then each L - R -3-cadence is also a 3-cadence.On the other hand, each 3-cadence defines integers i and d such that i − d ≤ i + 3 d > n hold. Therefore, we can use Lemma 2 to obtain that if S has a 3-cadence, it also has a 3-cadencethat either starts in one of the first two runs of S even or ends in one of the last two runs of S even .9he main challenge for the adaption of the detection algorithm for L - R -3-cadences to andetection algorithm for 3-cadences is that Lemma 4 does not quite work.See, for example, the string S = 000100011. Since a 3-cadence can start anywhere in the firstthird of the string and can end anywhere in the last third of the string, we have L = { , , } and R = { , , } .In terms of L - R -3-cadences, if we ignore the actual characters of S , the 0 at index 7 can forman L - R -3-cadence with the index 1 as well as with the index 3. Of these two possibilities, onlythe arithmetic progression 3 , , L - R -3-cadence. However, this arithmetic progressionis not structurally maximal and hence not a 3-cadence.On the other hand, in the string S ′ = 000100110, the 0 at index 9 can form an L - R -3-cadenceas well as a 3-cadence with the index 1 as well as with the index 3. Therefore, the arithmeticprogression 3 , , L - R -3-cadence as well as a 3-cadence.We therefore have to restrict the strings in Lemma 4 to those indices such that the corre-sponding 3-sub-cadences are structurally maximal. We assume without loss of generality that S [2] = 0 holds. Lemma 5.
Let S be a binary string. Define the two intervals L = (cid:8) , , . . . , (cid:4) n (cid:5)(cid:9) and R = (cid:8)(cid:4) n (cid:5) + 1 , (cid:4) n (cid:5) + 2 , . . . , n (cid:9) .Let further S [ L even ] be of the form i S ′ and let l min = min( L even ) = 2 be the first index of L even and l max = 2 + 2( i −
1) = 2 i be the index corresponding to the last in the first run of S [ L even ] .Then, for any r ∈ R even with S [ r ] = 0 define l ′ max = min (cid:0) l max , (cid:4) r (cid:5) , (cid:6) r − n (cid:7) − (cid:1) .There is an even -cadence which uses this as last element and any of the first i s of S [ L even ] as first element if and only if S (cid:20)(cid:26) l min + r , l min + r , . . . , l ′ max + r (cid:27)(cid:21) = 1 (cid:18) l ′ max+ r − l min+ r +1 (cid:19) holds.Conversely, for any m ∈ (cid:8) l min + r min , l min + r min + 1 , . . . , l max + r max (cid:9) with S [ m ] = 0 define l ′′ min = max (cid:0) l min , (cid:0) m − (cid:4) n (cid:5)(cid:1)(cid:1) and l ′′ max = min (cid:0) l max , (cid:4) m (cid:5) , (cid:0)(cid:6) m − n (cid:7) − (cid:1)(cid:1) . There is aneven -cadence which uses this as middle element if and only if S [ { m − l ′′ max , m − l ′′ max + 2 , . . . , m − l ′′ min } ] = 1( (2 m − l ′′ min ) − (2 m − l ′′ max )+1 ) holds.Proof. Like Lemma 4, this lemma basically holds by definition of the 3-cadence.All indices less than or equal to l ′ max can form a 3-cadence with r , since l := 2 (cid:4) r (cid:5) is thelargest even index fulfilling the inequality l − r − l ≤ l := 2( (cid:6) r − n (cid:7) −
1) is the largesteven index fulfilling the inequality l + 3 r − l > n .Similarly, all indices between l ′′ min and l ′′ max can form a 3-cadence with m , since the index l := 2 (cid:0) m − (cid:4) n (cid:5)(cid:1) is the smallest even index fulfilling the inequality l + 2( m − l ) ≤ n , theindex l := 2 (cid:4) m (cid:5) is the largest even index fulfilling the inequality l − ( m − l ) ≤ l := 2 (cid:0)(cid:6) m − n (cid:7) − (cid:1) is the largest even index fulfilling the inequality l + 3( m − l ) > n .Similarly to the case of the L - R -3-cadence, we can use this lemma to shrink the interval inwhich the last element of the arithmetic progression can be. Corollary 3.
Let S be a binary string. Define the two intervals L = (cid:8) , , . . . , (cid:4) n (cid:5)(cid:9) and R = { r min , r min + 1 , . . . , n } for a r min ≥ (cid:4) n (cid:5) + 1 . [ { , , . . . , } even ]: 0 0 0 0 0 1 1 0 S [ { , , . . . , } ]: 1 1 1 0 1 1 0 1 0 0 1 1 1 0 0 0 S [ { , , . . . , } even ]: 0 1 1 0 1 1 0 1Figure 3: A string with 48 characters after one application of Corollary 3. First, the index r = 34 is found. The minimal and maximal candidates for 3-cadences with r are marked withred. Then, the index m = 20 is found. The minimal and maximal candidates for 3-cadenceswith m are marked with yellow. Afterwards, the gray characters are guaranteed not to form a3-cadence with characters from the first run of S even . Let further S [ L even ] be of the form i S ′ and L ′ even be the set of indices of the first run in S [ L even ] . Let l min = min( L ′ even ) = 2 be the first index of L even and l max = max( L ′ even ) =2 + 2( i −
1) = 2 i be the index corresponding to the last in the first run of S [ L even ] .If S [ R even ] is of the form j , there is no even -cadence using an element of L ′ even as firstindex.Otherwise, define r = min ( r ∈ R even | S [ r ] = 0) and the corresponding maximal index for thefirst element of the -cadence l ′ max = min (cid:0) l max , (cid:4) r (cid:5) , (cid:6) r − n (cid:7) − (cid:1) .If S hn l min + r , l min + r + 1 , . . . , l ′ max + r oi contains a , the corresponding index forms a -cadence with an index of L ′ even and r .Otherwise, there is no -cadence using r as third index and a from the first run of S [ L even ] as first index and hence, if the substring S hn l ′ max + r + 1 , l ′ max + r + 2 , . . . , l max + r max oi of S is ofthe form j , then there is no even -cadence using an element of L ′ even as first index.Otherwise, define m = min (cid:16) m ∈ n l ′ max + r + 1 , l ′ max + r + 2 , . . . , l max + r max o | S [ m ] = 0 (cid:17) , theindex l ′′ min = max (cid:0) l min , (cid:0) m − (cid:4) n (cid:5)(cid:1)(cid:1) and l ′′ max = min (cid:0) l max , (cid:4) m (cid:5) , (cid:0)(cid:6) m − n (cid:7) − (cid:1)(cid:1) .If S [ { m − l ′′ max , m − l ′′ max + 2 , . . . , m − l ′′ min } ] contains a , the corresponding indexforms an -cadence with an index of L ′ even and m .Otherwise, define R ′ = R ∩ Z > m − l ′′ min . There is an even -cadence using an element of L ′ even as first index if and only if there is an even -cadence using an element of L ′ even as firstindex and an element of R ′ even as last index. An application of this corollary can be seen in Figure 3.In the uncompressed case, each element of the middle third and the last third has to be readat most once in order to decide whether there is a 3-cadence which starts in the first run of S even .Furthermore, we can modify this algorithm to detect the existence of a 3-cadence which start inthe second run of S even . By symmetry, we can also decide in linear time, whether there exists a3-cadence which end in one of the two last runs of S even . Similarly, we can decide in linear time,whether there is an odd 3-cadence.This implies: Theorem 5.
Let S be a binary string. We can decide in linear time whether S contains a -cadence.If there is a -cadence, this algorithm can also return such a cadence. In the compressed case, if the first run of S even contains only a single index, the corresponding3-cadences are exactly the L - R -3-cadences with L = { } and R = (cid:8)(cid:4) n (cid:5) + 1 , (cid:4) n (cid:5) + 2 , . . . , n (cid:9) .Since the detection of these L - R -3-cadences is N P -complete, if
P 6 = N P holds, it is not possibleto decide in polynomial time, whether there is a 3-cadence which starts in the first run.Luckily, it is not necessary to decide whether there is a 3-cadence which starts in the firstrun in order to decide whether there is a 3-cadence at all. Let S [ L even ] of the form 0 i S ′ . Thenthe 1 has index 2 i + 2. Let r be the smallest even index such that (2 i + 2) − r − (2 i +2)2 ≤ i + 2) + 3 r − (2 i +2)2 > n hold. I.e. the smallest index such that the arithmetic progression2 i + 2 , i +2+ r , r could form a 3-cadence if the three characters in the underlying string wereequal.In this setting, for L = { , , . . . , i + 2 } and R = { r , r + 1 , . . . , n } , each L - R -3-cadenceis structurally maximal and therefore forms a 3-cadence. This implies that we can check inpolynomial time whether S [ R even ] contains the substring 10 and therefore, whether such a 3-cadence exists. If such a 3-cadence exists, we are done.Otherwise, S [ R even ] is of the form 0 j j ′ with j, j ′ ≥ j ′ + 1 characters of S [ R even ].It is left to show that even in the compressed case, Corollary 3 is fast enough to allow findingthe 3-cadences which start in the first run of S even and end at an index smaller than r .In order to do this, I will show that with each application of Corollary 3 the discarded partof R doubles until r is reached. In the worst case, the new 0 at r is directly at the beginningof S [ R even ]. Since each 3-sub-cadences with distance greater than or equal to n is a 3-cadence,we can assume r < r ≤ i + 2 + n holds and therefore l max ≥ r − n holds as well. Also,both 2 (cid:4) r (cid:5) and 2( (cid:6) r − n (cid:7) −
1) are greater than or equal to r − n . Therefore l ′ max ≥ r − n holds.Similarly, in the worst case, the new 0 at m is directly behind l ′ max + r ≥ r − n . Therefore,in the worst case, the index m is at 2 r − n + 1. With l ′′ min = max (cid:0) l min , (cid:0) m − (cid:4) n (cid:5)(cid:1)(cid:1) , thisimplies that the inequality 2 m − l ′′ min ≥ min (cid:0) r + ( r − n ) , n − (cid:1) holds.Hence, under the assumption that r < r holds, one application of Corollary 3 checks foran interval of size r − n , whether this interval contains any last indices for a 3-cadence whichstarts in the first run. Therefore, we only need at most log n applications of this corollary.This implies that it can be decided in polynomial time whether a grammar-compressed binarystring contains any 3-cadences. This paper shows that we can decide in linear time whether an uncompressed binary stringcontains a 3-cadence. While we should expect that it is more difficult to avoid 3-cadences inbinary strings than to include 3-cadences, it is surprising that it is strictly easier to decide whetherthere is any 3-cadence at all than to decide whether there is a 3-cadence with a given character.For the latter problem, Amir et al. have shown in [1] that we should not expect a solutionwith time complexity o ( n log n ) by reduction of the 3SUM problem with bounded elements.For the compressed case, we have shown that we can decide in polynomial time whethera grammar-compressed binary string contains a 3-cadence. However, all even slightly harder12roblems have been shown to be N P -complete. These hardness-results seem to indicate thatcadences may not be very useful in compressed pattern matching.While we can decide in constant time whether a string contains a k -sub-cadence, there are noknown nontrivial bounds on the bit complexity of the detection of k -sub-cadences with a givencharacter. Closely related, it is unknown whether equidistant subsequence matching is N P -hardon compressed binary strings.Finally, in terms of uncompressed cadence detection, it is still unknown whether we can decidewith sub-quadratic bit complexity whether a given string contains a 4-cadence. The currentlybest result is by Funakoshi et al., who presented a detection algorithm with sub-quadratic timecomplexity in the word RAM model in [3].
References [1] Amihood Amir, Alberto Apostolico, Travis Gagie, and Gad M. Landau. String cadences.
Theoretical Computer Science , 698:4 – 8, 2017. Algorithms, Strings and Theoretical Ap-proaches in the Big Data Era (In Honor of the 60th Birthday of Professor Raffaele Giancarlo).URL: .[2] Moses Charikar, Eric Lehman, Ding Liu, Rina Panigrahy, Manoj Prabhakaran, Amit Sahai,and Abhi Shelat. The smallest grammar problem.
IEEE Trans. Inf. Theory , 51(7):2554–2576, 2005. doi:10.1109/TIT.2005.850116 .[3] Mitsuru Funakoshi, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, MasayukiTakeda, and Ayumi Shinohara. Detecting k-(Sub-)Cadences and Equidistant Sub-sequence Occurrences. In Inge Li Gørtz and Oren Weimann, editors, , volume 161of
Leibniz International Proceedings in Informatics (LIPIcs) , pages 12:1–12:11,Dagstuhl, Germany, 2020. Schloss Dagstuhl–Leibniz-Zentrum f¨ur Informatik. URL: https://drops.dagstuhl.de/opus/volltexte/2020/12137 .[4] Mitsuru Funakoshi and Julian Pape-Lange. Non-Rectangular Convolutions and (Sub-)Cadences with Three Elements. In Christophe Paul and Markus Bl¨aser, editors, , vol-ume 154 of
Leibniz International Proceedings in Informatics (LIPIcs) , pages 30:1–30:16,Dagstuhl, Germany, 2020. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik. URL: https://drops.dagstuhl.de/opus/volltexte/2020/11891 .[5] J. Gardelle. Cadences.
Math´ematiques et Sciences humaines , 9:31–38, 1964. URL: .[6] Artur Je˙z. Faster fully compressed pattern matching by recompression.
ACM Trans. Algo-rithms , 11(3), January 2015. doi:10.1145/2631920 .[7] Dominik Kempa and Tomasz Kociumaka. Resolution of the Burrows-Wheeler transformconjecture.
CoRR , abs/1910.10631, 2019. URL: http://arxiv.org/abs/1910.10631 .[8] Markus Lohrey.
Algorithms on Compressed Words , pages 43–65. Springer New York, NewYork, NY, 2014. doi:10.1007/978-1-4939-0748-9_3 .[9] M. Lothaire.
Combinatorics on Words . Cambridge Mathematical Library. Cambridge Uni-versity Press, 1997. URL: https://books.google.de/books?id=eATLTZzwW-sC .1310] Julian Pape-Lange. On extensions of maximal repeats in compressed strings.
CoRR ,abs/2002.06265, 2020. URL: https://arxiv.org/abs/2002.06265 .[11] Wojciech Rytter. Application of Lempel-Ziv factorization to the approximationof grammar-based compression.
Theor. Comput. Sci. , 302(1-3):211–222, 2003. doi:10.1016/S0304-3975(02)00777-6 .[12] Bartel Leendert van der Waerden. Beweis einer Baudet’schen Vermutung.
Nieuw Archiefvoor Wiskunde , 15:212–216, 1927.
A L-R-Cadences with Overlap
In this section, I will extend the result of Section 4 and show that all results still hold if theintervals L and R are allowed to have overlap.Since all N P -complete of Section 4 are still
N P -complete for this more general notion of L - R -cadences, it is left to show that one can detect an L - R -3-cadence in linear time in uncompressedbinary strings with respect to the length of the string and in polynomial time in grammar-compressed binary strings with respect to the compressed size of the string and the additionalvariable max (cid:16) | L || R | , | R || L | (cid:17) .Let L and R therefore be two overlapping intervals. Define the overlapping part M = L ∩ R as well as the two non-overlapping parts L ′ = L \ M and R ′ = R \ M . Since each L - R -3-cadencestarts with an index in L and ends with an index in R , we can assume that each index in L ′ isless than each index in M and that each index in M is less than each index in R ′ .By construction, each L - R -3-cadence is either • an L ′ - M -3-cadence, • an L ′ - R ′ -3-cadence, • an M - M -3-cadence or • an M - R ′ -3-cadence.The M - M -3-cadences are exactly the 3-sub-cadences on the string S [ M ]. Therefore, van derWaerden’s theorem shows that if S [ M ] contains at least 9 characters, it is guaranteed that a M - M -3-cadences exist and that we can find such a sub-cadence by reading the first 9 charactersof S [ M ].We can therefore assume in the remainder of the proof that S [ M ] contains less than 9 char-acters. In this case, we can find all M - M -3-cadences in constant time in uncompressed stringsand in linear time in grammar-compressed strings.Since | M | is small, there is no detection algorithm for L ′ - M -3-cadences and for M - R ′ -3-cadences which runs in polynomial time in grammar-compressed strings. However, each L ′ - R -3-cadence is either an L ′ - M -3-cadence or an L ′ - R ′ -3-cadence.If L ′ is empty, then there are no L ′ - R -3-cadences. Otherwise, we can use the results of Section4 to detect an L ′ - R -3-cadence. This takes linear time in uncompressed strings. Furthermore, ingrammar-compressed strings, we can detect an L ′ - R -3-cadence in polynomial time with respectto the compressed size and max (cid:16) | L ′ || R | , | R || L ′ | (cid:17) . However, since | L | − | M | = | L ′ | 6 = 0 and | M | < | R || L ′ | is bounded from above by 9 | R || L | . Therefore, in grammar-compressed strings,we can detect an L ′ - R -3-cadence in polynomial time with respect to the compressed size andmax (cid:16) | L || R | , | R || L | (cid:17) . 14ince, each L - R ′ -3-cadence is either an L ′ - R ′ -3-cadence or an M - R ′ -3-cadence, we can simi-larly detect those sub-cadences. Furthermore, since we do not attempt to count the number of L - R -3-cadences, it is not a problem that we may find L ′ - R ′ -3-cadences twice.This implies: Theorem 6.
For two, not necessarily disjoint, intervals L and R , it is possible to detect whethera binary string contains any L - R -cadence • in O ( | L | + | R | ) in uncompressed strings and • in polynomial time with respect to the compressed size and max (cid:16) | L || R | , | R || L | (cid:17) in grammar-compressed strings.If such an L - R -cadence exist, we can find such a cadence in the same time.-cadence exist, we can find such a cadence in the same time.