A Simple Sublinear Algorithm for Gap Edit Distance
aa r X i v : . [ c s . D S ] J u l A Simple Sublinear Algorithm for Gap Edit Distance
Joshua Brakensiek ∗ Moses Charikar † Aviad Rubinstein ‡ Abstract
We study the problem of estimating the edit distance between two n -character strings. Whileexact computation in the worst case is believed to require near-quadratic time, previous workshowed that in certain regimes it is possible to solve the following gap edit distance problem in sub-linear time : distinguish between inputs of distance ≤ k and > k . Our main result is a verysimple algorithm for this benchmark that runs in time ˜ O ( n/ √ k ), and in particular settles theopen problem of obtaining a truly sublinear time for the entire range of relevant k .Building on the same framework, we also obtain a k -vs- k algorithm for the one-sided pre-processing model with ˜ O ( n ) preprocessing time and ˜ O ( n/k ) query time (improving over a recent˜ O ( n/k + k )-query time algorithm for the same problem [GRS20]). ∗ Stanford University, partially supported by an NSF Graduate Research Fellowship. † Stanford University, supported by a Simons Investigator Award, a Google Faculty Research Award and anAmazon Research Award. ‡ Stanford University.
Introduction
We study the problem of estimating the edit distance between two n -character strings. Thereis a classic O ( n ) dynamic programming algorithm, and fine-grained complexity results from re-cent years suggest that it is nearly optimal [BK15, BI18, AHWW16, AB18]. There have beenlong lines of works on beating the quadratic time barrier with approximations [BEK +
03, BJKK04,BES06, AO12, AKO10, BEG +
18, CDG +
18, CGKK18, HSSS19, RSSS19, BR20, KS20b, GRS20,RS20, AN20], or beyond-worst case [Ukk85, Apo86, Mye86, LMS98, AK12, ABBK17, BK18, Kus19,HRS19, BSS20]. Motivated by applications where the strings may be extremely long (e.g. bioinfor-matics), we are interested in algorithms that run even faster, namely in sub- linear time. For exactcomputation in the worst case, this is unconditionally impossible — even distinguishing between apair of identical strings and a pair that differs in a single character requires reading the entire in-put. But in many regimes sublinear algorithms are still possible [BEK +
03, BJKK04, CK06, AN10,AO12, SS17, GKS19, NRRS19, BCLW19, RSSS19].
Gap Edit Distance: k vs k We give new approximation algorithms for edit distance that run in sublinear time when the inputstrings are close. To best understand our contribution and how it relates to previous work, we focuson the benchmark advocated by [GKS19] of distinguishing input strings whose edit distance is ≤ k from & k ; we discuss more general parameters later in this section. Notice that we can assumewlog that k < √ n (otherwise the algorithm can always accept). Furthermore, for tiny k there isan unconditional easy lower bound of Ω( n/k ) for distinguishing even identical strings from oneswith k substitutions. So our goal is to design an algorithm that runs in truly sublinear time for1 ≪ k < √ n .There are two most relevant algorithms in the literature for this setting: • [AO12] (building on [OR07]) gave an algorithm that runs in time n o (1) /k ; in particular, itis sublinear for k ≫ n / . • [GKS19] gave an algorithm that runs in time ˜ O ( n/k + k ); in particular, it is truly sublinearfor k ≪ n / .In particular, [GKS19] left as an open problem obtaining a sublinear algorithm for k ≈ n / .Our main result is a very simple algorithm that runs in time ˜ O ( n/ √ k ) and hence is simultane-ously sublinear for all relevant values of k . Theorem (Main result (informal); see Theorem 8) . We can distinguish between
ED(
A, B ) ≤ k and ED(
A, B ) = Ω( k ) in ˜ O ( n/ √ k ) time with high probability. Our algorithm is better than [AO12, GKS19] for n / ≪ k ≪ n / (and is also arguably simplerthan both). Independent work of Kociumaka and Saha
The open problem of Goldenberg, Krautgamer,and Saha [GKS19] was also independently resolved by Kociumaka and Saha [KS20a]. They useessentially the same main algorithm (Algorithm below), but use substantially different techniques toimplement approximate queries to the subroutine we call MaxAlign k . Their running time ( ˜ O ( n/k + k )) is faster than ours in the regime where our algorithm is faster than [AO12].1 dit distance with preprocessing: results and technical insights Our starting point for this paper is the recent work of [GRS20] that designed algorithms for editdistance with preprocessing , namely the algorithm consists of two phases:
Preprocessing where each string is preprocessed separately; and
Query where the algorithm has access to both strings and outputs of the preprocess phase.A simple and efficient preprocessing procedure proposed by [GRS20] is to compute a hash table forevery contiguous substring. In the query phase, this enables an O (log( n ))-time implementation ofa subroutine that given indices i A , i B returns the longest common (contiguous) substring of A, B starting at indices i A , i B (respectively). We use a simple modification of this subroutine, that wecall MaxAlign k : given only an index i B for string B , it returns the longest common (contiguous)substring of A, B starting at indices i A , i B (respectively) for any i A ∈ [ i B − k, i B + k ]. (It is nothard to see that for k -close strings, we never need to consider other choices of i A [Ukk85].)Given access to a MaxAlign k oracle, we obtain the following simple greedy algorithm for k -vs- k edit distance: Starting from pointer i B = 1, at each iteration it advances i B to the end of the nextlongest common subsequence returned by MaxAlign k . Algorithm 3
GreedyMatch(
A, B, k ) i B ← for e from to k + 1 i B ← i B + max(MaxAlign k ( A, B, i B ) , if i B > n then return SMALL return
LARGEEach increase of the pointer i B costs at most 2 k in edit distance (corresponding to the freedom tochoose i A ∈ [ i B − k, i B + k ]). Hence if i B reaches the end of B in O ( k ) steps, then ED( A, B ) ≤ O ( k )and we can accept; otherwise the edit distance is > k and we can reject. The above ideas suffice tosolve k -vs- k gap edit-distance in ˜ O ( k ) query time after polynomial preprocessing .Without preprocessing, we can’t afford to hash the entire input strings. Instead, we subsample ≈ /k -fraction of the indices from each string and compute hashes for the sampled subsequences. Ifthe sampled indices perfectly align (with a suitable shift in [ ± k ]), the hashes of identical contiguoussubstrings will be identical, whereas the hashes of substrings that are > k -far (even in Hammingdistance) will be different (w.h.p.). This error is acceptable since we already incur a Θ( k )-error foreach call of MaxAlign k . This algorithm would run in ˜ O ( n/k ) time , but there is a caveat: when weconstruct the hash table, it is not yet possible to pick the indices so that they perfectly align (wedon’t know the suitable shift). Instead, we try O ( √ k ) different shifts for each of A, B ; by birthdayparadox, there exists a pair of shifts that exactly adds up to the right shift in [ ± k ]. The total runtime is given by ˜ O ( n/k · √ k ) = ˜ O ( n/ √ k ).[GRS20] also considered the case where we can only preprocess one of the strings. In this case,we can mimic the strategy from the previous paragraph, but take all O ( k ) shifts on the preprocessedstring, saving the O ( √ k )-factor at query time. This gives the following result: Theorem (Informal statement of Theorem 11) . We can distinguish between
ED(
A, B ) ≤ k and ED(
A, B ) = ˜Ω( k ) with high probability in ˜ O ( n ) preprocessing time of A and ˜ O ( n/k ) query time. The prepocessing can be made near-linear, but in this setting our algorithm is still dominated by that of [CGK16]. There is also an additive ˜ O ( k ) term like in the preprocessing case, but it is dominated by ˜ O ( n/k ) for k < √ n . O ( n/k + k )-algorithm in [GRS20] that used similar ideas. (Asimilar algorithm with low asymmetric query complexity was also introduced in [GKS19].) Trading off running time for better approximation
By combining our algorithm with the h -wave algorithm of [LMS98], we can tradeoff approximationguarantee and running time in our algorithms. The running times we obtain for k vs kℓ editdistance are: No preprocessing ˜ O ( n √ k + k . ℓ ) running time for ℓ ∈ [ √ k, k ]. (Theorem 16) One-sided preprocessing ˜ O ( nkℓ ) preprocessing time and ˜ O ( n + k ℓ ) query time. (Theorem 19) Two-sided preprocessing ˜ O ( nkℓ ) preprocessing time and ˜ O ( k ℓ ) query time. (Corollary 20) Organization
Section 2 gives an overview of the randomized hashing technique we use, as well as a structurallemma theorem for close strings. Section 3 gives a meta-algorithm for distinguishing k versus k edit distance. Sections 4,5,6 respectively implement this meta-algorithm for two-, zero-, and one-sided preprocessing. Appendix A explains how to trade off running time for improved gap of k versus kℓ edit distance. Appendix B includes the proof of our structural decomposition lemma. A standard preprocessing ingredient is Rabin-Karp-style rolling hashes (e.g., [CLRS09]). We iden-tify the alphabet Σ with 1 , , . . . , | Σ | . Assume there is also $ Σ, which we index by | Σ | + 1. Assume before any preprocessing that we have picked a prime p with Θ(log n + log | Σ | ) digits aswell a uniformly random value x ∈ { , , . . . , p − } . We also have S ⊂ [ n ], a subsample of theindices which allows for sublinear preprocessing of the rolling hashes while still successfully testingstring matching (up to a ˜ O ( n/ | S | ) Hamming error). Algorithm 1
InitRollingHash(
A, S ) Input: A ∈ Σ n ; S array of indices to be hashed Output: H, a list of | S | +1 hashes H ← [0] c ← for i ∈ S then c ← cx + A [ i ] mod p append c to H . return H Observe that InitRollingHash runs in ˜ O ( | S | ) time and RetrieveRollingHash runs in ˜ O (1) time.The correctness guarantees follow from the following standard proposition. We assume that all indices out of range of A [1 , n ] are equal to $. lgorithm 2 RetrieveRollingHash(
A, S, H, i, j ) Input: A ∈ Σ n ; S array of hashed indices; H list of hashes; i ≤ j indices from 1 to n . Output: h , hash of string i ′ ← least index such that S [ i ′ ] ≥ i . j ′ ← greatest index such that S [ j ′ ] ≤ j . return h ← H [ j ′ ] − H [ i ′ − x j ′ − i ′ +1 mod p Proposition 1.
Let
A, B ∈ Σ n and S := { , , . . . , n } . Let H A = InitRollingHash( A, S ) and H B = InitRollingHash( B, S ) . The following holds with probability at least − n over the choiceof x . For all i A ≤ j A and i B ≤ j B , we have that RetrieveRollingHash(
A, S, H A , i A , j A ) = RetrieveRollingHash( A, S, H B , i B , j B ) if and only if A [ i A , j A ] = B [ i B , j B ] . This claim is sufficient for our warm-up two-sided preprocessing algorithm. However, for theother algorithms, we need to have | S | = o ( n ) for our hashing to be sublinear. This is captured byanother claim. Claim 2.
Let
A, B ∈ Σ n and S ⊆ { , , . . . , n } be a random subset with each element includedindependently with probability at least α := min( nk , . Let H A = InitRollingHash( A, S ) and H B = InitRollingHash( B, S ) . For any i ≤ j in { , . . . , n } we have(1) If A [ i, j ] = B [ i, j ] then RetrieveRollingHash(
A, S, H A , i, j ) = RetrieveRollingHash( B, S, H B , i, j ) .(2) If Ham( A [ i, j ] , B [ i, j ]) ≥ k then with probability at least − n over the choice of x and S , RetrieveRollingHash(
A, S, H A , i, j ) = RetrieveRollingHash( B, S, H B , i, j ) Proof.
Let A S and B S be the subsequences of A and B corresponding to the indices S . Note thatif A [ i, j ] = B [ i, j ] then A S [ i ′ , j ′ ] = B S [ i ′ , j ′ ], where i ′ and j ′ are chosen as in RetrieveRollingHash.Property (1) then follows by Proposition 1.If Ham( A [ i, j ] , B [ i, j ]) ≥ k , the probability there exists i ∈ S ∩ [ i, j ] such that A [ i ] = B [ i ]and thus A S [ i ′ , j ′ ] = B S [ i ′ , j ′ ] is 1 if α = 1 and otherwise at least1 − (1 − (4 ln n ) /k ) k ≥ − /e n = 1 − /n . If A S [ i ′ , j ′ ] = B S [ i ′ , j ′ ] then by Proposition 1,RetrieveRollingHash( A, S, H A , i, j ) = RetrieveRollingHash( B, S, H B , i, j )with probability at least 1 − /n . Therefore, for a random S , RetrieveRollingHash( A, S, H A , i, j ) =RetrieveRollingHash( B, S, H B , i, j ) is at least 1 − /n − /n > − /n . Thus, property (2)follows. Definition 1 ( k -alignment and approximate k -alignment) . Given strings
A, B , we say that a substring B [ i B , i B + d −
1] with 1 ≤ i B , i B + d − ≤ n isin k -alignment in A [ i A , i A + d −
1] if | i A − i B | ≤ k and A [ i A , i A + d −
1] = B [ i B , i B + d − | i A − i B | ≤ k and ED( A [ i A , i A + d − , B [ i B , i B + d − ≤ k , we say that B [ i B , i B + d −
1] is in approximate k -alignment with A [ i A , i A + d − B [ i B , i B + d − k -alignment in A if there is an i A with | i A − i B | ≤ k such that B [ i B , i B + d − k -alignment with A [ i A , i A + d − Lemma 3.
Let
A, B ∈ Σ ∗ be strings such that ED(
A, B ) ≤ k . Then, A and B can be partitionedinto at k + 1 intervals I A , . . . , I A k +1 ; I B , . . . , I B k +1 , respectively, and a partial monotone matching π : [2 k + 1] → [2 k + 1] ∪ {⊥} such that • Unmatched intervals are of length at most , and • For all i in the matching, B [ I Bπ ( i ) ] is in k -alignment with A [ I Ai ] . k vs. k In this section, we present GreedyMatch (Algorithm 3), a simple algorithm for distinguishingED(
A, B ) ≤ O ( k ) from ED( A, B ) ≥ Ω( k ). The algorithm assumes access to data structureMaxAlign k as defined below. In the following sections, we will present different implementations ofthis data structure for the case of two-sided, one-sided, and no preprocessing.Define MaxAlign k ( A, B, i B ) to be a function which returns d ∈ [1 , n ]. We say that an imple-mentation of MaxAlign k ( A, B, i B ) is correct if with probability 1 it outputs the maximum d suchthat B [ i B , i B + d −
1] has a k -alignment in A , and if no k -alignment exists, it outputs d = 0. Wesay that an implementation is approximately correct if the following are true.1. Let d be the maximal such that B [ i B , i B + d −
1] has a k -alignment in A . With probability1, MaxAlign k ( A, B, i B ) ≥ d .2. With probability at least 1 − /n , B [ i B , i B + MaxAlign k ( A, B, i B ) −
1] has an approximate k -alignment in A .We say that an implementation is half approximately correct if the following are true.1. Let d be the maximal such that B [ i B , i B + d −
1] has a k -alignment. With probability 1,MaxAlign k ( A, B, i B ) > d/ d = 0).2. With probability at least 1 − /n , B [ i B , i B + MaxAlign k ( A, B, i B ) −
1] has an approximate k -alignment in A . Algorithm 3
GreedyMatch(
A, B, k ) Input:
A, B ∈ Σ n , k ≤ n Output:
SMALL if ED(
A, B ) ≤ k or LARGE if ED( A, B ) > k i B ← for e from to k + 1 i B ← i B + max(MaxAlign k ( A, B, i B ) , if i B > n return SMALL return
LARGEWe now give the following correctness guarantee.5 emma 4. If MaxAlign k is approximately correct and ED(
A, B ) ≤ k , then with probability , GreedyMatch(
A, B, k ) returns SMALL. If MaxAlign k is half approximately correct and ED(
A, B ) ≤ k/ (2 log n ) , then with probability , GreedyMatch(
A, B, k ) returns SMALL. If MaxAlign k is (half )approximately correct and ED(
A, B ) > k , then with probability − n , GreedyMatch(
A, B, k ) returns LARGE. Further, GreedyMatch(
A, B, k ) makes O ( k ) calls to MaxAlign k and otherwiseruns in O ( k log n ) time.Proof. If MaxAlign k is approximately correct and if ED( A, B ) ≤ k then by Lemma 3, B can bedecomposed into 2 k + 1 intervals such that they are each of length at most 1 or they exactly matchthe corresponding interval A , up to a shift of k . In the algorithm, if i B is in one of these intervals,then MaxAlign k finds the rest of the interval (and perhaps more). Then, the algorithm will reachthe end of B in 2 k + 1 steps and output SMALL.Let k ′ = k/ (2 log n ). If MaxAlign k is half approximately correct and ED( A, B ) ≤ k ′ then byLemma 3, B can be decomposed into 2 k ′ + 1 intervals such that they are each of length at most1 or they exactly match the corresponding interval A , up to a shift of k . In the algorithm, if i B is in one of these intervals, then MaxAlign k finds more than half of the interval. Thus, it takes atmost log n steps for the algorithm to get past each of the 2 k ′ + 1 intervals. Thus, the algorithmwill reach the end of B in (2 k ′ + 1)(log n ) < k + 1 steps and output SMALL.For the other direction, it suffices to prove that if the algorithm outputs SMALL then ED( A, B ) ≤ k . If MaxAlign k is (half) approximately correct, and the algorithm outputs SMALL, with prob-ability at least 1 − /n over all calls to MaxAlign k , there exists a decomposition of B into 2 k + 1intervals such that each is either of length 1 or has an approximate k -alignment in A . Thus, thereexists a sequence of edit operations from B to A by1. deleting the at most 2 k + 1 characters of B which do not match,2. modifying at most 3 k characters within each interval of B , and3. adding/deleting 6 k characters between each consecutive pair of exactly-matching intervals(and before the first and after the last interval), since each match had a shift of up to 3 k .This is a total of 2 k + 1 + 3 k (2 k + 1) + 6 k (2 k + 2) ≤ k operations. Thus, if ED( A, B ) > k ,GreedyMatch( A, B, k ) return LARGE with probability at least 1 − n . The runtime analysis followsby inspection.By Lemma 4, it suffices to implement MaxAlign k efficiently and with 1 / poly( n ) error probabilityin various models. As warm-up, we give an implementation of MaxAlign k that first preprocesses A and B (separately)for poly( n ) time , and then implement MaxAlign k queries in O (log( n )) time.Algorithm 4 takes as input a string A and produces ( H A , T A ), the rolling hashes of A and acollection of hash tables. We let H B , T B denote the corresponding preprocessing output for B .Algorithm 5 gives a correct implementation of MaxAlign k with the assistance of this preprocessing. Lemma 5.
TwoSidedMaxAlign k is a correct implementation of MaxAlign k . It is not hard to improve the preprocessing time to ˜ O ( n ). We omit the details since this algorithm would stillnot be optimal for the two-sided preprocessing setting. lgorithm 4 TwoSidedPreprocessing k ( A ) Input: A ∈ Σ n , k ≤ n Output: ( H A , T A ), a collection of hashes H A ← InitRollingHash( A, [1 , n ]) T A ← n × n matrix of hash tables for i from to n for j from i to n for a from − k to k if [ i + a, j + a ] ⊂ [1 , n ], add RetrieveRollingHash( A, [1 , n ] , H A , i + a, j + a ) to T [ i, j ] return ( H A , T A ) Algorithm 5
TwoSidedMaxAlign k ( A, B, i B ) Input: A ∈ Σ n , B ∈ Σ n , k ≤ n , i B ∈ [1 , n ] Output: d ∈ [0 , n ].Binary search to find maximal d ∈ [0 , n − i B + 1] such thatRetrieveRollingHash( B, [1 , n ] , H B , i B , i B + d − ∈ T A [ i B , i B + d − return d Proof.
Observe that TwoSidedMaxAlign is correct if for all a ∈ [ − k, k ], RetrieveRollingHash( A, [1 , n ] , H A , i B + a, i B + d + a ) = RetrieveRollingHash( B, [1 , n ] , H B , i B , i B + d ) if and only if A [ i B + a, i B + d + a ] = B [ i B , i B + d ]. By Claim 1 and the union bound, this happens with probability at least 1 − n . Theorem 6.
When both A and B are preprocessed for poly( n ) time, we can distinguish between ED(
A, B ) ≤ k and ED(
A, B ) > k in ˜ O ( k ) time with probability − n .Remark. Note that [CGK16]’s algorithm obtains similar guarantees while only spending O (log( n ))query time. Further, sketching algorithms for edit distance often achieve much better approximationfactors, but the preprocessing is often not near-linear (e.g., [BZ16]). Proof of Theorem 6.
By Lemma 7, TwoSidedMaxAlign is correct (and thus approximately correct)so by Lemma 4 succeeds with high enough probability that GreedyMatch outputs the correct answerwith probability at least 1 − n .By inspection, the preprocessing runs in poly( n ) time. Further, as the binary search, hashcomputation, and table lookup are all ˜ O (1) operations, TwoSidedMaxAlign runs in ˜ O (1) time, sothe two-sided preprocessing version of GreedyMatch runs in ˜ O ( k ) time. k vs k with No Preprocessing As explained in the introduction, for the no preprocessing case, we take advantage of the fact thatany c ∈ [ − k, k ] can be written as a √ k + b , there a, b ∈ [ −√ k, √ k ]. Thus, if for A we computea rolling hash tables according to S + a √ k := { s + a √ k, s ∈ S } ∩ [1 , n ] for a ∈ √ k . Likewise,for B we compute rolling hash tables according to S − b := { s − b, s ∈ S } ∩ [1 , n ]. Then, if we Document exchange (e.g., [BZ16, Hae19]) is similar to the one-sided preprocessing model, but A and B are neverbrought together (rather a hash of A is sent to B ). We have √ k as shorthand for ⌈√ k ⌉ . A [ i B + c, i B + c + d −
1] and B [ i B , i B + d − A [ i B + a √ k + k, i B + a √ k + d − − k ] and B [ i B − b + k, i B + d − − b − k ].Before calling GreedyMatch, we call two methods ProcessA and ProcessB which compute thesehash tables. Note that the procedures are asymmetrical. These take ˜ O ( n/ √ k ) time each. Algorithm 6
ProcessA k ( A ) Input: A ∈ Σ n for a from −√ k to √ kH A,a √ k ← InitRollingHash(
A, S + a √ k ) return { H A,a √ k : a ∈ [ −√ k, √ k ] } Algorithm 7
ProcessB k ( B ) for b from −√ k to √ kH B,b ← InitRollingHash(
B, S − b ) return { H B,b : b ∈ [ −√ k, √ k ] } Algorithm 8
MaxAlign k ( A, B, i B ) Input: A ∈ Σ n , B ∈ Σ n , k ≤ n , i B ∈ [1 , n ] d ← k , d ← n − i B + 1 while d = d do d mid ← ⌈ ( d + d ) / ⌉ if d ≤ k then return True L A , L B ← for a from −√ k to √ kh ← RetrieveRollingHash(
A, S + a √ k, H A,a √ k , i B + k + a √ k, i B + d mid − k − a √ k )append h to L A for b from −√ k to √ kh ← RetrieveRollingHash(
B, S − b, H B,b , i B + k − b, i B + d mid − k − − b )append h to L B sort L A and L B if L A ∩ L B = ∅ then d ← d mid else d ← d mid − return d Lemma 7.
MaxAlign k is approximately correct.Proof. First, consider any d ≥ B [ i B , i B + d −
1] has a k -alignment in A . We seek to showthat MaxAlign k ( A, B, i B ) ≥ d with probability 1. Note that the output of MaxAlign k is always atleast 2 k , so we may assume that d > k . By definition of k -alignment, there exists c ∈ [ − k, k ] suchthat A [ i B + c, i B + d − c ] = B [ i B , i B + d − a, b ∈ [ −√ k, √ k ] such that We need to “shave” k from each end of the substrings as we need to ensure that [ i B − b + k, i B + d − − b − k ] ⊂ [ i B , i B + d − , etc. √ k + b = c and so A [ i B + k + a √ k, i B + d − k − a √ k ] = B [ i B + k − b, i B + d − k − − b ] . By applying Claim 2, we have with probability 1 thatRetrieveRollingHash(
A, S + a √ k, H A,a √ k , i B + k + a √ k, d − k − a √ k )= RetrieveRollingHash( B, S − b, H B,b , i B + k − b, i B + d − k − − b ) . Therefore, in the implementation of MaxAlign k ( A, B, i B ), if d mid = d , then L A and L B will havenontrivial intersection, so the output of the binary search will be at least d , as desired. Thus,MaxAlign k ( A, B, i B ) will output at least the length of the maximal k -alignment.Second, we verify that MaxAlign k outputs an approximate k -alignment. Let d be the output ofMaxAlign k , either d = 2 k , in which case B [ i B , i B + d −
1] trivially is in approximate k -alignmentwith A [ i B , i B + d −
1] or d > k . Thus, for that d , the binary search found that L A ∩ L B = ∅ andso there exists a, b ∈ [ −√ k, √ k ] such thatRetrieveRollingHash( A, S + a √ k, H A,a √ k , i B + k + a √ k, d − k − a √ k )= RetrieveRollingHash( B, S − b, H B,b , i B + k − b, i B + d − k − − b ) . Applying Claim 2 over all ˜ O ( √ k ) = ˜ O ( k ) comparisons of hashes made during the algorithm, withprobability at least 1 − /n , we must have thatED( A [ i B + k + a √ k, d − k − a √ k ] , B [ i B + k − b, i B + d − k − − b ]) ≤ k. Let c := a √ k + b then we have thatED( A [ i B + k + c − b, i B + d − k − c − b ] , B [ i B + k − b, i B + d − k − − b ]) ≤ k so ED( A [ i B + c, i B + d − c ] , B [ i B , i B + d − ≤ k. Since c = a √ k + b ∈ [ − k, k ], we have that B [ i B , i B + d −
1] has an approximate k -alignment, asdesired. Theorem 8.
For k ≤ O ( √ n ) , with no preprocessing, we can distinguish between ED(
A, B ) ≤ k and ED(
A, B ) > k in ˜ O ( n/ √ k ) time with probability at least − n .Proof. By Lemma 7, MaxAlign k is approximately correct so by Lemma 4 succeeds with high enoughprobability that GreedyMatch outputs the correct answer with probability at least 1 − n .By inspection, both ProcessA k and ProcessB k run in ˜ O ( n/ √ k ) time in expectation. Further,MaxAlign k runs in ˜ O ( √ k ) time, so GreedyMatch runs in ˜ O ( n/ √ k + k / ) = ˜ O ( n/ √ k ) time. For the one-sided preprocessing, we desire to get near-linear preprocessing time. To do that,MaxAlign k shall be half approximately correct rather than approximately correct.Recall as before we preselect S ⊂ [1 , n ] with each element included i.i.d. with probability q := min( nk , k is in S and that n − S .This onlyincreases the size of S by n/k , and does not hurt the success probability of Claim 2. To achievenear-linear preprocessing, we only store RetrieveRollingHash( A, S + a, H A,a , i + a, i + 2 i − a ),when ( S + a ) ∩ [ i + a, i + 2 i − a ] changes. This happens when i ∈ ( S + 1) ∪ ( S − i + 1).9 lgorithm 9 OneSidedPreprocessA k ( A ) for a from − k to kH A,a ← InitRollingHash(
A, S + a ) T A ← ⌊ log n ⌋ × nk matrix of empty hash tables for i in [ ⌊ log n ⌋ ] for a from − k to k for i in (( S + 1) ∪ ( S − i + 1)) with [ i + a, i + 2 i − a ] ⊂ [ n ] h ← RetrieveRollingHash(
A, S + a, H A,a , i + a, i + 2 i − a )add h to T A [ i , ⌊ i/k ⌋ − h to T A [ i , ⌊ i/k ⌋ ].add h to T A [ i , ⌊ i/k ⌋ + 1]. return T A Claim 9.
OneSidedPreprocesA( A ) runs in ˜ O ( n ) time in expectation.Proof. Computing InitRollingHash(
A, S + a ) takes | S | = ˜ O ( n/k ) time in expectation. Thus, com-puting the H A,a ’s takes ˜ O ( n ) time. The other loops take (amortized) ˜ O (1) · O ( k ) · ˜ O ( n/k ) = ˜ O ( n )time.Before we call GreedyMatch, we need to initialize the hash function for B using OneSidedProcessB( B ).This takes ˜ O ( n/k ) time in expectation. Algorithm 10
OneSidedProcessB( B ) return H B ← InitRollingHash(
B, S ) Algorithm 11
OneSidedMaxAlign k ( A, B, i B ) Input: A ∈ Σ n , B ∈ Σ n , k ≤ n , i b ∈ [1 , n ] for d ∈ [2 ⌊ log n ⌋ , ⌊ log n ⌋− , . . . , if RetrieveRollingHash(
B, S, H B , i B , i B + d − ∈ T A [log d, ⌊ i B /k ⌋ ] then return d return Lemma 10.
OneSidedMaxAlign k is half approximately correct.Proof. First, consider the maximal d ′ ≥ B [ i B , i B + d ′ −
1] has a k -alignment in A . We seek to show that OneSidedMaxAlign k ( A, B, i B ) ≥ d ′ with probability 1. Bydefinition of k -alignment, there exists a ∈ [ − k, k ] such that A [ i B + a, i B + d ′ − a ] = B [ i B , i B + d ′ − A, S + a, H A,a , i B + a, i B + d ′ − a )= RetrieveRollingHash( B, S, H B , i B , i B + d ′ − . Let i ′ B be the least integer in (( S + 1) ∪ ( S − d ′ + 1)) ∩ [ n ] which is at least i B . Since S containsevery multiple of k (and n − | i ′ B − i B | ≤ k . Therefore,RetrieveRollingHash( A, S + a, H A,a , i B + a, i B + d ′ − a )= RetrieveRollingHash( A, S + a, H A,a , i ′ B + a, i ′ B + d ′ − a ) ∈ T A [log d, ⌊ i ′ B /k ⌋ + {− , , } ] . ⌊ i ′ B /k ⌋−⌊ i B /k ⌋ ∈ {− , , } . We have that if d = d ′ , RetrieveRollingHash( B, S, H B , i B , i B + d ′ − ∈ T A [log d, ⌊ i B /k ⌋ ]. Thus, OneSidedMaxAlign k ( A, B, i B ) will output at least more than halfthe length of the maximal k -alignment.Second, we verify that OneSidedMaxAlign k outputs an approximate k -alignment. Let d bethe output of OneSidedMaxAlign k , either d = 0, in which case B [ i B , i B + d −
1] trivially is inapproximate k -alignment with A [ i B , i B + d −
1] or d ≥
1. Thus, for that d , the search foundthat RetrieveRollingHash( B, S, H B , i B , i B + d ′ − ∈ T A [log d, ⌊ i B /k ⌋ ]. Thus, there exists, i ′ B with |⌊ i ′ B /k ⌋ − ⌊ i B /k ⌋| ≤ a ∈ [ − k, k ] such thatRetrieveRollingHash( A, S + a, H A,a , i ′ B + a, i ′ B + d ′ − a )= RetrieveRollingHash( B, S, H B , i B , i B + d ′ − . Applying Claim 2 over all ˜ O ( k ) potential comparisons of hashes made during the algorithm, withprobability at least 1 − /n , we must have thatED( A [ i ′ B + a, i ′ B + a + d ′ − , B [ i B , i B + d ′ − ≤ k. Note that | i ′ B + a − i B | ≤ | i ′ B − i B | + | a | ≤ k . Thus B [ i B , i B + d ′ −
1] has an approximate k -alignment,as desired. Theorem 11.
For all
A, B ∈ Σ n . When A is preprocessed for ˜ O ( n ) time in expectation, we candistinguish between ED(
A, B ) ≤ k/ (2 log n ) and ED(
A, B ) > k in ˜ O ( n/k ) time with probabilityat least − n over the random bits in the preprocessing (oblivious to B ).Proof. By Lemma 10, OneSidedMaxAlign k is half approximately correct so by Lemma 4 succeedswith high enough probability that GreedyMatch outputs the correct answer with probability atleast 1 − n .By Claim 9, the preprocessing runs in ˜ O ( n ) time. Also OneSidedProcessB runs in ˜ O ( n/k ) time.Further, OneSidedMaxAlign k runs in ˜ O (1) time, as performing the power-of-two search, computingthe hash, and doing the table lookups are ˜ O (1) operations), so the one-sided preprocessing versionof GreedyMatch runs in ˜ O ( n/k + k ) = ˜ O ( n/k ) time. References [AB18] Amir Abboud and Karl Bringmann. Tighter connections between formula-sat andshaving logs. In , pages 8:1–8:18,2018.[ABBK17] Amir Abboud, Arturs Backurs, Karl Bringmann, and Marvin K¨unnemann. Fine-grained complexity of analyzing compressed data: Quantifying improvements overdecompress-and-solve. In , pages 192–203, 2017.[AHWW16] Amir Abboud, Thomas Dueholm Hansen, Virginia Vassilevska Williams, and RyanWilliams. Simulating branching programs with edit distance and friends: or: a polylogshaved is a lower bound made. In
Proceedings of the 48th Annual ACM SIGACTSymposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 18-21, 2016 , pages 375–388, 2016. 11AK12] Alexandr Andoni and Robert Krauthgamer. The smoothed complexity of edit distance.
ACM Trans. Algorithms , 8(4):44:1–44:25, 2012.[AKO10] Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Polylogarithmic ap-proximation for edit distance and the asymmetric query complexity. In , pages 377–386, 2010.[AN10] Alexandr Andoni and Huy L. Nguyen. Near-optimal sublinear time algorithms forulam distance. In
Proceedings of the Twenty-First Annual ACM-SIAM Symposium onDiscrete Algorithms, SODA 2010, Austin, Texas, USA, January 17-19, 2010 , pages76–86, 2010.[AN20] Alexandr Andoni and Negev Shekel Nosatzki. Edit distance in near-linear time: it’s aconstant factor.
CoRR , abs/2005.07678, 2020.[AO12] Alexandr Andoni and Krzysztof Onak. Approximating edit distance in near-lineartime.
SIAM J. Comput. , 41(6):1635–1648, 2012.[Apo86] Alberto Apostolico. Improving the worst-case performance of the hunt-szymanskistrategy for the longest common subsequence of two strings.
Inf. Process. Lett. ,23(2):63–69, 1986.[BCLW19] Omri Ben-Eliezer, Cl´ement L. Canonne, Shoham Letzter, and Erik Waingarten. Find-ing monotone patterns in sublinear time.
CoRR , abs/1910.01749, 2019.[BEG +
18] Mahdi Boroujeni, Soheil Ehsani, Mohammad Ghodsi, Mohammad Taghi Hajiaghayi,and Saeed Seddighin. Approximating edit distance in truly subquadratic time: Quan-tum and mapreduce. In
Proceedings of the Twenty-Ninth Annual ACM-SIAM Sym-posium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10,2018 , pages 1170–1189, 2018.[BEK +
03] Tugkan Batu, Funda Erg¨un, Joe Kilian, Avner Magen, Sofya Raskhodnikova, RonittRubinfeld, and Rahul Sami. A sublinear algorithm for weakly approximating editdistance. In
Proceedings of the 35th Annual ACM Symposium on Theory of Computing,June 9-11, 2003, San Diego, CA, USA , pages 316–324, 2003.[BES06] Tugkan Batu, Funda Erg¨un, and S¨uleyman Cenk Sahinalp. Oblivious string embed-dings and edit distance approximations. In
Proceedings of the Seventeenth AnnualACM-SIAM Symposium on Discrete Algorithms, SODA 2006, Miami, Florida, USA,January 22-26, 2006 , pages 792–801, 2006.[BI18] Arturs Backurs and Piotr Indyk. Edit distance cannot be computed in strongly sub-quadratic time (unless SETH is false).
SIAM J. Comput. , 47(3):1087–1097, 2018.[BJKK04] Ziv Bar-Yossef, T. S. Jayram, Robert Krauthgamer, and Ravi Kumar. Approximatingedit distance efficiently. In , pages 550–559, 2004.[BK15] Karl Bringmann and Marvin K¨unnemann. Quadratic conditional lower bounds forstring problems and dynamic time warping. In
IEEE 56th Annual Symposium onFoundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October,2015 , pages 79–97, 2015. 12BK18] Karl Bringmann and Marvin K¨unnemann. Multivariate fine-grained complexity oflongest common subsequence. In
Proceedings of the Twenty-Ninth Annual ACM-SIAMSymposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018 , pages 1216–1235, 2018.[BR20] Joshua Brakensiek and Aviad Rubinstein. Constant-factor approximation of near-linear edit distance in near-linear time. In Konstantin Makarychev, Yury Makarychev,Madhur Tulsiani, Gautam Kamath, and Julia Chuzhoy, editors,
Proccedings of the52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020,Chicago, IL, USA, June 22-26, 2020 , pages 685–698. ACM, 2020.[BSS20] Mahdi Boroujeni, Masoud Seddighin, and Saeed Seddighin. Improved algorithms foredit distance and LCS: beyond worst case. In Shuchi Chawla, editor,
Proceedings ofthe 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA 2020, Salt Lake City,UT, USA, January 5-8, 2020 , pages 1601–1620. SIAM, 2020.[BZ16] Djamal Belazzougui and Qin Zhang. Edit distance: Sketching, streaming, and docu-ment exchange. In , pages 51–60. IEEE, 2016.[CDG +
18] Diptarka Chakraborty, Debarati Das, Elazar Goldenberg, Michal Kouck´y, andMichael E. Saks. Approximating edit distance within constant factor in truly sub-quadratic time. In Mikkel Thorup, editor, , pages979–990. IEEE Computer Society, 2018.[CGK16] Diptarka Chakraborty, Elazar Goldenberg, and Michal Kouck´y. Streaming algorithmsfor embedding and computing edit distance in the low distance regime. In DanielWichs and Yishay Mansour, editors,
Proceedings of the 48th Annual ACM SIGACTSymposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 18-21,2016 , pages 712–725. ACM, 2016.[CGKK18] Moses Charikar, Ofir Geri, Michael P. Kim, and William Kuszmaul. On estimatingedit distance: Alignment, dimension reduction, and embeddings. In Ioannis Chatzi-giannakis, Christos Kaklamanis, D´aniel Marx, and Donald Sannella, editors, , volume 107 of
LIPIcs , pages 34:1–34:14.Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2018.[CK06] Moses Charikar and Robert Krauthgamer. Embedding the ulam metric into l Theoryof Computing , 2(11):207–224, 2006.[CLRS09] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein.
Intro-duction to algorithms . MIT press, 2009.[GKS19] Elazar Goldenberg, Robert Krauthgamer, and Barna Saha. Sublinear algorithms forgap edit distance. In , pages 1101–1120. IEEE, 2019.[GRS20] Elazar Goldenberg, Aviad Rubinstein, and Barna Saha. Does preprocessing help infast sequence comparisons? In Konstantin Makarychev, Yury Makarychev, Madhur13ulsiani, Gautam Kamath, and Julia Chuzhoy, editors,
Proccedings of the 52nd AnnualACM SIGACT Symposium on Theory of Computing, STOC 2020, Chicago, IL, USA,June 22-26, 2020 , pages 657–670. ACM, 2020.[Hae19] Bernhard Haeupler. Optimal document exchange and new codes for insertions anddeletions. In , pages 334–347. IEEE, 2019.[HRS19] Bernhard Haeupler, Aviad Rubinstein, and Amirbehshad Shahrasbi. Near-linear timeinsertion-deletion codes and (1+ ǫ )-approximating edit distance via indexing. In Pro-ceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing,STOC 2019, Phoenix, AZ, USA, June 23-26, 2019. , pages 697–708, 2019.[HSSS19] MohammadTaghi Hajiaghayi, Masoud Seddighin, Saeed Seddighin, and Xiaorui Sun.Approximating LCS in linear time: Beating the √ n barrier. In Proceedings of theThirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, SanDiego, California, USA, January 6-9, 2019 , pages 1181–1200, 2019.[KS20a] Tomasz Kociumaka and Barna Saha. Sublinear-time algorithms for computing &embedding gap edit distance. In
FOCS , 2020. To appear.[KS20b] Michal Kouck´y and Michael E. Saks. Constant factor approximations to edit dis-tance on far input pairs in nearly linear time. In Konstantin Makarychev, YuryMakarychev, Madhur Tulsiani, Gautam Kamath, and Julia Chuzhoy, editors,
Procced-ings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC2020, Chicago, IL, USA, June 22-26, 2020 , pages 699–712. ACM, 2020.[Kus19] William Kuszmaul. Efficiently approximating edit distance between pseudorandomstrings. In Timothy M. Chan, editor,
Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA,January 6-9, 2019 , pages 1165–1180. SIAM, 2019.[LMS98] Gad M. Landau, Eugene W. Myers, and Jeanette P. Schmidt. Incremental stringcomparison.
SIAM J. Comput. , 27(2):557–582, 1998.[Mye86] Eugene W. Myers. An O(ND) difference algorithm and its variations.
Algorithmica ,1(2):251–266, 1986.[NRRS19] Ilan Newman, Yuri Rabinovich, Deepak Rajendraprasad, and Christian Sohler. Testingfor forbidden order patterns in an array.
Random Struct. Algorithms , 55(2):402–426,2019.[OR07] Rafail Ostrovsky and Yuval Rabani. Low distortion embeddings for edit distance.
J.ACM , 54(5):23, 2007.[RS20] Aviad Rubinstein and Zhao Song. Reducing approximate longest common subsequenceto approximate edit distance. In Shuchi Chawla, editor,
Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA 2020, Salt Lake City, UT, USA,January 5-8, 2020 , pages 1591–1600. SIAM, 2020.14RSSS19] Aviad Rubinstein, Saeed Seddighin, Zhao Song, and Xiaorui Sun. Approximation algo-rithms for LCS and LIS with truly improved running times. In David Zuckerman, edi-tor, , pages 1121–1145. IEEE ComputerSociety, 2019.[SS17] Michael E. Saks and C. Seshadhri. Estimating the longest increasing sequence inpolylogarithmic time.
SIAM J. Comput. , 46(2):774–823, 2017.[Ukk85] Esko Ukkonen. Algorithms for approximate string matching.
Information and Control ,64(1-3):100–118, 1985.
A Trading off running time for better approximation
In this appendix, we show how to extend the results of the main body to distinguishing edit distance k vs. kℓ . A.1 Preliminaries: the h -wave algorithm The works of [LMS98, GRS20] show that if one preprocesses both A and B , then the exact editdistance between A and B can be found by the following O ( k )-sized dynamic program (called an h-wave ). The DP state is represented by a table h [ i, j ], where i ∈ [0 , k ] and j ∈ [ − k, k ], initializedwith h [0 ,
0] = 0 and h [0 , j ] = −∞ for all j ∈ [ − k, k ] \ { } . The transitions for i ≥ h [ i, j ] = max h [ i − , j −
1] + 1 h [ i − , j ] + max( d, h [ i − , j + 1]where d is maximal such that A [ h [ i − , j ] + 1 , h [ i − , j ] + d ] = B [ h [ i − , j ] + j + 1 , h [ i − , j ] + j + d ].Intuitively, h [ i, j ] is the farthest length n ′ such that A [1 , n ′ ] and B [1 , n ′ − j ] have edit distance atmost i . Then, ED( A, B ) ≤ k if and only if h [ k,
0] = n . A.2 Approximate h -wave and MaxShiftAlign ℓ We speed-up the original h -wave algorithm by considering a sparsified h -wave, where we store h [ i, j ]with i ∈ [0 , k ] and j ∈ [ − k, k ] such that j is a multiple of ℓ . We again initialize h [0 , j ] = 0 for all j ,but now we have the following transitions. h [ i, j ] = max h [ i − , j − ℓ ] + ℓ if j − ℓ ≥ − kh [ i − , j ] + max( d, ℓ ) h [ i − , j + ℓ ] + ℓ if j + ℓ ≤ k where d is maximal such that A [ h [ i − , j ]+1+ a, h [ i − , j ]+ j + a ] = B [ h [ i − , j ]+ j +1 , h [ i − , j ]+ j + d ]for some a ∈ [ − ℓ, ℓ ]. Note that when ℓ = 1 this mostly aligns with the h -wave algorithm except wehave h [ i − , j + ℓ ] + ℓ instead of h [ i − , j + ℓ ] (as we are only seeking an approximation, we do thisto make the analysis simpler).For the approximate h -wave, it is approximately true that h [ i, j ] is the farthest length n ′ suchthat A [1 , n ′ ] and B [1 , n ′ − j ] have edit distance at most ˜ O ( iℓ ) (see Lemma 13 and 14) for moredetails). Then, if ED( A, B ) ≤ ˜ O ( k ), we have h [ k, ≥ n ; and if h [ k, < n then ED( A, B ) ≥ ˜Ω( kℓ ) . A and B have a common start point. Instead we require a generalization of MaxAlign k , which we callMaxShiftAlign ℓ,k ( A, B, i A , i B ). This algorithm finds the greatest positive integer d such that A [ i A + c, i A + d − c ] = B [ i B , i B + d −
1] for some c ∈ [ − ℓ, ℓ ], given the promise that | i A − i B | ≤ k . Definition 2 (shifted ( i A , ℓ )-alignment and approximate shifted ( i A , ℓ )-alignment) . Given strings
A, B , and i A , i B ∈ [1 , n ] we say that a B [ i B , i B + d −
1] has a shifted- ( i A , ℓ ) -alignment with A if there is i with | i A − i | ≤ ℓ and A [ i, i + d −
1] = B [ i B , i B + d − ED( A [ i A , i A + d − , B [ i B , i B + d − ≤ ℓ , we say that B [ i B , i B + d −
1] has an approximate shifted- ( i A , ℓ ) -alignment with A. We say that an implementation of MaxShiftAlign ℓ,k ( A, B, i A , i B ) is approximately correct ifwhenever | i A − i B | ≤ k the following are true.1. Let d ′ be the maximal d ′ such that B [ i B , i B + d ′ −
1] = A [ i A , i A + d ′ −
1] for. With probability1, MaxShiftAlign ℓ,k ( A, B, i A , i B ) > d ′ / d ′ = 0).2. With probability at least 1 − /n , B [ i B , i B + MaxShiftAlign ℓ,k ( A, B, i A , i B ) −
1] has anapproximate shifted-( i A , ℓ )-alignment in A . Algorithm 12
GreedyWave(
A, B, k, ℓ ) Input:
A, B ∈ Σ n , ℓ ≤ k ≤ n Output:
SMALL if ED(
A, B ) ≤ ˜ O ( k ) or LARGE if ED( A, B ) ≥ ˜Ω( kℓ ) h ← matrix with indices [0 , k ] × ([ − k, k ] ∩ ℓ Z ) for i from k for j multiples of ℓ from − k to k if i = 0 then h [ i, j ] ← −∞ , h [0 , ← else h [ i, j ] ← h [ i − , j ] + ℓ if j − ℓ ≥ − k then h [ i, j ] ← max( h [ i, j ] , h [ i − , j − ℓ ] + ℓ ) if j + ℓ ≤ k then h [ i, j ] ← max( h [ i, j ] , h [ i − , j + ℓ ] + ℓ ) h [ i, j ] ← max( h [ i, j ] , h [ i − , j ] + MaxShiftAlign ℓ,k ( A, B, h [ i − , j ] + 1 , h [ i − , j ] + j + 1)) if h [ k, ≥ n return SMALL return
LARGE
A.3 Analysis of
GreedyWave
We first prove that if we do not take any of the “shortcuts” given by MaxShiftAlign ℓ,k ( A, B, h [ i − , j ] + 1 , h [ i − , j ] + j + 1)), we still increase h by a quantifiable amount. Claim 12.
Consider ( i, j ) , ( i ′ , j ′ ) ∈ [0 , k ] × ([ − k, k ] ∩ ℓ Z ) such that i ′ ≥ i and | j ′ − j | ≤ ℓ ( i ′ − i ) , then, h [ i ′ , j ′ ] ≥ h [ i, j ] + ℓ ( i ′ − i ) . Note that a shifted-( i A , ℓ )-alignment implies an approximate shifted-( i A , k )-alignment, because ED( A [ i A , i A + d − , [ i, i + d − ≤ ℓ if | i − i A | ≤ ℓ . roof. We prove this by induction on i ′ − i . The base case of i ′ − i = 0 is immediate.Now assume i ′ − i ≥ | j ′ − j | ≤ ℓ ( i ′ − i ). Note then there exists e ∈ {− , , } suchthat j ′ + eℓ is between j ′ and j and | ( j ′ + eℓ ) − j | ≤ ℓ ( i ′ − − i ) . By the induction hypothesis, we know then that h [ i ′ − , j ′ + eℓ ] ≥ h [ i, j ] + ℓ ( i ′ − i − . Note then from GreedyWave, since i ′ ≥
1, we have that h [ i ′ , j ′ ] ≥ h [ i ′ − , j ′ + eℓ ] + ℓ. Therefore, combining the previous two inequalities. h [ i ′ , j ′ ] ≥ h [ i, j ] + ℓ ( i ′ − i ) . Correctness of Algorithm 12 is proved in the following pair of lemmas.
Lemma 13.
Assume that
MaxShiftAlign ℓ,k is approximately correct. If
ED(
A, B ) ≤ k/ (20 ⌈ log n ⌉ ) ,then GreedyWave(
A, B, k, ℓ ) outputs SMALL with probability . Lemma 14.
Assume that
MaxShiftAlign ℓ,k is approximately correct. If
ED(
A, B ) > kℓ , then GreedyWave(
A, B, k, ℓ ) outputs LARGE with probability at least − /n .Proof of Lemma 13. Notation and Inductive Hypothesis
Let k ′ = ⌊ k/ (20 ⌈ log n ⌉ ) ⌋ . Assume that ED( A, B ) ≤ k ′ . Let I A , . . . , I A k ′ +1 and I B , . . . , I B k ′ +1 and π : [2 k ′ + 1] → [2 k ′ + 1] ∪ {⊥} be as in Lemma 3. Let ( a , b ) , . . . , ( a t , b t ) ∈ π be the matching,ordered such that a ≤ · · · ≤ a t and b ≤ · · · ≤ b t . Let A i = max I Aa i and B i = max I Bb i .For the boundary, we let a , b = 0, A = B = 0. Also let a t +1 , b t +1 = 2 k ′ + 2 and A t +1 = B t +1 = n + 1. Let r : Z → ℓ Z be the function which rounds each integer to the nearest multiple of ℓ (breaking ties by rounding down).It suffices to prove by induction for all i ∈ { , , . . . , t + 1 } , we have that h [ a i + b i + ( ⌈ log n ⌉ + 1) i, r ( A i − B i )] ≥ A i . The base case of i = 0 follows from h [0 ,
0] = 0 in the initialization. Assume now that
Inductive hypothesis. h [ a i + b i + ( ⌈ log n ⌉ + 1) i, r ( A i − B i )] ≥ A i . We seek to show that h [ a i +1 + b i +1 + ( ⌈ log n + 1 ⌉ )( i + 1) , r ( A i +1 − B i +1 )] ≥ A i +1 . We complete the induction in two steps. 17 tep 1, h [ a i +1 + b i +1 + ( ⌈ log n ⌉ + 1) i + 1 , r ( A i +1 − B i +1 )] ≥ A i + a i +1 − a i + 1 . First note that | ( A i +1 − B i +1 ) − ( A i − B i ) | = | ( A i +1 − A i − | I Aa i +1 | ) − ( B i +1 − B i − | I Bb i +1 | ) |≤ (cid:12)(cid:12)(cid:12) A i +1 − A i − | I Aa i +1 | (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) B i +1 − B i − | I Bb i +1 | (cid:12)(cid:12)(cid:12) ≤ ( a i +1 − a i ) + ( b i +1 − b i ) . Therefore, | r ( A i +1 − B i +1 ) − r ( A i − B i ) | ≤ ( a i +1 − a i ) + ( b i +1 − b i ) + ℓ ≤ ℓ [( a i +1 − a i ) + ( b i +1 − b i ) + 1]= ℓ [( a i +1 + b i +1 + ( ⌈ log n ⌉ + 1) i + 1) − ( a i + b i + ( ⌈ log n ⌉ + 1) i )] . Therefore, we may apply Claim 12 to get that. h [ a i +1 + b i +1 + ( ⌈ log n ⌉ + 1) i + 1 , r ( A i +1 − B i +1 )] ≥ h [ a i + b i + ( ⌈ log n ⌉ + 1) i, r ( A i − B i )]+ ℓ [( a i +1 + b i +1 + ( ⌈ log n ⌉ + 1) i + 1) − ( a i + b i + ( ⌈ log n ⌉ + 1) i )]= A i + ℓ (( a i +1 − a i ) + ( b i +1 − b i ) + 1) (induction hypothesis) ≥ A i + a i +1 − a i + 1 . Step 2, h [ a i +1 + b i +1 + ( ⌈ log n ⌉ + 1)( i + 1) , r ( A i +1 − B i +1 )] ≥ A i +1 . For j ∈ [ ⌈ log n + 1 ⌉ ], letˆ h j := h [ a i +1 + b i +1 + ( ⌈ log n ⌉ ) i + j, r ( A i +1 − B i +1 )] . If ˆ h j ≥ A i +1 for some j ∈ [ ⌈ log n ⌉ ], then we know that h [ a i +1 + b i +1 + ( ⌈ log n ⌉ + 1)( i + 1) , r ( A i +1 − B i +1 )] ≥ A i +1 , which finishes the inductive step.Otherwise, we know that ˆ h j ∈ [ A i + a i +1 − a i + 1 , A i +1 ) for all j ∈ [ ⌈ log n ⌉ ]. If i = t , then A t + a t +1 − a t + 1 ≥ n + 1 = A t +1 , because each interval strictly between a t to a t +1 has length atmost 1. Therefore, ˆ h j ≥ A t +1 , so we are done in this case.Now assume i < t and consider any j ∈ [log n ]. Since every interval between I Aa i and I Aa i +1 has length at most 1, we have that A i +1 − | I Aa i +1 | + 1 ≤ A i + a i +1 − a i + 1. Therefore, ˆ h j ∈ [ A i +1 − | I Aa i +1 | + 1 , A i ) . Therefore, A [ˆ h j , A i +1 ] = B [ h ′ − A i +1 + B i +1 , B i +1 ]. By definition of r , | r ( A i +1 − B i +1 ) − A i +1 + B i +1 | ≤ ℓ . Since MaxShiftAlign ℓ,k is approximately correct,MaxShiftAlign ℓ,k ( A, B, ˆ h j , ˆ h j + r ( A i +1 − B i +1 )) > ( A i +1 − ˆ h j ) / . Therefore, ˆ h j +1 ≥ ˆ h j + d > A i +1 + ˆ h j . By composing these inequalities, we have thatˆ h j +1 > (2 j − A i +1 + ˆ h j . j = ⌈ log n ⌉ , we get thatˆ h ⌈ log n ⌉ +1 > ( n − n A i +1 ≥ A i +1 − , so ˆ h ⌈ log n ⌉ +1 = h [ a i +1 + b i +1 + ( ⌈ log n ⌉ + 1)( i + 1) , r ( A i +1 − B i +1 )] ≥ A i +1 , as desired. Conclusion.
Therefore, we have that h [ a t +2 + b t +2 + ( ⌈ log n ⌉ + 1)( t + 1) , ≥ n + 1. Thus, we report SMALLas long as a t +2 + b t +2 + ( ⌈ log n ⌉ + 1)( t + 1) ≤ k . Observe that a t +2 + b t +2 + ( ⌈ log n ⌉ + 1)( t + 1) ≤ k ′ + 2 + 2 k ′ + 2 + ( ⌈ log n ⌉ + 1)(2 k ′ + 2) ≤ ⌈ log n ⌉ k ′ ≤ k. Therefore, our algorithm always reports SMALL when ED(
A, B ) ≤ k/ (20 ⌈ log n ⌉ ) . Proof of Lemma 14.
Assume that GreedyWave(
A, B, k, ℓ ) output SMALL, and that MaxShiftAlign ℓ,k never failed at being approximately correct. For succinctness, we let h i,j be shorthand for h [ i, j ]We prove by induction that for all i and j for which h i,j = −∞ , ED( A [1 , h i,j ] , B [1 , h i,j + jℓ ]) ≤ iℓ + | j | . This holds for the base case i = j = 0. Now, we break into cases depending on how h i,j was computed. Case 1, h i,j = h i − ,j + ℓ . If h i,j = h i − ,j + ℓ , then observe thatED( A [1 , h i,j ] , B [1 , h i,j + j ]) ≤ ℓ + ED( A [1 , h i − ,j ] , B [1 , h i − ,j + j ]) ≤ ℓ + 10( i − ℓ + | j | < iℓ + | j | . Case 2, h i,j = h i − ,j − ℓ + ℓ . If h i,j = h i − ,j − ℓ + ℓ , then observe thatED( A [1 , h i,j ] , B [1 , h i,j + j ]) ≤ ED( A [1 , h i − ,j − ℓ + ℓ ] , B [1 , h i − ,j − ℓ + j + ℓ ]) ≤ ℓ + ED( A [1 , h i − ,j − ℓ ] , B [1 , h i − ,j − ℓ + j − ℓ ]) ≤ ℓ + 10( i − ℓ + | j − ℓ | < iℓ + | j | . Case 3, h i,j = h i − ,j + ℓ + ℓ . If h i,j = h i − ,j + ℓ + ℓ , then observe thatED( A [1 , h i,j ] , B [1 , h i,j + jℓ ]) ≤ ED( A [1 , h i − ,j + ℓ + ℓ ] , B [1 , h i − ,j + ℓ + j + ℓ ]) ≤ ℓ + ED( A [1 , h i − ,j + ℓ ] , B [1 , h i − ,j + ℓ + j + ℓ ]) ≤ ℓ + 10( i − ℓ + | j + ℓ | < iℓ + | j | ) . ase 4, h i,j = h i − ,j + d Finally, if h i,j = h i − ,j + d , thenED( A [ h i − ,j + 1 , h i − ,j + d ] , B [ h i − ,j + j + 1 , h i − ,j + j + d ]) ≤ ℓ since MaxShiftAlign ℓ,k is approximately correct. Thus,ED( A [1 , h i,j ] , B [1 , h i,j + j ]) ≤ ED( A [1 , h i − ,j ] , B [1 , h i − ,j + ℓ + j ])+ ED( A [ h i − ,j + 1 , h i − ,j + d ] ,B [ h i − ,j + j + 1 , h i − ,j + j + d ]) ≤ i − ℓ + | j | + 10 ℓ < iℓ + | j | . This completes the induction. Therefore, since GreedyWave(
A, B, k, ℓ ) output SMALL, we havethat h [ k, ≥ n . Thus, ED( A, B ) ≤ kℓ , as desired. A.4 Implementing
MaxShiftAlign ℓ,k
A.4.1
MaxShiftAlign ℓ,k with No Preprocessing
We use a nearly-identical algorithm to that of Section 5, including looking at √ k shifts for both A and B . But, we now sample S ⊂ [1 , n ] so that each element is included with probability at leastmin(4 ln n/ℓ,
1) (instead of min(4 ln n/k,
Algorithm 13
ProcessA ℓ,k ( A ) Input: A ∈ Σ n for a from − √ k to √ kH A,a √ k ← InitRollingHash(
A, S + a √ k ) return { H A,a √ k : a ∈ [ − √ k, √ k ] } Algorithm 14
ProcessB ℓ,k ( B ) for b from −√ k to √ kH B,b ← InitRollingHash(
B, S − b ) return { H B,b : b ∈ [ −√ k, √ k ] } Lemma 15.
MaxShiftAlign ℓ,k is approximately correct.Proof.
First, consider any d ≥ B [ i B , i B + d −
1] has a ( i A , ℓ )-alignment with A . Weseek to show that MaxShiftAlign ℓ,k ( A, B, i A , i B ) ≥ d with probability 1. Note that the output ofMaxShiftAlign ℓ,k is always at least 2 ℓ , so we may assume that d > k . By definition of ( i A , ℓ )-alignment, there exists c ∈ [ − ℓ, ℓ ] such that A [ i A + c, i A + c + d −
1] = B [ i B , i B + d − | c + i A − i B | ≤ k , there exists a ∈ [ − √ k, √ k ] and b ∈ √ k such that a √ k + b = c + i A − i B .In fact, we may take a ∈ h ⌊ i A − i B − ℓ √ k ⌋ , ⌈ i A − i B + ℓ √ k ⌉ i . Thus, since b ∈ [ −√ k, √ k ] ⊂ [ − ℓ, ℓ ], we have that B [ i B + ℓ − b, i B + d − ℓ − − b ] = A [ i A + c + ℓ − b, i A + c + d − ℓ − − b ]= A [ i B + ℓ + a √ k, i B + d − ℓ − a √ k ] . lgorithm 15 MaxShiftAlign ℓ,k ( A, B, i A , i B ) Input: A ∈ Σ n , B ∈ Σ n , ℓ, k ≤ n , ℓ ≥ √ k , i A , i B ∈ [1 , n ] , | i A − i B | ≤ kd ← ℓ , d ← n − i b + 1 while d = d do d mid ← ⌈ ( d + d ) / ⌉ if d ≤ k then return True L A , L B ← for a from ⌊ i A − i B − ℓ √ k ⌋ to ⌈ i A − i B + ℓ √ k ⌉ h ← RetrieveRollingHash(
A, S + a √ k, H A,a √ k , i B + ℓ + a √ k, i B + d mid − ℓ − a √ k )append h to L A for b from −√ k to √ kh ← RetrieveRollingHash(
B, S − b, H B,b , i B + ℓ − b, i B + d mid − ℓ − − b )append h to L B sort L A and L B if L A ∩ L B = ∅ then d ← d mid else d ← d mid − return d By applying Claim 2, we have with probability 1 thatRetrieveRollingHash(
A, S + a √ k, H A,a √ k , i B + ℓ + a √ k, d − ℓ − a √ k )= RetrieveRollingHash( B, S − b, H B,b , i B + ℓ − b, i B + d − ℓ − − b ) . Therefore, in the implementation of MaxShiftAlign ℓ,k ( A, B, i B ), if d mid = d , then L A and L B willhave nontrivial intersection, so the output of the binary search will be at least d , as desired. Thus,MaxShiftAlign ℓ,k ( A, B, i B ) will output at least the length of the maximal ( i A , ℓ )-alignment.Second, we verify that MaxShiftAlign ℓ,k outputs an approximate ( i A , ℓ )-alignment. Let d be theoutput of MaxShiftAlign ℓ,k , either d = 2 ℓ , in which case B [ i B , i B + d −
1] trivially is in approximate( i A , ℓ )-alignment with A or d > ℓ . Thus, for that d , the binary search found that L A ∩ L B = ∅ and so there exists a ∈ h ⌊ i A − i B − ℓ √ k ⌋ , ⌈ i A − i B + ℓ √ k ⌉ i , b ∈ [ −√ k, √ k ] such thatRetrieveRollingHash( A, S + a √ k, H A,a √ k , i B + ℓ + a √ k, d − ℓ − a √ k )= RetrieveRollingHash( B, S − b, H B,b , i B + ℓ − b, i B + d − ℓ − − b ) . Applying Claim 2 over all at most ˜ O ( √ k ) = ˜ O ( k ) comparisons of hashes made during the algo-rithm, with probability at least 1 − /n , we must have thatED( A [ i B + ℓ + a √ k, d − ℓ − a √ k ] , B [ i B + ℓ − b, i B + d − ℓ − − b ]) ≤ ℓ. Let c := a √ k + b , then we have thatED( A [ i B + c + ℓ − b, i B + c + d − ℓ − − b ] , B [ i B + ℓ − b, i B + d − ℓ − − b ]) ≤ ℓ Therefore, ED( A [ i B + c, i B + c + d − , B [ i B , i B + d − ≤ ℓ, i B + c = i B + a √ k + b ≥ i B + [( i A − i B − ℓ ) − √ k ] − √ k ≥ i A − ℓ. Likewise, i B + c ≤ i A + 3 ℓ . Therefore,ED( A [ i B + c, i B + c + d − , A [ i A , i A + d − ≤ ℓ, since we need at most 3 ℓ insertions and 3 ℓ deletions to go between the two strings. By the triangleinequality we then have thatED( A [ i A , i A + d − , B [ i B , i B + d − < ℓ, as desired. Theorem 16. If ℓ ≥ √ k , with no preprocessing, we can distinguish between ED(
A, B ) ≤ ˜ O ( k ) and ED(
A, B ) ≥ ˜Ω( ℓk ) in ˜ O ( ( n + k ) √ kℓ ) time with probability at least − n .Proof. By Lemma 15, MaxShiftAlign ℓ,k is approximately correct so by Lemmas 13 and 14 succeedswith high enough probability that GreedyMatch outputs the correct answer with probability atleast 1 − n .Both ProcessA ℓ,k and ProcessB ℓ,k run in ˜ O ( n √ kℓ ) time. Further, MaxShiftAlign ℓ,k runs in ˜ O ( √ k )time. Since GreedyWave makes runs in ˜ O ( k /ℓ ) and makes that many calls to MaxShiftAlign ℓ,k ,we have that GreedyWave runs in ˜ O ( ( n + k ) √ kℓ ) time.Setting ℓ = ˜Θ( k − ǫ ) with ǫ < /
2, we get that distinguishing k from k − ǫ can be done in˜ O ( nk − / ǫ + k / ǫ ) time. A.4.2
MaxShiftAlign ℓ,k with One-Sided and Two-Sided Preprocessing
For the one-sided preprocessing, we mimic 6. We preselect S ⊂ [1 , n ] which each element includedi.i.d. with probability q := min( nℓ , ℓ (and n −
1) is in S .This only increases the size of S by n/ℓ , and does not hurt the success probability of Claim 2. Algorithm 16
OneSidedPreprocessA ℓ,k ( A ) for a from − k to kH A,a ← InitRollingHash(
A, S + a ) T A ← list of [ ⌊ log n ⌋ ] × [ nℓ ] × [ − kℓ − , kℓ + 1] matrix of empty hash tables for e in [ ⌊ log n ⌋ ] for a from − k to k for i in (( S + 1) ∪ ( S − i + 1)) ∩ [ n ] h ← RetrieveRollingHash(
A, S + a, H A,a , i + a, i + 2 e − a )add h to T A [ i , ⌊ i/ℓ ⌋ + e , ⌊ a/ℓ ⌋ + e ] for e , e ∈ {− , , } . return T A Claim 17.
OneSidedPreprocessA ℓ,k ( A ) runs in ˜ O ( nkℓ ) time in expectation. roof. Computing InitRollingHash( A − a, S ) takes | S | = ˜ O ( n/ℓ ) time in expectation. Thus, com-puting the H A − a ’s takes ˜ O ( nkℓ ) time. Initializing the hash table takes ˜ O ( nkℓ ) time. The other loopstake (amortized) ˜ O (1) · O ( k ) · ˜ O ( n/k ) = ˜ O ( n ) time.Before we call GreedyWave, we need to initialize the hash function for B using OneSidedProcessB ℓ,k ( B ).This takes ˜ O ( n/ℓ ) time in expectation. Algorithm 17
OneSidedProcessB ℓ,k ( B ) return H B ← InitRollingHash(
B, S ) Algorithm 18
OneSidedMaxShiftAlign ℓ,k ( A, B, i A , i B ) Input: A ∈ Σ n , B ∈ Σ n , ℓ, k ≤ n , i A , i B ∈ [1 , n ] | i A − i B | ≤ k . for d ∈ [2 ⌊ log n ⌋ , ⌊ log n ⌋− , . . . , if RetrieveRollingHash(
B, S, H B , i B , i B + d − ∈ T A [log d , ⌊ i B /ℓ ⌋ , ⌊ ( i A − i B ) /ℓ ⌋ ] then return d return Lemma 18.
OneSidedMaxShiftAlign ℓ,k is approximately correct.Proof.
First, consider the maximal d ≥ B [ i B , i B + d −
1] has a ( i A , ℓ )-alignment with A . We seek to show that OneSidedMaxShiftAlign k ( A, B, i A , i B ) ≥ d with probabil-ity 1. By definition of ( i A , ℓ )-alignment, there exists c ∈ [ − ℓ, ℓ ] such that A [ i A + c, i A + c + d ′ −
1] = B [ i B , i B + d ′ − a := i A + c − i B . By applying Claim 2, we have with probability 1 thatRetrieveRollingHash( A, S + a, H A,a , i B + a, i B + d − a )= RetrieveRollingHash( B, S, H B , i B , i B + d − . Let i ′ B be the least element of ( S + 1) ∪ ( S − d + 1) which is at least i B . Since every multiple of ℓ is in S (as well as n − | i ′ B − i B | ≤ ℓ andRetrieveRollingHash( A, S + a, H A,a , i B + a, i B + d − a )= RetrieveRollingHash( A, S + a, H A,a , i ′ B + a, i ′ B + d − a ) ∈ T A [log d, ⌊ i ′ B /ℓ ⌋ , ⌊ a/ℓ ⌋ ] . Therefore, in the implementation of OneSidedMaxShiftAlign ℓ,k ( A, B, i A , i B ), if d = d , we havethat RetrieveRollingHash( B, S, H B , i B , i B + d − ∈ T A [log d, ⌊ i ′ B /ℓ ⌋ + e , ⌊ a/ℓ ⌋ + e ] . for all e , e ∈ {− , , } . Since | i ′ B − i B | ≤ ℓ and | a − ( i A − i B ) | ≤ ℓ , we have thatRetrieveRollingHash( B, S, H B , i B , i B + d − ∈ T A [log d, ⌊ i B /ℓ ⌋ , ⌊ ( i A − i B ) /ℓ ⌋ ] . Thus, OneSidedMaxShiftAlign ℓ,k ( A, B, i A , i B ) will output at least more than half the length of themaximal shift ( i A , ℓ )-alignment.Second, we verify that OneSidedMaxShiftAlign ℓ,k outputs an approximate ( i A , ℓ )-alignment.Let d be the output of OneSidedMaxShiftAlign ℓ,k . Either d = 0, in which case B [ i B , i B + d − i A , ℓ )-alignment with A or d ≥
1. Thus, for that d , the search foundthat RetrieveRollingHash( B, S, H B , i B , i B + d − ∈ T A [log d, ⌊ i B /ℓ ⌋ , ⌊ ( i A − i B ) /ℓ ⌋ ]. Thus, thereexists i ′ A with |⌊ i ′ A /ℓ ⌋ − ⌊ i B /ℓ ⌋| ≤
1, and a ∈ [ − k, k ] such that |⌊ a/ℓ ⌋ − ⌊ ( i A − i B ) /ℓ ⌋| ≤ A, S + a, H A,a , i ′ A + a, i ′ A + d − a )= RetrieveRollingHash( B, S, H B , i ′ A , i ′ A + d − . Applying Claim 2 over all ˜ O ( k ) potential comparisons of hashes made during the algorithm, withprobability at least 1 − /n , we must have thatED( A [ i ′ A + a, i ′ A + a + d − , B [ i B , i B + d − ≤ ℓ. By (1), | i ′ A + a − i A | ≤ | i ′ A − i B | + | a − ( i A − i B ) | ≤ ℓ + 2 ℓ = 4 ℓ. Therefore, ED( A [ i A , i A + d − , B [ i B , i B + d − ≤ ℓ. Therefore, B [ i B , i B + d −
1] has an approximate ( i A , ℓ )-alignment with A , as desired. Theorem 19.
For all
A, B ∈ Σ n . When A is preprocessed for ˜ O ( nk/ℓ ) time in expectation, wecan distinguish between ED(
A, B ) ≤ ˜ O ( k ) and ED(
A, B ) ≥ ˜Ω( kℓ ) in ˜ O ( n + k ℓ ) time with probabilityat least − n over the random bits in the preprocessing (oblivious to B ).Proof. By Lemma 18, OneSidedMaxShiftAlign ℓ,k is approximately correct so by Lemmas 13 and 14succeeds with high enough probability that GreedyMatch outputs the correct answer with proba-bility at least 1 − n .As proved in Claim 17, the preprocessing runs in ˜ O ( nk/ℓ ) time. Also OneSidedProcessB k,ℓ runsin ˜ O ( n/ℓ ) time. Further, OneSidedMaxShiftAlign ℓ,k runs in ˜ O (1) time, as performing the power-of-two search, computing the hash, and doing the table lookups are ˜ O (1) operations. Therefore,GreedyWave runs in ˜ O ( k /ℓ ) time. Thus, the whole computation takes ˜ O ( n + k ℓ ) time.Thus, if ℓ = k − ǫ , the proprocessing is ˜ O ( n · k ǫ ), and the runtime otherwise is ˜ O ( n/k − ǫ + k ǫ ).If we are in the two-sided preprocessing model, we can run both OneSidedPreprocessA ℓ,k andOneSidedProcessB ℓ,k for both A and B . Then, we can just run GreedyWave which runs in ˜ O ( k /ℓ )time. Thus, we have the following corollary. We have an implication for two-sided preprocessingas a corollary. Corollary 20.
For all
A, B ∈ Σ n . When A and B are preprocessed for ˜ O ( nk/ℓ ) time in expectation,we can distinguish between ED(
A, B ) ≤ ˜ O ( k ) and ED(
A, B ) ≥ ˜Ω( kℓ ) in ˜ O ( k ℓ ) time with probabilityat least − n over the random bits in the preprocessing. Omitted Proofs
B.1 Proof of Lemma 3
Proof.
We prove this by induction on k . If k = 0, then we can partition A and B each into a singlepart which are matched. Assume we have proved the theorem for k ≤ k . Consider A and B withED( A, B ) = k + 1. Thus, there exists B ′ ∈ Σ ∗ such that ED( A, B ′ ) = k and ED( B ′ , B ) = 1. Bythe induction hypothesis, B ′ can be partitioned into intervals I , . . . , I k − that are each of lengthat most 1 or are equal to some interval of A up to a shift of k . We now break up into cases. Case 1.
A character of B ′ is substituted to make B . Let I j be the interval this substitu-tion occurs in. We split I j into three (some possibly empty) intervals I (1) j , I (2) j , I j , the inter-vals before, at, and after the substitution. We have the partition of B into 2 k + 1 intervals: I , . . . , I j − , I (1) j , I (2) j , I j , I j +1 , . . . , I k − . Every interval is of length at most 1 or corresponds toa equal substring of A up to a shift of k ≤ k + 1. Case 2.
A character of B ′ is deleted to make B . Let I j be the interval this deletion oc-curs in. We split I j into three intervals I (1) j , I (2) j , I j , the intervals before, at, and after thedeletion (the middle interval is empty). We have the partition of B into 2 k + 1 intervals: I , . . . , I j − , I (1) j , I (2) j , I j , I j +1 − , . . . , I k − −
1. Every interval is of length at most 1 or corre-sponds to a equal substring of A up to a shift of k + 1. Case 3.