Approximating the Geometric Edit Distance
AApproximating the Geometric Edit Distance ∗ Kyle Fox † Xinyi Li ‡ September 10, 2020
Abstract
Edit distance is a measurement of similarity between two sequences such as strings, pointsequences, or polygonal curves. Many matching problems from a variety of areas, such assignal analysis, bioinformatics, etc., need to be solved in a geometric space. Therefore, thegeometric edit distance (GED) has been studied. In this paper, we describe the first strictlysublinear approximate near-linear time algorithm for computing the GED of two point sequencesin constant dimensional Euclidean space. Specifically, we present a randomized O ( n log n )time O ( √ n )-approximation algorithm. Then, we generalize our result to give a randomized α -approximation algorithm for any α ∈ [ √ log n, (cid:112) n/ log n ], running in time O ( n /α log n ). Bothalgorithms are Monte Carlo and return approximately optimal solutions with high probability. Ordered sequences are frequently studied objects in the context of similarity measurements, becausesequence alignment plays a vital role in trajectory comparison and pattern recognition. As aconsequence, several metrics have been developed to measure the similarity of two sequences, e.g.,Fr´echet distance, dynamic time warping, and their variations. Geometric edit distance, a naturalextension of the string metric to geometric space, is the focus of this paper. This concept isformally introduced by Agarwal et al. [2]; however, a similar idea (extending string edit distanceto a geometric space) has been applied in other ways during the past decade. Examples includean l p -type edit distance for biological sequence comparison [3], ERP (Edit distance with RealPenalty) [4], EDR (Edit Distance on Real sequence) [5], TWED (Time Warping Edit Distance) [6]and a matching framework from Swaminathan et al. [7] motivated by computing the similarity oftime series and trajectories. See also a survey by Wang et al. [8]. Problem statement.
Geometric Edit Distance (GED) is the minimum cost of any matchingbetween two geometric point sequences that respects order along the sequences. The cost includesa constant penalty for each unmatched point.Formally, let P = < p , ..., p m > and Q = < q , ..., q n > be two point sequences in IR d for someconstant d . A monotone matching M is a set of index pairs { ( i , j ) , ..., ( i k , j k ) } such that the firstelements i (respectively, second elements j ) are distinct and for any two elements ( i, j ) and ( i (cid:48) , j (cid:48) )in M , i < i (cid:48) if j < j (cid:48) . ∗ A preliminary version of this work appeared in the Proceedings of the 30th International Symposium on Algo-rithms and Computation [1]. Most of the work was done while the second author was a student at the University ofTexas at Dallas. † The University of Texas at Dallas, [email protected] ‡ The University of Utah, xin [email protected] a r X i v : . [ c s . C G ] S e p e call every unmatched point a gap point . Let Γ( M ) be the set of all gap points. The cost of M is defined as δ ( M ) = (cid:88) ( i,j ) ∈M dist ( p i , q j ) + ρ (Γ( M )) (1)where dist ( p, q ) is the distance between points p and q (i.e. the Euclidean norm), and ρ (Γ( M )) is afunction of all gap points, which is known as a gap penalty function . The use of gap points and thegap penalty function allows us to recognize good matchings even in the presence of outlier points.The distance is sensitive to scaling, so, we may only match point pairs that are sufficiently closetogether. In this paper, we use a linear gap function. That is to say, ρ (Γ( M )) = | Γ( M ) | · (cid:96) , where (cid:96) is a constant parameter called the gap penalty . Without loss of generality, we may assume (cid:96) = 1when designing algorithms. Definition 1.
We denote the GED between two sequences
P, Q as:
GED ( P, Q ) = min M δ ( M ) = min M (cid:88) ( i,j ) ∈M dist ( p i , q j ) + | Γ( M ) | where the minimum is taken over all monotone matchings. Prior work.
To simplify the presentation of prior work, we assume n ≥ m . It is trivial tocompute GED ( P, Q ) in O ( mn ) time by simply changing the cost of substitution in the originalstring edit distance (Levenstein distance) dynamic programming algorithm [9]. Assuming k is theGED, we can achieve an O ( nk ) time algorithm by restricting our attention to the middle k diago-nals of the dynamic programming table (see also Ukkonen [10]). There is a slightly subquadratic O ( n / log n ) time algorithm [11] for the string edit distance, but it appears unlikely we can applyit directly to the geometric case. Accordingly, Gold and Sharir [12] proposed a different algorithmwhich can compute GED as well as the closely related dynamic time warping (DTW) distance in O ( n log log log n/ log log n ) time in polyhedral metric spaces. Recent papers have shown condi-tional lower bounds for several sequence distance measures even with some restrictions. Assumingthe Strong Exponential Time Hypothesis (SETH) [13], there is no O ( n − δ ) time algorithm for anyconstant δ > The latter ofthe above results implies the same conditional lower bound for GED, even assuming the sequencesconsist entirely of 0 , α -approximation algorithm for the discrete Fr´echet distance that runsin time O ( n log n + n /α ) for any α ∈ [1 , n ]. Chan and Rahmati [19] improved this running timeto O ( n log n + n /α ). Very recently, Kuszmaul [20] provided O ( α )-approximation algorithms with O (( n /α ) polylog n ) running times for edit distance over arbitrary metric spaces and DTW overwell separated tree metrics. Another O ( n /α ) time algorithm with an O ( α ) approximation factor The (discrete) Fr´echet and DTW distances are defined similarly to GED; however, they use one-to-many cor-respondences instead of one-to-one matchings, and they disallow the use of gap points. As in GED, DTW aims tominimize the sum of distances between corresponding points, while discrete Fr´echet distance aims to minimize themaximum distance over corresponding points. string edit distance is to run Ukkonens [10] O ( nk ) time algorithm letting k be n/α , and un-match all characters if this algorithm cannot return the optimal matching. Similarly, we can obtaina different O ( α )-approximation algorithm for GED running in O ( n /α ) time by making use ofthe O ( nk ) time exact algorithm mentioned above. There are many other approximation algorithmsspecialized for the string version of edit distance. In particular, an O ( √ n )-approximation algorithmwith linear running time can be acquired easily from an O ( n + k ) time exact algorithm [23]. Morerecent results include algorithms with (log n ) O (1 /ε ) [21] and constant approximation ratios [22] withdifferent running time tradeoffs. The latest result in this line of work is an O ( n ε ) time constantfactor approximation algorithm for any ε > O ( n )-approximation algorithm was observed by Agar-wal et al. [2]. In the same paper, they also offered a subquadratic time (near-linear time in somescenarios) approximation scheme on several well-behaved families of sequences. Using the prop-erties of these families, they reduced the search space to find the optimal admissible path in thedynamic programming graph [2]. Our results.
Inspired by the above applications and prior work, we commit to finding a fasterapproach to approximating GED between general point sequences while also returning the approx-imate best matching. Here, we give the first near-linear time algorithm to compute GED with astrictly sublinear approximation factor. We then generalize our result to achieve a tradeoff betweenthe running time and approximation factor. Both of these algorithms are Monte Carlo algorithms,returning an approximately best matching with high probability. To simplify our exposition, weassume the points are located in the plane (i.e., d = 2), and we assume the input sequences arethe same length (i.e., m = n ). We can easily extend our results to the unbalanced case, and ouranalysis implies that outside the plane, the running times and approximation ratios increase onlyby a factor polynomial in d . Theorem 1.
Given two point sequences P and Q in IR , each with n points, there exists an O ( n log n ) -time randomized algorithm that computes an O ( √ n ) -approximate monotone matchingfor geometric edit distance with high probability. The intuitive idea behind this algorithm is very simple. We check if the GED is less than eachof several geometrically increasing values g , each of which is less than O ( √ n ). For each g , wetransform the geometric sequences into strings using a randomly shifted grid and run the O ( n + k )time exact algorithm for strings [23]. If the GED is less than g , then we get an O ( √ n ) approximatematching. If we never find a matching of cost O ( √ n ), we simply leave all points unmatched as thisempty matching is an O ( √ n )-approximation for GED with high probability. We give the detailsfor this O ( √ n )-approximation algorithm in Section 2. Theorem 2.
Given two point sequences P and Q in IR , each with n points, there exists an O ( n α log n ) -time randomized algorithm that computes an O ( α ) -approximate monotone matchingfor geometric edit distance with high probability for any α ∈ [ √ log n, (cid:112) n/ log n ] . The second algorithm uses similar techniques to the former, except we can no longer use thestring edit distance algorithm as a black box. In particular, we cannot achieve our desired time-approximation tradeoff by just directly altering some parameters in our first algorithm. We discusswhy in Section 3.1. To overcome these difficulties, we develop a constant-factor approximationalgorithm to compute the GED of point sequences obtained by snapping points of the original We say an event occurs with high probability if it occurs with probability at least 1 − n c for some constant c > O ( n + k ) time algorithm for strings in Section 4.1, andthen describe our constant approximation algorithm for points in Section 4.2. We note that a keycomponent of the string algorithm and our extension is a fast method for finding maximal lengthcommon substrings from a given pair of starting positions in two strings A and B . A similarprocedure was needed in the discrete Fr´echet distance approximation of Chan and Rahmati [19].In Section 3, we present the algorithm for Theorem 2 using our constant approximation algorithmfor snapped point sequences as a black box. O ( √ n ) -Approximation for GED Recall that the main part of our algorithm is a decision procedure to check if the GED is less thana guess value g . There are two steps in this process:1. Transform the point sequences into strings. To be specific, we partition nearby points intocommon groups and distant points into different groups to simulate the identical charactersand different characters in the string version of edit distance.2. Run a modification of the exact string edit distance algorithm of Landau et al. [23]. To betterserve us when discussing geometric edit distance, we aim to minimize the number of insertionsand deletions to turn S into T only ; we consider substitution to have infinite cost. Details onthis modified algorithm appear in Section 4.1. We explain how to transform the point sequences into strings in Section 2.1, and we analyze theapproximation factor and running time in Sections 2.2 and 2.3.For convenience, we refer to the string edit distance algorithm as
SED ( S, T, k ), where S and T are two strings with equal length. This algorithm will return a matching in O ( n + k ) time ifthe edit distance is at most k . We give an outline of our algorithm as Algorithm 1. Here, c is asufficiently large constant, and we use lg to denote the logarithm of base 2. As stated above, the transformation technique should partition nearby points into common groupsand distant points into different groups. We use a randomly shifted grid to realize this ideal; seeHar-Peled [25] for an introduction to randomly shifted grids.Recall P and Q lie in IR . We cover the space with a grid. Let the side length of each grid cellbe ∆, and let b be a vector chosen uniformly at random from [0 , ∆] . Starting from an arbitraryposition, the grid shifts b i units in each dimension i . For a point p , let id ∆ ,b ( p ) denote the cellwhich contains p in this configuration. We consider two points p = ( x , y ), and p = ( x , y ) inthis space. Lemma 1.
We have P ( id ∆ ,b ( p ) (cid:54) = id ∆ ,b ( p )) ≤ min { | x − x | + | y − y | ∆ , } . We use this observation in our algorithm and set ∆ = g √ n as each cell’s side length. This variant of the string edit distance is really closer to the shortest common supersequence length of the stringsrather than the traditional Levenshtein distance, but we stick with “edit distance” for simplicity. lgorithm 1: O ( √ n )-approximation algorithm for GED Input:
Point sequences P and Q Output:
An approximately optimal matching for GED if (cid:80) ni =1 dist ( p i , q i ) ≤ then return matching { (1 , , ..., ( n, n ) } else for i := 0 to (cid:100) lg √ n (cid:101) do g := 2 i for j := 1 to (cid:100) c lg n (cid:101) do Transform P , Q to strings S , T using a randomly shifted grid out := SED ( S, T, √ n + 2 g ) if out (cid:54) = f alse then return out end end end return the empty matching end We claim the running time for Algorithm 1 is O ( n log n ). Computing (cid:80) ni =1 dist ( p i , q i ) takes O ( n )time. In the inner loop, the transformation operation (line 7) takes O ( n ) time assuming use of ahash table. The running time for SED ( S, T, √ n + 2 g ) is O ( n ) for g = O ( √ n ). Summing overthe outer loop and inner loop, the overall running time for Algorithm 1 is (cid:100) lg √ n (cid:101) (cid:88) i =1 (cid:100) c lg n (cid:101) (cid:88) j =1 O ( n ) = O ( n log n ) . In this section, we show that Algorithm 1 returns an O ( √ n )-approximate matching with highprobability. Notation.
For any monotone matching M , we define C S ( M ) as the cost of the correspondingedit operations for M in the string case and C G ( M ) to be δ ( M ) as defined in (1) for the geometriccase (as stated, there is no substitution operation in our modified string case). Let M ∗ G be theoptimal matching for geometric edit distance, and let M ∗ S denote the optimal matching under thestring configuration during a given iteration of the inner for loop. Our goal is to establish therelationship between C G ( M ∗ G ) and C G ( M ∗ S ). Lemma 2.
Consider an iteration of the outer for loop, and suppose
GED ( P, Q ) ≤ g . With aprobability at least − n c , at least one of the (cid:100) c lg n (cid:101) iterations of the inner for loop will return amatching M ∗ S where C S ( M ∗ S ) ≤ √ n + 2 g . Proof:
Let M be a monotone matching, and let U M M be the set of unmatched indices. Thereare four subsets of pairs in M : 5 OC M : In each pair, both indices’ points fall into One cell, and the distance between the twopoints is less and equal to ∆ = g √ n (Close). • OF M : In each pair, both indices’ points fall into One cell, and the distance between the twopoints is larger than ∆ = g √ n (Far). • DC M : In each pair, the indices’ points are in Different cells, and the distance between thetwo points is less and equal to ∆ = g √ n (Close). • DF M : In each pair, the indices’ points are in Different cells and the distance between thetwo points is larger than ∆ = g √ n (Far).These sets are disjoint, so C G ( M ∗ G ) = | U M M ∗ G | + (cid:88) ( i,j ) ∈ OC M∗ G dist ( p i , q j ) + (cid:88) ( i,j ) ∈ OF M∗ G dist ( p i , q j )+ (cid:88) ( i,j ) ∈ DC M∗ G dist ( p i , q j ) + (cid:88) ( i,j ) ∈ DF M∗ G dist ( p i , q j ) . (2)Recall that there is no substitution operation in our version of the string edit distance. Therefore,to better understand optimal matchings for string edit distance, we unmatch all the pairs in DC M ∗ G and DF M ∗ G , forming a new matching M ∗ (cid:48) G . Points in one cell are regarded as identical characterswhile those in different cells are different characters. Therefore, C S ( M ∗ (cid:48) G ) = | U M M ∗ G | + 0 · ( | OC M ∗ G | + | OF M ∗ G | ) + 2 · ( | DC M ∗ G | + | DF M ∗ G | )= | U M M ∗ G | + 2 · ( | DC M ∗ G | + | DF M ∗ G | ) . Observe that there are at most gg/ √ n = √ n pairs in DF M ∗ G if C G ( M ∗ G ) ≤ g . Therefore, C S ( M ∗ S ) ≤ C S ( M ∗ (cid:48) G )= | U M M ∗ G | + 2 | DC M ∗ G | + 2 | DF M ∗ G | ≤ g + 2 √ n + 2 | DC M ∗ G | (3)For any two points p i , q j , let P D ( i, j ) be the probability that p i and q j are assigned into differentcells. From Lemma 1, we can infer P D ( i, j ) ≤ dist ( p i ,q j ) g/ √ n .Then, E ( | DC M ∗ G | ) ≤ (cid:88) ( i,j ) ∈ DC M∗ G P D ( i, j ) ≤ (cid:88) ( i,j ) ∈ DC M∗ G dist ( p i , q j ) g/ √ n (4) ≤ √ n. Therefore, E ( C S ( M ∗ S )) ≤ √ n + g. By Markov’s inequality, P [ C S ( M ∗ S ) ≥ √ n + 2 g ] ≤ .
6n other words,
SED ( S, T, √ n + 2 g ) will fail with probability at most if GED ( P, Q ) ≤ g .So, if we test SED ( S, D, √ n + 2 g ) (cid:100) c lg n (cid:101) times, at least one iteration will return a value if GED ( P, Q ) ≤ g with a probability greater than or equal to1 − (cid:100) c lg n (cid:101) (cid:89) P [ C S ( M ∗ S ) ≥ √ n + 2 g ] ≥ − (cid:100) c lg n (cid:101) (cid:89)
12 = 1 − n c . We conclude the proof of Lemma 2. (cid:3)
According to Lemma 2, with high probability, we obtain a matching a M ∗ S such that C S ( M ∗ S ) ≤ √ n + 2 g if GED ( P, Q ) ≤ g .We now consider C G ( M ∗ S ). Again, U M M is the set of unmatched indices for a matching M .Observe, for all ( i, j ) ∈ M ∗ S , points p i and q j lie in the same grid cell. Therefore, dist ( p i , q j ) ≤ √ g √ n if ( i, j ) ∈ M ∗ S . We have: C G ( M ∗ S ) = | U M M ∗ S | + (cid:88) ( i,j ) ∈M ∗ S dist ( p i , q j ) (5) ≤ √ n + 2 g + n · ( √ g √ n ) = 12 √ n + 2 g + √ g √ n If GED ( P, Q ) ≤ √ n , then, with high probability, we obtain a matching M ∗ S by the end of theouter for loop iteration where g ≥ GED ( P, Q ) ≥ g . The cost of this matching is at most12 √ n + 2 g + √ g √ n = O ( √ n ) GED ( P, Q ). The same approximation bound holds if
GED ( P, Q ) > √ n , whether or not we find a matching during the outer for loop. We conclude the proof of Theorem1. O ( α ) -Approximation for GED We now discuss our O ( α )-approximation algorithm for any α ∈ [ √ log n, (cid:112) n/ log n ]. A naturalapproach for extending our O ( √ n )-approximation is using the same reduction to string edit distancebut letting the cell’s side length be a variable depending on the approximation factor α . We argue,however, that this approach does not lead to a good approximation. O ( √ n ) -approximation to achieve a tradeoff Suppose we try to modify the O ( √ n )-approximation algorithm by simply changing the side lengthof cells. Let ∆ α be the cell’s side length which depends on the approximation factor α . We need toobtain a matching M ∗ S with high probability such that C G ( M ∗ S ) ≤ g · O ( α ) during any iterationof the outer for loop with GED ( P, Q ) ≤ g .There can be at most n matched pairs in M ∗ S . Following (5), we need n · ∆ α ≤ g · O ( α ), implying∆ α ≤ O ( gαn ) . On the other hand, we have C S ( M ∗ S ) ≤ C G ( M ∗ S ) ≤ g · O ( α ) in our analysis. We then derived2 √ n in (3) as 2 g ∆ α . We now need 2 g ∆ α ≤ g · O ( α ), implying∆ α ≥ Ω( 1 α ) . α = √ n or for large values of g . But for small α and small g , we cannot have bothinequalities be true. Therefore, we require a different approach to obtain our desired approximationfactor-running time tradeoff. O ( α ) -approximation algorithm based on grid-snapping Grid-snapping.
Instead of simply grouping points by their different cells as in the O ( √ n )-approximation algorithm, we snap points to the lower left corners of their respective grid cells. Let P (cid:48) = < p (cid:48) , ..., p (cid:48) n > , Q (cid:48) = < q (cid:48) , ..., q (cid:48) n > be the sequences after grid-snapping. Let ∆ be the cell sidelength of the grid. We immediately obtain the following observation: Observation 1.
For any p i and q j from P and Q , respectively, we have dist ( p (cid:48) i , q (cid:48) j ) ≤ dist ( p i , q j ) +2 √ . Moreover, if p i and q j are in different cells, dist ( p (cid:48) i , q (cid:48) j ) ≥ ∆ . We can then obtain our O ( α )-approximation algorithm by altering the bound in the outer loopand the test procedure of Algorithm 1. See Algorithm 2. Here, AGED ( P (cid:48) , Q (cid:48) , k ) attempts toApproximate GED ( P (cid:48) , Q (cid:48) ) given that P (cid:48) and Q (cid:48) have their points on the corners of the grid cells.If GED ( P (cid:48) , Q (cid:48) ) ≤ k , then it returns an O (1)-approximate matching for GED ( P (cid:48) , Q (cid:48) ). Otherwise,it either returns an O (1)-approximate matching or it returns false. Algorithm 2: O ( α )-approximation algorithm Input:
Point sequences P and Q Output:
An approximately optimal matching for GED if (cid:80) ni =1 dist ( p i , q i ) ≤ then return matching { (1 , , ..., ( n, n ) } else for i := 0 to (cid:100) lg nα (cid:101) do g := 2 i for j := 1 to (cid:100) c lg n (cid:101) do Obtain P (cid:48) , Q (cid:48) by doing grid-snapping to P , Q based on a randomly shifted grid out := AGED (cid:0) P (cid:48) , Q (cid:48) , (4 √ g (cid:1) if out (cid:54) = f alse then return out end end end Return the empty matching end We describe how to implement
AGED ( P (cid:48) , Q (cid:48) , k ) in Section 4.2. The running time of ourimplementation is O ( n + k ∆ ) where ∆ is the cell side length of the grid. We do grid snapping in O ( n ) time. For each g = 2 i , we use cells of side length gαn and set k to (cid:0) √ (cid:1) g , so the overallrunning time of our O ( α )-approximation algorithm is O ( n ) + (cid:100) lg nα (cid:101) (cid:88) i =0 (cid:100) c lg n (cid:101) (cid:88) j =1 O ( n + 2 i nα ) = (cid:100) lg nα (cid:101) (cid:88) i =0 O ( n log n + 2 i nα log n ) = O ( n log n + n α log n ) . α < √ log n , the running time of our algorithm is O ( n ). Thus, we could just runthe classic O ( n ) dynamic programming algorithm if we need an approximation factor to be lessthan √ log n . On the other hand, the O ( n log n ) in the running time is asymptotically insignificantif α ≤ (cid:112) n/ log n . As a result, the running time is O ( n α log n ) for any α ∈ [ √ log n, (cid:112) n/ log n ]. The analysis for the O ( α )-approximation algorithm is similar to the first algorithm. First, weintroduce some additional notation to that used in Section 2.3.Let C GS ( M ) be the cost of any monotone matching M using distances between the grid-snapped points of P (cid:48) and Q (cid:48) . Let M ∗ GS be the optimal matching for P (cid:48) and Q (cid:48) , i.e., C GS ( M ∗ GS ) = GED ( P (cid:48) , Q (cid:48) ). Let M AGS be the matching returned by
AGED ( P (cid:48) , Q (cid:48) , (4 √ g ). We have thefollowing lemma. Lemma 3. If GED ( P, Q ) ≤ g , with a probability at least − n c , at least one of the (cid:100) c lg n (cid:101) iterations will return a matching M AGS . Proof:
Recall, we said pairs of points are close if their distance is less than or equal to ∆. Similarto (2), and using Observation 1, we have C GS ( M ∗ G ) = | U M M ∗ G | + 0 · ( | OC M ∗ G | + | OF M ∗ G | )+ (cid:88) ( i,j ) ∈ DC M∗ G dist ( p (cid:48) i , q (cid:48) j ) + (cid:88) ( i,j ) ∈ DF M∗ G dist ( p (cid:48) i , q (cid:48) j ) ≤ | U M M ∗ G | + ∆ · | DC M ∗ G | + (cid:88) ( i,j ) ∈ DF M∗ G (cid:16) dist ( p i , q j ) + 2 √ (cid:17) . = | U M M ∗ G | + (cid:88) ( i,j ) ∈ DF M∗ G dist ( p i , q j ) + ∆ | DC M ∗ G | + 2 √ | DF M ∗ G | If C G ( M ∗ G ) ≤ g , then C GS ( M ∗ GS ) ≤ g + ∆ | DC M ∗ G | + 2 √ | DF M ∗ G | . We have the same observation for DF M ∗ G as before, that is there are at most g ∆ pairs in DF M ∗ G .Using the same algebra as (4), we have E ( | DC M ∗ G | ) ≤ g ∆ . So, E ( C GS ( M ∗ GS )) ≤ g + ∆ · g ∆ + 2 √
2∆ 2 g ∆ = 2 √ g + 3 g. According to Markov’s inequality, we know P (cid:16) C GS ( M ∗ GS ) ≥ (cid:16) √ (cid:17) g (cid:17) ≤ . In Section 4.2, we prove that if C GS ( M ∗ GS ) = GED ( P (cid:48) , Q (cid:48) ) ≤ (4 √ g , then AGED ( P (cid:48) , Q (cid:48) , (4 √ g ) will return a constant approximate matching M AGS . So, if we test
AGED ( P (cid:48) , Q (cid:48) , (4 √ g ) (cid:100) c lg n (cid:101) times (using different grids each time), with a probability at least 1 − n c , at least one AGED ( P (cid:48) , Q (cid:48) , (4 √ g ) will return a matching M AGS . We conclude the proof of Lemma 3. (cid:3) i, j ) in M AGS , we have dist ( p i , q j ) ≤ dist ( p (cid:48) i , q (cid:48) j ) +2 √ C G ( M AGS ) = | U M M AGS | + (cid:88) ( i,j ) ∈ DC M AGS dist ( p i , q j ) + (cid:88) ( i,j ) ∈ DF M AGS dist ( p i , q j )+ (cid:88) ( i,j ) ∈ OC M AGS dist ( p i , q j ) + (cid:88) ( i,j ) ∈ OF M AGS dist ( p i , q j ) ≤ | U M M AGS | + (cid:88) ( i,j ) ∈ DC M AGS dist ( p (cid:48) i , q (cid:48) j ) + (cid:88) ( i,j ) ∈ DF M AGS dist ( p (cid:48) i , q (cid:48) j ) + (cid:88) ( i,j ) ∈ OC M AGS dist ( p (cid:48) i , q (cid:48) j )+ (cid:88) ( i,j ) ∈ OF M AGS dist ( p (cid:48) i , q (cid:48) j ) + 2 √
2∆ ( | DC M AGS | + | DF M AGS | + | OC M AGS | + | OF M AGS | ) ≤ O (1) · (4 √ g + n · √ . Recall, ∆ = gαn . If we obtain a matching M AGS during an iteration where g ≥ C G ( M ∗ G ) = GED ( P, Q ) ≥ g , then C G ( M AGS ) ≤ O ( gα ) = O ( α ) · GED ( P, Q ). Finishing with the sameargument as in Theorem 1, we conclude our proof of Theorem 2.
AGED ( P (cid:48) , Q (cid:48) , k ) Recall that our constant factor approximation algorithm for GED of grid corner points is basedon a known O ( n + k ) time exact algorithm for string edit distance [23]. We first describe thisexact algorithm for strings, which we refer as SED ( S, T, k ), in Section 4.1. Then in Section 4.2,we modify this string algorithm to obtain an O (1)-approximate matching for edit distance betweenpoint sequences P (cid:48) and Q (cid:48) assuming the points lie on the corners of grid cells and GED ( P (cid:48) , Q (cid:48) ) ≤ k . O ( n + k ) string edit distance algorithm Dynamic programming matrix and its properties.
Let S = < s , s , ...s n > and T =
1) + 1 D ( i − , j −
1) + | s i t j | where | s i t j | = (cid:40) , if s i = t j ∞ , otherwise . Property 2. D ( i,
0) = i , and D (0 , j ) = j . Property 3. D ( i, i + h ) is even if and only if h is even. Property 4. D ( i, j ) − D ( i − , j − ∈ { , } . Property 4 can be easily derived from Property 3 and induction on i + j (see Lemma 3 of [10]).From Property 4, we know all the diagonals are non-decreasing. In particular, all values on diagonal h are greater than | h | considering Property 2. So, we can just search the band from diagonal − k to k if the edit distance between S and T is at most k .10 i ... ... n nje L h;e diagonal h = j − i diagonal h − h + 1 e − e − L h +1 ;e − L h − ;e − r Figure 1. (a) The diagonal containing any entry ( i, i + h ) is diagonal h (b) The algorithm slides down the diagonal untilfinding an entry representing distinct characters. A circle means the corresponding two characters are the same; a crossmeans they are different Algorithm for edit distance at most k . We use a greedy approach to fill the entries alongeach diagonal. For each value e ∈ { , . . . , k } (the outer loop), we locate the elements whose value is e by inspecting diagonals − e to e (the inner loop). Finally, we return the best matching if D ( n, n )is covered by the above search. Otherwise, the edit distance is greater than k .The key insight is that we can implicitly find all entries containing e efficiently in each round.We first define L h,e as the row index of the farthest e entry in diagonal h . Definition 2. L h,e = max { i | D ( i, i + h ) = e } . Note by Property 3, L h,e is well-defined only if h ≡ e mod 2. Observe that all values ondiagonal h are at least | h | , so we can define our initial values as: L h, | h |− = (cid:40) | h | − , if h < − , otherwise , where h ∈ [ − k, k ] . Let r = max { L h − ,e − , L h +1 ,e − + 1 } . Then, D ( r, r + h ) = e by Properties 1 and 4. Also, if D ( r, r + h ) = e and s r +1 = t r +1+ h , then D ( r + 1 , r + 1 + h ) = e . From these observations, we cancompute L h,e in each inner loop using Algorithm 3 below. Algorithm 3:
Computing L h,e in each inner loop r := max { L h − ,e − , L h +1 ,e − + 1 } while r + 1 ≤ n , r + h + 1 ≤ n , and s r +1 == t r +1+ h do r := r + 1 ; /* slide */ end if r > n or r + h > n then L h,e := ∞ else L h,e := r end We call lines 2 through 4 “the slide”. It is straightforward to recover the optimal matching byusing the L h,e values to trace backwards through the dynamic programming matrix. Fig. 1 (b)demonstrates this process. 11e can perform slides in constant time each after some O ( n )-time preprocessing at the beginningof the algorithm. Specifically, the length of a slide can be computed using a single lowest commonancestor query in the suffix tree of a string based on S and T [23]. The overall running time is O ( n + k ). O (1) -approximation algorithm for GED of grid-snapped points Notations.
Similar to the string algorithm, we have a dynamic programming matrix; D (cid:48) ( i, j )is the edit distance between subsequences P (cid:48) i = < p (cid:48) , ..., p (cid:48) i > and Q (cid:48) j = < q (cid:48) , ..., q (cid:48) j > of pointsat the corners of grid cells. This matrix also meets Property 1 stated earlier except that we use dist ( p (cid:48) i , q (cid:48) j ) instead of | s i t j | . In addition, we also have the following property which is a refinementof Property 4. Property 5. D (cid:48) ( i, j ) − D (cid:48) ( i − , j − ∈ [0 , . Clearly, the upper bound is 2 (just unmatch p i and q j ). The lower bound can be proved byinduction. Because the values in any diagonal are non-decreasing, we need only consider diagonals − k through k . (Implicit) label rules. To obtain an approximate matching for the edit distance of snapped pointsequences, we now label each entry in the dynamic programming matrix with an approximatelytight lower bound on its value. Inspired by the string algorithm, we use non-negative integers forour labels, and the entries of any diagonal h only receive labels e where e ≡ h mod 2. Let LA ( i, j )be the label of entry ( i, j ) and L (cid:48) h,e be the row index of the farthest entry whose label is e in diagonal h . Definition 3. L (cid:48) h,e := max { i | LA ( i, i + h ) = e } . For each e from 0 to k , for each diagonal h where h ≡ e mod 2, we (implicitly) assign label e tosome entries on diagonal h .1. If h = − e or e , i.e., this is the first iteration to assign labels to this diagonal, then we labelthe very beginning entry in diagonal h as e , i.e., if h = − e , LA ( − h,
0) = e ; otherwise, LA (0 , h ) = e .2. We define a start entry ( r, r + h ) for each diagonal h . If h = − e or e , r is the row index ofthe first one entry in diagonal h ( r = | h | or 0); otherwise, r = max { L (cid:48) h − ,e − , L (cid:48) h +1 ,e − + 1 } .3. We assign the label e to entries ( r, r + h ) to ( r + s, r + h + s ) where (cid:80) si = r +1 dist ( p (cid:48) i , q (cid:48) i + h ) ≤ (cid:80) s +1 i = r +1 dist ( p (cid:48) i , q (cid:48) i + h ) > L (cid:48) h,e := r + s . These entries correspond to a slide in the stringalgorithm.4. Finally, if ( r − , r + h −
1) is unlabeled, we go backward up the diagonal labeling entries as e until we meet an entry that has been assigned a label previously. (Again, this step is implicit.As explained below, the actual algorithm only finds the L (cid:48) h,e entries.)Fig. 2 illustrates our rules. 12 i ... ... n nj L h;e diagonal h = j − i diagonal h − h + 1 L h +1 ;e (cid:0) L h (cid:0) ;e (cid:0) e − e − e − e e e e or ( r;r + h ) (a) Notations and labels for the boundary entries L h;e max f L h (cid:0) ;e (cid:0) ; e e e sum ≤ sum > r; r + h ) L h +1 ;e (cid:0) + 1 g e + 2diagonal h (b) Label entries following step 3 Figure 2.
Notations and rules for approximating SGED
Computing an approximately optimal matching.
Assume we have set the initial values.Our algorithm only needs to compute each L (cid:48) h,e as before. See Algorithm 4. Then, we guaranteethe following theorem: Theorem 3.
Suppose
GED ( P (cid:48) , Q (cid:48) ) ≤ k . We can recover a matching M ∗ (cid:48) GS using the values L (cid:48) h,e from Algorithm 4. The cost of M ∗ (cid:48) GS for point sequences P (cid:48) , Q (cid:48) is less and equal to GED ( P (cid:48) , Q (cid:48) ) . In short, we argue each label LA ( i, j ) ≤ D (cid:48) ( i, j ). We then follow a path through the matrix assuggested by the way we pick labels in Algorithm 4. The final matching has cost at most 3 LA ( n, n )which is less and equal to 3 GED ( P (cid:48) , Q (cid:48) ). The full proof appears in Section 4.3 Algorithm 4:
Computing L (cid:48) h,e for the fixed h and e r := max { ( L (cid:48) h − ,e − ) , ( L (cid:48) h +1 ,e − + 1) } sum := 0 while r + 1 ≤ n , r + h + 1 ≤ n , and (cid:0) sum + dist ( p (cid:48) r +1 , q (cid:48) r + h +1 ) ≤ (cid:1) do r := r + 1 sum := sum + dist ( p (cid:48) r , q (cid:48) r + h ) end if r > n or r + h > n then L (cid:48) h,e := ∞ else L (cid:48) h,e := r end We conclude by discussing the time complexity for our algorithm. Using the same O ( n ) pre-processing as in [23], we can slide down maximal sequences of consecutive entries ( r, r + h ) with dist ( p (cid:48) r , q (cid:48) r + h ) = 0 in constant time per slide. Let ∆ be the cell side length of the grid whose cellcorners contain points of P (cid:48) and Q (cid:48) . For dist ( p (cid:48) r , q (cid:48) r + h ) (cid:54) = 0, we know dist ( p (cid:48) r , q (cid:48) r + h ) ≥ ∆ from13bservations 1. Therefore, we only need to manually add distances and restart faster portions ofeach slide of distances summing to 2 a total of times. Thus, the total running time is O ( n + k (cid:88) e =0 e (cid:88) h = − e
1∆ ) = O ( n + k ∆ ) . We have the following properties for our labels along with the following lemma.
Property 6. LA ( i, i + h ) − LA ( i + 1 , i + 1 + h ) ∈ { , } . Property 7. LA ( i, i + h ) − LA ( i − , i + h ) ∈ {− , } and LA ( i, i + h ) − LA ( i, i + h − ∈ {− , } . Lemma 4.
For every entry ( i, j ) assigned a label, LA ( i, j ) ≤ D (cid:48) ( i, j ) . Note that in particular, LA ( n, n ) ≤ GED ( P (cid:48) , Q (cid:48) ). Proof:
From Property 5, we only need to prove e is the lower bound of the first entry whose labelis e in each diagonal h .We proceed by induction on e .1. If e = 0, we only label the first entry in diagonal 0 as 0. We have 0 ≤ D (cid:48) (0 ,
0) = 0. If e = 1,then for diagonals 1 and −
1, we have 1 ≤ D (cid:48) (0 ,
1) = D (cid:48) (1 ,
0) = 1.2. Assume Lemma 4 for labels less than e . For e , we consider the diagonals h = − e to e : If h = − e or e , we know e ≤ D (cid:48) ( | h | ,
0) = e or e ≤ D (cid:48) (0 , h ) = e .Otherwise, let ( f, f + h ) be the first entry in diagonal h whose label is e . From Property 6, f = L (cid:48) h,e − + 1. Fig. 3 shows the notations. From the refined Property 1, we need to discuss e − e − eL h;e (cid:0) L h;e (cid:0) +1 h − hh +1 f > e − ru Figure 3.
We compute the lower bound of entries which are labeled as e three cases:(a) D (cid:48) ( f, f + h ) = D (cid:48) ( f − , f + h ) + 1.From Property 7, we know LA ( f − , f + h ) = e − e + 1. • If LA ( f − , f + h ) = e − D (cid:48) ( f − , f + h ) ≥ e − D (cid:48) ( f, f + h ) = D (cid:48) ( f − , f + h ) + 1 ≥ e − e . • If LA ( f − , f + h ) = e + 1, then we know L (cid:48) h +1 ,e − is less than f −
1. From ourassumption and non-decreasing property, e − ≤ D (cid:48) ( L (cid:48) h +1 ,e − , L (cid:48) h +1 ,e − + h + 1) ≤ D (cid:48) ( f − , f + h − D (cid:48) ( f, f + h ) ≥ D (cid:48) ( f − , f + h −
1) + 1 ≥ e .(b) D (cid:48) ( f, f + h ) = D (cid:48) ( f, f + h −
1) + 1.This case is symmetric to the one above.14c) D (cid:48) ( f, f + h ) = D (cid:48) ( f − , f + h −
1) + dist ( p (cid:48) f , q (cid:48) f + h ). LA ( f − , f + h −
1) = e −
2, because f − L (cid:48) h,e − . Let r be the row index of the first entryon the slide with label e − h , i.e., r = max { L (cid:48) h − ,e − , L (cid:48) h +1 ,e − + 1 } . See Fig.3. We define u as the row index of the first entry walking backward from entry ( f, f + h )along the diagonal h where D (cid:48) ( u, u + h ) = min { D (cid:48) ( u, u + h − , D (cid:48) ( u − , u + h ) } + 1. • If u > r , like Fig. 3, then u > L (cid:48) h − ,e − and u − > L (cid:48) h +1 ,e − . We have D (cid:48) ( u, u + h − ≥ D (cid:48) ( L (cid:48) h − ,e − + 1 , L (cid:48) h − ,e − + h ) ≥ e − D (cid:48) ( u − , u + h ) ≥ D (cid:48) ( L (cid:48) h +1 ,e − + 1 , L (cid:48) h +1 ,e − + h + 2) ≥ e − . Therefore, D (cid:48) ( u, u + h ) = min { D (cid:48) ( u, u + h − , D (cid:48) ( u − , u + h ) } + 1 ≥ e. By Property 5, we have D (cid:48) ( f, f + h ) ≥ e as well. • If u ≤ r , then D (cid:48) ( f, f + h ) = D (cid:48) ( r, r + h ) + f (cid:88) i = r +1 dist ( p (cid:48) i , q (cid:48) i + h ) > e − e. Having considered all the cases, we conclude the proof of Lemma 4. (cid:3)
The bounds for the approximate matching C GS ( M AGS ) . From Algorithm 4, we note thelabel increases correspond to not matching a point in Line 1, and slides correspond to matchingpoints. Let M AGS be the resulting matching. So, C GS ( M AGS ) = | U M M AGS | + (cid:88) ( i,j ) ∈M AGS dist ( p (cid:48) i , q (cid:48) j ) ≤ LA ( n, n ) + 2 · LA ( n, n ) ≤ LA ( n, n ) ≤ GED ( P (cid:48) , Q (cid:48) ) . We conclude the proof of Theorem 3.
Acknowledgements
The authors would like to thank Anne Driemel and Benjamin Raichel forhelpful discussions.
References [1] K. Fox, X. Li, in
Proceedings of the 30th International Symposium on Algorithms and Com-putation (2019), pp. 26:1–26:16[2] P.K. Agarwal, K. Fox, J. Pan, R. Ying, in
Proceedings of the 32nd International Symposiumon Computational Geometry (2016), pp. 6:1–6:16[3] A. Stojmirovic, Y.k. Yu, Journal of Computational Biology (4), 579 (2009)154] L. Chen, R. Ng, in Proceedings of the 30th International Conference on Very Large Databases (2004), pp. 792–803[5] L. Chen, M.T. ¨Ozsu, V. Oria, in
Proceedings of the 2005 ACM SIGMOD International Con-ference on Management of Data (2005), pp. 491–502[6] P.F. Marteau, IEEE Transactions on Pattern Analysis and Machine Intelligence (2), 306(2009)[7] S. Sankararaman, P.K. Agarwal, T. Mølhave, J. Pan, A.P. Boedihardjo, in Proceedings of the21st ACM SIGSPATIAL International Conference on Advances in Geographic InformationSystems (2013), pp. 234–243[8] X. Wang, A. Mueen, H. Ding, G. Trajcevski, P. Scheuermann, E. Keogh, Data Mining andKnowledge Discovery (2), 275 (2013)[9] R.A. Wagner, M.J. Fischer, Journal of the ACM (1), 168 (1974)[10] E. Ukkonen, Information and Control (1-3), 100 (1985)[11] W.J. Masek, M.S. Paterson, Journal of Computer and System Sciences (1), 18 (1980)[12] O. Gold, M. Sharir, ACM Transactions on Algorithms (4), 50 (2018)[13] R. Impagliazzo, R. Paturi, J. Comp. Sys. Sci. (2), 367 (2001)[14] K. Bringmann, in Proceedings of the IEEE 55th Annual Symposium on Foundations of Com-puter Science (2014), pp. 661–670[15] A. Abboud, A. Backurs, V.V. Williams, in
Proceedings of the IEEE 56th Annual Symposiumon Foundations of Computer Science (2015), pp. 59–78[16] K. Bringmann, M. K¨unnemann, in
Proceedings of the IEEE 56th Annual Symposium on Foun-dations of Computer Science (2015), pp. 79–97[17] A. Backurs, P. Indyk, in
Proceedings of the 47th Annual ACM Symposium on Theory of Com-puting (2015), pp. 51–58[18] K. Bringmann, W. Mulzer, JoCG (2), 46 (2016)[19] T.M. Chan, Z. Rahmati, Information Processing Letters pp. 72–74 (2018)[20] W. Kuszmaul, in Proceedings of the 46th International Colloquium on Automata, Languagesand Programming (2019)[21] A. Andoni, R. Krauthgamer, K. Onak, in
Proceedings of the IEEE 51st Annual Symposium onFoundations of Computer Science (2010), pp. 377–386[22] D. Chakraborty, D. Das, E. Goldenberg, M. Koucky, M. Saks, in
Proceedings of the 2018IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS) (IEEE, 2018),pp. 979–990[23] G.M. Landau, E.W. Myers, J.P. Schmidt, SIAM Journal on Computing (2), 557 (1998)1624] A. Andoni, N.S. Nosatzki, in Proceedings of the IEEE 61st Annual Symposium on Foundationsof Computer Science (2020). To appear.[25] S. Har-Peled,