(2+ε) -ANN for time series under the Fréchet distance
((2 + (cid:15) ) -ANN for time series under the Fréchet distance Anne Driemel and Ioannis Psarros Hausdorff Center for Mathematics, University of Bonn, Germany, [email protected] , [email protected] November 5, 2020
Abstract
We give the first approximate-near-neighbor data structure for time series under the continuousFréchet distance. Given a parameter (cid:15) ∈ (0 , , the data structure can be used to preprocess n curves in R (aka time series), each of complexity m , to answer queries with a curve of complexity k by either returning a curve that lies within Fréchet distance (cid:15) , or answering that thereexists no curve in the input within distance . In both cases, the answer is correct.Our data structure uses space in n · O (cid:0) (cid:15) − (cid:1) k + O ( nm ) and query time in O ( k ) . We showthat under some conditions the approximation factor achieved by our data structure is optimalin the cell-probe model of computation. Concretely, we show that for any data structure whichachieves an approximation factor less than and which supports curves of arclength at most L , uses a word size bounded by O ( L − (cid:15) ) for some constant (cid:15) > , and answers the query usingonly a constant number of probes, the number of words used to store the data structure must beat least L Ω( k ) . Our data structure uses only a constant number of probes per query and doesnot have any dependency on L . In particular, this shows that our solution is optimal if only aconstant number of probes is allowed.Our second positive result is a probabilistic data structure based on locality-sensitive hashing,which achieves space in O ( nm ) and query time in O ( k ) , and which answers queries with anapproximation factor in O ( k ) . Both of our data structures make use of the concept of signatures,which were originally introduced for the problem of clustering time series under the Fréchetdistance. a r X i v : . [ c s . C G ] N ov Introduction
For a long time, Indyk’s result on approximate nearest neighbor algorithms for the discrete Fréchetdistance of 2002 [28] was the only result known for proximity searching under the Fréchet distance.However, recently there has been a raised interest in this area and several new results have beenpublished [2, 4, 5, 6, 10, 11, 14, 15, 17, 18, 19, 20, 22, 23, 32, 34]. An intuitive definition of the Fréchetdistance uses the metaphor of a person walking a dog. Imagine the dog walker being restricted tofollow the path defined by the first curve while the dog is restricted to the second curve. In thisanalogy, the Fréchet distance is the shortest length of a dog leash that makes a dog walk feasible.See Section 1.2 for an exact definition of the distance measure. Despite the many results in thisarea and despite the popularity of the Fréchet distance it is still an open problem how to buildefficient data structures for it. Known results either suffer from a large approximation factor or highcomplexity bounds with dependency on the arclength of the curve, or only support a very restrictedset of queries. See Section 1.1 for a detailed discussion of previous results.Following Indyk’s definition, the problem of Approximate Near Neighbor (ANN) search is thefollowing: given a distance threshold r and an approximation factor c > , one has to preprocessa dataset so as to answer queries of the following form: for any given query object, if there is adata object within distance r , then the data structure reports any data object within distance cr ,whereas if all data objects are at distance larger than cr from the query then the data structurereports “no”. In general, for any data structuring problem, one can consider two extremal regimes ofthe tradeoff between query-time and space. We focus on solutions which are query-time efficientwith good approximation factor, while the space complexity can be high. We assume that the inputis a set of curves in R , that is, we assume that each input element is a time series. Time series arean important type of data in signal processing, with applications in stock market analysis, sensordata analysis, genomic signal processing and many other fields. At the same time, we think thatthe one-dimensional case of time series already exhibits many of the algorithmic challenges of thenearest neighbor problem under the continuous Fréchet distance.We present several solutions to this data structuring problem. All of our solutions have theproperty that they achieve small query time (linear in the complexity of the query). Our first datastructure achieves small approximation factor ( (cid:15) ) but has a high space usage (exponential inthe query complexity). Our second data structure has a high approximation factor (linear in thequery complexity) and low space usage (linear in the input size). We do not know if it is possible toachieve a tradeoff between approximation factor and space usage.We supplement these positive results with an analysis in the cell-probe model indicating that theapproximation factor (2 + (cid:15) ) achieved by our data structure is almost tight. We show that, underreasonable assumptions, any data structure which stores one curve (i.e., n = 1 ) and answers queriesof complexity k with approximation factor strictly better than and with only a constant number ofmemory accesses, needs space proportional to L k , where L denotes the maximum arclength of thequery curves supported by the data structure. Our (2 + (cid:15) ) -approximate data structure uses only aconstant number of probes per query and does not have any dependency on L . In particular, thisshows that our solution is optimal if only a constant number of probes is allowed. We do not knowif it is possible to achieve a tradeoff by allowing a higher query time. Most previous results on data structures for ANN search of curves, concern the discrete Fréchetdistance (see Section 1.5 for the exact definition). The first non-trivial ANN-data structure forthe discrete Fréchet distance from 2002 by Indyk [28] achieved approximation factor O ((log m + og log n ) t − ) , where m is the maximum length of a sequence, and t > is a trade-off parameter. Morerecently, in 2017, Driemel and Silvestri [18] designed a data structure which achieves approximationfactor O ( k ) , where k is the length of the query sequence. They show how to improve the approximationfactor to O ( d / ) at the expense of additional space usage, and a follow-up result by Emiris andPsarros [20] achieves a ( (cid:15) ) approximation, at the expense of further increasing space usage.Recently, Filtser et al. [22] showed how to build a (1 + (cid:15) ) -approximate data structure using space in n · O (1 /(cid:15) ) kd and with query time in O ( kd ) .These results are relevant in our setting, since the continuous Fréchet distance can be approximatedusing the discrete Fréchet distance. To the best of our knowledge, all known such methods introducea dependency on the arclength of the curves (resp. the maximum length of an edge), either in thecomplexity bounds or in the approximation factor. It is not at all obvious how to avoid this whenapproximating the continuous with the discrete Fréchet distance. This is an issue which appearse.g. in the solution proposed by Mirzanezhad [32], where the diameter of the input (and hence thearclength) is assumed to be bounded. The main ingredient of this data structure is the discretizationof the space of query curves with a grid, leading to an approximation factor of (cid:15) , but with largestoring demands: the space required for each input curve is roughly D dk , where D is the diameter ofthe vertices, d is the dimension of the input space and k is the complexity of the query.Interestingly, there are some data structures for the related problem of range searching , which areespecially tailored to the case of the continuous Fréchet distance and which do not have a dependencyon the arclength. The subset of input curves, that lie within the search radius of the query curveis called the range of the query. A range searching query should return all input curves inside therange, and a range counting query should return the number thereof (In some applications, theinput is weighted and a range searching query should return the weighted sum of the elements of thequery). In this terminology, an ANN-query should return at least one element of the range, if therange is non-empty.The work of de Berg et al. [14] focuses on preprocessing a single polygonal curve into a datastructure to support range counting queries among its subcurves. Here, a query curve is restrictedto be a line segment. Gudmundsson and Smid [23] consider the problem of preprocessing a treeembedded in the plane so that given a query polygonal curve, one can decide if there is a path in thetree which is within Fréchet distance δ for some threshold δ > . Driemel and Afshani [2] considerthe exact range searching problem in the general case. For n curves of complexity m in R , theirdata structure uses space in O (cid:16) n (log log n ) O ( m ) (cid:17) and the query time costs O (cid:16) √ n log O ( m ) n (cid:17) ,assuming that the complexity of the query curves is at most log O (1) n . More importantly, they showlower bounds in the pointer model of computation that match the number of log factors used in theupper bounds asymptotically. Our lower bounds also hold for the case of range searching, but weassume a different computational model, namely the cell-probe model. Definition 1 (Fréchet distance) . Given two curves π, τ : [0 , (cid:55)→ R , their Fréchet distance is: d F ( π, τ ) = min f :[0 , (cid:55)→ [0 , g :[0 , (cid:55)→ [0 , max α ∈ [0 , (cid:107) π ( f ( α )) − τ ( g ( α )) (cid:107) , where f and g range over all continuous, non-decreasing functions with f (0) = g (0) = 0 , and f (1) = g (1) = 1 . efinition 2 ( c -ANN problem) . The input consists of n curves Π in R d . Given a distance threshold r > , an approximation factor c > , preprocess Π into a data structure such that for any query τ ,the data structure reports as follows: • if there exists a π ∈ Π s.t. d F ( π, τ ) ≤ r , then it returns π (cid:48) ∈ P s.t. d F ( π, τ ) ≤ cr , • if ∀ π ∈ Π , d F ( π, τ ) ≥ cr then it returns “no”, • otherwise, it either replies with a curve π ∈ Π s.t. d F ( π, τ ) ≤ cr , or with “no”. The approximate nearest neighbor problem is known [25] to reduce to a sequence of ANNproblems.
Our techniques are based on a number of different techniques that were previously used only forthe discrete Fréchet distance. In particular, the locality-sensitive hashing scheme of Driemel andSilvestri [18] and the grid-approach by Filtser et al. [22]. We combine these with the concept ofsignatures introduced by Driemel et al. [16], which were previously only used for clustering underthe Fréchet distance. In this section we give an overview of these techniques and highlight the mainchallenges that distinguish the discrete Fréchet distance from the continuous Fréchet distance.The locality-sensitive hashing scheme proposed by Driemel and Silvestri [18] achieves linear spaceand query time in O ( k ) , with an approximation factor of O ( k ) for the discrete Fréchet distance.The data structure is based on the idea of snapping vertices to a randomly shifted grid and thenremoving consecutive duplicates in the sequence of grid points produced by snapping. Any two nearcurves produce the same sequence of grid points with constant probability while any two curves,which are sufficiently far away from each other, produce two non-equal sequences of grid pointswith certainty. The main argument used in the analysis of this scheme involves the optimal discretematching of the vertices of the two curves. This analysis is not directly applicable to the continuousFréchet distance as the optimal matching may not be realized at the vertices of the curves.There are several ANN data structures with fast query time and small approximation factorwhich store a set of representative query candidates together with precomputed answers for thesequeries so that a query can be answered approximately with a lookup table. One example of this isthe (1 + (cid:15) ) -ANN data structure for the (cid:96) p norms [25], which employs a grid and stores all those gridpoints which are near to some data point, and a pointer to the data point that they represent. Theside-length of the grid controls the approximation factor and using hashing for storing precomputedsolutions leads to an efficient query time. A similar approach was used by Filtser et al. [22] for the (1 + (cid:15) ) -ANN problem under the discrete Fréchet distance. The algorithm discretizes the relevantparts of the query space with a canonical grid and stores representative point sequences on this grid.There are several challenges when trying to apply the same approach to the ANN problem underthe continuous Fréchet distance. Computing good representatives in this case is more intricate: twocurves may be near but some of their vertices may be far from any other vertex on the other curve.Hence, picking representative curves which are defined by vertices in the proximity of the vertices ofthe data curve is not sufficient. In case the input consists of curves with bounded arclength only, onecan enumerate all curves which are defined by grid points and lie within a given distance threshold.However, this results in a large dependency on the arclength. The question, whether efficient ANNdata structures for the continuous Fréchet distance which do not have a dependency on the arclengthof the input curves are possible, is an intriguing question, which we attempt to answer in this paper.Signatures were first introduced by Driemel et al. [16] in the context of clustering of timeseries under the continuous Fréchet distance. Interestingly, they allowed the application of the gridapproach to clustering. One can use a grid to discretize the space of candidate centers, as dictated by3he signatures of the input time series. A newer result for clustering [12], which also works for higherdimensional polygonal curves, employs sampling techniques to achieve a bicriteria approximation forthe k -median objective. This result is bicriterial in the sense that the medians computed by thealgorithm have good clustering costs with respect to an optimal solution of smaller complexity. We study the c -ANN problem for time series under the continuous Fréchet distance. Our datastructures operate in the real-RAM model , enhanced with floor function operations in constanttime. For a more detailed discussion on the computational models used in this paper we refer toAppendix A. Theorem 3 (Main Theorem) . Let (cid:15) ∈ (0 , . There is a data structure for the (2 + (cid:15) ) -ANN problem,which stores n curves in R and supports query curves of complexity k in R , which uses space in n · O (cid:0) (cid:15) (cid:1) k + O ( nm ) , needs n · O (cid:0) (cid:15) (cid:1) k + O ( nmk ) preprocessing time and answers a query in O ( k ) time. The proof of the above theorem can be found in Section 3. To achieve this result we generate adiscrete approximation of the set of queries that have non-empty ranges. To this end, we employ theconcept of signatures, previously introduced in [16]. The signature of a time series provides us witha selection of the local extrema of the function graph, which we use to generate candidate curves.We first apply this technique in Section 2 in combination with the triangle inequality. This does notyield the desired approximation factor, yet. A more careful analysis of the involved matchings and amore intricate algorithm to build the candidate set which we present in Section 3 yields the abovetheorem.Our second main result is a lower bound in the cell-probe model of computation. The cell-probemodel of computation counts the number of memory accesses (cell probes) to the data structurewhich are performed by a query. Given a universe of data and a universe of queries, a cell-probedata structure with performance parameters s , t , w , is a structure which consists of s memory cells,each able to store w bits, and any query can be answered by accessing t memory cells. Our lowerbound concerns approximate distance oracles. A Fréchet distance oracle is a data structure whichgiven one input curve π , a distance threshold r and an approximation factor c > , it reports for anyquery curve τ as follows: (i) if d F ( π, τ ) ≤ r then the answer is “yes”, (ii) if d F ( π, τ ) > cr then theanswer is “no”, (iii) otherwise the answer can be either “yes” or “no”. Theorem 4.
Consider any Fréchet distance oracle with approximation factor − γ , for any γ ∈ (0 , ,in the cell-probe model, which supports curves in R as follows: it stores any polygonal curve ofarclength at most L , for L ≥ , it supports queries of arclength at most L and complexity k , where k ≤ L/ , and it achieves performance parameters t , w , s . There exist w = Ω (cid:32) kt (cid:18) Lk (cid:19) − (cid:15) (cid:33) , s = 2 Ω (cid:16) k log( L/k ) t (cid:17) such that if w < w then s ≥ s , for any constant (cid:15) > . The above theorem implies that the approximation factor of (2 + (cid:15) ) in Theorem 3 cannot besignificantly improved, unless we restrict ourselves to curves of bounded arclength or increase thenumber of probes. The proof can be found in Section 5.To achieve this result we observe that a technique first introduced by Miltersen [31] can beapplied here. Miltersen shows that lower bounds for communication problems can be translated into4ower bounds for cell-probe data structures. In particular, we use a reduction from the lopsideddisjointness problem. Such reductions are not new. A similar reduction was devised by Meintrup etal. [30] in order to lower bound the bit complexity required for sketching, but it works for polygonalcurves in R , only. To the best of our knowledge the lower bound of the above theorem is new.In addition, we extend these lower bound results to the case of the discrete Fréchet distance.Here, our reduction is more intricate. We adapt a reduction by Bringmann and Mulzer [9], whichwas used for showing lower bounds for computing the Fréchet distance. Our results show that forthe corresponding data structure problem an exponential dependence on k for the space is necessary,when the number of probes is constant. This exponential dependence on k appears, e.g., in theupper bound by Filtser et al. [22].Finally we present a data structure for the ANN problem of time series, which only needs linearspace O ( nm ) and has query time in O ( k ) . This improvement in the space complexity comes with asacrifice in the approximation factor achieved by the data structure, which is now in O ( k ) . The datastructure is randomized: for any fixed query, the preprocessing is correct with constant probability.The probability of success can be amplified using repetition, i.e. building several data structuresindependently. Theorem 5.
There is a data structure for the (24 k + 1) -ANN problem, which stores n curves ofcomplexity m in R and supports query curves of complexity k in R , which uses space in O ( nm ) , needs O ( nm ) preprocessing time and answers a query in O ( k ) time. For a fixed query, the preprocessingsucceeds with probability at least / . To achieve this result, we combine the notion of signatures with the ideas of the locality-sensitivescheme that was previously used [18] for the discrete Fréchet distance. In the discrete case, itis sufficient to snap the vertices of the curves to a grid of well-chosen resolution and to removerepetitions of grid points along the curve to obtain a hash index with good probability. In thecontinuous case, we first compute a signature, which filters the salient points of the curve, and onlythen apply the grid snapping to this signature to obtain the hash index. The resulting data structureis surprisingly simple.
For any x ∈ R , | x | denotes the absolute value of x . For any positive integer n , [ n ] denotes the set { , . . . , n } . Throughout this paper, a curve is a function [0 , (cid:55)→ R and we may refer to such a curveas a time series . We can define a curve π as π := (cid:104) x , . . . , x m (cid:105) , which means that π is obtained bylinearly interpolating x , . . . , x m . The vertices of π : [0 , (cid:55)→ R are those points which are localextrema in π . So if v , . . . , v (cid:96) are the vertices of some curve π , then π = (cid:104) v , . . . , v (cid:96) (cid:105) . For any curve π , V ( π ) denotes the sequence of vertices of π . The number of vertices |V ( π ) | is called the complexity of π and it is also denoted by | π | . For any two points x , y , x, y denotes the directed line segmentconnecting x with y in the direction from x to y . The segment defined by any two consecutivevertices is called an edge . For any two ≤ p a < p b ≤ and any curve π , we denote by π [ p a , p b ] thesubcurve { π ( x ) | x ∈ [ p a , p b ] } . For any two curves π , π , with vertices x , . . . , x k and x k , . . . , x m respectively, π ⊕ π denotes the curve (cid:104) x , . . . , x k , . . . x m (cid:105) , that is the concatenation of π and π .We define the arclength λ ( π ) of a curve π as the total sum of lengths of the edges of π .We refer to a pair of continuous, non-decreasing functions f : [0 , (cid:55)→ [0 , , g : [0 , (cid:55)→ [0 , suchthat f (0) = g (0) , f (1) = g (1) , as a matching . If a matching φ = ( f, g ) of two curves π , τ satisfies max α ∈ [0 , (cid:107) π ( f ( α )) − τ ( g ( α )) (cid:107) ≤ δ , then we say that φ is a δ -matching of π and τ . We will also usethe following concept introduced by Alt and Godau [3].5 efinition 6 ( δ -free space) . Given two curves π : [0 , → R , τ : [0 , → R . The δ -free space is thesubset of the parametric space defined as { ( x, y ) ∈ [0 , | | π ( x ) − τ ( y ) | ≤ δ } . The standard algorithm by Alt and Godau [3] for computing the Fréchet distance between twocurves π, τ , finds a matching in the parametric space of the two curves, where a matching is realizedby a monotone path which starts at (0 , and ends at (1 , . If such a path is entirely contained inthe δ -free space, then d F ( π, τ ) ≤ δ . The Fréchet distance is known to satisfy the triangle inequality.We use the following two observations repeatedly in the paper.(i) For any curves τ , τ , π , π , which satisfy the property that the last vertex of τ is the firstvertex of τ and the last vertex of π is the first vertex of π , it holds that d F ( τ ⊕ τ , π ⊕ π ) ≤ max { d F ( τ , τ ) , d F ( π , π ) } .(ii) For any two edges a a , b b , it holds that d F ( a a , b b ) = max {| a − b | , | a − b |} .These two facts imply that if π = (cid:104) x , . . . , x m (cid:105) and π = (cid:104) y , . . . , y m (cid:105) such that for each i = 1 , . . . , m , | x i − y i | ≤ (cid:15) then d F ( π , π ) ≤ (cid:15) . This is a key property that we exploit when we snap vertices of acurve to a grid, since it allows us to bound the distance between the original curve, and the curvedefined by the sequence of snapped vertices.A crucial ingredient to our algorithms is the notion of signatures which was first introduced in[16] and capture critical points of the input time series. We define signatures as follows. Definition 7 ( δ -signatures) . A curve σ : [0 , (cid:55)→ R is a δ -signature of τ : [0 , (cid:55)→ R if it is a curvedefined by a series of values t < · · · < t (cid:96) = 1 as the linear interpolation of τ ( t i ) in the order ofthe index i , and satisfies the following properties. For ≤ i ≤ (cid:96) − the following conditions hold:i) (non-degeneracy) if i ∈ [2 , (cid:96) − then τ ( t i ) / ∈ τ ( t i − ) , τ ( t i +1 ) ,ii) (direction-preserving) if τ ( t i ) < τ ( t i +1 ) for t < t (cid:48) ∈ [ t i , t i +1 ] : τ ( t ) − τ ( t (cid:48) ) ≤ δ and if τ ( t i ) > τ ( t i +1 ) for t < t (cid:48) ∈ [ t i , t i +1 ] : τ ( t (cid:48) ) − τ ( t ) ≤ δ ,iii) (minimum edge length) if i ∈ [2 , (cid:96) − then | τ ( t i +1 ) − τ ( t i ) | > δ , and if i ∈ { , (cid:96) − } then | τ ( t i +1 ) − τ ( t i ) | > δ ,iv) (range) for t ∈ [ t i , t i +1 ] : if i ∈ [2 , (cid:96) − then τ ( t ) ∈ τ ( t i ) τ ( t i +1 ) , and if i = 1 and (cid:96) > then τ ( t ) ∈ τ ( t i ) τ ( t i +1 ) ∪ ( τ ( t i ) − δ )( τ ( t i ) + δ ) , and if i = (cid:96) − and (cid:96) > then τ ( t ) ∈ τ ( t i − ) τ ( t i ) ∪ ( τ ( t i ) − δ )( τ ( t i ) + δ ) , and if i = 1 and (cid:96) = 2 then τ ( t ) ∈ τ ( t ) τ ( t ) ∪ ( τ ( t ) − δ )( τ ( t ) + δ ) ∪ ( τ ( t ) − δ )( τ ( t ) + δ ) . For any δ > and any curve π : [0 , (cid:55)→ R of complexity m , a δ -signature of π can be computedin O ( m ) time [16]. We now state some basic results about signatures. Lemma 8 (Lemma 3.1 [16]) . It holds for any δ -signature σ of τ that d F ( σ, τ ) ≤ δ . Lemma 9 (Lemma 3.2 [16]) . Let σ with vertices v , . . . , v (cid:96) , be a δ -signature of π with vertices u , . . . , u m . Let r i = [ v i − δ, v i + δ ] , for ≤ i ≤ (cid:96) , be ranges centered at the vertices of σ orderedalong σ . It holds for any time series τ if d F ( π, τ ) ≤ δ , then τ has a vertex in each range r i , andsuch that these vertices appear on τ in the order of i . We end this section with the standard definition of the discrete Fréchet distance. For any positiveinteger m , (cid:0) R d (cid:1) m denotes the space of sequences of m real vectors of dimension d . Note that the definition in [16] contains a typo, which is corrected here. efinition 10 (Traversal) . Given P = p , . . . , p m ∈ (cid:0) R d (cid:1) m and Q = q , . . . , q k ∈ (cid:0) R d (cid:1) k , a traversal T = ( i , j ) , . . . , ( i t , j t ) of P and Q is a sequence of pairs of indices referring to a pairing of pointsfrom the two sequences such that:(i) i , j = 1 , i t = m , j t = k .(ii) ∀ ( i u , j u ) ∈ T : i u +1 − i u ∈ { , } and j u +1 − j u ∈ { , } .(iii) ∀ ( i u , j u ) ∈ T : ( i u +1 − i u ) + ( j u +1 − j u ) ≥ .For any traversal T , we define d T ( P, Q ) := max ( i,j ) ∈ T (cid:107) p i − q j (cid:107) . Definition 11 (Discrete Fréchet distance) . Given P = p , . . . , p m ∈ (cid:0) R d (cid:1) m and Q = q , . . . , q k ∈ (cid:0) R d (cid:1) k , we define the discrete Fréchet distance between P and Q as follows: d dF ( P, Q ) = min T ∈T max ( i u ,j u ) ∈ T (cid:107) p i u − q j u (cid:107) , where T denotes the set of all possible traversals for P, Q . Thus, d dF ( P, Q ) = min T ∈T d T ( P, Q ) . In this section, we show a data structure for the (5 + (cid:15) ) -ANN problem of time series. We initiate ourexposition with a simple corollary regarding the covering properties of ranges centered at signaturevertices. Corollary 12.
Let σ τ be a δ -signature of τ and let σ π be a δ -signature of π with vertices v , . . . , v (cid:96) .Let r i := [ v i − δ, v i + 2 δ ] , for i ∈ [ (cid:96) ] , be ranges at the vertices of σ π ordered along σ π . If d F ( π, τ ) ≤ δ then, the vertices of σ τ are contained in (cid:83) (cid:96)i =1 r i and the vertices of σ τ appear in the ranges r i , in theorder of i .Proof. If d F ( π, τ ) ≤ δ then by the triangle inequality and Lemma 8, d F ( σ π , τ ) ≤ δ . Let u , . . . , u (cid:96) (cid:48) be the vertices of σ τ and define for each i ∈ [ (cid:96) (cid:48) ] , r (cid:48) i := [ u i − δ, u i + 2 δ ] . By Lemma 9, σ π has avertex in each range r (cid:48) i and these vertices appear on σ π in the order of i . The data structure
The input consists of a set Π of n curves in R , the distance threshold r > ,and the approximation error (cid:15) > . To simplify our exposition, we assume that the distance thresholdis r := 1 − (cid:15)/ , since this allows us to use signature parameters independent of (cid:15) . (To solve theproblem for a different value of r , the input set can be uniformly scaled.) Let G w := { i · w | i ∈ Z } bethe regular grid with side-length w := (cid:15)/ . Let H be a hashtable, which is initially empty. For H , weassume perfect hashing, implying that for any key of complexity O ( k ) , we need O ( k ) time to accessthe corresponding bucket. For each input curve π ∈ Π , we compute its -signature σ π , with vertices V ( σ π ) = v , . . . , v (cid:96) , and for each v i ∈ V ( σ π ) we define the range r i := [ v i − , v i + 2] . Corollary 12ensures that the vertices of the -signature of a query τ , for which it holds that d F ( π, τ ) ≤ , liein the ranges r i satisfying the order of i . Hence, we enumerate all curves with at most k vertices,chosen from the sets r ∩ G w , r ∩ G w , . . . , and satisfying the order of i , and we store them in aset C (cid:48) . Next, we compute the set C ( π ) := { σ ∈ C (cid:48) | d F ( σ, π ) ≤ } . We store C ( π ) in H as follows:for each σ ∈ C ( π ) , we use as key the sequence of its vertices V ( σ ) . Let H ( σ ) be the bucket withkey V ( σ ) . Then, for each σ ∈ C ( π ) , we store in H ( σ ) a pointer to π . Thus, after processing allinput curves, H ( σ ) contains a list of all relevant pointers to curves in Π . The total space required is O ( n · max π ∈ Π |C ( π ) | ) .Our intuition is the following. We would like the set C ( π ) to contain all those curves thatcorrespond to -signatures of query curves that have π as an approximate near neighbor in the set7 . So when presented with a query we can simply compute its -signature and do a lookup in thetable H . However, this set is infinite. Therefore, we snap the vertices to a grid to obtain a discreteset of bounded size. The query algorithm
When presented with a query curve τ , we first compute a -signature σ τ , and then we compute a key by snapping the vertices to the same grid G w . Snapping to G w isimplemented as follows: if V ( σ τ ) = v , . . . , v (cid:96) then σ (cid:48) τ := (cid:104) g w ( v ) , . . . , g w ( v (cid:96) ) (cid:105) , where for any x ∈ R , g w ( x ) is the nearest point of x in G w . We perform a lookup in the hashtable H with the key V ( σ (cid:48) τ ) and return the result: if there is a bucket H ( σ (cid:48) τ ) then we return any curve which has a pointer storedthere, otherwise we return “no”. Section 2.1 contains detailed pseudocode of the basic algorithms. Lemma 13.
Let τ be a query curve of complexity k . If there exists a curve π ∈ Π such that d F ( π, τ ) ≤ − (cid:15)/ then the query algorithm returns a curve π (cid:48) ∈ Π such that d F ( π (cid:48) , τ ) ≤ (cid:15)/ .If for all curves π ∈ Π , d F ( π, τ ) > (cid:15)/ then the query algorithm returns “no”.Proof. Let π be any input curve in Π , and let τ be a query curve. Curves σ τ and σ (cid:48) τ are as definedin the description of the query algorithm. First suppose that d F ( π, τ ) ≤ − (cid:15)/ . By Lemma 8, wehave that d F ( τ, σ τ ) ≤ , and by the triangle inequality, d F ( π, σ τ ) ≤ d F ( π, τ ) + d F ( τ, σ τ ) ≤ − (cid:15)/ . Then, again, by the triangle inequality, d F ( π, σ (cid:48) τ ) ≤ d F ( π, σ τ ) + d F ( σ τ , σ (cid:48) τ ) ≤ . By Corollary 12 we are guaranteed that σ (cid:48) τ will be considered during preprocessing, so there will bea pointer to π in the bucket H ( σ (cid:48) τ ) .Now consider the case d F ( π, τ ) > (cid:15)/ . By Lemma 8, we have that d F ( τ, σ τ ) ≤ . By thetriangle inequality, d F ( π, σ τ ) ≥ d F ( π, τ ) − d F ( τ, σ τ ) > (cid:15)/ and then, d F ( π, σ (cid:48) τ ) ≥ d F ( π, σ (cid:48) τ ) − d F ( σ τ , σ (cid:48) τ ) > , which means that σ (cid:48) τ will not be assigned to C ( π ) during preprocessing. The approximation factor is (cid:15)/ − (cid:15)/ < (cid:15) , for any (cid:15) ∈ [0 , . Theorem 14.
Let (cid:15) ∈ (0 , . There is a data structure for the (5 + (cid:15) ) -ANN problem, which stores n curves in R and supports query curves of complexity k in R , which uses space in n · O (cid:0) (cid:15) (cid:1) k + O ( nm ) ,needs n · O (cid:0) (cid:15) (cid:1) k + O ( nm ) preprocessing time and answers a query in O ( k ) time.Proof. The data structure is described above. By Lemma 13 the data structure returns a correctresult. It remains to analyze the complexity. Our data structure solves the (5 + (cid:15) ) -ANN problemwith distance threshold − (cid:15)/ . The space required for each input curve is proportional to thenumber of candidate signatures computed in the preprocessing phase. Indeed, we will show nowthat |C (cid:48) | ≤ O (cid:0) (cid:15) (cid:1) k . Notice that if there exists a curve with k vertices which is within distance from π then (cid:96) ≤ k , by Lemma 9. Recall that the curves in |C (cid:48) | have vertices in the ranges r i ∩ G w and the vertices respect the order of i . If we fix the choices of t , . . . , t (cid:96) , where each t i denotes the8umber of vertices in r i ∩ G w to be used in the creation of those curves, we can produce at most (cid:81) (cid:96)i =1 | r i ∩ G w | t i distinct curves. Hence, |C (cid:48) | ≤ (cid:88) t + ... + t (cid:96) = k ∀ i : t i ≥ t ≥ ,t (cid:96) ≥ (cid:96) (cid:89) i =1 (cid:18) (cid:15) + 1 (cid:19) t i ≤ (cid:88) t + ... + t (cid:96) = k ∀ i : t i ≥ (cid:18) (cid:15) + 1 (cid:19) k ≤ (cid:18) k + (cid:96) − (cid:96) − (cid:19) · (cid:18) (cid:15) + 1 (cid:19) k = O (cid:18) (cid:15) (cid:19) k . The time to compute a signature for a curve of complexity m is O ( m ) , because we can use thealgorithm of [16]. So, we have a total of O ( nm ) + n · O (1 /(cid:15) ) k for the preprocessing time, n · O (1 /(cid:15) ) k space for the data structure and O ( nm ) space for storing the input curves and each query costs O ( k ) time, since we employ perfect hashing for H , and snapping a curve costs O ( k ) time assumingthat a floor function operation needs O (1) time.Deciding whether a query curve τ is near to a given curve π by only having a -signature of τ issubject to a ± error. One can find concrete worst-case examples where this approximation factor isattained. preprocess (set of time series Π ) Initialize empty hashtable H for each π ∈ Π do C ( π ) ← generate_candidates ( π ) if C ( π ) (cid:54) = ∅ then for each σ τ ∈ C ( π ) do add a pointer to π in H ( σ τ ) generate_candidates (time series π ) σ π ← -signature of π , with V ( σ π ) = v , . . . , v (cid:96) if (cid:96) > k then return ∅ for each i = 1 , . . . , (cid:96) do r i ← [ v i − , v i + 2] C (cid:48) ← ∅ for each j = 1 , . . . , (cid:96) do for each p ∈ r j ∩ G w do generate_sequences ( (cid:104) p (cid:105) , j , C (cid:48) ) C ( π ) ← ∅ for each σ τ ∈ C (cid:48) do if d F ( π, σ τ ) ≤ then C ( π ) ← C ( π ) ∪ { σ τ } return C ( π ) enerate_sequences (time series σ , integer i , returned set C (cid:48) ) // Stores in C (cid:48) all possible time series which begin with σ , have at most k vertices that belong to r (cid:48) j ∩ G w , for j = i, . . . , (cid:96) , and appear in them in the order of i . v , . . . , v t ← V ( σ ) if |V ( σ ) | ≤ k then C (cid:48) ← C (cid:48) ∪ { σ } if |V ( σ ) | < k then for each j = i, . . . , (cid:96) do for each p ∈ r j ∩ G w do σ (cid:48) ← (cid:104) v , . . . , v t , p (cid:105) generate_sequences ( σ (cid:48) , j , C (cid:48) ) query (time series τ ) σ τ ← compute a -signature of τ . σ (cid:48) τ ← snap σ τ to G w if ∃ π ∈ Π , σ (cid:48) τ ∈ C ( π ) then // check the bucket H ( σ (cid:48) τ ) report π // arbitrary π s.t. σ (cid:48) τ ∈ C ( π ) else report “no” (2 + (cid:15) ) In this section, we extend the ideas developed in Section 2, to improve the approximation factor ofour data structure from (5 + (cid:15) ) to (2 + (cid:15) ) . The key to circumventing the large approximation factorresulting from the use of the triangle inequality seems to be a careful construction of matchings. Forthis we define the notion of a δ -tight matching for two curves. Intuitively, a δ -tight matching is a matching which attains a distance of at most δ and matches asmany pairs of points as possible at distance zero.Full proofs of the lemmas of this section can be found in Section 3.2. Definition 15 ( δ -tight matching) . Given two curves π and τ , consider a monotone path λ throughthe parametric space of π and τ consisting of two types of segments:(i) a segment contained in the -free space (corresponding to identical subcurves of π and τ ),(ii) a horizontal line segment contained in the δ -free space (corresponding to a point on π and asubcurve on τ ).If λ exists, we say λ is a tight matching of width δ from π to τ . Lemma 16.
Let X = ab ⊂ R be a line segment and let τ : [0 , → R be a curve with [ a, b ] ⊆ [ τ (0) , τ (1)] . If d F ( X, τ ) ≤ δ then there exists a δ -tight matching from X to τ .Proof Sketch. We first construct a connected path in the δ -free space of the two curves that onlyconsists of sections of the -free space and horizontal line segments, but is not necessarily monotone.We do this by parametrizing the set that constitutes the -free space and connecting it by horizontal10 (cid:48) s s s (cid:48) s (cid:48) s yy (cid:48) t i + t i t j λ Figure 1: Replacing a section of the path with a horizontal line segment in the proof of Lemma 16 y i y i +1 p p (cid:48) t t (cid:48) p p (cid:48) tt (cid:48) Figure 2: Example of the path constructed in the proof of Lemma 17. The left figure shows a tightmatching from X to π . The middle figure shows a tight matching from X to τ . Diagonal edgesof the -free space of these can be transferred to the diagram on the right, which is the free spacediagram of π and τ . The final path results from connecting these diagonal segments using horizontaland vertical line segments.line segments. We obtain an x -monotone connected curve from (0 , to (1 , which lies inside the δ -free space. We then show that this path can be iteratively “repaired” by replacing non-monotonesections of the path with horizontal segments, while maintaining the property that the path iscontained inside the δ -free space. After a finite number of iterations of this procedure we obtain a δ -tight matching from X to τ . Figure 1 illustrates the process.In the next lemma we combine tight matchings from a line segment to show an upper bound onthe Fréchet distance. Using this lemma, we can show upper bounds on the distance that are strongerthan bounds obtained by triangle inequality. Figure 2 illustrates the idea of the proof. Lemma 17.
Let X = ab ⊂ R be a line segment and let τ and π be curves with [ a, b ] ⊆ [ τ (0) , τ (1)] and [ a, b ] ⊆ [ π (0) , π (1)] . If d F ( X, τ ) = δ and d F ( X, π ) = δ , then d F ( τ, π ) ≤ max( δ , δ ) . Lemma 18.
Let σ π be a δ -signature of a time series π and let σ τ be a δ -signature of a time series τ .Suppose that the first edge and the last edge of τ are both of length more than δ . If d F ( σ π , σ τ ) ≤ δ then d F ( π, τ ) ≤ δ .Proof Sketch. Let π ( p ) . . . , π ( p (cid:96) ) be the vertices of σ π and let τ ( t ) , . . . , τ ( t (cid:96) (cid:48) ) be the vertices of σ τ . The properties of the signatures of Definition 7 together with Lemma 9, imply that a weaklymonotonic matching of σ π with σ τ which attains a distance of at most δ can be structured as follows:each edge is either entirely matched with an edge, or with a vertex.11his observation combined with the range property of signatures and the assumption thatthe first and the last edge of τ are of length greater than δ allows us to focus on any pair ofmatched edges π ( p i ) π ( p i +1 ) , τ ( t j ) τ ( t j +1 ) , and show that d F ( π [ p i , p i +1 ] , τ [ t j , t j +1 ]) ≤ δ . To do that,we partition π [ p i , p i +1 ] into three subcurves π [ p i , p a ] , π [ p a , p b ] , π [ p b , p i +1 ] and τ [ t j , t j +1 ] into threesubcurves τ [ t j , t a ] , τ [ t a , t b ] , τ [ t b , t j +1 ] . The main property of this decomposition is that either π [ p i , p a ] or τ [ t j , t a ] is a point and we can easily derive d F ( π [ p i , p a ] , τ [ t j , t a ]) ≤ δ by greedily matching the othercurve. The same holds for π [ p b , p i +1 ] and τ [ t b , t j +1 ] . Then, to prove that d F ( π [ p a , p b ] , τ [ t a , t b ]) ≤ δ we rely on bounding d F ( π [ p a , p b ] , π ( p a ) π ( p b )) , d F ( τ [ t a , t b ] , τ ( t a ) π ( t b )) and then applying Lemma 17.Edges which are matched with vertices are special cases which can be easily handled.We conclude that each subcurve π e (or vertex) of π corresponding to some edge e (or vertex) of σ π which is matched with some edge e (or vertex) in σ τ by a matching of σ π with σ τ which attainsa distance of at most δ , can be matched with the subcurve τ e of τ which corresponds to e , suchthat d F ( π e , τ e ) ≤ δ . Hence, d F ( π, τ ) ≤ δ . Lemma 19.
Let σ π be a δ -signature of a time series π and let σ τ be a δ -signature of a time series τ . Suppose that the first edge and the last edge of τ are both of length more than δ . For any δ (cid:48) ≤ δ ,if d F ( π, τ ) ≤ δ (cid:48) then d F ( σ π , σ τ ) ≤ δ (cid:48) .Proof Sketch. Let π ( p ) . . . , π ( p (cid:96) ) be the vertices of σ π and let τ ( t ) , . . . , τ ( t (cid:96) (cid:48) ) be the vertices of σ τ . The assumption on the first and last edge together with the properties of the signatures ofDefinition 7 imply that any two consecutive δ -ranges along σ τ are disjoint. Moreover, we would liketo use the property that the length of any edge of σ π is strictly longer than δ . This is not entirelytrue for the first and last edge, but we can show something similar exploiting the fact that the edgelengths of τ are long and that the Fréchet distance of σ and τ is bounded by δ (cid:48) . As a result, we canshow that any edge π ( p i ) π ( p i +1 ) can be matched with a segment e i := τ ( h i ) τ ( h i +1 ) lying in someedge of σ τ such that for each i ∈ [ (cid:96) − , it holds that d F ( π ( p i ) π ( p i +1 ) , e i ) ≤ δ (cid:48) , which implies that d F ( σ π , σ τ ) ≤ δ (cid:48) . Lemma 16.
Let X = ab ⊂ R be a line segment and let τ : [0 , → R be a curve with [ a, b ] ⊆ [ τ (0) , τ (1)] . If d F ( X, τ ) ≤ δ then there exists a δ -tight matching from X to τ .Proof. Consider the δ -free space of the two curves X and τ , which is a subset of [0 , . We adopt theconvention that a point ( x, y ) ∈ [0 , in this diagram corresponds to two points X ( y ) and τ ( x ) (so X corresponds to the vertical axis and τ corresponds to the horizontal axis). Let x ≤ · · · ≤ x p = 1 denote the parameter values at vertices of τ . The δ -free space is subdivided into cells [0 , × [ x i , x i +1 ] .We call the intersection of the δ -free space with the vertical cell boundary at x -coordinate x i the freespace interval at index i and denote it with [ (cid:96) i , r i ] . Consider the -free space inside this diagram,this is the set of points ( x, y ) ∈ [0 , with X ( y ) = τ ( x ) . This set forms a set of paths λ , . . . , λ r ,for some r ∈ N , which is x -monotone, since X is a line segment. Therefore, we can parameterize thisset by x . We concatenate any two λ i and λ i +1 by adding a line segment between their endpoints. Aconnecting segment will be a horizontal line, either at y = 0 or at y = 1 . This can easily be provedby contradiction (assume that λ i ends at and λ i +1 starts at , then the section of τ between thoseendpoints would have to be disconnected). In addition, we add line segments to connect λ to (0 , and to connect λ r to (1 , . We obtain a connected path λ from (0 , to (1 , , which lies inside the δ -free space, but is not necessarily monotone in y . Figure 1 shows an example.We now describe how to obtain a δ -tight matching from λ by repeatedly replacing sections of λ with horizontal line segments, until λ is monotone in both parameters, x and y .12ssume λ is not monotone. Then, there exists a horizontal line that properly intersects λ inthree different points. Consider a horizontal line at height y with three distinct intersections at ( s , y ) , ( s , y ) , and ( s , y ) , such that(i) the section of λ between s and s lies completely above y (ii) the section of λ between s and s lies completely below y There exist indices i and j , such that s ≤ t i < t j ≤ s and such that t i is minimal and t j is maximalin this set of indices. Let L be the line segment from ( s , y ) to ( s , y ) . If L is contained inside the δ -free space, then we replace the corresponding section of λ with L and obtain monotonicity of λ inthe cell(s) [0 , × [ x i , x j ] .Otherwise, let i − ∈ [ i, j ] be the index that maximizes (cid:96) i − and let i + be the index in [ i, j ] whichminimizes r i + . (Recall that [ (cid:96) i , r i ] denotes the free space interval at index i ). It must be that y / ∈ [ (cid:96) i − , r i + ] , otherwise the line segment L would be contained inside the δ -free space.Assume y > r i + (the other case is symmetric and handled below). This case is illustrated inFigure 1. Let y (cid:48) = r i + and consider the intersections of λ with the horizontal line at y (cid:48) . It must bethat there exist intersection points with s (cid:48) , s (cid:48) , s (cid:48) with s (cid:48) < s < s (cid:48) < s (cid:48) < s , such that(i) the section of λ between s (cid:48) and s (cid:48) lies completely above y (cid:48) (ii) the section of λ between s (cid:48) and s (cid:48) lies completely below y (cid:48) Let L (cid:48) be the line segment from ( s (cid:48) , y (cid:48) ) to ( s (cid:48) , y (cid:48) ) . Since d F ( X, τ ) ≤ δ , it holds that (cid:96) j ≤ r i + for any j ≤ i + , otherwise there cannot be a monotone path in the δ -free space. Therefore, L (cid:48) is containedin the δ -free space and we can use it to shortcut λ and obtain monotonicity of λ in the cell(s) [0 , × [ x i , x j ] .Otherwise, we have y < (cid:96) i − . We handle this case symmetrically. Let y (cid:48) = (cid:96) i − and consider theintersections of λ with the horizontal line at y (cid:48) . It must be that there exist intersection points with s (cid:48) , s (cid:48) , s (cid:48) with s (cid:48) < s (cid:48) < s < s (cid:48) < s , such that(i) the section of λ between s (cid:48) and s (cid:48) lies completely above y (cid:48) (ii) the section of λ between s (cid:48) and s (cid:48) lies completely below y (cid:48) Let L (cid:48) be the line segment from ( s (cid:48) , y (cid:48) ) to ( s (cid:48) , y (cid:48) ) . Since d F ( X, τ ) ≤ δ , it holds that (cid:96) i − ≤ r j ≤ for any j ≥ i − , otherwise there cannot be a monotone path in the δ -free space. Therefore, L (cid:48) iscontained in the δ -free space and we can use it to shortcut λ and obtain monotonicity of λ in thecell(s) [0 , × [ x i , x j ] .With each shortcutting step we obtain monotonicity of the path λ in at least one of the cells.Therefore, the process ends after a finite number of steps. Lemma 17.
Let X = ab ⊂ R be a line segment and let τ and π be curves with [ a, b ] ⊆ [ τ (0) , τ (1)] and [ a, b ] ⊆ [ π (0) , π (1)] . If d F ( X, τ ) = δ and d F ( X, π ) = δ , then d F ( τ, π ) ≤ max( δ , δ ) .Proof. Let δ = max( δ , δ ) . By Lemma 16 there exists a δ -tight matching from X to τ and anotherone from X to π . We construct a monotone path in the δ -free space of τ and π from these two tightmatchings. In particular, we first specify diagonal segments of the constructed path, which lie in the -free space, and then connect these segments with horizontal, resp., vertical segments. Let S ⊂ [0 , be the finite set of parameter values of X , which correspond to the horizontal segments of the tightmatching from X to π . Let Q ⊂ [0 , be the finite set of parameters of the horizontal segments ofthe tight matching from X to τ . Let y < · · · < y r be the sorted list of the values S ∪ Q (withoutmultiplicities). For any interval y i , y i +1 in this list, there exists a diagonal segment in both tightmatchings that covers the entire interval in the y -direction. That is, the tight matching matches X [ y i , y i +1 ] to a subcurve on τ and a subcurve on π that are identical. Let τ [ t, t (cid:48) ] and π [ p, p (cid:48) ] bethese subcurves. Let λ i = ( t, p )( t (cid:48) , p (cid:48) ) be the corresponding diagonal segment of the δ -free space of13 and π . Since the two subcurves are identical, λ i is part of the -free space. We obtain a set ofdiagonal segments in the -free space, which we intend to connect to piecewise-linear path whereevery edge is of one of three types: (i) a diagonal edge contained in the -free space, (ii) a horizontaledge, (iii) a vertical edge. For connecting two diagonal segments λ i and λ i +1 , there are three cases:(i) y i +1 ∈ S and y i +1 / ∈ Q : in this case λ i and λ i +1 can be connected by a horizontal line segment.(ii) y i +1 / ∈ S and y i +1 ∈ Q : in this case λ i and λ i +1 can be connected by a vertical line segment.(iii) y i +1 ∈ S and y i +1 ∈ Q : in this case λ i and λ i +1 can be connected by a horizontal line segmentfollowed by a vertical line segment.From this, we obtain a monotone path in the δ -free space of π and τ from (0 , to (1 , . Lemma 18.
Let σ π be a δ -signature of a time series π and let σ τ be a δ -signature of a time series τ .Suppose that the first edge and the last edge of τ are both of length more than δ . If d F ( σ π , σ τ ) ≤ δ then d F ( π, τ ) ≤ δ .Proof. Let π ( p ) . . . , π ( p (cid:96) ) be the vertices of σ π and let τ ( t ) , . . . , τ ( t (cid:96) (cid:48) ) be the vertices of σ τ . Let r i := [ π ( p i ) − δ, π ( p i ) + δ ] , for i ∈ [ (cid:96) ] and let r (cid:48) i := [ τ ( t i ) − δ, τ ( t i ) + δ ] , for i ∈ [ (cid:96) (cid:48) ] . By Lemma 9,since σ π is a δ -signature of σ π , σ τ has a vertex in each range r i and such that these vertices appearin the order of i . Similarly, since σ τ is a δ -signature and hence a δ -signature of σ τ , σ π has a vertexin each range r (cid:48) i and such that these vertices appear in the order of i . Hence, a weakly monotonicmatching of σ π with σ τ which attains a distance of at most δ can be structured as follows: eachedge is either entirely matched with a vertex, or entirely matched with an edge.First, consider an edge π ( p i ) π ( p i +1 ) , ≤ i ≤ (cid:96) , which is entirely matched with some edge τ ( t j ) τ ( t j +1 ) , for some ≤ j ≤ (cid:96) (cid:48) . We will show now that d F ( π [ p i , p i +1 ] , τ [ t j , t j +1 ]) ≤ δ . By theminimum edge length property and the assumption that the first and the last edge of τ is of lengthmore than δ , we conclude that the two edges must have the same direction. Assume w.l.o.g. that π ( p i ) < π ( p i +1 ) and τ ( t j ) < τ ( t j +1 ) .If π ( p i ) < τ ( t j ) then we set p a = max { t ∈ [ p i , p i +1 ] | π ( t ) = τ ( t j ) } , t a = t j . If π ( p i ) ≥ τ ( t j ) then we set t a = min { t ∈ [ t j , t j +1 ] | τ ( t ) = π ( p i ) } , p a = p i . Similarly, if π ( p i +1 ) > τ ( t j +1 ) thenwe set p b = min { t ∈ [ p i , p i +1 ] | π ( t ) = τ ( t j +1 ) } , t b = t j +1 . If π ( p i +1 ) ≤ τ ( t j +1 ) then we set t b = max { t ∈ [ t j , t j +1 ] | τ ( t ) = π ( p i +1 ) } , p b = p i +1 . Let a = π ( p a ) = τ ( t a ) and b = π ( p b ) = τ ( t b ) .If π ( p i ) < τ ( t j ) then by the range property, the direction preserving property of σ π , andthe fact that | π ( p i ) − τ ( t j ) | ≤ δ , we conclude that π [ p i , p a ] ⊆ [ τ ( t j ) − δ, τ ( t j ) + 2 δ ] . Hence d F ( π [ p i , p a ] , τ [ t j , t b )]) ≤ δ . If π ( p i ) ≥ τ ( t j ) then by the range property of σ τ , the assumptionthat first and the last edges are of length larger than δ , and the fact that | π ( p i ) − τ ( t j ) | ≤ δ , τ [ t j , t a ] ⊆ [ π ( p i ) − δ, π ( p i )] . Hence, d F ( π [ p i , p a ] , τ [ t j , t b )]) ≤ δ . Applying the same argumentssymmetrically, we conclude that d F ( π [ p b , p i +1 ] , τ [ t b , t j +1 )]) ≤ δ .For π [ p a , p b ] , τ [ t a , t b ] we invoke Lemma 17 with X = ab . Notice that X is a δ -signature of π [ p a , p b ] because π [ p a , p b ] ⊆ [ a − δ, b + δ ] , by the triangle inequality and the range property of σ π ,and because the rest of the signature properties are still satisfied when we restrict π to [ p a , p b ] . Soby Lemma 8, we have d F ( π [ p a , p b ] , X ) ≤ δ . Similarly, d F ( τ [ t a , t b ] , X ) ≤ δ , because we have that τ [ t a , t b ] ⊆ [ a − δ, b + δ ] by the range property of σ τ and the assumption that the first and the lastedge are of length larger than δ , and because observing that X is a δ -signature of τ [ t a , t b ] allowsus to apply Lemma 8. Hence, by Lemma 17, d F ( π [ p a , p b ] , τ [ t a , t b ]) ≤ δ .By the minimum edge length property, the only edges that can be matched entirely with a vertexare the first and the last edge of σ π . Suppose that π ( p ) π ( p ) is entirely matched with τ ( t ) andassume w.l.o.g π ( p ) ≤ π ( p ) . Then, | π ( p ) − τ ( t ) | ≤ δ and | π ( p ) − τ ( t ) | ≤ δ . By the rangeproperty of σ π , π [ p , p ] ⊂ [ π ( p ) − δ, π ( p )] . By the triangle inequality, [ τ ( t ) − δ, τ ( t ) + 2 δ ] covers π [ p , p ] . The case of the last edge of σ π is symmetric.14e have shown that each subcurve π e (or vertex) of π corresponding to some edge e (or vertex)of σ π which is matched with some edge e (or vertex) in σ τ by a matching of σ π with σ τ whichattains a distance of at most δ , can be matched with the subcurve τ e of τ which corresponds to e ,such that d F ( π e , τ e ) ≤ δ . Hence, d F ( π, τ ) ≤ δ . Lemma 19.
Let σ π be a δ -signature of a time series π and let σ τ be a δ -signature of a time series τ . Suppose that the first edge and the last edge of τ are both of length more than δ . For any δ (cid:48) ≤ δ ,if d F ( π, τ ) ≤ δ (cid:48) then d F ( σ π , σ τ ) ≤ δ (cid:48) .Proof. Let π ( p ) . . . , π ( p (cid:96) ) be the vertices of σ π and let τ ( t ) , . . . , τ ( t (cid:96) (cid:48) ) be the vertices of σ τ . Let π ( p i ) π ( p i +1 ) be an edge of σ π . Assume wlog that π ( p i ) < π ( p i +1 ) . Now let h i , h i +1 ∈ [0 , beparameters such that τ [ h i , h i +1 ] is matched with π [ p i , p i +1 ] by the optimal matching between π, τ .Since d F ( π, τ ) ≤ δ (cid:48) , it holds that | π ( p i ) − τ ( h i ) | ≤ δ (cid:48) and | π ( p i +1 ) − τ ( h i +1 ) | ≤ δ (cid:48) .First, consider the case i ∈ [2 , (cid:96) − ∩ Z . By the minimum edge length property we knowthat | π ( p i ) − π ( p i +1 ) | > δ which implies that τ ( h i ) < τ ( h i +1 ) . We refer to edges going to theopposite direction of that of π ( p i ) π ( p i +1 ) as backward edges. The subcurve π [ p i , p i +1 ] can containbackward edges of length at most δ , because the existence of longer backward edges would refutethe direction preserving property. This means that there is no backward edge in τ [ h i , h i +1 ] of lengthgreater than δ because otherwise we would have d F ( π, τ ) > δ . Hence, τ ( h i ) and τ ( h i +1 ) belongto the same edge of σ τ , and d F ( π ( p i ) π ( p i +1 ) , τ ( h i ) τ ( h i +1 )) ≤ δ (cid:48) , because | π ( p i ) − τ ( h i ) | ≤ δ (cid:48) and | π ( p i +1 ) − τ ( h i +1 ) | ≤ δ (cid:48) .Now, we focus on the case i ∈ { , (cid:96) − } . In that case the minimum edge length property isweaker than in the case i ∈ [2 , (cid:96) − ∩ Z . The first edge of σ τ is of length more than δ , butwe cannot claim that | π ( p i ) − π ( p i +1 ) | > δ . Instead, we assume that | π ( p i ) − π ( p i +1 ) | ∈ ( δ, δ ] and we will exploit the fact that the first edge (resp. the last) of τ is long. We focus on the case i = 1 , since the case i = (cid:96) − is symmetric. By the assumption that d F ( π, τ ) ≤ δ (cid:48) , we havethat | π ( p ) − τ ( t ) | = | π (0) − τ (0) | ≤ δ (cid:48) . The range property implies that the first edge of τ isentirely contained in τ ( t ) τ ( t ) , which implies that | τ ( t ) − τ ( t ) | > δ . Since | τ ( t ) − π ( p ) | ≤ δ , | τ ( h ) − π ( p ) | ≤ δ , | π ( p ) − π ( p ) | ≤ δ and | τ ( t ) − τ ( t ) | > δ , it must hold that τ ( h ) ∈ τ ( t ) τ ( t ) .Hence, τ ( h ) and τ ( h ) belong to the same edge of σ τ , and d F ( π ( p ) π ( p ) , τ ( h ) τ ( h )) ≤ δ (cid:48) , because | π ( p ) − τ ( h ) | ≤ δ (cid:48) and | π ( p ) − τ ( h ) | ≤ δ (cid:48) .We have shown that each edge π ( p i ) π ( p i +1 ) can be matched with a segment e i := τ ( h i ) τ ( h i +1 ) lying in some edge of σ τ such that for each i ∈ [ (cid:96) − , d F ( π ( p i ) π ( p i +1 ) , e i ) ≤ δ (cid:48) , which implies that d F ( σ π , σ τ ) ≤ δ (cid:48) . The signatures fail to successfully capture the structure of a curve in the beginning and in the end.For that reason, we introduce the notions of the prefix and the suffix of a curve. The δ -prefix of acurve τ is the longest prefix of τ whose edges are shorter than δ . Similarly, the δ -suffix of τ is thelongest suffix of τ whose edges are shorter than δ . Definition 20 ( δ -prefix) . Let τ be a curve [0 , (cid:55)→ R and let t , . . . , t m = 1 be the parameterscorresponding to vertices of τ . The δ -prefix of τ is the maximal sequence τ ( t ) , τ ( t ) . . . , τ ( t (cid:96) ) forwhich it holds: for any i ∈ [ (cid:96) − , | τ ( t i ) − τ ( t i +1 ) | ≤ δ . If | τ ( t ) − τ ( t ) | > δ then the δ -prefix is τ ( t ) . efinition 21 ( δ -suffix) . Let τ be a curve [0 , (cid:55)→ R and let t , . . . , t m = 1 be the parameterscorresponding to vertices of τ . The δ -suffix of τ is the maximal sequence τ ( t (cid:96) ) , τ ( t (cid:96) +1 ) . . . , τ ( t m ) forwhich it holds: for any i ∈ [ (cid:96), m − ∩ Z , | τ ( t i ) − τ ( t i +1 ) | ≤ δ . If | τ ( t m ) − τ ( t m − ) | > δ then the δ -suffix is τ ( t m ) . The data structure
The input consists of a set Π of n curves in R , the distance threshold r > , and the approximation error (cid:15) > . As before, we assume that the distance threshold is r := 1 − (cid:15)/ (for other values of r we scale the input curves during preprocessing). This allowsus to simplify our exposition as we are able to use signature parameters that are independent of (cid:15) .To discretize the query space, we use the regular grid G w := { i · w | i ∈ Z } , where w := (cid:15)/ . Foreach input curve π ∈ Π , we first compute sets C , C , where C contains all possible curves withat most k vertices from G w , its first vertex within distance w from the first vertex of π andedge lengths at most , and C contains all possible curves with at most k vertices from G w , its lastvertex within distance w from the last vertex of π and edge lengths at most . Then for eachpair σ ∈ C , σ ∈ C , which satisfies | σ | + | σ | ≤ k , we first partition π into three parts π [0 , p a ] , π [ p a , p b ] , π [ p b , , where π ( p a ) is the last point in π such that d F ( π [0 , p a ] , σ ) ≤ − w and π ( p b ) is the first point in π such that d F ( π [ p b , , σ ) ≤ − w . Next, we compute the -signature σ π of π [ p a , p b ] , with vertices V ( σ π ) = v , . . . , v (cid:96) and we define ranges r (cid:48) i := v i ± (2 + w ) . We use theseranges r (cid:48) i to construct C (cid:48) , the set of all possible -signatures of at most k − | σ | − | σ | + 2 verticesthat belong to r (cid:48) i ∩ G w , for i = 1 , . . . , (cid:96) , appear in them in the order of i , and have as their firstvertex the last vertex of σ , and as their last vertex the first vertex of σ . At last, we compute C ( π ) = { ( σ , σ τ σ ) ∈ C × C (cid:48) × C | d F ( σ , π [0 , p a ]) ≤ , d F ( σ , π [ p b , ≤ , d F ( σ π , σ τ ) ≤ } . Theintuition is that σ corresponds to a -prefix, σ corresponds to a -suffix and σ τ is the signature ofthe subcurve lying between σ and σ , for some approximate near neighbor τ , modulo snapping to G w .The complete pseudocode for this procedure is diverted to Section 3.4 (see generate_candidates ( π )).We store C ( π ) in a hashtable H as follows: for each ( σ , σ τ , σ ) ∈ C ( π ) , we use as key κ ( σ , σ τ , σ ) the sequence of vertices V ( σ ) , V ( σ τ ) , V ( σ ) . Let H ( σ , σ τ , σ ) be the bucket with key κ ( σ , σ τ , σ ) .Then, for each ( σ , σ τ , σ ) ∈ C ( π ) , we store in H ( σ , σ τ , σ ) a pointer to π . For H , we assume perfecthashing, as in the data structure of Section 2. The query algorithm
For a query curve τ with V ( τ ) = τ ( t ) , . . . , τ ( t k ) , the algorithm query ( τ )snaps τ to G w , to obtain τ (cid:48) . Then, it computes a triplet ( σ , σ (cid:48) τ , σ ) , where σ is a -prefix of τ (cid:48) , σ is -suffix of τ (cid:48) and σ (cid:48) τ is the curve obtained by snapping the -signature of the subcurve τ [ t | σ | , t k −| σ | +1 | ] to G w . A lookup in H , with key κ ( σ , σ (cid:48) τ , σ ) , suffices to report a valid answer: ifthere is a bucket H ( σ , σ (cid:48) τ , σ ) then one arbitrary curve stored in it will be reported, otherwise thealgorithm returns “no”. The complete pseudocode for the query algorithm is diverted to Section 3.4(see query ( τ )). 16 .4 Pseudocode of the improved result generate_candidates (time series π ) π ( p ) , . . . , π ( p m ) ← V ( π ) r ← [ π ( p ) − − w, π ( p ) + 1 + w ] r m ← [ π ( p m ) − − w, π ( p m ) + 1 + w ] C ← ∅ for each p ∈ r ∩ G w do generate_bounded_curves ( (cid:104) p (cid:105) , C ) C ← ∅ for each p ∈ r m ∩ G w do generate_bounded_curves ( (cid:104) p (cid:105) , C ) C ( π ) ← ∅ for each σ ∈ C , σ ∈ C such that | σ | + | σ | ≤ k do if { p ∈ [0 , | d F ( σ , π [0 , p ]) ≤ } (cid:54) = ∅ then p a ← max { p ∈ [0 , | d F ( σ , π [0 , p ]) ≤ − w } . else p a ←⊥ if { p ∈ [0 , | d F ( σ , π [ p, ≤ } (cid:54) = ∅ then p b ← min { p ∈ [0 , | d F ( σ , π [ p, ≤ − w } else p b ←⊥ σ π ← -signature of π [ p a , p b ] , with V ( σ π ) = v , . . . , v (cid:96) if (cid:96) ≤ k and p a (cid:54) = ⊥ and p b (cid:54) = ⊥ then for each i = 1 , . . . , (cid:96) do r (cid:48) i ← [ v i − − w, v i + 2 + w ] u ← the last vertex of σ u ← the first vertex of σ C (cid:48) ← {(cid:104) u , u (cid:105)} for each j = 1 , . . . , (cid:96) do for each p ∈ r (cid:48) j ∩ G w do generate_sequences2 ( (cid:104) u , p (cid:105) , j , u , C (cid:48) ) for each σ τ ∈ C (cid:48) do if d F ( σ , π [0 , p a ]) ≤ and d F ( σ , π [ p b , ≤ and d F ( σ π , σ τ ) ≤ then C ( π ) ← C ( π ) ∪ { ( σ , σ τ , σ ) } return C ( π ) generate_bounded_curves (time series σ , returned set C ) // Stores in C all possible curves which begin with σ , have at most k vertices from G w , and each edgehas length at most v , . . . , v t ← V ( σ ) if |V ( σ ) | ≤ k then C ← C ∪ { σ } if |V ( σ ) | < k then for each p ∈ [ v t − , v t + 4] ∩ G w do σ (cid:48) ← (cid:104) v , . . . , v t , p (cid:105) generate_bounded_curves ( σ , C ) enerate_sequences2 (time series σ , integer i , last vertex u , returned set C (cid:48) ) // Stores in C (cid:48) all possible curves which begin with σ , have at most k vertices that belong to r (cid:48) j ∩ G w ,for j = i, . . . , (cid:96) , appear in them in the order of i , and have as their last vertex u . v , . . . , v t ← V ( σ ) if |V ( σ ) | ≤ k then C (cid:48) ← C (cid:48) ∪ {(cid:104) v , . . . , v t , u (cid:105)} if |V ( σ ) | < k then for each j = i, . . . , (cid:96) do for each p ∈ r (cid:48) j ∩ G w do σ (cid:48) ← (cid:104) v , . . . , v t , p (cid:105) generate_sequences2 ( σ (cid:48) , j , u , C (cid:48) ) query (time series τ with V ( τ ) = τ ( t ) , . . . , τ ( t k ) ) τ (cid:48) ← snap τ to G w σ ← -prefix of τ (cid:48) σ ← -suffix of τ (cid:48) σ τ ← compute a -signature of τ [ t | σ | , t k −| σ | +1 | ] . σ (cid:48) τ ← snap σ τ to G w if ∃ π ∈ Π , ( σ , σ (cid:48) τ , σ ) ∈ C ( π ) then // check the bucket H (( σ , σ (cid:48) τ , σ )) report π // arbitrary π s.t. ( σ , σ (cid:48) τ , σ ) ∈ C ( π ) else report “no”. We now prove correctness of the query algorithm. We start by proving a technical lemma.
Lemma 22.
Let π and τ be two curves in R such that d F ( π, τ ) ≤ δ . Assume that τ has at leastone edge of length more than δ . Let τ ( q a ) be the last vertex of the δ (cid:48) -prefix of τ and let τ ( q b ) be the first vertex of the δ (cid:48) -suffix of τ , where δ (cid:48) ≥ δ . Let π ( p a ) be the last point in π such that d F ( π [0 , p a ] , τ [0 , q a ]) ≤ δ and let π ( p b ) be the first point in π such that d F ( π [ p b , , τ [ q b , ≤ δ . Then d F ( π [ p a , p b ] , τ [ q a , q b ]) ≤ δ .Proof. Any optimal matching for π, τ matches π ( p a ) with some point τ ( q (cid:48) a ) such that q a ≤ q (cid:48) a ,because otherwise π ( p a ) would not be the last point in π such that d F ( π [0 , p a ] , τ [0 , q a ]) ≤ δ whichwould lead to a contradiction. Since d F ( π [0 , p a ] , τ [0 , q a ]) ≤ δ , we know that | π ( p a ) − τ ( q a ) | ≤ δ .We also know that τ ( q a ) is the first endpoint in an edge of length at least δ . Hence, τ [ q a , q (cid:48) a ] is asubsegment of an edge of τ , and can be matched with π ( p a ) since both endpoints are within distance δ from it. The rest of the points in π [ p a , p b ] , τ [ q a , q b ] are matched according to the optimal matchingof π, τ , until we reach π ( p b ) . If we had reached τ ( q b ) before π ( p b ) , this would mean that π ( p b ) is notthe first point in π such that d F ( π [ p b , , τ [ q b , ≤ δ . Hence π ( p b ) is matched with some point τ ( q (cid:48) b ) with q (cid:48) b ≤ q b , and τ [ q (cid:48) b , q b ] is a subsegment in an edge of τ which means that it can be matched with π ( p b ) since both of its endpoints are within distance δ from π ( p b ) . Lemma 23.
For any query curve τ of complexity k , query ( τ ) reports as follows: if there exists π ∈ Π such that d F ( π, τ ) ≤ − (cid:15)/ then it returns π (cid:48) ∈ Π such that d F ( π (cid:48) , τ ) ≤ (cid:15)/ , if forany π ∈ Π , d F ( π (cid:48) , τ ) > (cid:15)/ , then it returns “no”. roof. Let π be any input curve in Π . First, suppose that d F ( π, τ ) ≤ − w , where w = (cid:15) , whichis the side length the grid. By the triangle inequality, d F ( π, τ (cid:48) ) ≤ d F ( π, τ ) + d F ( τ, τ (cid:48) ) ≤ − w, where τ (cid:48) denotes the curve resulting by snapping the vertices of τ to G w . Let σ be the the -prefixof τ (cid:48) and let σ be the -suffix of τ (cid:48) . We define q a := t | σ | and q b := t k −| σ | +1 . Let π ( p a ) be thelast point in π such that d F ( π [0 , p a ] , σ ) ≤ − w and let π ( p b ) be the first point in π such that d F ( π [ p b , , σ ) ≤ − w . Let τ (cid:48) ab be the curve resulting by snapping the vertices of τ [ q a , q b ] to G w .Then, by Lemma 22, d F ( π [ p a , p b ] , τ (cid:48) ab ) ≤ − w and by the triangle inequality d F ( π [ p a , p b ] , τ [ q a , q b ]) ≤ d F ( π [ p a , p b ] , τ (cid:48) ab ) + d F ( τ (cid:48) ab , τ [ q a , q b ]) ≤ − w. Now, by Lemma 19, d F ( σ π , σ τ ) ≤ − w , where σ π is a -signature of π [ p a , p b ] and σ τ is a -signatureof τ [ q a , q b ] . During preprocessing, we definitely consider σ since we enumerate all possible curveswhich have their first vertex within distance w from the first vertex of π , the length of each oneof their edges is at most , and their vertices are in G w . Similarly, we definitely consider σ since weenumerate all possible curves which have their last vertex within distance w from the last vertexof π , he length of each one of their edges is at most and their vertices are in G w . Let v , . . . , v (cid:96) bethe vertices of σ π . By Corollary 12, we know that the union of the intervals [ v i − , v i + 2] , i = 1 , . . . , (cid:96), cover the vertices of σ τ . Hence, the union of the intervals r (cid:48) i = [ v i − (2 + w ) , v i + (2 + w )] cover thevertices of σ (cid:48) τ , the curve obtained by snapping the vertices of σ τ to G w , which means that we alsoconsider σ (cid:48) τ in C (cid:48) . By the triangle inequality, d F ( σ π , σ (cid:48) τ ) ≤ d F ( σ π , σ τ ) + d F ( σ τ , σ (cid:48) τ ) ≤ , and the answer will be correctly computed in the preprocessing phase.Now assume that d F ( σ π , σ (cid:48) τ ) ≤ , d F ( π [0 , p a ] , σ ) ≤ and d F ( π [ p b , , σ ) ≤ . By Lemma 18, d F ( π [ p a , p b ] , τ (cid:48) [ q a , q b ]) ≤ . Then, d F ( π, σ ⊕ τ (cid:48) [ q a , q b ] ⊕ σ ) ≤ and by the triangle inequality, d F ( π, τ ) ≤ d F ( π, σ ⊕ τ (cid:48) [ q a , q b ] ⊕ σ ) + d F ( τ, σ ⊕ τ (cid:48) [ q a , q b ] ⊕ σ ) ≤ w. We conclude that the approximation factor is w − w < (cid:15) , for w = (cid:15)/ and (cid:15) ∈ [0 , . Theorem 3 (Main Theorem) . Let (cid:15) ∈ (0 , . There is a data structure for the (2 + (cid:15) ) -ANN problem,which stores n curves in R and supports query curves of complexity k in R , which uses space in n · O (cid:0) (cid:15) (cid:1) k + O ( nm ) , needs n · O (cid:0) (cid:15) (cid:1) k + O ( nmk ) preprocessing time and answers a query in O ( k ) time.Proof. The data structure is described above. By Lemma 11 the data structure solves the (2+ (cid:15) ) -ANNwith distance threshold − w . We can solve for any distance threshold by scaling the ambient space.It remains to prove our complexity bounds. The algorithm generate_candidates first computessets C and C . We bound the cardinality of those sets as follows: |C | ≤ k (cid:88) i =1 (cid:18) δ(cid:15) + 2 (cid:19) i ≤ k (cid:88) i =1 (cid:18) (cid:15) (cid:19) i = (22 /(cid:15) ) k +1 − /(cid:15) − ≤ · (cid:18) (cid:15) (cid:19) k = O (cid:18) (cid:15) (cid:19) k . Similarly, |C | = O (cid:0) (cid:15) (cid:1) k . 19e proceed by analyzing the steps in the main loop of generate_candidates . To compute p a ,we compute the free-space diagram of σ and π , as in [3], in O ( mk ) time. We can then find the edgeof π with the largest index that contains a point which is reachable in the free-space diagram by amonotone path, and then focusing on the two edges realizing that point, we can compute p a . Similarlyfor p b The signature of a curve of complexity at most m can be computed in time O ( m ) by theresult of [16]. We can now upper bound the number of triplets ( σ , σ τ , σ ) which will be produced by generate_candidates . Assuming that σ and σ are fixed, we have |C (cid:48) | = O (cid:0) (cid:15) (cid:1) k −| σ |−| σ | similarlyto the proof of Theorem 14, and because the first and the last vertex are fixed. In total, over alliterations, the number of all triplets ( σ , σ τ , σ ) produced by the algorithm is upper bounded by (cid:88) t + t ≤ k O (cid:18) (cid:15) (cid:19) t · O (cid:18) (cid:15) (cid:19) t · O (cid:18) (cid:15) (cid:19) k − t − t = (cid:88) t + t ≤ k O (cid:18) (cid:15) (cid:19) k = O ( k ) · O (cid:18) (cid:15) (cid:19) k . Hence, the preprocessing time is O ( n · k ) · O (cid:0) (cid:15) (cid:1) k + O ( nmk ) and the space needed is O ( n · k ) ·O (cid:0) (cid:15) (cid:1) k + O ( nk ) assuming that we only store indices for the curves and not the actual curves (thiswould just require an additional O ( mn ) of space). The query time is O ( k ) because we need O ( k ) time to compute curves τ (cid:48) , σ , σ , σ τ , σ (cid:48) τ and then we rely on the guarantees of perfect hashing toprobe H in O ( k ) time. O ( k ) -ANN data structure with linear space In this section, we present a data structure for the ANN problem with approximation factor of order O ( k ) , but with linear space O ( nm ) and query time in O ( k ) . Our main ingredient is a properly-tunedrandomly shifted grid. Let w > be a fixed parameter and z chosen uniformly at random from the set [0 , w ] . The function g w,z ( x ) = (cid:4) w − ( x − z ) (cid:5) induces a random partition of the line. We make use of the followingstandard bound on the probability that a bounded set is not entirely contained in a cell. Claim 24.
Let X ⊆ R be a set such that diam( X ) ≤ ∆ and w > . Then, P r z [ ∃ x ∈ X ∃ y ∈ X : g w,z ( x ) (cid:54) = g w,z ( y )] ≤ ∆ w . Proof.
Let a, b ∈ R . Then, P r z (cid:20)(cid:22) a − zw (cid:23) (cid:54) = (cid:22) b − zw (cid:23)(cid:21) = | a − b | w . The claim then follows by setting a = min X , b = max X . The data structure
The input consists of a set Π of n curves in R , and the distance threshold r > . We assume that the distance threshold is r := 1 , (to solve the problem for a different value of r , the input set can be uniformly scaled) and we set w = 48 k . We sample z uniformly at randomfrom [0 , w ] . Let H be a hashtable, which is initially empty. For H , we assume perfect hashing,implying that for any key of complexity O ( k ) , we need O ( k ) time to access the corresponding bucket.For each input curve π ∈ Π , we compute its -signature σ π , with vertices V ( σ π ) = v , . . . , v (cid:96) , and20 abcd (a) An input time series π . The red points are vertices ofits δ -signature σ π , and the orange rectangles correspondto ranges of radius δ . tabcd (b) A query time series τ . Figure 3: Blue lines correspond to grid points. Each vertex is snapped to a grid point. Snapping V ( σ π ) to the grid produces the sequence b, c, a, c, b . The key is V ( σ (cid:48) π ) = V ( (cid:104) b, c, a, c, b (cid:105) ) = b, c, a, c, b . Snap-ping V ( τ ) to the grid produces the sequence b, c, b, b, a, c, b . The key is V ( τ (cid:48) ) = V ( (cid:104) b, c, b, b, a, c, b (cid:105) ) = b, c, a, c, b . The randomly shifted grid has been successfully chosen, since d F ( π, τ ) ≤ δ and the twokeys are identical.then we compute the curve σ (cid:48) π = (cid:104) g w,z ( v ) , . . . , g w,z ( v (cid:96) ) (cid:105) . For each π ∈ Π , such that | V ( σ π ) | ≤ k , weuse as key the sequence of vertices V ( σ (cid:48) π ) . Let H ( π ) be the bucket with key V ( π ) . We store in H ( π ) a pointer to π . Thus, after processing all input curves, H ( π ) contains a list of all relevant pointersto curves in Π . The query algorithm
When presented with a query curve τ , with vertices u , . . . , u k , we firstcompute the curve τ (cid:48) = (cid:104) g w,z ( u ) , . . . , g w,z ( u k ) (cid:105) and then we perform a lookup in the hashtable H with the key V ( τ (cid:48) ) and return the result: if there is a bucket H ( τ (cid:48) ) then we return any curve whichhas a pointer stored there, otherwise we return “no”.Figure 3 shows an example of how keys are computed, both in the case of input curves and inthe case of query curves. First we focus on any two curves π , τ such that d F ( π, τ ) ≤ δ . We show that any edge of τ which ismatched to points in the same subcurve π [ p i , p i +1 ] , where p i , p i +1 are the parameters that correspondto two consecutive signature vertices of π , and has the opposite direction of that of π ( p i ) π ( p i +1 ) ,must be short. This will allow us to argue that any such edge will likely collapse by snapping itsvertices to a randomly shifted grid. Claim 25.
Consider any two curves π , τ in R such that d F ( π, τ ) ≤ δ . Let σ π = π ( p ) , . . . , π ( p (cid:96) ) bea δ -signature of π . Let ≤ t < t ≤ be parameters such that each of τ ( t ) , τ ( t ) is matched withat least one point in π [ p i , p i +1 ] , for some i ∈ [ (cid:96) − , by an optimal matching. Then, • if π ( p i ) < π ( p i +1 ) then τ ( t ) ≥ τ ( t ) − δ , • if π ( p i ) > π ( p i +1 ) then τ ( t ) ≤ τ ( t ) + 4 δ .Proof. We prove the case π ( p i ) < π ( p i +1 ) . The second case is symmetric. Let φ be an optimalmatching between π and τ . Let p ∈ [ p i , p i +1 ] be such that π ( p ) is matched with τ ( t ) by φ and let p (cid:48) ∈ [ p i , p i +1 ] be such that π ( p (cid:48) ) is a point matched with τ ( t ) by φ . By the direction preserving21roperty of δ -signatures, if π ( p i ) < π ( p i +1 ) then π ( p ) − π ( p (cid:48) ) ≤ δ . Since | π ( p ) − τ ( t ) | ≤ δ and | π ( p (cid:48) ) − τ ( t ) | ≤ δ , we have τ ( t ) ≥ τ ( t ) − δ .Lemma 9 shows that there exist vertices of τ which stab the intervals [ π ( p i ) − δ, π ( p i ) + δ ] inthe order of i . The following claim shows that any subcurve of τ defined by two vertices of τ stabbing [ π ( p i ) − δ, π ( p i ) + δ ] and [ π ( p i +1 ) − δ, π ( p i +1 ) + δ ] must be entirely contained in the interval [min { π ( p i ) , π ( p i +1 } − δ, max { π ( p i ) , π ( p i +1 } + 2 δ ] . In other words, τ must satisfy a weak analogueof the range property satisfied by signatures. Claim 26.
Consider any two curves π , τ in R such that d F ( π, τ ) ≤ δ . Let σ π = π ( p ) , . . . , π ( p (cid:96) ) bea δ -signature of π . Let t j < · · · < t j (cid:96) = 1 be parameters corresponding to vertices of τ such that ∀ i ∈ [ (cid:96) ] , | τ ( t j i ) − π ( p i ) | ≤ δ . Then, for each i ∈ { , . . . , (cid:96) − } , • if π ( p i ) is a local minimum, then for any x ∈ τ [ t j i , t j i +1 ] , it holds x ≥ π ( p i ) − δ , • if π ( p i ) is a local maximum, then for any x ∈ τ [ t j i , t j i +1 ] , it holds x ≤ π ( p i ) + 2 δ .Proof. An optimal matching of π with τ matches each π ( p i ) , i ∈ { , . . . , (cid:96) − } , with points in τ [ t j i − , t j i +1 ] . This follows by the monotonicity of an optimal matching, the range property of δ -signatures, the minimum edge length property of δ -signatures and the triangle inequality. Supposenow that π ( p i ) is a local minimum. If i ∈ { , . . . , (cid:96) − } then π ( p i − ) is matched with some point in τ [ t j i − , t j i ] and π ( p i +1 ) is matched with some point in τ [ t j i , t j i +2 ] . If i = 2 then π ( p i − ) is matchedwith τ ( t j i − ) and π ( p i +1 ) is matched with some point in τ [ t j i , t j i +2 ] . If i = (cid:96) − then π ( p i − ) ismatched with some point in τ [ t j i − , t j i ] and π ( p i +1 ) is matched with τ ( t j i +1 ) . However, if thereexists a point x in τ [ t j i − , t j i +1 ] such that x < π ( p i ) − δ , then by the minimum edge length propertyand the range property of δ -signatures, x cannot be matched with any point in π [ p i − , p i +1 ] . Thisimplies that the matching is either non-continuous or non-optimal, leading to a contradiction. For i = 1 , by the range property of δ -signatures and and the triangle inequality we have that for any x ∈ τ [ t j , t j ] , it holds x ≥ π ( p ) − δ . The same arguments can be applied symmetrically when π ( p i ) is a local maximum. Lemma 27.
Let π be a curve in R and let σ π be a δ -signature of π with vertices π ( p ) , . . . , π ( p (cid:96) ) .Let τ be a curve in R with vertices τ ( t ) , . . . , τ ( t k ) . If d F ( π, τ ) ≤ δ then for the two curves σ (cid:48) π = (cid:104) g w,z ( π ( p )) , . . . , g w,z ( π ( p (cid:96) )) (cid:105) , τ (cid:48) = (cid:104) g w,z ( τ ( t )) , . . . , g w,z ( τ ( t k )) (cid:105) it holds V ( σ (cid:48) π ) = V ( τ (cid:48) ) withprobability at least kδ/w , where z is chosen uniformly at random from [0 , w ] .Proof. We consider an optimal matching φ of π with τ . For each i ∈ [ (cid:96) ] , we define r i := [ π ( p i ) − δ, π ( p i ) − δ ] . Lemma 9 implies that there exist parameters t j < · · · < t j (cid:96) = 1 correspondingto vertices of τ such that ∀ i ∈ [ (cid:96) ] , τ ( t j i ) ∈ r i . We first bound the length of edges of any τ [ t j i , t j i +1 ] which are directed backwards with respect to the direction of π ( p i ) , π ( p i +1 ) . We assume that π ( p i ) < π ( p i +1 ) , since the other case is symmetric. Let t < t ∈ [ t j i , t j i +1 ] be two parameterscorresponding to two consecutive vertices of τ [ t j i , t j i +1 ] such that τ ( t ) > τ ( t ) . Let φ be an optimalmatching of π with τ . We consider three cases regarding the position of τ ( t ) :i) if τ ( t ) ∈ [ π ( p i ) , π ( p i +1 ] \ ( r i ∪ r i +1 ) then τ ( t ) can only be matched, by φ , with points of π [ p i , p i +1 ] and since τ ( t ) < τ ( t ) , τ ( t ) can only be matched by φ with points of π [ p i , p i +1 ] .Claim 25 implies | τ ( t ) − τ ( t ) | ≤ δ .ii) If τ ( t ) ∈ r i then by Claim 26 and the fact that τ ( t ) < τ ( t ) , we know that | τ ( t ) − τ ( t ) | ≤ δ .22ii) If τ ( t ) ∈ r i +1 \ r i then• if τ ( t ) ∈ r i +1 then | τ ( t ) − τ ( t ) | ≤ δ .• if τ ( t ) / ∈ r i +1 then τ ( t ) can only be matched, by φ , with points in π [ p i , p i +1 ] . Since t < t , we conclude that τ ( t ) can also be matched only with points from π [ p i , p i +1 ] .Claim 25 then implies | τ ( t ) − τ ( t ) | ≤ δ .Hence, the length of any edge of any sub-curve τ [ t j i , t j i +1 ] which is directed backwards with respectto the direction of π ( p i ) , π ( p i +1 ) , has length at most δ .For each i ∈ [ k − , we define A i as the event that we have g w,z ( τ ( t i )) = g w,z ( τ ( t i +1 )) and I S ⊆ [ k − denotes the set of indices i such that | τ ( t i ) − τ ( t i +1 ) | ≤ δ . For each i ∈ [ (cid:96) ] , we define B i as the event that for any two points x, y ∈ r i we have g w,z ( x ) = g w,z ( y ) . We claim that if theevent S = (cid:84) i ∈ I S A i ∩ (cid:84) ki =1 B i occurs then V ( σ (cid:48) π ) = V ( τ (cid:48) ) . The event (cid:84) ki =1 B i directly implies thatfor each i ∈ [ (cid:96) ] , g w,z ( π ( p i )) = g w,z ( τ ( t j i )) . Hence, applying g w,z ( · ) to the vertices V ( τ ) , we obtain asequence V ( τ ) (cid:48) of the form g w,z ( π ( p )) , . . . , g w,z ( π ( p )) , . . . , g w,z ( π ( p (cid:96) )) . Now, consider any signatureedge π ( p i ) π ( p i +1 ) and suppose that π ( p i ) ≤ π ( p i +1 ) . The event (cid:84) i ∈ I S A i implies that for any edge τ ( t ) τ ( t ) of τ [ t j i , t j i +1 ] with the opposite direction of that of π ( p i ) π ( p i +1 ) , i.e. τ ( t ) < τ ( t ) , we have g w,z ( τ ( t )) = g w,z ( τ ( t ) . Moreover, g w,z ( · ) is monotone, which implies that for any two consecutivevertices τ ( t ) , τ ( t ) in τ [ t j i , t j i +1 ] , regardless of the their direction, we have g w,z ( τ ( t )) ≤ g w,z ( τ ( t ) .The same arguments apply symmetrically in the case π ( p i ) > π ( p i +1 ) . In that case any twoconsecutive vertices τ ( t ) , τ ( t ) in τ [ t j i , t j i +1 ] , satisfy g w,z ( τ ( t )) ≥ g w,z ( τ ( t ) . Hence, the sequence V ( τ ) (cid:48) remains monotonic between g w,z ( τ ( t j i )) = g w,z ( π ( p i )) and g w,z ( τ ( t j i +1 )) = g w,z ( π ( p i +1 )) , forany i ∈ [ (cid:96) ] . This implies that there are no local extrema in τ (cid:48) between g w,z ( τ ( t j i )) and g w,z ( τ ( t j i +1 )) ,and hence the two time series τ (cid:48) and σ (cid:48) π are identical.We now upper bound the probability of the complementary event S : P r (cid:2) S (cid:3) = P r (cid:91) i ∈ I S A i ∪ k (cid:91) i =1 B i ≤ (cid:88) i ∈ I S P r (cid:2) A i (cid:3) + k (cid:88) i =1 P r (cid:2) B i (cid:3) ≤ | I S | · δw + k · δw ≤ kδw , where the first two inequalities hold by a union bound, and then we apply Claim 24. Lemma 28.
Let π be a curve in R and let σ π be a δ -signature of π with vertices π ( p ) , . . . , π ( p (cid:96) ) . Let τ be a curve in R with vertices τ ( t ) , . . . , τ ( t k ) . For the two curves σ (cid:48) π = (cid:104) g w,z ( π ( p )) , . . . , g w,z ( π ( p (cid:96) )) (cid:105) , τ (cid:48) = (cid:104) g w,z ( τ ( t )) , . . . , g w,z ( τ ( t k )) (cid:105) , if V ( σ (cid:48) π ) = V ( τ (cid:48) ) then d F ( π, τ ) ≤ w + δ .Proof. By the triangle inequality, d F ( σ π , τ ) ≤ d F ( σ π , σ (cid:48) π ) + d F ( σ (cid:48) π , τ ) ≤ d F ( σ π , σ (cid:48) π ) + d F ( σ (cid:48) π , τ (cid:48) ) + d F ( τ (cid:48) , τ ) = d F ( σ π , σ (cid:48) π ) + d F ( τ (cid:48) , τ ) . Notice that σ (cid:48) π and τ (cid:48) are curves resulting by snapping the vertices of σ π and τ respectively, to gridpoints within distance w . Hence, d F ( σ π , σ (cid:48) π ) ≤ w and d F ( τ (cid:48) , τ ) ≤ w which imply d F ( σ π , τ ) ≤ w .Then by the triangle inequality and Lemma 8, d F ( π, τ ) ≤ d F ( π, σ π ) + d F ( σ π , τ ) ≤ w + δ. heorem 5. There is a data structure for the (24 k + 1) -ANN problem, which stores n curves ofcomplexity m in R and supports query curves of complexity k in R , which uses space in O ( nm ) , needs O ( nm ) preprocessing time and answers a query in O ( k ) time. For a fixed query, the preprocessingsucceeds with probability at least / .Proof. The data structure is described in Section 4.1. The hashtable H stores in each bucket allrelevant pointers to curves in Π . Hence the total storage is in O ( nm ) . Assuming perfect hashing, aquery costs O ( k ) time. Correctness follows from Lemmas 27 and 28 for δ = r = 1 and w = 12 k . In this section, we study lower bounds on the cell-probe-complexity of distance oracles for the Fréchetdistance and the discrete Fréchet distance. We focus on the decision version of the problem. Inparticular, we say a distance oracle with input curve π , threshold r > , and approximation factor c > , is a data structure which reports as follows: for any query τ , if d F ( π, τ ) ≤ r then it outputs“yes”, else if d F ( π, τ ) ≥ cr then it outputs “no” and otherwise both answers are acceptable. This canbe viewed as a special case of the c -ANN problem. To show our lower bounds, we employ a techniquefirst introduced by Miltersen [31], which implies that lower bounds for communication problemscan be translated into lower bounds for cell-probe data structures. The following communicationproblem is known as the lopsided (or asymmetric) disjointness problem. Definition 29 ( ( k, U ) -Disjointness) . Alice receives a set S , of size k , from a universe [ U ] = { . . . U } ,and Bob receives T ⊂ [ U ] of size m ≤ U . They need to decide whether T ∩ S = ∅ . A randomized [ a, b ] -protocol for a communication problem is a protocol in which Alice sends a bits, Bob sends b bits, and the error probability is bounded away from / . The following resultby Pătraşcu gives a lower bound on the randomized asymmetric communication complexity of the ( k, U ) -Disjointness problem. Theorem 30 (Theorem 1.4 [33]) . Assume Alice receives a set S , | S | = k and Bob receives a set T , | T | = m , both sets coming from a universe of size U , such that k ≤ m ≤ U . In any randomized,two-sided error communication protocol deciding disjointness of S and T , either Alice sends at least δk log (cid:0) Uk (cid:1) bits or Bob sends at least Ω (cid:16) k (cid:0) Uk (cid:1) − C · δ (cid:17) bits, for any δ > , and C = 1799 . We now define the distance threshold estimation problem (DTEP), where two parties mustdetermine whether two curves are near or far. This is basically the communication version of ourdata structure problem (for n = 1 ). Definition 31 ( ( k, U ) -Fréchet DTEP) . Given parameters c ≥ , r > , Alice receives a curve τ of complexity k in R d , Bob receives a curve π of complexity m ≤ U in R d . If d F ( π, τ ) ≤ r thenthey must output “yes”. If d F ( π, τ ) ≥ cr then they must output “no”. Otherwise, both answers areacceptable. Similarly, we define the ( k, U ) -Discrete Fréchet DTEP. Definition 32 ( ( k, U ) -Discrete Fréchet DTEP) . Given parameters c ≥ , r > , Alice receives acurve τ of complexity k in R d , Bob receives a curve π of complexity m ≤ U in R d . If d dF ( π, τ ) ≤ r then they must output “yes”. If d dF ( π, τ ) ≥ cr then they must output “no”. Otherwise, both answersare acceptable. .1 A cell-probe lower bound for the Fréchet distance We first reduce the lopsided set disjointness problem to the problem of approximating the Fréchetdistance of two curves in R . A similar reduction appears in [30], which however works for curves in R and it is used to lower bound the complexity of sketching.First consider an instance of the set disjointness problem: Alice has a set A = { α , . . . , α k } ⊂ [ U ] and Bob has a set B = { β , . . . , β m } ⊂ [ U ] , where U is the size of the universe. We now describeour main gadgets which will be used to define one curve of complexity O ( k ) for A and one curve ofcomplexity O ( U − m ) for B . For each i ∈ [ U ] :• If i ∈ A then x i − := 4 i + 4 , x i := 4 i ,• If i / ∈ A then x i − := 4 i , x i := 4 i ,• If i ∈ B then y i − := 4 i , y i := 4 i ,• If i / ∈ B then y i − := 4 i + 3 , y i := 4 i + 1 ,We now define ˜ x := (cid:104) , x , . . . , x U , U + 5 (cid:105) and ˜ y := (cid:104) , y , . . . , y U , U + 5 (cid:105) . Notice that the numberof vertices of ˜ x is k + 2 , and the number of vertices of ˜ y is U − m ) + 2 , because we only take intoaccount vertices which are local extremes. The arclength of any of ˜ x , ˜ y is at most U + 2 . Theorem 33. If A ∩ B = ∅ then d F (˜ x, ˜ y ) ≤ . If A ∩ B (cid:54) = ∅ then d F (˜ x, ˜ y ) ≥ .Proof. If there is no i ∈ A ∩ B then there is a monotonic matching which implies d F (˜ x, ˜ y ) ≤ .For any i ∈ [ U ] , let ˜ x i := (cid:104) i, x i − , x i , i + 4 (cid:105) and ˜ y i := (cid:104) i, y i − , y i , i + 4 (cid:105) . To show that, it issufficient to show that for any i ∈ [ U ] , d F (˜ x i , ˜ y i ) ≤ . If i / ∈ A and i ∈ B then the two subcurves arejust straight line segments and their distance is . If i / ∈ A and i / ∈ B then ˜ x i is a line segment and ˜ y i consists of three line segments forming a zig-zag. The matching works as follows: it first matchesthe interval [4 i, i + 2] of ˜ x i with the interval [4 i, i + 2] of ˜ y i by moving in both curves at the samespeed, then it stops moving in ˜ x i , while it moves from i + 2 to y i − and then to y i and then to i + 2 in ˜ y i . The matching continues by moving in the two remaining subsegments at the samespeed. This is a matching that attains d F (˜ x i , ˜ y i ) ≤ , because i + 2 is within distance from any of y i − , y i . Finally if i ∈ A and i / ∈ B then the matching works as follows: it first matches [4 i, x i − ] with [4 i, y i − ] , then it matches ( x i − , x i ] with ( y i − , y i ] , and it finally matches ( x i , i + 4] with ( y i , i + 4] . Since it basically matches pairs of line segments having endpoints at distance at most from each other, the Fréchet distance is again at most .Suppose now that there is an i such that i ∈ A and i ∈ B . Let v be the first appearanceof the point i + 4 in ˜ x , and let u be the second appearance of the point i in ˜ x . Assume that d F (˜ x, ˜ y ) = δ < . Then, v is matched with some point z in ˜ y which lies within distance δ . However,there is no point in ˜ y which lies within distance δ from u , and appears in ˜ y after z . This impliesthat δ ≥ , because the matching required by the definition of the Fréchet distance has to bemonotonic.We use a technique of obtaining cell-probe lower bounds first introduced by Miltersen [31]. Fora static data structure problem with input p ∈ P , which computes f ( p, q ) for any query q ∈ Q ,we consider the communication problem, where Alice gets q ∈ Q , Bob gets p ∈ P , and they mustdetermine f ( q, p ) . If there is a solution to the data structure problem with parameters s, w and t ,then there is a protocol for the communication problem, with t rounds of communication, whereAlice sends (cid:100) log s (cid:101) bits in each of her messages and Bob sends w bits in each of his messages. Theprotocol is a simple simulation of the assumed data structure where Alice sends indices to memory25ells and Bob responds with the cell content. Theorem 30, combined with Theorem 33, implies lowerbounds for cell-probe Fréchet distance oracles. Theorem 34.
Consider any Fréchet distance oracle with approximation factor − γ , for any γ ∈ (0 , , in the cell-probe model, which supports curves in R as follows: it stores any polygonalcurve of arclength at most L , for L ≥ , it supports queries of arclength at most L and complexity k ,where k ≤ L/ , and it achieves performance parameters t , w , s . There exist w = Ω (cid:32) kt (cid:18) Lk (cid:19) − (cid:15) (cid:33) , s = 2 Ω (cid:16) k log( L/k ) t (cid:17) such that if w < w then s ≥ s , for any constant (cid:15) > .Proof. By Theorem 33, if there exists a randomized [ a, b ] -protocol for the communication problem,in which, Alice gets any curve x of complexity k + 2 and arclength at most U + 2 , Bob getsany curve y of complexity U − m ) + 2 of arclength at most U + 2 and they can decide whether d F ( x, y ) ≤ or d F ( x, y ) ≥ , then they can solve the ( k, U ) -Disjointness problem.By Theorem 30, for any δ > , there exists b = Ω (cid:16) k (cid:0) Uk (cid:1) − δ (cid:17) , such that a randomized [ a, b ] -protocol for ( k, U ) -Disjointness, for any k ≤ m ≤ U , requires either a ≥ δk log (cid:0) Uk (cid:1) or b ≥ b .Hence, for any δ > , and any k ≤ m ≤ U , if there exists a randomized [ a, b ] -protocol for the (2 k + 2 , U − m ) + 2) -Fréchet DTEP for any curves of arclength at most U + 2 , then either a ≥ δk log (cid:0) Uk (cid:1) or b ≥ b .The simulation argument implies that if there exists a cell-probe data structure with parameters t , w , s for curves in R , with query complexity k + 2 , and arclength at most U + 2 , then thereexists a randomized [2 t log s, tw ] -protocol for the Fréchet DTEP. Hence it should be either that t log s ≥ δk log (cid:0) Uk (cid:1) or tw ≥ b . There exists a w = Ω (cid:16) k t (cid:0) Uk (cid:1) − δ (cid:17) such that if w < w ≤ b ,then s ≥ δk log( U/k )2 t . The theorem is now implied by setting δ = (cid:15)/ , L = 12 U + 2 and rescaling k ← k + 2 . In this section, we focus on distance oracles for the discrete Fréchet distance, in the cell-probe model.Our reductions use points in a bounded subset of R d requiring O ( d ) bits for their description. Next,we define domains of sequences which satisfy this property. Definition 35 (Bounded domain) . We say that a point sequence P = p , . . . , p m has a boundeddomain S ⊂ R d if there exist constants C > , λ > such that for all i ∈ [ m ] , p i ∈ S and eachelement of λ · p i is an integer lying in [ − C, C ] ∩ Z . In the remainder, we reduce ( k, U ) -Disjointness to ( k, U ) -Discrete Fréchet DTEP and concludewith lower bounds for discrete Fréchet distance oracles in the cell-probe model. We consider twocases for ( k, U ) -Discrete Fréchet DTEP. First, we assume that points belong to a bounded domain X ⊂ R and | X | = O (1) . Second, we consider the high-dimensional case where points are chosenfrom some bounded domain X ⊂ R O (log m ) , where m ≤ U . We want to construct point sequences, one for each input set of Alice and Bob, such that there existsa common element in Alice’s and Bob’s input sets, if and only if the discrete Fréchet distance of the26 ysy y β β α α β (cid:48) β (cid:48) α (cid:48) α (cid:48) w w (cid:48) x x Figure 4: Points used in our gadgets. The blue disks of radius are centered at w and w (cid:48) andthey cover α , α , y and α (cid:48) , α (cid:48) , y respectively. The red disk of radius centered at y covers β , β , w, w (cid:48) , s, x and the red disk of radius centered at y covers β (cid:48) , β (cid:48) , w, w (cid:48) , s, x .two sequences is less or equal than a given threshold. Our reduction takes some of its main ideasfrom [9]. Our gadgets use the following points (see Fig. 4): α = ( − . , . , α = ( − . , − . α (cid:48) = (1 . , . , α (cid:48) = (1 . , − . , β = ( − . , . ,β = ( − . , − . , β (cid:48) = (0 . , . , β (cid:48) = (0 . , − . , s = (0 , , w = ( − . , ,w (cid:48) = (0 . , , y = ( − . , , y = (0 . , , x = ( − . , − , x = (0 . , − . Let D = (cid:100) log U (cid:101) , where U is the size of the universe in the ( k, U ) -Disjointness instance. We furtherassume that D is even for convenience. We treat elements of the universe as binary vectors: Alice’sset corresponds to a set { a , . . . , a k } , where each a i ∈ { , } D , and Bob’s set corresponds to a set { b , . . . , b k } , where each b i ∈ { , } D . For each vector a i ∈ { , } D we have a gadget A i whichis a sequence of points constructed as follows: for each odd coordinate j we either put α or α depending on whether ( a i ) j is or and for each even coordinate j we either put α (cid:48) or α (cid:48) dependingon whether ( a i ) j is or . For example, for the vector (0 , , , (assuming that it belongs toAlice) we create a , a (cid:48) , a , a (cid:48) . Similarly for each vector b i we have a gadget B i which is a sequenceof points constructed as follows: for each odd coordinate j we either put β or β depending onwhether ( b i ) j is or and for each even coordinate j we either put β (cid:48) or β (cid:48) depending on whether ( b i ) j is or . Given two sequences P = p , . . . , p m and Q = q , . . . , q m , we say that a traversal T = ( i , j ) , . . . , ( i m , j m ) is parallel if for all k = 1 , . . . , m we have i k = j k = k . Lemma 36.
Let a i , b j ∈ { , } D . If a i = b j then d dF ( A i , B j ) ≤ . If a i (cid:54) = b j then d dF ( A i , B j ) ≥ √ .Moreover, for any non-parallel traversal T , we have d T ( A i , B j ) ≥ .Proof. If a i = b j then the parallel traversal gives d dF ( A i , B j ) ≤ . If a i (cid:54) = b j then d dF ( A i , B j ) ≥ √ .To see that notice that (cid:107) β − α (cid:107) = (cid:107) β − α (cid:107) = (cid:107) β (cid:48) − α (cid:48) (cid:107) = (cid:107) β (cid:48) − α (cid:48) (cid:107) = √ . Furthermore,for each z, w ∈ { , } (cid:107) a z − b (cid:48) w (cid:107) > and (cid:107) a (cid:48) z − b w (cid:107) > .We define W = (cid:13) Dm/ i =1 ( w ◦ w (cid:48) ) . Given a , . . . , a k and b , . . . , b m , we construct two point sequencesas follows: P = W ◦ x ◦ (cid:13) mi =1 ( s ◦ B i ) ◦ s ◦ x ◦ W,Q = (cid:13) ki =1 ( y ◦ A i ◦ y ) . Lemma 37.
Let a , . . . , a k ∈ { , } D and b , . . . , b m ∈ { , } D . If there exist i, j such that a i = b j then d dF ( P, Q ) ≤ . roof. We assume that there exist i ∗ ∈ [ k ] , j ∗ ∈ [ m ] such that a i ∗ = b j ∗ . We describe one traversal T which achieves d T ( P, Q ) ≤ and hence d dF ( P, Q ) ≤ .1. The first D ( i ∗ − points of W are matched with the first ( i ∗ − D + 2) points of q . Inparticular, for each i = 1 , . . . , i ∗ − (i) w is matched with y , (ii) T proceeds in parallel for (cid:13) D/ j =1 ( w ◦ w (cid:48) ) and A i , (iii) w (cid:48) is matched with y .2. T remains in y and it matches it with the rest of W . Then, x is matched with y .3. y is matched with all points in (cid:13) j ∗ − j =1 ( s ◦ B i ) .4. T proceeds in parallel for A i ∗ and B j ∗ .5. T remains in y and proceeds only in p until it reaches W .6. The first D ( m − i ∗ ) points of W are matched with the rest of Q as in step 1.7. T remains in y (the last point of q ) and it proceeds in P until the end.Points w , w (cid:48) are within distance from any of y , y , α , α , α (cid:48) , α (cid:48) . Points y are within distance from x , s and any of β , β , β (cid:48) , β (cid:48) . By Lemma 36, d dF ( A i ∗ , B j ∗ ) ≤ . Then y is within distance from x , s and any of β , β , β (cid:48) , β (cid:48) . wP = Q = x . . . p y i . . . p y i . . . x w. . . y i a i y i . . . Figure 5: x is matched with y i , p y i is the last point in p which is matched with y i and p y i is thefirst point in p which is matched with y i . Lemma 38.
Let a , . . . , a k ∈ { , } D and b , . . . , b m ∈ { , } D . If there are no i, j such that a i = b j ,then d dF ( P, Q ) ≥ . .Proof. Consider some traversal T . We assume that T matches x with some y and no other pointof Q . Likewise, x is matched with some y and no other point of Q . If these assumptions do nothold and x or x are matched with some other point then d T ( P, Q ) ≥ . . Furthermore we assumethat each s is matched with either a y or a y , because otherwise d T ( P, Q ) ≥ . . Now let y i bethe i th appearance of point y in Q and assume that x is matched with it. Let p y i be the last pointin P which is matched with y i and let p y i be the first point in P which is matched with y i (seeFig. 5). We consider all cases for p y i :• If p y i is x then the first appearance of s is matched with one of α , α , α (cid:48) , α (cid:48) and hence d dF ( P, Q ) ≥ . .• If p y i is the j th appearance of s then: – If j = k + 1 then the first point of A i is matched with either s or x . Hence, the distanceis at least . . 28 If j < k + 1 , then by our initial assumption that s is always matched with either a y ora y , p y i cannot appear after the ( j + 1) th appearance of s . Hence, a subsequence of B j is compared to A i . By Lemma 36 this implies that d dF ( P, Q ) ≥ √ .• If p y i is a point of some gadget B j then the same reasoning implies that a subsequence of B j is compared to A i . By Lemma 36 this implies that d dF ( P, Q ) ≥ √ .• If p y i ∈ { w, w (cid:48) } then this means that x is matched with y i because of monotonicity of thematching, but then the distance is at least . .We conclude that if there are no i, j such that a i = b j , then d dF ( P, Q ) ≥ . . Theorem 39.
Suppose that there exists a randomized [ a, b ] -protocol for the discrete Fréchet DTEPwith approximation factor c < . where Alice receives a sequence of k (2 + (cid:100) log( U ) (cid:101) ) points in X ⊂ R and Bob receives a sequence of (cid:100) log( U ) (cid:101) m + m + 3 points in X , where X is a boundeddomain and | X | ≤ . Then there exists a randomized [ a, b ] -protocol for the ( k, U ) -Disjointnessproblem in a universe [ U ] , where Alice receives a set S ⊂ [ U ] of size k and Bob receives a set T ⊆ [ U ] of size m .Proof. First Alice and Bob convert their inputs to their binary representation. Alice uses her binaryvectors a , . . . , a k and constructs a sequence of points Q = (cid:13) ki =1 ( y ◦ A i ◦ y ) , as described above.Similarly, Bob uses his binary vectors b , . . . , b m and constructs P = W ◦ x ◦ (cid:13) mi =1 ( s ◦ B i ) ◦ s ◦ x ◦ W .Then, Alice and Bob run the assumed [ a, b ] -protocol which allows them to determine whether d dF ( P, Q ) ≤ or d dF ( P, Q ) ≥ . . If d dF ( P, Q ) ≤ then the answer to the ( k, U ) -Disjointnessinstance is “yes” and if d dF ( P, Q ) ≥ . then the answer is “no”. Lemmas 37 and 38 imply that ineither case the answer is correct. Theorem 40.
Consider any discrete Fréchet distance oracle in the cell-probe model which supportspoint sequences from bounded domains in R , as follows: for any k ≤ m ≤ U , it stores any pointsequence of length m , it supports queries of length k , and it achieves performance parameters t , w , s ,and approximation factor c < . . There exist w = Ω (cid:32) kt log m · (cid:18) Uk (cid:19) − (cid:15) (cid:33) , s = 2 Ω (cid:16) k · log( U/k ) t log m (cid:17) , such that if w < w , then s ≥ s , for any constant (cid:15) > .Proof. By Theorem 39, for sufficiently large k (cid:48) = O ( k log U ) and m (cid:48) = O ( m log U ) , there existsa bounded domain X ⊂ R , for which if there exists a randomized [ a, b ] -protocol for the discreteFréchet DTEP with approximation factor c < . , Alice’s input length equal to k (cid:48) , Bob’s inputlength equal to m (cid:48) , then there exists a randomized [ a, b ] -protocol for ( k, U ) -Disjointness, where Alicereceives a set S ⊆ [ U ] and Bob receives a set T ⊆ [ U ] of size m , with k ≤ m ≤ U .Now consider the following randomized [ a, b ] -protocol. First, Alice and Bob use public randomcoins to map all elements of U to random bit strings of dimension D = 2 log( mk ) . By a union boundover at most mk different elements of U , distinct elements in T ∪ S will be mapped to distinct bitstrings with probability at least − ( mk ) − . Then, Alice and Bob use the protocol of Theorem 39 tosolve ( k, U ) -Disjointness in a universe of size m O (1) . Hence, for sufficiently large k (cid:48)(cid:48) = Θ( k log m ) and m (cid:48)(cid:48) = Θ( m log m ) , there exists a bounded domain X ⊂ R , for which if there exists a randomized [ a, b ] -protocol for the discrete Fréchet DTEP with approximation factor c < . , Alice’s input29ength equal to k (cid:48)(cid:48) , Bob’s input length equal to m (cid:48)(cid:48) , then there exists a randomized [ a, b ] -protocol for ( k, U ) -Disjointness in an arbitrary universe [ U ] , where Alice receives a set S ⊆ [ U ] and Bob receivesa set T ⊆ [ U ] of size m , with k ≤ m .By Theorem 30, for any δ > , a randomized [ a, b ] -protocol for ( k, U ) -Disjointness, for any m ≤ U , where U is the size of the universe, requires either a ≥ δk log (cid:0) Uk (cid:1) or b ≥ b , where b = Ω (cid:16) k (cid:0) Uk (cid:1) − δ (cid:17) . Hence, for any δ > , and any k , m , such that k ≤ m , if there exists arandomized [ a, b ] -protocol for the discrete Fréchet DTEP with the above-mentioned input parameters,then either a ≥ δk log (cid:0) Uk (cid:1) or b ≥ b .The simulation argument implies that if there exists a cell-probe discrete Fréchet distance oraclewith parameters t , w , s for point sequences of size k (cid:48)(cid:48) and m (cid:48)(cid:48) , for points in D , then there exists arandomized [2 t log s, tw ] -protocol for the discrete Fréchet DTEP. Hence, it should be that either t log s ≥ δk log (cid:0) Uk (cid:1) or tw ≥ b . In other words, if w < b , then s ≥ δk log( U/k )2 t . Rescaling for k (cid:48)(cid:48) = Θ( k log m ) and m (cid:48)(cid:48) = Θ( m log m ) implies that there exists w = Ω (cid:18) k (cid:48)(cid:48) t log m (cid:19) · (cid:18) U log mk (cid:48)(cid:48) (cid:19) − δ = Ω (cid:18) k (cid:48)(cid:48) t log m (cid:48)(cid:48) (cid:19) · (cid:18) Uk (cid:48)(cid:48) (cid:19) − δ and s = 2 Ω (cid:16) δk (cid:48)(cid:48) t log m · log ( U log mk (cid:48)(cid:48) ) (cid:17) = 2 Ω (cid:16) δk (cid:48)(cid:48) t log m (cid:48)(cid:48) · log ( Uk (cid:48)(cid:48) ) (cid:17) such that if w < w , then s ≥ s . The theorem is now implied by just renaming variables k (cid:48)(cid:48) , m (cid:48)(cid:48) and setting δ = (cid:15)/ . The reduction in the previous section uses point sequences in the plane. We now describe a secondreduction to show a dependency on the ambient dimension d of the point sequences in case d issufficiently high. For all i ∈ [ U ] , e i ∈ R U denotes the vector of the standard basis, i.e. the vectorwith all elements equal to except the i -th coordinate which is . We use the following points in R U +2 : w = (1 , , , . . . , , x = (1 , − , , . . . , , s = (0 , , , . . . , , ˜ b i = (0 , , e i ) ,x = ( − , , , . . . , , y = (1 , , , . . . , , ˜ a i = (1 , , e i ) , y = (0 , , , . . . , Given S = { s , . . . , s k } , T = { t , . . . , t m } as in Definition 29, we construct the following pointsequences: P = w ◦ x ◦ (cid:13) mi =1 ( s ◦ ˜ b t i ) ◦ s ◦ x ◦ w,Q = (cid:13) ki =1 ( y ◦ ˜ a s i ◦ y ) . Notice that P is a point sequence of length m + 5 and Q is a point sequence of length k . Allpoints lie in R U +2 . Point w serves as a skipping gadget since it is near to any point of Q , and points s , x , x , y , y are needed for synchronization: x is close to y but no other point in Q , x is closeto y but no other point in Q , and s is close to both y and y but no other point in Q . Our analysisis very similar to the one of Section 5.4. A new key component is the use of random projections,and in particular the random projection by Achlioptas [1] to reduce the dimension. Lemma 41. If S ∩ T (cid:54) = ∅ then d dF ( P, Q ) ≤ √ .Proof. Let i ∗ , j ∗ such that s i ∗ = t j ∗ ∈ T ∩ S . We describe a traversal which achieves distance √ :30. w is matched with all points of q before y ◦ ˜ a s i ∗ ◦ y x is matched with y y is matched with all points of p before b t j ∗ ˜ a s i ∗ is matched with ˜ b t j ∗ y is matched with the rest of P w is matched with the rest of Q Only the following distances appear in the above matching: (cid:107) w − y (cid:107) , (cid:107) w − y (cid:107) , (cid:107) x − y (cid:107) , (cid:107) s − y (cid:107) , (cid:107) ˜ a s i ∗ − ˜ b t j ∗ (cid:107) , (cid:107) s − y (cid:107) , (cid:107) x − y (cid:107) , {(cid:107) w − ˜ a s i (cid:107) | i (cid:54) = i ∗ } , {(cid:107) y − ˜ b t j (cid:107) | j < j ∗ } , {(cid:107) y − ˜ b t j (cid:107) | j > j ∗ } and all of them are at most √ . wP = Q = x . . . p y i . . . p y i . . . x w. . . y i ˜ a s i y i . . . Figure 6: x is matched with y i , p y i is the last point in p which is matched with y i and p y i is thefirst point in p which is matched with y i . Lemma 42. If S ∩ T = ∅ then d dF ( P, Q ) ≥ √ .Proof. Consider the optimal traversal for P and Q . We assume that x is matched with some y and noother point of Q . Likewise, x is matched with some y and no other point of Q . If these assumptionsdo not hold and x or x are matched with some other point then d dF ( P, Q ) ≥ √ . Furthermore weassume that each s is matched with either a y or a y , because otherwise d dF ( P, Q ) ≥ √ .Now let y i be the i th appearance of point y in Q and assume that x is matched with it. Nowlet p y i be the last point in P which is matched with y i and let p y i be the first point in P which ismatched with y i (see Fig. 6). We consider all cases for p y i :• If p y i is x then at least one of the following must happen: – the first appearance of s is matched with ˜ a s i and hence d dF ( P, Q ) ≥ √ , – x is matched with ˜ a s i and hence d dF ( P, Q ) ≥ √ .• If p y i is the j th appearance of s then: – If j = k + 1 then ˜ a s i is matched with either s or x (or both). Hence, the distance is atleast √ . – If j < k + 1 , then by our initial assumption that s is always matched with either a y ora y , p y i cannot appear after the ( j + 1) th appearance of s . Hence, ˜ b j is matched with ˜ a i (because s is assumed not to be matched with ˜ a s i ). This implies that d dF ( P, Q ) ≥ .31 If p y i is x then one of the aforementioned assumptions is not satisfied and hence d dF ( P, Q ) ≥√ .• If p y i is some point ˜ b t j then ˜ a s i is either matched with ˜ b t j or with s . Hence, d dF ( P, Q ) ≥ √ .• If p y i is w then this means that x is matched with y i because of monotonicity of the matching,but then the distance is at least √ .We conclude that if T ∩ S = ∅ , then d dF ( P, Q ) ≥ √ .Both point sequences P and Q consist of points in {− , , } U +2 . In order to reduce the dimension,we will use the following (slighlty rephrased) result by Achlioptas. Theorem 43 (Theorem 1.1 [1]) . Let P be an arbitrary set of n points in R d . Given (cid:15), β > , let d = 4 + 2 β(cid:15) / − (cid:15) / n. For integer d ≥ d , let R be a d (cid:48) × d random matrix with each R ( i, j ) being an independent randomvariable following the uniform distribution in {− , } . With probability at least n − β , for all u, v ∈ P : (cid:13)(cid:13)(cid:13)(cid:13) √ d (cid:48) Ru − √ d (cid:48) Rv (cid:13)(cid:13)(cid:13)(cid:13) ∈ (1 ± (cid:15) ) (cid:107) u − v (cid:107) . Theorem 44.
Suppose that there exists a randomized [ a, b ] -protocol for the discrete Fréchet DTEPwith approximation factor c < (cid:112) / where Alice receives a sequence of k points in ([ − , ∩ Z ) Θ(log m ) and Bob receives a sequence of m + 5 points in ([ − , ∩ Z ) Θ(log m ) . Then there exists a randomized [ a, b ] -protocol for the ( k, U ) -Disjointness problem in a universe [ U ] , where Alice receives a set S ⊆ [ U ] of size k and Bob receives a set T ⊆ [ U ] of size m .Proof. Alice constructs a sequence Q of k points as described above and similarly Bob constructs asequence P of m + 5 points. Let S be the set of all points in P, Q . Alice and Bob use a sourceof public random coins to construct the same Johnson Lindenstrauss randomized mapping. Inparticular, we use Theorem 43. Let R be a d (cid:48) × d matrix with each element R ( i, j ) chosen uniformlyat random from {− , } and let f : R d (cid:55)→ R d (cid:48) be the function which maps any vector v ∈ R d to Rv .Alice and Bob sample f ( · ) and project their points to dimension d (cid:48) = O (log( m + k )) = O (log m ) .With high probability, for any two points x, y ∈ S we have (cid:107) f ( x ) − f ( y ) (cid:107) ∈ (cid:2) . · d (cid:48) · (cid:107) x − y (cid:107) , . · d (cid:48) · (cid:107) x − y (cid:107) (cid:3) . Each element of vector f ( x ) is produced by an inner product of a vector of d random signs anda vector of at least d − zeros and at most elements from {− , } . Hence, (cid:107) f ( x ) (cid:107) ∞ ≤ andmoreover f ( x ) ∈ Z d (cid:48) . Let f ( P ) and f ( Q ) be the two point sequences after randomly projecting thepoints. By Lemmas 41 and 42 we get that if T ∩ S (cid:54) = ∅ then d dF ( f ( P ) , f ( Q )) ≤ √ . · d (cid:48) and if T ∩ S = ∅ then d dF ( f ( P ) , f ( Q )) ≥ √ . d (cid:48) .Hence Alice and Bob can now use the assumed protocol for computing the discrete Fréchetdistance and decide whether T ∩ S (cid:54) = ∅ or T ∩ S = ∅ .32 heorem 45. There exists d = O (log m ) , such that the following holds. Consider any discreteFréchet distance oracle in the cell-probe model which supports point sequences in R d , d ≥ d , asfollows: for any k ≤ m ≤ U , it stores any sequence of length m , it supports queries of length k , andit achieves performance parameters t , w , s , and approximation factor c < (cid:112) / . There exist w = Ω (cid:32) kt (cid:18) Uk (cid:19) − (cid:15) (cid:33) , s = 2 Ω (cid:16) k log( U/k ) t (cid:17) such that if w < w then s ≥ s , for any constant (cid:15) > .Proof. By Theorem 44, there exists a set X ⊂ R d , for which if there exists a randomized [ a, b ] -protocol for the discrete Fréchet DTEP with approximation factor c < (cid:112) / , Alice’s input lengthequal to k (cid:48) , Bob’s input length equal to m (cid:48) , then there exists a randomized [ a, b ] -protocol for ( k, U ) -Disjointness in an arbitrary universe [ U ] , where Alice receives a set S ⊆ [ U ] and Bob receives a set T ⊆ [ U ] of size m , with k ≤ m ≤ U . By Theorem 30, for any δ > , a randomized [ a, b ] -protocol for ( k, U ) -Disjointness, for any m ≤ U , where U is the size of the universe, requires either a ≥ δk log (cid:0) Uk (cid:1) or b ≥ b , where b = Ω (cid:16) k (cid:0) Uk (cid:1) − δ (cid:17) . Hence, for any δ > , and any k , m , such that k ≤ m , ifthere exists a randomized [ a, b ] -protocol for the discrete Fréchet DTEP with the abovementionedinput parameters, then either a ≥ δk log (cid:0) Uk (cid:1) or b ≥ b .The simulation argument implies that if there exists a cell-probe data structure with parameters t , w , s for point sequences of size k and m , for points in R d , then there exists a randomized [2 t log s, tw ] -protocol for the discrete Fréchet DTEP. Hence it should be either that t log s ≥ δk log (cid:0) Uk (cid:1) or tw ≥ b . There exists a w = Ω (cid:16) kt (cid:0) Uk (cid:1) − δ (cid:17) such that if w < w , then s ≥ δk log( U/k )2 t . Thetheorem is now implied by just rescaling δ = (cid:15)/ and substituting for k (cid:48) = Θ( k ) , m (cid:48) = Θ( m ) . We have described and analyzed a (2 + (cid:15) ) -ANN data structure for time series under the continuousFréchet distance. In doing so, we have presented the new technique of constructing so-called tightmatchings , which may be of independent interest. We also showed lower bounds in the cell-probemodel, which indicate that a better approximation cannot be achieved, unless we allow space usagedepending on the arclength of the time series or allow superconstant number of probes. In addition,we have also presented a O ( k ) -ANN randomized data structure for time series under the Fréchetdistance, with optimal space usage and optimal query time.Several open questions remain, we discuss the two main research directions:1. Are there data structures with similar guarantees for the ANN problem under the continuousFréchet distance for curves in the plane (or higher dimensions)? Our approach uses signatures,which are tailored to the -dimensional setting. A related concept for curves in higher dimensionsis the curve simplification . It is an open problem if it is possible to apply simplifications inplace of signatures to obtain similar results.2. The lower bounds presented in this paper are only meaningful when the approximation factoris a small constant and the number of probes is also constant. Are there ANN data structuresfor the continuous Fréchet distance (in any dimension) with query time polynomial in k and/or m , which do not have exponential dependency on k or m , and which do not depend on thearclength (resp. the maximum edge length) of the curves?One of the aspects that make our results and these open questions interesting is that knowngeneric approaches designed for more general classes of metric spaces cannot be applied. There exist33everal data structures which operate on general metric spaces with bounded doubling dimension(see e.g. [8, 26, 29]). However, the doubling dimension of the metric space defined over the space oftime series with the continuous Fréchet distance is unbounded [16].One of the aspects that make our problem difficult, is that the Fréchet distance does not exhibita norm structure. In this sense it is very similar to the well-known Hausdorff distance for sets, whichis equally challenging from the point of view of data structures (see also the discussion in [21, 27]).We hope that answering the above research questions will lead to new techniques for handling suchdistance measures. References [1] Dimitris Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binarycoins.
J. Comput. Syst. Sci. , 66(4):671–687, 2003. doi:10.1016/S0022-0000(03)00025-4 .[2] Peyman Afshani and Anne Driemel. On the complexity of range searching among curves.In
Proceedings of the 28th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA2018, New Orleans, LA, USA, January 7-10, 2018 , pages 898–917, 2018. doi:10.1137/1.9781611975031.58 .[3] Helmut Alt and Michael Godau. Computing the Fréchet distance between two polygonalcurves.
International Journal of Computational Geometry & Applications , 05:75–91, 1995. doi:10.1142/S0218195995000064 .[4] Boris Aronov, Omrit Filtser, Michael Horton, Matthew J. Katz, and Khadijeh Sheikhan. Efficientnearest-neighbor query and clustering of planar curves. In
Algorithms and Data Structures- 16th International Symposium, WADS 2019, Edmonton, AB, Canada, August 5-7, 2019,Proceedings , pages 28–42, 2019. URL: https://doi.org/10.1007/978-3-030-24766-9_3 , doi:10.1007/978-3-030-24766-9\_3 .[5] Maria Astefanoaei, Paul Cesaretti, Panagiota Katsikouli, Mayank Goswami, and Rik Sarkar.Multi-resolution sketches and locality sensitive hashing for fast trajectory processing. In International Conference on Advances in Geographic Information Systems (SIGSPATIAL 2018) ,volume 10, 2018.[6] Julian Baldus and Karl Bringmann. A fast implementation of near neighbors queries for Fréchetdistance (GIS Cup). In
Proceedings of the 25th ACM SIGSPATIAL International Conferenceon Advances in Geographic Information Systems , SIGSPATIAL’17, pages 99:1–99:4, 2017.[7] Alberto Bertoni, Giancarlo Mauri, and Nicoletta Sabadini. Simulations among classes of randomaccess machines and equivalence among numbers succinctly represented.
Ann. Discrete Math. ,25:65–90, 1985.[8] Alina Beygelzimer, Sham M. Kakade, and John Langford. Cover trees for nearest neighbor. In
Machine Learning, Proceedings of the Twenty-Third International Conference (ICML 2006),Pittsburgh, Pennsylvania, USA, June 25-29, 2006 , pages 97–104, 2006. URL: https://doi.org/10.1145/1143844.1143857 , doi:10.1145/1143844.1143857 .[9] Karl Bringmann and Wolfgang Mulzer. Approximability of the discrete Fréchet distance. JoCG ,7(2):46–76, 2016. doi:10.20382/jocg.v7i2a4 .3410] Kevin Buchin, Yago Diez, Tom van Diggelen, and Wouter Meulemans. Efficient trajectoryqueries under the Fréchet distance (GIS Cup). In
Proc. 25th Int. Conference on Advances inGeographic Information Systems (SIGSPATIAL) , pages 101:1–101:4, 2017.[11] Kevin Buchin, Yago Diez, Tom van Diggelen, and Wouter Meulemans. Efficient trajectoryqueries under the fréchet distance (gis cup). In
Proceedings of the 25th ACM SIGSPATIALInternational Conference on Advances in Geographic Information Systems , SIGSPATIAL ’17,New York, NY, USA, 2017. Association for Computing Machinery. URL: https://doi.org/10.1145/3139958.3140064 , doi:10.1145/3139958.3140064 .[12] Maike Buchin, Anne Driemel, and Dennis Rohde. Approximating (k, (cid:96) )-median clustering forpolygonal curves. CoRR , abs/2009.01488, 2020. URL: https://arxiv.org/abs/2009.01488 , arXiv:2009.01488 .[13] Timothy M. Chan. Well-separated pair decomposition in linear time? Inf. Process. Lett. ,107(5):138–141, August 2008. URL: https://doi.org/10.1016/j.ipl.2008.02.008 , doi:10.1016/j.ipl.2008.02.008 .[14] Mark De Berg, Atlas F Cook, and Joachim Gudmundsson. Fast Fréchet queries. ComputationalGeometry , 46(6):747–755, 2013.[15] Anne Driemel and Sariel Har-Peled. Jaywalking your dog: Computing the Fréchet distancewith shortcuts.
SIAM Journal on Computing , 42(5):1830–1866, 2013.[16] Anne Driemel, Amer Krivosija, and Christian Sohler. Clustering time series under the fréchetdistance. In
Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on DiscreteAlgorithms, SODA 2016, Arlington, VA, USA, January 10-12, 2016 , pages 766–785, 2016. URL: https://doi.org/10.1137/1.9781611974331.ch55 , doi:10.1137/1.9781611974331.ch55 .[17] Anne Driemel, Jeff M. Phillips, and Ioannis Psarros. The VC dimension of metric balls underFréchet and Hausdorff distances. In Proc. 35st International Symposium on ComputationalGeometry , pages 28:2–28:16, 2019.[18] Anne Driemel and Francesco Silvestri. Locally-sensitive hashing of curves. In
Proc. 33stInternational Symposium on Computational Geometry , pages 37:1–37:16, 2017.[19] Fabian Dütsch and Jan Vahrenhold. A filter-and-refinement- algorithm for range queries basedon the Fréchet distance (GIS Cup). In
Proc. 25th Int. Conference on Advances in GeographicInformation Systems (SIGSPATIAL) , pages 100:1–100:4, 2017.[20] Ioannis Z. Emiris and Ioannis Psarros. Products of Euclidean metrics and applications toproximity questions among curves. In
Proc. 34th Int. Symposium on Computational Geometry(SoCG) , volume 99 of
LIPIcs , pages 37:1–37:13, 2018.[21] Martin Farach-Colton and Piotr Indyk. Approximate nearest neighbor algorithms for hausdorffmetrics via embeddings. In , pages 171–180, 1999. URL: https://doi.org/10.1109/SFFCS.1999.814589 , doi:10.1109/SFFCS.1999.814589 .[22] Arnold Filtser, Omrit Filtser, and Matthew J. Katz. Approximate nearest neighbor for curves -simple, efficient, and deterministic. CoRR , abs/1902.07562, 2019. arXiv:1902.07562 .3523] Joachim Gudmundsson and Michiel Smid. Fast algorithms for approximate Fréchet matchingqueries in geometric trees.
Computational Geometry , 48(6):479 – 494, 2015. doi:http://dx.doi.org/10.1016/j.comgeo.2015.02.003 .[24] Sariel Har-Peled.
Geometric Approximation Algorithms . American Mathematical Society,Boston, MA, USA, 2011.[25] Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate nearest neighbor: Towardsremoving the curse of dimensionality.
Theory of Computing , 8(1):321–350, 2012. URL: http://dx.doi.org/10.4086/toc.2012.v008a014 , doi:10.4086/toc.2012.v008a014 .[26] Sariel Har-Peled and Manor Mendel. Fast construction of nets in low-dimensional metricsand their applications. SIAM J. Comput. , 35(5):1148–1184, 2006. URL: https://doi.org/10.1137/S0097539704446281 , doi:10.1137/S0097539704446281 .[27] Piotr Indyk. On approximate nearest neighbors in non-Euclidean spaces. In , pages 148–155, 1998. URL: https://doi.org/10.1109/SFCS.1998.743438 , doi:10.1109/SFCS.1998.743438 .[28] Piotr Indyk. Approximate nearest neighbor algorithms for Fréchet distance via product metrics.In Symposium on Computational Geometry , pages 102–106, 2002.[29] Robert Krauthgamer and James R. Lee. Navigating nets: simple algorithms for proximitysearch. In
Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms,SODA 2004, New Orleans, Louisiana, USA, January 11-14, 2004 , pages 798–807, 2004. URL: http://dl.acm.org/citation.cfm?id=982792.982913 .[30] Stefan Meintrup, Alexander Munteanu, and Dennis Rohde. Random projections and samplingalgorithms for clustering of high-dimensional polygonal curves. In
Advances in Neural Infor-mation Processing Systems 32: Annual Conference on Neural Information Processing Systems2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada , pages 12807–12817, 2019.[31] Peter Bro Miltersen. Lower bounds for union-split-find related problems on random accessmachines. In
Proceedings of the Twenty-sixth Annual ACM Symposium on Theory of Computing ,STOC ’94, pages 625–634, New York, NY, USA, 1994. ACM. doi:10.1145/195058.195415 .[32] Majid Mirzanezhad. On the approximate nearest neighbor queries among curves under thefréchet distance, 2020. arXiv:2004.08444 .[33] Mihai Patrascu. Unifying the landscape of cell-probe lower bounds.
SIAM J. Comput. , 40(3):827–847, 2011. doi:10.1137/09075336X .[34] Martin Werner and Dev Oliver. ACM SIGSPATIAL GIS Cup 2017: Range queries under Fréchetdistance.
SIGSPATIAL Special , 10(1):24–27, June 2018. doi:10.1145/3231541.3231549 .36
Computational models
Our data structures operate in the real-RAM model . That is, we assume that the machine canstore and access real numbers in constant time and the operations (+ , − , × , ÷ , ≤ ) can be performedin constant time on these real numbers. In addition, we assume that the floor function of a realnumber can be computed in constant time. This model is commonly used in the literature, see forexample [13, 24]. Nonetheless, the use of this computational model is controversial, since it allows allPSPACE and O (log( Cr ) + log( (cid:15) )) ,where C is the largest coordinate of any of the input points and r is the parameter that definesthe query radius of the ANN data structure. Moreover, the space and the number of cell probes tothe data structure is unaffected by this change. Our lower bounds hold in the cell probe model . Inthis model of computation we are interested in the number of memory accesses (cell probes) to thedata structure which are performed by a query. Given a universe of data and a universe of queries,a cell-probe data structure with performance parameters s , t , w , is a structure which consists of s memory cells, each able to store w bits, and any query can be answered by accessing tt