Approximating (k,ℓ) -Median Clustering for Polygonal Curves
AApproximating ( k, (cid:96) ) -Median Clustering for Polygonal Curves Maike Buchin ∗ Anne Driemel † Dennis Rohde ‡ November 4, 2020
Abstract
In 2015, Driemel, Krivošija and Sohler introduced the ( k, (cid:96) ) -median problem for clusteringpolygonal curves under the Fréchet distance. Given a set of input curves, the problem asks tofind k median curves of at most (cid:96) vertices each that minimize the sum of Fréchet distances overall input curves to their closest median curve. A major shortcoming of their algorithm is that theinput curves are restricted to lie on the real line. In this paper, we present a randomized bicriteria-approximation algorithm that works for polygonal curves in R d and achieves approximationfactor (1 + ε ) with respect to the clustering costs. The algorithm has worst-case running-timelinear in the number of curves, polynomial in the maximum number of vertices per curve, i.e.their complexity, and exponential in d , (cid:96) , ε and δ , i.e., the failure probability. We achieve thisresult through a shortcutting lemma, which guarantees the existence of a polygonal curve withsimilar cost as an optimal median curve of complexity (cid:96) , but of complexity at most (cid:96) − , andwhose vertices can be computed efficiently. We combine this lemma with the superset-samplingtechnique by Kumar et al. to derive our clustering result. In doing so, we describe and analyze ageneralization of the algorithm by Ackermann et al., which may be of independent interest. ∗ Faculty of Mathematics, Ruhr-University Bochum, Germany, [email protected] † Hausdorff Center for Mathematics, University of Bonn, Germany, [email protected] ‡ Faculty of Mathematics, Ruhr-University Bochum, Germany, [email protected] a r X i v : . [ c s . C G ] N ov Introduction
Since the development of k -means – the pioneer of modern computational clustering – the last 65years have brought a diversity of specialized [6, 7, 13, 19, 20, 31, 32] as well as generalized clusteringalgorithms [2, 5, 24]. However, in most cases clustering of point sets was studied. Many clusteringproblems indeed reduce to clustering of point sets, but for sequential data like time-series andtrajectories – which arise in the natural sciences, medicine, sports, finance, ecology, audio/speechanalysis, handwriting and many more – this is not the case. Hence, we need specialized clusteringmethods for these purposes, cf. [1, 12, 18, 29, 30].A promising branch of this active research deals with ( k, (cid:96) ) -center and ( k, (cid:96) ) -median clustering –adaptions of the well-known Euclidean k -center and k -median clustering. In ( k, (cid:96) ) -center clustering,respective ( k, (cid:96) ) -median clustering, we are given a set of n polygonal curves in R d of complexity(i.e., the number of vertices of the curve) at most m each and want to compute k centers thatminimize the objective function – just as in Euclidean k -clustering. In addition, the centers arerestricted to have complexity at most (cid:96) each to prevent over-fitting – a problem specific for sequentialdata. A great benefit of regarding the sequential data as polygonal curves is that we introduce animplicit linear interpolation. This does not require any additional storage space since we only needto store the vertices of the curves, which are the sequences at hand. We compare the polygonalcurves by their Fréchet distance, that is a continuous distance measure which takes the entire courseof the curves into account, not only the pairwise distances among their vertices. Therefore, irregularsampled sequences are automatically handled by the interpolation, which is desirable in many cases.Moreover, Buchin et al. [10] showed, by using heuristics, that the ( k, (cid:96) ) -clustering objectives yieldpromising results on trajectory data.This branch of research formed only recently, about twenty years after Alt and Godau developed analgorithm to compute the Fréchet distance between polygonal curves [4]. Several papers have sincestudied this type of clustering [9–11, 15, 26]. However, all of these clustering algorithms, except theapproximation-schemes for polygonal curves in R [15] and the heuristics in [10], choose a k -subset ofthe input as centers. (This is also often called discrete clustering.) This k -subset is later simplified,or all input-curves are simplified before choosing a k -subset. Either way, using these techniques onecannot achieve an approximation factor of less than . This is because there need not be an inputcurve with distance to its median which is less than the average distance of a curve to its median.Driemel et al. [15], who were the first to study clustering of polygonal curves under the Fréchetdistance in this setting, already overcame this problem in one dimension by defining and analyzing δ -signatures, which are succinct representations of classes of curves that allow synthetic center-curvesto be constructed. However, it seems that δ -signatures are only applicable in R . Here, we extendtheir work and obtain the first randomized bicriteria approximation algorithm for ( k, (cid:96) ) -medianclustering of polygonal curves in R d . Driemel et al. [15] introduced the ( k, (cid:96) ) -center and ( k, (cid:96) ) -median objectives and developed thefirst approximation-schemes for these objectives, for curves in R . Furthermore, they proved that ( k, (cid:96) ) -center as well as ( k, (cid:96) ) -median clustering is NP-hard, where k is a part of the input and (cid:96) isfixed. Also, they showed that the doubling dimension of the metric space of polygonal curves underthe Fréchet distance is unbounded, even when the complexity of the curves is bounded.Following this work, Buchin et al. [9] developed a constant-factor approximation algorithm for ( k, (cid:96) ) -center clustering in R d . Furthermore, they provide improved results on the hardness ofapproximating ( k, (cid:96) ) -center clustering under the Fréchet distance: the ( k, (cid:96) ) -center problem is NP-1ard to approximate within a factor of (1 . − ε ) for curves in R and within a factor of (2 . − ε ) for curves in R d , where d ≥ , in both cases even if k = 1 . Furthermore, for the ( k, (cid:96) ) -medianvariant, Buchin et al. [11] proved NP-hardness using a similar reduction. Again, the hardness holdseven if k = 1 . Also, they provided (1 + ε ) -approximation algorithms for ( k, (cid:96) ) -center, as well as ( k, (cid:96) ) -median clustering, under the discrete Fréchet distance. Nath and Taylor [28] give improvedalgorithms for (1 + ε ) -approximation of ( k, (cid:96) ) -median clustering under discrete Fréchet and Hausdorffdistance. Recently, Meintrup et al. [26] introduced a practical (1 + ε ) -approximation algorithm fordiscrete k -median clustering under the Fréchet distance, when the input adheres to a certain naturalassumption, i.e., the presence of a certain number of outliers.Our algorithms build upon the clustering algorithm of Kumar et al. [25], which was later extendedby Ackermann et al. [2]. This algorithm is a recursive approximation scheme, that employs twophases in each call. In the so-called candidate phase it computes candidates by taking a sample S from the input set T and running an algorithm on each subset of S of a certain size. Whichalgorithm to use depends on the metric at hand. The idea behind this is simple: if T contains acluster T (cid:48) that takes a constant fraction of its size, then a constant fraction of S is from T (cid:48) withhigh probability. By brute-force enumeration of all subsets of S , we find this subset S (cid:48) ⊆ T (cid:48) and if S is taken uniformly and independently at random from T then S (cid:48) is a uniform and independentsample from T (cid:48) . Ackermann et al. proved for various metric and non-metric distance measures, that S (cid:48) can be used for computing candidates that contain a (1 + ε ) -approximate median for T (cid:48) with highprobability. The algorithm recursively calls itself for each candidate to eventually evaluate thesetogether with the candidates for the remaining clusters.The second phase of the algorithm is the so-called pruning phase , where it partitions its inputaccording to the candidates at hand into two sets of equal size: one with the smaller distances to thecandidates and one with the larger distances to the candidates. It then recursively calls itself withthe second set as input. The idea behind this is that small clusters now become large enough to findcandidates for these. Furthermore, the partitioning yields a provably small error. Finally it returnsthe set of k candidates that together evaluated best. We present several algorithms for approximating (1 , (cid:96) ) -median clustering of polygonal curves underthe Fréchet distance, see Fig. 1 for an illustration of the operation principles of our algorithms. Whilethe first one, Algorithm 1, yields only a coarse approximation (factor 34), it is suitable as plugin forthe following two algorithms, Algorithms 2 and 4, due to its asymptotically fast running-time. Thesealgorithms yield a better approximation (factor ε , respectively ε ). Additionally, Algorithms 2and 4 are not only able to yield an approximation for the input set T , but for a cluster T (cid:48) ⊆ T , thattakes a constant fraction of T . We would like to use these as plugins to the (1 + ε ) -approximationalgorithm for k -median clustering by Ackermann et al. [2], but that would require our algorithms tocomply with the sampling properties. For an input set T the weak sampling property expresses thata constant-size set of candidates can be computed, that contains a (1 + ε ) -approximate median for T with high probability, by taking a constant-size uniform and independent sample of T . Further,the running-time for computing the candidates depends only on the size of the sample, the size ofthe candidate set and the failure probability parameter. The strong sampling property is definedsimilarly, but instead of a candidate set, an approximate median can be computed directly and therunning-time may only depend on the size of the sample. In our algorithms, the running-time forcomputing the candidate set depends on m which is a parameter of the input. Additionally, our firstalgorithm for computing candidates, which contain a (3 + ε ) -approximate (1 , (cid:96) ) -median with highprobability, does not achieve the required approximation-factor of (1 + ε ) . However, looking into2igure 1: From left to right: symbolic depiction of the operation principle of Algorithms 1, 2 and 4.Among all approximate (cid:96) -simplifications (depicted in blue) of the input curves (depicted in black),Algorithm 1 returns the one that evaluates best (the solid curve) with respect to a sample of theinput. Algorithm 2 does not return a single curve, but a set of candidates. These include the curvereturned by Algorithm 1 plus all curves with (cid:96) vertices from the cubic grids, covering balls of certainradius centered at the vertices of an input curve that is close to a median, w.h.p. Algorithm 4 issimilar to Algorithm 2 but does not only cover the vertices of a single curve, but of multiple curves.We depict the best approximate median that can be generated from the grids in solid green.the analysis of Ackermann et al., any algorithm for computing candidates, with some guaranteedapproximation-factor, can be used in the recursive approximation-scheme. Therefore, we decided togeneralize the k -median clustering algorithm of Ackermann et al. [2].Nath and Taylor [28] use a similar approach, but they developed yet another way to computecandidates: they define and analyze g -coverability, which is a generalization of the notion of doublingdimension and indeed, for the discrete Fréchet distance the proof builds upon the doubling dimensionof points in R d . However, the doubling dimension of polygonal curves under the Fréchet distanceis unbounded, even when the complexities of the curves are bounded and it is an open questionwhether g -coverability holds for the continuous Fréchet distance.We circumvent this by taking a different approach using the idea of shortcutting. It is well-knownthat shortcutting a polygonal curve (that is, replacing a subcurve by the line segment connecting itsendpoints) does not increase its Fréchet distance to a line segment. This idea has been used beforefor a variety of Fréchet-distance related problems [3, 8, 14, 15]. Specifically, we introduce two newshortcutting lemmata. These lemmata guarantee the existence of good approximate medians, withcomplexity at most (cid:96) − and whose vertices can be computed efficiently. The first one enables usto return candidates, which contain a (3 + ε ) -approximate median for a cluster inside the input, thattakes a constant fraction of the input, w.h.p., and we call it simple shortcutting . The second oneenables us to return candidates, which contain a (1 + ε ) -approximate median for a cluster inside theinput, that takes a constant fraction of the input, w.h.p., and we call it advanced shortcutting . All inall, we obtain as main result, following from Corollary 7.5: Theorem 1.1.
Given a set T of n polygonal curves in R d , of complexity at most m each, parametervalues ε ∈ (0 , . and δ ∈ (0 , , and constants k, (cid:96) ∈ N , there exists an algorithm, which computesa set C of k polygonal curves, each of complexity at most (cid:96) − , such that with probability at least (1 − δ ) , it holds that cost( T, C ) = (cid:88) τ ∈ T min c ∈ C d F ( c, τ ) ≤ (1 + ε ) (cid:88) τ ∈ T min c ∈ C ∗ d F ( c, τ ) = (1 + ε ) cost( T, C ∗ ) here C ∗ is an optimal ( k, (cid:96) ) -median solution for T under the Fréchet distance d F ( · , · ) .The algorithm has worst-case running-time linear in n , polynomial in m and exponential in δ, ε, d and (cid:96) . The paper is organized as follows. First we present a simple and fast -approximation algorithm for (1 , (cid:96) ) -median clustering. Then, we present the (3 + ε ) -approximation algorithm for (1 , (cid:96) ) -medianclustering of a cluster inside the input, that takes a constant fraction of the input, which buildsupon simple shortcutting and the -approximation algorithm. Then, we present a more practicalmodification of the (3 + ε ) -approximation algorithm, which achieves a (5 + ε ) -approximation for (1 , (cid:96) ) -median clustering. Following this, we present the similar but more involved (1 + ε ) -approximationalgorithm for (1 , (cid:96) ) -median clustering of a cluster inside the input, that takes a constant fractionof the input, which builds upon the advanced shortcutting and the -approximation algorithm.Finally we present the generalized recursive k -median approximation-scheme, which leads to ourmain result. Here we introduce all necessary definitions. In the following d ∈ N is an arbitrary constant. By (cid:107)·(cid:107) wedenote the Euclidean norm and for p ∈ R d and r ∈ R ≥ we denote by B ( p, r ) = { q ∈ R d | (cid:107) p − q (cid:107) ≤ r } the closed ball of radius r with center p . By S n we denote the symmetric group of degree n . Wegive a standard definition of grids: Definition 2.1 (grid) . Given a number r ∈ R > , for ( p , . . . , p d ) ∈ R d we define by G ( p, r ) =( (cid:98) p /r (cid:99) · r, . . . , (cid:98) p d /r (cid:99) · r ) the r -grid-point of p . Let X ⊆ R d be a subset of R d . The grid of cell width r that covers X is the set G ( X, r ) = { G ( p, r ) | p ∈ X } . Such a grid partitions the set X into cubic regions and for each r ∈ R > and p ∈ X we have that (cid:107) p − G ( p, r ) (cid:107) ≤ √ d r . We give a standard definition of polygonal curves: Definition 2.2 (polygonal curve) . A (parameterized) curve is a continuous mapping τ : [0 , → R d .A curve τ is polygonal, iff there exist v , . . . , v m ∈ R d , no three consecutive on a line, called τ ’svertices and t , . . . , t m ∈ [0 , with t < · · · < t m , t = 0 and t m = 1 , called τ ’s instants, such that τ connects every two contiguous vertices v i = τ ( t i ) , v i +1 = τ ( t i +1 ) by a line segment. We call the line segments v v , . . . , v m − v m the edges of τ and m the complexity of τ , denoted by | τ | . Sometimes we will argue about a sub-curve τ of a given curve σ . We will then refer to τ byrestricting the domain of σ , denoted by σ | X , where X ⊆ [0 , . Definition 2.3 (Fréchet distance) . Let H denote the set of all continuous bijections h : [0 , → [0 , with h (0) = 0 and h (1) = 1 , which we call reparameterizations. The Fréchet distance between curves σ and τ is defined as d F ( σ, τ ) = inf h ∈H max t ∈ [0 , (cid:107) σ ( t ) − τ ( h ( t )) (cid:107) . Sometimes, given two curves σ, τ , we will refer to an h ∈ H as matching between σ and τ .Note that there must not exist a matching h ∈ H , such that max t ∈ [0 , (cid:107) σ ( t ) − τ ( h ( t )) (cid:107) = d F ( σ, τ ) .This is due to the fact that in some cases a matching realizing the Fréchet distance would need tomatch multiple points p , . . . , p n on τ to a single point q on σ , which is not possible since matchingsneed to be bijections, but the p , . . . , p n can get matched arbitrarily close to q , realizing d F ( σ, τ ) inthe limit, which we formalize in the following lemma:4 emma 2.4. Let σ, τ : [0 , → R d be curves. Let r = d F ( σ, τ ) . There exists a sequence ( h i ) ∞ i =1 in H , such that lim i →∞ max t ∈ [0 , (cid:107) σ ( t ) − τ ( h i ( t )) (cid:107) = r .Proof. Define ρ : H → R ≥ , h (cid:55)→ max t ∈ [0 , (cid:107) σ ( t ) − τ ( h ( t )) (cid:107) with image R = { ρ ( h ) | h ∈ H} . Perdefinition, we have d F ( σ, τ ) = inf R = r .For any non-empty subset X of R that is bounded from below and for every ε > it holds thatthere exists an x ∈ X with inf X ≤ x < inf X + ε , by definition of the infimum. Since R ⊆ R and inf R exists, for every ε > there exists an r (cid:48) ∈ R with inf R ≤ r (cid:48) < inf R + ε .Now, let a i = 1 /i be a zero sequence. For every i ∈ N there exists an r i ∈ R with r ≤ r i < r + a i ,thus lim i →∞ r i = r .Let ρ − ( r (cid:48) ) = { h ∈ H | ρ ( h ) = r (cid:48) } be the preimage of ρ . Since ρ is a function, | ρ − ( r (cid:48) ) | ≥ for each r (cid:48) ∈ R . Now, for i ∈ N , let h i be an arbitrary element from ρ − ( r i ) . By definition it holds that lim i →∞ max t ∈ [0 , (cid:107) σ ( t ) − τ ( h i ( t )) (cid:107) = lim i →∞ ρ ( h i ) = lim i →∞ r i = r = inf R, which proves the claim.Now we introduce the classes of curves we are interested in. Definition 2.5 (polygonal curve classes) . For d ∈ N , we define by X d the equivalence class of polyg-onal curves (where two curves are equivalent, iff they can be made identical by a reparameterization)in ambient space R d . For m ∈ N we define by X dm the subclass of polygonal curves of complexity atmost m . Simplification is a fundamental problem related to curves and which appears as sub-problem in ouralgorithms.
Definition 2.6 (minimum-error (cid:96) -simplification) . For a polygonal curve τ ∈ X d we denote by simpl( α, τ ) an α -approximate minimum-error (cid:96) -simplification of τ , i.e., a curve σ ∈ X d(cid:96) with d F ( τ, σ ) ≤ α · d F ( τ, σ (cid:48) ) for all σ (cid:48) ∈ X d(cid:96) . Now we define the ( k, (cid:96) ) -median clustering problem for polygonal curves. Definition 2.7 ( ( k, (cid:96) ) -median clustering) . The ( k, (cid:96) ) -median clustering problem is defined as follows,where k, l ∈ N are fixed (constant) parameters of the problem: given a finite and non-empty set T ⊂ X dm of polygonal curves, compute a set of k curves C ∗ ⊂ X d(cid:96) , such that cost( T, C ∗ ) = (cid:80) τ ∈ T min c ∗ ∈ C ∗ d F ( τ, c ∗ ) is minimal. We call cost( · , · ) the objective function and we often write cost( T, c ) as shorthand for cost( T, { c } ) .The following theorem of Indyk [23] is useful for evaluating the cost of a curve at hand. Theorem 2.8. [23, Theorem 31] Let ε ∈ (0 , and T ⊂ X d be a set of polygonal curves. Further let W be a non-empty sample, drawn uniformly and independently at random from T , with replacement.For τ, σ ∈ T with cost( T, τ ) > (1 + ε ) cost( T, σ ) it holds that Pr[cost(
W, τ ) ≤ cost( W, σ )] < exp (cid:0) − ε | W | / (cid:1) . The following concentration bound also applies to independent Bernoulli trials, which are a specialcase of Poisson trials where each trial has same probability of success. Kumar et al. [25] use this tobound the probability that a subset S (cid:48) of an independent and uniform sample S from a set T isentirely contained in a subset T (cid:48) of T . They call it superset-sampling.5 emma 2.9 (Chernoff bound for independent Poisson trials) . [27, Theorem 4.5] Let X , . . . , X n be independent Poisson trials. For δ ∈ (0 , it holds that Pr (cid:34) n (cid:88) i =1 X i ≤ (1 − δ ) E (cid:34) n (cid:88) i =1 X i (cid:35)(cid:35) ≤ exp (cid:32) − δ (cid:34) n (cid:88) i =1 X i (cid:35)(cid:33) . -Approximation for (1 , (cid:96) ) -Median Here, we present Algorithm 1, a -approximation algorithm for (1 , (cid:96) ) -median clustering, which isbased on the following facts: we can obtain a -approximate solution to the (1 , (cid:96) ) -median for a givenset T = { τ , . . . , τ n } ⊂ X dm of polygonal curves in terms of objective value, i.e., we obtain one of theat least n/ input curves that are within distance · cost( T, c ∗ ) /n to an optimal (1 , (cid:96) ) -median c ∗ for T , w.h.p., by uniformly and independently sampling a sufficient number of curves from T . There areat least n/ of these curves by an averaging argument. These curves have cost up to · cost( T, c ∗ ) bythe triangle-inequality. The sample has size depending only on a parameter determining the failureprobability and we can improve on running-time even more by using Theorem 2.8 and evaluatethe cost of each curve in the sample of candidates against another sample of similar size insteadof against the complete input. Though, we have to accept an approximation factor of (if we set ε = 1 in Theorem 2.8). That is indeed acceptable, since we only obtain an approximate solution interms of objective value and completely ignore the bound on the number of vertices of the centercurve, which is a disadvantage of this approach and results in the lower bound of cost( T, c ∗ ) notnecessarily holding (if (cid:96) < m ). To fix this, we simplify the candidate curve that evaluated bestagainst the second sample, using an efficient minimum-error (cid:96) -simplification approximation algorithm,which downgrades the approximation factor to α , where α is the approximation factor of theminimum-error (cid:96) -simplification.However, Algorithm 1 is very fast in terms of the input size. Indeed, it has worst-case running-timeindependent of n and sub-quartic in m . Now, Algorithm 1 has the purpose to provide us anapproximate median for a given set of polygonal curves: the bi-criteria approximation algorithms(Algorithms 2 and 4), which we present afterwards and which are capable of generating center curveswith up to (cid:96) − vertices, need an approximate median (and the approximation factor) to boundthe optimal objective value. Furthermore, there is a case where Algorithms 2 and 4 may fail toprovide a good approximation, but it can be proven that the result of Algorithm 1 is then a verygood approximation, which can be used instead. Algorithm 1 (1 , (cid:96) ) -Median by Simplification procedure (1 , (cid:96) ) -Median- -Approximation ( T = { τ , . . . , τ n } , δ ) S ← sample (cid:100) − ln( δ )) (cid:101) curves from T uniformly and independently with replacement γ ← (cid:100)− δ ) − ln( (cid:100) − ln( δ ) (cid:101) ) (cid:101) W ← sample γ curves from T uniformly and independently with replacement t ← arbitrary elem. from arg min s ∈ S cost( W, s ) return simpl( α, t ) (cid:46) E.g. combining [4, 22]Next, we prove the quality of approximation of Algorithm 1.
Theorem 3.1.
Given a parameter δ ∈ (0 , and a set T = { τ , . . . , τ n } ⊂ X dm of polygonalcurves, Algorithm 1 returns with probability at least − δ a polygonal curve c ∈ X d(cid:96) , such that ost( T, c ∗ ) ≤ cost( T, c ) ≤ (6 + 7 α ) · cost( T, c ∗ ) , where c ∗ is an optimal (1 , (cid:96) ) -median for T and α isthe approximation-factor of the utilized minimum-error (cid:96) -simplification approximation algorithm.Proof. First, we know that d F ( τ, simpl( α, τ )) ≤ α · d F ( τ, c ∗ ) , for each τ ∈ T .Now, there are at least n curves in T that are within distance at most T,c ∗ ) n to c ∗ . Otherwisethe cost of the remaining curves would exceed cost( T, c ∗ ) , which is a contradiction. Hence each s ∈ S has probability at least to be within distance T,c ∗ ) n to c ∗ .Since the elements of S are sampled independently we conclude that the probability that every s ∈ S has distance to c ∗ greater than T,c ∗ ) n is at most (1 − ) | S | ≤ exp (cid:16) − − ln( δ ))2 (cid:17) = δ .Now, assume there is a s ∈ S with d F ( s, c ∗ ) ≤ T,c ∗ ) n . We do not want any t ∈ S \ { s } with cost( T, t ) > T, s ) to have cost( W, t ) ≤ cost( W, s ) . Using Theorem 2.8 we conclude that thishappens with probability at most exp (cid:18) − − δ ) − ln( (cid:100) − ln( δ ) (cid:101) )64 (cid:19) ≤ δ (cid:100) − ln( δ )) (cid:101) ≤ δ | S | , for each t ∈ S \ { s } .Using a union bound over all bad events, we conclude that with probability at least − δ , Algorithm 1samples a curve s ∈ S , with d F ( s, c ∗ ) ≤ T, c ∗ ) /n and returns the simplification c = simpl( α, t ) of a curve t ∈ S , with cost( T, t ) ≤ T, s ) . The triangle-inequality yields (cid:88) τ ∈ T ( d F ( t, c ∗ ) − d F ( τ, c ∗ )) ≤ (cid:88) τ ∈ T d F ( t, τ ) ≤ (cid:88) τ ∈ T d F ( s, τ ) ≤ (cid:88) τ ∈ T ( d F ( τ, c ∗ ) + d F ( c ∗ , s )) , which is equivalent to n · d F ( t, c ∗ ) ≤ T, c ∗ ) + cost( T, c ∗ ) + 2 n T, c ∗ ) n ⇔ d F ( t, c ∗ ) ≤ T, c ∗ ) n . Hence, we have cost(
T, c ) = (cid:88) τ ∈ T d F ( τ, simpl( α, t )) ≤ (cid:88) τ ∈ T ( d F ( τ, t ) + d F ( t, simpl( α, t ))) ≤ T, s ) + (cid:88) τ ∈ T α · d F ( t, c ∗ ) ≤ (cid:88) τ ∈ T ( d F ( τ, c ∗ ) + d F ( c ∗ , s )) + 7 α · cost( T, c ∗ ) ≤ T, c ∗ ) + 4 cost( T, c ∗ ) + 7 α · cost( T, c ∗ ) = (6 + 7 α ) cost( T, c ∗ ) . The lower bound cost(
T, c ∗ ) ≤ cost( T, c ) follows from the fact that the returned curve has (cid:96) verticesand that c ∗ has minimum cost among all curves with (cid:96) vertices.The following lemma enables us to obtain a concrete approximation-factor and worst-case running-time of Algorithm 1. Lemma 3.2 (Buchin et al. [9, Lemma 7.1]) . Given a curve σ ∈ X dm , a -approximate minimum-error (cid:96) -simplification can be computed in O ( m log m ) time. The simplification algorithm used for obtaining this statement is a combination of the algorithm byImai and Iri [22] and the algorithm by Alt and Godau [4]. Combining Theorem 3.1 and Lemma 3.2,we obtain the following corollary. 7 orollary 3.3.
Given a parameter δ ∈ (0 , and a set T ⊂ X dm of polygonal curves, Algorithm 1returns with probability at least − δ a polygonal curve c ∈ X d(cid:96) , such that cost( T, c ∗ ) ≤ cost( T, c ) ≤ · cost( T, c ∗ ) , where c ∗ is an optimal (1 , (cid:96) ) -median for T , in time O ( m log( m )( − ln δ ) + m log m ) ,when the algorithms by Imai and Iri [22] and Alt and Godau [4] are combined for (cid:96) -simplification.Proof. We use Lemma 3.2 together with Theorem 3.1, which yields an approximation factor of . Now, drawing the first sample takes time O ( − ln δ ) . Drawing the second sample also takestime O ( − ln( δ )) and evaluating the samples against each other takes time O ( m log( m )( − ln δ )) .Simplifying one of the curves that evaluates best takes time O ( m log m ) . We conclude thatAlgorithm 1 has running-time O ( m log( m )( − ln δ ) + m log m ) . (3 + ε ) -Approximation for (1 , (cid:96) ) -Median by Simple Shortcutting Here, we present Algorithm 2, which returns candidates, containing a (3 + ε ) -approximate (1 , (cid:96) ) -median of complexity at most (cid:96) − , for a cluster contained in the input, that takes a constant fractionof the input, w.h.p. Algorithm 2 can be used as plugin in our generalized version (Algorithm 5,Section 7) of the algorithm by Ackermann et al. [2].In contrast to Nath and Taylor [28] we cannot use the property, that the vertices of a median must befound in the balls of radius d F ( τ, c ∗ ) , centered at τ ’s vertices, where c ∗ is an optimal (1 , (cid:96) ) -median fora given input T , which τ is an element of. This is an immediate consequence of using the continuousFréchet distance.We circumvent this by proving the following shortcutting lemmata. We start with the simplest,which states that we can indeed search the aforementioned balls, if we accept a resulting curve ofcomplexity at most (cid:96) − . See Fig. 2 for a visualization. Lemma 4.1 (shortcutting using a single polygonal curve) . Let σ, τ ∈ X d be polygonal curves. Let v τ , . . . , v τ | τ | be the vertices of τ and let r = d F ( σ, τ ) . There exists a polygonal curve σ (cid:48) ∈ X d with everyvertex contained in at least one of B ( v τ , r ) , . . . , B ( v τ | τ | , r ) , d F ( σ (cid:48) , τ ) ≤ d F ( σ, τ ) and | σ (cid:48) | ≤ | σ | − .Proof. Let v σ , . . . , v σ | σ | be the vertices of σ . Further, let t σ , . . . , t σ | σ | and t τ , . . . , t τ | τ | be the instantsof σ and τ , respectively. Also, for h ∈ H (recall that H is the set of all continuous bijections h : [0 , → [0 , with h (0) = 0 and h (1) = 1 ), let r h = max t ∈ [0 , (cid:107) σ ( t ) − τ ( h ( t )) (cid:107) be the distancerealized by h . We know from Lemma 2.4 that there exists a sequence ( h x ) ∞ x =1 in H , such that lim x →∞ r h x = d F ( σ, τ ) = r .Now, fix an arbitrary h ∈ H and assume there is a vertex v σi of σ , with instant t σi , that is notcontained in any of B ( v τ , r h ) , . . . , B ( v τ | τ | , r h ) . Let j be the maximum of { , . . . , | τ | − } , such that t τj ≤ h ( t σi ) ≤ t τj +1 . So v σ is matched to τ ( t τj ) τ ( t τj +1 ) by h . We modify σ in such a way, that v σi isreplaced by two new vertices that are elements of B ( v τj , r h ) and B ( v τj +1 , r h ) , respectively.Figure 2: Visualization of a simple shortcut.The black curve is an input-curve that is close toan optimal median, which is depicted in red. Byinserting the blue shortcut we can find a curvethat has the same distance to the black curveas the median but with all vertices contained inthe balls centered at the black curves vertices.8amely, let t − be the maximum of [0 , t σi ) , such that σ ( t − ) ∈ B ( v τj , r h ) and let t + be the minimum of ( t σi , , such that σ ( t + ) ∈ B ( v τj +1 , r h ) . These are the instants when σ leaves B ( v τj , r h ) before visiting v σi and σ enters B ( v τj +1 , r h ) after visiting v σi , respectively. Let σ (cid:48) h be the piecewise defined curve,defined just like σ on [0 , t − ] and [ t + , , but on ( t − , t + ) it connects σ ( t − ) and σ ( t + ) with the linesegment s ( t ) = (cid:16) − t − t − t + − t − (cid:17) τ ( t − ) + t − t − t + − t − τ ( t + ) .We know that (cid:107) σ ( t − ) − τ ( h ( t − )) (cid:107) ≤ r h and (cid:107) σ ( t + ) − τ ( h ( t + )) (cid:107) ≤ r h . Note that t τj ≤ h ( t − ) and h ( t + ) ≤ t τj +1 since σ ( t − ) and σ ( t + ) are closest points to v σi on σ that have distance r h to v τj and v τj +1 ,respectively, by definition. Therefore, τ has no vertices between the instants h ( t − ) and h ( t + ) . Now, h can be used to match σ (cid:48) h | [0 ,t − ) to τ | [0 ,h ( t − )) and σ (cid:48) h | ( t + , to τ | ( t + , with distance at most r h . Since σ (cid:48) h | [ t − ,t + ] and τ | [ h ( t − ) ,h ( t + )] are just line segments, they can be matched to each other with distanceat most max {(cid:107) σ (cid:48) h ( t − ) − τ ( h ( t − )) (cid:107) , (cid:107) σ (cid:48) h ( t + ) − τ ( h ( t + )) (cid:107)} ≤ r h . We conclude that d F ( σ (cid:48) h , τ ) ≤ r h .Because this modification works for every h ∈ H , we have d F ( σ (cid:48) h , τ ) ≤ r h for every h ∈ H . Thus, lim x →∞ d F ( σ (cid:48) h x , τ ) ≤ d F ( σ, τ ) = r .Now, to prove the claim, for every h ∈ H we apply this modification to v σi and successively to everyother vertex v σ (cid:48) h i of the resulting curve σ (cid:48) h , not contained in one of the balls, until every vertex of σ (cid:48) h iscontained in a ball. Note that the modification is repeated at most | σ |− times for every h ∈ H , sincethe start and end vertex of σ must be contained in B ( v τ , r h ) and B ( v τ | τ | , r h ) , respectively. Therefore,the number of vertices of every σ (cid:48) h can be bounded by · ( | σ | −
2) + 2 since every other vertex mustnot lie in a ball and for each such vertex one new vertex is created. Thus, | σ (cid:48) h | ≤ | σ | − .We now present Algorithm 2, which works similar as Algorithm 1, but uses shortcutting insteadof simplification. As a consequence, we can achieve an approximation factor of ε , instead of afactor of ε + α , where α ≥ is the approximation factor of the simplifiaction algorithm usedin Algorithm 1. To achieve an approximation-factor of ε using simplification, one would needto compute the optimal minimum-error (cid:96) -simplifications of the input curves and to the best of ourknowledge, there is no such algorithm for the continuous Fréchet distance.In contrast to Algorithm 1, Algorithm 2 utilizes the superset-sampling technique by Kumar et al.[25], i.e., the concentration bound in Lemma 2.9, to obtain an approximate (1 , (cid:96) ) -median for a cluster T (cid:48) contained in the input T , that takes a constant fraction of T . Therefore, it has running-timeexponential in the size of the sample S . A further difference is that we need an upper and alower bound on the cost of an optimal (1 , (cid:96) ) -median for T (cid:48) , to properly set up the grids we use forshortcutting. The lower bound can be obtained by simple estimation, using Markov’s inequality. Forthe upper bound we utilize a case distinction, which guarantees us that if we fail to obtain an upperbound on the optimal cost, the result of Algorithm 1 then is a good approximation (factor ε )and can be used instead of a best curve obtained by shortcutting.Algorithm 2 has several parameters: β determines the size (in terms of a fraction of the input) of thesmallest cluster inside the input for which an approximate median can be computed, δ determinesthe probability of failure of the algorithm and ε determines the approximation factor.9 lgorithm 2 (1 , (cid:96) ) -Median for Subset by Simple Shortcutting procedure (1 , (cid:96) ) -Median- (3 + ε ) -Candidates ( T = { τ , . . . , τ n } , β, δ, ε ) ε (cid:48) ← ε/ , C ← ∅ S ← sample (cid:6) − β ( ε (cid:48) ) − (ln( δ ) − ln(4)) (cid:7) curves from T uniformly and independentlywith replacement for S (cid:48) ⊆ S with | S (cid:48) | = | S | β do c ← (1 , (cid:96) ) - Median - - Approximation ( S (cid:48) , δ/ (cid:46) Algorithm 1 ∆ ← cost( S (cid:48) , c ) , ∆ l ← δn | S | ∆34 , ∆ u ← ε (cid:48) ∆ , C ← C ∪ { c } for s ∈ S (cid:48) do P ← ∅ for i ∈ { , . . . , | s |} do P ← P ∪ G (cid:16) B ( v si , (1 + ε (cid:48) )∆ u ) , ε (cid:48) n √ d ∆ l (cid:17) (cid:46) v si : i th vertex of s C ← C ∪ set of all polygonal curves with (cid:96) − vertices from P return C We prove the quality of approximation of Algorithm 2.
Theorem 4.2.
Given three parameters β ∈ [1 , ∞ ) , δ, ε ∈ (0 , and a set T = { τ , . . . , τ n } ⊂ X dm ofpolygonal curves, with probability at least − δ the set of candidates that Algorithm 2 returns containsa (3 + ε ) -approximate (1 , (cid:96) ) -median with up to (cid:96) − vertices for any T (cid:48) ⊆ T , if | T (cid:48) | ≥ β | T | .Proof. We assume that | T (cid:48) | ≥ β | T | . Let n (cid:48) be the number of sampled curves in S that are elementsof T (cid:48) . Clearly, E [ n (cid:48) ] ≥ (cid:80) | S | i =1 1 β = | S | β . Also n (cid:48) is the sum of independent Bernoulli trials. A Chernoffbound (cf. Lemma 2.9) yields: Pr (cid:20) n (cid:48) < | S | β (cid:21) ≤ Pr (cid:20) n (cid:48) <
12 E (cid:2) n (cid:48) (cid:3)(cid:21) ≤ exp (cid:18) − | S | β (cid:19) ≤ exp (cid:18) ln( δ ) − ln(4) ε (cid:19) = (cid:18) δ (cid:19) ε ≤ δ . In other words, with probability at most δ/ no subset S (cid:48) ⊆ S , of cardinality at least | S | β , is a subsetof T (cid:48) . We condition the rest of the proof on the contrary event, denoted by E T (cid:48) , namely, that thereis a subset S (cid:48) ⊆ S with S (cid:48) ⊆ T (cid:48) and | S (cid:48) | ≥ | S | β . Note that S (cid:48) is then a uniform and independentsample of T (cid:48) .Now, let c ∗ ∈ arg min c ∈ X d(cid:96) cost( T (cid:48) , c ) be an optimal (1 , (cid:96) ) -median for T (cid:48) . The expected distance between s ∈ S (cid:48) and c ∗ is E [ d F ( s, c ∗ ) | E T (cid:48) ] = (cid:88) τ ∈ T (cid:48) d F ( c ∗ , τ ) · | T (cid:48) | = cost( T (cid:48) , c ∗ ) | T (cid:48) | . By linearity we have
E [cost( S (cid:48) , c ∗ ) | E T (cid:48) ] = | S (cid:48) || T (cid:48) | cost( T (cid:48) , c ∗ ) . Markov’s inequality yields: Pr (cid:20) δ | T (cid:48) | | S (cid:48) | cost (cid:0) S (cid:48) , c ∗ (cid:1) > cost (cid:0) T (cid:48) , c ∗ (cid:1) (cid:12)(cid:12)(cid:12) E T (cid:48) (cid:21) ≤ δ . We conclude that with probability at most δ/ we have δ | T (cid:48) | | S (cid:48) | cost( S (cid:48) , c ∗ ) > cost( T (cid:48) , c ∗ ) .Using Markov’s inequality again, for every s ∈ S (cid:48) we have Pr (cid:20) d F ( s, c ∗ ) > (1 + ε ) cost( T (cid:48) , c ∗ ) | T (cid:48) | (cid:12)(cid:12)(cid:12) E T (cid:48) (cid:21) ≤
11 + ε , Pr (cid:34) (cid:94) s ∈ S (cid:48) (cid:18) d F ( s, c ∗ ) > (1 + ε ) cost( T (cid:48) , c ∗ ) | T (cid:48) | (cid:19) (cid:12)(cid:12)(cid:12) E T (cid:48) (cid:35) ≤ ε ) | S (cid:48) | ≤ exp (cid:18) − ε | S | β (cid:19) . Hence, with probability at most exp (cid:32) − ε (cid:108) − β (ln( δ ) − ln(4)) ε (cid:109) β (cid:33) ≤ δ / ≤ δ/ there is no s ∈ S (cid:48) with d F ( s, c ∗ ) ≤ (1 + ε ) cost( T (cid:48) ,c ∗ ) | T (cid:48) | . Also, with probability at most δ/ Algorithm 1 fails to compute a -approximate (1 , (cid:96) ) -median c ∈ X d(cid:96) for S (cid:48) , cf. Corollary 3.3.Using a union bound over these bad events, we conclude that with probability at least − δ all ofthe following events occur simultaneously:• There is a subset S (cid:48) ⊆ S of cardinality at least | S | / (2 β ) that is a uniform and independentsample of T (cid:48) ,• there is a curve s ∈ S (cid:48) with d F ( s, c ∗ ) ≤ (1 + ε ) cost( T (cid:48) ,c ∗ ) | T (cid:48) | ,• Algorithm 1 computes a polygonal curve c ∈ X d(cid:96) with cost (cid:0) S (cid:48) , c ∗ S (cid:48) (cid:1) ≤ cost( S (cid:48) , c ) ≤
34 cost (cid:0) S (cid:48) , c ∗ S (cid:48) (cid:1) ,where c ∗ S (cid:48) ∈ X d(cid:96) is an optimal (1 , (cid:96) ) -median for S (cid:48) ,• and it holds that δ | T (cid:48) | | S (cid:48) | cost( S (cid:48) , c ∗ ) ≤ cost( T (cid:48) , c ∗ ) .Since c ∗ S (cid:48) is an optimal (1 , (cid:96) ) -median for S (cid:48) we get the following from the last two items: cost (cid:0) T (cid:48) , c ∗ (cid:1) ≥ δ | T (cid:48) | | S (cid:48) | cost (cid:0) S (cid:48) , c ∗ (cid:1) ≥ δ | T (cid:48) | | S (cid:48) | cost (cid:0) S (cid:48) , c ∗ S (cid:48) (cid:1) ≥ δ | T (cid:48) | | S (cid:48) | cost( S (cid:48) , c )34 . We now distinguish between two cases:
Case 1: d F ( c, c ∗ ) ≥ (1 + 2 ε ) cost( T (cid:48) ,c ∗ ) | T (cid:48) | The triangle-inequality yields d F ( c, s ) ≥ d F ( c, c ∗ ) − d F ( c ∗ , s ) ≥ d F ( c, c ∗ ) − (1 + ε ) cost( T (cid:48) , c ∗ ) | T (cid:48) |≥ (1 + 2 ε ) cost( T (cid:48) , c ∗ ) | T (cid:48) | − (1 + ε ) cost( T (cid:48) , c ∗ ) | T (cid:48) | = ε cost( T (cid:48) , c ∗ ) | T (cid:48) | . As a consequence, cost( S (cid:48) , c ) ≥ ε cost( T (cid:48) ,c ∗ ) | T (cid:48) | ⇔ cost( T (cid:48) ,c ∗ ) | T (cid:48) | ≤ ε cost( S (cid:48) , c ) .Now, let v s , . . . , v s | s | be the vertices of s . By Lemma 4.1 there exists a polygonal curve c (cid:48) withup to (cid:96) − vertices, every vertex contained in one of B ( v s , d F ( c ∗ , s )) , . . . , B ( v s | s | , d F ( c ∗ , s )) and d F ( s, c (cid:48) ) ≤ d F ( s, c ∗ ) ≤ (1 + ε ) cost( T (cid:48) ,c ∗ ) | T (cid:48) | ≤ (1 + ε ) cost( S (cid:48) ,c ) ε .In the set of candidates, that Algorithm 2 returns, a curve c (cid:48)(cid:48) with up to (cid:96) − vertices from the unionof the grid covers and distance at most ε δn | S (cid:48)| cost( S (cid:48) ,c ) n ≤ ε δ | T (cid:48)| | S (cid:48)| cost( S (cid:48) ,c ) | T (cid:48) | ≤ ε cost( T (cid:48) ,c ∗ ) | T (cid:48) | between everycorresponding pair of vertices of c (cid:48) and c (cid:48)(cid:48) is contained. We conclude that d F ( c (cid:48) , c (cid:48)(cid:48) ) ≤ ε cost( T (cid:48) ,c ∗ ) | T (cid:48) | .We can now bound the cost of c (cid:48)(cid:48) as follows: cost (cid:0) T (cid:48) , c (cid:48)(cid:48) (cid:1) = (cid:88) τ ∈ T (cid:48) d F ( τ, c (cid:48)(cid:48) ) ≤ (cid:88) τ ∈ T (cid:48) (cid:18) d F ( τ, c (cid:48) ) + ε cost( T (cid:48) , c ∗ ) | T (cid:48) | (cid:19) (cid:88) τ ∈ T (cid:48) ( d F ( τ, c ∗ ) + d F ( c ∗ , c (cid:48) )) + ε cost( T, c ∗ ) ≤ (cid:88) τ ∈ T (cid:48) ( d F ( τ, c ∗ ) + d F ( c ∗ , s ) + d F ( s, c (cid:48) )) + ε cost (cid:0) T (cid:48) , c ∗ (cid:1) ≤ (3 + 3 ε ) cost (cid:0) T (cid:48) , c ∗ (cid:1) . Case 2: d F ( c, c ∗ ) < (1 + 2 ε ) cost( T (cid:48) ,c ∗ ) | T (cid:48) | The cost of c can easily be bounded: cost (cid:0) T (cid:48) , c (cid:1) ≤ (cid:88) τ ∈ T (cid:48) ( d F ( τ, c ∗ ) + d F ( c ∗ , c )) < cost (cid:0) T (cid:48) , c ∗ (cid:1) + (1 + 2 ε ) cost (cid:0) T (cid:48) , c ∗ (cid:1) = (2 + 2 ε ) cost (cid:0) T (cid:48) , c ∗ (cid:1) . The claim follows by rescaling ε by .Next we analyse the worst-case running-time of Algorithm 2 and the number of candidates it returns. Theorem 4.3.
Algorithm 2 has running-time and returns number of candidates O (cid:18) ( − ln( δ ))2 · βε +log( m ) (cid:19) .Proof. The sample S has size O (cid:16) − ln( δ ) · βε (cid:17) and sampling it takes time O (cid:16) − ln( δ ) · βε (cid:17) . Let n S = | S | .The for-loop runs (cid:18) n Sn S β (cid:19) ∈ O (cid:16) nS β log n S (cid:17) ⊂ O (cid:18) ( − ln( δ ))2 · βε (cid:19) times. In each iteration, we run Algorithm 1, taking time O ( m log( m )( − ln δ ) + m log m ) (cf. Corollary 3.3), we compute the cost of the returned curve with respect to S (cid:48) , taking time O (cid:16) − ln( δ ) ε · m log( m ) (cid:17) , and per curve in S (cid:48) we build up to m grids of size (1+ ε )∆ ε ε δn ∆ n √ d | S | d = (cid:32) √ d | S | (1 + ε ) ε δ (cid:33) d ∈ O (cid:18) β d ( − ln δ ) d ε d δ d (cid:19) each. For each curve s ∈ S (cid:48) , Algorithm 2 then enumerates all combinations of (cid:96) − points fromthese up to m grids, resulting in O (cid:18) m (cid:96) − β (cid:96)d − d ( − ln δ ) (cid:96)d − d ε (cid:96)d − d δ (cid:96)d − d (cid:19) candidates per s ∈ S (cid:48) , per iteration of the for-loop. Thus, Algorithm 2 computes O (cid:0) poly (cid:0) m, β, δ − , ε − (cid:1)(cid:1) candidates per iteration of the for-loop and enumeration also takes time O (cid:0) poly (cid:0) m, β, δ − , ε − (cid:1)(cid:1) per iteration of the for-loop.All in all, we have running-time and number of candidates O (cid:18) ( − ln( δ ))2 · βε +log( m ) (cid:19) . (1 , (cid:96) ) -Median by Simple Short-cutting The following algorithm is a modification of Algorithm 2. It is more practical since it needs to coveronly up to m (small) balls, using grids. Unfortunately, it is not compatible with the superset-samplingtechnique and can therefore not be used as plugin in Algorithm 5.12 lgorithm 3 (1 , (cid:96) ) -Median by Simple Shortcutting procedure (1 , (cid:96) ) -Median- (5 + ε ) ( T = { τ , . . . , τ n } , δ, ε ) (cid:98) c ← (1 , (cid:96) ) - Median - - Approximation ( T, δ/ (cid:46) Algorithm 1 ∆ ← cost( T, (cid:98) c )34 , ε (cid:48) ← ε/ , P ← ∅ S ← sample (cid:6) − ε (cid:48) ) − (ln( δ ) − ln(4)) (cid:7) curves from T uniformly and independentlywith replacement W ← sample (cid:100)− ε (cid:48) ) − (ln( δ ) − ln( (cid:100)− ε (cid:48) ) − (ln( δ ) − ln(4)) (cid:101) )) (cid:101) curves from T uniformly and independently with replacement c ← arg min s ∈ S cost( W, s ) for i ∈ { , . . . , | c |} do P ← P ∪ G (cid:16) B (cid:16) v ci , (3+4 ε (cid:48) ) n (cid:17) , ε (cid:48) ∆ n √ d (cid:17) (cid:46) v ci is the i th vertex of c C ← set of all polygonal curves with (cid:96) − vertices from P return arg min c (cid:48) ∈ C cost( T, c (cid:48) ) We prove the quality of approximation of Algorithm 3.
Theorem 5.1.
Given two parameters δ, ε ∈ (0 , and a set T = { τ , . . . , τ n } ⊂ X dm of polygonalcurves, with probability at least − δ Algorithm 3 returns a (5 + ε ) -approximate (1 , (cid:96) ) -median for T with up to (cid:96) − vertices.Proof. Let c ∗ ∈ arg min c ∈ X d(cid:96) cost( T, c ) be an optimal (1 , (cid:96) ) -median for T .The expected distance between s ∈ S and c ∗ is E [ d F ( s, c ∗ )] = n (cid:88) i =1 d F ( c ∗ , τ i ) · n = cost( T, c ∗ ) n . Now using Markov’s inequality, for every s ∈ S we have Pr[ d F ( s, c ∗ ) > (1 + ε ) cost( T, c ∗ ) /n ] ≤ cost( T, c ∗ ) n − (1 + ε ) cost( T, c ∗ ) n − = 11 + ε , therefore by independence Pr (cid:34) (cid:94) s ∈ S ( d F ( s, c ∗ ) > (1 + ε ) cost( T, c ∗ ) /n ) (cid:35) ≤ ε ) | S | ≤ exp (cid:18) − ε | S | (cid:19) . Hence, with probability at most exp (cid:32) − ε (cid:108) − δ ) − ln(4)) ε (cid:109) (cid:33) ≤ δ/ there is no s ∈ S with d F ( s, c ∗ ) ≤ (1 + ε ) cost( T,c ∗ ) n . Now, assume there is a s ∈ S with d F ( s, c ∗ ) ≤ (1 + ε ) cost( T, c ∗ ) /n . We do not wantany t ∈ S \ { s } with d F ( t, c ∗ ) > (1 + ε ) d F ( s, c ∗ ) to have cost( W, t ) ≤ cost( W, s ) . Using Theorem 2.8,we conclude that this happens with probability at most exp (cid:18) − ε (cid:100)− ε − (ln( δ ) − ln( (cid:100)− ε (cid:48) ) − (ln( δ ) − ln(4)) (cid:101) )) (cid:101) (cid:19) ≤ δ (cid:100)− ε (cid:48) ) − (ln( δ ) − ln(4)) (cid:101) ≤ δ | S | , for each t ∈ S \{ s } . Also, with probability at most δ/ Algorithm 1 fails to compute a -approximate (1 , (cid:96) ) -median (cid:98) c ∈ X d(cid:96) for T , cf. Corollary 3.3. 13sing a union bound over these bad events, we conclude that with probability at least − δ ,Algorithm 3 samples a curve t ∈ S with cost( T, t ) ≤ (1 + ε ) cost( T, s ) and Algorithm 1 computes a -approximate (1 , (cid:96) ) -median (cid:98) c ∈ X d(cid:96) for T , i.e., cost( T, c ∗ ) ≤
34∆ = cost( T, (cid:98) c ) ≤
34 cost(
T, c ∗ ) . Let v t , . . . , v t | t | be the vertices of t . By Lemma 4.1 there exists a polygonal curve c (cid:48) with up to (cid:96) − vertices,every vertex contained in one of B ( v t , d F ( c ∗ , t )) , . . . , B ( v t | t | , d F ( c ∗ , t )) and d F ( t, c (cid:48) ) ≤ d F ( t, c ∗ ) . Usingthe triangle-inequality yields (cid:88) τ ∈ T ( d F ( t, c ∗ ) − d F ( τ, c ∗ )) ≤ (cid:88) τ ∈ T d F ( t, τ ) ≤ (1 + ε ) (cid:88) τ ∈ T d F ( s, τ ) ≤ (1 + ε ) (cid:88) τ ∈ T ( d F ( τ, c ∗ ) + d F ( c ∗ , s )) , which is equivalent to n · d F ( t, c ∗ ) ≤ (2 + ε ) cost( T, c ∗ ) + (1 + ε ) n (1 + ε ) cost( T, c ∗ ) /n ⇔ d F ( t, c ∗ ) ≤ (3 + 4 ε ) cost( T, c ∗ ) /n. Hence, we have d F ( t, c (cid:48) ) ≤ d F ( t, c ∗ ) ≤ (3 + 4 ε ) cost( T, c ∗ ) /n ≤ (3 + 4 ε )34∆ /n .In the last step, Algorithm 3 returns a curve c (cid:48)(cid:48) from the set C of all curves with up to (cid:96) − verticesfrom P , the union of the grid covers, that evaluates best. We can assume that c (cid:48)(cid:48) has distance atmost ε ∆ n ≤ ε cost( T,c ∗ ) n between every corresponding pair of vertices of c (cid:48) and c (cid:48)(cid:48) . We conclude that d F ( c (cid:48) , c (cid:48)(cid:48) ) ≤ ε ∆ n ≤ ε cost( T,c ∗ ) n .We can now bound the cost of c (cid:48)(cid:48) as follows: cost (cid:0) T, c (cid:48)(cid:48) (cid:1) = (cid:88) τ ∈ T d F ( τ, c (cid:48)(cid:48) ) ≤ (cid:88) τ ∈ T (cid:18) d F ( τ, c (cid:48) ) + ε ∆ n (cid:19) ≤ (cid:88) τ ∈ T ( d F ( τ, t ) + d F ( t, c (cid:48) )) + ε cost( T, c ∗ ) ≤ (1 + ε ) cost( T, s ) + (3 + 5 ε ) cost( T, c ∗ ) ≤ (1 + ε ) (cid:88) τ ∈ T ( d F ( τ, c ∗ ) + d F ( c ∗ , s )) + (3 + 5 ε ) cost( T, c ∗ ) ≤ (1 + ε ) cost( T, c ∗ ) + (1 + ε ) cost( T, c ∗ ) + (3 + 5 ε ) cost( T, c ∗ ) ≤ (5 + 9 ε ) cost( T, c ∗ ) The claim follows by rescaling ε by .We analyse the worst-case running-time of Algorithm 3. Theorem 5.2.
Algorithm 3 has running-time O (cid:16) nm (cid:96) − log( m ) ε (2 (cid:96) − d + m log( m )( − ln( δ )) ε (cid:17) .Proof. Algorithm 1 has running-time O ( m log( m )( − ln δ )) + m log m ) . The sample S has size O (cid:16) − ln( δ ) ε (cid:17) and the sample W has size O (cid:16) − ln( δ ) ε (cid:17) . Evaluating each curve of S against W takes time O (cid:16) m log( m )( − ln( δ )) ε (cid:17) , using the algorithm of Alt and Godau [4] to compute the distances.Now, c has up to m vertices and every grid consists of (cid:18) (3+ ε )∆ n ε (cid:48) ∆ nc √ d (cid:19) d = (cid:16) (3+ ε ) c √ d ε (cid:48) (cid:17) d ∈ O (cid:0) ε d (cid:1) points.Therefore, we have O (cid:0) mε d (cid:1) points in P and Algorithm 3 enumerates all combinations of (cid:96) − pointsfrom P taking time O (cid:16) m (cid:96) − ε (2 (cid:96) − d (cid:17) . Afterwards, these candidates are evaluated, which takes time O ( nm log( m )) per candidate using the algorithm of Alt and Godau [4] to compute the distances.All in all, we then have running-time O (cid:16) nm (cid:96) − log( m ) ε (2 (cid:96) − d + m log( m )( − ln( δ )) ε (cid:17) .14igure 3: Visualization of an advanced shortcut. The black curves are input-curves and the redcurve is an optimal median. By inserting the blue shortcut we can find a curve that has distance notlarger as the median to all the black curves, but one, and with all vertices contained in the ballscentered at the black curves vertices. (1+ ε ) -Approximation for (1 , (cid:96) ) -Median by Advanced Shortcutting Now we present Algorithm 4, which returns candidates, containing a (1+ ε ) -approximate (1 , (cid:96) ) -medianof complexity at most (cid:96) − , for a cluster contained in the input, that takes a constant fraction ofthe input, w.h.p. Before we present the algorithm, we present our second shortcutting lemma. Here,we do not introduce shortcuts with respect to a single curve, but with respect to several curves: byintroducing shortcuts with respect to ε | T | well-chosen curves from the given set T ⊂ X dm of polygonalcurves, for a given ε ∈ (0 , , we preserve the distances to at least (1 − ε ) | T | curves from T . In thiscontext well-chosen means that there exists a certain number of subsets of T , of each we have topick a curve for shortcutting. This will enable the high quality of approximation of Algorithm 4 andwe formalize this in the following lemma. Lemma 6.1 (shortcutting using a set of polygonal curves) . Let σ ∈ X d be a polygonal curve with | σ | > vertices and T = { τ , . . . , τ n } ⊂ X d be a set of polygonal curves. For i ∈ { , . . . , n } , let r i = d F ( τ i , σ ) and for j ∈ { , . . . , | τ i |} , let v τ i j be the j th vertex of τ i .For any ε ∈ (0 , there are | σ | − subsets T , . . . , T | σ |− ⊆ T of εn | σ | curves each (not nec-essarily disjoint) such that for every subset T (cid:48) ⊆ T containing at least one curve out of each T k ∈ { T , . . . , T | σ |− } , a polygonal curve σ (cid:48) ∈ X d exists with every vertex contained in (cid:91) τ i ∈ T (cid:48) (cid:91) j ∈{ ,..., | τ i |} B ( v τ i j , r i ) ,d F ( τ, σ (cid:48) ) ≤ d F ( τ, σ ) for each τ ∈ T \ ( T ∪ · · · ∪ T | σ |− ) and | σ (cid:48) | ≤ | σ | − . The idea is the following, see Fig. 3 for a visualization. One can argue that every vertex v of σ notcontained in any of the balls centered at the vertices of the curves in T (and of radius accordingto their distance to σ ) can be shortcut by connecting the last point p − before v (in terms of theparameter of σ ) contained in one ball and first point p + after v contained in one ball. This doesnot increase the Fréchet distances between σ and the τ ∈ T , because only matchings among linesegments are affected by this modification. Furthermore, most distances are preserved when we donot actually use the last and first ball before and after v , but one of the εn | σ | balls before and one ofthe εn | σ | balls after v , which is the key of the following proof. Proof of Lemma 6.1.
For the sake of simplicity, we assume that εn | σ | is integral. Let (cid:96) = | σ | . For i ∈ { , . . . , n } , let v τ i , . . . , v τ i | τ i | be the vertices of τ i with instants t τ i , . . . , t τ i | τ i | and let v σ , . . . , v σ(cid:96) be thevertices of σ with instants t σ , . . . , t σ(cid:96) . Also, for h ∈ H (recall that H is the set of all continuous bijec-tions h : [0 , → [0 , with h (0) = 0 and h (1) = 1 ) and i ∈ { , . . . , n } , let r i,h = max t ∈ [0 , (cid:107) σ ( t ) − τ i ( h ( t )) (cid:107)
15e the distance realized by h with respect to τ i . We know from Lemma 2.4 that for each i ∈ { , . . . , n } there exists a sequence ( h i,x ) ∞ x =1 in H , such that lim x →∞ r i,h i,x = d F ( σ, τ i ) = r i .In the following, given arbitrary h , . . . , h n ∈ H , we describe how to modify σ , such that its verticescan be found in the balls around the vertices of the τ ∈ T , of radii determined by h , . . . , h n . Laterwe will argue that this modification can be applied using the h ,x , . . . , h n,x , for each x ∈ N , inparticular.Now, fix arbitrary h , . . . , h n ∈ H and for an arbitrary k ∈ { , . . . , | σ | − } , fix the vertex v σk of σ with instant t σk . For i ∈ { , . . . , n } , let s i be the maximum of { , . . . , | τ i | − } , such that t τ i s i ≤ h i ( t σk ) ≤ t τ i s i +1 . Namely, v σk is matched to a point on the line segment v τ s v τ s +1 , . . . , v τ n s n v τ n s n +1 ,respectively, by h , . . . , h n .For i ∈ { , . . . , n } , let t − i be the maximum of [0 , t σk ] , such that σ ( t − i ) ∈ B ( v τ i s i , r i,h i ) and let t + i be the minimum of [ t σk , , such that σ ( t + i ) ∈ B ( v τ i s i +1 , r i,h i ) . These are the instants when σ visits B ( v τ i s i , r i,h i ) before or when visiting v σk and σ visits B ( v τ i s i +1 , r i,h i ) when or after visiting v σk ,respectively. Furthermore, there is a permutation α ∈ S n of the index set { , . . . , n } , such that t − α − (1) ≤ · · · ≤ t − α − ( n ) . (I)Also, there is a permutation β ∈ S n of the index set { , . . . , n } , such that t + β − (1) ≤ · · · ≤ t + β − ( n ) . (II)Additionally, for each i ∈ { , . . . , n } we have t τ i s i ≤ h i ( t − i ) (III)and h i ( t + i ) ≤ t τ i s i +1 , (IV)because σ ( t − i ) and σ ( t + i ) are closest points to v σ on σ that have distance at most r i,h i to v τ i s i and v τ i s i +1 , respectively, by definition. We will now use Eqs. (I) to (IV) to prove that an advanced shortcutonly affects matchings among line segments and hence we can easily bound the resulting distancesfor at least (1 − ε ) n of the curves.Let I v σk ( h , . . . , h n ) = { τ α − ((1 − ε (cid:96) ) n +1) , . . . , τ α − ( n ) } , O v σk ( h , . . . , h n ) = { τ β − (1) , . . . , τ β − ( εn (cid:96) ) } .I v σk ( h , . . . , h n ) is the set of the last εn (cid:96) curves whose balls are visited by σ , before or when σ visits v σk . Similarly, O v σk ( h , . . . , h n ) is the set of the first εn (cid:96) curves whose balls are visited by σ ,when or immediately after σ visited v σk . We now modify σ , such that v σk is replaced by two newvertices that are elements of at least one B ( v τ i j , r i,h i ) , for a τ i ∈ I v σk ( h , . . . , h n ) , respectively for a τ i ∈ O v σk ( h , . . . , h n ) , and j ∈ { , . . . , | τ i |} , each.Let σ (cid:48) h ,...,h n be the piecewise defined curve, defined just like σ on (cid:104) , t − α − ( k ) (cid:105) and (cid:104) t + β − ( k ) , (cid:105) forarbitrary k ∈ { (1 − ε (cid:96) ) n + 1 , . . . , n } and k ∈ { , . . . , εn (cid:96) } , but on (cid:16) t − α − ( k ) , t + β − ( k ) (cid:17) it connects σ (cid:16) t − α − ( k ) (cid:17) and σ (cid:16) t + β − ( k ) (cid:17) with the line segment γ ( t ) = (cid:32) − t − t − α − ( k ) t + β − ( k ) − t − α − ( k ) (cid:33) σ (cid:16) t − α − ( k ) (cid:17) + t − t − α − ( k ) t + β − ( k ) − t − α − ( k ) σ (cid:16) t + β − ( k ) (cid:17) .
16e now argue that for all τ i ∈ T \ ( I v σk ( h , . . . , h n ) ∪ O v σk ( h , . . . , h n )) the Fréchet distance between σ (cid:48) h ,...,h n and τ i is upper bounded by r i,h i . First, note that by definition h , . . . , h n are strictlyincreasing functions, since they are continuous bijections that map to and to . As immediateconsequence, we have that t τ i s i ≤ h i ( t − i ) ≤ h i (cid:16) t − α − ( k ) (cid:17) (V)for each τ i ∈ T \ I v σk ( h , . . . , h n ) and h i (cid:16) t + β − ( k ) (cid:17) ≤ h i ( t + i ) ≤ t τ i s i +1 (VI)for each τ i ∈ T \ O v σk ( h , . . . , h n ) , using Eqs. (I) to (IV). Therefore, each τ i ∈ T \ ( I v σk ( h , . . . , h n ) ∪ O v σk ( h , . . . , h n )) has no vertex between the instants h i (cid:16) t − α − ( k ) (cid:17) and h i (cid:16) t + β − ( k ) (cid:17) . We also knowthat for each τ i ∈ T (cid:13)(cid:13)(cid:13) σ (cid:16) t − α − ( k ) (cid:17) − τ i (cid:16) h i (cid:16) t − α − ( k ) (cid:17)(cid:17)(cid:13)(cid:13)(cid:13) ≤ r i,h i (VII)and (cid:13)(cid:13)(cid:13) σ (cid:16) t + β − ( k ) (cid:17) − τ i (cid:16) h i (cid:16) t + β − ( k ) (cid:17)(cid:17)(cid:13)(cid:13)(cid:13) ≤ r i,h i . (VIII)Let D s,σ = (cid:104) , t − α − ( k ) (cid:17) , D m,σ = (cid:104) t − α − ( k ) , t + β − ( k ) (cid:105) and D e,σ = (cid:16) t + β − ( k ) , (cid:105) . Also, for i ∈ { , . . . , n } ,let D s,τ i = (cid:104) , h i (cid:16) t − α − ( k ) (cid:17)(cid:17) , D m,τ i = (cid:104) h i (cid:16) t − α − ( k ) (cid:17) , h i (cid:16) t + β − ( k ) (cid:17)(cid:105) and D e,τ i = (cid:16) h i (cid:16) t + β − ( k ) (cid:17) , (cid:105) .Now, for each τ i ∈ T \ ( I v σk ( h , . . . , h n ) ∪ O v σk ( h , . . . , h n )) we use h i to match σ (cid:48) h ,...,h n | D s,σ to τ i | D s,τi and σ (cid:48) h ,...,h n | D e,σ to τ i | D e,τi with distance at most r i,h i . Since σ (cid:48) h ,...,h n | D m,σ and τ i | D m,τi are just linesegments by Eqs. (V) and (VI), they can be matched to each other with distance at most max (cid:110)(cid:13)(cid:13)(cid:13) σ (cid:16) t − α − ( k ) (cid:17) − τ i (cid:16) h i (cid:16) t − α − ( k ) (cid:17)(cid:17)(cid:13)(cid:13)(cid:13) , (cid:13)(cid:13)(cid:13) σ (cid:16) t + β − ( k ) (cid:17) − τ i (cid:16) h i (cid:16) t + β − ( k ) (cid:17)(cid:17)(cid:13)(cid:13)(cid:13)(cid:111) , which is at most r i,h i by Eqs. (VII) and (VIII). We conclude that d F ( σ (cid:48) h ,...,h n , τ i ) ≤ r i,h i .Because this modification works for every h , . . . , h n ∈ H , we conclude that d F ( σ (cid:48) h ,...,h n , τ i ) ≤ r i,h i forevery h , . . . , h n ∈ H and τ i ∈ T \ ( I v σk ( h , . . . , h n ) ∪ O v σk ( h , . . . , h n )) . Thus lim x →∞ d F ( σ (cid:48) h ,x ,...,h n,x , τ i ) ≤ d F ( σ, τ i ) = r i for each τ i ∈ T \ ( I v σk ( h ,x , . . . , h n,x ) ∪ O v σk ( h ,x , . . . , h n,x )) .Now, to prove the claim, for each combination h , . . . , h n ∈ H , we apply this modification to v σk and successively to every other vertex v σ (cid:48) h ,...,hn l of the resulting curve σ (cid:48) h ,...,h n , except v σ (cid:48) h ,...,hn and v σ (cid:48) h ,...,hn | σ (cid:48) h ,...,hn | , since these must be elements of B ( v τ i , r i,h i ) and B ( v τ i | τ i | , r i,h i ) , respectively, for each i ∈ { , . . . , n } , by definition of the Fréchet distance.Since the modification is repeated at most | σ | − times for each combination h , . . . h n ∈ H , weconclude that the number of vertices of each σ (cid:48) h ,...,h n can be bounded by · ( | σ | −
2) + 2 . T , . . . , T (cid:96) − are therefore all the I v σk ( h ,x , . . . , h n,x ) and O v σk ( h ,x , . . . , h n,x ) for k ∈ { , . . . , | σ | − } ,when x → ∞ . Note that every I v σk ( h ,x , . . . , h n,x ) and O v σk ( h ,x , . . . , h n,x ) is determined by thevisiting order of the balls and since their radii converge, these sets do too.We now present Algorithm 4, which is nearly identical to Algorithm 2 but uses the advancedshortcutting lemma. Furthermore, like Algorithm 2, it can be used as plugin in the recursive k -median approximation-scheme (Algorithm 5) that we present in Section 7.17 lgorithm 4 (1 , (cid:96) ) -Median for Subset by Advanced Shortcutting procedure (1 , (cid:96) ) -Median- (1 + ε ) -Candidates ( T = { τ , . . . , τ n } , β, δ, ε ) ε (cid:48) ← ε/ , C ← ∅ S ← sample (cid:6) − β(cid:96) ( ε (cid:48) ) − (ln( δ ) − ln(4(2 (cid:96) − (cid:7) curves from T uniformly and independentlywith replacement for S (cid:48) ⊆ S with | S (cid:48) | = | S | β do c ← (1 , (cid:96) ) - Median - - Approximation ( S (cid:48) , δ/ (cid:46) Algorithm 1 ∆ ← cost( S (cid:48) , c ) , ∆ l ← δn | S | ∆34 , ∆ u ← ε (cid:48) ∆ C ← C ∪ { c } , P ← ∅ for s ∈ S (cid:48) do for i ∈ { , . . . , | s |} do P ← P ∪ G (cid:16) B (cid:0) v si , (cid:96)ε (cid:48) ∆ u (cid:1) , ε (cid:48) n √ d ∆ l (cid:17) (cid:46) v si : i th vertex of s C ← C ∪ set of all polygonal curves with (cid:96) − vertices from P return C We prove the quality of approximation of Algorithm 4.
Theorem 6.2.
Given three parameters β ∈ [1 , ∞ ) , δ ∈ (0 , , ε ∈ (0 , . and a set T = { τ , . . . , τ n } ⊂ X dm of polygonal curves, with probability at least − δ the set of candidates thatAlgorithm 4 returns contains a (1 + ε ) -approximate (1 , (cid:96) ) -median with up to (cid:96) − vertices for any T (cid:48) ⊆ T , if | T (cid:48) | ≥ β | T | . In the following proof we make use of a case distinction developed by Nath and Taylor [28, Proof ofTheorem 10], which is a key ingredient to enable the (1 + ε ) -approximation, though the domain of ε has to be restricted to (0 , . . Proof of Theorem 6.2.
We assume that | T (cid:48) | ≥ β | T | . Let n (cid:48) be the number of sampled curves in S that are elements of T (cid:48) . Clearly, E [ n (cid:48) ] ≥ (cid:80) | S | i =1 1 β = | S | β . Also n (cid:48) is the sum of independent Bernoullitrials. A Chernoff bound (cf. Lemma 2.9) yields: Pr (cid:20) n (cid:48) < | S | β (cid:21) ≤ Pr (cid:20) n (cid:48) < E [ n (cid:48) ]2 (cid:21) ≤ exp (cid:18) − | S | β (cid:19) ≤ exp (cid:18) (cid:96) (ln( δ ) − ln(4(2 (cid:96) − ε (cid:19) ≤ (cid:18) δ (cid:96) (cid:96) (cid:19) ε ≤ δ . In other words, with probability at most δ/ no subset S (cid:48) ⊆ S , of cardinality at least | S | β , is a subsetof T (cid:48) . We condition the rest of the proof on the contrary event, denoted by E T (cid:48) , namely, that thereis a subset S (cid:48) ⊆ S with S (cid:48) ⊆ T (cid:48) and | S (cid:48) | ≥ | S | β . Note that S (cid:48) is then a uniform and independentsample of T (cid:48) .Now, let c ∗ ∈ X d(cid:96) be an optimal (1 , (cid:96) ) -median for T (cid:48) . The expected distance between s ∈ S (cid:48) and c ∗ is E [ d F ( s, c ∗ ) | E T (cid:48) ] = (cid:88) τ ∈ T (cid:48) d F ( c ∗ , τ ) · | T (cid:48) | = cost( T (cid:48) , c ∗ ) | T (cid:48) | . By linearity we have
E [cost( S (cid:48) , c ∗ ) | E T (cid:48) ] = | S (cid:48) || T (cid:48) | cost( T (cid:48) , c ∗ ) . Markov’s inequality yields: Pr (cid:20) δ | T (cid:48) | | S (cid:48) | cost (cid:0) S (cid:48) , c ∗ (cid:1) > cost (cid:0) T (cid:48) , c ∗ (cid:1) (cid:12)(cid:12)(cid:12) E T (cid:48) (cid:21) ≤ δ .
18e conclude that with probability at most δ/ we have δ | T (cid:48) | | S (cid:48) | cost( S (cid:48) , c ∗ ) > cost( T (cid:48) , c ∗ ) .Now, from Lemma 6.1 we know that there are (cid:96) − subsets T (cid:48) , . . . , T (cid:48) (cid:96) − ⊆ T (cid:48) , of cardinality ε | T (cid:48) | (cid:96) each and which are not necessarily disjoint, such that for every set W ⊆ T (cid:48) that contains at leastone curve τ ∈ T (cid:48) i for each i ∈ { , . . . , (cid:96) − } , there exists a curve c (cid:48) ∈ X d (cid:96) − which has all of itsvertices contained in (cid:91) τ ∈ W (cid:91) j ∈{ ,..., | τ |} B ( v τj , d F ( τ, c ∗ )) and for at least (1 − ε ) | T (cid:48) | curves τ ∈ T (cid:48) \ ( T (cid:48) ∪ · · · ∪ T (cid:48) (cid:96) − ) it holds that d F ( τ, c (cid:48) ) ≤ d F ( τ, c ∗ ) .There are up to ε | T (cid:48) | (cid:96) curves with distance to c ∗ at least (cid:96) cost( T (cid:48) ,c ∗ ) ε | T (cid:48) | . Otherwise the cost of thesecurves would exceed cost( T (cid:48) , c ∗ ) , which is a contradiction. Later we will prove that each ball wecover has radius at most (cid:96) cost( T (cid:48) ,c ∗ ) ε | T (cid:48) | . Therefore, for each i ∈ { , . . . , (cid:96) − } we have to ignore up tohalf of the curves τ ∈ T (cid:48) i , since we do not cover the balls of radius d F ( τ, c ∗ ) centered at their vertices.For each i ∈ { , . . . , (cid:96) − } and s ∈ S (cid:48) we now have Pr (cid:20) s ∈ T (cid:48) i ∧ d F ( s, c ∗ ) ≤ (cid:96) cost( T (cid:48) , c ∗ ) ε | T (cid:48) | (cid:12)(cid:12)(cid:12) E T (cid:48) (cid:21) ≥ ε (cid:96) . Therefore, by independence, for each i ∈ { , . . . , (cid:96) − } the probability that no s ∈ S (cid:48) is an element of T (cid:48) i and has distance to c ∗ at most (cid:96) cost( T (cid:48) ,c ∗ ) ε | T (cid:48) | is at most (1 − ε (cid:96) ) | S (cid:48) | ≤ exp (cid:16) − ε (cid:96) (cid:96) (ln(4(2 (cid:96) − − ln( δ )) ε (cid:17) =exp (cid:16) ln (cid:16) δ (cid:96) − (cid:17)(cid:17) = δ (cid:96) − . Also, with probability at most δ/ Algorithm 1 fails to compute a -approximate (1 , (cid:96) ) -median c ∈ X d(cid:96) for S (cid:48) , cf. Corollary 3.3.Using a union bound over these bad events, we conclude that with probability at least − / δ allof the following events occur simultaneously:1. There is a subset S (cid:48) ⊆ S of cardinality at least | S | / (2 β ) that is a uniform and independentsample of T (cid:48) ,2. for each i ∈ { , . . . , (cid:96) − } , S (cid:48) contains at least one curve from T (cid:48) i with distance to c ∗ up to (cid:96) cost( T (cid:48) ,c ∗ ) ε | T (cid:48) | ,3. Algorithm 1 computes a polygonal curve c ∈ X d(cid:96) with cost (cid:0) S (cid:48) , c ∗ S (cid:48) (cid:1) ≤ cost( S (cid:48) , c ) ≤
34 cost (cid:0) S (cid:48) , c ∗ S (cid:48) (cid:1) ,where c ∗ S (cid:48) ∈ X d(cid:96) is an optimal (1 , (cid:96) ) -median for S (cid:48) ,4. and it holds that δ | T (cid:48) | | S (cid:48) | cost( S (cid:48) , c ∗ ) ≤ cost( T (cid:48) , c ∗ ) .Let B c ∗ = (cid:110) τ ∈ T (cid:48) | d F ( τ, c ∗ ) ≤ cost( T (cid:48) ,c ∗ ) ε | T (cid:48) | (cid:111) , T (cid:48) c ∗ = T (cid:48) ∩ B c ∗ and B c = (cid:110) τ ∈ T (cid:48) | d F ( τ, c ) ≤ ε cost( T (cid:48) ,c ∗ ) | T (cid:48) | (cid:111) .First, note that | T (cid:48) \ B c ∗ | ≤ ε | T (cid:48) | , otherwise cost( T (cid:48) \ B c ∗ , c ∗ ) > cost( T (cid:48) , c ∗ ) , which is a contradic-tion, and therefore | T (cid:48) c ∗ | ≥ (1 − ε ) | T (cid:48) | . We now distinguish two cases: Case 1: | T (cid:48) c ∗ \ B c | > ε | T (cid:48) c ∗ | We have ε | T (cid:48) c ∗ | ≥ (1 − ε )2 ε | T (cid:48) | ≥ ε | T (cid:48) | , hence Pr (cid:104) d F ( s, c ) > ε cost( T (cid:48) ,c ∗ ) | T (cid:48) | (cid:12)(cid:12)(cid:12) E T (cid:48) (cid:105) ≥ ε for each s ∈ S (cid:48) .Using independence we conclude that with probability at most (1 − ε ) | S (cid:48) | ≤ exp (cid:18) − ε (cid:96) (ln(4(2 (cid:96) − − ln( δ )) ε (cid:19) ≤ δ (cid:96) (cid:96) ≤ δ s ∈ S (cid:48) has distance to c greater than ε cost( T (cid:48) ,c ∗ ) | T (cid:48) | . Using a union bound again, we conclude thatwith probability at least − δ Items 1 to 4 occur simultaneously and at least one s ∈ S (cid:48) has distanceto c greater than ε cost( T (cid:48) ,c ∗ ) | T (cid:48) | , hence cost( S (cid:48) , c ) > ε cost( T (cid:48) ,c ∗ ) | T (cid:48) | ⇔ cost( S (cid:48) ,c ) ε > cost( T (cid:48) ,c ∗ ) | T (cid:48) | and thus weindeed cover the balls of radius at most (cid:96) cost( T (cid:48) ,c ∗ ) ε | T (cid:48) | < (cid:96)ε cost( S (cid:48) ,c ∗ ) ε .In the last step, Algorithm 4 returns a set C of all curves with up to (cid:96) − vertices from thegrids, that contains one curve, denoted by c (cid:48)(cid:48) with same number of vertices as c (cid:48) (recall that this isthe curve guaranteed from Lemma 6.1) and distance at most εn ∆ l ≤ ε | T (cid:48) | cost( T (cid:48) , c ∗ ) between everycorresponding pair of vertices of c (cid:48) and c (cid:48)(cid:48) . We conclude that d F ( c (cid:48) , c (cid:48)(cid:48) ) ≤ ε | T (cid:48) | cost( T (cid:48) , c ∗ ) . Also, recallthat d F ( τ, c (cid:48) ) ≤ d F ( τ, c ∗ ) for τ ∈ T (cid:48) \ ( T (cid:48) ∪ · · · ∪ T (cid:48) (cid:96) − ) . Further, T (cid:48) contains at least | T (cid:48) | curves withdistance at most T (cid:48) ,c ∗ ) | T (cid:48) | to c ∗ , otherwise the cost of the remaining curves would exceed cost( T (cid:48) , c ∗ ) ,which is a contradiction, and since ε < there is at least one curve σ ∈ T (cid:48) \ ( T (cid:48) ∪ · · · ∪ T (cid:48) (cid:96) − ) with d F ( σ, c (cid:48) ) ≤ d F ( σ, c ∗ ) ≤ T (cid:48) ,c ∗ ) | T (cid:48) | by the pigeonhole principle. We can now bound the cost of c (cid:48)(cid:48) asfollows: cost (cid:0) T (cid:48) , c (cid:48)(cid:48) (cid:1) = (cid:88) τ ∈ T (cid:48) d F ( τ, c (cid:48)(cid:48) ) ≤ (cid:88) τ ∈ T (cid:48) \ ( T (cid:48) ∪···∪ T (cid:48) (cid:96) − ) (cid:18) d F ( τ, c (cid:48) ) + ε | T (cid:48) | cost (cid:0) T (cid:48) , c ∗ (cid:1)(cid:19) + (cid:88) τ ∈ ( T (cid:48) ∪···∪ T (cid:48) (cid:96) − ) (cid:0) d F ( τ, c ∗ ) + d F ( c ∗ , σ ) + d F ( σ, c (cid:48) ) + d F ( c (cid:48) , c (cid:48)(cid:48) ) (cid:1) ≤ (1 + ε ) cost (cid:0) T (cid:48) , c ∗ (cid:1) + (cid:88) τ ∈ ( T (cid:48) ∪···∪ T (cid:48) (cid:96) − ) (cid:18) (2 + 2 + ε ) cost( T (cid:48) , c ∗ ) | T (cid:48) | (cid:19) ≤ cost (cid:0) T (cid:48) , c ∗ (cid:1) + ε cost (cid:0) T (cid:48) , c ∗ (cid:1) + 5 ε cost (cid:0) T (cid:48) , c ∗ (cid:1) = (1 + 6 ε ) cost (cid:0) T (cid:48) , c ∗ (cid:1) . Case 2: | T (cid:48) c ∗ \ B c | ≤ ε | T (cid:48) c ∗ | Again, we distinguish two cases:
Case 2.1: d F ( c, c ∗ ) ≤ ε cost( T (cid:48) ,c ∗ ) | T (cid:48) | We can easily bound the cost of c : cost (cid:0) T (cid:48) , c (cid:1) ≤ (cid:88) τ ∈ T (cid:48) ( d F ( τ, c ∗ ) + d F ( c ∗ , c )) ≤ (1 + 4 ε ) cost (cid:0) T (cid:48) , c ∗ (cid:1) . Case 2.2: d F ( c, c ∗ ) > ε cost( T (cid:48) ,c ∗ ) | T (cid:48) | Recall that | T (cid:48) c ∗ | ≥ (1 − ε ) | T (cid:48) | . We have | T (cid:48) \ B c | ≤ | T (cid:48) \ T (cid:48) c ∗ | + 2 ε | T (cid:48) c ∗ | = | T (cid:48) | − (1 − ε ) | T (cid:48) c ∗ | ≤ | T (cid:48) | − (1 − ε )(1 − ε ) | T (cid:48) | = (2 ε + ε − ε ) | T (cid:48) | < | T (cid:48) | . Hence, | T (cid:48) ∩ B c | ≥ (1 − ε − ε + 2 ε ) | T (cid:48) | > | T (cid:48) | . Assume we assign all curves to c instead of to c ∗ .For τ ∈ T (cid:48) ∩ B c we now have decrease in cost d F ( τ, c ∗ ) − d F ( τ, c ) , which can be bounded as follows: d F ( τ, c ∗ ) − d F ( τ, c ) ≥ d F ( τ, c ∗ ) − ε cost( T (cid:48) , c ∗ ) | T (cid:48) | ≥ d F ( c, c ∗ ) − d F ( τ, c ) − ε cost( T (cid:48) , c ∗ ) | T (cid:48) |≥ d F ( c, c ∗ ) − ε cost( T (cid:48) , c ∗ ) | T (cid:48) | > d F ( c, c ∗ ) . τ ∈ T (cid:48) \ B c we have an increase in cost d F ( τ, c ) − d F ( τ, c ∗ ) ≤ d F ( c, c ∗ ) . Let the overall increasein cost be denoted by α , which can be bounded as follows: α < | T (cid:48) \ B c | · d F ( c, c ∗ ) − | T (cid:48) ∩ B c | · d F ( c, c ∗ )2 . By the fact that | T (cid:48) \ B c | < | T (cid:48) ∩ B c | for our choice of ε , we conclude that α < , which isa contradiction because c ∗ is an optimal (1 , (cid:96) ) -median for T (cid:48) . Therefore, case 2.2 cannot occur.Rescaling ε by proves the claim.We analyse the worst-case running-time of Algorithm 4 and the number of candidates it returns. Theorem 6.3.
Algorithm 4 has running-time and returns number of candidates O (cid:18) ( − ln( δ ))2 · βε +log( m ) (cid:19) .Proof. The sample S has size O (cid:16) − ln( δ ) · βε (cid:17) and sampling it takes time O (cid:16) − ln( δ ) · βε (cid:17) . Let n S = | S | .The for-loop runs (cid:18) n Sn S β (cid:19) ∈ O (cid:16) nS β log n S (cid:17) ⊂ O (cid:18) ( − ln( δ ))2 · βε (cid:19) times. In each iteration, we run Algorithm 1, taking time O ( m log( m )( − ln δ ) + m log m ) (cf. Corollary 3.3), we compute the cost of the returned curve with respect to S (cid:48) , taking time O (cid:16) − ln( δ ) ε · m log( m ) (cid:17) , and per curve in S (cid:48) we build up to m grids of size (1+ ε )∆ ε ε δn ∆ n √ d | S | d = (cid:32) √ d | S | (1 + ε ) ε δ (cid:33) d ∈ O (cid:18) β d ( − ln δ ) d ε d δ d (cid:19) each. Algorithm 4 then enumerates all combinations of (cid:96) − points from up to | S (cid:48) | · m grids,resulting in O (cid:18) m (cid:96) − β (cid:96)d − d +2 (cid:96) − ( − ln δ ) (cid:96)d − d +2 (cid:96) − ε (cid:96)d − d +2 (cid:96) − δ (cid:96)d − d (cid:19) candidates per iteration of the for-loop. Thus, Algorithm 4 computes O (cid:0) poly (cid:0) m, β, δ − , ε − (cid:1)(cid:1) candidates per iteration of the for-loop and enumeration also takes time O (cid:0) poly (cid:0) m, β, δ − , ε − (cid:1)(cid:1) per iteration of the for-loop.All in all, we have running-time and number of candidates O (cid:18) ( − ln( δ ))2 · βε +log( m ) (cid:19) . (1 + ε ) -Approximation for ( k, (cid:96) ) -Median We generalize the algorithm of Ackermann et al. [2] in the following way: instead of drawing auniform sample and running a problem-specific algorithm on this sample in the candidate phase, weonly run a problem-specific “plugin”-algorithm in the candidate phase, thus dropping the frameworkaround the sampling property. We think that the problem-specific algorithms used in [2] do notfulfill the role of a plugin, since parts of the problem-specific operations, e.g. the uniform sampling,remain in the main algorithm. Here, we separate the problem-specific operations from the mainalgorithm: any algorithm can serve as plugin, if it is able to return candidates for a cluster thattakes a constant fraction of the input, where the fraction is an input-parameter of the algorithm and21ome approximation factor is guaranteed (w.h.p.). The calls to the candidate-finder plugin do noteven need to be independent (stochastically), allowing adaptive algorithms.Now, let X = ( X, ρ ) be an arbitrary space, where X is any non-empty (ground-)set and ρ : X × X → R ≥ is a distance function (not necessarily a metric). We introduce a generalized definition of k -median clustering. Let the medians be restricted to come from a predefined subset Y ⊆ X . Definition 7.1 (generalized k -median) . The generalized k -median clustering problem is defined asfollows, where k ∈ N is a fixed (constant) parameter of the problem: given a finite and non-empty set Z ⊆ X , compute a set C of k elements from Y , such that cost( Z, C ) = (cid:80) z ∈ Z min c ∈ C ρ ( z, c ) is minimal. The following algorithm, Algorithm 5, can approximate every k -median problem compatible withDefinition 7.1, when provided with such a problem-specific plugin-algorithm for computing candidates.In particular, it can approximate the ( k, (cid:96) ) -median problem for polygonal curves under the Fréchetdistance, when provided with Algorithm 2 or Algorithm 4. Then, we have X = X d , Y ⊆ X d(cid:96) ⊆ X d = X and Z ⊆ X dm ⊆ X d = X . Note that the algorithm computes a bicriteria approximation, that is, thesolution is approximated in terms of the cost and the number of vertices of the center curves, i.e.,the centers come from X d (cid:96) − .Algorithm 5 has several parameters. The first parameter C is the set of centers found yet and κ isthe number of centers yet to be found. The following parameters concern only the plugin-algorithmused within the algorithm: β determines the size (in terms of a fraction of the input) of the smallestcluster for which an approximate median can be computed, δ determines the probability of failure ofthe plugin-algorithm and ε determines the approximation factor of the plugin-algorithm.Algorithm 5 works as follows: If it has already computed some centers (and there are still centers tocompute) it does pruning : some clusters might be too small for the plugin-algorithm to computeapproximate medians for them. Algorithm 5 then calls itself recursively with only half of the input:the elements with larger distances to the centers yet found. This way the small clusters will eventuallytake a larger fraction of the input and can be found in the candidate phase. In this phase Algorithm 5calls its plugin and for each candidate that the plugin returned, it calls itself recursively: addingthe candidate at hand to the set of centers yet found and decrementing κ by one. Eventually, allcombinations of computed candidates are evaluated against the original input and the centers thattogether evaluated best are returned. 22 lgorithm 5 Recursive Approximation-Scheme for k -Median Clustering procedure k -Median ( T, C, κ, β, δ, ε ) if κ = 0 then return C (cid:46)
Pruning Phase if C (cid:54) = ∅ then P ← set of (cid:106) | T | (cid:107) elements τ ∈ T , such that min c ∈ C ρ ( τ, c ) ≤ min c ∈ C ρ ( σ, c ) for each σ ∈ T \ P C (cid:48) ← k - Median ( T \ P, C, κ, β, δ, ε ) else C (cid:48) ← ∅ (cid:46) Candidate Phase K ← - Median - Candidates ( T, β, δ/k, ε ) for c ∈ K do C c ← k - Median ( T, C ∪ { c } , κ − , β, δ, ε ) C ← { C (cid:48) } ∪ (cid:83) c ∈ K { C c } return arg min C ∈C cost( T, C ) The quality of approximation and worst-case running-time of Algorithm 5 is stated in the followingtwo theorems, which we prove further below. The proofs are adaptations of corresponding proofsin [2]. We provide them for the sake of completeness.
Theorem 7.2.
Let T = { τ , . . . , τ n } ⊆ X , α ∈ [1 , ∞ ) and - Median - Candidates be an algorithmthat, given three parameters β ∈ [1 , ∞ ) , δ, ε ∈ (0 , and a set T ⊆ X , returns with probability atleast − δ an ( α + ε ) -approximate -median for any T (cid:48) ⊆ T , if | T (cid:48) | ≥ β | T | .Algorithm 5 called with parameters ( T, ∅ , k, β, δ, ε ) , where β ∈ (2 k, ∞ ) and δ, ε ∈ (0 , , returns withprobability at least − δ a set C = { c , . . . , c k } with cost( T, C ) ≤ (1 + k β − k )( α + ε ) cost( T, C ∗ ) ,where C ∗ is an optimal set of k medians for T . Theorem 7.3.
Let T ( n, β, δ, ε ) denote the worst-case running-time of - Median - Candidates foran arbitrary input-set T with | T | = n and let C ( n, β, δ, ε ) denote the maximum number of candidatesit returns. Also, let T d denote the worst-case running-time needed to compute d for an input elementand a candidate.If T and C are non-decreasing in n , Algorithm 5 has running-time O ( C ( n, β, δ, ε ) k +2 · n · T d + C ( n, β, δ, ε ) k +1 · T ( n, β, δ, ε )) . Now we state our main results, which follow from Theorems 4.2 and 4.3, respectively Theorems 6.2and 6.3, and Theorems 7.2 and 7.3.
Corollary 7.4.
Given two parameters δ, ε ∈ (0 , and a set T ⊂ X dm of polygonal curves, Algorithm 5endowed with Algorithm 2 as - Median - Candidates and run with parameters ( T, ∅ , k, k ε +2 k, δ, ε/ returns with probability at least − δ a set C ⊂ X d (cid:96) − that is a (3 + ε ) -approximate solutionto the ( k, (cid:96) ) -median for T . Algorithm 5 then has running-time n · O (cid:18) ( − ln( δ ))2 ε +log( m ) (cid:19) . Corollary 7.5.
Given two parameters δ ∈ (0 , , ε ∈ (0 , . and a set T ⊂ X dm of polygonalcurves, Algorithm 5 endowed with Algorithm 4 as - Median - Candidates and run with param-eters ( T, ∅ , k, k ε + 2 k, δ, ε/ returns with probability at least − δ a set C ⊂ X d (cid:96) − that isa (1 + ε ) -approximate solution to the ( k, (cid:96) ) -median for T . Algorithm 5 then has running-time n · O (cid:18) ( − ln( δ ))2 ε +log( m ) (cid:19) . Proof of Theorem 7.2.
For k = 1 , the claim trivially holds. We now distinguish two cases. In thefirst case the principle of the proof is presented in all its detail. In the second case we only showhow to generalize the first case to k > . Case 1: k = 2 Let C ∗ = { c ∗ , c ∗ } be an optimal set of k medians for T with clusters T ∗ and T ∗ , respectively, thatform a partition of T . For the sake of simplicity, assume that n is a power of and w.l.o.g. assumethat | T ∗ | ≥ | T | > β | T | . Let C be the set of candidates returned by - Median - Candidates in theinitial call. With probability at least − δ/k , there is a c ∈ C with cost( T ∗ , c ) ≤ ( α + ε ) cost( T ∗ , c ∗ ) .We distinguish two cases: Case 1.1:
There exists a recursive call with parameters ( T (cid:48) , { c } , , β, δ, ε ) and | T ∗ ∩ T (cid:48) | ≥ β | T (cid:48) | .First, we assume that T (cid:48) is the maximum cardinality input with | T ∗ ∩ T (cid:48) | ≥ β | T (cid:48) | , occurringin a recursive call of the algorithm. Let C be the set of candidates returned by - Median - Candidates in this call. With probability at least − δ/k , there is a c ∈ C with cost( T ∗ ∩ T (cid:48) , c ) ≤ ( α + ε ) cost( T ∗ ∩ T (cid:48) , (cid:101) c ) , where (cid:101) c is an optimal median for T ∗ ∩ T (cid:48) .Let P be the set of elements of T removed in the m ∈ N , m ≤ log ( n ) , pruning phases betweenobtaining c and c . Without loss of generality we assume that P (cid:54) = ∅ . For i ∈ { , . . . , m } ,let P i be the elements removed in the i th (in the order of the recursive calls occurring) pruningphase. Note that the P i are pairwise disjoint, we have that P = ∪ ti =1 P i and | P i | = n i . Since T = T ∗ (cid:93) ( T ∗ ∩ T (cid:48) ) (cid:93) ( T ∗ ∩ P ) , we have cost( T, { c , c } ) ≤ cost( T ∗ , c ) + cost (cid:0) T ∗ ∩ T (cid:48) , c (cid:1) + cost( T ∗ ∩ P, c ) . (I)Our aim is now to prove that the number of elements wrongly assigned to c , i.e., T ∗ ∩ P , is smalland further, that their cost is a fraction of the cost of the elements correctly assigned to c , i.e., T ∗ .We define R = T and for i ∈ { , . . . , m } we define R i = R i − \ P i . The R i are the elementsremaining after the i th pruning phase. Note that by definition | R i | = n i = | P i | . Since R m = T (cid:48) is the maximum cardinality input, with | T ∗ ∩ T (cid:48) | ≥ β | T (cid:48) | , we have that | T ∗ ∩ R i | < β | R i | for all i ∈ { , . . . , m − } . Also, for each i ∈ { , . . . , m } we have P i ⊆ R i − , therefore | T ∗ ∩ P i | ≤ | T ∗ ∩ R i − | < β | R i − | = 2 β n i (II)and as immediate consequence | T ∗ ∩ P i | = | P i | − | T ∗ ∩ P i | > | P i | − β | R i − | = (cid:18) − β (cid:19) n i . (III)This tells us that mainly the elements of T ∗ are removed in the pruning phase and only veryfew elements of T ∗ . By definition, we have for all i ∈ { , . . . , m − } , σ ∈ P i and τ ∈ P i +1 that ρ ( σ, c ) ≤ ρ ( τ, c ) , hence | T ∗ ∩ P i | cost( T ∗ ∩ P i , c ) ≤ | T ∗ ∩ P i +1 | cost( T ∗ ∩ P i +1 , c ) . Combining this inequality with Eqs. (II) and (III) we obtain for i ∈ { , . . . , m − } : β i n cost( T ∗ ∩ P i , c ) < i +1 (1 − /β ) n cost( T ∗ ∩ P i +1 , c ) cost( T ∗ ∩ P i , c ) < i +1 n (1 − /β ) nβ i cost( T ∗ ∩ P i +1 , c ) = 4( β −
2) cost( T ∗ ∩ P i +1 , c ) . (IV)We still need such a bound for i = m . Since | R m | = | P m | and also R m ⊆ R m − we can use Eq. (II)to obtain: | T ∗ ∩ R m | = | R m | − | T ∗ ∩ R m | ≥ | R m | − | T ∗ ∩ R m − | > (cid:18) − β (cid:19) n m . (V)Also, we have for all σ ∈ P m and τ ∈ R m that ρ ( σ, c ) ≤ ρ ( τ, c ) by definition, thus | T ∗ ∩ P m | cost( T ∗ ∩ P m , c ) ≤ | T ∗ ∩ R m | cost( T ∗ ∩ R m , c ) . We combine this inequality with Eq. (II) and Eq. (V) and obtain: β m n cost( T ∗ ∩ P m , c ) < m n (1 − /β ) nβ m cost( T ∗ ∩ R m , c ) ⇔ cost( T ∗ ∩ P m , c ) < β −
2) cost( T ∗ ∩ R m , c ) . (VI)We are now ready to bound the cost of the elements of T ∗ wrongly assigned to c . CombiningEq. (IV) and Eq. (VI) yields: cost( T ∗ ∩ P, c ) = m (cid:88) i =1 cost( T ∗ ∩ P i , c ) < β − m − (cid:88) i =1 cost( T ∗ ∩ P i +1 , c ) + 2 β − T ∗ ∩ R m , c ) < β − T ∗ , c ) . Here, the last inequality holds, because P , . . . , P m and R m are pairwise disjoint. Also, we have cost (cid:0) T ∗ ∩ T (cid:48) , c (cid:1) ≤ ( α + ε ) cost (cid:0) T ∗ ∩ T (cid:48) , (cid:101) c (cid:1) ≤ ( α + ε ) cost (cid:0) T ∗ ∩ T (cid:48) , c ∗ (cid:1) ≤ ( α + ε ) cost( T ∗ , c ∗ ) . Finally, using Eq. (I) and a union bound, with probability at least − δ the following holds: cost( T, { c , c } ) < ( α + ε ) cost( T ∗ , c ∗ ) + ( α + ε ) cost( T ∗ , c ∗ ) + 4 β − α + ε ) cost( T ∗ , c ∗ ) < (cid:18) β − (cid:19) ( α + ε ) cost( T, C ∗ ) = (cid:18) kkβ − k (cid:19) ( α + ε ) cost( T, C ∗ ) ≤ (cid:18) k β − k (cid:19) ( α + ε ) cost( T, C ∗ ) . Case 1.2:
For all recursive calls with parameters ( T (cid:48) , { c } , , β, δ, ε ) it holds that | T ∗ ∩ T (cid:48) | < β | T (cid:48) | .After log ( n ) pruning phases we end up with a singleton { σ } = T (cid:48) as input set. Since | T ∗ ∩ T (cid:48) | < β | T (cid:48) | ,it must be that | T ∗ ∩ T (cid:48) | < β | T (cid:48) | = β < and thus σ ∈ T ∗ .Let C be the set of candidates returned by - Median - Candidates in this call. With probabilityat least − δ/k there is a c ∈ C with cost( { σ } , c ) ≤ ( α + ε ) cost( { σ } , (cid:101) c ) ≤ ( α + ε ) cost( { σ } , c ∗ ) ,where (cid:101) c is an optimal median for { σ } . Since cost( T ∗ ∩ P, c ) is bounded as in Case 1.1, by a unionbound we have with probability at least − δ : cost( T, { c , c } ) ≤ cost( T ∗ \ { σ } , c ) + cost( T ∗ ∩ P, c ) + cost( { σ } , c ) ( α + ε ) cost( T ∗ , c ∗ ) + cost( T ∗ ∩ P, c ) ≤ (cid:18) β − (cid:19) ( α + ε ) cost( T, C ∗ ) ≤ (cid:18) k β − k (cid:19) ( α + ε ) cost( T, C ∗ ) . Case 2: k > We only prove the generalization of Case 1.1 to k > , the remainder of the proof is analogous tothe Case 1. For the sake of brevity, for i ∈ N , we define [ i ] = { , . . . , i } . Let C ∗ = { c ∗ , . . . , c ∗ k } bean optimal set of k medians for T with clusters T ∗ , . . . , T ∗ k , respectively, that form a partition of T .For the sake of simplicity, assume that n is a power of and w.l.o.g. assume | T ∗ | ≥ · · · ≥ | T ∗ k | . For i ∈ [ k ] and j ∈ [ k ] \ [ i ] we define T ∗ i,j = (cid:93) jt = i T ∗ t .Let T = T and let ( T j = T j − \ P j ) mj =1 be the sequence of input sets in the recursive calls of the m ∈ N , m ≤ log ( n ) , pruning phases, where P j is the set of elements removed in the j th (in theorder of the recursive calls occurring) pruning phase. Let T = {T } ∪ {T j | j ∈ [ m ] } . For i ∈ [ k ] , let T i be the maximum cardinality set in T , with | T ∗ i ∩ T i | ≥ β | T i | . Note that by assumption and since β > k , T = T must hold and also T j ⊂ T i for j ∈ [ k ] \ [ i ] .Using a union bound, with probability at least − δ , for each i ∈ [ k ] the call of - Median - Candidates with input T i yields a candidate c i with cost( T ∗ i ∩ T i , c i ) ≤ ( α + ε ) cost( T ∗ i ∩ T i , (cid:101) c i ) ≤ ( α + ε ) cost( T ∗ i ∩ T i , c ∗ i ) ≤ ( α + ε ) cost( T ∗ i , c ∗ i ) , (I)where (cid:101) c i is an optimal -median for T ∗ i ∩ T i . Let C = { c , . . . , c k } be the set of these candidates andfor i ∈ [ k − , let P i = T i \ T i +1 denote the set of elements of T removed by the pruning phasesbetween obtaining c i and c i +1 . Note that the P i are pairwise disjoint.By definition, the sets T ∗ ∩ T , . . . , T ∗ k ∩ T k , T ∗ ,k ∩ P , . . . , T ∗ k,k ∩ P k − form a partition of T , therefore cost( T, { c , . . . , c k } ) ≤ k (cid:88) i =1 cost( T ∗ i ∩ T i , c i ) + k − (cid:88) i =1 cost (cid:0) T ∗ i +1 ,k ∩ P i , { c , . . . , c i } (cid:1) ≤ ( α + ε ) k (cid:88) i =1 cost( T ∗ i , c ∗ i ) + k − (cid:88) i =1 cost (cid:0) T ∗ i +1 ,k ∩ P i , { c , . . . , c i } (cid:1) . (II)Now, it only remains to bound the cost of the wrongly assigned elements of T ∗ i +1 ,k . For i ∈ [ k ] , let n i = | T i | and w.l.o.g. assume that P i (cid:54) = ∅ for each i ∈ [ k − . Each P i is the disjoint union (cid:93) m i j =1 P i,j of m i ∈ N sets of elements of T removed in the interim pruning phases and it holds that | P i,j | = n i j .We now prove for each i ∈ [ k − and j ∈ [ m i ] that P i contains a large number of elements from T ∗ ,i and only a few elements from T ∗ i +1 ,k .For i ∈ [ k − , we define R i, = T i and for j ∈ [ m i ] we define R i,j = R i,j − \ P i,j . By definition, | R i,j | = n i j = | P i,j | , R i,j ⊃ R i,j for each j ∈ [ m i ] and j ∈ [ m i ] \ [ j ] , also R i,m i = T i +1 . Thus, | T ∗ t ∩ R i,j | < β | R i,j | for all i ∈ [ k − , j ∈ [ m i ] and t ∈ [ k ] \ [ i ] . As immediate consequence we obtain | T ∗ i +1 ,k ∩ R i,j | ≤ kβ | R i,j | . Since P i,j ⊆ R i,j − for all i ∈ [ k − and j ∈ [ m i ] , we have | T i +1 ,k ∩ P i,j | ≤ | T i +1 ,k ∩ R i,j − | ≤ kβ | R i,j − | = 2 kβ n i j , (III)26hich immediately yields | T ,i ∩ P i,j | = | P i,j | − | T i +1 ,k ∩ P i,j | ≥ (cid:18) − kβ (cid:19) n i j . (IV)Now, by definition we know that for all i ∈ [ k − , j ∈ [ m i ] \ { m i } , σ ∈ P i,j and τ ∈ P i,j +1 that min c ∈{ c ,...,c i } ρ ( σ, c ) ≤ min c ∈{ c ,...,c i } ρ ( τ, c ) . Thus, cost (cid:16) T ∗ i +1 ,k ∩ P i,j , { c , . . . , c i } (cid:17) | T ∗ i +1 ,k ∩ P i,j | ≤ cost (cid:16) T ∗ ,i ∩ P i,j +1 , { c , . . . , c i } (cid:17) | T ∗ ,i ∩ P i,j +1 | . Combining this inequality with Eqs. (III) and (IV) yields for i ∈ [ k − and j ∈ [ m i ] \ { m i } : β j kn i cost (cid:0) T ∗ i +1 ,k ∩ P i,j , { c , . . . , c i } (cid:1) ≤ j +1 (1 − kβ ) n i cost (cid:0) T ∗ ,i ∩ P i,j +1 , { c , . . . , c i } (cid:1) ⇔ cost (cid:0) T ∗ i +1 ,k ∩ P i,j , { c , . . . , c i } (cid:1) ≤ kβ − k cost (cid:0) T ∗ ,i ∩ P i,j +1 , { c , . . . , c i } (cid:1) (V)For each i ∈ [ k − we still need an upper bound on cost (cid:16) T ∗ i +1 ,k ∩ P i,m i , { c , . . . , c i } (cid:17) . Since | R i,m i | = | P i,m i | and also R i,m i ⊆ R i,m i − we can use Eq. (III) to obtain | T ∗ ,i ∩ R i,m i | = | R i,m i |−| T ∗ i +1 ,k ∩ R i,m i | ≥ | R i,m i |−| T ∗ i +1 ,k ∩ R i,m i − | > (cid:18) − kβ (cid:19) n i m i . (VI)By definition we also know that for all i ∈ [ k − , σ ∈ P i,m i and τ ∈ R i,m i that min c ∈{ c ,...,c i } ρ ( σ, c ) ≤ min c ∈{ c ,...,c i } ρ ( τ, c ) . Thus, cost (cid:16) T ∗ i +1 ,k ∩ P i,m i , { c , . . . , c i } (cid:17) | T ∗ i +1 ,k ∩ P i,m i | ≤ cost (cid:16) T ∗ ,i ∩ R i,m i , { c , . . . , c i } (cid:17) | T ∗ ,i ∩ R i,m i | . Combining this inequality with Eqs. (III) and (VI) yields: β m i kn i cost (cid:0) T ∗ i +1 ,k ∩ P i,m i , { c , . . . , c i } (cid:1) < m i (1 − kβ ) n i cost (cid:0) T ∗ ,i ∩ R i,m i , { c , . . . , c i } (cid:1) ⇔ cost (cid:0) T ∗ i +1 ,k ∩ P i,m i , { c , . . . , c i } (cid:1) < kβ − k cost (cid:0) T ∗ ,i ∩ R i,m i , { c , . . . , c i } (cid:1) . (VII)We can now give the following bound, combining Eqs. (V) and (VII), for each i ∈ [ k − : cost (cid:0) T ∗ i +1 ,k ∩ P i , { c , . . . , c i } (cid:1) = m i (cid:88) j =1 cost (cid:0) T ∗ i +1 ,k ∩ P i,j , { c , . . . , c i } (cid:1) < m i − (cid:88) j =1 kβ − k cost (cid:0) T ∗ ,i ∩ P i,j +1 , { c , . . . , c i } (cid:1) + 2 kβ − k cost (cid:0) T ∗ ,i ∩ R i,m i , { c , . . . , c i } (cid:1) kβ − k cost (cid:0) T ∗ ,i ∩ T i , { c , . . . , c i } (cid:1) . (VIII)Here, the last inequality holds, because P i, , . . . , P i,m i and R i,m i are pairwise disjoint subsets of T i .Now, we plug this bound into Eq. (II). Note that T ∗ j ∩ T i ⊆ T ∗ j ∩ T j for each i ∈ [ k ] and j ∈ [ i ] bydefinition. We obtain: cost( T, { c , . . . , c k } ) ≤ ( α + ε ) k (cid:88) i =1 cost( T ∗ i , c ∗ i ) + k − (cid:88) i =1 cost (cid:0) T ∗ i +1 ,k ∩ P i , { c , . . . , c i } (cid:1) < ( α + ε ) k (cid:88) i =1 cost( T ∗ i , c ∗ i ) + 4 kβ − k k − (cid:88) i =1 cost (cid:0) T ∗ ,i ∩ T i , { c , . . . , c i } (cid:1) ≤ ( α + ε ) k (cid:88) i =1 cost( T ∗ i , c ∗ i ) + 4 kβ − k k − (cid:88) i =1 i (cid:88) t =1 cost( T ∗ t ∩ T i , c t ) ≤ ( α + ε ) k (cid:88) i =1 cost( T ∗ i , c ∗ i ) + 4 kβ − k k − (cid:88) i =1 i (cid:88) t =1 cost( T ∗ t ∩ T t , c t ) ≤ ( α + ε ) k (cid:88) i =1 cost( T ∗ i , c ∗ i ) + 4 k β − k k − (cid:88) i =1 cost( T ∗ i ∩ T i , c i ) ≤ (cid:18) k β − k (cid:19) ( α + ε ) k (cid:88) i =1 cost( T ∗ i , c ∗ i ) = (cid:18) k β − k (cid:19) ( α + ε ) cost( T, C ∗ ) . The last inequality follows from Eq. (I).The following analysis of the worst-case running-time of Algorithm 4 is a slight adaption of [2,Theorem 2.8], which is also provided for the sake of completeness.
Proof of Theorem 7.3.
For the sake of simplicity, we assume that n is a power of .If κ = 0 , Algorithm 5 has running-time c ∈ O (1) and if κ ≥ n , Algorithm 5 has running-time c · n ∈ O ( n ) .Let T ( n, κ, β, δ, ε ) denote the worst-case running-time of Algorithm 5 for input set T with | T | = n .If n > κ ≥ , Algorithm 5 has running-time at most c · ( n · T d + n ) ∈ O ( n · T d ) to obtain P , T ( n/ , κ, β, δ, ε ) for the recursive call in the pruning phase, T ( n, β, δ, ε ) to obtain the candidates, C ( n, β, δ, ε ) · T ( n, κ − , β, δ, ε ) for the recursive calls in the candidate phase, one for each candidate,and c · n · T d · C ( n, β, δ, ε ) ∈ O ( n · T d · C ( n, β, δ, ε )) to eventually evaluate the candidate sets. Let c = max { c , c , c , c } . We obtain the following recurrence relation: T ( n, κ, β, δ, ε ) ≤ c if κ = 0 cn if κ ≥ nC ( n, β, δ, ε ) · T ( n, κ − , β, δ, ε ) + T ( n/ , κ, β, δ, ε )+ T ( n, β, δ, ε ) + cn · T d · C ( n, β, δ, ε )) else . Let f ( n, β, δ, ε ) = cn · T ( n, β, δ, ε ) + T d · C ( n, β, δ, ε ) .We prove that T ( n, κ, β, δ, ε ) ≤ c · κ · C ( n, β, δ, ε ) κ +1 · n · f ( n, β, δ, ε ) , by induction on n, κ .For κ = 0 we have T ( n, κ, β, δ, ε ) ≤ c ≤ cn ≤ c · · C ( n, β, δ, ε ) · n · f ( n, β, δ, ε ) .For κ ≥ n we have T ( n, κ, β, δ, ε ) ≤ cn ≤ c · κ · C ( n, β, δ, ε ) κ +1 · n · f ( n, β, δ, ε ) .28ow, let n > κ ≥ and assume the claim holds for T ( n (cid:48) , κ (cid:48) , β, δ, ε ) , for each κ (cid:48) ∈ { , . . . , κ − } and n (cid:48) ∈ { , . . . , n − } . We have: T ( n, κ, β, δ, ε ) ≤ C ( n, β, δ, ε ) · T ( n, κ − , β, δ, ε ) + T ( n/ , κ, β, δ, ε )+ T ( n, β, δ, ε ) + cn · T d · C ( n, β, δ, ε ) ≤ C ( n, β, δ, ε ) · c · κ − · C ( n, β, δ, ε ) κ · n · f ( n, β, δ, ε )+ c · κ · C ( n/ , β, δ, ε ) κ +1 · n · f ( n/ , β, δ, ε )+ cn · f ( n, β, δ, ε ) ≤ (cid:18)
14 + 12 + 14 κ C ( n, β, δ, ε ) κ +1 (cid:19) c · κ · C ( n, β, δ, ε ) κ +1 · n · f ( n, β, δ, ε ) ≤ c · κ · C ( n, β, δ, ε ) κ +1 · n · f ( n, β, δ, ε ) . The last inequality holds, because κ C ( n,β,δ,ε ) κ +1 ≤ , and the claim follows by induction. We have developed bi-criteria approximation algorithms for ( k, (cid:96) ) -median clustering of polygonalcurves under the Fréchet distance. While it showed to be relatively easy to obtain a good approxima-tion where the centers have up to (cid:96) vertices in reasonable time, a way to obtain good approximatecenters with up to (cid:96) vertices in reasonable time is not in sight. This is due to the continuous Fréchetdistance: the vertices of a median need not be anywhere near a vertex of an input-curve, resulting ina huge search-space. If we cover the whole search-space by, say grids, the worst-case running-timeof the resulting algorithms become dependent on the arc-lengths of the input-curves edges, whichis not acceptable. We note that g -coverability of the continuous Fréchet distance would imply theexistence of sublinear size ε -coresets for ( k, (cid:96) ) -center clustering of polygonal curves under the Fréchetdistance. It is an interesting open question, if the g -coverability holds for the continuous Fréchetdistance. In contrast to the doubling dimension, which was shown to be infinite even for curves ofbounded complexity [15], the VC-dimension of metric balls under the continuous Fréchet distanceis bounded in terms of the complexities (cid:96) and m of the curves [16]. Whether this bound can becombined with the framework by Feldman and Langberg [17] to achieve faster approximations forthe ( k, (cid:96) ) -median problem under the continuous Fréchet distance is an interesting open problem.The general relationship between the VC-dimension of range spaces derived from metric spaces andtheir doubling properties is a topic of ongoing research, see for example Huang et al. [21]. References [1] C. Abraham, P. A. Cornillon, E. Matzner-Løber, and N. Molinari. Unsupervised curve clusteringusing b-splines.
Scandinavian Journal of Statistics , 30(3):581–595, 2003.[2] Marcel R. Ackermann, Johannes Blömer, and Christian Sohler. Clustering for metric andnonmetric distance measures.
ACM Trans. Algorithms , 6(4):59:1–59:26, 2010.[3] Pankaj K. Agarwal, Sariel Har-Peled, Nabil H. Mustafa, and Yusu Wang. Near-linear timeapproximation algorithms for curve simplification. In Rolf Möhring and Rajeev Raman, editors,
Algorithms - ESA , pages 29–41. Springer, 2002.294] Helmut Alt and Michael Godau. Computing the Fréchet distance between two polygonal curves.
International Journal of Computational Geometry & Applications , 5:75–91, 1995.[5] Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. Clustering withBregman divergences.
Journal of Machine Learning Research , 6:1705–1749, 2005.[6] Nikhil Bansal, Avrim Blum, and Shuchi Chawla. Correlation clustering.
Machine Learning , 56(1-3):89–113, 2004.[7] Asa Ben-Hur, David Horn, Hava T. Siegelmann, and Vladimir Vapnik. Support vector clustering.
Journal of Machine Learning Research , 2:125–137, 2001.[8] Kevin Buchin, Maike Buchin, and Carola Wenk. Computing the Fréchet distance betweensimple polygons.
Comput. Geom. , 41(1-2):2–20, 2008.[9] Kevin Buchin, Anne Driemel, Joachim Gudmundsson, Michael Horton, Irina Kostitsyna, MaartenLöffler, and Martijn Struijs. Approximating (k, l)-center clustering for curves. In
Proceedings ofthe Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms , pages 2922–2938, 2019.[10] Kevin Buchin, Anne Driemel, Natasja van de L’Isle, and André Nusser. klcluster: Center-based clustering of trajectories. In
Proceedings of the 27th ACM SIGSPATIAL InternationalConference on Advances in Geographic Information Systems , pages 496–499, 2019.[11] Kevin Buchin, Anne Driemel, and Martijn Struijs. On the hardness of computing an averagecurve. In Susanne Albers, editor, , volume 162 of
LIPIcs , pages 19:1–19:19. Schloss Dagstuhl - Leibniz-Zentrum fürInformatik, 2020.[12] Jeng-Min Chiou and Pai-Ling Li. Functional clustering and identifying substructures oflongitudinal data.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) ,69(4):679–699, 2007.[13] Rudi Cilibrasi and Paul M. B. Vitányi. Clustering by compression.
IEEE Trans. InformationTheory , 51(4):1523–1545, 2005.[14] Anne Driemel and Sariel Har-Peled. Jaywalking your dog: Computing the Fréchet distancewith shortcuts.
SIAM Journal on Computing , 42(5):1830–1866, 2013.[15] Anne Driemel, Amer Krivosija, and Christian Sohler. Clustering time series under the Fréchetdistance. In
Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on DiscreteAlgorithms , pages 766–785, 2016.[16] Anne Driemel, Jeff M. Phillips, and Ioannis Psarros. The VC dimension of metric balls underFréchet and Hausdorff distances. In ,pages 28:1–28:16, 2019.[17] Dan Feldman and Michael Langberg. A unified framework for approximating and clusteringdata. In Lance Fortnow and Salil P. Vadhan, editors,
Proceedings of the 43rd ACM Symposiumon Theory of Computing , pages 569–578. ACM, 2011.[18] Luis Angel Garcia-Escudero and Alfonso Gordaliza. A proposal for robust curve clustering.
Journal of Classification , 22(2):185–201, 2005.3019] Sudipto Guha and Nina Mishra. Clustering data streams. In Minos N. Garofalakis, JohannesGehrke, and Rajeev Rastogi, editors,
Data Stream Management - Processing High-Speed DataStreams , Data-Centric Systems and Applications, pages 169–187. Springer, 2016.[20] Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In
Proceedings of the 36th Annual ACM Symposium on Theory of Computing , pages 291–300, 2004.[21] Lingxiao Huang, Shaofeng H.-C. Jiang, Jian Li, and Xuan Wu. Epsilon-coresets for clustering(with outliers) in doubling metrics. In , pages 814–825. IEEE Computer Society, 2018.[22] Hiroshi Imai and Masao Iri. Polygonal Approximations of a Curve — Formulations andAlgorithms.
Machine Intelligence and Pattern Recognition , 6:71–86, January 1988.[23] Piotr Indyk.
High-dimensional Computational Geometry . PhD thesis, Stanford University, CA,USA, 2000.[24] Stephen C. Johnson. Hierarchical clustering schemes.
Psychometrika , 32(3):241–254, 1967.[25] Amit Kumar, Yogish Sabharwal, and Sandeep Sen. A simple linear time (1+ ε )-approximationalgorithm for k-means clustering in any dimensions. In Proceedings of the 45th Annual IEEESymposium on Foundations of Computer Science , FOCS ’04, page 454–462. IEEE ComputerSociety, 2004.[26] Stefan Meintrup, Alexander Munteanu, and Dennis Rohde. Random projections and sam-pling algorithms for clustering of high-dimensional polygonal curves. In
Advances in NeuralInformation Processing Systems 32 , pages 12807–12817, 2019.[27] Michael Mitzenmacher and Eli Upfal.
Probability and Computing: Randomization and Proba-bilistic Techniques in Algorithms and Data Analysis . Cambridge University Press, USA, 2ndedition, 2017.[28] Abhinandan Nath and Erin Taylor. k-median clustering under discrete Fréchet and Hausdorffdistances. In Sergio Cabello and Danny Z. Chen, editors, , volume 164 of
LIPIcs , pages 58:1–58:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020.[29] François Petitjean and Pierre Gançarski. Summarizing a set of time series by averaging: Fromsteiner sequence to compact multiple alignment.
Theoretical Computer Science , 414(1):76 – 91,2012.[30] François Petitjean, Alain Ketterlin, and Pierre Gançarski. A global averaging method fordynamic time warping, with applications to clustering.
Pattern Recognition , 44(3):678 – 693,2011.[31] Satu Elisa Schaeffer. Graph clustering.
Computer Science Review , 1(1):27 – 64, 2007.[32] René Vidal. Subspace clustering.