[PDF] MapReduce and Streaming Algorithms for Diversity Maximization in Metric Spaces of Bounded Doubling Dimension

Abstract

Given a dataset of points in a metric space and an integer k , a diversity maximization problem requires determining a subset of k points maximizing some diversity objective measure, e.g., the minimum or the average distance between two points in the subset. Diversity maximization is computationally hard, hence only approximate solutions can be hoped for. Although its applications are mainly in massive data analysis, most of the past research on diversity maximization focused on the sequential setting. In this work we present space and pass/round-efficient diversity maximization algorithms for the Streaming and MapReduce models and analyze their approximation guarantees for the relevant class of metric spaces of bounded doubling dimension. Like other approaches in the literature, our algorithms rely on the determination of high-quality core-sets, i.e., (much) smaller subsets of the input which contain good approximations to the optimal solution for the whole input. For a variety of diversity objective functions, our algorithms attain an (α+ϵ) -approximation ratio, for any constant ϵ>0 , where α is the best approximation ratio achieved by a polynomial-time, linear-space sequential algorithm for the same diversity objective. This improves substantially over the approximation ratios attainable in Streaming and MapReduce by state-of-the-art algorithms for general metric spaces. We provide extensive experimental evidence of the effectiveness of our algorithms on both real world and synthetic datasets, scaling up to over a billion points.

Full PDF

MMapReduce and Streaming Algorithms for DiversityMaximization in Metric Spaces of Bounded Doubling Dimension ∗ Matteo Ceccarello Andrea Pietracaprina Geppino Pucci Eli Upfal Department of Information Engineering, University of Padova, Padova, Italy { ceccarel,capri,geppo } @dei.unipd.it Department of Computer Science, Brown University, Providence, RI USA eli [email protected]

Abstract

Given a dataset of points in a metric space and an integer k , a diversity maximization prob-lem requires determining a subset of k points maximizing some diversity objective measure, e.g.,the minimum or the average distance between two points in the subset. Diversity maximization iscomputationally hard, hence only approximate solutions can be hoped for. Although its applicationsare mainly in massive data analysis, most of the past research on diversity maximization focused onthe sequential setting. In this work we present space and pass/round-eﬃcient diversity maximizationalgorithms for the Streaming and MapReduce models and analyze their approximation guarantees forthe relevant class of metric spaces of bounded doubling dimension. Like other approaches in the liter-ature, our algorithms rely on the determination of high-quality core-sets, i.e., (much) smaller subsetsof the input which contain good approximations to the optimal solution for the whole input. For avariety of diversity objective functions, our algorithms attain an ( α + ε )-approximation ratio, for anyconstant ε >

0, where α is the best approximation ratio achieved by a polynomial-time, linear-spacesequential algorithm for the same diversity objective. This improves substantially over the approx-imation ratios attainable in Streaming and MapReduce by state-of-the-art algorithms for generalmetric spaces. We provide extensive experimental evidence of the eﬀectiveness of our algorithms onboth real world and synthetic datasets, scaling up to over a billion points. Diversity maximization is a fundamental primitive in massive data analysis, which provides a succinctsummary of a dataset while preserving the diversity of the data [1, 27, 33, 34]. This summary can bepresented visually to the user or can be used as a core for further processing of the dataset. In thispaper we present novel eﬃcient algorithms for diversity maximization in popular computation models formassive data processing, namely Streaming and MapReduce.

Diversity Measures and their Applications:

Given a dataset of points in a metric space and aconstant k , a solution to the diversity maximization problem is a subset of k points that maximizes somediversity objective measure deﬁned in terms of the distances between the points. ∗ This work was published in the Proocedings of the VLDB Endowment [10]. a r X i v : . [ c s . D C ] J a n ombinations of relevance ranking and diversity maximization have been explored in a variety ofapplications, including web search [5], e-commerce [7], recommendation systems [35], aggregate web-sites [28] and query-result navigation [14] (see [31, 1, 23] for further references on the applications ofdiversity maximization). The common problem in all these applications is that even after ﬁltering andranking for relevance, the output set is often too large to be presented to the user. A practical solutionis to present a diverse subset of the results so the user can evaluate the variety of options and possiblyreﬁne the search.There are a number of ways to formulate the goal of ﬁnding a set of k points which are as diverse, oras far as possible from each other. Conceptually, a k -diversity maximization problem can be formulatedin terms of a speciﬁc graph-theoretic measure deﬁned on sets of k points, seen as the nodes of a cliquewhere each edge is weighted with the distance between its endpoints [12]. Several diversity measures aredeﬁned in Table 1. While the most appropriate ones in the context of web search, e-commerce, aggregatorsystems and query results navigation are the remote-edge and the remote-clique measures [17, 1], theresults in this paper also extend to the other measures in the table, which have important applications inanalyzing network performance, locating strategic facilities or noncompeting franchises, or determininginitial solutions for iterative clustering algorithms or heuristics for hard optimization problems such asTSP [21, 12, 31]. We include all of these measures here to demonstrate the versatility of our approachto a variety of diversity criteria. We want to stress that diﬀerent measures characterize the diversity ofa set in a diﬀerent fashion: indeed, an optimal solution with respect to one measure is not necessarilyoptimal with respect to another measure. Distance Metric:

All the diversity criteria listed in Table 1 are known to be NP-hard for generalmetric spaces. Following a number of recent works [2, 15, 25, 19, 8, 9], we parameterize our results interms of the doubling dimension of the metric space. Recall that a metric space has doubling dimension D if any ball of radius r can be covered by at most 2 D balls of radius r/

2. While our methods yieldprovably tight bounds in spaces of bounded doubling dimension (e.g., any bounded dimension Euclidianspace) they have the ability of providing good approximations in more general spaces based on importantpractical distance functions such as the cosine distance in web search [5] and the dissimilarity (Jaccard)distance in database queries [26].

Massive Data Computation Models:

Since the applications of diversity maximization are mostlyin the realm of massive data analysis, it is important to develop eﬃcient algorithms for computationalsettings that can handle very large datasets. The Streaming and MapReduce models are widely rec-ognized as suitable computational frameworks for big-data processing. The Streaming model [30] copeswith large data volumes through an on-the-ﬂy computation on the streamlined dataset, storing only verylimited information in the process, while the MapReduce model [24, 29] enables the handling of largedatasets through the massive availability of resource-limited processing elements working in parallel. Themajor challenge in both models is devising strategies which work under the constraint that the numberof data items that a single processor can access simultaneously is substantially limited.

Related work.

Diversity maximization has been studied in the literature under diﬀerent names (e.g., p -Dispersion, Max-Min Facility Dispersion, etc.). An extensive account of the existing formulations isprovided in [12]. All of these problems are known to be NP-hard, and several sequential approximationalgorithms have been proposed. Table 1 summarizes the best known results for general metric spaces.There are also some specialized results for spaces with bounded doubling dimension: for the remote-cliqueproblem, a polynomial-time ( √ ε )-approximation algorithm on the Euclidean plane, and a polynomial-time (1 + ε )-approximation algorithm on d -dimensional spaces with rectilinear distances, for any positiveconstants ε > d , are presented in [16]. In [21] it is shown that a natural greedy algorithm2roblem Diversity measure Sequential approximationremote-edge min p,q ∈ S d ( p, q ) 2 (2) [32]remote-clique (cid:80) p,q ∈ S d ( p, q ) 2 ( − ) [22]remote-star min c ∈ S (cid:80) q ∈ S \{ c } d ( c, q ) 2 ( − ) [12]remote-bipartition min Q ⊂ S, | Q | = (cid:98)| S | / (cid:99) (cid:80) q ∈ Q,z ∈ S \ Q d ( q, z ) 3 ( − ) [12]remote-tree w (MST( S )) 4 (2) [21]remote-cycle w (TSP( S )) 3 (2) [21]Table 1: Diversity measures considered in this paper. w (MST( S )) (resp., w (TSP( S ))) denotes the min-imum weight of a spanning tree (resp., Hamiltonian cycle) of the complete graph whose nodes are thepoints of S and whose edge weights are the pairwise distances among the points. The last column liststhe best known approximation factor, the lower bound under the hypothesis P (cid:54) = N P (in parentheses),and the related references. Previous [23, 4] Our results

General metric spaces Bounded doubling dimension remote-edge 3 1 + ε remote-clique 6 + ε ε remote-star 12 1 + ε remote-bipartition 18 1 + ε remote-tree 4 1 + ε remote-cycle 3 1 + ε Table 2: Approximation factors of the composable core-sets computed by our algorithm, compared withprevious approaches.attains a 2 .

309 approximation factor on the Euclidean plane for remote-tree. Recently, the remote-clique problem has been considered under matroid constraints [1, 11], which generalize the cardinalityconstraints considered in previous literature.In recent years, the notion of (composable) core-set has been introduced as a key tool for the eﬃcientsolution of optimization problems on large datasets. A core-set [3], with respect to a given computationalobjective, is a (small) subset of the entire dataset which contains a good approximation to the optimalsolution for the entire dataset. A composable core-set [23] is a collection of core-sets, one for each subsetin an arbitrary partition of the dataset, such that the union of these core-sets contains a good core-set forthe entire dataset. The approximation factor attained by a (composable) core-set is deﬁned as the ratiobetween the value of the global optimal solution and the value of the optimal solution on the (composable)core-set. For the problems listed in Table 1, composable core-sets with constant approximation factorshave been devised in [23, 4] (see Table 2). As observed in [23], (composable) core-sets may become keyingredients for developing eﬃcient algorithms for the MapReduce and Streaming frameworks, where thememory available for a processor’s local operations is typically much smaller than the overall input size.In recent years, the characterization of data through the doubling dimension of the space it belongsto has been increasingly used for algorithm design and analysis in a number of contexts, includingclustering [2], nearest neighbour search [15], routing [25], machine learning [19], and graph analytics [8, 9].

Our contribution.

In this paper we develop eﬃcient algorithms for diversity maximization in theStreaming and MapReduce models. At the heart of our algorithms are novel constructions of (composable)3ore-sets. In contrast to [23, 4], where diﬀerent constructions are devised for each diversity objective,we provide a unique construction technique for all of the six objective functions. While our approach isapplicable to general metric spaces, on spaces of bounded doubling dimension, our (composable) core-setsfeature a 1 + ε approximation factor, for any ﬁxed 0 < ε ≤

1, for all of the six diversity objectives, withthe core-set size increasing as a function of 1 /ε . The approximation factor is signiﬁcantly better than theones attained by the known composable core-sets in general metric spaces, which are reported in Table 2for comparison.Once a core-set (possibly obtained as the union of composable core-sets) is extracted from the data,the best known sequential approximation algorithm can be run on it to derive the ﬁnal solution. The re-sulting approximation ratio attained in this fashion combines two sources of error: (1) the approximationloss in replacing the entire dataset with a core-set; and (2) the approximation factor of the sequentialapproximation algorithm executed on the core-set. On metric spaces of bounded doubling dimension thecombined approximation ratio attained by our algorithms for any of the six diversity objective functionsconsidered in the paper is bounded by ( α + ε ), for any constant 0 < ε ≤

1, where α the is best approx-imation ratio achieved by a polynomial-time, linear-space sequential algorithm for the same maximumdiversity criterion.Our algorithms require only one pass over the data, in the streaming setting, and only two roundsin MapReduce. To the best of our knowledge, for all six diversity problems, our streaming algorithmsare the ﬁrst ones that yield approximation ratios close to those of the best sequential algorithms usingspace independent of input stream size. Also, we remark that the parallel strategy at the base of theMapReduce algorithms can be eﬀectively ported to other models of parallel computation.Finally, we provide experimental evidence of the practical relevance of our algorithms on both syntheticand real-world datasets. In particular, we show that higher accuracy is achievable by increasing the size ofthe core-sets, and that the MapReduce algorithm is considerably faster (up to three orders of magnitude)than its state-of-the-art competitors. Also, we provide evidence that the proposed approach is highlyscalable. We want to remark that our work provides the ﬁrst substantial experimental study on theperformance of diversity maximization algorithms on large instances of up to billions of data points.The rest of the paper is organized as follows. In Section 2, we introduce some fundamental conceptsand useful notations. In Section 3, we identify suﬃcient conditions for a subset of points to be a core-setwith provable approximation guarantees. These properties are then crucially exploited by the streamingand MapReduce algorithms described in Sections 4 and 5, respectively. Section 6 discusses how the highermemory requirements of four of the six diversity problems can be reduced, while Section 7 reports on theresults of the experiments. Let ( D , d ) be a metric space. The distance between two points u, v ∈ D is denoted by d ( u, v ). Moreover,we let d ( p, S ) = min q ∈ S d ( p, q ) denote the minimum distance between a point p ∈ D and an element of aset S ⊆ D . Also, for a point p ∈ D , the ball of radius r centered at p is the set of all points in D at distanceat most r from p . The doubling dimension of a space is the smallest D such that any ball of radius r iscovered by at most 2 D balls of radius r/ < ε ≤

1, any ballof radius r can be covered by at most (1 /ε ) D balls of radius εr . For ease of presentation, in this paper weconcentrate on metric spaces of constant doubling dimension D , although the results can be immediatelyextended to nonconstant D by suitably adjusting the ranges of variability of the parameters involved.Several relevant metric spaces have constant doubling dimension, a notable case being Euclidean spaceof constant dimension D , which has doubling dimension O ( D ) [20].Let div : 2 D → R be a diversity function that maps a set S ⊂ D to some nonegative real number. Inthis paper, we will consider the instantiations of function div listed in Table 1, which were introduced4nd studied in [12, 23, 4]. For a speciﬁc diversity function div, a set S ⊂ D of size n and a positiveinteger k ≤ n , the goal of the diversity maximization problem is to ﬁnd some subset S (cid:48) ⊆ S of size k thatmaximizes the value div( S (cid:48) ). In the following, we refer to the k -diversity of S asdiv k ( S ) = max S (cid:48) ⊆ S, | S (cid:48) | = k div( S (cid:48) )The notion of core-set [3] captures the idea of a small set of points that approximates some propertyof a larger set. Deﬁnition 1.

Let div( · ) be a diversity function, k be a positive integer, and β ≥ . A set T ⊆ S , with | T | ≥ k , is a β -core-set for S if div k ( T ) ≥ β div k ( S )In [23, 4], the concept of core-set is extended so that, given an arbitrary partition of the input set,the union of the core-sets of each subset in the partition is a core-set for the entire input set. Deﬁnition 2.

Let div( · ) be a diversity function, k be a positive integer, and β ≥ . A function c ( S ) thatmaps S ⊂ D to one of its subsets computes a β -composable core-set w.r.t. div if, for any collection ofdisjoint sets S , . . . , S (cid:96) ⊂ D with | S i | ≥ k , we have div k (cid:32) (cid:96) (cid:91) i =1 c ( S i ) (cid:33) ≥ β div k (cid:32) (cid:96) (cid:91) i =1 S i (cid:33) Consider a set S ⊆ D and a subset T ⊆ S . We deﬁne the range of T as r T = max p ∈ S \ T d ( p, T ), andthe farness of T as ρ T = min c ∈ T { d ( c, T \ { c } ) } . Moreover, we deﬁne the optimal range r ∗ k for S w.r.t. k to be the minimum range of a subset of k points of S . Similarly, we deﬁne the optimal farness ρ ∗ k for S w.r.t. k to be the maximum farness of a subset of k points of S . Observe that ρ ∗ k is also the value of theoptimal solution to the remote-edge problem. In this section we identify some properties that, when exhibited by a set of points, guarantee that the setis a (1 + ε )-core-set for the diversity problems listed in Table 1. In the subsequent sections we will showhow core-sets with these properties can be obtained in the streaming and MapReduce settings. In fact,when we discuss the MapReduce setting, we will also show that these properties also yield composablecore-sets featuring tighter approximation factors than existing ones, for spaces with bounded doublingdimension.First, we need to establish a fundamental relation between the optimal range r ∗ k and the optimal farness ρ ∗ k for a set S . To this purpose, we observe that the classical greedy approximation algorithm proposedin [18] for ﬁnding a subset of minimum range ( k -center problem ), gives in fact a good approximation toboth measures. We refer to this algorithm as GMM . Consider a set of points S and a positive integer k < | S | . Let T = GMM ( S, k ) be the subset of k points returned by the algorithm for this instance. Thealgorithm initializes T with an arbitrary point a ∈ S . Then, greedily, it adds to T the point of S \ T which maximizes the distance from the already selected points, until T has size k . It is known that thereturned set T is such that r T ≤ r ∗ k [18] and it is easily seen that r T ≤ ρ T (referred to as anticoverproperty ). This immediately implies the following fundamental relation. Fact 1.

Given a set S and k > , we have r ∗ k ≤ ρ ∗ k . S be a set belonging to a metric space of doubling dimension D . In what follows, div( · ) denotesthe diversity function of the problem under consideration, and O denotes an optimal solution to theproblem with respect to instance S . Consider a subset T ⊆ S . Intuitively, T is a good core-set for somediversity measure on S , if for each point of the optimal solution O it contains a point suﬃciently closeto it. We formalize this intuition by suitably adapting the notion of proxy function introduced in [23].Given a core-set T ⊆ S , we aim at deﬁning a function p : O → T such that the distance between o and p ( o ) is bounded, for any o ∈ O . For some problems this function will be required to be injective,whereas for, some others, injectivity will not be needed. We begin by studying the remote-edge and theremote-cycle problem. Lemma 1.

For any given ε > , let ε (cid:48) be such that (1 − ε (cid:48) ) = 1 / (1 + ε ) . A set T ⊆ S is a (1 + ε ) -core-setfor the remote-edge and the remote-cycle problems if | T | ≥ k and there is a function p : O → T such that,for any o ∈ O , d ( o, p ( o )) ≤ ( ε (cid:48) / ρ ∗ k .Proof. Consider the remote-edge problem ﬁrst, and observe that div k ( T ) ≤ div( O ) = ρ ∗ k . By applyingthe triangle inequality and the stated property of the proxy function p we getdiv k ( T ) ≥ min o ,o ∈ O d ( p ( o ) , p ( o )) ≥ min o ,o ∈ O { d ( o , o ) − d ( o , p ( o )) − d ( o , p ( o )) }≥ min o ,o ∈ O d ( o , o ) − ε (cid:48) ρ ∗ k = div( O )(1 − ε (cid:48) ) = div( O ) / (1 + ε )Note that p ( · ) does not need to be injective: in fact, if two points of the optimal solution are mappedinto the same proxy, the ﬁrst inequality trivially holds, its right hand side being zero.Consider now the remote-cycle problem. Note that div k ( T ) ≤ div( O ). Let ¯ ρ = div( O ) /k and observethat ρ ∗ k ≤ ¯ ρ . Let P = { p ( o ) : o ∈ O } ⊆ T be the image of the proxy function. Following the argumentgiven in [23, 4], consider TSP( P ), an optimal tour on P . We build a weighted graph G whose vertex setis O ∪ P and whose edges are those induced by TSP( P ) plus two copies of edge ( o, p ( o )), for each o ∈ O .The weight of an edge ( u, v ) is d ( u, v ). Clearly, the resulting graph G is connected and all its verticeshave even degree, therefore it admits an Euler tour T E of its edges. From T E we obtain a cycle C of O by shortcutting all nodes that are not in O . By repeated applications of the triangle inequality duringshortcutting and the fact that d ( o, p ( o )) ≤ ( ε (cid:48) / ρ , we obtain: w (TSP( O )) ≤ w ( C ) ≤ w ( T E ) ≤ w (TSP( P )) + kε (cid:48) ¯ ρ ≤ div k ( T ) + ε (cid:48) div( O )Therefore, div( O ) ≤ div k ( T ) / (1 − ε (cid:48) ) = div k ( T )(1 + ε ). As in the case of the remote-edge problem, theinjectivity of p ( · ) is not necessary.Note that the proof of the above lemma does not require p ( · ) to be injective. Instead, injectivityis required for the remote-clique, remote-star, remote-bipartition, and remote-tree problems, which areconsidered next. Lemma 2.

For a given ε > , let ε (cid:48) be such that − ε (cid:48) = 1 / (1 + ε ) . A set T ⊆ S is a (1 + ε ) -core-setfor the remote-clique, remote-star, remote-bipartition, and remote-tree problems if | T | ≥ k and there isan injective function p : O → T such that, for any o ∈ O , d ( o, p ( o )) ≤ ( ε (cid:48) / ρ ∗ k .Proof. Observe that for each of the four problems it holds that div k ( T ) ≤ div( O ). Let us consider theremote-clique problem ﬁrst, and deﬁne ¯ ρ = div( O ) / (cid:0) k (cid:1) = (cid:80) o ,o ∈ O d ( o , o ) / (cid:0) k (cid:1) Clearly, ρ ∗ k ≤ ¯ ρ . Bycombining this observation with the triangle inequality we havediv k ( T (cid:48) ) ≥ (cid:88) o ,o ∈ O d ( p ( o ) , p ( o )) ≥ (cid:88) o ,o ∈ O [ d ( o , o ) − d ( o , p ( o )) − d ( o , p ( o ))] ≥ (cid:88) o ,o d ( o , o ) − (cid:18) k (cid:19) ε (cid:48) ¯ ρ = div( O ) / (1 + ε )6he injectivity of p ( · ) is needed in this case for the ﬁrst inequality above to be true, since k distinctproxies are needed to get a feasible solution. The argument for the other problems is virtually identical,and we omit it for brevity. In the Streaming model [30] one processor with a limited-size main memory is available for the compu-tation. The input is provided as a continuous stream of items which is typically too large to ﬁt in mainmemory, hence it must be processed on the ﬂy within the limited memory budget. Streaming algorithmsaim at performing as few passes as possible (ideally just one) over the input.In [23], the authors propose the following use of composable core-sets to approximate diversity in thestreaming model. The stream of n input points is partitioned into (cid:112) n/k blocks of size √ kn each, anda core-set of size k is computed from each block and kept in memory. At the end of the pass, the ﬁnalsolution is computed on the union of the core-sets, whose total size is √ kn . In this section, we showthat substantial savings (a space requirement independent of n ) can be obtained by computing a single core-set from the entire stream through two suitable variants of the 8-approximation doubling algorithm for the k -center problem presented in [13], which are described below.Let k, k (cid:48) be two positive integers, with k ≤ k (cid:48) . The ﬁrst variant, dubbed SMM ( S, k, k (cid:48) ), works inphases and maintains in memory a set T of at most k (cid:48) +1 points. Each Phase i is associated with a distancethreshold d i , and is divided into a merge step and an update step . Phase 1 starts after an initializationin which the ﬁrst k (cid:48) + 1 points of the stream are added to T , and d is set equal to min c ∈ T d ( c, T \ { c } ).At the beginning of Phase i , with i ≥

1, the following invariant holds. Let S i be the preﬁx of the streamprocessed so far. Then:1. ∀ p ∈ S i , d ( p, T ) ≤ d i ∀ t , t ∈ T , with t (cid:54) = t , we have d ( t , t ) ≥ d i Observe that the invariant holds at the beginning of Phase 1. The merge step operates on a graph G = ( T, E ) where there is an edge ( t , t ) between two points t (cid:54) = t ∈ T if d ( t , t ) ≤ d i . In this step,the algorithm seeks a maximal independent set I ⊆ T of G , and sets T = I . The update step acceptsnew points from the stream. Let p be one such new point. If d ( p, T ) ≤ d i , the algorithm discards p ,otherwise it adds p to T . The update step terminates when either the stream ends or the ( k (cid:48) + 1)-st pointis added to T . At the end of the step, d i +1 is set equal to 2 d i . As shown in [13], at the end of the updatestep, the set T and the threshold d i +1 satisfy the above invariants for Phase i + 1.To be able to use SMM for computing a core-set for our diversity problems, we have to make surethat the set T returned by the algorithm contains at least k points. However, in the algorithm describedabove the last phase could end with | T | < k . To ﬁx this situation, we modify the algorithm so to retainin memory, for the duration of each phase, the set M of points that have been removed from T duringthe merge step performed at the beginning of the phase. Consider the last phase. If at the end of thestream we have | T | < k , we can pick k − | T | arbitrary nodes from M and add them to T . Note that wecan always do so because M ∪ I = k (cid:48) + 1 ≥ k , where I is the independent set found during the last mergestep.Suppose that the input set S belongs to a metric space with doubling dimension D . We have: Lemma 3.

For any < ε (cid:48) ≤ , let k (cid:48) = (32 /ε (cid:48) ) D · k , and let T be the set of points returned by SMM ( S, k, k (cid:48) ) . Then, given an arbitrary set X ⊆ S with | X | = k , there exist a function p : X → T suchthat, for any x ∈ X , d ( x, p ( x )) ≤ ( ε (cid:48) / ρ ∗ k .Proof. Let r ∗ k (cid:48) to be the optimal range for S w.r.t. k (cid:48) . Also, let r T = max p ∈ S d ( p, T ) be the range of T and let ρ ∗ k be the optimal farness for S w.r.t. k . Suppose that SMM ( S, k, k (cid:48) ) performs (cid:96) phases. It is7mmediate to see that r T ≤ d (cid:96) . As was proved in [13], 4 d (cid:96) ≤ r ∗ k (cid:48) , thus r T ≤ r ∗ k (cid:48) . Consider now anoptimal clustering of S with k centers and range r ∗ k and, for notational convenience, deﬁne ε (cid:48)(cid:48) = ε (cid:48) / k (cid:48) balls in the space (centeredat nodes not necessarily in S ) of radius at most ε (cid:48)(cid:48) r ∗ k which contain all of the points in S . By choosingone arbitrary center in S for each such ball, we obtain a feasible solution to the k (cid:48) -center problem for S with range at most 2 ε (cid:48)(cid:48) r ∗ k . Consequently, r ∗ k (cid:48) ≤ ε (cid:48)(cid:48) r ∗ k . Hence, we have that r T ≤ r ∗ k (cid:48) ≤ ε (cid:48)(cid:48) r ∗ k . ByFact 1, we know that r ∗ k ≤ ρ ∗ k . Therefore, we have r T ≤ ε (cid:48)(cid:48) ρ ∗ k = ( ε (cid:48) / ρ ∗ k . Given a set X ⊆ S of size k ,the desired proxy function p ( · ) is the one that maps each point x ∈ X to the closest point in T . By thediscussion above, we have that d ( x, p ( x )) ≤ ( ε (cid:48) / ρ ∗ k .For the diversity problems mentioned in Lemma 2, we need that for each point of an optimal solutionthe ﬁnal core-set extracted from the data stream contains a distinct point very close to it. In whatfollows, we describe a variant of SMM , dubbed

SMM-EXT , which ensures this property. Algorithm

SMM-EXT proceeds as

SMM but maintains for each t ∈ T a set E t of at most k delegate points closeto t , including t itself. More precisely, at the beginning of the algorithm, T is initialized with the ﬁrst k (cid:48) + 1 points of the stream, as before, and E t is set equal to { t } , for each t ∈ T . In the merge step ofPhase i , with i ≥

1, iteratively for each point t not included in the independent set I , we determine anarbitrary point t ∈ I such that d ( t , t ) ≤ d i and let E t inherit max {| E t | , k − | E t |} points of E t .Note that one such point t must exist, otherwise I would not be a maximal independent set. Also, notethat a point t ∈ I may inherit points from sets associated with diﬀerent points not in I . Consider theupdate step of Phase i and let p be a new point from the stream. Let t ∈ T be the point currently in T which is closest to p . If d ( p, t ) > d i we add it to T . If instead d ( p, t ) ≤ d i and | E t | < k , then we add p to E t , otherwise we discard it. Finally, we deﬁne T (cid:48) = (cid:83) t ∈ T E t to be the output of the algorithm, andobserve that T ⊆ T (cid:48) . Lemma 4.

For any < ε (cid:48) ≤ , let k (cid:48) = (64 /ε (cid:48) ) D · k , and let T (cid:48) be the set of points returned by SMM-EXT ( S, k, k (cid:48) ) . Then, given an arbitrary set X ⊆ S with | X | = k , there exist an injective function p : X → T (cid:48) such that, for any x ∈ X , d ( x, p ( x )) ≤ ( ε (cid:48) / ρ ∗ k .Proof. Let r T (cid:48) = max p ∈ S d ( p, T (cid:48) ) be the range of T (cid:48) , and suppose that SMM ( S, k, k (cid:48) ) performs (cid:96) phases.By deﬁning ε (cid:48)(cid:48) = ε (cid:48) /

64, and by reasoning as in the proof of Lemma 3 we can show that r T (cid:48) ≤ d (cid:96) ≤ ε (cid:48)(cid:48) ρ ∗ k .Consider a point x ∈ X . If x ∈ T (cid:48) then we deﬁne p ( x ) = x . Otherwise, suppose that x is discardedduring Phase j , for some j , because either in the merging or in the update step the set E t that wassupposed to host it had already k points. Let T i denote the set T at the end of Phase i , for any i ≥

1. Asimple inductive argument shows that at the end of each Phase i , with j ≤ i ≤ (cid:96) there is a point t ∈ T i such that | E t | = k and d ( x, t ) ≤ d i . In particular, there exists a point t ∈ T (cid:96) such that | E t | = k and d ( x, t ) ≤ d (cid:96) ≤ ε (cid:48)(cid:48) ρ ∗ k . Since E t ⊂ T (cid:48) , any point in E t is at distance at most 4 d (cid:96) ≤ ε (cid:48)(cid:48) ρ ∗ k from t , and | X | = k , we can select a proxy p ( x ) for x from the k points in E t such that d ( x, p ( x )) ≤ ε (cid:48)(cid:48) ρ ∗ k = ( ε (cid:48) / ρ ∗ k and p ( x ) is not a proxy for any other point of X .It is easy to see that the set T characterized in Lemma 3 satisﬁes the hypotheses of Lemma 1. Similarly,the set T (cid:48) of Lemma 4 satisﬁes the hypotheses of Lemma 2. Therefore, as a consequence of these lemmas,for metric spaces with bounded doubling dimension D , we have that SMM and

SMM-EXT compute(1 + ε )-core-sets for the problems listed in Table 1, as stated by the following two theorems. Theorem 1.

For any < ε ≤ , let ε (cid:48) be such that (1 − ε (cid:48) ) = 1 / (1 + ε ) , and let k (cid:48) = (32 /ε (cid:48) ) D · k .Algorithm SMM ( S, k, k (cid:48) ) computes a (1 + ε ) -core-set for the remote-edge and remote-cycle problemsusing O (cid:0) (1 /ε ) D k (cid:1) memory. Theorem 2.

For any < ε ≤ , let ε (cid:48) be such that (1 − ε (cid:48) ) = 1 / (1+ ε ) , and let k (cid:48) = (64 /ε (cid:48) ) D · k . Algorithm SMM-EXT ( S, k, k (cid:48) ) computes a (1 + ε ) -core-set for the remote-clique, remote-star, remote-bipartition,and remote-tree problems using O (cid:0) (1 /ε ) D k (cid:1) memory. treaming Algorithm. The core-sets discussed above can be immediately applied to yield the follow-ing streaming algorithm for diversity maximization. Let S be the input stream of n points. One passon the data is performed using SMM , or

SMM-EXT , depending on the problem, to compute a core-setin main memory. At the end of the pass, a sequential approximation algorithm is run on the core-set tocompute the ﬁnal solution. The following theorem is immediate.

Theorem 3.

Let S be a stream of n points of a metric space of doubling dimension D , and let A bea linear-space sequential approximation algorithm for any one of the problems of Table 1, returning asolution S (cid:48) ⊆ S , with div k ( S ) ≤ α div( S (cid:48) ) , for some constant α ≥ . Then, for any < ε ≤ , there is a1-pass streaming algorithm for the same problem yielding an approximation factor of α + ε , with memory • Θ (cid:0) ( α/ε ) D k (cid:1) for the remote-edge and the remote-cycle problems; • Θ (cid:0) ( α/ε ) D k (cid:1) for the remote-clique, the remote-star, the remote-bipartition, and the remote-treeproblems. Recall that a MapReduce (MR) algorithm [24, 29] executes as a sequence of rounds where, in a round, amultiset X of key-value pairs is transformed into a new multiset Y of pairs by applying a given reducerfunction (simply called reducer ) independently to each subset of pairs of X having the same key. Themodel features two parameters M T and M L , where M T is the total memory available to the computation,and M L is the maximum amount of memory locally available to each reducer. Typically, we seek MRalgorithms that, on an input of size n , work in as few rounds as possible while keeping M T = O ( n ) and M L = O (cid:0) n δ (cid:1) , for some 0 ≤ δ < S belonging to a metric space of doubling dimension D , and a partition of S into (cid:96) disjoints sets S , S , . . . , S (cid:96) . In what follows, div( · ) denotes the diversity function of the problem underconsideration, and O denotes an optimal solution to the problem with respect to instance S = ∪ (cid:96)i =1 S i .Also, we let ρ ∗ k,i be the optimal farness for S i w.r.t. k , with 1 ≤ i ≤ (cid:96) , and let ρ ∗ k be the optimal farnessfor S w.r.t. k . Clearly, ρ ∗ k,i ≤ ρ ∗ k , for every 1 ≤ i ≤ (cid:96) .The basic idea of our MR algorithms is the following. First, each set S i is mapped to a reducer, whichcomputes a core-set T i ⊆ S i . Then, the core-sets are aggregated into one single core-set T = (cid:83) (cid:96)i =1 T i inone reducer, and a sequential approximation algorithm is run on T , yielding the ﬁnal output. We arethus employing the composable core-sets framework introduced in [23].The following Lemma shows that if we run Algorithm GMM from Section 3 on each S i , with 1 ≤ i ≤ (cid:96) ,and then take the union of the outputs, the resulting set satisﬁes the hypotheses of Lemma 1. Lemma 5.

For any < ε (cid:48) ≤ , let k (cid:48) = (8 /ε (cid:48) ) D · k , and let T = (cid:83) (cid:96)i =1 GMM( S i , k (cid:48) ) . Then, givenan arbitrary set X ⊆ S with | X | = k , there exist a function p : X → T such that for any x ∈ X , d ( x, p ( x )) ≤ ( ε (cid:48) / ρ ∗ k .Proof. Fix an arbitrary index i , with 1 ≤ i ≤ (cid:96) , and let T i = { c , c , . . . , c k (cid:48) } , where c j denotes thepoint added to T i at the j -th iteration of GMM ( S i , k (cid:48) ). Let also T i ( k ) = { c , c , . . . , c k } and d k = d ( c k , T i ( k ) \ { c k } ). From the anticover property exhibited by GMM, which holds for any preﬁx of pointsselected by the algorithm, we have r T i ( k ) ≤ d k ≤ ρ T i ( k ) ≤ ρ ∗ k . Deﬁne ε (cid:48)(cid:48) = ε (cid:48) /

8. Since S i can becovered with k balls of radius at most d k , and the space has doubling dimension D , then there exist k (cid:48) balls in the space (centered at nodes not necessarily in S i ) of radius at most ε (cid:48)(cid:48) d k that contain allthe points in S i . By choosing one arbitrary center in S i in each such ball, we obtain a feasible solutionto the k (cid:48) -center problem for S i with range at most 2 ε (cid:48)(cid:48) d k , which implies that the cost of the optimalsolution to k (cid:48) -center is at most 2 ε (cid:48) d k . As a consequence, GMM ( S i , k (cid:48) ) will return a 2-approximate9 lgorithm 1: GMM-EXT ( S, k, k (cid:48) ) T (cid:48) ← GMM ( S, k (cid:48) )Let T (cid:48) = { c , c , . . . , c k (cid:48) } T ← ∅ for j ← to k (cid:48) do C j ← { p ∈ S : c j = arg min c ∈ T (cid:48) d ( c, p ) ∧ p (cid:54)∈ C h with h < j } E j ← { c j } ∪ { arbitrary min {| C j | − , k − } points in C j } T ← T ∪ E j endreturn T solution T i to k (cid:48) -center with r T i ≤ ε (cid:48)(cid:48) d k , and we have r T i ≤ ε (cid:48)(cid:48) d k ≤ ε (cid:48)(cid:48) ρ ∗ k . Let now T = (cid:83) (cid:96)i =1 T i and r T = max ≤ i ≤ (cid:96) r T i . We have that r T ≤ ε (cid:48)(cid:48) ρ ∗ k , hence, for any set X ⊆ S , the desired proxy function p ( · )is obtained by mapping each x ∈ X to the closest point in T . By the observations on the range of T , wehave d ( x, p ( x )) ≤ ε (cid:48)(cid:48) ρ ∗ k = ( ε (cid:48) / ρ ∗ k .For the diversity problems considered in Lemma 2 (remote-cycle, remote-star, remote-bipartition, andremote-tree) the proxy function is required to be injective. Therefore, we develop an extension of the GMM algorithm, dubbed

GMM-EXT (see Algorithm 1 above) which ﬁrst determines a kernel T (cid:48) of k (cid:48) ≥ k points by running GMM ( S, k (cid:48) ) and then augments T (cid:48) by ﬁrst determining the clustering of S whose centers are the points of T (cid:48) and then picking from each cluster its center and up to k − T .As before, let S , S , . . . , S (cid:96) be disjoint subsets of a metric space of doubling dimension D . We have: Lemma 6.

For any < ε (cid:48) ≤ , let k (cid:48) = (16 /ε (cid:48) ) d · k , and let T = (cid:83) (cid:96)i =1 GMM-EXT( S i , k, k (cid:48) ) . Then,given an arbitrary set X ⊆ S , with | X | = k , there exist an injective function p : X → T such that for any x ∈ X , d ( x, p ( x )) ≤ ( ε (cid:48) / ρ ∗ k .Proof. For any 1 ≤ i ≤ (cid:96) , let T i = GMM-EXT( S i , k, k (cid:48) ) be the result of the invocation of GMM-EXT on S i . By deﬁning ε (cid:48)(cid:48) = ε (cid:48) /

16 and by reasoning as in Lemma 5, we have that the range of the set T (cid:48) i computed by the call to GMM ( S i , k (cid:48) ) within GMM-EXT ( S i , k, k (cid:48) ) is r T (cid:48) i ≤ ε (cid:48)(cid:48) ρ ∗ k . Fix an arbitraryindex i , with 1 ≤ i ≤ (cid:96) , and consider, for 1 ≤ j ≤ k (cid:48) , the sets C i,j and E i,j as determined by Algorithm GMM-EXT ( S i , k, k (cid:48) ), and deﬁne X i,j = X ∩ C i,j . Since | X i,j | ≤ min { k, | C i,j |} = | E i,j | , we can associateeach point in x ∈ X i,j to a distinct proxy p ( x ) ∈ E i,j . Since both x and p ( x ) belong to C i,j , by thetriangle inequality we have that d ( x, p ( x )) ≤ r T (cid:48) ≤ ε (cid:48)(cid:48) ρ ∗ k = ( ε (cid:48) / ρ ∗ k . Since the input sets S , S , . . . , S (cid:96) are disjoint, then we have that all the X i,j are disjoint. This ensures that we can ﬁnd a distinct proxyfor each point of X in T = (cid:83) (cid:96)i =1 T i , hence, the proxy function is injective.The two lemmas above guarantee that the set of points obtained by invoking GMM or GMM-EXT on the partitioned input complies with the hypotheses of Lemmas 1 and 2 of Section 3. Therefore,for metric spaces with bounded doubling dimension D , we have that GMM and

GMM-EXT compute(1 + ε )-composable core-sets for the problems listed in Table 1, as stated by the following two theorems. Theorem 4.

For any < ε ≤ , let ε (cid:48) be such that (1 − ε (cid:48) ) = 1 / (1 + ε ) , and let k (cid:48) = (8 /ε (cid:48) ) D · k .The algorithm GMM ( S, k (cid:48) ) computes a (1 + ε ) -composable core-set for the remote-edge and remote-cycleproblems. heorem 5. For any < ε ≤ , let ε (cid:48) be such that (1 − ε (cid:48) ) = 1 / (1 + ε ) , and let k (cid:48) = (16 /ε (cid:48) ) D · k . Thealgorithm GMM-EXT ( S, k, k (cid:48) ) computes a (1+ ε ) -composable core-set for the remote-clique, remote-star,remote-bipartition, and remote-tree problems. MapReduce Algorithm.

The composable core-sets discussed above can be immediately applied toyield the following MR algorithm for diversity maximization. Let S be the input set of n points andconsider an arbitrary partition of S into (cid:96) subsets S , S , . . . , S (cid:96) , each of size n/(cid:96) . In the ﬁrst round,each S i is assigned to a distinct reducer, which computes the corresponding core-set T i , according toalgorithms GMM , or

GMM-EXT , depending on the problem. In the second round, the union of the (cid:96) core-sets T = (cid:83) (cid:96)i =1 T i is concentrated within the same reducer, which runs a sequential approximationalgorithm on T to compute the ﬁnal solution. We have: Theorem 6.

Let S be a set of n points of a metric space of doubling dimension D , and let A be a linear-space sequential approximation algorithm for any one of the problems of Table 1, returning a solution S (cid:48) ⊆ S , with div k ( S ) ≤ α div( S (cid:48) ) , for some constant α ≥ . Then, for any < ε ≤ , there is a 2-roundMR algorithm for the same problem yielding an approximation factor of α + ε , with M T = n and • M L = Θ (cid:16)(cid:112) ( α/ε ) D kn (cid:17) for the remote-edge and the remote-cycle problems; • M L = Θ (cid:16) k (cid:112) ( α/ε ) D n (cid:17) for the remote-tree, the remote-clique, the remote-star, and the remote-bipartition problems.Proof. Set ε (cid:48) such that 1 / (1 − ε (cid:48) ) = 1+ ε/α , and recall the the remote-edge and the remote-cycle problemsadmit composable core-sets of size k (cid:48) = (8 /ε (cid:48) ) D k , while the problems remote-tree, remote-clique, remote-star, and remote-bipartition have core-sets of size kk (cid:48) , with k (cid:48) = (16 /ε (cid:48) ) D k . Suppose that the above MRalgorithm is run with (cid:96) = (cid:112) n/k (cid:48) for the former group of two problems, and (cid:96) = (cid:112) n/ ( kk (cid:48) ) for the lattergroup of four problems. Observe that by the choice of (cid:96) we have that both the size of each S i and the sizeof the aggregate set | T | are O ( M L ), therefore the stipulated bounds on the local memory of the reducersare met. The bound on the approximation factor of the resulting algorithm follows from the fact thatthe Theorems 4 and 5 imply that, for all problems, div k ( S ) ≤ (1 + ε/α ) div k ( T ) and the properties ofalgorithm A yield div k ( T ) ≤ α div( S ).Theorem 6 implies that on spaces of constant doubling dimension, we can get approximations toremote-edge and remote-cycle in 2 rounds of MR which are almost as good as the best sequential ap-proximations, with polynomially sublinear local memory M L = O (cid:16) √ kn (cid:17) , for values of k up to n − δ ,while for the remaining four problems, with polynomially sublinear local memory M L = O ( k √ n ) forvalues of k = O (cid:0) n / − δ (cid:1) , for 0 ≤ δ <

1. In fact, for these four latter problems and the same range ofvalues for k , we can obtain substantial memory savings either by using randomization (in two rounds),or, deterministically with an extra round (as will be shown in Section 6.2). We have: Theorem 7.

For the problems of remote-clique, remote-star, remote-bipartition, and remote-tree, we canobtain a randomized 2-round MR algorithm with the same approximation guarantees stated in Theorem 6holding with high probability, and with M L =  Θ (cid:18)(cid:113) ( α/ε ) D kn log n (cid:19) for k = O (cid:16) ( ε D n log n ) / (cid:17) Θ (cid:0) ( α/ε ) D k (cid:1) for k =  Ω (cid:16) ( ε D n log n ) / (cid:17) O (cid:16) n / − δ (cid:17) ∀ δ ∈ [0 , / here α is the approximation guarantee given by the current best sequential algorithms referenced inTable 1.Proof. We ﬁx ε (cid:48) and k (cid:48) as in the proof of Theorem 6, and, at the beginning of the ﬁrst round, we userandom keys to partition the n points of S among (cid:96) = Θ (cid:16) min { (cid:112) n/ ( k (cid:48) log n ) , n/ ( kk (cid:48) ) } (cid:17) reducers. Fix any of the four problems under consideration and let O be a given optimal solution. Asimple balls-into-bins argument suﬃces to show that, with high probability, none of the (cid:96) partitions maycontain more than Θ (max { log n, k/(cid:96) } ) out of the k points of O . Therefore, it is suﬃcient that, withineach subset of the partition, GMM-EXT selects up to those many delegate points per cluster (ratherthan k − T = (cid:83) (cid:96)i =1 T i be as in the proof of Theorem 5. If | T | > M L , we may re-apply the core-set-based strategy using T as the new input. The following theorem shows that this recursive strategycan still guarantee an approximation comparable to the sequential one as long as the local memory M L is not too small. Theorem 8.

Let S be a set of n points of a metric space of doubling dimension D , let and A be a linear-space sequential approximation algorithm for any one of the problems of Table 1, returning a solution S (cid:48) ⊆ S , with div k ( S ) ≤ α div( S (cid:48) ) , for some constant α ≥ . Then, for any < ε ≤ and < γ ≤ / there is an O ((1 − γ ) /γ ) -round MR algorithm for the same problem yielding an approximation factor of α + ε , with M T = n and • M L = Θ (cid:0) ( α (1 − γ ) /γ /ε ) D kn γ (cid:1) for the remote-edge and the remote-cycle problems; • M L = Θ (cid:0) ( α (1 − γ ) /γ ε ) D k n γ (cid:1) , for some γ > for the remote-clique, the remote-star, the remote-bipartition, and the remote-tree problems.Proof. Let ε (cid:48) be such that 1 / (1 − ε (cid:48) ) = 1 + ε/ ( α (2 (1 − γ ) /γ − k (cid:48) = (8 /ε (cid:48) ) D k , while the problems remote-tree,remote-clique, remote-star, and remote-bipartition, have core-sets of size kk (cid:48) , with k (cid:48) = (16 /ε (cid:48) ) D . Wemay apply the following recursive strategy. We partition the input set S into n/M L sets of size M L and compute the corresponding core-sets. Let T be the union of these core-sets. If | T | > M L , then werecursively apply the same strategy using T as the new input set, otherwise, we send T to a single reducerwhere algorithm A is applied. By the choice of the parameters, it follows that in all cases (1 − γ ) /γ roundssuﬃce to shrink the input set to size at most M L . The resulting approximation factor with respect todiv k ( S ) will then be at most α (cid:18) εα (2 (1 − γ ) /γ − (cid:19) (1 − γ ) γ ≤ α (cid:18) ε (2 (1 − γ ) /γ − α (2 (1 − γ ) /γ − (cid:19) = α + ε, where the last inequality follows from the known fact (1 + a ) b ≤ (1 + (2 b − a ) for every a ∈ [0 ,

1] and b >

1, and the observation that, by the choice of γ , we have (1 − γ ) /γ ≥ Saving memory: generalized core-sets

Consider the problems remote-clique, remote-star, remote-bipartition, and remote-tree. Our core-sets forthese problems are obtained by exploiting the suﬃcient conditions stated in Lemma 2, which require theexistence of an injective proxy function that maps the points of an optimal solution into close points ofthe core-set. To ensure this property, our strategy so far has been to add more points to the core-sets.More precisely, the core-set is composed by a kernel of k (cid:48) points, augmented by selecting, for each kernelpoint, a number of up to k − o of an optimal solution O , there exists a distinct close proxy among the delegates ofthe kernel point closest to o , as required by Lemma 2.In order to reduce the core-set size, the augmentation can be done implicitly by keeping track onlyof the number of delegates that must be added for each kernel point. A set of pairs ( p, m p ) is thenreturned, where p is a kernel point and m p is the number of delegates for p (including p itself). Theintuition behind this approach is the following. The set of pairs described above can be viewed as acompact representation of a multiset, where each point p of the kernel appears with multiplicity m p . If,for a given diversity measure, we solve the natural generalization of the maximization problem on themultiset, then we can transform the obtained multiset solution into a feasible solution for S by selecting,for each multiple occurrence of a kernel point, a distinct close enough point in S . In what follows weillustrate this idea in more detail.Let S be a set of points. A generalized core-set T for S is a set of pairs ( p, m p ) with p ∈ S and m p a positive integer, referred to as the multiplicity of p , where the ﬁrst components of the pairs areall distinct. We deﬁne its size s ( T ) to be the number of pairs it contains, and its expanded size as m ( T ) = (cid:80) ( p,m p ) ∈ T m p . Moreover, we deﬁne the expansion of a generalized core-set T as the multiset T formed by including, for each pair ( p, m p ) ∈ T , m p replicas of p in T .Given two generalized core-sets T and T , we say that T is a coherent subset of T , and write T (cid:118) T ,if for every pair ( p, m p ) ∈ T there exists a pair ( p, m (cid:48) p ) ∈ T with m (cid:48) p ≥ m p . For a given diversity functiondiv and a generalized core-set T for S , we deﬁne the generalized diversity of T , denoted by gen-div( T ),to be the value of div when applied to its expansion T , where m p replicas of the same point p are viewedas m p distinct points at distance 0 from one another. We also deﬁne the generalized k -diversity of T asgen-div k ( T ) = max T (cid:48) (cid:118) T : m ( T (cid:48) )= k gen-div( T (cid:48) ) . Let T be a generalized core-set for a set of points S . A set I ( T ) ⊆ S with | I ( T ) | = m ( T ) is referred toas a δ -instantiation of T if for each pair ( p, m p ) ∈ T it contains m p distinct delegate points (including p ), each at distance at most δ from p , with the requirement that the sets of delegates associated withany two pairs in T are disjoint. The following lemma ensures that the diﬀerence between the generalizeddiversity of T and the diversity of any of its δ -instantiations is bounded. Lemma 7.

Let T be a generalized core-set for S with m ( T ) = k , and consider the remote-clique,remote-star, remote-bipartition, and remote-tree problems. For any δ -instantiation I ( T ) of T we havethat div( I ( T )) ≥ gen-div( T ) − f ( k )2 δ. where f ( k ) = (cid:0) k (cid:1) for remote-clique, f ( k ) = k − for remote-star and remote tree, and f ( k ) = (cid:98) k/ (cid:99)·(cid:100) k/ (cid:101) for remote-bipartition.Proof. Recall that gen-div( T ) is deﬁned over the expansion T of T where each pair ( p, m p ) ∈ T isrepresented by m p occurrences of p . We create a 1-1 correspondence between T and I ( T ) by mappingeach occurrence of a point p ∈ T into a distinct proxy chosen among the delegates for ( p, m p ) in I ( T ). Thelemma follows by noting both gen-div( T ) and div( I ( T )) are expressed in terms of sums of f ( k ) distances13nd that, by the triangle inequality, for any two points p , p in the multiset (possibly two occurrencesof the same point p ) the distance of the corresponding proxies is at least d ( p , p ) − δ .It is important to observe that the best sequential approximation algorithms for the remote-clique,remote-star, remote-bipartition, and remote-tree problems (see Table 1), which are essentially based oneither ﬁnding a maximal matching or running GMM on the input set [22, 12, 21], can be easily adaptedto work on inputs with multiplicities. We have:

Fact 2.

The best existing sequential approximation algorithms for the remote-clique, remote-star, remote-bipartition, and remote-tree, can be adapted to obtain from a given generalized core-set T a coherent subset ˆ T with expanded size m ( ˆ T ) = k and gen-div( ˆ T ) ≥ (1 /α ) gen-div k ( T ) , where α is the same approximationratio achieved on the original problems. The adaptation works in space O ( s ( T )) . Using generalized core-sets we can lower the memory requirements for the remote-tree, remote-clique,remote-star, and remote-bipartition problems to match the one of the other two problems, at the expenseof an extra pass on the data. We have:

Theorem 9.

For the problems of remote-clique, remote star, remote-bipartition, and remote-tree, wecan obtain a 2-pass streaming algorithm with approximation factor α + ε and memory Θ (cid:0) ( α /ε ) D k (cid:1) , forany < ε < , where α is the approximation guarantee given by the current best sequential algorithmsreferenced in Table 1.Proof. Let ¯ ε be such that α + ε = α/ (1 − ¯ ε ), and observe that ¯ ε = Θ ( ε/α ). In the ﬁrst pass we determinea generalized core-set T of size k (cid:48) = (64 α/ ¯ ε ) D · k by suitably adapting the SMM-EXT algorithm tomaintain counts rather than delegates for each kernel point. Let r T denote the maximum distance of apoint of S from the closest point x such that ( x, m x ) is in T . Using the argument in the proof of Lemma 3,setting ε (cid:48) = ¯ ε/ (2 α ), it is easily shown that r T ≤ ( ε (cid:48) / ρ ∗ k = (¯ ε/ (4 α )) ρ ∗ k . Therefore, we can establish aninjective map p ( · ) from O to the expansion T of T . Let us focus on the remote-clique problem (theargument for the other three problems is virtually identical), and deﬁne ¯ ρ = div( O ) / (cid:0) k (cid:1) . By reasoningas in the proof of Lemma 2, we can show that gen-div k ( T ) ≥ div( O )(1 − ¯ ε/ (2 α )).At the end of the pass, the best sequential algorithm for the problem, adapted as stated in Fact 2,is used to compute in memory a coherent subset ˆ T (cid:118) T with m ( ˆ T ) = k and such that gen-div( ˆ T ) ≥ div( O )(1 − ¯ ε/ (2 α )) /α . The second pass starts with ˆ T in memory and computes an r T -instantiation I ( ˆ T )by selecting, for each pair ( p, m p ) ∈ ˆ T , m p distinct delegates at distance at most r T ≤ (¯ ε/ (4 α ))¯ ρ from p .Note that a point from the data stream could be a feasible delegate for multiple pairs. Such a point mustbe retained as long as the appropriate delegate count for each such pair has not been met. By applyingLemma 7 with δ = (¯ ε/ (4 α ))¯ ρ , we get div( I ( ˆ T )) ≥ div( O ) / ( α + ε ). Since ¯ ε = Θ ( ε/α ), the space requiredis Θ (cid:0) ( α/ ¯ ε ) D k (cid:1) = Θ (cid:0) ( α /ε ) D k (cid:1) . Let div be a diversity function, k be a positive integer, and β ≥

1. A function c ( S ) that maps a set ofpoints S to a generalized core-set T for S computes a β -composable generalized core-set for div if, for anycollection of disjoint sets S , . . . , S (cid:96) , we have thatgen-div k (cid:32) (cid:96) (cid:91) i =1 c ( S i ) (cid:33) ≥ β div k (cid:32) (cid:96) (cid:91) i =1 S i (cid:33) . roblem Streaming MapReduce1 pass 2 passes 2 rounds det. 2 rounds randomized 3 rounds det.r-edge Θ (cid:0) (1 /ε ) D k (cid:1) − Θ (cid:16)(cid:112) (1 /ε ) D kn (cid:17) − − r-cycler-clique Θ (cid:0) (1 /ε ) D k (cid:1) Θ (cid:0) (1 /ε ) D k (cid:1) Θ (cid:16) k (cid:112) (1 /ε ) D n (cid:17) max (cid:110) Θ (cid:16) (1 /ε ) D k (cid:17) , Θ (cid:16)(cid:112) (1 /ε ) D kn log n (cid:17) (cid:111) Θ (cid:16)(cid:112) (1 /ε ) D kn (cid:17) r-starr-bipartitionr-tree Table 3: Memory requirements of our streaming and MapReduce approximation algorithms. (For MapRe-duce we report only the size of M L since M T is always linear in n .) The approximation factor of eachalgorithm is α + ε , where α is the constant approximation factor of the sequential algorithms listed inTable 1.Consider a simple variant of GMM-EXT , which we refer to as

GMM-GEN , which on input S , k and k (cid:48) returns a generalized core-set T of S of size s ( T ) = k (cid:48) and extended size m ( T ) ≤ kk (cid:48) as follows: for eachpoint c i of the kernel set T (cid:48) = GMM ( S, k (cid:48) ), algorithm

GMM-GEN returns a pair ( c i , m c i ) where m c i isequal to the size of the set E i computed in the i -th iteration of the for loop of GMM-EXT . Lemma 8.

For any ε (cid:48) > , deﬁne k (cid:48) = (16 α/ε (cid:48) ) D k . Algorithm GMM-GEN computes a β -composablegeneralized core-set for the remote-clique, remote-star, remote-bipartition, and remote-tree problems, with /β = 1 − ε (cid:48) / (2 α ) .Proof. Given a collection of disjoint sets S , . . . , S (cid:96) , let T i = GMM-GEN ( S i , k, k (cid:48) ), and T = (cid:83) (cid:96)i =1 T i .Consider the expansion T of T . Let us focus on the remote-clique problem (the argument for theother three problems is virtually identical) and deﬁne ¯ ρ = div( O ) / (cid:0) k (cid:1) . By reasoning along the linesof the proof of Theorem 9, we can establish an injective map p : O → T such that, for any o ∈ O , d ( o, p ( o )) ≤ ( ε (cid:48) / (4 α ))¯ ρ . Let ˆ T be the generalized core-set whose expansion into a multiset yields the k points of the image of p . We have:gen-div k ( T ) ≥ gen-div( ˆ T ) ≥ div( O ) (cid:18) − ε (cid:48) α (cid:19) We are now able to show that

GMM-GEN computes a high-quality β -composable generalized core-set,which can then be employed in a 3-round MR algorithm to approximate the solution to the four problemsunder consideration with lower memory requirements. This result is summarized in the following theorem. Theorem 10.

For the problems of remote-clique, remote-star, remote-bipartition, and remote-tree, wecan obtain a 3-round MR algorithm with approximation factor α + ε and M L = Θ (cid:16)(cid:112) ( α /ε ) D kn (cid:17) , forany < ε < , where α is the approximation guarantee given by the current best sequential algorithmsreferenced in Table 1.Proof. Consider the remote-clique problem (the argument for the other three problems is virtually iden-tical) and deﬁne ¯ ρ = div( O ) / (cid:0) k (cid:1) . Let ε (cid:48) be such that α + ε = α/ (1 − ε (cid:48) ) and observe that ε (cid:48) = Θ ( ε/α ).Also, set k (cid:48) = (16 α/ε (cid:48) ) D · k . For (cid:96) = (cid:112) n/k (cid:48) consider a arbitrary partition of the input set S into (cid:96) subsets S , S , . . . , S (cid:96) each of size M L = n/(cid:96) = √ nk (cid:48) each. In the ﬁrst round, each reducer applies GMM-GEN to a distinct subset S i to compute generalized core-sets of size k (cid:48) . In the second round, these generalizedcore-sets are aggregated in a single generalized core-set T , whose size is (cid:96)k (cid:48) = √ nk (cid:48) = M L and such that15he maximum distance of a point of S from the closest point x with ( x, m x ) ∈ T is r T ≤ ( ε (cid:48) / (4 α ))¯ ρ .Then, one reducer applies to T the best sequential algorithm for the problem, adapted as stated in Fact 2,to compute a coherent subset ˆ T (cid:118) T with m ( ˆ T ) = k and such thatgen-div( ˆ T ) ≥ α gen-div k ( T ) ≥ (cid:18) − ε (cid:48) α (cid:19) α div( O ) , where the last inequality follows by Lemma 8. In the third round, ˆ T is distributed to (cid:96) reducers whichare able to compute an instantiation I ( ˆ T ) of ˆ T as follows. For each pair ( p, m p ) ∈ ˆ T , such that p ∈ S i ,the i -th reducer selects m p distinct delegates from S i at distance at most r T ≤ ( ε (cid:48) / (4 α ))¯ ρ from p . ByLemma 7, we have thatdiv( I ( ˆ T )) ≥ (cid:18) − ε (cid:48) α (cid:19) α div( O ) − ε (cid:48) α div( O ) = 1 α (cid:18) − ε (cid:48) α − ε (cid:48) (cid:19) div( O ) ≥ α (1 − ε (cid:48) ) div( O ) = 1 α + ε div( O )As for the memory bound, we have that M L = √ nk (cid:48) = Θ (cid:16)(cid:112) ( α /ε ) D kn (cid:17) .A synopsis of the main theoretical results presented in the paper is given in Table 3. We ran extensive experiments on a cluster of 16 machines, each equipped with 18GB of RAM and anIntel I7 processor. To the best of our knowledge, ours is the ﬁrst work on diversity maximization inthe MapReduce and Streaming settings, which complements theoretical ﬁndings with an experimentalevaluation. The MapReduce algorithm has been implemented within the Spark framework, whereas thestreaming algorithm has been implemented in Scala, simulating a Streaming setting . Since optimal solu-tions are out of reach for the input sizes that we considered, for each dataset we computed approximationratios with respect to the best solution found by many runs of our MapReduce algorithm with maximumparallelism and large local memory. We run our experiments on both synthetic and real-world datasets.Synthetic datasets are generated randomly from the three-dimensional Euclidean space in the followingway. For a given k , k points are randomly picked on the surface of the unit radius sphere centered atthe origin of the space, so to ensure the existence of a set of far-away points, and the other points arechosen uniformly at random in the concentric sphere of radius 0.8. Among all the distributions used totest our algorithms, on which we do not report for brevity, we found that this is the most challenging,hence the more interesting to demonstrate. To test our algorithm on real-world workloads we used the musiXmatch dataset [6]. This dataset contains the lyrics of 237,662 songs, each represented by the vectorof word counts of the most frequent 5,000 words across the entire dataset. The dimensionality of thespace of these vectors is therefore 5,000. We ﬁlter out songs represented by less than 10 frequent words,obtaining a dataset of 234,363 songs. The reason of this ﬁltering is that one can build an optimal solutionusing songs with short, non overlapping word lists. Thus, removing these songs makes the dataset morechallenging for our algorithm. On this dataset, as a distance between two vectors (cid:126)u and (cid:126)v , we use the cosine distance , deﬁned as dist( (cid:126)u, (cid:126)v ) = arccos (cid:0) (cid:126)u · (cid:126)v (cid:107) (cid:126)u (cid:107)(cid:107) (cid:126)v (cid:107) (cid:1) . This distance is closely related to the cosinesimilarity commonly used in Information Retrieval [26]. For brevity, we will report the results only forthe remote-edge problem. We observed similar behaviors for the other diversity measures, which are allimplemented in our software. All results reported in this section are obtained as averages over at least10 runs. The code is available as free software at https://github.com/Cecca/diversity-maximization

32 128 k . . . . . . . . a pp r o x i m a t i o n r a t i o k’ k2k4k8k Figure 1: Approximation ratio for the streamingalgorithm for diﬀerent values of k and k (cid:48) on the musiXmatch dataset. k a pp r o x i m a t i o n r a t i o k’ kk+4k+16k+64 Figure 2: Approximation ratios for the streamingalgorithm for diﬀerent values of k and k (cid:48) on a syn-thetic dataset of 100 million points. The ﬁrst set of experiments investigates the behavior of the streaming algorithm for various values of k ,as well as the impact of the core-set size, as controlled by the parameter k (cid:48) , on the approximation quality.The results of these experiments are reported in Figure 1, for the musiXmatch dataset, and Figure 2. fora synthetic dataset of 100 million points, generated as explained above.First, we observe that as k increases the remote-edge measure becomes harder to approximate: ﬁndinga higher number of diverse elements is more diﬃcult. On the real-world dataset, because of the highdimensionality of its space, we test the inﬂuence of k (cid:48) on the approximation with a geometric progressionof k (cid:48) (Figure 1). On the synthetic datasets instead (Figure 2), since R has a smaller doubling dimension,the eﬀect of k (cid:48) is evident already with small values, therefore we use a linear progression. As expected,by increasing k (cid:48) the accuracy of the algorithm increases in both datasets. Observe that although thetheory suggests that good approximations require rather large values of k (cid:48) = Ω( k/ε D ), in practice ourexperiments show that relatively small values of k (cid:48) , not much larger than k , already yield very goodapproximations, even for the real-world dataset whose doubling dimension is unknown.In Figure 3, we consider the performance of the kernel of streaming algorithm, that is, we concentrateon the time taken by the algorithm to process each point, ignoring the cost of streaming data frommemory. The rationale is that data may be streamed from sources with very diﬀerent throughput: ourgoal is to show the maximum rate that can be sustained by our algorithm independently of the source ofthe stream. We report results for the same combination of parameters shown in Figure 1. As expected, thethroughput is inversely proportional to both k and k (cid:48) , with values ranging from 3,078 to 544,920 points/s.The throughput supported by our algorithm makes it amenable to be used in streaming pipelines: forinstance, in 2013 Twitter averaged at 5,700 tweets/s and peaked at 143,199 tweets/s. In this scenario,it is likely that the bottleneck of the pipeline would be the data acquisition rather than our core-setconstruction.As for the synthetic dataset, the throughput of the algorithm exhibits a behavior with respect to k and k (cid:48) similar to the one reported in Figure 3, but with higher values ranging from 78,260 to 850,615points/s since the distance function is cheaper to compute. https://blog.twitter.com/2013/new-tweets-per-second-record-and-how

2k 4k 8k k’ · · · · · · t h r o u g hpu t( p o i n t s / s ) k Figure 3: Throughput of the kernel of the streamingalgorithm on the musiXmatch dataset. parallelism . . . . . . a pp r o x i m a t i o n r a t i o k’ k2k4k8k Figure 4: Approximation ratios for the MR algo-rithm for diﬀerent values of k and k (cid:48) on a syntheticdataset of 100 million points. We demonstrate our MapReduce algorithm on the same datasets used in the previous section. For thisset of experiments we ﬁxed k = 128 and we varied two parameters: size of the core-sets, as controlledby k (cid:48) , and parallelism (i.e., the number of reducers). Because the solution returned by the MapReducealgorithm for k (cid:48) = k turns out to be already very good, we use a geometric progression for k (cid:48) to highlightthe dependency of the approximation factor on k (cid:48) . The results are reported in Figure 4. For a ﬁxedlevel of parallelism, we observe that the approximation ratio decreases as k (cid:48) increases, in accordanceto the theory. Moreover, we observe that the approximation ratios are in general better than the onesattained by the streaming algorithm, plausibly because in MapReduce we use a 2-approximation k (cid:48) -centeralgorithm to build the core-sets, while in Streaming only a weaker 8-approximation k (cid:48) -center algorithmis available.Figure 4 also reveals that if we ﬁx k (cid:48) and increase the level of parallelism, the approximation ratiotends to decrease. Indeed, the ﬁnal core-set obtained by aggregating the ones produced by the individualreducers grows larger as the parallelism increases, thus containing more information on the input set.Instead, if we ﬁx the product of k (cid:48) and the level of parallelism, hence the size of the aggregate core-set,we observe that increasing the parallelism is mildly detrimental to the approximation quality. This is tobe expected, since with a ﬁxed space budget in the second round, in the ﬁrst round each reducer is forcedto build a smaller and less accurate core-set as the parallelism increases.The experiments for the real-world musiXmatch dataset (ﬁgures omitted for brevity) highlight thatthe GMM k (cid:48) -center algorithm returns very good core-sets on this high dimensional dataset, yieldingapproximation ratios very close to 1 even for low values of k (cid:48) . As remarked above, the more pronounceddependence on k (cid:48) in the streaming case may be the result of the weaker approximation guarantees of itscore-set construction.Since in real scenarios the input might not be distributed randomly among the reducers, we alsoexperimented with an “adversarial” partitioning of the input: each reducer was given points coming froma region of small volume, so to obfuscate a global view of the pointset. With such adversarial partitioning,the approximation ratios worsen by up to 10%. On the other hand, as k (cid:48) increases, the time required bya random shuﬄe of the points among the reducers becomes negligible with respect to the overall runningtime. Thus, randomly shuﬄing the points at the beginning may prove cost-eﬀective if larger values of k (cid:48) are aﬀordable. 18pproximation time (s)k AFZ CPPU AFZ CPPU

CPPU ) and

AFZ . processors t i m e ( s ) Number of points . · . · . · . · . · Figure 5: Scalability of our algorithms for diﬀerent num-ber of points and processors. The running time for oneprocessor is obtained with the streaming algorithm.

In Table 4, we compare our MapReduce algorithm (dubbed

CPPU ) against its state of the art competitorpresented in [4] (dubbed

AFZ ). Since no code was available for

AFZ , we implemented it in MapReducewith the same optimizations used for

CPPU . We remark that

AFZ employs diﬀerent core-set constructionsfor the various diversity measures, whereas our algorithm uses the same construction for all diversitymeasures. In particular, for remote-edge,

AFZ is equivalent to

CPPU with k (cid:48) = k , hence the comparisonis less interesting and can be derived from the behavior of CPPU itself. Instead, for remote-clique, thecore-set construction used by

AFZ is based on local search and may exhibit highly superlinear complexity.For remote-clique, we performed the comparison with various values of k , on datasets of 4 million pointson the 2-dimensional Euclidean space, using 16 reducers ( AFZ was prohibitively slow for higher dimensionsand bigger datasets). The datasets were generated as described in the introduction to the experimentalsection. Also, we ran

CPPU with k (cid:48) = 128 in all cases, so to ensure a good approximation ratio at theexpense of a slight increase of the running time. As Table 4 shows, CPPU is in all cases at least threeorders of magnitude faster than

AFZ , while achieving a better quality at the same time.

We report on the scalability of our MR algorithm on datasets drawn from R , ranging from 100 millionpoints (the same dataset used in subsections 7.1 and 7.2) up to 1.6 billion points. We ﬁxed the size s of the memory required by the ﬁnal reducer and varied the number of processors used. On a singlemachine, instead of running MapReduce, which makes little sense, we run the streaming algorithm with k (cid:48) = 2048, so to have a ﬁnal coreset of the same size as the ones found in MapReduce runs. For agiven number of processors p and number of points n , we run the corresponding experiment only if n/p points ﬁt into the main memory of a single processor. As shown in Figure 5, for a ﬁxed dataset size,our MapReduce algorithm exhibits super-linear scalability: doubling the number of processors results ina 4-fold gain in running time (at the expense of a mild worsening of the approximation ratio, as pointedout in Subsection 7.2). The reason is that each reducer performs O (cid:0) ns/ ( kp ) (cid:1) work to build its core-set,where p is the number of reducers, since the core-set construction involves s/ ( kp ) iterations, with eachiteration requiring the scan of n/p points.For the dataset with 100 million points, the MR algorithm outperforms the streaming algorithm inevery processor conﬁguration. It must be remarked that the running time reported in Figure 5 for thestreaming algorithm takes into account also the time needed to stream data from main memory (unlike19he throughput reported in Figure 3). This is to ensure a fair comparison with MapReduce, where wealso take into account the time needed to shuﬄe data between the ﬁrst and the second round, and thesetup time of the rounds. Also, we note that the streaming algorithm appears to be faster than what theMR algorithm would be if executed on a single processor, and this is probably due to the fact that theformer is more cache friendly.If we ﬁx the number of processors, we observe that our algorithm exhibits linear scalability in thenumber of points. Finally, in a set of experiments, omitted for brevity, we veriﬁed that for a ﬁxed numberof processors the time increases linearly with k (cid:48) . Both these behaviors are in accordance with the theory. Part of this work was done while the authors were visiting the Departiment of Computer Science atBrown University. This work was supported, in part, by MIUR of Italy under project AMANDA, and bythe University of Padova under project CPDA152255/15: ”Resource-Tradeoﬀs Based Design of Hardwareand Software for Emerging Computing Platforms”. The work of Eli Upfal was supported in part by NSFgrant IIS-1247581 and NIH grant R01-CA180776.

References [1] Z. Abbassi, V. S. Mirrokni, and M. Thakur. Diversity maximization under matroid constraints. In

Proc. ACM KDD , pages 32–40, 2013.[2] M. Ackermann, J. Bl¨omer, and C. Sohler. Clustering for metric and nonmetric distance measures.

ACM Trans. on Algorithms , 6(4):59, 2010.[3] P. Agarwal, S. Har-Peled, and K. Varadarajan. Geometric approximation via coresets.

Combinatorialand computational geometry , 52:1–30, 2005.[4] S. Aghamolaei, M. Farhadi, and H. Zarrabi-Zadeh. Diversity maximization via composable coresets.In

Proc. CCCG , pages 38–48, 2015.[5] A. Angel and N. Koudas. Eﬃcient diversity-aware search. In

Proc. SIGMOD , pages 781–792, 2011.[6] T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and P. Lamere. The million song dataset. In

Proc.ISMIR , 2011.[7] S. Bhattacharya, S. Gollapudi, and K. Munagala. Consideration set generation in commerce search.In

Proc. WWW , pages 317–326, 2011.[8] M. Ceccarello, A. Pietracaprina, G. Pucci, and E. Upfal. Space and time eﬃcient parallel graphdecomposition, clustering, and diameter approximation. In

Proc. ACM SPAA , pages 182–191, 2015.[9] M. Ceccarello, A. Pietracaprina, G. Pucci, and E. Upfal. A practical parallel algorithm for diameterapproximation of massive weighted graphs. In

Proc. IEEE IPDPS , 2016.[10] M. Ceccarello, A. Pietracaprina, G. Pucci, and E. Upfal. Mapreduce and streaming algorithms fordiversity maximization in metric spaces of bounded doubling dimension.

PVLDB , 10(5):469–480,2017.[11] A. Cevallos, F. Eisenbrand, and R. Zenklusen. Max-sum diversity via convex programming. In

Proc.SoCG , volume 51, page 26, 2016. 2012] B. Chandra and M. Halld´orsson. Approximation algorithms for dispersion problems.

J. of Algorithms ,38(2):438–465, 2001.[13] M. Charikar, C. Chekuri, T. Feder, and R. Motwani. Incremental clustering and dynamic informationretrieval.

SIAM J. on Computing , 33(6):1417–1440, 2004.[14] Z. Chen and T. Li. Addressing diverse user preferences in SQL-query-result navigation. In

Proc.SIGMOD , pages 641–652, 2007.[15] R. Cole and L. Gottlieb. Searching dynamic point sets in spaces with bounded doubling dimension.In

Proc. ACM STOC , pages 574–583, 2006.[16] S. Fekete and H. Meijer. Maximum dispersion and geometric maximum weight cliques.

Algorithmica ,38(3):501–511, 2004.[17] S. Gollapudi and A. Sharma. An axiomatic approach for result diversiﬁcation. In

Proc. WWW ,pages 381–390, 2009.[18] T. F. Gonzalez. Clustering to minimize the maximum intercluster distance.

Theoretical ComputerScience , 38:293 – 306, 1985.[19] L. Gottlieb, A. Kontorovich, and R. Krauthgamer. Eﬃcient classiﬁcation for metric data.

IEEETrans. on Information Theory , 60(9):5750–5759, 2014.[20] A. Gupta, R. Krauthgamer, and J. R. Lee. Bounded geometries, fractals, and low-distortion embed-dings. In

Proc. IEEE FOCS , pages 534–543, 2003.[21] M. Halld´orsson, K. Iwano, N. Katoh, and T. Tokuyama. Finding subsets maximizing minimumstructures.

SIAM Journal on Discrete Mathematics , 12(3):342–359, 1999.[22] R. Hassin, S. Rubinstein, and A. Tamir. Approximation algorithms for maximum dispersion.

Oper-ations Research Letters , 21(3):133 – 137, 1997.[23] P. Indyk, S. Mahabadi, M. Mahdian, and V. Mirrokni. Composable core-sets for diversity andcoverage maximization. In

Proc. ACM PODS , pages 100–108, 2014.[24] H. Karloﬀ, S. Suri, and S. Vassilvitskii. A model of computation for MapReduce. In

Proc. ACM-SIAM SODA , pages 938–948, 2010.[25] G. Konjevod, A. Richa, and D. Xia. Dynamic routing and location services in metrics of low doublingdimension. In

Distributed Computing , pages 379–393. Springer, 2008.[26] J. Leskovec, A. Rajaraman, and J. Ullman.

Mining of Massive Datasets, 2nd Ed . CambridgeUniversity Press, 2014.[27] M. Masin and Y. Bukchin. Diversity maximization approach for multiobjective optimization.

Oper-ations Research , 56(2):411–424, 2008.[28] S. Munson, D. Zhou, and P. Resnick. Sidelines: An algorithm for increasing diversity in news andopinion aggregators. In

Proc. ICWSM , 2009.[29] A. Pietracaprina, G. Pucci, M. Riondato, F. Silvestri, and E. Upfal. Space-round tradeoﬀs forMapReduce computations. In

Proc. ACM ICS , pages 235–244, 2012.[30] P. Raghavan and M. Henzinger. Computing on data streams. In

Proc. DIMACS Workshop ExternalMemory and Visualization , volume 50, page 107, 1999.2131] D. Rosenkrantz, S. Ravi, and G. Tayi. Approximation algorithms for facility dispersion. In

Handbookof Approximation Algorithms and Metaheuristics.

SIAM J. on Discrete Mathematics , 4(4):550–567,1991.[33] Y. Wu. Active learning based on diversity maximization.

Applied Mechanics and Materials ,347(10):2548–2552, 2013.[34] Y. Yang, Z. Ma, F. Nie, X. Chang, and A. Hauptmann. Multi-class active learning by uncertaintysampling with diversity maximization.

Int. J. of Computer Vision , 113(2):113–127, 2015.[35] C. Yu, L. Lakshmanan, and S. Amer-Yahia. Recommendation diversiﬁcation using explanations. In