Distributed Partial Clustering
aa r X i v : . [ c s . D S ] J un Distributed Partial Clustering
Sudipto Guha
University of PennsylvaniaPhiladelphia, PA 19104, United [email protected]
Yi Li
Nanyang Technological [email protected]
Qin Zhang
Indiana University BloomingtonBloomington, IN 47401, United [email protected]
ABSTRACT
Recent years have witnessed an increasing popularity of algorithmdesign for distributed data, largely due to the fact that massivedatasets are often collected and stored in different locations. In thedistributed setting communication typically dominates the queryprocessing time. Thus it becomes crucial to design communica-tion efficient algorithms for queries on distributed data. Simulta-neously, it has been widely recognized that partial optimizations,where we are allowed to disregard a small part of the data, provideus significantly better solutions. The motivation for disregardedpoints often arise from noise and other phenomena that are perva-sive in large data scenarios.In this paper we focus on partial clustering problems, k -center, k -median and k -means, in the distributed model, and provide algo-rithms with communication sublinear of the input size. As a con-sequence we develop the first algorithms for the partial k -medianand means objectives that run in subquadratic running time. Wealso initiate the study of distributed algorithms for clustering un-certain data, where each data point can possibly fall into multiplelocations under certain probability distribution. The challenge of optimization over large quantities of data hasbrought communication efficient distributed algorithms to the fore.From the perspective of optimization, it has also become clear that partial optimizations , where we are allowed to disregard a smallpart of the input, enable us to provide significantly better optimiza-tion solutions compared with those which are forced to account forthe whole input. While several algorithms for distributed cluster-ing have been proposed, partial optimizations for clustering prob-lems, introduced by Charikar et al. [4], have not received as muchattention. While the results of Chen [6] improve the approxima-tion ratios, the running time of the k -median and k -means versionshave not been improved and the (at least) quadratic running timeshave remained as a barrier.In this paper we study partial clustering under the standard ( k , t ) -median/means/center objective functions, where k is the num-ber of centers we can use and t is the maximum number of points Sudipto Guha was supported in part by NSF award 1546151. Qin Zhang was supportedin part by NSF CCF-1525024 and IIS-1633215.Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanthe author(s) must be honored. Abstracting with credit is permitted. To copy other-wise, or republish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Request permissions from [email protected].
SPAA ’17, Washington DC, USA © 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM.978-x-xxxx-xxxx-x/YY/MM...$15.00DOI: 10.1145/3087556.3087568 we can ignore. In the distributed setting, let s denote the num-ber of sites. The ( k , t ) -center problem has recently been studiedby Malkomes et al. [19], who gave a 2-round O ( ) -approximationalgorithm with ˜ O ( sk + st ) bits of communication , assuming thateach point can be encoded in O ( ) bits. In fact, we observe that re-sults from streaming algorithms [14] can in fact provide us 1-round O ( ) -approximation algorithms with ˜ O ( sk + st ) bits of communi-cation for ( k , t ) -center, ( k , t ) -median, and ( k , t ) -means. However,in many scenarios of interest, we have n > t ≫ k and t ≫ s .Thus the st term generates a significant communication burden.In this paper we reduce ˜ O ( st ) to ˜ O ( t ) for the ( k , t ) -center problem,as well as for ( k , t ) -median and ( k , t ) -means problems and unifytheir treatment. We also provide the first subquadratic algorithmsfor median and means version of this problem.Large data sets often have erroneous values. Stochastic opti-mization has recently attracted a lot of attention in the field ofdatabases, and has substantiated as a subfield called ‘uncertain/probabilisticdatabases’ (see, e.g., [20]). For the clustering problem, a method ofchoice is to first model the underlying uncertainty and then clusterthe uncertain data. Clustering under uncertainty has been studiedin centralized models [8, 15], but the algorithms proposed thereindo not consider communication costs. Note that it typically re-quires significantly more communication to communicate a distri-bution (for an uncertain point) than a deterministic point, and thusblack box adaptations of centralized algorithms do not work well inthe distributed setting. In this paper we propose communication-efficient distributed algorithms for handling both data uncertaintyand partial clustering. To the best of our knowledge neither dis-tributed clustering of uncertain data nor partial clustering of un-certain data has been studied. We note that both problems arefairly natural, and likely to be increasingly useful as distributedcloud computing becomes commonplace. Models and Problems.
We study the clustering problems in the coordinator model, in which there are s sites and one central co-ordinator, who are connected by a star communication networkwith the coordinator at the center. However, direct communica-tion between sites can be simulated by routing via the coordinator,which at most doubles the communication. The computation is interms of rounds. At each round, the coordinator sends a message(could be an empty message) to each site and every site sends amessage (could be an empty message) back to the coordinator. Thecoordinator outputs the answer at the end. The input A is par-titioned into ( A , . . . , A s ) among the s sites. Let n i = | A i | , and n = | A | = Í i ∈[ s ] n i be the total input size.We will consider clustering over a graph with n nodes and anoracle distance function d (· , ·) . An easy example of such is points We hide poly log n factors in the ˜ O notation. n Euclidean space. More complicated examples correspond to doc-uments and images represented in a feature space and the distancefunction is computed via a kernel. We now give the definitions of ( k , t ) -center/median/means. Definition 1.1 ( ( k , t ) -center,median,means). Let A be a set of n points and k , t are integer parameters (1 ≤ k ≤ n , 0 ≤ t ≤ n ). Inthe ( k , t ) -median problem we want to computemin K , O ⊆ A Õ p ∈ A \ O d ( p , K ) subject to | K | ≤ k and | O | ≤ t , where d ( p , K ) = min x ∈ K d ( p , x ) . We typically call K the centers and O the outliers . In the ( k , t ) -means and the ( k , t ) -center problem wereplace the objective function Í p ∈ A \ O d ( p , K ) with Í p ∈ A \ O d ( p , K ) and max p ∈ A \ O d ( p , K ) respectively.In the definition above, we assume that centers are chosen fromthe input points. In the Eucldiean space, such restriction will onlyaffect the approximation by a factor of 2.For the uncertain data, we follow the assigned clustering intro-duced in [8]. Let P be a finite set of points in a metric space. Thereare n input nodes A , where node j follows an independent distribu-tion D j over P . Each site i knows the distributions D j associatedwith the nodes j ∈ A i . Definition 1.2 (Clustering Uncertain Data).
In clustering with un-certainty, the output is a subset K ⊆ P of size k (centers), a subset O ⊆ P of size at most t (ignored points), as well as a mapping π : A → K . In every realization σ : A → P of the values of theinput nodes, node j ∈ A (now realized as σ ( j ) ∈ P ) is assigned tothe same center π ( j ) ∈ K . In uncertain ( k , t ) -median, the goal is tominimize the expected cost E σ ∼ Î j ∈ A D j Õ j ∈ A \ O d ( σ ( j ) , π ( j )) = Õ j ∈ A \ O E σ ∼D j [ d ( σ ( j ) , π ( j ))] . (1)The definition of uncertain ( k , t ) -means is basically the same asuncertain ( k , t ) -median, except that we replace the objective func-tion (1) with Í j ∈ A \ O E σ ∼D j (cid:2) d ( σ ( j ) , π ( j )) (cid:3) . For uncertain ( k , t ) -center, we have two objectives:max j ∈ A \ O (cid:18) E σ ∼D j [ d ( σ ( j ) , π ( j ))] (cid:19) (2) E σ ∼ Î j D j (cid:20) max j ∈ A \ O d ( σ ( j ) , π ( j )) (cid:21) (3)Note that these two objectives are not equivalent, since E andmax do not commute in Equation (3) and we cannot equate it to (2).Equation (2) is in the same spirit as Equation (1), and correspondsto a per point measurement. We term this problem as uncertain ( k , t ) -center-pp. Equation (3) corresponds to a more global mea-surement and we term this problem as uncertain ( k , t ) -center-g.This version was considered in [8, 15]. Our Results.
We present our main results in Table 1 and onlypresent the results based on 2 rounds. The full set of our resultscan be found in Appendix A. We use T to denote the runtime tocompute 1-median/means of a node distribution, B to denote the information needed to encode a point, and I to denote the infor-mation needed to encode a node in the uncertain data case. In thecolumn of Local Time , the first is the local computation time ofall sites, and the second is the local computation time at the co-ordinator. Observe that the total running time is ˜ O ( Í i n i ) , whichbecomes ˜ O ( n / s ) if the partitions are balanced. This shows that wecan reduce the running time by distributing the clustering acrossmany sites.In particular we have obtained the following. All algorithmsfinish in 2 rounds in the coordinator model. We say a solution is an ( α , β ) -approximation if it is a solution of cost α C while excluding βt points, where C is the optimum cost for excluding t points.(1) We give ( O ( ) , ) -approximation algorithms with ˜ O (( sk + t ) B ) communication for the ( k , t ) -median (Section 3) and the ( k , t ) -center (Theorem 4.3) problems. The lower bounds in [5] for the t = ( O ( + / ϵ ) , + ϵ ) -approximation algorithm with ˜ O (( sk + t ) B ) communicationfor the ( k , t ) -median (with better running time) and the ( k , t ) -means (Theorem 3.6) problems.(2) We show that for ( k , t ) -median/means and ( k , t ) -center-pp theabove results are achievable even on uncertain data (Theorem 5.6).For uncertain ( k , t ) -center-g we obtain an ( O ( + / ϵ ) , + ϵ ) -approximation algorithm with ˜ O ( skB + tI + s log ∆ ) communi-cation, where I is the information to encode the distribution ofan uncertain point, and ∆ is the ratio between the maximumpairwise distance and the minimum pairwise distance in thedataset (Theorem 5.14).Our results for the ( k , t ) -center problem improves that in [19]. Andas far as we are concerned, our results on distributed ( k , t ) -median/meansand of uncertain input are the first of their kinds. Our results fordistributed ( k , t ) -median or means also lead to subquadratic timeconstant factor approximation centralized algorithms, which havebeen left open for many years. Technical Overview.
The high level idea of our algorithms isfairly natural: Each site first performs a preclustering , i.e., it com-putes some local solution on its own dataset. Then each site sendsthe centers of the local solution, number of attached points to eachcenter and the ignored points to the coordinator, who will thensolve the induced weighted clustering problem.A major difficulty is to determine how many points to ignorein the local solution at each site. Certainly for the sake of safetyeach site can ignore t points and send all ignored t points to thecoordinator for a final decision. This would however incur Θ ( st ) bits of communication. To reduce the communication of this partto O ( t ) , we hope to find { t , . . . , t s } such that Í i t i = t and eachsite i sends a solution with just t i ignored points. At the cost of anextra round of communication, we solve the minimization problem Í i f i ( t i ) subject to Í i t i = t for convex functions { f i } . It is tempt-ing to take f i ( t i ) to be the cost of local solution with t i ignoredpoints on site i , however, such f i is not necessarily convex. Theremedy is to take a lower convex hull of f i instead, which can beshown to have only a mild effect on the solution cost. The convexhull of t points can be found in O ( t log t ) time, and we can furtherreduce the runtime without compromising approximation ratio by bjective Approx. Centers Ignored Rounds Total Comm. Local Timemedian O ( ) k t ˜ O (( sk + t ) B ) ˜ O ( n i ) , ˜ O ( k t ( sk + t ) ) O ( + / ϵ ) ( + ϵ ) t ˜ O (( sk + t ) B ) ˜ O ( n i ) , ˜ O (( sk + t ) ) means O ( + / ϵ ) k ( + ϵ ) t ˜ O (( sk + t ) B ) ˜ O ( n i ) , ˜ O (( sk + t ) ) center O ( ) k t ˜ O (( sk + t ) B ) ˜ O (( k + t ) n i ) , ˜ O (( sk + t ) ) uncertainmedian/means/center-pp as in the regular case above + O ( n i T ) , unchangedcenter-g O ( + / ϵ ) k ( + ϵ ) t ˜ O ( skB + tI + s log ∆ ) ˜ O ( n i log ∆ ) , ˜ O (( sk + t ) ) Table 1: Results based on a round algorithms. T denotes the runtime to compute -median/mean of a node distribution , B the information encoding a point and I the information encoding a node in the uncertain data case. ∆ is the ratio between themaximum pairwise distance and the minimum pairwise distance in the dataset. · · · y p y p y p y p Figure 1: An example of a compressed graph produced computing local solutions on each site for only log t geometricallyincreasing values of t i .For uncertain data, it is natural to reduce the clustering prob-lems to the deterministic case. To this end, we ‘collapse’ each node j to its optimal center in P . For instance, for the ( k , t ) -median prob-lem, each node j is ‘collapsed’ to y j = arg min y ∈P E σ [ d ( σ ( j ) , y )] ,called the 1-median of node j . It may be tempting to consider theclustering problem on the set of 1-medians, but the ‘collapse’ costis lost, hence we construct a compressed graph G that allows us tokeep track of the collapse costs. The graph looks like a clique withtentacles, see Figure 1. The 1-medians form a clique in G with edgeweight being the distance in the underlying metric space; for each1-median y j , we add a tentacle (an edge) from y j to a new vertex p j with edge weight being the collapse cost E σ [ d ( σ ( j ) , y j )] . Wemanage to show that the original clustering problem is equivalent,up to a constant factor in cost, to the clustering problem on thecompressed graph where the facility vertices are 1-medians { y j } and the demand vertices are { p j } . Our previous framework fordeterministic data is then applied to the compressed graph.Lastly, for the global center problem with uncertain data, webuild upon the approach developed in [15], which uses a truncateddistance function L τ ( x , y ) = max { d ( x , y ) − τ , } instead of theusual metric distance d (· , ·) . Our algorithm performs a paramet-ric search on τ , and applies our previous framework to solve theglobal problem using local solutions. Now in the analysis of theapproximation ratio we need to relate the optimum solution to thesolution with truncated distance function, which is a fairly non-trivial task. For a general discrete distribution on m points in Euclidean space with P be thewhole space, T = O ( m ) [10]; for special distributions such as normal distribution, T = O ( ) . Related Work.
In the centralized model, Charikar et al. gives a3-approximation algorithm for ( k , t ) -center, and an ( O ( ) , O ( )) bi-criteria algorithm for ( k , t ) -median [4]. This bicriteria was laterremoved by Chen [6], who designed an O ( ) -approximation algo-rithm using ˜ O ( k ( k + t ) n ) time. Feldman and Schulman studiedthe ( k , t ) -median problem with different loss functions using the coreset technique [12].On uncertain data, Cormode and McGregor considered k -center/median/means where each D i is a discrete distribution [8]. Guhaand Munagala provided a technique to reduce the uncertain k -centerto the deterministic k -median problem [15]. Wang and Zhang stud-ied the special case of k -center on the line [21]. We refer the read-ers to the survey by Aggarwal [1].Clustering on distributed data has been studied only recently. Inthe coordinator model, in the d -dimensional Euclidean space, Bal-can et al. obtained O ( ) -approximation algorithms with ˜ O (( kd + sk ) B ) bits of communication for both k -median and k -means [2].Their results on k -means were further improved by Liang et al. [18]and Cohen et al. [7]. Chen et al. provided a set lower boundsfor these problems [5]. In the MapReduce model, Ene et al. de-signed several O ( ) -approximation O ( ) -round algorithms for the k -center and the k -median problems [11]. Im and Moseley fur-ther studied the partial clustering variant [16], however their al-gorithms require communication polynomial in n . Cormode et al.studied the k -center maintenance problem in the distributed datastream model where the coordinator can keep track of the clustercenters at any time step [9]. Notation.
We use the following notations in this paper. • sol ( Z , k , t , d ) : A solution (computed by an algorithm) tothe median/means/center problem on point set Z with atmost k centers and at most t outliers, under the distancefunction d ; • opt ( Z , k , t , d ) : An optimal solution to the median/meansor center problem on point set Z with at most k centersand at most t outliers, under d ; • C sol ( Z , k , t , d ) : The cost of the solution sol ( Z , k , t , d ) ; • C opt ( Z , k , t , d ) : The cost of the solution opt ( Z , k , t , d ) ; • π ( j ) : The center to which point j is attached.hen Z lies in a metric space and d agrees with the distance func-tion on the metric space, we omit the parameter d in the notationsabove. Combining Preclustering Solutions.
We review a theorem from[14], which concerns ‘combining’ local solutions into a global so-lution. The problems considered in the theorem have no outliers( t =
0) and lie in a metric space, so we abbreviate the notation sol ( Z , k , t , d ) to sol ( Z , k ) , etc. Theorem 2.1 ([14]).
Suppose that A = A ⊎ · · · ⊎ A s (disjointunion) and { sol ( A i , k )} are the preclustering solutions at sites. Let M = { π ( j ) : j ∈ A } and L = Í j ∈ A d ( j , π ( j )) , where π ( j ) de-notes the preclustering assignment. Consider the weighted k -medianproblem on M where the weight of m ∈ M is defined to be thenumber of points that are assigned to m in the preclustering, thatis, |{ j | j ∈ A , π ( j ) = m }| . Then(i) There exists a weighted k -median solution sol ( M , k ) such that C sol ( M , k ) ≤ ( L + C opt ( A , k )) .(ii) Given any weighted k -median solution sol ( M , k ) , there exists a k -median solution sol ( A , k ) such that C sol ( A , k ) ≤ sol ( M , k ) + L .Consequently, there exists a k -median solution sol ( A , k ) such that C sol ( A , k ) ≤ γ ( L + C opt ( A , k )) + L and centers are restricted to M ,where γ is the best approximation ratio for the k -median problem. Corollary 2.2.
The result in Theorem 2.1 extends to(i) the k -center problem;(ii) the k -means problem with weaker constants, using a relaxedtriangle inequality;(iii) the ( k , t ) -median/means/center approximation on the weightedpoint set M (with γ being the corresponding bicriteria approx-imation ratio), provided the preclustering does not ignore anypoints. Otherwise the total number of ignored points is the sumof the ignored points in the clustering and preclustering phases. ( k , t ) -MEDIAN AND ( k , t ) -MEANS Our algorithm for distributed ( k , t ) -median clustering is providedin Algorithm 1. For integer pairs ( i , q ) , we consider the lexicograph-ical order as partial order, that is, ( i , q ) ≺ ( i , q ) if (cid:26) i < i ; or i = i and q < q . (4) Remark 1.
In Line 17 of Algorithm 1, (i) no input point is ignoredin the preclustering; (ii) if the preclustering aggregated q points butthe coordinator’s algorithm chooses less than q copies (to exclude ex-actly t ) then the proofs are not affected in any way. We begin with a theorem about approximating ( k , t ) -median ormeans with a different trade-off from that in [4]. Theorem 3.1 (Proof Omitted).
Let ϵ > . We can compute sol ( Z , k , ( + ϵ ) t ) and sol ( Z , ( + ϵ ) k , t ) for the ( k , t ) -median problemin ˜ O (| Z | ) time such that C sol ( Z , k , ( + ϵ ) t ) ≤ max { , / ϵ } · C opt ( Z , k , t ) , and C sol ( Z , ( + ϵ ) k , t ) ≤ max { , / ϵ } · C opt ( Z , k , t ) . The result extends to the ( k , t ) -means problem with a slightly largerconstant. Algorithm 1
Distributed ( k , ( + ϵ ) t ) -median clustering Input: A = A ⊎ · · · ⊎ A s , k ≥ t ≥ ρ > Output: sol ( A , k , ( + ϵ ) t ) such that C sol ( A , k , ( + ϵ ) t ) = O ( + / ϵ ) · C opt ( A , k , t ) for each site i do I ← {⌊ ρ r ⌋ : 1 ≤ r ≤ ⌊ log ρ t ⌋ , r ∈ Z } ∪ { , t } Compute sol ( A i , k , q ) for each q ∈ I Compute the (lower) convex hull of the point set {( q , C sol ( A i , k , q ))} q ∈ I , which induces a function f i (·) definedon { , . . . , t } Send the function f i (·) to the coordinator end for Coordinator computes ℓ ( i , q ) = f i ( q − ) − f i ( q ) for each 1 ≤ i ≤ s and each 1 ≤ q ≤ t Coordinator stably sorts all { ℓ ( i , q )} in decreasing order Coordinator finds ℓ ( i , q ) of rank ρt and sends ℓ ( i , q ) , i and q to all sites for each site i do t i ← max { q : ℓ ( i , q ) ≥ ℓ ( i , q )} ⊲ define max ∅ = if i = i then t i ← min { q ∈ I : q ≥ q and C sol ( A i , k , q ) = f i ( q )} end if Send the coordinator the 2 k centers built in sol ( A i , k , t i ) ,the number of points attached to each center, and the t i unas-signed points end for Coordinator considers the union of the centers obtained fromeach site and the unassigned points, and applies Theorem 3.1and outputs sol ( A , k , ( + ϵ ) t ) .Throughout the rest of the section, we denote by t ∗ i the numberof ignored points from A i in the global optimum solution opt ( A , k , t ) .We need the following lemmas. Lemma 3.2.
It holds that Í i C opt ( A i , k , t ∗ i ) ≤ C opt ( A , k , t ) . For ( k , t ) -means the constant changes from to . Proof.
We shall use an argument used in [14]. Let π opt be thecenter projection function and K be the set of optimum centers inthe optimal solution opt ( A , k , t ) . For each A i , we construct a solu-tion sol ( A i , k , t ∗ i ) by excluding the points excluded in opt ( A , k , t ) and choosing (cid:8) arg min u ∈ A i d ( u , k ) : k ∈ K (cid:9) to be the centers. Then C sol ( A i , k , t ∗ i ) ≤ Õ x ∈ A i d ( x , π opt ( x )) . Summing over i yields Í C sol ( A i , k , t ∗ i ) ≤ C opt ( A , k , t ) . The resultfor k -means follows from applying triangle inequality with ( a + b ) ≤ ( a + b ) . (cid:3) Lemma 3.3.
The t , . . . , t s computed in Step 11 of Algorithm 1minimizes Í i f i ( t i ) subject to Í i t i ≤ ρt and ≤ t i ≤ t . Proof.
Suppose that t ′ , . . . , t ′ s is a minimizer. Since f i (·) is non-increasing for all i , it must hold that Í i t ′ i = ρt . By the definition Stably means that when ℓ ( i , q ) = ℓ ( i , q ) , the sorting algorithm puts ℓ ( i , q ) before ℓ ( i , q ) if ( i , q ) ≺ ( i , q ) as defined in (4). Element of rank r means the r -th element in a sorted list f t i , it also holds that Í i t i = ρt . If ( t ′ , . . . , t ′ s ) , ( t , . . . , t s ) , theremust exist i , j such that t ′ i > t i and t ′ j < t j . By the definition of t i and the sorting of { ℓ ( i , q )} , we know that ℓ ( i , t i + ) ≤ ℓ ( i , q ) , ℓ ( j , t j ) ≥ ℓ ( i , q ) . From convexity of f i and that t ′ i ≥ t i + t ′ j + ≤ t j , it followsthat f i ( t ′ i − ) − f i ( t ′ i ) ≤ ℓ ( i , q ) ≤ f j ( t ′ j ) − f j ( t ′ j + ) which means that increasing t ′ j by 1 and decreasing t ′ i by 1 will notdecrease the sum G ( q ′ , . . . , q ′ s ) : = Õ i ( f i ( ) − f i ( t ′ i )) . Therefore Í i f i ( t ′ i ) = Í i f i ( ) − G ( t ′ , . . . , t ′ s ) will not increase. Wecan continue this procedure until ( t ′ , . . . , t ′ s ) = ( t , . . . , t s ) . (cid:3) Lemma 3.4.
It holds for all i , i that t i ∈ I and C sol ( A i , k , t i ) = f i ( t i ) , where i is computed in Step 9 and t i ’s in Step 11 of Algo-rithm 1. Proof.
Since 0 ∈ I , we need only to consider the i ’s with t i , i and q , it must hold that ℓ ( i , t i ) ≥ ℓ ( i , q ) > ℓ ( i , t i + ) for i < i ℓ ( i , t i ) > ℓ ( i , q ) ≥ ℓ ( i , t i + ) for i > i , which implies that ℓ ( i , t i ) > ℓ ( i , t i + ) whenever i , i , i.e., f i ( t i − ) − f i ( t i ) > f i ( t i ) − f i ( t i + ) , i , i . Hence ( i , f i ( t i )) is a vertex of the convex hull for all i , i , that is, t i ∈ I and f i ( t i ) = C sol ( A i , k , t i ) . (cid:3) Now we are ready to bound the ‘goodness’ of local solutions.
Lemma 3.5.
Let ρ = . It holds that Í i C sol ( A i , k , t i ) ≤ ·C opt ( A , k , t ) and Í i t i ≤ t , where t , . . . , t s are computed in Step 11and may be updated in Step 15 of Algorithm 1. Proof.
Let ˆ t i = min { q ∈ I : q ≥ t ∗ i } . It follows from Lemma 3.2with Í i t ∗ i ≤ t that2 C opt ( A , k , t ) ≥ Õ i C opt ( A i , k , t ∗ i ) ≥ Õ i C opt ( A i , k , ˆ t i )≥ Õ i C sol ( A i , k , ˆ t i ) , where the last inequality follows from Theorem 3.1 (applied with ϵ = ρ − = t i ≤ t ∗ i and thus Í i ˆ t i ≤ Í i t ∗ i ≤ t ,and Õ i C sol ( A i , k , ˆ t i ) ≥ Õ i f i ( ˆ t i ) ≥ Õ i f i ( t i ) , where the last equality follows from Lemma 3.3, and t i ’s are com-puted in Step 11.Now, by Lemma 3.4, f i ( t i ) = C sol ( A i , k , t i ) for all except one i .The exceptional t i will be replaced by a bigger value, which will notincrease f i ( t i ) by the monotonicity of f i , and the first part follows.This update will increase Í i t i by at most t and thus Í i t i ≤ t . (cid:3) Lemma 3.5 and Theorem 3.1 together give the following. Notethat | I | = O ( log t ) . Theorem 3.6.
For the distributed ( k , t ) -median problem, Algo-rithm 1 with ρ = outputs sol ( A , k , ( + ϵ ) t ) satisfying C sol ( A , k , ( + ϵ ) t )) ≤ O ( + / ϵ ) · C opt ( A , k , t ) . The sites communicate a total of ˜ O ( sk + t ) bits of information with the coordinator over rounds. Theruntime at each site is ˜ O ( n i ) and the runtime at the coordinator is ˜ O (( sk + t ) ) . The same result holds for ( k , t ) -means with larger con-stants in the approximation ratio and the runtime. Proof.
The communication cost is straightforward. By Lemma 3.5,the coordinator will solve the problem of at most 2 sk + t points.The claims on approximation ratio and the runtime then followfrom Theorem 3.1, noting that it takes time O ( I log I ) = ˜ O ( ) tofind the convex hull. (cid:3) If we were only interested in the clustering and not the list ofignored points, we could set ρ = + δ and change line 12 to line 15of Algorithm 1 to the following. The sites do not send the ignorednodes but just the number of them, and the exceptional site runs aslightly more convoluted algorithm. if i , i then Send the coordinator t i , the 2 k centers built in sol ( A i , k , t i ) and the number of points attached to each center else t i , = max { q ∈ I : q ≤ t i and C sol ( A i , k , q ) = f i ( q )} t i , = min { q ∈ I : q ≥ t i and C sol ( A i , k , q ) = f i ( q )} Combine sol ( A i , k , t i , ) and sol ( A i , k , t i , ) to form a so-lution sol ( A i , k , t i ) by taking the union of the medians, at-taching each point to the closest center among the combinedcenters, and ignoring the points with largest t i distances. Send to the coordinator t i , the combined centers and thenumber of points attached to each center. end if Observe that Lemma 3.5 still holds with Í i t i ≤ ( + δ ) t , sincewe are not changing the exceptional t i . For the exceptional site i ,suppose that t i = ( − θ ) t i , + θt i , for some θ ∈ ( , ) , we have ( − θ ) f i ( t i , ) + θ f i ( t i , ) ≤ f i ( t i ) . We now argue the next criticallemma. Lemma 3.7. C sol ( A i , k , t i ) ≤ ( − θ ) f i ( t i , ) + θ f i ( t i , ) . Proof.
We will prove the lemma by carefully designing an as-signment of n − t i points to the 4 k centers which is bounded aboveby the right hand side. Since choosing the minimum n − t i distanceswill only result in a smaller value, the lemma would follow.For j = ,
2, let π j be the center projection function in sol ( A i , k , t i , j ) and P i the set of clustered points in sol ( A i , k , t i , j ) . For x ∈ P ∩ P ,we attach x to the nearer one between the two centers π ( x ) and π ( x ) , and the incurred cost ismin { d ( x , π ( x )) , d ( x , π ( x ))} ≤ ( − θ ) d ( x , π ( x )) + θd ( x , π ( x )) . (5)For x ∈ P △ P , since only one of π ( x ) and π ( x ) exist, we ab-breviate it as π ( x ) for simplicity. Define h ( x ) for each x ∈ P △ P as h ( x ) = ( ( − θ ) · d ( x , π ( x )) , x ∈ P \ P ; θ · d ( x , π ( x )) , x ∈ P \ P . et r = | P ∩ P | , r = | P \ P | and r = | P \ P | . It holds that r + r = n − t i , and r + r = n − t i , , thus r > r and ( − θ ) r + θr = n − t i − r . Define Q = P \ P and Q = P \ P . Pick x = arg min z ∈ Q ∪ Q h ( z ) .If x ∈ Q , pick an arbitrary u ∈ Q , otherwise pick u ∈ Q . Attach x to π ( x ) in the 4 k -center solution we are constructing and mark u as outlier. Note that this incurs a cost of d ( x , π ( x )) ≤ ( ( − θ ) d ( x , π ( x )) + θd ( u , π ( u )) , x ∈ Q ; ( − θ ) d ( u , π ( u )) + θd ( x , π ( x )) , x ∈ Q , (6)by our choice of x , because one of the combination terms is exactly h ( x ) and it is smaller than h ( u ) , which is exactly the other term.Then we remove x and u from Q or Q depending on the case.Now, | Q | = r − | Q | = r −
1, and note that ( − θ )( r − ) + θ ( r − ) = n − t i − r − . Since r > r , we can continue this process until Q = ∅ . At thispoint we have run the procedure above r times, and it holds that ( − θ ) r = n − t i − r − r . Note that r ≥ n − t i − r − r , so we can choose E ⊆ Q to bethe points with smallest n − t i − r − r values of h . Attach pointsin E to their respective centers and mark the remaining points in Q as outliers. This incurs a cost of Õ x ∈ E d ( x , π ( x )) ≤ n − t i − r − r r Õ x ∈ Q d ( x , π ( x )) = ( − θ ) Õ x ∈ Q d ( x , π ( x )) (7)In total we have assigned r + r + ( n − t i − r − r ) = n − t i points asdesired. The desired upper bound on cost follows from (i) summingboth sides of (5) over P ∩ P ; (ii) summing both sides of (6) over x and the corresponding u during the pairing procedure; and (iii)Equation (7). Note that (ii) covers ( P △ P ) \ Q , where Q is thepost-pairing set. (cid:3) As a consequence of Lemma 3.7, C sol ( A i , k , t i ) ≤ f i ( t i ) . Thusthe upper bound on the approximation ratio still holds. Finally,note that | I | = ˜ O ( / δ ) and we conclude that Theorem 3.8.
For the distributed ( k , t ) -median problem, the mod-ified Algorithm 1 with ρ = + δ outputs sol ( A , k , ( + ϵ + δ ) t ) sat-isfying C sol ( A , k , ( + ϵ + δ ) t ) ≤ O ( + / ϵ ) · C opt ( A , k , t ) . The sitescommunicate a total of ˜ O ( sδ − + skB ) bits of information with thecoordinator over rounds. The runtime on site i is ˜ O ( n i / δ ) and theruntime on the coordinator is ˜ O (( sk ) ) . The same result holds for ( k , t ) -means with a larger constant in the approximation ratio. We now show an unusual application of Theorem 3.6 in speed-ing up existing constant-factor approximation algorithms for ( k , t ) -median (or means). Note that the centralized bicriteria approxi-mation algorithms in Charikar [4] are ˜ O ( n ) from n points, andwhile the modifications in Theorem 3.1 improve the running timeto ˜ O ( n ) , this leaves open the important question: Are there al-gorithms with provable constant factor approximation guaranteeswhich are subquadratic?
Observe that the question is even more pertinent in the context of unicriterion approximation, for whichthe only known result is a ˜ O ( n k t ) -time constant-factor approx-imation of ( k , t ) -median [6]. In the sequel we show that the run-ning time can be brought to almost linear time. The improvementarises from the fact that we can simulate a distributed algorithmsequentially. Lemma 3.9.
Suppose that we are given a ˜ O ( n + α k ) time algo-rithm for bicriteria approximation which produces k centers or t outliers with approximation factor γ , where α ≤ . Then we canproduce a similar algorithm with running time ˜ O ( t ) + ˜ O (cid:18) n + α + α k (cid:19) and approximation c γ for some absolute constant c > . Proof.
We will apply Theorem 3.6 after dividing the data arbi-trarily in s pieces of size n / s . The sequential simulation of the s sites will take time ˜ O ( s ( n / s ) + α k ) based on the statement of thelemma. The coordinator will require time ˜ O (( sk + t ) ) = ˜ O ( s k ) + ˜ O ( t ) . Observe that we can now balance n + α = s + α , whichprovides us the optimum s to use and achieve a running time of˜ O ( t ) + ˜ O ( s k ) = ˜ O ( t ) + ˜ O (cid:18) n + α + α k (cid:19) . (cid:3) Theorem 3.10.
Let α > and suppose that t ≤ √ n . There ex-ists a centralized algorithm for the ( k , t ) -median problem that runsin ˜ O ( n + α k ) time and outputs a solution sol ( A , k , t ) satisfying C sol ( A , k , t ) ≤ ( + / α ) O ( ) C opt ( A , k , t ) . Proof.
Note that the algorithm in Theorem 3.1 has runtime˜ O ( n ) , so we can take α = γ = O ( t + n / k ) , whichis ˜ O ( n / k ) by our assumption that t ≤ √ n . Repeatedly applyingLemma 3.9 for j times gives an algorithm of runtime ˜ O ( n + /( j − ) k ) and approximation ratio ( c γ ) j . Let j = log ( + / α ) , the runtimebecomes O ( n + α k ) and the approximation ratio ( + / α ) log ( c γ ) = ( + / α ) O ( ) . (cid:3) Remark 2.
We remark that(i) the theorem above also holds for sol ( A , k , t ) , where the numberof centers, instead of the outliers, is relaxed.(ii) for the unicriterion approximation, if we use the algorithm ofruntime ˜ O ( n t k ) from [6] instead of the result of Theorem 3.1,we need to balance s and s ( n / s ) + α for an analogy of Lemma 3.9,which will eventually lead to an algorithm of runtime O ( n + α t k ) ,provided that t ≤ n / . ( k , t ) -CENTER CLUSTERING Our algorithm for ( k , t ) -center clustering is presented in Algorithm 2.It is similar to Algorithm 1 but only simpler, because the precluster-ing stage admits a simpler algorithm due to Gonzalez [13]. For the k -center problem on a point set Z of n points, Gonzalez’s algorithmoutputs a re-ordering of points in Z , say, p , . . . , p n , such that foreach 1 ≤ r ≤ n , the solution sol ( Z , r ) of choosing { p , . . . , p r } asthe r centers is a 2-approximation for the r -center problem on Z ,i.e., C sol ( Z , r ) ≤ C opt ( Z , r ) .The core argument is that the k -center algorithm of Gonzalezcan be used to simultaneously (a) precluster the local data into localsolutions and (b) provide a witness that can be compared globally. lgorithm 2 Distributed ( k , t ) -center clustering for each site i do Run Gonzalez’s algorithm and obtain a re-ordering { a , . . . , a n i } of the points in A i for each 1 ≤ q ≤ t do Compute ℓ ( i , q ) ← min { d ( a j , a k + q ) : j < k + q } end for end for Sites and coordinator sort { ℓ ( i , q )} , and follow the subsequentsteps as in Algorithm 1, where the coordinator in the last stepruns the algorithm in [4] for the k -center problem with exactly t outliers. Remark 3.
In Algorithm 2, (i) none of the original points is ig-nored in the preclustering, and (ii) it is possible that the preclusteringaggregated q points but the coordinator’s algorithm chooses less than q copies to exclude exactly t points. This does not affect the proofs of ( k , t ) -center clustering. We now analyze the performance of Algorithm 2. Denote by t ∗ i the number of points ignored from A i in the global optimumsolution opt ( A , k , t ) . First we show two structural lemmas. Lemma 4.1. C opt ( A i , k , t ) ≥ max i C opt ( A i , k , t ∗ i ) . Proof.
Use the same argument in the proof of Lemma 3.2. (cid:3)
Lemma 4.2. max i C opt ( A i , k , t ∗ i ) ≥ min Í i t i ≥ t (cid:18) max i C opt ( A i , k , t i ) (cid:19) . Proof.
It follows from the fact that Í i t ∗ i = t . (cid:3) Theorem 4.3.
For the distributed ( k , t ) -center problem, Algorithm 2outputs sol ( A , k , t ) satisfying C sol ( A , k , t ) ≤ O ( ) · C opt ( A , k , t ) . Thesites communicate a total of ˜ O (( sk + t ) B ) bits of information to thecoordinator over rounds. The runtime on site i is ˜ O (( k + t ) n i ) andthe runtime on the coordinator is ˜ O (( sk + t ) ) . Proof.
The approximation ratio follows from a similar argu-ment to that of Theorem 3.6, using Lemma 4.1 and 4.2. The coordi-nator runtime follows from [4, Theorem 3.1] and the site runtimefrom [13], noting that we need only the first k + t points of thereordering of each A i . The communication cost is clear from Algo-rithm 2. (cid:3) Recall that in the setting of clustering with uncertainty there is anunderlying metric space (P , d ) . We are given a set of input nodes j ∈ A which correspond to distributions D j on P . In this sectionwe shall use nodes to indicate the input and points to indicate de-terministic objects in the metric space P . We shall denote by σ ( j ) a realization of node j and by π ( j ) the center node to which j isattached. Our goal in the ( k , t ) -median problem in this context isto compute min K ⊆P , O ⊆ A | K |≤ k , | O |≤ t Õ j ∈ A \ O (cid:18) min π ( j ) E σ [ d ( σ ( j ) , π ( j ))] (cid:19) . (8) For ( k , t ) -means we use d (· , ·) and for ( k , t ) -center-pp we use max j instead of Í j .Define b d : A × P → R as b d ( j , p ) = E σ [ d ( σ ( j ) , p )] , the objectivefunction (8) is then reduced to the usual ( k , t ) -median problem withthe new distance function b d . However, this definition only allowsthe computation of distance between an input node and a pointin P . To extend b d to a pair of input nodes, the site holding A i will need to know the point set Ð j ∈ A i ′ supp (D j ) from some othersite i ′ . This will blow up the communication cost, and thus naivelyusing this distance function in combination with the algorithms de-veloped previously will not work well. To circumvent this issue wecombine the notion of 1-median introduced in [8] along with theframework in Theorem 2.1, and introduce a compression schemeto evaluate distances. Definition 5.1.
For each node j , define its 1-median and 1-meanto be y j = arg min y ∈P E σ [ d ( σ ( j ) , y )] , y ′ j = arg min y ∈P E σ [ d ( σ ( j ) , y )] , respectively. Definition 5.2 (Compressed graph).
The compressed graph G ( A ) is a weighted graph on vertices P ∪ { p j } j ∈ A , where the edges areas follows: (1) each pair ( u , v ) ∈ P is an edge with weight d ( u , v ) ,and (2) for each j ∈ A , the vertex p j is connected only to y j withweight ℓ j = E σ [ d ( σ ( j ) , y j )] . Define the distance d G ( u , v ) betweentwo vertices u , v in G to be the length of the shortest path between u and v in G .For the compressed graph G , we can also consider the follow-ing ( k , t ) -median problem, where we restrict the demand pointsto { p j } and the possible centers to { y j } , and the distance functionis the length of shortest path on G . We continue to use the nota-tions sol ( G , k , t ) , C sol ( G , k , t ) , etc., to denote the solution and thecorresponding cost of ( k , t ) -median problem on G . The followingtwo lemmas show that ( k , t ) -median problem in Eqn (8) is, up tosome constant factor in the approximation ratio, equivalent to the ( k , t ) -median problem on the compressed graph. Lemma 5.3.
If there exists a solution sol ( A , k , t ) of cost C sol ( A , k , t ) to the objective in Equation (8) , then there exists a solution sol ( G ( A ) , k , t ) on the compressed graph such that C sol ( G ( A ) , k , t ) ≤ C sol ( A , k , t ) . Proof.
Let A ′ be the set of clustered nodes in the feasible ( k , t ) -median solution of the original problem with the objective in (8).Define the set of center points M = { y j : j ∈ A ′ } . For each j ∈ A ′ ,let y π ( j ) = arg min y ∈ M d ( π ( j ) , y ) . Let sol ( G ( A ) , k , t ) be the solu-tion of connecting each point p j ( j ∈ A ′ ) to y π ( j ) in the compressedgraph G . We try to upper bound the cost C sol ( G ( A ) , k , t ) : C sol ( G ( A ) , k , t ) = Õ j ∈ A ′ d G ( y π ( j ) , p j ) (definition of C sol ) = Õ j ∈ A ′ (cid:16) d ( y π ( j ) , y j ) + d G ( y j , p j ) (cid:17) (definition of d G ) ≤ Õ j ∈ A ′ d ( y π ( j ) , π ( j )) + Õ j ∈ A ′ d ( π ( j ) , y j ) + Õ j ∈ A ′ d G ( y j , p j ) (triangle inequality) Õ j ∈ A ′ d ( π ( j ) , y j ) + Õ j ∈ A ′ ℓ j , where the last line follows from d ( y π ( j ) , π ( j )) ≤ d ( π ( j ) , y j ) by thedefinition (optimality) of y π ( j ) .Observe that for any realization σ ( j ) , it holds that d ( y j , π ( j )) ≤ d ( y j , σ ( j )) + d ( σ ( j ) , π ( j )) . Taking expectation over σ , d ( y j , π ( j )) ≤ E σ d ( y j , σ ( j )) + E σ d ( σ ( j ) , π ( j )) = ℓ j + E σ d ( σ ( j ) , π ( j )) . Summing over j ∈ A ′ , Õ j ∈ A ′ d ( y j , π ( j )) ≤ Õ j ∈ A ′ ℓ j + Õ j ∈ A ′ E σ d ( σ ( j ) , π ( j )) ≤ Õ j ∈ A ′ ℓ j + C opt ( A , k , t ) . (9)We next bound Í j ∈ A ′ ℓ j . This is exactly the cost of connecting each j ∈ A ′ to its 1-median, which is the optimal solution of at most n − t centers for A ′ . The optimal cost for n − t centers is clearly less thanthat for k centers and hence Í j ∈ A ′ ℓ j ≤ C opt ( A , k , t ) .Therefore C sol ( G ( A ) , k , t ) ≤ · C opt ( A , k , t ) + C opt ( A , k , t ) = C opt ( A , k , t ) as claimed. (cid:3) Lemma 5.4.
If there exists a solution sol ( G ( A ) , k , t ) of cost C sol ( G ( A ) , k , t ) on the compressed graph, then there exists a solution sol ( A , k , t ) for the problem formulated in (8) such that C sol ( A , k , t ) ≤ C sol ( G ( A ) , k , t ) . Proof.
Let A ′′ be the set of clustered nodes in sol ( G ( A ) , k , t ) . Asimilar argument of increasing the number of centers as in Lemma 5.3yields that Í j ∈ A ′′ ℓ j ≤ C sol ( G ( A ) , k , t ) . Suppose that p j is assignedto π ( j ) in sol ( G ( A ) , k , t ) in the compressed graph. Note that π ( j ) ∈P . Let sol ( A , k , t ) be the solution of attaching j to π ( j ) in P , andthe cost can be bounded as C sol ( A , k , t ) = Õ j ∈ A ′′ E σ ( d ( σ ( j ) , π ( j ))) (definition of C sol ) ≤ Õ j ∈ A ′′ E σ (cid:0) d ( σ ( j ) , y j ) (cid:1) + Õ j ∈ A ′′ d ( y j , π ( j )) (triangle inequality) ≤ Õ j ∈ A ′′ ℓ j + Õ j ∈ A ′′ d G ( p j , π ( j )) (definition of d G , see below) ≤ C sol ( G ( A ) , k , t ) , (definition of C sol )where the third line follows from d G ( p j , π ( j )) = d ( p j , y j ) + d ( y j , π ( j )) ≥ d ( y j , π ( j )) . (cid:3) The equivalence between the original problem and the one onthe compressed graph also holds for the ( k , t ) -center-pp and the ( k , t ) -means problems. Lemma 5.5.
Lemma 5.3 and Lemma 5.4 both hold(a) for ( k , t ) -center-pp with the same constants; and(b) for ( k , t ) -means with slightly larger constants. Proof. (a) Observe that Í j is replaced with max j and Equa-tion (9) rewrites tomax j ∈ A ′ d ( y j , π ( j )) ≤ max j ∈ A ′ ℓ j + C opt ( A , k , t ) . The remainder of the equations hold with this transformation.
Algorithm 3
A Compression Scheme for Distributed Partial Clus-tering of Uncertain Data for each site i do Compute ℓ j = E σ [ d ( σ ( j ) , y j )] for all j ∈ A i Construct the compressed graph of A i as described in Def-inition 5.2 Run any algorithm corresponding to Section 3 and Sec-tion 4 on the compressed graph, with the following change:whenever the site has to communicate p j , it also sends y j (or y ′ j ) and the values of E σ [ d ( σ ( j ) , y j )] (or E σ [ d ( σ ( j ) , y ′ j )] ). end for (b) Note that we used triangle inequality in the proof above. Al-though the square of the distance does not obey the triangleinequality, we can nevertheless apply ( a + b ) ≤ a + b afterthe triangle inequality. The derivations above will go throughand the results hold with slightly larger constants. (cid:3) The overall algorithm is summarized in Algorithm 3. Note thatwe cannot just cluster the { y j } ; the graph is necessary. To imple-ment the algorithm, we need to show that each site is able to com-pute the distance function individually. Indeed, note that any sitethat contains p j will also contain the corresponding y j or y ′ j andthe value E σ [ d ( σ ( j ) , y j )] or E σ [ d ( σ ( j ) , y ′ j )] respectively. There-fore the distance oracle on the graph can be implemented by thesite in constant time. Theorem 5.6.
For the distributed ( k , t ) -median problem, Algo-rithm 3 outputs sol ( A , k , ( + ϵ ) t ) such that C sol ( A , k , ( + ϵ ) t ) = O ( + / ϵ )·C opt ( A , k , t ) . The sites communicate a total of ˜ O (( sk + t ) B ) bits of information to the coordinator over rounds. The runtime onsite i is ˜ O ( n i + n i T ) , where T is the runtime to compute 1-median, andthe runtime on the coordinator is ˜ O (( sk + t ) ) . The same result holdsfor the ( k , t ) -median and center-pp problems with larger constants. Proof.
By Lemma 5.4 for the median problem and Lemma 5.5for the means and center-pp problems, it suffices to show thatwe can solve the ( k , t ) -median problem on the compressed graph.The result then follows from Theorem 3.6 and Theorem 3.8 withthe following amendments: When a site sends the t or t i poten-tial outliers, it needs to send the y j and the corresponding values E σ [ d ( σ ( j ) , y j )] or E σ [ d ( σ ( j ) , y ′ j )] , which at most doubles the com-munication cost. The runtime is increased by O ( n i T ) due to Step 2since computing ℓ j on the compressed graph takes O ( T ) time. (cid:3) Other results claimed in Table 2 follow from analogous amend-ments to Theorem 3.8.
The global k -Center case. We now focus on ( k , t ) -center-g. Inthis setting D j ’s are independent and we optimizemin K ⊆P , O ⊆ A | K |≤ k , | O |≤ t E σ ∼ Î j D j (cid:20) max j ∈ A \ O d ( σ ( j ) , π ( j )) (cid:21) ! . Definition 5.7 (Truncated distance [15]).
For τ ≥
0, define L τ : P × P → R as L τ ( u , v ) = max { d ( u , v ) − τ , } and ρ τ : A × P → R as ρ τ ( j , u ) = E σ [L τ ( σ ( j ) , u )] . Note that L τ (· , ·) is not a metric for τ > efinition 5.8. Given a node set Z ⊆ A , let P( Z ) ⊆ P be the as-sociated point set corresponding to possible realizations of nodesin Z . Let sol ( Z , k , t , ρ τ ) and opt ( Z , k , t , ρ τ ) be a solution by algo-rithm and the global optimum solution respectively to the ( k , t ) -median problem on node set Z where the centers are restricted to P( Z ) and the weighted assignment cost of assigning node j ∈ Z to center m ∈ P( Z ) is ρ τ ( j , m ) . The costs C sol ( Z , k , t , ρ τ ) and C opt ( Z , k , t , ρ τ ) are defined analogously.Let d min and d max denote the minimum and the maximum dis-tance, respectively, between two distinct points in P and let ∆ = d max / d min . The algorithm is presented in Algorithm 4. Algorithm 4
Algorithm for ( k , t ) -center-g All parties compute d min and d max Each party creates T = { i d min /
18 : 0 ≤ i ≤ ⌈ log ∆ ⌉ + } for each τ ∈ T do All parties run Algorithm 2 with the following changes:when it calls Algorithm 1 as a subroutine, sol ( A i , k , q ) in Al-gorithm 1 is replaced with sol ( A i , k , q , ρ τ ) and the sites ob-tain the numbers of local outliers { t i ( τ )} end for Coordinator finds ˆ τ = min { τ ∈ T : Í i C sol ( A i , k , t i ( τ ) , ρ τ ) ≤ τ } Coordinator solves ( k , t ) -center-g on the preclustering solu-tions sol ( A i , k , t i ( ˆ τ ) , ρ τ ) and outputs sol ( A , k , ( + ϵ ) t ) .Now we try to analyze the performance of Algorithm 4. We firstshow an analogy of Theorem 3.1 that we can compute a constantapproximation to C opt ( Z , k , t , ρ τ ) . The proof is omitted. Lemma 5.9.
Let τ ≥ . For the ( k , t ) -center problem on Z , we cancompute in ˜ O (( k + t )| Z |) time sol ( Z , k , ( + ϵ ) t , ρ τ ) or sol ( Z , ( + ϵ ) k , t , ρ τ ) such that C sol ( Z , k , ( + ϵ ) t , ρ τ ) ≤ max { , / ϵ } · C opt ( Z , k , t , ρ τ )C sol ( Z , ( + ϵ ) k , t , ρ τ ) ≤ max { , / ϵ } · C opt ( Z , k , t , ρ τ ) We next show that the ˆ τ computed in Step 6 is a good choice of τ and will ensure that the preclustering solutions sol ( A i , k , t i ( ˆ τ ) , ρ τ ) can be combined to yield a good global solution. Specifically wehave the following two lemmas. Lemma 5.10.
The ˆ τ computed in Step 6 satisfies the following twoconditions.(i) Í i C sol ( A i , k , t i ( ˆ τ ) , ρ τ ) ≤ τ ;(ii) Í i C opt ( A i , k , t ′ i , ρ τ ) ≥ τ for all { t ′ i } s.t. Í i t ′ i ≤ t , Proof.
Note that τ max = max T > d max /
6, it always holds that ρ τ max =
0. Thus the condition Í i C sol ( A i , k , t i ( τ max ) , ρ τ max ) ≤ τ max holds, and ˆ τ exists and satisfies condition (i).Next we show that condition (ii) holds. Let { t ′ i } be an arbi-trary sequence satisfying that Í i t ′ i ≤ t . Similarly to the proof ofLemma 3.3, one can show that Í i C sol ( A i , k , t ′ i , ρ τ ) ≥ Í i C sol ( A i , k , t i ( ˆ τ ) , ρ τ ) ,using the fact that Í i t ′ i ≤ t < ρt = Í i t i . Combining withLemma 5.9 with ϵ =
1, we have that6 Õ i C opt ( A i , k , t ′ i , ρ τ ) ≥ Õ i C sol ( A i , k , t ′ i , ρ τ ) ≥ Õ i C sol ( A i , k , t i ( ˆ τ ) , ρ τ ) ≥ τ , whence condition (ii) follows. (cid:3) Lemma 5.11.
Suppose that ˆ τ satisfies the condition (i) and (ii) ofLemma 5.10, a γ -approximation of the weighted center-g probleminduced by preclustering sol ( A i , k , t i ( ˆ τ ) , ρ τ ) is an O ( γ ) approxi-mation of C opt ( A , k , t ) . To prove this lemma, we need the following two auxiliary lem-mas.
Lemma 5.12. C opt ( A , k , t , ρ τ ) ≥ Í i C opt ( A i , k , t ∗ i , ρ τ ) , where t ∗ i is the number of ignored nodes from A i in the global optimumsolution opt ( A , k , t , ρ τ ) . Proof.
Fix a realization of the nodes. The proof mimics Lemma 3.2for each realization. It then uses the observation that L τ ( u , u ) + L τ ( u , u ) ≥ L τ ( u , u ) and takes the expectation. (cid:3) Lemma 5.13. If C opt ( Z , k , t , ρ τ ) ≥ τ then C opt ( Z , k , t ) ≥ τ / . Proof.
The case of t = t >
0, let Z ′ ⊆ Z be the set of clustered pointin opt ( Z , k , t ) , then C opt ( Z ′ , k , , ρ τ ) = C opt ( Z , k , t , ρ τ ) ≥ τ , thus C opt ( Z , k , t ) = C opt ( Z ′ , k , ) ≥ τ / (cid:3) Proof of Lemma 5.11.
It follows from Lemma 5.12 and condi-tion (ii) of Lemma 5.10 that2 C opt ( A , k , t , ρ ˆ τ ) ≥ Õ i C opt ( A i , k , t ∗ i , ρ τ ) ≥ τ , where t ∗ i is the number of ignored nodes from A i in the globaloptimum solution opt ( A , k , t , ρ ˆ τ ) . It then follows from Lemma 5.13that C opt ( A , k , t ) ≥ ˆ τ / t i ( ˆ τ ) as t i . Let A ∗ i ⊆ A i be the set of nodes clustered in the globaloptimum solution opt ( A , k , t ) . Consider “collapsing” the nodes in A ∗ i to their corresponding centers in sol ( A i , k , t i , ρ τ ) while keep-ing the same centers in sol ( A , k , t ) . If a node in A ∗ i is marked as anoutlier in sol ( A i , k , t i , ρ τ ) then it is not moved, and it continuesto be excluded from the calculation. This movement increases theexpectation of the maximum assignment by 6ˆ τ + C sol ( A i , k , t i , ρ τ ) .Now consider the same process where we collapse A ∗ i for all i . Thetotal increase across the different i is 6ˆ τ + Í i C sol ( A i , k , t i , ρ τ ) because the increase in 6ˆ τ arises from distance truncation and iscommon. Thus we achieve a solution of cost at most γ C opt ( A , k , t ) + τ + Õ i C sol ( A i , k , t i , ρ τ ) ! . Now consider “expanding” the nodes of A i from the preclusteringto the distribution D j . By that logic the expected maximum canincrease by at most 2ˆ τ + Í i C sol ( A i , k , t i , ρ τ ) , which by condition(i) of Lemma 5.10 totals to O ( γ ˆ τ ) = O ( γ )C opt ( A , k , t ) . The lemmafollows. (cid:3) We state the main theorem for the ( k , t ) -center-g problem toconclude this section. bjective Approx. Centers Ignored Rounds Total Comm. Local Timemedian O ( ) k t ˜ O (( sk + st ) B ) ˜ O ( n i ) , ˜ O ( k s t ) ˜ O (( sk + t ) B ) ˜ O ( n i ) , ˜ O ( k t ( sk + t ) )( + δ ) t ˜ O ( s / δ + skB ) ˜ O ( n i ) , ˜ O ( s k ) means/median O ( + / ϵ ) k , ( + ϵ ) t or ( + ϵ ) k , t ˜ O (( sk + st ) B ) ˜ O ( n i ) , ˜ O (( sk + st ) ) ˜ O (( sk + t ) B ) ˜ O ( n i ) , ˜ O (( sk + t ) ) k ( + ϵ + δ ) t ˜ O ( s / δ + skB ) ˜ O ( n i ) , ˜ O (( sk ) )( + ϵ ) k ( + δ ) t center O ( ) k t ˜ O (( sk + st ) B ) ˜ O (( k + t ) n i ) , ˜ O (( sk + st ) ) ˜ O (( sk + t ) B ) ˜ O (( k + t ) n i ) , ˜ O (( sk + t ) )( + δ ) t ˜ O ( s / δ + skB ) ˜ O ( n i ) , ˜ O (( sk ) ) uncertainmedian/means/center-pp as in the regular case above + O ( n i T ) , unchangedcenter-g O ( + / ϵ ) k ( + ϵ ) t ˜ O ( skB + tI + s log ∆ ) ˜ O ( n i log ∆ ) , ˜ O (( sk + t ) ) O ( ) t ˜ O ( s ( kB + tI ) log ∆ ) ˜ O (( k + t ) n i log ∆ ) , ˜ O ( s ( k + t ) ) Table 2: Our results. T denotes the runtime to compute -median/mean of a node distribution, I is the information encoding anode in the uncertain data case, B the information encoding a point and ∆ the ratio between the maximum pairwise distanceand the minimum pairwise distance in the dataset. Theorem 5.14.
For the distributed ( k , t ) -center-g problem, Algo-rithm 4 outputs sol ( A , k , ( + ϵ ) t ) satisfying C sol ( A , k , ( + ϵ ) t ) = O ( + / ϵ ) · C opt ( A , k , t ) . The sites communicate a total of ˜ O ( skB + s log ∆ + tI ) bits of information to the coordinator over rounds,where I is the bit complexity to encode a node. The runtime at site i is ˜ O (( k + t ) n i log ∆ ) and the runtime at the coordinator is ˜ O (( sk + t ) ) . Proof.
The claim on approximation ratio follows from Lemma 5.11.To determine ˆ τ , the communication cost increases by a factor oflog ∆ ; to send the preclustering solutions, the communication costfor sending the outliers increases by a factor of I . The runtimefollows from Lemma 5.9 with an increase of a factor of log ∆ . (cid:3) We remark that the dependence on log ∆ can be removed withanother pass where each site computes a τ i using binary search.The discussion is omitted in the interest of simplicity.Other results claimed in Table 2 follow from analogous amend-ments to Theorem 3.8. REFERENCES [1] Charu C. Aggarwal. A survey of uncertain data clustering algorithms. In
DataClustering: Algorithms and Applications , pages 457–482. 2013.[2] Maria-Florina Balcan, Steven Ehrlich, and Yingyu Liang. Distributed k-meansand k-median clustering on general communication topologies. In
Proceedingsof NIPS , pages 1995–2003, 2013.[3] Moses Charikar and Sudipto Guha. Improved combinatorial algorithms for thefacility location and k-median problems. In
Proceedings of FOCS , pages 378–388,1999.[4] Moses Charikar, Samir Khuller, David M. Mount, and Giri Narasimhan. Algo-rithms for facility location problems with outliers. In
Proceedings of SODA , pages642–651, 2001.[5] Jiecao Chen, He Sun, D. Woodruff, and Qin Zhang. Communication-optimaldistributed clustering. In
Proceedings of NIPS , 2016.[6] Ke Chen. A constant factor approximation algorithm for k -median clusteringwith outliers. In Proceedings of SODA , pages 826–835, 2008.[7] Michael B. Cohen, Sam Elder, Cameron Musco, Christopher Musco, andMadalina Persu. Dimensionality reduction for k -means clustering and low rankapproximation. In Proceedings of STOC , pages 163–172, 2015.[8] Graham Cormode and Andrew McGregor. Approximation algorithms for clus-tering uncertain data. In
Proceedings of PODS , pages 191–200, 2008.[9] Graham Cormode, S Muthukrishnan, and Wei Zhuang. Conquering the divide:Continuous clustering of distributed data streams. In
Proceedings of ICDE , pages1036–1045. IEEE, 2007.[10] M. E. Dyer. On a multidimensional search technique and its application to theeuclidean one centre problem.
SIAM J. Comput. , 15(3):725–738, 1986. [11] Alina Ene, Sungjin Im, and Benjamin Moseley. Fast clustering using mapreduce.In
Proceedings of SIGKDD , pages 681–689, 2011.[12] Dan Feldman and Leonard J. Schulman. Data reduction for weighted and outlier-resistant clustering. In
Proceedings of SODA , pages 1343–1354, 2012.[13] Teofilo F. Gonzalez. Clustering to minimize the maximum intercluster distance.
Theor. Comput. Sci. , 38:293–306, 1985.[14] Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani, and LiadanO’Callaghan. Clustering data streams: Theory and practice.
IEEE Trans. onKnowl. and Data Eng. , 15(3):515–528, 2003.[15] Sudipto Guha and Kamesh Munagala. Exceeding expectations and clusteringuncertain data. In
Proceedings of PODS , pages 269–278, 2009.[16] Sungjin Im and Benjamin Moseley. Brief announcement: Fast and better dis-tributed mapreduce algorithms for k -center clustering. In Proceedings of SPAA ,pages 65–67, 2015.[17] Kamal Jain and Vijay V. Vazirani. Approximation algorithms for metric facilitylocation and k-median problems using the primal-dual schema and lagrangianrelaxation.
J. ACM , 48(2):274–296, 2001.[18] Yingyu Liang, Maria-Florina Balcan, Vandana Kanchanapally, and David P.Woodruff. Improved distributed principal component analysis. In
Proceedingsof NIPS , pages 3113–3121, 2014.[19] Gustavo Malkomes, Matt J. Kusner, Wenlin Chen, Kilian Q. Weinberger, andBenjamin Moseley. Fast distributed k -center clustering with outliers on massivedata. In Proceedings of NIPS , pages 1063–1071, 2015.[20] Dan Suciu, Dan Olteanu, R. Christopher, and Christoph Koch.
ProbabilisticDatabases . Morgan & Claypool Publishers, 1st edition, 2011.[21] Haitao Wang and Jingru Zhang. One-dimensional k -center on uncertain data. Theoretical Computer Science , 602:114 – 124, 2015.
A THE FULL SET OF OUR RESULTS
We summarize the full set of our results in Table 2. Besides themain results that already appear in Table 1, all the 1-round resultsin Table 2 basically follow from setting t i = t for all sites i . The re-sults for ( k , t ) -median/means that ignore ( + δ ) t or ( + ϵ + δ ) t pointsbasically follow from Theorem 3.8, where for ( k , t ) -median with k centers (unicriterion) we need to apply again the 1-round result,and for ( k , t ) -median/means with ( + ϵ ) k centers we simply usethe second inequality of Theorem 3.1 instead of the first one at thefinal clustering step at the coordinator. The result for ( k , t ) -centerthat ignore ( + δ ) t points is due to the following modifications onAlgorithm 4: sites do not send the total ( + δ ) t local outliers to thecoordinator, and thereafter the coordinator performs the secondlevel clustering with (another) t outliers, we have ( + δ ) t outliersin total. PROOF OF THEOREM 3.1
Proof.
The result in [4] prioritized approximation ratio andused [3] instead of [17]. However the former increases the run-ning time to ˜ O ( n ) to get the better approximation factor. Usingthe latter result of [17], we get the running time of the first partof the theorem. To observe the quality guarantee, note that [17]creates two solutions with k , k centers and each solution ignoresexactly t outliers, where k < k < k . Although not explicitlystated in [17], but as observed in [4], the algorithm is applicable tothe outlier case as we can simply stop the algorithm when thereare t points unprocessed.Set a = ( k − k )/( k − k ) and b = − a and consider the con-vex combination of the two solutions. The convex combinationof their costs is a 3-approximation. To get exactly k centers, weiteratively pair off every center in the small solution with its near-est (remaining) center in the large solution. With probability a we choose all the centers in the small solution and otherwise wechoose the paired centers in the large solution. In the latter case wehave chosen k centers and we choose the remaining k − k centersat random from the remaining centers in the large solution. Notethat every center in the large solution is chosen with probabilityat least 1 − a .In the current case we also have two solutions and each solutionignores exactly t outliers. Notice that if a point is labeled outlier inone solution and not in the other, it must be directly connected toa center (in the language of [17]). In the case we choose the centersin the small solution, all the points that were directly connected tothe center continue to satisfy that 6 times its dual value is greaterthan the distance to its center plus 6 times its payment towardsthe centers ([4] reduced this to 4 based on [3]). If we choose allthe centers in the small solution then we cannot have more than t outliers. If we choose the large solution then we may exceed t outliers if all the points labeled outliers in the small solution were excluded and some of the points clustered in the large solution(but not in the small) cannot be accommodated because the corre-sponding center was not chosen. But this happens with probabil-ity at most 1 − ( − a ) = a and therefore in expectation we losean extra a · t outliers. If a ≥ ϵ /
2, we choose the small solution,which provides 6 / ϵ approximation. Otherwise we run the round-ing part in [17] multiple times and choose a solution with at most t + ϵt outliers (which happens with probability O ( ϵ ) using Markovinequality).For the second part observe that if k + k ≥ ( + ϵ ) k then a > ϵ and we have 3 / ϵ approximation with k centers. Otherwise we use all the k + k centers. Now we can assert that the distance costplus 3 times the cost towards centers is at most 3 times the dualvalue for all points not marked as outliers in both solutions. Thusthe total number of outliers will be the intersection of the outliersof the two solutions and at most t . The theorem follows.Note that the above rounding argument uses triangle inequality.While the triangle inequality does not hold for squares of distances(as in the k -means objective function), we instead use 2 ( a + b ) ≥( a + b ) . (cid:3) C PROOF OF LEMMA 5.9
Proof.
The proof is similar to that of Theorem 3.1. The onlydifferent part is the accounting for the truncation. For the ( + ϵ ) k result we note a pseudo-triangle inequality (see [15, Lemma 4.1]) ρ τ ( j , m ) ≤ ρ τ ( j , m ′ ) + ρ τ ( i , m ′ ) + ρ τ ( i , m ) for any m ′ , since in thiscase we assign points within three hops. For the ( + ϵ ) t resultwe assign within 9 hops—each point has a center in the large andsmall solutions within 3 hops. The pairing of the centers in thetwo solutions show that the pair of a center in the small solutionexists within 6 hops. The whole argument for Theorem 3.1 thengoes through.resultwe assign within 9 hops—each point has a center in the large andsmall solutions within 3 hops. The pairing of the centers in thetwo solutions show that the pair of a center in the small solutionexists within 6 hops. The whole argument for Theorem 3.1 thengoes through.