[PDF] Streaming Balanced Clustering

Abstract

Clustering of data points in metric space is among the most fundamental problems in computer science with plenty of applications in data mining, information retrieval and machine learning. Due to the necessity of clustering of large datasets, several streaming algorithms have been developed for different variants of clustering problems such as k -median and k -means problems. However, despite the importance of the context, the current understanding of balanced clustering (or more generally capacitated clustering) in the streaming setting is very limited. The only previously known streaming approximation algorithm for capacitated clustering requires three passes and only handles insertions. In this work, we develop \emph{the first single pass streaming algorithm} for a general class of clustering problems that includes capacitated k -median and capacitated k -means in Euclidean space, using only poly (kdlogΔ) space, where k is the number of clusters, d is the dimension and Δ is the maximum relative range of a coordinate. (Note that dlogΔ is the space required to represent one point.) This algorithm only violates the capacity constraint by a 1+ϵ factor. Interestingly, unlike the previous algorithm, our algorithm handles both insertions and deletions of points. To provide this result we define a decomposition of the space via some curved half-spaces. We used this decomposition to design a strong coreset of size poly (kdlogΔ) for balanced clustering. Then, we show that this coreset is implementable in the streaming and distributed settings.

Full PDF

SStreaming Balanced Clustering

Hossein EsfandiariGoogle [email protected] Vahab MirrokniGoogle [email protected] Peilin ZhongColumbia [email protected]

Abstract

Clustering of data points in metric space is among the most fundamental problems in com-puter science with plenty of applications in data mining, information retrieval and machinelearning. Due to the necessity of clustering of large datasets, several streaming algorithms havebeen developed for diﬀerent variants of clustering problems such as k -median and k -means prob-lems. However, despite the importance of the context, the current understanding of balancedclustering (or more generally capacitated clustering) in the streaming setting is very limited. Theonly previously known streaming approximation algorithm for capacitated clustering requiresthree passes and only handles insertions.In this work, we develop the ﬁrst single pass streaming algorithm for a general class ofclustering problems that includes capacitated k -median and capacitated k -means in Euclideanspace, using only poly( kd log ∆) space, where k is the number of clusters, d is the dimension and ∆ is the maximum relative range of a coordinate . This algorithm only violates the capacityconstraint by a (cid:15) factor. Interestingly, unlike the previous algorithm, our algorithm handlesboth insertions and deletions of points. To provide this result we deﬁne a decomposition of thespace via some curved half-spaces. We used this decomposition to design a strong coreset of size poly( kd log ∆) for balanced clustering. Then, we show that this coreset is implementable in thestreaming and distributed settings. Note that d log ∆ is the space required to represent one point. a r X i v : . [ c s . D S ] O c t Introduction

Clustering of data points in metric space is among the most fundamental problems in computerscience with plenty of applications in data mining, information retrieval and machine learning. Inmany applications there are some natural constraints on the size of the clusters. To capture thisconstraint, balanced and capacitated clustering have been introduced and widely studied in theclassical setting [ASS17, ABM +

18, BRU16, Li17, DL16, XHX + k -median [COP03, FS05, FL11,GMMO00, HPM04, Che09, BIP +

16, BFL16, BFL +

17] and k -means problems [FMS07, Che09, FL11,BLLM16, BFL16, HSYZ18]. However, despite the importance of the context, there is no one-passstreaming algorithm known for balanced or capacitated version of these problems with non-trivialguarantees. The only previously known approximation algorithm in this context is a three-passinsertion-only streaming algorithm for a general class of capacitated k -clustering in (cid:96) r [BBLM14].In capacitated k -clustering in (cid:96) r , the objective is to assign all of the points into k centers such that,while respecting the capacity constraints, it minimizes the total sum of r -th power of the distances.Note that this deﬁnition extends capacitated k -median (for r = 1 ), capacitated k -means (for r = 2 )and capacitated k -center (for r = ∞ ). Let us say a solution is ( α, β ) -approximate solution if itscost is at most α times that of the optimum and it violates the capacity constraints by at mosta factor β . Given a regular sequential ( α, β ) -approximation algorithm for capacitated k -clusteringin (cid:96) r , the previous paper provides an ( O ( rα ) , β ) -approximation three-pass streaming algorithm.Unfortunately, there is a large constant hidden in O ( rα ) .In this paper, we develop the ﬁrst single-pass streaming algorithm for capacitated k -clustering in (cid:96) r , using only poly( kd log ∆) space, where d is the dimension and ∆ is the maximum relative range ofa coordinate . Given an ( α, β ) -approximation algorithm for weighted capacitated k -clustering in (cid:96) r ,and arbitrary positive numbers η and (cid:15) , our algorithm provides an ((1 + (cid:15) ) α, (1 + η ) β ) -approximatesolution . Interestingly, unlike the previous algorithm, this algorithm handles both insertion anddeletion of data points.Regardless of time complexities, for arbitrary (cid:15), η ∈ (0 , our result directly implies (1 + (cid:15), η ) -approximation streaming algorithms for capacitated k -median and capacitated k -means. Byapplying the ( O (1 /(cid:15) ) , (cid:15) ) -approximation algorithm of [DL16] for capacitated k -median, and usingproper parameters, we have a polynomial time ( O (1 /(cid:15) ) , (cid:15) ) -approximation streaming algorithmfor capacitated k -median in poly( kd log ∆) space. Similarly, by applying the

69 + (cid:15) -approximationﬁxed parameter tractable algorithm of [XHX +

19] for capacitated k -means we have a ﬁxed parametertractable (69+ (cid:15), (cid:15) ) -approximation streaming algorithm for capacitated k -means in poly( kd log ∆) space. Furthermore, if d is much larger than k/(cid:15) , we can apply [MMR19] to reduce the dimension to poly( k/(cid:15) ) . Then our streaming algorithm only needs d · poly( k log ∆) space though the dependenceon k and /(cid:15) becomes slightly larger.It is easy to convert our streaming algorithm to a distributed algorithm. We use the samedistributed model as of [KVW14, WZ16, BWZ16, SWZ17, SWZ19]. In this model we have s This is roughly the ratio between the maximum distance and the minimum distance. We deﬁne this notation inthe next subsection and give a further discussion in Section 2. When η , (cid:15) or β are not constant, they appear polynomially in the space. i -th machine holds a subset of input points. There is one coordinator andthe communication is only between the coordinator and other machines. The goal is to bound thecommunication. In Subsection 1.1 we present our results, and then in Subsection 1.2 we discuss ourtechnical contributions. Other Related Works.

Guha et al. studied the k -median problem in the streaming settingand provided the ﬁrst single pass O (1 /(cid:15) ) -approximation algorithm for this problem using O ( n (cid:15) ) space [GMMO00]. Next, Charikar et al. improved the approximation factor and the space re-quirement of this problem and provided a constant approximation streaming algorithm that stores O ( k log n ) points [COP03]. Braverman et al. developed constant approximation algorithm instreaming sliding window model for both k -median and k -means using O ( k log n ) space [BLLM16].In the Euclidean space, Har-Peled and Mazumdar show a (1 + (cid:15) ) -approximation insertion-onlystreaming algorithm for k -means and k -median [HPM04]. They developed a coreset that requires O ( k(cid:15) − d log n ) space and used it to provide a streaming algorithm that requires O ( k(cid:15) − d log d +2 n ) space. Later, Har-Peled and Kushal provided a coreset of size O ( k (cid:15) − d ) for these prob-lems [HPK05]. In high dimensional spaces, Chen presented a streaming algorithm for both problemsin O ( k d(cid:15) − log n ) space [Che09]. In this paper, we suppose all input and output points are in { , , · · · , ∆ } d for some ∆ , d ∈ Z ≥ . Thisassumption is without loss of generality since if the clustering cost is non-zero, we can always dis-cretize the space by changing the cost by an arbitrary small multiplicative error [BFL +

17, HSYZ18].Given a point set Q ∈ [∆] d , a strong ( η, (cid:15) ) -coreset of Q for capacitated k -clustering in (cid:96) r is a subsetof points Q (cid:48) ⊆ Q with weights w (cid:48) : Q (cid:48) → R > such that for any capacity t ≥ (cid:100)| Q | /k (cid:101) and any set of k centers Z = { z , z , · · · , z k } ,

11 + (cid:15) · cost ( r )(1+ η ) t ( Q, Z ) ≤ cost ( r )(1+ η ) t ( Q (cid:48) , Z, w (cid:48) ) ≤ (1 + (cid:15) ) · cost ( r ) t ( Q, Z ) , where cost ( r ) t ( Q, Z ) indicates the capacitated clustering cost in (cid:96) r with respect to centers Z andcapacity t , and similarly, cost ( r ) t (cid:48) ( Q (cid:48) , Z, w (cid:48) ) indicates the weighted version of (cid:96) r capacitated clusteringcost. We refer readers to Section 2 for the formal deﬁnition of cost ( r ) t ( Q, Z ) and cost ( r ) t (cid:48) ( Q (cid:48) , Z, w (cid:48) ) .In this paper, we give the ﬁrst strong coreset construction for capacitated k -clustering. The sizeof our coreset is poly( (cid:15) − η − kd log ∆) . Our coreset can be constructed in near linear time. Theorem 1.1 (Restatement of Theorem 3.19) . For a constant r ≥ , given k ∈ Z ≥ , (cid:15), η ∈ (0 , . and a point set Q ⊆ [∆] d with | Q | = n , there is a randomized algorithm which takes O ( nd log ( nd ∆)) time and outputs a subset of points Q (cid:48) ⊆ Q with weights w (cid:48) : Q (cid:48) → R > such that with probabilityat least . , ( Q (cid:48) , w (cid:48) ) is a strong ( η, (cid:15) ) -coreset of Q for capacitated k -clustering in (cid:96) r and | Q (cid:48) | ≤ poly( (cid:15) − η − kd log ∆) . Our coreset can also be constructed in streaming and distributed setting eﬃciently. The stream-ing model studied in this paper is the dynamic streaming model which allows both insertion anddeletion of points.

Theorem 1.2 (Restatement of Theorem 4.5) . For a constant r ≥ , given k ∈ Z ≥ , (cid:15), η ∈ (0 , . and a point set Q ⊆ [∆] d obtained by a stream of insertions and deletions, there is a streamingalgorithm which takes one pass over the stream and with probability at least . outputs a strong ( η, (cid:15) ) -coreset ( Q (cid:48) , w (cid:48) ) of Q for capacitated k -clustering in (cid:96) r . Furthermore, both | Q (cid:48) | and the spaceof the streaming algorithm is at most poly( (cid:15) − η − kd log ∆) .

2n the distributed model, the input is distributed into machines. Each machine can only commu-nicate with the coordinator. The goal in this model is to design a protocol with small communicationcost.

Theorem 1.3 (Restatement of Theorem 4.7) . For a constant r ≥ , given k ∈ Z ≥ , (cid:15), η ∈ (0 , . and a point set Q ⊆ [∆] d partitioned into s machines, there is a distributed protocol which ontermination with probability at least . leaves a subset of points Q (cid:48) ⊆ Q with weights w (cid:48) : Q (cid:48) → R > such that ( Q (cid:48) , w (cid:48) ) is a strong ( η, (cid:15) ) -coreset of Q for capacitated k -clustering in (cid:96) r and the size ofthe coreset is at most poly( (cid:15) − η − kd log ∆) . Furthermore, the total communication cost is at most s · poly( (cid:15) − η − kd log ∆) bits. Let us ﬁrst discuss how to construct a strong coreset for capacitated k -means. Later we will showhow to generalize the idea for capacitated k -clustering in (cid:96) r for general r ≥ .Our starting point is a common partitioning approach [Che09, BFL +

17, HSYZ18] for k -clusteringcoreset construction. The size n input point set Q ⊆ [∆] d is partitioned into poly( kd log ∆) partsof points P , P , · · · , P s such that if we move all points in each part to an arbitrary point in thispart, the optimal k -means cost for moved points should not change too much. In other words, s (cid:88) i =1 | P i | · (cid:18) max p,q ∈ P i dist( p, q ) (cid:19) ≤ poly( kd log ∆) · OPT k -means , (1)where OPT k -means = min Z ⊂ [∆] d : | Z | = k (cid:80) p ∈ Q dist ( p, Z ) .Let us brieﬂy review the sampling based strong coreset construction for standard k -meansproblem. Consider a ﬁxed set of k centers Z ⊂ [∆] d . Each point p ∈ P i is sampled withprobability poly( (cid:15) − kd log ∆) / | P i | and each sampled point is assigned a weight w ( p ) which isinverse sampling probability. The expected number of points sampled is poly( (cid:15) − kd log ∆) and (cid:80) sampled p ∈ P i w ( p ) · dist ( p, Z ) is an unbiased estimator of (cid:80) p ∈ P i dist ( p, Z ) . By triangle in-equality, ∀ p (cid:48) , q (cid:48) ∈ P i , | dist( p (cid:48) , Z ) − dist( q (cid:48) , Z ) | ≤ max p,q ∈ P i dist( p, q ) . This can upper boundthe variance of the cost of each sampled point. By Bernstein inequality, with high probability,for every part P i , the diﬀerence between (cid:80) sampled p ∈ P i w ( p ) · dist ( p, Z ) and (cid:80) p ∈ P i dist ( p, Z ) is at most (cid:15) (cid:80) p ∈ P i dist ( p, Z ) + (cid:15) poly( kd log ∆) | P i | · (max p,q ∈ P i dist( p, q )) . The additive error term (cid:15) poly( kd log ∆) | P i | · (max p,q ∈ P i dist( p, q )) is acceptable since the total additive error is bounded by (cid:15) · OPT k -means (Equation (1)) and thus becomes a small relative error. Notice that the total numberof choices of Z is at most ∆ kd . By taking union bound over all possible choices of Z , with highprobability, the sampled points together with their weights become a strong coreset for k -meanswith size poly( (cid:15) − kd log ∆) . We refer readers to [Che09, FL11, BFL16, BFL +

17, HSYZ18] for moredetails and history.Unfortunately, the above analysis breaks for capacitated k -means. Due to capacity constraints,each point may not be assigned to the closest center in the capacitated k -means solution. If welook at the samples from the given point set, it is unclear how to determine the cost of eachsampled point without looking at entire point set. This is an obstacle to obtain an unbiased es-timator of the capacitated k -means cost. To construct an unbiased estimator, we need to ﬁnda simple way to determine the cost of each sampled point. Again consider a ﬁxed set of k cen-ters Z = { z , z , · · · , z k } ⊂ [∆] d . If we know that each point p ∈ Q is assigned to the center π ( p ) ∈ Z , then (cid:80) sampled p w ( p ) dist ( p, π ( p )) is an unbiased estimator of the clustering cost withrespect to the assignment π : Q → Z , i.e., (cid:80) p dist ( p, π ( p )) . In addition, (cid:80) sampled p : π ( p )= z i w ( p ) z i and a dot denotes a point assigned to z j . We canﬁnd a hyperplane separating two clusters. Suppose a point on the right side is assigned to z i and apoint on the left side is assigned to z j . By Pythagorean theorem, the total cost of these two pointsis a + b + c + d . If we switch the assignments of these two points, the cost is a + b (cid:48) + c + d (cid:48) which is smaller since b (cid:48) + d (cid:48) < b + d . Thus, if two clusters cannot be separated by a hyperplane,the assignment cannot be optimal.is an unbiased estimator of the number of points assigned to the center z i . If for every assign-ment π , (cid:80) sampled p w ( p ) dist ( p, π ( p )) is a good approximation to (cid:80) p dist ( p, π ( p )) and ∀ i ∈ [ k ] , (cid:80) sampled p : π ( p )= z i w ( p ) is a good estimation of the size of the cluster with center z i , then the ca-pacitated clustering cost of samples is a good approximation of the capacitated clustering cost of Q if we allow some relaxation of capacity constraints. However, the possible choices of assignment π can be as large as k n . It implies that if we want to obtain a good estimation for every assignment π , we need at least Ω(log( k n )) = Ω( n log k ) samples which is even worse than taking entire pointset Q . The issue of the above attempt is that we want to have uniformly good estimations for allpossible assignments. To handle this issue, we should reduce the number of assignments that wecare about. An observation is that if an assignment π is not an optimal assignment for any capacityconstraint, then we do not care the quality of the estimated cost for π . Then we hope the numberof assignments which can be optimal for some capacity constraint is small.Our main technical contribution is ﬁnding a good structure of possible optimal assignments,and such structure can be used to upper bound the number of those assignments. Consider anassignment π : Q → Z which is optimal for some capacity constraint. For two centers z i , z j , byPythagorean theorem there must be a ( d − -dimensional hyperplane separating the points in thecluster with center z i and the points in the cluster with center z j (see Figure 1), and the hyperplaneis perpendicular to the line connecting z i and z j . Let H ( i,j ) denote the half-space which is one sideof the hyperplane containing z i . Similarly we denote H ( j,i ) as another side containing z j . For everypair of centers z i , z j , we can always deﬁne the half-space H ( i,j ) in the above way. It is clear to seethat π ( p ) = z i if and only if p ∈ (cid:84) j (cid:54) = i H ( i,j ) . In other words, the assignment π can be determinedby the set of all half-spaces { H ( i,j ) | i (cid:54) = j } . Since Q ⊆ [∆] d , the number of possible half-space H ( i,j ) for i, j is at most ∆ d . The total number of possible set of half-spaces { H ( i,j ) | i (cid:54) = j } is at most (∆ d )( k ) ≤ ∆ O ( dk ) . It implies that the number of possible optimal assignments is at most ∆ O ( dk ) which is much less than the total number of all possible assignments.Although we are able to get an unbiased estimator of the cost of the optimal assignment π ,there is an issue remaining for bounding the variance of the cost of samples. For two points p, q from the same part P i , the diﬀerence between dist( p, π ( p )) and dist( q, π ( q )) may be much largerthan dist( p, q ) (see Figure 2 for an example). Despite the diﬀerence between dist( p, π ( p )) and dist( q, π ( q )) can be arbitrarily large in general, the diﬀerence is upper bounded by dist( p, q ) if p p is assigned to the center z and q is assigned to the center z .The diﬀerence between dist( p, z ) and dist( q, z ) depends on dist( z , z ) and thus can be arbitrarilylarge.and q are assigned to the same center, i.e., π ( p ) = π ( q ) . This observation motivates us to furtherconceptually partition P i into k regions where each region contains the points assigned to the samecenter. If each region either contains no point from P i or contains at least | P i | / poly( (cid:15) − kd log ∆) points, then we can estimate (cid:80) p ∈ P i dist ( p, π ( p )) by sampling each point p ∈ P i with probability poly( (cid:15) − kd log ∆) / | P i | , and with high probability, the total estimated error can becomes a smallrelative error as discussed in the early paragraph in this section. Unfortunately, there could besome region in which number of points is much less than | P i | / poly( (cid:15) − kd log ∆) . These points maycontribute a lot to the cost since they may be far away from their center. But we may sample noneof them due to relatively low sampling rate and thus it can cause a large approximation error. Tohandle this, we develop a method to transfer the optimal assignment π to an another assignment π (cid:48) : Q → Z such that π (cid:48) approximately satisﬁes the capacity constraint and does not increase thecost by too much. Furthermore, for each part P i and center z j , either none of point in P i is assignedto z j by π (cid:48) or there are at least | P i | / poly( (cid:15) − kd log ∆) number of points in P i are assigned to z j by π (cid:48) . If we estimate the cost of π (cid:48) , the variance of the cost of sampled points can have a goodupper bound. Thus, with high probability, we can estimate the cost of π (cid:48) . By taking union boundover all the possible choices of Z and the possible transferred assignments π (cid:48) , we can prove that thesampled points form a strong coreset for capacitated k -means with high probability.Next, let us discuss how to extend the above idea for capacitated k -means to general capacitated k -clustering in (cid:96) r for r ≥ . The main diﬃculty to extend our capacitated k -means to (cid:96) r case isthat we may not ﬁnd a hyperplane to separate two clusters since we cannot apply Pythagoreantheorem for (cid:96) r cost (see Figure 3). But fortunately, we can ﬁnd a curved hyperplane to separatetwo clusters. Consider two centers z i and z j . We can deﬁne a curved hyperplane as { x ∈ R d | dist r ( x, z i ) − dist r ( x, z j ) = a } for some parameter a ∈ R . For an assignment π : Q → Z , if dist r ( p, z i ) − dist r ( p, z j ) < a and dist r ( q, z i ) − dist r ( q, z j ) > a but π ( p ) = z j , π ( q ) = z i , then π cannot be an optimal assignment since dist r ( p, z i ) + dist r ( q, z j ) < dist r ( q, z i ) + dist r ( p, z j ) , which implies switching the assigned centers of p, q can give an assignment with smaller cost. Thus,for an optimal assignment and any two clusters under the assignment, there always exists a curvedhyperplane separating two clusters. If we replace the half-spaces in previous paragraphs with thehalf-spaces deﬁned by the above curved hyperplanes, then the argument works for k -clustering in (cid:96) r for general r ≥ .Since the partition P , P , · · · , P s and its (cid:96) r variant version can be computed in the streamingmodel (with both insertion and deletion) and distributed model [BFL +

17, HSYZ18] and we onlyneed to sample points in each part with uniform sampling rate, our strong coreset construction canbe easily implemented in the streaming setting and distributed setting.5

Preliminaries

We use [ n ] to denote the set { , , · · · , n } . For any x ∈ R , a ∈ R > , we use x ± a to denote the interval ( x − a, x + a ) . For any x ∈ R ≥ , (cid:15) ∈ (0 , , we use (1 ± (cid:15) ) · x to denote the interval ((1 − (cid:15) ) · x, (1+ (cid:15) ) · x ) .Consider two points x, y ∈ R d . If ∃ i ∈ [ d ] such that x = y , x = y , · · · , x i − = y i − and x i < y i then x is smaller than y in the alphabetical order. For r ≥ , we use (cid:107) x (cid:107) r to denote the (cid:96) r norm of x ∈ R d , i.e., (cid:107) x (cid:107) r = ( (cid:80) di =1 | x i | r ) /r . We use dist( x, y ) to denote the Euclidean distance between x and y , i.e., dist( x, y ) = (cid:107) x − y (cid:107) . Fact 2.1.

For r ≥ and any x, y, z ∈ R d , dist r ( x, z ) ≤ r − (dist r ( x, y ) + dist r ( y, z )) . Proof. dist r ( x, z ) ≤ (dist( x, y ) + dist( y, z )) r ≤ (2 dist( x, y )) r + (2 dist( y, z )) r ≤ r − (dist r ( x, y ) + dist r ( y, z )) , where the ﬁrst step follows from triangle inequality and the second step follows from convexity. Deﬁnition 2.2 (Half-space) . Consider r ≥ and two points z , z ∈ [∆] d . All points x , x , · · · , x ∆ d ∈ [∆] d are sorted such that ∀ i ∈ [∆ d − , either dist r ( x i , z ) − dist r ( x i , z ) < dist r ( x i +1 , z ) − dist r ( x i +1 , z ) or dist r ( x i , z ) − dist r ( x i , z ) = dist r ( x i +1 , z ) − dist r ( x i +1 , z ) and x i is smaller than x i +1 in the alphabetical order. Let t be an arbitrary integer in (cid:2) ∆ d (cid:3) , then the set H = { x , x , · · · , x t } is an (cid:96) r -half-space corresponding to ( z , z , t ) . Given a set of points Z ⊂ R d and a point x ∈ R d , we deﬁne dist( x, Z ) = dist( Z, x ) =min y ∈ Z dist( x, y ) . For two sets of points P, Q ⊂ R d , we deﬁne the distance between P, Q as dist( P, Q ) = min p ∈ P,q ∈ Q dist( p, q ) . Consider an arbitrary set S . If S = S ∪ S ∪ · · · ∪ S s and ∀ i (cid:54) = j ∈ [ s ] , S i ∩ S j = ∅ , then S , S , · · · , S s is a partition of S , and S i is called a part of thepartition. We use S = S ˙ ∪ S ˙ ∪ · · · ˙ ∪ S s to denote that S , S , · · · , S s partitions S . For a point set Q ⊂ [∆] d , a set of centers Z = { z , z , · · · , z k } ⊆ [∆] d with | Z | = k and a size parameter t ≥ | Q | /k ,we deﬁne cost ( r ) t ( Q, Z ) = min S ,S , ··· ,S k : Q = S ∪ S ∪··· ˙ ∪ Sk, ∀ i ∈ [ k ] , | Si |≤ t k (cid:88) i =1 (cid:88) p ∈ S i dist r ( p, z i ) . Figure 3: The optimal clusters for capacitated k -median may not be separated by a hyperplane.But it is possible to use a curved hyperplane to separate the clusters. For example, two clusters ofcapacitated k -median in -dimensional space may be separated by a branch of a hyperbola.6or t = ∞ , we deﬁne cost ( r ) ∞ ( Q, Z ) = (cid:88) p ∈ Q dist r ( p, Z ) . For convenience of the notation, we use cost ( r ) ( Q, Z ) to denote cost ( r ) ∞ ( Q, Z ) for short. Similarly,we can deﬁne a weighted version of the cost function. Suppose each point p ∈ Q has a weight w ( p ) .We deﬁne cost ( r ) t ( Q, Z, w ) = min S ,S , ··· ,S k : Q = S ∪ S ∪··· ˙ ∪ Sk, ∀ i ∈ [ k ] , (cid:80) p ∈ Si w ( p ) ≤ t k (cid:88) i =1 (cid:88) p ∈ S i w ( p ) · dist r ( p, z i ) . If there is no partition Q = S ˙ ∪ S ˙ ∪ · · · ˙ ∪ S k satisfying ∀ i ∈ [ k ] , (cid:80) p ∈ S i w ( p ) ≤ t , we deﬁne cost ( r ) t ( Q, Z, w ) = ∞ . We denote cost ( r ) ( Q, Z, w ) = cost ( r ) ∞ ( Q, Z, w ) = (cid:88) p ∈ Q w ( p ) · dist r ( p, Z ) . Fact 2.3.

Consider a point set Q ⊆ [∆] d and a parameter k ∈ Z ≥ . For r ≥ , (cid:15), η ∈ (0 , . , if Q (cid:48) ⊆ Q, w (cid:48) : Q (cid:48) → R > satisﬁes that ∀ t ≥ | Q | /k and ∀ Z ⊂ [∆] d with | Z | = k ,

11 + (cid:15) · cost ( r )(1+ η ) t ( Q, Z ) ≤ cost ( r )(1+ η ) t ( Q (cid:48) , Z, w (cid:48) ) ≤ (1 + (cid:15) ) · cost ( r ) t ( Q, Z ) , then for ˆ Z ⊂ [∆] d with | ˆ Z | = k which satisﬁes cost ( r )(1+ η ) βt ( Q (cid:48) , ˆ Z, w (cid:48) ) ≤ α min Z ⊂ [∆] d : | Z | = k cost ( r )(1+ η ) t ( Q (cid:48) , Z, w (cid:48) ) for some α, β ≥ , we have cost ( r )(1+ O ( η )) βt ( Q, ˆ Z ) ≤ (1 + O ( (cid:15) )) α min Z ⊆ [∆] d : | Z | = k cost ( r ) t ( Q, Z ) . Proof.

We have cost ( r )(1+ η ) βt ( Q, ˆ Z ) ≤ (1 + (cid:15) ) · cost ( r )(1+ η ) βt ( Q (cid:48) , ˆ Z, w (cid:48) ) ≤ (1 + (cid:15) ) α · min Z ⊂ [∆] d : | Z | = k cost ( r )(1+ η ) t cost ( r ) ( Q (cid:48) , Z, w (cid:48) ) ≤ (1 + (cid:15) ) α · min Z ⊂ [∆] d : | Z | = k cost ( r ) t ( Q, Z ) where the ﬁrst and the last step follows from the strong coreset property of ( Q (cid:48) , w (cid:48) ) , and thesecond step follows from that ˆ Z is an ( α, β ) -approximation. Notice that (1 + η ) = (1 + O ( η )) and (1 + (cid:15) ) = (1 + O ( (cid:15) )) since η, (cid:15) ∈ (0 , . . We complete the proof.7 The Coreset Construction

In this section we show an oﬄine construction of the coreset and give an analysis of the algorithm.In Section 3.1, we show how to partition the input point set. The partitioning scheme follows thecommon thread of work [Che09, BFL +

17, HSYZ18]. Next, we describe our algorithm in Algorithm 2.To prove the correctness of our algorithm, we develop a novel half-space argument. We show thedetails of the argument in Section 3.2. This is the main technical contribution of this section. InSection 3.3, given a set of centers and capacity constraints, we show how to eﬃciently computea good assignment for the coreset and build a good representation of a good assignment for theoriginal point set.

In this section, we use a common approach to partition the point set. We refer readers to [Che09,BFL +

17, HSYZ18] for more details and history of this partitioning approach. We put all missingproofs in this section to Appendix ALet us partition the space [∆] d by a randomly shifted hierarchical grid structure. Without lossof generality, we suppose ∆ = 2 L for some integer L . We choose a vector v ∈ R d such that eachentry is an i.i.d. random sample drawn uniformly from [0 , ∆] . Then we can impose L + 1 level grids G , G , · · · , G L , where the grid G i partitions the space R d into cells with side length g i = ∆ / i ,and there is a cell which has a corner with location v . More precisely, ∀ i ∈ { , , · · · , L } ,G i = (cid:110) C (cid:12)(cid:12)(cid:12) C = [ v + g i t , v + g i ( t + 1)) × · · · × [ v d + g i t d , v d + g i ( t d + 1)) , t , t , · · · , t d ∈ Z (cid:111) . For convenience, we also deﬁne the gird G − in the similar way. Since the cell in G − has sidelength g − = 2∆ , there must be a cell which contains all the points in [∆] d . If C ∈ G i , C (cid:48) ∈ G j and C ⊆ C (cid:48) , then we call cell C (cid:48) an ancestor of cell C . For a point p ∈ R d , if p ∈ C for some cell C ∈ G i , then we deﬁne c i ( p ) = C . Similarly, for a point set P ⊂ R d , if P ⊆ C for some cell C ∈ G i ,then we denote c i ( P ) = C .Consider an input point set Q ⊆ [∆] d and a parameter k ∈ Z ≥ . We denote OPT ( r ) k -clus = min Z ⊆ [∆] d : | Z |≤ k cost ( r ) ( Q, Z ) . Let us review the heavy cell partitioning scheme (Algorithm 1). In Algorithm 1, once the heavycells are determined, the partitioning of Q = ˙ (cid:83) Li =0 ˙ (cid:83) s i j =1 Q i,j is determined. Thus, we explicitly storeall the heavy cells and conceptually partition Q into Q , , Q , , · · · , Q ,s , Q , , · · · , Q ,s , · · · , Q L,s L for analysis. Deﬁnition 3.1.

If the estimated size τ ( C ∩ Q ) in line 7 of Algorithm 1 satisﬁes either τ ( C ∩ Q ) ∈| C ∩ Q | ± . T i ( o ) or τ ( C ∩ Q ) ∈ (1 ± . · | C ∩ Q | , then the estimated size τ ( C ∩ Q ) is good for C . If the guess o is close to OPT ( r ) k -clus , the number of heavy cells cannot be too large with a goodprobability. The main reason is that there cannot be too many center cells . Let Z ∗ ⊂ [∆] d with | Z ∗ | ≤ k be an optimal solution of the standard (cid:96) r k -clustering problem of Q , i.e., cost ( r ) ( Q, Z ∗ ) =OPT ( r ) k -clus . We call a cell C ∈ G i a center cell if dist( C, Z ∗ ) ≤ g i /d . Let F denote the event thatthe total number of center cells is at most kL . Lemma 3.2 (Lemma 14 of [HSYZ18]) . F happens with probability at least . . lgorithm 1 Partitioning via Heavy Cells Predetermined: o ∈ (cid:104) , ∆ d · (cid:16) √ d ∆ (cid:17) r (cid:105) which is a guess of the optimal standard l r k -clustering cost Input: Q ⊆ [∆] d , k ∈ Z ≥ Impose randomly shifted grids G − , G , · · · , G L . for i := − → L − do T i ( o ) ← . · o ( √ dg i ) r . for C ∈ G i : C ∩ Q (cid:54) = ∅ do Estimate the size of | C ∩ Q | up to some precision, and let τ ( C ∩ Q ) be the estimated size. If τ ( C ∩ Q ) ≥ T i ( o ) and all the ancestors of C are heavy , mark C as heavy. Otherwise, if all the ancestors of C are marked as heavy, mark C as crucial . end for end for For C ∈ G L , if all the ancestors of C are heavy, mark C as crucial. ∀ i ∈ { , , · · · , L } , let s i denote the number of heavy cells in G i − . Partition Q = ˙ (cid:83) Li =0 ˙ (cid:83) s i j =1 Q i,j : Q i,j = (cid:83) crucial ˆ C ∈ G i : ˆ C ⊂ C ( ˆ C ∩ Q ) where C is the j -th heavy cell in G i − . Output:

All cells marked as heavy in G − , G , G , · · · , G L − In the remaining of the paper, we condition on the event F . Since F happens, it is able to showthat there cannot be too many heavy cells. Lemma 3.3 (Number of heavy cells) . Suppose o ≤ OPT ( r ) k -clus . If the estimated size τ ( C ∩ Q ) inline 7 of Algorithm 1 is good (Deﬁnition 3.1) for every cell C , then condition on F , the number ofheavy cells outputted by Algorithm 1, (cid:80) Li =0 s i , is at most k + d . r ) L · OPT ( r ) k -clus o . In the following, we show that removal of small parts will not change the cost of balanced(capacitated) k -clustering by too much. Lemma 3.4. ∀ i ∈ { , , · · · , L } , let P i = { Q i,j | j ∈ [ s i ] } , where Q i,j are parts computed byline 14 of Algorithm 1. Let P N ⊆ P , P N ⊆ P , · · · , P NL ⊆ P L be arbitrary subsets of partssatisfying ∀ i ∈ { , , · · · , L } , P ∈ P Ni , | P | ≤ γT i ( o ) , where γ = min (cid:16) η · r kL , (cid:15) · r ( k + d . r ) L (cid:17) for some arbitrary (cid:15), η ∈ (0 , . . Suppose o ≤ OPT k -means , (cid:80) Li =0 s i ≤ k + d . r ) L , and ∀ i ∈ {− , , · · · , L − } , every heavy cell C ∈ G i satisﬁes | C ∩ Q | ≥ . T i ( o ) . Then ∀ t ≥ | Q | /k and Z ⊆ [∆] d with | Z | = k , cost ( r ) t ( Q \ Q N , Z ) ≤ cost ( r ) t ( Q, Z ) and cost ( r )(1+ η ) · t ( Q, Z ) ≤ (1 + (cid:15) )cost ( r ) t ( Q \ Q N , Z ) , where Q N = (cid:83) Li =0 (cid:83) P ∈P Ni P . Our coreset construction is shown in Algorithm 2. In the remaining of the section, let us analyzethe algorithm. Before we starts our proof, let us assume that both τ (cid:16)(cid:83) s i j =1 Q i,j (cid:17) and τ ( Q i,j ) aregood estimations of (cid:12)(cid:12)(cid:12)(cid:83) s i j =1 Q i,j (cid:12)(cid:12)(cid:12) and | Q i,j | respectively. Deﬁnition 3.5.

If the estimated size τ (cid:16)(cid:83) s i j =1 Q i,j (cid:17) in Algorithm 2 satisﬁes either τ (cid:16)(cid:83) s i j =1 Q i,j (cid:17) ∈ (cid:80) s i j =1 | Q i,j | ± . T i ( o ) or τ (cid:16)(cid:83) s i j =1 Q i,j (cid:17) ∈ (1 ± . (cid:80) s i j =1 | Q i,j | , then τ (cid:16)(cid:83) s i j =1 Q i,j (cid:17) is good. Ifthe estimated size τ ( Q i,j ) in Algorithm 2 satisﬁes either τ ( Q i,j ) ∈ | Q i,j | ± . γT i ( o ) or τ ( Q i,j ) ∈ (1 ± . | Q i,j | , then τ ( Q i,j ) is good. lgorithm 2 Coreset Construction Predetermined: o ∈ (cid:104) , ∆ d · (cid:16) √ d ∆ (cid:17) r (cid:105) which is a guess of the optimal standard l r k -clustering cost Input: Q ⊆ [∆] d , k ∈ Z ≥ , η, (cid:15) ∈ (0 , . L ← log ∆ , γ ← − r +10) min (cid:16) ηkL , (cid:15) ( k + d . r ) L (cid:17) , ξ ← − r +10) min( (cid:15),η ) k ( k + d . r ) L , λ ← rk dL (cid:100) log( kdL ) (cid:101) . Compute the partition Q = ˙ (cid:83) Li =0 ˙ (cid:83) s i j =1 Q i,j . Set T i ( o ) ← . · o ( √ dg i ) r . //Algorithm 1 If (cid:80) Li =0 s i > k + d . r ) L , return FAIL. For ≤ i ≤ L , let τ (cid:16)(cid:83) s i j =1 Q i,j (cid:17) be an estimation of (cid:80) s i j =1 | Q i,j | up to some precision. Return FAIL if ∃ i ∈ { , , · · · , L } , τ  s i (cid:91) j =1 Q i,j  > kL + d . r ) T i ( o ) . for i = 0 → L do P Ii ← ∅ , Q (cid:48) i ← ∅ , φ i ← min (cid:16) , r +10) · λξ γT i ( o ) (cid:17) . For j ∈ [ s i ] , let τ ( Q i,j ) be an estimation of | Q i,j | up to some precision. If τ ( Q i,j ) ≥ γT i ( o ) , P Ii ← P Ii ∪ { Q i,j } . Let ˆ h i : [∆] d → { , } be a λ -wise independent hash function s.t. ∀ p ∈ [∆] d , Pr[ˆ h i ( p ) = 1] = φ i . For each P ∈ P Ii and for each p ∈ P , add p into Q (cid:48) i if ˆ h i ( p ) = 1 . end for Output: Q (cid:48) = (cid:83) Li =0 Q (cid:48) i , w (cid:48) : Q (cid:48) → R > . Consider a solution of k -clustering problem. Each cluster can be seen as the set of points whichare assigned to the same center. Thus, a solution of k -clustering can be represented by an assignmentmapping. Deﬁnition 3.6 (Assignment) . Given a point set P ⊆ [∆] d where each point p ∈ P has a weight w ( p ) ∈ R ≥ , and a set of centers Z ⊂ [∆] d with | Z | = k for some k ∈ Z ≥ , an assignment of P to thecenters Z is a mapping π : P → Z . The clustering cost of π is denoted as cost ( r ) ( π ) = (cid:80) p ∈ P w ( p ) · dist r ( p, π ( p )) . The size vector s ( π ) ∈ R k of clusters is deﬁned as ∀ i ∈ [ k ] , s ( π ) i = (cid:80) p ∈ P : π ( p )= z i w ( p ) . If we do not specify the weight function w ( · ) explicitly in the context, we suppose each point p ∈ P has w ( p ) = 1 .Next, we show that some assignment mapping can be deﬁned by a set of half-spaces. Deﬁnition 3.7 (Assignment half-spaces) . Given a set of centers Z = { z , z , · · · , z k } ⊂ [∆] d with | Z | = k for some k ∈ Z ≥ , a set of assignment half-spaces H corresponding to Z has (cid:0) k (cid:1) half-spaces(Deﬁnition 2.2), i.e., H = (cid:8) H ( i,j ) | ≤ i < j ≤ k (cid:9) , where H ( i,j ) is a half-space corresponding to (cid:0) z i , z j , t ( i,j ) (cid:1) for some integer t ( i,j ) ∈ (cid:2) ∆ d (cid:3) . For j < i ,we denote H ( i,j ) as [∆] d \ H ( j,i ) . For a point set P ⊆ [∆] d , if ∀ p ∈ P , there always exists a unique i ∈ [ k ] such that ∀ j ∈ [ k ] , j (cid:54) = i, it has p ∈ H ( i,j ) , we say H is valid for P , and the assignmentmapping π : P → Z corresponding to H is deﬁned as: ∀ p ∈ P, π ( p ) = z i , where i satisﬁes ∀ j (cid:54) = i, p ∈ H ( i,j ) . Q and centers Z . We can always ﬁnd a set of half-spaces such that we can use these half-spaces to determine theassigned center for each point p ∈ Q without looking at other points in Q . In the following, weformalize the argument and extend it to the weighted case. Lemma 3.8 (Cost and assignment half-spaces) . Consider a point set Q with at most m diﬀerentweights, i.e., Q = Q ˙ ∪ Q ˙ ∪ · · · ˙ ∪ Q m ⊆ [∆] d , where ∀ i ∈ [ m ] , p ∈ Q i , p has a weight w ( p ) = w i .For any set of centers Z = { z , z , · · · , z k } ⊂ [∆] d with | Z | = k for some k ∈ Z ≥ , and any t ≥ with cost ( r ) t ( Q, Z, w ) (cid:54) = ∞ , there always exist m sets of assignment half-spaces H (1) , H (2) , · · · , H ( m ) corresponding to Z such that ∀ i ∈ [ m ] , H ( i ) is valid for Q i , and cost ( r ) t ( Q, Z, w ) = (cid:80) mi =1 cost ( r ) ( π i ) and (cid:107) (cid:80) mi =1 s ( π i ) (cid:107) ∞ ≤ t, where π i : Q i → Z is an assignment mapping corresponding to H ( i ) .Proof. Let π ∗ : Q → Z be an optimal assignment mapping, i.e., cost ( r ) t ( Q, Z, w ) = cost ( r ) ( π ∗ ) . For l ∈ [ m ] , let us construct H ( l ) = { H ( l )( i,j ) | i < j ∈ [ k ] } as the following. Consider Q l , z i , and z j ( i < j ) .We can sort Q l = { p , p , · · · , p | Q l | } such that ∀ a ∈ [ | Q l | − , either dist r ( p a , z i ) − dist r ( p a , z j ) < dist r ( p a +1 , z i ) − dist r ( p a +1 , z j ) or dist r ( p a , z i ) − dist r ( p a , z j ) = dist r ( p a +1 , z i ) − dist r ( p a +1 , z j ) and p a is smaller than p a +1 in the alphabetic order. Consider the largest a such that π ∗ ( p a ) = z i andthe smallest a (cid:48) such that π ∗ ( p a (cid:48) ) = z j . Claim 3.9. If a > a (cid:48) , then w ( p a ) · dist r ( p a , z j ) + w ( p a (cid:48) ) · dist r ( p a (cid:48) , z i ) = w ( p a ) · dist r ( p a , z i ) + w ( p a (cid:48) ) · dist r ( p a (cid:48) , z j ) , and the alphabetic order of p a (cid:48) is smaller than p a .Proof. Since both p a and p a (cid:48) are from Q l , we know that w ( p a ) = w ( p a (cid:48) ) . Since a > a (cid:48) , we have: dist r ( p a (cid:48) , z i ) − dist r ( p a (cid:48) , z j ) ≤ dist r ( p a , z i ) − dist r ( p a , z j ) . If dist r ( p a (cid:48) , z i ) − dist r ( p a (cid:48) , z j ) < dist r ( p a , z i ) − dist r ( p a , z j ) , then w ( p a ) · dist r ( p a , z j ) + w ( p a (cid:48) ) · dist r ( p a (cid:48) , z i ) < w ( p a ) · dist r ( p a , z i ) + w ( p a (cid:48) ) · dist r ( p a (cid:48) , z j ) which implies that if we switch the assignment of p a and p a (cid:48) , we can get a better solution, and thusit contradicts to that π ∗ is the optimal assignment.If a < a (cid:48) , we can ﬁnd a half-space H ( l )( i,j ) such that ∀ p ∈ Q l with π ∗ ( p ) = z i satisﬁes p ∈ H ( l )( i,j ) and ∀ p ∈ Q l with π ∗ ( p ) = z j satisﬁes p ∈ H ( l )( j,i ) . Otherwise, by Claim 3.9 we can switch the assignmentof p a and p a (cid:48) which neither increases the cost nor changes the number of points assigned to eachcenter, and we can try to construct H ( l ) to the switched assignment mapping. Notice that theswitching decreases the summation of the alphabetic ranks of the points assigned to z i . Therefore,the switching operation will terminate. 11s discussed in the above lemma, the assigned center of a point may be determined by a setof half-spaces. We can deﬁne the set of points (may not be in Q ) which should be assigned to thesame center according to the half-spaces as a region. Most regions should be the intersection ofseveral half-spaces. However in some situations, the assignment half-spaces may not be valid for theunderlying input points, and thus there are some points which can not be assigned to any centeraccording to the half-spaces. In this case, we deﬁne an additional region for these points. Deﬁnition 3.10 (Regions induced by assignment half-spaces) . Given a set of assignment half-spaces H = (cid:8) H ( i,j ) | i < j ∈ [ k ] (cid:9) , if R = { p ∈ [∆] d | ∀ i ∈ [ k ] , ∃ j (cid:54) = i, p (cid:54)∈ H ( i,j ) } and ∀ i ∈ [ k ] , R i = { p ∈ [∆] d | ∀ j ∈ [ k ] , p ∈ H ( i,j ) } , then ( R , R , · · · , R k ) are regions induced by H . Consider a set of points P and a set of assignment half-spaces H . Let ( R , R , · · · , R k ) beregions induced by H . H may not be valid for P or there may be some region R i for i ∈ [ k ] suchthat R i ∩ P (cid:54) = ∅ but | P ∩ R i | is small. In this case, we want to ﬁnd an assignment mapping whichis almost determined by H and each non-empty cluster is large enough.To achieve this, we can check each point p ∈ P . If p ∈ R or p ∈ R i for some i such that | P ∩ R i | is small, we assign p to z i ∗ , where i ∗ satisﬁes that there are lots of points in region R i ∗ and i ∗ (cid:54) = 0 .Notice that though there may be no assignment mapping for P corresponding to H since H maybe invalid for P , we can always deﬁne a transferred assignment mapping for P according to H . Deﬁnition 3.11 (Assignment transfer) . Given a threshold T ∈ R ≥ , let P ⊂ [∆] d be a point setwhere each point p ∈ P has a weight w ( p ) ∈ R ≥ such that (cid:80) p ∈ P w ( p ) ≥ . T . Consider a setof centers Z = { z , z , · · · , z k } ⊂ [∆] d with | Z | = k for some k ∈ Z ≥ , a set of assignment half-spaces H = { H ( i,j ) | i < j ∈ [ k ] } corresponding to Z , and B = ( b , b , · · · , b k ) ∈ R k +1 ≥ such that ∀ i ∈ { , , · · · , k } , b i satisﬁes either b i ∈ (1 ± ξ ) · (cid:80) p ∈ R i ∩ P w ( p ) or b i ∈ (cid:80) p ∈ R i ∩ P w ( p ) ± ξT , where ξ ∈ (0 , . and ( R , R , · · · , R k ) are regions induced by H . Let i ∗ = arg max i ∈ [ k ] b i . A transferredassignment mapping π : P → Z corresponding to ( H , B, ξ, T ) is deﬁned as: ∀ p ∈ P, π ( p ) = (cid:26) z i , i ∈ [ k ] : b i ≥ ξT, p ∈ R i ,z i ∗ , otherwise. Similar to Deﬁnition 3.6, if we do not specify the weights w ( · ) , each point has weight .The following lemma shows that if assignment half-spaces H is valid for the point set P , thenthe cost of the transferred assignment mapping is close to the cost of the assignment mappingcorresponding to H , and furthermore, the number of points of which centers are changed is small. Lemma 3.12 (Transferred assignment does not change the cost too much) . Given a threshold T ∈ R ≥ , let P ⊆ [∆] d be a point set where each point p ∈ P has a weight w ( p ) ∈ R ≥ such that (cid:80) p ∈ P w ( p ) ≥ . T , and ∀ p, q ∈ P, dist( p, q ) ≤ √ dg for some g ∈ R ≥ . Consider a set of centers Z = { z , z , · · · , z k } ⊂ [∆] d with | Z | = k for some k ∈ Z ≥ , a set of assignment half-spaces H corresponding to Z , and B = ( b , b , · · · , B k ) ∈ R k +1 ≥ which satisﬁes the condition mentioned inDeﬁnition 3.11 for some ξ ∈ (0 , / (100 k )) . If H is valid for P , then cost ( r ) ( π (cid:48) ) ≤ (1 + 2 r +4 k · ξ )cost ( r ) ( π ) + ξ · r +1 kT ( √ dg ) r and (cid:107) s ( π (cid:48) ) − s ( π ) (cid:107) ≤ kξ (cid:80) p ∈ P w ( p ) , where π : P → Z is an assignment mapping corresponding to H , and π (cid:48) : P → Z is a transferredassignment mapping corresponding to ( H , B, ξ, T ) .Proof. Let i ∗ = arg max i ∈ [ k ] b i . Let ( R , R , · · · , R k ) be the regions induced by H . Since H is validfor P , R ∩ P = ∅ . By Pigeonhole Principle, we know that max i ∈ [ k ] (cid:88) p ∈ R i ∩ P w ( p ) ≥ . Tk ≥ T k . b i ∗ ≥ min (cid:18) T k − ξT, · T k · T (cid:19) ≥ T k , where the second inequality follows from ξ ≤ / (100 k ) . Therefore, (cid:88) p ∈ P : π ( p )= z i ∗ w ( p ) = (cid:88) p ∈ R i ∗ ∩ P w ( p ) ≥ min( b i ∗ − ξT, b i ∗ / ≥ T k . (2)We have cost ( r ) ( π (cid:48) ) − cost ( r ) ( π ) ≤ (cid:88) p ∈ P : π (cid:48) ( p ) (cid:54) = π ( p ) w ( p ) · dist r ( p, π (cid:48) ( p ))= (cid:88) p ∈ P : π (cid:48) ( p ) (cid:54) = π ( p ) w ( p ) · dist r ( p, z i ∗ ) ≤ (cid:88) p ∈ P : π (cid:48) ( p ) (cid:54) = π ( p ) w ( p ) · (cid:32) r − ( √ dg ) r + 2 r − (cid:80) q ∈ P : π ( q )= z i ∗ w ( q ) · dist r ( q, z i ∗ ) (cid:80) q ∈ P : π ( q )= z i ∗ w ( q ) (cid:33) ≤ k · ξT · (cid:32) r − ( √ dg ) r + 2 r − (cid:80) q ∈ P : π ( q )= z i ∗ w ( q ) · dist r ( q, z i ∗ ) (cid:80) q ∈ P : π ( q )= z i ∗ w ( q ) (cid:33) ≤ ξ ·  r +1 kT ( √ dg ) r + 2 r +4 k · (cid:88) q ∈ P : π ( q )= z i ∗ w ( q ) · dist ( q, z i ∗ )  ≤ ξ · (cid:16) r +1 kT ( √ dg ) r + 2 r +4 k · cost ( r ) ( π ) (cid:17) , where the third step follows from that there is a point q (cid:48) ∈ P such that π ( q (cid:48) ) = z i ∗ and dist r ( q (cid:48) , z i ∗ ) ≤ (cid:80) q ∈ P : π ( q )= z i ∗ w ( q ) · dist r ( q, z i ∗ ) (cid:80) q ∈ P : π ( q )= z i ∗ w ( q ) by averaging argument, and dist r ( p, z i ∗ ) ≤ r − dist r ( p, q (cid:48) ) + 2 r − dist r ( q (cid:48) , z i ∗ ) (Fact 2.1), the forthstep follows from (cid:80) π (cid:48) ( p ) (cid:54) = π ( p ) w ( p ) ≤ (cid:80) i ∈{ , , , ··· ,k }\{ i ∗ } : b i < ξT max(2 b i , b i + ξT ) ≤ k · ξT , the ﬁfthstep follows from Equation (2).Now, let us consider s ( π (cid:48) ) . We have (cid:107) s ( π (cid:48) ) − s ( π ) (cid:107) ≤ (cid:88) p ∈ P : π (cid:48) ( p ) (cid:54) = π ( p ) w ( p ) ≤ (cid:88) i ∈{ , , , ··· ,k }\{ i ∗ } : b i < ξT max(2 b i , b i + ξT ) ≤ ξ · kT ≤ kξ (cid:88) p ∈ P w ( p ) , where the last step follows from (cid:80) p ∈ P w ( p ) ≥ . T .The following lemma is a concentration bound for summation of random variables with limitedindependence. We need following lemma since we want to prove that our algorithm only needs lim-ited independence. If the fully independent random samples are allowed in the algorithm, Bernsteininequality will be enough for analysis. 13 emma 3.13 ([BR94]) . Consider an even integer λ ≥ and a random variable X = (cid:80) ni =1 X i ,where X i are λ -wise independent random variables taking values in [0 , M ] . For any a > , Pr [ | X − µ | > a ] ≤ (cid:18) µλM + λ M a (cid:19) λ/ , where µ = E [ X ] . Consider a set of points P and a set of assignment half-spaces H . Let ( R , R , · · · , R k ) beregions induced by H . We randomly sample a subset of points P (cid:48) from P . We can use the numberof points sampled from each region R i to estimate the number of points in R i . We can also use thetotal number of sampled points to estimate the total number of points in P . Based on the estimatednumber of points in each region, we can deﬁne transferred assignment mappings for both P (cid:48) and P . We argue that the cost of the transferred assignment mapping for P (cid:48) is a good estimation ofthe transferred assignment mapping for P . Furthermore, we can use the number of samples in P (cid:48) assigned to each center to estimate the number of points in P assigned to each center. Lemma 3.14 (Estimating the cost of transferred assignment via sampling) . Given a thresh-old T ∈ R ≥ , let P ⊆ [∆] d be a point set such that each point has weight w ( p ) = 1 and | P | ≥ T . Furthermore, ∀ p, q ∈ P, dist( p, q ) ≤ √ dg for some g ∈ R ≥ . Consider a set of cen-ters Z = { z , z , · · · , z k } ⊂ [∆] d with | Z | = k for some k ∈ Z ≥ , and a set of assignment half-spaces H = { H ( i,j ) | i < j ∈ [ k ] } corresponding to Z . For some arbitrary ξ, δ ∈ (0 , . , let P (cid:48) be arandom subset of P such that each point p ∈ P is chosen λ -wise independently with probability φ ,where λ = 100 k (cid:100) log( k/δ ) (cid:101) , φ = min (cid:16) , · r λξ T (cid:17) . Let each p ∈ P (cid:48) have weight w (cid:48) ( p ) = 1 /φ . Let ( R , R , · · · , R k ) be the regions (Deﬁnition 3.10) induced by H . Let B = ( b , b , · · · , b k ) such that ∀ i ∈ { , , · · · , k } , b i = (cid:80) p ∈ R i ∩ P (cid:48) w (cid:48) ( p ) . With probability at least − δ , all of the following eventshappens:1. ∀ i ∈ { , · · · , k } , either b i ∈ (1 ± ξ ) · | R i ∩ P | or b i ∈ | R i ∩ P | ± ξT ,2. (cid:80) p ∈ P (cid:48) w (cid:48) ( p ) ≥ . T ,3. | cost ( r ) ( π (cid:48) ) − cost ( r ) ( π ) | ≤ ξ (cid:16) | P | ( √ dg ) r + cost ( r ) ( π ) (cid:17) , where both of π : P → Z and π (cid:48) : P (cid:48) → Z are transferred assignment mappings corresponding to ( H , B, ξ, T ) ,4. (cid:107) s ( π ) − s ( π (cid:48) ) (cid:107) ≤ ξ | P | .Proof. We only need to consider the case when φ < .Let us ﬁrst consider event 1. We have ∀ i ∈ { , , · · · , k } , E [ b i ] = (cid:88) p ∈ R i ∩ P φ · φ = | R i ∩ P | . Consider i ∈ { , , · · · , k } . If | R i ∩ P | ≤ T , by Lemma 3.13, Pr[ | b i − | R i ∩ P || > ξT ] ≤ (cid:32) | R i ∩ P | λ · φ + λ · φ ξ T (cid:33) λ/ ≤ (cid:32) T λ · φ + λ · φ ξ T (cid:33) λ/ ≤ δ k + 1) , λ ≥

40 log( k/δ ) and φ ≥ λ/ ( ξ T ) . If | R i ∩ P | > T , byLemma 3.13 again, Pr[ | b i − | R i ∩ P || > ξ | R i ∩ P | ] ≤ (cid:32) | R i ∩ P | λ · φ + λ · φ ξ | R i ∩ P | (cid:33) λ/ ≤ δ k + 1) , where the last inequality also follows from λ ≥

40 log( k/δ ) and φ ≥ λ/ ( ξ T ) . By taking unionbound, event 1 happens with probability at least − δ/ .Consider event 2. We have E  (cid:88) p ∈ P (cid:48) w (cid:48) ( p )  = (cid:88) p ∈ P φ · φ = | P | . By Lemma 4.2, Pr (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | P | − (cid:88) p ∈ P (cid:48) w (cid:48) ( p ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > . | P |  ≤ (cid:32) | P | λ · φ + λ · φ . | P | (cid:33) λ/ ≤ δ , where the last step follows from that λ ≥

40 log(1 /δ ) , φ ≥ λ/T ≥ λ/ | P | .Next, let us consider event 3. Consider an arbitrary ˆ B = (ˆ b , ˆ b , · · · , ˆ b k ) which satisﬁes that ∀ i ∈ { , , · · · , k } , either ˆ b i ∈ | R i ∩ P | ± ξT or ˆ b i ∈ (1 ± ξ ) | R i ∩ P | . Let ˆ π : P → Z be a transferredassignment mapping corresponding to ( H , ˆ B, ξ, T ) . Let ˆ i ∗ = arg max i ∈ [ k ] ˆ b i Claim 3.15. ∀ i ∈ [ k ] , either s (ˆ π ) i = 0 or s (ˆ π ) i ≥ ξT .Proof. Consider i (cid:54) = ˆ i ∗ . If ∃ p ∈ P such that ˆ π ( p ) = z i , then by the deﬁnition of transferredassignment mapping, we have ˆ b i ≥ ξT and thus s (ˆ π ) i = | R i ∩ P | ≥ min(ˆ b i − ξT, . b i ) ≥ ξT . Theabove argument implies that ∀ i (cid:54) = ˆ i ∗ , either s (ˆ π ) i = 0 or s (ˆ π ) ≥ ξT .For ˆ i ∗ , if ˆ b ˆ i ∗ ≥ ξT , then s (ˆ π ) ˆ i ∗ ≥ | R ˆ i ∗ ∩ P | ≥ min(ˆ b ˆ i ∗ − ξT, . b ˆ i ∗ ) ≥ ξT . If ˆ b ˆ i ∗ < ξT , then ∀ i (cid:54) = ˆ i ∗ , ˆ b i < ξT which implies that s (ˆ π ) ˆ i ∗ = | P | ≥ T ≥ ξT .For i ∈ [ k ] , let ˆ R i = { p ∈ P | ˆ π ( p ) = z i } . We have E P (cid:48)  (cid:88) p ∈ P (cid:48) ∩ ˆ R i w (cid:48) ( p ) · dist r ( p, z i )  = (cid:88) p ∈ ˆ R i φ · φ · dist r ( p, z i ) = (cid:88) p ∈ ˆ R i dist r ( p, z i ) . We only need to handle the situation when ˆ R i (cid:54) = ∅ , i.e., s (ˆ π ) i ≥ ξT . Consider two cases, the ﬁrstcase is that dist( z i , ˆ R i ) ≤ √ dg . In this case, by Lemma 3.13, we have: Pr (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) p ∈ P (cid:48) ∩ ˆ R i w (cid:48) ( p ) · dist r ( p, z i ) − (cid:88) p ∈ ˆ R i dist r ( p, z i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > ξ | ˆ R i | ( √ dg ) r  ≤  (cid:16)(cid:80) p ∈ ˆ R i dist r ( p, z i ) (cid:17) · λ · (cid:16) φ · (dist( z i , ˆ R i ) + √ dg ) r (cid:17) + λ · (cid:16) φ · (dist( z i , ˆ R i ) + √ dg ) r (cid:17) ξ | ˆ R i | ( √ dg ) r  λ/ ≤ (cid:32) λφ · | ˆ R i | (dist( z i , ˆ R i ) + √ dg ) r + λ φ · (dist( z i , ˆ R i ) + √ dg ) r ξ | ˆ R i | ( √ dg ) r (cid:33) λ/ (cid:32) λφ · | ˆ R i | (2 √ dg ) r + λ φ · (2 √ dg ) r ξ | ˆ R i | ( √ dg ) r (cid:33) λ/ ≤ δ k k , where the ﬁrst and the second step follows from triangle inequality, and the last step follows fromthat φ ≥ · r λ/ ( ξ T ) ≥ · r λ/ ( ξ | ˆ R i | ) and λ ≥ k log( k/δ ) . The second case is that dist( z i , ˆ R i ) > √ dg . In this case, by Lemma 3.13, we have: Pr (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) p ∈ P (cid:48) ∩ ˆ R i w (cid:48) ( p ) · dist r ( p, z i ) − (cid:88) p ∈ ˆ R i dist r ( p, z i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ ξ (cid:88) p ∈ ˆ R i dist r ( p, z i )  ≤  (cid:16)(cid:80) p ∈ ˆ R i dist r ( p, z i ) (cid:17) · λ · (cid:16) φ · (dist( z i , ˆ R i ) + √ dg ) r (cid:17) + λ · (cid:16) φ · (cid:16) dist( z i , ˆ R i ) + √ dg (cid:17) r (cid:17) ξ (cid:16)(cid:80) p ∈ ˆ R i dist r ( p, z i ) (cid:17)  λ/ ≤ (cid:32) λφ · | ˆ R i | (dist( z i , ˆ R i ) + √ dg ) r + λ φ · (dist( z i , ˆ R i ) + √ dg ) r ξ | ˆ R i | dist r ( z i , ˆ R i ) (cid:33) λ/ ≤ (cid:32) λφ · | ˆ R i | (2 dist( z i , ˆ R i )) r + λ φ · (2 dist( z i , ˆ R i )) r ξ | ˆ R i | dist r ( z i , ˆ R i ) (cid:33) λ/ ≤ δ k k , where the ﬁrst and the second step follows from triangle inequality, and the last step follows from φ ≥ · r λ/ ( ξ T ) ≥ · r λ/ ( ξ | ˆ R i | ) and λ ≥ k log( k/δ ) .Thus, we know that with probability at least − δ k k , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) p ∈ P (cid:48) ∩ ˆ R i w (cid:48) ( p ) · dist r ( p, z i ) − (cid:88) p ∈ ˆ R i dist r ( p, z i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ξ  | ˆ R i | ( √ dg ) r + (cid:88) p ∈ ˆ R i dist r ( p, z i )  . By taking union bound over i ∈ [ k ] , with probability at least − δ k k , we have: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) p ∈ P (cid:48) w (cid:48) ( p ) dist r ( p, ˆ π ( p )) − cost ( r ) (ˆ π ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ k (cid:88) i =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) p ∈ P (cid:48) ∩ ˆ R i w (cid:48) ( p ) · dist r ( p, z i ) − (cid:88) p ∈ ˆ R i dist r ( p, z i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ k (cid:88) i =1 ξ  | ˆ R i | ( √ dg ) r + (cid:88) p ∈ ˆ R i dist ( p, z i )  ≤ ξ (cid:16) | P | ( √ dg ) r + cost ( r ) (ˆ π ) (cid:17) . Notice that though diﬀerent ˆ B may induce a diﬀerent assignment mapping ˆ π , the total number ofpossible ˆ π cannot be too large. This is because ˆ i ∗ only has k choices and for i ∈ { , , · · · , k } \ { ˆ i ∗ } , p ∈ R i ∩ P is assigned to z ˆ i ∗ or every point p ∈ R i ∩ P is assigned to z i . Basedon this observation, the total number of diﬀerent assignment mapping ˆ π is upper bounded by k · k .By taking union bound over all possible ˆ π , with probability at least − δ/ , for any possible choiceof ˆ B , we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) p ∈ P (cid:48) w (cid:48) ( p ) dist r ( p, ˆ π ( p )) − cost ( r ) (ˆ π ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ξ (cid:16) | P | ( √ dg ) r + cost ( r ) (ˆ π ) (cid:17) . Condition on event 1, B is a possible choice of ˆ B . Thus, with probability at least − δ/ , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) p ∈ P (cid:48) w (cid:48) ( p ) dist r ( p, π ( p )) − cost ( r ) ( π ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ξ (cid:16) | P | ( √ dg ) r + cost ( r ) ( π ) (cid:17) . Notice that, by the construction of π and π (cid:48) , ∀ p ∈ P (cid:48) , we have π (cid:48) ( p ) = π ( p ) . Thus, with probabilityat least − δ/ , | cost ( r ) ( π (cid:48) ) − cost ( r ) ( π ) | ≤ ξ ( | P | ( √ dg ) r + cost ( r ) ( π )) Finally, let us consider event 4. Similar to the argument for event 3, let us still consider anarbitrary ˆ B = (ˆ b , ˆ b , · · · , ˆ b k ) which satisﬁes that ∀ i ∈ { , , · · · , k } , either ˆ b i ∈ | R i ∩ P | ± ξT or ˆ b i ∈ (1 ± ξ ) | R i ∩ P | . Let ˆ π : P → Z be a transferred assignment mapping corresponding to ( H , ˆ B, ξ, T ) . Let ˆ i ∗ = arg max i ∈ [ k ] ˆ b i . For i ∈ [ k ] , let ˆ R i = { p ∈ P | ˆ π ( p ) = z i } . We have E P (cid:48)  (cid:88) p ∈ P (cid:48) ∩ ˆ R i w (cid:48) ( p )  = (cid:88) p ∈ ˆ R i φ · φ = | ˆ R i | = s (ˆ π ) i . By Claim 3.15, s (ˆ π ) i is either or at least ξT . We only need to handle the situation when s (ˆ π ) i isat least ξT . By Lemma 3.13, Pr (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) p ∈ P (cid:48) ∩ ˆ R i w (cid:48) ( p ) − s (ˆ π ) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > ξs (ˆ π ) i  ≤ (cid:32) | ˆ R i | · λ · φ + λ · φ ξ | ˆ R i | (cid:33) λ/ ≤ δ k k , where the last step follows from that φ ≥ λ/ ( ξ T ) ≥ λ/ ( ξ | ˆ R i | ) and λ ≥ k log( k/δ ) . Bytaking union bound over all i ∈ [ k ] , with probability at least − δ k k , k (cid:88) i =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) p ∈ P (cid:48) ∩ ˆ R i w (cid:48) ( p ) − s (ˆ π ) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ξ · k (cid:88) i =1 s (ˆ π ) i = ξ | P | . By taking union bound over all the possible diﬀerent ˆ π , with probability at least − δ/ , for anypossible choice of ˆ B , we have k (cid:88) i =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) p ∈ P (cid:48) ∩ ˆ R i w (cid:48) ( p ) − s (ˆ π ) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ξ | P | . B is a possible choice of ˆ B . Thus, with probability at lest − δ/ , k (cid:88) i =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) p ∈ P (cid:48) ∩ ˆ R i w (cid:48) ( p ) − s ( π ) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ξ | P | . Notice that, by the construction of π and π (cid:48) , ∀ p ∈ P (cid:48) , we have π (cid:48) ( p ) = π ( p ) . Thus, with probabilityat least − δ/ , (cid:107) s ( π ) − s ( π (cid:48) ) (cid:107) ≤ ξ | P | .By taking union bound over all four events, we complete the proof.Notice that the previous lemma only works for P and a ﬁxed set of assignment half-spaces. Tomake the above argument work for P and all sets of assignment half-spaces, we just need to slightlyraise the sampling probability and take union bound over all possible sets of assignment half-spaces. Lemma 3.16 (Estimation is good for all choices of centers and half-spaces) . Given a threshold T ∈ R ≥ , let P ⊆ [∆] d be a point set such that each point has weight w ( p ) = 1 and | P | ≥ T .Furthermore, ∀ p, q ∈ P , dist( p, q ) ≤ √ dg for some g ∈ R ≥ . Let ξ, δ ∈ (0 , . , k ∈ Z ≥ , L = log ∆ .Let P (cid:48) be a random subset of P such that each point p ∈ P is chosen λ -wise independently withprobability φ , where λ = 2000 k dL (cid:100) log( k/δ ) (cid:101) , φ = min (cid:16) , · r λξ T (cid:17) . Let each p ∈ P (cid:48) have weight w (cid:48) ( p ) = 1 /φ . With probability at least − δ/ , the following event happens: • For any choice of centers Z = { z , z , · · · , z k } ⊂ [∆] d with | Z | = k and any choice ofa set of assignment half-spaces H = { H ( i,j ) | i < j ∈ [ k ] } corresponding to Z , B Z, H =( b Z, H , b Z, H , · · · , b Z, H k ) satisﬁes that ∀ i ∈ { , , · · · , k } , either b Z, H i ∈ (1 ± ξ ) · | R Z, H i ∩ P | or b Z, H i ∈ | R Z, H i ∩ P | ± ξT , where ( R Z, H , R Z, H , · · · , R Z, H k ) are regions induced by H (Deﬁni-tion 3.10), and ∀ i ∈ { , , · · · , k } , b Z, H i = (cid:80) p ∈ R Z, H i ∩ P (cid:48) w (cid:48) ( p ) , and furthermore, | cost ( r ) ( π (cid:48) Z, H ) − cost ( r ) ( π Z, H ) | ≤ ξ ( | P | ( √ dg ) r + cost ( r ) ( π Z, H )) and (cid:107) s ( π (cid:48) Z, H ) − s ( π Z, H ) (cid:107) ≤ ξ | P | , where both π Z, H : P → Z, π (cid:48) Z, H : P (cid:48) → Z are transferred assignment mappings correspondingto ( H , B Z, H , ξ, T ) .In addition, with probability at least − δ/ , (cid:80) p ∈ P (cid:48) w (cid:48) ( p ) ≥ . T .Proof. By event 2 of Lemma 3.14, with probability at least − δ/ , (cid:80) p ∈ P (cid:48) w (cid:48) ( p ) ≥ . T .By Lemma 3.14 again, for any ﬁxed Z and H , the second event happens with probability at least − δ dk , the second event happens. Since Z ⊂ [∆] d and | Z | = k , the total number of possiblechoices of Z is at most ∆ dk . For a particular choice of Z , since H contains (cid:0) k (cid:1) ≤ k half-spaces andthere are ∆ d possible choices for each half-space, the total number of possible H corresponding to Z is at most ∆ dk . Hence the total number of possible choices of ( Z, H ) is at most ∆ dk · ∆ dk ≤ ∆ dk .By taking union bound over all possible choices of ( Z, H ) , with probability at least − δ/ , thesecond event happens.It is enough to prove the correctness of our algorithm. In Lemma 3.18, we will prove that if o is a good approximation of OPT ( r ) k -clus , Algorithm 2 does not output FAIL with high probability.In the following lemma, we suppose this happens. Furthermore, we can assume that the estimatedsize τ ( C ∩ Q ) , τ (cid:16)(cid:83) s i j =1 Q i,j (cid:17) , τ ( Q i,j ) in Algorithm 1 and Algorithm 2 are good estimations of | C ∩ | , (cid:12)(cid:12)(cid:12)(cid:83) s i j =1 Q i,j (cid:12)(cid:12)(cid:12) and | Q i,j | respectively. For oﬄine algorithm, it is easy to compute the exact valueof | C ∩ Q | , (cid:12)(cid:12)(cid:12)(cid:83) s i j =1 Q i,j (cid:12)(cid:12)(cid:12) and | Q i,j | . For streaming and distributed algorithm, we will explain how toget good estimations with high probability in Section 4. In the following lemma, we will show thatif o is a suitable choice, then we can output a strong coreset with high probability.In high level, to prove the correctness, we will apply Lemma 3.12 and Lemma 3.16 for each part Q i,j of which | Q i,j | is Ω( γT i ( o )) . We can show that the total error induced by each part Q i,j isrelatively small. Lemma 3.17 (Correctness of the construction) . Suppose the following conditions are satisﬁed:1. Algorithm 2 does not output FAIL,2. o ≤ OPT ( r ) k -clus ,3. all estimated size τ (cid:16)(cid:83) s i j =1 Q i,j (cid:17) , τ ( Q i,j ) in Algorithm 2 and τ ( C ∩ Q ) in Algorithm 1 calledby Algorithm 2 are good (Deﬁnition 3.5, Deﬁnition 3.1).Let Q (cid:48) ⊆ Q, w (cid:48) : Q (cid:48) → R ≥ be the output of Algorithm 2. With probability at least . , ∀ t ≥| Q | /k, Z ⊂ [∆] d with | Z | = k , cost ( r )(1+ η ) t ( Q, Z ) ≤ (1 + (cid:15) )cost ( r ) t ( Q (cid:48) , Z, w (cid:48) ) and cost ( r )(1+ η ) t ( Q (cid:48) , Z, w (cid:48) ) ≤ (1 + (cid:15) )cost ( r ) t ( Q, Z ) . Proof.

By line 9 of Algorithm 2, ∀ i ∈ { , , · · · , L } , ∀ P ∈ P Ii , we have | P | ≥ min(0 . τ ( P ) , τ ( P ) − . γT i ( o )) ≥ . γT i ( o ) . Consider an arbitrary P ∈ P Ii . Since ∀ p, q ∈ P, c i − ( p ) = c i − ( q ) , we have dist( p, q ) ≤ √ dg i − = √ d · g i . Let P (cid:48) = P ∩ Q (cid:48) . Let E ( P ) be the following events:1. (cid:80) p ∈ P (cid:48) w (cid:48) ( p ) ≥ . · . γT i ( o ) = 0 . γT i ( o ) .2. For any choice of centers Z = { z , z , · · · , z k } ⊂ [∆] d with | Z | = k and any choice of aset of assignment half-spaces H = { H ( j,j (cid:48) ) | j < j (cid:48) ∈ [ k ] } corresponding to Z , B P,Z, H =( b P,Z, H , b P,Z, H , · · · , b P,Z, H k ) satisﬁes that ∀ j ∈ { , , · · · , k } , ether b P,Z, H j ∈ (1 ± ξ ) · | R Z, H j ∩ P | or b P,Z, H j ∈ | R Z, H j ∩ P |± ξ · . γT i ( o ) , where ( R Z, H , R Z, H , · · · , R Z, H k ) are regions (Deﬁnition 3.10)induced by H , and b P,Z, H j = (cid:80) p ∈ P ∩ Q (cid:48) ∩ R Z, H j w (cid:48) ( p ) , and furthermore, | cost ( r ) ( π (cid:48) P (cid:48) ,Z, H ) − cost ( r ) ( π P,Z, H ) | ≤ ξ ( | P | ( √ d · g i ) r + cost ( r ) ( π P,Z, H )) and (cid:107) s ( π (cid:48) P (cid:48) ,Z, H ) − s ( π P,Z, H ) (cid:107) ≤ ξ | P | , where both π P,Z,H : P → Z, π (cid:48) P (cid:48) ,Z,H : P (cid:48) → Z are transferred assignment mappings corre-sponding to ( H , B P,Z, H , ξ, T ) .By Lemma 3.16, with probability at least − / (10 ( k + d . r ) L ) , E ( P ) happens. Notice that (cid:80) Li =0 |P Ii | ≤ (cid:80) Li =0 s i ≤ k + d . r ) L . By taking union bound over all i ∈ { , , · · · , L } , P ∈ P Ii ,with probability at least . , ∀ i ∈ { , , · · · , L } , P ∈ P Ii , event E ( P ) happens. In the remaining ofthe proof, we condition on all E ( P ) .Firstly, let us focus on proving cost ( r )(1+ η ) t ( Q (cid:48) , Z, w (cid:48) ) ≤ (1 + (cid:15) )cost ( r ) t ( Q, Z ) . Let Q I = (cid:83) Li =0 (cid:83) P ∈P Ii P . According to Lemma 3.8, there is a set of assignment half-spaces H I correspondingto Z such that H I = { H I ( j,j (cid:48) ) | j < j (cid:48) ∈ [ k ] } is valid for Q I and cost ( r ) t ( Q I , Z ) = cost ( r ) ( π I ) , (3)19here π I : Q I → Z is an assignment mapping corresponding to H I . For i ∈ { , , · · · , L } , P ∈P Ii , let π IP : P → Z be the assignment mapping such that ∀ p ∈ P, π IP ( p ) = π I ( p ) . Let ˆ π IP : P → Z be a transferred assignment mapping corresponding to ( H I , B P,Z, H I , ξ, . γT i ( o )) ,where B P,Z, H I = ( b P,Z, H I , b P,Z, H I , · · · , b P,Z, H I k ) is the same as deﬁned in the event E ( P ) , i.e., ∀ j ∈ { , , · · · , k } , b P,Z, H I j = (cid:80) p ∈ P ∩ Q (cid:48) ∩ R Z, H Ij w (cid:48) ( p ) , where ( R Z, H I , R Z, H I , · · · , R Z, H I k ) are regions(Deﬁnition 3.10) induced by H I . Let ˆ π IP ∩ Q (cid:48) : P ∩ Q (cid:48) → Z be a transferred assignment mappingwhich is also corresponding to ( H I , B P,Z, H I , ξ, . γT i ( o )) , where each point p ∈ P ∩ Q (cid:48) has weight w (cid:48) ( p ) . We have: cost ( r ) t ( Q, Z ) ≥ cost ( r ) t ( Q I , Z ) = cost ( r ) ( π I ) = L (cid:88) i =0 (cid:88) P ∈P Ii cost ( r ) ( π IP ) , (4)where the ﬁrst step follows from that Q I is a subset of Q , the second step follows from Equation (3).Consider i ∈ { , , · · · , L } and P ∈ P I . Algorithm 2 tells us | P | ≥ . γT i ( o ) . Since all points in P are in the same cell in G i − , we know that ∀ p, q ∈ P , dist( p, q ) ≤ √ d · g i . Then by Lemma 3.12, (1 + 2 r +4 k · ξ ) · cost ( r ) ( π IP ) ≥ cost ( r ) (ˆ π IP ) − ξ · r +1 k · . γT i ( o ) · ( √ d · g i ) r . (5)Since r +4 k ξ ≤ (cid:15)/ and ξ ≤ (cid:15) · r k ( k + d . r ) L , we have: (1 + (cid:15)/ · cost ( r ) t ( Q, Z ) ≥ (1 + (cid:15)/ · L (cid:88) i =0 (cid:88) P ∈P Ii cost ( r ) ( π IP ) (Equation (4)) ≥ L (cid:88) i =0 (cid:88) P ∈P Ii (cid:16) cost ( r ) (ˆ π IP ) − ξ · r +1 k · . γT i ( o ) · ( √ d · g i ) r (cid:17) (cid:16) r +4 k ξ ≤ (cid:15) and Equation (5) (cid:17) ≥ L (cid:88) i =0 (cid:88) P ∈P Ii (cid:18) cost ( r ) (ˆ π IP ) − (cid:15) k + d . r ) L · . T i ( o ) · ( √ dg i ) r (cid:19) (cid:18) ξ ≤ (cid:15) · r k ( k + d . r ) L (cid:19) = L (cid:88) i =0 (cid:88) P ∈P Ii (cid:18) cost ( r ) (ˆ π IP ) − (cid:15) k + d . r ) L · . o (cid:19) (cid:18) T i ( o ) = 0 . o ( √ dg i ) r (cid:19) = L (cid:88) i =0 (cid:88) P ∈P Ii cost ( r ) (ˆ π IP ) − (cid:15) k + d . r ) L · . o · L (cid:88) i =0 |P Ii | ( a ) ≥ L (cid:88) i =0 (cid:88) P ∈P Ii cost ( r ) (ˆ π IP ) − (cid:15) · o (cid:32) L (cid:88) i =0 |P Ii | ≤ k + d . r ) L (cid:33) ( b ) ≥ L (cid:88) i =0 (cid:88) P ∈P Ii cost ( r ) (ˆ π IP ) − (cid:15) · cost ( r ) t ( Q, Z ) , (6)where step (a) follows from (cid:80) Li =0 |P Ii | ≤ (cid:80) Li =0 s i ≤ k + d . r ) L which is according to Algo-rithm 2, and step (b) follows from o ≤ OPT ( r ) k -clus ≤ cost ( r ) t ( Q, Z ) . Consider i ∈ { , , · · · , L } and20 ∈ P I . Event E ( P ) shows that (1 + ξ ) · cost ( r ) (ˆ π IP ) ≥ cost ( r ) ( π IP ∩ Q (cid:48) ) − ξ · | P | ( √ d · g i ) r . (7)Since ξ ≤ (cid:15) · r ( kL + d . r ) L , we have: (1 + ξ ) · L (cid:88) i =0 (cid:88) P ∈P Ii cost ( r ) (ˆ π IP ) ≥ L (cid:88) i =0 (cid:88) P ∈P Ii (cid:16) cost ( r ) (ˆ π IP ∩ Q (cid:48) ) − ξ · | P | ( √ d · g i ) r (cid:17) (Equation (7)) ( a ) ≥ L (cid:88) i =0 (cid:88) P ∈P Ii cost ( r ) (ˆ π IP ∩ Q (cid:48) ) − L (cid:88) i =0 ξ · · ( kL + d . r ) T i ( o )( √ d · g i ) r ≥ L (cid:88) i =0 (cid:88) P ∈P Ii cost ( r ) (ˆ π IP ∩ Q (cid:48) ) − L (cid:88) i =0 (cid:15)L · T i ( o ) · ( √ dg i ) r (cid:18) ξ ≤ (cid:15) · r ( kL + d . r ) L (cid:19) = L (cid:88) i =0 (cid:88) P ∈P Ii cost ( r ) (ˆ π IP ∩ Q (cid:48) ) − L (cid:88) i =0 (cid:15)L · . o ( T i ( o ) = 0 . o/ ( √ dg i ) r ) ≥ L (cid:88) i =0 (cid:88) P ∈P Ii cost ( r ) (ˆ π IP ∩ Q (cid:48) ) − (cid:15) · o ( b ) ≥ L (cid:88) i =0 (cid:88) P ∈P Ii cost ( r ) (ˆ π IP ∩ Q (cid:48) ) − (cid:15) · cost ( r ) t ( Q, Z ) , (8)where step (a) follows from (cid:80) P ∈P Ii | P | ≤ (cid:80) s i j =1 | Q i,j | ≤ · ( kL + d . r ) T i ( o ) and step (b) followsfrom that o ≤ OPT ( r ) k -clus ≤ cost ( r ) t ( Q, Z ) . Consider the total weights of points assigned to eachcenter by ˆ π IP ∩ Q (cid:48) for all P ∈ P Ii and all i ∈ { , , · · · , L } . We have: (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L (cid:88) i =0 (cid:88) P ∈P Ii s (ˆ π IP ∩ Q (cid:48) ) − s ( π I ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L (cid:88) i =0 (cid:88) P ∈P Ii (cid:0) s (ˆ π IP ∩ Q (cid:48) ) − s (ˆ π IP ) (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L (cid:88) i =0 (cid:88) P ∈P Ii (cid:0) s (ˆ π IP ) − s ( π IP ) (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ (triangle inequality) ≤ L (cid:88) i =0 (cid:88) P ∈P Ii (cid:13)(cid:13) s (ˆ π IP ∩ Q (cid:48) ) − s (ˆ π IP ) (cid:13)(cid:13) ∞ + L (cid:88) i =0 (cid:88) P ∈P Ii (cid:13)(cid:13) s (ˆ π IP ) − s ( π IP ) (cid:13)(cid:13) ∞ (triangle inequality) ≤ L (cid:88) i =0 (cid:88) P ∈P Ii (cid:13)(cid:13) s (ˆ π IP ∩ Q (cid:48) ) − s (ˆ π IP ) (cid:13)(cid:13) + L (cid:88) i =0 (cid:88) P ∈P Ii (cid:13)(cid:13) s (ˆ π IP ) − s ( π IP ) (cid:13)(cid:13) ( ∀ x ∈ R k , (cid:107) x (cid:107) ∞ ≤ (cid:107) x (cid:107) ) ≤ L (cid:88) i =0 (cid:88) P ∈P Ii ξ | P | + L (cid:88) i =0 (cid:88) P ∈P Ii (cid:13)(cid:13) s (ˆ π IP ) − s ( π IP ) (cid:13)(cid:13) (event E ( P ) )21 L (cid:88) i =0 (cid:88) P ∈P Ii ξ | P | + L (cid:88) i =0 (cid:88) P ∈P Ii kξ | P | (Lemma 3.12) ≤ ξ | Q I | + 16 kξ | Q I | ≤ ηt (cid:18) ξ ≤ η k , Q I ⊆ Q, t ≥ | Q | k (cid:19) . Since (cid:107) s ( π I ) (cid:107) ∞ ≤ t , we know that L (cid:88) i =0 (cid:88) P ∈P Ii cost ( r ) (ˆ π IP ∩ Q (cid:48) ) ≥ cost ( r )(1+ η ) t ( Q (cid:48) , Z, w (cid:48) ) . By combining above inequality with Equation (6) and Equation (8), we have: (1 + (cid:15)/ · cost ( r ) t ( Q, Z ) ≥ L (cid:88) i =0 (cid:88) P ∈P Ii cost ( r ) (ˆ π IP ) − (cid:15) · cost ( r ) t ( Q, Z ) ≥

11 + ξ  L (cid:88) i =0 (cid:88) P ∈P Ii cost ( r ) (ˆ π IP ∩ Q (cid:48) ) − (cid:15) · cost ( r ) t ( Q, Z )  − (cid:15) · cost ( r ) t ( Q, Z ) ≥

11 + ξ (cid:16) cost ( r )(1+ η ) t ( Q (cid:48) , Z, w (cid:48) ) − (cid:15) · cost ( r ) t ( Q, Z ) (cid:17) − (cid:15) · cost ( r ) t ( Q, Z ) . Since ξ ≤ (cid:15)/ and (cid:15) ∈ (0 , . , we can conclude that cost ( r )(1+ η ) t ( Q (cid:48) , Z, w (cid:48) ) ≤ (1 + (cid:15) ) · cost ( r ) t ( Q, Z ) . Next, let us focus on proving (1 + (cid:15) ) · cost ( r ) t ( Q (cid:48) , Z, w (cid:48) ) ≥ cost ( r )(1+ η ) t ( Q, Z ) . We only need to con-sider the case when cost ( r ) t ( Q (cid:48) , Z, w (cid:48) ) (cid:54) = ∞ . Since Q (cid:48) has at most L + 1 diﬀerent weights, accordingto Lemma 3.8, there are m sets of assignment half-spaces H (cid:48) (0) = (cid:110) H (cid:48) (0)( j,j (cid:48) ) | j < j (cid:48) ∈ [ k ] (cid:111) , H (cid:48) (1) = (cid:110) H (cid:48) (1)( j,j (cid:48) ) | j < j (cid:48) ∈ [ k ] (cid:111) , · · · , H (cid:48) ( L ) = (cid:110) H (cid:48) ( L )( j,j (cid:48) ) | j < j (cid:48) ∈ [ k ] (cid:111) corresponding to Z such that ∀ i ∈{ , , · · · , L } , H (cid:48) ( i ) is valid for Q (cid:48) i , and cost ( r ) t ( Q (cid:48) , Z, w (cid:48) ) = L (cid:88) i =0 cost ( r ) ( π Q (cid:48) i ) , (9)where π Q (cid:48) i : Q (cid:48) i → Z is an assignment mapping corresponding to H (cid:48) ( i ) . For i ∈ { , , · · · , L } , P ∈ P Ii ,let π P ∩ Q (cid:48) i : P ∩ Q (cid:48) i → Z be the assignment mapping such that ∀ p ∈ P ∩ Q (cid:48) i , π P ∩ Q (cid:48) i ( p ) = π Q (cid:48) i ( p ) , and let ˆ π P ∩ Q (cid:48) i : P ∩ Q (cid:48) i → Z be a transferred assignment mapping corresponding to ( H (cid:48) ( i ) , B P,Z, H (cid:48) ( i ) , ξ, . γT i ( o )) , where B P,Z, H (cid:48) ( i ) = ( b P,Z, H (cid:48) ( i ) , b P,Z, H (cid:48) ( i ) , · · · , b P,Z, H (cid:48) ( i ) k ) is the sameas deﬁned in the event E ( P ) , i.e., ∀ j ∈ { , , · · · , k } , b P,Z, H (cid:48) ( i ) j = (cid:80) p ∈ P ∩ Q (cid:48) i ∩ R Z, H(cid:48) ( i ) j w (cid:48) ( p ) , where ( R Z, H ( i ) , R Z, H ( i ) , · · · , R Z, H ( i ) k ) are regions (Deﬁnition 3.10) induced by H (cid:48) ( i ) . Let ˆ π P : P → Z alsobe a transferred assignment mapping corresponding to ( H (cid:48) ( i ) , B P,Z, H (cid:48) ( i ) , ξ, . γT i ( o )) .22e have: cost ( r ) t ( Q (cid:48) , Z, w (cid:48) ) = L (cid:88) i =0 cost ( r ) ( π Q (cid:48) i ) = L (cid:88) i =0 (cid:88) P ∈P Ii cost ( r ) ( π P ∩ Q (cid:48) i ) , (10)where the second step follows from Equation (9). Notice that when we compute cost ( r ) ( π Q (cid:48) i ) and cost ( r ) ( π P ∩ Q (cid:48) i ) , each point p has weight w (cid:48) ( p ) . For i ∈ { , , · · · , L } and P ∈ P I , consider the pointssampled from P , i.e., P ∩ Q (cid:48) i . Event E ( P ) tells us that (cid:80) p ∈ P ∩ Q (cid:48) i w (cid:48) ( p ) ≥ . · . γT i ( o ) . Since allpoints in P are in the same cell in G i − , we know that ∀ p, q ∈ P , dist( p, q ) ≤ √ d · g i which impliesthat ∀ p, q ∈ P ∩ Q (cid:48) i , dist( p, q ) ≤ √ d · g i . Then by Lemma 3.12, we have: (1 + 2 r +4 k ξ ) · cost ( r ) ( π P ∩ Q (cid:48) i ) ≥ cost ( r ) (ˆ π P ∩ Q (cid:48) i ) − ξ · r +1 k · . γT i ( o ) · ( √ d · g i ) r . (11)Since r +4 k ξ ≤ (cid:15)/ and ξ ≤ (cid:15) · r k ( k + d . r ) L , we have: (1 + (cid:15)/ · cost ( r ) t ( Q (cid:48) , Z, w (cid:48) ) ≥ (1 + (cid:15)/ · L (cid:88) i =0 (cid:88) P ∈P Ii cost ( r ) ( π P ∩ Q (cid:48) i ) (Equation (10)) ≥ L (cid:88) i =0 (cid:88) P ∈P Ii (cid:16) cost ( r ) (ˆ π P ∩ Q (cid:48) i ) − ξ · r +1 k · . γT i ( o ) · ( √ d · g i ) r (cid:17) (cid:16) r +4 k ξ ≤ (cid:15) and Equation (11) (cid:17) ≥ L (cid:88) i =0 (cid:88) P ∈P Ii (cid:18) cost ( r ) (ˆ π P ∩ Q (cid:48) i ) − (cid:15) k + d . r ) L · T i ( o ) · ( √ dg i ) r (cid:19) (cid:18) ξ ≤ (cid:15) · r k ( k + d . r ) L (cid:19) = L (cid:88) i =0 (cid:88) P ∈P Ii (cid:18) cost ( r ) (ˆ π P ∩ Q (cid:48) i ) − (cid:15) k + d . r ) L · . o (cid:19) (cid:18) T i ( o ) = 0 . o ( √ dg i ) r (cid:19) = L (cid:88) i =0 (cid:88) P ∈P Ii cost ( r ) (ˆ π P ∩ Q (cid:48) i ) − (cid:15) k + d . r ) L · . o · L (cid:88) i =0 |P Ii | ( a ) ≥ L (cid:88) i =0 (cid:88) P ∈P Ii cost ( r ) (ˆ π P ∩ Q (cid:48) i ) − (cid:15) · o (cid:32) L (cid:88) i =0 |P Ii | ≤ k + d . r ) L (cid:33) ( b ) ≥ L (cid:88) i =0 (cid:88) P ∈P Ii cost ( r ) (ˆ π P ∩ Q (cid:48) i ) − (cid:15) · cost ( r )(1+ η ) t ( Q, Z ) , (12)where step (a) follows from (cid:80) Li =0 |P Ii | ≤ (cid:80) Li =0 s i ≤ k + d . r ) L which is according to Algo-rithm 2, and step (b) follows from o ≤ OPT ( r ) k -clus ≤ cost ( r )(1+ η ) t ( Q, Z ) . Consider i ∈ { , , · · · , L } and P ∈ P I . Event E ( P ) shows that cost ( r ) (ˆ π P ∩ Q (cid:48) i ) ≥ (1 − ξ )cost ( r ) (ˆ π P ) − ξ | P | ( √ d · g i ) r . (13)Since ξ ≤ (cid:15) · r ( kL + d . r ) L , L (cid:88) i =0 (cid:88) P ∈P Ii cost ( r ) (ˆ π P ∩ Q (cid:48) i ) L (cid:88) i =0 (cid:88) P ∈P Ii (cid:16) (1 − ξ ) · cost ( r ) (ˆ π P ) − ξ | P | ( √ d · g i ) r (cid:17) (Equation (13)) ( a ) ≥ (1 − ξ ) L (cid:88) i =0 (cid:88) P ∈P Ii cost ( r ) (ˆ π P ) − L (cid:88) i =0 ξ · kL + d . r ) T i ( o )( √ d · g i ) r ≥ (1 − ξ ) L (cid:88) i =0 (cid:88) P ∈P Ii cost ( r ) (ˆ π P ) − L (cid:88) i =0 (cid:15)L · T i ( o ) · ( √ dg i ) r (cid:18) ξ ≤ (cid:15) · r ( kL + d . r ) L (cid:19) =(1 − ξ ) L (cid:88) i =0 (cid:88) P ∈P Ii cost ( r ) (ˆ π P ) − L (cid:88) i =0 (cid:15)L · . o ( T i ( o ) = 0 . o/ ( √ dg i ) r ) ≥ (1 − ξ ) L (cid:88) i =0 (cid:88) P ∈P Ii cost ( r ) (ˆ π P ) − (cid:15) · o ( b ) ≥ (1 − ξ ) L (cid:88) i =0 (cid:88) P ∈P Ii cost ( r ) (ˆ π P ) − (cid:15) · cost ( r )(1+ η ) t ( Q, Z ) , (14)where step (a) follows from (cid:80) P ∈P Ii | P | ≤ (cid:80) s i j =1 | Q i,j | ≤ · ( kL + d . r ) T i ( o ) and step (b) followsfrom that o ≤ OPT ( r ) k -clus ≤ cost ( r )(1+ η ) t ( Q, Z ) . Consider the total number of points assigned to eachcenter by ˆ π P for all P ∈ P Ii and all i ∈ { , , · · · , L } . We have: (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L (cid:88) i =0 (cid:88) P ∈P Ii s (ˆ π P ) − L (cid:88) i =0 s ( π Q (cid:48) i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L (cid:88) i =0 (cid:88) P ∈P Ii (cid:16) s (ˆ π P ) − s ( π P ∩ Q (cid:48) i ) (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ L (cid:88) i =0 (cid:88) P ∈P Ii (cid:13)(cid:13)(cid:13) s (ˆ π P ) − s ( π P ∩ Q (cid:48) i ) (cid:13)(cid:13)(cid:13) ∞ (triangle inequality) ≤ L (cid:88) i =0 (cid:88) P ∈P Ii (cid:16)(cid:13)(cid:13)(cid:13) s (ˆ π P ) − s (ˆ π P ∩ Q (cid:48) i ) (cid:13)(cid:13)(cid:13) ∞ + (cid:13)(cid:13)(cid:13) s (ˆ π P ∩ Q (cid:48) i ) − s ( π P ∩ Q (cid:48) i ) (cid:13)(cid:13)(cid:13) ∞ (cid:17) (triangle inequality) ≤ L (cid:88) i =0 (cid:88) P ∈P Ii (cid:16)(cid:13)(cid:13)(cid:13) s (ˆ π P ) − s (ˆ π P ∩ Q (cid:48) i ) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) s (ˆ π P ∩ Q (cid:48) i ) − s ( π P ∩ Q (cid:48) i ) (cid:13)(cid:13)(cid:13) (cid:17) (cid:16) ∀ x ∈ R k , (cid:107) x (cid:107) ∞ ≤ (cid:107) x (cid:107) (cid:17) ≤ L (cid:88) i =0 (cid:88) P ∈P Ii ξ | P | + L (cid:88) i =0 (cid:88) P ∈P Ii (cid:13)(cid:13)(cid:13) s (ˆ π P ∩ Q (cid:48) i ) − s ( π P ∩ Q (cid:48) i ) (cid:13)(cid:13)(cid:13) (event E ( P ) ) ≤ L (cid:88) i =0 (cid:88) P ∈P Ii ξ | P | + L (cid:88) i =0 (cid:88) P ∈P Ii kξ  (cid:88) p ∈ P ∩ Q (cid:48) i w (cid:48) ( p )  (Lemma 3.12)24 L (cid:88) i =0 (cid:88) P ∈P Ii ξ | P | + L (cid:88) i =0 (cid:88) P ∈P Ii kξ (cid:13)(cid:13)(cid:13) s (cid:16) ˆ π P ∩ Q (cid:48) i (cid:17)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) s (cid:16) ˆ π P ∩ Q (cid:48) i (cid:17)(cid:13)(cid:13)(cid:13) = (cid:88) p ∈ P ∩ Q (cid:48) i w (cid:48) ( p )  ≤ L (cid:88) i =0 (cid:88) P ∈P Ii ξ | P | + L (cid:88) i =0 (cid:88) P ∈P Ii kξ ( (cid:107) s (ˆ π P ) (cid:107) + ξ | P | ) (event E ( P ) ) = L (cid:88) i =0 (cid:88) P ∈P Ii ξ | P | + L (cid:88) i =0 (cid:88) P ∈P Ii kξ (1 + ξ ) | P | ( (cid:107) s (ˆ π P ) (cid:107) = | P | ) ≤ ξ | Q I | + 16 kξ (1 + ξ ) | Q I | ≤ ηt/ (cid:18) ξ ≤ η k , Q I ⊆ Q, t ≥ | Q | k (cid:19) . Since (cid:107) (cid:80) Li =0 s ( π Q (cid:48) i ) (cid:107) ∞ ≤ t , we know that L (cid:88) i =0 (cid:88) P ∈P Ii cost ( r ) (ˆ π P ) ≥ cost ( r )(1+ η/ t ( Q I , Z ) . (15)Notice that ∀ i ∈ { , , · · · , L } , j ∈ [ s i ] with Q i,j (cid:54)∈ P Ii , we know that | Q i,j | ≤ γT i ( o ) due toAlgorithm 2. Because γ ≤ min (cid:16) η · r kL , (cid:15) · r ( k + d . r ) L (cid:17) and we can apply Lemma 3.4, we have cost ( r )(1+ η/ t ( Q I , Z ) ≥ cost ( r )(1+ η/ t ( Q, Z ) / (1 + (cid:15)/ ≥ cost ( r )(1+ η ) t ( Q, Z ) / (1 + (cid:15)/ , where the last step follows from (1 + η/ ≤ η . By Equation (15), we have: L (cid:88) i =0 (cid:88) P ∈P Ii cost ( r ) (ˆ π P ) ≥ cost ( r )(1+ η ) t ( Q, Z ) / (1 + (cid:15)/ . By Equation (14), we have: L (cid:88) i =0 (cid:88) P ∈P Ii cost ( r ) (ˆ π P ∩ Q (cid:48) i ) ≥ (1 − ξ )cost ( r )(1+ η ) t ( Q, Z ) / (1 + (cid:15)/ − (cid:15) · cost ( r )(1+ η ) t ( Q, Z ) . By Equation (12), we have: (1 + (cid:15)/ · cost ( r ) t ( Q (cid:48) , Z, w (cid:48) ) ≥ (1 − ξ )cost ( r )(1+ η ) t ( Q, Z ) / (1 + (cid:15)/ − · (cid:15) · cost ( r )(1+ η ) t ( Q, Z ) . Since ξ ≤ (cid:15)/ and (cid:15) ≤ . , we can reorder the terms in the above equation to conclude that cost ( r )(1+ η ) t ( Q, Z ) ≤ (1 + (cid:15) )cost ( r ) t ( Q (cid:48) , Z, w (cid:48) ) . Next, we consider the success probability and the size of the coreset. As shown in Lemma 3.2, F happens with probability at least . , i.e., with high probability, there should not be too manycenter cells. Furthermore, as explained before previous lemma, we can suppose all the estimatedsizes τ ( C ∩ Q ) , τ (cid:16)(cid:83) s i j Q i,j (cid:17) and τ ( Q i,j ) are good estimations to | C ∩ Q | , (cid:12)(cid:12)(cid:12)(cid:83) s i j =1 Q i,j (cid:12)(cid:12)(cid:12) and | Q i,j | .Condition on these events, Algorithm 2 does not output FAIL, and with high probability, the sizeof the outputted coreset is small. 25 emma 3.18 (Success probability and the size) . Suppose the following conditions are satisﬁed:1. F happens (Lemma 3.2),2. OPT ( r ) k -clus / ≤ o ≤ OPT ( r ) k -clus ,3. All estimated size τ (cid:16)(cid:83) s i j =1 Q i,j (cid:17) in Algorithm 2 and τ ( C ∩ Q ) in Algorithm 1 called by Algo-rithm 2 are good (Deﬁnition 3.5, Deﬁnition 3.1).Algorithm 2 does not return FAIL, and with probability at least . , | Q (cid:48) | ≤ · · r +10) rk d ( k + d . r ) L log( kdL )min( (cid:15), η ) . Proof.

By Lemma 3.3, since event F , (cid:80) Li =0 s i ≤ k + d . r ) L · ≤ k + d . r ) L whichimplies that Algorithm 2 does not return FAIL in line 5.Recall that Z ∗ ⊂ [∆] d with | Z ∗ | ≤ k is the optimal solution of the standard (cid:96) r k -clusteringproblem of Q , i.e., cost ( r ) ( Q, Z ∗ ) = OPT ( r ) k -clus , and a cell C ∈ G i a center cell if dist( C, Z ∗ ) ≤ g i /d .Due to F , the total number of center cells is at most kL . Let us consider (cid:80) s i j =1 | Q i,j | for anarbitrary i ∈ { , , · · · , L } . We have: s i (cid:88) j =1 | Q i,j | = (cid:88) C ∈ G i : C is crucial | C ∩ Q | = (cid:88) C ∈ G i : C is crucial, and is a center cell | C ∩ Q | + (cid:88) C ∈ G i : C is crucial, but is not a center cell | C ∩ Q |≤ kL · . T i ( o ) + (cid:88) C ∈ G i : C is crucial, but is not a center cell | C ∩ Q |≤ kL · . T i ( o ) + OPT ( r ) k -clus ( g i /d ) r ≤ kL · . T i ( o ) + 100 d . r T i ( o ) · OPT ( r ) k -clus o ≤ kL · . T i ( o ) + 1000 d . r T i ( o ) ≤ kL + d . r ) T i ( o ) , where the ﬁrst step follows from the construction of Q i,j , the third step follows from that thenumber of center cells is at most kL and each crucial cell has at most . T i ( o ) points, theforth step follows from that each point p in the non-center cell has distance to Z ∗ at least g i /d , theﬁfth step follows from the deﬁnition of T i ( o ) , and the sixth step follows from o ≥ OPT ( r ) k -clus . Thus, τ (cid:16)(cid:83) s i j =1 Q i,j (cid:17) ≤ kL + d . r ) T i ( o ) which implies that Algorithm 2 does not return FAIL inline 6.Let us analyze the size of the coreset. We have E [ | Q (cid:48) | ] = L (cid:88) i =0 E [ | Q (cid:48) i | ] ≤ L (cid:88) i =0 φ i s i (cid:88) j =1 | Q i,j | ≤ L (cid:88) i =0 φ i · kL + d . r ) T i ( o ) ≤ · · r +10) rk d ( k + d . r ) L log( kdL )min( (cid:15), η ) , φ i ≤ · r +10) · k d ( k + d . r ) L log( kdL )min( (cid:15), η ) T i ( o ) . By Markov’s inequality, with probability at least . , | Q (cid:48) | ≤ · · r +10) rk d ( k + d . r ) L log( kdL )min( (cid:15), η ) . The only thing remaining is to ﬁnd a suitable parameter o for Algorithm 2. Actually, we canenumerate o exponentially and thus there must be some o which is a good choice. For the goodchoice of o , we can output the coreset with high probability. Thus, we can conclude the followingtheorem. Theorem 3.19 (Oﬄine algorithm) . Consider a point set Q ⊆ [∆] d which contains n points andparameters k ∈ Z ≥ , (cid:15), η ∈ (0 , . . For constant r ≥ , there is a randomized algorithm whichoutputs a subset of points Q (cid:48) ⊆ Q and weights w (cid:48) : Q (cid:48) → R > in time O ( nd log ( nd ∆)) such thatwith probability at least . ,1. ∀ t ≥ n/k, Z ⊂ [∆] d with | Z | = k , cost ( r )(1+ η ) t ( Q, Z ) ≤ (1 + (cid:15) )cost ( r ) t ( Q (cid:48) , Z, w (cid:48) ) and cost ( r )(1+ η ) t ( Q (cid:48) , Z, w (cid:48) ) ≤ (1 + (cid:15) )cost ( r ) t ( Q, Z ) , | Q (cid:48) | ≤ poly( (cid:15) − η − kd log ∆) ,Proof. Algorithm 2 needs a parameter o which is an approximation of OPT ( r ) k -clus . We can enumerateall possible o ∈ (cid:110) , , , · · · , n · (cid:16) √ d ∆ (cid:17) r (cid:111) . We choose the smallest o such that Algorithm 2 doesnot output FAIL.Let us consider the running time of Algorithm 2. In line 4 of Algorithm 2, we call Algorithm 1.In Algorithm 1, for each p ∈ Q , we can update the number of points in c i ( p ) for i ∈ {− , , , · · · , L } .Then, for each p ∈ Q , we can check whether c i ( p ) is heavy or not for i ∈ {− , , , · · · , L } . Thus,the total running time of Algorithm 1 is O ( ndL ) . In Algorithm 2, for each point p ∈ Q weshould ﬁnd the level i ∈ { , , · · · , L } such that c i ( p ) is crucial which takes O ( dL ) time. Toconclude, the total running time of Algorithm 2 is O ( ndL ) . Thus, the overall running time is O ( ndL ) · log( n · ( √ d ∆) r ) = O ( nd log ( nd ∆)) .Let us consider the correctness. According to Lemma 3.2, with probability at least . , F happens. According to Lemma 3.18, if OPT ( r ) k -clus / ≤ o ≤ OPT ( r ) k -clus , Algorithm 2 does notoutput FAIL. Thus we can ﬁnd an o ≤ OPT ( r ) k -clus such that Algorithm 2 does not output FAILwith probability at least . . If Algorithm 2 does not output FAIL and o ≤ OPT ( r ) k -clus , then, byLemma 3.17, with probability at least . , ∀ t ≥ n/k, Z ⊂ [∆] d with | Z | = k , cost ( r )(1+ η ) t ( Q, Z ) ≤ (1 + (cid:15) )cost ( r ) t ( Q (cid:48) , Z, w (cid:48) ) and cost ( r )(1+ η ) t ( Q (cid:48) , Z, w (cid:48) ) ≤ (1 + (cid:15) )cost ( r ) t ( Q, Z ) , and furthermore, according to Lemma 3.18, with probability at least . , | Q (cid:48) | ≤ poly( (cid:15) − η − kdL ) .27 .3 Assignment Construction via Coreset In classic k -clustering problem, once centers are determined, each point should be assigned to theclosest center. But in capacitated k -clustering problem, even if the centers are determined, it isnon trivial to assign points to centers. In this section, we will discuss how to construct a goodassignment for the input point set Q given k centers Z = { z , z , · · · , z k } and the coreset ( Q (cid:48) , w (cid:48) ) obtained by our construction.Firstly, given a capacity t (cid:48) ≥ k · max( (cid:80) q ∈ Q (cid:48) w (cid:48) ( q ) , | Q | ) , we want to ﬁnd an assignment π (cid:48) : Q (cid:48) → Z such that cost ( r ) ( π (cid:48) ) ≤ (1 + (cid:15) )cost ( r ) t (cid:48) ( Q (cid:48) , Z, w (cid:48) ) and (cid:107) s ( π (cid:48) ) (cid:107) ∞ ≤ (1 + η ) t (cid:48) . Given centers Z ,ﬁnding an assignment satisfying the capacity constraint for weighted points in general is NP-hardsince we can reduce bin packing problem to the such feasibility problem. If we relax the problem tothe fractional version, i.e., the weight of a point can be split to multiple centers, then the optimalassignment for the relaxed problem can be solved by the minimum-cost ﬂow [BBLM14]. Given afractional assignment, we can use the following way to reduce the number of points of which weightis split to multiple centers:1. Build a bipartite graph as the following: create a vertex for each point and each center, andadd an edge between a point vertex and a center vertex if there is a non-zero fraction of theweight of the point assigned to the center.2. Find an arbitrary simple cycle in the bipartite graph. If there is no cycle, ﬁnish the procedure.Suppose the cycle corresponds to points p , p , · · · , p m and centers z , z , · · · , z m where p i con-nects to both z i and z ( i mod m )+1 . Notice that (cid:80) mi =1 dist r ( p i , z i ) = (cid:80) mi =1 dist r ( p i , z ( i mod m )+1 ) since the given fractional assignment is optimal.3. Suppose a is the minimum value of the weight assigned from p i to z i for all i ∈ [ m ] . For each p i , move a weights from z i to z ( i mod m )+1 .4. Repeat above steps for the new fractional assignment.In each iteration of the above procedure, we can remove an edge between a point and a center.Thus, it only takes polynomial running time. Since at the end of the above procedure there is nocycle in the constructed bipartite graph, the number of points of which weight is split to multiplecenters is at most k − . For each of the k − points, we modify its assignment to make all of itsweight assigned to the closest center. Thus, we can obtain an integral assignment π (cid:48) : Q (cid:48) → Z .Furthermore, we know that (cid:107) s ( π (cid:48) ) (cid:107) ∞ ≤ t (cid:48) + ( k − · max p ∈ Q (cid:48) w (cid:48) ( p ) . For p ∈ Q (cid:48) , by algorithm 2, if p ∈ Q i,j , then | Q i,j | ≥ . γT i ( o ) and w (cid:48) ( p ) ≤ ξ γT i ( o ) which implies that w (cid:48) ( p ) ≤ η | Q | /k (due tothe choice of ξ ). Therefore, we can conlude that (cid:107) s ( π (cid:48) ) (cid:107) ∞ ≤ (1 + η ) t (cid:48) , and cost ( r ) ( π (cid:48) ) ≤ cost ( r ) t (cid:48) ( Q (cid:48) , Z, w (cid:48) ) . Notice that π (cid:48) may not be represented by a small number of sets of assignment half-spaces.Thus, we need to apply the switching argument similar to the proof of Lemma 3.8 to modify π (cid:48) .The modiﬁcation of π (cid:48) can be done by the following procedure:1. For each Q (cid:48) i (see Algorithm 2) do the following:(a) Let π (cid:48) Q (cid:48) i : Q (cid:48) i → Z be the assignment mapping satisfying ∀ p ∈ Q (cid:48) i , π (cid:48) Q (cid:48) i ( p ) = π (cid:48) ( p ) .28b) Since points in Q (cid:48) i have the same weight, we can use minimum-cost ﬂow to ﬁnd an assign-ment mapping (cid:101) π Q (cid:48) i : Q (cid:48) i → Z such that s ( (cid:101) π Q (cid:48) i ) = s ( π (cid:48) Q (cid:48) i ) and cost ( r ) ( (cid:101) π Q (cid:48) i ) is minimized.(c) If exists p, q ∈ Q (cid:48) i , such that (cid:101) π Q (cid:48) i ( p ) = z j , (cid:101) π Q (cid:48) i ( q ) = z j (cid:48) ( j < j (cid:48) ) , dist r ( q, z j ) − dist r ( q, z j (cid:48) ) =dist r ( p, z j ) − dist r ( p, z j (cid:48) ) and the alphabetic order of q is smaller than p (due to theoptimality of (cid:101) π Q (cid:48) i , dist r ( q, z j ) − dist r ( q, z j (cid:48) ) < dist r ( p, z j ) − dist r ( p, z j (cid:48) ) can never hap-pen), then switch the assigned center of p and the assigned center of q , i.e., (cid:101) π Q (cid:48) i ( q ) ← z j , (cid:101) π Q (cid:48) i ( p ) ← z j (cid:48) .(d) Repeat the above step until no switching happens. Let π (cid:48)(cid:48) Q (cid:48) i : Q (cid:48) i → Z be the ﬁnal (cid:101) π Q (cid:48) i after all switching.2. Let π (cid:48)(cid:48) : Q (cid:48) → Z satisfy ∀ i ∈ { , , · · · , L } , p ∈ Q i , π (cid:48)(cid:48) ( p ) = π (cid:48)(cid:48) Q (cid:48) i ( p ) .Consider step 1c. After each switching, ∀ l ∈ [ s ( (cid:101) π Q (cid:48) i ) j ] , the alphabetic order of the point assignedto z j with the l -th smallest alphabetic order can not increase. Thus, the total running time of theabove procedure can be done in polynomial time. Consider the properties of π (cid:48)(cid:48) . It is easy to seethat cost ( r ) ( π (cid:48)(cid:48) ) = (cid:80) Li =0 cost ( r ) ( (cid:101) π Q (cid:48) i ) ≤ cost ( r ) ( π (cid:48) ) and s ( π (cid:48)(cid:48) ) = (cid:80) Li =0 s ( (cid:101) π Q (cid:48) i ) = s ( π (cid:48) ) . Furthermore,by Deﬁtion 3.7, for each i ∈ { , , · · · , L } , we can compute a set of assignment half-spaces H i = { H i ( j,j (cid:48) ) | j < j (cid:48) } such that ∀ p ∈ Q (cid:48) i , π (cid:48)(cid:48) ( p ) = z j if and only if ∀ j (cid:48) (cid:54) = j, p ∈ H i ( j,j (cid:48) ) .It is enough to construct an assignment mapping π : Q → Z for the original point set.For each i ∈ { , , · · · , L } , for each P ∈ P Ii (see Algorithm 2 for P Ii ), we can construct atransferred assignment mapping π P : P → Z corresponding to ( H i , B P,i , ξ, . γT i ( o )) , where B P,i = ( b P,i , b P,i , · · · , b P,ik ) and b P,ij = (cid:80) p ∈ P ∩ Q (cid:48) i : ∀ j (cid:48) (cid:54) = j,p ∈ H i ( j,j (cid:48) ) w (cid:48) ( p ) . According to the proof of Theo-rem 3.19, we can condition on that o ≤ OPT ( r ) k -clus . Similar to the proof of Lemma 3.17, conditionon E ( P ) for all P , we can show that L (cid:88) i =0 (cid:88) P ∈P Ii (cid:88) p ∈ P dist r ( p, π P ( p )) ≤ (1 + O ( (cid:15) )) (cid:88) p ∈ Q (cid:48) w (cid:48) ( p ) · dist r ( p, π (cid:48)(cid:48) ( p )) , and (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L (cid:88) i =0 (cid:88) p ∈P Ii s ( π P ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ (1 + O ( η )) · (cid:107) s ( π (cid:48)(cid:48) ) (cid:107) ∞ . Then we construct π : Q → Z as the following:1. if ∃ i ∈ { , · · · , L } , P ∈ P Ii such that p ∈ P , let π ( p ) ← π P ( p ) ;2. otherwise, let π ( p ) ← arg min z ∈ Z dist( p, z ) .According to the proof of Lemma 3.4 (see Appendix), we can show that |{ p ∈ Q | ∀ i ∈ { , , · · · , L } , P ∈ P Ii , p (cid:54)∈ P }| ≤ O ( η ) · | Q | /k and (cid:88) p ∈ Q : ∀ i ∈{ , , ··· ,L } ,P ∈P Ii ,p (cid:54)∈ P dist r ( p, Z ) ≤ O ( (cid:15) ) · cost ( r )  L (cid:91) i =0 (cid:91) P ∈P Ii P, Z  . (cid:88) p ∈ Q dist r ( p, π ( p )) ≤ (1 + O ( (cid:15) )) · (cid:88) p ∈ Q (cid:48) w (cid:48) ( p ) · dist r ( p, π (cid:48)(cid:48) ( p )) ≤ (1 + O ( (cid:15) ))cost ( r ) t (cid:48) ( Q (cid:48) , Z, w (cid:48) ) , and (cid:107) s ( π ) (cid:107) ∞ ≤ (1 + O ( η )) · t (cid:48) . Notice that P Ii can be determined by the heavy cells outputted by Algorithm 1 and the estimatednumber of points in its children cells. By the above argument, if we store this information togetherwith the coreset ( Q (cid:48) , w (cid:48) ) , we can determine the desired assignment mapping π for any capacity t (cid:48) and centers Z in poly( | Q (cid:48) | ) time. In this section, we discuss how to obtain the estimation τ ( C ∩ Q ) in line 7 of Algorithm 1, the esti-mation τ (cid:16)(cid:83) s i j =1 Q i,j (cid:17) in line 6 and the estimation τ ( Q i,j ) in line 9 of Algorithm 2. For convenience,we suppose that no two points in the input point set share the same coordinate . We can obtainthe estimation by the procedure in Algorithm 3. Algorithm 3

Estimation of Number of Points via Sampling

1. Let λ (cid:48) ← dL .2. Let h , h , · · · , h L : [∆] → { , } be λ (cid:48) -wise independent hash functions, where ∀ i ∈ { , , · · · , L } , p ∈ [∆] d , it satisﬁes Pr[ h i ( p ) = 1] = ψ i = min (cid:16) λ (cid:48) T i ( o ) , (cid:17) .3. For i ∈ { , , , · · · , L } and each cell C ∈ G i , set τ ( C ∩ Q ) ← ψ i · (cid:80) p ∈ C ∩ Q h i ( p ) . Use these estimatedvalues in Algorithm 1.4. Let h (cid:48) , h (cid:48) , · · · , h (cid:48) L : [∆] → { , } be λ (cid:48) -wise independent hash functions, where ∀ i ∈ { , , · · · , L } , p ∈ [∆] d , it satisﬁes Pr[ h (cid:48) i ( p ) = 1] = ψ (cid:48) i = min (cid:16) λ (cid:48) γT i ( o ) , (cid:17) .5. In Algorithm 2, for i ∈ { , , · · · , L } set τ  s i (cid:91) j =1 Q i,j  ← ψ (cid:48) i (cid:88) C ∈ G i : C is crucial (cid:88) p ∈ C ∩ Q h (cid:48) i ( p ) , and for j ∈ [ s i ] set τ ( Q i,j ) ← ψ (cid:48) i (cid:88) C ∈ G i : C is a crucial child of the j -th heavy cell in G i − (cid:88) p ∈ C ∩ Q h (cid:48) i ( p ) . Lemma 4.1.

With probability at least . , Algorithm 3 satisﬁes following conditions: If two points share the same coordinate, we can assume that each point has a unique tag so we can distinguishthem. . ∀ i ∈ { , , · · · , L } , C ∈ G i , τ ( C ∩ Q ) is good (Deﬁnition 3.1).2. ∀ i ∈ { , , · · · , L } , τ (cid:16)(cid:83) s i j =1 Q i,j (cid:17) is good (Deﬁnition 3.5).3. ∀ i ∈ { , , · · · , L } , j ∈ [ s i ] , τ ( Q i,j ) is good (Deﬁnition 3.5).Proof. Consider an arbitrary i ∈ { , , · · · , L } and an arbitrary cell C ∈ G i . Notice that E [ τ ( C ∩ Q )] = ψ i (cid:80) p ∈ C ∩ Q E [ h i ( p )] = | C ∩ Q | . If | C ∩ Q | ≥ T i ( o ) , by Lemma 3.13, we have Pr[ | τ ( C ∩ Q ) − | C ∩ Q || ≥ . | C ∩ Q | ] ≤ (cid:32) | C ∩ Q | λ (cid:48) · ψ i + λ (cid:48) · ψ i . | C ∩ Q | (cid:33) λ (cid:48) / ≤ . · d . Similarly, if | C ∩ Q | < T i ( o ) , then by Lemma 3.13, Pr[ | τ ( C ∩ Q ) − | C ∩ Q || ≥ . T i ( o )] ≤ (cid:32) | C ∩ Q | λ (cid:48) · ψ i + λ (cid:48) · ψ i . | T i ( o ) | (cid:33) λ (cid:48) / ≤ . · d . Since there are at most (∆ / i ) d non-empty cells in G i , the total number of non-empty cells isat most d . By taking union bound over all such cells, with probability at least . , ∀ i ∈{ , , · · · , L } , C ∈ G i , either τ ( C ∩ Q ) ∈ | C ∩ Q | ± . T i ( o ) or τ ( C ∩ Q ) ∈ (1 ± . · | C ∩ Q | .By using Lemma 3.13 similar to the above argument, with probability at least . , ∀ i ∈{ , , · · · , L } , either τ (cid:16)(cid:83) s i j =1 Q i,j (cid:17) ∈ (cid:80) s i j =1 | Q i,j | ± . T i ( o ) or τ (cid:16)(cid:83) s i j =1 Q i,j (cid:17) ∈ (1 ± . (cid:80) s i j =1 | Q i,j | .We also have that, with probability . , ∀ i ∈ { , , · · · , L } , j ∈ [ s i ] , either τ ( Q i,j ) ∈ | Q i,j | ± . γT i ( o ) or τ ( Q i,j ) ∈ (1 ± . | Q i,j | . By taking union bound over all failure events, we completethe proof. In this section, we will discuss how to implement the coreset construction in the streaming model.We consider the streaming model which allows both insertion and deletion. The description of themodel is in the following.

Dynamic streaming model.

Initially, Q is an empty point set. There is a stream of insertionsand deletions, ( p , ± ) , ( p , ± ) , · · · , where ( p i , +) denotes inserting a point p i ∈ [∆] d into Q , and ( p i , − ) denotes deleting p i from Q . Each deletion ( p i , − ) guarantees that p i is in Q before deletion.A dynamic streaming algorithm is allowed a single pass over the stream. At the end of the stream,the algorithm stores some information regarding Q . The space complexity of the algorithm is thetotal number of bits used by the algorithm during the stream.In this section, we will introduce a dynamic streaming algorithm which can output a coreset for (cid:96) r balanced k -clustering using space poly( (cid:15) − kdL ) bits for constant r . Lemma 4.2 (Lemma 19 in [HSYZ18]) . For i ∈ { , , · · · , L } , α, β ∈ Z ≥ , δ ∈ (0 , . , there is adynamic streaming algorithm Storing ( G i , α, β, δ ) which uses O ( αβdL · log ( αβ/δ )) bits to processa stream of insertion and deletion of points such that1. if the algorithm does not output FAIL, the algorithm will return • a set C of all non-empty cells, the number of points f ( C ) in each cell C ∈ C , • the set S of points in all the non-empty cells that contain at most β points,2. if |C| ≤ α , then with probability at least − δ , the algorithm does not output FAIL. Next, let us describe how to use above subroutine to implement our coreset construction algo-rithm. The idea is that we only store some information described by small number of bits, andat the end of the stream, we can use this information to implement Algorithm 1, Algorithm 2 andAlgorithm 3. The description is in Algorithm 4.

Algorithm 4

Coreset Construction over a Dynamic Stream

1. Let λ ← rk dL (cid:100) log( kdL ) (cid:101) .2. Let h , h , · · · , h L , h (cid:48) , h (cid:48) , · · · , h (cid:48) L , ˆ h , ˆ h , · · · , ˆ h L : [∆] d → { , } be λ -wise independent hash functions: ∀ i ∈ { , , · · · , L } , p ∈ [∆] d , Pr[ h i ( p ) = 1] = ψ i ( ψ i deﬁned in Algorithm 3 ) , Pr[ h (cid:48) i ( p ) = 1] = ψ (cid:48) i ( ψ (cid:48) i deﬁned in Algorithm 3 ) , Pr[ˆ h i ( p ) = 1] = φ i ( φ i deﬁned in Algorithm 1 ) .

3. For the input stream ( p , ± ) , ( p , ± ) , · · · , create L + 1) sub-streams. For each i ∈ { , , · · · , L } : • There is a sub-stream which contains all the insertions/deletions related to all thepoints p j , where h i ( p j ) = 1 (cid:44) h (cid:48) i ( p j ) = 1 (cid:44) ˆ h i ( p j ) = 1 . Run asubroutine Storing (cid:16) G i , α i , β i , . · L +1) (cid:17) (cid:30) Storing (cid:16) G i , α (cid:48) i , β (cid:48) i , . · L +1) (cid:17) (cid:30) Storing (cid:16) G i , ˆ α i , ˆ β i , . · L +1) (cid:17) (Lemma 4.2) on such sub-stream in parallel. Let C i ⊂ G i , f : C i → Z ≥ , S i ⊆ [∆] d (cid:30) C (cid:48) i ⊂ G i , f (cid:48) : C (cid:48) i → Z ≥ , S (cid:48) i ⊆ [∆] d (cid:30) ˆ C i ⊂ G i , ˆ f : ˆ C i → Z ≥ , ˆ S i ⊆ [∆] d be the output of the subroutine, where α i = 10 ( k + d . r ψ i T i ( o )) L , α (cid:48) i = 10 ( k + d . r ψ (cid:48) i T i ( o )) L , ˆ α i = 10 ( k + d . r φ i T i ( o )) L and β i = β (cid:48) i = 1 , ˆ β i = 4 · ( k + d . r ) L φ i T i ( o ) .4. For each i ∈ { , , · · · , L } , and each cell C ∈ G i , if C ∈ C i , set τ ( C ∩ Q ) ← ψ i · f i ( C ) ; set τ ( C ∩ Q ) ← otherwise. Run Algorithm 1 based on such τ ( C ∩ Q ) for all cells C .5. For each i ∈ { , , · · · , L } , set τ (cid:16)(cid:83) s i j =1 Q i,j (cid:17) ← ψ (cid:48) i (cid:80) C ∈C (cid:48) i : C is crucial f (cid:48) i ( C ) , and ∀ j ∈ [ s i ] set τ ( Q i,j ) ← ψ (cid:48) i (cid:80) C ∈C (cid:48) i : C is a crucial child of the j -th heavy cell in G i − f (cid:48) i ( C ) .

6. Run Algorithm 2 based on all τ (cid:16)(cid:83) s i j =1 Q i,j (cid:17) and τ ( Q i,j ) . To obtain Q (cid:48) i in line 11 of Algorithm 2, dothe following: Q (cid:48) i ← (cid:83) p ∈ ˆ S i : c i ( p ) is crucial ,τ ( Q i,j ) ≥ γT i ( o ) { p } , where c i − ( p ) is the j -th heavy cell in G i − .Outputs Q (cid:48) and w (cid:48) returned by Algorithm 2. Lemma 4.3 (Correctness of the streaming algorithm) . If o ≤ OPT ( r ) k -clus and none of the subroutinesof Algorithm 4 outputs FAIL, with probability at least . , the output Q (cid:48) , w (cid:48) of Algorithm 4 satisﬁesthat ∀ t ≥ | Q | /k, Z ⊂ [∆] d with | Z | = k , cost ( r )(1+ η ) t ( Q, Z ) ≤ (1 + (cid:15) )cost ( r ) t ( Q (cid:48) , Z, w (cid:48) ) and cost ( r )(1+ η ) t ( Q (cid:48) , Z, w (cid:48) ) ≤ (1 + (cid:15) )cost ( r ) t ( Q, Z ) . roof. Due to Lemma 4.2, since none of subroutine

Storing outputs FAIL, ∀ i ∈ { , , · · · , L } , C ∈ G i , • f i ( C ) = (cid:80) p ∈ C ∩ Q h i ( p ) if C ∈ C i and (cid:80) p ∈ C ∩ Q h i ( p ) = 0 otherwise, • f (cid:48) i ( C ) = (cid:80) p ∈ C ∩ Q h (cid:48) i ( p ) if C ∈ C (cid:48) i and (cid:80) p ∈ C ∩ Q h (cid:48) i ( p ) = 0 otherwise.Thus, in step 4 of Algorithm 4, τ ( C ∩ Q ) = 1 ψ i (cid:88) p ∈ C ∩ Q h i ( p ) . Similarly, in step 5 of Algorithm 4, τ  s i (cid:91) j =1 Q i,j  = 1 ψ (cid:48) i (cid:88) C ∈ G i : C is crucial (cid:88) p ∈ C ∩ Q h (cid:48) i ( p ) ,τ ( Q i,j ) = 1 ψ (cid:48) i (cid:88) C ∈ G i : C is a rucial child of the j -th heavy cell in G i − (cid:88) p ∈ C ∩ Q h (cid:48) i ( p ) . By Lemma 4.1, with probability at least . ,1. ∀ i ∈ { , , · · · , L } , C ∈ G i , either τ ( C ∩ Q ) ∈ | C ∩ Q |± . T i ( o ) or τ ( C ∩ Q ) ∈ (1 ± . ·| C ∩ Q | ,2. ∀ i ∈ { , , · · · , L } , either τ (cid:16)(cid:83) s i j =1 Q i,j (cid:17) ∈ (cid:80) s i j =1 | Q i,j | ± . T i ( o ) or τ (cid:16)(cid:83) s i j =1 Q i,j (cid:17) ∈ (1 ± . (cid:80) s i j =1 | Q i,j | ,3. ∀ i ∈ { , , · · · , L } , j ∈ [ s i ] , either τ ( Q i,j ) ∈ | Q i,j | ± . γT i ( o ) or τ ( Q i,j ) ∈ (1 ± . | Q i,j | .Consider step 6 of Algorithm 4. Notice that since Algorithm 2 does not output FAIL, we have ∀ i ∈ { , , · · · , L } , (cid:80) s i j =1 | Q i,j | ≤ · ( kL + d . r ) T i ( o ) . Thus, ∀ i ∈ { , , · · · , L } , E  s i (cid:88) j =1 (cid:88) p ∈ Q i,j ˆ h i ( p )  ≤ φ i · · ( kL + d . r ) T i ( o ) . By Markov’s inequality and union bound over all i ∈ { , , · · · , L } , with probability at least . , ∀ i ∈ { , , · · · , L } , s i (cid:88) j =1 (cid:88) p ∈ Q i,j ˆ h i ( p ) ≤ φ i · · ( k + d . r ) L T i ( o ) , which implies that ∀ i ∈ { , , · · · , L } , j ∈ [ s i ] , { p ∈ Q i,j | ˆ h i ( p ) = 1 } ⊆ ˆ S i . Thus, the construction of Q (cid:48) i in step 6 of Algorithm 4 is equivalent to the construction of Q (cid:48) i inline 11 of Algorithm 2. Due to the correctness (Lemma 3.17) of Algorithm 4, we complete theproof. Lemma 4.4 (Space complexity and success probability of the streaming algorithm) . For constant r , the space needed by Algorithm 4 is at most poly( (cid:15) − η − kdL ) . Furthermore, if OPT ( r ) k -clus / ≤ o ≤ OPT ( r ) k -clus , then with probability at least . , none of the subroutine of Algorithm 4 outputsFAIL, and the output Q (cid:48) of Algorithm 3 satiﬁes | Q (cid:48) | ≤ poly( (cid:15) − η − kdL ) . roof. Suppose r is a constant. Since all the hash functions are λ -wise independent, the totalnumber of random bits needed is at most poly( (cid:15) − η − kdL ) . Since ∀ i ∈ { , , · · · , L } , ψ i , ψ (cid:48) i , φ i ≤ poly( (cid:15) − η − kdL ) /T i ( o ) , we have α i , α (cid:48) i , ˆ α i , β i , β (cid:48) i , ˆ β i ≤ poly( (cid:15) − η − kdL ) . By Lemma 4.2, the spacecomplexity of each subroutine Storing (Lemma 4.2) is at most poly( (cid:15) − η − kdL ) bits. Since thereare L + 1) parallel subroutines of Storing , the total space needed is at most poly( (cid:15) − η − kdL ) bits.Recall that Z ∗ ⊂ [∆] d with | Z ∗ | ≤ k is the optimal solution of the standard (cid:96) r k -clusteringproblem of Q . A cell C ∈ G i a center cell if dist( C, Z ∗ ) ≤ g i /d . By Lemma 3.2, F happens withprobability at least . , i.e., the total number of center cells is at most kL .Consider i ∈ { , , · · · , L } . The number of points in non-center cells in G i is at most OPT ( r ) k -clus ( g i /d ) ≤ d . r T i ( o ) . Thus, for i ∈ { , , · · · , L } , we have: E [ |{ C ∈ G i | ∃ p ∈ C ∩ Q, h i ( p ) = 1 }| ] ≤ kL + ψ i · d . r T i ( o ) , E (cid:2)(cid:12)(cid:12)(cid:8) C ∈ G i | ∃ p ∈ C ∩ Q, h (cid:48) i ( p ) = 1 (cid:9)(cid:12)(cid:12)(cid:3) ≤ kL + ψ (cid:48) i · d . r T i ( o ) , E (cid:104)(cid:12)(cid:12)(cid:12)(cid:110) C ∈ G i | ∃ p ∈ C ∩ Q, ˆ h i ( p ) = 1 (cid:111)(cid:12)(cid:12)(cid:12)(cid:105) ≤ kL + φ i · d . r T i ( o ) . By Markov’s inequality and union bound, with probability at least . , for i ∈ { , , · · · , L } , wehave: |{ C ∈ G i | ∃ p ∈ C ∩ Q, h i ( p ) = 1 }| ≤ ( k + d . r ψ i T i ( o )) L ≤ α i , (cid:12)(cid:12)(cid:8) C ∈ G i | ∃ p ∈ C ∩ Q, h (cid:48) i ( p ) = 1 (cid:9)(cid:12)(cid:12) ≤ ( k + d . r ψ (cid:48) i T i ( o )) L ≤ α (cid:48) i , (cid:12)(cid:12)(cid:12)(cid:110) C ∈ G i | ∃ p ∈ C ∩ Q, ˆ h i ( p ) = 1 (cid:111)(cid:12)(cid:12)(cid:12) ≤ ( k + d . r φ i T i ( o )) L ≤ ˆ α i , and thus none of subroutine Storing outputs FAIL. According to the proof of Lemma 4.3, withprobability at least . ,1. ∀ i ∈ { , , · · · , L } , C ∈ G i , τ ( C ∩ Q ) is good (Deﬁnition 3.1),2. ∀ i ∈ { , , · · · , L } , τ (cid:16)(cid:83) s i j =1 Q i,j (cid:17) is good (Deﬁnition 3.5),3. ∀ i ∈ { , , · · · , L } , j ∈ [ s i ] , τ ( Q i,j ) is good (Deﬁnition 3.5).According to Lemma 3.18, with probability at least . , the subroutine Algorithm 2 does not outputFAIL. By union bound, Algorithm 4 does not output FAIL with probability at least . . Theorem 4.5 (Streaming algorithm) . Suppose the input point set Q ⊆ [∆] d is obtained by a dynamicstream. For a constant r ≥ , given (cid:15), η ∈ (0 , . , k ∈ Z ≥ , there is a dynamic streaming algorithmwhich uses one pass over the stream and with probability at least . can output a subset Q (cid:48) ⊆ Q and weights w (cid:48) : Q (cid:48) → R > such that1. ∀ t ≥ | Q | /k, Z ⊂ [∆] d with | Z | = k , cost ( r )(1+ η ) t ( Q, Z ) ≤ (1 + (cid:15) )cost ( r ) t ( Q (cid:48) , Z, w (cid:48) ) and cost ( r )(1+ η ) t ( Q (cid:48) , Z, w (cid:48) ) ≤ (1 + (cid:15) )cost ( r ) t ( Q, Z ) , | Q (cid:48) | ≤ poly( (cid:15) − η − kd log ∆) . urthermore, the space complexity of the algorithm is at most poly( (cid:15) − η − kd log ∆) .Proof. We can enumerate o ∈ { , , , · · · , ∆ d · ( √ d ∆) r } and run Algorithm 4 in parallel for eachpossible o . By [HSYZ18], there is a dynamic streaming algorithm which uses one pass over thestream and can give a -approximation to OPT ( r ) k -clus with probability at least . . We can runsuch algorithm in parallel, and at the end of the stream, we can ﬁnd an o such that OPT ( r ) / ≤ o ≤ OPT ( r ) . We conclude the proof by applying Lemma 4.3 and Lemma 4.4 for such o . Next we convert our streaming algorithm to a distributed algorithm. The distributed model isdescribed as the following.

Distributed model.

We study the same model as in [KVW14, WZ16, BWZ16, SWZ17, SWZ19].There are s machines, where the i -th machine holds a subset Q ( i ) ⊆ [∆] d of input points. Thereis one machine which is called coordinator. The communication is only allowed between machinesand the coordinator. The communication cost of a protocol is the total number of bits needed tocommunicate between machines and the coordinator.We show that there is a communication eﬃcient distributed protocol which has a similar behavioras the subroutine shown in Lemma 4.2. Lemma 4.6.

Suppose a point set is partitioned on s machines where each machine holds a subsetof points. For α, β ∈ Z ≥ , i ∈ { , , · · · , L } , there is a distributed protocol which requires O ( s · αβdL log d ) bits such that the protocol either leaves C ⊂ G i , f : C → Z ≥ and S ⊆ [∆] d on thecoordinator or outputs FAIL on the coordinator, where • C is the set which contains all the non-empty cells, • f ( C ) denotes the number of points in the cell C ∈ C , • S contains points in all non-empty cells that contains at most β points,and furthermore, if |C| ≤ α , the protocol does not output FAIL on the coordinator.Proof. The protocol is described as the following:1. The coordinator sends the randomly shifted vector to each machine such that each machinelearns the grid G i .2. The j -th machine ﬁnds non-empty cells C ( j ) ⊆ G i based on its local point set, computes thenumber of local points f ( j ) ( C ) of each cell C ∈ C ( j ) and construct S ( j ) to be the set of localpoints in all cells that contains at most β local points.3. If (cid:12)(cid:12) C ( j ) (cid:12)(cid:12) ≤ α , the j -th machine sends C ( j ) , f ( j ) and S ( j ) to the coordinator. Otherwise, the j -th machine sends FAIL to the coordinator.4. The coordinator outputs FAIL if it receives any FAIL from other machines. Otherwise, C ← (cid:83) sj =1 C ( j ) , for each cell C ∈ C , f ( C ) ← (cid:80) j : C ∈C ( j ) f ( j ) ( C ) and S ← (cid:83) sj =1 S ( j ) .The ﬁrst step needs O ( dL log d ) bits per machine. Consider the third step. The j -th machine needs O ( αdL ) bits to represent C j , needs O ( α · dL ) bits to represent f ( j ) , and needs O ( αβdL ) bits torepresent S ( j ) . Thus, the overall communication cost is at most O ( s · αβdL log d ) bits.35otice that if |C| ≤ α , then |C ( j ) | ≤ α for all j ∈ [ s ] since C j ⊆ C . If the total number of points ina cell C is at most β , then any machine can hold at most β local points in C . The above argumentconcludes the proof of correctness. Theorem 4.7.

Suppose a point set Q ⊆ [∆] d is partitioned into s machines. For a constant r ≥ ,given (cid:15), η ∈ (0 , . , k ∈ Z ≥ , there is a distributed protocol which on termination with probability atleast . leaves a subset of points Q (cid:48) ⊆ Q and weights w (cid:48) : Q (cid:48) → R > on the coordinator such that1. ∀ t ≥ | Q | /k, Z ⊂ [∆] d with | Z | = k , cost ( r )(1+ η ) t ( Q, Z ) ≤ (1 + (cid:15) )cost ( r ) t ( Q (cid:48) , Z, w (cid:48) ) and cost ( r )(1+ η ) t ( Q (cid:48) , Z, w (cid:48) ) ≤ (1 + (cid:15) )cost ( r ) t ( Q, Z ) , | Q (cid:48) | ≤ poly( (cid:15) − η − kd log ∆) .Furthermore, the total communication cost of the protocol is at most s · poly( (cid:15) − η − kd log ∆) bits.Proof. By [FL11, BFL16, BFL +

17, HSYZ18], there is a distributed protocol using s · poly( (cid:15) − η − kd log ∆) bits of communication leaves a -approximation of OPT ( r ) k -clus on the coor-dinator with . probability. The the coordinator can broadcast the approxtion to every machine,and all the machines can agree on the same o such that OPT ( r ) k -clus / ≤ o ≤ OPT ( r ) k -clus .Then the protocol simulates Algorithm 4. For step 3 of Algorithm 4, we can use the protocolshown in Lemma 4.6 instead. Due to the choices of α i , α (cid:48) i , ˆ α i , β i , β (cid:48) i , ˆ β i , the total communicationcost is at most s · poly( (cid:15) − η − kdL ) . For the remaining steps of Algorithm 4, we can simulate themon the coordinator. We conclude our proof by applying Lemma 4.3 and Lemma 4.4. References [ABM +

18] Marek Adamczyk, Jarosław Byrka, Jan Marcinkowski, Syed M Meesum, and MichałWłodarczyk. Constant factor fpt approximation for capacitated k-median. arXivpreprint arXiv:1809.05791 , 2018.[ASS17] Hyung-Chan An, Mohit Singh, and Ola Svensson. Lp-based algorithms for capacitatedfacility location.

SIAM Journal on Computing , 46(1):272–306, 2017.[BBLM14] MohammadHossein Bateni, Aditya Bhaskara, Silvio Lattanzi, and Vahab Mirrokni. Dis-tributed balanced clustering via mapping coresets. In

Advances in Neural InformationProcessing Systems (NIPS) , pages 2591–2599, 2014.[BFL16] Vladimir Braverman, Dan Feldman, and Harry Lang. New frameworks for oﬄine andstreaming coreset constructions. arXiv preprint arXiv:1612.00889 , 2016.[BFL +

17] Vladimir Braverman, Gereon Frahling, Harry Lang, Christian Sohler, and Lin F Yang.Clustering high dimensional dynamic data streams. In

ICML . https://arxiv.org/pdf/1706.03887 , 2017.[BIP +

16] Arturs Backurs, Piotr Indyk, Eric Price, Ilya Razenshteyn, and David P Woodruﬀ.Nearly-optimal bounds for sparse recovery in generic norms, with applications to k-median sketching. In

Proceedings of the Twenty-Seventh Annual ACM-SIAM Sympo-sium on Discrete Algorithms , pages 318–337. SIAM, 2016.36BKS12] Binay Bhattacharya, Tsunehiko Kameda, and Zhao Song. Computing minmax regret1-median on a tree network with positive/negative vertex weights. In

InternationalSymposium on Algorithms and Computation , pages 588–597. Springer, 2012.[BKS14] Binay Bhattacharya, Tsunehiko Kameda, and Zhao Song. A linear time algorithm forcomputing minmax regret 1-median on a tree network.

Algorithmica , 70(1):2–21, 2014.[BLLM16] Vladimir Braverman, Harry Lang, Keith Levin, and Morteza Monemizadeh. Clusteringproblems on sliding windows. In

Proceedings of the twenty-seventh annual ACM-SIAMsymposium on Discrete algorithms , pages 1374–1390. Society for Industrial and AppliedMathematics, 2016.[BR94] Mihir Bellare and John Rompel. Randomness-eﬃcient oblivious sampling. In

Foun-dations of Computer Science, 1994 Proceedings., 35th Annual Symposium on , pages276–287. IEEE, 1994.[BRU16] Jarosław Byrka, Bartosz Rybicki, and Sumedha Uniyal. An approximation algorithmfor uniform capacitated k-median problem with (cid:15) capacity violation. In InternationalConference on Integer Programming and Combinatorial Optimization , pages 262–274.Springer, 2016.[BWZ16] Christos Boutsidis, David P Woodruﬀ, and Peilin Zhong. Optimal principal componentanalysis in distributed and streaming models. In

Proceedings of the 48th Annual ACMSIGACT Symposium on Theory of Computing (STOC) , pages 236–249. ACM, https://arxiv.org/pdf/1504.06729 , 2016.[Che09] Ke Chen. On coresets for k-median and k-means clustering in metric and euclideanspaces and their applications.

SIAM Journal on Computing , 39(3):923–947, 2009.[COP03] Moses Charikar, Liadan O’Callaghan, and Rina Panigrahy. Better streaming algorithmsfor clustering problems. In

Proceedings of the thirty-ﬁfth annual ACM symposium onTheory of computing , pages 30–39. ACM, 2003.[DL16] Gökalp Demirci and Shi Li. Constant approximation for capacitated k -median with (1 + (cid:15) ) -capacity violation. arXiv preprint arXiv:1603.02324 , 2016.[FL11] Dan Feldman and Michael Langberg. A uniﬁed framework for approximating andclustering data. In Proceedings of the 43rd ACM Symposium on Theory of Computing,STOC 2011, San Jose, CA, USA, 6-8 June 2011 , pages 569–578, 2011.[FMS07] Dan Feldman, Morteza Monemizadeh, and Christian Sohler. A ptas for k-means clus-tering based on weak coresets. In

Proceedings of the twenty-third annual symposium onComputational geometry , pages 11–18. ACM, 2007.[FS05] Gereon Frahling and Christian Sohler. Coresets in dynamic geometric data streams.In

Proceedings of the thirty-seventh annual ACM symposium on Theory of computing(STOC) , pages 209–217. ACM, 2005.[GMMO00] Sudipto Guha, Nina Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams.In

FOCS , pages 359–366, 2000. 37HPK05] Sariel Har-Peled and Akash Kushal. Smaller coresets for k-median and k-means cluster-ing. In

Proceedings of the twenty-ﬁrst annual symposium on Computational geometry ,pages 126–134. ACM, 2005.[HPM04] Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median cluster-ing. In

Proceedings of the thirty-sixth annual ACM symposium on Theory of computing ,pages 291–300. ACM, 2004.[HSYZ18] Wei Hu, Zhao Song, Lin F Yang, and Peilin Zhong. Nearly optimal dynamic k -meansclustering for high-dimensional data. arXiv preprint arXiv:1802.00459 , 2018.[KVW14] Ravindran Kannan, Santosh S Vempala, and David P Woodruﬀ. Principal compo-nent analysis and higher correlations for distributed data. In Proceedings of The 27thConference on Learning Theory (COLT) , pages 1040–1057, 2014.[Li17] Shi Li. On uniform capacitated k-median beyond the natural lp relaxation.

ACMTransactions on Algorithms (TALG) , 13(2):22, 2017.[LLMR15] Silvio Lattanzi, Stefano Leonardi, Vahab Mirrokni, and Ilya Razenshteyn. Robusthierarchical k-center clustering. In

Proceedings of the 2015 Conference on Innovationsin Theoretical Computer Science , pages 211–218. ACM, 2015.[MMR19] Konstantin Makarychev, Yury Makarychev, and Ilya Razenshteyn. Performance ofjohnson-lindenstrauss transform for k-means and k-medians clustering. In

Proceedingsof the 51st Annual ACM SIGACT Symposium on Theory of Computing , pages 1027–1038. ACM, 2019.[SWZ17] Zhao Song, David P Woodruﬀ, and Peilin Zhong. Low rank approximation with en-trywise (cid:96) -norm error. In Proceedings of the 49th Annual Symposium on the Theory ofComputing (STOC) . ACM, https://arxiv.org/pdf/1611.00898 , 2017.[SWZ19] Zhao Song, David P Woodruﬀ, and Peilin Zhong. Relative error tensor low rank approx-imation. In

Proceedings of the Thirtieth Annual ACM-SIAM Symposium on DiscreteAlgorithms , pages 2772–2789. Society for Industrial and Applied Mathematics, 2019.[WZ16] David P Woodruﬀ and Peilin Zhong. Distributed low rank approximation of implicitfunctions of a matrix. In . https://arxiv.org/pdf/1601.07721 , 2016.[XHX +

19] Yicheng Xu, Rolf H.Mohring, Dachuan Xu, Yong Zhang, and Yifei Zou. A con-stant parameterized approximation for hard-capacitated k-means. arXiv preprintarXiv:1901.04628 , 2019.

A Missing Details of Section 3.1

Fact A.1.

Suppose o ≤ OPT ( r ) k -clus . If the estimated size τ ( C ∩ Q ) in line 7 of Algorithm 1 is good(Deﬁnition 3.1) for every cell, then there is a unique cell C ∈ G − which contains [∆] d and ismarked as heavy.Proof. As mentioned, since each cell in G − has side length , there is a unique cell C ∈ G − such that [∆] d ⊂ C . Thus, C ∩ Q = Q . Notice that OPT ( r ) k -clus ≤ ( √ d ∆) r | Q | ≤ ( √ dg − ) r | Q | . Since T − ( o ) = 0 . · o/ ( √ dg − ) r ≤ . · OPT ( r ) k -clus / ( √ dg − ) r , C should be heavy due to Algorithm 1.38 roof of Lemma 3.2 Suppose Z ∗ = { z ∗ , z ∗ , · · · , z ∗ k } . For i ∈ { , , · · · , L } , j ∈ [ k ] , consider thegrid G i and the center z ∗ j . For l ∈ [ d ] , let X l be the indicator random variable such that X l = 1 ifand only if the distance between z ∗ j and the boundary of the l -th dimension of G i is at most g i /d .Notice that if z ∗ j is close to a boundary, the number of cells which is close to z ∗ j may increase bya factor of . Therefore, the number of cells which has distance to z ∗ j is at most (cid:80) dl =1 X l of whichexpectation is at most d (cid:89) l =1 E (cid:2) X l (cid:3) = d (cid:89) l =1 (1 + E [ X l ]) = (1 + 2 /d ) d ≤ e . Thus, the expectation of total number of center cells is at most ( L + 1) k · e . By Markov’sinequality, with probability at most . , the total number of center cells is at most kL . Proof of Lemma 3.3

Condition on F , the total number of center cells is at most kL (Lemma 3.2). Consider a cell C ∈ G i which is not a center cell. If C is heavy, then (cid:88) p ∈ C ∩ Q dist r ( p, Z ∗ ) ≥ | C ∩ Q | · (cid:16) g i d (cid:17) r ≥ . T i ( o ) · (cid:16) g i d (cid:17) r ≥ o d . r . Since (cid:80) C ∈ G i (cid:80) p ∈ C ∩ Q dist r ( p, Z ∗ ) = (cid:80) p ∈ Q dist r ( p, Z ∗ ) = OPT ( r ) k -clus , the total number of heavy cellswhich are not center cells are at most ( L + 1) · OPT ( r ) k -clus / ( o/ (120 d . r )) ≤ d . r L · OPT k -means o .Together with the upper bound of number of center cells, we can conclude the proof. Proof of Lemma 3.4

Let Q I = Q \ Q N . Since Q I is a subset of Q , cost ( r ) t ( Q I , Z ) ≤ cost ( r ) t ( Q, Z ) is obviously true by the deﬁnition of cost ( r ) t ( · , · ) . In the remaining of the proof, let us focus onproving cost ( r )(1+ η ) · t ( Q, Z ) ≤ (1 + (cid:15) )cost ( r ) t ( Q I , Z ) . Before we prove the statement, there are severalobservations. Claim A.2. | Q N | ≤ η | Q | /k .Proof. | Q N | = L (cid:88) i =0 (cid:88) P ∈P Ni | P | ≤ L (cid:88) i =0 (cid:88) P ∈P Ni γT i ( o ) ≤ L (cid:88) i =0 (cid:88) P ∈P Ni r +2 γ | c i − ( P ) ∩ Q | ≤ r +2 γ ( L + 1) | Q | ≤ η | Q | /k, where the third step follows from that c i − ( P ) is a heavy cell and thus | c i − ( P ) ∩ Q | ≥ . T i − ( o ) ≥ T i ( o ) / r +1 , the ﬁfth step follows from γ ≤ η/ (2 r +3 kL ) . Claim A.3.

For i ∈ { , , · · · , L } , if C ∈ G i − is marked as heavy by Algorithm 1, then | C ∩ Q I | ≥ (1 − r +2 ( L − i ) γ ) | C ∩ Q | ≥ (1 − r +2 Lγ ) · . T i − ( o ) .Proof. We prove it by induction. Consider the case when i = L . If there is no heavy cell in G L − , the claim is true. Otherwise, consider a heavy cell C ∈ G L − . Since all the children of C are crucial, all the points in the children of C will be in the same part of which size is at least . T L − ( o ) > γT L ( o ) . Therefore | C ∩ Q I | = | C ∩ Q | .Suppose the claim is true for i + 1 , i + 2 , · · · , L . If there is no heavy cell in G i , the claim is true.Otherwise, consider a heavy cell C ∈ G i − . There are two cases. The ﬁrst case is that all pointsfrom Q in the crucial children of C are also in Q I . In this case, we have | C ∩ Q I | = (cid:88) C (cid:48) ∈ G i : C (cid:48) is a crucial child of C | C (cid:48) ∩ Q I | + (cid:88) C (cid:48) ∈ G i : C (cid:48) is a heavy child of C | C (cid:48) ∩ Q I | (cid:88) C (cid:48) ∈ G i : C (cid:48) is a crucial child of C | C (cid:48) ∩ Q | + (cid:88) C (cid:48) ∈ G i : C (cid:48) is a heavy child of C (1 − L − i − γ ) | C (cid:48) ∩ Q |≥ (1 − L − i ) γ ) | C ∩ Q | . The second case is that none of the point from Q in the curcial children of C is in Q I . In this case,we have (cid:88) C (cid:48) ∈ G i : C (cid:48) is a heavy child of C | C (cid:48) ∩ Q | = | C ∩ Q | − (cid:88) C (cid:48) ∈ G i : C (cid:48) is a crucial child of C | C (cid:48) ∩ Q |≥ | C ∩ Q | − γT i ( o ) ≥ (1 − r +2 γ ) | C ∩ Q | , where the last step follows from that | C ∩ Q | ≥ . T i − ( o ) ≥ / r +1 · T i ( o ) . Thus, | C ∩ Q I | = (cid:88) C (cid:48) ∈ G i : C (cid:48) is a heavy child of C | C (cid:48) ∩ Q I |≥ (1 − r +2 ( L − i − γ ) (cid:88) C (cid:48) ∈ G i : C (cid:48) is a heavy child of C | C (cid:48) ∩ Q |≥ (1 − r +2 ( L − i − γ )(1 − r +2 γ ) | C ∩ Q |≥ (1 − r +2 ( L − i ) γ ) | C ∩ Q | . It is enough to prove the lemma: cost ( r )(1+ η ) t ( Q, Z ) ≤ cost ( r ) t + η | Q | /k ( Q, Z ) ≤ cost ( r ) t ( Q I , Z ) + cost ( r ) ( Q N , Z )=cost ( r ) t ( Q I , Z ) + L (cid:88) i =0 (cid:88) P ∈P Ni (cid:88) p ∈ P dist r ( p, Z ) ≤ cost ( r ) t ( Q I , Z ) + L (cid:88) i =0 (cid:88) P ∈P Ni (cid:88) p ∈ P (cid:32) r − ( √ dg i − ) r + 2 r − (cid:80) q ∈ c i − ( P ) ∩ Q I dist r ( q, Z ) | c i − ( P ) ∩ Q I | (cid:33) ≤ cost ( r ) t ( Q I , Z ) + L (cid:88) i =0 (cid:88) P ∈P Ni (cid:88) p ∈ P (cid:32) r − ( √ dg i − ) r + 2 r +1 cost ( r ) ( Q I , Z ) T i − ( o ) (cid:33) ≤ cost ( r ) t ( Q I , Z ) + L (cid:88) i =0 |P Ni | · γT i ( o ) · (cid:32) r − ( √ dg i − ) r + 2 r +1 cost ( r ) ( Q I , Z ) T i − ( o ) (cid:33) =cost ( r ) t ( Q I , Z ) + L (cid:88) i =0 |P Ni | · . γ (cid:16) r − o + 2 r +1 cost ( r ) ( Q I , Z ) (cid:17) ≤ cost ( r ) t ( Q I , Z ) + 20000( k + d . r ) L · . γ (cid:16) r − o + 2 r +1 cost ( r ) ( Q I , Z ) (cid:17) ≤ cost ( r ) t ( Q I , Z ) + 20000( k + d . r ) L · . γ (cid:16) r − cost ( r )(1+ η ) t ( Q, Z ) + 2 r +1 cost ( r ) ( Q I , Z ) (cid:17) ≤ cost ( r ) t ( Q I , Z ) + 1000( k + d . r ) L · r γ (cid:16) cost ( r )(1+ η ) t ( Q, Z ) + cost ( r ) ( Q I , Z ) (cid:17) cost ( r ) t ( Q I , Z ) + (cid:15) (cid:16) cost ( r )(1+ η ) t ( Q, Z ) + cost ( r ) ( Q I , Z ) (cid:17) ≤ cost ( r ) t ( Q I , Z ) + (cid:15) (cid:16) cost ( r )(1+ η ) t ( Q, Z ) + cost ( r ) t ( Q I , Z ) (cid:17) where the ﬁrst step follows from t ≥ | Q | /k , the second step follows from Claim A.2, the forth stepfollows from that since c i − ( P ) ∩ Q I (cid:54) = ∅ (Claim A.3), there must be a point q (cid:48) ∈ c i − ( P ) ∩ Q I suchthat dist r ( q (cid:48) , Z ) ≤ (cid:80) q ∈ c i − ( P ) ∩ Q I dist r ( q, Z ) | c i − ( P ) ∩ Q I | by averaging argument and furthermore, due to dist( p, q (cid:48) ) ≤ √ dg i − and Fact 2.1, we have dist r ( p, Z ) ≤ r − dist r ( p, q (cid:48) ) + 2 r − dist r ( q (cid:48) , Z ) ≤ r − (cid:16) √ dg i − (cid:17) r + 2 r − (cid:80) q ∈ c i − ( P ) ∩ Q I dist r ( q, Z ) | c i − ( P ) ∩ Q I | , the ﬁfth step follows from | c i − ( P ) ∩ Q I | ≥ . T i − ( o ) (Claim A.3), the sixth step follows from ∀ P ∈ P Ni , | P | ≤ γT i ( o ) , the seventh step follows from T i ( o ) = 0 . o/ ( √ dg i ) r = 2 r T i − ( o ) , theeighth step follows from |P Ni | ≤ s i , the ninth step follows from o ≤ OPT ( r ) k -clus ≤ cost ( r )(1+ η ) t ( Q, Z ) ,and the last step follows from cost ( r ) ( Q I , Z ) ≤ cost ( r ) t ( Q I , Z ) .By rearranging the term of the above inequality, we have (1 − (cid:15)/ ( r )(1+ η ) t ( Q, Z ) ≤ (1 + (cid:15)/ ( r ) t ( Q I , Z ) . Since (cid:15) ∈ (0 , . , we can conclude that cost ( r )(1+ η ) t ( Q, Z ) ≤ (1 + (cid:15) )cost ( r ) t ( Q I , Z ))