FPT Approximation for Constrained Metric k -Median/Means
FFPT Approximation for Constrained Metric k -Median/Means Dishant Goyal, Ragesh Jaiswal, and Amit Kumar
Department of Computer Science and Engineering, Indian Institute of Technology Delhi. (cid:63)
Abstract.
The Metric k -median problem over a metric space ( X , d ) is defined as follows: given a set L ⊆ X of facility locations and a set C ⊆ X of clients , open a set F ⊆ L of k facilities such thatthe total service cost, defined as Φ ( F, C ) := (cid:80) x ∈ C min f ∈ F d ( x, f ), is minimised. The metric k -meansproblem is defined similarly using squared distances (i.e., d ( ., . ) instead of d ( ., . )). In many applicationsthere are additional constraints that any solution needs to satisfy. For example, to balance the loadamong the facilities in resource allocation problems, a capacity u is imposed on every facility. Thatis, no more than u clients can be assigned to any facility. This problem is known as the capacitated k -means/ k -median problem. Likewise, various other applications have different constraints, which giverise to different constrained versions of the problem such as r -gather , fault-tolerant , outlier k -means/ k -median problem. Surprisingly, for many of these constrained problems, no constant-approximationalgorithm is known. Moreover, the unconstrained problem itself is known [ABM +
19] to be W [ ]-hardwhen parameterized by k . We give FPT algorithms with constant approximation guarantee for a range ofconstrained k -median/means problems. For some of the constrained problems, ours is the first constantfactor approximation algorithm whereas for others, we improve or match the approximation guaranteeof previous works. We work within the unified framework of Ding and Xu [DX15] that allows us tosimultaneously obtain algorithms for a range of constrained problems. In particular, we obtain a (3+ ε )-approximation and (9 + ε )-approximation for the constrained versions of the k -median and k -meansproblem respectively in FPT time. In many practical settings of the k -median/means problem, one isallowed to open a facility at any client location, i.e., C ⊆ L . For this special case, our algorithm gives a(2 + ε )-approximation and (4 + ε )-approximation for the constrained versions of k -median and k -meansproblem respectively in FPT time. Since our algorithm is based on simple sampling technique, it canalso be converted to a constant-pass log-space streaming algorithm. In particular, here are some of themain highlights of this work:1. For the uniform capacitated k -median/means problems our results matches previously known re-sults of Addad et al. [CAL19].2. For the r -gather k -median/means problem (clustering with lower bound on the size of clusters),our FPT approximation bounds are better than what was previously known.3. Our approximation bounds for the fault-tolerant, outlier, and uncertain versions is better than allpreviously known results, albeit in FPT time.4. For certain constrained settings such as chromatic, l -diversity, and semi-supervised k -median/means,we obtain the first constant factor approximation algorithms to the best of our knowledge.5. Since our algorithms are based on a simple sampling based approach, we also obtain constant-passlog-space streaming algorithms for most of the above-mentioned problems. The metric k -means and k -median problems are similar. We combine the discussion of these problems bygiving a definition of the k -service problem that encapsulates both these problems. Definition 1 ( k -service problem). Let ( X , d ) be a metric space, k > be any integer and (cid:96) ≥ be anyreal number. Given a set L ⊆ X of feasible facility locations, and a set C ⊆ X of clients, find a set F ⊆ L of k facilities that minimises the total service cost: Φ ( F, C ) ≡ (cid:80) j ∈ C min i ∈ F d (cid:96) ( i, j ) . Note that the k -service problem is also studied with respect to a more general cost function (cid:80) j ∈ C min i ∈ F δ ( i, j ),where δ ( i, j ) denotes the cost of assigning a client j ∈ C to a facility i ∈ F . We consider the special case δ ( i, j ) ≡ d (cid:96) ( i, j ). For (cid:96) = 1, the problem is known as the k - median problem and for (cid:96) = 2, the problem isknown as the k - means problem. The above definition is motivated by the facility location problem and differs (cid:63) { Dishant.Goyal, rjaiswal, amitk } @cse.iitd.ac.in a r X i v : . [ c s . D S ] J u l rom it in two ways. First, in the facility location problem, one is allowed to open any number of facilities.Second, one has to pay for an additional facility establishment cost for every open facility. Thus the k -serviceproblem is basically the facility location problem for a fixed number of facilities.The k -service problem can also be viewed as a clustering problem, where the goal is to group the objectsthat are similar to each other. Clustering algorithms are commonly used in data mining, pattern recognition,and information retrieval [JMF99]. However, the notion of a cluster differs for different applications. For ex-ample, some applications consider a cluster as a dense region of points in the data-space [EKSX96, ABKS99],while others consider it as a highly connected sub-graph of a graph [HS00]. Likewise, various models havebeen developed in the past that capture the clustering properties in different ways [XT15]. The k -means and k -median problems are examples of the center-based clustering model. In this model, the objects are mappedto the points in a metric space such that the distance between the points captures the degree of dissimilaritybetween them. In other words, the closer the two points are, the more similar they are to each other. Inorder to measure the quality of a clustering, a center (known as the cluster representative) is assigned toeach cluster and the cost is measured based on the distances of the points to their respective cluster centers.Then the problem objective is to obtain a clustering with the minimum cost. To view the k -median instanceas a clustering instance, consider the client set as a set of data points and the facility locations as the feasiblecenters. In a feasible solution, the clients which are assigned to the same facility are considered a part ofthe same cluster and the corresponding facility act as their cluster center. During our discussion, we will usethe term center and facility interchangeably. Similarly, we can view the k -means problem as a clusteringproblem where the cost is measured with respect to the squared distances.Various variants of the k -median/means problem have been studied in the clustering literature. For exam-ple, the Euclidean k -means problem (where C ⊆ L = R d ) is NP -hard even for a fixed k or a fixed dimension d [Das08, ADHP09, MNV12, Vat09]. This opens the question of designing a PTAS (polynomial-time approxi-mation schemes) for the problem when either the number of clusters or the dimension is fixed. Indeed, variousPTASs are known under such conditions [KSS10, FMS07a, Che09a, JKS14, FRS16, CAKM16]. In general, itis known that the problem can not be approximated within a constant factor, unless P = NP [ACKS15, CC19].The hardness results in the previous paragraph was for Euclidean setting. These problems may be harderin general metric spaces which is indeed what has been shown. The metric k -median problem is hard toapproximate within a factor of (1 + 2 /e ), and the metric k -means problem is hard to approximate within afactor of (1+8 /e ) [GK99, JMS02]. On the positive side, various constant-factor approximation algorithms areknown for the k -means (and k -median) problems in the metric and Euclidean settings [KMN +
02, CGTS02,AGK +
04, GT08, LS13, BPR +
17, ANSW17]. Improving these bounds is not the goal of this paper. Instead,we undertake the task of improving/obtaining approximation bounds of a more general class of problemscalled the constrained k -means/ k -median problem. Let us see what these problems are and why they areimportant.For many real-world applications, the classical (unconstrained) k -means and k -median problems do notentirely capture the desired clustering properties. For example, consider the popular k - anonymity princi-ple [Swe02]. The principle provides anonymity to a public database while keeping it meaningful at the sametime. One way to achieve this is to cluster the data release only partial information related to the clustersobtained. Further, to protect the data from the re-identification attacks, the clustering should be done insuch a way that each cluster gets at least r data-points. This method is popularly known as r -gather cluster-ing [APF +
10] (see the formal definition in Table 1). Likewise, various other applications impose a specific setof constraints on the clusters. Such applications have been studied extensively. A survey on these applicationsis mentioned in Section 1.1 of [DX15]). We collectively mention these problems in Table 1 and their knownapproximation results in Table 2. We discuss these problems and their known results in detail in Section Bof the Appendix.An important distinction between the constrained problems and their unconstrained counterparts is theidea of locality . In simple words, the locality property says that the points which are close to each othershould be part of the same cluster. This property holds for the unconstrained version of the problem. How-ever, this may not necessarily hold for many of the constrained versions of the problem where minimisingclustering cost is not the only requirement. To understand this, consider a center-set F = { f , f , . . . , f k } and let { C , ..., C k } denote the clustering of the dataset such that the cost function gets minimised. That is, C i contain all the points for which f i is the closest center in the set F . Note that the clustering { C , ..., C k } just minimises the distance based cost function and may not satisfy any additional constraint that the clus- Problem Description r -gather k -service problem*( r, k )- GService
Find clustering C = { C , ..., C k } with minimum Ψ ∗ ( C )such that for all i , | C i | ≥ r i r -Capacity k -service problem*( r, k )- CaService
Find clustering C = { C , ..., C k } with minimum Ψ ∗ ( C )such that for all i , | C i | ≤ r i l -Diversity k -service problem( l, k )- DService
Given that every client has an associated colour,find a clustering C = { C , ..., C k } with minimum Ψ ∗ ( C )such that for all i , the fraction of points sharing thesame colour inside C i is ≤ l
4. Chromatic k -service problem k - ChService
Given that every client has an associated colour,find a clustering C = { C , ..., C k } with minimum Ψ ∗ ( C )such that for all i , C i should not have any twopoints with the same colour.5. Fault tolerant k -service problem( l, k )- FService
Given a value l p for every client, find a clustering C = { C , ..., C k } and a set F of k centers, such thatthe sum of service cost of the points to l p of nearestcenters out of F = { f , f , . . . , f k } , is minimised.6. Semi-supervised k -service problem k - SService
Given a target clustering C (cid:48) = { C (cid:48) , ..., C (cid:48) k } and constant α find a clustering C = { C , ..., C k } and a center set F , suchthat the cost Ψ ( F, C ) := α · Ψ ( F, C ) + (1 − α ) · Dist ( C (cid:48) , C )is minimised. Dist denotes the set-difference distance.7. Uncertain k -service problem k - UService
Given a discrete probability distribution for every client,i.e., for a point p ∈ C there is a set D p = { p , . . . , p h } such that p takes the value p i with probability t ip and (cid:80) hi =1 t ip ≤
1. Find a clustering C = { C , ..., C k } so that the expected cost of Ψ ∗ ( C ) is minimized.8. Outlier k -service problem*( k, m )- OService
Find a set Z ⊆ C of size m and a clustering C (cid:48) = { C (cid:48) , ..., C (cid:48) k } of the set C (cid:48) := C \ Z ,such that Ψ ∗ ( C (cid:48) ) is minimized. Table 1.
Constrained k -service problems with efficient partition algorithm (see Section 4 and 5.3 in [DX15] andreferences therein). The (*)marked problems were not discussed in [DX15]. We mention their partition algorithms inSection E. tering may need to satisfy in a constrained setting. In a constrained setting we may need an algorithm that,given a center-set { f , ..., f k } as input, outputs a clustering { ¯ C , ..., ¯ C k } which in addition to minimising (cid:80) i (cid:80) x ∈ ¯ C i d (cid:96) ( x, f i ) also satisfies certain clustering constraints. Such an algorithm is called a partition al-gorithm . In the unconstrained setting, the partition algorithm simply assigns points to closest center in F .However, designing such an efficient partition algorithm for the constrained versions of the problem is anon-trivial task. Ding and Xu [DX15] gave partition algorithms for all the problems mentioned in Table 1(see Section 4 and 5.3 of [DX15]). Though these algorithms were specifically designed for the Euclideanspace, they can be generalized to any metric space. We will see that such a partition algorithm is crucial inthe design of our FPT algorithms.The partition algorithm gives us a way for going from center-set to clustering. What about the reversedirection? Given a clustering C = { C , C , . . . , C k } , can we find a center set that gives minimum clusteringcost? The solution to this problem is simple. Construct a complete weighted bipartite graph G = ( V l , V r , E ),where a vertex in V l corresponds to a facility location in L , and a vertex in V r corresponds to a cluster C j ∈ C . The weight on an edge ( i, j ) ∈ V l × V r is equal to the cost of assigning the cluster C j to the i th facility, i.e., (cid:80) x ∈ C j d (cid:96) ( x, i ). Then we can easily obtain an optimal assignment by finding the minimum costerfect matching in the graph G . Let us denote the minimum cost by M CP M ( C , L ) Thus, it is sufficientto output an optimal clustering for a constrained k -service instance. In fact, all problems in Table 1 onlyrequires us to output an optimal clustering for the problem.Ding and Xu [DX15] suggested the following unified framework for considering any constrained k -means/ k -median problem by modeling an arbitrary set of constraints using feasible clusterings. Note thatthey studied the problem in the Euclidean space where C ⊆ L = R d whereas we study the problem in generalmetric space where L and C are discrete and separate sets. We will use a few more definitions to define theproblem. A k -center-set is a set of k distinct elements from L and for any k -center-set F = { f , ..., f k } anda clustering C = { C , ..., C k } , we will use the cost function: Ψ ( F, C ) ≡ min permutation π (cid:40) k (cid:88) i =1 (cid:88) x ∈ C i d (cid:96) ( x, f π ( i ) ) (cid:41) . Definition 2 (Constrained k -service problem). Let ( X , d ) be a metric space, k > be any integer and (cid:96) ≥ be any real number. Given a set L ⊆ X of feasible facility locations, a set C ⊆ X of clients, and a set C of feasible clusterings, find a clustering C = { C , C , . . . , C k } in C , that minimizes the following objectivefunction: Ψ ∗ ( C ) ≡ min k -center-set F Ψ ( F, C ) . Note that Ψ ( F, C ) is M CP M ( C , L ), the minimum cost perfect matching as discussed earlier. The key com-ponent of the above definition is the set of feasible clusterings C . Using this, we can define any constrainedversion of the problem. Note that C can have an exponential size. However, for many problems it can bedefined concisely using a simple set of mathematical constraints. For example, C for the r -gather problemcan be defined as C := {C | for every cluster C i ∈ C , | C i | ≥ r i } , where C = { C , C , . . . , C k } is a partitioningof the client set. Note that we consider the hard assignment model for the problem. That is, one cannot openmore than one facility at a location. It differs from the soft assignment model where one can open multiplefacilities at a location. The soft version can be stated in terms of the hard version – by allowing L to be amulti-set and creating k -copies for each location in L . It has been observed that the soft-assignment mod-els are easier and allow better approximation guarantees than the hard-assignment models [CHK12, Li16].For our discussion, we will call a center-set a soft center-set if it contains facility location multiple times,otherwise we call it a hard center-set . In fact, a soft center-set is a multi-set. We will avoid using the termmulti-set to keep our discussion simple.As observed in past works [DX15, BJK18], any constrained version of k -median/means can be solvedusing a partition algorithm for this version and a solution to a very general “list” version of the clusteringproblem which we discuss next. Let us define this problem which we call the list k -service problem . Thiswill help us solve the constrained k -service problem. Definition 3 (List k -service problem). Let α be a fixed constant. Let I = ( L, C, k, d, (cid:96) ) be any instance ofthe k -service problem and let C = { C , C , . . . , C k } be an arbitrary clustering of the client set C . The goal ofthe problem is: given I , find a list L of k -center-sets (i.e., each element of the list is a set of k distinct elementsfrom L ) such that, with probability at least (1 / , there is a k -center-set F such that Ψ ( F, C ) ≤ α · Ψ ∗ ( C ) . Note that the clustering algorithm in the above setup does not get access to the clustering C and yet issupposed to find good centers (constant α approximation) for this clustering. Given this, it is easy to seethat finding a single set of k centers that are good for C is not possible. However, finding a reasonably smalllist of k -center-sets such that at least one of the k -center-sets in the list is good may be feasible. This is mainrealization behind the formulation of the list version of the problem. The other reason is that since the targetclustering is allowed to be a completely arbitrary partition of the client set C , we can use the solution of thelist k -service problem to solve any constrained k -service problem as long as there is a partition algorithm.The following theorem combines the list k -service algorithm and the partition algorithm for a constrainedversion of the problem to produce a constant-approximation algorithm this problem. Theorem 1.
Let I = ( C, L, k, d, (cid:96), C ) be any instance of any constrained k -service problem and let A C bethe corresponding partition algorithm. Let B be an algorithm for the list k -service problem that runs in time This notion of list version of the clustering problem was implicitly present in the work of Ding and Xu [DX15].Bhattacharya et al. [BJK18] formalized this as the list k -means problem . B for instance ( C, L, k, d, (cid:96) ) . There is an algorithm that, with probability at least / , outputs a clustering C ∈ C , which is an α -approximation for the constrained k -service instance. The running time of the algorithmis O ( T B + |L| · T A ) , where T A is the running time of the partition algorithm.Proof. The algorithm is as follows. We first run algorithm B to obtain a list L . For every k -center-set inthe list, the algorithm runs the partition algorithm A C on it. Then the algorithm outputs that k -center-setthat has the minimum clustering cost. Let F (cid:48) be this k -center-set and C (cid:48) be the corresponding clustering.We claim that ( F (cid:48) , C (cid:48) ) is an α -approximation for the constrained k -service problem.Let C ∗ be an optimal solution for the constrained k -service instance ( C, L, k, d, (cid:96), C ) and F ∗ denote thecorresponding k -center-set. By the definition of the list k -service problem, with probability at least 1 /
2, thereis a k -center-set F in the list L , such that Ψ ( F, C ∗ ) ≤ α · Ψ ( F ∗ , C ∗ ). Let C = A C ( F ) ∈ C be the clusteringcorresponding to F . Thus, Ψ ( F, C ) ≤ Ψ ( F, C ∗ ) ≤ α · Ψ ( F ∗ , C ∗ ). Since F (cid:48) gives the minimum cost clusteringin the list, we have Ψ ( F (cid:48) , C (cid:48) ) ≤ Ψ ( F, C ). Therefore, Ψ ( F (cid:48) , C (cid:48) ) ≤ α · Ψ ( F ∗ , C ∗ ).Since, the algorithm runs a partition procedure for every center set in the list, the running time of thisstep is |L| · T A . Picking a minimum cost clustering from the list takes O ( |L| ) time. Hence the overall runningtime is O ( T B + |L| · T A ).Now suppose we are given a list L of size g ( k ) (for some function g ) and a partition algorithm forthe problem with the polynomial running time. Then by Theorem 1, we get an FPT algorithm for theconstrained k -service problem. Since for many of the constrained k -service problems there exists efficientpartition algorithms, it makes sense to design an algorithm for the list k -service problem that outputs a listof size at most g ( k ). We will design such an algorithm in Section D of this paper. We also need to makesure that the partition algorithms for constrained problems that we saw in Table 1 exists and our plan ofapproaching the constrained problem using the list problem can be executed. Indeed, Ding and Xu [DX15]gave partition algorithms for a number of constrained problems. We make addition to their list which allowsus to discuss new problems in this work. These additions and other discussions on approaching specificconstrained problems using the list problem is discussed in Section E of Appendix. What we note here isthat the approximation guarantee for the list problem carries over to all the constrained problem in Table 1.We now look at our main results for the list k -service problem and its main implications for the constrainedproblems. We will show the following result for the list k -service problem. Theorem 2 (Main Theorem).
Let < ε ≤ . Let ( C, L, k, d, (cid:96) ) be any k -service instance and let C = { C , C , . . . , C k } be any arbitrary clustering of the client set. There is an algorithm that, with probability atleast / , outputs a list L of size ( k/ε ) O ( k (cid:96) ) , such that there is a k -center-set S ∈ L in the list such that Ψ ( S, C ) ≤ (3 (cid:96) + ε ) · Ψ ∗ ( C ) . Moreover, the running time of the algorithm is O (cid:16) n · ( k/ε ) O ( k (cid:96) ) (cid:17) . For the specialcase when C ⊆ L , the algorithm gives a (2 (cid:96) + ε ) -approximation guarantee. Using the above Theorem together with Theorem 1, we obtain the following main results for the constrained k -means and k -median problems. Corollary 1 ( k -means). For any constrained version of the metric k -means problem with an efficientpartition algorithm, there is a (9+ ε ) -approximation algorithm with an FPT running time of ( k/ε ) O ( k ) · n O (1) .For a special case when C ⊆ L , the algorithm gives a (4 + ε ) -approximation guarantee. Corollary 2 ( k -median). For any constrained version of the metric k -median problem with an efficientpartition algorithm, there is a (3+ ε ) -approximation algorithm with an FPT running time of ( k/ε ) O ( k ) · n O (1) .For the special case when C ⊆ L , the algorithm gives a (2 + ε ) -approximation guarantee. Note that by Theorem 1, as long as the running time of the partition algorithm is g ( k ) · n O (1) , the totalrunning time of the algorithm still stays FPT. All the problems in Table 1 either have an efficient partitionalgorithm (polynomial in n and k ) or a partition algorithm with an FPT running time. We discuss thesepartition algorithms in Section E of the Appendix. Therefore, all the problems given in Table 1 admit a9 + ε )-approximation and a (3 + ε )-approximation for the k -means and k -median objectives respectively. Itshould be noted that other than the problems mentioned in Table 1, our algorithm works for any problemthat fits the framework of the constrained k -service problem (i.e., Definition 2) and has a partition algorithm.This makes the approach extremely versatile since we one may be able to solve more problems that may arisein the future. The known results on constrained problems in Table 1 is summarised in Table 2. Even thoughour work does not address the k -center problem, we state results on k -center just to understand the state ofart about these problems. Note that for all these problems we obtain FPT time (9 + ε )-approximation and(3 + ε )-approximation for k -means and k -median respectively. For the special case when C ⊆ L (a facilitycan be opened at any client location), we obtain FPT time (4 + ε )-approximation and (2 + ε )-approximationfor k -means and k -median respectively. There are some subtle differences in the problems in Table 1 andTable 2. This is to be able to compare our results with known results. We will highlight these differences inthe related work section. Problem Metric k -center Metric k -median Metric k -means r -gather k -service(uniform case) 2-approx. [APF +
10] 7.2-approx [Din18](for C = L )(in FPT time) 86.9-approx [Din18](for C = L )(in FPT time)2. r -Capacity k -service(uniform case) 6-approx. [KS00] (3 + ε )-approx [CAL19](in FPT time) (9 + ε )-approx [CAL19](in FPT time)3. l -Diversity k -service (2 + ε )-approx. [LYZ10] - -4. Chromatic k -service - - -5. Fault tolerant k -service 3-approx. [KPS00] 93-approx. [HHL +
16] -6. Semi-supervised k -service - - -7. Uncertain k -service(assigned version) 10-approx. [AJ18] (6 .
35 + ε )-approx. [CM08](for C ⊆ L ) (74 + ε )-approx. [CM08](for C ⊆ L )8. Outlier k -service 3-approx. [CKMN01] (7 + ε )-approx. [KLS18] (53 + ε )-approx. [KLS18]For the Euclidean k -means and k -median (where C ⊆ L = R d ), all the constrained problemshave a FPT time (1 + ε ) approximation algorithm [DX15, BJK18]. Table 2.
Known results for the constrained clustering problems. Note that for all the above problems we obtainFPT time ( + ε )-approximation and ( + ε )-approximation for k -median and k -means respectively. For the specialcase when C ⊆ L (a facility can be opened at any client location), we obtain FPT time ( + ε )-approximation and( + ε )-approximation for k -median and k -means respectively. Moreover, we can convert our algorithms to streaming algorithms using the technique of Goyal etal. [GJK19]. We basically require a streaming version of our algorithm for the list k -service problem anda streaming partition algorithm for the constrained k -service problem. In Section 1.5, we will design aconstant-pass log-space streaming algorithm for the list k -service problem. We already know streamingpartition algorithms for the various constrained k -service problems [GJK19]. This would give a streamingalgorithm for all the problems given in Table 1 except for the (cid:96) -diversity and chromatic k -service problems. We note that new ways of modelling fairness in clustering is giving rise to new clustering problems with fairnessconstraints and some of these new problems may fit into this framework. lthough single -pass streaming algorithms are considered much useful, it is interesting to know that thereis a constant-pass streaming algorithm for many constrained versions of the k -service problem. A unified framework for constrained k -means/ k -median problems was introduced by Ding and Xu [DX15].Using this framework, they designed a PTAS (fixed k ) for various constrained clustering problems. However,their study was limited to the Euclidean space where C ⊆ L = R d . Their results were obtained through analgorithm for the list version of the k -means problem (even though it was not formally defined in their work).The running time of this algorithm was O ( nd · (log n ) k · poly ( k/ε ) ) and the list size was (log n ) k · poly ( k/ε ) .Bhattacharya et al. [BJK18] formally defined and studied the list k -service problem. They obtained a fasteralgorithm for the list problem with running time to O ( nd · ( k/ε ) O (log( k/ε )) ) and list size to ( k/ε ) O (log( k/ε )) forthe constrained k -means/ k -median problem. Recently, Goyal et al, [GJK19] designed streaming algorithms forvarious constrained versions of the problem by extending the previous work of Bhattacharya et al. [BJK18].In this paper, we study the problem in general metric spaces while treating L and C as separate sets. Moreimportantly, we design an algorithm that gives a better approximation guarantee than the previously knownalgorithms by taking advantage of FPT running time. Moreover, for many problems, it is the first algorithmthat achieves a constant-approximation in FPT running time. Please see Table 2 for the known results onthe problem. We have a detailed discussion on these problems in Section B of the Appendix.In the introduction, we would specifically like to discuss the result of Addad et al. [CAL19] for thecapacitated k -service problem. Their definition of the capacitated k -service problem is different from the onementioned in Table 1 that we are consider. Following is their definition of the capacitated k -service problem. Definition 4 (Addad et al. [CAL19]).
Given an instance I = ( C, L, k, d, (cid:96) ) of the k -service problemand a capacity function r : L → Z + , find a set F ⊆ L of k facilities such that the assignment cost (cid:80) j ∈ C min i ∈ F d (cid:96) ( j, i ) is minimized, and no more than r i clients are assigned to a facility i ∈ L . Note that in the above definition, a facility has a capacity of r i whereas in our definition a cluster has acapacity of r i . This is an important difference since in our case we can assign a cluster of size r i to any facilitylocation which is not possible by their definition. However, for the uniform capacities the problem definitionsare equivalent and the results become comparable. We match the approximation guarantees obtained Addad et al. [CAL19] for the uniform case even though using very different techniques.As we mentioned earlier, the unconstrained metric k -median problem is hard to approximate within afactor of (1 + 2 /e ), and the metric k -means problem is hard to approximate within a factor of (1 + 8 /e ).Surprisingly this lower bound persists even if we allow an FPT running time [CAGK + + k -means and k -median problems in the metric setting is fairly well understood. On the other hand, our understandingof most constrained versions of the problem is still far from complete. We believe that our work is be animportant step in understanding constrained problems in general metric spaces. In this section, we discuss our sampling based algorithm for list k -service problem. As described earlier, anFPT algorithm for the list k -service problem gives an FPT algorithm for a constrained version of the k -serviceproblem that has an efficient or FPT-time partition algorithm. Our sampling based algorithm is similar to thealgorithm of Goyal et al. [GJK19] that was specifically designed for the Euclidean setting. However, workingin a metric space instead of Euclidean space poses challenges as some of the main tools used for analysis in theEuclidean setting cannot be used in metric spaces. We carefully devise and prove new sampling lemmas thatmakes the high-level analysis of Goyal et al. [GJK19] go through. Our algorithm is based on D (cid:96) -sampling.Given a point set F , D (cid:96) -sampling a point from the client set C w.r.t. center set F means sampling using thedistribution where the sampling probability of a client x ∈ C is Φ ( F, { x } ) Φ ( F,C ) = min f ∈ F d (cid:96) ( f,x ) (cid:80) y ∈ C min f ∈ F d (cid:96) ( f,y ) . In case F isempty, then D (cid:96) -sampling is the same as uniform sampling. Following is our algorithm for the list k -serviceproblem: ist-k-service ( C, L, k, d, (cid:96), ε ) Inputs : k -service instance ( C, L, k, d, (cid:96) ) and accuracy ε Output : A list L , each element in L being a k -center set Constants : β = 4 (cid:96) − · (cid:18) (cid:96) (cid:96) · (cid:96) +4 (cid:96) +3 ε (cid:96) +1 + 1 (cid:19) ; γ = (cid:96) (cid:96) · (cid:96) +5 (cid:96) +1 ε (cid:96) ; η = α β γ k · (cid:96) +2 ε (1) Run any α -approximation algorithm for the unconstrained k -serviceinstance ( C, C, k, d, (cid:96) ) and let F be the obtained center-set.( k -means++ [AV07] is one such algorithm. )(2) L ← ∅ (3) Repeat 2 k times:(4) Sample a multi-set M of ηk points from C using D (cid:96) -sampling w.r.t.center set F (5) M ← M ∪ F (6) T ← ∅ (7) For every point x in M :(8) T ← T ∪ { k points in L that are closest to x } (9) For all subsets S of T of size k :(10) L ← L ∪ { S } (11) return( L ) Algorithm 1.1.
Algorithm for the list k -service problem Let us discuss some of the main ideas of the algorithm and its analysis. Note that in the first step, weobtain a center-set F ⊆ C which is an α -approximation for the unconstrained k -service instance ( C, C, k, d, (cid:96) ).That is, Φ ( F, C ) ≤ α · OP T ( C, C ). One such algorithm is the k -means++ algorithm [AV07] that gives an O (4 (cid:96) · log k )-approximation guarantee and a running time O ( nk ). Now, let us see how the center-set F canhelp us. Let us focus on any cluster C i of a target clustering C = { C , . . . , C k } . We note that the closestfacility to a uniformly sampled client from any any client set C i provides a constant approximation to theoptimal 1-median/means cost for C i in expectation. This is formalized in the next lemma. This lemma (ora similar version) has been used in multiple other works in analysing sampling based algorithms. Lemma 1.
Let S ⊆ C be any subset of clients and let f ∗ be any center in L . If we uniformly sample apoint x in S and open a facility at the closest location in L , then the following identity holds: E [ Φ ( t ( x ) , S )] ≤ (cid:96) · Φ ( f ∗ , S ) , where t ( x ) is the closest facility location from x . Unfortunately, we cannot uniformly sample from C i directly since C i is not known to us. Given this, ourmain objective should be to use F to try to uniformly sample from C i so that we could achieve a constantapproximation for C i . Let us do a case analysis based on the distance of points in C i from the nearest pointin F . Consider the following two possibilities: The first possibility is that the points in C i are close to F . Ifthis is the case, we can uniformly sample a point from F instead of C i . This would incur some extra cost.However, the cost is small and can be bounded. To cover this first possibility, the algorithm adds the entireset F to the set of sampled points M (see line (5) of the algorithm). The second possibility is that the pointsin C i are far-away from F . In this case, we can D (cid:96) -sample the points from C . Since the points in C i are faraway, the sampled set would contain a good portion of points from C i and the points will be almost uniformlydistributed. We will show that almost uniform sampling is sufficient to apply Lemma 1 on C i . However, wewould have to sample a large number of points to boost the success probability. This requirement is takencare of by line (4) of the algorithm. Note that we may need to use a hybrid approach for analysis since thereal case may be a combination of the first and second possibility. Most of the ingenuity of this work lies informulating and proving appropriate sampling lemmas to make this hybrid analysis work.To apply lemma 1, we need to fulfill one more condition. We need the closest facility location from asampled point. This requirement is handled by lines (7) and (8) of the algorithm. However, note that thelgorithm picks k -closest facility locations instead of just one facility location. We will show that this stepis crucial to obtain a hard-assignment solution for the problem. Finally, the algorithm adds all the potentialcenter sets to a list L (see line (9) and (10) of the algorithm). The algorithm repeats this procedure 2 k timesto boost the success probability (see line (3) of the algorithm). We will show the following result from whichour main theorem (Theorem 2) trivially follows. Theorem 3.
Let < ε ≤ . Let ( C, L, k, d, (cid:96) ) be any k -service instance and let C = { C , C , . . . , C k } be anyarbitrary clustering of the client set. The algorithm List-k-service ( C, L, k, d, (cid:96), ε ), with probability at least / , outputs a list L of size ( k/ε ) O ( k (cid:96) ) , such thatthere is a k center set S ∈ L in the list such that Ψ ( S, C ) ≤ (3 (cid:96) + ε ) · Ψ ∗ ( C ) . Moreover, the running time of the algorithm is O (cid:16) n · ( k/ε ) O ( k (cid:96) ) (cid:17) . For the special case of C ⊆ L , theapproximation guarantee is (2 (cid:96) + ε ) . The details of the analysis is given in Appendix D.
We gave sampling based algorithms and showed an approximation guarantee of (3 (cid:96) + ε ) (and (2 (cid:96) + ε ) for thespecial case C ⊆ L ). In this subsection, we show that our analysis of the approximation factor is tight. Morespecifically, we will show that our algorithm does not provide better than (3 (cid:96) − δ (cid:48) ) approximation guaranteefor arbitrarily small δ (cid:48) > (cid:96) − δ (cid:48) for the case C ⊆ L ). To show this, we create a bad instance for theproblem in the following manner. We create the instance using an undirected weighted graph where C ∪ L isthe vertex set of the graph and the shortest weighted path between two vertices defines the distance metric.The set C is partitioned into the sub-sets C , C , . . . , C k , and L is partitioned into the sub-sets L , L , . . . , L k .The sub-graphs over C ∪ L , C ∪ L , . . . , and C k ∪ L k are all identical to each other. Let us describe thesub-graph over vertex set C i ∪ L i in general. In this sub-graph, all the clients are connected to a commonfacility location f ∗ i with an edge of unit weight. Also, every client is connected to a distinct set of k facilitylocations with an edge of weight (1 − δ ). We denote this set by T ( x ) for a client x ∈ C i . Figure 1 shows thecomplete description of this sub-graph. Lastly, all pairs of sub-graphs C i ∪ L i and C j ∪ L j are connectedwith an edge ( f ∗ i , f ∗ j ) of weight ∆ (cid:29) | C | . This completes the construction of the bad instance.Let us define a target clustering on the instance. Consider the unconstrained k -service problem. It is easyto see that C = { C , C , . . . , C k } is an optimal clustering for this instance. The optimal cost of a cluster C i is Φ ( f ∗ i , C i ) = | C i | , and the optimal cost of the entire instance is OP T = (cid:80) i | C i | = | C | .Now, we will show that any list L produced by the algorithm List-k-service does not contain anycenter-set that can provide better than (3 (cid:96) − δ (cid:48) )-approximation for C . To show this, let us examine everycenter-set in the list L produced by List-k-service . Note that the set T obtained in line (10) of thealgorithm does not contain any optimal facility location f ∗ i because f ∗ i does not belong to T ( x ). Therefore,no center set in the list contains any of the optimal facility locations { f ∗ , ..., f ∗ k } . Let us evaluate the clusteringcost corresponding to every center set in the list. Let F = { f , f , . . . , f k } be a center-set in the list. We havetwo possibilities for the facilities in F . The first possibility is that, there are at least two facilities in F , thatbelongs to the same sub-graph C i ∪ L i . In this case, the cost of the target clustering is Ψ ( F, C ) > ∆ (cid:29) OP T .So in this case, F gives an unbounded clustering cost. Let us consider the second possibility that all facilitiesin F belong to different sub-graphs. Without loss of generality, we can assume that f i ∈ L i . Since f i can notbe the optimal facility location, we can further assume that f i ∈ T ( x ) for some x ∈ C i . The cost of a clusterin this case is Φ ( f i , C i ) = (3 − δ ) (cid:96) ( | C i | −
1) + (1 − δ ) (cid:96) > (3 − δ ) (cid:96) ( | C i | −
1) . Hence, the overall cost of theinstance is Ψ ( F, C ) > (3 − δ ) (cid:96) · ( | C |− k ) ≥ (3 − δ ) (cid:96) ·| C |− (cid:96) k ≥ (3 (cid:96) − δ (cid:48) ) ·| C | , for δ (cid:48) = 3 (cid:96) − · (cid:96) δ + (cid:96) k | C | . Therefore,we can say that list does not contain any center set that can provide better than (3 (cid:96) − δ (cid:48) ) approximationguarantee for C . Theorem 4.
For any < δ (cid:48) ≤ , there are instances of the k -service problem for which the algorithm List-k-service ( C, L, k, d, (cid:96), ε ) does not provide better than (3 (cid:96) − δ (cid:48) ) approximation guarantee. Client Location: Facility Location
Fig. 1.
An undirected weighted sub-graph on C i ∪ L i . Now, let us examine the same bad instance when we have the flexibility to open a facility at a client location.In this case, we have a third possibility that F = { f , f , . . . , f k } such that f i is some client location in C i . The cost of a cluster in this case is Φ ( f i , C i ) = 2 (cid:96) · ( | C i | −
1) and the overall cost the instance is Ψ ( F, C ) = 2 (cid:96) · | C | − (cid:96) · k = (2 (cid:96) − δ (cid:48) ) · | C | , for δ (cid:48) = 2 (cid:96) · k/ | C | . So for the special case C ⊆ L , we obtain thefollowing theorem. Theorem 5.
For any < δ (cid:48) ≤ , there are instances of the k -service problem (with C ⊆ L ), for which thealgorithm List-k-service ( C, L, k, d, (cid:96), ε ) does not provide better than (2 (cid:96) − δ (cid:48) ) approximation guarantee. In this subsection, we discuss how to obtain constant-pass streaming algorithm using the ideas of Goyal etal. [GJK19]. Our offline algorithm has two main components, namely: the list k -service algorithm and par-tition algorithm. The list k -service procedure is common to all constrained versions of the problem. However,the partition algorithm differs for different constrained versions. First, let us convert List-k-service ( C, L, k, d, (cid:96) )algorithm to a streaming algorithm.1. In the first pass, we run a streaming α -approximation algorithm for the instance ( C, C, k, d, (cid:96) ). Forthis, we can use the streaming algorithm of Braverman et al. [BMO + O ( k log n ).2. In the second pass, we perform the D (cid:96) -sampling step using the reservoir sampling technique [Vit85].3. In the third pass, we find the k -closest facility locations for every point in M .This gives us the following result. Theorem 6.
There is a 3-pass streaming algorithm for the list k -service problem, with the running time of O ( n · f ( k, ε )) and space complexity of f ( k, ε ) · log n , where f ( k, ε ) = ( k/ε ) O ( k(cid:96) ) . Now, let us discuss the partition algorithms in streaming setting. For the l -diversity and chromatic k -service problems, it is known that there is no deterministic log-space streaming algorithm [GJK19]. For theremaining constrained problems, there are streaming partition algorithms that are discussed in [GJK19](for the Euclidean setting) and Section E of the Appendix (metric setting). Note that all of the streamingartitioning algorithms do not give an optimal partitioning but only a partitioning that is close to theoptimal. Each algorithm makes at most 3-pass over the data-set and takes logarithmic space complexity.The partition algorithm, together with the list k -service algorithm, gives the following main results. Theorem 7.
For the following constrained k -service problems there is a -pass streaming algorithm thatgives a (3 (cid:96) + ε ) -approximation guarantee.1. r -gather k -service problem2. r -capacity k -service problem3. Fault-tolerant k -service problem4. Semi-supervised k -service problem5. Uncertain k -service problem (assigned case)The algorithm has the space complexity of O ( f ( k, ε, (cid:96) ) · log n ) and the running time of O ( f ( k, ε, (cid:96) ) · n O (1) ) ,where f ( k, ε, (cid:96) ) = ( k/ε ) O ( k(cid:96) ) . Further, the algorithm gives (2 (cid:96) + ε ) -approximation guarantee when C ⊆ L . Theorem 8.
For the outlier k -service problem there is a 5-pass streaming algorithm that gives a (3 (cid:96) + ε ) -approximation guarantee. The algorithm has space complexity of O ( f ( k, m, ε, (cid:96) ) · log n ) and running time of f ( k, m, ε, (cid:96) ) · n O (1) , where f ( k, m, ε, (cid:96) ) = (( k + m ) /ε ) O ( k · (cid:96) ) . Further, the algorithm gives (2 (cid:96) + ε ) -approximationguarantee when C ⊆ L . In this paper, we worked within the unified framework of Ding and Xu [DX15] to obtain simple samplingbased algorithms for a range of constrained k -median/means problems in general metric spaces. Surprisingly,even working within this high-level framework, we obtained better (or matched) approximation guaranteesof known results that were designed specifically for the constrained problem. On one hand, this showsthe versatility of the unified approach along with the sampling method. On the other hand, it encouragesus to try to design algorithms with better approximation guarantees for these constrained problems. Ourmatching approximation lower bound for the sampling algorithm suggests that further improvement maynot be possible through sampling based ideas. On the lower bound side, it may be useful to obtain resultssimilar to that for the unconstrained setting where approximation lower bounds of (1 + 2 /e ) and (1 + 8 /e ) areknown for k -median and k -means respectively [CAGK + Acknowledgements
The authors would like to thank Anup Bhattacharya for useful discussions.
References
AAB +
10. Ankit Aggarwal, L. Anand, Manisha Bansal, Naveen Garg, Neelima Gupta, Shubham Gupta, andSurabhi Jain. A 3-approximation for facility location with uniform capacities. In Friedrich Eisenbrandand F. Bruce Shepherd, editors,
Integer Programming and Combinatorial Optimization , pages 149–162,Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.ABC +
15. Hyung-Chan An, Aditya Bhaskara, Chandra Chekuri, Shalmoli Gupta, Vivek Madan, and Ola Svensson.Centrality of trees for capacitated k -center. Math. Program. , 154(12):2953, December 2015.ABKS99. Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and J¨org Sander. Optics: Ordering points toidentify the clustering structure.
SIGMOD Rec. , 28(2):4960, June 1999.ABM +
19. Marek Adamczyk, Jaroslaw Byrka, Jan Marcinkowski, Syed M. Meesum, and Michal Wlodarczyk.Constant-Factor FPT Approximation for Capacitated k-Median. In Michael A. Bender, Ola Svensson,and Grzegorz Herman, editors, , volume144 of
Leibniz International Proceedings in Informatics (LIPIcs) , pages 1:1–1:14, Dagstuhl, Germany,2019. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.CKS15. Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop. The Hardnessof Approximation of Euclidean k -Means. In Lars Arge and J´anos Pach, editors, , volume 34 of Leibniz International Proceedingsin Informatics (LIPIcs) , pages 754–767, Dagstuhl, Germany, 2015. Schloss Dagstuhl–Leibniz-Zentrumfuer Informatik.ADHP09. Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat. Np-hardness of euclidean sum-of-squares clustering.
Mach. Learn. , 75(2):245248, May 2009.AFK +
05. Gagan Aggarwal, Tomas Feder, Krishnaram Kenthapadi, Rajeev Motwani, Rina Panigrahy, DilysThomas, and An Zhu. Approximation algorithms for k-anonymity. In
Proceedings of the InternationalConference on Database Theory (ICDT 2005) , November 2005.AGGN08. Barbara M. Anthony, Vineet Goyal, Anupam Gupta, and Viswanath Nagarajan. A plant location guidefor the unsure. In
Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms ,SODA 08, page 11641173, USA, 2008. Society for Industrial and Applied Mathematics.AGK +
04. Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, and Vinayaka Pan-dit. Local search heuristics for k-median and facility location problems.
SIAM Journal on Computing ,33(3):544–562, 2004.AJ18. Sharareh Alipour and Amir Jafari. Improvements on the k-center problem for uncertain data. In
Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems ,SIGMOD/PODS 18, page 425433, New York, NY, USA, 2018. Association for Computing Machinery.ANSW17. S. Ahmadian, A. Norouzi-Fard, O. Svensson, and J. Ward. Better guarantees for k-means and euclideank-median by primal-dual algorithms. In , pages 61–72, Oct 2017.APF +
10. Gagan Aggarwal, Rina Panigrahy, Tom´as Feder, Dilys Thomas, Krishnaram Kenthapadi, Samir Khuller,and An Zhu. Achieving anonymity via clustering.
ACM Trans. Algorithms , 6(3), July 2010.AS13. Sara Ahmadian and Chaitanya Swamy. Improved approximation guarantees for lower-bounded facilitylocation. In Thomas Erlebach and Giuseppe Persiano, editors,
Approximation and Online Algorithms ,pages 257–271, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.AS16. Sara Ahmadian and Chaitanya Swamy. Approximation Algorithms for Clustering Problems withLower Bounds and Outliers. In Ioannis Chatzigiannakis, Michael Mitzenmacher, Yuval Rabani, andDavide Sangiorgi, editors, , volume 55 of
Leibniz International Proceedings in Informatics (LIPIcs) , pages 69:1–69:15, Dagstuhl, Germany, 2016. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.AS17. Hyung-Chan An and Ola Svensson.
Recent Developments in Approximation Algorithms for FacilityLocation and Clustering Problems , pages 1–19. Springer Singapore, Singapore, 2017.ASS17. Hyung-Chan An, Mohit Singh, and Ola Svensson. Lp-based algorithms for capacitated facility location.
SIAM Journal on Computing , 46(1):272–306, 2017.AV07. David Arthur and Sergei Vassilvitskii. K-means++: The advantages of careful seeding. In
Proceedings ofthe Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms , SODA 07, page 10271035, USA,2007. Society for Industrial and Applied Mathematics.AY08. C. C. Aggarwal and P. S. Yu. A framework for clustering uncertain data streams. In , pages 150–159, 2008.AY09. Charu C. Aggarwal and Philip S. Yu. A survey of uncertain data algorithms and applications.
IEEETrans. on Knowl. and Data Eng. , 21(5):609623, May 2009.BBM04. Sugato Basu, Mikhail Bilenko, and Raymond J. Mooney. A probabilistic framework for semi-supervisedclustering. In
Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining , KDD 04, page 5968, New York, NY, USA, 2004. Association for Computing Machinery.BGG12. Manisha Bansal, Naveen Garg, and Neelima Gupta. A 5-approximation for capacitated facility location.In Leah Epstein and Paolo Ferragina, editors,
Algorithms – ESA 2012 , pages 133–144, Berlin, Heidelberg,2012. Springer Berlin Heidelberg.BJK18. Anup Bhattacharya, Ragesh Jaiswal, and Amit Kumar. Faster algorithms for the constrained k-meansproblem.
Theor. Comp. Sys. , 62(1):93115, January 2018.BKP93. J. Barilan, G. Kortsarz, and D. Peleg. How to allocate network centers.
J. Algorithms , 15(3):385415,November 1993.BMO +
11. Vladimir Braverman, Adam Meyerson, Rafail Ostrovsky, Alan Roytman, Michael Shindler, and BrianTagiku. Streaming k-means on well-clusterable data. In
Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms , SODA 11, page 2640, USA, 2011. Society for Industrial andApplied Mathematics.BPR +
17. Jaros(cid:32)law Byrka, Thomas Pensyl, Bartosz Rybicki, Aravind Srinivasan, and Khoa Trinh. An improvedapproximation for k-median and positive correlation in budgeted optimization.
ACM Trans. Algorithms ,13(2), March 2017.R05. R. J. Bayardo and Rakesh Agrawal. Data privacy through optimal k-anonymization. In , pages 217–228, 2005.BRU16. Jaros(cid:32)law Byrka, Bartosz Rybicki, and Sumedha Uniyal. An approximation algorithm for uniform ca-pacitated k-median problem with 1 + (cid:15) capacity violation. In Quentin Louveaux and Martin Skutella,editors,
Integer Programming and Combinatorial Optimization , pages 262–274, Cham, 2016. SpringerInternational Publishing.BSS10. Jaroslaw Byrka, Aravind Srinivasan, and Chaitanya Swamy. Fault-tolerant facility location: A random-ized dependent lp-rounding algorithm. In Friedrich Eisenbrand and F. Bruce Shepherd, editors,
IntegerProgramming and Combinatorial Optimization , pages 244–257, Berlin, Heidelberg, 2010. Springer BerlinHeidelberg.CAGK +
19. Vincent Cohen-Addad, Anupam Gupta, Amit Kumar, Euiwoong Lee, and Jason Li. Tight FPT Ap-proximations for k-Median and k-Means. In Christel Baier, Ioannis Chatzigiannakis, Paola Flocchini,and Stefano Leonardi, editors, , volume 132 of
Leibniz International Proceedings in Informatics (LIPIcs) , pages42:1–42:14, Dagstuhl, Germany, 2019. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.CAKM16. Vincent Cohen-Addad, Philip N. Klein, and Claire Mathieu. Local search yields approximation schemesfor k -means and k -median in euclidean and minor-free metrics. , 00:353–364, 2016.CAL19. Vincent Cohen-Addad and Jason Li. On the Fixed-Parameter Tractability of Capacitated Clustering.In Christel Baier, Ioannis Chatzigiannakis, Paola Flocchini, and Stefano Leonardi, editors, , volume 132 of LeibnizInternational Proceedings in Informatics (LIPIcs) , pages 41:1–41:14, Dagstuhl, Germany, 2019. SchlossDagstuhl–Leibniz-Zentrum fuer Informatik.CC19. V. Cohen-Addad and K. C.S. Inapproximability of clustering in lp metrics. In , pages 519–539, 2019.CCKN06. Michael Chau, Reynold Cheng, Ben Kao, and Jackey Ng. Uncertain data mining: An example in clus-tering location data. In Wee-Keong Ng, Masaru Kitsuregawa, Jianzhong Li, and Kuiyu Chang, editors,
Advances in Knowledge Discovery and Data Mining , pages 199–204, Berlin, Heidelberg, 2006. SpringerBerlin Heidelberg.CGR98. Shiva Chaudhuri, Naveen Garg, and R. Ravi. The p-neighbor k-center problem.
Information ProcessingLetters , 65(3):131 – 134, 1998.CGTS02. Moses Charikar, Sudipto Guha, va Tardos, and David B. Shmoys. A constant-factor approximationalgorithm for the k-median problem.
Journal of Computer and System Sciences , 65(1):129 – 149, 2002.Che08. Ke Chen. A constant factor approximation algorithm for k-median clustering with outliers. In
Proceedingsof the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms , SODA 08, page 826835, USA,2008. Society for Industrial and Applied Mathematics.Che09a. Ke Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and theirapplications.
SIAM Journal on Computing , 39(3):923–947, 2009.Che09b. Ke Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and theirapplications.
SIAM Journal on Computing , 39(3):923–947, 2009.CHK12. M. Cygan, M. Hajiaghayi, and S. Khuller. Lp rounding for k-centers with non-uniform hard capacities.In , pages 273–282, 2012.CKMN01. Moses Charikar, Samir Khuller, David M. Mount, and Giri Narasimhan. Algorithms for facility loca-tion problems with outliers. In
Proceedings of the Twelfth Annual ACM-SIAM Symposium on DiscreteAlgorithms , SODA 01, page 642651, USA, 2001. Society for Industrial and Applied Mathematics.CKT06. Deepayan Chakrabarti, Ravi Kumar, and Andrew Tomkins. Evolutionary clustering. In
Proceedings ofthe 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD 06,page 554560, New York, NY, USA, 2006. Association for Computing Machinery.CM08. Graham Cormode and Andrew McGregor. Approximation algorithms for clustering uncertain data.In
Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles ofDatabase Systems , PODS 08, page 191200, New York, NY, USA, 2008. Association for Computing Ma-chinery.CSZ +
07. Yun Chi, Xiaodan Song, Dengyong Zhou, Koji Hino, and Belle L. Tseng. Evolutionary spectral clus-tering by incorporating temporal smoothness. In
Proceedings of the 13th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining , KDD 07, page 153162, New York, NY, USA,2007. Association for Computing Machinery.CW99. Fabi´an A. Chudak and David P. Williamson. Improved approximation algorithms for capacitated facilitylocation problems. In
Proceedings of the 7th International IPCO Conference on Integer Programmingand Combinatorial Optimization , page 99113, Berlin, Heidelberg, 1999. Springer-Verlag.as08. Sanjoy Dasgupta. The hardness of k -means clustering. Technical Report CS2008-0916, Department ofComputer Science and Engineering, University of California San Diego, 2008.DBE99. Ayhan Demiriz, Kristin Bennett, and M. Embrechts. Semi-supervised clustering using genetic algorithms. Artif. Neural Netw. Eng , 09 1999.Din18. Hu Ding. Faster balanced clusterings in high dimension, 2018.DL16. G¨okalp Demirci and Shi Li. Constant Approximation for Capacitated k-Median with (1+epsilon)-Capacity Violation. In Ioannis Chatzigiannakis, Michael Mitzenmacher, Yuval Rabani, and DavideSangiorgi, editors, , volume 55 of
Leibniz International Proceedings in Informatics (LIPIcs) , pages 73:1–73:14,Dagstuhl, Germany, 2016. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.DMZ13. Riccardo Dondi, Giancarlo Mauri, and Italo Zoppis. The l-diversity problem: Tractability and approx-imability.
Theor. Comput. Sci. , 511:159171, November 2013.DX11. Hu Ding and Jinhui Xu. Solving the chromatic cone clustering problem via minimum spanning sphere. In
Proceedings of the 38th International Colloquim Conference on Automata, Languages and Programming- Volume Part I , ICALP11, page 773784, Berlin, Heidelberg, 2011. Springer-Verlag.DX12. Hu Ding and Jinhui Xu. Chromatic clustering in high dimensional space, 2012.DX15. Hu Ding and Jinhui Xu. A unified framework for clustering constrained data without locality property.In
Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms , SODA 15,page 14711490, USA, 2015. Society for Industrial and Applied Mathematics.EKSX96. Martin Ester, Hans-Peter Kriegel, J¨org Sander, and Xiaowei Xu. A density-based algorithm for discover-ing clusters in large spatial databases with noise. In
Proceedings of the Second International Conferenceon Knowledge Discovery and Data Mining , KDD96, page 226231. AAAI Press, 1996.FKRS19. Zachary Friggstad, Kamyar Khodamoradi, Mohsen Rezapour, and Mohammad R. Salavatipour. Approx-imation schemes for clustering with outliers.
ACM Trans. Algorithms , 15(2), February 2019.FMS07a. Dan Feldman, Morteza Monemizadeh, and Christian Sohler. A PTAS for k -means clustering based onweak coresets. In Proceedings of the twenty-third annual symposium on Computational geometry , SCG’07, pages 11–18, New York, NY, USA, 2007. ACM.FMS07b. Dan Feldman, Morteza Monemizadeh, and Christian Sohler. A PTAS for k -means clustering based onweak coresets. In Proceedings of the twenty-third annual symposium on Computational geometry , SCG’07, pages 11–18, New York, NY, USA, 2007. ACM.FRS16. Zachary Friggstad, Mohsen Rezapour, and Mohammad R. Salavatipour. Local search yields a PTAS for k -means in doubling metrics. , 00:365–374, 2016.FS12. Dan Feldman and Leonard J. Schulman. Data reduction for weighted and outlier-resistant clustering.In Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms , SODA 12,page 13431354, USA, 2012. Society for Industrial and Applied Mathematics.FZH +
19. Qilong Feng, Zhen Zhang, Ziyun Huang, Jinhui Xu, and Jianxin Wang. Improved Algorithms for Clus-tering with Outliers. In Pinyan Lu and Guochuan Zhang, editors, , volume 149 of
Leibniz International Proceedings in Infor-matics (LIPIcs) , pages 61:1–61:12, Dagstuhl, Germany, 2019. Schloss Dagstuhl–Leibniz-Zentrum fuerInformatik.GCB. Nizar Grira, Michel Crucianu, and Nozha Boujemaa. Unsupervised and semi-supervised clustering: abrief survey.GJK19. Dishant Goyal, Ragesh Jaiswal, and Amit Kumar. Streaming ptas for constrained k-means, 2019.GK99. Sudipto Guha and Samir Khuller. Greedy strikes back: Improved facility location algorithms.
Journalof Algorithms , 31(1):228 – 248, 1999.GKKM07. Gabriel Ghinita, Panagiotis Karras, Panos Kalnis, and Nikos Mamoulis. Fast data anonymization withlow information loss. In
Proceedings of the 33rd International Conference on Very Large Data Bases ,VLDB 07, page 758769. VLDB Endowment, 2007.GM09. Sudipto Guha and Kamesh Munagala. Exceeding expectations and clustering uncertain data. In
Pro-ceedings of the Twenty-Eighth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of DatabaseSystems , PODS 09, page 269278, New York, NY, USA, 2009. Association for Computing Machinery.GMM00. S. Guha, A. Meyerson, and K. Munagala. Hierarchical placement and network design problems. In
Proceedings of the 41st Annual Symposium on Foundations of Computer Science , FOCS 00, page 603,USA, 2000. IEEE Computer Society.GMM01. Sudipto Guha, Adam Meyerson, and Kamesh Munagala. Improved algorithms for fault tolerant facilitylocation. In
Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms , SODA01, page 636641, USA, 2001. Society for Industrial and Applied Mathematics.MMO00. S. Guha, N. Mishra, R. Motwani, and L. OCallaghan. Clustering data streams. In
Proceedings of the41st Annual Symposium on Foundations of Computer Science , FOCS 00, page 359, USA, 2000. IEEEComputer Society.GT08. Anupam Gupta and Kanat Tangwongsan. Simpler analyses of local search algorithms for facility location.
CoRR , abs/0809.2554, 2008.GTC06. Jing Gao, Pang-Ning Tan, and Haibin Cheng. Semi-supervised clustering with partial background in-formation. In
Proceedings of the 2006 SIAM International Conference on Data Mining , pages 489–493.SIAM, 2006.HHL +
16. Mohammadtaghi Hajiaghayi, Wei Hu, Jian Li, Shi Li, and Barna Saha. A constant factor approximationalgorithm for fault-tolerant k-median.
ACM Trans. Algorithms , 12(3), April 2016.HS00. Erez Hartuv and Ron Shamir. A clustering algorithm based on graph connectivity.
Information Pro-cessing Letters , 76(4):175 – 181, 2000.JKS14. Ragesh Jaiswal, Amit Kumar, and Sandeep Sen. A simple D -sampling based PTAS for k -means andother clustering problems. Algorithmica , 70(1):22–46, 2014.JMF99. A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review.
ACM Comput. Surv. , 31(3):264323,September 1999.JMM +
03. Kamal Jain, Mohammad Mahdian, Evangelos Markakis, Amin Saberi, and Vijay V. Vazirani. Greedyfacility location algorithms analyzed using dual fitting with factor-revealing lp.
J. ACM , 50(6):795824,November 2003.JMS02. Kamal Jain, Mohammad Mahdian, and Amin Saberi. A new greedy approach for facility locationproblems. In
Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing , STOC02, page 731740, New York, NY, USA, 2002. Association for Computing Machinery.KLS18. Ravishankar Krishnaswamy, Shi Li, and Sai Sandeep. Constant approximation for k-median and k-meanswith outliers via iterative rounding. In
Proceedings of the 50th Annual ACM SIGACT Symposium onTheory of Computing , STOC 2018, page 646659, New York, NY, USA, 2018. Association for ComputingMachinery.KM00. D. R. Karget and M. Minkoff. Building steiner trees with incomplete global knowledge. In
Proceedingsof the 41st Annual Symposium on Foundations of Computer Science , FOCS 00, page 613, USA, 2000.IEEE Computer Society.KMN +
02. Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, andAngela Y. Wu. A local search approximation algorithm for k-means clustering. In
Proceedings of theEighteenth Annual Symposium on Computational Geometry , SCG 02, page 1018, New York, NY, USA,2002. Association for Computing Machinery.KP05. Hans-Peter Kriegel and Martin Pfeifle. Density-based clustering of uncertain data. In
Proceedings of theEleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining , KDD 05,page 672677, New York, NY, USA, 2005. Association for Computing Machinery.KPR98. Madhukar R. Korupolu, C. Greg Plaxton, and Rajmohan Rajaraman. Analysis of a local search heuristicfor facility location problems. In
Proceedings of the Ninth Annual ACM-SIAM Symposium on DiscreteAlgorithms , SODA 98, page 110, USA, 1998. Society for Industrial and Applied Mathematics.KPS00. Samir Khuller, Robert Pless, and Yoram J. Sussmann. Fault tolerant k-center problems.
TheoreticalComputer Science , 242(1):237 – 245, 2000.KS00. Samir Khuller and Yoram J. Sussmann. The capacitated k-center problem.
SIAM J. Discret. Math. ,13(3):403418, May 2000.KSS10. Amit Kumar, Yogish Sabharwal, and Sandeep Sen. Linear-time approximation schemes for clusteringproblems in any dimensions.
J. ACM , 57(2):5:1–5:32, February 2010.KV00. Jain Kamal and Vijay V. Vazirani. An approximation algorithm for the fault tolerant metric facility loca-tion problem. In Klaus Jansen and Samir Khuller, editors,
Approximation Algorithms for CombinatorialOptimization , pages 177–182, Berlin, Heidelberg, 2000. Springer Berlin Heidelberg.LDR05. Kristen LeFevre, David J. DeWitt, and Raghu Ramakrishnan. Incognito: Efficient full-domain k-anonymity. In
Proceedings of the 2005 ACM SIGMOD International Conference on Management ofData , SIGMOD 05, page 4960, New York, NY, USA, 2005. Association for Computing Machinery.Li16. Shi Li. Approximating capacitated k-median with (1 + ε )k open facilities. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms , SODA 16, page 786796, USA, 2016.Society for Industrial and Applied Mathematics.Li17. Shi Li. On uniform capacitated k-median beyond the natural lp relaxation.
ACM Trans. Algorithms ,13(2), January 2017.Li19. Shi Li. On facility location with general lower bounds. In
Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms , SODA 19, page 22792290, USA, 2019. Society for Industrialand Applied Mathematics.S13. Shi Li and Ola Svensson. Approximating k-median via pseudo-approximation. In
Proceedings of theForty-Fifth Annual ACM Symposium on Theory of Computing , STOC 13, page 901910, New York, NY,USA, 2013. Association for Computing Machinery.LSS04. Retsef Levi, David B. Shmoys, and Chaitanya Swamy. Lp-based approximation algorithms for capaci-tated facility location. In Daniel Bienstock and George Nemhauser, editors,
Integer Programming andCombinatorial Optimization , pages 206–218, Berlin, Heidelberg, 2004. Springer Berlin Heidelberg.LSS13. Christiane Lammersen, Melanie Schmidt, and Christian Sohler. Probabilistic k-median clustering in datastreams. In Thomas Erlebach and Giuseppe Persiano, editors,
Approximation and Online Algorithms ,pages 70–81, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.LYZ10. Jian Li, Ke Yi, and Qin Zhang. Clustering with diversity. In
Proceedings of the 37th InternationalColloquium Conference on Automata, Languages and Programming , ICALP10, page 188200, Berlin, Hei-delberg, 2010. Springer-Verlag.MKGV07. Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, and Muthuramakrishnan Venkitasubrama-niam. L-diversity: Privacy beyond k-anonymity.
ACM Trans. Knowl. Discov. Data , 1(1):3es, March2007.MMK08. Richard Matthew McCutchen and Samir Khuller. Streaming algorithms for k-center clustering withoutliers and with anonymity. In Ashish Goel, Klaus Jansen, Jos´e D. P. Rolim, and Ronitt Rubinfeld,editors,
Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques ,pages 165–178, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg.MNV12. Meena Mahajan, Prajakta Nimbhorkar, and Kasturi Varadarajan. The planar k-means problem is np-hard.
Theoretical Computer Science , 442:13 – 21, 2012. Special Issue on the Workshop on Algorithmsand Computation (WALCOM 2009).MP03. Mohammad Mahdian and Martin P´al. Universal facility location. In Giuseppe Di Battista and Uri Zwick,editors,
Algorithms - ESA 2003 , pages 409–421, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg.MW04. Adam Meyerson and Ryan Williams. On the complexity of optimal k-anonymity. In
Proceedings of theTwenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems , PODS04, page 223228, New York, NY, USA, 2004. Association for Computing Machinery.NKC +
06. W. K. Ngai, B. Kao, C. K. Chui, R. Cheng, M. Chau, and K. Y. Yip. Efficient clustering of uncertaindata. In
Sixth International Conference on Data Mining (ICDM’06) , pages 436–445, 2006.PTW01. M. P´al, ´E. Tardos, and T. Wexler. Facility location with nonuniform hard capacities. In
Proceedings ofthe 42nd IEEE Symposium on Foundations of Computer Science , FOCS 01, page 329, USA, 2001. IEEEComputer Society.Sam01. P. Samarati. Protecting respondents identities in microdata release.
IEEE Trans. on Knowl. and DataEng. , 13(6):10101027, November 2001.SS08. Chaitanya Swamy and David B. Shmoys. Fault-tolerant facility location.
ACM Trans. Algorithms , 4(4),August 2008.Svi10. Zoya Svitkina. Lower-bounded facility location.
ACM Trans. Algorithms , 6(4), September 2010.Swe02. Latanya Sweeney. K-anonymity: A model for protecting privacy.
Int. J. Uncertain. Fuzziness Knowl.-Based Syst. , 10(5):557570, October 2002.Vat09. Andrea Vattani. The hardness of k-means clustering in the plane. Technical report, Department ofComputer Science and Engineering, University of California San Diego, 2009.Vit85. J S Vitter. Random sampling with a reservoir.
ACM Trans. Math. Software , 11(1):37 – 57, 1985.WC00. Kiri Wagstaff and Claire Cardie. Clustering with instance-level constraints. In
Proceedings of the Sev-enteenth International Conference on Machine Learning , ICML 00, page 11031110, San Francisco, CA,USA, 2000. Morgan Kaufmann Publishers Inc.WCRS01. Kiri Wagstaff, Claire Cardie, Seth Rogers, and Stefan Schr¨odl. Constrained k-means clustering withbackground knowledge. In
Proceedings of the Eighteenth International Conference on Machine Learning ,ICML 01, page 577584, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.XMX +
20. Yicheng Xu, Rolf H M¨ohring, Dachuan Xu, Yong Zhang, and Yifei Zou. A constant fpt approximationalgorithm for hard-capacitated k-means.
Optimization and Engineering , pages 1–14, 2020.XT15. Dongkuan Xu and Yingjie Tian. A comprehensive survey of clustering algorithms.
Annals of DataScience , 2, 08 2015.XYT10. Xiaokui Xiao, Ke Yi, and Yufei Tao. The hardness and approximation algorithms for l-diversity. In
Proceedings of the 13th International Conference on Extending Database Technology , EDBT 10, page135146, New York, NY, USA, 2010. Association for Computing Machinery.ZCY04. Jiawei Zhang, Bo Chen, and Yinyu Ye. A multi-exchange local search algorithm for the capacitatedfacility location problem. In Daniel Bienstock and George Nemhauser, editors,
Integer Programming andCombinatorial Optimization , pages 219–233, Berlin, Heidelberg, 2004. Springer Berlin Heidelberg.ZP11. Bin Zhou and Jian Pei. The k-anonymity and l-diversity approaches for privacy preservation in socialnetworks against neighborhood attacks.
Knowl. Inf. Syst. , 28(1):4777, July 2011.
Appendix
To keep the high-level discussion concise, we gave a general introduction in the main paper highlighting ourmain results. We will use the space in the appendix to give the detailed paper. This is done section wisestarting with a detailed discussion on the constrained problems followed by technical sections starting withpreliminaries and then a section for each of the technical contributions discussed in the main paper.
B Discussion on Constrained Problems
In this section, we will look at each of the constrained problems in Table 1 in more detail.
B.1 r -Gather k -Service Problem As you can see from Table 1, the problem is defined with a lower bound constraint on the cluster sizes, i.e.,each cluster C i must contain at least r i clients. A problem similar to the above can be defined where the lowerbound constraint is on the facilities instead of clusters. That is, a facility i ∈ L must serve at least r i clients.Note that above two problems are not the same. However, these problem definitions become equivalent inthe uniform setting where lower bound is the same (i.e., r = r = ... = r ). The second problem arises inthe context of the facility location (where there is no bound on facilities but there is a facility opening cost)called the lower bound facility location (LBLF) problem where the lower bound constraint is on the facilities.The first problem in the uniform setting is called the r -gather problem and we will refer to the version where r i ’s may not be the same as the non-uniform r -gather problem.The lower bound facility location problem was first introduced by Guha et al. [GMM00] and Karger etal. [KM00] to solve various network design problems. Both of these works gave a bi-criteria approximationalgorithm for the problem, where the lower bound constraint is violated by a constant factor and the solutioncost is at most constant times the optimal. Svitkina [Svi10] gave the first constant-approximation algorithmfor the uniform LBLF problem. The approximation guarantee of the algorithm was 448, which was later im-proved to 82.6 by Ahmadian and Swamy [AS13]. Recently, Shi Li [Li19] gave a 4000-approximation algorithmfor the non-uniform lower bound facility location problem.For the (uniform) r -gather k - median problem, the algorithms of [Svi10] and [AS13] can be adapted toobtain an O (1)-approximation guarantee [AS16]. Ding [Din18] gave a FPT (3 λ +2)-approximation and (18 λ +16)-approximation algorithm for the uniform r -gather k -median and k -means problem under the assumption C = L , respectively. Here, λ denotes the approximation guarantee of any unconstrained k -median or k -meansalgorithm. We can use the FPT algorithm of Addad et al. [CAGK +
19] for the unconstrained problems thathas λ = 1 + 2 /e and λ = 1 + 8 /e for the k -median and k -means problem respectively. Moreover, these boundsare tight conditioned on some recent complexity theory conjectures. So, the algorithm of Ding [Din18] givesa 7 . . r -gather k -median and k -means problems respectively. Even thoughthis algorithm is FPT in k , its advantage over the previous algorithms is that it considers both the lowerand upper bounds on the size of the clusters. Under the assumption that C ⊆ L ( C = L is a special caseof C ⊆ L ), we design a 2 and 4 approximation algorithm for the non-uniform r -gather problem that has anFPT running time. This is an improvement over the result of Ding [Din18]. Moreover, our algorithm alsoconsiders the lower and upper bounds on the size of clusters.The r -gather k - center problem is widely used in privacy-preserving data publication. The problemwas first introduced by Aggarwal et al. [APF +
10] and is based on the famous k - anonymity principle ofSweeney [Swe02]. A 2-approximation algorithm is known for the uniform r -gather k -center problem in of-fline setting (Section 2.4 of [APF + k -center problem (for r = 1) does not admit better than (2 − ε )-approximation for any ε >
0, assuming P (cid:54) = NP . Moreover, a 6-approximation is known for the uniform r -gather k -center problem in the streamingsetting [MMK08]. B.2 r -Capacity k -Service Problem As you can see from Table 1, the problem is defined by an upper bound constraint on the size of the clusters.This constraint is useful in scenarios where a facility can serve only a certain number of clients due to limitedesources. A problem similar to the above can be defined where capacities are imposed on individual facilitiesinstead of clusters. Note that the two problems are different. However, in case of uniform capacities (i.e., r = r = ... = r ), both problem definitions are equivalent. The first problem in the uniform setting is calledthe r -capacity problem and we will refer to the version where r i ’s may not be the same as the non-uniform r -capacity problem. The second problem is called the capacitated k -median/means problem. The secondproblem arises in the context of facility location where there is no bound on the number of open facilitiesbut there is facility opening cost. The facility location problem with upper bounds in capacities is known asthe capacitated facility location (CFL) problem.The capacitated facility location problem has been studied extensively [PTW01, MP03, ZCY04, BGG12,CW99, KPR98, AAB + + k -median/ k -means problem, no constant-factor approximation is known even in theuniform setting. However, various bi-criteria approximation algorithms are known for the problem [Li17, Li16,BRU16, DL16], which either violate the capacity or cardinality constraint (the constraint on the numberof open facilities) by a constant factor. The problem has been also studied from the perspective of fixedparameter tractability with k as the parameter [XMX +
20, ABM +
19, CAL19]. The algorithm of Addad etal. [CAL19] is based on the coreset technique, and gives a 3 and 9-approximation for the k -median and k -means objectives respectively, in FPT time. Our algorithm gives the same bounds as that of Addad etal. [CAL19] but for the non-uniform r -capacity problem. Note that the uniform version is a special caseof the non-uniform version and the uniform r -capacity problem is equivalent to the uniform capacitated k -median/means problem. This means that our algorithm is also 3 and 9 approximation algorithm for theuniform capacitated k -median and k -means problems respectively. So, in the uniform setting, we match theresult of Addad et al. [CAL19]. Moreover, our algorithm can be converted to a streaming algorithm.The capacitated k - center problem has also been well studied [BKP93, KS00, CHK12, ABC + +
15] in the non-uniform setting and 6 in the uniformsetting [KS00]. Some interesting open problem for the capacitated problems are discussed in the survey ofAn and Svensson [AS17].
B.3 l -Diversity k -Service problem The problem is motivated by the l -diversity principle, a popular method used for the privacy preservation ofpublic databases. Consider a medical database, in which the data entries are composed of sensitive attributeslike ‘name of the disease,’ and the non-sensitive attributes like ‘age,’ ‘gender,’ ‘zipcode’ etc. The goal is to keepthis information anonymous while keeping it meaningful at the same time so that it can be used for researchpurposes. To tackle this problem, Sweeney [Swe02] proposed the k - anonymity principle. The principle is verypopular, and various privacy-preserving algorithms are based on it [Swe02, BR05, LDR05, AFK +
05, MW04].Aggarwal et al. [APF +
10] proposed a clustering-based anonymity principle, based on ideas similar to the k - anonymity principle. In this model, the data is grouped into clusters such that each cluster contains at least r entries. Instead of publishing the original data, the cluster centers, cluster sizes, and cluster radius are pub-lished to the public. In this way, privacy is preserved while keeping data meaningful at the same time. Thismethod is popularly known as r-gather clustering. We have already discussed algorithms for r -gather clus-tering in Subsection B.1. The k - anonymity principle is susceptible to linking attacks [Sam01, MKGV07]when a cluster contains many similar data-entries. To counter this, Machanavajjhala et al. [MKGV07]introduced the l - diversity principle. According to this principle, a group of data-entries must not con-tain more than 1 /l fraction of data entries with the same sensitive attribute. We can represent the dataitems with same sensitive attribute with the same color. This gives rise to the l - diversity clustering prob-lem. There are many approximation and heuristic algorithms for problems based on the l -diversity princi-ple [MKGV07, XYT10, GKKM07, ZP11, DMZ13]. However, these algorithms are not based on the l -diversityclustering formulation. Unlike the r -gather clustering problem, the l -diversity clustering problem has not beenstudied well. Li et al. [LYZ10] gave the first constant approximation algorithm for the l -diversity clusteringproblem corresponding to the k -center objective (without any constraint on the number of centers). However,they considered a stronger version of the l -diversity problem where all points must be distinct in a cluster.ing and Xu [DX15]; and Bhattacharya et al. [BJK18] studied the problem in the Euclidean space anddesigned a PTAS for fixed value of k . No constant-approximation is known for the l -diversity k -means/ k -median problem in general metric spaces. In this paper, we study the problem in metric spaces and design anFPT algorithm that gives 3-approximation corresponding for the k -median objective and 9-approximationcorresponding for the k -means objective. B.4 Chromatic k -Service Problem The problem was formulated by Ding and Xu [DX12] and it has certain applications in cell biology [DX11].Ding and Xu [DX15, DX12] gave a PTAS for the chromatic k -median and k -means problems in the Euclideanspace (i.e., C ⊆ L = R d ) and Bhattacharya et al. [BJK18] improved the running time of the algorithm. In thiswork, we study the problem in general metric spaces. We give an FPT time (3 + ε )-approximation algorithmfor the chromatic k -median problem and (9+ ε )-approximation algorithm for the chromatic k -means problem. B.5 Fault-Tolerant k -Service Problem In certain settings, some facilities (say servers in case of a distributed network) may fail after some time.In that case, a client must be assigned to some other facility location/server. To provide backup againstfailures, we study the problem in fault-tolerance setting where we consider the cost of a client with respectto multiple facility locations. The formal definition of the problem is stated in Table 1.Jain and Vazirani [KV00] gave the first approximation algorithm for the fault-tolerant facility location problem. Recall that in facility location the number of open facilities is not bounded but there is a cost ofopening a facility. The algorithm had a O (log( l m ))-approximation guarantee, where l m := max x ∈ C { l x } is themaximum requirement of a client. Subsequently, better approximation algorithms were developed that gaveconstant-approximation guarantees for the problem [GMM01, JMM +
03, SS08, BSS10]. Byrka et al. [BSS10]gave a 1 . uniform setting where each client has the samerequirement l , i.e., l x = l for every x ∈ C i . Swamy and Shmoys [SS08] gave a 1 .
52 approximation guaranteefor the uniform fault-tolerant facility location problem.For the fault-tolerant k -median problem, the first approximation algorithm was given by Anthony etal. [AGGN08]. The algorithm had a O (log n )-approximation guarantee. Recently, Hajiaghayi et al. [HHL + uniform fault-tolerant k -medianproblem, a better approximation guarantee of 4 is known due to Swamy and Shmoys [SS08]. In this work, wegive an algorithm with an improved approximation guarantee of 3 for the general case, by allowing an FPTrunning time. This is a significant improvement over the 93-approximation of Hajiaghayi at.al. [HHL + k -means problem, no approximation algorithm is known yet. We give the firstapproximation algorithm for the problem, which gives a (9 + ε )-approximation guarantee in FPT time.For the fault-tolerant k -center problem, Chaudhuri et al. [CGR98] and Khuller et al. [KPS00] gave a3-approximation algorithm. However, they considered the objective function where the cost of a client istaken as its distance to the l thx closest facility location, whereas in this work, we consider the cost of a clientas the sum of its distances to the l x closest facility locations. B.6 Semi-Supervised k -Service Problem The clustering is typically studied in an unsupervised setting, where the class labels are not known for anydata-point. However, in various practical scenarios, we have some background knowledge about the class la-bels or some kind of supervision in terms of must-link and cannot-link constraints [WC00, WCRS01, BBM04].Clustering based on this prior knowledge is known as the semi-supervised clustering. There are several waysin which we can use the extra information for better clustering of data (see Section 2 of [GCB]). One such wayis to add an extra penalty term to the cost function, for violating the known constraints [DBE99, CKT06,CSZ +
07, GTC06]. We define such a cost function for the semi-supervised clustering problem. Suppose we areiven the target clustering C T = { C (cid:48) , C (cid:48) , . . . , C (cid:48) k } . The goal of the semi-supervised clustering is to output aclustering C that minimizes the following objective function: Ψ ( F, C ) := α · Ψ ( F, C ) + (1 − α ) · Dist ( C T , C ) , for some constant α ≥ Dist ( C T , C ) denotes the set-difference distance. A similar cost function is considered in the context of evolutionary clustering [CKT06]. In the evolutionary clustering problem, the task is to output a sequenceof clustering for a timestamped data. At every time stamp t , we have to output a clustering that does notdeviate largely from the clustering at the previous time step ( t − t −
1) acts as the target clustering C T for the clustering at time step t .Ding and Xu [DX15] gave an FPT algorithm that gives a (1 + ε )-approximation for the semi-supervised k -median/ k -means problem in Euclidean space. Further, Bhattacharya et al. [BJK18] improved the runningtime of the algorithm. No approximation algorithm is known for the problem in general metric spaces. Inthis work, we give the first constant-approximation for the problem in general metric spaces. Particularly,we give a (3 + ε )-approximation and (9 + ε )-approximation algorithm for the semi-supervised k -median and k -means problem, respectively, in FPT time. B.7 Uncertain k -service Problem (Probabilistic Clustering) Data from sources such as sensor network data, forecasting data, and demographic data has many un-certainties due to errors and noise in the measured values [AY09]. Due to this imprecision, the datais modeled probabilistically rather than deterministically. Such imprecise data needs to be clustered forvarious data-mining purposes and various heuristic algorithms are known for the clustering of uncertaindatasets [CCKN06, NKC +
06, KP05, AY08]. The problem has also been studied from a theoretical stand-point. The first theoretical study was done by Cormode and McGregor [CM08]. The authors designed ap-proximation algorithms for the uncertain k -center, k -means, and k -median problem. They defined the inputinstance in the following manner: A client j in C is represented by a random variable X j such that j ispresent at the location x ∈ X with probability t jx , i.e., P [ X j = x ] = t jx . Certainly, we have (cid:80) x ∈X t jx ≤ j ∈ C . Also, note that the probability could be less than one since it is possible that a clientmight not exist at all. The cost function in defined in two ways: unassigned and assigned , as follows.1. Unassigned Cost:
In this case, we output a center-set F that minimizes the following cost function: k -median: (cid:88) ( x ,x ,...,x n ) ∈X n n (cid:89) j =1 Pr [ X j = x j ] · n (cid:88) j =1 d ( x j , F ) k -means: (cid:88) ( x ,x ,...,x n ) ∈X n n (cid:89) j =1 Pr [ X j = x j ] · n (cid:88) j =1 d ( x j , F ) k -center: (cid:88) ( x ,x ,...,x n ) ∈X n n (cid:89) j =1 Pr [ X j = x j ] · max j { d ( x j , F ) } Here, n := | C | , and d ( x, F ) := min f ∈ F { d ( x, f ) } denote the distance of x to the closest facility location.Let F ∗ be an optimal center-set corresponding to the above objective function. We assign a client j toa facility location in F , based on its realized position. Suppose x j be the realised position of the client j . Then we assign j to a facility location that is closest to x j .2. Assigned Cost:
In this case, we assign a client to a cluster center prior to its realization. Therefore, weassume that for a client j , all its realizations are assigned to the same center. The goal is to output acenter set F and an assignment σ : C → F that minimizes the following cost function: k -median: (cid:88) ( x ,x ,...,x n ) ∈X n n (cid:89) j =1 Pr [ X j = x j ] · n (cid:88) j =1 d ( x j , σ ( j )) -means: (cid:88) ( x ,x ,...,x n ) ∈X n n (cid:89) j =1 Pr [ X j = x j ] · n (cid:88) j =1 d ( x j , σ ( j )) k -center: (cid:88) ( x ,x ,...,x n ) ∈X n n (cid:89) j =1 Pr [ X j = x j ] · max j { d ( x j , σ ( j )) } Both of these clustering criteria are useful [CM08, AJ18]. For the probabilistic metric k -center problem,Cormode and McGregor [CM08] gave bi-criteria approximation algorithms, corresponding to the unassignedobjective function. Guha and Munagala [GM09] gave the first constant approximation algorithm for theprobabilistic metric k -center problem (for both the assigned and unassigned cases). Recently, Alipour andJafari [AJ18] improved the approximation guarantee to 10, for the (assigned) probabilistic k -center problemin the general metric spaces. For the Euclidean space (where C ⊆ L = R d ) the authors gave an FPTalgorithm that has an approximation guarantee of 3 + ε , for any ε > k -center problem. Let us discuss the probabilistic k -medianand k -means problem. The unassigned case of these problems is quite simple. In this case, both problemscan be simply reduced to their weighted unconstrained counterparts by linearity of expectation (see Section5 of [CM08]). Thus, in the Euclidean space, both problems have (1 + ε )-approximation algorithm [FMS07b,Che09b, KSS10, JKS14]. In metric space, we get a (2 .
675 + ε )-approximation for the probabilistic k -medianproblem [BPR +
17] and (9 + ε )-approximation for the probabilistic k -means problem [ANSW17]. Therefore,in this work, we study these problems with respect to their assigned objectives only.First, let us discuss the assigned case for the probabilistic k -means/ k -median problems, in the Euclideanspace (where C ⊆ L = R d ). Cormode and McGregor [CM08] gave an FPT (1 + ε ) and (3 + ε )-approximationalgorithm corresponding to the assigned k -means and k -median objectives respectively. Recently, Ding andXu [DX15] improved the approximation guarantee for the assigned k -median problem to (1 + ε ), and Bhat-tacharya et al. [BJK18] further improved the running time of the algorithm.Now, let us discuss the (assigned) probabilistic k -means/ k -median problem in the general metric spaces.Lammersen and Schmidt [LSS13] gave the first coreset construction for the assigned version of the proba-bilistic k -median problem. Cormode and McGregor [CM08] reduced these problems to their weighted un-constrained counterparts [CM08], with a certain loss in the approximation factor. In particular, they gave a(2 α + 1)-approximation algorithm for the probabilistic k -median problem and a (8 α + 2)-approximation algo-rithm for the probabilistic k -means problem . Here, α is the approximation guarantee of any unconstrained k -median/ k -means algorithm. The current best approximation guarantee for the unconstrained k -medianproblem is (2 .
675 + ε ) [BPR + k -means problem is (9 + ε ) [ANSW17]. Substitutingthese α values, we obtain (6 .
35 + ε )-approximation for the probabilistic k -median problem, and (74 + ε )-approximation for the probabilistic k -means problem. All the above-stated approximation guarantees assume C ⊆ L = X for the discrete metric spaces. In this work, we improve these approximation guarantees by tak-ing the advantage of FPT running time. We give a (2 + ε ) and (4 + ε ) for the probabilistic k -median and k -means problem ( C ⊆ L ), respectively, in FPT time. Moreover, for the general case, where C and L arearbitrary sets, our algorithm gives (3 + ε )-approximation and (9 + ε )-approximation for the probabilistic k -median and k -means problem respectively, in FPT time. B.8 Outlier k -service Problem The oultier problem is central to the clustering domain since removing a few outliers (noisy data points)from the data may greatly improves the cost and quality of the clustering. The notion of the outlier problemwas first introduced by Charikar et al. [CKMN01]. The authors gave a 3-approximation algorithm for boththe outlier facility location problem and the outlier k -center problem. For the outlier k -median problem theygave a bi-criteria approximation algorithm that gives a 4(1 + 1 /ε )-approximation guarantee while violating Cormode and McGregor [CM08] did not state the (8 α + 2)-approximation for the probabilistic k -means problemexplicitly. However, this result can be obtained using the same technique used to obtain the (2 α +1)-approximationfor the probabilistic k -median problem. The (2 α + 1)-approximation algorithm for the probabilistic k -medianproblem is stated in Theorem 10 of [CM08]. In this theorem, if we replace the triangle-inequality with approximate triangle-inequality for the k -means objective, we would obtain an (8 α + 2)-approximation guarantee. he number of outliers by a factor of (1+ ε ). Chen [Che08] gave the first constant-approximation algorithm forthe outlier k - median problem. Recently, Ravishankar et al. [KLS18] improved the approximation guaranteefor the outlier k -median to (7 .
081 + ε ). Moreover, they gave a 53 . k - means problem. Friggstad et al. [FKRS19] gave a bi-criteria algorithm for the outlier k -means problemthat gives a (25 + ε )-approximation guarantee and opens (1 + ε ) facilities.All the above-stated algorithms are polynomial time algorithms. Now, let us discuss FPT algorithmsfor the problem. Feng et al. [FZH +
19] gave a (6 + ε )-approximation for the outlier k -means problem with C ⊆ L , with an FPT time of O (cid:16) n · β k (cid:0) k + mε (cid:1) k (cid:17) , for some constant β >
0. In this work, we improve onthis result by giving a (4 + ε )-approximation algorithm for the oulier k -means problem with C ⊆ L and a(2 + ε )-approximation algorithm for the outlier k -median problem with C ⊆ L = X with a running time of O (cid:16) n · (cid:0) k + mε (cid:1) O ( k ) (cid:17) . For the general case where C and L are arbitrary sets, our algorithm gives a (9 + ε )-approximation guarantee for the outlier k -means problem and a (3 + ε )-approximation guarantee for theoutlier k -median problem with the same running time.For the Euclidean version of the problem (i.e., C ⊆ L = R d ), Feldman and Schulmany [FS12] gave aPTAS for the outlier k -median problem with an FPT time of O (cid:0) nd · ( m + k ) O ( m + k ) + ( ε − k log n ) O (1) (cid:1) . Recently, Feng et al. [FZH +
19] improved this running time to O (cid:18) nd · (cid:0) m + kε (cid:1) ( k/ε ) O (1) (cid:19) . C Preliminaries
We give a few notations and identities that we will use often in our discussions. We define the unconstrained k -service cost of a set S with respect to a center set F as: Φ ( F, S ) := (cid:88) x ∈ S min f ∈ F d (cid:96) ( f, x ) . For a singleton set { f } , we denote Φ ( { f } , S ) by Φ ( f, S ). We denote the optimal (unconstrained) k -servicecost of a instance by OP T ( L, C ). Let us now look at some of the identities that we will use in our analysis.Following is the binomial approximation technique that we will use to simplify terms with large exponents.
Fact 1 (Binomial Approximation)
For ε · n ≤ / , we have (1 + ε ) n ≤ (1 + 2 εn )We use the next fact to carry out the trade-off between two values a and b . We will choose the value of δ according to our requirement. Fact 2
For any δ > , we have ( a + b ) (cid:96) ≤ (1 + δ ) (cid:96) · b (cid:96) + (cid:0) δ (cid:1) (cid:96) · a (cid:96) .Proof. There are two possibilities: a ≤ δ · b or a > δ · b . For the first case, we have ( a + b ) (cid:96) ≤ (1 + δ ) (cid:96) · b (cid:96) . Forthe second case, we have ( a + b ) (cid:96) ≤ (cid:0) δ (cid:1) (cid:96) · a (cid:96) . Hence we get the required result.Since we are working with metric spaces, triangle inequality becomes a powerful tool for analysis. We need togeneralize the triangle inequality since we are dealing with a general cost function d (cid:96) . The following inequalityis the generalization of the triangle inequality and simply follows from the power-mean inequality. Fact 3 (Approximate triangle inequality)
For a set of points { a, b, c } ∈ X , d (cid:96) ( a, b ) ≤ (cid:96) − · (cid:0) ( d (cid:96) ( a, c ) + d (cid:96) ( c, b ) (cid:1) .Similarly for a set of four points { a, b, c, d } ∈ X we have d (cid:96) ( a, b ) ≤ (cid:96) − · (cid:0) ( d (cid:96) ( a, c ) + d (cid:96) ( c, d ) + d (cid:96) ( d, b ) (cid:1) Let
OP T ( C, C ) denote the optimal cost of a unconstrained k -service instance when facilities are only allowedto open at client locations. The following fact easily follows from the power-mean inequality (for the detailedproof see Theorem 2.1 of [GMMO00]). Fact 4
OP T ( C, C ) ≤ (cid:96) · OP T ( L, C )The following lemma demonstrates the effectiveness of uniform sampling in obtaining constant factor ap-proximation. This lemma (or a similar version) has been used in multiple other works in analysing samplingbased algorithms. emma 2.
Let S ⊆ C be any subset of clients and let f ∗ be any center in L . If we uniformly sample apoint x in S and open a facility at the closest location in L , then the following identity holds: E [ Φ ( t ( x ) , S )] ≤ (cid:96) · Φ ( f ∗ , S ) , where t ( x ) is the closest facility location from x .Proof. The proof follows from the following sequence of inequalities. E [ Φ ( t ( x ) , S )] = 1 | S | (cid:32)(cid:88) x ∈ S Φ ( t ( x ) , S ) (cid:33) = 1 | S | (cid:32)(cid:88) x ∈ S (cid:88) x (cid:48) ∈ S d (cid:96) ( t ( x ) , x (cid:48) ) (cid:33) ≤ (cid:96) − | S | (cid:32)(cid:88) x ∈ S (cid:88) x (cid:48) ∈ S (cid:0) d (cid:96) ( f ∗ , x (cid:48) ) + d (cid:96) ( x, f ∗ ) + d (cid:96) ( t ( x ) , x ) (cid:1)(cid:33) , (Using Fact 3) ≤ (cid:96) − | S | (cid:32)(cid:88) x ∈ S (cid:88) x (cid:48) ∈ S (cid:0) d (cid:96) ( f ∗ , x (cid:48) ) + d (cid:96) ( x, f ∗ ) + d (cid:96) ( f ∗ , x ) (cid:1)(cid:33) , (Using defn. of t ( x ))= 3 (cid:96) − | S | (cid:32)(cid:88) x ∈ S Φ ( f ∗ , S ) + (cid:88) x (cid:48) ∈ S Φ ( f ∗ , S ) + (cid:88) x (cid:48) ∈ S Φ ( f ∗ , S ) (cid:33) = 3 (cid:96) − | S | (cid:18) | S | · Φ ( f ∗ , S ) (cid:19) = 3 (cid:96) · Φ ( f ∗ , S )This completes the proof of the lemma.In the next lemma, we show that the flexibility to open a facility at any client location gives a betterapproximation guarantee. Lemma 3.
Let S ⊆ C be any subset of clients and let f ∗ be be any center in L . If we uniformly sample apoint x in S and open a facility at x , then the following identity holds: E [ Φ ( x, S )] ≤ (cid:96) · Φ ( f ∗ , S ) . Proof.
The proof follows from the following inequalities. E [ Φ ( x, S )] = 1 | S | (cid:32)(cid:88) x ∈ S Φ ( x, S ) (cid:33) = 1 | S | (cid:32)(cid:88) x ∈ S (cid:88) x (cid:48) ∈ S d (cid:96) ( x, x (cid:48) ) (cid:33) ≤ (cid:96) − | S | (cid:32)(cid:88) x ∈ S (cid:88) x (cid:48) ∈ S (cid:0) d (cid:96) ( f ∗ , x (cid:48) ) + d (cid:96) ( x, f ∗ ) (cid:1)(cid:33) , (Using Fact 3)= 2 (cid:96) − | S | (cid:18) | S | · Φ ( f ∗ , S ) (cid:19) = 2 (cid:96) · Φ ( f ∗ , S )This completes the proof of the lemma. Algorithm for List k -Service In this section, we design and analyze the list k -service algorithm. As described earlier, an FPT algorithmfor the list k -service problem gives an FPT algorithm for a constrained version of the k -service problem thathas an efficient or FPT-time partition algorithm. Our sampling based algorithm is similar to the algorithmof Goyal et al. [GJK19] that was specifically designed for the Euclidean setting. However, the analysis differsat various steps since we study the problem in general metric spaces, whereas Goyal et al. [GJK19] studiedthe problem in the Euclidean space where C ⊆ L = R d . Following is our algorithm for the list k -serviceproblem: List-k-service ( C, L, k, d, (cid:96), ε ) Inputs : k -service instance ( C, L, k, d, (cid:96) ) and accuracy ε Output : A list L , each element in L being a k -center set Constants : β = 4 (cid:96) − · (cid:18) (cid:96) (cid:96) · (cid:96) +4 (cid:96) +3 ε (cid:96) +1 + 1 (cid:19) ; γ = (cid:96) (cid:96) · (cid:96) +5 (cid:96) +1 ε (cid:96) ; η = α β γ k · (cid:96) +2 ε (1) Run any α -approximation algorithm for the unconstrained k -serviceinstance ( C, C, k, d, (cid:96) ) and let F be the obtained center-set.( k -means++ [AV07] is one such algorithm. )(2) L ← ∅ (3) Repeat 2 k times:(4) Sample a multi-set M of ηk points from C using D (cid:96) -sampling w.r.t.center set F (5) M ← M ∪ F (6) T ← ∅ (7) For every point x in M :(8) T ← T ∪ { k points in L that are closest to x } (9) For all subsets S of T of size k :(10) L ← L ∪ { S } (11) return( L ) Algorithm 1.2.
Algorithm for the list k -service problem We start with the main intuition in the next subsection before going into the details of the proof.
D.1 Algorithm Description and Intuition
In the first step, we obtain a center-set F ⊆ C , which is an α -approximation for the unconstrained k -serviceinstance ( C, C, k, d, (cid:96) ). That is, Φ ( F, C ) ≤ α · OP T ( C, C ) . One such algorithm is the k -means++ algorithm [AV07] that gives an O (4 (cid:96) · log k )-approximation guaranteeand a running time O ( nk ). Now, let us see how set F can help us. Let us focus on any cluster C i of atarget clustering C = { C , . . . , C k } . Our main objective would be to uniformly sample a point from C i , sothat we could achieve a constant approximation for C i using Lemma 2. We will do a case analysis basedon the distance of points in C i from the nearest point in F . Consider the following two possibilities: Thefirst possibility is that the points in C i are close to F . If this is the case, we can uniformly sample a pointfrom F instead of C i . Note that we cannot uniformly sample from C i even if we wanted to since C i is notknown to us. This would incur some extra cost. However, the cost is small and can be bounded easily. Tocover this first possibility, the algorithm adds the entire set F to the set of sampled points M (see line (5)of the algorithm). The second possibility is that the points in C i are far-away from F . In this case, we can D (cid:96) -sample the points from C . Since the points in C i are far away, the sampled set would contain a goodortion of points from C i and the points will be almost uniformly distributed. We will show that almostuniform sampling is sufficient to apply lemma 2 on C i . However, we would have to sample a large number ofpoints to boost the success probability. This requirement is taken care of by line (4) of the algorithm. Notethat we may need to use a hybrid approach for analysis since the real case may be a combination of the firstand second possibility.To apply lemma 2, we need to fulfill one more condition, i.e., we need the closest facility location froma sampled point. This requirement is handled by lines (7) and (8) of the algorithm. However, note that thealgorithm picks k -closest facility locations instead of just one facility location. We will show that this step iscrucial to obtain a hard-assignment solution for the problem. At last, the algorithm adds all the potentialcenter sets to a list L (see line (9) and (10) of the algorithm). The algorithm repeats this procedure 2 k timesto boost the success probability (see line (3) of the algorithm). We will show the following main result. Theorem 9.
Let < ε ≤ . Let ( C, L, k, d, (cid:96) ) be any k -service instance and let C = { C , C , . . . , C k } beany arbitrary clustering of the client set. The algorithm List-k-service ( C, L, k, d, (cid:96), ε ), with probabilityat least / outputs a list L of size ( k/ε ) O ( k (cid:96) ) , such that there is a k center set S ∈ L in the list such that Ψ ( S, C ) ≤ (3 (cid:96) + ε ) · Ψ ∗ ( C ) . Moreover, the running time of the algorithm is O (cid:16) n · ( k/ε ) O ( k (cid:96) ) (cid:17) . D.2 Analysis
Let C = { C , C , . . . , C k } be the (unknown) target clustering and F ∗ = { f ∗ , f ∗ , . . . , f ∗ k } be the correspondingoptimal center set. Let ∆ ( C i ) denote the cost of a cluster C i with respect to F , i.e., ∆ ( C i ) = Φ ( f ∗ i , C i ). Letus classify the clusters into two categories: W and H . W := { C i | Φ ( F, C i ) ≤ εα γ k · Φ ( F, C ) , for 1 ≤ i ≤ k } H := { C i | Φ ( F, C i ) > εα γ k · Φ ( F, C ) , for 1 ≤ i ≤ k } In other words, W contains the low-cost clusters and H contains the high-cost clusters with respect to F .Now, let us look at the set M obtained by lines (4) and (5) of the algorithm. The set M contains some D (cid:96) -sampled points from C and the center set F . We show that M has the following property. Property - I : For any cluster C i ∈ { C , C , . . . , C k } , with probability at least 1 /
2, there is a point s i in M such that such that the following holds: Φ ( t ( s i ) , C i ) ≤ (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) + ε (cid:96) +1 k · OP T ( C, C ) , if C i ∈ W (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) , if C i ∈ H where t ( s i ) denotes any facility location that is closer to s i than f ∗ i , i.e., d ( s i , t ( s i )) ≤ d ( s i , f ∗ i ).First, let us see how this property gives the desired result. By Fact 4, we have OP T ( C, C ) ≤ (cid:96) · OP T ( L, C ).Moreover, the optimal cost
OP T ( L, C ) of the unconstrained k -service instance is always less than the con-strained k -service cost (cid:80) ki =1 ∆ ( C i ). Therefore, Property-I implies that T s := { t ( s ) , t ( s ) , . . . , t ( s k ) } is a (cid:0) (cid:96) + ε (cid:1) -approximation for C , with probability at least 1 / k . Now, note that, the facility locations that areclosest to s i satisfy the definition of t ( s i ). Moreover, the algorithm adds one such facility location to set T (see line (8) of the algorithm). Thus there is a center-set T s in the list that gives (3 (cid:96) + ε )-approximation for C . To boost the success probability to 1 /
2, the algorithm repeats the procedure 2 k times (see line (3) of thealgorithm). Based on these arguments, it looks like we got the desired result. However, there is one issuethat we need to take care of. Remember, we are looking for a hard assignment for the problem, and theset T s could be a soft center-set, since the closest facility locations might be same for s i ’s. In other words, t ( s i ) could be same as t ( s j ) for some i (cid:54) = j . At the end of this section we will show that there is indeed aard center-set in the list L , that gives the required approximation for the problem. For now let us proveProperty-I for M and the target clusters. First consider the case of low-cost clusters as follows. Case 1: Φ ( F, C i ) ≤ εα γ k · Φ ( F, C ) For a point x ∈ X , let c ( x ) denote the closest location in F . Based on this definition, consider a multi-set M i := { c ( x ) | x ∈ C i } . Since C i has a low cost with respect to F , the points in C i are close to frompoints from F . Consider uniformly sampling a point from M i . In the next lemma, We show that a uniformlysampled point from M i is a good enough center for C i . Lemma 4.
Let p be a point sampled uniformly at random from M i . Then the following bound holds: E [ Φ ( t ( p ) , C i )] ≤ (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) + ε (cid:96) +1 k · OP T ( C, C ) . Proof.
The proof follows from the following sequence of inequalities. E [ Φ ( t ( p ) , C i )] = 1 | C i | · (cid:88) p ∈ M i Φ ( t ( p ) , C i ) = 1 | C i | · (cid:88) p ∈ M i (cid:88) x ∈ C i d (cid:96) ( x, t ( p )) = 1 | C i | · (cid:32) (cid:88) x (cid:48) ∈ C i (cid:88) x ∈ C i d (cid:96) ( x, t ( c ( x (cid:48) ))) (cid:33) ≤ | C i | · (cid:32) (cid:88) x (cid:48) ∈ C i (cid:88) x ∈ C i ( d ( x, x (cid:48) ) + d ( x (cid:48) , c ( x (cid:48) )) + d ( c ( x (cid:48) ) , t ( c ( x (cid:48) )))) (cid:96) (cid:33) , by triangle inequality ≤ | C i | · (cid:32) (cid:88) x (cid:48) ∈ C i (cid:88) x ∈ C i ( d ( x, x (cid:48) ) + d ( x (cid:48) , c ( x (cid:48) )) + d ( c ( x (cid:48) ) , f ∗ i )) (cid:96) (cid:33) , by the defn. of t ( x ) ≤ | C i | · (cid:32) (cid:88) x (cid:48) ∈ C i (cid:88) x ∈ C i ( d ( x, x (cid:48) ) + d ( x (cid:48) , c ( x (cid:48) )) + d ( x (cid:48) , c ( x (cid:48) )) + d ( x (cid:48) , f ∗ i )) (cid:96) (cid:33) , by triangle inequalityLet us use Fact 2, by setting a = 2 · d ( x (cid:48) , c ( x (cid:48) )) and b = d ( x, x (cid:48) ) + d ( x (cid:48) , f ∗ i ). We get the following expression: E [ Φ ( t ( p ) , C i )] ≤ | C i | · (cid:32) (cid:88) x (cid:48) ∈ C i (cid:88) x ∈ C i (cid:32)(cid:18) δ (cid:19) (cid:96) · ( d ( x, x (cid:48) ) + d ( x (cid:48) , f ∗ i )) (cid:96) + (1 + δ ) (cid:96) · (cid:96) · d (cid:96) ( x (cid:48) , c ( x (cid:48) )) (cid:33)(cid:33) , for any δ > (cid:18) δ (cid:19) (cid:96) · | C i | · (cid:32) (cid:88) x (cid:48) ∈ C i (cid:88) x ∈ C i ( d ( x, x (cid:48) ) + d ( x (cid:48) , f ∗ i )) (cid:96) (cid:33) +(1 + δ ) (cid:96) · | C i | · (cid:32) (cid:88) x (cid:48) ∈ C i | C i | · (cid:96) · d (cid:96) ( x (cid:48) , c ( x (cid:48) )) (cid:33) By lemma 2, we have E [ Φ ( t ( x ) , C i )] ≤ | C i | · (cid:16) (cid:80) x (cid:48) ∈ C i (cid:80) x ∈ C i ( d ( x, x (cid:48) ) + d ( x (cid:48) , f ∗ i )) (cid:96) (cid:17) ≤ (cid:96) · ∆ ( C i ). Thus, weget: [ Φ ( t ( p ) , C i )] ≤ (cid:18) δ (cid:19) (cid:96) · (cid:96) · ∆ ( C i ) + (1 + δ ) (cid:96) · | C i | · (cid:32) (cid:88) x (cid:48) ∈ C i | C i | · (cid:96) · d (cid:96) ( x (cid:48) , c ( x (cid:48) )) (cid:33) = (cid:18) δ (cid:19) (cid:96) · (cid:96) · ∆ ( C i ) + 2 (cid:96) (1 + δ ) (cid:96) · Φ ( F, C i )= (cid:16) ε(cid:96) · (cid:96) +2 (cid:17) (cid:96) · (cid:96) · ∆ ( C i ) + 2 (cid:96) (1 + δ ) (cid:96) · Φ ( F, C i ) , by substituting δ = (cid:96) · (cid:96) +2 ε ≤ (cid:16) (cid:96) · ε(cid:96) · (cid:96) +2 (cid:17) · (cid:96) · ∆ ( C i ) + 2 (cid:96) (1 + δ ) (cid:96) · Φ ( F, C i ) , by Fact 1 ≤ (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) + 2 (cid:96) (1 + δ ) (cid:96) · Φ ( F, C i ) , ≤ (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) + 2 (cid:96) (2 δ ) (cid:96) · Φ ( F, C i ) , ∵ ≤ δ , for ε ≤ ≤ (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) + 4 (cid:96) · δ (cid:96) · εα γ k · Φ ( F, C ) , ∵ Φ ( F, C i ) ≤ εα γ k · Φ ( F, C ) ≤ (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) + ε (cid:96) +1 α k · Φ ( F, C ) , ∵ γ = (cid:96) (cid:96) · (cid:96) +5 (cid:96) +1 ε (cid:96) ≤ (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) + ε (cid:96) +1 k · OP T ( C, C ) , ∵ Φ ( F, C ) ≤ α · OP T ( C, C )This completes the proof of the lemma.Since the above lemma estimates the average cost corresponding to a sampled point, there has to be a point p in M i such that Φ ( t ( p ) , C i ) ≤ (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) + ε (cid:96) +1 k · OP T ( C, C ). Since M i is only composed of thepoints from F and we keep the entire set F in M (see line (5) of the algorithm), therefore Property-I issatisfied for every cluster C i ∈ W . Let us now prove Property I for the high cost clusters. Case 2: Φ ( F, C i ) > εα γ k · Φ ( F, C ) .Since the cost of the cluster is high, some points of C i are far away from the center set F . We partition C i into two sets: C ni and C fi , as follows. C ni := { x | d (cid:96) ( c ( x ) , x ) ≤ R (cid:96) , for x ∈ C i } , where R (cid:96) = 1 β · Φ ( F, C i ) | C i | C fi := { x | d (cid:96) ( c ( x ) , x ) > R (cid:96) , for x ∈ C i } , where R (cid:96) = 1 β · Φ ( F, C i ) | C i | In other words, C ni represents the set of points that are near to the center set F and C fi represents the setof points that are far from the center set F . Recall that our prime objective is to obtain a uniform samplefrom C i , so that we can apply lemma 2. To achieve that we consider sampling from C ni and C fi separately.The idea is as follow. To sample a point from C fi we use the D (cid:96) -sampling technique and show that it givesan almost uniform sample from C fi . For C ni , we will use F as its proxy, and sample a point from F instead.However, doing so would incur an extra cost. We will show that the extra cost is proportional to Φ ( F, C ni ),which can be bounded easily. To bound the extra cost we will use the following lemma. Lemma 5.
For R (cid:96) = 1 β · Φ ( F, C i ) | C i | , we have Φ ( F, C ni ) ≤ ε (cid:96) +1 (cid:96) (cid:96) · (cid:96) +5 (cid:96) +2 · ∆ ( C i ) .Proof. We have, ∆ ( C i ) ≥ Φ ( f ∗ i , C ni ) (cid:88) x ∈ C ni d (cid:96) ( f ∗ i , x ) ≥ (cid:88) x ∈ C ni (cid:18) d (cid:96) ( c ( x ) , f ∗ i )2 (cid:96) − − d (cid:96) ( x, c ( x )) (cid:19) , by Fact 3= (cid:88) x ∈ C ni (cid:18) d (cid:96) ( c ( x ) , f ∗ i )2 (cid:96) − (cid:19) − Φ ( F, C ni ) ≥ (cid:88) x ∈ C ni (cid:18) d (cid:96) ( c ( f ∗ i ) , f ∗ i )2 (cid:96) − (cid:19) − Φ ( F, C ni )Using Fact 3, we get Φ ( c ( f ∗ i ) , C i ) ≤ (cid:96) − · (cid:0) ∆ ( C i ) + | C i | · d (cid:96) ( c ( f ∗ i ) , f ∗ i ) (cid:1) . Since Φ ( F, C i ) ≤ Φ ( c ( f ∗ i ) , C i ), weget d (cid:96) ( c ( f ∗ i ) , f ∗ i ) ≥ Φ ( F, C i ) − (cid:96) − · ∆ ( C i )2 (cid:96) − | C i | . Using this, the previous expression simplifies to: ∆ ( C i ) ≥ | C ni | (cid:18) Φ ( F, C i ) − (cid:96) − · ∆ ( C i )4 (cid:96) − · | C i | (cid:19) − Φ ( F, C ni )= | C ni | · βR (cid:96) (cid:96) − − | C ni | · ∆ ( C i )2 (cid:96) − · | C i | − Φ ( F, C ni ) , ∵ R (cid:96) = 1 β · Φ ( F, C i ) | C i |≥ Φ ( F, C ni ) · β (cid:96) − − | C ni | · ∆ ( C i )2 (cid:96) − · | C i | − Φ ( F, C ni ) , ∵ Φ ( F, C ni ) ≤ | C ni | · R (cid:96) ≥ ( β − (cid:96) − )4 (cid:96) − Φ ( F, C ni ) − ∆ ( C i ) , ∵ | C ni | ≤ | C i | ≤ (cid:96) − · | C i | On rearranging the terms of the expression, we get Φ ( F, C ni ) ≤ · (cid:96) − β − (cid:96) − · ∆ ( C i ) ≤ ε (cid:96) +1 (cid:96) (cid:96) · (cid:96) +5 (cid:96) +2 · ∆ ( C i ) ∵ β = 4 (cid:96) − · (cid:18) (cid:96) (cid:96) · (cid:96) +4 (cid:96) +3 ε (cid:96) +1 + 1 (cid:19) Hence proved.Now, let us prove the main result. We need to define a few things. Since we are using F as a proxy for C ni , we define a multi-set M ni := { c ( x ) | x ∈ C ni } . Let us define another multi-set M i := C fi ∪ M ni . In thefollowing lemma we show that there is a point in M i that is a good center for C i . The lemma is similar tolemma 4 of the low-cost clusters. Lemma 6.
Let p be a point sampled uniformly at random from M i . Then the following bound holds: E [ Φ ( t ( p ) , C i )] ≤ (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) Proof. E [ Φ ( t ( p ) , C i )] = 1 | C i | · (cid:88) p ∈ M i Φ ( t ( p ) , C i ) = 1 | C i | · (cid:88) x (cid:48) ∈ C ni Φ ( t ( c ( x (cid:48) )) , C i ) + (cid:88) x (cid:48) ∈ C fi Φ ( t ( x (cid:48) ) , C i ) Let us evaluate these two terms separately.. The first term: (cid:88) x (cid:48) ∈ C ni Φ ( t ( c ( x (cid:48) )) , C i ) = (cid:88) x (cid:48) ∈ C ni (cid:88) x ∈ C i d (cid:96) ( x, t ( c ( x (cid:48) ))) ≤ (cid:88) x (cid:48) ∈ C ni (cid:88) x ∈ C i ( d ( x, x (cid:48) ) + d ( x (cid:48) , c ( x (cid:48) )) + d ( c ( x (cid:48) ) , t ( c ( x (cid:48) )))) (cid:96) , by triangle-inequality ≤ (cid:88) x (cid:48) ∈ C ni (cid:88) x ∈ C i ( d ( x, x (cid:48) ) + d ( x (cid:48) , c ( x (cid:48) )) + d ( c ( x (cid:48) ) , f ∗ i )) (cid:96) , by the defn. of t ( c ( x (cid:48) )) ≤ (cid:88) x (cid:48) ∈ C ni (cid:88) x ∈ C i ( d ( x, x (cid:48) ) + d ( x (cid:48) , c ( x (cid:48) )) + d ( x (cid:48) , c ( x (cid:48) )) + d ( x (cid:48) , f ∗ i )) (cid:96) , by triangle-inequality ≤ (cid:88) x (cid:48) ∈ C ni (cid:88) x ∈ C i (cid:32)(cid:18) δ (cid:19) (cid:96) · ( d ( x, x (cid:48) ) + d ( x (cid:48) , f ∗ i )) (cid:96) + 2 (cid:96) (1 + δ ) (cid:96) · d (cid:96) ( x (cid:48) , c ( x (cid:48) )) (cid:33) , by Fact 22. The second term: (cid:88) x (cid:48) ∈ C fi Φ ( t ( x (cid:48) ) , C i ) = (cid:88) x (cid:48) ∈ C fi (cid:88) x ∈ C i ( d ( x, x (cid:48) ) + d ( x (cid:48) , t ( x (cid:48) ))) (cid:96) , by triangle-inequality ≤ (cid:88) x (cid:48) ∈ C fi (cid:88) x ∈ C i ( d ( x, x (cid:48) ) + d ( x (cid:48) , f ∗ i )) (cid:96) , by the defn. of t ( x (cid:48) )On combining the two terms we get the following expression: E [ Φ ( t ( p ) , C i )] ≤ (cid:18) δ (cid:19) (cid:96) · | C i | · (cid:32) (cid:88) x (cid:48) ∈ C i (cid:88) x ∈ C i ( d ( x, x (cid:48) ) + d ( x (cid:48) , f ∗ i )) (cid:96) (cid:33) + (1 + δ ) (cid:96) · | C i | · (cid:88) x (cid:48) ∈ C ni | C i | · (cid:96) d (cid:96) ( x (cid:48) , c ( x (cid:48) )) By lemma 2, we have E [ Φ ( t ( x ) , C i )] ≤ | C i | · (cid:16) (cid:80) x (cid:48) ∈ C i (cid:80) x ∈ C i ( d ( x, x (cid:48) ) + d ( x (cid:48) , f ∗ i )) (cid:96) (cid:17) ≤ (cid:96) · ∆ ( C i ). Thus, weget: E [ Φ ( t ( p ) , C i )] ≤ (cid:18) δ (cid:19) (cid:96) · (cid:96) · ∆ ( C i ) + (1 + δ ) (cid:96) · | C i | · (cid:88) x (cid:48) ∈ C ni | C i | · (cid:96) d (cid:96) ( x (cid:48) , c ( x (cid:48) )) = (cid:18) δ (cid:19) (cid:96) · (cid:96) · ∆ ( C i ) + 2 (cid:96) (1 + δ ) (cid:96) · Φ ( F, C ni )= (cid:16) ε(cid:96) · (cid:96) +3 (cid:17) (cid:96) · (cid:96) · ∆ ( C i ) + 2 (cid:96) (1 + δ ) (cid:96) · Φ ( F, C ni ) , by substituting δ = (cid:96) · (cid:96) +3 ε ≤ (cid:16) (cid:96) · ε(cid:96) · (cid:96) +3 (cid:17) · (cid:96) · ∆ ( C i ) + 2 (cid:96) (1 + δ ) (cid:96) · Φ ( F, C ni ) , by Fact 1= (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) + 2 (cid:96) (1 + δ ) (cid:96) · Φ ( F, C ni ) , = (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) + 2 (cid:96) · (2 δ ) (cid:96) · Φ ( F, C ni ) , ∵ ≤ δ , for ε ≤ (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) + ε · ∆ ( C i ) , by lemma 5= (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i )This completes the proof of the lemma.We obtain the following corollary using the Markov’s inequality. Corollary 3.
For any < ε ≤ and point p sampled uniformly at random from M i , we have: Pr (cid:104) Φ ( t ( p ) , C i ) ≤ (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) (cid:105) > ε (cid:96) +2 . Let us call a point p , a good point if t ( p ) gives (cid:16) (cid:96) + ε (cid:17) -approximation for C i . Corollary 4.
There are at least (cid:24) ε · | C i | (cid:96) +2 (cid:25) good points in M i . Now our goal is to obtain one such good point from M i . If F contains any good point then we are done,since the algorithm adds the entire set F to M . As a result, Property I is satisfied for the cluster C i . On theother hand, if F does not contain any good point then there is no good point in M ni as well. It simply meansthat all good points are present in C fi . To sample these good points we use the D (cid:96) -sampling technique. Let G ⊆ C fi denote the set of good points. Then we have | G | ≥ (cid:24) ε · | C i | (cid:96) +2 (cid:25) . Using this fact, we will prove thefollowing lemma. Lemma 7.
For any point p ∈ C fi and any D (cid:96) -sampled point x ∈ C , we have: Pr [ x = p ] ≥ εα β γ k | C i | = τ and Pr (cid:104) Φ ( t ( x ) , C i ) ≤ (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) (cid:105) ≥ ε α β γ k · (cid:96) +2 Proof.
For any point p ∈ C fi , Pr [ x = p ] = d (cid:96) ( p, c ( p )) Φ ( F, C ) ≥ R (cid:96) Φ ( F, C ) = 1 β | C i | · Φ ( F, C i ) Φ ( F, C ) ≥ εα β γ k | C i | Let Z denote an indicator random variable, such that Z = 1 if Φ ( t ( x ) , C i ) ≤ (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) and 0 otherwise. Pr [ Z = 1 ] ≥ (cid:88) p ∈ G Pr [ x = p ] · Pr [ Z = 1 | x = p ] ≥ (cid:88) p ∈ G εα β γ k | C i | · | G | · εα β γ k | C i | ≥ ε α β γ k · (cid:96) +2 This completes the proof of the lemma.The above lemma states that, every point in C fi has a sampling probability of at least τ . Moreover, a sampledpoint gives (3 (cid:96) + ε )-approximation for C i with probability at least ε ( α β γ k · (cid:96) +2 ) . To boost this probability, wesample η := α β γ k · (cid:96) +2 ε points independently from C using D (cid:96) -sampling (see line (4) of the algorithm). Itfollows that, with probability at least 1 /
2, there is a point in the sampled set that gives (3 (cid:96) + ε )-approximationfor C i . Hence, Property-I is satisfied for C i . Also note that, in line(4) of the algorithm, we sample η · k points,i.e., η points corresponding to each cluster. Hence, Property-I holds for every cluster in H .ince Property-I is satisfied for every cluster in W and H , we can finally claim that T s = { t ( s ) , t ( s ) , . . . , t ( s k ) } is a (cid:0) (cid:96) + ε (cid:1) -approximation for C with probability at least k . However, as described earlier, T s could be asoft center-set since t ( s i ) can be same as t ( s j ) for some i (cid:54) = j . To obtain a hard center-set, we make use ofline (8) of the algorithm. In line (8), the algorithm pulls out the k closest points from L instead of just one.Note that it is not necessary to open a facility at a closest location in L . Rather, we can open a facility atany location f in L , that is at least as close to s i as f ∗ i , i.e., d ( s i , f ) ≤ d ( s i , f ∗ i ).Let T ( s i ) denote a set of k closest facility location for s i . We show that there is a hard center-set T h ⊂ ∪ i T ( s i ),such that T h := { f , . . . , f k } and d ( s i , f i ) ≤ d ( s i , f ∗ i ) for every 1 ≤ i ≤ k . We define T h using the followingsimple subroutine: FindFacilities - T h ← ∅ - For i ∈ { , ..., k } :- if ( f ∗ i ∈ T ( s i )) T h ← T h ∪ { f ∗ i } - else- Let f ∈ T ( s i ) be any facility such that f is not in T h - T h ← T h ∪ { f } Lemma 8. T h = { f , f , . . . , f k } contains exactly k different facilities such that for every ≤ i ≤ k , wehave d ( s i , f i ) ≤ d ( s i , f ∗ i ) .Proof. First, let us show that all facilities in T s are different. Since, f ∗ i is different for different clusters, the ifstatement adds facilities in T h that are different. In else part, we only add a facility to T h that is not presentin T h . Thus the else statement also adds facilities in T h that are different.Now, let us prove the second property, i.e., d ( s i , f i ) ≤ d ( s i , f ∗ i ) for every 1 ≤ i ≤ k . The property istrivially true for the facilities added in the if statement. Now, for the facilities added in the second step weknow that T ( s i ) does not contain f ∗ i . Since, T ( s i ) is a set of k -closest facility locations, we can say thatfor any facility location f in T ( s i ), d ( s i , f ) ≤ d ( s i , f ∗ i ). Thus any facility added in the else statement has d ( s i , f ) ≤ d ( s i , f ∗ i ). This completes the proof.Thus T h ∈ L is a hard center-set, which gives the (3 (cid:96) + ε )-approximation for the problem. This completesthe analysis of the algorithm.Now, suppose we are given the flexibility to open a facility at a client location. In other words, suppose itis given that C ⊆ L . For this case, we can directly open the facilities at the locations { s , s , . . . , s k } insteadof t ( s i )’s, and we would not need lines (7) and (8) of the algorithm. Further, we can show that lemma 4and 6 would give (2 (cid:96) + ε )-approximation for this special case. However, please note that { s , s , . . . , s k } isstill a soft center-set. To obtain a hard center-set we do need to consider the k -closest facility locations fora point s i . In that case, lemma 4 and 6 cannot provide (2 (cid:96) + ε )-approximation. Therefore, we make someslight changes in the analysis of lemma 4 and 6 to get a (2 (cid:96) + ε )-approximation. We discuss those details inthe following sub-section. D.3 Analysis for C ⊆ L The algorithm
List-k-service ( C, L, k, d, (cid:96), ε ), gives better approximation guarantees if it is allowed to opena facility at a client location, i.e., C ⊆ L . We will show the following result. Theorem 10.
Let < ε ≤ . Let ( C, L, k, d, (cid:96) ) be any k -service instance with C ⊆ L and let C = { C , C , . . . , C k } be any arbitrary clustering of the client set. The algorithm List-k-service ( C, L, k, d, (cid:96), ε ),with probability at least / , outputs a list L of size ( k/ε ) O ( k (cid:96) ) , such that there is a k center set S ∈ L inthe list such that Ψ ( S, C ) ≤ (2 (cid:96) + ε ) · Ψ ∗ ( C ) . Moreover, the running time of the algorithm is O (cid:16) n · ( k/ε ) O ( k (cid:96) ) (cid:17) . ince we are dealing with a special case, all the previous lemmas (i.e., Lemma 4, 5, 6, and 7) are valid here aswell. Here we will obtain improved versions of Lemmas 4 and 6 and consequently obtain better approximationguarantee (2 (cid:96) instead of 3 (cid:96) ). We show the following property for set M . Property - II : For any cluster C i ∈ { C , C , . . . , C k } , there is a point s i in M such that with proba-bility at least 1 /
2, following holds: Φ ( u i ( s i ) , C i ) ≤ (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) + ε (cid:96) +1 k · OP T ( C, C ) , if C i ∈ W (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) , if C i ∈ H. Here, for any point x ∈ C ∪ L , u i ( x ) denotes a location that is as close to x as its closest point in C i .In other words, if p = arg min y ∈ C i { d ( y, x ) } is the closest location in C i , then d ( x, u i ( x )) ≤ d ( x, p ).By Property-II, we claim that { u ( s ) , u ( s ) , . . . , u k ( s k ) } is (cid:16) (cid:96) + ε (cid:17) -approximation for C . The reasoningfor this claim is the same as the one we provided in the previous subsection. Moreover, s i is always a possiblecandidate for u ( s i ) and it is added to the set T in line (8) of the algorithm. Therefore, L contains a set thatis (cid:16) (cid:96) + ε (cid:17) -approximation for C . However, this set is only a soft center-set. Later we will show that there is ahard center-set in L that gives the desired approximation for the problem. For now, let us prove Property-IIfor the target clusters. Case 1: Φ ( F, C i ) ≤ εα γ k · Φ ( F, C ) Recall that we defined a multi-set M i := { c ( x ) | x ∈ C i } , where c ( x ) denotes a location in F that isclosest to x . We show the following result. Lemma 9.
Let p be a point sampled uniformly at random from M i . Then the following bound holds: E [ Φ ( u i ( p ) , C i )] ≤ (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) + ε (cid:96) +1 k · OP T ( C, C ) Proof.
The proof follows from the following sequence of inequalities. E [ Φ ( u i ( p ) , C i )] = 1 | C i | · (cid:88) p ∈ M i Φ ( u i ( p ) , C i ) = 1 | C i | · (cid:88) p ∈ M i (cid:88) x ∈ C i d (cid:96) ( x, u i ( p )) = 1 | C i | · (cid:32) (cid:88) x (cid:48) ∈ C i (cid:88) x ∈ C i d (cid:96) ( x, u i ( c ( x (cid:48) ))) (cid:33) ≤ | C i | · (cid:32) (cid:88) x (cid:48) ∈ C i (cid:88) x ∈ C i ( d ( x, x (cid:48) ) + d ( x (cid:48) , c ( x (cid:48) )) + d ( c ( x (cid:48) ) , u i ( c ( x (cid:48) ))) ) (cid:96) (cid:33) , by triangle inequality ≤ | C i | · (cid:32) (cid:88) x (cid:48) ∈ C i (cid:88) x ∈ C i ( d ( x, x (cid:48) ) + d ( x (cid:48) , c ( x (cid:48) )) + d ( c ( x (cid:48) ) , x (cid:48) )) (cid:96) (cid:33) , by the defn. of u i ( c ( x (cid:48) ))= 1 | C i | · (cid:32) (cid:88) x (cid:48) ∈ C i (cid:88) x ∈ C i ( d ( x, x (cid:48) ) + 2 d ( x (cid:48) , c ( x (cid:48) ))) (cid:96) (cid:33) et us use Fact 2 by setting b = d ( x, x (cid:48) ) and a = 2 · d ( x (cid:48) , c ( x (cid:48) )). We obtain the following expression: ≤ | C i | · (cid:32) (cid:88) x (cid:48) ∈ C i (cid:88) x ∈ C i (cid:32)(cid:18) δ (cid:19) (cid:96) · d (cid:96) ( x, x (cid:48) ) + (1 + δ ) (cid:96) · (cid:96) d (cid:96) ( x (cid:48) , c ( x (cid:48) )) (cid:33)(cid:33) , for any δ > (cid:18) δ (cid:19) (cid:96) · | C i | · (cid:32) (cid:88) x (cid:48) ∈ C i (cid:88) x ∈ C i d (cid:96) ( x, x (cid:48) ) (cid:33) + (1 + δ ) (cid:96) · | C i | · (cid:32) (cid:88) x (cid:48) ∈ C i | C i | · (cid:96) d (cid:96) ( x (cid:48) , c ( x (cid:48) )) (cid:33) = (cid:18) δ (cid:19) (cid:96) · E [ Φ ( x, C i )] + 2 (cid:96) (1 + δ ) (cid:96) · Φ ( F, C i ) ≤ (cid:18) δ (cid:19) (cid:96) · (cid:96) · ∆ ( C i ) + 2 (cid:96) (1 + δ ) (cid:96) · Φ ( F, C i ) , by lemma 3= (cid:16) ε(cid:96) · (cid:96) +2 (cid:17) (cid:96) · (cid:96) · ∆ ( C i ) + 2 (cid:96) (1 + δ ) (cid:96) · Φ ( F, C i ) , by substituting δ = (cid:96) · (cid:96) +2 ε ≤ (cid:16) (cid:96) · ε(cid:96) · (cid:96) +2 (cid:17) · (cid:96) · ∆ ( C i ) + 2 (cid:96) (1 + δ ) (cid:96) · Φ ( F, C i ) , by Fact 1= (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) + 2 (cid:96) (1 + δ ) (cid:96) · Φ ( F, C i ) , ≤ (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) + 2 (cid:96) · (2 δ ) (cid:96) · Φ ( F, C i ) , ∵ ≤ δ , for ε ≤ ≤ (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) + 4 (cid:96) · δ (cid:96) · εα γ k · Φ ( F, C ) , ∵ Φ ( F, C i ) ≤ εα γ k · Φ ( F, C ) ≤ (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) + ε (cid:96) +1 α k · Φ ( F, C ) , ∵ γ = (cid:96) (cid:96) · (cid:96) +5 (cid:96) +1 ε (cid:96) ≤ (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) + ε (cid:96) +1 k · OP T ( L, C ) , ∵ Φ ( F, C ) ≤ α · OP T ( C, C )This completes the proof of the lemma.By the above lemma, we can claim that there is a point p in M i such that Φ ( u i ( p ) , C i ) ≤ (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) + ε (cid:96) +1 k · OP T ( L, C ). Since M i is only composed of the points from F and F is contained in M , we can saythat Property- II is satisfied for every cluster C i ∈ W . Now, let us analyze the high cost clusters. Case 2: Φ ( F, C i ) > εα γ k · Φ ( F, C ) .We will reuse C ni and C fi as defined earlier. As before, let us define a multi-set M ni := { c ( x ) | x ∈ C ni } anda multi-set M i := C fi ∪ M ni . We show the following result. Lemma 10.
Let p be a point sampled uniformly at random from M i . Then the following bound holds: E [ Φ ( u i ( p ) , C i )] ≤ (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) Proof. E [ Φ ( u i ( p ) , C i )] = 1 | C i | · (cid:88) p ∈ M i Φ ( u i ( p ) , C i ) = 1 | C i | · (cid:88) x (cid:48) ∈ C ni Φ ( u i ( c ( x (cid:48) )) , C i ) + (cid:88) x (cid:48) ∈ C fi Φ ( u i ( x (cid:48) ) , C i ) et us evaluate these two terms separately.1. The first term: (cid:88) x (cid:48) ∈ C ni Φ ( u i ( c ( x (cid:48) )) , C i ) = (cid:88) x (cid:48) ∈ C ni (cid:88) x ∈ C i d (cid:96) ( x, u i ( c ( x (cid:48) ))) ≤ (cid:88) x (cid:48) ∈ C ni (cid:88) x ∈ C i ( d ( x, x (cid:48) ) + d ( x (cid:48) , c ( x (cid:48) )) + d ( c ( x (cid:48) ) , u i ( c ( x (cid:48) ))))) (cid:96) , by triangle-inequality ≤ (cid:88) x (cid:48) ∈ C ni (cid:88) x ∈ C i ( d ( x, x (cid:48) ) + d ( x (cid:48) , c ( x (cid:48) )) + d ( c ( x (cid:48) ) , x (cid:48) ))) (cid:96) , by the defn. of u i ( c ( x (cid:48) ))= (cid:88) x (cid:48) ∈ C ni (cid:88) x ∈ C i ( d ( x, x (cid:48) ) + 2 d ( x (cid:48) , c ( x (cid:48) )))) (cid:96) , ≤ (cid:88) x (cid:48) ∈ C ni (cid:88) x ∈ C i (cid:32)(cid:18) δ (cid:19) (cid:96) · d (cid:96) ( x, x (cid:48) ) + 2 (cid:96) (1 + δ ) (cid:96) · d (cid:96) ( x (cid:48) , c ( x (cid:48) )) (cid:33) by Fact 22. The second term: (cid:88) x (cid:48) ∈ C fi Φ ( u i ( x (cid:48) ) , C i ) = (cid:88) x (cid:48) ∈ C fi (cid:88) x ∈ C i ( d ( x, x (cid:48) ) + d ( x (cid:48) , u i ( x (cid:48) ))) (cid:96) , by triangle-inequality= (cid:88) x (cid:48) ∈ C fi (cid:88) x ∈ C i d (cid:96) ( x, x (cid:48) ) , by the defn. of u i ( x (cid:48) )Let us now combine the two terms. We get the following expression: E [ Φ ( u i ( p ) , C i )] ≤ (cid:18) δ (cid:19) (cid:96) · | C i | · (cid:32) (cid:88) x (cid:48) ∈ C i (cid:88) x ∈ C i d (cid:96) ( x, x (cid:48) ) (cid:33) + (1 + δ ) (cid:96) · | C i | · (cid:88) x (cid:48) ∈ C ni | C i | · (cid:96) d (cid:96) ( x (cid:48) , c ( x (cid:48) )) = (cid:18) δ (cid:19) (cid:96) · E [ Φ ( x, C i )] + 2 (cid:96) (1 + δ ) (cid:96) · Φ ( F, C ni ) ≤ (cid:18) δ (cid:19) (cid:96) · (cid:96) · ∆ ( C i ) + 2 (cid:96) (1 + δ ) (cid:96) · Φ ( F, C ni ) , by lemma 2= (cid:16) ε(cid:96) · (cid:96) +3 (cid:17) (cid:96) · (cid:96) · ∆ ( C i ) + 2 (cid:96) (1 + δ ) (cid:96) · Φ ( F, C ni ) , by substituting δ = (cid:96) · (cid:96) +3 ε ≤ (cid:16) (cid:96) · ε(cid:96) · (cid:96) +3 (cid:17) · (cid:96) · ∆ ( C i ) + 2 (cid:96) (1 + δ ) (cid:96) · Φ ( F, C ni ) , by Fact 1= (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) + 2 (cid:96) (1 + δ ) (cid:96) · Φ ( F, C ni ) , = (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) + 2 (cid:96) · (2 δ ) (cid:96) · Φ ( F, C ni ) , ∵ ≤ δ, for ε ≤ ≤ (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) + ε · ∆ ( C i ) , by lemma 5= (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i )This completes the proof of the lemma.We obtain the following corollary using the Markov’s inequality. orollary 5. If we sample a point p ∈ M i , uniformly at random, then for ε ≤ : Pr [ Φ ( u i ( p ) , C i ) ≤ (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i )] ≥ ε (cid:96) +2 > ε (cid:96) +2 Since Corollary 3 is similar to Corollary 5, all arguments made in the previous subsection are valid here aswell. Thus we can claim that, with probability at least 1 / x in the sampled set M such that u i ( x ) gives (cid:16) (cid:96) + ε (cid:17) -approximation for C i . In other words, Property-II is satisfied for every cluster in H .Now let us show that there is a hard center-set T h = { f , f , . . . , f k } in the list, which gives the desiredapproximation. Our argument is based on the fact that we can open a facility at any location u i ( s i ), whichis at least as close to s i as the closest location in C i (see Property-II). Let p i denote a location in C i thatis closest to s i i.e. p i = arg min x ∈ C i d ( x, s i ). We will show that d ( s i , f i ) ≤ d ( s i , p i ) for every 1 ≤ i ≤ k . Thefollowing analysis is very similar to the analysis that we did in the previous section. The only difference isthat f ∗ i is replaced with p i .Let T ( s i ) denote a set of k closest facility locations for a sampled point s i ∈ M (see line(8) of the algorithm).We define T h = { f , f , . . . , f k } using the following simple subroutine: FindFacilities - T h ← ∅ - For i ∈ { , ..., k } - if ( p i ∈ T ( s i )) T h ← T h ∪ { p i } - else- Let f ∈ T ( s i ) be any facility such that f is not in T h - T h ← T h ∪ { f } Since size of each T ( s i ) is exactly k , T h will contain k facilities at the end of the for-loop. Lemma 11.
The subroutine picks a set T h = { f , f , . . . , f k } of k distinct facilities such that for every ≤ i ≤ k , d ( s i , f i ) ≤ d ( s i , p i ) .Proof. First let us show that the facilities in T h are distinct. Since p i ’s are different for different clusters, theif statement adds distinct facilities to T h . Moreover, in the else statement, we only add a facility to T h if itis not already present in T h . This also ensures the facilities added in T h are distinct.Now, let us prove the second property, i.e., d ( s i , f i ) ≤ d ( s i , p i ) for every 1 ≤ i ≤ k . The property istrivially true for the facilities added in the if statement. Now, for the facilities added in the else statement,it is known that T ( s i ) does not contain p i . Since, T ( s i ) is a set of k -closest facility locations, for any facilityfacility f in T ( s i ), we have d ( s i , f ) ≤ d ( s i , p i ). Thus any facility added in step 2 has d ( s i , f ) ≤ d ( s i , p i ). Thiscompletes the proof.Thus T h ∈ L is a hard center-set, which gives (2 (cid:96) + ε )-approximation for the problem. This completes theanalysis of the algorithm when C ⊆ L . D.4 Algorithm for Outlier k -Service The outlier k -service problem does not fit the framework of constrained k -service problem since a clustering isdefined over the set C \ Z instead of C . Due to this we can not use the list k -service algorithm. To overcome thisissue, we modify the algorithm List-k-service , so that it gives a good center-set for any clustering definedover the set C \ Z . We do this modification only in the first step of the algorithm. In the first step, we run a α -approximation algorithm for the ( k + m )-means problem over the point set C (the k -means++ algorithmwith ( k + m ) centers instead of k -centers is one such algorithm with α = O (log k )), and we keep the rest ofthe algorithm same. Hence, we obtain the center-set F of size ( k + m ), such that Φ ( F, C ) ≤ α · OP T ( C, C ).Here
OP T ( C, C ) denotes the optimal cost for the unconstrained ( k + m )-service instance ( C, C, k + m, d, (cid:96) ),i.e., with ( k + m ) centers. Let C := { C , C , . . . , C k } be an optimal clustering for the outlier k -service instancedefined over C \ Z , where set Z of size m is the set of outliers. Let F ∗ := { f ∗ , f ∗ , . . . , f ∗ k } denote an optimalcenter-set for C . Let ∆ ( C i ) denote the optimal cost for each cluster i.e. ∆ ( C i ) = Φ ( f ∗ i , C i ). On the basis ofprevious analysis of the list k -service algorithm, we obtain the following property for set M . roperty III : For any cluster C i ∈ { C , C , . . . , C k } , there is a point s i in M such that withprobability at least 1 /
2, the following holds: Φ ( t ( s i ) , C i ) ≤ (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) + ε (cid:96) +1 k · OP T ( C, C ) , if C i ∈ W (cid:16) (cid:96) + ε (cid:17) · ∆ ( C i ) , if C i ∈ H where t ( s i ) denotes a facility location that is at least as close to s i as f ∗ i , i.e., d ( s i , t ( s i )) ≤ d ( s i , f ∗ i ).The following lemma bounds the optimal ( k + m )-service cost in terms of the optimal outlier k -service cost. Lemma 12.
OP T ( C, C ) ≤ (cid:96) · (cid:80) ki =1 ∆ ( C i ) Proof.
Let F c ⊆ C \ Z denote an optimal center set for C if centers are restricted to C \ Z . Let us see why Ψ ( F c , C ) ≤ (cid:80) ki =1 (cid:96) · ∆ ( C i ). Consider each point in Z as a cluster of its own. Let us denote these clusters by { C k +1 , C k +2 , . . . , C k + m } . Let us define a new clustering C (cid:48) := { C , C , . . . , C k + m } having ( k + m ) clusters.Let F (cid:48) := F c ∪ Z be a center-set for C (cid:48) . Thus we get Ψ ( F (cid:48) , C (cid:48) ) ≤ (cid:80) ki =1 (cid:96) · ∆ ( C i ). Moreover, since OP T ( C, C )is the optimal cost for the unconstrained ( k + m )-service instance ( C, C, k + m, d, (cid:96) ) (i.e., with k + m centers),we have OP T ( C, C ) ≤ Ψ ( F (cid:48) , C (cid:48) ). Hence we get OP T ( C, C ) ≤ (cid:96) · (cid:80) ki =1 ∆ ( C i ).Using the above claim and Property-III, we can say that { t ( s ) , t ( s ) , . . . , t ( s k ) } is (3 (cid:96) + ε )-approximationfor the outlier k -service problem. Hence, we claim that, with probability at least 1/2, there is a center-set inthe list that gives (3 (cid:96) + ε )-approximation for the outlier k -service problem. However, the list size is larger inthis case. In particular, the list size is (( k + m ) /ε ) O ( k (cid:96) ) .The outlier k -service problem also admits a partition algorithm with running time of O ( nk ) (see Sec-tion E). Therefore, we can apply Theorem 1 for the outlier k -service problem. This gives us the followingmain theorem for the outlier problem. Theorem 11.
There is a (3 (cid:96) + ε ) -approximation algorithm for the outlier k -service problem with runningtime of O ( n · (( k + m ) /ε ) O ( k (cid:96) ) ) . For the special case, when C ⊆ L , the algorithm gives an approximationguarantee of (2 (cid:96) + ε ) . It should be noted that for small value of k , our algorithm shows improvement over existing algorithms. Theknown polynomial time approximation-guarantee for the metric k -median problem is 7 and for the metric k -means problem is 53 [KLS18]. Our algorithm gives (3 + ε )-approximation for the metric k -median problemand (9 + ε )-approximation for the metric k -means problem, albeit in FPT time. Moreover, the running timeof our algorithm is only linear in n . E Partition Algorithms
We would like to make some important observations about the constrained problems mentioned in Table 1.The r -gather and r -capacity k -service problems that we consider in this work is more general than the oneconsidered by Ding and Xu [DX15] in that the r i values for different clusters may not be the same. Ding andXu [DX15] only considered the uniform version of the problem (where all r i ’s are the same) and designeda partition algorithm for the same. However, their algorithm can easily be generalized for the non-uniformversion. We describe these algorithms in the two subsections of this section. Let us first discuss the last fourproblems in Table 1 – fault-tolerant, semi-supervised, uncertain, and outlier k -service. These problems do notstrictly follow the definition of the constrained k -service problem due to which Theorem 1 does not hold forthese problems. Let us see why. The fault-tolerant, semi-supervised, and uncertain k -service problems havetheir objective functions different from the one mentioned in Definition 2. This distinction is crucial sincethe list L only provides a center set, which gives constant approximation with respect to the cost function Ψ ∗ ( C ). Therefore, Theorem 1 can not be applied here. For the outlier k -service problem, a feasible clusteringis defined over C \ Z instead of C . Whereas, L only provide a good center set for the clustering of C and notfor C \ Z . Therefore, Theorem 1 also does not work here. This problem of misfit can be evaded by definingthese problems differently so that they fit the definition of the constrained k -service problem. Let us considerthese problems one by one.. An instance of the fault-tolerant k -service problem can be reduced to a special instance of the chromatic k -service problem (see section 4.5 of [DX15]). Since we can apply Theorem 1 on the reduced instance,we can also obtain the clustering for the original instance.2. For the semi-supervised k -service problem, we do not need another equivalent definition. In fact, it canbe shown that the list L provides a good center-set for the objective function Ψ ( F, C ) := α · Ψ ( F, C ) + (1 − α ) · Dist ( C (cid:48) , C ) also (see section 4.4 of [GJK19]). Therefore, we can apply Theorem 1 on a semi-supervised k -service instance.3. For the uncertain k -service problem ( assigned case), Cormode and McGregor [CM08] gave an equivalentdefinition as follows: Given a weighted point set D := ∪ ni =1 D p such that a point p i ∈ D p carries a weight of t ip , the task is to find a clustering D := { D , D , . . . , D k } of D with minimum Ψ ∗ ( D ), such that all pointsin D p are assigned to the same cluster. This definition follows from linearity of expectation. Moreover,the weighted instance can easily be converted to an unweighted instance by replacing a weighted pointwith an appropriate number of points of unit weight (see Section 5 [CM08]). Now, it fits the definitionof the constrained k -service problem. Hence, it can be solved using Theorem 1.4. At last, we have the outlier k -service problem. We do not define the problem differently; instead weconstruct a new list L (cid:48) of center-sets such that it contains a good center-set for any clustering definedover the set C \ Z (for any set Z of size m ). However, this new list has a large size, some function of k and m . Therefore, the FPT algorithm for the outlier k -service problem is parameterized by both k and m . On the positive side, we only have k in the exponent term. We mentioned the details of the algorithmin Section D.4. Since the problem has not been discussed in [DX15], we mention its partition algorithmin a subsection here.We expect that more problems can fit directly or indirectly in the framework of the constrained k -serviceproblem. Since it already encapsulates a wide range of problems, it is important to study this framework.Next, we design the partition algorithms for the non-uniform r -gather, r -capacity, and outlier k -serviceproblems. For rest of the problems mentioned in Table 1, efficient partition algorithms (in the Euclideanspace) are already known (see Section 4 and 5.3 of [DX15]). These algorithms can be easily extended forthe general metric space. E.1 r -gather k -service Partition Problem : Given a k -center-set F , we have to find a clustering C = { C , . . . , C k } with minimum Ψ ( F, C ), such that each cluster C i ∈ C contains at least r i points.Let us describe the algorithm. Since we do not know which cluster is assigned to which center, let usconsider all possible cases. Without loss in generality, assume that F = { f , f , . . . , f k } and C i is assigned to f i . Now, we know that at least r i clients must be assigned to f i . Let us reduce the problem to a flow problemon a complete bipartite graph G = ( V l , V r , E ). Vertices in V l corresponds to the points in F ; vertices in V r corresponds to the points in C ; and an edge ( u, v ) ∈ E carries a weight of d (cid:96) ( u, v ) with a demand 0 andcapacity 1. Let us create a new source vertex and join it to every vertex in V l and put a demand of r i oneach edge. Similarly, create a new sink vertex and join it with every vertex in V r and put a demand of 1 andcapacity of 1 on each of the edge. It is easy to see that a max-flow on the graph corresponds to a partitioningof C satisfying the r -gather problem constraints. Now we can apply any standard min-cost flow algorithmto obtain the optimal partitioning of C .The running time of the algorithm is k k · n O (1) , where k k term appears because we exhaustively guessedthe demands for the facilities in F , and n O (1) term appears due to the min-cost flow algorithm. Ding andXu [DX15] used the same method to obtain the partition algorithm for the uniform r -gather k -serviceproblem. However, there was no guessing involved because each facility had the same demand. Theorem 12.
There is a partition algorithm for the r -gather k -service problem, with a running time of k O ( k ) · n O (1) . Now, let us discuss the partition algorithm in streaming setting. Goyal et al. [GJK19] designed a streamingpartition algorithm for the uniform r -gather/ r -capacity k -service problem. The same algorithm can be gen-eralized to the non-uniform setting in general metric spaces. Let us briefly see what their algorithm does.The algorithm first creates a small (in terms of space) representation ( G (cid:48) = ( V (cid:48) l , V (cid:48) r , E (cid:48) )) of the original graph G = ( V l , V r , E ) that satisfies the following definition: efinition 5 (Definition 4 of [GJK19]). G (cid:48) = ( V (cid:48) l , V (cid:48) r , E (cid:48) ) is an edge-weighted bipartite graph such thata number n v is associated with each vertex v ∈ V (cid:48) r . G (cid:48) represents the pair ( F, C ) , where F is a set of k centers and C is a set of n points, such that it satisfies the following conditions: – The set V (cid:48) l = F . Each point p ∈ C is mapped to a unique vertex v in V (cid:48) r – call this vertex φ ( p ) . Further n v is equal to | φ − ( v ) | . – For each point p ∈ C and center f ∈ F , the weight of the edge ( φ ( p ) , f ) in G (cid:48) is within (1 ± ε ) of d (cid:96) ( p, f ) . Moreover, the graph G (cid:48) can be constructed using a two-pass streaming algorithm, as described by thefollowing Theorem. Theorem 13 (Theorem 7 of [GJK19]).
Given a pair ( F, C ) of k centers and n points respectively, thereis a single pass streaming algorithm which builds a bipartite graph G (cid:48) = ( V (cid:48) l , V (cid:48) r , E (cid:48) ) . The space used by thisalgorithm (which includes the size of G (cid:48) ) is O (cid:16) k · k · ( k + log n + log ∆ ) · (cid:16) k · k · log n + k k · log k ( ε ) (cid:17)(cid:17) . Further, the dependence on log ∆ can be re-moved by adding one more pass to the algorithm, where ∆ denotes the aspect ratio max x ∈ C,f ∈ F d ( x,f )min x ∈ C,f ∈ F d ( x,f ) It is easy to see that, a flow on the graph G (cid:48) corresponds to a partitioning of the client set. Moreover, thecost of a flow is at most (1 + ε ) times the corresponding partitioning cost. Therefore, we first solve themin-cost flow problem on the graph G (cid:48) . This gives us an optimal flow on graph G (cid:48) . Using it, we can obtain apartitioning of C , which takes one additional pass. To know the complete details see Section 4.2 of [GJK19].We state the final result as follows. Theorem 14.
Consider the partition problem for a r -gather k -service instance ( C, L, k, d, (cid:96) ) . For a center-set F ⊆ L , there is a 3-pass streaming algorithm that outputs a partitioning C = { C , C , . . . , C k } of the clientset such that the cost of the partitioning is at most (1 + ε ) times the optimal partitioning cost. Moreover,the algorithm has space complexity of f ( k, ε ) · log n and the running time of f ( k, ε ) · n O (1) , where f ( k, ε ) = k O ( k ) · log k ( ε ) . E.2 r -capacity k -service Partition Problem : Given a k -center-set F , find a clustering C = { C , . . . , C k } with minimum Ψ ( F, C ), suchthat each cluster C i contains at most r i points, for 1 ≤ i ≤ k .The algorithm is similar to the partition algorithm of the r -gather k -service problem. We can exhaustivelyconsider the capacities of the facilities and then reduce the problem to a flow problem on a complete bipartitegraph. The only difference is that the edges connecting the source vertex to V l = F , have a demand of 0 andthe capacity of r i . Similarly we can design a streaming partition algorithm for the problem. We state thefinal results as follows. Theorem 15.
There is a partition algorithm for the r -capacity k -service problem, with a running time of k O ( k ) · n O (1) . Theorem 16.
Consider the partition problem for a r -capacity k -service instance ( C, L, k, d, (cid:96) ) . For a center-set F ⊆ L , there is a 3-pass streaming algorithm that outputs a partitioning C = { C , C , . . . , C k } of the clientset such that the cost of the partitioning is at most (1 + ε ) times the optimal partitioning cost. Moreover,the algorithm has the space complexity of f ( k, ε ) · log n and running time of f ( k, ε ) · n O (1) , where f ( k, ε ) = k O ( k ) · log k ( ε ) . E.3 Outlier k -service Partition Problem : Given a center set F , find a set Z ⊆ C of size m and a clustering C = { C , . . . , C k } of C \ Z such that Ψ ( F, C ) is minimized.The algorithm is simple. First, we define the distance of a point x from the set F as min f ∈ F { d ( f, x ) } . Basedon these distances, we define the outlier set as a set of m points which are farthest from F . The clusteringcan be obtained by simply running the Voronoi partitioning algorithm on C \ Z , with F as a center-set. Thuswe obtain the following result. heorem 17. There is a partition algorithm for the outlier k -service problem, with the running time of O ( n ) . The same algorithm can be converted to a streaming algorithm by keeping an account of the farthest pointfrom each center. In the first pass, we can identify the set Z . In the second pass, we can output the requiredclustering, simply by assigning the points to the closest center. We state the final result as follows. Theorem 18.
Consider the partition problem for an outlier k -service instance ( C, L, k, d, (cid:96) ) . For a center-set F ⊆ L , there is a two-pass streaming algorithm that finds a set Z of size m and outputs an optimalpartitioning C = { C , C , . . . , C k } for C \ Z . The algorithm has the space complexity of O ( k ) and the runningtime of O ( nk ))