Differentially Private Clustering via Maximum Coverage
aa r X i v : . [ c s . D S ] A ug Differentially Private Clustering via Maximum Coverage
Matthew Jones Huy Lê Nguy˜ên Thy NguyenKhoury CollegeNortheastern UniversityBoston, MA 02115 {jones.m,hu.nguyen,nguyen.thy2}@northeastern.edu
August 31, 2020
Abstract
This paper studies the problem of clustering in metric spaces while preserving the privacy of individ-ual data. Specifically, we examine differentially private variants of the k-medians and Euclidean k-meansproblems. We present polynomial algorithms with constant multiplicative error and lower additive er-ror than the previous state-of-the-art for each problem. Additionally, our algorithms use a clusteringalgorithm without differential privacy as a black-box. This allows practitioners to control the trade-offbetween runtime and approximation factor by choosing a suitable clustering algorithm to use.
In this work, we study the problem of clustering while preserving the privacy of individuals in the dataset.Clustering is an important routine in many machine learning tasks, such as image segmentation [24, 21],collaborative filtering [16, 22], and time series analysis [18]. Thus, improving the performance of privateclustering has a great potential for improving other private machine learning tasks. We consider the clusteringproblem with differential privacy, which is a privacy framework that requires the algorithm to be insensitiveto small changes in the dataset [9]. Formally, suppose we are given a metric d , a set V of n points in themetric, and a (private) set of demand points D ⊆ V . Our objective is to choose a set F ⊂ V of size k tominimize the following objective: X v ∈ V min f ∈ F d ( v, f ) p (1)This is the k -medians problem when p = 1 and the k -means problem when p = 2.The k -medians problem has been studied extensively in the literature. Previous work in [15] proved that k -medians is NP-hard. There is a long line of works on approximation algorithms for k -medians withoutdifferential privacy [6, 1, 5, 14, 13, 4]. The state of the art is a 2 .
675 + ǫ approximation by [4]. The private k -medians problem was studied in [12], which shows that any ( ǫ p , k -median problem must have cost at least OPT + Ω(∆ · k ln( n/k ) /ǫ p ), where OPT is the optimal cost. Thepaper also provided a polynomial time ( ǫ p , OPT + O ( k log n/ǫ p ). [23] states a variant of the algorithm for ( ǫ p , δ p )-differentialprivacy with cost O (cid:16) OPT + ∆ k . ǫ p log nβ q log n · log(1 /δ p ) (cid:17) where β is the failure probability. 1 eference Objective Multiplicative Error Additive Error Gupta et al. [12], k -medians O(1) O ( k log n ) Stemmer et al. [23] O ( k log n ) . Ours k -medians O (1) O ( k log n )Feldman et al. [11] Euclidean k -means O ( k log n ) O (cid:16) k √ d log( nd ) · log ∗ ( | X |√ d ) (cid:17) Balcan et al. [2] Euclidean k -means O (cid:0) log n (cid:1) O (cid:0)(cid:0) k + d (cid:1) log n (cid:1) Stemmer et al. [23] Euclidean k -means O (1) O (cid:0) ( k log ( n log k )) . + d . (log log n ) . ( k log k ) . (cid:1) Ours
Euclidean k -means O (1) O (cid:0) k log n + d . (log log n ) . ( k log k ) . (cid:1) Table 1: Comparison of our clustering algorithms with prior works, omitting dependence on ǫ p , δ p . Our contribution is a new polynomial time ( ǫ p , δ p )-differentially private k -medians algorithm withconstant multiplicative factor and improved additive error: O (cid:18) OPT + k ∆ ǫ p log n log (cid:18) eδ p (cid:19)(cid:19) Note that our additive error is linear in k instead of k . as in the previous work by [12, 23], and almostmatches the lower bound of [12] up to lower order terms (their original lower bound is for ( ǫ p , ǫ p , δ p ) privacy with inverse polynomial δ p ). By using the non-private algorithm of [4]as a subroutine, our multiplicative factor is 6 .
35 + ǫ , which is slightly worse than 6 from [12].The k -means problem has also been studied extensively with a long line of works on privacy preservingapproximation algorithms [3, 19, 10, 11, 2, 23]. As an extension of our techniques, we also provide an ( ǫ p , δ p )-differentially private algorithm for the Euclidean k -means problem with constant multiplicative error andbetter additive cost compared to previous work (see table 1). Our algorithm has cost: O (cid:0) OPT + k log n + d . (log log n ) . ( k log k ) . (cid:1) Again in this setting, our additive error is almost linear in k as opposed to k . in previous work [23].In addition to improved performance guarantee, our k -medians and Euclidean k -means algorithms usea non-private clustering algorithm as a black-box. This allows practitioner to control the trade-off betweenapproximation guarantee and runtime of the clustering algorithm and even use heuristics with good empiricalperformance. Our techniques for k -medians and Euclidean k -means include two main steps. In the first step, weiterate through distance thresholds from small to large and apply a differentially private Maximum Coveragealgorithm to select centers that cover almost as many points as the optimal solution at those thresholds. Forthe second step, we create a new dataset based on the potential centers and apply a non-private clusteringalgorithm on this dataset. The new dataset is created by moving each demand points to the nearest potentialcenter, and then applying the Laplace mechanism [8] to report the number of points at each potential center.This makes sure that privacy is preserved for the new dataset and no additional privacy cost when we applythe non-private clustering algorithm in the final step. Table 1 summarizes the performance of previous work in comparison with our algorithms. In the table,only the work by [12] is for ( ǫ p , ǫ p , δ p )-differential privacy. As The bound in [12] is for ǫ -differential privacy, while [23] is for δ -approximate ǫ -differential privacy. [11] studies the discrete d -dimensional space X d . k -medians algorithm with a constant multiplicative approximationand additive error polynomial in the number of centers and logarithmic in the number of points. Thealgorithm uses the local search approach of [1]. Our algorithm, on the other hand, can be used with anynon-private k -medians algorithm.For the Euclidean k -means problem, [2] proposes the strategy of first identifying a set of potential centerswith low k -means cost, then applying the techniques of [12] to find the final centers among the potentialcenters. However, their potential centers are only guaranteed to contain a solution with multiplicativeapproximation O (cid:0) log n (cid:1) . This result was improved by [23], which can construct a set of potential centerscontaining a solution with constant multiplicative approximation.Another approach for the k -means problem is via the 1-cluster problem. Given a set of input points in R d and t ≤ n , the goal is to find a center that covers at least t points with the smallest radius. The work of[11] shows that the k -means problem can solved by running the algorithm for the 1-cluster problem multipletime to find several balls to cover most of data points with O ( k log n ) multiplicative error. [20] proposed animproved algorithm for the 1-cluster problem, resulting in a differentially private k -means algorithm with O ( k ) multiplicative error. Differential privacy is a privacy definition for computations run against sensitive input data sets. Its require-ment, informally, is that the computation behaves similarly on two input dataset that are nearly identical.Formally,
Definition 3.1. ([7]) A randomized algorithm M has δ p -approximate ǫ p -differential privacy, or ( ǫ p , δ p ) -differential privacy, if for any two input sets A and B with a symmetric difference which has a singleelement, and for any set of outcomes S ⊆ Range ( M ) , Pr[ M ( A ) ∈ S ] ≤ exp( ǫ p ) × Pr[ M ( B ) ∈ S ] + δ p . If δ p = 0, we say that M is ǫ p -differentially private. An algorithm with ( ǫ p , M ( A ) is (almost) equally likely to be observed on neighboring datasets, whereas in ( ǫ p , δ p )-differential privacy, δ p dictates the probability that ǫ p -privacy fails to hold [7]. In this way, ( ǫ p , δ p )-differentialprivacy is a relaxation of ǫ p -differential privacy. We use an error parameter ǫ for utility and we use ǫ p and δ p to denote parameters for differential privacy (or ǫ s and δ s for algorithm 1, to differentiate privacy parametersto different algorithms).One of the most basic constructions for differentially private algorithms is the Laplace mechanism. Definition 3.2. ( L sensitivity) A function f : N | X | → R k has L sensitivty ∆ f if k f ( A ) − f ( A ′ ) k ≤ ∆ f for all A, A ′ with a symmetric difference which has a single element. Theorem 3.3. (Laplace mechanism [8]) Let function f : N | X | → R k have L sensitivty ∆ f and ǫ p > .Mechanism M that on input A outputs f ( A ) + Lap ( ∆ fǫ p ) is ( ǫ p , -differentially private, where Lap ( ∆ fǫ p ) denotes a random variable following Laplace distribution with scale parameter b = ∆ fǫ p . Another tool for construction of differential private algorithm we use in this work is the exponentialmechanism. This construction is parameterized by a query function q ( A, r ) mapping a pair of input data set A and candidate result r to a real valued. With q and a privacy value ǫ p , the mechanism selects an outputin favor of high score value: P r [ E ǫq ( A ) = r ] ∝ exp( ǫ p q ( A, r )) (2)
Theorem 3.4. ([17]) The exponential mechanism, when used to select an output r ∈ R gives ǫ p ∆ -differential privacy, letting R OPT be the subset of R achieving q ( A, r ) = max r q ( A, r ) , ensures that Pr[ q ( A, E ǫq ( A )) < max r q ( A, r ) − ln( | R | / | R OPT | ) /ǫ p − t/ǫ p ] ≤ exp( − t )3 .2 Maximum Coverage Algorithm 1:
Maximum Coverage
Input:
Set system ( U, S ), a private set R ⊂ U to cover, ǫ s , δ s , mi ← R i = R , S i ← S . ǫ ′ ← ǫ s / eδ s ). for i = 1 , , . . . , m do Pick a set S from S i with probabilityproportional to exp( ǫ ′ | S ∩ R i | ).Output set S . R i +1 ← R i \ S , S i +1 ← S i − { S } . end Our differentially private k -medians algorithmsolves the Maximum Coverage problem as a subprob-lem. The Maximum Coverage problem is defined asfollows: on a universe U of items and a family S of subsets of U , and a parameter z , the goal is toselect z sets in S to cover the most elements of U .Formally, we are looking to findarg max C⊆S , |C| = z (cid:12)(cid:12)(cid:12) [ c ∈C c (cid:12)(cid:12)(cid:12) . Our approach for solving private Maximum Cov-erage is based on the Unweighted Set Cover algo-rithm in [12]. To preserve privacy, algorithm 1 chooses sets using the exponential mechanism, with probabilityrelated to the improvement in coverage caused by choosing the set.Assume that there exists a selection of z sets that covers U . A classic fact for the maximum coverageproblem is that if we build the family C by always selecting the item in S \ C that covers the largest numberof uncovered elements in U , then after z iterations, | ∪ c ∈C c | ≥ (1 − /e ) |U| . Here we show an observationthat will be useful for our algorithm later. The proof is in the appendix. Lemma 3.5.
For ǫ > , if we always select a set that covers at least half as many uncovered elements asthe set that covers the most uncovered elements, then after z ln 1 /ǫ iteration, (cid:12)(cid:12)(cid:12)S c ∈C c (cid:12)(cid:12)(cid:12) ≥ (1 − ǫ ) |U| . Given a set of points V , a metric d : V × V → R , a private set of demand points D ⊆ V , and a value k < | V | ,the objective of the k -medians problem is to select a set of points (centers) F ⊂ V , | F | = k to minimizecost( F ) = P v ∈ D d ( v, F ), where d ( v, F ) = min f ∈ F d ( v, f ). Let ∆ = max u,v ∈ V d ( u, v ) be the diameter of themetric space. We use ǫ as the approximation parameter for maximum coverage problem in lemma 3.5, and ǫ p and δ p as privacy parameters. Let B t ( v ) be the ball of radius t centered at v i.e. the set of all points inthe metric space within distance t from v .Our approach is based on the Maximum Coverage problem. One way to compute the clustering cost isby computing for every distance threshold t , the number of points within distance t from the centers andintegrating the counts from 0 to the maximum distance. Thus, if for every threshold t the number of pointsfarther than t from the our solution’s centers is not much more than the number of points farther than t from the optimal centers, then our cost is not much larger than the optimal cost. Thus, our algorithmgoes through distance thresholds from small to large and tries to “cover” as many points as possible usingfresh centers every time. For each threshold, we aim to cover 1 − ǫ times the number of points the optimalsolution can cover. The result is that our clustering cost is not much larger than the optimal cost, albeitusing more centers. Since we use exponentially growing thresholds, we only use O (log n ) times more centers,which results in a small error due to privacy noise. The full algorithm is described in Algorithm 2.The algorithm begins with the discretization of distance thresholds. The goal is similar to our discussionabove. We apply algorithm 1 to select a set of points that cover a large set of demand points across differentdistance thresholds. Note that the objective cost of any set of center following the discretization scheme isnot too far from the actual costs, as we will show in lemma 4.3. Thus, the set of centers that we find acrossdifferent thresholds t should also have cost similar to OPT .Our final step is to obtain the final set of centers from the potential centers. To preserve privacy for thisstep, we create a new dataset similar to the original one with some privacy. In this new dataset, every demandpoint is shifted towards the closest center from the previous step, and we apply the Laplace mechanism to theassigned number of demands points of each center to preserve privacy. At this point, we can use a k -medians4 lgorithm 2: The k -medians algorithm Input: a set of points V , a private set of demand points D ⊆ V , a metric d , ǫ , ǫ p , δ p V ′ = D, C = ∅ , r = (cid:6) ǫ n (cid:7) for i from to r do Set t i = (1 + ǫ ) i − ∆ /n Run algorithm 1 for the set system U = V, S = { B t i ( v ) ∩ V ′ : v ∈ V } , with ǫ s = ǫ p , δ s = δ p for m = 2 k ln(1 /ǫ ) iterations to get C i V i = S v ∈ C i B t i ( v ) ∩ V ′ V ′ = V ′ \ V i C = C ∪ C i end Assign each point in D to its closest point c ∈ C Let n c be the total number of points assigned to c for each c ∈ C For each c ∈ C , set n ′ c = n c + Y c where Y c ∼ Lap (cid:16) ǫ p (cid:17) Run a k-medians algorithm to select k centers from V , with demand points at each c ∈ C withmultiplicity n ′ c algorithm on this dataset to output the final k centers. In this new problem, although the objective changesbecause of shifting and the Laplace mechanism, the point that can be selected to be centers are the same asbefore. Thus, the cost of the centers returned in the final step is at most the the cost of this new objectiveplus the total shifting distance. In the following sections, we will first analyze the privacy and then theutility of this algorithm. We first show that this algorithm is ( ǫ p , δ p ) differentially private. To show this, we first show that the entiretyof the for loop is ( ǫ p / , δ p ) differentially private, and then take advantage of composition and apply lemma3.3 at line 12 to obtain the final result. Note that the analysis of the loop very closely follows the proof forprivacy of Unweighted Set Cover in [12]; their algorithm selects sets in a particular order to form a cover,while our algorithm selects candidate centers with increasing distance thresholds, where a center is assumedto cover all demand points within its distance threshold. The significant difference in the two proofs is thatour algorithm could select the same center twice with different distance thresholds while a set will never bechosen twice in set cover in [12]. This re-selection of the same center will not affect the differential privacyof the algorithm, because the privacy analysis hinges on which demand points have been covered, not whichcenters have been selected. As a result of this, we save an additional log n factor on privacy, which removesa log n term from the additive error in lemma 4.4 which carries through the additive error in the utility. Weinclude the proof in the appendix and omit it here, due to its close similarity to [12]. Lemma 4.1.
The for loop in algorithm 2 preserves ( ǫ p / , δ p ) differential privacy. The function affected by the Laplace mechanism in line 11 returns a vector of the counts n c . In the caseof sets A, A ′ as in theorem 3.3, the difference between f ( A ) and f ( A ′ ) is exactly one for one item in thisvector, and therefore k f ( A ) − f ( A ′ ) k = 1, so the function has L sensitivity 1. Thus, line 11 is ( ǫ p / , Lemma 4.2.
Algorithm 2 is ( ǫ p , δ p ) differentially private. We define t = 0 , t = ∆ n , t = ∆(1+ ǫ ) n , ..., t r = ∆ as shorthand for the thresholds. Also, let o i be the numberof points at distance in the range [ t i − , t i ) from their center in the optimal solution (which we denote OPT ),5nd let a i be the the number of points at distance in the range [ t i − , t i ) from their closest point in C afterthe for-loop in algorithm 2. To bound the performance of our solution, we first show that discretizing thedistance thresholds at t i ’s instead of integrating from 0 to ∆ introduces negligible error to the cost of thesolution (see Lemma 4.3). Next, for each distance threshold, lemma 4.4 uses the approximation guaranteeof maximum coverage to show that we are efficiently covering demand points using not many more centersthan OPT . Crucially, lemma 4.5 shows that by covering almost as well as
OPT at every distance threshold,our solution has cost not much more than that of
OPT .The following lemmas bound
OPT by the threshold approximation and then bound our solution by thatapproximation respectively.
Lemma 4.3. P i =1 o i t i ≤ (1 + ǫ ) OPT + ∆
Proof.
On all u ∈ D and set of centers F , define d ′ ( u, F ) as the minimum distance threshold t i which islarger than d ( u, F ). If d ( u, OPT) > ∆ n then d ′ ( u, OPT) ≤ (1 + ǫ ) d ( u, OPT). If d ( u, OPT) ≤ ∆ n , then d ′ ( u, OPT) = ∆ n ≤ d ( u, OPT) + ∆ n . Summing over d ∈ D yields the bound.For each distance threshold t i , the following lemma shows that the algorithm covers almost as manypoints as OPT . Its proof is left in the appendix. The result follows mostly from lemma 3.5 and the error ofthe exponential mechanism in algorithm 1.
Lemma 4.4.
Consider iteration i of the for-loop and let M i be the maximum coverage of k centers withradius t i over points in V ′ . With high probability, at line 5, | V i | ≥ (1 − ǫ ) M i − k ln n ln (cid:0) eδp (cid:1) ǫ p . The next lemma relates the cost of our solution and that of
OPT given that we cover almost as well as
OPT at every distance threshold.
Lemma 4.5. P ri =1 a i t i ≤ − ǫ − ǫ − ǫ P i =1 o i t i + k ln n ln (cid:0) eδp (cid:1) ǫ p (1 − ǫ − ǫ ) Proof.
Let O i = P rj = i o j , A i = P rj = i a j and E = k ln n ln (cid:0) eδp (cid:1) ǫ p . Given a threshold t i , we know that thecenters in OPT cover n − O i +1 points with distance at most t i . At threshold t i , algorithm 2 has alreadycovered n − A i points so we know that there is a solution covering at additional ( n − O i +1 ) − ( n − A i ) = A i − O i +1 points. By the guarantee of the greedy set cover algorithm in lemma 4.4, we cover a i ≥ (1 − ǫ )( A i − O i +1 ) − E new points on the next iteration. By substituting A i = a i + A i +1 , we have a i ≥ (1 − ǫ ) ǫ ( A i +1 − O i +1 ) − Eǫ .Notice that: r X i =1 a i t i = r X i =1 A i ( t i − t i − ) ≤ r X i =1 (cid:18) ǫa i − − ǫ + O i + E − ǫ (cid:19) ( t i − t i − )= r X i =1 ǫa i − − ǫ ( t i − t i − ) + r X i =1 o i t i + ∆ E − ǫ The equality is because of the telescoping sums. Also notice that: r X i =1 ǫa i − − ǫ ( t i − t i − ) = r − X i =1 ǫa i − ǫ ( t i +1 − t i ) ≤ r − X i =1 ǫ − ǫ a i t i The last inequality is because for all 1 ≤ i ≤ r − t i +1 = (1 + ǫ ) t i by definition. We are also able todrop the term i = 0 from the last two sums because a = 0.Thus: r X i =1 a i t i − r − X i =1 ǫ − ǫ a i t i = 1 − ǫ − ǫ − ǫ r − X i =1 a i t i + a r t r ≤ X i =1 o i t i + ∆ E − ǫ = ⇒ r X i =1 a i t i ≤ − ǫ − ǫ − ǫ X i =1 o i t i + (1 − ǫ − ǫ )∆ E r X i =1 a i t i ≤ − ǫ − ǫ − ǫ ((1 + ǫ ) OPT + ∆) + (1 − ǫ − ǫ )∆ E which gives us a bound on the cost of snapping points in D to points in C . Lemma 4.6.
Consider the k-medians problem in the last line of algorithm 2, where demand points in D areshifted to points in C and Laplace noise is applied. With high probability, the optimal objective cost of thisnew k-medians problem is at most OPT + X a i t i + 4∆ k ln(1 /ǫ ) ǫ p (cid:18) ln( n )ln(1 + ǫ ) + 2 (cid:19) where OPT is the cost of the original k-medians problem.
We leave the details of this proof in the appendix. Briefly, we can choose OPT of the original problem asa solution to the new problem. In this case, the original points get the cost of OPT plus the cost of snappingto the centers in C , which is P a i t i . The last term is an upper bound on the error from using the Laplacemechanism, with high probability.Using an approximation algorithm for the k-medians problem with approximation factor M in the laststep of algorithm 2, the entire cost gains a multiplicative factor M . Therefore, we can summarize the utilityof algorithm 2 into the following lemma and even simpler theorem: Lemma 4.7.
With high probability, algorithm 2 preserves ( ǫ p , δ p ) differential privacy and solves the k-medians problem with cost O ( M (1 + ǫ )) OPT + O (cid:18) M k ∆ ǫ p ln n (cid:18) ln (cid:18) eδ p (cid:19) + ln(1 /ǫ )ln(1 + ǫ ) (cid:19)(cid:19) where the black-box k-medians algorithm used in the last step of algorithm 2 has approximation factor M and ǫ is a small, positive constant. The full proof of this lemma can be found in the appendix. In short, we merge the results of lemmas 4.3,4.5, and 4.6, and bound the final cost using the private problem’s cost, the snapping cost, and the triangleinequality.
Corollary 4.8.
By using a constant approximation non-private algorithm for k -medians, there is a ( ǫ p , δ p ) -differentially private algorithm for the k -medians problem that, with high probability, outputs a solution atcost O (cid:18) OPT + k ∆ ǫ p ln( n ) (cid:18) ln (cid:18) δ p (cid:19) + ln(1 /ǫ )ln(1 + ǫ ) (cid:19)(cid:19) . Note that our algorithm allows for any k -medians algorithm to be used at the last step. One can choosea preferred trade-off between runtime and performance to select a suitable algorithm. This is in contrast tothe approach in [12], where the algorithm builds on the k -median algorithm in [1]. In the Euclidean k -medians problem, instead of having a discrete set of demand points, V is defined to beall of R d . We wish to select a set of points (centers) F ⊂ R d , | F | = k to minimize cost( F ) = P v ∈ D d ( v, F ) .In this section, we will apply our result to improve additive error in the approach in [23].7he strategy in [23] is to first identify a polynomial set of candidate centers such that it contains a subsetof k candidate centers with low k -means cost. Then, the algorithm uses a private discrete k -means algorithmto select the final k -centers with low cost from the set of candidate center. More concretely, the algorithmsin [23] is guaranteed to output a ( ǫ p , δ p )-private set of candidate centers Y of size at most ǫ p n log( kβ ) suchthat with probability at least 1 − β , there exist a subset of size k center with constant multiplicative errorand additive error of O ( T − a − b · w − a − b · k − a − b )∆ , where a, b are small constant parameters of the LocalitySensitive Hashing algorithm used in [23], T = Θ(log log n ) and w = O √ dǫ p · log log n · log (cid:18) kβ (cid:19) s log log log nδ p ! Note that a, b can be chosen arbitrarily small at the cost of making the multiplicative approximation factora larger constant. The work of [23] focuses on the regime where a, b are small and 1 / (1 − a − b ) ≤ .
01. Theresulting additive error for identifying candidate centers is ˜ O ǫ p ,δ p ( k . d . ).The performance bottleneck of the Euclidean k -means is in the algorithm to select the final k centersfrom a candidate set. We can apply our algorithm on the potential center returned by [23] to improve thealgorithm performance. Note that our algorithm can be applied to solve the k -means objective by passingthe correct distance function. Although the squared root of Euclidean distance is not a metric, we can stillapply the same algorithm and analysis to get the bound for k -means objective. The only difference is at thelast step, instead of running a k -medians algorithm on the returned center, we run a k -means algorithm toget final centers.Rather than replicate the entire proof, we will only review the sections which are affected by the changein the distance function. Furthermore, we will extend the proof to all distance functions d p for any naturalnumber p .Notably, the privacy analysis is independent of the distance function, and is therefore unaffected. In fact,the only steps in the proofs of section 4 which involve the distance function are lemma 4.3 and lemmas 4.6and 4.7. For the distance function d p , lemma 4.3 is amended as follows: Lemma 5.1. P i =1 o i t i ≤ (1 + ǫ ) p OPT + ∆
Proof.
In the case where ( d ( u, OPT)) p > ∆ n , now ( d ′ ( u, OPT)) p ≤ (1 + ǫ ) p ( d ( u, OPT)) p . Otherwise, theproof is functionally identical to that of lemma 4.3.For lemmas 4.6 and 4.7, we directly address and resolve the main issue the distance function faces here,which is that the triangle inequality does not hold when p >
1. However, we can use the following lemma,which we prove in the appendix:
Lemma 5.2.
In any metric space and p ≥ , ( d ( a, b )) p ≤ p − (( d ( a, c )) p + ( d ( b, c )) p ) . Using this lemma, we see that when the triangle inequality would be applied, we gain an additional 2 p − constant. This affects the leading constant of the approximation factor and additive error, but does notaffect the asymptotic cost.When p = 2, this is the distance function used in the k -means problem. Therefore, the only changes to2 necessary to make it a functional k -means algorithm are to use the proper input metric d and to run ablack-box k -means algorithm as the last step rather than a black-box k -medians algorithm. Then, if we aretrying to minimize the objective function using the distance function d p and we use a black-box algorithmfor this objective function at the last step of algorithm 2, the proofs in section 4 using lemma 5.1 instead of4.3 yields the following lemma and corollary: Lemma 5.3.
Given a problem equivalent to k -means but with distance function d p , and a discrete algorithmfor that problem with approximation factor M , there exists an ( ǫ p , δ p ) differentially-private algorithm for thatproblem which, with high probability, has objective cost at most O (cid:0) p − M (1 + ǫ ) p (cid:1) OPT + O (cid:18) M k ∆ p ǫ p (cid:18) ln n + 2 p − ln | Y | ln (cid:18) eδ p (cid:19)(cid:19)(cid:19) orollary 5.4. There is a ( ǫ p , δ p ) -differentially private algorithm for the Euclidean k -means problem thatwith probability at least − β returns a solution with a constant multiplicative factor and an additive errorof O (cid:18) ∆ (cid:16) T − a − b · w − a − b · k − a − b (cid:17) + ∆ kǫ p (cid:18) ln n + ln (cid:18) ǫ p n log (cid:18) kβ (cid:19)(cid:19) ln (cid:18) eδ p (cid:19)(cid:19)(cid:19) Note that our algorithm results in better additive term compared to applying [12] on the potential centers.Specifically, the second additive term is almost linear in k instead of k . , making the entire additive erroralmost linear in k . Clustering has many applications in machine learning, such as image segmentation [24, 21], collaborativefiltering [16, 22], and time series analysis [18]. Privacy is a major concern when input data contains sensitiveinformation. Differential privacy [8] has become a rigorous framework for ensuring privacy in algorithms.Thus, differentially private algorithms for clustering problem would ensure for each individual in the inputa robust privacy guarantee.Our improved utility guarantee will perhaps encourage adoption of privacy-preserving algorithm as areplacement of the non-private counterpart. Furthermore, our approach allows for usage with other clusteringalgorithm as a black-blox. We believe this further improves the applicability of private clustering algorithms,making it easier to incorporate privacy guarantee into existing clustering frameworks. The limitations ofthe work are that the privacy guarantee requires certain assumptions on the input data such as the databeing bounded and the utility guarantee has additive error that is only meaningful when the dataset hasa large enough number of participants. When applying the algorithm, the curator has to ensure that theassumptions hold to protect the privacy of the participants.
References [1]
Arya, V., Garg, N., Khandekar, R., Meyerson, A., Munagala, K., and Pandit, V.
Localsearch heuristics for k-median and facility location problems.
SIAM Journal on computing 33 , 3 (2004),544–562.[2]
Balcan, M.-F., Dick, T., Liang, Y., Mou, W., and Zhang, H.
Differentially private clusteringin high-dimensional euclidean spaces. In
Proceedings of the 34th International Conference on MachineLearning-Volume 70 (2017), JMLR. org, pp. 322–331.[3]
Blum, A., Dwork, C., McSherry, F., and Nissim, K.
Practical privacy: the sulq framework.In
Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles ofdatabase systems (2005), pp. 128–138.[4]
Byrka, J., Pensyl, T. W., Rybicki, B., Srinivasan, A., and Trinh, K.
An improved approxi-mation for k -median and positive correlation in budgeted optimization. ACM Trans. Algorithms 13 , 2(2017), 23:1–23:31.[5]
Charikar, M., Guha, S., Tardos, É., and Shmoys, D. B.
A constant-factor approximationalgorithm for the k-median problem.
Journal of Computer and System Sciences 65 , 1 (2002), 129–149.[6]
Chrobak, M., Kenyon, C., and Young, N.
The reverse greedy algorithm for the metric k-medianproblem.
Information Processing Letters 97 , 2 (2006), 68–72.[7]
Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., and Naor, M.
Our data, ourselves:Privacy via distributed noise generation. In
Annual International Conference on the Theory and Appli-cations of Cryptographic Techniques (2006), Springer, pp. 486–503.98]
Dwork, C., McSherry, F., Nissim, K., and Smith, A.
Calibrating noise to sensitivity in privatedata analysis. In
Theory of Cryptography (July 2006), Lecture Notes in Computer Science (includingsubseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 265–284.[9]
Dwork, C., Roth, A., et al.
The algorithmic foundations of differential privacy.
Foundations andTrends R (cid:13) in Theoretical Computer Science 9 , 3–4 (2014), 211–407.[10] Feldman, D., Fiat, A., Kaplan, H., and Nissim, K.
Private coresets. In
Proceedings of theforty-first annual ACM symposium on Theory of computing (2009), pp. 361–370.[11]
Feldman, D., Xiang, C., Zhu, R., and Rus, D.
Coresets for differentially private k-means clus-tering and applications to privacy in mobile sensor networks. In (2017), IEEE, pp. 3–16.[12]
Gupta, A., Ligett, K., McSherry, F., Roth, A., and Talwar, K.
Differentially private com-binatorial optimization. In
Proceedings of the twenty-first annual ACM-SIAM symposium on DiscreteAlgorithms (2010), SIAM, pp. 1106–1125.[13]
Jain, K., Mahdian, M., Markakis, E., Saberi, A., and Vazirani, V. V.
Greedy facility locationalgorithms analyzed using dual fitting with factor-revealing lp.
Journal of the ACM (JACM) 50 , 6(2003), 795–824.[14]
Jain, K., and Vazirani, V. V.
Approximation algorithms for metric facility location and k-medianproblems using the primal-dual schema and lagrangian relaxation.
Journal of the ACM (JACM) 48 , 2(2001), 274–296.[15]
Kariv, O., and Hakimi, S. L.
An algorithmic approach to network location problems. i: The p-centers.
SIAM Journal on Applied Mathematics 37 , 3 (1979), 513–538.[16]
McSherry, F., and Mironov, I.
Differentially private recommender systems: Building privacy intothe netflix prize contenders. In
Proceedings of the 15th ACM SIGKDD international conference onKnowledge discovery and data mining (2009), pp. 627–636.[17]
McSherry, F., and Talwar, K.
Mechanism design via differential privacy. In
Annual IEEE Sympo-sium on Foundations of Computer Science (FOCS) (October 2007), IEEE.[18]
Mueen, J. Z. A., and Keogh, E.
Clustering time series using unsupervised-shapelets. In
InternationalConference on Data Mining (ICDM) (2012).[19]
Nissim, K., Raskhodnikova, S., and Smith, A.
Smooth sensitivity and sampling in private dataanalysis. In
Proceedings of the thirty-ninth annual ACM symposium on Theory of computing (2007),pp. 75–84.[20]
Nissim, K., and Stemmer, U.
Clustering algorithms for the centralized and local models. In
Algo-rithmic Learning Theory (2018), pp. 619–653.[21]
Patel, V. M., Van Nguyen, H., and Vidal, R.
Latent space sparse subspace clustering. In
Proceedings of the IEEE international conference on computer vision (2013), pp. 225–232.[22]
Schafer, J. B., Frankowski, D., Herlocker, J., and Sen, S.
Collaborative filtering recommendersystems. In
The adaptive web . Springer, 2007, pp. 291–324.[23]
Stemmer, U., and Kaplan, H.
Differentially private k-means with constant multiplicative error. In
Advances in Neural Information Processing Systems (2018), pp. 5431–5441.[24]
Yang, B., Fu, X., Sidiropoulos, N. D., and Hong, M.
Towards k-means-friendly spaces: Simul-taneous deep learning and clustering. In
Proceedings of the 34th International Conference on MachineLearning-Volume 70 (2017), JMLR. org, pp. 3861–3870.10
Missing proofs
Theorem A.1. If Z , Z , ...Z n are i.i.d random variables that follow exponential distrubtion with parameter λ > , then Pr[ P ni =1 Z i ≥ nλ ] ≤ (cid:0) e (cid:1) n Proof.
Recall that the moment generating function of Z i is E[exp( tZ i )] = λλ − t for t < λ and the fact thatE [exp ( t P ni =1 Z i )] = (cid:16) λλ − t (cid:17) n based on the property of moment generating function. We have:Pr " n X i =1 Z i ≥ n/λ = Pr " exp t n X i =1 Z i ! ≥ exp (2 tn/λ ) ≤ E " exp t n X i =1 Z i ! / exp(2 tn/λ )= (cid:18) λ ( λ − t ) exp(2 t/λ ) (cid:19) n The inequality follows from Markov’s inequality. Let t = λ/
2, we have the theorem.
Lemma A.2. (Lemma 3.5)
For ǫ > , if we always select a set covers at least half of the number of elementsin the set that covers the most uncovered elements, then after z ln 1 /ǫ iteration, (cid:12)(cid:12)(cid:12)S c ∈C c (cid:12)(cid:12)(cid:12) ≥ (1 − ǫ ) |U| .Proof. Define U j to be the number of uncovered elements after j iterations and A j to be the number ofnewly covered element at iteration j . We will first show that U j +1 ≤ (1 − z ) j +1 |U| . We will prove byinduction. For the base case j = 0, We know for each iteration there exists some set that can cover at least1 /z of the remaining uncovered elements. Otherwise, it’s not possible to cover all elements with only just z steps by the optimal solution. Thus, because we always select at least half of the number of elements in theset that covers the most uncovered elements, A j ≥ U j − z . This proves the base case. Now for the inductivehypothesis, assume that U j ≤ (1 − z ) i |U| . Because of the base case we have: U j +1 ≤ U j − U j / z = ⇒ U j +1 ≤ (1 − z ) i |U| (1 − z ) = (1 − z ) i +1 |U| where the last inequality we used the induction hypothesis.Now, note that if there are z · i iterations : U z · i ≤ (cid:20) (1 − z ) (cid:21) z · i |U| ≤ (cid:18) e (cid:19) i/ |U| For i ≥ ǫ , observe that the RHS is at most (1 − ǫ ) |U| , as desired. Lemma A.3. (Lemma 4.1)
The for loop in algorithm 2 preserves ( ǫ p / , δ p ) differential privacy.Proof. Let A and B be two sets of demand points from the same point set V , such that A and B have asymmetric difference to be a single element I ∈ V . For the sake of this proof, we care about the order inwhich points from V were selected by the iterations of algorithm 1. Furthermore, the same point v ∈ V maybe selected in multiple iterations of algorithm 1, by being chosen at different thresholds. In order to considerthese as separate instances, we will record the order of selected points with the current threshold, as (point,threshold) pairs. We will denote the order of (point, threshold) pairs selected on input A as π A , and oninput B as π B . Note that on any set of demand points, the total number of selections is the same for eachiteration of algorithm 1, so when checking equality of two selections π and π from the same set of points V , the thresholds in the pairs should always match, and equality depends only on the points in the pairs.First, we want to fix some order π and bound the ratio between Pr[ π A = π ] and Pr[ π B = π ]. Fix π .We denote by h the total selections in all calls to algorithm 1 in the for loop. Note that this value h isindependent of D given V , and is therefore equivalent for both A and B . We will also write s i,j ( D ) to denotethe size of S j in algorithm 1 (which is the updated value of V ′ after each iteration of the for loop) after11 − D . Then, the probability we make the samechoice as π on iteration i given that we made the same choices as π on the first i − ǫ ′ · s i,π i ( D )) P j exp ( ǫ ′ · s i,j ( D ))where the numerator is the relative probability that algorithm 1 picks π i , and the denominator is the sumof relative probabilities of all possible choices j . Therefore,Pr[ M ( A ) = π ]Pr[ M ( B ) = π ] = h Y i =1 exp( ǫ ′ · s i,π i ( A )) / ( P j exp( ǫ ′ · s i,j ( A )))exp( ǫ ′ · s i,π i ( B )) / ( P j exp( ǫ ′ · s i,j ( B ))) ! = exp( ǫ ′ · s t,π t ( A ))exp( ǫ ′ · s t,π t ( B )) · t Y i =1 P j exp( ǫ ′ · s i,j ( B )) P j exp( ǫ ′ · s i,j ( A )) ! where t is such the first set in π containing I is S π t . After t choices, the remaining demand points for A and B are identical since I is covered, and therefore all the following probabilities are equivalent and theterms cancel to multiplicative factor 1. Also, since the items in π before π t do not contain I , the relativeprobabilities of choosing those items is the same, so the numerators cancel for those indices.If A contains I and B does not, then the first term is exp( ǫ ′ ) since s t,π t ( A ) = s t,π t ( B ) + 1, and each termin the product is at most 1, since s i,j ( A ) ≥ s i,j ( B ). Therefore, the whole term is at most exp( ǫ ′ ). Since ǫ ′ ≤ ǫ s for δ p = δ s ≤
1, it follows that the for loop is ( ǫ p / ,
0) differentially private for this case, which showsthe weaker ( ǫ p / , δ p ) differential privacy.Now, suppose B contains I and A does not. In this case, the first term is exp( − ǫ ′ ) <
1. In instance B ,every (point, threshold) pair which contains I covers exactly 1 more item than that pair in A , and all othersremain the same size. We denote the set of such pairs as S I . Therefore, we have:Pr[ M ( A ) = π ]Pr[ M ( B ) = π ] ≤ t Y i =1 (exp( ǫ ′ ) − · P j ∈ S I exp( ǫ ′ · s i,j ( A )) + P j exp( ǫ ′ · s i,j ( A )) P j exp( ǫ ′ · s i,j ( A )) ! = t Y i =1 (1 + (exp( ǫ ′ ) − · p i ( A )) ≤ t Y i =1 exp((exp( ǫ ′ ) − · p i ( A ))where p i ( A ) is the probability that a set containing I is chosen at step i of the algorithm running on instance A , conditioned on picking the sets S π , . . . , S π i − in the previous steps. The last step follows because1 + x ≤ exp ( x ) when x ≥ A and an element I ∈ A , we say that an order of chosen (point, threshold) pairs σ is q -bad if the sum P i p i ( A ) ( I uncovered at step i ) is larger than q , where p i ( A ) is as defined above. We call σ q -good if it is not q -bad. We first consider the case when π is (ln δ − p )-good. Since the index t correspondsto the first set in π containing I , we have t − X i =1 p i ( A ) ≤ ln δ − p . Continuing the earlier analysis,Pr[ M ( A ) = π ]Pr[ M ( B ) = π ] ≤ t Y i =1 exp((exp( ǫ ′ ) − p i ( A )) ≤ exp(2 ǫ ′ t X i =1 p i ( A )) ≤ exp(2 ǫ ′ (ln( 1 δ p ) + p t ( A ))) ≤ exp(2 ǫ ′ (ln( 1 δ p ) + 1)) ≤ exp ( ǫ p / . δ − p )-good output π , we have Pr[ M ( A )= π ]Pr[ M ( B )= π ] ≤ exp( ǫ p / Lemma A.4.
For any instance A and any I ∈ A , the probability that the output π is q -bad is bounded by exp( − q ) . Thus for any set P of series of choices, we havePr[ M ( A ) ∈ P ]= X π ∈P Pr[ M ( A ) = π ]= X π ∈P : π is (ln δ − p ) -good Pr[ M ( A ) = π ] + X π ∈P : π is (ln δ − p ) -bad Pr[ M ( A ) = π ] ≤ X π ∈P : π is (ln δ − p ) -good exp( ǫ p /
2) Pr[ M ( B ) = π ] + δ p ≤ exp( ǫ p /
2) Pr[ M ( B ) ∈ P ] + δ p . Therefore, we have shown ( ǫ p / , δ p ) differential privacy in both cases. Lemma A.5. (Lemma 4.4)
Consider iteration i of the for-loop and let M i be the maximum coverage of k centers with radius t i over points in V ′ . With high probability, at line 5, | V i | ≥ (1 − ǫ ) M i − k ln n ln (cid:0) eδp (cid:1) ǫ p .Proof. The items in the family S in line 4 are exactly one-to-one with the points in V , and the set s v ∈ S corresponding to v ∈ V are exactly the points which are not within distance t i − of an existing center in C but are within distance t i of v . Therefore, the items covered by the centers in C i are all within distance t i of their closest centers, so the change in the coverage of C is at least the size of the set coverage from C i .Our analysis is similar to [12]. The main difference is that instead of covering all points that OPT can cover, we aim to cover a (1 − ǫ ) portion within an additive error, hence we run 2 k log(1 /ǫ ) iterationsinstead of 2 k log n . Consider | R i | to be the number of remaining elements yet to be covered, and define L i = max S ∈S | S ∩ R i | , the largest number of uncovered elements covered by any set in S .By theorem 3.4, the exponential mechanism when selecting set ensures that with probability at most1 /n that we select a center with coverage less than L i − nǫ ′ . When L i > nǫ ′ , we are guaranteed tochoose a center that covers at least L i / L i > nǫ ′ we always takes the greedy option and are guaranteed to have | C i | ≥ (1 − ǫ ) M i with probability at least1 − n . However, when L i ≤ nǫ ′ , although we are not guaranteed to take the greedy action, there are atmost k ln nǫ ′ yet to be covered by the algorithm compared to OPT at radius r . Thus, the algorithm loses atmost k ln nǫ ′ points. Lemma A.6. (Lemma 4.6)
Consider the k-medians problem in the last line of algorithm 2, where demandpoints in D are shifted to points in C and Laplacian noise is applied. With high probability, the optimalobjective cost of this new k-medians problem is at most OPT + X a i t i + 4∆ k ln(1 /ǫ ) ǫ p (cid:18) ln( n )ln(1 + ǫ ) + 2 (cid:19) where OPT is the cost of the original k-medians problem.Proof.
After assigning every point in D to the closest points in C , we run a k -medians algorithm on a multisetdefined by C , where each element c ∈ C has multiplicity n ′ c as in line 11 of algorithm 2. Recall that theabsolute value of a random variable following Laplace distribution with parameter b follows an exponentialdistribution with parameter b . Also, by theorem A.1, the sum of exponential variables will be less thantwice the expectation with high probability. For the sake of completeness, we include the proof of this fact13n theorem A.1. It is also significant that, since we call algorithm 1 a total l log (1+ ǫ ) n + 1 m = l ln n ln(1+ ǫ ) + 1 m times with m = 2 k ln(1 /ǫ ), we select at most | C | ≤ k ln( n ) ln(1 /ǫ ) / ln(1 + ǫ ) + 2 k ln(1 /ǫ ) centers beforecalling the black-box k-medians algorithm. With all this preliminary information, we begin to prove theclaim.The last term in the bound is obtained by the Laplace mechanism, where noise is applied to the counts ofeach center c ∈ C . Each of the centers in C has Laplacian noise applied to it with parameter 2 /ǫ p . Therefore,with high probability at most k ln(1 /e ) ǫ p (cid:16) ln( n )ln(1+ ǫ ) + 2 (cid:17) demand points are "added" by line 11 of the algorithm,and each of these points is at distance at most ∆ from their closest center, which yields the last term in thebound.The first two terms come from the fact that we shift points in line 9 of algorithm 2 and the triangleinequality. For each of the original demand points v ∈ D , let the cluster center in OPT closest to v be OP T v ,and let the point in C closest to V be denoted c v . We see that d ( c v , OP T v ) ≤ d ( v, c v ) + d ( v, OP T v )by the triangle inequality. Since OPT of the original k-medians problem is a candidate solution for thenew k-medians problem, the objective cost of using OPT upper bounds the optimal cost. Summing over all v ∈ D , we see that the objective cost of shifted demand points is therefore bounded as X v ∈ D d ( c v , OP T v ) ≤ X a i t i + OPT since P a i t i is an upper-bound approximation of the cost of shifting the demand points to centers in C , andthe sum of d ( v, OP T v ) yields exactly OPT . Thus, the first two terms come from the cost of shifting the realdemand points, and the last term comes from the Laplacian noise.Note that there may be a better solution than the original k-medians’ OPT to this new k-mediansproblem, but this is consistent with the bound by inequality.
Lemma A.7. (Lemma 4.7)
With high probability, algorithm 2 preserves ( ǫ p , δ p ) differential privacy andsolves the k-medians problem with cost O ( M (1 + ǫ )) OPT + O (cid:18) M k ∆ ǫ p ln n (cid:18) ln (cid:18) eδ p (cid:19) + ln(1 /ǫ )ln(1 + ǫ ) (cid:19)(cid:19) where the black-box k-medians algorithm used in the last step of algorithm 2 has approximation factor M and ǫ is a small, positive constant.Proof. Combine the results of lemmas 4.3, 4.5, and 4.6, we see that with high probability the optimal costof the k-medians problem at the final line of algorithm 2 is given by at most (cid:16) − ǫ − ǫ − ǫ − ǫ (cid:17) OPT + (cid:16) k ln(1 /ǫ ) ǫ p (cid:18) ln( n )ln(1 + ǫ ) + 2 (cid:19) + 24 k ln n ln (cid:16) eδ p (cid:17) ǫ p (1 − ǫ − ǫ ) + 1 − ǫ − ǫ − ǫ (cid:17) ∆ . For simplicity, we will use big-O notation going forward, so this is the same as(2 + O ( ǫ )) OPT + O (cid:18) k ∆ ǫ p (cid:18) ln( n ) ln(1 /ǫ )ln(1 + ǫ ) + ln n ln (cid:18) eδ p (cid:19)(cid:19)(cid:19) . Since the k-medians algorithm used at the last line has approximation factor M , the objective cost ofthe shifted k-medians problem is M times that result. To obtain the objective cost for the original problem,we see that for any demand point v ∈ D , the distance between v and a center is at most d ( c v , v ) and thedistance from c v to a center, by triangle inequality. Therefore, the new objective cost only adds the shiftingcost on top of the modified k-medians problem’s objective cost, which only slightly affects the factor in frontof OPT : 142 M + 1 + ( M + 1) O ( ǫ )) OPT + O (cid:18) M k ∆ ǫ p (cid:18) ln( n ) ln(1 /ǫ )ln(1 + ǫ ) + ln n ln (cid:18) eδ p (cid:19)(cid:19)(cid:19) which simplifies to the claim. Lemma A.8. (Lemma 5.2)
In any metric space and p ≥ , ( d ( a, b )) p ≤ p − (( d ( a, c )) p + ( d ( b, c )) p ) .Proof. Since the function f ( x ) = x p is convex over non-negative reals, and distance is a non-negative function, (cid:18) d ( a, c )2 + d ( b, c )2 (cid:19) p ≤
12 ( d ( a, c )) p + 12 ( d ( b, c )) p . Hence, ( d ( a, b )) p ≤ ( d ( a, c ) + d ( b, c )) p ≤
12 (2 d ( a, c )) p + 12 (2 d ( b, c )) p ≤ p − (( d ( a, c )) p + ( d ( a, c )) pp