Differentially private k -means clustering via exponential mechanism and max cover
aa r X i v : . [ c s . D S ] S e p Differentially private k -means clustering via exponentialmechanism and max cover Anamay Chaturvedi ∗ Huy Lê Nguy˜ên † Eric Xu ‡ September 3, 2020
Abstract
We introduce a new ( ǫ p , δ p ) -differentially private algorithm for the k -means clustering problem. Givena dataset in Euclidean space, the k -means clustering problem requires one to find k points in that spacesuch that the sum of squares of Euclidean distances between each data point and its closest respectivepoint among the k returned is minimised. Although there exist privacy-preserving methods with goodtheoretical guarantees to solve this problem ([4, 13]), in practice it is seen that it is the additive errorwhich dictates the practical performance of these methods. By reducing the problem to a sequence ofinstances of maximum coverage on a grid, we are able to derive a new method that achieves lower additiveerror then previous works. For input datasets with cardinality n and diameter ∆ , our algorithm has an O (∆ ( k log n log(1 /δ p ) /ǫ p + k p d log(1 /δ p ) /ǫ p )) additive error whilst maintaining constant multiplicativeerror. We conclude with some experiments and find an improvement over previously implemented workfor this problem. Clustering is a well-studied problem in theoretical computer science. The objective can vary between iden-tifying good cluster sets, as one might desire in unsupervised learning, or good cluster centers, which maybe framed as more of an optimization problem. One relatively general variant of this problem is to find afixed number of centers k for a given dataset D of size n such that the sum of distances of each point tothe closest center is minimized. When the ambient space is Euclidean and the distance is the square of theEuclidean metric this is known as the k -means problem. Although solving the k -means problem is NP hard[3, 6, 16], practical algorithmic solutions with good approximation guarantees are well-known [15, 20, 12, 2].When algorithms handle sensitive information (for example location data), an important requirementthat they might be expected to fulfill is that of being differentially private [8]. Differential privacy providesa framework for capturing the loss in privacy that occurs when sensitive data is processed. This frameworkencompasses different models that vary depending on who the trusted parties are, what is considered sensitivedata, and how loss in privacy is measured. In this work we are interested in the centralized model ofdifferential privacy, where we assume that the algorithm whose privacy loss we want to bound is executedby a trusted curator with access to many agents’ private information. This trusted curator publicly revealsthe output obtained at the end of their computation, which is when all privacy loss occurs.In the theoretical study of the k -means problem, reducing the worst-case multiplicative approximationfactor has been the focus of a major line of work [12, 2]. However, it is important to note that Lloyd’s algo-rithm, which has a tight sub-optimal multiplicative guarantee of O (log k ) , works quite well in practice. Thisbehaviour can be understood by showing (as in [1]) that Lloyd’s finds a solution with constant multiplica-tive error with constant probability, or that for a general class of datasets satisfying a certain separabilitycondition [20] the multiplicative error again has a strong O (1) bound. ∗ Khoury College of Computer Sciences, Northeastern University . . . . . . . . . . . . . . . . . . . . . . . . [email protected] † Khoury College of Computer Sciences, Northeastern University . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [email protected] ‡ Khoury College of Computer Sciences, Northeastern University . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [email protected] eference Multiplicative Error Additive Error Balcan et al. [4] O (cid:0) log n (cid:1) ˜ O (cid:0) k + d (cid:1) Kaplan and Stemmer [13] O (1 /γ ) ˜ O (cid:0) k . + d . γ k γ (cid:1) Jones, Nguyen and Nguyen [11] O (1 /γ ) ˜ O (cid:0) k + d . γ k γ (cid:1) Ours O (1) ˜ O ( k √ d ) Table 1: Comparison of our clustering algorithms with prior works where we omit all log factors in theadditive error, the dependence on privacy parameters, set δ p = 1 /n . and suppress the common ∆ factorin the additive error.In contrast, when the algorithm is required to be ( ǫ p , δ p ) -differentially private, no pure multiplicativeapproximation is attainable and additive error is necessary. Differential privacy forces a lower bound onthe clustering cost - morally, for an easy clustering instance a perfect solution would reveal too muchinformation about the sensitive dataset. This is formalised for the closely related discrete k -medians problemin theorem 4.4 of [10] which shows that there is a family of instances whose optimal clustering cost is butany differentially private algorithm must incur an Ω(∆ k (log n/k ) /ǫ p ) expected cost. In practice, for manydatasets it is seen that although the non-private clustering cost naturally decreases as the number of centers k increases, the costs incurred by differentially private algorithms quickly plateau (as in the experiments of[4]), suggesting that they have reached their limit in the additive error. Given this fundamental barrier, amajor question is: Question:
Is it possible to obtain a finite approximation with additive error nearly linear in k ? The work [10] initiated the study of differentially private clustering algorithms and gave a private algorithmfor the k -means problem with constant factor multiplicative approximation and ˜ O ( k . ) additive error infixed dimensions. Motivated by applications in high dimensions, the work [4] developed a different methodthat scales to high dimensions but with an O (log n ) multiplicative approximation and ˜ O ( k . ) additiveerror. Subsequently, the work [13] gave an algorithm with O (1 /γ ) approximation and ˜ O ( k . + k γ d . γ ) additive error with one key idea (among others) being the application of locality sensitive hashing (LSH).The trade-off between the multiplicative and additive errors is in part a consequence of the trade-off betweenthe distance approximation quality of LSH and the number of hash functions. Recently, the work [11]improved the additive error to ˜ O ( k γ d . γ ) but the multiplicative error remains O (1 /γ ) . In this work, wesimultaneously improve upon the minimum additive error in this trade-off to nearly linear in k and eliminatethe resulting blow-up in the multiplicative factor using a very different approach. We introduce a differentially private k -means clustering algorithm for the global model of differential privacy.This method has an O (1) multiplicative approximation and an O (∆ ( k log n log(1 /δ p ) /ǫ p + k p d log(1 /δ p ) /ǫ p )) additive approximation guarantee. The additive error is nearly linear in k in contrast with a polynomialin k overhead in previous works, and the multiplicative error is a constant, which is competitive with allprevious works. Apart from theoretical value, the algorithm also exhibits an improvement experimentallyover earlier work on synthetic and real-world datasets [4]. For a specific setting of parameters with constantsapplicable for experiments, we have the following bound. More general asymptotic bounds can be found inthe subsequent sections. The discrete k -medians problem has an identical formulation to the k -means except that the distance function is a metric,not squared, and the centers come from a public finite set and not the whole ambient space. heorem 1.1. There is an ( ǫ p , δ p ) differentially private algorithm for the k -means problem that achieves autility bound of O (1) f D ( OPT D ) + O (cid:18) k ∆ log n log 1 /δ p ǫ p (cid:19) + O ∆ k p d log 1 /δ p ǫ p ! , where D is the input dataset, f D ( OPT D ) is the optimal k -means cost for the input dataset D , d is the ambientdimension of D , n is the cardinality of D , ∆ is the diameter of D , and the failure probability of the algorithmis polynomially small in n . For lower bounds, we extend the construction of [10] originally for the discrete k -medians problem to oursetting and show that a linear dependence on k in the additive error is necessary for any finite multiplicativeapproximation. Theorem 1.2.
For any < ǫ p , δ p ≤ and integer k , there is a family of k -means instances over thecube [0 , ∆ / √ d ] d with d = O (ln( k/ ( ǫ p δ p ))) dimensions such that the optimal clustering cost is but any ( ǫ p , δ p ) -differentially private algorithm would incur an expected cost of Ω (cid:16) ∆ k ln( ǫ p /δ p ) ǫ p (cid:17) . The same construction also implies a lower bound for ( ǫ p , -differential privacy. Theorem 1.3.
For any < ǫ p ≤ and integers k and d = Ω(ln( k )) , there is a family of k -means instancesover the cube [0 , ∆ / √ d ] d such that the optimal clustering cost is but any ( ǫ p , -differentially private algo-rithm would incur an expected cost of Ω (cid:16) ∆ kdǫ p (cid:17) . In [10], the authors gave an algorithm for solving the discrete version of the problem and subsequentworks have focused on identifying a good discretization of the continuous domain and then invoking theiralgorithm for the discrete case. A recent approach by [13] uses locality sensitive hashing (LSH) to identify asmall discrete set of points that serve as potential centers. Inherent in this approach is a trade-off betweenthe multiplicative approximation and the size of this discrete set, which comes from the trade-off in LSHbetween the approximation and the number of hash functions. The number of discrete candidate centersdirectly impacts the additive error and thereby brings about a trade-off between the multiplicative andadditive errors.In this work, we avoid this trade-off by going back to the basics and using the most natural approach:discretizing the space using a grid and using all grid points as candidate centers. We can preprocess the datato reduce dimensions to O ((log n ) /ǫ ) and preserve all distances. However, there can be as many as ( n ) log n many points in the grid that we construct since the grid size must start from /n for negligible additiveerror. It is not clear how to implement a selection algorithm (such as the exponential mechanism) on sucha large number of choices. In fact, it was this hurdle, identified in [4], that prompted subsequent works tofind alternative approaches.An important observation is that a large number of choices is not inherently difficult since it is not hardto sample uniformly among them. Our task is nontrivial since the k -means cost objective is a complexfunction. To simplify the sampling weights, we exploit the connection between clustering and coverage andreduce the problem to finding maximum coverage: count the number of data points within a given radiusof each candidate center. The notable advantage is that in maximum coverage, the value of each center isan integer in the range from to n instead of a real number corresponding to a potential improvement inthe k -means cost. The centers can hence be partitioned into n classes and we just need to sample uniformlyamong them. The crucial observation is that there are at most n O (1 /ǫ ) grid points within the thresholdradius of any data point, meaning that there are only a polynomial number of grid points with non-zerocoverage. Thus, all but a polynomial number of choices have the same coverage of making it possible toimplement the exponential mechanism in polynomial time.Given the implementation of the exponential mechanism for coverage, we follow the approach of [11] tocover the points using clusters of increasing radii. Note that the approach goes back to the non-private3oreset construction of [5]. However, the use of coverage for dealing with each radius has another crucialadvantage: like in [11], using the technique of [10], the privacy loss only increases by a log 1 /δ p factor eventhough the algorithm has Ω( k ) adaptive rounds of exponential mechanism.To obtain a comparable coverage to the optimal solution at each radius r , we use a bi-criteria relaxationof maximum coverage and pick many more cluster centers than in the optimal solution. We increase r multiplicatively with factor (1 + ǫ ) from /n to the diameter and invoke this subroutine for each such r ,taking the union of all O ( log nǫ ) sets of candidate centers generated this way to generate a set C of “good"centers derived from many grids.By moving each point p ∈ D ′ to its closest candidate center grid [ p ] in C (with some additional noise), weend up privately constructing a proxy dataset D ′′ with total movement of data points on the order of the k -means cost (by virtue of C containing a good k -means solution for D ′ ). X p ∈ D ′ d ( p, grid [ p ]) ≃ cost ( D ′ ) By the triangle inequality, this bound on the total movement means that any k -means solution for D ′′ isimmediately a good k -means solution for the dataset D ′ as well, with a constant multiplicative overhead inthe cost. A k -means solution can be constructed for D ′′ using any non-private k -means clustering algorithm,and finally D can be clustered by using noisy averages of clusters in D ′ .We are able to use a parsimonious privacy budget primarily because of the round independent privacyanalysis for the grid-based proxy construction routine using the exponential mechanism which behaves wellunder composition. There are other privacy preserving noise additions when snapping the data points to thegrid for the proxy dataset construction and the final cluster centers but apart from a √ d dependence on theambient dimension they are dominated by the privacy expenditure of the exponential mechanism.We finish with an experimental evaluation of our algorithm, in which we find that this method performsbetter than an implementation of previous work [4]. We are given a dataset D of n points that lies in a ball B ∆ / (0) (the ball of radius ∆ / centered at ) in somehigh dimensional space R d . The goal is to find a set of k points S = { µ , . . . , µ k } such that P p ∈ D d ( p, S ) isminimal. Here d ( · , · ) : R d × R d → R is the square of the Euclidean distance, that is d ( p, q ) := P di =1 ( p i − q i ) .We abuse notation to set d ( p, S ) := min µ ∈ S d ( p, µ ) . We define f D ( S ) = X p ∈ D d ( p, S ) , so when S is a set of size k , f D ( S ) is the k -means cost of the solution S for the dataset D . There are a couple of closely related definitions of central differential privacy which can be trivially relatedto each other. For clarity we disambiguate the situation by uniformly adhering to the formalization in [9].
Definition 2.1.
We say that two datasets
D, D ′ ∈ X n are neighbouring if | D △ D ′ | = 1 , i.e. there is exactlyone element in their symmetric difference. We say that an algorithm A is ( ǫ, δ ) -differentially private if forany two neighbouring input datasets D, D ′ and any measurable output set S lying in the co-domain of A , P ( A ( D ) ∈ S ) ≤ e ǫ P ( A ( D ′ ) ∈ S ) + δ. We now mention some standard tools from the literature of differential privacy.4 emma 2.2 (Exponential Mechanism, [17]) . Let input set D ⊂ X , range R , and utility function q : X × R → R . The Exponential Mechanism M E ( D, q, ǫ E ) with privacy parameter ǫ E outputs an element r ∈ R sampledaccording to the distribution P ( M E ( D, q, ǫ E ) = r ) := exp (cid:18) ǫ E · q ( D, r )2∆ q (cid:19) , where ∆ q is the sensitivity of the utility q ; i.e. ∆ q = max r ∈ R | A ∆ B | =1 | q ( A, r ) − q ( B, r ) | . The Exponential Mechanism is ( ǫ E , -differentially private and with probability at least − γ , | max r ∈ R q ( D, r ) − q ( D, M E ( D, q, ǫ E )) | ≤ qǫ E log (cid:18) | R | γ (cid:19) . Lemma 2.3 (Laplace mechanism, [8]) . Given any function f : N | X | → R k , the Laplace mechanism withprivacy parameter ǫ L is defined as M L ( x, f ( · ) , ǫ L ) := f ( x ) + ( Y , . . . , Y k ) where Y i are i.i.d. ∼ Lap (cid:16) ∆ fǫ L (cid:17) , Lap ( x | b ) = b exp (cid:16) − | x | b (cid:17) and ∆ f = max | X △ Y | =1 | f ( X ) − f ( Y ) | , i.e. the ℓ sensitivity of f over all pairs of neighbouring datasets. The Laplace mechanism is ( ǫ L , -differentiallyprivate. Theorem 2.4 (Basic composition, [7]) . If a sequence of algorithms M i with ( ǫ i , δ i ) -differential privacyguarantees are composed in order then the composite process satisfies ( P mi =1 ǫ i , P mi =1 δ i ) -differential privacy. We state the parallel composition theorem from [18], slightly modifying the proof to extend this resultto be able to use ( ǫ, δ ) -differentially private subroutines where δ = 0 . Theorem 2.5 (Parallel composition, [18]) . Let M i each provide ( ǫ i , δ i ) -differential privacy. Let { D i : i ∈ Λ } be arbitrary disjoint subsets of the input domain D . For input dataset X , the sequence of M i ( X ∩ D i ) provides (max i ǫ i , max i δ i ) differential privacy.Proof. Let A and B be neighbouring datasets, A i = A ∩ D i and B i = B ∩ D i for i ∈ Λ . Let M i be ( ǫ i , δ i ) differentially private subroutines for i ∈ Λ . Since | A △ B | = 1 , there is at most one partition set D i ∗ of thedomain such that A i ∗ = B i ∗ . We bound the ratio of probabilities of the output tuple as P (( M i ( A i )) i ∈ Λ = ( S i ) i ∈ Λ) P (( M i ( B i )) i ∈ Λ = ( S i ) i ∈ Λ ) = Y i ∈ Λ P ( M i ( A i ) = S i P ( M i ( B i ) = S i = P ( M i ∗ ( A i ∗ ) = S i ∗ ) P ( M i ∗ ( B i ∗ ) = S i ∗ ) Since M i ∗ is ( ǫ i ∗ , δ i ∗ ) -differentially private, it follows that with probability − δ i ∗ , this ratio is bounded by exp( ǫ ∗ i ) . Since this is true for every pair of neighbouring datasets, we can summarise this by saying that thissequence of operations is (max i ǫ i , max i δ i ) - differentially private. Here we describe the noisy averaging algorithm from [19] which we use in the last step to derive clustercenters privately from cluster IDs and data without revealing the cluster member vectors or the cluster size.The authors of that work use the Gaussian mechanism in conjunction with a bound on the sensitivity of the5veraging operator and addition of Laplace noise to the number of points in the cluster to ensure privacy.This additional step of adding Laplace noise is necessary because in this setting the cluster sizes are derivedfrom sensitive information, and since the Gaussian noise added to mask the exact mean depends upon thecluster size, the noise parameter must itself be masked by additional noise (which is data-independent).In [19], the authors use a slightly different definition of privacy where datasets
D, D ′ are consideredneighboring if D = D ′ \{ p } ∪ { p ′ } for some p ∈ D ′ . The statements here are with the privacy parametersmodified to fit the definition of differential privacy that we are working with; ∆ is any upper bound on thediameter of the input set. Data:
Multiset V of vectors in R d , predicate g , parameters ǫ, δ Set ˆ m = |{ v ∈ V : g ( v ) = 1 }| + Lap (5 /ǫ ) − ǫ ln(2 /δ ) . If ˆ m < , output a uniformly random point inthe domain B ∆ / (0) . Denote σ = ǫ ˆ m p . /δ ) , and let η ∈ R d be a random noise vector with each coordinate sampledindependently from N (0 , σ ) . return g ( V ) + η Algorithm 1:
NoisyAVG[19, Algorithm 5]
Theorem 2.6 (Privacy [19, Theorem A.3] and noise [19, Observation A.1] bounds for algorithm 1) . Al-gorithm 1 is an ( ǫ, δ ) -differentially private algorithm for ǫ ≤ / . Further, if V and g are such that m = |{ v ∈ V : g ( v ) = 1 }| ≥ A (cid:16) ǫ ln (cid:16) βδ (cid:17)(cid:17) for sufficiently large constant A , then with probability − β ,algorithm 1 returns g ( V ) + η where η is a vector where every coordinate is sampled i.i.d. from N (0 , σ ) forsome σ ≤ ǫm p /δ ) , where ∆ is the diameter of the input set. In this subsection we list some technical lemmata that we will find useful to refer to in the main body ofthis work.
Lemma 2.7 (Greedy set cover bicriteria solution guarantee) . Let there be a set of elements U , a family ofsets S ⊂ U , and the promise that there is some subfamily of sets Z ⊂ S that covers U . Suppose that for Y = 2 ⌈|Z| log 1 /ǫ ⌉ + 1 we iteratively pick a collection of sets C = { c , . . . , c Y } ⊂ S . We denote the set ofelements not picked by the i th iteration U = U and U i = U i − \ c i − . If c i ∩ U i is at least half as big as max c ∈S ( c ∩ U i ) for all i , then S will cover (1 − ǫ ) of all elements in U . Formally, if C = { c i : c i ∈ S , i = 1 , . . . , ⌈ |Z| log 1 /ǫ ⌉} ,U i = U \ ( c ∪ · · · ∪ c i − ) where | c i ∩ U i | ≥ max c ∈ S ( c ∩ U i )2 , then | S i c i | ≥ (1 − ǫ ) | U | .Proof. The idea behind this proof is simple; it suffices to show that in every iteration we always cover acertain fraction of the thus far uncovered elements. Telescoping this multiplicative guarantee will give us ourdesired result. Formally, we know that since S z ∈Z z = U , the union S z ∈Z z also covers the set of unpickedelements U i − . Enumerating the elements of U i by summing the cardinalities of its intersections with themembers of Z , we get X z ∈Z | z ∩ U i − | = | U i − | max z ∈Z | z ∩ U i − | ≥ | U i − ||Z|⇒ | c i ∩ U i − | ≥ | U i − | |Z|⇒ | c i | ≥ | U i − | |Z| Since U i − = U i ⊔ c i , | U i | ≤ (cid:18) − |Z| (cid:19) | U i − |⇒ | U i | ≤ (cid:18) − |Z| (cid:19) i − | U | It follows that for i − > log − |Z| ǫ = log ǫ log 1 − |Z| > log ǫ − |Z| = 2 |Z| log 1 /ǫ, | U Y | < ǫ | U | and so the size of the complement | c ∪ · · · ∪ c Y | is ≥ (1 − ǫ ) | U | . Lemma 2.8.
In any metric space with a metric d ( · , · ) and p ≥ , d p ( a, b ) ≤ p − ( d p ( a, c ) + d p ( c, b )) .Proof. Applying Jensen’s inequality with the function g ( x ) = x p we have (cid:18) d ( a, c )2 + d ( c, b )2 (cid:19) p ≤ d p ( a, c ) + d p ( c, b )2 . Since d is a metric, by the triangle inequality d ( a, b ) ≤ d ( a, c ) + d ( c, b ) ⇒ d p ( a, b )2 p ≤ (cid:18) d ( a, c )2 + d ( c, b )2 (cid:19) p ≤ d p ( a, c ) + d p ( c, b )2 ⇒ d p ( a, b ) ≤ p − ( d p ( a, c ) + d p ( c, b )) Lemma 2.9 (Concentration bound for privacy analysis, [10]) . Let R , . . . , R n be some Bernoulli randomvariables, R i ∼ Ber ( p i ) , i.e. R i = 1 with probability p i and with probability − p i , where R i may dependarbitrarily on R , . . . R i − . Let Z j = Q ji =1 (1 − R i ) . Then P n X i =1 p i Z i > q ! ≤ exp( − q ) . Following the construction in theorem 4.4 of [10], we derive lower bounds for the k -means clustering problemin the ( ǫ, δ ) and ( ǫ, -differential privacy regimes. Theorem 3.1.
For any < ǫ p , δ p ≤ and integer k , there is a family of k -means instances over thecube [0 , ∆ / √ d ] d with d = O (ln( k/ ( ǫ p δ p ))) dimensions such that the optimal clustering cost is but any ( ǫ p , δ p ) -differentially private algorithm would incur an expected cost of Ω (cid:16) ∆ k ln( ǫ p /δ p ) ǫ p (cid:17) . roof. Let the ambient dimension d = Θ(ln( k/ (( e ǫ p − δ p )) and W be the set of codewords of an errorcorrecting code with constant rate and constant relative distance in { , } d . The dimension d and codewords W are chosen so that | W | ≥ k/ (( e ǫ p − δ p ) . Let L = ln(( e ǫ p − / (4 δ p )) / (2 ǫ p ) . Our input domain is the unitcube [0 , d with diameter ∆ = √ d . Note that for other values of ∆ , we can simply re-scale the construction.Suppose M is an arbitrary ( ǫ p , δ p ) -differentially private algorithm that on input D ⊂ [0 , d outputs a setof k locations. Let M ′ be the algorithm that first runs M on the input and then snaps each output point tothe nearest point in W . By post-processing, M ′ is ( ǫ p , δ p ) -differentially private. Furthermore, observe thatif the input points are located at a subset of W then the cost of M ′ is within a factor of the cost of M .Let A be a size k subset of W chosen uniformly at random and the dataset D A is a multiset containing eachpoint in A with multiplicity L . Note that the optimal cost for D A is .We would like to analyze φ = E A,M ′ [ | A ∩ M ′ ( D A ) | ] /k . We have: kφ = E A,M ′ "X i ∈ A i ∈ M ′ ( D A ) = k E A,M ′ E i ∈ A [1 i ∈ M ′ ( D A ) ]= k E i ∈ W E A,M ′ [1 i ∈ M ′ ( D A ) | i ∈ A ] Let i ′ be an random point in W not in A . Changing A to A ′ = A \ { i } ∪ { i ′ } requires changing L elementsof D A . Notice that for random A \ { i } in W \ { i } and random i ′ in W \ A , we have that A ′ is still a uniformlyrandom subset of W \ { i } . Thus, E i ∈ W E A ′ ,M ′ [1 i ∈ M ′ ( D A ′ ) | i A ′ ] ≥ (cid:18) E i ∈ W E A,M ′ [1 i ∈ M ′ ( D A ) | i ∈ A ] (cid:19) exp( − ǫ p · L ) − δ p e ǫ p − Here we use the fact that M ′ is ( ǫ p , δ p ) -differentially private, and that the δ p losses in expectation decreasegeometrically with factor exp( − ǫ p ) so the net leakage from the δ term can be lower bounded by the sum ofan infinite geometric progression. Continuing, E i ∈ W E A ′ ,M ′ [1 i ∈ M ′ ( D A ′ ) ] ≥ φ exp( − ǫ p · L ) − δ p / ( e ǫ p − ≥ φδ p / ( e ǫ p − − δ p / ( e ǫ p − Since the output M ′ ( D A ′ ) has at most k points, the LHS is at most k/ | W | . Thus, φ ≤ ( k/ | W | + δ p / ( e ǫ p − / (4 δ p / ( e ǫ p − ≤ / .For each point in A \ M ′ ( D A ) , the algorithm incurs a cost of Θ( L ∆ ) due to the multiplicity of L ofpoints in D A and the fact that all points in W are at distance Θ(∆) apart. Therefore, the expected cost of M ′ , and consequently the cost of M , is Ω( kL ∆ ) = Ω (cid:16) ∆ k ln( ǫ p /δ p ) ǫ p (cid:17) .By the same proof, one can also obtain a lower bound for ( ǫ p , -differential privacy. Theorem 3.2.
For any < ǫ p ≤ and integers k and d = Ω(ln( k )) , there is a family of k -means instancesover the cube [0 , ∆ / √ d ] d such that the optimal clustering cost is but any ( ǫ p , -differentially private algo-rithm would incur an expected cost of Ω (cid:16) ∆ kdǫ p (cid:17) .Proof. Let W be the set of codewords of an error correcting code with constant relative rate and constantrelative distance in { , } d . Note that | W | = 2 Ω( d ) . Let L = ln( | W | / (2 k )) / (2 ǫ p ) . Our input domain isthe unit cube [0 , d with diameter ∆ = √ d . Note that for other values of ∆ , we can simply re-scale theconstruction.Suppose M is an arbitrary ( ǫ p , -differentially private algorithm that on input D ⊂ [0 , d outputs a setof k locations. Let M ′ be the algorithm that first runs M on the input and then snaps each output point to8he nearest point in W . By post-processing, M ′ is ( ǫ p , -differentially private. Furthermore, observe that ifthe input points are located at a subset of W then the cost of M ′ is within a factor of the cost of M . Let A be a size k subset of W chosen uniformly at random and the dataset D A is a multiset containing eachpoint in A with multiplicity L . Note that the optimal cost for D A is .We would like to analyze φ = E A,M ′ [ | A ∩ M ′ ( D A ) | ] /k . We have: kφ = E A,M ′ "X i ∈ A i ∈ M ′ ( D A ) = k E i ∈ W E A,M ′ [1 i ∈ M ′ ( D A ) | i ∈ A ] Let i ′ be an arbitrary point in W not in A . Changing A to A ′ = A \ { i } ∪ { i ′ } requires changing L elementsof D A . Thus, E i ∈ W E A ′ ,M ′ [1 i ∈ M ′ ( D A ′ ) ] ≥ φ exp( − ǫ p · L ) Since the output M ′ ( D A ′ ) has at most k points, the LHS is at most k/ | W | . Thus, φ ≤ ( k/ | W | ) exp(2 Lǫ p ) ≤ / .For each point in A \ M ′ ( D A ) , the algorithm incurs a cost of Θ( L ∆ ) due to the multiplicity of L ofpoints in D A and the fact that all points in W are at distance Θ(∆) apart. Therefore, the expected cost of M ′ , and consequently the cost of M , is Ω( kL ∆ ) = Ω (cid:16) ∆ kdǫ p (cid:17) . We introduce some notation to make the analysis of algorithms 2 and 3 easier. • OPT D , optimal solution: We let OPT D be the lexicographically minimal optimal k -means set for thedataset D . The lexicographic minimality is just for uniqueness, it has no other significance. • ǫ , multiplicative approximation constant: We let ǫ be an approximation constant that is used in theJohnson-Lindenstrauss transform and such that ǫ is the factor the grid unit length and thresholdincrease by in each iteration of the loop on line 8 to line 12. We will require that ǫ ∈ (0 , and bebounded away from , say ǫ ≤ . . • m , number of iterations: We let m = ⌈ log ǫ n ⌉ , the total number of iterations for which algorithm 3is called. • r i , threshold radii: For i ∈ { , . . . , m } , we let r i = t i √ dǫ = (1+ ǫ ) i − n , the i th threshold radius used forcomputing the max cover bi-criteria relaxation. For notational convenience we set r = 0 , and notethat r increases geometrically from r = n to r m = 2 . • G i , t i , grid and unit length: For i ∈ { , . . . , m } , we let G i = {− , − t i , − t i , . . . , − t i , } d ′ ,where t i = ǫn √ d (1 + ǫ ) i − , the grid unit length in the i th iteration. Note that | G i | = ⌊ t i ⌋ d ′ . • ⌊·⌋ ( i ) , floor to grid function: We let ⌊ v ⌋ ( i ) for any vector v ∈ R d denote (( t i ⌊ v t i ⌋ ) , . . . , ( t i ⌊ v d ′ t i ⌋ )) , i.e. ⌊ v ⌋ ( i ) is the coordinate-wise “floor" of v in the grid of unit length t i . • o i , ideal thresholded objectives: For i ∈ { , . . . , m } , we let o i = { p ∈ D ′ : d ( p, OPT D ′ ) ∈ [ r i − , r i ) } .Since D ′ ⊂ B (0) , D ′ = ⊔ mi =1 o i • a i , set of points covered in i th call: For i ∈ { , . . . , m } , we let a i = B r i + t i √ d ′ ( C i ) ∩ D ′ where C i is theset of points returned by algorithm 3 when called in the i th iteration of algorithm 2.9 B r ( · ) : We let B r ( S ) be the union of all balls of radius r whose center is an element of S . We abusenotation to let B r ( g ) := B r ( { g } ) and observe that | B r ( · ) | is a monotonic positive submodular functionfor any r ≥ , as is | T ∩ B r ( · ) | , for any fixed set T . • cover : We let cover [ g ] denote that set of uncovered data points that are covered within the radius ( r i + t i √ d ′ ) around g in the i th call to algorithm 3. This is a subset of the data points in B (1+ ǫ ) r i • grid : We let grid [ p ] denote the grid point that covers p in the call to algorithm 3 that removes p from D ′ . • center : We let center [ p ] denote the closest element of OPT D ′ to the datapoint p ∈ D ′ .At a high level the algorithm can be described in four steps. Step 1:
First the dataset D ⊂ B (0 , Λ / ⊂ R d is preprocessed via dimension reduction, scaling andprojection to produce a dataset D ′ ⊂ B (0) ⊂ R d ′ where d = O ((log n ) /ǫ ) . Note that with high probabilitywe do not need to project any point and so need not account for it in the privacy analysis; however, byprojecting instead of re-scaling, we preserve privacy. To start with a finite number of candidate centers weconstruct multi-dimensional grids of side lengths t i and observe that if µ is a center of a cluster with radius r i in the optimal solution, then by the triangle inequality a ball of radius r i + t i √ d ′ centered at ⌊ µ ⌋ ( i ) (the“floor" of µ the in grid) contains all the points of the same cluster. Step 2:
Next, for geometrically increasing grid unit lengths t i with growth factor (1 + ǫ ) starting from ǫ/n √ d ′ and increasing to / √ d ′ we create grids and identify possible centers of clusters with radius in theinterval [ r i − , r i ) . This is done by counting the number of datapoints within r i + t i √ d ′ of every grid point.To ensure a polynomial time method this enumeration is done by iterating over datapoints and adding eachdatapoint to the grid points which could be valid cluster centers - we do this by keeping in hand a set ofvalid offsets V i and simply incrementing counts for all grid points within an offset of ⌊ p ⌋ ( i ) . Since we arelooking for a k -means solution there could be as many as k clusters for any given radius, which requires usto greedily identify the k/ǫ best grid points to obtain close to optimal coverage (see lemma 2.7). We takethe union of all log ǫ / (1 /n ) = O ((log n ) /ǫ ) sets of k/ǫ points so found to construct the set C . Step 3:
Once this set C containing a good cluster solution is identified, the idea is to construct the proxydataset D ′′ by moving each point to its closest point in C . However, constructing the proxy dataset in thisway means accessing the sensitive data again. In order to maintain privacy in this step, instead of directlymoving datapoints to points in C , we compute the counts n c of the number of datapoints that would ideallybe moved to c and add Laplace noise to n c to get ˜ n c . D ′′ then contains ˜ n c copies of c for all c ∈ C . Step 4:
In the final step we apply any non-private k -means clustering algorithm to D ′′ to get some clustercenters S ′′ . We cluster D ′ using these cluster centers to get clusters C ′ , and define final clusters for D by identifying points with their images under the Johnson Lindenstrauss map. Since this step again usessensitive data we use the Gaussian mechanism to return noisy averages of these clusters C ′ for the dataset D to derive the set of k -means S .The formal pseudocode algorithm 3 requires some additional justification; the construction of the offsetset V i , and the polynomial time implementation of the exponential mechanism. Claim 4.1.
A data point p is within distance r i + t i √ d ′ of a grid point t i b for b ∈ Z d ′ only if d ′ X j =1 min(( ⌊ p ⌋ ( i ) j − t i b j ) , ( ⌊ p ⌋ ( i ) j − t i ( b j + 1)) ) ≤ ( r i + t i √ d ′ ) . ata: D ⊂ R d dataset, | D ′ | = n . Result: S = { ˜ µ , . . . , ˜ µ k } ⊂ R d T ∼ JohnsonLindenstrauss ( n, ǫ ) D ′ ← T ( D ) d ′ ← dim ( T ) = O ((log n ) /ǫ ) Scale D ′ down by a factor of ∆2(1+ ǫ ) and project to B (0) Let T ′ be the composition of T with the scaling and projection so that T ′ ( D ) = D ′ r ← /n t ← ǫ/ ( n √ d ′ ) for i = 1 , . . . , m = ⌈ log ǫ n ⌉ do C i ← algorithm D ′ , t i , r i ) r i +1 ← (1 + ǫ ) r i . t i +1 ← (1 + ǫ ) t i . end D ′ ← T ′ ( D ) ; // resetting the dataset to account for points lost in call toalgorithm 3 C = S mi =1 C i Assign all points in D ′ to their closest point c ∈ C Let n c be the number of points in D ′ assigned to c For each c ∈ C set n ′ c = n c + Lap (cid:16) ǫ L (cid:17) Let D ′′ be the dataset where every c ∈ C is repeated n ′ c times S ′′ = { µ ′′ , . . . , µ ′′ k } ← Lloyd ( D ′′ ) D ′ i ← { p ∈ D ′ : arg min µ ′′ ∈ S ′′ d ( p, µ ′′ ) = µ ′′ i } for i = 1 , . . . , k for i = 1 , . . . , k do ˜ µ i = algorithm D, D ′ i , ǫ G , δ G ) ; // D ′ i ( p ) indicates whether T ′ ( p ) ∈ D ′ i for p ∈ D end return ˜ S = { ˜ µ , . . . , ˜ µ k } Algorithm 2:
Private k -means11 ata: D ′ dataset (passed by reference), t i grid unit length, r i threshold radius Result: set C i ⊂ G i C i ← ∅ repeat k ′ times cover ← empty linked list V i ← { v : v ∈ N d ′ , P d ′ j =1 ( t i v j ) < ( r i + t i √ d ′ ) } for all p ∈ D ′ do for all v ∈ V i do for all s ∈ { , } d ′ do t i b = ⌊ p ⌋ ( i ) + t i s + (2 s − ¯1) t i v ; // where ¯1 is the all-ones vector if d ( t i b, p ) < ( r i + t i √ d ′ ) then cover [ t i b ] += { p } end end end end totalCover ← for g ∈ cover do totalCover += exp (cid:16) ǫ E | cover [ g ] | (cid:17) end totalCover += | G i | − len [ cover ] Let P samp = 1 − | G i | totalCover . if Ber ( P samp ) = 1 then g ← pick i ∈ [ len [ cover ]] ∼ P ( g ) ∝ exp (cid:16) ǫ E | cover [ g ] | (cid:17) − else g ← pick i uniformly at random from G i end C i ← C i ∪ { g } D ′ ← D ′ \ cover [ g ] end return C i Algorithm 3:
Private grid set cover12 et V i = { v : v ∈ N d ′ , P d ′ j =1 t i v j < ( r i + t i √ d ′ ) } . If t i b is a grid point such that d ( p, t i b ) < ( r i + t i √ d ′ ) thenfor some s ∈ { , } d and v ∈ V i , t i b = ⌊ p ⌋ ( i ) + t i s + (2 s − ¯1) t i v , where ¯1 = (1 , , . . . , , the d ′ -dimensionalall-ones vector.Proof. Informally, p j is a real number and the t i b j lie on regularly spaced intervals on the number line. Since ⌊ p ⌋ ( i ) j and ⌊ p ⌋ ( i ) j + t i are the two closest grid points to p , any grid point must be closer to one of theseneighbours than it is to p j .If p is within distance r i + t i √ d ′ of a grid point t i b for b ∈ Z d ′ then by definition d ′ X j =1 ( p j − t i b j ) ≤ ( r i + t i √ d ′ ) . If p j ≥ t i b j then p j = t i b j + t i x + y for some x ∈ N and y ∈ [0 , . Since ⌊ p ⌋ ( i ) j = t i b j + t i x , ( ⌊ p ⌋ ( i ) j − t i b j ) < ( p j − t i b j ) . Else, if p j < t i b j then p j = t i b j − t i x − y , with same ranges for x and y . Then ⌊ p ⌋ ( i ) j + t i = t i b j − t i x so ( ⌊ p ⌋ ( i ) j − t i b j + t i ) < ( p j − t i b j ) . Therefore we have that min(( ⌊ p ⌋ ( i ) j − t i b j ) , ( ⌊ p ⌋ ( i ) j − t i ( b j + 1)) ) < ( p j − t i b j ) . Summing up this inequality over the index j and using the display above gives us the desiredresult.Let s j = 0 if p j ≥ t i b j and s j = 1 if p j < t i b j . Tracing the proof of the first half and letting v be suchthat t i v j = min(( ⌊ p ⌋ ( i ) j − t i b j ) , ( ⌊ p ⌋ ( i ) j − t i ( b j + 1)) ) , it follows that t i b j = ⌊ p ⌋ ( i ) j + t i s j + (2 s j − v j for all j ∈ [ k ] . Putting together all coefficients this gives us that t i b = ⌊ p ⌋ ( i ) + t i s + (2 s − ¯1) t i v for some v ∈ V i . Claim 4.2.
After computing the cover of each grid point, algorithm 3 executes the exponential mechanismcorrectly and in polynomial time.Proof.
First we note that there are at most polynomially many grid points whose cover is updated inany call to algorithm 3. From claim 4.1 we know that for any data point the only grid points whosecover must be updated lie in V i . It will hence suffice to show that | V i | < n O (1 /ǫ ) . To get the numberof unsigned d ′ -dimensional ordered tuples v for which P i t i v i < ( r i + t i √ d ′ ) ⇔ P i v i < d ′ ( ǫ + 1) , itsuffices to count the number of ways of partitioning d ′ ( ǫ + 1) + d ′ + 1 balls into d ′ + 1 distinguishablebins. We can do this by placing the balls in a line and choosing d ′ gaps between them. It follows that | V | = 2 d ′ (cid:0) d ′ ( ǫ +1) + d ′ +1 d ′ +1 (cid:1) = O (cid:16) d ′ ǫ (cid:17) < n O (1 /ǫ ) , using that d ′ = O (cid:16) log nǫ (cid:17) .To execute the exponential mechanism, we want that the grid point g ∈ G i be sampled with the probability P ( g ) given by the expression P ( g ) = exp (cid:16) ǫ E | cover [ g ] | (cid:17)P h ∈ G i exp (cid:16) ǫ E | cover [ h ] | (cid:17) . Since all but polynomially many grid points { g : cover [ g ] = 0 } are being sampled with exactly the sameprobability, which also happens to be the smallest value any point is sampled with, we can use the law oftotal probability to write this sampling distribution as a uniform distribution on the entire grid with someprobability − P samp , and a second distribution with P ′ supported only on the polynomially many gridpoints with non-zero cover with probability P samp . P ( g ) = P samp P ′ ( g ) + (1 − P samp ) 1 | G i | Letting g be any grid point with cover [ g ] = ∅ , so that P ′ ( g ) = 0 , (1 − P samp ) 1 | G i | = 1 P h ∈ G i exp (cid:16) ǫ E | cover [ h ] | (cid:17) P samp = 1 − | G i | P h ∈ G exp (cid:16) ǫ E | cover [ h ] | (cid:17) Putting together the last 3 displays, we get an expression for P ′ ( g ) : P samp P ′ ( g ) = exp (cid:16) ǫ E | cover [ g ] | (cid:17) − P h ∈ G exp (cid:16) ǫ E | cover [ h ] | (cid:17) ⇒ P ′ ( g ) = exp (cid:16) ǫ E | cover [ g ] | (cid:17) − P h ∈ G i exp (cid:16) ǫ E | cover [ h ] | (cid:17) P h ∈ G exp (cid:16) ǫ E | cover [ h ] | (cid:17)P h ∈ G i exp (cid:16) ǫ E | cover [ h ] | (cid:17) − | G i | = exp (cid:16) ǫ E | cover [ g ] | (cid:17) − P h ∈ G i exp (cid:16) ǫ E | cover [ h ] | (cid:17) − | G i | Suppressing the normalization, the derived expression can be summarised as P ′ ( g ) ∝ exp (cid:16) ǫ E | cover [ g ] | (cid:17) − . The outline of the utility analysis is as follows; we know that the optimal k -means solution OPT D ′ leads to k clusters, each of which has some radius between and . We will try to catch these clusters at thresholdradii r = (1 /n ) , r = (1 + ǫ )(1 /n ) , r = (1 + ǫ ) (1 /n ) , . . . , r m = 2 for i = 1 , . . . m = log ǫ n . If o i isthe number of points in D ′ such that for all p ∈ o i , d ( p, D ′ ) ∈ [ r i − , r i ) , then we can relate the cost of theoptimal k -means solution as P mi =1 | o i | r i ǫ ≤ f D ′ ( OPT D ′ ) < m X i =1 | o i | r i . We show that the sets a i , i.e. the set of points covered within r i + t i √ d ′ is close to the number of pointsthat lie within r i of their closest mean in OPT D ′ (lemma 5.1). It will then follow that moving each datapoint to its closest point in C = ∪ mi =1 C i will lead to moving datapoints a total distance of ∼ f D ′ ( OPT D ′ ) (lemma 5.2). Similarly it will also follow that the proxy dataset D ′′ constructed by enumerating points in C with multiplicity the number of points moved to them will have a similar clustering cost (lemma 5.5), andthat cluster centers for D ′′ also work well as cluster centers for D ′ (lemma 5.7). We identify points in D withtheir images in D ′ under T ′ and show that noisy averaging of clusters in D so found gives us a good solutionfor the k -means problem for D (theorem 5.9). Lemma 5.1.
With probability − γ | a l | ≥ (1 − ǫ ) | o l | − O (cid:18) k log nǫ E · poly( ǫ ) log nγ (cid:19) . where ǫ E is the privacy parameter used in the exponential mechanism.Proof. Since the ℓ distance between µ and ⌊ µ ⌋ ( l ) is at most t l √ d ′ , it follows from the definition of o l that o l ⊂ B r l + t l √ d ′ ( {⌊ µ ⌋ : µ ∈ OPT D ′ } ) . We can hence apply lemma 2.7 with the promise that the set of k ballswith centers in { arg min g ∈ G ′ d ( g, c ) : c ∈ OPT D ′ } and radii r l + t l √ d ′ cover the set o l , and that the k balls liein the family of sets { B r l + t l √ d ′ ( g ) : g ∈ G l } .We let h denote the submodular function | o l ∩ B t l + r l √ d ( · ) | and ∆ i h denote the marginal utility function,i.e. the increase in h when picking the i th element from the domain and adding it to the set of i − elements14lready picked. If g EMi is the i th element picked by the exponential mechanism, the lemma 2.2 guaranteegives us that with probability − γ , ∆ i h ( g EMi ) ≥ max g ∈ G l \{ g EM ,...,g EMi − } ∆ i h ( g ) − ǫ E log | G l | γ . This implies that when max g ∈ G l \{ g EM ,...,g EMi − } ∆ i h ( g ) ≥ ǫ E log | G l | γ , with probability − γ , ∆ i h ( g EMi ) ≥ max g ∈ G l \{ g EM ,...,g EMi − } ∆ i h ( g )2 . (1)Let g EM , . . . , g EMk ′ be the grid points chosen by the exponential mechanism in the course of algorithm 3.Note that this implies B r i + t i √ d ′ ( { g EM , . . . , g EMk ′ } ) ∩ o l ⊂ a l ⇒ h ( { g EM , . . . , g EMk ′ } ) ≤ | a l | . (2)If j is the greatest index for which max g ∈ G l \{ g EM ,...,g EMi − } ∆ j h ( g ) ≥ ǫ E log | G l | γ (noting that the maximum possiblemarginal increase in cover is non-increasing), then let g MAXj +1 , . . . , g MAXk ′ be the grid points with maximalmarginal utility in the round indicated by the subscript. Combining lemma 2.7 with eq. (1) gives us thatwith probability at least − jγ ≥ − k ′ γh ( { g EM , . . . , g EMj , g
MAXj +1 , . . . , g MAXk ′ } ) ≥ (1 − ǫ ) | o l |⇒ j X i =1 ∆ i h ( g EMi ) + k ′ X i = j +1 ∆ i h ( g MAXi ) ≥ (1 − ǫ ) | o l |⇒ j X i =1 ∆ i h ( g EMi ) + k ′ X i = j +1 ǫ E log | G l | γ ≥ (1 − ǫ ) | o l |⇒ h ( { g EM , . . . , g EMk ′ } ) + 4 k ′ ǫ E log | G l | γ ≥ (1 − ǫ ) | o l | We recall that | G l | is j t l k d ′ , and that t l is at least ǫn √ d ′ . Using this, and eq. (2), we see that with probability − k ′ γ , | a l | ≥ (1 − ǫ ) | o l | − O k ′ d ′ ǫ E log n √ d ′ γǫ ! . We absorb the k ′ = kǫ factor into the failure probability γ and noting that d ′ = O (cid:16) log nǫ (cid:17) , k ′ = kǫ and k < n gives us | a l | ≥ (1 − ǫ ) | o l | − O (cid:18) k log nǫ · ǫ E log kn √ log nγǫ (cid:19) ≥ (1 − ǫ ) | o l | − O (cid:18) k log nǫ E · poly( ǫ ) log nγ (cid:19) .
15e see that although the cluster radii thresholds are r i for i = 1 . . . m , discretization leads to slightlyinflated cluster radii r i + t i √ d = (1 + ǫ ) r i . This also leads to a slight multiplicative inflation in the totalmovement, showing up as the (1 + ǫ ) coefficient of P mi =1 | a i | r i in the following result bounding the totalmovement of points when constructing the proxy dataset D ′′ (without addition of noise). Lemma 5.2.
The total movement of points p ∈ D ′ to the closest point grid [ p ] ∈ C is bounded by the followinginequalities X p ∈ D ′ d ( p, grid [ p ]) ≤ (1 + ǫ ) m X i =1 | a i | r i ≤ (cid:18) ǫ − ǫ − ǫ (cid:19) f D ′ ( OPT D ′ ) + O (cid:18) k log nǫ E · poly( ǫ ) log nγ (cid:19) . We break this proof down into a couple of smaller steps.
Lemma 5.3.
The thresholded cost obeys the bound m X i =1 | o i | r i ≤ (1 + ǫ ) f D ′ ( OPT D ′ ) + 1 . (3) Proof.
We recall that o i = { p ∈ D ′ : d ( p, OPT D ′ ) ∈ [ r i − , r i ) } . Since D ′ ∈ B (0) , and r m = 2 , ∪ mi =1 a i = D ′ .From this we have that f ( OPT D ′ ) = m X i =1 X p ∈ o i d ( p, OPT D ′ )= X p ∈ o d ( p, OPT ) + m X i =2 X p ∈ o i d ( p, OPT D ′ ) . We note that for d ( p, OPT D ′ ) ∈ [0 , r ) = [0 , /n ) ⇒ n + d ( p, OPT D ′ ) > r , since r = 1 /n so X p ∈ o n + d ( p, OPT D ′ ) > X p ∈ o r ⇒ X p ∈ o d ( p, OPT D ′ ) ≥ | o | r . For d ( p, OPT D ′ ) ∈ [ r i − , r i ) for i = 1 , since r i r i − = (1 + ǫ ) , it follows that d ( p, OPT D ′ ) > r i ǫ , so summing overall such p we have X d ( p, OPT D ′ ) ∈ [ t i − ,t i ) d ( p, OPT D ′ ) > | o i | r i ǫ . From the last two displays we have m X i =1 | o i | r i ≤ (1 + ǫ ) f D ′ ( OPT D ′ ) + 1 . Lemma 5.4.
The ideal thresholded cost P mi =1 | o i | r i can be related to the achieved thresholded cost P mi =1 | a i | r i by the following inequality m X i =1 | a i | r i ≤ − ǫ − ǫ − ǫ m X i =1 ( | o i | r i ) + O (cid:18) k log nǫ E · poly( ǫ ) log nγ (cid:19) . (4)16 roof. We define O i = P mj = i | o j | and A i = P mj = i | a j | . We then have m X i =1 | a i | r i = m X i =1 A i ( r i − r i − ) . (5)We note that centers in OPT D ′ cover n − O i +1 points at a maximum distance of r i . We also know thatalgorithm 2 has already covered n − A i points at a distance of r i − + t i − √ d ′ . It then follows that there aresome k grid points in G i (snapping the centers in OPT D ′ to grid) that cover at least ( n − O i +1 ) − ( n − A i ) = A i − O i +1 uncovered points at a distance of at a distance of r i + t i √ d ′ . From the lemma 5.1 guarantee, weknow that | a i | ≥ (1 − ǫ )( A i − O i +1 ) − E, where E = O (cid:16) k log nǫ E · poly( ǫ ) log nγ (cid:17) . Since A i = | a i | + A i +1 , we have that | a i | ≥ (cid:18) − ǫǫ (cid:19) ( A i +1 − O i +1 ) − Eǫ ⇒ A i +1 ≤ (cid:18) ǫ − ǫ (cid:19) | a i | + O i +1 + E − ǫ . Substituting this in eq. (5), we continue as follows: m X i =1 | a i | r i = m X i =1 A i ( r i − r i − ) ≤ m X i =1 (cid:18)(cid:18) ǫ − ǫ (cid:19) | a i − | + O i + E − ǫ (cid:19) ( r i − r i − )= m X i =1 (cid:18) ǫ | a i − | − ǫ (cid:19) ( r i − r i − ) + m X i =1 | o i | r i + E − ǫ ( r m − r ) ≤ m X i =1 (cid:18) ǫ | a i − | − ǫ (cid:19) r i − + m X i =1 | o i | r i + 2 E − ǫ ⇒ m X i =1 | a i | r i (cid:18) − ǫ − ǫ (cid:19) ≤ m X i =1 | o i | r i + 2 E − ǫ ⇒ m X i =1 | a i | r i ≤ − ǫ − ǫ − ǫ m X i =1 | o i | r i + 2 E − ǫ − ǫ Substituting the order term E and using that ǫ is bounded away from , we get the desired inequality. Proof of lemma 5.2.
We recall that a i = D ′ ∩ B r i + t i √ d ′ ( C i ) . Since D ′ ∈ B (0) , and r m = 2 , ∪ mi =1 a i = D ′ . Itfollows that X p ∈ D ′ d ( p, grid [ p ]) = m X i =1 X p ∈ a i d ( p, grid [ p ]) ≤ m X i =1 X p ∈ a i r i + t i √ d ≤ m X i =1 | a i | ( r i + t i √ d ) (1 + ǫ ) m X i =1 | a i | r i This proves the first inequality. From eq. (3) and eq. (4) we see that P mi =1 | a i | r i obeys the inequality m X i =1 | a i | r i ≤ − ǫ − ǫ − ǫ ((1 + ǫ ) f D ′ ( OPT D ′ ) + 1) + O (cid:18) k log nǫ E · poly( ǫ ) log nγ (cid:19) ⇒ (1 + ǫ ) m X i =1 | a i | r i ≤ (cid:18) ǫ − ǫ − ǫ (cid:19) ((1 + ǫ ) f D ′ ( OPT D ′ ) + 1) + O (cid:18) k log nǫ E · poly( ǫ ) log nγ (cid:19) (6)with probability − γ . Absorbing smaller order terms into the additive error and using lemma 5.2, we get m X i =1 d ( p, grid [ p ]) ≤ (cid:18) ǫ − ǫ − ǫ (cid:19) f D ′ ( OPT D ′ ) + O (cid:18) k log nǫ E · poly( ǫ ) log nγ (cid:19) . In lemma 5.5 we bound the k -means cost of the proxy dataset D ′′ in terms of the k -means cost of D ′ . Lemma 5.5.
With probability − γ , f D ′′ ( OPT D ′ ) ≤ (cid:18) ǫ − ǫ − ǫ (cid:19) f D ′ ( OPT D ′ ) + O (cid:18) k log nǫ E · poly( ǫ ) log nγ (cid:19) + O (cid:18) k log nǫ L · poly( ǫ ) (cid:19) . Since D ′′ is constructed by using noisy counts of candidate centers, we defer its proof and first derive atechnical lemma to bound the error added by the privacy preserving algorithm. Lemma 5.6.
Let X , . . . X k ′ be i.i.d. Lap ( ǫ L ) random variables. Then for any positive constants c , . . . , c k ′ ,with probability − γ , X := k ′ X i =1 c i | X i | ≤ (log 2) k ′ X i =1 c i ǫ L + 2 max i c i ǫ L log 1 /γ. Proof.
The moment generating function of c | X | for X ∼ Lap ( ǫ L ) is ǫ L ǫ L − ct = − ctǫL for | ct | < ǫ L . It thenfollows that for t ≤ ǫ L max i c i , M X ( t ) = k ′ Y i =1 − c i tǫ L . We use the Chernoff bound P ( X > ∆ n ) = P ( e tX > e t ∆ n ) ≤ E [ e tX ] e t ∆ n ≤ M X ( t ) e t ∆ n . If we require that this event occur with probability at most γ , we derive a bound on ∆ n that would suffice. k ′ Y i =1 − c i tǫ L ! · e t ∆ n ≤ γ k ′ X i =1 log − c i tǫ L ! − t ∆ n ≤ log γ ⇔ t ∆ n ≥ − k ′ X i =1 log (cid:18) − c i tǫ L (cid:19) + log 1 /γ For c i tǫ L < . , − log (cid:16) − c i tǫ L (cid:17) < (log 2) c i tǫ L by convexity of − log(1 − x ) in the argument x . Restricting t to [0 , ǫ L i c i ] , we continue. ⇐ t ∆ n ≥ log 2 k ′ X i =1 c i tǫ L + log 1 /γ ⇔ ∆ n ≥ log 2 k ′ X i =1 c i ǫ L + 1 t log 1 /γ Setting t = ǫ L i c i , we get the desired result. Proof of lemma 5.5. In D ′′ each grid point occurs with multiplicity n ′ c = n c + X c where X c ∼ Lap (cid:16) ǫ L (cid:17) . f D ′′ ( OPT D ′ ) = X p ∈ D ′′ d ( p, OPT D ′ )= X c ∈ C n ′ c d ( c, OPT D ′ )= X c ∈ C (cid:18) n c d ( c, arg min µ ∈ D ′′ d ( c, µ )) + X c d ( c, arg min µ ∈ D ′′ d ( c, µ )) (cid:19) ≤ X c ∈ C n c d ( c, arg min µ ∈ OPT D ′ d ( c, µ )) + X c ∈ C | X c | · Using that points in D ′′ are enumerated by running over centers in c ∈ C with multiplicity n c , and byapplying lemma 5.6, we get that with probability − γ , f D ′′ ( OPT D ′ ) ≤ X p ∈ D ′ d ( grid [ p ] , arg min µ ∈ OPT D ′ d ( grid [ p ] , µ )) + O (cid:18) k log nǫ · ǫ L (cid:19) + O (cid:18) log 1 /γǫ L (cid:19) ≤ X p ∈ D ′ (cid:18) d ( grid [ p ] , p ) + d ( p, arg min µ ∈ OPT D ′ d ( grid [ p ] , µ )) (cid:19) + O (cid:18) k log nǫ L · poly( ǫ ) (cid:19) . In the above we drop log 1 /γ as it is asymptotically dominated by the other error term for any failureprobability polynomially small in n . Using lemma 5.2 to simplify the first term, we have that with probability − γ , f D ′′ ( OPT D ′ ) ≤ (cid:18) ǫ − ǫ − ǫ (cid:19) f D ′ ( OPT D ′ ) + 2 f D ′ ( OPT D ′ ) + O (cid:18) k log nǫ E · poly( ǫ ) log nγ (cid:19) + O (cid:18) k log nǫ L · poly( ǫ ) (cid:19) ≤ (cid:18) ǫ − ǫ − ǫ (cid:19) f D ′ ( OPT D ′ ) + O (cid:18) k log nǫ E · poly( ǫ ) log nγ (cid:19) + O (cid:18) k log nǫ L · poly( ǫ ) (cid:19) . In lemma 5.7 we bound the error incurred on using the k -means solution for the proxy dataset D ′′ forthe dataset D ′ . 19 emma 5.7. Let A be the clustering algorithm used in line 20 of algorithm 2. If A has the utility guarantee f S ( A ( S )) ≤ E M · f S ( OPT S ) + E A then f D ′ ( A ( D ′′ )) ≤ (8 E M + 2 + (8 E M + 4) ǫ ) f D ′ ( OPT D ′ ) + 2 E A + O (cid:18) k log nǫ E log knγ (cid:19) . Proof.
From the clustering algorithm guarantee we have that f D ′′ ( A ( D ′′ )) ≤ E M · f D ′′ ( OPT D ′′ ) + E A . By definition, f D ′′ ( OPT D ′′ ) < f D ′′ ( OPT D ′ ) . Substituting the bound from lemma 5.5, we get f D ′′ ( A ( D ′′ )) ≤ (cid:18) ǫ − ǫ − ǫ (cid:19) ( E M + 1) f D ′ ( OPT D ′ ) + 2 E A + O (cid:18) k log nǫ E · poly( ǫ ) log nγ (cid:19) + O (cid:18) k log nǫ L · poly( ǫ ) (cid:19) . So then f D ′ ( A ( D ′′ )) = X p ∈ D ′ d ( p, arg min µ ∈A ( D ′′ ) d ( p, µ )) ≤ X p ∈ D ′ d ( p, arg min µ ∈A ( D ′′ ) d ( grid [ p ] , µ )) ≤ X p ∈ D ′ d ( p, grid [ p ]) + d ( grid [ p ] , arg min µ ∈A ( D ′′ ) d ( grid [ p ] , µ )) ! ≤ X p ∈ D ′ d ( p, grid [ p ]) + 2 f D ′′ ( A ( D ′′ )) ≤ · (cid:18) ǫ − ǫ − ǫ (cid:19) f D ′ ( OPT D ′ ) + O (cid:18) k log nǫ E · poly( ǫ ) log nγ (cid:19) + O (cid:18) k log nǫ L · poly( ǫ ) (cid:19) +2 E M · (cid:18) ǫ − ǫ − ǫ (cid:19) f D ′ ( OPT D ′ ) + O (cid:18) k log nǫ E · poly( ǫ ) log nγ (cid:19) + O (cid:18) k log nǫ L · poly( ǫ ) (cid:19) + 2 E A ≤ (cid:18) ǫ − ǫ − ǫ (cid:19) ( E M + 1) f D ′ ( OPT D ′ ) + 2 E A + O (cid:18) k log nǫ E · poly( ǫ ) log nγ (cid:19) + O (cid:18) k log nǫ L · poly( ǫ ) (cid:19) . To complete the utility analysis, we need to account for the projection and scaling as well as the Gaussiannoise added to maintain privacy. In lemma 5.8 we derive an expression for the utility without accounting forany noise in and theorem 5.9 we derive an expression for the net utility guarantee of algorithm 2.
Lemma 5.8.
If we cluster D according to its projected and scaled version D ′ , we get a set S of size k suchthat f D ( S ) ≤ (1 + ǫ ) (cid:18) ǫ − ǫ − ǫ (cid:19) ( E M + 1) f D ( OPT D ) + 2(1 + ǫ )∆ E A ++ O (cid:18) k ∆ log nǫ E · poly( ǫ ) log nγ (cid:19) + O (cid:18) k ∆ log nǫ L · poly( ǫ ) (cid:19) . Proof.
To scale the data from a ball that lies within a diameter of ∆ to a diameter of , we note that theclustering cost is multiplied by a factor of ∆ / . To account for the projection we recall that the k -means20ost can be expressed without explicit reference to the means themselves. We let { D ′ i : i ∈ [ k ] } be thepartition of the data D ′ into k clusters, where the i th cluster is centered at µ ′ i and S ′ = { µ ′ i : i ∈ [ k ] } . f D ′ ( S ′ ) = X i ∈ [ k ] X p ∈ D ′ i d ( p, µ ′ i )= X i ∈ [ k ] X p ∈ D ′ i k p − µ ′ i k = X i ∈ [ k ] | D ′ i | X p = q ∈ D ′ i k p − q k . Since the Johnson Lindenstrauss transform preserves the ℓ norm squared within a multiplicative factor of (1 ± ǫ ) , it follows from the display above that the cost of clustering D according to its image D ′ is at most (1 + ǫ ) f D ′ ( S ′ ) . Denoting the cluster centers derived in this fashion by S , this gives us f D ( S ) ≤ (1 + ǫ ) (cid:18) ǫ − ǫ − ǫ (cid:19) ( E M + 1)∆ f D ′ ( OPT D ′ ) + ∆ E A O (cid:18) k ∆ log nǫ E · poly( ǫ ) log nγ (cid:19) + O (cid:18) k ∆ log nǫ L · poly( ǫ ) (cid:19) ⇒ f D ( S ) ≤ (1 + ǫ ) (cid:18) ǫ − ǫ − ǫ (cid:19) ( E M + 1) f D ( OPT D ) + 2(1 + ǫ )∆ E A ++ O (cid:18) k ∆ log nǫ E · poly( ǫ ) log nγ (cid:19) + O (cid:18) k ∆ log nǫ L · poly( ǫ ) (cid:19) , where we use that ∆ f D ′ ( OPT D ′ ) = f D ( OPT D ) . Theorem 5.9.
Algorithm 2 returns a set of points ˜ S such that E h f D ( ˜ S ) i ≤ (1 + ǫ ) (cid:18) ǫ − ǫ − ǫ (cid:19) ( E M + 1) f D ( OPT D ) + 2(1 + ǫ )∆ E A ++ O (cid:18) k ∆ log nǫ E · poly( ǫ ) log nγ (cid:19) + O (cid:18) k ∆ log nǫ L · poly( ǫ ) (cid:19) + O k ∆ p d log 1 /δ G ǫ G ! + O (cid:18) k ∆ log n/δ G ǫ G (cid:19) . Proof.
The final set of points returned, denoted ˜ S , is obtained by using algorithm 1. From the statementof theorem 2.6, we know that for the i th cluster if | D i | ≥ A (cid:16) ǫ G log (cid:16) nkδ G (cid:17)(cid:17) with sufficiently large constant A then with probability − kn , algorithm 1 returns µ i + g i where g i is sampled from N (0 , σ ) for some σ < ǫ G | D i | p /δ G ) . We let c = 4 p /δ G ) so that σ < c ∆ ǫ G | D i | . We can upper bound the clusteringcost by assuming that cluster sets remain the same even with the noisy means, and then add up the costcluster by cluster. f D ( ˜ S ) = X p ∈ D d ( p, ˜ S ) ≤ X i ∈ [ k ] X p ∈ D i d ( p, ˜ µ i ) X p ∈ D i d ( p, ˜ µ i ) = X p ∈ D i k x − ˜ µ i k = X p ∈ D i k p − µ i + g i k X p ∈ D i h p − µ i + g i , p − µ i + g i i = X p ∈ D i k p − µ i k + * X p ∈ D i p − µ i , g i + + X p ∈ D i k g i k = f D i ( { µ i } ) + h ( | D i | − µ i , g i i + X i | D i |k g i k , where in the last step we use that P p ∈ D i p = | D i | µ . If | D i | ≥ A (cid:16) ǫ G log (cid:16) nkδ G (cid:17)(cid:17) for sufficiently large constant A , then taking the expectation, we get ⇒ E X p ∈ D i d ( p, ˜ S ) ≤ f D i ( { µ i } ) + | D i | E d X j =1 k g i k ≤ f D i ( { µ i } ) + | D i | (cid:18) c ∆ | D i | ǫ G (cid:19) d ≤ f D i ( { µ i } ) + c ∆ | D i | ǫ G d. If | D i | ≥ c √ dǫ G then this is at most f D i ( { µ i } ) + c ∆ ǫ G √ d . On the other hand, if | D i | < c √ dǫ G , we observe that theclustering cost f D i (˜ µ i ) can be at most c √ dǫ G ∆ unconditionally. Similarly if | D i | = O (cid:16) ǫ G log (cid:16) nkδ G (cid:17)(cid:17) , then theclustering cost f D i (˜ µ i ) can be at most O (cid:16) ǫ G log (cid:16) nkδ G (cid:17) ∆ (cid:17) . With probability − n the large cluster costbound holds for all clusters simultaneously and we then have E X p ∈ D d ( p, ˜ S ) ≤ X i ∈ [ k ] E X p ∈ D i d ( p, ˜ µ i ) ≤ (cid:18) − n (cid:19) X i : | D i |≥√ d f D i ( { µ i } ) + c ∆ ǫ G √ d + 1 n ∆ n + X i : | D i | < √ d c √ dǫ G ∆ + X i : | D i | < ǫG log (cid:16) nkδG (cid:17) O (cid:18) ǫ G log (cid:18) nkδ G (cid:19) ∆ (cid:19) ≤ X i ∈ k f D i ( { µ i } ) + k c ∆ ǫ G √ d ! + ∆ + k ∆ c √ dǫ G + O (cid:18) kǫ G log (cid:18) nkδ G (cid:19) ∆ (cid:19) = f D ( S ) + O k ∆ p d log 1 /δ G ǫ G ! + O (cid:18) k ∆ log n/δ G ǫ G (cid:19) . Substituting the bound on f D ( S ) from lemma 5.8, we get E h f D ( ˜ S ) i ≤ (1 + ǫ ) (cid:18) ǫ − ǫ − ǫ (cid:19) ( E M + 1) f D ( OPT D ) + 2(1 + ǫ )∆ E A ++ O (cid:18) k ∆ log nǫ E · poly( ǫ ) log nγ (cid:19) + O (cid:18) k ∆ log nǫ L · poly( ǫ ) (cid:19) + O k ∆ p d log 1 /δ G ǫ G ! + O (cid:18) k ∆ log n/δ G ǫ G (cid:19) . Privacy
The main result of this section is the following:
Theorem 6.1.
Algorithm 2 is (cid:16) eǫ E ln δ − E + ǫ L + ǫ G , δ E + δ G (cid:17) -differentially private. From the basic (theorem 2.4) and parallel (theorem 2.5) composition laws of differential privacy andthe privacy guarantees of the Laplace mechanism (lemma 2.3) and algorithm 1 (theorem 2.6) most of theexpression for the bound on privacy loss claimed in this result follows relatively straightforwardly. To boundthe privacy loss incurred in the calls to algorithm 3, we adapt a technique from [10]. We use this techniquein the following lemma to show that the privacy loss when using the exponential mechanism many timessuccessively can be bounded as an expression of the sum of expected gains in the cover. For the set coverfunction this sum of expected gains can be shown to decay exponentially using lemma 2.9, which leads to astrong bound on the privacy loss.
Lemma 6.2.
The subroutine line 8-line 12 of algorithm 2 that constructs set of centers C (over m iterations)is (cid:16) eǫ E ln δ − E , δ E (cid:17) -differentially privateProof. Let A and B be two neighbouring datasets, i.e. A △ B = { I } . To show that this subroutine (denoted A ) is ( ǫ, δ ) differentially private, we need to show that the ratio P ( A ( A ) = C ) /P ( A ( B ) = C ) is boundedfrom above by e ǫ with probability − δ , where C is an arbitrary sequence of grid points c , . . . , c km/ǫ thatmight be picked in the thresholded max-cover subroutine. P ( A ( A ) = C ) P ( A ( B ) = C ) = km/ǫ Y i =1 P ( A ( A ) i = c i | c , . . . , c i − ) P ( A ( B ) i = c i | c , . . . , c i − ) P ( A ( A ) i = c i | c , . . . , c i − ) = exp (cid:16) ǫ E | cover [ c i ] | (cid:17)P g exp (cid:16) ǫ E | cover [ g ] | (cid:17) ⇒ P ( A ( A ) = C ) P ( A ( B ) = C ) = km/ǫ Y i =1 exp (cid:16) ǫ E | cover A [ c i ] | (cid:17) exp (cid:16) ǫ E | cover B [ c i ] | (cid:17) · km/ǫ Y i =1 P g exp (cid:16) ǫ E | cover B [ g ] | (cid:17)P g exp (cid:16) ǫ E | cover A [ g ] | (cid:17) . If A \ B = { I } then we see that the second factor is at most and the first factor is at most exp (cid:0) ǫ E (cid:1) , since cover A [ c i ] \ cover B [ c i ] can be at most the data point I , and that too for at most one index i , since cover countsonly yet uncovered data points. Inversely if B \ A = { I } , then the first factor is at most and we need tobound the second factor. We observe that this ratio of sums can be written as an expectation by factoringout the indicator of I as follows: km/ǫ Y i =1 P g exp (cid:16) ǫ E | cover B [ g ] | (cid:17)P g exp (cid:16) ǫ E | cover A [ g ] | (cid:17) = km/ǫ Y i =1 E g ∼ exp (cid:16) ǫE | cover A [ g ] | (cid:17) (cid:20) exp (cid:18) ǫ E I ∈ cover B [ g ] (cid:19)(cid:21) ≤ km/ǫ Y i =1 E g ∼· (cid:20) e · ǫ E I ∈ cover B [ g ] (cid:21) = km/ǫ Y i =1 eǫ E E [1 I ∈ cover B [ g ] ]2 ≤ km/ǫ Y i =1 exp (cid:18) eǫ E E [1 I ∈ cover B [ g ] ]2 (cid:19) exp eǫ E km/ǫ P i =1 E [1 I ∈ cover B [ g ] ]2 . To bound the sum of expectations that occurs in the exponent, we use lemma 2.9 with R i = 1 I ∈ cover B [ c i ] and p i = E [ R i ] if I has not been picked by the ( i − th round and R i = Ber (0) otherwise. We see that Z j = Q ji =1 (1 − R i ) then simply indicates the event that I has not been covered by the j th round. Withthese definitions, P km/ǫi =1 p i Z i = P km/ǫi =1 E [1 I ∈ cover B [ g ] ] and P km/ǫ X i =1 E [1 I ∈ cover B [ g ] ] > q < exp( − q ) . If km/ǫ P i =1 E [1 I ∈ cover B [ g ] ] < q then we say that the sequence C is q -good. If a sequence is not q -good, it is called q -bad. If we let q = ln δ − E , we see that the probability of an arbitrary sequence being ln δ − E -good is at least − δ E . This means that with probability − δ E , P ( A ( A ) = C ) P ( A ( B ) = C ) ≤ exp (cid:18) eǫ E ln δ − E (cid:19) . Putting everything together, we see that this subroutine satisfies (cid:16) eǫ E ln δ − E , δ E (cid:17) -differential privacy. Proof of theorem 6.1.
We divide the privacy analysis into two halves; first, we bound the loss in privacy thatoccurs when constructing the proxy dataset D ′′ . From lemma 6.2 we know that in the m calls to algorithm 3the net loss in privacy is ( eǫ E log δ − E , δ E ) . In the calculation of noisy counts we see that two neighbouringdatasets can only differ in their true counts by 1 unit at one center of C , from whence it follows that the ℓ sensitivity of the tuple of all counts is unit; this justifies the choice of parameter in the Laplace mechanism.Using basic composition theorem 2.4 along with the privacy loss bound for the Laplace mechanism lemma 2.3we see that the net loss in privacy on releasing the proxy dataset D ′′ is (cid:16) eǫ E log δ − E + ǫ L , δ E (cid:17) .We now have that D ′′ is publicly known and that the low-dimensional domain can be partitioned byidentifying each point in the domain with the closest point in the set returned by the non-private clusteringalgorithm used (a Voronoi diagram).In the second half of the analysis we use the parallel composition theorem (theorem 2.5) of [18] alongwith algorithm 1 (theorem 2.6). Since each application of algorithm 1 on the separate clusters is ( ǫ G , δ G ) -differentially private, we apply parallel composition (theorem 2.5) to conclude that the net privacy loss overall k applications is still ( ǫ G , δ G ) .Using basic composition we conclude that algorithm 2 is (cid:16) eǫ E log δ − E + ǫ L + ǫ G , δ E + δ G (cid:17) -differentiallyprivate. 24 Experiments . . . . . . · Centers K - M e a n s O b j ec t i v e Synthetic Dataset ε = 1 Ours (algorithm 2)Non-private LloydsBalcan et al. . . . . . . . · Centers K - M e a n s O b j ec t i v e MNIST Dataset ε = 1 Ours (algorithm 2)Non-private LloydsBalcan et al.Figure 1: Empirical comparison of algorithm 2 and the private k -means clustering algorithm from [4]In this section we present an experimental comparison between algorithm 2, the differentially private k -meansclustering algorithm from [4], and the non-private Lloyd’s algorithm. Although there are other works withstrong theoretical guarantees (such as [13]), we are not aware of any implementation for those methods. Thecomparison here is done for two datasets; a synthetic dataset reproducing the construction in [4] and theMNIST dataset [14].The empirical results shown here for Balcan et al.’s algorithm [4] come largely from their MATLABimplementation available on Github. Some corrections were made to the implementation of [4]; although thepseudocode uses a noisy count of the cluster sizes when computing the noisy average of the clusters foundtheir implementation used the non-private exact count. We replaced this subroutine with algorithm 1 to usethe best method we know for privately computing the average. Implementation details:
The privacy parameters were set to ǫ = 1 and δ = n − . for both algorithms.For each algorithm and dataset we let the number of centers k = 2 , , , and . Our implementation of thealgorithm, similar to [4], projects to a smaller subspace of dimension size log( n ) / rather than O (log( n ) /ǫ ) - note that this does not have any effect upon the privacy guarantee.At the conclusion of both algorithms, we run one round of differentially private Lloyd’s algorithm; addingthis call to the differentially private Lloyd’s yielded better empirical results for both the algorithm of [4] andours. The addition of these rounds of Lloyd’s requires adjusting privacy parameters by a constant factorbut otherwise does not affect the privacy guarantees of the original algorithms. Although [4] satisfy ( ǫ, differential privacy and hence use the Laplace mechanism for their noisy average, we replaced this step withthe noisyAVG routine of [19] for a fair comparison. The non-private Lloyd’s algorithm was executed with 10iterations. Figure 1 records the averages and standard deviation over five runs of each experiment. Datasets:
The synthetic dataset is comprised of 50000 points randomly sampled from a mixture of 64Gaussians in R . The MNIST dataset uses the raw pixels; it is comprised of 70000 points with 784 featureseach. 25 esults: As can be seen in fig. 1, our algorithm achieves a lower k -means objective score than that of[4] for both the synthetic Dataset as well as the MNIST dataset. Similar to the experimental results in[4], increasing the number of centers results in a decrease in the cost in the non-private algorithm but didnot result in a concomitant decrease in the cost of the private algorithms. This behavior suggests thatthe algorithms are limited by their additive errors and that perhaps further decreasing them even in theconstants would improve the gap compared with their non-private counterpart. References [1] Ankit Aggarwal, Amit Deshpande, and Ravi Kannan. Adaptive sampling for k-means clustering. In
Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques , pages15–28. Springer, 2009.[2] Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for k -meansand euclidean k -median by primal-dual algorithms. SIAM Journal on Computing , 49(4):FOCS17–97–FOCS17–156, 2020.[3] Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat. Np-hardness of euclidean sum-of-squares clustering.
Machine learning , 75(2):245–248, 2009.[4] Maria-Florina Balcan, Travis Dick, Yingyu Liang, Wenlong Mou, and Hongyang Zhang. Differentiallyprivate clustering in high-dimensional euclidean spaces. In
International Conference on Machine Learn-ing , pages 322–331, 2017.[5] Ke Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and theirapplications.
SIAM J. Comput. , 39(3):923–947, 2009.[6] Sanjoy Dasgupta.
The hardness of k-means clustering . Department of Computer Science and Engineer-ing, University of California, 2008.[7] Cynthia Dwork and Jing Lei. Differential privacy and robust statistics. In
Proceedings of the forty-firstannual ACM symposium on Theory of computing , pages 371–380, 2009.[8] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity inprivate data analysis. In
Theory of cryptography conference , pages 265–284. Springer, 2006.[9] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy.
Found. TrendsTheor. Comput. Sci. , 9(3-4):211–407, 2014.[10] Anupam Gupta, Katrina Ligett, Frank McSherry, Aaron Roth, and Kunal Talwar. Differentially pri-vate combinatorial optimization. In
Proceedings of the twenty-first annual ACM-SIAM symposium onDiscrete Algorithms , pages 1106–1125. SIAM, 2010.[11] Matthew Jones, Huy L. Nguyen, and Thy Nguyen. Differentially private clustering via maximumcoverage. Unpublished Manuscript, 2020.[12] Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth Silverman, andAngela Y Wu. A local search approximation algorithm for k-means clustering.
Computational Geometry ,28(2-3):89–112, 2004.[13] Haim Kaplan and Uri Stemmer. Differentially private k-means with constant multiplicative error. InSamy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and RomanGarnett, editors,
Advances in Neural Information Processing Systems 31: Annual Conference on NeuralInformation Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada , pages5436–5446, 2018. 2614] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recogni-tion.
Proceedings of the IEEE , 86(11):2278–2324, 1998.[15] Stuart Lloyd. Least squares quantization in pcm.
IEEE transactions on information theory , 28(2):129–137, 1982.[16] Meena Mahajan, Prajakta Nimbhorkar, and Kasturi Varadarajan. The planar k-means problem isnp-hard. In
International Workshop on Algorithms and Computation , pages 274–285. Springer, 2009.[17] Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In , pages 94–103. IEEE Computer Society, 2007.[18] Frank D McSherry. Privacy integrated queries: an extensible platform for privacy-preserving dataanalysis. In
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data ,pages 19–30, 2009.[19] Kobbi Nissim, Uri Stemmer, and Salil P. Vadhan. Locating a small cluster privately. In Tova Miloand Wang-Chiew Tan, editors,
Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium onPrinciples of Database Systems, PODS 2016, San Francisco, CA, USA, June 26 - July 01, 2016 , pages413–427. ACM, 2016.[20] Rafail Ostrovsky, Yuval Rabani, Leonard J Schulman, and Chaitanya Swamy. The effectiveness oflloyd-type methods for the k-means problem.