A note on differentially private clustering with large additive error
aa r X i v : . [ c s . D S ] S e p A note on differentially private clustering with large additive error
Huy L. NguyenNortheastern UniversitySeptember 29, 2020
Abstract
In this note, we describe a simple approach to obtain a differentially private algorithm for k -clusteringwith nearly the same multiplicative factor as any non-private counterpart at the cost of a large polynomialadditive error. The approach is the combination of a simple geometric observation independent of privacyconsideration and any existing private algorithm with a constant approximation. In this note, we consider the problem of finding an approximate clustering solution with differential privacyin Euclidean space. The problem has been studied extensively with many different objective functions. Someof the popular ones include the k -median objective and the k -mean objective. Recently the work [3] gavealgorithms for these objectives achieving almost the same multiplicative error as any non-private counterpartand a large polynomial additive error. In this note, we describe a simple alternative approach to achievea similar result. For concreteness, we focus on the k -median objective but a similar proof also works for k -mean objective. Definition 1.
In the Euclidean k -median problem, we are given a dataset D of n points in R d . The goal isto find a set S of k centers to minimize the following objective: min S X p ∈ D d ( p, S ) = min S X p ∈ D min c ∈ S d ( p, c ) where d ( p, q ) denotes the Euclidean distance between two points p and q . We use d ( p, S ) as the shorthandfor min q ∈ S d ( p, q ) .A major part of their work is in developing a private bi-criteria algorithm for points in R d with poly (cid:0) k, log n, d (cid:1) centers and clustering cost at most ǫ times the optimal cost plus a polynomial additive error. We show thatthis result can be obtained using a simple observation independent of privacy consideration. Note that theobservation holds more generally for metric spaces with doubling dimension d . Claim . Consider a dataset D of n points in R d and a constant ǫ ∈ (0 , / . Let O k be the optimal k -mediansolution and OP T k be the optimal k -median cost for the dataset. Then for a certain k ′ = k (1 /ǫ ) O ( d ) log( n/ǫ ) ,we have OP T k ′ ≤ O ( ǫOP T k ) . Proof.
Suppose O k = { c , . . . , c k } and suppose the optimal cost is Rn . We will construct a new solution S with k ′ centers. Let T be the set of exponentially growing thresholds T = { ǫR, ǫR (1 + ǫ ) , ǫR (1 + ǫ ) , . . . , nR } .For each center c i and threshold t ∈ T , we cover the ball B ( c i , t ) (the ball centered at c i with radius t ) usingballs of radius ǫt and include all the centers in the solution S . We also include all c i in S . It is clear that | S | = k (1 /ǫ ) O ( d ) | T | = k (1 /ǫ ) O ( d ) log( n/ǫ ) .Next we show that the clustering cost of S is at most O ( ǫRn ) . Consider a point p in the dataset atdistance r = d ( p, O k ) from its nearest center c i in O k . If r ≤ ǫR then we just note that its distance to the1earest center in S is also at most r (since c i ∈ S ). If ǫR < r ≤ nR then consider the minimum threshold t ∈ T such that t ≥ r . Since p ∈ B ( c i , t ) , we include a center at distance at most ǫt from p . By the minimalityof t , we have t ≤ (1 + ǫ ) r . Thus, p is at most (1 + ǫ ) ǫr away from some center in S . The total clustering costfor S is bounded by X p ∈ D d ( p, S ) ≤ X p ∈ D,d ( p,O k ) >ǫR (1 + ǫ ) ǫ · d ( p, O k ) + nǫR ≤ (1 + ǫ ) ǫnR + nǫR ≤ nǫR Combining the above observation with an arbitrary private constant approximation algorithm for k median such as [4] we obtain the following result: Corollary 3.
There is a ( ǫ p , δ p ) -differentially private algorithm that works on data in the unit ball in R d and outputs k ′ = k (1 /ǫ ) O ( d ) log( n/ǫ ) centers such that the k ′ -median clustering cost is at most O ( ǫOP T k ) + poly (cid:0) k, log n, (1 /ǫ ) d (cid:1) log(1 /δ p ) /ǫ p with probability at least − /n . For completeness, we include a brief description of the remaining steps to obtain an approximate solutionusing the bi-criteria solution.
Theorem 4.
Suppose there is a non-private algorithm with α approximation for k -median in R d . As aconsequence, there is an ( ǫ p , δ p ) -private algorithm for data in B (0 , that finds a solution with k -mediancost ( α + O ( ǫ )) OP T k + poly (cid:16) ( k/ǫ ) log(1 /ǫ ) /ǫ , log n (cid:17) · d log(1 /δ p ) /ǫ p with probability − /k .Proof. The algorithm follows similar steps as those of Balcan et al. for k -means [1].1. Project the data to d ′ = O ( ǫ − log k ) dimensions and project the results to the ball B (0 , log n ) .2. Run a ( ǫ p / , δ p ) -private algorithm on the projected data to find a bi-criteria solution with k ′ = k (1 /ǫ ) O ( d ′ ) log( n/ǫ ) centers.3. Use the Laplace mechanism to compute the approximate number of points assigned to each center.4. Run a non-private algorithm on a new dataset where the points are the k ′ centers and each center hasmultiplicity equal to the approximate number of points assigned to it i.e. snapping each point to itsnearest center.5. Partition the data according to each point’s closest center produced in step 4. For each cluster, use aprivate algorithm to recover an approximate optimal center in the original high dimensions.By [5], projecting to d ′ = O ( ǫ − log k ) dimensions using a random Gaussian matrix preserves the clusteringcost within a ǫ factor with probability − /k . By the standard argument using the concentration of the χ distribution, with probability − /n , the resulting points are also contained within the ball B (0 , log n ) in R d ′ . Thus, with probability − /n , the step of projecting to the ball B (0 , log n ) does not move anypoint. The reason we include this step is to protect privacy in the low probability event where the projectionfails.In step 2, the algorithm produces a solution with cost O ( ǫOP T k )+ poly (cid:16) ( k/ǫ ) log(1 /ǫ ) /ǫ , log n (cid:17) log(1 /δ p ) /ǫ p .In step 3, the number of points at each center is accurate up to additive error poly (cid:16) ( k/ǫ ) log(1 /ǫ ) /ǫ , log n (cid:17) /ǫ p per count. Thus, the new dataset has optimal k -median cost (1+ O ( ǫ )) OP T k + poly (cid:16) ( k/ǫ ) log(1 /ǫ ) /ǫ , log n (cid:17) log(1 /δ p ) /ǫ p (the original optimal cost plus the increase due to snapping points to centers and the inaccurate counts).2n step 4, the non-private clustering algorithm produces a solution with cost α (1 + O ( ǫ )) OP T k + αpoly (cid:16) ( k/ǫ ) log(1 /ǫ ) /ǫ , log n (cid:17) log(1 /δ p ) /ǫ p .In step 5, we can use the private convex empirical risk minimization algorithm [2] to compute theapproximate 1-median solution for each cluster separately. The algorithm works for convex Lipschitz riskfunction and the 1-median cost function is a convex 1-Lipschitz function. The algorithm has additive error poly ( k ) · d/ǫ p .The result follows by adding up the costs in steps 4 and 5. References [1] Maria-Florina Balcan, Travis Dick, Yingyu Liang, Wenlong Mou, and Hongyang Zhang. Differentiallyprivate clustering in high-dimensional euclidean spaces. In Doina Precup and Yee Whye Teh, editors,
Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW,Australia, 6-11 August 2017 , volume 70 of
Proceedings of Machine Learning Research , pages 322–331.PMLR, 2017.[2] Raef Bassily, Adam D. Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficientalgorithms and tight error bounds. In , pages 464–473. IEEE ComputerSociety, 2014.[3] Badih Ghazi, Ravi Kumar, and Pasin Manurangsi. Differentially private clustering: Tight approximationratios, 2020.[4] Haim Kaplan and Uri Stemmer. Differentially private k-means with constant multiplicative error. InSamy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and RomanGarnett, editors,