Discrete Distribution Estimation with Local Differential Privacy: A Comparative Analysis
DDiscrete Distribution Estimation with LocalDifferential Privacy: A Comparative Analysis
Ba Dung Le
Charles Sturt University, NSWCyber Security Cooperative Research CentreAustraliaEmail: [email protected]
Tanveer Zia
Charles Sturt University, NSWCyber Security Cooperative Research CentreAustraliaEmail: [email protected]
Abstract —Local differential privacy is a promising privacy-preserving model for statistical aggregation of user data thatprevents user privacy leakage from the data aggregator. Thispaper focuses on the problem of estimating the distribution ofdiscrete user values with Local differential privacy. We reviewand present a comparative analysis on the performance of theexisting discrete distribution estimation algorithms in terms oftheir accuracy on benchmark datasets. Our evaluation bench-marks include real-world and synthetic datasets of categoricalindividual values with the number of individuals from hundredsto millions and the domain size up to a few hundreds of values.The experimental results show that the Basic RAPPOR algorithmgenerally performs best for the benchmark datasets in the highprivacy regime while the k-RR algorithm often gives the bestestimation in the low privacy regime. In the medium privacyregime, the performance of the k-RR, the k-subset, and the HRalgorithms are fairly competitive with each other and generallybetter than the performance of the Basic RAPPOR and the CMSalgorithms.
I. I
NTRODUCTION
In the hyper-connected world today, collecting consumerstatistics has been a common practice of companies to un-derstand consumers’ insights to improve services and prod-ucts [5], [3], [12]. The statistical data collection is highlydemonstrated in applications of Internet of Things (IoTs) suchas Smart Homes, where energy consumption statistics fromhomeowners can be collected to optimize energy utilization,and Smart Cities, where road transport statistics from motorvehicles can be collected to improve the urban mobility [10].However, the collection of consumer data requires companiesto protect consumer or user privacy to comply with enactedprivacy laws and regulation [11].Local differential privacy (LDP) [9], [4] is a promisingprivacy-preserving model for statistical aggregation of userdata that prevents user privacy leakage from the data ag-gregator. LDP achieves the privacy guarantee by introducingrandom noise into user data before transmitting them to thedata aggregator while maintaining the users’ statistics to beaccurate. Since the data aggregator cannot confidently knowthe raw user data, the users have plausible deniability and,therefore, their privacy is remained protected to some degree.Given a mechanism M as a function of a value v thatperturbs v to a value s and returns s as a noisy representation of v . The formal definition of Local differential privacy isgiven below. Definition 1 ( (cid:15) -Local differential privacy [5]):
A random-ized mechanism M satisfies (cid:15) -Local differential privacy if forall pairs of values v and v given by users, and all set S ofthe possible outputs of M , P r [ M ( v ) ∈ S ] ≤ exp ( (cid:15) ) × P r [ M ( v ) ∈ S ] where the probability space is over the randomness of themechanism M .The parameter (cid:15) , called privacy lost or privacy budget , takesa real positive value specified by the data aggregator to controlthe strength of the privacy protection. The smaller the valueof (cid:15) , the higher the probability that the noisy representationsof user values are the same or, in other words, the less likelythe data aggregator realizes the true user values. Therefore,a smaller value of (cid:15) gives stronger protection of privacy.However, the smaller the value of (cid:15) , the worse the accuracyof the aggregated statistics because the noise eliminates usefulstatistical information about true user data. Thus, the privacybudget (cid:15) needs to be carefully chosen to balance privacy pro-tection and statistical accuracy [6]. In practice, (cid:15) is generallyset to a value from 0.01 to 1 for a high privacy regime andfrom 1 to 10 for a low privacy regime [5].Discrete distribution estimation with LDP is the task of esti-mating the distribution of discrete user values from privatizeddata given by LDP mechanisms without assessing the truedata [7]. In this paper, we consider the following scenarioof discrete distribution estimation. There is a set of n users U = { u , u , .., u n } and each user u i has one value takenfrom a set of d values V = { v , v , .., v d } . Let c ( v j ) being thenumber of users holding value v j . The (frequency) distributionof the user values is a vector f = ( f v , f v , .., f v d ) with f v j being c ( v j ) /n . Discrete distribution estimation algorithmswith LDP perturb user values at user locations and send theprivatized data to the data aggregator for distribution aggre-gation. Figure 1 illustrates the steps of discrete distributionestimation algorithms with LDP.Algorithms for discrete distribution estimation with LDPhave been proposed in the literature [7], [13], [5], [12], [1].However, there is a lack of comparison of these algorithmsregarding the estimation accuracy and, thus, it is difficult to a r X i v : . [ c s . CR ] F e b ig. 1: Discrete distribution estimation with Local differential privacyselect a suitable LDP distribution estimation algorithm fora particular dataset. Some of the existing work [13], [1]includes a comparison of discrete distribution algorithms butthe comparison is solely made based on synthetic datasets thatare dissimilar to real-world datasets, such as those are used inour evaluation.In this paper, we review and present a comparative anal-ysis on the performance of the existing discrete distributionestimation algorithms [7], [13], [5], [12], [1] in terms of theiraccuracy on benchmarks of real-world and synthetic datasets.The real-world datasets include the Statlog (Australian CreditApproval), the Adult, and the USCensusData1990 datasets thatare publicly available at the UCI Machine Learning Repository[2]. These datasets contain categorical individual values withdataset size from hundreds to millions of individuals anddomain size up to a few hundred values. The synthetic datasetsare generated to be similar to the real-world benchmarkdatasets regarding dataset size and domain size.II. A LGORITHMS
A. The Randomized response technique [15]
The Randomized response (RR) technique is perhaps theearliest proposed algorithm for distribution estimation thatguarantees LDP. This algorithm is originally designed forestimating the distribution of binary values, for example, being”yes” or ”no” only. The two user values can be representedas a single bit with value 1 indicating a user value and value0 indicating the negation of the user value.The RR algorithm perturbs a user value v by returning thisvalue with probability p and returning the negation of thevalue with probability (1 − p ) , for p (cid:54) = 1 / . The perturbationmechanism of the RR algorithm satisfies LDP with the privacybudget (cid:15) = ln p − p . The original distribution of user values is aggregated asfollows. Let f (cid:48) v be the proportion of the perturbed values being v , the estimated proportion of value v in the true user data is[15] ˆ f v = f (cid:48) v + p − p − . B. The generalized Randomized response algorithm [8], [7]
The generalized Randomized response algorithm, named k-RR, is a generalization of the RR algorithm to the case thatusers have more than two values.Given the privacy budget (cid:15) , the k-RR algorithm privatizesa user value by sending this value with probability e (cid:15) e (cid:15) + d − and sending one of the remaining user values with probability e (cid:15) + d − to the data aggregator. When d = 2 , the k-RRalgorithm is identical to the RR algorithm with the probability p = e (cid:15) e (cid:15) +1 .The k-RR algorithm aggregates the distribution of uservalues using the maximum likelihood estimator [7]. C. The k-subset algorithm [13]
The k-subset algorithm privatizes a user value by sendinga set of values sampled from the user value domain to thedata curator. The number of the sampled values is defined byparameter k . For a user value v , the privatized set S v hasa probability of de (cid:15) ke (cid:15) + d − k / ( dk ) to include v into itself and aprobability of dke (cid:15) + d − k / ( dk ) to not include v . When k = 1 , thek-subset algorithm is equivalent to the k-RR algorithm.Given all the privatized sets of user values, the k-subsetalgorithm aggregates the true distribution of user values asfollows. Let c (cid:48) ( v ) be the number of times the value v occurred in a privatized set. Given g k = ke (cid:15) ke (cid:15) + d − k and h k = ke (cid:15) ke (cid:15) + d − k k − d − + d − kke (cid:15) + d − k kd − , the estimated proportionof the value v in the true user data is [13] ˆ f v = c (cid:48) ( v ) − h k n ( g k − h k ) n . D. The Basic RAPPOR algorithm [5]
The Basic RAPPOR algorithm is a modified version of theRAPPOR algorithm [5] that maps a user value to a singlebit in a bit sequence, or a one-hot vector. A one-hot vectorrepresenting a value v is a bit sequence with only the bit atthe position v set to 1 and the other bits set to 0.To privatize a user value, the Basic RAPPOR algorithmapplies a mechanism based on the Randomized responsetechnique [15], simply called randomized response, to per-turb every single bit of the represented one-hot vector. Theperturbed bit vector initially has all the bits set to 0. For eachof the output bits, the bit is set to 1 with probability q ifhe corresponding bit in the input bit vector is 1 and withprobability p if the corresponding bit in the input bit vector is0. Given all the privatized bit vectors, the Basic RAPPORalgorithm aggregates the true distribution of user values asfollows. Let c (cid:48) ( v ) be the number of times the bit at the position v in a privatized bit vector is set to 1. The proportion of thevalue v in the true user data is [5] ˆ f v = c (cid:48) ( v ) − pn ( q − p ) n E. The Count Mean Sketch algorithm [12]
The Count Mean Sketch (CMS) algorithm, as similar to theBasic RAPPOR algorithm, represents a user value as a one-hot vector and applies randomized response to every bit of theone-hot vector to privatize the user value. However, the CMSalgorithm independently flips each bit of the one-hot vectorwith probability e (cid:15)/ and keep each bit unchanged with thecompliment probability.The CMS algorithm aggregates the true distribution of uservalues from perturbed bit vectors as follows. Let c (cid:48) ( v ) be thenumber of times the bit at the position v in a perturbed bitvector is set to 1. Given c = e (cid:15) +1 e (cid:15) − , the proportion of the value v in the true user data is [12] ˆ f v = c (cid:48) ( v ) c + 12 + (cid:18) c (cid:48) ( v ) − n (cid:19) c − F. The Hadamard response algorithm [1]
The Hadamard response (HR) algorithm privatizes a uservalue v in a domain of size d by returning a value v (cid:48) in thedomain of size d (cid:48) with d ≤ d (cid:48) ≤ d . To choose v (cid:48) , suppose d (cid:48) is a power of two, the HR algorithm first creates a Hadamardmatrix H d (cid:48) = { , − } d (cid:48) × d (cid:48) as follows: H = [+1] . H m = (cid:20) H m/ H m/ H m/ − H m/ (cid:21) with m = 2 j for ≤ j ≤ log ( d (cid:48) ) .The HR algorithm then creates a set of values S v of size s with s ≤ d (cid:48) . The set S v includes all the elements in therow ( v + 1) th of the Hadamard matrix with row index startingfrom 0 and at the columns in the matrix with a ’+1’.To privatize v , the HR algorithm returns an element of S v with probability e (cid:15) se (cid:15) + d (cid:48) − s and returns an element in thedomain of size d (cid:48) but not in S v with probability se (cid:15) + d (cid:48) − s .When d (cid:48) = d , s = 1 and S v = { v } , the HR algorithm isequivalent to the k-RR algorithm.The original distribution of user values is aggregated usingthe following estimation. Let c (cid:48) ( v ) be the number of theprivatized values of v that are in S v . The estimated proportionof v in the true user data is ˆ f ( v ) = 2( e (cid:15) + 1) e (cid:15) − (cid:0) c (cid:48) ( v ) n − (cid:1) . III. E
XPERIMENTAL RESULTS
In this section, we compare the performance of k-RR,Basic RAPPOR (bRAPPOR), CMS, k-subset, and HR in termsof accuracy on benchmark datasets. The benchmark datasetsshould include discrete values specifying user attributes. Thecompared algorithms first privatize these values and thenestimate the discrete distribution of the original values basedon the privatized data, without accessing the original data.
A. Experiment setting
Previous work having an evaluation of discrete distributionalgorithms with LDP test the algorithms on synthetic dataonly [7], [13], [1] or on real-world datasets but the datasetare not made public [5], [12]. We use publicly available real-world datasets as well as the synthetic dataset generated to besimilar to these real-world datasets, in terms of dataset sizeand domain size.For real-world datasets, we use three datasets of categoricalindividual values publicly available at the UC Irvine MachineLearning Repository [2]. These datasets include the Statlog(Australian Credit Approval), the Adult, and the USCen-susData1990 datasets. For each dataset, we select only theattributes that have a categorical data type for estimating thedistribution of the categorical values in each attribute. Thenumber of samples in each dataset, the selected attributes, andthe value domain size of each attribute are listed in Table I.The attributes of each dataset are listed in the increasing orderof the value domain sizes. For detailed descriptions of theseattributes, readers are referred to [2].TABLE I: Real-world datasets with the number of individuals(n), the selected attributes and the value domain size (d).
Dataset n Attribute dStatlog A4 3690 A6 8A5 14Adult Race 532,561 Occ 15Country 42USCensus1990 Military 52,458,285 Rvetserv 12Race 63PoB 283
We generate synthetic datasets of categorical values fol-lowing the Geometric distribution. For the synthetic datasetswith the number of samples up to 20,000 (small datasets), thegenerated values are taken from the two categorical domainsof 20 and 100 values. For the synthetic datasets with thenumber of samples from 20,000 to 400,000 (large datasets), thegenerated values are taken from the two categorical domainsof 100 and 500 values.We implement the k RR, the Basic RAPPOR, the CMS,and the k-subset algorithms following [7], [13], [5], [12].The implementation of the Hadamard algorithm is publiclyprovided as listed in [1]. The algorithms are executed withdefault values of parameters. The performance of the algo-rithms are compared in three privacy regimes with (cid:15) = 0.5 forABLE II: Performance (MAE) of the compared algorithms on the real world datasets for (cid:15) = 0 . . Dataset Attribute k-RR bRAPPOR CMS k-subset HR
Mean Std. Mean Std. Mean Std. Mean Std. Mean Std.Statlog A4 0.067 0.036
TABLE III: Performance (MAE) of the compared algorithms on the real world datasets for (cid:15) = 1 . Dataset Attribute k-RR bRAPPOR CMS k-subset HR
Mean Std. Mean Std. Mean Std. Mean Std. Mean Std.Statlog A4
TABLE IV: Performance (MAE) of the compared algorithms on the real world datasets for (cid:15) = 2 . Dataset Attribute k-RR bRAPPOR CMS k-subset HR
Mean Std. Mean Std. Mean Std. Mean Std. Mean Std.Statlog A4 high privacy regime, (cid:15) = 1 for medium privacy regime, and (cid:15) = 2 for low privacy regime.The accuracy of the distribution estimation is measured byMean Absolute Error (MAE) which is defined as
M AE ( ˆ f , f ) = 1 d d (cid:88) i =1 | ˆ f v i − f v i | where ˆ f v i and f v i are the estimated and the true proportionsof user value v i respectively. The MAE value is the averageof the absolute differences between all the elements of theestimated distribution and the true distribution of user values.The smaller the MAE value, the more accurate the distributionestimation. B. Results
Table II, Table III and Table IV show the accuracy of theevaluated algorithms on the real world datasets for the threeprivacy regimes. The result for the Statlog and the Adultdatasets is the average of the results from a hundred runs. Theresult of the USCensus dataset is the average of the results from 10 runs. The column Mean lists the average of the MAEvalues obtained from all the runs of the algorithm and thecolumn Std. lists the standard deviation of the MAE values.The higher the value of Std., the more fluctuated the estimationaccuracy. The best performing results (or the lowest MAEvalues) are highlighted in bold.As can be seen from the tables, the Basic RAPPOR algo-rithm performs best for most of the datasets in the high privacyregime ( (cid:15) = 0 . ). In the medium privacy regime ( (cid:15) = 1 ), k-RRperforms better than the other algorithms for the datasets witha small number of samples while on the datasets with a largenumber of samples, k-subset and HR are fairly competitivewith k-RR. In the low privacy regime ( (cid:15) = 2 ), the k-RRalgorithm generally performs better than the other algorithms.Figure 2, Figure 3, Figure 4 and Figure 5 illustrate theperformance of the compared algorithms on the syntheticdatasets. According the figures, in the high privacy regime( (cid:15) = 0 . ), Basic RAPPOR generally outperforms the otheralgorithms for the small datasets with small value domain size(Figure 2a) and for the large datasets with large value domain a) (cid:15) =0.5 (b) (cid:15) =1 (c) (cid:15) =2 Fig. 2: Performance (MAE) of the compared algorithms on the small synthetic datasets ( n ≤ , ) with d=20. (a) (cid:15) =0.5 (b) (cid:15) =1 (c) (cid:15) =2 Fig. 3: Performance (MAE) of the compared algorithms on the small synthetic datasets ( n ≤ , ) with d=100. (a) (cid:15) =0.5 (b) (cid:15) =1 (c) (cid:15) =2 Fig. 4: Performance (MAE) of the compared algorithms on the large synthetic datasets ( , < n ≤ , ) with d=100. (a) (cid:15) =0.5 (b) (cid:15) =1 (c) (cid:15) =2 Fig. 5: Performance (MAE) of the compared algorithms on the large synthetic datasets ( , < n ≤ , ) with d=500.ize (Figure 5a). HR is better than the other algorithms for thesmall datasets with large value domain size (Figure 3a) whilek-subset is the best performing algorithm for the large datasetswith small value domain size (Figure 4a).In the medium privacy regime ( (cid:15) = 1 ), k-RR, k-subset, andHR are competitive with each other and generally give the bestestimation. In the low privacy regime ( (cid:15) = 2 ), k-RR generallyhas the highest accuracy but the performance of k-subset andHR are often comparable with the performance of k-RR.The experimental results indicate that the relative accuracyof the compared algorithms are dependent on the charac-teristics of the evaluation datasets, such as dataset size andvalue domain size, as well as the privacy protection level.The dependency of the algorithm performance on datasetcharacteristics has also been previously observed in [1] for aset of synthetic datasets. However, our results further confirmthe finding for real-world datasets and a set of syntheticdatasets with similar characteristics to these real-world data.The performance dependency of the compared algorithmson dataset characteristics is likely due to their approach toprivatizing user values. While k-RR and k-subset privatizeuser values by directly perturbing the domain values only,HR perturbs user values using the domain values and a pre-defined number of the values outside the user value domain.Basic RAPPOR and CMS privatize user values using one-hot vector representations representing a large number of thevalues outside the user value domain. The perturbation steps ofthe algorithms differently introduce noise into their privatizeddata depending on value domain size, dataset size, and privacyprotection level. The estimation accuracy of the algorithms are,therefore, differently sensitive to these parameters.IV. C ONCLUSIONS
We have presented a comparative analysis on the per-formance of discrete distribution estimation algorithms withLocal differential privacy on real-world and synthetic datasets.The real-world datasets include three datasets of categoricalindividual values publicly available at the UC Irvine MachineLearning Repository [2]: the Statlog (Australian Credit Ap-proval), the Adult, and the USCensusData1990 datasets. Thesynthetic datasets are generated to be similar to the real-worlddatasets regarding dataset size and domain size. We concludethat the Basic RAPPOR algorithm generally performs best,in terms of estimation accuracy, for the evaluation datasets inthe high privacy regime. In the medium privacy regime, k-RR, k-subset, and HR give fairly comparable results but andgenerally are better than Basic RAPPOR and CMS. In thelow privacy regime, the k-RR algorithm often gives the bestestimation.Our empirical evaluation is based on the execution ofthe algorithms with default values of parameters. To furtherunderstand the relative performance of the algorithms, it wouldbe necessary to compare the algorithms for different parametersettings. For example, the accuracy of the Basic RAPPORalgorithm would be improved by adopting the optimal choiceof parameter values as discussed in [14]. In the limited scope of this paper, we have only focused on the estimation accuracyof the algorithms. Further work will be required to evaluate thealgorithms in other aspects such as computational complexityand communication cost.A
CKNOWLEDGMENT
The work has been supported by the Cyber Security Re-search Centre Limited whose activities are partially fundedby the Australian Government’s Cooperative Research CentresProgramme. We would like to thank anonymous reviewers fortheir constructive and helpful comments on an earlier versionof the manuscript. R
EFERENCES[1] Jayadev Acharya, Ziteng Sun, and Huanyu Zhang. Hadamard response:Estimating distributions privately, efficiently, and with little communi-cation. In
The 22nd International Conference on Artificial Intelligenceand Statistics , pages 1120–1129, 2019.[2] Arthur Asuncion and David Newman. Uci machine learning repository,2007.[3] Bolin Ding, Janardhan Kulkarni, and Sergey Yekhanin. Collectingtelemetry data privately. In
Advances in Neural Information ProcessingSystems , pages 3571–3580, 2017.[4] John C Duchi, Michael I Jordan, and Martin J Wainwright. Local privacyand statistical minimax rates. In , pages 429–438. IEEE, 2013.[5] ´Ulfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. Rappor:Randomized aggregatable privacy-preserving ordinal response. In
Proceedings of the 2014 ACM SIGSAC conference on computer andcommunications security , pages 1054–1067, 2014.[6] Justin Hsu, Marco Gaboardi, Andreas Haeberlen, Sanjeev Khanna, ArjunNarayan, Benjamin C Pierce, and Aaron Roth. Differential privacy: Aneconomic method for choosing epsilon. In , pages 398–410. IEEE, 2014.[7] Peter Kairouz, Keith Bonawitz, and Daniel Ramage. Discrete distribu-tion estimation under local privacy. arXiv preprint arXiv:1602.07387 ,2016.[8] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. Extremal mecha-nisms for local differential privacy. In
Advances in neural informationprocessing systems , pages 2879–2887, 2014.[9] Shiva Prasad Kasiviswanathan, Homin K Lee, Kobbi Nissim, SofyaRaskhodnikova, and Adam Smith. What can we learn privately?
SIAMJournal on Computing , 40(3):793–826, 2011.[10] Daniele Miorandi, Sabrina Sicari, Francesco De Pellegrini, and ImrichChlamtac. Internet of things: Vision, applications and research chal-lenges.
Ad hoc networks , 10(7):1497–1516, 2012.[11] General Data Protection Regulation. Regulation (eu) 2016/679 of theeuropean parliament and of the council of 27 april 2016 on the protectionof natural persons with regard to the processing of personal data and onthe free movement of such data, and repealing directive 95/46.
OfficialJournal of the European Union (OJ) , 59(1-88):294, 2016.[12] ADP Team et al. Learning with privacy at scale.
Apple MachineLearning Journal , 1(8), 2017.[13] Shaowei Wang, Liusheng Huang, Pengzhan Wang, Yiwen Nie, HongliXu, Wei Yang, Xiang-Yang Li, and Chunming Qiao. Mutual informationoptimally local private discrete distribution estimation. arXiv preprintarXiv:1607.08025 , 2016.[14] Tianhao Wang, Jeremiah Blocki, Ninghui Li, and Somesh Jha. Lo-cally differentially private protocols for frequency estimation. In { USENIX } Security Symposium ( { USENIX } Security 17) , pages 729–745, 2017.[15] Stanley L Warner. Randomized response: A survey technique foreliminating evasive answer bias.