Locally Differentially Private Frequency Estimation with Consistency
Tianhao Wang, Milan Lopuhaä-Zwakenberg, Zitao Li, Boris Skoric, Ninghui Li
LLocally Differentially Private Frequency Estimationwith Consistency
Tianhao Wang , Milan Lopuha¨a-Zwakenberg , Zitao Li , Boris Skoric , Ninghui Li Purdue University, Eindhoven University of Technology { tianhaowang, li2490, ninghui } @purdue.edu, { m.a.lopuhaa, b.skoric } @tue.nl Abstract —Local Differential Privacy (LDP) protects user pri-vacy from the data collector. LDP protocols have been increas-ingly deployed in the industry. A basic building block is frequencyoracle ( FO ) protocols, which estimate frequencies of values. Whileseveral FO protocols have been proposed, the design goal doesnot lead to optimal results for answering many queries. In thispaper, we show that adding post-processing steps to FO protocolsby exploiting the knowledge that all individual frequencies shouldbe non-negative and they sum up to one can lead to significantlybetter accuracy for a wide range of tasks, including frequenciesof individual values, frequencies of the most frequent values,and frequencies of subsets of values. We consider 10 differentmethods that exploit this knowledge differently. We establishtheoretical relationships between some of them and conductedextensive experimental evaluations to understand which methodsshould be used for different query tasks. I. I
NTRODUCTION
Differential privacy (DP) [12] has been accepted as the de facto standard for data privacy. Recently, techniques forsatisfying DP in the local setting, which we call LDP, havebeen studied and deployed. In this setting, there are manyusers and one aggregator. The aggregator does not see theactual private data of each individual. Instead, each user sendsrandomized information to the aggregator, who attempts toinfer the data distribution based on that. LDP techniques havebeen deployed by companies like Apple [1], Google [14],Microsoft [9], and Alibaba [32]. Examples of use cases includecollecting users’ default browser homepage and search engine,in order to understand the unwanted or malicious hijacking ofuser settings; or frequently typed emoji’s and words, to helpwith keyboard typing recommendation.The fundamental tools in LDP are mechanisms to estimatefrequencies of values. Existing research [14], [5], [31], [2],[36] has developed frequency oracle ( FO ) protocols, wherethe aggregator can estimate the frequency of any chosen valuein the specified domain (fraction of users reporting that value).While these protocols were designed to provide unbiasedestimations of individual frequencies while minimizing the es-timation variance [31], they can perform poorly for some tasks.In [17], it is shown that when one wants to query the frequency of all values in the domain, one can obtain significant accuracyimprovement by exploiting the belief that the distributionlikely follows power law. Also, some applications naturallyrequire querying the sums of frequencies for values in a subset.For example, with the estimation of each emoji’s frequency,one may be interested in understanding what categories ofemoji’s are more popular and need to issue subset frequencyqueries. For another example, in [38], multiple attributes areencoded together and reported using LDP, and recovering thedistribution for each attribute separately requires computingthe frequencies of sets of encoded values. For frequencies ofa subset of values, simply summing up the estimations of allvalues is far from optimal, especially when the input domainis large.We note that the problem of answering queries usinginformation obtained from the frequency oracle protocols isan estimation problem. Existing methods such as those in [31]do not utilize any prior knowledge of the distribution to beestimated. Due to the significant amount of noise needed tosatisfy LDP, the estimations for many values may be negative.Also, some LDP protocols may result in the total sum offrequencies to be different from one. In this paper, we showthat one can develop better estimation methods by exploitingthe universal fact that all frequencies are non-negative and theysum up to 1.Interestingly, when taking advantage of such prior knowl-edge, one introduces biases in the estimations. For example,when we impose the non-negativity constraint, we are in-troducing positive biases in the estimation as a side effect.Essentially, when we exploit prior beliefs, the estimationswill be biased towards the prior beliefs. These biases cancause some queries to be much more inaccurate. For example,changing all negative estimations to zero improves accu-racy for frequency estimations of individual values. However,the introduced positive biases accumulate for range queries.Different methods to utilize the prior knowledge introducesdifferent forms of biases, and thus have different impacts fordifferent kinds of queries.In this paper, we consider 10 different methods, whichutilizes prior knowledge differently. Some methods enforceonly non-negativity; some other methods enforce only thatall estimations sum to 1; and other methods enforce both.These methods can also be combined with the “Power” methodin [17] that exploits power law assumption.We evaluate these methods on three tasks, frequencies of a r X i v : . [ c s . CR ] J a n ndividual values, frequencies of the most frequent values,and frequencies of subsets of values. We find that there is nosingle method that out-performs other methods for all tasks. Amethod that exploits only non-negativity performs the best forindividual values; a method that exploits only the summing-to-one constraint performs the best for frequent values; and amethod that enforces both can be applied in conjunction withPower to perform the best for subsets of values.To summarize, the main contributions of this paper arethreefold: • We introduced the consistency properties as a way toimprove accuracy for FO protocols under LDP, andsummarized 10 different post-processing methods thatexploit the consistency properties differently. • We established theoretical relationships between Con-strained Least Squares and Maximum Likelihood Esti-mation, and analyze which (if any) estimation biases areintroduced by these methods. • We conducted extensive experiments on both syntheticand real-world datasets, the results improved the under-standing on the strengths and weaknesses of differentapproaches.
Roadmap.
In Section II, we give the problem definition,followed by the background information on FO in Section III.We present the post-processing methods in Section IV. Ex-perimental results are presented in V. Finally we discussrelated work in Section VI and provide concluding remarksin Section VII. II. P ROBLEM S ETTING
We consider the setting where there are many users andone aggregator . Each user possesses a value v from a finitedomain D , and the aggregator wants to learn the distributionof values among all users, in a way that protects the privacyof individual users. More specifically, the aggregator wants toestimate, for each value v ∈ D , the fraction of users having v (the number of users having v divided by the population size).Such protocols are called frequency oracle ( FO ) protocolsunder Local Differential Privacy (LDP), and they are the keybuilding blocks of other LDP tasks. Privacy Requirement. An FO protocol is specified by a pairof algorithms: Ψ is used by each user to perturb her inputvalue, and Φ is used by the aggregator. Each user sends Ψ( v ) to the aggregator. The formal privacy requirement is that thealgorithm Ψ( · ) satisfies the following property: Definition 1 ( (cid:15) -Local Differential Privacy) . An algorithm Ψ( · ) satisfies (cid:15) -local differential privacy ( (cid:15) -LDP), where (cid:15) ≥ , ifand only if for any input v, v (cid:48) ∈ D , we have ∀ y ∈ Ψ( D ) : Pr [Ψ( v ) = y ] ≤ e (cid:15) Pr [Ψ( v (cid:48) ) = y ] , where Ψ( D ) is discrete and denotes the set of all possibleoutputs of Ψ . Since a user never reveals v to the aggregator and reportsonly Ψ( v ) , the user’s privacy is still protected even if theaggregator is malicious. Utility Goals.
The aggregator uses Φ , which takes thevector of all reports from users as the input, and produces ˜f = (cid:104) ˜ f v (cid:105) v ∈ D , the estimated frequencies of the v ∈ D (i.e.,the fraction of users who have input value v ). As Ψ is arandomized function, the resulting ˜f becomes inaccurate.In existing work, the design goal for Ψ and Φ is that theestimated frequency for each v is unbiased, and the varianceof the estimation is minimized. As we will show in this paper,these may not result in the most accurate answers to differentqueries.In this paper, we consider three different query scenarios 1)query the frequency of every value in the domain, 2) querythe aggregate frequencies of subsets of values, and 3) querythe frequencies of the most frequent values. For each value orset of values, we compute its estimate and the ground truth,and calculate their difference, measured by Mean of SquaredError (MSE). Consistency.
We will show that the utility of existing mecha-nisms can be improved by enforcing the following consistencyrequirement.
Definition 2 (Consistency) . The estimated frequencies areconsistent if and only if the following two conditions aresatisfied:1) The estimated frequency of each value is non-negative.2) The sum of the estimated frequencies is . III. F
REQUENCY O RACLE P ROTOCOLS
We review the state-of-the-art frequency oracle protocols.We utilize the generalized view from [31] to present theprotocols, so that our post-processing procedure can be appliedto all of them.
A. Generalized Random Response (
GRR ) This FO protocol generalizes the randomized response tech-nique [35]. Here each user with private value v ∈ D sendsthe true value v with probability p , and with probability − p sends a randomly chosen v (cid:48) ∈ D \{ v } . Suppose the domain D contains d = | D | values, the perturbation function is formallydefined as ∀ y ∈ D Pr (cid:2) Ψ GRR ( (cid:15),d ) ( v ) = y (cid:3) = (cid:26) p = e (cid:15) e (cid:15) + d − , if y = vq = e (cid:15) + d − , if y (cid:54) = v (1)This satisfies (cid:15) -LDP since pq = e (cid:15) .From a population of n users, the aggregator receives alength- n vector y = (cid:104) y , y , · · · , y n (cid:105) , where y i ∈ D is thereported value of the i -th user. The aggregator counts thenumber of times each value v appears in y and producesa length- d vector c of natural numbers. Observe that thecomponents of c sum up to n , i.e., (cid:80) v ∈ D c v = n . The2ggregator then obtains the estimated frequency vector ˜f byscaling each component of c as follows: ˜ f v = c v n − qp − q = c v n − e (cid:15) + d − e (cid:15) − e (cid:15) + d − As shown in [31], the estimation variance of
GRR growslinearly in d ; hence the accuracy deteriorates fast when thedomain size d increases. This motivated the development ofother FO protocols. B. Optimized Local Hashing (
OLH ) This FO deals with a large domain size d by first using arandom hash function to map an input value into a smallerdomain of size g , and then applying randomized response tothe hash value in the smaller domain. In OLH , the reportingprotocol is Ψ OLH ( (cid:15) ) ( v ) := (cid:104) H, Ψ GRR ( (cid:15),g ) ( H ( v )) (cid:105) , where H is randomly chosen from a family of hash functionsthat hash each value in D to { . . . g } , and Ψ GRR ( (cid:15),g ) is givenin (1), while operating on the domain { . . . g } . The hashfamily should have the property that the distribution of each v ’s hashed result is uniform over { . . . g } and independentfrom the distributions of other input values in D . Since H is chosen independently of the user’s input v , H by itselfcarries no meaningful information. Such a report (cid:104) H, r (cid:105) canbe represented by the set Y = { y ∈ D | H ( y ) = r } . The useof a hash function can be viewed as a compression technique,which results in constant size encoding of a set. For a user withvalue v , the probability that v is in the set Y represented by therandomized report (cid:104) H, r (cid:105) is p = e (cid:15) − e (cid:15) + g − and the probabilitythat a user with value (cid:54) = v is in Y is q = g .For each value x ∈ D , the aggregator first computes thevector c of how many times each value is in the reported set.More precisely, let Y i denote the set defined by the user i ,then c v = |{ i | H ( v ) ∈ Y i }| . The aggregator then scales it: ˜ f v = c v n − /gp − /g (2)In OLH , both the hashing step and the randomization stepresult in information loss. The choice of the parameter g is a tradeoff between losing information during the hashingstep and losing information during the randomization step.It is found that the estimation variance when viewed as acontinuous function of g is minimized when g = e (cid:15) + 1 (orthe closest integer to e (cid:15) + 1 in practice) [31]. C. Other FO Protocols
Several other FO protocols have been proposed. While theytake different forms when originally proposed, in essence, theyall have the user report some encoding of a subset Y ⊆ D , sothat the user’s true value has a probability p to be included in Y and any other value has a probability q < p to be included in Y . The estimation method used in GRR and
OLH (namely, ˜ f v = c v /n − qp − q ) equally applies. Optimized Unary Encoding [31] encodes a value in a size- d domain using a length- d binary vector, and then perturbs eachbit independently. The resulting bit vector encodes a set ofvalues. It is found in [31] that when d is large, one should flipthe bit with probability / , and flip a bit with probability /e (cid:15) . This results in the same values of p, q as OLH , and hasthe same estimation variance, but has higher communicationcost (linear in domain size d ). Subset Selection [36], [30] method reports a randomlyselected subset of a fixed size k . The sensitive value v isincluded in the set with probability p = 1 / . For any othervalue, it is included with probability q = p · k − d − +(1 − p ) · kd − .To minimize estimation variance, k should be an integer equalor close to d/ ( e (cid:15) +1) . Ignoring the integer constraint, we have q = · k − d − = · de(cid:15) +1 − d − = e (cid:15) +1 · d − ( e (cid:15) +1) / d − < e (cid:15) +1 . Itsvariance is smaller than that of OLH . However, as d increases,the term d − ( e (cid:15) +1) / d − gets closer and closer to . For a largerdomain, this offers essentially the same accuracy as OLH , withhigher communication cost (linear in domain size d ). Hadamard Response [4], [2] is similar to Subset Selectionwith k = d/ , where the Hadamard transform is used tocompress the subset. The benefit of adopting this protocol isto reduce the communication bandwidth (each user’s report isof constant size). While it is similar to OLH with g = 2 , itsaggregation part Φ faster, because evaluating a Hadamard entryis practically faster than evaluating hash functions. However,this FO is sub-optimal when g = 2 is sub-optimal. D. Accuracy of Frequency Oracles
In [31], it is proved that ˜ f v = c v /n − qp − q produces unbiasedestimates. That is, ∀ v ∈ D, E (cid:104) ˜ f v (cid:105) = f v . Moreover, ˜ f v hasvariance σ v = q (1 − q ) + f v ( p − q )(1 − p − q ) n ( p − q ) (3)As c v follows Binomial distribution, by the central limittheorem, the estimate ˜ f v can be viewed as the true value f v plus a Normally distributed noise: ˜ f v ≈ f v + N (0 , σ v ) . (4)When d is large and (cid:15) is not too large, f v ( p − q )(1 − p − q ) isdominated by q (1 − q ) . Thus, one can approximate Equation (3)and (4) by ignoring the f v . Specifically, σ ≈ q (1 − q ) n ( p − q ) , (5) ˜ f v ≈ f v + N (0 , σ ) . (6)As the probability each user’s report support each value isindependent, we focus on post-processing ˜f instead of Y .3V. T OWARDS C ONSISTENT F REQUENCY O RACLES
While existing state-of-the-art frequency oracles are de-signed to provide unbiased estimations while minimizing thevariance, it is possible to further reduce the variance byperforming post-processing steps that use prior knowledge toadjust the estimations. For example, exploiting the propertythat all frequency counts are non-negative can reduce thevariance; however, simply turning all negative estimations to0 introduces a systematic positive bias in all estimations.By also ensuring the property that the sum of all estima-tions must add up to 1, one ensures that the sum of thebiases for all estimations is 0. However, even though thebiases cancel out when summing over the whole domain,they still exist. There are different post-processing methodsthat were explicitly proposed or implicitly used. They willresult in different combinations of variance reduction and biasdistribution. Selecting a post-processing method is similar toconsidering the bias-variance tradeoff in selecting a machinelearning algorithm.We study the property of several post-processing methods,aiming to understand how they compare under different set-tings, and how they relate to each other. Our goal is to identifyefficient post-processing methods that can give accurate esti-mations for a wide variety of queries. We first present thebaseline method that does not do any post-processing. • Base : We use the standard FO as presented in Section IIIto obtain estimations of each value. Base has no bias, and its variance can be analyticallycomputed (e.g., using [31]).
A. Baseline Methods
When the domain is large, there will be many values inthe domain that have a zero or very low true frequency; theestimation of them may be negative. To overcome negativity,we describe three methods: Base-Pos, Post-Pos, and Base-Cut. • Base-Pos : After applying the standard FO , we convert allnegative estimations to . This satisfies non-negativity, but the sum of all estimationsis likely to be above 1. This reduces variance, as it turnserroneous negative estimations to 0, closer to the true value.As a result, for each individual value, Base-Pos results in anestimation that is at least as accurate as the Base method. How-ever, this introduces systematic positive bias, because somenegative noise are removed or reduced by the process, but thepositive noise are never removed. This positive bias will bereflected when answering subset queries, for which Base-Posresults in biased estimations. For larger-range queries, the biascan be significant.
Lemma 1.
Base-Pos will introduce positive bias to all values.Proof.
The outputs of standard FO are unbiased estimation,which means for any v , f v = E (cid:104) ˜ f v (cid:105) = E (cid:104) ˜ f v · [ ˜ f v ≥ (cid:105) + E (cid:104) ˜ f v · [ ˜ f v < (cid:105) As Base-Pos changes all negative estimated frequencies to 0,we have E [ f (cid:48) v ] = E (cid:104) ˜ f v · [ ˜ f v ≥ (cid:105) After enforcing non-negativity constraints, the bias will be E [ f (cid:48) v ] − f v > . • Post-Pos : For each query result, if it is negative, we convertit to . This method does not post-process the estimated distribution.Rather, it post-processes each query result individually. Forsubset queries, as the results are typically positive, Post-Posis similar to Base. On the other hand, when the query is on asingle item, Post-Pos is equivalent to Base-Pos.Post-Pos still introduces a positive bias, but the bias wouldbe smaller for subset queries. However, Post-Pos may giveinconsistent answers in the sense that the query result on A ∪ B ,where A and B are disjoint, may not equal the addition of thequery results for A and B separately. • Base-Cut : After standard FO , convert everything belowsome sensitivity threshold to 0. The original design goal for frequency oracles is to recoverfrequencies for frequent values, and oftentimes there is a sen-sitivity threshold so that only estimations above the thresholdare considered. Specifically, for each value, we compare itsestimation with a threshold T = F − (cid:16) − αd (cid:17) σ, (7)where d is the domain size, F − is the inverse of cummulativedistribution function of the standard normal distribution, and σ is the standard deviation of the LDP mechanism (i.e., as inEquation (5)). By Base-Cut, estimations below the thresholdare considered to be noise. When using such a threshold, forany value v ∈ D whose original count is , the probability thatit will have an estimated frequency above T (or the probabilitya zero-mean Gaussian variable with standard deviation δ isabove T ) is at most αd . Thus when we observe an estimatedfrequency above T , the probability that the true frequency ofthe value is is (by union bound) at most d × αd = α . In [14],it is recommended to set α = 5% , following conventions inthe statistical community.Empirically we observe that α = 5% performs poorly,because such a threshold can be too high when the populationsize is not very large and/or the (cid:15) is not large. A largethreshold results in all except for a few estimations to bebelow the threshold and set to 0. We note that the choiceof α is trading off false positives with false negatives. Givena large domain, there are likely between several and a fewdozen values that have quite high frequencies, with most ofthe remaining values having low true counts. We want to keepan estimation if it is a lot more likely to be from a frequentvalue than from a very low frequency one. In this paper, wechoose to set α = 2 , which ensures that the expected numberof false positives , i.e., values with very low true frequenciesbut estimated frequencies above T , to be around . If there are4round 20 values that are truly frequent and have estimatedfrequencies above T , then ratio of true positives to falsepositives when using this threshold is 10:1.This method ensures that all estimations are non-negative. Itdoes not ensure that the sum of estimations is 1. The resultingestimations are either high (above the chosen threshold) orzero. The estimation for each item with non-zero frequencyis subject to two bias effects. The negative bias effect iscaused by the situation when the estimations are cut to zero.The positive effect is when large positive noise causes theestimation to be above the threshold, the resulting estimationis higher than true frequency. B. Normalization Method
We now explore several methods that normalize the esti-mated frequencies of the whole domain to ensure that the sumof the estimates equals . When the estimations are normalizedto sum to 1, the sum of the biases over the whole domain hasto be 0. Lemma 2.
If a normalization method adjusts the unbiasedestimates so that they add up to , the sum of biases itintroduces over the whole domain is .Proof. Denote f (cid:48) v as the estimated frequency of value v afterpost-processing. By linearity of expectations, we have (cid:88) v ∈ D ( E [ f (cid:48) v ] − f v ) = E (cid:34) (cid:88) v ∈ D f (cid:48) v (cid:35) − (cid:88) v ∈ D f v = E [ 1 ] − One standard way to do such normalization is throughadditive normalization: • Norm : After standard FO , add δ to each estimation so thatthe overall sum is 1. The method is formally proposed for the centralized set-ting [16] of DP and is used in the local setting, e.g., [28],[22]. Note the method does not enforce non-negativity. For
GRR , Hadamard Response, and Subset Selection, this methodactually does nothing, since each user reports a single value,and the estimations already sum to 1. For
OLH , however, eachuser reports a randomly selected subset whose size is a randomvariable, and Norm would change the estimations. It can beproved that Norm is unbiased:
Lemma 3.
Norm provides unbiased estimation for each value.Proof.
By the definition of Norm, we have (cid:80) v ∈ D f (cid:48) v = (cid:80) v ∈ D ( ˜ f v + δ ) = 1 . As the frequency oracle outputs unbiasedestimation, i.e., E (cid:104) ˜ f v (cid:105) = f v , we have E (cid:34) (cid:88) v ∈ D f (cid:48) v (cid:35) = 1 = E (cid:34) (cid:88) v ∈ D ( ˜ f v + δ ) (cid:35) = (cid:88) v ∈ D E (cid:104) ˜ f v (cid:105) + d · E [ δ ] = 1 + d · E [ δ ]= ⇒ E [ δ ] = 0 Thus E [ f (cid:48) v ] = E (cid:104) ˜ f v + δ (cid:105) = E (cid:104) ˜ f v (cid:105) + 0 = f v . Besides sum-to-one, if a method also ensures non-negativity,we first state that it introduces positive bias to values whosefrequencies are close to 0.
Lemma 4.
If a normalization method adjusts the unbiasedestimates so that they add up to and are non-negative, thenit introduces positive biases to values that are sufficiently closeto .Proof. As the estimates are non-negative and sum up to ,some of the estimates must be positive. For a value close to , there exists some possibility that its estimation is positive;but the possibility its estimation is negative is . Thus theexpectation of its estimation is positive, leading to a positivebias.Lemma 4 shows the biases for any method that ensuresboth constraints cannot be all zeros. Thus different methodsare essentially different ways of distributing the biases. Nextwe present three such normalization methods. • Norm-Mul : After standard FO , convert negative value to 0.Then multiply each value by a multiplicative factor so thatthe sum is 1. More precisely, given estimation vector ˜f , we find γ such that (cid:88) v ∈ D max( γ × ˜ f v ,
0) = 1 , and assign f (cid:48) v = max( γ × ˜ f v , as the estimations. This resultsin a consistent FO . Kairouz et al. [19] evaluated this methodand it performs well when the underlying dataset distributionis smooth. This method results in positive biases for low-frequency items, but negative biases for high-frequency items.Moreover, the higher an item’s true frequency, the larger themagnitude of the negative bias. The intuition is that here γ is typically in the range of [0 , ; and multiplying by a factormay result in the estimation of high frequency values to besignificantly lower than their true values. When the distributionis skewed, which is more interesting in the LDP case, themethod performs poorly. • Norm-Sub : After standard FO , convert negative values to0, while maintaining overall sum of 1 by adding δ to eachremaining value. More precisely, given estimation vector ˜f , we want to find δ such that (cid:88) v ∈ D max( ˜ f v + δ,
0) = 1
Then the estimation for each value v is f (cid:48) v = max( ˜ f v + δ, .This extends the method Norm and results in consistency.Norm-Sub was used by Kairouz et al. [19] and Bassily [3]to process results for some FO ’s. Under Norm-Sub, low-frequency values have positive biases, and high-frequencyitems have negative biases. The distribution of biases, however,is more even when compared to Norm-Mul. • Norm-Cut : After standard FO , convert negative and smallpositive values to 0 so that the total sums up to 1.
5e note that under Norm-Sub, higher frequency items havehigher negative biases. One natural idea to address this is toturn the low estimations to to ensure consistency, withoutchanging the estimations of high-frequency values. This isthe idea of Norm-Cut. More precisely, given the estimationvector ˜f , there are two cases. When (cid:80) v ∈ D max( ˜ f v , ≤ ,we simply change each negative estimations to 0. When (cid:80) v ∈ D max( ˜ f v , > , we want to find the smallest θ suchthat (cid:88) v ∈ D | ˜ f v ≥ θ ˜ f v ≤ Then the estimation for each value v is if ˜ f v < θ and ˜ f v if ˜ f v ≥ θ . This is similar to Base-cut in that both methodschange all estimated values below some thresholds to 0. Thedifferences lie in how the threshold is chosen. This results innon-negative estimations, and typically results in estimationsthat sum up to 1, but might result in a sum < . C. Constrained Least Squares
From a more principled point of view, we note that whatwe are doing here is essentially solving a Constraint Inference(CI) problem, for which CLS (Constrained Least Squares) isa natural solution. This approach was proposed in [16] butwithout the constraint that the estimates are non-negative (andit leads to Norm). Here we revisit this approach with theconsistency constraint (i.e., both requirements in Definition 2). • CLS : After standard FO , use least squares with constraints(summing-to-one and non-negativity) to recover the values. Specifically, given the estimates ˜f by FO , the method outputs f (cid:48) that is a solution of the following problem:minimize: || f (cid:48) − ˜f || subject to: ∀ v f (cid:48) v ≥ (cid:88) v f (cid:48) v = 1 We can use the KKT condition [21], [20] to solve theproblem. The process is presented in Appendix A. In thesolution, we partition the domain D into D and D , where D ∩ D = ∅ and D ∪ D = D . For v ∈ D , assign f (cid:48) v = 0 .For v ∈ D , f (cid:48) v = ˜ f v − | D | (cid:32) (cid:88) v ∈ D ˜ f v − (cid:33) Norm-Sub is the solution to the Constraint LeastSquare (CLS) formulation to the problem, and δ = − | D | (cid:16)(cid:80) v ∈ D ˜ f v − (cid:17) is the δ we want to find in Norm-Sub. D. Maximum Likelihood Estimation
Another more principled way of looking into this problem isto view it as recovering distributions given some LDP reports. For this problem, one standard solution is Bayesian inference.In particular, we want to find the f (cid:48) such that Pr (cid:104) f (cid:48) | ˜f (cid:105) = Pr (cid:104) ˜f | f (cid:48) (cid:105) · Pr [ f (cid:48) ] Pr (cid:104) ˜f (cid:105) (8)is maximized. Note that we require f (cid:48) satisfies ∀ v f (cid:48) v ≥ and (cid:80) v f (cid:48) v = 1 . In (8), Pr [ f (cid:48) ] is the prior, and the priordistribution influence the result. In our setting, as we assumethere is no such prior, Pr [ f (cid:48) ] is uniform. That is, Pr [ f (cid:48) ] isa constant. The denominator Pr (cid:104) ˜f (cid:105) is also a constant thatdoes not influence the result. As a result, we are seekingfor f (cid:48) which is the maximal likelihood estimator (MLE), i.e., Pr (cid:104) ˜f | f (cid:48) (cid:105) is maximized.For this method, Peter et al. [19] derived the exact MLEsolution for GRR and RAPPOR [14]. We compute Pr (cid:104) ˜f | f (cid:48) (cid:105) using the general form of Equation (4), which states that, giventhe original distribution f (cid:48) , the vector ˜ f is a set of independentrandom variables, where each component ˜ f v follows Gaussiandistribution with mean f (cid:48) v and variance σ (cid:48) v . The likelihoodof ˜ f given f (cid:48) is thus Pr (cid:104) ˜f | f (cid:48) (cid:105) = (cid:89) v Pr (cid:104) ˜ f v | f (cid:48) v (cid:105) ≈ (cid:89) v (cid:112) πσ (cid:48) v · e − ( f (cid:48) v − ˜ fv )22 σ (cid:48) v = 1 (cid:112) π (cid:81) v σ (cid:48) v · e − (cid:80) v ( f (cid:48) v − ˜ fv )22 σ (cid:48) v . (9)To differentiate from [19], we call it MLE-Apx. • MLE-Apx : First use standard FO , then compute the MLEwith constraints (summing-to-one and non-negativity) torecover the values. In Appendix B, we use the KKT condition [21], [20] to obtainan efficient solution. In particular, we partition the domain D into D and D , where D ∩ D = ∅ and D ∪ D = D . For v ∈ D , f (cid:48) v = 0 ; for v ∈ D , f (cid:48) v = q (1 − q ) x v + ˜ f v ( p − q ) p − q − ( p − q )(1 − p − q ) x v (10)where x v = (cid:80) x ∈ D ˜ f v ( p − q ) − ( p − q )( p − q )(1 − p − q ) − | D | q (1 − q ) We can rewrite Equation (10) as f (cid:48) v = ˜ f v · γ + δ, where γ = p − qp − q + ( p − q )(1 − p − q ) x v δ = q (1 − q ) x v p − q + ( p − q )(1 − p − q ) x v Hence MLE-Apx appears to represent some hybrid of Norm-Sub and Norm-Mul. In evaluation, we observe that Norm-Suband MLE-Apx give very close results, as γ ∼ . Furthermore,6 ethod Description Non-neg Sum to 1 Complexity Base-Pos Convert negative est. to 0 Yes No O ( d ) Post-Pos Convert negative query result to 0 Yes No N/ABase-Cut Convert est. below threshold T to 0 Yes No O ( d ) Norm Add δ to est. No Yes O ( d ) Norm-Mul Convert negative est. to 0, then multiply γ to positive est. Yes Yes O ( d ) Norm-Cut Convert negative and small positive est. below θ to 0. Yes Almost O ( d ) Norm-Sub Convert negative est. to 0 while adding δ to positive est. Yes Yes O ( d ) MLE-Apx Convert negative est. to 0, then add δ to positive est. Yes Yes O ( d ) Power Fit Power-Law dist., then minimize expected squared error Yes No O ( √ n · d ) PowerNS Apply Norm-Sub after Power Yes Yes O ( √ n · d ) TABLE IS
UMMARY OF M ETHODS . when the f v component in variance is dominated by theother component (as in Equation (5)), the CLS formulationis equivalent to our MLE formulation. E. Least Expected Square Error
Jia et al. [17] proposed a method in which one firstassumes that the data follows some type of distribution (butthe parameters are unknown), then uses the estimates to fit theparameters of the distribution, and finally updates the estimatesthat achieve expected least square. • Power : Fit a distribution, and then minimize the expectedsquared error.
Formally, for each value v , the estimate ˜ f v given by FO is regarded as the addition of two parts: the true frequency f v and noise following the normal distribution (as shownin Equation (6)). The method then finds f (cid:48) v that minimizes E (cid:104) ( f v − f (cid:48) v ) | ˜ f v (cid:105) . To solve this problem, the authors esti-mate the true distribution f v from the estimates ˜f (where ˜f isthe vector of the ˜ f v ’s).In particular, it is assume in [17] that the distribution followsPower-Law or Gaussian. The distributions can be determinedby one or two parameters, which can be fitted from theestimation ˜f . Given Pr [ x ] as the probability f v = x fromthe fitted distribution, and Pr [ x ∼ N (0 , σ )] as the pdf of x drawn from the Normal distribution with mean and standarddeviation σ (as in Equation (6)), one can then minimize theobjective. Specifically, for each value v ∈ D , output f (cid:48) v = (cid:90) Pr (cid:104) ( ˜ f v − x ) ∼ N (0 , σ ) (cid:105) · Pr [ x ] · x (cid:82) Pr (cid:104) ( ˜ f v − y ) ∼ N (0 , σ ) (cid:105) · Pr [ y ] dy dx. (11)We fit Pr [ x ] with the Power-Law distribution and call themethod Power. Using this method requires knowledge and/orassumption of the distribution to be estimated. If there aretoo much noise, or the underlying distribution is different,forcing the observations to fit a distribution could lead topoor accuracy. Moreover, this method does not ensure thefrequencies sum up to 1, as Equation (11) only considers thefrequency of each value v independently. To make the resultconsistent, we use Norm-Sub to post-process results of Power,since Power is close to CLS, and Norm-Sub is the solution toCLS. We call it PowerNS. • PowerNS : First use standard FO , then use Power to recoverthe values, finally use Norm-Sub to further process theresults. F. Summary of Methods In summary, Norm-Sub is the solution to the ConstraintLeast Square (CLS) formulation to the problem. Furthermore,when the f v component in variance is dominated by theother component (as in Equation (5)), the CLS formulationis equivalent to our MLE formulation. In that case, Norm-Subis equivalent to MLE-Apx.Table I gives a summary of the methods. First of all, allof the methods preserve the frequency order of the value,i.e., f (cid:48) v ≤ f (cid:48) v iff ˜ f v ≤ ˜ f v . The methods can be classifiesinto three classes: First, enforcing non-negativity only. Base-Pos, Post-Pos, Base-Cut, and Power fall in this category.Second, enforcing summing-to-one only. Only Norm is in thisclass. Third, enforcing the two requirement simultaneously.Norm-Mul, Norm-Cut, Norm-Sub, and PowerNS satisfy bothrequirements. V. E VALUATION
As we are optimizing multiple utility metrics together, itis hard to theoretically compare different methods. In thissection, we run experiments to empirically evaluate thesemethods.At the high level, our evaluations show that different meth-ods perform differently in different settings, and to achievethe best utility, it may or may not be necessary to exploit allthe consistency constraints. As a result, we conclude that forfull-domain query, Base-Cut performs the best; for set-valuequery, PowerNS performs the best; and for high-frequency-value query, Norm performs the best.
A. Experimental Setup
Datasets.
We run experiments on two datasets (one syntheticand one real). • Synthetic Zipf’s distribution with 1024 values and 1million reports. We use s = 1 . in this distribution. • Emoji: The daily emoji usage data. We use the averageemoji usage of an emoji keyboard , which gives the totalcount of n = 884427 with d = 1573 different emojis. Setup.
The FO protocols and post-processing algorithmsare implemented in Python 3.6.6 using Numpy 1.15; andall the experiments are conducted on a PC with Intel Corei7-4790 3.60GHz and 16GB memory. Although the post-processing methods can be applied to any FO protocol, we (a) Base (Post-Pos) (b) Base-Pos (c) Base-Cut (d) Norm (e) Norm-Mul (f) Norm-Cut (g) Norm-Sub (h) Power (i) PowerNSFig. 1. Log-scale distribution of the Zipf’s dataset fixing (cid:15) = 1 , the x -axes indicates the sorted value index and the y -axes is its count. The blue line is theground truth; the green dots are estimations by different methods. focus on simulating OLH as it provides near-optimal utilitywith reasonable communication bandwidth.
Metrics.
We evaluate three scenarios 1) estimate the fre-quency of every value in the domain (full-domain), 2) estimatethe aggregate frequencies of a subset of values (set-value),and 3) estimate the frequencies of the most frequent values(frequent-value).We use the metrics of
Mean of Squared Error (MSE). MSEmeasures the mean of squared difference between the estimateand the ground truth for each (set of) value. For full-domain,we compute MSE = 1 d (cid:88) v ∈ D ( f v − f (cid:48) v ) . For frequent-value, we consider the top k values with highest f v instead of the whole domain D ; and for set-value, insteadof measuring errors for singletons, we measure errors for sets,that is, we first sum the frequencies for a set of values, andthen measure the difference. Plotting Convention.
Unless otherwise specified, for eachdataset and each method, we repeat the experiment times,with result mean and standard deviation reported. The standarddeviation is typically very small, and barely noticeable in thefigures.Because there are 11 algorithms (10 post-processing meth-ods plus Base), and for any single metric there are oftenmultiple methods that perform very similarly, resulting theirlines overlapping. To make Figures 4–8 readable, we plot results on two separate figures on the same row. On the left,we plot 6 methods, Base, Base-Pos, Post-Pos, Norm, Norm-Mul, and Norm-Sub. On the right, we plot Norm-Sub with theremaining 5 methods, MLE-Apx, Base-Cut, Norm-Cut, Powerand PowerNS. We mainly want to compare the methods in theright column. B. Bias-variance Evaluation
Figure 1 shows the true distribution of the synthetic Zipf’sdataset and the mean of the estimations. As we plot the countestimations (instead of frequency estimations), the variance islarger (a n = 10 multiplicative factor than the frequencyestimations). We thus estimate times in order to make themean stabilize. In Figure 2, we subtract the estimation mean bythe ground truth and plot the difference, which representingthe empirical bias. It can be seen that Base and Norm areunbiased. Base-Pos introduces systematic positive bias. Base-Cut gives unbiased estimations for the first few most frequentvalues, as their true frequencies are much greater than thethreshold T used to cut off estimation below it to . As thenoise is close to normal distribution, the possibility that a high-frequency value is estimated to be below T is exponentiallysmall. The similar analysis also holds for the low-frequencyvalues, whose estimates are unlikely to be above T . On theother hand, for values in between, the two biases compete witheach other. At some point, the two effects cancel out witheach other, leading to unbiased estimations. But this point isdependent on the whole distribution, and thus is hard to befound analytically. For Norm-Cut, the similar reasoning also8 (a) Base (Post-Pos), bias sum: − (b) Base-Pos, bias sum: (c) Base-Cut, bias sum: − (d) Norm, bias sum: (e) Norm-Mul, bias sum: (f) Norm-Cut, bias sum: (g) Norm-Sub, bias sum: (h) Power, bias sum: − (i) PowerNS, bias sum: Fig. 2. Bias of count estimation for the Zipf’s dataset fixing (cid:15) = 1 . applies, with the difference that the threshold in Norm-Cutis typically smaller. For Norm-Sub, each value is influencedby two factors: subtraction by a same amount; and convertingto if negative. For the high-frequency values, we mostlysee the first factor; for the low-frequency values, they aremostly affected by the second factor; and for the values inbetween, the two factors compete against each other. We seean increasing line for Norm-Sub. Finally, Power changes littleto the top estimations; but more to the low ones, thus leadingto a similar shape as Norm-Cut. The shape of PowerNS isclose to Power because PowerNS applies Norm-Sub, whichsubtract some amount to the estimations, after Power.Figure 3 shows the variance of the estimations among the5000 runs. First of all, the variance is similar for all the valuesin Base and Norm, with Norm being slightly better (smaller)than Base. For all other methods, the variance drops with therank, because for low-frequency values, their estimates aremostly zeros. C. Full-domain Evaluation
Figure 4 shows MSE when querying the frequency of everyvalue in the domain. Note that The MSE is composed of the(square of) bias shown in Figure 2 and variance in Figure 3.We vary (cid:15) from . to . Let us fist focus on the figures on theleft. Base performs very close to Norm, since the adjustment ofNorm can be either positive or negative as the expected valueof the estimation sum is 1. As Base-Pos (which is equivalent toPost-Pos in this setting) converts negative results to , its MSEis around half that of Base (note the y-axis is in log-scale).Norm-Sub is able to reduce the MSE of Base by about a factor of 10 and 100 in the Zipfs and Emoji dataset respectively.Norm-Mul behaves differently from other methods. In par-ticular, the MSE decreases much slower than other methods.This is because Norm-Mul multiplies the original estimationsby the same factor. The higher the estimate, the greater theadjustment. Since the estimations are individually unbiased,this is not the correct adjustment.For the right part of Figure 4, we observe that, Norm-Suband MLE-Apx perform almost exactly the same, validatingthe prediction from theoretical analysis. Norm-Sub, MLE-Apx, Power, PowerNS, and Base-Cut perform very similarly.In these two datasets, PowerNS performs the best. Note thatPowerNS works well when the distribution is close to Power-Law. For an unknown distribution, we still recommend Base-Cut. This is because if one considers average accuracy of allestimations, the dominating source of errors comes from thefact many values have true frequencies close or equal to are randomly perturbed. And Base-Cut maintains the high-frequency values unchanged, and converts results below athreshold T to . Norm-Cut also converts low estimations to0, but the threshold θ is likely to be lower than T , because θ is chosen to achieve a sum of 1. Benefit of Post-Processing.
We demonstrate the benefit ofpost-processing by measuring the relationship between n and n (cid:48) , so that n records with post-processing can achieve the sameaccuracy for n (cid:48) records without it. In particular, we vary n andmeasure the errors for different methods. We then calculate n (cid:48) using Equation 3. In particular, the analytical MSE for n (cid:48) (a) Base (Post-Pos) (b) Base-Pos (c) Base-Cut (d) Norm (e) Norm-Mul (f) Norm-Cut (g) Norm-Sub (h) Power (i) PowerNSFig. 3. Variance of count estimation of the Zipf’s dataset fixing (cid:15) = 1 . The y -axes are scaled down by n = 10 (a value a in the figure represents a · ). records is d (cid:88) v σ v = q (1 − q ) n (cid:48) ( p − q ) + 1 d (cid:88) v f v (1 − p − q ) n (cid:48) ( p − q )= q (1 − q ) n (cid:48) ( p − q ) + 1 d − p − qn (cid:48) ( p − q ) . Given the empirical MSE, we can obtain n (cid:48) that achieves thesame error analytically. Note that the MSE does not depend onthe distribution. Thus we only evaluate on the Zipf’s dataset.The result is shown in Figure 5. We vary the size of the dataset n and plot the value of n (cid:48) (note that the x -axes are in the scaleof and y -axes are ). The higher the line, the better themethod performs. Base and Norm are two straight lines withthe slope of , verifying the analytical variance. The y valuefor Norm-Mul grows even slower than Base, indicating theharm of using Norm-Mul as a post-processing method. Theperformance of the other methods follow the similar trend ofthe full-domain MSE (as shown in the upper row of Figure 4),with PowerNS gives the best performance, which saves around of users. D. Set-value Evaluation
Estimating set-values plays an important role in the inter-active data analysis setting (e.g., estimating which categoryof emoji’s is more popular). Keeping (cid:15) = 1 , we evaluate theperformance of different methods by changing the size of theset. For the set-value queries, we uniformly sample ρ % × | D | elements from the domain and evaluate the MSE betweenthe sum of their true frequencies and estimated frequencies. Formally, define D sρ as the random subset of D that has ρ % ×| D | elements; and define f D sρ = (cid:80) v ∈ D sρ f v . We sample D sρ multiple times and measure MSE between f D sρ and f (cid:48) D sρ .Overall, the error MSE of set-value queries is greater than thatfor the full-domain evaluation, because the error for individualestimation accumulates. Vary ρ from to . Following the layout convention,we show results for set-value estimations in Figure 6, wherewe first vary ρ from to . Overall, the approachesthat exploits the summing-to-1 requirement, including Norm,Norm-Mul, Norm-Sub, MLE-Apx, Norm-Cut, and PowerNS,perform well, especially when ρ is large. Moreover, their MSEis symmetric with ρ = 50 . This is because as the results arenormalized, estimating set-values for ρ > equals estimatingthe rest. When ρ = 90 , the best norm-based method, PowerNS,outperforms any of the non-norm based methods by at least orders of magnitude.For each specific method, it is observed the MSE forBase-Pos is higher than other methods, because it only turnsnegative estimates to , introducing systematic bias. Post-Posis slightly better than Base, as it turns negative query resultsto . In the settings we evaluated, Base-Cut also outperformsBase; this happens because converting estimates below thethreshold T to is more likely to make the summation f (cid:48) D close to one. Finally, Power only converts negative estimationsto be positive, introducing systematic bias; PowerNS furthermakes them sum to 1, thus achieving better utility than all10 % D V H % D V H 3 R V 3 R V W 3 R V 1 R U P 1 R U P 0 X O 1 R U P 6 X E 1 R U P 6 X E 0 / ( $ S [ % D V H &