[PDF] Locally Differentially Private Frequency Estimation with Consistency

Abstract

Local Differential Privacy (LDP) protects user privacy from the data collector. LDP protocols have been increasingly deployed in the industry. A basic building block is frequency oracle (FO) protocols, which estimate frequencies of values. While several FO protocols have been proposed, the design goal does not lead to optimal results for answering many queries. In this paper, we show that adding post-processing steps to FO protocols by exploiting the knowledge that all individual frequencies should be non-negative and they sum up to one can lead to significantly better accuracy for a wide range of tasks, including frequencies of individual values, frequencies of the most frequent values, and frequencies of subsets of values. We consider 10 different methods that exploit this knowledge differently. We establish theoretical relationships between some of them and conducted extensive experimental evaluations to understand which methods should be used for different query tasks.

Full PDF

LLocally Differentially Private Frequency Estimationwith Consistency

Tianhao Wang , Milan Lopuha¨a-Zwakenberg , Zitao Li , Boris Skoric , Ninghui Li Purdue University, Eindhoven University of Technology { tianhaowang, li2490, ninghui } @purdue.edu, { m.a.lopuhaa, b.skoric } @tue.nl Abstract —Local Differential Privacy (LDP) protects user pri-vacy from the data collector. LDP protocols have been increas-ingly deployed in the industry. A basic building block is frequencyoracle ( FO ) protocols, which estimate frequencies of values. Whileseveral FO protocols have been proposed, the design goal doesnot lead to optimal results for answering many queries. In thispaper, we show that adding post-processing steps to FO protocolsby exploiting the knowledge that all individual frequencies shouldbe non-negative and they sum up to one can lead to signiﬁcantlybetter accuracy for a wide range of tasks, including frequenciesof individual values, frequencies of the most frequent values,and frequencies of subsets of values. We consider 10 differentmethods that exploit this knowledge differently. We establishtheoretical relationships between some of them and conductedextensive experimental evaluations to understand which methodsshould be used for different query tasks. I. I

NTRODUCTION

Differential privacy (DP) [12] has been accepted as the de facto standard for data privacy. Recently, techniques forsatisfying DP in the local setting, which we call LDP, havebeen studied and deployed. In this setting, there are manyusers and one aggregator. The aggregator does not see theactual private data of each individual. Instead, each user sendsrandomized information to the aggregator, who attempts toinfer the data distribution based on that. LDP techniques havebeen deployed by companies like Apple [1], Google [14],Microsoft [9], and Alibaba [32]. Examples of use cases includecollecting users’ default browser homepage and search engine,in order to understand the unwanted or malicious hijacking ofuser settings; or frequently typed emoji’s and words, to helpwith keyboard typing recommendation.The fundamental tools in LDP are mechanisms to estimatefrequencies of values. Existing research [14], [5], [31], [2],[36] has developed frequency oracle ( FO ) protocols, wherethe aggregator can estimate the frequency of any chosen valuein the speciﬁed domain (fraction of users reporting that value).While these protocols were designed to provide unbiasedestimations of individual frequencies while minimizing the es-timation variance [31], they can perform poorly for some tasks.In [17], it is shown that when one wants to query the frequency of all values in the domain, one can obtain signiﬁcant accuracyimprovement by exploiting the belief that the distributionlikely follows power law. Also, some applications naturallyrequire querying the sums of frequencies for values in a subset.For example, with the estimation of each emoji’s frequency,one may be interested in understanding what categories ofemoji’s are more popular and need to issue subset frequencyqueries. For another example, in [38], multiple attributes areencoded together and reported using LDP, and recovering thedistribution for each attribute separately requires computingthe frequencies of sets of encoded values. For frequencies ofa subset of values, simply summing up the estimations of allvalues is far from optimal, especially when the input domainis large.We note that the problem of answering queries usinginformation obtained from the frequency oracle protocols isan estimation problem. Existing methods such as those in [31]do not utilize any prior knowledge of the distribution to beestimated. Due to the signiﬁcant amount of noise needed tosatisfy LDP, the estimations for many values may be negative.Also, some LDP protocols may result in the total sum offrequencies to be different from one. In this paper, we showthat one can develop better estimation methods by exploitingthe universal fact that all frequencies are non-negative and theysum up to 1.Interestingly, when taking advantage of such prior knowl-edge, one introduces biases in the estimations. For example,when we impose the non-negativity constraint, we are in-troducing positive biases in the estimation as a side effect.Essentially, when we exploit prior beliefs, the estimationswill be biased towards the prior beliefs. These biases cancause some queries to be much more inaccurate. For example,changing all negative estimations to zero improves accu-racy for frequency estimations of individual values. However,the introduced positive biases accumulate for range queries.Different methods to utilize the prior knowledge introducesdifferent forms of biases, and thus have different impacts fordifferent kinds of queries.In this paper, we consider 10 different methods, whichutilizes prior knowledge differently. Some methods enforceonly non-negativity; some other methods enforce only thatall estimations sum to 1; and other methods enforce both.These methods can also be combined with the “Power” methodin [17] that exploits power law assumption.We evaluate these methods on three tasks, frequencies of a r X i v : . [ c s . CR ] J a n ndividual values, frequencies of the most frequent values,and frequencies of subsets of values. We ﬁnd that there is nosingle method that out-performs other methods for all tasks. Amethod that exploits only non-negativity performs the best forindividual values; a method that exploits only the summing-to-one constraint performs the best for frequent values; and amethod that enforces both can be applied in conjunction withPower to perform the best for subsets of values.To summarize, the main contributions of this paper arethreefold: • We introduced the consistency properties as a way toimprove accuracy for FO protocols under LDP, andsummarized 10 different post-processing methods thatexploit the consistency properties differently. • We established theoretical relationships between Con-strained Least Squares and Maximum Likelihood Esti-mation, and analyze which (if any) estimation biases areintroduced by these methods. • We conducted extensive experiments on both syntheticand real-world datasets, the results improved the under-standing on the strengths and weaknesses of differentapproaches.

Roadmap.

In Section II, we give the problem deﬁnition,followed by the background information on FO in Section III.We present the post-processing methods in Section IV. Ex-perimental results are presented in V. Finally we discussrelated work in Section VI and provide concluding remarksin Section VII. II. P ROBLEM S ETTING

We consider the setting where there are many users andone aggregator . Each user possesses a value v from a ﬁnitedomain D , and the aggregator wants to learn the distributionof values among all users, in a way that protects the privacyof individual users. More speciﬁcally, the aggregator wants toestimate, for each value v ∈ D , the fraction of users having v (the number of users having v divided by the population size).Such protocols are called frequency oracle ( FO ) protocolsunder Local Differential Privacy (LDP), and they are the keybuilding blocks of other LDP tasks. Privacy Requirement. An FO protocol is speciﬁed by a pairof algorithms: Ψ is used by each user to perturb her inputvalue, and Φ is used by the aggregator. Each user sends Ψ( v ) to the aggregator. The formal privacy requirement is that thealgorithm Ψ( · ) satisﬁes the following property: Deﬁnition 1 ( (cid:15) -Local Differential Privacy) . An algorithm Ψ( · ) satisﬁes (cid:15) -local differential privacy ( (cid:15) -LDP), where (cid:15) ≥ , ifand only if for any input v, v (cid:48) ∈ D , we have ∀ y ∈ Ψ( D ) : Pr [Ψ( v ) = y ] ≤ e (cid:15) Pr [Ψ( v (cid:48) ) = y ] , where Ψ( D ) is discrete and denotes the set of all possibleoutputs of Ψ . Since a user never reveals v to the aggregator and reportsonly Ψ( v ) , the user’s privacy is still protected even if theaggregator is malicious. Utility Goals.

The aggregator uses Φ , which takes thevector of all reports from users as the input, and produces ˜f = (cid:104) ˜ f v (cid:105) v ∈ D , the estimated frequencies of the v ∈ D (i.e.,the fraction of users who have input value v ). As Ψ is arandomized function, the resulting ˜f becomes inaccurate.In existing work, the design goal for Ψ and Φ is that theestimated frequency for each v is unbiased, and the varianceof the estimation is minimized. As we will show in this paper,these may not result in the most accurate answers to differentqueries.In this paper, we consider three different query scenarios 1)query the frequency of every value in the domain, 2) querythe aggregate frequencies of subsets of values, and 3) querythe frequencies of the most frequent values. For each value orset of values, we compute its estimate and the ground truth,and calculate their difference, measured by Mean of SquaredError (MSE). Consistency.

We will show that the utility of existing mecha-nisms can be improved by enforcing the following consistencyrequirement.

Deﬁnition 2 (Consistency) . The estimated frequencies areconsistent if and only if the following two conditions aresatisﬁed:1) The estimated frequency of each value is non-negative.2) The sum of the estimated frequencies is . III. F

REQUENCY O RACLE P ROTOCOLS

We review the state-of-the-art frequency oracle protocols.We utilize the generalized view from [31] to present theprotocols, so that our post-processing procedure can be appliedto all of them.

A. Generalized Random Response (

GRR ) This FO protocol generalizes the randomized response tech-nique [35]. Here each user with private value v ∈ D sendsthe true value v with probability p , and with probability − p sends a randomly chosen v (cid:48) ∈ D \{ v } . Suppose the domain D contains d = | D | values, the perturbation function is formallydeﬁned as ∀ y ∈ D Pr (cid:2) Ψ GRR ( (cid:15),d ) ( v ) = y (cid:3) = (cid:26) p = e (cid:15) e (cid:15) + d − , if y = vq = e (cid:15) + d − , if y (cid:54) = v (1)This satisﬁes (cid:15) -LDP since pq = e (cid:15) .From a population of n users, the aggregator receives alength- n vector y = (cid:104) y , y , · · · , y n (cid:105) , where y i ∈ D is thereported value of the i -th user. The aggregator counts thenumber of times each value v appears in y and producesa length- d vector c of natural numbers. Observe that thecomponents of c sum up to n , i.e., (cid:80) v ∈ D c v = n . The2ggregator then obtains the estimated frequency vector ˜f byscaling each component of c as follows: ˜ f v = c v n − qp − q = c v n − e (cid:15) + d − e (cid:15) − e (cid:15) + d − As shown in [31], the estimation variance of

GRR growslinearly in d ; hence the accuracy deteriorates fast when thedomain size d increases. This motivated the development ofother FO protocols. B. Optimized Local Hashing (

OLH ) This FO deals with a large domain size d by ﬁrst using arandom hash function to map an input value into a smallerdomain of size g , and then applying randomized response tothe hash value in the smaller domain. In OLH , the reportingprotocol is Ψ OLH ( (cid:15) ) ( v ) := (cid:104) H, Ψ GRR ( (cid:15),g ) ( H ( v )) (cid:105) , where H is randomly chosen from a family of hash functionsthat hash each value in D to { . . . g } , and Ψ GRR ( (cid:15),g ) is givenin (1), while operating on the domain { . . . g } . The hashfamily should have the property that the distribution of each v ’s hashed result is uniform over { . . . g } and independentfrom the distributions of other input values in D . Since H is chosen independently of the user’s input v , H by itselfcarries no meaningful information. Such a report (cid:104) H, r (cid:105) canbe represented by the set Y = { y ∈ D | H ( y ) = r } . The useof a hash function can be viewed as a compression technique,which results in constant size encoding of a set. For a user withvalue v , the probability that v is in the set Y represented by therandomized report (cid:104) H, r (cid:105) is p = e (cid:15) − e (cid:15) + g − and the probabilitythat a user with value (cid:54) = v is in Y is q = g .For each value x ∈ D , the aggregator ﬁrst computes thevector c of how many times each value is in the reported set.More precisely, let Y i denote the set deﬁned by the user i ,then c v = |{ i | H ( v ) ∈ Y i }| . The aggregator then scales it: ˜ f v = c v n − /gp − /g (2)In OLH , both the hashing step and the randomization stepresult in information loss. The choice of the parameter g is a tradeoff between losing information during the hashingstep and losing information during the randomization step.It is found that the estimation variance when viewed as acontinuous function of g is minimized when g = e (cid:15) + 1 (orthe closest integer to e (cid:15) + 1 in practice) [31]. C. Other FO Protocols

Several other FO protocols have been proposed. While theytake different forms when originally proposed, in essence, theyall have the user report some encoding of a subset Y ⊆ D , sothat the user’s true value has a probability p to be included in Y and any other value has a probability q < p to be included in Y . The estimation method used in GRR and

OLH (namely, ˜ f v = c v /n − qp − q ) equally applies. Optimized Unary Encoding [31] encodes a value in a size- d domain using a length- d binary vector, and then perturbs eachbit independently. The resulting bit vector encodes a set ofvalues. It is found in [31] that when d is large, one should ﬂipthe bit with probability / , and ﬂip a bit with probability /e (cid:15) . This results in the same values of p, q as OLH , and hasthe same estimation variance, but has higher communicationcost (linear in domain size d ). Subset Selection [36], [30] method reports a randomlyselected subset of a ﬁxed size k . The sensitive value v isincluded in the set with probability p = 1 / . For any othervalue, it is included with probability q = p · k − d − +(1 − p ) · kd − .To minimize estimation variance, k should be an integer equalor close to d/ ( e (cid:15) +1) . Ignoring the integer constraint, we have q = · k − d − = · de(cid:15) +1 − d − = e (cid:15) +1 · d − ( e (cid:15) +1) / d − < e (cid:15) +1 . Itsvariance is smaller than that of OLH . However, as d increases,the term d − ( e (cid:15) +1) / d − gets closer and closer to . For a largerdomain, this offers essentially the same accuracy as OLH , withhigher communication cost (linear in domain size d ). Hadamard Response [4], [2] is similar to Subset Selectionwith k = d/ , where the Hadamard transform is used tocompress the subset. The beneﬁt of adopting this protocol isto reduce the communication bandwidth (each user’s report isof constant size). While it is similar to OLH with g = 2 , itsaggregation part Φ faster, because evaluating a Hadamard entryis practically faster than evaluating hash functions. However,this FO is sub-optimal when g = 2 is sub-optimal. D. Accuracy of Frequency Oracles

In [31], it is proved that ˜ f v = c v /n − qp − q produces unbiasedestimates. That is, ∀ v ∈ D, E (cid:104) ˜ f v (cid:105) = f v . Moreover, ˜ f v hasvariance σ v = q (1 − q ) + f v ( p − q )(1 − p − q ) n ( p − q ) (3)As c v follows Binomial distribution, by the central limittheorem, the estimate ˜ f v can be viewed as the true value f v plus a Normally distributed noise: ˜ f v ≈ f v + N (0 , σ v ) . (4)When d is large and (cid:15) is not too large, f v ( p − q )(1 − p − q ) isdominated by q (1 − q ) . Thus, one can approximate Equation (3)and (4) by ignoring the f v . Speciﬁcally, σ ≈ q (1 − q ) n ( p − q ) , (5) ˜ f v ≈ f v + N (0 , σ ) . (6)As the probability each user’s report support each value isindependent, we focus on post-processing ˜f instead of Y .3V. T OWARDS C ONSISTENT F REQUENCY O RACLES

While existing state-of-the-art frequency oracles are de-signed to provide unbiased estimations while minimizing thevariance, it is possible to further reduce the variance byperforming post-processing steps that use prior knowledge toadjust the estimations. For example, exploiting the propertythat all frequency counts are non-negative can reduce thevariance; however, simply turning all negative estimations to0 introduces a systematic positive bias in all estimations.By also ensuring the property that the sum of all estima-tions must add up to 1, one ensures that the sum of thebiases for all estimations is 0. However, even though thebiases cancel out when summing over the whole domain,they still exist. There are different post-processing methodsthat were explicitly proposed or implicitly used. They willresult in different combinations of variance reduction and biasdistribution. Selecting a post-processing method is similar toconsidering the bias-variance tradeoff in selecting a machinelearning algorithm.We study the property of several post-processing methods,aiming to understand how they compare under different set-tings, and how they relate to each other. Our goal is to identifyefﬁcient post-processing methods that can give accurate esti-mations for a wide variety of queries. We ﬁrst present thebaseline method that does not do any post-processing. • Base : We use the standard FO as presented in Section IIIto obtain estimations of each value. Base has no bias, and its variance can be analyticallycomputed (e.g., using [31]).

A. Baseline Methods

When the domain is large, there will be many values inthe domain that have a zero or very low true frequency; theestimation of them may be negative. To overcome negativity,we describe three methods: Base-Pos, Post-Pos, and Base-Cut. • Base-Pos : After applying the standard FO , we convert allnegative estimations to . This satisﬁes non-negativity, but the sum of all estimationsis likely to be above 1. This reduces variance, as it turnserroneous negative estimations to 0, closer to the true value.As a result, for each individual value, Base-Pos results in anestimation that is at least as accurate as the Base method. How-ever, this introduces systematic positive bias, because somenegative noise are removed or reduced by the process, but thepositive noise are never removed. This positive bias will bereﬂected when answering subset queries, for which Base-Posresults in biased estimations. For larger-range queries, the biascan be signiﬁcant.

Lemma 1.

Base-Pos will introduce positive bias to all values.Proof.

The outputs of standard FO are unbiased estimation,which means for any v , f v = E (cid:104) ˜ f v (cid:105) = E (cid:104) ˜ f v · [ ˜ f v ≥ (cid:105) + E (cid:104) ˜ f v · [ ˜ f v < (cid:105) As Base-Pos changes all negative estimated frequencies to 0,we have E [ f (cid:48) v ] = E (cid:104) ˜ f v · [ ˜ f v ≥ (cid:105) After enforcing non-negativity constraints, the bias will be E [ f (cid:48) v ] − f v > . • Post-Pos : For each query result, if it is negative, we convertit to . This method does not post-process the estimated distribution.Rather, it post-processes each query result individually. Forsubset queries, as the results are typically positive, Post-Posis similar to Base. On the other hand, when the query is on asingle item, Post-Pos is equivalent to Base-Pos.Post-Pos still introduces a positive bias, but the bias wouldbe smaller for subset queries. However, Post-Pos may giveinconsistent answers in the sense that the query result on A ∪ B ,where A and B are disjoint, may not equal the addition of thequery results for A and B separately. • Base-Cut : After standard FO , convert everything belowsome sensitivity threshold to 0. The original design goal for frequency oracles is to recoverfrequencies for frequent values, and oftentimes there is a sen-sitivity threshold so that only estimations above the thresholdare considered. Speciﬁcally, for each value, we compare itsestimation with a threshold T = F − (cid:16) − αd (cid:17) σ, (7)where d is the domain size, F − is the inverse of cummulativedistribution function of the standard normal distribution, and σ is the standard deviation of the LDP mechanism (i.e., as inEquation (5)). By Base-Cut, estimations below the thresholdare considered to be noise. When using such a threshold, forany value v ∈ D whose original count is , the probability thatit will have an estimated frequency above T (or the probabilitya zero-mean Gaussian variable with standard deviation δ isabove T ) is at most αd . Thus when we observe an estimatedfrequency above T , the probability that the true frequency ofthe value is is (by union bound) at most d × αd = α . In [14],it is recommended to set α = 5% , following conventions inthe statistical community.Empirically we observe that α = 5% performs poorly,because such a threshold can be too high when the populationsize is not very large and/or the (cid:15) is not large. A largethreshold results in all except for a few estimations to bebelow the threshold and set to 0. We note that the choiceof α is trading off false positives with false negatives. Givena large domain, there are likely between several and a fewdozen values that have quite high frequencies, with most ofthe remaining values having low true counts. We want to keepan estimation if it is a lot more likely to be from a frequentvalue than from a very low frequency one. In this paper, wechoose to set α = 2 , which ensures that the expected numberof false positives , i.e., values with very low true frequenciesbut estimated frequencies above T , to be around . If there are4round 20 values that are truly frequent and have estimatedfrequencies above T , then ratio of true positives to falsepositives when using this threshold is 10:1.This method ensures that all estimations are non-negative. Itdoes not ensure that the sum of estimations is 1. The resultingestimations are either high (above the chosen threshold) orzero. The estimation for each item with non-zero frequencyis subject to two bias effects. The negative bias effect iscaused by the situation when the estimations are cut to zero.The positive effect is when large positive noise causes theestimation to be above the threshold, the resulting estimationis higher than true frequency. B. Normalization Method

We now explore several methods that normalize the esti-mated frequencies of the whole domain to ensure that the sumof the estimates equals . When the estimations are normalizedto sum to 1, the sum of the biases over the whole domain hasto be 0. Lemma 2.

If a normalization method adjusts the unbiasedestimates so that they add up to , the sum of biases itintroduces over the whole domain is .Proof. Denote f (cid:48) v as the estimated frequency of value v afterpost-processing. By linearity of expectations, we have (cid:88) v ∈ D ( E [ f (cid:48) v ] − f v ) = E (cid:34) (cid:88) v ∈ D f (cid:48) v (cid:35) − (cid:88) v ∈ D f v = E [ 1 ] − One standard way to do such normalization is throughadditive normalization: • Norm : After standard FO , add δ to each estimation so thatthe overall sum is 1. The method is formally proposed for the centralized set-ting [16] of DP and is used in the local setting, e.g., [28],[22]. Note the method does not enforce non-negativity. For

GRR , Hadamard Response, and Subset Selection, this methodactually does nothing, since each user reports a single value,and the estimations already sum to 1. For

OLH , however, eachuser reports a randomly selected subset whose size is a randomvariable, and Norm would change the estimations. It can beproved that Norm is unbiased:

Lemma 3.

Norm provides unbiased estimation for each value.Proof.

By the deﬁnition of Norm, we have (cid:80) v ∈ D f (cid:48) v = (cid:80) v ∈ D ( ˜ f v + δ ) = 1 . As the frequency oracle outputs unbiasedestimation, i.e., E (cid:104) ˜ f v (cid:105) = f v , we have E (cid:34) (cid:88) v ∈ D f (cid:48) v (cid:35) = 1 = E (cid:34) (cid:88) v ∈ D ( ˜ f v + δ ) (cid:35) = (cid:88) v ∈ D E (cid:104) ˜ f v (cid:105) + d · E [ δ ] = 1 + d · E [ δ ]= ⇒ E [ δ ] = 0 Thus E [ f (cid:48) v ] = E (cid:104) ˜ f v + δ (cid:105) = E (cid:104) ˜ f v (cid:105) + 0 = f v . Besides sum-to-one, if a method also ensures non-negativity,we ﬁrst state that it introduces positive bias to values whosefrequencies are close to 0.

Lemma 4.

If a normalization method adjusts the unbiasedestimates so that they add up to and are non-negative, thenit introduces positive biases to values that are sufﬁciently closeto .Proof. As the estimates are non-negative and sum up to ,some of the estimates must be positive. For a value close to , there exists some possibility that its estimation is positive;but the possibility its estimation is negative is . Thus theexpectation of its estimation is positive, leading to a positivebias.Lemma 4 shows the biases for any method that ensuresboth constraints cannot be all zeros. Thus different methodsare essentially different ways of distributing the biases. Nextwe present three such normalization methods. • Norm-Mul : After standard FO , convert negative value to 0.Then multiply each value by a multiplicative factor so thatthe sum is 1. More precisely, given estimation vector ˜f , we ﬁnd γ such that (cid:88) v ∈ D max( γ × ˜ f v ,

0) = 1 , and assign f (cid:48) v = max( γ × ˜ f v , as the estimations. This resultsin a consistent FO . Kairouz et al. [19] evaluated this methodand it performs well when the underlying dataset distributionis smooth. This method results in positive biases for low-frequency items, but negative biases for high-frequency items.Moreover, the higher an item’s true frequency, the larger themagnitude of the negative bias. The intuition is that here γ is typically in the range of [0 , ; and multiplying by a factormay result in the estimation of high frequency values to besigniﬁcantly lower than their true values. When the distributionis skewed, which is more interesting in the LDP case, themethod performs poorly. • Norm-Sub : After standard FO , convert negative values to0, while maintaining overall sum of 1 by adding δ to eachremaining value. More precisely, given estimation vector ˜f , we want to ﬁnd δ such that (cid:88) v ∈ D max( ˜ f v + δ,

0) = 1

Then the estimation for each value v is f (cid:48) v = max( ˜ f v + δ, .This extends the method Norm and results in consistency.Norm-Sub was used by Kairouz et al. [19] and Bassily [3]to process results for some FO ’s. Under Norm-Sub, low-frequency values have positive biases, and high-frequencyitems have negative biases. The distribution of biases, however,is more even when compared to Norm-Mul. • Norm-Cut : After standard FO , convert negative and smallpositive values to 0 so that the total sums up to 1.

5e note that under Norm-Sub, higher frequency items havehigher negative biases. One natural idea to address this is toturn the low estimations to to ensure consistency, withoutchanging the estimations of high-frequency values. This isthe idea of Norm-Cut. More precisely, given the estimationvector ˜f , there are two cases. When (cid:80) v ∈ D max( ˜ f v , ≤ ,we simply change each negative estimations to 0. When (cid:80) v ∈ D max( ˜ f v , > , we want to ﬁnd the smallest θ suchthat (cid:88) v ∈ D | ˜ f v ≥ θ ˜ f v ≤ Then the estimation for each value v is if ˜ f v < θ and ˜ f v if ˜ f v ≥ θ . This is similar to Base-cut in that both methodschange all estimated values below some thresholds to 0. Thedifferences lie in how the threshold is chosen. This results innon-negative estimations, and typically results in estimationsthat sum up to 1, but might result in a sum < . C. Constrained Least Squares

From a more principled point of view, we note that whatwe are doing here is essentially solving a Constraint Inference(CI) problem, for which CLS (Constrained Least Squares) isa natural solution. This approach was proposed in [16] butwithout the constraint that the estimates are non-negative (andit leads to Norm). Here we revisit this approach with theconsistency constraint (i.e., both requirements in Deﬁnition 2). • CLS : After standard FO , use least squares with constraints(summing-to-one and non-negativity) to recover the values. Speciﬁcally, given the estimates ˜f by FO , the method outputs f (cid:48) that is a solution of the following problem:minimize: || f (cid:48) − ˜f || subject to: ∀ v f (cid:48) v ≥ (cid:88) v f (cid:48) v = 1 We can use the KKT condition [21], [20] to solve theproblem. The process is presented in Appendix A. In thesolution, we partition the domain D into D and D , where D ∩ D = ∅ and D ∪ D = D . For v ∈ D , assign f (cid:48) v = 0 .For v ∈ D , f (cid:48) v = ˜ f v − | D | (cid:32) (cid:88) v ∈ D ˜ f v − (cid:33) Norm-Sub is the solution to the Constraint LeastSquare (CLS) formulation to the problem, and δ = − | D | (cid:16)(cid:80) v ∈ D ˜ f v − (cid:17) is the δ we want to ﬁnd in Norm-Sub. D. Maximum Likelihood Estimation

Another more principled way of looking into this problem isto view it as recovering distributions given some LDP reports. For this problem, one standard solution is Bayesian inference.In particular, we want to ﬁnd the f (cid:48) such that Pr (cid:104) f (cid:48) | ˜f (cid:105) = Pr (cid:104) ˜f | f (cid:48) (cid:105) · Pr [ f (cid:48) ] Pr (cid:104) ˜f (cid:105) (8)is maximized. Note that we require f (cid:48) satisﬁes ∀ v f (cid:48) v ≥ and (cid:80) v f (cid:48) v = 1 . In (8), Pr [ f (cid:48) ] is the prior, and the priordistribution inﬂuence the result. In our setting, as we assumethere is no such prior, Pr [ f (cid:48) ] is uniform. That is, Pr [ f (cid:48) ] isa constant. The denominator Pr (cid:104) ˜f (cid:105) is also a constant thatdoes not inﬂuence the result. As a result, we are seekingfor f (cid:48) which is the maximal likelihood estimator (MLE), i.e., Pr (cid:104) ˜f | f (cid:48) (cid:105) is maximized.For this method, Peter et al. [19] derived the exact MLEsolution for GRR and RAPPOR [14]. We compute Pr (cid:104) ˜f | f (cid:48) (cid:105) using the general form of Equation (4), which states that, giventhe original distribution f (cid:48) , the vector ˜ f is a set of independentrandom variables, where each component ˜ f v follows Gaussiandistribution with mean f (cid:48) v and variance σ (cid:48) v . The likelihoodof ˜ f given f (cid:48) is thus Pr (cid:104) ˜f | f (cid:48) (cid:105) = (cid:89) v Pr (cid:104) ˜ f v | f (cid:48) v (cid:105) ≈ (cid:89) v (cid:112) πσ (cid:48) v · e − ( f (cid:48) v − ˜ fv )22 σ (cid:48) v = 1 (cid:112) π (cid:81) v σ (cid:48) v · e − (cid:80) v ( f (cid:48) v − ˜ fv )22 σ (cid:48) v . (9)To differentiate from [19], we call it MLE-Apx. • MLE-Apx : First use standard FO , then compute the MLEwith constraints (summing-to-one and non-negativity) torecover the values. In Appendix B, we use the KKT condition [21], [20] to obtainan efﬁcient solution. In particular, we partition the domain D into D and D , where D ∩ D = ∅ and D ∪ D = D . For v ∈ D , f (cid:48) v = 0 ; for v ∈ D , f (cid:48) v = q (1 − q ) x v + ˜ f v ( p − q ) p − q − ( p − q )(1 − p − q ) x v (10)where x v = (cid:80) x ∈ D ˜ f v ( p − q ) − ( p − q )( p − q )(1 − p − q ) − | D | q (1 − q ) We can rewrite Equation (10) as f (cid:48) v = ˜ f v · γ + δ, where γ = p − qp − q + ( p − q )(1 − p − q ) x v δ = q (1 − q ) x v p − q + ( p − q )(1 − p − q ) x v Hence MLE-Apx appears to represent some hybrid of Norm-Sub and Norm-Mul. In evaluation, we observe that Norm-Suband MLE-Apx give very close results, as γ ∼ . Furthermore,6 ethod Description Non-neg Sum to 1 Complexity Base-Pos Convert negative est. to 0 Yes No O ( d ) Post-Pos Convert negative query result to 0 Yes No N/ABase-Cut Convert est. below threshold T to 0 Yes No O ( d ) Norm Add δ to est. No Yes O ( d ) Norm-Mul Convert negative est. to 0, then multiply γ to positive est. Yes Yes O ( d ) Norm-Cut Convert negative and small positive est. below θ to 0. Yes Almost O ( d ) Norm-Sub Convert negative est. to 0 while adding δ to positive est. Yes Yes O ( d ) MLE-Apx Convert negative est. to 0, then add δ to positive est. Yes Yes O ( d ) Power Fit Power-Law dist., then minimize expected squared error Yes No O ( √ n · d ) PowerNS Apply Norm-Sub after Power Yes Yes O ( √ n · d ) TABLE IS

UMMARY OF M ETHODS . when the f v component in variance is dominated by theother component (as in Equation (5)), the CLS formulationis equivalent to our MLE formulation. E. Least Expected Square Error

Jia et al. [17] proposed a method in which one ﬁrstassumes that the data follows some type of distribution (butthe parameters are unknown), then uses the estimates to ﬁt theparameters of the distribution, and ﬁnally updates the estimatesthat achieve expected least square. • Power : Fit a distribution, and then minimize the expectedsquared error.

Formally, for each value v , the estimate ˜ f v given by FO is regarded as the addition of two parts: the true frequency f v and noise following the normal distribution (as shownin Equation (6)). The method then ﬁnds f (cid:48) v that minimizes E (cid:104) ( f v − f (cid:48) v ) | ˜ f v (cid:105) . To solve this problem, the authors esti-mate the true distribution f v from the estimates ˜f (where ˜f isthe vector of the ˜ f v ’s).In particular, it is assume in [17] that the distribution followsPower-Law or Gaussian. The distributions can be determinedby one or two parameters, which can be ﬁtted from theestimation ˜f . Given Pr [ x ] as the probability f v = x fromthe ﬁtted distribution, and Pr [ x ∼ N (0 , σ )] as the pdf of x drawn from the Normal distribution with mean and standarddeviation σ (as in Equation (6)), one can then minimize theobjective. Speciﬁcally, for each value v ∈ D , output f (cid:48) v = (cid:90) Pr (cid:104) ( ˜ f v − x ) ∼ N (0 , σ ) (cid:105) · Pr [ x ] · x (cid:82) Pr (cid:104) ( ˜ f v − y ) ∼ N (0 , σ ) (cid:105) · Pr [ y ] dy dx. (11)We ﬁt Pr [ x ] with the Power-Law distribution and call themethod Power. Using this method requires knowledge and/orassumption of the distribution to be estimated. If there aretoo much noise, or the underlying distribution is different,forcing the observations to ﬁt a distribution could lead topoor accuracy. Moreover, this method does not ensure thefrequencies sum up to 1, as Equation (11) only considers thefrequency of each value v independently. To make the resultconsistent, we use Norm-Sub to post-process results of Power,since Power is close to CLS, and Norm-Sub is the solution toCLS. We call it PowerNS. • PowerNS : First use standard FO , then use Power to recoverthe values, ﬁnally use Norm-Sub to further process theresults. F. Summary of Methods In summary, Norm-Sub is the solution to the ConstraintLeast Square (CLS) formulation to the problem. Furthermore,when the f v component in variance is dominated by theother component (as in Equation (5)), the CLS formulationis equivalent to our MLE formulation. In that case, Norm-Subis equivalent to MLE-Apx.Table I gives a summary of the methods. First of all, allof the methods preserve the frequency order of the value,i.e., f (cid:48) v ≤ f (cid:48) v iff ˜ f v ≤ ˜ f v . The methods can be classiﬁesinto three classes: First, enforcing non-negativity only. Base-Pos, Post-Pos, Base-Cut, and Power fall in this category.Second, enforcing summing-to-one only. Only Norm is in thisclass. Third, enforcing the two requirement simultaneously.Norm-Mul, Norm-Cut, Norm-Sub, and PowerNS satisfy bothrequirements. V. E VALUATION

As we are optimizing multiple utility metrics together, itis hard to theoretically compare different methods. In thissection, we run experiments to empirically evaluate thesemethods.At the high level, our evaluations show that different meth-ods perform differently in different settings, and to achievethe best utility, it may or may not be necessary to exploit allthe consistency constraints. As a result, we conclude that forfull-domain query, Base-Cut performs the best; for set-valuequery, PowerNS performs the best; and for high-frequency-value query, Norm performs the best.

A. Experimental Setup

Datasets.

We run experiments on two datasets (one syntheticand one real). • Synthetic Zipf’s distribution with 1024 values and 1million reports. We use s = 1 . in this distribution. • Emoji: The daily emoji usage data. We use the averageemoji usage of an emoji keyboard , which gives the totalcount of n = 884427 with d = 1573 different emojis. Setup.

The FO protocols and post-processing algorithmsare implemented in Python 3.6.6 using Numpy 1.15; andall the experiments are conducted on a PC with Intel Corei7-4790 3.60GHz and 16GB memory. Although the post-processing methods can be applied to any FO protocol, we (a) Base (Post-Pos) (b) Base-Pos (c) Base-Cut (d) Norm (e) Norm-Mul (f) Norm-Cut (g) Norm-Sub (h) Power (i) PowerNSFig. 1. Log-scale distribution of the Zipf’s dataset ﬁxing (cid:15) = 1 , the x -axes indicates the sorted value index and the y -axes is its count. The blue line is theground truth; the green dots are estimations by different methods. focus on simulating OLH as it provides near-optimal utilitywith reasonable communication bandwidth.

Metrics.

We evaluate three scenarios 1) estimate the fre-quency of every value in the domain (full-domain), 2) estimatethe aggregate frequencies of a subset of values (set-value),and 3) estimate the frequencies of the most frequent values(frequent-value).We use the metrics of

Mean of Squared Error (MSE). MSEmeasures the mean of squared difference between the estimateand the ground truth for each (set of) value. For full-domain,we compute MSE = 1 d (cid:88) v ∈ D ( f v − f (cid:48) v ) . For frequent-value, we consider the top k values with highest f v instead of the whole domain D ; and for set-value, insteadof measuring errors for singletons, we measure errors for sets,that is, we ﬁrst sum the frequencies for a set of values, andthen measure the difference. Plotting Convention.

Unless otherwise speciﬁed, for eachdataset and each method, we repeat the experiment times,with result mean and standard deviation reported. The standarddeviation is typically very small, and barely noticeable in theﬁgures.Because there are 11 algorithms (10 post-processing meth-ods plus Base), and for any single metric there are oftenmultiple methods that perform very similarly, resulting theirlines overlapping. To make Figures 4–8 readable, we plot results on two separate ﬁgures on the same row. On the left,we plot 6 methods, Base, Base-Pos, Post-Pos, Norm, Norm-Mul, and Norm-Sub. On the right, we plot Norm-Sub with theremaining 5 methods, MLE-Apx, Base-Cut, Norm-Cut, Powerand PowerNS. We mainly want to compare the methods in theright column. B. Bias-variance Evaluation

Figure 1 shows the true distribution of the synthetic Zipf’sdataset and the mean of the estimations. As we plot the countestimations (instead of frequency estimations), the variance islarger (a n = 10 multiplicative factor than the frequencyestimations). We thus estimate times in order to make themean stabilize. In Figure 2, we subtract the estimation mean bythe ground truth and plot the difference, which representingthe empirical bias. It can be seen that Base and Norm areunbiased. Base-Pos introduces systematic positive bias. Base-Cut gives unbiased estimations for the ﬁrst few most frequentvalues, as their true frequencies are much greater than thethreshold T used to cut off estimation below it to . As thenoise is close to normal distribution, the possibility that a high-frequency value is estimated to be below T is exponentiallysmall. The similar analysis also holds for the low-frequencyvalues, whose estimates are unlikely to be above T . On theother hand, for values in between, the two biases compete witheach other. At some point, the two effects cancel out witheach other, leading to unbiased estimations. But this point isdependent on the whole distribution, and thus is hard to befound analytically. For Norm-Cut, the similar reasoning also8 (a) Base (Post-Pos), bias sum: − (b) Base-Pos, bias sum: (c) Base-Cut, bias sum: − (d) Norm, bias sum: (e) Norm-Mul, bias sum: (f) Norm-Cut, bias sum: (g) Norm-Sub, bias sum: (h) Power, bias sum: − (i) PowerNS, bias sum: Fig. 2. Bias of count estimation for the Zipf’s dataset ﬁxing (cid:15) = 1 . applies, with the difference that the threshold in Norm-Cutis typically smaller. For Norm-Sub, each value is inﬂuencedby two factors: subtraction by a same amount; and convertingto if negative. For the high-frequency values, we mostlysee the ﬁrst factor; for the low-frequency values, they aremostly affected by the second factor; and for the values inbetween, the two factors compete against each other. We seean increasing line for Norm-Sub. Finally, Power changes littleto the top estimations; but more to the low ones, thus leadingto a similar shape as Norm-Cut. The shape of PowerNS isclose to Power because PowerNS applies Norm-Sub, whichsubtract some amount to the estimations, after Power.Figure 3 shows the variance of the estimations among the5000 runs. First of all, the variance is similar for all the valuesin Base and Norm, with Norm being slightly better (smaller)than Base. For all other methods, the variance drops with therank, because for low-frequency values, their estimates aremostly zeros. C. Full-domain Evaluation

Figure 4 shows MSE when querying the frequency of everyvalue in the domain. Note that The MSE is composed of the(square of) bias shown in Figure 2 and variance in Figure 3.We vary (cid:15) from . to . Let us ﬁst focus on the ﬁgures on theleft. Base performs very close to Norm, since the adjustment ofNorm can be either positive or negative as the expected valueof the estimation sum is 1. As Base-Pos (which is equivalent toPost-Pos in this setting) converts negative results to , its MSEis around half that of Base (note the y-axis is in log-scale).Norm-Sub is able to reduce the MSE of Base by about a factor of 10 and 100 in the Zipfs and Emoji dataset respectively.Norm-Mul behaves differently from other methods. In par-ticular, the MSE decreases much slower than other methods.This is because Norm-Mul multiplies the original estimationsby the same factor. The higher the estimate, the greater theadjustment. Since the estimations are individually unbiased,this is not the correct adjustment.For the right part of Figure 4, we observe that, Norm-Suband MLE-Apx perform almost exactly the same, validatingthe prediction from theoretical analysis. Norm-Sub, MLE-Apx, Power, PowerNS, and Base-Cut perform very similarly.In these two datasets, PowerNS performs the best. Note thatPowerNS works well when the distribution is close to Power-Law. For an unknown distribution, we still recommend Base-Cut. This is because if one considers average accuracy of allestimations, the dominating source of errors comes from thefact many values have true frequencies close or equal to are randomly perturbed. And Base-Cut maintains the high-frequency values unchanged, and converts results below athreshold T to . Norm-Cut also converts low estimations to0, but the threshold θ is likely to be lower than T , because θ is chosen to achieve a sum of 1. Beneﬁt of Post-Processing.

We demonstrate the beneﬁt ofpost-processing by measuring the relationship between n and n (cid:48) , so that n records with post-processing can achieve the sameaccuracy for n (cid:48) records without it. In particular, we vary n andmeasure the errors for different methods. We then calculate n (cid:48) using Equation 3. In particular, the analytical MSE for n (cid:48) (a) Base (Post-Pos) (b) Base-Pos (c) Base-Cut (d) Norm (e) Norm-Mul (f) Norm-Cut (g) Norm-Sub (h) Power (i) PowerNSFig. 3. Variance of count estimation of the Zipf’s dataset ﬁxing (cid:15) = 1 . The y -axes are scaled down by n = 10 (a value a in the ﬁgure represents a · ). records is d (cid:88) v σ v = q (1 − q ) n (cid:48) ( p − q ) + 1 d (cid:88) v f v (1 − p − q ) n (cid:48) ( p − q )= q (1 − q ) n (cid:48) ( p − q ) + 1 d − p − qn (cid:48) ( p − q ) . Given the empirical MSE, we can obtain n (cid:48) that achieves thesame error analytically. Note that the MSE does not depend onthe distribution. Thus we only evaluate on the Zipf’s dataset.The result is shown in Figure 5. We vary the size of the dataset n and plot the value of n (cid:48) (note that the x -axes are in the scaleof and y -axes are ). The higher the line, the better themethod performs. Base and Norm are two straight lines withthe slope of , verifying the analytical variance. The y valuefor Norm-Mul grows even slower than Base, indicating theharm of using Norm-Mul as a post-processing method. Theperformance of the other methods follow the similar trend ofthe full-domain MSE (as shown in the upper row of Figure 4),with PowerNS gives the best performance, which saves around of users. D. Set-value Evaluation

Estimating set-values plays an important role in the inter-active data analysis setting (e.g., estimating which categoryof emoji’s is more popular). Keeping (cid:15) = 1 , we evaluate theperformance of different methods by changing the size of theset. For the set-value queries, we uniformly sample ρ % × | D | elements from the domain and evaluate the MSE betweenthe sum of their true frequencies and estimated frequencies. Formally, deﬁne D sρ as the random subset of D that has ρ % ×| D | elements; and deﬁne f D sρ = (cid:80) v ∈ D sρ f v . We sample D sρ multiple times and measure MSE between f D sρ and f (cid:48) D sρ .Overall, the error MSE of set-value queries is greater than thatfor the full-domain evaluation, because the error for individualestimation accumulates. Vary ρ from to . Following the layout convention,we show results for set-value estimations in Figure 6, wherewe ﬁrst vary ρ from to . Overall, the approachesthat exploits the summing-to-1 requirement, including Norm,Norm-Mul, Norm-Sub, MLE-Apx, Norm-Cut, and PowerNS,perform well, especially when ρ is large. Moreover, their MSEis symmetric with ρ = 50 . This is because as the results arenormalized, estimating set-values for ρ > equals estimatingthe rest. When ρ = 90 , the best norm-based method, PowerNS,outperforms any of the non-norm based methods by at least orders of magnitude.For each speciﬁc method, it is observed the MSE forBase-Pos is higher than other methods, because it only turnsnegative estimates to , introducing systematic bias. Post-Posis slightly better than Base, as it turns negative query resultsto . In the settings we evaluated, Base-Cut also outperformsBase; this happens because converting estimates below thethreshold T to is more likely to make the summation f (cid:48) D close to one. Finally, Power only converts negative estimationsto be positive, introducing systematic bias; PowerNS furthermakes them sum to 1, thus achieving better utility than all10 %DVH%DVH3RV 3RVW3RV1RUP 1RUP0XO1RUP6XE 1RUP6XE0/($S[ %DVH&XW1RUP&XW 3RZHU3RZHU16 Zipf’s Emoji

Fig. 4. MSE results on full-domain estimation, varying (cid:15) from . to . n H n H %DVH%DVH3RV 3RVW3RV1RUP 1RUP0XO1RUP6XE n H n H 1RUP6XE0/($S[ %DVH&XW1RUP&XW 3RZHU3RZHU16 Fig. 5. MSE results on full-domain estimation on Zipfs dataset, comparing n with n (cid:48) , ﬁxing (cid:15) = 1 while varying n from . × to . × . Threepairs of methods have similar performance: Base and Norm, Base-Pos and Post-Pos, Norm-Sub and MLE-Apx. other methods. Vary ρ from to . Having examined the performanceof set-queries for larger ρ , we then vary ρ from to anddemonstrate the results in Figure 7. Within this ρ range, theerrors of all methods increase with ρ , which is as expected.When ρ becomes small, the performance of different methodsapproaches to that of full-domain estimation.Norm-Cut varies the threshold so that after cutting, theremaining estimates sum up to one. Thus the performance ofNorm-Cut is better than Base-Cut especially when ρ ≥ .Intuitively, the norm-based methods should perform better an-swering set-queries. But Norm-Mul does not. This is becausethe multiplication operation reduces the large estimates a lot, making them biased. This also demonstrates that enforcingsum-to-one is not enough. Different approaches perform sig-niﬁcantly different. Fixed set queries.

Besides random set queries, we includea case study of ﬁxed subset queries for the Emoji dataset.The queries ask the frequency of each category . There are68 categories with the mean of . items per set. The MSEvarying (cid:15) is reported in Figure 8. It is interesting to see that thePost-Pos works best in the left sub-ﬁgure, and Norm-Cut fromthe right performs even better, especially when (cid:15) < . Thisindicates the set-queries contain values that are infrequent. https://data.world/kgarrett/emojis %DVH%DVH3RV 3RVW3RV1RUP 1RUP0XO1RUP6XE 1RUP6XE0/($S[ %DVH&XW1RUP&XW 3RZHU3RZHU16 Zipf’s Emoji

Fig. 6. MSE results on set-value estimation, varying set size percentage ρ from to , ﬁxing (cid:15) = 1 . Choosing the method on synthetic dataset.

As the optimalmethod in ﬁxed set-values (as shown in Figure 8) is differentfrom random set-values (shown in Figure 6 and 7), we investi-gate whether we can select the optimal post-processing methodgiven the query and the LDP reports. In particular, we ﬁrst ﬁta synthetic dataset from the estimation, then we simulate thedata collection and estimation process multiple times, withdifferent post-processing methods, and we calculate the errorstaking the synthesized dataset as the ground truth. Figure 9shows the result. Note that as we generate the synthetic datasetfrom the estimated distribution, the distribution itself shouldbe consistent (non-negative and sum up to 1). We select Norm-Sub and PowerNS to process the estimated distribution ﬁrst.These two methods perform well on full-domain and randomset-value queries.From the ﬁgure we can see that if the results are processedby Norm-Sub, the optimal method can be ﬁnd quite accurately;if PowerNS is used, PowerNS will be selected. The reason isthat PowerNS makes the distribution more close to the priorof Power-Law distribution, while Norm-Sub does not.

E. Frequent-value Evaluation

Finally, we evaluate different methods varying the top valuesto be considered. Deﬁne D tk as { v ∈ D | f v ranks top k } . Wemeasure MSE between ( f (cid:48) v ) v ∈ D tk and ( f v ) v ∈ D tk for differentvalues of k (from to ), ﬁxing (cid:15) = 1 . Note that neither the frequency oracle nor the subsequent post-processing operationis aware of D tk .From the left column of Figure 10, we observe that Base,Base-Pos, Post-Pos, and Norm perform consistently well fordifferent k , as the ﬁrst three methods do nothing to the topvalues, and Norm touches them in an unbiased way. Norm-Mulperforms at least × worse than any other methods becauseit reduces the higher estimations a lot. Norm-Sub performsworse than Base, but better than Norm-Mul, because the sameamount is subtracted from every estimate, regardless of k .To give a better comparison, we plot both Base and Norm-Sub to the right (i.e., we ignore MLE-Apx for now, as itperforms the same as Norm-Sub). These two methods haveconsistent MSE for different k . The rest four methods, Base-Cut, Norm-Cut, Power, and PowerNS, all have MSE that growswith k . In particular, for Base-Cut, a ﬁxed threshold T (inEquation (7)) is used and estimates below it is convertedto . This also suggests that at (cid:15) = 1 , around valuescan be reliably estimated. This also happens to Norm-Cutfor the similar reason. As Norm-Cut is better than Base-Cut,it suggests the threshold used in Norm-Cut is smaller thanthat in Base-Cut. If T is reduced, MSE of Base-Cut can belowered until it matches that of Norm-Cut. Thus T is actuallya tradeoff between frequent values and set-values. In practice,if the desired k is known in advance, one can set T to be the k -th highest estimated value. Finally, the performances of Powerand PowerNS are similar, and they are worse than Base-Cut,especially when k > .12 %DVH%DVH3RV 3RVW3RV1RUP 1RUP0XO1RUP6XE 1RUP6XE0/($S[ %DVH&XW1RUP&XW 3RZHU3RZHU16 Zipf’s Emoji

Fig. 7. MSE results on set-value estimation, varying set size percentage ρ from to , ﬁxing (cid:15) = 1 . %DVH%DVH3RV 3RVW3RV1RUP 1RUP0XO1RUP6XE 1RUP6XE0/($S[ %DVH&XW1RUP&XW 3RZHU3RZHU16 Fig. 8. MSE results on set-case estimation for the Emoji dataset, varying (cid:15) from . to . F. Discussion

In summary, we evaluate the post-processing methodson different datasets, for different tasks, and varying differentparameters. We now summarize the ﬁndings and presentguidelines for using the post-processing methods.With the experiments, we verify the connections amongthe methods: Norm-Sub and MLE-Apx perform similarly, andBase and Norm performs similarly.The best choice for post-processing method depends onthe queries one wants to answer. If set-value estimation isneeded, one should use PowerNS. When the set is ﬁxed, onecan also choose the optimal method using a synthetic datasetprocessed with Norm-Sub. The intuition is that PowerNSimproves over the approximate MLE (i.e., Norm-Sub, which is a theoretically testiﬁed method) by making the estimatescloser to the underlying distribution. If one just want toestimate results for the most frequent values, one can useNorm. While Base can also be used, Norm reduces varianceby utilizing the property that the estimates sum up to . Thesetwo methods do not change any value dramatically. Finally,if one cares about single value queries only, Base-Cut shouldbe used. This is because when many values in the datasetare of low frequency, converting low estimates to beneﬁtthe utility. Overall, one can follow the guideline for choosingpost-processing methods. • When single value queries are desired, use Base-Cut. • When frequent values are desired, use Norm. • When set-value queries are desired, use PowerNS or13 1RUP6XE%DVH&XW 1RUP&XW3RZHU16 6\Q1RUP6XE 6\Q3RZHU16 Fig. 9. Synthetic estimation for set-case query on the Emoji dataset. select one using synthetic datasets.VI. R

ELATED W ORK

LDP frequency oracle (estimating frequencies of values)is a fundamental primitive in LDP. There have been severalmechanisms [14], [5], [31], [4], [2], [36] proposed for thistask. Among them, [31] introduces

OLH , which achieveslow estimation errors and low communication costs on largedomains. Hadamard Response [4], [2] is similar to

OLH inessence, but uses the Hadamard transform instead of hashfunctions. The aggregation part is faster because evaluatinga Hadamard entry is practically faster; but it only outputs abinary value, which gives higher error than

OLH for larger (cid:15) setting. Subset selection [36], [30] achieves better accuracythan

OLH , but with a much higher communication cost.LDP frequency oracle is also a building block for otheranalytical tasks, e.g., ﬁnding heavy hitters [4], [7], [34],frequent itemset mining [26], [33], releasing marginals underLDP [27], [8], [38], key-value pair estimation [37], [15],evolving data monitoring [18], [13], and (multi-dimensional)range analytics [32], [22]. Mean estimation is also a buildingblock in LDP; most of existing work transforms the numericalvalue to a discrete value using stochastic round, and then applyfrequency oracles [11], [29], [24].There exist efforts to post-process results in the setting ofcentralized DP. Most of them focus on utilizing the structuralinformation in problems other than the simple histogram, e.g.,estimating marginals [10], [25] and hierarchy structure [16].The methods do not consider the non-negativity constraint.Other than that, they are similar to Norm-Sub and minimize L distance. On the other hand, the authors of [23] startedfrom MLE and propose a method to minimize L instead of L distance, as the DP noise follows Laplace distribution.In the LDP setting, Kairouz et al. [19] study exact MLEfor GRR and RAPPOR [14]; and empirically show exactMLE performs worse than Norm-Sub. In [3], Bassily provesthe error bound of Norm-Sub for the Hadamard Responsemechanism. Jia et al. [17] propose to use external informationabout the dataset’s distribution (e.g., assume the underlyingdataset follows Gaussian or Zipf’s distribution). We notethat such information may not always be available. On theother hand, we exploit the basic information in each LDP setting. That is, ﬁrst, the total number of users is known;second, negative values are not possible. We found that in theLDP setting, on the contrary to [19], minimizing L distanceachieves MLE under the approximation that the noise is closeto the Gaussian distribution. There are also post-processingtechniques proposed for other settings: Blasiok et al. [6]study the post-processing for linear queries, which generalizeshistogram estimation; but their method only applied to a non-optimal LDP mechanism. [28] and [22] consider the hierarchystructure and apply the technique of [16]. [37] considers meanestimation and propose to project the result into [0 , .VII. C ONCLUSION

In this paper, we study how to post-process results fromexisting frequency oracles to make them consistent whileachieving high accuracy for a wide range of tasks, includingfrequencies of individual values, frequencies of the mostfrequent values, and frequencies of subsets of values. Weconsidered 10 different methods, in addition to the baseline.We identiﬁed Norm performs similar to Base, and MLE-Apx performs similar to Norm-Sub. We then recommend thatfor full-domain estimation, Base-Cut should be used; whenestimating frequency of the most frequent values, Norm shouldbe used; when answering set-value queries, PowerNS or theoptimal one from synthetic dataset should be used.A

CKNOWLEDGEMENT

This project is supported by NSF grant 1640374, NWOgrant 628.001.026, and NSF grant 1931443. We thank ourshepherd Neil Gong and the anonymous reviewers for theirhelpful suggestions. R

EFERENCES[1] Apple differential privacy team, learning with privacy at scale, 2017.[2] J. Acharya, Z. Sun, and H. Zhang. Hadamard response: Estimatingdistributions privately, efﬁciently, and with little communication. In

AISTATS , 2019.[3] R. Bassily. Linear queries estimation with local differential privacy. In

AISTATS , 2019.[4] R. Bassily, K. Nissim, U. Stemmer, and A. G. Thakurta. Practical locallyprivate heavy hitters. In

NIPS , 2017.[5] R. Bassily and A. D. Smith. Local, private, efﬁcient protocols forsuccinct histograms. In

STOC , 2015.[6] J. Blasiok, M. Bun, A. Nikolov, and T. Steinke. Towards instance-optimal private query release. In

SODA , 2019.[7] M. Bun, J. Nelson, and U. Stemmer. Heavy hitters and the structure oflocal privacy. In

PODS , 2018.[8] G. Cormode, T. Kulkarni, and D. Srivastava. Marginal release underlocal differential privacy. In

SIGMOD , 2018.[9] B. Ding, J. Kulkarni, and S. Yekhanin. Collecting telemetry dataprivately. In

NIPS , 2017.[10] B. Ding, M. Winslett, J. Han, and Z. Li. Differentially private datacubes: optimizing noise sources and consistency. In

SIGMOD , 2011.[11] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy andstatistical minimax rates. In

FOCS , 2013.[12] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise tosensitivity in private data analysis. In

TCC , 2006.[13] ´U. Erlingsson, V. Feldman, I. Mironov, A. Raghunathan, K. Talwar,and A. Thakurta. Ampliﬁcation by shufﬂing: From local to centraldifferential privacy via anonymity. In

SODA , 2018.[14] ´U. Erlingsson, V. Pihur, and A. Korolova. RAPPOR: randomizedaggregatable privacy-preserving ordinal response. In

CCS , 2014. k %DVH%DVH3RV 3RVW3RV1RUP 1RUP0XO1RUP6XE k î î 1RUP6XE%DVH %DVH&XW1RUP&XW 3RZHU3RZHU16 Zipf’s k k î î Emoji

Fig. 10. MSE results on top- k value estimation varying k from to , ﬁxing (cid:15) = 1 .[15] X. Gu, M. Li, Y. Cheng, L. Xiong, and Y. Cao. Pckv: Locallydifferentially private correlated key-value data collection with optimizedutility. In USENIX Security , 2020.[16] M. Hay, V. Rastogi, G. Miklau, and D. Suciu. Boosting the accuracy ofdifferentially private histograms through consistency.

PVLDB , 2010.[17] J. Jia and N. Z. Gong. Calibrate: Frequency estimation and heavyhitter identiﬁcation with local differential privacy via incorporating priorknowledge. In

INFOCOM , 2019.[18] M. Joseph, A. Roth, J. Ullman, and B. Waggoner. Local differentialprivacy for evolving data. In

NIPS , 2018.[19] P. Kairouz, K. Bonawitz, and D. Ramage. Discrete distribution estima-tion under local privacy. In

ICML , 2016.[20] W. Karush. Minima of functions of several variables with inequalitiesas side constraints.

M. Sc. Dissertation. Dept. of Mathematics, Univ. ofChicago , 1939.[21] H. W. Kuhn and A. W. Tucker. Nonlinear programming. In

Traces andemergence of nonlinear programming . Springer, 2014.[22] T. Kulkarni, G. Cormode, and D. Srivastava. Answering range queriesunder local differential privacy.

PVLDB , 2019.[23] J. Lee, Y. Wang, and D. Kifer. Maximum likelihood postprocessing fordifferential privacy under consistency constraints. In

KDD , 2015.[24] Z. Li, T. Wang, M. Lopuha¨a-Zwakenberg, B. Skoric, and N. Li.Estimating numerical distributions under local differential privacy. arXivpreprint arXiv:1912.01051 , 2019.[25] W. Qardaji, W. Yang, and N. Li. Priview: practical differentially privaterelease of marginal contingency tables. In

SIGMOD , 2014.[26] Z. Qin, Y. Yang, T. Yu, I. Khalil, X. Xiao, and K. Ren. Heavy hitterestimation over set-valued data with local differential privacy. In

CCS ,2016.[27] X. Ren, C.-M. Yu, W. Yu, S. Yang, X. Yang, J. A. McCann, and S. Y.Philip. Lopub: High-dimensional crowdsourced data publication withlocal differential privacy.

Trans. on Info. Forensics and Security , 2018.[28] N. Wang, X. Xiao, Y. Yang, T. D. Hoang, H. Shin, J. Shin, andG. Yu. Privtrie: Effective frequent term discovery under local differentialprivacy. In

ICDE , 2018.[29] N. Wang, X. Xiao, Y. Yang, J. Zhao, S. C. Hui, H. Shin, J. Shin, and G. Yu. Collecting and analyzing multidimensional data with localdifferential privacy. In

ICDE , 2019.[30] S. Wang, L. Huang, P. Wang, Y. Nie, H. Xu, W. Yang, X. Li, andC. Qiao. Mutual information optimally local private discrete distributionestimation.

CoRR , abs/1607.08025, 2016.[31] T. Wang, J. Blocki, N. Li, and S. Jha. Locally differentially privateprotocols for frequency estimation. In

USENIX Security , 2017.[32] T. Wang, B. Ding, J. Zhou, C. Hong, Z. Huang, N. Li, and S. Jha.Answering multi-dimensional analytical queries under local differentialprivacy. In

SIGMOD . ACM, 2019.[33] T. Wang, N. Li, and S. Jha. Locally differentially private frequent itemsetmining. In SP , 2018.[34] T. Wang, N. Li, and S. Jha. Locally differentially private heavy hitteridentiﬁcation. Trans. Dependable Sec. Comput. , 2019.[35] S. L. Warner. Randomized response: A survey technique for eliminatingevasive answer bias.

Journal of the American Statistical Association ,1965.[36] M. Ye and A. Barg. Optimal schemes for discrete distribution estimationunder locally differential privacy.

Transactions on Information Theory ,2018.[37] Q. Ye, H. Hu, X. Meng, and H. Zheng. Privkv: Key-value data collectionwith local differential privacy. In SP , 2019.[38] Z. Zhang, T. Wang, N. Li, S. He, and J. Chen. Calm: Consistent adaptivelocal marginal for marginal release under local differential privacy. In CCS , 2018. PPENDIX AS OLUTION FOR

CLSUsing the KKT condition [21], [20], we augment theoptimization target with the following equations: minimize (cid:88) v ( f (cid:48) v − ˜ f v ) + a + b where (cid:88) v f (cid:48) v = 1 , ∀ v : 0 ≤ f (cid:48) v ≤ ,a = µ · (cid:88) v f (cid:48) v , b = (cid:88) v λ v · f (cid:48) v , ∀ v : λ v · f (cid:48) v = 0 . Since b = 0 , and a = µ is a constant, the condition thatminimizing the target is unchanged. Given that the targetis convex, we can ﬁnd the minimum by taking the partialderivative with respect to each variable: ∂ (cid:104)(cid:80) v ( f (cid:48) v − ˜ f v ) + a + b (cid:105) ∂f (cid:48) v = 0= ⇒ f (cid:48) v − ˜ f v ) + µ + λ v = 0= ⇒ f (cid:48) v = ˜ f v −

12 ( µ + λ v ) Now suppose there is a subset of domain D ⊆ D s.t., ∀ v ∈ D , f (cid:48) v = 0 and ∀ v ∈ D = D \ D , f (cid:48) v > ∧ λ v = 0 .By summing up f (cid:48) v for all v ∈ D , we have (cid:88) v ∈ D ˜ f v − | D | µ Thus for all v ∈ D , we can use the formula f (cid:48) v = ˜ f v − | D | (cid:32) (cid:88) v ∈ D ˜ f v − (cid:33) to derive the estimate f (cid:48) v for value v ∈ D , and f (cid:48) v = 0 for v ∈ D . One can also ﬁnd D using a similar approach whendealing with MLE. And it can also be veriﬁed (cid:80) v f (cid:48) v = 1 .A PPENDIX BS OLUTION FOR

MLE-A PX From Equation (9), we ﬁrst simplify the exponent pluggingin the value of σ (cid:48) v as in Equation (3): (cid:88) v ( f (cid:48) v − ˜ f v ) σ (cid:48) v = n (cid:88) v ( f (cid:48) v − ˜ f v ) ( p − q ) q (1 − q ) + f (cid:48) v ( p − q )(1 − p − q ) The factor n in the exponent ensures that for large n theexponent will vary the most with f (cid:48) , which dominates thecoefﬁcient √ π (cid:81) v σ (cid:48) v . Thus approximately we ﬁnd f (cid:48) thatachieves the following optimization goal:minimize: (cid:88) v ( f (cid:48) v − ˜ f v ) ( p − q ) q (1 − q ) + f (cid:48) v ( p − q )(1 − p − q ) subject to: (cid:88) v f (cid:48) v = 1 , ∀ v, ≤ f (cid:48) v ≤ . Using the KKT condition [21], [20], we augment theoptimization target with the following equations: minimize (cid:88) v ( f (cid:48) v − ˜ f v ) ( p − q ) q (1 − q ) + f (cid:48) v ( p − q )(1 − p − q ) + a + b where (cid:88) v f (cid:48) v = 1 , ∀ v : 0 ≤ f (cid:48) v ≤ ,a = µ · (cid:88) v f (cid:48) v , b = (cid:88) v λ v · f (cid:48) v , ∀ v : λ v · f (cid:48) v = 0 . Since b = 0 , and a = µ is a constant, the condition forminimizing the target is unchanged. Given that the targetis convex, we can ﬁnd the minimum by taking the partialderivative with respect to each variable: ∂ (cid:104)(cid:80) v ( f (cid:48) v − ˜ f v ) ( p − q ) q (1 − q )+ f (cid:48) v ( p − q )(1 − p − q ) + a + b (cid:105) ∂f (cid:48) v = − ( f (cid:48) v − ˜ f v ) ( p − q ) · ( p − q )(1 − p − q )( q (1 − q ) + f (cid:48) v ( p − q )(1 − p − q )) + 2( f (cid:48) v − ˜ f v )( p − q ) q (1 − q ) + f (cid:48) v ( p − q )(1 − p − q ) + µ + λ v = 0 Deﬁne a temporary notation x v = ( f (cid:48) v − ˜ f v )( p − q ) q (1 − q ) + f (cid:48) v ( p − q )(1 − p − q ) so that f (cid:48) v = q (1 − q ) x v + ˜ f v ( p − q ) p − q − ( p − q )(1 − p − q ) x v (12)With x v , we can simplify the previous equation: ( p − q )(1 − p − q ) x v − p − q ) x v − µ − λ v = 0 (13)Now suppose there is a subset of domain D ⊆ D s.t., ∀ v ∈ D , f (cid:48) v = 0 and ∀ v ∈ D = D \ D , f (cid:48) v > and λ v = 0 .Thus for those v ∈ D , solution of x v in Equation (13) doesnot depend on v . We solve x v by summing up f (cid:48) v for all v ∈ D : (cid:88) v ∈ D f (cid:48) v =1 = (cid:88) v ∈ D q (1 − q ) x v + ˜ f v ( p − q ) p − q − ( p − q )(1 − p − q ) x v = | D | q (1 − q ) x v + (cid:80) v ∈ D ˜ f v ( p − q ) p − q + ( p − q )(1 − p − q ) x v = ⇒ x v = (cid:80) x ∈ D ˜ f v ( p − q ) − ( p − q )( p − q )(1 − p − q ) − | D | q (1 − q ) Given x v , we can compute f (cid:48) v from Equation (12) for eachvalue v ∈ D efﬁciently; and f (cid:48) v = 0 for v ∈ D . It can beveriﬁed (cid:80) v f (cid:48) v = 1 .Finally, to ﬁnd D , one initiates D = ∅ and D = D , anditeratively tests whether all values in D are positive. In eachiteration, for any negative a x , x is moved from D to D .The process terminates when no negative a x is found for all x ∈ D1