[PDF] Lossless Compression of Efficient Private Local Randomizers

Abstract

Locally Differentially Private (LDP) Reports are commonly used for collection of statistics and machine learning in the federated setting. In many cases the best known LDP algorithms require sending prohibitively large messages from the client device to the server (such as when constructing histograms over large domain or learning a high-dimensional model). This has led to significant efforts on reducing the communication cost of LDP algorithms. At the same time LDP reports are known to have relatively little information about the user's data due to randomization. Several schemes are known that exploit this fact to design low-communication versions of LDP algorithm but all of them do so at the expense of a significant loss in utility. Here we demonstrate a general approach that, under standard cryptographic assumptions, compresses every efficient LDP algorithm with negligible loss in privacy and utility guarantees. The practical implication of our result is that in typical applications the message can be compressed to the size of the server's pseudo-random generator seed. More generally, we relate the properties of an LDP randomizer to the power of a pseudo-random generator that suffices for compressing the LDP randomizer. From this general approach we derive low-communication algorithms for the problems of frequency estimation and high-dimensional mean estimation. Our algorithms are simpler and more accurate than existing low-communication LDP algorithms for these well-studied problems.

Full PDF

LLossless Compression of Eﬃcient Private Local Randomizers

Vitaly FeldmanApple Kunal TalwarApple

Abstract

Locally Diﬀerentially Private (LDP) Reports are commonly used for collection of statistics and ma-chine learning in the federated setting. In many cases the best known LDP algorithms require sendingprohibitively large messages from the client device to the server (such as when constructing histogramsover large domain or learning a high-dimensional model). This has led to signiﬁcant eﬀorts on reducingthe communication cost of LDP algorithms.At the same time LDP reports are known to have relatively little information about the user’s datadue to randomization. Several schemes are known that exploit this fact to design low-communicationversions of LDP algorithm but all of them do so at the expense of a signiﬁcant loss in utility. Herewe demonstrate a general approach that, under standard cryptographic assumptions, compresses everyeﬃcient LDP algorithm with negligible loss in privacy and utility guarantees. The practical implication ofour result is that in typical applications the message can be compressed to the size of the server’s pseudo-random generator seed. More generally, we relate the properties of an LDP randomizer to the power of apseudo-random generator that suﬃces for compressing the LDP randomizer. From this general approachwe derive low-communication algorithms for the problems of frequency estimation and high-dimensionalmean estimation. Our algorithms are simpler and more accurate than existing low-communication LDPalgorithms for these well-studied problems.

We consider the problem of collecting statistics and machine learning in the setting where data is held on alarge number of user devices. The data held on devices in this federated setting is often sensitive and thusneeds to be analyzed with privacy preserving techniques. One of the key approaches to private federateddata analysis relies on the use of locally diﬀerentially private (LDP) algorithms to ensure that the reportsent by a user’s device reveals little information about that user’s data. Speciﬁcally, a randomized algorithm R : X → Y is an ε -DP local randomizer if for every possible output y ∈ Y , and any two possible values ofuser data x , x ∈ X , Pr [ R ( x ) = y ] and Pr [ R ( x ) = y ] are within a factor of e ε (where the probability istaken solely with respect to the randomness of the algorithm R ).The concept of a local randomizer dates back to the work of Warner [War65] where it was used toencourage truthfulness in surveys. In the context of modern data analysis it was introduced by Evﬁmievskiet al. [EGS03] and then related to diﬀerential privacy in the seminal work of Dwork et al. [DMNS06]. Localrandomizers are also used for collection of statistics and machine learning in several industrial applications[EPK14; App17; DKY17]. Practical applications such as building a histogram over a large domain or traininga model with millions of parameters [MRTZ18], require applying the randomizer to high dimensional data.Many of the standard and most accurate ways to randomize such data result in reports whose size scaleslinearly with the dimension of the problem. Communication from the user devices is often signiﬁcantly con-strained in practical applications. This limits the scope of problems in which we can achieve the best knownutility-privacy trade-oﬀ and motivates signiﬁcant research interest in designing communication-eﬃcient LDPalgorithms. 1 a r X i v : . [ c s . CR ] F e b .1 Our contribution In this work, we explore practical and theoretical aspects of compressing outputs of LDP mechanisms.We focus on the ε > ε by a constant factor which makes theapproach impractical in the ε > e − ε/ when ε >

1. In addition,the central privacy guarantees resulting from ampliﬁcation by shuﬄing also scale as e − ε/ .We propose a general approach to compressing an arbitrary local randomizer that preserves both theprivacy and accuracy (or utility) of the randomizer. At a high level it is based on replacing the true randombits used to generate the output with pseudo-random bits that can be described using a short seed. For arandomizer R : X → Y , we do this by ﬁrst picking a ﬁxed reference distribution ρ that is data-independentand ε -close (in the standard sense of diﬀerential privacy) to the output distributions of R for all possibleinputs x ∈ X . Existence of such reference distribution is exactly the deﬁnition of the deletion versionof local diﬀerential privacy [EFMRSTT20] and thus our results are easiest to describe in this model. Asample from ρ typically requires many random bits to generate but, by replacing random bits with pseudo-randomly generated ones, we will obtain a distribution over values in Y that can be described using ashort seed. In addition, under standard cryptographic assumptions, a random sample from this distributionis computationally indistinguishable from ρ . Given an input x we can now emulate R ( x ) by performingrejection sampling relative to pseudo-random samples from ρ . A special case of this idea appears in the workof Mishra and Sandler [MS06] who apply it the problem of estimating sets of counting queries.A crucial question is whether this scheme satisﬁes ε diﬀerential privacy. We show that the answer is yesif the pseudo-random generator (PRG) used is strong enough to fool a certain test that looks at the ratio ofthe output density of R ( x ) to ρ . This ratio is typically eﬃciently computable whenever the randomizer itselfis eﬃciently computable. Thus under standard cryptographic assumptions, the privacy is preserved (up toa negligible loss). Similarly, when the processing of the reports on the server side is done by an eﬃcientalgorithm the utility will be preserved. See Theorem 3.5 for a formal statement. Asymptotically, this resultimplies that if we assume that there exists an exponentially strong PRG, then the number of bits that needsto be communicated is logarithmic in the running time of the rejection sampler we deﬁned. An immediatepractical implication of this result is that in most applications the output of the local randomizer can becompressed to the size of the seed of the system (PRG) without any observable eﬀect on utility or privacy.This size is typically less than 1024 bits. We remark that when implementing a randomizer in practice, truerandomness is replaced with pseudo-randomly generated bits with an (implicit) assumption that this doesnot aﬀect privacy or utility guarantees. Thus the assumptions underlying our analysis are similar to thosethat are already present in practical implementations of diﬀerentially private algorithms.We demonstrate that this approach also extends to the (more common) replacement notion of localdiﬀerential privacy and also to ( ε, δ )-DP randomizers. In the latter case the randomizer needs to be modiﬁedto allow subsampling via simple truncation. This step adds δ to both privacy and utility guarantees of thealgorithm. For replacement DP this version also requires a more delicate analysis and a stronger set of testsfor the PRG. A detailed description of these results is given in Section 3.An important property of our analysis is that we do not need to follow the general recipe for speciﬁcrandomizers. Firstly, for some randomizers it is possible to directly sample from the desired distributionover seeds instead of using rejection sampling that requires e ε trials (in expectation). In addition, it may bepossible to ensure that privacy and utility are preserved without appealing to general cryptographically securePRGs and associated computational assumptions. In particular, one can leverage a variety of sophisticatedresults from complexity theory, such as k -wise independent PRGs and PRGs for bounded-space computation2Nis92], to achieve unconditional and more eﬃcient compression.We apply this ﬁne-grained approach to the problem of frequency estimation over a discrete domain. Inthis problem the domain X = [ k ] and the goal is to estimate the frequency of each element j ∈ [ k ] in thedataset. This is one of the central and most well-studied problems in private (federated) data analysis.However, for ε >

1, existing approaches either require communication on the order of k bits, or do notachieve the best known accuracy in some important regimes (see Sec. 1.2 for an overview).The best accuracy is achieved for this problem is achieved by the (asymmetric) RAPPOR algorithm[EPK14] (which has two versions depending on whether it is used with replacement or deletion privacy)and also by the closely related Subset Selection algorithm [WHWNXYLQ16; YB18]. We observe that apairwise-independent PRG suﬃces to fool both the privacy and utility conditions for this randomizer. Thuswe can compress RAPPOR to O (log k + ε ) bits losslessly and unconditionally using a standard constructionof a pairwise-independent PRG [LLW06]. The structure of the PRG also allows us to sample the seedseﬃciently without rejection sampling. The details of this construction appear in Section 3.As an additional application of our techniques we consider the problem of estimating the mean of d -dimensional vectors in (cid:96) -norm. This problem is a key part of various machine learning algorithms, mostnotably stochastic gradient descent. In the ε > (cid:100) ε (cid:101) log d bits) and asymptotically optimal algorithm was recently given by Chen et al. [CK ¨O20]. It is however less ac-curate empirically and more involved than the algorithm of Bhowmick et al. [BDFKR19] that communicatesa d dimensional vector. Using our general result we can losslessly compress the algorithm from [BDFKR19]to O (log d + ε ) bits. One limitation of this approach is the O ( e ε d ) complexity of rejection sampling in thiscase which can be prohibitive for large ε . However we show a simple reduction of the ε > ε < (cid:100) ε (cid:101) . This general reduction allows us to reduce the runningtime to O ( (cid:100) ε (cid:101) d ) and also use a simple and low-communication randomizer that is (asymptotically) optimalonly when ε < As mentioned, the closest in spirit to our work is the use of rejection sampling in the work of Mishra andSandler [MS06]. Their analysis can be seen as a special case of ours but they only prove that the resultingalgorithm satisﬁes 2 ε -DP. Rejection sampling on a sample from the reference distribution is also used inexisting compression schemes [BS15; BNS19] as well as earlier work on private compression in the two-partysetting [MMPRTV10]. These approaches assume that the sample is shared between the client and the server,namely, it requires shared randomness. Shared randomness is incompatible with the setting where the reportis anonymized and is not directly linked to the user that generated it. As pointed out in [BS15], a simple wayto overcome this problem is to include a seed to a PRG in the output of the randomizer and have the servergenerate the same sample from the reference distribution as the client. While superﬁcially this approachseems similar to ours, its analysis and properties are diﬀerent. For example, in our setting only the seedfor a single sample that passes rejection sampling is revealed to the server, whereas in [BS15; BNS19] allsamples from the reference distribution are known to the server and privacy analysis does not depend on thestrength of the PRG. More importantly, unlike previous approaches our compression scheme is essentiallylossless (although at the cost of requiring assumptions for the privacy analysis).Computational Diﬀerential Privacy (CDP) [MPRV09] is a notion of privacy that defends against com-putationally bounded adversaries. Our compression algorithm can be easily shown to satisfy the strongest Sim -CDP deﬁnition. At the same time, our privacy bounds also hold for computationally unbounded ad-versaries as long as the LDP algorithm itself does not lead to a distinguisher. This distinction allows us toremove computational assumptions for speciﬁc LDP randomizers.For both deletion and replacement privacy the best results for frequency estimation are achieved byvariants of the RAPPOR algorithm [EPK14] and also by a closely-related Subset Selection algorithm [WH-WNXYLQ16; YB18]. Unfortunately, both RAPPOR and Subset Selection have very high communicationcost of ≈ kH (1 / ( e ε + 1)), where H is the binary entropy function. This has led to numerous and still3ngoing eﬀorts to design low-communication protocols for the problem [HKR12; EPK14; BS15; KBR16;WHWNXYLQ16; WBLJ17; YB18; ASZ19; AS19; BNS19; BNST20; CK ¨O20].A number of low-communication algorithms that achieve asymptotically optimal bounds in the ε < ε > O ( ε ) bits and relies on shared randomness. However, it matches the bounds achieved by RAPPOR onlywhen e ε is an integer. Acharya and Sun [AS19] and Chen et al. [CK ¨O20] give closely related approachesthat are asymptotically optimal and use log k bits of communication (without shared randomness). Howeverboth the theoretical bounds and empirical results for these algorithms are noticeably worse than those of(asymmetric) RAPPOR and Subset Selection (e.g. plots in [CK ¨O20] show that these algorithms are ≈ ε = 5 than Subset Selection ). The constructions in [AS19; CK ¨O20] and their analysis are alsosubstantially more involved than RAPPOR.A closely related problem is ﬁnding “heavy hitters”, namely all elements j ∈ [ k ] with counts higher thansome given threshold. In this problem the goal is to avoid linear runtime dependence on k that would resultfrom doing frequency estimation and then checking all the estimates. This problem is typically solved usinga “frequency oracle” which is an algorithm that for a given j ∈ [ k ] returns an estimate of the number of j ’s held by users (typically without computing the entire histogram) [BS15; BNST20; BNS19]. Frequencyestimation is also closely related to the discrete distribution estimation problem in which inputs are sampledfrom some distribution over [ k ] and the goal is to estimate the distribution [YB18; ASZ19; AS19]. Indeed,bounds for frequency estimation can be translated directly to bounds on distribution estimation by addingthe sampling error.Mean estimation has attracted a lot of attention in recent years as it is an important subroutine indiﬀerentially private (stochastic) gradient descent algorithms [BST14; ACGMMTZ16] used in private feder-ated learning [Kai+19]. Indeed, private federated optimization algorithms aggregate updates to the modelcoming from each client in a batch of clients by getting a private estimate of the average update. When themodels are large, the dimensionality of the update d leads to signiﬁcant communication cost. Thus reducingthe communication cost of mean estimation has been studied in many works with [ASYKM18; GDDKS20;CK ¨O20; GKMM19] or without privacy [AGLTV17; FTMARRK20; SYKM17; GKMM19; MT20].In the absence of communication constraints and ε < d , the optimal ε -LDP protocols for this problemachieve an expected squared (cid:96) error of Θ( dn min( ε,ε ) ) [DJW18; DR19]. When ε ≤

1, the randomizer of Duchiet al. [DJW18] also achieves the optimal O ( dnε ) bound. Recent work of Erlingsson et al. [EFMRSTT20]gives a low-communication version of this algorithm. Building on the approach in [DJW18], Bhowmicket al. [BDFKR19] describe the PrivUnit algorithm that achieves the asymptotically optimal accuracy alsowhen ε > d ).An alternative approach in the ε < d -dimensional ball to vectors in [ − , O ( d ) [LV10]. Their work does not explicitly discuss the communication cost and assumes that the server can pickthe randomizer used at each client. However it is easy to see that a single bit suﬃces to answer a countingquery and therefore an equivalent randomizer can be implemented using (cid:100) log d (cid:101) + 1 bits of communication(or just 1 bit if shared randomness is used). Chen et al. [CK ¨O20] give a randomizer based on the same ideathat also achieves the asymptotically optimal bound in the d > ε > (cid:100) ε (cid:101) log d bits of communication. Computing Kashin’s representation is more involved than algorithms in [DJW18;BDFKR19]. In addition, as we demonstrate empirically , the variance of the estimate resulting from thisapproach is nearly a factor of 5 × larger for typical parameters of interest. The error of asymmetric RAPPOR (namely 0 and 1 are ﬂipped with diﬀerent probabilities) is essentially identical to that ofthe Subset Selection randomizer. Comparisons with RAPPOR often use the symmetric RAPPOR which is substantially worsethan the asymmetric version for the replacement notion of diﬀerential privacy. See Section 4 for details. Plots in [CK ¨O20] also compare their algorithm with

PrivUnit yet as their code at [Kas] shows and was conﬁrmed by theauthors, they implemented the algorithm from [DJW18] instead of

PrivUnit which is much worse than

PrivUnit for ε = 5. Theauthors also conﬁrmed that parameters stated in their ﬁgures are incorrect so cannot be directly compared to our results. Preliminaries

For a positive integer k we denote [ k ] = { , . . . , k } . For an arbitrary set S we use x ∼ S to mean that x ischosen randomly and uniformly from S .Diﬀerential privacy (DP) is a measure of stability of a randomized algorithm. It bounds the change inthe distribution on the outputs when one of the inputs is either removed or replaced with an arbitrary otherelement. The most common way to measure the change in the output distribution is via approximate inﬁnitydivergence. More formally, we say that two probability distributions µ and ν over (ﬁnite) domain Y are( ε, δ )- close if for all E ⊂ Y , e − ε ( µ ( E ) − δ ) ≤ ν ( E ) ≤ e ε µ ( E ) + δ. This condition is equivalent to (cid:80) y ∈ Y | µ ( y ) − e ε ν ( y ) | + ≤ δ and (cid:80) y ∈ Y | ν ( y ) − e ε µ ( y ) | + ≤ δ , where | a | + :=max { a, } [DR14]. We also say that two random variables P and Q are ( ε, δ )-close if their probabilitydistributions are ( ε, δ )-close. We abbreviate ( ε, ε -close.Algorithms in the local model of diﬀerential privacy and federated data analysis rely on the notion of local randomizer . Deﬁnition 2.1.

An algorithm R : X → Y is an ( ε, δ ) -DP local randomizer if for all pairs x, x (cid:48) ∈ D , R ( x ) and R ( x (cid:48) ) are ( ε, δ ) -close. We will also use the add/delete variant of diﬀerential privacy which was deﬁned for local randomizers in[EFMRSTT20].

Deﬁnition 2.2.

An algorithm R : X → Y is a deletion ( ε, δ )-DP local randomizer if there exists a referencedistribution ρ such that for all data points x ∈ X , R ( x ) and ρ are ( ε, δ ) -close. It is easy to see that a replacement ( ε, δ )-DP algorithm is also a deletion ( ε, δ )-DP algorithm, and thata deletion ( ε, δ )-DP algorithm is also a replacement (2 ε, δ )-DP algorithm. Fooling and Pseudorandomness:

The notion of pseudorandomness relies on ability to distinguish be-tween the output of the generator and true randomness using a family of tests, where a test is a booleanfunction (or algorithm).

Deﬁnition 2.3.

Let D be a family of boolean functions over some domain Y . We say that two randomvariables P and Q over Y are ( D , β ) -indistinguishable if for all D ∈ D , | Pr [ D ( P ) = 1] − Pr [ D ( Q ) = 1] | ≤ β. We say that P and Q are ( T, β ) -computationally indistinguishable if P and Q are ( D , β ) -indistinguishablewith D being all tests that can be computed in time T (for some ﬁxed computational model such as booleancircuits). We now give a deﬁnition of a pseudo-random number generator.

Deﬁnition 2.4 (Pseudo-random generator) . We say that an algorithm G : { , } n → { , } m where m (cid:29) n , β -fools a family of tests D if G ( s ) for s ∼ { , } n is ( D , β ) -indistinguishable from r for r ∼ { , } m . Werefer to such an algorithm as ( D , β ) -PRG and also use ( T, β ) -PRG to refer to G that β -fools all tests runningin time T . Standard cryptographic assumptions (namely that one-way functions exist) imply that for any m and T that are polynomial in n there exists an eﬃciently computable ( β, T )-PRG G , for negligible β (namely, β = 1 /n ω (1) ). For a number of standard approaches to cryptographically-secure PRGs, no tests are knownthat can distinguish the output of the PRG from true randomness with β = 2 − o ( n ) in time T = 2 o ( n ) .For example ﬁnding such a test for a PRG based on SHA-3 would be a major breakthrough. To make theassumption that such a test does not exist we refer to a ( β, T )-PRG for β = 2 − Ω( n ) and T = 2 Ω( n ) as an exponentially strong PRG. 5

Local Pseudo-Randomizers

In this section we describe a general way to compress LDP randomizers that relies on the complexity of therandomizer and subsequent processing. We will ﬁrst describe the result for deletion ε -DP and then give theversions for replacement DP and ( ε, δ )-DP.For the purpose of this result we ﬁrst need to quantify how much randomness a local randomizer needs. Wewill say that a randomizer R : X → Y is t -samplable if there exists a deterministic algorithm R ∅ : { , } t → Y such that for r chosen randomly and uniformly from { , } t , R ∅ ( r ) is distributed according to the referencedistribution of R (denoted by ρ ). Typically, for eﬃciently computable randomizers, t is polynomial in thesize of the output log( | Y | ) and ε . Note that every value y in the support of ρ is equal to R ∅ ( r ) for some r .Thus every element that can be output by R can be represented by some r ∈ { , } t .Our goal is to compress the communication by restricting the output from all those that can be representedby r ∈ { , } t to all those values in Y that can be represented by a t -bit string generated from a seed oflength (cid:96) (cid:28) t using some PRG G : { , } (cid:96) → { , } t . We could then send the seed to the server and let theserver ﬁrst generate the full t -bit string using G and then run R ∅ on it. The challenge is to do this eﬃcientlywhile preserving the privacy and utility guarantees of R .Our approach is based on the fact that we can easily sample from the pseudo-random version of thereference distribution ρ by outputting R ∅ ( G ( s )) for a random and uniform seed s . This leads to a naturalway to deﬁne a distribution over seeds on a input x : a seed s is output with probability that is proportionalto Pr [ R ( x )= R ∅ ( G ( s ))] Pr r ∼{ , } t [ R ∅ ( r )= R ∅ ( G ( s ))] . Speciﬁcally we deﬁne the desired randomizer as follows. Deﬁnition 3.1.

For a t -samplable deletion DP local randomizer R : X → Y and a function G : { , } (cid:96) →{ , } t let R [ G ] denote the local randomizer that given x ∈ X , outputs s ∈ { , } t with probability proportionalto Pr [ R ( x )= R ∅ ( G ( s ))] Pr r ∼{ , } t [ R ∅ ( r )= R ∅ ( G ( s ))] . For some combinations of a randomizer R and PRG G there is an eﬃcient way to implement R [ G ]directly (as we show in one of our applications). In the general case, when such algorithm may not exist wecan sample from R [ G ]( x ) by applying rejection sampling to uniformly generated seeds. A special case of thisapproach is implicit in the work of Mishra and Sandler [MS06] (albeit with a weaker analysis). Rejectionsampling only requires an eﬃcient algorithm for computing the ratio of densities above to suﬃciently highaccuracy. We describe the resulting algorithm below. Algorithm 1 R [ G, γ ]: PRG compression of R Input: x ∈ X , ε, γ >

0; seeded PRG G : { , } (cid:96) → { , } t ; t -samplable ε -DP randomizer R . J = e ε ln(1 /γ ) for j = 1 , . . . , J do Sample a random seed s ∈ { , } (cid:96) . y = R ∅ ( G ( s )) Sample b from Bernoulli (cid:16) Pr [ R ( x )= y ] e ε Pr r ∼{ , } t [ R ∅ ( r )= y ] (cid:17) if b == 1 then BREAK Send s Naturally, the output of this randomizer can be decompressed by applying G ◦ R ∅ to it. It is also clearthat the communication cost of the algorithm is (cid:96) bits.Next we describe the general condition on the PRG G that suﬃces for ensuring that the algorithm thatoutputs a random seed with correct probability is diﬀerentially private. Lemma 3.2.

For a t -samplable deletion ε -DP local randomizer R : X → Y and G : { , } (cid:96) → { , } t , let D enote the following family of tests which take r (cid:48) ∈ { , } t as an input: D := (cid:26) ind (cid:18) Pr [ R ( x ) = R ∅ ( r (cid:48) )] Pr r ∼{ , } t [ R ∅ ( r ) = R ∅ ( r (cid:48) )] ≥ θ (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) x ∈ X, θ ∈ [0 , e ε ] (cid:27) , where ind denotes the { , } indicator function of a condition. If G β -fools D for β < / (2 e ε ) then R [ G ] is adeletion ( ε + 2 e ε β ) -DP local randomizer. Furthermore, for every γ > , R [ G, γ ] is a deletion ( ε + 2 e ε β ) -DPlocal randomizer.Proof. We demonstrate that if R [ G ] is not a deletion ( ε + 2 e ε β )-DP randomizer then there exists a test in D that distinguishes the output of G from true randomness that succeeds with probability at least β . Toanalyze the privacy guarantees of R [ G ] we let the reference distribution ρ G be the uniform distribution over { , } (cid:96) . For brevity, for y ∈ Y we denote the density ratio of R ( x ) to ρ at y by π x ( y ) := Pr [ R ( x ) = y ] Pr r ∼{ , } t [ R ∅ ( r ) = y ] . Then, R [ G ] outputs a seed s with probability: µ x ( s ) := π x ( R ∅ ( G ( s ))) (cid:80) s (cid:48) ∈{ , } (cid:96) π x ( R ∅ ( G ( s (cid:48) ))) . By deﬁnition of our reference distribution, ρ G ( s ) = 2 − (cid:96) for all s . Therefore µ x ( s ) ρ G ( s ) = µ x ( s )2 − (cid:96) = π x ( R ∅ ( G ( s ))) E s (cid:48) ∼{ , } (cid:96) [ π x ( R ∅ ( G ( s (cid:48) )))] . We observe that, by the fact that R ( x ) is ε -DP we have that π x ( R ∅ ( G ( s ))) ∈ [ e − ε , e ε ]. Therefore, toshow that R [ G ] is ( ε + 2 e ε β )-DP, it suﬃces to show that the denominator is in the range [ e − e ε β , e e ε β ]. Toshow this, we assume for the sake of contradiction that it is not true. Namely, that either E s (cid:48) ∼{ , } (cid:96) [ π x ( R ∅ ( G ( s (cid:48) )))] > e e ε β > e ε β or E s (cid:48) ∼{ , } (cid:96) [ π x ( R ∅ ( G ( s (cid:48) )))] < e − e ε β < − e ε β, where we used the assumption that β < / (2 e ε ) in the second inequality.We ﬁrst deal with the case when E s (cid:48) ∼{ , } (cid:96) [ π x ( R ∅ ( G ( s (cid:48) )))] > e ε β (as the other case will be essentiallyidentical). Observe that for true randomness we have: E r (cid:48) ∼{ , } t [ π x ( R ∅ ( r (cid:48) ))] = E r (cid:48) ∼{ , } t (cid:20) Pr [ R ( x ) = R ∅ ( r (cid:48) )] Pr r ∼{ , } t [ R ∅ ( r ) = R ∅ ( r (cid:48) )] (cid:21) = 1 . Using the fact that π x ( y ) ∈ [0 , e ε ] we have that E s (cid:48) ∼{ , } (cid:96) [ π x ( R ∅ ( G ( s (cid:48) )))] = (cid:90) e ε Pr s (cid:48) ∼{ , } (cid:96) [ π x ( R ∅ ( G ( s (cid:48) ))) ≥ θ ] dθ and, similarly, E r (cid:48) ∼{ , } t [ π x ( R ∅ ( r (cid:48) ))] = (cid:90) e ε Pr r (cid:48) ∼{ , } t [ π x ( R ∅ ( r (cid:48) )) ≥ θ ] dθ. Thus, by our assumption, (cid:90) e ε (cid:18) Pr s (cid:48) ∼{ , } (cid:96) [ π x ( R ∅ ( G ( s (cid:48) ))) ≥ θ ] − Pr r (cid:48) ∼{ , } t [ π x ( R ∅ ( r (cid:48) )) ≥ θ ] (cid:19) dθ > e ε β. θ ∈ [0 , e ε ] such that Pr s (cid:48) ∼{ , } (cid:96) [ π x ( R ∅ ( G ( s (cid:48) ))) ≥ θ ] − Pr r (cid:48) ∼{ , } t [ π x ( R ∅ ( r (cid:48) )) ≥ θ ] > β. Note that ind ( π x ( R ∅ ( r (cid:48) )) ≥ θ ) ∈ D , for all x ∈ X and θ ∈ [0 , e ε ] contradicting our assumption on G . Thuswe obtain that E s (cid:48) ∼{ , } (cid:96) [ π x ( R ∅ ( G ( s (cid:48) )))] ≤ e ε β . We can arrive at a contradiction in the case when E s (cid:48) ∼{ , } (cid:96) [ π x ( R ∅ ( G ( s (cid:48) )))] < − e ε β in exactly the same way.To show that R [ G, γ ] is a deletion ( ε +2 e ε β )-DP local randomizer we observe that for every x , conditionedon accepting one of the samples, R [ G, γ ]( x ) outputs a sample distributed exactly according to R [ G ]( x ). If R [ G, γ ]( x ) does not accept any samples than it samples from the reference distribution ρ G . Thus given that R [ G ]( x ) is ( ε + 2 e ε β )-close to ρ G we have that the output distribution R [ G, γ ]( x ) is also ( ε + 2 e ε β )-close to ρ G .Unlike the preservation of privacy, conditions on the PRG under which we can ensure that the utility of R is preserved depend on the application. Here we describe a general result that relies only on the eﬃciency ofthe randomizer to establish computational indistinguishability of the output of our compressed randomizerfrom the output of the original one.As the ﬁrst step, we show that when used with the identity G , the resulting randomizer is γ -close in totalvariation distance to R . Lemma 3.3.

Let R be a deletion ε -DP t -samplable local randomizer. Then for the identity function ID t : { , } t → { , } t and any γ > we have that R [ID t , γ ] is a deletion ε -DP local randomizer andfor every x ∈ X , TV( R ∅ ( R [ID t , γ ]( x )) , R ( x )) ≤ γ .Proof. When applied with G = ID t , y is distributed according to the reference distribution of R . Thus thealgorithm performs standard rejection sampling until it accepts a sample or exceeds the bound J on thenumber of steps. Note that deletion DP implies that Pr [ R ( x )= y ] e ε Pr r ∼{ , } t [ R ∅ ( r )= y ] ≤

1. At each step, conditioned onsuccess the algorithm samples s such that R ∅ ( s ) is distributed identically to R ( x ). Further, the acceptanceprobability at each step is E y ∼ ρ (cid:20) Pr [ R ( x ) = y ] e ε Pr r ∼{ , } t [ R ∅ ( r ) = y ] (cid:21) = (cid:88) y ∈ Y Pr [ R ( x ) = y ] e ε = 1 e ε . Thus the probability that all steps reject is ≤ (1 − e − ε ) J ≤ γ . It follows that TV( R ∅ ( R [ID t , γ ]( x )) , R ( x )) isbounded by γ as claimed.We can now state the implications of using a suﬃciently strong PRG on the output of the randomizer. Lemma 3.4.

Let R be a deletion ε -DP t -samplable local randomizer, let G : { , } (cid:96) → { , } t be ( T, β ) -PRG.Let T ( R , G, γ ) denote the running time of R [ G, γ ] and assume that T > T ( R , G, γ ) . Then for all x ∈ X , R ∅ ( G ( R [ G, γ ]( x ))) is ( T (cid:48) , β (cid:48) ) -computationally indistinguishable from R ( x ) , where β (cid:48) = γ + e ε ln(1 /γ ) β and T (cid:48) = T − T ( R , G, γ ) .Proof. By Lemma 3.3, TV( R ∅ ( R [ID t , γ ]( x )) , R ( x )) ≤ γ and thus it suﬃces to prove that R ∅ ( G ( R [ G, γ ]( x )))is ( T (cid:48) , e ε ln(1 /γ ) β )-computationally indistinguishable from R ∅ ( R [ID t , γ ]( x )). Towards a contradiction, sup-pose that there exists a test D (cid:48) running in time T (cid:48) such that for some x , (cid:12)(cid:12) Pr [ D (cid:48) ( R ∅ ( G ( R [ G, γ ]( x )))) = 1] − Pr [ D (cid:48) ( R ∅ ( R [ID t , γ ]( x ))) = 1] (cid:12)(cid:12) ≥ e ε ln(1 /γ ) β. We claim that there exists a test for distinguishing G ( s ) for s ∼ { , } (cid:96) from a truly random seed r ∼ { , } t .Note that R ∅ ( G ( R [ G, γ ])) can be seen as R [ G, γ ] that outputs directly y = R ∅ ( G ( s )) instead of s itself. Nextwe observe that R ∅ ( G ( R [ G, γ ])) uses the output of G at most J = e ε ln(1 /γ ) times in place of truly random8 -bit string used by R ∅ ( R [ID t , γ ]). Thus, by the standard hybrid argument, one of those applicationscan be used to test G with success probability at least e ε ln(1 /γ ) β/J = β . This test requires running ahybrid between R ∅ ( G ( R [ G, γ ])) and R ∅ ( R [ID t , γ ]) in addition to D (cid:48) itself. The resulting test runs in time T (cid:48) + T ( R , G, γ ) = T .Thus we obtain a contradiction to G being a ( T, β )-PRG.As a direct corollary of Lemmas 3.2 and 3.4 we obtain a general way to compress eﬃcient LDP random-izers.

Theorem 3.5.

Let R be a deletion ε -DP t -samplable local randomizer, let G : { , } (cid:96) → { , } t be ( T, β ) -PRG for β < / (2 e ε ) . Let T ( R , G, γ ) be the running time of R [ G, γ ] and assume that T > T ( R , G, γ ) .Then R [ G, γ ] is a deletion ( ε + 2 e ε β ) -DP local randomizer and for all x ∈ X , R ∅ ( G ( R [ G, γ ]( x ))) is ( T (cid:48) , β (cid:48) ) -computationally indistinguishable from R ( x ) , where β (cid:48) = γ + e ε ln(1 /γ ) β and T (cid:48) = T − T ( R , G, γ ) .Proof. The second part of the claim is exactly Lemma 3.4. To see the ﬁrst part of the claim note that byour assumption

T > T ( R , G, γ ) and computation of the ratio of densities Pr [ R ( x )= R ∅ ( r (cid:48) )] e ε Pr r ∼{ , } t [ R ∅ ( r )= R ∅ ( r (cid:48) )] for any r (cid:48) ∈ { , } t is part of R [ G, γ ]. This implies that the test family D deﬁned in Lemma 3.2 can be computedin time T . Now applying Lemma 3.2 gives us the privacy claim.By plugging an exponentially strong PRG G into Theorem 3.5 we obtain that if an LDP protocol basedon R runs in time T then its communication can be compressed to O (log( T + T ( R , G, γ )) with negligibleeﬀect on privacy and utility. We also remark that even without making any assumptions on G , R [ G, γ ]satisﬁes 2 ε -DP. In other words, failure of the PRG does not lead to a signiﬁcant privacy violation, beyondthe degradation of the privacy parameter ε by a factor of two. Lemma 3.6.

Let R be a deletion ε -DP t -samplable local randomizer, let G : { , } (cid:96) → { , } t be an arbitraryfunction. Then R [ G, γ ] is a deletion ε -DP local randomizer.Proof. As in the proof of Lemma 3.2 we observe that if we take the reference distribution to be uniform over { , } (cid:96) we will get that, conditioned on accepting a sample, the seed s is output with probability µ x ( s ) suchthat µ x ( s ) ρ G ( s ) = µ x ( s )2 − (cid:96) = π x ( R ∅ ( G ( s ))) E s (cid:48) ∼{ , } (cid:96) [ π x ( R ∅ ( G ( s (cid:48) )))] . By the fact that R ( x ) is ε -DP we have that for every s (cid:48) ∈ { , } (cid:96) , π x ( R ∅ ( G ( s (cid:48) ))) ∈ [ e − ε , e ε ] and thus µ x ( s ) ρ G ( s ) ∈ [ e − ε , e ε ]. We now show that the same approach can be used to compress a replacement ε r -DP randomizer R . To dothis we ﬁrst let ρ be some reference distribution relative to which R is deletion ε -DP for some ε ≤ ε r . Onepossible way to deﬁne ρ is to pick some ﬁxed x ∈ X and let ρ be the distribution of R ( x ). In this case ε = ε r . But other choices of ρ are possible that give an easy to sample distribution and ε < ε r . In fact, forsome standard randomizers such as addition of Laplace noise we will get ε = ε r / ρ is t -samplable and given a PRG G : { , } (cid:96) → { , } t we deﬁne R [ G ] as in Def. 3.1and R [ G, γ ] us in Algorithm 1. The randomizer R is deletion ε -DP so all the results we proved apply to itas well (with the deletion ε and not the replacement ε r ). In addition we show that replacement privacy ispreserved as well. Lemma 3.7.

For a t -samplable deletion ε -DP and replacement ε r -DP local randomizer R : X → Y and G : { , } (cid:96) → { , } t , let D denote the following family of tests which take r (cid:48) ∈ { , } t as an input: D := (cid:26) ind (cid:18) Pr [ R ( x ) = R ∅ ( r (cid:48) )] Pr r ∼{ , } t [ R ∅ ( r ) = R ∅ ( r (cid:48) )] ≥ θ (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) x ∈ X, θ ∈ [0 , e ε ] (cid:27) . If G β -fools D for β < / (2 e ε ) then R [ G ] is a replacement ( ε r + 4 e ε β ) -DP local randomizer. Furthermore,for every γ > , R [ G, γ ] is a replacement ( ε r + 4 e ε β ) -DP local randomizer. roof. As in the proof of Lemma 3.2, for y ∈ Y , we denote the density ratio of R ( x ) to ρ at y by π x ( y ) := Pr [ R ( x ) = y ] Pr r ∼{ , } t [ R ∅ ( r ) = y ]and note that R [ G ] outputs a seed s with probability: µ x ( s ) := π ( R ∅ ( G ( s ))) (cid:80) s (cid:48) ∈{ , } (cid:96) π ( R ∅ ( G ( s (cid:48) ))) . Thus for two inputs x, x (cid:48) ∈ X and any s ∈ { , } (cid:96) we have that µ x ( s ) µ x (cid:48) ( s ) = π x ( R ∅ ( G ( s ))) π x (cid:48) ( R ∅ ( G ( s ))) · (cid:80) s (cid:48) ∈{ , } (cid:96) π x (cid:48) ( R ∅ ( G ( s (cid:48) ))) (cid:80) s (cid:48) ∈{ , } (cid:96) π x ( R ∅ ( G ( s (cid:48) )))= Pr [ R ( x ) = R ∅ ( G ( s (cid:48) ))] Pr [ R ( x (cid:48) ) = R ∅ ( G ( s (cid:48) ))] · E s (cid:48) ∼{ , } (cid:96) [ π x (cid:48) ( R ∅ ( G ( s (cid:48) )))] E s (cid:48) ∼{ , } (cid:96) [ π x ( R ∅ ( G ( s (cid:48) )))]Now R is ε r -replacement-DP and therefore the ﬁrst term satisﬁes: Pr [ R ( x ) = R ∅ ( G ( s (cid:48) ))] Pr [ R ( x (cid:48) ) = R ∅ ( G ( s (cid:48) ))] ∈ (cid:2) e − ε r , e ε r (cid:3) . At the same time, we showed in Lemma 3.2 that E s (cid:48) ∼{ , } (cid:96) [ π x ( R ∅ ( G ( s (cid:48) )))] ∈ [ e − e ε β , e e ε β ] and also E s (cid:48) ∼{ , } (cid:96) [ π x (cid:48) ( R ∅ ( G ( s (cid:48) )))] ∈ [ e − e ε β , e e ε β ]. Therefore µ x ( s ) µ x (cid:48) ( s ) ∈ [ e − ε r − e ε β , e ε r +4 e ε β ].To show that R [ G, γ ] is a replacement ( ε r + 4 e ε β )-DP local randomizer we observe that for every x , R [ G, γ ]( x ) is a mixture of R [ G ]( x ) and ρ G . As we showed, R [ G ]( x ) is ( ε r + 4 e ε β )-close to R [ G ]( x (cid:48) ) andwe also know from Lemma 3.2 that ρ G is ( ε + 2 e ε β )-close to R [ G ]( x (cid:48) ). By quasi-convexity we obtain that R [ G, γ ]( x ) is ( ε r +4 e ε β )-close to R [ G ]( x (cid:48) ). We also know that R [ G, γ ]( x ) is ( ε +2 e ε β )-close to ρ G . Appealingto quasi-convexity again, we obtain that R [ G, γ ]( x ) is ( ε r + 4 e ε β )-close to R [ G, γ ]( x (cid:48) ). ( ε, δ ) -DP We next extend our approach to ( ε, δ )-DP randomizers. The approach here is similar, except that we forsome outputs y = R ∅ ( G ( s )), the prescribed “rejection probability” in the original approach would be largerthan one. To handle this, we simply truncate this ratio at 1 to get a probability. Algorithm 2 is identical toAlgorithm 1 except for this truncation in the step where we sample b . Algorithm 2 R [ G, γ ]: PRG compression of deletion ( ε, δ )-DP R Input: x ∈ X , ε, γ >

0; seeded PRG G : { , } (cid:96) → { , } t ; t -samplable ε -DP randomizer R . J = e ε ln(1 /γ ) / (1 − δ ) for j = 1 , . . . , J do Sample a random seed s ∈ { , } (cid:96) . y = R ∅ ( G ( s )) Sample b from Bernoulli (cid:16) min (cid:16) , Pr [ R ( x )= y ] e ε Pr r ∼{ , } t [ R ∅ ( r )= y ] (cid:17)(cid:17) if b == 1 then BREAK Send s The proof is fairly similar to that for the pure DP randomizer. We start with a lemma that relates theproperties of the PRG to the properties of the randomizer that need to be preserved in order to ensure thatit satisﬁes deletion ( ε (cid:48) , δ (cid:48) )-LDP. 10 emma 3.8. For a t -samplable deletion ( ε, δ ) -DP local randomizer R : X → Y and G : { , } (cid:96) → { , } t ,let D denote the following family of tests which take r (cid:48) ∈ { , } t as an input: D := (cid:26) ind (cid:18) Pr [ R ( x ) = R ∅ ( r (cid:48) )] Pr r ∼{ , } t [ R ∅ ( r ) = R ∅ ( r (cid:48) )] ≥ θ (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) x ∈ X, θ ∈ [0 , e ε ] (cid:27) . Suppose that

G β -fools D and let π x ( y ) := min( Pr [ R ( x )= y ] ,e ε Pr [ R ( ∅ )= y ]) Pr r ∼{ , } t [ R ∅ ( r )= y ] . Then E s (cid:48) ∼{ , } (cid:96) [ π x ( R ∅ ( G ( s (cid:48) )))] ∈ [1 − δ − e ε β, e ε β ] and E s (cid:48) ∼{ , } (cid:96) (cid:2) | − e ε π x ( R ∅ ( G ( s (cid:48) ))) | + (cid:3) ≤ δ + β. Proof.

Let ρ G be the uniform distribution over { , } (cid:96) . Let ν x ( y ) := Pr [ R ( x ) = y ] and let ˜ ν x ( y ) :=min( ν x ( y ) , e ε Pr [ R ( ∅ ) = y ]). Note that ˜ ν x ( · ) does not necessarily deﬁne a probability distribution. For S = { y : ˜ ν x ( y ) < ν x ( y ) } , we have ν x ( S ) = (cid:88) y ∈ S ν x ( y )= (cid:88) y ∈ S ˜ ν x ( y ) + (cid:88) y ∈ S ( ν x ( y ) − ˜ ν x ( y ))= (cid:88) y ∈ S e ε ρ ( y ) + (cid:88) y ( ν x ( y ) − ˜ ν x ( y ))= e ε ρ ( S ) + (1 − (cid:88) y ˜ ν x ( y )) . Then deletion ( ε, δ )-DP of R implies that (cid:80) y ˜ ν x ( y ) ≥ − δ . Observe that this implies that for truerandomness we have: E r (cid:48) ∼{ , } t [ π x ( R ∅ ( r (cid:48) ))] = E r (cid:48) ∼{ , } t (cid:20) ˜ ν x ( R ∅ ( r (cid:48) )) Pr r ∼{ , } t [ R ∅ ( r ) = R ∅ ( r (cid:48) )] (cid:21) = E r (cid:48) ∼{ , } t (cid:20) ˜ ν x ( R ∅ ( r (cid:48) )) ρ ( R ∅ ( r (cid:48) )) (cid:21) = E y ∼ ρ (cid:20) ˜ ν x ( y ) ρ ( y ) (cid:21) = (cid:88) y ∈ Y ρ ( y ) · ˜ ν x ( y ) ρ ( y )= (cid:88) y ∈ Y ˜ ν x ( y ) ∈ [1 − δ, . Using the fact that π x ( y ) ∈ [0 , e ε ] we have that E s (cid:48) ∼{ , } (cid:96) [ π x ( R ∅ ( G ( s (cid:48) )))] = (cid:90) e ε Pr s (cid:48) ∼{ , } (cid:96) [ π x ( R ∅ ( G ( s (cid:48) ))) ≥ θ ] dθ and, similarly, E r (cid:48) ∼{ , } t [ π x ( R ∅ ( r (cid:48) ))] = (cid:90) e ε Pr r (cid:48) ∼{ , } t [ π x ( R ∅ ( r (cid:48) )) ≥ θ ] dθ. (cid:12)(cid:12)(cid:12)(cid:12) E s (cid:48) ∼{ , } (cid:96) [ π x ( R ∅ ( G ( s (cid:48) )))] − E r (cid:48) ∼{ , } t [ π x ( R ∅ ( r (cid:48) ))] (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) e ε (cid:18) Pr s (cid:48) ∼{ , } (cid:96) [ π x ( R ∅ ( G ( s (cid:48) ))) ≥ θ ] − Pr r (cid:48) ∼{ , } t [ π x ( R ∅ ( r (cid:48) )) ≥ θ ] (cid:19) dθ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:90) e ε (cid:12)(cid:12)(cid:12)(cid:12) Pr s (cid:48) ∼{ , } (cid:96) [ π x ( R ∅ ( G ( s (cid:48) ))) ≥ θ ] − Pr r (cid:48) ∼{ , } t [ π x ( R ∅ ( r (cid:48) )) ≥ θ ] (cid:12)(cid:12)(cid:12)(cid:12) dθ ≤ e ε β, where in the last step, we have used the property of the pseudorandom generator that it fools D , and thefact that for θ ∈ [0 , e ε ), ˜ ν x ( y ) Pr [ R ( ∅ )= y ] < θ if and only if ν x ( y ) Pr [ R ( ∅ )= y ] < θ . The ﬁrst part of the claim follows.For the second part of the claim we ﬁrst note that deletion ( ε, δ )-DP of R implies that E r (cid:48) ∼{ , } t [ | − e ε π x ( R ∅ ( r (cid:48) )) | + ] = E y ∼ ρ [ | − e ε π x ( y ) | + ]= E y ∼ ρ [ | − e ε π x ( y ) | + ]= (cid:88) y ∈ Y ρ ( y ) | − e ε π x ( y ) | + = (cid:88) y ∈ Y | ρ ( y ) − e ε ˜ ν x ( y ) | + = (cid:88) y ∈ Y | ρ ( y ) − e ε ν x ( y ) | + ≤ δ. Also note that E r (cid:48) ∼{ , } t [ | − e ε π x ( R ∅ ( r (cid:48) )) | + ] = (cid:90) Pr r (cid:48) ∼{ , } t [1 − e ε π x ( R ∅ ( r (cid:48) )) ≥ θ ] dθ = 1 − (cid:90) Pr r (cid:48) ∼{ , } t (cid:20) π x ( R ∅ ( r (cid:48) )) ≥ θe ε (cid:21) dθ. Similarly, E s (cid:48) ∼{ , } (cid:96) [ | − e ε π x ( R ∅ ( G ( s (cid:48) ))) | + ] = 1 − (cid:90) Pr s (cid:48) ∼{ , } (cid:96) (cid:20) π x ( R ∅ ( G ( s (cid:48) ))) ≥ θe ε (cid:21) dθ. Thus by the same argument as before, the fact that G , β -fools D implies that E s (cid:48) ∼{ , } (cid:96) (cid:2) | − e ε π x ( R ∅ ( G ( s (cid:48) ))) | + (cid:3) ≤ E r (cid:48) ∼{ , } t [ | − e ε π x ( R ∅ ( r (cid:48) )) | + ] + β ≤ δ + β. We can now give an analogue of Lemma 3.2 for deletion ( ε, δ )-DP randomizers.

Lemma 3.9.

For a t -samplable deletion ( ε, δ ) -DP local randomizer R : X → Y and G : { , } (cid:96) → { , } t ,let D denote the following family of tests which take r (cid:48) ∈ { , } t as an input: D := (cid:26) ind (cid:18) Pr [ R ( x ) = R ∅ ( r (cid:48) )] Pr r ∼{ , } t [ R ∅ ( r ) = R ∅ ( r (cid:48) )] ≥ θ (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) x ∈ X, θ ∈ [0 , e ε ] (cid:27) . If G β -fools D where δ + e ε β < / then R [ G ] is a deletion ( ε + 2 δ + 2 e ε β, δ + β ) -DP local randomizer.Furthermore, for every γ > , R [ G, γ ] is a deletion ( ε + 2 δ + 2 e ε β, δ + β ) -DP local randomizer. roof. As before, we let the reference distribution ρ G be the uniform distribution over { , } (cid:96) . Using thedeﬁnitions in the proof of Lemma 3.8 we observe that R [ G ]( x ) outputs s with probability: µ x ( s ) := π ( R ∅ ( G ( s ))) (cid:80) s (cid:48) ∈{ , } (cid:96) π ( R ∅ ( G ( s (cid:48) ))) = ˜ ν x ( R ∅ ( G ( s ))) ρ ( R ∅ ( G ( s ))) (cid:96) · E s (cid:48) ∼{ , } (cid:96) [ π x ( R ∅ ( G ( s (cid:48) )))] = ρ G ( s ) · ˜ ν x ( R ∅ ( G ( s ))) ρ ( R ∅ ( G ( s ))) E s (cid:48) ∼{ , } (cid:96) [ π x ( R ∅ ( G ( s (cid:48) )))] . By the deﬁnition of ˜ ν x we have that the numerator satisﬁes ˜ ν x ( R ∅ ( G ( s ))) ρ ( R ∅ ( G ( s ))) ≤ e ε . In addition, by Lemma 3.8the denominator E s (cid:48) ∼{ , } (cid:96) [ π x ( R ∅ ( G ( s (cid:48) )))] ≥ − δ − e ε β . Therefore µ x ( s ) ≤ ρ G ( s ) · e ε − δ − e ε β ≤ e ε +2 δ + e ε β ρ G ( s ) . For the other side of ( ε, δ )-closeness we simply observe that by the Lemma 3.8, (cid:88) s ∈{ , } (cid:96) (cid:12)(cid:12)(cid:12) ρ G ( s ) − e ε + e ε β µ x ( s ) (cid:12)(cid:12)(cid:12) + = (cid:88) s ∈{ , } (cid:96) (cid:12)(cid:12)(cid:12)(cid:12) ρ G ( s ) − e ε + e ε β ρ G ( s ) · π x ( R ∅ ( G ( s ))) E s (cid:48) ∼{ , } (cid:96) [ π x ( R ∅ ( G ( s (cid:48) )))] (cid:12)(cid:12)(cid:12)(cid:12) + ≤ (cid:88) s ∈{ , } (cid:96) | ρ G ( s ) − e ε ρ G ( s ) · π x ( R ∅ ( G ( s ))) | + = E s ∼{ , } (cid:96) (cid:2) | − e ε π x ( R ∅ ( G ( s ))) | + (cid:3) ≤ δ + β. Finally to establish that R [ G, γ ] is ( ε + 2 δ + 2 e ε β, δ + β ) we, as before, appeal to quasi-convexity.To establish the utility guarantees for R [ G, γ ] we follow the same approach by establishing the utilityguarantees for R [ID t , γ ] and then using the properties of G . Lemma 3.10.

Let R be a deletion ε -DP t -samplable local randomizer. Then for the identity function ID t : { , } t → { , } t and any γ > we have that R [ID t , γ ] is a deletion ε -DP local randomizer and forevery x ∈ X , TV( R ∅ ( R [ID t , γ ]( x )) , R ( x )) ≤ δ + γ .Proof. Conditioned on accepting a sample, R [ID t , γ ] outputs a sample from the truncated version of thedistribution of R ( x ). Speciﬁcally, y is output with probability ¯ ν x ( y ) := ˜ ν ( y ) (cid:80) y ∈ Y ˜ ν ( y ) , where ν x ( y ) := Pr [ R ( x ) = y ] and ˜ ν x ( y ) := min( ν x ( y ) , e ε Pr [ R ( ∅ ) = y ]). From the proof of Lemma 3.8, we know that (cid:80) y ∈ Y ˜ ν x ( y ) ≥ − δ .Thus TV( ν x , ¯ ν x ) = 12 (cid:88) y ∈ Y | ν x ( y ) − ¯ ν x ( y ) |≤ (cid:88) y ∈ Y ( | ν x ( y ) − ˜ ν x ( y ) | + | ˜ ν x ( y ) − ¯ ν x ( y ) | )= 12 (cid:88) y ∈ Y ( ν x ( y ) − ˜ ν x ( y ) + ¯ ν x ( y ) − ˜ ν x ( y )) ≤ δ. Truncation of the distribution also reduces the probability that a sample is accepted. Speciﬁcally, E y ∼ ρ (cid:20) ˜ ν x ( y ) e ε ρ ( y ) (cid:21) = (cid:88) y ∈ Y ˜ ν x ( y ) e ε ≥ − δe ε . R [ G, γ ] tries at least e ε ln(1 /γ ) / (1 − δ ) samples and therefore, as in the proof of Lemma 3.3, failure to acceptany samples adds at most γ to the total variation distance.13rom here we can directly obtain the analogues of Lemma 3.4 and Theorem3.5.Finally, to deal with the replacement version of ( ε, δ )-DP we combine the ideas we used in Lemmas 3.7and 3.9. The main distinction is a somewhat stronger test that we need to fool in this case. Lemma 3.11.

For a t -samplable replacement ( ε r , δ r ) -DP and deletion ( ε, δ ) -DP local randomizer R : X → Y and G : { , } (cid:96) → { , } t , let D and D r denote the following families of tests which take r (cid:48) ∈ { , } t as aninput: D := (cid:26) ind (cid:18) Pr [ R ( x ) = R ∅ ( r (cid:48) )] ρ ( R ∅ ( r (cid:48) )) ≥ θ (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) x ∈ X, θ ∈ [0 , e ε ] (cid:27) ; D r := (cid:26) ind (cid:18) ˜ ν x ( R ∅ ( r (cid:48) )) ρ ( R ∅ ( r (cid:48) )) − e ε ˜ ν x (cid:48) ( R ∅ ( r (cid:48) )) ρ ( R ∅ ( r (cid:48) )) ≥ θ (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) x, x (cid:48) ∈ X, θ ∈ [0 , e ε ] (cid:27) , where ρ is the reference distribution of R and ˜ ν x ( y ) := min( Pr [ R ( x ) = y ] , e ε ρ ( y )) . If G β -fools

D ∪ D r where δ + e ε β < / then R [ G ] is a replacement ( ε r + 2 δ + 3 e ε β, δ r + 2 e ε β ) -DP local randomizer. Furthermore,for every γ > , R [ G, γ ] is a ( ε r + 2 δ + 3 e ε β, δ r + 2 e ε β ) -DP local randomizer.Proof. First we observe that R being ( ε r , δ r ) replacement DP implies that ˜ ν x and ˜ ν x (cid:48) are ( ε r , δ r ) close inthe following sense: E r (cid:48) ∼{ , } t (cid:34)(cid:12)(cid:12)(cid:12)(cid:12) ˜ ν x ( R ∅ ( r (cid:48) )) ρ ( R ∅ ( r (cid:48) )) − e ε r ˜ ν x (cid:48) ( R ∅ ( r (cid:48) )) ρ ( R ∅ ( r (cid:48) )) (cid:12)(cid:12)(cid:12)(cid:12) + (cid:35) = E y ∼ ρ (cid:34)(cid:12)(cid:12)(cid:12)(cid:12) ˜ ν x ( y ) ρ ( y ) − e ε r ˜ ν x (cid:48) ( y ) ρ ( y ) (cid:12)(cid:12)(cid:12)(cid:12) + (cid:35) = (cid:88) y ∈ Y | ˜ ν x ( y ) − e ε r ˜ ν x (cid:48) ( y ) | + ≤ (cid:88) y ∈ Y | ν x ( y ) − e ε r ν x (cid:48) ( y ) | + ≤ δ r , where we used the fact that if ν x (cid:48) ( y ) > ˜ ν x (cid:48) ( y ) then ˜ ν x (cid:48) ( y ) = e ε ρ ( y ) ≥ ˜ ν x ( y ) and so | ˜ ν x ( y ) − e ε r ˜ ν x (cid:48) ( y ) | + = | ˜ ν x ( y ) − e ε r ν x (cid:48) ( y ) | + . Using the decomposition E r (cid:48) ∼{ , } t (cid:34)(cid:12)(cid:12)(cid:12)(cid:12) ˜ ν x ( R ∅ ( r (cid:48) )) ρ ( R ∅ ( r (cid:48) )) − e ε r ˜ ν x (cid:48) ( R ∅ ( r (cid:48) )) ρ ( R ∅ ( r (cid:48) )) (cid:12)(cid:12)(cid:12)(cid:12) + (cid:35) = (cid:90) e ε Pr r (cid:48) ∼{ , } t (cid:20) ˜ ν x ( R ∅ ( r (cid:48) )) ρ ( R ∅ ( r (cid:48) )) − e ε r ˜ ν x (cid:48) ( R ∅ ( r (cid:48) )) ρ ( R ∅ ( r (cid:48) )) ≥ θ (cid:21) dθ and the fact that G β fools D r we obtain that E s (cid:48) ∼{ , } (cid:96) (cid:34)(cid:12)(cid:12)(cid:12)(cid:12) ˜ ν x ( R ∅ ( G ( s (cid:48) ))) ρ ( R ∅ ( G ( s (cid:48) ))) − e ε r ˜ ν x (cid:48) ( R ∅ ( G ( s (cid:48) ))) ρ ( R ∅ ( G ( s (cid:48) ))) (cid:12)(cid:12)(cid:12)(cid:12) + (cid:35) ≤ δ r + e ε β. (1)By Lemma 3.8 we have that for π x ( y ) := ˜ ν x ( y ) ρ ( y ) it holds that ζ x := E s (cid:48) ∼{ , } (cid:96) [ π x ( R ∅ ( G ( s (cid:48) )))] ∈ [1 − δ − e ε β, e ε β ] . Following the notation in Lemma 3.9, we know that the distribution of R [ G ]( x ) is µ x ( s ) = ρ G ( s ) · ˜ ν x ( R ∅ ( G ( s ))) ρ ( R ∅ ( G ( s ))) E s (cid:48) ∼{ , } (cid:96) [ π x ( R ∅ ( G ( s (cid:48) )))] = ρ G ( s ) · ˜ ν x ( R ∅ ( G ( s ))) ζ x · ρ ( R ∅ ( G ( s ))) . ε (cid:48) = ε r + 2 δ + 3 e ε β we obtain: (cid:88) s (cid:48) ∈{ , } (cid:96) | µ x ( s (cid:48) ) − e ε (cid:48) µ x (cid:48) ( s (cid:48) ) | + = E s (cid:48) ∼{ , } (cid:96) (cid:34)(cid:12)(cid:12)(cid:12)(cid:12) µ x ( s (cid:48) ) ρ G ( s (cid:48) ) − e ε (cid:48) µ x (cid:48) ( s (cid:48) ) ρ G ( s (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12) + (cid:35) = E s (cid:48) ∼{ , } (cid:96) (cid:34)(cid:12)(cid:12)(cid:12)(cid:12) ˜ ν x ( R ∅ ( G ( s (cid:48) ))) ζ x · ρ ( R ∅ ( G ( s (cid:48) ))) − e ε (cid:48) ˜ ν x (cid:48) ( R ∅ ( G ( s (cid:48) ))) ζ x (cid:48) · ρ ( R ∅ ( G ( s (cid:48) ))) (cid:12)(cid:12)(cid:12)(cid:12) + (cid:35) = 1 ζ x · E s (cid:48) ∼{ , } (cid:96) (cid:34)(cid:12)(cid:12)(cid:12)(cid:12) ˜ ν x ( R ∅ ( G ( s (cid:48) ))) ρ ( R ∅ ( G ( s (cid:48) ))) − e ε (cid:48) ζ x · ˜ ν x (cid:48) ( R ∅ ( G ( s (cid:48) ))) ζ x (cid:48) · ρ ( R ∅ ( G ( s (cid:48) ))) (cid:12)(cid:12)(cid:12)(cid:12) + (cid:35) ≤ − δ − e ε β · E s (cid:48) ∼{ , } (cid:96) (cid:34)(cid:12)(cid:12)(cid:12)(cid:12) ˜ ν x ( R ∅ ( G ( s (cid:48) ))) ρ ( R ∅ ( G ( s (cid:48) ))) − e ε (cid:48) (1 − δ − e ε β )˜ ν x (cid:48) ( R ∅ ( G ( s (cid:48) )))(1 + e ε β ) ρ ( R ∅ ( G ( s (cid:48) ))) (cid:12)(cid:12)(cid:12)(cid:12) + (cid:35) ≤ − δ − e ε β · E s (cid:48) ∼{ , } (cid:96) (cid:34)(cid:12)(cid:12)(cid:12)(cid:12) ˜ ν x ( R ∅ ( G ( s (cid:48) ))) ρ ( R ∅ ( G ( s (cid:48) ))) − e ε r ˜ ν x (cid:48) ( R ∅ ( G ( s (cid:48) ))) ρ ( R ∅ ( G ( s (cid:48) ))) (cid:12)(cid:12)(cid:12)(cid:12) + (cid:35) ≤ δ r + e ε β ) , where we used that e ε β − δ − e ε β ≤ e δ +3 e ε β and − δ − e ε β ≤ δ + e ε β < / In this section we apply our approach to the problem of frequency estimation over a discrete domain. In thisproblem on domain X = [ k ], the goal is to estimate the frequency of each element j ∈ [ k ] in the dataset.Namely, for S = ( x , . . . , x n ) ∈ X n we let c ( S ) ∈ { , . . . , n } k be the vector of the counts of each of theelements in S : c ( S ) j = |{ i | x i = j }| . In the frequency estimation problem the goal is to design a localrandomizer and a decoding/aggregation algorithm that outputs a vector ˜ c that is close to c ( S ). Commonlystudied metrics are (the expected) (cid:96) ∞ , (cid:96) and (cid:96) norms of ˜ c − c ( S ). In most regimes of interest, n is largeenough and all these errors are essentially determined by the variance of the estimate of each count producedby the randomizer and therefore the choice of the metric does not aﬀect the choice of the algorithm.The randomizer used in the RAPPOR algorithm [EPK14] is deﬁned by two parameters α and α . Thealgorithm ﬁrst converts the input j to the indicator vector of j (also referred to as one-hot encoding). Itthen randomizes each bit in this encoding: if the bit is 0 then 1 is output with probability α (and 0 withprobability 1 − α ) and if the bit is 1 then 1 is output with probability α .For deletion privacy the optimal error is achieved by a symmetric setting α = 1 / ( e ε + 1) and α = e ε / ( e ε +1) [EFMRSTT20]. This makes the algorithm equivalent to applying the standard binary randomizedresponse to each bit. A simple analysis shows that this results in the standard deviation of each count being √ ne ε/ e ε − [EPK14; WBLJ17]. For replacement privacy the optimal error is achieved by an asymmetric versionin which α = 1 / ( e ε + 1) but α = 1 /

2. The resulting standard deviation for each count is dominatedby √ ne ε/ e ε − [WBLJ17]. (We remark that several works analyze the symmetric RAPPOR algorithm in thereplacement privacy. This requires setting α = (1 − α ) = 1 / ( e ε/ + 1) resulting in a substantially worsealgorithm than the asymmetric version).Note that the resulting encoding has ≈ n/ ( e ε + 1) ones. A closely-related Subset Selection algorithm[WHWNXYLQ16; YB18] maps inputs to bit vectors of length k with exactly (cid:100)≈ n/ ( e ε + 1) (cid:101) ones (that canbe thought of as a subset of [ k ]). An input j is mapped with probability ≈ / j and with probability ≈ / .1 Pairwise-independent RAPPOR While we can use our general result to compress communication in RAPPOR, in this section we exploit thespeciﬁc structure of the randomizer. Speciﬁcally, the tests needed for privacy are fooled if the marginals ofthe PRG are correct. Moreover the accuracy is preserved as long as the bits are randomized in a pairwiseindependent way. Thus we can simply use a standard derandomization technique for pairwise independentrandom variables. Speciﬁcally, to obtain a Bernoulli random variable with bias α we will use a ﬁnite ﬁeld X = F p of size p such that α p is an integer (or, in general, suﬃciently close to an integer) and p is a primelarger than k . This allows us to treat inputs in [ k ] as non-zero elements of F p . We will associate all elementsof the ﬁeld that are smaller (in the regular order over integers) than α p with 1 and the rest with 0. Wedenote this indicator function of the event z < α p by bool ( z ). Now for a randomly and uniformly chosenelement z ∈ F p , we have that bool ( z ) is distributed as a Bernoulli random variable with bias α . We notethat this approach is a special case of a more general approach is which we associate each j with a non-zeroelement of an inner product space F dq , where q is a prime power. This more general approach (that wedescribe in Section A) allows to reduce some of the computation overheads in decoding.As mentioned we will, associate each index j ∈ [ k ] with the element j in F p . We can describe an aﬃnefunction φ over F p using its 2 coeﬃcients: φ and φ and for z ∈ F p we deﬁne φ ( z ) = φ + zφ , whereaddition and multiplication are in the ﬁeld F p . Each such function encodes a vector in F kp as φ ([ k ]) := φ (1) , φ (2) , . . . , φ ( k ). Let Φ := { φ | φ ∈ F p } be the family of all such functions. For a randomly chosenfunction from this family the values of the function on two distinct non-zero values are uniformly distributedand pairwise-independent: for any j (cid:54) = j ∈ [ k ] and a , a ∈ F p we have that Pr φ ∼ Φ [ φ ( j ) = a and φ ( j ) = a ] = Pr φ ∼ Φ [ φ ( j ) = a ] · Pr φ ∼ Φ [ φ ( j ) = a ] = 1 p . In particular, if we use the encoding of φ as a boolean vector bool ( φ [ k ]) := bool ( φ (1)) , bool ( φ (2)) , . . . , bool ( φ ( k ))then we have that for φ ∼ Φ and any j (cid:54) = j ∈ [ k ], bool ( φ ( j )) and bool ( φ ( j )) are independent Bernoullirandom variables with bias α .Finally, for every index j ∈ [ k ] and bit b ∈ { , } we denote the set of functions φ whose encoding has bit b in position j by Φ j,b : Φ j,b := { φ ∈ Φ | bool ( φ ( j )) = b } . (2)We can now describe the randomizer, which we refer to as Pairwise-Independent (PI) RAPPOR for general α > α . Algorithm 3

PI-RAPPOR randomizer

Input:

An index j ∈ [ k ], 0 < α < α <

1, prime p ≥ k + 1 s.t. α p ∈ N Sample b from Bernoulli( α ) Sample randomly φ from Φ j,b deﬁned in eq. (2) Send φ The server side of the frequency estimation with pairwise-independent RAPPOR consists of a decodingstep that converts φ to bool ( φ [ k ]) and then the same debiasing and aggregation as for the standard RAPPOR.We describe it as a frequency oracle to emphasize that each count can be computed individually.We start by establishing several general properties of PI-RAPPOR. First we establish that the privacyguarantees for PI-RAPPOR are identical to those of RAPPOR. Lemma 4.1.

PI-RAPPOR randomizer (Alg. 3) is deletion max (cid:110) α α , − α − α (cid:111) -DP and replacement α (1 − α ) α (1 − α ) -DP. lgorithm 4 Server-side frequency for PI-RAPPOR

Input: < α < α < k , index j ∈ [ k ] and prime p > k . Reports φ , . . . , φ n from n users. sum = 0 for i ∈ [ n ] do sum+ = bool ( φ i ( j )) ˜ c j = sum − α nα − α Return ˜ c j Proof.

While it is easy to analyze the privacy guarantees of PI-RAPPOR directly it is instructive to showthat these guarantees follow from our general compression technique. Speciﬁcally, there is a natural way tosample from the reference distribution of RAPPOR relative to which our pairwise PRG fools the densitytests given in Lemma 3.2.To sample from the reference distribution of RAPPOR we pick k values z , . . . , z k randomly independentlyand uniformly from F p and then output bool ( z ) , bool ( z ) , . . . , bool ( z k ) (we note that samplability is deﬁnedusing uniform distribution over binary strings length t as an input but any other distribution can be usedinstead). By our choice of parameter p and deﬁnition of bool , this gives k i.i.d. samples from Bernoulli( α ),which is the reference distribution for RAPPOR. Let R denote the RAPPOR randomizer. For any j ∈ [ k ]and z (cid:48) ∈ F kp the ratio of densities at z (cid:48) satisﬁes: Pr [ R ( j ) = R ∅ ( z (cid:48) )] Pr z ∼ F kp [ R ∅ ( z ) = R ∅ ( z (cid:48) )] = (cid:40) α α , if bool ( z (cid:48) j ) = 1 − α − α , otherwise . . With probability α , PI-RAPPOR algorithm samples φ uniformly from Φ j, and with probability 1 − α PI-RAPPOR algorithm samples φ uniformly from Φ j, . This means that PI-RAPPOR is exactly equal to R [ G ], where G : F p → F kp is deﬁned as G ( φ ) = φ (1) , φ (2) , . . . , φ ( k ).Now to prove that PI-RAPPOR has the same deletion privacy guarantees as RAPPOR it suﬃces toprove that G bool ( φ ( j )) for φ ∼ Φ is distributed in the same way as bool ( z j ) for z ∼ F kp .To prove that PI-RAPPOR has the same replacement privacy guarantees as RAPPOR we simply use thesame reference distribution and apply Lemma 3.7.Second we establish that the utility guarantees of PI-RAPPOR are identical to those of RAPPOR.This follows directly from the fact that the utility is determined by the variance of the estimate of eachindividual count in each user’s contribution. The variance of the estimate of c ( S ) j is a sum of c ( S ) j variancesfor randomization of 1 and n − c ( S ) j variances of randomization of 0. These variances are identical forRAPPOR and PI-RAPPOR leading to identical exact bounds. Standard results on concentration of sumsof independent random variables imply that the bounds on the variance can be translated easily into highprobability bounds and also into bounds on the expectation of (cid:96) ∞ , (cid:96) , (cid:96) errors. Lemma 4.2.

For any dataset S ∈ [ k ] n , the estimate ˜ c computed by PI-RAPPOR algorithm (Algs. 3,4)satisﬁes: • E [˜ c ] = c ( S ) • For all j ∈ [ k ] , Var [˜ c j ] = c ( S ) j − α − α α − α + n α (1 − α )( α − α ) For the symmetric case α = 1 − α this simpliﬁes to Var [˜ c j ] = n α (1 − α )(1 − α ) .In particular, the expected (cid:96) squared error is E (cid:2) (cid:107) ˜ c − c ( S ) (cid:107) (cid:3) = n − α − α α − α + nk α (1 − α )( α − α ) . roof. We ﬁrst note that ˜ c j = (cid:88) i ∈ [ n ] bool ( φ i ( j )) − α α − α , where φ i is the output of the PI-RAPPOR randomizer on input x i . Thus to prove the claim about theexpectation it is suﬃcient to prove that for every i , E (cid:20) bool ( φ i ( j )) − α α − α (cid:21) = ind ( x i = j )and to prove the claim for variance it is suﬃcient to prove that Var (cid:20) bool ( φ i ( j )) − α α − α (cid:21) = ind ( x i = j ) 1 − α − α α − α + α (1 − α )( α − α ) . If x i = j then both of these claims follow directly from the fact that, by deﬁnition of PI-RAPPORrandomizer, in this case the distribution of bool ( φ i ( j )) is Bernoulli( α ).If, on the other hand x i (cid:54) = j , we use pairwise independence of bool ( φ ( x i )) and bool ( φ ( j )) for φ ∼ Φto infer that conditioning the distribution bool ( φ ( x i )) = b (for any b ) does not aﬀect the distribution of bool ( φ ( x i )). Thus, if x i (cid:54) = j then bool ( φ i ( j )) is distributed as Bernoulli( α ) and we can verify the desiredproperty directly.Finally, E (cid:2) (cid:107) ˜ c − c ( S ) (cid:107) (cid:3) = E  (cid:88) j ∈ [ k ] (˜ c j − c ( S ) j )  = (cid:88) j ∈ [ k ] Var [˜ c j ] = n − α − α α − α + nk α (1 − α )( α − α ) For RAPPOR these bounds are stated in [WBLJ17] who also demonstrate that optimizing α and α subject to the replacement privacy parameter being ε while ignoring the ﬁrst term in the variance (since it istypically dominated by the second term) leads to the asymmetric version α = 1 / ( e ε + 1) and α = 1 /

2. Fordeletion privacy the optimal setting of α = 1 − α = 1 / ( e ε + 1) follows from standard optimization of thebinary randomized response. Thus we obtain the following utility bounds for ε -DP versions of PI-RAPPOR. Corollary 4.3.

For any ε > and a setting of p that ensures that p/ ( e ε + 1) ∈ N we have that PI-RAPPORfor α = 1 − α = 1 / ( e ε + 1) satisﬁes deletion ε -DP and for every dataset S ∈ [ k ] n , the estimate ˜ c computedby PI-RAPPOR satisﬁes: E [˜ c ] = c ( S ) , for all j ∈ [ k ] , Var [˜ c j ] = n e ε ( e ε − and E (cid:2) (cid:107) ˜ c − c ( S ) (cid:107) (cid:3) = nk e ε ( e ε − . Corollary 4.4.

For any ε > and a setting of p that ensures that p/ ( e ε + 1) ∈ N we have that PI-RAPPOR for α = 1 / ( e ε + 1) and α = 1 / is replacement ε -DP and for every dataset S ∈ [ k ] n , theestimate ˜ c computed by PI-RAPPOR satisﬁes: E [˜ c ] = c ( S ) , for all j ∈ [ k ] , Var [˜ c j ] = c ( S ) j + n e ε ( e ε − and E (cid:2) (cid:107) ˜ c − c ( S ) (cid:107) (cid:3) = n + nk e ε ( e ε − . Note that in the setting where S is sampled i.i.d. from some distribution over [ k ] deﬁned by frequencies f , . . . , f k , the term c ( S ) j in the variance is comparable to sampling variance. This is true since c ( S ) j ≈ nf j and for a sum of n Bernoulli random variable with bias f j (cid:28)

1, the variance is nf j (1 − f j ) ≈ nf j . In mostpractical regimes of frequency estimation with LDP, sampling error is much lower than error introduced bythe local randomizers. This justiﬁes optimization of parameters based on the second term alone.Finally, we analyze the computational and communication costs of PI-RAPPOR. We ﬁrst bound thesefor the client. Lemma 4.5.

PI-RAPPOR randomizer (Alg. 3) can be implemented in ˜ O (log p ) time and uses (cid:100) log p (cid:101) bitsof communication. roof. Any function φ ∈ Φ is represented by two elements from F p which implies the claimed bound on thecommunication cost. The running time of PI-RAPPOR is dominated by the time to pick a random anduniform element in Φ j,b . This can be done by picking φ ∈ F p randomly and uniformly. We then need to pick φ randomly and uniformly from the set { φ | bool ( φ ( j )) = b } . Given the result of multiplication jφ thiscan be done in O (log p ) time. For example for b = 1 this set is equal to {− jφ , − jφ + 1 , . . . , − jφ + α p − } where all arithmetic operations are in F p . The set consists of at most two contiguous ranges of integersand thus a random and uniform element can be chosen in O (log p ) time. Multiplication in F p can be donein O (log( p ) · (log log p ) ) ( e.g. [MVOV18]) but in most practical settings standard Montgomery modularmultiplication that takes O (log ( p )) time would be suﬃciently fast.The analysis of the running time of decoding and aggregation is similarly straightforward since decodingevery bit of message takes time that is dominated by the time of a single multiplication in F p . Lemma 4.6.

For every j ∈ k , the server-side of PI-RAPPOR (Alg. 4) computes ˜ c j in time ˜ O ( n log p ) . Inparticular, the entire histogram is computed in time ˜ O ( kn log p ) . Note that the construction of the entire histogram on the server is relatively expensive. In Section A weshow an alternative algorithm that runs faster when k (cid:29) n . For comparison we note that aggregation inthe compression schemes in [ASZ19] and [CK ¨O20] can be done in ˜ O ( n + k ). However these schemes requireΩ( k ) computation on each client and thus the entire system also performs Ω( nk ) computation. They alsodo not give a frequency oracle since the decoding time of even a single message is linear in k .Finally we need to discuss how to pick p . In addition to the condition that is p a prime larger than k , ouralgorithm requires that α p be an integer. We observe that while, in general, we cannot always guaranteethat α = p/ ( e ε + 1), by picking p that is a suﬃciently large multiple of max { e ε , /ε } we get an ε (cid:48) -DPPI-RAPPOR algorithm for ε (cid:48) that is slightly smaller than ε (which also implies that its utility is slightlyworse). We make this formal below. Lemma 4.7.

There exists a constant c such that for any ε > , k ∈ N , ∆ > and any prime p ≥ c max { e ε , /ε } / ∆ we have that symmetric PI-RAPPOR with parameter α = (cid:100) p/ ( e ε +1) (cid:101) /p satisﬁes deletion ε -DP and outputs an estimate that satisﬁes: for all j ∈ [ k ] , Var [˜ c j ] ≤ n (1+∆) e ε ( e ε − . Further, PI-RAPPOR with α = (cid:100) p/ ( e ε + 1) (cid:101) /p and α = 1 / satisﬁes replacement ε -DP and outputs an estimate that satisﬁes: for all j ∈ [ k ] , Var [˜ c j ] = c ( S ) j + n e ε ( e ε − .Proof. We ﬁrst note that by our deﬁnition, α p = (cid:100) p/ ( e ε + 1) (cid:101) and therefore is an integer (as required byPI-RAPPOR). We denote by ε (cid:48) = ln(1 − p/α ) (so that α = 1 / ( e ε (cid:48) + 1) and note that ε (cid:48) ≤ ε . Thus thesymmetric PI-RAPPOR satisﬁes ε -DP. We now note that | / ( e ε (cid:48) + 1) − / ( e ε + 1) | ≤ /p . This implies thatthe bound on variance of PI-RAPPOR satisﬁes: Var [˜ c j ] = n α (1 − α )(1 − α ) = n e ε (cid:48) +1 (1 − e ε (cid:48) +1 )(1 − e ε (cid:48) +1 ) ≤ n ( e ε +1 + p )(1 − e ε +1 )(1 − e ε (cid:48) +1 − p ) . If ε ≤ e ε (cid:48) +1 ≥ e +1 and 1 − e ε (cid:48) +1 ≥ εe +1 . Thus the addition/subtraction of 1 /p to these quantities for p ≥ c / ( ε ∆) increases the bound by at most a multiplicative factor (1 + ∆) (for a suﬃciently large constant c ).Otherwise (if ε > e ε (cid:48) +1 ≥ e ε and 1 − e ε (cid:48) +1 ≥ e − e +1 . Thus the addition/subtraction of 1 /p to these quantities for p ≥ c e ε / ∆ increases the bound by at most a multiplicative factor (1 + ∆) (for asuﬃciently large constant c ).The analysis for replacement DP is analogous.In practice, setting ∆ = 1 /

100 will make the loss of accuracy insigniﬁcant. Thus we can conclude that PI-RAPPOR with p ≥ c max { k, e ε , /ε } for a suﬃciently large constant c achieves essentially the same guaran-tees as RAPPOR. This means that the communication cost of PI-RAPPOR is 2 log (max { k, e ε , /ε } )+ O (1).Also we are typically interested in compression when k (cid:29) max { e ε , /ε } and in such case the communicationcost is 2 log ( k ) + O (1). 19 Mean Estimation

In this section, we consider the problem of mean estimation in (cid:96) norm, for (cid:96) -norm bounded vectors.Formally, each client has a vector x i ∈ B d , where B d := { x ∈ R d | (cid:107) x (cid:107) ≤ } . Our goal is to compute themean of these vectors privately, and we measure our error in the (cid:96) norm. In the literature this problem isoften studied in the statistical setting where x i ’s are sampled i.i.d. from some distribution supported on B d and the goal is to estimate the mean of this distribution. In this setting, the expected squared (cid:96) distancebetween the mean of the distribution and the mean of the samples is at most 1 /n and is dominated by theprivacy error in the regime that we are interested in ( ε < d ).In the absence of communication constraints and ε < d , the optimal ε -LDP protocols for this problemachieve an expected squared (cid:96) error of Θ( dn min( ε,ε ) ) [DJW18; DR19]. Here and in the rest of the sectionwe focus on the replacement DP both for consistency with existing work and since for this problem thedependence on ε is linear (when 1 < ε < d ) and thus the diﬀerence between replacement and deletion is lessimportant.If one is willing to relax to ( ε, δ ) or concentrated diﬀerential privacy [DR16; BS16; Mir17] guarantees, thenstandard Gaussian noise addition achieves the asymptotically optimal bound. When ε ≤

1, the randomizerof Duchi et al. [DJW18] (which we refer to as

PrivHS ) also achieves the optimal O ( dnε ) bound. Recent workof Erlingsson et al. [EFMRSTT20] gives a low-communication version of PrivHS . Speciﬁcally, in the contextof federated optimization they show that

PrivHS is equivalent to sending a single bit and a randomly anduniformly generated unit vector. This vector can be sent using a seed to a PRG. Bhowmick et al. [BDFKR19]describe the

PrivUnit algorithm that achieves the optimal bound also when ε >

1. Unfortunately,

PrivUnit has high communication cost of Ω( d ).By applying Theorem 3.5 to PrivUnit or Gaussian noise addition, we can immediately obtain a lowcommunication algorithm with negligible eﬀect on privacy and utility. This gives us an algorithm thatcommunicates a single seed, and has the asymptotically optimal privacy utility trade-oﬀ. Implementing

PrivUnit requires sampling uniformly from a spherical cap { v | (cid:107) v (cid:107) = 1 , (cid:104) ˜ x , v (cid:105) ≥ α } for α ≈ (cid:112) ε/d . Usingstandard techniques this can be done with high accuracy using ˜ O ( d ) random bits and ˜ O ( d ) time. Further,for every x the resulting densities can be computed easily given the surface area of the cap. Overall rejectionsampling can be computed in ˜ O ( d ) time. Thus this approach to compression requires time ˜ O ( e ε d ). Thisimplies that given an exponentially strong PRG G , we can compress PrivUnit to O (log( dn ) + ε ) bits withnegligible eﬀects on utility and privacy. In most settings of interest, the computational cost ˜ O ( e ε d ) is notmuch larger than the typical cost of computing the vector itself, e.g. by back propagation in the case ofgradients of neural networks (e.g. ε = 8 requires ≈ ε > ε (cid:48) = ε/m that preserves asymptotic optimality, where m ≤ ε is an integer. Thealgorithm simply runs m copies of the ε (cid:48) -DP randomizer and sends all the reports. The estimates producedfrom these reports are averaged by the server. This reduces the expected number of rejection sampling trialsto me ε/m . Below we describe the reduction and state the resulting guarantees. Lemma 5.1.

Assume that for some ε > there exists a local ε -DP randomizer R ε : B d → Y and a decodingprocedure decode : Y → R d that for all x ∈ B d , satisﬁes: E [ decode ( R ε ( x ))] = x and E [ (cid:107) decode ( R ε ( x )) − x (cid:107) ] ≤ α ε . Further assume that R ε uses (cid:96) bits of communication and runs in time T . Then for every integer m ≥ there is a local ( mε ) -DP randomizer R mε : B d → Y m and decoding procedure decode m : Y m → R d thatuses m(cid:96) bits of communication, runs in time mT and for every x ∈ B d satisﬁes: E [ decode m ( R (cid:48) ε ( x ))] = x and E [ (cid:107) decode m ( R mε ( x )) − x (cid:107) ] ≤ α ε m .In particular, if for every ε ∈ (1 / , , α ε ≤ cdε for some constant c , then for every ε > there is a local ε -DP randomizer R (cid:48) ε and decoding procedure decode (cid:48) that uses (cid:100) ε (cid:101) (cid:96) bits of communication, runs in time (cid:100) ε (cid:101) T and for every x ∈ B d satisﬁes: E [ decode (cid:48) ( R (cid:48) ε ( x ))] = x and E [ (cid:107) decode (cid:48) ( R (cid:48) ε ( x )) − x (cid:107) ] ≤ cd min { ε,ε } .Proof. The randomizer R mε ( x ) runs R ε ( x ) m times independently to obtain y , . . . , y m and outputs thesevalues. To decode we deﬁne decode m ( y , . . . , y m ) := m ( decode ( y )+ · · · + decode ( y m )). By (simple) compo-sition of diﬀerential privacy, R mε is ( εm )-DP. The utility claim follows directly from linearity of expectation20nd independence of the estimates: E [ (cid:107) decode m ( R mε ( x )) − x (cid:107) ] = 1 m · E [ (cid:107) decode ( R ε ( x )) − x (cid:107) ] ≤ α ε m . For the second part of the claim we deﬁne R (cid:48) ε as follows. For ε ≤ R (cid:48) ε ( x ) just outputs R ε ( x ) and in thiscase decode (cid:48) is the same as decode . For ε >

1, we let m = (cid:100) ε (cid:101) and apply the lemma to R ε (cid:48) for ε (cid:48) = ε/ (cid:100) ε (cid:101) .Note that ε (cid:48) ∈ (1 / ,

1) and therefore the resulting bound on variance is E [ (cid:107) decode (cid:48) ( R (cid:48) ε ( x )) − x (cid:107) ] ≤ (cid:100) ε (cid:101) cdε (cid:48) = cdεε (cid:48) ≤ cdε . For example, by using the reduction in Lemma 5.1, we can reduce the computational cost to ˜ O ( (cid:100) ε (cid:101) d )while increasing the communication to O ( (cid:100) ε (cid:101) log d ). The server side reconstruction now requires samplingand averaging n (cid:100) ε (cid:101) d -dimensional vectors. Thus the server running time is ˜ O ( ndε ).This reduction allows one to achieve diﬀerent trade-oﬀs between computation, communication, and close-ness to the accuracy of the original randomizer. As an additional beneﬁt, we no longer need an LDP ran-domizer that is optimal in the ε > m = (cid:100) ε (cid:101) and get an asymptotically optimalalgorithm for ε > ε (cid:48) ∈ [1 , / PrivUnit we can use the low communication version of

PrivHS from [EFMRSTT20]. This bypasses theneed for our compression algorithm and makes the privacy guarantees unconditional.

Remark 5.2.

We remark that the compression of

PrivUnit can be easily made unconditional. The referencedistribution ρ of PrivUnit is uniform over a sphere of some radius B ( d, ε ) = O ( (cid:112) d/ min( ε, ε )) . It is nothard to see that both the privacy and utility guarantees of PrivUnit are preserved by any PRG G whichpreserves Pr v ∼ ρ [ (cid:104) x , v (cid:105) ≥ θ ] for every vector x suﬃciently well (up to some / poly ( d, n, e ε /ε ) accuracy).Note that these tests are halfspaces and have VC dimension d . Therefore by the standard (cid:15) -net argument, arandom sample of size O ( dB ( d, ε ) /γ ) from the reference distribution will, with high probability, give a setof points S that γ -fools the test (for any γ > ). By choosing γ = 1 / poly ( d, n, e ε /ε ) we can ensure that theeﬀect on privacy and accuracy is negligible (relative to the error introduced due to privacy). Thus one cancompress the communication to log ( | S | ) = O (log( dn/ε ) + ε ) bits unconditionally (with negligible eﬀect onaccuracy and privacy). While

PrivUnit , SQKR and the repeated version of

PrivHS (using Lemma 5.1) are asymptotically optimal,the accuracy they achieve in practice may be diﬀerent. Therefore we empirically compare these algorithms.In our ﬁrst comparison we consider four algorithms. The

PrivHS algorithm outputs a vector whose norm isfully deﬁned by the parameters d, ε : the output vector has norm B ( d, ε ) = e ε +1 e ε − √ π d Γ( d − +1)Γ( d +1) . The varianceis then easily seen to be ( B + 1 ± B ) /n when averaging over n samples. For large dimensional settingsof interest, B (cid:29) B /n and we use this value as a proxy.As an example, for d = 2000 , ε = 8, B ≈ .

5% (and the proxy iseven more accurate when d is signiﬁcantly larger which is the typical setting where compression). For SQKR ,we use the implementation provided by the authors at [Kas] (speciﬁcally, second version of the code thatoptimizes some of the parameters). We show error bars for the empirical squared error based on 20 trials.The

PrivUnit algorithm internally splits its privacy budget ε into two parts ε , ε = 1 − ε . As inthe case of PrivHS , the output of

PrivUnit (for ﬁxed d, ε , ε ) has a ﬁxed squared norm which is theproxy we use for variance. We ﬁrst consider the default split used in the experiments in [BDFKR19] andrefer to it as PrivUnit . In addition, we optimize the splitting so as to minimize the variance proxy, byevaluating the expression for the variance proxy as a function of the θ = ε /ε , for 101 values of θ =0 . , . , . , . . . , . , .

0. We call this algorithm

PrivUnitOptimized . Note that since we are optimizing21igure 1: (Left) Expected (cid:96) error of mechanisms PrivHS , PrivUnit , PrivUnitOptimized and

SQKR for valuesof ε between 1 and 8. (Right) Expected (cid:96) error of mechanisms PrivHS , PrivUnit and

PrivUnitOptimized for a total ε = 8, as a function of the number of repetitions of the mechanism with a proportionately smaller ε . The SQKR v2 line is for a single run with ε = 8 without splitting. Both plots use n = 10 , , d = 1 , SQKR are computed based on 10 trials. θ to minimize the norm proxy, this optimization is data-independent and need only be done once for aﬁxed ε . For both variants of PrivUnit , we use the norm proxy in our evaluation; as discussed above, inhigh-dimensional settings of interest, the proxy is nearly exact.Figure 1 (Left) compares the expected squared error of these algorithms for d = 1 , n = 10 ,

000 and ε taking integer values from 1 to 8. These plots show both PrivUnit and

PrivUnitOptimized are moreaccurate than

PrivHS and

SQKR in the whole range of parameters While

PrivHS is competitive for small ε ,it does not get better with ε for large ε . SQKR consistently has about 5 × higher expected squared error than PrivUnitOptimized and about 2 . × higher error compared to PrivUnit . Thus in the large ε regime, theability to compress PrivUnitOptimized gives a 5 × improvement in error compared to previous compressedalgorithms. We also observe that PrivUnitOptimized is noticeably better than

PrivUnit . Our techniquebeing completely general, it will apply losslessly to any other better local randomizers that may be discoveredin the future.As discussed earlier, one way to reduce the computational cost of compressed

PrivUnitOptimized is touse Lemma 5.1. For instance, instead of running

PrivUnitOptimized with ε = 8, we may run it twice with ε = 4 and average the results on the server. Asymptotically, this gives the same expected squared error andwe empirically evaluate the eﬀect of such splitting on the expected error. Figure 1 (Right) shows the resultsfor PrivHS , PrivUnit and

PrivUnitOptimized . We plot the single repetition version of

SQKR for comparison.The

SQKR algorithm does not get more eﬃcient for smaller ε and thus splitting it makes it worse in everyaspect. As its error grows quickly with splitting, we do not plot the split version of SQKR in these plots.The results demonstrate that splitting does have some cost in terms of expected squared error, and goingfrom ε = 8 to two runs of ε = 4 costs us about 2 × in expected squared error, and that the error continuesto increase as we split more. These results can inform picking an appropriate point on the computationcost-error tradeoﬀ and suggest that for ε around 8, the choice in most cases will be between not splittingand splitting into two mechanisms. Note that even with two or three repetitions, PrivUnitOptimized has2 − × smaller error compared to PrivHS and

SQKR . For

PrivHS , the sweet spot seems to be splitting intomultiple mechanisms each with ε ≈ References [ACGMMTZ16] M. Abadi, A. Chu, I. J. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, andL. Zhang. “Deep Learning with Diﬀerential Privacy”. In:

Proceedings of the 2016 CM SIGSAC Conference on Computer and Communications Security (CCS) . 2016,pp. 308–318.[AGLTV17] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. “QSGD: Communication-Eﬃcient SGD via Gradient Quantization and Encoding”. In:

Advances in Neural In-formation Processing Systems . Ed. by I. Guyon, U. V. Luxburg, S. Bengio, H. Wal-lach, R. Fergus, S. Vishwanathan, and R. Garnett. Vol. 30. Curran Associates, Inc.,2017, pp. 1709–1720. url : https://proceedings.neurips.cc/paper/2017/file/6c340f25839e6acdc73414517203f5f0-Paper.pdf .[App17] Apple’s Diﬀerential Privacy Team. “Learning with Privacy at Scale”. In: Apple Ma-chine Learning Journal arXiv preprint arXiv:1905.11888 (2019).[ASYKM18] N. Agarwal, A. T. Suresh, F. X. X. Yu, S. Kumar, and B. McMahan. “cpSGD:Communication-eﬃcient and diﬀerentially-private distributed SGD”. In:

Advances inNeural Information Processing Systems . Ed. by S. Bengio, H. Wallach, H. Larochelle,K. Grauman, N. Cesa-Bianchi, and R. Garnett. Vol. 31. Curran Associates, Inc.,2018, pp. 7564–7575. url : https://proceedings.neurips.cc/paper/2018/file/21ce689121e39821d07d04faab328370-Paper.pdf .[ASZ19] J. Acharya, Z. Sun, and H. Zhang. “Hadamard Response: Estimating DistributionsPrivately, Eﬃciently, and with Little Communication”. In: ed. by K. Chaudhuri andM. Sugiyama. Vol. 89. Proceedings of Machine Learning Research. PMLR, 2019,pp. 1120–1129.[BBGN19] B. Balle, J. Bell, A. Gasc´on, and K. Nissim. “The Privacy Blanket of the ShuﬄeModel”. In: Advances in Cryptology – CRYPTO 2019 . Ed. by A. Boldyreva and D.Micciancio. Cham: Springer International Publishing, 2019, pp. 638–667. isbn : 978-3-030-26951-7.[BDFKR19] A. Bhowmick, J. Duchi, J. Freudiger, G. Kapoor, and R. Rogers.

Protection AgainstReconstruction and Its Applications in Private Federated Learning . 2019. arXiv: .[BEMMRLRKTS17] A. Bittau, U. Erlingsson, P. Maniatis, I. Mironov, A. Raghunathan, D. Lie, M.Rudominer, U. Kode, J. Tinnes, and B. Seefeld. “Prochlo: Strong Privacy for An-alytics in the Crowd”. In:

Proceedings of the 26th Symposium on Operating SystemsPrinciples . SOSP ’17. 2017, pp. 441–459.[BNS19] M. Bun, J. Nelson, and U. Stemmer. “Heavy hitters and the structure of local privacy”.In:

ACM Transactions on Algorithms (TALG)

Journal of Machine Learning Research

Proceedings of the forty-seventh annual ACM symposium on Theory of computing .2015, pp. 127–135.[BS16] M. Bun and T. Steinke. “Concentrated Diﬀerential Privacy: Simpliﬁcations, Exten-sions, and Lower Bounds”. In:

Proceedings, Part I, of the 14th International Confer-ence on Theory of Cryptography - Volume 9985 . Berlin, Heidelberg: Springer-Verlag,2016, 635–658. isbn : 9783662536407. url : https://doi.org/10.1007/978-3-662-53641-4_24 .[BST14] R. Bassily, A. Smith, and A. Thakurta. “Private Empirical Risk Minimization, Re-visited”. In: CoRR abs/1405.7085 (2014). url : http://arxiv.org/abs/1405.7085 .23CK ¨O20] W.-N. Chen, P. Kairouz, and A. ¨Ozg¨ur. “Breaking the Communication-Privacy-AccuracyTrilemma”. In: arXiv preprint arXiv:2007.11707 (2020).[CSUZZ19] A. Cheu, A. Smith, J. Ullman, D. Zeber, and M. Zhilyaev. “Distributed DiﬀerentialPrivacy via Shuﬄing”. In: Advances in Cryptology – EUROCRYPT 2019 . Ed. by Y.Ishai and V. Rijmen. Cham: Springer International Publishing, 2019, pp. 375–403. isbn : 978-3-030-17653-2.[DJW18] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. “Minimax optimal proceduresfor locally private estimation”. In:

Journal of the American Statistical Association . 2017, pp. 3574–3583.[DMNS06] C. Dwork, F. McSherry, K. Nissim, and A. Smith. “Calibrating noise to sensitivity inprivate data analysis”. In:

TCC . 2006, pp. 265–284.[DR14] C. Dwork and A. Roth.

The Algorithmic Foundations of Diﬀerential Privacy . Vol. 9.3-4. 2014, pp. 211–407. url : http://dx.doi.org/10.1561/0400000042 .[DR16] C. Dwork and G. N. Rothblum. “Concentrated Diﬀerential Privacy”. In: ArXiv abs/1603.01887(2016).[DR19] J. Duchi and R. Rogers. “Lower Bounds for Locally Private Estimation via Commu-nication Complexity”. In:

Proceedings of the Thirty-Second Conference on LearningTheory . Ed. by A. Beygelzimer and D. Hsu. Vol. 99. Proceedings of Machine LearningResearch. Phoenix, USA: PMLR, 2019, pp. 1161–1191. url : http://proceedings.mlr.press/v99/duchi19a.html .[EFMRSTT20] U. Erlingsson, V. Feldman, I. Mironov, A. Raghunathan, S. Song, K. Talwar, and A.Thakurta. “Encode, Shuﬄe, Analyze Privacy Revisited: Formalizations and EmpiricalEvaluation”. In: (2020). arXiv: .[EFMRTT19] U. Erlingsson, V. Feldman, I. Mironov, A. Raghunathan, K. Talwar, and A. Thakurta.“Ampliﬁcation by Shuﬄing: From Local to Central Diﬀerential Privacy via Anonymity”.In: Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algo-rithms . SODA ’19. San Diego, California: Society for Industrial and Applied Mathe-matics, 2019, 2468–2479.[EGS03] A. V. Evﬁmievski, J. Gehrke, and R. Srikant. “Limiting privacy breaches in privacypreserving data mining”. In:

PODS . 2003, pp. 211–222.[EPK14] ´U. Erlingsson, V. Pihur, and A. Korolova. “Rappor: Randomized aggregatable privacy-preserving ordinal response”. In:

Proceedings of the 2014 ACM SIGSAC conferenceon computer and communications security . 2014, pp. 1054–1067.[FGV15] V. Feldman, C. Guzman, and S. Vempala. “Statistical Query Algorithms for MeanVector Estimation and Stochastic Convex Optimization”. In:

CoRR abs/1512.09170(2015). Extended abstract in SODA 2017. url : http://arxiv.org/abs/1512.09170 .[FMT20] V. Feldman, A. McMillan, and K. Talwar. “Hiding Among the Clones: A Simple andNearly Optimal Analysis of Privacy Ampliﬁcation by Shuﬄing”. In: CoRR abs/2012.12803(2020). arXiv: . url : https://arxiv.org/abs/2012.12803 .[FTMARRK20] F. Faghri, I. Tabrizian, I. Markov, D. Alistarh, D. M. Roy, and A. Ramezani-Kebrya.“Adaptive Gradient Quantization for Data-Parallel SGD”. In: Advances in NeuralInformation Processing Systems . Vol. 33. 2020.[GDDKS20] A. M. Girgis, D. Data, S. Diggavi, P. Kairouz, and A. T. Suresh.

Shuﬄed Model ofFederated Learning: Privacy, Communication and Accuracy Trade-oﬀs . 2020. arXiv: . 24GKMM19] V. Gandikota, D. Kane, R. K. Maity, and A. Mazumdar. “vqsgd: Vector quantizedstochastic gradient descent”. In: arXiv preprint arXiv:1911.07971 (2019).[HKR12] J. Hsu, S. Khanna, and A. Roth. “Distributed private heavy hitters”. In:

InternationalColloquium on Automata, Languages, and Programming . Springer. 2012, pp. 461–472.[Kai+19] P. Kairouz et al.

Advances and Open Problems in Federated Learning . 2019. arXiv: .[Kas]

An implementation of Kashine based mean estimation scheme . https : / / github .com/WeiNingChen/Kashin-mean-estimation . Accessed: 2021-02-17.[KBR16] P. Kairouz, K. Bonawitz, and D. Ramage. “Discrete distribution estimation underlocal privacy”. In: arXiv preprint arXiv:1602.07387 (2016).[LLW06] M. Luby, M. G. Luby, and A. Wigderson. Pairwise independence and derandomiza-tion . Vol. 4. Now Publishers Inc, 2006.[LV10] Y. Lyubarskii and R. Vershynin. “Uncertainty principles and vector quantization”.In:

Information Theory, IEEE Transactions on . 2017, pp. 263–275.[MMPRTV10] A. McGregor, I. Mironov, T. Pitassi, O. Reingold, K. Talwar, and S. Vadhan. “TheLimits of Two-Party Diﬀerential Privacy”. In: . 2010, pp. 81–90.[MPRV09] I. Mironov, O. Pandey, O. Reingold, and S. Vadhan. “Computational DiﬀerentialPrivacy”. In:

Advances in Cryptology - CRYPTO 2009 . Ed. by S. Halevi. Berlin,Heidelberg: Springer Berlin Heidelberg, 2009, pp. 126–142. isbn : 978-3-642-03356-8.[MRTZ18] H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang. “Learning Diﬀerentially Pri-vate Recurrent Language Models”. In: . 2018.[MS06] N. Mishra and M. Sandler. “Privacy via Pseudorandom Sketches”. In:

Proceedingsof the Twenty-Fifth ACM SIGMOD-SIGACT-SIGART Symposium on Principles ofDatabase Systems . PODS ’06. Chicago, IL, USA: Association for Computing Machin-ery, 2006, 143–152. isbn : 1595933182. url : https://doi.org/10.1145/1142351.1142373 .[MT20] P. Mayekar and H. Tyagi. “Limits on Gradient Compression for Stochastic Optimiza-tion”. In: . 2020,pp. 2658–2663.[MVOV18] A. J. Menezes, P. C. Van Oorschot, and S. A. Vanstone. Handbook of applied cryptog-raphy . CRC press, 2018.[Nis92] N. Nisan. “Pseudorandom generators for space-bounded computation”. In:

Combina-torica

Proceedings of the 34th International Conferenceon Machine Learning . Ed. by D. Precup and Y. W. Teh. Vol. 70. Proceedings of Ma-chine Learning Research. International Convention Centre, Sydney, Australia: PMLR,2017, pp. 3329–3337. url : http://proceedings.mlr.press/v70/suresh17a.html .[War65] S. L. Warner. “Randomized response: A survey technique for eliminating evasive an-swer bias”. In: Journal of the American Statistical Association .Vancouver, BC: USENIX Association, Aug. 2017, pp. 729–745. isbn : 978-1-931971-40-9. url : .[WHWNXYLQ16] S. Wang, L. Huang, P. Wang, Y. Nie, H. Xu, W. Yang, X.-Y. Li, and C. Qiao. “Mu-tual information optimally local private discrete distribution estimation”. In: arXivpreprint arXiv:1607.08025 (2016).[YB18] M. Ye and A. Barg. “Optimal schemes for discrete distribution estimation under lo-cally diﬀerential privacy”. In: IEEE Transactions on Information Theory

A A generalization of PI-RAPPOR

In the more general version of RAPPOR we let q by any prime power and assume that α q is an integer. Wewill rely on some arbitrary order on the elements of F q and denote α q smallest elements as F . We denotethe indicator function of the event z ∈ F by bool ( z ). As before, for a randomly and uniformly chosenelement z ∈ F we have that bool ( z ) is distributed as Bernoulli random variable with bias α . For eﬃciency,the order needs to be chosen in a way that allows computing bool ( z ) and generating a random element ofthe ﬁeld in some range in O (log q ) time.We will associate each index j ∈ [ k ] with a distinct non-zero element of z ( j ) ∈ F dq , where d := (cid:100) log q ( k +1) (cid:101) (in particular, q d ≤ k + 1 < q d +1 ). We can describe an aﬃne function φ over F dq using the vector of its d + 1coeﬃcients: φ , . . . , φ d and for z ∈ F dq we deﬁne φ ( z ) = φ + (cid:80) u ∈ [ d ] z u φ u , where addition and multiplicationare in the ﬁeld F q . For brevity we will also overload φ ( j ) := φ ( z ( j )). Each such function encodes a vectorin F kq as φ ([ k ]) := φ (1) , φ (2) , . . . , φ ( k ). The family of functions deﬁned by all d + 1 tuples in F q is deﬁnedas Φ := { φ | φ ∈ F d +1 q } . For a randomly chosen function from this family the values of the function ontwo distinct non-zero values uniformly distributed and pairwise independent: for any j (cid:54) = j ∈ [ k ] and a , a ∈ F q we have that Pr φ ∼ Φ [ φ ( j ) = a and φ ( j ) = a ] = Pr φ ∼ Φ [ φ ( j ) = a ] · Pr φ ∼ Φ [ φ ( j ) = a ] = 1 q . For every index j ∈ [ k ] and bit b ∈ { , } we denote the set of functions φ whose encoding has bit b inposition j by Φ j,b : Φ j,b := { φ ∈ Φ | bool ( φ ( j )) = b } . (3)The generalization of the PI-RAPPOR randomizer is described below. Algorithm 5

General PI-RAPPOR randomizer

Input:

An index j ∈ [ k ], 0 < α < α <

1, prime power q s.t. α q ∈ N d = (cid:100) log q ( k + 1) (cid:101) Sample Bernoulli b with bias α Sample randomly φ from Φ j,b deﬁned in eq. (3) Send φ The server side of the frequency estimation can be done exactly as before. We can convert φ to bool ( φ ( j ))and then aggregate the results. In addition, we describe an alternative algorithm that runs in time ˜ O ( k (cid:107) Φ (cid:107) + n ). This algorithm is faster than direct computation when | Φ | < n . In this case we can ﬁrst count the numberof times each φ is used and then decode each φ only once. This approach is based on an idea in [BNST20]26hich also relies on pairwise independence to upper bound the total number of encodings. We note thatin the simpler version of PI-RAPPOR | Φ | ≥ ( k + 1) , whereas in the generalized version | Φ | can be as low( k + 1) · ( e ε + 1). Algorithm 6

Private histograms with PI-RAPPOR

Input: < α < α <

1, prime power q Receive φ , . . . , φ n from n users. for φ ∈ Φ do n φ = 0 for i ∈ [ n ] do n φ i + = 1 sum = − α , . . . , − α for φ ∈ Φ do sum+ = n φ · bool ( φ i ) ˜ c = α − α sum Return ˜ c It is easy to see that pairwise-independence implies that privacy and utility guarantees of the generalizedPI-RAPPOR are the same as for the simpler version we described before. The primary diﬀerence is in thecomputational costs.

Lemma A.1.

PI-RAPPOR randomizer (Alg. 5) can be implemented in ˜ O (log k ) time and uses (cid:100) log | Φ |(cid:101) ≤ log k + 2 log q + 1 bits of communication.Proof. The running time of PI-RAPPOR is dominated by the time to pick a random and uniform elementin Φ j,b . This can be done by picking φ , . . . , φ d ∈ F p randomly and uniformly. We then need to pick φ randomly and uniformly from the set { φ | bool ( φ ( j )) = b } . Given the result of inner product (cid:80) i ∈ [ d ] z ( j ) i φ i this can be done in O (log p ) time as explained in the proof of Lemma 4.5. The computation of the innerproduct can be done in d · ˜ O (log q ) = ˜ O (log k ).The analysis of the running time of the aggregation algorithm Alg. 6 follows from the discussion above. Lemma A.2.

The server-side of the histogram construction for generalized PI-RAPPOR (Alg. 6) can bedone in time ˜ O ( n + k | Φ | log k ) . The running time depends on | Φ | ≤ kq and thus also depends on the choice of q . The eﬀect of the choiceof q on the utility of Algorithm 6 is the same as the eﬀect of the choice of p on the utility of the simplePI-RAPPOR (given in Lemma 4.7). Thus we can assume that q = O (max { e ε , /ε }}