Local Differential Privacy Is Equivalent to Contraction of E_γ-Divergence
11 Local Differential Privacy Is Equivalent toContraction of E γ -Divergence Shahab Asoodeh † , Maryam Aliakbarpour ∗ , and Flavio P. Calmon † † Harvard University, ∗ University of Massachusetts Amherst
Abstract
We investigate the local differential privacy (LDP) guarantees of a randomized privacy mechanismvia its contraction properties. We first show that LDP constraints can be equivalently cast in terms ofthe contraction coefficient of the E γ -divergence. We then use this equivalent formula to express LDPguarantees of privacy mechanisms in terms of contraction coefficients of arbitrary f -divergences. Whencombined with standard estimation-theoretic tools (such as Le Cam’s and Fano’s converse methods), thisresult allows us to study the trade-off between privacy and utility in several testing and minimax andBayesian estimation problems. I. I
NTRODUCTION
A major challenge in modern machine learning applications is balancing statistical efficiency with theprivacy of individuals from whom data is obtained. In such applications, privacy is often quantified interms of Differential Privacy (DP) [1]. DP has several variants, including approximate DP [2], Rényi DP[3], and others [4–7]. Arguably, the most stringent flavor of DP is local differential privacy (LDP) [8, 9].Intuitively, a randomized mechanism (or a Markov kernel) is said to be locally differentially private ifits output does not vary significantly with arbitrary perturbations of the input.More precisely, a mechanism is said to be ε -LDP (or pure LDP) if the privacy loss random variable,defined as the log-likelihood ratio of the output for any two different inputs, is smaller than ε withprobability one. One can also consider an approximate variant of this constraint: K is said to be ( ε, δ ) -LDP if the privacy loss random variable does not exceed ε with probability at least − δ (see Def. 1 forthe formal definition).The study of statistical efficiency under LDP constraints has gained considerable traction, e.g., [8–18].Almost all of these works consider ε -LDP and provide meaningful bounds only for sufficiently smallvalues of ε (i.e., the high privacy regime). For instance, Duchi et al. [10] studied minimax estimationproblems under ε -LDP constraints and showed that for ε ≤ , the price of privacy is to reduce the effectivesample size from n to ε n . A slightly improved version of this result appeared in [13, 19]. More recently,Duchi and Rogers [20] developed a framework based on the strong data processing inequality (SDPI)[21] and derived lower bounds for minimax estimation risk under ε -LDP that hold for any ε ≥ .In this work, we develop an SDPI-based framework for studying hypothesis testing and estimationproblems under ( ε, δ ) -LDP, extending the results of [20] to approximate LDP. In particular, we derivebounds for both the minimax and Bayesian estimation risks that hold for any ε ≥ and δ ≥ .Interestingly, when setting δ = 0 , our bounds can be slightly stronger than [10].Our main mathematical tool is an equivalent expression for DP in terms of E γ -divergence. Given γ ≥ ,the E γ -divergence between two distributions P and Q is defined as E γ ( P (cid:107) Q ) := 12 (cid:90) | d P − γ d Q | −
12 ( γ − . (1)We show that a mechanism K is ( ε, δ ) -LDP if and only if E γ ( P K (cid:107) Q K ) ≤ δ E γ ( P (cid:107) Q ) a r X i v : . [ c s . I T ] F e b for γ = e ε and any pairs of distributions ( P, Q ) where P K represents the output distribution of K when theinput distribution is P . Thus, the approximate LDP guarantee of a mechanism can be fully characterizedby its contraction under E γ -divergence. When combined with standard statistical techniques, includingLe Cam’s and Fano’s methods [22, 23], E γ -contraction leads to general lower bounds for the minimaxand Bayesian risk under ( ε, δ ) -LDP for any ε ≥ and δ ∈ [0 , . In particular, we show that the price ofprivacy in this case is to reduce the sample size from n to n [1 − e − ε (1 − δ )] .There exists several results connecting pure LDP to the contraction properties of KL divergence D KL and total variation distance TV . For instance, for any ε -LDP mechanism K , it is shown in [10, Theorem 1]that D KL ( P K (cid:107) Q K ) ≤ e ε − TV ( P, Q ) and in [13, Theorem 6] that TV ( P K (cid:107) Q K ) ≤ e ε − e ε +1 TV ( P, Q ) for any pairs ( P, Q ) . Inspired by these results, we further show that if K is ( ε, δ ) -LDP then D f ( P K (cid:107) Q K ) ≤ [1 − e − ε (1 − δ )] D f ( P (cid:107) Q ) for any arbitrary f -divergences D f and any pairs ( P, Q ) . Notation.
For a random variable X , we write P X and X for its distribution (i.e., X ∼ P X ) and itsalphabet, respectively. For any set A , we denote by P ( A ) the set of all probability distributions on A .Given two sets X and Z , a Markov kernel (i.e., channel) K is a mapping from X to P ( Z ) given by x (cid:55)→ K ( ·| x ) . Given P ∈ P ( X ) and a Markov kernel K : X → P ( Z ) , we let P K denote the outputdistribution of K when the input distribution is P , i.e., P K ( · ) = (cid:82) K ( ·| x ) P ( d x ) . Also, we use BSC ( ω ) to denote the binary symmetric channel with crossover probability ω . For sequences { a n } and { b n } , weuse a n (cid:38) b n to indicate a n ≥ Cb n for some universal constant C .II. P RELIMINARIES A. f -Divergences Given a convex function f : (0 , ∞ ) → R such that f (1) = 0 , the f -divergence between two probabilitymeasures P (cid:28) Q is defined as [24, 25] D f ( P (cid:107) Q ) := E Q (cid:104) f (cid:0) d P d Q (cid:1)(cid:105) . (2)Due to convexity of f , we have D f ( P (cid:107) Q ) ≥ f (1) = 0 . If, furthermore, f is strictly convex at ,then equality holds if and only P = Q . Popular examples of f -divergences include f ( t ) = t log t corresponding to KL divergence, f ( t ) = | t − | corresponding to total variation distance, and f ( t ) = t − corresponding to χ -divergence. In this paper, we mostly concern with an important sub-family of f -divergences associated with f γ ( t ) = max { t − γ, } for a parameter γ ≥ . The corresponding f -divergence, denoted by E γ ( P (cid:107) Q ) , is called E γ -divergence (or sometimes hockey-stick divergence [26])and is explicitly defined in (1). It appeared in [27] for proving converse channel coding results and alsoused in [7, 28–30] for characterizing privacy guarantees of iterative algorithms in terms of other variantsof DP. B. Contraction Coefficient
All f -divergences satisfy data processing inequality, i.e., D f ( P K (cid:107) Q K ) ≤ D f ( P (cid:107) Q ) for any pair ofprobability distributions ( P, Q ) and Markov kernel K [24]. However, in many cases, this inequality isstrict. The contraction coefficient of Markov kernel K under D f -divergence η f ( K ) is the smallest number η ≤ such that D f ( P K (cid:107) Q K ) ≤ ηD f ( P (cid:107) Q ) for any pair of probability distributions ( P, Q ) . Formally, η f ( K ) is defined as η f ( K ) := sup P,Q ∈P ( X ): D f ( P (cid:107) Q ) (cid:54) =0 D f ( P K (cid:107) Q K ) D f ( P (cid:107) Q ) . (3)Contraction coefficients have been studied for several f -divergences, e.g., η TV for total variation distancewas studied in [31–33], η KL for KL -divergence was studied in [34–39], and η χ for χ -divergence was studied in [33, 39, 40]. In particular, Dobrushin [31] showed that η TV has a remarkably simple two-pointcharacterization η TV ( K ) = sup x ,x ∈X TV ( K ( ·| x ) , K ( ·| x )) .Similarly, one can plug E γ -divergence into (3) and define the contraction coefficient η γ ( K ) for aMarkov kernel K under E γ -divergence. This contraction coefficient has recently been studied in [30] forderiving approximate DP guarantees for online algorithms. In particular, it was shown [30, Theorem 3]that η γ enjoys a simple two-point characterization, i.e., η γ ( K ) = sup x ,x ∈X E γ ( K ( ·| x ) (cid:107) K ( ·| x )) . Since E ( P (cid:107) Q ) = TV ( P, Q ) , this is a natural extension of Dobrushin’s result. C. Local Differential Privacy
Suppose K is a randomized mechanism mapping each x ∈ X to a distribution K ( ·| x ) ∈ P ( Z ) . Onecould view K as a Markov kernel (i.e., channel) K : X → P ( Z ) . Definition 1 ([8, 9]) . A mechanism K : X → P ( Z ) is ( ε, δ ) -LDP for ε ≥ and δ ∈ [0 , if sup x,x (cid:48) ∈X sup A ⊂Z (cid:2) K ( A | x ) − e ε K ( A | x (cid:48) ) (cid:3) ≤ δ. (4) K is said to be ε -LDP if it is ( ε, -LDP. Let Q ε,δ be the collection of all Markov kernels K with theabove property. When δ = 0 , we use Q ε to denote Q ε, . Interactivity in Privacy-Preserving Mechanisms:
Suppose there are n users, each in possession of adatapoint X i , i ∈ [ n ] := { , . . . , n } . The users wish to apply a mechanism K i that generates a privatizedversion of X i , denoted by Z i . We say that the collection of mechanisms { K i } is non-interactive if K i is entirely determined by X i and independent of ( X j , Z j ) for j (cid:54) = i . When all users apply the samemechanism K , we can view Z n := ( Z , . . . , Z n ) as independent applications of K to each X i . We denotethis overall mechanism by K ⊗ n . If interactions between users are permitted, then K i need not dependonly on X i . In this case, we denote the overall mechanism { K i } ni =1 by K n . In particular, the sequentiallyinteractive [10] setting refers to the case when the input of K i depends on both X i and the outputs Z i − of the ( i − previous mechanisms.III. LDP A S THE C ONTRACTION OF E γ -D IVERGENCE
We show next that the ( ε, δ ) -LDP constraint, with δ not necessarily equal to zero, is equivalent to thecontraction of E γ -divergence. Theorem 1.
A mechanism K is ( ε, δ ) -LDP if and only if η e ε ( K ) ≤ δ or equivalently K ∈ Q ε,δ ⇐⇒ E e ε ( P K (cid:107) Q K ) ≤ δ E e ε ( P (cid:107) Q ) , ∀ P, Q.
We note that Duchi et al. [10] showed that if K is ε -LDP then D KL ( P K (cid:107) Q K ) ≤ e ε − TV ( P, Q ) .They then informally concluded from this result that ε -LDP acts as a contraction on the space ofprobability measures. Theorem 1 makes this observation precise.According to Theorem 1, a mechanism K is ε -LDP if and only if E e ε ( P K (cid:107) Q K ) = 0 for any distributions P and Q . An example of such Markov kernels is given next. Example 1. (Randomized response mechanism)
Let X = Z = { , } and consider the mechanism givenby the binary symmetric channel BSC ( ω ε ) with ω ε := e ε . This is often called randomized responsemechanism [41] and denoted by K ε RR . This simple mechanism is well-known to be ε -LDP which cannow be verified via Theorem 1. Let P = Bernoulli ( p ) and Q = Bernoulli ( q ) with p, q ∈ [0 , . Then P K ε RR = Bernoulli ( p ∗ ω ε ) and P K ε RR = Bernoulli ( q ∗ ω ε ) where a ∗ b := a (1 − b ) + b (1 − a ) . It isstraightforward to verify that | p ∗ ω ε − e ε q ∗ ω ε | + | − p ∗ ω ε − e ε (1 − q ∗ ω ε ) | = 0 . e ε − for any p, q , implying E e ε ( P K ε RR (cid:107) Q K ε RR ) = 0 . When |X | = k ≥ , a simple generalization of this mechanism,called k -ary randomized response, has been reported in literature (see, e.g., [13, 19]) and is defined by Z = X and K kRR ( x | x ) = e ε k − e ε and K kRR ( z | x ) = k − e ε for z (cid:54) = x . Again, it can be verified that forthis mechanism we have E e ε ( P K ε kRR (cid:107) Q K ε kRR ) = 0 , for all Bernoulli P and Q . E γ -divergence underlies all other f -divergences, in a sense that any arbitrary f -divergence can berepresented by E γ -divergence [42, Corollary 3.7]. Thus, an LDP constraint implies that a Markov kernelcontracts for all f -divergences, in a similar spirit to E γ -contraction in Theorem 1. Lemma 1.
Let K ∈ Q ε,δ and ϕ ( ε, δ ) := 1 − (1 − δ ) e − ε . Then, η f ( K ) ≤ ϕ ( ε, δ ) or, equivalently, D f ( P K (cid:107) Q K ) ≤ D f ( P (cid:107) Q ) ϕ ( ε, δ ) ∀ P, Q ∈ P ( X ) . Notice that this lemma holds for any f -divergences and any general family of ( ε, δ ) -LDP mechanisms.However, it can be improved if one considers particular mechanisms or a certain f -divergence. Forinstance, it is known that η KL ( BSC ( ω )) = (1 − ω ) [21]. Thus, we have η KL ( K ε RR ) = ( e ε − e ε +1 ) for therandomized response mechanism K ε RR (cf. Example 1), while Lemma 1 implies that η KL ( K ε RR ) ≤ − e − ε .Unfortunately, η KL is difficult to compute in closed form for general Markov kernels, in which caseLemma 1 provides a useful alternative.Next, we extend Lemma 1 for the non-interactive mechanism. Fix an ( ε, δ ) -LDP mechanism K andconsider the corresponding non-interactive mechanism K ⊗ n . To obtain upper bounds on η f ( K ⊗ n ) directlythrough Lemma 1, we would first need to derive privacy parameters of K ⊗ n in terms of ε and δ (e.g., by applying composition theorems). Instead, we can use the tensorization properties of contractioncoefficients (see, e.g., [38, 39]) to relate η f ( K ⊗ n ) to η f ( K ) and then apply Lemma 1, as described next. Lemma 2.
Let K ∈ Q ε,δ and ϕ n ( ε, δ ) := 1 − e − nε (1 − δ ) n . Then η f ( K ⊗ n ) ≤ ϕ n ( ε, δ ) for n ≥ . Each of the next three sections provide a different application of the contraction characterization ofLDP. IV. P
RIVATE M INIMAX R ISK
Let X n = ( X , . . . , X n ) be n independent and identically distributed (i.i.d.) samples drawn from adistribution P in a family P ⊆ P ( X ) . Let also θ : P → T be a parameter of a distribution that wewish to estimate. Each user has a sample X i and applies a privacy-preserving mechanism K i to obtain Z i . Generally, we can assume that K i are sequentially interactive. Given the sequences { Z i } ni =1 , the goalis to estimate θ ( P ) through an estimator Ψ : Z n → T . The quality of such estimator is assessed by asemi-metric (cid:96) : T × T → R + and is used to define the minimax risk as: R n ( P , (cid:96), ε, δ ) := inf K n ⊂Q ε,δ inf Ψ sup P ∈P E [ (cid:96) (Ψ( Z n ) , θ ( P ))] . (5)The quantity R n ( P , (cid:96), ε, δ ) uniformly characterizes the optimal rate of private statistical estimationover the family P using the best possible estimator and privacy-preserving mechanisms in Q ε,δ . In theabsence of privacy constraints (i.e., Z n = X n ), we denote the minimax risk by R n ( P , (cid:96) ) .The first step in deriving information-theoretic lower bounds for minimax risk is to reduce the aboveestimation problem to a testing problem [22, 23, 43]. To do so, we need to construct an index set V with |V| < ∞ and a family of distributions { P v , v ∈ V} ⊆ P such that (cid:96) ( θ ( P v ) , θ ( P v (cid:48) )) ≥ τ for all v (cid:54) = v (cid:48) in V for some τ > . The canonical testing problem is then defined as follows: Nature choosesa random variable V uniformly at random from V , and then conditioned on V = v , the samples X n are drawn i.i.d. from P v , denoted by X n ∼ P ⊗ nv . Each X i is then fed to a mechanism K i to generate Z i . It is well-known [22, 23, 43] that R n ( P , (cid:96) ) ≥ τ P e ( V | X n ) , where P e ( V | X n ) denotes the probabilityof error in guessing V given X n . Replacing X n by its ( ε, δ ) -privatized samples Z n in this result, onecan obtain a lower bound on R n ( P , (cid:96), ε, δ ) in terms of P e ( V | Z n ) . Hence, the remaining challenge isto lower-bound P e ( V | Z n ) over the choice of mechanisms { K i } . There are numerous techniques for thisobjective depending on V . We focus on two such approaches, namely Le Cam’s and Fano’s method, that bound P e ( V | Z n ) in terms of total variation distance and mutual information and hence allow us toinvoke Lemmas 1 and 2. A. Locally Private Le Cam’s Method
Le Cam’s method is applicable when V is a binary set and contains, say, P and P . In its simplest form,it relies on the inequality (see [22, Lemma 1] or [23, Theorem 2.2]) P e ( V | X n ) ≥ (cid:2) − TV ( P ⊗ n , P ⊗ n ) (cid:3) .Thus, it yields the following lower bound for non-private minimax risk R n ( P , (cid:96) ) ≥ τ (cid:2) − TV ( P ⊗ n , P ⊗ n ) (cid:3) (6) ≥ τ (cid:20) − √ (cid:112) nD KL ( P (cid:107) P ) (cid:21) , (7)for any P (cid:54) = P in P , where the second inequality follows from Pinsker’s inequality and chain ruleof KL divergence. In the presence of privacy, the estimator Ψ depends on Z n instead of X n , which isgenerated by a sequentially interactive mechanism K n . To write the private counterpart of (6), we needto replace P ⊗ n and P ⊗ n with P ⊗ n K n and P ⊗ n K n the corresponding marginals of Z n , respectively. Alower bound for R n ( P , (cid:96), ε, δ ) is therefore obtained by deriving an upper bound for TV ( P ⊗ n K n , P ⊗ n K n ) for all K n ⊂ Q ε,δ . Lemma 3.
Let P , P ∈ P satisfy (cid:96) ( θ ( P ) , θ ( P )) ≥ τ . Then we have R n ( P , (cid:96), ε, δ ) ≥ τ (cid:20) − √ (cid:112) nϕ ( ε, δ ) D KL ( P (cid:107) P ) (cid:21) . By comparing with the original non-private Le Cam’s method (7), we observe that the effect of ( ε, δ ) -LDP is to reduce the effective sample size from n to (1 − e − ε (1 − δ )) n . Setting δ = 0 , this resultstrengthens Duchi et al. [10, Corollary 2], where the effective sample size was shown to be ε n forsufficiently small ε . Example 2. (One-dimensional mean estimation)
For some k > , we assume P is given by P = P k := { P ∈ P ( X ) : | E P [ X ] | ≤ , E P [ | X | k ] ≤ } . The goal is to estimate θ ( P ) = E P [ X ] under (cid:96) = (cid:96) the squared (cid:96) metric. This problem was first studiedin [10, Propsition 1] where it was shown R n ( P k , (cid:96) , ε, ≥ ( nε ) − ( k − /k only for ε ≤ . Applying ourframework to this example, we obtain a similar lower bound that holds for all ε ≥ and δ ∈ [0 , . Corollary 1.
For all k > , ε ≥ , and δ ∈ (0 , , we have R n ( P k , (cid:96) , ε, δ ) (cid:38) min (cid:110) , (cid:2) nϕ ( ε, δ ) (cid:3) − ( k − k (cid:111) . (8)It is worth instantiating this corollary for some special values of k . Consider first the usual settingof finite variance setting, i.e., k = 2 . In the non-private case, it is known that the sample mean hasmean-squared error that scales as /n . According to Corollary 1, this rate worsens to /ϕ ( ε, δ ) √ n inthe presence of ( ε, δ ) -LDP requirement. As k → ∞ , the moment condition E p [ | X | k ] ≤ implies theboundedness of X . In this case, Corollary 1 implies the more standard lower bound ( ϕ ( ε, δ ) n ) − . B. Locally Private Fano’s Method
Le Cam’s method involves a pair of distributions ( P , P ) in P . However, it is possible to derive astronger bound considering a larger subset of P by applying Fano’s inequality (see, e.g., [22]). We followthis path to obtain a better minimax lower bound for the non-interactive setting. Consider the index set V = { , . . . , |V|} . The non-private Fano’s method relies on the Fano’s inequalityto write a lower bound for P e ( V | X n ) in terms of mutual information as R n ( P , (cid:96) ) ≥ τ (cid:20) − I ( X n ; V ) + log 2log |V| (cid:21) . (9)To incorporate privacy into this result, we need to derive an upper bound for I ( Z n ; V ) over all choices ofmechanisms { K i } . Focusing on the non-interactive mechanisms, the following lemma exploits Lemma 2for such an upper bound. Lemma 4.
Given X n and V as described above, let Z n be constructed by applying K ⊗ n on X n . If K is ( ε, δ ) -LDP, then we have I ( Z n ; V ) ≤ ϕ n ( ε, δ ) I ( X n ; V ) ≤ nϕ n ( ε, δ ) |V| (cid:88) v,v (cid:48) ∈V D KL ( P v (cid:107) P v (cid:48) ) This lemma can be compared with [10, Corollary 1], where it was shown I ( Z n ; V ) ≤ e ε − n |V| (cid:88) v,v (cid:48) ∈V D KL ( P v (cid:107) P v (cid:48) ) . (10)This is a looser bound than Lemma 4 for any n ≥ and ε ≥ . and only holds for δ = 0 . Example 3. (High-dimensional mean estimation in an (cid:96) -ball) For a parameter r < ∞ , define P r := { P ∈ P ( B d ( r )) } , (11)where B d ( r ) := { x ∈ R d : (cid:107) x (cid:107) ≤ r } is the (cid:96) -ball of radius r in R d . The goal is to estimate the mean θ ( P ) = E [ X ] given the private views Z n . This example was first studied in [10, Proposition 3] thatstates R n ( P , (cid:96) , ε, (cid:38) r min (cid:110) ε √ n , dnε (cid:111) for ε ∈ (0 , . In the following, we use Lemma 4 to derivea similar lower bound for any ε ≥ and δ ∈ (0 , , albeit slightly weaker than [10, Proposition 3]. Corollary 2.
For the non-interactive setting, we have R n ( P , (cid:96) , ε, δ ) (cid:38) r min (cid:26) nϕ n ( ε, δ ) , dn ϕ n ( ε, δ ) (cid:27) . (12)V. P RIVATE B AYESIAN R ISK
In the minimax setting, the worst-case parameter is considered which usually leads to over-pessimisticbounds. In practice, the parameter that incurs a worst-case risk may appear with very small probability.To capture this prior knowledge, it is reasonable to assume that the true parameter is sampled from anunderlying prior distribution. In this case, we are interested in the
Bayes risk of the problem.Let P = { P X | Θ ( ·| θ ) : θ ∈ T } be a collection of parametric probability distributions on X and theparameter space T is endowed with a prior P Θ , i.e., Θ ∼ P Θ . Given an i.i.d. sequence X n drawn from P X | Θ , the goal is to estimate Θ from a privatized sequence Z n via an estimator Ψ : Z n → T . Here, wefocus on the non-interactive setting. Define the private Bayes risk as R Bayes n ( P Θ , (cid:96), ε, δ ) := inf K ∈Q ε,δ inf Ψ E [ (cid:96) (Θ , Ψ( Z n ))] , (13)where the expectation is taken with respect to the randomness of both Θ and Z n . It is evident that R Bayes n ( P Θ , (cid:96), ε, δ ) must depend on the prior P Θ . This dependence can be quantified by L ( ζ ) := sup t ∈T Pr( (cid:96) (Θ , t ) ≤ ζ ) , (14) for ζ < sup θ,θ (cid:48) ∈T (cid:96) ( θ, θ (cid:48) ) . Xu and Raginsky [44] showed that the non-private Bayes risk (i.e., X n = Z n ),denoted by R Bayes n ( P Θ , (cid:96) ) , is lower bounded as R Bayes n ( P Θ , (cid:96) ) ≥ sup ζ> ζ (cid:20) − I (Θ; X n ) + log 2log(1 / L ( ζ )) (cid:21) . (15)Replacing I (Θ; X n ) with I (Θ; Z n ) in this result and applying Lemma 2 (similar to Lemma 4), we candirectly convert (15) to a lower bound for R Bayes n ( P Θ , (cid:96), ε, δ ) . Corollary 3.
In the non-interactive setting, we have R Bayes n ( P Θ , (cid:96), ε, δ ) ≥ sup ζ> ζ (cid:20) − ϕ n ( ε, δ ) I (Θ; X n ) + log 2log(1 / L ( ζ )) (cid:21) . In the following theorem, we provide a lower bound for R Bayes n ( P Θ , (cid:96), ε, δ ) that directly involves E γ -divergence, and thus leads to a tighter bounds than (3). For any pair of random variables ( A, B ) ∼ P AB with marginals P A and P B and a constant γ ≥ , we define their E γ -information as I γ ( A ; B ) := E γ ( P AB (cid:107) P A P B ) . Theorem 2.
Let K be an ( ε, δ ) -LDP mechanism. Then, for n = 1 we have R Bayes ( P Θ , (cid:96), ε, δ ) ≥ sup ζ> ζ [1 − δI e ε (Θ; X ) − e ε L ( ζ )] , and for n > in non-interactive setting we have R Bayes n ( P Θ , (cid:96), ε, δ ) ≥ sup ζ> ζ [1 − ϕ n ( ε, δ ) I e ε (Θ; X n ) − e ε L ( ζ )] . We compare Theorem 2 with Corollary 3 in the next example.
Example 4.
Suppose Θ is uniformly distributed on [0 , , P X | Θ= θ = Bernoulli ( θ ) , and (cid:96) ( θ, θ (cid:48) ) = | θ − θ (cid:48) | .As mentioned earlier, L ( ζ ) ≤ min { ζ, } . We can write for γ = e ε I γ (Θ; X n ) = (cid:90) E γ ( P X n | θ (cid:107) P X n ) d θ. (16)A straightforward calculation shows that P X n | θ ( x n ) = θ s ( x n ) (1 − θ ) n − s ( x n ) , for any θ ∈ [0 , , and P X n ( x n ) = s ( x n )!( n − s ( x n ))!( n +1)! where s ( x n ) is the number of 1’s in x n . Given these marginal and conditionaldistribution, one can obtain after algebraic manipulations I γ (Θ; X n ) = 1 n + 1 n (cid:88) s =0 (cid:90) (cid:20) θ s (1 − θ ) n − s ( n + 1)! s !( n − s )! − γ (cid:21) + d θ. Plugging this into Theorem 2, we arrive at a maximization problem that can be numerically solved.Similarly, we compute I (Θ; X n ) = (cid:82) D KL ( P X n | θ (cid:107) P X n ) d θ and plug it into Corollary 3 and numericallysolve the resulting optimization problem. In Fig. 1, we compare these two lower bounds for δ = 10 − and n = 20 , indicating the advantage of Theorem 2 for small ε . Remark 1.
The proof of Theorem 2 leads to the following lower bound for the non-private Bayes risk R Bayes n ( P Θ , (cid:96) ) ≥ sup ζ> ,γ ≥ ζ [1 − I γ (Θ; X n ) − γ L ( ζ ) − (1 − γ ) + ] . (17) For a comparison with (15) , consider the following example. Suppose Θ is a uniform random variable on [0 , and P X | Θ= θ = Bernoulli ( θ ) . We are interested in the Bayes risk with respect to the (cid:96) -loss function Fig. 1. Comparison of the lower bounds obtained from Theorem 2 and the private version of [44, Theorem 1]described in Corollary 3 for Example 4 assuming δ = 10 − and n = 20 . (cid:96) ( θ, θ (cid:48) ) = | θ − θ (cid:48) | . It can be shown that I (Θ; X ) = 0 . nats while I γ (Θ; X ) = . γ if γ ∈ [0 , . γ − if γ ∈ [1 , otherwise . (18) Moreover, L ( ζ ) = sup t ∈ [0 , Pr( | Θ − t | ≤ ζ ) ≤ min { ζ, } . It can be verified that (15) gives R Bayes ( P Θ , (cid:96) ) ≥ . , whereas our bound (17) yields R Bayes ( P Θ , (cid:96) ) ≥ . . VI. P
RIVATE H YPOTHESIS T ESTING
We now turn our attention to the well-known problem of binary hypothesis testing under local differ-ential privacy constraint. Suppose n i.i.d. samples X n drawn from a distribution Q ∈ P ( X ) are observed.Let now each X i be mapped to Z i via a mechanism K i ∈ Q ε,δ (i.e., sequential interaction is permitted).The goal is to distinguish between the null hypothesis H : Q = P from the alternative H : Q = P given Z n . Let T be a binary statistic, generated from a randomized decision rule P T | Z n : Z n → P ( { , } ) where indicates that H is rejected. Type I and type II error probabilities corresponding to this statisticare given by Pr( T = 1 | H ) and Pr( T = 1 | H ) , respectively. To capture the optimal trade-off betweentype I and type II error probabilities, it is customary to define β ε,δn ( α ) := inf Pr( T = 0 | H ) where theinfimum is taken over all kernels P T | Z n such that Pr( T = 1 | H ) ≤ α and non-interactive mechanisms K ⊗ n with K ∈ Q ε,δ . In the following lemma, we apply Lemma 1 to obtain an asymptotic lower boundfor β ε,δn ( α ) . Corollary 4.
We have for any ε ≥ and δ ∈ [0 , n →∞ n log β ε,δn ( α ) ≥ − ϕ ( ε, δ ) D KL ( P (cid:107) P ) . (19)A similar result was proved by Kairouz et al. [13, Sec. 3] that holds only for sufficiently “small” (albeitunspecified) ε and δ = 0 . When compared to Chernoff-Stein lemma [45, Theorem 11.8.3], establishing D KL ( P (cid:107) P ) as the asymptotic exponential decay rate of β ε,δn ( α ) , the above corollary, once again, justifiesthe reduction of effective sample size from n to ϕ ( ε, δ ) n in the presence of ( ε, δ ) -LDP requirement. VII. M
UTUAL I NFORMATION OF
LDP M
ECHANISMS
Viewing mutual information as a utility measure, we may consider maximizing mutual informationunder local differential privacy as yet another privacy-utility trade-off. To formalize this, let X ∼ P X . Thegoal is to characterize the supremum of I ( X ; Z ) over K ∈ Q ε,δ , i.e., the maximum information sharedbetween X and its ( ε, δ ) -LDP representation. Such mutual information bounds under local DP haveappeared in the literature, e.g., McGregor et al. [46] provided a result that roughly states I ( X ; Z ) ≤ ε for K ∈ Q ε and Kairouz et al. [13, Corollary 15] showed for sufficiently small ε sup K ∈Q ε I ( X ; Z ) ≤ P X ( A )(1 − P X ( A )) ε , (20)where A ⊂ X satisfies A ∈ arg min B ⊂X | P X ( B ) − | . Next, we provide an upper bound for the mutualinformation under LDP that holds for all ε ≥ and δ ∈ [0 , . Corollary 5.
We have for any ε ≥ and δ ∈ [0 , K ∈Q ε,δ I ( X ; Z ) ≤ ϕ ( ε, δ ) H ( X ) . (21)R EFERENCES [1] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,”in
Proc. Theory of Cryptography (TCC) , Berlin, Heidelberg, 2006, pp. 265–284.[2] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor, “Our data, ourselves: Privacy via distributednoise generation,” in
EUROCRYPT , S. Vaudenay, Ed., 2006, pp. 486–503.[3] I. Mironov, “Rényi differential privacy,” in
Proc. Computer Security Found. (CSF) , 2017, pp. 263–275.[4] M. Bun and T. Steinke, “Concentrated differential privacy: Simplifications, extensions, and lower bounds,” in
Theory of Cryptography , 2016, pp. 635–658.[5] C. Dwork and G. N. Rothblum, “Concentrated differential privacy,” vol. abs/1603.01887, 2016. [Online].Available: http://arxiv.org/abs/1603.01887[6] J. Dong, A. Roth, and W. J. Su, “Gaussian differential privacy,” arXiv 1905.02383 , 2019.[7] S. Asoodeh, J. Liao, F. P. Calmon, O. Kosut, and L. Sankar, “Three variants of differential privacy: Losslessconversion and applications,”
To appear in Journal on Selected Areas in Information Theory (JSAIT) , 2021.[8] A. Evfimievski, J. Gehrke, and R. Srikant, “Limiting privacy breaches in privacy preserving data mining,” in
Proc. ACM symp. Principles of Database Systems (PODS) . ACM, 2003, pp. 211–222.[9] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. Smith, “What can we learn privately?”
SIAM J. Comput. , vol. 40, no. 3, pp. 793–826, Jun. 2011.[10] J. C. Duchi, M. I. Jordan, and M. J. Wainwright, “Local privacy, data processing inequalities, and statisticalminimax rates,” in
Proc. Symp. Foundations of Computer Science , 2013, p. 429–438. [Online]. Available:https://arxiv.org/abs/1302.3203[11] M. Gaboardi, R. Rogers, and O. Sheffet, “Locally private mean estimation: z -test and tight confidenceintervals,” in Proc. Machine Learning Research , 2019, pp. 2545–2554.[12] A. Bhowmick, J. Duchi, J. Freudiger, G. Kapoor, and R. Rogers, “Protection against reconstruction and itsapplications in private federated learning,” arXiv 1812.00984 , 2018.[13] P. Kairouz, S. Oh, and P. Viswanath, “Extremal mechanisms for local differential privacy,”
Journal of MachineLearning Research , vol. 17, no. 17, pp. 1–51, 2016.[14] L. P. Barnes, W. N. Chen, and A. Özgür, “Fisher information under local differential privacy,”
IEEE Journalon Selected Areas in Information Theory , vol. 1, no. 3, pp. 645–659, 2020.[15] J. Acharya, C. L. Canonne, and H. Tyagi, “Inference under information constraints i: Lower bounds fromchi-square contraction,”
IEEE Transactions on Information Theory , vol. 66, no. 12, pp. 7835–7855, 2020.[16] M. Ye and A. Barg, “Optimal schemes for discrete distribution estimation under locally differential privacy,”
IEEE Trans. Inf. Theory , vol. 64, no. 8, pp. 5662–5676, 2018.[17] D. Wang and J. Xu, “On sparse linear regression in the local differential privacy model,”
IEEE Trans. Inf.Theory , pp. 1–1, 2020. [18] A. Rohde and L. Steinberger, “Geometrizing rates of convergence under local differential privacy constraints,” Ann. Statist. , vol. 48, no. 5, pp. 2646–2670, 10 2020.[19] P. Kairouz, K. Bonawitz, and D. Ramage, “Discrete distribution estimation under local privacy,” in
Proc. Int.Conf. Machine Learning , vol. 48, 20–22 Jun 2016, pp. 2436–2444.[20] J. Duchi and R. Rogers, “Lower bounds for locally private estimation via communication complexity,” in
Proc.Conference on Learning Theory , 2019, pp. 1161–1191.[21] R. Ahlswede and P. Gács, “Spreading of sets in product spaces and hypercontraction of the markov operator,”
Ann. Probab. , vol. 4, no. 6, pp. 925–939, 12 1976.[22] B. Yu,
Assouad, Fano, and Le Cam . Springer New York, 1997, pp. 423–435.[23] A. B. Tsybakov,
Introduction to Nonparametric Estimation , 1st ed. Springer Publishing Company,Incorporated, 2008.[24] I. Csiszár, “Information-type measures of difference of probability distributions and indirect observations,”
Studia Sci. Math. Hungar. , vol. 2, pp. 299–318, 1967.[25] S. M. Ali and S. D. Silvey, “A general class of coefficients of divergence of one distribution from another,”
Journal of Royal Statistics , vol. 28, pp. 131–142, 1966.[26] N. Sharma and N. A. Warsi, “Fundamental bound on the reliability of quantum information transmission,”
CoRR , vol. abs/1302.5281, 2013. [Online]. Available: http://arxiv.org/abs/1302.5281[27] Y. Polyanskiy, H. V. Poor, and S. Verdú, “Channel coding rate in the finite blocklength regime,”
IEEE Trans.Inf. Theory , vol. 56, no. 5, pp. 2307–2359, 2010.[28] B. Balle, G. Barthe, and M. Gaboardi, “Privacy amplification by subsampling: Tight analyses via couplingsand divergences,” in
NeurIPS , 2018, pp. 6280–6290.[29] B. Balle, G. Barthe, M. Gaboardi, and J. Geumlek, “Privacy amplification by mixing and diffusionmechanisms,” in
NeurIPS , 2019, pp. 13 277–13 287.[30] S. Asoodeh, M. Diaz, and F. P. Calmon, “Privacy analysis of online learning algorithms via contractioncoefficients,” arXiv 2012.11035 , 2020.[31] R. L. Dobrushin, “Central limit theorem for nonstationary markov chains. I,”
Theory Probab. Appl. , vol. 1,no. 1, pp. 65–80, 1956.[32] P. Del Moral, M. Ledoux, and L. Miclo, “On contraction properties of markov kernels,”
Probab. Theory Relat.Fields , vol. 126, pp. 395–420, 2003.[33] J. E. Cohen, Y. Iwasa, G. Rautu, M. Beth Ruskai, E. Seneta, and G. Zbaganu, “Relative entropy under mappingsby stochastic matrices,”
Linear Algebra and its Applications , vol. 179, pp. 211 – 235, 1993.[34] V. Anantharam, A. Gohari, S. Kamath, and C. Nair, “On hypercontractivity and a data processing inequality,”in , 2014, pp. 3022–3026.[35] Y. Polyanskiy and Y. Wu, “Strong data-processing inequalities for channels and bayesian networks,” in
Convexity and Concentration , E. Carlen, M. Madiman, and E. M. Werner, Eds. New York, NY: SpringerNew York, 2017, pp. 211–249.[36] Y. Polyanskiy and Y. Wu, “Dissipation of information in channels with input constraints,”
IEEE Trans. Inf.Theory , vol. 62, no. 1, pp. 35–55, Jan 2016.[37] F. P. Calmon, Y. Polyanskiy, and Y. Wu, “Strong data processing inequalities for input constrained additivenoise channels,”
IEEE Trans. Inf. Theory , vol. 64, no. 3, pp. 1879–1892, 2018.[38] A. Makur and L. Zheng, “Comparison of contraction coefficients for f -divergences,” Probl. Inf. Trans. , vol. 56,pp. 103–156, 2020.[39] M. Raginsky, “Strong data processing inequalities and φ -sobolev inequalities for discrete channels,” IEEETrans. Inf. Theory , vol. 62, no. 6, pp. 3355–3389, June 2016.[40] H. S. Witsenhausen, “On sequences of pairs of dependent random variables,”
SIAM Journal on AppliedMathematics , vol. 28, no. 1, pp. 100–113, 1975.[41] S. L. Warner, “Randomized response: A survey technique for eliminating evasive answer bias,”
Journal of theAmerican Statistical Association , vol. 60, no. 309, pp. 63–69, 1965.[42] J. Cohen, J. Kemperman, and G. Zb˘aganu,
Comparisons of Stochastic Matrices, with Applications inInformation Theory, Economics, and Population Sciences . Birkhäuser, 1998.[43] Y. Yang and A. Barron, “Information-theoretic determination of minimax rates of convergence,”
Ann. Statist. ,vol. 27, no. 5, pp. 1564–1599, 10 1999.[44] A. Xu and M. Raginsky, “Converses for distributed estimation via strong data processing inequalities,” in
IEEE Int. Sympos. Inf. Theory (ISIT) , 2015, pp. 2376–2380. [45] T. M. Cover and J. A. Thomas, Elements of information theory . John Wiley & Sons, 2012.[46] A. McGregor, I. Mironov, T. Pitassi, O. Reingold, K. Talwar, and S. Vadhan, “The limits of two-partydifferential privacy,” in
Proc. of the 51st Annual IEEE Symposium on Foundations of Computer Science(FOCS ‘10) , 23–26 October 2010, p. 81–90.[47] I. Csiszár and J. Körner,
Information Theory: Coding Theorems for Discrete Memoryless Systems . CambridgeUniversity Press, 2011. A PPENDIX
We begin by some alternative expressions for E γ -divergence that are useful for the subsequent proofs.It is straightforward to show that for any γ ≥ , we have E γ ( P (cid:107) Q ) = 12 (cid:90) | d P − γ d Q | − | − γ | (22) = sup A ⊂X [ P ( A ) − γQ ( A )] − (1 − γ ) + (23) = P (cid:0) log d P d Q > log γ (cid:1) − γQ (cid:0) log d P d Q > log γ (cid:1) − (1 − γ ) + . (24)The proof of Theorem 1 relies on the following theorem, recently proved by the authors in [30, Theorem3]. Theorem 3.
For any γ ≥ and Markov kernel K with input alphabet X , we have η γ ( K ) = sup x,x (cid:48) ∈X E γ ( K ( ·| x ) (cid:107) K ( ·| x (cid:48) )) . (25)Notice that for γ = 1 , this theorem reduces to the well-known Dobrushin’s result [31] that states η TV ( K ) = sup x,x (cid:48) ∈X TV ( K ( ·| x ) , K ( ·| x (cid:48) )) . (26) Proof of Theorem 1.
It follows from Theorem 3 that η e ε ( K ) ≤ δ ⇐⇒ sup x,x (cid:48) ∈X E e ε ( K ( ·| x ) (cid:107) K ( ·| x (cid:48) )) ≤ δ, (27)which, according to (23), implies η e ε ( K ) ≤ δ ⇐⇒ sup x,x (cid:48) ∈ A sup A ⊂Z (cid:2) K ( A | x ) − e ε K ( A | x (cid:48) ) (cid:3) ≤ δ. Hence, in light of Definition 1, K is ( ε, δ ) -LDP if and only if η e ε ( K ) ≤ δ . Proof of Lemma 1.
We first show the following upper and lower bounds for E γ -divergence in terms ofthe total variation distance. Claim.
For any distributions P and Q on X and any γ ≥ , we have − γ (1 − TV ( P (cid:107) Q )) ≤ E γ ( P (cid:107) Q ) ≤ TV ( P, Q ) . (28) Proof of Claim.
The upper bound is immediate from the definition of E γ -divergence (and holds for any γ ≥ ). Note that γ TV ( P (cid:107) Q ) = max A ⊂X [ γP ( A ) − γQ ( A )]= max A ⊂X [ P ( A ) − γQ ( A ) + ( γ − P ( A )] ≤ max A ⊂X [ P ( A ) − γQ ( A )] + ( γ − E γ ( P (cid:107) Q ) + γ − , where the last equality follows from (23). This immediately results in the lower bound in (28).According to this claim, we can write for γ ≥ TV ( P, Q ) ≤ − − E γ ( P (cid:107) Q ) γ . Replacing P and Q with K ( ·| x ) and K ( ·| x (cid:48) ) , respectively, for some x and x (cid:48) in X , we obtain TV ( K ( ·| x ) , K ( ·| x (cid:48) )) ≤ − − E γ ( K ( ·| x ) (cid:107) K ( ·| x (cid:48) )) γ . Taking supremum over x and x (cid:48) from both side and invoking Theorem 3 and (26), we conclude that η TV ( K ) ≤ − − η γ ( K ) γ . (29)It is known [33, 39] that for any Markov kernel K and any convex functions f we have η f ( K ) ≤ η TV ( K ) , (30)from which the desired result follows immediately. Proof of Lemma 2.
Given n mechanisms K , K , . . . , K n , we consider the non-interactive mechanism P Z n | X n given by P Z n | X n ( z n | x n ) = n (cid:89) i =1 K i ( z i | x i ) . If K i ∈ Q ε,δ for i ∈ [ n ] , then we have η e ε ( K i ) ≤ δ . According to (29), it thus leads to η TV ( K i ) ≤ ϕ ( ε, δ ) .Invoking [35, Corollary 9] (see also [44, Lemma 3] and [38, Eq. (62)]), we obtain η TV ( P Z n | X n ) ≤ max i ∈ [ n ] [1 − (1 − η TV ( K i )) n ] ≤ [1 − (1 − ϕ ( ε, δ )) n ]= ϕ n ( ε, δ ) . Proof of Lemma 3.
Recall that X n is an i.i.d. sample of distribution P and each Z i , i ∈ [ n ] is obtainedby applying K i to X i . Note that by assumption K i specifies the conditional distribution P Z i | X i ,Z i − . Let M n and M n denote the distribution of Z n when P = P and P = P , respectively. Thus, we have for P = P and any z n ∈ Z n M n ( z n ) = n (cid:89) i =1 P Z i | Z i − ( z i | z i − ) (31) = n (cid:89) i =1 (cid:0) P X i | Z i − = z i − K i (cid:1) ( z i ) (32) = n (cid:89) i =1 ( P K i ) ( z i ) . (33) Having this in mind, we can write TV ( M n , M n ) ( a ) ≤ D KL ( M n (cid:107) M n ) (34) ( b ) = 12 n (cid:88) i =1 D KL ( P K i (cid:107) P K i ) (35) ( c ) ≤ n (cid:88) i =1 ϕ ( ε, δ ) D KL ( P (cid:107) P ) (36)where ( a ) follows from Pinsker’s inequality, ( b ) is due to the chain rule of KL divergence, ( c ) is anapplication of Lemma 1. Plugging (36) in (6), we obtain the desired result. Proof of Corollary 1.
Fix ω ∈ (0 , and consider two distributions P and P on {− ω − k , , ω − k } defined as P ( − ω − k ) = ω, P (0) = 1 − ω, and P ( ω − k ) = ω, P (0) = 1 − ω. It can be verified that both P and P belong to P k . Note that (cid:96) ( θ ( P ) , θ ( P )) = 2 ω k − k . Let M n = P ⊗ n K n and M n = P ⊗ n K n be the corresponding output distributions of the mechanism K n = K . . . K n ,the composition of mechanisms K i . Le Cam’s bound for (cid:96) -metric yields R n ( P k , (cid:96) , ε, δ ) ≥ ω k − k (1 − TV ( M n , M n ) ≥ ω k − k (1 − H ( M n , M n )) , (37)where the last inequality follows from the fact TV ( P, Q ) ≤ H ( P, Q ) for H ( P, Q ) being the Hellingerdistance. Notice that M n = (cid:81) ni =1 ( P K i ) and M n = (cid:81) ni =1 ( P K i ) where each K i for i ∈ [ n ] is ( ε, δ ) -LDP.It is well known that H (cid:32) n (cid:89) i =1 P i , n (cid:89) i =1 Q i (cid:33) = 2 − n (cid:89) i =1 (cid:18) − H ( P i , Q i ) (cid:19) . Thus, H ( M n , M n ) = 2 − n (cid:89) i =1 (cid:18) − H ( P K i , P K i ) (cid:19) ≤ − n (cid:89) i =1 (cid:18) − ϕ ( ε, δ )2 H ( P , P ) (cid:19) = 2 − (cid:18) − ϕ ( ε, δ )2 H ( P , P ) (cid:19) n = 2 − − ωϕ ( ε, δ )) n (38)Hence, we obtain TV ( M n , M n ) ≤ (cid:113) − − ωϕ ( ε, δ )) n . (39)Plugging (38) into (37), we obtain R n ( P k , (cid:96) , ε, δ ) ≥ ω k − k (cid:104) − √ (cid:112) − (1 − ωϕ ( ε, δ )) n (cid:105) . (40) Now, choose ω = min (cid:110) , ϕ ( ε,δ ) (cid:104) − (cid:0) (cid:1) √ n (cid:105)(cid:111) . Notice that we assume δ > and hence ϕ ( ε, δ ) > regardless of ε . Plugging this choice of ω into the above bound, we obtain R n ( P k , (cid:96) , ε, δ ) (cid:38) ( ϕ ( ε, δ )) − k − k (cid:34) − (cid:18) (cid:19) √ n (cid:35) k − k (cid:38) ( ϕ ( ε, δ )) − k − k n − k − k . (41) Proof of Lemma 4.
Note that we have the Markov chain V (cid:40) −− X n (cid:40) −− Z n . It has been shown in [47,Problem 15.12] (see also [34, 47]) that for any channel P B | A connecting random variable A to B , wehave η KL ( P B | A ) = sup P AU : U (cid:40) −− A (cid:40) −− B I ( U ; B ) I ( U ; A ) . (42)Replacing A and B with X n and Z n , respectively, in the above equation, we obtain I ( Z n ; V ) ≤ η KL ( K ⊗ n ) I ( X n ; V )= η KL ( K ⊗ n ) n |V| (cid:88) V = v D KL ( P v (cid:107) ¯ P ) , where K ⊗ n = P Z n | X n and ¯ P = |V| (cid:80) v ∈V P v . The desired result then follows from Lemma 2 and theconvexity of KL-divergence. Proof of Corollary 2.
The proof strategy is as follows: we first construct a set of probability distribution { P v } for v taking values in a finite set V and then apply Fano’s inequality (9) where V is a uniformrandom variable on V . Duchi el el. [10, Lemma 6] showed that there exists a set V k of the k -dimensionalhypercube {− , +1 } k satisfying (cid:107) v − v (cid:48) (cid:107) ≥ k for each v, v (cid:48) ∈ V k with v (cid:54) = v (cid:48) and some integer k ∈ [ d ] ,while |V k | being at least (cid:100) e k (cid:101) . If k < d , one can extend V k ⊂ R k to a subset of R d by considering V = V k × { } d − k . Fix ω ∈ (0 , and define a distribution P v ∈ P ( B d ( r )) for v ∈ V as follows: Choosean index j ∈ [ k ] uniformly and set P v ( r e j ) = ωv j and P v ( − r e j ) = − ωv j where e j is the standardbasis vector in R d . Given v ∈ V k , let X ∼ P v be a random variable taking values in {± r e j } kj =1 and X n be an i.i.d. sample of X . Furthermore, as before, let Z n be a privatized sample of X n obtained by K ⊗ n with K being an ( ε, δ ) -LDP mechanism. To apply Fano’s inequality, we first need to bound I ( Z n ; V ) .According to Lemma 4, we have I ( Z n ; V ) ≤ ϕ n ( ε, δ ) n |V k | (cid:88) v,v (cid:48) KL ( P v (cid:107) P v (cid:48) ) ≤ ϕ n ( ε, δ ) I ( X n ; V ) . (43)Hence, bounding I ( Z n ; V ) reduces to bounding I ( X n ; V ) . To this end, first notice that I ( X n ; V ) ≤ nI ( X ; V ) . Let K be a uniform random variable on [ k ] independent of V that chooses the coordinate of V . Note that K can be determined by X and hence I ( X ; V ) = I ( X, K ; V )= I ( X ; V | K ) ≤ log 2 − h b ( 1 − ω ≤ ω log 2 , where the last inequality follows from the fact that h b ( a ) ≥ a log 2 for a ∈ [0 , ] due to the concavityof entropy. Consequently, we can write I ( Z n ; V ) ≤ nωϕ n ( ε, δ ) log 2 . (44)Applying Fano’s inequality, we obtain R n ( P , (cid:96), ε, δ ) ≥ r ω k (cid:20) − (1 + nωϕ n ( ε, δ )) log 2log |V| (cid:21) (45) ≥ r ω k (cid:20) − nωϕ n ( ε, δ )) log 2 k (cid:21) . (46)Setting ω = min { , k nϕ n ( ε,δ ) } and assuming k ≥ , we can write R n ( P , (cid:96), ε, δ ) (cid:38) r max k ∈ [ d ] min (cid:26) k , kn ϕ n ( ε, δ ) (cid:27) . (47)By choosing k = min { nϕ n ( ε, δ ) , d } , we obtain R n ( P , (cid:96), ε, δ ) (cid:38) r min (cid:26) nϕ n ( ε, δ ) , dn ϕ n ( ε, δ ) (cid:27) . (48) Proof of Theorem 2.
Let ˆΘ = Ψ( Z n ) be an estimate of Θ for some Ψ and p ζ := P Θ ˆΘ ( (cid:96) (Θ , ˆΘ) ≤ ζ ) and q ζ := ( P Θ P ˆΘ )( (cid:96) (Θ , ˆΘ) ≤ ζ ) , i.e., p ζ and q ζ correspond to the probability of the event { (cid:96) (Θ , ˆΘ) ≤ ζ } under the joint and product distributions, respectively. By definition, we have for any γ ≥ I γ (Θ; ˆΘ) = E γ ( P Θ ˆΘ (cid:107) P Θ P ˆΘ )= sup A ⊂T ×T (cid:2) P Θ ˆΘ ( A ) − γ ( P Θ P ˆΘ )( A ) (cid:3) ≥ p ζ − γq ζ ≥ p ζ − γ L ( ζ ) , where the last inequality follows from the fact that q ζ ≤ L ( ζ ) , that can be shown as follows q ζ = (cid:90) T (cid:90) T { (cid:96) ( θ, ˆ θ ) ≤ ζ } P Θ ( d θ ) P ˆΘ ( d ˆ θ ) ≤ sup t ∈T (cid:90) T { (cid:96) ( θ,t ) ≤ ζ } P Θ ( d θ )= L ( ζ ) . Recalling that
Pr( (cid:96) (Θ , ˆΘ) > ζ ) = 1 − p ζ , the above thus implies Pr( (cid:96) (Θ , ˆΘ) > ζ ) ≥ − I γ (Θ; ˆΘ) − γ L ( ζ ) . (49)Since, by Markov’s inequality, E [ (cid:96) (Θ , ˆΘ)] ≥ ζ Pr( (cid:96) (Θ , ˆΘ) ≥ ζ ) , we can write by setting γ = e ε R Bayes n ( P Θ , (cid:96), ε, δ ) ≥ ζ (cid:104) − I e ε (Θ; ˆΘ) − e ε L ( ζ ) (cid:105) ≥ ζ [1 − I e ε (Θ; Z n ) − e ε L ( ζ )] , where the second inequality comes from the data processing inequality for I γ . To further lower boundthe right-hand side, we write I e ε (Θ; Z n ) = (cid:90) T E e ε ( P Z n | Θ= θ (cid:107) P Z n ) P Θ ( d θ ) ≤ η e ε ( P Z n | X n ) (cid:90) T E e ε ( P X n | Θ= θ (cid:107) P X n ) P Θ ( d θ )= η e ε ( P Z n | X n ) I e ε (Θ; X n ) , where the inequality follows from the definition of contraction coefficient. When n = 1 , we have η e ε ( K ) ≤ δ as K = P Z | X is assumed to be ( ε, δ ) -DP. For n > , we invoke Lemma 2 to obtain η e ε ( P Z n | X n ) ≤ ϕ n ( ε, δ ) . Proof of Corollary 4 .
Let β n ( α ) := β ∞ , n ( α ) be the non-private trade-off between type I and type IIerror probabilities (i.e., Z n = X n ). According to Chernoff-Stein lemma (see, e.g., [45, Theorem 11.8.3]),we have lim n →∞ n log β n ( α ) = − D KL ( P (cid:107) P ) . (50)Assume now that, Z n is the output of K ⊗ n for an ( ε, δ ) -LDP mechanism K . According to (50), we obtainthat lim n →∞ n log β ε,δn ( α ) = − sup K ∈Q ε,δ D KL ( P K (cid:107) P K ) . (51)Applying Lemma 1, we obtain the desired result. Proof of Corollary 5.
Consider the (trivial) Markov chain X (cid:40) −− X (cid:40) −− Z . According to (42), we canwrite I ( X ; Z ) ≤ η KL ( K ))
Consider the (trivial) Markov chain X (cid:40) −− X (cid:40) −− Z . According to (42), we canwrite I ( X ; Z ) ≤ η KL ( K )) H ( X ))