Regularized Loss Minimizers with Local Data Perturbation: Consistency and Data Irrecoverability
aa r X i v : . [ c s . L G ] J a n Regularized Loss Minimizers with Local Data Perturbation:Consistency and Data Irrecoverability
Zitao LiDepartment of Computer SciencePurdue UniversityWest Lafayette, IN 47907 [email protected]
Jean HonorioDepartment of Computer SciencePurdue UniversityWest Lafayette, IN 47907 [email protected]
Abstract
We introduce a new concept, data irrecoverability , andshow that the well-studied concept of data privacy impliesdata irrecoverability. We show that there are several regu-larized loss minimization problems that can use locally per-turbed data with theoretical guarantees of generalization,i.e., loss consistency. Our results quantitatively connect theconvergence rates of the learning problems to the impos-sibility for any adversary for recovering the original datafrom perturbed observations. In addition, we show sev-eral examples where the convergence rates with perturbeddata only increase the convergence rates with original datawithin a constant factor related to the amount of perturba-tion, i.e., noise.
In recent years, as machine learning algorithms are grad-ually embedded into different on-line services, there isincreasing concern about privacy leakage from serviceproviders. On the other hand, the enhancement of user ex-perience and promotion of advertisement must rely on userdata. Thus, there is a natural conflict between privacy andusefulness of data. Whether data can be protected, whileremaining useful, has become an interesting topic.To resolve this conflict, several frameworks have beenproposed. Since 2006, differential privacy [Dwork et al.,2006, Dwork and Lei, 2009] has been considered as aformal definition of privacy. The core idea of differ-ential privacy is to eliminate the effect of individualrecords from the output of learning algorithms, by intro-ducing randomization into the process. There is alreadya large number of differentially-private algorithms for dif-ferent purposes [Dwork, 2008, Wainwright et al., 2012,Abadi et al., 2016, Jain and Thakurta, 2014, Bassily et al.,2014, Chaudhuri et al., 2011]. More recently, local pri-vacy [Duchi et al., 2013, Near, 2018, Erlingsson et al.,2014], a stronger setting to protect individuals privacy, has been proposed. In local privacy, data providers ran-domize data before releasing it to a learning algorithm.Locally-private algorithms related to machine learningproblems have been further developed in [Smith et al.,2017, Kasiviswanathan and Jin, 2016].In this paper, we discuss the effect of locally perturbeddata on several problems in machine learning that can bemodeled as the minimization of an empirical loss , with afinite number of training samples randomly drawn fromsome unknown data distribution. In these problems, the expected loss is usually defined as the expected value ofthe empirical loss, with respect to the data distribution.The minimizers of the empirical loss and expected loss arecalled the empirical minimizer and true hypothesis respec-tively. One of the most important measurements of learn-ing success is loss consistency , which describes the differ-ence between the expected loss of the empirical minimizerand that of the true hypothesis. In [Honorio and Jaakkola,2014], a general framework was proposed to analyze lossconsistency for various problems, including the estima-tion of exponential family distributions, generalized lin-ear models, matrix factorization, nonparametric regressionand max-margin matrix factorization. Additionally, in[Honorio and Jaakkola, 2014] loss consistency was alsoused to establish other forms of consistency as corollariesof the former. That is, loss consistency implies norm consis-tency (small distance between the empirical minimizer andthe true hypothesis), sparsistency (recovery of the sparsitypattern of the true hypothesis) and sign consistency (recov-ery of the signs of the true hypothesis).
Contributions.
We generalize the concept of privacy bydefining the concept of data irrecoverabilitiy . We showthat under our framework, the convergence rates of severallearning problems with perturbed data, are similar to theconvergence rates with original data. More specifically, ourcontributions can be summarized as follows:• We define the concept of data irrecoverability , and1how an intuitive relationship between privacy anddata irrecoverability (Theorem 1).• We show how perturbed data affect the loss consis-tency of several problems, by extending the assump-tions and the framework of [Honorio and Jaakkola,2014]. That is, we prove a perturbed loss consistencyguarantee for regularized loss minimization (Theorem2).• Our framework allows us to analyze several em-pirical loss minimization problems, such as maxi-mum likelihood estimation for exponential family dis-tributions, generalized linear models with fixed de-sign, exponential-family PCA, nonparametric gener-alized regression and max-margin matrix factoriza-tion. We find that introducing noise with dimension-independent variance can make it difficult enoughto recover the original data, while only increasingthe convergence rate within a constant factor (Theo-rem 5 to 13) with respect to the results reported inHonorio and Jaakkola [2014].
In this section, we will first formalize our definition of per-turbed data and irrecoverability of perturbed data. Then wedefine the empirical loss minimization problems and ourmain assumptions.
First we show a general definition of privacy which is usedin both differential and local privacy.
Definition 1 ( Privacy ) . An algorithm M : X → Z satis-fies ( ǫ, δ ) -privacy, where ǫ > and δ ∈ (0 , , if and onlyif for any input x, x ′ ∈ X and S ∈ σ ( Z ) , we have P M [ M ( x ) ∈ S ] ≤ e ǫ P M [ M ( x ′ ) ∈ S ] + δ, where P M denotes that the probability is over randomdraws made by the algorithm M , and σ ( Z ) denotes a σ -algebra on Z . In the above definition, differential privacy assumes that x and x ′ are datasets that differ in a single data point. While M is a general mechanism in differential privacy, for localprivacy M is a particular mechanism that adds noise to thedata before releasing it to the learner. Data irrecoverability.
The definition of privacy can beconsidered as a forward mapping from data to the outputof the algorithm. Here we analyze the backward mapping.That is, we focus on how likely the original data can be recovered from the algorithm output. Next we provide ourformal definition.
Definition 2 ( Data Irrecoverability ) . For any privacy-preserving algorithm M : X → Z and any conceivableadversary A : Z → X , we say that the original data X isirrecoverable if the following holds: inf A P X, M [ A ( M ( X )) = X ] ≥ γ, for some constant γ ∈ (0 , . Our definition of data irrecoverability is more generalthan that of privacy. We can show that ( ǫ, δ ) -privacy im-plies data irrecoverability. Thus, in this case, our Definition2 is more general than Definition 1. Theorem 1 (Privacy implies data irrecoverability) . For anyprivacy-preserving algorithm M : X → Z that satisfies ( ǫ, δ ) -privacy, and any conceivable adversary A : Z → X ,data irrecoverability follows. That is: inf A P X, M [ A ( M ( X )) = X ] ≥ − b ( ǫ, δ ) + log 2 H ( X ) , where H ( X ) is the entropy of X and b ( ǫ, δ ) = inf x ′ ∈X log Z z ∈Z ( e ǫ P M ( M ( x ′ ) = z ) + δ ) dz, provided that H ( X ) > b ( ǫ, δ ) + log 2 . Note that b can beunderstood as an infimum of a log-partition function.Proof. We invoke Definition 1 for sets S of size 1. In thiscase we have S = { z } for z ∈ Z , and therefore M ( x ) ∈ S is equivalent to M ( x ) = z .We can describe the data process with the Markov chain X → M ( X ) → ˆ X , where ˆ X is the recovered data. Next,for a fixed and arbitrary x ′ ∈ X , we define the distribution Q as follows: Q ( z ) = e ǫ P M ( M ( x ′ ) = z ) + δ R z ′ ∈Z ( e ǫ P M ( M ( x ′ ) = z ′ ) + δ ) dz ′ The denominator is a partition function. It is easy to see that Q is a valid distribution since R z ∈Z Q ( z ) dz = 1 . Then wecan bound the mutual information between X and M ( X )
2n the following way: I ( X ; M ( X )) ≤ |X | X x ∈X KL ( P M ( M ( x )) | Q )= 1 |X | X x ∈X Z z ∈Z P M ( M ( x ) = z ) log( P M ( M ( x ) = z ) Q ( z ) ) dz ≤ |X | X x ∈X Z z ∈Z P M ( M ( x ) = z )log( e ǫ P M ( M ( x ′ ) = z ) + δ Q ( z ) ) dz = 1 |X | X x ∈X Z z ∈Z P M ( M ( x ) = z ) dz log( Z z ′ ∈Z ( e ǫ P M ( M ( x ′ ) = z ′ ) + δ ) dz ′ )= log Z z ′ ∈Z ( e ǫ P M ( M ( x ′ ) = z ′ ) + δ ) dz ′ The first inequality comes from equation 5.1.4 in [Duchi,2016]. The second inequality comes from the definition of ( ǫ, δ ) -privacy. Since x ′ is an arbitrary choice in our argu-ment, we can take the infimum with respect to x ′ and get atight bound on the mutual information: I ( X ; M ( X )) ≤ inf x ′ ∈X log Z z ∈Z ( e ǫ P M ( M ( x ′ ) = z ) + δ ) dz = b ( ǫ, δ ) Then, by Fano’s inequality [Cover and Thomas, 2012], wehave: inf A P X, M [ A ( M ( X )) = X ] ≥ − I ( X ; M ( X )) + log 2 H ( X ) ≥ − b ( ǫ, δ ) + log 2 H ( X ) , and we prove our claim. Corollary 1.
For any privacy-preserving algorithm M : X → Z that satisfies ( ǫ, -privacy, and any conceivableadversary A : Z → X , data irrecoverability follows. Thatis: inf A P X, M [ A ( M ( X )) = X ] ≥ − ǫ + log 2 H ( X ) , where H ( X ) is the entropy of X , provided that H ( X ) >ǫ + log 2 .Proof. When δ = 0 , since R z ∈Z P M ( M ( x ′ ) = z ) dz = 1 for all x ′ ∈ X , we have: b ( ǫ, δ ) = log inf x ′ ∈X Z z ∈Z ( e ǫ P M ( M ( x ′ ) = z )) dz = log e ǫ Z z ∈Z P M ( M ( x ′ ) = z ) dz = ǫ By Theorem 1, we prove our claim.In the particular case of local privacy, we can capturethe randomness of algorithm M ( · ) , by denoting M : X × H → Z , where M also takes a random parameter η ∈ H .In order to quantify the noise, we denote the variance of thenoise distribution Q as σ η . To formalize the empirical loss minimization problems, wedefine the problems as a tuple
Π = ( H , D , Q , b L , R ) fora hypothesis class H , a data distribution D , a noise distri-bution Q , an empirical loss b L and a regularizer R . Forsimplicity, we assume that H is a normed vector space.Let θ be a hypothesis such that θ ∈ H . For the originalempirical problem (without noise), let b L ( θ ) denote the em-pirical loss of n samples from an unknown data distribution D ; and let L ( θ ) = E D [ b L ( θ )] denote the expected loss fordata from distribution D .Furthermore, let ψ ( x , η ) denote a mapping X × H → Z .Then, we let b L η ( θ ) denote the empirical loss of n per-turbed samples ψ ( x (1) , η (1) ) , . . . , ψ ( x ( n ) , η ( n ) ) , where x (1) , . . . , x ( n ) are samples from the unknown data distri-bution D , and η (1) , . . . , η ( n ) are noise from distribution Q . Similarly, we let L η ( θ ) = E D , Q [ b L η ( θ )] denote theexpected loss of perturbed data, where the expectation istaken with respect to both the data distribution D and thennoise distribution Q .Let R ( θ ) be a regularizer and λ n > be a penalty pa-rameter. The empirical minimizer b θ ∗ and perturbed empir-ical minimizer b θ ∗ η are given by: b θ ∗ = arg min θ ∈H b L ( θ ) + λ n R ( θ ) , (1) b θ ∗ η = arg min θ ∈H b L η ( θ ) + λ n R ( θ ) . (2)We use a relaxed optimality assumption, defining an ξ -approximate empirical minimizer b θ and perturbed ξ -approximate empirical minimizer b θ η with the followingproperty for ξ ≥ : b L ( b θ ) + λ n R ( b θ ) ≤ ξ + min θ ∈H b L ( θ ) + λ n R ( θ ) , (3)3 L η ( b θ η ) + λ n R ( b θ η ) ≤ ξ + min θ ∈H b L η ( θ ) + λ n R ( θ ) . (4)The true hypothesis is defined as: θ ∗ = arg min θ ∈H L ( θ ) , (5)while the perturbed true hypothesis is defined as: θ ∗ η = arg min θ ∈H L η ( θ ) . (6)To give a simple example of an empirical loss, if ℓ ( x | θ ) is the loss of sample x given θ , then the em-pirical loss is b L ( θ ) = n P i ℓ ( x ( i ) | θ ) where samples x (1) , . . . , x ( n ) are drawn from a distribution D . Then, L ( θ ) = E x ∼D [ ℓ ( x | θ )] is the expected loss of x drawnfrom a data distribution D . If we use perturbed data ψ ( x (1) , η (1) ) , . . . , ψ ( x ( n ) , η ( n ) ) and ℓ ( ψ ( x , η ) | θ ) is theloss of ψ ( x , η ) given θ , then the perturbed empirical lossbecomes b L η ( θ ) = n P i ℓ ( ψ ( x ( i ) , η ( i ) ) | θ ) . We will ana-lyze how perturbed data works with different loss functionsof learning models in Section 3.The loss consistency is defined as the upper bound of: L ( b θ ) − L ( θ ∗ ) . (7)Similarly, in this paper, we define perturbed loss consis-tency as the upper bound of: L ( b θ η ) − L ( θ ∗ ) . (8)In the following, we introduce some reasonable assump-tions to justify loss consistency and perturbed loss consis-tency. Those assumptions also characterize the subset ofmachine learning problem analyzed in this paper. Next we present our main assumptions.
Scaled Uniform Convergence.
Our first assumption is scaled uniform convergence, a concept contrary to regu-lar uniform convergence. Although both scaled uniformconvergence and regular uniform convergence can be usedto describe the difference between the empirical and ex-pected loss for all θ , regular uniform convergence providesa bound that is the same for all θ , while scaled uniform con-vergence provides a bound that depends on a function of θ .We present the assumption formally in what follows: Assumption A (Scaled uniform convergence) . Let c : H → [0; + ∞ ) be the scale function. The empirical loss b L η is close to its expected value L η , such that their abso-lute difference is proportional to the scale of the hypothesis θ . That is, with probability at least − δ over draws of n samples: ( ∀ θ ∈ H ) | b L η ( θ ) − L η ( θ ) | ≤ ε n,δ c ( θ ) (9) where the rate ε n,δ is nonincreasing with respect to n and δ . Furthermore, assume lim n → + ∞ ε n,δ = 0 for δ ∈ (0 , . Super-Scale Regularizers.
Next, we borrow the super-scale regularizers assumption from [Honorio and Jaakkola,2014], which defines regularizers lower-bounded by a scalefunction.
Assumption B (Super-scale regularization[Honorio and Jaakkola, 2014]) . Let c : H → [0; + ∞ ) be the scale function. Let r : [0; + ∞ ) → [0; + ∞ ) be afunction such that: ( ∀ z ≥ z ≤ r ( z ) (10) The regularizer R is bounded as: ( ∀ θ ∈ H ) r ( c ( θ )) ≤ R ( θ ) < + ∞ (11)Note that the above assumption implies c ( θ ) ≤ R ( θ ) . Bounded Perturbed Loss.
Perturbed loss consistency describes the difference between the expected loss of the perturbed ξ -approximate empirical minimizer and that ofthe true hypothesis . Next, we introduce an assumption forthe difference between the expected loss for perturbed dataand that of original data. Assumption C (Bounded perturbed loss) . Let c : H → [0; + ∞ ) be the scale function. The expected loss of the per-turbed data L η is close to the expected loss of the originaldata L , such that their absolute difference is proportionalto the scale of the hypothesis θ . That is, with probability atleast − δ over draws of n samples: ∀ θ ∈ H , |L η ( θ ) − L ( θ ) | ≤ ε ′ n c ( θ ) , In this part, we formally show perturbed loss consistency,a worst-case guarantee of the difference between the ex-pected loss under the original data distribution D of the ξ -approximate empirical minimizer from perturbed data, b θ η ,and that of the true hypothesis θ ∗ . Theorem 2 (Perturbed Loss consistency) . Under Assump-tion A with rate ε n,δ , Assumption B for regularizers, andAssumption C with rate ε ′ n , perturbed regularized loss min-imization is loss-consistent. That is, for α ≥ and λ n = αε n,δ , with probability at least − δ : L ( b θ η ) − L ( θ ∗ ) ≤ ε n,δ ( α R ( θ ∗ η ) + c ( θ ∗ η )) + ε ′ n c ( θ ∗ ) + ξ (12) provided that ε ′ n ≤ ε n,δ . ε ′ n c ( θ ∗ ) ). In the following section, we show that theproblems that we study will either have larger ε n,δ than theones in [Honorio and Jaakkola, 2014] and ε ′ n = 0 , or havethe same ε n,δ as the ones in [Honorio and Jaakkola, 2014]and ε ′ n > . Thus, the loss consistency for perturbed dataleads to a larger upper bound when compared to using orig-inal data. Fortunately, we show that the difference is onlyin constant factors. In this section, we show that several popular problemscan be analyzed with our novel framework. This in-cludes maximum likelihood estimation for exponentialfamily distributions, generalized linear models with fixeddesign, exponential-family PCA, nonparametric general-ized regression and max-margin matrix factorization. Forthe first four examples in Subsection 3.1 to 3.4, we focuson a special class of algorithms that perform unbiased dataperturbation . In Subsection 3.5, we focus on an algorithmthat performs a sign-flipping data perturbation.
Definition 3 ( Unbiased Data Perturbation ) . Let ψ ( x , η ) denote a mapping X × H → Z , where x ∈ X is the orig-inal data sample drawn from D and η ∈ H is the noisedrawn from Q . We say that the function ψ ( x , η ) is unbi-ased if it satisfies the following restriction: E Q [ ψ ( x , η )] = t ( x ) , (13) for all x ∈ X , where t ( x ) is the sufficient statistic for aparticular machine learning problem. All the problems mentioned above fulfill Assumption C.We further analyze the new convergence rate ε n,δ based onperturbed data.In addition, we also adopt an information-theoretic ap-proach to show how much noise is necessary to guaran-tee data irrecoverability. Fano’s inequality is usually usedfor a restricted ensemble, i.e., a subclass of the originalclass of interest. If a subclass is difficult for data de-noising, then the original class will be at least as dif-ficult for data denoising. The use of restricted ensem-bles is customary for information-theoretic lower bounds[Santhanam and Wainwright, 2012, Wang et al., 2010].Table 1 summarizes the convergence rates achieved forseveral examples using our proposed framework. Table 1also shows the minimum noise variance in order to achievedata irrecoverability in the last column. Super-Scale Regularizers.
In Table 1, we show the con-vergence rates for different regularizers, which are shownto fulfill Assumption B in [Honorio and Jaakkola, 2014].We can categorize the regularizers in the following way:• Norms regularizers: This includes ℓ -norms[Ravikumar et al., 2008] and k -support norm[Argyriou et al., 2012] for sparsity promoting,multitask ℓ , and ℓ , ∞ -norms for overlappinggroups [Jacob et al., 2009, Mairal et al., 2010] ornon-overlapping groups [Negahban and Wainwright,2011, Obozinski et al., 2011], and the trace norm[Bach, 2008, Srebro et al., 2004] for low-rank regular-ization. All these regularizers fulfills Assumption Bwith c ( θ ) = k θ k and r ( z ) = z .• Function of norms: Tikhonov regularizer [Hsu et al.,2012], which can be written as R ( θ ) = k θ k + , canfulfill Assumption B with c ( θ ) = k θ k and r ( z ) = z + .• Mixture of norms: This includes sparse and low-rank prior [Richard et al., 2012] and elastic net[Zou and Hastie, 2005]. Sparse and low-rank prior, R ( θ ) = k θ k + k θ k tr , fulfill Assumption B with c ( θ ) = k θ k or k θ k tr and r ( z ) = z . Elastic net[Zou and Hastie, 2005], R ( θ ) = k θ k + k θ k + ,can fulfill Assumption B with c ( θ ) = k θ k or k θ k and r ( z ) = z + .• Dirty models as described in [Jalali et al., 2010], R ( θ ) = k θ (1) k , + k θ (2) k , ∞ where θ = θ (1) + θ (2) ,fulfills Assumption B for r ( z ) = z and c ( θ ) = k θ k , ∞ .• Other priors: total variation prior [Zhang and Wang,2010], which can be described as R ( θ ) = k θ k + f ( θ ) where f ( θ ) > , fulfills Assumption B with c ( θ ) = k θ k and r ( z ) = z .Before discussing various examples, we present twotechnical lemmas that are useful for the analysis of the per-turbed loss consistency. Lemma 3.
Given the sufficient statistic t ( x ) . Assume that ∀ j, t j ( x ) follows a sub-Gaussian distribution with parame-ter σ x , and that the conditional distribution of ψ j ( x , η ) forany fixed x is sub-Gaussian with parameter σ η . We havethat ψ j ( x , η ) follows a sub-Gaussian distribution with pa-rameter σ , such that σ = σ x + σ η . Lemma 4.
Given the sufficient statistic t ( x ) . Assume that ∀ j, t j ( x ) has variance at most σ x , and that the condi-tional distribution of ψ j ( x , η ) for any fixed x has varianceat most σ η . We have that ψ ( x , η ) has variance at most σ = σ x + σ η . able 1: New Convergence Rates ε n,δ with Data Irrecoverability and Minimum Noise for Examples in Section 3, Theorem 5 to 13. Our new rates are ofsimilar order as previous results without perturbation [Honorio and Jaakkola, 2014], and only increase a factor from σ x to q σ x + σ η . The convergence rates ε n,δ are for n samples with respectto p -dimension sufficient statistics, i.e., θ ∈ H = R p (forexponential-family PCA, θ ∈ H = R n × n and n = n × n ), with probability at least − δ . β ∈ (0 , / is aparameter for nonparametric regression. σ x and σ η are the pa-rameters of sub-Gaussian distributions or maximum variancesas described in Lemma 3 and 4. Rates were not optimized. Allrates follow from the specific regularizer and norm inequalities.NA means "not applicable" and NG means "no guarantees" inthe table. S p a r s it y ( ℓ )[ R a v i ku m a r e t a l ., ] E l a s ti c n e t [ Z ou a nd H a s ti e , ] T o t a l v a r i a ti on [ Z h a ng a nd W a ng , ] S p a r s it y a nd l o w -r a nk [ R i c h a r d e t a l ., ] Q u a s i c onv e x ( ℓ + ℓ p , p < ) S p a r s it y [ A r gy r i ou e t a l ., ]( k - s uppo r t no r m ) T i khonov [ H s u e t a l ., ] M u ltit a s k ( ℓ , ∞ ) D i r t y m u ltit a s k M u ltit a s k ( ℓ , )[ J ac ob e t a l ., ] O v e r l a p m u ltit a s k ( ℓ , )[ J ac ob e t a l ., ] g i s m a x i m u m g r oup s i ze O v e r l a p m u ltit a s k ( ℓ , ∞ )[ M a i r a l e t a l ., ] g i s m a x i m u m g r oup s i ze L o w -r a nk [ R i c h a r d e t a l ., ] M i n i m u m no i s e t o m a k e d a t a r e - c on s t r u c ti on i m po ss i b l e MLE for exponentialfamily distribution sub-Gaussian ( q σ x + σ η p log / δ ) q log pn q k log pn q p log pn p / √ log p √ n q g log pn g √ log p √ n q p log pn σ η ≥ − γ ) log 2 Finite variance ( q σ x + σ η p / δ ) q pn q kpn p √ n p / √ n q gpn g √ p √ n p √ n GLM withfixed design sub-Gaussian ( q σ y + σ η p log / δ ) q log pn q k log pn q p log pn p / √ log p √ n q g log pn g √ log p √ n NA σ η ≥ − γ ) log 2 Finite variance ( q σ y + σ η p / δ ) q pn q kpn p √ n p / √ n q gpn g √ p √ n NAExponential-familyPCA sub-Gaussian ( q σ x + σ η p log / δ ) √ log nn NA q log nn √ log nn / NA NA q log nn σ η ≥ − γ ) log 2 Finite variance ( q σ x + σ η p / δ ) √ n NA NG n / NA NA NGNonparametricregression sub-Gaussian ( q σ y + σ η p log / δ ) √ log pn / − β √ k log pn / − β p √ log pn / − β √ p log pn / − β √ g log pn / − β g √ log pn / − β NA σ η ≥ − γ ) log 2 Finite variance ( q σ y + σ η p / δ ) √ pn / − β √ kpn / − β p / n / − β pn / − β Max-margin matrixfactorization ( δ = 0 ) n NA √ n n / NA NA √ n q ∈ ( , + (1 − γ ) log 28 ) First, we focus on the problem of maximum likeli-hood estimation(MLE) for exponential family distributions[Kakade et al., 2010, Ravikumar et al., 2008] with arbitrarynorms regularization. This includes for instance, the prob-lem of learning the parameters (and possibly structure) ofGaussian and discrete MRFs. We provide a new conver-gence rate ε n,δ with perturbed data and provide an impos-sibility result for the recovery of the original data.To define the problem, let t ( x ) be the sufficient statisticand Z ( θ ) = R x e h t ( x ) , θ i be the partition function. Given n i.i.d. samples, let b T = n P i t ( x ( i ) ) be the original empiri-cal sufficient statistics, and let T = E x ∼D [ t ( x )] be the ex-pected sufficient statistics. After we perturb the n samples,denote b T η = n P i ψ ( x ( i ) , η ( i ) ) as the empirical statisticsfor perturbed data, and T η = E x ∼D , η ∼Q [ ψ ( x , η )] as theexpected sufficient statistic after perturbation.We further define the empirical loss functions as shownbelow:• b L ( θ ) = −h b T , θ i + log Z ( θ ) : empirical negative log-likelihood for original data b T ,• b L η ( θ ) = −h b T η , θ i + log Z ( θ ) : empirical negativelog-likelihood for privatized data b T η .Similarly, L ( θ ) = −h T , θ i + log Z ( θ ) and L η ( θ ) = −h T η , θ i + log Z ( θ ) are the expected negative log-likelihood for the original data and the perturbed data re-spectively. Theorem 5.
The model above fulfills Assumption A, andAssumption C with ε ′ n = 0 . Assume that ∀ j , t j ( x ) fol-lows a sub-Gaussian distribution with parameter σ x . Sup-pose the conditional distribution ψ j ( x , η ) for any fixed x is sub-Gaussian with parameter σ η , then ψ j ( x , η ) fol-lows a sub-Gaussian distribution with parameter σ suchthat σ = σ x + σ η . Thus, we can obtain a rate ε n,δ ∈O ( σ p / n log / δ ) for n independent samples.Similarly, assume that ∀ j , t j ( x ) has variance at most σ x . Suppose the conditional distribution of ψ j ( x , η ) forany fixed x , has variance at most σ η , then ψ j ( x , η ) hasvariance at most σ such that σ = σ x + σ η . Thus, we canobtain a rate ε n,δ ∈ O ( σ p / ( nδ ) ) For example, if one uses the ℓ regularizer[Ravikumar et al., 2008], the rate is ǫ n,δ = σ p / n (log p + log / δ ) for the sub-Gaussian case,and ǫ n,δ = σ p pnδ for the bounded-variance case. As com-parison, the rates with original data [Honorio and Jaakkola,2014] are σ x p / n (log p + log / δ ) and σ x p pnδ respec-tively. Data Irrecoverability.
Next we provide an example toshow how local perturbation can prevent an adversarialfrom recovering the original data. Based on the example,we analyze what is the minimum noise to guarantee data ir-recoverability. In what follows, we consider recovering thedata up to permutation, since the ordering of i.i.d. samplesin a dataset is not relevant.Consider a simple example, MLE for an Ising model6ith zero mean. Let θ ∈ H = R p and x ( i ) ∈ {− , } √ p be samples drawn from some unknown distribution. De-note X = { x (1) , x (2) , . . . , x ( n ) } . The sufficient statisticis t ( x ( i ) ) = x ( i ) x ( i ) T , and the empirical sufficient statis-tic is b T = P i x ( i ) x ( i ) T . We add noise in the followingway: we sample n times from N (0 , σ η I ) . We then get η ( i ) , i = 1 , . . . , n , then add noise to samples, obtaining X η = { x (1) + η (1) , . . . , x ( n ) + η ( n ) } . The perturbed suf-ficient statistics becomes b T ′ η = P i ( x ( i ) + η ( i ) )( x ( i ) + η ( i ) ) T . Finally we publish b T η which we obtain by remov-ing the diagonal entries of b T ′ η and by clamping the non-diagonal entries of b T ′ η to the range [ − , . Theorem 6.
If we perturb b T as mentioned above, γ ≤ − n √ p , n ≤ √ p/ and the noise variance fulfills σ η ≥ − γ ) log 2 , then any adversary will fail to recover the origi-nal data up to permutation with probability greater than γ .That is, inf A P X ,η [ A ( X η ) = X ] ≥ γ. Generalized linear models unify different models, includ-ing linear regression (when Gaussian noise is assumed), lo-gistic regression and compressed sensing with exponential-family noise [Rish and Grabarnik, 2009]. For simplicity,we focus on the fixed design model, in which y is an ran-dom variable and x is a constant vector. Let t ( y ) be thesufficient statistic and Z ( ν ) = R y e t ( y ) ν be the partitionfunction. Then the empirical loss functions are defined inthe following way:• b L ( θ ) = n P i − t ( y ( i ) ) h x ( i ) , θ i + log Z ( h x ( i ) , θ i ) :empirical negative log-likelihood for original data y ( i ) given their linear predictions h x ( i ) , θ i ,• b L η ( θ ) = n P i − ψ ( y ( i ) , η ( i ) ) h x ( i ) , θ i +log Z ( h x ( i ) , θ i ) : empirical negative log-likelihoodfor privatized data y ( i ) given their linear predictions h x ( i ) , θ i .Similarly, L ( θ ) = E ( ∀ i ) y ( i ) ∼D i [ b L ( θ )] and L η ( θ ) = E ( ∀ i ) y ( i ) ∼D i ,η ( i ) ∼Q [ b L η ( θ )] are the expected negative log-likelihood for the original and the perturbed data respec-tively. Theorem 7.
The model above fulfills Assumption A, andAssumption C with ε ′ n = 0 . Assume that t ( y ) follows a sub-Gaussian distribution with parameter σ y . Suppose the con-ditional distribution of ψ ( y, η ) for any fix y is sub-Gaussianwith parameter σ η , then ψ ( y, η ) follows a sub-Gaussian distribution with parameter σ , such that σ = σ y + σ η .Thus, we can obtain a rate ε n,δ ∈ O ( σ p / n log / δ ) .Similarly, assume that t ( y ) has variance at most σ y , andthat the conditional distribution of ψ ( y, η ) for any fixed y has variance at most σ η , then ψ ( y, η ) has variance at most σ with σ = σ y + σ η . Thus, we can obtain a rate ε n,δ ∈O ( σ p / ( nδ ) ) As comparison, the rates with original data[Honorio and Jaakkola, 2014] are O ( σ y q n log δ ) and O ( σ y q nδ ) respectively. Data Irrecoverability.
Next we provide an example andshow the minimum noise to achieve data irrecoverabil-ity. Here, we only consider to protect y . Assume that y ( i ) ∈ { +1 , − } is drawn from some unknown data dis-tribution. Let the sufficient statistic t ( y ) = y . De-note Y = { y (1) , . . . , y ( n ) } . We sample n times from N (0 , σ η ) , and get η (1) , . . . , η ( n ) . Then we perturb thedata as ψ ( y, η ) = y + η . Finally we publish Y η = { y (1) + η (1) , . . . , y ( n ) + η ( n ) } and all corresponding x ( i ) . Theorem 8.
If we perturb Y as mentioned above, γ ≤ − n and the noise variance fulfills σ η ≥ − γ ) log 2 , thenany adversary will fail to recover the original data withprobability greater than γ . That is, inf A P Y,η [ A ( Y η ) = Y ] ≥ γ. Exponential family PCA was first introduced in[Collins et al., 2001] as a generalization of GaussianPCA. We assume that each entry in in the random matrix X ∈ R n × n is independent, and might follow a differentdistribution. The hypothesis space for this problem is θ ∈ H = R n × n . Let t ( x ij ) be the sufficient statistic andand Z ( ν ) = R x ij e t ( x ij ) ν be the partition function. Theempirical loss functions are defined as follows:• b L ( θ ) = n P ij − t ( x ij ) θ ij + log Z ( θ ij ) : empiricalnegative log-likelihood for original data x ij ,• b L η ( θ ) = n P ij − ψ ( x ij , η ij ) θ ij + log Z ( θ ij ) : em-pirical negative log-likelihood for privatized data ψ ( x ij , η ij ) .Denote L ( θ ) = E ( ∀ ij ) x ij ∼D ij [ b L ( θ )] and L η ( θ ) = E ( ∀ ij ) x ij ∼D ij ,η ij ∼Q [ b L η ( θ )] as the expected negative log-likelihood function for the original and the perturbed data. Theorem 9.
The model above fulfills Assumption A, andAssumption C with ε ′ n = 0 . Assume that t ( x ij ) followsa sub-Gaussian distribution with parameter σ x . Suppose he conditional distribution of ψ ( x ij , η ij ) for any fix x ij is sub-Gaussian with parameter σ η , then ψ ( x ij , η ij ) fol-lows a sub-Gaussian distribution with parameter σ , suchthat σ = σ x + σ η . Thus, we can obtain a rate ε n,δ ∈O ( σ p / n log / δ ) .Similarly, assume that t ( x ij ) has variance at most σ x ,and that the conditional distribution of ψ ( x ij , η ij ) for anyfixed x has variance at most σ η , then ψ ( x ij , η ij ) has vari-ance at most σ such that σ = σ x + σ η . Thus, we canobtain a rate ε n,δ ∈ O ( σ p / ( nδ ) ) . As comparison, the rates with original data[Honorio and Jaakkola, 2014] are O ( σ x p / n log / δ ) and O ( σ x p / ( nδ ) ) respectively. Data Irrecoverability.
Next we provide an example andshow the minimum noise to achieve data irrecoverability.Assume ∀ ij, x ij ∈ {− , +1 } . We perturb the data in theway that ψ ( x ij , η ij ) = x ij + η ij , where η ij ∼ N (0 , σ η ) .Let X denote the original data, X η denote the perturbeddata. That is, the ( i, j ) -th entry of X η is ψ ( x ij , η ij ) . Theorem 10.
If we perturb X as mentioned above, γ ≤ − n and the noise variance fulfills σ η ≥ − γ ) log 2 , thenany adversary will fail to recover the original data withprobability greater than γ . That is, inf A P X ,η [ A ( X η ) = X ] ≥ γ. In nonparametric generalized regression with exponential-family noise, the goal is to learn a function, which canbe represented in an infinite dimensional orthonormal ba-sis. One instance of this problem is the Gaussian case pro-vided in [Ravikumar et al., 2005] with orthonormal basisfunctions depending on single coordinates. Here we allowfor the number of basis functions to grow with more sam-ples. For simplicity, we analyze the fixed design model, i.e., y is a random variable and x is a constant.Let X be the domain of x. Let θ : X → R be a predictor.Let t ( y ) be the sufficient statistic and Z ( ν ) = R y e t ( y ) ν bethe partition function. We define the empirical loss func-tions in the following way:• b L ( θ ) = n P i − t ( y ( i ) ) θ ( x ( i ) ) + log Z ( θ ( x ( i ) )) : em-pirical negative log-likelihood for original data y ( i ) given their predictions θ ( x ( i ) ) ;• b L η ( θ ) = n P i − ψ ( y ( i ) , η ( i ) ) θ ( x ( i ) ) + log Z ( θ ( x ( i ) )) :empirical negative log-likelihood for privatized data ψ ( y ( i ) , η ( i ) ) given their predictions θ ( x ( i ) ) ; Then denote L ( θ ) = E ( ∀ i ) y ( i ) ∼D i [ b L ( θ )] and L η ( θ ) = E ( ∀ i ) y ( i ) ∼D i ,η ( i ) ∼Q [ b L η ( θ )] as the expected negative log-likelihood function for the original and the perturbed data. Theorem 11.
The model above fulfills Assumption A, andAssumption C with ε ′ n = 0 . Assume that t ( y ) follows a sub-Gaussian distribution with parameter σ y . Suppose the con-ditional distribution of ψ ( y, η ) for any fix y is sub-Gaussianwith parameter σ η , then ψ ( y, η ) follows a sub-Gaussiandistribution with parameter σ , such that σ = σ y + σ η .Thus, we can obtain a rate ε n,δ ∈ O ( σ ( / n / − β ) p log / δ ) with n independent samples and O ( e n β ) basis functions,where β ∈ (0 , / .Similarly, assume that t ( y ) has variance at most σ y , andthat the conditional distribution ψ ( y, η ) for any fixed y hasvariance at most σ η , then ψ ( y, η ) has variance at most σ such that σ = σ y + σ η . Thus, we can obtain a rate ε n,δ ∈ O ( σ ( / n / − β ) p / δ ) for n independent samplesand O ( n β ) basis functions, where β ∈ (0 , / . As comparison, the rates with origi-nal data [Honorio and Jaakkola, 2014] are O ( σ y ( / n / − β ) p log / δ ) and O ( σ y ( / n / − β ) p / δ ) respectively. Data Irrecoverability.
In the case of nonparametric gen-eralized regression with fixed design, we can perturb thedata y in the same way as for generalized linear modelswith fixed design. Therefore, Theorem 8 also holds for thenonparametric generalized regression. The max-margin matrix factorization problem was intro-duced in [Srebro et al., 2004], which used a hing loss.Here we generalize the loss function to Lipschitz contin-uous. Let f : R → R be a K Lipschitz continuous lossfunction. Assume the entries of the random matrix X ∈{− , +1 } n × n are independent. Let n = n n . We per-turb each of the entries in matrix X as ψ ( x ij , η ij ) = x ij η ij ,where P [ η ij = 1] = q and P [ η ij = −
1] = 1 − q . We definethe empirical loss functions in the following way:• b L ( θ ) = n P ij f ( x ij θ ij ) : empirical risk of pre-dicting the binary value x ij ∈ {− , +1 } by using sgn( θ ij ) ;• b L η ( θ ) = n P ij f ( ψ ( x ij , η ij ) θ ij ) : empirical riskof predicting the privatized data ψ ( x ij , η ij ) by using sgn( θ ij ) Theorem 12.
The model above fulfills Assumption A withprobability 1(i.e., δ = 0 ), scale function c ( θ ) = k θ k andrate ε n, = O (1 /n ) . The model also fulfills Assumption Cwith ε ′ n ∈ O ( K (1 − q ) n ) and scale function c ( θ ) = k θ k .
8s comparison, the rate with original data[Honorio and Jaakkola, 2014] is O (1 /n ) . Data Irrecoverability.
We show that data irrecoverabil-ity can be achieved in this model. Let X denote the originaldata, X η denote the perturbed data. That is, the ( i, j ) -th en-try of X η is ψ ( x ij , η ij ) = x ij η ij , where P [ η ij = 1] = q and P [ η ij = −
1] = 1 − q . Theorem 13.
If we perturb X as mentioned above, γ ≤ − n and q ∈ (1 / , / (1 − γ ) log 28 ) , then any adversary willfail to recover the original data with probability greaterthan γ . That is, inf A P X , η [ A ( X η ) = X ] ≥ γ. As a corollary of our result on perturbed loss consistency,we believe that norm consistency, sparsistency and signconsistency as in [Honorio and Jaakkola, 2014] can also beproved under our framework of data irrecoverability. Inaddition, there are several problems that our current frame-work cannot accommodate, such as nonparametric cluster-ing with exponential families, for instance. We need to ex-plore new mathematical characterizations in the context ofthese problems.
References
M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan,I. Mironov, K. Talwar, and L. Zhang. Deep learning withdifferential privacy. In
Proceedings of the 2016 ACMSIGSAC Conference on Computer and CommunicationsSecurity , pages 308–318. ACM, 2016.A. Argyriou, R. Foygel, and N. Srebro. Sparse predictionwith the k-support norm.
NIPS , 2012.F. Bach. Consistency of trace norm minimization.
JMLR ,2008.R. Bassily, A. Smith, and A. Thakurta. Differentially pri-vate empirical risk minimization: Efficient algorithmsand tight error bounds. In
Foundations of Computer Sci-ence (FOCS), 2014 IEEE 55th Annual Symposium on ,pages 464–473. IEEE, 2014.K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differ-entially private empirical risk minimization.
Journal ofMachine Learning Research , 12(Mar):1069–1109, 2011.M. Collins, S. Dasgupta, and R. Schapire. A generaliza-tion of principal component analysis to the exponentialfamily.
NIPS , 2001. T. M. Cover and J. A. Thomas.
Elements of informationtheory . John Wiley & Sons, 2012.J. Duchi. Global fano method, 2016. URL https://web.stanford.edu/class/stats311/Lectures/lec-06.pdf .J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Localprivacy and statistical minimax rates. In
Foundations ofComputer Science (FOCS), 2013 IEEE 54th Annual Sym-posium on , pages 429–438. IEEE, 2013.C. Dwork. Differential privacy: A survey of results. In
International Conference on Theory and Applications ofModels of Computation , pages 1–19. Springer, 2008.C. Dwork and J. Lei. Differential privacy and robust statis-tics. In
Proceedings of the forty-first annual ACM sym-posium on Theory of computing , pages 371–380. ACM,2009.C. Dwork, F. McSherry, K. Nissim, and A. Smith. Cali-brating noise to sensitivity in private data analysis. InS. Halevi and T. Rabin, editors,
Theory of Cryptogra-phy , pages 265–284, Berlin, Heidelberg, 2006. SpringerBerlin Heidelberg. ISBN 978-3-540-32732-5.Ú. Erlingsson, V. Pihur, and A. Korolova. Rappor:Randomized aggregatable privacy-preserving ordinal re-sponse. In
Proceedings of the 2014 ACM SIGSAC confer-ence on computer and communications security , pages1054–1067. ACM, 2014.J. Honorio and T. Jaakkola. A unified framework for con-sistency of regularized loss minimizers. In
InternationalConference on Machine Learning , pages 136–144, 2014.D. Hsu, S. Kakade, and T. Zhang. Random design analysisof ridge regression.
COLT , 2012.L. Jacob, G. Obozinski, and J. Vert. Group lasso with over-lap and graph lasso.
NIPS , 2009.P. Jain and A. G. Thakurta. Near dimension independentrisk bounds for differentially private learning. In
Interna-tional Conference on Machine Learning , pages 476–484,2014.A. Jalali, P. Ravikumar, S. Sanghavi, and C. Ruan. A dirtymodel for multi-task learning.
NIPS , 2010.S. Kakade, O. Shamir, K. Sridharan, and A. Tewari. Learn-ing exponential families in high-dimensions: Strong con-vexity and sparsity.
AISTATS , 2010.S. P. Kasiviswanathan and H. Jin. Efficient private empiri-cal risk minimization for high-dimensional learning. In
International Conference on Machine Learning , pages488–497, 2016.9. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Networkflow algorithms for structured sparsity.
NIPS , 2010.J. Near. Differential privacy at scale: Uber and berkeleycollaboration.
Enigma 2018 (Enigma 2018) , 2018.S. Negahban and M. Wainwright. Simultaneous support re-covery in high dimensions: Benefits and perils of block ℓ /ℓ ∞ -regularization. IEEE Transactions on Informa-tion Theory , 2011.G. Obozinski, M. Wainwright, and M. Jordan. Supportunion recovery in high-dimensional multivariate regres-sion.
Annals of Statistics , 2011.P. Ravikumar, H. Liu, J. Lafferty, and L. Wasserman. Spam:Sparse additive models.
NIPS , 2005.P. Ravikumar, G. Raskutti, M. Wainwright, and B. Yu.Model selection in Gaussian graphical models: High-dimensional consistency of ℓ -regularized MLE. NIPS ,2008.E. Richard, P. Savalle, and N. Vayatis. Estimation of simul-taneously sparse and low rank matrices.
ICML , 2012.I. Rish and G. Grabarnik. Sparse signal recovery withexponential-family noise.
Allerton , 2009.N. P. Santhanam and M. J. Wainwright. Information-theoretic limits of selecting binary graphical models inhigh dimensions.
IEEE Transactions on InformationTheory , 58(7):4117–4134, 2012.A. Smith, A. Thakurta, and J. Upadhyay. Is interactionnecessary for distributed private learning? In
Securityand Privacy (SP), 2017 IEEE Symposium on , pages 58–77. IEEE, 2017.N. Srebro, J. Rennie, and T. Jaakkola. Maximum-marginmatrix factorization.
NIPS , 2004.M. J. Wainwright, M. I. Jordan, and J. C. Duchi. Privacyaware learning. In
Advances in Neural Information Pro-cessing Systems , pages 1430–1438, 2012.W. Wang, M. J. Wainwright, and K. Ramchandran.Information-theoretic bounds on model selection forgaussian markov random fields. In
Information The-ory Proceedings (ISIT), 2010 IEEE International Sym-posium on , pages 1373–1377. IEEE, 2010.B. Yu. Assouad, fano, and le cam. In
Festschrift for LucienLe Cam , pages 423–435. Springer, 1997.B. Zhang and Y. Wang. Learning structural changes ofGaussian graphical models in controlled experiments.
UAI , 2010.H. Zou and T. Hastie. Regularization and variable selectionvia the elastic net.
J. Royal Statistical Society , 2005. 10
Proofs
A.1 Proof of Theorem 2
Proof.
By definition, we have L η ( θ ∗ η ) − L η ( θ ∗ ) ≤ , (14)because θ ∗ η = arg min θ ∈H L η ( θ ) . By Assumptions A andB, and by setting λ n = αε n,δ for some α ≥ , then wehave L η ( b θ η ) − L η ( θ ∗ η ) ≤ ε n,δ ( − αr ( c ( b θ η )) + c ( b θ η ))+ ε n,δ ( α R ( θ ∗ η ) + c ( θ ∗ η )) + ξ (15)By Assumption C, and since ε ′ n ≤ ε n,δ , we have L ( b θ η ) − L ( θ ∗ )= ( L ( b θ η ) − L η ( b θ η )) + ( L η ( b θ η ) − L η ( θ ∗ η )) +( L η ( θ ∗ η ) − L η ( θ ∗ )) + ( L η ( θ ∗ ) − L ( θ ∗ )) ≤ ε ′ n c ( b θ η ) + ε n,δ ( − αr ( c ( b θ η )) + c ( b θ η )) + ε n,δ ( α R ( θ ∗ η ) + c ( θ ∗ η )) + ξ + 0 + ε ′ n c ( θ ∗ ) ≤ ε n,δ ( − αr ( c ( b θ η )) + 2 c ( b θ η )) + ε n,δ ( α R ( θ ∗ η ) + c ( θ ∗ η )) + ξ + ε ′ n c ( θ ∗ ) ≤ ε n,δ ( α R ( θ ∗ η ) + c ( θ ∗ η )) + ε ′ n c ( θ ∗ ) + ξ. The first inequality is based on Assumption C and the twoinequalities (14) and (15) mentioned above. The second in-equality comes from ε ′ n ≤ ε n,δ . The third inequality comesfrom α ≥ , Assumption B and the elimination of the neg-ative terms. A.2 Proof of Lemma 3
Since t j ( x ) follows a sub-Gaussian distribution, thenwe have E x [ e λ ( t j ( x ) − E x [ t j ( x )]) ] ≤ e σ xλ . Since theconditional random variable ψ j ( x , η ) for any fixed x follows sub-Gaussian distribution, then we have E η [ e λ ( ψ j ( x , η ) − E η [ ψ j ( x , η )]) | x ] ≤ e σ ηλ . Thus, for randomvariable ψ j ( x , η ) for any x and η , we can get: E x , η [ e λ ( ψ j ( x , η ) − E x , η [ ψ j ( x , η )]) ]= E x , η [ e λ ( ψ j ( x , η ) − t j ( x )+ t j ( x ) − E x , η [ ψ j ( x , η )]) ]= E x , η [ e λ ( ψ j ( x , η ) − E η [ ψ j ( x,η )]+ t j ( x ) − E x [ t j ( x )]) ]= E x [ e λ ( t j ( x ) − E x [ t j ( x )]) E η [ e λ ( ψ j ( x , η ) − E η [ ψ j ( x , η )]) | x ]] ≤ E x [ e λ ( t j ( x ) − E x [ t j ( x )]) e σ ηλ ]= e ( σ x + σ η ) λ Thus, ψ j ( x , η ) will also be sub-Gaussian with parameter σ such that σ = σ x + σ η . A.3 Proof of Lemma 4
Since t j ( x ) has variance at most σ x and ψ j ( x , η ) for anyfixed x has variance at most σ η . Then for random variable ψ j ( x , η ) for x and η , we have: E x , η [( ψ j ( x , η ) − E x , η [ ψ j ( x , η )]) ]= E x , η [( ψ j ( x , η ) − t j ( x ) + t j ( x ) − E x , η [ ψ j ( x , η )]) ]= E x , η [( ψ j ( x , η ) − t j ( x )) +2( ψ j ( x , η ) − t j ( x ))( t j ( x ) − E x , η [ ψ j ( x , η )])+( t j ( x ) − E x , η [ ψ j ( x , η )]) ]= E x [ E η [(( ψ j ( x , η ) − E η [ ψ j ( x , η )]) ) | x ]] +2 E x [( t j ( x ) − E x [ t j ( x )]) E η [ ψ j ( x , η ) − E η [ ψ j ( x , η )]]]+ E x [( t j ( x ) − E x [ t j ( x )]) ] ≤ σ η + σ x We can have last inequality because E η [ ψ j ( x , η ) − E η [ ψ j ( x , η )]] = 0 . Thus, ψ j ( x , η ) has variance at most σ η + σ x . A.4 Proof of Theorem 5
Claim i.
The maximum likelihood estimation for exponen-tial family distribution fulfills Assumption A with probabil-ity at least − δ , scale function c ( θ ) = k θ k and rate ε n,δ ,provided that the dual norm fulfills k b T η − T η k ∗ ≤ ε n,δ .The problem also fulfills Assumption C with ε ′ n = 0 .Proof. First we show that L η ( θ ) = L ( θ ) for any θ . Recallthat E η [ ψ ( x , η )] = t ( x ) . We have L η ( θ ) = −h T η , θ i + log Z ( θ )= −h T , θ i + log Z ( θ )= L ( θ ) For proving that Assumption C holds, note that L η ( θ ) = L ( θ ) for any θ , and thus ε ′ n = 0 .For proving that Assumption A holds, we invoke Claimi in [Honorio and Jaakkola, 2014], that is for all θ | b L η ( θ ) − L η ( θ ) | = |h b T η − T η , θ i|≤ k b T η − T η k ∗ k θ k≤ ε n,δ k θ k Let θ ∈ H = R p . Let k · k ∗ = k · k ∞ , k · k = k ·k . According to Lemma 3 and Lemma 4, the variance of ψ j ( x , η ) is σ = σ x + σ η . We now focus on proving that k b T η − T η k ∗ ≤ ε n,δ which is the precondition of Claim i.11 ub-Gaussian case and ℓ -norm. For sub-Gaussian ψ j ( x , η ) , ≤ j ≤ p with parameter σ and l -norm, bythe union bound and independence: P [ k b T η − T η k ∗ > ε ]= P [( ∃ j ) | n X i ( ψ j ( x ( i ) , η ( i ) )) − E x ∼D [ t j ( x )] | > ε ]= P [( ∃ j ) | n X i ( ψ j ( x ( i ) , η ( i ) )) − E x ∼D [ E η ∼Q [ ψ j ( x , η )]] | > ε ] ≤ p P [ 1 n X i ( ψ j ( x ( i ) , η ( i ) )) − E x ∼D [ E η ∼Q [ ψ j ( x , η )]] > ε ]= 2 p P [exp( t ( X i ( ψ j ( x ( i ) , η ( i ) )) − n E x ∼D [ E η ∼Q [ ψ j ( x , η )]])) > exp( tnε )] ≤ p E [exp(( t ( X i ( ψ j ( x ( i ) , η ( i ) )) − n E x ∼D [ E η ∼Q [ ψ j ( x , η )]])] / exp( tnε )= 2 p n Y i =1 E [exp(( t ( ψ j ( x ( i ) , η ( i ) )) − E x ∼D [ E η ∼Q [ ψ j ( x , η )]])] / exp( tnε ) ≤ p exp( σ t n − tnε ) ≤ p exp( − nε σ ) = δ By solving for ε , we have ε n,δ = σ p / n (log p + log / δ ) . Finite variance case and ℓ -norm. For s ψ j ( x , η ) , ≤ j ≤ p with finite variance σ and l -norm, by union boundand Chebyshev’s inequality: P [ k b T η − T η k ∗ > ε ]= P [( ∃ j ) | n X i ( ψ j ( x ( i ) , η ( i ) )) − E x ∼D [ t j ( x )] | > ε ]= P [( ∃ j ) | n X i ( ψ j ( x ( i ) , η ( i ) )) − E x ∼D [ E η ∼Q [ ψ j ( x , η )]] | > ε ] ≤ p P [ | n X i ( ψ j ( x ( i ) , η ( i ) )) − E x ∼D [ E η ∼Q [ ψ j ( x , η )]] | > ε ] ≤ p σ nε By solving for ε , we have ε n,δ = σ p pnδ . A.5 Proof of Theorem 6
Proof.
Using Fano’s inequality, we show that it will be im-possible to recover the original data X up to permutationwith probability greater than / . We can describe the dataprocess with the Markov chain X → X η → b T ′ η → b T η → ˆ X , where ˆ X is the output of A . The mutual informationof X , X η can be bounded by using the pairwise KL diver-gence bound [Yu, 1997]. I [ X ; ˆ X ] ≤ I [ X ; b T η ] ≤ I [ X ; X η ]= n I [ x ( i ) , x ( i ) η ] ≤ n √ pk X x j ∈F X x ′ j ∈F KL ( P x ( i ) ηj | x j | P x ( i ) ηj | x ′ j ) ≤ n √ pk X x j ∈F X x ′ j ∈F x j − x ′ j σ η ≤ n √ pk ( k − k ) 2 σ η ≤ n √ pσ η Because we require the correctness up to permutation,so k = ( √ p n ) ≥ √ pn n n . By Fano’s inequality[Cover and Thomas, 2012], P [ ˆ X = X ] ≥ − I [ X ; b T η ] + log 2log k ≥ − n √ pσ η + log 2 n √ p log 2 − n log n In order to have P [ ˆ X = X ] ≥ γ , we require n √ pσ η + log 2 nd log 2 − n log n ≤ − γ σ η log 2 n √ p σ η (log 2 − log n √ p ) ≤ − γσ η ≥ − γ )(log 2 − log n √ p ) − log 2 n √ p Thus, if n ≥ − γ ) √ p and n ≤ √ p/ , σ η ≥ − γ ) log 2 .6 Proof of Theorem 7 Claim ii.
The generalized linear models withfixed design fulfills Assumption A with probabil-ity at least − δ , scale function c ( θ ) = k θ k and rate ε n,δ , provided that the dual norm fulfills k n P i ( ψ ( y ( i ) , η ( i ) ) − E y ∼D i ,η ∼Q i [ ψ ( y ( i ) , η ( i ) )]) x ( i ) k ∗ ≤ ε n,δ .The problem also fulfills Assumption C with ε ′ n = 0 .Proof. We first show that L η ( θ ) = L ( θ ) for any θ . Recallthat E η [ ψ ( y, η )] = t ( y ) . We have L η ( θ ) = E ( ∀ i ) y ( i ) ∼D i ,η ( i ) ∼Q [ b L η ( θ )]= E ( ∀ i ) y ( i ) ∼D i ,η ( i ) ∼Q [ 1 n X i − ψ ( y ( i ) , η ( i ) ) h x ( i ) , θ i + log Z ( h x ( i ) , θ i )]= E ( ∀ i ) y ( i ) ∼D i [ 1 n X i − E η ( i ) ∼Q ψ ( y ( i ) , η ( i ) ) h x ( i ) , θ i + log Z ( h x ( i ) , θ i )]= E ( ∀ i ) y ( i ) ∼D i [ 1 n X i − t ( y ( i ) ) h x ( i ) , θ i ++ log Z ( h x ( i ) , θ i )]= L ( θ ) For proving that Assumption C holds, note that L η ( θ ) = L ( θ ) for any θ , and thus ε ′ n = 0 .For proving that Assumption A holds, we invoke Claimii in [Honorio and Jaakkola, 2014], that is for all θ | b L η ( θ ) − L η ( θ ) | = | n X i ψ ( y ( i ) , η ( i ) ) h x ( i ) , θ i− n X i E D , Q [ ψ ( y ( i ) , η ( i ) )] h x ( i ) , θ i| = |h n X i ( ψ ( y ( i ) , η ( i ) ) − E D , Q [ ψ ( y ( i ) , η ( i ) )]) x ( i ) , θ i|≤ k n X i ( ψ ( y ( i ) , η ( i ) ) − E D , Q [ ψ ( y ( i ) , η ( i ) )]) x ( i ) k ∗ k θ k≤ ε n,δ k θ k Let θ ∈ H = R p . Let k · k ∗ = k · k ∞ and k · k = k · k . Let ∀ x, k x k ∗ ≤ B and thus ∀ i, j, | x ( i ) j | < B .According to Lemma 3 and Lemma 4, the variance of ψ ( y, η ) is σ = σ y + σ η . We now focus on proving that k n P i ( ψ ( y ( i ) , η ( i ) ) − E D , Q [ ψ ( y ( i ) , η ( i ) )]) x ( i ) k ∗ ≤ ε n,δ which is the precondition of Claim ii. Sub-Gaussian case and ℓ -norm. By Claim ii, andby the union bound and independence, if we have sub-Gaussian ψ ( y, η ) , then P [ k n X i ( ψ ( y ( i ) , η ( i ) ) − E D , Q [ ψ ( y ( i ) , η ( i ) )]) x ( i ) k ∗ > ε ]= P [( ∃ j ) | n X i ( ψ ( y, η ) − E y ∼D [ t ( y )]) x ( i ) j | > ε ]= P [( ∃ j ) | n X i ( ψ ( y, η ) − E y ∼D [ E η ∼Q [ ψ ( y, η )]]) x ( i ) j | > ε ] ≤ p exp( − nε σB ) ) Thus, ε n,δ = σB p / n (log p + log / δ ) Finite variance case and ℓ -norm. If ψ ( y, η ) has vari-ance at most σ , then by Claim ii, and by the union boundand Chebyshev’s inequality, P [ k n X i ( ψ ( y ( i ) , η ( i ) ) − E D , Q [ ψ ( y ( i ) , η ( i ) )]) x ( i ) k ∗ > ε ]= P [( ∃ j ) | n X i ( ψ ( y, η ) − E y ∼D [ t ( y )]) x ( i ) j | > ε ]= P [( ∃ j ) | n X i ( ψ ( y, η ) − E y ∼D [ E η ∼Q [ ψ ( y, η )]]) x ( i ) j | > ε ] ≤ p ( σB ) nε By solving for ε , we have ε n,δ = σB p pnδ . A.7 Proof of Theorem 8
Proof.
Using Fano’s inequality, we show that it will beimpossible to recover the original data Y with probabilitygreater than / . We can describe the data process with theMarkov chain Y → Y η → ˆ Y , where ˆ Y is the output of A .The mutual information of Y, Y η can be bounded by usingthe pairwise KL divergence bound [Yu, 1997]. I [ Y ; ˆ Y ] ≤ I [ Y ; Y η ] ≤ k X Y ∈F X Y ′ ∈F KL ( P Y η | Y | P Y η | Y ′ )= nk X y ∈F X y ′ ∈F KL ( P y η | y | P y η | y ′ ) ≤ nσ η k = 2 n , then by Fano’sinequality[Cover and Thomas, 2012], P [ ˆ Y = Y ] ≥ − I ( Y ; ˆ Y ) + log 2log k ≥ − nσ η + log 2log k = 1 − nσ η + log 2 n log 2 In order to have P [ ˆ Y = Y ] ≥ γ , we require nσ η + log 2 n log 2 ≤ − γ σ η log 2 + 1 n ≤ − γ Thus, if n > − γ , we have σ η ≥ − γ − n ) log 2 σ η ≥ − γ ) log 2 A.8 Proof of Theorem 9
Claim iii.
The exponential family PCA fulfillsAssumption A with probability at least − δ ,scale function c ( θ ) = k θ k and rate ε n,δ , pro-vided that the dual norm fulfills k n ( ψ ( x , η ) − E x ∼D ,η ∼Q [ ψ ( x, η )] , . . . , ψ ( x n n , ηx n n ) − E x ∼D n n ,η ∼Q n n [ ψ ( x, η )]) k ∗ ≤ ε n,δ .The problem also fulfills Assumption C with ε ′ n = 0 .Proof. We first show that L η ( θ ) = L ( θ ) for any θ . Wehave L η ( θ ) = E ( ∀ ij ) x ij ∼D ij ,η ij ∼Q ij [ b L η ( θ )]= E ( ∀ ij ) x ij ∼D ij ,η ij ∼Q ij [ 1 n X ij − ψ ( x ij , η ij ) θ ij + log Z ( θ ij )]= E ( ∀ ij ) x ij ∼D ij [ 1 n X ij − E η ij ∼Q ij [ ψ ( x ij , η ij )] θ ij + log Z ( θ ij )]= E ( ∀ ij ) x ij ∼D ij [ 1 n X ij − t ( x ) θ ij + log Z ( θ ij )]= L ( θ ) For proving that Assumption C holds, note that L η ( θ ) = L ( θ ) for any θ , and thus ε ′ n = 0 .For proving that Assumption A holds, we have for all θ | b L η ( θ ) − L η ( θ ) | = | n X ij ψ ( x ij , η ij ) θ ij − n X ij E x ∼D ij ,η ∼Q ij [ ψ ( x, η )] θ ij | = | n X ij ( ψ ( x ij , η ij ) − E x ∼D ij ,η ∼Q ij [ ψ ( x, η )]) θ ij |≤ k n ( ψ ( x , η ) − E x ∼D ,η ∼Q [ ψ ( x, η )] , . . . ,ψ ( x n n , ηx n n ) − E x ∼D n n ,η ∼Q n n [ ψ ( x, η )]) k ∗ k θ k≤ ε n,δ k θ k Recall that θ ∈ H = R n × n and n = n × n . Let k · k ∗ = k · k ∞ , k · k = k · k . According to Lemma 3 andLemma 4, the variance of ψ ( x ij , η ij ) is σ = σ xij + σ ηij .We now focus on proving that k n ( ψ ( x , η ) − E x ∼D ,η ∼Q [ ψ ( x, η )] , . . . , ψ ( x n n , ηx n n ) − E x ∼D n n ,η ∼Q n n [ ψ ( x, η )]) k ∗ ≤ ε n,δ which is theprecondition of Claim iii. Claim iii Sub-Gaussian case and ℓ -norm. If we havesub-Gaussian ψ ( x ij , η ij ) , by Claim iii, and by the unionbound and independence, we have P [ k n ( ψ ( x , η ) − E x ∼D ,η ∼Q [ ψ ( x, η )] , . . . ,ψ ( x n n , ηx n n ) − E x ∼D n n ,η ∼Q n n [ ψ ( x, η )]) k ∗ > ε ]= P [( ∃ ij ) | ψ ( x ij , η ij ) − E x ∼D ij [ t ( x ij )] | > nε ] ≤ n exp( − ( nε ) σ ) Let δ = 2 n exp( − ( nε ) σ ) , we still have ε n,δ = nσ q n + log σ ) Claim iii Finite variance case and ℓ -norm. If ψ ( x ij , η ij ) has variance at most σ , by Claim iii, and bythe union bound and Chebyshev’s inequality: P [ k n ( ψ ( x , η ) − E x ∼D ,η ∼Q [ ψ ( x, η )] , . . . ,ψ ( x n n , ηx n n ) − E x ∼D n n ,η ∼Q n n [ ψ ( x, η )]) k ∗ > ε ]= P [( ∃ ij ) | ψ ( x ij , η ij ) − E x ∼D ij [ t ( x )] | > nε ] ≤ n σ ( nε ) Let δ = n σ ( nε ) , then we have ε n,δ = σ √ nσ .9 Proof of Theorem 10 Proof.
Using Fano’s inequality, we show that it will be im-possible to recover the original data X with probabilitygreater than / . We can describe the data process withthe Markov chain X → X η → ˆ X , where ˆ X is the outputof A . The mutual information of X , X η can be bounded byusing the pairwise KL divergence bound [Yu, 1997]. I [ X ; ˆ X ] ≤ I [ X ; X η ] ≤ k X X ∈F X X ′ ∈F KL ( P X η | X | P X η | X ′ )= nk X x ij ∈F X x ′ ij ∈F KL ( P x η | x ij | P x η | x ′ ij ) ≤ nσ η The hypothesis space has size k = 2 n . By Fano’sinequality[Cover and Thomas, 2012], we have, P [ ˆ X = X ] ≥ − I ( X ; X η ) + log 2log k ≥ − nσ η + log 2log k = 1 − nσ η + log 2 n log 2 In order to have P [ ˆ X = X ] ≥ γ , we require nσ η + log 2 n log 2 ≤ − γ Thus, if n > − γ , we have σ η ≥ − γ − n ) log 2 σ η ≥ − γ ) log 2 A.10 Proof of Theorem 11
Claim iv.
Let φ , . . . , φ ∞ be an infinitely dimensional or-thonormal basis, and let φ ( x ) = ( φ ( x ) , . . . , φ ∞ ( x )) .we represent the function θ : X → R by usingthe infinitely dimensional orthonormal basis. That is, θ ( x ) = P ∞ j =1 ν ( θ ) j φ j ( x ) = h ν ( θ ) , φ ( x ) i , where ν ( θ ) =( ν ( θ )1 , . . . , ν ( θ ) ∞ ) . In the latter, the superindex ( θ ) allows forassociating the infinitely dimensional coefficient vector ν with the original function θ . Then, we define the norm ofthe function θ with respect to the infinitely dimensional or-thonormal basis. That is, k θ k = k ν ( θ ) k .Non-parametric generalized regression withfixed design fulfills Assumption A with probabil-ity at least − δ , scale function c ( θ ) = k θ k and rate ε n,δ , provided that the dual norm fulfills k n P i ( ψ ( y ( i ) , η ( i ) ) − E y ∼D i ,η ∼Q i [ ψ ( y ( i ) , η ( i ) )]) φ ( x ( i ) ) k ∗ ≤ ε n,δ This problem also fulfills Assumption C with ε ′ n = 0 .Proof. We first show that L η ( θ ) = L ( θ ) . We have L η ( θ )= E ( ∀ i ) y ( i ) ∼D ( i ) ,η ( i ) ∼Q [ 1 n X i − ψ ( y ( i ) , η ( i ) ) θ ( x ( i ) )+ log Z ( θ ( x ( i ) ))]= E ( ∀ i ) y ( i ) ∼D ( i ) [ 1 n X i − E η ( i ) ∼Q [ ψ ( y ( i ) , η ( i ) )] θ ( x ( i ) )+ log Z ( θ ( x ( i ) ))]= E ( ∀ i ) y ( i ) ∼D ( i ) [ 1 n X i − t ( y ( i ) ) θ ( x ( i ) )+ log Z ( θ ( x ( i ) ))]= L ( θ ) For proving that Assumption C holds, note that L η ( θ ) = L ( θ ) for any θ , and thus ε ′ n = 0 .For proving that Assumption A holds, we have for all θ | b L η ( θ ) − L η ( θ ) | = | n X i ψ ( y ( i ) , η ( i ) ) θ ( x ( i ) ) − n X i E D , Q [ ψ ( y ( i ) , η ( i ) )] θ ( x ( i ) ) | = |h n X i ( ψ ( y ( i ) , η ( i ) ) − E D , Q [ ψ ( y ( i ) , η ( i ) )]) φ ( x ( i ) ) , ν ( θ ) i|≤ k n X i ( ψ ( y ( i ) , η ( i ) ) − E D , Q [ ψ ( y ( i ) , η ( i ) )]) φ ( x ( i ) ) k ∗ k ν ( θ ) k≤ ε n,δ k θ k Let x ∈ X = R p . Let k·k ∗ = k·k ∞ and k·k = k·k . Let ( ∀ x ) k φ ( x ) k ∗ ≤ B and thus ( ∀ ij ) | φ j ( x ( i ) ) | ≤ B . Thecomplexity of our nonparametric model grows with moresamples. Assume that we have q n orthonormal basis func-tions ϕ , . . . , ϕ q n : R → R . Let q n be increasing with re-spect to the number of samples n . With these bases, we de-fine q n p orthonormal basis functions of the form φ j ( x ) = ϕ k ( x l ) for j = 1 , . . . , q n p , k = 1 , . . . , q n , l = 1 , . . . , p .15ccording to Lemma 3 and Lemma 4, the variance of ψ ( y, η ) is σ = σ y + σ η . We now focus on proving that k n P i ( ψ ( y ( i ) , η ( i ) ) − E D , Q [ ψ ( y ( i ) , η ( i ) )]) φ ( x ( i ) ) k ∗ ≤ ε n,δ which is the precondition of Claim iv. Claim iv Sub-Gaussian case with ℓ -norm. Let ∀ i, ψ ( y ( i ) , η ( i ) ) be sub-Gaussian with parameter σ . There-fore ∀ i, ψ ( y ( i ) , η ( i ) ) φ j ( x ( i ) ) is sub-Gaussian with param-eter σB . By Claim iv , and by the union bound, sub-Gaussianity and independence, P [ k n X i ( ψ ( y ( i ) , η ( i ) ) − E D , Q [ ψ ( y ( i ) , η ( i ) )]) φ ( x ( i ) ) k ∗ > ε ]= P [( ∃ j ) | n X i ( ψ ( y ( i ) , η ( i ) ) − E D , Q [ ψ ( y ( i ) , η ( i ) )]) φ j ( x ( i ) ) | > ε ] ≤ q n p exp( − nε σB ) ) = δ By solving for ε , we have ε n,δ = σB p / n (log p + log q n + log / δ ) . Claim iv Finite variance case with ℓ -norm. Let ∀ i, ψ ( y ( i ) , η ( i ) ) have variance at most σ . Therefore ∀ i, ψ ( y ( i ) , η ( i ) ) φ j ( x ( i ) ) has variance at most ( σB ) . ByClaim iv, and by the union bound and Chebyshev’s inequal-ity, P [ k n X i ( ψ ( y ( i ) , η ( i ) ) − E D , Q [ ψ ( y ( i ) , η ( i ) )]) φ ( x ( i ) ) k ∗ > ε ]= P [( ∃ j ) | n X i ( ψ ( y ( i ) , η ( i ) ) − E D , Q [ ψ ( y ( i ) , η ( i ) )]) φ j ( x ( i ) ) | > ε ] ≤ q n p ( σB ) nε = δ By solving for ε , we have ε n,δ = σB p q n pnδ . A.11 Proof of Theorem 12
Claim v.
Max-margin matrix factorization fulfills Assump-tion A with probability 1, scale function c ( θ ) = k θ k andrate ε n,δ = O ( n ) . Furthermore, max-margin matrix fac-torization fulfills Assumption C with ε ′ n = K (1 − q ) n and c ( θ ) = k θ k .Proof. To prove this problem fulfills Assumption A, we have: | b L η ( θ ) − L η ( θ ) | = | n X ij ( f ( x ij η ij θ ij ) − E D , Q [ f ( x ij η ij θ ij )]) | = | n X ij (1[ x ij η ij = +1] f ( θ ij ) + 1[ x ij η ij = − f ( − θ ij ) − P [ x ij η ij = +1] f ( θ ij ) − P [ x ij η ij = − f ( − θ ij )) | = | n X ij ((1[ x ij η ij = +1] − P [ x ij η ij = +1]) f ( θ ij )+(1[ x ij η ij = − − P [ x ij η ij = − f ( − θ ij )) |≤ n X ij ( | x ij η ij = +1] − P [ x ij η ij = +1] || f ( θ ij ) | + | x ij η ij = − − P [ x ij η ij = − || f ( − θ ij ) |≤ n X ij K | θ ij | = 2 Kn k θ k To prove this problem fulfills Assumption C. Let K be theLipschitz constant of f . Note that E η [ b L η ( θ )] = 1 n X ij qf ( x ij θ ij ) + (1 − q ) f ( − x ij θ ij ) Thus, we have | b L ( θ ) − E Q [ b L η ( θ )] | = | n X ij f ( x ij θ ij ) − ( 1 n X ij qf ( x ij θ ij ) + (1 − q ) f ( − x ij θ ij )) | = 1 n | X ij (1 − q ) f ( x ij θ ij ) − (1 − q ) f ( − x ij θ ij ) | = (1 − q ) n | X ij f ( x ij θ ij ) − f ( − x ij θ ij ) |≤ (1 − q ) n | X ij K ( x ij θ ij ) |≤ K (1 − q ) n k θ k By Jensen’s inequality: |L ( θ ) − L η ( θ ) | = | E D [ b L ( θ ) − b L η ( θ )] |≤ E D | b L ( θ ) − b L η ( θ ) |≤ K (1 − q ) n k θ k .12 Proof of Theorem 13 Proof.
Using Fano’s inequality, we show that it will be im-possible to recover the original data X with probabilitygreater than / . Denote P [ η ij = 1] = q and P [ η ij = −
1] = 1 − q . We can describe the data process with theMarkov chain X → X η → ˆ X , where ˆ X is the output of A .The mutual information of X , X η can be bounded by usingthe pairwise KL divergence bound [Yu, 1997]. I [ X ; ˆ X ] ≤ I [ X ; X η ]= n I [ x ij ; η ij x ij ] ≤ n X x ij ∈{± } X x ′ ij ∈{± } KL ( P X ij | x ij | P X ij | x ij )= n q log q − q + (1 − q ) log 1 − qq )= n q −
1) log q − q The hypothesis space has size k = 2 n . By Fano’sinequality[Cover and Thomas, 2012], we have, P [ ˆ X = X ] ≥ − I [ X ; ˆ X ] + log 2log k ≥ − n (2 q −
1) log q − q + log 2 n log 2 In order to have P [ ˆ X = X ] ≥ γ , we require n (2 q −
1) log q − q + log 2 n log 2 < − γ (2 q −
1) log q − q < (1 − γ − n ) log 2 Note that (2 q −
1) log q − q < (2 q − q − q − q − q + 11 − q Let g = (1 − γ − n ) log 2 and g ∈ (0 , log 2) , we can solve q − q + 11 − q < g Solving the inequality above, we get, q ∈ ( , + g + √ g ( g +8)8 ) . A sufficient condition for the latter is q ∈ ( , + − γ − n ) . If we further assume that n > − γ , wecan have q ∈ ( , + (1 − γ )8 ))