[PDF] Distributionally-Robust Machine Learning Using Locally Differentially-Private Data

Abstract

We consider machine learning, particularly regression, using locally-differentially private datasets. The Wasserstein distance is used to define an ambiguity set centered at the empirical distribution of the dataset corrupted by local differential privacy noise. The ambiguity set is shown to contain the probability distribution of unperturbed, clean data. The radius of the ambiguity set is a function of the privacy budget, spread of the data, and the size of the problem. Hence, machine learning with locally-differentially private datasets can be rewritten as a distributionally-robust optimization. For general distributions, the distributionally-robust optimization problem can relaxed as a regularized machine learning problem with the Lipschitz constant of the machine learning model as a regularizer. For linear and logistic regression, this regularizer is the dual norm of the model parameters. For Gaussian data, the distributionally-robust optimization problem can be solved exactly to find an optimal regularizer. This approach results in an entirely new regularizer for training linear regression models. Training with this novel regularizer can be posed as a semi-definite program. Finally, the performance of the proposed distributionally-robust machine learning training is demonstrated on practical datasets.

Full PDF

11 Distributionally-Robust Machine LearningUsing Locally Differentially-Private Data

Farhad Farokhi,

Senior Member, IEEE

Abstract —We consider machine learning, particularly regression, using locally-differentially private datasets. The Wassersteindistance is used to deﬁne an ambiguity set centered at the empirical distribution of the dataset corrupted by local differential privacynoise. The ambiguity set is shown to contain the probability distribution of unperturbed, clean data. The radius of the ambiguity set is afunction of the privacy budget, spread of the data, and the size of the problem. Hence, machine learning with locally-differentiallyprivate datasets can be rewritten as a distributionally-robust optimization. For general distributions, the distributionally-robustoptimization problem can relaxed as a regularized machine learning problem with the Lipschitz constant of the machine learning modelas a regularizer. For linear and logistic regression, this regularizer is the dual norm of the model parameters. For Gaussian data, thedistributionally-robust optimization problem can be solved exactly to ﬁnd an optimal regularizer. This approach results in an entirelynew regularizer for training linear regression models. Training with this novel regularizer can be posed as a semi-deﬁnite program.Finally, the performance of the proposed distributionally-robust machine learning training is demonstrated on practical datasets.

Index Terms —Local differential privacy, Machine learning, Distributionally-robust optimization, Regularization, Wasserstein distance. (cid:70)

NTRODUCTION A DVANCES in artiﬁcial intelligence, particularly machinelearning, have opened new possibilities for data an-alytic to address important societal challenges. However,these achievements can be stiﬂed by privacy concerns. Localdifferential privacy, a variant of the popular differentialprivacy framework [1], [2], has been touted as an approachfor providing privacy guarantees in the presence of anuntrusted aggregator or data analysts [3]–[5]. This is be-cause, with local differential privacy, the data can be freelyaggregated and shared. Even, commercial entities, such asMicrosoft and Apple, have started using local differen-tial privacy to deploy privacy-preserving data aggregationmechanisms [6]–[8].The additive noise in local differential privacy can de-grade the performance of machine learning models trainedon perturbed privacy-preserving datasets. Several studieshave looked into providing bounds for the performancedegradation caused by local differential privacy noise as afunction of the privacy budget and the dataset size [5], [9]–[12]. These studies however do not use recent advances indistributionally-robust optimization and machine learning(see, e.g., [13]–[15]) to compute robust machine learningmodels with out-of-sample performance guarantees in thepresence of local differential privacy noise. They are moreconcerned by ﬁnding the effect of privacy-preserving noiseon established machine learning algorithms.Distributionally-robust optimization considers uncertainstochastic programs [16] with the ambiguity set for thedistribution modeled by discrete distributions [17], momentconstraints [18], Kullback-Leibler divergence [19], and theWasserstein distance [13]. Distributionally-robust optimiza- • F. Farokhi is with the Department of Electrical and Electronic Engineeringat the University of Melbourne, Australia.E-mail: [email protected] received June 25, 2020. tion has shown signiﬁcant promises in adversarial machinelearning [20], [21] or machine learning with outlier data [22].However, so far, this approach has not been used fortraining robust machine learning models based on datasetsperturbed using local differential privacy.In this paper, we use the Wasserstein distance to deﬁnean ambiguity set centered at the empirical distribution ofthe training dataset that is corrupted with local differentialprivacy noise. This ambiguity set is shown to contain theprobability distribution of unperturbed data. The radius ofthe ambiguity set is a function of the privacy budget, spreadof the data, and the size of the problem (i.e., number ofinputs and outputs of the machine learning model). Armedwith this description of the ambiguity set, we can castthe problem of learning with locally-differentially privatedata as a distributionally-robust optimization problem. Weshow that, for general distributions, an upper bound forthe worst-case expected loss in the distributionally-robustoptimization problem is the empirical sampled-averagedloss plus the Lipschitz-constant of the loss function. Usingthis, we can relax the distributionally-robust optimizationproblem as a regularized machine learning problem withthe Lipschitz constant as a regularizer. For linear and lo-gistic regression models, this regularizer is equal to thedual norm of the model parameters. For Gaussian data, thedistributionally-robust optimization problem can be solvedexactly to ﬁnd an optimal regularizer for the problem. Thisapproach results in an entirely new regularizer for linearregression. We ﬁnally demonstrate the performance of theproposed distributionally-robust optimization problems onpractical datasets.

VERVIEW OF THE W ASSERSTEIN D ISTANCE

In this section, a brief overview of the Wasserstein distanceis provided. The set of probability distributions Q over a r X i v : . [ c s . L G ] J un the set Ξ ⊆ R m such that E Q {(cid:107) ξ (cid:107)} < ∞ is denoted by M (Ξ) . For all P , P ∈ M (Ξ) , the Wasserstein distance W : M (Ξ) × M (Ξ) → R ≥ := { x ∈ R | x ≥ } is W q ( P , P ) := inf Π (cid:40) (cid:20)(cid:90) Ξ (cid:107) ξ − ξ (cid:107) q Π(d ξ , d ξ ) (cid:21) /q :Π is a joint disribution on ξ and ξ with marginals P and P , respectively (cid:41) . The Wasserstein distance is symmetric, i.e., W q ( P , P ) = W q ( P , P ) . The Wasserstein distance satisﬁes the triangleinequality [23, p. 170], i.e., W q ( P , P ) ≤ W q ( P , P ) + W q ( P , P ) . Also, the Wasserstein distance is convex [24,Lemma 2.1], i.e., W q ( α P + (1 − α ) P , Q ) ≤ α W q ( P , Q ) +(1 − α ) W q ( P , Q ) for all α ∈ [0 , . For q = 1 , the dualitytheorem of Kantorovich and Rubinstein [25] implies that W ( P , P ) := sup f ∈ L (cid:40)(cid:90) Ξ f ( ξ ) P (d ξ ) − (cid:90) Ξ f ( ξ ) P (d ξ ) (cid:41) , where L denotes the set of all Lipschitz functions withLipschitz constant upper bounded by one, i.e., all functions f such that | f ( ξ ) − f ( ξ ) | ≤ (cid:107) ξ − ξ (cid:107) for all ξ , ξ ∈ Ξ . ISTRIBUTIONALLY -R OBUST M ACHINE L EARN - ING WITH P RIVATE D ATA

We consider supervised learning based on a training dataset { ( x i , y i ) } ni =1 , where x i ∈ X ⊆ R p x is the input or featurevector (e.g., pixels of an image or features extracted fromit) and y i ∈ Y ⊆ R p y is the output or label (e.g., imagecontent). The training dataset is composed of independentlyand identically distributed (i.i.d.) samples from probabilitydistribution P .Training machine learning models refers to extracting amodel M : R p x × R p θ → R p y (sometimes referred to ashypothesis) to describe the relationship between inputs andoutputs distributed according to P . This can be done bysolving the stochastic program J ∗ := min θ ∈ Θ E P { (cid:96) ( M ( x ; θ ) , y ) } , (1)where θ is the machine learning model parameter, Θ ⊆ R p θ is the set of feasible parameters, and (cid:96) : R p y × R p y → R is the loss function. An example of a loss function is (cid:96) ( M ( x ; θ ) , y ) = ( M ( x ; θ ) − y ) (cid:62) ( M ( x ; θ ) − y ) . The existenceof a minimizer in (1), and the subsequent approximations inthis paper, is guaranteed if the loss function is continuousand the feasible set is compact. This problem is sometimesreferred to as expected risk minimization.In the absence of the knowledge of the distribution P ,the training dataset { ( x i , y i ) } ni =1 , i.e., samples from thisdistributions, can be used to solve the sample-averagedapproximation problem ˆ J := min θ ∈ Θ n n (cid:88) i =1 (cid:96) ( M ( x i ; θ ) , y i ) . (2)The approximation in (2) is often the starting point ofmachine learning. This is because (2) can be a good proxy for (1), when n is large enough, in the sense of probablyapproximately correct (PAC) learnability [26]. We make thefollowing standing assumption regarding the distribution ofthe training data. Assumption 1. E P { exp( (cid:107) ξ (cid:107) a ) } < ∞ for some a > .This assumption implies that P is a light-tailed distri-bution. All probability distributions with compact supportset are light-tailed; however, unbounded noises, such asGaussian or Laplace, are also light-tailed. This is often animplicit assumption in the machine learning literature as,for heavy-tailed distributions, the sample average of theloss in (2) may not even generally converge to the expectedloss in (1) and hence the PAC learnability might not evenhold [27], [28]. Due to privacy concerns, the training dataset is sometimesreplaced with a noisy dataset { (˜ x i , ˜ y i ) } ni =1 in which ˜ x i := x i + w i , (3a) ˜ y i := y i , (3b)where ( w i ) ni =1 are identically and independently distributed(i.i.d.) according to the probability distribution W . Local dif-ferential privacy is a useful and versatile notion of privacy. Deﬁnition 1 (Local Differential Privacy).

The reportingmechanism with additive noise in (3) is ( (cid:15), δ ) -locallydifferentially private if, for all ( x, y ) , ( x (cid:48) , y (cid:48) ) ∈ X × Y and any Lebesgue-measurable set A ⊆ X × Y , P { (˜ x i , ˜ y i ) ∈A| x i = x, y i = y }≤ exp( (cid:15) ) P { (˜ x i , ˜ y i ) ∈ A| x i = x (cid:48) , y i = y (cid:48) } . Assumption 2.

X ⊆ [ x, x ] p x .The box constraint nature of Assumption 2 is not strictly-speaking necessary; however, it simpliﬁes the closed-formexpression of the results. The results can be readily extendedto any compact sets by using diameter of the sets. It iswidely known that we can ensure local differential privacywith Laplace and Gaussian additive noises. This is exploredin the next theorem. Theorem 1.

The following statements hold:1) For (cid:15) > , the mechanism in (3) is (cid:15) -locally differ-entially private if w i is a vector of zero-mean i.i.d.Laplace noise with scale ∆ /(cid:15) , where ∆ := ( x − x ) p x ;2) For (cid:15), δ > , the mechanism in (3) is ( (cid:15), δ ) -locallydifferentially private if w i is a vector of zero-meani.i.d. Gaussian noise with standard deviation σ := (cid:112) . /δ )∆ /(cid:15) . Proof:

The proof for the Laplace mechanism followsfrom [2, Theorem 3.6] while noting that the (cid:96) -sensitivityof the query (which is equal to the identity function forlocal differential privacy) is given by ∆ . The proof for theGaussian noise follows from [2, Theorem A.1]. Note that thethe (cid:96) -sensitivity is an upper bound for the (cid:96) -sensitivity. The privacy-preserving records in the training dataset (˜ x i , ˜ y i ) ni =1 are independently and identically distributed(i.i.d.) according to D , which can be characterized by theconvolution of P and W . We can deﬁne the empirical proba-bility distribution (cid:98) D n := 1 n n (cid:88) i =1 d (˜ x i , ˜ y i ) , where d ξ is the Dirac distribution function. Following thedeﬁnition of the Dirac distribution function, we have n n (cid:88) i =1 (cid:96) ( M ( x i ; θ ) , y i ) = E (cid:98) D n { (cid:96) ( M ( x ; θ ) , y ) } , (4)and, as a result, we can rewrite (2) as ˆ J := min θ ∈ Θ E (cid:98) D n { (cid:96) ( M ( x ; θ ) , y ) } . (5)We can show that the empirical probability distribution (cid:98) D n is in a vicinity of the original probability distribution P witha high probability. Theorem 2.

Assume that W is the distribution in Theorem 1.There exist constants c , c > such that D n (cid:26) W ( (cid:98) D n , P ) ≤ ζ ( γ ) + √ p ∆ (cid:15) (cid:27) ≥ − γ, for the Laplace mechanism and D n (cid:40) W ( (cid:98) D n , P ) ≤ ζ ( γ ) + (cid:112) . /δ ) p ∆ (cid:15) (cid:41) ≥ − γ, for the Gaussian mechanism, where ζ ( γ ) := (cid:18) log( c /γ ) c n (cid:19) / max { p, } , n ≥ log( c /γ ) c , (cid:18) log( c /γ ) c n (cid:19) /a , n < log( c /γ ) c , for all n ≥ , p = p x + p y (cid:54) = 2 , and γ > . Proof:

Note that, since P is light-tailed, D is also light-tailed if we use the privacy-preserving noises in Theorem 1.Following [13], we know that D n { W ( (cid:98) D n , D ) ≤ ζ ( γ ) } ≤ − γ. Using [29, Lemma 8.6], we get W ( D , P ) ≤ W ( P , P )+ W ( d , W ) = W ( d , W ) ≤ E W {(cid:107) w (cid:107)} ≤ (cid:112) E W {(cid:107) w (cid:107) } , where the last inequality follows from the Jensen’s in-equality [30, p. 27]. Furthermore, for the Laplace noise, weget E W {(cid:107) w (cid:107) } = trace ( E W { ww (cid:62) } ) = 2 p ∆ /(cid:15) , and, asa result, W ( D , P ) ≤ √ p ∆ /(cid:15). Therefore, W ( (cid:98) D n , P ) ≤ ζ ( γ ) + √ p ∆ /(cid:15) if W ( (cid:98) D n , D ) ≤ ζ ( γ ) , which implies that D n { W ( (cid:98) D n , P ) ≤ ζ ( γ ) + √ p d /(cid:15) } ≥ − γ. The prooffor the Gaussian noise is the same with the exception that E W {(cid:107) w (cid:107) } = trace ( E W { ww (cid:62) } ) = 2 p log(1 . /δ )∆ /(cid:15) .Hence, if we select ρ large enough, the original distribu-tion P would belong to the ambiguity set { G : W ( G , (cid:98) D n ) ≤ ρ } . This observation motivates training the machine learn-ing model by solving the distributionally-robust optimiza-tion problem in ˆ J n := min θ ∈ Θ sup G : W ( G , (cid:98) D n ) ≤ ρ E G { (cid:96) ( M ( x ; θ ) , y ) } , (6) for some constant ρ > . The correct value of ρ is discussedin the next theorem. Theorem 3.

Assume that W is the distribution in Theorem 1.If ρ = ζ ( β ) + √ p ∆ /(cid:15) for the Laplace mechanism orif ρ = ζ ( β ) + (cid:112) p log(1 . /δ )∆ /(cid:15) for the Gaussianmechanism, then D n { J ∗ ≤ E P { (cid:96) ( M ( x ; ˆ θ n ) , y ) } ≤ ˆ J n } ≥ − β, (7)where β ∈ (0 , is a signiﬁcance parameter and thetrained model parameter ˆ θ n ∈ Θ is the minimizer of (6).The optimization problem in (6) involves taking a supre-mum over the probability density function. This is aninﬁnite-dimensional optimization problem and is hencecomputationally difﬁcult to solve. Therefore, we relax thisproblem in the remainder of this section. Proposition 1.

Assume that (cid:96) ( M ( x ; θ ) , y ) is L ( θ ) -Lipschitzcontinuous in ( x, y ) for a ﬁxed θ ∈ Θ . Then, sup G : W ( G , (cid:98) D n ) ≤ ρ E G { (cid:96) ( M ( x ; θ ) , y ) }≤ E (cid:98) D n { (cid:96) ( M ( x ; θ ) , y ) } + L ( θ ) ρ. Proof:

The proof follows from the duality theorem ofKantorovich and Rubinstein [25].Now, we can deﬁne the regularized sample-averagedoptimization problem in ˜ J n := min θ ∈ Θ (cid:104) E (cid:98) D n { (cid:96) ( M ( x ; θ ) , y ) } + ρL ( θ ) (cid:105) . (8)We can still prove a performance guarantee for the optimizerof (8). This is done in the next theorem. Theorem 4.

The proof is similar to [13, Theorem 3.4]with an extra step with the aid of Proposition 1. Byselecting ρ as in the statement of the theorem, P be-longs to a ball around (cid:98) D n with radius ρ with proba-bility greater than or equal to − β according to The-orem 2. Therefore, with probability of at least − β , E P { (cid:96) ( M ( x ; ˜ θ n ) , y ) } ≤ sup G : W ( G , (cid:98) D n ) ≤ ρ E G { (cid:96) ( M ( x ; ˜ θ n ) , y ) } .Proposition 1 states sup G : W ( G , (cid:98) D n ) ≤ ρ E G { (cid:96) ( M ( x ; ˆ θ n ) , y ) } ≤ E (cid:98) D n { (cid:96) ( M ( x ; ˜ θ ) , y ) } + ρL (˜ θ ) = ˜ J n . Therefore, with probabil-ity of at least − β , E P { (cid:96) ( M ( x ; ˜ θ n ) , y ) } ≤ ˜ J n .Theorem 4 shows that by regularizing the sample-averaged cost function, we can train a machine learningmodel that performs better in the presence of locally dif-ferentially private noises. This is an interesting observationdemonstrating the value of regularization in privacy ma-chine learning with private data. Remark 1 (Large Datasets).

In the limit for large n , ζ ( n ) ≈ . Therefore, following Theorem 4, ρ = O ( p x ( p x + p y ) / /(cid:15) ) for the Laplace mechanism and ρ = O ( p x ( p x + p y ) / log(1 /δ ) / /(cid:15) ) for the Gaussianmechanism. This implies that the regularization weight ρ should increase when reducing the privacy budget (cid:15) .Also, higher-dimensional problems, i.e., when p x or p y is larger, requires larger regularization weights ρ . Remark 2 (Linear and Logistic Regression).

Without loss ofgenerality, consider p y = 1 . If p y > , each output can betreated independently. In this case, M ( x ; θ ) = θ (cid:62) [ x (cid:62) (cid:62) and (cid:96) ( M ( x ; θ ) , y ) = ( M ( x ; θ ) − y ) / . We also assumethat ( x, y ) belong to compact set X × Y ⊆ R p x × R .Following [31], we know that L ( θ ) = ( X + 1 + Y ) (cid:107) θ (cid:107) ∗ , where (cid:107) · (cid:107) ∗ is the dual norm of (cid:107) · (cid:107) , and X = max x ∈X (cid:107) x (cid:107) and Y = max y ∈Y | y | . An alterna-tive to the quadratic loss function is (cid:96) ( M ( x ; θ ) , y ) = | M ( x ; θ ) − y | . Again, following [31], L ( θ ) = (cid:107) θ (cid:107) ∗ . Forthe logistic regression, M ( x ; θ ) = 1 / (1 + exp( − [ x (cid:62) θ )) and (cid:96) ( M ( x ; θ ) , y ) = − y log( M ( x ; θ )) − (1 − y ) log(1 − M ( x ; θ )) . Following [31], L ( θ ) = ( Y + X + 2) (cid:107) θ (cid:107) ∗ . INEAR R EGRESSION WITH G AUSSIAN D ATA

In this section, we consider the speciﬁc case that ( x i , y i ) ni =1 is Gaussian distributed with mean µ and covariance Σ .Furthermore, we assume that M ( x ; θ ) = Ax + B with A, B modeling the machine learning model parameters (insteadof θ ) and (cid:96) ( M ( x ; A, B ) , y ) = (cid:107) M ( x ; A, B ) − y (cid:107) . Therefore,by using the Gaussian mechanism in Theorem 1, (˜ x i , ˜ y i ) ni =1 is also Gaussian distributed. Note that Assumption 2 nolonger holds in this section (as the Gaussian process behindthe data has an inﬁnite support). However, with high prob-ability, the data belongs to a bounded set. Therefore, we canadopt local randomized differential privacy instead of localdifferential privacy [32]–[34].In the Gaussian linear regression case described above,we can redeﬁne deﬁne the empirical probability distribution (cid:98) D n to be Gaussian with mean (cid:98) µ n and covariance (cid:98) Σ n , where (cid:98) µ n := 1 n n (cid:88) i =1 (cid:20) ˜ x i ˜ y i (cid:21) , (cid:98) Σ n := 1 n − n (cid:88) i =1 (cid:18)(cid:20) ˜ x i ˜ y i (cid:21) − (cid:98) µ n (cid:19) (cid:18)(cid:20) ˜ x i ˜ y i (cid:21) − (cid:98) µ n (cid:19) (cid:62) . Theorem 5.

Assume that W is the Gaussian distribution inTheorem 1. With high probability, for any ε > , thereexists n ε ≥ such that if n ≥ n ε , W ( (cid:98) D n , P ) ≤ ε + (cid:112) . /δ ) p ∆ (cid:15) . Proof:

The proof is similar to Theorem 2 with theexception of using the fact that W ( (cid:98) D n , D ) → withprobability one as n → ∞ [35, Theorem 2.1].In this section, we consider the big data regime ( n (cid:29) )so, without loss of generality, ε (cid:117) and µ (cid:117) (cid:98) µ n . Therefore,by using Theorem 5, the original distribution P would be-long to the ambiguity set { G : G is Gaussian ∧ E G { ( x, y ) } = (cid:98) µ n ∧ W ( G , (cid:98) D n ) ≤ ρ } if we select ρ = (cid:112) . /δ ) p ∆ /(cid:15) .This observation motivates training the machine learningmodel by solving the distributionally-robust optimizationproblem in ˆ J n := min A,B sup G : G is Gaussian E G { ( x, y ) } = (cid:98) µ n W ( G , (cid:98) D n ) ≤ ρ E G { (cid:96) ( M ( x ; A, B ) , y ) } , (10) for constant ρ = (cid:112) . /δ ) p ∆ /(cid:15) . Following [36, Propo-sition 7], if (cid:98) D n and G are both Gaussian with same mean µ (cid:117) (cid:98) µ n , we get W ( (cid:98) D n , G ) = trace (Σ G + (cid:98) Σ n − (cid:98) Σ / n Σ G (cid:98) Σ / n ) / ) , (11)where Σ G denotes the covariance of G . The expected risk isgiven by E G { (cid:96) ( M ( x ; A, B ) , y ) } = trace ( E G { ( Ax + B − y )( Ax + B − y ) (cid:62) } )= trace (cid:32) E G (cid:40) (cid:2) A − I (cid:3) (cid:20) xy (cid:21) (cid:20) xy (cid:21) (cid:62) (cid:2) A − I (cid:3) (cid:62) + BB (cid:62) + (cid:2) A − I (cid:3) (cid:20) xy (cid:21) B (cid:62) + B (cid:20) xy (cid:21) (cid:62) (cid:2) A − I (cid:3) (cid:62) (cid:41)(cid:33) = trace (cid:16) (cid:2) A − I (cid:3) (Σ G + (cid:98) µ n (cid:98) µ (cid:62) n ) (cid:2) A − I (cid:3) (cid:62) + BB (cid:62) + (cid:2) A − I (cid:3) (cid:98) µ n B (cid:62) + B (cid:98) µ (cid:62) n (cid:2) A − I (cid:3) (cid:62) (cid:17) . Using [14, Proposition 2.8], it can be deduced that sup G : G is Gaussian E G { ( x, y ) } = (cid:98) µ n W ( G , (cid:98) D n ) ≤ ρ E G { (cid:96) ( M ( x ; A, B ) , y ) } = trace (cid:16) (cid:2) A − I (cid:3) (cid:98) µ n (cid:98) µ (cid:62) n (cid:2) A − I (cid:3) (cid:62) + BB (cid:62) + (cid:2) A − I (cid:3) (cid:98) µ n B (cid:62) + B (cid:98) µ (cid:62) n (cid:2) A − I (cid:3) (cid:62) (cid:17) + f ( A ) , where f ( A ) := inf ξ (cid:34) ξ ( ρ − trace ( (cid:98) Σ n ))+ ξ trace (cid:32)(cid:32) ξI − (cid:20) A (cid:62) − I (cid:21) (cid:2) A − I (cid:3) (cid:33) − (cid:98) Σ n (cid:33)(cid:35) , s . t . ξI (cid:31) (cid:20) A (cid:62) − I (cid:21) (cid:2) A − I (cid:3) . Further,trace (cid:16) (cid:2) A − I (cid:3) (cid:98) µ n (cid:98) µ (cid:62) n (cid:2) A − I (cid:3) (cid:62) + BB (cid:62) + (cid:2) A − I (cid:3) (cid:98) µ n B (cid:62) + B (cid:98) µ (cid:62) n (cid:2) A − I (cid:3) (cid:62) (cid:17) = E (cid:98) D n { (cid:96) ( M ( x ; A, B ) , y ) } − (cid:2) A − I (cid:3) (cid:98) Σ n (cid:2) A − I (cid:3) (cid:62) . Therefore, the optimization problem in (10) can be rewrittenas ˆ J n := min A,B E (cid:98) D n { (cid:96) ( M ( x ; A, B ) , y ) } + λ ( A ) , (12)where λ ( A ) := f ( A ) − (cid:2) A − I (cid:3) (cid:98) Σ n (cid:2) A − I (cid:3) (cid:62) is the optimalregularization for linear regression with locally-differentialprivate data with Gaussian data. This regularization is com-pletely novel in the context of linear regression. In whatfollows, we provide a more tractable formulation for (12). Theorem 6.

The solution to (12) is given by min

A,B,ξ,Z ξ ( ρ − trace ( (cid:98) Σ n )) + trace ( Z )+ trace (cid:16) (cid:2) A − I (cid:3) (cid:98) µ n (cid:98) µ (cid:62) n (cid:2) A − I (cid:3) (cid:62) + BB (cid:62) + (cid:2) A − I (cid:3) (cid:98) µ n B (cid:62) + B (cid:98) µ (cid:62) n (cid:2) A − I (cid:3) (cid:62) (cid:17) , (13a) s . t .  Z (cid:98) Σ / n ξ (cid:98) Σ / n ξ ξI (cid:20) A (cid:62) − I (cid:21) (cid:2) A − I (cid:3) I  (cid:23) . (13b) Proof:

Let Z (cid:23) be such that ξ (cid:98) Σ / n (cid:32) ξI − (cid:20) A (cid:62) − I (cid:21) (cid:2) A − I (cid:3) (cid:33) − (cid:98) Σ / n (cid:22) Z. Using the Schur complement [37], we can transform thisinequality into  Z (cid:98) Σ / n ξ (cid:98) Σ / n ξ ξI − (cid:20) A (cid:62) − I (cid:21) (cid:2) A − I (cid:3) (cid:23) . Again, using the Schur complement, this matrix inequalitycan be transformed into (cid:34) Z (cid:98) Σ / n ξ (cid:98) Σ / n ξ ξI (cid:35) −  A (cid:62) − I  (cid:2) A − I (cid:3) Z (cid:98) Σ / n ξ (cid:98) Σ / n ξ ξI (cid:20) A (cid:62) − I (cid:21) (cid:2) A − I (cid:3) I  (cid:23) . (14)Finally, using the Schur complement, the constraint in com-puting f ( A ) can rewritten as  ξI (cid:20) A (cid:62) − I (cid:21)(cid:2) A − I (cid:3) I  (cid:23) . (15)Note that (15) is a subset of (14) and thus need not be addedto the constraints. XPERIMENTAL R ESULT

Here, we demonstrate the performance of distributionally-robust machine learning on two practical datasets.

The dataset contains information of roughly 887,000loans [38]. The inputs contain loan information, e.g., totalloan size, and borrower information, e.g., age. The outputsare the interest rates of loans. Categorical features, e.g.,state of residence and loan grade, are encoded with integernumbers. Unique identiﬁers, e.g., identity, and irrelevantattributes, e.g., URLs, are removed. We scale all input at-tributes and outputs to be between zero and one to meetAssumption 2. We consider linear regression framework inRemark 2 and Section 4. We use the Gaussian mechanism inTheorem 1 to generate locally-differentially private datasets -3 -2 -1 (cid:15) E P { (cid:96) ( M ( x ; A , B ) , y ) } (12)(8) with ρ = 10 − (2) Fig. 1. Performance of the linear regression for the loan dataset trainedon the locally-differential private dataset tested on the original probabilitydistribution. (cid:15) E P { (cid:96) ( M ( x ; A , B ) , y ) } (12)(8) with ρ = 10 − (2) Fig. 2. Performance of the linear regression for the Adult dataset trainedon the locally-differential private dataset tested on the original probabilitydistribution. with δ = 10 − and (cid:15) ∈ [10 , ] /p x . We use 50 entries of thedataset for training and the remaining entries for evaluation.Note that we are using such a low number of data entries aswe are using linear regression. Figure 1 illustrates the perfor-mance of the linear regression for the loan dataset trained onthe locally-differential private dataset tested on the originalprobability distribution. Regularization clearly improves theout-of-sample performance of the model. Furthermore, theoptimal regularization in Section 4 can be better than thegeneric regularization based on the Lipschitz constant of theloss function in (8). This dataset contains nearly 49,000 records from the 1994Census database [39]. The records contain features, e.g.,age and education. The output is binary number indicatingwhether an individuals earns more than $50,000. Similarly,we scale all input attributes and outputs to be betweenzero and one in line with Assumption 2. We consider thelinear regression framework in Remark 2 and Section 4.We use the Gaussian mechanism in Theorem 1 to generatelocally-differentially private datasets with δ = 10 − and (cid:15) ∈ [10 , ] /p x . We use 50 entries of the dataset fortraining and the remaining entries for evaluation. Again,we are using such a low number of data entries as we areusing linear regression. Figure 2 illustrates the performanceof the linear regression for the adult dataset trained onthe locally-differential private dataset tested on the originalprobability distribution. Regularization clearly improves theperformance of the model. ONCLUSIONS

We considered machine learning, particularly regression, us-ing locally-differentially private datasets is considered. Weposed machine learning with locally-differentially privatedatasets as a distributionally-robust optimization with anambiguity set parameterized by the Wasserstein distance.For general distributions, the distributionally-robust opti-mization problem was relaxed as a regularized machinelearning problem with the Lipschitz constant of the machinelearning model as a regularizer. For Gaussian data, thedistributionally-robust optimization problem was solvedexactly to ﬁnd an optimal regularizer. R EFERENCES [1] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibratingnoise to sensitivity in private data analysis,” in

Theory of Cryptog-raphy (S. Halevi and T. Rabin, eds.), (Berlin, Heidelberg), pp. 265–284, Springer Berlin Heidelberg, 2006.[2] C. Dwork and A. Roth, “The algorithmic foundations of differ-ential privacy,”

Foundations and Trends® in Theoretical ComputerScience , vol. 9, no. 3–4, pp. 211–407, 2014.[3] P. Kairouz, S. Oh, and P. Viswanath, “Extremal mechanisms forlocal differential privacy,”

The Journal of Machine Learning Research ,vol. 17, no. 1, pp. 492–542, 2016.[4] R. Dewri, “Local differential perturbations: Location privacy un-der approximate knowledge attackers,”

IEEE Transactions on Mo-bile Computing , vol. 12, no. 12, pp. 2360–2372, 2013.[5] J. C. Duchi, M. I. Jordan, and M. J. Wainwright, “Local privacy andstatistical minimax rates,” in , pp. 429–438, 2013.[6] X. Ren, C.-M. Yu, W. Yu, S. Yang, X. Yang, J. A. McCann, and S. Y.Philip, “LoPub: High-dimensional crowdsourced data publicationwith local differential privacy,”

IEEE Transactions on InformationForensics and Security , vol. 13, no. 9, pp. 2151–2166, 2018.[7] ´U. Erlingsson, V. Pihur, and A. Korolova, “RAPPOR: Randomizedaggregatable privacy-preserving ordinal response,” in

Proceedingsof the 2014 ACM SIGSAC conference on computer and communicationssecurity , pp. 1054–1067, 2014.[8] J. Tang, A. Korolova, X. Bai, X. Wang, and X. Wang, “Privacy loss inApple’s implementation of differential privacy on MacOS 10.12,” arXiv preprint arXiv:1709.02753 , 2017.[9] A. Smith, A. Thakurta, and J. Upadhyay, “Is interaction necessaryfor distributed private learning?,” in , pp. 58–77, IEEE, 2017.[10] D. Wang, M. Gaboardi, and J. Xu, “Empirical risk minimization innon-interactive local differential privacy revisited,” in

Advances inNeural Information Processing Systems , pp. 965–974, 2018.[11] K. Zheng, W. Mou, and L. Wang, “Collect at once, use effectively:making non-interactive locally private learning possible,” in

Pro-ceedings of the 34th International Conference on Machine Learning-Volume 70 , pp. 4130–4139, 2017.[12] D. Wang, A. Smith, and J. Xu, “Noninteractive locally privatelearning of linear models via polynomial approximations,” in

Algorithmic Learning Theory , pp. 898–903, 2019.[13] P. M. Esfahani and D. Kuhn, “Data-driven distributionally robustoptimization using the wasserstein metric: Performance guar-antees and tractable reformulations,”

Mathematical Programming ,vol. 171, no. 1-2, pp. 115–166, 2018.[14] V. A. Nguyen, D. Kuhn, and P. M. Esfahani, “Distributionallyrobust inverse covariance estimation: The wasserstein shrinkageestimator,” arXiv preprint arXiv:1805.07194 , 2018. [15] D. Kuhn, P. M. Esfahani, V. A. Nguyen, and S. Shaﬁeezadeh-Abadeh, “Wasserstein distributionally robust optimization: The-ory and applications in machine learning,” in

Operations Research& Management Science in the Age of Analytics , pp. 130–166, IN-FORMS, 2019.[16] A. Ben-Tal, L. El Ghaoui, and A. Nemirovski,

Robust optimization ,vol. 28. Princeton University Press, 2009.[17] K. Postek, D. den Hertog, and B. Melenberg, “Computationallytractable counterparts of distributionally robust constraints on riskmeasures,”

SIAM Review , vol. 58, no. 4, pp. 603–650, 2016.[18] E. Delage and Y. Ye, “Distributionally robust optimization undermoment uncertainty with application to data-driven problems,”

Operations research , vol. 58, no. 3, pp. 595–612, 2010.[19] Z. Hu and L. J. Hong, “Kullback-leibler divergence constraineddistributionally robust optimization,”

Available at Optimization On-line , 2013.[20] A. Sinha, H. Namkoong, and J. Duchi, “Certiﬁable distributionalrobustness with principled adversarial training,” in

Proceedings ofthe Machine Learning and Computer Security Workshop (co-located withConference on Neural Information Processing Systems 2017) , vol. 2,2017.[21] F. Farokhi, “Regularization helps with mitigating poisoning at-tacks: Distributionally-robust machine learning using the wasser-stein distance,” arXiv preprint arXiv:2001.10655 , 2020.[22] R. Chen and I. C. Paschalidis, “A distributionally robust optimiza-tion approach for outlier detection,” in , pp. 352–357, IEEE, 2018.[23] A. Pr ¨ugel-Bennett,

The Probability Companion for Engineering andComputer Science . Cambridge University Press, 2020.[24] G. C. Pﬂug and A. Pichler,

Multistage Stochastic Optimization .Springer Series in Operations Research and Financial Engineering,Springer International Publishing, 2014.[25] L. V. Kantorovich and G. Rubinshtein, “On a space of totallyadditive functions,”

Vestn. Lening. Univ , vol. 13, pp. 52–59, 1958.[26] M. Anthony and P. L. Bartlett,

Neural Network Learning: TheoreticalFoundations . Cambridge University Press, 2009.[27] C. Brownlees, E. Joly, G. Lugosi, et al. , “Empirical risk minimiza-tion for heavy-tailed losses,”

The Annals of Statistics , vol. 43, no. 6,pp. 2507–2536, 2015.[28] O. Catoni, “Challenging the empirical mean and empirical vari-ance: A deviation study,” in

Annales de l’Institut Henri Poincar´e,Probabilit´es et Statistiques , vol. 48, pp. 1148–1185, 2012.[29] P. J. Bickel, D. A. Freedman, et al. , “Some asymptotic theory forthe bootstrap,”

The annals of statistics , vol. 9, no. 6, pp. 1196–1217,1981.[30] T. M. Cover and J. A. Thomas,

Elements of Information Theory . Wiley,2012.[31] F. Farokhi, “Regularization helps with mitigating poisoning at-tacks: Distributionally-robust machine learning using the wasser-stein distance,” 2020. arXiv:2001.10655, https://arxiv.org/abs/2001.10655.[32] A. Machanavajjhala, D. Kifer, J. Abowd, J. Gehrke, and L. Vilhuber,“Privacy: Theory meets practice on the map,” in

Proceedings ofthe 2008 IEEE 24th International Conference on Data Engineering ,pp. 277–286, 2008.[33] B. I. Rubinstein and F. Ald`a, “Pain-free random differential privacywith sensitivity sampling,” in

International Conference on MachineLearning , pp. 2950–2959, 2017.[34] R. Hall, A. Rinaldo, and L. Wasserman, “Random differentialprivacy,”

Journal of Privacy and Conﬁdentiality , vol. 4, no. 2, pp. 43–59, 2012.[35] T. Rippl, A. Munk, and A. Sturm, “Limit laws of the empiricalwasserstein distance: Gaussian distributions,”

Journal of Multivari-ate Analysis , vol. 151, pp. 90–109, 2016.[36] C. R. Givens and R. M. Shortt, “A class of wasserstein metricsfor probability distributions.,”

The Michigan Mathematical Journal ,vol. 31, no. 2, pp. 231–240, 1984.[37] F. Zhang,