MMedian regression with differential privacy ∗ E Chen , Ying Miao and Yu Tang School of Mathematical Sciences, Soochow University Faculty of Engineering, Information and Systems, University of TsukubaApr. 25, 2020
Abstract:
Median regression analysis has robustness properties which makeit attractive compared with regression based on the mean, while differentialprivacy can protect individual privacy during statistical analysis of certaindatasets. In this paper, three privacy preserving methods are proposed formedian regression. The first algorithm is based on a finite smoothing method,the second provides an iterative way and the last one further employs thegreedy coordinate descent approach. Privacy preserving properties of thesethree methods are all proved. Accuracy bound or convergence properties ofthese algorithms are also provided. Numerical calculation shows that thefirst method has better accuracy than the others when the sample size issmall. When the sample size becomes larger, the first method needs moretime while the second method needs less time with well-matched accuracy.For the third method, it costs less time in both cases, while it highly dependson step size. keywords : median regression, differential privacy, l sensitivity, LaplacemechanismMSC2010: 62F30, 68W20 ∗ Supported by NNSF of China (11671290) a r X i v : . [ s t a t . C O ] J un Introduction
Personal privacy information may be exposed with the unprecedented avail-ability of datasets, so there is increasing requirement that statistical analysisof such datasets should protect individual privacy. As [6] describes, differ-ential privacy addresses the paradox of learning nothing about an individualwhile learning useful information about a population. Over the past fewyears, differential privacy has been investigated in machine learning [1] andhas been applied in the real world, see for example [8]. Recently, [3] for-mulates a general lower bound argument for minimax risks with differentialprivacy constraints, and applies this argument to high-dimensional mean es-timation and linear regression problems.In this paper, three privacy preserving methods are proposed for medianregression, which is a special case of quantile regression. Quantile regressionwas first introduced in [12], which aims to estimate and conduct inferenceabout conditional quantile functions. In recent years, quantile regression hasbecome a comprehensive method for statistical analysis of response modelsand it has been widely used in reality, such as survival analysis and economics,see for example, [14], [20] and [15]. The fact that the median regression takesleast absolute deviation as its objective function to estimate parameters hasbeen known among statisticians [12].Denote a dataset of n i.i.d. samples about independent variables as X and each observation contains d variables x , x , . . . , x d . In the regressionsetting, we assume Y i is the response for case i , x ij is the value of predictor j for case i , and β j is the regression coefficient corresponding to predictor j , where 1 ≤ i ≤ n, ≤ j ≤ d . In this paper, we consider the linear l regression problem, i.e., minimizing the following function: F ( µ, β ) = 1 n n (cid:88) i =1 | r i ( µ, β ) | , (1)where r i ( µ, β ) = µ + X i β − Y i ( i = 1 , , · · · , n ) and X i represents i -th row of X ,and β = ( β , . . . , β d ) T . Without loss of generality, assume that || Y i || ≤ B ( B is a positive number) and || X i || ≤ i = 1 , . . . , n . In a vector form, r ( µ, β ) = µ + X β − Y represents a set of linear functions in R n with Y = ( y , . . . , y n ) T , where is an n -dimensional column vector whose allelements are 1. In addition, the ridge penalized regression is more stablethan simple linear regression and its objective function can be viewed as2inimizing the criterion L ( µ, β ) = F ( µ, β ) + λ β T β , (2)where λ is a fixed regularization parameter. We consider a dataset x as a collection of observations from a universe X . Itis convenient to represent databases by their histograms: x ∈ N |X | , in whicheach entry x i represents the number of elements in the database x of type i ∈ X . For example, the universe X contains 5 records and we denote themby { , , , , } . If a dataset x consists of three records 1 , x as a 5-dimensional vector (2 , , , , , , , ,
0) representsanother dataset y with 4 records, respectively.Differential privacy is based on the neighbourhood of a database, whenapplying differential privacy into practical use, it is key to define the precisecondition under which two databases x and y are considered to be neigh-bouring. There are two possible choices and thus producing two types ofdifferential privacy, one is called unbounded differential privacy [5] and theother is called bounded differential privacy [7]. Bounded differential privacyassumes that both x and y have the same size n and y can be obtained from x by replacing exactly one record. While unbounded differential privacy doesnot require x and y have the same fixed size, it holds the view that y can beobtained from x by adding or deleting exactly one record. In this paper, weadopt bounded differential privacy as our choice and use the notation x (cid:79) y if x and y are neighboring. Definition 2.1.
A randomized algorithm M with domain N |X | is ( (cid:15), δ ) -differentially private if for all S ⊆ Range ( M ) and for all datasets x, y ∈ N |X | and x (cid:79) y : P r ( M ( x ) ∈ S ) ≤ exp ( (cid:15) ) P r ( M ( y ) ∈ S ) + δ. By intuition, this definition guarantees that a randomized algorithm be-haves similarly on slightly different input datasets, which achieves the pur-pose of protecting individual privacy in some sense. Next, a randomized3lgorithm named Laplace mechanism, which is an effective method for pri-vacy preserving, will be introduced. Firstly, we need a concept named l sensitivity. Definition 2.2.
The l sensitivity of a function f : N |X | → R k is : ∆ f = max x,y ∈ N |X| ,x (cid:79) y || f ( x ) − f ( y ) || . The l sensitivity of a function f captures the magnitude by which asingle individual’s data can change the function f in the worst case. It isnoteworthy that ∆ f is an important value in the Laplace mechanism. Definition 2.3.
Given any function f : N |X | → R k , the Laplace mechanismis defined as: M L ( x, f ( · ) , (cid:15) ) = f ( x ) + ( Y , . . . , Y k ) , where Y i ( i = 1 , . . . , k ) are i.i.d random variables drawn from the Laplace dis-tribution Lap( ∆ f(cid:15) ) . The density function of the Laplace distribution ( centeredat ) Lap( c ) is: Lap ( x | c ) = 12 c exp ( − | x | c ) . The following Lemma can be seen in textbooks, see for example Theorem3.6 of [6].
Lemma 2.1.
The Laplace mechanism preserves ( (cid:15), − differential privacy. In this section, we put forward three privacy preserving algorithms for l regression and calculate their privacy parameters respectively. The finite smoothing method is an important tool to solve nondifferentiableproblem, for instance, median regression proposed in [16]. In addition, [16]proves that the solution of smooth function can estimate the solution oforiginal function well. This idea is applied in algorithm 1 by an analogoustechnique. 4ince the absolute value function is not differentiable at the cuspidalpoint, a smooth method for minimizing function (2) is considered. Let γ be a nonnegative parameter which indicates the degree of approximation.Define ρ γ ( t ) = (cid:26) t / (2 γ ); if | t | ≤ γ, | t | − γ ; if | t | > γ. (3)Then the nondifferentiable function F ( µ, β ) is approximated by the HuberM-estimator (see [2]).Denote F γ ( µ, β ) = n (cid:80) ni =1 ρ γ ( r i ( µ, β )) and L γ ( µ, β ) = F γ ( µ, β ) + λ β T β .The sign vector s γ ( µ, β ) = ( s ( µ, β ) , . . . , s n ( µ, β )) T is given by s i ( µ, β ) = −
1; if r i ( µ, β ) < − γ,
0; if − γ ≤ r i ( µ, β ) ≤ γ,
1; if r i ( µ, β ) > γ. (4)Let w i ( µ, β ) = 1 − s i ( µ, β ), then ρ γ ( r i ( µ, β )) = 12 γ w i ( µ, β ) r i ( µ, β ) + s i ( µ, β ) (cid:20) r i ( µ, β ) − γs i ( µ, β ) (cid:21) . (5)Denote W γ ( µ, β ) as the diagonal n × n matrix whose diagonal elements are w i ( µ, β ). So W γ ( µ, β ) has value 1 in those diagonal elements related to smallresiduals and 0 elsewhere. For µ ∈ R and β ∈ R d , the derivation of F γ ( µ, β )is ∂F γ ( µ, β ) ∂ β = 1 n X T (cid:20) γ W γ ( µ, β ) r ( µ, β ) + s γ ( µ, β ) (cid:21) , and ∂F γ ( µ, β ) ∂µ = 1 n T (cid:20) γ W γ ( µ, β ) r ( µ, β ) + s γ ( µ, β ) (cid:21) . It can be verified that L γ ( µ, β ) is convex and a minimizer of L ( µ, β ) is closeto a minimizer of L γ ( µ, β ) when γ is close to zero. Furthermore, accordingto Theorem 1 in [16], the l solution can be detected when γ > γ converge to zero in order to find aminimizer of L γ ( µ, β ). This observation is essential for the efficiency andthe numerical stability of the algorithm to be described in this paper. Inaddition, refer to the algorithm in [4], the first privacy preserving algorithmfor median regression is stated as follows.5 lgorithm 1:Inputs: privacy parameter (cid:15) , design matrix X, response vector Y,regularization parameter λ and approximation parameter γ . Generate a random vector b from the density function h ( b ) ∝ exp ( − (cid:15) || b || ).To implement this, pick the l norm of b from the Gamma distributionΓ( d + 1 , (cid:15) ), and the direction of b uniformly at random. Compute ( µ ∗ , β ∗ ) = argmin µ, β L γ ( µ, β ∗ ) + b T ωn + µ √ n , where ω = ( µ, β ) isa d + 1 dimensional vector, and n is the number of rows of X . Output ( µ ∗ , β ∗ ) . This algorithm is very similar to the smoothing median regression convexprogram in [16], and therefore its running time is similar to that of smoothingregression. In fact, ( µ ∗ , β ∗ ) can be obtained by the interior point method.Similar to the proof in [4], we can show that Algorithm 1 is privacy preserving. Theorem 3.1.
Given a set of n samples X , . . . , X n over R d , with labels Y , . . . , Y n , where for each i , || X i || ≤ and || Y i || ≤ B , the output of Algo-rithm 1 preserves ( (cid:15), -differential privacy.Proof. Let a and a be two row vectors over R d with l norm at most 1 and y , y ∈ [ − B, B ]. Consider the two inputs D and D where D is obtainedfrom D by replacing one record ( a , y ) into ( a , y ). For convenience, as-sume the first n − ω ∗ = ( µ ∗ , β ∗ ) byAlgorithm 1, there is a unique value of b that maps the input to the output.This uniqueness holds, because both the regularization function and the lossfunctions are differentiable everywhere. Denote ˜ a as (1 , a ) and ˜ a as (1 , a ). Let the values of d + 1 dimensional vector b for D and D respectively, be b and b . Since ω ∗ is the value that minimizes both the optimization prob-lems, the derivative of both optimization functions at ω ∗ is 0. This impliesthat for every b in the first case, there exists a b in the second case suchthat: b + ˜ a T1 (1 /γ W γ ( µ ∗ , β ∗ )( µ ∗ + a T1 β ∗ − y ) + S γ ( µ ∗ , β ∗ ))= b + ˜ a T2 (1 /γ W γ ( µ ∗ , β ∗ )( µ ∗ + a T2 β ∗ − y ) + S γ ( µ ∗ , β ∗ )) . According to the definitions of W γ ( µ ∗ , β ∗ ) and S γ ( µ ∗ , β ∗ ), it is clear that − ≤ /γ W γ ( µ ∗ , β ∗ ) ∗ ( µ ∗ + a T1 β ∗ − y ) + S γ ( µ ∗ , β ∗ ) ≤ − ≤ /γ W γ ( µ ∗ , β ∗ ) ∗ ( µ ∗ + a T2 β ∗ − y ) + S γ ( µ ∗ , β ∗ ) ≤ . || ˜ a || ≤ || ˜ a || ≤
2, we have || b − b || ≤
4, which implies that − ≤ || b || − || b || ≤
4. Therefore, for any ( a , y ) and ( a , y ), P (( µ ∗ , β ∗ ) | X , . . . , X n − , Y , . . . , Y n − , X n = a , Y n = y ) P (( µ ∗ , β ∗ ) | X , . . . , X n − , Y , . . . , Y n − , X n = a , Y n = y ) = h ( b ) h ( b ) = e − (cid:15) ( || b || −|| b || ) , where h ( b i ) for i = 1 , b i . Since − ≤ || b || − || b || ≤ exp ( (cid:15) ).According to Lemma 1 in [4], theoretical results for accuracy of parameterestimation is given for Algorithm 1. Lemma 3.1.
Let G ( ω ) and g ( ω ) be two convex functions, which are con-tinuous and differentiable at all points. If ω = argmin ω G ( ω ) and ω = argmin ω G ( ω ) + g ( ω ) , then || ω − ω || ≤ g G . Here, g = max ω || (cid:79) g ( ω ) || and G = min v min ω v T (cid:79) G ( ω ) v , for any unit vector v . The main idea of the proof is to examine the gradient and the Hessian ofthe functions G and g around ω and ω . Lemma 3.2. If || b || is a random variable drawn from Γ( d + 1 , (cid:15) ) , then withpossibility − α , || b || ≤ d +1) log ( d +1 α ) (cid:15) .Proof. Since a random variable drawn from Γ( d + 1 , (cid:15) ) can be written as thesum of d + 1 independent identically distributed random variables, each ofwhich is distributed as an exponential random variable with mean (cid:15) . Usingan union bound, we see that with probability 1 − α , the values of all d + 1 ofthese variables are upper bounded by log ( d +1 α ) (cid:15) . Therefore, with probabilityat least 1 − α , || b || ≤ d +1) log ( d +1 α ) (cid:15) . Theorem 3.2.
Given an l regression problem with regularization parameter λ , let ω be the classifier that minimizes L γ ( µ, β )+ µ √ n , and ω be the classifieroutput by Algorithm respectively. Then, with probability − α , || ω − ω || ≤ d +1) log ( d +1 α ) n min( λ, √ n ) (cid:15) .Proof. According to Lemma 3.1, we take G ( ω ) = L γ ( µ, β ) + µ √ n and g ( ω ) = b T β n . Because F γ ( µ, β ) is a convex function, if we define the second deriva-tive of F γ ( µ, β ) is 0 at nondifferentiable points, then the hessian matrix of7 γ ( µ, β ) is positive semidefinite. Notice that (cid:79) ( µ √ n ) = √ n and (cid:79) ( λ β T β ) = λ I , where I is an identity matrix with size d × d . Hence, for any unit vector v , G = min v min ω v T (cid:79) G ( ω ) v ≥ min( λ, √ n ) and g = || b || n , || ω − ω || ≤ || b || n min( λ, √ n ) . Since b is a random variable drawn from Γ( d + 1 , (cid:15) ), according toLemma 3.2, with possibility 1 − α , || b || ≤ d +1) log ( d +1 α ) (cid:15) , then the theorem isobtained.When n is sufficient large, ω approximates ω well and ω is close totrue parameter of argmin ω L γ ( ω ). The second algorithm is based on the iterative algorithm, which was firstproposed in [17]. This iterative technique combines absolute deviations re-gression with least square regression. Hence, at the heart of the technique isany standard least squares curve fitting algorithm.The basic least squares algorithm minimizes the criterion I = 1 n n (cid:88) i =1 w i r i ( µ, β ) + λ β T β , (6)where the weighting factors w i are positive real numbers. Based on theLagrange multiplier approach, for a fixed λ , there exists a unique value v such that minimizing equation (6) is equivalent to minimizing the followingequation. I = 1 n n (cid:88) i =1 w i r i ( µ, β ) ,s.t. β T β ≤ v. Considering the ( t + 1)-th iteration, we take w i as | r ( t ) i | + e , where r ( t ) i is theresidual of i -th sample at the t -th iteration. Then the iterative process canbe written as I ( t + 1) = 1 n n (cid:88) i =1 | r ( t ) i | + e r ( t + 1) i + λ β T β . (7)8f || r ( t ) i − r ( t + 1) i || ≈ , i = 1 , , . . . , n , (7) is close to L ( µ, β ). In practice,we set e as a small positive value (such as e = 0 .
05) .
Algorithm 2:Inputs: privacy parameter (cid:15) , deign matrix X, response vector Y,regularization parameter λ , tolerance parameter τ and the numberof iteration N Initialize the algorithm with ˆ µ (0) and ˆ β (0)(ˆ µ (1) , ˆ β (1)) = argmin µ, β I (1) for t = 1 , · · · , N − dowhile || ˆ µ ( t ) − ˆ µ ( t − || > τ or || ˆ β ( t ) − ˆ β ( t − || > τ do (ˆ µ ( t + 1) , ˆ β ( t + 1)) = argmin µ, β I ( t + 1) else doOutput (ˆ µ ( N ) , ˆ β ( N )) := (ˆ µ ( t ) , ˆ β ( t )) breakend whileend forOutput (ˆ µ, ˆ β ) := (ˆ µ ( N ) , ˆ β ( N )) + U,where U is a d + 1 dimensional Laplace random variable with pa-rameter c = n min( √ dv + B )+ e ,λ ) e ( √ dv + B ) Theorem 3.3.
Given a set of n samples X , . . . , X n over R d , with labels Y , . . . , Y n , where for each i ( ≤ i ≤ n ), || X i || ≤ , | Y i | ≤ B , the output ofAlgorithm 2 preserves ( (cid:15), -differential privacy.Proof. Denote ω = (ˆ µ ( N ) , ˆ β ( N )) and the l sensitivity of ω as s ( ω ). Let a and a be two vectors over R d with l norm at most 1 and y , y ∈ [ − B, B ]. Consider the two inputs D and D where D is obtained from D by changing one record ( a , y ) into ( a , y ). For convenience, assume thefirst n − G ( ω ) = I ( N ) and g ( ω ) = n w (ˆ µ ( N ) + a T2 ˆ β ( N ) − y ) − n w (ˆ µ ( N ) + a T1 ˆ β ( N ) − y ) . Similarto the proof in Theorem 3.2, we can achieve that g = max ω || (cid:79) g ( ω ) || ≤ n | w | ( | ˆ µ ( N ) | + | a T1 ˆ β ( N ) | + | y | )+ 2 n | w | ( | ˆ µ ( N ) | + | a T2 ˆ β ( N ) | + | y | ) . Notice that (ˆ µ ( N ) , ˆ β ( N )) = argmin µ, β I ( N ), then ∂I ( N ) ∂µ = 0 at µ = ˆ µ ( N ),9hat is, (cid:80) ni =1 w i (ˆ µ ( N ) + X i ˆ β ( N ) − Y i ) = 0 ⇐⇒ ˆ µ ( N ) = − (cid:80) ni =1 w i ( X i ˆ β ( N ) − Y i ) (cid:80) ni =1 w i . Since 0 < w i ≤ /e , || y i || ≤ B ( i = 1 , · · · , n ) and || ˆ β ( N ) || ≤ √ d || ˆ β ( N ) || ≤√ dv , we have || ˆ µ ( N ) || ≤ √ dv + B . Notice that above inequalities are stilltrue in t -th( ≥
2) iteration and hence √ dv + B )+ e ≤ w i ≤ e . Then we canachieve that g = max ω || (cid:79) g ( ω ) || ≤ √ dv + B ) ne . In addition, denote F e ( ω ) = n (cid:80) ni =1 w i r i ( µ, β ). It can be checked that F e ( ω )is convex and ∂F e ( ω ) ∂µ = n (cid:80) ni =1 w i ≤ √ dv + B )+ e , (cid:79) ( λ β T β ) = λ I , where I is an identity matrix with size d × d , then G ≥ min( √ dv + B )+ e , λ ) and s ( ω ) ≤ n min( √ dv + B )+ e ,λ ) e ( √ dv + B ) .According to lemma 2.1, the result is obtained directly from the compo-sition theorem.For e >
0, define a perturbation of L ( µ, β ) as L e ( µ, β ) = n (cid:88) i =1 | r i ( µ, β ) | − e ln ( e + | r i ( µ, β ) | ) + λ β T β . [10] proves that iterative least square algorithm without adding noise is aspecial case of Majorization-Minimization ( MM) algorithms ( see [11]) forobjective function L e ( µ, β ) and obtained convergence results. Proposition 3.1.
For linear median regression with a full-rank covariatematrix X , the iterative least square algorithm without adding noise convergesto the unique minimizer of L e ( µ, β ) . Proposition 3.2.
If ( ˆ µ e , ˆ β e ) minimizes L e ( µ, β ) , then any limit point of( ˆ µ e , ˆ β e ) as e tends to minimizes L ( µ, β ) . If L ( µ, β ) has a unique minimizer (˜ µ, ˜ β ) , then lim e → (ˆ µ e , ˆ β e ) = (˜ µ, ˜ β ) . The proof of above propositions can be seen in [10].10 heorem 3.4.
Given a l regression problem with regularization parameter λ , let ω be the classifier that minimizes L e ( µ, β ) , and ω be the classifieroutput by Algorithm respectively. Then, with probability − α , || ω − ω || ≤ √ dv + B )( d +1) log ( d +1 α ) (cid:15) min( √ dv + B )+ e ,λ ) ne .Proof. Since || b || is a random variable drawn from Γ( d +1 , √ dv + B ) (cid:15) min( √ dv + B )+ e ,λ ) ne ),with possibility 1 − α , || b || ≤ √ dv + B )( d +1) log ( d +1 α ) (cid:15) min( √ dv + B )+ e ,λ ) ne , the theorem is obtained.Therefore, for fixed small e , if n is sufficient large, accuracy can be ensuredin practice. In [1], the authors argue that adding noise to the estimated parameters af-ter optimization would destroy the utility of the learned model. Hence, weprefer a more sophisticated method to control the influence of the trainingdata during the training process, especially in the stochastic gradient decentcomputation. [19] declares that greedy coordinate descent is an effectivemethod for l regression, where l regression means median regression. Sowe apply this idea to minimize objective function L ( µ, β ) in a similar way.Although L ( µ, β ) is nondifferentiable, it does possess directional derivativesalong each forward or backward coordinate direction. For example, if e k isthe coordinate direction along which β k varies, then the objective function(2) has directional derivatives d e + k L ( µ, β ) = lim τ → + L ( µ, β + τ e k ) − L ( µ, β ) τ = d e + k F ( µ, β ) + λβ k and d e − k L ( µ, β ) = lim τ → − L ( µ, β + τ e k ) − L ( µ, β ) τ = d e − k F ( µ, β ) + λβ k . In l regression, the coordinate direction derivatives are d e + k F ( µ, β ) = 1 n n (cid:88) i =1 − x ik , r i ( µ, β ) < ,x ik , r i ( µ, β ) > , | x ik | , r i ( µ, β ) = 0 , (8)11nd d e − k F ( µ, β ) = 1 n n (cid:88) i =1 x ik , r i ( µ, β ) < , − x ik , r i ( µ, β ) > , | x ik | , r i ( µ, β ) = 0 . (9)In greedy coordinate descent progress[9], we update the direction of pa-rameter β k based on min { d e + k L ( µ, β ) , d e − k L ( µ, β ) } . If both coordinate di-rectional derivatives are nonnegative, the update of β k stops. In addition,ˆ µ = n (cid:80) n i =1 ( Y i − X i ˆ β ), where n = n/N . And by the method of batch gra-dient [18], the t -th iteration only employs records with batch size n , whichmeans L (ˆ µ ( t ) , ˆ β ( t )) in the algorithm is calculated by subset ( X (t), Y (t)).The algorithm is described as follows. Algorithm 3:Inputs: privacy parameters (cid:15) , deign matrix X, response vector Y,regularization parameter λ , positive number (cid:96) and the number ofiterations N .Randomly split ( X, Y) into N disjoint subsets of size n .Initialize the algorithm with a vector (ˆ µ (0) , ˆ β (0) ( such as the solu-tion of l regression).for t = 0 , , , ..., N − do η t = (cid:96)t +1 for k = 1 , , · · · , d do ˆ β k ( t +0 .
5) = ˆ β k ( t ) − η t ( min { d e + k L (ˆ µ ( t ) , ˆ β ( t )) , d e − k L (ˆ µ ( t ) , ˆ β ( t )) } ) , ˆ β k ( t + 1) = ˆ β k ( t + 0 .
5) + U t , where U t ∼ Lap ( η t (cid:15)n ) , n = n/N .end for ˆ µ ( t + 1) = n (cid:80) n i =1 ( Y i − X i ˆ β ( t + 1)) .end forOutput ˆ β := ˆ β ( N ) , ˆ µ = ˆ µ ( N ) .Theorem 3.5. Given a set of n samples X , . . . , X n over R d with labels Y , . . . , Y n , where for each i ( ≤ i ≤ n ), || X i || ≤ and || Y i || ≤ B , theoutput of Algorithm preserves ( (cid:15), -differential privacy.Proof. Because of sample splitting, for ( x, y ) ∈ ( X ( t ) , Y ( t )) for some 0 ≤ t ≤ N −
1, it suffices to prove the privacy guarantee for the t -th iteration of thealgorithm: any iteration prior to the t -th one does not depend on ( x, y ), whileany iteration after the t -th one is differentially private by post-processing [6].12t the t -th iteration, the algorithm first updates the non-sparse estimateof β k : ˆ β k ( t + 0 .
5) = ˆ β k ( t ) − η t ( min d e + k L (ˆ µ ( t ) , ˆ β ( t )) , d e − k L (ˆ µ ( t ) , ˆ β ( t ))) . Let a and a be two vectors over R d with l norm at most 1 and y , y ∈ [ − B, B ]. Consider the two inputs D and D where D is obtained from D by changing one record ( a , y ) into ( a , y ). For convenience, assume thefirst n − Dir ( t ) as the direction derivation (min { d e + k L (ˆ µ ( t ) , ˆ β ( t )) , d e − k L (ˆ µ ( t ) , ˆ β ( t )) } ) for the dataset D and Dir ( t ) forthe dataset D . Notice that ˆ β ( t ) does not depend on ( X ( t ) , Y ( t )), so ˆ β ( t +1)would be ( (cid:15), D and D (cid:48) , we have || η t /n [ Dir ( t ) − Dir ( t )] || ≤ η t n . This is true, since || η t /n [ Dir ( t ) − Dir ( t )] || ≤ η t /n ( || a || + || a || ) ≤ η t n , then the privacy guarantee for β is proved by Lemma 2.1. In addition,since ˆ µ = n (cid:80) n i =1 ( Y i − X i ˆ β ), it is differentially private by post-processing[6]. Then the theorem is obtained.[19] said that coordinate descent may fail for a nondifferentiable functionsince all directional derivatives must be nonnegative at a minimum point.However, if we can obtain a suitable approximate value quickly, this short-coming can be accepted in practice. The following theorem shows that esti-mated parameters would be stable when the number of iteration N is large. Theorem 3.6.
Given a set of n samples X , . . . , X n over R d with labels Y , . . . , Y n ( for each i , || X i || ≤ and || Y i || ≤ B ), Algorithm is convergentin probability with rate O (1 /t ) .Proof. Consider the t -th iteration for β k , since | Dir ( t ) | ≤ n (cid:80) n i =1 || x ik || ≤ β k ( t +1) = ˆ β k ( t ) − η t Dir ( t )+ U t , | ˆ β k ( t +1) − ˆ β k ( t ) | ≤ | η t | + | U t | = O p (1 /t ),where O p (1 /t ) indicates that it converges in probability with rate O (1 /t ).Since ˆ µ = n (cid:80) n i =1 ( Y i − X i ˆ β ), it is convergent in probability with rate O (1 /t ),too. Then the theorem is obtained. 13 Simulated results
Denote n as the number of samples. Here let n take two values: 5000 and5000000. In fact, when n is small, such as 100, Algorithm 1 can perform well,but Algorithm 2 requires n bigger ( otherwise the noise added would be bigwhich would result in big estimation error). Consider the following examplewith three independent variables x , x , x , where y i = 2+3 x i − x i + u i and u i obeys the Laplace distribution Lap (2), for i = 1 , . . . , n . We assume that,for each i ( 1 ≤ i ≤ n ), l norm of X i is less than 1 and l norm of Y i is lessthan 2. In practice, we take λ as 0 .
002 in the objective function. In Algorithm1, parameter γ is taken as 0 .
05. In Algorithm 2, we set parameter e = 0 . τ = 10 − and the number of iteration N = 200. In fact,Algorithm 2 tends to converge with less than 30 iterations. In Algorithm 3,we set (cid:96) = 0 .
1, step size η t = (cid:96)t +1 and the number of iteration N = 40. Inaddition, privacy parameters ( (cid:15), δ )= (0 . ,
0) for all the above algorithms. Theresults are listed in Table 1 and Table 2. It shows that Algorithm 1 performsbetter than the others when n = 5000. However, when n becomes muchbigger, Algorithm 1 costs much more time. Notice that when n = 5000000,the noise added to Algorithm 2 becomes small and it makes the estimatedresult precise. In addition, Algorithm 3 costs less time in both cases, but ithighly depends on initial value and step size η t , which is a common problemfor the gradient descent method [18].Table 1: Estimated results with sample size 5000Algorithm 1 Algorithm 2 Algorithm 3 True value µ β β -0.0295 13.2762 -0.6099 0 β -4.0835 -14.2089 -3.2283 -4 time(s) µ β β β -3.9460 -3.9205 -3.9918 -4 time(s) References [1] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwarand L. Zhang. 2016. Deep learning with differential privacy. In Proceed-ings of the 2016 ACM SIGSAC Conference on Computer and Commu-nications Security. 308-318.[2] I. Barrodale and F. D. K. Roberts. 1973. An improved algorithm for dis-crete l linear approximation. Society for Industrial and Applied Math-ematics Journal on Numerical Analysis. 839-848.[3] T. Cai, Y. Wang and L. Zhang. 2019. The cost of privacy: optimal ratesof convergence for parameter estimation with differential privacy. Underreview: arXiv:1902.04495v3.[4] K. Chaudhuri and C. Monteleoni. 2009. Privacy-preserving logistic re-gression. Proceedings of the 21st International Conference on NeuralInformation Processing Systems. 289-296.[5] C. Dwork. 2006. Differential privacy. International colloquium on au-tomata languages and programming. 1-12.[6] C. Dwork and A. Roth. 2014. The algorithmic foundations of differentialprivacy. Foundations and Trends in Theoretical Computer Science. Vol.9. 211-407.[7] C. Dwork, F. McSherry, K. Nissim and A. Smith. 2006. Calibratingnoise to sensitivity in private data analysis. Proceedings of Theory ofCryptography Conference. 265-284158] U. Erlingsson, V. Pihur and A. Korolova. 2014. Rappor: randomizedaggregatable privacy-preserving ordinal response. In Proceedings of the2014 ACM SIGSAC Conference on Computer and Communications Se-curity. 1054-1067.[9] J. Friedman, T. Hastie, H. H¨ofling and R. Tibshirani. 2007. Pathwisecoordinate optimization. The Annals of Applied Statistics. 1(2). 302-332.[10] D. R. Hunter and K. Lange. 2000. Quantile regression via an MM algo-rithm. Journal of Computational and Graphical Statistics. 9. 60-77.[11] D. R. Hunter and K. Lange. 2004. A tutorial on MM Algorithms. TheAmerican Statistician. 58:1, 30-37[12] R. Koenker and G. Bassett. 1978. Regression quantiles. Econometrica.46. 33-50.[13] D. Kifer and A. Machanavajjhala. 2011. No free lunch in data privacy.International conference on management of data. 193-204.[14] R. Koenker and O. Geling. 2001. Reappraising medfly longevity: a quan-tile regression survival analysis. Journal of the American Statistical As-sociation. 96. 458-468.[15] R. Koenker and K. F. Hallock. 2001. Quantile regression. Journal ofEconomic Perspectives. 15. 143-156.[16] K. Madsen and H. B. Nielsen. 1993. A finite smoothing algorithm forlinear l1