[PDF] Median regression with differential privacy

Abstract

Median regression analysis has robustness properties which make it attractive compared with regression based on the mean, while differential privacy can protect individual privacy during statistical analysis of certain datasets. In this paper, three privacy preserving methods are proposed for median regression. The first algorithm is based on a finite smoothing method, the second provides an iterative way and the last one further employs the greedy coordinate descent approach. Privacy preserving properties of these three methods are all proved. Accuracy bound or convergence properties of these algorithms are also provided. Numerical calculation shows that the first method has better accuracy than the others when the sample size is small. When the sample size becomes larger, the first method needs more time while the second method needs less time with well-matched accuracy. For the third method, it costs less time in both cases, while it highly depends on step size.

Full PDF

MMedian regression with diﬀerential privacy ∗ E Chen , Ying Miao and Yu Tang School of Mathematical Sciences, Soochow University Faculty of Engineering, Information and Systems, University of TsukubaApr. 25, 2020

Abstract:

Median regression analysis has robustness properties which makeit attractive compared with regression based on the mean, while diﬀerentialprivacy can protect individual privacy during statistical analysis of certaindatasets. In this paper, three privacy preserving methods are proposed formedian regression. The ﬁrst algorithm is based on a ﬁnite smoothing method,the second provides an iterative way and the last one further employs thegreedy coordinate descent approach. Privacy preserving properties of thesethree methods are all proved. Accuracy bound or convergence properties ofthese algorithms are also provided. Numerical calculation shows that theﬁrst method has better accuracy than the others when the sample size issmall. When the sample size becomes larger, the ﬁrst method needs moretime while the second method needs less time with well-matched accuracy.For the third method, it costs less time in both cases, while it highly dependson step size. keywords : median regression, diﬀerential privacy, l sensitivity, LaplacemechanismMSC2010: 62F30, 68W20 ∗ Supported by NNSF of China (11671290) a r X i v : . [ s t a t . C O ] J un Introduction

Personal privacy information may be exposed with the unprecedented avail-ability of datasets, so there is increasing requirement that statistical analysisof such datasets should protect individual privacy. As [6] describes, diﬀer-ential privacy addresses the paradox of learning nothing about an individualwhile learning useful information about a population. Over the past fewyears, diﬀerential privacy has been investigated in machine learning [1] andhas been applied in the real world, see for example [8]. Recently, [3] for-mulates a general lower bound argument for minimax risks with diﬀerentialprivacy constraints, and applies this argument to high-dimensional mean es-timation and linear regression problems.In this paper, three privacy preserving methods are proposed for medianregression, which is a special case of quantile regression. Quantile regressionwas ﬁrst introduced in [12], which aims to estimate and conduct inferenceabout conditional quantile functions. In recent years, quantile regression hasbecome a comprehensive method for statistical analysis of response modelsand it has been widely used in reality, such as survival analysis and economics,see for example, [14], [20] and [15]. The fact that the median regression takesleast absolute deviation as its objective function to estimate parameters hasbeen known among statisticians [12].Denote a dataset of n i.i.d. samples about independent variables as X and each observation contains d variables x , x , . . . , x d . In the regressionsetting, we assume Y i is the response for case i , x ij is the value of predictor j for case i , and β j is the regression coeﬃcient corresponding to predictor j , where 1 ≤ i ≤ n, ≤ j ≤ d . In this paper, we consider the linear l regression problem, i.e., minimizing the following function: F ( µ, β ) = 1 n n (cid:88) i =1 | r i ( µ, β ) | , (1)where r i ( µ, β ) = µ + X i β − Y i ( i = 1 , , · · · , n ) and X i represents i -th row of X ,and β = ( β , . . . , β d ) T . Without loss of generality, assume that || Y i || ≤ B ( B is a positive number) and || X i || ≤ i = 1 , . . . , n . In a vector form, r ( µ, β ) = µ + X β − Y represents a set of linear functions in R n with Y = ( y , . . . , y n ) T , where is an n -dimensional column vector whose allelements are 1. In addition, the ridge penalized regression is more stablethan simple linear regression and its objective function can be viewed as2inimizing the criterion L ( µ, β ) = F ( µ, β ) + λ β T β , (2)where λ is a ﬁxed regularization parameter. We consider a dataset x as a collection of observations from a universe X . Itis convenient to represent databases by their histograms: x ∈ N |X | , in whicheach entry x i represents the number of elements in the database x of type i ∈ X . For example, the universe X contains 5 records and we denote themby { , , , , } . If a dataset x consists of three records 1 , x as a 5-dimensional vector (2 , , , , , , , ,

0) representsanother dataset y with 4 records, respectively.Diﬀerential privacy is based on the neighbourhood of a database, whenapplying diﬀerential privacy into practical use, it is key to deﬁne the precisecondition under which two databases x and y are considered to be neigh-bouring. There are two possible choices and thus producing two types ofdiﬀerential privacy, one is called unbounded diﬀerential privacy [5] and theother is called bounded diﬀerential privacy [7]. Bounded diﬀerential privacyassumes that both x and y have the same size n and y can be obtained from x by replacing exactly one record. While unbounded diﬀerential privacy doesnot require x and y have the same ﬁxed size, it holds the view that y can beobtained from x by adding or deleting exactly one record. In this paper, weadopt bounded diﬀerential privacy as our choice and use the notation x (cid:79) y if x and y are neighboring. Deﬁnition 2.1.

A randomized algorithm M with domain N |X | is ( (cid:15), δ ) -diﬀerentially private if for all S ⊆ Range ( M ) and for all datasets x, y ∈ N |X | and x (cid:79) y : P r ( M ( x ) ∈ S ) ≤ exp ( (cid:15) ) P r ( M ( y ) ∈ S ) + δ. By intuition, this deﬁnition guarantees that a randomized algorithm be-haves similarly on slightly diﬀerent input datasets, which achieves the pur-pose of protecting individual privacy in some sense. Next, a randomized3lgorithm named Laplace mechanism, which is an eﬀective method for pri-vacy preserving, will be introduced. Firstly, we need a concept named l sensitivity. Deﬁnition 2.2.

The l sensitivity of a function f : N |X | → R k is : ∆ f = max x,y ∈ N |X| ,x (cid:79) y || f ( x ) − f ( y ) || . The l sensitivity of a function f captures the magnitude by which asingle individual’s data can change the function f in the worst case. It isnoteworthy that ∆ f is an important value in the Laplace mechanism. Deﬁnition 2.3.

Given any function f : N |X | → R k , the Laplace mechanismis deﬁned as: M L ( x, f ( · ) , (cid:15) ) = f ( x ) + ( Y , . . . , Y k ) , where Y i ( i = 1 , . . . , k ) are i.i.d random variables drawn from the Laplace dis-tribution Lap( ∆ f(cid:15) ) . The density function of the Laplace distribution ( centeredat ) Lap( c ) is: Lap ( x | c ) = 12 c exp ( − | x | c ) . The following Lemma can be seen in textbooks, see for example Theorem3.6 of [6].

Lemma 2.1.

The Laplace mechanism preserves ( (cid:15), − diﬀerential privacy. In this section, we put forward three privacy preserving algorithms for l regression and calculate their privacy parameters respectively. The ﬁnite smoothing method is an important tool to solve nondiﬀerentiableproblem, for instance, median regression proposed in [16]. In addition, [16]proves that the solution of smooth function can estimate the solution oforiginal function well. This idea is applied in algorithm 1 by an analogoustechnique. 4ince the absolute value function is not diﬀerentiable at the cuspidalpoint, a smooth method for minimizing function (2) is considered. Let γ be a nonnegative parameter which indicates the degree of approximation.Deﬁne ρ γ ( t ) = (cid:26) t / (2 γ ); if | t | ≤ γ, | t | − γ ; if | t | > γ. (3)Then the nondiﬀerentiable function F ( µ, β ) is approximated by the HuberM-estimator (see [2]).Denote F γ ( µ, β ) = n (cid:80) ni =1 ρ γ ( r i ( µ, β )) and L γ ( µ, β ) = F γ ( µ, β ) + λ β T β .The sign vector s γ ( µ, β ) = ( s ( µ, β ) , . . . , s n ( µ, β )) T is given by s i ( µ, β ) =  −

1; if r i ( µ, β ) < − γ,

0; if − γ ≤ r i ( µ, β ) ≤ γ,

1; if r i ( µ, β ) > γ. (4)Let w i ( µ, β ) = 1 − s i ( µ, β ), then ρ γ ( r i ( µ, β )) = 12 γ w i ( µ, β ) r i ( µ, β ) + s i ( µ, β ) (cid:20) r i ( µ, β ) − γs i ( µ, β ) (cid:21) . (5)Denote W γ ( µ, β ) as the diagonal n × n matrix whose diagonal elements are w i ( µ, β ). So W γ ( µ, β ) has value 1 in those diagonal elements related to smallresiduals and 0 elsewhere. For µ ∈ R and β ∈ R d , the derivation of F γ ( µ, β )is ∂F γ ( µ, β ) ∂ β = 1 n X T (cid:20) γ W γ ( µ, β ) r ( µ, β ) + s γ ( µ, β ) (cid:21) , and ∂F γ ( µ, β ) ∂µ = 1 n T (cid:20) γ W γ ( µ, β ) r ( µ, β ) + s γ ( µ, β ) (cid:21) . It can be veriﬁed that L γ ( µ, β ) is convex and a minimizer of L ( µ, β ) is closeto a minimizer of L γ ( µ, β ) when γ is close to zero. Furthermore, accordingto Theorem 1 in [16], the l solution can be detected when γ > γ converge to zero in order to ﬁnd aminimizer of L γ ( µ, β ). This observation is essential for the eﬃciency andthe numerical stability of the algorithm to be described in this paper. Inaddition, refer to the algorithm in [4], the ﬁrst privacy preserving algorithmfor median regression is stated as follows.5 lgorithm 1:Inputs: privacy parameter (cid:15) , design matrix X, response vector Y,regularization parameter λ and approximation parameter γ . Generate a random vector b from the density function h ( b ) ∝ exp ( − (cid:15) || b || ).To implement this, pick the l norm of b from the Gamma distributionΓ( d + 1 , (cid:15) ), and the direction of b uniformly at random. Compute ( µ ∗ , β ∗ ) = argmin µ, β L γ ( µ, β ∗ ) + b T ωn + µ √ n , where ω = ( µ, β ) isa d + 1 dimensional vector, and n is the number of rows of X . Output ( µ ∗ , β ∗ ) . This algorithm is very similar to the smoothing median regression convexprogram in [16], and therefore its running time is similar to that of smoothingregression. In fact, ( µ ∗ , β ∗ ) can be obtained by the interior point method.Similar to the proof in [4], we can show that Algorithm 1 is privacy preserving. Theorem 3.1.

Given a set of n samples X , . . . , X n over R d , with labels Y , . . . , Y n , where for each i , || X i || ≤ and || Y i || ≤ B , the output of Algo-rithm 1 preserves ( (cid:15), -diﬀerential privacy.Proof. Let a and a be two row vectors over R d with l norm at most 1 and y , y ∈ [ − B, B ]. Consider the two inputs D and D where D is obtainedfrom D by replacing one record ( a , y ) into ( a , y ). For convenience, as-sume the ﬁrst n − ω ∗ = ( µ ∗ , β ∗ ) byAlgorithm 1, there is a unique value of b that maps the input to the output.This uniqueness holds, because both the regularization function and the lossfunctions are diﬀerentiable everywhere. Denote ˜ a as (1 , a ) and ˜ a as (1 , a ). Let the values of d + 1 dimensional vector b for D and D respectively, be b and b . Since ω ∗ is the value that minimizes both the optimization prob-lems, the derivative of both optimization functions at ω ∗ is 0. This impliesthat for every b in the ﬁrst case, there exists a b in the second case suchthat: b + ˜ a T1 (1 /γ W γ ( µ ∗ , β ∗ )( µ ∗ + a T1 β ∗ − y ) + S γ ( µ ∗ , β ∗ ))= b + ˜ a T2 (1 /γ W γ ( µ ∗ , β ∗ )( µ ∗ + a T2 β ∗ − y ) + S γ ( µ ∗ , β ∗ )) . According to the deﬁnitions of W γ ( µ ∗ , β ∗ ) and S γ ( µ ∗ , β ∗ ), it is clear that − ≤ /γ W γ ( µ ∗ , β ∗ ) ∗ ( µ ∗ + a T1 β ∗ − y ) + S γ ( µ ∗ , β ∗ ) ≤ − ≤ /γ W γ ( µ ∗ , β ∗ ) ∗ ( µ ∗ + a T2 β ∗ − y ) + S γ ( µ ∗ , β ∗ ) ≤ . || ˜ a || ≤ || ˜ a || ≤

2, we have || b − b || ≤

4, which implies that − ≤ || b || − || b || ≤

4. Therefore, for any ( a , y ) and ( a , y ), P (( µ ∗ , β ∗ ) | X , . . . , X n − , Y , . . . , Y n − , X n = a , Y n = y ) P (( µ ∗ , β ∗ ) | X , . . . , X n − , Y , . . . , Y n − , X n = a , Y n = y ) = h ( b ) h ( b ) = e − (cid:15) ( || b || −|| b || ) , where h ( b i ) for i = 1 , b i . Since − ≤ || b || − || b || ≤ exp ( (cid:15) ).According to Lemma 1 in [4], theoretical results for accuracy of parameterestimation is given for Algorithm 1. Lemma 3.1.

Let G ( ω ) and g ( ω ) be two convex functions, which are con-tinuous and diﬀerentiable at all points. If ω = argmin ω G ( ω ) and ω = argmin ω G ( ω ) + g ( ω ) , then || ω − ω || ≤ g G . Here, g = max ω || (cid:79) g ( ω ) || and G = min v min ω v T (cid:79) G ( ω ) v , for any unit vector v . The main idea of the proof is to examine the gradient and the Hessian ofthe functions G and g around ω and ω . Lemma 3.2. If || b || is a random variable drawn from Γ( d + 1 , (cid:15) ) , then withpossibility − α , || b || ≤ d +1) log ( d +1 α ) (cid:15) .Proof. Since a random variable drawn from Γ( d + 1 , (cid:15) ) can be written as thesum of d + 1 independent identically distributed random variables, each ofwhich is distributed as an exponential random variable with mean (cid:15) . Usingan union bound, we see that with probability 1 − α , the values of all d + 1 ofthese variables are upper bounded by log ( d +1 α ) (cid:15) . Therefore, with probabilityat least 1 − α , || b || ≤ d +1) log ( d +1 α ) (cid:15) . Theorem 3.2.

Given an l regression problem with regularization parameter λ , let ω be the classiﬁer that minimizes L γ ( µ, β )+ µ √ n , and ω be the classiﬁeroutput by Algorithm respectively. Then, with probability − α , || ω − ω || ≤ d +1) log ( d +1 α ) n min( λ, √ n ) (cid:15) .Proof. According to Lemma 3.1, we take G ( ω ) = L γ ( µ, β ) + µ √ n and g ( ω ) = b T β n . Because F γ ( µ, β ) is a convex function, if we deﬁne the second deriva-tive of F γ ( µ, β ) is 0 at nondiﬀerentiable points, then the hessian matrix of7 γ ( µ, β ) is positive semideﬁnite. Notice that (cid:79) ( µ √ n ) = √ n and (cid:79) ( λ β T β ) = λ I , where I is an identity matrix with size d × d . Hence, for any unit vector v , G = min v min ω v T (cid:79) G ( ω ) v ≥ min( λ, √ n ) and g = || b || n , || ω − ω || ≤ || b || n min( λ, √ n ) . Since b is a random variable drawn from Γ( d + 1 , (cid:15) ), according toLemma 3.2, with possibility 1 − α , || b || ≤ d +1) log ( d +1 α ) (cid:15) , then the theorem isobtained.When n is suﬃcient large, ω approximates ω well and ω is close totrue parameter of argmin ω L γ ( ω ). The second algorithm is based on the iterative algorithm, which was ﬁrstproposed in [17]. This iterative technique combines absolute deviations re-gression with least square regression. Hence, at the heart of the technique isany standard least squares curve ﬁtting algorithm.The basic least squares algorithm minimizes the criterion I = 1 n n (cid:88) i =1 w i r i ( µ, β ) + λ β T β , (6)where the weighting factors w i are positive real numbers. Based on theLagrange multiplier approach, for a ﬁxed λ , there exists a unique value v such that minimizing equation (6) is equivalent to minimizing the followingequation. I = 1 n n (cid:88) i =1 w i r i ( µ, β ) ,s.t. β T β ≤ v. Considering the ( t + 1)-th iteration, we take w i as | r ( t ) i | + e , where r ( t ) i is theresidual of i -th sample at the t -th iteration. Then the iterative process canbe written as I ( t + 1) = 1 n n (cid:88) i =1 | r ( t ) i | + e r ( t + 1) i + λ β T β . (7)8f || r ( t ) i − r ( t + 1) i || ≈ , i = 1 , , . . . , n , (7) is close to L ( µ, β ). In practice,we set e as a small positive value (such as e = 0 .

05) .

Algorithm 2:Inputs: privacy parameter (cid:15) , deign matrix X, response vector Y,regularization parameter λ , tolerance parameter τ and the numberof iteration N Initialize the algorithm with ˆ µ (0) and ˆ β (0)(ˆ µ (1) , ˆ β (1)) = argmin µ, β I (1) for t = 1 , · · · , N − dowhile || ˆ µ ( t ) − ˆ µ ( t − || > τ or || ˆ β ( t ) − ˆ β ( t − || > τ do (ˆ µ ( t + 1) , ˆ β ( t + 1)) = argmin µ, β I ( t + 1) else doOutput (ˆ µ ( N ) , ˆ β ( N )) := (ˆ µ ( t ) , ˆ β ( t )) breakend whileend forOutput (ˆ µ, ˆ β ) := (ˆ µ ( N ) , ˆ β ( N )) + U,where U is a d + 1 dimensional Laplace random variable with pa-rameter c = n min( √ dv + B )+ e ,λ ) e ( √ dv + B ) Theorem 3.3.

Given a set of n samples X , . . . , X n over R d , with labels Y , . . . , Y n , where for each i ( ≤ i ≤ n ), || X i || ≤ , | Y i | ≤ B , the output ofAlgorithm 2 preserves ( (cid:15), -diﬀerential privacy.Proof. Denote ω = (ˆ µ ( N ) , ˆ β ( N )) and the l sensitivity of ω as s ( ω ). Let a and a be two vectors over R d with l norm at most 1 and y , y ∈ [ − B, B ]. Consider the two inputs D and D where D is obtained from D by changing one record ( a , y ) into ( a , y ). For convenience, assume theﬁrst n − G ( ω ) = I ( N ) and g ( ω ) = n w (ˆ µ ( N ) + a T2 ˆ β ( N ) − y ) − n w (ˆ µ ( N ) + a T1 ˆ β ( N ) − y ) . Similarto the proof in Theorem 3.2, we can achieve that g = max ω || (cid:79) g ( ω ) || ≤ n | w | ( | ˆ µ ( N ) | + | a T1 ˆ β ( N ) | + | y | )+ 2 n | w | ( | ˆ µ ( N ) | + | a T2 ˆ β ( N ) | + | y | ) . Notice that (ˆ µ ( N ) , ˆ β ( N )) = argmin µ, β I ( N ), then ∂I ( N ) ∂µ = 0 at µ = ˆ µ ( N ),9hat is, (cid:80) ni =1 w i (ˆ µ ( N ) + X i ˆ β ( N ) − Y i ) = 0 ⇐⇒ ˆ µ ( N ) = − (cid:80) ni =1 w i ( X i ˆ β ( N ) − Y i ) (cid:80) ni =1 w i . Since 0 < w i ≤ /e , || y i || ≤ B ( i = 1 , · · · , n ) and || ˆ β ( N ) || ≤ √ d || ˆ β ( N ) || ≤√ dv , we have || ˆ µ ( N ) || ≤ √ dv + B . Notice that above inequalities are stilltrue in t -th( ≥

2) iteration and hence √ dv + B )+ e ≤ w i ≤ e . Then we canachieve that g = max ω || (cid:79) g ( ω ) || ≤ √ dv + B ) ne . In addition, denote F e ( ω ) = n (cid:80) ni =1 w i r i ( µ, β ). It can be checked that F e ( ω )is convex and ∂F e ( ω ) ∂µ = n (cid:80) ni =1 w i ≤ √ dv + B )+ e , (cid:79) ( λ β T β ) = λ I , where I is an identity matrix with size d × d , then G ≥ min( √ dv + B )+ e , λ ) and s ( ω ) ≤ n min( √ dv + B )+ e ,λ ) e ( √ dv + B ) .According to lemma 2.1, the result is obtained directly from the compo-sition theorem.For e >

0, deﬁne a perturbation of L ( µ, β ) as L e ( µ, β ) = n (cid:88) i =1 | r i ( µ, β ) | − e ln ( e + | r i ( µ, β ) | ) + λ β T β . [10] proves that iterative least square algorithm without adding noise is aspecial case of Majorization-Minimization ( MM) algorithms ( see [11]) forobjective function L e ( µ, β ) and obtained convergence results. Proposition 3.1.

For linear median regression with a full-rank covariatematrix X , the iterative least square algorithm without adding noise convergesto the unique minimizer of L e ( µ, β ) . Proposition 3.2.

If ( ˆ µ e , ˆ β e ) minimizes L e ( µ, β ) , then any limit point of( ˆ µ e , ˆ β e ) as e tends to minimizes L ( µ, β ) . If L ( µ, β ) has a unique minimizer (˜ µ, ˜ β ) , then lim e → (ˆ µ e , ˆ β e ) = (˜ µ, ˜ β ) . The proof of above propositions can be seen in [10].10 heorem 3.4.

Given a l regression problem with regularization parameter λ , let ω be the classiﬁer that minimizes L e ( µ, β ) , and ω be the classiﬁeroutput by Algorithm respectively. Then, with probability − α , || ω − ω || ≤ √ dv + B )( d +1) log ( d +1 α ) (cid:15) min( √ dv + B )+ e ,λ ) ne .Proof. Since || b || is a random variable drawn from Γ( d +1 , √ dv + B ) (cid:15) min( √ dv + B )+ e ,λ ) ne ),with possibility 1 − α , || b || ≤ √ dv + B )( d +1) log ( d +1 α ) (cid:15) min( √ dv + B )+ e ,λ ) ne , the theorem is obtained.Therefore, for ﬁxed small e , if n is suﬃcient large, accuracy can be ensuredin practice. In [1], the authors argue that adding noise to the estimated parameters af-ter optimization would destroy the utility of the learned model. Hence, weprefer a more sophisticated method to control the inﬂuence of the trainingdata during the training process, especially in the stochastic gradient decentcomputation. [19] declares that greedy coordinate descent is an eﬀectivemethod for l regression, where l regression means median regression. Sowe apply this idea to minimize objective function L ( µ, β ) in a similar way.Although L ( µ, β ) is nondiﬀerentiable, it does possess directional derivativesalong each forward or backward coordinate direction. For example, if e k isthe coordinate direction along which β k varies, then the objective function(2) has directional derivatives d e + k L ( µ, β ) = lim τ → + L ( µ, β + τ e k ) − L ( µ, β ) τ = d e + k F ( µ, β ) + λβ k and d e − k L ( µ, β ) = lim τ → − L ( µ, β + τ e k ) − L ( µ, β ) τ = d e − k F ( µ, β ) + λβ k . In l regression, the coordinate direction derivatives are d e + k F ( µ, β ) = 1 n n (cid:88) i =1  − x ik , r i ( µ, β ) < ,x ik , r i ( µ, β ) > , | x ik | , r i ( µ, β ) = 0 , (8)11nd d e − k F ( µ, β ) = 1 n n (cid:88) i =1  x ik , r i ( µ, β ) < , − x ik , r i ( µ, β ) > , | x ik | , r i ( µ, β ) = 0 . (9)In greedy coordinate descent progress[9], we update the direction of pa-rameter β k based on min { d e + k L ( µ, β ) , d e − k L ( µ, β ) } . If both coordinate di-rectional derivatives are nonnegative, the update of β k stops. In addition,ˆ µ = n (cid:80) n i =1 ( Y i − X i ˆ β ), where n = n/N . And by the method of batch gra-dient [18], the t -th iteration only employs records with batch size n , whichmeans L (ˆ µ ( t ) , ˆ β ( t )) in the algorithm is calculated by subset ( X (t), Y (t)).The algorithm is described as follows. Algorithm 3:Inputs: privacy parameters (cid:15) , deign matrix X, response vector Y,regularization parameter λ , positive number (cid:96) and the number ofiterations N .Randomly split ( X, Y) into N disjoint subsets of size n .Initialize the algorithm with a vector (ˆ µ (0) , ˆ β (0) ( such as the solu-tion of l regression).for t = 0 , , , ..., N − do η t = (cid:96)t +1 for k = 1 , , · · · , d do ˆ β k ( t +0 .

5) = ˆ β k ( t ) − η t ( min { d e + k L (ˆ µ ( t ) , ˆ β ( t )) , d e − k L (ˆ µ ( t ) , ˆ β ( t )) } ) , ˆ β k ( t + 1) = ˆ β k ( t + 0 .

5) + U t , where U t ∼ Lap ( η t (cid:15)n ) , n = n/N .end for ˆ µ ( t + 1) = n (cid:80) n i =1 ( Y i − X i ˆ β ( t + 1)) .end forOutput ˆ β := ˆ β ( N ) , ˆ µ = ˆ µ ( N ) .Theorem 3.5. Given a set of n samples X , . . . , X n over R d with labels Y , . . . , Y n , where for each i ( ≤ i ≤ n ), || X i || ≤ and || Y i || ≤ B , theoutput of Algorithm preserves ( (cid:15), -diﬀerential privacy.Proof. Because of sample splitting, for ( x, y ) ∈ ( X ( t ) , Y ( t )) for some 0 ≤ t ≤ N −

1, it suﬃces to prove the privacy guarantee for the t -th iteration of thealgorithm: any iteration prior to the t -th one does not depend on ( x, y ), whileany iteration after the t -th one is diﬀerentially private by post-processing [6].12t the t -th iteration, the algorithm ﬁrst updates the non-sparse estimateof β k : ˆ β k ( t + 0 .

5) = ˆ β k ( t ) − η t ( min d e + k L (ˆ µ ( t ) , ˆ β ( t )) , d e − k L (ˆ µ ( t ) , ˆ β ( t ))) . Let a and a be two vectors over R d with l norm at most 1 and y , y ∈ [ − B, B ]. Consider the two inputs D and D where D is obtained from D by changing one record ( a , y ) into ( a , y ). For convenience, assume theﬁrst n − Dir ( t ) as the direction derivation (min { d e + k L (ˆ µ ( t ) , ˆ β ( t )) , d e − k L (ˆ µ ( t ) , ˆ β ( t )) } ) for the dataset D and Dir ( t ) forthe dataset D . Notice that ˆ β ( t ) does not depend on ( X ( t ) , Y ( t )), so ˆ β ( t +1)would be ( (cid:15), D and D (cid:48) , we have || η t /n [ Dir ( t ) − Dir ( t )] || ≤ η t n . This is true, since || η t /n [ Dir ( t ) − Dir ( t )] || ≤ η t /n ( || a || + || a || ) ≤ η t n , then the privacy guarantee for β is proved by Lemma 2.1. In addition,since ˆ µ = n (cid:80) n i =1 ( Y i − X i ˆ β ), it is diﬀerentially private by post-processing[6]. Then the theorem is obtained.[19] said that coordinate descent may fail for a nondiﬀerentiable functionsince all directional derivatives must be nonnegative at a minimum point.However, if we can obtain a suitable approximate value quickly, this short-coming can be accepted in practice. The following theorem shows that esti-mated parameters would be stable when the number of iteration N is large. Theorem 3.6.

Given a set of n samples X , . . . , X n over R d with labels Y , . . . , Y n ( for each i , || X i || ≤ and || Y i || ≤ B ), Algorithm is convergentin probability with rate O (1 /t ) .Proof. Consider the t -th iteration for β k , since | Dir ( t ) | ≤ n (cid:80) n i =1 || x ik || ≤ β k ( t +1) = ˆ β k ( t ) − η t Dir ( t )+ U t , | ˆ β k ( t +1) − ˆ β k ( t ) | ≤ | η t | + | U t | = O p (1 /t ),where O p (1 /t ) indicates that it converges in probability with rate O (1 /t ).Since ˆ µ = n (cid:80) n i =1 ( Y i − X i ˆ β ), it is convergent in probability with rate O (1 /t ),too. Then the theorem is obtained. 13 Simulated results

Denote n as the number of samples. Here let n take two values: 5000 and5000000. In fact, when n is small, such as 100, Algorithm 1 can perform well,but Algorithm 2 requires n bigger ( otherwise the noise added would be bigwhich would result in big estimation error). Consider the following examplewith three independent variables x , x , x , where y i = 2+3 x i − x i + u i and u i obeys the Laplace distribution Lap (2), for i = 1 , . . . , n . We assume that,for each i ( 1 ≤ i ≤ n ), l norm of X i is less than 1 and l norm of Y i is lessthan 2. In practice, we take λ as 0 .

002 in the objective function. In Algorithm1, parameter γ is taken as 0 .

05. In Algorithm 2, we set parameter e = 0 . τ = 10 − and the number of iteration N = 200. In fact,Algorithm 2 tends to converge with less than 30 iterations. In Algorithm 3,we set (cid:96) = 0 .

1, step size η t = (cid:96)t +1 and the number of iteration N = 40. Inaddition, privacy parameters ( (cid:15), δ )= (0 . ,

0) for all the above algorithms. Theresults are listed in Table 1 and Table 2. It shows that Algorithm 1 performsbetter than the others when n = 5000. However, when n becomes muchbigger, Algorithm 1 costs much more time. Notice that when n = 5000000,the noise added to Algorithm 2 becomes small and it makes the estimatedresult precise. In addition, Algorithm 3 costs less time in both cases, but ithighly depends on initial value and step size η t , which is a common problemfor the gradient descent method [18].Table 1: Estimated results with sample size 5000Algorithm 1 Algorithm 2 Algorithm 3 True value µ β β -0.0295 13.2762 -0.6099 0 β -4.0835 -14.2089 -3.2283 -4 time(s) µ β β β -3.9460 -3.9205 -3.9918 -4 time(s) References [1] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwarand L. Zhang. 2016. Deep learning with diﬀerential privacy. In Proceed-ings of the 2016 ACM SIGSAC Conference on Computer and Commu-nications Security. 308-318.[2] I. Barrodale and F. D. K. Roberts. 1973. An improved algorithm for dis-crete l linear approximation. Society for Industrial and Applied Math-ematics Journal on Numerical Analysis. 839-848.[3] T. Cai, Y. Wang and L. Zhang. 2019. The cost of privacy: optimal ratesof convergence for parameter estimation with diﬀerential privacy. Underreview: arXiv:1902.04495v3.[4] K. Chaudhuri and C. Monteleoni. 2009. Privacy-preserving logistic re-gression. Proceedings of the 21st International Conference on NeuralInformation Processing Systems. 289-296.[5] C. Dwork. 2006. Diﬀerential privacy. International colloquium on au-tomata languages and programming. 1-12.[6] C. Dwork and A. Roth. 2014. The algorithmic foundations of diﬀerentialprivacy. Foundations and Trends in Theoretical Computer Science. Vol.9. 211-407.[7] C. Dwork, F. McSherry, K. Nissim and A. Smith. 2006. Calibratingnoise to sensitivity in private data analysis. Proceedings of Theory ofCryptography Conference. 265-284158] U. Erlingsson, V. Pihur and A. Korolova. 2014. Rappor: randomizedaggregatable privacy-preserving ordinal response. In Proceedings of the2014 ACM SIGSAC Conference on Computer and Communications Se-curity. 1054-1067.[9] J. Friedman, T. Hastie, H. H¨oﬂing and R. Tibshirani. 2007. Pathwisecoordinate optimization. The Annals of Applied Statistics. 1(2). 302-332.[10] D. R. Hunter and K. Lange. 2000. Quantile regression via an MM algo-rithm. Journal of Computational and Graphical Statistics. 9. 60-77.[11] D. R. Hunter and K. Lange. 2004. A tutorial on MM Algorithms. TheAmerican Statistician. 58:1, 30-37[12] R. Koenker and G. Bassett. 1978. Regression quantiles. Econometrica.46. 33-50.[13] D. Kifer and A. Machanavajjhala. 2011. No free lunch in data privacy.International conference on management of data. 193-204.[14] R. Koenker and O. Geling. 2001. Reappraising medﬂy longevity: a quan-tile regression survival analysis. Journal of the American Statistical As-sociation. 96. 458-468.[15] R. Koenker and K. F. Hallock. 2001. Quantile regression. Journal ofEconomic Perspectives. 15. 143-156.[16] K. Madsen and H. B. Nielsen. 1993. A ﬁnite smoothing algorithm forlinear l1