Private Incremental Regression
aa r X i v : . [ c s . D S ] J a n Private Incremental Regression
Shiva Prasad Kasiviswanathan ∗ Kobbi Nissim † Hongxia Jin ‡ Abstract
Data is continuously generated by modern data sources, and a recent challenge in machine learning has been todevelop techniques that perform well in an incremental (streaming) setting. A variety of offline machine learningtasks are known to be feasible under differential privacy, where generic construction exist that, given a large enoughinput sample, perform tasks such as PAC learning, Empirical Risk Minimization (ERM), regression, etc. In this paper,we investigate the problem of private machine learning, where as common in practice, the data is not given at once,but rather arrives incrementally over time.We introduce the problems of private incremental ERM and private incremental regression where the generalgoal is to always maintain a good empirical risk minimizer for the history observed under differential privacy. Ourfirst contribution is a generic transformation of private batch ERM mechanisms into private incremental ERM mech-anisms, based on a simple idea of invoking the private batch ERM procedure at some regular time intervals. We takethis construction as a baseline for comparison. We then provide two mechanisms for the private incremental regres-sion problem. Our first mechanism is based on privately constructing a noisy incremental gradient function, whichis then used in a modified projected gradient procedure at every timestep. This mechanism has an excess empiricalrisk of ≈ √ d , where d is the dimensionality of the data. While from the results of Bassily et al. [2] this bound istight in the worst-case, we show that certain geometric properties of the input and constraint set can be used to derivesignificantly better results for certain interesting regression problems. Our second mechanism which achieves thisis based on the idea of projecting the data to a lower dimensional space using random projections, and then addingprivacy noise in this low dimensional space. The mechanism overcomes the issues of adaptivity inherent with the useof random projections in online streams, and uses recent developments in high-dimensional estimation to achieve anexcess empirical risk bound of ≈ T / W / , where T is the length of the stream and W is the sum of the Gaussianwidths of the input domain and the constraint set that we optimize over. Most modern data such as documents, images, social media data, sensor data, and mobile data naturally arrive ina streaming fashion, giving rise to the challenge of incremental machine learning, where the goal is build and pub-lish a model that evolves as data arrives. Learning algorithms are frequently run on sensitive data, such as locationinformation in a mobile setting, and results of such analyses could leak sensitive information. For example, Ka-siviswanathan et al. [32] show how the results of many convex ERM problems can be combined to carry out recon-struction attacks in the spirit of Dinur and Nissim [11]. Given this, a natural direction to explore, is whether, we cancarry out incremental machine learning, without leaking any significant information about individual entries in thedata. For example, a data scientist, might want to continuously update the regression parameter of a linear modelbuilt on a stream of user profile data gathered from an ongoing survey, but these updates should not reveal whetherany one person participated in the survey or not.Differential privacy [15] is a rigorous notion of privacy that is now widely studied in computer science and statis-tics. Intuitively, differential privacy requires that datasets differing in only one entry induce similar distributions onthe output of a (randomized) algorithm. One of the strengths of differential privacy comes from the large variety ofmachine learning tasks that it allows. Good generic constructions exist for tasks such as PAC learning [4, 31] and Em-pirical Risk Minimization [41, 34, 25, 26, 48, 27, 2, 12, 51, 47]. These constructions, however, are typically focused ∗ Samsung Research America, [email protected] † Georgetown University, Supported by NSF grant CNS1565387 and grants from the Sloan Foundation.
[email protected] ‡ Samsung Research America, [email protected] n the batch (offline) setting, where information is first collected and then analyzed. Considering an incrementalsetting, it is natural to ask whether these tasks can still be performed with high accuracy, under differential privacy.In this paper, we introduce the problem of private incremental empirical risk minimization (ERM) and providealgorithms for this new setting. Our particular focus will be on the problem of private incremental linear regression.Let us start with a description of the traditional batch convex ERM framework. Given a dataset and a constraintspace C , the goal in ERM is to pick a θ ∈ C that minimizes the empirical error (risk) . Formally, given n datapoints z , . . . , z n from some domain Z , and a closed, convex set C ⊆ R d , consider the optimization problem: min θ ∈C J ( θ ; z , . . . , z n ) where J ( θ ; z , . . . , z n ) = n X i =1 ( θ ; z i ) . (1)The loss function J : C ×Z n → R measures the fit of θ ∈ C to the given data z , . . . , z n , the function : C ×Z → R is the loss associated with a single datapoint and is assumed to be convex in the first parameter θ for every z ∈ Z . It iscommon to assume that the loss function has certain properties, e.g., positive valued. The M-estimator (true empiricalrisk minimizer) ˆ θ associated with a given a function J ( θ ; z , . . . , z n ) ≥ is defined as: ˆ θ ∈ argmin θ ∈C J ( θ ; z , . . . , z n ) = argmin θ ∈C n X i =1 ( θ ; z i ) . This type of program captures a variety of empirical risk minimization (ERM) problems, e.g., the MLE (MaximumLikelihood Estimators) for linear regression is captured by setting ( θ ; z ) = ( y −h x , θ i ) in (1), where z = ( x , y ) for x ∈ R d and y ∈ R . Similarly, the MLE for logistic regression is captured by setting ( θ ; z ) = ln(1+exp( − y h x , θ i )) .Another common example is the support vector machine (SVM), where ( θ ; z ) = hinge( y h x , θ i ) , where hinge( a ) =1 − a if a ≤ and otherwise.The main focus of this paper will be on a particularly important ERM problem of linear regression. Linearregression is a popular statistical technique that is commonly used to model the relationship between the outcome(label) and the explanatory variables (covariates). Informally, in a linear regression, given n covariate-response pairs ( x , y ) , . . . , ( x n , y n ) ∈ R d × R , we wish to find a (regression) parameter vector ˆ θ such that h x i , ˆ θ i ≈ y i for most i ’s. Specifically, let y = ( y , . . . , y n ) ∈ R n denote a vector of the responses, and let X ∈ R n × d be the designmatrix where x ⊤ i is the i th row. Consider the linear model: y = Xθ ⋆ + w , where w is the noise vector, the goal inlinear regression is to estimate the unknown regression vector θ ⋆ . Assuming that the noise vector w = ( w , . . . , w n ) follows a (sub)Gaussian distribution, estimating the vector θ ⋆ amounts to solving the “ordinary least squares” (OLS)problem: ˆ θ ∈ argmin θ n X i =1 ( y i − h x i , θ i ) . Typically, for additional guarantees such as sparsity, stability, etc., θ is constrained to be from a convex set C ⊂ R d .Popular choices of C include the L -ball (referred to as Lasso regression) and L -ball (referred to as Ridge regression).In an incremental setting, the ( x i , y i ) ’s arrive over time, and the goal in incremental linear regression is to maintainover time (an estimate of) the regression parameter. We provide a more detailed background on linear regression inAppendix A.1. Incremental Setting.
In an incremental setting the data arrives in a stream at discrete time intervals. The incrementalsetting is a variant of the traditional batch setting capturing the fact that modern data is rarely collected at one singletime and more commonly data gathering and analysis may be interleaved. An incremental algorithm is modeled asfollows: at each timestep the algorithm receives an input from the stream, computes, and produces outputs. Typically,constraints are placed on the algorithm in terms of some computational resource (such as memory, computation time)availability. In this paper, the challenge in this setting comes from the differential privacy constraint because frequentreleases about the data can lead to privacy loss (see, e.g., [11, 32]).We focus on the incremental ERM problem. In this problem setting, the data z , . . . , z T ∈ Z arrives one pointat each timestep t ∈ { , . . . , T } . The goal of an incremental ERM algorithm is to release at each timestep t , anestimator that minimizes the risk measured on the data z , . . . , z t . In more concrete terms, the goal is to output ˆ θ t atevery t ∈ { , . . . , T } , where: Incremental ERM: ˆ θ t ∈ argmin θ ∈C t X i =1 ( θ ; z i ) . This formulation also captures regularized
ERM, in which an additional convex function R ( θ ) is added to the loss function to penalize certaintype of solutions, e.g., “penalize” for the “complexity” of θ . The loss function J ( θ ; z , . . . , z n ) then equals P ni =1 ( θ ; z i ) + R ( θ ) , which issame as replacing ( θ ; z i ) by ( θ ; z i ) + R ( θ ) /n in (1). he goal of private incremental ERM algorithm, is to (differential) privately estimate ˆ θ t at every t ∈ { , . . . , T } . Inthis paper, we develop the first private incremental ERM algorithms.There is a long line of work in designing differentially private algorithms for empirical risk minimization problemsin the batch setting [41, 34, 26, 48, 27, 2, 12, 51, 47]. A naive approach to transform the existing batch techniques towork in the incremental model is by using them to recompute the outcome after each datapoint’s arrival. However,for achieving an overall fixed level of differential privacy, this would result in an unsatisfactory loss in terms of utility.Precise statements can be obtained using composition properties of differential privacy (as in Theorem A.4), butinformally if a differentially private algorithm is executed T times on the same input, then the privacy parameter ( ǫ inDefinition 4) degrades by a factor of ≈ √ T , and this affects the overall utility of this approach as the utility boundstypically depend inversely on the privacy parameter. Therefore, the above naive approach will suffer an additionalmultiplicative factor of ≈ √ T over the risk bounds obtained for the batch case. Our goal is to obtain risk boundsin the incremental setting that are comparable to those in the batch setting, i.e., bounds that do not suffer from thisadditional penalty of √ T . Generalization.
For the kind of problems we consider in this paper, if the process generating the data satisfies theconditions for generalization (e.g., if the data stream contains datapoints where each datapoint is sampled indepen-dent and identically from an unknown distribution), the incremental ERM would converge to the true risk on thedistribution (via uniform convergence and other ideas, refer to [53, 43, 2] for more details). In this case, the modellearned in an incremental fashion will have a good predictive accuracy on unseen arriving data. If however, the datadoes not satisfy these conditions, then ˆ θ t can be viewed as a “summarizer” for the data seen so far. Generating thesesummaries could also be useful in many applications, e.g., the regression parameter can be used to explain the as-sociations between the outcome ( y i ’s) and the covariates ( x i ’s). These associations are regularly used in domainssuch as public health, social sciences, biological studies, to understand whether specific variables are important (e.g.,statistically significant) or unimportant predictors of the outcome. In practice, these associations would need to beconstantly re-evaluated over time as new data arrives. Comparison to Online Learning.
Online learning (or sequential prediction) is another well-studied setting forlearning when the data arrives in a sequential order. There are differences between the goals of incremental andonline learning. In online ERM learning, the aim is to provide an estimator ˜ θ t that can be used for future prediction.More concretely, at time t , an online learner, chooses ˜ θ t and then the adversary picks z t and the learner suffers loss of (˜ θ t ; z t ) . Online learners try to minimize regret defined as the difference between the cumulative loss of the learnerand the cumulative loss of the best fixed decision at the end of T rounds [42]. In an incremental setting, the algorithmgets to observe z t before committing to the estimator, and the goal is to ensure at each timestep t , the algorithmmaintains a single estimator that minimizes the risk on the history. There are strong lower bounds on the achievableregret for online ERM learning. In particular, even under the differential privacy constraint, the excess risk bounds onincremental learning that we obtain here have better dependence on T (stream length) than the regret lower boundsfor online ERM. Incremental learning model should be viewed as a variant of batch (offline) learning model, wherethe data arrives over time, and the goal is to output intermediate results. Before stating our results, let us define how we measure success of our algorithms. As is standard in ERM, wemeasure the quality of our algorithm by the worst-case (over inputs) excess (empirical) risk (defined as the differencefrom the minimum possible risk over a function class). In an incremental setting, we want this excess risk to bealways small for any sequence of inputs. The following definition captures this requirement. All our algorithms arerandomized, and they take a confidence parameter β and produce bounds that hold with probability at least − β . Definition 1.
A (randomized) incremental algorithm is an ( α, β ) - estimator for loss function J , if with probability atleast − β over the coin flips of the algorithm, for each t ∈ { , . . . , T } , after processing a prefix of the stream oflength t , it generates an output θ t ∈ C that satisfies the following bound on excess (empirical) risk: J ( θ t ; z , . . . , z t ) − J (ˆ θ t ; z , . . . , z t ) ≤ α, where ˆ θ t ∈ argmin θ ∈C J ( θ ; z , . . . , z t ) is the true empirical risk minimizer. Here, α is referred to as the bound onthe excess risk. A point to note in the above description, while we have the privacy constraint, we have placed no computational constraints on the algorithm.In particular, the above description allows also those algorithms that at time t use the whole input history z , . . . , z t . However, as we will discussin Sections 4 and 5, our proposed approaches for private incremental regression are also efficient in terms of their resource requirements. ncremental ERMProblem Objective Bound on the Excess (Empirical) Risk under ( ǫ, δ ) -Differential Privacy( α in Definition 1)1 Convex Function(using a generic transformation) ( Td ) log (1 /δ ) ǫ / √ d log (1 /δ ) ν / ǫ √ d √ log(1 /δ ) ǫ Mech 2: T W √ log(1 /δ ) ǫ + T W √ OPT + T W
12 4 √ OPT (where W = w ( X ) + w ( C ) and OPT is the minimum empirical risk at timestep T ) Table 1: Summary of our results. The stream length is T and d is the dimensionality of the data (number of covariatesin the case of regression). The bounds are stated for the setting where both the Lipschitz constant of the function and kCk are scaled to . The bounds ignore polylog factors in d and T , and the value in the table gives the bound whenit is below T , i.e., the bounds should be read as min { T, ·} . ν is the strong convexity parameter. For the regressionproblem, X is the domain from which the inputs (covariates) are drawn and C is the constraint space. OPT stands forthe minimum empirical risk at time T . The exact results are provided in Theorems 3.1, 4.2, and 5.7, respectively. In this paper, we propose incremental ERM algorithms that provide a differential privacy guarantee on datastreams. Informally, differential privacy requires that the output of a data analysis mechanism is not overly affectedby any single entry in the input dataset. In the case of incremental setting, we insist that this guarantee holds at eachtimestep for all outputs produced up to that timestep (precise definition in Section 2). This would imply that, anadversary is unable to determine whether a particular datapoint was present or not in the input stream by observingthe output of the algorithm over time. Two parameters, ǫ and δ control the level of privacy. Very roughly, ǫ is anupper bound on the amount of influence an individual datapoint has on the outcome and δ is the probability that thisbound fails to hold (for a precise interpretation, refer to [30]), so the definition becomes more stringent as ǫ, δ → .Therefore, while parameters α, β measure the accuracy of an incremental algorithm, the parameters ǫ, δ representits privacy risk. Our private incremental algorithms take ǫ, δ as parameters and satisfies: a) the differential privacyconstraint with parameters ǫ, δ and b) ( α, β ) -estimator property (Definition 1) for every β > and some α (parameterthat the algorithm tries to minimize).There is a trivial differentially private incremental ERM algorithm that completely ignores the input and outputsat every t ∈ { , . . . , T } , any θ ∈ C (this scheme is private as the output is always independent of the input). Theexcess risk of this algorithm is at most T L kCk , where L is the Lipschitz constant of (Definition 8) and kCk isthe maximum attained norm in the convex set C (Definition 2). All bounds presented in this paper, as also true forall other existing results in the private ERM literature, are only interesting in the regime where they are less than thistrivial bound.For the purposes of this section, we make some simplifying assumptions and omit dependence on all but keyvariables (dimension d and stream length T ). Slightly more detailed bounds are stated in Table 1. All our algorithmsrun in time polynomial in d and T (exact bounds depend on the time needed for Euclidean projection onto the con-straint set which will differ based on the constraint set). Additionally, our private incremental regression algorithms,which utilize the Tree Mechanism of [16, 7] have space requirement whose dependence on the stream length T isonly logarithmic. (1) A Generic Transformation. A natural first question is whether non-trivial private ERM algorithms exist in gen-eral for the incremental setting. Our first contribution is to answer this question in the affirmative – we presenta simple generic transformation of private batch ERM algorithms into private incremental ERM algorithms. Theconstruction idea is simple: rather than invoking the batch ERM algorithm every timestep, the batch ERM algo-rithm is invoked every τ timesteps, where τ is chosen to balance the excess risk factor coming from the stale riskminimizer (because of inaction) with the excess risk factor coming from the increased privacy noise due to reuse ofthe data. Using this idea along with recent results of Bassily et al. [2] for private batch ERM, we obtain an excess This follows as in each timestep t ∈ { , . . . , T } , for any θ ∈ C , J ( θ ; z , . . . , z t ) − J (ˆ θ t ; z , . . . , z t ) ≤ tL k θ − ˆ θ t k ≤ tL ( k θ k + k ˆ θ t k ) ≤ tL kCk ≤ T L kCk . This includes parameters ǫ, δ, β, kCk , and the Lipschitz constant of the loss function. isk bound ( α in Definition 1) of ˜ O (min { ( T d ) / , T } ) . Using this same framework, we also show that when theloss function is strongly convex (Definition 9) the excess risk bound can be improved to ˜ O (min {√ d, T } ) .A follow-up question is: how much worse is this private incremental ERM risk bound compared to best knownprivate batch ERM risk bound (with a sample size of T datapoints)? In the batch setting, the results of Bassily et al. [2] establish that for any convex ERM problem, it is possible to achieve an excess risk bound of ˜ O (min {√ d, T } ) (which is also tight in general). Therefore, our transformation from the batch to incremental setting, causes theexcess risk to increase by at most a factor of ≈ max { T / /d / , } . Note that even for a low-dimensional setting(small d case), the factor increase in excess risk of ≈ T / (as now max { T / /d / , } ≈ T / ) is much smallerthan the factor increase of ≈ T / for the earlier described naive approach based on using a private batch ERMalgorithm at every timestep. The situation only improves for larger dimensional data. (2) Private Incremental Linear Regression Using Tree Mechanism. We show that we can improve the genericconstruction from (1) for the important problem of linear regression. We do so by introducing the notion of a private gradient function (Definition 5) that allows differentially private evaluation of the gradient at any θ ∈ C . More formally, for any θ ∈ C , a private gradient function allows differentially private evaluation of the gradient at θ to within a small error (with high probability). Since the data arrives continually, our algorithm utilizes the TreeMechanism of [16, 7] to continually update the private gradient function. Now given access to a private gradientfunction, we can use any first-order convex optimization technique (such as projected gradient descent) to privatelyestimate the regression parameter. The idea is, since these optimization techniques operate by iteratively takingsteps in the direction opposite to the gradient evaluated at the current point, we can use a private gradient functionfor evaluating all the required gradients. Using this, we design a private regression algorithm for the incrementalsetting, that achieves an excess risk bound of ˜ O (min {√ d, T } ) . It is easy to observe, for private incrementallinear regression, this result improves the bounds from the generic construction above for any choice of d, T (as min {√ d, T } ≤ min { ( T d ) / , T } ). Ignoring polylog factors, this worst-case bound matches the lower bounds on the excess risk for squared-lossin the batch case [2], implying that this bound cannot be improved in general (an excess risk upper bound for aproblem in the incremental setting trivially holds for the same problem in the batch setting). (3) Private Incremental Linear Regression: Going Beyond Worst-Case.
The noise added in our previous solution (2) grows approximately as the square root of the input dimension (for sufficiently large T ), which could be pro-hibitive for a high-dimensional input. While in a worst-case setup this, as we discussed above, seems unavoidable,we investigate whether certain geometric properties of the input/output space could be used to obtain better resultsin certain interesting scenarios.A natural strategy for reducing the dependence of the excess risk bounds on d is to use dimensionality reduc-tion techniques such as the Johnson-Lindenstrauss Transform (JLT): The server performing the computation canchoose a random projection matrix Φ ∈ R m × d which is then used for projecting all the x i ’s (covariates) onto alower-dimensional space. The advantage being that, using the techniques from (2) , one could privately minimizethe excess risk in the projected subspace, and in doing so, the dependence on the dimension in excess risk isreduced to ≈ √ m (from ≈ √ d ). However, there are two significant challenges in implementing this idea forincremental regression, both of which we overcome using geometric ideas.The first one being that this only solves the problem in the projected subspace, whereas our goal is to producean estimate to true empirical risk minimizer. To achieve this we would need to “lift” back the solution fromthe projected subspace to the original space. We do so by using recent developments in the problem of highdimensional estimation with constraints [54]. The
Gaussian width of the constraint space C plays an importantrole in this analysis. Gaussian width is a well-studied quantity in convex geometry that captures the geometriccomplexity of any set of vectors. The rough idea here being that a good estimation (lifting) can be done fromfew observations (small m ) as long as the constraint set C has small Gaussian width. Many popular sets have lowGaussian width, e.g., the width of L -ball ( B d is Θ( √ log d ) , and that of set of all sparse vectors in R d with atmost k non-zero entries is Θ( p k log( d/k )) .The second challenge comes from the incremental nature of input generation, because it allows for generation For simplicity of exposition, the ˜ O ( · ) notation hides factors polynomial in log T and log d . Intuitively, this would imply that for any θ ∈ C , the output of a private gradient function cannot be used to distinguish two streams that arealmost the same (differential privacy guarantee). A linear regression instance typically does not satisfy strong convexity requirements. A point to note that this use of random projections for linear regression is different from its typical use in prior work [57], where they are usednot for reducing dimensionality but rather for reducing the number of samples used in the regression computation, and hence improving in runningtime. For a set S ⊆ R d , Gaussian width is defined as w ( S ) = E g ∈N (0 , d [sup a ∈ S h a , g i ] . f x i ’s after Φ is fixed . This is an issue because the guarantees of random projections, such as JL transform,only hold if the inputs on which it is applied are chosen before choosing the transformation. For example, givena random projection matrix Φ ∈ R m × d with m ≪ d , it is simple to generate x such that the norm on x issubstantially different from the norm of Φ x . Again to deal with these kind of adaptive choice of inputs, we relyon the geometric properties of problem. In particular, we use Gordon’s theorem that states that one can embeda set of points S on a unit sphere into a (much) lower-dimensional space R m using a Gaussian random matrix Φ such that, sup a ∈ S |k Φ a k − k a k | is small (with high probability), provided m is at least the square of theGaussian width of S [21]. In a sense, w ( S ) can be thought as the “effective dimension” of S , so projecting thedata onto m ≈ w ( S ) dimensional space suffices for guaranteeing the above condition.Using the above geometric ideas, and the Tree Mechanism to incrementally construct the private gradientfunction (as in (2) ), we present our second private algorithm for incremental regression, with an ˜ O (min { T W + T W √ OPT + T W
12 4 √ OPT , T } ) excess risk bound, where W = w ( X ) + w ( C ) , X ⊂ R d is the domain from which the x i ’s (covariates) aredrawn, and OPT is the true minimum empirical risk at time T . As we discuss in Section 5, for many practicallyinteresting regression instances, such as when X is a domain of sparse vectors, and C is bounded by an L -ball(as is the case of popular Lasso regression) or is a polytope defined by polynomial (in dimension) many vertices, W ≈ polylog ( d ) , in which case the risk bound can be simplified to ˜ O (min { T + T √ OPT + T
14 4 √ OPT , T } ) .In most practical scenarios, when the relationship between covariates and response satisfy a linear relationship, onewould also expect OPT ≪ T . These bounds show that, for certain instances, it is possible to design differentiallyprivate risk minimizers in the incremental setting with excess risk that depends only poly-logarithmically on thedimensionality of the data, a desired feature in a high-dimensional setting. Organization.
In Section 1.2, we discuss some related work. In Section 2, we present some preliminaries. Ourgeneric transformation from a private batch ERM algorithm to a private incremental ERM algorithm is given inSection 3. We present techniques that improve upon these bounds for the problem of private incremental regressionin Sections 4 and 5. The appendices contain some proof details and supplementary material. In Appendix B, weanalyze the convergence rate of a noisy projected gradient descent technique, and in Appendix C, we present the
TreeMechanism of [16, 7].
Private ERM.
Starting from the works of Chaudhuri et al. [8, 9], private convex ERM problems have been studied invarious settings including the low-dimensional setting [41, 34], high-dimensional sparse regression setting [34, 48],online learning setting [25, 49, 27, 37], local privacy setting [12], and interactive setting [26, 51].Bassily et al. [2] presented algorithms that for a general convex loss function ( θ ; z ) which is -Lipschitz (Def-inition 8) for every z , achieves an expected excess risk of ≈ √ d under ( ǫ, δ ) -differential privacy and ≈ d under ǫ -differential privacy (ignoring the dependence on other parameters for simplicity). We use their batch mechanismsin our generic construction to obtain risk bounds for incremental ERM problems (Theorem 3.1). They also showedthat these bounds cannot be improved in general , even for the least-squares regression function. However, if theconstraint space has low Gaussian width (such as with the L -ball), Talwar et al. [47, 46] recently showed that, under ( ǫ, δ ) -differential privacy, the above bound can be improved by exploiting the geometric properties of the constraintspace. An analogous result under ǫ -differential privacy for the class of generalized linear functions (which includeslinear, logistic regression) was recently obtained by Kasiviswanathan and Jin [29]. Our excess risk bounds basedon Gaussian width (presented in Section 5) uses a lifting procedure similar to that used by Kasiviswanathan andJin [29]. All of these above algorithms operate in the batch setting, and ours is the first work dealing with privateERM problems in an incremental setting. Private Online Convex Optimization.
Differentially private algorithms have also been designed for a large class ofonline (convex optimization) learning problems, in both the full information and bandit settings [25, 49, 27]. Adaptingthe popular
Follow the Approximate Leader framework [23], Smith and Thakurta [49] obtained regret bounds forprivate online learning with nearly optimal dependence on T (though the dependence on the dimensionality d in theseresults is much worse than the known lower bounds). As discussed earlier, incremental learning is a variant of thebatch learning, with goals different from online learning. Note that this issue arises independent of the differential privacy requirements, and holds even in a non-private incremental setting. Recent results [5] have shown other distributions for generating Φ also provide similar guarantees. Better risk bounds were achieved under strong convexity assumptions [2, 47]. rivate Incremental Algorithms. Dwork et al. [16] introduced the problem of counting under incremental (contin-ual) observations. The goal is to monitor a stream of T bits, and continually release a counter of the number of ’sthat have been observed so far, under differential privacy. The elegant Tree Mechanism introduced by Dwork et al. [16] and Chan et al. [7] solves this problem, under ǫ -differential privacy, with error roughly log / T . The versatilityof this mechanism has been utilized in different ways in subsequent works [49, 17, 24]. We use the Tree Mechanism as a basic building block for computing the private gradient function incrementally. Dwork et al. [16] also achieve pan-privacy for their continual release (which means that the mechanism preserves differential privacy even whenan adversary can observe snapshots of the mechanism’s internal states), a property that we do not investigate in thispaper.
Use of JL Transform for Privacy.
The use of JL transform for achieving differential privacy with better utility hasbeen well documented for a variety of computational tasks [58, 3, 33, 44, 52]. Blocki et al. [3] have shown that if Φ ∈ R m × n is a Gaussian random matrix of appropriate dimension, then Φ X ∈ R m × d is differentially private if theleast singular value of the matrix X ∈ R n × d is “sufficiently” large. The bound on the least singular value was recentlyimproved by Sheffet [44]. Here the privacy comes as a result of randomization inherent in the transform. However,these results require that the projection matrix is kept private, which is an issue in an incremental setting, where anadversary could learn about Φ over time. Kenthapadi et al. [33] use Johnson-Lindenstrauss transform to publish aprivate sketch that enables estimation of the distance between users. Their main idea is based on projecting a d -dimensional user feature vector into a lower m -dimensional space by first applying a random Johnson-Lindenstrausstransform and then adding Gaussian noise to each entry of the resulting vector. None of these above results deal withan incremental setting, where applying the JL transform is itself a challenge because of the adaptivity issues. Traditional Streaming Algorithms.
The literature on streaming algorithms is replete with various techniques thatcan solve linear regression and related problems on various streaming models of computation under various computa-tional resource constraints. We refer the reader to the survey by Woodruff [57] for more details. However, incrementalregression under differential privacy, poses a different challenge than that faced by traditional streaming algorithms.The issue is that the solution (regression parameter) at each timestep relies on all the datapoints observed in the past,and frequent releases about earlier points can lead to privacy loss.
Notation and Data Normalization.
We denote [ n ] = { , . . . , n } . Vectors are in column-wise fashion, denoted byboldface letters. For a vector v , v ⊤ denotes its transpose, k v k it’s Euclidean ( L -) norm, and k v k it’s L -norm. Fora matrix M , k M k denotes its spectral norm which equals its largest singular value, and k M k F its Frobenius norm.We use to denote a d -dimensional vector of all zeros. The d -dimensional unit ball in L p -norm centered at origin isdenoted by B dp . I d represents the d × d identity matrix. N ( µ, Σ) denotes the Gaussian distribution with mean vector µ and covariance matrix Σ . For a variable n , we use poly( n ) to denote a polynomial function of n and polylog ( n ) to denote poly(log( n )) .We assume all streams are of a fixed length T , which is known to the algorithm. We make this assumption forsimplifying the discussion. In fact, in our presented generic transformation for incremental ERM this assumption canbe straightforwardly removed. Whereas in the case of algorithms for private incremental regression this assumptioncan be removed by using a simple trick introduced by Chan et al. [7]. For a stream Γ , we use Γ t to denote thestream prefix of length t .Throughout this paper, we use ℓ and L to indicate the least-squared loss on a single datapoint and a collection ofdatapoints, respectively. Namely, ℓ ( θ ; ( x , y )) = ( y − h x , θ i ) and L ( θ ; ( x , y ) , . . . , ( x n , y n )) = P ni =1 ℓ ( θ ; ( x i , y i )) . In Appendix A, we review a few additional definitions related to convex functions and Gaussian concentration.For a set of vectors, we define its diameter as the maximum attained norm in the set.
Definition 2. (Diameter of Set) The diameter kCk of a closed set
C ⊆ R d , is defined as kCk = sup θ ∈C k θ k . Chan et al. [7] presented a scheme that provides a generic way for converting the
Tree Mechanism that requires prior knowledge of T into amechanism that does not. They also showed that this new mechanism (referred to as the Hybrid Mechanism ) achieves asymptotically the same errorguarantees as the
Tree Mechanism . The same ideas work in our case too, and the asymptotic excess risk bounds are not affected. or improving the worst-case dependence on dimension d , we exploit the geometric properties of the input andconstraint space. We use the well-studied quantity of Gaussian width that captures the L -geometric complexity of aset S ⊆ R d . Definition 3 (Gaussian Width) . Given a closed set S ⊆ R d , its Gaussian width w ( S ) is defined as: w ( S ) = E g ∈N (0 , d [sup a ∈ S h a , g i ] . In particular, w ( S ) can be thought as the “effective dimension” of S . Many popular convex sets have lowGaussian width, e.g., the width of both the unit L -ball in R d ( B d ) and the standard d -dimensional probabilitysimplex equals Θ( √ log d ) , and the width of any ball B dp for ≤ p ≤ ∞ is ≈ d − /p . For a set C contained in the B d , w ( C ) is always O ( √ d ) . Another prominent set with low Gaussian width is that made up of sparse vectors. Forexample, the set of all k -sparse vectors (with at most k non-zero entries) in R d has Gaussian width Θ( p k log( d/k )) . Differential Privacy on Streams.
We will consider differential privacy on data streams [16]. A stream is a sequenceof points from some domain set Z . We say that two streams Γ , Γ ′ ∈ Z ∗ of the same length are neighbors if thereexists a datapoint z ∈ Γ and z ′ ∈ Z such that if we change z in Γ to z ′ we get the stream Γ ′ . The result of analgorithm processing a stream is a sequence of outputs. Definition 4 (Event-level differential privacy [15, 16]) . Algorithm
Alg is ( ǫ, δ ) -differentially private if for all neigh-boring streams Γ , Γ ′ and for all sets of possible output sequences R ⊆ R N , we have Pr[Alg(Γ) ∈ R ] ≤ exp( ǫ ) · Pr[Alg(Γ ′ ) ∈ R ] + δ, where the probability is taken over the randomness of the algorithm. When δ = 0 , the algorithm Alg is ǫ -differentiallyprivate. We provide additional background on differential privacy along with some techniques for achieving it in Ap-pendix A.2.
We present a generic transformation for converting any private batch ERM algorithm into a private incremental ERMalgorithm. We take this construction as a baseline for comparison for our private incremental regression algorithms.Mechanism P
RIV I NC ERM describes this simple transformation. At every timestep, Mechanism P
RIV I NC ERMoutputs a θ priv t , a differentially private approximation of ˆ θ t ∈ argmin θ ∈C J ( θ ; z , . . . , z t ) , where J ( θ ; z , . . . , z t ) = t X i =1 ( θ ; z i ) . The idea is to perform “relevant” computations only every τ timesteps, thereby ensuring that no z i is used inmore than T /τ invocations of the private batch ERM algorithm (for simplicity, assume that T is a multiple of τ ).This idea is reminiscent of mini-batch processing ideas commonly used in big data processing [6]. In Theorem 3.1,the parameter τ is set to balance the increase in excess risk due to lack of updates on the estimator and the increasein the excess risk due to the change in the privacy risk parameter ǫ (which arises from multiple interactions with thedata).Mechanism P RIV I NC ERM invokes a differentially private (batch) ERM algorithm for timesteps t which are amultiple of τ , and in all other timesteps it just outputs the result from the previous timestep. In Step 5 of Mecha-nism P RIV I NC ERM any differentially private batch ERM algorithm can be used, and this step dominates the timecomplexity of this mechanism. Here we present excess risk bounds obtained by invoking Mechanism P
RIV I NC ERMwith the differentially private ERM algorithms of Bassily et al. [2] and Talwar et al. [46]. As mentioned earlier, thebounds of Bassily et al. [2] are tight in the worst-case. But as shown by Talwar et al. [46], if the constraint space C hasa small Gaussian width, then these bounds could be improved. The bounds of Talwar et al. depend on the curvatureconstant of defined as: C = sup z ∈Z sup θ a ,θ b ∈C ,l ∈ (0 , ,θ c = θ a + l ( θ b − θ a ) l ( ( θ c ; z ) − ( θ a ; z ) − h θ c − θ a , ∇ ( θ a ; z ) i ) . In the practice of differential privacy, we generally think of ǫ as a small non-negligible constant, and δ as a parameter that is cryptographicallysmall. echanism 1: P RIV I NC ERM ( ǫ, δ ) Input : A stream
Γ = z , . . . , z T , where each z t is from the domain Z ⊆ R d , and τ ∈ N Output : A differentially private estimate of ˆ θ t ∈ argmin θ ∈C P ti =1 ( θ ; z i ) at every timestep t ∈ [ T ] Set ǫ ′ ← ǫ (cid:18) q Tτ ln ( δ ) (cid:19) and δ ′ ← δτ T θ priv0 ← for all t ∈ [ T ] do if t is a multiple of τ then θ priv t ← Output of an ( ǫ ′ , δ ′ ) -differentially private algorithm minimizing J ( θ ; z , . . . , z t ) end else θ priv t ← θ priv t − end Return θ priv t end For linear regression where ( θ ; z ) = ℓ ( θ ; ( x , y )) = ( y − h x , θ i ) , then with k x k ≤ and | y | ≤ , it follows that C ℓ ≤ kCk [10].We now show that Mechanism P RIV I NC ERM is event-level differentially private (Definition 4), and analyze itsutility under various invocations in Step 5.
Theorem 3.1.
Mechanism P RIV I NC ERM is ( ǫ, δ ) -differentially private with respect to a single datapoint change inthe stream Γ . Also,1. (Using Theorem 2.4, Bassily et al. [2]). If the function ( θ ; z ) : C × Z → R is a positive-valued function that isconvex with respect to θ over the domain C ⊆ R d . Then for any β > , with probability at least − β , for each t ∈ [ T ] , θ priv t generated by Mechanism P RIV I NC ERM with τ = ⌈ ( Td ) / ǫ / ⌉ satisfies: J ( θ priv t ; Γ t ) − min θ ∈C J ( θ ; Γ t ) = O (cid:18) min { ( T d ) / L kCk log / (1 /δ ) polylog ( T /β ) ǫ / , T L kCk} (cid:19) , where L is the Lipschitz parameter of the function .2. (Using Theorem 2.4, Bassily et al. [2]). If the function ( θ ; z ) : C × Z → R is a positive-valued function whichis ν -strongly convex with respect to θ over the domain C ⊆ R d . Then for any β > , with probability at least − β , for each t ∈ [ T ] , θ priv t generated by Mechanism P RIV I NC ERM with τ = ⌈ √ dL ( ν / ǫ kCk / ) ⌉ satisfies: J ( θ priv t ; Γ t ) − min θ ∈C J ( θ ; Γ t ) = O min { √ dL / kCk / log (1 /δ ) polylog ( T /β ) ν / ǫ , T L kCk} ! , where L is the Lipschitz parameter of the function .3. (Using Theorem 2.6 of Talwar et al. [46]). If the function ( θ ; z ) : C × Z → R is a positive-valued function thatis convex with respect to θ over the domain C ⊆ R d . Then for any β > , with probability at least − β , for each t ∈ [ T ] , θ priv t generated by Mechanism P RIV I NC ERM with τ = ⌈ √ Tw ( C ) C / (( L kCk ) / ǫ / ) ⌉ satisfies: J ( θ priv t ; Γ t ) − min θ ∈C J ( θ ; Γ t ) = O min { p T w ( C ) C / ( L kCk ) / log / (1 /δ ) polylog ( T /β ) ǫ / , T L kCk} ! , where L is the Lipschitz parameter and C is the curvature constant of the function .Proof. Privacy Analysis.
Each z i is accessed at most T /τ times by the algorithm invoked in Step 5. Let l = T /τ .By using the composition Theorem A.4 with δ ∗ = δ/ it follows that the entire algorithm is ( ǫ ′ p l ln(2 /δ ) +2 lǫ ′ , l · ( δ/ l ) + δ/ -differentially private. We set ǫ ′ as ǫ/ ǫ ′ p l ln(2 /δ ) . Note that with this setting of ǫ ′ , lǫ ′ ≤ ǫ/ , therefore, ǫ ′ p l ln(2 /δ ) + 2 lǫ ′ ≤ ǫ . Hence, Mechanism P RIV I NC ERM is ( ǫ, δ ) -differentially private. tility Analysis. If T < τ , algorithm does not access the data, and in that case, the excess risk can be bounded by
T L kCk . Now assume T ≥ τ . Note that the algorithm performs no computation when t is not a multiple of τ . Let Γ t denote the prefix of stream Γ till time t . Let t lie in the interval of [ jτ, ( j + 1) τ ] for some j ∈ N . The total lossaccumulated by the algorithm at time t can be split as: t X i =1 ( θ priv t ; z i ) = jτ X i =1 ( θ priv jτ ; z i ) + t X i = jτ +1 ( θ priv jτ ; z i ) ≤ jτ X i =1 ( θ priv jτ ; z i ) + τ L kCk , as θ priv t = θ priv jτ .Let ˆ θ t ∈ argmin θ ∈C P ti =1 ( θ ; z i ) . As is positive-valued, t X i =1 (ˆ θ t ; z i ) ≥ jτ X i =1 (ˆ θ t ; z i ) ≥ jτ X i =1 (ˆ θ jτ ; z i ) . Hence, we get, t X i =1 ( θ priv t ; z i ) − t X i =1 (ˆ θ t ; z i ) ≤ jτ X i =1 ( θ priv jτ ; z i ) − jτ X i =1 (ˆ θ jτ ; z i ) + τ L kCk . Using the results from Bassily et al. [2] or Talwar et al. [46] to bound P jτi =1 ( θ priv jτ ; z i ) − P jτi =1 (ˆ θ jτ ; z i ) , setting τ to balance the various opposing terms, and a final union bound provide the claimed bounds. We now focus on the problem of private incremental linear regression. Our first approach for this problem is basedon a private incremental computation of the gradient. The algorithm is particularly effective in the regime of large T and small d . A central idea of our approach is the construction of a private gradient function defined as follows. Definition 5.
Let
C ⊆ R d . Algorithm Alg computes an ( α, β ) -accurate gradient of the loss function J ( θ ; z , . . . , z t ) with respect to θ ∈ C , if given z , . . . , z t ∈ Z it outputs a function g t : C → R d such that: (i) Privacy : Alg is ( ǫ, δ ) -differentially private (as in Definition 4), i.e., for all neighboring streams Γ , Γ ′ ∈ Z ∗ andsubsets R ⊆ C → R d , Pr[Alg(Γ) ∈ R ] ≤ exp( ǫ ) · Pr[Alg(Γ ′ ) ∈ R ] + δ. (ii) Utility : The function g t is an ( α, β ) -approximation to the true gradient, in that, Pr Alg (cid:20) max z ,..., z t ∈Z ,θ ∈C k g t ( θ ) − ∇J ( θ ; z , . . . , z t ) k ≥ α (cid:21) ≤ β. Note that the output of
Alg in the above definition is a function g t . The first requirement on Alg specifies that itsatisfies the differential privacy condition (Definition 4). The second requirement on
Alg specifies that for any θ ∈ C ,it gives a “sufficiently” accurate estimate of the true gradient ∇J ( θ ; z , . . . , z t ) .Let Γ = ( x , y ) , . . . , ( x T , y T ) represent the stream of covariate-response pairs, we use Γ t to denote ( x , y ) , . . . , ( x t , y t ) .Consider the gradient of the loss function L ( θ ; Γ t ) where X t ∈ R t × d is a matrix with rows x ⊤ , . . . , x ⊤ t and y t = ( y , . . . , y t ) : ∇L ( θ ; Γ t ) = 2( X ⊤ t X t θ − X ⊤ t y t ) = 2 t X i =1 x i x ⊤ i θ − t X i =1 x i y i ! . (2)A simple observation from the gradient form of (2) is that if we can maintain the streaming sum of x y , . . . , x T y T and the streaming sum of x x ⊤ , . . . , x T x ⊤ T , then we can maintain the necessary ingredients needed to compute ∇L ( θ ; Γ t ) for any t ∈ [ T ] . We use this observation to construct a private gradient function g t : C → R d at everytimestep t . The idea is to privately maintain P Ti =1 x i x ⊤ i and P Ti =1 x i y i over the steam using the Tree Mechanism of [16, 7]. We present the entire construction of the
Tree Mechanism in Appendix C. The rough idea behind thismechanism is to build a binary tree where the leaves are the actual inputs from the stream, and the internal nodesstore the partial sums of all the leaves in its sub-tree. For a stream of vectors, υ , . . . , υ T in the unit ball, the Tree lgorithm 2: P RIV I NC R EG ( ǫ, δ ) Input : A stream
Γ = ( x , y ) , . . . , ( x T , y T ) , where each ( x t , y t ) in Γ is from the domain X × Y where
X ⊂ R d with kX k ≤ and Y ⊂ R with kYk ≤ Output : A differentially private estimate of ˆ θ t ∈ argmin θ ∈C P ti =1 ( y i − h x i , θ i ) at every timestep t ∈ [ T ] Set ǫ ′ ← ǫ , δ ′ ← δ , κ ← log / ( T ) q log ( δ ′ ) ǫ ′ , α ′ ← O ( κ kCk√ d ) , and r ← Θ (cid:18)(cid:16) T kCk α ′ (cid:17) (cid:19) for all t ∈ [ T ] do q t ← output of T REE M ECH ( ǫ ′ , δ ′ , at t when invoked on the stream x y , . . . , x T y T Q t ← output of T REE M ECH ( ǫ ′ , δ ′ , at t when invoked on the stream x x ⊤ , . . . , x T x ⊤ T which can beviewed as d -dimensional vectors (the outputs are converted back to form d × d matrices) Define a private gradient function, g t : C → R d as: g t ( θ ) = 2( Q t θ − q t ) θ priv t ← N OISY P ROJ G RAD ( C , g t , r ) (described in Appendix B) Return θ priv t end Mechanism allows private estimation of P ti =1 υ i , for every t ∈ [ T ] , with error roughly √ d log T (under ( ǫ, δ ) -differential privacy, ignoring other parameters).Somewhat similar to our approach, Smith and Thakurta also use the Tree Mechanism to maintain the sum ofgradients in their private online learning algorithm [49], however, unlike our approach, they do not construct a privategradient function.Once the private gradient function g t is released, it could be used to obtain an approximation to the estimator ˆ θ byusing any traditional gradient-based optimization technique. In this paper, we use a variant of the classical projectedgradient descent approach, described in Algorithm N OISY P ROJ G RAD (Appendix B). Algorithm N
OISY P ROJ G RAD is a iterative algorithm that takes in the constraint set C , private gradient function g t , and a parameter r denoting thenumber of iterations, and returns back ( θ priv t ) a private estimate of regression parameter.An important point to note is that evaluating the gradient function at different θ ’s, as needed for any gradi-ent descent technique, does not affect the privacy parameters ǫ and δ . We analyze the convergence of AlgorithmN OISY P ROJ G RAD in Appendix B. This convergence result (Corollary B.2) is used to set the parameter r in Algo-rithm P RIV I NC R EG .Algorithm P RIV I NC R EG is ( ǫ, δ ) -differentially private with respect to a single datapoint change in the stream Γ . This follows as the L -sensitivity of the stream in both the invocations (Steps 3 and 4) of Algorithm T REE M ECH is less than (because of the normalization on the x i ’s and y i ’s). Using the standard composition properties (Theo-rem A.3) of differential privacy gives that the Algorithm P RIV I NC R EG is ( ǫ, δ ) -differentially private. The algorithmonly requires O ( d log T ) space (see Appendix C), therefore having only a logarithmic dependence on the length ofthe stream. The running time of the algorithm is dominated by Steps 4 and 6; at every timestep t , Step 4 has timecomplexity of O ( d log T ) , whereas the time complexity of Step 6 is r times the time complexity of projecting adatapoint onto C (the P C operation defined in Appendix B). We now analyze the utility of this algorithm, using theerror bound on Tree Mechanism from Proposition C.1.
Lemma 4.1.
For any β > , θ ∈ C , and t ∈ [ T ] , with probability at least − β , the function g t defined inAlgorithm P RIV I NC R EG satisfies: k g t ( θ ) − ∇L ( θ ; Γ t ) k = O (cid:16) κ kCk ( √ d + p log(1 /β )) (cid:17) . Proof.
Applying Proposition C.1, we know with probability at least − β , (cid:13)(cid:13) q t − P ti =1 x i y i (cid:13)(cid:13) = O (cid:16) κ ( √ d + p log(1 /β )) (cid:17) and (cid:13)(cid:13) Q t − P ti =1 x i x ⊤ i (cid:13)(cid:13) = O (cid:16) κ ( √ d + p log(1 /β )) (cid:17) . herefore, we get with probability at least − β , k g t ( θ ) − ∇L ( θ ; Γ t ) k = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Q t − t X i =1 x i x ⊤ i ! θ − q t − t X i =1 x i y i !(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O (cid:16) κ kCk ( √ d + p log(1 /β )) (cid:17) , where we used the fact that k θ k ≤ kCk for any θ ∈ C . Theorem 4.2.
Algorithm P RIV I NC R EG is ( ǫ, δ ) -differentially private with respect to a single datapoint changein the stream Γ . For any β > , with probability at least − β , for each t ∈ [ T ] , θ priv t generated by Algo-rithm P RIV I NC R EG satisfies: L ( θ priv t ; Γ t ) − min θ ∈C L ( θ ; Γ t ) = O log / T p log(1 /δ ) kCk ( √ d + p log( T /β )) ǫ ! . Proof.
The ( ǫ, δ ) -differential privacy follows from the above discussed global sensitivity bound.Fix any t ∈ [ T ] . Let ˆ θ t ∈ argmin θ ∈C P ti =1 ( y i − h x i , θ i ) . Combining Lemma 4.1 and Corollary B.2, withprobability at least − rβ ′ L ( θ priv t ; Γ t ) − L (ˆ θ t ; Γ t ) = O (cid:16) κ kCk ( √ d + p log(1 /β ′ )) (cid:17) . Replacing β ′ by β/r , substituting for κ , and taking a union bound over all t ∈ [ T ] gives the claimed result. Remark 4.3.
For linear regression instances, which typically do not satisfy the strong convexity property, the ≈ min { ( T d ) / , T } risk bound obtained from Mechanism P RIV I NC ERM is substantially worse than the ≈ min {√ d, T } risk bound provided by Algorithm P RIV I NC R EG .The dependence on dimension d is tight in the worst-case; due to ≈ √ d excess risk lower bounds established byBassily et al. [2]. Remark 4.4.
The techniques developed in this section (based on Tree Mechanism) can be applied to any convexERM problem whose gradient has a linear form, in which case we obtain an excess risk bound as in Theorem 4.2. Itis an interesting open question to obtain similar bounds for general convex ERM problems in an incremental setting.
The noise added in Theorem 4.2 for privacy grows proportionately as √ d . While this seems unavoidable in theworst-case, we ask whether it is possible to go beyond this worst-case bound under some realistic assumptions. Anintuitive idea to overcome this curse of dimensionality is to project (compress) the data to a lower dimension, beforethe addition of noise. In this section, we use this and other geometric ideas to cope with the high-dependency ondimensionality in Theorem 4.2. The resulting bound will depend on the Gaussian width of the input and constraintspace which for many interesting problem instances will be smaller than √ d . We mention some specific instantiationsand extensions of our result in Section 5.2 including a scenario when not all inputs are drawn from a domain with a“small” Gaussian width.Our general approach in this section will be based on this simple principle: reduce the dimensionality of theproblem, solve it privately in the lower dimensional space, and then “lift” the solution back to the original space. Ourlifting procedure is similar to that used by Kasiviswanathan and Jin [29] in their recent work on bounding excess riskfor private ERM in a batch setting.Fix t and consider solving the following projected least-squares problem, defined as: L proj ( θ ; Γ t ; Φ) = t X i =1 (cid:18) y i − k x i kk Φ x i k h Φ x i , Φ θ i (cid:19) , (3)where Φ ∈ R m × d is a random projection matrix (to be defined later). The loss function, L proj is also referred to as compressed least-squares in the literature [36, 20, 28]. The scaling factor of k x i kk Φ x i k is for simplicity of analysis only. One could omit it and still obtain the same results using a slightly differentanalysis. Also without loss of generality, we assume x i = 0 for all i . s a first step, we investigate the relationship between L ( θ ; Γ t ) and L proj ( θ ; Γ t ; Φ) . A fundamental tool indimensionality reduction, the Johnson-Lindenstrauss (JL) lemma, states that for any set S ⊆ R d , given γ > and m = Ω(log | S | /γ ) , there exists a map that embeds the set into R m , distorting all pairwise distances within at most ± γ factor. A simple consequence of the JL Lemma is that, for any set of n vectors x , . . . , x n ∈ R d , θ ∈ R d , γ > , and β > , if Φ is an m × d matrix with entries drawn i.i.d. from N (0 , /m ) with m = Θ(log( n/β ) /γ ) ,we get, Pr [ |h Φ x i , Φ θ i − h x i , θ i| ≥ γ k x i kk θ k for all i ∈ [ n ]] ≤ β. (4)Using standard inequalities such as (4), establishing the relationship between L and L proj is relatively straightforwardin a batch setting. However, the incremental setting raises the challenge of dealing with adaptive inputs in JL transfor-mation. The issue being that JL transformation is inherently non-adaptive, in that “success” of the JL transformation(properties such as (4)) depends on the fact that the inputs on which it gets applied are chosen before (independentof) fixing the transformation Φ .To deal with this kind of adaptive generation of inputs, we use a generalization of JL lemma which yields similarguarantees but with a much smaller (than log | S | ) requirement on m , if the set S has certain geometric characteristics.For ease of exposition, in the rest of this section, we are going to assume that Φ is a matrix in R m × d with i.i.d. entriesfrom N (0 , /m ) . Gordon [21] showed that using a Gaussian random matrix, one can embed the set S into a lower-dimensional space R m , where m roughly scales as the square of Gaussian width of S . This result has found severalinteresting applications in high-dimensional convex geometry, statistics, and compressed sensing [39]. Theorem 5.1 (Gordon [21]) . Let ˜Φ be an m × d random matrix, whose rows φ ⊤ , . . . , φ ⊤ m are i.i.d. Gaussian randomvectors in R d chosen according to the standard normal distribution N ( , I d ) . Let Φ = ˜Φ / √ m . Let S be a set ofpoints in R d . There is a constant C > such that for any < γ, β < , Pr (cid:20) sup a ∈ S (cid:12)(cid:12) k Φ a k − k a k (cid:12)(cid:12) ≥ γ k a k (cid:21) ≤ β, provided that m ≥ Cγ max n w ( S ) , log (cid:16) β o(cid:17) . Note that the w ( S ) is defined for all sets, not just convex sets, a fact that we use below as the input domain X may not be convex. As a simple corollary to the above theorem it also follows that, Corollary 5.2.
Under the setting of Theorem 5.1, there exists a constant C ′ > such that for any < γ, β < , Pr (cid:20) sup a , b ∈ S |h Φ a , Φ b i − h a , b i| ≥ γ k a kk b k (cid:21) ≤ β, provided that m ≥ C ′ γ max n w ( S ) , log (cid:16) β o(cid:17) . Applying the above corollary to the set of vectors in
X ∪ C , and by noting that w ( X ∪ C ) ≤ w ( X ) + w ( C ) , givesthat if m = Θ((1 /γ ) max { ( w ( X ) + w ( C )) , log(1 /β ) } ) , then with probability at least − β , Pr (cid:20) sup x ∈X ,θ ∈C |h Φ x , Φ θ i − h x , θ i| ≥ γ k x kk θ k (cid:21) ≤ β. (5) We now present a mechanism (Algorithm P
RIV I NC R EG ) for private incremental linear regression based on mini-mizing the projected least-squares objective (3), under differential privacy. The idea is to again construct a privategradient function g t , but of function L proj (instead of L as done in Algorithm P RIV I NC R EG ).Let X t be a matrix with rows x ⊤ , . . . , x ⊤ t , and ˜ X t ∈ R n × d be a matrix with rows ˜ x ⊤ , . . . , ˜ x ⊤ t . As before, let y t be the vector ( y , . . . , y t ) . Under these notation, L proj ( θ ; Γ t ; Φ) from (3) can be re-expressed as: L proj ( θ ; Γ t ; Φ) = k y t − ˜ X t Φ ⊤ Φ θ k . The gradient of L proj with respect to Φ θ equals: ∇ (Φ θ ) L proj ( θ ; Γ t ; Φ) = ∂ k y t − ( ˜ X t Φ ⊤ )(Φ θ ) k ∂ (Φ θ ) = 2(( ˜ X t Φ ⊤ ) ⊤ ( ˜ X t Φ ⊤ ))(Φ θ ) −
2( ˜ X t Φ ⊤ ) ⊤ y t . One could also use other (better) constructions of Φ , such as those that create sparse Φ matrix, using recent results by Bourgain et al. [5]extending Theorem 5.1 to other distributions. lgorithm 3: P RIV I NC R EG ( ǫ, δ ) Input : A stream
Γ = ( x , y ) , . . . , ( x T , y T ) , where each ( x t , y t ) in Γ is from the domain X × Y where
X ⊂ R d with kX k ≤ and Y ⊂ R with kYk ≤ Output : θ priv t a differentially private estimate of ˆ θ t ∈ argmin θ ∈C P ti =1 ( y i − h x i , θ i ) at every timestep t ∈ [ T ] Set ǫ ′ ← ǫ , δ ′ ← δ , κ ← log / ( T ) q log ( δ ′ ) ǫ ′ , α ′ ← O ( κ kCk√ m ) , r ← Θ (cid:18)(cid:16) T kCk α ′ (cid:17) (cid:19) , γ ← ( w ( X )+ w ( C )) / T / , and m ← Θ (cid:16) γ max { ( w ( X ) + w ( C )) , log( Tβ ) } (cid:17) Let Φ ← m × d random matrix with entries drawn i.i.d. from N (0 , /m ) for all t ∈ [ T ] do Let ˜ x t ← k x t kk Φ x t k x t q t ← output of T REE M ECH ( ǫ ′ , δ ′ , at t when invoked on the stream Φ˜ x y , . . . , Φ˜ x T y T Q t ← output of T REE M ECH ( ǫ ′ , δ ′ , at t when invoked on the stream (Φ˜ x )(Φ˜ x ) ⊤ , . . . , (Φ˜ x T )(Φ˜ x T ) ⊤ which can be viewed as m -dimensional vectors (the outputs are converted back to form m × m matrices) Define a private gradient function, g t : Φ C → R m as: g t ( ϑ ) = 2( Q t ϑ − q t ) ϑ priv t ← N OISY P ROJ G RAD (Φ C , g t , r ) (described in Appendix B) θ priv t ← argmin θ ∈ R d k θ k C subject to Φ θ = ϑ priv t (can be solved using any convex optimization technique) Return θ priv t end Note that ∇ (Φ θ ) L proj ∈ R m .Let Φ C = { ϑ ∈ Φ θ : θ ∈ C} . Note for a convex C , Φ C ⊂ R m is also convex. In Algorithm P RIV I NC R EG , ϑ priv t is a private estimate of ˆ ϑ t , where ˆ ϑ t ∈ argmin ϑ ∈ Φ C t X i =1 (cid:18) y i − k x i kk Φ x i k h Φ x i , ϑ i (cid:19) . Algorithm P
RIV I NC R EG only requires O ( m log T + log d ) space, therefore is slightly more memory efficientthan Algorithm P RIV I NC R EG (as m ≤ d ). The time complexity can be analyzed as for Algorithm P RIV I NC R EG .Algorithm P RIV I NC R EG is ( ǫ, δ ) -differentially private with respect to a single datapoint change in the stream Γ .The L -sensitivity for both invocations of Algorithm T REE M ECH is . In Step 6, this holds because max x a , x b ∈X k (Φ˜ x a )(Φ˜ x a ) ⊤ − (Φ˜ x b )(Φ˜ x b ) ⊤ k F ≤ k (Φ˜ x a )(Φ˜ x a ) ⊤ k F + k (Φ˜ x b )(Φ˜ x b ) ⊤ k F = k Φ˜ x a k + k Φ˜ x b k = k x a k + k x b k = 2 , the second to last equality follows as, for every x ∈ X , k Φ˜ x k = k x k (by construction). Since Φ x i ’s are in theprojected subspace ( R m ), the noise needed for differential privacy (in Steps 5 and 6) of Algorithm P RIV I NC R EG roughly scales as √ m .In Step 9 of Algorithm P RIV I NC R EG , we lift ϑ priv t into the original d -dimensional constraint space C . Since ϑ priv t ∈ Φ C , we know that there exists a θ true t ∈ C , such that Φ θ true t = ϑ priv t . Then the goal is to estimate θ true t from Φ θ true t . Again geometry of C (Gaussian width) plays an important role, as it controls the diameter of high-dimensional random sections of C (referred to as M ⋆ bound [35, 54]). We refer the reader to the excellent tutorial byVershynin [54] for more details.We define Minkowski functional, as commonly used in geometric functional analysis and convex analysis. Definition 6 (Minkowski functional) . For any vector θ ∈ R d , the Minkowski functional of C is the non-negativenumber k θ k C defined by the rule: k θ k C = inf { ρ ∈ R : θ ∈ ρ C} . For the typical situation in ERM problems, where C is a symmetric convex body, k · k C defines a norm. Theoptimization problem solved in Step 9 of Algorithm P RIV I NC R EG is convex if C is convex, and hence can beefficiently solved. The existence of θ priv t follows from the following theorem. heorem 5.3. [54] Let Φ be an m × d matrix, whose rows φ ⊤ , . . . , φ ⊤ m are i.i.d. Gaussian random vectors in R d chosen according to the standard normal distribution N (0 , I d ) . Let C be a convex set. Given v = Φ u and Φ , let ˆ u be the solution to the following convex program: min u ′ ∈ R d k u ′ k C subject to Φ u ′ = v . Then for any β > , withprobability at least − β , sup u : v =Φ u k u − ˆ u k = O w ( C ) √ m + kCk p log(1 /β ) √ m ! . The next thing to be verified is that θ priv t generated by Algorithm P RIV I NC R EG is in C . This is simple as bydefinition of Minkowski functional, as any closed set C = { θ ∈ R d : k θ k C ≤ } . Hence, k θ true t k C ≤ . By choiceof θ priv t in Step 9, ensures that k θ priv t k C ≤ k θ true t k C ≤ , which guarantees that θ priv t ∈ C . Finally, note that thelifting is a post-processing operation on a differentially private output ϑ priv t , and hence does not affect the differentialprivacy guarantee. Utility Analysis of Algorithm P
RIV I NC R EG . In Lemma 5.4, by using the fact that the Gaussian noise for pri-vacy (in the
Tree Mechanism ) is added on a lower dimensional ( m ) instance, we show that the difference between L proj ( θ priv t ; Γ t ; Φ) and L proj (ˆ θ t ; Γ t ; Φ) (the minimum empirical risk) roughly scales as √ m , for sufficiently large m . The Lipschitz constant of the function L proj ( θ ; Γ t ; Φ) is O ( k Φ Ck ) , which by Theorem 5.1, with probability atleast − β is O ( kCk ) , when m = Θ((1 /γ ) max { ( w ( X ) + w ( C )) , log( T /β ) } ) . Let E be the event that the above Lipschitz bound holds. Lemma 5.4.
For any β > , with probability at least − β , for each t ∈ [ T ] , θ priv t generated by Algorithm P RIV I NC R EG satisfies: L proj ( θ priv t ; Γ t ; Φ) − L proj (ˆ θ t ; Γ t ; Φ) = O √ m log / T p log(1 /δ ) kCk ǫ ! where ˆ θ t ∈ argmin θ ∈C P ti =1 ( y i − h x i , θ i ) .Proof. Let us condition on event E to hold. By definition, min ϑ ∈ Φ C t X i =1 (cid:18) y i − k x i kk Φ x i k h Φ x i , ϑ i (cid:19) ≡ min θ ∈C t X i =1 (cid:18) y i − k x i kk Φ x i k h Φ x i , Φ θ i (cid:19) . Since the inputs are m -dimensional and ϑ priv t = Φ θ priv t , using an analysis similar to Theorem 4.2 gives that, withprobability at least − β , for each t ∈ [ T ] , t X i =1 (cid:18) y i − k x i kk Φ x i k h Φ x i , Φ θ priv t i (cid:19) − min θ ∈C t X i =1 (cid:18) y i − k x i kk Φ x i k h Φ x i , Φ θ i (cid:19) = O log / T p log(1 /δ ) kCk ( √ m + p log( T /β )) ǫ ! . In other words, with probability at least − β , for each t ∈ [ T ] , L proj ( θ priv t ; Γ t ; Φ) − min θ ∈C L proj ( θ ; Γ t ; Φ) = O log / T p log(1 /δ ) kCk ( √ m + p log( T /β )) ǫ ! . By noting that min θ ∈C L proj ( θ ; Γ t ; Φ) ≤ L proj (ˆ θ t ; Γ t ; Φ) and removing the conditioning on E (by adjusting β ),completes the proof.Using properties of random projections, we now bound L proj (ˆ θ t ; Γ t ; Φ) in terms of L (ˆ θ t ; Γ t ) . Lemma 5.5.
Let Φ be a random matrix as defined in Theorem 5.1 with m = Θ((1 /γ ) max { ( w ( X )+ w ( C )) , log( T /β ) } ) ,and let β > . Then with probability at least − β , for each t ∈ [ T ] , L proj (ˆ θ t ; Γ t ; Φ) ≤ L (ˆ θ t ; Γ t ) + 4 γ kCk t + 2 γ kCk q t L (ˆ θ t ; Γ t ) + 2 √ γ / kCk / t / L (ˆ θ t ; Γ t ) / . roof. Fix a t ∈ [ T ] . From Theorem 5.1, for the chosen value of m , with probability at least − β , L (ˆ θ t ; (˜ x , y ) , . . . , (˜ x t , y t )) = t X i =1 ( y i − h ˜ x i , ˆ θ t i ) = t X i =1 (cid:18) y i − k x i kk Φ x i k h x i , ˆ θ t i (cid:19) ≤ t X i =1 ( | y i − h x i , ˆ θ t i| + γ k ˆ θ t k ) ≤ t X i =1 ( | y i − h x i , ˆ θ t i| + γ kCk ) ≤ L (ˆ θ t ; Γ t ) + γ kCk t + 2 γ kCk t X i =1 | y i − h x i , ˆ θ t i|≤ L (ˆ θ t ; Γ t ) + γ kCk t + 2 γ kCk q t L (ˆ θ t ; Γ t ) . (6)Consider L proj (ˆ θ t ; Γ t ; Φ) . From Corollary 5.2, for the chosen value of m , with probability at least − β , L proj (ˆ θ t ; Γ t ; Φ) = t X i =1 ( y i − h Φ˜ x i , Φˆ θ t i ) ≤ t X i =1 ( | y i − h ˜ x i , ˆ θ t i| + γ k ˆ θ t k ) ≤ t X i =1 ( | y i − h ˜ x i , ˆ θ t i| + γ kCk ) = L (ˆ θ t ; (˜ x , y ) , . . . , (˜ x t , y t )) + γ kCk t + 2 γ kCk t X i =1 | y i − h ˜ x i , ˆ θ t i|≤ L (ˆ θ t ; (˜ x , y ) , . . . , (˜ x t , y t )) + γ kCk t + 2 γ kCk q t L (ˆ θ t ; (˜ x , y ) , . . . , (˜ x t , y t )) , where the first inequality is by application of Corollary 5.2. Substituting the result from (6) and taking a union boundover all t ∈ [ T ] (i.e., replacing β by β/T ), completes the proof.The next step is to lower bound L proj ( θ priv t ; Γ t ; Φ) in terms of L ( θ priv t ; Γ t ) . Lemma 5.6.
For any β > , with probability at least − β , for each t ∈ [ T ] , θ priv t generated by Algorithm P RIV I NC R EG satisfies: L ( θ priv t ; Γ t ) ≤ L proj ( θ priv t ; Γ t ; Φ) + 2 γ kCk q T L ( θ priv t ; Γ t )+ 2 √ γ / kCk / T / L ( θ priv t ; Γ t ) / + 4 γ kCk T. Proof.
Fix a t ∈ [ T ] . For the chosen value of m , Pr [ |k Φ x i k − k x i k| ≥ γ k x i k for all i ∈ [ t ]] ≤ β. Similarly for the chosen value of m by using (5), Pr h(cid:12)(cid:12)(cid:12) h Φ x i , Φ θ priv t i − h x i , θ priv t i (cid:12)(cid:12)(cid:12) ≥ γ k x i kk θ priv t k for all i ∈ [ t ] i ≤ β. Using arguments as in Lemma 5.5, but by focusing on the lower bounds, we get that with probability at least − β , L proj ( θ priv t ; Γ t ; Φ) ≥ L ( θ priv t ; Γ t ) − γ kCk q T L ( θ priv t ; Γ t ) − √ γ / kCk / T / L ( θ priv t ; Γ t ) / − γ kCk T. Taking a union bound over all t ∈ [ T ] completes the proof. utting together Lemmas 5.4, 5.5, and 5.6 and simple arithmetic manipulation gives: that with probability at least − β , for each t ∈ [ T ] , L ( θ priv t ; Γ t ) − L (ˆ θ t ; Γ t ) = O √ m log / T p log(1 /δ ) kCk ǫ ! + 8 γ kCk T + 2 γ kCk q T L (ˆ θ t ; Γ t )+ 2 γ kCk q T L ( θ priv t ; Γ t ) + 2 √ γ / kCk / T / L ( θ priv t ; Γ t ) / + 2 √ γ / kCk / T / L (ˆ θ t ; Γ t ) / . (7)To simplify (7) in terms of its dependence on L ( θ priv t ; Γ t ) , start by noting that it is of the form p − a √ p − a / √ p − b − a ≥ , where p = L ( θ priv t ; Γ t ) ,a = 2 γ kCk√ T , and b = L (ˆ θ t ; Γ t ) + O √ m log / T p log(1 /δ ) kCk ǫ ! + 2 γ kCk q T L (ˆ θ t ; Γ t ) + 2 √ γ kCk T L (ˆ θ t ; Γ t ) . Solving for p from p − a √ p − a / p / − b − a = 0 (we use WolframAlpha solver [56]), and using that tosimplify (7) yields, that with probability at least − β , for each t ∈ [ T ] , L ( θ priv t ; Γ t ) − L (ˆ θ t ; Γ t ) = O ( √ m log / T p log(1 /δ ) kCk ǫ + γ kCk T + γ kCk q T L (ˆ θ t ; Γ t ) + γ / kCk / T / L (ˆ θ t ; Γ t ) / ) . (8) Theorem 5.7.
Algorithm P RIV I NC R EG is ( ǫ, δ ) -differentially private with respect to a single datapoint changein the stream Γ . For any β > , with probability at least − β , for each t ∈ [ T ] , θ priv t generated by Algo-rithm P RIV I NC R EG satisfies: L ( θ priv t ; Γ t ) − L (ˆ θ t ; Γ t ) = O ( T W log T kCk p log(1 /δ ) log(1 /β ) ǫ + T W kCk q L (ˆ θ t ; Γ t ) + T W kCk
32 4 q L (ˆ θ t ; Γ t )) , where ˆ θ t ∈ min θ ∈C L ( θ ; Γ t ) and W = w ( X ) + w ( C ) .Proof. The ( ǫ, δ ) -differential privacy follows from the established global sensitivity bound.For the utility analysis, we start from (8), and substitute γ = ( w ( X ) + w ( C )) / /T / to get the claimed bound.The value of γ is picked to balance the various opposing factors. Remark 5.8.
Since L is an non-decreasing function in t , in the above theorem L (ˆ θ t ; Γ t ) in the right-hand side couldbe replaced by L (ˆ θ T ; Γ T ) (defined as OPT in the introduction).
We start this discussion by mentioning a few instantiations of Theorem 5.7. Let
OPT = L (ˆ θ T ; Γ T ) (remember, that OPT ≤ T ). For simplicity, below we ignore dependence on the privacy and confidence parameters.For arbitrary X and C , with just an L -diameter assumption as in Theorem 4.2, W = w ( X ) + w ( C ) = O ( √ d ) ,and therefore the excess risk bound provided by Theorem 5.7 (accounting for the trivial excess risk bound of T ) is ˜ O (min { T / d / + T / d / √ OPT+ T / d / √ OPT , T } ) , which is worse than the ˜ O (min {√ d, T } ) excess riskbound provided by Theorem 4.2. However, as we discuss below, for many interesting high-dimensional problems,one could get substantially better bounds using Theorem 5.7. This happens when W is “small”.In many practical regression instances, the input data is high-dimensional and sparse [38, 22, 59] which leads toa small w ( X ) . For example, if the x i ’s are k -sparse, then w ( X ) = O ( p k log( d/k )) . Another common scenario isto have the x i ’s come from a bounded L -diameter ball, in which case w ( X ) = O ( √ log d ) . Now under any of theseassumptions, let us look at different choices of constraint spaces ( C ) that have a small Gaussian width. If C = conv { a , . . . , a l } be convex hull of vectors a i ∈ R d , such that for all i ∈ [ d ] , k a i k ≤ c with c ∈ R + ,then w ( C ) = O ( c √ log l ) . A popular subcase of the above is the cross-polytope B d (unit L -ball). For example,the popular Lasso formulation [50] used for high-dimensional linear regression is: ˆ θ t ∈ argmin θ ∈ cB d t X i =1 ( y i − h x i , θ i ) . Another popular subcase is the probability simplex where, C = { θ ∈ R d : P i θ i = 1 , ∀ i ∈ [ d ] , θ i ≥ } . • Group/Block L -norm is another prominent sparsity inducing norm used in many applications [1]. For a vector θ ∈ R d , and a parameter k , this norm is defined as: k θ k k,L , = ⌈ d/k ⌉ X i =1 vuuut min { ik,d } X j =( i − k +1 | θ j | . If C denotes the convex set centered with radius one with respect to k · k k,L , -norm then the Gaussian width of C is O ( p k log( d/k )) [47]. • L p -balls ( < p < ) are another popular choice of constraint space [40]. The regression problem in this case isdefined as: ˆ θ t ∈ argmin θ ∈ cB dp t X i =1 ( y i − h x i , θ i ) , and w ( cB dp ) = O ( cd − /p ) .As the reader may notice, in all of the above problem settings, W is much smaller than √ d . As a comparisonto the bound in Theorem 4.2, if W = O ( polylog ( d )) , then Theorem 5.7 yields an excess risk bound of ˜ O ( T / + T / √ OPT + T / √ OPT) . This is significantly better than the ˜ O ( √ d ) risk bound from Theorem 4.2 for manysetting of T, d , and
OPT , e.g., if d ≫ T / .One could also compare the result from Theorem 5.7 to the bound obtained by applying the differentially privateERM algorithm of Talwar et al. [46] in the generic mechanism (Theorem 3.1, Part 3). It is hard to do a precisecomparison because of the dependence on different parameters in these bounds. In general, when OPT is not verybig (say ≪ T / ) and W = O ( polylog ( d )) then the excess risk bound from Theorem 5.7 is significantly better than ˜ O ( √ T ) risk bound obtained in Theorem 3.1, Part 3. Extension to a case where not all inputs are drawn from a domain with small Gaussian Width.
The previousanalysis assumes that w ( X ) is small (all inputs are drawn from a domain with small Gaussian Width). We now showthat the techniques and results in the previous section extend to a more robust setting, where not all inputs are assumedto come from a domain with small Gaussian width.In particular, we assume that there exists a set G ⊆ X , such that w ( G ) is small, and only some inputs in thestream come from G (e.g., only a fraction of the covariates could be sparse). We also assume that the algorithmhas access to an oracle which given a point x ∈ X , returns whether x ∈ G or not. The goal of the algorithmis to perform private incremental linear regression on inputs from G . In a non-private world, this is trivial as thealgorithm can simply ignore when x t is not in G , but this operation is not private. However, a simple change toAlgorithm P RIV I NC R EG can handle this scenario without a breach in the privacy. The idea is to check whether x t ∈ G , if so ( x t , y t ) is used as currently in Algorithm P RIV I NC R EG . Otherwise, ( x t , y t ) is replaced by ( , before invoking Algorithm T REE M ECH (in Steps 5 and 6 of Algorithm P
RIV I NC R EG ). With this change, theresulting algorithm is ( ǫ, δ ) -differentially private, and with probability at least − β , for each t ∈ [ T ] , its output θ priv t will satisfy: X x i ∈G ,i ∈ [ t ] ( y i − h x i , θ priv t i ) − X x i ∈G ,i ∈ [ t ] ( y i − h x i , ˆ θ t i ) = O ( T W log T kCk p log(1 /δ ) log(1 /β ) ǫ + T W kCk q L (ˆ θ t ; Γ t ) + T W kCk
32 4 q L (ˆ θ t ; Γ t )) , where ˆ θ t ∈ min θ ∈C P x i ∈G ,i ∈ [ t ] ( y i − h x i , θ i ) and W = w ( G ) + w ( C ) . There are generalizations of this norm that can handle different group (block) sizes. eferences [1] Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Optimization with sparsity-inducingpenalties. Foundations and Trends in Machine Learning , 4(1):1–106, 2012.[2] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Differentially private empirical risk minimization: Effi-cient algorithms and tight error bounds. In
FOCS . IEEE, 2014.[3] Jeremiah Blocki, Avrim Blum, Anupam Datta, and Or Sheffet. The johnson-lindenstrauss transform itselfpreserves differential privacy. In
Foundations of Computer Science (FOCS), 2012 IEEE 53rd Annual Symposiumon , pages 410–419. IEEE, 2012.[4] Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. Practical privacy: The SuLQ framework. In
PODS , pages 128–138. ACM, 2005.[5] Jean Bourgain, Dirksen Sjoerd, and Jelani Nelson. Toward a unified theory of sparse dimensionality reductionin euclidean space. In
Proceedings of the 47th ACM Symposium on Theory of Computing . Association forComputing Machinery, 2015.[6] John Canny and Huasha Zhao. Bidmach: Large-scale learning with zero memory allocation. In
BigLearning,NIPS Workshop , 2013.[7] T-H Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics.
ACM Transactions onInformation and System Security (TISSEC) , 14(3):26, 2011.[8] Kamalika Chaudhuri and Claire Monteleoni. Privacy-preserving logistic regression. In
Advances in NeuralInformation Processing Systems , pages 289–296, 2009.[9] Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical risk minimiza-tion.
The Journal of Machine Learning Research , 12:1069–1109, 2011.[10] Kenneth L Clarkson. Coresets, sparse greedy approximation, and the frank-wolfe algorithm.
ACM Transactionson Algorithms (TALG) , 6(4):63, 2010.[11] Irit Dinur and Kobbi Nissim. Revealing Information while Preserving Privacy. In
PODS , pages 202–210. ACM,2003.[12] John C Duchi, Michael Jordan, Martin J Wainwright, et al. Local privacy and statistical minimax rates. In
Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on , pages 429–438. IEEE,2013.[13] John C Duchi, Michael I Jordan, and Martin J Wainwright. Privacy aware learning.
Journal of the ACM (JACM) ,61(6):38, 2014.[14] Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor. Our data, ourselves:Privacy via distributed noise generation. In
EUROCRYPT , LNCS, pages 486–503. Springer, 2006.[15] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in privatedata analysis. In
TCC , volume 3876 of
LNCS , pages 265–284. Springer, 2006.[16] Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N Rothblum. Differential privacy under continual ob-servation. In
Proceedings of the forty-second ACM symposium on Theory of computing , pages 715–724. ACM,2010.[17] Cynthia Dwork, Moni Naor, Omer Reingold, and Guy Rothblum. Pure differential privacy for rectangle queriesvia private partitions. In
ASIACRYPT , 2015.[18] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy.
Theoretical ComputerScience , 9(3-4):211–407, 2013.[19] Cynthia Dwork, Guy N Rothblum, and Salil Vadhan. Boosting and differential privacy. In
Foundations ofComputer Science (FOCS), 2010 51st Annual IEEE Symposium on , pages 51–60. IEEE, 2010.[20] Mahdi Milani Fard, Yuri Grinberg, Joelle Pineau, and Doina Precup. Compressed least-squares regression onsparse spaces. In
AAAI , 2012.[21] Yehoram Gordon.
On Milman’s inequality and random subspaces which escape through a mesh in R n . Springer,1988.[22] Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness: Probabilisticalgorithms for constructing approximate matrix decompositions. SIAM review , 53(2):217–288, 2011.
23] Elad Hazan, Adam Kalai, Satyen Kale, and Amit Agarwal. Logarithmic regret algorithms for online convexoptimization. In
Learning theory , pages 499–513. Springer, 2006.[24] Justin Hsu, Zhiyi Huang, Aaron Roth, Tim Roughgarden, and Zhiwei Steven Wu. Private matchings and al-locations. In
Proceedings of the 46th Annual ACM Symposium on Theory of Computing , pages 21–30. ACM,2014.[25] Prateek Jain, Pravesh Kothari, and Abhradeep Thakurta. Differentially private online learning. In
COLT 2012 ,pages 24.1–24.34, 2012.[26] Prateek Jain and Abhradeep Thakurta. Differentially private learning with kernels. In
Proceedings of the 30thInternational Conference on Machine Learning (ICML-13) , pages 118–126, 2013.[27] Prateek Jain and Abhradeep Guha Thakurta. (near) dimension independent risk bounds for differentially privatelearning. In
Proceedings of The 31st International Conference on Machine Learning , pages 476–484, 2014.[28] Ata Kab´an. New bounds on compressive linear least squares regression. In
The 17-th International Conferenceon Artificial Intelligence and Statistics (AISTATS 2014) , volume 33, pages 448–456, 2014.[29] Shiva Kasiviswanathan and Hongxia Jin. Efficient private empirical risk minimization for high-dimensionallearning. In
ICML , 2016.[30] Shiva P Kasiviswanathan and Adam Smith. On the’semantics’ of differential privacy: A bayesian formulation.
Journal of Privacy and Confidentiality , 6(1):1, 2014.[31] Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. Whatcan we learn privately? In
FOCS , pages 531–540. IEEE Computer Society, 2008.[32] Shiva Prasad Kasiviswanathan, Mark Rudelson, and Adam Smith. The power of linear reconstruction attacks.In
Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms , pages 1415–1433.SIAM, 2013.[33] Krishnaram Kenthapadi, Aleksandra Korolova, Ilya Mironov, and Nina Mishra. Privacy via the johnson-lindenstrauss transform.
Journal of Privacy and Confidentiality , 5(1):39–71, 2013.[34] Daniel Kifer, Adam Smith, and Abhradeep Thakurta. Private convex empirical risk minimization and high-dimensional regression.
Journal of Machine Learning Research , 1:41, 2012.[35] Michel Ledoux and Michel Talagrand.
Probability in Banach Spaces: isoperimetry and processes , volume 23.Springer Science & Business Media, 2013.[36] Odalric Maillard and R´emi Munos. Compressed least-squares regression. In
Advances in Neural InformationProcessing Systems , pages 1213–1221, 2009.[37] Nikita Mishra and Abhradeep Thakurta. (nearly) optimal differentially private stochastic multi-arm bandits. In
UAI , pages 592–601, 2015.[38] John Nelson and Huy L Nguyˆen. Osnap: Faster numerical linear algebra algorithms via sparser subspaceembeddings. In
Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on , pages117–126. IEEE, 2013.[39] G¨otz E Pfander.
Sampling Theory, a Renaissance . Springer, 2015.[40] Azar Rahimi, Jingjia Xu, and Linwei Wang. -norm regularization in volumetric imaging of cardiac currentsources.
Computational and mathematical methods in medicine , 2013, 2013.[41] Benjamin IP Rubinstein, Peter L Bartlett, Ling Huang, and Nina Taft. Learning in a large function space:Privacy-preserving mechanisms for svm learning. arXiv preprint arXiv:0911.5708 , 2009.[42] Shai Shalev-Shwartz. Online learning and online convex optimization.
Foundations and Trends in MachineLearning , 4(2):107–194, 2011.[43] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniformconvergence.
The Journal of Machine Learning Research , 11:2635–2670, 2010.[44] Or Sheffet. Private approximations of the 2nd-moment matrix using existing techniques in linear regression. arXiv preprint arXiv:1507.00056 , 2015.[45] Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentiallyprivate updates. In
Global Conference on Signal and Information Processing (GlobalSIP), 2013 IEEE , pages245–248. IEEE, 2013.
46] Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Nearly optimal private lasso. In
Advances in Neural Infor-mation Processing Systems , pages 3007–3015, 2015.[47] Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Private empirical risk minimization beyond the worst case:The effect of the constraint set geometry. arXiv preprint arXiv:1411.5417, To appear in NIPS , 2015.[48] Abhradeep Guha Thakurta and Adam Smith. Differentially private feature selection via stability arguments, andthe robustness of the lasso. In
Conference on Learning Theory , pages 819–850, 2013.[49] Abhradeep Guha Thakurta and Adam Smith. (nearly) optimal algorithms for private online learning in full-information and bandit settings. In
Advances in Neural Information Processing Systems , pages 2733–2741,2013.[50] Robert Tibshirani. Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society.Series B (Methodological) , pages 267–288, 1996.[51] Jonathan Ullman. Private multiplicative weights beyond linear queries. In
Proceedings of the 34th ACM Sym-posium on Principles of Database Systems , pages 303–312. ACM, 2015.[52] Jalaj Upadhyay. Randomness efficient fast-johnson-lindenstrauss transform with applications in differentialprivacy and compressed sensing. arXiv preprint arXiv:1410.2470 , 2014.[53] Vladimir Vapnik.
The nature of statistical learning theory . Springer Science & Business Media, 2013.[54] Roman Vershynin. Estimation in high dimensions: a geometric perspective. arXiv preprint arXiv:1405.5103 ,2014.[55] Oliver Williams and Frank McSherry. Probabilistic inference and differential privacy. In
Advances in NeuralInformation Processing Systems , pages 2451–2459, 2010.[56] WolframAlpha. , 2016.[57] David P. Woodruff. Sketching as a tool for numerical linear algebra.
Foundations and Trends in TheoreticalComputer Science , 10(12):1–157, 2014.[58] Shuheng Zhou, Katrina Ligett, and Larry Wasserman. Differential privacy with compression. In
InformationTheory, 2009. ISIT 2009. IEEE International Symposium on , pages 2718–2722. IEEE, 2009.[59] Shuheng Zhou, Larry Wasserman, and John D Lafferty. Compressed regression. In
Advances in Neural Infor-mation Processing Systems , pages 1713–1720, 2008.
A Additional Preliminaries
We start by reviewing some standard definitions in convex optimization. In our setting, for a loss function ( θ ; z ) ,all the following properties (such as convexity, Lipschitzness, strong convexity) are defined with respect to the firstargument θ .In the following, we use the notation ∇ ( θ ; z ) to denote the gradient (if it exists) or any subgradient of function ( · ; z ) at θ . Definition 7. (Convex Functions) A loss function : C × Z → R is convex with respect to θ over the domain C if for all z ∈ Z the following inequality holds: ( λθ a + (1 − λ ) θ b ; z ) ≤ λ · ( θ a ; z ) + (1 − λ ) · ( θ b ; z ) for all θ a , θ b ∈ C and λ ∈ [0 , . For a continuously differentiable , this inequality can be equivalently replaced with: ( θ b ; z ) ≥ ( θ a ; z ) + h∇ ( θ a ; z ) , θ b − θ a i for all θ a , θ b ∈ C , where ∇ ( θ a ; z ) is the gradient of ( · ; z ) at θ a . Definition 8. (Lipschitz Functions) A loss function : C × Z → R is L -Lipschitz with respect to θ over the domain C , if for all z ∈ Z and θ a , θ b ∈ C , we have | ( θ a ; z ) − ( θ b ; z ) | ≤ L k θ a − θ b k . If is a convex function, then is L -Lipschitz iff for all θ ∈ C and subgradients g of at θ we have k g k ≤ L . Definition 9. (Strongly Convex Functions) A loss function : C × Z → R is ν -strongly convex if for all z ∈ Z and θ a , θ b ∈ C , all subgradients g of ( θ a ; z ) , we have ( θ b ; z ) ≥ ( θ a ; z ) + h g , θ b − θ a i + ( ν/ k θ b − θ a k (i.e., isbounded below by a quadratic function tangent at θ a ). Gaussian Norm Bounds.
Let N (0 , σ ) denote the Gaussian (normal) distribution with mean and variance σ . Weuse the following standard result on the spectral norm (largest singular value) of an i.i.d. Gaussian random matrixthroughout this paper. Proposition A.1.
Let A be an N × n matrix whose entries are independent standard normal random variables. Thenfor every t > , with probability at least − − t / one has, k A k = O ( √ N + √ n + t ) . In particular, withprobability at least − β , k A k = O ( √ N + √ n + p log(1 /β )) . .1 Background on Linear Regression Linear regression is a statistical method used to create a linear model. It attempts to model the relationship betweentwo variables (known as covariate-response pairs) by fitting a linear function to observed data. More formally, given y = Xθ ⋆ + w , where y = ( y , . . . , y n ) ∈ R n is a vector of observed responses, X ∈ R n × d is the covariate matrixin which i th row x ⊤ i represents the covariates (features) for the i th observation, and w = ( w , . . . , w n ) is a noisevector, the goal of linear regression is to estimate the unknown regression vector θ ⋆ .Assuming that the noise vector w follows a (sub)Gaussian distribution, estimating θ ⋆ amounts to solving theordinary least-squares problem: ˆ θ ∈ argmin θ ∈ R d k y − Xθ k = argmin θ ∈ R d n X i =1 ( y i − h x i , θ i ) . Typically, for additional guarantees such as sparsity and stability, constraints are added to the least-squares estimator.This leads to the constrained linear regression formulation:Linear Regression: ˆ θ ∈ argmin θ ∈C k y − Xθ k = argmin θ ∈C n X i =1 ( y i − h x i , θ i ) , (9)for some convex set C ⊆ R d . In this paper, we work with this formulation. Two well-known regression problems areobtained by choosing C as the L / L -ball:Ridge Regression: ˆ θ ∈ argmin θ ∈ cB d P ni =1 ( y i − h x i , θ i ) , Lasso Regression: ˆ θ ∈ argmin θ ∈ cB d P ni =1 ( y i − h x i , θ i ) , where c ∈ R + . Another popular example is the Elastic-net regression which combines the Lasso and Ridge regres-sion. Note that, while in this paper we focus on the constrained formulation of regression, from duality and the KKTconditions, the constrained formulation is equivalent to a penalized (regularized) formulation. In the Streaming Setting.
Let
Γ = ( x , y ) , . . . , ( x T , y T ) denote a data stream. Let Γ t denote the stream prefix of Γ of length t . Informally, the goal of incremental linear regression is to release at each timestep t ∈ [ T ] , θ t ∈ C , thatminimizes L ( θ ; Γ t ) = P ti =1 ( y i − h x i , θ i ) . Definition 10 (Adaptation of Definition 1 for Incremental Linear Regression) . A randomized streaming algorithm is ( α, β ) - estimator for incremental linear regression, if with probability at least − β over the coin flips of the algorithm,for each t ∈ [ T ] , after processing a prefix of the stream of length t , it generates an output θ t ∈ C that satisfies thefollowing bound on excess (empirical) risk: t X i =1 ( y i − h x i , θ t i ) − min θ ∈C t X i =1 ( y i − h x i , θ i ) ! ≤ α. A.2 Background on Differential Privacy
In this section, we review some basic constructions in differential privacy. The literature on differential privacy isnow rich with tools for constructing differentially private analyses, and we refer the reader to a survey by Dwork andRoth [18] for a comprehensive review of developments there.One of the most basic technique, for achieving differential privacy is by adding noise to the outcome of a computedfunction where the noise magnitude is scaled to the (global) sensitivity of the function defined as:
Definition 11 (Sensitivity) . Let f be a function mapping streams Γ ∈ Z ∗ to R d . The L -sensitivity ∆ of f is themaximum of k f (Γ) − f (Γ ′ ) k over neighboring streams Γ , Γ ′ . Theorem A.2 (Framework of Global Sensitivity [15]) . Let f : Z ∗ → R d be a function with L -sensitivity ∆ . Thealgorithm that on an input Γ outputs f (Γ) + Y where Y ∼ N (0 , (2∆ ln(2 /δ )) /ǫ ) d is ( ǫ, δ ) -differentially private. Composition theorems for differential privacy allow a modular design of privacy preserving algorithms based onalgorithms for simpler sub tasks:
Theorem A.3 ([14]) . A mechanism that permits k adaptive interactions with mechanisms that preserves ( ǫ, δ ) -differential privacy (and does not access the database otherwise) ensures ( kǫ, kδ ) -differential privacy. A stronger composition is also possible as shown by Dwork et al. [19].
Theorem A.4 ([19]) . Let ǫ, δ, δ ∗ > and ǫ ≤ . A mechanism that permits k adaptive interactions with mechanismsthat preserves ( ǫ, δ ) -differential privacy ensures ( ǫ p k ln(1 /δ ∗ ) + 2 kǫ , kδ + δ ∗ ) -differential privacy. Noisy Projected Gradient Descent
In this section, we investigate the convergence rate of noisy projected gradient descent. Williams and McSherry firstinvestigated gradient descent with noisy updates for probabilistic inference [55]; noisy stochastic variants of gradientdescent have also been studied by [25, 12, 45, 2] in various private convex optimization settings. Other convexoptimization techniques such as mirror descent [13, 47] and Frank-Wolfe scheme [47] have also been considered fordesigning private ERM algorithms.For completeness, in this section, we present an analysis of the projected gradient descent procedure that oper-ates with only access to a private gradient function (Definition 5). The analysis relies on standard ideas in convexoptimization literature.Consider the following constrained optimization problem, min θ ∈C f ( θ ) where C ⊆ R d , (10)where f : R d → R is a convex function and C is some non-empty closed convex set. Define projection of θ ∈ R d onto a convex set C as: P C ( θ ) = argmin z ∈C k θ − z k . Projected gradient descent algorithm uses the following update to solve (10)P
ROJ G RAD ( C , r ) : Initialize θ ∈ C , Repeat r times: θ k +1 = P C ( θ k − η k ∇ f ( θ k )) , Output ¯ θ = 1 r r X i =1 θ i , (11)for some stepsize η k . Here P C ( θ ) defines the projection of θ onto C .Let g : R d → R d be an ( α, β ) -approximation of the true gradient of f (as in Definition 5), Pr (cid:20) max θ ∈C k g ( θ ) − ∇ f ( θ ) k > α (cid:21) ≤ β. The noisy projected gradient descent is a simple modification of the projected gradient descent algorithm (11), whereinstead of the true gradient, a noisy gradient is used. In other words, noisy projected gradient descent takes the form,N
OISY P ROJ G RAD ( C , g, r ) : Initialize θ ∈ C , Repeat r times: θ k +1 = P C ( θ k − η k g ( θ k )) , Output ¯ θ = 1 r r X i =1 θ i . (12)The following proposition analyzes the convergence of the above N OISY P ROJ G RAD procedure assuming g is an ( α, β ) -approximation of the true gradient of f . Note that the proposition holds, even if f is not differentiable, inwhich case, ∇ f ( θ ) represents any subgradient of f at θ . Proposition B.1.
Suppose f is convex and is L -Lipschitz. Let kCk be the diameter of C . Let g : R d → R d be an ( α, β ) -approximation of the true gradient of f . Then after r steps of Algorithm N OISY P ROJ G RAD , starting with any θ ∈ C and constant stepsize of η k = kCk√ r ( α + L ) , with probability at least − rβ , f (¯ θ ) − f ( θ ∗ ) ≤ ( α + L ) kCk√ r + α kCk . Proof.
Let θ ∗ ∈ C be the optimal solution of (10). Let g ( θ k ) = ∇ f ( θ k ) + e ( θ k ) . By Definition 8, L ≥ max θ ∈C k∇ f ( θ ) k . k x k +1 − θ ∗ k = k P C ( θ k − η k g ( θ k )) − P C ( θ ∗ ) k ≤ k θ k − η k g ( θ k ) − θ ∗ k ≤ k θ k − θ ∗ k + 2 η k h g ( θ k ) , θ ∗ − θ k i + η k k g ( θ k ) k = k θ k − θ ∗ k + 2 η k h∇ f ( θ k ) + e ( θ k ) , θ ∗ − θ k i + η k k∇ f ( θ k ) + e ( θ k ) k ≤ k θ k − θ ∗ k + 2 η k h∇ f ( θ k ) , θ ∗ − θ k i + 2 η k k e ( θ k ) kkCk + η k k∇ f ( θ k ) + e ( θ k ) k ≤ k θ k − θ ∗ k + 2 η k ( f ( θ ∗ ) − f ( θ k )) + 2 η k k e ( θ k ) kkCk + η k ( k∇ f ( θ k ) k + k e ( θ k ) k + 2 k∇ f ( θ k ) kk e ( θ k ) k ) ≤ k θ k − θ ∗ k + 2 η k ( f ( θ ∗ ) − f ( θ k )) + 2 η k k e ( θ k ) kkCk + η k ( L + k e ( θ k ) k + 2 L k e ( θ k ) k ) . he first inequality follows because projection cannot increase distances (projection operator is contractive ).By the assumption on g ( θ k ) , with probability at least − β , k e ( θ k ) k ≤ α . Hence, with probability at least − β , k θ k +1 − θ ∗ k ≤ k θ k − θ ∗ k + 2 η k ( f ( θ ∗ ) − f ( θ k )) + 2 η k α kCk + η k ( L + α + 2 αL ) . Summing the above expression and taking a union bound, yields that with probability at least − rβ , ≤ k θ r +1 − θ ∗ k ≤ k θ − θ ∗ k + 2 r X k =1 η k ( f ( θ ∗ ) − f ( θ k )) + 2 rη k α kCk + rη k ( L + α + 2 αL ) . Rearranging the above, and setting constant step size η k = kCk√ r ( α + L ) , with probability at least − rβ , r X k =1 ( f ( θ k ) − f ( θ ∗ )) ≤ kCk√ r ( α + L ) + rα kCk . Now by Jensen’s inequality, f (¯ θ ) = f r r X k =1 θ k ! ≤ r r X k =1 f ( θ k ) . Therefore, f (¯ θ ) − f ( θ ∗ ) ≤ r r X k =1 ( f ( θ k ) − f ( θ ∗ )) ≤ ( α + L ) kCk√ r + α kCk . Corollary B.2.
By setting r = ( α + L ) kCk ζ , in Proposition B.1 gives that with probability at least − rβ , f (¯ θ ) − f ( θ ∗ ) ≤ ζ + α kCk . If α > , then we can set r = (cid:0) Lα (cid:1) gives f (¯ θ ) − f ( θ ∗ ) ≤ α kCk . After constructing the private gradient function, evaluating the gradient function at any θ can be done withoutaffecting the privacy budget, as this is just a post-processing of private outputs. C Tree Mechanism for Continually Releasing Private Sums
Given a bit stream b , . . . , b T ∈ { , } , the private streaming counter problem is to release at every timestep t , (anapproximation to) P ti =1 b i while satisfying differential privacy (Definition 4). Chan et al. [7] and Dwork et al. [16]proposed an elegant differentially private mechanism (referred to as the Tree Mechanism ) for this problem. We usethis mechanism as a basic building block in our private incremental regression algorithms. For completeness, in Algorithm T
REE M ECH , we present the entire
Tree Mechanism as applied to a set of vectors.Given a data stream
Υ = υ , . . . , υ T ∈ Z , the algorithm releases at each timestep t , (an approximation to) the sumfunction P ti =1 υ i , while satisfying differential privacy.Algorithm T REE M ECH can be viewed as releasing partial sums of different ranges at each timestep t and comput-ing the final sum is simply a post-processing of the partial sums. At most log T partial sums are used for constructingeach private sum. The following theorem follows by using the standard upper deviation inequality for Gaussian ran-dom variables (Proposition A.1) in the analysis of Tree Mechanism from [16, 7]. Another advantage of this mechanismis that it can be implemented with small memory, as only O (log t ) partial sums are needed at any time t . Proposition C.1.
Algorithm T REE M ECH is ( ǫ, δ ) -differentially private with respect to a single datapoint change inthe stream Υ . For any β > and t ∈ [ T ] , with probability at least − β , s t computed by Algorithm T REE M ECH satisfies: (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) s t − t X i =1 υ i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O ∆ ( √ d + p log (1 /β )) log / T p log (1 /δ ) ǫ ! , where ∆ is the L -sensitivity of the sum function from Definition 11. Dwork et al. [17] have recently improved the bounds for the private streaming counter problem in the case where the bit stream is sparse , i.e.,have many fewer ’s than ’s. lgorithm 4: T REE M ECH ( ǫ, δ, ∆ ) Input : A stream
Υ = υ , . . . , υ T , where each υ t is from the domain Z ⊆ R d , and ∆ = max υ,υ ′ ∈Z k υ − υ ′ k Output : A differentially private estimate of P ti =1 υ i at every timestep t ∈ [ T ] for all t ∈ [ T ] do Express t = P log tj =0 j Bin j ( t ) (where Bin j ( t ) is the bit at the j th index in the binary representation of t ) i ← min ≤ j ≤ log T { Bin j ( t ) = 0 } a i ← P j