[PDF] Adaptive Sequential Optimization with Applications to Machine Learning

Abstract

A framework is introduced for solving a sequence of slowly changing optimization problems, including those arising in regression and classification applications, using optimization algorithms such as stochastic gradient descent (SGD). The optimization problems change slowly in the sense that the minimizers change at either a fixed or bounded rate. A method based on estimates of the change in the minimizers and properties of the optimization algorithm is introduced for adaptively selecting the number of samples needed from the distributions underlying each problem in order to ensure that the excess risk, i.e., the expected gap between the loss achieved by the approximate minimizer produced by the optimization algorithm and the exact minimizer, does not exceed a target level. Experiments with synthetic and real data are used to confirm that this approach performs well.

Full PDF

aa r X i v : . [ c s . L G ] S e p Adaptive Sequential Optimization with Applications to MachineLearning

Craig Wilson and Venugopal V. Veeravalli ∗ Coordinated Science Lab and Electrical and Computer EngineeringUniversity of Illinois at Urbana-ChampaignUrbana, IL 61801, USA {wilson60,vvv}@illinois.edu

October 25, 2018

Abstract

A framework is introduced for solving a sequence of slowly changing optimization problems, including thosearising in regression and classiﬁcation applications, using optimization algorithms such as stochastic gradient descent(SGD). The optimization problems change slowly in the sense that the minimizers change at either a ﬁxed or boundedrate. A method based on estimates of the change in the minimizers and properties of the optimization algorithm isintroduced for adaptively selecting the number of samples needed from the distributions underlying each problem inorder to ensure that the excess risk, i.e., the expected gap between the loss achieved by the approximate minimizerproduced by the optimization algorithm and the exact minimizer, does not exceed a target level. Experiments withsynthetic and real data are used to conﬁrm that this approach performs well.

Consider solving a sequence of machine learning problems such as regression or classiﬁcation by minimizing theexpected value of a ﬁxed loss function ℓ ( x , z ) at each time n s:min x ∈ X n f n ( x ) , E z n ∼ p n [ ℓ ( x , z n )] o ∀ n ≥ z n corresponds to the predictors and response pair at time n and x parameterizes the regression model.For classiﬁcation z n corresponds to the feature and label pair at time n and x parameterizes the classiﬁer. Although,motivated by regression and classiﬁcation, our framework works for any loss function ℓ ( x , z ) that satisﬁes certainproperties discussed later. In the learning context, a task consists of the loss function ℓ ( x , z ) and the distribution p n ,and so our problem can be viewed as learning a sequence of tasks.The problems change slowly at a constant but unknown rate in the sense that k x ∗ n − x ∗ n − k = r ∀ n ≥ x ∗ n the minimizer of f n ( x ) . In an extended version of this paper [ ? ], we also consider slow changes at a boundedbut unknown rate k x ∗ n − x ∗ n − k ≤ r ∀ n ≥ x n of each function f n ( x ) using K n samples from distribution p n by applying an optimization algorithm. We evaluate the quality of our approximate minimizers x n through anexcess risk criterion e n , i.e., E [ f n ( x n )] − f n ( x ∗ n ) ≤ e n ∗ This work was supported by the NSF under award CCF 11-11342 through the University of Illinois at Urbana-Champaign. K n required to achieve a desired excess risk e for each n with r unknown. As r is unknown, wewill construct estimates of r . Given an estimate of r , we determine selection rules for the number of samples K n toachieve a target excess risk e . Our problem has connections with multi-task learning (MTL) and transfer learning . In multi-task learning, one triesto learn several tasks simultaneously as in [2],[3], and [4] by exploiting the relationships between the tasks. In transferlearning, knowledge from one source task is transferred to another target task either with or without additional trainingdata for the target task [5]. Multi-task learning could be applied to our problem by running a MTL algorithm each timea new task arrives, while remembering all prior tasks. However, this approach incurs a memory and computationalburden. Transfer learning lacks the sequential nature of our problem. For multi-task and transfer learning, there aretheoretical guarantees on regret for some algorithms [6].We can also consider the concept drift problem in which we observe a stream of incoming data that potentiallychanges over time, and the goal is to predict some property of each piece of data as it arrives. After prediction, weincur a loss that is revealed to us. For example, we could observe a feature w n and predict the label y n as in [7].Some approaches for concept drift use iterative algorithms such as SGD, but without speciﬁc models on how the datachanges. As a result, only simulation results showing good performance are available. There are also some banditapproaches in which one of a ﬁnite number of predictors must be applied to the data as in [8]. For this approach, thereare regret guarantees using techniques for analyzing bandit problems.Another relevant model is sequential supervised learning (see [9]) in which we observe a stream of data consistingof feature/label pairs ( w n , y n ) at time n , with w n being the feature vector and y n being the label. At time n , we wantto predict y n given x n . One approach to this problem, studied in [10] and [11], is to look at L consecutive pairs { ( w n − i , y n − i ) } Li = and develop a predictor at time n by applying a supervised learning algorithm to this training data.Another approach is to assume that there is an underlying hidden Markov model (HMM) [12]. The label y n representsthe hidden state and the pair ( w n , y n ) represents the observation with y n being a noisy version of y n . HMM inferencetechniques are used to estimate y n . r Known

For analysis, we need the following assumptions on our functions f n ( x ) and the optimization algorithm: A.1

For the optimization algorithm under consideration, there is a function b ( d , K n ) such that E [ f n ( x n )] − f n ( x ∗ n ) ≤ b ( d , K n ) with K n the number of samples from p n and E k x n ( ) − x ∗ n k ≤ d , where x n ( ) is the initial point of theoptimization algorithm at time n . Finally, b ( d , K n ) is non-decreasing in d . A.2

Each loss function ℓ ( x , z ) is differentiable in x . Each f n ( x ) is strongly convex with parameter m , i.e., f n ( y ) ≥ f n ( x ) + h (cid:209) x f n ( x ) , y − x i + m k y − x k A.3 diam ( X ) < + ¥ A.4

We can ﬁnd initial points x and x that satisfy the excess risk criterion with e and e known, i.e., E [ f i ( x i )] − f i ( x ∗ i ) ≤ e i i = , emarks: For assumption A.1 , we assume that the bound b ( d , K n ) depends on the number of samples K n and notthe number of iterations. For SGD, generally the number of iterations equals K n as each sample is used to produce anoisy gradient. In addition, we often set x n ( ) = x n − . See Appendix A for a discussion of useful b ( d , K n ) bounds.For assumption A.4 , we can ﬁx K i and set e i = b ( diam ( X ) , K i ) for i = , r in (2) or (3), is known. For the analysis of the section,whether (2) or (3) holds does not affect the analysis. Later we will estimate r and in this case whether (2) or (3) holdsmatters substantially.We want to ﬁnd a bound e n on the excess risk at time n in terms of K n and r , i.e., e n such that E [ f n ( x n )] − f n ( x ∗ n ) ≤ e n .The idea is to start with the bounds from assumption A.4 and proceed inductively using the previous e n − and r from (2). Suppose that e n − bounds the excess risk at time n −

1. Using the triangle inequality, strong convexity, and(2) we have E k x n − − x ∗ n k ≤ (cid:0) k x n − − x ∗ n − k + k x ∗ n − x ∗ n − k (cid:1) ≤ r m E [ f n − ( x n − )] − f n − ( x ∗ n − ) + k x ∗ n − x ∗ n − k ! ≤ r e n − m + r ! (4)In comparison, we could use the estimate diam ( X ) to bound E k x n − − x ∗ n k and select K n . If the bound in (4) ismuch smaller than diam ( X ) , then we need signiﬁcantly fewer samples K n to guarantee a desired excess risk. Now,by using the bound b ( d , K n ) from assumption A.1 , we can set e n = b  r e n − m + r ! , K n  ∀ n ≥ n − e n − . To achieve e n ≤ e for all n , we set K = min { K ≥ | b (cid:0) diam ( X ) , K (cid:1) ≤ e } and K n = K ∗ for n ≥ K ∗ = min  K ≥ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) b  r e m + r ! , K  ≤ e  (5) r In practice, we do not know r , so we must construct an estimate ˆ r n using the samples from each distribution p n . Weintroduce two approaches to estimate r at one time step, k x ∗ i − x ∗ i − k , and methods to combine these estimates underassumptions (2) and (3). We show that for our estimate ˆ r n and appropriately chosen sequences { t n } for all n largeenough ˆ r n + t n ≥ r almost surely. With this property, analysis similar to that in Section 2 holds. K n One of the sources of difﬁculty in estimating r is that we will allow K n to be selected in a data dependent way, so K n is itself a random variable. We make the assumption that K n is selected using only information available at the end oftime n −

1. To make this precise we deﬁne a ﬁltration of sigma algebras to describe the available information. First,we deﬁne the sigma algebra K containing all the information on the initial conditions of our algorithm. For example,we may start at a random point x and then K = s ( x ) K may also contain information about K and K . Next, we deﬁne the ﬁltration K n = s (cid:16) { z n ( k ) } K n k = (cid:17) ∨ K n − ∀ n ≥ F ∨ G = s ( F ∪ G ) is the merge operator for sigma algebras. The sigma algebra K n contains all the information available to us at theend of time n . We assume that K n is K n − -measurable to capture the idea that K n is chosen only using informationavailable at the end of time n − First, we estimate the one step changes k x ∗ i − x ∗ i − k denoted by ˜ r i . Implicitly, we assume that all one step estimatesare capped by diam ( X ) , since trivially k x ∗ n − x ∗ n − k ≤ diam ( X ) . First, we construct an estimate ˜ r i of the one step changes k x ∗ i − x ∗ i − k . Using the triangle inequality and variationalinequalities from [13] yields k x ∗ i − x ∗ i − k ≤ k x i − x i − k + k x i − x ∗ i k + k x i − − x ∗ i − k≤ k x i − x i − k + m k (cid:209) x f i ( x i ) k + m k (cid:209) x f i ( x i − ) k We then approximate k (cid:209) x f i ( x i ) k = k E z i ∼ p i [ (cid:209) x ℓ ( x i , z i )] k by (cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = (cid:209) x ℓ ( x i , z i ( k )) (cid:13)(cid:13)(cid:13)(cid:13) to yield the following estimate that we call the direct estimate :˜ r i , k x i − x i − k + m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = (cid:209) x ℓ ( x i , z i ( k )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i − K i − (cid:229) k = (cid:209) x ℓ ( x i − , z i − ( k )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Given a class of functions F where each f ∈ F maps Z → R , an integral probability metric (IPM) [14] between twodistributions p and q is deﬁned to be g F ( p , q ) , sup f ∈ F (cid:12)(cid:12) E z ∼ p [ f ( z )] − E ˜ z ∼ q [ f ( ˜ z )] (cid:12)(cid:12) We consider an extension of this idea, which we call a vector IPM , in which the class of functions F maps Z → X : g V F ( p , q ) , sup f ∈ F k E z ∼ p [ f ( z )] − E ˜ z ∼ q [ f ( ˜ z )] k (7)Lemma 1 shows that a vector IPM can be used to bound the change in minimizer at time i and follows from variationalinequalities in [13] and the assumption that { (cid:209) x ℓ ( x , · ) : x ∈ X } ⊂ F . Lemma 1.

Assume that { (cid:209) x ℓ ( x , · ) : x ∈ X } ⊂ F . Then k x ∗ i − x ∗ i − k ≤ m g V F ( p i , p i − ) . roof. By exploiting variational inequalities from [13], we can show that k x ∗ i − x ∗ i − k ≤ m k (cid:209) x f i ( x ∗ i − ) − (cid:209) x f i − ( x ∗ i − ) k = m k E z i ∼ p i (cid:2) (cid:209) x ℓ ( x ∗ i − , z i ) (cid:3) − E z i − ∼ p i − (cid:2) (cid:209) x ℓ ( x ∗ i − , z i − ) (cid:3) k By assumption { (cid:209) x ℓ ( x ∗ i − , · ) : x ∈ X } ⊂ F , so k (cid:209) x f i ( x ∗ i − ) − (cid:209) x f i − ( x ∗ i − ) k = k E z i ∼ p i (cid:2) ℓ ( x ∗ i − , z i ) (cid:3) − E z i − ∼ p i − (cid:2) ℓ ( x ∗ i − , z i − ) (cid:3) k≤ sup f ∈ F k E z i ∼ p i [ f ( z i )] − E z i − ∼ p i − [ f ( z i − )] k = g V F ( p i , p i − ) We cannot compute this vector IPM, since we do not know the distributions p i and p i − . Instead, we plug inthe empiricals ˆ p i and ˆ p i − to yield the estimate m g V F ( ˆ p i , ˆ p i − ) . This estimate is biased upward, which ensures that k x ∗ i − x ∗ i − k ≤ E (cid:2) m g V F ( ˆ p i , ˆ p i − ) (cid:3) .Our estimate is still not in a closed form since there is a supremum over F in the computation of g V F ( ˆ p i , ˆ p i − ) .For the class of functions F = (cid:8) f (cid:12)(cid:12) k f ( z ) − f ( ˜ z ) k ≤ r ( z , ˜ z ) (cid:9) . (8)we can compute an upper bound G i on g V F ( ˆ p i , ˆ p i − ) yielding a computable estimate ˜ r i = m G i . Set ˜ z i ( k ) = z i ( k ) if1 ≤ k ≤ K i and ˜ z i ( k ) = z i − ( k ) if K i + ≤ k ≤ K i + K i − . From (7), we have g V F ( ˆ p i , ˆ p i − ) = sup f ∈ F (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = f ( ˜ z i ( k )) − K i − K i − (cid:229) k = f ( ˜ z i ( K i + k )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) We can relax this supremum by maximizing over the function value f ( ˜ z i ( k )) denoted by a k in the following non-convex quadratically constrained quadratic program (QCQP):maximize (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = a k − K i − K i − (cid:229) k = a K i + k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) subject to k a k − a j k ≤ r ( ˜ z i ( k ) , ˜ z i ( j )) ∀ k < j The constraints are imposed to ensure that the function values a k can correspond to a function in F from (8). Thevalue of this QCQP exactly may not equal the vector IPM but at least provides an upper bound. Finally, we note thatthis QCQP can be converted to its dual form to yield an SDP, which is often easier to solve. The direct estimate is easier to compute but may be loose if k x n − x ∗ n k is large. If k x n − x ∗ n k is large, then the vectorIPM approach is in general tighter. However, the vector IPM is more difﬁcult to compute due to need to solve a QCQPor SDP and check the inclusion conditions in Lemma 1. Also, the number of constraints in the QCQP or SDP growsquadratically in the number of samples. Assuming that k x ∗ i − x ∗ i − k = r from (2), we average the one step estimates ˜ r i to yield a better estimateˆ r n = n − n (cid:229) i = ˜ r i of r at each time n under (2). To analyze the behavior of our combined estimates, we use sub-Gaussian concentrationinequalities detailed in Appendix B. Lemma 22 is of particular importance to our analysis.5 .3.1 Direct Estimate The difﬁculty in analyzing the direct estimate comes because in approximating m k (cid:209) f i ( x i ) k by1 m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = (cid:209) x ℓ ( x i , z i ( k )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) x i is dependent on all the samples { z i ( k ) } K i k = . To illustrate the problem further, consider drawing two independentcopies { z i ( k ) } K i k = ∼ p i and { ˜ z i ( k ) } K i k = ∼ p i of the samples. Suppose that we use the second copy { ˜ z i ( k ) } K i k = tocompute x i using our optimization algorithm of choice starting from x i − . Then we approximate m k (cid:209) f i ( x i ) k by1 m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = (cid:209) x ℓ ( x i , z i ( k )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Now, since x i is independent of { z i ( k ) } K i k = the quantity1 m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = (cid:209) x ℓ ( x i , z i ( k )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) is the norm of an average of independent random variables conditioned on x i . This allows us to apply standardconcentration inequalities for norms of random variables as in [15]. In this section, we argue that re-using the samples { z i ( k ) } K i k = to compute x i is not too far from using a second independent draw { ˜ z i ( k ) } K i k = .For analysis, we need the following additional assumptions: B.1

The loss function ℓ ( x , z ) has uniform Lipschitz continuous gradients in x with modulus L , i.e. k (cid:209) x ℓ ( x , z ) − (cid:209) x ℓ ( ˜ x , z ) k ≤ L k x − ˜ x k ∀ z ∈ Z B.2

Assuming X is d -dimensional, each component j of the gradient error (cid:209) x ℓ ( x , z n ) − f n ( x ) satisﬁes E (cid:20) exp n s ( (cid:209) x ℓ ( x , z n ) − (cid:209) f n ( x )) j o (cid:12)(cid:12)(cid:12)(cid:12) x (cid:21) ≤ exp (cid:26) C g d s (cid:27) Assumption B.1 is reasonable if the space Z containing z is compact. Although in practice, the distribution ofgradient error could depend on x , we assume that the bound C g does not depend on x . We can view this as apessimistic assumption corresponding to choosing the worst case bound as a function of x and the resulting C g . Thisis a common assumption for in high probability analysis of optimization algorithms as in [16] for example.To proceed, we ﬁrst deﬁne two other useful estimates for r . As discussed before, suppose that we make a secondindependent draw of samples { ˜ z i ( k ) } K i k = from p i . We use these samples to compute ˜ x i in the same manner as x i starting from x i − except with { ˜ z i ( k ) } K i k = used in place of { z i ( k ) } K i k = . Then deﬁne˜ r ( ) i , k ˜ x i − ˜ x i − k + m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = (cid:209) x ℓ ( ˜ x i , z i ( k )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i − K i − (cid:229) k = (cid:209) x ℓ ( ˜ x i − , z i − ( k )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) This is the same form as the direct estimate with ˜ x i in place of x i . Next, deﬁne˜ r ( ) i , k ˜ x i − ˜ x i − k + m k (cid:209) f i ( x i ) k + m k (cid:209) f i − ( x i − ) k This is in fact the bound that inspired the direct estimate. We also deﬁne the averaged estimatesˆ r ( ) n , n − n (cid:229) i = ˜ r ( ) i r ( ) n , n − n (cid:229) i = ˜ r ( ) i We know that ˆ r ( ) n ≥ r . Thus, if we can control the gap between the pair ˆ r n and ˆ r ( ) n and the pair ˆ r ( ) n and ˆ r ( ) n , thenwe can ensure that ˆ r n plus an appropriate constant upper bounds r for all n large enough as desired.First, we show that ˆ r ( ) n upper bounds r eventually. Lemma 2.

Suppose that the following conditions hold:1. B.1 -B.2 hold2. The sequence { t n } satisﬁes ¥ (cid:229) n = exp (cid:26) − ( n − ) m t n C g (cid:27) < ¥ Then for all n large enough it holds that ˆ r ( ) n + ˆ C ( ) n + t n ≥ r almost surely with ˆ C ( ) n , dm ( n − ) r C g K + n (cid:229) i = r C g K i + r C g K n ! Proof.

First, we have by the triangle equality and reverse triangle inequality m | ˜ r ( ) i − ˜ r ( ) i | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = (cid:209) x ℓ ( ˜ x i , z i ( k )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − k (cid:209) x f i ( ˜ x i ) k ! + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i − K i − (cid:229) k = (cid:209) x ℓ ( ˜ x i − , z i − ( k )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − k (cid:209) x f i − ( ˜ x i − ) k !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = (cid:209) x ℓ ( ˜ x i , z i ( k )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − k (cid:209) x f i ( ˜ x i ) k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i − K i − (cid:229) k = (cid:209) x ℓ ( ˜ x i − , z i − ( k )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − k (cid:209) x f i − ( ˜ x i − ) k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = ( (cid:209) x ℓ ( ˜ x i , z i ( k )) − (cid:209) x f i ( ˜ x i )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i − K i − (cid:229) k = ( (cid:209) x ℓ ( ˜ x i − , z i − ( k )) − (cid:209) x f i − ( ˜ x i − )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Then by the triangle inequality, we have | ˆ r ( ) n − ˆ r ( ) n | ≤ m ( n − ) n (cid:229) i = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = ( (cid:209) x ℓ ( ˜ x i , z i ( k )) − (cid:209) x f i ( ˜ x i )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i − K i − (cid:229) k = ( (cid:209) x ℓ ( ˜ x i − , z i − ( k )) − (cid:209) x f i − ( ˜ x i − )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)! ≤ m ( n − ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K K (cid:229) k = ( (cid:209) x ℓ ( ˜ x , z ( k )) − (cid:209) x f ( ˜ x )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + n − (cid:229) i = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = ( (cid:209) x ℓ ( ˜ x i , z i ( k )) − (cid:209) x f i ( ˜ x i )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K n K n (cid:229) k = ( (cid:209) x ℓ ( ˜ x n , z n ( k )) − (cid:209) x f n ( ˜ x n )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)! (9)We will analyze the behavior of this bound on | ˆ r ( ) i − ˆ r ( ) i | using Lemma 22 in Appendix B. Deﬁne the ﬁltration F i = s i [ j = { z j ( k ) } K j k = ∪ i + [ j = { ˜ z j ( k ) } K j k = ! ∨ K i = , . . . , n (10)7ith K from (6). Note that K i − ⊂ F i − , so K i is F i − -measurable. In addition, ˜ x i but not x i is F i − -measurable.Deﬁne the random variables V i = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = ( (cid:209) x ℓ ( ˜ x i , z i ( k )) − (cid:209) x f i ( ˜ x i )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − E "(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = ( (cid:209) x ℓ ( ˜ x i , z i ( k )) − (cid:209) x f i ( ˜ x i )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F i − i = , . . . , n Clearly, V i is F i -measurable, since V i is a function of ˜ x i , K i , and { z i ( k ) } K i k = all of which are F i -measurable. Condi-tioned on F i − , the sum 1 K i K i (cid:229) k = ( (cid:209) x ℓ ( ˜ x i , z i ( k )) − (cid:209) x f i ( ˜ x i )) (11)is a sum of iid random variables. We now work with the conditional measure P {· | F i − } to compute sub-Gaussiannorms of (11) deﬁne in (24) and (25) of Appendix B. By assumption B.2 , we have t (cid:16) ( (cid:209) x ℓ ( ˜ x i , z i ( k )) − (cid:209) x f i ( ˜ x i )) j (cid:17) ≤ C g d Therefore, applying Lemma 24 yields B K i (cid:229) k = ( (cid:209) x ℓ ( ˜ x i , z i ( k )) − (cid:209) x f i ( ˜ x i )) ! ≤ r C g K i due to the independence conditioned on F i − . By applying Lemma 25 from [17] to the conditional distribution P {·| F i − } , we have P ((cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = ( (cid:209) x ℓ ( ˜ x i , z i ( k )) − (cid:209) x f i ( ˜ x i )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F i − ) ≤ ( − t ( p C g / K i ) ) = (cid:26) − K i t C g (cid:27) Since E "(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = ( (cid:209) x ℓ ( ˜ x i , z i ( k )) − (cid:209) x f i ( ˜ x i )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F i − ≥ , we have P ( V i > t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F i − ) = P ((cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = ( (cid:209) x ℓ ( ˜ x i , z i ( k )) − (cid:209) x f i ( ˜ x i )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − E "(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = ( (cid:209) x ℓ ( ˜ x i , z i ( k )) − (cid:209) x f i ( ˜ x i )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F i − > t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F i − ) ≤ P ((cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = ( (cid:209) x ℓ ( ˜ x i , z i ( k )) − (cid:209) x f i ( ˜ x i )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F i − ) ≤ (cid:26) − K i t C g (cid:27) ≤ (cid:26) − t C g (cid:27) E [ V i | F i − ] =

0, we can apply Lemma 26 with c = / ( C g ) to yield E (cid:2) e sV i (cid:12)(cid:12) F i − (cid:3) ≤ exp (cid:26) ( C g ) s (cid:27) This shows that the collection of random variables { V i } ni = and the ﬁltration { F i } ni = satisﬁes the conditions ofLemma 22. Before applying Lemma 22, we bound the conditional expectations E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = ( (cid:209) x ℓ ( ˜ x i , z i ( k )) − (cid:209) x f i ( ˜ x i )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F i −  By a straightforward calculation conditioned on F i − , we have E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = ( (cid:209) x ℓ ( ˜ x i , z i ( k )) − (cid:209) x f i ( ˜ x i )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F i −  = K i K i (cid:229) k = K i (cid:229) j = E [ h (cid:209) x ℓ ( ˜ x i , z i ( k )) − (cid:209) x f ( ˜ x i ) , (cid:209) x ℓ ( ˜ x i , z i ( j )) − (cid:209) x f ( ˜ x i ) i | F i − ]= K i K i (cid:229) k = E (cid:2) k (cid:209) x ℓ ( ˜ x i , z i ( k )) − (cid:209) x f ( ˜ x i ) k | F i − (cid:3) ( a ) = K i K i (cid:229) k = d (cid:229) q = E (cid:2) ( (cid:209) x ℓ ( ˜ x i , z i ( k )) − (cid:209) x f ( ˜ x i )) q | F i − (cid:3) ( b ) ≤ K i K i (cid:229) k = d C g d ≤ C g dK i where (a) is a decomposition into each component of the vector and (b) follows since a centered sub-Gaussian randomvariable with parameter C g / d satisﬁes E (cid:2) ( (cid:209) x ℓ ( ˜ x i , z i ( k )) − (cid:209) x f ( ˜ x i )) q | F i − (cid:3) ≤ C g d Then by Jensen’s inequality E "(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = ( (cid:209) x ℓ ( ˜ x i , z i ( k )) − (cid:209) x f i ( ˜ x i )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F i − ≤ r C g dK i Deﬁne the constants a = a n = m ( n − ) a = · · · = a n − = m ( n − ) resulting in k a k = m ( n − ) a , it holds that P ( | ˆ r ( ) n − ˆ r ( ) n | > n (cid:229) i = a i r C g dK i + t ) ≤ P ( n (cid:229) i = a i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = ( (cid:209) x ℓ ( ˜ x i , z i ( k )) − (cid:209) x f i ( ˜ x i )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > n (cid:229) i = a i E "(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = ( (cid:209) x ℓ ( ˜ x i , z i ( k )) − (cid:209) x f i ( ˜ x i )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F i − + t ) = P ( n (cid:229) i = a i V i > t ) ≤ exp (cid:26) − m ( n − ) t C g (cid:27) Combining this bound with ˆ r ( ) n ≥ r yields ¥ (cid:229) n = P ( ˆ r ( ) n < r − n (cid:229) i = a i r C g dK i − t n ) ≤ ¥ (cid:229) n = P ( ˆ r ( ) n < ˆ r ( ) n − n (cid:229) i = a i r C g dK i − t n ) ≤ ¥ (cid:229) n = P ( | ˆ r ( ) n − ˆ r ( ) n | > n (cid:229) i = a i r C g dK i + t n ) ≤ ¥ (cid:229) n = exp (cid:26) − m ( n − ) t n C g (cid:27) < ¥ The result follows from the Borel-Cantelli lemma. Note that as claimedˆ C ( ) n = dm ( n − ) r C g K + n − (cid:229) i = r C g K i + r C g K n ! Next, we show that ˆ r n upper bounds ˆ r ( ) n eventually with a general assumption on the optimization algorithm.When the conditions of Lemmas 2 and 3 are satisﬁed, it holds that ˆ r n plus a constant upper bounds r . Lemma 3.

Suppose the following conditions hold:1. B.1-B.2 hold2. There exist bounds E (cid:2) k x i − ˜ x i k (cid:12)(cid:12) F i − (cid:3) ≤ C ( K i ) i = , . . . , n3. The sequence { t n } satisﬁes ¥ (cid:229) n = exp ( − ( n − ) t n n (cid:0) + Lm (cid:1) diam ( X ) ) < + ¥ Then for all n large enough it holds that ˆ r n + ˆ C n + t n ≥ ˆ r ( ) n almost surely with ˆ C n , (cid:0) + Lm (cid:1) n − C ( K ) + n − (cid:229) i = C ( K i ) + C ( K n ) ! roof. We have by the triangle inequality, reverse triangle inequality, and the Lipschitz continuity of (cid:209) x ℓ ( x , z ) in x from assumption B.1 | ˜ r i − ˜ r ( ) i | ≤ (cid:12)(cid:12) k x i − x i − k − k ˜ x i − ˜ x i − k (cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = (cid:209) x ℓ ( x i , z i ( k )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = (cid:209) x ℓ ( ˜ x i , z i ( k )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i − K i − (cid:229) k = (cid:209) x ℓ ( x i − , z i − ( k )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i − K i − (cid:229) k = (cid:209) x ℓ ( ˜ x i − , z i − ( k )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ k ( x i − ˜ x i ) − ( x i − − ˜ x i − ) k + m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = ( (cid:209) x ℓ ( x i , z i ( k )) − (cid:209) x ℓ ( ˜ x i , z i ( k ))) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i − K i − (cid:229) k = ( (cid:209) x ℓ ( x i − , z i − ( k )) − (cid:209) x ℓ ( ˜ x i − , z i − ( k ))) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:18) + Lm (cid:19) ( k x i − ˜ x i k + k x i − − ˜ x i − k ) so | ˆ r n − ˆ r ( ) n | ≤ n − n (cid:229) i = | ˜ r i − ˜ r ( ) i |≤ (cid:0) + Lm (cid:1) n − n (cid:229) i = ( k x i − ˜ x i k + k x i − − ˜ x i − k )= (cid:0) + Lm (cid:1) n − k x − ˜ x k + n − (cid:229) i = k x i − ˜ x i k + k x n − ˜ x n k ! We will again apply Lemma 22 of Appendix B to analyze this upper bound using the sigma algebra F i = s i [ j = { z j ( k ) } K j k = ∪ i [ j = { ˜ z j ( k ) } K j k = ! ∨ K i = , . . . , n (12)Deﬁne the random variable V i = k x i − ˜ x i k − E (cid:2) k x i − ˜ x i k (cid:12)(cid:12) F i − (cid:3) Clearly, V i is F i -measurable. Since − diam ( X ) ≤ V i ≤ diam ( X ) , and E [ V i | F i − ] =

0, we can apply the conditional version Hoeffding’s Lemma from Lemma 23 to yield E (cid:2) e sV i (cid:12)(cid:12) F i − (cid:3) ≤ exp (cid:26)

12 diam ( X ) s (cid:27) The collection of random variables { V i } ni = and the ﬁltration { F i } ni = satisfy the conditions of Lemma 22. Beforeapplying Lemma 22, we bound the conditional expectations E (cid:2) k x i − ˜ x i k (cid:12)(cid:12) F i − (cid:3) By assumption, we have E (cid:2) k x i − ˜ x i k (cid:12)(cid:12) F i − (cid:3) ≤ C ( K i ) i = , . . . , n (cid:0) + Lm (cid:1) n − E (cid:2) k x − ˜ x k (cid:12)(cid:12) F (cid:3) + n − (cid:229) i = E (cid:2) k x i − ˜ x i k (cid:12)(cid:12) F i − (cid:3) k + E (cid:2) k x n − ˜ x n k (cid:12)(cid:12) F n − (cid:3)! ≤ (cid:0) + Lm (cid:1) n − C ( K ) + n − (cid:229) i = C ( K i ) + C ( K n ) ! , ˆ C n Set a = a n = (cid:0) + Lm (cid:1) n − a = · · · = a n − = (cid:0) + Lm (cid:1) n − k a k = n (cid:0) + Lm (cid:1) ( n − ) Applying our bound in (12) and Lemma 22 with this choice of a yields P n | ˆ r n − ˆ r ( ) n | > ˆ C n + t o ≤ P ( (cid:0) + Lm (cid:1) n − k x − ˜ x k + n − (cid:229) i = k x i − ˜ x i k + k x n − ˜ x n k ! > (cid:0) + Lm (cid:1) n − E (cid:2) k x − ˜ x k (cid:12)(cid:12) F (cid:3) + n − (cid:229) i = E (cid:2) k x i − ˜ x i k (cid:12)(cid:12) F i − (cid:3) k + E (cid:2) k x n − ˜ x n k (cid:12)(cid:12) F n − (cid:3)! + t ) = P ( (cid:0) + Lm (cid:1) n − V + n − (cid:229) i = V i + V n ! > t ) = P ( n (cid:229) i = a i V i > t ) ≤ exp ( − ( n − ) t n (cid:0) + Lm (cid:1) diam ( X ) ) Finally, we have ¥ (cid:229) n = P n ˆ r n < ˆ r ( ) n − ˆ C n − t n o ≤ ¥ (cid:229) n = P n | ˆ r n − ˆ r ( ) n | > ˆ C n + t n o ≤ ¥ (cid:229) n = exp ( − ( n − ) t n n (cid:0) + Lm (cid:1) diam ( X ) ) < + ¥ The claim follows from the Borel-Cantelli Lemma.If Lemmas 2 and 3 hold for the sequence { t n / } , then for all n large enough it holds thatˆ r n + ˆ C n + ˆ C ( ) n + t n ≥ r almost surely. 12 emma 4. It always holds that E (cid:2) k x i − ˜ x i k (cid:12)(cid:12) F i − (cid:3) ≤ r m b (cid:0) diam ( X ) , K i (cid:1) Therefore, the choice C ( K i ) , r m b (cid:0) diam ( X ) , K i (cid:1) satisﬁes the conditions of Lemma 3.Proof. Using the sigma algebras deﬁned in (12) yields E [ k x i − ˜ x i k | F i − ] ≤ E [ k x i − x ∗ i k | F i − ] + E [ k ˜ x i − x ∗ i k | F i − ] ≤ E "r m ( f i ( x i ) − f i ( x ∗ i )) | F i − + E "r m ( f i ( ˜ x i ) − f i ( x ∗ i )) | F i − ≤ r m E [( f i ( x i ) − f i ( x ∗ i )) | F i − ] + r m E [( f i ( ˜ x i ) − f i ( x ∗ i )) | F i − ] ≤ r m b ( diam ( X ) , K i ) where the third inequality follows from Jensen’s inequality.This choice of C ( K n ) works for any algorithm with the associated b ( d , K ) . For any particular algorithm, we believethat we can produce tighter bounds independent of diam ( X ) by copying the Lyapunov analysis used to analyze SGDas in Appendix A. The analysis becomes algorithm dependent in this case and is omitted.Finally, we state an overall theorem for the direct estimate that gives general combined conditions under which ˆ r n upper bounds r . Theorem 1.

If B.1 -B.2 hold and the sequence { t n } satisﬁes (cid:229) ¥ n = e − Cnt n < ¥ for all C > , then for a sequence ofconstants { C n } and for all n large enough it holds that ˆ r n + C n + t n ≥ r almost surely.Proof. Combine Lemmas 2 and 3 to yield the result with C n = ˆ C n + ˆ C ( ) n We ﬁrst derive a version of Hoeffding’s inequality that allows for some dependence among the random variables. Weuse this concentration inequality to analyze ˆ r n for the IPM estimate. Given an integer W , we construct a cover of { , , . . . , n } by dividing the set into W groups of integers spaced by W , i.e., A j = (cid:26) j , j + W , j + W . . . , j + (cid:22) n − jW (cid:23) W (cid:27) j = , . . . , W (13)Note that { , , . . ., n } = W [ j = A j and A i ∩ A j = /0 for i = j . The proof of Lemma 5 is nearly identical to the proof of the extension of Hoeffding’sinequality from [18] with Lemma 22 used instead. We assume that if we refer to a ﬁltration F i with i <

0, then weimplicitly refer to F . 13 emma 5 (Dependent Hoeffding’s Inequality) . Suppose we are given a collection of random variable { V i } ni = and aﬁltration { F } ni = such that1. a i ≤ V i ≤ b i for constants a i and b i i = , . . . , n2. V i is F i -measurable i = , . . . , n3. Given an integer W and a cover { A j } Wj = as in (13) for each j it holds that E h V j + iW (cid:12)(cid:12)(cid:12) F j +( i − ) W i = i = , . . . , (cid:22) n − jW (cid:23) and E h V j (cid:12)(cid:12)(cid:12) F i = Then it holds that P ( n (cid:229) i = V i > t ) ≤ exp (cid:26) − t W (cid:229) ni = ( b i − a i ) (cid:27) and P ( n (cid:229) i = V i < − t ) ≤ exp (cid:26) − t W (cid:229) ni = ( b i − a i ) (cid:27) Proof.

Deﬁne U j , (cid:4) n − jW (cid:5) (cid:229) i = V j + iW for j = , . . . , W . Let { p j } Wj = be a probability distribution on { , . . . , W } to be speciﬁed later. By Jensen’s inequality,we have exp ( s n (cid:229) i = V i ) = exp ( W (cid:229) j = p j sp j U j ) ≤ W (cid:229) j = p j exp (cid:26) sp j U j (cid:27) Then it holds that E " exp ( s n (cid:229) i = V i ) ≤ W (cid:229) j = p j E (cid:20) exp (cid:26) sp j U j (cid:27)(cid:21) Now consider one term E (cid:20) exp (cid:26) sp j U j (cid:27)(cid:21) = E  exp  sp j (cid:4) n − jW (cid:5) (cid:229) i = V j + iW  Since a j + iW ≤ V j + iW ≤ b j + iW and E h V j + iW (cid:12)(cid:12)(cid:12) F j +( i − ) W i = , we can apply the conditional version Hoeffding’s Lemma from Lemma 23 to yield E (cid:2) e sV j + iW (cid:12)(cid:12) F j +( i − ) W (cid:3) ≤ exp (cid:26) ( b j + iW − a j + iW ) s (cid:27) { V j + iW } (cid:4) n − jW (cid:5) i = and { F j + iW } (cid:4) n − jW (cid:5) i = to yield E (cid:20) exp (cid:26) sp j U j (cid:27)(cid:21) ≤ exp  s p j (cid:4) n − jW (cid:5) (cid:229) i = ( b j + iW − a j + iW )  = (cid:4) n − jW (cid:5) (cid:213) i = exp ( s p j ( b a − a a ) ) Then we have E " exp ( s n (cid:229) i = V i ) ≤ W (cid:229) j = p j (cid:4) n − jW (cid:5) (cid:213) i = exp ( s p j ( b a − a a ) ) = W (cid:229) j = p j exp ( s c j p j ) with c j = (cid:4) n − jW (cid:5) (cid:229) i = ( b j + iW − a j + iW ) Let p j = √ c j / T and T = W (cid:229) j = √ c j . Therefore, we have E " exp ( s n (cid:229) i = V i ) ≤ exp (cid:26) T s (cid:27) Applying the Chernoff bound [19] and optimizing yields P ( n (cid:229) i = V i > t ) ≤ exp (cid:8) − t / T (cid:9) Bounding T with Cauchy-Schwarz yields T ≤ W (cid:229) j = ! W (cid:229) j = c j ! = W n (cid:229) i = ( b i − a i ) and the results follows. The proof for the other tail is nearly identical.If we do not have the condition 3 of Lemma 5, then it holds that P  n (cid:229) i = V i > W (cid:229) j = (cid:4) n − jW (cid:5) (cid:229) i = E (cid:2) V j + iW (cid:12)(cid:12) F j +( i − ) W (cid:3) + t  ≤ exp (cid:26) − t W (cid:229) ni = ( b i − a i ) (cid:27) If we can bound the conditional expectation E (cid:2) V j + iW (cid:12)(cid:12) F j +( i − ) W (cid:3) ≤ C j + iW ,

15y a F j +( i − ) W -measurable random variable, then we have P ( n (cid:229) i = V i > n (cid:229) i = C i + t ) = P  n (cid:229) i = V i > W (cid:229) j = (cid:4) n − jW (cid:5) (cid:229) i = C j + iW + t  ≤ P  n (cid:229) i = V i > W (cid:229) j = (cid:4) n − jW (cid:5) (cid:229) i = E (cid:2) V j + iW (cid:12)(cid:12) F j +( i − ) W (cid:3) + t  ≤ P  W (cid:229) j = (cid:4) n − jW (cid:5) (cid:229) i = (cid:0) V j + iW − E (cid:2) V j + iW (cid:12)(cid:12) F j +( i − ) W (cid:3)(cid:1) > t  ≤ exp (cid:26) − t W (cid:229) ni = ( b i − a i ) (cid:27) We have the following lemma characterizing the performance of the IPM estimate.

Lemma 6.

For the IPM estimate and any sequence { t n } such that ¥ (cid:229) n = exp (cid:26) − nt n diam ( X ) (cid:27) < ¥ for all n large enough it holds that ˆ r n + t n ≥ r almost surely.Proof. Deﬁne the random variables V i = ˜ r i − E [ ˜ r i | K i − ] with { K i } ni = deﬁned in (6). We have − diam ( X ) ≤ V i ≤ diam ( X ) Clearly, V i is K i -measurable and E [ V i | K i − ] =

0. Now, we can apply Lemma 5 with W = P ( n (cid:229) i = V i < − nt ) ≤ exp ( − ( nt ) ( ) (cid:0) n diam ( X ) (cid:1) ) = exp (cid:26) − nt ( X ) (cid:27) None of the random variables { z i ( k ) } K i k = and { z i − ( k ) } K i − k = are K i − measurable. Also, regardless of how manysamples K i and K i − are taken, the IPM estimate is biased upward. Thus, it holds that E [ ˜ r i | K i − ] ≥ r Therefore, it follows that P { ˆ r n < r − t } ≤ P ( n (cid:229) i = ˜ r i < n (cid:229) i = E [ ˜ r i | K i − ] − nt ) = P ( n (cid:229) i = V i < − nt ) ≤ exp (cid:26) − nt ( X ) (cid:27) r i and ˜ r i − both depending on the samples from p i − . Since ¥ (cid:229) n = exp (cid:26) − nt n ( X ) (cid:27) < ¥ it follows that ¥ (cid:229) n = P { ˆ r n + t < r } < + ¥ , This in turn guarantees by way of the Borel-Cantelli Lemma that for n large enoughˆ r n + t n ≥ r almost surely. We now look at estimating r in the case that k x ∗ n − x ∗ n − k ≤ r . We set r i , k x ∗ i − x ∗ i − k B.3

Assume that we have estimators ˆ h W : R W → R such that1. E [ ˆ h W ( r j , . . . , r j − W + )] ≥ r for all j ≥ W ≥

12. For any random variables { ˜ r i } such that E [ ˜ r i ] ≥ E [ r i ] , we have E (cid:2) ˆ h W ( ˜ r j , . . . , ˜ r j − W + ) (cid:3) ≥ E (cid:2) ˆ h W ( r j , . . . , r j − W + ) (cid:3) For example, if r i iid ∼ Unif [ , r ] , thenˆ h W ( r i , r i + , . . . , r i + W − ) = W + W max { r i , r i + , . . . , r i + W − } is an estimator of r with the required properties. Also, note that the two conditions on the estimator in B.3 imply that E (cid:2) ˆ h W ( ˜ r j , . . . , ˜ r j − W + ) (cid:3) ≥ E (cid:2) ˆ h W ( r j , . . . , r j − W + ) (cid:3) ≥ r Given an estimator satisfying assumption B.3 , we compute˜ r ( i ) = ˆ h W ( ˜ r i , ˜ r i − , . . . , ˜ r i − W + ) and set ˆ r n = n − n (cid:229) i = ˜ r ( i ) = n − n (cid:229) i = ˆ h min { W , i − } ( ˜ r i , ˜ r i − , . . . , ˜ r max { i − W + , } ) (14)We have E [ ˆ r n ] = n − n (cid:229) i = E [ ˜ r ( i ) ] ≥ r Lemma 7 (IPM Single Step Estimates) . For the estimator in (14) computed using the IPM estimate for ˜ r i and anysequence { t n } such that ¥ (cid:229) n = exp (cid:26) − ( n − ) t n ( W + ) diam ( X ) (cid:27) < ¥ it holds that for all n large enough ˆ r n + t n ≥ r almost surely. roof. We copy the proof of Lemma 6 with W + r ( i ) and ˜ r ( j ) with | i − j | > W + P { ˆ r n < r − t } ≤ exp (cid:26) − ( n − ) t ( W + ) diam ( X ) (cid:27) We pay a price of W + r ( i ) . By the Borel-CantelliLemma, for all n large enough it holds that ˆ r n + t n ≥ r almost surely as long as ¥ (cid:229) n = exp (cid:26) − ( n − ) t n ( W + ) diam ( X ) (cid:27) < ¥ To analyze the direct estimate, we need the following assumption

B.4

Suppose that there exists absolute constants { b i } Wi = for any ﬁxed W such that | ˆ h W ( p , . . . , p W ) − ˆ h W ( q , . . . , q W ) | ≤ W (cid:229) i = b i | p i − q i | ∀ p , q ∈ R W ≥ For the uniform case, we have (cid:12)(cid:12)(cid:12) W + W max { p , . . . , p W } − W + W max { q , . . . , q W } (cid:12)(cid:12)(cid:12) ≤ W + W max {| p − q | , . . . , | p W − q W |}≤ W + W W (cid:229) i = | p i − q i | so b = · · · = b W = W + W Under assumption B.4 , we can then show that ˆ r n = n − W n (cid:229) i = W + ˜ r ( i ) eventually upper bounds r by copying the proofs of the lemmas behind Theorem 1. Lemma 8 (Direct Single Step Estimates) . Suppose that the following conditions hold:1. B.1 -B.4 hold2. The sequence { t n } satisﬁes ¥ (cid:229) n = W + exp  − ( n − W ) t n n (cid:0) + Lm (cid:1) (cid:16) (cid:229) Wj = b j (cid:17) diam ( X )  < + ¥ and ¥ (cid:229) n = W + exp  − ( n − W ) m t n nC g (cid:16) (cid:229) Wj = b j (cid:17)  < + ¥

3. There are bounds C ( K ) such that E [ k x i − ˜ x i k | F i − ] ≤ C ( K i ) hen for all n large enough it holds that ˆ r n + ˆ U n + ˆ V n + t n ≥ r almost surely with ˆ U n = (cid:0) + Lm (cid:1) (cid:229) Wj = b j n − W n (cid:229) i = C ( K i ) and ˆ V n = (cid:229) Wj = b j m ( n − W ) n (cid:229) i = r C g dK i Proof.

Deﬁne ˜ r ( ) i , ˜ r ( ) i , ˆ r ( ) i , and ˆ r ( ) i as in Lemmas 2 and 3. First, we have | ˆ r n − ˆ r ( ) n | ≤ n − W n (cid:229) i = W + | ˜ r ( i ) − ˜ r ( i ) |≤ n − W n (cid:229) i = W + i (cid:229) j = i − W + b j | ˜ r j − ˜ r ( ) j |≤ n − W n (cid:229) i = W + i (cid:229) j = i − W + b j (cid:16) | ˜ r j − ˜ r ( ) j | + | ˜ r ( ) j − ˜ r ( ) j | (cid:17) ≤ (cid:229) Wj = b j n − W n (cid:229) i = (cid:16) | ˜ r i − ˜ r ( ) i | + | ˜ r ( ) i − ˜ r ( ) i | (cid:17) Second, deﬁne U i , k x i − ˜ x i k and V i , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = ( (cid:209) x ℓ ( ˜ x i , z i ( k )) − (cid:209) f i ( ˜ x i )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Then we have | ˜ r i − ˜ r ( ) i | ≤ k x i − ˜ x i k + m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = ( (cid:209) x ℓ ( x i , z i ( k )) − (cid:209) x ℓ ( ˜ x i , z i ( k ))) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:18) + Lm (cid:19) ( U i + U i − ) and | ˜ r ( ) i − ˜ r ( ) i | ≤ m ( V i + V i − ) Then it follows that | ˆ r n − ˆ r ( ) n | ≤ (cid:229) Wj = b j n − W n (cid:229) i = (cid:16) | ˜ r i − ˜ r ( ) i | + | ˜ r ( ) i − ˜ r ( ) i | (cid:17) ≤ (cid:0) + Lm (cid:1) (cid:229) Wj = b j n − W n (cid:229) i = U i + (cid:229) Wj = b j m ( n − W ) n (cid:229) i = V i Suppose that 2 (cid:0) + Lm (cid:1) (cid:229) Wj = b j n − W n (cid:229) i = E [ U i | F i − ] ≤ ˆ U n and 2 (cid:229) Wj = b j m ( n − W ) n (cid:229) i = E [ V i | F i − ] ≤ ˆ V n P n | ˆ r n − ˆ r ( ) n | > ˆ U n + ˆ V n + t o ≤ P ( (cid:0) + Lm (cid:1) (cid:229) Wj = b j n − W n (cid:229) i = U i + (cid:229) Wj = b j m ( n − W ) n (cid:229) i = V i > ˆ U n + ˆ V n + t ) ≤ P ( (cid:0) + Lm (cid:1) (cid:229) Wj = b j n − W n (cid:229) i = U i > ˆ U n + t ) + P ( (cid:229) Wj = b j m ( n − W ) n (cid:229) i = V i > ˆ V n + t ) We can apply Lemma 22 to each term to yield P ( (cid:0) + Lm (cid:1) (cid:229) Wj = b j n − W n (cid:229) i = U i > ˆ U n + t ) ≤ exp  − ( n − W ) t n (cid:0) + Lm (cid:1) (cid:16) (cid:229) Wj = b j (cid:17) diam ( X )  and P ( (cid:229) Wj = b j m ( n − W ) n (cid:229) i = V i > ˆ V n + t ) ≤ exp  − ( n − W ) m t nC g (cid:16) (cid:229) Wj = b j (cid:17)  Then it holds that P n | ˆ r n − ˆ r ( ) n | > ˆ U n + ˆ V n + t o ≤ exp  − ( n − W ) t n (cid:0) + Lm (cid:1) (cid:16) (cid:229) Wj = b j (cid:17) diam ( X )  + exp  − ( n − W ) m t nC g (cid:16) (cid:229) Wj = b j (cid:17)  We have by straightforward computation ˆ U n = (cid:0) + Lm (cid:1) (cid:229) Wj = b j n − W n (cid:229) i = C ( K i ) and ˆ V n = (cid:229) Wj = b j m ( n − W ) n (cid:229) i = r C g dK i Then it holds that ¥ (cid:229) n = W + P (cid:8) ˆ r n < r − ˆ U n − ˆ V n − t n (cid:9) ≤ ¥ (cid:229) n = W + P n ˆ r n < ˆ r ( ) n − ˆ U n − ˆ V n − t n o ≤ ¥ (cid:229) n = W + P n | ˆ r n − ˆ r ( ) n | > ˆ U n + ˆ V n + t n o ≤ ¥ (cid:229) n = W + exp  − ( n − W ) t n n (cid:0) + Lm (cid:1) (cid:16) (cid:229) Wj = b j (cid:17) diam ( X )  + ¥ (cid:229) n = W + exp  − ( n − W ) m t n nC g (cid:16) (cid:229) Wj = b j (cid:17)  < ¥ By the Borel-Cantelli lemma, it follows that for all n large enoughˆ r n + ˆ U n + ˆ V n + t n ≤ r almost surely. 20 .5 Parameter Estimation We may need to estimate parameters of the functions { f n } such as the strong convexity parameter m to compute b ( d , K ) . We need the following assumption on our bound: D.1

Suppose that our bound b ( d , K , y ) is parameterized by y , which depends on properties of the function ℓ ( x , z ) and the distributions { p n } ¥ n = . Suppose that y ≤ y ⇔ b ( d , K , y ) ≤ b ( d , K , y ) D.2

There exists a true set of parameters y ∗ such that y n = y ∗ ∀ n ≥ D.3

The spaces X and Z are compact D.4

There exists a constant L such that k (cid:209) x ℓ ( x , z ) − (cid:209) x ℓ ( ˜ x , z ) k ≤ L k x − ˜ x k D.5

Suppose that we know that the parameters y ∈ P with P compact D.6

Suppose that (cid:209) f n ( x n ) has Lipschitz continuous gradients with modulus M As a consequence of Assumption D.4 , it follows that there exists a constant G such that there exists a constant G suchthat k (cid:209) x ℓ ( x , z ) k ≤ G ∀ x ∈ X , z ∈ Z Satisfying Assumption D.5 is usually easy due to the compactness assumptions in Assumption D.4 .In most cases, we have y =  − mMAB  where m is the parameter of strong convexity, M is the Lipschitz gradient modulus, and the pair ( A , B ) controls gradientgrowth, i.e., E k (cid:209) x ℓ ( x , z ) k ≤ A + B k x − x ∗ k We parameterize using − m , since smaller m increase the bound b ( d , K ) . We present several general methods forestimating these parameters, although in practice, problem speciﬁc estimators based on the form of the function mayoffer better performance. As an example, we present problem speciﬁc estimates for ℓ ( x , z ) = (cid:16) y − w ⊤ x (cid:17) + l k x k As in estimating r , we produce one time instant estimates ˜ m i , ˜ M i , ˜ A i , and ˜ B i at time i and combine them. We onlyexamine the case under Assumption D.4 , although we could examine an inequality constraints as with estimating r .We combine estimates by averaging to yield1. ˆ m n = n (cid:229) ni = ˜ m i

2. ˆ M n = n (cid:229) ni = ˜ M i

3. ˆ A n = n (cid:229) ni = ˜ A i

4. ˆ B n = n (cid:229) ni = ˜ B i .5.1 Estimating Strong Convexity Parameter and Lipschitz Gradient Modulus We seek one step estimators ˜ m n and ˜ M n such that E [ ˜ m n | K n − ] ≤ m and E [ ˜ M n | K n − ] ≥ M with { K n } deﬁned in (6). Hessian Method:

We exploit the fact that (cid:209) xx f n ( x ) (cid:23) m I ∀ x ∈ X This in turn implies that l min (cid:0) (cid:209) xx f n ( x ) (cid:1) ≥ m ∀ x ∈ X This suggests that given { z n ( k ) } K n k = we set˜ m n , min x ∈ X l min K n K n (cid:229) k = (cid:209) xx ℓ ( x , z n ( k )) ! Since l min ( A ) = min v : k v k = h A v , v i , l min ( A ) is a concave function of A . Then by Jensen’s inequality, we have E [ ˜ m n ] = E " min x ∈ X l min K n K n (cid:229) k = (cid:209) xx ℓ ( x , z n ( k )) ! (cid:12)(cid:12)(cid:12)(cid:12) K n − ≤ min x ∈ X E " l min K n K n (cid:229) k = (cid:209) xx ℓ ( x , z n ( k )) ! (cid:12)(cid:12)(cid:12)(cid:12) K n − ≤ min x ∈ X l min E " K n K n (cid:229) k = (cid:209) xx ℓ ( x , z n ( k )) (cid:12)(cid:12)(cid:12)(cid:12) K n − = min x ∈ X l min (cid:0) (cid:209) xx f n ( x ) (cid:1) = m Similarly, we can set ˜ M n , max x ∈ X l max K n K n (cid:229) k = (cid:209) xx ℓ ( x , z n ( k )) ! Since l max ( A ) = max v : k v k = h A v , v i , l max ( A ) is a convex function of A . By Jensen’s inequality, it holds that E [ ˜ M n | K n − ] ≥ MGradient Method To Compute ˜ m n : To actually minimize over x , we can use gradient descent. To apply gradientdescent, we use eigenvalue perturbation results [20]. Suppose that we have a base matrix T with eigenvectors v i andeigenvalues l i . We want to ﬁnd the eigenvectors v i and eigenvalues l i of a perturbed matrix T : T v i = l i v i T v i = l i v i

22n particular, we want to relate l i to l i . With d T , T − T , we have dl i = v ⊤ i ( d T ) v i and ¶l i ¶ T i j = v i ( i ) v j ( − d i j ) Suppose we are given a matrix-valued function T ( x ) with T ( x ) v ( x ) = l min ( x ) v ( x ) Then it holds that (cid:209) x l min ( T ( x )) = (cid:229) i , j ¶l min ¶ T i j (cid:209) x T i j ( x )= (cid:229) i , j v i ( x ) v j ( x )( − d i j ) (cid:209) x T i j ( x ) Then we can use gradient descent to solvemin x ∈ X l min K n K n (cid:229) k = (cid:209) x ℓ ( x , z n ( k )) ! Starting from any x ( ) , we can compute x ( p ) = P X " x ( p − ) − m (cid:209) x l min K n K n (cid:229) k = (cid:209) xx ℓ ( x , z n ( k )) ! p = , . . . , P and set ˆ m n , l min K n K n (cid:229) k = (cid:209) xx ℓ ( x ( P ) , z n ( k )) ! (15) Heuristic Method:

For any two points x and y , we have by strong convexity f n ( y ) ≥ f n ( x ) + h (cid:209) f n ( x ) , y − x i + m k y − x k Suppose that we have N points x ( ) , . . . , x ( N ) . Then we know that for any two distinct points x i and x j m ≤ f n ( x ( i )) − f n ( x ( j )) − h (cid:209) f n ( x ( j )) , x ( i ) − x ( j ) i k x ( i ) − x ( j ) k This suggests the estimatorˆ m n , min i = j K n (cid:229) K n k = ℓ ( x ( i ) , z n ( k )) − K n (cid:229) K n k = ℓ ( x ( j ) , z n ( k )) − D K n (cid:229) K n k = (cid:209) x ℓ ( x ( j ) , z n ( k )) , x ( i ) − x ( j ) E k x ( i ) − x ( j ) k (16)23or the strong convexity parameter. Then we have E [ ˆ m n ]= E  min i = j K n (cid:229) K n k = ℓ ( x ( i ) , z n ( k )) − K n (cid:229) K n k = ℓ ( x ( j ) , z n ( k )) − D K n (cid:229) K n k = (cid:209) x ℓ ( x ( j ) , z n ( k )) , x ( i ) − x ( j ) E k x ( i ) − x ( j ) k  ≤ min i = j E  K n (cid:229) K n k = ℓ ( x ( i ) , z n ( k )) − K n (cid:229) K n k = ℓ ( x ( j ) , z n ( k )) − D K n (cid:229) K n k = (cid:209) x ℓ ( x ( j ) , z n ( k )) , x ( i ) − x ( j ) E k x ( i ) − x ( j ) k  ≤ min i = j f n ( x ( i )) − f n ( x ( j )) − h (cid:209) f n ( x ( j )) , x ( i ) − x ( j ) i k x ( i ) − x ( j ) k It is difﬁcult to compare this estimator to m exactly. All we can say is that m ≤ min i = j f n ( x ( i )) − f n ( x ( j )) − h (cid:209) f n ( x ( j )) , x ( i ) − x ( j ) i k x ( i ) − x ( j ) k as well. In practice, this method produces estimates close to m .Similarly, we can setˆ M n , max i = j K n (cid:229) K n k = ℓ ( x ( i ) , z n ( k )) − K n (cid:229) K n k = ℓ ( x ( j ) , z n ( k )) − D K n (cid:229) K n k = (cid:209) x ℓ ( x ( j ) , z n ( k )) , x ( i ) − x ( j ) E k x ( i ) − x ( j ) k (17) Problem Speciﬁc:

For the penalized quadratic, we have (cid:209) xx ℓ ( x , z ) = l I + ww ⊤ so (cid:209) xx f n ( x ) = l I + E [ w n w ⊤ n ] This suggests the simple closed-form estimates˜ m n = l + l min K n K n (cid:229) k = w n ( k ) w n ( k ) ⊤ ! and ˜ M n = l + l max K n K n (cid:229) k = w n ( k ) w n ( k ) ⊤ ! Again, by Jensen’s inequality, it holds that E [ ˜ m n | K n − ] ≤ m and E [ ˜ M n | K n − ] ≥ MCombining Estimates:

We now look at combining the single time instant estimates of the strong convexity parameterand the Lipschitz gradient modulus.

Lemma 9.

Choose t n such that for all C > it holds that ¥ (cid:229) n = e − Cnt n < + ¥ Then for all n large enough it holds that . ˆ m n − t n ≤ m2. ˆ M n + t n ≥ Malmost surely.Proof.

By the compactness of the space P containing y , we can apply the dependent version of Hoeffding’s lemma(Lemma 23) to yield E (cid:2) e s ˜ m i (cid:12)(cid:12) K i − (cid:3) ≤ exp (cid:26) s m s (cid:27) and E h e s ˜ M i (cid:12)(cid:12) K i − i ≤ exp (cid:26) s M s (cid:27) for some constants s m and s M derived from Hoeffding’s lemma. Then applying Lemma 22, it follows that P ( ˆ m n > n n (cid:229) i = E [ ˜ m i | K i − ] + t n ) ≤ exp (cid:26) − nt n s m (cid:27) We know that 1 n n (cid:229) i = E [ ˜ m i | K i − ] > m so it follows that P { ˆ m n > m + t n } ≤ exp (cid:26) − nt n s m (cid:27) Similarly, for the Lipschitz gradient modulus, it holds that P (cid:8) ˆ M n < M − t n (cid:9) ≤ exp (cid:26) − nt n s M (cid:27) As before, we have ¥ (cid:229) n = P { ˆ m n > m + t n } ≤ ¥ (cid:229) n = exp (cid:26) − nt n s m (cid:27) < + ¥ and ¥ (cid:229) n = P (cid:8) ˆ M n < M − t n (cid:9) ≤ ¥ (cid:229) n = exp (cid:26) − nt n s M (cid:27) < + ¥ to ensure that almost surely for all n large enough it holds thatˆ m n − t n ≤ m and ˆ M n + t n ≥ m For Lemma 9, we need t n to decay no faster that O ( n − / ) .25 .5.2 Estimating Gradient Parameters From Assumption D.6 , it holds that E k (cid:209) x ℓ ( x , z ) k = E k (cid:209) x ℓ ( x ∗ , z ) + ( (cid:209) x ℓ ( x , z ) − (cid:209) x ℓ ( x ∗ , z )) k ≤ E k (cid:209) x ℓ ( x ∗ , z ) k + E k (cid:209) x ℓ ( x , z ) − (cid:209) x ℓ ( x ∗ , z ) k ≤ E k (cid:209) x ℓ ( x ∗ , z ) k + M k x − x ∗ k Thus, we can set B = M and A = E k (cid:209) x ℓ ( x ∗ , z ) k This suggests that given an estimate ˜ M n for M , we set˜ B n = M n Then by Jensen’s inequality, we have E [ ˜ B n | K n − ] = E [ ˜ M n | K n − ] ≥ (cid:0) E [ ˜ B n | K n − ] (cid:1) ≥ M = B Lemma 10.

Choose t n such that for all C > it holds that ¥ (cid:229) n = e − Cnt n < + ¥ Then for all n large enough it holds that ˆ B n + t n ≥ Balmost surely.Proof.

By identical reasoning for the strong convexity and Lipschitz continuous gradients, it holds that P (cid:8) ˆ B n < B − t n (cid:9) ≤ exp (cid:26) − nt n s B (cid:27) Since we have ¥ (cid:229) n = exp (cid:26) − nt n s B (cid:27) < + ¥ for all n large enough it holds that ˆ B n + t n ≥ B almost surely.To estimate A , consider using a point x to approximate x ∗ . It holds that E k (cid:209) x ℓ ( x ∗ , z ) k = E k (cid:209) x ℓ ( x , z ) + ( (cid:209) x ℓ ( x ∗ , z ) − (cid:209) x ℓ ( x , z )) k ≤ E k (cid:209) x ℓ ( x , z ) k + E k (cid:209) x ℓ ( x ∗ , z ) − (cid:209) x ℓ ( x , z ) k ≤ E k (cid:209) x ℓ ( x , z ) k + M E k x − x ∗ k ≤ E k (cid:209) x ℓ ( x , z ) k + (cid:18) Mm (cid:19) k (cid:209) f ( x ) k ≤ E k (cid:209) x ℓ ( x , z ) k + (cid:18) Mm (cid:19) k (cid:209) f ( x ) k A n ( x ) = K n K n (cid:229) k = k (cid:209) x ℓ ( x , z n ( k )) k + (cid:18) ˜ M n − + t n − ˜ m n − − t n − (cid:19) (cid:13)(cid:13)(cid:13)(cid:13) K n K n (cid:229) k = (cid:209) x ℓ ( x , z n ( k )) (cid:13)(cid:13)(cid:13)(cid:13) Lemma 11.

For any x possibly random but not a function of { z n ( k ) } K n k = and all n large enough, it holds that E [ ˜ A n | K n − ] ≥ AProof.

For any x possibly random but not a function of { z n ( k ) } K n k = , it holds that E [ ˜ A n | K n − ]= E " K n K n (cid:229) k = k (cid:209) x ℓ ( x , z n ( k )) k + (cid:18) ˜ M n − + t n − ˜ m n − − t n − (cid:19) (cid:13)(cid:13)(cid:13)(cid:13) K n K n (cid:229) k = (cid:209) x ℓ ( x , z n ( k )) (cid:13)(cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12) K n − = E " K n K n (cid:229) k = k (cid:209) x ℓ ( x , z n ( k )) k (cid:12)(cid:12)(cid:12)(cid:12) K n − + (cid:18) ˜ M n − + t n − ˜ m n − − t n − (cid:19) E "(cid:13)(cid:13)(cid:13)(cid:13) K n K n (cid:229) k = (cid:209) x ℓ ( x , z n ( k )) (cid:13)(cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12) K n − ≥ E k (cid:209) x ℓ ( x , z n ) k + (cid:18) ˜ M n − + t n − ˜ m n − − t n − (cid:19) k (cid:209) f n ( x ) k The last inequality uses Jensen’s inequality. Then by our prior analysis, almost surely for all n sufﬁciently large itholds that ˜ M n − + t n − ˜ m n − − t n − ≥ Mm and so for all n sufﬁciently large E [ ˜ A n | K n − ] ≥ E k (cid:209) x ℓ ( x , z n ) k + (cid:18) Mm (cid:19) k (cid:209) f n ( x ) k = E k (cid:209) x ℓ ( x ∗ n , z n ) k = A Therefore, for all n sufﬁciently large (dependent on estimation of m and M ), it holds that E [ ˜ A n | K n − ] ≥ ACombining Estimates for A:

In practice, we use ˜ A n ( x n ) , which complicates the analysis due to the fact that x n iscomputed using the same samples { z n ( k ) } K n k = . Lemma 12.

Choose t n such that for all C > it holds that ¥ (cid:229) n = e − Cnt n < + ¥ Then for all n large enough it holds that ˆ A n + t n ≥ Aalmost surely. roof. Consider the following three estimates of A all computed with knowledge of m and M and ˜ x n as in Lemma 2:˜ A ( ) i = K i K i (cid:229) k = k (cid:209) x ℓ ( x i , z i ( k )) k + (cid:18) Mm (cid:19) (cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = (cid:209) x ℓ ( x i , z i ( k )) (cid:13)(cid:13)(cid:13)(cid:13) ˜ A ( ) i = K i K i (cid:229) k = k (cid:209) x ℓ ( ˜ x i , z i ( k )) k + (cid:18) Mm (cid:19) (cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = (cid:209) x ℓ ( ˜ x i , z i ( k )) (cid:13)(cid:13)(cid:13)(cid:13) ˜ A ( ) i = E k (cid:209) x ℓ ( ˜ x i , z i ) k + (cid:18) Mm (cid:19) k (cid:209) f i ( ˜ x i ) k Deﬁne the averaged estimates ˆ A ( ) n = n n (cid:229) i = ˜ A ( ) i ˆ A ( ) n = n n (cid:229) i = ˜ A ( ) i ˆ A ( ) n = n n (cid:229) i = ˜ A ( ) i We always have ˜ A ( ) i ≥ A so ˆ A ( ) n ≥ A First, we show that ˆ A ( ) n is close to A ( ) n . We have | ˜ A ( ) i − ˜ A ( ) i |≤ (cid:12)(cid:12)(cid:12)(cid:12) K i K i (cid:229) k = (cid:0) k (cid:209) x ℓ ( x i , z i ( k )) k − k (cid:209) x ℓ ( ˜ x i , z i ( k )) k (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) + (cid:18) Mm (cid:19) (cid:12)(cid:12)(cid:12)(cid:12)(cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = (cid:209) x ℓ ( x i , z i ( k )) (cid:13)(cid:13)(cid:13)(cid:13) − (cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = (cid:209) x ℓ ( ˜ x i , z i ( k )) (cid:13)(cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12) ≤ G K i K i (cid:229) k = k (cid:209) x ℓ ( x i , z i ( k )) − (cid:209) x ℓ ( ˜ x i , z i ( k )) k + G (cid:18) Mm (cid:19) (cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = ( (cid:209) x ℓ ( x i , z i ( k )) − (cid:209) x ℓ ( ˜ x i , z i ( k ))) (cid:13)(cid:13)(cid:13)(cid:13) ≤ + (cid:18) Mm (cid:19) ! GM k x i − ˜ x i k yielding | ˆ A ( ) n − ˆ A ( ) n | ≤ + (cid:18) Mm (cid:19) ! GM n n (cid:229) i = k x i − ˜ x i k ! Second, we have | ˆ A ( ) n − ˆ A ( ) n |≤ (cid:12)(cid:12)(cid:12)(cid:12) n n (cid:229) i = K i K i (cid:229) k = (cid:0) k (cid:209) x ℓ ( ˜ x i , z i ( k )) k − E (cid:2) k (cid:209) x ℓ ( ˜ x i , z i ) k | F n − (cid:3)(cid:1)!(cid:12)(cid:12)(cid:12)(cid:12) + (cid:18) Mm (cid:19) G n n (cid:229) i = (cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = ( (cid:209) x ℓ ( ˜ x i , z i ( k )) − (cid:209) f i ( ˜ x i )) (cid:13)(cid:13)(cid:13)(cid:13) | ˆ A ( ) n − ˆ A ( ) n |≤ + (cid:18) Mm (cid:19) ! GM n n (cid:229) i = k x i − ˜ x i k ! + (cid:12)(cid:12)(cid:12)(cid:12) n n (cid:229) i = K i K i (cid:229) k = (cid:0) k (cid:209) x ℓ ( ˜ x i , z i ( k )) k − E (cid:2) k (cid:209) x ℓ ( ˜ x i , z i ) k | F n − (cid:3)(cid:1)! (cid:12)(cid:12)(cid:12)(cid:12) + (cid:18) Mm (cid:19) G n n (cid:229) i = (cid:13)(cid:13)(cid:13)(cid:13) K i K i (cid:229) k = ( (cid:209) x ℓ ( ˜ x i , z i ( k )) − (cid:209) f i ( ˜ x i )) (cid:13)(cid:13)(cid:13)(cid:13) The ﬁrst and third terms in this bound can be controlled by the analysis of the direct estimate and the second term byLemma (22). This shows that P ( ˆ A ( ) n < A − n n (cid:229) i = C i √ K i − t n ) ≤ P ( ˆ A ( ) n < ˆ A ( ) n − n n (cid:229) i = C i √ K i − t n ) ≤ P ( | ˆ A ( ) n − ˆ A ( ) n | > n n (cid:229) i = C i √ K i t n ) ≤ (cid:26) − nt n s A (cid:27) Since ¥ (cid:229) n = P ( ˆ A ( ) n < A − n n (cid:229) i = C i √ K i − t n ) ≤ ¥ (cid:229) n = C exp (cid:26) − nt n s A (cid:27) < + ¥ almost surely for all n large enough, it holds thatˆ A ( ) n + n n (cid:229) i = C i √ K i + t n ≥ A In addition, we have ˆ A ( ) n + n n (cid:229) i = C i √ K i + t n ≥ A There exists a random variable ˜ N such that n ≥ ˜ N ⇒ M n + t n m n − t n ≥ Mm Then for n ≥ ˜ N , it holds thatˆ A n − ˆ A ( ) n = n n (cid:229) i = "(cid:18) ˆ M i − + t i − ˆ m i − − t i − (cid:19) − (cid:18) Mm (cid:19) K i K i (cid:229) k = (cid:209) x ℓ ( x i , z i ( k )) (cid:13)(cid:13)(cid:13)(cid:13) ≥ n ˜ N − (cid:229) i = "(cid:18) ˆ M i − + t i − ˆ m i − − t i − (cid:19) − (cid:18) Mm (cid:19) K i K i (cid:229) k = (cid:209) x ℓ ( x i , z i ( k )) (cid:13)(cid:13)(cid:13)(cid:13) t n can decay only as fast as C / √ n , it follows that4 n ˜ N − (cid:229) i = "(cid:18) ˆ M i − + t i − ˆ m i − − t i − (cid:19) − (cid:18) Mm (cid:19) K i K i (cid:229) k = (cid:209) x ℓ ( x i , z i ( k )) (cid:13)(cid:13)(cid:13)(cid:13) − t n < n large enough. This implies thatˆ A n + n n (cid:229) i = C i √ K i + t n ≥ ˆ A n − n ˜ N − (cid:229) i = "(cid:18) Mm (cid:19) − (cid:18) ˆ M i − + t i − ˆ m i − + t i − (cid:19) K i K i (cid:229) k = (cid:209) x ℓ ( x i , z i ( k )) (cid:13)(cid:13)(cid:13)(cid:13) − t n ! + n n (cid:229) i = C i √ K i + t n ≥ ˆ A ( ) n + n n (cid:229) i = C i √ K i + t n ≥ A for n large enough.Using these estimates, we have constructed estimates ˆ y n such that for all n large enough it holds thatˆ y n + C n + t n ≥ y ∗ for appropriate constants C n almost surely. Therefore, by assumption for all n large enough it holds that b ( d , K , y ∗ ) ≤ b ( d , K , ˆ y n + t n ) r Estimation

Our analysis of estimating r assumes that we know the parameters of the function and in particular the strong convexityparameter m . We now argue that the effect of using estimated parameters instead is minimal. This happens becausewe know that for all n large enough it holds that ˆ y n ≥ y ∗ almost surely. Lemma 13.

We want to estimate a non-negative parameter f ∗ by producing a sequence of estimates f i for all i ≥ and averaging to produce ˆ f n = n n (cid:229) i = f i where the estimates f i are dependent on an auxiliary sequence y i in the sense that f i ( y i ) . Suppose that the followingconditions hold:1. Suppose that there exists a random variable ˜ N such that n ≥ ˜ N implies that ˆ y n ≥ y ∗ E [ f i ( y ∗ )] ≥ f ∗ Then it follows that lim inf n → ¥ E " n n (cid:229) i = f i ≥ f ∗ roof. It holds that 1 n n (cid:229) i = f i = n ˜ N − (cid:229) i = f i ( y i ) + n n (cid:229) i = ˜ N f i ( y i ) ≥ n ˜ N − (cid:229) i = f i ( y i ) + n n (cid:229) i = ˜ N f i ( y ∗ i ) (18)Therefore, it follows that lim inf n → ¥ E " n n (cid:229) i = f i ≥ lim inf n → ¥ E " n n (cid:229) i = ˜ N f i ( y ∗ i ) ≥ f ∗ We can extend all the concentration inequalities for estimating r as well by extending the inequality in (18) toyield 1 n n (cid:229) i = f i = n ˜ N − (cid:229) i = f i ( y i ) + n n (cid:229) i = ˜ N f i ( y i ) ≥ n ˜ N − (cid:229) i = f i ( y i ) + n n (cid:229) i = ˜ N f i ( y ∗ i ) ≥ n ˜ N − (cid:229) i = ( f i ( y i ) − f i ( y ∗ )) + n n (cid:229) i = f i ( y ∗ i )= n n (cid:229) i = f i ( y ∗ i ) + o ( ) Before, we have analyzed 1 n n (cid:229) i = f i ( y ∗ i ) so for large enough n , we recover previous results, since the o ( ) term goes to 0. r Unknown

We now examine the case with r unknown. We extend the work of Section 2 using the estimates of r in Section 3.Our analysis depends on the following crucial assumption: C.1

For appropriate sequences { t n } , for all n sufﬁciently large it holds that ˆ r n + t n ≥ r almost surely. C.2 b ( d , K n ) factors as b ( d , K n ) = a ( K n ) d + b ( K n ) We have demonstrated that assumption C.1 that holds for the direct and IPM estimates of r under (2) and (3). Notethat whether we assume (2) or (3) does not matter for analysis. K n We start with a general result showing that for any choice of K n such that K n ≥ K ∗ for all n large enough the excessrisk is controlled in the sense that lim sup n → ¥ ( E [ f n ( x n )] − f n ( x ∗ n )) ≤ e

31e then apply this result to two different selection rules for Kn.Consider the function f K ( v ) = a ( K ) r m v + r ! + b ( K ) derived from assumption C.2. Note that as a function of v , f K ( v ) is clearly increasing and strictly concave. First,suppose that we select K ∗ deﬁned in (5). Then by deﬁnition it holds that f K ∗ ( e ) ≤ e We study ﬁxed points of the function f K ∗ ( v ) : Lemma 14.

The function f K ∗ ( v ) has a unique positive ﬁxed point ¯ v with1. ¯ v = f K ∗ ( ¯ v ) ≤ e f ′ K ∗ ( ¯ v ) < Proof.

We have f K ∗ ( ) = a ( K ∗ ) r + b ( K ∗ ) > v → f K ∗ ( v ) = f K ∗ ( ) and f K ∗ ( ) >

0, there exists a positive a sufﬁciently small that f K ∗ ( a ) > a Next, expanding f K ( v ) yields f K ( v ) = m a ( K ) v + a ( K ) r r m √ v + a ( K ) r + b ( K ) Since f K ∗ ( e ) ≤ e , we obviously must have m a ( K ∗ ) ≤

1. Suppose that2 m a ( K ∗ ) = f K ∗ ( e ) = e + √ m r √ e + m r + b ( K ) > e This is a contradiction, so it holds that 2 m a ( K ∗ ) < v − f K ∗ ( v ) → ¥ as v → ¥ . Therefore, there exists a point b > a such that f K ∗ ( b ) < b It is easy to check that f K ∗ ( v ) is increasing and strictly concave. Therefore, we can apply Theorem 3.3 from [21] toconclude that there exists a unique, positive ﬁxed point ¯ v of f K ∗ ( v ) .Next, suppose that f ′ K ∗ ( ¯ v ) >

1. Then by Taylor’s Theorem for v > ¯ v sufﬁciently close to ¯ v , we have f K ∗ ( v ) > v v → ¥ , it holds that v − f K ∗ ( v ) → ¥ . By the Intermediate Value Theorem, this implies thatthere is another ﬁxed point on [ v , ¥ ) . This is a contradiction, since ¯ v is the unique, positive ﬁxed point. Therefore, itholds that f ′ K ∗ ( ¯ v ) ≤

1. Now, suppose that f ′ K ∗ ( ¯ v ) =

1. Since f K ∗ ( v ) is strictly concave, its derivative is decreasing [22].Therefore, on [ , ¯ v ) , it holds that f ′ K ∗ ( v ) > f K ∗ ( ¯ v ) = f K ∗ ( ) + Z ¯ v f ′ K ∗ ( v ) dx ≥ f K ∗ ( ) + ¯ v > ¯ v This is a contradiction, so it must be that f ′ K ∗ ( ¯ v ) < f K ∗ ( v ) , we can study a ﬁxed point iteration involving f K ( v ) . Deﬁnethe n -fold composition mapping f ( n ) K ( v ) , ( f K ◦ · · · ◦ f K ) ( v ) Lemma 15.

For any v > , it holds that lim n → ¥ f ( n ) K ∗ ( v ) = ¯ vProof. Following [23], for any ﬁxed point ¯ v , it holds that | f K ∗ ( v ) − ¯ v | ≤ f ′ K ∗ ( ¯ v ) | v − ¯ v | Therefore, applying the ﬁxed point property repeatedly yields | f ( n ) K ∗ ( v ) − ¯ v | ≤ ( f ′ K ∗ ( ¯ v )) n | v − ¯ v | By Lemma 14, it holds that f ′ K ∗ ( ¯ v ) < r . The extension of this argumentto the case when we also estimate function parameters y is straightforward. If we have p ( { z n ( k ) } K n k = | x n − , K n ) = K n (cid:213) k = p n ( z n ( k )) then E [ f n ( x n ) | x n − , K n ] − f n ( x ∗ n ) ≤ b  r m (cid:0) f n − ( x n − ) − f n − ( x ∗ n − ) (cid:1) + r ! , K n  Therefore, it holds that E [ f n ( x n )] − f n ( x ∗ n ) ≤ E  b  r m (cid:0) f n − ( x n − ) − f n − ( x ∗ n − ) (cid:1) + r ! , K n  Suppose that we set K ¥ = s ( { K n } ¥ n = ∪ { ˆ r n } ¥ n = ) { ˆ r n } and thus { K n } . Then, we do not have p ( { z n ( k ) } K n k = | K ¥ ) = K n (cid:213) k = p n ( z n ( k )) since K n + , K n + , . . . are a function of { K n } K n k = . We do not even have E [ f n ( x n ) | K ¥ ] − f n ( x ∗ n ) ≤ b  r m (cid:0) f n − ( x n − ) − f n − ( x ∗ n − ) (cid:1) + r ! , K n  However, we would expect that this is not too far from true. Conceptually, we consider running our approach twice onindependent samples. The ﬁrst run determines the required number of samples { K n } ¥ n = . We then run our process fora second run with these ﬁxed choices of { K n } ¥ n = and independent samples as in Figure 1. For the second run, it is truethat p ( { z ( ) n ( k ) } K n k = | K ¥ ) = K n (cid:213) k = p n ( z ( ) n ( k )) and E h f n ( x ( ) n ) | K ¥ i − f n ( x ∗ n ) ≤ b  r m (cid:16) f n − ( x ( ) n − ) − f n − ( x ∗ n − ) (cid:17) + r ! , K n  In practice, we do not need to run our process twice. This is only a proof technique. Now, for the second run therecursion e ( ) n = b  r m e ( ) n − + r ! , K n  ∀ n ≥ e and e from Assumption A.4 bounds the excess risk of the second run E [ f n ( x ( ) n ) | K ¥ ] − f n ( x ∗ n ) ≤ e ( ) n Then it follows that E [ f n ( x ( ) n )] − f n ( x ∗ n ) ≤ E [ e ( ) n ] Receive { z n − ( k ) } K n − k = Optimize x n − Computeˆ r n − Choose K n Receive { z ( ) n ( k ) } K n k = Optimize x ( ) n Computeexcess riskbound First Run - n − n Figure 1: Two Run ProcessWe now argue that E [ e ( ) n ] also bounds the excess risk of the ﬁrst run.34 emma 16. For the ﬁrst run, it holds that E [ f n ( x n )] − f n ( x ∗ n ) ≤ E [ e ( ) n ] Proof.

We proceed by induction. For n = ,

2, we know that E [ f n ( x n )] − f n ( x ∗ n ) ≤ E [ e ( ) n ] by deﬁnition. Next, suppose that E [ f n − ( x n − )] − f n − ( x ∗ n − ) ≤ E [ e ( ) n − ] We have E [ f n ( x n )] − f n ( x ∗ n ) ≤ E (cid:20) a ( K n ) (cid:16)q f n − ( x n − ) − f n − ( x ∗ n − ) + r (cid:17) + b ( K n ) (cid:21) so it holds that E [ e ( ) n ] − ( E [ f n ( x n )] − f n ( x ∗ n )) ≥ E " a ( K n ) (cid:18)q e ( ) n − + r (cid:19) − a ( K n ) (cid:16)q f n − ( x n − ) − f n − ( x ∗ n − ) + r (cid:17) = E h a ( K n ) (cid:16) e ( ) n − − (cid:0) f n − ( x n − ) − f n − ( x ∗ n − ) (cid:1)(cid:17)i + E (cid:20) ra ( K n ) (cid:18)q e ( ) n − − q f n − ( x n − ) − f n − ( x ∗ n − ) (cid:19)(cid:21) By the Monotone Convergence Theorem, it holds that E h a ( K n ) (cid:16) e ( ) n − − (cid:0) f n − ( x n − ) − f n − ( x ∗ n − ) (cid:1)(cid:17)i = lim q → ¥ E h max { a ( K n ) , / q } (cid:16) e ( ) n − − (cid:0) f n − ( x n − ) − f n − ( x ∗ n − ) (cid:1)(cid:17)i ≥ lim inf q → ¥ q E h e ( ) n − − (cid:0) f n − ( x n − ) − f n − ( x ∗ n − ) (cid:1)i ≥ E [ f n − ( x n − )] − f n − ( x ∗ n − ) ≤ E [ e ( ) n − ] E (cid:20) ra ( K n ) (cid:18)q e ( ) n − − q f n − ( x n − ) − f n − ( x ∗ n − ) (cid:19)(cid:21) = E  ra ( K n ) e ( ) n − − (cid:0) f n − ( x n − ) − f n − ( x ∗ n − ) (cid:1)q e ( ) n − + q f n − ( x n − ) − f n − ( x ∗ n − )  = lim q → ¥ E  r max { a ( K n ) , / q } e ( ) n − − (cid:0) f n − ( x n − ) − f n − ( x ∗ n − ) (cid:1)q e ( ) n − + q f n − ( x n − ) − f n − ( x ∗ n − )  ≥ lim sup q → ¥ r q E  e ( ) n − − (cid:0) f n − ( x n − ) − f n − ( x ∗ n − ) (cid:1)q e ( ) n − + q f n − ( x n − ) − f n − ( x ∗ n − )  ≥ lim sup q → ¥ r q lim t → ¥ E  e ( ) n − − (cid:0) f n − ( x n − ) − f n − ( x ∗ n − ) (cid:1)q e ( ) n − + q f n − ( x n − ) − f n − ( x ∗ n − ) { q e ( ) n − + √ f n − ( x n − ) − f n − ( x ∗ n − ) ≤ t }  ≥ lim sup q → ¥ r q lim sup t → ¥ t E h e ( ) n − − (cid:0) f n − ( x n − ) − f n − ( x ∗ n − ) (cid:1)i ≥ E [ f n ( x n )] − f n ( x ∗ n ) ≤ E [ e ( ) n ] Theorem 2.

Under assumptions C.1 - C.2 and with K n ≥ K ∗ for all n large enough almost surely with K ∗ from (20) ,we have lim sup n → ¥ ( E [ f n ( x n )] − f n ( x ∗ n )) ≤ e Proof.

Let ¯ v be the ﬁxed point associated with f K ∗ ( v ) from Lemma 14. We know that¯ v = f K ∗ ( ¯ v ) ≤ e and f ( n ) K ∗ ( v ) → ¯ v ≤ e with ¯ v ≤ e . Since we have K n ≥ K ∗ for all n large enough almost surely, there exists a random variable ˜ N such that n ≥ ˜ N ⇒ K n ≥ K ∗ Then we have almost surely lim sup n → ¥ e ( ) n ≤ lim sup n → ¥ ( f K n ◦ · · · ◦ f K ˜ N )( e ˜ N − ) ≤ lim sup n → ¥ f ( n − ˜ N + ) K ∗ ( e ˜ N − )= ¯ v ≤ e n → ¥ ( E [ f n ( x n )] − f n ( x ∗ n )) ≤ lim sup n → ¥ E h e ( ) n i ≤ E (cid:20) lim sup n → ¥ e ( ) n (cid:21) ≤ e We ﬁrst consider updating all past excess risk bounds as we go. At time n , we plug-in ˆ r n − + t n − in place of r andfollow the analysis of Section 2. Deﬁne for i = , . . . , n ˆ e ( n ) i = b  r m ˆ e ( n ) i − + ( ˆ r n − + t n − ) ! , K i  If it holds that ˆ r n − + t n − ≥ r , then E [ f n ( x n )] − f n ( x ∗ n ) ≤ ˆ e ( i ) n for i = , . . . , n . Assumption C.1 guarantees that thisholds for all n large enough almost surely. We can thus set K n equal to the smallest K such that b  r m max { ˆ e ( n − ) n − , e } + ( ˆ r n − + t n − ) ! , K  ≤ e for all n ≥ e . The maximum in this deﬁnition ensures that when ˆ r n − + t n − ≥ r , K n ≥ K ∗ with K ∗ from (5). We can therefore apply Theorem 2. Updating all past estimates of the excess risk bounds from time 1 up to n imposes a computational and memory burden.Suppose that for all n ≥ K n = min  K ≥ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) b  r e m + ( ˆ r n − + t n − ) ! , K  ≤ e  (20)This is the same form as the choice in (5) with ˆ r n − + t n − in place of r . Due to assumption C.1 , for all n large enoughit holds that ˆ r n + t n ≥ r almost surely. Then by the monotonicity assumption in A.1 , for all n large enough we pick K n ≥ K ∗ almost surely. We can therefore apply Theorem 2. We focus on two regression applications for synthetic and real data as well as two classiﬁcation applications forsynthetic and real data. For the synthetic regression problem, we can explicitly compute r and x ∗ n and exactly evaluatethe performance of our method. It is straightforward to check that all requirements in A.1 -A.4 are satisﬁed for theproblems considered in this section. We apply the do not update past excess risk choice of K n here.37 .1 Synthetic Regression Consider a regression problem with synthetic data using the penalized quadratic loss ℓ ( x , z ) = (cid:16) y − w ⊤ x (cid:17) + l k x k with z = ( w , y ) ∈ R d + . The distribution of z n is zero mean Gaussian with covariance matrix (cid:20) s w I r w n , y n r ⊤ w n , y n s y n (cid:21) Under these assumptions, we can analytically compute minimizers x ∗ n of f n ( x ) = E z n ∼ p n [ ℓ ( x , z n )] . We change only r w n , y n and s y n appropriately to ensure that k x ∗ n − x ∗ n − k = r holds for all n . We ﬁnd approximate minimizers usingSGD with l = .

1. We estimate r using the direct estimate.We let n range from 1 to 20 with r =

1, a target excess risk e = .

1, and K n from (20). We average over twenty runsof our algorithm. Figure 2 shows ˆ r n , our estimate of r , which is above r in general. Figure 3 shows the number ofsamples K n , which settles down. We can exactly compute f n ( x n ) − f n ( x ∗ n ) , and so by averaging over the twenty runsof our algorithm, we can estimate the excess risk (denoted “sample average estimate”). Figure 4 shows this estimateof the excess risk, the target excess risk, and our bound on the excess risk from Section 4.3. We achieve at least ourtargeted excess risk n2 4 6 8 10 12 14 16 18 20 ρ Direct Estimate ρ Figure 2: r Estimate n2 4 6 8 10 12 14 16 18 20 K n Direct Estimate

Figure 3: K n n2 4 6 8 10 12 14 16 18 20 E xc e ss R i sk Direct EstimateSample Average Estimate

Figure 4: Excess Risk

The Panel Study of Income Dynamics (PSID) surveyed individuals every year to gather demographic and incomedata annually from 1981-1997 [24]. We want to predict an individual’s annual income ( y ) from several demographicfeatures ( w ) including age, education, work experience, etc. chosen based on previous economic studies in [25]. The38dea of this problem conceptually is to rerun the survey process and determine how many samples we would need ifwe wanted to solve this regression problem to within a desired excess risk criterion e .We use the same loss function, direct estimate for r , and minimization algorithm as the synthetic regressionproblem. The income is adjusted for inﬂation to 1997 dollars with mean $20,294. We average over twenty runs of ouralgorithm by resampling without replacement [26]. We compare to taking an equivalent number of samples up front.Figure 5 shows the test losses over time evaluated over twenty percent of the available samples. The test loss for ourapproach is substantially less than taking the same number of samples up front. The square root of the average testloss over this time period for our approach and all samples up front are $1153 ±

352 and $2805 ±

424 respectively in1997 dollars. n1982 1984 1986 1988 1990 1992 1994 1996 T e s t Lo ss × All Samples Up FrontDirect Estimate

Figure 5: Test Loss

Consider a binary classiﬁcation problem using ℓ ( x , z ) = ( − y ( w ⊤ x )) + + l k x k with z = ( w , y ) ∈ R d × R and ( y ) + = max { y , } . This is a smoothed version of the hinge loss used in support vector machines (SVM) [26]. Wesuppose that at time n , the two classes have features drawn from a Gaussian distribution with covariance matrix s I but different means m ( ) n and m ( ) n , i.e., w n | { y n = i } ∼ N ( m ( i ) n , s I ) . The class means move slowly over uniformlyspaced points on a unit sphere in R d as in Figure 6 to ensure that (2) holds. We ﬁnd approximate minimizers usingSGD with l = .

1. We estimate r using the direct estimate with t n (cid:181) / n / .Figure 6: Evolution of Class MeansWe let n range from 1 to 20 and target a excess risk e = .

1. We average over twenty runs of our algorithm. Asa comparison, if our algorithm takes { K n } n = samples, then we consider taking (cid:229) n = K n samples up front at n = r n , our estimate of r .Figure 8 shows the average test loss for both sampling strategies. To compute test loss we draw T n additional samples39 z test n ( k ) } T n k = from p n and compute T n (cid:229) T n k = ℓ ( x n , z test n ( k )) . We see that our approach achieves substantially smallertest loss than taking all samples up front. n2 4 6 8 10 12 14 16 18 20 ρ Direct Estimate

Figure 7: r Estimate n2 4 6 8 10 12 14 16 18 20 T e s t Lo ss All Samples Up FrontDirect Estimate

Figure 8: Test Loss

The General Social Survey (GSS) surveyed individuals every year to gather socio-economic data annually from 1981-2013 [27]. We want to predict an individual’s marital status ( y ) from several demographic features ( w ) including age,education, etc. We model this as a binary classiﬁcation problem using loss ℓ ( x , z ) = ( − y ( w ⊤ x )) + + l k x k with z = ( w , y ) ∈ R d × R and ( y ) + = max { y , } . This is a smoothed version of the hinge loss used in support vectormachines [26]. We ﬁnd approximate minimizers using SGD with l = .

1. Figure 9 shows the test loss. We see thatour approach achieves smaller test loss than taking all samples up front. We also plot receiver operating characteristics(ROC) [26] to characterize the performance of our classiﬁers. In particular we plot the ROC for 1974 in Figure 10 andthe ROC for 2012 in Figure 11. By examining the ROC, we see that taking all samples up front is much better in 1974but much worse in 2012.

We introduced a framework for adaptively solving a sequence of optimization problems with applications to machinelearning. We developed estimates of the change in the minimizers used to determine the number of samples K n neededto achieve a target excess risk e . Experiments with synthetic and real data demonstrate that this approach is effective. References [1] M. Mohri, A. Rostamizadeh, and A. Talwalkar,

Foundations of Machine Learning , The MIT Press, 2012.[2] A. Agarwal, H. Daumé, and S. Gerber, “Learning multiple tasks using manifold regularization.,” in

NIPS , 2011,pp. 46–54.[3] T. Evgeniou and M. Pontil, “Regularized multi–task learning,” in

Proceedings of the Tenth ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining , New York, NY, USA, 2004, KDD ’04, pp.109–117, ACM.[4] Y. Zhang and D. Yeung, “A convex formulation for learning task relationships in multi-task learning,”

CoRR ,vol. abs/1203.3536, 2012.[5] S. Pan and Q. Yang, “A survey on transfer learning,”

IEEE Transactions on Knowledge and Data Engineering ,vol. 22, no. 10, pp. 1345–1359, Oct 2010. 40 ear1975 1980 1985 1990 1995 2000 2005 2010 T e s t Lo ss All Samples Up FrontDirect Estimate

Figure 9: Test Loss

False Positive0 0.2 0.4 0.6 0.8 1 T r ue P o s i t i v e Direct EstimateAll Samples Up Front

Figure 10: ROC for 1974

False Positive0 0.2 0.4 0.6 0.8 1 T r ue P o s i t i v e All Samples Up FrontDirect Estimate

Figure 11: ROC for 2012[6] A. Agarwal, A. Rakhlin, and P. Bartlett, “Matrix regularization techniques for online multitask learning,” Tech.Rep. UCB/EECS-2008-138, EECS Department, University of California, Berkeley, Oct 2008.[7] Z. Towﬁc, J. Chu, and A. Sayed, “Online distirubted online classifcation in the midst of concept drifts,”

Neuro-computing , vol. 112, pp. 138–152, 2013.[8] C. Tekin, L. Canzian, and M. van der Schaar, “Context adaptive big data stream mining,” in

Allerton Conference ,2014, pp. 46–54.[9] T. Dietterich, “Machine learning for sequential data: A review,” in

Structural, Syntactic, and Statistical PatternRecognition , 2002, pp. 15–30.[10] T. Fawcett and F. Provost, “Adaptive fraud detection.,”

Data Min. Knowl. Discov. , vol. 1, no. 3, pp. 291–316,1997.[11] N. Qian and T. Sejnowski, “Predicting the secondary structure of globular proteins using neural network models,”

Journal of Molecular Biology , vol. 202, pp. 865–884, Aug 1988.[12] Y. Bengio and P. Frasconi, “Input-output HMM’s for sequence processing,”

IEEE Transactions on NeuralNetworks , vol. 7(5), pp. 1231–1249, 1996.[13] A. Dontchev and R. Rockafellar,

Implicit Functions and Solution Mappings: A View from Variational Analysis ,Springer, New York, New York, 2009.[14] B. Sriperumbudur, “On the empirical estimation of integral probability metrics,”

Electronic Journal of Statistics ,pp. 1550–1599, 2012.[15] R. Veryshin, “Introduction to non-asymptotic analysis of random matrices,” Tech. Rep., University of Michigan,2012.[16] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Stochastic approximation approach to stochastic program-ming,”

SIAM Journal on Optimization , vol. 19, pp. 1574–1609, 2009.4117] V.V Buldygin and E.D. Pechuk, “Inequalities for the distributions of functionals of sub-gaussian vectors,”

Theor.Probability and Math. Statist. , pp. 25–36, 2010.[18] S. Janson, “Large deviations for sums of partly dependent random variables,”

Random Structures Algorithms ,vol. 24, pp. 234–248, 2004.[19] S. Boucheron, G. Lugosi, and P. Massart,

Concentration Inequalities: A Nonasymptotic Theory of Independence ,Oxford University Press, 2013.[20] L. Trefethen,

Numerical Linear Algebra , SIAM, 1997.[21] J. Kennan, “Uniqueness of positive ﬁxed points for increasing concave functions on rn: An elementary result,”

Review of Economic Dynamics , vol. 4, pp. 893â ˘A ¸S899, 2001.[22] Stephen Boyd and Lieven Vandenberghe,

Convex Optimization , Cambridge University Press, New York, NY,USA, 2004.[23] A. Granas and J. Dugundji,

Fixed Point Theory , Springer-Verlag, 2003.[24] “Panel study of income dynamics: public use dataset,”

Survey Research Center , 2015.[25] S. Jenkins and P. Van Kerm, “Trends in income inequality, pro-poor income growth, and income mobility,”

Oxford Economic Papers , vol. 58, no. 3, pp. 531–548, 2006.[26] T. Hastie, R. Tibshirani, and J.H. Friedman,

The elements of statistical learning: data mining, inference, andprediction: with 200 full-color illustrations , New York: Springer-Verlag, 2001.[27] “General social survey,”

National Opinion Research Center , 2015.[28] F. Bach and E. Moulines, “Non-Asymptotic Analysis of Stochastic Approximation Algorithms for MachineLearning,” in

Advances in Neural Information Processing Systems (NIPS) , Spain, 2011.[29] D. Bertsekas,

Nonlinear Programming , Athena Scientiﬁc, 1999.[30] Léon Bottou, “Online learning and stochastic approximations,” 1998.[31] A. Nedic and S. Lee, “Analysis of mirror descent for strongly convex functions,”

ArXiV , 2013.[32] Yu. Nesterov,

Introductory Lectures on Convex Optimization: A Basic Course , Kluwer Academic Publishers,Norwell, Massachusetts, USA, 2004.[33] R. Antonini and Y. Kozachenko, “A note on the asymptotic behavior of sequences of generalized subgaussianrandom vectors,”

Random Op. and Stoch. Equ. , vol. 13, pp. 39–52, 2005.

A Examples of b ( d , K ) : For this section, we drop the n index for convenience. The bounds of this form depend on the strong convexityparameter m and an assumption on how the gradients grow. In general, we assume that E z ∼ p k (cid:209) x ℓ ( x , z ) k ≤ A + B k x − x ∗ k The base algorithm we look at is SGD. First, we generate iterates x ( ) , . . . , x ( K ) through SGD as follows: x ( ℓ + ) = P X [ x ( ℓ ) − m ( ℓ + ) (cid:209) x ℓ ( x ( ℓ ) , z ( ℓ ))] ℓ = , . . . , K − x ( ) ﬁxed. We then combine the iterates to yield a ﬁnal approximate minimizer¯ x ( K ) = f ( x ( ) , . . . , x ( K )) For our choice of f , we look at two cases: 42. No iterate averaging, i.e., f ( x ( ) , . . . , x ( K )) = x ( K )

2. Iterate averaging, i.e, for a convex combination { l ( ℓ ) } K ℓ = f ( x ( ) , . . . , x ( K )) = K (cid:229) ℓ = l ( ℓ ) x ( ℓ ) Deﬁne d ( ℓ ) , k x ( ℓ ) − x ∗ k (21)First we bound E [ d ( ℓ )] in Lemma 17. Lemma 17.

Suppose that the function f ( x ) has Lipschitz continuous gradients. Then it holds that E [ d ( ℓ )] ≤ ℓ (cid:213) k = ( − m m ( ℓ ) + B m ( ℓ )) + ℓ (cid:229) k = ℓ (cid:213) i = k + ( − m m ( i ) + B m ( i )) m ( k ) Proof.

Following the standard SGD analysis (see [16]), it holds that d ( ℓ ) ≤ k x ( ℓ − ) − x ∗ − m ( ℓ ) (cid:209) x ℓ ( x ( ℓ − ) , z ( ℓ )) k ≤ d ( ℓ − ) − m ( ℓ ) h x ( ℓ − ) − x ∗ , (cid:209) x ℓ ( x ( ℓ − ) , z ( ℓ )) i + m ( ℓ ) k (cid:209) x ℓ ( x ( ℓ − ) , z ( ℓ )) k Then it follows that E [ d ( ℓ ) | x ( ℓ − )] ≤ d ( ℓ − ) − m ( ℓ ) h x ( ℓ − ) − x ∗ , (cid:209) f ( x ( ℓ − )) i + m ( ℓ ) E [ k (cid:209) x ℓ ( x ( ℓ − ) , z ( ℓ )) k | x ( ℓ − )] ≤ ( − m m ( ℓ ) + B m ( ℓ )) d ( ℓ − ) + m ( ℓ − ) A and E [ d ( ℓ )] ≤ ( − m m ( ℓ ) + B m ( ℓ )) E [ d ( ℓ − )] + m ( ℓ − ) A Since B > m , we have 2 m m − B m ≤ r B m − r B m ! ≤ = − m m ( ℓ ) + B m ( ℓ ) ≥ − = E [ d ( ℓ )] ≤ ℓ (cid:213) k = ( − m m ( ℓ ) + B m ( ℓ )) + ℓ (cid:229) k = ℓ (cid:213) i = k + ( − m m ( i ) + B m ( i )) m ( k ) The bound in Lemma 17 can be further bounded into a closed form as follows from [28]: Deﬁne j b ( t ) = ( t b − b , if b = ( t ) , if b = m ( ℓ ) = C ℓ − a , it holds that E [ d ( ℓ )] ≤ ( (cid:8) BC j − a ( ℓ ) (cid:9) exp (cid:8) − mC ℓ − a (cid:9) (cid:0) E [ d ( )] + AB (cid:1) + ACm ℓ a , if 0 ≤ a < exp { BC } ℓ mC (cid:0) E [ d ( )] + AB (cid:1) + AC j mC / − ( ℓ ) ℓ mC / , if a = f ( x ( ) , . . . , x ( K )) = x ( K ) Lemma 18.

With arbitrary step sizes and assuming that f ( x ) has Lipschitz continuous gradients with modulus M, itholds that E [ f ( x )] − f ( x ∗ ) ≤ M E [ d ( K )] and therefore, we setb ( d , K ) = M K (cid:213) ℓ = ( − m m ( ℓ ) + B m ( ℓ )) + K (cid:229) ℓ = K (cid:213) i = ℓ + ( − m m ( i ) + B m ( i )) m ( ℓ ) ! Proof.

Using the descent lemma from [29], it holds that E [ f ( x )] − f ( x ∗ ) ≤ M E [ d ( K )] Plugging in the bound from Lemma 17 yields the bound b ( d , K ) .Next, we introduce a bound inspired by [30] for the case where f ( x ( ) , . . . , x ( K )) corresponds to forming a convexcombination of the iterates. Lemma 19.

With a constant step size and averaging with l ( ℓ ) = ( g ( ℓ ) (cid:229) K t = g ( t ) , if ℓ > , if ℓ = where g ( ℓ ) = ( − m m + B m ) − ℓ it holds that b ( d , K ) = d m (cid:229) K ℓ = g ( ℓ ) + A m Proof.

By strong convexity, it holds that − h x ( ℓ − ) − x ∗ , (cid:209) f ( x ( ℓ − )) i ≤ − m k x ( ℓ − ) − x ∗ k − ( f ( x ( ℓ − )) − f ( x ∗ )) Following the Lyapunov-style analysis of Lemma 17, it holds that E [ d ( ℓ )] ≤ ( − m m + B m ) E [ d ( ℓ − )] − m ( E [ f ( x ( ℓ − ))] − f ( x ∗ )) + A m Rearranging, using the telescoping sum, and using convexity, it holds that E [ f ( x )] − f ( x ∗ ) ≤ d m (cid:229) K t = g ( t ) + A m If we set m = √ K , then it holds that b ( d , K ) = O (cid:18) √ K (cid:19) for Lemma 19.We consider an extension of the averaging scheme in [31]. The bound in this paper only works with B =

0, so weextend it slightly to handle B >

0. 44 emma 20.

Consider the choice of step sizes given by m ( ℓ ) = m ℓ ∀ ℓ ≥ Then b ( d , K ) = d ( ) + ( K + ) A + B (cid:229) K ℓ = g ( ℓ ) + m ( K + )( K + ) where E [ d ( ℓ )] ≤ g ( ℓ ) Note that we can use the bound in Lemma 17 here.Proof.

We have using Lyapunov style analysis E [ d ( ℓ )] ≤ ( − m m ( ℓ ) + B m ( ℓ )) E [ d ( ℓ − )] − m ( ℓ )( E [ f ( x ( ℓ ))] − f ( x ∗ )) + A m ( ℓ ) Then we have 1 m ( ℓ ) E [ d ( ℓ )] ≤ (cid:18) − m m ( ℓ ) m ( ℓ ) + B (cid:19) E [ d ( ℓ − )] − m ( ℓ ) ( E [ f ( x ( ℓ ))] − f ( x ∗ ) + A It holds that 1 − m m ( ℓ ) m ( ℓ ) − m ( ℓ − ) = m ( ℓ ) − m m ( ℓ ) − m ( ℓ − )= ℓ C − m ℓ C − ( ℓ − ) C = ( mC − ) L − C As long as we have mC − ≤ ⇔ C ≤ m then we get 1 m ( ℓ ) E [ d ( ℓ )] − m ( ℓ − ) E [ d ( ℓ − )] ≤ B E [ d ( ℓ − )] − m ( ℓ ) ( E [ f ( x ( ℓ ))] − f ( x ∗ ) + A Summing an rearranging yields K (cid:229) ℓ = m ( ℓ ) ( E [ f ( x ( ℓ ))] − f ( x ∗ )) ≤ d ( ) + ( K + ) A + B K (cid:229) ℓ = E [ d ( ℓ )] with m ( ) = g ( ℓ ) = m ( ℓ ) (cid:229) ℓ j = m ( j ) we have E [ f ( ¯ x ( K ))] − f ( x ∗ ) ≤ d ( ) + ( K + ) A + B (cid:229) K ℓ = E [ d ( ℓ )] (cid:229) K t = m ( t ) Then it holds that K (cid:229) t = = + K (cid:229) t = m t = + m ( K + )( K + ) E [ f ( ¯ x ( K ))] − f ( x ∗ ) ≤ d ( ) + ( K + ) A + B (cid:229) K ℓ = E [ d ( ℓ )] + m ( K + )( K + ) For the choice of step sizes in Lemma 20 from Lemma 17, it holds that E [ d ( ℓ )] = O (cid:18) ℓ (cid:19) Since K (cid:229) ℓ = ℓ = O ( log K ) it holds that E [ f ( ¯ x ( K ))] − f ( x ∗ ) = O (cid:18) d ( ) K + log ( K ) K + K (cid:19) Note that a rate of O ( K ) is minimax optimal for stochastic minimization of a strongly convex function [32].Next, we look at a special case of averaging for functions such that E k (cid:209) x ℓ ( x , z ) − (cid:209) x ℓ ( ˜ x , z ) − (cid:209) xx ℓ ( ˜ x , z ) ( x − ˜ x ) k = Lemma 21.

Assuming that E k (cid:209) x ℓ ( x , z ) − (cid:209) x ℓ ( ˜ x , z ) − (cid:209) xx ℓ ( ˜ x , z ) ( x − ˜ x ) k = , we select step sizes m ( ℓ ) = C ℓ − a with a > / , and l ( ℓ ) = ( K , if ℓ > , if ℓ = it holds that (cid:0) E [ ¯ d ( K )] (cid:1) / ≤ m / K − (cid:229) k = (cid:12)(cid:12)(cid:12)(cid:12) m ( k + ) − m ( k ) (cid:12)(cid:12)(cid:12)(cid:12) ( E [ d ( k )]) / + m / m ( ) ( E [ d ( )]) / + m / m ( K ) ( E [ d ( K )]) / + r AmK + s BmK K (cid:229) k = E [ d ( k − )] with ¯ d ( K ) = k ¯ x ( K ) − x ∗ k . If in addition f has Lipschitz continuous gradients with modulus M, then it holds that E [ f ( ¯ x ( K ))] − f ( x ∗ ) ≤ M E [ ¯ d ( K )] Proof.

Suppose that we set ¯ x ( K ) = n K (cid:229) k = x ( k ) (cid:209) xx f ( x ∗ )( x ( k ) − x ∗ ) = (cid:209) x ℓ ( x ( k − ) , z ( k − )) − (cid:209) x ℓ ( x ∗ , z ( k − ))+ (cid:2) (cid:209) xx f ( x ∗ ) − (cid:209) xx ℓ ( x ∗ , z ( k − )) (cid:3) ( x ( k − ) − x ∗ ) yielding (cid:209) xx f ( x ∗ )( ¯ x ( k ) − x ∗ ) = K K (cid:229) k = (cid:209) x ℓ ( x ( k − ) , z ( k − )) − K K (cid:229) k = (cid:209) x ℓ ( x ∗ , z ( k − ))+ K K (cid:229) k = (cid:2) (cid:209) xx f ( x ∗ ) − (cid:209) xx ℓ ( x ∗ , z ( k − )) (cid:3) ( x ( k − ) − x ∗ ) First, we have1 K K (cid:229) k = (cid:209) x ℓ ( x ( k − ) , z ( k − )) = K K (cid:229) k = (cid:209) x ℓ ( x ( ℓ − ) , z ( ℓ − ))= K K (cid:229) k = m ( k ) ( x ( ℓ − ) − x ( ℓ ))= K K (cid:229) k = m ( k ) ( x ( ℓ − ) − x ∗ ) − K K (cid:229) k = m ( k ) ( x ( ℓ ) − x ∗ )= K K − (cid:229) k = (cid:18) m ( k + ) − m ( k ) (cid:19) ( x ( ℓ ) − x ∗ ) + m ( ) ( x ( ) − x ∗ ) − m ( K ) ( x ( K ) − x ∗ ) Second, we have E (cid:13)(cid:13)(cid:13)(cid:13) K K (cid:229) k = (cid:209) x ℓ ( x ∗ , z ( k − )) (cid:13)(cid:13)(cid:13)(cid:13) = K K (cid:229) k = E k (cid:209) x ℓ ( x ∗ , z ( k − )) k ≤ An Third, we have E (cid:13)(cid:13)(cid:13)(cid:13) K K (cid:229) k = (cid:2) (cid:209) xx f ( x ∗ ) − (cid:209) xx ℓ ( x ∗ , z ( k − )) (cid:3) ( x ( k − ) − x ∗ ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ BK K (cid:229) k = E [ d ( k − )] Combining these bounds with Minkowski’s inequality yields (cid:0) m E [ ¯ d ( K )] (cid:1) / ≤ (cid:0) E k (cid:209) xx f ( x ∗ )( ¯ x ( K ) − x ∗ ) k (cid:1) / ≤ K − (cid:229) k = (cid:12)(cid:12)(cid:12)(cid:12) m ( k + ) − m ( k ) (cid:12)(cid:12)(cid:12)(cid:12) ( E [ d ( k )]) / + m ( ) ( E [ d ( )]) / + m ( K ) ( E [ d ( K )]) / + r AK + s BK K (cid:229) k = E [ d ( k − )] (cid:0) E [ ¯ d ( K )] (cid:1) / ≤ m / K − (cid:229) k = (cid:12)(cid:12)(cid:12)(cid:12) m ( k + ) − m ( k ) (cid:12)(cid:12)(cid:12)(cid:12) ( E [ d ( k )]) / + m / m ( ) ( E [ d ( )]) / + m / m ( K ) ( E [ d ( K )]) / + r AmK + s BmK K (cid:229) k = E [ d ( k − )] This decays at rate O (cid:0) K (cid:1) as long as m ( ℓ ) = C ℓ − a with ≤ a ≤ B Useful Concentration Inequalities

For our analysis of both the direct and IPM estimates, we need the following key technical lemma from [33]. Thislemma controls the concentration of sums of random variables that are sub-Gaussian conditioned on a particularﬁltration { F i } ni = . Such a collection of random variables is referred to as a sub-Gaussian martingale sequence . Weinclude the proof for completeness. Lemma 22 (Theorem 7.5 of [33]) . Suppose we have a collection of random variables { V i } ni = and a ﬁltration { F i } ni = such that for each random variable V i it holds that1. E (cid:2) e sV i (cid:12)(cid:12) F i − (cid:3) ≤ e s i s with s i a constant2. V i is F i -measurableThen for every a ∈ R n it holds that P ( n (cid:229) i = a i V i > t ) ≤ exp (cid:26) − t n (cid:27) ∀ t > and P ( n (cid:229) i = a i V i < − t ) ≤ exp (cid:26) − t n (cid:27) ∀ t > with n = n (cid:229) i = s i a i Proof.

We bound the moment generating function of (cid:229) ni = a i V i by induction. As a base case, we have E (cid:2) e sa V (cid:3) = E h E h e sa V (cid:12)(cid:12)(cid:12) F ii ≤ e s a s Assume for induction that we have E " exp ( s j (cid:229) i = a i V i ) ≤ exp ( j (cid:229) i = s i a i ! s ) E " exp ( j + (cid:229) i = a i V i ) = E " exp ( s j (cid:229) i = a i V i ) e sa j + X j + = E " E " exp ( s j (cid:229) i = a i V i ) e sa j + X j + (cid:12)(cid:12)(cid:12) F j + ( a ) = E " exp ( s j (cid:229) i = a i V i ) E h e sa j + X j + (cid:12)(cid:12)(cid:12) F j + i ( b ) ≤ E " exp ( s j (cid:229) i = a i V i ) e s j + a j + s ( c ) ≤ exp ( j + (cid:229) i = s i a i ! s ) where (a) follows since (cid:229) ji = a i V i is F j measurable, (b) follows since E h e sa j + X j + (cid:12)(cid:12)(cid:12) F j + i ≤ e s j + a j + s , and (c) is the inductive assumption. This proves that E " exp ( s n (cid:229) i = a i V i ) ≤ exp ( n (cid:229) i = s i a i ! s ) ≤ exp (cid:26) n s (cid:27) Using the Chernoff bound [19], we have P ( n (cid:229) i = a i V i > t ) ≤ e − st E " exp ( s n (cid:229) i = a i V i ) ≤ exp (cid:26) − st + n s (cid:27) Optimizing the bound over s yields P ( n (cid:229) i = a i V i > t ) ≤ exp (cid:26) − t n (cid:27) The proof for the other tail is similar.If the random variables instead satisfy1. E (cid:2) exp (cid:8) s (cid:0) V i − E (cid:2) V i (cid:12)(cid:12) F i − (cid:3)(cid:1)(cid:9) (cid:12)(cid:12) F i − (cid:3) ≤ e s i s with s i a constant2. V i is F i -measurablethen Lemma 22 can be applied to { V i − E (cid:2) V i (cid:12)(cid:12) F i − (cid:3) } ni = to yield P ( n (cid:229) i = a i V i > n (cid:229) i = a i E (cid:2) V i (cid:12)(cid:12) F i − (cid:3) + t ) ≤ exp (cid:26) − t n (cid:27) If we can upper bound the conditional expectations E (cid:2) V i (cid:12)(cid:12) F i − (cid:3) ≤ C i , by F i − -measurable random variables C i , then we have P ( n (cid:229) i = a i V i > n (cid:229) i = a i C i + t ) ≤ P ( n (cid:229) i = a i V i > n (cid:229) i = a i E (cid:2) V i (cid:12)(cid:12) F i − (cid:3) + t ) ≤ exp (cid:26) − t n (cid:27) For our analysis, we generally cannot compute E (cid:2) V i (cid:12)(cid:12) F i − (cid:3) , but we can ﬁnd “nice” C i .To ﬁnd s i for use in Lemma 22, we frequently use the following conditional version of Hoeffding’s Lemma.49 emma 23 (Conditional Hoeffding’s Lemma) . If a random variable V and a sigma algebra F satisfy a ≤ V ≤ b and E [ V | F ] = , then E (cid:2) e sV | F (cid:3) ≤ exp (cid:26) ( b − a ) s (cid:27) Proof.

We follow standard proof of Hoeffding’s Lemma from [19]. Since e sx is convex, it follows that e sx ≤ b − xb − ae sa + x − ab − a e sb a ≤ x ≤ b Therefore, taking the conditional expectation with respect to F yields E (cid:2) e sV (cid:12)(cid:12) F (cid:3) ≤ b − E [ V | F ] b − a e sa + E [ V | F ] − ab − a e sb (22)Let h = s ( b − a ) , p = − ab − a , and L ( h ) = − hp + log ( − p + pe h ) . Then we have e L ( h ) = bb − ae sa + − ab − a e sb = b − E [ V | F ] b − a e sa + E [ V | F ] − ab − a e sb (23)since E [ V | F ] =

0. Since L ( h ) = L ′ ( h ) = L ′′ ( h ) ≤ , , it holds that L ( h ) ≤ ( b − a ) s . Combining this boundon L ( h ) with (22) and (23) yields the result.Before proceeding with our analysis, we need to introduce a few useful concentration inequalities for sub-Gaussianvector-valued random variables. First, for a scalar random variable x , deﬁne the sub-Gaussian norm t ( x ) = inf (cid:26) a > (cid:12)(cid:12)(cid:12)(cid:12) E [ e s x ] ≤ e a s ∀ s ≥ (cid:27) (24)Clearly, if t ( x ) < + ¥ , then x is sub-Gaussian. Second, for a random vector v in R d , deﬁne B ( v ) = d (cid:229) i = t (( v ) i ) (25)where ( v ) i is the i th component of v . We deﬁne v to be sub-Gaussian if B ( v ) < + ¥ .Of crucial importance in our analysis is analyzing the norm of an average of vector-valued sub-Gaussian randomvariables. The following lemma describes how to control the sub-Gaussian norm in such a situation. Lemma 24.

Suppose that { v i } Ki = is a collection of independent sub-Gaussian random variables in R d . Then it holdsthat B K K (cid:229) i = v i ! ≤ K d (cid:229) j = s K (cid:229) i = t (( v i ) j ) If in addition the random variables { v i } Ki = satisfy max i = ,..., K max j = ,..., d t (( v i ) j ) ≤ t then it holds that B K K (cid:229) i = v i ! ≤ t d √ K roof. We analyze one component of the sum K (cid:229) Ki = v i . It holds that E  exp  s K K (cid:229) i = v i ! j  = E " exp ( sK K (cid:229) i = ( v i ) j ) = K (cid:213) i = E h exp n sK ( v i ) j oi ≤ K (cid:213) i = exp (cid:26)

12 1 K t (( v i ) j ) s (cid:27) = exp ( K K (cid:229) i = t (( v i ) j ) ! s ) This implies that t  K K (cid:229) i = v i ! j  ≤ K s K (cid:229) i = t (( v i ) j ) and so B K K (cid:229) i = v i ! ≤ K d (cid:229) j = s K (cid:229) i = t (( v i ) j ) Finally, if t (( v i ) j ) ≤ t , then we have B K K (cid:229) i = v i ! ≤ K d (cid:229) j = s K (cid:229) i = t (( v i ) j ) ≤ dK s K (cid:229) i = t = t d √ K Example 3.2 from [17], a consequence of Theorem 3.1 in [17], is useful for the concentration of the norm ofsub-Gaussian vector random variables.

Lemma 25 (Example 3.2 of [17]) . If v is a random vector in R d with B ( v ) < + ¥ , then P {k v k > t } ≤ (cid:26) − t B ( v ) (cid:27) Finally, we will also need to deal with dependent random variables that are sub-Gaussian with respect to a particularﬁltration.

Lemma 26.

Suppose that a random variable V and a sigma algebra F satisﬁes1. E [ V | F ] = P (cid:8) | V | > t (cid:12)(cid:12) F (cid:9) ≤ e − ct with c a constant.Then it holds that E [ e sV (cid:12)(cid:12) F ] ≤ exp (cid:26) (cid:18) c (cid:19) s (cid:27) for all s ≥ . roof. Adapted from the characterization of sub-Gaussian random variables in [15]. First, we have for any a < c that E h e aV (cid:12)(cid:12)(cid:12) F i ≤ + Z ¥ ate at P {| V | > t | F } dt ≤ + Z ¥ ate − ( c − a ) t dt = + ac − a Setting a = c yields the bound E h e aV (cid:12)(cid:12)(cid:12) F i ≤ E [ V | F ] =

0, by a Taylor expansion we have E (cid:2) e sV (cid:12)(cid:12) F (cid:3) = + Z ¥ ( − y ) E h ( sV ) e ysV (cid:12)(cid:12)(cid:12) F i dy ≤ (cid:18) + s a (cid:19) e s a ≤ exp (cid:26) s a (cid:27) = exp (cid:26) (cid:18) c (cid:19) s (cid:27)(cid:27)