[PDF] Making the Last Iterate of SGD Information Theoretically Optimal

Abstract

Stochastic gradient descent (SGD) is one of the most widely used algorithms for large scale optimization problems. While classical theoretical analysis of SGD for convex problems studies (suffix) \emph{averages} of iterates and obtains information theoretically optimal bounds on suboptimality, the \emph{last point} of SGD is, by far, the most preferred choice in practice. The best known results for last point of SGD \cite{shamir2013stochastic} however, are suboptimal compared to information theoretic lower bounds by a logT factor, where T is the number of iterations. \cite{harvey2018tight} shows that in fact, this additional logT factor is tight for standard step size sequences of $\OTheta{\frac{1}{\sqrt{t}}}$ and $\OTheta{\frac{1}{t}}$ for non-strongly convex and strongly convex settings, respectively. Similarly, even for subgradient descent (GD) when applied to non-smooth, convex functions, the best known step-size sequences still lead to O(logT) -suboptimal convergence rates (on the final iterate). The main contribution of this work is to design new step size sequences that enjoy information theoretically optimal bounds on the suboptimality of \emph{last point} of SGD as well as GD. We achieve this by designing a modification scheme, that converts one sequence of step sizes to another so that the last point of SGD/GD with modified sequence has the same suboptimality guarantees as the average of SGD/GD with original sequence. We also show that our result holds with high-probability. We validate our results through simulations which demonstrate that the new step size sequence indeed improves the final iterate significantly compared to the standard step size sequences.

Full PDF

aa r X i v : . [ m a t h . O C ] M a y M AKING THE L AST I TER ATE OF

SG DI

NFOR MATION T HEOR ETICALLY O PTIMAL

A P

REPRINT

Prateek Jain

Microsoft ResearchBengaluru, India [email protected]

Dheeraj Nagaraj ∗ Department of Electrical Engineering and Computer ScienceMassachusetts Institute of TechnologyCambridge, USA 02139 [email protected]

Praneeth Netrapalli

Microsoft ResearchBengaluru, India [email protected]

May 30, 2019 A BSTRACT

Stochastic gradient descent (SGD) is one of the most widely used algorithms for large scale opti-mization problems. While classical theoretical analysis of SGD for convex problems studies (sufﬁx) averages of iterates and obtains information theoretically optimal bounds on suboptimality, the lastpoint of SGD is, by far, the most preferred choice in practice. The best known results for last pointof SGD [1] however, are suboptimal compared to information theoretic lower bounds by a log T factor, where T is the number of iterations. [2] shows that in fact, this additional log T factor is tightfor standard step size sequences of Θ (cid:16) √ t (cid:17) and Θ (cid:0) t (cid:1) for non-strongly convex and strongly convexsettings, respectively. Similarly, even for subgradient descent (GD) when applied to non-smooth,convex functions, the best known step-size sequences still lead to O (log T ) -suboptimal convergencerates (on the ﬁnal iterate). The main contribution of this work is to design new step size sequencesthat enjoy information theoretically optimal bounds on the suboptimality of last point of SGD aswell as GD. We achieve this by designing a modiﬁcation scheme, that converts one sequence of stepsizes to another so that the last point of SGD/GD with modiﬁed sequence has the same suboptimalityguarantees as the average of SGD/GD with original sequence. We also show that our result holdswith high-probability. We validate our results through simulations which demonstrate that the newstep size sequence indeed improves the ﬁnal iterate signiﬁcantly compared to the standard step sizesequences. K eywords Stochastic Gradient Descent · Machine Learning · Convex Optimization

Stochastic Gradient Descent (SGD) is one of the most popular algorithms for solving large-scale empirical risk mini-mization (ERM) problems [3, 4, 5]. The algorithm updates the iterates using stochastic gradients obtained by samplingdata points uniformly at random. The algorithm has been studied for several decades [6] but there are still signiﬁcant ∗ Accepted for presentation at the Conference on Learning Theory (COLT) 2019

AST I TERATE

SGD - M AY

30, 2019gaps between practical implementations and theoretical analyses. In particular, the standard analyses hold only forsome kind of average of iterates, but most practitioners just use the ﬁnal iterate of SGD. So, [7] asked the natural ques-tion of whether the ﬁnal iterate of SGD, as opposed to average of iterates, is provably good. It was partly answeredin [1] which gave sub-optimality bound for the last point of SGD but the obtained sub-optimality rates are O (log T ) worse than the information theoretically optimal rates; T is the number of iterations.[2] showed that the above result is tight for the standard step-size sequence used by most existing theoretical results.The extra logarithmic factor is not due to the stochastic nature of SGD. In fact, even for subgradient descent (GD)when applied to general non-smooth, convex functions, the last point’s convergence rates are sub-optimal by O (log T ) factor.So, this work addresses the following two fundamental questions: “Does there exist a step-size sequence for which the last point of SGD when applied to general convex functions as wellas to strongly-convex functions has optimal error (sub-optimality) rate?”, and,“Does there exist a step-size sequence for which the last point of GD when applied to general non-smooth convexfunctions has optimal error (sub-optimality) rate?” In this paper, we answer both the questions in the afﬁrmative. That is, we provide novel step size sequences andshow that the ﬁnal iterate of SGD run with these step size sequences has the information theoretically optimal error(suboptimality) rate. In particular, for general non-smooth convex functions, our results ensure an error rate of O ( √ T ) and for strongly-convex functions, the error rate is O ( T ) . We also present high-probablity versions, i.e., we showthat with probability at least − δ , the suboptimality is O r log 1 δT ! and O (cid:18) log 1 δT (cid:19) respectively (see Theorems 1and 2). For GD, we show that a similarly modiﬁed step-size sequence leads to suboptimality of O ( T ) and O ( √ T ) fornon-smooth convex functions, with and with out strong convexity respectively, which is optimal.In general, SGD takes the iterates near the optimum value but since the objective isn’t smooth near the optimizer x ∗ , thegradients don’t become small even when the points are close to x ∗ . Standard step sizes don’t decay appreciably withtime to ensure fast enough convergence to x ∗ . Therefore the iterates x t , after going close to x ∗ , start oscillating aroundit without actually approaching it (See Section 4 for concrete examples). Our new step sizes, given in Section 2.1ensure that the step sizes decay fast enough after a certain point, making the iterates go closer to the optimum x ∗ . Theexact mode of this decay ensures that the last iterate approaches the optimum at the information theoretic rate.Our results utilize a general step size modiﬁcation scheme which ensures that the upper bounds for the average func-tion value with the original step sizes gets transferred to the last iterate when the modiﬁed step sizes are used (seeTheorems 3 and 4). A key technical contribution of the paper is the proof of Theorem 2 that constructs a sequence ofaveraging schemes which are ‘good’ with high probability such that the last averaging scheme consists only of the lastiterate and hence lets us conclude that the last iterate is ‘good’ with high probability.Our new step-size sequence requires that the number of iterations or horizon T is known apriori. In contrast, standardstep-size sequences do not require T apriori, and hence guarantee any-time results. Information about T apriori helpsus in ensuring that we do not drop step-size too early; only after we are close to the optimum, does the step size droprapidly. In fact, we conjecture that in absence of apriori information about T , no step-size sequence can ensure theinformation theoretically optimal error rates for ﬁnal iterate of SGD. As a step towards proving this, we show that inthe case of strongly convex objectives, any choice of step sizes with inﬁnite horizon (i.e, without the knowledge oftotal number of iterations) is either suboptimal almost surely or suboptimal in expectation for inﬁnitely many points.We show this in Theorem 5. Related Work : Averaging was used ﬁrst in the stochastic approximation setting by [8] to show optimal rates ofconvergence. Gradient Descent type methods have been shown to achieve information theoretically optimal error ratesin the convex and strongly convex settings when averaging of iterates is used ([9],[10],[11], [12], Epoch GD in [13] ,SGD [14] and [15]). The question of the last iterate was ﬁrst considered in [1] and it gives a bound of O ( log T √ T ) and O ( log TT ) in expectation for the general case and strongly convex case respectively. [2] show matching high probabilitybounds and show that for the standard step sizes ( O (cid:16) √ t (cid:17) in the general case and O (cid:0) t (cid:1) in the strongly convex case),the logarithmic-suboptimal bounds are tight. Organization : The setting and main results are presented in Section 2. In particular, Section 2.1 describes the generalstep size modiﬁcation considered and states key results regarding this modiﬁcation and the lower bound is presentedin Section 2.2. Key technical ideas are developed in Section 3 and the main theorems are proved. We present some2

AST I TERATE

SGD - M AY

30, 2019experimental results in Section 4 and conclude in Section 5. Skipped proofs of technical lemmas are given in theappendix.

Consider the following optimization problem: min x ∈W F ( x ) , (1)where objective function F : R d → R is a convex function and W ⊂ R d is a closed convex set. Let the globalminimizer of F ( · ) be x ∗ ∈ W . We start the SGD algorithm at a point x ∈ W and iteratively obtain estimates x t forthe minimizer of F ( · ) . We assume that at each time step, we have access to independent, unbiased estimate ˆ g t to asubgradient g t ∈ ∂F . That is, E [ˆ g t ( x )] = g t ( x ) ∈ ∂F ( x ) for every x ∈ W and (ˆ g t − g t ) Tt =1 are independent. We pickstep sizes ( α t ) Tt =1 ≥ . Let Π W be the projection operator to the set W . The SGD algorithm is given as follows: Input: total time T and step sizes α t Output: x T for t ← to T do x t +1 ← Π W ( x t − α t ˆ g t ( x t )) . end Algorithm 1: Stochastic Gradient DescentHenceforth, we will retain the assumptions made above. Whenever we use g t ( x ) , it is implied that g t ( x ) ∈ ∂F ( x ) .Throughout the paper, we assume that F is a Lipschitz continuous convex function. Assumption 1 (Lipschitz Continuity) . F : R d → R is G -Lipschitz continuous convex function over closed convexset W , i.e., k g ( x ) k ≤ G for every x ∈ W and every g ( x ) ∈ ∂F ( x ) . Furthermore, the stochastic gradients ˆ g satisfy: k ˆ g ( x ) k ≤ G almost surely for every x ∈ W . Assumption 2 (Closed and bounded set) . Diameter of closed convex set W is bounded by D , i.e., diam( W ) ≤ D . Assumption 3 (Strong convexity) . Let λ > . A convex function F is said to be λ strongly convex over W iff F ( y ) ≥ F ( x ) + h∇ F ( x ) , y − x i + λ k y − x k ∀ x, y ∈ W . Step size sequence for general convex functions : we ﬁrst deﬁne, k := inf { i : T · − i ≤ } , T i := T − ⌈ T · − i ⌉ , ≤ i ≤ k, and T k +1 := T. (2)Clearly, T < T < . . . < T k = T − < T k +1 = T . We note in particular that T ≈ T . Let C > be arbitrary.Then, we choose the step size α t as follows: α t = C · − i √ T when T i < t ≤ T i +1 , ≤ i ≤ k. (3)The theorem below provides suboptimality guarantee for the SGD algorithm with the step-size sequence mentionedabove. Theorem 1 (SGD/GD Last Point for General Convex Functions) . Let Assumptions 1 and 2 hold. Given T ≥ , let x , . . . , x T be the iterates of SGD (Algorithm 1) with step size α t as deﬁned in Equation (3) . Then, the following holdsfor all T ≥ : E [ F ( x T )] ≤ F ( x ∗ ) + 4 D C √ T + 11 G C √ T .

In particular, if we choose C = DG , we have: E F ( x T ) ≤ F ( x ∗ ) + GD √ T . Furthermore, the following holds w.p. ≥ − δ for any < δ < e : F ( x T ) = F ( x ∗ ) + O (cid:18) D C √ T + CG √ T log (cid:0) δ (cid:1)(cid:19) ≤ F ( x ∗ ) + O DG r log 1 δT ! . Finally, under the same assumptions, GD update ( x t +1 = Π W ( x t − α t ∇ F ( x t ))) with the same step-size sequencegiven in (3) also ensures the following after T iterations: F ( x T ) ≤ F ( x ∗ ) + 4 D C √ T + 11 G C √ T . AST I TERATE

SGD - M AY

30, 2019We will prove this theorem in Section 3 after developing some general ideas.

Remarks: (1) Note that the bounds on sub-optimality (for SGD and GD) are information theoretically optimal up toconstants.(2) Our result on the expected sub-optimality improves upon that of [1] by a multiplicative log T factor and our resulton the high probability sub-optimality improves upon [2] by a multiplicative factor of log T q log δ . On the otherhand, our step-size sequence requires apriori knowledge of T . We conjecture that for any-time algorithm (i.e., withoutapriori knowledge of T ) expected error rate of GD log T √ T is information theoretically optimal.(3) The rate obtained above for last point of GD (in the deterministic setting) is also optimal in the gradient oraclemodel and to the best of our knowledge, is the ﬁrst such result for last point of GD. Step size sequence for strongly-convex functions : Let F ( · ) be λ strongly convex (Assumption 3). Let k := inf { i : T · − i ≤ } . We pick α t as follows: α t = 2 − i λt , ∀ T i < t ≤ T i +1 , ≤ i ≤ k. (4)We now present our result for last point of SGD with strong-convexity assumption. Theorem 2 (SGD Last Point for Strongly Convex Functions) . Let F satisfy Assumptions 1 and 3. Then the follow-ing holds for the T -th iterate of the SGD algorithm (Algorithm 1) when run with the step size sequence given inEquation (4) : E [ F ( x T )] ≤ F ( x ∗ ) + 130 G λT . Furthermore, the following holds for all < δ ≤ /e with probability at least − δ : E [ F ( x T )] = F ( x ∗ ) + O (cid:18) G log( δ ) λT (cid:19) . Under the same assumptions, GD update ( x t +1 = Π W ( x t − α t ∇ F ( x t ))) with the same step-size sequence given in (4) also ensures the following after T iterations: F ( x T ) ≤ F ( x ∗ ) + 130 G λT . Here again, we note that the result is information theoretically optimal up to log(1 /δ ) factor. Theorems 1 and 2 are consequences of our general results on step size modiﬁcation that we present below. ConsiderSGD step size sequence ( γ t ) Tt =1 . We obtain modiﬁed step size sequence ( α t ) Tt =1 as follows: α t := 2 − i γ t ∀ T i < t ≤ T i +1 and ≤ i ≤ k. (5)Under certain mild conditions, we will show that the last iterate of SGD with step size α t is as good as the averageiterate of SGD with step size γ t . We make these notions precise below: Assumption 4 (Slowly Decreasing Step Size Sequence) . We call a step size sequence ( γ t ) ‘decreasing’ if γ t +1 ≤ γ t .We say that step size sequence γ t has ‘at most polynomial decay’ with decay constant < β ≤ if γ t ≥ βγ t forevery t ≥ . We have the following general theorem:

Theorem 3.

Let ( γ t ) Tt =1 be a decreasing step size sequence with at most polynomial decay with decay constant < β ≤ . Let the iterates of SGD with step size γ t be y , . . . , y T . Let α t be the modiﬁcation of γ t as deﬁned inEquation (5) . Let the iterates of SGD with step size α t be x , . . . , x T . Then, for all T ≥ , we have: E [ F ( x T )] ≤ G γ T (cid:18) β + 1 β (cid:19) + inf ⌈ T ⌉≤ t ≤ T E [ F ( y t )] . We also give a high probability version of Theorem 3. 4

AST I TERATE

SGD - M AY

30, 2019

Theorem 4.

Let T ≥ . Let q (0) be any arbitrary ﬁxed probability distribution over the set {⌈ T ⌉ , . . . , T } . Withprobability atleast − δ , we have: F ( x T ) ≤ γ T · G (cid:0)

120 log( δ ) + 400 (cid:1) · (cid:18) β + 1 β (cid:19) + T X s = ⌈ T ⌉ q (0) ( s ) F ( y s ) . That is, the above theorems show that compared to any weighted average of function values of iterates in the [ T / , T ] iterations, the error is not signiﬁcantly larger if β is reasonably large and γ T is small. Now, using standard analysis,we can ensure small average function value for iterates in [ T / , T ] iterations. Small value of γ T and bound on β holdtrivially for standard step-size sequences.See Section 3 for detailed proofs of the above theorems. We ﬁrst develop general technique and prove key lemmas inthe next section, and then present proofs for all the theorems. The step size modiﬁcation procedure described above assumed the knowledge of the last iterate T (this is not a setbackin practice). We study the case of inﬁnite horizon SGD. In this section we state our bounds on the last iterate of‘any time’ (inﬁnite horizon) SGD in the case of strongly convex objectives. We will ﬁrst introduce the notion ofsuboptimality that we consider. In particular, we look at two kinds of ‘bad performance’ in inﬁnite horizon SGD fornon-smooth strongly convex optimization. Consider any inﬁnite step size sequence γ t .1. The sequence γ t is said to be ‘bad in expectation’ if for an objective F satisfying assumptions 1,2 and 3,some choice of subgradient oracle, and SGD iterates ( x t ) t ∈ N with step size γ t , there is a ﬁxed subsequence { t k } k ∈ N such that lim k →∞ t k E [ F ( x t k ) − F ( x ∗ )] = ∞ .2. The sequence γ t is said to be ‘bad almost surely’ if for an objective F satisfying assumptions 1,2 and 3, somechoice of subgradient oracle, and SGD iterates ( x t ) t ∈ N with step size γ t , with probability there exists arandom inﬁnite sequence of times { t k } such that lim k →∞ t k [ F ( x t k ) − F ( x ∗ )] = ∞ We give a ‘no free lunch’ theorem: that is we show that inﬁnite horizon step-size sequence for non-smooth stronglyconvex optimization is either ‘bad in expectation’ or ‘bad almost surely’. More precisely, we will show that if anyinﬁnite horizon SGD is good in ‘expectation’ for every t for every strongly convex function, then it is ‘bad almostsurely’ for some function F . Theorem 5.

Consider inﬁnite horizon SGD with step size γ t such that assumptions 1 2 and 3 hold for the objectivefunction. Then, for any choice of γ t > , the algorithm is either bad in expectation or bad almost surely. We give the proof in Section B.

Recall the deﬁnition of T i from Section 2. The rough idea behind the proof is as follows: we will ﬁnd a ‘good point’in the range (cid:2) ⌈ T ⌉ , T (cid:3) and then show that this implies that there is a ‘good point’ between T = T / and T ≈ T / and so on, until we conclude that x T is a good point.To this end, we ﬁrst provide a key lemma that bounds the total weighted deviation of SGD iterates from a given iterate x t (in terms of function value), i.e., it intuitively shows that once we ﬁnd an iterate with small function value, theremaining iterates cannot deviate from it signiﬁcantly. The lemma uses a trick that was ﬁrst used in [16] and then alsoin [1]. Lemma 1.

Let x , . . . , x T be the output of SGD algorithm (Algorithm 1) with step size sequence α t deﬁned by (3) .Then, given any < t < t ≤ T , t X t = t α t E [ F ( x t ) − F ( x t )] ≤ t X t = t G α t . Proof.

By convexity of W , we have: k x t +1 − x t k = k Π W ( x t − α t ˆ g t ( x t )) − x t k ≤ k x t − α t ˆ g t ( x t ) − x t k AST I TERATE

SGD - M AY

30, 2019Taking squares and expanding on both sides, k x t +1 − x t k ≤ k x t − x t k + α t k ˆ g t ( x t ) k − α t h ˆ g t ( x t ) , x t − x t i Taking expectation on both sides, and realizing that ˆ g t is independent of x t and x t , we conclude, E (cid:2) k x t +1 − x t k (cid:3) ≤ E (cid:2) k x t − x t k (cid:3) + α t G − α t E h g t , x t − x t i Here we have used the fact that E [ˆ g t ( x t ) | x t , x t ] = g t ( x t ) . Using convexity, h g t , x t − x t i is lower bounded by F ( x t ) − F ( x t ) . We conclude that: E (cid:2) k x t +1 − x t k (cid:3) ≤ E (cid:2) k x t − x t k (cid:3) + α t G − α t E [ F ( x t ) − F ( x t )] The result now follows by summing the above term from t = t to t = t .We now provide a high probability version of Lemma 1. To this end, we construct an exponential super-martingalethat when combined with a Chernoff bound leads to exponential concentration bound. The method used is somewhatsimilar to the one used in [2], but our technique is speciﬁcally for Lemma 1 and is more concise.For simplicity of exposition, we ﬁrst deﬁne a few key quantities. Let < t < t ≤ T and r = t − t + 1 . We deﬁnethe sequence L t as follows: for t ≤ t ≤ t as follows: L t = 1 e · r , t ∈ [ t , t ] , L t − = L t + L t , t ≤ t − < t . (6)Using Lemma 3, e · r ≤ L t ≤ r . Now, for any l such that t ≤ l ≤ t , we deﬁne the following random variables : A ( l, t ) := t X t = l L t (cid:2) α t ( F ( x t ) − F ( x l )) − α t G (cid:3) , A ∗ ( t , t ) := t X t = t L t (cid:2) α t ( F ( x t ) − F ( x ∗ )) − α t G (cid:3) . (7) We note the difference between A ∗ ( t , t ) and A ( l, t ) : A ( l, t ) considers suboptimality with respect to x l whereas A ∗ ( t , t ) considers the suboptimality with respect to the optimizer x ∗ . Lemma 2.

Let A and A ∗ be as deﬁned by (7) . Let p ( t ) , . . . , p ( t ) be any probability distribution over { t , . . . , t } .We let ( p.A )( t , t ) := P t l = t p ( l ) A ( l, t ) . Also, let α t be a decreasing step size sequence. Then, P [( p.A )( t , t ) > η ] ≤ exp (cid:16) − η α t G (cid:17) . Additionally, if diam ( W ) ≤ D almost surely, we have: P [ A ∗ ( t , t ) > η ] ≤ exp (cid:16) D L t α t G (cid:17) exp (cid:16) − η α t G (cid:17) . Lemma 3.

Let Γ > be ﬁxed. Let λ = re Γ , λ = λ + Γ λ , . . . , λ i +1 = λ i + Γ λ i . Then, for every i ≤ r , λ i ≤ (1 + r ) i λ See Section A for proofs of the above given lemmata. We also require the following technical lemma:

Lemma 4.

Let T i be as deﬁned in Section 2. Then, for all ≤ i ≤ k − : T i +2 − T i +1 ) ≥ T i +1 − T i . Proof.

Lemma follows from the fact that ⌈ a ⌉ − ≤ ⌈ a ⌉ ≤ ⌈ a ⌉ . Henceforth, we will assume that γ t is a decreasing step size sequence with at most polynomial decay (decay constantbeing β ). We let α t be the modiﬁcation of γ t as deﬁned in Equation 5. Let, τ i := arg inf T i

SGD - M AY

30, 2019

Lemma 5.

Let x t ’s be iterates of SGD (Algorithm 1) with modiﬁed step size sequence α t of γ t deﬁned in (5) ; γ t sequence satisﬁes Assumption 4. Let T i , k be as deﬁned by (2) , and τ i , ≤ i ≤ k + 1 be as deﬁned in (8) . Also, let T ≥ . Then, the following holds for all i ∈ [ k ] : E [ F ( x τ i +1 ) − F ( x τ i )] ≤ G γ T β − i , E [ F ( x τ ) − F ( x τ )] ≤ G γ T β . Proof.

We ﬁrst consider i ≥ . If E [ F ( x τ i +1 )] ≤ E [ F ( x τ i )] , the proof is done. Else, using Lemma 1 with t = τ i and t = T i +2 , and the fact that α t is a decreasing sequence, we get: P T i +2 t = τ i α t E [ F ( x t ) − F ( x τ i )] T i +2 − τ i + 1 ≤ P T i +2 t = τ i G α t T i +2 − τ i + 1 ≤ G α T i +1 . (9)By deﬁnition of τ i , E [ F ( x τ i )] ≤ E [ F ( x t )] whenever T i < t ≤ T i +1 . Hence, G − i γ T i +1 = G α T i +1 ≥ P Ti +2 t = τi α t E [ F ( x t ) − F ( x τi ) ] T i +2 − τ i +1 ≥ P Ti +2 t = Ti +1+1 α t E [ F ( x t ) − F ( x τi ) ] T i +2 − τ i +1 , (10)where the ﬁrst equality follows from the deﬁnition of α t in (5), ﬁrst inequality follows from Equation (9), and the ﬁnalinequality follows from the fact that E [ F ( x t ) − F ( x τ i )] ≥ when T i < t ≤ T i +1 (see deﬁnition of τ i in (8)).Now, by using the above inequality with the assumption E [ F ( x τ i +1 )] ≥ E [ F ( x τ i )] , and the fact that T i +2 − T i ≥ T i +2 − τ i + 1 , we have: G − i γ T i +1 ≥ α T i +2 T i +2 − T i +1 T i +2 − T i E (cid:2) F ( x τ i +1 ) − F ( x τ i ) (cid:3) ζ ≥ α Ti +2 E (cid:2) F ( x τ i +1 ) − F ( x τ i ) (cid:3) = − i γ Ti +2 E (cid:2) F ( x τ i +1 ) − F ( x τ i ) (cid:3) ≥ − i βγ Ti +1 E (cid:2) F ( x τ i +1 ) − F ( x τ i ) (cid:3) , (11)where ζ follows from Lemma 4. The equality follows from deﬁnition of α t and the last inequality follows from the β -slowly decaying assumption for γ t (Assumption 4).That is we obtain the result for the case i ≥ . The proof for thecase when i = 0 follows with minor modiﬁcations to the arguments given above.We now present a high probability version of Lemma 5. Lemma 6.

Consider the setting of Lemma 5. Let ≤ i ≤ k and deﬁne t = T i + 1 for ≤ i and t = ⌈ T ⌉ for i = 0 . Let q ( i ) be any probability distribution over { t , . . . , T i +1 } . Let p i +1 ( t ) := L t α t P Ti +2 s = Ti +1+1 L s α s , where t ∈ [ T i +1 + 1 , T i +2 ] and the sequence ( L t ) T i +2 T i +1 +1 is deﬁned by (6) . Then, for any δ i ∈ (0 , and i ∈ [1 , k − , thefollowing holds with probability at least − δ i : T i +2 X t = T i +1 +1 p ( i +1) ( t ) F ( x t ) ≤ G γ T − i β (cid:16)

15 + 120 log δ i (cid:17) + T i +1 X s = T i +1 q ( i ) ( s ) F ( x s ) . For i = 0 , the following holds with probability atleast − δ : T X t = T +1 p (1) ( t ) F ( x t ) ≤ G γ T − i β (cid:16)

15 + 120 log δ i (cid:17) + T X s = ⌈ T ⌉ q (0) ( s ) F ( x s ) . Proof.

We will only show the case ≤ i ≤ k − . The i = 0 case follows by a similar proof. For T i ≤ t ≤ T i +1 , wedeﬁne Γ( t ) = P T i +2 s = t +1 α s L s . We let κ be deﬁned as follows over { T i + 1 , . . . , T i +2 } : κ ( T i + 1) := Γ( T i +1 )Γ( T i +1) · q ( i ) ( T i + 1) , κ ( t ) := Γ( T i +1 )Γ( t ) q ( i ) ( t ) + α t L t · (cid:16)P t − s = Ti +1 κ ( s ) (cid:17) Γ( t ) , t ∈ ( T i + 1 , T i +1 ] ,κ ( t ) := 0 , ∀ t ≥ T i +1 . (12)From Lemma 7, we conclude that κ is a probability distribution over { T i + 1 , . . . , T i +2 } . From Lemma 2, we concludethat with probability atleast − δ i : ( κ.A )( t , t ) ≤ α t G log δ i (13)7 AST I TERATE

SGD - M AY

30, 2019We will show that when this event happens, the inequality in the statement of the lemma holds. If P T i +2 t = T i +1 +1 p ( i +1) ( t ) F ( x t ) ≤ P T i +1 s = T i +1 q ( i ) ( s ) F ( x s ) , then the statement of the lemma holds trivially. Now assume P T i +2 t = T i +1 +1 p ( i +1) ( t ) F ( x t ) > P T i +1 s = T i +1 q ( i ) ( s ) F ( x s ) . We use the fact that κ is supported over { T i + 1 , . . . , T i +1 } and hence: ( κ.A )( t , t ) = T i +1 X l = T i +1 T i +2 X t = l κ ( l ) L t (cid:2) α t ( F ( x t ) − F ( x l )) − α t G (cid:3) We exchange summation and collect the coefﬁcients of the term F ( x t ) to conclude: ( κ.A )( t , t ) = T i +2 X t = T i +1 +1 L t (cid:0) α t F ( x t ) − α t G (cid:1) − T i +1 X s = T i +1 (cid:0) α s G σ s L s + 2 F ( x s ) ( σ s α s L s − κ ( s )Γ( s − (cid:1) , where σ ( s ) := P st = T i +1 κ ( s ) (empty sum being by deﬁnition). By deﬁnition of σ ( s ) = κ ( s )+ σ ( s − , Γ( s −

1) = α s L s + Γ( s ) and κ ( s ) = Γ( T i +1 )Γ( s ) q ( i ) ( s ) + α s L s Γ( s ) σ ( s − . Therefore, we conclude: ( κ.A )( t , t ) = T i +2 X T i +1 +1 α t L t F ( x t ) − T i +2 X T i +1 +1 α t G L t − T i +1 X s = T i +1 α s G L s σ s −  T i +2 X T i +1 +1 α t L t   T i +1 X s = T i +1 q ( i ) ( s ) F ( x s )  . (14)We recall that p ( i +1) ( t ) · (cid:16)P T i +2 s = T i +1 +1 α s L s (cid:17) = α t L t whenever T i +1 < t ≤ T i +2 . The rest of the proof is similarto Equation (11) in Lemma 5. We use the fact that α t is the modiﬁcation of γ t , γ t has at most polynmial decay, e ( T i +2 − T i ) ≤ L t ≤ T i +2 − T i and Lemma 4 in Equation (14) to conclude the result. Lemma 7.

Let κ be as deﬁned in (12) . Then, κ is a probability distribution over { T i + 1 , . . . , T i +2 } . The proof of this lemma is given in Section A

Proof.

Recall the deﬁnition of τ i in (8). Clearly, τ k +1 = T . Summing the bounds in Lemma 5 we conclude: E [ F ( x T )] = E [ F ( x τ k +1 )] = E [ F ( x τ )] + k X i =0 E [ F ( x τ i +1 ) − F ( x τ i )] ≤ E [ F ( x τ )] + 5 G γ T β + k X i =1 G γ T β − i ≤ G γ T (cid:18) β + 1 β (cid:19) + inf ⌈ T ⌉≤ t ≤ T E [ F ( x t )] . We conclude the result by noting that x t = y t for all t ≤ T . Proof.

This proof is similar to the proof of Theorem 3, but instead of Lemma 5 we use Lemma 6. In Lemma 6, wepick q ( i ) = p ( i ) for ≤ i ≤ k − and we let q (0) be arbitrary. We let δ i = δ i +2 . By union bound, the inequalities inthe statement of Lemma 6 hold for all ≤ i ≤ k − simultaneously with probabiliy atleast − P k − i =0 δ i ≥ − δ .Summing all these inequalities, we conclude: T k +1 X t = T k +1 p ( k ) ( t ) F ( x t ) ≤ γ T G (cid:2)

120 log( δ ) + 400 (cid:3) (cid:20) β + 1 β (cid:21) + T X s = ⌈ T ⌉ q (0) ( s ) F ( x s ) . We note that the distribution p ( k ) has unit mass over the point T k +1 = T and that x t = y t when t ≤ T to concludethe result. 8 AST I TERATE

SGD - M AY

30, 2019

Proof.

We note that the step size deﬁned in Equation (3) is the modiﬁcation of the standard step size γ t = C √ T . Let y t be the output of SGD under the assumptions of the theorem when step size γ t is used. Using the fact that inﬁmum issmaller than any weighted average, we have: inf ⌈ T ⌉≤ t ≤ T E [ F ( y t ) − F ( x ∗ )] ≤ T −⌈ T ⌉ +1 T X t = ⌈ T ⌉ E [ F ( y t ) − F ( x ∗ )] ≤ T T X t =1 E [ F ( y t ) − F ( x ∗ )] ζ ≤ √ T h D √ T C + CG p T i ≤ √ T h D C + CG i , where the second line follows from T ≤ T − ⌈ T ⌉ + 1) . ζ follows from the standard analysis [6]. Final inequalityfollows from the fact that T ≤ T ≤ T . We note that γ t satisﬁes the conditions for Theorem 3 with β = 1 . Weinvoke Theorem 3 to conclude the bound on expectation. The above proof in expectation also works for GD . We take ˆ g t = ∇ F and SGD is the same as GD. Here each x t and y t is a deterministic point mass. Therefore, the expectationbound for the last iterate of SGD holds for the last iterate of GD.We will now prove the high probability bound. Let t = ⌈ T ⌉ , t = T , and α t = γ t = C √ T for t ∈ [ t , t ] . Then usingLemma 2, the following holds with probability atleast − δ : A ∗ ( t , t ) ≤ D L t + 8 α t G log δ . Using e ( t − t +1) ≤ L t ≤ t − t +1 and proceeding similarly as above, we have w.p. ≥ − δ , inf ⌈ T ⌉≤ t ≤ T E [ F ( y t ) − F ( x ∗ )] ≤ D C √ T + CG log 2 δ √ T . Theorem now follows by using Theorem 4 with β = 1 , q (0) ( t ) = T −⌈ T ⌉ +1 , and union bound. Proof.

We note that the step size deﬁned in Equation (4) is the modiﬁcation of the standard step size γ t = λt used forstrongly convex functions (see [14]). Let y , . . . , y T be the output of SGD when step size γ t is used. From Theorem5 in [14], we conclude that: inf ⌈ T/ ⌉≤ t ≤ T E [ F ( y t ) − F ( x ∗ )] ≤ T −⌈ T ⌉ +1 T X t = ⌈ T/ ⌉ E [ F ( y t ) − F ( x ∗ )] ≤ G λT . The expectation bound follows from using above equation with Theorem 3 and noting that γ t satisﬁes the requiredconditions with β = 2 . We get high probability bounds by invoking high probability bounds for sufﬁx averaging from[2], i.e., w.p. at least − δ , inf ⌈ T ⌉≤ t ≤ T F ( y t ) − F ( x ∗ ) ≤ O G log (cid:16) δ (cid:17) λT ! . The result now follows by using Theorem 4 with β = 2 and q (0) ( t ) = T −⌈ T ⌉ +1 . We now empirically compare SGD last point with our step-size sequence (Our Method) with the standard steps sizesequence (Standard) as well as the averaged iterates of SGD (Averaged). We apply these methods on two non-smoothproblems: a) Lasso regression, b) linear SVM training.

Lasso Regression:

We consider gradient descent for F ( x ) = n P ni =1 kh a i , x i − b i k + λ k x k for x ∈ R d . Here a i ∼ N (0 , I d ) and b i = h a i , x ∗ i + z i for some s sparse vector x ∗ and z i ∼ N (0 , σ ) . a i and z i are all independent.9 AST I TERATE

SGD - M AY

30, 2019

Iteration l og - l o ss Lasso regression

StandardOur MethodAveraged iteration l o ss SVM loss

StandardOur MethodSuffix Averaged iteration l o ss SVM loss

StandardOur MethodSuffix Averaged (a) (b) (c)Figure 1: (a) F ( x ) vs number of iterations for the Lasso regression (Section 4). Here d = 100 , s = k x ∗ k = 60 , n = 80 , C = 4 , σ = 0 . , λ = 0 . and T = 2 . Green line indicates the running average of the iterates from to t .(b) SVM loss (15) vs number of iterations. We pick d = 30 , σ = 5 , η = 1 , n = 500 , λ = 0 . , T = 2 . (c) Averageover independent SGD runs for the SVM loss. Green line is the loss of the average of the last T iterates. In all thecases, SGD last point with our step-size sequence produces smaller objective value than the standard step size as wellas the averaged iterates of SGD.We use the step sizes of γ t = C √ T and let α t be the modiﬁcation of γ t as given in Section 2 for total T iterations. Sincethe objective is not smooth, the gradient doesn’t vanish near the optimum. Therefore, when the standard step size waspicked, the iterate x t kept oscillating around the inﬁmum but never really reaches it. In contrast, our method decreasedthe step size after sometime which allows better convergence to the optimum (see Figure 1(a)). Training SVMs:

We consider training SVMs which is a typical example where non-smooth SGD is heavily used [4].For our experiments, we generate data as follows. Let a i ∼ N (0 , σ I d ) and the label b i = sgn( a i (1) + z i ) where z i ∼ N (0 , η ) . We generate n = 500 points in d = 30 dimensions. The SVM training problem is now: F ( x ) = 1 n n X i =1 max(0 , − b i h a i , x i ) + λ k x k , (15)where λ = 0 . . Since the objective is λ strongly convex, we consider step sizes of γ t := λt for the standard methodand the modiﬁed step sizes given in Equation (4) for our method. Figure 1 (b) plots loss during a typical run of SGDand Figure 1 (c) for the loss averaged over independent runs of SGD for the same problem with the same initialpoint. The last point of SGD with modiﬁed step size sequence (Our Method) in blue consistently outperformed thestandard SGD (Standard) in red. The green line denotes the loss of the average of the last T iterates. We studied the fundamental question of sub-optimality of the last point of SGD/GD for general non-smooth convexfunctions as well as for strongly-convex functions. We proposed a novel step-size sequence that leads to informationtheoretically optimal rates in both the above mentioned settings. Our result proves a more general result for any“modiﬁed step-size” of a decaying standard step-size, and uses a novel technique of tracking best iterate in eachtime-interval and ensuring that the later iterates do not signiﬁcantly deviate from the best iterate in the previous timeinterval. We also provide a high-probability bound using a super-martingale technique from [2]. Simulations showthat our step-size indeed leads to better last point than the standard step-size sequences.Our approach fundamentally exploits an assumption that we apriori know the total number of iterations T . Hence, ourresult does not provide an any-time algorithm. In contrast, existing any-time results have an extra log T multiplicativefactor in the sub-optimality. We conjecture that this gap is fundamental and every any-time algorithm would sufferfrom the extra log T factor. We give lower bounds for the strongly convex case to show that for any choice of stepsizes, the algorithm is either sub-optimal in expectation or almost surely so inﬁnitely often. Acknowledgements

This research was partially supported by ONR N00014-17-1-2147 and MIT-IBM Watson AI Lab.10

AST I TERATE

SGD - M AY

30, 2019

References [1] Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence resultsand optimal averaging schemes. In

International Conference on Machine Learning , pages 71–79, 2013.[2] Nicholas JA Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smoothstochastic gradient descent. arXiv preprint arXiv:1812.05217 , 2018.[3] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature , 521(7553):436, 2015.[4] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated sub-gradientsolver for svm.

Mathematical programming , 127(1):3–30, 2011.[5] Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on ima-genet in 15 minutes. arXiv preprint arXiv:1711.04325 , 2017.[6] Sébastien Bubeck. Convex optimization: Algorithms and complexity.

Foundations and Trends R (cid:13) in MachineLearning , 8(3-4):231–357, 2015.[7] Ohad Shamir. Open problem: Is averaging needed for strongly convex stochastic gradient descent? In Conferenceon Learning Theory , pages 47–1, 2012.[8] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging.

SIAM Journalon Control and Optimization , 30(4):838–855, 1992.[9] Arkadii Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method efﬁciency inoptimization. 1983.[10] Martin Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In

Proceedings ofthe 20th International Conference on Machine Learning (ICML-03) , pages 928–936, 2003.[11] Nicolo Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learningalgorithms.

IEEE Transactions on Information Theory , 50(9):2050–2057, 2004.[12] Sham M Kakade and Ambuj Tewari. On the generalization ability of online strongly convex programmingalgorithms. In

Advances in Neural Information Processing Systems , pages 801–808, 2009.[13] Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization.

The Journal of Machine Learning Research , 15(1):2489–2512, 2014.[14] Alexander Rakhlin, Ohad Shamir, Karthik Sridharan, et al. Making gradient descent optimal for strongly convexstochastic optimization. In

ICML , volume 12, pages 1571–1578. Citeseer, 2012.[15] Simon Lacoste-Julien, Mark Schmidt, and Francis Bach. A simpler approach to obtaining an o (1/t) convergencerate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002 , 2012.[16] Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In

Proceedings of the twenty-ﬁrst international conference on Machine learning , page 116. ACM, 2004.

A Proofs of Technical Lemmas

A.1 Proof of Lemma 2

Proof of Lemma 2.

We ﬁx l such that t ≤ l ≤ t . In this proof, we will freely use the fact that α t ≥ α t whenever t ≤ t . Let l ≤ t ≤ t Deﬁne ∆ t = h ˆ g t ( x t ) − g t ( x t ) , x t − x l i . We note that x t are random variables and are functionsof ˆ g , . . . , ˆ g t − only. We deﬁne the sigma-ﬁeld F t := σ (ˆ g , . . . , ˆ g t ) .We use the following notation for the sake of convenience: D t := k x t − x l k . Clearly, D t is F t − measurable and ∆ t is F t measurable. It is clear from the deﬁnition of ∆ t that E [∆ t |F t − ] = 0 and | ∆ t | ≤ G k x t − x l k = 2 G √ D t .By Hoeffding’s lemma, we conclude that for any µ ∈ R , we conclude: E (cid:2) exp( µ ∆ t ) (cid:12)(cid:12) F t − (cid:3) ≤ exp(2 G D t µ ) (16)Let λ = α t G . For t ≤ t ≤ t , consider M t := exp t X s = l − λL s α s ∆ s + λ ( L s − L s − ) D s ! . AST I TERATE

SGD - M AY

30, 2019Clearly, M t is F t measurable. M l = 1 since D l = ∆ l = 0 almost surely. We will show that M t is a super martingale: E (cid:2) M t (cid:12)(cid:12) F t − (cid:3) = M t − E (cid:2) exp ( − λL t α t ∆ t + λ ( L t − L t − ) D t ) (cid:12)(cid:12) F t − (cid:3) = M t − exp (cid:0) − λL t D t (cid:1) E (cid:2) exp ( − λL t α t ∆ t ) (cid:12)(cid:12) F t − (cid:3) ≤ M t − exp (cid:0) − λL t D t + 8 λ L t α t G D t (cid:1) ≤ M t − Therefore, E [ M t ] ≤ (17)From the proof of Lemma 1, for l ≤ t ≤ t we have: k x t +1 − x l k ≤ k x t − x l k + α t k ˆ g t ( x t ) k − α t h ˆ g t ( x t ) , x t − x l i≤ k x t − x l k + α t G − α t h g t ( x t ) , x t − x l i − α t h ˆ g t ( x t ) − g t ( x t ) , x t − x l i≤ k x t − x l k + α t G + 2 α t ( F ( x l ) − F ( x t )) − α t ∆ t (18)In the third step, we have used the convexity of F ( · ) . Reordering Equation (18) and using the notation deﬁned above: α t ( F ( x t ) − F ( x l )) − α t G ≤ D t − D t +1 − α t ∆ t . Multiplying the equation above by L t and adding from t = l to t = t , noting the fact that D l = 0 and D t +1 ≥ , weconclude: t X t = l L t (cid:2) α t ( F ( x t ) − F ( x l )) − α t G (cid:3) ≤ t X t = l L t ( D t − D t +1 − α t ∆ t ) ≤ t X t = l − L t α t ∆ t + ( L t − L t − ) D t (19)We recall the random variable A ( l, t ) := t X t = l L t (cid:2) α t ( F ( x t ) − F ( x l )) − α t G (cid:3) From equations (17) and (19), we conclude that for every l such that t ≤ l ≤ t : E [exp( λA ( l, t ))] ≤ E [ M t ] ≤ . By convexity of the exponential function, we have: E [exp( λ ( p.A )( t , t ))] ≤ E t X l = t p ( l ) E [exp( λA ( l, t ))] ≤ By Chernoff Bound, we conclude: P [( p.A )( t , t ) > η ] ≤ exp (cid:16) − η α t G (cid:17) The case for A ∗ ( t , t ) proceeds similarly but this time we use x ∗ in place of x t . We deﬁne D ∗ t := k x t − x ∗ k , ∆ ∗ t := h ˆ g t ( x t ) − g t ( x t ) , x t − x ∗ i and M ∗ t := exp t X i = t − λL i α i ∆ ∗ i + λ ( L i − L i − ) D ∗ i ! We note that for t < t ≤ t , E (cid:2) M ∗ t (cid:12)(cid:12) F t − (cid:3) ≤ M ∗ t − and D ∗ t ≤ D . Therefore, E M ∗ t ≤ E M ∗ ≤ exp (cid:18) L t D + L t D α t G (cid:19) ≤ exp (cid:16) L t D α t G (cid:17) Here we have used the fact that L t ≤ t − t +1 ≤ . Noting that exp( λA ∗ ( t , t )) ≤ M ∗ t we use Chernoff bound toconclude the result. 12 AST I TERATE

SGD - M AY

30, 2019

A.2 Proof of Lemma 3

Proof of Lemma 3.

We prove this by induction. The assertion is true for i = 0 . Suppose it is true for i = k ≤ r − .Then, λ k +1 = λ k (1 + Γ λ k ) ≤ (cid:0) r (cid:1) k λ (cid:16)

1+ 1 r (cid:17) k re ! ≤ (cid:0) r (cid:1) k λ (cid:16)

1+ 1 r (cid:17) k re ! ≤ (cid:0) r (cid:1) k +1 λ The we have proved the assertion through induction.

A.3 Proof of Lemma 7

Proof of Lemma 7.

We take the deﬁnitions of the terms used from the proof of Lemma 6. It is clear from the deﬁnitionthat κ ( t ) ≥ . Since κ ( t ) = 0 for t > T i +1 , it is sufﬁcient to show that P T i +1 s = T i +1 κ ( s ) = 1 .We deﬁne σ ( t ) = P ts = T i +1 κ ( s ) (an empty sum denotes 0). By deﬁnition of κ , for T i + 1 ≤ t ≤ T i +1 σ ( t ) = Γ( T i +1 )Γ( t ) q ( i ) ( t ) + (cid:18) α t L t Γ( t ) (cid:19) σ ( t − T i +1 )Γ( t ) q ( i ) ( t ) + Γ( t − t ) σ ( t − Continuing the above recursion, we conclude: σ ( t ) = Γ( T i +1 )Γ( t ) T X s = T i +1 q ( i ) ( t ) Since q is a probability distribution over { T i + 1 , . . . , T i } , we conclude σ ( T i +1 ) = T i +1 X s = T i +1 q ( i ) ( t ) = 1 . B Proofs of Lower Bounds

We will prove theorem 5 for G = 5 and µ = 1 for the sake of convenience. We can handle the general case byconsidering the transformation F ( x ) = µG F ( G µ x ) . We scale the domain as D := µG D . If F is µ stronglyconvex and G Lipschitz, then F is strongly convex and Lipschitz. We take the subgradient oracle for F to be ˆ g t ( x ) := G ˆ g t ( Gx µ ) . It is easy to check that if SGD for F ( . ) with step sizes α t , the iterates are x t , then starting from x := µG x and using step sizes α t := µα t and the subgradient oracle deﬁned above, the iterates for F is x t = µG x t .Therefore, F ( x t ) = µG F ( x t ) and the proof below goes through seamlessly. This is similar to the rescaling used forthe lowerbounds in [2].Without loss of generality, we will restrict our attention to strictly positive step size sequences: γ t > . We furtherrestrict the possible values of γ t in the following lemma: Lemma 8.

If the step size sequence γ t is such that there is an inﬁnite sequence of times t k such that lim k →∞ t k γ t k = ∞ , then SGD is bad in expectation. Therefore, we can restrict our consideration to step size sequences of the form γ t = O ( t ) . AST I TERATE

SGD - M AY

30, 2019

Proof.

Consider the function F : [ − , → R deﬁned by F ( x ) = | x | + x . F has a global optimum at x = 0 and itis strongly convex. Let ǫ t be a sequence of i.i.d. rademacher random variables (i.e, uniform over {− , } ). We letthe subgradient oracle to return ˆ g t ( x ) = sgn ( x ) + x + 3 ǫ t . Clearly, | x t +1 | = min( | x t − γ t ( sgn ( x t ) + x t + 3 ǫ t ) | , .ǫ t is independent of x t and conditioned on the value of x t , with probability atleast , ǫ t has the opposite sign as x t .When this happens, ( sgn ( x t ) + x t + 3 ǫ t ) has the opposite sign of x t and | ( sgn ( x t ) + x t + 3 ǫ t ) | ≥ . Therefore underthis event, | x t − γ t ( sgn ( x t ) + x t + 3 ǫ t ) | ≥ | x t | + γ t ≥ γ t .Therefore, we conclude: E | x t +1 | ≥

12 min(1 , γ t ) . Considering the fact that ( t k + 1) E ( F ( x t k +1 ) − F (0)) ≥ t k E | x t k +1 | ≥ min( t k , γ t k t k ) → ∞ , we conclude thatSGD with this step size is bad in expectation.Henceforth, we will restrict our attention without loss of generality to step size sequences such that γ t = O ( t ) . Wewill ﬁrst consider the function F ( x ) = x over the set [ − , . Let the inﬁnite horizon learning rate be ( γ t ) t ∈ N ateach time instant, the subgradient oracle returns x + ǫ t where ǫ t is a sequence of i.i.d. uniform random variable over {− , } (that is rademacher random variables). Let the iterates of SGD be z t and z = 1 . Lemma 9.

Let T be the smallest time such that γ t < for all t ≥ T . Then, for every t ≥ T E | z t +1 | = (1 − γ t ) E z t + γ t E k z t k ≥ t − T + 1 Proof.

Suppose γ t ≥ . Then: | z t (1 − γ t ) + γ t ǫ t | ≥ γ t | ǫ t | − | z t (1 − γ t ) | ≥ γ t − | − γ t | = 1 Therefore, when γ t ≥ , | z t +1 | = 1 . Therefore, z T = 1 almost surely. When t ≥ T , | z t − γ t ( z t + ǫ t ) | ≤ | (1 − γ t ) z t | + γ t | ǫ t | = (1 − γ t ) | z t | + γ t ≤ Therefore, when t ≥ T , the iteration of SGD won’t leave the set [ − , almost surely, so there is no need for theprojection step to obtain the next iterate. That is, for t ≥ T , z t +1 = z t (1 − γ t ) + ǫ t γ t . Squaring and takingexpectations, we conclude: E | z t +1 | = (1 − γ t ) E z t + γ t ≥ inf γ ∈ R (1 − γ ) E z t + γ = E z t E z t Clearly, E | z T | = 1 = T − T +1 . Using induction in the equation above, we conclude: E | z t | ≥ t − T +1 for every t ≥ T .We divide N into time intervals of the form { k , k + 1 , . . . , k +1 − } := I k . We have the following lemma: Lemma 10. If γ t = ≤ Ct for some constant C ≥ and there exist positive inﬁnite sequences c k and d k such that lim k →∞ c k = ∞ , lim k →∞ d k = 0 and every k , either one of the two conditions below hold: AST I TERATE

SGD - M AY

30, 2019 P t ∈ I k γ t ≥ c k − k (cid:0)P t ∈ I k γ t (cid:1) P i ∈ I k γ t ≤ d k Then, SGD with step size γ t is bad in expectation.Proof. We consider the optimization problem considered in Lemma 9 i.e, optimizing F ( x ) = x . Let T k = 2 k . Weassume the contrary - that is, E | z T k | ≤ LT k for every k , for some L > . As shown in the second inequality ofLemma 9, irrespective of the choice of γ , . . . , γ T k − , E | z T k | ≥ T k − . From the ﬁrst equality in Lemma 9, we conclude that for t ∈ I k , E | z t +1 | = (1 − γ t ) E z t + γ t Since γ t ≤ Ct , wecan take k large enough so that γ t ≤ for every t ∈ I k . Using the fact that (1 − γ t ) ≥ exp − γ t − γ t ≥ exp − γ t .Therefore, E | z t +1 | ≥ e − γ t E | z t | + γ t Unravelling the recursion above, we conclude: E | z T k +1 | ≥ e − P t ∈ Ik γ t " E | z T k | + X t ∈ I k γ t (20)We deﬁne S k := P t ∈ I k γ t .1. Suppose for a particular k , the ﬁrst item in the statement of the lemma holdsBy assumption, P t ∈ I k γ t ≥ c k T k S k . Using this in Equation (20), we conclude: E | z T k +1 | ≥ e − S k (cid:20) E | z T k | + c k T k S k (cid:21) Now, since γ t ≤ Ct , we have S k ≤ C . Therefore, E | z T k +1 | ≥ e − S k E | z T k | + c k T k S k e − C = E | z T k | (cid:20) e − S k + c k S k e − C T k E | z T k | (cid:21) ≥ E | z T k | (cid:20) e − S k + c k S k e − C L (cid:21) ≥ E | z T k | inf x ≥ (cid:20) e − x + x c k e − C L (cid:21) (21)In the third step, we have used the fact that T k E | z T k | ≤ L . We now consider the function h : R + → R + given by h ( x ) = e − x + κx for some κ > . Clearly, h is convex, bounded below and tends to inﬁnity as x → ∞ . Therefore, it has a unique minimizer t ∗ - the unique point such that h ′ ( t ∗ ) = 0 . That is, t ∗ is theunique point which satisﬁes: κt ∗ = 2 e − t ∗ ≤ . Therefore, t ∗ ≤ κ . Therefore, h ( t ∗ ) ≥ e − t ∗ ≥ e − /κ ≥ − κ . In Equation (23), we take κ = c k e − C L we conclude: E | z T k +1 | ≥ E | z T k | (cid:16) − C ′ c k (cid:17) Where C ′ is a constant depending only on L and C .15 AST I TERATE

SGD - M AY

30, 20192. Suppose for a particular k , the second item in the statement of the lemma holds: Then, by Equation (20), wehave: E | z T k +1 | ≥ e − d k E | z T k | ≥ (1 − d k ) E | z T k | (22)From Equations (21) and (22), we conclude that there exists an absolute constant ¯ C depending only on C and L suchthat: E | z T k +1 | ≥ (cid:16) − ¯ C max (cid:16) d k , c k (cid:17)(cid:17) E | z T k | (23)Since max (cid:16) d k , c k (cid:17) → , we can choose k large enough so that sup s>k ¯ C max (cid:16) d k , c k (cid:17) ≤ − e − ǫ for arbitrary ǫ > .From Equation (23), it follows that for arbitrary K ∈ N , E | z T k + K | ≥ e − ǫK E | z T k | By Lemma 9, E | z T k | ≥ T k − ≥ − k . By our assumption, E | z T k + K | ≤ L − k − K . Therefore, weconclude: L − k − K ≥ e − ǫK − k for every K ∈ N . This cannot hold for any ﬁnite L when we take ǫ < log 2 . Thiscontradicts our assumption. Therefore, SGD with step size γ t is bad in expectation.We will show that if conditions for γ t in Lemma 9 or those in Lemma 10 don’t hold, then SGD is bad almost surely.We recall the deﬁnition of the interval I k = { k + 1 , . . . , k +1 } . We prove the following lemma to inspect howfrequently long, contiguous segments of ǫ t are all equal to for t ∈ I k . We take τ k := 2 ⌊ log ( k/ ⌋ . We note that k ≤ τ k ≤ k We can divide I k into | I k | /τ k contiguous, disjoint intervals, each of size τ k . We call these intervals J k ( i ) for i ∈ { , . . . , | I k | /τ k } . We let A k to be the event that for some i ∈ { , . . . , | I k | /tau } , ǫ t = 1 for all t ∈ J k ( i ) . Inparticular, the even A k implies that there is a contiguous τ k length sequence of ǫ t of all s in I k . Lemma 11. P ( A ck ) ≤ Ck − k/ for some absolute constant C .Proof. We subdivide the interval I k into disjoint subintervals of length τ k . There are k τ k such intervals. The event A k holds if over one such subinterval, the random signs are all . The probability of a given subinterval having all signsequal to is p τ k := τk . Therefore, we conclude: P ( A k ) = 1 − (1 − p τ ) ⌊ k τ ⌋ ≥ − e − p τ ⌊ k τ ⌋ ≥ − ep τ k ⌊ k τ k ⌋ . Here we have used the inequality xe − x ≤ e for x > .Therefore, we conclude that: P ( A ck ) ≤ Ck − k/ for some absolute constant C .We now consider the same function which was considered in Lemma 8 i.e, F : [ − , → R deﬁned by F ( x ) = | x | + x . F has a global optimum at x = 0 and it is strongly convex. Let ǫ t be a sequence of i.i.d. rademacherrandom variables (i.e, uniform over {− , } ). We let the subgradient oracle to return ˆ g t ( x ) = sgn ( x ) + x + 3 ǫ t . Letthe iterates of SGD for F with step sizes γ t be y t . Lemma 12.

Suppose γ t ≤ Ct , there exists an inﬁnite sequence ( k r ) r ∈ N and ﬁxed constants c , d > such that boththe conditions hold:1. X t ∈ I kr γ t ≤ c − k r  X t ∈ I kr γ t  AST I TERATE

SGD - M AY

30, 2019 X t ∈ I kr γ t ≥ d We note that these conditions are the negations of the conditions for γ t in Lemma 8 and Lemma 10. Then SGD withstep size γ t is bad almost surely.Proof. We will show that there exists a sequence of independent events B k r for r ∈ N such that P ( B k r ) ≥ p > uniformly and whenever B k r holds, max t ∈ I kr t log t [ F ( y t ) − F (0)] ≥ δ For some constant δ > . We note that p and δ depend only on C, d and c . We consider a random times T max , T min ∈ I k as follows:1. If the event A ck holds, pick a uniformly random element i from { , . . . , | I k | /τ k } independent of everythingelse. Set T max := max J k ( i ) and T min := min J k ( i )

2. If the event A k holds, pick a uniformly random element i from { i : for all t ∈ J k ( i ) , ǫ t = 1 } , independentof everything else. Set T max := max J k ( i ) and T min := min J k ( i ) We note that by symmetry, i is uniformly distributed over the set { , . . . , | I k | /τ k } . We will show that, when the event A k holds, then one of the following is true:1. y ( T max ) = − .2. y ( T min ) − y ( T max ) ≥ P t ∈ J k ( i ) γ t Suppose the event A k holds. Then for T min ≤ t < T max , y t +1 = max( y t − γ t ( y t + sgn ( y t ) + 3) , − . Since under theevent A k , ǫ t = 1 for every t ∈ J k ( i ) , we conclude that γ t ( y t + sgn ( y t ) + 3) ≥ γ t . That is SGD drifts in the negativedirection irrespective of the value of the iterate. It is therefore clear that if for some T min ≤ t ≤ T max , y t hits − , then y ( T max ) = − . Now suppose that for y t > − for every t in this range. Then, y t +1 ≤ y t − γ t . But unraveling thisrecursion, it follows that y ( T min ) − y ( T max ) ≥ P t ∈ J k ( i ) γ t . Therefore, it follows that when the event A k holds: max t ∈ I k F ( y t ) ≥ max( F ( y ( T max )) , F ( y ( T min ))) ≥ max( | y ( T max ) | , | y ( T min ) | ) ≥ min (cid:0) , X t ∈ J k ( i ) γ t (cid:1) (24)It is clear that since γ t = C/t , for k large enough, P t ∈ J k ( i ) γ t ≤ . Therefore, we conclude that for k large enough,when the event A k holds, max t ∈ I k F ( y t ) ≥ X t ∈ J k ( i ) γ t . Fix < β < .We now consider E k to be the event (cid:8) P t ∈ J k ( i ) γ t ≥ βτ | I k | P t ∈ I k γ t (cid:9) .By symmetry, i is uniformly distributed over { , . . . , | I k | /τ k } . Therefore, E X t ∈ J k ( i ) γ t = τ k | I k | X t ∈ I k γ t AST I TERATE

SGD - M AY

30, 2019and E  X t ∈ J k ( i ) γ t  = τ k | I k | | I k | /τ k X i =1 X t,s ∈ J k ( i ) γ t γ s ≤ τ k | I k | | I k | /τ k X i =1 X t,s ∈ J k ( i ) γ t + γ s τ k | I k | X t ∈ I k γ t Now, when k is part of the inﬁnite sequence ( k r ) , by assumption we have: E  X t ∈ J k ( i ) γ t  ≤ c τ k | I k | X t ∈ I k γ t ! Therefore, by Payley-Zigmund inequality, whenever k is part of the inﬁnite sequence ( k r ) , for every β < , P  X t ∈ J k ( i ) γ t ≥ β τ | I k | X t ∈ I k γ t  ≥ (1 − β ) ( E P t ∈ J k ( i ) γ t ) E ( P t ∈ J k ( i ) γ t ) ≥ (1 − β ) c Recalling the deﬁnition of E k , we conclude, P ( E k r ) ≥ (1 − β ) c .We will now deﬁne the event B k := E k ∩ A k . The events B k are all independent by deﬁnition. When the event B k holds, clearly, from equation (24), we conclude: max t ∈ I kr F ( y t ) ≥ X t ∈ J kr ( i ) γ t ≥ βτ k r | I k r | X t ∈ I kr γ t ≥ βτ k r d | I k r | The second inequality follows from the deﬁntion of E k . Using the fact that any t ∈ I k r is such that t ≤ | I k r | and τ k r = Θ( k r ) , we conclude that for some δ > , ﬁxed, the following holds whenever the event B k r holds. max t ∈ I kr t log t [ F ( y t ) − F (0)] ≥ δ (25) P ( B k r ) ≥ P ( E k r ) − P ( A k r ) ≥ (1 − β ) c − O ( k r − k r / ) It is clear that we can ﬁnd a p > such that for all k r large enough, P ( B k r ) > p .Since B k r are independent sets, it follows that inﬁnitely many of them are true with probability . From equation (25),we conclude that SGD with step sizes γ t is bad almost surely. Proof of Theorem 5.

We will conclude this from Lemmas 8, 10 and 12. Therefore, it is sufﬁcient to show that anystrictly positive inﬁnite sequence γ t is such that atleast one of the following condition holds1. There is an inﬁnite sequence of times t k such that lim k →∞ t k γ t k = ∞ . In this case, by Lemma 8, weconclude that it is bad in expectation.2. There exists a C such that γ t ≤ Ct and there exist inﬁnite sequences c k → ∞ and d k → such that for everyk, either P t ∈ I k γ t ≥ c k − k (cid:0)P t ∈ I k γ t (cid:1) or P t ∈ I k γ t ≤ d k . In this case, by Lemma 10, we conclude that itis bad in expectation. 18 AST I TERATE

SGD - M AY

30, 20193. There exists a C such that γ t ≤ Ct and there exist ﬁxed positive constants c and d such that for some inﬁnitesub-sequence ( k r ) , P t ∈ I kr γ t ≤ c − k r (cid:16)P t ∈ I kr γ t (cid:17) and P t ∈ I kr γ t ≥ d . In this case, by Lemma 12, weconclude that the algorithm is bad almost surely.It is therefore sufﬁcient to show that if conditions 1 and 2 don’t hold then condition 3 holds. The negation of condition1 is that γ t ≤ Ct for some C > . Now, we denote by η k := 2 k P t ∈ I k γ t (cid:0)P t ∈ I k γ t (cid:1) and λ k := X t ∈ I k γ t . Therefore, η k ≥ c k or λ k ≤ d k for some c k → ∞ and d k → is equivalent to η k + λ k → ∞ which is equivalent tothe statement that for every subsequence k r , η k r + λ kr → ∞ . Therefore the negation of condition 2 is equivalent toatleast one of the following conditions being true1. There exists inﬁnite sequence ( t k ) such that t k γ t k → ∞

2. There exists and inﬁnite subsequence k r such that η k r + λ kr ≤ M for some M > . That is, η k r ≤ M := c and λ k r ≥ M := d0