Closing the convergence gap of SGD without replacement
CClosing the convergence gap of SGD without replacement
Shashank Rajput [email protected]
Anant Gupta [email protected]
Dimitris Papailiopoulos [email protected]
University of Wisconsin - Madison
Abstract
Stochastic gradient descent without replacement sampling is widely used in practice for modeltraining. However, the vast majority of SGD analyses assumes data is sampled with replacement,and when the function minimized is strongly convex, an O (cid:0) T (cid:1) rate can be established when SGDis run for T iterations. A recent line of breakthrough works on SGD without replacement (SGDo)established an O (cid:0) nT (cid:1) convergence rate when the function minimized is strongly convex and is a sumof n smooth functions, and an O (cid:16) T + n T (cid:17) rate for sums of quadratics. On the other hand, thetightest known lower bound postulates an Ω (cid:16) T + n T (cid:17) rate, leaving open the possibility of betterSGDo convergence rates in the general case. In this paper, we close this gap and show that SGDwithout replacement achieves a rate of O (cid:16) T + n T (cid:17) when the sum of the functions is a quadratic, andoffer a new lower bound of Ω (cid:0) nT (cid:1) for strongly convex functions that are sums of smooth functions. Stochastic gradient descent (SGD) is a widely used first order optimization technique used to approximatelyminimize a sum of functions F ( x ) = 1 n n (cid:88) i =1 f i ( x ) . In its most general form, SGD produces a series of iterates x i +1 = x i − α · g ( x, ξ i )where x i is the i -th iterate, g ( x, ξ i ) is a stochastic gradient defined below, ξ i is a random variable thatdetermines the choice of a single or a subset of sampled functions f i , and α represents the step size. With-and without replacement sampling of the individual component functions are regarded as some of themost popular variants of SGD. During SGD with replacement sampling, the stochastic gradient is equalto g ( x, ξ i ) = ∇ f ξ i ( x ) and ξ i is a uniform number in { , . . . , n } , i.e. , a with replacement sample fromthe set of gradients ∇ f , . . . , ∇ f n . In the case of without replacement sapling, the stochastic gradient isequal to g ( x, ξ i ) = ∇ f ξ i ( x ) and ξ i is the i -th ordered element in a random permutation of the numbers in { , . . . , n } , i.e. , a without-replacement sample.In practice, SGD without replacement is much more widely used compared to its with replacementcounterpart, as it can empirically converge significantly faster (Bottou, 2009; Recht and R´e, 2013; 2012).However, in the land of theoretical guarantees, with replacement SGD has been the focal point ofconvergence analyses. This is because analyzing stochastic gradients sampled with replacement aresignificantly more tractable. The reason is simple: in expectation, the stochastic gradient is equal tothe “true” gradient of F , i.e. , E ξ i ∇ f ξ i ( x ) = ∇ F ( x ). This makes SGD amenable to analyses very similarto that of vanilla gradient descent (GD), which has been extensively studied under a large variety offunction classes and geometric assumptions, e.g. , see Bubeck et al. (2015).Unfortunately, the same cannot be said for SGD without replacement, which has long resisted non-vacuous convergence guarantees. For example, although we have long known that SGD with replacementcan achieve a O (cid:0) T (cid:1) rate for strongly convex functions F , for many years the best known bounds forSGD without replacement did not even match that rate, in contrast to empirical evidence. However,a recent series of breakthrough results on SGD without replacement has established similar or betterconvergence rates than SGD with replacement. 1 a r X i v : . [ c s . L G ] J u l is strongly convex and a sum of n quadraticsLower bound, Safran andShamir (2019) Ω (cid:18) T + n T (cid:19) Upper bound, HaoChen andSra (2018) ˜ O (cid:18) T + n T (cid:19) Our upper bound,Theorem 1 ˜ O (cid:18) T + n T (cid:19) F is strongly convex and a sum of n smooth functionsLower bound, Safran andShamir (2019) Ω (cid:18) T + n T (cid:19) Upper bound, Nagaraj et al.(2019) ˜ O (cid:18) nT (cid:19) Our lower bound,Theorem 2 Ω (cid:18) nT (cid:19) Table 1: Comparison of our lower and upper bounds to current state-of-the-art results. Our matching boundsestablish information theoretically optimal rates for SGD. We note that the ˜ O ( · ) notation hides logarithmicfactors. G¨urb¨uzbalaban et al. (2015) established for the first time that for sums of quadratics or smoothfunctions, there exist parameter regimes under which SGDo achieves an O ( n /T ) rate compared to the O (1 /T ) rate of SGD with replacement sampling. In this case, if n is considered a constant, then SGDobecomes T times faster than SGD with replacement. Shamir (2016) showed that for one epoch, i.e., onepass over the n functions, SGDo achieves a convergence rate of O (1 /T ). More recently, HaoChen andSra (2018) showed that for functions that are sums of quadratics, or smooth functions under a Hessiansmoothness assumption, one could obtain an even faster rate of O (cid:16) T + n T (cid:17) . Nagaraj et al. (2019) showthat for Lipschitz convex functions, SGDo is at least as fast as SGD with replacement, and for functionsthat are strongly convex and sum of n smooth components one can achieve a rate of O (cid:0) nT (cid:1) . This latterresult was the first convergence rate that provably establishes the superiority of SGD without replacementeven for the regime that n is not a constant, as long as the number of iterations T grows faster than thenumber n of function components.This new wave of upper bounds has also been followed by new lower bounds. Safran and Shamir (2019)establish that there exist sums of quadratics on which SGDo cannot converge faster than Ω (cid:16) T + n T (cid:17) .This lower bound gave rise to a gap between achievable rates and information theoretic impossibility. Onone hand, SGDo on n quadratics has a rate of at least Ω (cid:16) T + n T (cid:17) and at most O (cid:16) T + n T (cid:17) . On theother hand, for the more general class of strongly convex functions that are sums of smooth functions thebest rate is O (cid:0) nT (cid:1) . This leaves open the question of whether the upper or lower bounds are loose. Thisis precisely the gap we close in this work. Our Contributions:
In this work, we establish tight bounds for SGDo. We close the gap betweenlower and upper bounds on two of the function classes that prior works have focused on: strongly convexfunctions that are i) sums of quadratics and ii) sums of smooth functions. Specifically, for i) , we offertighter convergence rates, i.e. , an upper bound that matches the lower bound given by Safran and Shamir(2019); as a matter of fact our convergence rates apply to general quadratic functions that are stronglyconvex, which is a little more general of a function class. For ii) , we provide a new lower bound thatmatches the upper bound by Nagaraj et al. (2019). A detailed comparison of current and proposedbounds can be found in Table 1.A few words on the techniques used are in order. For our convergence rate on quadratic functions, weheavily rely on and combine the approaches used by Nagaraj et al. (2019) and HaoChen and Sra (2018).The convergence rate analyses proposed by HaoChen and Sra (2018) can be tightened by a more carefulanalysis that employs iterate coupling similar to the one used by Nagaraj et al. (2019), combined withnew bounds on the deviation of the stochastic, without-replacement gradient from the true gradient of F .For our lower bound, we use a similar construction to the one used by Safran and Shamir (2019), withthe difference that each of the individual function components is not a quadratic function, but rather apiece-wise quadratic. This particular function has the property we need: it is smooth, but not quadratic.By appropriately scaling the sharpness of the individual quadratics we construct a function that behavesin a way that SGD without replacement cannot converge faster than a rate of n/T , no matter what stepsize one chooses.We note that although our methods have an optimal dependence on n and T , we believe that thedependence on function parameters, e.g. , strong convexity, Lipschitz, and smoothness, can potentially beimproved. 2 Related Work
The recent flurry of work on without replacement sampling in stochastic optimization extends to severalvariants of stochastic algorithms beyond SGD. In (Lee and Wright, 2019; Wright and Lee, 2017), theauthors provide convergence rates for random cyclic coordinate descent, establishing for the first timethat it can provably converge faster than stochastic coordinate descent with replacement sampling. Thiswork is complemented by a lower bound on the gap between the random and non-random permutationvariant of coordinate descent (Sun and Ye, 2019). Several other works have focused on the randompermutation variant of coordinate descent, e.g. , see (Gurbuzbalaban et al., 2019a; Sun et al., 2019). In(Gurbuzbalaban et al., 2019b), novel bounds are given for incremental Newton based methods. Menget al. (2019) present convergence bounds for with replacement sampling and distributed SGD. Finally,Ying et al. (2018) present asymptotic bounds for SGDo for strongly convex functions, and show that witha constant step size it approaches the global optimizer to within smaller error radius compared to SGDwith replacement. In (Shamir, 2016), linear convergence is established for a without replacement variantof SVRG.
We focus on using SGDo to approximately find x ∗ , the global minimizer of the following unconstrainedminimization problem min x ∈ R d (cid:32) F ( x ) := 1 n n (cid:88) i =1 f i ( x ) (cid:33) . In our convergence bounds, we denote by T the total number of iterations of SGDo, and by K thenumber of epochs, i.e. , passes over the data. Hence, T = nK. In our derivations, we denote by x ji the i -th iterate of the j -th epoch. Consequentially, we have that x j +10 ≡ x jn .Our results in the following sections rely on the following assumptions. Assumption 1. (Convexity of Components) f i is convex for all i ∈ [ n ] . Assumption 2. (Strong Convexity) F is strongly convex with strong convexity parameter µ , that is ∀ x, y : F ( y ) ≥ F ( x ) + (cid:104)∇ F ( x ) , y − x (cid:105) + µ (cid:107) y − x (cid:107) Assumption 3. (Bounded Domain) ∀ x : (cid:107) x − x ∗ (cid:107) ≤ D. Assumption 4. (Bounded Gradients) ∀ i, x : (cid:107)∇ f i ( x ) (cid:107) ≤ G. Assumption 5. (Lipschitz Gradients) The functions f i are L -smooth, that is ∀ i, x, y : (cid:107)∇ f i ( x ) − ∇ f i ( y ) (cid:107) ≤ L (cid:107) x − y (cid:107) . In this section, we will focus on strongly convex functions that are quadratic. We will provide a tightconvergence rate that improves upon the the existing rates and matches the Ω (cid:16) T + n T (cid:17) lower boundby Safran and Shamir (2019) up to logarithmic factors.For strongly convex functions that are a sum of smooth functions, Nagaraj et al. (2019) offer a rateof O (cid:0) nT (cid:1) , whereas for strongly convex quadratics HaoChen and Sra (2018) give a convergence rate of O (cid:16) T + n T (cid:17) . A closer comparison of these two rates reveals that neither of them can be tight due tothe following observation. Assume that n (cid:28) K . Then, that implies (cid:18) T + n T (cid:19) < nT .
3t the same time, if we assume that the number of data points is significantly larger than the number ofepochs that we run SGDo for, i.e. , n (cid:29) K we have that (cid:18) T + n T (cid:19) > nT . In comparison, the known lower bound for quadratics given by Safran and Shamir (2019) isΩ (cid:16) T + n T (cid:17) . This makes one wonder what is the true convergence rate of SGDo in this case. We settlethe optimal rates for quadratics here by providing an upper bound which, up to logarithmic factors,matches the best known lower bound.For the special case of one dimensional quadratics, Safran and Shamir (2019) proved an upper boundmatching the one we prove in this paper. Further, the paper conjectures that the proof can be extendedto the generic multidimensional case. However, the authors say that the main technical barrier for thisextension is that it requires a special case of a matrix-valued arithmetic-geometric mean inequality, whichhas only been conjectured to be true but not yet proven. The authors further conjecture that their proofcan be extended to general smooth and strongly convex functions, which turns out to not be true, as weshow in Corollary 1. On the other hand, we believe that our proof can be extended to the more generalfamily of strongly convex functions, where the Hessian is Lipschitz, similar to the the way HaoChen andSra (2018) extend their proof to that case.In addition to Assumptions 1-5 above, here we also assume the following: Assumption 6. F is a quadratic function F ( x ) = 12 x T Hx + b T x + c, where H is a positive semi-definite matrix. Note that this assumption is a little more general than the assumption that F is a sum of quadratics.Also, note that this assumption, in combination with the assumptions on strong convexity and Lipschitzgradients implies bounds on the minimum and maximum eigenvalues of the Hessian of F , that is, µI (cid:52) H (cid:52) LI, where I is the identity matrix and A (cid:52) B means that x T ( A − B ) x ≤ x . Theorem 1.
Under Assumptions 1-6, let the step size of SGDo be α = 8 log TT µ and the number of epochs be K ≥ L µ log T. Then, after T iterations SGDo achieves the following rate E [ (cid:107) x T − x ∗ (cid:107) ] = ˜ O (cid:18) T + n T (cid:19) , where ˜ O ( · ) hides logarithmic factors. The exact upper bound and the full proof of this Theorem are given in Appendix A, but we give aproof sketch in the next subsection.At this point, we would like to remark that the bound on epochs K ≥ L µ log T may be a bitsurprising as K and T are dependent. However, note that since T = nK , we can show that the boundon K above is satisfied if we set the number of epochs to be greater than C log n for some constant C . Furthermore, we note that the dependence of K on Lµ ( i.e. , the condition number of F ) is mostprobably not optimal. In particular both (Nagaraj et al., 2019) and (HaoChen and Sra, 2018) have abetter dependence on the condition number.The proof for Theorem 1 uses ideas from the works of HaoChen and Sra (2018) and Nagaraj et al.(2019). In particular, one of the central ideas in these two papers is that they aim to quantify the amount4f progress made by SGDo over a single epoch. Both analyses decompose the progress of the iterates inan epoch as n steps of full gradient descent plus some noise term.Similar to (HaoChen and Sra, 2018), we use the fact that the Hessian H of F is constant, which helpsus better estimate the value of gradients around the minimizer. In contrast to that work, we do notrequire all individual components f i to be quadratic, but rather the entire F to be a quadratic function.An important result proved by (Nagaraj et al., 2019) is that during an epoch, the iterates do not steeroff too far away from the starting point of the epoch. This allows one to obtain a reasonably good boundon the noise term, when one tries to approximate the stochastic gradient with the true gradient of F . Inour analysis, we prove a slightly different version of the same result using an iterate coupling argumentsimilar to the one in (Nagaraj et al., 2019).The analysis of (Nagaraj et al., 2019) relies on computing the Wasserstein distance between theunconditional distribution of iterates and the distribution of iterates given a function sampled duringan iteration. In our analysis, we use the same coupling, but we bypass the Wasserstein framework that(Nagaraj et al., 2019) suggests and directly obtain a bound on how far the coupled iterates move awayfrom each other during the course of an epoch. This results, in our view, to a somewhat simpler andshorter proof. Now we give an overview of the proof. As mentioned before, similar to the previous works, the key ideais to perform a tight analysis of the progress made during an epoch. This is captured by the followingLemma.
Lemma 1.
Let the SGDo step size be α = 4 l log TT µ and the total number of epochs be K ≥ L µ log T ,where l ≤ . Then for any epoch, E (cid:104) (cid:107) x j − x ∗ (cid:107) (cid:105) ≤ (cid:16) − nαµ (cid:17) (cid:107) x j − − x ∗ (cid:107) + 16 nα G L µ − + 20 n α G L . (1)Given the result in Lemma 1, proving Theorem 1 is a simple exercise. To do so, we simply unroll therecursion (1) for K consecutive epochs. For ease of notation, define C := 16 G L µ − and C := 20 G L .Then, E (cid:2) (cid:107) x Kn − x ∗ (cid:107) (cid:3) ≤ (cid:16) − nαµ (cid:17) E (cid:2) (cid:107) x K − x ∗ (cid:107) (cid:3) + C nα + C n α ≤ (cid:16) − nαµ (cid:17) E (cid:2) (cid:107) x K − − x ∗ (cid:107) (cid:3) + ( C nα + C n α ) (cid:16) (cid:16) − nαµ (cid:17)(cid:17) ... ≤ (cid:16) − nαµ (cid:17) K +1 E (cid:2) (cid:107) x − x ∗ (cid:107) (cid:3) + ( C nα + C n α ) K (cid:88) j =1 (cid:16) − nαµ (cid:17) j − = (cid:16) − nαµ (cid:17) K +1 (cid:107) x − x ∗ (cid:107) + ( C nα + C n α ) K (cid:88) j =1 (cid:16) − nαµ (cid:17) j − . We can now use the fact that (1 − x ) ≤ e − x and (cid:0) − nαµ (cid:1) ≤
1, to get the following bound: E (cid:2) (cid:107) x Kn − x ∗ (cid:107) (cid:3) ≤ e − nαµ K (cid:107) x − x ∗ (cid:107) + ( C nα + C n α ) K. By setting the step size to be α = l log TT µ and noting that T = nK , we get that E (cid:2) (cid:107) x Kn − x ∗ (cid:107) (cid:3) ≤ e − n l log TTµ µ K (cid:107) x − x ∗ (cid:107) + ( nα C + α n C ) K = e − l log T (cid:107) x − x ∗ (cid:107) + ˜ O (cid:18) T + n T (cid:19) = (cid:107) x − x ∗ (cid:107) T l + ˜ O (cid:18) T + n T (cid:19) . Noting that (cid:107) x − x ∗ (cid:107) ≤ D and choosing l = 2 gives us the result of Theorem 1.5 .2 With- and without-replacement stochastic gradients are close One of the key lemmas in (Nagaraj et al., 2019) establishes that once SGDo iterates get close enough tothe global minimizer x ∗ , then any iterate at any time during an epoch x ji stays close to the iterate at thebeginning of that epoch. To be more precise, the lemma we refer to is the following. Lemma 2. [Nagaraj et al. (2019, Lemma 5)]
Under the assumptions of Theorem 1, E [ (cid:107) x ji − x j (cid:107) ] ≤ iα G + 2 iα ( F ( x j ) − F ( x ∗ )) . We would like to note that Lemma 2 is slightly different from the one in (Nagaraj et al., 2019), whichinstead uses E [ F ( x ji ) − F ( x ∗ )] rather than ( F ( x ji ) − F ( x ∗ )), but their proof can be adapted to obtain theversion written above. For the formal version of Lemma 2, please see Lemma 6 in the Appendix.Now, consider the case when the iterates are very close to the optimum and hence F ( x ji ) − F ( x ∗ ) ≈ E [ (cid:107) x ji − x j (cid:107) ] does not grow quadratically in i which would genericallyhappen for i gradient steps, but it rather grows linearly in i . This is an important and useful fact forSGDo: it shows that all iterates within an epoch remain close to x j .Hence, since the iterates of SGDo do not move too much during an epoch, then the gradients computedthroughout the epoch at points x ji should be well approximated by gradients computed on the x j iterate.Roughly, this translates to the following observation: the n gradient steps taken through a single epochare almost equal to n steps of full gradient descent computed at x j . This is in essence what allows SGDoto achieve better convergence than SGD - an epoch can be approximated by n steps of gradient descent.Now, let σ j represent the random permutation of the n functions f i during the j -th epoch. Thus, σ j ( i ) is the index of the function chosen at the i -th iteration of the j -th epoch. Proving Lemma 2 requiresproving that the function value of f σ j ( i ) ( x ji ), in expectation, is almost equal to F ( x ji ). In particular, weprove the following claim in our supplemental material. Claim 1. [Nagaraj et al. (2019, Lemma 4)] If α ≤ L , then for any epoch j and i -th ordered iterate duringthat epoch (cid:12)(cid:12)(cid:12)(cid:12) E (cid:104) F ( x ji ) − f σ j ( i ) ( x ji ) (cid:12)(cid:12)(cid:12) x j (cid:105) (cid:12)(cid:12)(cid:12)(cid:12) ≤ αG . (2)This claim establishes that SGDo behaves almost like SGD with replacement, for which the followingis true: E [ f σ j ( i ) ( x ji )] = E [ F ( x ji )] . To prove this claim, (Nagaraj et al., 2019) consider the conditionaldistribution of iterates, given the current function index, that is x ji | σ i ( j ), and the unconditional distributionof the iterates x ji . Then, they prove that the absolute difference | E [ F ( x ji )] − E [ f σ j ( i ) ( x ji )] | can be upperbounded by the Wasserstein distance between these two distributions. To further upper bound theWasserstein distance, they propose a coupling between the two distributions. To prove our slightlydifferent version of Lemma 2, we proved (2) without using this Wasserstein framework. Instead, we usethe same coupling argument to directly get a bound on (2). Below we explain the coupling and provide ashort intuition.Consider the conditional distribution of σ j | σ j ( i ) = s . If we take the distribution of σ | σ ( i ) = 1, wecan generate the support of σ j | σ j ( i ) = s by taking all permutations σ | σ ( i ) = 1 and by swapping 1 and s among them. This is essentially a coupling between these two distributions, proposed in (Nagaraj et al.,2019). Now, if we use this coupling to convert a permutation in σ | σ ( i ) = 1 to a permutation σ | σ ( i ) = s ,the corresponding x i | σ ( i ) = 1 and x i | σ ( i ) = s would be within a distance of 2 αG . This distance bound isLemma 2 of (Nagaraj et al., 2019).We can now use such distance bound, and let v (1 ,s ) denote a (random) vector whose norm is less than2 αG . Then, E (cid:2) f σ ( i ) ( x i ) (cid:3) = 1 n n (cid:88) s =1 E (cid:2) f σ ( i ) ( x i ) | σ ( i ) = s (cid:3) = 1 n n (cid:88) s =1 E [ f s ( x i ) | σ ( i ) = s ]= 1 n n (cid:88) s =1 E (cid:2) f s (cid:0) x i + v (1 ,s ) (cid:1) | σ ( i ) = 1 (cid:3) ≤ n n (cid:88) s =1 E (cid:2) f s ( x i ) + (2 αG ) | σ ( i ) = 1 (cid:3) = E [ F ( x i ) | σ ( i ) = 1] + 2 αG . s ∈ { , . . . , n } : E (cid:2) f σ ( i ) ( x i ) (cid:3) ≤ E [ F ( x i ) | σ ( i ) = s ] + 2 αG . Therefore, E (cid:2) f σ ( i ) ( x i ) (cid:3) ≤ n n (cid:88) s =1 E [ F ( x i ) | σ ( i ) = s ] + 2 αG ≤ E [ F ( x i )] + 2 αG . Similarly, we can prove that E (cid:2) f σ ( i ) ( x i ) (cid:3) ≥ E [ F ( x i )] − αG . Combining these two results we obtain (2). The detailed proof of Claim 1 is provided in the appendix.The full proof of Theorem 1 requires some more nuanced bounding derivations, and the completedetails can be found in Appendix A.
In the previous section, we establish that for quadratic functions the Ω (cid:16) T + n T (cid:17) lower-bound by Safranand Shamir (2019) is essentially tight. This still leaves open the possibility that a tighter lower boundmay exist for strongly convex functions that are not quadratic. After all, the best convergence rate knownfor strongly convex functions that are sums of smooth functions is of the order of n/T .Indeed, in this section, we show that the convergence rate of O (cid:0) nT (cid:1) established by Nagaraj et al.(2019) is tight.For a certain constant C (see Appendix B for the formal version of the theorem), we show the followingtheorem Theorem 2.
There exists a strongly convex function F that is the sum of n smooth convex functions,such that for any step size T ≤ α ≤ Cn , the error after T total iterations of SGDo satisfies E [ (cid:107) x T − x ∗ (cid:107) ] = Ω (cid:16) nT (cid:17) . The full proof of this theorem is provided in Appendix B, but we give an intuitive explanation of theproof later in this section.Note that the theorem above establishes the existence of a function for which SGDo converges atrate Ω( nT ), but only for the step size range T ≤ α ≤ Cn . This is the range of the most interest becausemost of the upper bounds and convergence guarantees of SGDo (and SGD) work in this step size range.However, it would still be desirable to get a function on which SGDo converges at rate Ω( n/T ) for allstep sizes. Such a function would be difficult to optimize, no matter how much we tune the step size.Indeed, we show that based on Theorem 2, we can create such a function. To do that, we use a functionproposed by Safran and Shamir (2019, Proposition 1), which converges slowly outside of the step sizerange T ≤ α ≤ Cn .Safran and Shamir (2019) show that there exists a strongly convex function F , which is the sumof n quadratics, such that for step size α ≤ T , the expected error satisfies E [ (cid:107) x T − x ∗ (cid:107) ] = Ω(1) (seethe proof of Proposition 1, pg. 10-12 in their paper). Further, for the same function F , the proof ofthat proposition can be adapted directly to get E [ (cid:107) x T − x ∗ (cid:107) ] = Ω (cid:0) n (cid:1) for any step size α ≥ Cn , for anyconstant C .Using this function F and the function F from Theorem 2, we can create a function on which SGDoconverges at rate Ω( nT ) for all step sizes. Corollary 1.
There exists a 2-Dimensional strongly convex function that is the sum of n smooth convexfunctions, such that for any α > E [ (cid:107) x T − x ∗ (cid:107) ] = Ω (cid:16) nT (cid:17) . E [ (cid:107) x T − x ∗ (cid:107) ] = Ω (cid:0) nT (cid:1) . Next, we try to explain thefunction construction and proof technique behind Theorem 2. The construction of the lower bound issimilar to the one used by Safran and Shamir (2019). The difference is that the prior work considersquadratic functions, while we consider a slightly modified piece-wise quadratic function.Specifically, we construct the following function F ( x ) = n (cid:80) ni =1 f i ( x ) as F ( x ) = x , if x ≥ Lx , if x < , where n is an even number. Of the n component functions f i , half of them are defined as follows:if i ≤ n f i ( x ) = x Gx , if x ≥ Lx Gx , if x < , and the other half of the functions are defined as follows:if i > n f i ( x ) = x − Gx , if x ≥ Lx − Gx , if x < . For our construction, we set L to be a big enough positive constant. See for example, Fig. 1. −15 −10 −5 0 5 10 15020040060080010001200 F ( x ) f ( x ) f ( x ) Figure 1: Lower bound construction. Note that f ( x ) represents the component functions of the first kind, and f ( x ) represents the component functions of the second kind, and F ( x ) represents the overall function. Next we ought to verify that this function abides to Assumptions 1-5. Note that Assumption 1 issatisfied, as it can be seen that functions f i ’s are all continuous and convex. Next, we need to show thatAssumption 2 holds, that is F is strongly convex. We will show that this is true by proving the followingequivalent definition of strong convexity: a function f is µ -strongly convex if g ( x ) := f ( x ) − µ (cid:107) x (cid:107) isconvex. We can see that this is true for F with µ = 1.In the proof of Theorem 2, we initialize at the origin. In that case, in the proof we also prove thatAssumptions 3 and 4 hold. In particular, we show that the iterates do not go outside of a bounded8omain, and inside this domain, the gradient is bounded by G . Finally, let us focus on Assumption 5. Toprove that these functions have Lipschitz gradients, we need to show ∀ x, y : |∇ f i ( x ) − ∇ f i ( y ) | ≤ L | x − y | . If xy ≥
0, that is x and y lie on the same side of the origin, then this is simple to see because they bothlie on the same quadratic. Otherwise WLOG, assume x < y >
0. Also, assume WLOG that f i isfunction of the first kind, that is i ≤ n and hence the linear term in f i ( x ) is Gx . Then, |∇ f i ( x ) − ∇ f i ( y ) | = (cid:12)(cid:12)(cid:12)(cid:12) Lx + G − y − G (cid:12)(cid:12)(cid:12)(cid:12) = y − Lx ≤ Ly − Lx ≤ L | y − x | . Overall, the difficulty in the analysis comes from the fact that unlike the functions considered bySafran and Shamir (2019), our functions are piece-wise quadratics.Let us initialize at x = 0 (the minimizer). We will show that in expectation, at the end of K epochs,the iterate would be at a certain distance (in expectation). Note that the progress made over an epoch isjust the sum of gradients (multiplied by − α ) over the epoch: x jn − x j = − α n (cid:88) i =1 ∇ f σ j ( i ) ( x i )where σ j ( i ) represents the index of the i -th function chosen in the j -th epoch. Next, note that thegradients from the linear components ± G x are equal to ± G , that is they are constant. Thus, they willcancel out over an epoch.However the gradients from the quadratic components do not cancel out, and in fact that part of thegradient will not even be unbiased, in the sense that if x t ≥
0, the gradient at x t from the quadraticcomponent x will be less in magnitude than the gradient from the quadratic component Lx at − x t .The idea is to now ensure that if an epoch starts off near the minimizer, then the iterates spend acertain amount of time in the x < Lx ,which makes the sum of the gradients at the end of the epoch biased away from the minimizer.To ensure that the iterates spend some time in the x < x ≈
0, the gradient contribution of the quadratic terms would be small, and the dominating componentduring an epoch would come from the linear terms. What this means is that in the middle of an epoch, itis the linear terms which contribute the most towards the “iterate movement”, even though at the end ofthat epoch their gradients get cancelled out and what remains is the contribution of the quadratic terms.Then, to obtain a lower bound matching the upper bound given by Nagaraj et al. (2019), observe thatit is indeed this contribution of the linear terms that we require to get a tight bound on. This is because,the upper bound from the aforementioned work was also in fact directly dependent on the movementof iterates away from the minimizer during an epoch, caused by the stochasticity in the gradients (cf.Lemma 5 of Nagaraj et al. (2019)). We give below the informal version of the main lemma for the proof:
Lemma 3. [Informal]
Let ( σ , . . . , σ n ) be a random permutation of { +1 , . . . , +1 (cid:124) (cid:123)(cid:122) (cid:125) n times , − , . . . , − (cid:124) (cid:123)(cid:122) (cid:125) n times } . Then for i < n/ , E (cid:104)(cid:12)(cid:12)(cid:12)(cid:80) ij =1 σ j (cid:12)(cid:12)(cid:12)(cid:105) ≥ C √ i, where C is a universal constant. Please see Lemma 12 in Appendix B for the formal version of this lemma.For the purpose of intuition, ignore the contribution of gradients from the quadratic terms. Then, thelemma above says that during an epoch, the gradients from the linear terms would move the iteratesapproximately Ω (cid:0) α √ n G (cid:1) away from the minimizer (after we multiply by the step size α ).9his implies that in the middle of an epoch, with (almost) probability 1 / x ≈ − Ω (cid:0) α √ n G (cid:1) and with (almost) probability 1 / x ≈ Ω (cid:0) α √ n G (cid:1) . Hence,over the epoch, the accumulated quadratic gradients multiplied by the step size would look like n (cid:88) i =1 E (cid:104) − α ( L x ji < + x ji ≥ ) x ji (cid:105) ≈ − α n (cid:88) i =1 (cid:18) L Ω (cid:18) − α √ n G (cid:19) + 12 Ω (cid:18) α √ n G (cid:19)(cid:19) = Ω( Lα n √ n ) . If this happens for K epochs, we get that the accumulated error would be Ω( Lα n √ nK ) = Ω (cid:16) √ nK (cid:17) for α ∈ [1 /nK, /n ]. Since E [ | x T | ] ≥ / √ nK , we know that E [ | x T − | ] ≥ /nK = n/T . Since 0 is theminimizer of our function in this setting, we have constructed a case where SGDo achieves error E [ | x T − x ∗ | ] ≥ n/T . This completes the sketch of the proof and the complete proof of Theorem 2 is given in Appendix B.
30 50 70 90 110 130 150 170 190
Number of epochs K −2 −1 | x T − x * | K K K
30 90 150
Number of component functions n −1 | x T − x * | (log T ) / n n n Figure 2: Running SGDo on the function F used in our lower bound (Theorem 2) confirms that the rate ofconvergence of SGDo on this function is indeed Ω( nK ) = Ω( nT ). The curves are normalized so that they beginat the same point. To verify our lower bound of Theorem 2, we ran SGDo on the function described in Eq. (5) with L = 4. The step size regimes that were considered were α = T , TT , TT , TT , and n . The plot for α = TT is shown in Figure 2. The plots for the other step size regimes are provided in Appendix D.The step size regimes considered cover the range specified in the statement of Theorem 2. Looking atFigure 2 (and the figures in Appendix D for the other step size regimes), the dependence of the convergencerate on K indeed looks exactly like 1 /K . However, looking at the figures for the dependence of theconvergence rate on n , we see that they look like (log T ) n . This suggests that the tightest possible lowerbound for SGDo with constant step size on strongly convex smooth functions might have a logarithmicterm in the numerator. Next, we explain the details of the experiment.Consider any one of the step size regimes specified above, say α = TT . For this regime, we ran twoexperiments:1. We fix n = 500 and vary K from 30 to 200, and2. we fix K = 500 and vary n from 30 to 200.Consider the first experiment, where n = 500 and K is varied. For each value of K , say K = 50, we set α = TT = nK ) nK = ∗ ∗ and ran SGDo with this constant step size α on the sum of n = 500functions for K = 50 epochs, and the final error was recorded. This was repeated 1000 times to reducevariance. The final mean error after these 1000 runs gave us one point, which we plotted for K = 50 onthe top subfigure of Figure 2. Repeating the same for all values of K from 30 to 200 gave us the top10ubfigure of Figure 2. The same procedure was followed for the second experiment where we fix K andvary n , and that gave us the bottom subfigure of Figure 2. The optimization was initialized at the origin,that is x = 0. These pairs of experiments were performed for all values of step size regimes in the list( T , TT , TT , TT , and n ).Now, we justify the ranges of n and K considered in our experiments. We wanted to verify that thelower bound on the error of SGDo is indeedΩ (cid:16) nT (cid:17) = Ω (cid:18) nK (cid:19) [Theorem 2]instead of the previously known best lower boundΩ (cid:18) T + n T (cid:19) = Ω (cid:18) nK (cid:18) n + 1 K (cid:19)(cid:19) . [Safran and Shamir (2019)]Looking at the RHS of the two equations above, we can see that the dependence of the two lowerbounds on K differs only when n (cid:29) K and the dependence on n differs only when K (cid:29) n . Thus forexample, when we wanted to check dependence on K , we set n = 500 which was bigger than every K inthe range 30 to 200.The code for these experiments is available at https://github.com/shashankrajput/SGDo . Theorem 2 hints that for faster convergence rates in the epoch based random shuffling SGD, we wouldnot just require smooth and strongly convex functions, but also potentially require that the Hessians ofsuch functions to be Lipschitz.We conjecture that Hessian Lipschitzness is sufficient to get the convergence rate of Theorem 1. Wethink that this is interesting, because the optimal rates for both SGD with replacement and vanillagradient descent only require strong convexity and gradient smoothness. However, here we prove that anoptimal rate for SGDo requires the function to be quadratic as well (or at the very least have a LipschitzHessian), and SGDo seems to converge slower if the Hessian is not Lipschitz.
SGD without replacement has long puzzled researchers. From a practical point of view, it always seems tooutperform SGD with replacement, and is the algorithm of choice for training modern machine learningmodels. From a theoretical point of view, SGDo has resisted tight convergence analysis that establish itsperformance benefits. A recent wave of work established that indeed SGDo can be faster than SGD withreplacement sampling, however a gap still remained between the achievable rates and the best knownlower bounds.In this paper we settle the optimal performance of SGD without replacement for functions that arequadratics, and strongly convex functions that are sums of n smooth functions. Our results indicate thata possible improvement in convergence rates may require a fundamentally different step size rule andsignificantly different function assumptions.As future directions, we believe that it would be interesting to establish rates for variants of SGDothat do not re-permute the functions at every epoch. This is something that is common in practice,where a random permutation is only performed once every few epochs without a significant drop inperformance. Current theoretical bounds are inadequate to explain this phenomenon, and a new theoreticalbreakthrough may be required to tackle it.We however believe that one of the strongest new theoretical insights introduced by (Nagaraj et al.,2019) and used in our analyses can be of significance in a potential attempt to analyze other variantsof SGDo as the one above. This insight is that of iterate coupling. That is the property that SGDoiterates are only mildly perturbed after swapping only two elements of a permutation. Such a property isreminiscent to that of algorithmic stability, and a deeper connection between that and iterate coupling isleft as a meaningful intellectual endeavor for future work.11 cknowledgements We would like to thank the ICML reviewers for their constructive feedback in improving the structureof the Appendix. The authors also attribute the motivation for Corollary 1 (see the paragraph afterTheorem 2) to the comments of Reviewer
References
L´eon Bottou. Curiously fast convergence of some stochastic gradient descent algorithms. In
Proceedingsof the symposium on learning and data science, Paris , 2009.Benjamin Recht and Christopher R´e. Parallel stochastic gradient algorithms for large-scale matrixcompletion.
Mathematical Programming Computation , 5(2):201–226, 2013.Benjamin Recht and Christopher R´e. Beneath the valley of the noncommutative arithmetic-geometricmean inequality: conjectures, case-studies, and consequences. arXiv preprint arXiv:1202.4184 , 2012.S´ebastien Bubeck et al. Convex optimization: Algorithms and complexity.
Foundations and Trends R (cid:13) inMachine Learning , 8(3-4):231–357, 2015.Itay Safran and Ohad Shamir. How good is sgd with random shuffling? arXiv preprint arXiv:1908.00045v3 ,2019.Jeffery Z HaoChen and Suvrit Sra. Random shuffling beats sgd after finite epochs. arXiv preprintarXiv:1806.10077 , 2018.Dheeraj Nagaraj, Prateek Jain, and Praneeth Netrapalli. Sgd without replacement: Sharper rates forgeneral smooth convex functions. In International Conference on Machine Learning , pages 4703–4711,2019.Mert G¨urb¨uzbalaban, Asu Ozdaglar, and PA Parrilo. Why random reshuffling beats stochastic gradientdescent.
Mathematical Programming , pages 1–36, 2015.Ohad Shamir. Without-replacement sampling for stochastic gradient methods. In D. D. Lee, M. Sugiyama,U. V. Luxburg, I. Guyon, and R. Garnett, editors,
Advances in Neural Information Processing Systems29 , pages 46–54. Curran Associates, Inc., 2016.Ching-Pei Lee and Stephen J Wright. Random permutations fix a worst case for cyclic coordinate descent.
IMA Journal of Numerical Analysis , 39(3):1246–1275, 2019.Stephen J Wright and Ching-pei Lee. Analyzing random permutations for cyclic coordinate descent. arXiv preprint arXiv:1706.00908 , 2017.Ruoyu Sun and Yinyu Ye. Worst-case complexity of cyclic coordinate descent: O ( n ) gap with randomizedversion. Mathematical Programming , pages 1–34, 2019.Mert Gurbuzbalaban, Asuman Ozdaglar, Nuri Denizcan Vanli, and Stephen J Wright. Randomness andpermutations in coordinate descent methods.
Mathematical Programming , pages 1–28, 2019a.Ruoyu Sun, Zhi-Quan Luo, and Yinyu Ye. On the efficiency of random permutation for admm andcoordinate descent.
Mathematics of Operations Research , 2019.M Gurbuzbalaban, A Ozdaglar, and PA Parrilo. Convergence rate of incremental gradient and incrementalnewton methods.
SIAM Journal on Optimization , 29(4):2542–2565, 2019b.Qi Meng, Wei Chen, Yue Wang, Zhi-Ming Ma, and Tie-Yan Liu. Convergence analysis of distributedstochastic gradient descent with shuffling.
Neurocomputing , 337:46–57, 2019.Bicheng Ying, Kun Yuan, Stefan Vlaski, and Ali H Sayed. Stochastic learning under random reshufflingwith constant step-sizes.
IEEE Transactions on Signal Processing , 67(2):474–489, 2018.12urii Nesterov.
Introductory lectures on convex optimization: A basic course , volume 87. Springer Science& Business Media, 2013.Cristinel Mortici. On gospers formula for the gamma function.
Journal of Mathematical Inequalities , 5,12 2011. doi: 10.7153/jmi-05-53. 13
Proof of Theorem 1
Theorem 1. (Formal version) Under Assumptions 1-6, let the step size of SGDo be α = 4 l log TT µ , where l ≤ and the number of epochs be K ≥ L µ log T. Then after T iterations SGDo, E (cid:2) (cid:107) x Kn − x ∗ (cid:107) (cid:3) ≤ (cid:107) x − x ∗ (cid:107) T l + 2 G L log TT µ + 2 G L n log TT µ . Proof.
The proof for upper bound uses the framework of HaoChen and Sra (2018), combined with somecrucial ideas from Nagaraj et al. (2019).In the block diagram below we connect the pieces needed to establish the proof. All lemmas andproofs follow.
Theorem 1Lemma 1 (Per Epoch Progress)
Lemma 4 (Upper bound on
𝔼[||𝑅|| ! ] ) Lemma 5 (Lower bound on ⟨𝑥 " − 𝑥 ∗ ,𝔼 𝑅 ⟩ ) Claim 3 (Upper bound on
𝔼[||𝐵|| ! ] ) Claim 2
𝔼 𝐴
Lemma 6
Iterates don’t steer off too far during epoch(Upper bound on
𝔼[||𝑥 $% − 𝑥 "% || ! ] ) Claim 1
With- and without- replacement SGD are close(Upper bound on
𝔼 |𝐹(𝑥 $% − 𝑓 & $ 𝑥 "% |] ) Figure 3: A dependency graph for the proof of Theorem 1, giving short descriptions of the components required.
The proof strategy is to quantify the progress made during each epoch and then simply unrolling thatfor K epochs. Towards that end, we have the following lemma Lemma 1.
Let the SGDo step size be α = 4 l log TT µ and the total number of epochs be K ≥ L µ log T ,where l ≤ . Then for any epoch, E (cid:104) (cid:107) x j − x ∗ (cid:107) (cid:105) ≤ (cid:16) − nαµ (cid:17) (cid:107) x j − − x ∗ (cid:107) + 16 nα G L µ − + 20 n α G L . (1)As mentioned before, we apply this lemma recursively to all epochs. For ease of notation, define14 := 16 G L µ − and C := 20 G L . Then, E (cid:2) (cid:107) x Kn − x ∗ (cid:107) (cid:3) ≤ (cid:16) − nαµ (cid:17) E (cid:2) (cid:107) x K − x ∗ (cid:107) (cid:3) + C nα + C n α ≤ (cid:16) − nαµ (cid:17) E (cid:2) (cid:107) x K − − x ∗ (cid:107) (cid:3) + ( C nα + C n α ) (cid:16) (cid:16) − nαµ (cid:17)(cid:17) ... ≤ (cid:16) − nαµ (cid:17) K +1 E (cid:2) (cid:107) x − x ∗ (cid:107) (cid:3) + ( C nα + C n α ) K (cid:88) j =1 (cid:16) − nαµ (cid:17) j − = (cid:16) − nαµ (cid:17) K +1 (cid:107) x − x ∗ (cid:107) + ( C nα + C n α ) K (cid:88) j =1 (cid:16) − nαµ (cid:17) j − . We can now use the fact that (1 − x ) ≤ e − x and (cid:0) − nαµ (cid:1) ≤
1, to get the following bound: E (cid:2) (cid:107) x Kn − x ∗ (cid:107) (cid:3) ≤ e − nαµ K (cid:107) x − x ∗ (cid:107) + ( C nα + C n α ) K. By setting the stepsize to be α = TT µ and noting that T = nK , we get that E (cid:2) (cid:107) x Kn − x ∗ (cid:107) (cid:3) ≤ e − n l log TTµ µ K (cid:107) x − x ∗ (cid:107) + ( nα C + α n C ) K = e − l log T (cid:107) x − x ∗ (cid:107) + 2 C log TT µ + 2 C n log TT µ . Substituting the values of C and C back into the inequality gives E (cid:2) (cid:107) x Kn − x ∗ (cid:107) (cid:3) ≤ (cid:107) x − x ∗ (cid:107) T l + 2 G L log TT µ + 2 G L n log TT µ . A.1 Proof of Lemma 1
Proof.
Throughout this proof we will be working inside an epoch, so we skip using the super script j in x ji which denotes the j -th epoch. Thus in this proof, x refers to the iterate at beginning of that epoch.Let σ denote the permutation of [ n ] used in this epoch. Therefore at the i -th iteration of this epoch, wetake a descent step using the gradient of f σ ( i ) .Next we define the error term (same as the error term defined in (HaoChen and Sra, 2018)) R := n (cid:88) i =1 (cid:0) ∇ f σ ( i ) ( x i − ) − ∇ F ( x ) (cid:1) = n (cid:88) i =1 (cid:0) ∇ f σ ( i ) ( x i − ) − ∇ f σ ( i ) ( x ) (cid:1) . (3)Then, using the iterative relation for gradient descent, we get (cid:107) x n − x ∗ (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:32) x − n (cid:88) i =1 α ∇ f σ ( i ) ( x i − ) (cid:33) − x ∗ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:107) x − x ∗ (cid:107) − α (cid:42) x − x ∗ , n (cid:88) i =1 ∇ f σ ( i ) ( x i − ) (cid:43) + α (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 ∇ f σ ( i ) ( x i − ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:107) x − x ∗ (cid:107) − nα (cid:104) x − x ∗ , ∇ F ( x ) (cid:105) − α (cid:104) x − x ∗ , R (cid:105) + α (cid:107) n ∇ F ( x ) + R (cid:107) a ) ≤ (cid:107) x − x ∗ (cid:107) − nα (cid:104) x − x ∗ , ∇ F ( x ) (cid:105) + 2 α n (cid:13)(cid:13) ∇ F ( x ) (cid:107) − α (cid:104) x − x ∗ , R (cid:105) + 2 α (cid:107) R (cid:13)(cid:13) b ) ≤(cid:107) x − x ∗ (cid:107) − nα (cid:18) µ (cid:107) x − x ∗ (cid:107) + 12 L (cid:107)∇ F ( x ) (cid:107) (cid:19) + 2 α n (cid:107)∇ F ( x ) (cid:107) + 2 α (cid:107) R (cid:107) − α (cid:104) x − x ∗ , R (cid:105) = (1 − nαµ ) (cid:107) x − x ∗ (cid:107) − nα (cid:18) L − nα (cid:19) (cid:107)∇ F ( x ) (cid:107) + 2 α (cid:107) R (cid:107) − α (cid:104) x − x ∗ , R (cid:105) . (4)The inequality ( a ) above comes from the fact that (cid:107) a + b (cid:107) ≤ (cid:107) a (cid:107) + 2 (cid:107) b (cid:107) . For inequality ( b ), we usedthe property of strong convexity given by Theorem 3 below:15 heorem 3. [Nesterov (2013, Theorem 2.1.11)] For an L -smooth and µ -strongly convex function F , (cid:104)∇ F ( x ) − ∇ F ( y ) , x − y (cid:105) ≥ µ (cid:107) x − y (cid:107) + 12 L (cid:107)∇ F ( x ) − ∇ F ( y ) (cid:107) NOTE: We have slightly modified the original theorem by using the fact that L ≥ µ . Thus, to upper bound the expected value of (cid:107) x n − x ∗ (cid:107) , Ineq. (4) says that we need to control theexpected magnitude of the error term, which is E [2 α (cid:107) R (cid:107) ]; and its expected alignment with ( x − x ∗ ),which is E [ − α (cid:104) x − x ∗ , R (cid:105) ]. To achieve that we introduce the following two lemmas. Lemma 4. If α ≤ / nL , then the magnitude of the error is bounded above, in expectation: E [ (cid:107) R (cid:107) ] ≤ L n (cid:107) x − x ∗ (cid:107) + 5 n α L G . Lemma 5. If α ≤ / nL , then the error term’s alignment with x − x ∗ is bounded below, in expectation: (cid:104) x − x ∗ , E [ R ] (cid:105) ≥ − αn (cid:107)∇ F ( x ) (cid:107) − (cid:18) nµ α n L µ (cid:19) (cid:107) x − x ∗ (cid:107) − α L G n µ − nα G L µ . For Lemma 4 and Lemma 5, we will show later (see Ineq. (7)) that our choice of parameters ensure α ≤ / nL .Substituting the inequalities from these two lemmas into Ineq. (4), we get that E [ (cid:107) x n − x ∗ (cid:107) ] ≤ (1 − nαµ ) (cid:107) x − x ∗ (cid:107) − nα (cid:18) L − nα (cid:19) (cid:107)∇ F ( x ) (cid:107) + 2 α ( L n (cid:107) x − x ∗ (cid:107) + 5 n α L G ) − α (cid:18) − αn (cid:107)∇ F ( x ) (cid:107) − (cid:18) nµ α n L µ (cid:19) (cid:107) x − x ∗ (cid:107) − α L G n µ − nα G L µ (cid:19) = (cid:18) − nαµ + 2 α L n + αµn n L α µ (cid:19) (cid:107) x − x ∗ (cid:107) − nα (cid:18) L − nα (cid:19) (cid:107)∇ F ( x ) (cid:107) + 10 n α L G + 20 α L G n µ + 16 nα G L µ = (cid:16) − nαµ α L n + 4 µ − n L α (cid:17) (cid:107) x − x ∗ (cid:107) − nα (cid:18) L − nα (cid:19) (cid:107)∇ F ( x ) (cid:107) + 10 n α L G + 20 µ − α L G n + 16 nα G L µ − . (5)We will prove the following inequalities shortly nαµ − α L n − µ − n L α c ) ≥ L − αn ( d ) ≥ n α L G e ) ≥ µ − α L G n . Finally, using ( c ), ( d ) and ( e ) in Ineq. (5), we get E [ (cid:107) x n − x ∗ (cid:107) ] ≤ (cid:16) − nαµ (cid:17) (cid:107) x − x ∗ (cid:107) + 20 n α L G + 16 nα G L µ − . This completes the proof. The only thing left is to prove the inequalities ( c ), ( d ) and ( e ), which we’ll donext.( c ) and ( d ): It can be shown that 2 α L n ≥ µ − n L α (See ( e ) below). So to prove ( c ), it issufficient to show that nαµ ≥ α L n . α = l log TT µ ≤ TT µ , T = nK , and K ≥ L µ log T . Then, α ≤ TT µ = 8 log
TnKµ ≤ µ log T nL µ log T = µ nL . (6)This is equivalent to nαµ ≥ α L n . Thus, we have proven ( c ). To prove ( d ), we continue on the theseries of inequalities: α ≤ µ nL ≤ nL ≤ nL . (7)This proves ( d ).( e ): α ≤ µ L n [Using Ineq. (6)] ≤ µ L n = ⇒ n α L G ≥ µ − α L G n . A.2 Proof of Lemma 4
Proof.
Throughout this proof we will be working inside an epoch, so we skip using the super script j in x ji which denotes the j -th epoch. Thus in this proof, x refers to the iterate at beginning of that epoch.Let σ denote the permutation of [ n ] used in this epoch. Therefore at the i -th iteration of this epoch, wetake a descent step using the gradient of f σ ( i ) . E (cid:2) (cid:107) R (cid:107) (cid:3) = E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 (cid:0) ∇ f σ ( i ) ( x i − ) − ∇ f σ ( i ) ( x ) (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ E (cid:32) n (cid:88) i =1 (cid:13)(cid:13) ∇ f σ ( i ) ( x i − ) − ∇ f σ ( i ) ( x ) (cid:13)(cid:13)(cid:33) . [Triangle inequality]Now, if we can bound (cid:107) x i − − x (cid:107) , then we can also bound (cid:107)∇ f σ ( i ) ( x i − ) − ∇ f σ ( i ) ( x ) (cid:107) using gradientLipschitzness. To bound (cid:107) x i − − x (cid:107) , we use Lemma 6 below which says that if an epoch starts close tothe minimizer x ∗ , then during that epoch, the iterates do now wander off too far away from the beginningof that epoch. Lemma 6. [Nagaraj et al. (2019, Lemma 5), but proved slightly differently] If α ≤ /L then, E [ (cid:107) x ji − x j (cid:107) ] ≤ E [ (cid:107) x ji − x j (cid:107) ] ≤ iα G + 2 iα ( F ( x j ) − F ( x ∗ )) ≤ iα G + iαL (cid:107) x j − x ∗ (cid:107) . E (cid:34)(cid:18) n (cid:88) i =1 (cid:107)∇ f σ ( i ) ( x i − ) − ∇ f σ ( i ) ( x ) (cid:107) (cid:19) (cid:35) ≤ L E (cid:32) n (cid:88) i =1 (cid:107) x i − − x (cid:107) (cid:33) [Gradient Lipschitzness]= L E n (cid:88) i =1 n (cid:88) j =1 (cid:107) x i − − x (cid:107) (cid:107) x j − − x (cid:107) = L n (cid:88) i =1 n (cid:88) j =1 E [ (cid:107) x i − − x (cid:107) (cid:107) x j − − x (cid:107) ] ≤ L n (cid:88) i =1 n (cid:88) j =1 (cid:114) E (cid:104) (cid:107) x i − − x (cid:107) (cid:105)(cid:114) E (cid:104) (cid:107) x j − − x (cid:107) (cid:105) [Cauchy-Scwartz inequality] ≤ L n (2 nαL (cid:107) x − x ∗ (cid:107) + 5 nα G ) [Using Lemma 6] ≤ L n (cid:107) x − x ∗ (cid:107) + 5 n α G L , (8)where we used the assumption that α ≤ / nL in the last step. A.3 Proof of Lemma 5
Proof.
Throughout this proof we will be working inside an epoch, so we skip using the super script j in x ji which denotes the j -th epoch. Thus in this proof, x refers to the iterate at beginning of that epoch.Let σ denote the permutation of [ n ] used in this epoch. Therefore at the i -th iteration of this epoch, wetake a descent step using the gradient of f σ ( i ) . R = n (cid:88) i =1 (cid:2) ∇ f σ ( i ) ( x i − ) − ∇ f σ ( i ) ( x ) (cid:3) = n (cid:88) i =1 ∇ f σ ( i ) x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x j − ) − ∇ f σ ( i ) ( x ) [By the definition of SGDo iterations: x i +1 = x i − α ∇ f σ ( i ) ( x i )]= n (cid:88) i =1 ∇ f σ ( i ) x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x j − ) − ∇ f σ ( i ) ( x )+ ∇ f σ ( i ) x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x ) − ∇ f σ ( i ) x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x ) = n (cid:88) i =1 ∇ f σ ( i ) x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x ) − ∇ f σ ( i ) ( x ) + n (cid:88) i =1 ∇ f σ ( i ) x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x j − ) − ∇ f σ ( i ) x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x ) = A + B, (9)where A := n (cid:88) i =1 ∇ f σ ( i ) x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x ) − ∇ f σ ( i ) ( x ) , and B := n (cid:88) i =1 ∇ f σ ( i ) x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x j − ) − ∇ f σ ( i ) x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x ) . and B are the same terms as the ones defined in HaoChen and Sra (2018). The difference in ouranalysis and HaoChen and Sra (2018) is that we get tighter bounds on these terms.In the following, we use u to denote a random vector with norm less than or equal to 1. Also, assumethat H is the Hessian of the quadratic function F . Claim 2. E [ A ] = (2 nαGL ) u − α n ( n − H ∇ F ( x ) . Claim 3. If α ≤ / nL , then (cid:107) E [ B ] (cid:107) ≤ n L α (cid:107) x − x ∗ (cid:107) + 5 n L α G . Using Eq. (9) and Claim 2: (cid:104) x − x ∗ , E [ R ] (cid:105) = (cid:104) x − x ∗ , E [ A ] + E [ B ] (cid:105) = (cid:104) x − x ∗ , E [ A ] (cid:105) + (cid:104) x − x ∗ , E [ B ] (cid:105) = − α n ( n − (cid:104) x − x ∗ , H ∇ F ( x ) (cid:105) + (cid:104) (2 nαGL ) u, x − x ∗ (cid:105) + (cid:104) x − x ∗ , E [ B ] (cid:105) = − α n ( n − (cid:107)∇ F ( x ) (cid:107) + (cid:104) (2 nαGL ) u, x − x ∗ (cid:105) + (cid:104) x − x ∗ , E [ B ] (cid:105) . (10)For the middle term in Eq. (10), we use Cauchy-Schwarz inequality and the AM-GM inequality asfollows (cid:104) (2 nαGL ) u, x − x ∗ (cid:105) ≥ −(cid:107) (2 nαGL ) u (cid:107)(cid:107) x − x ∗ (cid:107)≥ − (cid:20) λ (cid:107) x − x ∗ (cid:107) + 12 λ (2 nαGL ) (cid:21) , (11)where λ is a positive number.We bound the last term in Eq. (10) similarly, (cid:104) x − x ∗ , E [ B ] (cid:105) ≥ −(cid:107) x − x ∗ (cid:107)(cid:107) E [ B ] (cid:107)≥ − (cid:20) λ (cid:107) x − x ∗ (cid:107) + 12 λ (cid:107) E [ B ] (cid:107) (cid:21) . (12)Setting λ = µn and continuing on from Eq. (10), we get (cid:104) x − x ∗ , E [ R ] (cid:105) = − α n ( n − (cid:107)∇ F ( x ) (cid:107) + (cid:104) (2 nαGL ) u, x − x ∗ (cid:105) + (cid:104) x − x ∗ , E [ B ] (cid:105)≥ − αn (cid:107)∇ F ( x ) (cid:107) − (cid:20) λ (cid:107) x − x ∗ (cid:107) + 12 λ (2 nαGL ) (cid:21) − (cid:20) λ (cid:107) x − x ∗ (cid:107) + 12 λ (cid:107) E [ B ] (cid:107) (cid:21) [Using Ineq. (11) and (12)]= − αn (cid:107)∇ F ( x ) (cid:107) − µn (cid:107) x − x ∗ (cid:107) − µn (2 nαGL ) − µn (cid:107) E [ B ] (cid:107) [Using the value of λ = µn ] ≥ − αn (cid:107)∇ F ( x ) (cid:107) − (cid:16) nµ α n µ − L (cid:17) (cid:107) x − x ∗ (cid:107) − µ − α L G n − nα G L µ − . [Using Claim 3] A.3.1 Proof of Claim 2
Proof.
The proof strategy of this claim is to use the fact that the Hessian of a quadratic function isconstant, and thus we can use it calculate exactly the gradient difference of a quadratic function. In this19roof, we will use u to denote a random variable with norm at most 1. E [ A ] = E n (cid:88) i =1 ∇ f σ ( i ) x − α i − (cid:88) j =1 ∇ f σ t ( j ) ( x ) − ∇ f σ ( i ) ( x ) = n (cid:88) i =1 E ∇ f σ ( i ) x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x ) − ∇ f σ t ( i ) ( x ) ( f ) = n (cid:88) i =1 E ∇ F x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x ) + (2 αGL ) u − ∇ F ( x ) = (2 nαGL ) u + n (cid:88) i =1 E ∇ F x − α i − (cid:88) j =1 ∇ f σ t ( j ) ( x ) − ∇ F ( x ) = (2 nαGL ) u − α n (cid:88) i =1 E H i − (cid:88) j =1 ∇ f σ t ( j ) ( x ) [For quadratics, ∇ F ( x ) − ∇ F ( y ) = H ( x − y )]= (2 nαGL ) u − αH n (cid:88) i =1 i − (cid:88) j =1 E (cid:2) ∇ f σ t ( j ) ( x ) (cid:3) = (2 nαGL ) u − α n ( n − H ∇ F ( x ) . We used Claim 4 for ( f ). This concludes the proof of Claim 2, and the proof of Claim 4 is provided next. Claim 4. E ∇ f σ ( i ) x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x ) = E ∇ F x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x ) + (2 αGL ) u Proof of Claim 4
The proof for this claim uses the iterate coupling technique.
Proof. E ∇ f σ ( i ) x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x ) = 1 n n (cid:88) s =1 E ∇ f σ ( i ) x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) σ ( i ) = s [Since ∀ i, j : P ( σ ( i ) = j ) = 1 /n ]= 1 n n (cid:88) s =1 E ∇ f s x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) σ ( i ) = s . Note that the distribution of ( σ | σ ( i ) = s ) can be created from the distribution of ( σ | σ ( i ) = 1) by takingall permutations from σ | σ ( i ) = 1 and swapping 1 and s in the permutations (this is essentially a couplingbetween the two distributions, same as the one in (Nagaraj et al., 2019)). This means that when we converta permutation from ( σ | σ ( i ) = 1) to ( σ | σ ( i ) = s ) in this manner, the sum (cid:80) i − j =1 ∇ f σ ( j ) ( x ) would have achange of at most one component f σ ( j ) before and after the swap, and furthermore the component will havea norm of at most G . Thus because of the swap (adding a component and removing one), the norm changesby at most 2 G . In the following, we use u p , v ( p,q ) , w ( p,q ) and u to denote random vectors with norms atmost 1. Hence, the sum ( (cid:80) i − j =1 ∇ f σ ( j ) ( x ) | σ ( i ) = s ) is equal to (2 Gv (1 ,s ) + (cid:80) i − j =1 ∇ f σ ( j ) ( x ) | σ ( i ) = 1).20hen, continuing on the sequence of equalities, E ∇ f σ ( i ) x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x ) = 1 n n (cid:88) s =1 E ∇ f s x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x ) | σ ( i ) = s = 1 n n (cid:88) s =1 E ∇ f s x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x ) + 2 αGv (1 ,s ) | σ ( i ) = 1 = 1 n n (cid:88) s =1 E ∇ f s x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x ) + (2 αGL ) w (1 ,s ) | σ ( i ) = 1 [Using gradient Lipschitzness]= E ∇ F x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x ) | σ ( i ) = 1 + (2 αGL ) u . Similarly, for any s : E ∇ f σ ( i ) x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x ) = E ∇ F x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x ) | σ ( i ) = s + (2 αGL ) u s . Hence, E ∇ f σ ( i ) x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x ) = 1 n n (cid:88) s =1 E ∇ F x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x ) | σ ( i ) = s + (2 αGL ) u s = E ∇ F x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x ) + (2 αGL ) u. .3.2 Proof of Claim 3 Proof. (cid:107) E [ B ] (cid:107) ≤ ( E [ (cid:107) B (cid:107) ]) [Jensen’s inequality]= E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 ∇ f σ ( i ) x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x j − ) − ∇ f σ ( i ) x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ E n (cid:88) i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∇ f σ ( i ) x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x j − ) − ∇ f σ ( i ) x − α i − (cid:88) j =1 ∇ f σ ( j ) ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) [Triangle inequality] ≤ E n (cid:88) i =1 Lα (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) i − (cid:88) j =1 ∇ f σ ( j ) ( x j − ) − ∇ f σ ( j ) ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) [Gradient Lipschtizness] ≤ L α E n (cid:88) i =1 i − (cid:88) j =1 (cid:107)∇ f σ ( j ) ( x j − ) − ∇ f σ ( j ) ( x ) (cid:107) [Triangle inquality] ≤ L α E n (cid:88) i =1 n (cid:88) j =1 (cid:107)∇ f σ ( j ) ( x j − ) − ∇ f σ ( j ) ( x ) (cid:107) = n L α E n (cid:88) j =1 (cid:107)∇ f σ ( j ) ( x j − ) − ∇ f σ ( j ) ( x ) (cid:107) [The inner sum is independent of i ] ≤ n L α E n (cid:88) j =1 (cid:107)∇ f σ ( j ) ( x j − ) − ∇ f σ ( j ) ( x ) (cid:107) . [Jensen’s inequality]Now, we have already proved an upper bound on E (cid:20)(cid:16)(cid:80) nj =1 (cid:107)∇ f σ ( j ) ( x j − ) − ∇ f σ ( j ) ( x ) (cid:107) (cid:17) (cid:21) in theproof of Lemma 4 in Subsection A.2 under exactly the same assumptions as the ones for this claim (SeeIneq. (8) in that proof). In particular, there we have proved that E n (cid:88) j =1 (cid:107)∇ f σ ( j ) ( x j − ) − ∇ f σ ( j ) ( x ) (cid:107) ≤ L n (cid:107) x − x ∗ (cid:107) + 5 n α G L . Substituting this inequality in the set of inequalities above gives us the result.
A.4 Proof of Lemma 6
We will only prove E [ (cid:107) x ji − x j (cid:107) ] ≤ iα + 2 iα ( F ( x j ) − F ( x ∗ )), the rest follows from Jensen’s inequalityand gradient Lipschitzness.We will use the following claim, which is just Nagaraj et al. (2019, Lemma 4) proved in a slightlydifferent way: we skip the Wasserstein framework but use the same coupling. Claim 1. [Nagaraj et al. (2019, Lemma 4)] If α ≤ L , then for any epoch j and i -th ordered iterate duringthat epoch (cid:12)(cid:12)(cid:12)(cid:12) E (cid:104) F ( x ji ) − f σ j ( i ) ( x ji ) (cid:12)(cid:12)(cid:12) x j (cid:105) (cid:12)(cid:12)(cid:12)(cid:12) ≤ αG . (2)The rest of the proof is identical to the proof in (Nagaraj et al., 2019). Because in this proof we workinside an epoch, so we skip the super script in the notation. (cid:107) x i +1 − x (cid:107) = (cid:107) x i − x (cid:107) − α (cid:104)∇ f σ ( i ) ( x i ) , x i − x (cid:105) + α (cid:107)∇ f σ ( i ) ( x i ) (cid:107) ≤ (cid:107) x i − x (cid:107) − α (cid:104)∇ f σ ( i ) ( x i ) , x i − x (cid:105) + α G [Bounded gradients] ≤ (cid:107) x i − x (cid:107) + 2 α ( f σ ( i ) ( x ) − f σ ( i ) ( x j )) + α G [Convexity of f σ ( i ) ]22aking expectation both sides: E [ (cid:107) x i +1 − x (cid:107) | x ] ≤ E [ (cid:107) x i − x (cid:107) | x ] + 2 α E [ f σ ( i ) ( x ) − f σ ( i ) ( x j ) | x ] + α G = E [ (cid:107) x i − x (cid:107) | x ] + 2 αF ( x ) + 2 α E [ − f σ ( i ) ( x j ) | x ] + α G = E [ (cid:107) x i − x (cid:107) | x ] + 2 αF ( x ) + 2 α E [ F ( x j ) − f σ ( i ) ( x j ) − F ( x j ) | x ] + α G ≤ E [ (cid:107) x i − x (cid:107) | x ] + 2 αF ( x ) + 2 α E [ F ( x j ) − f σ ( i ) ( x j ) − F ( x ∗ ) | x ] + α G [Since x ∗ is the minimizer of F ]= E [ (cid:107) x i − x (cid:107) | x ] + 2 α ( F ( x ) − F ( x ∗ )) + 2 α E [ F ( x j ) − f σ ( i ) ( x j ) | x ] + α G ≤ E [ (cid:107) x i − x (cid:107) | x ] + 2 α ( F ( x ) − F ( x ∗ )) + 2 α (2 αG ) + α G [Using Claim 1]= E [ (cid:107) x i − x (cid:107) | x ] + 2 α ( F ( x ) − F ( x ∗ )) + 5 α G . Unrolling this for i iterations gives us the required result. A.4.1 Proof of Claim 1
The proof for this claim also uses the iterate coupling technique, similar to the proof of Claim 4.
Proof.
As written in the claim statement, we assume that the start of the epoch, x j is given. In thisproof, we work inside an epoch, so we skip the superscript j in x ji . E (cid:2) f σ ( i ) ( x i ) (cid:3) = 1 n n (cid:88) s =1 E (cid:2) f σ ( i ) ( x i ) | σ ( i ) = s (cid:3) = 1 n n (cid:88) s =1 E [ f s ( x i ) | σ ( i ) = s ] . Note that the distribution of ( σ | σ ( i ) = s ) can be created from the distribution of ( σ | σ ( i ) = 1) by takingall permutations from ( σ | σ ( i ) = 1) and swapping 1 and s in the permutations (this is essentially acoupling between the two distributions, same as the one in (Nagaraj et al., 2019)). This means thatwhen we convert a permutation from the distribution ( σ | σ ( i ) = 1) to a permutation from the distribution( σ | σ ( i ) = s ) in this manner, the corresponding ( x i | σ ( i ) = 1) and ( x i | σ ( i ) = s ) would be within a distanceof 2 αG . Here is why this is true: let x (cid:48) be an iterate reached using a permutation σ (cid:48) from the distribution( σ | σ ( i ) = 1). Now, create σ (cid:48)(cid:48) by swapping 1 and s in σ (cid:48) . Then σ (cid:48)(cid:48) ( i ) = s and hence it lies in thedistribution ( σ | σ ( i ) = s ). Let x (cid:48)(cid:48) be an iterate reached using σ (cid:48)(cid:48) . Then can use Lemma 2 from (Nagarajet al., 2019), adapted to our setting: Lemma 7. [Nagaraj et al. (2019, Lemma 2)] Let α ≤ /L . Then almost surely, ∀ i ∈ [ n ] , (cid:107) x (cid:48) − x (cid:48)(cid:48) (cid:107) ≤ Gα.
In the following, we use v ( p,q ) to denote a random vector with norm less than or equal to 1; and w ( p,q ) , u p and u to denote a random scalar with absolute value less than or equal to 1.Then using Lemma 7, ( x i | σ ( i ) = s ) is equal to ( x i + v (1 ,s ) | σ ( i ) = 1) (Similar to what we did in theproof of Claim 4.). E (cid:2) f σ ( i ) ( x i ) (cid:3) = 1 n n (cid:88) s =1 E [ f s ( x i ) | σ ( i ) = s ]= 1 n n (cid:88) s =1 E (cid:2) f s (cid:0) x i + (2 αG ) v (1 ,s ) (cid:1) | σ ( i ) = 1 (cid:3) = 1 n n (cid:88) s =1 E (cid:2) f s ( x i ) + (2 αG ) w (1 ,s ) | σ ( i ) = 1 (cid:3) = E [ F ( x i ) | σ ( i ) = 1] + (2 αG ) u . Similarly, for any s : E (cid:2) f σ ( i ) ( x i ) (cid:3) = E [ F ( x i ) | σ ( i ) = s ] + (2 αG ) u s . E (cid:2) f σ ( i ) ( x i ) (cid:3) = 1 n n (cid:88) s =1 (cid:0) E [ F ( x i ) | σ ( i ) = s ] + (2 αG ) u s (cid:1) = E [ F ( x i )] + (2 αG ) u. All the calculations in this proof assumed that the initial point of the epoch, x j is known. Thus, theequation above implies (cid:12)(cid:12)(cid:12) E (cid:104) F ( x ji ) − f σ ( i ) ( x ji ) (cid:12)(cid:12)(cid:12) x j (cid:105)(cid:12)(cid:12)(cid:12) ≤ αG . Proof of Theorem 2
Theorem 2. (Formal version) There exists an initialization point x and a -strongly convex function F that is the mean of n smooth convex functions which have L -Lipschitz gradients ( L ≥ ), such that if nK ≤ α ≤ − nL , K ≥ L,n ≥ and n is a multiple of , then, E [ (cid:107) x T − x ∗ (cid:107) ] ≥ − G nT . Remarks:
1. Because µ = 1, the condition number is just L . Note that the lower bound provided above isindependent of L .2. The theorem and proof have not been optimized with respect to the dependence on universalconstants and it can probably be much better. In particular, the experiments in Subsection 5.1 usethe same construction of functions with much better values of constants. Proof.
As with Theorem 1, we first start with a block diagram of the components required to establishthe lower bound.
Theorem 2Lemma 8
𝔼 𝑥 !" increases when |𝑥 !" | is small Lemma 11 + Lemma 12
𝔼 𝑥 !" − 𝑥 < 0 with constant probability Lemma 12 (Key Lemma) Lemma 6
Iterates don’t steer off too far during epoch(Upper bound on
𝔼[||𝑥 !" − 𝑥 || $ ] ) Lemma 9 𝑥 !" doesn’t decrease too much even when 𝑥 !" is big Corollary 2
𝔼 𝑥 !" − 𝑥 is not too large Figure 4: A dependency graph for the proof of Theorem 2, giving short descriptions of the components for theproof of Theorem 2.
The function F ( x ) = n (cid:80) ni =1 f i ( x ) that we construct is of the form F ( x ) = (cid:40) x if x ≥ Lx if x < . where n ≥ n component functions f i , half of them are defined as follows(we call these as functions of first kind):if i ≤ n f i ( x ) = x Gx , if x ≥ Lx Gx , if x < , i > n f i ( x ) = x − Gx , if x ≥ Lx − Gx , if x < . Let σ j be the permutation of functions f i ’s that is used in the j -th epoch. Then, σ j can be representedby a permutation of the following multiset: { +1 , . . . , +1 (cid:124) (cid:123)(cid:122) (cid:125) n times , − , . . . , − (cid:124) (cid:123)(cid:122) (cid:125) n times } Accordingly, if σ ji = +1, we assume in the i -th iteration of the j -th epoch, a function of the 1st kind wassampled. Similarly if σ ji = −
1, we assume in the i -th iteration of the j -th epoch, a function of the 2ndkind was sampled. −15 −10 −5 0 5 10 15020040060080010001200 F ( x ) f ( x ) f ( x ) Figure 5: Lower bound construction. Note that f ( x ) represents functions of the first kind, and f ( x ) representsfunctions of the second kind, and F ( x ) represents the overall function. Assumptions 1-5, except Assumptions 3 and 4 have been already proved for this function in the maintext of this paper. Thus, we only prove 3 and 4 here.We will initialize at x = 0. Next we prove that Assumption 4 is satisfied for our lowerboundconstruction. The minimizer of the functions of first kind is at x = − G/ L , and the minimizer of thefunctions of the second kind is at x = G/
2. Between these two values, the norm of gradient of any of thetwo function kinds is always less than G . Thus, it is sufficient show that the iterates stay between the twominimizers. The step size we have chosen is small enough (smaller than 1 /L ) to ensure that the iteratesdo not go outside the two minimizers. To see why this is so, consider the case when one is doing gradientdescent on g ( x ) = ax /
2. If the step length α < /a , then the next iterate x t +1 = x t − α ( ax t ) = x t (1 − αa )and the current iterate x t lie on the same side of the minimizer (which is x ∗ = 0), that is the iterates never‘cross over’ the minimizer. A similar logic here implies that iterates in our case stay within [ − G/ L, G/ G .Next, we have the following key lemma of the proof: Lemma 8. If n ≥ is a multiple of , L ≥ , α ≤ − nL and | x j | ≤ − Gα √ n , then E (cid:104) x j +10 (cid:105) ≥ x j + 2 − LGα n √ n E [ | x j +10 | ] should keep increasing at rate Ω( α n √ n ) until | x j | = Ω( α √ n ). This is exactly what we need, because for the range of step length α specified in thetheorem statement, | x j | = Ω( α √ n ) is the claimed error lower bound. The rest of the proof is just makingthis intuition rigorous.The following helper lemma says that E [ | x j +10 | ] does not decrease by too much, even if | x j | > − Gα √ n . Lemma 9. If α ≤ /nL , then E (cid:104) | x j +10 | (cid:105) ≥ | x j | (1 − Lαn ) − LGα n √ n, and further if | x j | > − Gα √ n , then E (cid:104) x j +10 (cid:105) ≥ x j − − | x j | Lαn.
Next, we aim to use the two lemmas above to get lower bounds on unconditioned expectation E [ | x j +10 | ].For this, we consider the following two cases. Case 1: If P ( | x j | > − Gα √ n ) E (cid:104) | x j | (cid:12)(cid:12)(cid:12) | x j | > − Gα √ n (cid:105) > − Gα √ n . Then decomposing theexpectation into conditional expectations, E (cid:104) | x j | (cid:105) = P (cid:16) | x j | > − Gα √ n (cid:17) E (cid:104) | x j | (cid:12)(cid:12)(cid:12) | x j | > − Gα √ n (cid:105) + P (cid:16) | x j | ≤ − Gα √ n (cid:17) E (cid:104) | x j | (cid:12)(cid:12)(cid:12) | x j | ≤ − Gα √ n (cid:105) ≥ P (cid:16) | x j | > − Gα √ n (cid:17) E (cid:104) | x j | (cid:12)(cid:12)(cid:12) | x j | > − Gα √ n (cid:105) > − Gα √ n. (13) Case 2: Otherwise, P ( | x j | > − Gα √ n ) E (cid:104) | x j | (cid:12)(cid:12)(cid:12) | x j | > − Gα √ n (cid:105) ≤ − Gα √ n . Again decom-posing the expectation into conditional expectations, E (cid:104) x j +10 (cid:105) = P (cid:16) | x j | ≤ − Gα √ n (cid:17) E (cid:104) x j +10 (cid:12)(cid:12)(cid:12) | x j | ≤ − Gα √ n (cid:105) + P (cid:16) | x j | > − Gα √ n (cid:17) E (cid:104) x j +10 (cid:12)(cid:12)(cid:12) | x j | > − Gα √ n (cid:105) ≥ P (cid:16) | x j | ≤ − Gα √ n (cid:17) (cid:16) E (cid:104) x j (cid:12)(cid:12)(cid:12) | x j | ≤ − Gα √ n (cid:105) + 2 − LGα n √ n (cid:17) + P (cid:16) | x j | > − Gα √ n (cid:17) (cid:16) E (cid:104) x j (cid:12)(cid:12)(cid:12) | x j | > − Gα √ n (cid:105) − − L E (cid:104) | x j | (cid:12)(cid:12)(cid:12) | x j | > − Gα √ n (cid:105) αn (cid:17) [Using Lemma 8 and 9]= E (cid:104) x j (cid:105) + P (cid:16) | x j | ≤ − Gα √ n (cid:17) − LGα n √ n − P (cid:16) | x j | > − Gα √ n (cid:17) E (cid:104) | x j | (cid:12)(cid:12)(cid:12) | x j | > − α √ n (cid:105) − Lαn. (14)In the last step above, we gathered the sum of conditional expectations back into an unconditionalexpectation.By assumption of this case, P (cid:16) | x j | > − Gα √ n (cid:17) E (cid:104) | x j | (cid:12)(cid:12)(cid:12) | x j | > − Gα √ n (cid:105) ≤ − Gα √ n. This implies that P (cid:16)(cid:12)(cid:12)(cid:12) x j (cid:12)(cid:12)(cid:12) > − Gα √ n (cid:17) ≤ and thus, P (cid:16)(cid:12)(cid:12)(cid:12) x j (cid:12)(cid:12)(cid:12) ≤ − Gα √ n (cid:17) ≥ . Using this in Ineq. (14), E (cid:104) x j +10 (cid:105) ≥ E (cid:104) x j (cid:105) + 12 2 − LGα n √ n − P (cid:16)(cid:12)(cid:12)(cid:12) x j (cid:12)(cid:12)(cid:12) > − Gα √ n (cid:17) E (cid:104) | x j | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) x j (cid:12)(cid:12)(cid:12) > − Gα √ n (cid:105) − Lαn ≥ E (cid:104) x j (cid:105) + 2 − LGα n √ n − (cid:0) − Gα √ n (cid:1) − Lαn [By assumption of this case] ≥ E (cid:104) x j (cid:105) + 2 − LGα n √ n. (15)What we have shown using the two cases is the following: If for some epoch j , E [ | x j | ] ≤ − Gα √ n then looking at Ineq. (13) tells us that we are in Case 2. Then, E [ x j +10 ] ≥ E [ x j ] + 2 − LGα n √ n , that is E [ x j +10 ] increases by Ω( α n √ n ). This shows that if we initialize x at 0, then until E [ | x j | ] > − Gα √ n ,the expected error will keep increasing at rate Ω( α n √ n ). Thus given the step size regime considered27n this theorem, there is some epoch where error reaches Ω( α √ n ), which is the desired lower bound.However, what we want to show is that at the end of K epoch, the error is still Ω( α √ n ). We will provethis next.We initialize x = 0 and we run K epochs. Then, because E [ | x j | ] ≥ E [ x j ], we have shown in theprevious paragraph that E [ | x j | ] ≥ min (cid:8) − , − LαnK (cid:9) Gα √ n for some 0 ≤ j ≤ K . Now, for ourgiven range of α , we know that LαnK is greater than L , which in turn is greater than 2 . Thus, E [ | x j | ] ≥ − Gα √ n . To complete the proof, next we prove that once E [ | x j | ] ≥ − Gα √ n , then E [ | x t | ]remains above C l Gα √ n for t > j and some universal constant C l . The strategy is that we will show thatif E [ | x j | ] starts falling below 2 − Gα √ n , then E [ x j ] starts increasing. We also have Lemma 10 (givenbelow) that says that E [ x j ] ≥ E [ | x j | ] ≥ E [ x j ] always. All these togetherand some simple arithmetic will give a bound on how much E [ | x j | ] can decrease. Lemma 10. If α ≤ /L , then ∀ i, j : E [ x ji ] ≥ . We formalize the argument of the previous paragraph next. Let j be such that E [ | x j | ] ≥ − Gα √ n and E [ | x j +10 | ] < − Gα √ n . Then, E (cid:104) | x j +10 | (cid:105) ≥ E (cid:104) | x j | (cid:105) (1 − Lαn ) − √ LGα n √ n [Using Lemma 9] ≥ E (cid:104)(cid:12)(cid:12)(cid:12) x j (cid:12)(cid:12)(cid:12)(cid:105) − √ LGα n √ n [Since α ≤ nL ] ≥ − Gα √ n − √ LGα n √ n ≥ − Gα √ n. [Since α ≤ − nL ]For subsequent epochs l > j , we want to show that E [ | x l | ] doesn’t fall below Ω( α √ n ). Thus assumethat for l > j , E [ | x l | ] < − Gα √ n , because otherwise E [ | x l | ] = Ω( α √ n ).Because E [ | x l | ] < − Gα √ n , then using Ineq. (13), we can infer that we are in Case 2 in the epoch l . Therefore, Ineq. (15) implies that for each such epoch l , E [ x l +10 ] increases by at least 2 − LGα n √ n per epoch; whereas Lemma 9 says that E [ | x l +10 | ] can decrease by at most 2 L E [ | x l | ] αn + √ LGα n √ n ≤ (2 − + √ LGα n √ n per epoch. Further, we have the two facts that ∀ l : E [ x l ] ≥ E [ x l ] ≤ E [ | x l | ]. Combining all these and using simple arithmetic gives that E [ | x l | ] can decrease to atmost 2 − Gα √ n (cid:18) − LGα n √ n − LGα n √ n + (2 − + √ LGα n √ n (cid:19) ≥ − Gα √ n. After this E [ | x l | ] will have to keep increasing because E [ x l ] will keep increasing till we enter Case1, and then this cycle may repeat, but we have shown that regardless, E [ | x l | ] always remains above2 − Gα √ n .Finally, given α ≥ nK , we get that E [ | x K | ] ≥ − Gα √ n ≥ − G √ nK . Applying Jensen’s inequality onthis gives E [ | x K | ] ≥ − G nK = − G nT . 28 .1 Proof of Lemma 8 The gradient computed at x ji can be written as ( x ji ≤ L + x ji > x ji + G σ ji . Then, x j +10 − x j is just thesum of the gradient steps taken through the epoch. E [ x j +10 ] = x j − α G n (cid:88) i =1 σ ji − α n (cid:88) i =0 E (cid:104) ( x ji ≤ L + x ji > x ji (cid:105) = x j − α n (cid:88) i =1 E (cid:104) ( x ji ≤ L + x ji > x ji (cid:105) [Since (cid:80) ni =1 σ ji = 0]= x j − α n/ (cid:88) i = n/ E (cid:104) ( x ji ≤ L + x ji > x ji (cid:105) − α (cid:88) i/ ∈ [ n , n ] E (cid:104) ( x ji ≤ L + x ji > x ji (cid:105) = x j − α n/ (cid:88) i = n/ P (cid:32) i (cid:88) p =1 σ jp > (cid:33) E (cid:34) ( x ji ≤ L + x ji > x ji (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ jp > (cid:35) − α n/ (cid:88) i = n/ P (cid:32) i (cid:88) p =1 σ jp ≤ (cid:33) E (cid:34) ( x ji ≤ L + x ji > x ji (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ jp ≤ (cid:35) − α (cid:88) i/ ∈ [ n , n ] E (cid:104) ( x ji ≤ L + x ji > x ji (cid:105) . (16)We decomposed the expectation into the sum of conditional expectations to achieve the last equality.The following inequalities help us bound E (cid:104) ( x ji ≤ L + x ji > x ji (cid:105) in Eq. (16) with simple expressions.Let r be any random variable. Then, E [( r ≤ L + r> r ] = L P ( r < E [ r | r <
0] + P ( r ≥ E [ r | r ≥ L ( P ( r < E [ r | r <
0] + P ( r ≥ E [ r | r ≥ − L ) P ( r ≥ E [ r | r ≥ ≤ L ( P ( r < E [ r | r <
0] + P ( r ≥ E [ r | r ≥ L ≥ ≤ L E [ r ] (17)and E [( r ≤ L + r> r ] = L P ( r < E [ r | r <
0] + P ( r ≥ E [ r | r ≥ P ( r < E [ r | r <
0] + P ( r ≥ E [ r | r ≥ L − P ( r < E [ r | r < ≤ P ( r < E [ r | r <
0] + P ( r ≥ E [ r | r ≥ L ≥ ≤ E [ r ] . (18)We also have the following lemmas and corollary that along with the two inequalities above, helplower bound the RHS of Eq. (16). Lemma 11. If | x j | ≤ √ nGα and α ≤ − nL , then for n/ ≤ i ≤ n/ , we have E (cid:34) x ji − x j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ jp > (cid:35) ≤ − G √ iα. Lemma 12. If n ≥ is a multiple of and i ≤ n/ , then √ i ≤ E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ jp (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) ≤ √ i. Further, for any i ∈ [ n/ , n/ , P (cid:32) i (cid:88) p =1 σ jp < (cid:33) ≥ and P (cid:32) i (cid:88) p =1 σ jp > (cid:33) ≥ . orollary 2. If α ≤ /L and i ∈ [ n , n ] then for any h ∈ [1 , n ] , E (cid:34) | x jh − x j | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ jp ≤ (cid:35) ≤ (cid:16) √ hGα + | x j |√ hαL (cid:17) , and E (cid:34) | x jh − x j | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ jp > (cid:35) ≤ (cid:16) √ hGα + | x j |√ hαL (cid:17) . Now, we handle each of the three terms in Eq. (16) individually. • If i ∈ [ n/ , n/
2] and (cid:16)(cid:80) ip =1 σ jp (cid:17) > E (cid:34) ( x ji ≤ L + x ji > x ji (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ jp > (cid:35) ≤ L E (cid:34) x ji (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ jp > (cid:35) [Using Ineq. (17) with r = ( x ji | (cid:80) ip =1 σ jp > L E (cid:34) x ji − x j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ jp > (cid:35) + Lx j ≤ − L (cid:114) n Gα + L G √ nα [Using Lemma 11] ≤ − L G √ nα. (19) • If i ∈ [ n/ , n/
2] and (cid:16)(cid:80) ip =1 σ jp (cid:17) ≤ E (cid:34) ( x ji ≤ L + x ji > x ji (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ jp ≤ (cid:35) ≤ E (cid:34) x ji (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ jp ≤ (cid:35) [Using Ineq. (18) with r = ( x ji | (cid:80) ip =1 σ jp ≤ ≤ E (cid:34) x ji − x j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ jp ≤ (cid:35) + | x j |≤ (cid:16) √ iGα + | x j |√ iαL (cid:17) + | x j | [Using Corollary 2 with h = i .] ≤ (cid:16) √ iGα + | x j | (cid:17) + | x j | [Since α ≤ /nL ] ≤ (cid:16) √ iGα + 2 − Gα √ n (cid:17) + 2 − Gα √ n [Since | x j | ≤ − Gα √ n by assumption] ≤ G √ nα. (20) • If i / ∈ [ n/ , n/ E (cid:104) ( x ji ≤ L + x ji > x ji (cid:105) ≤ E (cid:104) x ji (cid:105) [Using Ineq. (18) with r = x ji ] ≤ E (cid:104) x ji − x j (cid:105) + | x j |≤ (cid:16) √ iGα + | x j |√ iαL (cid:17) + | x j | [Using Lemma 6 with x ∗ = 0] ≤ (cid:16) √ iGα + | x j |√ (cid:17) + | x j | [Since α ≤ /nL ] ≤ G √ nα. (21)Continuing on from Eq. (16) and substituting Ineq. (19), Ineq. (20) and Ineq. (21):30 (cid:104) x j +10 (cid:105) = x j − α n/ (cid:88) i = n/ P (cid:32) i (cid:88) p =1 σ jp > (cid:33) E (cid:34) ( x ji ≤ L + x ji > x ji (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ jp > (cid:35) − α n/ (cid:88) i = n/ P (cid:32) i (cid:88) p =1 σ jp ≤ (cid:33) E (cid:34) ( x ji ≤ L + x ji > x ji (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ jp ≤ (cid:35) − α (cid:88) i/ ∈ [ n , n ] E (cid:104) ( x ji ≤ L + x ji > x ji (cid:105) ≥ x j − α n/ (cid:88) i = n/ P (cid:32) i (cid:88) p =1 σ jp > (cid:33) (cid:18) − L G √ nα (cid:19) − α n/ (cid:88) i = n/ P (cid:32) i (cid:88) p =1 σ jp ≤ (cid:33) (8 G √ nα ) − α (cid:88) i/ ∈ [ n , n ] (8 G √ nα ) [Using Ineq. (19), (20) and (21)] ≥ x j − α n/ (cid:88) i = n/ P (cid:32) i (cid:88) p =1 σ jp > (cid:33) (cid:18) − L G √ nα (cid:19) − α n (cid:88) i =1 (8 G √ nα ) ≥ x j + α n (cid:18) L G √ nα (cid:19) − αn (8 G √ nα ) [Using Lemma 12] ≥ x j + 2 − LGα n √ n. [Since L ≥ ] B.2 Proof of Lemma 9
We will start off by bounding the difference of iterates between the start of two epochs. The gradientcomputed at x ji can be written as ( x ji ≤ L + x ji > x ji + G σ ji . Then, x j +10 − x j is just the sum of thegradient steps taken through the epoch. E (cid:104) | x j +10 − x j | (cid:105) = E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − α G n (cid:88) i =1 σ ji − α n (cid:88) i =0 ( x ji ≤ L + x ji > x ji (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) = E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − α n (cid:88) i =0 ( x ji ≤ L + x ji > x ji (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) [Since (cid:80) ni =1 σ ji = 0] ≤ α E (cid:34) n (cid:88) i =0 ( x ji ≤ L + x ji > | x ji | (cid:35) ≤ αL E (cid:34) n (cid:88) i =0 | x ji | (cid:35) ≤ αL E (cid:34) n (cid:88) i =0 ( | x j − x ji | + | x j | ) (cid:35) ≤ αL E (cid:34) n (cid:88) i =0 ( √ iαL | x j | + √ iαG + | x j | ) (cid:35) [Using Lemma 6] ≤ αL E (cid:34) n (cid:88) i =0 ( √ iαG + 2 | x j | ) (cid:35) [Since α ≤ /nL ] ≤ αLn (cid:16) √ nαG + 2 | x j | (cid:17) . Thus, we have shown the following: E (cid:104) | x j +10 − x j | (cid:105) ≤ L | x j | αn + LGα n √ n. (22)31hen, E (cid:104) | x j +10 | (cid:105) = E (cid:104) | x j + x j +10 − x j | (cid:105) ≥ | x j | − E (cid:104) | x j +10 − x j | (cid:105) ≥ | x j | (1 − Lαn ) − LGα n √ n. [Using Ineq. (22)]This proves the first inequality of the lemma. For the second inequality, E (cid:20) x j +10 (cid:12)(cid:12)(cid:12)(cid:12) | x j | > Gα √ n (cid:21) = E (cid:104) x j + x j +10 − x j (cid:12)(cid:12)(cid:12) | x j | > − Gα √ n (cid:105) ≥ x j − E (cid:104) | x j +10 − x j | (cid:12)(cid:12)(cid:12) | x j | > − Gα √ n (cid:105) ≥ x j − (2 | x j | Lαn + LGα n √ n ) [Using Ineq. (22)] ≥ x j − (cid:16) L | x j | αn + 512 √ L | x j | αn (cid:17) [Since | x j | > Gα √ n ] ≥ x j − − L | x j | αn. B.3 Proof of Lemma 10
Let us denote the iterate at iteration t by x t,L , where the L in subscript is to show dependence on L . Now, consider the problem setup when L = 1. In that case, the two kinds of functions are simplyquadratics instead of piecewise quadratics. The two kinds of functions in this case are f ( x ) = x + G x and f ( x ) = x − G x . Then, due to symmetry of the function, the expected value of iterate at anyiteration t is 0: E [ x t, ] = 0.Now, take the problem setup for the case L ≥
1, where we denote the iterates by x t,L . We couple the twoiterates x t, and x t,L such that both x t,L and x t, are created using the exact same random permutationsof the functions but x t, had L = 1 and x t,L had L ≥
1. Then, we show that x t,L ≥ x t, and since E [ x t, ] = 0, we get that E [ x t,L ] ≥ • Base case: Both are initialized at 0, and thus x , = x ,L = 0. • Inductive case (assume x i,L ≥ x i, ): We break the analysis into the following three cases:1. Case: If x i, ≤ x i,L ≤ . Then note that regardless of the choice of the functions, thecontribution of the linear gradients in the gradients would be the same for both x i +1 , and x i +1 ,L (since we use the exact same permutation of functions for both). Hence, x i +1 ,L − x i +1 , = (1 − αL ) x i,L − (1 − α ) x i, ≥ (1 − αL ) x i,L − (1 − Lα ) x i, ≥ . [Since α ≤ /L ]2. Case: If ≤ x i, ≤ x i,L . Again, regardless of the choice of the functions, the contribution ofthe linear gradients in the gradients would be the same for both x i +1 , and x i +1 ,L (since weuse the exact same permutation of functions for both). Hence, x i +1 ,L − x i +1 , = (1 − α ) x i,L − (1 − α ) x i, ≥ . [Since α ≤ /L ]3. Case: If x i, ≤ ≤ x i,L . Again, regardless of the choice of the functions, the contribution ofthe linear gradients in the gradients would be the same for both x i +1 , and x i +1 ,L (since weuse the exact same permutation of functions for both). Hence, x i +1 ,L − x i +1 , = (1 − α ) x i,L − (1 − α ) x i, ≥ . [Since α ≤ /L ]32 .4 Proof of Lemma 11 The gradient computed at x ji can be written as ( x ji ≤ L + x ji > x ji + G σ ji . Then, E (cid:34) x ji − x j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ jp > (cid:35) = E (cid:34) − α i (cid:88) p =1 ∇ f σ jp ( x jp − ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ jp > (cid:35) = − α E (cid:34) i (cid:88) p =1 (cid:18) ( x jp − ≥ + L x jp − < ) x jp − + G σ jp (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ jp > (cid:35) ≤ − α E (cid:34) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 ( x jp − ≥ + L x jp − < ) x jp − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + i (cid:88) p =1 G σ jp (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ jp > (cid:35) ≤ − α E (cid:34) − L i (cid:88) p =1 | x jp − | + G i (cid:88) p =1 σ jp (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ jp > (cid:35) ≤ − α E (cid:34) G (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ jp (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − L i (cid:88) p =1 (cid:16) | x j − x jp − | + | x j | (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ jp > (cid:35) . (23)We have the following helpful lemma which will help us control the first term in the sum above. Lemma 12. If n ≥ is a multiple of and i ≤ n/ , then √ i ≤ E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ jp (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) ≤ √ i. Further, for any i ∈ [ n/ , n/ , P (cid:32) i (cid:88) p =1 σ jp < (cid:33) ≥ and P (cid:32) i (cid:88) p =1 σ jp > (cid:33) ≥ . We also use the following corollary (of Lemma 12 and Lemma 6).
Corollary 2. If α ≤ /L and i ∈ [ n , n ] then for any h ∈ [1 , n ] , E (cid:34) | x jh − x j | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ jp ≤ (cid:35) ≤ (cid:16) √ hGα + | x j |√ hαL (cid:17) , and E (cid:34) | x jh − x j | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ jp > (cid:35) ≤ (cid:16) √ hGα + | x j |√ hαL (cid:17) . Continuing from (23), and using Lemma 12 and Corollary 2, we have that for n/ ≤ i ≤ n/ E (cid:34) x ji − x j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ jp > (cid:35) ≤ − √ iαG
64 + αLi (cid:16) √ iGα + | x j | (cid:16) √ iαL (cid:17)(cid:17) . Thus, if | x j | ≤ √ nGα and α ≤ − nL , then for n/ ≤ i ≤ n/
2, we have E (cid:34) x ji − x j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ jp > (cid:35) ≤ − √ iαG. B.5 Proof of Lemma 12
Proof.
We skip the superscript in the notation because we’ll be working inside an epoch.33et s i := (cid:80) ip =1 σ p . First, we prove the upper bound. We have that E [ | s i | ] = E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) ≤ (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) E (cid:32) i (cid:88) p =1 σ p (cid:33) [Jensen’s inequality]= (cid:118)(cid:117)(cid:117)(cid:116) i (cid:88) p =1 E [( σ p ) ] + 2 (cid:88) k
− P ( s i − σ i < . (24)Intuitively, because P ( s i − σ i > ≈ P ( s i − σ i < E [ | s i | ] ≈ (cid:80) ip =1 P ( s i = 0). This is relatively easyto compute using combinatorics and gives that E [ | s i | ] ≈ Ω( √ i ). We do the exact calculations next.Continuing on from Eq. (24) and decomposing the probabilities as sum of conditional probabilities, E [ | s i | ] = E [ | s i − | ] + P ( s i − = 0) + P ( s i − σ i > − P ( s i − σ i < E [ | s i − | ] + P ( s i − = 0) + i − (cid:88) p =1 P ( | s i − | = p ) P (cid:16) s i − σ i > (cid:12)(cid:12)(cid:12) | s i − | = p (cid:17) − i − (cid:88) p =1 P ( | s i − | = p ) P (cid:16) s i − σ i < (cid:12)(cid:12)(cid:12) | s i − | = p (cid:17) . The term P (cid:0) s i − σ i > (cid:12)(cid:12) | s i − | = p (cid:1) has a closed form solution: Assume WLOG that s i = 0. Then, P ( s i − σ i > | s i − = p ) = P ( σ i = +1 | s i − = p ) is just the probability of sampling a +1, when out of i − p ‘+1’s more than ‘ − ( n − i − p +1) / n − i +1 . Similarly, we can handle34he other cases. This gives, E [ | s i | ] = E [ | s i − | ] + P ( s i − = 0) + i − (cid:88) p =1 P ( | s i − | = p ) (cid:18) ( n − i − p + 1) / n − i + 1 (cid:19) − i − (cid:88) p =1 P ( | s i − | = p ) (cid:18) ( n − i + p + 1) / n − i + 1 (cid:19) = E [ | s i − | ] + P ( s i − = 0) − i − (cid:88) p =1 P ( | s i − | = p ) (cid:18) pn − i + 1 (cid:19) = E [ | s i − | ] + P ( s i − = 0) − E ( | s i − | ) 1 n − i + 1= E [ | s i − | ](1 − / ( n − i + 1)) + P ( s i − = 0) ≥ E [ | s i − | ](1 − /n ) + P ( s i − = 0) [Since i ≤ n/ E [ | s i − | ](1 − /n ) + i − P ( s i − = 0) . [ s i − can be 0 only if i − P ( s i − = 0), we first see that it is just the ratio of the number of ways of choosing ( i − / i − n − i + 1) / n − i + 1 positions for the remaining ‘+1’s, to the total number of ways of choosing n/ n positions. Thus, P ( | s i − | = 0) = (cid:0) i − i − / (cid:1)(cid:0) n − i +1( n − i +1) / (cid:1)(cid:0) nn/ (cid:1) (25)Using this equality and continuing on, E [ | s i | ] ≥ E [ | s i − | ](1 − /n ) + i − P ( | s i − | = 0)= E [ | s i − | ](1 − /n ) + i − (cid:0) i − i − / (cid:1)(cid:0) n − i +1( n − i +1) / (cid:1)(cid:0) nn/ (cid:1) . [Using Eq. (25)]To bound the combinatorial expression above, we use the following approximation by Mortici (2011,Theorem 1): (cid:112) π (2 k + 0 . ke ) k ≤ k ! ≤ (cid:112) π (2 k + 0 . ke ) k . Using this and continuing on the sequenceof inequalities, E [ | s i | ] ≥ E [ | s i − | ](1 − /n ) + i − √ i − E [ | s i − | ](1 − /n ) + i − √ i − − /n ) + i − π e (cid:114) i − ≥ E [ | s i − | ](1 − /n ) + 116 √ i (1 − /n ) ≥ (1 − /n )16 √ i (cid:98) i/ (cid:99)− (cid:88) p =0 (1 − /n ) p = (1 − /n )16 √ i − (1 − /n ) (cid:98) i/ (cid:99) )4 n − n . We will use the inequality : ∀ x ∈ [0 , , m ≥ − x ) m ≤ mx to upper bound the term (1 − /n ) (cid:100) i/ (cid:101) .35hus, we get E [ | s i | ] ≥ (1 − /n )16 √ i − n (cid:98) i/ (cid:99) n − n ≥ (1 − /n )16 √ i − in n = (1 − /n )16 √ i inn + i = ( n − √ in + i ≥ (3 n/ √ i n/ √ i . This proves the lower bound of the lemma.Next we prove that P (cid:16)(cid:80) ip =1 σ jp < (cid:17) ≥ and P (cid:16)(cid:80) ip =1 σ jp > (cid:17) ≥ . Firstly, we can see that P (cid:16)(cid:80) ip =1 σ jp < (cid:17) = P (cid:16)(cid:80) ip =1 σ jp > (cid:17) by symmtery and hence it is sufficient to show P (cid:16)(cid:80) ip =1 σ jp = 0 (cid:17) ≤ / i , P (cid:16)(cid:80) ip =1 σ jp = 0 (cid:17) = 0 trivially. Thus, we focus only on even i . Towards that end, P (cid:32) i (cid:88) p =1 σ jp = 0 (cid:33) = P ( s i = 0)= (cid:0) ii/ (cid:1)(cid:0) n − i ( n − i ) / (cid:1)(cid:0) nn/ (cid:1) [Using Eq. (25)] ≤ √ i [Using Mortici (2011, Theorem 1) mentioned earlier.] ≤ √ n ≤ . Thus, P (cid:16)(cid:80) ip =1 σ jp < (cid:17) = P (cid:16)(cid:80) ip =1 σ jp > (cid:17) ≥ / B.6 Proof of Corollary 2 E [ | x jh − x j | ] = P (cid:32) i (cid:88) p =1 σ ip ≤ (cid:33) E (cid:34) | x jh − x j | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ ip ≤ (cid:35) + P (cid:32) i (cid:88) p =1 σ ip > (cid:33) E (cid:34) | x jh − x j | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ ip > (cid:35) ≥ P (cid:32) i (cid:88) p =1 σ ip ≤ (cid:33) E (cid:34) | x jh − x j | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ ip ≤ (cid:35) . Therefore, E (cid:34)(cid:12)(cid:12)(cid:12) x jh − x j (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i (cid:88) p =1 σ ip ≤ (cid:35) ≤ E [ | x jh − x j | ] P ( (cid:80) ip =1 σ ip ≤ ≤ E [ | x jh − x j | ] [Using Lemma 12] ≤ √ hGα + | x j |√ hαL ) . [Using Lemma 6]The other inequality can be proved similarly. 36 Proof of Corollary 1
Let F ( x ) be the 1-Dimensional function from Theorem 2 with L = 2 and F ( x ) be the 1-Dimensionalfunction from Proposition 1 in Safran and Shamir (2019, p. 10-12) with λ = 1. In particular under thissetting, the function from Proposition 1 in their paper is the following: F ( x ) = 1 n n (cid:88) i =1 f ,i ( x ) = x , where ∀ i ∈ [ n ] : f ,i ( x ) = x Gx , if x ≥ x Gx , if x < . In particular, it is the same function as Theorem 2 from this paper if L was set to 1. Then, F ( x ) and F ( x ) satisfy the following properties (for each j ∈ { , } ):1. F j ( x ) = n (cid:80) ni =1 f j,i ( x ).2. F j ( x ) satisfies Assumption 2, ∀ i, f j,i ( x ) satisfy Assumptions 1 and 5; and that in the 1-Dimensionalspace, Assumptions 3 and 4 are satisfied. We prove these for our function F ( x ), in the proof ofTheorem 2. These can be similarly also proved for F ( x ).3. There is a step size range A j ⊂ R and an initialization x j, such that after K epochs of SGDo on F j ( x ) with any constant step size α ∈ A j , E [ (cid:107) x T − x ∗ j (cid:107) ] = E [ (cid:107) x T (cid:107) ] ≥ C l G nT , (26)where x ∗ j = 0 is the minimizer of F j ( x ) and C l is a universal constant. When L = 2 , we haveshown that A = (cid:2) T , Cn (cid:3) for a universal constant C . The proof of Proposition 1 in Safran andShamir (2019, p. 10-12) can be modified slightly so that F satisfies Ineq. (26) with A = [0 , ∞ ) \ A .For the rest of this proof, we will be working in a 2-Dimensional space. A 2-Dimensional vector willbe represented as x = [ x (1) , x (2) ]. The superscript in this section does not denote the epoch, it denotesthe co-ordinate. ∀ i , we define the 2-Dimensional functions f i ( x ) = f ,i ( x (1) )+ f ,i ( x (2) ) and F ( x ) = n (cid:80) ni =1 f i ( x ). Then, F ( x ) satisfies Assumption 2 and ∀ i, f i ( x ) satisfy Assumptions 1 and 5. Let x ∗ = [ x ∗ , x ∗ ] = [0 ,
0] denotethe minimizer of F ( x ). Consider the 2-Dimensional square such which is defined by { x : (cid:107) x − x ∗ (cid:107) ∞ ≤ D } .Then, in this domain, Assumptions 3 and 4 are satisfied with constants D (cid:48) := D √ G (cid:48) := G √ f i ( x ) with respect to x (1) is just ∇ f ,i ( x (1) ) and with respect to x (2) is just ∇ f ,i ( x (2) ). Thus, SGDo essentially operates independently along each of the two co-ordinates. Now,for any step length α ∈ R + , we know that α ∈ A or α ∈ A . Then using Ineq. (26), at least one of thefollowing is true: E [ (cid:107) x (1) T − ( x ∗ ) (1) (cid:107) ] = E [ (cid:107) x (1) T − x ∗ (cid:107) ] ≥ C l G nT or E [ (cid:107) x (2) T − ( x ∗ ) (2) (cid:107) ] = E [ (cid:107) x (2) T − x ∗ (cid:107) ] ≥ C l G nT . Therefore, the overall error E [ (cid:107) x T − x ∗ (cid:107) ] ≥ C l G nT . Numerical results
30 50 70 90 110 130 150 170 190
Number of epochs K −2 −1 | x T − x * | K K K
30 70 110 150 190
Number of component functions n −1 | x T − x * | (log T ) / n n n (a) α = 1 /T
30 50 70 90 110 130 150 170 190
Number of epochs K −2 −1 | x T − x * | K K K
30 70 110 150 190
Number of component functions n −1 | x T − x * | (log T ) / n n n (b) α = 2 log T /T
30 50 70 90 110 130 150 170 190
Number of epochs K −2 −1 | x T − x * | K K K
30 70 110 150 190
Number of component functions n −1 | x T − x * | (log T ) / n n n (c) α = 8 log T /T
30 50 70 90 110 130 150 170 190
Number of epochs K −2 −1 | x T − x * | K K K
30 70 110 150 190
Number of component functions n −1 | x T − x * | (log T ) / n n n (d) α = 1 /n Figure 6: Running SGDo with different step size regimes on the function F used in our lower bound, Theorem 2.The setup is the same as described in Subsection 5.1.used in our lower bound, Theorem 2.The setup is the same as described in Subsection 5.1.