[PDF] On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic Non-Convex Optimization

Abstract

For SGD based distributed stochastic optimization, computation complexity, measured by the convergence rate in terms of the number of stochastic gradient calls, and communication complexity, measured by the number of inter-node communication rounds, are two most important performance metrics. The classical data-parallel implementation of SGD over N workers can achieve linear speedup of its convergence rate but incurs an inter-node communication round at each batch. We study the benefit of using dynamically increasing batch sizes in parallel SGD for stochastic non-convex optimization by charactering the attained convergence rate and the required number of communication rounds. We show that for stochastic non-convex optimization under the P-L condition, the classical data-parallel SGD with exponentially increasing batch sizes can achieve the fastest known O(1/(NT)) convergence with linear speedup using only log(T) communication rounds. For general stochastic non-convex optimization, we propose a Catalyst-like algorithm to achieve the fastest known O(1/ NT − − − √ ) convergence with only O( NT − − − √ log( T N )) communication rounds.

Full PDF

OOn the Computation and Communication Complexity of Parallel SGD withDynamic Batch Sizes for Stochastic Non-Convex Optimization

Hao Yu Rong Jin Abstract

For SGD based distributed stochastic optimiza-tion, computation complexity, measured by theconvergence rate in terms of the number ofstochastic gradient calls, and communication com-plexity, measured by the number of inter-nodecommunication rounds, are two most importantperformance metrics. The classical data-parallelimplementation of SGD over N workers canachieve linear speedup of its convergence rate butincurs an inter-node communication round at eachbatch. We study the beneﬁt of using dynamicallyincreasing batch sizes in parallel SGD for stochas-tic non-convex optimization by charactering theattained convergence rate and the required num-ber of communication rounds. We show that forstochastic non-convex optimization under the P-Lcondition, the classical data-parallel SGD withexponentially increasing batch sizes can achievethe fastest known O (1 / ( N T )) convergence withlinear speedup using only log( T ) communicationrounds. For general stochastic non-convex opti-mization, we propose a Catalyst-like algorithmto achieve the fastest known O (1 / √ N T ) conver-gence with only O ( √ N T log( TN )) communica-tion rounds.

1. Introduction

Consider solving the following stochastic optimization min x ∈ R m f ( x ) ∆ = E ζ ∼D [ F ( x ; ζ )] (1)with a ﬁxed yet unknown distribution D only by accessingi.i.d. stochastic gradients ∇ F ( · ; ζ ) . Most machine learningapplications can be cast into the above stochastic optimiza-tion where x refers to the machine learning model, random Machine Intelligence Technology Lab, Alibaba Group(U.S.) Inc., Bellevue, WA. Correspondence to: Hao Yu < [email protected] > . Proceedings of the th International Conference on MachineLearning , Long Beach, California, PMLR 97, 2019. Copyright2019 by the author(s). variables ζ ∼ D refer to instance-label pairs and F ( x ; ζ ) refers to the corresponding loss function. For example, con-sider a simple least squares linear regression problem: let ζ i = ( a i , b i ) ∈ D be training data collected ofﬂine or on-line , where each a i is a feature vector and b i is its label,then F ( x ; ζ i ) = ( a T i x − b i ) . Throughout this paper, wehave the following assumption: Assumption 1. Smoothness:

The objective function f ( x ) in problem (1) is smooth with modulus L .2. Unbiased gradients with bounded variances:

As-sume there exits a stochastic ﬁrst-order oracle (SFO)to provide independent unbiased stochastic gradients ∇ F ( x ; ζ ) satisfying E ζ ∼D [ ∇ F ( x ; ζ )] = ∇ f ( x ) , ∀ x . The unbiased stochastic gradients have a bounded vari-ance, i.e., there exits a constant σ > such that E ζ ∼D (cid:107)∇ F ( x ; ζ ) − ∇ f ( x ) (cid:107) ≤ σ (2)When solving stochastic optimization (1) only with sampledstochastic gradients, the computation complexity, which isalso known as the convergence rate, is measured by thedecay law of the solution error with respect to the num-ber of access of the stochastic ﬁrst-order oracle (SFO) that provides sampled stochastic gradients (Nemirovsky &Yudin, 1983; Ghadimi et al., 2016). For strongly convexstochastic minimization, SGD type algorithms (Nemirovskiet al., 2009; Hazan & Kale, 2014; Rakhlin et al., 2012)can achieve the optimal O (1 /T ) convergence rate. That is,the error is ensured to be at most O (1 /T ) after T accessof stochastic gradients. For non-convex stochastic mini-mization, which is the case of training deep neural networks, Note that if the training data is from a ﬁnite set collected of-ﬂine, the stochastic optimization can also be written as a ﬁnite summinimization, which is a special case of the stochastic optimizationwith known uniform distribution D . However, for online training,since ( a i , b i ) is generated gradually and disclosed to us one byone, we need to solve the more challenging stochastic optimizationwith unknown distribution D . The algorithms developed in thispaper does not requires any knowledge of distribution D . a r X i v : . [ m a t h . O C ] M a y arallel SGD with Dynamic Batch Sizes for Stochastic Non-Convex Optimization SGD type algorithms can achieve an O (1 / √ T ) convergencerate . Classical SGD type algorithms can be acceleratedby utilizing multiple workers/nodes to follow a parallelSGD (PSGD) procedure where each worker computes localstochastic gradients in parallel, aggregates all local gradi-ents, and updates its own local solution using the average ofall gradients. Such a data-parallel training strategy with N workers has O (1 / ( N T )) convergence for strongly convexminimization and O (1 / √ N T ) convergence for smooth non-convex stochastic minimization, both of which is N timesfaster than SGD with a single worker (Dekel et al., 2012;Ghadimi & Lan, 2013; Lian et al., 2015). This is known asthe linear speedup (with respect to the number of nodes)property of PSGD.However, such linear speedup is often not attainable inpractice because PSGD involves additional coordinationand communication cost as most other distributed/parallelalgorithms do. In particular, PSGD requires aggregatinglocal batch gradients among all workers after evaluations oflocal batch SGD. The corresponding communication costfor gradient aggregations is quite heavy and often becomesthe performance bottleneck.Since the number of inter-node communication rounds inPSGD over multiple nodes is equal to the number of batches,it is desirable to use larger batch sizes to avoid communica-tion overhead as long as the large batch size does not damagethe overall computation complexity (in terms of number ofaccess of SFO). For training deep neural networks, practi-tioners have observed that SGD using dynamically increas-ing batch sizes can converges to similar test accuracy withthe same number of epochs but signiﬁcantly fewer numberof batches when compared with SGD with small batch sizes(Devarakonda et al., 2017; Smith et al., 2018). The idea ofusing large or increasing batch sizes can be partially backedby some recent theoretical works (Bottou et al., 2018; Deet al., 2017). It is shown in (De et al., 2017) that if thebatch size is sufﬁciently large such that the randomness, i.e.,variances, is dominated by gradient magnitude, then SGDessentially degrades to deterministic gradient descent. How-ever, in the worst case, e.g., stochastic optimization (1) orlarge-scale optimization with limited budgets of SFO access,SGD with large batch sizes considered in (De et al., 2017)can have worse convergence performance than SGD withﬁxed small batch sizes (Bottou & Bousquet, 2008; Bottouet al., 2018). For strongly convex stochastic minimization,it is proven in (Friedlander & Schmidt, 2012; Bottou et al., For general non-convex functions, the convergence rate isusually measured in terms of (cid:107)∇ f ( x ) (cid:107) which in some sense canbe considered as the counterpart of f ( x ) − f ( x ∗ ) in convex case(Nesterov, 2004; Ghadimi & Lan, 2013). The linear speedup property is desirable for parallel com-putating algorithms since it means the algorithm’s computationcapability can be expanded with perfect horizontal scalability. O (1 /T ) convergence as SGD withﬁxed small batch sizes, where T is the number of access ofSFO. The results in (Friedlander & Schmidt, 2012; Bottouet al., 2018) are encouraging since it means using exponen-tially increasing batch sizes can preserve the low O (1 /T ) computation complexity with log( T ) communication com-plexity that is signiﬁcantly lower than O ( T ) required bySGD with ﬁxed batch sizes for distributed strongly convexstochastic minimization. However, the computation andcommunication complexity remains under-explored for dis-tributed stochastic non-convex optimization, which is thecase of training deep neural networks. While work (Smith& Le, 2018; Smith et al., 2018) justify SGD with increasingbatch sizes by relating it with the integration of a stochasticdifferential equation for which decreasing learning ratescan roughly compensate the effect of increasing batch sizes,rigorous theoretical characterization on its computation andcommunication complexity (as in (Nemirovski et al., 2009;Bottou et al., 2018)) is missing for stochastic non-convexoptimization. In general, it remains unclear “If using dy-namic batch sizes in parallel SGD can yield the same fast O (1 / √ N T ) convergence rate (with linear speedup withrespect to the number of nodes) as the classical PSGD fornon-convex optimization?” and “What is the correspondingcommunication complexity of using dynamic batch sizes tosolve distributed non-convex optimization?” Our Contributions:

This paper aims to characterize bothcomputation and communication complexity when usingthe idea of dynamically increasing batch sizes in SGD tosolve stochastic non-convex optimization with N paral-lel workers. We ﬁrst consider non-convex optimizationsatisfying the Polyak-Lojasiewicz (P-L) condition, whichcan be viewed as a generalization of strong convexity fornon-convex optimization. We show that by simply expo-nentially increasing the batch sizes at each worker (for-mally described in Algorithm 1) in the classical data-parallelSGD, we can solve non-convex optimization with the fast O (1 / ( N T )) convergence using only O (log( T )) communi-cation rounds. For general stochastic non-convex optimiza-tion (without P-L condition), we propose a Catalyst-like(Lin et al., 2015; Paquette et al., 2018) approach (formallydescribed in Algorithm 2) that wraps Algorithm 1 with anouter loop that iteratively introduces auxiliary problems. Weshow that Algorithm 2 can solves general stochastic non-convex optimization with O (1 / √ N T ) computation com-plexity and O ( √ T N log( TN )) communication complexity.In both cases, using dynamic batch sizes can achieve thelinear speedup of convergence with communication com-plexity less than that of existing communication efﬁcientparallel SGD methods with ﬁxed batch sizes (Stich, 2018;Yu et al., 2018). arallel SGD with Dynamic Batch Sizes for Stochastic Non-Convex Optimization

2. Non-Convex Minimization Under the P-LCondition

This section considers problem (1) satisfying the Polyak-Lojasiewicz (P-L) condition deﬁned in Assumption 2.

Assumption 2.

The objective function f ( x ) in problem (1) satisﬁes the Polyak-Lojasiewicz (P-L) condition withmodulus µ > . That is, (cid:107)∇ f ( x ) (cid:107) ≥ µ ( f ( x ) − f ∗ ) , ∀ x (3) where f ∗ is the global minimum in problem (1) . The P-L condition is originally introduced by Polyak in(Polyak, 1963) and holds for many machine learning mod-els. Neither the convexity of f ( x ) nor the uniqueness of itsglobal minimizer is required in the P-L condition. In partic-ular, the P-L condition is weaker than many other popularconditions, e.g., strong convexity and the error bound con-dition, used in optimization literature (Karimi et al., 2016).See e.g. Fact 1. Fact 1 (Appendix A in (Karimi et al., 2016)) . If smoothfunction φ : R m (cid:55)→ R is strongly convex with modulus µ > , then it satisifes the P-L condition with the samemodulus µ . One important example is: f ( x ) = g ( Ax ) with stronglyconvex g ( · ) and possibly rank deﬁcient matrix A , e.g. f ( x ) = (cid:107) Ax − b (cid:107) used in least squares regressions.While f ( x ) = g ( Ax ) is not strongly convex when A isrank deﬁcient, it turns out that such f ( x ) always satisﬁesthe P-L condition (Karimi et al., 2016).Consider the C ommunication R educed P arallel S tochastic G radient D escent ( CR-PSGD ) algorithm described in Al-gorithm 1. The inputs of CR-PSGD are: (1) N , the numberof parallel workers; (2) T , the total number of gradient eval-uations at each worker; (3) x , the common initial point ateach worker; (3) γ > , the learning rate; (4) B , the initialSGD batch size at each worker; (5) ρ > , the batch sizescaling factor. Compared with the classical PSGD, our CR-PSGD has the minor change that each worker exponentiallyincreases its own SGD batch size with a factor ρ . Since B t increasingly exponentially, it is easy to see that the “while”loop in Algorithm 1 terminates after at most O (log T ) steps.Meanwhile, we note that inter-worker communication isused only to aggregate individual batch SGD averages andhappens only once in each “while” loop iteration. As aconsequence, CR-PSGD only involves O (log T ) rounds ofcommunication. The remaining part of this section furtherproves that CR-PSGD has O (1 / ( N T )) convergence.Similar ideas of exponentially increasing batch size appearin other works, e.g., (Hazan & Kale, 2014; Zhang et al.,2013), for different purposes and with different algorithm Algorithm 1

CR-PSGD ( f, N, T, x , B , ρ, γ ) Input: N , T , x ∈ R m , γ , B and ρ > . Initialize t = 1 while (cid:80) tτ =1 B τ ≤ T do Each worker i observes B t unbiased i.i.d. stochas-tic gradients at point x t given by g i,j ∆ = ∇ F ( x t ; ζ i,j ) , j ∈ { , . . . , B t } , ζ i,j ∼ D and cal-culates its batch SGD average ¯ g t,i = B t (cid:80) B t j =1 g i,j . Aggregate all ¯ g t,i from N workers and compute theiraverage ¯ g t = N (cid:80) Ni =1 ¯ g t,i . Update x t +1 over all N workers in parallel via: x t +1 = x t − γ ¯ g t . Set B t +1 = (cid:98) ρ t B (cid:99) where (cid:98) z (cid:99) represents the largestinteger no less than z . Update t ← t + 1 . end while Return: x t dynamics. In this paper, we explore this idea in the con-text of parallel stochastic optimization. It is impressivethat such a simple idea enables us to obtain a parallel algo-rithm to achieve the fast O (1 / ( N T )) convergence with only O (log T ) rounds of communication for stochastic optimiza-tion under the P-L condition. When considering stochasticstrongly convex minimization that is a subclass of stochasticoptimization under the P-L condition, the O (log T ) com-munication complexity attained by our CR-PSGD is signiﬁ-cantly less than the O ( √ N T ) communication complexityattained by the local SGD method in (Stich, 2018).The next simple lemma relates per-iteration error with thebatch sizes and is a key property to establish the convergencerate of Algorithm 1. Lemma 1.

Consider problem (1) under Assumptions 1-2. If we choose γ < L in Algorithm 1, then for all t ∈{ , , . . . , } , we have E [ f ( x t +1 ) − f ∗ )] ≤ (1 − ν ) E [ f ( x t ) − f ∗ ] + γ (2 − Lγ )2 NB t σ (4) where f ∗ is the global minimum in problem (1) and ν ∆ = γµ (1 − Lγ ) satisﬁes < ν < .Proof. Fix t ≥ . By the smoothness of f ( x ) in Assump-tion 1, we have f ( x t +1 ) ≤ f ( x t ) + (cid:104)∇ f ( x t ) , x t +1 − x t (cid:105) + L (cid:107) x t +1 − x t (cid:107) a ) = f ( x t ) − γ (cid:104)∇ f ( x t ) , ¯ g t (cid:105) + L γ (cid:107) ¯ g t (cid:107) = f ( x t ) + γ (cid:104) ¯ g t − ∇ f ( x t ) , ¯ g t (cid:105) − γ (cid:107) ¯ g t (cid:107) + L γ (cid:107) ¯ g t (cid:107) arallel SGD with Dynamic Batch Sizes for Stochastic Non-Convex Optimization ( b ) ≤ f ( x t ) + γ (cid:107) ¯ g t − ∇ f ( x t ) (cid:107) + γ Lγ − (cid:107) ¯ g t (cid:107) c ) ≤ f ( x t ) + γ Lγ − (cid:107)∇ f ( x t ) (cid:107) + γ − Lγ ) (cid:107) ¯ g t − ∇ f ( x t ) (cid:107) d ) ≤ f ( x t ) + 12 γµ ( Lγ − f ( x t ) − f ∗ )+ γ − Lγ ) (cid:107) ¯ g t − ∇ f ( x t ) (cid:107) (5) where (a) follows by substituting x t +1 = x t − γ ¯ g t ; (b) fol-lows by applying elementary inequality (cid:104) u , v (cid:105) ≤ (cid:107) u (cid:107) + (cid:107) v (cid:107) with u = ¯ g t − ∇ f ( x t ) and v = ¯ g t ; (c) followsby noting that Lγ − < under our selection of γ andapplying elementary inequality (cid:107) u + v (cid:107) ≥ (cid:107) u (cid:107) − (cid:107) v (cid:107) with u = ∇ f ( x t ) and v = ¯ g t − ∇ f ( x t ) ; and (d) followsby noting that γ ( Lγ − < under our selection of γ and (cid:107)∇ f ( x t ) (cid:107) ≥ µ ( f ( x t ) − f ∗ ) by Assumption 2.Deﬁning ν ∆ = γµ (1 − Lγ ) , subtracting f ∗ from both sidesof (5), and rearranging terms yields f ( x t +1 ) − f ∗ ≤ (1 − ν )( f ( x t ) − f ∗ ) + γ − Lγ ) (cid:107) ¯ g t − ∇ f ( x t ) (cid:107) (6)Taking expectations on both sides and noting that E [ (cid:107) ¯ g t − ∇ f ( x t ) (cid:107) ] ≤ NB t σ , which further follows fromAssumption 1 and the fact that each ¯ g t is the average of N B t i.i.d. stochastic gradients evaluated at the same point,yields E [ f ( x t +1 ) − f ∗ ] ≤ (1 − ν ) E [ f ( x t ) − f ∗ ] + γ (2 − Lγ )2 N B t σ It remains to verify why < ν < . Since γ < L , it iseasy to see ν > . Next, we show γµ (1 − Lγ ) < . Bythe smoothness of f ( x ) (and Fact 3 in Supplement 6.1), wehave (cid:107)∇ f ( x ) (cid:107) ≤ L ( f ( x ) − f ∗ ) , ∀ x (7)By Assumption 2, we have (cid:107)∇ f ( x ) (cid:107) ≥ µ ( f ( x ) − f ∗ ) , ∀ x (8)Inequalities (7) and (8) together imply that µ ≤ L , whichfurther implies that γµ (1 − Lγ ) ≤ γL (1 − Lγ ) < . Remark 1.

Note that by adapting steps (b) and (c) of (5) in the proof of Lemma 1, i.e., using inequalities with slightlydifferent coefﬁcients for the squared norm terms, we canobtain (4) with different ν values. Larger ν variants (withpossibly more stringent conditions on the selection rule of γ )may lead to faster convergence (but with the same order) ofAlgorithm 1. This paper does not explore further in this di-rection since the current simple analysis is already sufﬁcient to provide the desired order of convergence/communication.The suggested ﬁner development on ν can improve the con-stant factor in the rates but does not improve their order.Nevertheless, it is worthwhile to point out that the ﬁnerdevelopment on ν can be helpful to guide practitioners totune Algorithm 1 according to their speciﬁc minimizationproblems. The O ( NT ) convergence with O (log T ) communicationrounds is summarized in Theorem 1. Theorem 1.

Consider problem (1) under Assumptions 1-2.Let

T > be a given constant. If we choose B ≥ , γ < L and < ρ < − ν , where ν ∆ = γµ (1 − Lγ ) , in Algorithm1, then the ﬁnal output x t returned by Algorithm 1 satisﬁes E [ f ( x t ) − f ∗ ] ≤ c ( f ( x ) − f ∗ ) T δ + c N T = O ( 1 T δ ) + O ( 1 N T ) (9) where δ ∆ = log ρ ( − ν ) − > , c = − ν (cid:0) B ρ − (cid:1) δ , c = ρ γ (2 − Lγ )) σ (1 − (1 − ν ) ρ )( ρ − , and f ∗ is the minimum value of problem (1) .Proof. See Supplement 6.2.

Remark 2.

Since δ > , O ( T δ ) decays faster than O ( NT ) when T is sufﬁciently large. In fact, we can evenexplicitly choose suitable ρ to make δ sufﬁciently large, e.g.,we can choose < ρ < (cid:113) − ν to ensure δ > such that O ( T δ ) < O ( T ) . In this case, as long as T ≥ N , whichis almost always true in practice, the error term on the rightside of (29) has order O ( NT ) . Recall that if f ( x ) is strongly convex with modulus µ ,then it satisﬁes Assumption 2 with the same µ by Fact1. Furthermore, if f ( x ) is strongly convex with modulus µ > , we know problem (1) has a unique minimizer x ∗ and (cid:107) x − x ∗ (cid:107) ≤ µ ( f ( x ) − f ( x ∗ )) for any x . (See e.g. Fact 4in Supplement 6.1.) Thus, we have the following corollaryfor Theorem 1. Corollary 1.

Consider problem (1) under Assumptions 1where f ( x ) is strongly convex with modulus µ > . Un-der the same conditions in Theorem 1, the ﬁnal output x t returned by Algorithm 1 satisﬁes E [ (cid:107) x t − x ∗ (cid:107) ] ≤ c ( f ( x ) − f ( x ∗ )) µT δ + 2 c µN T = O ( 1 T δ ) + O ( 1 N T ) (10) where δ, c , c are positive constants deﬁned in Theorem 1and x ∗ is the unique minimizer of problem (1) . It is shown at the bottom of the proof for Lemma 1 that ν isensured to satisfy < ν < under the selection γ < L . arallel SGD with Dynamic Batch Sizes for Stochastic Non-Convex Optimization Algorithm 2

CR-PSGD-Catalyst ( f, N, T, y , B , ρ, γ ) Input: N , T , θ , y ∈ R m , γ , B and ρ > . Initialize y (0) = y and k = 1 . while k ≤ (cid:98)√ N T (cid:99) do Deﬁne h θ ( x ; y ( k − ) using (11). Update y ( k ) via y ( k ) = CR-PSGD ( h θ ( · ; y ( k − ) , N, (cid:98) (cid:113) T/N (cid:99) , y ( k − , B , ρ, γ ) Update k ← k + 1 . end whileRemark 3. Recall that O (1 /T ) convergence is optimalfor stochastic strongly convex optimization (Nemirovsky &Yudin, 1983; Rakhlin et al., 2012) over single node. Sincethe convergence of Algorithm 1 scales out perfectly with re-spect to the number of involved workers and strongly convexfunctions are a subclass of functions satisfying the P-L con-dition, we can conclude the O ( NT ) convergence attainedby Algorithm 1 is optimal for parallel stochastic optimiza-tion under the P-L condition. It is also worth noting that weconsider general stochastic optimization (1) such that ac-celeration techniques developed for ﬁnite sum optimization,e.g., variance reduction, are excluded from consideration.

3. General Non-Convex Minimization

Let f ( x ) be the (stochastic) objective function in problem(1). For any given ﬁxed y , deﬁne a new function withrespect to x given by h θ ( x ; y ) ∆ = f ( x ) + θ (cid:107) x − y (cid:107) (11)It is easy to verify that if f ( x ) is smooth with modulus L and θ > L , then h θ ( x ; y ) is both smooth with modulus θ + L andstrongly convex with modulus θ − L > . Furthermore, if ∇ F ( x ; ζ ) are unbiased i.i.d. stochastic gradients of function f ( · ) with a variance bounded by σ , then ∇ F ( x ; ζ )+ θ ( x − y ) are unbiased i.i.d. stochastic gradients of h θ ( x ; y ) withthe same variance.Now consider Algorithm 2 that wraps CR-PSGD withan outer-loop that updates h θ ( x ; y ( k − ) and applies CR-PSGD to minimize it. Note that h θ ( x ; y ( k − ) augmentsthe objective function f ( x ) with an iteratively updatedproximal term θ (cid:107) x − y ( k − (cid:107) . The introduction of prox-imal terms θ (cid:107) x − y ( k − (cid:107) is inspired by earlier works(G¨uler, 1992; He & Yuan, 2012; Salzo & Villa, 2012;Lin et al., 2015; Yu & Neely, 2017; Davis & Grimmer,2017; Paquette et al., 2018) on proximal point methods,which solve an minimization problem by solving a se-quence of auxiliary problems involving a quadratic prox-imal term. By choosing θ > L in (11), we can en-sure h θ ( x ; y ( k − ) is both smooth and strongly convex.For strongly convex h θ ( x ; y ( k − ) , Theorem 1 and Corol- lary 1 show that CR-PSGD ( N, (cid:98) (cid:112) T /N (cid:99) , y ( k − , B , ρ, γ ) can return an O ( √ NT ) approximated minimizer with only O (log( TN )) communication rounds. The ultimate goal ofthe proximal point like outer-loop introduced in Algorithm 2is to lift the ”communication reduction” property from CR-PSGD for non-convex minimization under the restrictive PLcondition to solve general non-convex minimization withreduced communication. Our method shares a similar phi-losophy with the “catalyst acceleration” in (Lin et al., 2015)which also uses a “proximal-point” outer-loop to achieveimproved convergence rates for convex minimization by lift-ing fast convergence from strong convex minimization. Inthis perspective, we call Algorithm 2 “CR-PSGD-Catalyst”by borrowing the word “catalyst” from (Lin et al., 2015).While both Algorithm 2 and “catalyst acceleration” use anproximal point outer-loop to lift desired algorithmic prop-erties from speciﬁc problems to generic problems, they aredifferent in the following two aspects: • The “catalyst acceleration” in (Lin et al., 2015; Paque-tte et al., 2018) is developed to accelerate a wide rangeof ﬁrst-order deterministic minimization, e.g., gradientbased methods and their randomized variants such asSAG, SAGA, SDCA, SVRG, for both convex and non-convex cases. In particular, it requires the existence ofa subprocedure with linear convergence for stronglyconvex minimization. It is remarked in (Lin et al.,2015) that whether “catalyst” can accelerate stochasticgradient based methods for stochastic minimizationin the sense of (Nemirovski et al., 2009) remains un-clear. In contrast, our CR-PSGD-Catalyst can solvegeneral stochastic minimization, which does not nec-essarily have a ﬁnite sum form, with i.i.d. stochasticgradients. The used CR-PSGD subprocedure that isdifferent from linear converging subprocedure used in(Lin et al., 2015; Paquette et al., 2018). • The “proximal point” outer loop used in “catalyst accel-eration” is solely to accelerate convergence (Lin et al.,2015; Paquette et al., 2018). In contrast, the “proxi-mal point” outer loop used in our CR-PSGD-Catalystprovides convergence acceleration and communica-tion reduction simultaneously. Our analysis is alsosigniﬁcantly different from analyses for conventional“catalyst acceleration”.Since each call of CR-PSGD in Algorithm 2 requires only For ﬁnite sum minimization, it is possible to develop linearlyconverging solvers by using techniques such as variance reduction.However, for general strongly convex stochastic minimization, it isin general impossible to develop linearly converging stochastic gra-dient based solver and the fastest possible convergence is O (1 /T ) (Rakhlin et al., 2012; Hazan & Kale, 2014; Lacoste-Julien et al.,2012). That is, stochastic minimization fundamentally fails tosatisfy the prerequisite in (Lin et al., 2015; Paquette et al., 2018). arallel SGD with Dynamic Batch Sizes for Stochastic Non-Convex Optimization O (log( TN )) inter-worker communication rounds and thereare √ N T calls of CR-PSGD, it is easy to see CR-PSGD-Catalyst in total uses O ( √ N T log( TN )) communicationrounds. The O ( √ N T log( TN )) communication complex-ity of CR-PSGD-Catalyst for general non-convex stochas-tic optimization is signiﬁcantly less than the O ( T ) com-munication complexity attained by PSGD (Dekel et al.,2012; Ghadimi & Lan, 2013; Lian et al., 2015) or the O ( N / T / ) communication complexity required by lo-cal SGD (Yu et al., 2018). The next theorem summarizesthat our CR-PSGD-Catalyst can achieve the fastest known O (1 / √ N T ) convergence that is previously attained by thePSGD or local SGD. Theorem 2.

Consider problem (1) under Assumption 1. Ifwe choose θ > L , B ≥ , γ < θ + L and < ρ < − ν ,where ν ∆ = γ ( θ − L )(1 − ( θ + L ) γ ) , in Algorithm 2 and if T ≥ max { N, N (cid:0) c ( θ + L ) ( θ − L ) (cid:1) δ , N ( c ) δ } , then we have √ NT √ NT (cid:88) k =1 E [ (cid:107)∇ f ( y ( k ) ) (cid:107) ] = O ( 1 √ NT ) where { y ( k ) , k ≥ } are a sequence of solutions returnedfrom the CR-PSGD subprocedure.Proof. For simplicity, we assume √ N T and (cid:112)

T /N areintegers and hence (cid:98)√

N T (cid:99) = √ N T and (cid:98) (cid:112)

T /N (cid:99) = (cid:112) T /N . This can be be ensured when T = N q where q is any integer. In general, even if √ T N or (cid:112) T /N are non-integers, by using the fact that z ≤ (cid:98) z (cid:99) ≤ z for any z ≥ ,the same order of convergence can be easily extended to thecase when √ N T or (cid:112) T /N are non-integers.Fix k ≥ and consider stochastic minimization min x ∈ R m h θ ( x ; y ( k − ) . Since h θ ( x ; y ( k − ) is stronglyconvex with modulus θ − L > , we know h θ ( x ; y ( k − ) also satisﬁes the P-L condition with modulus θ − L by Fact1. At the same time, h θ ( x ; y ( k − ) is smooth with modulus θ + L . Note that our selections of B , γ and ρ satisfy thecondition in Theorem 1 for stochastic minimization underthe P-L condition. Denote y ( k ) ∗ ∆ = argmin x ∈ R m { h θ ( x ; y ( k − ) } . Recall that y ( k ) is the solution returned from CR-PSGDwith (cid:112) T /N iterations. By Theorem 1, we have E [ h θ ( y ( k ) ; y ( k − ) − h θ ( y ( k ) ∗ ; y ( k − )] ≤ c ( TN ) δ E [ h θ ( y ( k − ; y ( k − ) − h θ ( y ( k ) ∗ ; y ( k − )] + c √ NT (12) For non-convex optimization, local SGD is more widelyknown as periodic model averaging or parallel restarted SGD sinceeach worker periodically restarts its independent SGD procedurewith a new initial point that is the average of all individual models(Yu et al., 2018; Wang & Joshi, 2018; Jiang & Agrawal, 2018). where δ ∆ = log ρ ( − ν ) − > , c = − ν (cid:0) B ρ − (cid:1) δ , and c = ρ γ (2 − ( θ + L ) γ )) σ (1 − (1 − ν ) ρ )( ρ − are absolute constants independentof T .Since h θ ( · ; y ( k − ) is smooth with modulus θ + L and y ( k ) ∗ minimizes it, by Fact 3 (in Supplement 6.1), we have θ + L ) (cid:107)∇ h θ ( y ( k ) ; y ( k − ) (cid:107) ≤ h θ ( y ( k ) ; y ( k − ) − h θ ( y ( k ) ∗ ; y ( k − ) (13)One the other hand ,we also have h θ ( y ( k − ; y ( k − ) − h θ ( y ( k ) ∗ ; y ( k − ) ( a ) ≤ θ + L (cid:107) y ( k − − y ( k ) ∗ (cid:107) b ) ≤ ( θ + L ) (cid:107) y ( k ) − y ( k ) ∗ (cid:107) + ( θ + L ) (cid:107) y ( k ) − y ( k − (cid:107) c ) ≤ θ + L ( θ − L ) (cid:107)∇ h θ ( y ( k ) ; y ( k − ) (cid:107) + ( θ + L ) (cid:107) y ( k ) − y ( k − (cid:107) (14) where (a) follows from Fact 2 (in Supplement 6.1)by recalling again that h θ ( · ; y ( k − ) is smooth withmodulus θ + L and y ( k ) ∗ minimizes it; (b) fol-lows because (cid:107) y ( k − − y ( k ) ∗ (cid:107) ≤ (cid:107) y ( k ) − y ( k ) ∗ (cid:107) +2 (cid:107) y ( k ) − y ( k − (cid:107) , which further follows by applying ba-sic inequality (cid:107) u − v (cid:107) ≤ (cid:107) u (cid:107) + 2 (cid:107) v (cid:107) with u = y ( k ) − y ( k ) ∗ and v = y ( k ) − y ( k − ; and (c) followsbecause (cid:107) y ( k ) − y ( k ) ∗ (cid:107) ≤ θ − L ) (cid:107)∇ h θ ( y ( k ) ; y ( k − ) (cid:107) ,which further follows from by Fact 5 (in Supplement 6.1) bynoting that h θ ( · ; y ( k − ) is strongly convex with modulus θ − L and y ( k ) ∗ minimizes it.Substituting (13) and (14) into (12) and rearranging termsyields (cid:16) θ + L ) − c ( θ + L )( θ − L ) TN ) δ (cid:124) (cid:123)(cid:122) (cid:125) ∆ = α (cid:17) E [ (cid:107)∇ h θ ( y ( k ) ; y ( k − ) (cid:107) ] ≤ c ( θ + L )( TN ) δ E ] (cid:107) y ( k ) − y ( k − (cid:107) ] + c √ NT (15) Note that T ≥ N (cid:0) c ( θ + L ) ( θ − L ) (cid:1) δ ensures the term markedby an underbrace in (15) satisﬁes α ≥ θ + L ) . Thus, (15)implies that θ + L ) E [ (cid:107)∇ h θ ( y ( k ) ; y ( k − ) (cid:107) ] ≤ c ( θ + L )( TN ) δ E ] (cid:107) y ( k ) − y ( k − (cid:107) ] + c √ N T (16)By the deﬁnition of h θ ( · ; y ( k − ) , we have ∇ h θ ( y ( k ) ; y ( k − ) = ∇ f ( y ( k ) ) + θ ( y ( k ) − y ( k − ) . arallel SGD with Dynamic Batch Sizes for Stochastic Non-Convex Optimization This implies that (cid:107)∇ f ( y ( k ) ) (cid:107) ≤ (cid:107)∇ h θ ( y ( k ) ; y ( k − ) (cid:107) + 2 θ (cid:107) y ( k ) − y ( k − (cid:107) (17) Combining (16) and (17) yields E [ (cid:107)∇ f ( y ( k ) ) (cid:107) ] ≤ (cid:16) c ( θ + L ) ( TN ) δ + 2 θ (cid:17) E [ (cid:107) y ( k ) − y ( k − (cid:107) ] + 8 c ( θ + L ) √ NT ( a ) ≤ (cid:16) c ( θ + L ) + 2 θ (cid:17) E [ (cid:107) y ( k ) − y ( k − (cid:107) ] + 8 c ( θ + L ) √ NT (18) where (a) follows because ( TN ) δ ≥ as long as T ≥ N .Since T ≥ N c δ ensures c ( TN ) δ ≤ , by (12), we have E [ h θ ( y ( k ) ; y ( k − ) − h θ ( y ( k ) ∗ ; y ( k − )] ≤ E [ h θ ( y ( k − ; y ( k − ) − h θ ( y ( k ) ∗ ; y ( k − )] + c √ N T (19)Cancelling the common term on both sides and substitutingthe deﬁnition of h θ ( · ; y ( k − ) into (19) yields E [ f ( y ( k ) ) + θ (cid:107) y ( k ) − y ( k − (cid:107) ] ≤ E [ f ( y ( k − )] + c √ N T (20)Rewriting this inequality as E [ (cid:107) y ( k ) − y ( k − (cid:107) ] ≤ θ E [ f ( y ( k − ) − f ( y ( k ) )] + c θ √ NT and substituting it into(18) yields E [ (cid:107)∇ f ( y ( k ) ) (cid:107) ] ≤ θ (cid:16) c ( θ + L ) + 2 θ (cid:17) E [ f ( y ( k − ) − f ( y ( k ) )]+ (cid:16) c ( θ + L ) θ + 12 θ + 8 L (cid:17) c √ NT (21) Summing this inequality over k ∈ { , . . . , √ N T } and di-viding both sides by a factor √ N T yields √ NT √ NT (cid:88) k =1 E [ (cid:107)∇ f ( y ( k ) ) (cid:107) ] ≤ θ (cid:16) c ( θ + L ) + 2 θ (cid:17) E [ f ( y (0) ) − f ( y √ NT )] √ NT + (cid:16) c ( θ + L ) θ + 12 θ + 8 L (cid:17) c √ NT ( a ) ≤ θ (cid:16) c ( θ + L ) + 2 θ (cid:17) f ( y (0) ) − f ∗ √ NT + (cid:16) c ( θ + L ) θ + 12 θ + 8 L (cid:17) c √ NT = O ( 1 √ NT ) (22) where (a) follows because f ∗ is the global minimum ofproblem (1).

4. Experiments

To validate the theory developed in this paper, we conducttwo numerical experiments: (1) distributed logistic regres-sion and (2) training deep neural networks.

Consider solving an l regularized logistic regression prob-lem using multiple parallel nodes. Let ( z ij , b ij ) be thetraining pairs at node i , where z ij ∈ R d are d -dimensionfeature vectors and b ij ∈ {− , } are labels. The problemcan be cast as follows: min x ∈ R d N N (cid:88) i =1 M i M i (cid:88) j =1 log(1 + exp( b ij ( z T ij x i )) + 12 µ (cid:107) x (cid:107) (23)where N is the number of parallel workers, M i are thenumber of training samples available at node i and µ is theregularization coefﬁcient.Our experiment generates a problem instance with d = 500 , N = 10 , M i = 10 , ∀ i ∈ { , , . . . , N } and µ = 0 . .The synthetic training feature vectors z ij are generated fromnormal distribution N ( I , I d ) . Assume the underlying clas-siﬁcation problem has a true weight vector x true ∈ R d gen-erated from a standard normal distribution and then gener-ate the noisy labels b ij = sign ( z T ij x true + ξ i ) where noise ξ i ∼ N (0 , . Note that the distributed logistic regressionproblem (23) is strongly convex and hence satisﬁes As-sumption 2. We run Algorithm 2, the classical parallel SGD,and “local SGD” with communication skipping proposedin (Stich, 2018) to solve problem (23). For strongly convexstochastic optimization, all these three methods are provento achieve the fast O ( NT ) convergence. The communi-cation complexity of these three methods are O (log( T )) , O ( T ) and O ( √ N T ) , respectively. Our Algorithm 1 has thelowest communication complexity. In the experiment, wechoose N = 10 , T = 10000 , x = , B = 2 , γ = 0 . and ρ = 1 . in Algorithm 1; choose ﬁxed batch size andlearning rate . in the classical parallel SGD; choose ﬁxedbatch size , learning rate . and the largest communica-tion skipping interval for which the loss at convergence doesnot sacriﬁce in local SGD. Figures 1 and 2 plot the objectivevalues of problem (23) versus the number of SFO accessand the number of communication rounds, respectively. Ournumerical results verify that Algorithm 1 can achieve sim-ilar convergence as existing fastest parallel SGD variantswith fewer communication rounds. arallel SGD with Dynamic Batch Sizes for Stochastic Non-Convex Optimization Figure 1.

Distributed logistic regression: loss v.s. number of SFOaccess.

Figure 2.

Distributed logistic regression: loss v.s. number of com-munication rounds.

Consider using deep learning for the image classiﬁcationover CIFAR-10 (Krizhevsky & Hinton, 2009). The lossfunction for deep neural networks is non-convex and typ-ically violates Assumption 2. We run Algorithm 2, theclassical parallel SGD, and “local SGD” with communi-cation skipping in (Stich, 2018; Yu et al., 2018) to trainResNet20 (He et al., 2016) with GPUs. It has been shownthat the “local SGD”, also known as parallel restarted SGDor periodic model averaging, can linearly speed up the par-allel training of deep neural networks with signiﬁcantly lesscommunication overhead than the classical parallel SGD(Yu et al., 2018; Lin et al., 2018; Wang & Joshi, 2018; Jiang& Agrawal, 2018). For both parallel SGD and local SGD,the learning rate is . , the momentum is . , the weightdecay is e − , and the batch size at each GPU is . Forlocal SGD, we use the largest communication skipping in-terval for which the loss at convergence does not sacriﬁce. For Algorithm 2, we use B = 32 , ρ = 1 . and γ = 0 . .In our experiment, each iteration of Algorithm 2 executesCR-PSGD (Algorithm 1) to access one epoch of trainingdata at each GPU. That is, the T parameter in each call ofAlgorithm 1 is . The B τ parameter in Algorithm 1stop growing when it exceeds . Figure 3.

Training deep neural networks: loss v.s. number of SFOaccess.

Figure 4.

Training deep neural networks: loss v.s. number of com-munication rounds.

5. Conclusion

In this paper, we explore the idea of using dynamic batchsizes for distributed non-convex optimization. For non-convex optimization satisfying the Polyak-Lojasiewicz (P-L) condition, we show using exponential increasing batchsizes in parallel SGD as in Algorithm 1 can achieve O ( NT ) convergence using only O (log( T )) communication rounds.For general stochastic non-convex optimization (withoutP-L condition), we propose a Catalyst-like algorithm thatcan achieve O ( √ NT ) convergence with O ( √ T N log( TN )) communication rounds. arallel SGD with Dynamic Batch Sizes for Stochastic Non-Convex Optimization References

Bertsekas, D. P.

Nonlinear Programming . Athena Scientiﬁc,second edition, 1999.Bottou, L. and Bousquet, O. The tradeoffs of large scalelearning. In

Advances in Neural Information ProcessingSystems (NIPS) , 2008.Bottou, L., Curtis, F. E., and Nocedal, J. Optimizationmethods for large-scale machine learning.

SIAM Review ,60(2):223–311, 2018.Davis, D. and Grimmer, B. Proximally guided stochasticsubgradient method for nonsmooth, nonconvex problems. arXiv:1707.03505 , 2017.De, S., Yadav, A., Jacobs, D., and Goldstein, T. Automatedinference with adaptive batches. In

International Confer-ence on Artiﬁcial Intelligence and Statistics (AISTATS) ,pp. 1504–1513, 2017.Dekel, O., Gilad-Bachrach, R., Shamir, O., and Xiao, L.Optimal distributed online prediction using mini-batches.

Journal of Machine Learning Research , 13(165–202),2012.Devarakonda, A., Naumov, M., and Garland, M. Adabatch:Adaptive batch sizes for training deep neural networks. arXiv:1712.02029 , 2017.Friedlander, M. P. and Schmidt, M. Hybrid deterministic-stochastic methods for data ﬁtting.

SIAM Journal onScientiﬁc Computing , 34(3):1380–1405, 2012.Ghadimi, S. and Lan, G. Stochastic ﬁrst-and zeroth-ordermethods for nonconvex stochastic programming.

SIAMJournal on Optimization , 23(4):2341–2368, 2013.Ghadimi, S., Lan, G., and Zhang, H. Mini-batch stochasticapproximation methods for nonconvex stochastic com-posite optimization.

Mathematical Programming , 155(1-2):267–305, 2016.G¨uler, O. New proximal point algorithms for convex mini-mization.

SIAM Journal on Optimization , 2(4):649–664,1992.Hazan, E. and Kale, S. Beyond the regret minimization bar-rier: an optimal algorithm for stochastic strongly-convexoptimization.

Journal of Machine Learning Research ,2014.He, B. and Yuan, X. An accelerated inexact proximal pointalgorithm for convex minimization.

Journal of Optimiza-tion Theory and Applications , 154(2):536–548, 2012.He, K., Zhang, X., Ren, S., and Sun, J. Deep residuallearning for image recognition. In

IEEE conference oncomputer vision and pattern recognition (CVPR) , 2016. Jiang, P. and Agrawal, G. A linear speedup analysis of dis-tributed deep learning with sparse and quantized commu-nication. In

Advances in Neural Information ProcessingSystems (NeurIPS) , 2018.Karimi, H., Nutini, J., and Schmidt, M. Linear conver-gence of gradient and proximal-gradient methods underthe Polyak-Lojasiewicz condition. In

Joint European Con-ference on Machine Learning and Knowledge Discoveryin Databases , 2016.Krizhevsky, A. and Hinton, G. Learning multiple layers offeatures from tiny images.

Technical report, Universityof Toronto , 2009.Lacoste-Julien, S., Schmidt, M., and Bach, F. A sim-pler approach to obtaining an O (1 /t ) convergencerate for the projected stochastic subgradient method. arXiv:1212.2002 , 2012.Lian, X., Huang, Y., Li, Y., and Liu, J. Asynchronousparallel stochastic gradient for nonconvex optimization.In Advances in Neural Information Processing Systems(NIPS) , 2015.Lin, H., Mairal, J., and Harchaoui, Z. A universal catalystfor ﬁrst-order optimization. In

Advances in Neural In-formation Processing Systems (NIPS) , pp. 3384–3392,2015.Lin, T., Stich, S. U., and Jaggi, M. Don’t use large mini-batches, use local SGD. arXiv:1808.07217 , 2018.Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Ro-bust stochastic approximation approach to stochastic pro-gramming.

SIAM Journal on optimization , 19(4):1574–1609, 2009.Nemirovsky, A. S. and Yudin, D. B.

Problem complexityand method efﬁciency in optimization.

Introductory Lectures on Convex Optimization:A Basic Course . Springer Science & Business Media,2004.Paquette, C., Lin, H., Drusvyatskiy, D., Mairal, J., andHarchaoui, Z. Catalyst for gradient-based nonconvexoptimization. In

International Conference on ArtiﬁcialIntelligence and Statistics (AISTATS) , pp. 1–10, 2018.Polyak, B. T. Gradient methods for minimizing function-als.

Zhurnal Vychislitel’noi Matematikii MatematicheskoiFiziki , pp. 643–653, 1963.Rakhlin, A., Shamir, O., and Sridharan, K. Making gradi-ent descent optimal for strongly convex stochastic opti-mization. In

Proceedings of International Conference onMachine Learning (ICML) , 2012. arallel SGD with Dynamic Batch Sizes for Stochastic Non-Convex Optimization

Salzo, S. and Villa, S. Inexact and accelerated proximalpoint algorithms.

Journal of Convex Analysis , 19(4):1167–1192, 2012.Smith, S. L. and Le, Q. V. Understanding generalizationand stochastic gradient descent. In

Proceedings of theInternational Conference on Learning Representations(ICLR) , 2018.Smith, S. L., Kindermans, P.-J., Ying, C., and Le, Q. V.Don’t decay the learning rate, increase the batch size. In

Proceedings of the International Conference on LearningRepresentations (ICLR) , 2018.Stich, S. U. Local SGD converges fast and communicateslittle. arXiv:1805.09767 , 2018.Wang, J. and Joshi, G. Cooperative SGD: A uniﬁed frame-work for the design and analysis of communication-efﬁcient SGD algorithms. arXiv:1808.07576 , 2018.Yu, H. and Neely, M. J. A simple parallel algorithm withan O (1 /t ) convergence rate for general convex programs. SIAM Journal on Optimization , 27(2):759–783, 2017.Yu, H., Yang, S., and Zhu, S. Parallel restarted SGDwith faster convergence and less communication: De-mystifying why model averaging works for deep learning. arXiv:1807.06629 , 2018.Zhang, L., Yang, T., Jin, R., and He, X. O (log T ) projec-tions for stochastic optimization of smooth and stronglyconvex functions. In International Conference on Ma-chine Learning (ICML) , pp. 1121–1129, 2013. arallel SGD with Dynamic Batch Sizes for Stochastic Non-Convex Optimization

6. Supplement

This section summarizes several well-known facts for smooth and/or strongly convex functions. For the convenience to thereaders, we also provide self-contained proofs to these facts.Recall that if φ ( x ) is a smooth function with modulus L > , then we have φ ( y ) ≤ φ ( x ) + (cid:104)∇ φ ( x ) , y − x (cid:105) + L (cid:107) y − x (cid:107) for any x and y . This property is known as the descent lemma for smooth functions, see e.g., Proposition A.24 in (Bertsekas,1999). The next two useful facts follow directly from the descent lemma. Fact 2.

Let φ : R m (cid:55)→ R be a smooth function with modulus L . If x ∗ is a global minimizer of f over R m , then φ ( x ) − φ ( x ∗ ) ≤ L (cid:107) x − x ∗ (cid:107) , ∀ x (24) Proof.

By the descent lemma for smooth functions, for any x , we have φ ( x ) ≤ φ ( x ∗ ) + (cid:104)∇ φ ( x ∗ ) , x − x ∗ (cid:105) + L (cid:107) x − x ∗ (cid:107) a ) = φ ( x ∗ ) + L (cid:107) x − x ∗ (cid:107) where (a) follows from ∇ φ ( x ∗ ) = . Fact 3.

Let φ : R m → R be a smooth function with modulus L . We have L (cid:107)∇ φ ( x ) (cid:107) ≤ φ ( x ) − φ ∗ , ∀ x (25) where φ ∗ is the global minimum of φ ( x ) .Proof. By the descent lemma for smooth functions, for any x , y ∈ R n , we have φ ( y ) ≤ φ ( x ) + (cid:104)∇ φ ( x ) , y − x (cid:105) + L (cid:107) y − x (cid:107) a ) = φ ( x ) + L (cid:107) y − x + 1 L ∇ φ ( x ) (cid:107) − L (cid:107)∇ φ ( x ) (cid:107) where (a) can be veriﬁed by noting that (cid:107) y − x + L ∇ f ( x ) (cid:107) = (cid:107) y − x (cid:107) + L (cid:104)∇ φ ( x ) , y − x (cid:105) + L (cid:107)∇ φ ( x ) (cid:107) .Minimizing both sides over y ∈ R m yields φ ∗ ≤ φ ( x ) − L (cid:107)∇ φ ( x ) (cid:107) Recall that if smooth function φ ( x ) is strongly convex with modulus µ > , then we have φ ( y ) ≥ φ ( x ) + (cid:104)∇ φ ( x ) , y − x (cid:105) + µ (cid:107) y − x (cid:107) for any x and y . The next two useful facts follow directly from this inequality. Fact 4.

Let smooth function φ : R m (cid:55)→ R be strongly convex with modulus µ > . If x ∗ is the (unique) global minimizer of f over R m , then φ ( x ) − φ ( x ∗ ) ≥ µ (cid:107) x − x ∗ (cid:107) , ∀ x (26) Proof.

By the strong convexity of φ ( x ) , for any x , we have φ ( x ) ≥ φ ( x ∗ ) + (cid:104)∇ φ ( x ∗ ) , x − x ∗ (cid:105) + µ (cid:107) x − x ∗ (cid:107) a ) = φ ( x ∗ ) + µ (cid:107) x − x ∗ (cid:107) where (a) follows from ∇ φ ( x ∗ ) = . arallel SGD with Dynamic Batch Sizes for Stochastic Non-Convex Optimization Fact 5.

Let smooth function φ : R m → R be strongly convex with modulus µ > . If x ∗ is the (unique) global minimizer of φ ( x ) over R m , then (cid:107)∇ φ ( x ) (cid:107) ≥ µ (cid:107) x − x ∗ (cid:107) , ∀ x (27) Proof.

By Fact 4, we have φ ( x ) − φ ( x ∗ ) ≥ µ (cid:107) x − x ∗ (cid:107) , ∀ x By Fact 1 and the deﬁnition of P-L condition, we have (cid:107)∇ φ ( x ) (cid:107) ≥ µ ( φ ( x ) − φ ( x ∗ ) , ∀ x Combining these two inequalities yields the desired result.Both Fact 4 and Fact 5 are restricted to strongly convex functions. They can be possibly extended to smooth functionswithout strong convexity. A generalization of (26) is known as the quadratic growth condition. Similarly, a generalization of(27) is known as the error bound condition. In general, both (26) and (27), where x ∗ should be replaced by P X ∗ [ x ] , i.e., theprojection of x onto the set of minimizers for φ ( x ) when φ ( x ) does not have a unique minimizer, can be proven to hold aslong as smooth φ ( x ) satisﬁes the P-L condition with the same modulus µ . See Supplement A in (Karimi et al., 2016) fordetailed discussions. Fix

T > . Let x t be the solution returned by Algorithm 1 when it terminates. According to the “while” condition inAlgorithm 1, we must have (cid:80) t − τ =0 (cid:98) ρ τ B (cid:99) ≥ T , which further implies (cid:80) t − τ =0 ρ τ B ≥ T . Simplifying the partial sum ofgeometric series and rearranging terms yields t ≥ log ρ (cid:18) T ( ρ − B + 1 (cid:19) ( a ) ≥ log ρ (cid:18) ( ρ − TB (cid:19) = log ρ (cid:18) B T ( ρ − (cid:19) (28)where (a) follows because ρ > .By Lemma 1, for all τ ∈ { , , . . . , t − } , we have E [ f ( x τ +1 ) − f ∗ ] ≤ (1 − ν ) E [ f ( x τ ) − f ∗ ] + γ (2 − Lγ )2 N B τ σ a ) ≤ (1 − ν ) E [ f ( x τ ) − f ∗ ] + γ (2 − Lγ ) σ N B ρ τ − where (a) follows by recalling B τ = (cid:98) ρ τ − B (cid:99) and noting (cid:98) z (cid:99) > z as long as z ≥ . arallel SGD with Dynamic Batch Sizes for Stochastic Non-Convex Optimization Recursively applying the above inequality from τ = 1 to τ = t − yields E [ f ( x t ) − f ∗ ] ≤ (1 − ν ) t − ( f ( x ) − f ∗ ) + γ (2 − Lγ ) σ N B t − (cid:88) τ =0 (1 − ν ) τ ( 1 ρ ) t − − τ =(1 − ν ) t − ( f ( x ) − f ∗ ) + γ (2 − Lγ ) σ N B ( 1 ρ ) t − t − (cid:88) τ =0 ((1 − ν ) ρ ) τ ( a ) ≤ (1 − ν ) t − ( f ( x ) − f ∗ ) + γ (2 − Lγ ) σ N B ( 1 ρ ) t − − (1 − ν ) ρ ( b ) ≤ (1 − ν ) log ρ ( B T ( ρ − ) 11 − ν ( f ( x ) − f ∗ )) + γ (2 − Lγ ) σ N − (1 − ν ) ρ ρ T ( ρ − ( c ) = (cid:16) B T ( ρ − (cid:17) log ρ (1 − ν ) − ν ( f ( x ) − f ∗ ) + ρ γ (2 − Lγ ) σ (1 − (1 − ν ) ρ )( ρ −

1) 1

N T = 11 − ν ( f ( x ) − f ∗ ) (cid:16) B ρ − (cid:17) log ρ (1 − ν ) T log ρ (1 − ν ) + ρ γ (2 − Lγ ) σ (1 − (1 − ν ) ρ )( ρ −

1) 1

N T ( d ) = 11 − ν ( f ( x ) − f ∗ ) (cid:16) B ρ − (cid:17) log ρ ( − ν ) T log ρ ( − ν ) + ρ γ (2 − Lγ ) σ (1 − (1 − ν ) ρ )( ρ −

1) 1

N T (29)where (a) follows by simplifying the partial sum of geometric series and noting that (1 − ν ) ρ < ; (b) follows by substituting(28) and noting that < − ν < and < ρ < ; (c) follows by noting that log ρ (cid:16) B T ( ρ − (cid:17) = log − ν (cid:0) B T ( ρ − (cid:1) log − ν (cid:0) ρ (cid:1) =log ρ (1 − ν ) log − ν (cid:16) B T ( ρ − (cid:17) = log − ν (cid:16)(cid:16) B T ( ρ − (cid:17) log ρ (1 − ν ) (cid:17) ; and (d) follows from log ρ (1 − ν ) = log ρ ( − ν ))