[PDF] Towards Understanding Acceleration Tradeoff between Momentum and Asynchrony in Nonconvex Stochastic Optimization

Abstract

Asynchronous momentum stochastic gradient descent algorithms (Async-MSGD) is one of the most popular algorithms in distributed machine learning. However, its convergence properties for these complicated nonconvex problems is still largely unknown, because of the current technical limit. Therefore, in this paper, we propose to analyze the algorithm through a simpler but nontrivial nonconvex problem - streaming PCA, which helps us to understand Aync-MSGD better even for more general problems. Specifically, we establish the asymptotic rate of convergence of Async-MSGD for streaming PCA by diffusion approximation. Our results indicate a fundamental tradeoff between asynchrony and momentum: To ensure convergence and acceleration through asynchrony, we have to reduce the momentum (compared with Sync-MSGD). To the best of our knowledge, this is the first theoretical attempt on understanding Async-MSGD for distributed nonconvex stochastic optimization. Numerical experiments on both streaming PCA and training deep neural networks are provided to support our findings for Async-MSGD.

Full PDF

TTowards Understanding Acceleration Tradeoﬀ betweenMomentum and Asynchrony in Distributed NonconvexStochastic Optimization ∗Tianyi Liu, Shiyang Li † , Jianping Shi ‡ , Enlu Zhou, Tuo Zhao § Abstract

Asynchronous momentum stochastic gradient descent algorithms (Async-MSGD) have beenwidely used in distributed machine learning, e.g., training large collaborative ﬁltering systemsand deep neural networks. Due to current technical limit, however, establishing convergenceproperties of Async-MSGD for these highly complicated nonoconvex problems is generallyinfeasible. Therefore, we propose to analyze the algorithm through a simpler but nontrivialnonconvex problem — streaming PCA. This allows us to make progress toward understand-ing Aync-MSGD and gaining new insights for more general problems. Speciﬁcally, by exploit-ing the diﬀusion approximation of stochastic optimization, we establish the asymptotic rate ofconvergence of Async-MSGD for streaming PCA. Our results indicate a fundamental tradeoﬀbetween asynchrony and momentum: To ensure convergence and acceleration through asyn-chrony, we have to reduce the momentum (compared with Sync-MSGD). To the best of ourknowledge, this is the ﬁrst theoretical attempt on understanding Async-MSGD for distributednonconvex stochastic optimization. Numerical experiments on both streaming PCA and train-ing deep neural networks are provided to support our ﬁndings for Async-MSGD.

Modern machine learning models trained on large data sets have revolutionized a wide variety ofdomains, from speech and image recognition (Hinton et al., 2012; Krizhevsky et al., 2012) to nat-ural language processing (Rumelhart et al., 1986) to industry-focused applications such as recom-mendation systems (Salakhutdinov et al., 2007). Training these machine learning models requiressolving large-scale nonconvex optimization. For example, to train a deep neural network given n observations denoted by { ( x i , y i ) } ni =1 , where x i is the i -th input feature and y i is the response, we ∗ Working in progress. † S. Li is aﬃliated with Harbin Institute of Technology. ‡ J. Shi is aﬃliated with Sensetime Group Limited. § T. Liu, E. Zhou, and T. Zhao are aﬃliated with School of Industrial and Systems Engineering at Georgia Tech; TuoZhao is the corresponding author; Email: [email protected]. a r X i v : . [ c s . L G ] S e p eed to solve the following empirical risk minimization problem, min θ F ( θ ) := 1 n n (cid:88) i =1 (cid:96) ( y i , f ( x i , θ )) , (1.1)where (cid:96) is a loss function, and f is a neural network function/operator associated with θ .Thanks to signiﬁcant advances made in GPU hardware and training algorithms, we can easilytrain machine learning models on a GPU-equipped machine. For example, we can solve (1.1) usingthe popular momentum stochastic gradient descent (MSGD, Robbins and Monro (1951); Polyak(1964)) algorithm. Speciﬁcally, at the t -th iteration, we uniformly sample i (or a mini-batch) from (1 , ..., n ) , and then take θ ( k +1) = θ ( k ) − η ∇ (cid:96) ( y i , f ( x i , θ ( k ) )) + µ ( θ ( k ) − θ ( k − ) , (1.2)where η is the step size parameter and µ ∈ [0 , is the parameter for controlling the momentum.Note that when µ = 0 , (1.3) is reduced to the vanilla stochastic gradient descent (VSGD) algorithm.Many recent empirical results have demonstrated the impressive computational performance ofMSGD. For example, ﬁnishing a 180-epoch training with a moderate scale deep neural network(ResNet, . million parameters, He et al. (2016)) for CIFAR10 ( , training images in resolution × ) only takes hours with a NVIDIA Titan XP GPU.For even larger models and datasets, however, solving (1.1) is much more computationally de-manding and can take an impractically long time on a single machine. For example, ﬁnishing a90-epoch ImageNet-1k ( million training images in resolution × ) training with large scaleResNet (around . million parameters) on the same GPU takes over 10 days. Such high com-putational demand of training deep neural networks necessitates the training on distributed GPUcluster in order to keep the training time acceptable.In this paper, we consider the “parameter server” approach (Li et al., 2014), which is one of themost popular distributed optimization frameworks. Speciﬁcally, it consists of two main ingredi-ents: First, the model parameters are globally shared on multiple servers nodes. This set of serversare called the parameter servers. Second, there can be multiple workers processing data in paralleland communicating with the parameter servers. The whole framework can be implemented ineither synchronous or asynchronous manner. The synchronous implementations are mainly criti-cized for the low parallel eﬃciency, since the servers always need to wait for the slowest worker toaggregate all updates within each iteration.To circumvent this issue, practitioners have resorted to asynchronous implementations, whichemphasize parallel eﬃciency by using potentially stale stochastic gradients for computation. Specif-ically, each worker in asynchronous implementations can process a mini-batch of data indepen-dently of the others, as follows: (1) The worker fetches from the parameter servers the most up-to-date parameters of the model needed to process the current mini-batch; (2)

It then computesgradients of the loss with respect to these parameters; (3)

Finally, these gradients are sent back tothe parameter servers, which then updates the model accordingly. Since each worker communi-2ates with the parameter servers independently of the others, this is called Asynchronous MSGD(Async-MSGD).As can be seen, Async-MSGD is diﬀerent from Sync-MSGD, since parameter updates may haveoccurred while a worker is computing its stochastic gradient; hence, the resulting stochastic gra-dients are typically computed with respect to outdated parameters. We refer to these as stalestochastic gradients, and its staleness as the number of updates that have occurred between itscorresponding read and update operations. More precisely, at the k -th iteration, Async-MSGDtakes θ ( k +1) = θ ( k ) − η ∇ (cid:96) ( y i , f ( x i , θ ( k − τ k ) )) + µ ( θ ( k ) − θ ( k − ) , (1.3)where τ k ∈ Z + denotes the delay in the system (usually proportional to the number of workers).Understanding the theoretical impact of staleness is fundamental, but very diﬃcult for dis-tributed nonconvex stochastic optimization. Though there have been some recent papers on thistopic, there are still signiﬁcant gaps between theory and practice: (A) They all focus on Async-VSGD (Lian et al., 2015; Zhang et al., 2015; Lian et al., 2016). Manymachine learning models, however, are often trained using algorithms equipped with momentumsuch as Async-MSGD and Async-ADAM (Kingma and Ba, 2014). Moreover, there have been someresults reporting that Async-MSGD sometimes leads to computational and generalization perfor-mance loss than Sync-MSGD. For example, Mitliagkas et al. (2016) observe that Async-MSGD leadsto the generalization accuracy loss for training deep neural networks; Chen et al. (2016) observesimilar results for Async-ADAM for training deep neural networks; Zhang and Mitliagkas (2018)suggest that the momentum for Async-MSGD needs to be adaptively tuned for better generaliza-tion performance. (B)

They all focus on analyzing convergence to a ﬁrst order optimal solution (Lian et al., 2015;Zhang et al., 2015; Lian et al., 2016), which can be either a saddle point or local optimum. To betterunderstand the algorithms for nonconvex optimization, machine learning researcher are becomingmore and more interested in the second order optimality guarantee. The theory requires morereﬁned characterization on how the delay aﬀects escaping from saddle points and converging tolocal optima.Unfortunately, closing these gaps of Async-MSGD for highly complicated nonconvex problems(e.g., training large recommendation systems and deep neural networks) is generally infeasible dueto current technical limit. Therefore, we will study the algorithm through a simpler and yet nontriv-ial nonconvex problems — streaming PCA. This helps us to understand the algorithmic behaviorof Async-MSGD better even in more general problems. Speciﬁcally, the stream PCA problem isformulated as max v v (cid:62) E X ∼D [ XX (cid:62) ] v subject to v (cid:62) v = 1 , (1.4)where D is an unknown zero-mean distribution, and the streaming data points { X k } ∞ k =1 are drawnindependently from D . This problem, though nonconvex, is well known as a strict saddle optimiza-3ion problem over sphere (Ge et al., 2015), and its optimization landscape enjoys two geometricproperties: (1) no spurious local optima and (2) negative curvatures around saddle points.These nice geometric properties can also be found in several other popular nonconvex optimiza-tion problems, such as matrix regression/completion/sensing, independent component analysis,partial least square multiview learning, and phase retrieval (Ge et al., 2016; Li et al., 2016; Sunet al., 2016). However, little has been known for the optimization landscape of general nonconvexproblems. Therefore, as suggested by many theoreticians, a strict saddle optimization problemsuch as streaming PCA could be a ﬁrst and yet signiﬁcant step towards understanding the algo-rithms. The insights we gain on such simpler problems shed light on more general nonconvexoptimization problems. Illustrating through the example of streaming PCA, we intend to answerthe fundamental question, which also arises in Mitliagkas et al. (2016): Does there exist a tradeoﬀ between asynchrony and momentumin distributed nonconvex stochastic optimization?

The answer is “Yes”. We need to reduce the momentum for allowing a larger delay. Roughlyspeaking, our analysis indicates that for streaming PCA, the delay τ k ’s are allowed to asymptoti-cally scale as τ k (cid:46) (1 − µ ) / √ η. Moreover, our analysis also indicates that the asynchrony has very diﬀerent behaviors from mo-mentum. Speciﬁcally, as shown in Liu et al. (2018), the momentum accelerates optimization, whenescaping from saddle points, or in nonstationary regions, but cannot improve the convergence tooptima. The asynchrony, however, can always enjoy a linear speed up throughout all optimizationstages.The main technical challenge for analyzing Async-MSGD comes from the complicated depen-dency caused by momentum and asynchrony. Our analysis adopts diﬀusion approximations ofstochastic optimization, which is a powerful applied probability tool based on the weak conver-gence theory. Existing literature has shown that it has considerable advantages when analyzingcomplicated stochastic processes (Kushner and Yin, 2003). Speciﬁcally, we prove that the solutiontrajectory of Async-MSGD for streaming PCA converges weakly to the solution of an appropri-ately constructed ODE/SDE. This solution can provide intuitive characterization of the algorith-mic behavior, and establish the asymptotic rate of convergence of Async-MSGD. To the best of ourknowledge, this is the ﬁrst theoretical attempt of Async-MSGD for distributed nonconvex stochas-tic optimization.

Notations : For ≤ i ≤ d, let e i = (0 , ..., , , , ..., (cid:62) (the i -th dimension equals to , others ) be thestandard basis in R d . Given a vector v = ( v (1) , . . . , v ( d ) ) (cid:62) ∈ R d , we deﬁne the vector norm: || v || = (cid:80) j ( v ( j ) ) . The notation w.p. is short for with probability one, B t is the standard Brownian Motionin R d , and S denotes the sphere of the unit ball in R d , i.e., S = { v ∈ R d | (cid:107) v (cid:107) = 1 } . ˙ F denotes thederivative of the function F ( t ) . (cid:16) means asymptotically equal.4 Async-MSGD and Optimization Landscape of Streaming PCA

Recall that we study Async-MSGD for the streaming PCA problem formulated as (1.4) max v v (cid:62) E X ∼D [ XX (cid:62) ] v subject to v (cid:62) v = 1 . We apply the asynchronous stochastic generalized Hebbian Algorithm with Polyak’s momentum(Sanger, 1989; Polyak, 1964). Note that the serial/synchronous counterpart has been studied inLiu et al. (2018). Speciﬁcally, at the k -th iteration, given X k ∈ R d independently sampled from theunderlying zero-mean distribution D , Async-MSGD takes v k +1 = v k + µ ( v k − v k − ) + η ( I − v k − τ k v (cid:62) k − τ k ) X k X (cid:62) k v k − τ k , (2.1)where µ ∈ [0 , is the momentum parameter, and τ k is the delay. We remark that from the per-spective of manifold optimization, (2.1) is essentially considered as the stochastic approximationof the manifold gradient with momentum in the asynchronous manner. Throughout the rest ofthis paper, if not clearly speciﬁed, we denote (2.1) as Async-MSGD for notational simplicity.The optimization landscape of (1.4) has been well studied in existing literature. Speciﬁcally, weimpose the following assumption on Σ = E [ XX (cid:62) ] . Assumption 1.

The covariance matrix Σ is positive deﬁnite with eigenvalues λ > λ ≥ ... ≥ λ d > and associated normalized eigenvectors v , v , ..., v d . Assumption 1 implies that the eigenvectors ± v , ± v , ..., ± v d are all the stationary points forproblem (1.4) on the unit sphere S . Moreover, the eigen-gap ( λ > λ ) guarantees that the globaloptimum v is identiﬁable up to sign change, and moreover, v , ..., v d − are d − strict saddle points,and v d is the global minimum (Chen et al., 2017). We analyze the convergence of the Async-MSGD by diﬀusion approximations. Our focus is toﬁnd the proper delay given the momentum parameter µ and the step size η . We ﬁrst prove theglobal convergence of Async-MSGD using an ODE approximation. Then through more reﬁnedSDE analysis, we further establish the rate of convergence. Before we proceed, we impose thefollowing mild assumption on the underlying data distribution: Assumption 2.

The data points { X k } ∞ k =1 are drawn independently from some unknown distribution D over R d such that E [ X ] = 0 , E [ XX (cid:62) ] = Σ , (cid:107) X (cid:107) ≤ C d , where C d is a constant (possibly dependent on d ). We ﬁrst show that the solution trajectory converges to the solution of an ODE. By studying theODE, we establish the global convergence of Async-MSGD, and the rate of convergence will beestablished later. Speciﬁcally, we consider a continuous-time interpolation V η,τ ( t ) of the solutiontrajectory of the algorithm: For t ≥ , set V η,τ ( t ) = v η,τk on the time interval [ kη, kη + η ) . Throughoutour analysis, similar notations apply to other interpolations, e.g., H η,τ ( t ) , U η,τ ( t ) .To prove the weak convergence, we need to show the solution trajectory { V η,τ ( t ) } must be tightin the Cadlag function space. In another word, { V η,τ ( t ) } is uniformly bounded in t, and the max-imum discontinuity (distance between two iterations) converges to 0, as shown in the followinglemma: Lemma 3.1.

Given v ∈ S , for any k ≤ O (1 /η ) , we have (cid:107) v k (cid:107) ≤ O (cid:32) max i τ i η (1 − µ ) (cid:33) + O (cid:32) η (1 − µ ) (cid:33) . Speciﬁcally, given τ k (cid:46) (1 − µ ) /η − γ for some γ ∈ (0 , , we have (cid:107) v k (cid:107) ≤ O ( η γ ) and (cid:107) v k +1 − v k (cid:107) ≤ C d η − µ . The proof is provided in Appendix A.1. Roughly speaking, the delay is required to satisfy τ k (cid:46) (1 − µ ) /η − γ , ∀ k > , for some γ ∈ (0 , such that the tightness of the trajectory sequence is kept. Then by Prokhorov’sTheorem, this sequence { V η ( t ) } converges weakly to a continuous function. For self-containedness,we provide the prerequisite knowledge on weak convergence theory in Appendix ?? .Then we derive the weak limit. Speciﬁcally, we rewrite Async-MSGD as follows: v k +1 = v k + η ( m k +1 + β k + (cid:15) k ) , (3.1)where (cid:15) k = ( Σ k − Σ ) v k − τ k − v (cid:62) k − τ k ( Σ k − Σ ) v k − τ k v k − τ k ,m k +1 = (cid:80) ki =0 µ i [ Σ v k − i − τ k − i − v (cid:62) k − i − τ k − i Σ v k − i − τ k − i v k − i − τ k − i ] , and β k = k − (cid:88) i =0 µ k − i (cid:104) ( Σ i − Σ ) v i − τ i − v (cid:62) i ( Σ i − Σ ) v i − τ i v i − τ i (cid:105) .

6s can bee seen in (3.1), the term m k +1 dominates the update, and β k + (cid:15) k is the noise. Note thatwhen we have momentum in the algorithm, m k +1 is not a stochastic approximation of the gradient,which is diﬀerent from VSGD. Actually, it is an approximation of (cid:101) M ( v ηk ) and biased, where (cid:101) M ( v ) = 11 − µ [ Σ v − v (cid:62) Σ vv ] . We have the following lemma to bound the approximation error.

Lemma 3.2.

For any k > , we have (cid:107) m ηk +1 − (cid:101) M ( v ηk ) (cid:107) ≤ O ( η log(1 /η )) + O  τ k λ η (1 − µ )  , w.p. . Note that the ﬁrst term in the above error bound comes from the momentum, while the secondone is introduced by the delay. To ensure that this bound does not blow up as η → , we have toimpose a further requirement on the delay.Given Lemmas 3.1 and 3.2, we only need to prove taht the continuous interpolation of the noiseterm β k + (cid:15) k converges to 0, which leads to the main theorem. Theorem 3.3.

Suppose for any i > , v − i = v = v ∈ S . When the delay in each step is chosen according tothe following condition: τ k (cid:16) (1 − µ ) / ( λ η − γ ) , ∀ k > , for some γ ∈ (0 , , for each subsequence of { V η ( · ) , η > } , there exists a further subsequence and a process V η ( · ) such that V η ( · ) ⇒ V ( · ) in the weak sense as η → through the convergent subsequence, where V ( · ) satisﬁes thefollowing ODE: ˙ V = 11 − µ [ Σ V − V (cid:62) Σ V V ] , V (0) = v . (3.2)To solve ODE (3.2), we rotate the coordinate to decouple each dimension. Speciﬁcally, thereexists an eigenvalue decomposition such that Σ = Q Λ Q (cid:62) , where Λ = diag( λ , λ , ..., λ d ) and Q (cid:62) Q = I. Note that, after the rotation, e is the optimum corresponding to v . Let H η ( t ) = Q (cid:62) V η ( t ) , then wehave as η → , { H η ( · ) , η > } converges weakly to H ( i ) ( t ) = (cid:18) d (cid:88) i =1 (cid:20) H ( i ) (0) exp (cid:18) λ i t − µ (cid:19)(cid:21) (cid:19) − H ( i ) (0) exp (cid:32) λ i t − µ (cid:33) , i = 1 , ..., d. Moreover, given H (1) (0) (cid:44) , H ( t ) converges to H ∗ = e as t → ∞ . This implies that the lim-iting solution trajectory of Async-MSGD converges to the global optima, given the delay τ k (cid:46) (1 − µ ) / ( λ η − γ ) in each step. 7uch an ODE approach neglects the noise and only considers the eﬀect of the gradient. Thus,it is only a characterization of the mean behavior and is reliable only when the gradient dominatesthe variance throughout all iterations. In practice, however, we care about one realization of thealgorithm, and the noise plays a very important role and cannot be neglected (especially near thesaddle points and local optima, where the gradient has a relatively small magnitude). Moreover,since the ODE analysis does not explicitly characterize the order of the step size η , no rate of con-vergence can be established. In this respect, the ODE analysis is insuﬃcient. Therefore, we resortto the SDE-based approach later for a more precise characterization. The following SDE approach recovers the eﬀect of the noise by rescaling and can provide a moreprecise characterization of the local behavior. The relationship between the SDE and ODE ap-proaches is analogous to that between Central Limit Theorem and Law of Large Number. • Phase III: Around Global Optima . We consider the normalized process { u η,τn = ( h η,τn − e ) / √ η } around the optimal solution e , where h η,τn = Q (cid:62) v η,τn . The intuition behind this rescaling is similarto “ √ N ” in Central Limit Theorem.We ﬁrst analyze the error introduced by the delay after the above normalization. Let D n = H n +1 − H n − η (cid:80) ki =0 µ k − i { Λ i H i − H (cid:62) i Λ i H i H i } be the error . Then we have u n +1 = u n + √ η k (cid:88) i =0 µ k − i { Λ i H i − H (cid:62) i Λ i H i H i } + 1 √ η D n . Deﬁne the accumulative asynchronous error process as: D ( t ) = √ η (cid:80) t/ηi =1 D i . To ensure the weakconvergence, we prove that the continuous stochastic process D ( t ) converges to zero as shown inthe following lemma. Lemma 3.4.

Given delay τ (cid:48) k s satisfying τ k (cid:16) (1 − µ ) ( λ + C d ) η − γ , ∀ k > , for some γ ∈ (0 , . , we have for any t ﬁxed, lim η → D ( t ) → , a.s. Lemma 3.4 shows that after normalization, we have to use a delay smaller than that in Theorem3.3 to control the noise. This justiﬁes that the upper bound we derived from the ODE approxima-tion is inaccurate for one single sample path.We then have the following SDE approximation of the solution trajectory.8 heorem 3.5.

For every k > , the delay satisﬁes the following condition: τ k (cid:16) (1 − µ ) ( λ + C d ) η − γ , ∀ k > , for some γ ∈ (0 , . , as η → , { U η,s,i ( · ) } ( i (cid:44) ) converges weakly to a stationary solution of dU = λ i − λ − µ U dt + α i, − µ dB t , (3.3) where α i,j = (cid:113) E [( Y ( i ) ) ( Y ( j ) ) ] , Y = Q (cid:62) X and U η,s,i ( · ) is the i -th dimension of U η,s ( · ) . Theorem 3.3 implies that (1 − µ ) ( λ + C d ) η − γ workers are allowed to work simultaneously. For nota-tional simplicity, denote τ = max k τ k and φ = (cid:80) j α ,j , which is bounded by the forth order momentof the data. Then the asymptotic rate of convergence is shown in the following proposition. Proposition 3.6.

Given a suﬃciently small (cid:15) > and η (cid:16) (1 − µ ) η = (1 − µ ) (cid:15) ( λ − λ ) /φ, there exists some constant δ (cid:16) √ η , such that after restarting the counter of time, if (cid:16) H η, (0) (cid:17) ≥ − δ , weallow τ workers to work simultaneously, where for some γ ∈ (0 , . ,τ (cid:16) (1 − µ ) ( λ + C d ) η − γ , and we need T = 1 − µ λ − λ ) log (cid:18) − µ )( λ − λ ) δ (1 − µ )( λ − λ ) (cid:15) − ηφ (cid:19) to ensure (cid:80) di =2 (cid:16) H η,i ( T ) (cid:17) ≤ (cid:15) with probability at least / . Proposition 3.6 implies that asymptotically, the eﬀective iteration complexity of Async-MSGDenjoys a linear acceleration, i.e., N (cid:16) T τη (cid:16) ( λ + C d ) φ + γ [(1 − µ )( λ − λ )] + γ (cid:15) + γ log (cid:18) − µ )( λ − λ ) δ (1 − µ )( λ − λ ) (cid:15) − ηφ (cid:19) Remark 3.7.

Mitliagkas et al. (2016) conjecture that the delay in Async-SGD is equivalent to the mo-mentum in MSGD. Our result, however, shows that this is not true in general. Speciﬁcally, when µ = 0 ,Async-SGD yields an eﬀective iterations of complexity: (cid:98) N (cid:16) ( λ + C d ) φ + γ [( λ − λ )] + γ (cid:15) + γ log (cid:18) λ − λ ) δ ( λ − λ ) (cid:15) − η φ (cid:19) , which is faster than that of MSGD (Liu et al., 2018): (cid:101) N (cid:16) φ(cid:15) ( λ − λ ) · log (cid:18) λ − λ ) δ ( λ − λ ) (cid:15) − η φ (cid:19) . Thus, there exists fundamental diﬀerence between these two algorithms. Phase II: Traverse between Stationary Points . For Phase II, we study the algorithmic behavioronce Async-MSGD has escaped from saddle points. During this period, since the noise is too smallcompared to the large magnitude of the gradient, the update is dominated by the gradient, and theinﬂuence of the noise is negligible. Accordingly, the algorithm behaves like an almost deterministictraverse between stationary points, which can be viewed as a two-step discretization of the ODEwith a discretization error O ( η ) (Griﬃths and Higham, 2010). Therefore, the ODE approximationis reliable before it enters the neighborhood of the optimum. The upper bound τ (cid:46) (1 − µ ) /λ η − γ we ﬁnd in Section 3.1 works in this phase. Then we have the following proposition: Proposition 3.8.

After restarting the counter of time, given a suﬃciently small η and δ (cid:16) √ η , we can allow τ workers to work simultaneously, where for some γ ∈ (0 , ,τ (cid:16) (1 − µ ) λ η − γ , and we need T = (1 − µ )2( λ − λ ) log (cid:32) − δ δ (cid:33) such that P (cid:18)(cid:16) H η, ( T ) (cid:17) ≥ − δ (cid:19) ≥ . When (cid:15) is small enough, we can choose η (cid:16) (cid:15) ( λ − λ ) /φ , and Proposition 3.8 implies thatasymptotically, the eﬀective iteration complexity of Async-MSGD enjoys a linear acceleration by afactor τ , i.e., N (cid:16) T τη (cid:16) λ φ γ − µ )( λ − λ ) γ (cid:15) γ log (cid:32) − δ δ (cid:33) . • Phase I: Escaping from Saddle Points . At last, we study the algorithmic behavior around saddlepoints e j , j (cid:44) . Similarly to Phase I, the gradient has a relatively small magnitude, and noise is thekey factor to help the algorithm escape from the saddles. Thus, an SDE approximation need to bederived. Deﬁne { u s,ηn = ( h s,ηn − e i ) / √ η } for i (cid:44) . By the same SDE approximation technique used inSection 3.2, we obtain the following theorem. Theorem 3.9.

Given i (cid:44) j, for any C > , if for any k , the delay satisﬁes the following condition: τ k (cid:16) (1 − µ ) ( λ + C d ) η − γ , ∀ k > , for some γ ∈ (0 , . , there exist some δ > and η (cid:48) > such that sup η<η (cid:48) P (sup τ | U η,i ( τ ) | ≤ C ) ≤ − δ. Theorem 3.9 implies that for i > j , with a constant probability δ, AMSGD with proper delayescapes from the saddle points at some time. Then MSGD enters phase II and converge to theglobal optimum. Note that when the step size η is small, the local behavior of H η around saddlepoints can be characterized by a SDE.Then we obtain the following proposition on the asymptoticescaping rate of AMSGD. 10 roposition 3.10. Given a pre-speciﬁed ν ∈ (0 , , η (cid:16) (cid:15) ( λ − λ ) /φ , and δ (cid:16) √ η , we allow τ workers towork simultaneously, where for some γ ∈ (0 , . ,τ (cid:16) (1 − µ ) ( λ + C d ) η − γ , and we need T = 1 − µ λ − λ ) log  − µ ) η − δ ( λ − λ ) Φ − (cid:16) ν/ (cid:17) α + 1  such that ( H η, ( T )) ≤ − δ with probability at least − ν , where Φ ( x ) is the CDF of the standard normaldistribution. Proposition 3.10 implies that asymptotically, the eﬀective iteration complexity of Async-MSGDenjoys a linear acceleration, i.e., N (cid:16) T ητ (cid:16) ( λ + C d ) φ + γ − µ )( λ − λ ) + γ (cid:15) + γ log  − µ ) η − δ ( λ − λ ) Φ − (cid:16) ν/ (cid:17) α + 1  . Remark 3.11.

We brieﬂy summarize here: (1) There is a trade-oﬀ between the momentum and asynchrony.Speciﬁcally, to guarantee the convergence, delay must be chosen according to : τ (cid:16) (1 − µ ) ( λ + C d ) η − γ , for some γ ∈ (0 , . . Then Async-MSGD asymptotically achieves a linear speed-up compared to MSGD.(2) Momentum and asynchrony have fundamental diﬀerence. With proper delays, Async-SGD achieves alinear speed-up in the third phase, while momentum cannot improve the convergence.

The previous analysis focuses on the cases where the delay is deterministic and bounded. Lemma3.2 and 3.4 show that when the delay satisﬁes certain condition, the error introduced by asynchronygoes to with probability as η → . However, when proving weak convergence of the solutiontrajectory, we only need convergence in probability. Thus, it is possible to extend our result tounbounded random delay by using Markov Inequality.Speciﬁcally, following Lemma 3.2, to guarantee (cid:107) m ηk +1 − (cid:101) M ( v ηk ) (cid:107) → in probability , we need τ k λ η (1 − µ ) → in probability . By Markov Inequality, for any (cid:15) > , P (cid:32) τ k λ η (1 − µ ) ≥ (cid:15) (cid:33) ≤ E (cid:18) τ k λ η (1 − µ ) (cid:19) (cid:15) → , when E ( τ k ) (cid:16) (1 − µ ) / ( λ η − γ ) , ∀ k > , for some γ ∈ (0 , . (4.1)11 elay ⌧ O p t i m i z a t i o n E rr o r s ⌧ opt ⇡ ⌧ opt ⇡ ⌧ opt ⇡ ⌧ opt ⇡ ⌧ opt ⇡ Figure 1:

Comparison of Async-MSGD with diﬀerent momentum and delays. For µ =0 . , . , . , . , . , the optimal delay’s are τ = 120 , , , , respectively. This suggests a cleartradeoﬀ between the asynchrony and momentum. Thus, Theorem 3.3 holds when the delay satisﬁes the above moment condition.Similar extension can be made to our SDE analysis (Theorem 3.5 and 3.9), and the correspond-ing moment condition is: E ( τ k ) (cid:16) (1 − µ ) ( λ + C d ) η − γ , ∀ k > , for some γ ∈ (0 , . . We present numerical experiments for both streaming PCA and training deep neural networks todemonstrate the tradeoﬀ between the momentum and asynchrony. The experiment on streamingPCA verify our theory in Section 3, and the experiments on training deep neural networks verifythat our theory, though trimmed for Streaming PCA, gains new insights for more general problems.

We ﬁrst provide a numerical experiment to show the tradeoﬀ between the momentum and asyn-chrony in streaming PCA. For simplicity, we choose d = 4 and the covariance matrix Σ = diag { , , , } . The optimum is (1 , , , . We compare the performance of Async-MSGD with diﬀerent delays andmomentum parameters. Speciﬁcally, we start the algorithm at the saddle point (0 , , , and set η = 0 . . The algorithm is run for times.Figure 1 shows the average optimization error obtained by Async-MSGD with µ = 0 . , . , . , . , . and delays from 0 to 100. Here, the shade is the error bound. We see that for a ﬁxed µ ,12sync-MSGD can achieve similar optimization error to that of MSGD when the delay is belowsome threshold. We call it the optimal delay. As can be seen in Fig 1, the optimal delays for µ =0 . , . , . , . , . are , , , , respectively. This indicates that there is a clear tradeoﬀbetween the asynchrony and momentum which is consistent with our theoretical analysis. Weremark that the diﬀerence among Async-MSGD with diﬀerent µ when τ = 0 is due to the fact thatthe momentum hurts convergence, as shown in Liu et al. (2018). We then provide numerical experiments for comparing diﬀerent number workers and choices ofmomentum in training a 32-layer hyperspherical residual neural network (SphereResNet34) usingthe CIFAR- dataset for a -class image classiﬁcation task. We use a computer workstationwith 8 Titan XP GPUs. We choose a batch size of . k images are used for training, and therest k are used for testing. We repeat each experiment for times and report the average. Wechoose the initial step size as . . We decrease the step size by a factor of . after , , and epochs. The momentum parameter is tuned over { . , . , . , . , . } . More details on the networkarchitecture and experimental settings can be found in He et al. (2016) and Liu et al. (2017). Werepeat all experiments for times, and report the averaged results.Figure 2 shows that the validation accuracies of ResNet34 under diﬀerent settings. We can seethat for one single worker τ = 1 , the optimal momentum parameter is µ = 0 . ; As the number ofworkers increases, the optimal momentum decreases; For workers τ = 8 , the optimal momentumparameter is µ = 0 . . We also see that µ = 0 . yields the worst performance for τ = 8 . This indicatesa clear tradeoﬀ between the delay and momentum, which is consistent with our theory. We remark that though our theory helps explain some phenomena in training DNNs, there stillexist some gaps: (A)

The optimization landscapes of DNNs are much more challenging than that of our studiedstreaming PCA problem. For example, there might exist many bad local optima and high ordersaddle points. How Async-MSGD behaves in these regions is still largely unknown; (B)

Our analysis based on the diﬀusion approximations requires η → . However, the experi-ments actually use relatively large step sizes at the early stage of training. Though we can expectlarge and small step sizes share some similar behaviors, they may lead to very diﬀerent results; (C) Our analysis only explains how Async-MSGD minimizes the population objective. ForDNNs, however, we are more interested in generalization accuracies. (D)

Some algorithms, like “Adam” (Kingma and Ba, 2014), propose to use adaptive momentum.In these cases, the trade-oﬀ between asynchrony and the momentum is still unknown.We will leave these open questions for future investigation.13igure 2:

The average validation accuracies of ResNet34 versus the momentum parameters with diﬀerentnumbers of workers. We can see that the optimal momentum decreases, as the number of workers increases.

References

Chen, J., Pan, X., Monga, R., Bengio, S. and Jozefowicz, R. (2016). Revisiting distributed syn-chronous sgd. arXiv preprint arXiv:1604.00981 .Chen, Z., Yang, F. L., Li, C. J. and Zhao, T. (2017). Online multiview representation learning:Dropping convexity for better eﬃciency. arXiv preprint arXiv:1702.08134 .Ge, R., Huang, F., Jin, C. and Yuan, Y. (2015). Escaping from saddle points—online stochasticgradient for tensor decomposition. In

Conference on Learning Theory .Ge, R., Lee, J. D. and Ma, T. (2016). Matrix completion has no spurious local minimum. In

Advancesin Neural Information Processing Systems . 14riffiths, D. F. and Higham, D. J. (2010).

Numerical methods for ordinary diﬀerential equations: initialvalue problems . Springer Science & Business Media.He, K., Zhang, X., Ren, S. and Sun, J. (2016). Deep residual learning for image recognition. In

Proceedings of the IEEE conference on computer vision and pattern recognition .Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V.,Nguyen, P., Sainath, T. N. et al. (2012). Deep neural networks for acoustic modeling in speechrecognition: The shared views of four research groups.

IEEE Signal Processing Magazine arXiv preprintarXiv:1412.6980 .Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012). Imagenet classiﬁcation with deep convolu-tional neural networks. In Advances in neural information processing systems .Kushner, H. J. and Yin, G. G. (2003). Stochastic approximation and recursive algorithms and ap-plications, stochastic modelling and applied probability, vol. 35.Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., Long, J., Shekita, E. J.and Su, B.-Y. (2014). Scaling distributed machine learning with the parameter server. In

OSDI ,vol. 14.Li, X., Wang, Z., Lu, J., Arora, R., Haupt, J., Liu, H. and Zhao, T. (2016). Symmetry, saddle points,and global geometry of nonconvex matrix factorization. arXiv preprint arXiv:1612.09296 .Lian, X., Huang, Y., Li, Y. and Liu, J. (2015). Asynchronous parallel stochastic gradient for noncon-vex optimization. In

Advances in Neural Information Processing Systems .Lian, X., Zhang, H., Hsieh, C.-J., Huang, Y. and Liu, J. (2016). A comprehensive linear speedupanalysis for asynchronous stochastic parallel optimization from zeroth-order to ﬁrst-order. In

Advances in Neural Information Processing Systems .Liu, T., Chen, Z., Zhou, E. and Zhao, T. (2018). Toward deeper understanding of noncon-vex stochastic optimization with momentum using diﬀusion approximations. arXiv preprintarXiv:1802.05155 .Liu, W., Zhang, Y.-M., Li, X., Yu, Z., Dai, B., Zhao, T. and Song, L. (2017). Deep hypersphericallearning. In

Advances in Neural Information Processing Systems .Mitliagkas, I., Zhang, C., Hadjis, S. and R´e, C. (2016). Asynchrony begets momentum, with an ap-plication to deep learning. In

Communication, Control, and Computing (Allerton), 2016 54th AnnualAllerton Conference on . IEEE.Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods.

USSRComputational Mathematics and Mathematical Physics The annals of mathematicalstatistics nature

Proceedings of the 24th international conference on Machine learning . ACM.Sanger, T. D. (1989). Optimal unsupervised learning in a single-layer linear feedforward neuralnetwork.

Neural networks Information Theory(ISIT), 2016 IEEE International Symposium on . IEEE.Zhang, J. and Mitliagkas, I. (2018). Yellowﬁn: Adaptive optimization for (a) synchronous systems.

Training arXiv preprint arXiv:1511.05950 . 16 Proof of the main results

A.1 Proof of Lemma 3.1

Proof.

First, if we assume { v k } is uniformly bounded by 2, we then have v k +1 − v k = µ ( v k − v k − ) + η { Σ k v k − τ k − v (cid:62) k − τ k Σ k v k − τ k v k − τ k } , = ⇒ v k +1 − v k = k (cid:88) i =0 µ k − i η { Σ i v i − τ i − v (cid:62) i − τ i Σ i v i − τ i v i − τ i } , = ⇒(cid:107) v k +1 − v k (cid:107) ≤ C δ η − µ , where C δ = sup (cid:107) v (cid:107)≤ , (cid:107) X (cid:107)≤ C d (cid:107) XX (cid:62) v − v (cid:62) XX (cid:62) vv (cid:107) ≤ C d . Thus, the jump v k +1 − v k is bounded. Next,we show the boundedness assumption on v can be taken oﬀ. In fact, with an initialization on S (the sphere of the unit ball), the algorithm is bounded in a much smaller ball of radius O ( η γ ) . Recall δ k +1 = v k +1 − v k . Let’s consider the diﬀerence between the norm of two iterates, ∆ k = (cid:107) v k +1 (cid:107) − (cid:107) v k (cid:107) = (cid:107) δ k +1 (cid:107) + 2 v (cid:62) k δ k +1 ∆ k +1 − ∆ k = (cid:107) δ k +2 (cid:107) + 2 v (cid:62) k +1 δ k +2 − (cid:107) δ k +1 (cid:107) − v (cid:62) k δ k +1 = (cid:107) δ k +2 (cid:107) − (cid:107) δ k +1 (cid:107) + 2 µv (cid:62) k +1 δ k +1 + 2 ηv (cid:62) k +1 [ Σ k +1 v k +1 − τ k +1 − v (cid:62) k +1 − τ k +1 Σ k +1 v k +1 − τ k +1 v k +1 − τ k +1 ] − v (cid:62) k δ k +1 = (cid:107) δ k +2 (cid:107) − (cid:107) δ k +1 (cid:107) + 2 µv (cid:62) k +1 δ k +1 + 2 ηv (cid:62) k +1 − τ k +1 [ Σ k +1 v k +1 − τ k +1 − v (cid:62) k +1 − τ k +1 Σ k +1 v k +1 − τ k +1 v k +1 − τ k +1 ]+2 η [ v k +1 − v k +1 − τ k +1 ] (cid:62) [ Σ k +1 v k +1 − τ k +1 − v (cid:62) k +1 − τ k +1 Σ k +1 v k +1 − τ k +1 v k +1 − τ k +1 ] − v (cid:62) k δ k +1 ≤ (cid:107) δ k +2 (cid:107) − (cid:107) δ k +1 (cid:107) + 2 µv (cid:62) k δ k +1 + 2 µ (cid:107) δ k +1 (cid:107) + 2 ηv (cid:62) k +1 − τ k +1 Σ k +1 v k +1 − τ k +1 (1 − v (cid:62) k +1 − τ k +1 v k +1 − τ k +1 ) − v (cid:62) k δ k +1 + C δ − µ τ k +1 η = | δ k +2 (cid:107) + µ (cid:107) δ k +1 (cid:107) − (1 − µ )( (cid:107) δ k +1 (cid:107) + 2 v (cid:62) k δ k +1 ) + C δ − µ τ k +1 η + 2 ηv (cid:62) k +1 − τ k +1 Σ k +1 v k +1 − τ k +1 (1 − v (cid:62) k +1 − τ k +1 v k +1 − τ k +1 )= (cid:107) δ k +2 (cid:107) + µ (cid:107) δ k +1 (cid:107) − (1 − µ ) ∆ k + 2 ηv (cid:62) k +1 − τ k +1 Σ k +1 v k +1 − τ k +1 (1 − v (cid:62) k +1 − τ k +1 v k +1 − τ k +1 ) + C δ − µ τ k +1 η ≤ (cid:107) δ k +2 (cid:107) + µ (cid:107) δ k +1 (cid:107) − (1 − µ ) ∆ k + C δ − µ τ k +1 η , when ≤ (cid:107) v k +1 − τ k +1 (cid:107) ≤ . Let κ = inf { i : (cid:107) v i +1 (cid:107) > } , then ∆ κ +1 ≤ (1 + µ )( C δ − µ ) η + µ ∆ κ + C δ − µ τ κ +1 η . < (cid:107) v κ + i − τ k + i (cid:107) ≤ holds for i = 1 , ..., n < tη , we have ∆ κ + i ≤ (1 + µ )( C δ − µ ) η + µ ∆ κ + i − ≤ µ − µ ( C δ − µ ) η + C δ (1 − µ ) (max k τ k ) η + µ i ∆ κ . Thus, (cid:107) v κ + n +1 (cid:107) = (cid:107) v κ (cid:107) + n (cid:88) i =0 ∆ κ + i ≤ − µ ∆ k + tη µ − µ ( C δ − µ ) η + tη C δ (1 − µ ) (max k τ k ) η ≤ O ( (max k τ k ) η (1 − µ ) ) . In other words, when η is very small, and τ k (cid:16) (1 − µ ) / ( η − γ )) , we cannot go far from S and theassumption that (cid:107) v (cid:107) ≤ can be removed. A.2 Proof of Lemma 3.2

Proof.

To prove the inequality, we decompose the error (left-hand) into two parts: (cid:107) m ηk +1 − (cid:101) M ( v ηk ) (cid:107) ≤ || m ηk +1 − (cid:101) M ( v ηk − τ k ) || + || (cid:101) M ( v ηk − τ k ) − (cid:101) M ( v ηk ) || , where the ﬁrst term on the right is the error caused by the noise while the second term is thatintroduce by the asynchrony. We ﬁrst bound the second term. In fact, it can be easily bounded bythe Lipschitz continuity. Here the Lipschitz constant of (cid:101) M is λ / (1 − µ ) ,then we have: || (cid:101) M ( v ηk − τ k ) − (cid:101) M ( v ηk ) || ≤ λ − µ || v ηk − τ k − v ηk ||≤ λ − µ τ k Cη/ (1 − µ )= O ( τ k λ η/ (1 − µ ) ) . Next we are going to bound the ﬁrst term. Since this can be now viewed as no-delay case, wecan use the same method as in Appendix B.2 in Liu et al. (2018). Since − µ = (cid:80) ∞ i =0 µ i , there exists N ( η ) = log µ (1 − µ ) η such that (cid:80) ∞ i = N ( η ) µ i < η. When k > N ( η ) , write m k and (cid:101) M ( v k ) into summations: m k +1 = k (cid:88) i =0 µ i [ Σ v k − i − τ k − i − v (cid:62) k − i − τ k − i Σ v k − i − τ k − i v k − i − τ k − i ]= N ( δ ) (cid:88) i =0 µ i [ Σ v k − i − τ k − i − v (cid:62) k − i − τ k − i Σ v k − i − τ k − i v k − i − τ k − i ]+ k (cid:88) i = N ( δ )+1 µ i [ Σ v k − i − τ k − i − v (cid:62) k − i − τ k − i Σ v k − i − τ k − i v k − i − τ k − i ] , (cid:101) M ( v k − τ k ) = 11 − µ [ Σ v k − τ k − v (cid:62) k − τ k Σ v k − τ k v k − τ k ]= N ( δ ) (cid:88) i =0 µ i [ Σ v k − τ k − v (cid:62) k − τ k Σ v k − τ k v k − τ k ] + ∞ (cid:88) i = N ( δ )+1 µ i [ Σ v k − τ k − v (cid:62) k − τ k Σ v k − τ k v k − τ k ] . Note that (cid:107) v k +1 − v k (cid:107) ≤ Cη , where C = C δ − µ is a constant. Then we have max i =0 , ,...,N ( η ) (cid:107) v k − i − τ k − i − v k − τ k (cid:107) ≤ C δ − µ N ( η ) η + 2 C δ − µ max i τ i η. They by Lipschitz continuity, for i = 0 , , ..., N ( δ ) , we have (cid:107) Σ v k − τ k − v (cid:62) k − τ k Σ v k − τ k v k − τ k − Σ v k − i − τ k − i + v (cid:62) k − i − τ k − i Σ v k − i − τ k − i v k − i − τ k − i (cid:107)≤ λ C δ − µ N ( η ) η + 2 λ C δ − µ max i τ i η. Then (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N ( δ ) (cid:88) i =0 µ i { [ Σ v k − i − v (cid:62) k − i Σ v k − i v k − i ] − [ Σ v k − v (cid:62) k Σ v k v k ] } (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ KCN ( η ) η − µ ≤ C δ (1 − µ ) N ( η ) η + 2 C δ (1 − µ ) max i τ i η. Since Σ v k − v (cid:62) k Σ v k v k is uniformly bounded by C w.p.1, both (cid:80) ki = N ( δ )+1 µ i [ Σ v k − i − τ k − i − v (cid:62) k − i − τ k − i Σ v k − i − τ k − i v k − i − τ k − i ] and (cid:80) ∞ i = N ( δ )+1 µ i [ Σ v k − tau k − v (cid:62) k − tau k Σ v k − tau k v k − tau k ] are bounded by Cη.

Thus, (cid:107) m k +1 − (cid:101) M ( v k − tau k ) (cid:107) ≤ C δ (1 − µ ) N ( η ) η + 2 C δ (1 − µ ) max i τ i η + 2 Cη = O ( η log 1 η ) + O ( τ k λ η/ (1 − µ ) ) w.p. . For k < N ( η ) , following the same approach, we can bound (cid:107) m k +1 − (cid:101) M ( v k ) (cid:107) by the same bound . A.3 Proof Sketch of Theorem 3.3

Proof Sketch.

The proof technique is the ﬁxed-state-chain method introduced by Liu et al. (2018).Thedetail proof followsthat of Theorem 3.1 in Liu et al. (2018), which is very involved and out of ourmajor concern. Thus, we only provide the general idea here. Recall that the update of v is asfollows. v k +1 = v k + ηZ k = v k + η ( m k +1 + β k + (cid:15) k ) ,

19r equivalently v k +1 − v k η = ( m k +1 + β k + (cid:15) k ) , = (cid:101) M ( v k ) + ( m k +1 − (cid:101) M ( v k )) + β k + (cid:15) k . By Lemma 3.2, m k +1 − (cid:101) M ( v k ) converges to . Since the sample path of AMSGD stays closed to thesphere of the unit ball, one can easily veriﬁes that β k + (cid:15) k diminishes when η → . Thus, V η ( t ) converges weakly to a limiting process satisfying the following ODE: ˙ V = 11 − µ [ Σ V − V (cid:62) Σ V V ] , V (0) = v . A.4 Proof of Lemma 3.4

Proof.

Deﬁne G j ( h ) = Λ j h − h (cid:62) Λ j hh = Λ h − h (cid:62) Λ hh + X j X (cid:62) j h − h (cid:62) X j X (cid:62) j hh, which is smooth andbounded, thus Lipschitz. The Lipschitz constant is determined by Λ and the data X . Since X isbounded by Assumption 2, for any j > , we have || G j ( h (cid:48) ) − G j ( h (cid:48)(cid:48) ) || ≤ ( C d + λ ) || h (cid:48) − h (cid:48)(cid:48) || . Then we have: || D k || = η || k (cid:88) j =0 µ k − i ( G j ( H j ) − G j ( H j − s )) ||≤ η k (cid:88) j =0 µ k − i L d || H j − H j − τ j ||≤ k (cid:88) j =0 µ k − i L d τ j C δ η − µ ≤ C δ L d max j τ j η (1 − µ ) = o ( η / ) . Then from the deﬁnition of D ( t ) , we know D ( t ) → , a.s. A.5 Proof Sketch of Theorem 3.5

Proof Sketch.

The detailed proof follows the similar lines to proof of Theorem 4.1 in Liu et al. (2018).Here, we only provide the proof sketch. Compared with MSGD, AMSGD has the additional errorintroduced by the asynchrony. However, under our choice of delay, the continuous time interpo-lation of this error, i.e., D ( t ) diminishes to zero almost surely when η is small enough.Thus, in the20symptotic sense, normalized sample path of AMSGD shares the same limit behavior with that ofMSGD. To see it more clearly, we write the update of u η,in as follows. u ηn +1 = u ηn + √ η k (cid:88) j =0 µ k − j { Λ j H ηj − ( H ηj ) (cid:62) Λ j H ηj H ηj } + 1 √ η Dη n , or equivalently, u η,in +1 − u η,in η = (cid:80) kj =0 µ k − j { Λ j H ηj − ( H ηj ) (cid:62) Λ j H ηj H ηj } ( i ) √ η + 1 η √ η D η,in , ∀ i = 1 , . . . , d. Following Equation (4.3) in Liu et al. (2018), we can further rewrite the above equation as follows: u η,in +1 − u η,in η = λ i − λ − µ u η,in + W nη , i √ η + o ( | u η,in | ) + 1 η √ η D n , ∀ i (cid:44) , where W nη ,i √ η converges to a Winner process, and o ( | u η,in | ) + η √ η D n converges to as η → . Thus, U η,i converges weakly to the solution of the following SDE. dU = λ i − λ − µ U dt + α i, − µ dB t . (A.1) A.6 Proof of Proposition 3.6

Proof.

Since we restart our record time, we assume here the algorithm is initialized around theglobal optimum e . Thus, we have (cid:80) di =2 ( U η,i (0)) = η − δ < ∞ . Since U η,i ( t ) approximates to U ( i ) ( t ) in this neighborhood, and the second moment of U ( i ) ( t ) is: For i (cid:44) , E (cid:16) U ( i ) ( t ) (cid:17) = α i − µ )( λ − λ i ) + (cid:16) U ( i ) (0) (cid:17) − α i − µ )( λ − λ i )  exp (cid:34) − λ − λ i ) t − µ (cid:35) ,

21y Markov inequality, we have: P  d (cid:88) i =2 (cid:18) H ( i ) η ( T ) (cid:19) > (cid:15)  ≤ E (cid:32)(cid:80) di =2 (cid:18) H ( i ) η ( T ) (cid:19) (cid:33) (cid:15) = E (cid:18)(cid:80) di =2 (cid:16) U η,i ( T ) (cid:17) (cid:19) η − (cid:15) ≈ η − (cid:15) d (cid:88) i =2 α i − µ )( λ − λ i ) (cid:18) − exp (cid:16) − λ − λ i ) T − µ (cid:17)(cid:19) + (cid:16) U η,i (0) (cid:17) exp (cid:34) − λ − λ i ) T − µ (cid:35) ≤ η − (cid:15) (cid:18) φ − µ )( λ − λ ) (cid:18) − exp (cid:16) − λ − λ d ) T − µ (cid:17)(cid:19) + η − δ exp (cid:34) − λ − λ ) T − µ (cid:35) (cid:19) ≤ η − (cid:15) (cid:32) φ − µ )( λ − λ ) + η − δ exp (cid:34) − λ − λ ) T − µ (cid:35)(cid:33) . To guarantee η − (cid:15) (cid:18) φ − µ )( λ − λ ) + η − δ exp (cid:104) − ( λ − λ ) T − µ (cid:105)(cid:19) ≤ , we have: T = 1 − µ λ − λ ) log (cid:32) (1 − µ )( λ − λ ) δ (1 − µ )( λ − λ ) (cid:15) − ηφ (cid:33) . A.7 Proof of Proposition 3.8

Proof.

After Phase I, we restart our record time, i.e., H η, (0) = δ . Then we obtain (cid:16) H η, ( T ) (cid:17) ≈ (cid:16) H (1) ( T ) (cid:17) =  d (cid:88) j =1 (cid:32)(cid:16) H ( j ) (0) (cid:17) exp (2 λ j − µ T ) (cid:33) − (cid:16) H (1) (0) (cid:17) exp (2 λ − µ T ) ≥ (cid:32) δ exp(2 λ − µ T ) + (1 − δ ) exp(2 λ − µ T ) (cid:33) − δ exp(2 λ − µ T ) , which requires (cid:32) δ exp(2 λ − µ T ) + (1 − δ ) exp(2 λ − µ T ) (cid:33) − δ exp(2 λ − µ T ) ≥ η − (1 − δ ) . Solving the above inequality, we get T = 1 − µ λ − λ ) log 1 − δ δ . .8 Proof Sketch of Theorem 3.9 Proof Sketch.

The detailed proof follows the similar lines to proof of Theorem 4.5 in Liu et al. (2018).Here, we only provide the proof sketch.We prove this argument by contradiction. Assume the conclusion does not hold, that is thereexists a constant

C > , such that for any η (cid:48) > we have sup η ≤ η (cid:48) P (sup τ | U η,i ( τ ) | ≤ C ) = 1 . Then we can show that there exist a subsequence { U η n ,i } ∞ n =1 of { U η,i } η such that it is tight and weaklyconverges to a limiting process. By Lemma 3.4, we know that the normalized process of AMSGDshares the same limiting process with MSGD, which implies { U η n ,i } n weakly converges to a solutionto the following SDE. dU i = λ i − λ j − µ U i dt + α i,j − µ dB t . (A.2)Then we can ﬁnd a τ (cid:48) > such that P ( | U η n ,i ( τ (cid:48) ) | ≥ C ) ≥ δ, ∀ n > N , which leads to a contradiction. A.9 Proof of Proposition 3.10

Proof.

According to the proof of Theorem 3.9, when ( H (2) η ( t )) ≥ − δ holds, we can use SDE (A.2)to characterize the local algorithmic behavior. We approximate U η, ( t ) by the limiting processapproximation, which is normal distributed at time t . As η → , by simple manipulation, we have P (cid:16) ( H η, ( T )) ≤ − δ (cid:17) = P (cid:16) ( U η, ( T )) ≤ η − (1 − δ ) (cid:17) . We then prove P (cid:16)(cid:12)(cid:12)(cid:12) U η, ( T ) (cid:12)(cid:12)(cid:12) ≥ η − δ (cid:17) ≥ − ν . At time t, U η, ( t ) approximates to a normal distribu-tion with mean and variance α − µ )( λ − λ ) (cid:104) exp (cid:16) ( λ − λ ) T − µ (cid:17) − (cid:105) . Therefore, let Φ ( x ) be the CDF of N (0 , , we have P  (cid:12)(cid:12)(cid:12) U η, ( T ) (cid:12)(cid:12)(cid:12)(cid:114) α − µ )( λ − λ ) (cid:104) exp (cid:16) ( λ − λ ) T − µ (cid:17) − (cid:105) ≥ Φ − (cid:18) ν (cid:19) ≈ − ν, which requires η − δ ≤ Φ − (cid:18) ν (cid:19) · (cid:115) α − µ )( λ − λ ) (cid:34) exp (cid:16) λ − λ ) T − µ (cid:17) − (cid:35) . T = (1 − µ )2( λ − λ ) log  η − δ (1 − µ )( λ − λ ) Φ − (cid:16) ν (cid:17) α + 1  ..