[PDF] A Model Parallel Proximal Stochastic Gradient Algorithm for Partially Asynchronous Systems

Abstract

Large models are prevalent in modern machine learning scenarios, including deep learning, recommender systems, etc., which can have millions or even billions of parameters. Parallel algorithms have become an essential solution technique to many large-scale machine learning jobs. In this paper, we propose a model parallel proximal stochastic gradient algorithm, AsyB-ProxSGD, to deal with large models using model parallel blockwise updates while in the meantime handling a large amount of training data using proximal stochastic gradient descent (ProxSGD). In our algorithm, worker nodes communicate with the parameter servers asynchronously, and each worker performs proximal stochastic gradient for only one block of model parameters during each iteration. Our proposed algorithm generalizes ProxSGD to the asynchronous and model parallel setting. We prove that AsyB-ProxSGD achieves a convergence rate of O(1/ K − − √ ) to stationary points for nonconvex problems under \emph{constant} minibatch sizes, where K is the total number of block updates. This rate matches the best-known rates of convergence for a wide range of gradient-like algorithms. Furthermore, we show that when the number of workers is bounded by O( K 1/4 ) , we can expect AsyB-ProxSGD to achieve linear speedup as the number of workers increases. We implement the proposed algorithm on MXNet and demonstrate its convergence behavior and near-linear speedup on a real-world dataset involving both a large model size and large amounts of data.

Full PDF

AA Model Parallel Proximal Stochastic Gradient Algorithm forPartially Asynchronous Systems

Rui Zhu Di Niu

Department of Electrical and Computer Engineering, University of Alberta

Abstract

Large models are prevalent in modern ma-chine learning scenarios, including deeplearning, recommender systems, etc., whichcan have millions or even billions of param-eters. Parallel algorithms have become anessential solution technique to many large-scale machine learning jobs. In this paper, wepropose a model parallel proximal stochasticgradient algorithm, AsyB-ProxSGD, to dealwith large models using model parallel block-wise updates while in the meantime handlinga large amount of training data using proxi-mal stochastic gradient descent (ProxSGD).In our algorithm, worker nodes communicatewith the parameter servers asynchronously,and each worker performs proximal stochas-tic gradient for only one block of model pa-rameters during each iteration. Our pro-posed algorithm generalizes ProxSGD to theasynchronous and model parallel setting. Weprove that AsyB-ProxSGD achieves a conver-gence rate of O (1 / √ K ) to stationary pointsfor nonconvex problems under constant mini-batch sizes, where K is the total number ofblock updates. This rate matches the best-known rates of convergence for a wide rangeof gradient-like algorithms. Furthermore, weshow that when the number of workers isbounded by O ( K / ), we can expect AsyB-ProxSGD to achieve linear speedup as thenumber of workers increases. We imple-ment the proposed algorithm on MXNet anddemonstrate its convergence behavior andnear-linear speedup on a real-world datasetinvolving both a large model size and largeamounts of data. Preliminary work. Under review by AISTATS 2019. Donot distribute.

Many machine learning problems can be formulated asthe following general minimization framework:min x ∈ R d Ψ( x ) := 1 n n (cid:88) i =1 f i ( x ) + h ( x ) , (1)where f i ( x ) is typically a smooth yet possibly non-convex loss function of the i -th sample, and h ( x ) is aconvex yet nonsmooth regularizer term that promotessome structures. Examples include deep learning withregularization (Dean et al., 2012; Chen et al., 2015;Zhang et al., 2015), LASSO (Tibshirani et al., 2005),sparse logistic regression (Liu et al., 2009), robust ma-trix completion (Xu et al., 2010; Sun and Luo, 2015),and sparse support vector machine (SVM) (Friedmanet al., 2001).Many classical deterministic (non-stochastic) algo-rithms are available to solve problem (1), including theproximal gradient (ProxGD) method (Parikh et al.,2014) and its accelerated variants (Li and Lin, 2015) aswell as the alternating direction method of multipliers(ADMM) (Hong et al., 2016). These methods leveragethe so-called proximal operators (Parikh et al., 2014)to handle the nonsmoothness in the problem. How-ever, these algorithms require calculating the gradientof all n samples in each iteration, which is expensive inmodern machine learning problems. The trend to dealwith large volumes of data is the use of stochastic algo-rithms. As the number of training samples n increases,the cost of updating the model x taking into accountall error gradients becomes prohibitive. To tackle thisissue, stochastic algorithms make it possible to update x using only a small subset of all training samples ata time.Stochastic gradient descent (SGD) is one of the ﬁrst al-gorithms widely implemented in an asynchronous par-allel fashion; its convergence rates and speedup prop-erties have been analyzed for both convex (Agarwaland Duchi, 2011; Mania et al., 2017) and nonconvex(Lian et al., 2015) optimization problems. Neverthe-less, SGD is mainly applicable to the case of smooth a r X i v : . [ c s . L G ] O c t anuscript under review by AISTATS 2019 optimization, and yet is not suitable for problems witha nonsmooth term in the objective function, e.g., an (cid:96) norm regularizer. In fact, such nonsmooth regu-larizers are commonplace in many practical machinelearning problems or constrained optimization prob-lems. In these cases, SGD becomes ineﬀective, as itis hard to obtain gradients for a nonsmooth objectivefunction.With rapidly growing data volumes and model com-plexity, the need to scale up machine learning hassparked broad interests in developing eﬃcient paral-lel optimization algorithms. A typical parallel opti-mization algorithm usually decomposes the originalproblem into multiple subproblems, each handled bya worker node. Each worker iteratively downloads theglobal model parameters and computes its local gradi-ents to be sent to the master node or servers for modelupdates. Recently, asynchronous parallel optimizationalgorithms (Niu et al., 2011; Li et al., 2014b; Lianet al., 2015), exempliﬁed by the Parameter Server ar-chitecture (Li et al., 2014a), have been widely deployedin industry to solve practical large-scale machine learn-ing problems. Asynchronous algorithms can largely re-duce overhead and speedup training, since each workermay individually perform model updates in the systemwithout synchronization.Existing parallel algorithms fall into two categories: data parallelism and model parallelism . In data paral-lelism, each worker takes a subset of training samples i and calculates their loss functions f i ’s and/or gra-dients in parallel. For example, a typical implemen-tation of parallel SGD is to divide a minibatch with N samples into several smaller minibatches (each with N (cid:48) samples), and each worker computes gradients on N (cid:48) samples. This is preferred when the size of data n is large. In model parallelism, the model parame-ters x is partitioned into M blocks, where x j ∈ R d j with d j ∈ N + and (cid:80) Mj =1 d j = d . Since proximal op-erator on x can be decomposed into those on indi-vidual blocks (Parikh et al., 2014), proximal gradientand its stochastic version (proximal stochastic gradi-ent descent or ProxSGD) is again a natural candidateto solve (1). (Zhou et al., 2016) shows that BlockProximal Gradient Descent works in an asynchronousparallel mode, and in the meantime proves that BlockProxSGD can converge to critical points when themaximum staleness is bounded. However, theoreticalunderstanding of the behavior of Asynchronous BlockProxSGD, a more useful algorithm in practical Param-eter Servers, is a gap yet to be ﬁlled.In this paper, we propose AsyB-ProxSGD (Asyn-chronous Block Proximal Stochastic Gradient De-scent), an extension of proximal stochastic gradient(ProxSGD) algorithm to the model parallel paradigm and to the partially asynchronous protocol (PAP) set-ting. In AsyB-ProxSGD, workers asynchronously com-municate with the parameter servers, which collec-tively store model parameters in blocks. In an iter-ation, each worker pulls the latest yet possibly out-dated model from servers, calculates partial gradientsfor only one block based on stochastic samples, andpushes the gradients to the corresponding server. Asworkers can update diﬀerent blocks in parallel, AsyB-ProxSGD is diﬀerent from traditional data parallelProxSGD can handle both a large model size d and alarge number n of training samples, a case frequentlyobserved in reality.Our theoretical contribution is summarized as follows.We prove that AsyB-ProxSGD can converge to sta-tionary points of the nonconvex and nonsmooth prob-lem (1) with an ergodic convergence rate of O (1 / √ K ),where K is the total number of times that any blockin x is updated. This rate matches the convergencerate known for asynchronous SGD. The latter, how-ever, is suitable only for smooth problems. To our bestknowledge, this is the ﬁrst work that provides conver-gence rate guarantees for ProxSGD in a model parallelmode, especially in an asynchronous setting. We alsoprovide a linear speedup guarantee as the number ofworkers increases, provided that the number of workersis bounded by O ( K / ). This result has laid down atheoretical ground for the scalability and performanceof AsyB-ProxSGD in practice. Evaluation based on areal-world dataset involving both a large model and alarge dataset has corroborated our theoretical ﬁndingson the convergence and speedup behavior of AsyB-ProxSGD, under a Parameter Server implementation. In this section, we ﬁrst introduce some notations tobe used throughout the paper. Then we introducethe stochastic optimization problem to be studied. Fi-nally, we introduce proximal operators and enumeratefundamental assumptions made in the model.We use (cid:107) x (cid:107) to denote the (cid:96) norm of the vector x ,and (cid:104) x, y (cid:105) to denote the inner product of two vectors x and y . We use g ( x ) to denote the “true” gradient ∇ f ( x ) and use G ( x ; ξ ) to denote the stochastic gra-dient ∇ F ( x ; ξ ) for a function f ( x ). Let ∇ j be thederivative w.r.t. x j , the j -th coordinate of x ; and let G j and g j represent ∇ j F ( x ; ξ ) and ∇ j f ( x ), respec-tively. For a random variable or vector X , let E [ X |F ]be the conditional expectation of X w.r.t. a sigmaalgebra F . anuscript under review by AISTATS 2019 In this paper, we consider the following stochastic opti-mization problem instead of the original deterministicversion (1):min x ∈ R d Ψ( x ) := E ξ [ F ( x ; ξ )] + h ( x ) , (2)where the stochastic nature comes from the randomvariable ξ , which in our problem settings, represents arandom index selected from the training set { , . . . , n } .Therefore, (2) attempts to minimize the expected lossof a training sample plus a regularizer h ( x ). When itcomes to large models, we decompose x into M blocks,and rewrite (2) into the following block optimizationform:min x ∈ R d Ψ( x ) := E ξ [ F ( x , . . . , x M ; ξ )] + M (cid:88) j =1 h j ( x j ) , (3)where x = ( x , . . . , x M ), x j ∈ R d j for those d j ∈ N + and (cid:80) Mj =1 d j = d , and h ( x ) = (cid:80) Mj =1 h j ( x j ).In this work, we assume all h j ’s for problem (3) areproper, closed and convex, yet not necessarily smooth .To handle the potential non-smoothness, we introducethe following generalized notion of derivatives to beused in convergence analysis. Deﬁnition 1 (Subdiﬀerential e.g., (Parikh et al.,2014)) . We say a vector p ∈ R d is a subgradient ofthe function h : R d → R at x ∈ dom h , if for all z ∈ dom h , h ( z ) ≥ h ( x ) + (cid:104) p, z − x (cid:105) . (4) Moreover, denote the set of all such subgradients at x by ∂h ( x ) , which is called the subdiﬀerential of h at x . For problems (2) and (3), we deﬁne the critical pointas follows:

Deﬁnition 2 (Critical point (Attouch et al., 2013)) . A point x ∈ R d is a critical point of Ψ , iﬀ ∈ ∇ f ( x )+ ∂h ( x ) . The proximal operator is fundamental to many algo-rithms to solve problem (1) as well as its stochasticvariants (2) and (3).

Deﬁnition 3 (Proximal operator) . The proximal op-erator prox of a point x ∈ R d under a proper andclosed function h with parameter η > is deﬁned as: prox ηh ( x ) = arg min y ∈ R d (cid:26) h ( y ) + 12 η (cid:107) y − x (cid:107) (cid:27) . (5) In its vanilla version, proximal gradient descent per-forms the following iterative updates: x k +1 ← prox η k h ( x k − η k ∇ f ( x k )) , for k = 1 , , . . . , where η k > k .To solve stochastic optimization problems (2) and (3),we need a variant called proximal stochastic gradientdescent (ProxSGD), with its update rule at iteration k given by x k +1 ← prox η k h  x k − η k | Ξ k | (cid:88) ξ ∈ Ξ k ∇ F ( x k ; ξ )  . In ProxSGD, the gradient ∇ f is replaced by the gra-dients from a random subset of training samples, de-noted by Ξ k at iteration k . Since ξ is a random variableindicating a random index in { , . . . , n } , F ( x ; ξ ) is arandom loss function for the random sample ξ , suchthat f ( x ) := E ξ [ F ( x ; ξ )].With these deﬁnitions, we now introduce our metricused in ergodic convergence analysis: P ( x, g, η ) := 1 η ( x − prox ηh ( x − ηg )) , which is also called the gradient mapping in the lit-erature, e.g., (Parikh et al., 2014). For non-convexproblems, it is a standard approach to measure con-vergence (to a stationary point) by gradient mappingaccording to the following lemma: Lemma 1 (Non-convex Convergence (Attouch et al.,2013)) . A point x is a critical point of (2) iﬀ. P ( x, g, η ) = 0 . Therefore, we can use the following deﬁnition as a con-vergence metric:

Deﬁnition 4 (Iteration complexity (Reddi et al.,2016)) . A solution x is called (cid:15) -accurate, if E [ (cid:107) P ( x, g, η ) (cid:107) ] ≤ (cid:15) for some η > . If an algo-rithm needs at least K iterations to ﬁnd an (cid:15) -accuratesolution, its iteration complexity is K . We make the following assumptions throughout thepaper. Other algorithm speciﬁc assumptions will beintroduced later in the corresponding sections.We assume that f ( · ) is a smooth function with thefollowing properties: Assumption 1 (Lipschitz Gradient) . For function f there are Lipschitz constants L j , L > such that (cid:107)∇ f ( x ) − ∇ f ( y ) (cid:107) ≤ L (cid:107) x − y (cid:107) , ∀ x, y ∈ R d , (6) (cid:107)∇ j f ( x ) − ∇ j f ( x + αe j ) (cid:107) ≤ L max | α | , ∀ x ∈ R d , (7) anuscript under review by AISTATS 2019 where e j is an indicator vector that is only valid atblock j with value 1, and vanishes to zero at otherblocks. Clearly we have L max ≤ L . As discussed above, assume that h (or h j ) is a proper,closed and convex function, which is yet not necessar-ily smooth. If the algorithm has been executed for k iterations, we let F k denote the set that consists of allthe samples used up to iteration k . Since F k ⊆ F k (cid:48) for all k ≤ k (cid:48) , the collection of all such F k forms a ﬁltration . Under such settings, we can restrict our at-tention to those stochastic gradients with an unbiasedestimate and bounded variance, which are common inthe analysis of stochastic gradient descent or stochastic proximal gradient algorithms, e.g., (Lian et al., 2015;Ghadimi et al., 2016). Assumption 2 (Unbiased gradient) . For any k , wehave E [ G k |F k ] = g k . Assumption 3 (Bounded variance) . The vari-ance of the stochastic gradient is bounded by E [ (cid:107) G ( x ; ξ ) − ∇ f ( x ) (cid:107) ] ≤ σ . Recent years have witnessed rapid development ofparallel and distributed computation frameworks forlarge-scale machine learning problems. One populararchitecture is called parameter server (Dean et al.,2012; Li et al., 2014a), which consists of some workernodes and server nodes. In this architecture, one ormultiple master machines play the role of parameterservers, which maintain the model x . All other ma-chines are worker nodes that communicate with serversfor training machine learning models. In particular,each worker has two types of requests: pull the cur-rent model x from servers, and push the computedgradients to servers.Before proposing our AsyB-ProxSGD algorithm in thenext section, let us ﬁrst introduce its synchronous ver-sion. Suppose we execute ProxSGD with a minibatchof 128 random samples on 8 workers and our goal is touse these 8 workers to train a model with 8 blocks .We can let each worker randomly take 128 samples,compute a summed partial gradient w.r.t one block(say, block j ) on them, and push it to server j . In syn-chronous case, server j will ﬁnally receive 8 summedpartial gradients on 8 blocks (each partial gradientcontains information of 128 samples) in each iteration.This server then updates the model by performing theproximal gradient descent step. In general, if we di-vide the model into M blocks, each worker will beassigned to update only one block using a minibatch For brevity, we assume that each server hosts only oneblock, and we will say server j and block j interchangeablyin this paper. of N samples and they can do updating in parallel inan iteration.Note that in this scenario, all workers have to calcu-late the whole minibatch of the computation. Thanksto the decomposition property of proximal operator(Parikh et al., 2014), updating blocks can be done inparallel by servers, which corresponds to model paral-lelism in the literature (e.g., (Niu et al., 2011; Zhouet al., 2016; Pan et al., 2016)). Another type of paral-lelism is called data parallelism , in which each workeruses only part of N random samples in the minibatchto compute a full gradient on x (e.g., (Agarwal andDuchi, 2011; Ho et al., 2013)). We handle the issue oflarge n by using stochastic algorithms, and our mainfocus is to handle the large model challenge by modelparallelism. We now present our main contribution in this pa-per,

Asynchronous Block Proximal Stochastic Gradi-ent (AsyB-ProxSGD) algorithm. Recall that asyn-chronous algorithm tries to alleviate random delaysin computation and communication in diﬀerent itera-tions. When model is big, it is hard to put the wholemodel in a single node (a single machine or device),and we have to split it into M blocks. In this case, nosingle node maintains all of the parameters in mem-ory and the nodes can update in parallel. The ideaof model parallelism has been used in many applica-tions, including deep learning (Dean et al., 2012) andfactorization machine (Li et al., 2016).We now formally introduce how our proposed algo-rithm works. The main idea of our proposed algorithmis to update block x j in parallel by diﬀerent workers.In Algorithm 1, the ﬁrst step is to ensure that thestaleness is upper bounded by T , which is essential toensure convergence. Here we use ˆ x to emphasize thatthe pulled model parameters x may not be consistentwith that stored on parameter servers. Since blocksare scattered on multiple servers, diﬀerent blocks maybe not consistent with updates and thus results in dif-ferent delays. For example, suppose the server storesmodel x = ( x , x ), and we have two workers that up-dates x and x in parallel. Our expectation is that x is updated by them and it becomes x (cid:48) = ( x (cid:48) , x (cid:48) ). How-ever, in partially asynchronous protocol (PAP) whereworkers may skip synchronization, the following casemay happen. At time 1, worker 1 pushes x (cid:48) and pulls x ; thus, worker 1 gets ( x (cid:48) , x ). At time 2, worker 2pushes x (cid:48) and pulls x (cid:48) ; thus, worker 2 gets ( x (cid:48) , x (cid:48) ).We can see that the next update by worker 1 is basedon ( x (cid:48) , x ), which has diﬀerent delays on two blocks. anuscript under review by AISTATS 2019 Let us discuss this in more implementation details fordistributed clusters. In distributed clusters, we split alarge model x into M blocks, and one server only main-tains a single block x j to achieve model parallelism.Thus, diﬀerent block may be updated at diﬀerent it-erations by diﬀerent workers. The same phenomenonalso exist in shared memory systems (i.e., a single ma-chine with multiple CPU cores or GPUs, etc.). Inthese systems, the model is stored on the main mem-ory and we can regard it as a “logical” server. In thesesystems, “reading” and “writing” can be done simul-taneously, thus block x j may be “pulled” while it isbeing updated. In summary, model parameters x maybe inconsistent with any actual state on the server side.In our algorithm, workers can update multiple blocksin parallel, and this is the spirit of model parallelismhere. However, we note that on the server side, push request is usually more time consuming than pull re-quest since it needs additional computations of theproximal operator. Therefore, we should let workersgather more stochastic gradients before pushing to thesever, and that is the reason we let each worker to com-pute gradients on all N samples in a minibatch. Thatis, a worker iteration t should computeˆ G tj t := 1 N N (cid:88) i =1 ˆ G j t (ˆ x t ; ξ i,t ) , where j t is the index of block to be updated at iteration t , and ˆ G j t (ˆ x t ; ξ i,t ) is the partial gradient w.r.t. block j t at model ˆ x t pulled at iteration t and on sample ξ i,t . To facilitate the analysis of Algorithm 1, we rewriteit in an equivalent global view (from the server’s per-spective), as described in Algorithm 2. In this algo-rithm, we deﬁne one iteration as the time to updateany single block of x and to successfully store it at thecorresponding server. We use a counter k to recordhow many times the model x has been updated; k increments every time a push request (model updaterequest) is completed for a single block. Note thatsuch a counter k is not required by workers to com-pute gradients and is diﬀerent from the counter t inAlgorithm 1— t is maintained by each worker to counthow many sample gradients have been computed lo-cally.In particular, for every worker, it takes N stochasticsample gradients and aggregates them by averaging:ˆ G kj k := 1 N N (cid:88) i =1 ∇ j k F ( x k − d k ; ξ k,i ) , (8)where j k is the random index chosen at iteration k , Algorithm 1

AsyB-ProxSGD: Block PAP StochasticGradient

Server j executes: Initialize x . loop if Pull Request from a worker is received then Send x j to the worker. end if if Push Request (gradient G j ) from a workeris received then x j ← prox ηh j ( x j − ηG j ). end if end loopWorker asynchronously performs on block j : Pull x to initialize. for t = 0 , , . . . do Wait until all iterations before t − T are ﬁnishedat all workers. Randomly choose N training samples indexedby ξ t, , . . . , ξ t,N . Calculate G tj = N (cid:80) Ni =1 ∇ j F ( x t ; ξ t,i ). Push G tj to server j . Pull the current model x from servers: x t +1 ← x . end ford k = ( d k, , . . . , d k,M ) denotes the delay vector , i.e.,the delays of diﬀerent blocks in ˆ x k when computingthe gradient for sample ξ k,i at iteration k , and d k,j isthe delay of a speciﬁc block x j . In addition, we denoteˆ x := x k − d k := ( x k − d k, , . . . , x k − d k,M M ) as a vector ofmodel parameters pulled from the server side. Then,the server updates x k to x k +1 using proximal gradientdescent. To analyze Algorithm (2), we make the following com-mon assumptions on the delay and independence (Niuet al., 2011; Liu and Wright, 2015; Avron et al., 2015):

Assumption 4 (Bounded delay) . There exists anconstant T such that for all k , all values in delay vec-tor d k are upper bounded by T : ≤ d k,j ≤ T for all j . Assumption 5 (Independence) . All random variablesincluding selected indices { j k } and samples { ξ k,i } forall k and i in Algorithm 2 are mutually independent. The assumption of bounded delay is to guarantee thatgradients from workers should not be too old. Notethat the maximum delay T is roughly proportional tothe number of workers in practice. We can enforceall workers to wait for others if it runs too fast, like anuscript under review by AISTATS 2019 Algorithm 2

AsyB-ProxSGD (from a Global Per-spective) Initialize x . for k = 1 , . . . , K do Randomly select N training samples indexedby ξ k, , . . . , ξ k,N . Randomly select a coordinate index j k from { , . . . , M } . Calculate the averaged gradient ˆ G kj k accordingto (8). for j = 1 , . . . , M do if j = j k then x k +1 j ← prox η k h j ( x kj − η k ˆ G kj ). else x k +1 j ← x kj . end if end for end for step 3 of workers in Algorithm 1. This setting is alsocalled partially synchronous parallel (Ho et al., 2013; Liet al., 2014b; Zhou et al., 2016) in the literature. An-other assumption on independence can be met by se-lecting samples with replacement , which can be imple-mented using some distributed ﬁle systems like HDFS(Borthakur et al., 2008). These two assumptions arecommon in convergence analysis for asynchronous par-allel algorithms, e.g., (Lian et al., 2015; Davis et al.,2016). We present our main convergence theorem as follows:

Theorem 1.

If the step length sequence { η k } in Al-gorithm 1 satisﬁes η k ≤ L max , η k L T T (cid:88) l =1 η k + l ≤ M , (9) for all k = 1 , , . . . , K , we have the following ergodicconvergence rate for Algorithm 2: (cid:80) Kk =1 ( η k − L max η k ) (cid:107) P ( x k , g k , η k ) (cid:107) (cid:80) Kk =1 η k − L max η k ≤ M (Ψ( x ) − Ψ( x ∗ )) (cid:80) Kk =1 η k − L max η k + 8 M (cid:80) Kk =1 (cid:18) Lη k MN + η k L T (cid:80) Tl =1 η k − l M N (cid:19) σ (cid:80) Kk =1 η k − L max η k , (10) where the expectation is taken in terms of all the ran-dom variables in Algorithm 2. Taking a closer look at Theorem 1, we can properlychoose the step size η k as a constant value and obtainthe following results on convergence rate: Corollary 1.

Let the step length be a constant, i.e., η := (cid:114) (Ψ( x ) − Ψ( x ∗ )) MNLKσ . (11) If the delay bound T satisﬁes K ≥ x ) − Ψ( x ∗ )) N LM σ ( T + 1) , (12) then the output of Algorithm 2 satisﬁes the followingergodic convergence rate as min k =1 ,...,K E [ (cid:107) P ( x k , g k , η k ) (cid:107) ] ≤ K K (cid:88) k =1 E [ (cid:107) P ( x k , g k , η k ) (cid:107) ] ≤ (cid:114) x ) − Ψ( x ∗ )) LMσ KN . (13)

Remark 1. (Linear speedup w.r.t. the staleness)When the maximum delay T is bounded by O ( K / ),we can see that the gradient mapping E [ P ( x, g, η )]decreases regardless of T , and thus linear speedupis achievable (if other parameters are constants). Inother words, we can see that by (12) and (13), as longas T is no more than O ( K / ), the iteration complex-ity (from a global perspective) to achieve (cid:15) -optimalityis O (1 /(cid:15) ), which is independent from T . Remark 2. (Linear speedup w.r.t. number of work-ers) We note that the delay bound T is roughly pro-portional to the number of workers, so the total itera-tions w.r.t. T can be an indicator of convergence w.r.t.the number of workers. As the iteration complexity is O (1 /(cid:15) ) to achieve (cid:15) -optimality, and it is independentfrom T , we can conclude that the total iterations willbe shortened to 1 /T of a single worker’s iterations ifΘ( T ) workers work in parallel. This shows that ouralgorithm nearly achieves linear speedup. Remark 3. (Consistency with ProxSGD) When T =0, our proposed AsyB-ProxSGD reduces to the vanillaproximal stochastic gradient descent (ProxSGD) (e.g.,(Ghadimi et al., 2016)). Thus, the iteration complex-ity is O (1 /(cid:15) ) according to (13), attaining the sameresult as that in (Ghadimi et al., 2016) yet withoutassuming increased minibatch sizes . We now present numerical results to conﬁrm that ourproposed algorithms can be used to solve the chal-lenging non-convex non-smooth problems in machinelearning.

Setup:

In our experiments, we consider the sparse anuscript under review by AISTATS 2019

Iterations f ( x ) − f ( x ∗ ) (a) Iteration vs. Objective Time (s) f ( x ) − f ( x ∗ ) (b) Time vs. Objective Figure 1: Convergence of AsyB-ProxSGD on the sparse logistic regression problem under diﬀerent numbers ofworkers. In this ﬁgure, the number of servers is ﬁxed to 8.logistic regression problem:min x n n (cid:88) i =1 log(1 + exp( − b i · a (cid:62) i x )) + λ (cid:107) x (cid:107) + λ (cid:107) x (cid:107) . (14)The (cid:96) -regularized logistic regression is widely used forlarge scale risk minimization. We consider the Avazu dataset , which is used in a click-through rate predic-tion competition jointly hosted by Avazu and Kagglein 2014. In its training dataset ( avazu-app ), there aremore than 14 million samples, 1 million features, and4050 million nonzero entries. In other words, both n and d in (14) are large.We use a cluster of 16 instances on Google Cloud.Each server or worker process uses just one core. Upto 8 instances serve as server nodes, while the other8 instances serve as worker nodes. To show the ad-vantage of asynchronous parallelism, we set up fourexperiments adopting 1, 2, 4, and 8 worker nodes, re-spectively. For all experiments, the whole dataset isshuﬄed and all workers have a copy of this dataset.When computing a stochastic gradient, each workertakes one minibatch of random samples from its owncopy. This way, each sample is used by a worker withan equal probability empirically to mimic the scenarioof our analysis.We consider the suboptimality gap as our performancemetric, which is deﬁned as the gap between f ( x ) and f ( x ∗ ). Here we estimate the optimal value ˆ x by per-forming 5 × as many iterations as needed for conver-gence. The hyper-parameters are set as follows. Forall experiments, the coeﬃcients are set as λ = 0 . λ = 0 . η k := 0 . / √ . k atiteration k ≥ Implementation:

We implemented our algorithm on MXNet (Chen Available at et al., 2015), a ﬂexible and eﬃcient deep learning li-brary with support for distributed machine learning.Due to the sparse nature of the dataset, the model x is stored as a sparse ndarray , and in each iteration,a worker only pulls those blocks of x that are activelyrelated to its sampled minibatch, and then calculatesthe gradient w.r.t. this minibatch of data, and pushesthe gradients for those activated blocks only. Results:

Empirically, Assumption 4 (bounded de-lays) are observed to hold for this cluster. In our ex-periments, the maximum delay does not exceed 100iterations unless some worker nodes fail. Fig. 1(a)and Fig. 1(b) show the convergence behavior of AsyB-ProxSGD algorithm in terms of objective function val-ues. We can clearly observe the convergence of ourproposed algorithm, conﬁrming that asynchrony withtolerable delays can still lead to convergence. In addi-tion, the running time drops in trend when the numberof workers increases.For our proposed AsyB-ProxSGD algorithm, weare particularly interested in two kinds of speedup,namely, iteration speedup and running time speedup.If we need T iterations (with T sample gradients pro-cessed by servers) to achieve a certain suboptimalitylevel using one worker, and T p iterations to achieve thesame level using p workers, then the iteration speedupis deﬁned as p × T /T p (Lian et al., 2015). Note thatiterations are counted on the server side, which is actu-ally the number of minibatch gradients are processedby the server. On the other hand, the time speedup issimply deﬁned as the ratio between the running timeof using one worker and that of using p workers toachieve the same suboptimality level. We summarizeiteration and running time speedup in Table 1.We further evaluate the relationship between the num-ber of servers and the convergence behavior. Since themodel has millions of parameters to be trained, stor-ing the whole model in a single machine can be inef-fective. In fact, from Fig. 2 we can even see nearlylinear speedup w.r.t. the number of servers. The anuscript under review by AISTATS 2019 Iterations f ( x ) − f ( x ∗ ) (a) Iteration vs. Objective Time (s) f ( x ) − f ( x ∗ ) (b) Time vs. Objective Figure 2: Convergence of AsyB-ProxSGD on the sparse logistic regression problem under diﬀerent numbers ofservers. In this ﬁgure, we use 8 workers with diﬀerent numbers of servers.Table 1: Iteration speedup and time speedup of AsyB-ProxSGD at the optimality level 10 − .Workers 1 2 4 8Iteration Speedup 1.000 2.127 3.689 6.748Time Speedup 1.000 1.973 4.103 8.937reason here is that, more servers can signiﬁcantly de-crease the length of request queue at the server side.When we have only one server, the blue dashed curvein Fig. 2(b) looks like a tilt staircase, and further inves-tigation shows that some push requests take too longtime to be processed. Therefore, we have to set morethan one servers to observe parallel speedup in Fig. 1so that servers are not the bottleneck. Robbins and Monro (1951) propose a classical stochas-tic approximation algorithm for solving a class ofstrongly convex problems, which is regarded as theseminal work of stochastic optimization problems.For nonconvex problems, Ghadimi and Lan (2013)prove that SGD has an ergodic convergence rate of O (1 / √ K ), which is consistent with the convergencerate of SGD for convex problems. To deal with nons-moothness, proximal gradient algorithm is widely con-sidered and its stochastic variant is heavily studiedfor convex problems. Duchi and Singer (2009) showProxSGD can converge at the rate of O (1 /µK ) for µ -strongly convex objective functions when the stepsize η k diminishes during iterations. However, fornonconvex problems, rather limited studies on Prox-SGD exist so far. To our best knowledge, the seminalwork on ProsSGD for nonconvex problems was doneby Ghadimi et al. (2016), in which the convergenceanalysis is based on the assumption of an increasingminibatch size.Updating a single block in each iteration is also re- ferred to as block coordinate methods in the literature.Block coordinate methods for smooth problems withseparable, convex constraints (Tseng, 1991) and gen-eral nonsmooth regularizers (Razaviyayn et al., 2014;Davis, 2016; Zhou et al., 2016) are proposed. However,the study on stochastic coordinate descent is limitedand existing work like (Liu and Wright, 2015) focuseson convex problems. Xu and Yin (2015) study blockstochastic proximal methods for nonconvex problems.However, they only analyze convergence to stationarypoints assuming an increasing minibatch size, and theconvergence rate is not provided. Davis et al. (2016)presents a stochastic block coordinate method, whichis the closest one with our work in this paper. How-ever, the algorithm studied in (Davis et al., 2016) de-pends on the use of a noise term with diminishingvariance to guarantee convergence. Our convergenceresults of ProxSGD do not rely on the assumption ofincreasing batch sizes, variance reduction or the use ofadditional noise terms. In this paper, we propose AsyB-ProxSGD as an ex-tension of the proximal stochastic gradient (ProxSGD)algorithm to asynchronous model parallelism setting.Our proposed algorithm aims at solving nonconvexnonsmooth optimization problems involved with largedata size and model dimension. Theoretically, weprove that the AsyB-ProxSGD method has a conver-gence rate to critical points with the same order asProxSGD, as long as workers have bounded delay dur-ing iterations. Our convergence result does not relyon minibatch size increasing, which is required in allexisting works for ProxSGD (and its variants). Wefurther prove that AsyB-ProxSGD can achieve linearspeedup when the number of workers is bounded by O ( K / ). We implement AsyB-ProxSGD on Param-eter Server and experiments on large scale real-worlddataset conﬁrms its eﬀectiveness. anuscript under review by AISTATS 2019 References

A. Agarwal and J. C. Duchi. Distributed delayedstochastic optimization. In

Advances in Neural In-formation Processing Systems , pages 873–881, 2011.H. Attouch, J. Bolte, and B. F. Svaiter. Convergence ofdescent methods for semi-algebraic and tame prob-lems: proximal algorithms, forward–backward split-ting, and regularized gauss–seidel methods.

Mathe-matical Programming , 137(1-2):91–129, 2013.H. Avron, A. Druinsky, and A. Gupta. Revisitingasynchronous linear solvers: Provable convergencerate through randomization.

Journal of the ACM(JACM) , 62(6):51, 2015.D. Borthakur et al. Hdfs architecture guide.

HadoopApache Project , 53, 2008.T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang,T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet:A ﬂexible and eﬃcient machine learning library forheterogeneous distributed systems. arXiv preprintarXiv:1512.01274 , 2015.D. Davis. The asynchronous palm algorithm fornonsmooth nonconvex problems. arXiv preprintarXiv:1604.00526 , 2016.D. Davis, B. Edmunds, and M. Udell. The sound ofapalm clapping: Faster nonsmooth nonconvex opti-mization with stochastic asynchronous palm. In

Ad-vances in Neural Information Processing Systems ,pages 226–234, 2016.J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin,M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le,et al. Large scale distributed deep networks. In

Advances in neural information processing systems ,pages 1223–1231, 2012.J. Duchi and Y. Singer. Eﬃcient online and batchlearning using forward backward splitting.

Journalof Machine Learning Research , 10(Dec):2899–2934,2009.J. Friedman, T. Hastie, and R. Tibshirani.

The el-ements of statistical learning , volume 1. Springerseries in statistics Springer, Berlin, 2001.S. Ghadimi and G. Lan. Stochastic ﬁrst-and zeroth-order methods for nonconvex stochastic program-ming.

SIAM Journal on Optimization , 23(4):2341–2368, 2013.S. Ghadimi, G. Lan, and H. Zhang. Mini-batchstochastic approximation methods for nonconvexstochastic composite optimization.

MathematicalProgramming , 155(1-2):267–305, 2016.Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B.Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing.More eﬀective distributed ml via a stale synchronous parallel parameter server. In

Advances in neuralinformation processing systems , pages 1223–1231,2013.M. Hong, Z.-Q. Luo, and M. Razaviyayn. Convergenceanalysis of alternating direction method of multipli-ers for a family of nonconvex problems.

SIAM Jour-nal on Optimization , 26(1):337–364, 2016.H. Li and Z. Lin. Accelerated proximal gradient meth-ods for nonconvex programming. In

Advances inneural information processing systems , pages 379–387, 2015.M. Li, D. G. Andersen, J. W. Park, A. J. Smola,A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, andB.-Y. Su. Scaling distributed machine learning withthe parameter server. In , pages 583–598, 2014a.M. Li, D. G. Andersen, A. J. Smola, and K. Yu.Communication eﬃcient distributed machine learn-ing with the parameter server. In

Advances in Neu-ral Information Processing Systems , pages 19–27,2014b.M. Li, Z. Liu, A. J. Smola, and Y.-X. Wang. Di-facto: Distributed factorization machines. In

Pro-ceedings of the Ninth ACM International Conferenceon Web Search and Data Mining , pages 377–386.ACM, 2016.X. Lian, Y. Huang, Y. Li, and J. Liu. Asynchronousparallel stochastic gradient for nonconvex optimiza-tion. In

Advances in Neural Information ProcessingSystems , pages 2737–2745, 2015.J. Liu and S. J. Wright. Asynchronous stochastic coor-dinate descent: Parallelism and convergence proper-ties.

SIAM Journal on Optimization , 25(1):351–376,2015.J. Liu, J. Chen, and J. Ye. Large-scale sparse lo-gistic regression. In

Proceedings of the 15th ACMSIGKDD international conference on Knowledgediscovery and data mining , pages 547–556. ACM,2009.H. Mania, X. Pan, D. Papailiopoulos, B. Recht,K. Ramchandran, and M. I. Jordan. Perturbed it-erate analysis for asynchronous stochastic optimiza-tion.

SIAM Journal on Optimization , 27(4):2202–2229, jan 2017. doi: 10.1137/16m1057000. URL https://doi.org/10.1137%2F16m1057000 .F. Niu, B. Recht, C. Re, and S. Wright. Hogwild:A lock-free approach to parallelizing stochastic gra-dient descent. In

Advances in Neural InformationProcessing Systems , pages 693–701, 2011.X. Pan, M. Lam, S. Tu, D. Papailiopoulos, C. Zhang,M. I. Jordan, K. Ramchandran, and C. R´e. Cy- anuscript under review by AISTATS 2019 clades: Conﬂict-free asynchronous machine learn-ing. In

Advances in Neural Information ProcessingSystems , pages 2568–2576, 2016.N. Parikh, S. Boyd, et al. Proximal algorithms.

Foun-dations and Trends R (cid:13) in Optimization , 1(3):127–239, 2014.M. Razaviyayn, M. Hong, Z.-Q. Luo, and J.-S. Pang.Parallel successive convex approximation for nons-mooth nonconvex optimization. In Advances in Neu-ral Information Processing Systems , pages 1440–1448, 2014.S. J. Reddi, S. Sra, B. P´oczos, and A. J. Smola. Proxi-mal stochastic methods for nonsmooth nonconvexﬁnite-sum optimization. In

Advances in NeuralInformation Processing Systems , pages 1145–1153,2016.H. Robbins and S. Monro. A stochastic approxima-tion method.

The annals of mathematical statistics ,pages 400–407, 1951.R. Sun and Z.-Q. Luo. Guaranteed matrix comple-tion via nonconvex factorization. In

Foundations ofComputer Science (FOCS), 2015 IEEE 56th AnnualSymposium on , pages 270–289. IEEE, 2015.R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, andK. Knight. Sparsity and smoothness via the fusedlasso.

Journal of the Royal Statistical Society: SeriesB (Statistical Methodology) , 67(1):91–108, 2005.P. Tseng. On the rate of convergence of a partiallyasynchronous gradient projection algorithm.

SIAMJournal on Optimization , 1(4):603–619, 1991.H. Xu, C. Caramanis, and S. Sanghavi. Robust pca viaoutlier pursuit. In

Advances in Neural InformationProcessing Systems , pages 2496–2504, 2010.Y. Xu and W. Yin. Block stochastic gradient itera-tion for convex and nonconvex optimization.

SIAMJournal on Optimization , 25(3):1686–1716, 2015.S. Zhang, A. E. Choromanska, and Y. LeCun. Deeplearning with elastic averaging sgd. In

Advances inNeural Information Processing Systems , pages 685–693, 2015.Y. Zhou, Y. Yu, W. Dai, Y. Liang, and E. Xing. Onconvergence of model parallel proximal gradient al-gorithm for stale synchronous parallel system. In

Artiﬁcial Intelligence and Statistics , pages 713–722,2016. anuscript under review by AISTATS 2019

A Auxiliary Lemmas

Lemma 2 ((Ghadimi et al., 2016)) . For all y ← prox ηh ( x − ηg ) , we have: (cid:104) g, y − x (cid:105) + ( h ( y ) − h ( x )) ≤ − (cid:107) y − x (cid:107) η . (15)Due to slightly diﬀerent notations and deﬁnitions in (Ghadimi et al., 2016), we provide a proof here for com-pleteness. We refer readers to (Ghadimi et al., 2016) for more details. Proof.

By the deﬁnition of proximal function, there exists a p ∈ ∂h ( y ) such that: (cid:104) g + y − xη + p, x − y (cid:105) ≥ , (cid:104) g, x − y (cid:105) ≥ η (cid:104) y − x, y − x (cid:105) + (cid:104) p, y − x (cid:105)(cid:104) g, x − y (cid:105) + ( h ( x ) − h ( y )) ≥ η (cid:107) y − x (cid:107) , which proves the lemma. (cid:117)(cid:116) By applying the above lemma for each block, we have the following corollary, which is useful in convergenceanalysis for Algorithm ?? . Corollary 2.

For all y j ← prox ηh j ( x j − ηg j ) , we have: (cid:104) g j , y j − x j (cid:105) + ( h j ( y j ) − h j ( x j )) ≤ − (cid:107) y j − x j (cid:107) η . (16) Lemma 3 ((Ghadimi et al., 2016)) . For all x, g, G ∈ R d , if h : R d → R is a convex function, we have (cid:107) prox ηh ( x − ηG ) − prox ηh ( x − ηg ) (cid:107) ≤ η (cid:107) G − g (cid:107) . (17) Proof.

Let y denote prox ηh ( x − ηG ) and z denote prox ηh ( x − ηg ). By deﬁnition of the proximal operator, forall u ∈ R d , we have (cid:104) G + y − xη + p, u − y (cid:105) ≥ , (cid:104) g + z − xη + q, u − z (cid:105) ≥ , where p ∈ ∂h ( y ) and q ∈ ∂h ( z ). Let z substitute u in the ﬁrst inequality and y in the second one, we have (cid:104) G + y − xη + p, z − y (cid:105) ≥ , (cid:104) g + z − xη + q, y − z (cid:105) ≥ . Then, we have (cid:104)

G, z − y (cid:105) ≥ (cid:104) y − xη , y − z (cid:105) + (cid:104) p, y − z (cid:105) , (18)= 1 η (cid:104) y − z, y − z (cid:105) + 1 η (cid:104) z − x, y − z (cid:105) + (cid:104) p, y − z (cid:105) , (19) ≥ (cid:107) y − z (cid:107) η + 1 η (cid:104) z − x, y − z (cid:105) + h ( y ) − h ( z ) , (20) anuscript under review by AISTATS 2019 and (cid:104) g, y − z (cid:105) ≥ (cid:104) z − xη + q, z − y (cid:105) , (21)= 1 η (cid:104) z − x, z − y (cid:105) + (cid:104) q, z − y (cid:105) (22) ≥ η (cid:104) z − x, z − y (cid:105) + h ( z ) − h ( y ) . (23)By adding (20) and (23), we obtain (cid:107) G − g (cid:107)(cid:107) z − y (cid:107) ≥ (cid:104) G − g, z − y (cid:105) ≥ η (cid:107) y − z (cid:107) , which proves the lemma. (cid:117)(cid:116) Corollary 3.

For all x j , g j , G j ∈ R d j , we have (cid:107) prox ηh j ( x j − ηG j ) − prox ηh j ( x j − ηg j ) (cid:107) ≤ η (cid:107) G j − g j (cid:107) . (24) Lemma 4 ((Ghadimi et al., 2016)) . For any g and g , we have (cid:107) P ( x, g , η ) − P ( x, g , η ) (cid:107) ≤ (cid:107) g − g (cid:107) . (25) Proof.

It can be obtained by directly applying Lemma 3 and the deﬁnition of gradient mapping. (cid:117)(cid:116)

Corollary 4.

Let P j ( x, g, η ) := η ( x j − prox ηh j ( x j − ηg j )) . Then, for any G j and g j , we have (cid:107) P j ( x, G, η ) − P j ( x, g, η ) (cid:107) ≤ (cid:107) G − g (cid:107) . (26) Lemma 5 ((Reddi et al., 2016)) . Suppose we deﬁne y = prox ηh ( x − ηg ) for some g . Then for y , the followinginequality holds: Ψ( y ) ≤ Ψ( z )+ (cid:104) y − z, ∇ f ( x ) − g (cid:105) + (cid:18) L − η (cid:19) (cid:107) y − x (cid:107) + (cid:18) L η (cid:19) (cid:107) z − x (cid:107) − η (cid:107) y − z (cid:107) , (27) for all z . Corollary 5.

Suppose we deﬁne y j = prox ηh j ( x j − ηg j ) for some g j , and the index j is chosen among M indiceswith uniform distribution. For other j (cid:48) (cid:54) = j , we assume y j (cid:48) = x j (cid:48) . Then the following inequality holds: Ψ( y ) ≤ Ψ( z ) + (cid:104)∇ j f ( x ) − g j , y j − z j (cid:105) + (cid:18) L j − η (cid:19) (cid:107) y j − x j (cid:107) + (cid:18) L j η (cid:19) (cid:107) z j − x j (cid:107) − η (cid:107) y j − x j (cid:107) . (28) for all z .Proof. From the deﬁnition of proximal operator, we have h j ( y j ) + (cid:104) g j , y j − x j (cid:105) + 12 η (cid:107) y j − x j (cid:107) + η (cid:107) g j (cid:107) ≤ h j ( z j ) + (cid:104) g j , z j − x j (cid:105) + 12 η (cid:107) z j − x j (cid:107) + η (cid:107) g j (cid:107) − η (cid:107) y j − z j (cid:107) . anuscript under review by AISTATS 2019 By rearranging the above inequality, we have h j ( y j ) + (cid:104) g j , y j − z j (cid:105) ≤ h j ( z j ) + 12 η [ (cid:107) z j − x j (cid:107) − (cid:107) y j − x j (cid:107) − (cid:107) y j − z j (cid:107) ] . (29)Since f is L -Lipschitz, we have f ( y ) ≤ f ( x ) + (cid:104)∇ j f ( x ) , y j − x j (cid:105) + L j (cid:107) y j − x j (cid:107) ,f ( x ) ≤ f ( z ) + (cid:104)∇ j f ( x ) , x j − z j (cid:105) + L j (cid:107) x j − z j (cid:107) . Adding these two inequality we have f ( y ) ≤ f ( z ) + (cid:104)∇ j f ( x ) , y j − z j (cid:105) + L (cid:107) y j − x j (cid:107) + (cid:107) z j − x j (cid:107) ] , (30)and therefore Ψ( y ) ≤ Ψ( z ) + (cid:104)∇ j f ( x ) − g j , y j − z j (cid:105) + (cid:18) L j − η (cid:19) (cid:107) y j − x j (cid:107) + (cid:18) L j η (cid:19) (cid:107) z j − x j (cid:107) − η (cid:107) y j − x j (cid:107) . (cid:117)(cid:116) Lemma 6 (Young’s Inequality) . (cid:104) a, b (cid:105) ≤ δ (cid:107) a (cid:107) + δ (cid:107) b (cid:107) . (31)We recall and deﬁne some notations for convergence analysis in the subsequent. We denote ˆ G kj as the average of delayed stochastic gradients and ˆ g kj as the average of delayed true gradients, respectively:ˆ G kj := 1 N N (cid:88) i =1 ∇ j F ( x k − d k j ; ξ k,i )ˆ g kj := 1 N N (cid:88) i =1 ∇ j f ( x k − d k j ) . Moreover, we denote δ kj := ˆ g kj − ˜ G kj as the diﬀerence between these two diﬀerences. B Convergence analysis for B-PAPSG

To simplify notations, we use j instead of j k in this section. Since we update one block only in each iteration,we deﬁne an auxiliary function as follows: P j ( x, g, η ) := 1 η ( x j − prox ηh j ( x j − ηg j )) , where the variables x j and g j take the j -th coordinate. B.1 Milestone LemmasLemma 7 (Descent Lemma) . E j [Ψ( x k +1 ) |F k ] ≤ E j [Ψ( x k ) |F k ] − η k − L max η k M (cid:107) P ( x k , ˆ g k , η k ) (cid:107) + η k M (cid:107) g k − ˆ g k (cid:107) + Lη k M N σ . (32) anuscript under review by AISTATS 2019 Lemma 8.

Suppose we have a sequence { x k } by Algorithm 2. Then, we have E [ (cid:107) x k − x k − τ (cid:107) ] ≤ T (cid:80) Tl =1 η k − l M N σ + 2 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) l ∈ K ( τ ( k )) η k − l P j k − l ( x k − l , ˆ g k − l , η k − l ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (33) Lemma 9.

Suppose we have a sequence { x k } by Algorithm 2. Then, we have E [ (cid:107) g kj − ˆ g kj (cid:107) ] ≤ L T (cid:80) Tl =1 η k − l M N σ + 2 L TM T (cid:88) l =1 η k − l (cid:107) P ( x k − l , ˆ g k − l , η k − l ) (cid:107) (34) B.2 Proof of Theorem 1

Proof.

We have (cid:107) P j ( x k , ˆ g k , η k ) (cid:107) ≥ (cid:107) P j ( x k , g k , η k ) (cid:107) − (cid:107) g k − ˆ g k (cid:107) . When we have η k ≤ L max , we can apply the above equation following Lemma 7: E j [Ψ( x k +1 ) |F k ] ≤ E j [Ψ( x k ) |F k ] − η k − L max η k M (cid:107) P ( x k , ˆ g k , η k ) (cid:107) + η k M (cid:107) g k − ˆ g k (cid:107) + Lη k M N σ ≤ E j [Ψ( x k ) |F k ] − η k − L max η k M (cid:107) P ( x k , ˆ g k , η k ) (cid:107) − η k M (cid:107) P ( x k , ˆ g k , η k ) (cid:107) + η k M (cid:107) g k − ˆ g k (cid:107) + Lη k M N σ ≤ E j [Ψ( x k ) |F k ] − η k − L max η k M (cid:107) P ( x k , g k , η k ) (cid:107) + 3 η k M (cid:107) g k − ˆ g k (cid:107) − η k M (cid:107) P ( x k , ˆ g k , η k ) (cid:107) + Lη k M N σ By Lemma 9, we have E j [Ψ( x k +1 ) |F k ] ≤ E j [Ψ( x k ) |F k ] − η k − L max η k M (cid:107) P ( x k , g k , η k ) (cid:107) − η k M (cid:107) P ( x k , ˆ g k , η k ) (cid:107) + Lη k M N σ + 3 η k M (cid:32) L T (cid:80) Tl =1 η k − l M N σ + 2 L TM T (cid:88) l =1 η k − l (cid:107) P ( x k − l , ˆ g k − l , η k − l ) (cid:107) (cid:33) ≤ E j [Ψ( x k ) |F k ] − η k − L max η k M (cid:107) P ( x k , g k , η k ) (cid:107) + (cid:32) Lη k M N + 3 η k L T (cid:80) Tl =1 η k − l M N (cid:33) σ + 3 η k L T M T (cid:88) l =1 η k − l (cid:107) P j k − l ( x k − l , ˆ g k − l , η k − l ) (cid:107) − η k M (cid:107) P ( x k , ˆ g k , η k ) (cid:107) anuscript under review by AISTATS 2019 By taking telescope sum, we have K (cid:88) k =1 η k − L max η k M (cid:107) P ( x k , g k , η k ) (cid:107) ≤ Ψ( x ) − Ψ( x ∗ ) + K (cid:88) k =1 (cid:32) Lη k M N + 3 η k L T (cid:80) Tl =1 η k − l M N (cid:33) σ − K (cid:88) k =1 ( η k M − η k L M l k (cid:88) l =1 η k + l ) (cid:107) P ( x k , ˆ g k , η k ) (cid:107) ≤ Ψ( x ) − Ψ( x ∗ ) + K (cid:88) k =1 (cid:32) Lη k M N + 3 η k L T (cid:80) Tl =1 η k − l M N (cid:33) σ , where the last inequality follows from the assumption that 6 η k L (cid:80) Tl =1 η k + l ≤ M , and now we prove Theorem 1. (cid:117)(cid:116) B.3 Proof of Corollary 1

Proof.

Since the learning rate η k := η is a constant, we apply it to Theorem 1 and we have:1 K K (cid:88) k =1 E [ (cid:107) P ( x k , g k , η k ) (cid:107) ] ≤ M (Ψ( x ) − Ψ( x ∗ )) Kη + 16 Mη (cid:18) Lη M N + 3 L T η M N (cid:19) σ . (35)Following conditions in Corollary 1, we have η ≤ M L ( T + 1) , and thus we have 3 LT η M ≤ M T M · T + 1) ≤ , L T η M ≤ Lη . Then, we can estimate (35) from the above inequality as follows:1 K K (cid:88) k =1 E [ (cid:107) P ( x k , g k , η k ) (cid:107) ] ≤ M (Ψ( x ) − Ψ( x ∗ )) Kη + 16 Mη (cid:18) Lη M N + 3 L T η M N (cid:19) σ ≤ M (Ψ( x ) − Ψ( x ∗ )) Kη + 32 LηN σ = 32 (cid:114) x ) − Ψ( x ∗ )) LM ( T + 1) σ KN , which proves the corollary. (cid:117)(cid:116)

B.4 Proof of Milestone Lemmas

Proof of Lemma 7.

Recall Corollary 5:Ψ( x k +1 ) ≤ Ψ(¯ x k +1 ) + (cid:104)∇ j f ( x k ) − ˆ G kj , x k +1 j − ¯ x k +1 j (cid:105) + (cid:18) L j − η (cid:19) (cid:107) y j − x j (cid:107) + (cid:18) L j η (cid:19) (cid:107) z j − x j (cid:107) − η (cid:107) y j − x j (cid:107) . anuscript under review by AISTATS 2019 Now we turn to bound Ψ(¯ x k +1 j ) as follows: f (¯ x k +1 ) ≤ f ( x k ) + (cid:104)∇ j f ( x k ) , ¯ x k +1 j − x kj (cid:105) + L j (cid:107) ¯ x k +1 j − x kj (cid:107) = f ( x k ) + (cid:104) g kj , ¯ x k +1 j − x kj (cid:105) + η k L j (cid:107) P j ( x k , ˆ g k , η k ) (cid:107) = f ( x k ) + (cid:104) ˆ g kj , ¯ x k +1 j − x kj (cid:105) + (cid:104) g kj − ˆ g kj , ¯ x k +1 j − x kj (cid:105) + η k L j (cid:107) P j ( x k , ˆ g k , η k ) (cid:107) = f ( x k ) − η k (cid:104) ˆ g kj , P j ( x k , ˆ g kj , η k ) (cid:105) + (cid:104) g kj − ˆ g kj , ¯ x k +1 j − x kj (cid:105) + η k L j (cid:107) P j ( x k , ˆ g k , η k ) (cid:107) ≤ f ( x k ) − [ η k (cid:107) P j ( x k , ˆ g k , η k ) (cid:107) + h j (¯ x k +1 j ) − h j ( x kj )]+ (cid:104) g kj − ˆ g kj , ¯ x k +1 j − x kj (cid:105) + η k L j (cid:107) P j ( x k , ˆ g k , η k ) (cid:107) , where the last inequality follows from Corollary 2. By rearranging terms on both sides, we haveΨ(¯ x k +1 ) ≤ Ψ( x k ) − ( η k − η k L j (cid:107) P j ( x k , ˆ g k , η k ) (cid:107) + (cid:104) g kj − ˆ g kj , ¯ x k +1 j − x kj (cid:105) (36)Taking the summation of ( ?? ) and (36), we haveΨ( x k +1 ) ≤ Ψ( x k ) + (cid:104)∇ j f ( x k ) − ˆ G kj , x k +1 j − ¯ x k +1 j (cid:105) + (cid:18) L j − η (cid:19) (cid:107) y j − x j (cid:107) + (cid:18) L j η k (cid:19) (cid:107) z j − x j (cid:107) − η k (cid:107) y j − x j (cid:107) − ( η k − η k L j (cid:107) P j ( x k , ˆ g k , η k ) (cid:107) + (cid:104) g kj − ˆ g kj , ¯ x k +1 j − x kj (cid:105) = Ψ( x k ) + (cid:104)∇ j f ( x k ) − ˆ g kj , x k +1 j − ¯ x k +1 j (cid:105) + (cid:104) ˆ g kj − ˆ G kj , x k +1 j − ¯ x k +1 j (cid:105) + (cid:18) L j η k − η k (cid:19) (cid:107) P j ( x k , ˆ G k , η k ) (cid:107) + (cid:18) L j η k η k (cid:19) (cid:107) P j ( x k , ˆ g k , η k ) (cid:107) − η k (cid:107) x k +1 j − ¯ x k +1 j (cid:107) − ( η k − η k L j (cid:107) P j ( x k , ˆ g k , η k ) (cid:107) = Ψ( x k ) + (cid:104) x k +1 j − x kj , g kj − ˆ g kj (cid:105) + (cid:104) x k +1 j − ¯ x k +1 j , δ kj (cid:105) + L j η k − η k (cid:107) P j ( x k , ˆ G k , η k ) (cid:107) + 2 L j η k − η k (cid:107) P j ( x k , ˆ g k , η k ) (cid:107) − η k (cid:107) x k +1 j − ¯ x k +1 j (cid:107) . By taking the expectation on condition of ﬁltration F k and j , we have the following equation according toAssumption 2: E j [Ψ( x k +1 ) |F k ] ≤ E j [Ψ( x k ) |F k ] + 1 M E [ (cid:104) x k +1 − x k , g k − ˆ g k (cid:105)|F k ] + L max η k − η k M E [ (cid:107) P ( x k , ˆ G k , η k ) (cid:107) |F k ]+ 2 L max η k − η k M (cid:107) P ( x k , ˆ g k , η k ) (cid:107) − M η k (cid:107) x k +1 − ¯ x k +1 (cid:107) . (37) anuscript under review by AISTATS 2019 Therefore, we have E j [Ψ( x k +1 ) |F k ] ≤ E j [Ψ( x k ) |F k ] + 1 M E [ (cid:104) x k +1 − x k , g k − ˆ g k (cid:105)|F k ] + L max η k − η k M E [ (cid:107) P ( x k , ˆ G k , η k ) (cid:107) |F k ]+ 2 L max η k − η k M (cid:107) P ( x k , ˆ g k , η k ) (cid:107) − M η k (cid:107) x k +1 − ¯ x k +1 (cid:107) ≤ E j [Ψ( x k ) |F k ] + η k M (cid:107) g k − ˆ g k (cid:107) + L max η k M E [ (cid:107) P ( x k , ˆ G k , η k ) (cid:107) |F k ] + 2 L max η k − η k M (cid:107) P ( x k , ˆ g k , η k ) (cid:107) ≤ E j [Ψ( x k ) |F k ] − η k − L max η k M (cid:107) P ( x k , ˆ g k , η k ) (cid:107) + η k M (cid:107) g k − ˆ g k (cid:107) + Lη k M N σ . (cid:117)(cid:116) Proof of Lemma 8. (cid:107) x k − x k − τ (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) l ∈ K ( τ ( k )) x k − l +1 − x k − l (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) l ∈ K ( τ ( k )) η k − l P j k − l ( x k − l , ˆ G k − l , η k − l ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) l ∈ K ( τ ( k )) η k − l ( P j k − l ( x k − l , ˆ G k − l , η k − l ) − P j k − l ( x k − l , ˆ g k − l , η k − l )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + 2 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) l ∈ K ( τ ( k )) η k − l P j k − l ( x k − l , ˆ g k − l , η k − l ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ T (cid:88) l ∈ K ( τ ( k )) η k − l (cid:13)(cid:13)(cid:13) P j k − l ( x k − l , ˆ G k − l , η k − l ) − P j k − l ( x k − l , ˆ g k − l , η k − l ) (cid:13)(cid:13)(cid:13) + 2 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) l ∈ K ( τ ( k )) η k − l P j k − l ( x k − l , ˆ g k − l , η k − l ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ T T (cid:88) l =1 η k − l (cid:107) ˆ G k − l − ˆ g k − l (cid:107) + 2 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) l ∈ K ( τ ( k )) η k − l P j k − l ( x k − l , ˆ g k − l , η k − l ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ T (cid:80) Tl =1 η k − l M N σ + 2 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) l ∈ K ( τ ( k )) η k − l P j k − l ( x k − l , ˆ g k − l , η k − l ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:117)(cid:116) anuscript under review by AISTATS 2019 Proof of Lemma 9. E [ (cid:107) g kj − ˆ g kj (cid:107) ] = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N (cid:88) i =1 g kj − ˆ g k − τ ( k,i ) j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ N N (cid:88) i =1 (cid:107) g kj − ˆ g k − τ ( k,i ) j (cid:107) ≤ M N N (cid:88) i =1 (cid:107) g k − ˆ g k − τ ( k,i ) (cid:107) ≤ L M N N (cid:88) i =1 (cid:107) x k − ˆ x k − τ ( k,i ) (cid:107) ≤ L M N N (cid:88) i =1  T (cid:80) Tl =1 η k − l M N σ + 2 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T (cid:88) l =1 η k − l P j k − l ( x k − l , ˆ g k − l , η k − l ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  ≤ L T (cid:80) Tl =1 η k − l M N σ + 2 L TM T (cid:88) l =1 η k − l (cid:107) P ( x k − l , ˆ g k − l , η k − l ) (cid:107)2