[PDF] Make Workers Work Harder: Decoupled Asynchronous Proximal Stochastic Gradient Descent

Abstract

Asynchronous parallel optimization algorithms for solving large-scale machine learning problems have drawn significant attention from academia to industry recently. This paper proposes a novel algorithm, decoupled asynchronous proximal stochastic gradient descent (DAP-SGD), to minimize an objective function that is the composite of the average of multiple empirical losses and a regularization term. Unlike the traditional asynchronous proximal stochastic gradient descent (TAP-SGD) in which the master carries much of the computation load, the proposed algorithm off-loads the majority of computation tasks from the master to workers, and leaves the master to conduct simple addition operations. This strategy yields an easy-to-parallelize algorithm, whose performance is justified by theoretical convergence analyses. To be specific, DAP-SGD achieves an O(logT/T) rate when the step-size is diminishing and an ergodic O(1/ T − − √ ) rate when the step-size is constant, where T is the number of total iterations.

Full PDF

aa r X i v : . [ m a t h . O C ] M a y Make Workers Work Harder: Decoupled Asynchronous ProximalStochastic Gradient Descent

Yitan Li, Linli Xu, Xiaowei Zhong, Qing LingUniversity of Science and Technology of [email protected], [email protected], {xwzhong,qingling}@mail.ustc.edu.cnMay 24, 2016

Abstract

Asynchronous parallel optimization algorithms for solving large-scale machine learning problems havedrawn signiﬁcant attention from academia to industry recently. This paper proposes a novel algorithm,decoupled asynchronous proximal stochastic gradient descent (DAP-SGD), to minimize an objective func-tion that is the composite of the average of multiple empirical losses and a regularization term. Unlikethe traditional asynchronous proximal stochastic gradient descent (TAP-SGD) in which the master car-ries much of the computation load, the proposed algorithm oﬀ-loads the majority of computation tasksfrom the master to workers, and leaves the master to conduct simple addition operations. This strategyyields an easy-to-parallelize algorithm, whose performance is justiﬁed by theoretical convergence analyses.To be speciﬁc, DAP-SGD achieves an O (log T /T ) rate when the step-size is diminishing and an ergodic O (1 / √ T ) rate when the step-size is constant, where T is the number of total iterations. A majority of classical machine learning tasks can be formulated as solving a general regularized optimizationproblem: min x ∈ R m P ( x ) = f ( x ) + h ( x ) , where f ( x ) , n n X i =1 f i ( x ) . (1)Given n samples, f i ( x ) represents the empirical loss of the i th sample with regard to the decision variable x ,and h ( x ) corresponds to a (usually non-smooth) regularization term. Our goal is to ﬁnd the optimal solution,deﬁned as x ∗ , which minimizes the summation of the averaged empirical loss and the regularization termover the whole dataset.With the enormous growth of data size n and model complexity, asynchronous parallel algorithms [1, 2,3, 4, 5, 6] have become an important tool and received signiﬁcant successes for solving large scale machinelearning problems in the form of (1). Asynchronous parallel algorithms distribute computation on multi-core systems (shared memory architecture) or multi-machine system (parameter server architecture), whosecomputation power generally scales up with the increasing number of cores or machines. As a consequence,eﬀective design and implementation of asynchronous parallel algorithms is critical for large scale machinelearning.Numerous eﬀorts have been devoted to this topic. Among them, asynchronous stochastic gradient descentis proposed in [1, 2], and its performance is guaranteed by theoretical convergence analyses. An asynchronousproximal gradient descent algorithm is designed on the parameter server architecture in [3] with a distributedoptimization software provided. Convergence rate of asynchronous stochastic gradient descent with a non-convex objective is analyzed in [4]. Apart from work on asynchronous gradient descent and its proximal1ariant, much attention has also been attracted to asynchronous alternating direction method of multipli-ers (ADMM) [5], asynchronous stochastic coordinate ascent [7, 8, 9, 10, 11, 12] and asynchronous dualstochastic coordinate ascent [13].The traditional asynchronous proximal stochastic gradient method (TAP-SGD) that solves (1) works asfollows. The workers (multiple cores or machines) access samples, compute the gradients of their correspond-ing empirical losses, and send to the master. The master fuses the gradients and runs a proximal step onthe regularization term (more details are given in Section 2). However, the performance of this paradigmis restricted when the proximal operator is not an element-wise operation. For this case, running proximalsteps can be time-consuming, and the computation in the master becomes the bottleneck of the whole system.We note that this is common for many popular regularization terms, as shown in Section 2. To avoid thisdiﬃculty, one has to design a customized parallel computation for every single regularization term, whichmakes the framework inﬂexible. For the sake of speeding up computation and simplifying algorithm design,we expect to design an alternative algorithm that is easier to parallelize.In light of this issue, this paper develops a decoupled asynchronous proximal stochastic gradient descent(DAP-SGD), which oﬀ-loads the majority of computation tasks (especially the proximal steps) from themaster to workers, and leaves the master to conduct simple addition operations. This algorithmic frameworkis suitable for many master/worker architectures including the single machine multi-core system (sharedmemory architecture) where the master is the parameter updating thread and the workers correspond toother threads processing samples, and the multi-machine system (parameter server architecture) where themaster is the central machine for storing and updating parameters and the workers represent those machinesfor storing and processing samples.The main contributions of this paper are highlighted as follows: • The proposed DAP-SGD algorithm oﬀ-loads the computation bottleneck from the master to workers.To be more speciﬁc, DAP-SGD allows workers to evaluate the proximal operators (work harder) andthe master only needs to do element-wise addition operations, which is easy to parallelize. • Convergence analysis is provided for DAP-SGD. DAP-SGD achieves an O (log T /T ) rate when the step-size is diminishing and an ergodic O (1 / √ T ) rate when the step-size is constant, where T is the numberof total iterations. We start from the synchronous proximal stochastic gradient descent (P-SGD) algorithm that solves (1). P-SGD only requires the gradient of one sample in a single iteration. Hence in large scale optimization problems,it is a preferred surrogate for proximal gradient descent [14, 15], which requires computing gradients of allsamples in a single iteration. The recursion of P-SGD is x t +1 = Prox η t ,h ( x t − η t ▽ f i t ( x t )) , (2)where Prox η,h ( x ) = argmin y k y − x k / (2 η ) + h ( y ) denotes a proximal operator, while η t is the step-size and i t is the index of the selected sample in the t th iteration.The traditional asynchronous proximal stochastic gradient descent (TAP-SGD) algorithm is an asyn-chronous variant of P-SGD, as summarized in Algorithm 1. The master is the main updating processor,while the workers provide the gradients of the samples. Every worker receives the parameter (namely, de-cision variables) x from the master, computes the gradient of one random sample ▽ f i ( x ) and sends it tothe master. Obviously, when one worker is computing and sending its gradient, the master may update theparameter using the gradients sent by the other workers in the previous time period. As a consequence,the gradients received at the master are often delayed, causing the main diﬀerence between P-SGD andTAP-SGD. In the master, the delayed gradient received at the t th iteration is denoted by ▽ f i t ( x d ( t ) ) where i t indexes the selected sample, x d ( t ) refers to that the parameter is the one from the d ( t ) th iteration, and d ( t ) ∈ [ t − τ, t ] where τ stands for the maximum delay of the system. Therefore, we can write the recursion2f TAP-SGD as x t +1 = Prox η t ,h ( x t − η t ▽ f i t ( x d ( t ) )) . (3) Algorithm 1:

Asynchronous Proximal Stochastic Gradient Descent (AP-SGD)

Input : Initialization x , t = 0 , dataset with n samples in which the loss function of the i th sample isdenoted by f i ( x ) , regularization term h ( x ) , maximum number of iterations T , number ofworkers S , step-size in the t th iteration η t , maximum delay τ Output : x T Procedure of each worker s ∈ [1 , ..., S ] repeat Uniformly sample i from [1 , ..., n ] ; Obtain the parameter x from the master (shared memory or parameter server); Evaluate the gradient of the i th sample over parameter x , denoted by ▽ f i ( x ) ; Send ▽ f i ( x ) to the master; until procedure of master ends Procedure of master for t = 0 to T − do Get a gradient ▽ f i t ( x d ( t ) ) (the delay t − d ( t ) is bounded by τ ); Update the parameter with the proximal operator x t +1 = Prox η t ,h ( x t − η t ▽ f i t ( x d ( t ) )) ; t = t + 1 ;Observe that the updating procedure of the master is the computational bottleneck of the TAP-SGDalgorithm. When the proximal step is time-consuming to calculate, the workers must wait for a long timeto receive updated parameters, which signiﬁcantly degrades the performance of the system. To avoid thisdiﬃculty, one has to design a customized parallel computation for every single regularization term, whichmakes the framework inﬂexible. In a multi-machine system with multiple masters, such parallelized proximaloperators will also cause complicated network communications between masters. Coupled Proximal Operators

In practice, many widely used (usually non-smooth) regularization terms are associated with coupled proximaloperators, which lead to high computational complexity, including group lasso regularization [16], fused lassoregularization [17], nuclear norm regularization [18, 19], etc.

The proximal operator of group lasso regularization h ( x ) = λ P gi =1 k x k i :( k i +1 − k :Prox η,h ( x ) = argmin y η k y − x k + λ g X i =1 k y k i :( k i +1 − k . (4)Here g is the number of groups and k = 1 < ... k i < k i +1 ... < k g +1 = m + 1 . The closed-form solution ofthe proximal operator above is [ Prox η,h ( x )] k i :( k i +1 − = x k i :( k i +1 − (cid:18) − λ k x k i :( k i +1 − k (cid:19) + . (5)For the group lasso regularization, the proximal operator (4) is separated into g groups. When partitionsof groups are unbalanced, it will be hard to speed up the computation with parallelization. The proximal operator of simpliﬁed fused lasso regularization h ( x ) = λ P m − i =1 k x i − x i +1 k :Prox η,h ( x ) = argmin y η k y − x k + λ m − X i =1 k y i − y i +1 k = y − R T z ∗ , (6)3here R =  − ...

00 1 − ... ... ... −  ∈ R ( m − × m , z ∗ = arg min k z k ∞ ≤ ηλ k R T z k − < R T z , y > . For the simpliﬁed fused lasso regularization, the proximal operator (6) has a closed form solution. However,solving z ∗ involves a subproblem that is time-consuming. The proximal operator of nuclear norm regularization h ( X ) = λ k X k ∗ :Prox η,h ( X ) = argmin Y η k Y − X k F + λ k Y k ∗ = U ˆΣV T , (7)where X = UΣV T calculated from singular value decomposition, σ i is the i th singular value of X , ˆ σ i =max( σ i − ηλ, is the i th element of ˆ σ ) , and ˆ Σ = Diag( ˆ σ ) . For the nuclear norm regularization, the proximaloperator (7) involves singular value decomposition, which is challenging especially for large scale problems.As discussed above, evaluating the proximal operator can be a computational bottleneck and limitsthe performance of TAP-SGD. This motivates us to design a novel asynchronous parallel algorithm, whichdecouples and distributes the calculation of the proximal operator to the workers. The key idea of the decoupled asynchronous proximal stochastic gradient descent (DAP-SGD) algorithm isto oﬀ-load the computational bottleneck from the master to the workers. The master no longer takes careof the proximal operators; instead, it only needs to conduct element-wise addition operations. On the otherhand, the workers must work harder: they evaluate the proximal operators independently, without caringabout the parallel mechanism.The procedure of DAP-SGD is summarized in Algorithm 2. Each worker evaluates the proximal operatorand sends update information (namely, innovation) ∆ = x ′ − x to the master. In the master, the delayedupdate information ∆ d ( t ) = x ′ d ( t ) − x d ( t ) is used to modify the parameter x . Obviously, parameter updatingin the master is no longer the computational bottleneck of the system, since it only involves element-wiseaddition operations.The recursion of DAP-SGD is x ′ d ( t ) = Prox η,h ( x d ( t ) − η d ( t ) ▽ f i d ( t ) ( x d ( t ) )) , x t +1 = x t + x ′ d ( t ) − x d ( t ) . (8)Comparing the recursions of TAP-SGD (3) and DAP-SGD (8), we can observe that the DAP-SGD recur-sion (8) splits the proximal operator and parameter updating step . This is the why we call the proposedalgorithm “ decoupled ”. The beneﬁt of decoupling is that the computational bottleneck (for example, theunbalanced partitioned groups in (4), the subproblem in (6), and the singular value decomposition in (7))no longer lies in the master. The workers conduct these operations, which improves the performance of thesystem. Below, we further analyze the convergence properties of DAP-SGD theoretically. This section gives theorems that establish the convergence properties of DAP-SGD. The detailed proofs arepresented in the appendix. We start from some basic assumptions.The ﬁrst two assumptions are about the properties of the averaged empirical cost f ( x ) . Note that both TAP-SGD and DAP-SGD can support mini-batch updating. lgorithm 2: Decoupled Asynchronous Proximal Stochastic Gradient Descent (DAP-SGD)

Input : Initialization x , t = 0 , dataset with n samples in which loss function of the i th sample isdenoted by f i ( x ) , regularization term h ( x ) , maximum number of iterations T , number ofworkers S , step-size in the t th iteration η t , maximum delay τ Output : x T Procedure of each worker s ∈ [1 , ..., S ] repeat Uniformly sample i from [1 , ..., n ] ; Obtain parameter x and step-size η from master (shared memory or parameter server); Evaluate the gradient of the i th sample over parameter x , denoted by ▽ f i ( x ) ; Evaluate the proximal operator x ′ = Prox η,h ( x − η ▽ f i ( x )) ; Send update information ∆ = x ′ − x to the master; until procedure of master end Procedure of master for t = 0 to T − do Get ∆ d ( t ) = x ′ d ( t ) − x d ( t ) from one worker (the delay t − d ( t ) is bounded by τ ); Update parameter with x t +1 = x t + ∆ d ( t ) ; t = t + 1 ; Assumption 1

Lipschitz continuous gradient of ▽ f ( x ) : The function f ( x ) is diﬀerentiable and itsgradient ▽ f ( x ) is Lipschitz continuous with constant L . Namely, the following two equivalent inequalitieshold: f ( x ) ≤ f ( y ) + h ▽ f ( y ) , x − y i + L k x − y k , ∀ x , y , (9) and L k ▽ f ( x ) − ▽ f ( y ) k ≤ h ▽ f ( x ) − ▽ f ( y ) , x − y i ≤ L k x − y k , ∀ x , y . (10) Assumption 2

Strong convexity of f ( x ) : The function f ( x ) is strongly convex with constant µ . Namely,the following inequality holds: f ( x ) ≥ f ( y ) + h ▽ f ( y ) , x − y i + µ k x − y k , ∀ x , y . (11)The next assumption bounds the variance of sampling a random gradient ▽ f i ( x ) to replace the truegradient ▽ f ( x ) . Assumption 3

Bounded variance of gradient evaluation : The variance of a selected gradient isbounded by a constant C f : E k ▽ f i ( x ) − ▽ f ( x ) k ≤ C f , ∀ x . (12)The last two assumptions are about the properties of the regularization term h ( x ) . Assumption 4

Convexity of h ( x ) : The function h ( x ) is convex. Namely, the following inequality holds: h ( x ) ≥ h ( y ) + h ∂h ( y ) , x − y i , ∀ x , y , (13)where ∂h ( x ) stands for any subgradient of h ( x ) . Assumption 5

Bounded subgradient of h ( x ) : The squared subgradient of h ( x ) is bounded by a constant C h k ∂h ( x ) k ≤ C h . (14)5n immediate result from Assumption 5 is that, ▽ f ( x ∗ ) is also bounded where x ∗ is the optimal solutionto (1), as given in the following corollary. Corollary 1

Bounded gradient of f ( x ) at the optimum : Let x ∗ = argmin x f ( x ) + h ( x ) be the optimalsolution to (1) , then we have k ▽ f ( x ∗ ) k = k ∂h ( x ∗ ) k ≤ C h . (15)Assumptions 1, 2, 3 and 4 are common in the convergence analysis of stochastic gradient descent algo-rithms [1, 3, 4, 20, 21]. Assumption 5 is due to the (usually non-smooth) regularization term h ( x ) , andis reasonable for many non-smooth regularization terms such as L regularization, group lasso, fused lassoand nuclear norm, etc. Next we provide the constant upper bounds of subgradients for these non-smoothregularization terms. In the following part, ∂ denotes the set of subderivatives, and with a slight abuse ofnotation, also denotes any element (namely, subgradient) in the set. Upper bound of subgradient for L regularization k x k : k ∂ k x k k ≤ m . (16) Upper bound of subgradient for group lasso regularization P gi =1 k x k i :( k i +1 − k : (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∂ g X i =1 k x k i :( k i +1 − k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ g , where ∂ k x k i :( k i +1 − k = ( k x ki :( ki +1 − k x k i :( k i +1 − if x k i :( k i +1 − = 0 , { g |k g k ≤ } if x k i :( k i +1 − = 0 . (17) Upper bound of subgradient for simpliﬁed fused lasso regularization P m − i =1 k x i − x i +1 k = k Rx k : k ∂ k Rx k k = k R T SGN ( Rx ) k ≤ X i k R : ,i k k SGN ( Rx ) k ≤ ( m − X i k R : ,i k ≤√ m ( m − , (18)where SGN [17] is a function whose output is within [ − , . Upper bound of subgradient of nuclear norm regularization k X k ∗ , X ∈ R m × q , d = min( m, q ) : k ∂ k X k ∗ k F ≤ k UV T k F + k W k F ≤ k U k F k V T k F + k W k F ≤ rank( X ) + d ≤ d + d , (19)where ∂ k X k ∗ = { UV T + W | W ∈ R m × q , U T W = 0 , WV = 0 , k W k ≤ , X = UΣV T } .Under the assumptions given above, we prove that DAP-SGD achieves an O (log T /T ) rate when thestep-size is diminishing (Theorem 1) and an ergodic O (1 / √ T ) rate when the step-size is constant (Theorem2), where T is the number of total iterations. The proofs of the theorems are given in the appendix. Theorem 1

Suppose that the cost function of (1) satisﬁes the following conditions: f ( x ) is strongly convexwith constant µ and h ( x ) is convex; f ( x ) is diﬀerentiable and ▽ f ( x ) is Lipschitz continuous with constant L ; E k ▽ f i ( x ) − ▽ f ( x ) k ≤ C f ; k ∂h ( x ) k ≤ C h . Deﬁne the optimal solution of (1) as x ∗ . At time t , set thestep-size of the DAP-SGD recursion (8) as η t = O (1 /t ) . Then the iterate generated by (8) at time T , denotedby x T , satisﬁes E k x T − x ∗ k ≤ O (cid:18) log TT (cid:19) . (20) Theorem 2

10 20 30 4010 −2 seconds TAP−SGDDAP−SGD (a) L −1 seconds TAP−SGDDAP−SGD (b) group lasso seconds TAP−SGDDAP−SGD (c) fused lasso −2 seconds TAP−SGDDAP−SGD (d) nuclear norm −2 iter TAP−SGDDAP−SGD (e) L −1 iter TAP−SGDDAP−SGD (f) group lasso iter TAP−SGDDAP−SGD (g) fused lasso −2 iter TAP−SGDDAP−SGD (h) nuclear norm

Figure 1: Comparison of TAP-SGD and DAP-SGD in terms of time and number of iterations. The Y-axisshows the log distance between the solution generated by an algorithm and the optimal solution, denoted by log k x − x ∗ k . Results of L , group lasso, simpliﬁed fused lasso and nuclear norm regularized objectives areshown in columns from left to right, respectively. Top and bottom rows correspond to the results regardingtime and number of iterations, respectively. step-size of the DAP-SGD recursion (8) η t as η = O (1 / √ T ) , where T is the maximum number of iterations.Deﬁne the iterate generated by (8) at time t as x t . Then the running average iterate generated by (8) at time T , denoted by ¯ x T = P Tt =0 x t / ( T + 1) , satisﬁes E k ¯ x T − x ∗ k ≤ O (cid:18) √ T (cid:19) . (21) We compare the proposed DAP-SGD algorithm with TAP-SGD in a consistent way without assuming thedata is sparse. The implementation is based on the single machine multi-core system (shared memoryarchitecture). Both algorithms are implemented in C++ and run on a multi-core server. Singular valuedecomposition (SVD) is calculated by eigen3 . The parameters are locked while they are being updated.The lock operation will slow down the computation; however it guarantees that the implementation conformsto the algorithm and its corresponding convergence analysis.Without loss of generality, we choose the least square loss with a non-smooth regularization term as theoptimization objective: min x ∈ R m P ( x ) = f ( x ) + h ( x ) = 1 n n X i =1 (cid:2) k x T s i − y i k + λ k x k (cid:3) + h ( x ) . (22)In the case of nuclear norm regularization, the loss function f ( x ) becomes the multi-target least square loss f ( X ) = n P ni =1 (cid:2) k X T s i − y i k + λ k X k F (cid:3) correspondingly.In the implementation TAP-SGD, the proximal operator of the L regularized objective can be parallelizedeasily, while the proximal operators of group lasso, simpliﬁed fused lasso and nuclear norm are not parallelizeddue to their coupled and non-element-wise operations. On the other hand, the procedure of the master inthe proposed DAP-SGD only involves simple element-wise operations. eigen.tuxfamily.org s peed up TAP−SGDDAP−SGD (a) L s peed up TAP−SGDDAP−SGD (b) group lasso s peed up TAP−SGDDAP−SGD (c) fused lasso s peed up TAP−SGDDAP−SGD (d) nuclear norm

Figure 2: Speedup of TAP-SGD and DAP-SGD with 4 diﬀerent non-smooth regularization terms.

Experimental Setup . We conduct two experiments to evaluate the algorithms with 4 diﬀerent non-smooth regularization terms ( L , group lasso, simpliﬁed fused lasso, nuclear norm) regarding the runningtime and number of iterations, as well as the speedup. Data is generated randomly. In the ﬁrst experiment,for the 4 diﬀerent objectives, the number of samples n is set to × , × , × , and × , whilethe length of the parameter is set to × , × , × and × (in the form of a × matrixfor nuclear norm regularization), respectively. The number of iterations T is set to × , × , × and × , and the step-size η t is set to × +200 t , × +200 t , × +200 t and × + t , respectively, whichis decreasing with iterations. The hyper-parameter λ is set to , , , . correspondingly. In thesecond experiment of evaluating the speedup, the settings are identical to the ﬁrst experiment except thatthe number of iterations for simpliﬁed fused norm and nuclear norm regularized objectives is set to and × , and the number of parameters for L and group lasso regularized objectives is set to × . Thetotal time cost of a system consists of two parts: evaluation of updating information in the workers andupdating in the master. If we can speed up both with k times, then we can achieve a k -speed up in the idealcase. In our experiment, the number of updating threads running in parallel and maximum delay τ in themaster is ﬁxed to the number of workers.Results are summarized in Figures 1 and 2. Figure 1 shows the comparison between TAP-SGD andDAP-SGD regarding the running time and number of iterations. As shown in the top row of Figure 1, theproposed DAP-GSD algorithm is slightly slower than TAP-SGD with the L regularized objective. The reasonis that the proximal operator of L norm is element-wise and can be parallelized. The decoupled update ofDAP-SGD (8) involves more operations in workers than the update of TAP-SGD (3), whose workers onlyneed to evaluate the gradients. Nevertheless, DAP-SGD is much faster than TAP-SGD with group lasso,simpliﬁed fused lasso and nuclear norm regularized objectives because the proximal operators of these normsare not element-wise and hard to parallelize. As a consequence, evaluation of the proximal operator inthe master of TAP-SGD becomes the computational bottleneck of the whole system and the performancedegrades signiﬁcantly. In contrast, DAP-SGD allows each worker to evaluate the proximal operator, whichjustiﬁes our core idea of decoupling the computation. Meanwhile, according to the bottom row of Figure 1,TAP-SGD and DAP-SGD perform similarly regarding the number of iterations. The experimental resultsshown in Figure 1 validate that the decoupled operation in DAP-SGD makes the algorithm more ﬂexible andeasier to parallelize without aﬀecting the precision of the algorithm.Figure 2 compares TAP-SGD and DAP-SGD in terms of the speedup with diﬀerent regularization terms.Obviously, DAP-SGD can achieve signiﬁcant speedup with the number of workers increasing except for the L regularized objective due to the same reason discussed above. With group lasso, simpliﬁed fused lassoand nuclear norm regularized objectives, TAP-SGD essentially fails to speedup when the number of workersincreases, which indicates the computational bottleneck at the master for evaluating the coupled proximaloperator. Meanwhile, the decoupling operation of DAP-SGD is eﬀective to oﬀ-load the computation to theworkers and improves the parallelism in asynchronous proximal stochastic gradient descent. This paper proposes a novel decoupled asynchronous proximal stochastic gradient descent (DAP-SGD) algo-rithm for optimizing a composite objective function. By oﬀ-loading computation from the master to workers,8he proposed DAP-SGD algorithm becomes easy to parallelize. DAP-SGD is suitable for many master-workerarchitectures, including single machine multi-core systems and multi-machine systems. We further providetheoretical convergence analyses for DAP-SGD, with both diminishing and ﬁxed step-sizes.9 eferences [1] F. Niu, B. Recht, C. Re, S. J. Wright, Hogwild: A lock-free approach to parallelizing stochastic gradientdescent, in: Proceedings of Advances in Neural Information Processing Systems 24, December 12-14,2011, Granada, Spain, 2011, pp. 693–701.[2] A. Agarwal, J. C. Duchi, Distributed delayed stochastic optimization, in: Proceedings of Advances inNeural Information Processing Systems 24, December 12-14, 2011, Granada, Spain, 2011, pp. 873–881.[3] M. Li, D. G. Andersen, A. J. Smola, K. Yu, Communication eﬃcient distributed machine learningwith the parameter server, in: Proceedings of Advances in Neural Information Processing Systems 27,December 8-13 2014, Montreal, Quebec, Canada, 2014, pp. 19–27.[4] X. Lian, Y. Huang, Y. Li, J. Liu, Asynchronous parallel stochastic gradient for nonconvex optimiza-tion, in: Proceedings of Advances in Neural Information Processing Systems 28, December 7-12, 2015,Montreal, Quebec, Canada, 2015, pp. 2737–2745.[5] R. Zhang, J. T. Kwok, Asynchronous distributed ADMM for consensus optimization, in: Proceedings ofthe 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014,2014, pp. 1701–1709.[6] H. R. Feyzmahdavian, A. Aytekin, M. Johansson, A delayed proximal gradient method with linearconvergence rate, in: IEEE International Workshop on Machine Learning for Signal Processing, MLSP2014, Reims, France, September 21-24, 2014, pp. 1–6.[7] J. Liu, S. J. Wright, C. Ré, V. Bittorf, S. Sridhar, An asynchronous parallel stochastic coordinate descentalgorithm, Journal of Machine Learning Research 16 (2015) 285–322.[8] J. Liu, S. J. Wright, Asynchronous stochastic coordinate descent: Parallelism and convergence properties,SIAM Journal on Optimization 25 (1) (2015) 351–376.[9] O. Fercoq, P. Richtárik, Accelerated, parallel, and proximal coordinate descent, SIAM Journal on Opti-mization 25 (4) (2015) 1997–2023.[10] J. Mareček, P. Richtárik, M. Takáč, Distributed block coordinate descent for minimizing partially sepa-rable functions, in: Numerical Analysis and Optimization, Springer, 2015, pp. 261–288.[11] M. Hong, A distributed, asynchronous and incremental algorithm for nonconvex optimization: An admmbased approach, arXiv preprint arXiv:1412.6058.[12] Y. Zhou, Y. Yu, W. Dai, Y. Liang, E. Xing, On convergence of model parallel proximal gradient algorithmfor stale synchronous parallel system, in: International Conference on Artiﬁcial Intelligence and Statistics(AISTATS), 2016.[13] C. Hsieh, H. Yu, I. S. Dhillon, Passcode: Parallel asynchronous stochastic dual co-ordinate descent, in:Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11July 2015, 2015, pp. 2370–2379.[14] A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAMjournal on imaging sciences 2 (1) (2009) 183–202.[15] N. Parikh, S. P. Boyd, Proximal algorithms., Foundations and Trends in optimization 1 (3) (2014)127–239.[16] J. Friedman, T. Hastie, R. Tibshirani, A note on the group lasso and a sparse group lasso, arXiv preprintarXiv:1001.0736.[17] J. Liu, L. Yuan, J. Ye, An eﬃcient algorithm for a class of fused lasso problems, in: Proceedings of the16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington,DC, USA, July 25-28, 2010, 2010, pp. 323–332. 1018] S. Ji, J. Ye, An accelerated gradient method for trace norm minimization, in: Proceedings of the 26thAnnual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June14-18, 2009, 2009, pp. 457–464.[19] J. Cai, E. J. Candès, Z. Shen, A singular value thresholding algorithm for matrix completion, SIAMJournal on Optimization 20 (4) (2010) 1956–1982.[20] Y. Nesterov, Introductory lectures on convex optimization: A basic course, Vol. 87, Springer Science &Business Media, 2013.[21] A. Nemirovski, A. Juditsky, G. Lan, A. Shapiro, Robust stochastic approximation approach to stochasticprogramming, SIAM Journal on Optimization 19 (4) (2009) 1574–1609.11 ppendix for Make Workers Work Harder: Decoupled AsynchronousProximal Stochastic Gradient Descent

Theorem 1

From the DAP-SGD update x t +1 = x t + x ′ d ( t ) − x d ( t ) , we have E k x t +1 − x ∗ k = E k x t − x ∗ + x ′ d ( t ) − x d ( t ) k = E k x t − x ∗ k + E k x ′ d ( t ) − x d ( t ) k + 2 E D x ′ d ( t ) − x d ( t ) , x t − x ∗ E = E k x t − x ∗ k + E k x ′ d ( t ) − x d ( t ) k + 2 E D x ′ d ( t ) − x d ( t ) , x d ( t ) − x ∗ E| {z } Q +2 E D x ′ d ( t ) − x d ( t ) , x t − x d ( t ) E . (24)Below we bound the value of Q from above.Recalling the update of x ′ d ( t ) in (8) of the paper, which is x ′ d ( t ) = Prox η,h ( x d ( t ) − η d ( t ) ▽ f i d ( t ) ( x d ( t ) ))= argmin y η d ( t ) k y − ( x d ( t ) − η d ( t ) ▽ f i d ( t ) ( x d ( t ) )) k + h ( y ) , (25)we have η d ( t ) ( x d ( t ) − x ′ d ( t ) ) − ▽ f i d ( t ) ( x d ( t ) ) ∈ ∂h ( x ′ d ( t ) ) . (26)Because f ( x ) is convex (right now we do not need to use its strong convexity) and h ( x ) is also convex,we have the following lower bound for the optimal value P ( x ∗ ) , f ( x ∗ ) + h ( x ∗ ) ≥ f ( x d ( t ) ) + (cid:10) ▽ f ( x d ( t ) ) , x ∗ − x d ( t ) (cid:11) + h ( x ′ d ( t ) ) + D ∂h ( x ′ d ( t ) ) , x ∗ − x ′ d ( t ) E . (27)With a slight abuse of notation, here and thereafter ∂h ( x ′ d ( t ) ) stands for any subgradient. Hence we substitutethe one given in (26) into (27) and obtain P ( x ∗ ) ≥ f ( x d ( t ) ) + (cid:10) ▽ f ( x d ( t ) ) , x ∗ − x d ( t ) (cid:11) + h ( x ′ d ( t ) ) + (cid:28) η d ( t ) ( x d ( t ) − x ′ d ( t ) ) − ▽ f i d ( t ) ( x d ( t ) ) , x ∗ − x ′ d ( t ) (cid:29) . (28)On the other hand, ▽ f ( x ) being Lipschitz continuous with constant L implies f ( x ′ d ( t ) ) ≤ f ( x d ( t ) ) + D ▽ f ( x d ( t ) ) , x ′ d ( t ) − x d ( t ) E + L k x ′ d ( t ) − x d ( t ) k . (29)Substituting (29) into (28) P ( x ∗ ) ≥ f ( x ′ d ( t ) ) − D ▽ f ( x d ( t ) ) , x ′ d ( t ) − x d ( t ) E − L k x ′ d ( t ) − x d ( t ) k + (cid:10) ▽ f ( x d ( t ) ) , x ∗ − x d ( t ) (cid:11) + h ( x ′ d ( t ) ) + (cid:28) η d ( t ) ( x d ( t ) − x ′ d ( t ) ) − ▽ f i d ( t ) ( x d ( t ) ) , x ∗ − x ′ d ( t ) (cid:29) . (30)12oticing that by deﬁnition P ( x ′ d ( t ) ) , f ( x ′ d ( t ) ) + h ( x ′ d ( t ) ) and reorganizing the terms of (30), we obtain − [ P ( x ′ d ( t ) ) − P ( x ∗ )] ≥ D ▽ f ( x d ( t ) ) − ▽ f i d ( t ) ( x d ( t ) ) , x ∗ − x ′ d ( t ) E + 1 η d ( t ) D x d ( t ) − x ′ d ( t ) , x ∗ − x d ( t ) E + 1 η d ( t ) k x d ( t ) − x ′ d ( t ) k − L k x d ( t ) − x ′ d ( t ) k . (31)Assuming that η t ≤ /L for any t (this assumption holds according to the step-size rule given later), (31)yields − [ P ( x ′ d ( t ) ) − P ( x ∗ )] ≥ D ▽ f ( x d ( t ) ) − ▽ f i d ( t ) ( x d ( t ) ) , x ∗ − x ′ d ( t ) E + 1 η d ( t ) D x d ( t ) − x ′ d ( t ) , x ∗ − x d ( t ) E + 12 η d ( t ) k x d ( t ) − x ′ d ( t ) k . (32)Taking expectation on both sides of (32) and reorganizing terms, we have − E [ P ( x ′ d ( t ) ) − P ( x ∗ )] + E D ▽ f i d ( t ) ( x d ( t ) ) − ▽ f ( x d ( t ) ) , x ∗ − x ′ d ( t ) E| {z } Q ≥ η d ( t ) E D x d ( t ) − x ′ d ( t ) , x ∗ − x d ( t ) E + 12 η d ( t ) E k x d ( t ) − x ′ d ( t ) k . (33)Deﬁne ˆ x ′ d ( t ) , Prox η,h ( x d ( t ) − η d ( t ) ▽ f ( x d ( t ) )) as an approximation of x ′ d ( t ) , Prox η,h ( x d ( t ) − η d ( t ) ▽ f i d ( t ) ( x d ( t ) )) .Because the random variable i d ( t ) is independent with x ∗ and ˆ x ′ d ( t ) , while E (cid:2) ▽ f i d ( t ) ( x d ( t ) ) (cid:3) = ▽ f ( x d ( t ) ) , itholds E h ▽ f i d ( t ) ( x d ( t ) ) − ▽ f ( x d ( t ) ) , x ∗ − ˆ x ′ d ( t ) i = 0 . Hence, Q can be upper bounded by Q = E D ▽ f i d ( t ) ( x d ( t ) ) − ▽ f ( x d ( t ) ) , x ∗ − x ′ d ( t ) E = E D ▽ f i d ( t ) ( x d ( t ) ) − ▽ f ( x d ( t ) ) , x ∗ − ˆ x ′ d ( t ) E + E D ▽ f i d ( t ) ( x d ( t ) ) − ▽ f ( x d ( t ) ) , ˆ x ′ d ( t ) − x ′ d ( t ) E = E D ▽ f i d ( t ) ( x d ( t ) ) − ▽ f ( x d ( t ) ) , ˆ x ′ d ( t ) − x ′ d ( t ) E ≤ E (cid:16) k ▽ f i d ( t ) ( x d ( t ) ) − ▽ f ( x d ( t ) ) k k ˆ x ′ d ( t ) − x ′ d ( t ) k (cid:17) , (34)where the last inequality comes from the Cauchy-Schwarz inequality. Further, the non-expansive property ofproximal operators [8] implies k ˆ x ′ d ( t ) − x ′ d ( t ) k = k Prox η,h ( x d ( t ) − η d ( t ) ▽ f ( x d ( t ) )) − Prox η,h ( x d ( t ) − η d ( t ) ▽ f i d ( t ) ( x d ( t ) )) k ≤ η d ( t ) k ▽ f i d ( t ) ( x d ( t ) ) − ▽ f ( x d ( t ) ) k . (35)Combining (34) and (35) yields an upper bound of Q as Q ≤ η d ( t ) E k ▽ f i d ( t ) ( x d ( t ) ) − ▽ f ( x d ( t ) ) k ≤ η d ( t ) C f , (36)where the last inequality is due to the assumption of bounded variance E k ▽ f i ( x ) − ▽ f ( x ) k ≤ C f .Substituting (36) into (33), we have − E [ P ( x ′ d ( t ) ) − P ( x ∗ )] + η d ( t ) C f ≥ η d ( t ) E D x d ( t ) − x ′ d ( t ) , x ∗ − x d ( t ) E + 12 η d ( t ) E k x d ( t ) − x ′ d ( t ) k . (37)Now we end up with an upper bound of Q as Q , E k x ′ d ( t ) − x d ( t ) k + 2 E D x ′ d ( t ) − x d ( t ) , x d ( t ) − x ∗ E ≤ − η d ( t ) E [ P ( x ′ d ( t ) ) − P ( x ∗ )] + 2 η d ( t ) C f . (38)13herefore Q ≤ − η d ( t ) E [ P ( x t ) − P ( x ∗ )] − η d ( t ) E [ P ( x ′ d ( t ) ) − P ( x t )] + 2 η d ( t ) C f . ≤ − µη d ( t ) E k x t − x ∗ k − η d ( t ) E [ P ( x ′ d ( t ) ) − P ( x t )] + 2 η d ( t ) C f . (39)The second line comes from the inequality P ( x t ) − P ( x ∗ ) ≥ µ k x t − x ∗ k , (40)which is due to the facts that x ∗ is the optimal solution of P ( x ) = f ( x ) + h ( x ) , f ( x ) is strongly convex withconstant µ , and h ( x ) is convex.Substituting (39) into (24), we have E k x t +1 − x ∗ k ≤ (1 − µη d ( t ) ) E k x t − x ∗ k + 2 η d ( t ) E [ P ( x d ( t ) ) − P ( x ′ d ( t ) )] | {z } Q + 2 η d ( t ) t − d ( t ) X p =1 E [ P ( x t − p +1 ) − P ( x t − p )] | {z } Q +2 η d ( t ) C f + 2 E D x ′ d ( t ) − x d ( t ) , x t − x d ( t ) E| {z } Q . (41)We proceed to bound the terms Q , Q , and Q .Because f ( x ) and h ( x ) are convex as well as the norm of ∂h ( x ) is bounded, we have the following basicinequality P ( x ) − P ( y ) = f ( x ) − f ( y ) + h ( x ) − h ( y ) ≤ h ▽ f ( x ) , x − y i + h ∂h ( x ) , x − y i≤k ▽ f ( x ) k k x − y k + k ∂h ( x ) k k x − y k ≤k ▽ f ( x ) k k x − y k + p C h k x − y k =( k ▽ f ( x ) k + p C h ) k x − y k . (42)In (42), the second line comes from the convexity of f ( x ) and h ( x ) , while the third line comes from theCauchy-Schwarz inequality. Replacing x by x d ( t ) and y by x ′ d ( t ) in (42), we have Q = E h P ( x d ( t ) ) − P ( x ′ d ( t ) ) i ≤ E h ( k ▽ f ( x d ( t ) ) k + p C h ) k x d ( t ) − x ′ d ( t ) k i . (43)Applying the expression of x d ( t ) − x ′ d ( t ) in (26) into (43) yields Q ≤ η d ( t ) E h ( k ▽ f ( x d ( t ) ) k + p C h ) k ▽ f i d ( t ) ( x d ( t ) ) + ∂h ( x ′ d ( t ) ) k i ≤ η d ( t ) E k ▽ f ( x d ( t ) ) k + 12 η d ( t ) C h + η d ( t ) E k ▽ f i d ( t ) ( x d ( t ) ) + ∂h ( x ′ d ( t ) ) k . (44)Due to the inequalities k ▽ f ( x d ( t ) ) k ≤ k ▽ f ( x d ( t ) ) − ▽ f ( x ∗ ) k + k ▽ f ( x ∗ ) k , (45)and k ▽ f i d ( t ) ( x d ( t ) ) + ∂h ( x ′ d ( t ) ) k ≤ k ▽ f i d ( t ) ( x d ( t ) ) k + 2 k ∂h ( x ′ d ( t ) ) k ≤ k ▽ f i d ( t ) ( x d ( t ) ) − ▽ f ( x d ( t ) ) k + 4 k ▽ f ( x d ( t ) ) k + 2 k ∂h ( x ′ d ( t ) ) k ≤ k ▽ f i d ( t ) ( x d ( t ) ) − ▽ f ( x d ( t ) ) k + 8 k ▽ f ( x d ( t ) ) − ▽ f ( x ∗ ) k + 8 k ▽ f ( x ∗ ) k + 2 k ∂h ( x ′ d ( t ) ) k , (46)1444) turns to Q ≤ η d ( t ) E k ▽ f ( x d ( t ) ) − ▽ f ( x ∗ ) k + 9 η d ( t ) E k ▽ f ( x ∗ ) k + 4 η d ( t ) E k ▽ f i d ( t ) ( x d ( t ) ) − ▽ f ( x d ( t ) ) k + 2 η d ( t ) E k ∂h ( x ′ d ( t ) ) k + 12 η d ( t ) C h . (47)Considering Lipschitz continuity of ▽ f ( x ) , k ▽ f ( x ∗ ) k ≤ C h from Corollary 1, E k ▽ f i ( x ) − ▽ f ( x ) k ≤ C f , aswell as k ∂h ( x ) k ≤ C h , (47) further turns to Q ≤ η d ( t ) L E k x d ( t ) − x ∗ k + 4 η d ( t ) C f + 232 η d ( t ) C h . (48)Similar to the derivation of (48), we have Q = E [ P ( x t − p +1 ) − P ( x t − p )] ≤ E h ( k ▽ f ( x t − p +1 ) k + p C h ) k x t − p +1 − x t − p k i ≤ η d ( t − p ) E h ( k ▽ f ( x t − p +1 ) k + p C h ) k ▽ f i d ( t − p ) ( x d ( t − p ) ) + ∂h ( x ′ d ( t − p ) ) k i ≤ η d ( t − p ) E k ▽ f ( x t − p +1 ) k + 12 η d ( t − p ) C h + η d ( t − p ) E k ▽ f i d ( t − p ) ( x d ( t − p ) ) + ∂h ( x ′ d ( t − p ) ) k . (49)Using the inequalities (see (45) and (46)) k ▽ f ( x t − p +1 ) k ≤ k ▽ f ( x t − p +1 ) − ▽ f ( x ∗ ) k + k ▽ f ( x ∗ ) k , (50)and k ▽ f i d ( t − p ) ( x d ( t − p ) ) + ∂h ( x ′ d ( t − p ) ) k ≤ k ▽ f i d ( t − p ) ( x d ( t − p ) ) − ▽ f ( x d ( t − p ) ) k + 8 k ▽ f ( x d ( t − p ) ) − ▽ f ( x ∗ ) k + 8 k ▽ f ( x ∗ ) k + 2 k ∂h ( x ′ d ( t − p ) ) k , (51)(49) yields Q ≤ η d ( t − p ) E k ▽ f ( x t − p +1 ) − ▽ f ( x ∗ ) k + 9 η d ( t − p ) E k ▽ f ( x ∗ ) k + 8 η d ( t − p ) E k ▽ f ( x t − p ) − ▽ f ( x ∗ ) k + 4 η d ( t − p ) E k ▽ f i d ( t − p ) ( x d ( t − p ) ) − ▽ f ( x d ( t − p ) ) k + 2 η d ( t − p ) E k ∂h ( x ′ d ( t − p ) ) k + 12 η d ( t − p ) C h ≤ η d ( t − p ) L E k x t − p +1 − x ∗ k + 8 η d ( t − p ) L E k x d ( t − p ) − x ∗ k + 4 η d ( t − p ) C f + 232 η d ( t − p ) C h . (52)Again, the last line of (52) utilizes Lipschitz continuity of ▽ f ( x ) , k ▽ f ( x ∗ ) k ≤ C h from Corollary 1, E k ▽ f i ( x ) − ▽ f ( x ) k ≤ C f , as well as k ∂h ( x ) k ≤ C h .For the term Q , we use the Cauchy-Schwarz inequality followed by the substitution of (26) and get Q = E D x ′ d ( t ) − x d ( t ) , x t − x d ( t ) E ≤ E (cid:16) k x ′ d ( t ) − x d ( t ) k k x t − x d ( t ) k (cid:17) ≤ η d ( t ) E (cid:16) k ▽ f i d ( t ) ( x d ( t ) ) + ∂h ( x ′ d ( t ) ) k k x t − x d ( t ) k (cid:17) . (53)Further relaxing (53) by the triangle inequality yields Q ≤ η d ( t ) t − d ( t ) X p =1 E (cid:16) k ▽ f i d ( t ) ( x d ( t ) ) + ∂h ( x ′ d ( t ) ) k k x t − p +1 − x t − p k (cid:17) . (54)Since the maximum delay is τ , we have Q ≤ η d ( t ) τ X p =1 E (cid:16) k ▽ f i d ( t ) ( x d ( t ) ) + ∂h ( x ′ d ( t ) ) k k x t − p +1 − x t − p k (cid:17) . (55)15oticing the relations x t − p +1 − x t − p = x ′ d ( t − p ) − x d ( t − p ) from the DAP-SGD recursion and x ′ d ( t − p ) − x d ( t − p ) = η d ( t ) ( ▽ f ( x d ( t − p ) ) + ∂h ( x ′ d ( t − p ) )) from (26), (56) leads to Q ≤ η d ( t ) τ X p =1 η d ( t − p ) E (cid:16) k ▽ f i d ( t ) ( x d ( t ) ) + ∂h ( x ′ d ( t ) ) k k ▽ f ( x d ( t − p ) ) + ∂h ( x ′ d ( t − p ) ) k (cid:17) . (56)Following the similar routines as those in (48) and (52), eventually we reach Q ≤ η d ( t ) L τ X p =1 η d ( t − p ) E k x d ( t ) − x ∗ k + 4 η d ( t ) L τ X p =1 η d ( t − p ) E k x d ( t − p ) − x ∗ k + 4 η d ( t ) τ X p =1 η d ( t − p ) C f + 10 η d ( t ) τ X p =1 η d ( t − p ) C h (57)Substituting (48), (52) and (57) into (41), we have E k x t +1 − x ∗ k ≤ (cid:0) − µη d ( t ) (cid:1) E k x t − x ∗ k + η d ( t ) L τ X p =1 η d ( t − p ) + 18 η d ( t ) L ! E k x d ( t ) − x ∗ k + 2 η d ( t ) L τ X p =1 η d ( t − p ) E k x t − p +1 − x ∗ k + 24 η d ( t ) L τ X p =1 η d ( t − p ) E k x d ( t − p ) − x ∗ k + η d ( t ) τ X p =1 η d ( t − p ) + 8 η d ( t ) ! C f + η d ( t ) τ X p =1 η d ( t − p ) + 23 η d ( t ) ! C h . (58)Deﬁne the step-size rule η t = 1 µ ( t + 1) + u = O (cid:18) t (cid:19) , (59)where u is a positive constant satisfying: • u > (2 τ − µ such that η t ≤ η d ( t ) ; • u is large enough such that min( µ/ (4 C τ ) , /L ) ≥ η t , where C is a constant we give below.Deﬁne two constants C = (cid:18) L µ + uµ + u − µτ + 48 τ L + 8 τ L µ + uµ + u − µτ (cid:19) µ + uµ + u − µτ + 18 L , and C = [(16 τ + 8) C f + (43 τ + 23) C h ] ( µ + u ) ( µ + u − µτ ) . Though not straightforward, we can show that under the step-size rule given by (59), (58) yields E k x t +1 − x ∗ k ≤ (1 − µη t ) E k x t − x ∗ k + C τ X p =0 η t − p E k x t − p − x ∗ k + C η t . (60)For the ease of presentation, we deﬁne a t = E k x t − x ∗ k and will analyze its rate. Rewrite (60) to a t +1 ≤ (1 − µη t ) a t + C τ X p =0 η t − p a t − p + C η t . (61)16pplying telescopic cancellation to (61) from t = 0 to t = T − yields a T ≤ a − T − X t =0 µη t a t + C T − X t =0 2 τ X p =0 η t − p a t − p + C T − X t =0 η t ≤ a − T − X t =0 ( µη t − C η t τ ) a t + C O (1) . (62)As we can verify, µ/ (4 C τ ) ≥ η t , meaning that T − X t =0 ( µη t − C η t τ ) a t ≥ T − X t =0 µη t a t . (63)Combining (62) and (63), we have T − X t =0 µη t a t ≤ a − a T + C O (1) , (64)which, along with the step-size rule (59), implies that T − X t =0 µ ( t + 1) + u a t ≤ µ ( a + C O (1)) (65)Further deﬁne C = u/ ( u − µτ ) such that µ ( t + 1) + u ( µ ( t − p + 1) + u ) ≤ C µ ( t − p + 1) + u . Substituting the step-size rule (59) into (61), we have a t +1 ≤ (cid:18) − µµ ( t + 1) + u (cid:19) a t + C τ X p =0 µ ( t − p + 1) + u ) a t − p + 1( µ ( t + 1) + u ) C , (66)and consequently ( µ ( t + 1) + u ) a t +1 ≤ ( µt + u ) a t + C τ X p =0 µ ( t + 1) + u ( µ ( t − p + 1) + u ) a t − p + 1 µ ( t + 1) + u C ≤ ( µt + u ) a t + C C τ X p =0 µ ( t − p + 1) + u a t − p + 1 µ ( t + 1) + u C . (67)Applying telescopic cancellation again to (67) from t = 0 to t = T − , we have ( µT + u ) a T ≤ ua + C C T − X t =0 2 τ X p =0 µ ( t − p + 1) + u a t − p + T − X t =0 µ ( t + 1) + u C ≤ ua + 2 C C τ T − X t =0 µ ( t + 1) + u a t + T − X t =0 µ ( t + 1) + u C . (68)Substituting (65) into (68) yields ( µT + u ) a T ≤ ua + 4 µ C C τ ( a + C O (1)) + C O (log T ) , (69)17nd consequently a T ≤ ua + µ C C τ ( a + C O (1)) + C O (log T ) µT + u = O (cid:18) log TT (cid:19) , (70)which completes the proof. Theorem 2

Suppose that the cost function of (1) satisﬁes the following conditions: f ( x ) is strongly convexwith constant µ and h ( x ) is convex; f ( x ) is diﬀerentiable and ▽ f ( x ) is Lipschitz continuous with constant L ; E k ▽ f i ( x ) − ▽ f ( x ) k ≤ C f ; k ∂h ( x ) k ≤ C h . Deﬁne the optimal solution of (1) as x ∗ . At time t , ﬁx thestep-size of the DAP-SGD recursion (8) η t as η = O (1 / √ T ) , where T is the maximum number of iterations.Deﬁne the iterate generated by (8) at time t as x t . Then the running average iterate generated by (8) at time T , denoted by ¯ x T = 1 T + 1 T X t =0 x t , satisﬁes E k ¯ x T − x ∗ k ≤ O (cid:18) √ T (cid:19) . (71) Proof of Theorem 2:

We start from (58) in the proof of Theorem 1. Deﬁne the step-size rule η t = η = 1 v √ T , (72)where v is a positive constant such that min( µ/ (4 C τ ) , /L ) ≥ η . Deﬁning constants C = (2 + 56 τ ) L , and C = (16 τ + 8) C f + (43 τ + 23) C h , followed by manipulating (58), we have (similar to the inequality (61)) the following result a t +1 ≤ (1 − µη ) a t + C η τ X p =0 a t − p + C η (73)Applying telescopic cancellation to (73) from t = 0 to t = T yields a T +1 ≤ a − T X t =0 µηa t + C η T X t =0 2 τ X p =0 a t − p + C ( T + 1) η ≤ a − T X t =0 ( µη − C η τ ) a t + C ( T + 1) η . (74)Since µ/ (4 C τ ) ≥ η such that T X t =0 ( µη − C τ η ) a t ≥ µη T X t =0 a t , (75)(74) implies µη T X t =0 a t ≤ a − a T +1 + C ( T + 1) η , (76)18nd consequently µηT + 1 T X t =0 a t ≤ a + 2 C ( T + 1) η T + 1 . (77)According to Jensen’s inequality, we have µηT + 1 T X t =0 a t = µηT + 1 T X t =0 E k x t − x ∗ k ≥ µη E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T + 1 T X t =0 x t − x ∗ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = µη E k ¯ x T − x ∗ k . (78)Substituting (78) and the step-size rule (72) into (77), we have E k ¯ x T − x ∗ k ≤ a v √ T + 2 C ( T + 1) v √ T µ ( T + 1) = O ( 1 √ T ) ,,