[PDF] Asynchronous Parallel Nonconvex Optimization Under the Polyak-Lojasiewicz Condition

Abstract

Communication delays and synchronization are major bottlenecks for parallel computing, and tolerating asynchrony is therefore crucial for accelerating parallel computation. Motivated by optimization problems that do not satisfy convexity assumptions, we present an asynchronous block coordinate descent algorithm for nonconvex optimization problems whose objective functions satisfy the Polyak-Lojasiewicz condition. This condition is a generalization of strong convexity to nonconvex problems and requires neither convexity nor uniqueness of minimizers. Under only assumptions of mild smoothness of objective functions and bounded delays, we prove that a linear convergence rate is obtained. Numerical experiments for logistic regression problems are presented to illustrate the impact of asynchrony upon convergence.

Full PDF

AAsynchronous Parallel Nonconvex OptimizationUnder the Polyak-Łojasiewicz Condition

Kasra Yazdani and Matthew Hale ∗ Abstract — Communication delays and synchronization aremajor bottlenecks for parallel computing, and tolerating asyn-chrony is therefore crucial for accelerating distributed compu-tation. Motivated by modern interest in optimization problemsthat do not satisfy convexity assumptions, we present anasynchronous block coordinate descent algorithm for nonconvexoptimization problems whose objective functions satisfy thePolyak-Łojasiewicz condition. This condition is a generalizationof strong convexity to nonconvex problems and requires neitherconvexity nor uniqueness of minimizers. Under only assump-tions of mild smoothness of objective functions and boundeddelays, we prove that a linear convergence rate is obtained.Numerical experiments for logistic regression problems are pre-sented to illustrate the impact of asynchrony upon convergencein this setting.

I. I

NTRODUCTION

Asynchronous distributed optimization algorithms havegained attention in part due to increases in available dataand use of distributed computation. These algorithms are akey tool in large-scale machine learning problems [1] andfederated learning problems wherein statistical models aretrained on remote devices [2]. Similar applications also arisein power control [3], signal processing [4], and robotics.Asynchronous algorithms are crucial in parallel computingbecause their convergence is not hindered by slow individualprocessors, they reduce synchronization wait, and they relaxcommunication overhead compared to synchronized imple-mentations.This paper considers a class of optimization problemswhose objective functions satisfy the Polyak-Łojasiewicz(PL) condition, which is a geometric condition characterizingthe curvature of some nonconvex functions [5]–[7]. ThePL condition can be regarded as a generalization of strongconvexity in which neither convexity nor uniqueness ofminimizers is required. Several important applications in ma-chine learning and signal processing have objective functionsthat satisfy the PL condition, including least squares, rank-deﬁcient least squares,logistic regression, and support vectormachines; see [6] and references therein. Recent work in [7],[8] also studies this class of functions for a team of agentswith local objective functions. That work uses an algorithmicmodel in which each agent updates all decision variables andthen agents average their iterates. Our algorithmic model is ∗ Authors are with the Department of Mechanical and Aerospace En-gineering at the University of Florida, Gainesville, FL USA. Emails: { kasra.yazdani,matthewhale } @ufl.edu . This work was sup-ported in part by a Task Order contract with the Air Force ResearchLaboratory, Munitions Directorate, at Eglin AFB and by AFOSR underGrant FA9550-19-1-0169. parallel , in that each decision variable is updated only by asingle agent.For nonconvex problems, one way to accelerate classicalgradient descent algorithms is to use multiple processorsto compute local gradients and update their iterates usingaverages of gradients received from other processors. For T iterations and n processors, this approach achieves O ( / √ nT ) convergence for strongly convex functions and O ( / nT ) forsmooth nonconvex stochastic optimization [9]. However, theproposed linear speedup can be difﬁcult to attain in practicebecause of the communication overhead of the aggregationof local batch gradients [8], [9].We consider an alternative algorithmic model that tolerateslonger delays under weaker assumptions. The algorithmwe consider is essentially asynchronous distributed blockcoordinate descent (BCD). Although this class of update lawhas been studied before [10], [11], this work is, to the bestof our knowledge, the ﬁrst to connect asynchronous blockcoordinate descent to the study of objective functions underthe PL condition. Contributions:

The main results of this paper are: • We show that the asynchronous block coordinate de-scent algorithm converges to a global minimizer inlinear time under the PL condition. Compared withrecent work [8], [12], [13], we achieve the same con-vergence rate under more general assumptions on thecost function, network architecture, and communicationrequirements. To the best of our knowledge, this workis the ﬁrst to establish a linear speed up with arbitrary(but bounded) delays in parallelized computations andcommunications when minimizing objectives that sat-isfy the PL condition. • We expand our results to show that the PL conditionis weaker than the so-called Regularity Condition (RC)that has seen wide use in the data science community,e.g., [14]. RC can be used to show that gradient descentconverges to a minimizer at a linear rate [14], andfunctions that satisfy it have been studied in [15]. Weleverage this result to show that our asynchronous blockcoordinate descent algorithm attains a linear conver-gence rate for this class of functions as well.Our work is closest to [16] which is a block coordinatedescent algorithm for objective functions that satisfy a formof the “error-bound condition”. We show similar convergencerate results, but under weaker assumptions and with a sub-stantially simpliﬁed proof.This paper is organized as follows. Section II providesbackground and a problem formulation. Section III provides a r X i v : . [ m a t h . O C ] F e b he main contributions of the paper and shows the linearconvergence of our algorithm. We provide simulation resultsfor a regularized logistic regression problem in Section IV,and Section V concludes the paper.II. B ACKGROUND AND A SYNCHRONOUS A LGORITHM

This section presents the asynchronous parallel implemen-tation of coordinate descent and assumptions we use to deriveconvergence rates. We use the notation [ n ] := { , . . . , n } . A. Optimization Problem

We consider n processors jointly solving min x ∈ R m f ( x ) , where f : R m → R is a continuously differentiable functionand satisﬁes the Polyak-Łojasiewicz inequality: Deﬁnition 1. (Polyak-Łojasiewicz (PL) Inequality)

A func-tion satisﬁes the PL inequality if, for some µ > , (cid:107)∇ f ( z ) (cid:107) ≥ µ ( f ( z ) − f ∗ ) for all z ∈ R m , where f ∗ = min x ∈ R m f ( x ) . We say such an f is µ -PL orhas the µ -PL property. (cid:52) A µ -PL function has a unique global minimum value,which we denote by f ∗ , and the PL condition impliesthat every stationary point is a global minimizer. The µ -PLproperty is implied, e.g., by µ -strong convexity, but it allowsfor multiple minima and does not require convexity of anykind. For example, f ( x ) = x + 3 sin ( x ) is non-convex andsatisﬁes the PL inequality with µ = 1 / . It has also beenshown to be satisﬁed by problems in signal processing andmachine learning, including phase retrieval [14], deep linearneural networks and shallow nonlinear neural networks [17],matrix sensing, and matrix completion [18]. We assume thefollowing about f . Assumption 1. f is µ -PL for some µ >

2) The set X ∗ = { x ∗ ∈ R m | ∇ f ( x ∗ ) = 0 } is nonempty3) ∇ f ( x ) is L -Lipschitz continuous. In particular, f ( y ) ≤ f ( x ) + ∇ f ( x ) T ( y − x ) + L (cid:107) y − x (cid:107) . B. Distributed Asynchronous Block Coordinate Descent

We decompose x via x = ( x , . . . , x n ) T , where x i ∈ R m i and m = (cid:80) ni =1 m i . Below, processor i computes updatesonly for x i . Deﬁne ∇ i f = ∂f∂x i .Each processor stores a local copy of the decision vari-able x . Due to asynchrony, these can disagree. We denoteprocessor i ’s decision variable at time t by x i ( t ) . Processor i computes updates to x ii but not x ij for j (cid:54) = i . Instead,processor j updates x jj locally and transmits updated valuesto processor i . Due to asynchrony these values are delayed,and, in particular, x ij ( t ) can contain an old value of x jj . Wedeﬁne τ ij ( t ) to be the time at which processor j originallycomputed the value that processor i has stored as x ij ( t ) . That is, τ ij satisﬁes x ij ( t ) = x jj (cid:0) τ ij ( t ) (cid:1) . Clearly τ ij ( t ) ≤ t andwe have x i ( t ) = (cid:16) x (cid:0) τ i ( t ) (cid:1) , . . . , x ii ( t ) , . . . , x nn (cid:0) τ in ( t ) (cid:1)(cid:17) .Below, we will also analyze the “true” state of the network,which we deﬁne as x ( t ) = (cid:0) x ( t ) , x ( t ) , . . . , x nn ( t ) (cid:1) . (1)We deﬁne T i ⊆ N as the set of times at which processor i updates x ii ; agent i does not actually know (or need toknow) T i becasue it is merely a tool used for analysis. Forall i ∈ [ n ] and stepsize γ > , processor i executes x ii ( t + 1) = (cid:40) x ii ( t ) − γ ∇ i f (cid:0) x i ( t ) (cid:1) t ∈ T i x ii ( t ) otherwise . We assume that communication and computation delaysare bounded, which has been called partial asynchrony inthe literature [11]. Formally, we have:

Assumption 2.

There exists a positive integer B such that1) For every i ∈ [ n ] and t ∈ N , at least one of theelements of the set { t, t + 1 , . . . , t + B − } is in T i .2) There holds t − B < τ ij ( t ) ≤ t , for all i, j ∈ [ n ] , j (cid:54) = i ,and all t ∈ T i .We summarize the algorithm as follows. Algorithm 1:

Asynchronous BCD

Input :

Choose a stepsize γ > Initialize: { x i } ni =1 for n processors for t = 0 , , . . . , T do for i ∈ [ n ] do if t ∈ T i then Update: x ii ( t + 1) = x ii ( t ) − γ ∇ i f (cid:0) x i ( t ) (cid:1) else Do not Update: x ii ( t + 1) = x ii ( t ) end for j ∈ [ n ] \{ i } do if processor i receives x ij at time t + 1 then x ij ( t + 1) = x jj ( τ ij ( t + 1)) else x ij ( t + 1) = x ij ( t ) end end end end III. C

ONVERGENCE A NALYSIS

We ﬁrst prove linear convergence of Algorithm 1 under thePL condition. Then we show that the Regularity Condition(RC) [14] implies the PL inequality and provide convergenceguarantees for RC functions as well. . Convergence Under the PL Inequality

Theorem 1.

Let f satisfy Assumption 1 and let Assumption 2hold. Let η > be large enough such that f ( x ( t )) − f ∗ ≤ η, γ t − (cid:88) τ = t − B (cid:107) s ( τ ) (cid:107) ≤ η for all t . Then, there exists γ ∈ (0 , such that for all γ ∈ (0 , γ ) , and x ( t ) as deﬁned in (1) , the sequence { x ( t ) } t ∈ N generated by Algorithm 1 satisﬁes f ( kB ) − f ∗ ≤ (1 − γµ ) k − η, (2) γ kB − (cid:88) τ =( k − B (cid:107) s ( τ ) (cid:107) ≤ (1 − γµ ) k − η. (3) for all k = 0 , , , . . . .Proof: See Appendix. (cid:4)

This result is a direct generalization of standard resultsfor strongly convex functions to the case of µ -PL functionsminimized under asynchrony. Speciﬁcally, if we considercentralized gradient descent for a τ -strongly convex function,then B = 1 and µ = τ , and we recover the classic linearrate for strongly convex functions. B. Convergence Under RC

We next extend Theorem 1 to objective functions thatsatisfy the Regularity Condition introduced in [14]:

Deﬁnition 2. (Regularity Condition)

A function f satisﬁesthe Regularity Condition RC ( α, β ) with α, β > , if (cid:104)∇ f ( z ) , z − x (cid:63) (cid:105) ≥ α (cid:107)∇ f ( z ) (cid:107) + 1 β (cid:107) z − x (cid:63) (cid:107) for all z , where x ∗ is a minimizer of f .We say such an f is RC ( α, β ) . This condition has appearedin machine learning and signal processing applications in-cluding phase retrieval and matrix sensing [14]. While it issimple to show that centralized gradient descent convergeslinearly for such functions, the convergence of distributedoptimization algorithms under RC has received less attention,and we therefore extend our results to this case here.Though elementary, we were unable to ﬁnd the followinglemma in the literature. Lemma 1.

Let f have a Lipschitz continuous gradient withLipschitz constant L . If f is RC ( α, β ) , then it is β L -PL. Proof:

Applying the Cauchy-Schwarz inequality to the RCdeﬁnition, we write (cid:107)∇ f ( z ) (cid:107) (cid:107) z − x ∗ (cid:107) ≥ α (cid:107)∇ f ( z ) (cid:107) + 1 β (cid:107) z − x ∗ (cid:107) . This gives (cid:107)∇ f ( z ) (cid:107) ≥ β (cid:107) z − x ∗ (cid:107) . Using Lipschitz conti-nuity of the gradient (cf. Assumption 1.3), for any x ∗ ∈ X ∗ and z ∈ R m we obtain f ( z ) − f ( x ∗ ) ≤ L (cid:107) z − x ∗ (cid:107) ≤ β L (cid:107)∇ f ( z ) (cid:107) , and f satisﬁes the PL inequality with µ = β L . (cid:4) This connection lets us establish the convergence of Al-gorithm 1 for RC functions.

Theorem 2.

Let f be RC ( α, β ) and have Lipschitz gradientwith constant L . Let x ∗ be a global minimizer of f . For γ ∈ (0 , γ ) as in Theorem 1, the sequence { x ( t ) } t ∈ N generatedby Algorithm 1 converges linearly to a ﬁxed point x ∗ as inTheorem 1 with µ = β L .Proof: Immediately follows from Theorem 1. (cid:4)

IV. C

ASE S TUDY

We solve an (cid:96) -regularized logistic regression problemusing Algorithm 1. We denote training feature vectors by z ( i ) ∈ R m , and we use y ( i ) ∈ { , } to denote their cor-responding labels. The logistic regression objective functionfor N observations is given by E ( x ) = − N (cid:34) N (cid:88) i =1 y ( i ) log( h i ( z ( i ) ))+ (1 − y ( i ) ) log(1 − h i ( z ( i ) )) (cid:35) + λ N (cid:107) x (cid:107) where h i ( x ) = e − xT z ( i ) is a sigmoid hypothesis function.We conduct experiments on the Epsilon dataset usingthe above logistic regression model. The Epsilon datasetis a popular benchmark for large scale binary classiﬁcationdataset [8], and it consists of , training samples and , test samples. Each sample has a feature dimensionof m = 2000 . Each instance is preprocessed to mean zero,unit variance, and normalized to a unit vector for both thetraining and test sets.We ran Algorithm 1 with processors. The stepsizeand the regularization parameter are set to γ = 10 − and λ = 10 − , respectively. In three separate experiments,the communication delay for each processor is randomlygenerated and bounded by B = 10 , B = 100 , and B = 1000 ,respectively. The results of the experiment are given inFigure 1.These numerical experiments indeed show that Algo-rithm 1 converges linearly. Indeed, we can observe that itconverges with a slower rate as the communication delaysincrease, which reﬂects the “delayed linear” nature of The-orem 1, which contracts toward a minimizer by a factorof − γµ every B timesteps.V. C ONCLUSIONS

We derived convergence rates of asynchronous coordi-nate decent parallelized among n processors for functionssatisfying the Polyak-Łojasiewicz (PL) condition. By anelementary extension of our main result, we also showedlinear convergence to minimzers of functions that satisfy theRegularity Condition (RC). Directions for future extensionsinclude deriving similar convergence rates for stochasticsettings and accelerated methods. ig. 1. Comparison between the convergence rate of BCD in Algorithm 1 with the maximum amount of delay B ∈ { , , } .R EFERENCES[1] X. Lian, Y. Huang, Y. Li, and J. Liu, “Asynchronous parallel stochasticgradient for nonconvex optimization,” in

Advances in Neural Informa-tion Processing Systems , 2015, pp. 2737–2745.[2] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman,V. Ivanov, C. Kiddon, J. Koneˇcn`y, S. Mazzocchi, H. B. McMahan et al. , “Towards federated learning at scale: System design,” arXivpreprint arXiv:1902.01046 , 2019.[3] T.-H. Chang, A. Nedi´c, and A. Scaglione, “Distributed constrainedoptimization by consensus-based primal-dual perturbation method,”

IEEE Transactions on Automatic Control , vol. 59, no. 6, pp. 1524–1538, 2014.[4] A. Olshevsky, “Efﬁcient information aggregation strategiesfor distributed control and signal processing,” arXiv preprintarXiv:1009.6036 , 2010.[5] B. T. Polyak, “Gradient methods for minimizing functionals,”

ZhurnalVychislitel’noi Matematiki i Matematicheskoi Fiziki , vol. 3, no. 4, pp.643–653, 1963.[6] H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gra-dient and proximal-gradient methods under the polyak-łojasiewiczcondition,” in

Joint European Conference on Machine Learning andKnowledge Discovery in Databases . Springer, 2016, pp. 795–811.[7] X. Yi, S. Zhang, T. Yang, T. Chai, and K. H. Johansson, “A primal-dualsgd algorithm for distributed nonconvex optimization,” arXiv preprintarXiv:2006.03474 , 2020.[8] F. Haddadpour, M. M. Kamani, M. Mahdavi, and V. Cadambe, “Localsgd with periodic averaging: Tighter analysis and adaptive synchro-nization,” in

Advances in Neural Information Processing Systems ,2019, pp. 11 082–11 094.[9] S. U. Stich, “Local sgd converges fast and communicates little,” arXivpreprint arXiv:1805.09767 , 2018.[10] Z. Peng, Y. Xu, M. Yan, and W. Yin, “Arock: an algorithmic frame-work for asynchronous parallel coordinate updates,”

SIAM Journal onScientiﬁc Computing , vol. 38, no. 5, pp. A2851–A2879, 2016.[11] D. P. Bertsekas and J. N. Tsitsiklis,

Parallel and distributed computa-tion: numerical methods . Prentice hall Englewood Cliffs, NJ, 1989,vol. 23.[12] H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, “ d : Decentralizedtraining over decentralized data,” arXiv preprint arXiv:1803.07068 ,2018.[13] H. Yu and R. Jin, “On the computation and communication complexityof parallel sgd with dynamic batch sizes for stochastic non-convexoptimization,” arXiv preprint arXiv:1905.04346 , 2019.[14] E. J. Candes, X. Li, and M. Soltanolkotabi, “Phase retrieval viawirtinger ﬂow: Theory and algorithms,” IEEE Transactions on Infor-mation Theory , vol. 61, no. 4, pp. 1985–2007, 2015.[15] Y. Chi, Y. M. Lu, and Y. Chen, “Nonconvex optimization meets low-rank matrix factorization: An overview,”

IEEE Transactions on SignalProcessing , vol. 67, no. 20, pp. 5239–5269, 2019.[16] P. Tseng, “On the rate of convergence of a partially asynchronousgradient projection algorithm,”

SIAM Journal on Optimization , vol. 1,no. 4, pp. 603–619, 1991. [17] Y. Zhou and Y. Liang, “Characterization of gradient dominanceand regularity conditions for neural networks,” arXiv preprintarXiv:1710.06910 , 2017.[18] S. Bhojanapalli, B. Neyshabur, and N. Srebro, “Global optimality oflocal search for low rank matrix recovery,” in

Advances in NeuralInformation Processing Systems , 2016, pp. 3873–3881.

VI. A

PPENDIX

Deﬁne s i ( t ) := (cid:40) −∇ i ( f ( x i ( t )) t ∈ T i otherwiseand concatenate the terms in s ( t ) := [ s ( t ) T , . . . , s n ( t ) T ] T .We begin with the following basic lemmas. Lemma 2.

For all t ≥ and all i , we have (cid:13)(cid:13) x i ( t ) − x ( t ) (cid:13)(cid:13) ≤ γ (cid:80) t − τ = t − B (cid:107) s ( τ ) (cid:107) . Proof:

See Equation (5.9) in [11, Section 7.5]. (cid:4)

Now, by using the partial asynchrony assumption and theLipschitzian property of the gradient, we quantify the B -stepdecrease in the function value in the true state x ( t ) . Lemma 3.

For all t ≥ , we have f ( x ( t + B )) − f ( x ( t )) ≤ L γ nB t − (cid:88) τ = t − B (cid:107) s ( τ ) (cid:107) + (cid:18) γ L (cid:18) B n + 1) + 1 (cid:19) − γ (cid:19) t + B − (cid:88) τ = t (cid:107) s ( τ ) (cid:107) . (4) Proof:

From the Descent Lemma [11, Section 3.2], we have f ( x ( t + 1)) = f ( x ( t ) + γs ( t )) ≤ f ( x ( t )) + γ n (cid:88) i =1 s i ( t ) T ∇ i f ( x ( t )) + Lγ (cid:107) s ( t ) (cid:107) . Adding ( s i ( t ) − s i ( t )) to ∇ i f ( x ( t )) and applying the Lips-chitz property of the gradient gives f ( x ( t + 1)) − f ( x ( t )) ≤ Lγ n (cid:88) i =1 (cid:13)(cid:13) ∇ i f ( x i ( t )) (cid:13)(cid:13) (cid:13)(cid:13) x ( t ) − x i ( t ) (cid:13)(cid:13) + ( Lγ − γ ) (cid:107) s ( t ) (cid:107) . Employing Lemma 2, and applying the inequality ab ≤ ( a + b ) inside the sum, we ﬁnd f ( x ( t + 1)) − f ( x ( t )) ≤ L γ (cid:34) B (cid:107) s ( t ) (cid:107) + n t − (cid:88) τ = t − B (cid:107) s ( τ ) (cid:107) (cid:35) + ( Lγ − γ ) (cid:107) s ( t ) (cid:107) . The proof is completed by applying this inequality succes-sively to t, t + 1 , . . . , t + B − , and summing them up. (cid:4) The proof scheme up to this point closely follows [11],[16], which were not focused on PL functions. From thispoint on, we leverage the µ -PL property of the objectivefunction and the following lemma and the rest of the resultsare new. The next lemma bounds the one-step change in x ( t ) . emma 4. For all t ≥ , there holds (cid:107) x ( t + 1) − x ( t ) (cid:107) ≤ (cid:0) nBγ L + γ Ln (cid:1) t − (cid:88) τ = t − B (cid:107) s ( τ ) (cid:107) + (cid:0) γ + γ LnB (cid:1) (cid:107)∇ f ( x ( t )) (cid:107) . Proof:

With x ( t + 1) = x ( t ) + γs ( t ) , we add and sub-tract γ ∇ f (cid:0) x ( t ) (cid:1) and apply the triangle inequality to ﬁnd (cid:107) x ( t + 1) − x ( t ) (cid:107) ≤ (cid:107) γs ( t ) + γ ∇ f ( x ( t )) (cid:107) + (cid:107) γ ∇ f ( x ( t )) (cid:107) . Squaring both sides, we expand to ﬁnd (cid:107) x ( t + 1) − x ( t ) (cid:107) ≤ (cid:107) γ ∇ f ( x ( t )) (cid:107) + 2 γ n (cid:88) i =1 (cid:107) γ ∇ f ( x ( t )) (cid:107) (cid:107) s i ( t )+ ∇ i f ( x ( t )) (cid:107) + γ n (cid:88) i =1 (cid:107)∇ i f ( x ( t )) + s i ( t ) (cid:107) . Using the Lipschitz property of the gradient gives (cid:107) x ( t +1) − x ( t ) (cid:107) ≤ (cid:107) γ ∇ f ( x ( t )) (cid:107) + γ L n (cid:88) i =1 (cid:13)(cid:13) x i ( t ) − x ( t ) (cid:13)(cid:13) + 2 γL n (cid:88) i =1 (cid:107) γ ∇ f ( x ( t )) (cid:107)(cid:107) x i ( t ) − x ( t ) (cid:107) . Applying Lemma 2, we expand to ﬁnd (cid:107) x ( t + 1) − x ( t ) (cid:107) ≤ nBγ L t − (cid:88) τ = t − B (cid:107) s ( τ ) (cid:107) + (cid:107) γ ∇ f ( x ( t )) (cid:107) + γ Ln t − (cid:88) τ = t − B (cid:107) s ( τ ) (cid:107) + γ LnB (cid:107) γ ∇ f ( x ( t )) (cid:107) , where we use ( (cid:80) Ki =1 y i ) ≤ K (cid:80) Ki =1 y i and ab ≤ ( a + b ) . Rearranging completes the proof. (cid:4) The next lemma quantiﬁes the distance to minima withasynchronous updates.

Lemma 5.

Take γ < min { L LnB + B +1 , L B ( n +1)+ L + L nB } .For all t ≥ , we have f ( x ( t + B )) − f ∗ ≤ (1 − C ) ( f ( x ( t )) − f ∗ )+ ( C + C ) t − (cid:88) τ = t − B (cid:107) s ( τ ) (cid:107) , where C , C , and C are positive constantsdeﬁned as C = − µ (cid:0) γ L nB + γ ( LB + L ) − γ (cid:1) , C = nBγ L + γ L ( L + 1) n, and C := L γ nB . Proof:

By Lipschitz continuity (cf. Assumption 1) we write f ( x ( t + 1)) − f ( x ( t )) ≤ (cid:104)∇ f ( x ( t )) , γs ( t ) (cid:105) + L (cid:107) x ( t + 1) − x ( t ) (cid:107) ≤ γL n (cid:88) i =1 (cid:13)(cid:13) x ( t ) − x i ( t ) (cid:13)(cid:13) (cid:107)∇ i f ( x ( t )) (cid:107)− γ (cid:107)∇ f ( x ( t )) (cid:107) + L (cid:107) x ( t + 1) − x ( t ) (cid:107) , where we added ∇ f ( x ( t )) − ∇ f ( x ( t )) to γs ( t ) and usedCauchy-Schwarz and the Lipschitz continuity of the gradient.Next, applying Lemma 2 and then using ab ≤ ( a + b ) onthe ﬁrst term of the RHS and simplifying we get f ( x ( t + 1)) − f ( x ( t )) ≤ L (cid:107) x ( t + 1) − x ( t ) (cid:107) + (cid:18) γ LB − γ (cid:19) (cid:107)∇ f ( x ( t )) (cid:107) + γ Ln t − (cid:88) τ = t − B (cid:107) s ( τ ) (cid:107) . Applying Lemma 4, we obtain f ( x ( t + 1)) − f ( x ( t )) ≤ (cid:18) nBγ L + 12 γ L ( L + 1) n (cid:19) t − (cid:88) τ = t − B (cid:107) s ( τ ) (cid:107) + (cid:18) γ L nB + γ (cid:18) LB + L (cid:19) − γ (cid:19) (cid:107)∇ f ( x ( t )) (cid:107) . The fact that γ < and the ﬁrst term in the deﬁnition of γ ensure that γ L nB + γ (cid:0) LB + L (cid:1) − γ ≤ . Using thisand the PL inequality, we have f ( x ( t + 1)) − f ∗ ≤ (1 − C ) ( f ( x ( t )) − f ∗ )+ C t − (cid:88) τ = t − B (cid:107) s ( τ ) (cid:107) , (5)where we have also added − f ∗ to both sides. For a smallenough γ , repeating the argument in Lemma 3 from t + 1 to t + B , we ﬁnd − f ( x ( t + 1)) ≤ − f ( x ( t + B )) + L γ nB t − (cid:88) τ = t − B (cid:107) s ( τ ) (cid:107) . (6)Using (6) in (5), completes the proof. (cid:4) To streamline further analyses, we deﬁne α ( t ) := f ( x ( t )) − f ∗ , β ( t ) := γ (cid:80) t − τ = t − B (cid:107) s ( τ ) (cid:107) . We next ﬁnd upper bounds for β ( t ) and β ( t − B ) . Wesubstitute t − B for t and t − B for t consecutively in (4),and by rearranging the terms, we obtain γ − β ( t ) ≤ C (cid:18) α ( t − B ) + L nBβ ( t − B ) − α ( t ) (cid:19) , (7) γ − β ( t − B ) ≤ C (cid:18) α ( t − B )+ L nBβ ( t − B ) − α ( t − B ) (cid:19) , (8)here C := γ − γ L (cid:0) B ( n + 1) + 1 (cid:1) ∈ R + . Next, weemploy (7) to bound β ( t ) in Lemma 5, which gives α ( t + B ) ≤ (cid:18) − C − C + C C (cid:19) α ( t )+ (9) C + C C (cid:18) α ( t − B ) + L nBβ ( t − B ) (cid:19) . Using (8), we bound the term β ( t − B ) in (9), and bysimplifying the terms, we obtain α ( t + B ) ≤ (cid:18) − C − C + C C (cid:19) α ( t ) + (cid:18) C + C C (cid:19) · (cid:20) (1 − C C ) α ( t − B ) + C C α ( t − B ) + L nB C C β ( t − B ) (cid:21) . (10)Next, by repeatedly applying Lemma 3, at t, t + B, t + 2

B, . . . and summing up those equations,we obtain α ( t + kB ) ≤ α ( t ) + ( − C + C ) k − (cid:88) l =1 t + lB − (cid:88) τ = t +( l − B (cid:107) s ( τ ) (cid:107) − C t + kB − (cid:88) τ = t +( k − B (cid:107) s ( τ ) (cid:107) + C t − (cid:88) τ = t − B (cid:107) s ( τ ) (cid:107) . Using γ < L B ( n +1)+ L + L nB makes ( C − C ) < , andtherefore we get α ( t + kB ) ≤ α ( t ) + L nBβ ( t ) (11) + (cid:18) L ( B n + 1) + 1) − γ − + L nB (cid:19) β ( t + B ) . By the PL inequality f ( x ( t )) ≥ f ∗ for all x ( t ) and therefore lim inf t →∞ f ( x ( t )) ≥ f ∗ , which gives α ( t + kB ) ≥ forall k . From this fact and (11) we obtain β ( t + B ) ≤ α ( t ) + L nBβ ( t ) γ − − L (cid:0) B ( n + 1) + 1 (cid:1) − L nB . (12)Finally, using (10) and (12) we are able to show the mainresult on asynchronous linear convergence of BCD. Proof of Theorem 1:

We proceed with induction on k . Takethe scalar η large enough such that α ( t ) , α ( t + B ) , β ( t ) , β ( t + B ) ≤ η. Then (2) and (3) hold for k = 0 and k = 1 . Weshow that if α ( t + kB ) ≤ (1 − γµ ) k − η and β ( t + kB ) ≤ (1 − γµ ) k − η , then α ( t + ( k + 1) B ) ≤ (1 − γµ ) k η and β ( t + ( k + 1) B ) ≤ (1 − γµ ) k η for k ≥ . By thisinduction hypothesis, (10) is written as α ( t + ( k + 1) B ) ≤ (1 − C − C + C C ) η (1 − γµ ) k − + C + C C (cid:32) (1 − C C ) η (1 − γµ ) k − + C C η (1 − γµ ) k − + ( L nB ) C C η (1 − γµ ) k − (cid:33) Factoring out the term (1 − γµ ) k − η , using the inequality (1 − γµ ) − < (1 + 2 γµ ) and its square, and simplifying theabove equation we get α ( t + ( k + 1) B ) ≤ (cid:34) − C + C + C C (cid:32) γµ − C C (1 + 2 γµ )+ C C (1 + L nB )(1 + 4 γ µ + 4 γµ ) (cid:33)(cid:35) η (1 − γµ ) k − Next, take γ < µ which gives γµ (1+ γµ ) < γµ . Weuse that and the inequality, (cid:0) − γL (cid:0) B ( n + 1) + 1 (cid:1)(cid:1) − < γL (cid:0) B ( n + 1) + 1 (cid:1) and expand to get α ( t + ( k + 1) B ) ≤ (cid:34) µ (cid:0) γ L nB + γ ( LB + L ) − γ (cid:1) + (1 + 2 γL ( B n + 1) + 1)) · ( L γnB + 12 nBγ L + 12 γL ( L + 1) n ) · (2 γµ + C C ( L nB + 8 γµ + LnB γµ )) (cid:35) η (1 − γµ ) k − . Deﬁne the constants A , A ∈ R + by A := L nB (1 +2 L ( B ( n + 1) + 1)) , A := 2 µ + A (cid:0) L nB + 4 LnB + 8 (cid:1) .Note that from γ < we have C C < γA . Simplifying theabove equation gives α ( t + ( k + 1) B ) ≤ (cid:34) − γµ + µγ (cid:0) L nB + LB + L (cid:1) + γ (cid:18) A + 2 A L ( B n + 1) + 1) (cid:19) · (cid:18) L nB + 12 nBL + 12 L ( L + 1) n (cid:19)(cid:35) η (1 − γµ ) k − . Taking γ according to γ ≤ min (cid:40) L LnB + B + 1 ,µA Ln L ( B ( n + 1) + 2)) ( nBL + L + B + 1) (cid:41) , therefore, allows us to upperbound all of the above terms toget α ( t + ( k + 1) B ) ≤ (1 − γµ ) k η This completes the ﬁrst part of the proof. From the inductionhypothesis, (12) gives β ( t + ( k + 1) B ) ≤ η (1 − γµ ) k − + L nBη (1 − γµ ) k − (cid:0) γ − − L (cid:0) B ( n + 1) + 1 (cid:1) − L nB (cid:1) Taking γ according to γ < L B (3 n +1)+ L + µ +1 therefore gives β ( t + ( k + 1) B ) ≤ (1 − γµ ) k η. This holds for all t and the proof is complete.and the proof is complete.