[PDF] Multi-consensus Decentralized Accelerated Gradient Descent

Abstract

This paper considers the decentralized optimization problem, which has applications in large scale machine learning, sensor networks, and control theory. We propose a novel algorithm that can achieve near optimal communication complexity, matching the known lower bound up to a logarithmic factor of the condition number of the problem. Our theoretical results give affirmative answers to the open problem on whether there exists an algorithm that can achieve a communication complexity (nearly) matching the lower bound depending on the global condition number instead of the local one. Moreover, the proposed algorithm achieves the optimal computation complexity matching the lower bound up to universal constants. Furthermore, to achieve a linear convergence rate, our algorithm \emph{doesn't} require the individual functions to be (strongly) convex. Our method relies on a novel combination of known techniques including Nesterov's accelerated gradient descent, multi-consensus and gradient-tracking. The analysis is new, and may be applied to other related problems. Empirical studies demonstrate the effectiveness of our method for machine learning applications.

Full PDF

MMulti-consensus Decentralized Accelerated Gradient Descent

Haishan Ye ∗ Luo Luo ∗ Ziang Zhou ∗ Tong Zhang † May 5, 2020

Abstract

This paper considers the decentralized optimization problem, which has applications in largescale machine learning, sensor networks, and control theory. We propose a novel algorithm thatcan achieve near optimal communication complexity, matching the known lower bound up toa logarithmic factor of the condition number of the problem. Our theoretical results give aﬃr-mative answers to the open problem on whether there exists an algorithm that can achieve acommunication complexity (nearly) matching the lower bound depending on the global conditionnumber instead of the local one. Moreover, the proposed algorithm achieves the optimal compu-tation complexity matching the lower bound up to universal constants. Furthermore, to achievea linear convergence rate, our algorithm doesn’t require the individual functions to be (strongly)convex. Our method relies on a novel combination of known techniques including Nesterov’saccelerated gradient descent, multi-consensus and gradient-tracking. The analysis is new, andmay be applied to other related problems. Empirical studies demonstrate the eﬀectiveness ofour method for machine learning applications.

In this paper, we consider the decentralized optimization problem, where the objective functionis composed of m local functions f i ( x ) that are stored in m diﬀerent agents (or computationalnodes). These agents form a connected and undirected network. Moreover, each agent i can onlyaccess to its private f i ( x ), and communicate with its neighbor agents in N ( i ). These agents try tocooperatively solve the following convex optimization problem:min x ∈ R d f ( x ) (cid:44) m m X i =1 f i ( x ) , (1.1)where f ( x ) is a strongly convex function. Decentralized optimization has been widely studied andapplied in many applications such as large scale machine learning (Tsianos et al., 2012; Kairouzet al., 2019; He et al., 2019), automatic control (Bullo et al., 2009; Lopes & Sayed, 2008), wirelesscommunication (Ribeiro, 2010), and sensor networks (Rabbat & Nowak, 2004; Khan et al., 2009). ∗ Shenzhen Research Institute of Big Data; The Chinese University of Hong Kong, Shenzhen; email:hsye [email protected]; [email protected]; [email protected] † Hong Kong University of Science and Technology; email: [email protected] a r X i v : . [ c s . L G ] M a y ecause of their wide applications, many decentralized algorithms have been proposed. Decen-tralized gradient methods (Nedic & Ozdaglar, 2009; Yuan et al., 2016), decentralized acceleratedgradient method (Jakoveti´c et al., 2014; Qu & Li, 2019) and EXTRA (Shi et al., 2015; Li et al.,2019; Mokhtari & Ribeiro, 2016) are primal-only methods. These algorithms only require the com-putation of the gradient of f i ( x ), and they are usually computationally eﬃcient. Another class ofalgorithms are the dual-based decentralized algorithms. Typical examples include the dual subgra-dient ascent (Terelius et al., 2011), dual gradient ascent and its accelerated version (Scaman et al.,2017; Uribe et al., 2018), the primal-dual method (Lan et al., 2018; Scaman et al., 2018; Honget al., 2017), and ADMM (Erseghe et al., 2011; Shi et al., 2014).In spite of many studies of the decentralized optimization problem, there are several impor-tant open problems. The ﬁrst open problem is whether there exists an algorithm that can bothachieve optimal computation and communication complexities matching the known lower bounds.Recently, Scaman et al. (2017) proposed a dual-based algorithm that achieves the optimal com-munication complexity. However, dual-based methods typically require the evaluation of Fenchelconjugate of the local objective function f i ( x ), and for many problems, this requires more com-putation per-iteration than the primal-only algorithms. Therefore methods proposed by Scamanet al. (2017) do not achieve the optimal computation complexity for general functions. In contrast,Li et al. (2018) proposed a primal-only method called APM-C which can achieve the optimal com-putation complexity. However, its communication complexity only matches the lower bound up toa log (cid:16) (cid:15) (cid:17) factor. The second open problem is whether there exists an algorithm that can estab-lish an optimal communication complexity depending on the global condition number κ g insteadof the local condition number κ ‘ (deﬁned in Eqn. (3.3)) (Scaman et al., 2017). We note that κ ‘ is always larger than κ g , and the gap between κ ‘ and κ g is often very large in real applications.Therefore it is important to measure complexity using κ g instead of κ ‘ . Third, it is unknownwhether the convexity of each individual f i ( x ) is essential for communication-eﬃcient decentralizedalgorithms. Dual-based algorithms require each f i ( x ) to be convex, because they require that thedual function of each individual f i ( x ) is well-deﬁned. Existing primal-only algorithms also assumeeach f i ( x ) to be (strongly) convex to achieve linear convergence rates. Finally, it is not clear howoptimal centralized and decentralized optimization methods are related. It was shown in Scamanet al. (2017) that lower bounds of communication complexities for centralized and decentralizedproblems are related: they only diﬀer in the communication cost, where averaging is required incentralized algorithms, and consensus is needed for decentralized algorithms. Several decentralizedalgorithms tried to ‘imitate’ their centralized counterparts and they achieve low computation andcommunication complexities (Qu & Li, 2019; Jakoveti´c et al., 2014; Li et al., 2018). However, thesealgorithms achieve convergence rates inferior to their centralized counterparts. For example, thecomputation complexity of APM-C depends on the local condition number κ ‘ , while the computationcomplexity of the centralized accelerated gradient descent depends on the global condition number κ g .This paper addresses the theoretical issues discussed above. We summarize our contributionsas follows:1. We propose a novel algorithm that can achieve the optimal computation complexity O (cid:16) √ κ g log (cid:15) (cid:17) ,2nd a near optimal communication complexity O (cid:16)q κ g − λ ( W ) log( ML κ g ) log (cid:15) (cid:17) , which matchesthe lower bound up to a log( ML κ g ) factor with M and L being the smoothness parameters of f i ( x ) and f ( x ), respectively. To the best of our knowledge, this is the best communicationcomplexity that primal-only algorithms can achieve.2. The complexities of our algorithm depend on the global condition number κ g instead of thelocal condition number κ ‘ , where it holds that κ g ≤ κ ‘ . It is an open problem whether thereexists an algorithm that can achieve a communication complexity O (cid:16)q κ g − λ ( W ) log (cid:15) (cid:17) or evenclose to it. Our algorithm provide an aﬃrmative answer to this open problem.3. Our algorithm doesn’t require each individual function to be (strongly) convex. In contrast,this condition is required in the previous algorithms to achieve linear convergence rates. Thus,our algorithm has a much wider application range since f i ( x ) may not be convex in manymachine learning problems.4. Our analysis reveals an important connection between centralized and decentralized algo-rithms. We show that a decentralized algorithm with multi-consensus and gradient trackingcan approximate its centralized counterpart. Using this observation, we can show that decen-tralized algorithms and their centralized counterparts have the same computation complexity,but with diﬀerent communication complexities due to the diﬀerence in communication coststo achieve averaging or consensus. We review prior works that are closely related to the proposed algorithms. First, we review thepenalty-based algorithms. Nedic & Ozdaglar (2009) proposed the well-known decentralized gradientdescent method, where each agent performs a consensus step and a gradient descent step with a ﬁxedstep-size which is related to the penalty parameter. Yuan et al. (2016) proved the convergence rateof decentralized gradient descent and showed how the penalty parameter aﬀects the computationcomplexity. To achieve faster convergence rate and smaller communication complexity, a NetworkNewton method was proposed in (Mokhtari et al., 2016). In Jakoveti´c et al. (2014), Nesterov’sacceleration was used to improve the convergence speed, and the method relied on multi-consensusto reduce the impact of penalty parameter. However, this kind of algorithms with ﬁxed penaltyparameters can only achieve a sub-linear convergence rate even when the objective function isstrongly convex (Yuan et al., 2016; Lan & Monteiro, 2013). Recently, Li et al. (2018) proposed

APM-C , which employed multi-consensus and increased the penalty parameter properly for eachiteration. Due to Nesterov’s acceleration,

APM-C can achieve a linear convergence rate and a lowcommunication complexity.To achieve fast convergence rate, the gradient-tracking method was proposed in (Qu & Li,2017; Xu et al., 2015; Qu & Li, 2019). There are two diﬀerent techniques for gradient-tracking.The ﬁrst technique keeps a variable to estimate the average gradient and uses this estimationin the gradient descent step.

EXTRA is another gradient-tracking method which introduces two3iﬀerent weight matrices to track the diﬀerence of gradients (Shi et al., 2015; Li et al., 2019).

EXTRA is much diﬀerent from the standard decentralized gradient methods which only use a singleweight matrix. Because of these diﬀerences, the convergence analysis for diﬀerent gradient-trackingbased algorithms are also diﬀerent. Due to the tracking of history information, gradient-trackingbased algorithms can achieve linear convergence for strongly convex objective functions (Qu & Li,2017; Shi et al., 2015). However, the previously obtained convergence rates and communicationcomplexities are much worse than the method proposed in this paper.Finally, dual-based methods introduce an Lagrangian function and work in the dual space. Thereare diﬀerent ways to solve the reformulated problem such as gradient descent method (Tereliuset al., 2011), accelerated gradient method (Scaman et al., 2017; Uribe et al., 2018), primal-dualmethod (Lan et al., 2018; Scaman et al., 2018) and ADMM (Shi et al., 2014; Erseghe et al.,2011). However, generally speaking, the dual methods are computationally ineﬃcient. For example,using the accelerated gradient method to solve the dual counterpart of decentralized optimizationproblem can achieve optimal communication complexity (Scaman et al., 2017; Uribe et al., 2018).However, the computation complexity will have extra dependency on the eigenvalue gap of theweight matrix describing the network (Uribe et al., 2018), and this can be much worse than theoptimal computation complexity.

Let x i ∈ R d be the local copy of the variable of x for agent i and we introduce the aggregatevariable x , aggregate objective function F ( x ) as x =  x > ... x > m  , F ( x ) = 1 m m X i =1 f i ( x i ) , (3.1)where x ∈ R m × d and the aggregate gradient ∇ F ( x ) ∈ R m × d is deﬁned as ∇ F ( x ) = 1 m  ∇ > f ( x )... ∇ > f m ( x m )  . We denote that¯ x t = 1 m m X i =0 x t ( i, :) , ¯ y t = 1 m m X i =0 y t ( i, :) , ¯ g t = 1 m m X i =0 ∇ f i ( y t ( i, :)) , (3.2)where x ( i, :) means the i -th row of matrix x . Moreover, in this paper, we use k·k to denote theFrobenius norm of vector or matrix. We use h x, y i to denote the inner product of vectors x and y .Furthermore, we introduce the following deﬁnitions that will be used in the whole paper:4 lobal L-Smoothness We say f ( x ) is L -smooth if for all y, x ∈ R d , f ( y ) ≤ f ( x ) + h∇ f ( x ) , y − x i + L k y − x k . Global µ -Strong Convexity We say f ( x ) is µ -strongly convex, if for all y, x ∈ R d , f ( y ) ≥ f ( x ) + h∇ f ( x ) , y − x i + µ k y − x k . Local M -Smoothness We say the problem is locally M -smooth if for all i and y, x ∈ R d , f i ( x )in Eqn. (1.1) satisﬁes f i ( y ) ≤ f i ( x ) + h∇ f i ( x ) , y − x i + M k y − x k . Local ν -Strong Convexity We say the problem is locally ν -strongly convex if for all i and y, x ∈ R d , f i ( x ) in Eqn. (1.1) satisﬁes f i ( y ) ≥ f i ( x ) + h∇ f i ( x ) , y − x i + ν k y − x k . Based on the smoothness and strong convexity, we can deﬁne global and local condition numberof the objective function respectively as follows: κ g = Lµ and κ ‘ = Mν . (3.3)It is well known that L ≤ M, and κ g ≤ κ ‘ . (3.4)For many applications, κ ‘ can be signiﬁcantly larger than κ g . Therefore it is desirable to investigatemethods that depend on κ g instead of κ ‘ . In fact, some applications contain nonconvex individualfunctions f i ( x ), although the global f ( x ) is convex. For such applications, κ ‘ are not well-deﬁned. Let W be the weight matrix associated with the network, indicating how agents are connectedto each other. We assume that the weight matrix W has the following properties:1. W is symmetric with W i,j = 0 if and if only agents i and j are connected or i = j .2. (cid:22) W (cid:22) I , W = , null( I − W ) = span( ).We use I to denote the m × m identity matrix and = [1 , . . . , > ∈ R m denotes the vector withall ones. Many examples of weight W satisfy above properties, such as W = I − L λ ( L ) , where L isthe Laplacian matrix associated with a weighted graph and λ ( L ) is the largest eigenvalue of L .The weight matrix has an important property that W ∞ = m > (Xiao & Boyd, 2004). Thus,one can achieve the eﬀect of averaging local x i on diﬀerent agents by multiple steps of local com-munications, where each round of local communication starts with a vector x , and results in the5 lgorithm 1 Mudag Input: y = x = y − = ¯ x , ∇ F ( y t − ) = 0, η = L , and α = p µL , K = O (cid:18) √ − λ ( W ) log( ML κ g ) (cid:19) . for t = 0 , . . . , T do x t +1 = FastMix( y t + ( x t − y t − ) − η ( ∇ F ( y t ) − ∇ F ( y t − )) , K ) y t +1 = x t +1 + − α α ( x t +1 − x t ) end for Output: ¯ x T . Algorithm 2

FastMix Input: x = x − , K , W , step size η w = − √ − λ ( W )1+ √ − λ ( W ) . for k = 0 , . . . , K do x k +1 = (1 + η w ) W x k − η w x k − ; end for Output: x K . vector W x . Instead of directly multiplying W several times, Liu & Morse (2011) proposed a moreeﬃcient way to achieve averaging described in Algorithm 2, and this method is one pillar for ourdecentralized algorithms. Algorithm 2 has the following important proposition. Proposition 1 (Proposition 3 of (Liu & Morse, 2011)) . Let x K be the output of Algorithm 2 and ¯ x = m > x . Then it holds that ¯ x = 1 m > x K , and (cid:13)(cid:13)(cid:13) x K − ¯ x (cid:13)(cid:13)(cid:13) ≤ (cid:18) − q − λ ( W ) (cid:19) K (cid:13)(cid:13)(cid:13) x − ¯ x (cid:13)(cid:13)(cid:13) , where λ ( W ) is the second largest eigenvalue of W . In this section, we propose a novel decentralized gradient descent algorithm achieving the optimalcomputation complexity and near optimal communication complexity. First, we provide the mainidea behind our algorithm.

Our algorithm bases on the multi-consensus, gradient-tracking and Nesterov’s accelerated gradi-ent descent. For the convenience of introducing the main intuition, we reformulate the algorithmic6rocedure of

Mudag (Algorithm 1) as follows: x t +1 =FastMix ( y t − s t , K ) (4.1) y t +1 = x t +1 + 1 − α α ( x t +1 − x t ) (4.2) s t +1 =FastMix( s t , K ) + η ( ∇ F ( y t +1 ) − ∇ F ( y t )) − (FastMix( y t , K ) − y t ) , (4.3)where η is the step size and K is the step number in multi-consensus. We will prove the abovereformulation in Lemma 1. We can observe that Eqn. (4.1) and (4.2) belong to the algorithmicframework of Nesterov’ accelerated gradient descent if s t /η can approximate the average gradient.In Eqn. (4.3), we track the gradient using history information and the gradient diﬀerence. Such s t /η can well approximate the average gradient ¯ g t (deﬁned in Eqn. (3.2)). Furthermore, y t canalso approximate ¯ y t well. Since ¯ y t and ¯ g t can be well approximated, then we can obtain that ¯ g t ≈∇ f (¯ y t ). Thus, the convergence properties of our algorithm are similar to the centralized Nesterov’saccelerated gradient descent. This is the main idea behind our approach to the decentralizedoptimization. That is, we combine multi-consensus with gradient-tracking to approximate thecentralized Nesterov’s accelerated gradient descent. As we will show, this seemingly simple idealeads to near optimal algorithm for the decentralized optimization. Next, we will describe in detailsthe three components in our approach, ‘multi-consensus’, ‘gradient-tracking’ and ‘approximationto centralized algorithm’.It is well known that the centralized Nesterov’s accelerated gradient descent method can achievean optimal computation complexity O ( √ κ g log (cid:15) ). Then our decentralized algorithm can alsoachieve the optimal convergence rate once it can approximate the centralized Nesterov’s acceleratedgradient descent. To achieve such approximation, we resort to multi-consensus and gradient track-ing. Several works have tried to approximate the average gradient ¯ g t and average variable ¯ y t onlyby multi-consensus and they require signiﬁcant communication cost (Li et al., 2018; Jakoveti´c et al.,2014). This is the reason why existing multi-consensus algorithms can not achieve optimal commu-nication complexity. Consequently, multi-consensus is regarded as communication-unfriendly (Qu& Li, 2019).By combining multi-consensus with gradient-tracking, we can obtain an accurate approximationto the average gradient and average variable with constant steps of multi-consensus. This meansthe proposed approach can well approximate the centralized Nesterov’ accelerated gradient descent,and this critical observation leads to the establishment of optimal computation complexity andnear optimal communication complexity in our paper. Furthermore, the idea of approximating thecentralized algorithm brings several extra important beneﬁts. First, our algorithm does not requireeach f i ( x ) to be strongly convex. Second, the computation and communication complexities of ouralgorithm depend on the global condition number κ g instead of κ ‘ . In this work, we consider synchronized computation, where the computation complexity dependson the number of times that the gradient of f ( x ) is computed. The communication complexity7epends on the times of local communication which is presented as W x in our algorithm. Now wegive the detailed computation complexity and communication complexity of our algorithm in thefollowing theorem. Theorem 1.

Let f ( x ) be L -smooth and µ -strongly convex. Assume each f i ( x ) is M -smooth. Letting K satisfy that K = s κ g − λ ( W ) log ρ − , with √ ρ ≤ µα L · min ( LM Θ , L M Θ ) , where Θ = 1 + µ m · k∇ f (¯ x ) −∇ f ( x ∗ ) k f (¯ x ) − f ( x ∗ )+ µ k ¯ x /α − x ∗ k , then, it holds that f (¯ x T ) − f ( x ∗ ) ≤ (cid:18) − α (cid:19) T f (¯ x ) − f ( x ∗ ) + µ k ¯ x − x ∗ k ! . To achieve f (¯ x T ) − f ( x ∗ ) ≤ (cid:15) and k x T − x ∗ k = O ( m(cid:15)/µ ) , the computation and communicationcomplexities of Algorithm 1 are T = O (cid:18) √ κ g log (cid:18) (cid:15) (cid:19)(cid:19) , and Q = O s κ g − λ ( W ) log (cid:18) ML κ g (cid:19) log 1 (cid:15) ! , where each O ( · ) contains a universal constant. Remark 1.

Theorem 1 shows that our algorithm achieves the same computation complexity as thatof the centralized Nesterov’s accelerated gradient descent. At the same time, the communicationcomplexity almost matches the known lower bound of decentralized optimization problem up to afactor of log (cid:16) ML κ g (cid:17) . We conjecture that it may be possible to remove the log( κ g ) factor, because theterm only comes from the inequality k ¯ y t − x ∗ k ≤ q µ V t ( V t is deﬁned in Eqn. (5.1) ) in the proof,which may be loose. Remark 2.

Theorem 1 only assumes that f ( x ) is µ -strongly convex and L -smooth, and f i ( x ) is M -smooth (note that unlike many previous works, our dependency on M is logarithmic only). Thus,our algorithm can be used in the case where f i ( x ) can be non-convex. This kind of problem hasbeen widely studied in the recent years (Allen-Zhu, 2018; Garber et al., 2016) and one importantexample is the fast PCA by shift-invert method (Garber et al., 2016). In the previous works suchas (Scaman et al., 2017; Li et al., 2018, 2019; Qu & Li, 2019), each f i ( x ) is assumed to be convexor even strongly convex in order to prove linear convergence. Therefore, our algorithm has a widerrange of applications. Remark 3.

The computation and communication complexities of our algorithm depend linearly on √ κ g rather than √ κ ‘ . This result is new. In fact, before this work, it was unknown whether there ex-ists a decentralized algorithm that can achieve a communication complexity close to O (cid:16)q κ g − λ ( W ) log( (cid:15) ) (cid:17) (Scaman et al., 2017). ethods Complexity of computation Complexity of communication f i ( x ) being convex?Acc-DNGD (Qu & Li, 2019) O (cid:18) κ / ‘ (1 − λ ( W )) . log (cid:15) (cid:19) O (cid:18) κ / ‘ (1 − λ ( W )) . log (cid:15) (cid:19) YesNIDS (Li et al., 2019) O (cid:16) max { κ ‘ , − λ ( W ) } log (cid:15) (cid:17) O (cid:16) max { κ ‘ , − λ ( W ) } log (cid:15) (cid:17) YesADA (Uribe et al., 2018) O (cid:18) κ ‘ √ − λ ( W ) log (cid:15) (cid:19) O (cid:18)q κ ‘ − λ ( W ) log (cid:15) (cid:19) YesAPM-C (Li et al., 2018) O (cid:0) √ κ ‘ log (cid:15) (cid:1) O (cid:18)q κ ‘ − λ ( W ) log (cid:15) (cid:19) Yes

Our Method O (cid:0) √ κ g log (cid:15) (cid:1) O (cid:16)q κ g − λ ( W ) log( ML κ g ) log (cid:15) (cid:17) No Lower Bound (Scaman et al., 2017) O (cid:0) √ κ g log (cid:15) (cid:1) O (cid:16)q κ g − λ ( W ) log (cid:15) (cid:17) No Table 1: Complexity comparisons between the our algorithm and existing work for smooth andstrongly convex problems. Each f i ( x ) is M -smooth, f ( x ) is L -smooth and µ -strongly convex. W is the matrix describing the topology of the network. Theorem 1 shows that Algorithm 1 can achieve the optimal computation complexity and nearoptimal communication complexity. Before our work,

APM-C established a computation complexity O (cid:16) √ κ ‘ log (cid:15) (cid:17) but with a communication complexity O (cid:18)q κ ‘ − λ ( W ) log (cid:15) (cid:19) , which does not matchthe lower bound O (cid:16)q κ g − λ ( W ) log (cid:15) (cid:17) . APM-C also resorts to the Nesterov’s acceleration and multi-consensus. Due to the lack of tracking historic gradient information, more and more multi-consensussteps are required to obtain accurate estimates of the average gradients in

APM-C . This strategyis communication-ineﬃcient and will bring a large communication burden. Furthermore, Qu &Li (2017) proved that increased approximation precision is needed for directly approximating theaverage gradient only by multi-consensus. This means that O (cid:18)q κ ‘ − λ ( W ) log (cid:15) (cid:19) is almost the bestcommunication complexity that multi-consensus based gradient approximation can achieve withoutgradient tracking.The work most related to our algorithm is Acc-DNGD proposed in (Qu & Li, 2019) which alsoutilized the Nesterov’s acceleration and gradient-tracking. However, there are several importantdiﬀerences between our algorithm and

Acc-DNGD . First and importantly, the idea behind their algo-rithm is diﬀerent from that of ours. Our algorithm tries to approximate the centralized Nesterov’saccelerated gradient descent. Therefore the convergence analysis of our algorithm is almost thesame as the standard analysis of centralized Nesterov’s accelerated gradient descent (Nesterov,2018). Instead,

Acc-DNGD tries to ‘imitate’ the centralized Nesterov’s accelerated gradient descentand used inexact Nesterov’s gradient descent framework (Devolder et al., 2013). Second, in theimplementation,

Acc-DNGD requires three single-consensus steps compared to one multi-consensusstep of

Mudag . Third,

Mudag can achieve the optimal computation complexity and a near opti-mal communication complexity. However,

Acc-DNGD does not achieve near optimal computationcomplexity, nor does it achieve near optimal communication complexity. Fourth, our algorithmdoes not require each individual function f i ( x ) to be convex but only requires f ( x ) to be strongly9onvex. On the other hand, the condition that each individual function f i ( x ) is convex is requiredin Acc-DNGD . In fact, we can observe in the experiments (Section 6) that if some of the individualfunctions f i ( x ) are non-convexity, then the performance of Acc-DNGD deteriorates, while the per-formance of our algorithm is not aﬀected. Finally, the convergence rate of our algorithm dependson the global condition number κ g , while that of Acc-DNGD depends on the local condition number κ ‘ . Since κ ‘ is no smaller than κ g , our algorithm has a better convergence guarantee.Another important related work is NIDS which utilizes a combination of gradient-tracking andsingle-consensus (Li et al., 2019). Comparing the algorithmic procedures of Algorithm 1,

NIDS (Liet al., 2019) and

EXTRA (Shi et al., 2015), we can regard

Mudag as the accelerated version of

NIDS or EXTRA , combined with multi-consensus. This can be clearly observed when we choose ˜ W = I + W in NIDS and

EXTRA . Because of the lack of acceleration, the computation and communicationcomplexities of

NIDS and

EXTRA are both suboptimal. In fact,

Acc-DNGD has less dependency on thecondition number κ ‘ than NIDS . Moreover, unlike

NIDS , Algorithm 1 does not need to construct anextra consensus matrix ˜ W in order to achieve the best performance. Any matrix that satisﬁes thecondition described in Section 3.2 is suitable for Algorithm 1. Furthermore, the convergence analysisof Algorithm 1 is diﬀerent from that of NIDS . NIDS is analyzed in the primal-dual framework, whileour convergence analysis is in the (primal) Nesterov’s accelerated gradient descent framework.In Scaman et al. (2017), a lower bound of communication complexity was obtained for thedecentralized optimization problem, which is O (cid:18)q κ ‘ − λ ( W ) log (cid:15) (cid:19) for strongly convex problem.A dual-based algorithm was proposed to match the lower bound. However, this method is onlysuitable for the cases where dual functions of each local agent are easy to compute. Hence, thecomputation complexity of the method in Scaman et al. (2017) severely deteriorates once the dualfunctions are computationally ineﬃcient to work with. Recently, Uribe et al. (2018) proposed anaccelerated dual ascent algorithm which achieves the same communication complexity as the oneof Scaman et al. (2017), but with a computation complexity of O (cid:18) κ ‘ √ − λ ( W ) log (cid:15) (cid:19) .We can observe that our algorithm achieves the optimal computation complexity O (cid:16) √ κ g log (cid:15) (cid:17) ,with a near optimal communication complexity O (cid:16)q κ g − λ ( W ) log( ML κ g ) log (cid:15) (cid:17) . Our complexityresults depend on κ g instead of κ ‘ . Since κ ‘ may be much larger than κ g , an algorithm withcommunication complexity depending on κ g is desirable. Therefore, our algorithm is preferredto the previous works whose computation and communication complexities depend on the localcondition number κ ‘ .Table 1 presents a detailed comparison of our methods with state-of-the-art decentralized opti-mization algorithms. In Section 4.1, we introduced the main idea behind our algorithm, which is to approximatethe centralized Nesterov’s accelerated gradient descent (

AGD ) in a decentralized manner. In this It holds that κ g = Ω( κ ‘ ) for the case used to prove the lower bound of communication complexity (Scamanet al., 2017). AGD . Similar tothe convergence analysis of centralized Nesterov’s accelerated gradient descent, we ﬁrst deﬁne theLyapunov function as follows V t = f (¯ x t ) − f ( x ∗ ) + µ k ¯ v t − x ∗ k , (5.1)where ¯ v t is deﬁned as ¯ v t = ¯ x t − + 1 α (¯ x t − ¯ x t − ) , with α = r µL . (5.2)In the rest of this section, we will show how the Lyapunov function V t converges and how multi-consensus and gradient-tracking help us to approximate AGD in the proposed algorithm.First, to obtain a clear convergence analysis of Algorithm 1, we rewrite the update rule ofAlgorithm 1 in the following form.

Lemma 1.

The update procedure of Algorithm 1 can be represented as x t +1 =FastMix ( y t − s t , K ) (5.3) y t +1 = x t +1 + 1 − α α ( x t +1 − x t ) (5.4) s t +1 =FastMix( s t , K ) + η ( ∇ F ( y t +1 ) − ∇ F ( y t )) − (FastMix( y t , K ) − y t ) , (5.5) with s = η ∇ F ( y ) .Proof. For notation convenience, we use T ( x ) to denote the ‘FastMix’ operation on matrix x , whichis used in Algorithm 1. That is, T ( x ) (cid:44) FastMix( x , K ) . It is obvious that the ‘FastMix’ operation T ( · ) is linear.The proof of this reformulation is equivalent to prove that given the reformulation of x t , y t and s t at iteration t , the reformulation of x t +1 holds at iteration t + 1. Therefore our induction focuseson x t +1 . First, when t = 0, we can obtain that x = T ( y − η ∇ F ( y )) = T ( y − s ) , (5.6)which implies that x − y = − T ( s ) + T ( y ) − y . Furthermore, have s = T ( s ) + η ( ∇ F ( y ) − ∇ F ( y )) − ( T ( y ) − y ) . Thus, we can obtain that x = T ( y + ( x − y ) − η ( ∇ F ( y ) − ∇ F ( y )))= T ( y − ( T ( s ) + η ( ∇ F ( y ) − ∇ F ( y ))) + T ( y ) − y )= T ( y − s ) . t = 0.Next, we prove that if the results hold in the t -th iteration, then it also holds at the ( t + 1)-thiteration. For the t -th iteration, we assume that x t +1 = T ( y t − s t ), which implies that x t +1 − T ( y t ) = − T ( s t ). Therefore, we obtain that x t +2 = T ( y t +1 + x t +1 − y t − η ( ∇ F ( y t +1 ) − ∇ F ( y t )))= T ( y t +1 + x t +1 − T ( y t ) − η ( ∇ F ( y t +1 ) − ∇ F ( y t )) + T ( y t ) − y t )= T ( y t +1 − T ( s t ) − η ( ∇ F ( y t +1 ) − ∇ F ( y t )) + T ( y t ) − y t )= T ( y t +1 − s t +1 ) . This proves the desired result.We now show that ¯ x t , ¯ y t , ¯ g t (deﬁned in Eqn. (3.2) and generated by Algorithm 1) and ¯ v t (deﬁned in Eqn. (5.2)) can be ﬁt into the framework of the centralized Nesterov’s acceleratedgradient descent. Lemma 2.

Let ¯ x t , ¯ y t , ¯ g t (deﬁned in Eqn. (3.2) ) be generated by Algorithm 1. Then they satisfythe following equalities: ¯ x t +1 =¯ y t − η ¯ g t (5.7)¯ y t +1 =¯ x t +1 + 1 − α α (¯ x t +1 − ¯ x t ) (5.8)¯ s t +1 =¯ s t + η ¯ g t +1 − η ¯ g t = η ¯ g t +1 (5.9) Proof.

We ﬁrst prove the last equality. First, by Proposition 1, we have m > ( T ( y t ) − y t ) = ¯ y t − ¯ y t = 0. Thus, we can obtain that¯ s t +1 = ¯ s t + η ¯ g t +1 − η ¯ g t . Furthermore, we will prove ¯ s t = η ¯ g t by induction. For t = 0, we use the fact that s = η ∇ F ( y ).Then, it holds that ¯ s = η ¯ g . We assume that ¯ s t = η ¯ g t at time t . By the update equation, we have¯ s t +1 = ¯ s t + η (¯ g t +1 − ¯ g t ) = η ¯ g t +1 . Thus, we obtain the result at time t + 1. The ﬁrst two equations can be proved using Eqn. (5.9)and Proposition 1.From Lemma 2, we can observe that Eqn. (5.7)-(5.9) is almost the same as Nesterov’s accel-erated gradient descent (Nesterov, 2018). Thus, if ¯ s t /η is an accurate estimation of ∇ f (¯ y t ), thenAlgorithm 1 has convergence properties similar to AGD . Next, we are going to show y t ( i, :) ≈ ¯ y t and s t ( i, :) ≈ ¯ s t and we have the following lemma. 12 emma 3. Let z t = h k y t − ¯ y t k , ρ k x t − ¯ x t k , Mη k s t − ¯ s t k i > , then it holds that z t +1 ≤ Az t + 4 √ m " , , s µ V t > , where ρ and A are deﬁned as ρ = (cid:18) − q − λ ( W ) (cid:19) K , A (cid:44)  ρ ρ ρM η M η

M η ρ ρ (1 + 2 ρM η )  . Furthermore, we have z t +1 ≤ A t +1 z + 4 s mµ · t X i =0 A t − i [0 , p V i ] > . (5.10)If the spectrum radius of A is less than 1, and V t converges to zero, k z t k will converge to zero.Note that k y t − ¯ y t k and Mη k s t − ¯ s t k are no larger than k z t k . This implies that k y t − ¯ y t k and Mη k s t − ¯ s t k will also converge to zero. That is, Algorithm 1 can approximate AGD well.Next, we will prove the above two conditions which guarantee the convergence of k z t k . In thefollowing lemma, we show the properties of A and prove that the spectrum radius of A is less than if ρ is small enough. Lemma 4.

Matrix A deﬁned in Lemma 3 satisﬁes that < λ ( A ) , | λ ( A ) | ≤ | λ ( A ) | < λ ( A ) , with λ i ( A ) being the i -th largest eigenvalue of A . Let η = L and ρ satisfy the condition that ρ ≤ M η + 6 M η + 1)(3 + 2 M η ) , then it holds that λ ( A ) ≤ , and the eigenvector v associated with λ ( A ) is positive and its entries satisfy v (1) ≤ v (3)2(7 + 2 M η ) , v (2) ≤ √ ρ (7 + 2 M η ) +

M η √ ρ ! v (3) , < v (3) , with v ( i ) being i -th entry of v . Using Lemma 4, we are going to show that V t converges to zero and we have the followinglemma. 13 emma 5. Assume that ρ satisﬁes the properties in Lemma 4, and Θ = 1 + µ m · k z k V , we have √ ρ ≤ µα L · min ( LM Θ , L M Θ ) . Assuming also that α ≤ , then Algorithm 1 has the following convergence rate V t +1 ≤ (cid:18) − α (cid:19) t +1 · V . (5.11)Lemma 5 shows that V t will converge to 0 with rate 1 − α . Thus, we can directly obtain thecomputation complexity. Furthermore, to achieve the conditions on ρ , we upper bound the multi-consensus step K by Proposition 1. Combining with the computation complexity, we can obtainthe total communication complexity. Now, we give the detailed proof of Theorem 1 as follows. Proof of Theorem 1.

It is easy to check that ρ satisﬁes the conditions required in Lemma 4 and 5.By Eqn. (5.11), we have f (¯ x T ) − f ( x ∗ ) ≤ (cid:18) − r µL (cid:19) T (cid:18) f (¯ x − f ( x ∗ )) + µ k ¯ x − x ∗ k (cid:19) ≤ exp (cid:18) − T r µL (cid:19) (cid:18) f (¯ x − f ( x ∗ )) + µ k ¯ x − x ∗ k (cid:19) Thus, in order to achieve f (¯ x T − x ∗ ) ≤ (cid:15) , T only needs to be T = 2 √ κ g log f (¯ x − f ( x ∗ )) + µ k ¯ x − x ∗ k (cid:15) = O (cid:18) √ κ g log 1 (cid:15) (cid:19) . Furthermore, by Lemma 3, we have k x T − ¯ x T k ≤ ( ρ z T (2)) ≤ ( ρ v (2)) · (cid:18) − α (cid:19) T s mµ p V + k z k ! ≤ mρM ηµ · (cid:15) ≤ mµ · (cid:15), where the last inequality is because of the condition of ρ in Lemma 4. Therefore, we have k x T − x ∗ k ≤ k x T − ¯ x T k + k (¯ x T − x ∗ ) k ) ≤ mµ · (cid:15) + 4 mµ · (cid:15) = O ( m(cid:15)/µ ) , where the second inequality is due to the µ -strong convexity of f ( x ).The bound of K can be obtained by Proposition 1. Combining with the computation complexity,14

500 1000 1500 2000 2500 3000

Number of Gradient Computations -18-16-14-12-10-8-6-4-20

AGDAPM-C EXTRANIDS Acc-DNGDMudag

Number of Communications -18-16-14-12-10-8-6-4-20

AGDAPM-C EXTRANIDS Acc-DNGDMudag

Number of Gradient Computations -18-16-14-12-10-8-6-4-20

AGDAPM-C EXTRANIDS Acc-DNGDMudag

Number of Communications -18-16-14-12-10-8-6-4-20

AGDAPM-C EXTRANIDS Acc-DNGDMudag

Number of Gradient Computations -14-12-10-8-6-4-20

AGDAPM-C EXTRANIDS Acc-DNGDMudag

Number of Communications -14-12-10-8-6-4-20

AGDAPM-C EXTRANIDS Acc-DNGDMudag

Number of Gradient Computations -12-10-8-6-4-20

AGDAPM-C EXTRANIDS Acc-DNGDMudag

Number of Communications -12-10-8-6-4-20

AGDAPM-C EXTRANIDS Acc-DNGDMudag

Figure 1: Comparisons with logistic regression and random networks. Each f i ( x ) is strongly convex( σ i = 0 .

001 in the top row, and σ i = 0 . − λ ( W ) =0 .

81 in the left two columns and 1 − λ ( W ) = 0 .

05 in the right two columns.we can obtain the total communication complexity as Q = O s κ g − λ ( W ) log (cid:18) ML κ g (cid:19) log 1 (cid:15) ! . In the previous section, we presented a theoretical analysis of our algorithm. In this section,we will provide empirical studies. We evaluate the performance of our algorithm using logisticregression with diﬀerent settings, including the situation that each f i ( x ) is strongly convex and thesituation that each f i ( x ) may be non-convex. In our experiments, we consider random networks where each pair of agents have a connectionwith a probability of p , and we set W = I − L λ ( L ) where L is the Laplacian matrix associatedwith a weighted graph, and λ ( L ) is the largest eigenvalue of L . We set m = 100, that is, thereexists 100 agents in this network. By the properties of the well-known Erd˝os-R´enyi random graph,we can obtain that when p = mm , the random graph is connected and − λ ( W ) = O (1). In ourexperiments, We test the performance with p = 0 . p = 0 . − λ ( W ) = 0 . − λ ( W ) = 0 .

81, respectively. 15 .2 Experiments on Logistic Regression

The individual objective function of logistic regression is deﬁned as f i ( x ) = 1 n n X j =1 log[1 + exp( − b j h a j , x i )] + σ i k x k , where a j ∈ R d is the j -th input vector, and b j ∈ {− , } is the corresponding label. We conductour experiments on a real-world dataset ‘a9a’ which can be downloaded from Libsvm datasets.We set n = 325 and d = 123. To test the performance of our algorithm on the strongly convexfunction, we set σ i = 10 − and σ i = 10 − for i = 1 , . . . , m to control the condition number of theobjective function f ( x ). Furthermore, we will also test the performance of our algorithm on thecase that f i ( x ) may be non-convex but f ( x ) is strongly convex. Here, we set σ i = − − for agents i = 1 . . . , m − σ m = 1 for agent m , respectively. In this case, the condition number of f ( x ) isthe same as the one with setting σ i = 10 − for all i = 1 , . . . , m . We also set σ i = − − for agents i = 1 . . . , m − σ m = 10 for agent m . In this case, the condition number of f ( x ) is the sameas the one with setting σ i = 10 − for all agents.We compare our algorithm ( Mudag ) to centralized accelerated gradient descent (

AGD ) in (Nes-terov, 2018),

EXTRA in (Shi et al., 2015),

NIDS in (Li et al., 2019),

Acc-DNGD in (Qu & Li, 2019) and

APM-C in (Li et al., 2018). In this paper, we do not compare our algorithm to the dual-based algo-rithms such as accelerated dual ascent algorithm (Uribe et al., 2018; Scaman et al., 2017) becausethese algorithms can not be applied to the case where some functions f i ( x ) are non-convex.The step sizes of all algorithms are well-tuned to achieve their best performances. Furthermore,we set the momentum coeﬃcient as √ L −√ µ √ L + √ µ for Mudag , AGD and

APM-C . We initialize x at for allthe compared methods.For the setting that each f i ( x ) is a strongly convex function, we report the experimental resultsin Figure 1. First, compared to AGD , our algorithm has almost the same computation cost asthat of

AGD , which matches our theoretical analysis. Assuming that

AGD communicates once periteration, we can also see that the communication cost of

Mudag is almost the same as that of

AGD when 1 − λ ( W ) = 0 .

81, and six times that of

AGD when 1 − λ ( W ) = 0 .

05. This matchesthe communication complexity of our algorithm. Furthermore, our algorithm achieves both lowercomputation cost and lower communication cost than other decentralized algorithms on all settings.The advantages are more obvious when the regularization parameters σ i are small, which validatesthe theoretical comparison in Section 4.3.We report the results of our experiments where the individual function f i ( x ) can be non-convexbut f ( x ) is strongly convex in Figure 2. Note that the settings of experiments reported in Figure 1and Figure 2 are the same except for the non-convexity of some f i ( x )’s in Figure 2. Comparingthe curves in these two ﬁgures, we can observe that the computation cost of AGD and our algorithmare not aﬀected by the non-convexity of f i ( x ) because their convergence rates only depend on √ κ g .On the other hand, the communication cost of our algorithm increases slightly compared to thesetting where each f i ( x ) is convex. This is because the ratio M/L of f i ( x ) increases when we set σ i = − − or σ i = − − for agent i = 1 , . . . , m −

1. Our communication complexity theory16

500 1000 1500 2000 2500 3000

Number of Gradient Computations -18-16-14-12-10-8-6-4-20

AGDAPM-C EXTRANIDS Acc-DNGDMudag

Number of Communications -18-16-14-12-10-8-6-4-20

AGDAPM-C EXTRANIDS Acc-DNGDMudag

Number of Gradient Computations -18-16-14-12-10-8-6-4-20

AGDAPM-C EXTRANIDS Acc-DNGDMudag

Number of Communications -18-16-14-12-10-8-6-4-20

AGDAPM-C EXTRANIDS Acc-DNGDMudag

Number of Gradient Computations -12-10-8-6-4-20

AGDAPM-C EXTRANIDS Acc-DNGDMudag

Number of Communications -12-10-8-6-4-20

AGDAPM-C EXTRANIDS Acc-DNGDMudag

Number of Gradient Computations -12-10-8-6-4-20

AGDAPM-C EXTRANIDS Acc-DNGDMudag

Number of Communications -12-10-8-6-4-20

AGDAPM-C EXTRANIDS Acc-DNGDMudag

Figure 2: Comparisons with logistic regression and random networks. Each local objective f i ( x )may be non-convex. In the top row, σ i = − − for agents i = 1 . . . , m − σ i = 1 for theagent i = m . In the bottom row, σ i = − − for agents i = 1 . . . , m −

1, and σ i = 10 for the agent i = m . Random networks have 1 − λ ( W ) = 0 .

81 in the left two columns and 1 − λ ( W ) = 0 .

05 inthe right two columns.shows

M/L will aﬀect the communication cost by a log(

M/L ) factor. Compared to our algorithm,the performance of the other decentralized algorithms deteriorates greatly, which can be clearlyobserved by comparing the two ﬁgures in top right corners of Figure 1 and Figure 2. When someof f i ( x ) are non-convex, EXTRA , NIDS , and

Acc-DNGD perform rather poorly.

In this paper, we proposed a novel decentralized algorithm called

Mudag . Our method can achievethe optimal computation complexity with a near optimal communication complexity, matching thelower bound up to a log( ML κ g ) factor. This is the best communication complexity that primal-baseddecentralized algorithms can achieve.Our results provide an aﬃrmative answer to the open problem on whether there is a decentralizedalgorithm that can achieve the communication complexity O (cid:16)q κ g − λ ( W ) log (cid:15) (cid:17) or even close to thislower bound for a strongly convex objective function. Furthermore, our algorithm does not requireeach individual functions f i ( x ) to be convex. However, this requirement is necessary for existingdecentralized algorithms. In fact, our experiments showed that the non-convexity of individualfunction f i ( x ) can degrade their performances. Our new algorithm has a wider range of applicationsin machine learning because the individual functions f i ( x ) may not be convex in many machinelearning problems.Our analysis also implies that for decentralized optimization, multi-consensus and gradient track-17ng can be combined to well approximate the corresponding centralized counterpart. The resultingmethods are simple and eﬀective, with near optimal complexities. This novel point of view mayalso provide useful insights for developing new decentralized algorithms in other situations. References

Allen-Zhu, Z. (2018). Katyusha x: Simple momentum method for stochastic sum-of-nonconvexoptimization. In

International Conference on Machine Learning (pp. 179–185).Bullo, F., Cortes, J., & Martinez, S. (2009).

Distributed control of robotic networks: a mathematicalapproach to motion coordination algorithms , volume 27. Princeton University Press.Devolder, O., Glineur, F., Nesterov, Y., et al. (2013). First-order methods with inexact oracle: thestrongly convex case.

CORE Discussion Papers , 2013016, 47.Erseghe, T., Zennaro, D., Dall’Anese, E., & Vangelista, L. (2011). Fast consensus by the alternatingdirection multipliers method.

IEEE Transactions on Signal Processing , 59(11), 5523–5537.Garber, D., Hazan, E., Jin, C., Kakade, S. M., Musco, C., Netrapalli, P., & Sidford, A. (2016). Ro-bust shift-and-invert preconditioning: Faster and more sample eﬃcient algorithms for eigenvectorcomputation. In

ICML .He, C., Tan, C., Tang, H., Qiu, S., & Liu, J. (2019). Central server free federated learning oversingle-sided trust social networks. arXiv preprint arXiv:1910.04956 .Hong, M., Hajinezhad, D., & Zhao, M.-M. (2017). Prox-pda: The proximal primal-dual algorithmfor fast distributed nonconvex optimization and learning over networks. In

Proceedings of the34th International Conference on Machine Learning-Volume 70 (pp. 1529–1538).: JMLR. org.Horn, R. A. & Johnson, C. R. (2012).

Matrix analysis . Cambridge university press.Jakoveti´c, D., Xavier, J., & Moura, J. M. (2014). Fast distributed gradient methods.

IEEETransactions on Automatic Control , 59(5), 1131–1146.Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K.,Charles, Z., Cormode, G., Cummings, R., et al. (2019). Advances and open problems in federatedlearning. arXiv preprint arXiv:1912.04977 .Khan, U. A., Kar, S., & Moura, J. M. (2009). Diland: An algorithm for distributed sensor lo-calization with noisy distance measurements.

IEEE Transactions on Signal Processing , 58(3),1940–1947.Lan, G., Lee, S., & Zhou, Y. (2018). Communication-eﬃcient algorithms for decentralized andstochastic optimization.

Mathematical Programming , (pp. 1–48).Lan, G. & Monteiro, R. D. (2013). Iteration-complexity of ﬁrst-order penalty methods for convexprogramming.

Mathematical Programming , 138(1-2), 115–139.18i, H., Fang, C., Yin, W., & Lin, Z. (2018). A sharp convergence rate analysis for distributedaccelerated gradient methods. arXiv preprint arXiv:1810.01053 .Li, Z., Shi, W., & Yan, M. (2019). A decentralized proximal-gradient method with network inde-pendent step-sizes and separated convergence rates.

IEEE Transactions on Signal Processing ,67(17), 4494–4506.Liu, J. & Morse, A. S. (2011). Accelerated linear iterations for distributed averaging.

AnnualReviews in Control , 35(2), 160–165.Lopes, C. G. & Sayed, A. H. (2008). Diﬀusion least-mean squares over adaptive networks: Formu-lation and performance analysis.

IEEE Transactions on Signal Processing , 56(7), 3122–3136.Mokhtari, A., Ling, Q., & Ribeiro, A. (2016). Network newton distributed optimization methods.

IEEE Transactions on Signal Processing , 65(1), 146–161.Mokhtari, A. & Ribeiro, A. (2016). Dsa: Decentralized double stochastic averaging gradient algo-rithm.

The Journal of Machine Learning Research , 17(1), 2165–2199.Nedic, A. & Ozdaglar, A. (2009). Distributed subgradient methods for multi-agent optimization.

IEEE Transactions on Automatic Control , 54(1), 48–61.Nesterov, Y. (2018).

Lectures on convex optimization , volume 137. Springer.Qu, G. & Li, N. (2017). Harnessing smoothness to accelerate distributed optimization.

IEEETransactions on Control of Network Systems , 5(3), 1245–1260.Qu, G. & Li, N. (2019). Accelerated distributed nesterov gradient descent.

IEEE Transactions onAutomatic Control .Rabbat, M. & Nowak, R. (2004). Distributed optimization in sensor networks. In

Proceedings of the3rd international symposium on Information processing in sensor networks (pp. 20–27).: ACM.Ribeiro, A. (2010). Ergodic stochastic optimization algorithms for wireless communication andnetworking.

IEEE Transactions on Signal Processing , 58(12), 6369–6386.Scaman, K., Bach, F., Bubeck, S., Lee, Y. T., & Massouli´e, L. (2017). Optimal algorithms forsmooth and strongly convex distributed optimization in networks. In

Proceedings of the 34thInternational Conference on Machine Learning-Volume 70 (pp. 3027–3036).: JMLR. org.Scaman, K., Bach, F., Bubeck, S., Massouli´e, L., & Lee, Y. T. (2018). Optimal algorithms fornon-smooth distributed optimization in networks. In

Advances in Neural Information ProcessingSystems (pp. 2740–2749).Shi, W., Ling, Q., Wu, G., & Yin, W. (2015). Extra: An exact ﬁrst-order algorithm for decentralizedconsensus optimization.

SIAM Journal on Optimization , 25(2), 944–966.19hi, W., Ling, Q., Yuan, K., Wu, G., & Yin, W. (2014). On the linear convergence of the admm indecentralized consensus optimization.

IEEE Transactions on Signal Processing , 62(7), 1750–1761.Terelius, H., Topcu, U., & Murray, R. M. (2011). Decentralized multi-agent optimization via dualdecomposition.

IFAC proceedings volumes , 44(1), 11245–11251.Tsianos, K. I., Lawlor, S., & Rabbat, M. G. (2012). Consensus-based distributed optimization:Practical issues and applications in large-scale machine learning. In (pp. 1543–1550).: IEEE.Uribe, C. A., Lee, S., Gasnikov, A., & Nedi´c, A. (2018). A dual approach for optimal algorithmsin distributed optimization over networks. arXiv preprint arXiv:1809.00710 .Xiao, L. & Boyd, S. (2004). Fast linear iterations for distributed averaging.

Systems & ControlLetters , 53(1), 65–78.Xu, J., Zhu, S., Soh, Y. C., & Xie, L. (2015). Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes. In (pp. 2055–2060).: IEEE.Yuan, K., Ling, Q., & Yin, W. (2016). On the convergence of decentralized gradient descent.

SIAMJournal on Optimization , 26(3), 1835–1854.

A Proof of Proposition 1

Proof of Proposition 1.

By the update rule of Algorithm 2 and the fact that W ∞ = m > (Xiao& Boyd, 2004), we have W ∞ x K = W ∞ x K − + η w (cid:16) W ∞ x K − − W ∞ x K − (cid:17) . We can obtain that W ∞ (cid:16) x K − x K − (cid:17) = η w (cid:16) W ∞ x K − − W ∞ x K − (cid:17) . Note that x = x − in Algorithm 2, we can obtain that for any k = 0 , . . . , K , we have W ∞ (cid:16) x k − x k − (cid:17) = 0 . Therefore, we can obtain the identity W ∞ x K = W ∞ x , which implies the result. The convergencerate of Algorithm 2 can be found in (Liu & Morse, 2011). B Several Important Lemmas

We also have the following important properties of the update rule.20 emma 6.

We have the following inequalities: k∇ F ( y ) − ∇ F ( x ) k ≤ M k y − x k , (B.1) k ¯ g t − ∇ f (¯ y t ) k ≤ M √ m k y t − ¯ y t k . (B.2) Proof.

The ﬁrst inequality is because F ( x ) is M -smooth and k∇ F ( y ) − ∇ F ( x ) k = vuut m X i k∇ f i ( y ( i, :)) − ∇ f i ( x ( i, :)) k ≤ vuut M m X i k y ( i, :) − x ( i, :) k = M k y − x k . The second inequality follows from k ¯ g t − ∇ f (¯ y t ) k = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m m X i =0 h ∇ f i ( y t ( i, :)) − ∇ f i (¯ y t ) i(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m X i =1 ∇ f i ( y t ( i, :)) − ∇ f i (¯ y t ) m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ M m X i =0 k y t ( i, :) − ¯ y t k m ≤ M vuut m X i =0 k y t ( i, :) − ¯ y t k m = M √ m k y t − ¯ y t k . By the convergence rate of accelerated gradient descent, we have the following property.

Lemma 7.

Let ¯ x t , ¯ y t , ¯ g t (deﬁned in Eqn. (3.2) ) and ¯ v t (deﬁned in Eqn. (5.2) ) be generated byAlgorithm 1. Then they satisfy the following equalities: ¯ v t +1 =(1 − α )¯ v t + α ¯ y t − ηα ¯ s t , (B.3)¯ y t +1 = ¯ x t +1 + α ¯ v t +1 α . (B.4)21 roof. We ﬁrst prove the second equality. Replacing Eqn. (5.2) to (B.4), we can obtain that¯ x t +1 + α ¯ v t +1 α = ¯ x t +1 + α (¯ x t + α (¯ x t +1 − ¯ x t ))1 + α =¯ x t +1 + 1 − α α (¯ x t +1 − ¯ x t )=¯ y t +1 . We are going to prove the ﬁrst equality. First, by Eqn. (B.4), we can obtain that¯ y t − ¯ x t = α (¯ v t − ¯ y t ) . Then we have (1 − α )¯ v t + α ¯ y t − ηα ¯ s t =¯ v t − α (¯ v t − ¯ y t ) − ηα ¯ s t =¯ x t + ¯ v t − ¯ y t − ηα ¯ s t =¯ x t + 1 α (¯ y t − ¯ x t − η ¯ s t )=¯ x t + 1 α (¯ x t +1 − ¯ x t )=¯ v t +1 . Lemma 8.

Consider y t in Algorithm 1, then ¯ y t satisﬁes k ¯ y t − x ∗ k ≤ s µ V t . (B.5) Proof.

By Eqn. (B.4), we have k ¯ y t − x ∗ k = 11 + α k ¯ x t + α ¯ v t − (1 + α ) x ∗ k ≤

11 + α ( k ¯ x t − x ∗ k + α k ¯ v t − x ∗ k ) ≤

11 + α s µ V t + α s µ V t ! = s µ V t . The last inequality is because of the condition that f ( x ) is µ -strongly convex and the deﬁnition of V t . 22 Proof of Lemma 3

Proof of Lemma 3.

By the update step of y t +1 in Algorithm 1, we have k y t +1 − ¯ y t +1 k ≤

21 + α k x t +1 − ¯ x t +1 k + 1 − α α k x t − ¯ x t k . Furthermore, by Eqn. (5.3), we have1 ρ k x t +1 − ¯ x t +1 k ≤ k y t − ¯ y t k + M η · M η k s t − ¯ s t k . Therefore, we can obtain that k y t +1 − ¯ y t +1 k ≤ ρ α k y t − ¯ y t k + 1 − α α k x t − ¯ x t k + 2 ρ α k s t − ¯ s t k≤ ρ k y t − ¯ y t k + ρ · ρ k x t − ¯ x t k + 2 ρM η · M η k s t − ¯ s t k . Furthermore, by Eqn. (5.5), we have k s t +1 − ¯ s t +1 k = k T ( s t ) − ¯ s t k + η k∇ F ( y t +1 ) − ∇ F ( y t ) − (¯ g t +1 − ¯ g t ) k + k T ( y t ) − y t k≤ k T ( s t ) − ¯ s t k + η k∇ F ( y t +1 ) − ∇ F ( y t ) k + k T ( y t ) − y t k≤ ρ k s t − ¯ s t k + M η k y t +1 − y t k + ( ρ + 1) k y t − ¯ y t k , where the ﬁrst inequality is because for any matrix x ∈ R m × d , it holds that k x − ¯ x k = m X j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) x ( j, :) − m m X i x ( i, :) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = m X j  k x ( j, :) k − m m X i h x ( j, :) , x ( i, :) i + 1 m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m X i [ x ( i, :) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  = k x k − m m X i m X j h x ( j, :) , x ( i, :) i + 1 m m X i m X j h x ( j, :) , x ( i, :) i≤ k x k . The last inequality is because f i ( x ) is M -smooth.23y the update rule of y t +1 , we have k y t +1 − y t k = (cid:13)(cid:13)(cid:13)(cid:13) − α α x t +1 − − α α x t − y t (cid:13)(cid:13)(cid:13)(cid:13) (5.3) = (cid:13)(cid:13)(cid:13)(cid:13) − α α T ( y t − s t ) − − α α x t − y t (cid:13)(cid:13)(cid:13)(cid:13) ≤ − α α k T ( y t ) − y t k + 1 − α α k x t − y t k + 2 − α α k T ( s t ) k≤

41 + α k y t − ¯ y t k + 1 − α α ( k x t − ¯ x t k + k y t − ¯ y t k + k (¯ y t − x ∗ ) k + k (¯ x t − x ∗ ) k )+ 2 − α α ( k T ( s t ) − ¯ s t k + k ¯ s t k ) (5.9) ≤

51 + α k y t − ¯ y t k + 2 ρ α k s t − ¯ s t k + 1 − α α ( k x t − ¯ x t k )+ 1 − α α ( k (¯ y t − x ∗ ) k + k (¯ x t − x ∗ ) k ) + 2 η √ m α k ¯ g t k , where the second inequality is because of k T ( y t ) − y t k = k T ( y t ) − ¯ y t + ¯ y t − y t k ≤ (1 + ρ ) k y t − ¯ y t k ≤ k y t − ¯ y t k . Furthermore, by Eqn. (B.2), we have k ¯ g t k ≤ k ¯ g t − ∇ f (¯ y t ) k + k∇ f (¯ y t ) k ≤ M √ m k y t − ¯ y t k + k∇ f (¯ y t ) k . Therefore, we can obtain that1

M η k s t +1 − ¯ s t +1 k ≤ ρ (1 + 2 ρM η ) 1 M η k s t − ¯ s t k + (cid:18) M η α + ρ + 1 M η (cid:19) k y t − ¯ y t k − α α k x t − ¯ x t k + 1 − α α ( k (¯ y t − x ∗ ) k + k (¯ x t − x ∗ ) k )+ 2 η √ m α k∇ f (¯ y t ) k≤ ρ (1 + 2 ρM η ) · M η k s t − ¯ s t k + (7 + 2 M η ) k y t − ¯ y t k + k x t − ¯ x t k + k (¯ y t − x ∗ ) k + k (¯ x t − x ∗ ) k + 2 η √ m k∇ f (¯ y t ) k , < α , η = L and L ≤ M . Furthermore, we have k (¯ y t − x ∗ ) k + k (¯ x t − x ∗ ) k + 2 η √ m k∇ f (¯ y t ) k≤ k (¯ y t − x ∗ ) k + k (¯ x t − x ∗ ) k + 2 Lη √ m k ¯ y t − x ∗ k≤ √ m k ¯ y t − x ∗ k + √ m k ¯ x t − x ∗ k (B.5) ≤ √ m s µ V t + √ m k ¯ x t − x ∗ k≤ √ m s µ V t . The ﬁrst inequality is because of the L -smoothness of f ( x ). The second inequality follows from thestep size η = L . The last inequality is due to the µ -strong convexity. Thus, we can obtain that1 M η k s t +1 − ¯ s t +1 k ≤ ρ (1 + 2 ρM η ) · M η k s t − ¯ s t k + (7 + 2 M η ) k y t − ¯ y t k + ρ · ρ k x t − ¯ x t k + 4 √ m s µ V t . Let z t +1 be z t +1 = (cid:20) k y t +1 − ¯ y t +1 k , ρ k x t +1 − ¯ x t +1 k , M η k s t +1 − ¯ s t +1 k (cid:21) > , then we have z t +1 = Az t + [0 , , q mV t /µ ] > . D Proof of Lemma 4

Proof of Lemma 4.

It is easy to check that A is non-negative and irreducible. Furthermore, everydiagonal entry of A is not zero. Thus, by Perron-Frobenius theorem and Corollary 8.4.7 of (Horn& Johnson, 2012), A has a real-valued positive number λ ( A ) which is algebraically simple andassociated with a strictly positive eigenvector v . It also holds that λ ( A ) is strictly larger than | λ i ( A ) | with i = 2 , p ( ζ ) of A , p ( ζ ) = ζp ( ζ ) − M η (7 + 2

M η ) ρ + (1 + 2 ρM η ) ρ , with p ( ζ ) = ζ − ρ (3 + 2 ρM η ) ζ − ρ (cid:16) M η + 4 M η + 1 − ρM η ) ρ (cid:17) . ρ (cid:16) M η + 4 M η + 1 − ρM η ) ρ (cid:17) . (D.1)Then, if ρ satisﬁes the condition ρ ≤ − √ Mη (15 Mη +4 M η +1)4 Mη , we have ∆ ≥

0. Thus, two rootsof p ( ζ ), ζ and ζ are ζ , ζ = ρ (3 + 2 ρM η ) ± p (3 + 2 ρM η ) ρ + ∆2 . Furthermore, we have p  ρ · (3 M η (7 + 2

M η ) + 1)(3 + 2 ρM η ) + q max { ∆ , }  = ρ · (3 M η (7 + 2

M η ) + 1)(3 + 2 ρM η ) + q max { ∆ , } ·  ρ · (3 M η (7 + 2

M η ) + 1)(3 + 2 ρM η ) + q max { ∆ , } − q ( ρ (3 + 2 ρM η )) + ∆2  ·  ρ · (3 M η (7 + 2

M η ) + 1)(3 + 2 ρM η ) + q max { ∆ , } + q ( ρ (3 + 2 ρM η )) + ∆2  − M η (7 + 2

M η ) ρ + (1 + 2 ρM η ) ρ ≥ ρ (3 M η (7 + 2

M η ))8 − M η (7 + 2

M η ) ρ> . Note that p ( ζ ) is monotonely increasing in the range (cid:20) ρ · (3 Mη (7+2 Mη )+1)(3+2 ρMη )+ p max { ∆ , } , ∞ (cid:21) .Thus, p ( ζ ) does not have real roots in this range. This implies λ ( A ) ≤ ρ · (3 M η (7 + 2

M η ) + 1)(3 + 2 ρM η ) + q max { ∆ , } . By Eqn. (D.1), we can obtain that if ρ satisﬁes the condition that ρ ≤ Mη +4 M η +116 , then itholds that ∆ ≤ . If ρ also satisﬁes the condition that ρ ≤ Mη +6 M η +1)(3+2 Mη ) , then we canobtain that λ ( A ) ≤ + q max { ∆ , } . Combining the above conditions of ρ , we only need that ρ ≤ M η + 6 M η + 1)(3 + 2 M η ) . √ ρ < λ ( A ). We can conclude this result once it holds that p ( √ ρ ) <

0. This is because p ( ζ ) will have a root between √ ρ and 1 / λ ( A ) must be no lessthan this root. We have p ( √ ρ ) = √ ρp ( √ ρ ) − M η (7 + 2

M η ) ρ + (1 + 2 ρM η ) ρ = ρ √ ρ − ρ (3 + 2 ρM η ) − ∆4 √ ρ − M η (7 + 2

M η ) + ρ + 2 ρ M η ! ≤ ρ (cid:16) √ ρ − ρ − M η − M η (cid:17) = ρ − (cid:18) √ ρ − (cid:19) + 18 − M η − M η ! < , where the ﬁrst inequality is due to ∆ ≥ M η ≥ v is the eigenvector associated with λ ( A ), we can obtain that Av = λ ( A ) v and havethe following equations 2 ρ v (1) + ρ v (2) + 2 ρM η v (3) = λ ( A ) v (1) , v (1) + M η v (3) = λ ( A ) v (2) , (7 + 2 M η ) v (1) + ρ v (2) + ρ (1 + 2 ρM η ) v (3) = λ ( A ) v (3) . Thus, we can obtain that v (1) ≤

17 + 2

M η ( λ ( A ) v (3) − ( ρ v (2) + ρ (1 + 2 ρM η ))) < v (3)2(7 + 2 M η ) , and v (2) = ( v (1) + M η v (3)) /λ ( A ) ≤ √ ρ (7 + 2 M η ) +

M η √ ρ ! v (3) . E Proof of Lemma 5

We ﬁrst present an important lemma which is a part of the proof of Lemma 5.

Lemma 9.

Letting V t be the Lyapunov function associated to Algorithm 1, then it satisﬁes thefollowing property V t +1 ≤ (1 − α ) V t + 1 L k ¯ g t − ∇ f (¯ y t ) k + 2 s V t µ k ¯ g t − ∇ f (¯ y t ) k . (E.1)27 roof. By the update procedure of Algorithm 1, we have f (¯ x t +1 ) ≤ f (¯ y t ) − η h∇ f (¯ y t ) , ¯ g t i + Lη k ¯ g t k = f (¯ y t ) − η h ¯ g t , ¯ g t i + η h ¯ g t , ¯ g t − ∇ f (¯ y t ) i + Lη k ¯ g t k = f (¯ y t ) − · L k ¯ g t k + 1 L h ¯ g t , ¯ g t − ∇ f (¯ y t ) i , (E.2)where the last equation is because η = L . Furthermore, by the deﬁnition of V t , we have V t +1 = µ k ¯ v t +1 − x ∗ k + f (¯ x t +1 ) − f ( x ∗ ) (B.3) = µ k (1 − α )¯ v t + α ¯ y t − x ∗ k − µLα h ¯ g t , (1 − α )¯ v t + α ¯ y t − x ∗ i + µ L α k ¯ g t k + f (¯ x t +1 ) − f ( x ∗ ) (E.2) ≤ µ k (1 − α )¯ v t + α ¯ y t − x ∗ k − α h ¯ g t , (1 − α )¯ v t + α ¯ y t − x ∗ i + f (¯ y t ) − f ( x ∗ ) + 1 L h ¯ g t , ¯ g t − ∇ f (¯ y t ) i . Furthermore, by Eqn. (B.4), we can obtain that ¯ v t = ¯ y t + α (¯ y t − ¯ x t ). Then we can obtain(1 − α )¯ v t + α ¯ y t = ¯ y t + 1 − αα (¯ y t − ¯ x t ) . Hence, we have f (¯ y t ) − α h ¯ g t , (1 − α )¯ v t + α ¯ y t − x ∗ i − f ( x ∗ )= f (¯ y t ) + h ¯ g t , αx ∗ + (1 − α )¯ x t − ¯ y t i − f ( x ∗ )=( α + 1 − α ) f (¯ y t ) + h∇ f (¯ y t ) , α ( x ∗ − ¯ y t ) + (1 − α )(¯ x t − ¯ y t ) i − f ( x ∗ )+ h ¯ g t − ∇ f (¯ y t ) , αx ∗ + (1 − α )¯ x t − ¯ y t i≤ (1 − α )( f (¯ x t ) − f ( x ∗ )) − αµ k x ∗ − ¯ y t k + h ¯ g t − ∇ f (¯ y t ) , αx ∗ + (1 − α )¯ x t − ¯ y t i , where the last inequality is because f ( x ) is µ -strongly convex. Therefore, we can obtain that V t +1 ≤ µ k (1 − α )¯ v t + α ¯ y t − x ∗ k + 1 L h ¯ g t , ¯ g t − ∇ f (¯ y t ) i + (1 − α )( f (¯ x t ) − f ( x ∗ )) − αµ k x ∗ − ¯ y t k + h ¯ g t − ∇ f (¯ y t ) , αx ∗ + (1 − α )¯ x t − ¯ y t i≤ µ (1 − α )2 k ¯ v t − x ∗ k + µα k ¯ y t − x ∗ k + (1 − α )( f (¯ x t ) − f ( x ∗ )) − αµ k x ∗ − ¯ y t k + (cid:28) ¯ g t − ∇ f (¯ y t ) , αx ∗ + (1 − α )¯ x t − ¯ y t + 1 L ¯ g t (cid:29) =(1 − α ) V t + h ¯ g t − ∇ f (¯ y t ) , αx ∗ + (1 − α )¯ x t − ¯ y t i + 1 L k ¯ g t − ∇ f (¯ y t ) k k ¯ g t k where the the second inequality is because of k (1 − α )¯ v t + α ¯ y t − x ∗ k ≤ ((1 − α ) k ¯ v t − x ∗ k + α k ¯ y t − x ∗ k ) ≤ (1 − α ) k ¯ v t − x ∗ k + α k ¯ y t − x ∗ k . k αx ∗ + (1 − α )¯ x t − ¯ y t k ≤ (1 − α ) k ¯ x t − x ∗ k + α k ¯ y t − x ∗ k (B.5) ≤ max { s µ V t , s µ V t } ≤ s V t µ . Therefore, we have V t +1 ≤ (1 − α ) V t + 1 L k ¯ g t − ∇ f (¯ y t ) k k ¯ g t k + s V t µ k ¯ g t − ∇ f (¯ y t ) k≤ (1 − α ) V t + 1 L k ¯ g t − ∇ f (¯ y t ) k + 1 L k ¯ g t − ∇ f (¯ y t ) k k∇ f (¯ y t ) k + s V t µ k ¯ g t − ∇ f (¯ y t ) k (B.5) ≤ (1 − α ) V t + 1 L k ¯ g t − ∇ f (¯ y t ) k + 2 s V t µ k ¯ g t − ∇ f (¯ y t ) k . Proof of Lemma 5.

Let the eigenvector v deﬁned in Lemma 4 and set v (3) = 1. Combining withthe fact that ﬁrst two entries of z are zero, we can obtain that, z ≤ k z k v , and [0 , , > ≤ v . By Eqn. (5.10), we can obtain that z t +1 ≤ k z k · A t +1 v + 4 s mµ · t X i =0 p V i · A t − i v = k z k λ ( A ) t +1 v + 4 s mµ · t X i =0 p V i · λ ( A ) t − i v ≤ k z k (cid:18) (cid:19) t +1 · v + 4 s mµ · t X i =0 (cid:18) (cid:19) t − i p V i · v , (E.3)where the ﬁrst equality is because v is the eigenvector associated with λ ( A ) and the last inequalityis because of Lemma 4.Next, we will prove our result by induction. When t = 0, we have k ¯ s t − η ∇ f (¯ y t ) k = 0, becausewe assume the initial values x ( i, :) are equal to each other. Then by Eqn. (E.1), we have V ≤ (1 − α ) V ≤ (cid:18) − α (cid:19) · V . Next, we assume that for i = 1 , . . . , t , it holds that V i ≤ (cid:18) − α (cid:19) i · V . (E.4)29ombining with Eqn. (E.3), we can obtain that z t − ≤ v ·  s mµ t − X j =0 − ( t − − j ) q V j + 2 − ( t − k z k  ≤ v ·  s mµ t − X j =0 − ( t − − j ) (cid:18)r − α (cid:19) j p V + 2 − ( t − k z k  = v ·  s mµ (cid:16)q − α (cid:17) t − − − ( t − q − α − p V + 2 − ( t − k z k  ≤ v · s mµ (cid:18)r − α (cid:19) t − p V + (cid:18)r − α (cid:19) t − k z k ! = v · (cid:18)r − α (cid:19) t − s mµ p V + k z k ! , (E.5)where the last inequality is because of the assumption α ≤ . Now, we begin to upper bound thevalue of k ¯ s t − ∇ f (¯ y t ) k . First, by Lemma 3, we can obtain that k y t − ¯ y t k ≤ [2 ρ, ρ, ρM η ] z t − ≤ ρ (2 v (1) + v (2) + 2 M η ) · (cid:18)r − α (cid:19) t − s mµ p V + k z k ! ≤ ρ · M η

M η √ ρ ! · (cid:18)r − α (cid:19) t − s mµ p V + k z k ! ≤√ ρ · (cid:18)r − α (cid:19) t − s mµ p V + k z k ! . Thus, we can obtain that k ¯ g t − ∇ f (¯ y t ) k ≤ M √ m k y t − ¯ y t k ≤ M √ m √ ρ · (cid:18)r − α (cid:19) t − s mµ p V + k z k ! . Thus, we can obtain k ¯ g t − ∇ f (¯ y t ) k ≤ M ρm (cid:18) − α (cid:19) t − (cid:18) mµ V + k z k (cid:19) . p V t k ¯ g t − ∇ f (¯ y t ) k (E.4) ≤ (cid:18)r − α (cid:19) t · p V · M √ m √ ρ · (cid:18)r − α (cid:19) t − s mµ p V + k z k ! ≤ (cid:18) − α (cid:19) t · √ ρM √ µ · (cid:18)p V + √ µ √ m k z k (cid:19) · (cid:18)p V + √ µ √ m k z k (cid:19) ≤ (cid:18) − α (cid:19) t · √ ρM √ µ · (cid:18) V + µ m k z k (cid:19) , where the second inequality is because of α ≤ . Combining the inductive hypothesis withEqn. (E.1), we have V t +1 (E.1) ≤ (1 − α ) V t + 1 L k ¯ s t − ∇ f (¯ y t ) k + 2 s V t µ k ¯ s t − ∇ f (¯ y t ) k≤ (1 − α ) (cid:18) − α (cid:19) t · V + 1 L · M ρµ (cid:18) − α (cid:19) t − (cid:18) V + µ m k z k (cid:19) + 2 √ √ µ (cid:18) − α (cid:19) t · √ ρM √ µ · (cid:18) V + µ m k z k (cid:19) ≤ (1 − α ) (cid:18) − α (cid:19) t · V + 1 L · M ρµ (cid:18) − α (cid:19) t − · Θ · V + 2 √ √ µ (cid:18) − α (cid:19) t · √ ρM √ µ · Θ · V ≤ (cid:18) − α (cid:19) t +1 · V , where Θ = 1 + µ m · k z k V and the second inequality is because ρ < √ ρ , and the last inequality isbecause ρ satisﬁes the condition √ ρ ≤ µα L · min ( LM Θ , L M Θ ) . Therefore, we can obtain that at the t + 1-th iteration, it also holds that V t +1 ≤ (cid:18) − α (cid:19) t +1 · V ..