[PDF] A Distributed Optimization Algorithm over Time-Varying Graphs with Efficient Gradient Evaluations

Abstract

We propose an algorithm for distributed optimization over time-varying communication networks. Our algorithm uses an optimized ratio between the number of rounds of communication and gradient evaluations to achieve fast convergence. The iterates converge to the global optimizer at the same rate as centralized gradient descent when measured in terms of the number of gradient evaluations while using the minimum number of communications to do so. Furthermore, the iterates converge at a near-optimal rate when measured in terms of the number of communication rounds. We compare our algorithm with several other known algorithms on a distributed target localization problem.

Full PDF

AA Distributed Optimization Algorithm overTime-Varying Graphs with EﬃcientGradient Evaluations ? Bryan Van Scoy ∗ Laurent Lessard ∗ , ∗∗∗ Wisconsin Institute for Discovery ∗∗ Department of Electrical EngineeringUniversity of Wisconsin–Madison, Madison, WI 53706, USA {vanscoy,laurent.lessard}@wisc.edu

Abstract:

We propose an algorithm for distributed optimization over time-varying commu-nication networks. Our algorithm uses an optimized ratio between the number of rounds ofcommunication and gradient evaluations to achieve fast convergence. The iterates converge tothe global optimizer at the same rate as centralized gradient descent when measured in termsof the number of gradient evaluations while using the minimum number of communications todo so. Furthermore, the iterates converge at a near-optimal rate when measured in terms ofthe number of communication rounds. We compare our algorithm with several other knownalgorithms on a distributed target localization problem.1. INTRODUCTIONWe consider the distributed optimization problemminimize x ∈ R d f ( x ) where f ( x ) := 1 n n X i =1 f i ( x ) . (1)Associated with each agent i ∈ { , . . . , n } is the localobjective function f i : R d → R where n is the number ofagents and d is the dimension of the problem. The goal isfor all the agents to calculate the global optimizer usingonly local communications and gradient evaluations.Many algorithms have been proposed recently to solve thedistributed optimization problem. Some examples includedistributed gradient descent by Nedić and Ozdaglar (2009),EXTRA by Shi et al. (2015), AugDGM by Xu et al. (2015),NIDS by Li et al. (2017), DIGing by Nedić et al. (2017)and Qu and Li (2018), Exact Diﬀusion by Yuan et al. (2019),and SVL by Sundararajan et al. (2019) among others. Ineach algorithm, agents do the following at each step: • communicate state variables with local neighbors, • evaluate the local gradient ∇ f i , and • update local state variables.Each algorithm alternates between these three steps andtherefore uses the same number of communications andlocal gradient evaluations. In this paper, however, we allowthis ratio to depend on the properties of the objectivefunction and the communication network. To characterizethe convergence properties of our algorithm, we use thefollowing notions of time. • We deﬁne a step as one round of communication andat most one gradient evaluation. • We deﬁne an iteration as m rounds of communicationand one gradient evaluation. ? This material is based upon work supported by the National ScienceFoundation under Grant No. 1656951 and 1750162.

In other words, an iteration consists of m steps whereeach step is at least as simple as that of the algorithmspreviously mentioned. We assume that local state updateshave neglible cost and can therefore be performed anynumber of times per step or iteration.For example, consider an algorithm that updates as follows:iteration 1 2 3step 1 2 3 4 5 6 7 8 9communication (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) gradient evaluation (cid:88) (cid:88) (cid:88) This algorithm performs three rounds of communicationper gradient evaluation, so m = 3. Main contributions.

In this work, we propose a noveldecentralized algorithm for solving (1). Instead of usingthe same number of communication rounds as gradientevaluations, our algorithm sets the ratio between theseusing global problem parameters. We show the following:(1) The iterates of our algorithm converge to the optimizerwith the same rate as centralized gradient descent interms of number of the iterations. Furthermore, ouralgorithm achieves this using the minimum number m of communications per gradient evaluation.(2) The iterates of our algorithm converge to the optimizerwith a near-optimal rate in terms of the number ofsteps, despite not evaluating the gradient at each step.A decentralized algorithm can trivially obtain the same rateas centralized gradient descent if we use an inﬁnite numberof communication rounds per iteration (i.e., m → ∞ )since then every agent can compute an exact averageat each iteration (and therefore can evaluate the globalgradient). We show, however, that our algorithm achievesthe same rate with a ﬁnite number of communicationrounds per iteration, and we characterize precisely howmany communication rounds are required. a r X i v : . [ m a t h . O C ] J a n o prove convergence of our algorithm, we make thefollowing assumptions. • The local objective functions satisfy a contractionproperty that is weaker than assuming smoothnessand strong convexity. • The communication network may be time-varying andeither directed or undirected as long as it is suﬃcientlyconnected and the associated weight matrix is doublystochastic at each step.Perhaps the algorithm most similar to ours is the multi-stepdual accelerated (MSDA) algorithm by Scaman et al. (2017).This algorithm also adjusts the ratio between the number ofcommunication rounds and gradient evaluations to achievefast convergence. The MSDA algorithm is provably optimalin terms of both the number of communications andgradient evaluations when the objective function is smoothand strongly convex and the communication network isﬁxed. Compared to our algorithm, the MSDA algorithmachieves an accelerated rate of convergence by makingstronger assumptions on both the objective function and thecommunication network while we prove a non-acceleratedrate using weaker assumptions.The remainder of the paper is organized as follows. We ﬁrstset up the distributed optimization problem along with ourassumptions in Section 2, and then present our algorithmalong with its main convergence result in Section 3. We thencompare our algorithm with several others on a distributedtarget localization problem in Section 4, and conclude inSection 5. To simplify the presentation, we defer the mainconvergence proof to Appendix A.

Notation.

We use subscript i to denote the agent andsuperscript k to denote the iteration. We denote the all-onesvector by ∈ R n and the identity matrix by I n ∈ R n × n .We use k · k to denote the 2-norm of a vector as well as theinduced 2-norm of a matrix.2. PROBLEM SETUPWe now discuss the assumptions on the objective functionand the communication network that we make in order tosolve the distributed optimization problem (1). Assumption 1.

The distributed optimization problem (1)has an optimizer x ? ∈ R d . Furthermore, there exists a stepsize α > contraction factor ρ ∈ (0 ,

1) such that k x − x ? − α ( ∇ f i ( x ) − ∇ f i ( x ? )) k ≤ ρ k x − x ? k (2)for all x ∈ R d and all i ∈ { , . . . , n } .Each ∇ f i ( x ? ) is in general nonzero, although we have n X i =1 ∇ f i ( x ? ) = 0 . (3)Assumption 1 also implies that k x − x ? − α ∇ f ( x ) k ≤ ρ k x − x ? k for all x ∈ R d , so the global objective function satisﬁes the same propertyas the local functions. Assumption 1 holds if the localfunctions satisfy a one-point smooth and strong convexityproperty as described in the following proposition. Proposition 1.

Let 0 < µ ≤ L , and suppose each localfunction f i is one-point µ -smooth and L -strongly convexwith respect to the global optimizer, in other words, µ k x − x ? k ≤ (cid:0) ∇ f i ( x ) − ∇ f i ( x ? ) (cid:1) T ( x − x ? ) ≤ L k x − x ? k for all x ∈ R d and all i ∈ { , . . . , n } . Then (2) holds withstepsize α = L + µ and contraction factor ρ = L − µL + µ .Assumption 1 also holds under the stronger assumptionthat each f i is µ -smooth and L -strongly convex, meaningthat µ k x − y k ≤ (cid:0) ∇ f i ( x ) − ∇ f i ( y ) (cid:1) T ( x − y ) ≤ L k x − y k for all x, y ∈ R d and all i ∈ { , . . . , n } . To characterize the communication among agents, we usea gossip matrix deﬁned as follows.

Deﬁnition 2. (Gossip matrix). We say that the matrix W = { w ij } ∈ R n × n is a gossip matrix if w ij = 0 wheneveragent i does not receive information from agent j . Wedeﬁne the spectral gap σ ∈ R of a gossip matrix W as σ := k W − n T k . (4)Furthermore, we say that W is doubly-stochastic if both W = and T W = T .The spectral gap characterizes the connectivity of thecommunication network. In particular, a small spectral gapcorresponds to a well-connected network and vice versa.One way to obtain a gossip matrix is to set W = I − L where L is the (possibly weighted) graph Laplacian. Wemake the following assumption about the gossip matrix. Assumption 2. (Communication network). There existsa scalar σ ∈ (0 ,

1) such that each agent i ∈ { , . . . , n } hasaccess to the i th row of a doubly-stochastic gossip matrix W with spectral gap at most σ at each step of the algorithm.Time-varying communication networks that are eitherdirected or undirected can satisfy Assumption 2 as longas the associated gossip matrix is doubly stochastic witha known upper bound on its spectral gap. See Xiao et al.(2007) for how to optimize the weights of the gossipmatrix to minimize the spectral gap, and see Nedić andOlshevsky (2015) for distributed optimization over non-doubly-stochastic networks using the push-sum protocol. The (centralized) gradient descent iterations are given by x k +1 = x k − α ∇ f ( x k ) (5)where α > ρ .In other words, k x k − x ? k = O ( ρ k ). While this methodcould be approximated in a decentralized manner using alarge number of steps per iteration (so that every agentcan compute the average gradient at each iteration), weshow that our algorithm achieves the same convergencerate using the minimal number m of necessary rounds ofcommunication per gradient evaluation.. MAIN RESULTSTo solve the distributed optimization problem, we nowintroduce our algorithm, which depends on the stepsize α ,contraction factor ρ , and spectral gap σ . Algorithm

Parameters: stepsize α >

0, contraction factor ρ ∈ (0 , σ ∈ (0 , Inputs: local functions f i : R d → R on agent i ∈ { , . . . , n } , gossipmatrices { w k‘ij } at iteration k and communication round ‘ . Initialization: • Each agent i ∈ { , . . . , n } chooses x i , y i ∈ R d such that P ni =1 y i = 0 (for example, y i = 0). • Deﬁne the number of communications per iteration m := minimize r ≥ ρ, s ≥ σ (cid:6) log s (cid:0) √ r −√ − r (cid:1)(cid:7) . for iteration k = 0 , , , . . . dofor agent i ∈ { , . . . , n } do v ki, = x ki for step ‘ = 1 , . . . , m do v ki,‘ = P nj =1 w k‘ij v kj,‘ − (local communication) end for u ki = v ki,m − α ∇ f i ( v ki,m ) (local gradient evaluation) y k +1 i = y ki + x ki − v ki,m (local state update) x k +1 i = u ki − p − ρ y k +1 i (local state update) end forend forreturn x ki ∈ R d is the estimate of x ? on agent i at iteration k At iteration k of the algorithm, agent i ﬁrst communicateswith its local neighbors m times using the gossip matrices { W k,‘ } m‘ =1 , then evaluates its local gradient ∇ f i at the pointresulting from the communication, and ﬁnally updates itslocal state variables x ki and y ki . The output of the algorithmis x ki , which is the estimate of the optimizer x ? of the globalobjective function f . Note that agents are required to knowthe global parameters ρ and σ so that they can calculatethe number of communication rounds m .For a given contraction factor ρ and spectral gap σ , agentsperform m consecutive rounds of communication at eachiteration where m := minimize r ≥ ρ, s ≥ σ (cid:6) log s (cid:0) √ r −√ − r (cid:1)(cid:7) . (6)This is the minimum integer number of communicationrounds so that the spectral gap of the m -step gossip matrix Q m‘ =1 W k,‘ at iteration k is no greater than √ ρ −√ − ρ .Since only one gradient evaluation is performed periteration, this adjusts the ratio between the numberof communications and gradient evaluations as shownin Figure 1. In particular, the algorithm uses a singlecommunication per gradient evaluation when the network issuﬃciently connected ( σ small) and the objective functionis ill-conditioned ( ρ large). As the network becomes moredisconnected and/or the objective function becomes morewell-conditioned, the algorithm uses more communicationsper gradient evaluation in order to keep the ratio at theoptimal operating point. = 12 3 ≤ ≤ ≤ ≤ w e ll - c o nd i t i o n e d ill - c o nd i t i o n e d fully-connected disconnected spectral gap ( σ ) c o n t r a c t i o n f a c t o r ( ρ ) Fig. 1. Ratio between the number of communications andgradient evaluations as a function of the spectral gap σ and the contraction factor ρ . The color indicates theratio from light (small ratio) to dark (large ratio).We now present our main result, which states that theiterates of each agent converge to the global optimizerlinearly with a rate equal to the contraction factor ρ . Weprove the result in Appendix A. Theorem 1. (Main result). Suppose Assumptions 1 and 2hold for some point x ? ∈ R d , stepsize α >

0, contractionfactor ρ ∈ (0 , σ ∈ (0 , { x ki } k ≥ of each agent i ∈ { , . . . , n } inour algorithm converges to the optimizer x ? linearly withrate ρ . In other words, k x ki − x ? k = O ( ρ k ) for all i ∈ { , . . . , n } . (7)Theorem 1 states that the iterates of our algorithm convergeto the optimal solution of (1) in a decentralized mannerat the same rate as centralized gradient descent (5) interms of the number of iterations. In other words, thealgorithm converges just as fast (in the worst case) asif each agent had access to the information of all otheragents at every iteration. Instead of communicating all thisinformation, however, it is suﬃcient to only perform m rounds of communication where m is deﬁned in (6).The convergence rate in Theorem 1 is in terms of thenumber of iterations. To compare the performance of ouralgorithm in terms of the number of steps, we plot theconvergence rate per step in Figure 2. For comparison, wealso plot the rate of the algorithm SVL by Sundararajanet al. (2019). This algorithm is designed to optimize theconvergence rate per step and requires agents to computetheir local gradient at each step of the algorithm. Incontrast, our algorithm is slightly slower than the optimalalgorithm but uses far fewer computations since localgradients are only evaluated once every m steps. . . . . . . . .

81 0 0 . . . . . . . . contraction factor ( ρ ) c o n v e r g e n ce r a t e σ = 1 σ = 0 . σ = 0 . σ = 0 . σ = 0 . σ = 0 0 0 . . . . . . . .

81 0 0 . . . . . . . . spectral gap ( σ ) c o n v e r g e n ce r a t e ρ = 1 ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 Fig. 2. Convergence rate in terms of the number of steps as a function of the contraction factor ρ and spectral gap σ .Solid lines indicate our algorithm while dashed lines indicate the optimal algorithm SVL by Sundararajan et al.(2019). Since our algorithm converges with rate ρ with respect to the number of iterations and performs m steps periteration, the convergence rate with respect to the number of steps is ρ /m where m is deﬁned in (6).4. APPLICATION: TARGET LOCALIZATIONTo illustrate our results, we use our algorithm to solvethe distributed target localization problem illustrated inFigure 3, which is inspired by the example in Section 18.3of the book by Boyd and Vandenberghe (2018). We assumeeach agent (blue dot) can measure its distance (but notangle) to the target (red dot) and can communicate withlocal neighbors.Suppose agents are located in a two-dimensional planewhere the location of agent i ∈ { , . . . , n } is given by( p i , q i ) ∈ R . Each agent knows its own position but not the location of the target, denoted by x ? = ( p ? , q ? ) ∈ R .Agent i is capable of measuring its distance to the target, r i = p ( p i − p ? ) + ( q i − q ? ) . The objective function f i : R → R associated to agent i is f i ( p, q ) = (cid:0)p ( p i − p ) + ( q i − q ) − r i (cid:1) . Then in order to locate the target, the agents cooperate tosolve the distributed nonlinear least-squares problemminimize p,q ∈ R n n X i =1 f i ( p, q ) . (8)Agents can communicate with local neighbors as shownin Figure 3. To simulate randomly dropped packets fromagent 4 to agent 1, the gossip matrix at each iteration israndomly chosen from the set W ∈ 

38 14

34 18  , 

12 14  . Both gossip matrices satisfy Assumption 2 with maximumspectral gap σ ≈ . p q Fig. 3. Setup of the target localization problem. Theposition ( p i , q i ) ∈ R of agent i ∈ { , . . . , } is denotedby a blue dot with the position of the target in red at( p ? , q ? ) = (1 , i to j if agent j receives information from agent i . The dashed arrowindicates the link that varies in time. The smoothcurves are the contour lines of the objective functionfor the distributed nonlinear least-squares problemin (8). Note that the problem is nonconvex since thelevel sets are nonconvex.

50 100 150 iteration or step − − − − e rr o r ouralg(iter)ouralg(step)NIDSEXTRASVLcentralized ρ k Fig. 4. Plot of the error for the target localization problem.The blue lines indicate the error k x ki − x ? k for eachof the ﬁve agents computed using our proposeddecentralized algorithm as a function of either theiteration (dark blue) or step (light blue); iterations andsteps are equivalent for each of the other algorithms.The red line indicates the error using centralizedgradient descent (5). Our algorithm performs onegradient evaluation and m = 6 communications periteration.We choose the stepsize to optimize the asymptotic rateof convergence. In particular, the estimate of each agentbecomes arbitrarily close to the target as k → ∞ , so theoptimal stepsize is α = λ + λ where λ and λ are thesmallest and largest eigenvalues of the Hessian matrixevaluated at the target, in other words, ∇ f ( p ? , q ? ). Sincethe objective function is two-dimensional, the sum of itssmallest and largest eigenvalues is equal to its trace, so λ + λ = trace( ∇ f ) = 1 n n X i =1 trace( ∇ f i )where the trace of the local Hessian istrace( ∇ f i ) = 2 − r i √ ( p i − p ) +( q i − q ) . The trace is equal to one at the target, so the optimalstepsize is α = 2. Since NIDS and EXTRA are unstablewith this stepsize, we instead use α = 1 and α = 0 . σ and κ = ρ − ρ .We choose the contraction factor as the convergence rateof centralized gradient descent which is ρ ≈ .

75. Thenour algorithm performs m = 6 communication rounds periteration. We have each agent initialize its states with itsposition x i = ( p i , q i ) ∈ R and y i = (0 , ∈ R .In Figure 4, we plot the error of each agent as a function ofeither the iteration or step. As expected from Theorem 1,the error of our algorithm converges to zero at the same rateas centralized gradient descent (5) in terms of iterations.Our algorithm uses m = 6 communications per iterationwhile NIDS, EXTRA, and SVL use only one; our algorithmis more eﬃcient in terms of gradient evaluations, but alsouses more communications than the other algorithms toobtain a solution with a given precision. 5. CONCLUSIONWe developed an algorithm for distributed optimizationthat uses the minimal amount of communication necessarysuch that the iterates converge to the optimizer at the samerate as centralized gradient descent in terms of the numberof gradient evaluations. Furthermore, the convergence rateof our algorithm is near-optimal (in the worst-case) in termsof the number of communication rounds even though thegradient is not evaluated at each step. Such an algorithm isparticularly useful when gradient evaluations are expensiverelative to the cost of communication.REFERENCESBoyd, S. and Vandenberghe, L. (2018). Introduction toApplied Linear Algebra – Vectors, Matrices, and LeastSquares . Cambridge University Press, New York, NY,USA.Li, Z., Shi, W., and Yan, M. (2017). A decentral-ized proximal-gradient method with network inde-pendent step-sizes and separated convergence rates. arXiv:1704.07807 .Nedić, A. and Olshevsky, A. (2015). Distributed op-timization over time-varying directed graphs.

IEEETransactions on Automatic Control , 60(3), 601–615.Nedić, A., Olshevsky, A., and Shi, W. (2017). Achievinggeometric convergence for distributed optimization overtime-varying graphs.

SIAM Journal on Optimization ,27(4), 2597–2633.Nedić, A. and Ozdaglar, A. (2009). Distributed subgradientmethods for multi-agent optimization.

IEEE Transac-tions on Automatic Control , 54(1), 48–61.Qu, G. and Li, N. (2018). Harnessing smoothness toaccelerate distributed optimization.

IEEE Transactionson Control of Network Systems , 5(3), 1245–1260.Scaman, K., Bach, F., Bubeck, S., Lee, Y.T., and Mas-soulié, L. (2017). Optimal algorithms for smooth andstrongly convex distributed optimization in networks.In

Proceedings of the 34th International Conference onMachine Learning , volume 70 of

Proceedings of MachineLearning Research , 3027–3036.Shi, W., Ling, Q., Wu, G., and Yin, W. (2015). EXTRA: Anexact ﬁrst-order algorithm for decentralized consensusoptimization.

SIAM Journal on Optimization , 25(2),944–966.Sundararajan, A., Van Scoy, B., and Lessard, L.(2019). Analysis and design of ﬁrst-order distributedoptimization algorithms over time-varying graphs. arXiv:1907.05448 .Xiao, L., Boyd, S., and Kim, S.J. (2007). Distributedaverage consensus with least-mean-square deviation.

Journal of Parallel and Distributed Computing , 67(1),33–46.Xu, J., Zhu, S., Soh, Y.C., and Xie, L. (2015). Augmenteddistributed gradient methods for multi-agent optimiza-tion under uncoordinated constant stepsizes. In

IEEEConference on Decision and Control , 2055–2060.Yuan, K., Ying, B., Zhao, X., and Sayed, A.H. (2019). Exactdiﬀusion for distributed optimization and learning—PartI: Algorithm development.

IEEE Transactions on SignalProcessing , 67(3), 708–723.ppendix A. PROOF OF THEOREM 1We now prove linear convergence of the iterates of ouralgorithm to the optimizer of the global objective function.

Average and disagreement operators.

To simplify thenotation, we deﬁne the average operator avg : R nd → R nd as avg ( x ) := ( n T ⊗ I d ) x along with the disagreement operator dis : R nd → R nd as dis ( x ) := (cid:0) ( I n − n T ) ⊗ I d (cid:1) x where ⊗ denotes the Kronecker product. Note that anypoint can be decomposed into its average and disagreementcomponents since avg + dis = I . Also, the operators areorthogonal in that avg ( x ) T dis ( y ) = 0 for all x , y ∈ R nd . Vectorized form.

Deﬁning the parameter λ := p − ρ ,we can then write our algorithm in vectorized form as v k = W k ( x k ) (A.1a) u k = v k − α ¯ ∇ f ( v k ) (A.1b) y k +1 = y k + x k − v k (A.1c) x k +1 = u k − λ y k +1 (A.1d)with avg ( y ) = 0 where the concatenated vectors are u k :=  u k ... u kn  , v k :=  v k ... v kn  , x k :=  x k ... x kn  , y k :=  y k ... y kn  , and the m -step consensus operator W k : R nd → R nd andglobal gradient operator ¯ ∇ f : R nd → R nd are deﬁned as W k := m Y ‘ =1 (cid:0) W k,‘ ⊗ I d (cid:1) and ¯ ∇ f ( v ) :=  ∇ f ( v )... ∇ f n ( v n )  . Fixed-point.

Deﬁne the points u ? , v ? , x ? , y ? ∈ R nd as v ? = x ? = ⊗ x ? , u ? = v ? − α ¯ ∇ f ( v ? ) , y ? = λ ( u ? − x ? ) . Then ( u ? , v ? , x ? , y ? ) is a ﬁxed-point of the concatenatedsystem (A.1) since the gossip matrix is doubly-stochasticat each step. Also, avg ( y ? ) = 0 since x ? satisﬁes (3). Error system.

To analyze the algorithm, we use a changeof variables to put it in error coordinates. The error vectors¯ u k := u k − u ? ¯ x k := x k − x ? ¯ v k := v k − v ? ¯ y k := y k − y ? satisfy the iterations¯ y k +1 = ¯ y k + ¯ x k − ¯ v k (A.2a)¯ x k +1 = ¯ u k − λ ¯ y k +1 (A.2b)for k ≥ Fixed-point operator.

From Assumption 1, the globalgradient operator ¯ ∇ f satisﬁes avg (cid:0) ¯ ∇ f ( x ? ) (cid:1) = 0and k x − x ? − α ( ¯ ∇ f ( x ) − ¯ ∇ f ( x ? )) k ≤ ρ k x − x ? k (A.3)for all x ∈ R nd . In other words, I − α ¯ ∇ f is a contractionwith respect to the point x ? with contraction factor ρ . We use the over-bar in ¯ ∇ f to distinguish it from the gradient ofthe global objective function f in (1). The operators are related by( n T ⊗ I d ) ¯ ∇ f ( ⊗ x ) = ∇ f ( x ). Consensus operator.

From Assumption 2 along with thedeﬁnition of m , the consensus operator W k satisﬁes k dis (cid:0) W k ( x ) (cid:1) k ≤ σ m k dis ( x ) k ≤ σ k dis ( x ) k (A.4)for all x ∈ R nd and all k ≥ σ := √ ρ −√ − ρ . Consensus direction.

We now derive some properties ofthe average error vectors. Using the assumption that thegossip matrix is doubly-stochastic, we have avg (¯ x k ) = avg (¯ v k ) for all k ≥ . (A.5)The iterates are initialized such that avg (¯ y ) = 0 (recallthat avg ( y ? ) = 0). Taking the average of (A.1c), we havethat the average is preserved. In other words, we have that avg (¯ y k +1 ) = avg (¯ y k ) for all k ≥

0. Then by induction, avg (¯ y k ) = 0 for all k ≥ . (A.6) Lyapunov function.

To prove convergence, we will showthat the function V : R nd × R nd → R deﬁned by V (¯ x , ¯ y ) := k avg (¯ x ) k + (cid:20) dis (¯ x ) dis (¯ y ) (cid:21) T (cid:18)(cid:20) λλ λ (cid:21) ⊗ I nd (cid:19) (cid:20) dis (¯ x ) dis (¯ y ) (cid:21) (A.7)is a Lyapunov function for the algorithm, that is, it is bothpositive deﬁnite and decreasing along system trajectories.Note that λ ∈ (0 ,

1) since ρ ∈ (0 , V is also positive deﬁnite, meaningthat V (¯ x , ¯ y ) ≥ x and ¯ y , and V (¯ x , ¯ y ) = 0 if andonly if ¯ x = 0 and dis (¯ y ) = 0 (recall that avg (¯ y k ) = 0).Next, we show that the Lyapunov function decreases by afactor of at least ρ at each iteration. Deﬁne the weighteddiﬀerence in the Lyapunov function between iterations as∆ V k := V (¯ x k +1 , ¯ y k +1 ) − ρ V (¯ x k , ¯ y k ) . Subsituting the expressions for the iterates in (A.2) andusing the properties of the average iterates in (A.5)and (A.6), we have∆ V k = − (cid:0) ρ k ¯ v k k − k ¯ u k k (cid:1) − ρ (cid:0) σ k dis (¯ x k ) k − k dis (¯ v k ) k (cid:1) − σ (cid:13)(cid:13) dis (cid:0) ¯ v k + λ (¯ x k + ¯ y k ) (cid:1)(cid:13)(cid:13) . The ﬁrst term is nonpositive since ¯ ∇ f satisﬁes (A.3), thesecond since W k satisﬁes (A.4), and the third since it is asquared norm. Therefore, ∆ V k ≤ k ≥

0. Applyingthis inequality at each iteration and summing, we obtainthe bound V (¯ x k , ¯ y k ) ≤ ρ k V (¯ x , ¯ y ) for all k ≥ . Bound.

Finally, we use the Lyapunov function to showthat k x ki − x ? k converges to zero linearly with rate ρ foreach agent i ∈ { , . . . , n } . The norm is upper bounded by k x ki − x ? k ≤ cond (cid:18)(cid:20) λλ λ (cid:21)(cid:19) V (¯ x k , ¯ y k ) ≤ c ρ k where the nonnegative constant c ∈ R is deﬁned as c := s cond (cid:18)(cid:20) λλ λ (cid:21)(cid:19) V (¯ x , ¯ y )and cond ( · ) denotes the condition number. Taking thesquare root, we obtain the bound k x ki − x ? k ≤ c ρ k for each agent i ∈ { , . . . , n } and iteration k ≥0.