[PDF] On the Distributed Optimization over Directed Networks

Abstract

In this paper, we propose a distributed algorithm, called Directed-Distributed Gradient Descent (D-DGD), to solve multi-agent optimization problems over directed graphs. Existing algorithms mostly deal with similar problems under the assumption of undirected networks, i.e., requiring the weight matrices to be doubly-stochastic. The row-stochasticity of the weight matrix guarantees that all agents reach consensus, while the column-stochasticity ensures that each agent's local gradient contributes equally to the global objective. In a directed graph, however, it may not be possible to construct a doubly-stochastic weight matrix in a distributed manner. We overcome this difficulty by augmenting an additional variable for each agent to record the change in the state evolution. In each iteration, the algorithm simultaneously constructs a row-stochastic matrix and a column-stochastic matrix instead of only a doubly-stochastic matrix. The convergence of the new weight matrix, depending on the row-stochastic and column-stochastic matrices, ensures agents to reach both consensus and optimality. The analysis shows that the proposed algorithm converges at a rate of O( lnk k √ ) , where k is the number of iterations.

Full PDF

OOn the Distributed Optimization over Directed Networks (cid:73)

Chenguang Xi, Qiong Wu, Usman A. Khan ∗ Department of Electrical and Computer Engineering, Tufts University, 161 College Ave., Medford, MA, 02155, USADepartment of Mathematics, Tufts University, 503 Boston Ave., Medford, MA 02155, USA

Abstract

In this paper, we propose a distributed algorithm, called Directed-Distributed Gradient Descent (D-DGD), to solvemulti-agent optimization problems over directed graphs. Existing algorithms mostly deal with similar problems underthe assumption of undirected networks, i.e., requiring the weight matrices to be doubly-stochastic. The row-stochasticityof the weight matrix guarantees that all agents reach consensus, while the column-stochasticity ensures that each agent’slocal gradient contributes equally to the global objective. In a directed graph, however, it may not be possible to constructa doubly-stochastic weight matrix in a distributed manner. We overcome this diﬃculty by augmenting an additionalvariable for each agent to record the change in the state evolution. In each iteration, the algorithm simultaneouslyconstructs a row-stochastic matrix and a column-stochastic matrix instead of only a doubly-stochastic matrix. Theconvergence of the new weight matrix, depending on the row-stochastic and column-stochastic matrices, ensures agentsto reach both consensus and optimality. The analysis shows that the proposed algorithm converges at a rate of O ( ln k √ k ),where k is the number of iterations. Keywords:

Distributed optimization; multi-agent networks; directed graphs; distributed gradient descent.

1. Introduction

Distributed computation and optimization, [28, 29],has received signiﬁcant recent interest in many areas, e.g.,multi-agent networks, [2], model predictive control, [13,12], cognitive networks, [10], source localization, [22], re-source scheduling, [4], and message routing, [18]. The re-lated problems can be posed as the minimization of a sumof objectives, (cid:80) ni =1 f i ( x ), where f i : R p → R is a pri-vate objective function at the i th agent. There are twogeneral types of distributed algorithms to solve this prob-lem. The ﬁrst type is a gradient based method, whereat each iteration a gradient related step is calculated, fol-lowed by averaging with neighbors in the network, e.g.,the Distributed Gradient Descent (DGD), [16], and thedistributed dual-averaging method, [5]. The main advan-tage of these methods is computational simplicity. Thesecond type of distributed algorithms are based on aug-mented Lagrangians, where at each iteration the primalvariables are solved to minimize a Lagrangian related func-tion, followed by updating the dual variables accordingly,e.g., the Distributed Alternating Direction Method of Mul- (cid:73) This work has been partially supported by an NSF Career Award ∗ Corresponding author

Email addresses: [email protected] (Chenguang Xi), [email protected] (Qiong Wu), [email protected] (Usman A.Khan) tipliers (D-ADMM), [11, 30, 24]. The latter type is pre-ferred when agents can solve the local optimization prob-lem eﬃciently. All proposed distributed algorithms, [16, 5,11, 30, 24], assume undirected graphs. The primary rea-son behind assuming the undirected graphs is to obtaina doubly-stochastic weight matrix. The row-stochasticityof the weight matrix guarantees that all agents reach con-sensus, while the column-stochasticity ensures optimality,i.e., each agent’s local gradient contributes equally to theglobal objective.In this paper, we propose a gradient based method solv-ing distributed optimization problem over directed graphs,which we refer to as the Directed-Distributed Gradient De-scent (D-DGD). Clearly, a directed topology has broaderapplications in contrast to undirected graphs and may fur-ther result in reduced communication cost and simpliﬁedtopology design. We start by explaining the necessity ofweight matrices being doubly-stochastic in existing gradi-ent based method, e.g., DGD. In the iteration of DGD,agents will not reach consensus if the row sum of theweight matrix is not equal to one. On the other hand,if the column of the weight matrix does not sum to one,each agent will contribute diﬀerently to the network. Sincedoubly-stochastic matrices may not be achievable in a di-rected graph, the original methods, e.g., DGD, no longerwork. We overcome this diﬃculty in a directed graphby augmenting an additional variable for each agent torecord the state updates. In each iteration of the D-DGD

Preprint submitted to Elsevier February 2, 2016 a r X i v : . [ m a t h . O C ] F e b lgorithm, we simultaneously construct a row-stochasticmatrix and a column-stochastic matrix instead of only adoubly-stochastic matrix. We give an intuitive explanationof our proposed algorithm and further provide convergenceand convergence rate analysis as well.In the context of directed graphs, related work has con-sidered distributed gradient based algorithms, [15, 14, 27,25, 26], by combining gradient descent and push-sum con-sensus. The push-sum algorithm, [7, 1], is ﬁrst proposedin consensus problems to achieve average-consensus givena column-stochastic matrix. The idea is based on com-puting the stationary distribution (the left eigenvector ofthe weight matrix corresponding to eigenvalue 1) for theMarkov chain characterized by the multi-agent networkand canceling the imbalance by dividing with the left-eigenvector. The algorithms in [15, 14, 27, 25, 26] follow asimilar spirit of push-sum consensus and propose nonlinear(because of division) methods. In contrast, our algorithmfollows linear iterations and does not involve any division.The remainder of the paper is organized as follows. InSection 2, we provide the problem formulation and showthe reason why DGD fails to converge to the optimal so-lution over directed graphs. We subsequently present theD-DGD algorithm and the necessary assumptions. Theconvergence analysis of the D-DGD algorithm is studiedin Section 3, consisting of agents’ consensus analysis andoptimality analysis. The convergence rate analysis andnumerical experiments are presented in Sections 4 and 5.Section 6 contains concluding remarks. Notation:

We use lowercase bold letters to denotevectors and uppercase italic letters to denote matrices. Wedenote by [ x ] i the i th component of a vector x , and by [ A ] ij the ( i, j )th element of a matrix, A . An n -dimensional vec-tor with all elements equal to one (zero) is representedby n ( n ). The notation 0 n × n represents an n × n ma-trix with all elements equal to zero. The inner product oftwo vectors x and y is (cid:104) x , y (cid:105) . We use (cid:107) x (cid:107) to denote thestandard Euclidean norm.

2. Problem Formulation

Consider a strongly-connected network of n agents com-municating over a directed graph, G = ( V , E ), where V is the set of agents, and E is the collection of orderedpairs, ( i, j ) , i, j ∈ V , such that agent j can send infor-mation to agent i . Deﬁne N in i to be the collection ofin-neighbors, i.e., the set of agents that can send infor-mation to agent i . Similarly, N out i is deﬁned as the out-neighborhood of agent i , i.e., the set of agents that canreceive information from agent i . We allow both N in i and N out i to include the node i itself. Note that in a di-rected graph N in i (cid:54) = N out i , in general. We focus on solvinga convex optimization problem that is distributed over the See, [6, 23, 21, 20, 19, 31], for additional information on averageconsensus problems. above network. In particular, the network of agents coop-eratively solve the following optimization problem:P1 : min f ( x ) = n (cid:88) i =1 f i ( x ) , where each f i : R p → R is convex, not necessarily diﬀeren-tiable, representing the local objective function at agent i . Assumption 1.

In order to solve the above problem, wemake the following assumptions:(a) The agent graph, G , is strongly-connected.(b) Each local function, f i : R p → R , is convex, ∀ i ∈ V .(c) The solution set of Problem P1 and the correspond-ing optimal value exist. Formally, we have x ∗ ∈ X ∗ = (cid:26) x | f ( x ) = min y ∈ R p f ( y ) (cid:27) , f ∗ = min f ( x ) . (d) The sub-gradient, ∇ f i ( x ) , is bounded: (cid:107)∇ f i ( x ) (cid:107) ≤ D, for all x ∈ R p , i ∈ V . The Assumptions 1 are standard in distributed optimiza-tion, see related literature, [17], and references therein.Before describing our algorithm, we ﬁrst recap the DGDalgorithm, [16], to solve P1 in an undirected graph. Thisalgorithm requires doubly-stochastic weight matrices. Weanalyze the inﬂuence to the result of the DGD when theweight matrices are not doubly-stochastic.

Consider the Distributed Gradient Descent (DGD), [16],to solve P1. Agent i updates its estimate as follows: x k +1 i = n (cid:88) j =1 w ij x kj − α k ∇ f ki , (1)where w ij is a non-negative weight such that W = { w ij } is doubly-stochastic. The scalar, α k , is a diminishing butnon-negative step-size, satisfying the persistence condi-tions, [8, 9]: (cid:80) ∞ k =0 α k = ∞ , (cid:80) ∞ k =0 α k < ∞ , and the vec-tor ∇ f ki is a sub-gradient of f i at x ki . For the sake of ar-gument, consider W to be row-stochastic but not column-stochastic. Clearly, is a right eigenvector of W , andlet π = { π i } be its left eigenvector corresponding to eigen-value 1. Summing over i in Eq. (1), we get (cid:98) x k +1 (cid:44) n (cid:88) i =1 π i x k +1 i , = n (cid:88) j =1 (cid:32) n (cid:88) i =1 π i w ij (cid:33) x kj − α k n (cid:88) i =1 π i ∇ f i ( x ki ) , (cid:98) x k − α k n (cid:88) i =1 π i ∇ f ki , (2)where π j = (cid:80) ni =1 π i w ij , ∀ i, j . If we assume that the agentsreach an agreement, then Eq. (2) can be viewed as aninexact (central) gradient descent (with (cid:80) ni =1 π i ∇ f i ( x ki )instead of (cid:80) ni =1 π i ∇ f i ( (cid:98) x k )) minimizing a new objective, (cid:98) f ( x ) (cid:44) (cid:80) ni =1 π i f i ( x ). As a result, the agents reach con-sensus and converge to the minimizer of (cid:98) f ( x ).Now consider the weight matrix, W , to be column-stochastic but not row-stochastic. Let x k be the averageof agents estimates at time k , then Eq. (1) leads to x k +1 (cid:44) n n (cid:88) i =1 x k +1 i , = 1 n n (cid:88) j =1 (cid:32) n (cid:88) i =1 w ij (cid:33) x kj − α k n n (cid:88) i =1 ∇ f i ( x ki ) , = x k − (cid:16) α k n (cid:17) n (cid:88) i =1 ∇ f ki . (3)Eq. (3) reveals that the average, x k , of agents estimates fol-lows an inexact (central) gradient descent ( (cid:80) ni =1 ∇ f i ( x ki )instead of (cid:80) ni =1 ∇ f i ( x k )) with stepsize α k /n , thus reach-ing the minimizer of f ( x ). Despite the fact that the av-erage, x k , reaches the optima, x ∗ , of f ( x ), the optima isnot achievable for each agent because consensus can not bereached with a matrix that is not necessary row-stochastic.Eqs. (2) and (3) explain the importance of doubly-stochastic matrices in consensus-based optimization. Therow-stochasticity guarantees all of the agents to reach aconsensus, while column-stochasticity ensures each localgradient to contribute equally to the global objective. From the above discussion, we note that reaching aconsensus requires the right eigenvector (correspondingto eigenvalue 1) to lie in span { n } , and minimizing theglobal objective requires the corresponding left eigenvec-tor to lie in span { n } . Both the left and right eigenvectorsof a doubly-stochastic matrix are n , which, in general, isnot possible in directed graphs. In this paper, we intro-duce Directed-Distributed Gradient Descent (D-DGD) thatovercomes the above issues by augmenting an additionalvariable at each agent and thus constructing a new weightmatrix, W ∈ R n × n , whose left and right eigenvectors(corresponding to eigenvalue 1) are in the form: [ (cid:62) n , v (cid:62) ]and [ (cid:62) n , u (cid:62) ] (cid:62) . Formally, we describe D-DGD as follows.At k th iteration, each agent, j ∈ V , maintains twovectors: x kj and y kj , both in R p . Agent j sends its state es-timate, x kj , as well as a weighted auxiliary variable, b ij y kj ,to each out-neighbor, i ∈ N out j , where b ij ’s are such that: b ij = (cid:26) > , i ∈ N out j , , otw. , n (cid:88) i =1 b ij = 1 , ∀ j. Agent i updates the variables, x k +1 i and y k +1 i , with theinformation received from its in-neighbors, j ∈ N in i , asfollows: x k +1 i = n (cid:88) j =1 a ij x kj + (cid:15) y ki − α k ∇ f i ( x ki ) , (4a) y k +1 i = x ki − n (cid:88) j =1 a ij x kj + n (cid:88) j =1 b ij y kj − (cid:15) y ki , (4b)where: a ij = (cid:26) > , j ∈ N in i , , otw. , n (cid:88) j =1 a ij = 1 , ∀ i. The diminishing step-size, α k ≥

0, satisﬁes the persistence

Send:Receive: x j x i x j y j b j i y i b j i y j x m a lm x l a lm x m b lm y m y l b lm y m Figure 1: Illustration of the message passing betweenagents by Eq. (4).conditions, [8, 9]: (cid:80) ∞ k =0 α k = ∞ , (cid:80) ∞ k =0 α k < ∞ . Thescalar, (cid:15) , is a small positive number, which plays a key rolein the convergence of the algorithm . For an illustration ofthe message passing between agents in the implementationof Eq. (4), see Fig. 1 on how agent i sends information to itsout-neighbors and agent l receives information from its in-neighbors. In Fig. 1, the weights b j i and b j i are designedby agent i , and satisfy b ii + b j i + b j i = 1. To analyze thealgorithm, we denote z ki ∈ R p , g ki ∈ R p , and M ∈ R n × n as follows: z ki = (cid:26) x ki , i ∈ { , ..., n } , y ki − n , i ∈ { n + 1 , ..., n } , g ki = (cid:26) ∇ f i ( x ki ) , i ∈ { , ..., n } , p , i ∈ { n + 1 , ..., n } ,M = (cid:20) A (cid:15)II − A B − (cid:15)I (cid:21) , (5) Note that in the implementation of Eq. (4), each agent needs theknowledge of its out-neighbors. In a more restricted setting, e.g., abroadcast application where it may not be possible to know the out-neighbors, we may use b ij = |N out j | − ; thus, the implementationonly requires knowing the out-degrees, see, e.g., [15, 14] for similarassumptions. A = { a ij } is row-stochastic, B = { b ij } is column-stochastic. Consequently, Eq. (4) can be represented com-pactly as follows: for any i ∈ { , ..., n } , at k + 1th itera-tion, z k +1 i = n (cid:88) j =1 [ M ] ij z kj − α k g ki . (6)We refer to the iterative relation in Eq. (6) as the Directed-Distributed Gradient Descent (D-DGD) method, since ithas the same form as DGD except the dimension doublesdue to a new weight matrix M ∈ R n × n as deﬁned inEq. (5). It is worth mentioning that even though Eq. (6)looks similar to DGD, [16], the convergence analysis of D-DGD does not exactly follow that of DGD. This is becausethe weight matrix, M , has negative entries. Besides, M isnot a doubly-stochastic matrix, i.e., the row sum is not 1.Hence, the tools in the analysis of DGD are not applicable,e.g., (cid:107) (cid:80) j [ M ] ij z j − x ∗ (cid:107) ≤ (cid:80) j [ M ] ij (cid:107) z j − x ∗ (cid:107) does notnecessarily hold because [ M ] ij are not non-negative. Innext section, we prove the convergence of D-DGD.

3. Convergence Analysis

The convergence analysis of D-DGD can be dividedinto two parts. In the ﬁrst part, we discuss the con-sensus property of D-DGD, i.e., we capture the decreasein (cid:13)(cid:13) z ki − z k (cid:13)(cid:13) for i ∈ { , ..., n } , as the D-DGD progresses,where we deﬁne z k as the accumulation point: z k (cid:44) n n (cid:88) j =1 z ki = 1 n n (cid:88) j =1 x ki + 1 n n (cid:88) j =1 y ki . (7)The decrease in (cid:13)(cid:13) z ki − z k (cid:13)(cid:13) reveals that all agents approacha common accumulation point. We then show the opti-mality property in the second part, i.e., the decrease in thediﬀerence between the function evaluated at the accumu-lation point and the optimal solution, f ( z k ) − f ( x ∗ ). Wecombine the two parts to establish the convergence. To show the consensus property, we study the conver-gence behavior of the weight matrices, M k , in Eq. (5) as k goes to inﬁnity. We use an existing results on such matri-ces M , based on which we show the convergence behavioras well as the convergence rate. We borrow the followingfrom [3]. Lemma 1. (Cai et al. [3]) Assume the graph is strongly-connected. M is the weighting matrix deﬁned in Eq. (5) ,and the constant (cid:15) in M satisﬁes (cid:15) ∈ (0 , Υ) , where Υ := n ) n (1 −| λ | ) n , where λ is the third largest eigenvalueof M in Eq. (5) by setting (cid:15) = 0 . Then the weightingmatrix, M , deﬁned in Eq. (5) , has a simple eigenvalue and all other eigenvalues have magnitude smaller than one. Based on Lemma 1, we now provide the convergence be-havior as well as the convergence rate of the weight ma-trix, M . Lemma 2.

Assume that the network is strongly-connected,and M is the weight matrix that deﬁned in Eq. (5) .Then,(a) The sequence of (cid:8) M k (cid:9) , as k goes to inﬁnity, con-verges to the following limit: lim k →∞ M k = (cid:34) n (cid:62) n n n (cid:62) n n (cid:35) ; (b) For all i, j ∈ V , the entries (cid:2) M k (cid:3) ij converge to theirlimits as k → ∞ at a geometric rate, i.e., there existbounded constants, Γ ∈ R , and < γ < , such that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) M k − (cid:34) n (cid:62) n n n (cid:62) n n (cid:35)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ Γ γ k . Proof 1.

Note that the sum of each column of M equalsone, so is an eigenvalue of M with a corresponding left(row) eigenvector [ (cid:62) n (cid:62) n ] . We further have M [ (cid:62) n (cid:62) n ] (cid:62) =[ (cid:62) n (cid:62) n ] (cid:62) , so [ (cid:62) n (cid:62) n ] (cid:62) is a right (column) eigenvector cor-responding to the eigenvalue . According to Lemma 1, is a simple eigenvalue of M and all other eigenvalues havemagnitude smaller than one. We represent M k in the Jor-dan canonical form for some P i and Q i M k = 1 n [ (cid:62) n (cid:62) n ] (cid:62) [ (cid:62) n (cid:62) n ] + n (cid:88) i =2 P i J ki Q i , (8) where the diagonal entries in J i are smaller than one inmagnitude for all i . The statement (a) follows by notingthat lim k →∞ J ki = 0 , for all i .From Eq. (8) , and with the fact that all eigenvaluesof M except have magnitude smaller than one, there existsome bounded constants, Γ and γ ∈ (0 , , such that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) M k − (cid:34) n (cid:62) n n n (cid:62) n n (cid:35)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =2 P i J ki Q i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , ≤ n (cid:88) i =2 (cid:107) P i (cid:107) (cid:107) Q i (cid:107) (cid:13)(cid:13) J ki (cid:13)(cid:13) ≤ Γ γ k , from which we get the desired result. (cid:3) Using the result from Lemma 1, Lemma 2 shows theconvergence behavior of the power of the weight matrix,and further show that its convergence is bounded by a ge-ometric rate. Lemma 2 plays a key role in proving theconsensus properties of D-DGD. Based on Lemma 2, webound the diﬀerence between agent estimates in the fol-lowing lemma. More speciﬁcally, we show that the agentestimates, x ki , approaches the accumulation point, z k , andthe auxiliary variable, y ki , goes to n , where z k is deﬁnedin Eq. (7).4 emma 3. Let the Assumptions A1 hold. Let (cid:8) z ki (cid:9) bethe sequence over k generated by the D-DGD algorithm,Eq. (6) . Then, there exist some bounded constants, Γ and < γ < , such that:(a) for ≤ i ≤ n , and k ≥ , (cid:13)(cid:13) z ki − z k (cid:13)(cid:13) ≤ Γ γ k n (cid:88) j =1 (cid:13)(cid:13) z j (cid:13)(cid:13) + n Γ D k − (cid:88) r =1 γ k − r α r − + 2 Dα k − ; (b) for n + 1 ≤ i ≤ n , and k ≥ , (cid:13)(cid:13) z ki (cid:13)(cid:13) ≤ Γ γ k n (cid:88) j =1 (cid:13)(cid:13) z j (cid:13)(cid:13) + n Γ D k − (cid:88) r =1 γ k − r α r − . Proof 2.

For any k ≥ , we write Eq. (6) recursively z ki = n (cid:88) j =1 [ M k ] ij z j − k − (cid:88) r =1 2 n (cid:88) j =1 [ M k − r ] ij α r − g r − j − α k − g k − i . (9) Since every column of M sums up to one, we have forany r (cid:80) ni =1 [ M r ] ij = 1 . Considering the recursive relationof z ki in Eq. (9) , we obtain that z k can be represented as z k = n (cid:88) j =1 n z j − k − (cid:88) r =1 2 n (cid:88) j =1 n α r − g r − j − n n (cid:88) j =1 α k − g k − j . (10) Subtracting Eq. (10) from (9) and taking the norm, weobtain that for ≤ i ≤ n , (cid:13)(cid:13) z ki − z k (cid:13)(cid:13) ≤ n (cid:88) j =1 (cid:13)(cid:13)(cid:13)(cid:13) [ M k ] ij − n (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13) z j (cid:13)(cid:13) + k − (cid:88) r =1 n (cid:88) j =1 (cid:13)(cid:13)(cid:13)(cid:13) [ M k − r ] ij − n (cid:13)(cid:13)(cid:13)(cid:13) α r − (cid:13)(cid:13) ∇ f j ( x r − j ) (cid:13)(cid:13) + α k − (cid:13)(cid:13) ∇ f i ( x k − i ) (cid:13)(cid:13) + 1 n n (cid:88) j =1 α k − (cid:13)(cid:13) ∇ f j ( x k − j ) (cid:13)(cid:13) . (11) The proof of part (a) follows by applying the result ofLemma 2 to Eq. (11) and noticing that the gradient isbounded by a constant D . Similarly, by taking the normof Eq. (9) , we obtain that for n + 1 ≤ i ≤ n , (cid:13)(cid:13) z ki (cid:13)(cid:13) ≤ n (cid:88) j =1 (cid:13)(cid:13) [ M k ] ij (cid:13)(cid:13) (cid:13)(cid:13) z j (cid:13)(cid:13) + k − (cid:88) r =1 n (cid:88) j =1 (cid:13)(cid:13) [ M k − r ] ij (cid:13)(cid:13) α r − (cid:13)(cid:13) ∇ f j ( x r − j ) (cid:13)(cid:13) . The proof of part (b) follows by applying the result of Lemma2 to the preceding relation and considering the boundednessof gradient in Assumption 1(e). (cid:3)

Using the above lemma, we now draw our ﬁrst con-clusion on the consensus property at the agents. Proposi-tion 1 reveals that all agents asymptotically reach consen-sus.

Proposition 1.

Let the Assumptions A1 hold. Let (cid:8) z ki (cid:9) be the sequence over k generated by the D-DGD algorithm,Eq. (6) . Then, z ki satisﬁes(a) for ≤ i ≤ n , ∞ (cid:88) k =1 α k (cid:13)(cid:13) z ki − z k (cid:13)(cid:13) < ∞ ; (b) for n + 1 ≤ i ≤ n , ∞ (cid:88) k =1 α k (cid:13)(cid:13) z ki (cid:13)(cid:13) < ∞ . Proof 3.

Based on the result of Lemma 3(a), we obtain,for ≤ i ≤ n , K (cid:88) k =1 α k (cid:13)(cid:13) z ki − z k (cid:13)(cid:13) ≤ Γ  n (cid:88) j =1 (cid:13)(cid:13) z j (cid:13)(cid:13) K (cid:88) k =1 α k γ k + n Γ D K (cid:88) k =1 k − (cid:88) r =1 γ ( k − r ) α k α r − + 2 D K − (cid:88) k =0 α k . (12) With the basic inequality ab ≤ ( a + b ) , a, b ∈ R , wehave: K (cid:88) k =1 α k γ k ≤ K (cid:88) k =1 (cid:2) α k + γ k (cid:3) ≤ K (cid:88) k =1 α k + 11 − γ ; and K (cid:88) k =1 k − (cid:88) r =1 γ ( k − r ) α k α r − ≤ K (cid:88) k =1 α k k − (cid:88) r =1 γ ( k − r ) + 12 K − (cid:88) r =1 ( α r − ) K (cid:88) k = r +1 γ ( k − r ) ≤ − γ K (cid:88) k =1 α k . The proof of part (a) follows by applying the preceding re-lations to Eq. (12) along with (cid:80) Kk =0 α k < ∞ as K → ∞ .Following the same spirit in the proof of part (b), we canreach the conclusion of part (b). (cid:3) Since (cid:80) ∞ k =1 α k = ∞ , Proposition 1 shows that all agentsreach consensus at the accumulation point, z k , asymptot-ically, i.e., for all 1 ≤ i ≤ n , 1 ≤ j ≤ n ,lim k →∞ z ki = lim k →∞ z k = lim k →∞ z kj , (13)and for n + 1 ≤ i ≤ n , the states, z ki , asymptotically,converge to zero, i.e., for n + 1 ≤ i ≤ n ,lim k →∞ z ki = 0 . (14)We next show how the accumulation point, z k , approachesthe optima, x ∗ , as D-DGD progresses.5 .2. Optimality Property The following lemma gives an upper bound on the dif-ference between the objective evaluated at the accumu-lation point, f ( z k ), and the optimal objective value, f ∗ . Lemma 4.

Let the Assumptions A1 hold. Let (cid:8) z ki (cid:9) bethe sequence over k generated by the D-DGD algorithm,Eq. (6) . Then, ∞ (cid:88) k =0 α k (cid:0) f ( z k ) − f ∗ (cid:1) ≤ n (cid:13)(cid:13) z − x ∗ (cid:13)(cid:13) + nD ∞ (cid:88) k =0 α k + 4 Dn n (cid:88) i =1 ∞ (cid:88) k =0 α k (cid:13)(cid:13) z ki − z k (cid:13)(cid:13) . (15) Proof 4.

Consider Eq. (6) and the fact that each columnof M sums to one, we have z k +1 = 1 n n (cid:88) j =1 (cid:34) n (cid:88) i =1 [ M ] ij (cid:35) z kj − α k n n (cid:88) i =1 g ki , = z k − α k n n (cid:88) i =1 ∇ f i ( z ki ) . Therefore, we obtain that (cid:13)(cid:13) z k +1 − x ∗ (cid:13)(cid:13) = (cid:13)(cid:13) z k − x ∗ (cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) α k n n (cid:88) i =1 ∇ f i ( z ki ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − α k n n (cid:88) i =1 (cid:10) z k − x ∗ , ∇ f i ( z ki ) (cid:11) . (16) Denote ∇ f ki = ∇ f i ( z ki ) . Since (cid:107)∇ f ki (cid:107) ≤ D , we have (cid:10) z k − x ∗ , ∇ f ki (cid:11) = (cid:10) z k − z ki , ∇ f ki (cid:11) + (cid:10) z ki − x ∗ , ∇ f ki (cid:11) ≥ (cid:10) z k − z ki , ∇ f ki (cid:11) + f i ( z ki ) − f i ( x ∗ ) ≥ − D (cid:13)(cid:13) z ki − z k (cid:13)(cid:13) + f i ( z ki ) − f i ( z k ) + f i ( z k ) − f i ( x ∗ ) ≥ − D (cid:13)(cid:13) z ki − z k (cid:13)(cid:13) + f i ( z k ) − f i ( x ∗ ) . (17) By substituting Eq. (17) in Eq. (16) , and rearranging theterms, we obtain that α k (cid:0) f ( z k ) − f ∗ (cid:1) ≤ n (cid:13)(cid:13) z k − x ∗ (cid:13)(cid:13) − n (cid:13)(cid:13) z k +1 − x ∗ (cid:13)(cid:13) + nD α k + 4 Dn n (cid:88) i =1 α k (cid:13)(cid:13) z ki − z k (cid:13)(cid:13) . (18) The desired result is achieved by summing Eq. (18) overtime from k = 0 to ∞ . (cid:3) We are ready to present the main result of this paper, bycombining all the preceding results.

Theorem 1.

Let the Assumptions A1 hold. Let (cid:8) z ki (cid:9) bethe sequence over k generated by the D-DGD algorithm,Eq. (6) . Then, for any agent i , we have lim k →∞ f ( z ki ) = f ∗ . Proof 5.

Since that the step-size follows that (cid:80) ∞ k =0 α k < ∞ , and (cid:80) ∞ k =0 α k (cid:107) z ki − z k (cid:107) < ∞ from Lemma 1, we obtainfrom Eq. (15) that ∞ (cid:88) k =0 α k (cid:0) f ( z k ) − f ∗ (cid:1) < ∞ , (19) which reveals that lim k →∞ f ( z k ) = f ∗ , by considering that (cid:80) ∞ k =0 α k = ∞ . In Eq. (13) , we have already shown that lim k →∞ z ki = lim k →∞ z k . Therefore, we obtain the desiredresult. (cid:3)

4. Convergence Rate

In this section, we show the convergence rate of D-DGD. Let f m := min k f ( z k ), we have( f m − f ∗ ) K (cid:88) k =0 α k ≤ K (cid:88) k =0 α k ( f ( z k ) − f ∗ ) (20)By combining Eqs. (12), (15) and (20), it can be veriﬁedthat Eq. (15) can be represented in the following form:( f m − f ∗ ) K (cid:88) k =0 α k ≤ C + C K (cid:88) k =0 α k , or equivalently,( f m − f ∗ ) ≤ C (cid:80) Kk =0 α k + C (cid:80) Kk =0 α k (cid:80) Kk =0 α k , (21)where the constants, C and C , are given by C = n (cid:13)(cid:13) z − x ∗ (cid:13)(cid:13) − n (cid:13)(cid:13) z K +1 − x ∗ (cid:13)(cid:13) + D Γ n (cid:88) j =1 (cid:13)(cid:13) z j (cid:13)(cid:13) − γ ,C = nD D + D Γ n (cid:88) j =1 (cid:13)(cid:13) z j (cid:13)(cid:13) + 2 D Γ1 − γ . Eq. (21) actually has the same form as the equations inanalyzing the convergence rate of DGD (recall, e.g., [16]).In particular, when α k = k − / , the ﬁrst term in Eq. (21)leads to C (cid:80) Kk =0 α k = C / K / − O (cid:18) √ K (cid:19) , while the second term in Eq. (21) leads to C (cid:80) Kk =0 α k (cid:80) Kk =0 α k = C ln K √ K −

1) = O (cid:18) ln K √ K (cid:19) . It can be observed that the second term dominates, andthe overall convergence rate is O (cid:16) ln k √ k (cid:17) . As a result, D-DGD has the same convergence rate as DGD. The restric-tion of directed graph does not eﬀect the speed.6 . Numerical Experiment We consider a distributed least squares problem in adirected graph: each agent owns a private objective func-tion, s i = R i x + n i , where s i ∈ R m i and R i ∈ R m i × p aremeasured data, x ∈ R p is unknown states, and n i ∈ R m i is random unknown noise. The goal is to estimate x . Thisproblem can be formulated as a distributed optimizationproblem solvingmin f ( x ) = 1 n n (cid:88) i =1 (cid:107) R i x − s i (cid:107) . We consider the network topology as the digraphs shownin Fig. 2. We employ identical setting and graphs as [3].In [3], the value of (cid:15) = 0 . G a , G b , G c .

23 156 47 10 89 G a

23 156 47 10 89 G b

23 156 47 10 89 G c Figure 2: Three examples of strongly-connected but non-balanced digraphs.Fig. 3 shows the convergence of the D-DGD algorithmfor three digraphs displayed in Fig. 2. Once the weightmatrix, M , deﬁned in Eq. (5), converges, the D-DGD en-sures the convergence. Moreover, it can be observed thatthe residuals decrease faster as the number of edges in-creases, from G a to G c . This indicates faster convergencewhen there are more communication channels available forinformation exchange. −4 −3 −2 −1 k R e s i du a l G a G b G c Figure 3: Plot of residuals (cid:107) x k − x ∗ (cid:107) F (cid:107) x − x ∗ (cid:107) F for digraph G a , G b , G c as D-DGD progresses.In Fig. 4, we display the trajectories of both states, x and y , when the D-DGD, Eq. (6), is applied on digraph G a with parameter (cid:15) = 0 .

7. Recall that in Eqs. (13) and (14),we have shown that as times, k , goes to inﬁnity, the state, x ki of all agents will converges to a same accumulation point, z k ,which is the optimal solution of the problem, and y ki of allagents converges to zero, which are shown in Fig. 4. x k i k y k i k Figure 4: Sample paths of states, x ki , and y ki , for all agentson digraphs G a with (cid:15) = 0 . G a . In Fig. 5, we alsoshown the convergence behavior of two other algorithmson the same digraph. The blue line is the plot of resid-uals with a DGD algorithm using a row-stochastic ma-trix. As we have discussed is Section 2, when the weightmatrix is restricted to be row-stochastic, DGD actuallyminimizes a new objective function (cid:98) f ( x ) = (cid:80) ni =1 π i f i ( x )where π = { π i } is the left eigenvector of the weight ma-trix corresponding to eigenvalue 1. So it does not con-verge to the true x ∗ . The black curve shows the conver-gence behavior of the gradient-push algorithm, proposedin [15, 14]. Our algorithm has the same convergence rateas the gradient-push algorithm, which is O (cid:16) ln k √ k (cid:17) . −4 −3 −2 −1 k R e s i du a l Distributed Gradient DescentGradient−PushDirected−Distributed Gradient Descent

Figure 5: Plot of residuals (cid:107) x k − x ∗ (cid:107) F (cid:107) x − x ∗ (cid:107) F as (D-)DGD pro-gresses.

6. Conclusion

In this paper, we describe a distributed algorithm, calledDirected-Distributed Gradient Descent (D-DGD), to solve7he problem of minimizing a sum of convex objective func-tions over a directed graph. Existing distributed algo-rithms, e.g., Distributed Gradient Descent (DGD), dealwith the same problem under the assumption of undi-rected networks. The primary reason behind assuming theundirected graphs is to obtain a doubly-stochastic weightmatrix. The row-stochasticity of the weight matrix guar-antees that all agents reach consensus, while the column-stochasticity ensures optimality, i.e., each agents local gra-dient contributes equally to the global objective. In a di-rected graph, however, it may not be possible to constructa doubly-stochastic weight matrix in a distributed manner.In each iteration of D-DGD, we simultaneously constructsa row-stochastic matrix and a column-stochastic matrix in-stead of only a doubly-stochastic matrix. The convergenceof the new weight matrix, depending on the row-stochasticand column-stochastic matrices, ensures agents to reachboth consensus and optimality. The analysis shows thatthe D-DGD converges at a rate of O ( ln k √ k ), where k is thenumber of iterations. References [1] F. Benezit, V. Blondel, P. Thiran, J. Tsitsiklis, and M. Vet-terli. Weighted gossip: Distributed averaging using non-doublystochastic matrices. In

IEEE International Symposium on In-formation Theory , pages 1753–1757, Jun. 2010.[2] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Dis-tributed optimization and statistical learning via the alternat-ing direction method of multipliers.

Foundation and Trends inMaching Learning , 3(1):1–122, January 2011.[3] K. Cai and H. Ishii. Average consensus on general stronglyconnected digraphs.

Automatica , 48(11):2750 – 2761, 2012.[4] L. Chunlin and L. Layuan. A distributed multiple dimensionalqos constrained resource scheduling optimization policy in com-putational grid.

Journal of Computer and System Sciences ,72(4):706 – 726, 2006.[5] J. C. Duchi, A. Agarwal, and M. J. Wainwright. Dual averagingfor distributed optimization: Convergence analysis and networkscaling.

IEEE Transactions on Automatic Control , 57(3):592–606, Mar. 2012.[6] A. Jadbabaie, J. Lim, and A. S. Morse. Coordination ofgroups of mobile autonomous agents using nearest neighborrules.

IEEE Transactions on Automatic Control , 48(6):988–1001, Jun. 2003.[7] D. Kempe, A. Dobra, and J. Gehrke. Gossip-based computationof aggregate information. In , pages 482–491, Oct. 2003.[8] H. J. Kushner and G. Yin.

Stochastic approximation and recur-sive algorithms and applications , volume 35. Springer Science& Business Media, 2003.[9] I. Lobel, A. Ozdaglar, and D. Feijer. Distributed multi-agentoptimization with state-dependent communication.

Mathemat-ical Programming , 129(2):255–284, 2011.[10] G. Mateos, J. A. Bazerque, and G. B. Giannakis. Distributedsparse linear regression.

IEEE Transactions on Signal Process-ing , 58(10):5262–5276, Oct. 2010.[11] J. F. C. Mota, J. M. F. Xavier, P. M. Q. Aguiar, and M. Puschel.D-ADMM: A communication-eﬃcient distributed algorithm forseparable optimization.

IEEE Transactions on Signal Process-ing , 61(10):2718–2723, May 2013.[12] I. Necoara. Random coordinate descent algorithms for multi-agent convex optimization over networks.

IEEE Transactionson Automatic Control , 58(8):2001–2012, Aug 2013.[13] I. Necoara and J. A. K. Suykens. Application of a smooth-ing technique to decomposition in convex optimization.

IEEE Transactions on Automatic Control , 53(11):2674–2679, Dec.2008.[14] A. Nedic and A. Olshevsky. Distributed optimization over time-varying directed graphs. In , pages 6855–6860, Florence, Italy, Dec.2013.[15] A. Nedic and A. Olshevsky. Distributed optimization over time-varying directed graphs.

IEEE Transactions on AutomaticControl , PP(99):1–1, 2014.[16] A. Nedic and A. Ozdaglar. Distributed subgradient methodsfor multi-agent optimization.

IEEE Transactions on AutomaticControl , 54(1):48–61, Jan. 2009.[17] A. Nedic, A. Ozdaglar, and P. A. Parrilo. Constrained consensusand optimization in multi-agent networks.

IEEE Transactionson Automatic Control , 55(4):922–938, Apr. 2010.[18] G. Neglia, G. Reina, and S. Alouf. Distributed gradient op-timization for epidemic routing: A preliminary evaluation. In , pages 1–6, Paris, Dec. 2009.[19] R. Olfati-Saber, J. A. Fax, and R. M. Murray. Consensus andcooperation in networked multi-agent systems.

Proceedings ofthe IEEE , 95(1):215–233, Jan. 2007.[20] R. Olfati-Saber and R. M. Murray. Consensus protocols fornetworks of dynamic agents. In

IEEE American Control Con-ference , volume 2, pages 951–956, Denver, Colorado, Jun. 2003.[21] R. Olfati-Saber and R. M. Murray. Consensus problems innetworks of agents with switching topology and time-delays.

IEEE Transactions on Automatic Control , 49(9):1520–1533,Sep. 2004.[22] M. Rabbat and R. Nowak. Distributed optimization in sensornetworks. In , pages 20–27, Berkeley, CA, Apr.2004.[23] C. W. Reynolds. Flocks, herds and schools: A distributedbehavioral model. In , pages 25–34, New York,NY, USA, 1987. ACM.[24] W. Shi, Q. Ling, K Yuan, G Wu, and W Yin. On the lin-ear convergence of the admm in decentralized consensus opti-mization.

IEEE Transactions on Signal Processing , 62(7):1750–1761, April 2014.[25] K. I. Tsianos.

The role of the Network in Distributed Optimiza-tion Algorithms: Convergence Rates, Scalability, Communica-tion/Computation Tradeoﬀs and Communication Delays . PhDthesis, Dept. Elect. Comp. Eng. McGill University, 2013.[26] K. I. Tsianos, S. Lawlor, and M. G. Rabbat. Consensus-baseddistributed optimization: Practical issues and applications inlarge-scale machine learning. In , pages 1543–1550, Monticello, IL, USA, Oct. 2012.[27] K. I. Tsianos, S. Lawlor, and M. G. Rabbat. Push-sum dis-tributed dual averaging for convex optimization. In , pages 5453–5458,Maui, Hawaii, Dec. 2012.[28] J. N. Tsitsiklis.

Problems in Decentralized Decision Makingand Computation . PhD thesis, Dept. Elect. Eng. Comp. Sci.,Massachusetts Institute of Technology, Cambridge, 1984.[29] J. N. Tsitsiklis, D. P. Bertsekas, and M. Athans. Distributedasynchronous deterministic and stochastic gradient optimiza-tion algorithms.

IEEE Transactions on Automatic Control ,31(9):803–812, Sep. 1986.[30] E. Wei and A. Ozdaglar. Distributed alternating directionmethod of multipliers. In , pages 5445–5450, Dec. 2012.[31] L. Xiao, S. Boyd, and S. J. Kim. Distributed average consen-sus with least-mean-square deviation.

Journal of Parallel andDistributed Computing , 67(1):33 – 46, 2007., 67(1):33 – 46, 2007.