[PDF] A linear algorithm for optimization over directed graphs with geometric convergence

Abstract

In this letter, we study distributed optimization, where a network of agents, abstracted as a directed graph, collaborates to minimize the average of locally-known convex functions. Most of the existing approaches over directed graphs are based on push-sum (type) techniques, which use an independent algorithm to asymptotically learn either the left or right eigenvector of the underlying weight matrices. This strategy causes additional computation, communication, and nonlinearity in the algorithm. In contrast, we propose a linear algorithm based on an inexact gradient method and a gradient estimation technique. Under the assumptions that each local function is strongly-convex with Lipschitz-continuous gradients, we show that the proposed algorithm geometrically converges to the global minimizer with a sufficiently small step-size. We present simulations to illustrate the theoretical findings.

Full PDF

11 A linear algorithm for optimization over directed graphswith geometric convergence

Ran Xin,

Student Member, IEEE , and Usman A. Khan,

Senior Member, IEEE

Abstract

In this letter, we study distributed optimization, where a network of agents, abstracted as a directedgraph, collaborates to minimize the average of locally-known convex functions. Most of the existingapproaches over directed graphs are based on push-sum (type) techniques, which use an independentalgorithm to asymptotically learn either the left or right eigenvector of the underlying weight matrices.This strategy causes additional computation, communication, and nonlinearity in the algorithm. Incontrast, we propose a linear algorithm based on an inexact gradient method and a gradient estimationtechnique. Under the assumptions that each local function is strongly-convex with Lipschitz-continuousgradients, we show that the proposed algorithm geometrically converges to the global minimizer witha sufﬁciently small step-size. We present simulations to illustrate the theoretical ﬁndings.

Index Terms

Distributed optimization, directed graphs

I. I

NTRODUCTION

In this letter, we consider distributed optimization over multi-agent networks. Formally, eachagent i has access only to a private function, f i : R p → R . The goal is to minimize the average ofthese functions, n (cid:80) ni =1 f i ( x ) , via information exchange among the agents. We focus on the casewhere the communication network is described by an arbitrary directed graph. Early work ondistributed optimization includes distributed sub-gradient descent (DGD) [1], which convergesto the optimal solution at a sublinear rate, i.e., O ( ln k √ k ) for arbitrary (possibly non-differentiable)convex functions and O ( ln kk ) for strongly-convex functions, where k is the number of iterations. R. Xin and U. A. Khan are with the Department of Electrical and Computer Engineering, Tufts University, 161 College Ave,Medford, MA 02155; [email protected], [email protected] . This work has been partially supported by an NSFCareer Award a r X i v : . [ m a t h . O C ] M a y These methods are slow due to the diminishing step-sizes. With the help of strong-convexity andLiptschiz-continuous gradients, algorithms with faster convergence rates have been developed. Inparticular, DGD with a constant step-size [2] converges geometrically to an error ball around theoptimal solution. Another method, EXTRA [3], achieves geometric convergence to the globaloptimal solution with the requirement of symmetric weights. Of relevance are Refs. [4]–[7],which combine inexact gradient methods and a gradient estimation technique based on dynamicaverage consensus [8]. Additional related work and applications can be found in [9]–[14].All of the aforementioned methods require the underlying graphs to be undirected or weight-balanced. This requirement, however, may not be practical, for example, when the agents broad-cast at different power levels leading to communication capability in one direction but not inthe other. It is natural thus to develop optimization and learning algorithms that are applicableto directed graphs. The primary challenge in dealing with directed graphs is that it may notbe possible to construct doubly-stochastic weight matrices for information fusion. The weightedadjacency matrix for directed graphs, in general, may only be either row-stochastic or column-stochastic, but not both. See [15] for work on balancing the weights in strongly-connecteddirected graphs.The existing approaches for optimization over directed graphs are motivated by combiningaverage-consensus methods developed for directed graphs with optimization algorithms designedfor undirected graphs. For instance, subgradient-push introduced in [16] and further studiedin [17] combines push-sum consensus [18] and DGD; A linear algorithm over directed graphs,called Directed-Distributed Gradient Descent (D-DGD), was introduced in [19], [20], and isbased on surplus consensus [21] and DGD. Such DGD-based methods, however, restricted bythe diminishing step-size, converge relatively slowly at O ( ln k √ k ) for general convex functionsand O ( ln kk ) for strongly-convex functions. The convergence rate has been recently improved inDEXTRA [22], which converges geometrically to the global optimal given that its step-sizelies in an interval and the objective functions are strongly-convex with Lipschitz-continuousgradients. DEXTRA was subsequently improved in ADD-OPT/Push-DIGing [23], [24], whichgeometrically converges with a sufﬁciently small step-size. The implementation of DEXTRAand ADD-OPT/Push-DIGing requires each agent to know its out-degree in order to construct acolumn-stochastic weight matrix. This requirement is later removed in [25] and FROST [26],which use row-stochastic weights and thus require no knowledge of out-degrees as each agent locally decides weights assigned to the incoming information. What is common among thesefast methods over directed graphs is that they all are based on push-sum (type) techniques,which make the resulting algorithm nonlinear because an independent algorithm is used toasymptotically learn either the right or the left eigenvector, corresponding to the eigenvalue of ,of the weight matrix. This strategy causes additional computation and communication on theagents.In this paper, we provide a linear distributed optimization algorithm that converges geometri-cally to the global optimal with a sufﬁciently small step-size and when the objective functions arestrongly-convex with Lipschitz-continuous gradients. In the rest of the paper, Section II providesthe algorithm development and its relationship with existing approaches, while Section III detailsthe convergence analysis. Section IV presents numerical experiments and Section V concludesthe paper. Basic Notation:

We use lowercase bold letters to denote vectors and uppercase italic letters todenote matrices. The matrix, I n , represents the n × n identity, whereas n is the n -dimensionalcolumn vector of all ’s. For an arbitrary vector, x , we denote its i th element by [ x ] i . We denoteby X ⊗ Y , the Kronecker product of two matrices, X and Y . For a matrix, X , we denote ρ ( X ) as its spectral radius and X ∞ as its inﬁnite power (if it exists), i.e., X ∞ = lim k →∞ X k . For aprimitive, row-stochastic matrix, A , we denote its left and right eigenvectors corresponding tothe eigenvalue of by π r and n , respectively, such that π (cid:62) r n = 1 . Similarly, for a primitive,column-stochastic matrix, B , we denote its left and right eigenvectors corresponding to theeigenvalue of by n and π c , respectively, such that (cid:62) n π c = 1 . The notation (cid:107) · (cid:107) denotes theEuclidean norm of vectors and ||| · ||| denotes the spectral norm of matrices.II. A LGORITHM D EVELOPMENT

In this section, we mathematically formulate the optimization problem and describe the pro-posed algorithm and its relationship with the existing methods. Consider a network of n agentswhose communication links are described by a strongly-connected directed graph, G = ( V , E ) ,where V is the index set of agents, and E is the collection of ordered pairs, ( i, j ) , i, j ∈ V , suchthat agent j can send information to agent i , i.e., j → i . We deﬁne N in i as the collection ofin-neighbors, i.e., the set of agents that can send information to agent i . Similarly, N out i is theset of out-neighbors of agent i . Note that both N in i and N out i include node i . We assume that each agent i knows its out-degree (the number of out-neighbors), denoted by |N out i | ; see [27]for details.We focus on solving a convex optimization problem distributed over the above multi-agentnetwork. In particular, the network of agents cooperatively solves the following:P1 : min F ( x ) = 1 n n (cid:88) i =1 f i ( x ) , where each f i : R p → R is known only to agent i . We assume that each local function, f i ( x ) ,is strongly-convex and has Lipschitz-continuous gradients. Our goal is to design a distributedalgorithm such that the iterates at each agent converge to the global optimal solution of ProblemP1 via information exchange with nearby agents over the directed graph, G . We formalize theset of assumptions as follows. Assumption 1:

The graph, G , is strongly-connected and each agent in the network knows itsout-degree. Assumption 2:

Each local function, f i , is strongly-convex, and has globally Lipschitz-continuousgradient, i.e., for any i and x , x ∈ R p ,(i) there exists a positive constant β such that (cid:107)∇ f i ( x ) − ∇ f i ( x ) (cid:107) ≤ β (cid:107) x − x (cid:107) ; (ii) there exists a positive constant α such that f i ( x ) − f i ( x ) ≤ ∇ f i ( x ) (cid:62) ( x − x ) − α (cid:107) x − x (cid:107) . Clearly, the Lipschitz-continuity and strongly-convexity constants for the global objective func-tion F ( x ) are β and α , respectively. Assumption 2 ensures that the optimal solution, denotedas x ∗ , for P1 exists and is unique. Algorithm description:

To solve Problem P1, we propose the following algorithm. Eachagent, i ∈ V , maintains two variables: x i ( k ) , y i ( k ) ∈ R p , where k is discrete-time index.The algorithm, initialized with y i (0) = ∇ f i ( x i (0)) and with arbitrary x i (0) , ∀ i , performs the Such an assumption is standard in the related literature, see, e.g., [16], [17], [19]–[23]. following iterations. x i ( k + 1) = n (cid:88) j =1 a ij x j ( k ) − η y i ( k ) , (1a) y i ( k + 1) = n (cid:88) j =1 b ij (cid:16) y j ( k ) + ∇ f j (cid:0) x j ( k + 1) (cid:1) − ∇ f j (cid:0) x j ( k ) (cid:1)(cid:17) , (1b)where the step-size, η , is some positive constant. The weights, a ij ’s and b ij ’s satisfy the followingconditions: a ij =  > , j ∈ N in i , , otherwise , n (cid:88) j =1 a ij = 1 , ∀ i, (2) b ij =  > , i ∈ N out j , , otherwise , n (cid:88) i =1 b ij = 1 , ∀ j. (3)Eq. (2) leads to a row-stochastic matrix A = { a ij } , which is easy to implement as each agentlocally decides the weights. Eq. (3), on the other hand, results in a column-stochastic matrix B = { b ij } , whose distributed implementation only requires each agent to know its out-degree. Inparticular, we can construct such weights as b ij = 1 / |N out j | , ∀ i, j .The algorithm in Eqs. (1) can be explained as follows. To implement Eq. (1a), the receiv-ing agent i decides on the weights a ij assigned to the incoming x j ( k ) ’s such that a ij ’s sumto . Implementation of Eq. (1b) requires the sending agent to scale the transmission y j ( k ) + ∇ f j (cid:0) x j ( k + 1) (cid:1) − ∇ f j (cid:0) x j ( k ) by appropriate choice of b ij ’s (to ensure column-stochasticityof B ) as the out-degree of agent j may not be known to agent i . Agent i subsequently addsthese received messages to implement Eq. (1b). Intuitively, Eq. (1b) asymptotically learns theaverage, n (cid:80) ni =1 ∇ f i ( x i ( k )) , of the local gradients, [4]–[8]; and thus Eq. (1a) approaches acentralized gradient descent, as the descent direction, y i ( k ) , becomes the gradient of the globalobjective function over time. Relation with existing work:

We now brieﬂy compare the proposed algorithm with existingtechniques. The algorithms in Refs. [4]–[6], can be summarized as a single class of algorithmsover undirected graphs with the following form: x i ( k + 1) = n (cid:88) j =1 w ij x j ( k ) − η y i ( k ) , (4a) y i ( k + 1) = n (cid:88) j =1 w ij y j ( k ) + ∇ f i (cid:0) x i ( k + 1) (cid:1) − ∇ f i (cid:0) x i ( k ) (cid:1) , (4b) where W = { w ij } is doubly-stochastic. It is shown in Ref. [5], [6], that Eqs. (4) convergegeometrically to the optimal solution of Problem P1 as long as the step-size, η , is sufﬁcientlysmall. This algorithm, however, is not applicable to directed graphs as it may not be possible toconstruct doubly-stochastic weights.To overcome this issue, Refs. [23]–[26] leverage push-sum (type) techniques, with either row-or column-stochastic weights, towards the algorithm in Eqs. (4). Refs. [25], [26], e.g., proposethe following algorithm: y i ( k + 1) = n (cid:88) j =1 a ij y i ( k ) , x i ( k + 1) = n (cid:88) j =1 a ij x i ( k ) − η i z i ( k ) , z i ( k + 1) = n (cid:88) j =1 a ij z i ( k ) + ∇ f i (cid:0) x i ( k + 1) (cid:1) [ y i ( k + 1)] i − ∇ f i (cid:0) x i ( k ) (cid:1) [ y i ( k )] i , where A = { a ij } is row-stochastic. Note that the ﬁrst equation is an independent algorithm,which asymptotically learns the left eigenvector, corresponding to the eigenvalue of , of A .However, it adds nonlinearity to the overall algorithm along with additional computation andcommunication costs in contrast to the proposed algorithm in Eqs. (1). Remarks:

The algorithm, Eqs. (1), proposed in this letter can be viewed as related to Eq. (4)but without doubly-stochastic weights, due to which we lose the nice eigenstructure within theweight matrices. It is rather straightforward to notice that a linear extension of Eqs. (4) to thedirected graphs is non-trivial as all earlier attempts were made by adding nonlinearity to theoriginal set of equations. One of the major challenges lies in the fact that even though thecontraction of a doubly-stochastic W is well-established in the subspace orthogonal to n , it isnot straightforward to establish simultaneous contractions for a row-stochastic matrix, A , anda column-stochastic matrix, B . The latter requires working with arbitrary norms (as opposedto the -norm applicable to doubly-stochastic matrices) and norm-equivalence constants, as weshow in Lemma 1 and onwards. III. C

ONVERGENCE A NALYSIS

For the sake of analysis, we now write Eqs. (1) in matrix form. The variables x ( k ) and y ( k ) collect all the local variables x i ( k ) ’s and y i ( k ) ’s in a vector, respectively, and ∇ f ( k ) =  ∇ f (cid:0) x ( k ) (cid:1) ... ∇ f n (cid:0) x n ( k ) (cid:1)  ∈ R np . (5)Let A = A ⊗ I p and B = B ⊗ I p , where ⊗ is the Kronecker product. We denote x ∗ as theoptimal solution of Problem P1. We now rewrite Eqs. (1) in a compact matrix form as follows: x ( k + 1) = A x ( k ) − η y ( k ) , (6a) y ( k + 1) = B (cid:16) y ( k ) + ∇ f ( k + 1) − ∇ f ( k ) (cid:17) , (6b)where y (0) = ∇ f (0) and x (0) is arbitrary. A. Auxiliary relations

We next start the convergence analysis with a key lemma regarding the contraction in consensusprocess with row- and column-stochastic weight matrices, respectively.

Lemma 1:

Consider the weight matrices A = A ⊗ I p and B = B ⊗ I p . Then there exist vectornorms, (cid:107) · (cid:107) A and (cid:107) · (cid:107) B , such that for all a ∈ R np , (cid:107) A a − A ∞ a (cid:107) A ≤ σ A (cid:107) a − A ∞ a (cid:107) A , (7) (cid:107) B a − B ∞ a (cid:107) B ≤ σ B (cid:107) a − B ∞ a (cid:107) B , (8)where < σ A < and < σ B < are some constants. Proof:

Since A is irreducible, row-stochastic with positive diagonals, from Perron-Frobeniustheorem we have that ρ ( A ) = 1 , every eigenvalue of A other than is strictly less than ρ ( A ) ,and π (cid:62) r is a strictly positive left eigenvector corresponding to the eigenvalue of with (cid:62) n π r = 1 ;thus lim k →∞ A k = n π (cid:62) r . We further have A ∞ = lim k →∞ A k = (cid:16) lim k →∞ A k (cid:17) ⊗ I p = (cid:0) n π (cid:62) r (cid:1) ⊗ I p . It follows that AA ∞ = ( A ⊗ I p ) (cid:16) ( n π (cid:62) r ) ⊗ I p (cid:17) = A ∞ ,A ∞ A ∞ = (cid:16) ( n π (cid:62) r ) ⊗ I p (cid:17)(cid:16) ( n π (cid:62) r ) ⊗ I p (cid:17) = A ∞ . Thus AA ∞ − A ∞ A ∞ is a zero matrix, which leads to the following relation: A a − A ∞ a = ( A − A ∞ )( a − A ∞ a ) . (9)Since ρ ( A − A ∞ ) = ρ (( A − n π (cid:62) r ) ⊗ I p ) < , we have from Lemma 5.6.10 in [28] that thereexists a matrix norm, say ||| · ||| A , such that σ A (cid:44) ||| A − A ∞ ||| A < . (10)Moreover, from Theorem 5.7.13 in [28], we know that for any matrix norm, ||| · ||| A , there existsa compatible vector norm, say (cid:107) · (cid:107) A , such that (cid:107) X x (cid:107) A ≤ ||| X ||| A (cid:107) x (cid:107) A , for all matrices, X , andall vectors, x ; hence, Eq. (9) leads to (cid:107) A a − A ∞ a (cid:107) A = (cid:107) ( A − A ∞ )( a − A ∞ a ) (cid:107) A , ≤ ||| A − A ∞ ||| A (cid:107) a − A ∞ a (cid:107) A , = σ A (cid:107) a − A ∞ a (cid:107) A , and Eq. (7) follows. Similarly, Eq. (8) follows for some matrix norm, ||| · ||| B , with σ B (cid:44) ||| B − B ∞ ||| B .The following lemma is a direct consequence of the column-stochasticity of B and the initialcondition that y (0) = ∇ f (0) . Lemma 2:

We have ( (cid:62) n ⊗ I p ) y ( k ) = ( (cid:62) n ⊗ I p ) ∇ f ( k ) , ∀ k . Proof:

Recall Eq. (6b) and multiply both sides of Eq. (6b) with (cid:62) n ⊗ I p . We get ( (cid:62) n ⊗ I p ) y ( k + 1)= ( (cid:62) n ⊗ I p )( B ⊗ I p ) (cid:16) y ( k ) + ∇ f ( k + 1) − ∇ f ( k ) (cid:17) = ( (cid:62) n ⊗ I p ) y ( k ) + ( (cid:62) n ⊗ I p ) ∇ f ( k + 1) − ( (cid:62) n ⊗ I p ) ∇ f ( k )= ( (cid:62) n ⊗ I p ) (cid:16) y (0) − ∇ f (0) (cid:17) + ( (cid:62) n ⊗ I p ) ∇ f ( k + 1)= ( (cid:62) n ⊗ I p ) ∇ f ( k + 1) , which completes the proof. Lemma 2 shows that the average of y i ( k ) ’s preserves the average of local gradients. The nextlemma, a standard result in convex optimization theory from [5], [29], states that the distanceto the optimal minimizer shrinks by at least a ﬁxed ratio if we perform a gradient descent step. Lemma 3:

Suppose that g : R p → R is strongly convex with Lipschitz-continuous gradient.Let α and β be its strong-convexity and Lipschitz-continuity constants respectively. For ∀ x ∈ R p and < θ < β , we have (cid:107) x − θ ∇ g ( x ) − x ∗ (cid:107) ≤ τ (cid:107) x − x ∗ (cid:107) , where τ = max ( | − αθ | , | − βθ | ) .The subsequent convergence analysis is based on deriving a contraction relationship in theproposed algorithm, i.e., (cid:107) x ( k + 1) − A ∞ x ( k + 1) (cid:107) A , (cid:107) A ∞ x ( k + 1) − n ⊗ x ∗ (cid:107) , and (cid:107) y ( k +1) − B ∞ y ( k + 1) (cid:107) B , are bounded linearly by their values in the last iteration. We capturea relationship on these objects in the next lemmas. Before we proceed, note that all vectornorms on ﬁnite-dimensional vector space are equivalent, i.e., there exist ﬁnite and positiveconstants, c, d, h, l, g, m , such that: (cid:107) · (cid:107) A ≤ c (cid:107) · (cid:107) B , (cid:107) · (cid:107) ≤ h (cid:107) · (cid:107) B , (cid:107) · (cid:107) ≤ g (cid:107) · (cid:107) A , (cid:107) · (cid:107) B ≤ d (cid:107) · (cid:107) A , (cid:107) · (cid:107) B ≤ l (cid:107) · (cid:107) , (cid:107) · (cid:107) A ≤ m (cid:107) · (cid:107) . Lemma 4:

The following inequality holds, ∀ k : (cid:107) x ( k + 1) − A ∞ x ( k + 1) (cid:107) A ≤ σ A (cid:107) x ( k ) − A ∞ x ( k ) (cid:107) A + ηm ||| I np − A ∞ ||| (cid:107) y ( k ) (cid:107) Proof:

Using Eq. (6a) and Lemma. 1, we have (cid:107) x ( k + 1) − A ∞ x ( k + 1) (cid:107) A = (cid:107) A x ( k ) − η y ( k ) − A ∞ (cid:16) A x ( k ) − η y ( k ) (cid:17) (cid:107) A , ≤ σ A (cid:107) x ( k ) − A ∞ x ( k ) (cid:107) A + ηm (cid:107) y ( k ) − A ∞ y ( k ) (cid:107) , ≤ σ A (cid:107) x ( k ) − A ∞ x ( k ) (cid:107) A + ηm ||| I np − A ∞ ||| (cid:107) y ( k ) (cid:107) and the lemma follows.Next, we develop a relation for (cid:107) A ∞ x ( k + 1) − n ⊗ x ∗ (cid:107) . Lemma 5:

The following holds, ∀ k , when < η < nβ π (cid:62) r π c : (cid:107) A ∞ x ( k + 1) − n ⊗ x ∗ (cid:107) ≤ ηnβg ( π (cid:62) r π c ) (cid:107) x ( k ) − A ∞ x ( k ) (cid:107) A + λ (cid:107) A ∞ x ( k ) − n ⊗ x ∗ (cid:107) + ηh ||| A ∞ ||| (cid:107) y ( k ) − B ∞ y ( k ) (cid:107) B , (11)where λ = max (cid:0)(cid:12)(cid:12) − αnη ( π (cid:62) r π c ) (cid:12)(cid:12) , (cid:12)(cid:12) − βnη ( π (cid:62) r π c ) (cid:12)(cid:12)(cid:1) . Proof:

With A ∞ = ( n π (cid:62) r ) ⊗ I p = ( n ⊗ I p )( π (cid:62) r ⊗ I p ) and Eq. (6a), we have (cid:107) A ∞ x ( k + 1) − n ⊗ x ∗ (cid:107) = (cid:13)(cid:13)(cid:13) A ∞ (cid:16) A x ( k ) − η y ( k ) + B ∞ y ( k )( − η + η ) (cid:17) − n ⊗ x ∗ (cid:13)(cid:13)(cid:13) , ≤ (cid:13)(cid:13)(cid:0) ( n π (cid:62) r ) ⊗ I p (cid:1) x ( k ) − ( n ⊗ I p ) x ∗ − ηA ∞ B ∞ y ( k ) (cid:13)(cid:13) + ηh ||| A ∞ ||| (cid:107) y ( k ) − B ∞ y ( k ) (cid:107) B . (12)Since the last term above matches with the last term in Eq. (11), what is left is to manipulatethe ﬁrst term. Before we proceed, deﬁne ∇ F ( k ) = ∇ F (cid:0) ( π (cid:62) r ⊗ I p ) x ( k ) (cid:1) , which is the globalgradient evaluated at ( π (cid:62) r ⊗ I p ) x ( k ) . Note that A ∞ B ∞ = ( n π (cid:62) r ⊗ I p )( π c (cid:62) n ⊗ I p ) = π (cid:62) r π c ( n (cid:62) n ⊗ I p ) . We have the following: (cid:107) (cid:0) ( n π (cid:62) r ) ⊗ I p (cid:1) x ( k ) − ( n ⊗ I p ) x ∗ − ηA ∞ B ∞ y ( k ) (cid:107) ≤ (cid:13)(cid:13)(cid:13) ( n ⊗ I p ) (cid:16) ( π (cid:62) r ⊗ I p ) x ( k ) − x ∗ − nη ( π (cid:62) r π c ) ∇ F ( k ) (cid:17)(cid:13)(cid:13)(cid:13) + η ( π (cid:62) r π c ) (cid:13)(cid:13) n ( n ⊗ I p ) ∇ F ( k ) − ( n , ⊗ I p )( (cid:62) n ⊗ I p ) y ( k ) (cid:13)(cid:13) , := s + ηs . From Lemma 3, we have that if < η < / ( nβ π (cid:62) r π c ) , s ≤ λ (cid:107) A ∞ x ( k ) − n ⊗ x ∗ (cid:107) . Recall that ( (cid:62) n ⊗ I p ) y ( k ) = ( (cid:62) n ⊗ I p ) ∇ f ( k ) , ∀ k, from Lemma 2, we have s ≤ nβg ( π (cid:62) r π c ) (cid:107) x ( k ) − A ∞ x ( k ) (cid:107) A . The lemma follows by using the above bounds in Eq. (12). Next, we develop a relation for (cid:107) y ( k + 1) − B ∞ y ( k + 1) (cid:107) B . Lemma 6:

The following inequality holds, ∀ k : (cid:107) y ( k + 1) − B ∞ y ( k + 1) (cid:107) B ≤ σ B βlg ||| A − I np ||| (cid:107) x ( k ) − A ∞ x ( k ) (cid:107) A + σ B (cid:107) y ( k ) − B ∞ y ( k ) (cid:107) B + ησ B βl (cid:107) y ( k ) (cid:107) . (13) Proof:

We note that (cid:107) y ( k + 1) − B ∞ y ( k + 1) (cid:107) B = (cid:13)(cid:13)(cid:13) B (cid:16) y ( k ) + ∇ f ( k + 1) − ∇ f ( k ) (cid:17) − B ∞ B (cid:16) y ( k ) + ∇ f ( k + 1) − ∇ f ( k ) (cid:17)(cid:13)(cid:13)(cid:13) B , ≤ σ B (cid:107) y ( k ) − B ∞ y ( k ) (cid:107) B + σ B βl (cid:107) x ( k + 1) − x ( k ) (cid:107) , (14)because of Lemma 1. Now we analyze (cid:107) x ( k + 1) − x ( k ) (cid:107) . (cid:107) x ( k + 1) − x ( k ) (cid:107) = (cid:107) A x ( k ) − η y ( k ) − x ( k ) (cid:107) , = (cid:13)(cid:13) ( A − I np ) (cid:0) x ( k ) − A ∞ x ( k ) (cid:1) − η y ( k ) (cid:13)(cid:13) , ≤ ||| A − I np ||| g (cid:107) x ( k ) − A ∞ x ( k ) (cid:107) A + η (cid:107) y ( k ) (cid:107) . (15)The lemma follows by plugging Eq. (15) into Eq. (14).The last step is to bound (cid:107) y ( k ) (cid:107) in terms of (cid:107) x ( k ) − A ∞ x ( k ) (cid:107) A , (cid:107) A ∞ x ( k ) − n ⊗ x ∗ (cid:107) ,and (cid:107) y ( k ) − B ∞ y ( k ) (cid:107) B . Then we can replace (cid:107) y ( k ) (cid:107) in Lemma 4-6 by this bound to completethe contraction relationship. Lemma 7:

The following inequality holds, ∀ k : (cid:107) y ( k ) (cid:107) ≤ gβ ||| B ∞ ||| (cid:107) x ( k ) − A ∞ x ( k ) (cid:107) A + β ||| B ∞ ||| (cid:107) A ∞ x ( k ) − n ⊗ x ∗ (cid:107) + h (cid:107) y ( k ) − B ∞ y ( k ) (cid:107) B . Proof:

Recall that B ∞ = ( π c ⊗ I p )( (cid:62) n ⊗ I p ) . We have (cid:107) y ( k ) (cid:107) ≤ h (cid:107) y ( k ) − B ∞ y ( k ) (cid:107) B + (cid:107) B ∞ y ( k ) (cid:107) . (16) We next bound (cid:107) B ∞ y ( k ) (cid:107) : (cid:107) B ∞ y ( k ) (cid:107) = (cid:107) ( π c ⊗ I p )( (cid:62) n ⊗ I p ) y ( k ) (cid:107) = (cid:107) π c (cid:107) (cid:107) ( (cid:62) n ⊗ I p ) ∇ f ( k ) (cid:107) = (cid:107) π c (cid:107) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 ∇ f i ( x i ( k )) − n (cid:88) i =1 ∇ f i ( x ∗ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:107) π c (cid:107) β n (cid:88) i =1 (cid:107) x i ( k ) − x ∗ (cid:107) ≤ (cid:107) π c (cid:107) β √ n (cid:107) x ( k ) − n ⊗ x ∗ (cid:107) , ≤ ||| B ∞ ||| βg (cid:107) x ( k ) − A ∞ x ( k ) (cid:107) A + ||| B ∞ ||| β (cid:107) A ∞ x ( k ) − n ⊗ x ∗ (cid:107) , (17)where the second last inequality uses Jensen’s inequality and the last inequality uses the factthat ||| B ∞ ||| = √ n (cid:107) π c (cid:107) . The lemma follows by plugging Eqs. (17) into Eq. (16).Before the main result, we present an additional lemma from nonnegative matrix theory. Lemma 8: (Theorem 8.1.29 in [28]) Let X ∈ R n × n be a nonnegative matrix and x ∈ R n bea positive vector. If X x < ω x , then ρ ( X ) < ω . B. Main results

With the help of auxiliary relations developed in the previous subsection, we now present themain result, which establishes the geometric convergence of the proposed algorithm.

Theorem 1: If < η < nβ π (cid:62) r π c , we have the following linear matrix inequality (entry-wise): t ( k + 1) ≤ J ( η ) t ( k ) , ∀ k, (18)where t ( k ) ∈ R and J ( η ) ∈ R × are deﬁned as follows: t ( k ) =  (cid:107) x ( k ) − A ∞ x ( k ) (cid:107) A (cid:107) A ∞ x ( k ) − n ⊗ x ∗ (cid:107) (cid:107) y ( k ) − B ∞ y ( k ) (cid:107) B  , (19) J ( η ) =  σ A + a η a η a ηa η λ a ηa + a η a η σ B + a η  , (20) with the positive constants a i ’s being a = mgβ ||| I np − A ∞ ||| ||| B ∞ ||| ,a = mβ ||| I np − A ∞ ||| ||| B ∞ ||| ,a = mh ||| I np − A ∞ ||| ,a = nβg ( π (cid:62) r π c ) ,a = h ||| A ∞ ||| ,a = gσ B lβ ||| A − I np ||| ,a = gσ B lβ ||| B ∞ ||| ,a = σ B lβ ||| B ∞ ||| ,a = hσ B lβ. When the step-size, η , satisﬁes η < min (cid:26) (cid:15) (1 − σ A ) a (cid:15) + a (cid:15) + a (cid:15) , (1 − σ B ) (cid:15) − (cid:15) a a (cid:15) + a (cid:15) + a (cid:15) , nβ π (cid:62) r π c (cid:27) , (21)where (cid:15) , (cid:15) , (cid:15) are positive constants such that (cid:15) > , (cid:15) < (1 − σ B ) (cid:15) a , (cid:15) > a (cid:15) + a (cid:15) αn ( π (cid:62) r π c ) , (22)the spectral radius of J ( η ) , ρ ( J ( η )) , is strictly less than , and therefore (cid:107) x ( k ) − n ⊗ x ∗ (cid:107) converges to zero geometrically at the rate of O ( ρ ( J ( η )) k ) . Proof:

Combining the results of Lemmas 4–7, one can verify that Eq. (18) holds if <η < nβ π (cid:62) r π c . Recall that λ = max (cid:0)(cid:12)(cid:12) − αnη ( π (cid:62) r π c ) (cid:12)(cid:12) , (cid:12)(cid:12) − βnη ( π (cid:62) r π c ) (cid:12)(cid:12)(cid:1) . When < η < nβ π (cid:62) r π c , λ = 1 − αnη ( π (cid:62) r π c ) , since α ≤ β ; see, e.g., [29] for details. The goal is to ﬁnd anupper bound of the step-size, (cid:101) η , such that ρ ( J ( η )) < when η < (cid:101) η. In the light of Lemma 8, wesolve for the range of the step-size, η , and a positive vector (cid:15) = [ (cid:15) , (cid:15) , (cid:15) ] (cid:62) from the followinglinear matrix inequality (entry-wise):  σ A + a η a η a ηa η − αnη ( π (cid:62) r π c ) a ηa + a η a η σ B + a η   (cid:15) (cid:15) (cid:15)  <  (cid:15) (cid:15) (cid:15)  , (23) which is equivalent to the following set of inequalities:  ( a (cid:15) + a (cid:15) + a (cid:15) ) η < (cid:15) (1 − σ A ) , ( a (cid:15) − αn ( π (cid:62) r π c ) (cid:15) + a (cid:15) ) η < , ( a (cid:15) + a (cid:15) + a (cid:15) ) η < (1 − σ B ) (cid:15) − (cid:15) a , Solving the inequalities above, we have that when  (cid:15) < (1 − σ B ) (cid:15) a ,(cid:15) > a (cid:15) + a (cid:15) αn ( π (cid:62) r π c ) ,(cid:15) > ,η < min (cid:110) (cid:15) (1 − σ A ) a (cid:15) + a (cid:15) + a (cid:15) , (1 − σ B ) (cid:15) − (cid:15) a a (cid:15) + a (cid:15) + a (cid:15) (cid:111) , the inequality in Eq. (23) holds and the Theorem follows.IV. N UMERICAL E XPERIMENTS

We consider a binary classiﬁcation problem in the distributed setting, where we use logisticloss function to train a linear classiﬁer. Each agent i has access to m i training data, ( c ij , y ij ) ∈ R p × {− , +1 } , where c ij contains p features of the j th training data at agent i and y ij is thecorresponding binary label. For privacy issues, agents do not share training data with each other.In order to use the entire data set for training, the network of agents cooperatively solves thefollowing distributed logistic regression problem: min w ∈ R p ,b ∈ R F ( w , b ) = n (cid:88) i =1 m i (cid:88) j =1 ln (cid:2) (cid:0) − (cid:0) w (cid:62) c ij + b (cid:1) y ij (cid:1)(cid:3) + ξ (cid:107) w (cid:107) , where the private function at each agent, i , is given by: f i ( w , b ) = m i (cid:88) j =1 ln (cid:2) (cid:0) − (cid:0) w (cid:62) c ij + b (cid:1) y ij (cid:1)(cid:3) + ξ n (cid:107) w (cid:107) . In our setting, n = 8 , p = 5 . The feature vectors, c ij ’s, are Gaussian with zero mean andvariance . The binary labels are randomly generated from standard Bernoulli distribution. Weﬁrst compare the performance of the proposed algorithm in this paper, with ADD-OPT/Push-DIGing [23], [24], FROST [26], and subgradient-push [16], [17], over the leftmost directedgraph, G , shown in Fig. 1. The simulation results are shown in the left ﬁgure in Fig. 2. Next,we evaluate the proposed algorithm on the three different directed graphs, G , G , G , shown inFig. 1, where each graph to the right has a few more edges compared to the one on its left. The simulation results are shown in the right ﬁgure in Fig. 2. In both cases, we plot the average of theresiduals at each agent, n (cid:80) ni =1 (cid:107) x i ( k ) − x ∗ (cid:107) . We note that the proposed linear algorithm achievesa geometric (linear on the log-scale) convergence speed comparable to other fast algorithms overdirected graphs but with less computation and communication. These simulations conﬁrm thetheoretical ﬁndings in this letter.

1 2 3 4 5 6 7 8

Fig. 1. Strongly-connected but unbalanced directed graphs. -5 Distributed Logistic Regression

The proposed algorithmFROSTADD-OPT/Push-DIGingsubgradient-push -5 Distributed Logistic Regression

Fig. 2. (Left) Comparison across different algorithms. (Right) Proposed algorithm over different graphs.we plot the averageresiduals at each agent, n (cid:80) ni =1 (cid:107) x i ( k ) − x ∗ (cid:107) . V. C

ONCLUSIONS

In this letter, we describe a linear distributed algorithm for optimization over directed graphsthat can be seen as a generalization of earlier work over undirected graphs. Under the assumptionsthat the objective functions are strongly-convex and have Lipschitz-continuous gradients, theproposed algorithm achieves a geometric convergence to the global optimal. Our analysis is based on a novel approach where we establish simultaneous contractions of both row- andcolumn-stochastic matrices under some arbitrary norms. We then use an elegant result fromnonnegative matrix theory to develop the conditions for convergence.R EFERENCES [1] A. Nedi´c and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,”

IEEE Trans. on AutomaticControl , vol. 54, no. 1, pp. 48–61, Jan. 2009.[2] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,”

SIAM Journal on Optimization ,vol. 26, no. 3, pp. 1835–1854, Sep. 2016.[3] W. Shi, Q. Ling, G. Wu, and W Yin, “Extra: An exact ﬁrst-order algorithm for decentralized consensus optimization,”

SIAM Journal on Optimization , vol. 25, no. 2, pp. 944–966, 2015.[4] J. Xu, Sj Zhu, Yj Cj Soh, and L. Xie, “Augmented distributed gradient methods for multi-agent optimization underuncoordinated constant stepsizes,” in

IEEE 54th Annual Conference on Decision and Control , 2015, pp. 2055–2060.[5] G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,”

IEEE Trans. on Control of NetworkSystems , Apr. 2017.[6] Jinming Xu, Shanying Zhu, Yeng Chai Soh, and Lihua Xie, “Convergence of asynchronous distributed gradient methodsover stochastic networks,”

IEEE Transactions on Automatic Control , vol. 63, no. 2, pp. 434–448, 2018.[7] G. Qu and N. Li, “Accelerated distributed Nesterov gradient descent,”

Arxiv: https://arxiv.org/abs/1705.07176 , May 2017.[8] M. Zhu and S. Mart´ınez, “Discrete-time dynamic average consensus,”

Automatica , vol. 46, no. 2, pp. 322–329, 2010.[9] J. F. C. Mota, J. M. F. Xavier, P. M. Q. Aguiar, and M. P¨uschel, “Distributed basis pursuit,”

IEEE Transactions on SignalProcessing , vol. 60, no. 4, pp. 1942–1956, Apr. 2012.[10] Dusan Jakovetic, “A uniﬁcation, generalization, and acceleration of exact distributed ﬁrst order methods,” arXiv preprintarXiv:1709.01317 , 2017.[11] H. Raja and W. U. Bajwa, “Cloud K-SVD: A collaborative dictionary learing algorithm for big, distributed data,”

IEEETrans. Signal Processing , vol. 64, no. 1, pp. 173–188, Jan. 2016.[12] S. Lee and M. M. Zavlanos, “Approximate projection methods for decentralized optimization with functional constraints,”

IEEE Transactions on Automatic Control , 2017.[13] F. Mansoori and E. Wei, “Superlinearly convergent asynchronous distributed network Newton method,” in , Dec. 2017, pp. 2874–2879.[14] B. Ying and A. H. Sayed, “Performance limits of stochastic sub-gradient learning, part II: Multi-agent case,”

SignalProcessing , vol. 144, pp. 253–264, Mar. 2018.[15] Bahman Gharesifard and Jorge Cort´es, “Distributed strategies for generating weight-balanced and doubly stochasticdigraphs,”

European Journal of Control , vol. 18, no. 6, pp. 539–557, 2012.[16] K. I. Tsianos, S. Lawlor, and M. G. Rabbat, “Push-sum distributed dual averaging for convex optimization,” in , Maui, Hawaii, Dec. 2012, pp. 5453–5458.[17] A. Nedi´c and A. Olshevsky, “Distributed optimization over time-varying directed graphs,”

IEEE Trans. on AutomaticControl , vol. 60, no. 3, pp. 601–615, Mar. 2015.[18] D. Kempe, A. Dobra, and J. Gehrke, “Gossip-based computation of aggregate information,” in , Oct. 2003, pp. 482–491. [19] C. Xi, Q. Wu, and U. A. Khan, “On the distributed optimization over directed networks,” Neurocomputing , vol. 267, pp.508–515, Dec. 2017.[20] C. Xi and U. A. Khan, “Distributed subgradient projection algorithm over directed graphs,”

IEEE Trans. on AutomaticControl , vol. 62, no. 8, pp. 3986–3992, Oct. 2016.[21] K. Cai and H. Ishii, “Average consensus on general strongly connected digraphs,”

Automatica , vol. 48, no. 11, pp. 2750– 2761, 2012.[22] C. Xi and U. A. Khan, “DEXTRA: A fast algorithm for optimization over directed graphs,”

IEEE Trans. on AutomaticControl , vol. 62, no. 10, pp. 4980–4993, Oct. 2017.[23] C. Xi, R. Xin, and U. A. Khan, “ADD-OPT: Accelerated distributed directed optimization,”

IEEE Trans. on AutomaticControl , Aug. 2017, in press .[24] A. Nedi´c, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varyinggraphs,”

SIAM Journal of Optimization , Dec. 2017.[25] C. Xi, V. S. Mai, R. Xin, E. Abed, and U. A. Khan, “Linear convergence in optimization over directed graphs withrow-stochastic matrices,”

IEEE Trans. on Automatic Control , Jan. 2018, in press .[26] R. Xin, C. Xi, and U. A. Khan, “FROST – Fast row-stochastic optimization with uncoordinated step-sizes,”

Arxiv:https://arxiv.org/abs/1803.09169 , Mar. 24th 2018.[27] F. Bullo, J. Cortes, and S. Martinez,

Distributed Control of Robotic Networks , Princeton University: Applied MathematicsSeries, 2009.[28] R. A. Horn and C. R. Johnson,

Matrix Analysis, 2 nd ed. , Cambridge University Press, New York, NY, 2013.[29] S. Bubeck, “Convex optimization: Algorithms and complexity,” arXiv preprint arXiv:1405.4980arXiv preprint arXiv:1405.4980