[PDF] Decentralized conjugate gradients with finite-step convergence

Abstract

The decentralized solution of linear systems of equations arises as a subproblem in optimization over networks. Typical examples include the KKT system corresponding to equality constrained quadratic programs in distributed optimization algorithms or in active set methods. This note presents a tailored structure-exploiting decentralized variant of the conjugate gradient method. We show that the decentralized conjugate gradient method exhibits super-linear convergence in a finite number of steps. Finally, we illustrate the algorithm's performance in comparison to the Alternating Direction Method of Multipliers drawing upon examples from sensor fusion.

Full PDF

aa r X i v : . [ m a t h . O C ] F e b DECENTRALIZED CONJUGATEGRADIENTS WITH FINITE-STEPCONVERGENCE

ALEXANDER ENGELMANN AND TIMM FAULWASSER

Abstract.

The decentralized solution of linear systems of equationsarises as a subproblem in optimization over networks. Typical examplesinclude the KKT system corresponding to equality constrained qua-dratic programs in distributed optimization algorithms or in active setmethods. This note presents a tailored structure-exploiting decentral-ized variant of the conjugate gradient method. We show that the decen-tralized conjugate gradient method exhibits super-linear convergence ina ﬁnite number of steps. Finally, we illustrate the algorithm’s perfor-mance in comparison to the Alternating Direction Method of Multipliersdrawing upon examples from sensor fusion. Introduction

Sparse and structured linear systems of equations occur frequently assubproblems in diﬀerent context of systems and control algorithms. In manycases, the coeﬃcient matrix can be written as a sum of coeﬃcient matricesof individual subsystems, where each of these matrices carries a sparsitypattern related to the underlying graph/network structure. Examples forthese linear systems can be found sensor networks [2, 3]. Moreover, theyoccur as inner coordination problem in distributed optimization algorithms[4, 5]—speciﬁcally in control and robotics [6], or in power systems [1].Many of the state of-the-art algorithms for such problems, such as theAlternating Direction Method of Multipliers (ADMM) [7, 8] or the Jacobi-teration based on Schwarz-decomposition [9], deliver a linear convergencerate provided the coeﬃcient matrix is strictly positive deﬁnite. In optimiza-tion over networks, however, these matrices are often positive-semideﬁniterendering many classical algorithms slow. The present paper presents a de-centralized Conjugate Gradient algorithm (d-CG) for such problems. Weprove convergence conditions and demonstrate a fast practically superlinearconvergence rate, which is, to the best of our knowledge, the fastest con-vergence rate currently known for decentralized algorithms. Moreover, we

AE and TF are both with the Institute for Energy Systems, Energy Eﬃciency andEnergy Economics, TU Dortmund University, Germany. Parts of this work have beenconducted while AE was with the Institute for Automation and applied Informatics(IAI), Karlsruhe Institute of Technology, Karlsruhe, Germany in the PhD thesis [1] . [email protected], [email protected] . The notions for distributed and decentralized computation are not uniﬁed in the lit-erature. Here we use the terminology from [10, 11], where distributed computation allowsa small amount of central coordination activity and decentralized computation avoidscentral coordination and relies on neighbor-to-neighbor communication only. rove that d-CG converges to the exact solution after a ﬁnite number ofiterations.The idea of decentralized conjugate gradient algorithms has been pre-viously considered in the context of bi-level decentralized algorithms fornon-convex problems in [6]. In contrast to the previous results, the presentpaper focuses on sparsity exploitation and it avoids any pre-computationsteps. Moreover, the convergence rate estimate of Theorem 1 has not beengiven in [6].The remainder of this paper is structured as follows: In Section 2, weformulate a linear system of equations with an underlying sparsity patterncorresponding to a graph structure. Moreover, we recall the centralizedconjugate gradient algorithms and analyze its convergence properties forpositive- semideﬁnite coeﬃcient matrices. Based on that, we develop a de-centralized conjugate gradient method in Section 3 inheriting the fast con-vergence properties of centralized CG. Section 4 demonstrates the eﬀective-ness of our approach by means of numerical examples, where we comparethe performance of our method to the ADMM.

Notation:

We denote a weighted semi-norm by k x k W . = √ x ⊤ W x with W (cid:23)

0. Moreover, blkdiag( { W i } i ∈S ) denotes block-diagonal concatena-tion and ( { W i } i ∈S ) denotes vertical concatenation of a sequence of matri-ces { W i } i ∈S . We denote a zero matrix with ones on the main diagonal forall entries indexed by the set S by E S ∈ R |S|×|S| .2. Problem Formulation and Preliminaries

We consider a network optimization problem with subsystems S . = { , . . . , S } ,which aim at cooperatively solving X i ∈S S i !| {z } . = S λ = X i ∈S s i | {z } . = s , (1)where each coeﬃcient matrix S i = S ⊤ i ∈ R m × m (cid:23) , i ∈ S , is symmetricpositive semideﬁnite and s i ∈ R m . We assume that the matrices S i exhibita sparsity structure , i.e., speciﬁc rows/columns of S i are zero. We collectall indices for which S i has non-zero rows/columns in the set C ( i ) . = { c ∈ { , . . . , m } | ( S i ) c = 0 } . (2)Moreover, two subsystems i, j ∈ S are said to be adjacent, if C ( i ) ∩ C ( j ) = ∅ . The Centralized Conjugate Gradient Method.

Next, we recall thestandard CG algorithm, which solves (1) written as Sλ = s in centralitedfashion and in a ﬁnite number of iterations. Centralized CG, as summarizedin Algorithm 1, dates back to [12] and is discussed in many standard sourcessuch as [13, Ch. 5.1].The idea of centralized CG is as follows: problem (1) is interpreted as theﬁrst-order optimality condition of a quadratic objective with shape matrix S (cid:23)

0. The CG algorithm generates search directions p n (3e), which are conjugate to all previous search directions and then computes an optimalstep size α n (3a) minimizing the quadratic objective along this direction. lgorithm 1 Conjugate Gradients for problem (1)

Initialization: r = p = s − S ¯ λ .Repeat until: r n = 0 α n = r n ⊤ r n p n ⊤ Sp n (3a) ¯ λ n +1 = ¯ λ n + α n p n (3b) r n +1 = r n − α n Sp n (3c) β n = r n +1 ⊤ r n +1 r n ⊤ r n (3d) p n +1 = r n +1 + β n p n (3e) n ← n + 1 . The conjugacy of p n immediately leads to ﬁnite-step convergence. In orderto avoid storing all previous search directions, the extra steps (3b)-(3d)reduce the storage requirements to storing the previous iterate only.Next, we present a convergence result of CG for positive- semi deﬁnite ma-trices S and s ∈ range( S ), which extends the standard convergence analysis.Semideﬁnite S arise frequently in network settings, as redundant communi-cation constraints induced by the graph topology lead to zero eigenvalues of S , cf. Section 4. Theorem 1 (Convergence of CG) . If S (cid:23) , range( S ) = { } and s ∈ range( S ) , Algorithm 1 converges to a solution of (1) , ¯ λ ⋆ . Moreover, theconvergence-rate estimate k ¯ λ n +1 − ¯ λ ⋆ k S ≤ (cid:18) √ κ − √ κ + 1 (cid:19) n k ¯ λ − ¯ λ ⋆ k S (4) holds, where κ = ω max /ω min is the ratio of the largest and smallest non-zeroeigenvalues of S . Moreover, d-CG terminates after at most m − n steps, where n are thenumber of zero-eigenvalues of S . (cid:3) Proof.

We analyze the CG iterates (3a)-(3e) in the kernel and in the rangeof S separately. Any vector x ∈ R n can be uniquely decomposed as x = x N + x R , where x N lies in a subspace W of R n and x R is in the or-thogonal complement of W . Henceforth, we denote vectors in the ker-nel of S by subscript ( · ) N and vectors in the range space of S by sub-script ( · ) R . Since null( S ) is the orthogonal complement of range( S ⊤ ), andrange( S ⊤ ) = range( S ) since S is symmetric, we split the iterates of the CGalgorithm in these two components. Iterates in null( S ): Consider the initialization condition in Algorithm 1. By s ∈ range( S )—i.e.. S ¯ λ N = s N = 0—it is clear that r N = p N = 0. Asrange( S ) ⊃ { } , S (cid:23)

0, and r n = 0 due to the termination criterion, forall iterates n ∈ N , the scalars α n , β n are well-deﬁned positive numbers anddepend only on p R and r R . Hence, by (3b), (3c), and (3e), r mN = p mN = 0 Note that k x k S is not a norm, since k x k S = 0 if x ∈ null( S ). However, the esti-mates are nonetheless useful since ¯ λ ⋆ does not lie in null( S ) and hence, the inequalitiescharacterize convergence in the range-space of S . nd λ mN = λ N for m ≥ n . Iterates in range( S ): Since r nN = p nN = 0, the iterates (3a)-(3e) are notaﬀected by their kernel components (except for (3b)) since only zeros areadded. Let Q ∈ R m × m − n be a matrix which columns form a orthonormalbasis of range( S ), i.e., range( S ) = { y ∈ R n | y = Qx, x ∈ R n − n } , with Q ⊤ Q = I . Thus, we can uniquely represent all components of ¯ λ, r and p inthe range of S in terms of Q , i.e., (¯ λ R , r R , p R ) = ( Q ¯ λ R , Q r R , Q p R ), wherethe underlined quantities ( · ) ∈ R m − n denote the representations of ¯ λ, r and p in terms of Q . Expressing the CG iterations (3) in these variables,left-multiplying (3b), (3c) and (3e) with Q ⊤ , using orthonormality of Q and r nN = p nN = 0, yields α n = r n ⊤ R r nR p n ⊤ R S p nR , ¯ λ n +1 R = ¯ λ nR + α n p nR ,r n +1 R = r nR − α n Sp nR , β n = r n +1 ⊤ R r n +1 R r n ⊤ R r nR ,p n +1 R = r n +1 R + β n p nR , where S . = Q ⊤ SQ . The same applies to the initial condition in Algorithm 1and gives r R = p R = s − S ¯ λ with s . = Q ⊤ s . Due to the constructionof S , the matrices S and S have the same non-zero eigenvalues. Hence, S ≻ m − n iterations for solving S ¯ λ R = s . Thus, ¯ λ R converges to the solution S ¯ λ ⋆R = s . Hence, the construction of S and s implies Q ⊤ S ¯ λ R = Q ⊤ s . Since ¯ λ nN = ¯ λ N , CG converges to ¯ λ ⋆ = Q ¯ λ ⋆R + ¯ λ N ,which satisﬁes (1). (cid:3) A similar proof idea can be found in [15].3.

Decentralized Conjugate Gradients

Next, we present the main contribution of this note: a decentralized con-jugate gradient algorithm which inherits the fast and exact convergenceproperties of centralized CG.

Sparsity Exploitation.

To the end of eliminating the zero rows/columnsin S i , we introduce matrices I C ∈ R |C|× m , which project on the non-zerocomponents of S i . These matrices are deﬁned as the vertical concatenationof unit vectors e ⊤ i . = (0 , . . . , |{z} i -th element , . . . , ∈ R m indexed by the set C ( i ) from (2), i.e., I C ( i ) . = (cid:16) e ⊤ j (cid:17) l , l ∈ C ( i ) ⊆ { , . . . , m } . (5)oreover, we deﬁne Λ i . = I C ( i ) X j ∈S I ⊤C ( j ) I C ( j ) | {z } . =Λ I ⊤C ( i ) , (6)which is a diagonal matrix with positive integers on the main diagonal. Weformalize properties of Λ i and S i with respect to multiplication with I C ( i ) inthe next lemma. Lemma 1 (Properties of S i , I C ( i ) and Λ i ) . Consider S i from (1) , and I C ( i ) , Λ i from (5) and (6) with C ( i ) from (2) . Then, the following propertieshold: (P1) I ⊤C ( i ) I C ( i ) = E C ( i ) , (P2) I ⊤C ( i ) I C ( i ) S i = S i , (P3) I ⊤C ( i ) I C ( i ) s i = s i , (P4) I ⊤C ( i ) I C ( i ) I ⊤C ( i ) = I ⊤C ( i ) , (P5) X i ∈S I ⊤C ( i ) Λ − i I C ( i ) = I. (cid:3) Proof.

P1: Follows immediately from e ⊤ i e j = δ ij , where δ ij is the Kroneckerdelta. P2: By (P1) and the deﬁnition of C ( i ), we have I ⊤C ( i ) I C ( i ) S i = E C ( i ) S i = S i ; (P3) and (P4) follow similarly. (P5) is slightly more complicated. SinceΛ is diagonal, and by deﬁnition of I C ( i ) as the concatenation of unit vectors,we have Λ − i = ( I C ( i ) Λ I ⊤C ( i ) ) − = I C ( i ) Λ − I ⊤C ( i ) . Combination with (P1)-(P4)and AB = BA for diagonal matrices yields X i ∈S I ⊤C ( i ) Λ − i I C ( i ) = X i ∈S I ⊤C ( i ) I C ( i ) Λ − I ⊤C ( i ) I C ( i ) = X i ∈S I ⊤C ( i ) I C ( i ) I ⊤C ( i ) I C ( i ) Λ − = X i ∈S I ⊤C ( i ) I C ( i ) Λ − = Λ Λ − = I. (cid:3) Next, we derive the d-CG algorithm based on the properties of Lemma 1.The main idea of d-CG is to exploit sparsity in (3a)-(3e) via the mappings I C ( i ) . We start by deﬁning local equivalents for all iterates ¯ λ, r and p , λ i . = I C ( i ) ¯ λ, r i . = I C ( i ) r, and p i . = I C ( i ) p (7)for all subsystems i ∈ S . With these deﬁnitions, we decompose (3a). Deﬁne α n . = η n σ n and η n . = r n ⊤ r n . From Lemma 1 we have that η n = r n ⊤ r n = r n ⊤ I r n = r n ⊤ (cid:16) X i ∈S I ⊤C ( i ) Λ − i I C ( i ) (cid:17) r n . Consider η i . = r n ⊤ i Λ − i r ni , then we obtain via (7) η = r n ⊤ r n = X i ∈S r n ⊤ i Λ − i r ni = X i ∈S η i , here η i can be computed locally in each subsystem. This yields (8c)-1 and(8d) in Algorithm 2. Note that the sums P i ∈S η i and P i ∈S σ ni require global communication of two scalars ( η i , σ i ) per subsystem i ∈ S . By the deﬁnition of

S, p i and by Lemma 1, the denominator of (3a) canbe written as σ n = p n ⊤ X i ∈S S i p n = p n ⊤ X i ∈S I ⊤C ( i ) I C ( i ) S i I ⊤C ( i ) I C ( i ) p n = X i ∈S p n ⊤ i ˆ S i p ni = X i ∈S σ ni , where ˆ S i = I C ( i ) S i I ⊤C ( i ) . This yields (8a) and (8f) of Algorithm 2. Next,consider (3b) and (3e). These are entirely local steps since left multiplyingboth equations by I C ( i ) yields¯ λ n +1 i = ¯ λ ni + η n σ n p ni and p n +1 i = r n +1 i + η n +1 η n p ki comprising (8e)-1 and (8c)-2 in Algorithm (2). Finally, we decompose (3c)which requires neighbor-to-neighbor communication due to the product Sp n .Again, by Lemma 1 and by the deﬁnitions of p i and ˆ S j , we have r n +1 = r n − η n σ n Sp n = r n − η n σ n X j ∈S S j p n = r n − η n σ n X j ∈S I ⊤C ( j ) ˆ S j p nj . Left multiplying by I C ( i ) gives r n +1 i = r ni − η n σ n X j ∈N ( i ) I ij ˆ S j p nj = r ni − η n σ n X j ∈N ( i ) I ij u nj , with u nj . = ˆ S j p nj and I ij . = I C ( i ) I ⊤C ( j ) . Note that the summation is over theneighbors of subsystem i , N ( i ), as I ij = 0 if j = N ( i ). Moreover, u nj = ˆ S j p nj can again be computed locally by each subsystem. This yields (8b) and (8e)-2 in Algorithm 2. Left multiplying the initialization condition in Algorithm 1by I C ( i ) and using (P2) yields r i = p i = X j ∈R I ij ˆ s j − X j ∈R I ij ˆ S j λ j . If λ j = 0, one needs one extra neighbor-to-neighbor communication withˆ s i = I C ( i ) s i for initialization.Observe that, by construction, the iterates of d-CG and standard CGare equivalent; thus the fast convergence properties of the centralized CGmethod are inherited by d-CG. We summarize this observation in the fol-lowing proposition. Its proof follows from the derivation above. Proposition 1 (Convergence of d-CG) . The decentralized conjugate gradi-ent algorithm (Algorithm 2) exhibits the convergence properties from Theorem 1. (cid:3) This can be achieved via hopping, i.e., a round-robin protocol. lgorithm 2

Decentralized CG for problem (1)

Initialization: λ i and r i = p i = P j ∈R I ij ˆ s j − P j ∈R I ij ˆ S j λ j Repeat until r ki = 0 for all i ∈ S : σ n = P i ∈S σ ni (scalar global sum)(8a) r n +1 i = r ni − η n σ n P i ∈N ( i ) I ij u nj (neighbor-to-neighbor)(8b) η n +1 i = r ni Λ − i r ni ; λ n +1 i = λ ni + η n σ n p i (local)(8c) η n +1 = P i ∈S η n +1 i (scalar global sum)(8d) p n +1 i = r ni + η n +1 η n p ni ; u n +1 i = ˆ S i p n +1 i (local)(8e) σ n +1 i = p n +1 ⊤ i ˆ S i p n +1 i (local)(8f) n ← n + 1 Table 1.

Properties of d-CG and d-ADMM for (1).convergence rate d-CG d-ADMMtheoretical m -step linear/sublinearpractical superlinear linear/sublineartuning required no yeslocal variables four two Discussion.

For the sake of comparison, the appendix details a variant ofdecentralized ADMM (d-ADMM) based on the sparsity model used above.Table 1 compares properties of d-ADMM and d-CG. An advantage of d-ADMM is that it maintains only two local variables, whereas d-CG main-tains four local variables. On the other hand, d-CG achieves a ﬁnite-stepconvergence to the exact solution. Moreover, the practically observed con-vergence rate is superlinear , whereas d-ADMM or comparable algorithmsachieve an at most linear or even sub linear rate [11]. A further advantageof d-CG is that it does not require tuning while ADMM does. We remarkthat ﬁnite-step convergence and no tuning parameters are unique featuresof d-CG and, to the best of our knowledge, at present these properties arenot achieved by other decentralized algorithms.

Remark 1 (Diﬀerence to d-ADMM and d-CG in [6]) . In [6] we presenteda diﬀerent versions of d-ADMM and d-CG. The advantage of the methodpresented here based on the sparsity results of Lemma 1 is that for d-CG,no precomputation step is needed and for d-ADMM, we obtain strictly con-vex local objective functions f i in (16) for S ≻ and thus obtain a linearconvergence rate, cf. [16–18]. Applications

In the introduction we claimed that problem (1) occurs in many appli-cations. One particular example are algorithms for distributed non-convexoptimization such as distributed interior-point methods [4, 5, 19], distributedSQP methods [20], and other distributed second-order algorithms such asLADIN [21]. These methods have in common that a KKT optimality sys-tem is solved centrally in each iteration, which is a substantial barrier fordirect decentralization. Under standard regularity and second-order condi-tions, these KKT systems can be written in form of (1). Thus they can besolved via d-ADMM or d-CG—this way obtaining decentralized variants ofthese algorithms. A blueprint for doing so is as follows:(1) Reformulate an optimization problem in aﬃnely-coupled separableform min x ,...,x S X i ∈S f i ( x i )(9a) subject to g i ( x i ) = 0 | κ i , ∀ i ∈ S , (9b) X i ∈S A i x i = b | λ. (9c)(2) Compute a quadratic approximation ( B i , G i , g i ) of (9) at the cur-rent iterate ( x k , κ k , λ k ), where B ki and G ki are Hessian and Jacobianapproximations. This yields a structured KKT system (cid:18) H A ⊤ A D (cid:19) (cid:18) pλ (cid:19) = (cid:18) − hd (cid:19) , (10) where p ⊤ = ( x ⊤ , κ ⊤ ), A = ( A , . . . , A S ), and H =blkdiag (cid:18) B i G ⊤ i G i (cid:19) , h = (cid:18) g i (cid:19) , i ∈ S . (3) Compute the Schur complement of (10)—i.e., solve the ﬁrst row in(10) for p and insert it to the second one—obtaining (1).This reformulation procedure has the advantage, that—in contrast to apply-ing decentralized algorithms directly to (9) as for example done in [9]—oneobtains a linear system of equation with strictly positive-deﬁnite coeﬃcientmatrix. Moreover, this reformulation substantially reduces the dimension ofthe KKT system (10)—thus it often improves run-time. Remark 2 (Requirements on (10)) . In order to apply the above procedure,it is necessary that all B i are invertible. This can be established by imposinga slightly stronger version of the Second-Order Suﬃcient Condition (SOSC)on (9) , cf. [6, Ass 1]. (cid:3) Remark 3 (The sparsity of A ) . Note that when applying the above proce-dure, the sparsity of S i is induced by the sparsity of A i . More speciﬁcally,the zero rows/columns in S i correspond to the zero columns in A i , cf. [6]. (cid:3) Example – Decentralized sensor fusion.

Next, we apply the above pro-cedure to a problem from sensor fusion, where all sensors i ∈ S are identicalwith linear measurement equations [2, 3, 22]. The goal is to estimate theparameters θ ∈ R n collaboratively and to minimize the variance of the es-timate. The measurement equation for each sensor i ∈ S is y i = M i θ + v i , This is always possible via introducing auxiliary variables, cf. [6]. ith v ∼ N (0 , y i ∈ R n y and θ ∈ R n θ . We assume that all M i have fullrow rank. The maximum-likelihood estimate of θ is θ ⋆ = argmin θ X i ∈S k y i − M i θ k . (11)The communication topology allows communication for solving (11) only be-tween adjacent sensors. We encode this topology via a connected undirectedgraph G = ( S , E ), where the edges E ⊂ S × S encode the communicationlinks and the nodes S refer to the sensors.As suggested in the above recipe, we reformulate (11) as θ ⋆ = argmin θ ,...,θ S X i ∈S k y i − M i θ i k (12a) s.t. θ i = θ j for all ( i, j ) ∈ E . (12b)Observe that (12b) is equivalent to P i ∈S A i θ i = 0 with A = I G ⊗ I , where I G is the incidence matrix of G . The Lagrangian to (12) reads L ( x, λ ) = X i ∈S k y i − M i θ i k + λ ⊤ X i ∈S A i θ i . Thus, the ﬁrst-order optimality conditions of (12) are M ⊤ i ( y i − M i θ i ) + A ⊤ i λ = 0 , and X i ∈S A i θ i = 0 . (13)Solving (13) for θ i yields θ i = ( M ⊤ i M i ) − ( M ⊤ i y i + A ⊤ i λ ). Applying step 3)(Schur complement) yields X i ∈S A i ( M ⊤ i M i ) − A ⊤ i ! λ = s, (14)with s . = − P i ∈S A i ( M ⊤ i M i ) − M ⊤ i y i , which is equivalent to (1). Lemma 2 (Veriﬁcation of assumptions) . If all M i have full row rank, wehave s ∈ range( S ) in (14) . (cid:3) Proof.

The deﬁnition of s implies that s ∈ range( A ). Since S on the left-hand side of (14) involves multiplication with A from both sides, range( S ) =range( A ) if M ⊤ M has full rank. Thus the assertion follows. (cid:3) Numerical results.

We consider a network of S = { , . . . , } sensors withdiﬀerent communication topologies such as a path graph, a star star-likegraph, and two meshed graphs shown in Figure 1. The measurement matrix M is randomly generated with all matrix entries drawn from a uniform dis-tribution on [0 , v ∼ N (0 , − ).The numerical performance of d-CG and d-ADMM is shown in Figure 2.Here, Figure 2a depicts the number of iterations required for reaching anaccuracy of k S ¯ λ − s k < − . As one can see, for all considered networktopologies, d-CG is about 2-5 times faster than ADMM even in the best-tuned case. This also holds true for a larger and randomly generated meshed Note that for (12) LICQ does not hold in general. However, Slater’s condition isalways satisﬁed here and thus we can use the KKT conditions. igure 1.

Investigated communication topologies.graph with | S | = 1000 sensors and |E| = 1500 edges, where d-CG requires 29iterations to converge to k S ¯ λ − s k < − and ADMM requires 86 iterationsin the best-tuned case. Moreover, one can observe that tuning of d-ADMMis quite diﬃcult: the optimal tuning parameter ρ varies with the networktopology and badly-tuned parameters can lead to a signiﬁcant decrease ofconvergence speed.Additionally, Figure 2b shows the distance to a minimizer measured inthe weighted semi-norm k λ − λ ⋆ k S for the strongly meshed graph. Observethat the convergence rate bound (4) is met. Moreover, d-CG converges toa (numerically) exact solution after 69 iterations, which is strictly less than n o = 90 non-zero eigenvalues predicted by Theorem 1 and Proposition 1.5. Summary

This note presented a decentralized version of the conjugate gradient al-gorithm for solving structured linear systems of equations over networks.We have proven theoretical convergence in a ﬁnite number of steps andshown practically superlinear convergence. Moreover, we characterized theconvergence rate in terms of the eigenvalues of the underlying linear system.Future work will investigate the design of communication topologies for op-timizing convergence bounds. Moreover, tightening these bounds in termsof the spectrum of the coeﬃcient matrix seems possible.

Appendix A. Decentralized ADMM

For the sake of comparison, we derive a decentralized ADMM variant for(1) based on the sparsity framework from Section 3. The general idea is to -2 -1 (a) Required k S ¯ λ − s k < − for d-CG (red) and d-ADMM (blue) and diﬀerent communication topologies depending on the tuningparameter ρ .

50 100 150 200 250 300 350 40010 -10 -5 (b) Convergence of d-CG and d-ADMM for the strongly meshed graph with con-vergence bound (4) for d-CG.

Figure 2.

Numerical performance of d-CG and d-ADMM.reformulate problem (1) as the optimality conditions to a consensus problemand then apply ADMM. But ﬁrst, we eliminate the zero rows/columns in S i to obtain a partially separable strongly convex optimization problem,which renders ADMM applicable with fast (linear) convergence guarantees.A lack of strong convexity might lead to a sublinear convergence instead [7,23]. By (P2), (P3) and (P5) from Lemma 1, and AB = BA for two diagonalmatrices, we write the left-hand-side of (1) as X i ∈S S i ! λ = X i ∈S I ⊤C ( i ) I C ( i ) S i I ⊤C ( i ) I C ( i ) ! λ = X i ∈S I ⊤C ( i ) I C ( i ) S i I ⊤C ( i ) ! I C ( i ) λ = X i ∈S I ⊤C ( i ) ˆ S i I C ( i ) λ. lgorithm 3 Decentralized ADMM for problem (1)

Initialization: ¯ λ i , γ i for all i ∈ S , ρ For all i ∈ S , repeat until k λ ni − ¯ λ ni k < ǫ p and k ¯ λ n +1 i − ¯ λ ni k < ǫ d : λ n +1 i = (cid:16) ˆ S i + ρI (cid:17) − (cid:0) ˆ s i − γ ni + ρ ¯ λ ni (cid:1) (local)(18a) ¯ λ n +1 i = (Λ i ) − P j ∈N ( i ) I ij λ n +1 j (neighbor-neighbor)(18b) γ n +1 i = γ ni + ρ ( λ n +1 i − ¯ λ n +1 i ) (local)(18c) n ← n + 1 Similarly, the right-hand-side of (1) gives X i ∈S s i = X i ∈S I ⊤C ( i ) I C ( i ) s i = X i ∈S I ⊤C ( i ) ˆ s i . Summing up, we write (1) as X i ∈S I ⊤C ( i ) ˆ S i I C ( i ) λ = X i ∈S I ⊤C ( i ) ˆ s i . (15)Since ˆ S i = ˆ S ⊤ i (cid:23)

0, (15) corresponds to the optimality conditions ofmin λ X i ∈S λ ⊤ I ⊤C ( i ) ˆ S i I C ( i ) λ − I C ( i ) ˆ s ⊤ i λ. (16)Equivalently, we can reformulate (16) asmin λ ,...,λ S , ¯ λ X i ∈S λ ⊤ i ˆ S i λ i − ˆ s ⊤ i λ i subject to λ i = I C ( i ) ¯ λ | γ i , for all i ∈ S . (17)with new local multipliers λ , . . . , λ S . Now, a decentralized version of ADMMfollows standard patterns [10, 24] and is shown in Algortihm 3. References [1] A. Engelmann, “Distributed Optimization with Application to Power Systems andControl,” Ph.D. dissertation, Karlsruhe Institute of Technology (KIT), Karlsruhe,2020.[2] L. Xiao, S. Boyd, and S. Lall, “A scheme for robust distributed sensor fusion basedon average consensus,” in

IPSN 2005. Fourth International Symposium on Infor-mation Processing in Sensor Networks, 2005. , 2005, pp. 63–70.[3] I. D. Schizas, A. Ribeiro, and G. B. Giannakis, “Consensus in Ad Hoc WSNsWith Noisy Links—Part I: Distributed Estimation of Deterministic Signals,”

IEEETransactions on Signal Processing , vol. 56, no. 1, pp. 350–364, 2008.[4] V. M. Zavala, C. D. Laird, and L. T. Biegler, “Interior-point decomposition ap-proaches for parallel solution of large-scale nonlinear parameter estimation prob-lems,” en,

Chemical Engineering Science , Model-Based Experimental Analysis, vol. 63,no. 19, pp. 4834–4845, 2008.[5] J. Kardoˇs, D. Kourounis, and O. Schenk, “Two-Level Parallel Augmented SchurComplement Interior-Point Algorithms for the Solution of Security ConstrainedOptimal Power Flow Problems,”

IEEE Transactions on Power Systems , vol. 35,no. 2, pp. 1340–1350, 2020.[6] A. Engelmann, Y. Jiang, B. Houska, and T. Faulwasser, “Decomposition of Non-convex Optimization via Bi-Level Distributed ALADIN,”

IEEE Transactions onControl of Network Systems , vol. 7, no. 4, pp. 1848–1858, 2020.7] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the Linear Convergence of theADMM in Decentralized Consensus Optimization,”

IEEE Transactions on SignalProcessing , vol. 62, no. 7, pp. 1750–1761, 2014.[8] A. Makhdoumi and A. Ozdaglar, “Convergence Rate of Distributed ADMM OverNetworks,”

IEEE Transactions on Automatic Control , vol. 62, no. 10, pp. 5082–5095, 2017.[9] S. Shin, V. M. Zavala, and M. Anitescu, “Decentralized Schemes With Overlap forSolving Graph-Structured Optimization Problems,”

IEEE Transactions on Controlof Network Systems , vol. 7, no. 3, pp. 1225–1236, 2020.[10] D. P. Bertsekas and J. N. Tsitsiklis,

Parallel and Distributed Computation: Numer-ical Methods . Prentice Hall Englewood Cliﬀs, NJ, 1989, vol. 23.[11] A. Nedi´c, A. Olshevsky, and S. Wei, “Decentralized Consensus Optimization andResource Allocation,” in

Large-Scale and Distributed Optimization , Springer, 2018,pp. 247–287.[12] M. R. Hestenes and E. Stiefel, “Methods of conjugate gradients for solving linearsystems,” 1952.[13] J. Nocedal and S. Wright,

Numerical Optimization . Springer Science & BusinessMedia, New York, 2006.[14] A. Greenbaum, “Comparison of splittings used with the conjugate gradient algo-rithm,” en,

Numerische Mathematik , vol. 33, no. 2, pp. 181–193, 1979.[15] K. Hayami, “Convergence of the Conjugate Gradient Method on Singular Systems,” arXiv:1809.00793 , 2020. arXiv: .[16] W. H. Yang and D. Han, “Linear Convergence of the Alternating Direction Methodof Multipliers for a Class of Convex Optimization Problems,”

SIAM Journal onNumerical Analysis , vol. 54, no. 2, pp. 625–640, 2016.[17] E. Wei and A. Ozdaglar, “On the O(1/k) convergence of asynchronous distributedalternating Direction Method of Multipliers,” in , 2013, pp. 551–554.[18] A. Nedi´c, A. Olshevsky, and M. G. Rabbat, “Network Topology and Communication-Computation Tradeoﬀs in Decentralized Optimization,”

Proceedings of the IEEE ,vol. 106, no. 5, pp. 953–976, 2018.[19] S. K. Pakazad, A. Hansson, M. S. Andersen, and I. Nielsen, “Distributed pri-mal–dual interior-point methods for solving tree-structured coupled convex prob-lems using message-passing,”

Optimization Methods and Software , vol. 32, no. 3,pp. 401–435, 2017.[20] A. Kozma, “Distributed Optimization Methods for Large Scale Optimal Control,”Ph.D. dissertation, KU Leuven, 2014.[21] B. Houska, J. Frasch, and M. Diehl, “An Augmented Lagrangian Based Algorithmfor Distributed NonConvex Optimization,”

SIAM Journal on Optimization , vol. 26,no. 2, pp. 1101–1127, 2016.[22] M. Rabbat and R. Nowak, “Distributed optimization in sensor networks,” in

ThirdInternational Symposium on Information Processing in Sensor Networks, 2004.IPSN 2004 , 2004, pp. 20–27.[23] W. Deng and W. Yin, “On the global and linear convergence of the generalized al-ternating direction method of multipliers,”

Journal of Scientiﬁc Computing , vol. 66,no. 3, pp. 889–916, 2016.[24] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed Optimiza-tion and Statistical Learning via the Alternating Direction Method of Multipliers,”

Foundations and Trends ® in Machine Learningin Machine Learning