Blended Dynamics Approach to Distributed Optimization: Sum Convexity and Convergence Rate
BBlended Dynamics Approach to Distributed Optimization:Sum Convexity and Convergence Rate ? Seungjoon Lee a and Hyungbo Shim a a ASRI, Department of Electrical and Computer Engineering, Seoul National University, Seoul, Korea
Abstract
This paper studies the application of the blended dynamics approach towards distributed optimization problem where theglobal cost function is given by a sum of local cost functions. The benefits include (i) individual cost function need not beconvex as long as the global cost function is strongly convex and (ii) the convergence rate of the distributed algorithm isarbitrarily close to the convergence rate of the centralized one. Two particular continuous-time algorithms are presented usingthe proportional-integral-type couplings. One has benefit of ‘initialization-free,’ so that agents can join or leave the networkduring the operation. The other one has the minimal amount of communication information. After presenting a general theoremthat can be used for designing distributed algorithms, we particularly present a distributed heavy-ball method and discuss itsstrength over other methods.
Key words: multi-agent systems; distributed optimization; distributed heavy-ball method.
We consider the problem of finding w ∈ R n that mini-mizes the cost F ( w ) := 1 N N X i =1 f i ( w ) , (1)in a distributed way. Here, F ( · ) is assumed to bestrongly convex, but each f i : R n → R need not beconvex. Each f i ( · ) is assumed to be continuously dif-ferentiable and its gradient ∇ f i ( · ) is globally Lipschitz.We consider a network of N agents and each agent i ( i = 1 , . . . , N ) knows the local cost function f i onlybut not others. The main objective in this paper is the ? This work was supported by the National ResearchFoundation of Korea (NRF) grant funded by the Ko-rea government (Ministry of Science and ICT) (No.NRF2017R1E1A1A03070342). Preliminary result of this ar-ticle was presented at 20th International Conference on Con-trol, Automation and Systems, Busan, South Korea, Octo-ber 13–16, 2020 (Lee and Shim, 2020b).
Email addresses: [email protected] (SeungjoonLee), [email protected] (Hyungbo Shim). The scaling factor 1 /N is simply for convenience in thispaper because we are interested in the minimizer w ∗ but notin the optimal cost. distributed minimization problem of (1); that is, to de-sign the agents in the network and their communicationpolicy such that each agent finds the global minimizerof F ( w ).Distributed optimization problem has received much at-tention due to various applications such as resource al-location problem, distributed state estimation, or dis-tributed machine learning. It has been studied mostly inthe discrete-time domain. For example, early solutions tothe problem can be found in (Nedi´c and Ozdaglar, 2009)which proposes a consensus-based subgradient method.Extensions are then made by numerous works, e.g., forfixed step size (Yuan et al. , 2016), for asymptotic con-vergence under fixed step size (Shi et al. , 2015), and forgradient tracking method (Qu and Li, 2018) to name afew. A common approach in these works is to combinea consensus algorithm with the classical gradient de-scent method to obtain a distributed algorithm. Finally,a common framework to analyze different variations ofthe distributed algorithms is also studied (Alghunaimand Sayed, 2020; Jakoveti´c, 2019), which unifies variousalgorithms proposed in the literature. Also see (Yang etal. , 2019) for a survey of distributed algorithms.Recently, further improvements are made to the perfor-mance of the distributed algorithms. From the optimiza-tion theory, it is well known that accelerated methodssuch as Nesterov gradient descent (Nesterov, 2004) im- Preprint submitted to Automatica 26 February 2021 a r X i v : . [ m a t h . O C ] F e b rove the convergence rate. Consequently, acceleratedoptimization methods are combined with consensus al-gorithms to obtain distributed algorithms with improvedperformance. For instance, Qu and Li (2020) proposed adistributed Nesterov gradient descent and Xin and Khan(2019) proposed a distributed heavy-ball method.In parallel to the studies in the discrete-time domain,continuous-time optimization algorithms have also at-tracted attention due to the insights it provides basedon the rich knowledge from the classical stability anal-ysis. For instance, a continuous-time Nesterov gradi-ent descent is analyzed by Su et al. (2016), and Lya-punov analysis for momentum methods is done in theworks such as (Shi et al. , 2019; Wilson et al. , 2016). Ac-cordingly, continuous-time distributed algorithms havealso been developed. Early works are done by Wangand Elia (2010) which proposed a proportional-integraltype algorithm. This is extended in various manners,e.g., for discrete communications using event-triggeredcontrols (Kia et al. , 2015), for strongly connected andweight-balanced graphs (Gharesifard and Cort´es, 2014),for communication delays using passivity (Hatanaka etal. , 2018), and for a constrained problem (Yang et al. ,2017). However, these works employ the gradient descentalgorithm and accelerated methods are not adopted.Most importantly, all of the distributed algorithms men-tioned so far (both discrete and continuous cases) as-sume convexity of each local function f i .In this paper, we introduce a continuous-time dis-tributed optimizer and present a distributed heavy-ballmethod. The advantages are listed as follows. • Individual cost function f i ( w ) need not be convex aslong as their sum, i.e., the global cost function F ( w ),is strongly convex. • The convergence rate of the distributed algorithm isarbitrarily close to the convergence rate of the cen-tralized one. • Two algorithms are presented using the proportional-integral(PI)-type coupling. The first one communi-cates 2 n -dimensional information with neighboringagents, and has the benefit of ‘initialization-free,’so that agents can join or leave the network duringthe operation. The second one communicates just n -dimensional information but a specific initializationis needed. • We present a general theorem that can also be used fordesigning other distributed algorithms, and demon-strate its use for designing a distributed gradient de-scent algorithm.In Section 2, we state the general theorem regardingthe behavior of multi-agent system in which each agentexchanges their output information only by the PI-type couplings. This result is utilized in Section 3 fordistributed optimization algorithms. Numerical sim-ulations and conclusions appear in Sections 4 and 5, respectively.We use the following notation in this paper. For givenmatrices X , . . . , X N , we denote [ X > · · · X > N ] > by[ X ; · · · ; X N ] and we let 1 N := [1; · · · ; 1] ∈ R N . For avector x and a matrix X , | · | represents their Euclidean2-norm and induced 2-norm, respectively. The Kro-necker product is denoted by ⊗ . The minimum and themaximum singular values of a matrix A are denoted by σ m ( A ) and σ M ( A ), respectively. A (undirected) graph isdefined as a tuple ( N , E ) where N := { , . . . , N } is thenode set and E ⊆ N ×N is the edge set. The set of neigh-bors of agent i is defined as N i := { j ∈ N | ( j, i ) ∈ E} .The graph is connected if two nodes can be joined by apath (Godsil and Royle, 2001). The Laplacian matrix L = [ l ij ] ∈ R N × N is defined as l ij := − j, i ) ∈ E and l ij := 0 otherwise for i = j , and l ii := − P j = i l ij . In this section, we present a result on the behavior ofmulti-agent system under the proportional-integral (PI)type of couplings, which will lead to a few applicationsof distributed optimization in the next section.Let us consider the multi-agent system consisting of N agents with the node dynamics of agent i ∈ N := { , . . . , N } being given by˙ x i = h i ( x i ) + u i y i = Ex i , (2)where x i ∈ R n is the state of agent i , h i : R n → R n iscalled the vector field of agent i which is assumed to beglobally Lipschitz, and y i ∈ R q is called the output ofagent i which is communicated with other agents. (More-over, in the next section, y i will be the variable that con-verges to the optimal value w ∗ := argmin w F ( w ) for theminimization problem (1).) Typically, we have q ≤ n ,and we assume that E ∈ R q × n has full row rank. The in-put term u i ∈ R n is the coupling term whose value is de-termined with the received information from the neigh-boring agents. We assume that the communication graphis undirected and connected , and consider the followingPI-type coupling:˙ ξ i = − κK X j ∈N i ( y j − y i ) (3a)(A) : u i = k P E > X j ∈N i ( y j − y i ) + k I E > X j ∈N i ( ξ j − ξ i )(3b)where ξ i ∈ R q is the state, N i is the index set of agentsthat send the information to the agent i , the couplinggains κ , k P and k I are positive numbers to be designed,2nd K is a positive definite matrix to be designed as well.On the other hand, we also consider another couplinginput u i given by (3a) and(B) : u i = k P E > X j ∈N i ( y j − y i ) − k I E > ξ i with N X i =1 ξ i (0) = 0 . (3c)An immediate observation is that the input (A) requiresthe communication of both y j and ξ j among the agentswhile only y j is communicated for the input (B). How-ever, when the input (B) is used, the initial conditionsshould satisfy P Ni =1 ξ i (0) = 0, while this is not neededwhen the input (A) is used. In this sense, the algorithm(A) (i.e., (3a) and (3b)) is called to be initialization-free ,which is desired if some agents leave and/or new agentsjoin the network during the operation.Define x ∈ R Nn and ξ ∈ R Nq as the column stack of x i and ξ i , respectively. Then the dynamics (2), (3a), and(3b) or (3c), can be written compactly as˙ x = h ( x ) − k P ( L P ⊗ E > E ) x − k I ( D I ⊗ E > ) ξ (4a)˙ ξ = κ ( L P ⊗ KE ) x (4b)where h ( x ) := [ h ( x ); · · · ; h N ( x N )] ∈ R Nn , L P is theLaplacian matrix for the communication graph, and D I = (cid:26) L P , if input (A) is used, I N , if input (B) is used.In order to analyze the behavior of (4), two steps of statetransformations, which are inspired from the work (Leeand Shim, 2020a), are introduced. First, from the givenmatrix E ∈ R q × n , find two matrices Z ∈ R n × ( n − q ) and W ∈ R n × q such that the columns of Z and W are anorthonormal basis of ker( E ) and ker( E ) ⊥ , respectively.Then, T := [ Z W ] ∈ R n × n is an orthogonal matrix, EZ = 0, and EW ∈ R q × q is an invertible matrix. As aresult, with Z and W , the individual state x i for i ∈ N can be transformed into z i and w i as z i = Z > x i ∈ R n − q , w i = W > x i ∈ R q , (5)so that we have x i = Zz i + W w i , ∀ i ∈ N . (6)Next, define R ∈ R N × ( N − such that the columns of R is an orthonormal basis of ker(1 > N ). Then, we have the following property for a matrix T ∈ R N × N ; T := " N > N R > , then T − = h N R i . Let w ∈ R Nq be the column stack of w i . Then, by thematrix ( T ⊗ I q ), the states w and ξ can be convertedinto [ ¯ w ; e w ] and [ ¯ ξ ; e ξ ] as¯ w = 1 N N X i =1 w i , ¯ ξ = 1 N N X i =1 ξ i , e w = ( R > ⊗ I q ) w, e ξ = ( R > ⊗ I q ) ξ (7)where ¯ w, ¯ ξ ∈ R q and e w, e ξ ∈ R q ( N − . Thus, we have w = (1 N ⊗ I q ) ¯ w + ( R ⊗ I q ) e w (8a) ξ = (1 N ⊗ I q ) ¯ ξ + ( R ⊗ I q ) e ξ (8b) x = ( I N ⊗ Z ) z + (1 N ⊗ W ) ¯ w + ( R ⊗ W ) e w (8c)or, x i = Zz i + W ¯ w + ( R i ⊗ W ) e w, ∀ i ∈ N (8d)where z ∈ R N ( n − q ) is the column stack of z i and R i isthe i -th row of R .Now, through two consecutive linear coordinate changesby (5) and (7), the system (4) is converted into˙ z i = Z > h i ( x i ) , i ∈ N , (9a)˙¯ w = 1 N N X i =1 W > h i ( x i ) (9b)˙ e w = ( R > ⊗ W > ) h ( x ) − k P (Λ P ⊗ W > E > EW ) e w − k I (Λ I ⊗ W > E > ) e ξ (9c)˙¯ ξ = 0 (9d)˙ e ξ = κ (Λ P ⊗ KEW ) e w (9e)where Λ P = R > L P R ∈ R ( N − × ( N − is positive definitebecause the graph is connected, and Λ I is either Λ P or I N − depending on the cases (A) and (B), respectively.No coupling term appears in (9a) because EZ = 0. Tosee how (9b) is derived, note that ¯ w = (1 /N )(1 > N ⊗ W > ) x . Then, it is seen that no coupling term appears in(9b) in the case of (A) because 1 > N L P = 0. For the caseof (B), we have ˙¯ w = (1 /N ) P Ni =1 W > h i ( x i ) − k I E > ¯ ξ ,but by (9d) and by ¯ ξ (0) = (1 /N ) P Ni =1 ξ i (0) = 0 from(3c), we have that ¯ ξ ( t ) ≡
0. Equation (9c) is obtainedwith e w = ( R > ⊗ W > ) x by (4a), (8b), and (8c), and thefact that ¯ ξ ( t ) ≡ ξ = (1 /N )(1 > N ⊗ I q ) ξ . It is worthwhile to emphasize againthat the average of ξ i (i.e., ¯ ξ ) is constant in both cases of3A) and (B), and hence, ¯ ξ ( t ) is completely determinedby the initial conditions of ξ i (0). Assumption 1
For a given multi-agent system (2) , de-fine its blended dynamics as ˙ z i = Z > h i ( Zz i + W ¯ w ) , i ∈ N , ˙¯ w = 1 N N X i =1 W > h i ( Zz i + W ¯ w ) . (10) Assume that the blended dynamics (10) has a unique equi-librium ( z ∗ , . . . , z ∗ N , ¯ w ∗ ) that is globally exponentiallystable with a rate µ > ; that is, | e δ ( t ) | ≤ ce − µt | e δ (0) | (11) where e δ := [ z − z ∗ ; ¯ w − ¯ w ∗ ] (with z = [ z ; · · · ; z N ] and z ∗ = [ z ∗ ; · · · ; z ∗ N ] ) and c is a constant. Note that the blended dynamics (10) is nothingbut the subsystem (9a) and (9b) when e w ≡ (use (8d)), and has the dimension of N ( n − q ) + q .Since the blended dynamics (10) has the equilib-rium ( z ∗ , ¯ w ∗ ) by Assumption 1, the whole system(9) has an equilibrium ( z ∗ , ¯ w ∗ , e w ∗ , ¯ ξ ∗ , e ξ ∗ ) where e w ∗ = 0, ¯ ξ ∗ = (1 /N ) P Ni =1 ξ i (0), and e ξ ∗ = (1 /k I )(Λ I ⊗ W > E > ) − ( R > ⊗ W > ) h ( x ∗ ) in which x ∗ = ( I N ⊗ Z ) z ∗ +(1 N ⊗ W ) ¯ w ∗ . This equilibrium is unique with respect tothe initial condition ξ (0). Remark 1
The blended dynamics (10) can be seen as a residual dynamics of the multi-agent system (2) and (3) (or, (9) ), which is left over when the output y i achievesconsensus. Indeed, if y i ( t ) ≡ y j ( t ) , then w i ( t ) ≡ w j ( t ) because y i = Ex i = EW w i by (6) and EW is invert-ible. This yields that e w ( t ) = ( R > ⊗ I q ) w ( t ) = 0 because R > N = 0 . Then, (9a) and (9b) with e w ≡ becomesthe blended dynamics. Note that the blended dynamics (10) contains the average of the vector fields h i (for its ¯ w -dynamics), which differs from the dynamics of anyindividual agent as well as the overall dynamics. Notethat Assumption 1 asks stability of the blended dynamics(which consists of the averaged vector field), but not ofindividual agents, which will be the main ingredient howconvexity of individual cost functions is not necessary. For convenience, let us translate the equilibrium of (9)into the origin through z δ := z − z ∗ , ¯ w δ := ¯ w − ¯ w ∗ , and In fact, in (Lee and Shim, 2020a), the blended dynamicsis defined as the quasi-steady-state subsystem when (9) isviewed as a singularly perturbed system. The difference iswhether (9d) belongs to the blended dynamics, and here, wedo not include it considering that the behavior of (9d) istrivial. e ξ δ = e ξ − e ξ ∗ . Then the state x can be written as x = ( I N ⊗ Z )( z δ + z ∗ )+ (1 N ⊗ W )( ¯ w δ + ¯ w ∗ ) + ( R ⊗ W ) e w (12)so that x ∗ = ( I N ⊗ Z ) z ∗ + (1 N ⊗ W ) ¯ w ∗ . (13)Then it can be verified that (9) becomes˙ z δ = ( I N ⊗ Z > ) h ( x ) (14a)˙¯ w δ = 1 N (1 > N ⊗ I n ) h ( x ) (14b)˙ e w = ( R > ⊗ W > )( h ( x ) − h ( x ∗ )) − k P (Λ P ⊗ W > E > EW ) e w − k I (Λ I ⊗ W > E > ) e ξ δ (14c)˙¯ ξ = 0 (14d)˙ e ξ δ = κ (Λ P ⊗ KEW ) e w (14e)where Λ I is Λ P for (A) and I N − for (B).Now, we present the main result. Theorem 1
Consider the multi-agent system (2) andthe PI-type coupling (3a) and (3b) , or (3a) and (3c) , with K = EW W > E > . Then, under Assumption 1, the following results hold.(a) For any positive κ and k I such that κ > k I , there ex-ists k ∗ P ( κ, k I ) such that, for each k P > k ∗ P , the ori-gin of (14a) , (14b) , (14c) , and (14e) is globally expo-nentially stable. Therefore, x i ( t ) of (2) converges to x ∗ i = Zz ∗ i + W ¯ w ∗ exponentially fast.(b) Set κ = k I = φ ∗ k P where φ ∗ > is sufficiently small. For any positive υ < µ , there exists k ∗∗ P ( υ, φ ∗ ) > such that, for each k P > k ∗∗ P , we have that (cid:12)(cid:12)(cid:12) [ e δ ( t ); e w ( t ); e ξ δ ( t )] (cid:12)(cid:12)(cid:12) ≤ ¯ ce − ( µ − υ ) t (cid:12)(cid:12)(cid:12) [ e δ (0); e w (0); e ξ δ (0)] (cid:12)(cid:12)(cid:12) See (20) in the proof for an explicit expression of k ∗ P . In fact, φ ∗ can be chosen (see (22)) such that0 < φ ∗ < min n √ r σ m (Λ P ) σ M (Λ I ) , σ M (Λ P ) σ M (Λ I ) σ M ( E ) , σ M ( E ) , σ m (Λ P ) σ m ( E ) p σ M (Λ P ) σ M (Λ I ) σ M ( E ) o . See (25) in the proof for an explicit expression of k ∗∗ P . here e δ = [ z δ ; ¯ w δ ] and ¯ c is a constant. Therefore, x i ( t ) converges to x ∗ i exponentially fast at least withthe rate µ − υ . Theorem 1.(a) states that the proposed dynamics (2)with (3) is exponentially stable for any (small) κ and k I if 2 κ > k I and k P is sufficiently large. If k P and k I = κ are chosen as proposed in Theorem 1.(b), then the con-vergence rate to the equilibrium can be made arbitrar-ily close to that of the blended dynamics. For this re-sult, existence of exponentially stable equilibrium for theblended dynamics (10) is enough, and each agent mayeven be unstable as long as the blended dynamics is sta-ble. Remark 2
In order to recover the convergence rate µ by choosing υ arbitrarily close to µ , one has to choose κ = k I = φ ∗ k P with suitable choice of φ ∗ and k P , whichmay be tedious. This can be simplified by choosing κ = k I = β, k P = β , β (cid:29) . Indeed, with the choice, φ ∗ = 1 / √ β so that large β makes (22) hold. Moreover, it is seen from (25) that k ∗∗ P = ϑ / ( φ ∗ ) = ϑ β (with some constant ϑ ) when φ ∗ is sufficiently small, i.e., β is sufficiently large. There-fore, it holds that k P > k ∗∗ P if β (cid:29) . PROOF.
The first step is to obtain a Lyapunov func-tion for the blended dynamics (10) that can character-ize the convergence rate µ . For this purpose, we employthe converse Lyapunov theorem by Corless and Glielmo(1998), among others, which states under Assumption 1that, for any υ such that 0 < υ < µ , there exists a Lya-punov function ¯ V ( e δ ), with e δ = [ z δ ; ¯ w δ ], such that c | e δ | ≤ ¯ V ( e δ ) ≤ c | e δ | , (cid:12)(cid:12)(cid:12)(cid:12) ∂ ¯ V∂e δ ( e δ ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ c | e δ | , ˙¯ V ( e δ ) | along (10) ≤ − (2 µ − υ ) ¯ V ( e δ )where c , c , c >
0. Since c | e δ ( t ) | ≤ ¯ V ( e δ ( t )) ≤ e − (2 µ − υ ) t c | e δ (0) | , they just guarantee the convergencerate of µ − υ/
2. Because our goal is to recover the con-vergence rate µ arbitrarily closely as stated in Theorem1, the above statement is enough for our purpose.With the function ¯ V , let us consider V ( e δ , e w, e ξ δ ) = ¯ V ( e δ ) + 12 " e w e ξ δ > " X φYφY > U w e ξ δ where X = Λ P ⊗ W > E > EW, U = Λ I ⊗ I q ,Y = Λ I ⊗ W > E > , φ = φ ( κ, k I , k P ) = 2 κ − k I k P . In order for the function V to be a candidate Lyapunovfunction of (14), it should be a positive definite function,which asks φ < s σ m ( X ) σ M ( Y U − Y > ) =: θ . (15)On the other hand, we will need an upper bound of V that is independent of φ . For this, noting that e w > φY e ξ δ ≤ φ | Y | | e w | + | e ξ δ | ) , it can be seen that, if φ | Y | / ≤ max( | X | / , | U | ), or φ ≤ | Y | max (cid:18) | X | , | U | (cid:19) =: θ , (16)then12 e w > X e w + e w > φY e ξ δ + e ξ > δ U e ξ δ ≤ max( | X | , | U | )( | e w | + | e ξ δ | ) =: η ( | e w | + | e ξ δ | )in which, η is independent of φ .With these in mind, let us take the time derivative of V ,term by term. For the first term, we have that˙¯ V | along (14a) and (14b) = ˙¯ V | along (10) + ∂ ¯ V∂e δ · " I N ⊗ Z > N (1 > N ⊗ W > ) ( h ( x ) − h ( x − ( R ⊗ W ) e w )) ≤ − (2 µ − υ ) ¯ V ( e δ ) + c L | e δ || e w |≤ − µ − υ ) ¯ V ( e δ ) − υc | e δ | + υc | e δ | + 3 c L υc | e w | where L is a Lipschitz constant of h . Here we note that | h ( x ) − h ( x ∗ ) | ≤ L | ( I N ⊗ Z ) z δ + (1 N ⊗ W ) ¯ w δ + ( R ⊗ W ) e w | ≤ L √ N | e δ | + L | e w | . d ( e w > X e w ) dt = e w > X ( R > ⊗ W > )( h ( x ) − h ( x ∗ )) − k P e w > X e w − k I e w > XY e ξ δ ≤ | X | L √ N | e δ || e w | + | X | L | e w | − k P σ m ( X ) | e w | − k I e ξ > δ Y > X e w, ≤ υc | e δ | + 3 | X | L Nυc | e w | + | X | L | e w | − k P σ m ( X ) | e w | − k I e ξ > δ Y > X e w,d ( e ξ > δ U e ξ δ ) dt = 2 κ e ξ > δ U (Λ P ⊗ KEW ) e w = 2 κ e ξ > δ Y > X e w, and φ d ( e w > Y e ξ δ ) dt = φκ e w > Y (Λ P ⊗ KEW ) e w + φ e ξ > δ Y > ( R > ⊗ W > )( h ( x ) − h ( x ∗ )) − φk P e ξ > δ Y > (Λ P ⊗ W > E > EW ) e w − φk I e ξ > δ Y > (Λ I ⊗ W > E > ) e ξ δ ≤ φκ e w > Y ( I N ⊗ EW ) X ˜ w + 2 φ | Y | L √ N | e δ || e ξ δ | + φ | Y | L | e w || e ξ δ | − φk P e ξ > δ Y > X e w − φk I e ξ > δ Y > Y e ξ δ ≤ φκ | Y || E || X || e w | + υc | e δ | + φ | Y | L Nυc | e ξ δ | + φk I σ m ( Y )4 | e ξ δ | + φ | Y | L k I σ m ( Y ) | e w | − φk P e ξ > δ Y > X e w − φk I σ m ( Y ) | e ξ δ | . Then, it follows that˙ V ≤ − µ − υ ) ¯ V − φ (cid:18) k I σ m ( Y )4 − φ | Y | L Nvc (cid:19) | e ξ δ | − (cid:18) k P σ m ( X ) − φκ | Y || E || X | − φk I | Y | L σ m ( Y ) − θ (cid:19) | e w | (17)where θ := 3 c L υc + 3 | X | L Nυc + | X | L. Therefore, it is seen that, under the assumption that κ and k I are chosen such that 2 κ − k I > φ = (2 κ − k I ) /k P > V negative definite by letting k P sufficiently large. Indeed, the first big parenthesis in(17) becomes positive when k P > κ − k I k I | Y | L Nσ m ( Y ) vc =: 2 κ − k I k I θ . (18)In addition, the second big parenthesis in (17) becomes positive when σ m ( X ) k − θ k P − (2 κ − k I ) (cid:18) κθ + θ k I (cid:19) > θ := | Y || E || X | and θ := | Y | L /σ m ( Y ), whichis the case when k P > θ + p θ + 4(2 κ − k I )( κθ + θ /k I ) σ m ( X )2 σ m ( X )=: ϕ ( κ, k I ) . Overall, the function ˙ V of (17) becomes upper-boundedby a negative quadratic function if k P > max (cid:18) κ − k I θ , κ − k I θ , κ − k I k I θ , ϕ ( κ, k I ) (cid:19) =: k ∗ P ( κ, k I , Λ P , Λ I , E, L, N, υ, c , c ) (20)where θ = √ s σ m (Λ P ) σ M (Λ I ) σ m ( E ) σ M ( E ) θ = max (cid:18) σ M (Λ P ) σ M (Λ I ) σ M ( E ) , σ M ( E ) (cid:19) θ = 4 σ M (Λ I ) σ M ( E ) L Nσ m (Λ I ) σ m ( E ) θ = σ M (Λ P ) σ M (Λ I ) σ M ( E ) θ = σ M (Λ I ) σ M ( E ) σ m (Λ I ) σ m ( E ) L θ = 3 c L υc + 3 σ M (Λ P ) σ M ( E ) L Nυc + σ M (Λ P ) σ M ( E ) L, which is derived by the fact that | X | = σ M (Λ P ) σ M ( E ), | Y | = σ M (Λ I ) σ M ( E ), σ m ( X ) = σ m (Λ P ) σ m ( E ), and σ m ( Y ) = σ m (Λ I ) σ m ( E ). This completes the proof ofTheorem 1.(a).To prove Theorem 1.(b), we will show that˙ V ≤ − µ − υ ) ¯ V − µ − υ ) η ( | e w | + | e ξ δ | ) ≤ − µ − υ ) V (21)by the choice of κ = k I = φ ∗ k P where k P will be deter-mined shortly and φ ∗ is a positive constant such that φ ∗ < min (cid:18) θ , θ , σ m ( X ) √ θ (cid:19) . (22)It follows from this choice that φ = (2 κ − k I ) /k P = φ ∗ ,and so, the conditions (15) and (16) are satisfied. Now,we want both the coefficients of | e ξ δ | and | e w | in (17) to6e less than − µ − υ ) η for (21). For the first coefficient,we ask k P φ ∗ ) σ m ( Y )4 − φ ∗ ) | Y | L Nυc > µ − υ ) η. (23)For the second, we ask k P ( σ m ( X ) − ( φ ∗ ) | Y || E || X | ) − k P | Y | L σ m ( Y ) − θ = k P ( σ m ( X ) − ( φ ∗ ) θ ) − θ k P − θ > µ − υ ) η. (24)Under (22), both inequalities (23) and (24) hold if k P >k ∗∗ P where k ∗∗ P ( φ ∗ , Λ P , Λ I , E, L, N, υ, c , c , µ ) := max ( ϑ , ϑ ( φ ∗ ) , ( θ + 2( µ − υ ) η )+ p ( θ + 2( µ − υ ) η ) + 4 ϑ θ ϑ ) (25)in which ϑ = 2 · σ M (Λ I ) σ M ( E ) L Nσ m (Λ I ) σ m ( E ) υc , ϑ = 2 · µ − υ ) η σ m (Λ I ) σ m ( E ) ,ϑ = σ m (Λ P ) σ m ( E ) − ( φ ∗ ) θ ,η = max( σ M (Λ P ) σ M ( E ) , σ M (Λ I )) . This completes the proof.
Remark 3
By investigating the proof of Theorem 1 indetail, relationship between the convergence rate and thegains k P and k I can be inspected. In particular, from thecoefficients of the three terms in (17) , it may be statedthat, when κ = k I , increasing k P with k I kept fixed maydegrade the convergence rate. One can find that thosethree coefficients are proportional to µ − υ , k /k P , and k P , respectively, and so, if k P is too large compared to κ = k I , then the second coefficient gets smaller. In this section, we use Theorem 1 to obtain distributedalgorithms for solving the minimization problem:min w ∈ R n F ( w ) = 1 N N X i =1 f i ( w ) (26)under the assumption: Assumption 2
The cost function F ( w ) is continuouslydifferentiable and strongly convex with parameter α , A function F : R n → R is strongly convex with parameter α if ( ∇ F ( w ) − ∇ F ( w )) > ( w − w ) ≥ α | w − w | for all w and w . and ∇ f i , i ∈ N , are globally Lipschitz. It should be noted that the convexity of f i is not assumedwhile strong convexity of F is required.The proposed distributed algorithms are based on thePI-type coupling (3). We present the case (A) in (3) onlybecause the case (B) yields the same convergence resultas the case (A). We first illustrate the utility of Theorem 1 by analyzingthe classical distributed PI algorithm, which is given by˙ w i = −∇ f i ( w i ) + k P X j ∈N i ( w j − w i ) + k I X j ∈N i ( ξ j − ξ i )˙ ξ i = − k I X j ∈N i ( w j − w i ) . (27)This corresponds to the case (A) in (3) where κ = k I isused. In particular, it is seen that the output y i = w i ,or E = I n . This yields that W = I n and Z is null, andso, K = I n and the blended dynamics (10) consists of ¯ w only and becomes˙¯ w = − N N X i =1 ∇ f i ( ¯ w ) = −∇ F ( ¯ w ) . (28)Notice that (28) is exactly the centralized gradient de-scent method for minimizing (26). Under Assumption 2,the blended dynamics (28) has the exponentially sta-ble equilibrium at the minimizer w ∗ of F with a rate α . Indeed, it follows with V ( ¯ w ) = (1 / | ¯ w − w ∗ | that˙ V = − ( ¯ w − w ∗ ) > ( ∇ F ( ¯ w ) − ∇ F ( w ∗ )) ≤ − αV (since ∇ F ( w ∗ ) = 0). Thus, Assumption 1 holds, and Theo-rem 1 with h i = −∇ f i and Remark 2 yield the following. Theorem 2
Under Assumption 2, the distributed algo-rithm (27) solves the problem (26) in the sense that (a) w i ( t ) → w ∗ , ∀ i ∈ N , when k P is sufficiently large and k I > , and (b) its convergence rate can be made arbitrar-ily close to the convergence rate of the centralized gradientdescent method if k P = k / and k I is sufficiently large.3.2 Distributed Heavy-ball Method with State Coupling The argument used to obtain Theorem 2 may inspirea design paradigm of multi-agent system. That is, thenode dynamics of individual agents (e.g., (27)) are de-signed such that their blended dynamics (e.g., (28)) isthe system that performs the desired task.7n this sense, let us suppose that we want to solve (26)by the heavy-ball method (Qian, 1999):˙¯ w = ¯ z ˙¯ z = − √ α ¯ z − N N X i =1 ∇ f i ( ¯ w ) (29)and raise the question what is a suitable node dynamicsthat solves (26) in a distributed way.An immediate answer to the question is: " ˙ w i ˙ z i = " z i − √ αz i − ∇ f i ( w i ) + k P X j ∈N i " w j z j − " w i z i + k I X j ∈N i ( ξ j − ξ i )˙ ξ i = − k I X j ∈N i " w j z j − " w i z i (30)where w i ∈ R n , z i ∈ R n , and ξ i ∈ R n . Since the vectors[ w i ; z i ] are communicated, we have E = I n . Therefore, W = I n and Z is null, and so, the blended dynamics of(30) is nothing but (29).Convergence property of (29) is well studied by Siegel(2019). That is, the equilibrium point of (29) is given by[ w ∗ ; 0], where w ∗ is the minimizer of (26), and the solu-tion of (29) converges to the equilibrium exponentiallyfast with the rate √ α/ Hence, Assumption 1 holds,and Theorem 1 yields the following.
Theorem 3
Under Assumption 2, the states w i and z i of (30) converge to the equilibrium of (29) , i.e., w i ( t ) → w ∗ and z i ( t ) → when k P is sufficiently large and k I > .Moreover, the convergence rate of the distributed algo-rithm (30) can be made arbitrarily close to √ α/ with k P = k / and sufficiently large k I . The heavy-ball method is known to outperform the gra-dient descent method in view of convergence rate for aclass of problems (e.g., when 0 < α (cid:28)
If we let x i = [ w i ; z i ] in view of (2), it is seen that thealgorithm (30) communicates the full state x i of size 2 n , In Lemma 4 of the next subsection, we extend the proofof (Siegel, 2019), which can also be used for justifying thisclaim. as well as ξ i of size 2 n , because E = I n . However, wecan reduce the amount of communication if we do notcommunicate z i , or if we let E = [ I n , n × n ]. In fact, wepropose the following node dynamics in this subsection: " ˙ w i ˙ z i = " z i − √ αz i − ∇ f i ( w i ) + k P " I n n j ∈N i ( w j − w i ) + k I " I n n j ∈N i ( ξ j − ξ i )˙ ξ i = − k I X j ∈N i ( w j − w i ) (31)where ξ i ∈ R n .The algorithm (31) communicates w i and ξ i only, whosesizes are both n . (If the case (B) of (3) is used, thenwe can communicate w i only. However, in this case, ini-tialization P Ni =1 ξ i (0) = 0 is necessary.) It follows from E = [ I n , n × n ] that W = [ I n ; 0 n × n ] and Z = [0 n × n ; I n ].This leads to the blended dynamics of (31) as˙¯ w = 1 N N X i =1 z i ˙ z i = − √ αz i − ∇ f i ( ¯ w ) , i ∈ N . (32)The following lemma asserts that Assumption 1 holds. Lemma 4
Under Assumption 2, the blended dynamics (32) has the unique equilibrium point ( w ∗ , z ∗ , . . . , z ∗ N ) where w ∗ = argmin w F ( w ) and z ∗ i = − √ α ∇ f i ( w ∗ ) (33) and the equilibrium is exponentially stable with conver-gence rate √ α/ . PROOF.
Let ¯ z := (1 /N ) P Ni =1 z i so that˙¯ w = ¯ z ˙¯ z = − √ α ¯ z − ∇ F ( ¯ w ) . Following the derivation of (Siegel, 2019), let the func-tion V ( ¯ w, ¯ z ) be defined as V := F ( ¯ w ) − F ( w ∗ ) + 12 (cid:12)(cid:12) √ α ( ¯ w − w ∗ ) + ¯ z (cid:12)(cid:12) . V = ∇ F ( ¯ w )¯ z + ( √ α ( ¯ w − w ∗ ) + ¯ z )( −√ α ¯ z − ∇ F ( ¯ w ))= −√ α ∇ F ( ¯ w )( ¯ w − w ∗ ) − α ¯ z ( ¯ w − w ∗ ) − √ α ¯ z ≤ −√ α (cid:16) F ( ¯ w ) − F ( w ∗ ) + α | ¯ w − w ∗ | (cid:17) − α ¯ z ( ¯ w − w ∗ ) − √ α ¯ z in which, strong convexity of F is used. Now, pick anarbitrary positive number υ < √ α . Since F ( w ) ≥ F ( w ∗ )for all w , we have˙ V ≤ − (cid:0) √ α − υ (cid:1) (cid:16) F ( ¯ w ) − F ( w ∗ ) + α | ¯ w − w ∗ | (cid:17) − υα | ¯ w − w ∗ | − α ¯ z ( ¯ w − w ∗ ) − √ α ¯ z ≤ − (cid:0) √ α − υ (cid:1) (cid:18) F ( ¯ w ) − F ( w ∗ ) + 12 |√ α ( ¯ w − w ∗ ) + ¯ z | (cid:19) − υα | ¯ w − w ∗ | − υ √ α ( ¯ w − w ∗ )¯ z + (cid:18) √ α − υ − √ α (cid:19) ¯ z and, by Young’s inequality for υ √ α ( ¯ w − w ∗ )¯ z ,˙ V ≤ − ( √ α − υ ) V − υα | ¯ w − w ∗ | + υα | ¯ w − w ∗ | + (cid:18) υ + √ α − υ − √ α (cid:19) ¯ z ≤ − ( √ α − υ ) V − υα | ¯ w − w ∗ | . Now, let e z := R > ([ z ; · · · ; z N ] − [ z ∗ ; · · · ; z ∗ N ]). Then,˙ e z = − √ αR > z − R > [ ∇ f ( ¯ w ); · · · ; ∇ f N ( ¯ w )]= − √ α e z + R > ∇ f ( w ∗ ) − ∇ f ( ¯ w )... ∇ f N ( w ∗ ) − ∇ f N ( ¯ w ) in which, (33) is used. Now, let L be the maximum of allLipschitz coefficients of ∇ f i ’s. With W = V + γ e z > e z (34)we have (recalling | R | = 1)˙ W ≤ − ( √ α − υ ) V − υα | ¯ w − w ∗ | − √ αγ | e z | + γL | ¯ w − w ∗ || e z |≤ − ( √ α − υ ) V − √ αγ | e z | + γ L υα | e z | . Strong convexity of a function F : R n → R with pa-rameter α is equivalently characterized as: F ( x ) ≥ F ( x ) + ∇ F ( x )( x − x )+( α/ | x − x | , ∀ x, x . Here, we put x = w ∗ and x = ¯ w . With a sufficiently small γ such that3 √ α + υ − γL υα ≥ W ≤ − ( √ α − υ )( V + ( γ/ | e z | ) = − ( √ α − υ ) W. It can be checked that W is a positive definite functionin terms of ¯ w − w ∗ and e z . Hence, we obtain that theunique equilibrium point is exponentially stable witha rate ( √ α − υ ) /
2. Since the choice of υ is arbitrary,we conclude that the convergence rate is √ α/
2, whichcompletes the proof.
Theorem 5
Under Assumption 2, the states w i and z i of (31) converge to the equilibrium in (33) , i.e., w i ( t ) → w ∗ and z i ( t ) → −∇ f i ( w ∗ ) / (2 √ α ) when k P is sufficientlylarge and k I > . Moreover, the convergence rate of thedistributed algorithm (31) can be made arbitrarily closeto √ α/ with k P = k / and sufficiently large k I . Compared with the existing algorithms, the classical PIalgorithms (e.g., Kia et al. (2015); Yang et al. (2017))communicate n -dimensional information to its neigh-bors while requiring a specific initial condition (likethe case (B) of (3)). Algorithms are proposed that donot require specific initializations (e.g., Hatanaka et al. (2018); Wang and Elia (2010)), but these communicate2 n -dimensional information. Additionally, most worksonly prove asymptotic convergence while exponentialconvergence is a preferred property. An exception is(Kia et al. , 2015) which proves exponential convergencebut accelerated methods such as heavy-ball methodare not studied. For discrete-time algorithms, the dis-tributed Nesterov method studied by Qu and Li (2020)communicates 3 n -dimensional information and requiresinitialization. Distributed heavy-ball method proposedby Xin and Khan (2019) communicates 2 n -dimensionalinformation but still requires a specific initial condi-tion. Additionally, convergence rates of discrete-timealgorithms did not match the rate of the correspondingcentralized algorithms. The proposed algorithm (31)implements the distributed heavy-ball method whileonly communicating 2 n -dimensional information andis initialization-free. In addition, if the proposed algo-rithm is implemented by the case (B) of (3), then theycommunicate only n -dimensional information while itis no longer initialization-free. In both cases of (A) and(B), we recover the convergence rate of the central-ized algorithm arbitrarily closely. These discussions aresummarized in Table 1. For a numerical simulation, let us consider a distributedquadratic problem with N = 12 agents. The cost func-9 able 1Study Require Initialization Required Approach Convergence rateconvexity of f i -free communication (if strongly convex)Wang and Elia (2010) Yes Yes 2 n Not available AsymptoticKia et al. (2015) Yes No n Lyapunov ExponentialYang et al. (2017) Yes No n LaSalle AsymptoticHatanaka et al. (2018) Yes Yes 2 n LaSalle AsymptoticXin and Khan (2019) Yes No 2 n Lyapunov ExponentialQu and Li (2020) Yes No 3 n Lyapunov ExponentialProposed algorithm (A), or (31) No Yes 2 n Lyapunov ExponentialProposed algorithm (B) No No n Lyapunov Exponential tion of each agent is given by f i ( w ) = w > A i w + b > i w where A i ∈ R × is a symmetric matrix and b i ∈ R . Itis supposed that P Ni =1 A i >
0, while each A i may be in-definite. The condition number C of (1 /N ) P Ni =1 f i ( x )is defined as the ratio of the maximum to the minimumeigenvalue of (1 /N ) P Ni =1 A i . Matrices A i and b i are gen-erated randomly such that the maximum eigenvalue of(1 /N ) P Ni =1 A i is approximately 1 .
0, while having a largecondition number (i.e., small α ) in order to see the ef-fect of heavy-ball methods. The communication graph isgenerated randomly using the Erd˝os-R´enyi model witheach edge having a probability of 0 . are shown in Fig. 1, where averageerror to the optimal value is plotted in vertical axis (i.e.,(1 /N ) P Ni =1 | w i ( t ) − w ∗ | for distributed algorithms and | w ( t ) − w ∗ | for centralized algorithms) and horizontalaxis is the time t . Two heavy-ball methods, by the statecoupling in (30) (denoted by ‘HB-State’) and by theoutput coupling in (31) (denoted by ‘HB-Output’), areimplemented which correspond to the case (A). Thus,(30) communicates 4 n -dimensional information whereas(31) communicates 2 n -dimensional information. Resultsare compared with the PI algorithm (denoted by ‘PI’)of (Hatanaka et al. , 2018), the centralized gradient de-scent (‘CGD’), and the centralized heavy-ball method(29) (denoted by ‘CHB’). First graph of Fig. 1 shows theresult when k P = 0 . k I = 0 .
15. It can be seen thatthe heavy-ball methods outperform the gradient descentmethods which have slow performance. We can also seethat the proposed distributed heavy-ball algorithms con-verge to the optimal value but do not recover the con-vergence rate of the centralized heavy-ball method. Onthe other hand, the second graph of Fig. 1 shows the One can obtain and run the simulation code at https://doi.org/10.24433/CO.9155541.v1 . result when k P = 1 . k I = 0 .
5. Since sufficientlyhigh gains are used, we see that the performance of theproposed algorithms recover the convergence rate of thecentralized heavy-ball method.
We have studied distributed continuous-time algo-rithms, based on PI-type couplings, to solve distributedoptimization problems when the global cost function(1 /N ) P Ni =1 f i ( w ) is α -strongly convex while individual f i ’s are not necessarily convex. A concept of blendeddynamics is proposed to analyze the system and it isshown that the property of the blended dynamics isimportant in characterizing the behavior of the overallsystem. It is also shown that the distributed algorithmrecovers the convergence rate of the blended dynamicswith suitably chosen coupling gains. Using these re-sults, distributed algorithms are constructed from thecentralized algorithms, and in particular, distributedheavy-ball methods are proposed which achieve theconvergence rate of √ α/ References
Alghunaim, S. A., & Sayed, A. H. (2020). Linear con-vergence of primal-dual gradient methods and theirperformance in distributed optimization.
Automatica,117 , 109003.Corless, M., & Glielmo, L. (1998). New converse Lya-punov theorems and related results on exponentialstability.
Mathematics of Control, Signals, and Sys-tems, 11 , 79–100.10 ig. 1. Simulation results with low gains (top) and high gains(bottom).
Gharesifard, B., & Cort´es, J. (2014). Distributedcontinuous-time convex optimization on weight-balanced digraphs.
IEEE Transactions on AutomaticControl, 59 (3), 781–786.Godsil, C., & Royle, G. (2001).
Algebraic Graph Theory .Springer.Hatanaka, T., Chopra, N., Ishizaki, T., & Li, N. (2018).Passivity-based distributed optimization with com-munication delays using PI consensus algorithm.
IEEE Transactions on Automatic Control, 63 (12),4421–4428.Jakoveti´c, D. (2019). A unification and generalization ofexact distributed first-order methods.
IEEE Transac-tions on Signal and Information Processing over Net-works, 5 (1), 31–46.Kia, S. S., Cort´es, J., & Mart´ınez, S. (2015). Distributedconvex optimization via continuous-time coordinationalgorithms with discrete-time communication.
Auto-matica, 55 , 254–264.Lee, J. G., & Shim, H. (2020). A tool for analysis andsynthesis of heterogeneous multi-agent systems under rank-deficient coupling.
Automatica, 117 , 108952.Lee, S., & Shim, H. (2020). Blended dynamics approachfor analysis and construction of distributed optimiza-tion algorithms.
International Conference on Control,Automation and Systems (pp. 536–541).Nedi´c, A., & Ozdaglar, A., (2009). Distributed sub-gradient methods for multi-agent optimization.
IEEETransactions on Automatic Control, 54 (1), 48–61.Nesterov, Y. (2004).
Introductory lectures on convex op-timization : a basic course . Kluwer Academic Publish-ers, Boston.Qian, N. (1999). On the momentum term in gradientdescent learning algorithms.
Neural Networks, 12 (1),145–151.Qu, G., & Li, N. (2018). Harnessing smoothness to ac-celerate distributed optimization.
IEEE Transactionson Control of Network Systems, 5 (3), 1245–1260.Qu, G., & Li, N. (2020). Accelerated distributed Nes-terov gradient descent.
IEEE Transactions on Auto-matic Control, 65 (6), 2566–2581.Shi, B., Du, S. S., Su, W., & Jordan, M. I. (2019).Acceleration via symplectic discretization of high-resolution differential equations.
Advances in NeuralInformation Processing Systems (pp. 5744–5752).Shi, W., Ling, Q., Wu, G., & Yin, W. (2015). EXTRA:An exact first-order algorithm for decentralized con-sensus optimization.
SIAM Journal on Optimization,25 (2), 944–966.Siegel, J. W. (2019). Accelerated first-order meth-ods: differential equations and lyapunov functions. arXiv:1903.05671 .Su, W., Boyd, S., & Cand`es, E. J. (2016). A differentialequation for modeling Nesterov’s accelerated gradi-ent method: theory and insights.
Journal of MachineLearning Research, 17 , 1–43.Wang, J., & Elia, N. (2010). Control approach to dis-tributed optimization.
In Proceedings of 48th AllertonConference on Communication, Control, and Comput-ing (pp. 557–561).Wilson, A. C., Recht, B., & Jordan, M. I. (2016). A lya-punov analysis of momentum methods in optimiza-tion. arXiv:1611.02635v4 .Xin, R., & Khan, U. A. (2019). Distributed heavy-ball: ageneralization and acceleration of first-order methodswith gradient tracking.
IEEE Transactions on Auto-matic Control, 65 (6), 2627–2633.Yang, S., Liu, Q., & Wang, J. (2017). A multi-agentsystem with a proportional-integral protocol for dis-tributed constrained optimization.
IEEE Transac-tions on Automatic Control, 62 (7), 3461–3467.Yang, T., Yi, X., Wu, J., Yuan, Y., Wu, D., Meng, Z.,et al. (2019). A survey of distributed optimization.
Annual Reviews in Control, 47 , 278–305.Yuan, K., Ling, Q., & Yin, W. (2016). On the conver-gence of decentralized gradient descent.