Distributed Newton Optimization with Maximized Convergence Rate
aa r X i v : . [ m a t h . O C ] F e b Distributed Newton Optimization with MaximizedConvergence Rate
Dami´an Marelli, Yong Xu † , Minyue Fu, Fellow IEEE , and Zenghong Huang
Abstract —The distributed optimization problem is set up in acollection of nodes interconnected via a communication network.The goal is to find the minimizer of a global function formedby the addition of partial functions locally known at eachnode. A number of methods are available for addressing thisproblem, having different advantages. The goal of this work isto achieve the maximum possible convergence rate. As a firststep towards this end, we propose a new method which we showconverges faster than other available options. We then carry outa theoretical analysis which yields guarantees for convergence ina neighborhood of a local optimum and quantifies its asymptoticconvergence rate. As with most distributed optimization methods,this rate depends on a step size parameter. Our second steptoward our goal consists in choosing the optimal step sizein the sense of maximizing the convergence rate. Since thisoptimal value depends on the unknown global function, wetackle the problem by proposing a fully distributed methodfor estimating it. We present numerical experiments showingthat, for the same step size, our method converges significantlyfaster than its rivals. Experiments also show that the distributedstep size estimation method achieves the theoretically maximumasymptotic convergence rate.
I. I
NTRODUCTION
A networked system is a web of intelligent sensing and com-puting devices connected via a communication network. Itsmain goal is to carry out a computational task in a distributedmanner. The goal of a distributed optimization method is tominimize a cost function formed by a sum of local functionswhich are only known by each node [1], [2], [3], [4], [5]. Itfinds applications in power systems, sensor networks, smartbuildings, smart manufacturing, etc. The available distributedoptimization methods can be classified according to differentcriteria. We describe below those criteria used in this work.One classification criterion is between sequential and si-multaneous methods. In a sequential method, nodes take turnsto tune its local variables using its local function as well asinformation received from other nodes [6]. The main drawback
Dami´an Marelli is with the School of Automation, Guangdong Uni-versity of Technology, Guangzhou, China, and with the French Ar-gentine International Center for Information and Systems Sciences, Na-tional Scientific and Technical Research Council, Argentina. Email:
[email protected] .Yong Xu is with the School of Automation, Guandong University ofTechnology, China. Email: [email protected].
Minyue Fu is with the School of Electrical Engineering and ComputerScience, University of Newcastle, Callaghan, NSW 2308, Australia.Zenghong Huang is with the School of Automation, Guandong Universityof Technology, China. Email: [email protected]. † Corresponding author.This work was supported by the Argentinean Agency for Scientific andTechnological Promotion (PICT- 201-0985) and by the National Natural Sci-ence Foundation of China (Grant Nos. 61633014, 61803101 and U1701264). of these methods is that they do not scale well for largenetworks, since many turns are needed to guarantee that allnodes are visited. Also, for a fully distributed implementation,a distributed mechanism is needed to guarantee that each nodeis regularly visited in the sequence. A popular approach withinthis line are methods based on alternating direction methodof multipliers [7], [8]. In contrast to sequential methods,a simultaneous method iterates over a computation step, inwhich all nodes carry out local computations, and a com-munication step, in which nodes communicate informationwith their neighbors. These two steps typically depend on thenumber of local neighbors and not on the network size. In thisway, simultaneous methods avoid the scalability problems ofsequential ones.Another classification criterion is between methods usingfirst order derivatives and those using derivatives of secondorder. There is a vast literature on first order methods [9],[10], [11], [12], [13], [14] as well as the survey [2]. As withcentralized optimization methods, the advantage of first orderdistributed methods is that they are simpler to implement andanalyze. However, second order distributed methods convergemuch faster, leading to less computational and communicationrequirements.Upon convergence, a distributed optimization method needsto guarantee that all nodes obtain the same optimal value.Another classification criterion is based on how the distributedmethod guarantees this inter-node matching property. Oneapproach consists in adding a constraint to the optimizationprogram forcing this property. In [15], [16] the resulting con-strained optimization problem is solved by adding a penaliza-tion term. This gives an approximate solution which becomesexact as the step size used in the optimization recursionsdecreases to zero. This has the disadvantage of slowing downconvergence. This is avoided in [17] by solving the constrainedoptimization program via its dual. However, at each iterationeach node needs to solve a local optimization problem neededto evaluate the Lagrange dual function. These two approacheswere also considered in [18]. All these methods require thelocal functions at each node to be strongly convex, whichis a somehow strong requirement, since these local functionstypically represent partial information about the variables to beoptimized. They also require, at each iteration, inverting a ma-trix using a recursive inversion formula. This requires runningsub-iterations, each involving a computation/communicationstep, between every two main iterations. These drawbacksare avoided by forcing the inter-node matching using averageconsensus. By doing so the resulting optimization program isunconstrained. As pointed out in [19], an additional advantage of this approach is that the large literature available for averageconsensus permits guaranteeing its robustness to asynchronouscommunications, packet losses, time-varying network topol-ogy, undirected communication links, etc.In view of the above classification, we are interestedin simultaneous second order distributed methods based onaverage consensus. A seminal method within this line wasintroduced in [19], [20]. It was then extended in [21] todeal with asynchronous communications, in [22] to dealwith loosy communications, and in [23] to deal with event-triggered communications in the context of contimuous-timeoptimization methods. A fast convergent variant of this methodwas proposed in [24] using finite-time average consensus.However, this kind of consensus requires global knowledgeof the network structure, hence the method is not fullydistributed. In this work we build upon this line by focusingon convergence speed. Our contributions are the following:(1) We propose a variant of the aforementioned distributedmethod. The proposed variant converges much faster than theother methods, for the same optimization step size. Moreover,it is also more robust to the choice of the step size, in the sensethat it converges with step sizes which are large enough tocause the divergence of other methods. This permits choosinglarger step sizes to speed up convergence. (2) With the aim tomaximize the convergence speed of the proposed algorithm wedo a theoretical convergence analysis. This result yields both, acharacterization of the dynamics of the local parameter vectorsaround a local optimum of the target function, as well as acondition to guarantee the convergence of these parametersin a neighborhood of this local optimum. These results permitreadily obtaining an expression of the asymptotic convergencerate. (3) Using this convergence rate measure we propose afully distributed method for determining, at each node, theoptimal step size in the sense of maximizing the convergencespeed.The rest of the paper is organized as follows. In Section IIwe state the research problem. In Section III we derivethe proposed distributed optimization method and compareit with other available methods. In Section IV we providetheoretical guarantees for stability, derive an expression forthe asymptotic convergence rate and proposed a distributedmethod for estimating the optimal step size. In Section Vwe apply our results to a practical problem, namely, targetlocalization, and present numerical experiments confirmingour claims. Concluding remarks are given in Section VI.II. P
ROBLEM STATEMENT
Notation 1.
The set of natural and real numbers are denotedby N and R , respectively. The set of symmetric N × N matricesis denoted by S N and the set of positive definite matrices ofthe same dimension by P N . For a vector x ∈ R N , we use k x k to denote its -norm. For a matrix A ∈ R N × N , we use k A k F to denote its Frobenius norm, k A k to denote its operatornorm (induced by the vector -norm), ρ ( A ) its spectral radiusand κ ( A ) its condition number. Also, I N denotes the N × N identity matrix and N the N -dimensional column vectorfilled with ones. To simplify the notation we often omit the subscript in I N and N when the dimension can be clearlyinferred from the context. We use col ( x , · · · , x I ) to denotethe column vector formed by stacking the elements x , · · · , x I and ⊗ to denote the Kronecker product. Finally, O ( x ) denotesBachmann–Landau’s big O notation O ( x ) as x → ∞ . We have a network of I nodes. Node i can evaluate thefunction f i : R N → R and send messages to its out-neighbors N i ⊆ { , · · · , I } using a consensus network. Thecommunication link from node i to node j ∈ N i has time-invariant gain w j,i . We assume that w i,j = 0 if j = N i .We also assume that the graph induced by the communicationnetwork is balanced (i.e., possibly directed) and stronglyconnected. This implies that matrix W = (cid:2) w i,j (cid:3) Ii,j =1 is doublystochastic and primitive. As explained in [26], a consequenceof this is that ⊤ I W = ⊤ I and W I = I . Also, forany x = (cid:2) x i , · · · , x I (cid:3) ⊤ ∈ R I , the sequence generated by x k +1 = W x k satisfies lim k →∞ x k = I ⊗ I I X i =1 x i . The goal of distributed optimization is to design a dis-tributed method for solving the following minimization prob-lem x ⋆ ∈ arg min x ∈ R N f ( x ) with f ( x ) = 1 I I X i =1 f i ( x ) . (1)Solving (1) using a centralized or distributed method requirescertain assumptions on the objective function f , e.g., con-vexity, quasi-convexity, everywhere positive definite Hessianmatrix, etc. These assumptions may be too strong in certainapplications. When none of these assumptions can be made,it is often enough to solve x ⋆ ∈ loc min x ∈ R N f ( x ) , (2)where loc min denotes the set of local minimizers of f .As mentioned in the introduction, a number of methods areavailable for solving either (1) or (2), and our preferred choiceis the family of methods in [19], [20], [21], [22], [23]. In thenext section we derive a new method that can be regardedas a variant of the aforementioned ones. As we explain, thedesign of the proposed variant aims at maximizing conver-gence speed. We then complement the proposed design byquantifying its asymptotic convergence rate and maximizing it.In contrast with most results in the field, our results are validfor the general case (2). They only require that the Hessianmatrix is positive definite at the local optimum x ⋆ .III. P ROPOSED DISTRIBUTED N EWTON METHOD
In this section we introduce the proposed distributed opti-mization method. In Section III-A we introduce some back-ground on static, dynamic and second order average consensus.Using this, in Section III we derive the proposed method, andin Section III-C describe its differences with respect to thevariant that has been used in the literature. In this work we assume that the weights w i,j are given. However, theycan be chosen to optimize different criteria, as done in [25] in the context ofdistributed optimization. A. Static and dynamic average consensus
As described in Section I, we are interested in methodsachieving inter-node matching of minimization parameters viaaverage consensus. In this section we briefly describe theoptions available for doing so.Suppose that, in the network described in Section II, eachnode i knows a variable v i ∈ V , i ∈ N , where V is avector space. In order to make the presentation valid in thegeneral case, we assume that V is an arbitrary vector space,i.e., each v i can be either a scalar, a vector, a matrix, etc. Thegoal of (static) average consensus is to compute the average u = I P Ii =1 v i in a distributed manner. This is done using thefollowing iterations v ik +1 = I X j =1 w i,j v jk , initialized by v i = v i [26]. Letting v k = col (cid:0) v k , · · · , v Ik (cid:1) ∈ V I we can write the above compactly as follows v k +1 = W v k . (3)As it is known, the above recursions converge as O (cid:0) λ k (cid:1) ,where λ is the second largest eigenvalue of the matrix W + W ⊤ .Suppose now that each node i knows a time-varying se-quence of variables v ik ∈ V , k ∈ N . The goal of the dynamicaverage consensus technique [27] is to obtain, at each k , anestimate of the average u k = I P Ii =1 v ik . This is done asfollows: Suppose that at time k node i knows an estimate u ik − of u k − . It then transmits the following message s ik = u ik − + v ik − v ik − , (4)to its out-neighbors. On reception, node i obtains u ik = I X j =1 w i,j s jk . (5)The above iterations are initialized by s i = v i . We cancombine (4)-(5) in two ways, namely, in message form s ik +1 = I X j =1 w i,j s jk + v ik +1 − v ik , or in estimate form u ik +1 = I X j =1 w i,j (cid:16) u jk + v jk +1 − v jk (cid:17) . B. The proposed method
The essential idea consists in distributing the Newton iter-ations x k +1 = x k − α (cid:2) ∇ f ( x k ) (cid:3) − ∇ f ( x k ) , (6)where α is called the step size. To this end we make use ofthe dynamic average consensus technique [27]. Let ¯ f (cid:0) x k , · · · , x Ik (cid:1) = 1 I I X i =1 f i (cid:0) x ik (cid:1) . We can obtain an estimate g ik of ∇ ¯ f (cid:0) x k , · · · , x Ik (cid:1) = I P Ii =1 ∇ f i (cid:0) x ik (cid:1) , at each node i , by applying dynamic av-erage consensus on the inputs ∇ f i (cid:0) x ik (cid:1) . This yields thefollowing recursions written in estimate form g ik = I X j =1 w i,j h g jk − + ∇ f j (cid:16) x jk (cid:17) − ∇ f j (cid:16) x jk − (cid:17)i . We can do the same to obtain an estimate H ik of the Hessian ∇ ¯ f (cid:0) x k , · · · , x Ik (cid:1) = I P Ii =1 ∇ f i (cid:0) x ik (cid:1) using the inputs ∇ f i (cid:0) x ik (cid:1) . This gives H ik = I X j =1 w i,j h H jk − + ∇ f j (cid:16) x jk (cid:17) − ∇ f j (cid:16) x jk − (cid:17)i . It is easy to see that ¯ f ( x, · · · , x ) = f ( x ) . Hence, if x ik ≃ x k for all i ∈ { , · · · , I } , (7)and some x k , then g ik and H ik are estimates of ∇ f ( x k ) and ∇ f ( x k ) , respectively. Thus, in principle, each node i coulduse g ik and H ik , in place of ∇ f ( x k ) and ∇ f ( x k ) to locallycarry out the iterations (6). This would yield, at node i , thefollowing sequence of estimates of x ⋆ ˘ x ik +1 = ˘ x ik − α (cid:2) H ik (cid:3) − g ik . But the above requires (7) to hold, with x ik replaced by ˘ x ik . Inorder to enforce that, once again we apply dynamic averageconsensus on the inputs ˘ x ik . Writing the result in message formwe obtain x k +1 = I X j =1 w i,j x jk + ˘ x ik +1 − ˘ x ik = I X j =1 w i,j x jk − α (cid:2) H ik (cid:3) − g ik . Finally, since the approximations H ik to the Hessian areobtained via dynamic average consensus on the local Hessianmatrices ∇ f j (cid:16) x jk (cid:17) , and the latter may fail to be positivedefinite, some mechanism is required to guarantee that H ik is positive definite. To do so we let β > and use a map B : S N → P N to guarantee that B (cid:0) H ik (cid:1) ≥ β − I . The map B is defined as follows: Let H ∈ S N and H = U Λ U ⊤ be its spectral decomposition, with Λ = diag ( λ , · · · , λ N ) .Let ˜Λ = diag (cid:16) ˜ λ , · · · , ˜ λ N (cid:17) with ˜ λ i = λ i if λ i ≥ β − and ˜ λ i = β − otherwise. Then B ( H ) = U ˜Λ U ⊤ . The question then arises as to how to choose the parameter β .This is given in Assumption 1 of our main result Theorem 1.To summarize the above, the proposed algorithm is givenby the following recursions, x ik +1 = I X j =1 w i,j x jk − αB (cid:0) H ik (cid:1) − g ik , (8) g ik +1 = I X j =1 w i,j h g jk + ∇ f j (cid:16) x jk +1 (cid:17) − ∇ f j (cid:16) x jk (cid:17)i , (9) H ik +1 = I X j =1 w i,j h H jk + ∇ f j (cid:16) x jk +1 (cid:17) − ∇ f j (cid:16) x jk (cid:17)i . (10) which are initialized by x i = x i init , g i = ∇ f i (cid:0) x i init (cid:1) and H i = ∇ f i (cid:0) x i init (cid:1) . Remark . The reader may wonder why we choose to write (8)in message form while (9) and (10) in estimate form. This isdone to put the algorithm in a form compatible with otheralgorithms in the literature.
Remark . Notice that the information exchanged by eachnode at each time step does not grow with the network size, asit depends on the number of out neighbors of each node. Thisproperty is common to all simultaneous, second order methodsbased on average consensus [19], [20], [21], [22], [23].
C. Comparison with similar algorithms
As mentioned above, the proposed algorithm (8)-(10) is avariant of the algorithm used in [19], [20], [21], [22], [23].The latter differ from (8)-(10) in essentially two aspects. Thefirst one is that consensus is not done on the parameters x ik .This means that (8) is replaced by x ik +1 = x ik − B (cid:0) H ik (cid:1) − g ik . (11)Together with (9)-(10), equation (11) forms an algorithm that,for latter reference, we refer to as Algorithm A.The second difference consists in using the following trans-formation of (6) x k +1 = (1 − α ) x k + α (cid:0) ∇ f ( x k ) (cid:1) − ℓ ( x k ) , (12)where ℓ ( x ) = ∇ f ( x ) x − ∇ f ( x ) . Using dynamic averageconsensus, we can estimate ℓ ( x k ) at each node using l ik +1 = I X j =1 w i,j h l jk + ℓ j (cid:16) x jk +1 (cid:17) − ℓ j (cid:16) x jk (cid:17)i , (13)where ℓ i ( x ) = ∇ f i ( x ) x − ∇ f i ( x ) . We can then distributeiterations (12) as follows x ik +1 = (1 − α ) I X j =1 w i,j x jk + αB (cid:0) H ik (cid:1) − l ik . (14)We refer to the algorithm resulting from (14), (13) and (10)as Algorithm B.Finally, the algorithm used in [19], [20], [21], [22], [23],apart from other minor differences, essentially consists incombining the modifications introduced by Algorithms Aand B. This leads to the recursions formed by x ik +1 = (1 − α ) x ik + αB (cid:0) H ik (cid:1) − l ik , (15)together with (13) and (10). We refer to it as Algo-rithm VZCPS, standing for the initials of the authors whichproposed it.As we show with experiments in Section V, the modifica-tions (11) and (14)-(13), introduced by Algorithms A and B,respectively, drastically slow down convergence and can causeinstability. More precisely, (11) does not guarantee the conver-gence of each x ik to x ⋆ , due to the lack of consensus on theseparameters. Convergence only occurs if modification (14)-(13)is also considered, i.e., in the VZCPS algorithm, although at a much slower rate. However, a feature of the latter modificationis that the first term in (14) pushes the local variables x ik towards zero at each iteration. This pushing is compensated bythe second term, but only after consensus on the parameters x ik is reached. Before this happens, this zero pushing effecthas a negative influence if the minimizing parameters x ⋆ arefar from zero. As we show in Section V, this can slow downconvergence and even cause instability.IV. C ONVERGENCE RATE ANALYSIS
In this section we do a theoretical convergence rate analysisof the proposed method. We derive our main result in Sec-tion IV-A, which characterizes the asymptotic dynamics of theestimates x ik , k ∈ N , i ∈ { , · · · , I } at a local optimum x ⋆ . Weuse this result in Section IV-B to provide sufficient conditionsto guarantee the convergence of the method in a neighborhoodof the local optimum x ⋆ . Finally, in Section IV-C we introducea distributed method for estimating, at each node, the step sizethat maximized the asymptotic convergence rate. A. Dynamics at a local optimum
We introduce the following required notation.
Notation 2.
Let A = I I ⊤ I , A = A ⊗ I N and ˜ A = I − A . Letalso x ⋆ = I ⊗ x ⋆ , x k = col (cid:0) x k , · · · , x Ik (cid:1) , ¯ x k = Ax k and ˜ x k = x k − ¯ x k . We similarly define g k = col (cid:0) g k , · · · , g Ik (cid:1) , ¯ g k = Ag k and ˜ g k = g k − ¯ g k as well as h k =col (cid:0) H k , · · · , H Ik (cid:1) , ¯ h k = Ah k and ˜ h k = h k − ¯ h k . In the above notation, ¯ x k is a block vector with all its sub-vectors equal to the average ¯ x k = 1 I I X i =1 x ik . (16)Also, ˜ x k is a block vector whose i -th sub-vector is given by ˜ x ik = x ik − ¯ x k . Finally, notice that h k is a (column) vector ofmatrices, i.e., h k ∈ R IN × N . Notation 3.
For a block vector x = col (cid:0) x , · · · , x I (cid:1) , let g ( x ) = col (cid:0) ∇ f (cid:0) x (cid:1) , · · · , ∇ f I (cid:0) x I (cid:1)(cid:1) , h ( x ) = col (cid:0) ∇ f (cid:0) x (cid:1) , · · · , ∇ f I (cid:0) x I (cid:1)(cid:1) , H ( x ) = diag ( h ( x )) , and for a block diagonal matrix H = diag (cid:0) H , . . . .H I (cid:1) , let B ( H ) = diag (cid:0) B (cid:0) H (cid:1) , · · · , B (cid:0) H I (cid:1)(cid:1) . We also define H k = diag ( h k ) , B k = B ( H k ) , W = (cid:2) w i,j (cid:3) Ii,j =1 and W = W ⊗ I N . Using the above notation we can write (8)-(10) in thefollowing block state-space form x k +1 = Wx k − α B − k g k , (17) g k +1 = W [ g k + g ( x k +1 ) − g ( x k )] , (18) h k +1 = W [ h k + h ( x k +1 ) − h ( x k )] . (19)Our next step is to transform the model (17)-(19) into anequivalent model which is more convenient for our stabilityanalysis. Notation 4.
Let ¯ g ( x ) = A g ( x ) and ¯ h ( x ) = A h ( x ) .Let also ˜ g ( x ) = g ( x ) − ¯ g ( x ) , ˜ h ( x ) = h ( x ) − ¯ h ( x ) , ¯ H k = diag (cid:0) ¯ h k (cid:1) , ¯ H ( x ) = diag (cid:0) ¯ h ( x ) (cid:1) , ˜ H k = diag (cid:16) ˜ h k (cid:17) and ˜ H ( x ) = diag (cid:16) ˜ h ( x ) (cid:17) . Lemma 1.
The following equivalences hold ¯ g k = ¯ g ( x k ) , ¯ h k = ¯ h ( x k ) , ¯ H k = ¯ H ( x k ) . (20) Proof:
Since matrix W is doubly stochastic, it followsthat AW = A . Then, from (18) ¯ g k +1 = Ag k +1 = ¯ g k + ¯ g ( x k +1 ) − ¯ g ( x k ) . (21)The first equation in (20) then follows from (21) since ¯ g = Ag = A g ( x ) = ¯ g ( x ) . The other two equations followusing the same argument.Using Lemma 1 we can write model (17)-(19) as follows ¯ x k +1 = ¯ x k − α AB − k g k , (22) ˜ x k +1 = W ˜ x k − α ˜ AB − k g k , (23) ˜ g k +1 = W [˜ g k + ˜ g (¯ x k +1 + ˜ x k +1 ) − ˜ g (¯ x k + ˜ x k )] , (24) ˜ h k +1 = W h ˜ h k + ˜ h (¯ x k +1 + ˜ x k +1 ) − ˜ h (¯ x k + ˜ x k ) i , (25)where g k = ¯ g (¯ x k + ˜ x k ) + ˜ g k and B k = B (cid:16) ¯ H (¯ x k + ˜ x k ) + ˜ H k (cid:17) .We are now ready to introduce our main result. It states theasymptotic dynamics, as k tends to infinity, of the non-linearstate-space model (22)-(25). Its proof essentially consists inlinearizing the model around the equilibrium point given by ¯ x k = x ⋆ , ˜ x k = 0 , ˜ g k = 0 and ˜ H k = 0 . The difficulty in doingso lies in the cumbersome task of having to differentiate thematrix valued function on the right-hand side of (25) withrespect to the vectors ¯ x k , ˜ x k and ˜ g k and the matrix ˜ h k . Tohandle this in a compact and elegant manner we use Fr´echetderivatives [28], [29], which are denoted by D . Notation 5.
Let M = R IN × R IN × N × R IN × R IN . For ζ = (cid:16) ¯ x , ˜ h , ˜ x , ˜ g (cid:17) ∈ M , we define k ζ k = k ¯ x k + (cid:13)(cid:13)(cid:13) ˜ h (cid:13)(cid:13)(cid:13) + k ˜ x k + k ˜ g k . We also define the linear maps Φ , Ψ : M → M as those having the following matrix representations Φ = I W Φ , Φ , , Ψ = I Ψ , Ψ , Ψ , , where Φ , = (cid:2) ˜ W D h ( x ⋆ ) (( W − I ) · ) 0 (cid:3) , Φ , = (cid:20) ˜ W W H ( x ⋆ ) ( W − I ) ˜ W (cid:21) , and Ψ , = (cid:2) ¯ H − ( x ⋆ ) A H ( x ⋆ ) 0 (cid:3) , Ψ , = (cid:2) W D h ( x ⋆ ) (cid:0) ¯ H − ( x ⋆ ) · (cid:1) (cid:3) , Ψ , = (cid:20) H − ( x ⋆ )0 ˜ W H ( x ⋆ ) ¯ H − ( x ⋆ ) (cid:21) , Also, ˜ W = W − A and D h ( x ⋆ ) ( M · ) denotes the linearoperator y D h ( x ⋆ ) ( My ) , where D h ( x ⋆ ) denotes theFr´echet derivative of h at x ⋆ . Assumption 1.
Vector x ⋆ satisfies (2) (i.e., is a local min-imizer), f is twice continuously differentiable at x ⋆ , theHessian matrix ∇ f ( x ⋆ ) is positive definite, and β ≥ I (cid:13)(cid:13)(cid:13)(cid:2) ∇ f ( x ⋆ ) (cid:3) − (cid:13)(cid:13)(cid:13) . Theorem 1.
Let ζ k = (cid:16) ¯ x k − x ⋆ , ˜ h k , ˜ x k , ˜ g k (cid:17) . If Assumption 1holds, then ζ k +1 = ( Φ − α Ψ ) ζ k + O (cid:16) k ζ k k (cid:17) . (26) Proof:
With some abuse of notation, we use ζ k +1 ( ζ k ) to denote the map ζ k ζ k +1 induced by (22)-(25). Theresult follows by doing a Taylor expansion (using Fr´echetderivatives) of ζ k +1 ( · ) around . We have ζ k +1 ( ζ k ) = ζ k +1 (0) + D ζ k +1 (0) ( ζ k ) + O (cid:16) k ζ k k (cid:17) , where ζ k +1 (0) = 0 and D ζ k +1 (0) ( ζ k ) = ( D ¯ x k +1 (0) ( ζ k ) , D ˜ x k +1 (0) ( ζ k ) , D ˜ g k +1 (0) ( ζ k ) , D ˜ h k +1 (0) ( ζ k ) (cid:17) . In order to compute the above, we notice that D g ( x ) ( y ) = J g ( x ) y = H ( x ) y , where J g denoted the Jacobian of g . Also, it follows fromAssumption 1 that ¯ H ( x ⋆ ) = I I ⊗ I I X i =1 ∇ f i ( x ⋆ ) = I I ⊗ I ∇ f ( x ⋆ ) ≥ β − I IN , and therefore B (cid:0) ¯ H ( x ⋆ ) (cid:1) = ¯ H ( x ⋆ ) . It is clear that the partial Fr´echet derivatives D ˜ h ¯ x k +1 (0) = D ˜ h ˜ x k +1 (0) = D ˜ h ˜ g k +1 (0) = 0 . Hence, since ¯ x , ˜ x and ˜ g arevectors, all other partial Fr´echet derivatives of ¯ x k +1 , ˜ x k +1 and ˜ g k +1 can be computed using their Jacobians. This gives D ¯ x ¯ x k +1 (0) (¯ x k ) = (1 − α ) ¯ x k , D ˜ x ¯ x k +1 (0) (˜ x k ) = − α ¯ H − ( x ⋆ ) A H ( x ⋆ ) ˜ x k , D ˜ g ¯ x k +1 (0) (˜ g k ) = 0 , D ¯ x ˜ x k +1 (0) (¯ x k ) = 0 , D ˜ x ˜ x k +1 (0) (˜ x k ) = ˜ W ˜ x k , D ˜ g ˜ x k +1 (0) (˜ g k ) = − α ¯ H − (0) ˜ g k , D ¯ x ˜ g k +1 (0) (¯ x k ) = 0 , D ˜ x ˜ g k +1 (0) (˜ x k ) = ˜ W H ( x ⋆ ) ( W − I ) ˜ x k , D ˜ g ˜ g k +1 (0) (˜ g k ) = h ˜ W − α ˜ W H ( x ⋆ ) ¯ H − ( x ⋆ ) i ˜ g k . We now compute the partial Fr´echet derivatives of ˜ h k +1 ( · ) D ¯ x ˜ h k +1 (0) (¯ x k ) = 0 , D ˜ x ˜ h k +1 (0) (˜ x k ) = ˜ W D h ( x ⋆ ) (( W − I ) ˜ x k ) , D ˜ g ˜ h k +1 (0) (˜ g k ) = − α ˜ W D h ( x ⋆ ) (cid:0) ¯ H − ( x ⋆ ) ˜ g k (cid:1) , D ˜ h ˜ h k +1 (0) (cid:16) ˜ h k (cid:17) = ˜ W ˜ h k . The result then follows by adding partial derivatives to formthe matrix representation of the operator D ζ k +1 (0) . Remark . In view of Theorem 1, the asymptotic convergencerate of the proposed method in a neighborhood of the localoptimum x ⋆ is linear. This property is shared with mostavailable methods. To the best of our knowledge, the onlymethod which enjoys superlinear asymptotic convergence isthe one in [24]. However, as we mentioned in Section I,this is achieved by using finite-time average consensus toattain inter-node matching. Since finite-time average consensusrequires global knowledge, this method is therefore not fullydistributed. Remark . In Section IV-C we propose a distributed algorithmto tune the step size α at each node. This results in step sizesthat vary across nodes ant time steps. The result of Theorem 1can be straightforwardly generalized to cover this case, bysimply replacing the scalar step size α by a time-varyingdiagonal matrix. B. Sufficient condition for stability
Let ζ k = h k ¯ x k − x ⋆ k , (cid:13)(cid:13)(cid:13) ˜ h k (cid:13)(cid:13)(cid:13) F , k ˜ x k k , k ˜ g k k i ⊤ and υ = k H ( x ⋆ ) k , ϑ = (cid:13)(cid:13) ¯ H − ( x ⋆ ) (cid:13)(cid:13) , η = k D h ( x ⋆ ) k . Recall the definition of λ given in Section III-A. It followsfrom [26, Th. 3] that k W − A k = λ , and therefore (cid:13)(cid:13)(cid:13) ˜ W (cid:13)(cid:13)(cid:13) = λ . Then, using Theorem 1, we readily obtain ζ k +1 ≤ (Φ + α Ψ) ζ k + O (cid:16) k ζ k k (cid:17) , (27)where Φ = λ Φ , , , Ψ = − , , , , with Φ , = (cid:2) λ η (cid:3) , Ψ , = (cid:2) ϑυ (cid:3) , Ψ , = (cid:2) λ η (cid:3) and Φ , = (cid:20) λ λ υ λ (cid:21) , Ψ , = (cid:20) ϑ λ υϑ (cid:21) . Remark . Notice that, since the operator D h ( x ⋆ ) maps avector y ∈ R IN into a matrix h ∈ R IN × N , and in thedefinition of ζ k we use the Frobenius norm on these matrices,the induced norm of D h ( x ⋆ ) is defined by k D h ( x ⋆ ) k = sup k y k =1 k D h ( x ⋆ ) ( y ) k F . We then have the following corollary of Theorem 1.
Corollary 1.
The recursions (26) are stable in a neighborhoodof x ⋆ If < α < min (cid:26) (1 − λ ) p pκ ( M ) k Ψ , k , (cid:27) , (28) where Φ , = M JM − is the Jordan decomposition of Φ , and p is the maximum dimension among the Jordan blocks in J . Proof: It follows from [30] that ρ (Φ , − α Ψ , ) ≤ ( αpκ ( M ) k Ψ , k ) /p + ρ (Φ , ) ≤ ( αpκ ( M ) k Ψ , k ) /p + λ . We then have ρ (Φ + α Ψ) = max n | − α | , ( αpκ ( M ) k Ψ , k ) /p + λ o . Then, condition (28) guarantees that ρ (Φ + α Ψ) < . In viewof (27), this in turn implies that lim k →∞ ζ k = 0 , and the resultfollows.An immediate consequence of Corollary 1 is that, for each α satisfying (28), there exists a neighborhood E α of x ⋆ , suchthat, if x i ∈ E α , for all i ∈ { , · · · , I } , then lim k →∞ x ik = x ⋆ , for all i ∈ { , . . . , I } . C. Adaptive step size selection
Equation (26) asserts that, in a neighborhood of the optimalsolution, the speed of convergence is determined by thespectral radius ρ ( Φ − α Ψ ) . Hence, a criterion to speed upconvergence consists in choosing the step size α to minimizethat spectral radius, i.e., α ⋆ ∈ arg min α ρ ( Φ − α Ψ ) . (29)The problem of doing so is that computing α ⋆ requiresknowledge of the global function f ( x ) at the optimum x ⋆ ,which is obviously not available. To go around this, in thissection we propose a distributed method for obtaining, at each k ∈ N and node i , an estimate α ik of α ⋆ .Since matrices Φ and Ψ are block diagonal, it follows that ρ ( Φ − α Ψ ) = max { − α, λ , ρ ( Φ , − α Ψ , ) } . Then, since ρ ( Φ , ) = λ , α ⋆ can be found by solving − α ⋆ = ρ ( Φ , − α ⋆ Ψ , ) . (30)Suppose that we could obtain, at each k and i ∈ { , · · · , I } ,an estimate ̺ ik of ρ ( Φ , − α Ψ , ) . We could then obtainan estimate α ik of α ⋆ by solving (30) using some iterativeprocedure. One such procedures is α ik +1 = α ik + δ (cid:0) − α ik − ̺ ik (cid:1) , (31)for some δ > . Hence, we need a distributed method forestimating ̺ ik .In view of the power method [31], we know that, for each i ∈ { , · · · , I } , ρ ( Φ , − α Ψ , ) = lim k →∞ vuut (cid:13)(cid:13) ˜ x ik (cid:13)(cid:13) + (cid:13)(cid:13) ˜ g ik (cid:13)(cid:13) (cid:13)(cid:13) ˜ g ik − (cid:13)(cid:13) + (cid:13)(cid:13) ˜ g ik − (cid:13)(cid:13) . (32)where ˜ x ik = x ik − ¯ x k and ˜ g ik = g ik − ¯ g k , with ¯ x k givenby (16) and ¯ g k = I P Ii =1 g ik . In order to compute (32) weneed a distributed method to obtain, at each node i , estimates ˆ¯ x ik and ˆ¯ g ik of ¯ x k and ¯ g k , respectively. This can be achievedby running L consensus steps on the variables x ik and g ik . More precisely, if we define ˆ¯ x k = col (cid:0) ˆ¯ x k , · · · , ˆ¯ x Ik (cid:1) and ˆ¯ g k = col (cid:0) ˆ¯ g k , · · · , ˆ¯ g Ik (cid:1) , we have ˆ¯ x k = W L x k and ˆ¯ g k = W L g k . Notice that, while we use the notation ˆ¯ x k ( ˆ¯ g k ) to indicate thatthe variables represent averages of the entries of x k ( g k ), theseaverages are only available with a delay of L time steps, i.e.,at time step k + L .Then, in view of (32), we obtain ̺ ik as follows ̺ ik = vuut (cid:13)(cid:13) x ik − ˆ¯ x ik (cid:13)(cid:13) + (cid:13)(cid:13) g ik − ˆ¯ g ik (cid:13)(cid:13) (cid:13)(cid:13) x ik − − ˆ¯ x ik − (cid:13)(cid:13) + (cid:13)(cid:13) g ik − − ˆ¯ g ik − (cid:13)(cid:13) . V. A
NUMERICAL EXAMPLE
A. Case study
We consider a target localization problem. There are I = 30 nodes, measuring the distance to a target located at x true ∈ R .Node i is located at a i ∈ R , with a i ∼ N ( x true , × I ) ,and is initialized by x i ∼ N ( x true , I ) . It measures z i = (cid:13)(cid:13) x true − a i (cid:13)(cid:13) + n i , n i ∼ N (cid:0) , σ (cid:1) , with σ = 0 . . Nodes are connected via a network with ringtopology, whose gains are given by w i,j = . , i = j, . , mod ( i − j, I ) = 1 , . , mod ( i − j, I ) = I − , , otherwise , where mod ( a, b ) denotes the a modulo b operation. Thisresults in λ = 0 . .Doing a maximum likelihood estimation of x we obtain x ⋆ = arg max x p (cid:0) z i , · · · , z I | x (cid:1) = arg min x I X i =1 f i ( x ) , with f i ( x ) = (cid:16)(cid:13)(cid:13) x − a i (cid:13)(cid:13) − z i (cid:17) . It is straightforward to obtain ∇ f i ( x ) = 4 (cid:16)(cid:13)(cid:13) x − a i (cid:13)(cid:13) − z i (cid:17) (cid:0) x − a i (cid:1) , ∇ f i ( x ) = 8 (cid:0) x − a i (cid:1) (cid:0) x − a i (cid:1) ⊤ + 4 (cid:16)(cid:13)(cid:13) x − a i (cid:13)(cid:13) − z i (cid:17) I , Using the above, the optimal step size, in the sense ofminimizing (29), is given by α ⋆ = 6 . × − . B. Numerical experiments
In the first experiment we evaluate the effect of consid-ering the modification (11) introduced by Algorithms A, asdescribed in Section III-C. We use x true = [0 , ⊤ , β = 0 . and the optimal step size α ⋆ = 6 . × − . We also use e ik , (cid:13)(cid:13) x ik − x ⋆ (cid:13)(cid:13) as the performance metric for each node.In Figure 1-(a) we compare the performance of the proposedalgorithm with Algorithm A. We see how the lack of aconsensus stage prevents the local variables x ik to converge to acommon value. In Figure 1-(b) we see the effect of consideringalso the modification (14)-(13), i.e, Algorithm VZCPS. Wesee that it converges, although at a much smaller rate than theproposed algorithm. We then conclude that it is this secondmodification which causes the converge of Algorithm VZCPS. e i k Iteration (k)Proposed algorithmAlgorithm A (a) e i k Iteration (k)Proposed algorithmAlgorithm VZCPS (b)
Fig. 1. Effect of modification (11), i.e, removing consensus on variables. (a)Algorithm A does not converge. (b) Algorithm VZCPS, which also introducesmodification (14)-(13) converges slower than the proposed one.
In the second experiment we remove modification (11),i.e, add consensus on variables, and study the effect ofmodification (14)-(13) introduced by Algorithm B. In Figure 2-(a) we see that Algorithm B converges at rate similar tothat of the proposed algorithm. However, as explained inSection III, modification (14)-(13) has a negative effect whenthe minimizing parameters x ⋆ are far from zero. We showthis in Figure 2-(b), where we repeat the previous experimentwith x true = [300 , ⊤ . We see how the local estimatesof Algorithm B are pulled away from x ⋆ during the initialiterations, until consensus is reached . In Figure 2-(c) werepeat the experiment with x true = [1000 , ⊤ . We seethat Algorithm B is not able to reach consensus in time, whichcauses its divergence. e i k Iteration (k)Proposed algorithmAlgorithm B (a) e i k Iteration (k)Proposed algorithmAlgorithm B (b) e i k Iteration (k)Proposed algorithmAlgorithm B (c)
Fig. 2. Effect of modification (14)-(13). (a) With x true = [0 , ⊤ Algo-rithm B performs similar to the proposed one. (b) With x true = [300 , ⊤ the parameters are pulled away from x ⋆ until consensus is reached. (c) With x true = [1000 , ⊤ the pulling effect is intensified causing instability. In the third experiment we evaluate the use of the distributedalgorithm for estimating the optimal step size α ⋆ . To this endwe use (31) with δ = 10 − and a delay of L = 5 time steps. In Figure 3-(a) we compare the convergence of (cid:13)(cid:13) x ik − x ⋆ (cid:13)(cid:13) usingboth, the optimal step size α ⋆ and the distributedly estimatedone. We see that both methods converge at a very similarrate. We also show in the figure the theoretically optimal rate ρ k ( Φ − α ⋆ Ψ ) . We see that the asymptotic convergence rate ofboth methods closely resembles the theoretical one. We alsoshow in Figure 3-(b) the evolution of the estimation of thespectral radius ̺ ik and step size α ik at each node. e i k Iteration (k)Theoretical rateRate with estimated α Rate with optimal α (a) (b) Fig. 3. Distributed estimation of the optimal step size α ⋆ . (a) The asymptoticconvergence rate matches the theoretically optimal one. (b) Evolution of thedistributed spectral radius and step size estimates. VI. C
ONCLUSION
We aimed at achieving the fastest convergence rate fordistributed optimization. We did two steps towards this goal.In the first step we proposed a new distributed optimizationmethod which converges faster than other available options.For this method we provided sufficient conditions for the con-vergence of the method in a neighborhood of a local optimum,and quantified its asymptotic convergence rate. Since the latterdepends on the step size parameter, in the second step weaim at choosing the step size that maximizes this rate. Sincethis value depends on global information, we proposed a fullydistributed method for estimating it. We present numericalexperiments confirming our claims.R
EFERENCES[1] Ali H Sayed. Adaptation, learning, and optimization over networks.
Foundations and Trends in Machine Learning , 7:311–801, 2014.[2] Tao Yang, Xinlei Yi, Junfeng Wu, Ye Yuan, Di Wu, Ziyang Meng,Yiguang Hong, Hong Wang, Zongli Lin, and Karl H Johansson. Asurvey of distributed optimization.
Annual Reviews in Control , 2019.[3] Angelia Nedi´c and Ji Liu. Distributed optimization for control.
AnnualReview of Control, Robotics, and Autonomous Systems , 1:77–103, 2018.[4] Angelia Nedich. Convergence rate of distributed averaging dynamicsand optimization in networks.
Foundations and Trends in Systems andControl , 2(1):1–100, 2015.[5] Bo Yang and Mikael Johansson. Distributed optimization and games:A tutorial overview. In
Networked Control Systems , pages 109–148.Springer, 2010.[6] Mert G¨urb¨uzbalaban, Asuman Ozdaglar, and Pablo Parrilo. A globallyconvergent incremental newton method.
Mathematical Programming ,151(1):283–313, 2015.[7] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and JonathanEckstein. Distributed optimization and statistical learning via thealternating direction method of multipliers.
Foundations and Trendsin Machine learning , 3(1):1–122, 2011.[8] Ermin Wei and Asuman Ozdaglar. Distributed alternating directionmethod of multipliers. In , pages 5445–5450. IEEE, 2012.[9] A. Nedic and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization.
IEEE Trans. Autom. Control , 54(1):48–61, 2009. [10] Duˇsan Jakoveti´c, Joao Xavier, and Jos´e MF Moura. Fast distributedgradient methods.
IEEE Trans. Autom. Control , 59(5):1131–1146, 2014.[11] Kun Yuan, Qing Ling, and Wotao Yin. On the convergence ofdecentralized gradient descent.
SIAM J. Optim. , 26(3):1835–1854, 2016.[12] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. Extra: An exact first-order algorithm for decentralized consensus optimization.
SIAM Journalon Optimization , 25(2):944–966, 2015.[13] Tao Yang, Yan Wan, Hong Wang, and Zongli Lin. Global optimalconsensus for discrete-time multi-agent systems with bounded controls.
Automatica , 97:182–185, 2018.[14] S. Pu, W. Shi, J. Xu, and A. Nedic. Push-pull gradient methods fordistributed optimization in networks.
IEEE Trans. Autom. Control , 2020.[15] Aryan Mokhtari, Qing Ling, and Alejandro Ribeiro. Network newtondistributed optimization methods.
IEEE Transactions on Signal Process-ing , 65(1):146–161, 2016.[16] Fatemeh Mansoori and Ermin Wei. A fast distributed asynchronousnewton-based optimization algorithm.
IEEE Transactions on AutomaticControl , 2019.[17] Rasul Tutunov, Haitham Bou-Ammar, and Ali Jadbabaie. Distributednewton method for large-scale consensus optimization.
IEEE Transac-tions on Automatic Control , 64(10):3983–3994, 2019.[18] Ali Jadbabaie, Asuman Ozdaglar, and Michael Zargham. A distributednewton method for network optimization. In
IEEE Conference onDecision and Control , pages 2736–2741. IEEE, 2009.[19] D. Varagnolo, F. Zanella, A. Cenedese, G. Pillonetto, and L. Schenato.Newton-raphson consensus for distributed convex optimization.
IEEETransactions on Automatic Control , 61(4):994–1009, 2016.[20] Filippo Zanella, Damiano Varagnolo, Angelo Cenedese, Gianluigi Pil-lonetto, and Luca Schenato. Newton-raphson consensus for distributedconvex optimization. In
IEEE CDC , pages 5917–5922. IEEE, 2011.[21] F. Zanella, D. Varagnolo, A. Cenedese, G. Pillonetto, and L. Schen-ato. Asynchronous newton-raphson consensus for distributed convexoptimization.
IFAC Proceedings Volumes , 45(26):133–138, 2012.[22] Nicoletta Bof, Ruggero Carli, Giuseppe Notarstefano, Luca Schenato,and Damiano Varagnolo. Multiagent newton–raphson optimization overlossy networks.
IEEE Trans. Autom. Control , 64(7):2983–2990, 2018.[23] Y. Li, H. Zhang, B. Huang, and J. Han. A distributed newton–raphson-based coordination algorithm for multi-agent optimization with discrete-time communication.
Neural Comput Appl , pages 1–15, 2018.[24] Jiaqi Zhang, Keyou You, and Tamer Bas¸ar. Distributed adaptivenewton methods with globally superlinear convergence. arXiv preprintarXiv:2002.07378 , 2020.[25] Tor Anderson, Chin-Yao Chang, and Sonia Mart´ınez. Distributedapproximate newton algorithms and weight design for constrainedoptimization.
Automatica , 109:108538, 2019.[26] R. Olfati-Saber, J Fax, and R. Murray. Consensus and cooperation innetworked multi-agent systems.
Proc. IEEE , 95(1):215–233, 2007.[27] Minghui Zhu and Sonia Mart´ınez. Discrete-time dynamic averageconsensus.
Automatica , 46(2):322–329, 2010.[28] Richard S Hamilton et al. The inverse function theorem of nash andmoser.
B Am Math Soc , 7(1):65–222, 1982.[29] Rodney Coleman.
Calculus on normed vector spaces . Springer Science& Business Media, 2012.[30] King-wah Eric Chu. Generalization of the Bauer-Fike theorem.
Nu-merische Mathematik , 49(6):685–691, 1986.[31] Dimitri P Bertsekas and John N Tsitsiklis.