Distributed Optimization Over Markovian Switching Random Network
DDistributed Optimization Over Markovian Switching Random Network
Peng Yi a,b , Li Li a,b
Abstract — In this paper, we investigate the distributedconvex optimization problem over a multi-agent system withMarkovian switching communication networks. The objectivefunction is the sum of each agent’s local objective function,which cannot be known by other agents. The communicationnetwork is assumed to switch over a set of weight-balanceddirected graphs with a Markovian property. We propose aconsensus sub-gradient algorithm with two time-scale step-sizes to handle the uncertainty due to the Markovian switchingtopologies and the absence of global gradient information.With a proper selection of step-sizes, we prove the almost sureconvergence of all agents’ local estimates to the same optimalsolution when the union graph of the Markovian network’states is strongly connected and the Markovian network isirreducible. Simulations are given for illustration of the results.
I. I
NTRODUCTION
There is an increasing research interest in distributedoptimization over multiagent systems due to its broadapplications in engineering networks, such as distributedparameters estimation in sensor networks [1], [2], resourceallocation in communication networks, [3], [4], and optimalpower flow in power grids, [5], [6]. Due to the privacyof each agent’s local data and the burden of data cen-tralization, in distributed optimization problems each agentcan only manipulate its local objective function withoutknowing other agents’ objective functions, while the globalobjective function to be optimized is usually taken as thesum of agents’ local objective functions. Many significantdistributed optimization algorithms have been proposed andanalyzed, including (sub)gradient algorithms [4], [7], [8],dual averaging algorithms [9], primal-dual methods [2],[6], [10], gradient tracking methods [11], [12]. Please referto [13]–[17] for the survey of recent developments indistributed optimization.In distributed optimization, the agents must cooperativelyfind a consensual optimal solution by sharing informa-tion locally with network neighbors, hence, communicationplays a vital role in the design and analysis of distributedoptimization algorithm. Different communication modelsand graph connectivity assumptions, either deterministic a The Department of Control Science and Engineering, Tongji Univer-sity, Shanghai, 201804, China; b Shanghai Institute of Intelligent Science and Technology, TongjiUniversity, Shanghai, ChinaEmails: [email protected], [email protected] work was partially supported by the Key Research and DevelopmentProject of National Ministry of Science and Technology under grant No.2018YFB1305304 and the National Natural Science Foundation of Chinaunder Grant No. 51475334. or stochastic, have been discussed for different algorithmsincluding uniformly joint strongly connected graphs [7],[9], quantized communication [18], random graphs [10],[19], broadcasting [20] and gossip communication [21]. Infact, the practical communication networks are essentiallyrandom and stochastic due to link failure, uncertain quanti-zation, packet dropout or node recreation. Random commu-nication networks with temporal independence assumptionshave been investigated in distributed optimization. [20]established the almost sure convergence of the consensussubgradient algorithm to an optimal point when the agentsshare information through independent broadcast communi-cations. [8] provided the almost sure convergence results fordistributed subgradient algorithm when the communicationlink failures are independent and identically distributedover time. [19] investigated the asynchronous distributedgradient method with a linear convergence rate for stronglyconvex functions when the graph weights are independentlyand identically drawn from the same probability space.[22] proved the optimal convergence rate of distributedstochastic gradient methods for strongly convex functionsover temporally independent identically distributed randomnetworks. [10] investigated the asymptotic normality andefficiency of distributed primal-dual gradient algorithm forindependent and identically random communication net-works. [23] gave a primal-dual algorithm for distributedresource allocation, also with independent and identicallyrandom communication networks.Nevertheless, the practical communications over multia-gent systems are usually random but with temporal corre-lation. Markovian switching graphs have been adopted formodelling the random communication with one-step tempo-ral dependence. For example, [24]–[26] have investigatedthe performance of averaging consensus algorithm withMarkovian switching communication networks, [1] haveconsidered the distributed parameter estimation problemover Markovian switching topologies, and [27] investigatedthe Kalman filter with Markovian packet losses when trans-mitting the measurements to the filter. However, to the bestof our knowledge, how to achieve distributed optimizationwith Markovian switching graphs is not fully investigated,because distributed optimization is a fundamentally differenttask from consensus or parameter estimations, except that[28] have studied distributed optimization over a switchingstate-dependent graphs. We also note that [29] investigatedthe distributed optimization through the fixed points itera-tion of random operators derived from a general class of a r X i v : . [ m a t h . O C ] F e b andom graphs.Motivated by the above, we investigate the consensussubgradient algorithm to achieve optimal consensus withMarkovian switching topologies. The communication graphamong the agents switches within a finite graph set fol-lowing a Markovian chain. Note that [28] assumed that therandom link failure is dependent on the node state ratherthan the previous step communication, hence, it considereda different Markovain model from the Markovain randomgraph considered here. We propose to select two differentstep-sizes for the consensus term and the gradient term tobalance the speed of consensus and innovation. We find asufficient choice of step-sizes to ensure that the consensusterm is slightly “faster” than the innovation gradient term,and then we can give a mean consensus error bounds underthe Markovian assumption. With these error bounds, weprove that all the agents converge to the same optimalsolution with probability .The paper is organized as follows. We give the formula-tion of the distributed optimization problem and Markovianswitching communication model in Section II. We give thealgorithm and sketch the main results in Section III. We givethe proofs of main theorems with an illustrative numericalexample in Section IV, and present the conclusions insection VI. The proof of a key lemma is given in theAppendix. Notations : Denote m = (1 , ..., T ∈ R m and m =(0 , ..., T ∈ R m . For a column vector x ∈ R m , x T denotes its transpose. I n denotes the identity matrix in R n × n . For a matrix A = [ a ij ] ∈ R N × N , a ij standsfor the ( i, j ) th entry in A . A matrix A is nonnegative if a ij ≥ , ∀ i, j = 1 , · · · , N . A nonnegative matrix A iscalled row stochastic iff A N = N , and column stochasticmatrix iff TN A = TN , while A is doubly stochastic iff A is both row and column stochastic matrix. ⊗ stands for theKronecker product of two matrixes. For a probability space (Ξ , F , P ) , Ξ is the sample space, F is the σ -algebra and P isthe probability measure. For k = 0 , , , · · · , ( v k , F k ) is anadapted sequences if σ ( v k ) ∈ F k for all k . The expectationof a random variable is denoted as E [ · ] .A directed graph G = {V , E G , A G } is defined with nodeset V = { , ..., N } , edge set E G ⊂ V × V , and adjacencymatrix A G = [ a ij ] ∈ R N × N . ( j, i ) ∈ E G if and only ifagent i can get information from agent j . A G = [ a ij ] isnonnegative and row stochastic, and < a ij ≤ if ( j, i ) ∈E G , and a ij = 0 , otherwise. Denote by N i = { j | ( j, i ) ∈ E G } the neighbor set of agent i . A path of graph G is a sequenceof distinct agents in V such that any consecutive agentsin the sequence corresponding to an edge of the graph G .Agent j is said to be connected to agent i if there is apath from j to i . Graph G is strongly connected if any twoagents are connected. Graph G is called weighted-balancedif adjacency matrix A is doubly stochastic , i.e., TN A G = TN A T G . Denote by D G = diag { (cid:80) Nj =1 a j , ..., (cid:80) Nj =1 a Nj } , called the in-degree matrix of G . Then, the (weighted)Laplacian matrix of G is L G := D G − A G . When graph G is strongly connected, 0 is a simple eigenvalue of Laplacian L G with the eigenspace { α N | α ∈ R } .II. P ROBLEM F ORMULATION
In this section, we formulate the distributed optimizationproblem.Consider a multi-agent network with agent (node) set V = { , ..., N } , where agent i has its own objective function f i ( x ) unknown to any other agents. The task is to findthe optimal solution of the sum of all the local objectivefunctions, that is, min x ∈ R n f ( x ) , f ( x ) = N (cid:88) i =1 f i ( x ) , (1)where f i ( · ) : R n → R , as a lower semicontinuous ( possiblenonsmooth) convex function, is the local objective functionof agent i , and f ( · ) is the global objective function. We givethe following assumption on the objective functions. Assumption 1:
1) The optimization problem in (1) issolvable, i.e., there exists a finite x ∗ ∈ R n such that x ∗ ∈ X ∗ (cid:44) arg min f ( x ) , f ( x ) = N (cid:88) i =1 f i ( x )
2) The sub-gradient sets of f i ( x ) , are uniformly boundedfor all i ∈ V , i.e., there exists a constant l such that ∀ g ( x ) ∈ ∂f i ( x ) , (cid:107) g ( x ) (cid:107)≤ l , ∀ x ∈ dom ( f i ) , ∀ i ∈ V .We assume the agents exchange information locallythrough a Markovian switching random communicationnetwork. All the possible communication topologies form aset of a finite number of graphs: {G , · · · , G m } with eachgraph endowed with an adjacency matrix A G i . The timeis slotted as k = 1 , , · · · , . And then, we use a randomprocess θ ( k ) , which is a Markovian chain on a finite indexset I = { , ..., m } with a stationary transition matrix P =[ p ij ] ∈ R m × m , to indicate the communication graph at time k , i.e., G ( k ) = G i when θ ( k ) = i . The markovian propertyof θ ( k ) implies that given the graph at time k being G i , theprobability of the communication graph at time k + 1 being G j is p ij . The works about average consensus in [1], [24],[25] have provided detailed descriptions and motivationsfor using Markovian switching communication networks indistributed computation over multi-agent systems, includingwireless sensor networks and UAV swarms.Here is the assumption on the Markovian communicationgraphs: Assumption 2:
1) The adjacency matrixes A G i of eachgraph in the set {G , · · · , G m } is a doubly stochasticmatrix, and the union graph G c (cid:44) m (cid:91) i =1 G i = {V , m (cid:91) i =1 E G i , m m (cid:88) i =1 A G i } s strongly connected.2) The Markovian chain θ ( k ) is irreducible.III. D ISTRIBUTED A LGORITHM AND MAIN RESULTS
In this section, we provide the algorithm with the mainresults.Denote by x i ( k ) ∈ R n the estimate of agent i for theoptimal solution x ∗ at time k . The random variable θ ( k ) evolves as a markovian chain. The communication graphtakes G ( k ) (cid:44) G θ ( k ) = ( V , E G θ ( k ) , A G θ ( k ) ) at time k . Agent i can get the estimates of its neighboring agents N i ( k ) = { j | ( j, i ) ∈ E G θ ( k ) } with G θ ( k ) . And then, each agent updatesits estimate with the following algorithm Algorithm 1
Consensus subgradient algorithm
Initialize:
Agent i ∈ V picks an initial state x i (0) ∈ R n . Iterate until convergence
At time k , each agent i ∈ V gets its neighbour states (cid:8) x j ( k ) (cid:9) j ∈N i ( k ) through the random graph G ( k ) , and up-dates its local state as follows x i ( k + 1) = x i ( k ) + α k N (cid:88) j =1 a ij ( k )( x j ( k ) − x i ( k )) − β k d i ( k ) , (2)where α k > and β k > are the step-sizes, a ij ( k ) is ( i, j ) th entry of A G θ ( k ) , and d i ( k ) ∈ ∂f i ( x i ( k )) is a(sub)gradient vector of f i ( x ) at x i ( k ) .Algorithm 1 is an extension of the distributed subgradientalgorithm in [7], [8] by adding an additional step-size. Inequation (2), the first consensus term drives each agent’sstate towards the averaging of all agents’ states, while thesecond term provides the innovative gradient information tosearch for the optimal solution x ∗ .To guarantee the algorithm convergence even with arandomly switching network, we have two different step-sizes α k and β k to control the speed of consensus andinnovation. In fact, we require that “consensus” speed isa bit of faster than “innovation” term as specified by thefollowing assumption. Assumption 3:
We take the step-sizes in (2) as α k = a ( k + 1) δ , β k = a ( k + 1) δ , (3)where a > , a > , < δ < δ ≤ , and δ − δ ≥ . Now we are ready to present the main analysis resultsfor Algorithm 1.
Theorem 1 (Almost sure consensus):
Suppose Assump-tions 1, 2 and 3 hold. Let x i ( k ) , i ∈ V be generated by (2),and y ( k ) = N (cid:80) Ni =1 x i ( k ) . Then the following statementshold.1) The agents’ states reach consensus and track theaveraging of all the agents’ states asymptotically with probability , i.e., lim k →∞ (cid:107) x i ( k ) − y ( k ) (cid:107) = 0 , ∀ i ∈ V , a.s. (4)2) The accumulation of the norm of track error y ( k ) − x i ( k ) weighted by the step-sizes β k is bounded for eachagent, i.e., ∞ (cid:88) k =1 β k (cid:107) y ( k ) − x i ( k ) (cid:107) < ∞ , ∀ i ∈ V , a.s. (5) Remark 1:
Theorem 1 shows that all the agents almostsurely reach consensus asymptotically. In fact, we canalso show the convergence rate for reaching consensus isdominated by the difference between δ , δ . Specifically, ,we can find a τ < δ − such that lim k →∞ ( k + 1) τ (cid:107) y ( k ) − x i ( k ) (cid:107) = 0 a.s., ∀ i ∈ V Theorem 2 (Almost sure converge to a consensual solution):
Suppose Assumptions 1,2 and 3 hold. Then with Algorithm1, all the agents’ states almost surely converge to the sameoptimal solution of (1), i.e., lim k →∞ x i ( k ) = x ∗ , ∀ i ∈ V , a.s. IV. T
HE PROOF OF MAIN RESULTS
In this section, we give the proofs of the main results.Denote X ( k ) = ( x T ( k ) , · · · , x TN ( k )) T ∈ R nN and d ( k ) = ( d T ( k ) , · · · , d TN ( k )) T ∈ R nN , and we can rewritethe overall updating equations in Algorithm 1 in a compactform as X ( k + 1) = X ( k ) − α k ( L G θ ( k ) ⊗ I n ) X ( k ) − β k d ( k ) , (6)where L G θ ( k ) is the Laplacian of G θ ( k ) . With abuse ofnotation, we also use A ( k ) to denote the random matrix A G θ ( k ) for simplicity. Since A ( k ) is doubly stochastic, (6)can also be written as: X ( k + 1) = X ( k ) + α k (( A ( k ) − I N ) ⊗ I n ) X ( k ) − β k d ( k ) . (7)To investigate the consensus of all agents’ states, wedefine Q = (cid:32) Q TN √ N (cid:33) , with Q N = and Q Q T = I N − . Then we have QQ T = I N , i.e. Q is an orthogonal matrix. Denote Γ = I N − N TN N as the disagreement matrix. Since A ( k ) is a doubly stochas-tic matrix, QA ( k ) Q T = (cid:18) Q A ( k ) Q T N − N − (cid:19) , Q Γ = (cid:18) Q (cid:19) , nd therefore, Q Γ A ( k ) = (cid:18) Q (cid:19) Q T (cid:18) Q A ( k ) Q T (cid:19) (cid:32) Q TN √ N (cid:33) = (cid:18) Q A ( k ) Q T Q (cid:19) . Therefore, by multiplying both sides of (7) with Q Γ ⊗ I n , (cid:18) Q (cid:19) ⊗ I n X ( k + 1) = (cid:18) Q (cid:19) ⊗ I n X ( k )+ α k (cid:18) ( Q A ( k ) Q T − I N − ) Q (cid:19) ⊗ I n X ( k ) − β k (cid:18) Q (cid:19) ⊗ I n d ( k ) . (8)Denote ξ ( k ) = ( Q ⊗ I n ) X ( k ) and H ( k ) = Q A ( k ) Q T − I N − , and we have the reduced recursion of (8) as ξ ( k + 1) = ξ ( k ) + α k ( H ( k ) ⊗ I n ) ξ ( k ) − β k ( Q ⊗ I n ) d ( k ) . (9)Define a state transfer matrix Φ( k, s ) for k ≥ s as Φ( k, s ) = ( I N − + α k H ( k )) · · · ( I N − + α s H ( s )) , (10)and Φ( k, k + 1) = I N − .The state transfer matrix Φ( k, s ) is a random matrix thatplays a key role in the convergence analysis, and we havethe following Lemma 1. Lemma 1:
Suppose Assumptions 1, 2 and 3 hold. For thestate transfer matrix Φ( k, s ) defined in (10), we have:(i) There exist positive constants c , c such that E [ (cid:107) Φ( k, s ) (cid:107) ] ≤ c exp[ − c k +1 (cid:88) i = s α i ] , ∀ k ≥ s. (ii) (cid:80) ks =0 β s E [ (cid:107) Φ( k, s + 1) (cid:107) ] → as k → ∞ .(iii) (cid:80) ∞ k =0 β k +1 E [ (cid:107) Φ( k, (cid:107) ] < ∞ .(iv) (cid:80) ∞ k =0 β k +1 (cid:80) ks =0 β s E [ (cid:107) Φ( k, s + 1) (cid:107) ] < ∞ .The proof of Lemma 1 is given in the Appendix, whichcould have an independent interest since it is not relatedto the gradient part of the algorithm. Next we give amartingale convergence result from [30], and a lemma forthe convergence analysis. Proposition 1:
Let ( v k , F k ) , ( α k , F k ) be two nonnega-tive adapted sequences.(i) If E [ v k +1 |F k ] ≤ v k + α k and E [ (cid:80) ∞ i =1 α i ] < ∞ , then v k converges a.s. to a finite limit.(ii) If E [ v k +1 |F k ] ≤ v k − α k , then (cid:80) ∞ i =1 α i < ∞ , a.s.. Lemma 2:
Let ( v k , F k ) , ( d k , F k ) , and ( α k , F k ) be threenonnegative adapted sequences. If E [ v k +1 |F k ] ≤ v k + α k − d k and E [ (cid:80) ∞ i =1 α i ] < ∞ , then (cid:80) ∞ i =1 d i < ∞ a.s. , and v k converges a.s. to a finite limit. Proof:
Since E [ v k +1 |F k ] ≤ v k + α k , and E [ (cid:80) ∞ i =1 α i ] < ∞ ,from (i) of Proposition 1, we know that v k converges a.s.to a finite limit. Set u k +1 = v k +1 + E [ (cid:80) ∞ i = k +1 α i |F k +1 ] . Then E [ u k +1 |F k ] ≤ v k + α k − d k + E ( ∞ (cid:88) i = k +1 α i |F k ) = u k − d k , and hence, we have (cid:80) ∞ i =1 d i < ∞ a.s. from (ii) ofProposition 1. (cid:50) Proof of Theorem 1 :1) It follows from (9) that ξ ( k + 1) = (Φ( k, ⊗ I n ) ξ (0) − k (cid:88) s =0 β s (Φ( k, s + 1) ⊗ I n )( Q ⊗ I n ) d ( s ) . (11)Thus, ∞ (cid:88) k =0 β k +1 E [ (cid:107) ξ ( k + 1) (cid:107) ] ≤ ∞ (cid:88) k =0 β k +1 E [ (cid:107) Φ( k, (cid:107) ] (cid:107) ξ (0) (cid:107) + ∞ (cid:88) k =0 β k +1 k (cid:88) s =0 β s E [ (cid:107) Φ( k, s + 1) (cid:107) ] (cid:107) Q (cid:107)(cid:107) d ( s ) (cid:107)≤(cid:107) ξ (0) (cid:107) ∞ (cid:88) k =0 β k +1 E [ (cid:107) Φ( k, (cid:107) ]+ √ N l (cid:107) Q (cid:107) ∞ (cid:88) k =0 β k +1 k (cid:88) s =0 β s E [ (cid:107) Φ( k, s + 1) (cid:107) ] (12)Hence, with the results in Lemma 1 and (cid:107) Q (cid:107) = 1 , ∞ (cid:88) k =0 β k +1 E [ (cid:107) ξ ( k + 1) (cid:107) ] < ∞ . Therefore, by the monotone convergence theorem ( [31]), E [ ∞ (cid:88) k =0 β k +1 (cid:107) ξ ( k + 1) (cid:107) ] < ∞ . (13)Since ( Q Γ ⊗ I n ) X ( k ) = ( ξ ( k ) T , ) T and Q is anorthogonal matrix, we get (cid:107) (Γ ⊗ I n ) X ( k ) (cid:107) = (cid:107) ξ ( k ) (cid:107) .With (13), ∞ (cid:88) k =1 β k (cid:107) (Γ ⊗ I n ) X ( k ) (cid:107) < ∞ a.s. (14)Note that (Γ ⊗ I n ) X ( k ) = y ( k ) − x ( k ) ... y ( k ) − x N ( k ) , and with (14), we have ∞ (cid:88) k =1 β k (cid:107) y ( k ) − x i ( k ) (cid:107) < ∞ ∀ i ∈ V , a.s. (15)) By (2) and y ( k ) = N (cid:80) ni =1 x i ( k ) , y ( k + 1) = y ( k ) − β k N N (cid:88) j =1 d j ( k ) . (16)Therefore, for any i ∈ V , y ( k + 1) − x i ( k + 1)= y ( k ) − x i ( k ) + α k N (cid:88) j =1 a ij ( k )( x j ( k ) − x i ( k ))) − β k ( 1 N N (cid:88) j =1 d j ( k ) − d i ( k ))= y ( k ) − N (cid:88) j =1 ˜ a ij ( k ) x j ( k ) − β k ( 1 N N (cid:88) j =1 d j ( k ) − d i ( k )) , where ˜ a ij ( k ) = α k a ij ( k ) , i (cid:54) = j and ˜ a ii ( k ) = 1 − α k .Note that we have (cid:80) Nj =1 ˜ a ij ( k ) = 1 , and therefore, withthe Jensen’s inequality, (cid:107) y ( k +1) − x i ( k +1) (cid:107)≤ N (cid:88) j =1 ˜ a ij ( k ) (cid:107) y ( k ) − x j ( k ) (cid:107) +2 β k l. Denote by e i ( k ) = (cid:107) y ( k ) − x i ( k ) (cid:107) , and e ( k ) = (cid:80) Ni =1 e i ( k ) . Then taking square of above equations and bythe convexity of (cid:107) · (cid:107) , e i ( k +1) ≤ N (cid:88) j =1 ˜ a ij ( k ) e j ( k ) +4 l β k +4 lβ k N (cid:88) j =1 ˜ a ij ( k ) e j ( k ) . Hence, N (cid:88) i =1 e i ( k + 1) ≤ N (cid:88) j =1 e j ( k ) + 4 N l β k + 4 lβ k N (cid:88) i =1 e i ( k ) , and ∞ (cid:88) k =1 N (cid:88) i =1 e i ( k ) ≤ N (cid:88) i =1 e i (0) + 4 N l ∞ (cid:88) k =1 β k + 4 l ∞ (cid:88) k =1 β k N (cid:88) i =1 e i ( k ) . (17)Because (cid:80) ∞ k =1 β k < ∞ , (cid:80) Ni =1 e i ( k ) converges withprobability 1 as k → ∞ following from (15) and (17).Meanwhile, with (15) and (cid:80) ∞ k =1 β ( k ) = ∞ , we have lim inf k →∞ e ( k ) = 0 . Since e ( k ) is a.s. bounded, thereexists a subsequence n k , such that e ( n k ) −−−−→ k →∞ . Takeany (cid:15) > , and then there exists a r such that e ( n r ) < (cid:15) , (cid:80) ∞ k = n r β k e ( k ) < (cid:15) , (cid:80) ∞ k = n r β k < (cid:15) . Therefore, N (cid:88) i =1 e i ( n m ) ≤ N (cid:88) i =1 e i ( n r ) + 4 N l ∞ (cid:88) k = n r β k + 4 l ∞ (cid:88) k = n r β k e ( k ) ≤ N (cid:15) + 4 l ( N + 1) (cid:15) n m = n l + 1 , n l + 2 · · · (18) Therefore, we conclude lim k →∞ e i ( k ) = 0 , a.s. ∀ i ∈ V . (19) (cid:50) Next, we prove that all the agents almost surely convergeto the same optimal solution.
Proof of Theorem 2:
From (16) it follows that, for any x ∈ X ∗ , (cid:107) y ( k + 1) − x (cid:107) = (cid:107) y ( k ) − x − β k N N (cid:88) j =1 d j ( k ) (cid:107) = (cid:107) y ( k ) − x (cid:107) − β k N N (cid:88) j =1 d j ( k ) T ( y ( k ) − x ) (20) + β ( k ) N (cid:107) N (cid:88) j =1 d j ( k ) (cid:107) . Note that d j ( k ) T ( y ( k ) − x ) ≥ f j ( x j ( k )) − f j ( x ) + d j ( k ) T ( y ( k ) − x j ( k ))= f j ( x j ( k )) − f j ( y ( k )) + f j ( y ( k )) − f j ( x )+ d j ( k ) T ( y ( k ) − x j ( k )) ≥ − l (cid:107) y ( k ) − x j ( k ) (cid:107) + f j ( y ( k )) − f j ( x ) . From (20), we have (cid:107) y ( k + 1) − x (cid:107) ≤(cid:107) y ( k ) − x (cid:107) − β k N [ f ( y ( k )) − f ( x )]+ β k lN N (cid:88) j =1 (cid:107) y ( k ) − x j ( k ) (cid:107) + β ( k ) l . By the result 1) in Theorem 1, E [ (cid:80) ∞ k =1 β k (cid:107) y ( k ) − x i ( k ) (cid:107) ] < ∞ . Since (cid:80) ∞ k =0 β k < ∞ and f ( y ( k )) ≥ f ∗ ,the condition in Lemma 2 is satisfied. Thus, || y ( k ) − x || converges a.s., and ∞ (cid:88) k =0 β k [ f ( y ( k )) − f ( x )] < ∞ a.s. (21)It follows from (cid:80) ∞ k =0 β ( k ) = ∞ and (21) that lim inf k →∞ f ( y ( k )) − f ( x ) = 0 . a.s. With a similar argument of (18), we obtain lim k →∞ y ( k ) = x ∗ , x ∗ ∈ X ∗ , a.s. Therefore, all the agents almost surely converge to the sameoptimal solution of problem (1). (cid:50) . SimulationExample 1:
We give an example to illustrate the algo-rithm. Consider five agents with the local objective func-tions as follows: f ( x ) = ln( e . x + e . x ) + 5 min z ∈ Ω || x − z || ; f ( x ) = 3( x ) ln(( x ) + 1) + 2( x ) ; f ( x ) = 3( x − + 0 . x − + 2 | x | + 2 | x | ; f ( x ) = x ) √ x ) +1 + 0 . x + x ) ; f ( x ) = ( x + 5 x − ;+4 max { x + x , ( x + x ) } , with a decision variable as x = ( x , x ) ∈ R and a set Ω as Ω = { x ∈ R | x + x ≤ } .The five agents share information with three graphs {G , G , G } , whose weighted adjacency matrices are A , A , A ∈ R × , respectively. The transition matrix ofthe stationary Markovian chain θ ( k ) is P ∈ R × . We let P, A , A , A to be the following matrices: . . . . . . , . .
51 0 0 0 00 0 . . . . . . , . .
50 1 0 0 01 0 0 0 00 0 0 . . . . , . .
50 0 . . . . . We choose the step-sizes as α k = k +1) . and β k = k +1) . . The (sub)gradients are normalized to 1. We per-form the simulation for 100 times, and the estimations ofthe five agents always reach the same optimal solution. (Theoptimal solution is unique in this case.) The simulationresults are shown in Figure 1 and 2.V. C ONCLUSIONS
In this paper, we proposed a consensus subgradientalgorithm to solve a distributed optimization problem withMarkovian switching random communication networks. Thealgorithm was given with two time-scale step-sizes, differentfrom most existing ones. We showed the almost sure conver-gence with a proper connectivity assumption and step-sizechoices. In the future, we will work on the mean-squareconvergence rate analysis.A
PPENDIX : P
ROOF OF KEY LEMMAS
Proof of Lemma 1 (i): Denote F k = σ { θ ( t ) , ≤ t ≤ k } .Take h = ( m − + 1 where m is number of states in I . We first prove that ∀ t , θ ( k ) will visit all the states in I with a positive probability during [ t, t + h − . Fig. 1: The trajectories of the five agents’ statesFig. 2: The figure shows the trajectories ofthree performance index: the consensus error, (cid:80) i =1 || x i ( k ) − (cid:80) j =1 x j ( k ) || , the distance to optimalsolution, || (cid:80) i =1 x i ( k ) − x ∗ || and the optimal value gap, | f ( (cid:80) i =1 x i ( k )) − f ∗ | .ith the transition matrix P ∈ R m × m being a stochasticmatrix, there exists a ζ > such that p ij ≥ ζ when p ij > .Assume θ ( t ) = i , and then ∀ i ∈ I \ i we conclude that P ( θ ( k ) visits i during [ t, t + m − ≥ ζ m − because θ ( k ) is irreducible. Hence, P ( θ ( k ) visits all states in I during [ t, t + h − ≥ ζ h . Secondly, with the union graph of {G , ..., G m } beingstrongly connected, we prove that there exists a constant γ > such that E [( t + h − (cid:88) k = t H ( k ) T + H ( k ) |F t )] ≤ − γ I N − , ∀ t. (22)Since A ( k ) is a doubly stochastic matrix, we get I N − A ( k ) − A ( k ) T = L G θ ( k ) + L T G θ ( k ) = 2 L ˆ G θ ( k ) , where ˆ G θ ( k ) is the undirected mirror graph of G θ ( k ) . As aLaplician matrix of an undirected graph, L ˆ G θ ( k ) is positivesemi-definite, i.e., x T L ˆ G θ ( k ) x ≥ , ∀ x ∈ R N . Taking x = Q T u, u (cid:54) = , we have u T Q (2 I N − A ( k ) − A ( k ) T ) Q T u = − u T ( H ( k ) + H ( k ) T ) u ≥ , (23)and thereby, H ( k ) + H ( k ) T is negative semi-definite.Similarly, E [ u T Q (2 hI N − t + h − (cid:88) k = t [ A ( k ) + A ( k ) T ]) Q T u |F t ]= − E [ u T ( t + h − (cid:88) k = t [ H ( k ) + H ( k ) T ]) u |F t ] ≥ . Therefore, E [ (cid:80) t + h − k = t [ H ( k ) + H ( k ) T ] |F t ] is alsonegative semi-definite. With Assumption 2, we have E [ x T (2 hI N − (cid:80) t + h − k = t [ A ( k )+ A ( k ) T ]) x |F t ] = 0 if and onlyif x = c N , c (cid:54) = 0 . We conclude Q T u (cid:54) = c N , ∀ c (cid:54) = 0 sinceotherwise c = 0 can also be concluded from TN Q T u = c TN N = 0 . Therefore, E [ u T Q (2 hI N − t + h − (cid:88) k = t [ A ( k )+ A ( k ) T ]) Q T u |F t ] > (24)i.e., E [ (cid:80) t + h − k = t [ H ( k ) T + H ( k )] |F t ] is negative definite.Hence, there exists a constant γ > to make (22) hold.As a result, Φ( h + t − , t ) T Φ( h + t − , t ) = ( I N − + α t H ( t ) T ) · · · ( I N − + α t + h − H ( t + h − T )( I N − + α t + h − H ( t + h − · · · ( I N − + α t H ( t )) ≤ I N − + t + h − (cid:88) k = t α k ( H ( k ) T + H ( k )) + c α t I N − , with c > as a constant. Combined with (22), E [Φ( h + t − , t ) T Φ( h + t − , t ) |F t ] ≤ (1 − γ hα t + h − + c hα t ) I N − , (25)where γ = γ h > , c = c h > .In fact, ∀ k : t ≤ k ≤ t + h − , ( k + 1 t + h ) δ = (1 + k + 1 − t − ht + h ) δ = 1 + δ k + 1 − t − ht + h + O ( ht + h ) . (26)Therefore, when t is large enough α k − α t + h − = a ( k + 1) δ − a ( t + h ) δ = a ( k + 1) δ (1 − ( k + 1 t + h ) δ ) ≤ a ( k + 1) δ ( ht + h + O ( ht + h ) ) ≤ a k + 1) δ = α ( k )4 (27)We also have: α t α k = [ k + 1( t + 1) ] δ −−−→ t →∞ . (28)Thereby, with (27) and (28), we conclude that ∃ k when t ≥ k , − γ hα t + h − + c hα t = 1 − γ t + h − (cid:88) k = t α k + γ t + h − (cid:88) k = t ( α k − α t + h − )+ c t + h − (cid:88) k = t α k α t α k ≤ − γ t + h − (cid:88) k = t α k + γ t + h − (cid:88) k = t α k + γ t + h − (cid:88) k = t α k = (1 − c t + h − (cid:88) k = t α k ) with a constant c = γ > .By (25), when t ≥ k , E [Φ( h + t − , t ) T Φ( h + t − , t ) |F t ] ≤ (1 − c t + h − (cid:88) k = t α k ) I N − . Now given another integer s ≥ , we have the followingstimation by recursions, E [Φ( sh − t, t ) T Φ( sh − t, t )]= E (cid:2) Φ(( s − h − t, t ) T E [Φ( sh + t − , ( s − h + t ) T Φ( sh + t − , ( s − h + t ) |F ( s − h + t ]Φ(( s − h − t, t ) (cid:3) ≤ (1 − c sh + t − (cid:88) k =( s − h + t α ( k )) E [Φ(( s − h − t, t ) T Φ(( s − h − t, t )] ≤ exp[ − c sh + t − (cid:88) k = t α k )] I N − , t ≥ k , (29)based on the inequality − x ≤ e − x , ∀ x ≥ .Therefore, E [ (cid:107) Φ( sh − t, t ) (cid:107) ] ≤ c exp[ − c sh + t − (cid:88) k = t α k )] (30)with c = 1 and c = 2 c as positive constants.Since H ( k ) = Q L G θ ( k ) Q T and G θ ( k ) switches among afinite set of graphs, there exists a constant C max > , suchthat (cid:107) H ( k ) (cid:107)≤ C max , ∀ k ≥ . Therefore, ∀ k, s ≥ k , wehave ∃ ς ≥ , ≤ r ≤ h − such that k − s = ςh + r . Then E [ (cid:107) Φ( k, s ) (cid:107) ] ≤ E [ (cid:107) Φ( k, k − ςh + 1) (cid:107) ] (cid:107) Φ( r + s, s ) (cid:107)≤ r + s (cid:89) i = s (1 + α i C max ) c exp[ − c k (cid:88) i = s + r +1 α i ] ≤ r + s (cid:89) i = s (1 + α i C max ) c exp[ c r + s (cid:88) i = s α i ] exp[ − c k (cid:88) i = s α i ] ≤ ˜ c exp[ − c k +1 (cid:88) i = s α i ] , (31)with ˜ c = c (cid:81) k + hi = k (1 + α i C max ) exp[ c (cid:80) k + hi = k α i ] and c = c as positive constants.When s ≤ k , without loss generality we assume k ≥ k ,and then E [ (cid:107) Φ( k, s ) (cid:107) ] ≤ E [ (cid:107) Φ( k, k ) (cid:107) ] (cid:107) Φ( k , s ) (cid:107)≤ k (cid:89) i = s (1 + α i C max )˜ c exp[ − c k (cid:88) i = k α i ] ≤ k (cid:89) i = s (1 + α i C max )˜ c exp[ c k (cid:88) i = s α i ] exp[ − c k (cid:88) i = s α i ] ≤ ˆ c exp[ − c k +1 (cid:88) i = s α i ] , (32)With ˆ c = ˜ c (cid:81) k i =0 (1+ α i C max ) exp[ c (cid:80) k i =0 α i ] and c = c as constants. Taking c > max { ˜ c , ˆ c } , Lemma 1 (i)holds from (31) and (32). (ii) Because k +1 (cid:88) i = s α i = k +1 (cid:88) i = s a ( i + 1) δ ≥ (cid:90) k +1 x = s a ( x + 1) δ dx ≥ a − δ ( x + 1) − δ | k + 1 s ≥ a − δ [( k + 2) − δ − ( s + 1) − δ ] , (33)we have k (cid:88) s =0 β s E [ (cid:107) Φ( k, s + 1) (cid:107) ] ≤ k (cid:88) s =0 a ( s + 1) δ c exp[ a c − δ ( s + 2) − δ − a c − δ ( k + 2) − δ ] ≤ a c q ( k ) k (cid:88) s =0 s + 1) δ exp[ a c − δ ( s + 2) − δ ] , (34)with q ( k ) = exp[ a c − δ ( k + 2) − δ ] .It is easy to verify that there exists a k > , such that ∀ x ≥ k , x δ exp[ a c − δ x − δ ] is a monotonically increasingfunction. Then we obtain k (cid:88) s =0 s + 1) δ exp[ a c − δ ( s + 2) − δ ] ≤ k − (cid:88) s =0 s + 1) δ exp[ a c − δ ( s + 2) − δ ]+ k (cid:88) s = k s/ δ exp[ a c − δ ( s + 2) − δ ]= τ + 2 δ k +2 (cid:88) s = k +2 s δ exp[ a c − δ s − δ ] ≤ τ + 2 δ (cid:90) k +3 x = k +2 x δ exp[ a c − δ x − δ ] , (35)with τ = (cid:80) k − s =0 1( s +1) δ exp[ a c − δ ( s + 2) − δ ] . It followsfrom equation (34) that k (cid:88) s =0 β s E [ (cid:107) Φ( k, s + 1) (cid:107) ] ≤ a c (2 δ p ( k ) q ( k ) + τq ( k ) ) , (36)with p ( k ) = (cid:82) k +3 x = k +2 1 x δ exp[ a c − δ x − δ ] .Even though both p ( k ) and q ( k ) tend to infinity as k →∞ , they are not increasing at the same order. In fact, wecan the derivative of them as p ( x ) (cid:48) = 1( x + 3) δ exp[ a c − δ ( x + 3) − δ ] ,q ( x ) (cid:48) = a c ( x + 2) δ exp[ a c − δ ( x + 2) − δ ] . ccording to δ > δ , lim x →∞ p ( x ) (cid:48) q ( x ) (cid:48) = 0 , and hence, bythe well-known L’Hˆopital’s rule, we get lim x →∞ p ( x ) q ( x ) = 0 .Therefore, it follows from (36) that k (cid:88) s =0 β k E [ (cid:107) Φ( k, s + 1) (cid:107) ] → , k → ∞ . (iii) Note that x − δ exp[ − c x − δ ] is a monotonicallydecreasing function when c is positive, and then withequations (32) and (33), we have ∞ (cid:88) k =0 β k +1 E [ (cid:107) Φ( k, (cid:107) ] ≤ ∞ (cid:88) k =0 a ( k + 2) δ c exp[ a c − δ − a c − δ ( k + 2) − δ ] ≤ c ∞ (cid:88) k =1 k + 1) δ exp[ − c ( k + 1) − δ ] ≤ c (cid:90) ∞ x =1 x δ exp[ − c x − δ ]= − c c (1 − δ ) exp[ − c x − δ ] | ∞ x =1 = c c exp[ − c ] , (37)with c = a c exp[ a c − δ ] , c = a c − δ as positive constants.(iv) Since ( x + 2) − η q ( x ) tends to infinity as x → ∞ ,with L’Hˆopital’s rule, lim x →∞ p ( x )( x + 2) − η q ( x )= lim x →∞ p ( x ) (cid:48) ( x + 2) − η q ( x ) (cid:48) − η ( x + 2) − η − q ( x )= lim x →∞ ( x + 2) − δ a c ( x + 2) − η − δ − a c η ( x + 2) − − η = 1 a c . Hence, p ( x ) q ( x ) = O (( x + 2) − η ) with η = δ − δ .Therefore, by (36), k (cid:88) s =0 β s E [ (cid:107) Φ( k, s + 1) (cid:107) ] ≤ a c (2 δ O (( k + 2) − η ) + τq ( k ) ) . From δ > δ , we have k δ < k δ , and hence, by a similarargument in (37) we get ∞ (cid:88) k =0 k + 2) δ exp[ − c ( k + 2) − δ ] < ∞ . Finally, we conclude that ∞ (cid:88) k =0 β k +1 k (cid:88) s =0 β s E [ (cid:107) Φ( k, s + 1) (cid:107) ] ≤ a c δ ∞ (cid:88) k =0 O (( k + 2) − η )( k + 2) δ + a c τ ∞ (cid:88) k =0 k + 2) δ q ( k ) ≤ c ∞ (cid:88) k =0 k + 2) δ exp[ − c ( k + 2) − δ ]+ c ∞ (cid:88) k =0 k + 2) δ − δ < ∞ , with c , c as positive constants. (cid:50) R EFERENCES[1] Q. Zhang and J.-F. Zhang, “Distributed parameter estimation overunreliable networks with markovian switching topologies,”
IEEETransactions on Automatic Control , vol. 57, no. 10, pp. 2545–2560,2012.[2] J. Lei, H.-F. Chen, and H.-T. Fang, “Primal–dual algorithm fordistributed constrained optimization,”
Systems & Control Letters ,vol. 96, pp. 110–117, 2016.[3] E. Wei, A. Ozdaglar, and A. Jadbabaie, “A distributed newton methodfor network utility maximization–i: Algorithm,”
IEEE Transactionson Automatic Control , vol. 58, no. 9, pp. 2162–2175, 2013.[4] W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact first-orderalgorithm for decentralized consensus optimization,”
SIAM Journalon Optimization , vol. 25, no. 2, pp. 944–966, 2015.[5] E. Dall’Anese, H. Zhu, and G. B. Giannakis, “Distributed optimalpower flow for smart microgrids,”
IEEE Transactions on Smart Grid ,vol. 4, no. 3, pp. 1464–1475, 2013.[6] P. Yi, Y. Hong, and F. Liu, “Initialization-free distributed algorithmsfor optimal resource allocation with feasibility constraints and appli-cation to economic dispatch of power systems,”
Automatica , vol. 74,pp. 259–269, 2016.[7] A. Nedic and A. Ozdaglar, “Distributed subgradient methods formulti-agent optimization,”
IEEE Transactions on Automatic Control ,vol. 54, no. 1, pp. 48–61, 2009.[8] I. Lobel and A. Ozdaglar, “Distributed subgradient methods forconvex optimization over random networks,”
IEEE Transactions onAutomatic Control , vol. 56, no. 6, pp. 1291–1306, 2010.[9] J. C. Duchi, A. Agarwal, and M. J. Wainwright, “Dual averaging fordistributed optimization: Convergence analysis and network scaling,”
IEEE Transactions on Automatic control , vol. 57, no. 3, pp. 592–606,2011.[10] J. Lei, H.-F. Chen, and H.-T. Fang, “Asymptotic properties of primal-dual algorithm for distributed stochastic optimization over randomnetworks with imperfect communications,”
SIAM Journal on Controland Optimization , vol. 56, no. 3, pp. 2159–2188, 2018.[11] J. Xu, S. Zhu, Y. C. Soh, and L. Xie, “Augmented distributed gradientmethods for multi-agent optimization under uncoordinated constantstepsizes,” in . IEEE, 2015, pp. 2055–2060.[12] S. Pu, W. Shi, J. Xu, and A. Nedi´c, “A push-pull gradient methodfor distributed optimization in networks,” in . IEEE, 2018, pp. 3385–3390.[13] A. H. Sayed et al. , “Adaptation, learning, and optimization overnetworks,”
Foundations and Trends R (cid:13) in Machine Learning , vol. 7,no. 4-5, pp. 311–801, 2014.[14] A. Nedich et al. , “Convergence rate of distributed averaging dynamicsand optimization in networks,” Foundations and Trends R (cid:13) in Systemsand Control , vol. 2, no. 1, pp. 1–100, 2015.[15] A. Nedi´c and J. Liu, “Distributed optimization for control,” AnnualReview of Control, Robotics, and Autonomous Systems , vol. 1, pp.77–103, 2018.16] T. Yang, X. Yi, J. Wu, Y. Yuan, D. Wu, Z. Meng, Y. Hong, H. Wang,Z. Lin, and K. H. Johansson, “A survey of distributed optimization,”
Annual Reviews in Control , 2019.[17] G. Notarstefano, I. Notarnicola, A. Camisa et al. , “Distributedoptimization for smart cyber-physical networks,”
Foundations andTrends R (cid:13) in Systems and Control , vol. 7, no. 3, pp. 253–383, 2019.[18] P. Yi and Y. Hong, “Quantized subgradient algorithm and data-rateanalysis for distributed optimization,” IEEE Transactions on Controlof Network Systems , vol. 1, no. 4, pp. 380–392, 2014.[19] J. Xu, S. Zhu, Y. C. Soh, and L. Xie, “Convergence of asynchronousdistributed gradient methods over stochastic networks,”
IEEE Trans-actions on Automatic Control , vol. 63, no. 2, pp. 434–448, 2017.[20] A. Nedic, “Asynchronous broadcast-based convex optimization overa network,”
IEEE Transactions on Automatic Control , vol. 56, no. 6,pp. 1337–1351, 2010.[21] J. Lu, C. Y. Tang, P. R. Regier, and T. D. Bow, “Gossip algorithms forconvex consensus optimization over networks,”
IEEE Transactions onAutomatic Control , vol. 56, no. 12, pp. 2917–2923, 2011.[22] D. Jakovetic, D. Bajovic, A. K. Sahu, and S. Kar, “Convergencerates for distributed stochastic optimization over random networks,”in . IEEE,2018, pp. 4238–4245.[23] P. Yi, J. Lei, and Y. Hong, “Distributed resource allocation overrandom networks based on stochastic approximation,”
Systems &Control Letters , vol. 114, pp. 44–51, 2018.[24] M. Huang, S. Dey, G. N. Nair, and J. H. Manton, “Stochastic con-sensus over noisy networks with markovian and arbitrary switches,”
Automatica , vol. 46, no. 10, pp. 1571–1583, 2010.[25] I. Matei, J. S. Baras, and C. Somarakis, “Convergence results for thelinear consensus problem under markovian random graphs,”
SIAMJournal on Control and Optimization , vol. 51, no. 2, pp. 1574–1591,2013.[26] T. Li and J. Wang, “Distributed averaging with random networkgraphs and noises,”
IEEE Transactions on Information Theory ,vol. 64, no. 11, pp. 7063–7080, 2018.[27] N. Xiao, L. Xie, and M. Fu, “Kalman filtering over unreliablecommunication networks with bounded markovian packet dropouts,”
International Journal of Robust and Nonlinear Control: IFAC-Affiliated Journal , vol. 19, no. 16, pp. 1770–1786, 2009.[28] I. Lobel, A. Ozdaglar, and D. Feijer, “Distributed multi-agent op-timization with state-dependent communication,”
Mathematical pro-gramming , vol. 129, no. 2, pp. 255–284, 2011.[29] S. S. Alaviani and N. Elia, “Distributed multi-agent convex opti-mization over random digraphs,”
IEEE Transactions on AutomaticControl , 2019.[30] H.-F. Chen,
Stochastic approximation and its applications . SpringerScience & Business Media, 2006, vol. 64.[31] R. B. Ash,