On the Convergence of Consensus Algorithms with Markovian Noise and Gradient Bias
aa r X i v : . [ m a t h . O C ] N ov On the Convergence of Consensus Algorithms with MarkovianNoise and Gradient Bias
Hoi-To Wai † ∗ November 6, 2020
Abstract
This paper presents a finite time convergence analysis for a decentralized stochastic approx-imation (SA) scheme. The scheme generalizes several algorithms for decentralized machinelearning and multi-agent reinforcement learning. Our proof technique involves separating theiterates into their respective consensual parts and consensus error. The consensus error isbounded in terms of the stationarity of the consensual part, while the updates of the consensualpart can be analyzed as a perturbed SA scheme. Under the Markovian noise and time varyingcommunication graph assumptions, the decentralized SA scheme has an expected convergencerate of O (log T / √ T ), where T is the iteration number, in terms of squared norms of gradientfor nonlinear SA with smooth but non-convex cost function. This rate is comparable to the bestknown performances of SA in a centralized setting with a non-convex potential function. Decentralized algorithm has become a core tool for control, optimization and machine learning inthe increasingly connected world. Among others, a common setting is multi-agent optimization,where a group of agents on a connected graph seek to minimize a sum of local functions usinga common parameter. For convex optimization problems, consensus-based algorithms have beendeveloped in [1] for deterministic optimization, and the recent results have extended the latter tostochastic/non-convex optimization [2–5].With new machine learning inspired-models such as the training of neural networks and rein-forcement learning, researchers have focused on developing decentralized methods for non-convex stochastic optimization [6]. However, methods developed so far for this scenario are limited to using unbiased gradient estimates [7–9], requiring i.i.d. data samples and easy-to-compute gradients. Thelatter requirements may be an obstacle in deploying the methods in sophisticated problems.The focus of this paper is to relax the unbiased gradient estimate assumption of decentralizedstochastic optimization methods. We consider a decentralized stochastic approximation (DSA)scheme whose local update directions are computed from samples of a Markov chain, and theupdate directions converge asymptotically to a non-gradient mean field, i.e., it is a biased DSA ∗ The author is with the Department of SEEM, The Chinese University of Hong Kong, Shatin, Hong Kong. Thiswork is supported by CUHK Direct Grant [email protected] • For non-convex stochastic optimization, we prove that the DSA scheme with updates from Markovsamples converge to a consensual and stationary point at a rate of O (log T / √ T ), where T is theiteration number. • To analyze the convergence of biased DSA, we develop a decoupling procedure to split the DSAiterates into their consensual part and consensus error, which has a similar favor to the analysis in[2,10,11]. Such separation allows us to (i) bound the consensus error explicitly; and subsequently(ii) analyze the recursion of the consensual part as a perturbed biased SA scheme through adoptingprior analysis, e.g., [12]. • We show that the biased DSA scheme also converges for time varying topology under a standardbounded connectivity assumption.For biased DSA scheme with Markov samples, the closest work related to ours include [2] whichconsidered a convex optimization related setting; and [13] which proposed an algorithm that maynot achieve exact consensus. To the best of our knowledge, this paper provides the first finite-timeconvergence analysis for DSA scheme with biased updates relying on Markov samples. In addition,this work is related to the recent works on non-asymptotic analysis of SA schemes [14–16].
Consider a network of n agents which have the goal of obtaining a common (a.k.a. consensual)solution, θ ∈ R d , which is a stationary point to the optimization problem: V ⋆ = min θ ∈ R d V ( θ ) := 1 n n X i =1 V i ( θ ) , (1)where V i : R d → R is a smooth (but possibly non-convex) function and is lower bounded. Concep-tually, the i th function V i ( θ ) can be interpreted as the local potential/cost function held by the i thagent. The agents communicate through an undirected, connected simple graph G = ( V, E ) where V = { , ..., n } is the node set, and E ⊆ V × V is the edge set which includes self-loops. The graphis endowed with a symmetric, weighted adjacency matrix A . We assume H1.
The matrix A ∈ R n × n + satisfies:1. A ij = 0 whenever ( i, j ) / ∈ E .2. A = A ⊤ = .3. k U ⊤ AU k ≤ − ¯ ρ , where ¯ ρ ∈ (0 , and U is a projection matrix such that I − n ⊤ = U U ⊤ . Note that condition 3) is equivalent to requiring that max {| λ ( A ) | , | λ n ( A ) |} ≤ − ¯ ρ . Such aweighted adjacency matrix A exists if G is connected, e.g., [1].At iteration t , each agent holds a local solution θ ( t ) i and for simplicity, we assume the initialization θ (0) i = θ (0) j for any i, j . Let X t +1 ∈ X be a random sample, where X is a (discrete or continuous)state space. The decentralized SA (DSA) scheme performs the recursion at all agents i = 1 , ..., n ,2 ( t +1) i = n X j =1 A ij θ ( t ) j − γ t +1 H i ( θ ( t ) i ; X t +1 ) , (2)where the first term corresponds to an average consensus step among the neighbors of agent i , and H i ( θ ( t ) i ; X t +1 ) is a local stochastic update computed from θ ( t ) i , X t +1 . Furthermore, we denote F t = σ {{ θ (0) i } ni =1 , X s , s = 0 , , , ..., t } (3)as the filtration of random elements up to iteration t . Note that θ ( t ) i is measurable w.r.t. F t . Inthe special case of n = 1, (2) is reduced to the classical SA scheme [17]; for general n >
1, (2) isrelated to a matrix momentum SA scheme studied in [18].We consider a biased
DSA scheme in this paper. To describe the setup, the random samples { X t } t ≥ forms form a Markov chain (MC) with the kernel P satisfying: H2.
The Markov kernel P : X × X → R + generating { X t } t ≥ has a unique stationary distribution µ : X → R , and it is irreducible, aperiodic. For any measurable function f on X , with a slight abuse of notation we define P f ( X t ) = E [ f ( X t +1 ) | X t ] = R X f ( x ) P ( X t , dx ). The mean field of H i ( θ ; X ) is h i ( θ ) := R X H i ( θ ; x ) µ ( dx ) . (4)Importantly, the averaged mean field h ( θ ) := n P ni =1 h i ( θ ) (5)is related to Problem (1) through: H3.
For any θ ∈ R d , there exists d , c > such that h h ( θ ) | ∇ V ( θ ) i ≥ c k h ( θ ) k , d k h ( θ ) k ≥ k∇ V ( θ ) k , where h x | y i = x ⊤ y denotes Euclidean inner product. The constants c , d characterize the multiplicative bias of the mean field in view of a stationarysolution to (1). It allows the local stochastic update to be quasi-gradient , which is relevant whenthe gradient of V i ( θ ) is hard to obtain. Besides, the transient of stochastic update is biased underH2 with E [ H i ( θ ( t ) i ; X t +1 ) |F t ] = h i ( θ ( t ) i ) for finite t . To this end, we assume: H4.
For any i = 1 , ..., n , θ ∈ R d , x ∈ X , there exists a measurable function b H i : R d × X → R d suchthat b H i ( θ ; x ) − P b H i ( θ ; x ) = H i ( θ ; x ) − h i ( θ ) . (6)The measurable function in H4 is a solution to the Poisson equation. Such function exists underH2 and additional conditions on the MC. For example, we may assume P is uniformly ergodic, i.e.,for a constant K , sup x ∈ X k P t ( x, · ) − µ ( · ) k TV ≤ Kλ − t , ∀ t ≥ , (7)such that λ ∈ [0 ,
1) characterizes the mixing time of P . H4 is also satisfied under more relaxedconditions, e.g., geometric ergodicity, see [19, Ch. 21.2], [20].We assume that both the local stochastic updates and potential functions are smooth w.r.t. θ :3 For any i , the local stochastic update H i ( θ ; x ) is L h -Lipschitz w.r.t. θ , i.e., for any θ , θ ′ ∈ R d , sup x ∈ X kH i ( θ ; x ) − H i ( θ ′ ; x ) k ≤ L h k θ − θ ′ k . (8) Consequently, the mean field map h i ( θ ) is L h -Lipschitz such that k h i ( θ ) − h i ( θ ′ ) k ≤ L h k θ − θ ′ k , ∀ θ , θ ′ ∈ R d . H6.
The potential function V ( θ ) is L V -smooth such that k∇ V ( θ ) − ∇ V ( θ ′ ) k ≤ L V k θ − θ ′ k for any θ , θ ′ ∈ R d . Lastly, we assume the following on H i ( θ ; x ): H7.
For any θ = ( θ , ..., θ n ) , there exists σ o such that sup x ∈ X kH i ( θ i ; x ) − n P nj =1 H j ( θ j ; x ) k≤ σ o { n + n k h ( e θ c ) k + k θ i − e θ c k} , (9) for any i = 1 , ..., n , where we have e θ c = n P nj =1 θ j . H8.
For any e θ c ∈ R d , there exists σ h such that sup x ∈ X (cid:13)(cid:13) n P ni =1 H i ( e θ c ; x ) − h ( e θ c ) (cid:13)(cid:13) ≤ σ h . (10)In particular, σ o quantifies the heterogeneity of the stochastic updates, while σ h plays a similar roleas the variance of (1 /n ) P ni =1 {H i ( θ ; x ) − h i ( θ ) } . Under H5, we observe H7 can be satisfied if thel.h.s. of (9) is upper bounded by O (1 + P ni =1 k h i ( θ i ) k ). Our condition H7 is considerably weakerthan the heterogeneity assumption required by [7], as we allow the heterogeneity between the localupdates to grow with the norm of mean field.Notice that H7, H8 are uniform bounds on the norms of error for all x ∈ X . They are considerablystronger than those for decentralized stochastic algorithms with i.i.d. data, e.g., [7,21]. However, weremark that this is a caveat for the prior works on SA with Markov noise as well, e.g., [12,13,22,23].We discuss two applications. In this case, the i th potential function is taken as the following stochastic objective function: V i ( θ ) := E X i ∼ µ i (cid:2) V i ( θ ; X i ) (cid:3) . (11)The local stochastic update is given by H i ( θ ( t ) i ; X t +1 i ) = ∇ V i ( θ ( t ) i ; X t +1 i ) , (12)such that { X ti } t ≥ is a Markov chain with the kernel P i : X i × X i → R + and a unique stationarydistribution µ i : X i → R + . Consequently, the mean field of H i ( θ ( t ) i ; X t +1 i ) is the gradient h i ( θ ( t ) i ) = ∇ V i ( θ ( t ) i ). 4q. (11), (12) generalize the vanilla decentralized SGD method [7] to scenarios with non-i.i.d. (a.k.a. er-godic) data. As discussed in [22, 24], the latter is important to applications where data samples arenot obtained independently. For example, the data samples are generated using a Markov chainMonte carlo method.To see that (11), (12) fit the assumptions of this paper, we take X = X × · · · × X n and form P byconcatenating the local Markov kernels. Clearly, H3 is satisfied with c = d = 1; H2, H4 dependon the Markov chain; H5, H6 are related to the smoothness of V i ( θ ; x ) w.r.t. θ ; H7 can be satisfiedwith more homogeneous objective function; H8 bounds the noise in estimating the mean field byaveraging the local stochastic updates. We consider the policy evaluation problem in a multi-agent reinforcement learning (MARL) setting.Our aim is to compute the value function under a policy π for an unknown Markov decision process(MDP) using linear function approximation [25–27].Consider the MDP at state x with the reward of R( x ). The agents only observe a local rewardR i ( x ) satisfying R( x ) = n P ni =1 R i ( x ). We aim at approximating the value function as V ( x ) = E [ P ∞ s =0 γ s R( x s ) | x = x ] ≈ θ ⊤ Φ( x ), where θ ∈ R d is the function parameter and Φ( x ) is a ‘feature’vector. To find θ , the decentralized TD(0) learning algorithm [26,27] deploys (2) with the followinglocal stochastic update: H i ( θ ( t ) i ; x ) = Φ( x ) (cid:8) R i ( x ) + ( γ Φ( x ′ ) − Φ( x )) ⊤ θ ( t ) i (cid:9) where x ′ ∈ X denotes the next state drawn from the MDP when the current state is x . The termsinside the curly bracket is the temporal difference error. We observe that the resultant algorithmis a linear DSA scheme.To discuss the performance of TD(0), we take V i ( θ ) = k θ − θ ⋆ k , where θ ⋆ solves the Bellmanequation: E µ [Φ( x )( γ Φ( x ′ ) − Φ( x )) ⊤ ] θ ⋆ = E µ [Φ( x )R( x )] . Most of our assumptions can be satisfied by the linear DSA. Using [28, Lemma 3 & 4], H3 issatisfied with c = − γ , and we can show that d = E µ [ k Φ( x )( γ Φ( x ′ ) − Φ( x )) ⊤ k ] . H2, H4 areconditions on the Markov chain; H5–H7 can be satisfied with a bounded Φ( x ), R( x ). Lastly, H8can be relaxed in the analysis as the resultant DSA scheme is linear. In the interest of space, weleave the development of the latter case to a future work. For general smooth cost function V ( θ ), we consider a random terminating time τ ( T ) such thatPr( τ ( T ) = t ) = γ t +1 / P Ts =0 γ s +1 , t = 0 , ..., T, where τ ( T ) ∈ { , ..., T } is selected independently and T is the maximum number of iterations. Thisis a common stopping criterion proposed in [21]. For DSA, it can be decided by the agents with asimple consensus protocol before the iterations. Our main result is summarized as:5 heorem 1. Under H1–H8, suppose the step size satisfies sup t ≥ γ t ≤ min n , ¯ ρ σ o , c e C mk o (13) and there exists ˆ a such that ≤ γ t − γ t +1 ≤ ˆ aγ t . For any T ≥ , it holds that E [ k h ( e θ ( τ ( T )) c ) k ] ≤ C tot ( c / P Tt =0 γ t +1 , max i =1 ,...,n E [ k θ ( τ ( T )) i − e θ ( τ ( T )) c k ] ≤ c C tot + σ o ρ P Tt =0 γ t +1 P Tt =0 γ t +1 , (14) where we have defined e θ ( t ) c := n P ni =1 θ ( t ) i , C tot := V ( P ni =1 e θ (0) i n ) − V ⋆ + C mk + C mk P Tt =0 γ t +1 , and the constants e C mk , C mk , C mk will be specified in Section 3.1. The expectation is taken w.r.t. theterminating iteration τ ( T ) and the Markovian randomness. For the best convergence rate, we may take γ t = a / √ t + a for some a , a >
0. In this case,the theorem shows that the squared norm of mean field and the consensus error converge at therate of O (log T / √ T ). By H3 and (14), we have that E [ k∇ V ( e θ ( τ ( T )) c ) k ] converges at O (log T / √ T ),i.e., the DSA scheme finds a stationary point to problem (1). Note that this is a standard ratefor non-convex stochastic optimization [6]. Compared to existing works, our convergence rate issimilar to a centralized SA scheme, e.g., [12], and it strengthens that of [13] for DSA with Markovnoise, as we provide a convergence rate for exact consensus .As will be derived later, the constant C tot is proportional to O (¯ ρ − ), i.e., related to the spectralgap of the weighted adjacency matrix [cf. H1] and the magnitude sup x, θ k b H i ( θ ; x ) k in H4. In thecase of uniform MC, the latter is in the order of O ( − λ ) such that it is related to the mixing timeof the Markov chain. Our bound also highlights on the initialization V ( P ni =1 e θ (0) i /n ).Instead of analyzing the convergence of the DSA scheme with a single potential function, in theanalysis that follows, we adopt a divide-and-conquer approach similar to [2, 10, 11], where we firstdecompose the DSA iterate θ ( t )1 , ..., θ ( t ) n into its consensual part and consensus error. By observingthe individual update formulas, we bound the consensus error separately as the latter dependson stationarity of the averaged iterate. Subsequently, the consensual part can be analyzed usingsimilar technique as a centralized SA scheme. Due to space limitation, we only provide the analysisfor general nonlinear DSA under H1–H8. Define the following nd -dimensional vectors θ ( t ) := θ ( t )1 ... θ ( t ) n , H ( θ ( t ) ; x ) := H ( θ ( t )1 ; x )... H n ( θ ( t ) n ; x ) as the collection of local solutions and stochastic updates, respectively. We rewrite the DSA recur-sion (2) as: 6 ( t +1) = (cid:0) A ⊗ I d (cid:1) θ ( t ) − γ t +1 H ( θ ( t ) ; X t +1 ) , (15)where ⊗ denotes the Kronecker product.Consider the projection matrix I n − n ⊤ onto the subspace orthogonal to span { n } . As rank( I n − n ⊤ ) = n −
1, it admits the factorization I − n ⊤ = U U ⊤ , where U satisfies U ⊤ U = I n − . Welet e θ ( t ) c := ( n ⊤ ⊗ I d ) θ ( t ) , e θ ( t ) o := ( U ⊤ ⊗ I d ) θ ( t ) , (16)such that e θ ( t ) c ∈ R d , e θ ( t ) o ∈ R ( n − d denote the consensual component, and the consensus error of θ ( t ) , respectively. Moreover, θ ( t ) = ( ⊗ I d ) e θ ( t ) c + ( U ⊗ I d ) e θ ( t ) o . (17)Using (15), (16), the recursions of the two components in (17) can be described as e θ ( t +1) c ( a ) = e θ ( t ) c − γ t +1 ( n ⊤ ⊗ I d ) H ( θ ( t ) ; X t +1 ) , e θ ( t +1) o ( b ) = ( U ⊤ AU ⊗ I d ) e θ ( t ) o − γ t +1 ( U ⊤ ⊗ I d ) H ( θ ( t ) ; X t +1 ) , (18)where (a) used ⊤ A = ⊤ ; (b) used U ⊤ A = U ⊤ = . The recursions in (18) are coupledthrough the local solutions θ ( t ) in the stacked stochastic update. In particular, they allow us tohandle the convergence of the respective components as two SA schemes.Our next step is to derive an intermediate bound on the consensus error e θ ( t +1) o . A key observationis as follows: Lemma 1.
Assume H1, H7 and the step size satisfies γ t ≤ ¯ ρ σ o . If θ (0) i = θ (0) j for all i, j , then itholds for any t ≥ that k e θ ( t +1) o k ≤ σ o t X s =0 γ s +1 (cid:0) − ¯ ρ (cid:1) t − s (cid:8) k h ( e θ ( s ) c ) k (cid:9) . The above lemma shows that the consensus error can be upper bounded by the convolution betweenan exponential term (1 − ¯ ρ ) t − s and the norm of mean field weighted by the step size as γ s +1 { k h ( e θ ( s ) c ) k} . Importantly, k e θ ( t ) o k decays to zero at the rate of O ( γ t +1 ) provided that k h ( e θ ( t ) c ) k → e θ ( t ) c . We observe that( n ⊤ ⊗ I d ) H ( θ ( t ) ; X t +1 )= h ( e θ ( t ) c ) + 1 n n X i =1 H i ( e θ ( t ) c ; X t +1 ) − h ( e θ ( t ) c ) + 1 n n X i =1 {H i ( θ ( t ) i ; X t +1 ) − H i ( e θ ( t ) c ; X t +1 ) } , (19)where h ( θ ( t ) ) := ( h ( θ ( t )1 ); · · · ; h n ( θ ( t ) n )). Denote e ( t )0 := n P ni =1 H i ( e θ ( t ) c ; X t +1 ) − h ( e θ ( t ) c ) e ( t )1 := n P ni =1 {H i ( θ ( t ) i ; X t +1 ) − H i ( e θ ( t ) c ; X t +1 ) } e θ ( t ) c follows that of a perturbed SA scheme: e θ ( t +1) c = e θ ( t ) c − γ t +1 (cid:8) h ( e θ ( t ) c ) + e ( t )0 + e ( t )1 (cid:9) , (20)where e ( t )0 is a perturbation due to the random sample X t +1 in estimating the mean field, and e ( t )1 is bounded by the consensus error.Our idea is to proceed in a similar fashion as in [12]. Observe the following lemma: Lemma 2.
Under H1, H3, H5, H6, H7, H8 and assume that γ t ≤ min { ¯ ρ σ o , } . For any T ≥ and let E := 12 σ o L h / (¯ ρn ) , it holds P Tt =0 γ t +1 (cid:16) c − γ t +1 (cid:8) E + d + L V (cid:9)(cid:17) k h ( e θ ( t ) c ) k ≤ V ( e θ (0) c ) − V ⋆ + (cid:8) σ h L V + E (cid:9) P Tt =0 γ t +1 − P Tt =0 γ t +1 h∇ V ( e θ ( t ) c ) | e ( t )0 i . From the above, we observe that by setting a sufficiently small γ t +1 , it is possible to lower boundthe l.h.s. by P Tt =0 γ t +1 k h ( e θ ( t ) c ) k . Now if the r.h.s. is finite, the convergence of E [ k h ( e θ ( τ ( T )) c ) k ] canbe guaranteed.Our remaining task is to upper bound the inner product | E [ P Tt =0 γ t +1 h∇ V ( e θ ( t ) c ) | e ( t )0 i ] | . Noticethat in the case when X t +1 are drawn i.i.d. from the distribution µ , this inner product is zero. Ourresults below shows that despite that X t +1 are not i.i.d., the inner product can still be controlledwith an appropriate step size. Lemma 3.
Under H3–H8. Let | γ t − γ t +1 | ≤ ˆ aγ t for some constant ˆ a , and the step sizes satisfies γ t ≤ min { , ¯ ρ σ o } . For any T ≥ , it holds (cid:12)(cid:12) E (cid:2) P Tt =0 γ t +1 h∇ V ( e θ ( t ) c ) | e ( t )0 i (cid:3)(cid:12)(cid:12) ≤ C mk + C mk P Tt =0 γ t +1 + C mk P Tt =0 γ t +1 k h ( e θ ( t ) c ) k . Here, C mk , C mk , C mk are technical and the constants will be defined in (35) . The above lemma gives a compatible bound of the desired inner product under the scenario ofMarkovian noise.Substituting Lemma 3 into the conclusion of Lemma 2 and rearranging terms yield P Tt =0 γ t +1 (cid:16) c − γ t +1 e C mk (cid:17) k h ( e θ ( t ) c ) k ≤ V ( e θ (0) c ) − V ⋆ + C mk + C mk P Tt =0 γ t +1 =: C tot . where e C mk := C mk + E + d + L V , C mk := C mk + E + σ h L V . We also denote the quantity in ther.h.s by C tot . If we select the step size according to (13), then E [ k h ( e θ ( τ ( T )) c ) k ] = P Tt =0 γ t +1 k h ( e θ ( t ) c ) k / P Tt =0 γ t +1 ≤ (( c / P Tt =0 γ t +1 ) − C tot . As such, it concludes our result for the convergence of mean field. Furthermore, note this impliesthat the norm of gradient of V ( θ ) converges [cf. H3].8inally, we bound the consensus error. Again, we invoke Lemma 1 and observe the following P Tt =0 γ t +1 k e θ ( t ) o k ≤ σ o P Tt =0 γ t +1 P t − s =0 γ s +1 (1 − ¯ ρ ) t − s { k h ( e θ ( s ) c ) k} ( a ) ≤ σ o P T − s =0 γ s +1 { k h ( e θ ( s ) c ) k} P Tt = s +1 (1 − ¯ ρ ) t − s ( b ) ≤ (3 σ o / ρ ) P T − s =0 γ s +1 + P T − s =0 γ s +1 k h ( e θ ( s ) c ) k . where (a) involved a change of order in summation and γ t +1 ≤ γ s +1 as the step size is nonincreasing;(b) involved the condition γ s +1 ≤ ¯ ρ/ (2 σ o ). Finally, evaluating the expectation shows that E [ k e θ ( τ ( T )) o k ] ≤ (3 σ o / ρ ) P T − s =0 γ s +1 + c C tot P Tt =0 γ t +1 . (21)The above concludes the proof of Theorem 1. Extension to Time-varying Graph
Our analysis can be extended to scenarios when the com-munication graph is time varying. Let G ( t ) = ( V, E ( t ) ) be a simple, undirected graph which ispotentially not connected, where E ( t ) ⊆ E , and the graph is associated with a weighted adjacencymatrix A ( t ) . We replace A by A ( t ) in the DSA scheme (2) at iteration t , and H1 is updated withthe following assumption H9.
For any t ≥ , the matrix A ( t ) ∈ R n × n satisfies:1. A ( t ) ij = 0 whenever ( i, j ) / ∈ E ( t ) .2. A ( t ) = ( A ( t ) ) ⊤ = .3. ∃ B ≥ with k U ⊤ A ( t + B − · · · A ( t ) U k ≤ − ¯ ρ , where ¯ ρ ∈ (0 , . The last condition can be guaranteed under the ‘bounded communication’ setting [1], i.e., whenthe combined graph (
V, E ( t ) ∪ · · · E ( t + B ) ) is connected for any t ≥ A ( t ) remains doubly stochastic, the decomposition in (18) is still valid. We can then extendLemma 1 to bound the consensus error using a blocking argument; see the discussion in Appendix A.The proof for Theorem 1 can be modified accordingly and we obtain the same convergence rate forthe time varying graph setting. In this paper, we have studied the convergence of a biased decentralized stochastic approximation(DSA) scheme. The scheme is a multi-agent optimization algorithm relying on biased, stochasticupdates that approximate the gradient of a smooth cost function. Here, the biasednesses stemfrom taking Markov samples and quasi-gradients in the updates. We prove that DSA finds aconsensual and stationary point to the cost function at a rate of O (log T / √ T ), where T is themaximum iteration number. Future works include extending to asynchronous, gradient trackingDSA, state-controlled Markov chain, etc.. 9 Proof of Lemma 1 & Its Extension
From the recursion (18), we observe that k e θ ( t +1) o k ≤ k ( U ⊤ AU ⊗ I ) e θ ( t ) o k + γ t +1 k ( U ⊤ ⊗ I ) H ( θ ( t ) ; X t +1 ) k . (22)Using H1, we observe the contraction k ( U ⊤ AU ⊗ I ) e θ ( t ) o k ≤ k ( U ⊤ AU ⊗ I ) k k e θ ( t ) o k ≤ (1 − ¯ ρ ) k e θ ( t ) o k . (23)Using H7, we bound the second term in (22) as: k ( U ⊤ ⊗ I ) H ( θ ( t ) ; X t +1 ) k ≤ kH ( θ ( t ) ; X t +1 ) − ( n ⊤ ⊗ I d ) H ( θ ( t ) ; X t +1 ) k≤ σ o P ni =1 { n + n k h ( e θ ( t ) c ) k ] + k θ ( t ) i − e θ ( t ) c k} , which can be further simplified as σ o (cid:8) k h ( e θ ( t ) c ) k + k e θ ( t ) o k (cid:9) . Substituting into (22) yields k e θ ( t +1) o k ≤ (1 − ¯ ρ + γ t +1 σ o ) k e θ ( t ) o k + γ t +1 σ o { k h ( e θ ( t ) c ) k} . (24)Setting γ t +1 ≤ ¯ ρ σ o yields that 1 − ¯ ρ + γ t +1 σ o ≤ − ¯ ρ . Solving the recursion and noticing that e θ (0) o = yield the desired bound. Extension to Time-varying Topology
Under the relaxed condition H9, we apply a blockingargument to derive the result as in Lemma 1. In particular, denote Θ ( m, n ) := k e θ ( m ) o k + · · · k e θ ( n ) o k ,we can show: Θ ( t + 1 , t + B ) ≤ (1 − ¯ ρ ) Θ ( t − B + 1 , t )+ σ o γ t − B +2 (cid:8) Θ ( t − B + 1 , t ) + · · · + Θ ( t, t + B − (cid:9) + σ o B P t + B − s = t − B +1 γ s +1 (cid:8) k h ( e θ ( s ) c ) k (cid:9) , which implies that Θ ( t + 1 , t + B ) ≤ − ¯ ρ + σ o Bγ t − B +2 − σ o Bγ t − B +2 Θ ( t − B + 1 , t ) + σ o B − σ o Bγ t − B +2 t + B − X s = t − B +1 γ s +1 (cid:8) k h ( e θ ( s ) c ) k (cid:9) . Setting a sufficiently small step size γ t allows us to derive a similar recursion as (24) for Θ ( t +1 , t + B ).Solving it yields a convolution bound as in Lemma 1. B Proof of Lemma 2
Using the L V -smoothness of V ( · ) [cf. H6], we observe V ( e θ ( t +1) c ) ≤ V ( e θ ( t ) c ) + γ t +1 L V k h ( e θ ( t ) c ) + e ( t )0 + e ( t )1 k − γ t +1 h∇ V ( e θ ( t ) c ) | h ( e θ ( t ) c ) + e ( t )0 + e ( t )1 i≤ V ( e θ ( t ) c ) − γ t +1 (cid:0) c − γ t +1 L V (cid:1) k h ( e θ ( t ) c ) k + γ t +1 L V k e ( t )0 + e ( t )1 k − γ t +1 h∇ V ( e θ ( t ) c ) | e ( t )0 + e ( t )1 i t = 0 to t = T yields P Tt =0 γ t +1 (cid:16) c − γ t +1 L V (cid:17) k h ( e θ ( t ) c ) k ≤ V ( e θ (0) c ) − V ⋆ + 2 L V P Tt =0 γ t +1 (cid:8) k e ( t )0 k + k e ( t )1 k (cid:9) − P Tt =0 γ t +1 h∇ V ( e θ ( t ) c ) | e ( t )0 + e ( t )1 i By H5, we observe k e ( t )1 k ≤ ( L h /n ) k θ ( t ) − ( ⊗ I d ) e θ ( t ) c k = ( L h /n ) k ( U ⊗ I d ) e θ ( t ) o k ≤ ( L h /n ) k e θ ( t ) o k (25)Also, applying H8 and re-arranging terms show that T X t =0 γ t +1 (cid:16) c − γ t +1 L V (cid:17) k h ( e θ ( t ) c ) k ≤ V ( e θ (0) c ) − V ⋆ − P Tt =0 γ t +1 h∇ V ( e θ ( t ) c ) | e ( t )0 + e ( t )1 i + 2 L V P Tt =0 γ t +1 (cid:8) σ h + L h n k e θ ( t ) o k (cid:9) Moreover, using H3 we observe γ t +1 h∇ V ( e θ ( t ) c ) | e ( t )1 i ≤ γ t +1 d k h ( e θ ( t ) c ) k + 12 k e ( t )1 k Re-arranging terms again and using γ t ≤ P Tt =0 γ t +1 (cid:16) c − γ t +1 (cid:8) d + L V (cid:9)(cid:17) k h ( e θ ( t ) c ) k ≤ V ( e θ (0) c ) − V ⋆ + σ h L V P Tt =0 γ t +1 + P Tt =0 3 L h n k e θ ( t ) o k − P Tt =0 γ t +1 h∇ V ( e θ ( t ) c ) | e ( t )0 i Next, we need to upper bound P Tt =0 k e θ ( t ) o k with Lemma 1 and 4, we obtain T X t =0 k e θ ( t ) o k ≤ σ o ¯ ρ T X t =0 γ t +1 (cid:8) k h ( e θ ( t ) c ) k (cid:9) . (26)Define the constant E := 12 σ o L h / (¯ ρn ) and substituting into the previous inequality, we obtain P Tt =0 γ t +1 (cid:16) c − γ t +1 (cid:8) E + d + L V (cid:9)(cid:17) k h ( e θ ( t ) c ) k ≤ V ( e θ (0) c ) − V ⋆ + (cid:8) σ h L V + E (cid:9) P Tt =0 γ t +1 − P Tt =0 γ t +1 h∇ V ( e θ ( t ) c ) | e ( t )0 i This is the desirable bound for the lemma.
C Proof of Lemma 3
We begin the proof by using the solution to Poisson equation defined in H4. We have e ( t )0 = n P ni =1 (cid:8) H i ( e θ ( t ) c ; X t +1 ) − h i ( e θ ( t ) c ) (cid:9) = n P ni =1 (cid:8) ˆ H i ( e θ ( t ) c ; X t +1 ) − P ˆ H i ( e θ ( t ) c ; X t +1 ) (cid:9) . T X t =0 γ t +1 h∇ V ( e θ ( t ) c ) | e ( t )0 i ≡ n n X i =1 { A i + A i + A i + A i } , where A i := T X t =0 γ t +1 h∇ V ( e θ ( t ) c ) | ˆ H i ( e θ ( t ) c ; X t +1 ) − P ˆ H i ( e θ ( t ) c ; X t ) i ,A i := T X t =0 γ t +1 h∇ V ( e θ ( t ) c ) | P ˆ H i ( e θ ( t ) c ; X t ) − P ˆ H i ( e θ ( t − c ; X t ) i ,A i := T X t =0 γ t +1 h∇ V ( e θ ( t ) c ) | P ˆ H i ( e θ ( t − c ; X t ) i − T − X t =0 γ t +2 h∇ V ( e θ ( t +1) c ) | P ˆ H i ( e θ ( t ) c ; X t +1 ) i ,A i := T − X t =0 h γ t +2 ∇ V ( e θ ( t +1) c ) − γ t +1 ∇ V ( e θ ( t ) c ) | P ˆ H i ( e θ ( t ) c ; X t +1 ) i − γ T +1 h∇ V ( e θ ( T ) c ) | P ˆ H i ( e θ ( T ) c ; X T +1 ) i . We have set θ (0) = θ ( − as a convention in the above. Next, we upper bound the above terms asfollows.Firstly, due to the Martingale property with E [ h∇ V ( e θ ( t ) c ) | ˆ H i ( e θ ( t ) c ; X t +1 ) − P ˆ H i ( e θ ( t ) c ; X t ) i|F t ] = 0,we have E [ n P ni =1 A i ] = 0 , ∀ i. (27)Secondly, note that H5 implies that P ˆ H i ( θ ; x ) is ¯ L h -Lipschitz w.r.t. θ , for some constant ¯ L h [29].As such, A i ≤ T X t =0 γ t +1 ¯ L h k∇ V ( e θ ( t ) c ) kk e θ ( t ) c − e θ ( t − c k ≤ T X t =0 γ t +1 d ¯ L h k h ( e θ ( t ) c ) kk e θ ( t ) c − e θ ( t − c k . (28)Taking the summation over i and dividing by n yield1 n n X i =1 A i ≤ T X t =0 γ t +1 d ¯ L h k h ( e θ ( t ) c ) kk e θ ( t ) c − e θ ( t − c k . Notice that e θ ( t ) c − e θ ( t − c = − γ t (cid:8) h ( e θ ( t − c ) + e ( t − + e ( t − (cid:9) . We observe that k e ( t − k ≤ σ h , k e ( t − k ≤ ( L h /n ) k e θ ( t − o k . (29)As such,1 d ¯ L h n n X i =1 A i ≤ L h n T X t =0 γ t +1 γ t k h ( e θ ( t ) c ) kk e θ ( t − o k + T X t =0 γ t +1 γ t k h ( e θ ( t ) c ) k (cid:8) σ h + k h ( e θ ( t − c ) k}≤ (cid:0) L h n + 2 (cid:1) T X t =0 γ t +1 k h ( e θ ( t ) c ) k + T X t =0 k e θ ( t − o k + σ h T X t =0 γ t ≤ (cid:0) L h n + 4 σ o ¯ ρ (cid:1) T X t =0 γ t +1 k h ( e θ ( t ) c ) k + (cid:0) σ h + 4 σ o ¯ ρ (cid:1) T X t =0 γ t +1 . (30)12o analyze the last two terms A i , A i , we denote E = n ⊤ ⊗ I d , θ ( t ) c := ( ⊗ I d ) e θ ( t ) c , (31)such that P ˆ H ( θ c ; x ) = ( P ˆ H ( e θ c ; x ); · · · ; P ˆ H n ( e θ c ; x )).Thirdly, we observe that from [29, Lemma 4.2], under H2, H4, H8, it can be shown for any θ c =( ⊗ I d ) e θ c , with e θ c ∈ R d , and x ∈ X that: k P ˆ H ( θ c ; x ) k ≤ K P . (32)Here, K P depends on the mixing time of the Markov chain, e.g., it is proportional to − λ underthe uniform ergodicity condition (7). Therefore, n P ni =1 A i = γ h∇ V ( e θ (0) c ) | E P ˆ H ( θ (0) c ; X ) i≤ γ K P k∇ V ( e θ (0) c ) k . (33)Fourthly, using | γ t +2 − γ t +1 | ≤ ˆ aγ t +1 , we have n n X i =1 A i ≤ ˆ ad T − X t =0 γ t +1 k h ( e θ ( t +1) c ) kk E P ˆ H ( θ ( t ) c ; X t +1 ) k + L V T − X t =0 γ t +1 k e θ ( t +1) c − e θ ( t ) c kk E P ˆ H ( θ ( t ) c ; X t +1 ) k + γ T +1 k∇ V ( e θ ( T ) c ) kk E P ˆ H i ( θ ( T ) c ; X T +1 ) k≤ K P T − X t =0 (cid:8) ˆ ad γ t +1 k h ( e θ ( t +1) c ) k + L V γ t +1 k e θ ( t +1) c − e θ ( t ) c k (cid:9) + γ T +1 K P k∇ V ( e θ ( T ) c ) k . To bound n P ni =1 A i , we observe that P T − t =0 γ t +1 k h ( e θ ( t +1) c ) k ≤ P T − t =0 γ t +1 (cid:8) k h ( e θ ( t +1) c ) k (cid:9) , the latter can be further simplified as P T − t =0 γ t +1 k h ( e θ ( t +1) c ) k ≤ a P Tt =0 γ t +1 k h ( e θ ( t ) c ) k . Moreover, T − X t =0 γ t +1 k e θ ( t +1) c − e θ ( t ) c k = T − X t =0 γ t +1 k h ( e θ ( t ) c ) + e ( t )0 + e ( t )1 k≤ T − X t =0 γ t +1 (cid:8) σ h + k h ( e θ ( t ) c ) k + L h n k e θ ( t ) o k}≤ T − X t =0 γ t +1 (cid:8) σ h + 1 + L h n k h ( e θ ( t ) c ) k + L h n k e θ ( t ) o k }≤ T − X t =0 γ t +1 (cid:8) ¯ ρn (1+2 σ h )+ L h +4 L h σ o ρn + ¯ ρn +4 L h σ o ρn k h ( e θ ( t ) c ) k } . We observe the crude upper bound γ T +1 k∇ V ( e θ ( T ) c ) k ≤ d γ T +1 k h ( e θ ( T ) c ) k≤ d (cid:0) γ T +1 k h ( e θ ( T ) c ) k (cid:1) ≤ d d T X t =0 γ t +1 k h ( e θ ( t ) c ) k . (34)13efine the constantsC mk := K P n d γ k∇ V ( e θ (0) c ) k o , C mk := K P L V ¯ ρn (1 + 2 σ h ) + L h (1 + 4 σ o )2¯ ρn + d (cid:0) ¯ L h σ h + ¯ L h σ o ¯ ρ + K P ˆ a (cid:1) , C mk := d ¯ L h (cid:0) L h n + 4 σ o ¯ ρ (cid:1) + K P d K P (cid:16) ˆ aa d + L V ¯ ρn + 4 L h σ o ρn (cid:17) . (35)Combining the terms yields E (cid:2)(cid:12)(cid:12) P Tt =0 γ t +1 h∇ V ( e θ ( t ) c ) | e ( t )0 i (cid:12)(cid:12)(cid:3) ≤ C mk + C mk P Tt =0 γ t +1 + C mk P Tt =0 γ t +1 k h ( e θ ( t ) c ) k . This is the desired result for the lemma.
D Auxiliary Lemma
Lemma 4.
Let { a s } s ≥ be an arbitrary sequence of non-negative number and ρ ∈ (0 , be aconstant. For any T ≥ , we have T X t =0 t X s =0 a s (1 − ρ ) t − s ! ≤ ρ T X t =0 a t . (36) Proof.
We begin by expanding the summation on the l.h.s. of (36) and observing the followingupper bound: T X t =0 t X s =0 t X q =0 a s a q (1 − ρ ) t − q − s ≤ T X s =0 s X q =0 a s a q (1 − ρ ) − q − s T X t = s (1 − ρ ) t + T X q =0 q X s =0 a s a q (1 − ρ ) − q − s T X t = q (1 − ρ ) t (37)As P Tt = s (1 − ρ ) t ≤ (1 − ρ ) s ρ , we have T X s =0 s X q =0 a s a q (1 − ρ ) − q − s T X t = s (1 − ρ ) t ≤ ρ T X s =0 s X q =0 (cid:8) a s + a q (cid:9) (1 − ρ ) s − q (38)Observe that T X s =0 a s s X q =0 (1 − ρ ) s − q = T X s =0 a s s X q =0 (1 − ρ ) q ≤ T X s =0 a s ρ T X s =0 s X q =0 a q (1 − ρ ) s − q = T X q =0 a q T X s = q (1 − ρ ) s − q ≤ T X s =0 a s ρ T X s =0 s X q =0 a s a q (1 − ρ ) − q − s T X t = s (1 − ρ ) t ≤ ρ T X s =0 a s By symmetry, we have P Tq =0 P qs =0 a s a q (1 − ρ ) − q − s P Tt = q (1 − ρ ) t ≤ ρ P Ts =0 a s . Adding this twobounds yields the desired result in (36). References [1] A. Nedi´c and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,”
IEEE Transactions on Automatic Control , vol. 54, no. 1, pp. 48–61, 2009.[2] P. Bianchi, G. Fort, and W. Hachem, “Performance of a distributed stochastic approximationalgorithm,”
IEEE Transactions on Information Theory , vol. 59, no. 11, pp. 7405–7418, 2013.[3] P. Di Lorenzo and G. Scutari, “Next: In-network nonconvex optimization,”
IEEE Transactionson Signal and Information Processing over Networks , vol. 2, no. 2, pp. 120–136, 2016.[4] S. Pu and A. Nedi´c, “A distributed stochastic gradient tracking method,”
Mathematical Pro-gramming , 2020.[5] R. Xin, A. K. Sahu, U. A. Khan, and S. Kar, “Distributed stochastic optimization with gradienttracking over strongly-connected networks,” arXiv:1903.07266 , 2019.[6] T.-H. Chang, M. Hong, H.-T. Wai, X. Zhang, and S. Lu, “Distributed learning in the non-convex world: From batch to streaming data, and beyond,”
IEEE Signal Processing Magazine ,2020.[7] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithmsoutperform centralized algorithms? a case study for decentralized parallel stochastic gradientdescent,” in
NeurIPS , pp. 5330–5340, 2017.[8] H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, “ d : Decentralized training over decentralizeddata,” in ICML , pp. 4848–4856, 2018.[9] S. Lu, X. Zhang, H. Sun, and M. Hong, “Gnsd: a gradient-tracking based nonconvex stochasticalgorithm for decentralized optimization,” in
IEEE DSW , pp. 315–321, 2019.[10] A. S. Mathkar and V. S. Borkar, “Nonlinear gossip,”
SIAM Journal on Control and Optimiza-tion , vol. 54, no. 3, pp. 1535–1557, 2016.[11] S. Vlaski and A. H. Sayed, “Distributed learning in non-convex environments–part i: Agree-ment at a linear rate,” arXiv preprint arXiv:1907.01848 , 2019.[12] B. Karimi, B. Miasojedow, E. Moulines, and H.-T. Wai, “Non-asymptotic analysis of biasedstochastic approximation scheme,” in
COLT , 2019.1513] T. Sun, T. Chen, Y. Sun, Q. Liao, and D. Li, “Decentralized markov chain gradient descent,” arXiv:1909.10238 , 2019.[14] B. Kumar, V. Borkar, and A. Shetty, “Non-asymptotic error bounds for constant stepsizestochastic approximation for tracking mobile agents,”
Math. of Control, Signals, and Systems ,vol. 31, no. 4, pp. 589–614, 2019.[15] S. Chen, A. M. Devraj, A. Buˇsi´c, and S. Meyn, “Explicit mean-square error bounds for monte-carlo and linear stochastic approximation,” arXiv preprint arXiv:2002.02584 , 2020.[16] T. T. Doan, L. M. Nguyen, N. H. Pham, and J. Romberg, “Finite-time analysis of stochasticgradient descent under markov randomness,” arXiv preprint arXiv:2003.10973 , 2020.[17] H. Robbins and S. Monro, “A stochastic approximation method,”
The annals of mathematicalstatistics , pp. 400–407, 1951.[18] A. M. Devraj, A. Buˇsi´c, and S. Meyn, “Optimal matrix momentum stochastic approximationand applications to q-learning,” arXiv preprint arXiv:1809.06277 , 2018.[19] R. Douc, E. Moulines, P. Priouret, and P. Soulier,
Markov chains . Springer, 2018.[20] P. W. Glynn and S. P. Meyn, “A liapounov bound for solutions of the poisson equation,”
TheAnnals of Probability , pp. 916–931, 1996.[21] S. Ghadimi and G. Lan, “Stochastic first-and zeroth-order methods for nonconvex stochasticprogramming,”
SIAM Journal on Optimization , vol. 23, no. 4, pp. 2341–2368, 2013.[22] T. Sun, Y. Sun, and W. Yin, “On markov chain gradient descent,” in
NeurIPS , pp. 9896–9905,2018.[23] R. Srikant and L. Ying, “Finite-time error bounds for linear stochastic approximation and tdlearning,” in
COLT , pp. 2803–2830, 2019.[24] J. C. Duchi, A. Agarwal, M. Johansson, and M. I. Jordan, “Ergodic mirror descent,”
SIAMJournal on Optimization , vol. 22, no. 4, pp. 1549–1578, 2012.[25] H.-T. Wai, Z. Yang, Z. Wang, and M. Hong, “Multi-agent reinforcement learning via doubleaveraging primal-dual optimization,” in
NeurIPS , pp. 9649–9660, 2018.[26] J. Sun, G. Wang, G. B. Giannakis, Q. Yang, and Z. Yang, “Finite-sample analysis of decen-tralized temporal-difference learning with linear function approximation,” arXiv:1911.00934 ,2019.[27] T. T. Doan, S. T. Maguluri, and J. Romberg, “Finite-time performance of distributed tempo-ral difference learning with linear function approximation,” arXiv preprint arXiv:1907.12530 ,2019.[28] J. Bhandari, D. Russo, and R. Singal, “A finite time analysis of temporal difference learningwith linear function approximation,” arXiv preprint arXiv:1806.02450 , 2018.[29] G. Fort, E. Moulines, P. Priouret, et al. , “Convergence of adaptive and interacting markovchain monte carlo algorithms,”