[PDF] On the Convergence of Consensus Algorithms with Markovian Noise and Gradient Bias

Abstract

This paper presents a finite time convergence analysis for a decentralized stochastic approximation (SA) scheme. The scheme generalizes several algorithms for decentralized machine learning and multi-agent reinforcement learning. Our proof technique involves separating the iterates into their respective consensual parts and consensus error. The consensus error is bounded in terms of the stationarity of the consensual part, while the updates of the consensual part can be analyzed as a perturbed SA scheme. Under the Markovian noise and time varying communication graph assumptions, the decentralized SA scheme has an expected convergence rate of O(logT/ T − − √ ) , where T is the iteration number, in terms of squared norms of gradient for nonlinear SA with smooth but non-convex cost function. This rate is comparable to the best known performances of SA in a centralized setting with a non-convex potential function.

Full PDF

aa r X i v : . [ m a t h . O C ] N ov On the Convergence of Consensus Algorithms with MarkovianNoise and Gradient Bias

Hoi-To Wai † ∗ November 6, 2020

Abstract

This paper presents a ﬁnite time convergence analysis for a decentralized stochastic approx-imation (SA) scheme. The scheme generalizes several algorithms for decentralized machinelearning and multi-agent reinforcement learning. Our proof technique involves separating theiterates into their respective consensual parts and consensus error. The consensus error isbounded in terms of the stationarity of the consensual part, while the updates of the consensualpart can be analyzed as a perturbed SA scheme. Under the Markovian noise and time varyingcommunication graph assumptions, the decentralized SA scheme has an expected convergencerate of O (log T / √ T ), where T is the iteration number, in terms of squared norms of gradientfor nonlinear SA with smooth but non-convex cost function. This rate is comparable to the bestknown performances of SA in a centralized setting with a non-convex potential function. Decentralized algorithm has become a core tool for control, optimization and machine learning inthe increasingly connected world. Among others, a common setting is multi-agent optimization,where a group of agents on a connected graph seek to minimize a sum of local functions usinga common parameter. For convex optimization problems, consensus-based algorithms have beendeveloped in [1] for deterministic optimization, and the recent results have extended the latter tostochastic/non-convex optimization [2–5].With new machine learning inspired-models such as the training of neural networks and rein-forcement learning, researchers have focused on developing decentralized methods for non-convex stochastic optimization [6]. However, methods developed so far for this scenario are limited to using unbiased gradient estimates [7–9], requiring i.i.d. data samples and easy-to-compute gradients. Thelatter requirements may be an obstacle in deploying the methods in sophisticated problems.The focus of this paper is to relax the unbiased gradient estimate assumption of decentralizedstochastic optimization methods. We consider a decentralized stochastic approximation (DSA)scheme whose local update directions are computed from samples of a Markov chain, and theupdate directions converge asymptotically to a non-gradient mean ﬁeld, i.e., it is a biased DSA ∗ The author is with the Department of SEEM, The Chinese University of Hong Kong, Shatin, Hong Kong. Thiswork is supported by CUHK Direct Grant [email protected] • For non-convex stochastic optimization, we prove that the DSA scheme with updates from Markovsamples converge to a consensual and stationary point at a rate of O (log T / √ T ), where T is theiteration number. • To analyze the convergence of biased DSA, we develop a decoupling procedure to split the DSAiterates into their consensual part and consensus error, which has a similar favor to the analysis in[2,10,11]. Such separation allows us to (i) bound the consensus error explicitly; and subsequently(ii) analyze the recursion of the consensual part as a perturbed biased SA scheme through adoptingprior analysis, e.g., [12]. • We show that the biased DSA scheme also converges for time varying topology under a standardbounded connectivity assumption.For biased DSA scheme with Markov samples, the closest work related to ours include [2] whichconsidered a convex optimization related setting; and [13] which proposed an algorithm that maynot achieve exact consensus. To the best of our knowledge, this paper provides the ﬁrst ﬁnite-timeconvergence analysis for DSA scheme with biased updates relying on Markov samples. In addition,this work is related to the recent works on non-asymptotic analysis of SA schemes [14–16].

Consider a network of n agents which have the goal of obtaining a common (a.k.a. consensual)solution, θ ∈ R d , which is a stationary point to the optimization problem: V ⋆ = min θ ∈ R d V ( θ ) := 1 n n X i =1 V i ( θ ) , (1)where V i : R d → R is a smooth (but possibly non-convex) function and is lower bounded. Concep-tually, the i th function V i ( θ ) can be interpreted as the local potential/cost function held by the i thagent. The agents communicate through an undirected, connected simple graph G = ( V, E ) where V = { , ..., n } is the node set, and E ⊆ V × V is the edge set which includes self-loops. The graphis endowed with a symmetric, weighted adjacency matrix A . We assume H1.

The matrix A ∈ R n × n + satisﬁes:1. A ij = 0 whenever ( i, j ) / ∈ E .2. A = A ⊤ = .3. k U ⊤ AU k ≤ − ¯ ρ , where ¯ ρ ∈ (0 , and U is a projection matrix such that I − n ⊤ = U U ⊤ . Note that condition 3) is equivalent to requiring that max {| λ ( A ) | , | λ n ( A ) |} ≤ − ¯ ρ . Such aweighted adjacency matrix A exists if G is connected, e.g., [1].At iteration t , each agent holds a local solution θ ( t ) i and for simplicity, we assume the initialization θ (0) i = θ (0) j for any i, j . Let X t +1 ∈ X be a random sample, where X is a (discrete or continuous)state space. The decentralized SA (DSA) scheme performs the recursion at all agents i = 1 , ..., n ,2 ( t +1) i = n X j =1 A ij θ ( t ) j − γ t +1 H i ( θ ( t ) i ; X t +1 ) , (2)where the ﬁrst term corresponds to an average consensus step among the neighbors of agent i , and H i ( θ ( t ) i ; X t +1 ) is a local stochastic update computed from θ ( t ) i , X t +1 . Furthermore, we denote F t = σ {{ θ (0) i } ni =1 , X s , s = 0 , , , ..., t } (3)as the ﬁltration of random elements up to iteration t . Note that θ ( t ) i is measurable w.r.t. F t . Inthe special case of n = 1, (2) is reduced to the classical SA scheme [17]; for general n >

1, (2) isrelated to a matrix momentum SA scheme studied in [18].We consider a biased

DSA scheme in this paper. To describe the setup, the random samples { X t } t ≥ forms form a Markov chain (MC) with the kernel P satisfying: H2.

The Markov kernel P : X × X → R + generating { X t } t ≥ has a unique stationary distribution µ : X → R , and it is irreducible, aperiodic. For any measurable function f on X , with a slight abuse of notation we deﬁne P f ( X t ) = E [ f ( X t +1 ) | X t ] = R X f ( x ) P ( X t , dx ). The mean ﬁeld of H i ( θ ; X ) is h i ( θ ) := R X H i ( θ ; x ) µ ( dx ) . (4)Importantly, the averaged mean ﬁeld h ( θ ) := n P ni =1 h i ( θ ) (5)is related to Problem (1) through: H3.

For any θ ∈ R d , there exists d , c > such that h h ( θ ) | ∇ V ( θ ) i ≥ c k h ( θ ) k , d k h ( θ ) k ≥ k∇ V ( θ ) k , where h x | y i = x ⊤ y denotes Euclidean inner product. The constants c , d characterize the multiplicative bias of the mean ﬁeld in view of a stationarysolution to (1). It allows the local stochastic update to be quasi-gradient , which is relevant whenthe gradient of V i ( θ ) is hard to obtain. Besides, the transient of stochastic update is biased underH2 with E [ H i ( θ ( t ) i ; X t +1 ) |F t ] = h i ( θ ( t ) i ) for ﬁnite t . To this end, we assume: H4.

For any i = 1 , ..., n , θ ∈ R d , x ∈ X , there exists a measurable function b H i : R d × X → R d suchthat b H i ( θ ; x ) − P b H i ( θ ; x ) = H i ( θ ; x ) − h i ( θ ) . (6)The measurable function in H4 is a solution to the Poisson equation. Such function exists underH2 and additional conditions on the MC. For example, we may assume P is uniformly ergodic, i.e.,for a constant K , sup x ∈ X k P t ( x, · ) − µ ( · ) k TV ≤ Kλ − t , ∀ t ≥ , (7)such that λ ∈ [0 ,

1) characterizes the mixing time of P . H4 is also satisﬁed under more relaxedconditions, e.g., geometric ergodicity, see [19, Ch. 21.2], [20].We assume that both the local stochastic updates and potential functions are smooth w.r.t. θ :3 For any i , the local stochastic update H i ( θ ; x ) is L h -Lipschitz w.r.t. θ , i.e., for any θ , θ ′ ∈ R d , sup x ∈ X kH i ( θ ; x ) − H i ( θ ′ ; x ) k ≤ L h k θ − θ ′ k . (8) Consequently, the mean ﬁeld map h i ( θ ) is L h -Lipschitz such that k h i ( θ ) − h i ( θ ′ ) k ≤ L h k θ − θ ′ k , ∀ θ , θ ′ ∈ R d . H6.

The potential function V ( θ ) is L V -smooth such that k∇ V ( θ ) − ∇ V ( θ ′ ) k ≤ L V k θ − θ ′ k for any θ , θ ′ ∈ R d . Lastly, we assume the following on H i ( θ ; x ): H7.

For any θ = ( θ , ..., θ n ) , there exists σ o such that sup x ∈ X kH i ( θ i ; x ) − n P nj =1 H j ( θ j ; x ) k≤ σ o { n + n k h ( e θ c ) k + k θ i − e θ c k} , (9) for any i = 1 , ..., n , where we have e θ c = n P nj =1 θ j . H8.

For any e θ c ∈ R d , there exists σ h such that sup x ∈ X (cid:13)(cid:13) n P ni =1 H i ( e θ c ; x ) − h ( e θ c ) (cid:13)(cid:13) ≤ σ h . (10)In particular, σ o quantiﬁes the heterogeneity of the stochastic updates, while σ h plays a similar roleas the variance of (1 /n ) P ni =1 {H i ( θ ; x ) − h i ( θ ) } . Under H5, we observe H7 can be satisﬁed if thel.h.s. of (9) is upper bounded by O (1 + P ni =1 k h i ( θ i ) k ). Our condition H7 is considerably weakerthan the heterogeneity assumption required by [7], as we allow the heterogeneity between the localupdates to grow with the norm of mean ﬁeld.Notice that H7, H8 are uniform bounds on the norms of error for all x ∈ X . They are considerablystronger than those for decentralized stochastic algorithms with i.i.d. data, e.g., [7,21]. However, weremark that this is a caveat for the prior works on SA with Markov noise as well, e.g., [12,13,22,23].We discuss two applications. In this case, the i th potential function is taken as the following stochastic objective function: V i ( θ ) := E X i ∼ µ i (cid:2) V i ( θ ; X i ) (cid:3) . (11)The local stochastic update is given by H i ( θ ( t ) i ; X t +1 i ) = ∇ V i ( θ ( t ) i ; X t +1 i ) , (12)such that { X ti } t ≥ is a Markov chain with the kernel P i : X i × X i → R + and a unique stationarydistribution µ i : X i → R + . Consequently, the mean ﬁeld of H i ( θ ( t ) i ; X t +1 i ) is the gradient h i ( θ ( t ) i ) = ∇ V i ( θ ( t ) i ). 4q. (11), (12) generalize the vanilla decentralized SGD method [7] to scenarios with non-i.i.d. (a.k.a. er-godic) data. As discussed in [22, 24], the latter is important to applications where data samples arenot obtained independently. For example, the data samples are generated using a Markov chainMonte carlo method.To see that (11), (12) ﬁt the assumptions of this paper, we take X = X × · · · × X n and form P byconcatenating the local Markov kernels. Clearly, H3 is satisﬁed with c = d = 1; H2, H4 dependon the Markov chain; H5, H6 are related to the smoothness of V i ( θ ; x ) w.r.t. θ ; H7 can be satisﬁedwith more homogeneous objective function; H8 bounds the noise in estimating the mean ﬁeld byaveraging the local stochastic updates. We consider the policy evaluation problem in a multi-agent reinforcement learning (MARL) setting.Our aim is to compute the value function under a policy π for an unknown Markov decision process(MDP) using linear function approximation [25–27].Consider the MDP at state x with the reward of R( x ). The agents only observe a local rewardR i ( x ) satisfying R( x ) = n P ni =1 R i ( x ). We aim at approximating the value function as V ( x ) = E [ P ∞ s =0 γ s R( x s ) | x = x ] ≈ θ ⊤ Φ( x ), where θ ∈ R d is the function parameter and Φ( x ) is a ‘feature’vector. To ﬁnd θ , the decentralized TD(0) learning algorithm [26,27] deploys (2) with the followinglocal stochastic update: H i ( θ ( t ) i ; x ) = Φ( x ) (cid:8) R i ( x ) + ( γ Φ( x ′ ) − Φ( x )) ⊤ θ ( t ) i (cid:9) where x ′ ∈ X denotes the next state drawn from the MDP when the current state is x . The termsinside the curly bracket is the temporal diﬀerence error. We observe that the resultant algorithmis a linear DSA scheme.To discuss the performance of TD(0), we take V i ( θ ) = k θ − θ ⋆ k , where θ ⋆ solves the Bellmanequation: E µ [Φ( x )( γ Φ( x ′ ) − Φ( x )) ⊤ ] θ ⋆ = E µ [Φ( x )R( x )] . Most of our assumptions can be satisﬁed by the linear DSA. Using [28, Lemma 3 & 4], H3 issatisﬁed with c = − γ , and we can show that d = E µ [ k Φ( x )( γ Φ( x ′ ) − Φ( x )) ⊤ k ] . H2, H4 areconditions on the Markov chain; H5–H7 can be satisﬁed with a bounded Φ( x ), R( x ). Lastly, H8can be relaxed in the analysis as the resultant DSA scheme is linear. In the interest of space, weleave the development of the latter case to a future work. For general smooth cost function V ( θ ), we consider a random terminating time τ ( T ) such thatPr( τ ( T ) = t ) = γ t +1 / P Ts =0 γ s +1 , t = 0 , ..., T, where τ ( T ) ∈ { , ..., T } is selected independently and T is the maximum number of iterations. Thisis a common stopping criterion proposed in [21]. For DSA, it can be decided by the agents with asimple consensus protocol before the iterations. Our main result is summarized as:5 heorem 1. Under H1–H8, suppose the step size satisﬁes sup t ≥ γ t ≤ min n , ¯ ρ σ o , c e C mk o (13) and there exists ˆ a such that ≤ γ t − γ t +1 ≤ ˆ aγ t . For any T ≥ , it holds that E [ k h ( e θ ( τ ( T )) c ) k ] ≤ C tot ( c / P Tt =0 γ t +1 , max i =1 ,...,n E [ k θ ( τ ( T )) i − e θ ( τ ( T )) c k ] ≤ c C tot + σ o ρ P Tt =0 γ t +1 P Tt =0 γ t +1 , (14) where we have deﬁned e θ ( t ) c := n P ni =1 θ ( t ) i , C tot := V ( P ni =1 e θ (0) i n ) − V ⋆ + C mk + C mk P Tt =0 γ t +1 , and the constants e C mk , C mk , C mk will be speciﬁed in Section 3.1. The expectation is taken w.r.t. theterminating iteration τ ( T ) and the Markovian randomness. For the best convergence rate, we may take γ t = a / √ t + a for some a , a >

0. In this case,the theorem shows that the squared norm of mean ﬁeld and the consensus error converge at therate of O (log T / √ T ). By H3 and (14), we have that E [ k∇ V ( e θ ( τ ( T )) c ) k ] converges at O (log T / √ T ),i.e., the DSA scheme ﬁnds a stationary point to problem (1). Note that this is a standard ratefor non-convex stochastic optimization [6]. Compared to existing works, our convergence rate issimilar to a centralized SA scheme, e.g., [12], and it strengthens that of [13] for DSA with Markovnoise, as we provide a convergence rate for exact consensus .As will be derived later, the constant C tot is proportional to O (¯ ρ − ), i.e., related to the spectralgap of the weighted adjacency matrix [cf. H1] and the magnitude sup x, θ k b H i ( θ ; x ) k in H4. In thecase of uniform MC, the latter is in the order of O ( − λ ) such that it is related to the mixing timeof the Markov chain. Our bound also highlights on the initialization V ( P ni =1 e θ (0) i /n ).Instead of analyzing the convergence of the DSA scheme with a single potential function, in theanalysis that follows, we adopt a divide-and-conquer approach similar to [2, 10, 11], where we ﬁrstdecompose the DSA iterate θ ( t )1 , ..., θ ( t ) n into its consensual part and consensus error. By observingthe individual update formulas, we bound the consensus error separately as the latter dependson stationarity of the averaged iterate. Subsequently, the consensual part can be analyzed usingsimilar technique as a centralized SA scheme. Due to space limitation, we only provide the analysisfor general nonlinear DSA under H1–H8. Deﬁne the following nd -dimensional vectors θ ( t ) :=  θ ( t )1 ... θ ( t ) n  , H ( θ ( t ) ; x ) :=  H ( θ ( t )1 ; x )... H n ( θ ( t ) n ; x )  as the collection of local solutions and stochastic updates, respectively. We rewrite the DSA recur-sion (2) as: 6 ( t +1) = (cid:0) A ⊗ I d (cid:1) θ ( t ) − γ t +1 H ( θ ( t ) ; X t +1 ) , (15)where ⊗ denotes the Kronecker product.Consider the projection matrix I n − n ⊤ onto the subspace orthogonal to span { n } . As rank( I n − n ⊤ ) = n −

1, it admits the factorization I − n ⊤ = U U ⊤ , where U satisﬁes U ⊤ U = I n − . Welet e θ ( t ) c := ( n ⊤ ⊗ I d ) θ ( t ) , e θ ( t ) o := ( U ⊤ ⊗ I d ) θ ( t ) , (16)such that e θ ( t ) c ∈ R d , e θ ( t ) o ∈ R ( n − d denote the consensual component, and the consensus error of θ ( t ) , respectively. Moreover, θ ( t ) = ( ⊗ I d ) e θ ( t ) c + ( U ⊗ I d ) e θ ( t ) o . (17)Using (15), (16), the recursions of the two components in (17) can be described as e θ ( t +1) c ( a ) = e θ ( t ) c − γ t +1 ( n ⊤ ⊗ I d ) H ( θ ( t ) ; X t +1 ) , e θ ( t +1) o ( b ) = ( U ⊤ AU ⊗ I d ) e θ ( t ) o − γ t +1 ( U ⊤ ⊗ I d ) H ( θ ( t ) ; X t +1 ) , (18)where (a) used ⊤ A = ⊤ ; (b) used U ⊤ A = U ⊤ = . The recursions in (18) are coupledthrough the local solutions θ ( t ) in the stacked stochastic update. In particular, they allow us tohandle the convergence of the respective components as two SA schemes.Our next step is to derive an intermediate bound on the consensus error e θ ( t +1) o . A key observationis as follows: Lemma 1.

Assume H1, H7 and the step size satisﬁes γ t ≤ ¯ ρ σ o . If θ (0) i = θ (0) j for all i, j , then itholds for any t ≥ that k e θ ( t +1) o k ≤ σ o t X s =0 γ s +1 (cid:0) − ¯ ρ (cid:1) t − s (cid:8) k h ( e θ ( s ) c ) k (cid:9) . The above lemma shows that the consensus error can be upper bounded by the convolution betweenan exponential term (1 − ¯ ρ ) t − s and the norm of mean ﬁeld weighted by the step size as γ s +1 { k h ( e θ ( s ) c ) k} . Importantly, k e θ ( t ) o k decays to zero at the rate of O ( γ t +1 ) provided that k h ( e θ ( t ) c ) k → e θ ( t ) c . We observe that( n ⊤ ⊗ I d ) H ( θ ( t ) ; X t +1 )= h ( e θ ( t ) c ) + 1 n n X i =1 H i ( e θ ( t ) c ; X t +1 ) − h ( e θ ( t ) c ) + 1 n n X i =1 {H i ( θ ( t ) i ; X t +1 ) − H i ( e θ ( t ) c ; X t +1 ) } , (19)where h ( θ ( t ) ) := ( h ( θ ( t )1 ); · · · ; h n ( θ ( t ) n )). Denote e ( t )0 := n P ni =1 H i ( e θ ( t ) c ; X t +1 ) − h ( e θ ( t ) c ) e ( t )1 := n P ni =1 {H i ( θ ( t ) i ; X t +1 ) − H i ( e θ ( t ) c ; X t +1 ) } e θ ( t ) c follows that of a perturbed SA scheme: e θ ( t +1) c = e θ ( t ) c − γ t +1 (cid:8) h ( e θ ( t ) c ) + e ( t )0 + e ( t )1 (cid:9) , (20)where e ( t )0 is a perturbation due to the random sample X t +1 in estimating the mean ﬁeld, and e ( t )1 is bounded by the consensus error.Our idea is to proceed in a similar fashion as in [12]. Observe the following lemma: Lemma 2.

Under H1, H3, H5, H6, H7, H8 and assume that γ t ≤ min { ¯ ρ σ o , } . For any T ≥ and let E := 12 σ o L h / (¯ ρn ) , it holds P Tt =0 γ t +1 (cid:16) c − γ t +1 (cid:8) E + d + L V (cid:9)(cid:17) k h ( e θ ( t ) c ) k ≤ V ( e θ (0) c ) − V ⋆ + (cid:8) σ h L V + E (cid:9) P Tt =0 γ t +1 − P Tt =0 γ t +1 h∇ V ( e θ ( t ) c ) | e ( t )0 i . From the above, we observe that by setting a suﬃciently small γ t +1 , it is possible to lower boundthe l.h.s. by P Tt =0 γ t +1 k h ( e θ ( t ) c ) k . Now if the r.h.s. is ﬁnite, the convergence of E [ k h ( e θ ( τ ( T )) c ) k ] canbe guaranteed.Our remaining task is to upper bound the inner product | E [ P Tt =0 γ t +1 h∇ V ( e θ ( t ) c ) | e ( t )0 i ] | . Noticethat in the case when X t +1 are drawn i.i.d. from the distribution µ , this inner product is zero. Ourresults below shows that despite that X t +1 are not i.i.d., the inner product can still be controlledwith an appropriate step size. Lemma 3.

Under H3–H8. Let | γ t − γ t +1 | ≤ ˆ aγ t for some constant ˆ a , and the step sizes satisﬁes γ t ≤ min { , ¯ ρ σ o } . For any T ≥ , it holds (cid:12)(cid:12) E (cid:2) P Tt =0 γ t +1 h∇ V ( e θ ( t ) c ) | e ( t )0 i (cid:3)(cid:12)(cid:12) ≤ C mk + C mk P Tt =0 γ t +1 + C mk P Tt =0 γ t +1 k h ( e θ ( t ) c ) k . Here, C mk , C mk , C mk are technical and the constants will be deﬁned in (35) . The above lemma gives a compatible bound of the desired inner product under the scenario ofMarkovian noise.Substituting Lemma 3 into the conclusion of Lemma 2 and rearranging terms yield P Tt =0 γ t +1 (cid:16) c − γ t +1 e C mk (cid:17) k h ( e θ ( t ) c ) k ≤ V ( e θ (0) c ) − V ⋆ + C mk + C mk P Tt =0 γ t +1 =: C tot . where e C mk := C mk + E + d + L V , C mk := C mk + E + σ h L V . We also denote the quantity in ther.h.s by C tot . If we select the step size according to (13), then E [ k h ( e θ ( τ ( T )) c ) k ] = P Tt =0 γ t +1 k h ( e θ ( t ) c ) k / P Tt =0 γ t +1 ≤ (( c / P Tt =0 γ t +1 ) − C tot . As such, it concludes our result for the convergence of mean ﬁeld. Furthermore, note this impliesthat the norm of gradient of V ( θ ) converges [cf. H3].8inally, we bound the consensus error. Again, we invoke Lemma 1 and observe the following P Tt =0 γ t +1 k e θ ( t ) o k ≤ σ o P Tt =0 γ t +1 P t − s =0 γ s +1 (1 − ¯ ρ ) t − s { k h ( e θ ( s ) c ) k} ( a ) ≤ σ o P T − s =0 γ s +1 { k h ( e θ ( s ) c ) k} P Tt = s +1 (1 − ¯ ρ ) t − s ( b ) ≤ (3 σ o / ρ ) P T − s =0 γ s +1 + P T − s =0 γ s +1 k h ( e θ ( s ) c ) k . where (a) involved a change of order in summation and γ t +1 ≤ γ s +1 as the step size is nonincreasing;(b) involved the condition γ s +1 ≤ ¯ ρ/ (2 σ o ). Finally, evaluating the expectation shows that E [ k e θ ( τ ( T )) o k ] ≤ (3 σ o / ρ ) P T − s =0 γ s +1 + c C tot P Tt =0 γ t +1 . (21)The above concludes the proof of Theorem 1. Extension to Time-varying Graph

Our analysis can be extended to scenarios when the com-munication graph is time varying. Let G ( t ) = ( V, E ( t ) ) be a simple, undirected graph which ispotentially not connected, where E ( t ) ⊆ E , and the graph is associated with a weighted adjacencymatrix A ( t ) . We replace A by A ( t ) in the DSA scheme (2) at iteration t , and H1 is updated withthe following assumption H9.

For any t ≥ , the matrix A ( t ) ∈ R n × n satisﬁes:1. A ( t ) ij = 0 whenever ( i, j ) / ∈ E ( t ) .2. A ( t ) = ( A ( t ) ) ⊤ = .3. ∃ B ≥ with k U ⊤ A ( t + B − · · · A ( t ) U k ≤ − ¯ ρ , where ¯ ρ ∈ (0 , . The last condition can be guaranteed under the ‘bounded communication’ setting [1], i.e., whenthe combined graph (

V, E ( t ) ∪ · · · E ( t + B ) ) is connected for any t ≥ A ( t ) remains doubly stochastic, the decomposition in (18) is still valid. We can then extendLemma 1 to bound the consensus error using a blocking argument; see the discussion in Appendix A.The proof for Theorem 1 can be modiﬁed accordingly and we obtain the same convergence rate forthe time varying graph setting. In this paper, we have studied the convergence of a biased decentralized stochastic approximation(DSA) scheme. The scheme is a multi-agent optimization algorithm relying on biased, stochasticupdates that approximate the gradient of a smooth cost function. Here, the biasednesses stemfrom taking Markov samples and quasi-gradients in the updates. We prove that DSA ﬁnds aconsensual and stationary point to the cost function at a rate of O (log T / √ T ), where T is themaximum iteration number. Future works include extending to asynchronous, gradient trackingDSA, state-controlled Markov chain, etc.. 9 Proof of Lemma 1 & Its Extension

From the recursion (18), we observe that k e θ ( t +1) o k ≤ k ( U ⊤ AU ⊗ I ) e θ ( t ) o k + γ t +1 k ( U ⊤ ⊗ I ) H ( θ ( t ) ; X t +1 ) k . (22)Using H1, we observe the contraction k ( U ⊤ AU ⊗ I ) e θ ( t ) o k ≤ k ( U ⊤ AU ⊗ I ) k k e θ ( t ) o k ≤ (1 − ¯ ρ ) k e θ ( t ) o k . (23)Using H7, we bound the second term in (22) as: k ( U ⊤ ⊗ I ) H ( θ ( t ) ; X t +1 ) k ≤ kH ( θ ( t ) ; X t +1 ) − ( n ⊤ ⊗ I d ) H ( θ ( t ) ; X t +1 ) k≤ σ o P ni =1 { n + n k h ( e θ ( t ) c ) k ] + k θ ( t ) i − e θ ( t ) c k} , which can be further simpliﬁed as σ o (cid:8) k h ( e θ ( t ) c ) k + k e θ ( t ) o k (cid:9) . Substituting into (22) yields k e θ ( t +1) o k ≤ (1 − ¯ ρ + γ t +1 σ o ) k e θ ( t ) o k + γ t +1 σ o { k h ( e θ ( t ) c ) k} . (24)Setting γ t +1 ≤ ¯ ρ σ o yields that 1 − ¯ ρ + γ t +1 σ o ≤ − ¯ ρ . Solving the recursion and noticing that e θ (0) o = yield the desired bound. Extension to Time-varying Topology

Under the relaxed condition H9, we apply a blockingargument to derive the result as in Lemma 1. In particular, denote Θ ( m, n ) := k e θ ( m ) o k + · · · k e θ ( n ) o k ,we can show: Θ ( t + 1 , t + B ) ≤ (1 − ¯ ρ ) Θ ( t − B + 1 , t )+ σ o γ t − B +2 (cid:8) Θ ( t − B + 1 , t ) + · · · + Θ ( t, t + B − (cid:9) + σ o B P t + B − s = t − B +1 γ s +1 (cid:8) k h ( e θ ( s ) c ) k (cid:9) , which implies that Θ ( t + 1 , t + B ) ≤ − ¯ ρ + σ o Bγ t − B +2 − σ o Bγ t − B +2 Θ ( t − B + 1 , t ) + σ o B − σ o Bγ t − B +2 t + B − X s = t − B +1 γ s +1 (cid:8) k h ( e θ ( s ) c ) k (cid:9) . Setting a suﬃciently small step size γ t allows us to derive a similar recursion as (24) for Θ ( t +1 , t + B ).Solving it yields a convolution bound as in Lemma 1. B Proof of Lemma 2

Using the L V -smoothness of V ( · ) [cf. H6], we observe V ( e θ ( t +1) c ) ≤ V ( e θ ( t ) c ) + γ t +1 L V k h ( e θ ( t ) c ) + e ( t )0 + e ( t )1 k − γ t +1 h∇ V ( e θ ( t ) c ) | h ( e θ ( t ) c ) + e ( t )0 + e ( t )1 i≤ V ( e θ ( t ) c ) − γ t +1 (cid:0) c − γ t +1 L V (cid:1) k h ( e θ ( t ) c ) k + γ t +1 L V k e ( t )0 + e ( t )1 k − γ t +1 h∇ V ( e θ ( t ) c ) | e ( t )0 + e ( t )1 i t = 0 to t = T yields P Tt =0 γ t +1 (cid:16) c − γ t +1 L V (cid:17) k h ( e θ ( t ) c ) k ≤ V ( e θ (0) c ) − V ⋆ + 2 L V P Tt =0 γ t +1 (cid:8) k e ( t )0 k + k e ( t )1 k (cid:9) − P Tt =0 γ t +1 h∇ V ( e θ ( t ) c ) | e ( t )0 + e ( t )1 i By H5, we observe k e ( t )1 k ≤ ( L h /n ) k θ ( t ) − ( ⊗ I d ) e θ ( t ) c k = ( L h /n ) k ( U ⊗ I d ) e θ ( t ) o k ≤ ( L h /n ) k e θ ( t ) o k (25)Also, applying H8 and re-arranging terms show that T X t =0 γ t +1 (cid:16) c − γ t +1 L V (cid:17) k h ( e θ ( t ) c ) k ≤ V ( e θ (0) c ) − V ⋆ − P Tt =0 γ t +1 h∇ V ( e θ ( t ) c ) | e ( t )0 + e ( t )1 i + 2 L V P Tt =0 γ t +1 (cid:8) σ h + L h n k e θ ( t ) o k (cid:9) Moreover, using H3 we observe γ t +1 h∇ V ( e θ ( t ) c ) | e ( t )1 i ≤ γ t +1 d k h ( e θ ( t ) c ) k + 12 k e ( t )1 k Re-arranging terms again and using γ t ≤ P Tt =0 γ t +1 (cid:16) c − γ t +1 (cid:8) d + L V (cid:9)(cid:17) k h ( e θ ( t ) c ) k ≤ V ( e θ (0) c ) − V ⋆ + σ h L V P Tt =0 γ t +1 + P Tt =0 3 L h n k e θ ( t ) o k − P Tt =0 γ t +1 h∇ V ( e θ ( t ) c ) | e ( t )0 i Next, we need to upper bound P Tt =0 k e θ ( t ) o k with Lemma 1 and 4, we obtain T X t =0 k e θ ( t ) o k ≤ σ o ¯ ρ T X t =0 γ t +1 (cid:8) k h ( e θ ( t ) c ) k (cid:9) . (26)Deﬁne the constant E := 12 σ o L h / (¯ ρn ) and substituting into the previous inequality, we obtain P Tt =0 γ t +1 (cid:16) c − γ t +1 (cid:8) E + d + L V (cid:9)(cid:17) k h ( e θ ( t ) c ) k ≤ V ( e θ (0) c ) − V ⋆ + (cid:8) σ h L V + E (cid:9) P Tt =0 γ t +1 − P Tt =0 γ t +1 h∇ V ( e θ ( t ) c ) | e ( t )0 i This is the desirable bound for the lemma.

C Proof of Lemma 3

We begin the proof by using the solution to Poisson equation deﬁned in H4. We have e ( t )0 = n P ni =1 (cid:8) H i ( e θ ( t ) c ; X t +1 ) − h i ( e θ ( t ) c ) (cid:9) = n P ni =1 (cid:8) ˆ H i ( e θ ( t ) c ; X t +1 ) − P ˆ H i ( e θ ( t ) c ; X t +1 ) (cid:9) . T X t =0 γ t +1 h∇ V ( e θ ( t ) c ) | e ( t )0 i ≡ n n X i =1 { A i + A i + A i + A i } , where A i := T X t =0 γ t +1 h∇ V ( e θ ( t ) c ) | ˆ H i ( e θ ( t ) c ; X t +1 ) − P ˆ H i ( e θ ( t ) c ; X t ) i ,A i := T X t =0 γ t +1 h∇ V ( e θ ( t ) c ) | P ˆ H i ( e θ ( t ) c ; X t ) − P ˆ H i ( e θ ( t − c ; X t ) i ,A i := T X t =0 γ t +1 h∇ V ( e θ ( t ) c ) | P ˆ H i ( e θ ( t − c ; X t ) i − T − X t =0 γ t +2 h∇ V ( e θ ( t +1) c ) | P ˆ H i ( e θ ( t ) c ; X t +1 ) i ,A i := T − X t =0 h γ t +2 ∇ V ( e θ ( t +1) c ) − γ t +1 ∇ V ( e θ ( t ) c ) | P ˆ H i ( e θ ( t ) c ; X t +1 ) i − γ T +1 h∇ V ( e θ ( T ) c ) | P ˆ H i ( e θ ( T ) c ; X T +1 ) i . We have set θ (0) = θ ( − as a convention in the above. Next, we upper bound the above terms asfollows.Firstly, due to the Martingale property with E [ h∇ V ( e θ ( t ) c ) | ˆ H i ( e θ ( t ) c ; X t +1 ) − P ˆ H i ( e θ ( t ) c ; X t ) i|F t ] = 0,we have E [ n P ni =1 A i ] = 0 , ∀ i. (27)Secondly, note that H5 implies that P ˆ H i ( θ ; x ) is ¯ L h -Lipschitz w.r.t. θ , for some constant ¯ L h [29].As such, A i ≤ T X t =0 γ t +1 ¯ L h k∇ V ( e θ ( t ) c ) kk e θ ( t ) c − e θ ( t − c k ≤ T X t =0 γ t +1 d ¯ L h k h ( e θ ( t ) c ) kk e θ ( t ) c − e θ ( t − c k . (28)Taking the summation over i and dividing by n yield1 n n X i =1 A i ≤ T X t =0 γ t +1 d ¯ L h k h ( e θ ( t ) c ) kk e θ ( t ) c − e θ ( t − c k . Notice that e θ ( t ) c − e θ ( t − c = − γ t (cid:8) h ( e θ ( t − c ) + e ( t − + e ( t − (cid:9) . We observe that k e ( t − k ≤ σ h , k e ( t − k ≤ ( L h /n ) k e θ ( t − o k . (29)As such,1 d ¯ L h n n X i =1 A i ≤ L h n T X t =0 γ t +1 γ t k h ( e θ ( t ) c ) kk e θ ( t − o k + T X t =0 γ t +1 γ t k h ( e θ ( t ) c ) k (cid:8) σ h + k h ( e θ ( t − c ) k}≤ (cid:0) L h n + 2 (cid:1) T X t =0 γ t +1 k h ( e θ ( t ) c ) k + T X t =0 k e θ ( t − o k + σ h T X t =0 γ t ≤ (cid:0) L h n + 4 σ o ¯ ρ (cid:1) T X t =0 γ t +1 k h ( e θ ( t ) c ) k + (cid:0) σ h + 4 σ o ¯ ρ (cid:1) T X t =0 γ t +1 . (30)12o analyze the last two terms A i , A i , we denote E = n ⊤ ⊗ I d , θ ( t ) c := ( ⊗ I d ) e θ ( t ) c , (31)such that P ˆ H ( θ c ; x ) = ( P ˆ H ( e θ c ; x ); · · · ; P ˆ H n ( e θ c ; x )).Thirdly, we observe that from [29, Lemma 4.2], under H2, H4, H8, it can be shown for any θ c =( ⊗ I d ) e θ c , with e θ c ∈ R d , and x ∈ X that: k P ˆ H ( θ c ; x ) k ≤ K P . (32)Here, K P depends on the mixing time of the Markov chain, e.g., it is proportional to − λ underthe uniform ergodicity condition (7). Therefore, n P ni =1 A i = γ h∇ V ( e θ (0) c ) | E P ˆ H ( θ (0) c ; X ) i≤ γ K P k∇ V ( e θ (0) c ) k . (33)Fourthly, using | γ t +2 − γ t +1 | ≤ ˆ aγ t +1 , we have n n X i =1 A i ≤ ˆ ad T − X t =0 γ t +1 k h ( e θ ( t +1) c ) kk E P ˆ H ( θ ( t ) c ; X t +1 ) k + L V T − X t =0 γ t +1 k e θ ( t +1) c − e θ ( t ) c kk E P ˆ H ( θ ( t ) c ; X t +1 ) k + γ T +1 k∇ V ( e θ ( T ) c ) kk E P ˆ H i ( θ ( T ) c ; X T +1 ) k≤ K P T − X t =0 (cid:8) ˆ ad γ t +1 k h ( e θ ( t +1) c ) k + L V γ t +1 k e θ ( t +1) c − e θ ( t ) c k (cid:9) + γ T +1 K P k∇ V ( e θ ( T ) c ) k . To bound n P ni =1 A i , we observe that P T − t =0 γ t +1 k h ( e θ ( t +1) c ) k ≤ P T − t =0 γ t +1 (cid:8) k h ( e θ ( t +1) c ) k (cid:9) , the latter can be further simpliﬁed as P T − t =0 γ t +1 k h ( e θ ( t +1) c ) k ≤ a P Tt =0 γ t +1 k h ( e θ ( t ) c ) k . Moreover, T − X t =0 γ t +1 k e θ ( t +1) c − e θ ( t ) c k = T − X t =0 γ t +1 k h ( e θ ( t ) c ) + e ( t )0 + e ( t )1 k≤ T − X t =0 γ t +1 (cid:8) σ h + k h ( e θ ( t ) c ) k + L h n k e θ ( t ) o k}≤ T − X t =0 γ t +1 (cid:8) σ h + 1 + L h n k h ( e θ ( t ) c ) k + L h n k e θ ( t ) o k }≤ T − X t =0 γ t +1 (cid:8) ¯ ρn (1+2 σ h )+ L h +4 L h σ o ρn + ¯ ρn +4 L h σ o ρn k h ( e θ ( t ) c ) k } . We observe the crude upper bound γ T +1 k∇ V ( e θ ( T ) c ) k ≤ d γ T +1 k h ( e θ ( T ) c ) k≤ d (cid:0) γ T +1 k h ( e θ ( T ) c ) k (cid:1) ≤ d d T X t =0 γ t +1 k h ( e θ ( t ) c ) k . (34)13eﬁne the constantsC mk := K P n d γ k∇ V ( e θ (0) c ) k o , C mk := K P L V ¯ ρn (1 + 2 σ h ) + L h (1 + 4 σ o )2¯ ρn + d (cid:0) ¯ L h σ h + ¯ L h σ o ¯ ρ + K P ˆ a (cid:1) , C mk := d ¯ L h (cid:0) L h n + 4 σ o ¯ ρ (cid:1) + K P d K P (cid:16) ˆ aa d + L V ¯ ρn + 4 L h σ o ρn (cid:17) . (35)Combining the terms yields E (cid:2)(cid:12)(cid:12) P Tt =0 γ t +1 h∇ V ( e θ ( t ) c ) | e ( t )0 i (cid:12)(cid:12)(cid:3) ≤ C mk + C mk P Tt =0 γ t +1 + C mk P Tt =0 γ t +1 k h ( e θ ( t ) c ) k . This is the desired result for the lemma.

D Auxiliary Lemma

Lemma 4.

Let { a s } s ≥ be an arbitrary sequence of non-negative number and ρ ∈ (0 , be aconstant. For any T ≥ , we have T X t =0 t X s =0 a s (1 − ρ ) t − s ! ≤ ρ T X t =0 a t . (36) Proof.

We begin by expanding the summation on the l.h.s. of (36) and observing the followingupper bound: T X t =0 t X s =0 t X q =0 a s a q (1 − ρ ) t − q − s ≤ T X s =0 s X q =0 a s a q (1 − ρ ) − q − s T X t = s (1 − ρ ) t + T X q =0 q X s =0 a s a q (1 − ρ ) − q − s T X t = q (1 − ρ ) t (37)As P Tt = s (1 − ρ ) t ≤ (1 − ρ ) s ρ , we have T X s =0 s X q =0 a s a q (1 − ρ ) − q − s T X t = s (1 − ρ ) t ≤ ρ T X s =0 s X q =0 (cid:8) a s + a q (cid:9) (1 − ρ ) s − q (38)Observe that T X s =0 a s s X q =0 (1 − ρ ) s − q = T X s =0 a s s X q =0 (1 − ρ ) q ≤ T X s =0 a s ρ T X s =0 s X q =0 a q (1 − ρ ) s − q = T X q =0 a q T X s = q (1 − ρ ) s − q ≤ T X s =0 a s ρ T X s =0 s X q =0 a s a q (1 − ρ ) − q − s T X t = s (1 − ρ ) t ≤ ρ T X s =0 a s By symmetry, we have P Tq =0 P qs =0 a s a q (1 − ρ ) − q − s P Tt = q (1 − ρ ) t ≤ ρ P Ts =0 a s . Adding this twobounds yields the desired result in (36). References [1] A. Nedi´c and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,”

IEEE Transactions on Automatic Control , vol. 54, no. 1, pp. 48–61, 2009.[2] P. Bianchi, G. Fort, and W. Hachem, “Performance of a distributed stochastic approximationalgorithm,”

IEEE Transactions on Information Theory , vol. 59, no. 11, pp. 7405–7418, 2013.[3] P. Di Lorenzo and G. Scutari, “Next: In-network nonconvex optimization,”

IEEE Transactionson Signal and Information Processing over Networks , vol. 2, no. 2, pp. 120–136, 2016.[4] S. Pu and A. Nedi´c, “A distributed stochastic gradient tracking method,”

Mathematical Pro-gramming , 2020.[5] R. Xin, A. K. Sahu, U. A. Khan, and S. Kar, “Distributed stochastic optimization with gradienttracking over strongly-connected networks,” arXiv:1903.07266 , 2019.[6] T.-H. Chang, M. Hong, H.-T. Wai, X. Zhang, and S. Lu, “Distributed learning in the non-convex world: From batch to streaming data, and beyond,”

IEEE Signal Processing Magazine ,2020.[7] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithmsoutperform centralized algorithms? a case study for decentralized parallel stochastic gradientdescent,” in

NeurIPS , pp. 5330–5340, 2017.[8] H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, “ d : Decentralized training over decentralizeddata,” in ICML , pp. 4848–4856, 2018.[9] S. Lu, X. Zhang, H. Sun, and M. Hong, “Gnsd: a gradient-tracking based nonconvex stochasticalgorithm for decentralized optimization,” in

IEEE DSW , pp. 315–321, 2019.[10] A. S. Mathkar and V. S. Borkar, “Nonlinear gossip,”

SIAM Journal on Control and Optimiza-tion , vol. 54, no. 3, pp. 1535–1557, 2016.[11] S. Vlaski and A. H. Sayed, “Distributed learning in non-convex environments–part i: Agree-ment at a linear rate,” arXiv preprint arXiv:1907.01848 , 2019.[12] B. Karimi, B. Miasojedow, E. Moulines, and H.-T. Wai, “Non-asymptotic analysis of biasedstochastic approximation scheme,” in

COLT , 2019.1513] T. Sun, T. Chen, Y. Sun, Q. Liao, and D. Li, “Decentralized markov chain gradient descent,” arXiv:1909.10238 , 2019.[14] B. Kumar, V. Borkar, and A. Shetty, “Non-asymptotic error bounds for constant stepsizestochastic approximation for tracking mobile agents,”

Math. of Control, Signals, and Systems ,vol. 31, no. 4, pp. 589–614, 2019.[15] S. Chen, A. M. Devraj, A. Buˇsi´c, and S. Meyn, “Explicit mean-square error bounds for monte-carlo and linear stochastic approximation,” arXiv preprint arXiv:2002.02584 , 2020.[16] T. T. Doan, L. M. Nguyen, N. H. Pham, and J. Romberg, “Finite-time analysis of stochasticgradient descent under markov randomness,” arXiv preprint arXiv:2003.10973 , 2020.[17] H. Robbins and S. Monro, “A stochastic approximation method,”

The annals of mathematicalstatistics , pp. 400–407, 1951.[18] A. M. Devraj, A. Buˇsi´c, and S. Meyn, “Optimal matrix momentum stochastic approximationand applications to q-learning,” arXiv preprint arXiv:1809.06277 , 2018.[19] R. Douc, E. Moulines, P. Priouret, and P. Soulier,

Markov chains . Springer, 2018.[20] P. W. Glynn and S. P. Meyn, “A liapounov bound for solutions of the poisson equation,”

TheAnnals of Probability , pp. 916–931, 1996.[21] S. Ghadimi and G. Lan, “Stochastic ﬁrst-and zeroth-order methods for nonconvex stochasticprogramming,”

SIAM Journal on Optimization , vol. 23, no. 4, pp. 2341–2368, 2013.[22] T. Sun, Y. Sun, and W. Yin, “On markov chain gradient descent,” in

NeurIPS , pp. 9896–9905,2018.[23] R. Srikant and L. Ying, “Finite-time error bounds for linear stochastic approximation and tdlearning,” in

COLT , pp. 2803–2830, 2019.[24] J. C. Duchi, A. Agarwal, M. Johansson, and M. I. Jordan, “Ergodic mirror descent,”

SIAMJournal on Optimization , vol. 22, no. 4, pp. 1549–1578, 2012.[25] H.-T. Wai, Z. Yang, Z. Wang, and M. Hong, “Multi-agent reinforcement learning via doubleaveraging primal-dual optimization,” in

NeurIPS , pp. 9649–9660, 2018.[26] J. Sun, G. Wang, G. B. Giannakis, Q. Yang, and Z. Yang, “Finite-sample analysis of decen-tralized temporal-diﬀerence learning with linear function approximation,” arXiv:1911.00934 ,2019.[27] T. T. Doan, S. T. Maguluri, and J. Romberg, “Finite-time performance of distributed tempo-ral diﬀerence learning with linear function approximation,” arXiv preprint arXiv:1907.12530 ,2019.[28] J. Bhandari, D. Russo, and R. Singal, “A ﬁnite time analysis of temporal diﬀerence learningwith linear function approximation,” arXiv preprint arXiv:1806.02450 , 2018.[29] G. Fort, E. Moulines, P. Priouret, et al. , “Convergence of adaptive and interacting markovchain monte carlo algorithms,”