[PDF] Adaptation and learning over networks under subspace constraints -- Part II: Performance Analysis

Abstract

Part I of this paper considered optimization problems over networks where agents have individual objectives to meet, or individual parameter vectors to estimate, subject to subspace constraints that require the objectives across the network to lie in low-dimensional subspaces. Starting from the centralized projected gradient descent, an iterative and distributed solution was proposed that responds to streaming data and employs stochastic approximations in place of actual gradient vectors, which are generally unavailable. We examined the second-order stability of the learning algorithm and we showed that, for small step-sizes μ , the proposed strategy leads to small estimation errors on the order of μ . This Part II examines steady-state performance. The results reveal explicitly the influence of the gradient noise, data characteristics, and subspace constraints, on the network performance. The results also show that in the small step-size regime, the iterates generated by the distributed algorithm achieve the centralized steady-state performance.

Full PDF

11 Adaptation and learning over networks undersubspace constraints – Part II: Performance Analysis

Roula Nassif † , Member, IEEE , Stefan Vlaski † , ‡ , Student Member, IEEE ,Ali H. Sayed † , Fellow Member, IEEE † Institute of Electrical Engineering, EPFL, Switzerland ‡ Electrical Engineering Department, UCLA, USAroula.nassif@epﬂ.ch stefan.vlaski@epﬂ.ch ali.sayed@epﬂ.ch

Abstract

Part I of this paper considered optimization problems over networks where agents have individual objectives tomeet, or individual parameter vectors to estimate, subject to subspace constraints that require the objectives acrossthe network to lie in low-dimensional subspaces. Starting from the centralized projected gradient descent, an iterativeand distributed solution was proposed that responds to streaming data and employs stochastic approximations inplace of actual gradient vectors, which are generally unavailable. We examined the second-order stability of thelearning algorithm and we showed that, for small step-sizes µ , the proposed strategy leads to small estimation errorson the order of µ . This Part II examines steady-state performance. The results reveal explicitly the inﬂuence of thegradient noise, data characteristics, and subspace constraints, on the network performance. The results also show thatin the small step-size regime, the iterates generated by the distributed algorithm achieve the centralized steady-stateperformance. Index Terms

Distributed optimization, subspace projection, gradient noise, steady-state performance.

This work was supported in part by NSF grant CCF-1524250. A short conference version of this work appears in [1]. a r X i v : . [ c s . M A ] J un I. I

NTRODUCTION

As pointed out in Part I [2] of this work, most prior literature on distributed inference over networks focuseson consensus problems, where agents with separate objective functions need to agree on a common parametervector corresponding to the minimizer of the aggregate sum of individual costs [3]–[12]. In this paper, and itsaccompanying Part I [2], we focus instead on multitask networks where the agents need to estimate and trackmultiple objectives simultaneously [13]–[20]. Based on the type of prior information that may be available abouthow the tasks are related to each other, multitask algorithms can be derived by translating the prior informationinto constraints on the parameter vectors to be inferred.In this paper, and the accompanying Part I [2], we consider multitask inference problems where each agent seeksto minimize an individual cost, and where the collection of parameter vectors to be estimated across the networkis required to lie in a low-dimensional subspace. That is, we let w k ∈ C M k denote some parameter vector at node k and let W = col { w , . . . , w N } denote the collection of parameter vectors from across the network ( N is thenumber of agents in the network). We associate with each agent k a differentiable convex cost J k ( w k ) : C M k → R ,which is expressed as the expectation of some loss function Q k ( · ) and written as J k ( w k ) = E Q k ( w k ; x k ) , where x k denotes the random data. The expectation is computed over the distribution of the data. Let M = (cid:80) Nk =1 M k .We consider constrained problems of the form: W o = arg min W J glob ( W ) (cid:44) N (cid:88) k =1 J k ( w k ) , subject to W ∈ R ( U ) , (1)where R ( · ) denotes the range space operator, and U is an M × P full-column rank matrix with P (cid:28) M . Eachagent k is interested in estimating the k -th M k × subvector w ok of W o = col { w o , . . . , w oN } .In order to solve problem (1), we proposed in Part I [2] the following adaptive and distributed strategy:  ψ k,i = w k,i − − µ (cid:92) ∇ w ∗ k J k ( w k,i − ) , w k,i = (cid:80) (cid:96) ∈N k A k(cid:96) ψ (cid:96),i , (2)where µ > is a small step-size parameter, ψ k,i is an intermediate estimate, w k,i is the estimate of w ok at agent k and iteration i , N k denotes the neighborhood of agent k , and ∇ w ∗ k J k ( · ) is the (Wirtinger) complex gradient [4,Appendix A] of J k ( · ) relative to w ∗ k (complex conjugate of w k ). Notice that approximate gradient vectors (cid:92) ∇ w ∗ k J k ( · ) are employed in (2) instead of true gradient vectors ∇ w ∗ k J k ( · ) since we are interested in solving (1) in the stochasticsetting when the distribution of the data x k is unknown. A common construction in stochastic approximation theoryis to employ the following approximation at iteration i : (cid:92) ∇ w ∗ k J k ( w k ) = ∇ w ∗ k Q k ( w k ; x k,i ) , (3)where x k,i represents the data observed at iteration i . The difference between the true gradient and its approximationis called the gradient noise s k,i ( · ) : s k,i ( w ) (cid:44) ∇ w ∗ k J k ( w ) − (cid:92) ∇ w ∗ k J k ( w ) . (4) This noise will seep into the operation of the algorithm and one main challenge is to show that despite its presence,agent k is still able to approach w ok asymptotically. The matrix A k(cid:96) appearing in (2) is of size M k × M (cid:96) . It multipliesthe intermediate estimate ψ (cid:96),i arriving from neighboring agent (cid:96) to agent k . Let A (cid:44) [ A k(cid:96) ] ∈ C M × M denote thematrix that collects all these blocks. This N × N block matrix is chosen by the designer to satisfy the followingtwo conditions: (cid:40) lim i →∞ A i = P U , (5) A k(cid:96) = [ A ] k(cid:96) = 0 , if (cid:96) / ∈ N k and k (cid:54) = (cid:96), (6)where [ A ] k(cid:96) denotes the ( k, (cid:96) ) -th block of A and P U is the projector onto the P -dimensional subspace of C M spanned by the columns of U : P U = U ( U ∗ U ) − U ∗ . (7)The sparsity condition (6) characterizes the network topology and ensures local exchange of information at eachtime instant i . It is shown in Part I [2] that the matrix equation (5) holds, if and only if, the following conditionson the projector P U and the matrix A are satisﬁed: A U = U , (8) U ∗ A = U ∗ . (9) ρ ( A − P U ) < , (10)where ρ ( · ) denotes the spectral radius of its matrix argument. Conditions (8) and (9) state that the P columnsof U are right and left eigenvectors of A associated with the eigenvalue . Together with these two conditions,condition (10) means that A has P eigenvalues at one, and that all other eigenvalues are strictly less than one inmagnitude. Combining conditions (8)–(10) with the sparsity condition (6), the design of a matrix A to run (2) canbe written as the following feasibility problem:ﬁnd A such that A U = U , U ∗ A = U ∗ ,ρ ( A − P U ) < , [ A ] k(cid:96) = 0 , if (cid:96) / ∈ N k and (cid:96) (cid:54) = k. (11)Not all network topologies satisfying (6) guarantee the existence of an A satisfying condition (5). The higher thedimension of the signal subspace is, the greater the graph connectivity has to be. In the works [1], [20], it isassumed that the sparsity constraints (6) and the signal subspace lead to a feasible problem. That is, it is assumedthat problem (11) admits at least one solution. As a remedy for the violation of such assumption, one may increasethe network connectivity by increasing the transmit power of each node, i.e., adding more links [20]. In Section IIIof this part, we shall relax the feasibility assumption by considering the problem of ﬁnding an A that minimizesthe number of edges to be added to the original topology while satisfying the constraints (8), (9), and (10). In thiscase, if the original topology leads to a feasible solution, then no links will be added. Otherwise, we assume thatthe designer is able to add some links to make the problem feasible. When studying the performance of algorithm (2) relative to W o , we assume that a feasible A (topology) iscomputed by the designer and that its blocks { A k(cid:96) } (cid:96) ∈N k are provided to agent k in order to run (2). We carried outin Part I [2] a detailed stability analysis of the proposed strategy (2). We showed that, despite the gradient noise,the distributed strategy (2) is able to converge in the mean-square-error sense within O ( µ ) from the solution of theconstrained problem (1), for sufﬁciently small step-sizes µ . We particularly established that, for each agent k , theerror variance relative to w ok enters a bounded region whose size is in the order of µ , namely, lim sup i →∞ E (cid:107) w ok − w k,i (cid:107) = O ( µ ) . In Section II of this Part II, we will assess the size of this mean-square-error by deriving closed-formexpression for the network mean-square-deviation (MSD) deﬁned by [4]:MSD (cid:44) µ lim µ → (cid:18) lim sup i →∞ µ E (cid:18) N (cid:107) W o − W i (cid:107) (cid:19)(cid:19) , (12)where W i (cid:44) col { w k,i } Nk =1 . In other words, we will assess the size of the constant multiplying µ in the O ( µ ) − term.This closed form expression will reveal explicitly the inﬂuence of the data characteristics (captured by the second-order properties of the costs and second-order moments of the gradient noises) and subspace constraints (capturedby U ), on the network performance. In this way, we will be able to conclude that distributed strategies of theform (2) with small step-sizes are able to lead to reliable performance even in the presence of gradient noise. Wewill be able also to conclude that the iterates generated by the distributed implementation achieve the centralizedsteady-state performance. Particularly, we compare the performance of strategy (2) to the following centralizedstochastic gradient projection algorithm [21]: W ci = P U (cid:18) W ci − − µ col (cid:110) (cid:92) ∇ w ∗ k J k ( w ck,i − ) (cid:111) Nk =1 (cid:19) , i ≥ , (13)where W ci = col { w c ,i , . . . , w cN,i } is the estimate of W o at iteration i . Observe that each agent at each iteration needsto send its data to a fusion center, which performs the projection in (13), and then sends the resulting estimates w ck,i back to the agents. Finally, simulations will be provided in Section IV to verify the theoretical ﬁndings.II. S TOCHASTIC PERFORMANCE ANALYSIS

In Part I [2], we carried out a detailed stability analysis of the proposed strategy (2). We showed, under someAssumptions on the risks { J k ( · ) } and on the gradient noise processes { s k,i ( · ) } deﬁned by (4), that a networkrunning strategy (2) with a matrix A satisfying conditions (6), (8), (9), and (10) is mean-square-error stable forsufﬁciently small step-sizes, namely, it holds that: lim sup i →∞ E (cid:107) w ok − w k,i (cid:107) = O ( µ ) , k = 1 , . . . , N, (14)for small enough µ –see [2, Theorem 1]. Expression (14) indicates that the mean-square error E (cid:107) W o − W i (cid:107) is onthe order of µ . In this section, we are interested in characterizing how close the W i gets to the network limit point W o . In particular, we will be able to characterize the network mean-square deviation (MSD) (deﬁned by (12)) valuein terms of the step-size µ , the data-type variable h deﬁned in Table I, the second-order properties of the costs TABLE ID

EFINITION OF SOME VARIABLES USED THROUGHOUT THE ANALYSIS . I IS A PERMUTATION MATRIX DEFINED BY (16).Variable Real data case Complex data caseData-type variable h ∇ w (cid:62) k J k ( w k ) ∇ w ∗ k J k ( w k ) Error vector (cid:101) w ek,i (cid:101) w k,i from (39)  (cid:101) w k,i ( (cid:101) w ∗ k,i ) (cid:62)  Gradient noise s ek,i ( w ) s k,i ( w ) from (4)  s k,i ( w )( s ∗ k,i ( w )) (cid:62)  Bias vector b ek b k from (40)  b k ( b ∗ k ) (cid:62)  ( k, (cid:96) ) -th block of A e A k(cid:96)  A k(cid:96)

00 ( A ∗ k(cid:96) ) (cid:62)  Matrix U e U I (cid:62)  U

00 ( U ∗ ) (cid:62)  Matrix J e(cid:15) J (cid:15) from (28)  J (cid:15)

00 ( J ∗ (cid:15) ) (cid:62)  Matrix V eR,(cid:15) V R,(cid:15) from (28) I (cid:62)  V R,(cid:15)

00 ( V ∗ R,(cid:15) ) (cid:62)  Matrix ( V eL,(cid:15) ) ∗ V ∗ L,(cid:15) from (28)  V ∗ L,(cid:15) V (cid:62) L,(cid:15)  I Noise covariance R ok R q,k from (51),(52)  R s,k R q,k R ∗ q,k R (cid:62) s,k  (captured by H o deﬁned below in (45)), the second-order moments of the gradient noises (captured by S deﬁnedbelow in (64)), and the subspace constraints (captured by U e deﬁned in Table I) as follows:MSD = µ hN Tr (cid:16) (( U e ) ∗ H o U e ) − (( U e ) ∗ SU e ) (cid:17) . (15)As explained in Part I [2], in the general complex data case, extended vectors and matrices need to be introducedin order to analyze the network evolution. The arguments and results presented in this section are applicable toboth cases of real and complex data through the use of data-type variable h . Table I lists a couple of variablesand symbols that will be used in the sequel for both real and complex data cases. The matrix I in Table I is apermutation matrix of N × N blocks with ( m, n ) -th block given by: [ I ] mn (cid:44)  I M k , if m = k, n = 2( k −

1) + 1 I M k , if m = k + N, n = 2 k , otherwise (16)for m, n = 1 , . . . , N and k = 1 , . . . , N . A. Modeling assumptions from Part I [2]

In this section, we recall the assumptions used in Part I [2] to establish the network mean-square error stabil-ity (14). We ﬁrst introduce the Hermitian Hessian matrix functions (see [2, Sec. II-A]): H k ( w k ) (cid:44) ∇ w k J k ( w k ) , ( hM k × hM k ) (17) H ( W ) (cid:44) diag { H k ( w k ) } Nk =1 , ( hM × hM ) . (18) Assumption 1. (Conditions on aggregate and individual costs).

The individual costs J k ( w k ) ∈ R are assumed tobe twice differentiable and convex such that: ν k h I hM k ≤ H k ( w k ) ≤ δ k h I hM k , (19) where ν k ≥ for k = 1 , . . . , N . It is further assumed that, for any W , H ( W ) satisﬁes: < νh I hP ≤ ( U e ) ∗ H ( W ) U e ≤ δh I hP , (20) for some positive parameters ν ≤ δ . The data-type variable h and the matrix U e are deﬁned in Table I . As explained in [2], condition (20) ensures that problem (1) has a unique minimizer W o . Assumption 2. (Conditions on gradient noise).

The gradient noise process deﬁned in (4) satisﬁes for any w ∈ F i − and for all k, (cid:96) = 1 , . . . , N : E [ s k,i ( w ) | F i − ] = 0 , (21) E [ s k,i ( w ) s ∗ (cid:96),i ( w ) | F i − ] = 0 , k (cid:54) = (cid:96), (22) E [ s k,i ( w ) s (cid:62) (cid:96),i ( w ) | F i − ] = 0 , k (cid:54) = (cid:96), (23) E [ (cid:107) s k,i ( w ) (cid:107) | F i − ] ≤ ( β k /h ) (cid:107) w (cid:107) + σ s,k , (24) for some β k ≥ , σ s,k ≥ , and where F i − denotes the ﬁltration generated by the random processes { w (cid:96),j } forall (cid:96) = 1 , . . . , N and j ≤ i − . Assumption 3. (Condition on U ). The full-column rank matrix U is assumed to be semi-unitary, i.e., its columnvectors are orthonormal and U ∗ U = I P . Consider the N × N block matrix A e whose ( k, (cid:96) ) -th block is deﬁned in Table I. This matrix will appear in oursubsequent study. In [2, Lemma 2], we showed that this hM × hM matrix A e admits a Jordan decomposition ofthe form: A e (cid:44) V e(cid:15) Λ e(cid:15) ( V e(cid:15) ) − , (25)with Λ e(cid:15) =  I hP J e(cid:15)  , V e(cid:15) = [ U e V eR,(cid:15) ] , ( V e(cid:15) ) − =  ( U e ) ∗ ( V eL,(cid:15) ) ∗  (26) where U e , J e(cid:15) , V eR,(cid:15) , and ( V eL,(cid:15) ) ∗ are deﬁned in Table I with the matrices J (cid:15) , V R,(cid:15) , and V ∗ L,(cid:15) originating from theeigen-structure of A . Under Assumption 3, the M × M combination matrix A satisfying conditions (8), (9), and (10)admits a Jordan canonical decomposition of the form: A (cid:44) V (cid:15) Λ (cid:15) V (cid:15) , (27)with: Λ (cid:15) =  I P J (cid:15)  , V (cid:15) = (cid:104) U V

R,(cid:15) (cid:105) , V − (cid:15) =  U ∗ V ∗ L,(cid:15)  , (28)where J (cid:15) is a Jordan matrix with the eigenvalues (which may be complex but have magnitude less than one) onthe diagonal and (cid:15) > on the super-diagonal. The eigen-decomposition (25) will be useful for establishing themean-square performance.The results in Part I [2] established that the iterates w k,i converge in the mean-square error sense to a small O ( µ ) − neighborhood around the solution w o . In this part, we will be more precise and determine the size of thisneighborhood, i.e., assess the size of the constant multiplying µ in the O ( µ ) − term. To do so, we shall derive anaccurate ﬁrst-order expression for the mean-square error (14); the expression will be accurate to ﬁrst-order in µ .To arrive at the desired expression, we start by motivating a long-term model for the evolution of the networkerror vector after sufﬁcient iterations have passed, i.e., for i (cid:29) . It turns out that the performance expressionsobtained from analyzing the long-term model provide accurate expressions for the performance of the originalnetwork model to ﬁrst order in µ . To derive the long-term model, we follow the approach developed in [4]. Theﬁrst step is to establish the asymptotic stability of the fourth-order moment of the error vector, E (cid:107) w ok − w k,i (cid:107) . Underthe same settings of Theorem 1 in [2] with the second-order moment condition (24) replaced by the fourth-ordermoment condition: E (cid:2) (cid:107) s k,i ( w ) (cid:107) | F i − (cid:3) ≤ ( β ,k /h ) (cid:107) w (cid:107) + σ s ,k , (29)with β ,k ≥ , σ s ,k ≥ , and using similar arguments as in [4, Theorem 9.2], we can show that the fourth-ordermoments of the network error vectors are stable for sufﬁciently small µ , namely, it holds that (see Appendix F) lim sup i →∞ E (cid:107) w ok − w k,i (cid:107) = O ( µ ) , k = 1 , . . . , N. (30)As explained in [4], condition (29) implies (24). We analyze the long-term model under the same settings ofTheorem 1 in [2] and the following smoothness assumption on the individual costs. Assumption 4. (Smoothness condition on individual costs).

It is assumed that each J k ( w k ) satisﬁes the followingsmoothness condition close to the limit point w ok : (cid:107)∇ w k J k ( w ok + ∆ w k ) − ∇ w k J k ( w ok ) (cid:107) ≤ κ d (cid:107) ∆ w k (cid:107) , (31) for small perturbations (cid:107) ∆ w k (cid:107) and κ d ≥ . B. Long-term-error model

To introduce the long-term model, we reconsider the network error recursion from Part I [2], namely, (cid:101) W ei = B i − (cid:101) W ei − − µ A e s ei + µ A e b e (32)where: (cid:101) W ei (cid:44) col (cid:8) (cid:101) w e ,i , . . . , (cid:101) w eN,i (cid:9) , (33) H i − (cid:44) diag { H ,i − , . . . , H N,i − } , (34) B i − (cid:44) A e ( I hM − µ H i − ) , (35) s ei (cid:44) col (cid:8) s e ,i ( w ,i − ) , . . . , s eN,i ( w N,i − ) (cid:9) , (36) b e (cid:44) col { b e , . . . , b eN } , (37)where: H k,i − (cid:44) (cid:90) ∇ w k J k ( w ok − t (cid:101) w k,i − ) dt, (38)and (cid:101) w ek,i , s ek,i ( w k,i − ) , and b ek are deﬁned in Table I with: (cid:101) w k,i (cid:44) w ok − w k,i , (39) b k (cid:44) ∇ w ∗ k J k ( w ok ) . (40)We rewrite (32) as: (cid:101) W ei = B (cid:101) W ei − − µ A e s ei + µ A e b e + µ A e c i − , (41)in terms of the constant matrix B and the random perturbation sequence c i − : B (cid:44) A e ( I hM − µ H o ) , (42) c i − (cid:44) (cid:101) H i − (cid:101) W ei − , (43)where H o and (cid:101) H i − are given by: (cid:101) H i − (cid:44) H o − H i − , (44) H o (cid:44) diag { H o , . . . , H oN } , (45)with each H ok given by the value of the Hessian matrix at the limit point, namely, H ok (cid:44) ∇ w k J k ( w ok ) . (46)By exploiting the smoothness condition (31), and following an argument similar to [4, pp. 554], we can show fromTheorem 1 in [2] that, for i (cid:29) , (cid:107) c i − (cid:107) = O ( µ ) with high probability. Motivated by this observation, we introducethe following approximate model, where the last term µ A e c i − that appears in (41), which is O ( µ ) , is removed: (cid:101) W e (cid:48) i = B (cid:101) W e (cid:48) i − − µ A e s ei ( W i − ) + µ A e b e , (47) for i (cid:29) . Obviously, the iterates { (cid:101) W e (cid:48) i } generated by (47) are generally different from the iterates generated by theoriginal recursion (32). To highlight this fact, we are using the prime notation for the state of the long-term model.Note that the driving process s ei ( W i − ) in (47) is the same gradient noise process from the original recursion (32).We start by showing that the mean-square difference between { (cid:101) W e (cid:48) i , (cid:101) W ei } is asymptotically bounded by O ( µ ) andthat the mean-square-error of the long term model (47) is within O ( µ ) from the one of the original recursion (32).Working with (47) is much more tractable for performance analysis because its dynamics is driven by the constantmatrix B as opposed to the random matrix B i − in (32). Therefore, we shall work with model (47) and evaluateits performance, which will provide an accurate representation for the performance of (2) to ﬁrst order in µ . Theorem 1. (Size of approximation error).

Consider a network of N agents running the distributed strategy (2) with a matrix A satisfying conditions (8) , (9) , and (10) and U satisfying Assumption 3. Assume the individual costs, J k ( w k ) , satisfy the conditions in Assumptions 1 and 4. Assume further that the gradient noise processes satisfythe conditions in Assumption 2 with the second-order moment condition (24) replaced by the fourth-order momentcondition (29) . Then, it holds that, for sufﬁciently small step-sizes: lim sup i →∞ E (cid:107) (cid:101) W ei − (cid:101) W e (cid:48) i (cid:107) = O ( µ ) , (48) lim sup i →∞ E (cid:107) (cid:101) W ei (cid:107) = lim sup i →∞ E (cid:107) (cid:101) W e (cid:48) i (cid:107) + O ( µ / ) . (49) Proof.

See Appendix A.Using similar eigenvalue perturbation arguments as in [4, Theorem 9.3], we can show that, under the same settingsof Theorem 1, the constant matrix B deﬁned by (42) is stable for sufﬁciently small step-sizes (see Appendix G). C. Mean-square-error performance

We showed in Theorem 1 in Part I [2] that a network running the distributed strategy (2) is mean-square stablefor sufﬁciently small µ . Particularly, we showed that lim sup i →∞ E (cid:107) w ok − w k,i (cid:107) = O ( µ ) . In this section, we assessthe size of the mean-square error by measuring the network MSD deﬁned by (12).We refer to the individual gradient noise process in (4) and denote its conditional covariance matrix by: R es,k,i ( w ) (cid:44) E (cid:2) s ek,i ( w ) s e ∗ k,i ( w ) | F i − (cid:3) . (50)We assume that, in the limit, the following moment matrices tend to constant values when evaluated at w ok : R s,k (cid:44) lim i →∞ E (cid:2) s k,i ( w ok ) s ∗ k,i ( w ok ) | F i − (cid:3) , (51) R q,k (cid:44) lim i →∞ E (cid:104) s k,i ( w ok ) s (cid:62) k,i ( w ok ) | F i − (cid:105) . (52) Assumption 5. (Smoothness condition on noise covariance).

It is assumed that the conditional second-ordermoments of the individual noise processes satisfy the following smoothness condition, (cid:107) R es,k,i ( w ok + ∆ w k ) − R es,k,i ( w ok ) (cid:107) ≤ κ d (cid:107) ∆ w k (cid:107) γ , (53) for small perturbations (cid:107) ∆ w k (cid:107) , and for some constant κ d ≥ and exponent < γ ≤ . One useful conclusion that follows from (53) is that, for i (cid:29) and for sufﬁciently small step-size, we can express thecovariance matrix of s ek,i ( w ) in terms of the limiting matrix R ok deﬁned in Table I as follows (see [4, Lemma 11.1]): E s ek,i ( w k,i − ) s e ∗ k,i ( w k,i − ) = R ok + O ( µ min { , γ } ) . (54)Before proceeding, we introduce the ( hM ) × ( hM ) matrix F that will play a critical role in characterizingthe performance: F = B (cid:62) ⊗ b B ∗ . (55)This matrix is deﬁned in in terms of the block Kronecker operation. In the derivation that follows, we shall usethe block Kronecker product ⊗ b operator [22] and the block vectorization operator bvec ( · ) . As explained in [4],these operations preserve the locality of the blocks in the original matrix arguments. The matrix F will sometimesappear transformed under the similarity transformation: F (cid:44) (cid:16) ( V e(cid:15) ) (cid:62) ⊗ b ( V e(cid:15) ) ∗ (cid:17) F (cid:16) ( V e(cid:15) ) (cid:62) ⊗ b ( V e(cid:15) ) ∗ (cid:17) − . (56) Lemma 1. (Low-rank approximation).

Assume the matrix A satisﬁes conditions (8) , (9) , and (10) with U satisfyingAssumption 3. For sufﬁciently small step-sizes, it holds that F is stable and that: ( I − F ) − = O (1 /µ ) , (57) ( I − F ) − =  O (1 /µ ) O (1) O (1) O (1)  , (58) where the leading ( hP ) × ( hP ) block in ( I − F ) − is O (1 /µ ) . Moreover, we can also write: ( I − F ) − = (cid:16) [( U e ) ∗ ] (cid:62) ⊗ b U e (cid:17) Z − (cid:16) ( U e ) (cid:62) ⊗ b ( U e ) ∗ (cid:17) + O (1) , (59) in terms of the block Kronecker operation, where the matrix Z has dimension ( hP ) × ( hP ) : Z = ( I hP ⊗ D ∗ ) + ( D (cid:62) ⊗ I hP ) = O ( µ ) , (60) with D = µ ( U e ) ∗ H o U e which is positive deﬁnite under Assumption 1.Proof. See Appendix B. In our derivations, the block Kronecker product and the block vectorization operations are applied to × block matrices C = [ C k(cid:96) ] and D = [ D k(cid:96) ] with blocks { C , D } of size hP × hP , blocks { C , D } of size hP × h ( M − P ) , blocks { C , D } of size h ( M − P ) × hP ,and blocks { C , D } of size h ( M − P ) × h ( M − P ) . Theorem 2. (Mean-square-error performance).

Consider the same settings of Theorem 1. Assume further thatAssumption 5 holds. Let γ m (cid:44) min { , γ } > with γ ∈ (0 , from (53) . Then, it holds that: lim sup i →∞ hN E (cid:107) (cid:101) W ei (cid:107) = 1 hN ( bvec ( Y (cid:62) )) (cid:62) ( I − F ) − bvec ( I hM ) + O ( µ γ m ) , (61) = 1 hN Tr (cid:32) ∞ (cid:88) n =0 B n Y ( B ∗ ) n (cid:33) + O ( µ γ m ) , (62) where: Y = µ A e S ( A e ) ∗ , (63) S = diag { R o , R o , . . . , R oN } . (64) Furthermore, it holds that:

MSD = µ hN Tr (cid:16) (( U e ) ∗ H o U e ) − (( U e ) ∗ SU e ) (cid:17) . (65) Proof.

See Appendix C.Since ( I − F ) is of size ( hM ) × ( hM ) , the ﬁrst term on the R.H.S. of expression (61) may be hard to evaluatedue to numerical reasons. In comparison, the ﬁrst term in expression (62) only requires manipulations of matricesof size hM × hM . In practice, a reasonable number of terms can be used instead of n → ∞ to obtain accurateevaluation.Note that the MSD of the centralized solution is equal to (65) since the centralized implementation can beobtained from (2) by replacing P U by A and by assuming fully-connected network. We therefore conclude, forsufﬁciently small step-sizes (i.e., in the slow adaptation regime), that the distributed strategy (2) is able to attainthe same MSD performance as the centralized solution.III. F INDING A COMBINATION MATRIX A In the following, we consider the problem of ﬁnding an A that minimizes the number of edges to be added to theoriginal topology while satisfying the constraints (10), (8), and (9). That is, we consider the following optimizationproblem: minimize A f ( A ) = N (cid:80) k =1 (cid:80) (cid:96)/ ∈N k ||| A k(cid:96) ||| + γ (cid:107)A(cid:107) F , subject to A U = U , A = A ∗ ,ρ ( A − P U ) ≤ − (cid:15), (66)where ||| A k(cid:96) ||| (cid:44) (cid:80) M k m =1 (cid:80) M (cid:96) n =1 | [ A k(cid:96) ] mn | ∈ R , (cid:107)A(cid:107) F = (cid:112) Tr ( A ∗ A ) ∈ R is the Frobenius norm of A , γ ≥ is aregularization parameter, and (cid:15) ∈ (0 , is a small positive number. In general, the spectral radius of a matrix is notconvex over the matrix space. We therefore restrict our search to the class of Hermitian matrices, since their spectralradius coincides with their spectral norm (maximum singular value), which is a convex function. Problem (66) is convex since the objective is convex, the equality constraints are linear, and the inequality constraint function isconvex [23]. The parameter (cid:15) controls the convergence rate of A i towards the projector P U . That is, small (cid:15) leadsto slow convergence and large (cid:15) gives fast convergence. The convex (cid:96) -norm based function (cid:80) Nk =1 (cid:80) (cid:96)/ ∈N k ||| A k(cid:96) ||| is used as a relaxation of the pseudo (cid:96) -norm h ( A ) = (cid:80) Nk =1 card { (cid:96) | A k(cid:96) (cid:54) = 0 , (cid:96) / ∈ N k } , which is a non-convexfunction that leads to computational challenges. Among the potentially multiple feasible solutions, the cardinalityfunction h ( A ) in the objective in (66) selects as optimum the one that minimizes the number of edges to be addedto the network topology in order to satisfy constraint (5). The quadratic term (cid:107)A(cid:107) F = (cid:80) Mm =1 (cid:80) Mn =1 | a mn | in (66)makes the objective strictly convex, and therefore problem (66) has a unique minimum. Problem (66) can be solvedusing general convex optimization solvers such as CVX [24]. These solvers generally implement second-ordermethods that require calculation of Hessian matrices. Therefore, problems with more than few thousand entries areprobably beyond the capabilities of these solvers. The Douglas-Rachford algorithm can also be employed to solveproblem (66). As we shall see in the following, the required proximal operators for implementing this algorithmcan be computed efﬁciently using closed form expressions.In the following, we shall assume that U ∈ R M × P and, therefore, we shall solve (66) over real-valued matrices A ∈ R M × M . In order to solve the constrained problem (66), we shall apply the Douglas-Rachford splittingalgorithm [25], which is used to solve problems of the form: minimize x ∈ R N g ( x ) + g ( x ) , (67)where g ( · ) and g ( · ) are functions in Γ ( R N ) such that ( ri dom g ) ∩ ( ri dom g ) (cid:54) = 0 and g ( x ) + g ( x ) → + ∞ as (cid:107) x (cid:107) → + ∞ . By selecting g ( · ) as f ( · ) in (66) and g ( · ) as the indicator function I Ω ( · ) of the closed nonemptyconvex set: Ω = {A|A U = U , A = A (cid:62) , (cid:107)A − P U (cid:107) ≤ − (cid:15) } (68)deﬁned as: I Ω ( A ) (cid:44)  , if A ∈ Ω , + ∞ , if A / ∈ Ω , (69)the Douglas-Rachford algorithm to solve (66) has the following form:  A i = prox ηf ( C i ) C i +1 = C i + prox η I Ω (2 A i − C i ) − A i , (70)where η > and prox ηg : R M × M → R M × M is the proximal operator of ηg ( · ) ( g : R M × M → R ∪ { + ∞} ) deﬁnedas [25], [26]: prox ηg ( C ) = arg min A g ( A ) + 12 η (cid:107)A − C(cid:107) F . (71)Every sequence ( A i ) i ∈ N generated by algorithm (70) converges to a solution of problem (66) [25, Proposition 4.3].The Douglas-Rachford algorithm operates by splitting since it employs the functions f ( · ) and I Ω ( · ) separately. It requires the implementation of two proximal steps at each iteration, which can be computed efﬁciently as explainedin the following.The function f ( A ) is an entrywise matrix function that treats the matrix A ∈ R M × M as a vector in R M andthen uses a corresponding vector function; the proximal operator is then the same as that of the vector function.Let C k(cid:96) denote the ( k, (cid:96) ) -th block of an N × N block matrix C and let [ C k(cid:96) ] mn denote the ( m, n ) -th entry of C k(cid:96) .The ( k, (cid:96) ) -th block of the proximal operator of ηf ( · ) is given by: [ prox ηf ( C )] k(cid:96) = (cid:18)

11 + ηγ (cid:19) ·  C k(cid:96) , if (cid:96) ∈ N k or k = (cid:96),C sk(cid:96) , if (cid:96) / ∈ N k , (72)where the matrix C sk(cid:96) is of size M k × M (cid:96) with ( m, n ) -th entry given by: [ C sk(cid:96) ] mn =  [ C k(cid:96) ] mn − η, if [ C k(cid:96) ] mn ≥ η, , if | [ C k(cid:96) ] mn | ≤ η, [ C k(cid:96) ] mn + η, if [ C k(cid:96) ] mn ≤ − η. (73)Since I Ω is the indicator function of the closed convex set Ω , its proximal operator reduces to the projection onto Ω deﬁned as: prox η I Ω ( D ) = Π Ω ( D ) =  arg min A (cid:107)A − D(cid:107) F subject to A ∈ Ω (74)where the parameter η does not appear since the proximal operator is a projection. The set Ω in (68) can be writtenalternatively as Ω = Ω ∩ Ω where Ω and Ω are two closed convex sets deﬁned as: Ω = {A|A U = U , A = A (cid:62) } , (75) Ω = {A|(cid:107)A − P U (cid:107) ≤ − (cid:15) } . (76)As we shall explain in the following, the projection onto the intersection Ω can be obtained by properly projectingonto the individual sets Ω and Ω according to: Π Ω ( D ) = Π Ω (Π Ω ( D )) . (77)The projection onto Ω is given by (see Appendix D): Π Ω ( D ) = ( I − P U ) (cid:18) D + D (cid:62) (cid:19) ( I − P U ) + P U , (78)and the projection of the symmetric matrix Π Ω ( D ) onto Ω is given by (see Appendix D): Π Ω (Π Ω ( D )) = P U + M (cid:88) m =1 β m v m v (cid:62) m , (79)where: β m =  − (cid:15), if λ m < − (cid:15),λ m , if | λ m | < − (cid:15), − (cid:15), if λ m > − (cid:15). (80) Fig. 1. Experimental setup. (Left)

Network topology. (Right)

Graph spectral content of W (cid:63) with w (cid:63)m = ( v (cid:62) m ⊗ I L ) W (cid:63) . where { λ m , v m } Mm =1 are the eigenvalues and eigenvectors of the symmetric matrix ( I − P U ) (cid:16) D + D (cid:62) (cid:17) ( I − P U ) . Inorder to establish (77), we introduce the following lemma. Lemma 2. (Characterization of the projection). If Ω is an afﬁne set, Ω is a closed convex set, and Π Ω (Π Ω ( C )) ∈ Ω , then Π Ω ∩ Ω ( C ) = Π Ω (Π Ω ( C )) .Proof. See Appendix E.Since the projection onto Ω (given by (79)) changes only the eigenvalues of a matrix without affecting theeigenvectors, we have Π Ω (Π Ω ( C )) ∈ Ω . We then conclude from Lemma 2 that (77) holds.IV. S IMULATION RESULTS

We apply strategy (2) to solve distributed inference under smoothness (described in Remark 4 of Section IIin [2]). We consider a connected mean-square-error (MSE) network of N = 50 nodes and M k = L = 5 . The N nodes are placed randomly in the [0 , × [0 , square, and the weighted graph is then constructed according to athresholded Gaussian kernel weighting function based on the distance between nodes. Particularly, the weight c k(cid:96) of edge ( k, (cid:96) ) connecting nodes k and (cid:96) that are d k(cid:96) apart is: c k(cid:96) =  exp (cid:0) − d k(cid:96) / (2 σ ) (cid:1) , if d k(cid:96) ≤ κ , otherwise (81)with σ = 0 . and κ = 0 . . We assume real data case. Each agent is subjected to streaming data { d k ( i ) , u k,i } assumed to satisfy a linear regression model [4]: d k ( i ) = u (cid:62) k,i w (cid:63)k + v k ( i ) , k = 1 , . . . , N, (82)for some unknown L × vector w (cid:63)k to be estimated with v k ( i ) denoting a zero-mean measurement noise. For thesenetworks, the risk functions take the form of mean-square-error costs: J k ( w k ) = 12 E | d k ( i ) − u (cid:62) k,i w k | , k = 1 , . . . , N. (83) Fig. 2. Inference under smoothness. Performance of algorithm (2) for different choices of the matrix U in (1) with U = U ⊗ I L , andnon-cooperative strategy. (Left) Performance w.r.t. W (cid:63) . (Right) Performance w.r.t. W o . The processes { u k,i , v k ( i ) } are assumed to be zero-mean Gaussian with: i) E u k,i u (cid:62) (cid:96),i = R u,k = σ u,k I L if k = (cid:96) and zero otherwise; ii) E v k ( i ) v (cid:96) ( i ) = σ v,k if k = (cid:96) and zero otherwise; and iii) u k,i and v k ( i ) are independent ofeach other. The variances σ u,k and σ v,k are generated from the uniform distributions unif (0 . , and unif (0 . , . ,respectively. Let W (cid:63) = col { w (cid:63) , . . . , w (cid:63)N } . The signal W (cid:63) is generated by smoothing a signal W o by a diffusionkernel. Particularly, we generate W (cid:63) according to W (cid:63) = [( V e − τ Λ V (cid:62) ) ⊗ I L ] W o with τ = 30 , W o a randomly generatedvector from the Gaussian distribution N (0 . × NL , I NL ) , and { V = [ v , . . . , v N ] , Λ = diag { λ , . . . , λ N }} are thematrices of eigenvectors and eigenvalues of L c = diag { C N } − C with [ C ] k(cid:96) = c k(cid:96) given by (81). Figure 1 (right)illustrates the normalized squared (cid:96) -norm of the spectral component w (cid:63)m = ( v (cid:62) m ⊗ I L ) W (cid:63) . It can be observedthat the signal is mainly localized in [0 , . . Note that, for MSE networks, it holds that H k ( w k ) = R u,k ∀ w k .Furthermore, the gradient noise process (4) is given by: s k,i ( w k ) = ( u (cid:62) k,i u k,i − R u,k )( w ok − w k ) + u (cid:62) k,i v k ( i ) , (84)with covariance R ok given by: R ok = E [( u (cid:62) k,i u k,i − R u,k ) W k ( u (cid:62) k,i u k,i − R u,k )] + σ v,k R u,k = R u,k W k R u,k + R u,k Tr ( R u,k W k ) + σ v,k R u,k (85)where W k = ( w (cid:63)k − w ok )( w (cid:63)k − w ok ) (cid:62) , and where we used the fact that E [ u (cid:62) k,i u k,i W k u (cid:62) k,i u k,i ] = 2 R u,k W k R u,k + R u,k Tr ( R u,k W k ) since the regressors are zero-mean real Gaussian [27].We run algorithm (2) for 5 different choices of matrix U in (1) with U = U ⊗ I L : i) matrix U chosen as the ﬁrsteigenvector of the Laplacian U = [ v ] = √ N N ; ii) matrix U chosen as the ﬁrst two eigenvectors of the Laplacian U = [ v v ] ; iii) U = [ v v v ] ; iv) U = [ v . . . v ] ; v) U = [ v . . . v ] . Since U = U ⊗ I L , the matrix A is ofthe form A ⊗ I L with A = [ a k(cid:96) ] an N × N matrix. In each case, the combination matrix A is set as the solutionof the optimization problem (66) ( (cid:15) = 0 . , γ = 0 , ||| A k(cid:96) ||| = | a k(cid:96) | ), which is solved by the Douglas-Rachfordalgorithm (70) with η = 0 . . Note that, for the different choices of U , the distributed implementation is feasible TABLE IIP

ERFORMANCE OF STRATEGY (2) W . R . T . W o IN (1) FOR DIFFERENT CHOICES OF µ .MSDStep-size µ Solution Exp. (65) Exp. (62) SimulationCentralized − . dB − . dB − . dB − Distributed − . dB − . dB − . dBCentralized − . dB − . dB − . dB − Distributed − . dB − . dB − . dBCentralized − . dB − . dB − . dB − Distributed − . dB − . dB − . dB and the steady-state value of the cost in (66) is zero. We set µ = 0 . . We report the network MSD (cid:63) learningcurves N E (cid:107) W (cid:63) − W i (cid:107) in Fig. 2 (left). The results are averaged over Monte-Carlo runs. The learning curveof the non-cooperative solution, obtained from (2) by setting A = I LN , is also reported. The results show that thebest performance is obtained when U = [ v v v v ] ⊗ I L . This is due to the fact that the columns of U constitutea basis spanning the useful signal subspace (see Fig. 1 (right)). As a consequence, a strong noise reduction may beobtained by projecting onto this subspace compared with the non-cooperative strategy where each agent estimates w (cid:63)k without any cooperation. By forcing consensus (i.e., by choosing U = [ v ] ), the resulting estimate w k,i will bebiased with respect to w (cid:63)k , which is not common across agents. The performance obtained when U = [ v . . . v ] is worse than the case where U = [ v . . . v ] due to a smaller noise reduction.Finally, we illustrate Theorem 2 in Table II by reporting the steady-state MSD = lim sup i →∞ N E (cid:107) W o − W i (cid:107) when U = [ v . . . v ] ⊗ I L for different values of the step-size µ = { − , − , − } . A closed form solutionfor W o in (1) exists and is given by: W o = U ( U (cid:62) H U ) − U (cid:62) H W (cid:63) , (86)where H = diag { R u,k } Nk =1 . We observe that, in the small adaptation regime, i.e., when µ → , the network MSDincreases approximately dB per decade (when µ goes from µ to µ ). This means that the steady-state MSD ison the order of µ . We also observe that, in the small adaptation regime, the distributed solution is able to attain thesame performance as the centralized one. Finally, note that, for relatively large step-size ( µ = 10 − ), expression (62)provides better results than (65) in the distributed case. This is due to neglecting the O (1) term in (58) which ismultiplied by O ( µ ) (since Y = O ( µ ) ) when replaced in (62).V. C ONCLUSION

In this paper, and its accompanying Part I [2], we considered inference problems over networks where agentshave individual parameter vectors to estimate subject to subspace constraints that require the parameters acrossthe network to lie in low-dimensional subspaces. Based on the gradient projection algorithm, we proposed aniterative and distributed implementation of the projection step, which runs in parallel with the stochastic gradient descent update. We showed that, for small step-size parameter, the network is able to approach the minimizer ofthe constrained problem to arbitrarily good accuracy levels. Furthermore, we derived a closed-form expressions forthe steady-state mean-square-error (MSE) performance. These expressions revealed explicitly the inﬂuence of thegradient noise, data characteristics, and subspace constraints, on the network performance. Finally, among manypossible convex formulations, we considered the design of feasible distributed solution that minimizes the numberof edges to be added to the original graph. A PPENDIX AP ROOF OF T HEOREM z i (cid:44) (cid:101) W ei − (cid:101) W e (cid:48) i . Using (44) in the expression for B i − in (35), we can write: B i − = B + µ A e (cid:101) H i − , (87)in terms of the constant coefﬁcient matrix B in (42). Using (87) and (43), and subtracting (32) and (47), we thenget: z i = B z i − + µ A e c i − . (88)If we multiply both sides of (88) from the left by ( V e(cid:15) ) − we obtain:  z i z (cid:86) i  = B  z i − z (cid:86) i −  +  c i − c (cid:86) i −  (89)where we partitioned the vectors ( V e(cid:15) ) − z i and µ Λ e(cid:15) ( V e(cid:15) ) − c i − into: ( V e(cid:15) ) − z i (cid:44)  z i z (cid:86) i  , µ Λ e(cid:15) ( V e(cid:15) ) − c i − (cid:44)  c i − c (cid:86) i −  (90)with the leading vectors, { z i , c i − } , having dimensions hP × each. The matrix B is given by: B (cid:44) ( V e(cid:15) ) − BV e(cid:15) (42) , (25) = Λ e(cid:15) − µ Λ e(cid:15) ( V e(cid:15) ) − H o V e(cid:15) =  I hP − D −D −D J e(cid:15) − D  (91)with the blocks {D mn } given by: D (cid:44) µ ( U e ) ∗ H o U e , D (cid:44) µ ( U e ) ∗ H o V eR,(cid:15) , (92) D (cid:44) µ J e(cid:15) ( V eL,(cid:15) ) ∗ H o U e , D (cid:44) µ J e(cid:15) ( V eL,(cid:15) ) ∗ H o V eR,(cid:15) . (93) Recursion (89) has a form similar to the earlier recursion (66) in Part I [2] with three differences. First, thematrices {D mn } in (90) are constant matrices; nevertheless, they satisfy the same bounds as the matrices { D mn,i − } in eq. (66) in Part I [2]. In particular, from (115), (116), and (122) in Part I [2], it continues to hold that: (cid:107) I hP − D (cid:107) ≤ − µσ , (cid:107)D (cid:107) ≤ µσ , (94) (cid:107)D (cid:107) ≤ µσ , (cid:107)D (cid:107) ≤ µσ , (95)for some positive constants σ , σ , σ , σ that are independent of µ . Second, Third, the bias term b (cid:86) e in (66) inPart I [2] is absent from (90). Third, the gradient noise terms that appeared in recursion (66) in Part I [2] are nowreplaced by the perturbation sequences { c i − , c (cid:86) i − } . However, these sequences can be bounded as follows: (cid:107) c i − (cid:107) ≤ µ r (cid:107) (cid:101) W ei − (cid:107) , (cid:107) c (cid:86) i − (cid:107) ≤ µ r (cid:107) (cid:101) W ei − (cid:107) , (96)for some constant r that is independent of µ since: (cid:107) µ Λ e(cid:15) ( V e(cid:15) ) − c i − (cid:107) ≤ µ r (cid:107) (cid:101) W ei − (cid:107) . (97)To establish the above inequality, we start by noting that any cost J k ( · ) satisfying (19) and (31) will also satisfy [4,Lemmas E.4, E.8]: (cid:107)∇ w k J k ( w ok + ∆ w k ) − ∇ w k J k ( w ok ) (cid:107) ≤ κ (cid:48) d (cid:107) ∆ w k (cid:107) , (98)for any ∆ w k and where κ (cid:48) d (cid:44) max { κ d , ( δ k − ν k ) / ( h(cid:15) ) } . Then, for each agent k we have: (cid:107) H ok − H k,i − (cid:107) (38) , (46) ≤ (cid:90) (cid:107)∇ w k J k ( w ok ) − ∇ w k J k ( w ok − t (cid:101) w k,i − ) (cid:107) dt (98) ≤ (cid:90) κ (cid:48) d (cid:107) t (cid:101) w k,i − (cid:107) dt = 12 κ (cid:48) d (cid:107) (cid:101) w k,i − (cid:107) (99)Therefore, (cid:107) (cid:101) H i − (cid:107) (44) = max ≤ k ≤ N (cid:107) H ok − H k,i − (cid:107)≤ κ (cid:48) d (cid:18) max ≤ k ≤ N (cid:107) (cid:101) w k,i − (cid:107) (cid:19) ≤ κ (cid:48) d (cid:107) (cid:101) W ei − (cid:107) . (100)Now, replacing c i − in (97) by (43) and using (100) we conclude (97).Repeating the argument that led to inequalities (129) and (130) in Part I [2] we obtain: E (cid:107) z i (cid:107) ≤ (1 − µσ ) E (cid:107) z i − (cid:107) + 2 µσ σ E (cid:107) z (cid:86) i − (cid:107) + 2 µr σ E (cid:107) (cid:101) W ei − (cid:107) (101)and E (cid:107) z (cid:86) i (cid:107) ≤ (cid:18) ρ ( J (cid:15) ) + (cid:15) + 3 µ σ − ρ ( J (cid:15) ) − (cid:15) (cid:19) E (cid:107) z (cid:86) i − (cid:107) + 3 µ σ − ρ ( J (cid:15) ) − (cid:15) E (cid:107) z i − (cid:107) + 3 µ r − ρ ( J (cid:15) ) − (cid:15) E (cid:107) (cid:101) W ei − (cid:107) . (102)We can combine (101) and (102) into a single inequality recursion as follows:  E (cid:107) z i (cid:107) E (cid:107) z (cid:86) i (cid:107)  (cid:22)  a bc d (cid:124) (cid:123)(cid:122) (cid:125) Γ  E (cid:107) z i − (cid:107) E (cid:107) z (cid:86) i − (cid:107)  +  ef  E (cid:107) (cid:101) W ei − (cid:107) . (103) where a = 1 − O ( µ ) , b = O ( µ ) , c = O ( µ ) , d = ρ ( J (cid:15) ) + (cid:15) + O ( µ ) , e = O ( µ ) , and f = O ( µ ) . Using (30) andeq. (134) in Part I [2] we conclude that: lim sup i →∞ E (cid:107) z i (cid:107) = O ( µ ) , lim sup i →∞ E (cid:107) z (cid:86) i (cid:107) = O ( µ ) , (104)and, hence, lim sup i →∞ E (cid:107) z i (cid:107) = O ( µ ) . It follows that lim sup i →∞ E (cid:107) (cid:101) W ei − (cid:101) W e (cid:48) i (cid:107) = O ( µ ) , which estab-lishes (48). Finally, note that: E (cid:107) (cid:101) W e (cid:48) i (cid:107) = E (cid:107) (cid:101) W e (cid:48) i − (cid:101) W ei + (cid:101) W ei (cid:107) ≤ E (cid:107) (cid:101) W e (cid:48) i − (cid:101) W ei (cid:107) + E (cid:107) (cid:101) W ei (cid:107) + 2 | E ( (cid:101) W e (cid:48) i − (cid:101) W ei ) ∗ (cid:101) W ei |≤ E (cid:107) (cid:101) W e (cid:48) i − (cid:101) W ei (cid:107) + E (cid:107) (cid:101) W ei (cid:107) + 2 (cid:113) E (cid:107) (cid:101) W e (cid:48) i − (cid:101) W ei (cid:107) E (cid:107) (cid:101) W ei (cid:107) (105)where we used | E x | ≤ E | x | from Jensen’s inequality and where we applied Holder’s inequality: E | x ∗ y | ≤ ( E | x | p ) p ( E | y | q ) q , when /p + 1 /q = 1 . (106)Hence, from (14) and (48) we get: lim sup i →∞ ( E (cid:107) (cid:101) W e (cid:48) i (cid:107) − E (cid:107) (cid:101) W ei (cid:107) ) ≤ O ( µ ) + O ( µ / ) = O ( µ / ) (107)since µ < µ / for small µ (cid:28) , which establishes (49).From (14) and (107), it follows that: lim sup i →∞ E (cid:107) (cid:101) W e (cid:48) i (cid:107) = O ( µ ) , (108)and, therefore, the long-term approximate model (47) is also mean-square stable.A PPENDIX BL OW - RANK APPROXIMATION

From (91), we obtain: B (cid:62) = (cid:16) ( V e(cid:15) ) (cid:62) (cid:17) −  I hP − D (cid:62) −D (cid:62) −D (cid:62) ( J e(cid:15) ) (cid:62) − D (cid:62)  ( V e(cid:15) ) (cid:62) (109) B ∗ = (( V e(cid:15) ) ∗ ) −  I hP − D ∗ −D ∗ −D ∗ ( J e(cid:15) ) ∗ − D ∗  ( V e(cid:15) ) ∗ (110)where the block matrices {D (cid:62) mn , D ∗ mn } are all on the order of µ with: D (cid:62) = µ ( U e ) (cid:62) ( H o ) (cid:62) [( U e ) ∗ ] (cid:62) = O ( µ ) , (111) D ∗ = D = µ ( U e ) ∗ H o U e = O ( µ ) , (112)of dimensions hP × hP . Substituting (109) and (110) into (55) and using property ( A ⊗ b B )( C ⊗ b D ) = AC ⊗ b BD, for block Kronecker products, we obtain: F = (cid:16) ( V e(cid:15) ) (cid:62) ⊗ b ( V e(cid:15) ) ∗ (cid:17) − X (cid:16) ( V e(cid:15) ) (cid:62) ⊗ b ( V e(cid:15) ) ∗ (cid:17) , (113) where we introduced: X (cid:44)  I hP − D (cid:62) −D (cid:62) −D (cid:62) ( J e(cid:15) ) (cid:62) − D (cid:62)  ⊗ b  I hP − D ∗ −D ∗ −D ∗ ( J e(cid:15) ) ∗ − D ∗  . (114)We partition X into the following block structure: X =  X X X X  (115)where, for example, X is ( hP ) × ( hP ) and is given by: X (cid:44) (cid:16) I hP − D (cid:62) (cid:17) ⊗ ( I hP − D ∗ ) . (116)Since ( I − F ) − = (cid:16) ( V e(cid:15) ) (cid:62) ⊗ b ( V e(cid:15) ) ∗ (cid:17) − ( I − X ) − (cid:16) ( V e(cid:15) ) (cid:62) ⊗ b ( V e(cid:15) ) ∗ (cid:17) (117)we proceed to evaluate I − F . It follows that: I − X =  I ( hP ) − X −X −X I − X  (118)and, in a manner similar to the way we assessed the size of the block matrices { D mn,i − } in the proof of Theorem 1in Part I [2], we can verify that: I − X = I ( hP ) − (cid:16) I hP − D (cid:62) (cid:17) ⊗ ( I hP − D ∗ ) = O ( µ ) , X = O ( µ ) , X = O ( µ ) , I − X = O (1) . (119)The matrix I − X is invertible since I − F is invertible; this is because ρ ( F ) = [ ρ ( B )] < . Therefore, applyingthe block matrix inversion formula to I − X we get: ( I − X ) − =  ( I ( hP ) − X ) −

00 0  +  Y Y Y ∆ −  (120)where Y = ( I − X ) − X ∆ − X ( I − X ) − , Y = ( I − X ) − X ∆ − , Y = ∆ − X ( I − X ) − , and ∆ = ( I − X ) − X ( I − X ) − X . The entries of ( I ( hP ) − X ) − are O (1 /µ ) , while the entries in the secondmatrix on the right-hand side of the above equation are O (1) when the step-size is small. That is, we can write: ( I − X ) − =  O (1 /µ ) O (1) O (1) O (1)  . (121)Moreover, since O (1 /µ ) dominates O (1) for sufﬁciently small µ , we can also write: ( I − X ) − =  ( I ( hP ) − X ) −

00 0  + O (1)=  (cid:0) ( I ( hP ) ⊗ D ∗ ) + ( D (cid:62) ⊗ I ( hP ) ) (cid:1) −

00 0  + O (1)=  I ( hP )  Z − (cid:104) I ( hP ) (cid:105) + O (1) . (122) Substituting (122) into (117) we arrive at (59). Since Z = O ( µ ) , we conclude that (57) holds. We also concludethat (58) holds since: ( I − F ) − = ( I − X ) − . (123)A PPENDIX CP ROOF OF T HEOREM F i − , invoking the conditions on thegradient noise process from Assumption 2, and computing the conditional expectation, we obtain: E [ (cid:101) W e (cid:48) i | F i − ] = B (cid:101) W e (cid:48) i − + µ A e b e , (124)where the term involving s ei ( W ei − ) is eliminated because E [ s ei | F i − ] = 0 . Taking expectations again we arrive at: E (cid:101) W e (cid:48) i = B (cid:16) E (cid:101) W e (cid:48) i − (cid:17) + µ A e b e . (125)Since recursion (47) includes a constant driving term µ A e b e , we introduce the centered variable y i (cid:44) (cid:101) W e (cid:48) i − E (cid:101) W e (cid:48) i . Subtracting (125) from (47), we ﬁnd that y i satisﬁes the following recursion: y i = B y i − − µ A e s ei ( W i − ) . (126)Although we are interested in evaluating lim sup i →∞ E (cid:107) (cid:101) W e (cid:48) i (cid:107) , we can still rely on y i since it holds for i (cid:29) : E (cid:107) z i (cid:107) = E (cid:107) (cid:101) W e (cid:48) i (cid:107) − (cid:107) E (cid:101) W e (cid:48) i (cid:107) = E (cid:107) (cid:101) W e (cid:48) i (cid:107) + O ( µ ) , (127)where we used the fact that lim sup i →∞ (cid:107) E (cid:101) W e (cid:48) i (cid:107) = O ( µ ) (see Appendix H). Therefore, from (49) and (127), weobtain: lim sup i →∞ hN E (cid:107) (cid:101) W ei (cid:107) = lim sup i →∞ hN E (cid:107) y i (cid:107) + O ( µ / ) . (128)Let Σ denote an arbitrary Hermitian positive semi-deﬁnite matrix that we are free to choose. Equating the squaredweighted values of both sides of (126) and taking expectations conditioned on the past history gives: E [ (cid:107) y i (cid:107) | F i − ] = (cid:107) y i − (cid:107) B ∗ Σ B + µ E [ (cid:107) s ei (cid:107) A e ) ∗ Σ A e | F i − ] . (129)Taking expectations again, we get: E (cid:107) y i (cid:107) = E (cid:107) y i − (cid:107) B ∗ Σ B + µ E (cid:107) s ei (cid:107) A e ) ∗ Σ A e . (130)From (54) and using same arguments as in [4, pp. 586], we can rewrite the second term on the R.H.S. of (130) as: µ E (cid:107) s ei (cid:107) A e ) ∗ Σ A e = µ Tr (Σ A e E [ s ei s e ∗ i ]( A e ) ∗ )= Tr (Σ Y ) + Tr (Σ) · O ( µ γ m ) . (131)for i (cid:29) . Therefore, we obtain: lim sup i →∞ E (cid:107) y i (cid:107) −B ∗ Σ B = Tr (Σ Y ) + Tr (Σ) · O ( µ γ m ) . (132) In order to reduce the weighting matrix on the mean-square value of z i in (132) to the identity, we need to select Σ as the solution to the following Lyapunov equation: Σ − B ∗ Σ B = I hM . (133)This equation has a unique Hermitian non-negative deﬁnite solution Σ [4, pp. 772] since the matrix B is stable forsufﬁciently small step-size. Now, by applying the block vectorization operation to both sides of (133) and by usingthe property that: bvec ( ACB ) = ( B (cid:62) ⊗ b A ) bvec ( C ) , (134)we ﬁnd that: bvec (Σ) = ( I − F ) − bvec ( I hM ) (135)in terms of the matrix F deﬁned in (55).Now, substituting Σ in (135) into (132), we obtain E (cid:107) y i (cid:107) on the left-hand side while the term Tr (Σ Y ) on theright-hand side becomes: Tr (Σ Y ) = ( bvec ( Y (cid:62) )) (cid:62) ( I − F ) − bvec ( I hM ) . (136)where we used the property that: Tr ( AB ) = [ bvec ( B (cid:62) )] (cid:62) bvec ( A ) , (137)Using the fact that ( I − F ) − = O (1 /µ ) and following similar arguments as in [4, pp. 590], we can show that:Tr (Σ) · O ( µ γ m ) = O ( µ γ m ) . (138)Replacing (136) and (138) into (132) gives (61). Observe that the ﬁrst term on the R.H.S. of (61) is O ( µ ) since (cid:107)Y(cid:107) = O ( µ ) and (cid:107) ( I − F ) − (cid:107) = O (1 /µ ) . Therefore, this term dominates the factor O ( µ γ m ) .Since F is a stable matrix for sufﬁciently small step-sizes, we can employ the expansion ( I − F ) − = I + F + F + F + . . . , replace F by (55), and use properties (137) and (134) to write the ﬁrst term on the R.H.S. of (61)as (cid:80) ∞ n =0 Tr ( B n Y ( B ∗ ) n ) .Now, in order to establish (65), we shall use the low-rank approximation (59). Using deﬁnition (12) and (61),we obtain: MSD = µhN lim µ → lim sup i →∞ µ ( bvec ( Y (cid:62) )) (cid:62) ( I − F ) − bvec ( I hM ) (139)From (59) we get: ( bvec ( Y (cid:62) )) (cid:62) ( I − F ) − bvec ( I hM ) = ( bvec ( Y (cid:62) )) (cid:62) (cid:16) [( U e ) ∗ ] (cid:62) ⊗ b U e (cid:17) Z − (cid:16) ( U e ) (cid:62) ⊗ b ( U e ) ∗ (cid:17) bvec ( I hM ) + O ( µ ) . (140)Using property (134), it is straightforward to verify that the last three terms combine into the following result: (cid:16) ( U e ) (cid:62) ⊗ b ( U e ) ∗ (cid:17) bvec ( I hM ) = bvec ( I hP ) = vec ( I hP ) , (141) Let us therefore evaluate the matrix vector product x (cid:44) Z − vec ( I hP ) . Using the deﬁnition (60) for Z , the vector x is therefore the unique solution to the linear system of equations: ( I hP ⊗ D ∗ ) x + ( D (cid:62) ⊗ I hP ) x = vec ( I hP ) . (142)Let X = unvec ( x ) denote the hP × hP matrix whose vector representation is x . Applying to each of the termsappearing on the left-hand side of the above expression the Kronecker product property (134), albeit using vecinstead of bvec operation, we ﬁnd that ( I hP ⊗ D ∗ ) x = vec ( D ∗ X ) , and ( D (cid:62) ⊗ I hP ) x = vec ( X D ) . We concludefrom these equalities and from (142) that X is the unique solution to the (continuous-time) Lyapunov equation D ∗ X + X D = I hP . Since D in (92) is Hermitian, we obtain: X = 12 D − = 12 µ (( U e ) ∗ H o U e ) − . (143)Therefore, substituting into (140) gives: ( bvec ( Y (cid:62) )) (cid:62) ( I − F ) − bvec ( I hM ) = ( bvec ( Y (cid:62) )) (cid:62) (cid:16) [( U e ) ∗ ] (cid:62) ⊗ b U e (cid:17) vec ( X ) + O ( µ ) (134) = ( bvec ( Y (cid:62) )) (cid:62) bvec ( U e X ( U e ) ∗ ) + O ( µ ) (137) = Tr ( U e X ( U e ) ∗ Y ) + O ( µ ) = Tr (( U e ) ∗ YU e X ) + O ( µ ) (63) = µ Tr (( U e ) ∗ A e S ( A e ) ∗ U e X ) + O ( µ )= µ Tr (( U e ) ∗ SU e X ) + O ( µ ) (143) = µ Tr (cid:16) (( U e ) ∗ H o U e ) − (( U e ) ∗ SU e ) (cid:17) + O ( µ ) . (144)where we used the fact that ( U e ) ∗ A e = ( U e ) ∗ . Now substituting the above expression into the right-hand sideof (139) and computing the limit as µ → , we arrive at expression (65).A PPENDIX DP ROJECTION ONTO Ω IN (75) AND Ω IN (76)The closed convex set Ω in (75) can be rewritten alternatively as: Ω = {A|A U = U , U (cid:62) A = U (cid:62) , A = A (cid:62) } , (145)and the projection onto it is given by: Π Ω ( D ) =  arg min A (cid:107)A − D(cid:107) F subject to A U = U , U (cid:62) A = U (cid:62) , A = A (cid:62) . (146)The Lagrangian of the convex optimization problem in (146) is deﬁned as: L ( A ; X, Y, W ) = 12 (cid:107)A − D(cid:107) F + Tr ( X (cid:62) ( AU − U ))+ Tr ( Y (cid:62) ( U (cid:62) A − U (cid:62) )) + Tr ( Z (cid:62) ( A − A (cid:62) )) , (147) where X ∈ R M × P , Y ∈ R P × M , and Z ∈ R M × M are the matrices of Lagrange multipliers. From the Karush-Kuhn-Tucker (KKT) conditions, we obtain at the optimum ( A o ; X o , Y o , Z o ) : A o U = U , (148) U (cid:62) A o = U (cid:62) , (149) A o = ( A o ) (cid:62) , (150) ∇ A L = A o − D + X o U (cid:62) + U Y o + Z o − ( Z o ) (cid:62) = 0 . (151)From (151), we obtain: A o = D − X o U (cid:62) − U Y o − Z o + ( Z o ) (cid:62) . (152)Multiplying both sides of (152) by U and using the fact that U (cid:62) U = I from Assumption 3, we obtain: A o U = DU − X o − U Y o U − Z o U + ( Z o ) (cid:62) U . (153)Combining the previous expression with (148), we get: X o = DU − U Y o U − Z o U + ( Z o ) (cid:62) U − U . (154)Replacing (154) into (152) and using the fact that P U = U U (cid:62) , we arrive at: A o = D − DP U + U Y o P U + Z o P U − ( Z o ) (cid:62) P U + P U − U Y o − Z o + ( Z o ) (cid:62) . (155)Pre-multiplying both sides of the previous equation by U (cid:62) and using the fact that U (cid:62) U = I , we obtain: U (cid:62) A o = U (cid:62) D − U (cid:62) DP U + Y o P U + U (cid:62) Z o P U − U (cid:62) ( Z o ) (cid:62) P U + U (cid:62) − Y o − U (cid:62) Z o + U (cid:62) ( Z o ) (cid:62) . (156)Combining the previous expression with (149), we arrive at: U (cid:62) D − U (cid:62) DP U + Y o P U + U (cid:62) Z o P U − U (cid:62) ( Z o ) (cid:62) P U − Y o − U (cid:62) Z o + U (cid:62) ( Z o ) (cid:62) = 0 . (157)Pre-multiplying both sides of the previous equation by U and using the fact that P U = U U (cid:62) , we obtain: U Y o P U − U Y o = − P U D + P U DP U − P U Z o P U + P U ( Z o ) (cid:62) P U + P U Z o − P U ( Z o ) (cid:62) . (158)Replacing (158) into (155), we arrive at: A o =( I − P U ) D ( I − P U ) − ( I − P U )( Z o − ( Z o ) (cid:62) )( I − P U ) + P U , (159)and thus, A o (cid:62) = ( I − P U ) D (cid:62) ( I − P U ) + ( I − P U )( Z o − ( Z o ) (cid:62) )( I − P U ) + P U (160)Combining (150) and the previous two equations, we obtain: ( I − P U ) (cid:18) D − D (cid:62) (cid:19) ( I − P U )=( I − P U )( Z o − ( Z o ) (cid:62) )( I − P U ) (161) Replacing the previous equation into (159), we arrive at: Π Ω ( D ) = ( I − P U ) (cid:18) D − D − D (cid:62) (cid:19) ( I − P U ) + P U = ( I − P U ) (cid:18) D + D (cid:62) (cid:19) ( I − P U ) + P U . (162)Now, projecting a symmetric matrix C onto Ω in (76) is given by: Π Ω ( C ) =  arg min A (cid:107)A − C(cid:107) F subject to (cid:107)A − P U (cid:107) ≤ − (cid:15) = P U +  arg min Y (cid:107)Y − ( C − P U ) (cid:107) F subject to (cid:107)Y(cid:107) ≤ − (cid:15) = P U + Π Ω ( C − P U ) (163)where Ω (cid:44) {A|(cid:107)A(cid:107) ≤ − (cid:15) } . In order to project the symmetric matrix C − P U onto Ω , we need to compute itseigenvalue decomposition C − P U = (cid:80) Mm =1 λ m v m v (cid:62) m and then threshold the eigenvalues { λ m } Mm =1 to have absolutemagnitude at most − (cid:15) [26, pp. 191–194]. Thus we obtain: Π Ω ( C ) = P U + M (cid:88) m =1 β m v m v (cid:62) m , (164)where: β m =  − (cid:15), if λ m < − (cid:15),λ m , if | λ m | < − (cid:15), − (cid:15), if λ m > − (cid:15). (165)Now, replacing the matrix C by (162), we obtain (79).A PPENDIX EP ROOF OF L EMMA

Lemma 3.

Let Ω denote a closed convex set. For any C / ∈ Ω , A o = Π Ω ( C ) if and only if (cid:104)C − A o , A − A o (cid:105) ≤ , ∀A ∈ Ω where (cid:104) X, Y (cid:105) = Tr ( X (cid:62) Y ) .Proof. ( ⇒ ) Let A o = Π Ω ( C ) for any given C / ∈ Ω , that is, suppose that A o is the unique solution to the optimizationproblem. Let A ∈ Ω be such that A (cid:54) = A o . Let α ∈ (0 , . Since Ω is convex, (1 − α ) A o + α A = A o + α ( A−A o ) ∈ Ω .By the assumed optimality of A o , we must have: (cid:107)C − A o (cid:107) F ≤ (cid:107)C − [ A o + α ( A − A o )] (cid:107) F = (cid:107)C − A o (cid:107) F + α (cid:107)A − A o (cid:107) F − α (cid:104)C − A o , A − A o (cid:105) , (166)and we obtain: (cid:104)C − A o , A − A o (cid:105) ≤ α (cid:107)A − A o (cid:107) F . (167) Now, note that (167) holds for any α ∈ (0 , . Since the RHS of (167) can be made arbitrarily small for a given A , the LHS can not be strictly positive. Thus, we conclude as desired: (cid:104)C − A o , A − A o (cid:105) ≤ , ∀A ∈ Ω . (168)( ⇐ ) Let A o ∈ Ω be such that (cid:104)C − A o , A − A o (cid:105) ≤ , ∀A ∈ Ω . We shall show that it must be the optimal solution.Let A ∈ Ω and A (cid:54) = A o . We have: (cid:107)C − A(cid:107) F − (cid:107)C − A o (cid:107) F = (cid:107)C − A o + A o − A(cid:107) F − (cid:107)C − A o (cid:107) F = (cid:107)C − A o (cid:107) F + (cid:107)A o − A(cid:107) F − (cid:104)C − A o , A − A o (cid:105) − (cid:107)C − A o (cid:107) F > . (169)Hence, A o is the optimal solution to the optimization problem, and thus A o = Π Ω ( C ) by deﬁnition. Lemma 4. If Ω is further afﬁne, then, for any C / ∈ Ω , A o = Π Ω ( C ) if and only if (cid:104)C − A o , A − A o (cid:105) = 0 , ∀A ∈ Ω .Proof. ( ⇒ ) Let A o = Π Ω ( C ) for any given C / ∈ Ω , that is, suppose that A o is the unique solution to the optimizationproblem. Let A ∈ Ω be such that A (cid:54) = A o . Let α ∈ R . Since Ω is afﬁne, (1 − α ) A o + α A = A o + α ( A − A o ) ∈ Ω .By the assumed optimality of A o , we must have: (cid:107)C − A o (cid:107) F ≤ (cid:107)C − [ A o + α ( A − A o )] (cid:107) F = (cid:107)C − A o (cid:107) F + α (cid:107)A − A o (cid:107) F − α (cid:104)C − A o , A − A o (cid:105) , (170)and we obtain: α (cid:104)C − A o , A − A o (cid:105) ≤ α (cid:107)A − A o (cid:107) F . (171)If α ≥ , we obtain: (cid:104)C − A o , A − A o (cid:105) ≤ α (cid:107)A − A o (cid:107) F . (172)Now, note that (172) holds for any α ≥ . Since the RHS of (172) can be made arbitrarily small for a given A ,the LHS can not be strictly positive. Thus, we conclude: (cid:104)C − A o , A − A o (cid:105) ≤ , ∀A ∈ Ω . (173)If α ≤ , we obtain: (cid:104)C − A o , A − A o (cid:105) ≥ α (cid:107)A − A o (cid:107) F . (174)Now, note that (174) holds for any α ≤ . Since the RHS of (174) can be made arbitrarily large for a given A ,the LHS can not be strictly negative. Thus, we conclude: (cid:104)C − A o , A − A o (cid:105) ≥ , ∀A ∈ Ω . (175)Combining (173) and (175), we conclude as desired: (cid:104)C − A o , A − A o (cid:105) = 0 , ∀A ∈ Ω . (176) ( ⇐ ) Let A o ∈ Ω be such that (cid:104)C − A o , A − A o (cid:105) = 0 , ∀A ∈ Ω . We shall show that it must be the optimal solution.Let A ∈ Ω and A (cid:54) = A o . We have: (cid:107)C − A(cid:107) F − (cid:107)C − A o (cid:107) F = (cid:107)C − A o + A o − A(cid:107) F − (cid:107)C − A o (cid:107) F = (cid:107)C − A o (cid:107) F + (cid:107)A o − A(cid:107) F − (cid:104)C − A o , A − A o (cid:105) − (cid:107)C − A o (cid:107) F > . (177)Hence, A o is the optimal solution to the optimization problem, and thus A o = Π Ω ( C ) by deﬁnition.Now we prove Lemma 2. Let Y = Π Ω ( C ) . From Lemma 4, we have: (cid:104)C − Y, A − Y (cid:105) = 0 , ∀A ∈ Ω . (178)Let Z = Π Ω ( Y ) . From Lemma 3, we have: (cid:104) Y − Z, A − Z (cid:105) ≤ , ∀A ∈ Ω . (179)For Z = Π Ω (Π Ω ( C )) to be the projection of C onto Ω ∩ Ω , we need to show from Lemma 3 that: (cid:104)C − Z, A − Z (cid:105) ≤ , ∀A ∈ Ω ∩ Ω , (180)under the conditions in Lemma 2. For any A ∈ Ω ∩ Ω , we have: (cid:104)C − Z, A − Z (cid:105) = (cid:104)C − Y + Y − Z, A − Z (cid:105) = (cid:104)C − Y, A − Z (cid:105) + (cid:104) Y − Z, A − Z (cid:105) = (cid:104)C − Y, A − Y + Y − Z (cid:105) + (cid:104) Y − Z, A − Z (cid:105) = (cid:104)C − Y, A − Y (cid:105) (cid:124) (cid:123)(cid:122) (cid:125) =0 from (178) − (cid:104)C − Y, Z − Y (cid:105) (cid:124) (cid:123)(cid:122) (cid:125) =0 from (178) and Z ∈ Ω + (cid:104) Y − Z, A − Z (cid:105) (cid:124) (cid:123)(cid:122) (cid:125) ≤ from (179) ≤ (181)which concludes the proof. A PPENDIX FS TABILITY OF FOURTH - ORDER ERROR MOMENT

In this Appendix, we show that, under the same settings of Theorem 1 in [2] with the second-order momentcondition (24) replaced by the fourth-order moment condition (29), the fourth-order moment of the network errorvector is stable for sufﬁciently small µ , namely, (30) holds for small enough µ . We start by recalling that for anytwo complex column vectors x and y , it holds that (cid:107) x + y (cid:107) ≤ (cid:107) x (cid:107) + 3 (cid:107) y (cid:107) + 8 (cid:107) x (cid:107) (cid:107) y (cid:107) + 4 (cid:107) x (cid:107) Re ( x ∗ y ) [4, pp. 523]. Applying this inequality to eq. (60) in [2], conditioning on F i − , computing the expectations of bothsides, using Assumption 2, taking expectations again, and exploiting the convexity of (cid:107) x (cid:107) , we conclude that: E (cid:107) W ei (cid:107) ≤ E (cid:107) ( I hP − D ,i − ) W ei − − D ,i − W (cid:86) ei − (cid:107) + 3 E (cid:107) s ei (cid:107) +8 E (cid:0) (cid:107) ( I hP − D ,i − ) W ei − − D ,i − W (cid:86) ei − (cid:107) (cid:1) (cid:0) E (cid:107) s ei (cid:107) (cid:1) = E (cid:107) (1 − t ) 11 − t ( I hP − D ,i − ) W ei − − t t D ,i − W (cid:86) ei − (cid:107) + 3 E (cid:107) s ei (cid:107) +8 E (cid:18) (cid:107) (1 − t ) 11 − t ( I hP − D ,i − ) W ei − − t t D ,i − W (cid:86) ei − (cid:107) (cid:19) (cid:0) E (cid:107) s ei (cid:107) (cid:1) ≤ − t ) E (cid:2) (cid:107) I hP − D ,i − (cid:107) (cid:107) W ei − (cid:107) (cid:3) + 1 t E (cid:2) (cid:107) D ,i − (cid:107) (cid:107) W (cid:86) ei − (cid:107) (cid:3) + 3 E (cid:107) s ei (cid:107) +8 (cid:0) E (cid:107) s ei (cid:107) (cid:1) (cid:18) − t E (cid:2) (cid:107) I hP − D ,i − (cid:107) (cid:107) W ei − (cid:107) (cid:3) + 1 t E (cid:2) (cid:107) D ,i − (cid:107) (cid:107) W (cid:86) ei − (cid:107) (cid:3)(cid:19) ≤ (1 − µσ ) (1 − t ) E (cid:107) W ei − (cid:107) + µ σ t E (cid:107) W (cid:86) ei − (cid:107) + 3 E (cid:107) s ei (cid:107) +8 (cid:0) E (cid:107) s ei (cid:107) (cid:1) (cid:18) (1 − µσ ) − t E (cid:107) W ei − (cid:107) + µ σ t E (cid:107) W (cid:86) ei − (cid:107) (cid:19) , (182)for any arbitrary positive number t ∈ (0 , . In the last inequality we used the bounds (115) and (116) in [2]. Byselecting t = µσ , we arrive at: E (cid:107) W ei (cid:107) ≤ (1 − µσ ) E (cid:107) W ei − (cid:107) + µσ σ E (cid:107) W (cid:86) ei − (cid:107) + 3 E (cid:107) s ei (cid:107) +8 (cid:0) E (cid:107) s ei (cid:107) (cid:1) (cid:18) (1 − µσ ) E (cid:107) W ei − (cid:107) + µσ σ E (cid:107) W (cid:86) ei − (cid:107) (cid:19) . (183) Applying similar arguments for relation (61) in [2] and using the relation (cid:107) a + b + c (cid:107) ≤ (cid:107) a (cid:107) + 27 (cid:107) b (cid:107) + 27 (cid:107) c (cid:107) ,we obtain: E (cid:107) W (cid:86) ei (cid:107) ≤ E (cid:107) ( J e(cid:15) − D ,i − ) W (cid:86) ei − − D ,i − W ei − + b (cid:86) e (cid:107) + 3 E (cid:107) s (cid:86) ei (cid:107) +8 (cid:16) E (cid:107) ( J e(cid:15) − D ,i − ) W (cid:86) ei − − D ,i − W ei − + b (cid:86) e (cid:107) (cid:17) (cid:0) E (cid:107) s (cid:86) ei (cid:107) (cid:1) = E (cid:13)(cid:13)(cid:13)(cid:13) t t J e(cid:15) W (cid:86) ei − − (1 − t ) 11 − t ( D ,i − W (cid:86) ei − + D ,i − W ei − − b (cid:86) e ) (cid:13)(cid:13)(cid:13)(cid:13) + 3 E (cid:107) s (cid:86) ei (cid:107) +8 (cid:32) E (cid:13)(cid:13)(cid:13)(cid:13) t t J e(cid:15) W (cid:86) ei − − (1 − t ) 11 − t ( D ,i − W (cid:86) ei − + D ,i − W ei − − b (cid:86) e ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:33) (cid:0) E (cid:107) s (cid:86) ei (cid:107) (cid:1) ≤ t (cid:107)J e(cid:15) (cid:107) E (cid:107) W (cid:86) ei − (cid:107) + 1(1 − t ) E (cid:107) D ,i − W (cid:86) ei − + D ,i − W ei − − b (cid:86) e (cid:107) + 3 E (cid:107) s (cid:86) ei (cid:107) +8 (cid:0) E (cid:107) s (cid:86) ei (cid:107) (cid:1) (cid:18) t (cid:107)J e(cid:15) (cid:107) E (cid:107) W (cid:86) ei − (cid:107) + 11 − t E (cid:107) D ,i − W (cid:86) ei − + D ,i − W ei − − b (cid:86) e (cid:107) (cid:19) ≤ t (cid:107)J e(cid:15) (cid:107) E (cid:107) W (cid:86) ei − (cid:107) + 27(1 − t ) ( E (cid:107) D ,i − (cid:107) (cid:107) W (cid:86) ei − (cid:107) + E (cid:107) D ,i − (cid:107) (cid:107) W ei − (cid:107) + (cid:107) b (cid:86) e (cid:107) ) + 3 E (cid:107) s (cid:86) ei (cid:107) +8 (cid:0) E (cid:107) s (cid:86) ei (cid:107) (cid:1) (cid:18) t (cid:107)J e(cid:15) (cid:107) E (cid:107) W (cid:86) ei − (cid:107) + 31 − t ( E (cid:107) D ,i − (cid:107) (cid:107) W (cid:86) ei − (cid:107) + E (cid:107) D ,i − (cid:107) (cid:107) W ei − (cid:107) + (cid:107) b (cid:86) e (cid:107) ) (cid:19) ≤ ( ρ ( J (cid:15) ) + (cid:15) ) t E (cid:107) W (cid:86) ei − (cid:107) + 27 µ σ (1 − t ) E (cid:107) W (cid:86) ei − (cid:107) + 27 µ σ (1 − t ) E (cid:107) W ei − (cid:107) + 27(1 − t ) (cid:107) b (cid:86) e (cid:107) + 3 E (cid:107) s (cid:86) ei (cid:107) +8 (cid:0) E (cid:107) s (cid:86) ei (cid:107) (cid:1) (cid:18) ( ρ ( J (cid:15) ) + (cid:15) ) t E (cid:107) W (cid:86) ei − (cid:107) + 3 µ σ − t E (cid:107) W (cid:86) ei − (cid:107) + 3 µ σ − t E (cid:107) W ei − (cid:107) + 31 − t (cid:107) b (cid:86) e (cid:107) (cid:19) (184)for any arbitrary positive number t ∈ (0 , . In the last inequality we used relation (122) in [2]. Selecting t = ρ ( J (cid:15) ) + (cid:15) < , we arrive at: E (cid:107) W (cid:86) ei (cid:107) ≤ ( ρ ( J (cid:15) ) + (cid:15) ) E (cid:107) W (cid:86) ei − (cid:107) + 27 µ σ (1 − t ) E (cid:107) W (cid:86) ei − (cid:107) + 27 µ σ (1 − t ) E (cid:107) W ei − (cid:107) + 27(1 − t ) (cid:107) b (cid:86) e (cid:107) + 3 E (cid:107) s (cid:86) ei (cid:107) +8 (cid:0) E (cid:107) s (cid:86) ei (cid:107) (cid:1) (cid:18) ( ρ ( J (cid:15) ) + (cid:15) ) E (cid:107) W (cid:86) ei − (cid:107) + 3 µ σ − t E (cid:107) W (cid:86) ei − (cid:107) + 3 µ σ − t E (cid:107) W ei − (cid:107) + 31 − t (cid:107) b (cid:86) e (cid:107) (cid:19) (185)where − t = 1 − ρ ( J (cid:15) ) − (cid:15) .In order to bound the fourth-order noise terms E (cid:107) s ei (cid:107) and E (cid:107) s (cid:86) ei (cid:107) appearing in (183) and (185), we ﬁrst notefrom eq. (58) in [2] that: E (cid:107) s ei (cid:107) + E (cid:107) s (cid:86) i (cid:107) ≤ E ( (cid:107) s ei (cid:107) + (cid:107) s (cid:86) i (cid:107) ) = E (cid:107) µ ( V e(cid:15) ) − A e s ei (cid:107) ≤ µ v E (cid:107) s ei (cid:107) . (186)Now, applying Jensen’s inequality to the convex function f ( x ) = x , we can write: E (cid:107) s ei (cid:107) = E ( (cid:107) s ei (cid:107) ) = 4 E (cid:32) N (cid:88) k =1 (cid:107) s k,i (cid:107) (cid:33) = 4 E (cid:32) N (cid:88) k =1 N N (cid:107) s k,i (cid:107) (cid:33) ≤ N E (cid:32) N (cid:88) k =1 (cid:107) s k,i (cid:107) (cid:33) = 4 N N (cid:88) k =1 E (cid:107) s k,i (cid:107) , (187) in terms of the individual gradient noise processes, E (cid:107) s k,i (cid:107) . For each term s k,i , we have from (29) and from theJensen’s inequality applied to the convex norm (cid:107) x (cid:107) : E (cid:107) s k,i ( w k,i − ) (cid:107) ≤ ( β ,k /h ) E (cid:107) w k,i − (cid:107) + σ s ,k = ( β ,k /h ) E (cid:107) w k,i − − w ok + w ok (cid:107) + σ s ,k ≤ β ,k /h ) E (cid:107) (cid:101) w k,i − (cid:107) + 8( β ,k /h ) (cid:107) w ok (cid:107) + σ s ,k ≤ ¯ β ,k E (cid:107) (cid:101) w k,i − (cid:107) + ¯ σ s ,k (188)where ¯ β ,k = 8( β ,k /h ) and ¯ σ s ,k = 8( β ,k /h ) (cid:107) w ok (cid:107) + σ s ,k . Using the relations, N (cid:88) k =1 (cid:107) (cid:101) w k,i − (cid:107) ≤ ( (cid:107) (cid:101) w ,i − (cid:107) + (cid:107) (cid:101) w ,i − (cid:107) + . . . + (cid:107) (cid:101) w N,i − (cid:107) ) = (cid:107) (cid:101) W i − (cid:107) = (cid:18) (cid:107) (cid:101) W ei − (cid:107) (cid:19) = 14 (cid:107) (cid:101) W ei − (cid:107) , (189) (cid:107) ( V e(cid:15) ) − (cid:101) W ei − (cid:107) = ( (cid:107) W ei (cid:107) + (cid:107) W (cid:86) ei (cid:107) ) ≤ (cid:107) W ei (cid:107) + 2 (cid:107) W (cid:86) ei (cid:107) , (190)the term E (cid:107) s i (cid:107) in (187) can be bounded as follows: E (cid:107) s ei (cid:107) ≤ N N (cid:88) k =1 ¯ β ,k E (cid:107) (cid:101) w k,i − (cid:107) + 4 N N (cid:88) k =1 ¯ σ s ,k ≤ β , max N (cid:88) k =1 E (cid:107) (cid:101) w k,i − (cid:107) + σ s (189) ≤ β , max E (cid:107)V e(cid:15) ( V e(cid:15) ) − (cid:101) W ei − (cid:107) + σ s ≤ β , max (cid:107)V e(cid:15) (cid:107) E (cid:107) ( V e(cid:15) ) − (cid:101) W ei − (cid:107) + σ s (190) ≤ β , max v [ E (cid:107) W ei − (cid:107) + E (cid:107) W (cid:86) ei − (cid:107) ] + σ s (191)where β , max (cid:44) N max ≤ k ≤ N ¯ β ,k , and σ s (cid:44) N (cid:80) Nk =1 ¯ σ s ,k . Substituting into (186), we get: E (cid:107) s ei (cid:107) + E (cid:107) s (cid:86) i (cid:107) ≤ µ β , max v v [ E (cid:107) W ei − (cid:107) + E (cid:107) W (cid:86) ei − (cid:107) ] + µ v σ s . (192)Returning to (183), and using the bounds (128) in [2] and (192), we ﬁnd that: E (cid:107) W ei (cid:107) ≤ (1 − µσ ) E (cid:107) W ei − (cid:107) + µσ σ E (cid:107) W (cid:86) ei − (cid:107) + 6 µ β , max v v [ E (cid:107) W ei − (cid:107) + E (cid:107) W (cid:86) ei − (cid:107) ]+3 µ v σ s + 8 µ v β v (1 − µσ )( E (cid:107) W ei − (cid:107) ) + 8 µ v β v (1 − µσ ) E (cid:107) W (cid:86) ei − (cid:107) E (cid:107) W ei − (cid:107) +8 µ v σ s (1 − µσ ) E (cid:107) W ei − (cid:107) + 8 µ σ σ v β v E (cid:107) W ei − (cid:107) E (cid:107) W (cid:86) ei − (cid:107) +8 µ σ σ v β v ( E (cid:107) W (cid:86) ei − (cid:107) ) + 8 µ σ σ v σ s E (cid:107) W (cid:86) ei − (cid:107) . (193)Using the properties that, for any two random variables a and c , it holds that [4, pp. 528]: ( E a ) ≤ E a , E a )( E c ) ≤ E a + E c , we can write: E (cid:107) W ei − (cid:107) )( E (cid:107) W (cid:86) ei − (cid:107) ) ≤ E (cid:107) W ei − (cid:107) + E (cid:107) W (cid:86) ei − (cid:107) , (194) ( E (cid:107) W ei − (cid:107) ) ≤ E (cid:107) W ei − (cid:107) , (195) ( E (cid:107) W (cid:86) ei − (cid:107) ) ≤ E (cid:107) W (cid:86) ei − (cid:107) , (196)so that: E (cid:107) W ei (cid:107) ≤ a E (cid:107) W ei − (cid:107) + b E (cid:107) W (cid:86) ei − (cid:107) + a (cid:48) E (cid:107) W ei − (cid:107) + b (cid:48) E (cid:107) W (cid:86) ei − (cid:107) + e (197)where: a = 1 − µσ + O ( µ ) , b = O ( µ ) , a (cid:48) = O ( µ ) , b (cid:48) = O ( µ ) , e = O ( µ ) . (198)Returning to (185) and using similar arguments, we can verify that: E (cid:107) W (cid:86) ei (cid:107) ≤ ( ρ ( J (cid:15) ) + (cid:15) ) E (cid:107) W (cid:86) ei − (cid:107) + 27 µ σ (1 − ρ ( J (cid:15) ) − (cid:15) ) E (cid:107) W (cid:86) ei − (cid:107) + 27 µ σ (1 − ρ ( J (cid:15) ) − (cid:15) ) E (cid:107) W ei − (cid:107) +27(1 − ρ ( J (cid:15) ) − (cid:15) ) (cid:107) b (cid:86) e (cid:107) + 6 µ β , max v v [ E (cid:107) W ei − (cid:107) + E (cid:107) W (cid:86) ei − (cid:107) ] + 3 µ v σ s +4( ρ ( J (cid:15) ) + (cid:15) ) µ v β v [ E (cid:107) W ei − (cid:107) + E (cid:107) W (cid:86) ei − (cid:107) ] + 8( ρ ( J (cid:15) ) + (cid:15) ) µ v β v E (cid:107) W (cid:86) ei − (cid:107) +8( ρ ( J (cid:15) ) + (cid:15) ) µ v σ s E (cid:107) W (cid:86) ei − (cid:107) + 12 µ v β v σ − ρ ( J (cid:15) ) − (cid:15) [ E (cid:107) W ei − (cid:107) + E (cid:107) W (cid:86) ei − (cid:107) ]+24 µ v β v σ − ρ ( J (cid:15) ) − (cid:15) E (cid:107) W (cid:86) ei − (cid:107) + 24 µ σ v σ s − ρ ( J (cid:15) ) − (cid:15) E (cid:107) W (cid:86) ei − (cid:107) + 24 µ σ v β v − ρ ( J (cid:15) ) − (cid:15) E (cid:107) W ei − (cid:107) +12 µ σ v β v − ρ ( J (cid:15) ) − (cid:15) [ E (cid:107) W ei − (cid:107) + E (cid:107) W (cid:86) ei − (cid:107) ] + 24 µ σ v σ s − ρ ( J (cid:15) ) − (cid:15) E (cid:107) W ei − (cid:107) +24 µ v β v − ρ ( J (cid:15) ) − (cid:15) (cid:107) b (cid:86) e (cid:107) [ E (cid:107) W ei − (cid:107) + E (cid:107) W (cid:86) ei − (cid:107) ] + 24 µ v σ s − ρ ( J (cid:15) ) − (cid:15) (cid:107) b (cid:86) e (cid:107) , (199)so that, E (cid:107) W (cid:86) ei (cid:107) ≤ c E (cid:107) W ei − (cid:107) + d E (cid:107) W (cid:86) ei − (cid:107) + c (cid:48) E (cid:107) W ei − (cid:107) + d (cid:48) E (cid:107) W (cid:86) ei − (cid:107) + f, (200)where the coefﬁcients { c, d, c (cid:48) , d (cid:48) , f } have the following form: c = O ( µ ) , d = ρ ( J (cid:15) ) + (cid:15) + O ( µ ) , c (cid:48) = O ( µ ) , d (cid:48) = O ( µ ) , f = O ( µ ) . (201)Therefore, we can write  E (cid:107) W ei (cid:107) E (cid:107) W (cid:86) ei (cid:107)  (cid:22)  a bc d (cid:124) (cid:123)(cid:122) (cid:125) Γ  E (cid:107) W ei − (cid:107) E (cid:107) W (cid:86) ei − (cid:107)  +  a (cid:48) b (cid:48) c (cid:48) d (cid:48)   E (cid:107) W ei − (cid:107) E (cid:107) W (cid:86) ei − (cid:107)  +  ef  (202)in terms of the × coefﬁcient matrix Γ of the form (132) in [2] which is stable matrix for sufﬁciently small µ and (cid:15) . Moreover, using relation (135) in [2], we have: lim sup i →∞  a (cid:48) b (cid:48) c (cid:48) d (cid:48)   E (cid:107) W ei − (cid:107) E (cid:107) W (cid:86) ei − (cid:107)  =  O ( µ ) O ( µ )  . (203) In this case, we can iterate (202) and use relation (134) in [2] to conclude that: lim sup i →∞ E (cid:107) W ei (cid:107) = O ( µ ) , lim sup i →∞ E (cid:107) W (cid:86) ei (cid:107) = O ( µ ) , (204)and, therefore, lim sup i →∞ E (cid:107) (cid:101) W ei (cid:107) = lim sup i →∞ E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) V (cid:15)  W ei W (cid:86) ei (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ v lim sup i →∞ E ( (cid:107) W ei (cid:107) + (cid:107) W (cid:86) ei (cid:107) ) ≤ lim sup i →∞ v ( E (cid:107) W ei (cid:107) + E (cid:107) W (cid:86) ei (cid:107) ) = O ( µ ) . (205)A PPENDIX GS TABILITY OF THE COEFFICIENT MATRIX B In this Appendix, we show that, under the same settings of Theorem 1, the constant matrix B deﬁned by (42) isstable for sufﬁciently small step-sizes. To establish this, we use similar argument as in [4], [6]. We ﬁrst note thatthe matrix B in (42) is similar to the matrix B in (91), and therefore has the same eigenvalues as the block matrix B written as: B ∼  I hP − D −D −D J e(cid:15) − D  , (206)where the blocks entries {D mn } are given by (92)–(93). In a manner similar to the arguments used in the proof ofTheorem 1 in [2], we can verify that: D = O ( µ ) , D = O ( µ ) , (207) D = O ( µ ) , D = O ( µ ) , (208) ρ ( I hP − D ) = 1 − σ µ = 1 − O ( µ ) , (209)where σ is a positive scalar independent of µ . Thus, we obtain: B ∼  I hP − O ( µ ) O ( µ ) O ( µ ) J e(cid:15) + O ( µ )  . (210)Now recall that the matrix J e(cid:15) deﬁned in Table I is h ( M − P ) × h ( M − P ) and has a Jordan structure. We considerhere the complex data case since the real data case can be easily deduced from the complex case by removing the block ( J ∗ (cid:15) ) (cid:62) . It can be expressed in the following upper-triangular form: J e(cid:15) =  λ a, K . . . λ a,L λ ∗ a, K . . . λ ∗ a,L  (211)with scalars { λ a,(cid:96) , λ ∗ a,(cid:96) } on the diagonal, all of which have norms strictly less than one, and where the entries ofthe strictly upper-triangular matrix K are either (cid:15) or zero. It follows that: J e(cid:15) + O ( µ ) =  λ a, + O ( µ ) K + O ( µ ) . . . O ( µ ) O ( µ ) λ a,L + O ( µ ) λ ∗ a, + O ( µ ) K + O ( µ ) O ( µ ) . . . O ( µ ) λ ∗ a,L + O ( µ )  (212)We introduce the eigen-decomposition of the Hermitian positive-deﬁnite matrix D and denote it by [4], [6]: D (cid:44) U d Λ d U ∗ d (213)where U d is unitary and Λ d has positive diagonal entries { λ k } ; the matrices U d and Λ d are hP × hP . Using U d ,we further introduce the following block-diagonal similarity transformation: T (cid:44) diag { µ P/M U d , µ ( hP +1) /hM , . . . , µ ( hM − /hM , µ } . (214)We now use (91) to get: T − BT =  B O ( µ ( hM +1) /hM ) λ a, + O ( µ ) O ( µ /hM ) O ( µ P/M ) . . . O ( µ ( hM − /hM ) λ ∗ a,L + O ( µ )  (215)where we introduced the hP × hP diagonal matrix: B (cid:44) I hP − Λ d . (216)It follows that all off-diagonal entries of the above transformed matrix are at most O ( µ /hM ) . Although the factor µ /hM decays slower than µ , it nevertheless becomes small for sufﬁciently small µ . Calling upon the Gershgorin’s theorem , we conclude that the eigenvalues of B are either located in the Gershgorin circles that are centered atthe eigenvalues of B with radii O ( µ ( hM +1) /hM ) or in the Gershgorin circles that are centered at the { λ a,(cid:96) , λ ∗ a,(cid:96) } with radii O ( µ /M ) , namely, | λ ( B ) − λ ( B ) | ≤ O ( µ ( hM +1) /hM ) or | λ ( B ) − λ a,(cid:96) + O ( µ ) | ≤ O ( µ /hM ) or | λ ( B ) − λ ∗ a,(cid:96) + O ( µ ) | ≤ O ( µ /hM ) (217)where λ ( B ) and λ ( B ) denote any of the eigenvalues of B and B , and (cid:96) = 1 , . . . , L . It follows that: ρ ( B ) ≤ ρ ( B ) + O ( µ ( hM +1) /hM ) or ρ ( B ) ≤ ρ ( J (cid:15) ) + O ( µ ) + O ( µ /hM ) . (218)Now since J (cid:15) is a stable matrix, we know that ρ ( J (cid:15) ) < . We express this spectral radius as: ρ ( J (cid:15) ) = 1 − δ J (219)where δ J is positive and independent of µ . We also know from (209) that: ρ ( B ) = 1 − σ µ < , (220)since B = U ∗ d ( I hP − D ) U d . We conclude from (218) that: ρ ( B ) ≤ − σ µ + O ( µ ( hM +1) /hM ) or ρ ( B ) ≤ − δ J + O ( µ ) + O ( µ /hM ) . (221)If we now select µ (cid:28) small enough such that: O ( µ ( hM +1) /hM ) < σ µ, and O ( µ /hM ) + O ( µ ) < δ J (222)then we would be able to conclude that ρ ( B ) < so that B is stable for sufﬁciently small step-sizes, as claimed.If we exploit the structure of B in (91) we can further show, for sufﬁciently small step-sizes, that: ( I − B ) − = O (1 /µ ) (223) ( I − B ) − =  O (1 /µ ) O (1) O (1) O (1)  (224)where the leading (1 , block in ( I − B ) − has dimensions hP × hP . Consider an N × N matrix A with scalar entries { a k(cid:96) } . With each diagonal entry a kk we associate a disc in the complex plane centeredat a kk and with r k = (cid:80) N(cid:96) =1 ,(cid:96) (cid:54) = k | a k(cid:96) | . That is, r k is equal to the sum of the magnitudes of the non-diagonal entries on the same row as a kk . We denote the disc by D k ; it consists of all points that satisfy D k = { z ∈ C such that | z − a kk | ≤ r k } . Gershgorin’s theorem statesthat the spectrum of A (i.e., the set of all its eigenvalues, denoted by λ ( A ) ) is contained in the union of all N Gershgorin discs λ ( A ) ⊂ ∪ Nk =1 D k . A stronger statement of the Gershgorin theorem covers the situation in which some of the Gershgorin discs happen to be disjoint. Speciﬁcally,if the union of the L discs is disjoint from the union of the remaining N − L discs, then the theorem further asserts that L eigenvalues of A will lie in the ﬁrst union of L discs and the remaining N − L eigenvalues of A will lie in the second union of N − L discs. To establish this we ﬁrst note that, by similarity, the matrix B is stable. Let X = I − B =  D D D I − J e(cid:15) + D  (cid:44)  X X X X  , (225)where from (207)–(208), we have: X = O ( µ ) , X = O ( µ ) , (226) X = O ( µ ) , X = O (1) . (227)The matrix X is invertible since I − B is invertible. Moreover, X is invertible since D is Hermitian positivedeﬁnite. Using the block matrix inversion formula, we can write: X − =  X − + X − X ∆ − X X − −X − X ∆ − − ∆ − X X − ∆ −  (228)where ∆ denotes the Schur complement of X relative to X : ∆ = X − X X − X = O (1) . (229)We then use (226)–(227) to conclude (224). A PPENDIX HS TABILITY OF FIRST - ORDER ERROR MOMENT OF (47)In this Appendix, we show that, under the same settings of Theorem 1, the ﬁrst-order moment of the long-termmodel (47) is stable for sufﬁciently small step-sizes, namely, it holds that: lim sup i →∞ (cid:107) E (cid:101) W e (cid:48) i (cid:107) = O ( µ ) . (230)We ﬁrst multiply both sides of recursion (125) from the left by ( V e(cid:15) ) − and use relation (59) in [2] to get:  E W e (cid:48) i E W (cid:86) e (cid:48) i (cid:124) (cid:123)(cid:122) (cid:125) (cid:44) y i =  I hP − D −D −D J e(cid:15) − D (cid:124) (cid:123)(cid:122) (cid:125) (cid:44) B  E W e (cid:48) i − E W (cid:86) e (cid:48) i − (cid:124) (cid:123)(cid:122) (cid:125) (cid:44) y i − +  b (cid:86) e  (231)where the matrix B in (91) is stable as shown in Appendix G. Recursion (231) can be written more compactly as: y i = B y i − +  b (cid:86) e  . (232)Since B is stable and b (cid:86) e = O ( µ ) , we conclude from (232) and (224) that: lim i →∞ y i = ( I − B ) −  b (cid:86) e  (224) =  O (1 /µ ) O (1) O (1) O (1)   O ( µ )  = O ( µ ) . (233)It follows that lim sup i →∞ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E W e (cid:48) i E W (cid:86) e (cid:48) i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O ( µ ) , (234) and, hence, lim sup i →∞ (cid:107) E (cid:101) W e (cid:48) i (cid:107) = lim sup i →∞ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) V e(cid:15)  E W e (cid:48) i E W (cid:86) e (cid:48) i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:107)V e(cid:15) (cid:107)  lim sup i →∞ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E W e (cid:48) i E W (cid:86) e (cid:48) i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O ( µ ) . (235)R EFERENCES [1] R. Nassif, S. Vlaski, and A. H. Sayed, “Distributed inference over networks under subspace constraints,” in

Proc. IEEE Int. Conf.Acoust., Speech, and Signal Process. , Brighton, UK, May 2019, pp. 1–5.[2] R. Nassif, S. Vlaski, and A. H. Sayed, “Adaptation and learning over networks under subspace constraints – Part I: Stability analysis,”

Submitted for publication , May 2019.[3] A. H. Sayed, S. Y. Tu, J. Chen, X. Zhao, and Z. J. Towﬁc, “Diffusion strategies for adaptation and learning over networks,”

IEEESignal Process. Mag. , vol. 30, no. 3, pp. 155–171, 2013.[4] A. H. Sayed, “Adaptation, learning, and optimization over networks,”

Foundations and Trends in Machine Learning , vol. 7, no. 4-5,pp. 311–801, 2014.[5] J. Chen and A. H. Sayed, “Distributed Pareto optimization via diffusion strategies,”

IEEE J. Sel. Topics Signal Process. , vol. 7, no. 2,pp. 205–220, Apr. 2013.[6] X. Zhao and A. H. Sayed, “Asynchronous adaptation and learning over networks–Part I: Modeling and stability analysis,”

IEEE Trans.Signal Process. , vol. 63, no. 4, pp. 811–826, Feb. 2015.[7] S. Vlaski and A. H. Sayed, “Diffusion learning in non-convex environments,” in

Proc. IEEE Int. Conf. Acoust., Speech, and SignalProcess. , Brighton, UK, May 2019.[8] M. G. Rabbat and R. D. Nowak, “Quantized incremental algorithms for distributed optimization,”

IEEE J. Sel. Areas Commun. , vol.23, no. 4, pp. 798–808, Apr. 2005.[9] S. Kar and J. M. F. Moura, “Distributed consensus algorithms in sensor networks with imperfect communication: Link failures andchannel noise,”

IEEE Trans. Signal Process. , vol. 57, no. 1, pp. 355–369, Jan. 2009.[10] K. Srivastava and A. Nedic, “Distributed asynchronous constrained stochastic optimization,”

IEEE J. Sel. Topics Signal Process. , vol.5, no. 4, pp. 772–790, Aug. 2011.[11] R. Olfati-Saber, J. A. Fax, and R. M. Murray, “Consensus and cooperation in networked multi-agent systems,”

Proc. IEEE , vol. 95,no. 1, pp. 215–233, 2007.[12] S. S. Ram, A. Nedi´c, and V. V. Veeravalli, “Distributed stochastic subgradient projection algorithms for convex optimization,”

J. Optim.Theory Appl. , vol. 147, no. 3, pp. 516–545, 2010.[13] J. Plata-Chaves, A. Bertrand, M. Moonen, S. Theodoridis, and A. M. Zoubir, “Heterogeneous and multitask wireless sensor networks– Algorithms, applications, and challenges,”

IEEE Journal of Selected Topics in Signal Processing , vol. 11, no. 3, pp. 450–465, 2017.[14] J. Chen, C. Richard, and A. H. Sayed, “Diffusion LMS over multitask networks,”

IEEE Trans. Signal Process. , vol. 63, no. 11, pp.2733–2748, Jun. 2015.[15] R. Nassif, C. Richard, A. Ferrari, and A. H. Sayed, “Proximal multitask learning over networks with sparsity-inducing coregularization,”

IEEE Trans. Signal Process. , vol. 64, no. 23, pp. 6329–6344, Dec. 2016.[16] C. Eksin and A. Ribeiro, “Distributed network optimization with heuristic rational agents,”

IEEE Trans. Signal Process. , vol. 60, no.10, pp. 5396–5411, Oct. 2012.[17] A. Hassani, J. Plata-Chaves, M. H. Bahari, M. Moonen, and A. Bertrand, “Multi-task wireless sensor network for joint distributednode-speciﬁc signal enhancement, LCMV beamforming and DOA estimation,”

IEEE J. Sel. Topics Signal Process. , vol. 11, no. 3, pp.518–533, 2017.[18] J. F. C. Mota, J. M. F. Xavier, P. M. Q. Aguiar, and M. P¨uschel, “Distributed optimization with local domains: Applications in MPCand network ﬂows,”

IEEE Trans. Autom. Control , vol. 60, no. 7, pp. 2004–2009, Jul. 2015. [19] X. Zhao and A. H. Sayed, “Distributed clustering and learning over networks,” IEEE Trans. Signal Process. , vol. 63, no. 13, pp.3285–3300, Jul. 2015.[20] S. Barbarossa, G. Scutari, and T. Battisti, “Distributed signal subspace projection algorithms with maximum convergence rate for sensornetworks with topological constraints,” in

Proc. IEEE Int. Conf. Acoust., Speech, and Signal Process. , Taipei, Taiwan, Apr. 2009, pp.2893–2896.[21] D. P. Bertsekas,

Nonlinear Programming , Athena Scientiﬁc, 1999.[22] R. H. Koning, H. Neudecker, and T. Wansbeek, “Block Kronecker products and the vecb operator,”

Linear Algebra Appl. , vol. 149,pp. 165–184, Apr. 1991.[23] S. Boyd and L. Vandenberghe,

Convex Optimization , Cambridge University Press, New York, NY, USA, 2004.[24] M. Grant and S. Boyd, “CVX: Matlab software for disciplined convex programming, version 2.1,” http://cvxr.com/cvx, Mar. 2014.[25] P. L. Combettes and J.-C. Pesquet, “Proximal splitting methods in signal processing,” in

Fixed-Point Algorithms for Inverse Problemsin Science and Engineering , H. H. Bauschke et al., Ed., pp. 185–212. Springer, New York, NY, USA, 2011.[26] N. Parikh and S. Boyd, “Proximal algorithms,”

Foundations and Trends in Optimization , vol. 1, no. 3, pp. 127–239, Jan. 2014.[27] L. Isserlis, “On a formula for the product-moment coefﬁcient of any order of a normal frequency distribution in any number ofvariables,”