[PDF] Diffusion Adaptation Strategies for Distributed Optimization and Learning over Networks

Abstract

We propose an adaptive diffusion mechanism to optimize a global cost function in a distributed manner over a network of nodes. The cost function is assumed to consist of a collection of individual components. Diffusion adaptation allows the nodes to cooperate and diffuse information in real-time; it also helps alleviate the effects of stochastic gradient noise and measurement noise through a continuous learning process. We analyze the mean-square-error performance of the algorithm in some detail, including its transient and steady-state behavior. We also apply the diffusion algorithm to two problems: distributed estimation with sparse parameters and distributed localization. Compared to well-studied incremental methods, diffusion methods do not require the use of a cyclic path over the nodes and are robust to node and link failure. Diffusion methods also endow networks with adaptation abilities that enable the individual nodes to continue learning even when the cost function changes with time. Examples involving such dynamic cost functions with moving targets are common in the context of biological networks.

Full PDF

11 Diffusion Adaptation Strategies for DistributedOptimization and Learning over Networks

Jianshu Chen,

Student Member, IEEE, and Ali H. Sayed,

Fellow, IEEE

Abstract

We propose an adaptive diffusion mechanism to optimize a global cost function in a distributedmanner over a network of nodes. The cost function is assumed to consist of a collection of individualcomponents. Diffusion adaptation allows the nodes to cooperate and diffuse information in real-time; italso helps alleviate the effects of stochastic gradient noise and measurement noise through a continu-ous learning process. We analyze the mean-square-error performance of the algorithm in some detail,including its transient and steady-state behavior. We also apply the diffusion algorithm to two problems:distributed estimation with sparse parameters and distributed localization. Compared to well-studiedincremental methods, diffusion methods do not require the use of a cyclic path over the nodes andare robust to node and link failure. Diffusion methods also endow networks with adaptation abilitiesthat enable the individual nodes to continue learning even when the cost function changes with time.Examples involving such dynamic cost functions with moving targets are common in the context ofbiological networks.

Index Terms

Distributed optimization, diffusion adaptation, incremental techniques, learning, energy conservation,biological networks, mean-square performance, convergence, stability.

I. I

NTRODUCTION

We consider the problem of optimizing a global cost function in a distributed manner. The cost functionis assumed to consist of the sum of individual components, and spatially distributed nodes are used to

Manuscript received October 30, 2011; revised March 15, 2012. This work was supported in part by NSF grants CCF-1011918and CCF-0942936. Preliminary results related to this work are reported in the conference presentations [1] and [2].The authors are with Department of Electrical Engineering, University of California, Los Angeles, CA 90095. Email: { jshchen,sayed } @ee.ucla.edu. May 15, 2012 DRAFT a r X i v : . [ m a t h . O C ] M a y seek the common minimizer (or maximizer) through local interactions. Such problems abound in thecontext of biological networks, where agents collaborate with each other via local interactions for acommon objective, such as locating food sources or evading predators [3]. Similar problems are commonin distributed resource allocation applications and in online machine learning procedures. In the lattercase, data that are generated by the same underlying distribution are processed in a distributed mannerover a network of learners in order to recover the model parameters (e.g., [4], [5]).There are already a few of useful techniques for the solution of optimization problems in a distributedmanner [6]–[24]. Most notable among these methods is the incremental approach [6]–[9] and the con-sensus approach [10]–[23]. In the incremental approach, a cyclic path is deﬁned over the nodes anddata are processed in a cyclic manner through the network until optimization is achieved. However,determining a cyclic path that covers all nodes is known to be an NP-hard problem [25] and, in addition,cyclic trajectories are prone to link and node failures. When any of the edges along the path fails, thesharing of data through the cyclic trajectory is interrupted and the algorithm stops performing. In theconsensus approach, vanishing step-sizes are used to ensure that nodes reach consensus and convergeto the same optimizer in steady-state. However, in time-varying environments, diminishing step-sizesprevent the network from continuous learning and optimization; when the step-sizes die out, the networkstops learning. In earlier publications [26]–[35], and motivated by our work on adaptation and learningover networks, we introduced the concept of diffusion adaptation and showed how this technique can beused to solve global minimum mean-square-error estimation problems efﬁciently both in real-time and in a distributed manner. In the diffusion approach, information is processed locally and simultaneously at all nodes and the processed data are diffused through a real-time sharing mechanism that ripplesthrough the network continuously. Diffusion adaptation was applied to model complex patterns of behaviorencountered in biological networks, such as bird ﬂight formations [36] and ﬁsh schooling [3]. Diffusionadaptation was also applied to solve dynamic resource allocation problems in cognitive radios [37],to perform robust system identiﬁcation [38], and to implement distributed online learning in patternrecognition applications [5].This paper generalizes the diffusive learning process and applies it to the distributed optimizationof a wide class of cost functions. The diffusion approach will be shown to alleviate the effect ofgradient noise on convergence. Most other studies on distributed optimization tend to focus on thealmost-sure convergence of the algorithms under diminishing step-size conditions [6], [7], [39]–[43], oron convergence under deterministic conditions on the data [6]–[8], [15]. In this article we instead examinethe distributed algorithms from a mean-square-error perspective at constant step-sizes. This is because May 15, 2012 DRAFT constant step-sizes are necessary for continuous adaptation, learning, and tracking, which in turn enablethe resulting algorithms to perform well even under data that exhibit statistical variations, measurementnoise, and gradient noise.This paper is organized as follows. In Sec. II, we introduce the global cost function and approximateit by a distributed optimization problem through the use of a second-order Taylor series expansion. InSec. III, we show that optimizing the localized alternative cost at each node k leads naturally to diffusionadaptation strategies. In Sec. IV, we analyze the mean-square performance of the diffusion algorithmsunder statistical perturbations when stochastic gradients are used. In Sec. V, we apply the diffusionalgorithms to two application problems: sparse distributed estimation and distributed localization. Finally,in Sec. VI, we conclude the paper. Notation . Throughout the paper, all vectors are column vectors except for the regressors { u k,i } , whichare taken to be row vectors for simplicity of notation. We use boldface letters to denote random quantities(such as u k,i ) and regular font letters to denote their realizations or deterministic variables (such as u k,i ).We write E to denote the expectation operator. We use diag { x , . . . , x N } to denote a diagonal matrixconsisting of diagonal entries x , . . . , x N , and use col { x , . . . , x N } to denote a column vector formedby stacking x , . . . , x N on top of each other. For symmetric matrices X and Y , the notation X ≤ Y denotes Y − X ≥ , namely, that the matrix difference Y − X is positive semi-deﬁnite.II. P ROBLEM F ORMULATION

The objective is to determine, in a collaborative and distributed manner, the M × column vector w o that minimizes a global cost of the form: J glob ( w ) = N (cid:88) l =1 J l ( w ) (1)where J l ( w ) , l = 1 , , . . . , N , are individual real-valued functions, deﬁned over w ∈ R M and assumed tobe differentiable and strictly convex. Then, J glob ( w ) in (1) is also strictly convex so that the minimizer w o is unique [44]. In this article we study the important case where the component functions { J l ( w ) } areminimized at the same w o . This case is common in practice; situations abound where nodes in a networkneed to work cooperatively to attain a common objective (such as tracking a target, locating the sourceof chemical leak, estimating a physical model, or identifying a statistical distribution). This scenario isalso frequent in the context of biological networks. For example, during the foraging behavior of ananimal group, each agent in the group is interested in determining the same vector w o that correspondsto the location of the food source or the location of the predator [3]. This scenario is equally common May 15, 2012 DRAFT in online distributed machine learning problems, where data samples are often generated from the sameunderlying distribution and they are processed in a distributed manner by different nodes (e.g., [4], [5]).The case where the { J l ( w ) } have different individual minimizers is studied in [45]; this situation is morechallenging to study. Nevertheless, it is shown in [45] that the same diffusion strategies (18)–(19) of thispaper are still applicable and nodes would converge instead to a Pareto-optimal solution.Our strategy to optimize the global cost J glob ( w ) in a distributed manner is based on three steps.First, using a second-order Taylor series expansion, we argue that J glob ( w ) can be approximated by analternative localized cost that is amenable to distributed optimization — see (11). Second, each individualnode optimizes this alternative cost via a steepest-descent procedure that relies solely on interactionswithin the neighborhood of the node. Finally, the local estimates for w o are spatially combined by eachnode and the procedure repeats itself in real-time.To motivate the approach, we start by introducing a set of nonnegative coefﬁcients { c l,k } that satisfy: N (cid:88) k =1 c l,k = 1 , c l,k = 0 if l / ∈ N k , l = 1 , , . . . , N (2)where N k denotes the neighborhood of node k (including node k itself); the neighbors of node k consistof all nodes with which node k can share information. Each c l,k represents a weight value that node k assigns to information arriving from its neighbor l . Condition (2) states that the sum of all weightsleaving each node l should be one. Using the coefﬁcients { c l,k } , we can express J glob ( w ) from (1) as J glob ( w ) = J loc k ( w ) + N (cid:88) l (cid:54) = k J loc l ( w ) (3)where J loc k ( w ) (cid:44) (cid:88) l ∈N k c l,k J l ( w ) (4)In other words, for each node k , we are introducing a new local cost function, J loc k ( w ) , which correspondsto a weighted combination of the costs of its neighbors. Since the { c l,k } are all nonnegative and each J l ( w ) is convex, then J loc k ( w ) is also a convex function (actually, the J loc k ( w ) will be guaranteed to bestrongly convex in our treatment in view of Assumption 1 further ahead).Now, each J loc l ( w ) in the second term of (3) can be approximated via a second-order Taylor seriesexpansion as: J loc l ( w ) ≈ J loc l ( w o ) + (cid:107) w − w o (cid:107) l (5) May 15, 2012 DRAFT

12 4 87k5 93 6 N k N k a k; c k; J k ( w ) J ( w ) J ( w ) J ( w ) J ( w ) J ( w ) J ( w ) J ( w ) J ( w ) J ( w ) a ;k c ;k Fig. 1. A network with N nodes; a cost function J k ( w ) is associated with each node k . The set of neighbors of node k isdenoted by N k ; this set consists of all nodes with which node k can share information. where Γ l = ∇ w J loc l ( w o ) is the (scaled) Hessian matrix relative to w and evaluated at w = w o , and thenotation (cid:107) a (cid:107) denotes a T Σ a for any weighting matrix Σ . The analysis in the subsequent sections willshow that the second-order approximation (5) is sufﬁcient to ensure mean-square convergence of theresulting diffusion algorithm. Now, substituting (5) into the right-hand side of (3) gives: J glob ( w ) ≈ J loc k ( w )+ (cid:88) l (cid:54) = k (cid:107) w − w o (cid:107) l + (cid:88) l (cid:54) = k J loc l ( w o ) (6)The last term in the above expression does not depend on the unknown w . Therefore, we can ignore itso that optimizing J glob ( w ) is approximately equivalent to optimizing the following alternative cost: J glob (cid:48) ( w ) (cid:44) J loc k ( w ) + (cid:88) l (cid:54) = k (cid:107) w − w o (cid:107) l (7)III. I TERATIVE D IFFUSION S OLUTION

Expression (7) relates the original global cost (1) to the newly-deﬁned local cost function J loc k ( w ) .The relation is through the second term on the right-hand side of (7), which corresponds to a sum ofquadratic terms involving the minimizer w o . Obviously, w o is not available at node k since the nodes wishto estimate w o . Likewise, not all Hessian matrices Γ l are available to node k . Nevertheless, expression (7)suggests a useful approximation that leads to a powerful distributed solution, as we proceed to explain.Our ﬁrst step is to replace the global cost J glob (cid:48) ( w ) by a reasonable localized approximation for it atevery node k . Thus, initially we limit the summation on the right-hand side of (7) to the neighbors ofnode k and introduce the cost function: J glob (cid:48) k ( w ) (cid:44) J loc k ( w ) + (cid:88) l ∈N k \{ k } (cid:107) w − w o (cid:107) l (8) May 15, 2012 DRAFT

Compared with (7), the last term in (8) involves only quantities that are available in the neighborhood ofnode k . The argument involving steps (5)–(8) therefore shows us one way by which we can adjust theearlier local cost function J loc k ( w ) deﬁned in (4) by adding to it the last term that appears in (8). Doingso, we end up replacing J loc k ( w ) by J glob (cid:48) k ( w ) , and this new localized cost function preserves the secondterm in (3) up to a second-order approximation. This correction will help lead to a diffusion step (see(14)–(15)).Now, observe that the cost in (8) includes the quantities { Γ l } , which belong to the neighbors of node k .These quantities may or may not be available. If they are known, then we can proceed with (8) and relyon the use of the Hessian matrices Γ l in the subsequent development. Nevertheless, the more interestingsituation in practice is when these Hessian matrices are not known beforehand (especially since theydepend on the unknown w o ). For this reason, in this article, we approximate each Γ l in (8) by a multipleof the identity matrix, say, Γ l ≈ b l,k I M (9)for some nonnegative coefﬁcients { b l,k } ; observe that we are allowing the coefﬁcient b l,k to vary withthe node index k . Such approximations are common in stochastic approximation theory and help reducethe complexity of the resulting algorithms — see [44, pp.20–28] and [46, pp.142–147]. Approximation(9) is reasonable since, in view of the Rayleigh-Ritz characterization of eigenvalues [47], we can alwaysbound the weighted squared norm (cid:107) w − w o (cid:107) l by the unweighted squared norm as follows λ min (Γ l ) · (cid:107) w − w o (cid:107) ≤ (cid:107) w − w o (cid:107) l ≤ λ max (Γ l ) · (cid:107) w − w o (cid:107) Thus, we replace (8) by J glob (cid:48)(cid:48) k ( w ) (cid:44) J loc k ( w ) + (cid:88) l ∈N k \{ k } b l,k (cid:107) w − w o (cid:107) (10)As the derivation will show, we do not need to worry at this stage about how the scalars { b l,k } areselected; they will be embedded into other combination weights that the designer selects. If we replace J loc k ( w ) by its deﬁnition (4), we can rewrite (10) as J glob (cid:48)(cid:48) k ( w ) = (cid:88) l ∈N k c l,k J l ( w ) + (cid:88) l ∈N k \{ k } b l,k (cid:107) w − w o (cid:107) (11)Observe that cost (11) is different for different nodes; this is because the choices of the weighting scalars { c l,k , b l,k } vary across nodes k ; moreover, the neighborhoods vary with k . Nevertheless, these localizedcost functions now constitute the important starting point for the development of diffusion strategies forthe online and distributed optimization of (1). May 15, 2012 DRAFT

Each node k can apply a steepest-descent iteration to minimize J glob (cid:48)(cid:48) k ( w ) by moving along the negativedirection of the gradient (column) vector of the cost function, namely, w k,i = w k,i − − µ k (cid:88) l ∈N k c l,k ∇ w J l ( w k,i − ) − µ k (cid:88) l ∈N k \{ k } b l,k ( w k,i − − w o ) , i ≥ (12)where w k,i denotes the estimate for w o at node k at time i , and µ k denotes a small constant positivestep-size parameter. While vanishing step-sizes, such as µ k ( i ) = 1 /i , can be used in (12), we consider inthis paper the case of constant step-sizes. This is because we are interested in distributed strategies thatare able to continue adapting and learning. An important question to address therefore is how close eachof the w k,i gets to the optimal solution w o ; we answer this question later in the paper by means of amean-square-error convergence analysis (see expression (88)). It will be seen then that the mean-square-error (MSE) of the algorithm will be of the order of the step-size; hence, sufﬁciently small step-sizeswill lead to sufﬁciently small MSEs.Expression (12) adds two correction terms to the previous estimate, w k,i − , in order to update it to w k,i . The correction terms can be added one at a time in a succession of two steps, for example, as: ψ k,i = w k,i − − µ k (cid:88) l ∈N k c l,k ∇ w J l ( w k,i − ) (13) w k,i = ψ k,i − µ k (cid:88) l ∈N k \{ k } b l,k ( w k,i − − w o ) (14)Step (13) updates w k,i − to an intermediate value ψ k,i by using a combination of local gradient vectors.Step (14) further updates ψ k,i to w k,i by using a combination of local estimates. However, two issuesarise while examining (14):(a) First, iteration (14) requires knowledge of the optimizer w o . However, all nodes are running similarupdates to estimate the w o . By the time node k wishes to apply (14), each of its neighbors wouldhave performed its own update similar to (13) and would have available their intermediate estimates, { ψ l,i } . Therefore, we replace w o in (14) by ψ l,i . This step helps diffuse information over the networkand brings into node k information that exists beyond its immediate neighborhood; this is becauseeach ψ l,i is inﬂuenced by data from the neighbors of node l . We observe that this diffusive termarises from the quadratic approximation (5) we have made to the second term in (3).(b) Second, the intermediate value ψ k,i in (13) is generally a better estimate for w o than w k,i − since it is obtained by incorporating information from the neighbors through (13). Therefore, we May 15, 2012 DRAFT further replace w k,i − in (14) by ψ k,i . This step is reminiscent of incremental-type approaches tooptimization, which have been widely studied in the literature [6]–[9].Performing the substitutions described in items (a) and (b) into (14), we obtain: w k,i = ψ k,i − µ k (cid:88) l ∈N k \{ k } b l,k ( ψ k,i − ψ l,i ) (15)Now introduce the coefﬁcients a l,k (cid:44) µ k b l,k ( l (cid:54) = k ) , a k,k (cid:44) − µ k (cid:88) l ∈N k \{ k } b l,k (16)Note that the { a l,k } are nonnegative for l (cid:54) = k and a k,k ≥ for sufﬁciently small step-sizes. Moreover,the coefﬁcients { a l,k } satisfy N (cid:88) l =1 a l,k = 1 , a l,k = 0 if l / ∈ N k (17)Using (16) in (15), we arrive at the following Adapt-then-Combine (ATC) diffusion strategy (whosestructure is the same as the ATC algorithm originally proposed in [29]–[31] for mean-square-errorestimation): ψ k,i = w k,i − − µ k (cid:88) l ∈N k c l,k ∇ w J l ( w k,i − ) w k,i = (cid:88) l ∈N k a l,k ψ l,i (18)To run algorithm (18), we only need to select combination coefﬁcients { a l,k , c l,k } satisfying (2) and (17),respectively; there is no need to worry about the intermediate coefﬁcients { b l,k } any more, since theyhave been blended into the { a l,k } . The ATC algorithm (18) involves two steps. In the ﬁrst step, node k receives gradient vector information from its neighbors and uses it to update its estimate w k,i − to anintermediate value ψ k,i . All other nodes in the network are performing a similar step and generating theirintermediate estimate ψ l,i . In the second step, node k aggregates the estimates { ψ l,i } of its neighbors andgenerates w k,i . Again, all other nodes are performing a similar step. Similarly, if we reverse the order ofsteps (13) and (14) to implement (12), we can motivate the following alternative Combine-then-Adapt(CTA) diffusion strategy (whose structure is similar to the CTA algorithm originally proposed in [26]–[32]for mean-square-error estimation): ψ k,i − = (cid:88) l ∈N k a l,k w l,i − w k,i = ψ k,i − − µ k (cid:88) l ∈N k c l,k ∇ w J l ( ψ k,i − ) (19) May 15, 2012 DRAFT

Adaptive diffusion strategies of the above ATC and CTA types were ﬁrst proposed and extended in [26]–[34] for the solution of distributed mean-square-error, least-squares, and state-space estimation problemsover networks. The special form of ATC strategy (18) for minimum-mean-square-error estimation is listedfurther ahead as Eq. (57) in Example 3; the same strategy as (57) also appeared in [48] albeit with avanishing step-size sequence to ensure convergence towards consensus. A special case of the diffusionstrategy (19) (corresponding to choosing c l,k = 0 for l (cid:54) = k and c k,k = 1 , i.e., without sharing gradientinformation) was used in the works [39], [40], [43] to solve distributed optimization problems that requireall nodes to reach agreement about w o by relying on step-sizes that decay to zero with time. Diffusionrecursions of the forms (18) and (19) are more general than these earlier investigations in a couple ofrespects. First, they do not only diffuse the local estimates, but they can also diffuse the local gradientvectors. In other words, two sets of combination coefﬁcients { a l,k , c l,k } are used. Second, the combinationweights { a l,k } are not required to be doubly stochastic (which would require both the rows and columnsof the weighting matrix A = [ a l,k ] to add up to one; as seen from (17), we only require the entries onthe columns of A to add up to one). Finally, and most importantly, the step-size parameters { µ k } in (18)and (19) are not required to depend on the time index i and are not required to vanish as i → ∞ . Instead,they can assume constant values, which is critical to endow the network with continuous adaptation andlearning abilities (otherwise, when step-sizes die out, the network stops learning). Constant step-sizesalso endow networks with tracking abilities, in which case the algorithms can track time changes in theoptimal w o .Constant step-sizes will be shown further ahead to be sufﬁcient to guarantee agreement among thenodes when there is no noise in the data. However, when measurement noise and gradient noise arepresent, using constant step-sizes does not force the nodes to attain agreement about w o (i.e., to convergeto the same w o ). Instead, the nodes will be shown to tend to individual estimates for w o that are withina small mean-square-error (MSE) bound from the optimal solution; the bound will be proportional to thestep-size so that sufﬁciently small step-sizes lead to small MSE values. Multi-agent systems in naturebehave in this manner; they do not require exact agreement among their agents but allow for ﬂuctuationsdue to individual noise levels (see [3], [36]). Giving individual nodes this ﬂexibility, rather than forcingthem to operate in agreement with the remaining nodes, ends up leading to nodes with enhanced learningabilities.Before proceeding to a detailed analysis of the performance of the diffusion algorithms (18)–(19), wenote that these strategies differ in important ways from traditional consensus-based distributed solutions, May 15, 2012 DRAFT0 which are of the following form [10], [14], [15], [18]: w k,i = (cid:88) l ∈N k a l,k w k,i − − µ k ( i ) · ∇ w J l ( w k,i − ) (20)usually with a time-variant step-size sequence, µ k ( i ) , that decays to zero. For example, if we set C (cid:44) [ c l,k ] = I in the CTA algorithm (19) and substitute the combination step into the adaptation step, weobtain: w k,i = (cid:88) l ∈N k a l,k w k,i − − µ k ∇ w J l (cid:32) (cid:88) l ∈N k a l,k w k,i − (cid:33) (21)Thus, note that the gradient vector in (21) is evaluated at ψ k,i − , while in (20) it is evaluated at w k,i − .Since ψ k,i − already incorporates information from neighbors, we would expect the diffusion algorithmto perform better. Actually, it is shown in [49] that, for mean-square-error estimation problems, diffusionstrategies achieve higher convergence rate and lower mean-square-error than consensus strategies due tothese differences in the dynamics of the algorithms.IV. M EAN -S QUARE P ERFORMANCE A NALYSIS

The diffusion algorithms (18) and (19) depend on sharing local gradient vectors ∇ w J l ( · ) . In manycases of practical relevance, the exact gradient vectors are not available and approximations are insteadused. We model the inaccuracy in the gradient vectors as some random additive noise component, say,of the form: (cid:98) ∇ w J l ( w ) = ∇ w J l ( w ) + v l,i ( w ) (22)where v l,i ( · ) denotes the perturbation and is often referred to as gradient noise. Note that we are usinga boldface symbol v to refer to the gradient noise since it is generally stochastic in nature. Example 1.

Assume the individual cost J l ( w ) at node l can be expressed as the expected value of acertain loss function Q l ( · , · ) , i.e., J l ( w ) = E { Q l ( w, x l,i ) } , where the expectation is with respect to therandomness in the data samples { x l,i } that are collected at node l at time i . Then, if we replace the truegradient ∇ w J l ( w ) with its stochastic gradient approximation (cid:98) ∇ w J l ( w ) = ∇ w Q l ( w, x l,i ) , we ﬁnd thatthe gradient noise in this case can be expressed as v l,i ( w ) = ∇ w Q l ( w, x l,i ) − ∇ w E { Q l ( w, x l,i ) } (23) May 15, 2012 DRAFT1

Using the perturbed gradient vectors (22), the diffusion algorithms (18)–(19) become the following: (ATC) ψ k,i = w k,i − − µ k (cid:88) l ∈N k c l,k (cid:98) ∇ w J l ( w k,i − ) w k,i = (cid:88) l ∈N k a l,k ψ l,i (24) (CTA) ψ k,i − = (cid:88) l ∈N k a l,k w l,i − w k,i = ψ k,i − − µ k (cid:88) l ∈N k c l,k (cid:98) ∇ w J l ( ψ k,i − ) (25)Observe that, starting with (24)–(25), we will be using boldface letters to refer to the various estimatequantities in order to highlight the fact that they are also stochastic in nature due to the presence of thegradient noise.Given the above algorithms, it is necessary to examine their performance in light of the approximationsteps (6)–(15) that were employed to arrive at them, and in light of the gradient noise (22) that seepsinto the recursions. A convenient framework to carry out this analysis is mean-square analysis. In thisframework, we assess how close the individual estimates w k,i get to the minimizer w o in the mean-square-error (MSE) sense. In practice, it is not necessary to force the individual agents to reach agreement andto converge to the same w o using diminishing step-sizes. It is sufﬁcient for the nodes to converge withinacceptable MSE bounds from w o . This ﬂexibility is beneﬁcial and is common in biological networks; itallows nodes to learn and adapt in time-varying environments without the forced requirement of havingto agree with neighbors.The main results that we derive in this section are summarized as follows. First, we derive conditions onthe constant step-sizes to ensure boundedness and convergence of the mean-square-error for sufﬁcientlysmall step-sizes — see (80) and (106) further ahead. Second, despite the fact that nodes inﬂuence eachother’s behavior, we are able to quantify the performance of every node in the network and to deriveclosed-form expressions for the mean-square performance at small step-sizes — see (106)–(108). Finally,as a special case, we are able to show that constant step-sizes can still ensure that the estimates acrossall nodes converge to the optimal w o and reach agreement in the absence of noise — see Theorem 2.Motivated by [31], we address the mean-square-error performance of the adaptive ATC and CTAdiffusion strategies (24)–(25) by treating them as special cases of a general diffusion structure of the May 15, 2012 DRAFT2 following form: φ k,i − = N (cid:88) l =1 p ,l,k w l,i − (26) ψ k,i = φ k,i − − µ k N (cid:88) l =1 s l,k (cid:104) ∇ w J l ( φ k,i − ) + v l,i ( φ k,i − ) (cid:105) (27) w k,i = N (cid:88) l =1 p ,l,k ψ l,i (28)The coefﬁcients { p ,l,k } , { s l,k } , and { p ,l,k } are nonnegative real coefﬁcients corresponding to the { l, k } -th entries of three matrices P , S , and P , respectively. Different choices for { P , P , S } correspond todifferent cooperation modes. For example, the choice P = I , P = I and S = I corresponds to thenon-cooperative case where nodes do not interact. On the other hand, the choice P = I , P = A = [ a l,k ] and S = C = [ c l,k ] corresponds to ATC [29]–[31], while the choice P = A , P = I and S = C corresponds to CTA [26]–[31]. We can also set S = I in ATC and CTA to derive simpliﬁed versionsthat have no gradient exchange [29]. Furthermore, if in CTA ( P = I ), we enforce P = A to be doublystochastic, set S = I , and use a time-decaying step-size parameter ( µ k ( i ) → ), then we obtain theunconstrained version used by [39], [43]. The matrices { P , P , S } are required to satisfy: P T = , P T = , S = (29)where the notation denotes a vector whose entries are all equal to one. A. Error Recursions

We ﬁrst derive the error recursions corresponding to the general diffusion formulation in (26)–(28).Introduce the error vectors: ˜ φ k,i (cid:44) w o − φ k,i , ˜ ψ k,i (cid:44) w o − ψ k,i , ˜ w k,i (cid:44) w o − w k,i (30)Then, subtracting both sides of (26)–(28) from w o gives: ˜ φ k,i − = N (cid:88) l =1 p ,l,k ˜ w l,i − (31) ˜ ψ k,i = ˜ φ k,i − + µ k N (cid:88) l =1 s l,k (cid:104) ∇ w J l ( φ k,i − ) + v l,i ( φ k,i − ) (cid:105) (32) ˜ w k,i = N (cid:88) l =1 p ,l,k ˜ ψ l,i (33) May 15, 2012 DRAFT3

Expression (32) still includes terms that depend on φ k,i − and not on the error quantity, ˜ φ k,i − . Wecan ﬁnd a relation in terms of ˜ φ k,i − by calling upon the following result from [44, p.24] for anytwice-differentiable function f ( · ) : ∇ f ( y ) = ∇ f ( x ) + (cid:20)(cid:90) ∇ f (cid:16) x + t ( y − x ) (cid:17) dt (cid:21) ( y − x ) (34)where ∇ f ( · ) denotes the Hessian matrix of f ( · ) and is symmetric. Now, since each component function J l ( w ) has a minimizer at w o , then, ∇ w J l ( w o ) = 0 for l = 1 , , . . . , N . Applying (34) to J l ( w ) using x = w o and y = φ k,i − , we get ∇ w J l ( φ k,i − )= ∇ w J l ( w o ) − (cid:20)(cid:90) ∇ w J l (cid:16) w o − t ˜ φ k,i − (cid:17) dt (cid:21) ˜ φ k,i − (cid:44) − H l,k,i − ˜ φ k,i − (35)where we are introducing the symmetric random matrix H l,k,i − (cid:44) (cid:90) ∇ w J l (cid:16) w o − t ˜ φ k,i − (cid:17) dt (36)Observe that one such matrix is associated with every edge linking two nodes ( l, k ) ; observe further thatthis matrix changes with time since it depends on the estimate at node k . Substituting (35)–(36) into (32)leads to: ˜ ψ k,i = (cid:34) I M − µ k N (cid:88) l =1 s l,k H l,k,i − (cid:35) ˜ φ k,i − + µ k N (cid:88) l =1 s l,k v l,i ( φ k,i − ) (37)We introduce the network error vectors, which collect the error quantities across all nodes: ˜ φ i (cid:44)  ˜ φ ,i ... ˜ φ N,i  , ˜ ψ i (cid:44)  ˜ ψ ,i ... ˜ ψ N,i  , ˜ w i (cid:44)  ˜ w ,i ... ˜ w N,i  (38)

May 15, 2012 DRAFT4 and the following block matrices: P = P ⊗ I M , P = P ⊗ I M (39) S = S ⊗ I M , M = Ω ⊗ I M (40) Ω = diag { µ , . . . , µ N } (41) D i − = N (cid:88) l =1 diag (cid:110) s l, H l, ,i − , · · · , s l,N H l,N,i − (cid:111) (42) g i = N (cid:88) l =1 col (cid:110) s l, v l,i ( φ ,i − ) , · · · ,s l,N v l,i ( φ N,i − ) (cid:111) (43)where the symbol ⊗ denotes Kronecker products [50]. Then, recursions (31), (37) and (33) lead to: ˜ w i = P T [ I MN − M D i − ] P T ˜ w i − + P T M g i (44)To proceed with the analysis, we introduce the following assumption on the cost functions and gradientnoise, followed by a lemma on H l,k,i − . Assumption 1 (Bounded Hessian) . Each component cost function J l ( w ) has a bounded Hessian matrix,i.e., there exist nonnegative real numbers λ l, min and λ l, max such that λ l, min ≤ λ l, max and that for all w : λ l, min I M ≤ ∇ w J l ( w ) ≤ λ l, max I M (45) Furthermore, the { λ l, min } Nl =1 satisfy N (cid:88) l =1 s l,k λ l, min > , k = 1 , , . . . , N (46)Condition (46) ensures that the local cost functions { J loc k ( w ) } deﬁned earlier in (4) are strongly convexand, hence, have a unique minimizer at w o . Assumption 2 (Gradient noise) . There exist α ≥ and σ v ≥ such that, for all w ∈ F i − and for all i , l : E { v l,i ( w ) | F i − } = 0 (47) E (cid:8) (cid:107) v l,i ( w ) (cid:107) (cid:9) ≤ α E (cid:107) w o − w (cid:107) + σ v (48) where F i − denotes the past history ( σ − ﬁeld) of estimates { w k,j } for j ≤ i − and all k . May 15, 2012 DRAFT5

Lemma 1 (Bound on H l,k,i − ) . Under Assumption 1, the matrix H l,k,i − deﬁned in (36) is a nonnegative-deﬁnite matrix that satisﬁes: λ l, min I M ≤ H l,k,i − ≤ λ l, max I M (49) Proof:

It sufﬁces to prove that λ l, min ≤ x T H l,k,i − x ≤ λ l, max for arbitrary M × unit Euclidean normvectors x . By (36) and (45), we have x T H l,k,i − x = (cid:90) x T ∇ w J l (cid:16) w o − t ˜ φ k,i − (cid:17) x dt ≤ (cid:90) λ l, max dt = λ l, max In a similar way, we can verify that x T H l,k,i − x ≥ λ l, min .In distributed subgradient methods (e.g., [15], [39], [43]), the norms of the subgradients are usuallyrequired to be uniformly bounded. Such assumption is restrictive in the unconstrained optimization ofdifferentiable functions. Assumption 1 is more relaxed in that it allows the gradient vector ∇ w J l ( w ) to have unbounded norm (e.g., quadratic costs). Furthermore, condition (48) allows the variance of thegradient noise to grow no faster than E (cid:107) w o − w (cid:107) . This condition is also more general than the uniformbounded assumption used in [39] (Assumptions 5.1 and 6.1), which requires instead: E (cid:107) v l,i ( w ) (cid:107) ≤ σ v , E (cid:8) (cid:107) v l,i ( w ) (cid:107) |F i − (cid:9) ≤ σ v (50)Furthermore, condition (48) is similar to condition (4.3) in [51, p.635]: E (cid:8) (cid:107) v l,i ( w ) (cid:107) |F i − (cid:9) ≤ α (cid:104) (cid:107)∇ w J l ( w ) (cid:107) + 1 (cid:105) (51)which is a combination of the “relative random noise” and the “absolute random noise” conditions deﬁnedin [44, pp.100–102]. Indeed, we can derive (48) by substituting (35) into (51), taking expectation withrespect to F i − , and then using (49). Example 2.

Such a mix of “relative random noise” and “absolute random noise” is of practical impor-tance. For instance, consider an example in which the loss function at node l is chosen to be of thefollowing quadratic form: Q l ( w, { u l,i , d l ( i ) } ) = | d l ( i ) − u l,i w | for some scalars { d l ( i ) } and × M regression vectors { u l,i } . The corresponding cost function is then: J l ( w ) = E | d l ( i ) − u l,i w | (52) May 15, 2012 DRAFT6

Assume further that the data { u l,i , d l ( i ) } satisfy the linear regression model d l ( i ) = u l,i w o + z l ( i ) (53)where the regressors { u l,i } are zero mean and independent over time with covariance matrix R u,l = E { u Tl,i u l,i } , and the noise sequence { z k ( j ) } is also zero mean, white, with variance σ z,k , and independentof the regressors { u l,i } for all l, k, i, j . Then, using (53) and (23), the gradient noise in this case can beexpressed as: v l,i ( w ) = 2( R u,l − u Tl,i u l,i )( w o − w ) − u Tl,i z l ( i ) (54)It can easily be veriﬁed that this noise satisﬁes both conditions stated in Assumption 2, namely, (47) andalso: E (cid:8) (cid:107) v l,i ( w ) (cid:107) (cid:9) ≤ E (cid:107) R u,l − u Tl,i u l,i (cid:107) · E (cid:107) w o − w (cid:107) +4 σ z,l Tr( R u,l ) (55)for all w ∈ F i − . Note that both relative random noise and absolute random noise components appear in(55) and are necessary to model the statistical gradient perturbation even for quadratic costs. Such costs,and linear regression models of the form (53), arise frequently in the context of adaptive ﬁlters — see,e.g., [9], [26]–[33], [36], [46], [52]–[55]. Example 3.

Quadratic costs of the form (52) are common in mean-square-error estimation for linearregression models of the type (53). If we use the instantaneous approximations as is common in the contextof stochastic approximation and adaptive ﬁltering [44], [46], [52], then the actual gradient ∇ w J l ( w ) canbe approximated by (cid:98) ∇ w J l ( w ) = ∇ w Q l ( w, { u l,i , d l ( i ) } )= − u Tl,i [ d l ( i ) − u l,i w ] (56)Substituting into (24)–(25), and assuming C = I for illustration purposes only, we arrive at the followingATC and CTA diffusion strategies originally proposed and extended in [26]–[31] for the solution ofdistributed mean-square-error estimation problems: (ATC) ψ k,i = w k,i − +2 µ k u Tk,i [ d k ( i ) − u k,i w k,i − ] w k,i = (cid:88) l ∈N k a l,k ψ l,i (57) May 15, 2012 DRAFT7 (CTA) ψ k,i − = (cid:88) l ∈N k a l,k w l,i − w k,i = ψ k,i − +2 µ k u Tk,i [ d k ( i ) − u k,i ψ k,i − ] (58) B. Variance Relations

The purpose of the mean-square analysis in the sequel is to answer two questions in the presenceof gradient perturbations. First, how small the mean-square error, E (cid:107) ˜ w k,i (cid:107) , gets as i → ∞ for anyof the nodes k . Second, how fast this error variance tends towards its steady-state value. The ﬁrstquestion pertains to steady-state performance and the second question pertains to transient/convergencerate performance. Answering such questions for a distributed algorithm over a network is a challengingtask largely because the nodes inﬂuence each other’s behavior: performance at one node diffuses throughthe network to the other nodes as a result of the topological constraints linking the nodes. The approachwe take to examine the mean-square performance of the diffusion algorithms is by studying how thevariance E (cid:107) ˜ w k,i (cid:107) , or a weighted version of it, evolves over time. As the derivation will show, theevolution of this variance satisﬁes a nonlinear relation. Under some reasonable assumptions on the noiseproﬁle, and the local cost functions, we will be able to bound these error variances as well as estimatetheir steady-state values for sufﬁciently small step-sizes. We will also derive closed-form expressions thatcharacterize the network performance. The details are as follows.Equating the squared weighted Euclidean norm of both sides of (44), applying the expectation operatorand using using (47), we can show that the following variance relation holds: E (cid:107) ˜ w i (cid:107) = E (cid:110) (cid:107) ˜ w i − (cid:107) Σ (cid:48) (cid:111) + E (cid:107)P T M g i (cid:107) Σ (cid:48) = P [ I MN −M D i − ] P Σ P T [ I MN −M D i − ] P T (59)where Σ is a positive semi-deﬁnite weighting matrix that we are free to choose. The variance expression(59) shows how the quantity E (cid:107) ˜ w i (cid:107) evolves with time. Observe, however, that the weighting matrixon ˜ w i − on the right-hand side of (59) is a different matrix, denoted by Σ (cid:48) , and this matrix is actuallyrandom in nature (while Σ is deterministic). As such, result (59) is not truly a recursion. Nevertheless,it is possible, under a small step-size approximation, to rework variance relations such as (59) into arecursion by following certain steps that are characteristic of the energy conservation approach to mean-square analysis [46].The ﬁrst step in this regard would be to replace Σ (cid:48) by its mean E Σ (cid:48) . However, the matrix Σ (cid:48) depends onthe { H l,k,i − } via D i − (see (42)). It follows from the deﬁnition of H l,k,i − in (36) that Σ (cid:48) is dependent May 15, 2012 DRAFT8 on ˜ φ k,i − as well, which in turn is a linear combination of the { ˜ w l,i − } . Therefore, the main challenge tocontinue from (59) is that Σ (cid:48) depends on ˜ w i − . For this reason, we cannot apply directly the traditionalstep of replacing Σ (cid:48) in the ﬁrst equation of (59) by E Σ (cid:48) as is typically done in the study of stand-aloneadaptive ﬁlters to analyze their transient behavior [46, p.345]; in the case of conventional adaptive ﬁlters,the matrix Σ (cid:48) is independent of ˜ w i − . To address this difﬁculty, we shall adjust the argument to relyon a set of inequality recursions that will enable us to bound the steady-state mean-square-error at eachnode — see Theorem 1 further ahead.The procedure is as follows. First, we note that (cid:107) x (cid:107) is a convex function of x , and that expressions(31) and (33) are convex combinations of { ˜ w l,i − } and { ˜ ψ l,i } , respectively. Then, by Jensen’s inequality[56, p.77] and taking expectations, we obtain E (cid:107) ˜ φ k,i − (cid:107) ≤ N (cid:88) l =1 p ,l,k E (cid:107) ˜ w l,i − (cid:107) (60) E (cid:107) ˜ w k,i (cid:107) ≤ N (cid:88) l =1 p ,l,k E (cid:107) ˜ ψ l,i (cid:107) (61)for k = 1 , . . . , N . Next, we derive a variance relation for (37). Equating the squared Euclidean norms ofboth sides of (37), applying the expectation operator, and using (47) from Assumption 2, we get E (cid:107) ˜ ψ k,i (cid:107) = E (cid:110) (cid:107) ˜ φ k,i − (cid:107) Σ k,i − (cid:111) + µ k E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N (cid:88) l =1 s l,k v l,i ( φ k,i − ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (62)where Σ k,i − = (cid:34) I M − µ k N (cid:88) l =1 s l,k H l,k,i − (cid:35) (63)We call upon the following two lemmas to bound (62). Lemma 2 (Bound on Σ k,i − ) . The weighting matrix Σ k,i − deﬁned in (63) is a symmetric, positivesemi-deﬁnite matrix, and satisﬁes: ≤ Σ k,i − ≤ γ k I M (64) where γ k (cid:44) max (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − µ k N (cid:88) l =1 s l,k λ l, max (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − µ k N (cid:88) l =1 s l,k λ l, min (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:41) (65) May 15, 2012 DRAFT9

Proof:

By deﬁnition (63) and the fact that H l,k,i − is symmetric — see deﬁnition (36), the matrix I M − µ k (cid:80) Nl =1 s l,k H l,k,i − is also symmetric. Hence, its square, Σ k,i − , is symmetric and also nonnegative-deﬁnite. To establish (64), we ﬁrst use (49) to note that: I M − µ k N (cid:88) l =1 s l,k H l,k,i − ≥ (cid:32) − µ k N (cid:88) l =1 s l,k λ l, max (cid:33) I M (66) I M − µ k N (cid:88) l =1 s l,k H l,k,i − ≤ (cid:32) − µ k N (cid:88) l =1 s l,k λ l, min (cid:33) I M (67)The matrix I M − µ k (cid:80) Nl =1 s l,k H l,k,i − may not be positive semi-deﬁnite because we have not speciﬁeda range for µ k yet; the expressions on the right-hand side of (66)–(67) may still be negative. However,inequalities (66)–(67) imply that the eigenvalues of I M − µ k (cid:80) Nl =1 s l,k H l,k,i − are bounded as: λ (cid:32) I M − µ k N (cid:88) l =1 s l,k H l,k,i − (cid:33) ≥ − µ k N (cid:88) l =1 s l,k λ l, max (68) λ (cid:32) I M − µ k N (cid:88) l =1 s l,k H l,k,i − (cid:33) ≤ − µ k N (cid:88) l =1 s l,k λ l, min (69)By deﬁnition (63), Σ k,i − is the square of the symmetric matrix I M − µ k (cid:80) Nl =1 s l,k H l,k,i − , meaning that λ ( Σ k,i − ) = (cid:34) λ (cid:32) I M − µ k N (cid:88) l =1 s l,k H l,k,i − (cid:33)(cid:35) ≥ (70)Substituting (68)–(69) into (70) leads to λ ( Σ k,i − ) ≤ max (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − µ k N (cid:88) l =1 s l,k λ l, max (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − µ k N (cid:88) l =1 s l,k λ l, min (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:41) (71)which is equivalent to (64). Lemma 3 (Bound on noise combination) . The second term on the right-hand-side of (62) satisﬁes: E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N (cid:88) l =1 s l,k v l,i ( φ k,i − ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:107) S (cid:107) · (cid:104) α E (cid:107) ˜ φ k,i − (cid:107) + σ v (cid:105) (72) where (cid:107) S (cid:107) denotes the -norm of the matrix S (i.e., the maximum absolute column sum). May 15, 2012 DRAFT0

Proof:

Applying Jensen’s inequality [56, p.77], it holds that E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N (cid:88) l =1 s l,k v l,i ( φ k,i − ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:16) N (cid:88) l =1 s l,k (cid:17) · E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N (cid:88) l =1 s l,k (cid:80) Nl =1 s l,k v l,i ( φ k,i − ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:16) N (cid:88) l =1 s l,k (cid:17) · N (cid:88) l =1 s l,k (cid:80) Nl =1 s l,k E (cid:107) v l,i ( φ k,i − ) (cid:107) = (cid:16) N (cid:88) l =1 s l,k (cid:17) · N (cid:88) l =1 s l,k E (cid:107) v l,i ( φ k,i − ) (cid:107) ≤ (cid:16) N (cid:88) l =1 s l,k (cid:17) · (cid:104) α E (cid:107) ˜ φ k,i − (cid:107) + σ v (cid:105) (73) ≤ (cid:107) S (cid:107) · (cid:104) α E (cid:107) ˜ φ k,i − (cid:107) + σ v (cid:105) (74)where inequality (73) follows by substituting (48), and (74) is obtained using the fact that (cid:107) S (cid:107) is themaximum absolute column sum and that the entries { s l,k } are nonnegative.Substituting (64) and (72) into (62), we obtain: E (cid:107) ˜ ψ k,i (cid:107) ≤ ( γ k + µ k α (cid:107) S (cid:107) ) · E (cid:107) ˜ φ k,i − (cid:107) + µ k (cid:107) S (cid:107) σ v (75)for k = 1 , . . . , N . Finally, introduce the following network mean-square-error vectors (compare with(38)): X i =  E (cid:107) ˜ φ ,i (cid:107) ... E (cid:107) ˜ φ N,i (cid:107)  , Y i =  E (cid:107) ˜ ψ ,i (cid:107) ... E (cid:107) ˜ ψ N,i (cid:107)  , W i =  E (cid:107) ˜ w ,i (cid:107) ... E (cid:107) ˜ w N,i (cid:107)  and the matrix Γ = diag (cid:8) γ + µ α (cid:107) S (cid:107) , . . . , γ N + µ N α (cid:107) S (cid:107) (cid:9) (76)Then, (60)–(61) and (75) can be written as  X i − (cid:22) P T W i − Y i (cid:22) Γ X i − + σ v (cid:107) S (cid:107) Ω W i (cid:22) P T Y i (77) May 15, 2012 DRAFT1 where the notation x (cid:22) y denotes that the components of vector x are less than or equal to thecorresponding components of vector y . We now recall the following useful fact that for any matrix F with nonnegative entries, x (cid:22) y ⇒ F x (cid:22)

F y (78)This is because each entry of the vector

F y − F x = F ( y − x ) is nonnegative. Then, combining all threeinequalities in (77) leads to: W i (cid:22) P T Γ P T W i − + σ v (cid:107) S (cid:107) · P T Ω (79) C. Mean-Square Stability

Based on (79), we can now prove that, under certain conditions on the step-size parameters { µ k } , themean-square-error vector W i is bounded as i → ∞ , and we use this result in the next subsection toevaluate the steady-state MSE for sufﬁciently small step-sizes. Theorem 1 (Mean-Square Stability) . If the step-sizes { µ k } satisfy the following condition: < µ k < min (cid:40) σ k, max σ k, max + α (cid:107) S (cid:107) , σ k, min σ k, min + α (cid:107) S (cid:107) (cid:41) (80) for k = 1 , . . . , N , where σ k, max and σ k, min are deﬁned as σ k, max (cid:44) N (cid:88) l =1 s l,k λ l, max , σ k, min (cid:44) N (cid:88) l =1 s l,k λ l, min (81) then, as i → ∞ , lim sup i →∞ (cid:107)W i (cid:107) ∞ ≤ (cid:16) max ≤ k ≤ N µ k (cid:17) · (cid:107) S (cid:107) σ v − max ≤ k ≤ N ( γ k + µ k α (cid:107) S (cid:107) ) (82) where (cid:107) x (cid:107) ∞ denotes the maximum absolute entry of vector x . Proof:

See Appendix A.If we let α =0 and σ v =0 in Theorem 1, and examine the arguments leading to it, we conclude thevalidity of the following result, which establishes the convergence of the diffusion strategies (24)–(25)in the absence of gradient noise (i.e., using the true gradient rather than stochastic gradient — see (18)and (19)). May 15, 2012 DRAFT2

Theorem 2 (Convergence in Noise-free Case) . If there is no gradient noise, i.e., α = 0 and σ v = 0 , thenthe mean-square-error vector becomes the deterministic vector W i = col {(cid:107) ˜ w ,i (cid:107) , · · · , (cid:107) ˜ w N,i (cid:107) } , and itsentries converge to zero if the step-sizes { µ k } satisfy the following condition: < µ k < σ k, max (83) for k = 1 , . . . , N , where σ k, max was deﬁned in (81) . We observe that, in the absence of noise, the deterministic error vectors, ˜ w k,i , will tend to zero as i → ∞ even with constant (i.e., non-vanishing) step-sizes. This result implies the interesting fact that, inthe noise-free case, the nodes can reach agreement without the need to impose diminishing step-sizes. D. Steady-State Performance

Expression (80) provides a condition on the step-size parameters { µ k } to ensure the mean-squarestability of the diffusion strategies (24)–(25). At the same time, expression (82) gives an upper bound onhow large W i can be at steady-state. Since the ∞ -norm of a vector is deﬁned as the largest absolute valueof its entries, then (82) bounds the MSE of the worst-performing node in the network. We can deriveclosed-form expressions for MSEs when the step-sizes are assumed to be sufﬁciently small. Indeed, weﬁrst conclude from (82) that for step-sizes that are sufﬁciently small, each w k,i will get closer to w o atsteady-state. To verify this fact, assume the step-sizes are small enough so that the nonnegative factor γ k that was deﬁned earlier in (65) becomes γ k = 1 − µ k N (cid:88) l =1 s l,k λ l, min = 1 − µ k σ k, min (84)where σ k, min was given by (81). Substituting (84) into (82), we obtain: lim sup i →∞ (cid:107)W i (cid:107) ∞ ≤ (cid:32) max ≤ k ≤ N µ k (cid:33) · (cid:107) S (cid:107) σ v − max ≤ k ≤ N (cid:40) (1 − µ k σ k, min ) + µ k α (cid:107) S (cid:107) (cid:41) ≤ (cid:32) max ≤ k ≤ N µ k (cid:33) · (cid:107) S (cid:107) σ v min ≤ k ≤ N (cid:40) µ k (cid:34) σ k, min − µ k ( σ k, min + α (cid:107) S (cid:107) ) (cid:35)(cid:41) May 15, 2012 DRAFT3 ≤ (cid:107) S (cid:107) σ v min ≤ k ≤ N (cid:40) σ k, min − µ k ( σ k, min + α (cid:107) S (cid:107) ) (cid:41) · µ µ min (85)where µ max (cid:44) max ≤ k ≤ N µ k , µ min (cid:44) min ≤ k ≤ N µ k (86)For sufﬁciently small step-sizes, the denominator in (85) can be approximated as σ k, min − µ k ( σ k, min + α (cid:107) S (cid:107) ) ≈ σ k, min (87)Substituting into (85), we get lim sup i →∞ (cid:107)W i (cid:107) ∞ ≤ (cid:107) S (cid:107) σ v ≤ k ≤ N σ k, min · µ µ min (88)Therefore, if the step-sizes are sufﬁciently small, the MSE of each node becomes small as well. Thisresult is clear when all nodes use the same step-sizes such that µ max = µ min = µ . Then, the right-handside of (88) is on the order of O ( µ ) , as indicated. It follows that { ˜ w k,i } are small in the mean-square-errorsense at small step-sizes, which also means that the mean-square value of ˜ φ k,i − is small because it isa convex combination of { ˜ w k,i } (recall (31)). Then, by deﬁnition (36), in steady-state (for large enough i ), the matrix H l,k,i − can be approximated by: H l,k,i − ≈ (cid:90) ∇ J l ( w o ) dt = ∇ J l ( w o ) (89)In this case, the matrix H l,k,i − is not random anymore and is not dependent on the error vector ˜ φ k,,i − .Accordingly, in steady-state, the matrix D i − that was deﬁned in (42) is not random anymore and itbecomes D i − ≈ D ∞ (cid:44) N (cid:88) l =1 diag (cid:110) s l, ∇ w J l ( w o ) , · · · ,s l,N ∇ w J l ( w o ) (cid:111) (90)As a result, in steady-state, the original error recursion (44) can be approximated by ˜ w i = P T [ I MN − MD ∞ ] P T ˜ w i − + P T M g i (91)Taking expectations of both sides of (91), we obtain the following mean-error recursion E ˜ w i = P T [ I MN − MD ∞ ] P T · E ˜ w i − , i → ∞ (92)which converges to zero if the matrix B (cid:44) P T [ I MN − MD ∞ ] P T (93) May 15, 2012 DRAFT4 is stable. The stability of B can be guaranteed when the step-sizes are sufﬁciently small (or chosenaccording to (80)) — see the proof in Appendix C. Therefore, in steady-state, we have lim i →∞ E ˜ w i = 0 (94)Next, we determine an expression (rather than a bound) for the MSE. To do this, we need to evaluate thecovariance matrix of the gradient noise vector g i . Recall from (43) that g i depends on { φ k,i − } , whichis close to w o at steady-state for small step-sizes. Therefore, it is sufﬁcient to determine the covariancematrix of g i at w o . We denote this covariance matrix by: R v (cid:44) E { g i g Ti } (cid:12)(cid:12)(cid:12) φ k,i − = w o = E (cid:34) N (cid:88) l =1 col (cid:110) s l, v l,i ( w o ) , · · · , s l,N v l,i ( w o ) (cid:111)(cid:35) × (cid:34) N (cid:88) l =1 col (cid:110) s l, v l,i ( w o ) , · · · , s l,N v l,i ( w o ) (cid:111)(cid:35) T  (95)In practice, we can evaluate R v from the expressions of { v l,i ( w o ) } . For example, for the case of thequadratic cost (52), we can substitute (54) into (95) to evaluate R v .Returning to the last term in the ﬁrst equation of (59), we can evaluate it as follows: E (cid:107)P T M g i (cid:107) = E g Ti MP Σ P T M g i = Tr (cid:0) Σ P T M E { g i g Ti }MP (cid:1) = Tr (cid:0) Σ P T M R v MP (cid:1) (96)Using (90), the matrix Σ (cid:48) in (59) becomes a deterministic quantity as well, and is given by: Σ (cid:48) ≈ P [ I MN − MD ∞ ] P Σ P T [ I MN − MD ∞ ] P T (97)Substituting (96) and (97) into (59), an approximate variance relation is obtained for small step-sizes: E (cid:107) ˜ w i (cid:107) ≈ E (cid:107) ˜ w i − (cid:107) (cid:48) + Tr (cid:0) Σ P T M R v MP (cid:1) (98) Σ (cid:48) ≈ P [ I MN −MD ∞ ] P Σ P T [ I MN −MD ∞ ] P T (99)Let σ = vec(Σ) denote the vectorization operation that stacks the columns of a matrix Σ on top of eachother. We shall use the notation (cid:107) x (cid:107) σ and (cid:107) x (cid:107) interchangeably to denote the weighted squared Euclideannorm of a vector. Using the Kronecker product property [57, p.147]: vec( U Σ V ) = ( V T ⊗ U )vec(Σ) , wecan vectorize Σ (cid:48) in (99) and ﬁnd that its vector form is related to Σ via the following linear relation: May 15, 2012 DRAFT5 σ (cid:48) (cid:44) vec(Σ (cid:48) ) ≈ F σ , where, for sufﬁciently small steps-sizes (so that higher powers of the step-sizes canbe ignored), the matrix F is given by F (cid:44) (cid:16) P [ I MN −MD ∞ ] P (cid:17) ⊗ (cid:16) P [ I MN −MD ∞ ] P (cid:17) (100)Here, we used the fact that M and D ∞ are block diagonal and symmetric. Furthermore, using the property Tr(Σ X ) = vec( X T ) T σ , we can rewrite (98) as E (cid:107) ˜ w i (cid:107) σ ≈ E (cid:107) ˜ w i − (cid:107) F σ + (cid:2) vec (cid:0) P T M R v MP (cid:1)(cid:3) T σ (101)It is shown in [46, pp.344–346] that recursion (101) converges to a steady-state value if the matrix F is stable. This condition is guaranteed when the step-sizes are sufﬁciently small (or chosen according to(80)) — see Appendix C. Finally, denoting E (cid:107) ˜ w ∞ (cid:107) σ (cid:44) lim i →∞ E (cid:107) ˜ w i (cid:107) σ (102)and letting i → ∞ , expression (101) becomes E (cid:107) ˜ w ∞ (cid:107) σ ≈ E (cid:107) ˜ w ∞ (cid:107) F σ + (cid:2) vec (cid:0) P T M R v MP (cid:1)(cid:3) T σ so that E (cid:107) ˜ w ∞ (cid:107) I −F ) σ ≈ (cid:2) vec (cid:0) P T M R v MP (cid:1)(cid:3) T σ (103)Expression (103) is a useful result: it allows us to derive several performance metrics through the properselection of the free weighting parameter σ (or Σ ). First, to be able to evaluate steady-state performancemetrics from (103), we need ( I − F ) to be invertible, which is guaranteed by the stability of matrix F — see Appendix C. Given that ( I − F ) is a stable matrix, we can now resort to (103) and use it toevaluate various performance metrics by choosing proper weighting matrices Σ (or σ ), as it was done in[31] for the mean-square-error estimation problem. For example, the MSE of any node k can be obtainedby computing E (cid:107) ˜ w ∞ (cid:107) T with a block weighting matrix T that has an identity matrix at block ( k, k ) andzeros elsewhere: E (cid:107) ˜ w k, ∞ (cid:107) = E (cid:107) ˜ w ∞ (cid:107) T (104)Denote the vectorized version of this matrix by t k , i.e., t k (cid:44) vec(diag( e k ) ⊗ I M ) (105) May 15, 2012 DRAFT6 where e k is a vector whose k th entry is one and zeros elsewhere. Then, if we select σ in (103) as σ = ( I − F ) − t k , the term on the left-hand side becomes the desired E (cid:107) ˜ w k, ∞ (cid:107) and MSE for node k is therefore given by: MSE k ≈ (cid:2) vec (cid:0) P T M R v MP (cid:1)(cid:3) T ( I − F ) − t k (106)This value for MSE k is actually the k th entry of W ∞ deﬁned as W ∞ (cid:44) lim i →∞ W i (107)Then, we arrive at an expression for W ∞ (as opposed to the bound for it in (82), as was explained earlier;expression (108) is derived under the assumption of sufﬁciently small step-sizes): W ∞ ≈ (cid:110) I N ⊗ (cid:16)(cid:2) vec (cid:0) P T M R v MP (cid:1)(cid:3) T ( I −F ) − (cid:17)(cid:111) t (108)where t = col { t , . . . , t N } . If we are interested in the network MSE, then the weighting matrix of E (cid:107) ˜ w ∞ (cid:107) T should be chosen as T = I MN /N . Let q denote the vectorized version of I MN , i.e., q (cid:44) vec( I MN ) (109)and select σ in (103) as σ = ( I −F ) − q/N . The network MSE is then given by MSE (cid:44) N N (cid:88) k =1 MSE k ≈ N (cid:2) vec (cid:0) P T M R v MP (cid:1)(cid:3) T ( I − F ) − q (110)The approximate expressions (108) and (110) hold when the step-sizes are small enough so that (90)holds. In the next section, we will see that they are consistent with the simulation results.V. S IMULATION R ESULTS

In this section we illustrate the performance of the diffusion strategies (24)–(25) by considering twoapplications. We consider a randomly generated connected network topology with a cyclic path. Thereare a total of N = 10 nodes in the network, and nodes are assumed connected when they are closeenough geographically. In the simulations, we consider two applications: a regularized least-mean-squaresestimation problem with sparse parameters, and a collaborative localization problem. May 15, 2012 DRAFT7

A. Distributed Estimation with Sparse Data

Assume each node k has access to data { U k,i , d k,i } , generated according to the following model: d k,i = U k,i w o + z k,i (111)where { U k,i } is a sequence of K × M i.i.d. Gaussian random matrices. The entries of each U k,i havezero mean and unit variance, and z k,i ∼ N (0 , σ z I K ) is the measurement noise that is temporally andspatially white and is independent of U l,j for all k, l, i, j . Our objective is to estimate w o from the dataset { U k,i , d k,i } in a distributed manner. In many applications, the vector w o is sparse such as w o = [1 0 . . . T ∈ R M One way to search for sparse solutions is to consider a global cost function of the following form [58]: J glob ( w ) = N (cid:88) l =1 E (cid:107) d l,i − U l,i w (cid:107) + ρR ( w ) (112)where R ( w ) and ρ are the regularization function and regularization factor, respectively. A popularchoice is R ( w ) = (cid:107) w (cid:107) , which helps enforce sparsity and is convex [58]–[63]. However, this choiceis non-differentiable, and we would need to apply sub-gradient methods [44, pp.138–144] for a properimplementation. Instead, we use the following twice-differentiable approximation for (cid:107) w (cid:107) : R ( w ) = M (cid:88) m =1 (cid:112) [ w ] m + (cid:15) (113)where [ w ] m denotes the m -th entry of w , and (cid:15) is a small number. We see that, as (cid:15) goes to zero, R ( w ) ≈ (cid:107) w (cid:107) . Obviously, R ( w ) is convex, and we can apply the diffusion algorithms to minimize (112)in a distributed manner. To do so, we decompose the global cost into a sum of N individual costs: J l ( w ) = E (cid:107) d l,i − U l,i w (cid:107) + ρN R ( w ) (114)for l = 1 , . . . , N . Then, using algorithms (18) and (19), each node k would update its estimate of w o byusing the gradient vectors of { J l ( w ) } l ∈N k , which are given by: ∇ w J l ( w ) = 2 E (cid:0) U Tl,i U l,i (cid:1) w − E (cid:0) U Tl,i d l,i (cid:1) + ρN ∇ w R ( w ) (115)However, the nodes are assumed to have access to measurements { U l,i , d l,k } and not to the second-order moments E (cid:16) U Tl,i U l,i (cid:17) and E (cid:16) U Tl,i d l,i (cid:17) . In this case, nodes can use the available measurements toapproximate the gradient vectors in (24) and (25) as: (cid:98) ∇ w J l ( w ) = 2 U Tl,i [ U l,i w − d l,i ]+ ρN ∇ w R ( w ) (116) May 15, 2012 DRAFT8 where ∇ w R ( w ) = (cid:34) [ w ] (cid:112) [ w ] + (cid:15) · · · [ w ] M (cid:113) [ w ] M + (cid:15) (cid:35) T (117)In the simulation, we set M = 50 , K = 5 , σ v = 1 , and w o = [1 0 . . . T . We apply both diffusion andincremental methods to solve the distributed learning problem, where the incremental approach [6]–[9]uses the following construction to determine w i : • Start with ψ ,i = w i − at the node at the beginning of the incremental cycle. • Cycle through the nodes k = 1 , . . . , N : ψ k,i = ψ k − ,i − µ (cid:98) ∇ w J k ( ψ k − ,i ) (118) • Set w i ← ψ N,i . • Repeat.The results are averaged over trials. The step-sizes for ATC, CTA and non-cooperative algorithms areset to µ = 10 − , and the step-size for the incremental algorithm is set to µ = 10 − /N . This is because theincremental algorithm cycles through all N nodes every iteration. We therefore need to ensure the sameconvergence rate for both algorithms for a fair comparison [35]. For ATC and CTA strategies, we usesimple averaging weights for the combination step, and for ATC and CTA with gradient exchange, we useMetropolis weights for { c l,k } to combine the gradients (see Table III in [31] for the deﬁnitions of averagingweights and Metropolis weights). We use expression (110) to evaluate the theoretical performance of thediffusion strategies. As a remark, expression (110) gives the MSE with respect to the minimizer of thecost J glob ( w ) in (112). In this example, the minimizer of the cost (112), denoted as ˆ w o , is biased awayfrom the model parameter w o in (111) when the regularization factor γ (cid:54) = 0 . To evaluate the theoreticalMSE with respect to w o , we use MSD = lim i →∞ N N (cid:88) k =1 E (cid:107) w o − w k,i (cid:107) = E (cid:107) w o − ˆ w o (cid:107) + lim i →∞ N N (cid:88) k =1 E (cid:107) ˆ w o − w k,i (cid:107) (119)where the second term in (119) can be evaluated by expression (110) with w o replaced by ˆ w o . Moreover,in the derivation of (119), we used the fact that lim i →∞ E ( ˆ w o − w k,i ) = 0 to eliminate the cross term, andthis result is due to (94) with w o there replaced by ˆ w o . Fig. 2(a) shows the learning curves for differentalgorithms for γ = 2 and (cid:15) = 10 − . We see that the diffusion and incremental schemes have similarperformance, and both of them have about dB gain over the non-cooperation case. To examine the May 15, 2012 DRAFT9

Number of Iterations N e t w o r k M S E ( d B ) IncrementalATC (S=I)CTA (S=I)ATC (S=C)CTA (S=C)Non-cooperative (a) Learning curves ( γ = 2 and (cid:15) = 10 − ). Regularization factor γ S t ea dy s t a t e M S E ( d B ) Regularization factor γ S t ea dy s t a t e M S E ( d B ) IncrementalATC (S=I)ATC (S=I), theoryCTA(S=I)CTA(S=I), theoryATC (S=C)ATC (S=C), theoryCTA (S=C)CTA (S=C), theory † = 1 † = 10 − (b) Steady-state MSD ( µ = 10 − ).Fig. 2. Transient and steady-state performance of distributed estimation with sparse parameters. impact of the parameter (cid:15) and the regularization factor γ , we show the steady-state MSE for differentvalues of γ and (cid:15) in Fig. 2(b). When (cid:15) is small ( (cid:15) = 10 − ), adding a reasonable regularization ( γ = 1 ∼ )decreases the steady-state MSE. However, when (cid:15) is large ( (cid:15) = 1 ), expression (113) is no longer a goodapproximation for (cid:107) w (cid:107) , and regularization does not improve the MSE. B. Distributed Collaborative Localization

The previous example deals with a convex cost (112). Now, we consider a localization problem thathas a non-convex cost function and apply the same diffusion strategies to its solution. Assume each nodeis interested in locating a common target located at w o = [0 0] T . Each node k knows its position x k andhas a noisy measurement of the squared distance to the target: d k ( i ) = (cid:107) w o − x k (cid:107) + z k ( i ) , k = 1 , , . . . , N where z k ( i ) ∼ N (0 , σ z,k ) is the measurement noise of node k at time i . The component cost function J k ( w ) at node k is chosen as J k ( w ) = 14 E (cid:12)(cid:12) d k ( i ) − (cid:107) w − x k (cid:107) (cid:12)(cid:12) (120)where we multiply by / here to eliminate a factor of that will otherwise appear in the gradient. If eachnode k minimizes J k ( w ) individually, it is not possible to solve for w o . Therefore, we use informationfrom other nodes, and instead seek to minimize the following global cost: J glob ( w ) = 14 N (cid:88) k =1 E (cid:12)(cid:12) d k ( i ) − (cid:107) w − x k (cid:107) (cid:12)(cid:12) (121) May 15, 2012 DRAFT0

This problem arises, for example, in cellular communication systems, where multiple base-stations areinterested in locating users using the measured distances between themselves and the user. Diffusionalgorithms (18) and (19) can be applied to solve the problem in a distributed manner. Each node k wouldupdate its estimate of w o by using the gradient vectors of { J l ( w ) } l ∈N k , which are given by: ∇ w J l ( w ) = − E d l ( i ) ( w − x l ) + (cid:107) w − x l (cid:107) ( w − x l ) (122)However, the nodes are assumed to have access to measurements { d l ( i ) , x l } and not to E d l ( i ) . In thiscase, nodes can use the available measurements to approximate the gradient vectors in (24) and (25) as: (cid:98) ∇ w J l ( w ) = − d l ( i )( w − x l ) + (cid:107) w − x l (cid:107) ( w − x l ) (123)If we do not exchange the local gradients with neighbors, i.e., if we set S = I , then the base-stationsonly share the local estimates of the target position w o with their neighbors (no exchange of { x l } l ∈N k ).We ﬁrst simulate the stationary case, where the target stays at w o . In Fig. 3(a), we show the MSE curvesfor non-cooperative, ATC, CTA, and incremental algorithms. The noise variance is set to σ z,k = 1 . Weset the step-sizes to µ = 0 . /N for the incremental algorithm, and µ = 0 . for other algorithms.For ATC and CTA strategies, we use simple averaging for the combination step { a l,k } , and for ATCand CTA with gradient exchange, we use Metropolis weights for { c l,k } to combine the gradients. Theperformance of CTA and ATC algorithms are close to each other, and both of them are close to theincremental scheme. In Fig. 3(b), we show the steady state MSE with respect to different values of µ .We also use expression (110) to evaluate the theoretical performance of the diffusion strategies. As thestep-size becomes small, the performances of diffusion and incremental algorithms are close, and theMSE decreases as µ decreases. Furthermore, we see that exchanging only local estimates ( S = I ) isenough for localization, compared to the case of exchanging both local estimates and gradients ( S = C ).Next, we apply the algorithms to a non-stationary scenario, where the target moves along a trajectory,as shown in Fig. 3(c). The step-size is set to µ = 0 . for diffusion algorithms, and to µ = 0 . /N for the incremental approach. To see the advantage of using a constant step-size for continuous tracking,we also simulate the vanishing step-size version of the algorithm from [39], [43] ( µ k,i = 0 . /i ). Thediffusion algorithms track well the target but not the non-cooperative algorithm and the algorithm from[39], [43], because a decaying step-size is not helpful for tracking. The tracking performance is shownin Fig. 3(d). May 15, 2012 DRAFT1

Number of Iterations A v e r a g e N e t w o r k M S E ( d B ) IncrementalATC (S=I)CTA (S=I)ATC (S=C)CTA (S=C)Non-cooperation

Number of Iterations A v e r a g e N e t w o r k M S E ( d B ) IncrementalATC (S=I)CTA (S=I)ATC (S=C)CTA (S=C)Non-cooperation0 500 1000 1500 2000-30-25-20-15-10-505

Number of Iterations N e t w o r k M S E ( d B ) IncrementalATC (S=I)CTA (S=I)ATC (S=C)CTA (S=C)Non-cooperation

Number of Iterations N e t w o r k M S E ( d B ) IncrementalATC (S=I)CTA (S=I)ATC (S=C)CTA (S=C)Non-cooperative (a) Learning curves for stationary target ( µ = 0 . ). −3 −2 −40−35−30−25−20−15−10−50 Step−size µ S t ea dy s t a t e N e t w o r k M S E ( d B ) IncrementalATC (S=I)CTA (S=I)ATC (S=C)CTA (S=C)Non−cooperativeATC (S=I), theoryCTA (S=I), theoryATC (S=C), theoryCTA (S=C), theory (b) Steady-state performance for stationary target. −6 −4 −2 0 2 4 6−6−4−20246 x (km) y ( k m ) Target TrajectoryIncrementalATC (S=I)CTA (S=I)ATC (S=C)CTA (S=C)Approach in [39,43]Non−cooperative (c) Tracking a moving-target by node ( µ = 0 . ). Number of Iterations N e t w o r k M S E ( d B ) IncrementalATC (S=I)CTA (S=I)ATC (S=C)CTA (S=C)Approach in [39,43]Non-cooperative (d) Learning curves for moving target ( µ = 0 . ).Fig. 3. Performance of distributed localization for stationary and moving targets. Diffusion strategies employ constant step-sizes,which enable continuous adaptation and learning even when the target moves (which corresponds to a changing cost function). VI. C

ONCLUSION

This paper proposed diffusion adaptation strategies to optimize global cost functions over a networkof nodes, where the cost consists of several components. Diffusion adaptation allows the nodes tosolve the distributed optimization problem via local interaction and online learning. We used gradientapproximations and constant step-sizes to endow the networks with continuous learning and trackingabilities. We analyzed the mean-square-error performance of the algorithms in some detail, includingtheir transient and steady-state behavior. Finally, we applied the scheme to two examples: distributedsparse parameter estimation and distributed localization. Compared to incremental methods, diffusion

May 15, 2012 DRAFT2 strategies do not require a cyclic path over the nodes, which makes them more robust to node and linkfailure. A

PPENDIX AP ROOF OF M EAN -S QUARE S TABILITY

Taking the ∞− norm of both sides of (79), we obtain (cid:107)W i (cid:107) ∞ ≤ (cid:107) P T Γ P T (cid:107) ∞ ·(cid:107)W i − (cid:107) ∞ + σ v (cid:107) S (cid:107) ·(cid:107) P T (cid:107) ∞ ·(cid:107) Ω (cid:107) ∞ ≤ (cid:107) P T (cid:107) ∞ · (cid:107) Γ (cid:107) ∞ · (cid:107) P T (cid:107) ∞ · (cid:107)W i − (cid:107) ∞ + σ v (cid:107) S (cid:107) · (cid:107) P T (cid:107) ∞ · (cid:107) Ω (cid:107) ∞ = (cid:107) Γ (cid:107) ∞ ·(cid:107)W i − (cid:107) ∞ + (cid:16) max ≤ k ≤ N µ k (cid:17) · σ v (cid:107) S (cid:107) (124)where we used the fact that (cid:107) P T (cid:107) ∞ = (cid:107) P T (cid:107) ∞ = 1 because each row of P T and P T sums up to one.Moreover, from (76), we have (cid:107) Γ (cid:107) ∞ = max ≤ k ≤ N ( γ k + µ k α (cid:107) S (cid:107) ) (125)Iterating (124), we obtain (cid:107)W i (cid:107) ∞ ≤ (cid:107) Γ (cid:107) i ∞ · (cid:107)W (cid:107) ∞ + (cid:16) max ≤ k ≤ N µ k (cid:17) · σ v (cid:107) S (cid:107) i − (cid:88) j =0 (cid:107) Γ (cid:107) j ∞ (126)We are going to show further ahead that condition (80) guarantees (cid:107) Γ (cid:107) ∞ < . Now, given that (cid:107) Γ (cid:107) ∞ < ,the ﬁrst term on the right hand side of (126) converges to zero as i → ∞ , and the second term on theright-hand side of (126) converges to: lim i →∞ σ v (cid:107) S (cid:107) i − (cid:88) j =0 (cid:107) Γ (cid:107) j ∞ = σ v (cid:107) S (cid:107) − (cid:107) Γ (cid:107) ∞ (127)Therefore, we establish (82) as follows: lim sup i →∞ (cid:107)W i (cid:107) ∞ ≤ (cid:16) max ≤ k ≤ N µ k (cid:17) · σ v (cid:107) S (cid:107) − (cid:107) Γ (cid:107) ∞ = (cid:32) max ≤ k ≤ N µ k (cid:33) · (cid:107) S (cid:107) σ v − max ≤ k ≤ N ( γ k + µ k α (cid:107) S (cid:107) ) (128) May 15, 2012 DRAFT3

The only fact that remains to prove is to show that (80) ensures (cid:107) Γ (cid:107) ∞ < . From (125), we see thatthe condition (cid:107) Γ (cid:107) ∞ < is equivalent to requiring: γ k + µ k α (cid:107) S (cid:107) < , k = 1 , . . . , N. (129)Then, using (65), this is equivalent to: (cid:16) − µ k N (cid:88) l =1 s l,k λ l, max (cid:17) + µ k α (cid:107) S (cid:107) < (130) (cid:16) − µ k N (cid:88) l =1 s l,k λ l, min (cid:17) + µ k α (cid:107) S (cid:107) < (131)for k = 1 , . . . , N . Recalling the deﬁnitions for σ k, max and σ k, min in (81) and solving these two quadraticinequalities with respect to µ k , we arrive at: < µ k < σ k, max σ k, max + α (cid:107) S (cid:107) , < µ k < σ k, min σ k, min + α (cid:107) S (cid:107) and we are led to (80). A PPENDIX BB LOCK M AXIMUM N ORM OF A M ATRIX

Consider a block matrix X with blocks of size M × M each. Its block maximum norm is deﬁned as[35]: (cid:107) X (cid:107) b, ∞ (cid:44) max x (cid:54) =0 (cid:107) Xx (cid:107) b, ∞ (cid:107) x (cid:107) b, ∞ (132)where the block maximum norm of a vector x (cid:44) col { x , . . . , x N } , formed by stacking N vectors of size M each on top of each other, is deﬁned as [35]: (cid:107) x (cid:107) b, ∞ (cid:44) max ≤ k ≤ N (cid:107) x k (cid:107) (133)where (cid:107) · (cid:107) denotes the Euclidean norm of its vector argument. Lemma 4 (Block maximum norm) . If a block diagonal matrix X (cid:44) diag { X , . . . , X N } ∈ R NM × NM consists of N blocks along the diagonal with dimension M × M each, then the block maximum norm of X is bounded as (cid:107) X (cid:107) b, ∞ ≤ max ≤ k ≤ N (cid:107) X k (cid:107) (134) in terms of the -induced norms of { X k } (largest singular values). Moreover, if X is symmetric, thenequality holds in (134) . May 15, 2012 DRAFT4

Proof:

Note that Xx = col { X x ,. . . ,X N x N } . Evaluating the block maximum norm of vector Xx leadsto (cid:107) Xx (cid:107) b, ∞ = max ≤ k ≤ N (cid:107) X k x k (cid:107)≤ max ≤ k ≤ N (cid:107) X k (cid:107) · (cid:107) x k (cid:107)≤ max ≤ k ≤ N (cid:107) X k (cid:107) · max ≤ k ≤ N (cid:107) x k (cid:107) (135)Substituting (135) and (133) into (132), we establish (134) as (cid:107) X (cid:107) b, ∞ (cid:44) max x (cid:54) =0 (cid:107) Xx (cid:107) b, ∞ (cid:107) x (cid:107) b, ∞ ≤ max x (cid:54) =0 max ≤ k ≤ N (cid:107) X k (cid:107) · max ≤ k ≤ N (cid:107) x k (cid:107) max ≤ k ≤ N (cid:107) x k (cid:107) = max ≤ k ≤ N (cid:107) X k (cid:107) (136)Next, we prove that, if all the diagonal blocks of X are symmetric, then equality should hold in (136).To do this, we only need to show that there exists an x (cid:54) = 0 , such that (cid:107) Xx (cid:107) b, ∞ (cid:107) x (cid:107) b, ∞ = max ≤ k ≤ N (cid:107) X k (cid:107) (137)which would mean that (cid:107) X (cid:107) b, ∞ (cid:44) max x (cid:54) =0 (cid:107) Xx (cid:107) b, ∞ (cid:107) x (cid:107) b, ∞ ≥ (cid:107) Xx (cid:107) b, ∞ (cid:107) x (cid:107) b, ∞ = max ≤ k ≤ N (cid:107) X k (cid:107) (138)Then, combining inequalities (136) and (138), we would obtain desired equality that (cid:107) X (cid:107) b, ∞ = max ≤ k ≤ N (cid:107) X k (cid:107) (139)when X is block diagonal and symmetric. Thus, without loss of generality, assume the maximum in(137) is achieved by X , i.e., max ≤ k ≤ N (cid:107) X k (cid:107) = (cid:107) X (cid:107) For a symmetric X k , its 2-induced norm (cid:107) X k (cid:107) (deﬁned as the largest singular value of X k ) coincides withthe spectral radius of X k . Let λ denote the eigenvalue of X of largest magnitude, with the correspondingright eigenvector given by z . Then, max ≤ k ≤ N (cid:107) X k (cid:107) = | λ | , X z = λ z May 15, 2012 DRAFT5

We select x = col { z , , . . . , } . Then, we establish (137) by: (cid:107) Xx (cid:107) b, ∞ (cid:107) x (cid:107) b, ∞ = (cid:107) col { X z , , . . . , }(cid:107) b, ∞ (cid:107) col { z , , . . . , }(cid:107) b, ∞ = (cid:107) X z (cid:107)(cid:107) z (cid:107) = (cid:107) λ z (cid:107)(cid:107) z (cid:107) = | λ | = max ≤ k ≤ N (cid:107) X k (cid:107) A PPENDIX CS TABILITY OF B AND F Recall the deﬁnitions of the matrices B and F from (93) and (100): B = P T [ I MN − MD ∞ ] P T (140) F = (cid:16) P [ I MN − MD ∞ ] P (cid:17) ⊗ (cid:16) P [ I MN − MD ∞ ] P (cid:17) = B T ⊗ B T (141)From (140)–(141), we obtain (see Theorem 13.12 from [57, p.141]): ρ ( F ) = ρ ( B T ⊗ B T ) = [ ρ ( B T )] = [ ρ ( B )] (142)where ρ ( · ) denotes the spectral radius of its matrix argument. Therefore, the stability of the matrix F isequivalent to the stability of the matrix B , and we only need to examine the stability of B . Now notethat the block maximum norm (see the deﬁnition in Appendix B) of the matrix B satisﬁes (cid:107)B(cid:107) b, ∞ ≤ (cid:107) I MN − MD ∞ (cid:107) b, ∞ (143)since the block maximum norms of P and P are one (see [35, p.4801]): (cid:13)(cid:13) P T (cid:13)(cid:13) b, ∞ = 1 , (cid:13)(cid:13) P T (cid:13)(cid:13) b, ∞ = 1 (144)Moreover, by noting that the spectral radius of a matrix is upper bounded by any matrix norm (Theorem5.6.9, [50, p.297]) and that I MN − MD ∞ is symmetric and block diagonal, we have ρ ( B ) ≤ (cid:107) I MN − MD ∞ (cid:107) b, ∞ = ρ ( I MN − MD ∞ ) (145)Therefore, the stability of B is guaranteed by the stability of I MN − MD ∞ . Next, we call upon thefollowing lemma to bound (cid:107) I MN −MD ∞ (cid:107) b, ∞ . Lemma 5 (Norm of I MN −MD ∞ ) . It holds that the matrix D ∞ deﬁned in (90) satisﬁes (cid:107) I MN −MD ∞ (cid:107) b, ∞ ≤ max ≤ k ≤ N γ k (146) May 15, 2012 DRAFT6 where γ k is deﬁned in (65) . Proof:

Since D ∞ is block diagonal and symmetric, I MN − MD ∞ is also block diagonal with blocks { I M − µ k D k, ∞ } , where D k, ∞ denotes the k th diagonal block of D ∞ . Then, from (134) in Lemma 4 inAppendix B, it holds that (cid:107) I MN −MD ∞ (cid:107) b, ∞ = max ≤ k ≤ N (cid:107) I M − µ k D k, ∞ (cid:107) (147)By the deﬁnition of D ∞ in (90), and using condition (45) from Assumption 1, we have (cid:32) N (cid:88) l =1 s l,k λ l, min (cid:33) · I M ≤ D k, ∞ ≤ (cid:32) N (cid:88) l =1 s l,k λ l, max (cid:33) · I M which implies that I M − µ k D k, ∞ ≥ (cid:32) − µ k N (cid:88) l =1 s l,k λ l, max (cid:33) · I M (148) I M − µ k D k, ∞ ≤ (cid:32) − µ k N (cid:88) l =1 s l,k λ l, min (cid:33) · I M (149)Thus, (cid:107) I M − µ k D k, ∞ (cid:107) ≤ γ k . Substituting into (147), we get (146).Substituting (146) into (145), we get: ρ ( B ) ≤ max ≤ k ≤ N γ k (150)As long as max ≤ k ≤ N γ k < , then all the eigenvalues of B will lie within the unit circle. By the deﬁnitionof γ k in (65), this is equivalent to requiring | − µ k σ k, max | < , | − µ k σ k, min | < for k = 1 , . . . , N , where σ k, max and σ k, min are deﬁned in (81). These conditions are satisﬁed if wechoose µ k such that < µ k < /σ k, max , k = 1 , . . . , N (151)which is obviously guaranteed for sufﬁciently small step-sizes (and also by condition (80)).R EFERENCES [1] J. Chen, S.-Y. Tu, and A. H. Sayed, “Distributed optimization via diffusion adaptation,” in

Proc. IEEE InternationalWorkshop on Comput. Advances Multi-Sensor Adaptive Process. (CAMSAP) , Puerto Rico, Dec. 2011, pp. 281–284.[2] J. Chen and A. H. Sayed, “Performance of diffusion adaptation for collaborative optimization,” in

Proc. IEEE InternationalConf. Acoustics, Speech and Signal Process. (ICASSP) , Kyoto, Japan, March 2012, pp. 1–4.

May 15, 2012 DRAFT7 [3] S.-Y. Tu and A. H. Sayed, “Mobile adaptive networks,”

IEEE J. Sel. Topics. Signal Process. , vol. 5, no. 4, pp. 649–664,Aug. 2011.[4] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao, “Optimal distributed online prediction,” in

Proc. International Conf.Machin. Learning (ICML) , Bellevue, USA, June 2011, pp. 713–720.[5] Z. J. Towﬁc, J. Chen, and A. H. Sayed, “Collaborative learning of mixture models using diffusion adaptation,” in

Proc.IEEE Workshop on Mach. Learning Signal Process. (MLSP) , Beijing, China, Sep. 2011, pp. 1–6.[6] D. P. Bertsekas, “A new class of incremental gradient methods for least squares problems,”

SIAM J. Optim. , vol. 7, no. 4,pp. 913–926, 1997.[7] A. Nedic and D. P. Bertsekas, “Incremental subgradient methods for nondifferentiable optimization,”

SIAM J. Optim. ,vol. 12, no. 1, pp. 109–138, 2001.[8] M. G. Rabbat and R. D. Nowak, “Quantized incremental algorithms for distributed optimization,”

IEEE J. Sel. AreasCommun. , vol. 23, no. 4, pp. 798–808, Apr. 2005.[9] C. G. Lopes and A. H. Sayed, “Incremental adaptive strategies over distributed networks,”

IEEE Trans. Signal Process. ,vol. 55, no. 8, pp. 4064–4077, Aug. 2007.[10] D. P. Bertsekas and J. N. Tsitsiklis,

Parallel and Distributed Computation: Numerical Methods, 1st edition . AthenaScientiﬁc, Singapore, 1997.[11] J. N. Tsitsiklis and M. Athans, “Convergence and asymptotic agreement in distributed decision problems,”

IEEE Trans.Autom. Control , vol. 29, no. 1, pp. 42–50, Jan. 1984.[12] J. N. Tsitsiklis, D. P. Bertsekas, and M. Athans, “Distributed asynchronous deterministic and stochastic gradient optimizationalgorithms,”

IEEE Trans. Autom. Control , vol. 31, no. 9, pp. 803–812, Sep. 1986.[13] S. Barbarossa and G. Scutari, “Bio-inspired sensor network design,”

IEEE Signal Process. Mag. , vol. 24, no. 3, pp. 26–35,May 2007.[14] A. Nedic and A. Ozdaglar, “Cooperative distributed multi-agent optimization,” in Convex Optimization in Signal Processingand Communications,

Y. Eldar and D. Palomar, Eds., pp. 340–386, 2009.[15] ——, “Distributed subgradient methods for multi-agent optimization,”

IEEE Trans. Autom. Control , vol. 54, no. 1, pp.48–61, Jan. 2009.[16] I. D. Schizas, A. Ribeiro, and G. B. Giannakis, “Consensus in ad hoc WSNs with noisy links—Part I: Distributed estimationof deterministic signals,”

IEEE Trans. Signal Process. , vol. 56, no. 1, pp. 350–364, 2008.[17] S. Kar and J. M. F. Moura, “Sensor networks with random links: Topology design for distributed consensus,”

IEEE Trans.Signal Process. , vol. 56, no. 7, pp. 3315–3326, July 2008.[18] ——, “Convergence rate analysis of distributed gossip (linear parameter) estimation: Fundamental limits and tradeoffs,”

IEEE J. Sel. Topics. Signal Process. , vol. 5, no. 4, pp. 674–690, Aug. 2011.[19] A. G. Dimakis, S. Kar, J. M. F. Moura, M. G. Rabbat, and A. Scaglione, “Gossip algorithms for distributed signalprocessing,”

Proc. of the IEEE , vol. 98, no. 11, pp. 1847–1864, 2010.[20] R. Olfati-Saber and R. M. Murray, “Consensus problems in networks of agents with switching topology and time-delays,”

IEEE Trans. Autom. Control , vol. 49, no. 9, pp. 1520–1533, Sep. 2004.[21] T. C. Aysal, M. E. Yildiz, A. D. Sarwate, and A. Scaglione, “Broadcast gossip algorithms for consensus,”

IEEE Trans.Signal Process. , vol. 57, no. 7, pp. 2748–2761, 2009.[22] S. Sardellitti, M. Giona, and S. Barbarossa, “Fast distributed average consensus algorithms based on advection-diffusionprocesses,”

IEEE Trans. Signal Process. , vol. 58, no. 2, pp. 826–842, Feb. 2010.

May 15, 2012 DRAFT8 [23] L. Xiao, S. Boyd, and S. Lall, “A scheme for robust distributed sensor fusion based on average consensus,” in

Proc. Int.Symp. Information Processing Sensor Networks (IPSN) , Los Angeles, CA, Apr. 2005, pp. 63–70.[24] C. Eksin and A. Ribeiro, “Network optimization with heuristic rational agents,” in

Proc. Asilomar Conf. on Signals SystemsComputers , Paciﬁc Grove, CA, Nov. 2011, pp. 1–5.[25] R. M. Karp, “Reducibility among combinational problems,”

Complexity of Computer Computations (R. E. Miller and J.W. Thatcher, Eds.), pp. 85–104, 1972.[26] C. G. Lopes and A. H. Sayed, “Distributed processing over adaptive networks,” in

Proc. Adaptive Sensor Array ProcessingWorkshop , MIT Lincoln Laboratory, MA, June 2006, pp. 1–5.[27] C. Lopes and A. Sayed, “Diffusion least-mean squares over adaptive networks,” in

IEEE ICASSP , vol. 3, Honolulu, HI,Apr. 2007, pp. 917–920.[28] A. H. Sayed and C. G. Lopes, “Adaptive processing over distributed networks,”

IEICE Trans. Fund. Electron., Commun.Comput. Sci. , vol. E90-A, no. 8, pp. 1504–1510, Aug. 2007.[29] C. G. Lopes and A. H. Sayed, “Diffusion least-mean squares over adaptive networks: Formulation and performanceanalysis,”

IEEE Trans. Signal Process. , vol. 56, no. 7, pp. 3122–3136, July 2008.[30] F. S. Cattivelli and A. H. Sayed, “Diffusion LMS algorithms with information exchange,” in

Proc. Asilomar Conf. Signals,Syst. Comput. , Paciﬁc Grove, CA, Nov. 2008, pp. 251–255.[31] ——, “Diffusion LMS strategies for distributed estimation,”

IEEE Trans. Signal Process. , vol. 58, no. 3, pp. 1035–1048,March 2010.[32] F. S. Cattivelli, C. G. Lopes, and A. H. Sayed, “A diffusion RLS scheme for distributed estimation over adaptive networks,”in

Proc. IEEE Workshop on Signal Process. Advances Wireless Comm. (SPAWC) , Helsinki, Finland, June 2007, pp. 1–5.[33] ——, “Diffusion recursive least-squares for distributed estimation over adaptive networks,”

IEEE Trans. Signal Process. ,vol. 56, no. 5, pp. 1865–1877, May 2008.[34] F. S. Cattivelli and A. H. Sayed, “Diffusion strategies for distributed Kalman ﬁltering and smoothing,”

IEEE Trans. Autom.Control , vol. 55, no. 9, pp. 2069–2084, Sep. 2010.[35] N. Takahashi, I. Yamada, and A. H. Sayed, “Diffusion least-mean squares with adaptive combiners: Formulation andperformance analysis,”

IEEE Trans. Signal Process. , vol. 58, no. 9, pp. 4795–4810, Sep. 2010.[36] F. S. Cattivelli and A. H. Sayed, “Modeling bird ﬂight formations using diffusion adaptation,”

IEEE Trans. Signal Process. ,vol. 59, no. 5, pp. 2038–2051, May 2011.[37] P. Di Lorenzo, S. Barbarossa, and A. H. Sayed, “Bio-inspired swarming for dynamic radio access based on diffusionadaptation,” in

Proc. European Signal Process. Conf. (EUSIPCO) , Aug. 2011, pp. 1–6.[38] S. Chouvardas, K. Slavakis, and S. Theodoridis, “Adaptive robust distributed learning in diffusion sensor networks,”

IEEETrans. Signal Process. , vol. 59, no. 10, pp. 4692–4707, 2011.[39] S. S. Ram, A. Nedic, and V. V. Veeravalli, “Distributed stochastic subgradient projection algorithms for convexoptimization,”

J. Optim. Theory Appl. , vol. 147, no. 3, pp. 516–545, 2010.[40] P. Bianchi, G. Fort, W. Hachem, and J. Jakubowicz, “Convergence of a distributed parameter estimator for sensor networkswith local averaging of the estimates,” in

Proc. IEEE ICASSP , Prague, Czech, May 2011, pp. 3764–3767.[41] D. P. Bertsekas, “Incremental gradient, subgradient, and proximal methods for convex optimization: A survey,”

LIDSTechnical Report, MIT , no. 2848, 2010.[42] V. S. Borkar and S. P. Meyn, “The ODE method for convergence of stochastic approximation and reinforcement learning,”

SIAM J. Control Optim. , vol. 38, no. 2, pp. 447–469, 2000.

May 15, 2012 DRAFT9 [43] K. Srivastava and A. Nedic, “Distributed asynchronous constrained stochastic optimization,”

IEEE J. Sel. Topics. SignalProcess. , vol. 5, no. 4, pp. 772–790, Aug. 2011.[44] B. Polyak,

Introduction to Optimization . Optimization Software, NY, 1987.[45] J. Chen and A. H. Sayed, “Distributed Pareto-optimal solutions via diffusion adaptation,” in

Proc. IEEE Statistical SignalProcess. Workshop (SSP) , Ann Arbor, MI, Aug. 2012.[46] A. H. Sayed,

Adaptive Filters . Wiley, NJ, 2008.[47] G. H. Golub and C. F. Van Loan,

Matrix Computations (3rd Edition) . Johns Hopkins University Press, 1996.[48] S. S. Stankovic, M. S. Stankovic, and D. M. Stipanovic, “Decentralized parameter estimation by consensus based stochasticapproximation,”

IEEE Trans. Autom. Control , vol. 56, no. 3, pp. 531–543, Mar. 2011.[49] S.-Y. Tu and A. H. Sayed, “Diffusion networks outperform consensus networks,” in

Proc. IEEE Statistical Signal ProcessingWorkshop (SSP) , Ann Arbor, MI, Aug. 2012.[50] R. Horn and C. Johnson,

Matrix Analysis . Cambridge University Press, 1990.[51] D. P. Bertsekas and J. N. Tsitsiklis, “Gradient convergence in gradient methods with errors,”

SIAM J. Optim. , vol. 10,no. 3, pp. 627–642, 2000.[52] S. Haykin,

Adaptive Filter Theory, 2nd Edition . Prentice Hall, 2002.[53] J. Arenas-Garcia, M. Martinez-Ramon, A. Navia-Vazquez, and A. R. Figueiras-Vidal, “Plant identiﬁcation via adaptivecombination of transversal ﬁlters,”

Signal Processing , vol. 86, no. 9, pp. 2430–2438, 2006.[54] M. Silva and V. Nascimento, “Improving the tracking capability of adaptive ﬁlters via convex combination,”

IEEE Trans.Signal Process. , vol. 56, no. 7, pp. 3137–3149, 2008.[55] S. Theodoridis, K. Slavakis, and I. Yamada, “Adaptive learning in a world of projections,”

IEEE Signal Process. Mag. ,vol. 28, no. 1, pp. 97–123, Jan. 2011.[56] S. Boyd and L. Vandenberghe,

Convex Optimization . Cambridge University Press, 2004.[57] A. J. Laub,

Matrix Analysis for Scientists and Engineers . Society for Industrial and Applied Mathematics (SIAM), PA,2005.[58] P. Di Lorenzo, S. Barbarossa, and A. H. Sayed, “Sparse diffusion LMS for distributed adaptive estimation,” in

Proc. IEEEICASSP , Kyoto, Japan, March 2012, pp. 1–4.[59] R. Tibshirani, “Regression shrinkage and selection via the lasso,”

J. Royal Statist. Soc. B , pp. 267–288, 1996.[60] R. G. Baraniuk, “Compressive sensing,”

IEEE Signal Process. Mag. , vol. 24, no. 4, pp. 118–121, Mar. 2007.[61] E. Candes, M. Wakin, and S. Boyd, “Enhancing sparsity by reweighted (cid:96) minimization,” J. Fourier Anal. Appl. , vol. 14,no. 5, pp. 877–905, 2008.[62] G. Mateos, J. A. Bazerque, and G. B. Giannakis, “Distributed sparse linear regression,”

IEEE Trans. Signal Process. ,vol. 58, no. 10, pp. 5262–5276, 2010.[63] Y. Kopsinis, K. Slavakis, and S. Theodoridis, “Online sparse system identiﬁcation and signal reconstruction using projectionsonto weighted (cid:96) balls,” IEEE Trans. Signal Process. , vol. 59, no. 3, pp. 936–952, Mar. 2011., vol. 59, no. 3, pp. 936–952, Mar. 2011.