[PDF] Multitask diffusion adaptation over asynchronous networks

Abstract

The multitask diffusion LMS is an efficient strategy to simultaneously infer, in a collaborative manner, multiple parameter vectors. Existing works on multitask problems assume that all agents respond to data synchronously. In several applications, agents may not be able to act synchronously because networks can be subject to several sources of uncertainties such as changing topology, random link failures, or agents turning on and off for energy conservation. In this work, we describe a model for the solution of multitask problems over asynchronous networks and carry out a detailed mean and mean-square error analysis. Results show that sufficiently small step-sizes can still ensure both stability and performance. Simulations and illustrative examples are provided to verify the theoretical findings. The framework is applied to a particular application involving spectral sensing.

Full PDF

11 Multitask diffusion adaptation over asynchronousnetworks

Roula Nassif,

Student Member, IEEE , C´edric Richard,

Senior Member, IEEE

Andr´e Ferrari,

Member, IEEE , Ali H. Sayed,

Fellow Member, IEEE

Abstract —The multitask diffusion LMS is an efﬁcient strategyto simultaneously infer, in a collaborative manner, multipleparameter vectors. Existing works on multitask problems as-sume that all agents respond to data synchronously. In severalapplications, agents may not be able to act synchronously becausenetworks can be subject to several sources of uncertainties such aschanging topology, random link failures, or agents turning on andoff for energy conservation. In this work, we describe a model forthe solution of multitask problems over asynchronous networksand carry out a detailed mean and mean-square error analysis.Results show that sufﬁciently small step-sizes can still ensure bothstability and performance. Simulations and illustrative examplesare provided to verify the theoretical ﬁndings.

Index Terms —Distributed optimization, asynchronous net-works, diffusion adaptation, multitask learning, mean-squareperformance analysis.

I. I

NTRODUCTION

Distributed adaptive learning enables agents to learn a con-cept via local information exchange, and to continuously adaptto track possible concept drifts. Distributed implementationsoffer an attractive alternative to centralized solutions withadvantages related to scalability, robustness, and decentral-ization (see, e.g., [2], [3] and the many examples therein).Several strategies for distributed online parameter estimationhave been proposed in the literature, including consensusstrategies [4]–[9], incremental strategies [10]–[14], and dif-fusion strategies [15]–[20]. Incremental techniques operateon a cyclic path that runs across all nodes, which makesthem sensitive to link failures and problematic for adaptiveimplementations. On the other hand, diffusion strategies areparticularly attractive due to their enhanced adaptation per-formance and wider stability ranges than consensus-basedimplementations. Accessible overviews of results on diffusionadaptation can be found in [2], [15], [16].Most prior literature focuses primarily on the case wherenodes estimate a single parameter vector collaboratively. Werefer to problems of this type as single-task problems. Someapplications require more complex models and ﬂexible algo-rithms than single-task implementations since their agents may

The work of C. Richard and A. Ferrari was partly supported by ANRand DGA grant ANR-13-ASTR-0030 (ODISSEE project). The work of A. H.Sayed was supported in part by NSF grants CIF-1524250 and ECCS-1407712.A short version of this work appears in the conference publication [1].R. Nassif, C. Richard, and A. Ferrari are with the Universit´e de Nice Sophia-Antipolis, France (email: [email protected]; [email protected]; [email protected]).A. H. Sayed is with the department of electrical engineering, University ofCalifornia, Los Angeles, USA (email: [email protected]). involve the need to track multiple targets simultaneously. Forinstance, sensor networks deployed to estimate a spatially-varying temperature proﬁle need to exploit more directly thespatio-temporal correlations that exist between measurementsat neighboring nodes [21]. Likewise, monitoring applicationswhere agents need to track the movement of multiple corre-lated targets need to exploit the correlation proﬁle in the datafor enhanced accuracy. Problems of this kind, where nodesneed to infer multiple parameter vectors, are referred to asmultitask problems.Existing strategies to address multitask problems mostlydepend on how the tasks relate to each other and on exploitingsome prior information. There have been some useful worksdealing with such problems over distributed networks. Forexample, in [22] a diffusion strategy of the LMS type isdeveloped to solve distributed optimization problems wherenodes are interested in estimating parameters of local interestand parameters of global interest to the whole network. In [23],an extension of the diffusion algorithm developed in [22]allows nodes to estimate parameters of common interest toa subset of nodes simultaneously with parameters of localand global interest. In comparison, the parameter space isdecomposed into two orthogonal subspaces in [24], withone of the subspaces being common to all nodes. Multitaskestimation algorithms over fully connected networks and treenetworks are also considered in [25], [26]. These worksassume that the node-speciﬁc parameter vectors lie in acommon latent signal subspace and exploit this property tocompress information and to reduce communication costs. Analternative way to exploit and model relationships among tasksis to formulate optimization problems with appropriate co-regularizers between nodes [27], [28]. The multitask diffusionLMS algorithm derived in [27] relies on this principle, and webuild on this construction in this article. In this context, thenetwork is not assumed to be fully connected and agents neednot be interested in some common parameters. It is sufﬁcient toassume that different clusters within the network are interestedin their own models, and that there are some correlationsamong the models of adjacent clusters. These correlationsare captured by means of regularization parameters. Multitaskestimation problems have also been addressed over diffusionnetworks where no prior information on possible relationshipsbetween tasks is assumed and nodes do not know whichother nodes share the same task [29]–[32]. In this case, itwas argued in [29] that the diffusion iterates converge toa Pareto optimal solution when confronted with a multi-objective optimization problem. To avoid cooperation between a r X i v : . [ c s . S Y ] A ug neighbors seeking different objectives, automatic clusteringtechniques using diffusion strategies have been proposed. Theclustering techniques developed in [30], [31] are based onsetting the combination coefﬁcients in an online manner. Thetechnique proposed in [33] is based on solving a hypothesistest problem for setting the neighborhood in an online manner.The aforementioned works on multitask problems assumethat all agents respond to data synchronously. In several appli-cations, agents may not be able to act synchronously becausenetworks can be subject to several sources of uncertaintiessuch as changing topology, random link failures, or agentsturning on and off. There exist several useful studies inthe literature on the performance of consensus and gossipstrategies in the presence of asynchronous events [8], [9],[34], [35] or changing topologies [8], [9], [35]–[41]. In mostparts, these works investigate pure averaging algorithms thatcannot process streaming data or the works assume noise-free data or make use of decreasing step-size sequences.There are also studies in the context of diffusion strategies.In particular, the works [42]–[44] advanced a rather generalframework for asynchronous networks that includes manyprior models as special cases. The works examined howasynchronous events interfere with the behavior of adaptivenetworks in the presence of streaming noisy data and underconstant step-size adaptation. Several interesting conclusionsare reported in [44] where comparisons are carried out betweensynchronous and asynchronous behavior, as well as withcentralized solutions. In the current work, we would like toexamine similar effects to [42], [43] albeit in the context ofmultitask networks as opposed to single-task networks. In thiscase, a new dimension arises in that asynchronous events caninterfere with the exchange of information among clusters. Weexamine in some detail the mean and mean-square stability ofthe multitask network and show that sufﬁciently small step-sizes can still ensure convergence and performance. Varioussimulation results illustrate the theoretical ﬁndings.This paper is organized as follows. In Section II, we brieﬂyrecall the multitask diffusion LMS strategy and we introducea fairly general model for asynchronous behavior. Underthis model, agents in the network may stop updating theirsolutions, or may stop sending or receiving information in arandom manner. Section III analyzes the theoretical perfor-mance of the algorithm, in the mean and mean-square errorsense. In Section IV, experiments are presented to illustratethe performance of the diffusion multitask approach overasynchronous networks.II. M ULTITASK DIFFUSION

LMS

OVER ASYNCHRONOUSNETWORKS

Before starting our presentation, we provide a summary ofsome of the main symbols used in the article. Other symbolswill be deﬁned in the context where they are used: x Normal font letters denote scalars. x Boldface lowercase letters denote column vectors. R Boldface uppercase letters denote matrices. ( · ) (cid:62) Matrix transpose. ( · ) − Matrix inverse. I N Identity matrix of size N × N . N k The set of nodes containing the neighborhood ofnode k , including k . N − k The set of nodes containing the neighborhood ofnode k , excluding k . C j Cluster j , i.e., index set of nodes in the j -th cluster. C ( k ) The cluster of nodes to which node k belongs,including k . C ( k ) − The cluster of nodes to which node k belongs,excluding k .We now brieﬂy recall the synchronous diffusion adaptationstrategy developed in [27] for solving distributed optimizationproblems over multitask networks. A. Multitask diffusion adaptation

We consider a connected network consisting of N nodesgrouped into Q clusters, as illustrated in Figure 1. The problemis to estimate an L × unknown vector w (cid:63)k at each node k fromcollected data. Node k has access to temporal measurementsequences { d k ( i ) , x k ( i ) } , where d k ( i ) is a scalar zero-meanreference signal, and x k ( i ) is an L × regression vector with apositive-deﬁnite covariance matrix R x,k = E { x k ( i ) x (cid:62) k ( i ) } > . The data at node k are assumed to be related via the linearregression model d k ( i ) = x (cid:62) k ( i ) w (cid:63)k + z k ( i ) , (1)where z k ( i ) is a zero-mean i.i.d. noise of variance σ z,k thatis independent of any other signal. We assume that nodesbelonging to the same cluster have the same parameter vectorto estimate, namely, w (cid:63)k = w (cid:63) C q , whenever k ∈ C q . (2)We say that two clusters are connected if there exists at leastone edge linking a node from one cluster to a node in the othercluster. We also assume that relationships between connectedclusters exist so that cooperation among adjacent clusters isbeneﬁcial. In particular, we suppose that the parameter vectorscorresponding to two connected clusters C p and C q satisfycertain properties, such as being close to each other [27].Cooperation across these clusters can therefore be beneﬁcialto infer w (cid:63) C p and w (cid:63) C q .Consider the cluster C ( k ) to which node k belongs. A localcost function, J k ( w C ( k ) ) , is associated with node k . It isassumed to be strongly convex and second-order differentiable,an example of which is the mean-square error criterion con-sidered throughout this paper and deﬁned by J k ( w C ( k ) ) = E {| d k ( i ) − x (cid:62) k ( i ) w C ( k ) | } . (3)Depending on the application, there may be certain propertiesamong the optimal vectors { w (cid:63) C , . . . , w (cid:63) C Q } that deserve tobe promoted in order to enhance estimation accuracy. Among

23 4 5 6 7 89101 11 12 131415 w ?` = w ? C w ?m = w ? C w ?k = w ? C k `m C C C Fig. 1. Clustered network consisting of clusters. Two clusters are connectedif they share at least one edge. other possible options, a smoothness condition was enforcedin [27]. Speciﬁcally, the local variation of the graph signalat node k was deﬁned as the squared (cid:96) -norm of the graphgradient at this node [45], namely, (cid:107)∇ k W(cid:107) = (cid:88) (cid:96) ∈N k ρ k(cid:96) (cid:107) w k − w (cid:96) (cid:107) (4)where ρ k(cid:96) is a nonnegative weight assigned to the edgebetween nodes k and (cid:96) . As an alternative to (4) and in orderto promote piecewise constant transitions in the entries of theparameter vectors, the use of the (cid:96) -norm of the graph gradientat each node was also proposed and studied in [28]. In thispaper, we will focus on (4).To estimate the unknown parameter vectors w (cid:63) C , . . . , w (cid:63) C Q ,it was shown in [27] that the local cost (3) and the regular-izer (4) can be combined at the level of each cluster. Thisformulation led to the following estimation problem deﬁnedin terms of Q Nash equilibrium problems [46], where eachcluster C j estimates w (cid:63) C j by minimizing the regularized costfunction J C j ( w C j , w −C j ) : ( P j )  min w C j J C j ( w C j , w −C j ) with J C j ( w C j , w −C j )= (cid:88) k ∈C j E {| d k ( i ) − x (cid:62) k ( i ) w C ( k ) | } + η (cid:88) k ∈C j (cid:88) (cid:96) ∈N k \C j ρ k(cid:96) (cid:107) w C ( k ) − w C ( (cid:96) ) (cid:107) (5)for j = 1 , . . . , Q . Note that we have kept the notation w C ( k ) in (5) to make the role of the regularization termclearer, even though we have w C ( k ) = w C j for all k in C j .The notation w −C j denotes the collection of weight vectorsestimated by the other clusters, that is, w −C j = { w C q : q = 1 , . . . , Q } \ { w C j } . The second term on the RHS ofexpression (5) enforces smoothness of the resulting graphparameter vectors { w C , . . . , w C Q } , with strength parameter η ≥ . In [27], the coefﬁcients { ρ k(cid:96) } were chosen to satisfy the conditions: (cid:88) (cid:96) ∈N k \C ( k ) − ρ k(cid:96) = 1 , and  ρ k(cid:96) > , if (cid:96) ∈ N k \ C ( k ) ,ρ kk ≥ ,ρ k(cid:96) = 0 , otherwise. (6)We impose ρ k(cid:96) = 0 for all (cid:96) / ∈ N k \C ( k ) since nodes belongingto the same cluster estimate the same parameter vector.Following the same line of reasoning from [16], [18] in thesingle-task case, and extending the argument to problem (5)by using Nash-equilibrium properties [46], [47], the followingdiffusion strategy of the adapt-then-combine (ATC) form wasderived in [27] for solving the multitask learning problem (5)in a distributed manner:  ψ k ( i + 1) = w k ( i ) + µ k x k ( i ) ( d k ( i ) − x (cid:62) k ( i ) w k ( i ))+ η µ k (cid:16) (cid:88) (cid:96) ∈N k \C ( k ) − ρ k(cid:96) ( w (cid:96) ( i ) − w k ( i )) (cid:17) , w k ( i + 1) = (cid:88) (cid:96) ∈N k ∩C ( k ) a (cid:96)k ψ (cid:96) ( i + 1) . (7)where w k ( i ) denotes the estimate of the unknown parametervector w (cid:63)k at node k and iteration i , and µ k is a positivestep-size parameter. The combination coefﬁcients { a (cid:96)k } arenonnegative scalars that are chosen to satisfy the conditions: (cid:88) (cid:96) ∈N k ∩C ( k ) a (cid:96)k = 1 , and (cid:26) a (cid:96)k > , if (cid:96) ∈ N k ∩ C ( k ) ,a (cid:96)k = 0 , otherwise. (8)There are several ways to select these coefﬁcients such asusing the averaging rule or the Metropolis rule (see [16] fora listing of these and other choices). B. Asynchronous multitask diffusion adaptation

To model the asynchronous behavior over networks, wefollow the same procedure developed in [42] since the modelpresented in that work allows us to cover many situations ofpractical interest. Speciﬁcally, we replace each deterministicstep-size µ k by a random process µ k ( i ) , and model uncer-tainties in the links by using random combination coefﬁcients { a (cid:96)k ( i ) } and random regularization factors { ρ k(cid:96) ( i ) } . In otherwords, we modify the multitask diffusion strategy (7) to thefollowing form:  ψ k ( i + 1) = w k ( i ) + µ k ( i ) x k ( i ) ( d k ( i ) − x (cid:62) k ( i ) w k ( i ))+ η µ k ( i ) (cid:16) (cid:88) (cid:96) ∈N k ( i ) \C ( k ) − ρ k(cid:96) ( i )( w (cid:96) ( i ) − w k ( i )) (cid:17) , w k ( i + 1) = (cid:88) (cid:96) ∈N k ( i ) ∩C ( k ) a (cid:96)k ( i ) ψ (cid:96) ( i + 1) (9)where N k ( i ) is also now random and denotes the randomneighborhood of agent k at time instant i . The compositionof each cluster is assumed to be known a priori and doesnot change over time. When dealing with multitask networks,compared to single-task networks [42], a second source ofuncertainty comes from links transmitting data between clus-ters. Indeed, data transmitted over intra-cluster links are usedto reach a consensus while data transmitted over inter-cluster links are used to promote relationships between tasks. In amanner similar to [42], the asynchronous network model isassumed to satisfy the following conditions: • Conditions on the step-size parameters:

At each timeinstant i , the step-size at node k is a bounded nonneg-ative random variable µ k ( i ) ∈ [0 , µ max ,k ] . These step-sizes are collected into the random matrix M ( i ) (cid:44) diag { µ ( i ) , . . . , µ N ( i ) } . We assume that { M ( i ) , i ≥ } is a weakly stationary random process with mean M and Kronecker-covariance matrix C M of size N × N deﬁned as C M (cid:44) E { ( M ( i ) − M ) ⊗ ( M ( i ) − M ) } (10)with ⊗ denoting the Kronecker product. • Conditions on the combination coefﬁcients:

The ran-dom coefﬁcients { a (cid:96)k ( i ) } used to scale the estimates { ψ (cid:96) ( i + 1) } that are being received by node k from itscluster neighbors (cid:96) ∈ N k ( i ) ∩ C ( k ) satisfy the followingconstraints at each iteration i : (cid:88) (cid:96) ∈N k ( i ) ∩C ( k ) a (cid:96)k ( i ) = 1 , and (cid:26) a (cid:96)k ( i ) > , if (cid:96) ∈ N k ( i ) ∩ C ( k ) a (cid:96)k ( i ) = 0 , otherwise. (11)We collect these coefﬁcients into the random N × N left-stochastic matrix A ( i ) . We again assume that { A ( i ) , i ≥ } is a weakly stationary random process. Let A be itsmean and C A its Kronecker-covariance matrix of size N × N deﬁned as C A (cid:44) E { ( A ( i ) − A ) ⊗ ( A ( i ) − A ) } . (12) • Conditions on the regularization factors:

The randomfactors { ρ k(cid:96) ( i ) } , which adjust the regularization strengthbetween the parameter vectors at neighboring nodes ofdistinct clusters, satisfy the following constraints at eachiteration i : (cid:88) (cid:96) ∈N k ( i ) \C ( k ) − ρ k(cid:96) ( i ) = 1 , and  ρ k(cid:96) ( i ) > , if (cid:96) ∈ N k ( i ) \ C ( k ) ρ kk ( i ) ≥ ,ρ k(cid:96) ( i ) = 0 , otherwise. (13)We collect these coefﬁcients into the random N × N right-stochastic matrix P ( i ) . We assume that { P ( i ) , i ≥ } is a weakly stationary random process with mean P and Kronecker-covariance matrix C P of size N × N deﬁned as C P (cid:44) E { ( P ( i ) − P ) ⊗ ( P ( i ) − P ) } . (14) • Independence assumptions:

To enable tractable analysis,we shall assume that the random matrices M ( i ) , A ( i ) ,and P ( i ) at iteration i are mutually-independent and in-dependent of any other random variables. These matricesare related to node, intra-cluster and inter-cluster linkfailures, respectively. • Mean graph:

The mean matrices A and P deﬁnethe intra-cluster and inter-cluster neighborhoods, namely, N k ∩ C ( k ) and N k \ C ( k ) for all k , respectively. We referto the neighborhoods N k = (cid:0) N k ∩ C ( k ) (cid:1) ∪ (cid:0) N k \ C ( k ) (cid:1) for all k , deﬁned by A and P , as the mean graph.In view of the above conditions, the mean combination coefﬁcients ¯ a (cid:96)k (cid:44) E { a (cid:96)k ( i ) } and regularization factors ¯ ρ k(cid:96) (cid:44) E { ρ k(cid:96) ( i ) } are nonnegative and satisfy the follow-ing constraints. (cid:88) (cid:96) ∈N k ∩C ( k ) ¯ a (cid:96)k = 1 , and (cid:26) ¯ a (cid:96)k > , if (cid:96) ∈ N k ∩ C ( k ) , ¯ a (cid:96)k = 0 , otherwise, (15) (cid:88) (cid:96) ∈N k \C ( k ) − ¯ ρ k(cid:96) = 1 , and  ¯ ρ k(cid:96) > , if (cid:96) ∈ N k \ C ( k ) , ¯ ρ kk ≥ , ¯ ρ k(cid:96) = 0 , otherwise. (16)Using the same arguments as Lemmas 2 and 3 in [42], we canstate the following properties for the asynchronous model (9). Property 1.

The N × N matrix A and the N × N matrix A ⊗ A + C A are left-stochastic matrices. Property 2.

The N × N matrix P and the N × N matrix P ⊗ P + C P are right-stochastic matrices. Property 3.

For every node k , the neighborhood N k that isdeﬁned by the mean graph of the asynchronous model (9) isequal to the union of all possible realizations for the randomneighborhood N k ( i ) = (cid:0) N k ( i ) ∩ C ( k ) (cid:1) ∪ (cid:0) N k ( i ) \ C ( k ) (cid:1) .We provide in Appendix A one example for a commonasynchronous network referred to as the Bernoulli network.The Bernoulli model proposed in [42] is more general thanthe one used for modeling random link failures in consensusnetworks [8], [37] since it also allows to consider random“on-off” behavior for agents. When dealing with multitaskproblems over asynchronous network, additional sources ofuncertainties must be considered. The network provided inAppendix A allows us to jointly model intra-cluster link fail-ures, inter-cluster link failures, and random “on-off” behaviorsfor agents.III. P ERFORMANCE OF MULTITASK DIFFUSION OVERASYNCHRONOUS NETWORKS

The performance of the multitask diffusion algorithm (9)is affected by various random perturbations due to the asyn-chronous events. We now examine the stochastic behavior ofthis strategy in the mean and mean-square error sense.

A. Mean error behavior analysis

For each agent k , we introduce the weight error vectors: (cid:101) w k ( i ) (cid:44) w (cid:63)k − w k ( i ) , (cid:101) ψ k ( i ) (cid:44) w (cid:63)k − ψ k ( i ) (17)where w (cid:63)k is the optimum parameter vector at node k . Wedenote by (cid:101) w ( i ) , (cid:101) ψ ( i ) , and w (cid:63) the block weight error vector,the block intermediate weight error vector, and the blockoptimum weight vector, all of size N × with blocks of size L × , namely, (cid:101) w ( i ) (cid:44) col { (cid:101) w ( i ) , . . . , (cid:101) w N ( i ) } (18) (cid:101) ψ ( i ) (cid:44) col { (cid:101) ψ ( i ) , . . . , (cid:101) ψ N ( i ) } (19) w (cid:63) (cid:44) col { w (cid:63) , . . . , w (cid:63)N } . (20) We also introduce the following N × N block matrices withindividual entries of size L × L : M ( i ) (cid:44) M ( i ) ⊗ I L (21) A ( i ) (cid:44) A ( i ) ⊗ I L (22) P ( i ) (cid:44) P ( i ) ⊗ I L . (23)To perform the theoretical analysis, we introduce the followingindependence assumption. Assumption 1. (Independent regressors) The regression vec-tors x k ( i ) arise from a stationary random process that istemporally stationary, temporally white, and independent overspace with R x,k = E { x k ( i ) x (cid:62) k ( i ) } > . A direct consequence is that x k ( i ) is independent of (cid:101) w (cid:96) ( j ) forall (cid:96) and j ≤ i . Although not true in general, this assumptionis commonly used to analyze adaptive constructions sinceit allows to simplify the derivations without constrainingthe conclusions. There are several results in the adaptationliterature that show that performance results that are obtainedunder the above independence assumptions match well theactual performance of the algorithms when the step-sizes aresufﬁciently small (see, e.g., [48, App. 24.A] and the manyreferences therein).The estimation error in the ﬁrst step of the asynchronousstrategy (9) can be rewritten as: d k ( i ) − x (cid:62) k ( i ) w k ( i ) = x (cid:62) k ( i ) (cid:101) w k ( i ) + z k ( i ) . (24)Subtracting w (cid:63)k from both sides of the adaptation step in (9)and using the above relation, we can express the updateequation for (cid:101) ψ ( i + 1) as: (cid:101) ψ ( i + 1) =[ I NL − M ( i )( R x ( i ) + η Q ( i ))] (cid:101) w ( i ) − M ( i ) p xz ( i ) + η M ( i ) Q ( i ) w (cid:63) (25)where Q ( i ) (cid:44) I NL − P ( i ) , (26)while R x ( i ) is an N × N block matrix with individual entriesof size L × L given by R x ( i ) (cid:44) diag (cid:8) x ( i ) x (cid:62) ( i ) , . . . , x N ( i ) x (cid:62) N ( i ) (cid:9) , (27)and p xz ( i ) is the N × block column vector with blocks ofsize L × deﬁned as p xz ( i ) (cid:44) col { x ( i ) z ( i ) , . . . , x N ( i ) z N ( i ) } . (28)Subtracting w (cid:63)k from both sides of the combination step in (9),we get the block weight error vector: (cid:101) w ( i + 1) = A (cid:62) ( i ) (cid:101) ψ ( i + 1) . (29)Substituting (25) into (29) we ﬁnd that the error dynamicsof the asynchronous multitask diffusion strategy (9) evolvesaccording to the following recursion: (cid:101) w ( i + 1) = A (cid:62) ( i ) (cid:2) I NL − M ( i )( R x ( i ) + η Q ( i )) (cid:3) (cid:101) w ( i ) − A (cid:62) ( i ) M ( i ) p xz ( i ) + η A (cid:62) ( i ) M ( i ) Q ( i ) w (cid:63) . (30) For compactness of notation, we introduce the symbols: B ( i ) (cid:44) A (cid:62) ( i ) (cid:2) I NL − M ( i )( R x ( i ) + η Q ( i )) (cid:3) (31) g ( i ) (cid:44) A (cid:62) ( i ) M ( i ) p xz ( i ) (32) r ( i ) (cid:44) A (cid:62) ( i ) M ( i ) Q ( i ) w (cid:63) , (33)so that (30) can be written as (cid:101) w ( i + 1) = B ( i ) (cid:101) w ( i ) − g ( i ) + η r ( i ) . (34)Taking the expectation of both sides, using Assumption 1,and the independence of A ( i ) , M ( i ) , and P ( i ) , the networkmean error vector ends up evolving according to the followingdynamics: E { (cid:101) w ( i + 1) } = B E { (cid:101) w ( i ) } + η r (35)where B (cid:44) E { B ( i ) } = A (cid:62) (cid:2) I NL − M ( R x + η Q ) (cid:3) (36) r (cid:44) E { r ( i ) } = A (cid:62) MQ w (cid:63) , (37)where A , M , R x , and Q denote the expectations of A ( i ) , M ( i ) , R x ( i ) , and Q ( i ) , respectively, and are given by: A (cid:44) E { A ( i ) } = A ⊗ I L (38) M (cid:44) E { M ( i ) } = M ⊗ I L (39) P (cid:44) E { P ( i ) } = P ⊗ I L (40) R x (cid:44) E { R x ( i ) } = diag { R x, , . . . , R x,N } (41) Q (cid:44) E { Q ( i ) } = I NL − E { P ( i ) } = I NL − P . (42)Note that E { g ( i ) } = 0 since z k ( i ) is zero-mean and indepen-dent of any other signal. Theorem 1. (Stability in the mean)

Assume data model (1) and Assumption 1 hold. Then, for any initial condition, themultitask diffusion LMS strategy (9) applied to asynchronousnetworks converges asymptotically in the mean if, and only if,the step-sizes in M are chosen to satisfy ρ (cid:0) A (cid:62) (cid:2) I NL − M ( R x + η Q ) (cid:3)(cid:1) < , (43) where ρ ( · ) denotes the spectral radius of its matrix argument.In that case, the asymptotic mean bias is given by lim i →∞ E { (cid:101) w ( i ) } = η ( I NL − B ) − r . (44) Assume that the expected values for all step-sizes are uniform,namely, E { µ k ( i ) } = ¯ µ for all k . A sufﬁcient condition for (43) to hold is to ensure that < ¯ µ < ≤ k ≤ N ρ ( R x,k ) + 2 η . (45) Proof:

Convergence in the mean requires the matrix B in (35) to be stable. Since any induced matrix norm is lowerbounded by its spectral radius, we can write in terms of theblock maximum norm [16]: ρ (cid:0) A (cid:62) [ I NL − M ( R x + η Q )] (cid:1) ≤ (cid:107) A (cid:62) [ I NL − M ( R x + η Q )] (cid:107) b, ∞ ≤ (cid:107) A (cid:62) (cid:107) b, ∞ · (cid:107) I NL − M ( R x + η Q ) (cid:107) b, ∞ . (46) We have (cid:107) A (cid:62) (cid:107) b, ∞ = 1 because A is a block left-stochasticmatrix. This yields: ρ (cid:0) A (cid:62) [ I NL − M ( R x + η Q )] (cid:1) ≤ (cid:107) I NL − M ( R x + η Q ) (cid:107) b, ∞ = (cid:107) I NL − M ( R x + η ( I NL − P )) (cid:107) b, ∞ ≤ (cid:107) I NL − MR x − η M (cid:107) b, ∞ + η (cid:107) MP (cid:107) b, ∞ . (47)Consider the ﬁrst term on the RHS of (47). Since the matrices M and R x are block diagonal, it holds from the propertiesof the block maximum norm [16]: (cid:107) I NL − MR x − η M (cid:107) b, ∞ = max ≤ k ≤ N ρ (cid:0) (1 − η ¯ µ k ) I L − ¯ µ k R x,k (cid:1) = max ≤ k ≤ N max ≤ (cid:96) ≤ L | (1 − η ¯ µ k ) − ¯ µ k λ (cid:96) ( R x,k ) | (48)where ¯ µ k (cid:44) E { µ k ( i ) } , and λ (cid:96) ( · ) denotes the (cid:96) -th eigenvalueof its matrix argument. Consider now the second term on theRHS of (47). Using the submultiplicative property of the blockmaximum norm, and the fact that P is a block right-stochasticmatrix, we get η (cid:107) MP (cid:107) b, ∞ ≤ η (cid:107) M (cid:107) b, ∞ . (49)Because M is a block diagonal matrix, we further have that (cid:107) M (cid:107) b, ∞ = max ≤ k ≤ N ¯ µ k . (50)Combining (48) and (50) we conclude that the algorithm isstable in the mean if max ≤ k ≤ N max ≤ (cid:96) ≤ L | − η ¯ µ k − ¯ µ k λ (cid:96) ( R x,k ) | + η max ≤ k ≤ N ¯ µ k < . (51)In order to simplify this condition, assume that ¯ µ k = ¯ µ for all k . Condition (51) then reduces to (45). Note that therandomness in the topology does not affect the condition forstability in the mean of the algorithm. B. Mean-square error behavior analysis

To perform mean-square error analysis over asynchronousnetworks, compared to synchronous networks [27], newoperators with additional properties must be introduced. Weshall use the block Kronecker product operator ⊗ b insteadof the Kronecker product ⊗ , and the block vectorizationoperator bvec ( · ) instead of the vectorization operator vec ( · ) .This is because, as explained in [3], [43], these blockoperators preserve the locality of the blocks in the originalmatrix arguments. Recall that if X is an N × N blockmatrix with blocks of size L × L , bvec ( X ) vectorizes eachblock of X and stacks the vectors on top of each other.Before proceeding, we recall some properties of these blockoperators [3], [49]:For any two N × block vectors { x , y } with blocks of size L × , we have: bvec ( xy (cid:62) ) = y ⊗ b x . (52) For any N × N block-matrices { A , B , C , D } with blocks ofsize L × L , we have: ( A + B ) ⊗ b ( C + D ) = A ⊗ b C + A ⊗ b D + B ⊗ b C + B ⊗ b D (53) ( AC ) ⊗ b ( BD ) = ( A ⊗ b B )( C ⊗ b D ) (54) ( A ⊗ B ) ⊗ b ( C ⊗ D ) = ( A ⊗ C ) ⊗ ( B ⊗ D ) (55)trace ( AB ) = [ bvec ( B (cid:62) )] (cid:62) bvec ( A ) (56)bvec ( ABC ) = ( C (cid:62) ⊗ b A ) bvec ( B ) (57) ( A ⊗ b B ) (cid:62) = ( A (cid:62) ⊗ b B (cid:62) ) . (58)We now use these properties to evaluate the expectation ofsome block Kronecker matrix products that will be useful inthe sequel: M I (cid:44) E { M ( i ) ⊗ b M ( i ) } = E { ( M ( i ) ⊗ I L ) ⊗ b ( M ( i ) ⊗ I L ) } (56) = E { ( M ( i ) ⊗ M ( i )) ⊗ ( I L ⊗ I L ) } (10) = ( M ⊗ M + C M ) ⊗ I L . (59)In the same way, we get the following expectations: A I (cid:44) E { A ( i ) ⊗ b A ( i ) } = ( A ⊗ A + C A ) ⊗ I L , (60) P I (cid:44) E { P ( i ) ⊗ b P ( i ) } = ( P ⊗ P + C P ) ⊗ I L . (61)Since Q ( i ) = I NL − P ( i ) , we also obtain: Q I (cid:44) E { Q ( i ) ⊗ b Q ( i ) } = ( I N − I N ⊗ P − P ⊗ I N + P ⊗ P + C P ) ⊗ I L . (62)Before concluding these preliminary calculations, let us makesome remarks on the stochasticity of matrices considered inthe sequel. At each time instant i , the matrix P ( i ) ⊗ P ( i ) has nonnegative entries since P ( i ) has nonnegative entries.It follows that E { P ( i ) ⊗ P ( i ) } = P ⊗ P + C P has alsononnegative entries, and is right-stochastic since ( P ⊗ P + C P ) N = E { ( P ( i ) ⊗ P ( i ))( N ⊗ N ) } = E { ( P ( i ) N ) ⊗ ( P ( i ) N ) } = N (63)In the same token, the matrix A ⊗ A + C A is left-stochastic.To analyze the convergence in mean-square-error sense ofthe multitask diffusion LMS algorithm (9) over asynchronousnetworks, we consider the variance of the weight error vector (cid:101) w ( i ) weighted by any positive semi-deﬁnite matrix Σ , that is, E {(cid:107) (cid:101) w ( i ) (cid:107) Σ } , where (cid:107) (cid:101) w ( i ) (cid:107) Σ (cid:44) (cid:101) w (cid:62) ( i ) Σ (cid:101) w ( i ) . The freedomin selecting Σ will allow us to extract various types ofinformation about the network and the nodes. By Assumption1 and using (34), we get: E {(cid:107) (cid:101) w ( i + 1) (cid:107) Σ } = E {(cid:107) (cid:101) w ( i ) (cid:107) Σ (cid:48) } + E {(cid:107) g ( i ) (cid:107) Σ } + η E {(cid:107) r ( i ) (cid:107) Σ } + 2 η E { r (cid:62) ( i ) Σ B ( i ) (cid:101) w ( i ) } (64)where Σ (cid:48) = E { B (cid:62) ( i ) Σ B ( i ) } . Let σ denotes the ( N L ) × vector representation of Σ that is obtained by the block vectorization operator, namely, σ (cid:44) bvec ( Σ ) . In the sequel,it will be more convenient to work with σ than with Σ itself.Let σ (cid:48) (cid:44) bvec ( Σ (cid:48) ) . Using property (57), we can verify that σ (cid:48) = F (cid:62) σ (65)where F is the ( N L ) × ( N L ) matrix given by: F (cid:44) E { B ( i ) ⊗ b B ( i ) } (54) = E { A (cid:62) ( i ) ⊗ b A (cid:62) ( i ) } E { [ I NL − M ( i )( R x ( i ) + η Q ( i ))] ⊗ b [ I NL − M ( i )( R x ( i ) + η Q ( i ))] } (60) , (53) = A (cid:62) I [ I ( NL ) − I NL ⊗ b M ( R x + η Q ) − M ( R x + η Q ) ⊗ b I NL + E { M ( i )( R x ( i ) + η Q ( i )) ⊗ b M ( i )( R x ( i ) + η Q ( i )) } ] (66)where using property (54) and the deﬁnition of M I in (59),we have E { M ( i )( R x ( i ) + η Q ( i )) ⊗ b M ( i )( R x ( i ) + η Q ( i )) } = M I E { ( R x ( i ) + η Q ( i )) ⊗ b ( R x ( i ) + η Q ( i )) } . (67)The term on the RHS of equation (67) is proportional to M I = E { M ( i ) ⊗ M ( i ) } ⊗ I L , where E { M ( i ) ⊗ M ( i ) } isan N × N block diagonal matrix whose k -th block is an N × N diagonal matrix with (cid:96) -th entry given by E { µ k ( i ) µ (cid:96) ( i ) } . It issufﬁcient for the exposition in this work to focus on the caseof sufﬁciently small step-sizes where terms involving higherorder moments of the step-sizes can be ignored. Such approxi-mations are common when analyzing diffusion strategies in themean-square-error sense (see [16, Section 6.5]). Accordingly,the last term in (66) can be neglected and we continue ourdiscussion by letting F ≈ A (cid:62) I [ I ( NL ) − I NL ⊗ b M ( R x + η Q ) − M ( R x + η Q ) ⊗ b I NL ] . (68)Consider next the second term on the RHS of (64). We canwrite: E {(cid:107) g ( i ) (cid:107) Σ } = trace (cid:8) Σ E { g ( i ) g (cid:62) ( i ) } (cid:9) (56) = g (cid:62) b σ (69)where g b = bvec ( E { g ( i ) g (cid:62) ( i ) } ) . Using expression (32) andthe deﬁnitions of M I and A I in (59) and (60), we have g b = bvec ( E { ( A (cid:62) ( i ) M ( i ) p xz ( i ) p (cid:62) xz ( i ) M ( i ) A ( i )) } (57) = E { ( A (cid:62) ( i ) ⊗ b A (cid:62) ( i )) bvec ( M ( i ) p xz ( i ) p (cid:62) xz ( i ) M ( i )) } (57) , (58) = A (cid:62) I E { ( M ( i ) ⊗ b M ( i )) bvec ( p xz ( i ) p (cid:62) xz ( i )) } = A (cid:62) I M I bvec ( S ) , (70)where S (cid:44) E { p xz ( i ) p (cid:62) xz ( i ) } = diag { σ z,k R x,k } Nk =1 . Let usexamine now the third term on the RHS of (64): E {(cid:107) r ( i ) (cid:107) Σ } = trace (cid:8) Σ E { r ( i ) r (cid:62) ( i ) } (cid:9) (56) = r (cid:62) b σ (71)where r b = bvec ( E { r ( i ) r (cid:62) ( i ) } ) . Using expression (33),property (57), and the deﬁnitions of M I , A I , and Q I in(59), (60), and (62), and proceeding as in (70), we obtain thefollowing expression: r b = A (cid:62) I M I Q I bvec ( w (cid:63) ( w (cid:63) ) (cid:62) ) . (72) Consider now the fourth term E { r (cid:62) ( i ) Σ B ( i ) (cid:101) w ( i ) } . We have: E { r (cid:62) ( i ) Σ B ( i ) (cid:101) w ( i ) } = E { bvec ( r (cid:62) ( i ) Σ B ( i ) (cid:101) w ( i )) } (57) = E { ( B ( i ) (cid:101) w ( i )) (cid:62) ⊗ b r (cid:62) ( i ) } σ (58) = E { B ( i ) (cid:101) w ( i ) ⊗ b r ( i ) } (cid:62) σ (54) = E { (cid:101) w ( i ) ⊗ b } (cid:62) E { B ( i ) ⊗ b r ( i ) } (cid:62) σ = E { (cid:101) w ( i ) } (cid:62) E { B ( i ) ⊗ b r ( i ) } (cid:62) σ (73)with E { B ( i ) ⊗ b r ( i ) } = E { A (cid:62) ( i ) (cid:2) I NL − M ( i )( R x ( i ) + η Q ( i )) (cid:3) ⊗ b A (cid:62) ( i ) M ( i ) Q ( i ) w (cid:63) } (54) = A (cid:62) I E { (cid:2) I NL − M ( i )( R x ( i ) + η Q ( i )) (cid:3) ⊗ b M ( i ) Q ( i ) w (cid:63) } (53) = A (cid:62) I (cid:2)(cid:0) I NL ⊗ b MQ w (cid:63) ) − E { M ( i )( R x ( i ) + η Q ( i )) ⊗ b M ( i ) Q ( i ) w (cid:63) } (cid:3) , (74)where E { M ( i )( R x ( i ) + η Q ( i )) ⊗ b M ( i ) Q ( i ) w (cid:63) } (54) = M I E { ( R x ( i ) + η Q ( i )) ⊗ b Q ( i ) w (cid:63) } (53) = M I (( R x ⊗ b Q w (cid:63) ) + η E { Q ( i ) ⊗ b Q ( i ) w (cid:63) } ) (54) = M I (( R x ⊗ b Q w (cid:63) ) + η Q I ( I NL ⊗ b w (cid:63) ) ) . (75)Finally, combining (74) and (75) and introducing the notation K , we get K (cid:44) E { B ( i ) ⊗ b r ( i ) } = A (cid:62) I (cid:2) ( I NL ⊗ b MQ w (cid:63) ) − M I (cid:0) ( R x ⊗ b Q w (cid:63) )+ η Q I ( I NL ⊗ b w (cid:63) ) (cid:1)(cid:3) . (76)Relation (64) can be written in a more compact form as E {(cid:107) (cid:101) w ( i + 1) (cid:107) σ } = E {(cid:107) (cid:101) w ( i ) (cid:107) F (cid:62) σ } + y (cid:62) ( i ) σ , (77)where y ( i ) is the ( LN ) × vector given by: y ( i ) (cid:44) g b + η r b + 2 η K E { (cid:101) w ( i ) } . (78)In the sequel, we shall use the notations (cid:107) · (cid:107) Σ and (cid:107) · (cid:107) σ interchangeably. Theorem 2. (Mean-square stability)

Assume data model (1) and Assumption 1 hold. Assume further that the upperbounds on the step-sizes, { µ max ,k } , are sufﬁciently small suchthat approximation (68) is justiﬁed by ignoring higher-orderpowers of the step-sizes, and (77) can be used as a reasonablerepresentation for the dynamics of the weighted mean-squareerror. Then, the asynchronous diffusion multitask algorithm (9) is mean-square stable if the matrix F deﬁned by (68) is stable.Proof: Provided that F is stable, recursion (77) is stableif y (cid:62) ( i ) σ is bounded. Since η, g b , r b , K , and σ are ﬁniteand constant terms, the boundedness of y (cid:62) ( i ) σ depends on E { (cid:101) w ( i ) } being bounded. We know from (35) that E { (cid:101) w ( i ) } isuniformly bounded because (35) is a Bounded Input BoundedOutput (BIBO) stable recursion with a bounded driving term η A (cid:62) MQ w (cid:63) . It follows that y (cid:62) ( i ) σ is uniformly bounded.As a result, E {(cid:107) (cid:101) w ( i + 1) (cid:107) σ } converges to a bounded value as i → ∞ , and the algorithm is mean-square stable.The stability of F is studied in Appendix B. It is worthnoting that, due to the Kronecker covariance matrix C A ,the matrix F cannot be approximated by B ⊗ B as in thesynchronous case [16], [27]. Moreover, deriving a conditionthat ensures the stability of F in a multitask setting is morechallenging than in the single-task setting [43] due to thepresence of the non-block diagonal matrix Q in the secondterm on the RHS of (68). Theorem 3. (Transient network performance)

Consider sufﬁ-ciently small step-sizes that ensure mean and mean-square sta-bility. The variance curve deﬁned by ζ ( i ) = E {(cid:107) (cid:101) w ( i + 1) (cid:107) σ } evolves according to the following recursion for i ≥ : ζ ( i + 1)= ζ ( i ) + (cid:107) (cid:101) w (0) (cid:107) F (cid:62) − I ( NL )2 )( F (cid:62) ) i σ + ( y (cid:62) ( i ) + Γ ( i )) σ (79) where Γ ( i + 1) is updated as follows: Γ ( i + 1) = Γ ( i ) F (cid:62) + y (cid:62) ( i )( F (cid:62) − I ( NL ) ) , (80) with the initial conditions ζ (0) = (cid:107) (cid:101) w (0) (cid:107) σ and Γ (0) = (cid:62) ( NL ) . The network mean-square deviation (MSD) is ob-tained by setting σ = bvec ( Σ ) with Σ = N I NL . Proof:

The argument is similar to the proof of Theorem in [27]. Theorem 4. (Steady-state network performance)

Assumesufﬁciently small step-sizes to ensure mean and mean-squareconvergence. Then, the steady-state performance for multitaskdiffusion LMS (9) applied to asynchronous network is givenby: ζ (cid:63) = (cid:0) g b + η r b + 2 η K E { (cid:101) w ( ∞ ) } (cid:1) (cid:62) ( I ( NL ) − F (cid:62) ) − σ . (81) where E { (cid:101) w ( ∞ ) } is given by (44). The network mean-squaredeviation (MSD) is obtained by setting σ = bvec ( Σ ) with Σ = N I NL .Proof: The steady-state network performance with metric σ is deﬁned as: ζ (cid:63) = lim i →∞ E {(cid:107) (cid:101) w ( i ) (cid:107) σ } . (82)From the recursive expression (77), we obtain as i → ∞ : lim i →∞ E {(cid:107) (cid:101) w ( i ) (cid:107) I ( NL )2 − F (cid:62) ) σ } = ( g b + η r b + 2 η K E { (cid:101) w ( ∞ ) } ) (cid:62) σ . (83)To obtain (82), we replace σ in (83) by ( I ( NL ) − F (cid:62) ) − σ .Before moving on to the presentation of experimental re-sults, note that the performance of the synchronous multitaskalgorithm over the mean-graph topology can be obtained bysetting C A , C M , and C P to zero in (59)–(61). IV. S IMULATION RESULTS

A. Illustrative example

We adopt the same clustered multitask network as [27] inour simulations. As shown in Figure 2, the network con-sists of nodes divided into clusters: C = { , , } , C = { , , } , C = { , } , C = { , } . The unknownparameter vector w (cid:63) C i of each cluster is of size × ,and has the following form: w (cid:63) C i = w + δ w C i with w = [0 . , − . (cid:62) , δ w C = [0 . , − . (cid:62) , δ w C =[0 . , . (cid:62) , δ w C = [ − . , . (cid:62) , and δ w C =[0 . , . (cid:62) . The input and output data at each node k are related via the linear regression model: d k ( i ) = x (cid:62) k ( i ) w (cid:63)k + z k ( i ) where w (cid:63)k = w (cid:63) C ( k ) . The regressors arezero-mean × random vectors governed by a Gaussiandistribution with covariance matrices R x,k = σ x,k I L . Thevariances σ x,k are shown in Figure 2. The background noises z k ( i ) are independent and identically distributed zero-meanGaussian random variables, independent of any other signals.The corresponding variances are given in Figure 2. C C C C Node number k < x ; k Node number k < z ; k Fig. 2. Experimental setup. Left: Network topology. Right: Regression andnoise variances.

We considered the Bernoulli asynchronous model describedin Appendix A. We set the coefﬁcient a (cid:96)k in (90) such that a (cid:96)k = |N k ∩C ( k ) | − for all (cid:96) ∈ ( N k ∩C ( k )) , where |N k ∩C ( k ) | denotes the cardinality of the set N k ∩ C ( k ) . Then we set theregularization factors ρ k(cid:96) in (94) as follows. If N k \C ( k ) (cid:54) = ∅ , ρ k(cid:96) was set to ρ k(cid:96) = |N k \C ( k ) | − for all (cid:96) ∈ N k \C ( k ) , and to ρ k(cid:96) = 0 for any other (cid:96) . If N k \ C ( k ) = ∅ , these factors wereset to ρ kk = 1 and to ρ k(cid:96) = 0 for all (cid:96) (cid:54) = k . This usually leadsto asymmetrical regularization factors. The parameters of theBernoulli distribution governing the step-sizes µ k ( i ) were thesame over the network, that is, we set µ k in (88) to . forall k . The regularization strength η was set to . The MSDlearning curves were averaged over Monte-Carlo runs.The transient MSD curves were obtained with Theorem 3,and the steady-state MSD was estimated with Theorem 4. InFigure 3 (left), we report the network MSD learning curvesfor different cases:Case 1: idle: q k = p (cid:96)k = r k(cid:96) = 0 . ;Case 2: idle: q k = p (cid:96)k = r k(cid:96) = 0 . ;Case 3: no idle nodes: q k = p (cid:96)k = r k(cid:96) = 1 .We observe that the simulation results match well the theoret-ical results. Furthermore, the performance of the network isinﬂuenced by the probability of occurrence of random events.In Figure 3 (right), the asynchronous algorithm in Case 2 is Iteration i M S D i nd B -35-30-25-20-15-10-50 Transient MSDSimulated MSDSteady-state MSD30% idle0% idle50% idle

Iteration i M S D i nd B -35-30-25-20-15-10-50 Transient MSDSimulated MSDSteady-state MSDAsynchronous ATCSynchronous ATC

Fig. 3. Left: Comparison of asynchronous network MSD under idle, idle, and idle. Right: Network MSD comparison of asynchronous networkunder idle and the corresponding synchronous network. compared with its synchronous version obtained from (7) bysetting µ k , a (cid:96)k , and ρ k(cid:96) to the expected values ¯ µ k = E { µ k ( i ) } , ¯ a (cid:96)k = E { a (cid:96)k ( i ) } , and ¯ ρ k(cid:96) = E { ρ k(cid:96) ( i ) } , respectively. Al-though both algorithms show the same convergence rate, theasynchronous algorithm suffers from degradation in its MSDperformance caused by the additional randomness throughoutthe adaptation process. B. Multitask learning beneﬁt

In this section we provide an example to show the beneﬁtof multitask learning. We consider a network consisting of N = 100 nodes grouped into Q = 3 clusters such that C = { , . . . , } , C = { , . . . , } , and C = { , . . . , } . Thephysical connections are deﬁned by the connectivity matrixrepresented in Figure 4. The inputs x k ( i ) were zero-mean × random vectors governed by a Gaussian distributionwith covariance matrix R x,k = σ x,k I , where σ x,k wererandomly chosen in the interval [1 , . . The noises z k ( i ) were i.i.d. zero-mean Gaussian random variables, independentof any other signal with variances σ z,k randomly chosenin the interval [0 . , . . The × unknown parametervectors were chosen as: w (cid:63) C = w = [ × , × , · × , × , − × , × , − · × ] (cid:62) , w (cid:63) C = w + δ w , w (cid:63) C = w − δ w where δ w was randomly generated suchthat (cid:107) δ w (cid:107) ∞ = max i | [ δ w ] i | = 0 . .We considered the Bernoulli asynchronous model. Thecoefﬁcients { a (cid:96)k } and { ρ k(cid:96) } in (90) and (94), respectively,were generated in the same manner as in IV-A. Parameters µ k and q k in (88) were set to µ k = 1 / , q k = 0 . for nodes inthe ﬁrst cluster, µ k = 2 / , q k = 0 . for nodes in the secondcluster, and µ k = 1 / , q k = 0 . for nodes in the third cluster.The probabilities { p (cid:96)k } in (90) were p (cid:96)k = 0 . for links inthe ﬁrst cluster, p (cid:96)k = 0 . for links in the second cluster, and p (cid:96)k = 0 . for links in the third cluster. The probability that alink connecting two nodes belonging to neighboring clustersdrops was − r k(cid:96) = 0 . . The simulated curves were obtainedby averaging over Monte-Carlo runs.In Figure 5 (left), we compare two algorithms: the asyn-chronous diffusion strategy without regularization (obtainedfrom (9) by setting η = 0 ) and its synchronous counterpart

10 20 30 40 50 60 70 80 90 100102030405060708090100

Fig. 4. Connectivity matrix of the network. The orange, blue, and redelements correspond to links within C , C , and C , respectively. The cyanelements correspond to links between C and C and the magenta elementscorrespond to links between C and C . No links between C and C . (obtained from (9) by setting η = 0 and replacing µ k ( i ) , a (cid:96)k ( i ) by ¯ µ k , ¯ a (cid:96)k ). As shown in this ﬁgure, the performance is highlydeteriorated in the third cluster and slightly deteriorated inthe ﬁrst cluster because C is more susceptible to randomevents. In Figure 5 (right), we compare two algorithms: theasynchronous diffusion strategy with regularization (obtainedfrom (9) by setting η = 2 ) and the same synchronousalgorithm as in the left plot. As shown in this ﬁgure, thecooperation between clusters improves the performance ofeach cluster so that gaps appearing in the left plot are reduced.In other words, C and C beneﬁt from the high performancelevels achieved by C . This can be justiﬁed by two arguments:a large number of nodes is employed to collectively estimate w (cid:63) C and the probabilities associated with random events in C are small. As a conclusion, when tasks between neighboringclusters are similar, cooperation among clusters improves thelearning especially for clusters where asynchronous eventsoccur frequently. C. Circular arcs localization

In this section, we consider the problem of adaptive surfacelocalization over asynchronous networks. When dealing witha smooth target surface, we can expect that promoting the Iteration i M S D i nd B -30-25-20-15-10-5051015 C : asynch. without inter-cluster coop. C : asynch. without inter-cluster coop. C : asynch. without inter-cluster coop. C : synch. without inter-cluster coop. C : synch. without inter-cluster coop. C : synch. without inter-cluster coop. Iteration i M S D i nd B -30-25-20-15-10-5051015 C : asynch. with inter-cluster coop. C : asynch. with inter-cluster coop. C : asynch. with inter-cluster coop. C : synch. without inter-cluster coop. C : synch. without inter-cluster coop. C : synch. without inter-cluster coop. Fig. 5. Cluster learning curves. Left: Comparison of the asynchronous multitask diffusion LMS (9) without inter-cluster cooperation ( η = 0) and itssynchronous counterpart. Right: Comparison of the asynchronous multitask diffusion LMS (9) with inter-cluster cooperation ( η (cid:54) = 0) and the multitaskdiffusion LMS (9) without inter-cluster cooperation ( η = 0) . smoothness of the graph signal will improve the performanceof the network [27]. In the following, we consider an arclocalization application where the radius of the arc is changingover time, and we illustrate the inﬂuence of the random eventson the learning behavior and tracking ability of the network.Let us denote by L = [ θ , θ ] an arc of circle with radius R and subtending an angle θ = θ − θ with the circlecenter w o . Let us decompose L into Q sub-arcs L q withradius R and subtending an angle δ (cid:28) θ with w o . In orderto estimate the location of L , and for sufﬁciently small δ ,it is sufﬁcient to estimate the location of each of these Q sub-arcs by solving a point target localization problem. Thiscan be done by employing a network of N nodes, composedof Q clusters, where nodes of each cluster C q are interestedin locating L q by estimating a parameter vector w (cid:63) C q . Let usconsider node k belonging to cluster C q . At each time instant i ,node k gets noisy measurements { d k ( i ) , u k ( i ) } that are relatedvia the linear data model [16]: d k ( i ) = u (cid:62) k ( i ) w (cid:63) C q + v k ( i ) , (84)where v k ( i ) is a zero-mean temporally and spatially inde-pendent Gaussian noise with variance σ v,k , u k ( i ) is a noisymeasurement of the unit-norm direction vector of u k pointingfrom agent k to the target w (cid:63) C q given by: u k ( i ) = u k + α k ( i ) u ⊥ k + β k ( i ) u k , (85)with u k given by u k = ( w (cid:63) C q − n k ) / (cid:107) w (cid:63) C q − n k (cid:107) where n k is the location vector of node k , u ⊥ k denoting a unit normvector that lies in the same space as u k and whose directionis perpendicular to u k . The variables α k ( i ) and β k ( i ) are zero-mean independent Gaussian random variables of variances σ α,k and σ β,k , respectively. The amount of perturbation alongthe parallel direction is assumed to be small compared to theamount of perturbation along the perpendicular direction, thatis, σ β,k (cid:28) σ α,k .To show the effects of randomness at the level of nodes andlinks, we considered a network of nodes grouped into Q =10 clusters, located over arcs of radiuses uniformly distributedbetween R and R given R . Angular parameters θ and θ were set to π/ and π/ , respectively. The networktopology is shown in Figure 6. The noise variances were setto σ v,k = 0 . , σ α,k = 0 . , and σ β,k = 0 . , for all k . Weconsidered a Bernoulli asynchronous model. The coefﬁcients a (cid:96)k in (90) were set to |N k ∩ C ( k ) | − for intra-cluster links,and to zero for inter-cluster links. The regularization factors ρ k(cid:96) in (94) were set to |N k \ C ( k ) | − . The probabilities ofsuccess q k , p (cid:96)k , and r k(cid:96) were identically set to . . x y -9-8-7-6-5-4-3-2 Fig. 6. Network topology consisting of 10 clusters: circles for nodes, solidlines for links, and dashed lines for cluster boundaries.

The MSD learning curves were averaged over

Monte-Carlo runs. We ran the synchronous and asynchronous multi-task algorithms in two different situations. For the ﬁrst one,we set the regularization strength η to zero, that is, we didnot allow any cooperation between neighboring clusters. Inthe second one, we set the regularization strength η to . .For comparison purposes, we also ran the noncooperativeLMS, which was obtained by setting A ( i ) = P ( i ) = I N for all i , and the standard diffusion LMS [17]. In bothcases, synchronous and asynchronous algorithms were alsoconsidered. Each synchronous algorithm was derived from itsasynchronous counterpart by making µ k ( i ) , a (cid:96)k ( i ) , and ρ k(cid:96) ( i ) deterministic quantities equal to ¯ µ k , ¯ a (cid:96)k and ¯ ρ k(cid:96) , respectively.In order to illustrate the tracking ability of the algorithms, wemodiﬁed the radius R of L every iterations such that: for i ∈ [0 , , R = 0 . R , for i ∈ ]500 , , R = R ,for i ∈ ]1000 , , R = 1 . R , and for i ∈ ]1500 , , R = 2 R . Note that varying R has an effect on the levelof similarity between neighboring tasks when characterizedby (cid:107) w (cid:63) C i − w (cid:63) C j (cid:107) , where C i and C j denote two neighboringclusters. Indeed, w (cid:63) C j can be expressed as: w (cid:63) C j = w o + R  cos (cid:16) θ + θQ (cid:0) j − (cid:1)(cid:17) sin (cid:16) θ + θQ (cid:0) j − (cid:1)(cid:17) , ∀ j = 1 , . . . , Q, (86)where θ = θ − θ . With the topology shown in Fig. 6, weobtain: (cid:107) w (cid:63) C i − w (cid:63) C j (cid:107) = R (2 − θ/Q )) . (87)Figure 7 shows that cooperation among clusters improved thenetwork MSD performance and endowed the network withrobustness towards asynchronous events. We also observe thatthe performance of the standard diffusion LMS algorithm de-teriorates when the level of similarity between tasks decreases.Figure 8 depicts the estimated arc when R = R for thefollowing algorithms in an asynchronous setting: noncooper-ative LMS obtained by setting A ( i ) = P ( i ) = I N for all i ,standard diffusion LMS [17], and multitask diffusion LMS (9).In each case, the results were averaged over 150 Monte-Carloruns and over 50 samples after convergence. The multitaskdiffusion algorithm outperformed the non cooperative LMSand the standard diffusion. The standard diffusion was notable to estimate the location of the target since it is a singletask algorithm. It is shown in [29] that standard diffusion LMSconverges to a Pareto optimal solution when it is applied tomultitask problems. x y -2-1.8-1.6-1.4-1.2-1-0.8-0.6 Fig. 8. Target estimation results ( R = R = 2 ) over asynchronousnetwork: black cross sign for multitask diffusion (9), red asterisk sign fornon-cooperative, and blue circle sign for standard diffusion [17]. Finally, in order to show the effects of the number of clusters(or tasks) on the performance of the network, we considered 2additional experimental setups. In the ﬁrst one represented inFigure 9 (left), the number of tasks was set to , that is, the arc L was decomposed into 5 sub-arcs. In the second one depictedin Figure 9 (right), the number of clusters was set to 15. Exceptfor these changes, we considered the same experimental setupas before. Every time steps, the radius R of the arc was modiﬁed as before in order to decrease the similaritylevel between tasks. The learning curves of the algorithmsconsidered in Figure 7 are reported in Figure 10. As expected,it can be observed that the larger the number of clusters is,the more efﬁcient the collaboration between clusters becomes.The beneﬁts of inter-cluster cooperation decreases when thenumber of clusters becomes small.V. C ONCLUSION AND PERSPECTIVES

In this paper, we considered multitask problems wherenetworks are able to handle situations beyond the case wherethe nodes estimate a unique parameter vector over the network.We introduced a general model for asynchronous behaviorwith random step-sizes, combination coefﬁcients, and co-regularization factors. We then carried out a convergenceanalysis of the asynchronous multitask algorithm in the meanand mean-square-error sense, and we derived conditions forconvergence. Several open problems still have to be solvedfor speciﬁc applications. For instance, it would be interest-ing to investigate how nodes can autonomously adjust co-regularization factors between neighboring clusters in orderto optimize the learning performance. It would also be ad-vantageous to consider alternative co-regularizers in order topromote properties such as sparsity or block sparsity, and toanalyze the convergence behavior of the resulting algorithms.A

PPENDIX AT HE B ERNOULLI MODEL

In this model, the step-sizes { µ k ( i ) } are distributed asfollows: µ k ( i ) = (cid:26) µ k , with probability q k , with probability − q k (88)where µ k is a ﬁxed value. This probability distribution allowsus to model random “on-off” behavior by each agent k due to power saving strategies or random agent failures. Weassume that the step-sizes µ k ( i ) are spatially uncorrelated fordifferent k . At each iteration i , the mean of the step-size µ k ( i ) is ¯ µ k = µ k q k , and the covariance between µ k ( i ) and µ (cid:96) ( i ) is: c µ,k,(cid:96) (cid:44) E { ( µ k ( i ) − ¯ µ k )( µ (cid:96) ( i ) − ¯ µ (cid:96) ) } = (cid:26) µ k q k (1 − q k ) , if (cid:96) = k , otherwise. (89)Furthermore, combination weights { a (cid:96)k ( i ) } are distributed asfollows: a (cid:96)k ( i ) = (cid:26) a (cid:96)k , with probability p (cid:96)k , with probability − p (cid:96)k (90)for any (cid:96) ∈ N − k ( i ) ∩ C ( k ) , where < a (cid:96)k < a ﬁxedcoefﬁcient. The coefﬁcients { a (cid:96)k ( i ) } are spatially uncorrelatedfor different (cid:96) and k . Node k adjusts its own combination co-efﬁcient to ensure that the sum of its neighboring coefﬁcientsis equal to one as follows: a kk ( i ) = 1 − (cid:88) (cid:96) ∈N − k ( i ) ∩C ( k ) a (cid:96)k ( i ) ≥ . (91)The probability distribution (90) allows us to model a random“on-off” status for links within clusters at time i due to communication cost saving strategies or random link failures.With this model, we are giving the opportunity to each agent k to randomly choose a subset of neighbors that belong to itscluster to perform the combination step. At each iteration i ,the mean of the coefﬁcient a (cid:96)k ( i ) is given by: ¯ a (cid:96)k =  a (cid:96)k p (cid:96)k , if (cid:96) ∈ N − k ∩ C ( k )1 − (cid:80) (cid:96) ∈N − k ∩C ( k ) a (cid:96)k p (cid:96)k , if (cid:96) = k , otherwise . (92)and the covariance between a (cid:96)k ( i ) and a nm ( i ) equals [42]: c a,(cid:96)k,nm = E { ( a (cid:96)k ( i ) − a (cid:96)k )( a nm ( i ) − a nm ) } =  c a,(cid:96)k,(cid:96)k , if k = m, (cid:96) = n, (cid:96) ∈ N − k ∩ C ( k ) − c a,(cid:96)k,(cid:96)k , if k = m = n, (cid:96) ∈ N − k ∩ C ( k ) − c a,nk,nk , if k = m = (cid:96), n ∈ N − k ∩ C ( k ) (cid:80) j ∈N − k ∩C ( k ) c a,jk,jk , if k = m = (cid:96) = n , otherwise. (93)where c a,(cid:96)k,(cid:96)k = a (cid:96)k p (cid:96)k (1 − p (cid:96)k ) .Finally, the regularization factors { ρ k(cid:96) ( i ) } are distributed asfollows: ρ k(cid:96) ( i ) = (cid:26) ρ k(cid:96) , with probability r k(cid:96) , with probability − r k(cid:96) (94)for any (cid:96) ∈ N k ( i ) \ C ( k ) , where < ρ k(cid:96) < is aﬁxed regularization factor. The factors { ρ k(cid:96) ( i ) } are spatiallyuncorrelated for k (cid:54) = (cid:96) . At each iteration i , in order to get aright stochastic matrix P ( i ) , node k adjusts its regularizationfactor as follows: ρ kk ( i ) = 1 − (cid:88) (cid:96) ∈N k ( i ) \C ( k ) ρ k(cid:96) ( i ) ≥ . (95)The probability distribution (94) allows each agent k torandomly select a subset of neighbors that do not belong toits cluster and introduce co-regularization in the estimationprocess. This behavior can also be interpreted as resultingfrom link random failures between neighboring clusters: atevery time instant i , the communication link from agent (cid:96) toagent k drops with probability − r k(cid:96) . The mean of ρ k(cid:96) ( i ) isgiven: ρ k(cid:96) =  ρ k(cid:96) r k(cid:96) , if (cid:96) ∈ N k \ C ( k )1 − (cid:80) (cid:96) ∈N k \C ( k ) ρ k(cid:96) r k(cid:96) , if (cid:96) = k , otherwise , (96)and the covariance between ρ k(cid:96) ( i ) and ρ mn ( i ) is: c ρ,k(cid:96),mn = E { ( ρ k(cid:96) ( i ) − ρ k(cid:96) )( ρ mn ( i ) − ρ mn ) } =  c ρ,k(cid:96),k(cid:96) , if k = m, (cid:96) = n, (cid:96) ∈ N k \ C ( k ) − c ρ,k(cid:96),k(cid:96) , if k = m = n, (cid:96) ∈ N k \ C ( k ) − c ρ,kn,kn , if k = m = (cid:96), n ∈ N k \ C ( k ) (cid:80) j ∈N k \C ( k ) c ρ,kj,kj , if k = m = (cid:96) = n , otherwise (97)where c ρ,k(cid:96),k(cid:96) = ρ k(cid:96) r k(cid:96) (1 − r k(cid:96) ) . A PPENDIX BS TABILITY OF F Recall from (68) that F ≈ A (cid:62) I [ I ( NL ) − I NL ⊗ b M ( R x + η Q ) − M ( R x + η Q ) ⊗ b I NL ] . (98)We now upper-bound the spectral radius of F in order toderive a sufﬁcient condition for mean-square stability of thealgorithm. We can write: ρ ( F ) ≤ (cid:107) A (cid:62) I (cid:107) b, ∞ · (cid:107) I ( NL ) − I NL ⊗ b M ( R x + η Q ) − M ( R x + η Q ) ⊗ b I NL (cid:107) b, ∞ (99)Since the matrix A I is a block left-stochastic matrix, we knowthat (cid:107) A (cid:62) I (cid:107) b, ∞ = 1 . Using (42) and the triangular inequality,we have: ρ ( F ) ≤ (cid:107) I ( NL ) − I NL ⊗ b M ( R x + η I NL ) − M ( R x + η I NL ) ⊗ b I NL (cid:107) b, ∞ + η (cid:107) I NL ⊗ b MP (cid:107) b, ∞ + η (cid:107) MP ⊗ b I NL (cid:107) b, ∞ . (100)Consider the second term on the RHS of (100). We know that I NL ⊗ b MP (54) = ( I NL ⊗ b M )( I NL ⊗ b P ) (55) = (cid:16) ( I N ⊗ M ) ⊗ I L (cid:17)(cid:16) ( I N ⊗ P ) ⊗ I L (cid:17) . (101)Since (cid:16) ( I N ⊗ P ) ⊗ I L (cid:17) is a block right-stochastic matrix and (cid:16) ( I N ⊗ M ) ⊗ I L (cid:17) is an N × N block diagonal matrix witheach block of the form ¯ µ k I L ( k = 1 , . . . , N ), we obtain: (cid:107) I NL ⊗ b MP (cid:107) b, ∞ ≤ (cid:107) ( I N ⊗ M ) ⊗ I L (cid:107) b, ∞ · (cid:107) ( I N ⊗ P ) ⊗ I L (cid:107) b, ∞ = max ≤ k ≤ N µ k (102)Following the same steps for the third term on the RHS of(100), we have: (cid:107) MP ⊗ b I NL (cid:107) b, ∞ ≤ max ≤ k ≤ N µ k . (103)The matrix (cid:2) I ( NL ) − I NL ⊗ b M ( R x + η I NL ) − M ( R x + η I NL ) ⊗ b I NL (cid:3) in the ﬁrst term on the RHS of (100) isan N × N block diagonal matrix. The m -th block on thediagonal (where m = ( (cid:96) − N + k for k, (cid:96) = 1 , . . . , N ) is ofsize L × L , symmetric, and has the following form: I L − I L ⊗ ¯ µ k ( R x,k + η I L ) − ¯ µ (cid:96) ( R x,(cid:96) + η I L ) ⊗ I L =( − ¯ µ (cid:96) R x,(cid:96) − η ¯ µ (cid:96) I L ) ⊗ I L + I L ⊗ ( I L − ¯ µ k R x,k − η ¯ µ k I L ) (104)Before proceeding, let us recall the Kronecker sum operator,denoted by ⊕ . If A and B are two matrices of dimension L × L each, then A ⊕ B (cid:44) A ⊗ I L + I L ⊗ B . (105)Let λ k {·} denote the k -th eigenvalue of its matrix argument.Then, the eigenvalues of A ⊕ B are of the form λ i { A } + λ j { B } for i , j = 1 , . . . , L [50]. Note that the RHS of equation(104) can be written as ( − ¯ µ (cid:96) R x,(cid:96) − η ¯ µ (cid:96) I L ) ⊕ ( I L − ¯ µ k R x,k − η ¯ µ k I L ) (106)and its eigenvalues are therefore of the form: − η ¯ µ k − ¯ µ k λ j { R x,k } − η ¯ µ (cid:96) − ¯ µ (cid:96) λ i { R x,(cid:96) } (107)for i , j = 1 , . . . , L and k , (cid:96) = 1 , . . . , N . In order to simplifythe mean-square stability condition, we assume that the ﬁrstorder moment of the step-sizes is the same for all nodes. Usingthe fact that the block maximum norm of a block diagonalHermitian matrix is equal to the largest spectral radius of itsblock entries [16], we get: (cid:107) I ( NL ) − I NL ⊗ b M ( R x + η I NL ) − M ( R x + η I NL ) ⊗ b I NL (cid:107) b, ∞ = max ≤ k,(cid:96) ≤ N (cid:16) max ≤ i,j ≤ L | − η ¯ µ − ¯ µ ( λ j { R x,k } + λ i { R x,(cid:96) } ) | (cid:17) = max ≤ k,(cid:96) ≤ N (cid:16) max ≤ i,j ≤ L { − η ¯ µ − ¯ µ ( λ j { R x,k } + λ i { R x,(cid:96) } ) , − η ¯ µ + ¯ µ ( λ j { R x,k } + λ i { R x,(cid:96) } ) } (cid:17) = max { − η ¯ µ − ¯ µ min k,(cid:96) ( λ min { R x,k } + λ min { R x,(cid:96) } ) , − η ¯ µ + ¯ µ max k,(cid:96) ( λ max { R x,k } + λ max { R x,(cid:96) } ) } . (108)The minimum (identically the maximum) on k and (cid:96) thatappears in the last equality of (108) is reached for k = (cid:96) .Thus, a sufﬁcient condition for mean-square stability is givenby: max ≤ k ≤ N ( max ≤ i ≤ L | − η ¯ µ − µλ i ( R x,k ) | + 2 η ¯ µ ) < , (109)which is veriﬁed if the ﬁrst order moment of the step-sizessatisﬁes: < ¯ µ < η + max ≤ k ≤ N ρ ( R x,k ) . (110)R EFERENCES[1] R. Nassif, C. Richard, A. Ferrari, and A. H. Sayed, “Performanceanalysis of multitask diffusion adaptation over asynchronous networks,”in

Proc. Asilomar Conference on Signals, Systems, and Computers ,Paciﬁﬁc Grove, CA, November 2014.[2] A. H. Sayed, “Adaptive networks,”

Proc. of the IEEE , vol. 102, no. 4,pp. 460–497, April 2014.[3] A. H. Sayed, “Adaptation, learning, and optimization over networks,”

Foundations and Trends in Machine Learning , vol. 7, no. 4-5, pp. 311–801, July 2014.[4] J. Tsitsiklis and M. Athans, “Convergence and asymptotic agreementin distributed decision problems,”

IEEE Transactions on AutomaticControl , vol. 29, no. 1, pp. 42–50, January 1984.[5] L. Xiao and S. Boyd, “Fast linear iterations for distributed averaging,”

System & Control Letters , vol. 53, no. 9, pp. 65–78, September 2004.[6] P. Braca, S. Marano, and V. Matta, “Running consensus in wirelesssensor networks,” in

Proc. International Conference on InformationFusion (FUSION) , Cologne, Germany, June-July 2008, pp. 1–6.[7] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,”

IEEE Transactions on Automatic Control , vol. 54,no. 1, pp. 48–61, January 2009.[8] S. Kar and J. M. F. Moura, “Distributed consensus algorithms in sensornetworks with imperfect communication: Link failures and channelnoise,”

IEEE Transactions on Signal Processing , vol. 57, no. 1, pp.355–369, 2009. [9] K. Srivastava and A Nedic, “Distributed asynchronous constrainedstochastic optimization,”

IEEE Journal of Selected Topics in SignalProcessing , vol. 5, no. 4, pp. 772–790, 2011.[10] D. P. Bertsekas, “A new class of incremental gradient methods for leastsquares problems,”

SIAM Journal on Optimization , vol. 7, no. 4, pp.913–926, November 1997.[11] A. Nedic and D. P. Bertsekas, “Incremental subgradient methods fornondifferentiable optimization,”

SIAM Journal on Optimization , vol.12, no. 1, pp. 109–138, July 2001.[12] M. G. Rabbat and R. D. Nowak, “Quantized incremental algorithms fordistributed optimization,”

IEEE Journal of Selected Topics in Areas inCommunications , vol. 23, no. 4, pp. 798–808, April 2005.[13] D. Blatt, A. O. Hero, and H. Gauchman, “A convergent incrementalgradient method with constant step size,”

SIAM Journal on Optimization ,vol. 18, no. 1, pp. 29–51, February 2007.[14] C. G. Lopes and A. H. Sayed, “Incremental adaptive strategies overdistributed networks,”

IEEE Transactions on Signal Processing , vol.55, no. 8, pp. 4064–4077, August 2007.[15] A. H. Sayed, S.-Y Tu, J. Chen, X. Zhao, and Z. Towﬁc, “Diffusionstrategies for adaptation and learning over networks,”

IEEE SignalProcessing Magazine , vol. 30, no. 3, pp. 155–171, May 2013.[16] A. H. Sayed, “Diffusion adaptation over networks,” in

Academic PressLibraray in Signal Processing , R. Chellapa and S. Theodoridis, Eds.,vol. 3, pp. 322–454. Elsevier, 2014.[17] C. G. Lopes and A. H. Sayed, “Diffusion least-mean squares overadaptive networks: Formulation and performance analysis,”

IEEETransactions on Signal Processing , vol. 56, no. 7, pp. 3122–3136, July2008.[18] F. S. Cattivelli and A. H. Sayed, “Diffusion LMS strategies fordistributed estimation,”

IEEE Transactions on Signal Processing , vol.58, no. 3, pp. 1035–1048, March 2010.[19] J. Chen and A. H. Sayed, “Diffusion adaptation strategies for distributedoptimization and learning over networks,”

IEEE Transactions on SignalProcessing , vol. 60, no. 8, pp. 4289–4305, August 2012.[20] J. Chen and A. H. Sayed, “Distributed Pareto optimization via diffusionstrategies,”

IEEE Journal of Selected Topics in Signal Processing , vol.7, no. 2, pp. 205–220, April 2013.[21] R. Abdolee, B. Champagne, and A. H. Sayed, “Estimation of space-time varying parameters using a diffusion LMS algorithm,”

IEEETransactions on Signal Processing , vol. 62, no. 2, pp. 403–418, Jan.2014.[22] N. Bogdanovi´c, J. Plata-Chaves, and K. Berberidis, “Distributeddiffusion-based LMS for node-speciﬁc parameter estimation over adap-tive networks,” in

Proc. IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) , Florence, Italy, May 2014, pp.7223–7227.[23] J. Plata-Chaves, N. Bogdanovi´c, and K. Berberidis, “Distributeddiffusion-based LMS for node-speciﬁc adaptive parameter estimation,”Available as arXiv:1408.3354, 2014.[24] J. Chen, C. Richard, A. O. Hero, and A. H. Sayed, “Diffusion LMSfor multitask problems with overlapping hypothesis subspaces,” in

Proc. IEEE International Workshop on Machine Learning for SignalProcessing (MLSP) , Reims, France, September 2014, pp. 1–6.[25] A. Bertrand and M. Moonen, “Distributed adaptive node-speciﬁc signalestimation in fully connected sensor networks – Part I: sequential nodeupdating,”

IEEE Transactions on Signal Processing , vol. 58, no. 10, pp.5277–5291, Oct. 2010.[26] A. Bertrand and M. Moonen, “Distributed adaptive estimation of node-speciﬁc signals in wireless sensor networks with a tree topology,”

IEEETransactions on Signal Processing , vol. 59, no. 5, pp. 2196–2210, May2011.[27] J. Chen, C. Richard, and A. H. Sayed, “Multitask diffusion adaptationover networks,”

IEEE Transactions on Signal Processing , vol. 62, no.16, pp. 4129–4144, August 2014.[28] R. Nassif, C. Richard, A. Ferrari, and A. H. Sayed, “Multitask diffusionLMS with sparsity-based regularization,” in

Proc. IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP) ,Brisbane, Australia, April 2015.[29] J. Chen and A. H. Sayed, “Distributed Pareto optimization via diffusionstrategies,”

IEEE Journal of Selected Topics in Signal Processing , vol.7, no. 2, pp. 205–220, April 2013.[30] X. Zhao and A. H. Sayed, “Clustering via diffusion adaptation over net-works,” in

Proc. 3rd International Workshop on Cognitive InformationProcessing (CIP) , Baiona, Spain, May 2012, pp. 1–6.[31] J. Chen, C. Richard, and A. H. Sayed, “Diffusion LMS over multitasknetworks,”

IEEE Transactions on Signal Processing , vol. 63, no. 11,pp. 2733–2748, June 2015. [32] R. Nassif, C. Richard, J. Chen, A. Ferrari, and A. H. Sayed, “DiffusionLMS over multitask networks with noisy links,” in Proc. IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP) , Shanghai, China, March 2016.[33] X. Zhao and A. H. Sayed, “Distributed clustering and learning overnetworks,”

IEEE Transactions on Signal Processing , vol. 63, no. 13,pp. 3285–3300, July 2015.[34] J. N. Tsitsiklis, D. P. Bertsekas, and M. Athans, “Distributed asyn-chronous deterministic and stochastic gradient optimization algorithms,”

IEEE Transactions on Automatic Control , vol. 31, no. 9, pp. 803–812,1986.[35] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah, “Randomized gossipalgorithms,”

IEEE Transactions on Information Theory , vol. 52, no. 6,pp. 2508–2530, 2006.[36] S. Kar and J. M. F. Moura, “Convergence rate analysis of distributedgossip (linear parameter) estimation: Fundamental limits and tradeoffs,”

IEEE Journal of Selected Topics in Signal Processing , vol. 5, no. 4, pp.674–690, 2011.[37] S. Kar and J. M. F. Moura, “Sensor networks with random links:Topology design for distributed consensus,”

IEEE Transactions onSignal Processing , vol. 56, no. 7, pp. 3315–3326, 2008.[38] T. C. Aysal, M. E. Yildiz, A. D. Sarwate, and A. Scaglione, “Broad-cast gossip algorithms for consensus,”

IEEE Transactions on SignalProcessing , vol. 57, no. 7, pp. 2748–2761, 2009.[39] S. Kar and J. M. F. Moura, “Distributed consensus algorithms in sensornetworks: Quantized data and random link failures,”

IEEE Transactionson Signal Processing , vol. 58, no. 3, pp. 1383–1400, 2010.[40] D. Jakovetic, J. Xavier, and J. M. F. Moura, “Weight optimizationfor consensus algorithms with correlated switching topology,”

IEEETransactions on Signal Processing , vol. 58, no. 7, pp. 3788–3801, 2010.[41] D. Jakovetic, J. Xavier, and J. M. F. Moura, “Cooperative convexoptimization in networked systems: Augmented Lagrangian algorithmswith directed gossip communication,”

IEEE Transactions on SignalProcessing , vol. 59, no. 8, pp. 3889–3902, 2011.[42] X. Zhao and A. H. Sayed, “Asynchronous adaptation and learning overnetworks-Part I: Modeling and stability analysis,”

IEEE Transactionson Signal Processing , vol. 63, no. 4, pp. 811–826, February 2015. [43] X. Zhao and A. H. Sayed, “Asynchronous adaptation and learning overnetworks-Part II: Performance analysis,”

IEEE Transactions on SignalProcessing , vol. 63, no. 4, pp. 827–842, February 2015.[44] X. Zhao and A. H. Sayed, “Asynchronous adaptation and learning overnetworks-Part III: Comparison analysis,”

IEEE Transactions on SignalProcessing , vol. 63, no. 4, pp. 843–858, February 2015.[45] L. Grady and J. R. Polimeni,

Discrete Calculus: Applied Analysis onGraphs for Computational Science , Springer, 2010.[46] T. Basar and G. J. Olsder,

Dynamic Noncooperative Game Theory ,London, Academic Press, 2nd edition edition, 1995.[47] J. B. Rosen, “Existence and uniqueness of equilibrium points forconcave n-person games,”

Econometrica: Journal of the EconometricSociety , vol. 33, no. 3, pp. 520–534, 1965.[48] A. H. Sayed,

Adaptive Filters , John Wiley & Sons, NJ, 2008.[49] R. H. Koning, H. Neudecker, and T. Wansbeek, “Block Kroneckerproducts and the vecb operator,”

Linear Algebra and its Applications ,vol. 149, pp. 165–184, April 1991.[50] D. S. Bernstein,

Matrix Mathematics: Theory, Facts, and Formulas withApplication to Linear Systems Theory , Princeton University Press, 2005.

Roula Nassif was born in Beirut, Lebanon in Febru-ary 1991. She received the bachelor’s degree inElectrical Engineering from the Lebanese University,Lebanon, in 2013. She received the M.S. degreesin Industrial Control and Intelligent Systems forTransport from the Lebanese University, Lebanon,and from Compi`egne University of Technology,France, in 2013. Since October 2013 she is a Ph.D.student at the Lagrange Laboratory (University ofNice Sophia Antipolis, CNRS, Observatoire de laCˆote d’Azur). Her research activity is focused ondistributed optimization over multitask networks. C´edric Richard (S’98–M’01–SM’07) received theDipl.-Ing. and the M.S. degrees in 1994, and thePh.D. degree in 1998, from Compi`egne Universityof Technology, France, all in electrical and computerengineering. From 1999 to 2003, he was an Asso-ciate Professor at Troyes University of Technology,France, and a Full Professor from 2003 to 2009.Since 2009, he is a Full Professor at the Universityof Nice Sophia Antipolis, France. He was a juniormember of the Institut Universitaire de France in2010-2015.His current research interests include statistical signal processing andmachine learning. C´edric Richard is the author of over 230 papers. He wasthe General Co-Chair of the IEEE SSP Workshop that was held in Nice,France, in 2011. He was the Technical Co-Chair of EUSIPCO 2015 that washeld in Nice, France, and of the IEEE CAMSAP Workshop 2015 that washeld in Cancun, Mexico. He serves as a Senior Area Editor of the IEEETransactions on Signal Processing and as an Associate Editor of the IEEETransactions on Signal and Information Processing over Networks since 2015.He is also an Associate Editor of Signal Processing Elsevier since 2009. C´edricRichard is member of the Machine Learning for Signal Processing (MLSPTC) Technical Committee, and served as member of the Signal ProcessingTheory and Methods (SPTM TC) Technical Committee in 2009-2014.

Andr´e Ferrari (SM’91-M’93) received theIng´enieur degree from ´Ecole Centrale de Lyon,Lyon, France, in 1988 and the M.Sc. and Ph.D.degrees from the University of Nice SophiaAntipolis (UNS), France, in 1989 and 1992,respectively, all in electrical and computerengineering. He is currently a Professor at UNS.He is currently a Professor at UNS. He is amember of the Joseph-Louis Lagrange Laboratory(CNRS, OCA), where his research activity iscentered around statistical signal processing andmodeling, with a particular interest in applications to astrophysics.

Ali H. Sayed (S’90-M’92-SM’99-F’01) is professorand former chairman of electrical engineering at theUniversity of California, Los Angeles, USA, wherehe directs the UCLA Adaptive Systems Laboratory.An author of more than 460 scholarly publicationsand six books, his research involves several areasincluding adaptation and learning, statistical signalprocessing, distributed processing, network science,and biologically inspired designs. Dr. Sayed has re-ceived several awards including the 2015 EducationAward from the IEEE Signal Processing Society, the2014 Athanasios Papoulis Award from the European Association for SignalProcessing, the 2013 Meritorious Service Award, and the 2012 TechnicalAchievement Award from the IEEE Signal Processing Society. Also, the 2005Terman Award from the American Society for Engineering Education, the2003 Kuwait Prize, and the 1996 IEEE Donald G. Fink Prize. He served asDistinguished Lecturer for the IEEE Signal Processing Society in 2005 and asEditor-in Chief of the IEEE TRANSACTIONS ON SIGNAL PROCESSING(2003-2005). His articles received several Best Paper Awards from the IEEESignal Processing Society (2002, 2005, 2012, 2014). He is a Fellow ofthe American Association for the Advancement of Science (AAAS). He isrecognized as a Highly Cited Researcher by Thomson Reuters. Iteration i M S D i nd B -35-30-25-20-15-10-505 Multi. di ﬀ . ( η = 0)Multi. di ﬀ . ( η = 0)Noncoop. LMSStand. di ﬀ . Fig. 7. Network topology consisting of 10 clusters. Network MSD learning curves in a non-stationary environment: comparison of the multitask diffusionLMS with (namely, η > ) and without (namely, η = 0 ) inter-cluster cooperation, the standard diffusion LMS [17] and the non-cooperative LMS. The dottedlines are for synchronous networks and the solid lines are for asynchronous networks. x y -9-8-7-6-5-4-3-2 x y -9-8-7-6-5-4-3-2 Fig. 9. Network topology: circles for nodes, solid lines for links, and dashed lines for cluster boundaries. Left: network consisting of 5 clusters. Right:network consisting of 15 clusters. Iteration i M S D i nd B -35-30-25-20-15-10-505 Stand. di ﬀ . Noncoop. LMS Multi. di ﬀ . ( η = 0) Multi. di ﬀ . ( η = 0)Iteration i M S D i nd B -35-30-25-20-15-10-505 Multi. di ﬀ . ( η = 0) Multi. di ﬀ . ( η = 0)Noncoop. LMSStand. di ﬀ . Fig. 10. Network MSD learning curves in a non-stationary environment: comparison of the same algorithms considered in Figure 7. The dotted lines arefor synchronous networks and the solid lines are for asynchronous networks. Top: network consisting of clusters. Down: Network consisting of15