[PDF] Wide and Deep Graph Neural Networks with Distributed Online Learning

Abstract

Graph neural networks (GNNs) learn representations from network data with naturally distributed architectures, rendering them well-suited candidates for decentralized learning. Oftentimes, this decentralized graph support changes with time due to link failures or topology variations. These changes create a mismatch between the graphs on which GNNs were trained and the ones on which they are tested. Online learning can be used to retrain GNNs at testing time, overcoming this issue. However, most online algorithms are centralized and work on convex problems (which GNNs rarely lead to). This paper proposes the Wide and Deep GNN (WD-GNN), a novel architecture that can be easily updated with distributed online learning mechanisms. The WD-GNN comprises two components: the wide part is a bank of linear graph filters and the deep part is a GNN. At training time, the joint architecture learns a nonlinear representation from data. At testing time, the deep part (nonlinear) is left unchanged, while the wide part is retrained online, leading to a convex problem. We derive convergence guarantees for this online retraining procedure and further propose a decentralized alternative. Experiments on the robot swarm control for flocking corroborate theory and show potential of the proposed architecture for distributed online learning.

Full PDF

WWide and Deep Graph Neural Networks withDistributed Online Learning

Zhan Gao, Fernando Gama, and Alejandro Ribeiro

Department of Electrical and Systems EngineeringUniversity of PennsylvaniaPhiladelphia, PA 19104 {gaozhan, fgama, aribeiro}@seas.upenn.edu

Abstract

Graph neural networks (GNNs) learn representations from network data withnaturally distributed architectures. This renders them well-suited candidates fordecentralized learning since the operations respect the structure imposed by theunderlying graph. Oftentimes, this graph support changes with time, whether it isdue to link failures or topology changes caused by mobile components. Modiﬁca-tions to the underlying structure create a mismatch between the graphs on whichGNNs were trained and the ones on which they are tested. Online learning canbe used to retrain the GNNs at test time, overcoming this issue. However, mostonline learning algorithms are centralized and work on convex objective functions(which GNNs rarely lead to). This paper puts forth the Wide and Deep GNN(WD-GNN), a novel architecture that can be easily updated with distributed onlinelearning mechanisms. The WD-GNN consists of two components: the wide part isa bank of linear graph ﬁlters and the deep part is a convolutional GNN. At trainingtime, the joint architecture learns a relevant nonlinear representation from data. Attest time, the deep part is left unchanged, while the wide part is retrained online.Since the wide part is linear, the problem becomes convex, and online optimizationalgorithms can be used. Furthermore, in order to exploit the distributed natureof the architecture, we propose a distributed online optimization algorithm thatupdates the wide part at test time, without violating its decentralized nature. Wealso analyze the stability of the WD-GNN to changes in the underlying topologyand derive convergence guarantees for the online retraining procedure. Theseresults indicate the transferability, scalability, and efﬁciency of the WD-GNN toadapt online to new testing scenarios in a distributed manner. Experiments on thecontrol of robot swarm for ﬂocking corroborate the theory and show the potentialof the proposed architecture for distributed online learning.

Graph neural networks (GNNs) [1–6] are nonlinear representation maps that have been shownto perform successfully on graph data in a wide array of tasks involving citation networks [5],recommendation systems [7], source localization [8], power grids [9] and robot swarms [10]. GNNsconsist of a cascade of layers, each of which applies a graph convolution (a graph ﬁlter) [11],followed by a pointwise nonlinearity [1–6]. One of the key aspects of GNNs is that they are localand distributed. They are local since they require information only from neighboring nodes, anddistributed since each node can compute its own output, without need for a centralized unit.Oftentimes, however, problems of interest exhibit (slight) changes in data structure between trainingand testing set or involve dynamic systems [12, 13]. For example, in the case of the robot swarm, the

Preprint. Under review. a r X i v : . [ c s . L G ] J un raph is determined by the communication network between robots which is, in turn, determined bytheir physical proximity. Thus, if robots move, the communication links will change, and the graphsupport will change as well. Therefore, we oftentimes need to adapt to (slightly) new data structures.GNNs have been shown to be resilient to changes, as proven by the properties of permutationequivariance and stability [14, 15]. While these properties guarantee transference, we can furtherimprove the performance by leveraging online learning approaches.Online learning is a well-established paradigm that tracks the optimizer of time-varying optimizationproblems and has been successful as an enabler in the ﬁelds of machine learning and signal processing[16]. In a nutshell, online algorithms tackle each modiﬁed time instance of the optimization problem,by performing a series of updates on the previously obtained solutions. In order to leverage onlinelearning in GNNs we face two major roadblocks. First, optimality bounds and convergence guaranteesare given only for convex problems [17]. Second, online optimization algorithms assume a centralizedsetting. The latter one is particularly problematic since it violates the local and distributed nature ofGNNs, upon which much of its success has been built [10].This paper puts forward the Wide and Deep Graph Neural Network (WD-GNN) architecture, thatis amenable to distributed online learning, while keeping convergence guarantees. We deﬁne theWD-GNN as consisting of two components, a deep part which is a nonlinear GNN and a wide partwhich is a bank of linear graph ﬁlters (Section 3). We propose to have an ofﬂine phase of training, thatneed not be distributed, and then an online retraining phase, where only the wide part is adapted to thenew problem settings. In this way, we learn a nonlinear representation, that can still be adapted onlinewithout sacriﬁcing the convex nature of the problem (Section 4). We further develop an algorithm for distributed online learning. We prove that the WD-GNN is stable to changes in the underlying graphsupport, indicating a certain level of robustness to structural changes, and we prove convergenceguarantees for the proposed online learning procedure (Section 5). Finally, we perform simulatedexperiments on robot swarm control (Section 6). Note that proofs, implementation details and anotherexperiment involving movie recommendation systems, can be found in the supplementary material. GNNs have been developed as a nonlinear representation map that is capable of leveraging graphstructure present in data. The most popular model for GNNs is the one involving graph convolutions(formally known as graph FIR ﬁlters). Several implementations of this model were proposed,including [1] which computes the graph convolution in the spectral domain, [2] which uses aChebyshev polynomial implementation, [3, 4] which uses a summation polynomial, and [5, 6] whichreduce the polynomial to just the ﬁrst order. All of these are different implementations of the samerepresentation space given by the use of graph convolutional ﬁlters to regularize the linear transformof a neural network model. Other popular GNN models include graph attention transforms [18] andCayleyNets [19], see [20] for a general framework.Online learning has been investigated in designing neural networks (NNs) for dynamically varyingproblems. Speciﬁcally, [21, 22] develop online algorithms for feedforward neural networks withapplications in dynamical condition monitoring and aircraft control. More recently, online learninghas been used in convolutional neural networks for visual tracking, detection and classiﬁcation [23,24].While these works develop online algorithms for NNs, analysis on the convergence of these algorithmsis not presented, except for [25] that proves the convergence of certain online algorithms for radialneural networks only.

Let G = {V , E , W} describe a graph , where V = { n , . . . , n N } is the set of N nodes, E ⊆ V × V isthe set of edges, and W : E → R is the edge weight function. In the case of the robot swarm, eachnode n i ∈ V represents a robot, each edge ( n j , n i ) ∈ E represents the communication link betweenrobot n j and n i , and the weight W ( n j , n i ) = w ji ≥ models the communication channel.The graph G is used to describe the data structure of interest. The data itself is deﬁned on top of thegraph and is described by means of a graph signal X : V → R F which assigns an F -dimensional feature vector to each node. For example, in the robot swarm, the signal X ( n i ) ∈ R F represents the state of robot n i , typically described by its relative position, velocity or acceleration. The collection2f features across all nodes in the graph can be conveniently denoted with a matrix X ∈ R N × F which we call a graph signal as well. Note that each row of X corresponds to the feature vector ateach node, whereas each column corresponds to the collection of the f th feature across all nodes.Describing the data as a graph signal lets us leverage the framework of graph signal processing(GSP) as the mathematical foundation on which to develop algorithms, derive properties and gaininsights [11]. In particular, note that X is a N × F matrix that bares no information about theunderlying graph (beyond the fact that it has N nodes). To be able to relate the graph signal to thespeciﬁc graph it is supported on, we need a matrix description of the graph. Let S ∈ R N × N be a support matrix that satisﬁes [ S ] ij = 0 if i (cid:54) = j and ( n j , n i ) / ∈ E . Examples of support matricescommonly used in the literature include the adjacency matrix, the Laplacian matrix, and othernormalized versions. The key aspect of the support matrix is that it respects the sparsity of the graph.Thus, when using it as a linear operator on the data SX , we observe that the output at node n i forfeature f becomes [ SX ] if = N (cid:88) j =1 [ S ] ij [ X ] jf = (cid:88) j : n j ∈N i [ S ] ij [ X ] jf (1)where the second equality emphasizes the sparse nature of support, in the sense that only the valuesof the f th feature at the neighboring nodes n j ∈ N i for N i = { n j ∈ V : ( n j , n i ) ∈ E} are requiredto compute the output of the linear operation SX at each node. This renders SX a linear operationthat only needs information from neighboring nodes (local) and that can be computed separately ateach node (distributed). The operation SX is at the core of GSP since it effectively relates the graphsignal with the graph support, and it usually receives the name of a graph shift [11].While, in general, we can think of graph data as given by a pair ( X , S ) consisting of the graph signal X and its support S , we would like to remark that we only regard X as actionable . The support S isdetermined by the physical constraints of the problem. For example, in the robot swarm, S representssome speciﬁc model of communications among robots.In what follows, we propose the Wide and Deep Graph Neural Network (WD-GNN) architecture. Itis a nonlinear map Ψ : R N × F → R N × G that consists of two components Ψ ( X ; S , A , B ) = α D Φ ( X ; S , A ) + α W B ( X ; S , B ) + β (2)where Φ ( X ; S , A ) is called the deep part and is a graph neural network (GNN), and B ( X ; S , B ) iscalled the wide part and is a bank of graph ﬁlters. The scalars α D , α W , β are preset weights . The wide component is a bank of graph ﬁlters [11]. These are deﬁned as a linear mapping betweengraph signals B : R N × F → R N × G , characterized by a set of weights or ﬁlter taps B = { B k ∈ R F × G , k = 0 , . . . , K } as follows B ( X ; S , B ) = K (cid:88) k =0 S k XB k . (3)The operation to compute the output (3) is often called a graph convolution [1, 3].The output of (3) is another graph signal of dimensions N × G . Note that the graph ﬁlter has theability to change the dimension of the feature vector (i.e. from F to G ). In fact, to accommodate forthe multi-dimensional nature of these features, the graph ﬁlter (3) actually acts analogously to theapplication of a bank of F G ﬁlters, hence the name. Oftentimes, though, we refer to (3) as simply a graph ﬁlter or a graph convolution .The ﬁltering operation as in (3) is distributed and local, as it can be computed at each node withinformation relied by neighboring nodes only. To see this, note that the multiplication on the leftby S k is the one that mixes information from different nodes (in the columns of X ), but since S respects the sparsity of the graph and S k X = S ( S k − X ) can be seen as repeated applications of S ,then only the information from neighboring nodes is collected [cf. (1)]. Multiplication on the rightby B k actually carries out a linear combination of the entries in the row of X , but since these rows The weights α , α and β can also be considered architecture parameters and be trained, if necessary. S at implementation time. They only need to havecommunication capabilities so that they can receive the information from neighboring nodes, andcomputational capabilities to compute a linear combination of the information received from theneighbors. They do not require full knowledge of the graph, but only of their immediate neighbors.Thus, in terms of distributed implementation, the graph ﬁltering operation scales seamlessly [15].We fundamentally use (3) as a mathematical framework that offers a condensed description of thecommunication exchanges that happen in a network. The deep component is a convolutional graph neural network (GNN). These are deﬁned as a nonlinearmapping between graph signals Φ : R N × F → R N × G , built as a cascade of graph ﬁlters [cf. (3)] andpointwise nonlinearities Φ ( X ; S , A ) = X L with X (cid:96) = σ (cid:18) K (cid:88) k =0 S k X (cid:96) − A (cid:96)k (cid:19) (4)for (cid:96) = 1 , . . . , L , with σ : R → R a pointwise nonlinear function, which, in an abuse of notation,denotes its entrywise application in (4); and characterized by the set of ﬁlter taps A = { A (cid:96)k ∈ R F (cid:96) − × F (cid:96) , k = 0 , . . . , K (cid:96) , (cid:96) = 1 , . . . , L } . The graph signal X (cid:96) at each layer has F (cid:96) features, andthe input is X = X so that F = F while the output has F L = G features.There is a vast literature on GNNs. We note that [1–4] all describe the same representation spaceas (4), they just differ in how the graph convolution (3) is implemented. On the other hand, [5, 6]are restricted to K (cid:96) = 1 for all (cid:96) , and thus their representation space is just a subspace of that of (4).Since the results presented here are characterizations of the representation space of (4), they hold forall implementations, and while in the numerical section we implement the graph convolution as adirect polynomial, any of the other implementations could have been used . We train the WD-GNN (2) by solving the empirical risk minimization (ERM) problem for somecost function J : R N × G → R in a given training set T = { X , . . . , X |T | } min A , B |T | (cid:88) X ∈T J (cid:0) Ψ ( X ; S , A , B ) (cid:1) . (5)The ERM on a nonlinear neural network model is typically nonconvex, and is usually approximatelysolved by means of some SGD-based optimization algorithm, exploiting the backpropagation methodfor efﬁcient computation of the derivatives. Note that the number of parameters in A and B isdetermined by the hyperparameters L (number of layers), K (cid:96) (number of ﬁlter taps per layer) and F (cid:96) (number of features per layer) for the deep part, and K (number of ﬁlter taps) for the wide part.Training is the procedure of applying some SGD-based algorithm for some number of iterations or training steps and arriving at some set of parameters A † and B † .In many problems of interest, the data structures may change (slightly) from the training phase to thetesting phase, or we may consider dynamic systems, where the scenario changes naturally with time.The problem of controlling a robot swarm, for instance, exhibits both since the different initializations We note that no single implementation has consistently outperformed the others in a wide range of diverseproblems; this is why we choose to focus on characterizing the representation space, and not in the speciﬁcs ofimplementation. We took the license to deﬁne the ERM problem as in (5) so as to include supervised and unsupervisedproblems in a single framework. To use (5) for a supervised problem, we just extend J to operate on an extrainput representing the label given in the training set.

4f positions and velocities of the swarm lead to different structures at training time and testing time,and since inevitable movements of robots cause the communications links between them to change.

Online learning addresses this problem by proposing optimization algorithms that adapt to a continu-ously changing problem [16]. It operates by adjusting parameters repetitively for each time instanceof the problem. Online learning algorithms require convexity of the ERM problem to be able toprovide optimality bounds as well as convergence guarantees. However, the ERM problem (5) usingthe WD-GNN is rarely convex.To tackle this issue we propose to only retrain the wide component of the WD-GNN. By consideringthe deep part ﬁxed A = A † as obtained from solving (5) over the training set, we can focus onthe wide part, which is linear. Thus, we can obtain a new ERM problem, now convex, that can besolved online to ﬁnd a new set of parameters B . In essence, we are leveraging the deep part to learna nonlinear representation from the training set in an ofﬂine phase, and then adapt it online to thetesting set, but only up to the extent of linear transforms.Let A † and B † be the parameters learned from the ofﬂine phase. At testing time, the implementationscenario may differ from the one that is used for training, leading to a time-varying optimizationproblem of the form min A , B J t (cid:0) Ψ ( X t ; S t , A , B ) (cid:1) (6)where J t ( · ) , X t and S t are the loss function, the observed signal and the graph structure at time t , respectively. In the online phase we ﬁx the deep part A = A † , converting the WD-GNN to aconvex model. Then, we retrain the wide part online based on the changing scenario [cf. (32)]. Morespeciﬁcally, we let B = B † initially and, at time t , we have parameters A † and B t , input signal X t ,output Ψ ( X t ; S t , A † , B t ) , and loss J t (cid:0) Ψ ( X t ; S t , A † , B t ) (cid:1) . At time t , we can then perform a few(probably one) gradient descent step with step size γ t to update B t B t +1 = B t − γ t ∇ B J t (cid:0) Ψ ( X t ; S t , A † , B ) (cid:1) . (7)Algorithm 1 in the supplementary material summarizes the above online learning procedure. Onemajor drawback is that this online algorithm is centralized, violating the distributed nature of theWD-GNN. To overcome this issue, we introduce the following method. In decentralized problems, each node n i has access to a local loss J i,t (cid:0) Ψ ( X t ; S t , A i , B i ) (cid:1) with local parameters A i and B i . The goal is to coordinate nodes to minimize the sum-cost (cid:80) Ni =1 J i,t (cid:0) Ψ ( X t ; S t , A i , B i ) (cid:1) while keeping local parameters equal to each other, i.e., A i = A and B i = B for all i = 1 , . . . , N . We can then recast problem (32) as a constrained optimization one min { B i } Ni =1 N (cid:88) i =1 J i,t (cid:0) Ψ ( X t ; S t , A † , B i ) (cid:1) , s . t . B i = B j ∀ i, j : n j ∈ N i (8)where A i = A † for all i = 1 , . . . , N since the deep part is ﬁxed. The constraint B i = B j for all i, j : n j ∈ N i indicates that B i = B for all i = 1 , . . . , N under the assumed connectivity of thegraph. To solve (8), at time t , each node n i updates its local parameters B i by the recursion B i,t +1 = 1 |N i | + 1 (cid:0) (cid:88) j : n j ∈N i B j,t + B i,t (cid:1) − γ t ∇ B J i,t (cid:0) Ψ ( X t ; S t , A † , B i,t ) (cid:1) (9)with |N i | the number of neighbors of node n i . Put simply, each node n i descends its parameters B i,t along the local gradient ∇ B J i,t (cid:0) Ψ ( X t ; S t , A † , B i,t ) (cid:1) and performs the average over the 1-hopneighborhood in the meantime, thus approaching the optimal parameters of (8), while simultaneouslydriving the local parameters to the consensus.In a nutshell, the proposed online procedure is of low complexity due to the linearity and guaranteesefﬁcient convergence due to the convexity as proved next. Furthermore, it can be implemented in adistributed manner requiring only neighborhood information.5 Stability and convergence

In this section, we establish the stability of the WD-GNN to changes in the underlying graph S , thatis, we prove that the change in the WD-GNN output caused by the perturbation in S , is bounded bythe size of perturbation. Then, we prove that the online learning procedure converges to the optimizerset of a time-varying problem, up to an error neighborhood that depends on problem variations. The WD-GNN consists of a GNN and a graph ﬁlter. Both of these components are permutationequivariant [6, 15] which means that their outputs are unaffected by node reorderings. Since the sumin (2) does not affect this property, WD-GNNs are permutation equivariant as well.We then consider arbitrary changes to the graph matrix. Let S be the given graph and ˆ S a perturbationof it. We measure the size of the perturbation in relative terms as follows. Deﬁne the relative errorset R = { E ∈ R N × N : P T ˆ SP = S + ES + SE , E = E T , P ∈ P} (10)for P in the permutation set P = { P ∈ { , } N × N : P1 = , P T = } . The relative error set isthe set of all symmetric matrices E such that, when multiplied with the support matrix S and addedback, they yield a permutation of the perturbed support ˆ S . Then, we can deﬁne relative perturbationmeasure between S and ˆ S as d ( S , ˆ S ) = min E ∈R (cid:107) E (cid:107) . (11)The relative perturbation measure computes how different is the perturbation ˆ S in terms of the originalsupport S , irrespective of the speciﬁc ordering of the nodes (given the permutation equivarianceproperty of the WD-GNN). We note that the relative perturbation model ties the size of the perturbationto the topology of the graph through the multiplication of E with S , thus avoiding the failure tocapture structural transformations that typically accompanies the choice of absolute perturbation P T ˆ SP = S + E [15]. The WD-GCNN can be proved stable when built upon integral Lipschitz ﬁlters. Deﬁnition 1 (Integral Lipschitz ﬁlters) . Let B ( X ; S , B ) be a graph ﬁlter (3) . Denote by b fgk = [ B k ] fg and build the graph frequency response b fg ( λ ) = (cid:80) Kk =0 b fgk λ k satisfying | b fg ( λ ) | ≤ [11]. If thereexists a constant C L > such that for all λ , λ ∈ R and for all f = 1 , . . . , F and g = 1 , . . . , G , (cid:12)(cid:12) b fg ( λ ) − b fg ( λ ) (cid:12)(cid:12) ≤ C L | λ − λ || λ + λ | / , (12) we say that the ﬁlter B ( X ; S , B ) is integral Lipschitz . Integral Lipschitz ﬁlters are those for which the integral of the graph frequency response is Lipschitzcontinuous. Integral Lipschitz ﬁlters satisfy | λ ( b fg ( λ )) (cid:48) | ≤ C L for all λ , for all f, g , and where ( b fg ( λ )) (cid:48) = db fg /dλ . Such a condition is reminiscent of the scale invariance of wavelet transforms[26]. In any case, examples of integral Lipschitz ﬁlters include graph wavelets [27, 28] and can alsobe enforced by means of penalties during training [29].We can now establish the stability of the WD-GNN to relative perturbations. Without loss ofgenerality, we assume the single input and single output, i.e., F = G = 1 , for theoretical analysis. Theorem 1.

Consider the underlying graph S = VΛV (cid:62) and the perturbated graph ˆ S with N nodes.The relative perturbation E = UΘU (cid:62) ∈ R satisﬁes d ( S , ˆ S ) ≤ (cid:107) E (cid:107) ≤ (cid:15) . Consider the WD-GNN (2) with integral Lipschitz ﬁlters with constant C L . Consider also the nonlinearity σ ( · ) is normalizedLipschitz | σ ( x ) − σ ( x ) | ≤ | x − x | for all x , x ∈ R with σ (0) = 0 . Then it holds that (cid:107) Ψ ( X ; S , A , B ) − Ψ ( X ; ˆ S , A , B ) (cid:107) ≤ C L (1+ δ √ N ) (cid:0) | α D | L (cid:89) L − (cid:96) =1 F (cid:96) + | α W | (cid:1) (cid:107) X (cid:107) (cid:15) + O ( (cid:15) ) (13) where δ = ( (cid:107) U − V (cid:107) + 1) − implies the eigenvector misalignment between S and E . Theorem 1 shows the WD-GNN output is Lipschitz stable to relative graph perturbations up to a stabil-ity constant. The constant comprises three terms C L , δ √ N and | α D | L (cid:81) L − (cid:96) =1 F (cid:96) + | α W | , indicatingeffects of the ﬁlter property, the graph perturbation and the network architecture, respectively.6 Communication radius r T o t a l v e l o c i t y v a r i a t i on r e l a t i v e t o op t i m a l The WD-GNNThe GNN (a)

Initial velocity v T o t a l v e l o c i t y v a r i an c e r e l a t i v e t o op t i m a l The WD-GNNThe GNN (b)

30 40 50 60 70

Number of agents N T o t a l v e l o c i t y v a r i a t i on r e l a t i v e t o op t i m a l The WD-GNNThe GNN (c)Figure 1. Total velocity variation relative to the optimal controller for the WD-GNN and the GNN. (a) Comparisonunder different communication radius. (b) Comparison under different initial velocities. (c) Comparison underdifferent numbers of agents.

Though the WD-GNN exhibits certain stability to underlying graph changes, the performancedegradation is inevitable if this perturbation becomes severe. Furthermore, the graph perturbationis only one example reﬂecting time-varying scenarios, not to mention changes of the loss function,system observations, etc. Online learning mitigates these effects and adjusts the WD-GNN forchanging problems. We show the efﬁciency of proposed online learning procedure by analyzing itsconvergence property. Before claiming the main result, we need following standard assumptions.

Assumption 1.

Consider the time-varying loss function J t (cid:0) Ψ ( X t ; S t , A , B ) (cid:1) with ﬁxed parameters A . Let B ∗ t be an optimal solution of J t (cid:0) Ψ ( X t ; S t , A , B ) (cid:1) at time t . There exists a sequence {B ∗ t } t and a constant C B such that for all t , it holds that (cid:107)B ∗ t +1 − B ∗ t (cid:107) ≤ C B . (14) Assumption 2.

Consider the time-varying loss function J t (cid:0) Ψ ( X t ; S t , A , B ) (cid:1) with ﬁxed parameters A . When Ψ ( X t ; S t , A , B ) is a linear function of B , J t (cid:0) Ψ ( X t ; S t , A , B ) (cid:1) is differentiable, stronglysmooth with constant C t,s and strongly convex with constant C t,(cid:96) . Assumption 1 establishes the correlation between time-varying problems and bounds time variationsof changing optimal solutions. Assumption 2 is commonly used in optimization theory and satisﬁedin practice. Both assumptions are mild, based on which we present the convergence result.

Theorem 2.

Consider the WD-GNN (2) optimized with the propose online learning algorithm. Let J t (cid:0) Ψ ( X t ; S t , A , B ) (cid:1) be the time-varying loss function satisfying Assumptions 1-2 with constants C B , C t,s and C t,(cid:96) . Let also B ∗ t be the optimal solution of J t (cid:0) Ψ ( X t ; S t , A , B ) (cid:1) and γ t = γ be thestep-size of gradient descent. Then, the sequence {B t } t generated by the online learning algorithmin (7) satisﬁes (cid:107)B t − B ∗ t (cid:107) ≤ (cid:0) t (cid:89) τ =1 m τ (cid:1) (cid:107)B − B ∗ (cid:107) + 1 − ˆ m t − ˆ m C B (15) where m t = max {| − γC t,s | , | − γC t,(cid:96) |} is the convergence rate and ˆ m = max ≤ τ ≤ t m τ . Theorem 2 shows the online learning of the WD-GNN converges to the optimal solution of thetime-varying problem up to an limiting error neighborhood. The latter depends on time variations ofthe optimization problem. If we particularize C B = 0 with m t = m for all t , we re-obtain the sameresult for the time-invariant optimization problem, that is the exact convergence of gradient descent. The goal of the experiment is to learn a decentralized controller that coordinates robots, initiallyﬂying at random velocities, to move together at the same velocity while avoiding collisions. Here,we focus on presenting main results. The implementation details, as well as another experimentinvolving movie recommendation systems, are shown in the supplementary material.Consider a network of N robots initially moving at random velocities sampled in the interval [ − v, v ] .At time t , each robot n i is described by its position p i,t ∈ R , velocity v i,t ∈ R and acceleration7able 1: Average (std. deviation) of total and ﬁnal velocity variation.Architecture / Measurement Total velocity variation Final velocity variationOptimal controller ±

2) 0 . ± . WD-GNN ±

5) 0 . ± . WD-GNN (centralized online learning) ±

4) 0 . ± . WD-GNN (decentralized online learning) ±

4) 0 . ± . GNN ±

6) 0 . ± . Graph ﬁlter ± . ± . u i,t ∈ R , the latter being the controllable variable. This problem has an optimal centralizedcontroller u ∗ i,t that can be readily computed [10]. Such a solution requires knowledge of the positionsand velocities of all robots and thus demands a centralized computation unit. In the decentralizedsetting, we assume robot n i can only communicate with robot n j if they are within the communicationradius r , i.e., (cid:107) p i,t − p j,t (cid:107) ≤ r . We establish the communication graph G t with the node set V for robots and the edge set E t for available communication links, and the graph matrix S t is theassociated adjacency matrix.We use imitation learning [30] to train the WD-GNN for a decentralized controller on accelerations U t = [ u ,t , . . . , u N,t ] (cid:62) = Ψ ( X t ; S t , A , B ) ∈ R N × , where X t is the graph signal that collectspositions and velocities of neighbors at each robot [10]. At testing time, we measure the variation invelocity among the robots over the whole trajectory and also at the ﬁnal time instant [10]. In a way,the total velocity variation includes how long it takes for the team to be coordinated, while the ﬁnalvelocity variation tells how well the task was ﬁnally accomplished.We ﬁrst compare the optimal controller, the GNN and the graph ﬁlter with the WD-GNN withoutretraining in the initial condition r = 2 m, v = 3 m/s and N = 50 (Table 1). We see that theWD-GNN exhibits the best performance in both performance measures. We attribute this behavior tothe increased representation power of the WD-GNN as a combined architecture. The GNN takes thesecond place, while the graph ﬁlter performs much worse than the other two architectures. This isbecause the optimal distributed controller is known to be nonlinear [31].To account for the adaptability to different initial conditions of the proposed model in comparisonwith the GNN, we run simulations for changing communication radius (Fig. 1a), changing initialvelocities v (Fig. 1b) and changing numbers of agents N (Fig. 1c). These experiments show animproved robustness of the WD-GNN. We display results as the relative change in the total velocityvariation to the optimal controller. Fig. 1a shows that the performance increases as the communicationradius r increases, which is expected since the robots now have access to farther information. In Fig.1b, we observe the ﬂocking problem becomes harder with the increase of initial velocity v , since itis reasonably harder to control robots that move very fast in random directions. As for the numberof agents N , the relative total velocity variation decreases when the number of agents N increases,which is explained by the fact that the optimization problem becomes easier to solve with increasednode exchanges in larger graphs.Finally, we test the improvement on the WD-GNN with online learning. We consider the centralizedonline optimization algorithm as the baseline, with the velocity variation as the instantaneous lossfunction in (32). We test the proposed distributed optimization algorithm (9), where the instantaneousloss function is the velocity variance of neighboring robots. Results are shown in Table 1. We see thetotal and ﬁnal velocity variation get reduced for both centralized and decentralized online algorithms,indicating that the WD-GNN is successfully adapting to the changing communication network. Theimprovement in ﬁnal velocity variation is more noticeable, since the effects of single time-updatesget compounded. We propose the Wide and Deep Graph Neural Network architecture and consider a distributedonline learning scenario. The proposed architecture consists of a bank of ﬁlters (wide part) and aGNN (deep part), leading to a local and distributed architecture, that we proved is stable to smallchanges in the underlying graph support. To address more general time-varying scenarios withoutcompromising the distributed nature of the architecture, we proposed a distributed online learning8ptimization. By ﬁxing the deep part, and retraining online only the wide part, we manage to obtain aconvex optimization problem, and thus proved convergence guarantees of online learning algorithms.Numerical experiments are performed on learning decentralized controllers for ﬂocking a robotswarm, showing the success of WD-GNNs in adapting to time-varying scenarios. Future researchinvolves the online retratining of the deep part, as well as obtaining convergence guarantees for thedistributed online learning algorithm.

Broad Impacts

Graph neural networks are nonlinear representation maps that are computationally inexpensive anddescriptive enough to capture a wide range of behaviors. They are also local and distributed, whichmakes them ideal for deployment over physical networks. In the particular case of dynamic multi-agent systems, the issue arises in that the graph support changes with time, requiring for GNNs to beable to adapt to unseen scenarios in an online manner. In this paper, we have developed one suchmodel that can learn online, and most importantly, do so without violating the decentralized natureof the architecture. The current limitations of the proposed method are that only the linear part isbeing adapted online, and that there are still no convergence guarantees for the distributed onlinealgorithm (only for the centralized online algorithm). In any case, online distributed learning is a keyaspect in many real-world application, such as search and rescue, map exploration, and path planning;applications of which ﬂocking is a proof-of-concept that can lay the groundwork for more involvedscenarios.

Acknowledgments

Supported by NSF CCF 1717120, ARO W911NF1710438, ARL DCIST CRA W911NF-17-2-0181,ISTC-WAS and Intel DevCloud.

References [1] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and deep locally connectednetworks on graphs,” in . Banff, AB: Assoc. Comput.Linguistics, 14-16 Apr. 2014, pp. 1–14.[2] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphswith fast localized spectral ﬁltering,” in

Barcelona,Spain: Neural Inform. Process. Foundation, 5-10 Dec. 2016, pp. 3844–3858.[3] F. Gama, A. G. Marques, G. Leus, and A. Ribeiro, “Convolutional neural network architecturesfor signals supported on graphs,”

IEEE Trans. Signal Process. , vol. 67, no. 4, pp. 1034–1049,Feb. 2019.[4] F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger, “Simplifying graph convolutionalnetworks,” in , vol. 97. Long Beach, CA: Proc. Mach. LearningRes., 9-15 June 2019, pp. 6861–6871.[5] T. N. Kipf and M. Welling, “Semi-supervised classiﬁcation with graph convolutional networks,”in . Toulon, France: Assoc. Comput. Linguistics,24-26 Apr. 2017, pp. 1–14.[6] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” in . New Orleans, LA: Assoc. Comput. Linguistics, 6-9 May2019, pp. 1–17.[7] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec, “Graph convolu-tional neural networks for web-scale recommender systems,” in

International Conference onKnowledge Discovery , 2018.[8] Z. Gao, E. Isuﬁ, and A. Ribeiro, “Stochastic graph neural networks,” in

Barcelona, Spain: IEEE, 4-8 May 2020.[9] D. Owerko, F. Gama, and A. Ribeiro, “Optimal power ﬂow using graph neural networks,” in

Barcelona, Spain: IEEE, 4-8 May2020. 910] E. Tolstaya, F. Gama, J. Paulos, G. Pappas, V. Kumar, and A. Ribeiro, “Learning decentralizedcontrollers for robot swarms with graph neural networks,” in

Conf. Robot Learning 2019 .Osaka, Japan: Int. Found. Robotics Res., 30 Oct.-1 Nov. 2019.[11] A. Ortega, P. Frossard, J. Kovaˇcevi´c, J. M. F. Moura, and P. Vandergheynst, “Graph signalprocessing: Overview, challenges and applications,”

Proc. IEEE , vol. 106, no. 5, pp. 808–828,May 2018.[12] G. Frahling, P. Indyk, and C. Sohler, “Sampling in dynamic data streams and applications,”

International Journal of Computational Geometry & Applications , vol. 18, no. 01n02, pp. 3–28,2008.[13] M. K. Helwa and A. P. Schoellig, “Multi-robot transfer learning: A dynamical system perspec-tive,” in ,2017, pp. 4702–4708.[14] D. Zou and G. Lerman, “Graph convolutional neural networks via scattering,”

Appl. Comput.Harmonic Anal. , 13 June 2019, accepted for publication (in press). [Online]. Available:http://doi.org/10.1016/j.acha.2019.06.003[15] F. Gama, J. Bruna, and A. Ribeiro, “Stability properties of graph neural networks,” arXiv:1905.04497v3 [cs.LG] , 22 Apr. 2020. [Online]. Available: http://arxiv.org/abs/1905.04497[16] N. Dabbagh and B. Bannan-Ritland,

Online learning: Concepts, strategies, and application .Pearson/Merrill/Prentice Hall Upper Saddle River, NJ, 2005.[17] S. Shalev-Shwartz, “Online learning and online convex optimization,”

Foundations andTrends R (cid:13) in Machine Learning , vol. 4, no. 2, pp. 107–194, 2012.[18] P. Veliˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attentionnetworks,” in . Vancouver, BC: Assoc. Comput.Linguistics, 30 Apr.-3 May 2018, pp. 1–12.[19] R. Levie, F. Monti, X. Bresson, and M. M. Bronstein, “CayleyNets: Graph convolutional neuralnetworks with complex rational spectral ﬁlters,” IEEE Trans. Signal Process. , vol. 67, no. 1, pp.97–109, Jan. 2019.[20] E. Isuﬁ, F. Gama, and A. Ribeiro, “EdgeNets: Edge varying graph neural networks,” arXiv:2001.07620v2 [cs.LG] , 12 March 2020. [Online]. Available: http://arxiv.org/abs/2001.07620[21] J. Sanz, R. Perera, and C. Huerta, “Gear dynamics monitoring using discrete wavelet transfor-mation and multi-layer perceptron neural networks,”

Applied Soft Computing , vol. 12, no. 9, pp.2867–2878, 2012.[22] Y. Li, N. Sundararajan, P. Saratchandran, and Z. Wang, “Robust neuro-h ∞ controller design foraircraft auto-landing,” IEEE Transactions on Aerospace and Electronic Systems , vol. 40, no. 1,pp. 158–167, 2004.[23] S. Hong, T. You, S. Kwak, and B. Han, “Online tracking by learning discriminative saliencymap with convolutional neural network,” in

International Conference on Machine Learning ,2015.[24] P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz, “Online detection andclassiﬁcation of dynamic hand gestures with recurrent 3d convolutional neural networks,” in

Conference on Computer Vision and Pattern Recognition , 2016.[25] K. J. Ho, C. S. Leung, and J. Sum, “Convergence and objective functions of some fault/noise-injection-based online learning algorithms for rbf networks,”

IEEE Transactions on NeuralNetworks , vol. 21, no. 6, pp. 938–947, 2010.[26] I. Daubechies,

Ten Lectures on Wavelets , ser. CBMS-NSF Regional Conf. Series Appl. Math.Philadelphia, PA: SIAM, 1992, vol. 61.[27] D. K. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets on graphs via spectral graphtheory,”

Appl. Comput. Harmonic Anal. , vol. 30, no. 2, pp. 129–150, March 2011.[28] D. I. Shuman, C. Wiesmeyr, N. Holighaus, and P. Vandergheynst, “Spectrum-adapted tightgraph wavelet and vertex-frequency frames,”

IEEE Trans. Signal Process. , vol. 63, no. 16, pp.4223–4235, Aug. 2015. 1029] F. Gama, J. Bruna, and A. Ribeiro, “Stability of graph neural networks to relative perturbations,”in

Barcelona, Spain: IEEE, 4-8 May2020.[30] S. Ross and J. A. Bagnell, “Efﬁcient reductions for imitation learning,” in

Sardinia, Italy: Proc. Mach. Learning Res., 13-15 May 2010, pp.661–668.[31] H. S. Witsenhausen, “A counterexample in stochastic optimum control,”

SIAM J. Control , vol. 6,no. 1, pp. 131–147, 1968.[32] A. Simonetto, “Time-varying convex optimization via time-varying averaged operators,” arXiv:1704.07338v3 [math.OC] , 24 Apr. 2017. [Online]. Available: https://arxiv.org/abs/1704.07338[33] W. Huang, A. G. Marques, and A. Ribeiro, “Rating prediction via graph signal processing,”

IEEE Trans. Signal Process. , vol. 66, no. 19, pp. 5066–5081, 2018.[34] M. F. Harper and J. A. Konstan, “The movielens datasets: History and context,”

ACM Trans.Interact. Intell. Syst. , vol. 5, no. 4, pp. 19:(1–19), 2016.11 upplementary Materials for ‘Wide and Deep Graph NeuralNetworks with Distributed Online Learning’

A Proofs

A.1 Proof of Theorem 1

We need the following lemma in the proof.

Lemma 1.

Consider the underlying graph S = VΛV (cid:62) and the perturbated graph ˆ S with N nodes.The relative perturbation E = UΘU (cid:62) ∈ R satisﬁes d ( S , ˆ S ) ≤ (cid:107) E (cid:107) ≤ (cid:15) . Consider the integralLipschitz ﬁlter [cf. Deﬁnition 1 in full paper] with F = 1 input feature, G = 1 output featureand integral Lipschitz constant C L . Then, the output difference between ﬁlters B ( X ; S , B ) and B ( X ; ˆ S , B ) satisﬁes (cid:107) B ( X ; S , B ) − B ( X ; ˆ S , B ) (cid:107) ≤ C L (1 + σ √ N ) (cid:107) X (cid:107) (cid:15) + O ( (cid:15) ) (16) where δ = ( (cid:107) U − V (cid:107) + 1) − implies the eigenvector misalignment between S and E .Proof. With F = 1 input feature X ∈ R N × and G = 1 output feature B ( X ; S , B ) , B ( X ; ˆ S , B ) ∈ R N × , the output difference can be represented by (cid:107) B ( X ; S , B ) − B ( X ; ˆ S , B ) (cid:107) = (cid:13)(cid:13) K (cid:88) k =0 b k S k X − K (cid:88) k =0 b k ˆ S k X (cid:13)(cid:13) (17)with ﬁlter parameters B = { b , . . . , b K } . We then refer to Theorem 3 in [15] to complete theproof. Proof of Theorem 1.

The output difference between Ψ ( X ; S , A , B ) and Ψ ( X ; ˆ S , A , B ) can be di-vided into the wide part difference and the deep part difference (cid:107) Ψ ( X ; S , A , B ) − Ψ ( X ;ˆ S , A , B ) (cid:107) ≤ (cid:107) Φ ( X ; S , A ) − Φ ( X ;ˆ S , A ) (cid:107) + (cid:107) B ( X ; S , B ) − B ( X ;ˆ S , B ) (cid:107) (18)where the triangle inequality is used. Let us consider these two terms separately. The wide term.

By using Lemma 1, we bound the wide part difference as (cid:107) B ( X ; S , B ) − B ( X ; ˆ S , B ) (cid:107) ≤ C L (1 + σ √ N ) (cid:15) + O ( (cid:15) ) (19) The deep term.

At layer (cid:96) , the graph convolution can be considered as the application of F (cid:96) − F (cid:96) ﬁlters, i.e., (cid:34) K (cid:88) k =0 S k X (cid:96) − A (cid:96)k (cid:35) f = F (cid:96) − (cid:88) g =1 K (cid:88) k =0 a fg(cid:96)k S k [ X (cid:96) − ] g (20)for f = 1 , . . . , F (cid:96) , where a fg(cid:96)k = [ A (cid:96)k ] fg is the ( f, g ) th entry of matrix A (cid:96)k and [ · ] f representsthe f th column. We denote by G fg(cid:96) ( S )[ X (cid:96) − ] g = (cid:80) Kk =0 a fgk S k [ X (cid:96) − ] g and x g(cid:96) − = [ X (cid:96) − ] g forconvenience of following derivation. By using (20) and substituting the GNN architecture [(4) in fullpaper] into the deep part difference, we get (cid:107) Φ ( X ; S , A ) − Φ ( X ;ˆ S , A ) (cid:107) = (cid:13)(cid:13) σ (cid:0) F L − (cid:88) f =1 G fL ( S ) x fL − (cid:1) − σ (cid:0) F L − (cid:88) f =1 G fL (ˆ S )ˆ x fL − (cid:1)(cid:13)(cid:13) ≤ F L − (cid:88) f =1 (cid:13)(cid:13) G fL ( S ) x fL − − G fL (ˆ S )ˆ x fL − (cid:13)(cid:13) (21)where the second inequality is because the nonlinearity σ is normalized Lipschitz and the triangleinequality. By adding and substracting G fL (ˆ S ) x fL − into the terms in the norm, we have (cid:13)(cid:13) G fL ( S ) x fL − − G fL (ˆ S )ˆ x fL − (cid:13)(cid:13) ≤ (cid:13)(cid:13) G fL ( S ) x fL − − G fL (ˆ S ) x fL − (cid:13)(cid:13) + (cid:13)(cid:13) G fL (ˆ S ) x fL − − G fL (ˆ S )ˆ x fL − (cid:13)(cid:13) ≤ (cid:13)(cid:13) G fL ( S ) x fL − − G fL (ˆ S ) x fL − (cid:13)(cid:13) + (cid:107) G fL (ˆ S ) (cid:107) (cid:13)(cid:13) x fL − − ˆ x fL − (cid:13)(cid:13) (22)A1or the ﬁrst term in (22), by using Lemma 1, we get (cid:13)(cid:13) G fL ( S ) x fL − − G fL (ˆ S ) x fL − (cid:13)(cid:13) ≤ C L (1 + σ √ N ) (cid:107) x fL − (cid:107) (cid:15) + O ( (cid:15) ) . (23)In terms of the term (cid:107) x fL − (cid:107) , we observe that (cid:107) x fL − (cid:107) = (cid:13)(cid:13) σ (cid:0) F L − (cid:88) g =1 G fgL − ( S ) x gL − (cid:1)(cid:13)(cid:13) ≤ F L − (cid:88) g =1 (cid:13)(cid:13) G fgL − ( S ) x gL − (cid:13)(cid:13) ≤ F L − (cid:88) g =1 (cid:13)(cid:13) x gL − (cid:13)(cid:13) (24)where we use the triangle inequality, followed by the bound on ﬁlters [11], i.e., the ﬁlter frequencyresponse | a fgL − ( λ ) | = (cid:12)(cid:12) (cid:80) Kk =0 a fg ( L − k λ k (cid:12)(cid:12) ≤ . We follow this recursion to obtain (cid:107) x fL − (cid:107) ≤ L − (cid:89) (cid:96) =1 F (cid:96) (cid:13)(cid:13) x (cid:13)(cid:13) (25)where (cid:107) x (cid:107) = (cid:107) X (cid:107) by deﬁnition since the number of input feature F = F = 1 . By substituting(25) into (23), we have (cid:13)(cid:13) G fL ( S ) x fL − − G fL (ˆ S ) x fL − (cid:13)(cid:13) ≤ C L (1 + σ √ N ) L − (cid:89) (cid:96) =1 F (cid:96) (cid:13)(cid:13) X (cid:13)(cid:13) (cid:15) + O ( (cid:15) ) (26)For the second term in (22), by again using the ﬁlter bound [11], we get (cid:107) G fL (ˆ S ) (cid:107) (cid:13)(cid:13) x fL − − ˆ x fL − (cid:13)(cid:13) ≤ (cid:13)(cid:13) x fL − − ˆ x fL − (cid:13)(cid:13) . (27)By substituting (26) and (27) into (22) and the latter into (21), we have (cid:107) Φ ( X ; S , A ) − Φ ( X ;ˆ S , A ) (cid:107) ≤ F L − (cid:88) f =1 (cid:13)(cid:13) x fL − − ˆ x fL − (cid:13)(cid:13) + 2 C L (1+ σ √ N ) L − (cid:89) (cid:96) =1 F (cid:96) (cid:13)(cid:13) X (cid:13)(cid:13) (cid:15) + O ( (cid:15) ) . (28)From (28), we observe that the output difference of L th layer depends on that of ( L − th layer.Repeating this recursion until the input layer, we have (cid:107) Φ ( X ; S , A ) − Φ ( X ;ˆ S , A ) (cid:107) ≤ C L (1 + σ √ N ) L L − (cid:89) (cid:96) =1 F (cid:96) (cid:13)(cid:13) X (cid:13)(cid:13) (cid:15) + O ( (cid:15) ) (29)where we use the initial condition that (cid:107) x − ˆ x (cid:107) = (cid:107) X − X (cid:107) = 0 .Together, by substituting (19) and (29) into (18), we complete the proof (cid:107) Ψ ( X ; S , A , B ) − Ψ ( X ;ˆ S , A , B ) (cid:107) ≤ C L (1 + σ √ N ) (cid:32) | α W | + | α D | L L − (cid:89) (cid:96) =1 F (cid:96) (cid:33)(cid:13)(cid:13) X (cid:13)(cid:13) (cid:15) + O ( (cid:15) ) . (30) A.2 Proof of Theorem 2

Proof.

Let A † and B † be the learned parameters from the ofﬂine phase. The proposed online learningﬁxes the deep part, i.e., it freezes the parameters A = A † , and retrain the wide part online. The model Ψ ( X ; S , A , B ) can then be represented as a function of B while considering A = A † as constant Ψ ( X ; S , A † , B ) = α D Φ ( X ; S , A † ) + α W B ( X ; S , B ) + β = ˆ Ψ ( X ; S , B ) . (31)Given the graph signal X and the graph matrix S , ˆ Ψ ( X ; S , B ) is a linear function of B since both thegraph ﬁlter B ( X ; S , B ) and the combination way of two components are linear.At testing time t , the sampled optimization problem [(6) in full paper] is translated to min B J t (cid:0) ˆ Ψ ( X t ; S t , B ) (cid:1) (32)where J t ( · ) , X t and S t are given instantaneous loss function, observed signal and graph matrix attime t . Since ˆ Ψ ( X t ; S t , B ) is a linear function of B , J t (cid:0) ˆ Ψ ( X t ; S t , B ) (cid:1) is differentiable, stronglysmooth with constant C t,s and strongly convex with constant C t,(cid:96) based on Assumption 2. We thenrefer to Corollary 7.1 in [32] to complete the proof.A2 Implementation details for robot swarm ﬂocking

We consider a network with N robots initially moving at random velocities. At time t , each robot n i is described by its position p i,t ∈ R , velocity v i,t ∈ R , and controls its acceleration u i,t ∈ R forthe next state p i,t +1 = p i,t + v i,t T s + 12 u i,t T s , v i,t +1 = v i,t + u i,t T s (33)where T s is the sampling time and u i,t is held constant during the sampling time interval [ T s t, T s ( t +1)] . Our purpose is to control accelerations U t = [ u i,t , . . . , u N,t ] (cid:62) ∈ R N × such that robots willmove at the same velocity without collision. There is an optimal solution for accelerations [31] u ∗ i,t = − N (cid:88) j =1 ( v i,t − v j,t ) − N (cid:88) j =1 ∇ p i,t V ( p i,t , p j,t ) , ∀ i = 1 , . . . , N (34)with V ( p i,t , p j,t ) =  (cid:107) p i,t − p j,t (cid:107) − log (cid:0) (cid:107) p i,t − p j,t (cid:107) (cid:1) , if (cid:107) p i,t − p j,t (cid:107) ≤ ρ ρ − log (cid:0) ρ (cid:1) , otherwise (35)the collision avoidance potential. The computation of u ∗ i,t requires instantaneous positions andvelocities of all robots over network. As such, it is centralized that cannot be implemented in practicewhere each robot only has access to local neighborhood information.In the decentralized setting, robot n i can communicate with robot n j if and only if they are withinthe communication radius r , i.e., there is a communivation link ( n i , n j ) if (cid:107) p i,t − p j,t (cid:107) ≤ r . Weestablish the communication graph G t with the node set V = { n , . . . , n N } and the edge set E t containing available links. The graph matrix S t is the adjacency matrix with entry [ S t ] ij = 1 if ( n i , n j ) ∈ E t and [ S t ] ij = 0 otherwise. Additionally, we assume robot communications occur withinthe sampling time interval, such that robot action clock and communication clock coincide.We use the WD-GNN to learn a decentralized controller U t = Ψ ( X t ; S t , A , B ) where the graphmatrix S t is the adjacency matrix of the communication graph G t and the graph signal X t =[ x ,t , . . . , x N,t ] (cid:62) ∈ R N × is x i,t =  (cid:88) j : n j ∈N i,t (cid:0) v i,t − v j,t (cid:1) , (cid:88) j : n j ∈N i,t p i,t − p j,t (cid:107) p i,t − p j,t (cid:107) , (cid:88) j : n j ∈N i,t p i,t − p j,t (cid:107) p i,t − p j,t (cid:107)  , ∀ i = 1 , . . . , N. (36)which collects position and velocity information of neighboring robots. Graph ﬁlters here are adaptedto the delayed information structure as B ( X t ; S t , B ) = K (cid:88) k =0 S t S t − · · · S t − k X t − k B k . (37)We leverage the imitation learning to parametrize the optimal controller (34) with the WD-GNN. Dataset.

The dataset contains trajectories for training, for validation and for testing. Foreach trajectory, N = 50 robots are distributed randomly in a circle. The minimal initial distancebetween robots is . m and initial velocities are sampled randomly from [ − v, + v ] with v = 3 m/sby default. The duration of trajectories is T = 2 s with the sampling time T s = 0 . s; the maximalacceleration is ± m/s and the communication radius is r = 2 m. Parametrizations.

For the WD-GNN, we consider the wide component as a graph ﬁlter and the deepcomponent as a single layer GNN, where both have G = 32 output features. All ﬁlters are of order K = 3 and the nonlinearity is the Tanh. The output features are fed into a local readout layer at eachnode to generate two-dimensional acceleration u i,t . We train the WD-GNN for epochs with batchsize of trajectories. The ADAM optimizer is used with decaying factors β = 0 . , β = 0 . and learning rate γ = 5 · − . We average experiment results for dataset realizations. Performance measure.

The ﬂocking condition can be quantiﬁed by the variance of robot velocitiesover network, referred as the velocity variation . At testing time, we measure the performance ofA3 . − . − . − . . . . . − . − . − . . . . . . . Time t = 0 . (a) Time t = 0 . s − . − . − . − . . . . . − . − . − . . . . . . . Time t = 0 . (b) Time t = 0 . s − . − . − . − . . . . . − . − . − . . . . . . . Time t = 1 . (c) Time t = 1 . sFigure 2. Video snapshots of the robot swarm ﬂocking process with the learned WD-GNN controller. learned controller from two aspects: the total velocity variation over the whole trajectory and the ﬁnalvelocity variation J = 1 N D (cid:88) t =1 N (cid:88) i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) v i,t − N N (cid:88) i =1 v i,t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , J ( D ) = 1 N N (cid:88) i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) v i,D − N N (cid:88) i =1 v i,D (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (38)where D = T /T s is the total number of time instances. The former reﬂects the whole controllingprocess which decreases if robots approach the consensus more quickly, while the latter indicates theﬁnal ﬂocking condition of robots.Main results are shown in the full paper. To further help understand and visualize this experiment, weshow video snapshots of the robot swarm ﬂocking process with the learned WD-GNN controller inFig. 2. We see that robots move at random velocities initially in Fig. 2a, tend to move together in2b, and are well coordinated in Fig. 2c. The cost shown in the ﬁgure is the instantaneous velocityvariation of robots over network. C Experiment on movie recommendation systems

We consider another experiment on movie recommendation systems to corroborate our model. Thegoal is to predicate the rating a user would give to a speciﬁc movie [33]. We build the underlyinggraph as the movie similarity network, where nodes are movies and edge weights are similaritystrength between movies. The graph signal contains the ratings of movies given by a user, withmissing values if those movies are not rated. We train the WD-GNN to predict the rating of a movieof our choice, based on ratings given to other movies.

C.1 Implementation details

We use a subset of MovieLens 100k dataset, which includes users and movies with largestnumber of ratings [34]. We compute the movie similarity as the Pearson correlation and keep tenedges with highest similarity for each node (movie) [33]. Each user is a graph signal, where thesignal value on each node is the user rating given to the associated movie, with zero value if thatmovie is not rated. The dataset is split into for training and for testing. The rating of themovie of our choice is extracted as a label, and zeroed out in the graph signal.We consider a WD-GNN comprising a graph ﬁlter and a single layer GNN. Both components have G = 64 output features. All ﬁlters are of order K = 5 and the nonlinearity is the ReLU. A localreadout layer follows to map output features to a single scalar predicted rating. We train theWD-GNN for epochs with batch size of samples, and use the ADAM optimizer with decayingfactors β = 0 . , β = 0 . and learning rate γ = 5 · − . The performance is measured with theroot mean squared error (RMSE) and the results are averaged for random dataset split. C.2 Results

We ﬁrst consider the WD-GNN without online retraining and compare it with the GNN and the graphﬁlter (Table 2). We predict ratings for the movie

Star War that has the largest number of ratings.A4able 2: Average (std. deviation) of the root mean squared error.Architecture / Experiments Train & test on same movie Train on one & test on anotherWD-GNN . ± . . ± . Online WD-GNN ( update) . ± . . ± . Online WD-GNN ( updates) . ± . . ± . Online WD-GNN ( updates) . ± . . ± . GNN . ± . . ± . Graph ﬁlter . ± . . ± . We see that three architectures exhibit similar performance, while the WD-GNN performs best withthe lowest RMSE. We also attribute this behavior to the increased learning ability of the WD-GNNobtained from its combined architecture. The second experiment considers the transferability, wherewe train the architectures on one movie (

Star War ) and use learned models to predict ratings foranother movie (

Contact ). In this case, the problem scenario changes that creats a mismatch betweenthe training and the testing, and all architectures suffer performance degradations. The GNN isslightly better, while followed closely by the WD-GNN and the graph ﬁlter.We then run the WD-GNN with online learning. At testing time, we consider the system gets feedbackfrom the user after predicting the rating, which is used as the instantaneous label for online learning.While the centralized online learning is available for recommendation systems, we keep in mindthat the proposed WD-GNN can be retrained online in a distributed manner. We consider an onlineprocedure with testing users and for each user, the system performs the parameter update for , and times based on the instantaneous signal and feedback label, respectively. Results are shown inTable 2. We observe signiﬁcant performance improvements when training and testing on differentmovies. In this case, the testing scenario differs from the one used for training, such that onlinelearning adapts the WD-GNN to the new scenario and improves the transferability. We remark thatthese improvements will be emphasized as the testing phase further goes on. On the other hand, whentraining and testing on the same movie, online learning exhibits little effect (slight improvement) onthe performance. This is because the problem scenario does not change much and the ofﬂine phasehas already trained the WD-GNN well. D Online Wide and Deep GNN evaluation

Fig. 3 details the graph ﬁlter (graph convolution) [(3) in full paper] of order K = 3 with F = 1 inputfeature and G = 1 output feature B ( X ; S , B ) = K (cid:88) k =0 b k S k X . (39)with ﬁlter parameters B = { b , . . . , b K } . In particular, the linear operator SX , also referred asgraph shift operator, leverages the graph structure to process the graph signal. It assigns to each nodethe aggregated signal from immediate neighbors and collects the graph neighborhood information.Shifting X for k times aggregates information from k -hop neighborhood yielding the k -shifted signal S k X . With a set of parameters [ b , . . . , b K ] (cid:62) ∈ R K +1 , the graph ﬁlter generates the higher-levelfeature that accounts for shifted signals up to a neighborhood of radius K , and thus reﬂects a morecomplete picture of network. As the shift-and-sum operation of graph signal X over graph structure S ,the graph ﬁlter is also considered as the convolution in graph domain. If further particularizing S theline graph and x the signal sampled at time instances, the graph ﬁlter (39) reduces to the conventionalconvolution.Algorithm 1 summarizes the proposed online learning algorithm for the WD-GNN.A5 a) Underlying graph (b) -hop neighbor-hood (c) =hop neighbor-hood (d) -hop neighbor-hood S S S + + + +

X SX S X S X b b b b B ( X ; S , B ) Figure 3. Graph ﬁlters perform successive local node exchanges with neighbors, where the k -shifted signal S k X collects the information from k -hop neighborhood (shown by the increasing disks), and aggregate these shiftedsignals X , . . . , S K X with a set of parameters [ b , . . . , b K ] (cid:62) to generate the higher-level feature that accountsfor the graph structure up to a neighborhood of radius K . Algorithm 1

Online Learning Algorithm for the WD-GNN Input: ofﬂine learned parameters A † , B † by minimizing the ERM problem [(5) in full paper]over training dataset, and online step size γ t Fix the deep part parameters A = A † and set the initial wide part parameters B = B † for t = 0 , , ... do Observe instantaneous graph signal X t , graph matrix S t and loss function J t ( · ) Compute the instantaneous loss J t (cid:0) Ψ ( X t ; S t , A † , B t ) (cid:1) if requiring decentralized implementation then Update the wide part parameters in a distributed manner for i = 1 , ..., N do B i,t +1 = N i +1 (cid:0) (cid:80) j : n j ∈N i B j,t + B i,t (cid:1) − γ t ∇ B J i,t (cid:0) Ψ ( X t ; S t , A † , B i,t ) (cid:1) end for else Update the wide part parameters in a centralized manner B t +1 = B t − γ t ∇ B J t (cid:0) Ψ ( X t ; S t , A † , B t ) (cid:1) end if end forend for