Wide and Deep Graph Neural Networks with Distributed Online Learning
WWide and Deep Graph Neural Networks withDistributed Online Learning
Zhan Gao, Fernando Gama, and Alejandro Ribeiro
Department of Electrical and Systems EngineeringUniversity of PennsylvaniaPhiladelphia, PA 19104 {gaozhan, fgama, aribeiro}@seas.upenn.edu
Abstract
Graph neural networks (GNNs) learn representations from network data withnaturally distributed architectures. This renders them well-suited candidates fordecentralized learning since the operations respect the structure imposed by theunderlying graph. Oftentimes, this graph support changes with time, whether it isdue to link failures or topology changes caused by mobile components. Modifica-tions to the underlying structure create a mismatch between the graphs on whichGNNs were trained and the ones on which they are tested. Online learning canbe used to retrain the GNNs at test time, overcoming this issue. However, mostonline learning algorithms are centralized and work on convex objective functions(which GNNs rarely lead to). This paper puts forth the Wide and Deep GNN(WD-GNN), a novel architecture that can be easily updated with distributed onlinelearning mechanisms. The WD-GNN consists of two components: the wide part isa bank of linear graph filters and the deep part is a convolutional GNN. At trainingtime, the joint architecture learns a relevant nonlinear representation from data. Attest time, the deep part is left unchanged, while the wide part is retrained online.Since the wide part is linear, the problem becomes convex, and online optimizationalgorithms can be used. Furthermore, in order to exploit the distributed natureof the architecture, we propose a distributed online optimization algorithm thatupdates the wide part at test time, without violating its decentralized nature. Wealso analyze the stability of the WD-GNN to changes in the underlying topologyand derive convergence guarantees for the online retraining procedure. Theseresults indicate the transferability, scalability, and efficiency of the WD-GNN toadapt online to new testing scenarios in a distributed manner. Experiments on thecontrol of robot swarm for flocking corroborate the theory and show the potentialof the proposed architecture for distributed online learning.
Graph neural networks (GNNs) [1–6] are nonlinear representation maps that have been shownto perform successfully on graph data in a wide array of tasks involving citation networks [5],recommendation systems [7], source localization [8], power grids [9] and robot swarms [10]. GNNsconsist of a cascade of layers, each of which applies a graph convolution (a graph filter) [11],followed by a pointwise nonlinearity [1–6]. One of the key aspects of GNNs is that they are localand distributed. They are local since they require information only from neighboring nodes, anddistributed since each node can compute its own output, without need for a centralized unit.Oftentimes, however, problems of interest exhibit (slight) changes in data structure between trainingand testing set or involve dynamic systems [12, 13]. For example, in the case of the robot swarm, the
Preprint. Under review. a r X i v : . [ c s . L G ] J un raph is determined by the communication network between robots which is, in turn, determined bytheir physical proximity. Thus, if robots move, the communication links will change, and the graphsupport will change as well. Therefore, we oftentimes need to adapt to (slightly) new data structures.GNNs have been shown to be resilient to changes, as proven by the properties of permutationequivariance and stability [14, 15]. While these properties guarantee transference, we can furtherimprove the performance by leveraging online learning approaches.Online learning is a well-established paradigm that tracks the optimizer of time-varying optimizationproblems and has been successful as an enabler in the fields of machine learning and signal processing[16]. In a nutshell, online algorithms tackle each modified time instance of the optimization problem,by performing a series of updates on the previously obtained solutions. In order to leverage onlinelearning in GNNs we face two major roadblocks. First, optimality bounds and convergence guaranteesare given only for convex problems [17]. Second, online optimization algorithms assume a centralizedsetting. The latter one is particularly problematic since it violates the local and distributed nature ofGNNs, upon which much of its success has been built [10].This paper puts forward the Wide and Deep Graph Neural Network (WD-GNN) architecture, thatis amenable to distributed online learning, while keeping convergence guarantees. We define theWD-GNN as consisting of two components, a deep part which is a nonlinear GNN and a wide partwhich is a bank of linear graph filters (Section 3). We propose to have an offline phase of training, thatneed not be distributed, and then an online retraining phase, where only the wide part is adapted to thenew problem settings. In this way, we learn a nonlinear representation, that can still be adapted onlinewithout sacrificing the convex nature of the problem (Section 4). We further develop an algorithm for distributed online learning. We prove that the WD-GNN is stable to changes in the underlying graphsupport, indicating a certain level of robustness to structural changes, and we prove convergenceguarantees for the proposed online learning procedure (Section 5). Finally, we perform simulatedexperiments on robot swarm control (Section 6). Note that proofs, implementation details and anotherexperiment involving movie recommendation systems, can be found in the supplementary material. GNNs have been developed as a nonlinear representation map that is capable of leveraging graphstructure present in data. The most popular model for GNNs is the one involving graph convolutions(formally known as graph FIR filters). Several implementations of this model were proposed,including [1] which computes the graph convolution in the spectral domain, [2] which uses aChebyshev polynomial implementation, [3, 4] which uses a summation polynomial, and [5, 6] whichreduce the polynomial to just the first order. All of these are different implementations of the samerepresentation space given by the use of graph convolutional filters to regularize the linear transformof a neural network model. Other popular GNN models include graph attention transforms [18] andCayleyNets [19], see [20] for a general framework.Online learning has been investigated in designing neural networks (NNs) for dynamically varyingproblems. Specifically, [21, 22] develop online algorithms for feedforward neural networks withapplications in dynamical condition monitoring and aircraft control. More recently, online learninghas been used in convolutional neural networks for visual tracking, detection and classification [23,24].While these works develop online algorithms for NNs, analysis on the convergence of these algorithmsis not presented, except for [25] that proves the convergence of certain online algorithms for radialneural networks only.
Let G = {V , E , W} describe a graph , where V = { n , . . . , n N } is the set of N nodes, E ⊆ V × V isthe set of edges, and W : E → R is the edge weight function. In the case of the robot swarm, eachnode n i ∈ V represents a robot, each edge ( n j , n i ) ∈ E represents the communication link betweenrobot n j and n i , and the weight W ( n j , n i ) = w ji ≥ models the communication channel.The graph G is used to describe the data structure of interest. The data itself is defined on top of thegraph and is described by means of a graph signal X : V → R F which assigns an F -dimensional feature vector to each node. For example, in the robot swarm, the signal X ( n i ) ∈ R F represents the state of robot n i , typically described by its relative position, velocity or acceleration. The collection2f features across all nodes in the graph can be conveniently denoted with a matrix X ∈ R N × F which we call a graph signal as well. Note that each row of X corresponds to the feature vector ateach node, whereas each column corresponds to the collection of the f th feature across all nodes.Describing the data as a graph signal lets us leverage the framework of graph signal processing(GSP) as the mathematical foundation on which to develop algorithms, derive properties and gaininsights [11]. In particular, note that X is a N × F matrix that bares no information about theunderlying graph (beyond the fact that it has N nodes). To be able to relate the graph signal to thespecific graph it is supported on, we need a matrix description of the graph. Let S ∈ R N × N be a support matrix that satisfies [ S ] ij = 0 if i (cid:54) = j and ( n j , n i ) / ∈ E . Examples of support matricescommonly used in the literature include the adjacency matrix, the Laplacian matrix, and othernormalized versions. The key aspect of the support matrix is that it respects the sparsity of the graph.Thus, when using it as a linear operator on the data SX , we observe that the output at node n i forfeature f becomes [ SX ] if = N (cid:88) j =1 [ S ] ij [ X ] jf = (cid:88) j : n j ∈N i [ S ] ij [ X ] jf (1)where the second equality emphasizes the sparse nature of support, in the sense that only the valuesof the f th feature at the neighboring nodes n j ∈ N i for N i = { n j ∈ V : ( n j , n i ) ∈ E} are requiredto compute the output of the linear operation SX at each node. This renders SX a linear operationthat only needs information from neighboring nodes (local) and that can be computed separately ateach node (distributed). The operation SX is at the core of GSP since it effectively relates the graphsignal with the graph support, and it usually receives the name of a graph shift [11].While, in general, we can think of graph data as given by a pair ( X , S ) consisting of the graph signal X and its support S , we would like to remark that we only regard X as actionable . The support S isdetermined by the physical constraints of the problem. For example, in the robot swarm, S representssome specific model of communications among robots.In what follows, we propose the Wide and Deep Graph Neural Network (WD-GNN) architecture. Itis a nonlinear map Ψ : R N × F → R N × G that consists of two components Ψ ( X ; S , A , B ) = α D Φ ( X ; S , A ) + α W B ( X ; S , B ) + β (2)where Φ ( X ; S , A ) is called the deep part and is a graph neural network (GNN), and B ( X ; S , B ) iscalled the wide part and is a bank of graph filters. The scalars α D , α W , β are preset weights . The wide component is a bank of graph filters [11]. These are defined as a linear mapping betweengraph signals B : R N × F → R N × G , characterized by a set of weights or filter taps B = { B k ∈ R F × G , k = 0 , . . . , K } as follows B ( X ; S , B ) = K (cid:88) k =0 S k XB k . (3)The operation to compute the output (3) is often called a graph convolution [1, 3].The output of (3) is another graph signal of dimensions N × G . Note that the graph filter has theability to change the dimension of the feature vector (i.e. from F to G ). In fact, to accommodate forthe multi-dimensional nature of these features, the graph filter (3) actually acts analogously to theapplication of a bank of F G filters, hence the name. Oftentimes, though, we refer to (3) as simply a graph filter or a graph convolution .The filtering operation as in (3) is distributed and local, as it can be computed at each node withinformation relied by neighboring nodes only. To see this, note that the multiplication on the leftby S k is the one that mixes information from different nodes (in the columns of X ), but since S respects the sparsity of the graph and S k X = S ( S k − X ) can be seen as repeated applications of S ,then only the information from neighboring nodes is collected [cf. (1)]. Multiplication on the rightby B k actually carries out a linear combination of the entries in the row of X , but since these rows The weights α , α and β can also be considered architecture parameters and be trained, if necessary. S at implementation time. They only need to havecommunication capabilities so that they can receive the information from neighboring nodes, andcomputational capabilities to compute a linear combination of the information received from theneighbors. They do not require full knowledge of the graph, but only of their immediate neighbors.Thus, in terms of distributed implementation, the graph filtering operation scales seamlessly [15].We fundamentally use (3) as a mathematical framework that offers a condensed description of thecommunication exchanges that happen in a network. The deep component is a convolutional graph neural network (GNN). These are defined as a nonlinearmapping between graph signals Φ : R N × F → R N × G , built as a cascade of graph filters [cf. (3)] andpointwise nonlinearities Φ ( X ; S , A ) = X L with X (cid:96) = σ (cid:18) K (cid:88) k =0 S k X (cid:96) − A (cid:96)k (cid:19) (4)for (cid:96) = 1 , . . . , L , with σ : R → R a pointwise nonlinear function, which, in an abuse of notation,denotes its entrywise application in (4); and characterized by the set of filter taps A = { A (cid:96)k ∈ R F (cid:96) − × F (cid:96) , k = 0 , . . . , K (cid:96) , (cid:96) = 1 , . . . , L } . The graph signal X (cid:96) at each layer has F (cid:96) features, andthe input is X = X so that F = F while the output has F L = G features.There is a vast literature on GNNs. We note that [1–4] all describe the same representation spaceas (4), they just differ in how the graph convolution (3) is implemented. On the other hand, [5, 6]are restricted to K (cid:96) = 1 for all (cid:96) , and thus their representation space is just a subspace of that of (4).Since the results presented here are characterizations of the representation space of (4), they hold forall implementations, and while in the numerical section we implement the graph convolution as adirect polynomial, any of the other implementations could have been used . We train the WD-GNN (2) by solving the empirical risk minimization (ERM) problem for somecost function J : R N × G → R in a given training set T = { X , . . . , X |T | } min A , B |T | (cid:88) X ∈T J (cid:0) Ψ ( X ; S , A , B ) (cid:1) . (5)The ERM on a nonlinear neural network model is typically nonconvex, and is usually approximatelysolved by means of some SGD-based optimization algorithm, exploiting the backpropagation methodfor efficient computation of the derivatives. Note that the number of parameters in A and B isdetermined by the hyperparameters L (number of layers), K (cid:96) (number of filter taps per layer) and F (cid:96) (number of features per layer) for the deep part, and K (number of filter taps) for the wide part.Training is the procedure of applying some SGD-based algorithm for some number of iterations or training steps and arriving at some set of parameters A † and B † .In many problems of interest, the data structures may change (slightly) from the training phase to thetesting phase, or we may consider dynamic systems, where the scenario changes naturally with time.The problem of controlling a robot swarm, for instance, exhibits both since the different initializations We note that no single implementation has consistently outperformed the others in a wide range of diverseproblems; this is why we choose to focus on characterizing the representation space, and not in the specifics ofimplementation. We took the license to define the ERM problem as in (5) so as to include supervised and unsupervisedproblems in a single framework. To use (5) for a supervised problem, we just extend J to operate on an extrainput representing the label given in the training set.
4f positions and velocities of the swarm lead to different structures at training time and testing time,and since inevitable movements of robots cause the communications links between them to change.
Online learning addresses this problem by proposing optimization algorithms that adapt to a continu-ously changing problem [16]. It operates by adjusting parameters repetitively for each time instanceof the problem. Online learning algorithms require convexity of the ERM problem to be able toprovide optimality bounds as well as convergence guarantees. However, the ERM problem (5) usingthe WD-GNN is rarely convex.To tackle this issue we propose to only retrain the wide component of the WD-GNN. By consideringthe deep part fixed A = A † as obtained from solving (5) over the training set, we can focus onthe wide part, which is linear. Thus, we can obtain a new ERM problem, now convex, that can besolved online to find a new set of parameters B . In essence, we are leveraging the deep part to learna nonlinear representation from the training set in an offline phase, and then adapt it online to thetesting set, but only up to the extent of linear transforms.Let A † and B † be the parameters learned from the offline phase. At testing time, the implementationscenario may differ from the one that is used for training, leading to a time-varying optimizationproblem of the form min A , B J t (cid:0) Ψ ( X t ; S t , A , B ) (cid:1) (6)where J t ( · ) , X t and S t are the loss function, the observed signal and the graph structure at time t , respectively. In the online phase we fix the deep part A = A † , converting the WD-GNN to aconvex model. Then, we retrain the wide part online based on the changing scenario [cf. (32)]. Morespecifically, we let B = B † initially and, at time t , we have parameters A † and B t , input signal X t ,output Ψ ( X t ; S t , A † , B t ) , and loss J t (cid:0) Ψ ( X t ; S t , A † , B t ) (cid:1) . At time t , we can then perform a few(probably one) gradient descent step with step size γ t to update B t B t +1 = B t − γ t ∇ B J t (cid:0) Ψ ( X t ; S t , A † , B ) (cid:1) . (7)Algorithm 1 in the supplementary material summarizes the above online learning procedure. Onemajor drawback is that this online algorithm is centralized, violating the distributed nature of theWD-GNN. To overcome this issue, we introduce the following method. In decentralized problems, each node n i has access to a local loss J i,t (cid:0) Ψ ( X t ; S t , A i , B i ) (cid:1) with local parameters A i and B i . The goal is to coordinate nodes to minimize the sum-cost (cid:80) Ni =1 J i,t (cid:0) Ψ ( X t ; S t , A i , B i ) (cid:1) while keeping local parameters equal to each other, i.e., A i = A and B i = B for all i = 1 , . . . , N . We can then recast problem (32) as a constrained optimization one min { B i } Ni =1 N (cid:88) i =1 J i,t (cid:0) Ψ ( X t ; S t , A † , B i ) (cid:1) , s . t . B i = B j ∀ i, j : n j ∈ N i (8)where A i = A † for all i = 1 , . . . , N since the deep part is fixed. The constraint B i = B j for all i, j : n j ∈ N i indicates that B i = B for all i = 1 , . . . , N under the assumed connectivity of thegraph. To solve (8), at time t , each node n i updates its local parameters B i by the recursion B i,t +1 = 1 |N i | + 1 (cid:0) (cid:88) j : n j ∈N i B j,t + B i,t (cid:1) − γ t ∇ B J i,t (cid:0) Ψ ( X t ; S t , A † , B i,t ) (cid:1) (9)with |N i | the number of neighbors of node n i . Put simply, each node n i descends its parameters B i,t along the local gradient ∇ B J i,t (cid:0) Ψ ( X t ; S t , A † , B i,t ) (cid:1) and performs the average over the 1-hopneighborhood in the meantime, thus approaching the optimal parameters of (8), while simultaneouslydriving the local parameters to the consensus.In a nutshell, the proposed online procedure is of low complexity due to the linearity and guaranteesefficient convergence due to the convexity as proved next. Furthermore, it can be implemented in adistributed manner requiring only neighborhood information.5 Stability and convergence
In this section, we establish the stability of the WD-GNN to changes in the underlying graph S , thatis, we prove that the change in the WD-GNN output caused by the perturbation in S , is bounded bythe size of perturbation. Then, we prove that the online learning procedure converges to the optimizerset of a time-varying problem, up to an error neighborhood that depends on problem variations. The WD-GNN consists of a GNN and a graph filter. Both of these components are permutationequivariant [6, 15] which means that their outputs are unaffected by node reorderings. Since the sumin (2) does not affect this property, WD-GNNs are permutation equivariant as well.We then consider arbitrary changes to the graph matrix. Let S be the given graph and ˆ S a perturbationof it. We measure the size of the perturbation in relative terms as follows. Define the relative errorset R = { E ∈ R N × N : P T ˆ SP = S + ES + SE , E = E T , P ∈ P} (10)for P in the permutation set P = { P ∈ { , } N × N : P1 = , P T = } . The relative error set isthe set of all symmetric matrices E such that, when multiplied with the support matrix S and addedback, they yield a permutation of the perturbed support ˆ S . Then, we can define relative perturbationmeasure between S and ˆ S as d ( S , ˆ S ) = min E ∈R (cid:107) E (cid:107) . (11)The relative perturbation measure computes how different is the perturbation ˆ S in terms of the originalsupport S , irrespective of the specific ordering of the nodes (given the permutation equivarianceproperty of the WD-GNN). We note that the relative perturbation model ties the size of the perturbationto the topology of the graph through the multiplication of E with S , thus avoiding the failure tocapture structural transformations that typically accompanies the choice of absolute perturbation P T ˆ SP = S + E [15]. The WD-GCNN can be proved stable when built upon integral Lipschitz filters. Definition 1 (Integral Lipschitz filters) . Let B ( X ; S , B ) be a graph filter (3) . Denote by b fgk = [ B k ] fg and build the graph frequency response b fg ( λ ) = (cid:80) Kk =0 b fgk λ k satisfying | b fg ( λ ) | ≤ [11]. If thereexists a constant C L > such that for all λ , λ ∈ R and for all f = 1 , . . . , F and g = 1 , . . . , G , (cid:12)(cid:12) b fg ( λ ) − b fg ( λ ) (cid:12)(cid:12) ≤ C L | λ − λ || λ + λ | / , (12) we say that the filter B ( X ; S , B ) is integral Lipschitz . Integral Lipschitz filters are those for which the integral of the graph frequency response is Lipschitzcontinuous. Integral Lipschitz filters satisfy | λ ( b fg ( λ )) (cid:48) | ≤ C L for all λ , for all f, g , and where ( b fg ( λ )) (cid:48) = db fg /dλ . Such a condition is reminiscent of the scale invariance of wavelet transforms[26]. In any case, examples of integral Lipschitz filters include graph wavelets [27, 28] and can alsobe enforced by means of penalties during training [29].We can now establish the stability of the WD-GNN to relative perturbations. Without loss ofgenerality, we assume the single input and single output, i.e., F = G = 1 , for theoretical analysis. Theorem 1.
Consider the underlying graph S = VΛV (cid:62) and the perturbated graph ˆ S with N nodes.The relative perturbation E = UΘU (cid:62) ∈ R satisfies d ( S , ˆ S ) ≤ (cid:107) E (cid:107) ≤ (cid:15) . Consider the WD-GNN (2) with integral Lipschitz filters with constant C L . Consider also the nonlinearity σ ( · ) is normalizedLipschitz | σ ( x ) − σ ( x ) | ≤ | x − x | for all x , x ∈ R with σ (0) = 0 . Then it holds that (cid:107) Ψ ( X ; S , A , B ) − Ψ ( X ; ˆ S , A , B ) (cid:107) ≤ C L (1+ δ √ N ) (cid:0) | α D | L (cid:89) L − (cid:96) =1 F (cid:96) + | α W | (cid:1) (cid:107) X (cid:107) (cid:15) + O ( (cid:15) ) (13) where δ = ( (cid:107) U − V (cid:107) + 1) − implies the eigenvector misalignment between S and E . Theorem 1 shows the WD-GNN output is Lipschitz stable to relative graph perturbations up to a stabil-ity constant. The constant comprises three terms C L , δ √ N and | α D | L (cid:81) L − (cid:96) =1 F (cid:96) + | α W | , indicatingeffects of the filter property, the graph perturbation and the network architecture, respectively.6 Communication radius r T o t a l v e l o c i t y v a r i a t i on r e l a t i v e t o op t i m a l The WD-GNNThe GNN (a)
Initial velocity v T o t a l v e l o c i t y v a r i an c e r e l a t i v e t o op t i m a l The WD-GNNThe GNN (b)
30 40 50 60 70
Number of agents N T o t a l v e l o c i t y v a r i a t i on r e l a t i v e t o op t i m a l The WD-GNNThe GNN (c)Figure 1. Total velocity variation relative to the optimal controller for the WD-GNN and the GNN. (a) Comparisonunder different communication radius. (b) Comparison under different initial velocities. (c) Comparison underdifferent numbers of agents.
Though the WD-GNN exhibits certain stability to underlying graph changes, the performancedegradation is inevitable if this perturbation becomes severe. Furthermore, the graph perturbationis only one example reflecting time-varying scenarios, not to mention changes of the loss function,system observations, etc. Online learning mitigates these effects and adjusts the WD-GNN forchanging problems. We show the efficiency of proposed online learning procedure by analyzing itsconvergence property. Before claiming the main result, we need following standard assumptions.
Assumption 1.
Consider the time-varying loss function J t (cid:0) Ψ ( X t ; S t , A , B ) (cid:1) with fixed parameters A . Let B ∗ t be an optimal solution of J t (cid:0) Ψ ( X t ; S t , A , B ) (cid:1) at time t . There exists a sequence {B ∗ t } t and a constant C B such that for all t , it holds that (cid:107)B ∗ t +1 − B ∗ t (cid:107) ≤ C B . (14) Assumption 2.
Consider the time-varying loss function J t (cid:0) Ψ ( X t ; S t , A , B ) (cid:1) with fixed parameters A . When Ψ ( X t ; S t , A , B ) is a linear function of B , J t (cid:0) Ψ ( X t ; S t , A , B ) (cid:1) is differentiable, stronglysmooth with constant C t,s and strongly convex with constant C t,(cid:96) . Assumption 1 establishes the correlation between time-varying problems and bounds time variationsof changing optimal solutions. Assumption 2 is commonly used in optimization theory and satisfiedin practice. Both assumptions are mild, based on which we present the convergence result.
Theorem 2.
Consider the WD-GNN (2) optimized with the propose online learning algorithm. Let J t (cid:0) Ψ ( X t ; S t , A , B ) (cid:1) be the time-varying loss function satisfying Assumptions 1-2 with constants C B , C t,s and C t,(cid:96) . Let also B ∗ t be the optimal solution of J t (cid:0) Ψ ( X t ; S t , A , B ) (cid:1) and γ t = γ be thestep-size of gradient descent. Then, the sequence {B t } t generated by the online learning algorithmin (7) satisfies (cid:107)B t − B ∗ t (cid:107) ≤ (cid:0) t (cid:89) τ =1 m τ (cid:1) (cid:107)B − B ∗ (cid:107) + 1 − ˆ m t − ˆ m C B (15) where m t = max {| − γC t,s | , | − γC t,(cid:96) |} is the convergence rate and ˆ m = max ≤ τ ≤ t m τ . Theorem 2 shows the online learning of the WD-GNN converges to the optimal solution of thetime-varying problem up to an limiting error neighborhood. The latter depends on time variations ofthe optimization problem. If we particularize C B = 0 with m t = m for all t , we re-obtain the sameresult for the time-invariant optimization problem, that is the exact convergence of gradient descent. The goal of the experiment is to learn a decentralized controller that coordinates robots, initiallyflying at random velocities, to move together at the same velocity while avoiding collisions. Here,we focus on presenting main results. The implementation details, as well as another experimentinvolving movie recommendation systems, are shown in the supplementary material.Consider a network of N robots initially moving at random velocities sampled in the interval [ − v, v ] .At time t , each robot n i is described by its position p i,t ∈ R , velocity v i,t ∈ R and acceleration7able 1: Average (std. deviation) of total and final velocity variation.Architecture / Measurement Total velocity variation Final velocity variationOptimal controller ±
2) 0 . ± . WD-GNN ±
5) 0 . ± . WD-GNN (centralized online learning) ±
4) 0 . ± . WD-GNN (decentralized online learning) ±
4) 0 . ± . GNN ±
6) 0 . ± . Graph filter ± . ± . u i,t ∈ R , the latter being the controllable variable. This problem has an optimal centralizedcontroller u ∗ i,t that can be readily computed [10]. Such a solution requires knowledge of the positionsand velocities of all robots and thus demands a centralized computation unit. In the decentralizedsetting, we assume robot n i can only communicate with robot n j if they are within the communicationradius r , i.e., (cid:107) p i,t − p j,t (cid:107) ≤ r . We establish the communication graph G t with the node set V for robots and the edge set E t for available communication links, and the graph matrix S t is theassociated adjacency matrix.We use imitation learning [30] to train the WD-GNN for a decentralized controller on accelerations U t = [ u ,t , . . . , u N,t ] (cid:62) = Ψ ( X t ; S t , A , B ) ∈ R N × , where X t is the graph signal that collectspositions and velocities of neighbors at each robot [10]. At testing time, we measure the variation invelocity among the robots over the whole trajectory and also at the final time instant [10]. In a way,the total velocity variation includes how long it takes for the team to be coordinated, while the finalvelocity variation tells how well the task was finally accomplished.We first compare the optimal controller, the GNN and the graph filter with the WD-GNN withoutretraining in the initial condition r = 2 m, v = 3 m/s and N = 50 (Table 1). We see that theWD-GNN exhibits the best performance in both performance measures. We attribute this behavior tothe increased representation power of the WD-GNN as a combined architecture. The GNN takes thesecond place, while the graph filter performs much worse than the other two architectures. This isbecause the optimal distributed controller is known to be nonlinear [31].To account for the adaptability to different initial conditions of the proposed model in comparisonwith the GNN, we run simulations for changing communication radius (Fig. 1a), changing initialvelocities v (Fig. 1b) and changing numbers of agents N (Fig. 1c). These experiments show animproved robustness of the WD-GNN. We display results as the relative change in the total velocityvariation to the optimal controller. Fig. 1a shows that the performance increases as the communicationradius r increases, which is expected since the robots now have access to farther information. In Fig.1b, we observe the flocking problem becomes harder with the increase of initial velocity v , since itis reasonably harder to control robots that move very fast in random directions. As for the numberof agents N , the relative total velocity variation decreases when the number of agents N increases,which is explained by the fact that the optimization problem becomes easier to solve with increasednode exchanges in larger graphs.Finally, we test the improvement on the WD-GNN with online learning. We consider the centralizedonline optimization algorithm as the baseline, with the velocity variation as the instantaneous lossfunction in (32). We test the proposed distributed optimization algorithm (9), where the instantaneousloss function is the velocity variance of neighboring robots. Results are shown in Table 1. We see thetotal and final velocity variation get reduced for both centralized and decentralized online algorithms,indicating that the WD-GNN is successfully adapting to the changing communication network. Theimprovement in final velocity variation is more noticeable, since the effects of single time-updatesget compounded. We propose the Wide and Deep Graph Neural Network architecture and consider a distributedonline learning scenario. The proposed architecture consists of a bank of filters (wide part) and aGNN (deep part), leading to a local and distributed architecture, that we proved is stable to smallchanges in the underlying graph support. To address more general time-varying scenarios withoutcompromising the distributed nature of the architecture, we proposed a distributed online learning8ptimization. By fixing the deep part, and retraining online only the wide part, we manage to obtain aconvex optimization problem, and thus proved convergence guarantees of online learning algorithms.Numerical experiments are performed on learning decentralized controllers for flocking a robotswarm, showing the success of WD-GNNs in adapting to time-varying scenarios. Future researchinvolves the online retratining of the deep part, as well as obtaining convergence guarantees for thedistributed online learning algorithm.
Broad Impacts
Graph neural networks are nonlinear representation maps that are computationally inexpensive anddescriptive enough to capture a wide range of behaviors. They are also local and distributed, whichmakes them ideal for deployment over physical networks. In the particular case of dynamic multi-agent systems, the issue arises in that the graph support changes with time, requiring for GNNs to beable to adapt to unseen scenarios in an online manner. In this paper, we have developed one suchmodel that can learn online, and most importantly, do so without violating the decentralized natureof the architecture. The current limitations of the proposed method are that only the linear part isbeing adapted online, and that there are still no convergence guarantees for the distributed onlinealgorithm (only for the centralized online algorithm). In any case, online distributed learning is a keyaspect in many real-world application, such as search and rescue, map exploration, and path planning;applications of which flocking is a proof-of-concept that can lay the groundwork for more involvedscenarios.
Acknowledgments
Supported by NSF CCF 1717120, ARO W911NF1710438, ARL DCIST CRA W911NF-17-2-0181,ISTC-WAS and Intel DevCloud.
References [1] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and deep locally connectednetworks on graphs,” in . Banff, AB: Assoc. Comput.Linguistics, 14-16 Apr. 2014, pp. 1–14.[2] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphswith fast localized spectral filtering,” in
Barcelona,Spain: Neural Inform. Process. Foundation, 5-10 Dec. 2016, pp. 3844–3858.[3] F. Gama, A. G. Marques, G. Leus, and A. Ribeiro, “Convolutional neural network architecturesfor signals supported on graphs,”
IEEE Trans. Signal Process. , vol. 67, no. 4, pp. 1034–1049,Feb. 2019.[4] F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger, “Simplifying graph convolutionalnetworks,” in , vol. 97. Long Beach, CA: Proc. Mach. LearningRes., 9-15 June 2019, pp. 6861–6871.[5] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,”in . Toulon, France: Assoc. Comput. Linguistics,24-26 Apr. 2017, pp. 1–14.[6] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” in . New Orleans, LA: Assoc. Comput. Linguistics, 6-9 May2019, pp. 1–17.[7] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec, “Graph convolu-tional neural networks for web-scale recommender systems,” in
International Conference onKnowledge Discovery , 2018.[8] Z. Gao, E. Isufi, and A. Ribeiro, “Stochastic graph neural networks,” in
Barcelona, Spain: IEEE, 4-8 May 2020.[9] D. Owerko, F. Gama, and A. Ribeiro, “Optimal power flow using graph neural networks,” in
Barcelona, Spain: IEEE, 4-8 May2020. 910] E. Tolstaya, F. Gama, J. Paulos, G. Pappas, V. Kumar, and A. Ribeiro, “Learning decentralizedcontrollers for robot swarms with graph neural networks,” in
Conf. Robot Learning 2019 .Osaka, Japan: Int. Found. Robotics Res., 30 Oct.-1 Nov. 2019.[11] A. Ortega, P. Frossard, J. Kovaˇcevi´c, J. M. F. Moura, and P. Vandergheynst, “Graph signalprocessing: Overview, challenges and applications,”
Proc. IEEE , vol. 106, no. 5, pp. 808–828,May 2018.[12] G. Frahling, P. Indyk, and C. Sohler, “Sampling in dynamic data streams and applications,”
International Journal of Computational Geometry & Applications , vol. 18, no. 01n02, pp. 3–28,2008.[13] M. K. Helwa and A. P. Schoellig, “Multi-robot transfer learning: A dynamical system perspec-tive,” in ,2017, pp. 4702–4708.[14] D. Zou and G. Lerman, “Graph convolutional neural networks via scattering,”
Appl. Comput.Harmonic Anal. , 13 June 2019, accepted for publication (in press). [Online]. Available:http://doi.org/10.1016/j.acha.2019.06.003[15] F. Gama, J. Bruna, and A. Ribeiro, “Stability properties of graph neural networks,” arXiv:1905.04497v3 [cs.LG] , 22 Apr. 2020. [Online]. Available: http://arxiv.org/abs/1905.04497[16] N. Dabbagh and B. Bannan-Ritland,
Online learning: Concepts, strategies, and application .Pearson/Merrill/Prentice Hall Upper Saddle River, NJ, 2005.[17] S. Shalev-Shwartz, “Online learning and online convex optimization,”
Foundations andTrends R (cid:13) in Machine Learning , vol. 4, no. 2, pp. 107–194, 2012.[18] P. Veliˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attentionnetworks,” in . Vancouver, BC: Assoc. Comput.Linguistics, 30 Apr.-3 May 2018, pp. 1–12.[19] R. Levie, F. Monti, X. Bresson, and M. M. Bronstein, “CayleyNets: Graph convolutional neuralnetworks with complex rational spectral filters,” IEEE Trans. Signal Process. , vol. 67, no. 1, pp.97–109, Jan. 2019.[20] E. Isufi, F. Gama, and A. Ribeiro, “EdgeNets: Edge varying graph neural networks,” arXiv:2001.07620v2 [cs.LG] , 12 March 2020. [Online]. Available: http://arxiv.org/abs/2001.07620[21] J. Sanz, R. Perera, and C. Huerta, “Gear dynamics monitoring using discrete wavelet transfor-mation and multi-layer perceptron neural networks,”
Applied Soft Computing , vol. 12, no. 9, pp.2867–2878, 2012.[22] Y. Li, N. Sundararajan, P. Saratchandran, and Z. Wang, “Robust neuro-h ∞ controller design foraircraft auto-landing,” IEEE Transactions on Aerospace and Electronic Systems , vol. 40, no. 1,pp. 158–167, 2004.[23] S. Hong, T. You, S. Kwak, and B. Han, “Online tracking by learning discriminative saliencymap with convolutional neural network,” in
International Conference on Machine Learning ,2015.[24] P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz, “Online detection andclassification of dynamic hand gestures with recurrent 3d convolutional neural networks,” in
Conference on Computer Vision and Pattern Recognition , 2016.[25] K. J. Ho, C. S. Leung, and J. Sum, “Convergence and objective functions of some fault/noise-injection-based online learning algorithms for rbf networks,”
IEEE Transactions on NeuralNetworks , vol. 21, no. 6, pp. 938–947, 2010.[26] I. Daubechies,
Ten Lectures on Wavelets , ser. CBMS-NSF Regional Conf. Series Appl. Math.Philadelphia, PA: SIAM, 1992, vol. 61.[27] D. K. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets on graphs via spectral graphtheory,”
Appl. Comput. Harmonic Anal. , vol. 30, no. 2, pp. 129–150, March 2011.[28] D. I. Shuman, C. Wiesmeyr, N. Holighaus, and P. Vandergheynst, “Spectrum-adapted tightgraph wavelet and vertex-frequency frames,”
IEEE Trans. Signal Process. , vol. 63, no. 16, pp.4223–4235, Aug. 2015. 1029] F. Gama, J. Bruna, and A. Ribeiro, “Stability of graph neural networks to relative perturbations,”in
Barcelona, Spain: IEEE, 4-8 May2020.[30] S. Ross and J. A. Bagnell, “Efficient reductions for imitation learning,” in
Sardinia, Italy: Proc. Mach. Learning Res., 13-15 May 2010, pp.661–668.[31] H. S. Witsenhausen, “A counterexample in stochastic optimum control,”
SIAM J. Control , vol. 6,no. 1, pp. 131–147, 1968.[32] A. Simonetto, “Time-varying convex optimization via time-varying averaged operators,” arXiv:1704.07338v3 [math.OC] , 24 Apr. 2017. [Online]. Available: https://arxiv.org/abs/1704.07338[33] W. Huang, A. G. Marques, and A. Ribeiro, “Rating prediction via graph signal processing,”
IEEE Trans. Signal Process. , vol. 66, no. 19, pp. 5066–5081, 2018.[34] M. F. Harper and J. A. Konstan, “The movielens datasets: History and context,”
ACM Trans.Interact. Intell. Syst. , vol. 5, no. 4, pp. 19:(1–19), 2016.11 upplementary Materials for ‘Wide and Deep Graph NeuralNetworks with Distributed Online Learning’
A Proofs
A.1 Proof of Theorem 1
We need the following lemma in the proof.
Lemma 1.
Consider the underlying graph S = VΛV (cid:62) and the perturbated graph ˆ S with N nodes.The relative perturbation E = UΘU (cid:62) ∈ R satisfies d ( S , ˆ S ) ≤ (cid:107) E (cid:107) ≤ (cid:15) . Consider the integralLipschitz filter [cf. Definition 1 in full paper] with F = 1 input feature, G = 1 output featureand integral Lipschitz constant C L . Then, the output difference between filters B ( X ; S , B ) and B ( X ; ˆ S , B ) satisfies (cid:107) B ( X ; S , B ) − B ( X ; ˆ S , B ) (cid:107) ≤ C L (1 + σ √ N ) (cid:107) X (cid:107) (cid:15) + O ( (cid:15) ) (16) where δ = ( (cid:107) U − V (cid:107) + 1) − implies the eigenvector misalignment between S and E .Proof. With F = 1 input feature X ∈ R N × and G = 1 output feature B ( X ; S , B ) , B ( X ; ˆ S , B ) ∈ R N × , the output difference can be represented by (cid:107) B ( X ; S , B ) − B ( X ; ˆ S , B ) (cid:107) = (cid:13)(cid:13) K (cid:88) k =0 b k S k X − K (cid:88) k =0 b k ˆ S k X (cid:13)(cid:13) (17)with filter parameters B = { b , . . . , b K } . We then refer to Theorem 3 in [15] to complete theproof. Proof of Theorem 1.
The output difference between Ψ ( X ; S , A , B ) and Ψ ( X ; ˆ S , A , B ) can be di-vided into the wide part difference and the deep part difference (cid:107) Ψ ( X ; S , A , B ) − Ψ ( X ;ˆ S , A , B ) (cid:107) ≤ (cid:107) Φ ( X ; S , A ) − Φ ( X ;ˆ S , A ) (cid:107) + (cid:107) B ( X ; S , B ) − B ( X ;ˆ S , B ) (cid:107) (18)where the triangle inequality is used. Let us consider these two terms separately. The wide term.
By using Lemma 1, we bound the wide part difference as (cid:107) B ( X ; S , B ) − B ( X ; ˆ S , B ) (cid:107) ≤ C L (1 + σ √ N ) (cid:15) + O ( (cid:15) ) (19) The deep term.
At layer (cid:96) , the graph convolution can be considered as the application of F (cid:96) − F (cid:96) filters, i.e., (cid:34) K (cid:88) k =0 S k X (cid:96) − A (cid:96)k (cid:35) f = F (cid:96) − (cid:88) g =1 K (cid:88) k =0 a fg(cid:96)k S k [ X (cid:96) − ] g (20)for f = 1 , . . . , F (cid:96) , where a fg(cid:96)k = [ A (cid:96)k ] fg is the ( f, g ) th entry of matrix A (cid:96)k and [ · ] f representsthe f th column. We denote by G fg(cid:96) ( S )[ X (cid:96) − ] g = (cid:80) Kk =0 a fgk S k [ X (cid:96) − ] g and x g(cid:96) − = [ X (cid:96) − ] g forconvenience of following derivation. By using (20) and substituting the GNN architecture [(4) in fullpaper] into the deep part difference, we get (cid:107) Φ ( X ; S , A ) − Φ ( X ;ˆ S , A ) (cid:107) = (cid:13)(cid:13) σ (cid:0) F L − (cid:88) f =1 G fL ( S ) x fL − (cid:1) − σ (cid:0) F L − (cid:88) f =1 G fL (ˆ S )ˆ x fL − (cid:1)(cid:13)(cid:13) ≤ F L − (cid:88) f =1 (cid:13)(cid:13) G fL ( S ) x fL − − G fL (ˆ S )ˆ x fL − (cid:13)(cid:13) (21)where the second inequality is because the nonlinearity σ is normalized Lipschitz and the triangleinequality. By adding and substracting G fL (ˆ S ) x fL − into the terms in the norm, we have (cid:13)(cid:13) G fL ( S ) x fL − − G fL (ˆ S )ˆ x fL − (cid:13)(cid:13) ≤ (cid:13)(cid:13) G fL ( S ) x fL − − G fL (ˆ S ) x fL − (cid:13)(cid:13) + (cid:13)(cid:13) G fL (ˆ S ) x fL − − G fL (ˆ S )ˆ x fL − (cid:13)(cid:13) ≤ (cid:13)(cid:13) G fL ( S ) x fL − − G fL (ˆ S ) x fL − (cid:13)(cid:13) + (cid:107) G fL (ˆ S ) (cid:107) (cid:13)(cid:13) x fL − − ˆ x fL − (cid:13)(cid:13) (22)A1or the first term in (22), by using Lemma 1, we get (cid:13)(cid:13) G fL ( S ) x fL − − G fL (ˆ S ) x fL − (cid:13)(cid:13) ≤ C L (1 + σ √ N ) (cid:107) x fL − (cid:107) (cid:15) + O ( (cid:15) ) . (23)In terms of the term (cid:107) x fL − (cid:107) , we observe that (cid:107) x fL − (cid:107) = (cid:13)(cid:13) σ (cid:0) F L − (cid:88) g =1 G fgL − ( S ) x gL − (cid:1)(cid:13)(cid:13) ≤ F L − (cid:88) g =1 (cid:13)(cid:13) G fgL − ( S ) x gL − (cid:13)(cid:13) ≤ F L − (cid:88) g =1 (cid:13)(cid:13) x gL − (cid:13)(cid:13) (24)where we use the triangle inequality, followed by the bound on filters [11], i.e., the filter frequencyresponse | a fgL − ( λ ) | = (cid:12)(cid:12) (cid:80) Kk =0 a fg ( L − k λ k (cid:12)(cid:12) ≤ . We follow this recursion to obtain (cid:107) x fL − (cid:107) ≤ L − (cid:89) (cid:96) =1 F (cid:96) (cid:13)(cid:13) x (cid:13)(cid:13) (25)where (cid:107) x (cid:107) = (cid:107) X (cid:107) by definition since the number of input feature F = F = 1 . By substituting(25) into (23), we have (cid:13)(cid:13) G fL ( S ) x fL − − G fL (ˆ S ) x fL − (cid:13)(cid:13) ≤ C L (1 + σ √ N ) L − (cid:89) (cid:96) =1 F (cid:96) (cid:13)(cid:13) X (cid:13)(cid:13) (cid:15) + O ( (cid:15) ) (26)For the second term in (22), by again using the filter bound [11], we get (cid:107) G fL (ˆ S ) (cid:107) (cid:13)(cid:13) x fL − − ˆ x fL − (cid:13)(cid:13) ≤ (cid:13)(cid:13) x fL − − ˆ x fL − (cid:13)(cid:13) . (27)By substituting (26) and (27) into (22) and the latter into (21), we have (cid:107) Φ ( X ; S , A ) − Φ ( X ;ˆ S , A ) (cid:107) ≤ F L − (cid:88) f =1 (cid:13)(cid:13) x fL − − ˆ x fL − (cid:13)(cid:13) + 2 C L (1+ σ √ N ) L − (cid:89) (cid:96) =1 F (cid:96) (cid:13)(cid:13) X (cid:13)(cid:13) (cid:15) + O ( (cid:15) ) . (28)From (28), we observe that the output difference of L th layer depends on that of ( L − th layer.Repeating this recursion until the input layer, we have (cid:107) Φ ( X ; S , A ) − Φ ( X ;ˆ S , A ) (cid:107) ≤ C L (1 + σ √ N ) L L − (cid:89) (cid:96) =1 F (cid:96) (cid:13)(cid:13) X (cid:13)(cid:13) (cid:15) + O ( (cid:15) ) (29)where we use the initial condition that (cid:107) x − ˆ x (cid:107) = (cid:107) X − X (cid:107) = 0 .Together, by substituting (19) and (29) into (18), we complete the proof (cid:107) Ψ ( X ; S , A , B ) − Ψ ( X ;ˆ S , A , B ) (cid:107) ≤ C L (1 + σ √ N ) (cid:32) | α W | + | α D | L L − (cid:89) (cid:96) =1 F (cid:96) (cid:33)(cid:13)(cid:13) X (cid:13)(cid:13) (cid:15) + O ( (cid:15) ) . (30) A.2 Proof of Theorem 2
Proof.
Let A † and B † be the learned parameters from the offline phase. The proposed online learningfixes the deep part, i.e., it freezes the parameters A = A † , and retrain the wide part online. The model Ψ ( X ; S , A , B ) can then be represented as a function of B while considering A = A † as constant Ψ ( X ; S , A † , B ) = α D Φ ( X ; S , A † ) + α W B ( X ; S , B ) + β = ˆ Ψ ( X ; S , B ) . (31)Given the graph signal X and the graph matrix S , ˆ Ψ ( X ; S , B ) is a linear function of B since both thegraph filter B ( X ; S , B ) and the combination way of two components are linear.At testing time t , the sampled optimization problem [(6) in full paper] is translated to min B J t (cid:0) ˆ Ψ ( X t ; S t , B ) (cid:1) (32)where J t ( · ) , X t and S t are given instantaneous loss function, observed signal and graph matrix attime t . Since ˆ Ψ ( X t ; S t , B ) is a linear function of B , J t (cid:0) ˆ Ψ ( X t ; S t , B ) (cid:1) is differentiable, stronglysmooth with constant C t,s and strongly convex with constant C t,(cid:96) based on Assumption 2. We thenrefer to Corollary 7.1 in [32] to complete the proof.A2 Implementation details for robot swarm flocking
We consider a network with N robots initially moving at random velocities. At time t , each robot n i is described by its position p i,t ∈ R , velocity v i,t ∈ R , and controls its acceleration u i,t ∈ R forthe next state p i,t +1 = p i,t + v i,t T s + 12 u i,t T s , v i,t +1 = v i,t + u i,t T s (33)where T s is the sampling time and u i,t is held constant during the sampling time interval [ T s t, T s ( t +1)] . Our purpose is to control accelerations U t = [ u i,t , . . . , u N,t ] (cid:62) ∈ R N × such that robots willmove at the same velocity without collision. There is an optimal solution for accelerations [31] u ∗ i,t = − N (cid:88) j =1 ( v i,t − v j,t ) − N (cid:88) j =1 ∇ p i,t V ( p i,t , p j,t ) , ∀ i = 1 , . . . , N (34)with V ( p i,t , p j,t ) = (cid:107) p i,t − p j,t (cid:107) − log (cid:0) (cid:107) p i,t − p j,t (cid:107) (cid:1) , if (cid:107) p i,t − p j,t (cid:107) ≤ ρ ρ − log (cid:0) ρ (cid:1) , otherwise (35)the collision avoidance potential. The computation of u ∗ i,t requires instantaneous positions andvelocities of all robots over network. As such, it is centralized that cannot be implemented in practicewhere each robot only has access to local neighborhood information.In the decentralized setting, robot n i can communicate with robot n j if and only if they are withinthe communication radius r , i.e., there is a communivation link ( n i , n j ) if (cid:107) p i,t − p j,t (cid:107) ≤ r . Weestablish the communication graph G t with the node set V = { n , . . . , n N } and the edge set E t containing available links. The graph matrix S t is the adjacency matrix with entry [ S t ] ij = 1 if ( n i , n j ) ∈ E t and [ S t ] ij = 0 otherwise. Additionally, we assume robot communications occur withinthe sampling time interval, such that robot action clock and communication clock coincide.We use the WD-GNN to learn a decentralized controller U t = Ψ ( X t ; S t , A , B ) where the graphmatrix S t is the adjacency matrix of the communication graph G t and the graph signal X t =[ x ,t , . . . , x N,t ] (cid:62) ∈ R N × is x i,t = (cid:88) j : n j ∈N i,t (cid:0) v i,t − v j,t (cid:1) , (cid:88) j : n j ∈N i,t p i,t − p j,t (cid:107) p i,t − p j,t (cid:107) , (cid:88) j : n j ∈N i,t p i,t − p j,t (cid:107) p i,t − p j,t (cid:107) , ∀ i = 1 , . . . , N. (36)which collects position and velocity information of neighboring robots. Graph filters here are adaptedto the delayed information structure as B ( X t ; S t , B ) = K (cid:88) k =0 S t S t − · · · S t − k X t − k B k . (37)We leverage the imitation learning to parametrize the optimal controller (34) with the WD-GNN. Dataset.
The dataset contains trajectories for training, for validation and for testing. Foreach trajectory, N = 50 robots are distributed randomly in a circle. The minimal initial distancebetween robots is . m and initial velocities are sampled randomly from [ − v, + v ] with v = 3 m/sby default. The duration of trajectories is T = 2 s with the sampling time T s = 0 . s; the maximalacceleration is ± m/s and the communication radius is r = 2 m. Parametrizations.
For the WD-GNN, we consider the wide component as a graph filter and the deepcomponent as a single layer GNN, where both have G = 32 output features. All filters are of order K = 3 and the nonlinearity is the Tanh. The output features are fed into a local readout layer at eachnode to generate two-dimensional acceleration u i,t . We train the WD-GNN for epochs with batchsize of trajectories. The ADAM optimizer is used with decaying factors β = 0 . , β = 0 . and learning rate γ = 5 · − . We average experiment results for dataset realizations. Performance measure.
The flocking condition can be quantified by the variance of robot velocitiesover network, referred as the velocity variation . At testing time, we measure the performance ofA3 . − . − . − . . . . . − . − . − . . . . . . . Time t = 0 . (a) Time t = 0 . s − . − . − . − . . . . . − . − . − . . . . . . . Time t = 0 . (b) Time t = 0 . s − . − . − . − . . . . . − . − . − . . . . . . . Time t = 1 . (c) Time t = 1 . sFigure 2. Video snapshots of the robot swarm flocking process with the learned WD-GNN controller. learned controller from two aspects: the total velocity variation over the whole trajectory and the finalvelocity variation J = 1 N D (cid:88) t =1 N (cid:88) i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) v i,t − N N (cid:88) i =1 v i,t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , J ( D ) = 1 N N (cid:88) i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) v i,D − N N (cid:88) i =1 v i,D (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (38)where D = T /T s is the total number of time instances. The former reflects the whole controllingprocess which decreases if robots approach the consensus more quickly, while the latter indicates thefinal flocking condition of robots.Main results are shown in the full paper. To further help understand and visualize this experiment, weshow video snapshots of the robot swarm flocking process with the learned WD-GNN controller inFig. 2. We see that robots move at random velocities initially in Fig. 2a, tend to move together in2b, and are well coordinated in Fig. 2c. The cost shown in the figure is the instantaneous velocityvariation of robots over network. C Experiment on movie recommendation systems
We consider another experiment on movie recommendation systems to corroborate our model. Thegoal is to predicate the rating a user would give to a specific movie [33]. We build the underlyinggraph as the movie similarity network, where nodes are movies and edge weights are similaritystrength between movies. The graph signal contains the ratings of movies given by a user, withmissing values if those movies are not rated. We train the WD-GNN to predict the rating of a movieof our choice, based on ratings given to other movies.
C.1 Implementation details
We use a subset of MovieLens 100k dataset, which includes users and movies with largestnumber of ratings [34]. We compute the movie similarity as the Pearson correlation and keep tenedges with highest similarity for each node (movie) [33]. Each user is a graph signal, where thesignal value on each node is the user rating given to the associated movie, with zero value if thatmovie is not rated. The dataset is split into for training and for testing. The rating of themovie of our choice is extracted as a label, and zeroed out in the graph signal.We consider a WD-GNN comprising a graph filter and a single layer GNN. Both components have G = 64 output features. All filters are of order K = 5 and the nonlinearity is the ReLU. A localreadout layer follows to map output features to a single scalar predicted rating. We train theWD-GNN for epochs with batch size of samples, and use the ADAM optimizer with decayingfactors β = 0 . , β = 0 . and learning rate γ = 5 · − . The performance is measured with theroot mean squared error (RMSE) and the results are averaged for random dataset split. C.2 Results
We first consider the WD-GNN without online retraining and compare it with the GNN and the graphfilter (Table 2). We predict ratings for the movie
Star War that has the largest number of ratings.A4able 2: Average (std. deviation) of the root mean squared error.Architecture / Experiments Train & test on same movie Train on one & test on anotherWD-GNN . ± . . ± . Online WD-GNN ( update) . ± . . ± . Online WD-GNN ( updates) . ± . . ± . Online WD-GNN ( updates) . ± . . ± . GNN . ± . . ± . Graph filter . ± . . ± . We see that three architectures exhibit similar performance, while the WD-GNN performs best withthe lowest RMSE. We also attribute this behavior to the increased learning ability of the WD-GNNobtained from its combined architecture. The second experiment considers the transferability, wherewe train the architectures on one movie (
Star War ) and use learned models to predict ratings foranother movie (
Contact ). In this case, the problem scenario changes that creats a mismatch betweenthe training and the testing, and all architectures suffer performance degradations. The GNN isslightly better, while followed closely by the WD-GNN and the graph filter.We then run the WD-GNN with online learning. At testing time, we consider the system gets feedbackfrom the user after predicting the rating, which is used as the instantaneous label for online learning.While the centralized online learning is available for recommendation systems, we keep in mindthat the proposed WD-GNN can be retrained online in a distributed manner. We consider an onlineprocedure with testing users and for each user, the system performs the parameter update for , and times based on the instantaneous signal and feedback label, respectively. Results are shown inTable 2. We observe significant performance improvements when training and testing on differentmovies. In this case, the testing scenario differs from the one used for training, such that onlinelearning adapts the WD-GNN to the new scenario and improves the transferability. We remark thatthese improvements will be emphasized as the testing phase further goes on. On the other hand, whentraining and testing on the same movie, online learning exhibits little effect (slight improvement) onthe performance. This is because the problem scenario does not change much and the offline phasehas already trained the WD-GNN well. D Online Wide and Deep GNN evaluation
Fig. 3 details the graph filter (graph convolution) [(3) in full paper] of order K = 3 with F = 1 inputfeature and G = 1 output feature B ( X ; S , B ) = K (cid:88) k =0 b k S k X . (39)with filter parameters B = { b , . . . , b K } . In particular, the linear operator SX , also referred asgraph shift operator, leverages the graph structure to process the graph signal. It assigns to each nodethe aggregated signal from immediate neighbors and collects the graph neighborhood information.Shifting X for k times aggregates information from k -hop neighborhood yielding the k -shifted signal S k X . With a set of parameters [ b , . . . , b K ] (cid:62) ∈ R K +1 , the graph filter generates the higher-levelfeature that accounts for shifted signals up to a neighborhood of radius K , and thus reflects a morecomplete picture of network. As the shift-and-sum operation of graph signal X over graph structure S ,the graph filter is also considered as the convolution in graph domain. If further particularizing S theline graph and x the signal sampled at time instances, the graph filter (39) reduces to the conventionalconvolution.Algorithm 1 summarizes the proposed online learning algorithm for the WD-GNN.A5 a) Underlying graph (b) -hop neighbor-hood (c) =hop neighbor-hood (d) -hop neighbor-hood S S S + + + +
X SX S X S X b b b b B ( X ; S , B ) Figure 3. Graph filters perform successive local node exchanges with neighbors, where the k -shifted signal S k X collects the information from k -hop neighborhood (shown by the increasing disks), and aggregate these shiftedsignals X , . . . , S K X with a set of parameters [ b , . . . , b K ] (cid:62) to generate the higher-level feature that accountsfor the graph structure up to a neighborhood of radius K . Algorithm 1
Online Learning Algorithm for the WD-GNN Input: offline learned parameters A † , B † by minimizing the ERM problem [(5) in full paper]over training dataset, and online step size γ t Fix the deep part parameters A = A † and set the initial wide part parameters B = B † for t = 0 , , ... do Observe instantaneous graph signal X t , graph matrix S t and loss function J t ( · ) Compute the instantaneous loss J t (cid:0) Ψ ( X t ; S t , A † , B t ) (cid:1) if requiring decentralized implementation then Update the wide part parameters in a distributed manner for i = 1 , ..., N do B i,t +1 = N i +1 (cid:0) (cid:80) j : n j ∈N i B j,t + B i,t (cid:1) − γ t ∇ B J i,t (cid:0) Ψ ( X t ; S t , A † , B i,t ) (cid:1) end for else Update the wide part parameters in a centralized manner B t +1 = B t − γ t ∇ B J t (cid:0) Ψ ( X t ; S t , A † , B t ) (cid:1) end if end forend for