Graph-Adaptive Activation Functions for Graph Neural Networks
GGRAPH-ADAPTIVE ACTIVATION FUNCTIONS FOR GRAPH NEURAL NETWORKS
Bianca Iancu ∗ , Luana Ruiz † , Alejandro Ribeiro † and Elvin Isufi ∗∗ Intelligent Systems Department, Delft University of Technology, Delft, The Netherlands † Department of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, USAe-mails: [email protected], { rubruiz, aribeiro } @seas.upenn.edu, E.Isufi[email protected] ABSTRACT
Activation functions are crucial in graph neural networks (GNNs) asthey allow defining a nonlinear family of functions to capture the re-lationship between the input graph data and their representations. Thispaper proposes activation functions for GNNs that not only adapt to thegraph into the nonlinearity, but are also distributable. To incorporate thefeature-topology coupling into all GNN components, nodal features arenonlinearized and combined with a set of trainable parameters in a formakin to graph convolutions. The latter leads to a graph-adaptive train-able nonlinear component of the GNN that can be implemented directlyor via kernel transformations, therefore, enriching the class of functionsto represent the network data. Whether in the direct or kernel form,we show permutation equivariance is always preserved. We also provethe subclass of graph-adaptive max activation functions are Lipschitzstable to input perturbations. Numerical experiments with distributedsource localization, finite-time consensus, distributed regression, andrecommender systems corroborate our findings and show improved per-formance compared with pointwise as well as state-of-the-art localizednonlinearities.
Index Terms — Activation functions; graph neural networks; graphsignal processing; Lipschitz stability; permutation equivariance.
1. INTRODUCTION
Graph neural networks (GNNs) are parametric architectures suitable forlearning a nonlinear mapping for data defined over graphs such as social,sensor, and biological network data [1, 2]. By interweaving graph filterswith pointwise nonlinearities, GNNs express the function map in a lay-ered form and learn compositions of features that account for the data-topology coupling [3, 4]. Another property GNNs inherit from graphfilters is the distributed implementation [5–7]. Distributed computationfacilitates scalability of computation and endows the system with ro-bustness to failures of the processing unit. The latter is fundamental inapplications involving consensus, optimization, and control [8–10].Building on spectral graph theory, [11] defined graph convolutionalneural networks by multiplying feature representations in the Laplacianeigenspace with trainable kernels. Subsequently, [3] used finite impulseresponse (FIR) graph filters to combine features in the vertex domain bymeans of a polynomial in the Laplacian matrix. The work in [4] fol-lows the same idea but builds a polynomial filter in any graph represen-tation matrix (e.g., adjacency, Laplacian). Differently from [11], [3, 4]are also readily distributable architectures with appropriate choices ofgraph pooling (i.e., not altering the graph structure; e.g., zero-padding)and with pointwise activation functions. On the other hand, [12] builds aGNN with distributable autoregressive moving average graph filters [6],which capture a broader family of functions at the expense of computa-tion cost. Parallel to these efforts, [13] proposes attention-like mecha-nisms to adapt the edge weights to the task at hand. More recently, thework in [14] showed that all the above architectures are equivalent andfall under the framework of edge varying GNN (EdgeNet). Altogether,these works capture the data-graph coupling only linearly through graphfilters, while they ignore the coupling in the nonlinear pointwise compo-nent (e.g., ReLU). To improve the representation power of GNNs, [15] proposed localized activation functions that account for the graph topol-ogy by operating on node neighborhoods of different resolutions. How-ever, the latter accounts only for the graph and not the data-topology cou-pling, since it ignores the edge weights and the data propagation betweenneighbors. Localized activation functions are also not distributable be-yond the one-hop neighborhood, hence missing multi-hop informationbetween nodes.To address these limitations, we put forward a new family of ac-tivation functions that adapt to the data-topology coupling in the sur-rounding of a node. The nodal features obtained from graph filtering areshifted prior to local-nonlinearization in a form akin to graph convolu-tions. These nonlinear features are subsequently combined with a set oftrainable parameters to accordingly weigh the information at differentneighborhood resolutions. The resolution radius is a design parameterand allows adapting the GNN nonlinear component to the task at hand.Besides being graph-adaptive and distributable, these activation func-tions preserve two properties of theoretical interest for GNNs, namelypermutation equivariance and Lipschitz stability to perturbations [16].Concretely, our contribution is threefold.1. We develop a new family of nonlinearities for GNNs that aregraph-adaptive to the surrounding of a node and distributable.The first class [Def. 3] nonlinearizes shifted features in the sur-rounding of a node in their direct form. The second class [Def. 6]transforms the shifted features with graph-adaptive kernels priorto nonlinearization.2. We prove that: ( a ) the proposed nonlinearities are permutationequivariant [Prop. 1], i.e. the output of the respective GNN archi-tecture is agnostic to node labeling; ( b ) the max graph-adaptivenonlinearity is Lipschitz stable to input perturbations [Prop. 2].3. We propose distributed GNN tasks with graph-adaptive nonlin-earities for source localization, finite-time consensus, signal de-noising, and rating prediction in recommender systems.
2. GRAPH NEURAL NETWORKS
Consider a graph G = ( V , E ) with vertex set V of cardinality |V| = N and edge set E ⊆ V ×V of cardinality |E| = M . An edge is a tuple e ij =( i, j ) connecting nodes i and j . The neighborhood of node i is the set ofnodes N i = { j | ( i, j ) ∈ E} connected to i . Associated to G is the graphshift operator (GSO) matrix S ∈ R N × N , whose sparsity pattern matchesthe graph structure. That is, entry ( i, j ) satisfies [ S ] i,j = s i,j (cid:54) = 0 onlyif i = j or ( i, j ) ∈ E . Commonly used GSOs include the adjacencymatrix, the graph Laplacian, and their normalized and translated forms.On the vertices of G , we define a graph signal x ∈ R N whose i th component is the value at node i . We consider applications wheregraph signals are processed in a distributed fashion. A typical exampleis in sensor networks without access to a centralized processing unit andwhere each sensor communicates only with its neighbor sensors. Graph convolution.
A graph convolution is defined as a graph filter H ( S ) that can be written as a polynomial of the GSO S [7]. For an inputsignal x and filter coefficients h = [ h , . . . , h K ] (cid:62) , the output y ∈ R N of the graph convolutional filter is computed as y = H ( S ) x = K (cid:88) k =0 h k S k x . (1) a r X i v : . [ ee ss . SP ] S e p ue to the locality of S , graph convolutions can be run distribu-tively. When building the output y , we need to compute the terms Sx , . . . , S K x . Since S is local, operation Sx requires one-hop nodeexchanges and so, by writing S k x = S ( S k − x ) = Sx ( k − , node i cancompute signal x ( k ) through exchange of previous shifted information x ( k − with its neighbors. This recursion allows for distributed commu-nications and computational cost of order O ( MK ) , while the trainableparameters defining (1) are of order O ( K ) [7]. Graph convolutional neural networks (GCNNs).
We consider aGCNN of L graph convolutional layers followed by a shared fully con-nected layer per node. Each convolutional layer comprises a bank ofgraph filters [cf. (1)] and a nonlinearity. At layer l , the GCNN takesas input F l − features { x gl − } F l − g =1 from layer ( l − and produces F l output features { x fl } Ff =1 . Each input feature x gl − is processed bya parallel bank of F l graph filters { H fgl ( S ) } f . The filter outputs areaggregated over the input index g to yield the f th convolved feature z fl = F l − (cid:88) g =1 H fgl ( S ) x gl − = F l − (cid:88) g =1 K (cid:88) k =0 h fgkl S k x gl − , for f = 1 , . . . , F l . (2)The convolved feature z fl is subsequently passed through an activationfunction σ ( · ) to obtain the f th convolutional layer output x fl = σ ( z fl ) , for f = 1 , . . . , F. (3)The output features of the last convolutional layer L , x L , . . . , x F L L ,represent the final convolutional features. These features are interpretedas a collection of F L graph signals, where on node i we have the F L × feature vector χ Li = [ x Li , . . . , x F L Li ] (cid:62) . Each node locally combines thefeatures χ Li with a one-layer perceptron to obtain the output ˜ y = H FC χ Li (4)where matrix H FC ∈ R F o × F L maps the F L convolutional features to the F o output features (e.g., the number of classes). The parameters in H FC are shared among nodes to keep the number of trainable parameters in-dependent of the graph dimensions (i.e., N and M ), but only dependenton the filter order and the number of features and layers.By grouping all learnable parameters into the set H = { h fgl ; H FC } lfg ,we can consider the GCNN as a map Φ ( · ) that takes as input a graphsignal x , a GSO S , and a set of parameters H to produce the output Φ ( x ; S ; H ) := ˜ y . (5)The output (5) is computed for a training set T = { ( x , y ) } of |T | pairs,where y are the target representations. Activation functions.
The activation function σ ( · ) in (3) can beany of the conventional pointwise activation functions, such as ReLU( σ ( x ) = max(0 , x ) ), or a localized activation function [15]. Differentlyfrom the pointwise, localized activation functions consider the featuresat the neighborhood of each node i in the nonlinear GCNN component[15]. For a graph signal feature x the localized activation function isbased on two local operators, namely: • local max operator , max ( S , x ) , whose output is a graph signal z with i th entry being the maximum value of the signal in the neigh-borhood, i.e., z i = [ max ( S , x )] i = max (cid:0) { x j : v j ∈ N i } (cid:1) ; • local median operator , med ( S , x ) , whose output is a graph sig-nal z with i th entry being the median value of the signal in theneighborhood, i.e., z i = [ med ( S , x )] i = med (cid:0) { x j : v j ∈ N i } (cid:1) .For simplicity, we denote both local operators with the generic localfunction f ( S , x ) . Then, the localized activation function is defined as σ ( x ) = β ReLU ( x ) + K (cid:88) k =1 h σk f ( S k , x ) . (6)where f ( S k , x ) applies the local activation function to the signal valuesof the k -hop neighbors and parameters β and h σ = [ h σ , . . . , h σK ] (cid:62) are learned [15]. A GCNN with localized activation functions can thus bewritten as the map Φ ( x ; S ; H ) with parameters H = { h fgl ; h fσl ; H FC } lfg .As it follows from (6), localized activation functions ignore the edgeweights and require information from the non-immediate k -hop neigh-bors, which makes them not distributable. Hence, in distributed settings, the order K in (6) is limited to one. To address this limitation, we pro-pose two new activation functions based on local operators and kernelfunctions to account for the graph structure and be distributable.
3. GRAPH-ADAPTIVE ACTIVATION FUNCTIONS
In this section, we first define the graph-adaptive localized activationfunctions , which are based on arbitrary nonlinear operators acting onthe one-hop neighborhood of a node (Section 3.1). Then, we define the graph-adaptive kernel activation functions (Section 3.2). Finally, weprove the proposed nonlinearities are permutation equivariant and stableto input perturbations (Section 3.3).
To start, let us first define the basic building block for graph-adaptiveactivation functions: the shifted localized operator (SLO).
Definition 1 (Shifted Localized Operator) . Let G be an N -node graphwith shift operator S , x a signal, and S k x the k th shifted signal. Con-sider an arbitrary nonlinear localized function f ( · , N i ) : R N → R N ,which at node i computes the local nonlinear operation [ f ( x , N i )] i = f ( { x j } j ∈N i ) . For this choice of f ( · , N i ) , the k -hop shifted localizedoperator maps input x to output z ∈ R N as z i = [ f ( S k x , N i )] i = f ( { [ S k x ] j : j ∈ N i } ) . (7)That is, the SLO shifts the signal k times to obtain S k x , and thenreplaces the value of this signal at each node i by a nonlinear aggregation f ( · , N i ) of the signal values within the one-hop neighborhood of i . TheSLO utilizes information locally available at each node to account forthe signal-topology coupling for nodes that are k -hops away. We cannow define graph-adaptive nonlinear graph filters as follows. Definition 2 (Shifted Localized Graph Filter) . Consider the shifted lo-calized operator induced by an arbitrary nonlinear localized function f ( · , N i ) [cf. Def. 1], and let h σ = [ h σ , . . . , h σK ] (cid:62) be a vector ofparameters. The output of the shifted localized graph filter applied tosignal x , w.r.t. the shift operator S , is the signal z ∈ R N with i th entry z i = K (cid:88) k =1 h σk [ f ( S k x , N i )] i . (8)Definition 2 implies the output of a shifted localized graph filter isa linear combination of the SLOs f ( S k x , N i ) at different resolutions.Hence, shifted localized graph filters inherit the localization propertyof SLOs, as they incorporate the graph structure up to K hops awayaccessing only neighboring information. These nonlinear filters can beemployed to define graph-adaptive localized activation functions . Definition 3 (Graph-Adaptive Localized Activation Function) . Con-sider a scalar β and vector h σ = [ h fσl , . . . , h fσlK ] (cid:62) of learnable param-eters. At layer l , the graph-adaptive localized activation function mapsthe linear features z fl [cf. 2] to the output features x fl following therecursion [ x fl ] i = β ReLU ([ z fl ] i ) + K (cid:88) k =1 h fσlk [ f ( S k z fl , N i )] i . (9)Definition 3 combines the pointwise ReLU nonlinearity and theshifted localized graph filters [cf. Def. 2] into a single graph-adaptivelocalized nonlinearity for GNNs. The latter is distributable and localizedbecause, even though the resolution —given by the shift order— canbe arbitrarily large, the SLO f ( · , N i ) [cf. Def. 1] operates only in theone-hop neighborhood. In Section 4, we evaluate this activation functionfor f ( · , N i ) being the max and median, leading to the graph-adaptivemax and median activation function , respectively. .2. Graph-Adaptive Kernel Activation Functions The graph-adaptive kernel activation functions replace the localized non-linear function f ( · , N i ) by a localized kernel to enrich the representationpower. Let x ( k ) i ∈ R |N i | denote the vector containing |N i | copies of the k shifted signal at node i , [ S k x ] i , i.e. x ( k ) i = |N i | ⊗ [ S k x ] i where |N i | is the vector of ones of dimension |N i | and ⊗ is the Kronecker opera-tor. Additionally, consider the vector containing the values at neighbors j ∈ N i of the k th shifted signal S k x , i.e. x ( k ) j ∈N i = [ S k x ] j ∈N i . Withthis notation in place, we define a graph kernel operator as follows. Definition 4 (Kernel Operator) . Let G be an N -node graph with shiftoperator S , x a signal, and S k x the k th shifted signal. Consider an ar-bitrary kernel function g ( · , N i ) : R |N i | → R |N i | , which at node i com-putes the nonlinear local operation [ g ( x , N i )] i = g ( (cid:101) x i , x j ∈N i ) , where (cid:101) x i = |N i | ⊗ [ x ] i is a vector of dimensionality |N i | containing copiesof signal x at node i . The k -hop shifted kernel operator mapping from x to z ∈ R N has the entries z i = [ g ( S k x , N i )] i := g ( x ( k ) i , x ( k ) j ∈N i ) . (10)Definition 4 shows the kernel operator first shifts the input signal x as S k x and then replaces the signal value at each i by the kernel value g ( · , N i ) in the one-hop neighborhood of i . Thus, the kernel operatoremploys only local information at each node to account for the signal-topology coupling up to k -hops away from a node. For the kernel func-tion g ( · , N i ) we will employ the Gaussian kernel g ( x, y ) = exp (cid:0) −|| x − y || / γ (cid:1) , (11)where scalar γ is tunable. We can now define kernel graph filters . Definition 5 (Kernel Graph Filter) . Consider a kernel operator [cf. ]with kernel function g ( · , N i ) and let h σ = [ h σ , . . . , h σK ] (cid:62) be a vectorof parameters. The output of the kernel graph filter applied to signal x ,w.r.t. the shift operator S , is the signal z ∈ R N with i th entry z i = K (cid:88) k =1 h σk [ g ( S k x , N i )] i . (12)Definition 5 implies the output of the kernel graph filter is a linearcombination of the kernel operator applied to each k -shifted signal S k x for ≤ k ≤ K . Kernel graph filters thus preserve the localizationproperties of kernel operators, i.e., they account for the topology of thegraph up to K -hops away accessing only information in the one-hopneighborhood. These kernel graph filters can be further employed todefine the graph-adaptive kernel activation function as follows. Definition 6 (Graph-Adaptive Kernel Activation Function) . Considera scalar β and vector h σ = [ h fσl , ..., h fσlK ] (cid:62) of learnable parameters.At layer l , the graph-adaptive kernel activation function maps the linearfeatures z fl [cf. 2] to the output features x fl following the recursion [ x fl ] i = β ReLU ([ z fl ] i ) + K (cid:88) k =1 h fσlk [ g ( S k z fl , N i )] i . (13)Definition 6 combines the pointwise ReLU and kernel graph filters[cf. Def. 5] into a single graph-adaptive kernel activation function. Thisactivation function is distributable and localized because, even thoughthe resolution —given by the shift order— can be arbitrarily large, thekernel g ( · , N i ) operates only in the one-hop neighborhood.In both proposed activation functions, coefficients { β, h σ } aretrainable, meaning these nonlinearities adapt the multi-hop resolutionweights to the task at hand. Because these coefficients are shared amongnodes, the number of parameters to learn for a graph-adaptive activa-tion function is independent of the graph size. This allows GCNNs to scale . Note that even though the nonlinear functions f ( · , N i ) or thekernel functions g ( · , N i ) act only on the one-hop neighborhood, theyare applied to the shifted signals S k x , therefore they account for thefeature-graph coupling (up to K -hops away) in a nonlinear fashion. This is an advantage over traditional GCNNs with pointwise nonlinear-ities, in which the graph topology is only incorporated through linearencodings generated by graph convolutions.Definitions 3 and 6 implement fully graph-adaptive GCNNs that,at each layer, apply a graph convolution followed by a graph-adaptiveactivation function. The distributed GCNN is given by the map Φ ( x ; S , H , W ) := ˜ y . (14)The GCNN output now depends on both the coefficients H [cf. (5)] andon the nonlinear activation functions coefficients W = { h fσl } lf ∪ { β } . A key property GCNNs with pointwise activation functions inherit fromgraph convolutions is permutation equivariance [15]. The output of aGCNN is invariant to node relabeling and, more importantly, GCNNsexploit graph symmetries to generalize learned representations to differ-ent graph signals that share some of these symmetries. Herein, we showthat permutation equivariance also holds for graph-adaptive nonlineari-ties. We will also discuss a property that is specific to the graph-adaptivelocalized max activation: Lipschitz stability to input perturbations.
Permutation equivariance.
Consider the graph convolutional filter H ( S ) [cf. (1)] and let P be an N × N permutation matrix satisfying P T P = PP T = I . If we permute the GSO S and input x respec-tively as S (cid:48) = P T SP and x (cid:48) = P T x , we get the corresponding graphconvolution output y (cid:48) = H ( S (cid:48) ) x (cid:48) = H ( P T SP ) P T x = P T H ( S ) PP T x = P T y . (15)Because pointwise activation functions are scalar and by definition per-mutation equivariant, (15) implies GCNNs with pointwise nonlinearitiesare invariant to node relabelings. For GCNNs with graph-adaptive ac-tivation functions, it is then desirable to retain this property. This isguaranteed by the following proposition. Proposition 1 (Permutation equivariance) . Consider a graph signal x defined on an N -node graph G with GSO S . Let Φ ( x ; S , H , W ) be theoutput of a GCNN with graph-adaptive activation functions [cf. (14)]and let P be an N × N permutation matrix. The GNN Φ ( x ; S , H , W ) satisfies Φ ( P (cid:62) x ; P (cid:62) SP , H , W ) = P (cid:62) Φ ( x ; S , H , W ) (16)i.e., GNNs with graph-adaptive activation functions are permutationequivariant. Proof.
For the proof, we refer the reader to the Appendix.
Lipschitz stability.
In addition to permutation equivariance, the graph-adaptive max nonlinearity is Lipschitz stable to input perturbations withrespect to the infinity norm (cid:107) · (cid:107) ∞ as stated in the following proposition. Proposition 2 (Lipschitz stability) . Let G be a graph with GSO S . As-sume that S is normalized by its largest eigenvalue so that its spectralnorm ρ ( S ) is unitary. Let x be a graph signal and let ˜ x be a perturbationof x . The output of the graph-adaptive max activation function [ z ] i = β ReLU ([ x ] i ) + K (cid:88) k =1 h σk [ max ( S k x , N i )] i (17)with coefficients | h σk | ≤ C is Lipschitz stable to input perturbations inthe infinity norm (cid:107) · (cid:107) ∞ . That is, there exists a constant L σ such that (cid:107) ˜ z − z (cid:107) ∞ ≤ L σ (cid:107) ˜ x − x (cid:107) ∞ (18)where L σ = | β | + KC max k (cid:107) S k (cid:107) ∞ . Proof.
For the proof, we refer the reader to the Appendix.Proposition 2 implies the graph-adaptive max activation is Lipschitzstable at each node . Lipschitz stability is crucial to make learning morerobust. For instance, in classification problems, a GNN with graph-adaptive max nonlinearities will more likely classify correctly a per-turbed signal ˜ x than a GNN with non-Lipschitz activation functions. Theipschitz constant depends on the coefficient β , the number of filter taps K , the weights h σk (through C ), and the graph (through max k (cid:107) S k (cid:107) ∞ ).While we may not have full control over max k (cid:107) S k (cid:107) ∞ , β and K are de-sign parameters, and so is the maximum value of the coefficients h σk .The Lipschitz constant of graph-adaptive max nonlinearities is thus tun-able . This represents an advantage compared to conventional pointwiseactivation functions, which are stable but have fixed Lipschitz constants.
4. NUMERICAL EXPERIMENTS
We evaluate the performance of six activation functions that include:ReLU, localized activation functions (max and median) [15], and ourproposed graph-adaptive localized (max and median) and kernel acti-vation functions. Our goal is to highlight the benefits and limitationsof the different nonlinearities in applications requiring distributed com-putations with both synthetic and real data. To train the GCNNs weused the ADAM optimizer with learning rate − and forgetting fac-tors β = 0 . and β = 0 . . As the GSO, we employ the adjacencymatrix normalized by the maximum eigenvalue. For the graph-adaptivekernel nonlinearity, we set the parameter γ in (11) to γ = 0 . . We consider a diffusion process over a graph of N = 40 nodes dividedinto C = 4 communities. The goal is to determine the source com-munity of a given diffused signal locally at a selected node. The graphis an undirected stochastic block model (SBM) with intra- and inter-community probabilities p = 0 . and q = 0 . , respectively. The graphsignals are defined as Kronecker deltas δ c ∈ R N centered at a sourcenode c and diffused at a timestamp t ∈ [0 , , i.e. x t = S t δ c . Wechoose as source node c each of the nodes, thus generating a data setconsisting of graph signals. We split these samples into training,validation, and test set respectively as , , and . We simu-late 10 different graphs and generate 10 different splits per graph. Thetraining and testing are performed for the highest connected node foreach community, resulting in four nodes. Training is performed for epochs with a batch size of samples.Table 1 shows the classification accuracy for a two layer GCNN withdifferent number of features, F ∈ { , , } . For the graph-adaptive non-linearities, we carried out the experiments with resolutions K = 1 and K = 2 . We only report the results for the better performing filter order,as the rest were comparable to the localized nonlinearities from [15].We observe both the localized nonlinearities and the proposed graph-adaptive nonlinearities significantly outperform ReLU, with a differencein classification accuracy of at least . This result highlights the ben-efits of accounting for the graph topology during classification. More-over, the graph-adaptive max and median activation functions outper-form their localized versions, confirming the advantage of accountingfor further away data-graph coupling. The max nonlinearities achieve ahigher accuracy than medians in both the localized and graph-adaptivelocalized nonlinearities. This result could be caused by the fact that themedian will overall smooth the signal, hence undermining some localvariations important for classification. Additionally, this could also ex-plain the lower performance of the graph-adaptive kernel nonlinearitiescompared to the localized nonlinearities, which might be affected by thepossible redundancies in the extra information coming from neighbors. Distributed finite-time consensus aims to achieve consensus among allnodes in finite-time, by accessing only local information at each node.We consider learning the distributed consensus function in a data-drivenfashion over an undirected SBM graph with N = 100 nodes dividedinto C = 5 communities with intra- and inter-community probabilities p = 0 . and q = 0 . , respectively. The graph signals are generated froma normal distribution N ( , I ) . We generate samples and split theminto , , training, validation, and test sets, respectively. We The code can be found at https://github.com/bianca26/graph-adaptive-activation-functions-gnns . Table 1 : Source Localization Test Accuracy. L.: localized nonlinearities[cf. 6]; G.A.: graph-adaptive nonlinearities [cf. (9) and (13)]. Betweenbrackets the filter order K is specified. Nonlinearity/ F . ( ± . )% 44 . ± . . ± . Max L. . ± . . ± . . ( ± . )% Max G.A. (2) . ± . . ± . . ( ± . )% Median L. . ± . . ± . . ( ± . )% Median G.A. (2) . ± . . ± . . ( ± . )% Kernel G.A. (1) . ± . . ± . . ( ± . )% average the performance across different graph realizations and different data splits for each graph. We consider a two layer GCNN with F = 32 features per layer followed by a per-node fully connected layer.We employ various number of filter orders K ∈ { , , , } . Train-ing is performed for epochs with batch size . The evaluationmetric is the RMSE.Figure 1a shows the RMSE as a function of the filter order for thedifferent nonlinearities. All GCNNs achieve a lower RMSE comparedwith the FIR graph filter. For the lowest order K = 20 , ReLU yields aworse RMSE than the localized and graph-adaptive nonlinearities. Oncethe filter order increases, and thus the degrees of freedom, adding a para-metric nonlinearity seems to be less beneficial because the network hasenough degrees of freedom in the filter to model the consensus function.We also experiment with the robustness of the different models to linklosses by removing graph edges with different probabilities, followingthe random edge sampling model of [17]. For each method, we consid-ered the best performing setup. From the trained graph G , we randomlyremoved edges with probabilities in the interval [0 . , . . The re-sults are shown in Figure 1b, averaged across realizations. Althoughall models deteriorate when the link losses increase, graph-adaptive non-linearities handle the stochasticity better. The kernel nonlinearity seemsto be the most sensitive as its performance reaches those of the othergraph-adaptive alternatives. We perform distributed regression using the Molene dataset, which con-tains hourly temperature measurements of N = 32 stations over T =744 hours recorded in January 2014 in the area of Brest (France). Us-ing the node (station) coordinates, we generate a weighted geometricgraph using a ten nearest neighbor approach proposed in [18]. We con-sider as graph signals the measurements taken at different timestamps t ∈ T . Thus, our data set consists of graph signals. On top ofthe original signals we add zero-mean noise with a signal-to-noise ratio(SNR) of 3 dB. These noisy signals are split into , , train-ing, validation, and test sets, respectively. Our goal is to train a GCNNfor removing the noise distributively. We employ a GCNN with onelayer and a varying number of features F ∈ { , , } and filter orders K ∈ { , , , } . We perform the training for 500 epochs with a batchsize of 100 samples. We employ RMSE as the evaluation metric. Thefinal results are averaged across 20 different splits of the data set.Figure 1c shows the RSME as a function of the filter order for thedifferent nonlinearities. Across all GCNNs, the best performance wasachieved for the highest number of features, four, so we only report theresults for this setup. In the other setups, the performances were compa-rable. All GCNNs perform better than the FIR, but the difference is moresignificant for the lowest filter order K = 1 , especially in the case ofgraph-adaptive localized nonlinearities. This finding suggests their ap-plicability in situations where the communication resources are limited.To further address this hypothesis, we experimented with different lev-els of noise added to the data. For each method, we considered the setupwith the lowest filter order K = 1 . The results in Figure 1d show that thegraph-adaptive and localized nonlinearities outperform or achieve com-parable results to ReLU. The general trend shows an increase in perfor-mance when the SNR becomes larger, with a more significant increasefor the graph-adaptive localized nonlinearities. The performance of thegraph-adaptive kernel nonlinearity suffers in this scenario, as it requires Filter Order R M S E FIRReLUL. MaxL. MedG.A. Max K=2G.A. Med K=2G.A. Kernel K=2 (a)
Probability of edge removal R M S E FIRReLUL. MaxL. MedG.A. Max K=2G.A. Med K=2G.A. Kernel K=2 (b)
Filter Order R M S E FIRReLUL. Max.L. MedG.A. Max K=2G.A. Med K=2G.A. Kernel K=2 (c)
SNR R M S E FIRReLUL. MaxL. MedG.A. Max K=2G.A. Med K=2G.A. Kernel K=2 (d)
Fig. 1 . (a) Root mean square error (RMSE) of the GCNNs and FIR graph filters for distributed finite-time consensus as a function of filter order.(b) Robustness of the GCNNs and FIR graph filters for distributed finite-time consensus as a function of link-loss probabilities. (c) RMSE of theGCNNs and FIR graph filters for distributed regression as a function of filter order. (d) Robustness of the GCNNs and FIR graph filters for distributedregression as a function of the SNR. L.: localized nonlinearities [cf. 6]; G.A.: graph-adaptive nonlinearities [cf. (9) and (13)]. K denotes the filterorder. Table 2 : Average test RMSE over five train-test splits for the moviesToy Story, Contact and Return of the Jedi. L.: localized nonlinearities[cf. 6]; G.A.: graph-adaptive nonlinearities [cf. (9) and (13)].
Nonlinearity Toy Story Contact Return of the JediReLU . ± . . ± . . ± . Max L. . ± . . ± . . ± . Max G.A. . ( ± . ) . ( ± . ) . ( ± . ) Median L. . ± . . ± . . ± . Median G.A. . ± . . ± . . ± . Kernel G.A. . ± . . ± . . ± . higher filter orders compared to the rest. We suggest using higher ordersin the latter case to fully exploit the kernel power. We implement a GNN-based recommender system by considering a U × M rating matrix R , containing 100,000 ratings given by U = 943 usersto M = 1682 movies in the MovieLens 100k dataset [19]. The entries [ R ] um are the ratings between and if user u has rated movie m , and otherwise. We interpret the rows of R , i.e., the user rating vectors r u , asgraph signals on a M -node movie similarity network. The graph signalsare split into 90% as training and 10% as test set, and the movie similar-ity network is built by computing pairwise correlations between movierating vectors (i.e., columns of R ) containing only ratings from users inthe training set. The GNN is trained to predict user ratings to a movie m . This is achieved by “zeroing” out the ratings to movie m in the inputgraph signals r u , feeding them to the GNN to generate the rating pre-diction [ r u ] m , and minimizing the smooth (cid:96) loss | [ r u ] m − [ r u ] m | . Weconsider three graph-adaptive GNNs employing the one-hop max, one-hop median, and one-hop kernel graph-adaptive nonlinearities to high-light the impact of immediate neighboring information, hence makingthe recommendation more localized over items. They are compared withGNNs containing ReLU activations and the one-hop max and median ac-tivations from [15]. All GNNs consist of L = 1 layer, F = 32 featuresusing graph convolutional filters banks with K = 5 filter taps each. Wetrain all GNNs over epochs and in batches of for the movies ToyStory, Contact, and Return of the Jedi. The average test RMSEs overfive random train-test splits for each movie are reported in Table 2.We observe the graph-adaptive max activation function outperformsthe other nonlinearities for all three movies. In particular, the graph-adaptive max fares better than both the ReLU and its localized counter-part. The graph-adaptive median also outperforms the localized medianfor the movie Contact, and achieves comparable performance for theother movies. As for the graph-adaptive kernel activation, it performssimilarly to the ReLU and does not provide much of an improvement.
5. CONCLUSIONS
We proposed a new family of graph-adaptive activation functions forGNNs that capture the graph topology while also being distributable . These activation functions incorporate the data-topology coupling intoall the GNN components by combining nonlinearized features fromneighboring nodes with a set of trainable parameters. These param-eters adapt the information coming from neighborhoods of differentresolutions to the task at hand, hence aiding learning. The proposedgraph-adaptive activation functions preserve permutation equivariance,and the graph-adaptive max activation function is Lipschitz stable toinput perturbations. Graph-adaptive nonlinearities were compared toGCNNs employing localized and pointwise nonlinearities in four differ-ent problems based on both synthetic and real-world data, showing animproved performance compared to pointwise and other state-of-the-artlocalized nonlinearities. Future work will be on two fronts: characteriz-ing the stability of the proposed activation functions to perturbations inthe topology and performing learning distributively.
APPENDIX
Proof of Prop. 1 . Let S (cid:48) = P (cid:62) SP be the graph permutation and x (cid:48) = P (cid:62) x the permuted signal. From (15), the output of the graph convo-lution is equivariant to the action of P . Hence, we only need to provepermutation equivariance of the graph-adaptive activation functions in(9) and (13). We write their output as the signal z with entries [ z ] i = β ReLU ([ x ] i ) + K (cid:88) k =1 h σk [ g ( S k x , N i )] i (19)where g ( · , N i ) denotes either a shifted localized operator [cf. Def 1] ora kernel operator [cf. Def. 4]. Applying the activation functions in (19)to the permuted signal x (cid:48) , we obtain [ z (cid:48) ] i = β ReLU ([ P (cid:62) x ] i ) + K (cid:88) k =1 h σk [ g (( P (cid:62) S P ) k P (cid:62) x , N i )] i . (20)Since the ReLU activation function is pointwise, it is permutationequivariant, i.e. ReLU ( P (cid:62) x ) = P (cid:62) ReLU ( x ) . We then focus onthe second term of the sum, where we observe that ( P (cid:62) SP ) k = P (cid:62) SPP (cid:62) SP ... P (cid:62) SP = P (cid:62) S k P , which implies ( P (cid:62) SP ) k P (cid:62) x = P (cid:62) S k x . We can rewrite z (cid:48) as [ z (cid:48) ] i = β [ P (cid:62) ReLU ( x )] i + K (cid:88) k =1 h σk [ g ( P (cid:62) S k x , N i )] i . (21)Because function g ( · , N i ) is localized, it acts on the one-hop neighbor-hoods of each node, which are preserved under node relabelings. There-fore, g ( · , N i ) is permutation equivariant and (21) becomes [ z (cid:48) ] i = β [ P (cid:62) ReLU ( x )] i + K (cid:88) k =1 h σk [ P (cid:62) g ( S k x , N i )] i = [ P (cid:62) β ReLU ( x )] i + (cid:20) P (cid:62) K (cid:88) k =1 h σk g ( S k x , N i ) (cid:21) i .herefore z (cid:48) = P (cid:62) z and, hence, GNNs with graph-adaptive activationfunctions are permutation equivariant. Proof of Prop. 2 . Let ˜ x be a perturbed input with i th entry [˜ x ] i = [ x ] i + (cid:15) i . Denoting by ˜ z the output obtained by applying the graph-adaptivemax activation function to ˜ x , we can write (cid:107) [˜ z ] i − [ z ] i (cid:107) ≤ (cid:107) β ( ReLU ([˜ x ] i ) − ReLU ([ x ] i )) (cid:107) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K (cid:88) k =1 h σk (cid:16) [ max ( S k ˜ x , N i )] i − [ max ( S k x , N i )] i (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (22)which is obtained by grouping terms and applying the triangle inequality.The ReLU activation is Lipschitz stable with constant one [20], and so (cid:107) β ( ReLU ([˜ x ] i ) − ReLU ([ x ] i )) (cid:107) ≤ | β | (cid:107) [˜ x ] i − [ x ] i (cid:107) = | β | (cid:107) [˜ x − x ] i (cid:107) .(23)For the second part of the sum in (22), we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K (cid:88) k =1 h σk (cid:16) [ max ( S k ˜ x , N i )] i − [ max ( S k x , N i )] i (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ K (cid:88) k =1 | h σk | (cid:13)(cid:13)(cid:13) [ max ( S k ˜ x , N i )] i − [ max ( S k x , N i )] i (cid:13)(cid:13)(cid:13) which follows from the Cauchy-Schwarz inequality. Observe that, forany two functions f ( · ) and g ( · ) , we can write the inequality max( f ) =max( f − g + g ) ≤ max( f − g ) + max( g ) , and so K (cid:88) k =1 | h σk | (cid:13)(cid:13)(cid:13) [ max ( S k ˜ x , N i )] i − [ max ( S k x , N i )] i (cid:13)(cid:13)(cid:13) ≤ K (cid:88) k =1 | h σk |(cid:107) [max( S k (˜ x − x ) , N i )] i (cid:107) .We proceed by noting that (cid:107) [max( S k (˜ x − x ) , N i )] i (cid:107) ≤ (cid:107) max i [ S k (˜ x − x )] i (cid:107)≤ max i (cid:107) [ S k (˜ x − x )] i (cid:107) = (cid:107) S k (˜ x − x ) (cid:107) ∞ which allows us to write (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K (cid:88) k =1 h σk (cid:16) [ max ( S k ˜ x , N i )] i − [ max ( S k x , N i )] i (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ K (cid:88) k =1 | h σk |(cid:107) S k (˜ x − x ) (cid:107) ∞ ≤ K (cid:88) k =1 | h σk |(cid:107) S k (cid:107) ∞ (cid:107) ˜ x − x (cid:107) ∞ ≤ KC max k (cid:107) S k (cid:107) ∞ (cid:107) ˜ x − x (cid:107) ∞ . (24)Putting (23) and (24) together, we can write (cid:107) [˜ z − z ] i (cid:107) = (cid:107) [˜ z ] i − [ z ] i (cid:107) ≤ | β | (cid:107) [˜ x − x ] i (cid:107) + KC max k (cid:107) S k (cid:107) ∞ (cid:107) ˜ x − x (cid:107) ∞ .Since this is true for all i and from the definition of (cid:107) · (cid:107) ∞ , we conclude (cid:107) ˜ z − z (cid:107) ∞ = ≤ (cid:16) | β | + KC max k (cid:107) S k (cid:107) ∞ (cid:17) (cid:107) ˜ x − x (cid:107) ∞ which completes the proof. Note that (cid:107) S k (cid:107) ∞ ≥ ρ ( S ) k = 1 for all k with lim k →∞ (cid:107) S k (cid:107) ∞ = ρ ( S ) k = 1 , so there exists K such that, forall k > K , (cid:107) S k (cid:107) ∞ ≤ max k (cid:107) S k (cid:107) ∞ with max k (cid:107) S k (cid:107) ∞ = (cid:107) S K (cid:107) ∞ .
6. REFERENCES [1] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “Acomprehensive survey on graph neural networks,”
IEEE Transac-tions on Neural Networks and Learning Systems , 2020.[2] F. Gama, E. Isufi, G. Leus, and A. Ribeiro, “Graphs, convolutions,and neural networks,” arXiv:2003.03777 , 2020.[3] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutionalneural networks on graphs with fast localized spectral filtering,” in , Barcelona, Spain, 5-10Dec. 2016, pp. 3844–3858, Neural Inf. Process. Syst. Foundation.[4] F. Gama, A. G. Marques, G. Leus, and A. Ribeiro, “Convolutionalneural network architectures for signals supported on graphs,”
IEEE Trans. Sig. Proc. , vol. 67, no. 4, pp. 1034–1049, Feb. 2019.[5] D. I. Shuman, P. Vandergheynst, D. Kressner, and P. Frossard,“Distributed signal processing via chebyshev polynomial approxi-mation,”
IEEE Transactions on Signal and Information Processingover Networks , vol. 4, no. 4, pp. 736–751, 2018.[6] E. Isufi, A. Loukas, A. Simonetto, and G. Leus, “Autoregressivemoving average graph filtering,”
IEEE Trans. Sig. Proc. , vol. 65,no. 2, pp. 274–288, 2016.[7] S. Segarra, A. G. Marques, and A. Ribeiro, “Optimal graph-filterdesign and applications to distributed linear network operators,”
IEEE Trans. Sig. Proc. , vol. 65, no. 15, pp. 4117–4131, Aug. 2017.[8] Aliaksei Sandryhaila, Soummya Kar, and Jos´e MF Moura, “Finite-time distributed consensus through graph filters,” in . IEEE, 2014, pp. 1080–1084.[9] John Tsitsiklis, Dimitri Bertsekas, and Michael Athans, “Dis-tributed asynchronous deterministic and stochastic gradient opti-mization algorithms,”
IEEE transactions on automatic control , vol.31, no. 9, pp. 803–812, 1986.[10] Ali Jadbabaie, Jie Lin, and A Stephen Morse, “Coordination ofgroups of mobile autonomous agents using nearest neighbor rules,”
IEEE Transactions on automatic control , vol. 48, no. 6, pp. 988–1001, 2003.[11] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral net-works and deep locally connected networks on graphs,” in , Banff, AB, 14-16 Apr. 2014, pp.1–14, Assoc. Comput. Linguistics.[12] F. M. Bianchi, D. Grattarola, C. Alippi, and L. Livi, “Graph neuralnetworks with convolutional ARMA filters,” arXiv:1901.01343[cs.LG] , 2019.[13] P. Veliˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Li`o, andY. Bengio, “Graph attention networks,” in
Int. Conf. LearningRepresentations 2018 , Vancouver, BC, 30 Apr.-3 May 2018, pp.1–12, Assoc. Comput. Linguistics.[14] E. Isufi, F. Gama, and A. Ribeiro, “EdgeNets: Edge varying graphneural networks,” arXiv:2001.07620v1 [cs.LG] , 21 Jan. 2020.[15] L. Ruiz, F. Gama, A. G. Marques, and A. Ribeiro, “Invariance-preserving localized activation functions for graph neural net-works,”
IEEE Trans. Sig. Proc. , vol. 68, no. 1, pp. 127–141, Jan.2020.[16] F. Gama, J. Bruna, and A. Ribeiro, “Stability properties of graphneural networks,” arXiv:1905.04497v2 [cs.LG] , 4 Sep. 2019.[17] Elvin Isufi, Andreas Loukas, Andrea Simonetto, and Geert Leus,“Filtering random graph processes over random time-varyinggraphs,”
IEEE Transactions on Signal Processing , vol. 65, no. 16,pp. 4406–4421, 2017.[18] Elvin Isufi, Andreas Loukas, Nathanael Perraudin, and Geert Leus,“Forecasting time series with varma recursions on graphs,”
IEEETransactions on Sig. Proc. , vol. 67, no. 18, pp. 4870–4885, 2019.19] F. M. Harper and J. A. Konstan, “The MovieLens datasets: Historyand context,”
ACM Trans. Interactive Intell. Syst. , vol. 5, no. 4, pp.19:(1–19), Jan. 2016.[20] T. Wiatowski and H. B¨olcskei, “A mathematical theory of deepconvolutional neural networks for feature extraction,”