[PDF] Graph-Adaptive Activation Functions for Graph Neural Networks

Abstract

Activation functions are crucial in graph neural networks (GNNs) as they allow defining a nonlinear family of functions to capture the relationship between the input graph data and their representations. This paper proposes activation functions for GNNs that not only adapt to the graph into the nonlinearity, but are also distributable. To incorporate the feature-topology coupling into all GNN components, nodal features are nonlinearized and combined with a set of trainable parameters in a form akin to graph convolutions. The latter leads to a graph-adaptive trainable nonlinear component of the GNN that can be implemented directly or via kernel transformations, therefore, enriching the class of functions to represent the network data. Whether in the direct or kernel form, we show permutation equivariance is always preserved. We also prove the subclass of graph-adaptive max activation functions are Lipschitz stable to input perturbations. Numerical experiments with distributed source localization, finite-time consensus, distributed regression, and recommender systems corroborate our findings and show improved performance compared with pointwise as well as state-of-the-art localized nonlinearities.

Full PDF

GGRAPH-ADAPTIVE ACTIVATION FUNCTIONS FOR GRAPH NEURAL NETWORKS

Bianca Iancu ∗ , Luana Ruiz † , Alejandro Ribeiro † and Elvin Isuﬁ ∗∗ Intelligent Systems Department, Delft University of Technology, Delft, The Netherlands † Department of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, USAe-mails: [email protected], { rubruiz, aribeiro } @seas.upenn.edu, E.Isuﬁ[email protected] ABSTRACT

Activation functions are crucial in graph neural networks (GNNs) asthey allow deﬁning a nonlinear family of functions to capture the re-lationship between the input graph data and their representations. Thispaper proposes activation functions for GNNs that not only adapt to thegraph into the nonlinearity, but are also distributable. To incorporate thefeature-topology coupling into all GNN components, nodal features arenonlinearized and combined with a set of trainable parameters in a formakin to graph convolutions. The latter leads to a graph-adaptive train-able nonlinear component of the GNN that can be implemented directlyor via kernel transformations, therefore, enriching the class of functionsto represent the network data. Whether in the direct or kernel form,we show permutation equivariance is always preserved. We also provethe subclass of graph-adaptive max activation functions are Lipschitzstable to input perturbations. Numerical experiments with distributedsource localization, ﬁnite-time consensus, distributed regression, andrecommender systems corroborate our ﬁndings and show improved per-formance compared with pointwise as well as state-of-the-art localizednonlinearities.

Index Terms — Activation functions; graph neural networks; graphsignal processing; Lipschitz stability; permutation equivariance.

1. INTRODUCTION

Graph neural networks (GNNs) are parametric architectures suitable forlearning a nonlinear mapping for data deﬁned over graphs such as social,sensor, and biological network data [1, 2]. By interweaving graph ﬁlterswith pointwise nonlinearities, GNNs express the function map in a lay-ered form and learn compositions of features that account for the data-topology coupling [3, 4]. Another property GNNs inherit from graphﬁlters is the distributed implementation [5–7]. Distributed computationfacilitates scalability of computation and endows the system with ro-bustness to failures of the processing unit. The latter is fundamental inapplications involving consensus, optimization, and control [8–10].Building on spectral graph theory, [11] deﬁned graph convolutionalneural networks by multiplying feature representations in the Laplacianeigenspace with trainable kernels. Subsequently, [3] used ﬁnite impulseresponse (FIR) graph ﬁlters to combine features in the vertex domain bymeans of a polynomial in the Laplacian matrix. The work in [4] fol-lows the same idea but builds a polynomial ﬁlter in any graph represen-tation matrix (e.g., adjacency, Laplacian). Differently from [11], [3, 4]are also readily distributable architectures with appropriate choices ofgraph pooling (i.e., not altering the graph structure; e.g., zero-padding)and with pointwise activation functions. On the other hand, [12] builds aGNN with distributable autoregressive moving average graph ﬁlters [6],which capture a broader family of functions at the expense of computa-tion cost. Parallel to these efforts, [13] proposes attention-like mecha-nisms to adapt the edge weights to the task at hand. More recently, thework in [14] showed that all the above architectures are equivalent andfall under the framework of edge varying GNN (EdgeNet). Altogether,these works capture the data-graph coupling only linearly through graphﬁlters, while they ignore the coupling in the nonlinear pointwise compo-nent (e.g., ReLU). To improve the representation power of GNNs, [15] proposed localized activation functions that account for the graph topol-ogy by operating on node neighborhoods of different resolutions. How-ever, the latter accounts only for the graph and not the data-topology cou-pling, since it ignores the edge weights and the data propagation betweenneighbors. Localized activation functions are also not distributable be-yond the one-hop neighborhood, hence missing multi-hop informationbetween nodes.To address these limitations, we put forward a new family of ac-tivation functions that adapt to the data-topology coupling in the sur-rounding of a node. The nodal features obtained from graph ﬁltering areshifted prior to local-nonlinearization in a form akin to graph convolu-tions. These nonlinear features are subsequently combined with a set oftrainable parameters to accordingly weigh the information at differentneighborhood resolutions. The resolution radius is a design parameterand allows adapting the GNN nonlinear component to the task at hand.Besides being graph-adaptive and distributable, these activation func-tions preserve two properties of theoretical interest for GNNs, namelypermutation equivariance and Lipschitz stability to perturbations [16].Concretely, our contribution is threefold.1. We develop a new family of nonlinearities for GNNs that aregraph-adaptive to the surrounding of a node and distributable.The ﬁrst class [Def. 3] nonlinearizes shifted features in the sur-rounding of a node in their direct form. The second class [Def. 6]transforms the shifted features with graph-adaptive kernels priorto nonlinearization.2. We prove that: ( a ) the proposed nonlinearities are permutationequivariant [Prop. 1], i.e. the output of the respective GNN archi-tecture is agnostic to node labeling; ( b ) the max graph-adaptivenonlinearity is Lipschitz stable to input perturbations [Prop. 2].3. We propose distributed GNN tasks with graph-adaptive nonlin-earities for source localization, ﬁnite-time consensus, signal de-noising, and rating prediction in recommender systems.

2. GRAPH NEURAL NETWORKS

Consider a graph G = ( V , E ) with vertex set V of cardinality |V| = N and edge set E ⊆ V ×V of cardinality |E| = M . An edge is a tuple e ij =( i, j ) connecting nodes i and j . The neighborhood of node i is the set ofnodes N i = { j | ( i, j ) ∈ E} connected to i . Associated to G is the graphshift operator (GSO) matrix S ∈ R N × N , whose sparsity pattern matchesthe graph structure. That is, entry ( i, j ) satisﬁes [ S ] i,j = s i,j (cid:54) = 0 onlyif i = j or ( i, j ) ∈ E . Commonly used GSOs include the adjacencymatrix, the graph Laplacian, and their normalized and translated forms.On the vertices of G , we deﬁne a graph signal x ∈ R N whose i th component is the value at node i . We consider applications wheregraph signals are processed in a distributed fashion. A typical exampleis in sensor networks without access to a centralized processing unit andwhere each sensor communicates only with its neighbor sensors. Graph convolution.

A graph convolution is deﬁned as a graph ﬁlter H ( S ) that can be written as a polynomial of the GSO S [7]. For an inputsignal x and ﬁlter coefﬁcients h = [ h , . . . , h K ] (cid:62) , the output y ∈ R N of the graph convolutional ﬁlter is computed as y = H ( S ) x = K (cid:88) k =0 h k S k x . (1) a r X i v : . [ ee ss . SP ] S e p ue to the locality of S , graph convolutions can be run distribu-tively. When building the output y , we need to compute the terms Sx , . . . , S K x . Since S is local, operation Sx requires one-hop nodeexchanges and so, by writing S k x = S ( S k − x ) = Sx ( k − , node i cancompute signal x ( k ) through exchange of previous shifted information x ( k − with its neighbors. This recursion allows for distributed commu-nications and computational cost of order O ( MK ) , while the trainableparameters deﬁning (1) are of order O ( K ) [7]. Graph convolutional neural networks (GCNNs).

We consider aGCNN of L graph convolutional layers followed by a shared fully con-nected layer per node. Each convolutional layer comprises a bank ofgraph ﬁlters [cf. (1)] and a nonlinearity. At layer l , the GCNN takesas input F l − features { x gl − } F l − g =1 from layer ( l − and produces F l output features { x fl } Ff =1 . Each input feature x gl − is processed bya parallel bank of F l graph ﬁlters { H fgl ( S ) } f . The ﬁlter outputs areaggregated over the input index g to yield the f th convolved feature z fl = F l − (cid:88) g =1 H fgl ( S ) x gl − = F l − (cid:88) g =1 K (cid:88) k =0 h fgkl S k x gl − , for f = 1 , . . . , F l . (2)The convolved feature z fl is subsequently passed through an activationfunction σ ( · ) to obtain the f th convolutional layer output x fl = σ ( z fl ) , for f = 1 , . . . , F. (3)The output features of the last convolutional layer L , x L , . . . , x F L L ,represent the ﬁnal convolutional features. These features are interpretedas a collection of F L graph signals, where on node i we have the F L × feature vector χ Li = [ x Li , . . . , x F L Li ] (cid:62) . Each node locally combines thefeatures χ Li with a one-layer perceptron to obtain the output ˜ y = H FC χ Li (4)where matrix H FC ∈ R F o × F L maps the F L convolutional features to the F o output features (e.g., the number of classes). The parameters in H FC are shared among nodes to keep the number of trainable parameters in-dependent of the graph dimensions (i.e., N and M ), but only dependenton the ﬁlter order and the number of features and layers.By grouping all learnable parameters into the set H = { h fgl ; H FC } lfg ,we can consider the GCNN as a map Φ ( · ) that takes as input a graphsignal x , a GSO S , and a set of parameters H to produce the output Φ ( x ; S ; H ) := ˜ y . (5)The output (5) is computed for a training set T = { ( x , y ) } of |T | pairs,where y are the target representations. Activation functions.

The activation function σ ( · ) in (3) can beany of the conventional pointwise activation functions, such as ReLU( σ ( x ) = max(0 , x ) ), or a localized activation function [15]. Differentlyfrom the pointwise, localized activation functions consider the featuresat the neighborhood of each node i in the nonlinear GCNN component[15]. For a graph signal feature x the localized activation function isbased on two local operators, namely: • local max operator , max ( S , x ) , whose output is a graph signal z with i th entry being the maximum value of the signal in the neigh-borhood, i.e., z i = [ max ( S , x )] i = max (cid:0) { x j : v j ∈ N i } (cid:1) ; • local median operator , med ( S , x ) , whose output is a graph sig-nal z with i th entry being the median value of the signal in theneighborhood, i.e., z i = [ med ( S , x )] i = med (cid:0) { x j : v j ∈ N i } (cid:1) .For simplicity, we denote both local operators with the generic localfunction f ( S , x ) . Then, the localized activation function is deﬁned as σ ( x ) = β ReLU ( x ) + K (cid:88) k =1 h σk f ( S k , x ) . (6)where f ( S k , x ) applies the local activation function to the signal valuesof the k -hop neighbors and parameters β and h σ = [ h σ , . . . , h σK ] (cid:62) are learned [15]. A GCNN with localized activation functions can thus bewritten as the map Φ ( x ; S ; H ) with parameters H = { h fgl ; h fσl ; H FC } lfg .As it follows from (6), localized activation functions ignore the edgeweights and require information from the non-immediate k -hop neigh-bors, which makes them not distributable. Hence, in distributed settings, the order K in (6) is limited to one. To address this limitation, we pro-pose two new activation functions based on local operators and kernelfunctions to account for the graph structure and be distributable.

3. GRAPH-ADAPTIVE ACTIVATION FUNCTIONS

In this section, we ﬁrst deﬁne the graph-adaptive localized activationfunctions , which are based on arbitrary nonlinear operators acting onthe one-hop neighborhood of a node (Section 3.1). Then, we deﬁne the graph-adaptive kernel activation functions (Section 3.2). Finally, weprove the proposed nonlinearities are permutation equivariant and stableto input perturbations (Section 3.3).

To start, let us ﬁrst deﬁne the basic building block for graph-adaptiveactivation functions: the shifted localized operator (SLO).

Deﬁnition 1 (Shifted Localized Operator) . Let G be an N -node graphwith shift operator S , x a signal, and S k x the k th shifted signal. Con-sider an arbitrary nonlinear localized function f ( · , N i ) : R N → R N ,which at node i computes the local nonlinear operation [ f ( x , N i )] i = f ( { x j } j ∈N i ) . For this choice of f ( · , N i ) , the k -hop shifted localizedoperator maps input x to output z ∈ R N as z i = [ f ( S k x , N i )] i = f ( { [ S k x ] j : j ∈ N i } ) . (7)That is, the SLO shifts the signal k times to obtain S k x , and thenreplaces the value of this signal at each node i by a nonlinear aggregation f ( · , N i ) of the signal values within the one-hop neighborhood of i . TheSLO utilizes information locally available at each node to account forthe signal-topology coupling for nodes that are k -hops away. We cannow deﬁne graph-adaptive nonlinear graph ﬁlters as follows. Deﬁnition 2 (Shifted Localized Graph Filter) . Consider the shifted lo-calized operator induced by an arbitrary nonlinear localized function f ( · , N i ) [cf. Def. 1], and let h σ = [ h σ , . . . , h σK ] (cid:62) be a vector ofparameters. The output of the shifted localized graph ﬁlter applied tosignal x , w.r.t. the shift operator S , is the signal z ∈ R N with i th entry z i = K (cid:88) k =1 h σk [ f ( S k x , N i )] i . (8)Deﬁnition 2 implies the output of a shifted localized graph ﬁlter isa linear combination of the SLOs f ( S k x , N i ) at different resolutions.Hence, shifted localized graph ﬁlters inherit the localization propertyof SLOs, as they incorporate the graph structure up to K hops awayaccessing only neighboring information. These nonlinear ﬁlters can beemployed to deﬁne graph-adaptive localized activation functions . Deﬁnition 3 (Graph-Adaptive Localized Activation Function) . Con-sider a scalar β and vector h σ = [ h fσl , . . . , h fσlK ] (cid:62) of learnable param-eters. At layer l , the graph-adaptive localized activation function mapsthe linear features z fl [cf. 2] to the output features x fl following therecursion [ x fl ] i = β ReLU ([ z fl ] i ) + K (cid:88) k =1 h fσlk [ f ( S k z fl , N i )] i . (9)Deﬁnition 3 combines the pointwise ReLU nonlinearity and theshifted localized graph ﬁlters [cf. Def. 2] into a single graph-adaptivelocalized nonlinearity for GNNs. The latter is distributable and localizedbecause, even though the resolution —given by the shift order— canbe arbitrarily large, the SLO f ( · , N i ) [cf. Def. 1] operates only in theone-hop neighborhood. In Section 4, we evaluate this activation functionfor f ( · , N i ) being the max and median, leading to the graph-adaptivemax and median activation function , respectively. .2. Graph-Adaptive Kernel Activation Functions The graph-adaptive kernel activation functions replace the localized non-linear function f ( · , N i ) by a localized kernel to enrich the representationpower. Let x ( k ) i ∈ R |N i | denote the vector containing |N i | copies of the k shifted signal at node i , [ S k x ] i , i.e. x ( k ) i = |N i | ⊗ [ S k x ] i where |N i | is the vector of ones of dimension |N i | and ⊗ is the Kronecker opera-tor. Additionally, consider the vector containing the values at neighbors j ∈ N i of the k th shifted signal S k x , i.e. x ( k ) j ∈N i = [ S k x ] j ∈N i . Withthis notation in place, we deﬁne a graph kernel operator as follows. Deﬁnition 4 (Kernel Operator) . Let G be an N -node graph with shiftoperator S , x a signal, and S k x the k th shifted signal. Consider an ar-bitrary kernel function g ( · , N i ) : R |N i | → R |N i | , which at node i com-putes the nonlinear local operation [ g ( x , N i )] i = g ( (cid:101) x i , x j ∈N i ) , where (cid:101) x i = |N i | ⊗ [ x ] i is a vector of dimensionality |N i | containing copiesof signal x at node i . The k -hop shifted kernel operator mapping from x to z ∈ R N has the entries z i = [ g ( S k x , N i )] i := g ( x ( k ) i , x ( k ) j ∈N i ) . (10)Deﬁnition 4 shows the kernel operator ﬁrst shifts the input signal x as S k x and then replaces the signal value at each i by the kernel value g ( · , N i ) in the one-hop neighborhood of i . Thus, the kernel operatoremploys only local information at each node to account for the signal-topology coupling up to k -hops away from a node. For the kernel func-tion g ( · , N i ) we will employ the Gaussian kernel g ( x, y ) = exp (cid:0) −|| x − y || / γ (cid:1) , (11)where scalar γ is tunable. We can now deﬁne kernel graph ﬁlters . Deﬁnition 5 (Kernel Graph Filter) . Consider a kernel operator [cf. ]with kernel function g ( · , N i ) and let h σ = [ h σ , . . . , h σK ] (cid:62) be a vectorof parameters. The output of the kernel graph ﬁlter applied to signal x ,w.r.t. the shift operator S , is the signal z ∈ R N with i th entry z i = K (cid:88) k =1 h σk [ g ( S k x , N i )] i . (12)Deﬁnition 5 implies the output of the kernel graph ﬁlter is a linearcombination of the kernel operator applied to each k -shifted signal S k x for ≤ k ≤ K . Kernel graph ﬁlters thus preserve the localizationproperties of kernel operators, i.e., they account for the topology of thegraph up to K -hops away accessing only information in the one-hopneighborhood. These kernel graph ﬁlters can be further employed todeﬁne the graph-adaptive kernel activation function as follows. Deﬁnition 6 (Graph-Adaptive Kernel Activation Function) . Considera scalar β and vector h σ = [ h fσl , ..., h fσlK ] (cid:62) of learnable parameters.At layer l , the graph-adaptive kernel activation function maps the linearfeatures z fl [cf. 2] to the output features x fl following the recursion [ x fl ] i = β ReLU ([ z fl ] i ) + K (cid:88) k =1 h fσlk [ g ( S k z fl , N i )] i . (13)Deﬁnition 6 combines the pointwise ReLU and kernel graph ﬁlters[cf. Def. 5] into a single graph-adaptive kernel activation function. Thisactivation function is distributable and localized because, even thoughthe resolution —given by the shift order— can be arbitrarily large, thekernel g ( · , N i ) operates only in the one-hop neighborhood.In both proposed activation functions, coefﬁcients { β, h σ } aretrainable, meaning these nonlinearities adapt the multi-hop resolutionweights to the task at hand. Because these coefﬁcients are shared amongnodes, the number of parameters to learn for a graph-adaptive activa-tion function is independent of the graph size. This allows GCNNs to scale . Note that even though the nonlinear functions f ( · , N i ) or thekernel functions g ( · , N i ) act only on the one-hop neighborhood, theyare applied to the shifted signals S k x , therefore they account for thefeature-graph coupling (up to K -hops away) in a nonlinear fashion. This is an advantage over traditional GCNNs with pointwise nonlinear-ities, in which the graph topology is only incorporated through linearencodings generated by graph convolutions.Deﬁnitions 3 and 6 implement fully graph-adaptive GCNNs that,at each layer, apply a graph convolution followed by a graph-adaptiveactivation function. The distributed GCNN is given by the map Φ ( x ; S , H , W ) := ˜ y . (14)The GCNN output now depends on both the coefﬁcients H [cf. (5)] andon the nonlinear activation functions coefﬁcients W = { h fσl } lf ∪ { β } . A key property GCNNs with pointwise activation functions inherit fromgraph convolutions is permutation equivariance [15]. The output of aGCNN is invariant to node relabeling and, more importantly, GCNNsexploit graph symmetries to generalize learned representations to differ-ent graph signals that share some of these symmetries. Herein, we showthat permutation equivariance also holds for graph-adaptive nonlineari-ties. We will also discuss a property that is speciﬁc to the graph-adaptivelocalized max activation: Lipschitz stability to input perturbations.

Permutation equivariance.

Consider the graph convolutional ﬁlter H ( S ) [cf. (1)] and let P be an N × N permutation matrix satisfying P T P = PP T = I . If we permute the GSO S and input x respec-tively as S (cid:48) = P T SP and x (cid:48) = P T x , we get the corresponding graphconvolution output y (cid:48) = H ( S (cid:48) ) x (cid:48) = H ( P T SP ) P T x = P T H ( S ) PP T x = P T y . (15)Because pointwise activation functions are scalar and by deﬁnition per-mutation equivariant, (15) implies GCNNs with pointwise nonlinearitiesare invariant to node relabelings. For GCNNs with graph-adaptive ac-tivation functions, it is then desirable to retain this property. This isguaranteed by the following proposition. Proposition 1 (Permutation equivariance) . Consider a graph signal x deﬁned on an N -node graph G with GSO S . Let Φ ( x ; S , H , W ) be theoutput of a GCNN with graph-adaptive activation functions [cf. (14)]and let P be an N × N permutation matrix. The GNN Φ ( x ; S , H , W ) satisﬁes Φ ( P (cid:62) x ; P (cid:62) SP , H , W ) = P (cid:62) Φ ( x ; S , H , W ) (16)i.e., GNNs with graph-adaptive activation functions are permutationequivariant. Proof.

For the proof, we refer the reader to the Appendix.

Lipschitz stability.

In addition to permutation equivariance, the graph-adaptive max nonlinearity is Lipschitz stable to input perturbations withrespect to the inﬁnity norm (cid:107) · (cid:107) ∞ as stated in the following proposition. Proposition 2 (Lipschitz stability) . Let G be a graph with GSO S . As-sume that S is normalized by its largest eigenvalue so that its spectralnorm ρ ( S ) is unitary. Let x be a graph signal and let ˜ x be a perturbationof x . The output of the graph-adaptive max activation function [ z ] i = β ReLU ([ x ] i ) + K (cid:88) k =1 h σk [ max ( S k x , N i )] i (17)with coefﬁcients | h σk | ≤ C is Lipschitz stable to input perturbations inthe inﬁnity norm (cid:107) · (cid:107) ∞ . That is, there exists a constant L σ such that (cid:107) ˜ z − z (cid:107) ∞ ≤ L σ (cid:107) ˜ x − x (cid:107) ∞ (18)where L σ = | β | + KC max k (cid:107) S k (cid:107) ∞ . Proof.

For the proof, we refer the reader to the Appendix.Proposition 2 implies the graph-adaptive max activation is Lipschitzstable at each node . Lipschitz stability is crucial to make learning morerobust. For instance, in classiﬁcation problems, a GNN with graph-adaptive max nonlinearities will more likely classify correctly a per-turbed signal ˜ x than a GNN with non-Lipschitz activation functions. Theipschitz constant depends on the coefﬁcient β , the number of ﬁlter taps K , the weights h σk (through C ), and the graph (through max k (cid:107) S k (cid:107) ∞ ).While we may not have full control over max k (cid:107) S k (cid:107) ∞ , β and K are de-sign parameters, and so is the maximum value of the coefﬁcients h σk .The Lipschitz constant of graph-adaptive max nonlinearities is thus tun-able . This represents an advantage compared to conventional pointwiseactivation functions, which are stable but have ﬁxed Lipschitz constants.

4. NUMERICAL EXPERIMENTS

We evaluate the performance of six activation functions that include:ReLU, localized activation functions (max and median) [15], and ourproposed graph-adaptive localized (max and median) and kernel acti-vation functions. Our goal is to highlight the beneﬁts and limitationsof the different nonlinearities in applications requiring distributed com-putations with both synthetic and real data. To train the GCNNs weused the ADAM optimizer with learning rate − and forgetting fac-tors β = 0 . and β = 0 . . As the GSO, we employ the adjacencymatrix normalized by the maximum eigenvalue. For the graph-adaptivekernel nonlinearity, we set the parameter γ in (11) to γ = 0 . . We consider a diffusion process over a graph of N = 40 nodes dividedinto C = 4 communities. The goal is to determine the source com-munity of a given diffused signal locally at a selected node. The graphis an undirected stochastic block model (SBM) with intra- and inter-community probabilities p = 0 . and q = 0 . , respectively. The graphsignals are deﬁned as Kronecker deltas δ c ∈ R N centered at a sourcenode c and diffused at a timestamp t ∈ [0 , , i.e. x t = S t δ c . Wechoose as source node c each of the nodes, thus generating a data setconsisting of graph signals. We split these samples into training,validation, and test set respectively as , , and . We simu-late 10 different graphs and generate 10 different splits per graph. Thetraining and testing are performed for the highest connected node foreach community, resulting in four nodes. Training is performed for epochs with a batch size of samples.Table 1 shows the classiﬁcation accuracy for a two layer GCNN withdifferent number of features, F ∈ { , , } . For the graph-adaptive non-linearities, we carried out the experiments with resolutions K = 1 and K = 2 . We only report the results for the better performing ﬁlter order,as the rest were comparable to the localized nonlinearities from [15].We observe both the localized nonlinearities and the proposed graph-adaptive nonlinearities signiﬁcantly outperform ReLU, with a differencein classiﬁcation accuracy of at least . This result highlights the ben-eﬁts of accounting for the graph topology during classiﬁcation. More-over, the graph-adaptive max and median activation functions outper-form their localized versions, conﬁrming the advantage of accountingfor further away data-graph coupling. The max nonlinearities achieve ahigher accuracy than medians in both the localized and graph-adaptivelocalized nonlinearities. This result could be caused by the fact that themedian will overall smooth the signal, hence undermining some localvariations important for classiﬁcation. Additionally, this could also ex-plain the lower performance of the graph-adaptive kernel nonlinearitiescompared to the localized nonlinearities, which might be affected by thepossible redundancies in the extra information coming from neighbors. Distributed ﬁnite-time consensus aims to achieve consensus among allnodes in ﬁnite-time, by accessing only local information at each node.We consider learning the distributed consensus function in a data-drivenfashion over an undirected SBM graph with N = 100 nodes dividedinto C = 5 communities with intra- and inter-community probabilities p = 0 . and q = 0 . , respectively. The graph signals are generated froma normal distribution N ( , I ) . We generate samples and split theminto , , training, validation, and test sets, respectively. We The code can be found at https://github.com/bianca26/graph-adaptive-activation-functions-gnns . Table 1 : Source Localization Test Accuracy. L.: localized nonlinearities[cf. 6]; G.A.: graph-adaptive nonlinearities [cf. (9) and (13)]. Betweenbrackets the ﬁlter order K is speciﬁed. Nonlinearity/ F . ( ± . )% 44 . ± . . ± . Max L. . ± . . ± . . ( ± . )% Max G.A. (2) . ± . . ± . . ( ± . )% Median L. . ± . . ± . . ( ± . )% Median G.A. (2) . ± . . ± . . ( ± . )% Kernel G.A. (1) . ± . . ± . . ( ± . )% average the performance across different graph realizations and different data splits for each graph. We consider a two layer GCNN with F = 32 features per layer followed by a per-node fully connected layer.We employ various number of ﬁlter orders K ∈ { , , , } . Train-ing is performed for epochs with batch size . The evaluationmetric is the RMSE.Figure 1a shows the RMSE as a function of the ﬁlter order for thedifferent nonlinearities. All GCNNs achieve a lower RMSE comparedwith the FIR graph ﬁlter. For the lowest order K = 20 , ReLU yields aworse RMSE than the localized and graph-adaptive nonlinearities. Oncethe ﬁlter order increases, and thus the degrees of freedom, adding a para-metric nonlinearity seems to be less beneﬁcial because the network hasenough degrees of freedom in the ﬁlter to model the consensus function.We also experiment with the robustness of the different models to linklosses by removing graph edges with different probabilities, followingthe random edge sampling model of [17]. For each method, we consid-ered the best performing setup. From the trained graph G , we randomlyremoved edges with probabilities in the interval [0 . , . . The re-sults are shown in Figure 1b, averaged across realizations. Althoughall models deteriorate when the link losses increase, graph-adaptive non-linearities handle the stochasticity better. The kernel nonlinearity seemsto be the most sensitive as its performance reaches those of the othergraph-adaptive alternatives. We perform distributed regression using the Molene dataset, which con-tains hourly temperature measurements of N = 32 stations over T =744 hours recorded in January 2014 in the area of Brest (France). Us-ing the node (station) coordinates, we generate a weighted geometricgraph using a ten nearest neighbor approach proposed in [18]. We con-sider as graph signals the measurements taken at different timestamps t ∈ T . Thus, our data set consists of graph signals. On top ofthe original signals we add zero-mean noise with a signal-to-noise ratio(SNR) of 3 dB. These noisy signals are split into , , train-ing, validation, and test sets, respectively. Our goal is to train a GCNNfor removing the noise distributively. We employ a GCNN with onelayer and a varying number of features F ∈ { , , } and ﬁlter orders K ∈ { , , , } . We perform the training for 500 epochs with a batchsize of 100 samples. We employ RMSE as the evaluation metric. Theﬁnal results are averaged across 20 different splits of the data set.Figure 1c shows the RSME as a function of the ﬁlter order for thedifferent nonlinearities. Across all GCNNs, the best performance wasachieved for the highest number of features, four, so we only report theresults for this setup. In the other setups, the performances were compa-rable. All GCNNs perform better than the FIR, but the difference is moresigniﬁcant for the lowest ﬁlter order K = 1 , especially in the case ofgraph-adaptive localized nonlinearities. This ﬁnding suggests their ap-plicability in situations where the communication resources are limited.To further address this hypothesis, we experimented with different lev-els of noise added to the data. For each method, we considered the setupwith the lowest ﬁlter order K = 1 . The results in Figure 1d show that thegraph-adaptive and localized nonlinearities outperform or achieve com-parable results to ReLU. The general trend shows an increase in perfor-mance when the SNR becomes larger, with a more signiﬁcant increasefor the graph-adaptive localized nonlinearities. The performance of thegraph-adaptive kernel nonlinearity suffers in this scenario, as it requires Filter Order R M S E FIRReLUL. MaxL. MedG.A. Max K=2G.A. Med K=2G.A. Kernel K=2 (a)

Probability of edge removal R M S E FIRReLUL. MaxL. MedG.A. Max K=2G.A. Med K=2G.A. Kernel K=2 (b)

Filter Order R M S E FIRReLUL. Max.L. MedG.A. Max K=2G.A. Med K=2G.A. Kernel K=2 (c)

SNR R M S E FIRReLUL. MaxL. MedG.A. Max K=2G.A. Med K=2G.A. Kernel K=2 (d)

Fig. 1 . (a) Root mean square error (RMSE) of the GCNNs and FIR graph ﬁlters for distributed ﬁnite-time consensus as a function of ﬁlter order.(b) Robustness of the GCNNs and FIR graph ﬁlters for distributed ﬁnite-time consensus as a function of link-loss probabilities. (c) RMSE of theGCNNs and FIR graph ﬁlters for distributed regression as a function of ﬁlter order. (d) Robustness of the GCNNs and FIR graph ﬁlters for distributedregression as a function of the SNR. L.: localized nonlinearities [cf. 6]; G.A.: graph-adaptive nonlinearities [cf. (9) and (13)]. K denotes the ﬁlterorder. Table 2 : Average test RMSE over ﬁve train-test splits for the moviesToy Story, Contact and Return of the Jedi. L.: localized nonlinearities[cf. 6]; G.A.: graph-adaptive nonlinearities [cf. (9) and (13)].

Nonlinearity Toy Story Contact Return of the JediReLU . ± . . ± . . ± . Max L. . ± . . ± . . ± . Max G.A. . ( ± . ) . ( ± . ) . ( ± . ) Median L. . ± . . ± . . ± . Median G.A. . ± . . ± . . ± . Kernel G.A. . ± . . ± . . ± . higher ﬁlter orders compared to the rest. We suggest using higher ordersin the latter case to fully exploit the kernel power. We implement a GNN-based recommender system by considering a U × M rating matrix R , containing 100,000 ratings given by U = 943 usersto M = 1682 movies in the MovieLens 100k dataset [19]. The entries [ R ] um are the ratings between and if user u has rated movie m , and otherwise. We interpret the rows of R , i.e., the user rating vectors r u , asgraph signals on a M -node movie similarity network. The graph signalsare split into 90% as training and 10% as test set, and the movie similar-ity network is built by computing pairwise correlations between movierating vectors (i.e., columns of R ) containing only ratings from users inthe training set. The GNN is trained to predict user ratings to a movie m . This is achieved by “zeroing” out the ratings to movie m in the inputgraph signals r u , feeding them to the GNN to generate the rating pre-diction [ r u ] m , and minimizing the smooth (cid:96) loss | [ r u ] m − [ r u ] m | . Weconsider three graph-adaptive GNNs employing the one-hop max, one-hop median, and one-hop kernel graph-adaptive nonlinearities to high-light the impact of immediate neighboring information, hence makingthe recommendation more localized over items. They are compared withGNNs containing ReLU activations and the one-hop max and median ac-tivations from [15]. All GNNs consist of L = 1 layer, F = 32 featuresusing graph convolutional ﬁlters banks with K = 5 ﬁlter taps each. Wetrain all GNNs over epochs and in batches of for the movies ToyStory, Contact, and Return of the Jedi. The average test RMSEs overﬁve random train-test splits for each movie are reported in Table 2.We observe the graph-adaptive max activation function outperformsthe other nonlinearities for all three movies. In particular, the graph-adaptive max fares better than both the ReLU and its localized counter-part. The graph-adaptive median also outperforms the localized medianfor the movie Contact, and achieves comparable performance for theother movies. As for the graph-adaptive kernel activation, it performssimilarly to the ReLU and does not provide much of an improvement.

5. CONCLUSIONS

We proposed a new family of graph-adaptive activation functions forGNNs that capture the graph topology while also being distributable . These activation functions incorporate the data-topology coupling intoall the GNN components by combining nonlinearized features fromneighboring nodes with a set of trainable parameters. These param-eters adapt the information coming from neighborhoods of differentresolutions to the task at hand, hence aiding learning. The proposedgraph-adaptive activation functions preserve permutation equivariance,and the graph-adaptive max activation function is Lipschitz stable toinput perturbations. Graph-adaptive nonlinearities were compared toGCNNs employing localized and pointwise nonlinearities in four differ-ent problems based on both synthetic and real-world data, showing animproved performance compared to pointwise and other state-of-the-artlocalized nonlinearities. Future work will be on two fronts: characteriz-ing the stability of the proposed activation functions to perturbations inthe topology and performing learning distributively.

APPENDIX

Proof of Prop. 1 . Let S (cid:48) = P (cid:62) SP be the graph permutation and x (cid:48) = P (cid:62) x the permuted signal. From (15), the output of the graph convo-lution is equivariant to the action of P . Hence, we only need to provepermutation equivariance of the graph-adaptive activation functions in(9) and (13). We write their output as the signal z with entries [ z ] i = β ReLU ([ x ] i ) + K (cid:88) k =1 h σk [ g ( S k x , N i )] i (19)where g ( · , N i ) denotes either a shifted localized operator [cf. Def 1] ora kernel operator [cf. Def. 4]. Applying the activation functions in (19)to the permuted signal x (cid:48) , we obtain [ z (cid:48) ] i = β ReLU ([ P (cid:62) x ] i ) + K (cid:88) k =1 h σk [ g (( P (cid:62) S P ) k P (cid:62) x , N i )] i . (20)Since the ReLU activation function is pointwise, it is permutationequivariant, i.e. ReLU ( P (cid:62) x ) = P (cid:62) ReLU ( x ) . We then focus onthe second term of the sum, where we observe that ( P (cid:62) SP ) k = P (cid:62) SPP (cid:62) SP ... P (cid:62) SP = P (cid:62) S k P , which implies ( P (cid:62) SP ) k P (cid:62) x = P (cid:62) S k x . We can rewrite z (cid:48) as [ z (cid:48) ] i = β [ P (cid:62) ReLU ( x )] i + K (cid:88) k =1 h σk [ g ( P (cid:62) S k x , N i )] i . (21)Because function g ( · , N i ) is localized, it acts on the one-hop neighbor-hoods of each node, which are preserved under node relabelings. There-fore, g ( · , N i ) is permutation equivariant and (21) becomes [ z (cid:48) ] i = β [ P (cid:62) ReLU ( x )] i + K (cid:88) k =1 h σk [ P (cid:62) g ( S k x , N i )] i = [ P (cid:62) β ReLU ( x )] i + (cid:20) P (cid:62) K (cid:88) k =1 h σk g ( S k x , N i ) (cid:21) i .herefore z (cid:48) = P (cid:62) z and, hence, GNNs with graph-adaptive activationfunctions are permutation equivariant. Proof of Prop. 2 . Let ˜ x be a perturbed input with i th entry [˜ x ] i = [ x ] i + (cid:15) i . Denoting by ˜ z the output obtained by applying the graph-adaptivemax activation function to ˜ x , we can write (cid:107) [˜ z ] i − [ z ] i (cid:107) ≤ (cid:107) β ( ReLU ([˜ x ] i ) − ReLU ([ x ] i )) (cid:107) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K (cid:88) k =1 h σk (cid:16) [ max ( S k ˜ x , N i )] i − [ max ( S k x , N i )] i (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (22)which is obtained by grouping terms and applying the triangle inequality.The ReLU activation is Lipschitz stable with constant one [20], and so (cid:107) β ( ReLU ([˜ x ] i ) − ReLU ([ x ] i )) (cid:107) ≤ | β | (cid:107) [˜ x ] i − [ x ] i (cid:107) = | β | (cid:107) [˜ x − x ] i (cid:107) .(23)For the second part of the sum in (22), we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K (cid:88) k =1 h σk (cid:16) [ max ( S k ˜ x , N i )] i − [ max ( S k x , N i )] i (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ K (cid:88) k =1 | h σk | (cid:13)(cid:13)(cid:13) [ max ( S k ˜ x , N i )] i − [ max ( S k x , N i )] i (cid:13)(cid:13)(cid:13) which follows from the Cauchy-Schwarz inequality. Observe that, forany two functions f ( · ) and g ( · ) , we can write the inequality max( f ) =max( f − g + g ) ≤ max( f − g ) + max( g ) , and so K (cid:88) k =1 | h σk | (cid:13)(cid:13)(cid:13) [ max ( S k ˜ x , N i )] i − [ max ( S k x , N i )] i (cid:13)(cid:13)(cid:13) ≤ K (cid:88) k =1 | h σk |(cid:107) [max( S k (˜ x − x ) , N i )] i (cid:107) .We proceed by noting that (cid:107) [max( S k (˜ x − x ) , N i )] i (cid:107) ≤ (cid:107) max i [ S k (˜ x − x )] i (cid:107)≤ max i (cid:107) [ S k (˜ x − x )] i (cid:107) = (cid:107) S k (˜ x − x ) (cid:107) ∞ which allows us to write (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K (cid:88) k =1 h σk (cid:16) [ max ( S k ˜ x , N i )] i − [ max ( S k x , N i )] i (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ K (cid:88) k =1 | h σk |(cid:107) S k (˜ x − x ) (cid:107) ∞ ≤ K (cid:88) k =1 | h σk |(cid:107) S k (cid:107) ∞ (cid:107) ˜ x − x (cid:107) ∞ ≤ KC max k (cid:107) S k (cid:107) ∞ (cid:107) ˜ x − x (cid:107) ∞ . (24)Putting (23) and (24) together, we can write (cid:107) [˜ z − z ] i (cid:107) = (cid:107) [˜ z ] i − [ z ] i (cid:107) ≤ | β | (cid:107) [˜ x − x ] i (cid:107) + KC max k (cid:107) S k (cid:107) ∞ (cid:107) ˜ x − x (cid:107) ∞ .Since this is true for all i and from the deﬁnition of (cid:107) · (cid:107) ∞ , we conclude (cid:107) ˜ z − z (cid:107) ∞ = ≤ (cid:16) | β | + KC max k (cid:107) S k (cid:107) ∞ (cid:17) (cid:107) ˜ x − x (cid:107) ∞ which completes the proof. Note that (cid:107) S k (cid:107) ∞ ≥ ρ ( S ) k = 1 for all k with lim k →∞ (cid:107) S k (cid:107) ∞ = ρ ( S ) k = 1 , so there exists K such that, forall k > K , (cid:107) S k (cid:107) ∞ ≤ max k (cid:107) S k (cid:107) ∞ with max k (cid:107) S k (cid:107) ∞ = (cid:107) S K (cid:107) ∞ .

6. REFERENCES [1] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “Acomprehensive survey on graph neural networks,”

IEEE Transac-tions on Neural Networks and Learning Systems , 2020.[2] F. Gama, E. Isuﬁ, G. Leus, and A. Ribeiro, “Graphs, convolutions,and neural networks,” arXiv:2003.03777 , 2020.[3] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutionalneural networks on graphs with fast localized spectral ﬁltering,” in , Barcelona, Spain, 5-10Dec. 2016, pp. 3844–3858, Neural Inf. Process. Syst. Foundation.[4] F. Gama, A. G. Marques, G. Leus, and A. Ribeiro, “Convolutionalneural network architectures for signals supported on graphs,”

IEEE Trans. Sig. Proc. , vol. 67, no. 4, pp. 1034–1049, Feb. 2019.[5] D. I. Shuman, P. Vandergheynst, D. Kressner, and P. Frossard,“Distributed signal processing via chebyshev polynomial approxi-mation,”

IEEE Transactions on Signal and Information Processingover Networks , vol. 4, no. 4, pp. 736–751, 2018.[6] E. Isuﬁ, A. Loukas, A. Simonetto, and G. Leus, “Autoregressivemoving average graph ﬁltering,”

IEEE Trans. Sig. Proc. , vol. 65,no. 2, pp. 274–288, 2016.[7] S. Segarra, A. G. Marques, and A. Ribeiro, “Optimal graph-ﬁlterdesign and applications to distributed linear network operators,”

IEEE Trans. Sig. Proc. , vol. 65, no. 15, pp. 4117–4131, Aug. 2017.[8] Aliaksei Sandryhaila, Soummya Kar, and Jos´e MF Moura, “Finite-time distributed consensus through graph ﬁlters,” in . IEEE, 2014, pp. 1080–1084.[9] John Tsitsiklis, Dimitri Bertsekas, and Michael Athans, “Dis-tributed asynchronous deterministic and stochastic gradient opti-mization algorithms,”

IEEE transactions on automatic control , vol.31, no. 9, pp. 803–812, 1986.[10] Ali Jadbabaie, Jie Lin, and A Stephen Morse, “Coordination ofgroups of mobile autonomous agents using nearest neighbor rules,”

IEEE Transactions on automatic control , vol. 48, no. 6, pp. 988–1001, 2003.[11] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral net-works and deep locally connected networks on graphs,” in , Banff, AB, 14-16 Apr. 2014, pp.1–14, Assoc. Comput. Linguistics.[12] F. M. Bianchi, D. Grattarola, C. Alippi, and L. Livi, “Graph neuralnetworks with convolutional ARMA ﬁlters,” arXiv:1901.01343[cs.LG] , 2019.[13] P. Veliˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Li`o, andY. Bengio, “Graph attention networks,” in

Int. Conf. LearningRepresentations 2018 , Vancouver, BC, 30 Apr.-3 May 2018, pp.1–12, Assoc. Comput. Linguistics.[14] E. Isuﬁ, F. Gama, and A. Ribeiro, “EdgeNets: Edge varying graphneural networks,” arXiv:2001.07620v1 [cs.LG] , 21 Jan. 2020.[15] L. Ruiz, F. Gama, A. G. Marques, and A. Ribeiro, “Invariance-preserving localized activation functions for graph neural net-works,”

IEEE Trans. Sig. Proc. , vol. 68, no. 1, pp. 127–141, Jan.2020.[16] F. Gama, J. Bruna, and A. Ribeiro, “Stability properties of graphneural networks,” arXiv:1905.04497v2 [cs.LG] , 4 Sep. 2019.[17] Elvin Isuﬁ, Andreas Loukas, Andrea Simonetto, and Geert Leus,“Filtering random graph processes over random time-varyinggraphs,”

IEEE Transactions on Signal Processing , vol. 65, no. 16,pp. 4406–4421, 2017.[18] Elvin Isuﬁ, Andreas Loukas, Nathanael Perraudin, and Geert Leus,“Forecasting time series with varma recursions on graphs,”

IEEETransactions on Sig. Proc. , vol. 67, no. 18, pp. 4870–4885, 2019.19] F. M. Harper and J. A. Konstan, “The MovieLens datasets: Historyand context,”

ACM Trans. Interactive Intell. Syst. , vol. 5, no. 4, pp.19:(1–19), Jan. 2016.[20] T. Wiatowski and H. B¨olcskei, “A mathematical theory of deepconvolutional neural networks for feature extraction,”