Improving Attention Mechanism in Graph Neural Networks via Cardinality Preservation
IImproving Attention Mechanism in Graph Neural Networks via CardinalityPreservation
Shuo Zhang, Lei Xie Ph.D. Program in Computer Science, The Graduate Center, The City University of New York Department of Computer Science, Hunter College, The City University of New York Helen and Robert Appel Alzheimer’s Disease Research Institute, Feil Family Brain and Mind Research Institute,Weill Cornell Medicine, Cornell [email protected], [email protected]
Abstract
Graph Neural Networks (GNNs) are powerful to learn therepresentation of graph-structured data. Most of the GNNsuse the message-passing scheme, where the embedding ofa node is iteratively updated by aggregating the informationof its neighbors. To achieve a better expressive capability ofnode influences, attention mechanism has grown to be pop-ular to assign trainable weights to the nodes in aggregation.Though the attention-based GNNs have achieved remarkableresults in various tasks, a clear understanding of their dis-criminative capacities is missing. In this work, we present atheoretical analysis of the representational properties of theGNN that adopts the attention mechanism as an aggregator.Our analysis determines all cases when those attention-basedGNNs can always fail to distinguish certain distinct struc-tures. Those cases appear due to the ignorance of cardinalityinformation in attention-based aggregation. To improve theperformance of attention-based GNNs, we propose cardinal-ity preserved attention (CPA) models that can be applied toany kind of attention mechanisms. Our experiments on nodeand graph classification confirm our theoretical analysis andshow the competitive performance of our CPA models.
Introduction
Graph, as a kind of powerful data structure in non-Euclideandomain, can represent a set of instances (nodes) and the rela-tionships (edges) between them, thus has a broad applicationin various fields (Zhou et al. 2018b). Different from regularEuclidean data such as texts, images and videos, which haveclear grid structures that are relatively easy to generalizefundamental mathematical operations (Shuman et al. 2013),graph structured data are irregular so it is not straightforwardto apply important operations in deep learning (e.g. convo-lutions). Consequently, the analysis of graph-structured dataremains a challenging and ubiquitous question.In recent years, Graph Neural Networks (GNNs) havebeen proposed to learn the representations of graph-structured data and attract a growing interest (Scarselli et al.2009; Li et al. 2016; Duvenaud et al. 2015; Niepert, Ahmed,and Kutzkov 2016; Kipf and Welling 2017; Hamilton, Ying,and Leskovec 2017; Zhang et al. 2018; Ying et al. 2018;Morris et al. 2019a; Xu et al. 2019). GNNs can iterativelyupdate node embeddings by aggregating/passing node fea- tures and structural information in the graph. The gener-ated node embeddings can be fed into an extra classifica-tion/prediction layer and the whole model is trained end-to-end for different tasks.Though many GNNs have been proposed, it is noted thatwhen updating the embedding of a node v i by aggregat-ing the embeddings of its neighbor nodes v j , most of theGNN variants will assign non-parametric weight between v i and v j in their aggregators (Kipf and Welling 2017;Hamilton, Ying, and Leskovec 2017; Xu et al. 2019). How-ever, such aggregators (e.g. sum or mean) fail to learn anddistinguish the information between a target node and itsneighbors during the training. Taking account of differentcontributions from the nodes in a graph is important in real-world data as not all edges have similar impacts. A naturalalternative solution is making the edge weights trainable tohave a better expressive capability.To assign learnable weights in the aggregation, attentionmechanism (Bahdanau, Cho, and Bengio 2014; Vaswaniet al. 2017) is incorporated in GNNs. Thus the weightscan be directly represented by attention coefficients be-tween nodes and give interpretability (Veliˇckovi´c et al. 2018;Thekumparampil et al. 2018; Zhou et al. 2018a). ThoughGNNs with the attention-based aggregators achieve promis-ing performance on various tasks empirically, a clear un-derstanding of their discriminative power is missing for thedesigning of more powerful attention-based GNNs. Recentworks (Morris et al. 2019b; Xu et al. 2019; Maron et al.2019) have theoretically analyzed the expressive power ofGNNs. However, they are unaware of the attention mech-anism in their analysis. So that it’s unclear whether usingattention mechanism in aggregation will constrain the ex-pressive power of GNNs.In this work, we make efforts to theoretically analyze thediscriminative power of GNNs with attention-based aggre-gators. Our findings reveal that previous proposed attention-based aggregators fail to distinguish certain distinct struc-tures. By determining all such cases, we reveal the reasonfor those failures is the ignorance of cardinality informa-tion in aggregation. It inspires us to improve the attentionmechanism via cardinality preservation. We propose mod-els that can be applied to any kind of attention mechanisms a r X i v : . [ c s . L G ] M a y nd achieve the goal. In our experiments on node and graphclassifications, we confirm our theoretical analysis and vali-date the power of our proposed models. The best-performedone can achieve competitive results comparing to other base-lines. Specifically, our key contributions are summarized asfollows: • We show that previously proposed attention-based aggre-gators in message-passing GNNs always fail to distin-guish certain distinct structures. We determine all of thosecases and demonstrate the reason is the ignorance of thecardinality information in attention-based aggregation. • We propose Cardinality Preserved Attention (CPA) meth-ods to improve the original attention-based aggregator.With them, we can distinguish all cases that previouslyalways fail an attention-based aggregator. • Experiments on node and graph classification validate ourtheoretical analysis and the power of our CPA models.Comparing to baselines, CPA models can reach state-of-the-art level.
Preliminaries
Notations
Let G = ( V, E ) be a graph with set of nodes V and setof edges E . The nearest neighbors of node i are defined as N ( i ) = { j | d ( i, j ) = 1 } , where d ( i, j ) is the shortest dis-tance between node i and j . We denote the set of node i andits nearest neighbors as ˜ N ( i ) = N ( i ) ∪{ i } . For the nodes in ˜ N ( i ) , their feature vectors form a multiset M ( i ) = ( S i , µ i ) ,where S i = { s , . . . , s n } is the ground set of M ( i ) , and µ i : S i → N ∗ is the multiplicity function that gives the mul-tiplicity of each s ∈ S i . The cardinality | M | of a multiset isthe number of elements (with multiplicity) in the multiset. Graph Neural Networks
General GNNs
Graph Neural Networks (GNNs) adopt el-ement (node or edge) features X and the graph structure A as input to learn the representation of each element, h i , oreach graph, h G , for different tasks. In this work, we focuson the GNNs under massage-passing framework, which up-dates the node embeddings by aggregating its nearest neigh-bor node embeddings iteratively. In previous surveys, thistype of GNNs is referred as Graph Convolutional Networksin (Wu et al. 2019) or the GNNs with convolutional aggre-gator in (Zhou et al. 2018b). Under the framework, a learnedrepresentation of the node after l aggregation layers can con-tain the features and the structural information within l -stepneighborhoods of the node. The l -th layer of a GNN can beformally represented as: h li = φ l (cid:0) h l − i , (cid:8) h l − j , ∀ j ∈ N ( i ) (cid:9) (cid:1) , (1)where the superscript l denotes the l -th layer and h i is ini-tialized as X i . The aggregation function φ in equation 1propagates information between nodes and updates the hid-den state of nodes.In the final layer, since the node representation h Li after L iterations contains the L -step neighborhood information, it can be directly used for local/node-level tasks. While forglobal/graph-level tasks, the whole graph representation h G is needed, which requiring an extra readout function g tocompute h G from all h Li : h G = g (cid:0) (cid:8) h Li , ∀ i ∈ G (cid:9) (cid:1) . (2) Attention-Based GNNs
In a GNN, when the aggregationfunction φ in equation 1 adopts attention mechanism, weconsider it as an attention-based GNN. In previous survey(Section 6 of (Lee et al. 2018)), this is referred to the firsttwo types of attentions which have been applied to graphdata. The attention-based aggregator in l -th layer can be for-mulated as follows: e l − ij = Att (cid:0) h l − i , h l − j (cid:1) , (3) α l − ij = softmax (cid:0) e l − ij (cid:1) = exp( e l − ij ) (cid:80) k ∈ ˜ N ( i ) exp (cid:0) e l − ik (cid:1) , (4) h li = f l (cid:16) (cid:88) j ∈ ˜ N ( i ) α l − ij h l − j (cid:17) , (5)where the superscript l denotes the l -th layer and e ij is theattention coefficient computed by an attention function Att to measure the relation between node i and node j . α ij is theattention weight calculated by the softmax function. Equa-tion 5 is a weighted summation that uses all α as weightsfollowed with a nonlinear function f . Related Works
Since GNNs have achieved remarkable results in practice, aclear understanding of the power of GNNs in graph repre-sentational learning is needed to design better models andmake further improvements. Recent works (Morris et al.2019b; Xu et al. 2019; Maron et al. 2019) focus on un-derstanding the discriminative power of GNNs by compar-ing it to the Weisfeiler-Lehman (WL) test (Weisfeiler andLeman 1968) when deciding the graph isomorphism. It isproved that massage-passing-based GNNs which aggregatethe nearest neighbor node features of a node for embeddingare at most as powerful as the 1-WL test (Xu et al. 2019).Inspired by the higher discriminative power of the k -WLtest ( k > ) (Cai, Fürer, and Immerman 1992) than the1-WL test, GNNs that have a theoretically higher discrim-inative power than the massage-passing-based GNNs havebeen proposed based on the k -WL test (Morris et al. 2019b;Maron et al. 2019). However, the GNNs proposed in thoseworks don’t specifically contain the attention mechanism asthe part of their analysis. So it’s currently unknown whetherthe attention mechanism will constrain the discriminativepower. Our work focuses on the massage-passing-basedGNNs with attention mechanism, which are upper boundedby the 1-WL test.Another recent work (Knyazev, Taylor, and Amer 2019)aims to understand the attention mechanism over nodes inGNNs with experiments in a controlled environment. How-ever, the attention mechanism discussed in the work is usedin the pooling layer for the pooling of nodes, while our workinvestigates the usage of attention mechanism in the aggre-gation layer for the updating of nodes. imitation of Attention-Based GNNs In this section, we theoretically analyze the discrimina-tive power of attention-based GNNs and show their lim-itations. The discriminative power means how well anattention-based GNN can distinguish different elements (lo-cal or global structures). We find that previously proposedattention-based GNNs can fail in certain cases and the dis-criminative power is limited. Besides, by theoretically find-ing out all cases that always fail an attention-based GNN,we reveal that those failures come from the lack of cardinal-ity preservation in attention-based aggregators. The detailsof proofs are included in the Supplemental Material.
Discriminative Power of Attention-based GNNs
We assume the node input feature space is countable. Forany attention-based GNNs, we give the conditions in Lemma1 to make them reach the upper bound of discrimina-tive power when distinguishing different elements (local orglobal structures). In particular, each local structure belongsto a node and is the k -height subtree structure rooted at thenode, which is naturally captured in the node feature h ki af-ter k iterations in a GNN. The global structure contains theinformation of all such subtrees in a graph. Lemma 1.
Let A : G → R g be a GNN following the neigh-borhood aggregation scheme with the attention-based ag-gregator (Equation 5). For global-level task, an extra read-out function (Equation 2) is used in the final layer. A canreach its upper bound of discriminative power (can distin-guish all distinct local structures or be as powerful as the1-WL test when distinguishing distinct global structures) af-ter sufficient iterations with the following conditions: • Local-level : Function f and the weighted summation inEquation 5 are injective. • Global-level : Besides the conditions for local-level, A ’sreadout function (Equation 2) is injective. With Lemma 1, we are interested in whether its condi-tions can always be satisfied, so as to reach the upper boundof discriminative capacity of an attention-based GNN. Sincethe function f and the global-level readout function canbe predetermined to be injective, we focus on whether theweighted summation function in attention-based aggregatorcan be injective. The Non-Injectivity of Attention-Based Aggregator
In this part, we aim to answer the following two questions:
Q 1.
Can the attention-based GNNs actually reach the up-per bound of discriminative power? In other words, can theweighted summation function in an attention-based aggre-gator be injective?
Q 2.
If not, can we determine all of the cases that preventany kind of weighted summation function being injective?
Given a countable feature space H , a weighted summa-tion function is a mapping W : H → R n . The exact W isdetermined by the attention weights α computed from Att inEquation 3. Since
Att is affected by stochastic optimizationalgorithms (e.g. SGD) which introduce stochasticity in W , we have to pay attention that W is not fixed when dealingwith the two questions.In Theorem 1, we answer Q1 with No by giving the casesthat make W not to be injective. So that the attention-basedGNNs can never meet their upper bound of discriminativepower, which is stated in Corollary 1. Moreover, we answerQ2 with Yes in Theorem 1 by pointing out those cases are the only reason to always prevent W being injective. This alle-viates the difficulty of summarizing the properties of thosecases. Besides, we can specifically propose methods to avoidthose cases so as to let W to be injective. Theorem 1.
Assume the input feature space X is count-able. Given a multiset X ⊂ X and the node feature c ofthe central node, the weighted summation function h ( c, X ) in aggregation is defined as h ( c, X ) = (cid:80) x ∈ X α cx f ( x ) ,where f : X → R n is a mapping of input feature vec-tor and α cx is the attention weight between f ( c ) and f ( x ) calculated by the attention function Att in Equation 3 andthe softmax function in Equation 4. For all f and Att , h ( c , X ) = h ( c , X ) if and only if c = c , X = ( S, µ ) and X = ( S, k · µ ) for k ∈ N ∗ . In other words, h will mapdifferent multisets to the same embedding if and only if themultisets have the same central node feature and the samedistribution of node features. Corollary 1.
Let A be the GNN defined in Lemma 1. A never reaches its upper bound of discriminative power:There exists two different subtrees S and S or twographs G and G that the Weisfeiler-Lehman test decidesas non-isomorphic, such that A always maps the two sub-trees/graphs to the same embeddings. Attention Mechanism Fails to Preserve Cardinality
With Theorem 1, we are now interested in the properties ofall cases that always prevent the weighted summation func-tions W being injective. Since the multisets that all W failto distinguish share the same distribution of node features,we can say that W ignores the multiplicity information ofeach identical element in the multisets. Thus the cardinalityof the multiset is not preserved: Corollary 2.
Let A be the GNN defined in Lemma 1. Theattention-based aggregator in A cannot preserve the cardi-nality information of the multiset of node features in aggre-gation. In the next section, we aim to propose improved attention-based models to preserve the cardinality in aggregation.
Cardinality Preserved Attention (CPA) Model
Since the cardinality of the multiset is not preserved inattention-based aggregators, our goal is to propose modifi-cations to any kind of attention mechanism to make themcapture the cardinality information. So that all of the casesthat always prevent attention-based aggregator being injec-tive can be avoided.To achieve our goal, we modify the weighted summationfunction in Equation 5 to incorporate the cardinality infor-mation and don’t change the attention function in Equation3 so as to keep its original expressive power. Two different D ‹‹ DD ‹‹(cid:240)(cid:240) ::(cid:1)(cid:1) ÈÈ00 EE (cid:1)(cid:1) ;; DD ‹‹66ÒÒ DD ‹‹55ÒÒÒÒÒÒ DD ‹‹66ÒÒÒÒÒÒ DD ‹‹66ÒÒÒÒ DD ‹‹55ÒÒÒÒ MM MM DD DD DD ÆÆ ŸŸ ‹‹55 ŸŸ ‹‹66 ŸŸ ‹‹77 ŸŸ ‹‹‹‹ tt ⁄⁄ LL :: (cid:127)(cid:127) ·· (cid:198)(cid:198) ;; DD DD DD ÆÆ DD ‹‹‹‹ ŸŸ ‹‹88 ŸŸ ‹‹77 ŸŸ ‹‹66 ŸŸ ‹‹55 tt ¤¤ LL :: (cid:127)(cid:127) ·· ëë (cid:9)(cid:9) (cid:198)(cid:198) ;; DD ‹‹55ÒÒ11 NNEECCEEJJ ==HH 11 NNEECCEEJJ ==HH EEËË LL SS ËË (cid:195)(cid:195) ››–– ÈÈ«« -- ‹‹ DD ›› EEËË ›› –– ÈÈ«« .. ‹‹ DD ›› (cid:240)(cid:240) :: (cid:1)(cid:1) ÈÈ00 EE (cid:1)(cid:1) ;;
Figure 1: An illustration of different attention-based aggre-gators on multiset of node features. Given two distinct mul-tisets H and H that have the same central node feature h i and the same distribution of node features, aggregators willmap h i to h i and h i for H and H . The Original modelwill get h (cid:48) i = h (cid:48) i and fail to distinguish H and H , whileour Additive and
Scaled models can always distinguish H and H with h (cid:48)(cid:48) i (cid:54) = h (cid:48)(cid:48) i and h (cid:48)(cid:48)(cid:48) i (cid:54) = h (cid:48)(cid:48)(cid:48) i .models named as Additive and
Scaled are proposed to mod-ify the
Original model in Equation 5:
Model 1. (Additive) h li = f l (cid:16) (cid:88) j ∈ ˜ N ( i ) α l − ij h l − j + w l (cid:12) (cid:88) j ∈ ˜ N ( i ) h l − j (cid:17) , (6) Model 2. (Scaled) h li = f l (cid:16) ψ l (cid:0)(cid:12)(cid:12) ˜ N ( i ) (cid:12)(cid:12)(cid:1) (cid:12) (cid:88) j ∈ ˜ N ( i ) α l − ij h l − j (cid:17) , (7)where w is a non-zero vector ∈ R n , (cid:12) denotes the element-wise multiplication, | ˜ N ( i ) | equals to the cardinality of themultiset ˜ N ( i ) , ψ : Z + → R n is an injective function.In the Additive model, each element in the multiset willcontribute to the term that we added to preserve the cardinal-ity information. In the
Scaled model, the original weightedsummation is directly multiplied by a representational vec-tor of the cardinality value. So with these models, distinctmultisets with the same distribution will result in differentembedding h . Note that both of our models don’t change the Att function, such that they can keep the learning power ofthe original attention mechanism. We summarize the effectof our models in Corollary 3 and illustrate it in Figure 1.
Corollary 3.
Let T be the original attention-based aggrega-tor in Equation 5. With our proposed Cardinality PreservedAttention (CPA) models in Equation 6 and 7, T (cid:48) s discrimi-native power is increased: T can now distinguish all differ-ent multisets in aggregation that it previously always fails todistinguish. While the original attention-based aggregator is never in-jective as we mentioned in previous sections, our cardinality preserved attention-based aggregator can be injective withcertain learned attention weights to reach its upper bound ofdiscriminative power. We validate this in our experiments.For the time and space complexity of our CPA modelscomparing to the original attention-based aggregator, it isobvious that the Model 1 and 2 take more time and spacethan the original one due to our introduced vectors w and ψ ( | ˜ N ( i ) | ) . Thus we further simplify our models by fixingthe values in w and ψ ( | ˜ N ( i ) | ) and define two CPA variants: Model 3. (f-Additive) h li = f l (cid:16) (cid:88) j ∈ ˜ N ( i ) ( α l − ij + 1) h l − j (cid:17) , (8) Model 4. (f-Scaled) h li = f l (cid:16)(cid:12)(cid:12) ˜ N ( i ) (cid:12)(cid:12) · (cid:88) j ∈ ˜ N ( i ) α l − ij h l − j (cid:17) . (9)Model 3 and 4 still preserve the cardinality informationand have reduced time and space complexity comparing toModel 1 and 2. Actually, since w and ψ ( | ˜ N ( i ) | ) are degen-erate into constants, Model 3 and 4 have the same time andspace complexity as the original model in Equation 5. In ourexperiments, we will examine all 4 models together with theoriginal one. Experiments
In our experiments, we focus on the following questions:
Q 3.
Since attention-based GNNs (e.g. GAT) are originallyproposed for local-level tasks like node classification, willthose models fail or not meet the upper bound of discrimi-native power when solving certain node classification tasks?If so, can our proposed CPA models improve the originalmodel?
Q 4.
For global-level tasks like graph classification, howwell can the original attention-based GNNs perform? Canour proposed CPA models improve the original model?
Q 5.
How the attention-based GNNs with our CPA modelsperform compared to baselines?
To answer Question 3, we design a node classification taskwhich is to predict whether or not a node is included in a tri-angle as one vertex in a graph. To answer Question 4 and 5,we perform experiments on graph classification benchmarksand evaluate the performance of attention-based GNNs withCPA models.
Experimental Setup
Datasets
In our synthetic task (TRIANGLE-NODE) forpredicting whether or not a node is included in a trian-gle, we generate a graph with different node features. Inour experiment on graph classification, we use 6 benchmarkdatasets: 2 social network datasets (REDDIT-BINARY (RE-B), REDDIT-MULTI5K (RE-M5K)) and 4 bioinformaticsdatasets (MUTAG, PROTEINS, ENZYMES, NCI1). Moredetails of the datasets are provided in Supplemental Mate-rial.able 1: Testing accuracies(%) of GAT variants (the originalGAT and the GAT applied with each of our 4 CPA models)on TRIANGLE-NODE dataset for node classification. Wehighlight the result of the best performed model. The pro-portion P of multisets that hold the properties in Theorem 1among all multisets is also reported. Dataset TRIANGLE-NODE P (%) ± ± ± f-Additive 91.18 ± ± Table 2: Testing accuracies(%) of GAT-GC variants (theoriginal one and the ones applied with each of our 4 CPAmodels) on social network datasets. We highlight the resultof the best performed model per dataset. The proportion P of multisets that hold the properties in Theorem 1 among allmultisets is also reported for each dataset. Datasets RE-B RE-M5K P (%) ± ± ± ± Scaled 92.36 ± ± ± ± ± ± Models
In our experiments, the
Original model is the onethat uses the original version of an attention mechanism.We apply each of our 4 CPA models (
Additive , Scaled , f-Additive and f-Scaled ) to the original attention mechanismfor comparison. In the Additive and
Scaled models, we takeadvantage of the approximation capability of multi-layerperceptron (MLP) (Hornik, Stinchcombe, and White 1989;Hornik 1991) to model f and ψ .For node classification, we use GAT (Veliˇckovi´c etal. 2018) as the Original model. For graph classifica-tion, we build a GNN (GAT-GC) based on GAT asthe
Original model: We adopt the attention mechanismin GAT to specify the form of Equation 3: e ij =LeakyReLU (cid:0) a (cid:62) [ W h i (cid:107) W h j ] (cid:1) . For the readout function, anaive way is to only consider the node embeddings from thelast iteration. Although a sufficient number of iterations canhelp to avoid the cases in Theorem 1 by aggregating morediverse node features, the features from the latter iterationsmay generalize worse and the GNNs usually have shallowstructures (Xu et al. 2019; Zhou et al. 2018b). So the GAT-GC adopts the same function as used in (Xu et al. 2018;Xu et al. 2019; Lee, Lee, and Kang 2019; Li et al. 2019),which concatenates graph embeddings from all iterations: h G = (cid:107) Lk =0 (cid:0) Readout( (cid:8) h li (cid:12)(cid:12) i ∈ G (cid:9) ) (cid:1) , Readout functioncan be sum or mean. With CPA models, the cases in The- Figure 2: Training curves of GAT-GC variants on bioinfor-matics datasets.Table 3: Testing accuracies(%) of GAT-GC variants (theoriginal one and the ones applied with each of our 4 CPAmodels) on bioinformatics datasets. We highlight the resultof the best performed model per dataset. The highlighted re-sults are significantly higher than those from the correspond-ing
Original model under paired t-test at significance level . The proportion P of multisets that hold the propertiesin Theorem 1 among all multisets is also reported for eachdataset. Datasets MUTAG PROTEINS ENZYMES NCI1 P (%) ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± orem 1 can be avoided in each iteration. Full experimentalsettings are included in Supplemental Material. Node Classification
For the TRIANGLE-NODE dataset, the proportion P ofmultisets that hold the properties in Theorem 1 is . ,as shown in Table 1. The classification accuracy of the Orig-inal model (GAT) is significantly lower than the CPA mod-els. It supports the claim in Corollary 1: the
Original modelfails to distinguish all distinct multisets in the dataset and ex-hibits constrained discriminate power. On the contrary, CPAmodels can distinguish all different multisets in the graph assuggested in Corollary 3 and indeed significantly improvethe accuracy of the
Original model as shown in Table 1. Thisexperiment thus well answers Question 3 that we raised.
Graph Classification
In this section, we aim to answer Question 4 by evaluatingthe performance of variants of GAT-based GNN (GAT-GC)on graph classification benchmarks. Besides, we compareable 4: Testing accuracies(%) for graph classification. We highlight the result of the best performed model for each dataset.Our GAT-GC (f-Scaled) model achieves the top 2 on all 6 datasets.
Datasets MUTAG PROTEINS ENZYMES NCI1 RE-B RE-M5K B a s e li n e s WL 82.05 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± CapsGNN 86.67 ± ± ± ± ± ± ± ± ± ± ± our best-performed CPA model with baseline models to an-swer Question 5. Social Network Datasets
Since the RE-B and RE-M5Kdatasets don’t have original node features and we assign allthe node features to be the same, we have P = 100 . in those datasets. Thus all multisets in aggregation will bemapped to the same embedding by the Original
GAT-GC.After a mean readout function on all multisets, all graphsare finally mapped to the same embedding. The performanceof the
Original model is just random guessing of the graphlabels as reported in Table 2. While our CPA models candistinguish all different multisets and are confirmed to besignificantly better than the
Original one.Here we examine a naive approach to incorporate thecardinality information in the
Original model by assigningnode degrees as input node labels. By doing this way, thenode features are diverse and we get P = 0 . , whichmeans that the cases in Theorem 1 can be all avoided.However, the testing accuracies of Original can only reach . ± . on RE-B and . ± . on RE-M5K,which are significantly lower than the results of CPA mod-els in Table 2. Thus in practice, our proposed models exhibitgood generalization power comparing to the naive approach. Bioinformatics Datasets
For bioinformatics datasets thatcontain diverse node labels, we also report the P values inTable 3. The results reveal the existence ( P ≥ . ) of thecases in those datasets that can fool the Original model, thusthe discriminative power of the
Original model is theoreti-cally constrained.To empirically validate this, we compare the training ac-curacies of GAT-GC variants, since the discriminative powercan be directly indicated by the accuracies on training sets .Higher training accuracy indicates a better fitting ability todistinguish different graphs. The training curves of GAT-GCvariants are shown in Figure 2. From these curves, we cansee even though the
Original model has overfitted differentdatasets, the fitting accuracies that it converges to can neverbe higher than those of our CPA models. Compared to theWL kernel, CPA models can get training accuracies close to on several datasets, which reach those obtained fromthe WL kernel (equal to as shown in (Xu et al. 2019)).These findings validate that the discriminative power of the
Original model is constrained while our CPA models can ap-proach the upper bound of discriminative power with certainlearned weights. In Table 3 we report the testing accuracies of GAT-GCvariants on bioinformatics datasets. The
Original model canget meaningful results. However, we find our proposed CPAmodels further improve the testing accuracies of the
Origi-nal model on all datasets. This indicates that the preserva-tion of cardinality can also benefit the generalization powerof the model besides the discriminative power.From previous results in Table 2 and 3, we find the f-Scaled model performs the best with an average rankingmeasure (Taheri, Gimpel, and Berger-Wolf 2018). The goodperformance of the fixed-weight models ( f-Additive and f-Scaled ) comparing to the full models (
Additive and
Scaled )demonstrates that the improvements achieved by CPA mod-els are not simply due to the increased capacities given bythe additional vectors embedded.
Comparison to Baselines
We further compare the best-performed GAT-GC variant ( f-Scaled ) with other baselines(WL kernel (WL) (Shervashidze et al. 2011), PATCHY-SAN(PSCN) (Niepert, Ahmed, and Kutzkov 2016), Deep GraphCNN (DGCNN) (Zhang et al. 2018), Graph IsomorphismNetwork (GIN) (Xu et al. 2019) and Capsule Graph Neu-ral Network (CapsGNN) (Xinyi and Chen 2019)). In Ta-ble 4, we report the results. Our GAT-GC (f-Scaled) modelachieves 4 top 1 and 2 top 2 on all 6 datasets. It is expectedthat even better performance can be achieved with certainchoices of attention mechanism besides the GAT one.
Conclusion
In this paper, we theoretically analyze the representationalpower of GNNs with attention-based aggregators: We de-termine all cases when those GNNs always fail to distin-guish distinct structures. The finding shows that the miss-ing cardinality information in aggregation is the only rea-son to cause those failures. To improve, we propose cardi-nality preserved attention (CPA) models to solve this issue.In our experiments, we validate our theoretical analysis thatthe performances of the original attention-based GNNs arelimited. With our models, the original models can be im-proved. Compared to other baselines, our best-performedmodel achieves competitive performance. In future work, achallenging problem is to better learn the attention weightsso as to guarantee the injectivity of our cardinality preservedattention models after the training. Besides, it would be in-teresting to analyze the effects of different attention mecha-nisms. eferences [Bahdanau, Cho, and Bengio 2014] Bahdanau, D.; Cho, K.;and Bengio, Y. 2014. Neural machine translation byjointly learning to align and translate. arXiv preprintarXiv:1409.0473 .[Cai, Fürer, and Immerman 1992] Cai, J.-Y.; Fürer, M.; andImmerman, N. 1992. An optimal lower bound on the num-ber of variables for graph identification.
Combinatorica
Advances inNeural Information Processing Systems , 2224–2232.[Hamilton, Ying, and Leskovec 2017] Hamilton, W.; Ying,Z.; and Leskovec, J. 2017. Inductive representation learningon large graphs. In
Advances in Neural Information Pro-cessing Systems , 1024–1034.[Hornik, Stinchcombe, and White 1989] Hornik, K.; Stinch-combe, M.; and White, H. 1989. Multilayer feedforwardnetworks are universal approximators.
Neural networks
Neural networks
International Confer-ence on Machine Learning , 448–456.[Ivanov and Burnaev 2018] Ivanov, S., and Burnaev, E.2018. Anonymous walk embeddings. In
International Con-ference on Machine Learning , 2191–2200.[Kingma and Ba 2018] Kingma, D. P., and Ba, J. 2018.Adam: A method for stochastic optimization. In
Interna-tional Conference on Learning Representations .[Kipf and Welling 2017] Kipf, T. N., and Welling, M. 2017.Semi-supervised classification with graph convolutional net-works. In
International Conference on Learning Represen-tations .[Knyazev, Taylor, and Amer 2019] Knyazev, B.; Taylor,G. W.; and Amer, M. R. 2019. Understanding attentionand generalization in graph neural networks. arXiv preprintarXiv:1905.02850 .[Lee et al. 2018] Lee, J. B.; Rossi, R. A.; Kim, S.; Ahmed,N. K.; and Koh, E. 2018. Attention models in graphs: Asurvey. arXiv preprint arXiv:1807.07984 .[Lee, Lee, and Kang 2019] Lee, J.; Lee, I.; and Kang, J.2019. Self-attention graph pooling. In
International Con-ference on Machine Learning , 3734–3743.[Li et al. 2016] Li, Y.; Tarlow, D.; Brockschmidt, M.; andZemel, R. 2016. Gated graph sequence neural networks.In
International Conference on Learning Representations .[Li et al. 2019] Li, G.; Müller, M.; Thabet, A.; and Ghanem,B. 2019. Deepgcns: Can gcns go as deep as cnns? In
TheIEEE International Conference on Computer Vision (ICCV) . [Maron et al. 2019] Maron, H.; Ben-Hamu, H.; Serviansky,H.; and Lipman, Y. 2019. Provably powerful graph net-works. In
Advances in Neural Information Processing Sys-tems .[Morris et al. 2019a] Morris, C.; Ritzert, M.; Fey, M.; Hamil-ton, W. L.; Lenssen, J. E.; Rattan, G.; and Grohe, M. 2019a.Weisfeiler and leman go neural: Higher-order graph neuralnetworks. In
Proceedings of AAAI Conference on ArtificialInteligence .[Morris et al. 2019b] Morris, C.; Ritzert, M.; Fey, M.;Hamilton, W. L.; Lenssen, J. E.; Rattan, G.; and Grohe, M.2019b. Weisfeiler and leman go neural: Higher-order graphneural networks. In
Proceedings of the AAAI Conference onArtificial Intelligence , volume 33, 4602–4609.[Niepert, Ahmed, and Kutzkov 2016] Niepert, M.; Ahmed,M.; and Kutzkov, K. 2016. Learning convolutional neu-ral networks for graphs. In
International conference on ma-chine learning , 2014–2023.[Scarselli et al. 2009] Scarselli, F.; Gori, M.; Tsoi, A. C.; Ha-genbuchner, M.; and Monfardini, G. 2009. The graph neu-ral network model.
IEEE Transactions on Neural Networks
Journal of MachineLearning Research
IEEE Signal Processing Magazine
KDDDeep Learning Day .[Thekumparampil et al. 2018] Thekumparampil, K. K.;Wang, C.; Oh, S.; and Li, L.-J. 2018. Attention-based graphneural network for semi-supervised learning. arXiv preprintarXiv:1803.03735 .[Vaswani et al. 2017] Vaswani, A.; Shazeer, N.; Parmar, N.;Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polo-sukhin, I. 2017. Attention is all you need. In
Advances inneural information processing systems , 5998–6008.[Veliˇckovi´c et al. 2018] Veliˇckovi´c, P.; Cucurull, G.;Casanova, A.; Romero, A.; Liò, P.; and Bengio, Y. 2018.Graph Attention Networks. In
International Conference onLearning Representations .[Weisfeiler and Leman 1968] Weisfeiler, B., and Leman, A.1968. The reduction of a graph to canonical form and thealgebra which appears therein.
NTI, Series arXiv preprint arXiv:1901.00596 .[Xinyi and Chen 2019] Xinyi, Z., and Chen, L. 2019. Cap-sule graph neural network. In
International Conference onLearning Representations .Xu et al. 2018] Xu, K.; Li, C.; Tian, Y.; Sonobe, T.;Kawarabayashi, K.-i.; and Jegelka, S. 2018. Representa-tion learning on graphs with jumping knowledge networks.In
International Conference on Machine Learning , 5449–5458.[Xu et al. 2019] Xu, K.; Hu, W.; Leskovec, J.; and Jegelka,S. 2019. How powerful are graph neural networks? In
International Conference on Learning Representations .[Ying et al. 2018] Ying, Z.; You, J.; Morris, C.; Ren, X.;Hamilton, W.; and Leskovec, J. 2018. Hierarchical graphrepresentation learning with differentiable pooling. In
Ad-vances in Neural Information Processing Systems , 4805–4815.[Zhang et al. 2018] Zhang, M.; Cui, Z.; Neumann, M.; andChen, Y. 2018. An end-to-end deep learning architecturefor graph classification. In
Proceedings of AAAI Conferenceon Artificial Inteligence .[Zhou et al. 2018a] Zhou, H.; Young, T.; Huang, M.; Zhao,H.; Xu, J.; and Zhu, X. 2018a. Commonsense knowledgeaware conversation generation with graph attention. In
IJ-CAI , 4623–4629.[Zhou et al. 2018b] Zhou, J.; Cui, G.; Zhang, Z.; Yang, C.;Liu, Z.; and Sun, M. 2018b. Graph neural networks:A review of methods and applications. arXiv preprintarXiv:1812.08434 . Proof for Lemma 1
Proof.
Local-level: For the aggregator in the first layer, itwill map different 1-height subtree structures to differentembeddings from the distinct input multisets of neighbor-hood node features, since it’s injective. Iteratively, the ag-gregator in the l -th layer can distinguish different l -heightsubtree structures by mapping them to different embeddingsfrom the distinct input multisets of l -1-height subtree fea-tures, since it’s injective.Global level: From Lemma 2 and Theorem 3 in (Xu etal. 2019), we know: When all functions in A are injec-tive, A can reach its upper bound of discriminative power,which is the same as the Weisfeiler-Lehman (WL) test (We-isfeiler and Leman 1968) when deciding the graph isomor-phism. Proof for Theorem 1
Proof.
To prove Theorem 1, we have consider both two di-rections in the iff statement: (1)
If given c = c = c , X = ( S, µ ) and X = ( S, k · µ ) ,as h ( c, X ) = (cid:80) x ∈ X α cx f ( x ) , we have: h ( c i , X i ) = (cid:88) x ∈ X i α cxi f ( x ) , i ∈ { , } , where α cxi is the attention weight belongs to X i , and be-tween f ( c ) and f ( x ) , x ∈ X i , i ∈ { , } .We can rewrite the equations using S and µ : h ( c , X ) = h ( c, S, µ ) = (cid:88) s ∈ S µ ( s ) α cs f ( s ) ,h ( c , X ) = h ( c, S, k · µ ) = (cid:88) s ∈ S k · µ ( s ) α cs f ( s ) , where µ ( s ) is the multiplicity function, and α csi is the at-tention weight belongs to X i , and between f ( c ) and f ( s ) , s ∈ S, i ∈ { , } .Considering the softmax function in Equation 2 of our pa-per, we can use attention coefficient e to rewrite the equa-tions: (cid:88) s ∈ S µ ( s ) α cs f ( s ) = (cid:88) s ∈ S µ ( s ) exp( e cs ) (cid:80) x ∈ X exp( e cx ) f ( s )= (cid:80) s ∈ S µ ( s ) exp( e cs ) (cid:80) x ∈ X exp( e cx ) f ( s ) , (cid:88) s ∈ S k · µ ( s ) α cs f ( s ) = k · (cid:88) s ∈ S µ ( s ) exp( e cs ) (cid:80) x ∈ X exp( e cx ) f ( s )= k · (cid:80) s ∈ S µ ( s ) exp( e cs ) (cid:80) x ∈ X exp( e cx ) f ( s ) , where e csi is the attention coefficient belongs to X i , and be-tween f ( c ) and f ( s ) , s ∈ S, i ∈ { , } . Moreover, e cxi isthe attention coefficient belongs to X i , and between f ( c ) and f ( x ) , x ∈ X i , i ∈ { , } .As attention coefficient e is computed by function Att ,which is regardless of X , thus e cs = e cs , ∀ s ∈ S and cx = e cx , ∀ x ∈ X , X . We denote e cx = e cx = e cx , e cs = e cs = e cs . Remind that X has k copies of theelements in X , so that (cid:88) x ∈ X exp( e cx ) = 1 k (cid:88) x ∈ X exp( e cx ) . Using this equation, we can get (cid:80) s ∈ S µ ( s ) exp( e cs ) (cid:80) x ∈ X exp( e cx ) f ( s ) = (cid:80) s ∈ S µ ( s ) exp( e cs ) k (cid:80) x ∈ X exp( e cx ) f ( s )= k · (cid:80) s ∈ S µ ( s ) exp( e cs ) (cid:80) x ∈ X exp( e cx ) f ( s ) . From all equations above, we finally have h ( c , X ) = (cid:80) s ∈ S µ ( s ) exp( e cs ) (cid:80) x ∈ X exp( e cx ) f ( s )= k · (cid:80) s ∈ S µ ( s ) exp( e cs ) (cid:80) x ∈ X exp( e cx ) f ( s )= h ( c , X ) . (2) If given h ( c , X ) = h ( c , X ) for all f , Att , we have (cid:88) x ∈ X α cx f ( x ) = (cid:88) x ∈ X α cx f ( x ) , ∀ f, Att, where α cxi is the attention weight belongs to X i , and be-tween f ( c i ) and f ( x ) , x ∈ X i , i ∈ { , } .We denote X = ( S , µ ) and X = ( S , µ ) and rewritethe equation: (cid:88) s ∈ S µ ( s ) α cs f ( s ) = (cid:88) s ∈ S µ ( s ) α cs f ( s ) , ∀ f, Att, where µ i ( s ) is the multiplicity function of X i , i ∈ { , } .Moreover, α csi is the attention weight belongs to X i , andbetween f ( c i ) and f ( s ) , s ∈ S i , i ∈ { , } .When considering the relations between S and S , wehave: (cid:88) s ∈ S ∩ S (cid:0) µ ( s ) α cs − µ ( s ) α cs (cid:1) f ( s )+ (cid:88) s ∈ S \ S µ ( s ) α cs f ( s ) − (cid:88) s ∈ S \ S µ ( s ) α cs f ( s ) = 0 . (10)If we assume the equality of Equation 10 is true for all f and S (cid:54) = S , we can define such two functions f and f : f ( s ) = f ( s ) , ∀ s ∈ S ∩ S ,f ( s ) = f ( s ) − , ∀ s ∈ S \ S ,f ( s ) = f ( s ) + 1 , ∀ s ∈ S \ S . If given the equality of Equation 10 is true for f , we have: (cid:88) s ∈ S ∩ S (cid:0) µ ( s ) α cs − µ ( s ) α cs (cid:1) f ( s )+ (cid:88) s ∈ S \ S µ ( s ) α cs f ( s ) − (cid:88) s ∈ S \ S µ ( s ) α cs f ( s ) = 0 . (11) We can rewrite Equation 11 using f : (cid:88) s ∈ S ∩ S (cid:0) µ ( s ) α cs − µ ( s ) α cs (cid:1) f ( s )+ (cid:88) s ∈ S \ S µ ( s ) α cs ( f ( s ) − − (cid:88) s ∈ S \ S µ ( s ) α cs ( f ( s ) + 1) = 0 . Thus we know (cid:88) s ∈ S ∩ S (cid:0) µ ( s ) α cs − µ ( s ) α cs (cid:1) f ( s )+ (cid:88) s ∈ S \ S µ ( s ) α cs f ( s ) − (cid:88) s ∈ S \ S µ ( s ) α cs f ( s ) = (cid:88) s ∈ S \ S µ ( s ) α cs + (cid:88) s ∈ S \ S µ ( s ) α cs (12)Note that the LHS of Equation 12 is just the LHS of Equa-tion 10 when f = f . As µ i ( s ) ≥ due to the definition ofmultiplicity, α csi > due to the softmax function, we have µ i ( s ) α csi > , ∀ s ∈ S i , i ∈ { , } . Thus the RHS of Equa-tion 12 > 0 and we now know the equality in Equation 10 isnot true for f . So the assumption of S (cid:54) = S is false.We denote S = S = S . To let the remaining summationterm always equal to 0, we have µ ( s ) α cs − µ ( s ) α cs = 0 , ∀ Att.
Considering Equation 2 in our paper, we can rewrite theequation above: µ ( s ) µ ( s ) = exp( e cs )exp( e cs ) (cid:80) x ∈ X exp( e cx ) (cid:80) x ∈ X exp( e cx ) , ∀ Att, (13)where e csi is the attention coefficient belongs to X i , andbetween f ( c i ) and f ( s ) , s ∈ S . And e cxi is the atten-tion coefficient belongs to X i , and between f ( c i ) and f ( x ) , x ∈ X i , i ∈ { , } .The LHS of Equation 13 is a rational number. Howeverif c (cid:54) = c , the RHS of Equation 13 can be irrational: Weassume S contains at least two elements s and s (cid:54) = s . Ifnot, we can directly get c = c . We consider any attentionmechanism that results in: e cs = 1 , ∀ s ∈ S,e cs = (cid:26) , for s = s , , ∀ s (cid:54) = s ∈ S. Thus when s = s , the RHS of the equation become: ee (cid:12)(cid:12) X (cid:12)(cid:12) e ( (cid:12)(cid:12) X (cid:12)(cid:12) − n ) e + ne = (cid:12)(cid:12) X (cid:12)(cid:12) ( (cid:12)(cid:12) X (cid:12)(cid:12) − n ) e + n , where n is the multiplicity of s in X . It is obvious that thevalue of RHS is irrational. So we have c = c to alwayshold the equality.With c = c , we know e cs = e cs , ∀ s ∈ S and e cx = e cx , ∀ x ∈ X , X . We denote e cx = e cx = e cx , Equation13 becomes µ ( s ) µ ( s ) = (cid:80) x ∈ X exp( e cx ) (cid:80) x ∈ X exp( e cx ) = const., ∀ Att. e further denote k = µ ( s ) /µ ( s ) , ∀ s ∈ S . So that µ = k · µ . Finally by denoting µ = µ , we have X =( S, µ ) , X = ( S, k · µ ) and c = c . Proof for Corollary 1
Proof.
For subtrees, if S and S are 1-height subtrees thathave the same root node feature and the same distribution ofnode features, A will get the same embeddings for S and S according to Theorem 1.For graphs, let G be a fully connect graph with n nodesand G be a ring-like graph with n nodes. All nodes in G and G have the same feature x . It is clear that theWeisfeiler-Lehman test of isomorphism decides G and G as non-isomorphic.We denote { X i } , i ∈ G as the set of multisets for aggre-gation in G , and { X j } , j ∈ G as the set of multisets foraggregation in G . As G is a fully connect graph, all mul-tisets in G contain central node and n − neighbors. As G is a ring-like graph, all multisets in G contain centralnode and neighbors. Thus we have X i = ( { x } , { µ ( x ) = n } ) , ∀ i ∈ G ,X j = ( { x } , { µ ( x ) = 3 } ) , ∀ j ∈ G , where µ i ( x ) is the multiplicity function of the node withfeature x in G i , i ∈ { , } .From Theorem 1, we know that h ( c, X i ) = h ( c, X j ) , ∀ i ∈ G , ∀ j ∈ G . Considering the Equa-tion 3 of our paper, we have h li = h lj , ∀ i ∈ G , ∀ j ∈ G ineach iteration l . Besides, as the number of node in G and G are equals to n , A will always map G and G to thesame set of multisets of node features { h l } in each iteration l and finally get the same embedding for each graph. Proof for Corollary 2
Proof.
Given two distinct multiset of node features X and X that have the same central node feature and the samedistribution of node features: c = c , X = ( S, µ ) and X = ( S, k · µ ) for k ∈ N ∗ , we know the cardinality of X is k times of the cardinality of X . Thus X and X can bedistinguished by their cardinality.However, the weighted summation function h inattention-based aggregator A will map them to the same em-bedding: h ( c , X ) = h ( c , X ) according to Theorem 1.Thus we cannot distinguish X and X via A . To conclude, A lost the cardinality information after aggregation. Proof for Corollary 3
Proof.
For any two distinct multisets X and X that T previously always fail to distinguish according to Theorem1, we denote X = ( S, µ ) and X = ( S, k · µ ) ⊂ X for some k ∈ N ∗ and c ∈ S . Thus (cid:80) x ∈ X α cx f ( x ) = (cid:80) x ∈ X α cx f ( x ) , where α cxi is the attention weight be-longs to X i , and between f ( c ) and f ( x ) , x ∈ X i , i ∈ { , } .We denote H = (cid:80) x ∈ X α cx f ( x ) = (cid:80) x ∈ X α cx f ( x ) .When applying CPA models, the aggregation functions in T can be rewritten as: h ( c, X i ) = H + w (cid:12) (cid:88) x ∈ X i f ( x ) , i ∈ { , } ,h ( c, X i ) = ψ ( (cid:12)(cid:12) X i (cid:12)(cid:12) ) (cid:12) H, i ∈ { , } . We consider the following example: All elements in w equal to 1. Function ψ maps (cid:12)(cid:12) X (cid:12)(cid:12) to a n-dimensional vectorwhich all elements in it equal to (cid:12)(cid:12) X (cid:12)(cid:12) . And f ( x ) = N − Z ( x ) ,where Z : X → N and N > (cid:12)(cid:12) X (cid:12)(cid:12) . So that the aggregationfunctions become: h ( c, X i ) = H + (cid:88) x ∈ X i f ( x ) , i ∈ { , } ,h ( c, X i ) = (cid:12)(cid:12) X i (cid:12)(cid:12) · H, i ∈ { , } . For h , we have h ( c, X ) − h ( c, X ) = (cid:80) x ∈ X f ( x ) − (cid:80) x ∈ X f ( x ) . According to Lemma 5 of (Xu et al. 2019),when X (cid:54) = X , (cid:80) x ∈ X f ( x ) (cid:54) = (cid:80) x ∈ X f ( x ) . So h ( c, X ) (cid:54) = h ( c, X ) .For h , we have h ( c, X ) − h ( c, X ) = ( (cid:12)(cid:12) X (cid:12)(cid:12) − (cid:12)(cid:12) X (cid:12)(cid:12) ) · H . As α cx > due to the softmax function, and f ( x ) > inour example, we know H > . Moreover as (cid:12)(cid:12) X (cid:12)(cid:12) − (cid:12)(cid:12) X (cid:12)(cid:12) (cid:54) =0 , we can get h ( c, X ) (cid:54) = h ( c, X ) . Details of Datasets
For the node classification task, we generate a graph with4800 nodes and 32400 edges. . of the nodes are in-cluded in triangles as vertices while . Datasets Graphs Classes Features Node Avg. Edge Avg.MUTAG 188 2 7 17.93 19.79PROTEINS 1113 2 4 39.06 72.81ENZYMES 600 6 6 32.63 62.14NCI1 4110 2 23 29.87 32.30RE-B 2000 2 - 429.63 995.51RE-M5K 4999 5 - 508.52 1189.75
Details of Experiment Settings
For all experiments, we perform 10-fold cross-validationand repeat the experiments 10 times for each dataset andeach model. To get a final accuracy for each run, we selectthe epoch with the best cross-validation accuracy averagedover all 10 folds. The average accuracies and their standarddeviations are reported based on the results across the foldsin all runs.In our
Additive and
Scaled models, all MLPs have 2 layerswith ReLU activation.n the GAT variants, we use 2 GNN layers and a hid-den dimensionality of 32. The negative input slope of
LeakyReLU in the
GAT attention mechanism is 0.2. Thenumber of heads in multi-head attention is 1.In the GAT-GC variants, we use 4 GNN layers. For the
Readout function in all models, we use sum for bioinfor-matics datasets and mean for social network datasets. We ap-ply Batch normalization (Ioffe and Szegedy 2015) after ev-ery hidden layers. The hidden dimensionality is set as 32 forbioinformatics datasets and 64 for social network datasets.The negative input slope of
LeakyReLU in the
GAT atten-tion mechanism is 0.2. We use a single head in the multi-head attention in all models.All models are trained using the Adam optimizer (Kingmaand Ba 2018) and the learning rate is dropped by a factorof 0.5 every 400 epochs in the node classification task andevery 50 epochs in the graph classification task. We use aninitial learning rate of 0.01 for the TRIANGLE-NODE andbioinformatics datasets and 0.0025 for the social networkdatasets. For the GAT variants, we use a dropout ratio of 0and a weight decay value of 0. For the GAT-GC variants oneach dataset, the following hyper-parameters are tuned: (1)Batch size in { , } ; (2) Dropout ratio in { , . } afterdense layer; (3) L regularization from to .001