[PDF] Improving Attention Mechanism in Graph Neural Networks via Cardinality Preservation

Abstract

Graph Neural Networks (GNNs) are powerful to learn the representation of graph-structured data. Most of the GNNs use the message-passing scheme, where the embedding of a node is iteratively updated by aggregating the information of its neighbors. To achieve a better expressive capability of node influences, attention mechanism has grown to be popular to assign trainable weights to the nodes in aggregation. Though the attention-based GNNs have achieved remarkable results in various tasks, a clear understanding of their discriminative capacities is missing. In this work, we present a theoretical analysis of the representational properties of the GNN that adopts the attention mechanism as an aggregator. Our analysis determines all cases when those attention-based GNNs can always fail to distinguish certain distinct structures. Those cases appear due to the ignorance of cardinality information in attention-based aggregation. To improve the performance of attention-based GNNs, we propose cardinality preserved attention (CPA) models that can be applied to any kind of attention mechanisms. Our experiments on node and graph classification confirm our theoretical analysis and show the competitive performance of our CPA models.

Full PDF

IImproving Attention Mechanism in Graph Neural Networks via CardinalityPreservation

Shuo Zhang, Lei Xie Ph.D. Program in Computer Science, The Graduate Center, The City University of New York Department of Computer Science, Hunter College, The City University of New York Helen and Robert Appel Alzheimer’s Disease Research Institute, Feil Family Brain and Mind Research Institute,Weill Cornell Medicine, Cornell [email protected], [email protected]

Abstract

Graph Neural Networks (GNNs) are powerful to learn therepresentation of graph-structured data. Most of the GNNsuse the message-passing scheme, where the embedding ofa node is iteratively updated by aggregating the informationof its neighbors. To achieve a better expressive capability ofnode inﬂuences, attention mechanism has grown to be pop-ular to assign trainable weights to the nodes in aggregation.Though the attention-based GNNs have achieved remarkableresults in various tasks, a clear understanding of their dis-criminative capacities is missing. In this work, we present atheoretical analysis of the representational properties of theGNN that adopts the attention mechanism as an aggregator.Our analysis determines all cases when those attention-basedGNNs can always fail to distinguish certain distinct struc-tures. Those cases appear due to the ignorance of cardinalityinformation in attention-based aggregation. To improve theperformance of attention-based GNNs, we propose cardinal-ity preserved attention (CPA) models that can be applied toany kind of attention mechanisms. Our experiments on nodeand graph classiﬁcation conﬁrm our theoretical analysis andshow the competitive performance of our CPA models.

Introduction

Graph, as a kind of powerful data structure in non-Euclideandomain, can represent a set of instances (nodes) and the rela-tionships (edges) between them, thus has a broad applicationin various ﬁelds (Zhou et al. 2018b). Different from regularEuclidean data such as texts, images and videos, which haveclear grid structures that are relatively easy to generalizefundamental mathematical operations (Shuman et al. 2013),graph structured data are irregular so it is not straightforwardto apply important operations in deep learning (e.g. convo-lutions). Consequently, the analysis of graph-structured dataremains a challenging and ubiquitous question.In recent years, Graph Neural Networks (GNNs) havebeen proposed to learn the representations of graph-structured data and attract a growing interest (Scarselli et al.2009; Li et al. 2016; Duvenaud et al. 2015; Niepert, Ahmed,and Kutzkov 2016; Kipf and Welling 2017; Hamilton, Ying,and Leskovec 2017; Zhang et al. 2018; Ying et al. 2018;Morris et al. 2019a; Xu et al. 2019). GNNs can iterativelyupdate node embeddings by aggregating/passing node fea- tures and structural information in the graph. The gener-ated node embeddings can be fed into an extra classiﬁca-tion/prediction layer and the whole model is trained end-to-end for different tasks.Though many GNNs have been proposed, it is noted thatwhen updating the embedding of a node v i by aggregat-ing the embeddings of its neighbor nodes v j , most of theGNN variants will assign non-parametric weight between v i and v j in their aggregators (Kipf and Welling 2017;Hamilton, Ying, and Leskovec 2017; Xu et al. 2019). How-ever, such aggregators (e.g. sum or mean) fail to learn anddistinguish the information between a target node and itsneighbors during the training. Taking account of differentcontributions from the nodes in a graph is important in real-world data as not all edges have similar impacts. A naturalalternative solution is making the edge weights trainable tohave a better expressive capability.To assign learnable weights in the aggregation, attentionmechanism (Bahdanau, Cho, and Bengio 2014; Vaswaniet al. 2017) is incorporated in GNNs. Thus the weightscan be directly represented by attention coefﬁcients be-tween nodes and give interpretability (Veliˇckovi´c et al. 2018;Thekumparampil et al. 2018; Zhou et al. 2018a). ThoughGNNs with the attention-based aggregators achieve promis-ing performance on various tasks empirically, a clear un-derstanding of their discriminative power is missing for thedesigning of more powerful attention-based GNNs. Recentworks (Morris et al. 2019b; Xu et al. 2019; Maron et al.2019) have theoretically analyzed the expressive power ofGNNs. However, they are unaware of the attention mech-anism in their analysis. So that it’s unclear whether usingattention mechanism in aggregation will constrain the ex-pressive power of GNNs.In this work, we make efforts to theoretically analyze thediscriminative power of GNNs with attention-based aggre-gators. Our ﬁndings reveal that previous proposed attention-based aggregators fail to distinguish certain distinct struc-tures. By determining all such cases, we reveal the reasonfor those failures is the ignorance of cardinality informa-tion in aggregation. It inspires us to improve the attentionmechanism via cardinality preservation. We propose mod-els that can be applied to any kind of attention mechanisms a r X i v : . [ c s . L G ] M a y nd achieve the goal. In our experiments on node and graphclassiﬁcations, we conﬁrm our theoretical analysis and vali-date the power of our proposed models. The best-performedone can achieve competitive results comparing to other base-lines. Speciﬁcally, our key contributions are summarized asfollows: • We show that previously proposed attention-based aggre-gators in message-passing GNNs always fail to distin-guish certain distinct structures. We determine all of thosecases and demonstrate the reason is the ignorance of thecardinality information in attention-based aggregation. • We propose Cardinality Preserved Attention (CPA) meth-ods to improve the original attention-based aggregator.With them, we can distinguish all cases that previouslyalways fail an attention-based aggregator. • Experiments on node and graph classiﬁcation validate ourtheoretical analysis and the power of our CPA models.Comparing to baselines, CPA models can reach state-of-the-art level.

Preliminaries

Notations

Let G = ( V, E ) be a graph with set of nodes V and setof edges E . The nearest neighbors of node i are deﬁned as N ( i ) = { j | d ( i, j ) = 1 } , where d ( i, j ) is the shortest dis-tance between node i and j . We denote the set of node i andits nearest neighbors as ˜ N ( i ) = N ( i ) ∪{ i } . For the nodes in ˜ N ( i ) , their feature vectors form a multiset M ( i ) = ( S i , µ i ) ,where S i = { s , . . . , s n } is the ground set of M ( i ) , and µ i : S i → N ∗ is the multiplicity function that gives the mul-tiplicity of each s ∈ S i . The cardinality | M | of a multiset isthe number of elements (with multiplicity) in the multiset. Graph Neural Networks

General GNNs

Graph Neural Networks (GNNs) adopt el-ement (node or edge) features X and the graph structure A as input to learn the representation of each element, h i , oreach graph, h G , for different tasks. In this work, we focuson the GNNs under massage-passing framework, which up-dates the node embeddings by aggregating its nearest neigh-bor node embeddings iteratively. In previous surveys, thistype of GNNs is referred as Graph Convolutional Networksin (Wu et al. 2019) or the GNNs with convolutional aggre-gator in (Zhou et al. 2018b). Under the framework, a learnedrepresentation of the node after l aggregation layers can con-tain the features and the structural information within l -stepneighborhoods of the node. The l -th layer of a GNN can beformally represented as: h li = φ l (cid:0) h l − i , (cid:8) h l − j , ∀ j ∈ N ( i ) (cid:9) (cid:1) , (1)where the superscript l denotes the l -th layer and h i is ini-tialized as X i . The aggregation function φ in equation 1propagates information between nodes and updates the hid-den state of nodes.In the ﬁnal layer, since the node representation h Li after L iterations contains the L -step neighborhood information, it can be directly used for local/node-level tasks. While forglobal/graph-level tasks, the whole graph representation h G is needed, which requiring an extra readout function g tocompute h G from all h Li : h G = g (cid:0) (cid:8) h Li , ∀ i ∈ G (cid:9) (cid:1) . (2) Attention-Based GNNs

In a GNN, when the aggregationfunction φ in equation 1 adopts attention mechanism, weconsider it as an attention-based GNN. In previous survey(Section 6 of (Lee et al. 2018)), this is referred to the ﬁrsttwo types of attentions which have been applied to graphdata. The attention-based aggregator in l -th layer can be for-mulated as follows: e l − ij = Att (cid:0) h l − i , h l − j (cid:1) , (3) α l − ij = softmax (cid:0) e l − ij (cid:1) = exp( e l − ij ) (cid:80) k ∈ ˜ N ( i ) exp (cid:0) e l − ik (cid:1) , (4) h li = f l (cid:16) (cid:88) j ∈ ˜ N ( i ) α l − ij h l − j (cid:17) , (5)where the superscript l denotes the l -th layer and e ij is theattention coefﬁcient computed by an attention function Att to measure the relation between node i and node j . α ij is theattention weight calculated by the softmax function. Equa-tion 5 is a weighted summation that uses all α as weightsfollowed with a nonlinear function f . Related Works

Since GNNs have achieved remarkable results in practice, aclear understanding of the power of GNNs in graph repre-sentational learning is needed to design better models andmake further improvements. Recent works (Morris et al.2019b; Xu et al. 2019; Maron et al. 2019) focus on un-derstanding the discriminative power of GNNs by compar-ing it to the Weisfeiler-Lehman (WL) test (Weisfeiler andLeman 1968) when deciding the graph isomorphism. It isproved that massage-passing-based GNNs which aggregatethe nearest neighbor node features of a node for embeddingare at most as powerful as the 1-WL test (Xu et al. 2019).Inspired by the higher discriminative power of the k -WLtest ( k > ) (Cai, Fürer, and Immerman 1992) than the1-WL test, GNNs that have a theoretically higher discrim-inative power than the massage-passing-based GNNs havebeen proposed based on the k -WL test (Morris et al. 2019b;Maron et al. 2019). However, the GNNs proposed in thoseworks don’t speciﬁcally contain the attention mechanism asthe part of their analysis. So it’s currently unknown whetherthe attention mechanism will constrain the discriminativepower. Our work focuses on the massage-passing-basedGNNs with attention mechanism, which are upper boundedby the 1-WL test.Another recent work (Knyazev, Taylor, and Amer 2019)aims to understand the attention mechanism over nodes inGNNs with experiments in a controlled environment. How-ever, the attention mechanism discussed in the work is usedin the pooling layer for the pooling of nodes, while our workinvestigates the usage of attention mechanism in the aggre-gation layer for the updating of nodes. imitation of Attention-Based GNNs In this section, we theoretically analyze the discrimina-tive power of attention-based GNNs and show their lim-itations. The discriminative power means how well anattention-based GNN can distinguish different elements (lo-cal or global structures). We ﬁnd that previously proposedattention-based GNNs can fail in certain cases and the dis-criminative power is limited. Besides, by theoretically ﬁnd-ing out all cases that always fail an attention-based GNN,we reveal that those failures come from the lack of cardinal-ity preservation in attention-based aggregators. The detailsof proofs are included in the Supplemental Material.

Discriminative Power of Attention-based GNNs

We assume the node input feature space is countable. Forany attention-based GNNs, we give the conditions in Lemma1 to make them reach the upper bound of discrimina-tive power when distinguishing different elements (local orglobal structures). In particular, each local structure belongsto a node and is the k -height subtree structure rooted at thenode, which is naturally captured in the node feature h ki af-ter k iterations in a GNN. The global structure contains theinformation of all such subtrees in a graph. Lemma 1.

Let A : G → R g be a GNN following the neigh-borhood aggregation scheme with the attention-based ag-gregator (Equation 5). For global-level task, an extra read-out function (Equation 2) is used in the ﬁnal layer. A canreach its upper bound of discriminative power (can distin-guish all distinct local structures or be as powerful as the1-WL test when distinguishing distinct global structures) af-ter sufﬁcient iterations with the following conditions: • Local-level : Function f and the weighted summation inEquation 5 are injective. • Global-level : Besides the conditions for local-level, A ’sreadout function (Equation 2) is injective. With Lemma 1, we are interested in whether its condi-tions can always be satisﬁed, so as to reach the upper boundof discriminative capacity of an attention-based GNN. Sincethe function f and the global-level readout function canbe predetermined to be injective, we focus on whether theweighted summation function in attention-based aggregatorcan be injective. The Non-Injectivity of Attention-Based Aggregator

In this part, we aim to answer the following two questions:

Q 1.

Can the attention-based GNNs actually reach the up-per bound of discriminative power? In other words, can theweighted summation function in an attention-based aggre-gator be injective?

Q 2.

If not, can we determine all of the cases that preventany kind of weighted summation function being injective?

Given a countable feature space H , a weighted summa-tion function is a mapping W : H → R n . The exact W isdetermined by the attention weights α computed from Att inEquation 3. Since

Att is affected by stochastic optimizationalgorithms (e.g. SGD) which introduce stochasticity in W , we have to pay attention that W is not ﬁxed when dealingwith the two questions.In Theorem 1, we answer Q1 with No by giving the casesthat make W not to be injective. So that the attention-basedGNNs can never meet their upper bound of discriminativepower, which is stated in Corollary 1. Moreover, we answerQ2 with Yes in Theorem 1 by pointing out those cases are the only reason to always prevent W being injective. This alle-viates the difﬁculty of summarizing the properties of thosecases. Besides, we can speciﬁcally propose methods to avoidthose cases so as to let W to be injective. Theorem 1.

Assume the input feature space X is count-able. Given a multiset X ⊂ X and the node feature c ofthe central node, the weighted summation function h ( c, X ) in aggregation is deﬁned as h ( c, X ) = (cid:80) x ∈ X α cx f ( x ) ,where f : X → R n is a mapping of input feature vec-tor and α cx is the attention weight between f ( c ) and f ( x ) calculated by the attention function Att in Equation 3 andthe softmax function in Equation 4. For all f and Att , h ( c , X ) = h ( c , X ) if and only if c = c , X = ( S, µ ) and X = ( S, k · µ ) for k ∈ N ∗ . In other words, h will mapdifferent multisets to the same embedding if and only if themultisets have the same central node feature and the samedistribution of node features. Corollary 1.

Let A be the GNN deﬁned in Lemma 1. A never reaches its upper bound of discriminative power:There exists two different subtrees S and S or twographs G and G that the Weisfeiler-Lehman test decidesas non-isomorphic, such that A always maps the two sub-trees/graphs to the same embeddings. Attention Mechanism Fails to Preserve Cardinality

With Theorem 1, we are now interested in the properties ofall cases that always prevent the weighted summation func-tions W being injective. Since the multisets that all W failto distinguish share the same distribution of node features,we can say that W ignores the multiplicity information ofeach identical element in the multisets. Thus the cardinalityof the multiset is not preserved: Corollary 2.

Let A be the GNN deﬁned in Lemma 1. Theattention-based aggregator in A cannot preserve the cardi-nality information of the multiset of node features in aggre-gation. In the next section, we aim to propose improved attention-based models to preserve the cardinality in aggregation.

Cardinality Preserved Attention (CPA) Model

Since the cardinality of the multiset is not preserved inattention-based aggregators, our goal is to propose modiﬁ-cations to any kind of attention mechanism to make themcapture the cardinality information. So that all of the casesthat always prevent attention-based aggregator being injec-tive can be avoided.To achieve our goal, we modify the weighted summationfunction in Equation 5 to incorporate the cardinality infor-mation and don’t change the attention function in Equation3 so as to keep its original expressive power. Two different D ‹‹ DD ‹‹(cid:240)(cid:240) ::(cid:1)(cid:1) ÈÈ00 EE (cid:1)(cid:1) ;; DD ‹‹66ÒÒ DD ‹‹55ÒÒÒÒÒÒ DD ‹‹66ÒÒÒÒÒÒ DD ‹‹66ÒÒÒÒ DD ‹‹55ÒÒÒÒ MM MM DD DD DD ÆÆ ŸŸ ‹‹55 ŸŸ ‹‹66 ŸŸ ‹‹77 ŸŸ ‹‹‹‹ tt ⁄⁄ LL :: (cid:127)(cid:127) ·· (cid:198)(cid:198) ;; DD DD DD ÆÆ DD ‹‹‹‹ ŸŸ ‹‹88 ŸŸ ‹‹77 ŸŸ ‹‹66 ŸŸ ‹‹55 tt ¤¤ LL :: (cid:127)(cid:127) ·· ëë (cid:9)(cid:9) (cid:198)(cid:198) ;; DD ‹‹55ÒÒ11 NNEECCEEJJ ==HH 11 NNEECCEEJJ ==HH EEËË LL SS ËË (cid:195)(cid:195) ››–– ÈÈ«« -- ‹‹ DD ›› EEËË ›› –– ÈÈ«« .. ‹‹ DD ›› (cid:240)(cid:240) :: (cid:1)(cid:1) ÈÈ00 EE (cid:1)(cid:1) ;;

Figure 1: An illustration of different attention-based aggre-gators on multiset of node features. Given two distinct mul-tisets H and H that have the same central node feature h i and the same distribution of node features, aggregators willmap h i to h i and h i for H and H . The Original modelwill get h (cid:48) i = h (cid:48) i and fail to distinguish H and H , whileour Additive and

Scaled models can always distinguish H and H with h (cid:48)(cid:48) i (cid:54) = h (cid:48)(cid:48) i and h (cid:48)(cid:48)(cid:48) i (cid:54) = h (cid:48)(cid:48)(cid:48) i .models named as Additive and

Scaled are proposed to mod-ify the

Original model in Equation 5:

Model 1. (Additive) h li = f l (cid:16) (cid:88) j ∈ ˜ N ( i ) α l − ij h l − j + w l (cid:12) (cid:88) j ∈ ˜ N ( i ) h l − j (cid:17) , (6) Model 2. (Scaled) h li = f l (cid:16) ψ l (cid:0)(cid:12)(cid:12) ˜ N ( i ) (cid:12)(cid:12)(cid:1) (cid:12) (cid:88) j ∈ ˜ N ( i ) α l − ij h l − j (cid:17) , (7)where w is a non-zero vector ∈ R n , (cid:12) denotes the element-wise multiplication, | ˜ N ( i ) | equals to the cardinality of themultiset ˜ N ( i ) , ψ : Z + → R n is an injective function.In the Additive model, each element in the multiset willcontribute to the term that we added to preserve the cardinal-ity information. In the

Scaled model, the original weightedsummation is directly multiplied by a representational vec-tor of the cardinality value. So with these models, distinctmultisets with the same distribution will result in differentembedding h . Note that both of our models don’t change the Att function, such that they can keep the learning power ofthe original attention mechanism. We summarize the effectof our models in Corollary 3 and illustrate it in Figure 1.

Corollary 3.

Let T be the original attention-based aggrega-tor in Equation 5. With our proposed Cardinality PreservedAttention (CPA) models in Equation 6 and 7, T (cid:48) s discrimi-native power is increased: T can now distinguish all differ-ent multisets in aggregation that it previously always fails todistinguish. While the original attention-based aggregator is never in-jective as we mentioned in previous sections, our cardinality preserved attention-based aggregator can be injective withcertain learned attention weights to reach its upper bound ofdiscriminative power. We validate this in our experiments.For the time and space complexity of our CPA modelscomparing to the original attention-based aggregator, it isobvious that the Model 1 and 2 take more time and spacethan the original one due to our introduced vectors w and ψ ( | ˜ N ( i ) | ) . Thus we further simplify our models by ﬁxingthe values in w and ψ ( | ˜ N ( i ) | ) and deﬁne two CPA variants: Model 3. (f-Additive) h li = f l (cid:16) (cid:88) j ∈ ˜ N ( i ) ( α l − ij + 1) h l − j (cid:17) , (8) Model 4. (f-Scaled) h li = f l (cid:16)(cid:12)(cid:12) ˜ N ( i ) (cid:12)(cid:12) · (cid:88) j ∈ ˜ N ( i ) α l − ij h l − j (cid:17) . (9)Model 3 and 4 still preserve the cardinality informationand have reduced time and space complexity comparing toModel 1 and 2. Actually, since w and ψ ( | ˜ N ( i ) | ) are degen-erate into constants, Model 3 and 4 have the same time andspace complexity as the original model in Equation 5. In ourexperiments, we will examine all 4 models together with theoriginal one. Experiments

In our experiments, we focus on the following questions:

Q 3.

Since attention-based GNNs (e.g. GAT) are originallyproposed for local-level tasks like node classiﬁcation, willthose models fail or not meet the upper bound of discrimi-native power when solving certain node classiﬁcation tasks?If so, can our proposed CPA models improve the originalmodel?

Q 4.

For global-level tasks like graph classiﬁcation, howwell can the original attention-based GNNs perform? Canour proposed CPA models improve the original model?

Q 5.

How the attention-based GNNs with our CPA modelsperform compared to baselines?

To answer Question 3, we design a node classiﬁcation taskwhich is to predict whether or not a node is included in a tri-angle as one vertex in a graph. To answer Question 4 and 5,we perform experiments on graph classiﬁcation benchmarksand evaluate the performance of attention-based GNNs withCPA models.

Experimental Setup

Datasets

In our synthetic task (TRIANGLE-NODE) forpredicting whether or not a node is included in a trian-gle, we generate a graph with different node features. Inour experiment on graph classiﬁcation, we use 6 benchmarkdatasets: 2 social network datasets (REDDIT-BINARY (RE-B), REDDIT-MULTI5K (RE-M5K)) and 4 bioinformaticsdatasets (MUTAG, PROTEINS, ENZYMES, NCI1). Moredetails of the datasets are provided in Supplemental Mate-rial.able 1: Testing accuracies(%) of GAT variants (the originalGAT and the GAT applied with each of our 4 CPA models)on TRIANGLE-NODE dataset for node classiﬁcation. Wehighlight the result of the best performed model. The pro-portion P of multisets that hold the properties in Theorem 1among all multisets is also reported. Dataset TRIANGLE-NODE P (%) ± ± ± f-Additive 91.18 ± ± Table 2: Testing accuracies(%) of GAT-GC variants (theoriginal one and the ones applied with each of our 4 CPAmodels) on social network datasets. We highlight the resultof the best performed model per dataset. The proportion P of multisets that hold the properties in Theorem 1 among allmultisets is also reported for each dataset. Datasets RE-B RE-M5K P (%) ± ± ± ± Scaled 92.36 ± ± ± ± ± ± Models

In our experiments, the

Original model is the onethat uses the original version of an attention mechanism.We apply each of our 4 CPA models (

Additive , Scaled , f-Additive and f-Scaled ) to the original attention mechanismfor comparison. In the Additive and

Scaled models, we takeadvantage of the approximation capability of multi-layerperceptron (MLP) (Hornik, Stinchcombe, and White 1989;Hornik 1991) to model f and ψ .For node classiﬁcation, we use GAT (Veliˇckovi´c etal. 2018) as the Original model. For graph classiﬁca-tion, we build a GNN (GAT-GC) based on GAT asthe

Original model: We adopt the attention mechanismin GAT to specify the form of Equation 3: e ij =LeakyReLU (cid:0) a (cid:62) [ W h i (cid:107) W h j ] (cid:1) . For the readout function, anaive way is to only consider the node embeddings from thelast iteration. Although a sufﬁcient number of iterations canhelp to avoid the cases in Theorem 1 by aggregating morediverse node features, the features from the latter iterationsmay generalize worse and the GNNs usually have shallowstructures (Xu et al. 2019; Zhou et al. 2018b). So the GAT-GC adopts the same function as used in (Xu et al. 2018;Xu et al. 2019; Lee, Lee, and Kang 2019; Li et al. 2019),which concatenates graph embeddings from all iterations: h G = (cid:107) Lk =0 (cid:0) Readout( (cid:8) h li (cid:12)(cid:12) i ∈ G (cid:9) ) (cid:1) , Readout functioncan be sum or mean. With CPA models, the cases in The- Figure 2: Training curves of GAT-GC variants on bioinfor-matics datasets.Table 3: Testing accuracies(%) of GAT-GC variants (theoriginal one and the ones applied with each of our 4 CPAmodels) on bioinformatics datasets. We highlight the resultof the best performed model per dataset. The highlighted re-sults are signiﬁcantly higher than those from the correspond-ing

Original model under paired t-test at signiﬁcance level . The proportion P of multisets that hold the propertiesin Theorem 1 among all multisets is also reported for eachdataset. Datasets MUTAG PROTEINS ENZYMES NCI1 P (%) ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± orem 1 can be avoided in each iteration. Full experimentalsettings are included in Supplemental Material. Node Classiﬁcation

For the TRIANGLE-NODE dataset, the proportion P ofmultisets that hold the properties in Theorem 1 is . ,as shown in Table 1. The classiﬁcation accuracy of the Orig-inal model (GAT) is signiﬁcantly lower than the CPA mod-els. It supports the claim in Corollary 1: the

Original modelfails to distinguish all distinct multisets in the dataset and ex-hibits constrained discriminate power. On the contrary, CPAmodels can distinguish all different multisets in the graph assuggested in Corollary 3 and indeed signiﬁcantly improvethe accuracy of the

Original model as shown in Table 1. Thisexperiment thus well answers Question 3 that we raised.

Graph Classiﬁcation

In this section, we aim to answer Question 4 by evaluatingthe performance of variants of GAT-based GNN (GAT-GC)on graph classiﬁcation benchmarks. Besides, we compareable 4: Testing accuracies(%) for graph classiﬁcation. We highlight the result of the best performed model for each dataset.Our GAT-GC (f-Scaled) model achieves the top 2 on all 6 datasets.

Datasets MUTAG PROTEINS ENZYMES NCI1 RE-B RE-M5K B a s e li n e s WL 82.05 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± CapsGNN 86.67 ± ± ± ± ± ± ± ± ± ± ± our best-performed CPA model with baseline models to an-swer Question 5. Social Network Datasets

Since the RE-B and RE-M5Kdatasets don’t have original node features and we assign allthe node features to be the same, we have P = 100 . in those datasets. Thus all multisets in aggregation will bemapped to the same embedding by the Original

GAT-GC.After a mean readout function on all multisets, all graphsare ﬁnally mapped to the same embedding. The performanceof the

Original model is just random guessing of the graphlabels as reported in Table 2. While our CPA models candistinguish all different multisets and are conﬁrmed to besigniﬁcantly better than the

Original one.Here we examine a naive approach to incorporate thecardinality information in the

Original model by assigningnode degrees as input node labels. By doing this way, thenode features are diverse and we get P = 0 . , whichmeans that the cases in Theorem 1 can be all avoided.However, the testing accuracies of Original can only reach . ± . on RE-B and . ± . on RE-M5K,which are signiﬁcantly lower than the results of CPA mod-els in Table 2. Thus in practice, our proposed models exhibitgood generalization power comparing to the naive approach. Bioinformatics Datasets

For bioinformatics datasets thatcontain diverse node labels, we also report the P values inTable 3. The results reveal the existence ( P ≥ . ) of thecases in those datasets that can fool the Original model, thusthe discriminative power of the

Original model is theoreti-cally constrained.To empirically validate this, we compare the training ac-curacies of GAT-GC variants, since the discriminative powercan be directly indicated by the accuracies on training sets .Higher training accuracy indicates a better ﬁtting ability todistinguish different graphs. The training curves of GAT-GCvariants are shown in Figure 2. From these curves, we cansee even though the

Original model has overﬁtted differentdatasets, the ﬁtting accuracies that it converges to can neverbe higher than those of our CPA models. Compared to theWL kernel, CPA models can get training accuracies close to on several datasets, which reach those obtained fromthe WL kernel (equal to as shown in (Xu et al. 2019)).These ﬁndings validate that the discriminative power of the

Original model is constrained while our CPA models can ap-proach the upper bound of discriminative power with certainlearned weights. In Table 3 we report the testing accuracies of GAT-GCvariants on bioinformatics datasets. The

Original model canget meaningful results. However, we ﬁnd our proposed CPAmodels further improve the testing accuracies of the

Origi-nal model on all datasets. This indicates that the preserva-tion of cardinality can also beneﬁt the generalization powerof the model besides the discriminative power.From previous results in Table 2 and 3, we ﬁnd the f-Scaled model performs the best with an average rankingmeasure (Taheri, Gimpel, and Berger-Wolf 2018). The goodperformance of the ﬁxed-weight models ( f-Additive and f-Scaled ) comparing to the full models (

Additive and

Scaled )demonstrates that the improvements achieved by CPA mod-els are not simply due to the increased capacities given bythe additional vectors embedded.

Comparison to Baselines

We further compare the best-performed GAT-GC variant ( f-Scaled ) with other baselines(WL kernel (WL) (Shervashidze et al. 2011), PATCHY-SAN(PSCN) (Niepert, Ahmed, and Kutzkov 2016), Deep GraphCNN (DGCNN) (Zhang et al. 2018), Graph IsomorphismNetwork (GIN) (Xu et al. 2019) and Capsule Graph Neu-ral Network (CapsGNN) (Xinyi and Chen 2019)). In Ta-ble 4, we report the results. Our GAT-GC (f-Scaled) modelachieves 4 top 1 and 2 top 2 on all 6 datasets. It is expectedthat even better performance can be achieved with certainchoices of attention mechanism besides the GAT one.

Conclusion

In this paper, we theoretically analyze the representationalpower of GNNs with attention-based aggregators: We de-termine all cases when those GNNs always fail to distin-guish distinct structures. The ﬁnding shows that the miss-ing cardinality information in aggregation is the only rea-son to cause those failures. To improve, we propose cardi-nality preserved attention (CPA) models to solve this issue.In our experiments, we validate our theoretical analysis thatthe performances of the original attention-based GNNs arelimited. With our models, the original models can be im-proved. Compared to other baselines, our best-performedmodel achieves competitive performance. In future work, achallenging problem is to better learn the attention weightsso as to guarantee the injectivity of our cardinality preservedattention models after the training. Besides, it would be in-teresting to analyze the effects of different attention mecha-nisms. eferences [Bahdanau, Cho, and Bengio 2014] Bahdanau, D.; Cho, K.;and Bengio, Y. 2014. Neural machine translation byjointly learning to align and translate. arXiv preprintarXiv:1409.0473 .[Cai, Fürer, and Immerman 1992] Cai, J.-Y.; Fürer, M.; andImmerman, N. 1992. An optimal lower bound on the num-ber of variables for graph identiﬁcation.

Combinatorica

Advances inNeural Information Processing Systems , 2224–2232.[Hamilton, Ying, and Leskovec 2017] Hamilton, W.; Ying,Z.; and Leskovec, J. 2017. Inductive representation learningon large graphs. In

Advances in Neural Information Pro-cessing Systems , 1024–1034.[Hornik, Stinchcombe, and White 1989] Hornik, K.; Stinch-combe, M.; and White, H. 1989. Multilayer feedforwardnetworks are universal approximators.

Neural networks

International Confer-ence on Machine Learning , 448–456.[Ivanov and Burnaev 2018] Ivanov, S., and Burnaev, E.2018. Anonymous walk embeddings. In

International Con-ference on Machine Learning , 2191–2200.[Kingma and Ba 2018] Kingma, D. P., and Ba, J. 2018.Adam: A method for stochastic optimization. In

Interna-tional Conference on Learning Representations .[Kipf and Welling 2017] Kipf, T. N., and Welling, M. 2017.Semi-supervised classiﬁcation with graph convolutional net-works. In

International Conference on Learning Represen-tations .[Knyazev, Taylor, and Amer 2019] Knyazev, B.; Taylor,G. W.; and Amer, M. R. 2019. Understanding attentionand generalization in graph neural networks. arXiv preprintarXiv:1905.02850 .[Lee et al. 2018] Lee, J. B.; Rossi, R. A.; Kim, S.; Ahmed,N. K.; and Koh, E. 2018. Attention models in graphs: Asurvey. arXiv preprint arXiv:1807.07984 .[Lee, Lee, and Kang 2019] Lee, J.; Lee, I.; and Kang, J.2019. Self-attention graph pooling. In

International Con-ference on Machine Learning , 3734–3743.[Li et al. 2016] Li, Y.; Tarlow, D.; Brockschmidt, M.; andZemel, R. 2016. Gated graph sequence neural networks.In

International Conference on Learning Representations .[Li et al. 2019] Li, G.; Müller, M.; Thabet, A.; and Ghanem,B. 2019. Deepgcns: Can gcns go as deep as cnns? In

TheIEEE International Conference on Computer Vision (ICCV) . [Maron et al. 2019] Maron, H.; Ben-Hamu, H.; Serviansky,H.; and Lipman, Y. 2019. Provably powerful graph net-works. In

Advances in Neural Information Processing Sys-tems .[Morris et al. 2019a] Morris, C.; Ritzert, M.; Fey, M.; Hamil-ton, W. L.; Lenssen, J. E.; Rattan, G.; and Grohe, M. 2019a.Weisfeiler and leman go neural: Higher-order graph neuralnetworks. In

Proceedings of AAAI Conference on ArtiﬁcialInteligence .[Morris et al. 2019b] Morris, C.; Ritzert, M.; Fey, M.;Hamilton, W. L.; Lenssen, J. E.; Rattan, G.; and Grohe, M.2019b. Weisfeiler and leman go neural: Higher-order graphneural networks. In

Proceedings of the AAAI Conference onArtiﬁcial Intelligence , volume 33, 4602–4609.[Niepert, Ahmed, and Kutzkov 2016] Niepert, M.; Ahmed,M.; and Kutzkov, K. 2016. Learning convolutional neu-ral networks for graphs. In

International conference on ma-chine learning , 2014–2023.[Scarselli et al. 2009] Scarselli, F.; Gori, M.; Tsoi, A. C.; Ha-genbuchner, M.; and Monfardini, G. 2009. The graph neu-ral network model.

IEEE Transactions on Neural Networks

Journal of MachineLearning Research

IEEE Signal Processing Magazine

KDDDeep Learning Day .[Thekumparampil et al. 2018] Thekumparampil, K. K.;Wang, C.; Oh, S.; and Li, L.-J. 2018. Attention-based graphneural network for semi-supervised learning. arXiv preprintarXiv:1803.03735 .[Vaswani et al. 2017] Vaswani, A.; Shazeer, N.; Parmar, N.;Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polo-sukhin, I. 2017. Attention is all you need. In

Advances inneural information processing systems , 5998–6008.[Veliˇckovi´c et al. 2018] Veliˇckovi´c, P.; Cucurull, G.;Casanova, A.; Romero, A.; Liò, P.; and Bengio, Y. 2018.Graph Attention Networks. In

International Conference onLearning Representations .[Weisfeiler and Leman 1968] Weisfeiler, B., and Leman, A.1968. The reduction of a graph to canonical form and thealgebra which appears therein.

NTI, Series arXiv preprint arXiv:1901.00596 .[Xinyi and Chen 2019] Xinyi, Z., and Chen, L. 2019. Cap-sule graph neural network. In

International Conference onLearning Representations .Xu et al. 2018] Xu, K.; Li, C.; Tian, Y.; Sonobe, T.;Kawarabayashi, K.-i.; and Jegelka, S. 2018. Representa-tion learning on graphs with jumping knowledge networks.In

International Conference on Machine Learning , 5449–5458.[Xu et al. 2019] Xu, K.; Hu, W.; Leskovec, J.; and Jegelka,S. 2019. How powerful are graph neural networks? In

International Conference on Learning Representations .[Ying et al. 2018] Ying, Z.; You, J.; Morris, C.; Ren, X.;Hamilton, W.; and Leskovec, J. 2018. Hierarchical graphrepresentation learning with differentiable pooling. In

Ad-vances in Neural Information Processing Systems , 4805–4815.[Zhang et al. 2018] Zhang, M.; Cui, Z.; Neumann, M.; andChen, Y. 2018. An end-to-end deep learning architecturefor graph classiﬁcation. In

Proceedings of AAAI Conferenceon Artiﬁcial Inteligence .[Zhou et al. 2018a] Zhou, H.; Young, T.; Huang, M.; Zhao,H.; Xu, J.; and Zhu, X. 2018a. Commonsense knowledgeaware conversation generation with graph attention. In

IJ-CAI , 4623–4629.[Zhou et al. 2018b] Zhou, J.; Cui, G.; Zhang, Z.; Yang, C.;Liu, Z.; and Sun, M. 2018b. Graph neural networks:A review of methods and applications. arXiv preprintarXiv:1812.08434 . Proof for Lemma 1

Proof.

Local-level: For the aggregator in the ﬁrst layer, itwill map different 1-height subtree structures to differentembeddings from the distinct input multisets of neighbor-hood node features, since it’s injective. Iteratively, the ag-gregator in the l -th layer can distinguish different l -heightsubtree structures by mapping them to different embeddingsfrom the distinct input multisets of l -1-height subtree fea-tures, since it’s injective.Global level: From Lemma 2 and Theorem 3 in (Xu etal. 2019), we know: When all functions in A are injec-tive, A can reach its upper bound of discriminative power,which is the same as the Weisfeiler-Lehman (WL) test (We-isfeiler and Leman 1968) when deciding the graph isomor-phism. Proof for Theorem 1

Proof.

To prove Theorem 1, we have consider both two di-rections in the iff statement: (1)

If given c = c = c , X = ( S, µ ) and X = ( S, k · µ ) ,as h ( c, X ) = (cid:80) x ∈ X α cx f ( x ) , we have: h ( c i , X i ) = (cid:88) x ∈ X i α cxi f ( x ) , i ∈ { , } , where α cxi is the attention weight belongs to X i , and be-tween f ( c ) and f ( x ) , x ∈ X i , i ∈ { , } .We can rewrite the equations using S and µ : h ( c , X ) = h ( c, S, µ ) = (cid:88) s ∈ S µ ( s ) α cs f ( s ) ,h ( c , X ) = h ( c, S, k · µ ) = (cid:88) s ∈ S k · µ ( s ) α cs f ( s ) , where µ ( s ) is the multiplicity function, and α csi is the at-tention weight belongs to X i , and between f ( c ) and f ( s ) , s ∈ S, i ∈ { , } .Considering the softmax function in Equation 2 of our pa-per, we can use attention coefﬁcient e to rewrite the equa-tions: (cid:88) s ∈ S µ ( s ) α cs f ( s ) = (cid:88) s ∈ S µ ( s ) exp( e cs ) (cid:80) x ∈ X exp( e cx ) f ( s )= (cid:80) s ∈ S µ ( s ) exp( e cs ) (cid:80) x ∈ X exp( e cx ) f ( s ) , (cid:88) s ∈ S k · µ ( s ) α cs f ( s ) = k · (cid:88) s ∈ S µ ( s ) exp( e cs ) (cid:80) x ∈ X exp( e cx ) f ( s )= k · (cid:80) s ∈ S µ ( s ) exp( e cs ) (cid:80) x ∈ X exp( e cx ) f ( s ) , where e csi is the attention coefﬁcient belongs to X i , and be-tween f ( c ) and f ( s ) , s ∈ S, i ∈ { , } . Moreover, e cxi isthe attention coefﬁcient belongs to X i , and between f ( c ) and f ( x ) , x ∈ X i , i ∈ { , } .As attention coefﬁcient e is computed by function Att ,which is regardless of X , thus e cs = e cs , ∀ s ∈ S and cx = e cx , ∀ x ∈ X , X . We denote e cx = e cx = e cx , e cs = e cs = e cs . Remind that X has k copies of theelements in X , so that (cid:88) x ∈ X exp( e cx ) = 1 k (cid:88) x ∈ X exp( e cx ) . Using this equation, we can get (cid:80) s ∈ S µ ( s ) exp( e cs ) (cid:80) x ∈ X exp( e cx ) f ( s ) = (cid:80) s ∈ S µ ( s ) exp( e cs ) k (cid:80) x ∈ X exp( e cx ) f ( s )= k · (cid:80) s ∈ S µ ( s ) exp( e cs ) (cid:80) x ∈ X exp( e cx ) f ( s ) . From all equations above, we ﬁnally have h ( c , X ) = (cid:80) s ∈ S µ ( s ) exp( e cs ) (cid:80) x ∈ X exp( e cx ) f ( s )= k · (cid:80) s ∈ S µ ( s ) exp( e cs ) (cid:80) x ∈ X exp( e cx ) f ( s )= h ( c , X ) . (2) If given h ( c , X ) = h ( c , X ) for all f , Att , we have (cid:88) x ∈ X α cx f ( x ) = (cid:88) x ∈ X α cx f ( x ) , ∀ f, Att, where α cxi is the attention weight belongs to X i , and be-tween f ( c i ) and f ( x ) , x ∈ X i , i ∈ { , } .We denote X = ( S , µ ) and X = ( S , µ ) and rewritethe equation: (cid:88) s ∈ S µ ( s ) α cs f ( s ) = (cid:88) s ∈ S µ ( s ) α cs f ( s ) , ∀ f, Att, where µ i ( s ) is the multiplicity function of X i , i ∈ { , } .Moreover, α csi is the attention weight belongs to X i , andbetween f ( c i ) and f ( s ) , s ∈ S i , i ∈ { , } .When considering the relations between S and S , wehave: (cid:88) s ∈ S ∩ S (cid:0) µ ( s ) α cs − µ ( s ) α cs (cid:1) f ( s )+ (cid:88) s ∈ S \ S µ ( s ) α cs f ( s ) − (cid:88) s ∈ S \ S µ ( s ) α cs f ( s ) = 0 . (10)If we assume the equality of Equation 10 is true for all f and S (cid:54) = S , we can deﬁne such two functions f and f : f ( s ) = f ( s ) , ∀ s ∈ S ∩ S ,f ( s ) = f ( s ) − , ∀ s ∈ S \ S ,f ( s ) = f ( s ) + 1 , ∀ s ∈ S \ S . If given the equality of Equation 10 is true for f , we have: (cid:88) s ∈ S ∩ S (cid:0) µ ( s ) α cs − µ ( s ) α cs (cid:1) f ( s )+ (cid:88) s ∈ S \ S µ ( s ) α cs f ( s ) − (cid:88) s ∈ S \ S µ ( s ) α cs f ( s ) = 0 . (11) We can rewrite Equation 11 using f : (cid:88) s ∈ S ∩ S (cid:0) µ ( s ) α cs − µ ( s ) α cs (cid:1) f ( s )+ (cid:88) s ∈ S \ S µ ( s ) α cs ( f ( s ) − − (cid:88) s ∈ S \ S µ ( s ) α cs ( f ( s ) + 1) = 0 . Thus we know (cid:88) s ∈ S ∩ S (cid:0) µ ( s ) α cs − µ ( s ) α cs (cid:1) f ( s )+ (cid:88) s ∈ S \ S µ ( s ) α cs f ( s ) − (cid:88) s ∈ S \ S µ ( s ) α cs f ( s ) = (cid:88) s ∈ S \ S µ ( s ) α cs + (cid:88) s ∈ S \ S µ ( s ) α cs (12)Note that the LHS of Equation 12 is just the LHS of Equa-tion 10 when f = f . As µ i ( s ) ≥ due to the deﬁnition ofmultiplicity, α csi > due to the softmax function, we have µ i ( s ) α csi > , ∀ s ∈ S i , i ∈ { , } . Thus the RHS of Equa-tion 12 > 0 and we now know the equality in Equation 10 isnot true for f . So the assumption of S (cid:54) = S is false.We denote S = S = S . To let the remaining summationterm always equal to 0, we have µ ( s ) α cs − µ ( s ) α cs = 0 , ∀ Att.

Considering Equation 2 in our paper, we can rewrite theequation above: µ ( s ) µ ( s ) = exp( e cs )exp( e cs ) (cid:80) x ∈ X exp( e cx ) (cid:80) x ∈ X exp( e cx ) , ∀ Att, (13)where e csi is the attention coefﬁcient belongs to X i , andbetween f ( c i ) and f ( s ) , s ∈ S . And e cxi is the atten-tion coefﬁcient belongs to X i , and between f ( c i ) and f ( x ) , x ∈ X i , i ∈ { , } .The LHS of Equation 13 is a rational number. Howeverif c (cid:54) = c , the RHS of Equation 13 can be irrational: Weassume S contains at least two elements s and s (cid:54) = s . Ifnot, we can directly get c = c . We consider any attentionmechanism that results in: e cs = 1 , ∀ s ∈ S,e cs = (cid:26) , for s = s , , ∀ s (cid:54) = s ∈ S. Thus when s = s , the RHS of the equation become: ee (cid:12)(cid:12) X (cid:12)(cid:12) e ( (cid:12)(cid:12) X (cid:12)(cid:12) − n ) e + ne = (cid:12)(cid:12) X (cid:12)(cid:12) ( (cid:12)(cid:12) X (cid:12)(cid:12) − n ) e + n , where n is the multiplicity of s in X . It is obvious that thevalue of RHS is irrational. So we have c = c to alwayshold the equality.With c = c , we know e cs = e cs , ∀ s ∈ S and e cx = e cx , ∀ x ∈ X , X . We denote e cx = e cx = e cx , Equation13 becomes µ ( s ) µ ( s ) = (cid:80) x ∈ X exp( e cx ) (cid:80) x ∈ X exp( e cx ) = const., ∀ Att. e further denote k = µ ( s ) /µ ( s ) , ∀ s ∈ S . So that µ = k · µ . Finally by denoting µ = µ , we have X =( S, µ ) , X = ( S, k · µ ) and c = c . Proof for Corollary 1

Proof.

For subtrees, if S and S are 1-height subtrees thathave the same root node feature and the same distribution ofnode features, A will get the same embeddings for S and S according to Theorem 1.For graphs, let G be a fully connect graph with n nodesand G be a ring-like graph with n nodes. All nodes in G and G have the same feature x . It is clear that theWeisfeiler-Lehman test of isomorphism decides G and G as non-isomorphic.We denote { X i } , i ∈ G as the set of multisets for aggre-gation in G , and { X j } , j ∈ G as the set of multisets foraggregation in G . As G is a fully connect graph, all mul-tisets in G contain central node and n − neighbors. As G is a ring-like graph, all multisets in G contain centralnode and neighbors. Thus we have X i = ( { x } , { µ ( x ) = n } ) , ∀ i ∈ G ,X j = ( { x } , { µ ( x ) = 3 } ) , ∀ j ∈ G , where µ i ( x ) is the multiplicity function of the node withfeature x in G i , i ∈ { , } .From Theorem 1, we know that h ( c, X i ) = h ( c, X j ) , ∀ i ∈ G , ∀ j ∈ G . Considering the Equa-tion 3 of our paper, we have h li = h lj , ∀ i ∈ G , ∀ j ∈ G ineach iteration l . Besides, as the number of node in G and G are equals to n , A will always map G and G to thesame set of multisets of node features { h l } in each iteration l and ﬁnally get the same embedding for each graph. Proof for Corollary 2

Proof.

Given two distinct multiset of node features X and X that have the same central node feature and the samedistribution of node features: c = c , X = ( S, µ ) and X = ( S, k · µ ) for k ∈ N ∗ , we know the cardinality of X is k times of the cardinality of X . Thus X and X can bedistinguished by their cardinality.However, the weighted summation function h inattention-based aggregator A will map them to the same em-bedding: h ( c , X ) = h ( c , X ) according to Theorem 1.Thus we cannot distinguish X and X via A . To conclude, A lost the cardinality information after aggregation. Proof for Corollary 3

Proof.

For any two distinct multisets X and X that T previously always fail to distinguish according to Theorem1, we denote X = ( S, µ ) and X = ( S, k · µ ) ⊂ X for some k ∈ N ∗ and c ∈ S . Thus (cid:80) x ∈ X α cx f ( x ) = (cid:80) x ∈ X α cx f ( x ) , where α cxi is the attention weight be-longs to X i , and between f ( c ) and f ( x ) , x ∈ X i , i ∈ { , } .We denote H = (cid:80) x ∈ X α cx f ( x ) = (cid:80) x ∈ X α cx f ( x ) .When applying CPA models, the aggregation functions in T can be rewritten as: h ( c, X i ) = H + w (cid:12) (cid:88) x ∈ X i f ( x ) , i ∈ { , } ,h ( c, X i ) = ψ ( (cid:12)(cid:12) X i (cid:12)(cid:12) ) (cid:12) H, i ∈ { , } . We consider the following example: All elements in w equal to 1. Function ψ maps (cid:12)(cid:12) X (cid:12)(cid:12) to a n-dimensional vectorwhich all elements in it equal to (cid:12)(cid:12) X (cid:12)(cid:12) . And f ( x ) = N − Z ( x ) ,where Z : X → N and N > (cid:12)(cid:12) X (cid:12)(cid:12) . So that the aggregationfunctions become: h ( c, X i ) = H + (cid:88) x ∈ X i f ( x ) , i ∈ { , } ,h ( c, X i ) = (cid:12)(cid:12) X i (cid:12)(cid:12) · H, i ∈ { , } . For h , we have h ( c, X ) − h ( c, X ) = (cid:80) x ∈ X f ( x ) − (cid:80) x ∈ X f ( x ) . According to Lemma 5 of (Xu et al. 2019),when X (cid:54) = X , (cid:80) x ∈ X f ( x ) (cid:54) = (cid:80) x ∈ X f ( x ) . So h ( c, X ) (cid:54) = h ( c, X ) .For h , we have h ( c, X ) − h ( c, X ) = ( (cid:12)(cid:12) X (cid:12)(cid:12) − (cid:12)(cid:12) X (cid:12)(cid:12) ) · H . As α cx > due to the softmax function, and f ( x ) > inour example, we know H > . Moreover as (cid:12)(cid:12) X (cid:12)(cid:12) − (cid:12)(cid:12) X (cid:12)(cid:12) (cid:54) =0 , we can get h ( c, X ) (cid:54) = h ( c, X ) . Details of Datasets

For the node classiﬁcation task, we generate a graph with4800 nodes and 32400 edges. . of the nodes are in-cluded in triangles as vertices while . Datasets Graphs Classes Features Node Avg. Edge Avg.MUTAG 188 2 7 17.93 19.79PROTEINS 1113 2 4 39.06 72.81ENZYMES 600 6 6 32.63 62.14NCI1 4110 2 23 29.87 32.30RE-B 2000 2 - 429.63 995.51RE-M5K 4999 5 - 508.52 1189.75

Details of Experiment Settings

For all experiments, we perform 10-fold cross-validationand repeat the experiments 10 times for each dataset andeach model. To get a ﬁnal accuracy for each run, we selectthe epoch with the best cross-validation accuracy averagedover all 10 folds. The average accuracies and their standarddeviations are reported based on the results across the foldsin all runs.In our

Additive and

Scaled models, all MLPs have 2 layerswith ReLU activation.n the GAT variants, we use 2 GNN layers and a hid-den dimensionality of 32. The negative input slope of

LeakyReLU in the

GAT attention mechanism is 0.2. Thenumber of heads in multi-head attention is 1.In the GAT-GC variants, we use 4 GNN layers. For the

Readout function in all models, we use sum for bioinfor-matics datasets and mean for social network datasets. We ap-ply Batch normalization (Ioffe and Szegedy 2015) after ev-ery hidden layers. The hidden dimensionality is set as 32 forbioinformatics datasets and 64 for social network datasets.The negative input slope of

LeakyReLU in the

GAT atten-tion mechanism is 0.2. We use a single head in the multi-head attention in all models.All models are trained using the Adam optimizer (Kingmaand Ba 2018) and the learning rate is dropped by a factorof 0.5 every 400 epochs in the node classiﬁcation task andevery 50 epochs in the graph classiﬁcation task. We use aninitial learning rate of 0.01 for the TRIANGLE-NODE andbioinformatics datasets and 0.0025 for the social networkdatasets. For the GAT variants, we use a dropout ratio of 0and a weight decay value of 0. For the GAT-GC variants oneach dataset, the following hyper-parameters are tuned: (1)Batch size in { , } ; (2) Dropout ratio in { , . } afterdense layer; (3) L regularization from to .001