[PDF] A Generalization of Transformer Networks to Graphs

Abstract

Full PDF

AA Generalization of Transformer Networks to Graphs

Vijay Prakash Dwivedi, ¶ Xavier Bresson ¶ ¶ School of Computer Science and Engineering, Nanyang Technological University, Singapore [email protected] , [email protected] Abstract

We propose a generalization of transformer neural networkarchitecture for arbitrary graphs. The original transformerwas designed for Natural Language Processing (NLP), whichoperates on fully connected graphs representing all connec-tions between the words in a sequence. Such architecture doesnot leverage the graph connectivity inductive bias, and canperform poorly when the graph topology is important andhas not been encoded into the node features. We introduce agraph transformer with four new properties compared to thestandard model. First, the attention mechanism is a functionof the neighborhood connectivity for each node in the graph.Second, the positional encoding is represented by the Lapla-cian eigenvectors, which naturally generalize the sinusoidalpositional encodings often used in NLP. Third, the layernormalization is replaced by a batch normalization layer,which provides faster training and better generalization per-formance. Finally, the architecture is extended to edge featurerepresentation, which can be critical to tasks s.a. chemistry(bond type) or link prediction (entity relationship in knowl-edge graphs). Numerical experiments on a graph benchmarkdemonstrate the performance of the proposed graph trans-former architecture. This work closes the gap between theoriginal transformer, which was designed for the limited caseof line graphs, and graph neural networks, that can work witharbitrary graphs. As our architecture is simple and generic,we believe it can be used as a black box for future applica-tions that wish to consider transformer and graphs. There has been a tremendous success in the ﬁeld of nat-ural language processing (NLP) since the development ofTransformers (Vaswani et al. 2017) which are currently thebest performing neural network architectures for handlinglong-term sequential datasets such as sentences in NLP.This is achieved by the use of attention mechanism (Bah-danau, Cho, and Bengio 2014) where a word in a sentenceattends to each other word and combines the received in-formation to generate its abstract feature representations.From a perspective of message-passing paradigm (Gilmer https://github.com/graphdeeplearning/graphtransformer. et al. 2017) in graph neural networks (GNNs), this processof learning word feature representations by combining fea-ture information from other words in a sentence can alter-natively be viewed as a case of a GNN applied on a fullyconnected graph of words (Joshi 2020). Transformers basedmodels have led to state-of-the-art performance on severalNLP applications (Devlin et al. 2018; Radford et al. 2018;Brown et al. 2020). On the other hand, graph neural net-works (GNNs) are shown to be the most effective neuralnetwork architectures on graph datasets and have achievedsigniﬁcant success on a wide range of applications, suchas in knowledge graphs (Schlichtkrull et al. 2018; Chamiet al. 2020), in social sciences (Monti et al. 2019), in physics(Cranmer et al. 2019; Sanchez-Gonzalez et al. 2020), etc.In particular, GNNs exploit the given arbitrary graph struc-ture while learning the feature representations for nodes andedges and eventually the learned representations are used fordownstream tasks. In this work, we explore inductive biasesat the convergence of these two active research areas in deeplearning towards presenting an improved version of GraphTransformer (see Figure 1) which extends the key designcomponents of the NLP transformers to arbitrary graphs. As a preliminary, we highlight the most recent researchworks which attempt to develop graph transformers (Li et al.2019; Nguyen, Nguyen, and Phung 2019; Zhang et al. 2020)with few focused on specialized cases such as on heteroge-neous graphs, temporal networks, generative modeling, etc.(Yun et al. 2019; Xu, Joshi, and Bresson 2019; Hu et al.2020; Zhou et al. 2020).The model proposed in Li et al. (2019) employs attentionto all graph nodes instead of a node’s local neighbors forthe purpose of capturing global information. This limits theefﬁcient exploitation of sparsity which we show is a good in-ductive bias for learning on graph datasets. For the purposeof global information, we argue that there are other ways toincorporate the same instead of letting go sparsity and localcontexts. For example, the use of graph-speciﬁc positionalfeatures (Zhang et al. 2020), or node Laplacian positioneigenvectors (Belkin and Niyogi 2003; Dwivedi et al. 2020),or relative learnable positional information (You, Ying, andLeskovec 2019), virtual nodes (Li et al. 2015), etc. Zhanget al. (2020) propose Graph-BERT with an emphasis on pre- a r X i v : . [ c s . L G ] J a n dd&NormAdd&Norm Add&NormAdd&NormAdd&NormAdd&Norm Graph Transformer Layer Graph Transformer Layerwith edge features + +

Laplacian EigVecs asPositional Encoding + +

Figure 1: Block Diagram of Graph Transformer with Laplacian Eigvectors ( λ ) used as positional encoding (LapPE). LapPEis added to input node embeddings before passing the features to the ﬁrst layer. Left : Graph Transformer operating on nodeembeddings only to compute attention scores;

Right : Graph Transformer with edge features with designated feature pipeline tomaintain layer wise edge representations. In this extension, the available edge attributes in a graph is used to explicitly modifythe corresponding pairwise attention scores.training and parallelized learning using a subgraph batch-ing scheme that creates ﬁxed-size linkless subgraphs to bepassed to the model instead of the original graph. Graph-BERT employs a combination of several positional encod-ing schemes to capture absolute node structural and rela-tive node positional information. Since the original graph isnot used directly in Graph-BERT and the subgraphs do nothave edges between the nodes ( i.e. , linkless), the proposedcombination of positional encodings attempts at retaining the original graph structure information in the nodes. Weperform detailed analysis of Graph-BERT positional encod-ing schemes, along with experimental comparison with themodel we present in this paper in Section 4.1.Yun et al. (2019) developed Graph Transformer Networks(GTN) to learn on heterogeneous graphs with a target totransform a given heterogeneous graph into a meta-pathbased graph and then perform convolution. Notably, theirfocus behind the use of attention framework is for inter- preting the generated meta-paths. There is another trans-former based approach developed for heterogeneous in-formation networks, namely Heterogeneous Graph Trans-former (HGT) by Hu et al. (2020). Apart from its abilityof handling arbitrary number of node and edge types, HGTalso captures the dynamics of information ﬂow in the hetero-geneous graphs in the form of relative temporal positionalencoding which is based on the timestamp differences of thecentral node and the message-passing nodes. Furthermore,Zhou et al. (2020) proposed a transformer based generativemodel which generates temporal graphs by directly learn-ing from dynamic information in networks. The architecturepresented in Nguyen, Nguyen, and Phung (2019) somewhatproceeds along our goal to develop graph transformer for ar-bitrary homogeneous graphs with a coordinate embeddingbased positional encoding scheme. However, their experi-ments show that the coordinate embeddings are not universalin performance and only helps in a couple of unsupervisedearning experiments among all evaluations.

Overall, we ﬁnd that the most fruitful ideas from the trans-formers literature in NLP can be applied in a more efﬁcientway and posit that sparsity and positional encodings are two key aspects in the development of a Graph Transformer. Asopposed to designing a best performing model for speciﬁcgraph tasks, our work attempts for a generic, competitivetransformer model which draws ideas together from the do-mains of NLP and GNNs. For an overview, this paper bringsthe following contributions:• We put forward a generalization of transformer networksto homogeneous graphs of arbitrary structure, namelyGraph Transformer, and an extended version of GraphTransformer with edge features that allows the usage ofexplicit domain information as edge features.• Our method includes an elegant way to fuse node po-sitional features using Laplacian eigenvectors for graphdatasets, inspired from the heavy usage of positional en-codings in NLP transformer models and recent researchon node positional features in GNNs. The comparisonwith literature shows Laplacian eigenvectors to be well-placed than any existing approaches to encode node posi-tional information for arbitrary homogeneous graphs.• Our experiments demonstrate that the proposed modelsurpasses baseline isotropic and anisotropic GNNs. Thearchitecture simultaneously emerges as a better attentionbased GNN baseline as well as a simple and effectiveTransformer network baseline for graph datasets for fu-ture research at the intersection of attention and graphs.

As stated earlier, we take into account two key aspects todevelop Graph Transformers – sparsity and positional en-codings which should ideally be used in the best possibleway for learning on graph datasets. We ﬁrst discuss the mo-tivations behind these using a transition from NLP to graphs,and then introduce the architecture proposed.

In NLP transformers, a sentence is treated as a fully con-nected graph and this choice can be justiﬁed for two reasons– a) First, it is difﬁcult to ﬁnd meaningful sparse interactionsor connections among the words in a sentence. For instance,the dependency of a word in a sentence on another word canvary with context, perspective of a user and speciﬁc applica-tion. There can be numerous plausible ground truth connec-tions among words in a sentence and therefore, text datasetsof sentences do not have explicit word interactions available.It thereby makes sense to have each word attending to eachother word in a sentence, as followed by the Transformerarchitecture (Vaswani et al. 2017). – b) Next, the so-calledgraph considered in an NLP transformer often has less thantens or hundreds of nodes ( i.e. sentences are often less thantens or hundreds of words). This makes for computationally feasibility and large transformer models can be trained onsuch fully connected graphs of words.In case of actual graph datasets, graphs have arbitrary con-nectivity structure available depending on the domain andtarget of application, and have node sizes in ranges of upto millions, or billions. The available structure presents uswith a rich source of information to exploit as an inductivebias in a neural network, whereas the node sizes practicallymakes it impossible to have a fully connected graph for suchdatasets. On these accounts, it is ideal and practical to havea Graph Transformer where a node attends to local nodeneighbors, same as in GNNs (Defferrard, Bresson, and Van-dergheynst 2016; Kipf and Welling 2017; Monti et al. 2017;Gilmer et al. 2017; Veliˇckovi´c et al. 2018; Bresson and Lau-rent 2017; Xu et al. 2019).

In NLP, transformer based models are, in most cases, sup-plied with a positional encoding for each word. This is criti-cal to ensure unique representation for each word, and even-tually preserve distance information. For graphs, the designof unique node positions is challenging as there are sym-metries which prevent canonical node positional informa-tion (Murphy et al. 2019). In fact, most of the GNNs whichare trained on graph datasets learn structural node informa-tion that are invariant to the node position (Srinivasan andRibeiro 2020). This is a critical reason why simple attentionbased models, such as GAT (Veliˇckovi´c et al. 2018), wherethe attention is a function of local neighborhood connectiv-ity, instead full-graph connectivity, do not seem to achievecompetitive performance on graph datasets. The issue of po-sitional embeddings has been explored in recent GNN works(Murphy et al. 2019; You, Ying, and Leskovec 2019; Srini-vasan and Ribeiro 2020; Dwivedi et al. 2020; Li et al. 2020)with a goal to learn both structural and positional features.In particular, Dwivedi et al. (2020) make the use of avail-able graph structure to pre-compute Laplacian eigenvectors(Belkin and Niyogi 2003) and use them as node positionalinformation. Since Laplacian PEs are generalization of thePE used in the original transformers (Vaswani et al. 2017)to graphs and these better help encode distance-aware in-formation ( i.e., nearby nodes have similar positional fea-tures and farther nodes have dissimilar positional features),we use Laplacian eigenvectors as PE in Graph Transformer.Although these eigenvectors have multiplicity occuring dueto the arbitrary sign of eigenvectors, we randomly ﬂip thesign of the eigenvectors during training, following Dwivediet al. (2020).We pre-compute the Laplacian eigenvectors ofall graphs in the dataset. Eigenvectors are deﬁned via thefactorization of the graph Laplacian matrix; ∆ = I − D − / AD − / = U T Λ U, (1)where A is the n × n adjacency matrix, D is the degreematrix, and Λ , U correspond to the eigenvalues and eigen-vectors respectively. We use the k smallest non-trivial eigen-vectors of a node as its positional encoding and denote by λ i for node i . Finally, we refer to Section 4.1 for a comparisonof Laplacian PE with existing Graph-BERT PEs. .3 Graph Transformer Architecture We now introduce the Graph Transformer Layer and GraphTransformer Layer with edge features. The layer archi-tecture is illustrated in Figure 1. The ﬁrst model is de-signed for graphs which do not have explicit edge attributes,whereas the second model maintains a designated edge fea-ture pipeline to incorporate the available edge informationand maintain their abstract representations at every layer.

Input

First of all, we prepare the input node and edge em-beddings to be passed to the Graph Transformer Layer. Fora graph G with node features α i ∈ R d n × for each node i and edge features β ij ∈ R d e × for each edge between node i and node j , the input node features α i and edge features β ij are passed via a linear projection to embed these to d -dimensional hidden features h i and e ij . ˆ h i = A α i + a ; e ij = B β ij + b , (2)where A ∈ R d × d n , B ∈ R d × d e and a , b ∈ R d are theparameters of the linear projection layers. We now embedthe pre-computed node positional encodings of dim k via alinear projection and add to the node features ˆ h i . λ i = C λ i + c ; h i = ˆ h i + λ i , (3)where C ∈ R d × k and c ∈ R d . Note that the Laplacian po-sitional encodings are only added to the node features at theinput layer and not during intermediate Graph Transformerlayers. Graph Transformer Layer

The Graph Transformer isclosely the same transformer architecture initially proposedin (Vaswani et al. 2017), see Figure 1 (Left). We now pro-ceed to deﬁne the node update equations for a layer (cid:96) . ˆ h (cid:96) +1 i = O (cid:96)h H (cid:110) k =1 (cid:16) (cid:88) j ∈N i w k,(cid:96)ij V k,(cid:96) h (cid:96)j (cid:17) , (4)where, w k,(cid:96)ij = softmax j (cid:16) Q k,(cid:96) h (cid:96)i · K k,(cid:96) h (cid:96)j √ d k (cid:17) , (5)and Q k,(cid:96) , K k,(cid:96) , V k,(cid:96) ∈ R d k × d , O (cid:96)h ∈ R d × d , k = 1 to H de-notes the number of attention heads, and (cid:107) denotes concate-nation. For numerical stability, the outputs after taking expo-nents of the terms inside softmax is clamped to a value be-tween − to +5 . The attention outputs ˆ h (cid:96) +1 i are then passedto a Feed Forward Network (FFN) preceded and succeededby residual connections and normalization layers, as: ˆˆ h (cid:96) +1 i = Norm (cid:16) h (cid:96)i + ˆ h (cid:96) +1 i (cid:17) , (6) ˆˆˆ h (cid:96) +1 i = W (cid:96) ReLU ( W (cid:96) ˆˆ h (cid:96) +1 i ) , (7) h (cid:96) +1 i = Norm (cid:16) ˆˆ h (cid:96) +1 i + ˆˆˆ h (cid:96) +1 i (cid:17) , (8)where W (cid:96) , ∈ R d × d , W (cid:96) , ∈ R d × d , ˆˆ h (cid:96) +1 i , ˆˆˆ h (cid:96) +1 i denote in-termediate representations, and Norm can either be Layer-Norm(Ba, Kiros, and Hinton 2016) or BatchNorm (Ioffe andSzegedy 2015). The bias terms are omitted for clarity of pre-sentation. Graph Transformer Layer with edge features

TheGraph Transformer with edge features is designed for bet-ter utilization of rich feature information available in severalgraph datasets in the form of edge attributes. See Figure 1(Right) for a reference to the building block of a layer. Sinceour objective remains to better use the edge features whichare pairwise scores corresponding to a node pair, we tie theseavailable edge features to implicit edge scores computed bypairwise attention. In other words, say an intermediate at-tention score before softmax, ˆ w ij , is computed when a node i attends to node j after the multiplication of query and key feature projections, see the expression inside the brackets inEquation 5. Let us treat this score ˆ w ij as implicit informationabout the edge < i, j > . We now try to inject the availableedge information for the edge < i, j > and improve the al-ready computed implicit attention score ˆ w ij . It is done bysimply multiplying the two values ˆ w ij and e ij , see Equa-tion 12. This kind of information injection is not seen to beexplored much, or applied in NLP Transformers as there isusually no available feature information between two words.However, in graph datasets such as molecular graphs, orsocial media graphs, there is often some feature informa-tion available on the edge interactions and it becomes nat-ural to design an architecture to use this information whilelearning. For the edges, we also maintain a designated node-symmetric edge feature representation pipeline for propagat-ing edge attributes from one layer to another, see Figure 1.We now proceed to deﬁne the layer update equations for alayer (cid:96) . ˆ h (cid:96) +1 i = O (cid:96)h H (cid:110) k =1 (cid:16) (cid:88) j ∈N i w k,(cid:96)ij V k,(cid:96) h (cid:96)j (cid:17) , (9) ˆ e (cid:96) +1 ij = O (cid:96)e H (cid:110) k =1 (cid:16) ˆ w k,(cid:96)ij (cid:17) , where, (10) w k,(cid:96)ij = softmax j ( ˆ w k,(cid:96)ij ) , (11) ˆ w k,(cid:96)ij = (cid:16) Q k,(cid:96) h (cid:96)i · K k,(cid:96) h (cid:96)j √ d k (cid:17) · E k,(cid:96) e (cid:96)ij , (12)and Q k,(cid:96) , K k,(cid:96) , V k,(cid:96) , E k,(cid:96) ∈ R d k × d , O (cid:96)h , O (cid:96)e ∈ R d × d , k = 1 to H denotes the number of attention head, and (cid:107) de-notes concatenation. For numerical stability, the outputs af-ter taking exponents of the terms inside softmax is clampedto a value between − to +5 . The outputs ˆ h (cid:96) +1 i and ˆ e (cid:96) +1 ij arethen passed to separate Feed Forward Networks precededand succeeded by residual connections and normalizationlayers, as: ˆˆ h (cid:96) +1 i = Norm (cid:16) h (cid:96)i + ˆ h (cid:96) +1 i (cid:17) , (13) ˆˆˆ h (cid:96) +1 i = W (cid:96)h, ReLU ( W (cid:96)h, ˆˆ h (cid:96) +1 i ) , (14) h (cid:96) +1 i = Norm (cid:16) ˆˆ h (cid:96) +1 i + ˆˆˆ h (cid:96) +1 i (cid:17) , (15)here W (cid:96)h, , ∈ R d × d , W (cid:96)h, , ∈ R d × d , ˆˆ h (cid:96) +1 i , ˆˆˆ h (cid:96) +1 i denoteintermediate representations, ˆˆ e (cid:96) +1 ij = Norm (cid:16) e (cid:96)ij + ˆ e (cid:96) +1 ij (cid:17) , (16) ˆˆˆ e (cid:96) +1 ij = W (cid:96)e, ReLU ( W (cid:96)e, ˆˆ e (cid:96) +1 ij ) , (17) e (cid:96) +1 ij = Norm (cid:16) ˆˆ e (cid:96) +1 ij + ˆˆˆ e (cid:96) +1 ij (cid:17) , (18)where W (cid:96)e, , ∈ R d × d , W (cid:96)e, , ∈ R d × d , ˆˆ e (cid:96) +1 ij , ˆˆˆ e (cid:96) +1 ij denoteintermediate representations. Task based MLP Layers

The node representations ob-tained at the ﬁnal layer of Graph Transformer are passedto a task based MLP network for computing task-dependentoutputs, which are then fed to a loss function to train theparameters of the model. The formal deﬁnitions of the taskbased layers that we use can be found in Appendix A.1.

We evaluate the performance of proposed Graph Trans-former on three benchmark graph datasets– ZINC (Irwinet al. 2012), PATTERN and CLUSTER (Abbe 2017) froma recent GNN benchmark (Dwivedi et al. 2020).

ZINC, Graph Regression

ZINC (Irwin et al. 2012) is amolecular dataset with the task of graph property regres-sion for constrained solubility. Each ZINC molecule is rep-resented as a graph of atoms as nodes and bonds as edges.Since this dataset have rich feature information in terms ofbonds as edge attributes, we use the ‘Graph Transformerwith edge features’ for this task. We use the 12K subset ofthe data as in Dwivedi et al. (2020).

PATTERN, Node Classiﬁcation

PATTERN is a nodeclassiﬁcation dataset generated using the Stochastic BlockModels (SBM) (Abbe 2017). The task is classify the nodesinto 2 communities. PATTERN graphs do not have explicitedge features and hence we use the simple ‘Graph Trans-former’ for this task. The size of this dataset is 14K graphs.

CLUSTER, Node Classiﬁcation

CLUSTER is also asynthetically generated dataset using SBM model. The taskis to assign a cluster label to each node. There are total 6cluster labels. Similar to PATTERN, CLUSTER graphs donot have explicit edge features and hence we use the simple‘Graph Transformer’ for this task. The size of this dataset is12K graphs. We refer the readers to (Dwivedi et al. 2020)for additional information, inlcuding preparation, of thesedatasets.

Model Conﬁgurations

For experiments, we follow thebenchmarking protocol introduced in Dwivedi et al. (2020)based on PyTorch (Paszke et al. 2019) and DGL (Wang et al.2019). We use 10 layers of Graph Transformer layers witheach layer having 8 attention heads and arbitrary hidden di-mensions such that the total number of trainable parametersis in the range of 500k. We use learning rate decay strategyto train the models where the training stops at a point when the learning rate reaches to a value of × − . We run eachexperiment with 4 different seeds and report the mean andaverage performance measure of the 4 runs. The results arereported in Table 1 and comparison in Table 2. We now present the analysis of our experiments on the pro-posed Graph Transformer Architecture, see Tables 1 and 2.• The generalization of transformer network on graphs isbest when Laplacian PE are used for node positions andBatch Normalization is selected instead of Layer Normal-ization. For all three benchmark datasets, the experimentsscore the highest performance in this setting, see Table 1.• The proposed architecture performs signiﬁcantly betterthan baseline isotropic and anisotropic GNNs (GCN andGAT respectively), and helps close the gap between theoriginal transformer and transformer for graphs. Notably,our architecture emerges as a fresh and improved attentionbased GNN baseline surpassing GAT (see Table 2), whichemploys multi-headed attention inspired by the originaltransformer (Vaswani et al. 2017) and have been oftenused in the literature as a baseline for attention-basedGNN models.• As expected, sparse graph connectivity is a critical in-ductive bias for datasets with arbitrary graph structure, asdemonstrated by comparing sparse vs. full graph experi-ments.• Our proposed extension of Graph Transformer with edgefeatures reaches close to the best performing GNN, i.e.,

GatedGCN, on ZINC. This architecture speciﬁcallybrings exciting promise to datasets where domain infor-mation along pairwise interactions can be leveraged formaximum learning performance.

In addition to the reasons underscored in Sections 1.1 and2.2, we demonstrate the usefulness of Laplacian eigenvec-tors as a suitable candidate PE for Graph Transformer inthis section, by its comparison with different PE schemes ap-plied in Graph-BERT (Zhang et al. 2020). In Graph-BERT,which operates on ﬁxed size sampled subgraphs, a node at-tends to every other node in a subgraph. For a given graph G = ( V , E ) with V nodes and E edges, a subgraph g i of size k + 1 is created for every node i in the graph, which meansthe original single graph G is converted to V subgraphs. Fora subgraph g i corresponding to node u i , the k other nodesare the ones which have the top k intimacy scores with node u i based on a pre-computed intimacy matrix that maps everyedge in the graph G to an intimacy score. While the samplingis great for parallelization and efﬁciency, the original graphstructure is not directly used in the layers. Graph-BERT uses Note that we do not perform empirical comparison with otherPEs in Graph Transformer literature except Graph-BERT, becauseof two reasons: i) Some existing Graph Transformer methods donot use PEs, ii) If PEs are used, they are usually specialised; forinstance, Relative Temporal Encoding (RTE) for encoding dynamicinformation in heterogeneous graphs in (Hu et al. 2020). parse Graph Full GraphDataset LapPE L ± s.d. Train Perf. ± s.d. ± s.d. Train Perf. ± s.d. Batch Norm:

False ; Layer Norm:

True

ZINC x

10 588353 0.278 ± ± ± ± (cid:88)

10 588929 0.284 ± ± ± ± CLUSTER x

10 523146 70.879 ± ± ± ± (cid:88)

10 524026 70.649 ± ± ± ± PATTERN x

10 522742 73.140 ± ± ± ± (cid:88)

10 522982 71.005 ± ± ± ± True ; Layer Norm:

False

ZINC x

10 588353 0.264 ± ± ± ± (cid:88)

10 588929 ± ± ± ± CLUSTER x

10 523146 72.139 ± ± ± ± (cid:88)

10 524026 ± ± ± ± PATTERN x

10 522742 83.949 ± ± ± ± (cid:88)

10 522982 ± ± ± ± Table 1: Results of GraphTransformer (GT) on all datasets. Performance Measure for ZINC is MAE, for PATTERN and CLUS-TER is Acc. Results (higher is better for all except ZINC) are averaged over 4 runs with 4 different seeds.

Bold : the bestperforming model for each dataset. We perform each experiment with given graphs (Sparse Graph) and (Full Graph) inwhich we create full connections among all nodes; For ZINC full graphs, edge features are discarded given our motive of thefull graph experiments without any sparse structure information.

Model ZINC CLUSTER PATTERN

GNN BASELINE SCORES from (Dwivedi et al. 2020)

GCN ± ± ± GAT ± ± ± GatedGCN ± ± ± OUR RESULTS

GT (Ours) ± ± ± Table 2: Comparison of our best performing scores (fromTable 1) on each dataset against the GNN baselines (GCN(Kipf and Welling 2017), GAT (Veliˇckovi´c et al. 2018), Gat-edGCN(Bresson and Laurent 2017)) of 500k model param-eters.

Note:

Only GatedGCN and GT models use the avail-able edge attributes in ZINC.a combination of node PE schemes to inform the model onnode structural, positional, and distance information fromoriginal graph– i) Intimacy based relative PE, ii) Hop basedrelative distance encoding, and iii) Weisfeiler Lehman basedabsolute PE (WL-PE). The intimacy based PE and the hopbased PE are variant to the sampled subgraphs, i.e. , thesePEs for a node in a subgraph g i depends on the node u i w.r.twhich it is sampled, and cannot be directly used in othercases unless we use similar sampling strategy. The WL-PEwhich are absolute structural roles of nodes in the originalgraph computed using WL algorithm (Zhang et al. 2020;Niepert, Ahmed, and Kutzkov 2016), are not variant to thesubgraphs and can be easily used as a generic PE mecha-nism. On that account, we swap Laplacian PE in our experi-ments for an ablation analysis and use WL-PE from Graph-BERT, see Table 3. As Laplacian PE capture better struc-tural and positional information about the nodes, which es-sentially is the objective behind using the three Graph-BERTPEs, they outperform the WL-PE. Besides, WL-PEs tend tooverﬁt SBM datasets and lead to poor generalization. Sparse GraphDataset PE ± s.d. Train Perf. ± s.d. Batch Norm:

True ; Layer Norm:

False ; L = 10 ZINC x ± ± ± ± ± ± CLUSTER x ± ± ± ± ± ± PATTERN x ± ± ± ± ± ± Table 3: Analysis of GraphTransformer (GT) using differentPE schemes. Notations x : No PE; L : LapPE (ours); W : WL-PE (Zhang et al. 2020). Bold : the best performing model foreach dataset.

This work presented a simple yet effective approach to gen-eralize transformer networks on arbitrary graphs and intro-duced the corresponding architecture. Our experiments con-sistently showed that the presence of – i) Laplacian eigen-vectors as node positional encodings and – ii) batch normal-ization, in place of layer normalization, around the trans-former feed forward layers enhanced the transformer univer-sally on all experiments. Given the simple and generic na-ture of our architecture and competitive performance againststandard GNNs, we believe the proposed model can be usedas baseline for further improvement across graph applica-tions employing node attention. In future works, we are in-terested in building upon the graph transformer along as-pects such as efﬁcient training on single large graphs, ap-plicability on heterogeneous domains, etc., and perform ef-ﬁcient graph representation learning keeping in account therecent innovations in graph inductive biases.

Acknowledgments

XB is supported by NRF Fellowship NRFF2017-10. eferences

Abbe, E. 2017. Community detection and stochastic blockmodels: recent developments.

The Journal of MachineLearning Research

NeurIPS workshop on Deep Learning .Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural ma-chine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 .Belkin, M.; and Niyogi, P. 2003. Laplacian eigenmaps fordimensionality reduction and data representation.

Neuralcomputation arXiv preprint arXiv:1711.07553 .Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.;Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell,A.; et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 .Chami, I.; Wolf, A.; Juan, D.-C.; Sala, F.; Ravi, S.; and R´e,C. 2020. Low-Dimensional Hyperbolic Knowledge GraphEmbeddings. arXiv preprint arXiv:2005.00545 .Cranmer, M. D.; Xu, R.; Battaglia, P.; and Ho, S. 2019.Learning Symbolic Physics with Graph Networks. arXivpreprint arXiv:1909.05862 .Defferrard, M.; Bresson, X.; and Vandergheynst, P. 2016.Convolutional Neural Networks on Graphs with Fast Local-ized Spectral Filtering. In

Advances in Neural InformationProcessing Systems 29 , 3844–3852.Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018.Bert: Pre-training of deep bidirectional transformers for lan-guage understanding. arXiv preprint arXiv:1810.04805 .Dwivedi, V. P.; Joshi, C. K.; Laurent, T.; Bengio, Y.; andBresson, X. 2020. Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982 .Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; andDahl, G. E. 2017. Neural message passing for quantumchemistry. In

Proceedings of the 34th International Confer-ence on Machine Learning-Volume 70 , 1263–1272. JMLR.org.Hu, Z.; Dong, Y.; Wang, K.; and Sun, Y. 2020. Heteroge-neous graph transformer. In

Proceedings of The Web Con-ference 2020 , 2704–2710.Ioffe, S.; and Szegedy, C. 2015. Batch normalization: Accel-erating deep network training by reducing internal covariateshift. arXiv preprint arXiv:1502.03167 .Irwin, J. J.; Sterling, T.; Mysinger, M. M.; Bolstad, E. S.;and Coleman, R. G. 2012. ZINC: a free tool to discoverchemistry for biology.

Journal of chemical information andmodeling

The Gradient .Kipf, T. N.; and Welling, M. 2017. Semi-Supervised Clas-siﬁcation with Graph Convolutional Networks. In

Interna-tional Conference on Learning Representations (ICLR) . Li, P.; Wang, Y.; Wang, H.; and Leskovec, J. 2020. Dis-tance Encoding–Design Provably More Powerful GNNsfor Structural Representation Learning. arXiv preprintarXiv:2009.00142 .Li, Y.; Liang, X.; Hu, Z.; Chen, Y.; and Xing, E. P. 2019.Graph Transformer. URL https://openreview.net/forum?id=HJei-2RcK7.Li, Y.; Tarlow, D.; Brockschmidt, M.; and Zemel, R. 2015.Gated graph sequence neural networks. arXiv preprintarXiv:1511.05493 .Monti, F.; Boscaini, D.; Masci, J.; Rodola, E.; Svoboda, J.;and Bronstein, M. M. 2017. Geometric Deep Learning onGraphs and Manifolds Using Mixture Model CNNs. doi:10.1109/cvpr.2017.576.Monti, F.; Frasca, F.; Eynard, D.; Mannion, D.; and Bron-stein, M. M. 2019. Fake news detection on socialmedia using geometric deep learning. arXiv preprintarXiv:1902.06673 .Murphy, R.; Srinivasan, B.; Rao, V.; and Ribeiro, B. 2019.Relational Pooling for Graph Representations. In

Interna-tional Conference on Machine Learning , 4663–4673.Nguyen, D. Q.; Nguyen, T. D.; and Phung, D. 2019. Univer-sal Self-Attention Network for Graph Classiﬁcation. arXivpreprint arXiv:1909.11855 .Niepert, M.; Ahmed, M.; and Kutzkov, K. 2016. Learningconvolutional neural networks for graphs. In

Internationalconference on machine learning , 2014–2023.Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.;Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.;Desmaison, A.; K¨opf, A.; Yang, E.; DeVito, Z.; Raison, M.;Tejani, A.; Chilamkurthy, S.; Steiner, B.; Fang, L.; Bai, J.;and Chintala, S. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library.Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever,I. 2018. Improving language understanding by generativepre-training.Sanchez-Gonzalez, A.; Godwin, J.; Pfaff, T.; Ying, R.;Leskovec, J.; and Battaglia, P. W. 2020. Learning to sim-ulate complex physics with graph networks. arXiv preprintarXiv:2002.09405 .Schlichtkrull, M.; Kipf, T. N.; Bloem, P.; Van Den Berg, R.;Titov, I.; and Welling, M. 2018. Modeling relational datawith graph convolutional networks. In

European SemanticWeb Conference , 593–607. Springer.Srinivasan, B.; and Ribeiro, B. 2020. On the Equivalencebetween Node Embeddings and Structural Graph Represen-tations.

International Conference on Learning Representa-tions .Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-tention is all you need. In

Advances in neural informationprocessing systems , 5998–6008.eliˇckovi´c, P.; Cucurull, G.; Casanova, A.; Romero, A.; Li`o,P.; and Bengio, Y. 2018. Graph Attention Networks.

Inter-national Conference on Learning Representations .Wang, M.; Yu, L.; Zheng, D.; Gan, Q.; Gai, Y.; Ye, Z.; Li,M.; Zhou, J.; Huang, Q.; Ma, C.; Huang, Z.; Guo, Q.; Zhang,H.; Lin, H.; Zhao, J.; Li, J.; Smola, A. J.; and Zhang, Z.2019. Deep Graph Library: Towards Efﬁcient and ScalableDeep Learning on Graphs.

ICLR Workshop on Representa-tion Learning on Graphs and Manifolds .Xu, K.; Hu, W.; Leskovec, J.; and Jegelka, S. 2019. HowPowerful are Graph Neural Networks? In

International Con-ference on Learning Representations .Xu, K.; Li, C.; Tian, Y.; Sonobe, T.; Kawarabayashi, K.-i.; and Jegelka, S. 2018. Representation learning ongraphs with jumping knowledge networks. arXiv preprintarXiv:1806.03536 .Xu, P.; Joshi, C. K.; and Bresson, X. 2019. Multi-graphtransformer for free-hand sketch recognition. arXiv preprintarXiv:1912.11258 .You, J.; Ying, R.; and Leskovec, J. 2019. Position-awaregraph neural networks.

International Conference on Ma-chine Learning .Yun, S.; Jeong, M.; Kim, R.; Kang, J.; and Kim, H. J. 2019.Graph transformer networks. In

Advances in Neural Infor-mation Processing Systems , 11983–11993.Zhang, J.; Zhang, H.; Sun, L.; and Xia, C. 2020. Graph-Bert:Only Attention is Needed for Learning Graph Representa-tions. arXiv preprint arXiv:2001.05140 .Zhou, D.; Zheng, L.; Han, J.; and He, J. 2020. A Data-Driven Graph Generative Model for Temporal InteractionNetworks. In

Proceedings of the 26th ACM SIGKDD Inter-national Conference on Knowledge Discovery & Data Min-ing , 401–411.

A Appendix

A.1 Task based MLP layer equations

Graph prediction layer

For graph prediction task, the ﬁ-nal layer node features of a graph is averaged to get a d -dimensional graph-level feature vector y G . y G = 1 V V (cid:88) i =0 h Li , (19)The graph feature vector is then passed to a MLP to obtainthe un-normalized prediction score for each class, y pred ∈ R C for each class: y pred = P ReLU ( Q y G ) , (20)where P ∈ R d × C , Q ∈ R d × d , C is the number of task la-bels (classes) to be predicted. Since we perform single-targetgraph regression in ZINC, C = 1 , and the L1-loss betweenthe predicted and groundtruth values is minimized duringtraining. Node prediction layer

For node prediction task, eachnode’s feature vector is passed to a MLP for computingthe un-normalized prediction scores y i, pred ∈ R C for eachclass: y i, pred = P ReLU (cid:0)