[PDF] Self-Supervised Graph Transformer on Large-Scale Molecular Data

Abstract

How to obtain informative representations of molecules is a crucial prerequisite in AI-driven drug design and discovery. Recent researches abstract molecules as graphs and employ Graph Neural Networks (GNNs) for molecular representation learning. Nevertheless, two issues impede the usage of GNNs in real scenarios: (1) insufficient labeled molecules for supervised training; (2) poor generalization capability to new-synthesized molecules. To address them both, we propose a novel framework, GROVER, which stands for Graph Representation frOm self-superVised mEssage passing tRansformer. With carefully designed self-supervised tasks in node-, edge- and graph-level, GROVER can learn rich structural and semantic information of molecules from enormous unlabelled molecular data. Rather, to encode such complex information, GROVER integrates Message Passing Networks into the Transformer-style architecture to deliver a class of more expressive encoders of molecules. The flexibility of GROVER allows it to be trained efficiently on large-scale molecular dataset without requiring any supervision, thus being immunized to the two issues mentioned above. We pre-train GROVER with 100 million parameters on 10 million unlabelled molecules -- the biggest GNN and the largest training dataset in molecular representation learning. We then leverage the pre-trained GROVER for molecular property prediction followed by task-specific fine-tuning, where we observe a huge improvement (more than 6% on average) from current state-of-the-art methods on 11 challenging benchmarks. The insights we gained are that well-designed self-supervision losses and largely-expressive pre-trained models enjoy the significant potential on performance boosting.

Full PDF

GGROVER: Self-supervised Message PassingTransformer on Large-scale Molecular Data

Yu Rong

Tencent AI LabShenzhen, China 518057 [email protected]

Yatao Bian

Tencent AI LabShenzhen, China 518057 [email protected]

Tingyang Xu

Tencent AI LabShenzhen, China 518057 [email protected]

Weiyang Xie

Tencent AI LabShenzhen, China 518057 [email protected]

Ying Wei

Tencent AI LabShenzhen, China 518057 [email protected]

Wenbing Huang

Department of Computer Science and TechnologyTsinghua UniversityBeijing, China [email protected]

Junzhou Huang

University of Texas at ArlingtonArlington, TX 76019 [email protected]

Abstract

GROVER , which stands for G raph R epresentation fr O m self-super V ised m E ssagepassing t R ansformer. With carefully designed self-supervised tasks in node, edgeand graph-level, GROVER can learn rich structural and semantic information ofmolecules from enormous unlabelled molecular data. Rather, to encode suchcomplex information,

GROVER integrates Message Passing Networks with theTransformer-style architecture to deliver a class of more expressive encoders ofmolecules. The ﬂexibility of

GROVER allows it to be trained efﬁciently on large-scale molecular dataset without requiring any supervision, thus being immunizedto the two issues mentioned above. We pre-train

GROVER with 100 million param-eters on 10 million unlabelled molecules—the biggest GNN and the largest trainingdataset that we have ever met. We then leverage the pre-trained

GROVER to down-stream molecular property prediction tasks followed by task-speciﬁc ﬁne-tuning,where we observe a huge improvement (more than 6% on average) over currentstate-of-the-art methods on 11 challenging benchmarks. The insights we gainedare that well-designed self-supervision losses and largely-expressive pre-trainedmodels enjoy the signiﬁcant potential on performance boosting.

Preprint. Under review. a r X i v : . [ q - b i o . B M ] J un Introduction

Inspired by the remarkable achievements of deep learning in many scientiﬁc domains, such as com-puter vision [19, 55], natural language processing [49, 51], and social networks [3, 30], researchersare exploiting deep learning approaches to accelerate the process of drug discovery and reduce costsby facilitating the rapid identiﬁcation of molecules [5]. Molecules can be naturally represented bymolecular graphs which preserve rich structural information. Therefore, supervised deep learning ofgraphs, especially with Graph Neural Networks(GNNs) [24, 45] have shown promising results inmany tasks, such as molecular property prediction [13, 22] and virtual screening [54, 62].Despite the fruitful progress, two “dark clouds” still impede the usage of deep learning in realscenarios: (1) insufﬁcient labeled data for molecular tasks; (2) poor generalization capabilities ofmodels for the enormous chemical spaces. First, the deep learning models are prone to overﬁtting withinsufﬁcient labeled data. What’s worse, it is hard to increase the labels for most molecular task sincegetting molecular labels usually requires wet-lab experiments which are costly and time-consuming.Consequently, insufﬁcient labeled data further leads to the poor generalization capabilities of models,which is hard to handle the out-of-distribution molecules.In Natural Language Processing (NLP), the deep learning methods used to face similar challenges.Researchers in NLP adopt the pre-training strategy to enforce the model to learn implicit informationfrom the large language space with some low-cost self-supervised targets. The pre-training languagemodels achieve a huge success in boosting the performance of language tasks, such as BERT [9],GPT [37], etc. In this vein, [56] tries to pre-train a BERT-style model on the sequential representation–SMILES [57] of molecules. Liu et.al. [28] exploit the idea from N-gram approach in NLP and embedsa sequence of n vertices to self-predict the attributes of the vertices. However, these representationsfailed to explicitly encode the structural information of molecules.To consider the graph structure of molecules, several works aim to establish a pre-trained graph modelfor molecules. Hu et.al. [18] investigate the strategies to construct the pre-training task on moleculargraphs and propose three tasks, i.e., context prediction, node making and graph-level predictionfor molecular pre-training. We argue that current models and pre-train tasks restrict the power ofpre-trained representations. First, in the masking task, they treat the atom type as the label. Comparedwith the NLP tasks, the number of atom types in molecules is much smaller than that of a languagevocabulary. Therefore, it suffers serious ambiguity problems on atom type and the model is hardto encode meaningful information, especially for the high frequent atoms. Second, the graph-levelpre-training task in [18] depends on the supervised labels. This limits the expressive power of themodel on graph-level since most of molecules are completely unlabelled and it also introduces therisk of negative transfer for the downstream tasks.In this paper, we improve the pre-training model for molecular graph by introducing a novel molecularrepresentation framework, GROVER , namely, G raph R epresentation fr O m self-super V ised m E ssagepassing t R ansformer. GROVER constructs two types of self-supervised tasks. For the node/edge-leveltasks, instead of predicting the node/edge type alone,

GROVER randomly masks a local subgraph ofthe target node/edge and predicts this contextual property from node/edge embedding. In this vein,

GROVER can alleviate the ambiguity problem by considering both the target node/edge being maskedand its context. For the graph-level tasks, by incorporating the domain knowledge,

GROVER extractsthe semantic motifs existing in molecular graphs and predicts the occurrence of these motifs for amolecule from graph-embedding. Since the semantic motifs can be obtained by a low-cost patternmatching method,

GROVER can make use of any molecules to optimize the graph-level embedding.With self-supervised tasks in node, edge and graph-levels,

GROVER can learn rich structural andsemantic information of molecules from enormous unlabelled molecular data. Rather, to encode suchcomplex information,

GROVER integrates Message Passing Networks with the Transformer-stylearchitecture to deliver a class of highly expressive encoders of molecules. The ﬂexibility of

GROVER allows it to be trained efﬁciently on large-scale molecule data without requiring any supervision.We pre-train

GROVER with 100 million parameters on 10 million of unlabelled molecules—thebiggest GNN and the largest training dataset that we have ever met. We then leverage the pre-trained

GROVER models to downstream molecular property prediction followed by task-speciﬁc ﬁne-tuning.On the downstream tasks,

GROVER achieve . relative improvement compared with [28] and . relative improvement compared with [18] on classiﬁcation tasks. Furthermore, even comparedwith current state-of-the-art methods, we observe a huge relative improvement of GROVER (morethan 6% on average) over 11 popular benchmarks.2

Related Work

Molecular Representation Learning.

To represent molecules in the vector space, the traditionalchemical ﬁngerprints, such as ECFP [41], try to encode the neighbors of atoms in the molecule into aﬁx-length vector. To improve the expressive power of chemical ﬁngerprints, some studies [7, 10]introduce convolutional layers to learn the neural ﬁngerprints of molecules, and apply the neuralﬁngerprints to the downstream tasks, such as property prediction. Following these works, [20, 60]take the SMILES representation [57] as input and use RNN-based models to produce molecularrepresentations. Recently, many works [22, 44, 45] explore the graph convolutional network toencode molecular graphs into neural ﬁngerprints. A slot of work [43, 59] propose to learn theaggregation weights by extending the Graph Attention Network (GAT) [52]. To better capture theinteractions among atoms, [13] proposes to use a message passing framework and [24, 61] extendthis framework to model bond interactions. Furthermore, [29] builds a hierarchical GNN to capturemultilevel interactions.

Self-supervised Learning on Graphs.

Self-supervised learning has a long history in machinelearning and has achieved fruitful progresses in many areas, such as computer vision [34] andlanguage modeling [9]. The traditional graph embedding methods [14, 36] deﬁne different kindsof graph proximity, i.e., the vertex proximity relationship, as the self-supervised objective to learnvertex embeddings. GraphSAGE [15] proposes to use a random-walk based proximity objective totrain GNN in an unsupervised fashion. [35, 48, 53] exploit the mutual information maximizationscheme to construct objective for GNNs. Recently, two works are proposed to construct unsupervisedrepresentations for molecular graphs. Liu et.al.[28] employs an N-gram model to extract the contextof vertices and construct the graph representation by assembling the vertex embeddings in short walksin the graph. Hu et.al. [18] investigate various strategies to pre-train the GNNs and propose threeself-supervised tasks to learn molecular representations. However, [18] isolates the highly correlatedtasks of context prediction and node/edge type prediction, which makes it difﬁcult to preserve domainknowledge between the local structure and the node attributes. Besides, the graph-level task in [18] isconstructed by the supervised property labels, which is impeded by the limited number of supervisedlabels of molecules and has demonstrated the negative transfer in the downstream tasks. Contrastwith [18], the molecular representations derived by our method are more appropriate in terms ofpersevering the domain knowledge, which has demonstrated remarkable effectiveness in downstreamtasks without negative transfer.

GROVER is built upon the Transformer architecture [51] and GNNs, so we brieﬂy formalize themand the supervised learning task in this section.

Supervised learning tasks of graphs.

Downstream tasks of molecules are often supervised learningtasks. A molecule can be abstracted as a topological graph G = ( V , E ) , where |V| = n refers to a setof n nodes (atoms) in the molecule and |E| = m refers to a set of m edges (bonds). N v is used todenote the set of neighbors of node v in the graph. We use x v to represent the initial features of node v , and e uv as the initial features of edge ( u, v ) . For graph data, there are usually two categories ofsupervised learning tasks: i) Node classiﬁcation/regression , where each node v has a label/target y v ,and the task is to learn to predict the labels of unlabelled nodes; ii) Graph classiﬁcation/regression ,where a set of graphs { G , ..., G N } and their labels/targets { y , ..., y N } are given, and the task is topredict the label/target of a new graph. Attention mechanism and the Transformer-style architectures.

The attention mechanism is themain building block of various Transformer-style models. We focus on multi-head attention, whichstacks several scaled dot-product attention layers together and allows parallel running. One scaleddot-product attention layer takes a set of queries, keys, values ( q , k , v ) as inputs. Then it computesthe dot products of the query with all keys, and applies a softmax function to obtain the weightson the values. By stacking the set of ( q , k , v )s into matrices ( Q , K , V ), it admits highly optimizedmatrix multiplication operations. Speciﬁcally, the outputs can be arranged as a matrix:Attention ( Q , K , V ) = softmax ( QK (cid:62) / √ d ) V , (1)3here d is the dimension of q and k . Suppose we arrange k attention layers into the multi-headattention, then its output matrix can be written as,MultiHead ( Q , K , V ) = Concat ( head , ..., head k ) W O , head i = Attention ( QW Q i , KW K i , VW V i ) , where W Q i , W K i , W V i are the projection matrices of head i . Graph Neural Networks (GNNs).

Recently, GNNs have received a surge of interest in variousdomains, such as knowledge graph, social networks and drug discovery. The key operation ofGNNs can be abstracted to a message passing process, which involves message passing (also calledneighborhood aggregation) between the nodes in the graph. The message passing operation iterativelyupdates a node v ’s hidden states, h v , by aggregating the hidden states of v ’s neighboring nodes andedges. In general, the message passing process involves several iterations, each iteration can befurther partitioned into several hops. Suppose there are L iterations, and iteration l contains K l hops.Formally, in iteration l , the k -th hop can be formulated as, m ( l,k ) v = AGGREGATE ( l ) ( { ( h ( l,k − v , h ( l,k − u , e uv ) | u ∈ N v } ) , (2) h ( l,k ) v = σ ( W ( l ) m ( l,k ) v + b ( l ) ) , where m ( l,k ) v is the aggregated message, and σ () is some activation function. We make the conventionthat h ( l, v := h ( l − ,K l − ) v . There are several popular ways of choosing AGGREGATE ( l ) () , such asmean, max pooling and graph attention mechanism [15, 52]. For one iteration of message passing,there are a layer of trainable parameters (i.e., parameters inside AGGREGATE ( l ) , W ( l ) and b ( l ) ).These parameters are shared across the K l hops within iteration l . After L iterations of messagepassing, the hidden states of the last hop in the last iteration are used as the embeddings of the nodes,i.e., h ( L,K L ) v , v ∈ V . Lastly, a READOUT operation is applied to get the graph-level representation, h G = READOUT ( { h (0 ,K ) v , ..., h ( L,K L ) v | v ∈ V} ) . (3) GROVER

Pre-training Framework

This section contains details of our pre-training architecture together with the well-designed self-supervision tasks. On a high level, the model is a Transformer-based neural network with tailoredGNNs as the self-attention building blocks. The GNNs therein enable capturing structural informationin the graph data and information ﬂow on both the node and edge message passing paths. Furthermore,we introduce a dynamic message passing scheme in the tailored GNN, which is proved to boost thegeneralization performance of

GROVER models.

GROVER consists of two submodules: the node GNN transformer and edge GNN transformer.In order to ease the exposition, we will only explain details of the node GNN transformer (ab-breviated as node

GTransformer ) in the sequel, and ignore the edge GNN transformer sinceit has a similar structure. The overall architecture of

GROVER is deferred in Appendix A.

Multi-Head AttentionLayerNorm dyMPN dyMPN dyMPN

Input Graph long-range residual connection

Output ModuleGTransformer

Feed Forward

Node Embed

Add & Norm

Concat & NormAggregate2Node Feed Forward

Edge Embed

Add & NormConcat & NormAggregate2Edge V K Q Figure 1: Overview of

GTransformer . GNN Transformer (

GTransformer ). The key componentof the node

GTransformer is our proposed graph multi-head attention component, which is the attention blockstailored to structural input data. A vanilla attention block,such as that in Equation (1), requires vectorized inputs.However, graph inputs are naturally structural data that arenot vectorized. So we use our designed dynamic GNNs( dyMPN , see the following sections for details) to extractvectors as queries, keys and values from nodes of thegraph, then feed them into the attention block.This strategy is simple yet powerful , because it enablesutilizing the highly expressive dyMPN , in order to bettermodel the structural information in molecular data. The4

Contextual property extraction Subgraph masking ？？

Prediction …… node-basededge-based node/edge representation masked part … Semantic motifs fromdomain knowledge Graph-level Prediction graph representation

Input moleculeMolecular graph

Contextual property prediction (node/edge level task) Graph-level motif prediction 𝑘 = 1𝑘 = 1

Figure 2: Overview of the self-supervised task construction of

GROVER .high expressiveness of

GTransformer can be attributed to its bi-level information extraction frame-work. It is well-known that the message passing process captures local structural information ofthe graph, therefore using the outputs of dyMPN as queries, keys and values would get the localsubgraph structure involved, thus constituting the ﬁrst level of information extraction. Meanwhile,the Transformer encoder can be viewed as a variant of the GAT [21, 52] on a fully connected graphconstructed by V . Hence, using Transformer encoder on top of these dyMPN queries, keys andvalues makes it possible to extract global relations between nodes, which enables the second level ofinformation extraction. This bi-level information extraction largely enhances the representationalpower of GROVER models.The second feature of

GTransformer is that we use a single long-range residual connection , whilethe Transformer encoder uses several short-range residual connections within each attention layer.The long-range residual connection conveys the initial node/edge feature information directly to thelast layers of

GTransformer . Two beneﬁts could be obtained from this single long-range residualconnection: i) like ordinary residual connections, it improves the training process by alleviating thevanishing gradient problem [17], ii) compared to the various short-range residual connections in theTransformer encoder, our long-range residual connection can alleviate the over-smoothing [33, 42]problem in the message passing process.

Dynamic Message Passing Network ( dyMPN ). The general message passing process (see Equa-tion (2)) has two hyperparameters: number of iterations/layers L and number of hops K l , l = 1 , ..., L within each iteration. The number of hops is closely related to the size of the receptor ﬁeld of thegraph convolution operation, which would affect generalizability of the message passing model.Given a ﬁxed number of layers L , we ﬁnd out that the pre-speciﬁed number of hops might notwork well for different kinds of dataset. Instead of pre-specifying the number of hops, we developa randomized strategy for choosing the number of hops during training process: at each epoch,we choose K l from some random distribution for layer l . Two choices of randomization workwell: i) K l ∼ U ( a, b ) , drawn from a uniform distribution; ii) K l is drawn from a truncated normaldistribution, which is derived from that of a normally distributed random variable by bounding therandom variable from both bellow and above.The above randomized message passing scheme enables random receptor ﬁeld for each node ingraph convolution operation. We call the induced network Dynamic Message Passing networks(abbreviated as dyMPN ). Extensive experimental veriﬁcation demonstrates that dyMPN enjoysbetter generalization performance than vanilla message passing networks without the randomizationstrategy. The success of the pre-training model crucially depends on the design of self-supervision tasks.Different from Hu et.al. [18], to avoid negative transfer on downstream tasks, we do not use thesupervised labels in pre-training and propose new self-supervision tasks on both of these two levels: contextual property prediction and graph-level motif prediction , which are sketched in Figure 2.

Contextual Property Prediction.

A good self-supervision task on the node level should satisfythe following properties: i) The prediction target is reliable and easy to get; ii) The predictiontarget should reﬂect contextual information of the node/edge. Guided by these criteria, we presentthe tasks on both nodes and edges. They both try to predict the context-aware properties of thetarget node/edge within some local subgraph. What kinds of context-aware properties shall oneuse? We deﬁne recurrent statistical properties of local subgraph in the following two-step man-ner (let us take the node subgraph in Figure 3 as the example): i) Given a target node (e.g.,5he Carbon atom in red color), we extract its local subgraph as its k -hop neighboring nodes andedges. When k =1, it involves the Nitrogen atom, Oxygen atom, the double bond and single bond. CN Key: C_N-DOUBLE1_O-SINGLE1Key: DOUBLE_C-SINGLE1_N-SINGLE1

C CC ON 𝑘 = 1

Figure 3:

Examples of constructing contex-tual properties. ii) We extract statistical properties of this subgraph, specif-ically, we count the number of occurrence of (node, edge)pairs around the center node, which makes the term of node-edge-counts . Then we list all the node-edgecounts terms in alphabetical order, which constitutes theﬁnal property: e.g.,

C_N_DOUBLE1_O-SINGLE1 in the ex-ample. This step can be viewed as a clustering process: thesubgraphs are clustered according to the extracted prop-erties, one property corresponds to a cluster of subgraphswith the same statistical property.With the context-aware property deﬁned, the contextualproperty prediction task works as follows: given a molecular graph, after feeding it into the

GROVER encoder, we obtain embeddings of its atoms and bonds. Suppose randomly choose the atom v andits embedding is h v . Instead of predicting the atom type of v , we would like h v to encode somecontextual information around node v . The way to achieve this target is to feed h v into a very simplemodel (such as a fully connected layer), then use the output to predict the contextual properties ofnode v . This prediction is a multi-class prediction problem (one class corresponds to one contextualproperty). Graph-level Motif Prediction.

Graph-level self-supervision task also needs reliable and cheaplabels. Motifs are recurrent sub-graphs among the input graph data, which are prevalent in moleculargraph data. One important class of motifs in molecules are functional groups, which encodes therich domain knowledge of molecules and can be easily detected by the professional software, suchas RDKit [26]. Formally, the motif prediction task can be formulated as a multi-label classiﬁcationproblem, where each motif corresponds to one label. Suppose we are considering the presence of p motifs { m , ..., m p } in the molecular data. For one speciﬁc molecule (abstracted as a graph G ), weuse professional software to detect whether each of the motif shows up in G , then use it as the targetof the motif prediction task. After pre-training

GROVER models on massive unlabelled data with the designed self-supervisedtasks, one should obtain a high-quality molecular encoder which is able to output embeddings forboth nodes and edges. These embeddings can be used for downstream tasks through the ﬁne-tuningprocess. Various downstream tasks could beneﬁt from the pre-trained

GROVER models. They canbe roughly divided into three categories: node level tasks, e.g., node classiﬁcation; edge level tasks,e.g., link prediction; and graph level tasks, such as the property prediction for molecules. Take thegraph level task for instance. Given node/edge embeddings output by the

GROVER encoder, we canapply some READOUT function Equation (3) to get the graph embedding ﬁrstly, then use additionalmultiple layer perceptron (MLP) to predict the property of the molecular graph. One would use partof the supervised data to ﬁne-tune both the encoder and additional parameters (READOUT and MLP).After several epochs of ﬁne-tuning, one can expect a well-performed model for property prediction.

Pre-training Data Collection.

We collect 11 million (M) unlabelled molecules sampled fromZINC15 [46] and Chembl [11] datasets to pre-train

GROVER . We randomly split 10% percent ofunlabelled molecules as the validation sets for model selection.

Fine-tuning Tasks and Datasets.

To thoroughly evaluate

GROVER on downstream tasks, weconduct experiments on 11 benchmark datasets from the MoleculeNet [58] with various targets, suchas quantum mechanics, physical chemistry, biophysics and physiology. Details are deferred to theAppendix B.1. In machine learning tasks, random splitting is a common process to split the dataset.However, for molecular property prediction, scaffold splitting [2] offers a more challenging yetrealistic way of splitting. We adopt the scaffold splitting method with a ratio for train/validation/test All datasets can be downloaded from http://moleculenet.ai/datasets-1

Classiﬁcation (Higher is better)Dataset BBBP SIDER ClinTox BACE Tox21 ToxCast . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . GraphConv [23] . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . Weave [22] . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . SchNet [44] . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . MPNN [13] . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . DMPNN [61] . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . MGCN [29] . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . AttentiveFP [59] . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . N-GRAM [28] . (0 . . (0 . . (0 . . (0 . . (0 . - HU. et.al[18] . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . GROVER base . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . GROVER large . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . Regression (Lower is better)Dataset FreeSolv ESOL Lipo QM7 QM8 . (0 . . (0 . . (0 . . (9 . . (0 . GraphConv [23] . (0 . . (0 . . (0 . . (20 . . (0 . Weave [22] . (0 . . (0 . . (0 . . (2 . . (0 . SchNet [44] . (0 . . (0 . . (0 . . (6 . . (0 . MPNN [13] . (0 . . (0 . . (0 . . (17 . . (0 . DMPNN [61] . (0 . . (0 . . (0 . . (13 . . (0 . MGCN [29] . (0 . . (0 . . (0 . . (4 . . (0 . AttentiveFP [59] . (0 . . (0 . . (0 . . (4 . . (0 . N-GRAM [28] . (0 . . (0 . . (0 . . (1 . . (0 . GROVER base . (0 . . (0 . . (0 . . (5 . . (0 . GROVER large . (0 . . (0 . . (0 . . (3 . . (0 . as 8:1:1. For each dataset, as suggested by [58], we apply three independent runs on three random-seeded scaffold splitting and report the mean and standard deviations. Baselines.

We comprehensively evaluate

GROVER against 10 popular baselines from Moleculenet[58] and several state-of-the-arts (STOAs) approaches. Among them, The TF_Roubust [39] is aDNN-based mulitask framework taking the molecular ﬁngerprints as the input. GraphConv [23],Weave [22] and SchNet [44] are three graph convolutional models. MPNN [13] and it’s variantsDMPNN [61] and MGCN [29] are models considering the edge features during message passing.AttentiveFP [59] is an extension of the graph attention network. Speciﬁcally, to demonstrate the powerof our self-supervised strategy, we also compare

GROVER with two pre-trained model: N-Gram [28]and Hu et.al [18]. We only report classiﬁcation results for [18] since the original implementation donot admit regression task without non-trivial modiﬁcations.

Experimental Conﬁgurations.

We use Adam optimizer for both pre-train and ﬁne-tuning. TheNoam learning rate scheduler [9] is adopted to adjust the learning rate during training. Speciﬁcconﬁgurations are:

GROVER

Pre-training.

For the contextual property prediction task, we set thecontext radius k = 1 to extract the contextual property dictionary, and obtain 2518 and 2686 distinctnode and edge contextual properties as the node and edge label, respectively. For each moleculargraph, we randomly mask 15% of node and edge labels for prediction. For the graph-level motifprediction task, we use RDKit [26] to extract 85 functional groups as the motifs of molecules. Werepresent the label of motifs as the one-hot vector. To evaluate the effect of model size, we pre-traintwo GROVER models,

GROVER base and

GROVER large with different hidden sizes, while keepingall other hyper-parameters the same. Speciﬁcally,

GROVER base contains ∼

48M parameters and

GROVER large contains ∼ Fine-tuning Procedure.

We use the validation lossto select the best model. For each training process, we train models for 100 epochs. For hyper-parameters, we perform the random search on the validation set for each dataset and report the bestresults. More pre-training and ﬁne-tuning details are deferred to Appendix C and Appendix D. The result is not presented since N-Gram on ToxCast is too time consuming to be ﬁnished in time. .1 Results on Downstream Tasks Table 1 documents the overall results of all models on all datasets, where the cells in gray indicatethe previous STOAs, and the cells in blue indicates the best result achieved by

GROVER . Table 1offers the following observations: (1)

GROVER models consistently achieve the best performance onall datasets with large margin on almost of them. The overall relative improvement is . on alldatasets ( . on classiﬁcation tasks and . on regression tasks). . This remarkable boostingvalidates the effectiveness of the pre-training model GROVER for molecular property predictiontasks. (2) Speciﬁcally,

GROVER base outperforms the STOAs on 8/11 datasets, while

GROVER large surpasses the STOAs on all datasets. This improvement can be attributed to the high expressivepower of the large model, which can encode more information from the self-supervised tasks. (3)In the small dataset FreeSolv with only 642 labeled molecules,

GROVER gains a . relativeimprovement over existing STOAs. This conﬁrms the strength of GROVER since it can signiﬁcantlyhelp with the tasks with very few label information.

GROVER

FrameworkHow Useful is the Self-supervised Pre-training?

To investigate the contribution of the self-supervision strategies, we compare the performances of pre-trained

GROVER and

GROVER withoutpre-training on classiﬁcation datasets, both of which follow the same hyper-parameter setting. Wereport the comparison of classiﬁcation task in Table 5.2, it is not supervising that the performanceof

GROVER becomes worse without pre-training. The self-supervised pre-training leads to a per-formance boost with an average AUC increase of 3.8% over the model without pre-training. Thisconﬁrms that the self-supervised pre-training strategy can learn the implicit domain knowledgeand enhance the prediction performance of downstream tasks. Notably, the datasets with fewersamples, such as SIDER, ClinTox and BACE gain a larger improvement through the self-supervisedpre-training. It re-conﬁrm the effectiveness of the self-supervised pre-training for the task withinsufﬁcient labeled molecules.

Effect of the Proposed dyMPN and

GTransformer . In this section, we use a toy data set with 600Kunlabelled molecules to pre-train

GROVER with 38M parameters. Besides, to justify the rationalebehind the proposed

GTransformer and dyMPN , we implement two variants:

GROVER w/o dyMPN and

GROVER w/o

GTrans . GROVER w/o dyMPN ﬁx the number of message passing hops K l , while GROVER w/o

GTrans replace the

GTransformer with the original Transformer. Figure 4 displays thecurve of training and validation loss for three models. First,

GROVER w/o

GTrans is the worst onein both training and validation. It implies that trivially combining the GNN and Transformer can notenhance the expressive power of GNN. Second, dyMPN slightly harm the training loss by introducingrandomness in the training process. However, the validation loss becomes better. Therefore, dyMPN brings a better generalization ability to

GROVER by randomizing the receptor ﬁeld for every messagepassing step. Overall, with new Transformer-style architecture and the dynamic message passingmechanism,

GROVER enjoys high expressive power and can well capture the structural informationin molecules, thus helping with various downstream molecular prediction tasks.

GROVER

No Pretrain Abs. Imp.BBBP (2039)

Table 2: Comparison between

GROVER with and without pre-training.

Epoch L o ss Training Loss

GROVERGROVER w/o DyMPNGROVER w/o GTrans

Epoch L o ss Validation Loss

GROVERGROVER w/o DyMPNGROVER w/o GTrans

Figure 4: The training and validation loss of

GROVER andits variants.

We explore the potential of the large-scale pre-trained GNN models in this work. With well-designedself-supervised tasks and largely-expressive architecture, our model

GROVER can learn rich implicitinformation from the enormous unlabelled graphs. More importantly, by ﬁne-tuning on

GROVER , We use relative improvement [50] to provide the uniﬁed descriptions.

8e achieve huge improvements (more than on average) over current STOAs on 11 challengingmolecular property prediction benchmarks, which ﬁrst veriﬁes the power of self-supervised pre-trained approaches in the graph learning area.Despite the successes, there is still room to improve GNN pre-training in the following aspects: More self-supervised tasks.

Well designed self-supervision tasks are the key of success for GNNpre-training. Except for the tasks presented in this paper, other meaningful tasks would also boostthe pre-training performance, such as distance-preserving tasks and tasks that getting 3D inputinformation involved.

More downstream tasks.

It is desirable to explore a larger category ofdownstream tasks, such as node prediction and link prediction tasks on different kinds of graphs.Different categories of downstream tasks might prefer different pre-training strategies/self-supervisiontasks, which is worthwhile to study in the future.

Wider and deeper models.

Larger models arecapable of capturing richer semantic information for more complicated tasks, as veriﬁed by severalstudies in the NLP area. It is also interesting to employ even larger models and data than

GROVER .However, one might need to alleviate potential problems when training super large models of GNN,such as gradient vanishing and oversmoothing.

References [1] Tox21 challenge, 2017. https://tripod.nih.gov/tox21/challenge/ .[2] Guy W Bemis and Mark A Murcko. The properties of known drugs. 1. molecular frameworks.

Journal of medicinal chemistry , 39(15):2887–2893, 1996.[3] Tian Bian, Xi Xiao, Tingyang Xu, Peilin Zhao, Wenbing Huang, Yu Rong, and Junzhou Huang.Rumor detection on social media with bi-directional graph convolutional networks. In

AAAI2020 , 2020.[4] L. C. Blum and J.-L. Reymond. 970 million druglike small molecules for virtual screening inthe chemical universe database GDB-13.

J. Am. Chem. Soc. , 131:8732, 2009.[5] Hongming Chen, Ola Engkvist, Yinhai Wang, Marcus Olivecrona, and Thomas Blaschke. Therise of deep learning in drug discovery.

Drug discovery today , 23(6):1241–1250, 2018.[6] Zhengdao Chen, Xiang Li, and Joan Bruna. Supervised community detection with line graphneural networks. arXiv preprint arXiv:1705.08415 , 2017.[7] Connor W Coley, Regina Barzilay, William H Green, Tommi S Jaakkola, and Klavs F Jensen.Convolutional embedding of attributed molecular graphs for physical property prediction.

Journal of chemical information and modeling , 57(8):1757–1772, 2017.[8] John S Delaney. Esol: estimating aqueous solubility directly from molecular structure.

Journalof chemical information and computer sciences , 44(3):1000–1005, 2004.[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training ofdeep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 ,2018.[10] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel,Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learningmolecular ﬁngerprints. In

Advances in neural information processing systems , pages 2224–2232, 2015.[11] Anna Gaulton, Louisa J Bellis, A Patricia Bento, Jon Chambers, Mark Davies, Anne Hersey,Yvonne Light, Shaun McGlinchey, David Michalovich, Bissan Al-Lazikani, et al. Chembl: alarge-scale bioactivity database for drug discovery.

Nucleic acids research , 40(D1):D1100–D1107, 2012.[12] Kaitlyn M Gayvert, Neel S Madhukar, and Olivier Elemento. A data-driven approach topredicting successes and failures of clinical trials.

Cell chemical biology , 23(10):1294–1301,2016.[13] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neuralmessage passing for quantum chemistry. In

ICML , pages 1263–1272. JMLR. org, 2017.914] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In

Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery anddata mining , pages 855–864, 2016.[15] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on largegraphs. In

Advances in neural information processing systems , pages 1024–1034, 2017.[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers:Surpassing human-level performance on imagenet classiﬁcation. In

Proceedings of the IEEEinternational conference on computer vision , pages 1026–1034, 2015.[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In

Proceedings of the IEEE conference on computer vision and pattern recognition ,pages 770–778, 2016.[18] Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and JureLeskovec. Pre-training graph neural networks. arXiv preprint arXiv:1905.12265 , 2019.[19] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connectedconvolutional networks. In

Proceedings of the IEEE conference on computer vision and patternrecognition , pages 4700–4708, 2017.[20] Stanisław Jastrz˛ebski, Damian Le´sniak, and Wojciech Marian Czarnecki. Learning to smile (s). arXiv preprint arXiv:1602.06289 , 2016.[21] Chaitanya Joshi. Transformers are graph neural networks, 2020. https://graphdeeplearning.github.io/post/transformers-are-gnns/ .[22] Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Moleculargraph convolutions: moving beyond ﬁngerprints.

Journal of computer-aided molecular design ,30(8):595–608, 2016.[23] Thomas N. Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutionalnetworks. In

International Conference on Learning Representations (ICLR) , 2017.[24] Johannes Klicpera, Janek Groß, and Stephan Günnemann. Directional message passing formolecular graphs. In

International Conference on Learning Representations (ICLR) , 2020.[25] Michael Kuhn, Ivica Letunic, Lars Juhl Jensen, and Peer Bork. The sider database of drugs andside effects.

Nucleic acids research , 44(D1):D1075–D1079, 2015.[26] Greg Landrum et al. Rdkit: Open-source cheminformatics. 2006.[27] Jia Li, Yu Rong, Hong Cheng, Helen Meng, Wenbing Huang, and Junzhou Huang. Semi-supervised graph classiﬁcation: A hierarchical graph perspective. In

The World Wide WebConference , pages 972–982. ACM, 2019.[28] Shengchao Liu, Mehmet F Demirel, and Yingyu Liang. N-gram graph: Simple unsupervisedrepresentation for graphs, with applications to molecules. In

Advances in Neural InformationProcessing Systems , pages 8464–8476, 2019.[29] Chengqiang Lu, Qi Liu, Chao Wang, Zhenya Huang, Peize Lin, and Lixin He. Molecularproperty prediction: A multilevel quantum interactions modeling perspective. In

Proceedings ofthe AAAI Conference on Artiﬁcial Intelligence , volume 33, pages 1052–1060, 2019.[30] Jing Ma, Wei Gao, and Kam-Fai Wong. Detect rumors on twitter by promoting informationcampaigns with generative adversarial learning. In

The World Wide Web Conference , pages3049–3055, 2019.[31] Ines Filipa Martins, Ana L Teixeira, Luis Pinheiro, and Andre O Falcao. A bayesian approachto in silico blood-brain barrier penetration modeling.

Journal of chemical information andmodeling , 52(6):1686–1697, 2012.[32] David L Mobley and J Peter Guthrie. Freesolv: a database of experimental and calculatedhydration free energies, with input ﬁles.

Journal of computer-aided molecular design , 28(7):711–720, 2014.[33] Kenta Oono and Taiji Suzuki. On asymptotic behaviors of graph cnns from dynamical systemsperspective. arXiv preprint arXiv:1905.10947 , 2019.1034] Andrew Owens and Alexei A Efros. Audio-visual scene analysis with self-supervised multisen-sory features. In

Proceedings of the European Conference on Computer Vision (ECCV) , pages631–648, 2018.[35] Zhen Peng, Wenbing Huang, Minnan Luo, Qinghua Zheng, Yu Rong, Tingyang Xu, andJunzhou Huang. Graph representation learning via graphical mutual information maximization.In

Proceedings of The Web Conference 2020 , pages 259–270, 2020.[36] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social repre-sentations. In

Proceedings of the 20th ACM SIGKDD international conference on Knowledgediscovery and data mining , pages 701–710, 2014.[37] Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. Improving languageunderstanding with unsupervised learning.

Technical report, OpenAI , 2018.[38] Raghunathan Ramakrishnan, Mia Hartmann, Enrico Tapavicza, and O Anatole Von Lilienfeld.Electronic spectra from tddft and machine learning in chemical space.

The Journal of chemicalphysics , 143(8):084111, 2015.[39] Bharath Ramsundar, Steven Kearnes, Patrick Riley, Dale Webster, David Konerding, and VijayPande. Massively multitask networks for drug discovery. arXiv preprint arXiv:1502.02072 ,2015.[40] Ann M Richard, Richard S Judson, Keith A Houck, Christopher M Grulke, Patra Volarath,Inthirany Thillainadarajah, Chihae Yang, James Rathman, Matthew T Martin, John F Wambaugh,et al. Toxcast chemical landscape: paving the road to 21st century toxicology.

Chemical researchin toxicology , 29(8):1225–1251, 2016.[41] David Rogers and Mathew Hahn. Extended-connectivity ﬁngerprints.

Journal of chemicalinformation and modeling , 50(5):742–754, 2010.[42] Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. Dropedge: Towards deepgraph convolutional networks on node classiﬁcation. In

International Conference on LearningRepresentations , 2020.[43] Seongok Ryu, Jaechang Lim, Seung Hwan Hong, and Woo Youn Kim. Deeply learning molec-ular structure-property relationships using attention-and gate-augmented graph convolutionalnetwork. arXiv preprint arXiv:1805.10988 , 2018.[44] Kristof Schütt, Pieter-Jan Kindermans, Huziel Enoc Sauceda Felix, Stefan Chmiela, AlexandreTkatchenko, and Klaus-Robert Müller. Schnet: A continuous-ﬁlter convolutional neural networkfor modeling quantum interactions. In

Advances in neural information processing systems ,pages 991–1001, 2017.[45] Kristof T Schütt, Farhad Arbabzadah, Stefan Chmiela, Klaus R Müller, and AlexandreTkatchenko. Quantum-chemical insights from deep tensor neural networks.

Nature com-munications , 8(1):1–8, 2017.[46] Teague Sterling and John J Irwin. Zinc 15–ligand discovery for everyone.

Journal of chemicalinformation and modeling , 55(11):2324–2337, 2015.[47] Govindan Subramanian, Bharath Ramsundar, Vijay Pande, and Rajiah Aldrin Denny. Computa-tional modeling of β -secretase 1 (bace-1) inhibitors using ligand based approaches. Journal ofchemical information and modeling , 56(10):1936–1949, 2016.[48] Fan-Yun Sun, Jordan Hoffman, Vikas Verma, and Jian Tang. Infograph: Unsupervised andsemi-supervised graph-level representation learning via mutual information maximization. In

International Conference on Learning Representations , 2019.[49] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neuralnetworks. In

Advances in neural information processing systems , pages 3104–3112, 2014.[50] Leo Törnqvist, Pentti Vartia, and Yrjö O Vartia. How should relative changes be measured?

The American Statistician , 39(1):43–46, 1985.[51] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In

Advances in neural informationprocessing systems , pages 5998–6008, 2017.1152] Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and YoshuaBengio. Graph attention networks. arXiv preprint arXiv:1710.10903 , 2017.[53] Petar Veliˇckovi´c, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio, and R DevonHjelm. Deep graph infomax. arXiv preprint arXiv:1809.10341 , 2018.[54] Izhar Wallach, Michael Dzamba, and Abraham Heifets. Atomnet: a deep convolutionalneural network for bioactivity prediction in structure-based drug discovery. arXiv preprintarXiv:1510.02855 , 2015.[55] Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, XiaogangWang, and Xiaoou Tang. Residual attention network for image classiﬁcation. In

Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition , pages 3156–3164, 2017.[56] Sheng Wang, Yuzhi Guo, Yuhong Wang, Hongmao Sun, and Junzhou Huang. Smiles-bert:Large scale unsupervised pre-training for molecular property prediction. In

Proceedings ofthe 10th ACM International Conference on Bioinformatics, Computational Biology and HealthInformatics , pages 429–436, 2019.[57] David Weininger, Arthur Weininger, and Joseph L Weininger. Smiles. 2. algorithm for generationof unique smiles notation.

Journal of chemical information and computer sciences , 29(2):97–101, 1989.[58] Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh SPappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machinelearning.

Chemical Science , 9(2):513–530, 2018.[59] Zhaoping Xiong, Dingyan Wang, Xiaohong Liu, Feisheng Zhong, Xiaozhe Wan, Xutong Li,Zhaojun Li, Xiaomin Luo, Kaixian Chen, Hualiang Jiang, et al. Pushing the boundaries ofmolecular representation for drug discovery with the graph attention mechanism.

Journal ofmedicinal chemistry , 2019.[60] Zheng Xu, Sheng Wang, Feiyun Zhu, and Junzhou Huang. Seq2seq ﬁngerprint: An unsuperviseddeep molecular embedding for drug discovery. In

BCB , 2017.[61] Kevin Yang, Kyle Swanson, Wengong Jin, Connor Coley, Philipp Eiden, Hua Gao, AngelGuzman-Perez, Timothy Hopper, Brian Kelley, Miriam Mathea, et al. Analyzing learned molec-ular representations for property prediction.

Journal of chemical information and modeling ,59(8):3370–3388, 2019.[62] Liangzhen Zheng, Jingrong Fan, and Yuguang Mu. Onionnet: a multiple-layer intermolecular-contact-based convolutional neural network for protein–ligand binding afﬁnity prediction.

ACSomega , 4(14):15956–15965, 2019. 12 ppendix

A The Overall Architecture of

GROVER

Model

A-B"$+C3)BD !" ,993(9)$(:2.*( /.-0)$+1+2.34,**+1+2.34 5((*+5.36)3* ,993(9)$(:7*9( /.-0)$+1+2.34,**+1+2.34 > ? @ !" ,993(9)$(:2.*( /.-0)$+1+2.34,**+1+2.34 5((*+5.36)3* ,993(9)$(:7*9( /.-0)$+1+2.34,**+1+2.34 > ? @ B) M M B Figure 5: Overview of the whole

GROVER architecture with both node-view

GTransformer (in pinkbackground) and edge-view

GTransformer (in green background)Figure 5 illustrates the complete architecture of

GROVER models, which contains a node-view

GTransformer (in pink background) and an edge-view

GTransformer (in green background). Briefpresentations of the node-view

GTransformer have been introduced in the main text, and the edge-view

GTransformer is in a similar structure. Here we elaborate more details of the

GROVER modeland its associated four sets of output embeddings.As shown in Figure 5, node-view

GTransformer contains node dyMPN , which maintains hiddenstates of nodes h v , v ∈ V and performs the message passing over nodes. Meanwhile, edge-view GTransformer contains edge dyMPN , that maintains hidden states of edges h vw , h wv , ( v, w ) ∈ E andconducts message passing over edges . The edge message passing is viewed as an ordinary messagepassing over the line graph of the original graph, where the line graph describes the neighboring ofedges in the original graph and enables an appropriate way to deﬁne message passing over edges [6].Note that edge hidden states have directions, i.e., h vw is not identical to h wv in general.Then, after the multi-head attention, we denote the transformed node and edge hidden states by ¯ h v and ¯ h vw , respectively.Given the above setup, we can explain why GROVER will output four sets of embeddings in Figure 5.Let us focus on the information ﬂow in the pink panel of Figure 5, ﬁrst. Here the node hiddenstates ¯ h v encounter the two components, Aggregate2Node and

Aggregate2Edge , which are used toaggregate the node hidden states to node messages and edge messages, respectively. Speciﬁcally,the

Aggregate2Node and

Aggregate2Edge components in node-view

GTransformer is formulatedas follows: m node-embedding-from-node-states v = (cid:88) u ∈N v CONCAT(¯ h u , x u ) (4) m edge-embedding-from-node-states vw = (cid:88) u ∈N v \ w CONCAT(¯ h u , e uv ) . (5)Then the node-view GTransformer transforms the node messages m node-embedding-from-node-states v and edge messages m edge-embedding-from-node-states vw through Pointwise Feed Forward layers [51] andAdd&LayerNorm to produce the ﬁnal node embeddings and edge embeddings, respectively.Similarly, for the information ﬂow in the green panel, the edge hidden states ¯ h vw encounter thetwo components Aggregate2Node and

Aggregate2Edge as well. Their operations are formulated as13

Key: DOUBLE_C-SINGLE1_N-SINGLE1

C C CON C CC ON 𝑘 = 1

Figure 6: Examples of constructing contextual properties for edgesfollows, m node-embedding-from-edge-states v = (cid:88) u ∈N v CONCAT(¯ h uv , x u ) , (6) m edge-embedding-from-edge-states vw = (cid:88) u ∈N v \ w CONCAT(¯ h uv , e uv ) . (7)Then, the edge-view GTransformer transforms the node messages and edge messages throughPointwise Feed Forward layers and Add&LayerNorm to produce the ﬁnal node embeddings and edgeembeddings, respectively.In summary, the

GROVER model outputs four sets of embeddings from two information ﬂows. Thenode information ﬂow (node

GTransformer ) maintains node hidden states and ﬁnally transformthem into another node embeddings and edge embeddings, while the edge information ﬂow (edge

GTransformer ) maintains edge hidden states and also transforms them into node and edge embeddings.The four sets of embeddings reﬂect structural information extracted from the two distinct views, andthey are ﬂexible to conduct downstream tasks, such as node-level prediction, edge-level predictionand graph-level prediction (via an extra READOUT component).

A.1 Fine-tuning Model for Molecular Property Prediction

As explained above, given a molecular graph G i and the corresponding label y i , GROVER producestwo node embeddings, H i, node-view and H i, edge-view , from node-view GTransformer and edge-view

GTransformer , respectively. We feed these two node embeddings into a shared self-attentive READ-OUT function to generate the graph-level embedding [27, 52]: S = softmax (cid:0) W tanh (cid:0) W H (cid:62) (cid:1)(cid:1) , g = Flatten( SH ) , (8)where W ∈ R d attn_hidden × d hidden_size and W ∈ R d attn_out × d attn_hidden are two weight matrix and g is the ﬁnalgraph embedding. After the READOUT, we employ two distinct MLPs to generate two predictions: p i, node-view and p i, edge-view . Besides the supervised loss L ( p i, node-view , y i ) + L ( p i, edge-view , y i ) , theﬁnal loss function also includes a disagreement loss [27] L diss = || p i, node-view − p i, edge-view || toretrain the consensus of two predictions. A.2 Constructing Contextual Properties for Edges

In Section 4.2 we describe an example of constructing contextual properties of nodes, here we presentan instance of cooking edge contextual properties in order to complete the picture.Similar to the process of node contextual property construction, we deﬁne recurrent statisticalproperties of local subgraph in a two-step manner. Let us take the graphs in Figure 6 for instance andconsider the double chemical bond in red color in the left graph.Step I: We extract its local subgraph as its k -hop neighboring nodes and edges. When k =1, it involvesthe Nitrogen atom, Carbon atom and the two single bonds. Step II: We extract statistical propertiesof this subgraph, speciﬁcally, we count the number of occurrence of (node, edge) pairs around the14enter edge, which makes the term of node-edge-counts . Then we list all the node-edge countsterms in alphabetical order, which makes the ﬁnal property: e.g., DOUBLE_C_SINGLE1_N-SINGLE1 in the example.Note that there are two graphs and two double bonds in red color in Figure 6, since their subgraphshave the same statistical property, the resulted contextual properties of the two bonds would be thesame. For a different point of view, this step can be viewed as a clustering process: the subgraphs areclustered according to the extracted properties, one property corresponds to a cluster of subgraphswith the same statistical property.

B Details about Experimental Setup

B.1 Dataset Description

Table 3: Dataset information

Type Category Dataset

Classiﬁcation Biophysics BBBP 1 2039 ROC-AUCPhysiology SIDER 27 1427 ROC-AUCClinTox 2 1478 ROC-AUCBACE 1 1513 ROC-AUCTox21 12 7831 ROC-AUCToxCast 617 8575 ROC-AUCRegression Physical chemistry FreeSolv 1 642 RMSEESOL 1 1128 RMSELipophilicity 1 4200 RMSEQuantum mechanics QM7 1 6830 MAEQM8 12 21786 MAE

Table 3 summaries information of benchmark datasets, including task type, dataset size, and evaluationmetrics. The details of each dataset are listed bellow [58]:

Molecular Classiﬁcation Datasets. - BBBP [31] involves records of whether a compound carries the permeability property ofpenetrating the blood-brain barrier.-

SIDER [25] records marketed drugs along with its adverse drug reactions, also known as theSide Effect Resource .-

ClinTox [12] compares drugs approved through FDA and drugs eliminated due to thetoxicity during clinical trials.-

BACE [47] is collected for recording compounds which could act as the inhibitors of human β -secretase 1 (BACE-1) in the past few years.- Tox21 [1] is a public database measuring the toxicity of compounds, which has been usedin the 2014 Tox21 Data Challenge.-

ToxCast [40] contains multiple toxicity labels over thousands of compounds by runninghigh-throughput screening tests on thousands of chemicals.

Molecular Regression Datasets. - QM7 [4] is a subset of GDB-13, which records the computed atomization energies of stableand synthetically accessible organic molecules, such as HOMO/LUMO, atomization energy,etc. It contains various molecular structures such as triple bonds, cycles, amide, epoxy, etc .-

QM8 [38] contains computer-generated quantum mechanical properties, e.g., electronicspectra and excited state energy of small molecules.-

ESOL is a small dataset documenting the solubility of compounds [8].-

Lipophilicity [11] is selected from the ChEMBL database, which is an important prop-erty that affects the molecular membrane permeability and solubility. The data is obtainedvia octanol/water distribution coefﬁcient experiments .15

FreeSolv [32] is selected from the Free Solvation Database, which contains the hydrationfree energy of small molecules in water from both experiments and alchemical free energycalculations .

Dataset Splitting.

We apply the scaffold splitting [2] for all tasks on all datasets. It splits themolecules with distinct two-dimensional structural frameworks into different subsets. It is a morechallenging but practical setting since the test molecular can be structurally different from trainingset. Here we apply the scaffold splitting to construct the train/validation/test sets.

B.2 Feature Extraction Processes for Molecules

The feature extraction contains two parts: 1) Node / edge feature extraction. We use RDKit toextract the atom and bond features as the input of dyMPN . Table 4 and Tabel 5 show the atomand bond feature we used in

GROVER . 2) Molecule-level feature extraction. Following the sameprotocol of [58, 61], we extract additional 200 molecule-level features by RDKit for each moleculeand concatenate these features to the output of self-attentive READOUT, to go through MLP for theﬁnal prediction. Table 4: Atom features. features size descriptionatom type 100 type of atom (e.g., C, N, O), by atomic numberformal charge 5 integer electronic charge assigned to atomnumber of bonds 6 number of bonds the atom is involved inchirality 5 number of bonded hydrogen atomsnumber of H 5 number of bonded hydrogen atomsatomic mass 1 mass of the atom, divided by 100aromaticity 1 whether this atom is part of an aromatic systemhybridization 5 sp, sp2, sp3, sp3d, or sp3d2

Table 5: Bond features. features size descriptionbond type 4 single, double, triple, or aromaticstereo 6 none, any, E/Z or cis/transin ring 1 whether the bond is part of a ringconjugated 1 whether the bond is conjugated

C Implementation and Pre-training Details

We use Pytorch to implement

GROVER and horovod for the distributed training. We use the Adamoptimizer with learning rate . and L2 weight decay for − . We train the model for 500epochs. The learning rate warmup over the ﬁrst two epochs and decreases exponentially from . to . . We use PReLU [16] as the activation function and the dropout rate is 0.1 for alllayers. Both GROVER base and

GROVER base contain 4 heads. We set the iteration L = 1 and sample K l ∼ φ ( µ = 6 , σ = 1 , a = 3 , b = 9) for the embedded dyMPN in GROVER . ψ ( µ, σ, a, b ) is atruncated normal distribution with a truncation range ( a, b ) . For the The hidden size for GROVER base and

GROVER base are 800 and 1200 respectively.We use 250 Nvidia V100 GPUs to pre-train

GROVER base and

GROVER large . Pre-training

GROVER base and

GROVER large took 2.5 days and 4 days respectively. For the models depictedin Figure 4 in the ablation study, we use 32 Nvidia V100 GPUs to pre-train the

GROVER model andits variants. We will release the pre-trained models and the training code in the future.16

Fine-tuning Details

For each task, we try 300 different hyper-parameter combinations via random search to ﬁnd the bestresults. Table 6 demonstrates all the hyper-parameters of ﬁne-tuning model. All ﬁne-tuning tasks arerun on a single P40 GPU. Table 6: The ﬁne-tuning hyper-parameters hyper-parameter Description Rangebatch_size the input batch_size. 32init_lr initial learning rate ratio of Noam learning rate scheduler. The real initial learning rate is max_lr / init_lr. 10max_lr maximum learning rate of Noam learning rate scheduler. . ∼ . ﬁnal_lr ﬁnal learning rate ratio of Noam learning rate scheduler. The real ﬁnal learning rate is max_lr / ﬁnal_lr. ∼ dropout dropout ratio. 0, 0.05, 0.1,0.2attn_hidden hidden size for the self-attentive readout. 128attn_out the number of output heads for the self-attentive readout. 4,8dist_coff coefﬁcient of the disagreement loss 0.05, 0.1,0.15bond_drop_rate drop edge ratio [42] 0, 0.2,0.4,0.6ffn_num_layer The number of MLP layers. 2,3ffn_hidden_size The hidden size of MLP layers. 5,7,13 E Additional Experimental Results

Table 7 depicts the additional results of the comparison of the performance of pre-trained

GROVER and

GROVER without pre-training on regression tasks.Table 7: Comparison between

GROVER with and without pre-training on regression tasks

GROVER

No Pre-training Absolute ImprovementRMSE FreeSolv0.013