Self-Supervised Graph Transformer on Large-Scale Molecular Data
Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, Junzhou Huang
GGROVER: Self-supervised Message PassingTransformer on Large-scale Molecular Data
Yu Rong
Tencent AI LabShenzhen, China 518057 [email protected]
Yatao Bian
Tencent AI LabShenzhen, China 518057 [email protected]
Tingyang Xu
Tencent AI LabShenzhen, China 518057 [email protected]
Weiyang Xie
Tencent AI LabShenzhen, China 518057 [email protected]
Ying Wei
Tencent AI LabShenzhen, China 518057 [email protected]
Wenbing Huang
Department of Computer Science and TechnologyTsinghua UniversityBeijing, China [email protected]
Junzhou Huang
University of Texas at ArlingtonArlington, TX 76019 [email protected]
Abstract
How to obtain informative representations of molecules is a crucial prerequisitein AI-driven drug design and discovery. Recent researches abstract molecules asgraphs and employ Graph Neural Networks (GNNs) for task-specific and data-driven molecular representation learning. Nevertheless, two “dark clouds” impedethe usage of GNNs in real scenarios: (1) insufficient labeled molecules for super-vised training; (2) poor generalization capabilities to new-synthesized molecules.To address them both, we propose a novel molecular representation framework,
GROVER , which stands for G raph R epresentation fr O m self-super V ised m E ssagepassing t R ansformer. With carefully designed self-supervised tasks in node, edgeand graph-level, GROVER can learn rich structural and semantic information ofmolecules from enormous unlabelled molecular data. Rather, to encode suchcomplex information,
GROVER integrates Message Passing Networks with theTransformer-style architecture to deliver a class of more expressive encoders ofmolecules. The flexibility of
GROVER allows it to be trained efficiently on large-scale molecular dataset without requiring any supervision, thus being immunizedto the two issues mentioned above. We pre-train
GROVER with 100 million param-eters on 10 million unlabelled molecules—the biggest GNN and the largest trainingdataset that we have ever met. We then leverage the pre-trained
GROVER to down-stream molecular property prediction tasks followed by task-specific fine-tuning,where we observe a huge improvement (more than 6% on average) over currentstate-of-the-art methods on 11 challenging benchmarks. The insights we gainedare that well-designed self-supervision losses and largely-expressive pre-trainedmodels enjoy the significant potential on performance boosting.
Preprint. Under review. a r X i v : . [ q - b i o . B M ] J un Introduction
Inspired by the remarkable achievements of deep learning in many scientific domains, such as com-puter vision [19, 55], natural language processing [49, 51], and social networks [3, 30], researchersare exploiting deep learning approaches to accelerate the process of drug discovery and reduce costsby facilitating the rapid identification of molecules [5]. Molecules can be naturally represented bymolecular graphs which preserve rich structural information. Therefore, supervised deep learning ofgraphs, especially with Graph Neural Networks(GNNs) [24, 45] have shown promising results inmany tasks, such as molecular property prediction [13, 22] and virtual screening [54, 62].Despite the fruitful progress, two “dark clouds” still impede the usage of deep learning in realscenarios: (1) insufficient labeled data for molecular tasks; (2) poor generalization capabilities ofmodels for the enormous chemical spaces. First, the deep learning models are prone to overfitting withinsufficient labeled data. What’s worse, it is hard to increase the labels for most molecular task sincegetting molecular labels usually requires wet-lab experiments which are costly and time-consuming.Consequently, insufficient labeled data further leads to the poor generalization capabilities of models,which is hard to handle the out-of-distribution molecules.In Natural Language Processing (NLP), the deep learning methods used to face similar challenges.Researchers in NLP adopt the pre-training strategy to enforce the model to learn implicit informationfrom the large language space with some low-cost self-supervised targets. The pre-training languagemodels achieve a huge success in boosting the performance of language tasks, such as BERT [9],GPT [37], etc. In this vein, [56] tries to pre-train a BERT-style model on the sequential representation–SMILES [57] of molecules. Liu et.al. [28] exploit the idea from N-gram approach in NLP and embedsa sequence of n vertices to self-predict the attributes of the vertices. However, these representationsfailed to explicitly encode the structural information of molecules.To consider the graph structure of molecules, several works aim to establish a pre-trained graph modelfor molecules. Hu et.al. [18] investigate the strategies to construct the pre-training task on moleculargraphs and propose three tasks, i.e., context prediction, node making and graph-level predictionfor molecular pre-training. We argue that current models and pre-train tasks restrict the power ofpre-trained representations. First, in the masking task, they treat the atom type as the label. Comparedwith the NLP tasks, the number of atom types in molecules is much smaller than that of a languagevocabulary. Therefore, it suffers serious ambiguity problems on atom type and the model is hardto encode meaningful information, especially for the high frequent atoms. Second, the graph-levelpre-training task in [18] depends on the supervised labels. This limits the expressive power of themodel on graph-level since most of molecules are completely unlabelled and it also introduces therisk of negative transfer for the downstream tasks.In this paper, we improve the pre-training model for molecular graph by introducing a novel molecularrepresentation framework, GROVER , namely, G raph R epresentation fr O m self-super V ised m E ssagepassing t R ansformer. GROVER constructs two types of self-supervised tasks. For the node/edge-leveltasks, instead of predicting the node/edge type alone,
GROVER randomly masks a local subgraph ofthe target node/edge and predicts this contextual property from node/edge embedding. In this vein,
GROVER can alleviate the ambiguity problem by considering both the target node/edge being maskedand its context. For the graph-level tasks, by incorporating the domain knowledge,
GROVER extractsthe semantic motifs existing in molecular graphs and predicts the occurrence of these motifs for amolecule from graph-embedding. Since the semantic motifs can be obtained by a low-cost patternmatching method,
GROVER can make use of any molecules to optimize the graph-level embedding.With self-supervised tasks in node, edge and graph-levels,
GROVER can learn rich structural andsemantic information of molecules from enormous unlabelled molecular data. Rather, to encode suchcomplex information,
GROVER integrates Message Passing Networks with the Transformer-stylearchitecture to deliver a class of highly expressive encoders of molecules. The flexibility of
GROVER allows it to be trained efficiently on large-scale molecule data without requiring any supervision.We pre-train
GROVER with 100 million parameters on 10 million of unlabelled molecules—thebiggest GNN and the largest training dataset that we have ever met. We then leverage the pre-trained
GROVER models to downstream molecular property prediction followed by task-specific fine-tuning.On the downstream tasks,
GROVER achieve . relative improvement compared with [28] and . relative improvement compared with [18] on classification tasks. Furthermore, even comparedwith current state-of-the-art methods, we observe a huge relative improvement of GROVER (morethan 6% on average) over 11 popular benchmarks.2
Related Work
Molecular Representation Learning.
To represent molecules in the vector space, the traditionalchemical fingerprints, such as ECFP [41], try to encode the neighbors of atoms in the molecule into afix-length vector. To improve the expressive power of chemical fingerprints, some studies [7, 10]introduce convolutional layers to learn the neural fingerprints of molecules, and apply the neuralfingerprints to the downstream tasks, such as property prediction. Following these works, [20, 60]take the SMILES representation [57] as input and use RNN-based models to produce molecularrepresentations. Recently, many works [22, 44, 45] explore the graph convolutional network toencode molecular graphs into neural fingerprints. A slot of work [43, 59] propose to learn theaggregation weights by extending the Graph Attention Network (GAT) [52]. To better capture theinteractions among atoms, [13] proposes to use a message passing framework and [24, 61] extendthis framework to model bond interactions. Furthermore, [29] builds a hierarchical GNN to capturemultilevel interactions.
Self-supervised Learning on Graphs.
Self-supervised learning has a long history in machinelearning and has achieved fruitful progresses in many areas, such as computer vision [34] andlanguage modeling [9]. The traditional graph embedding methods [14, 36] define different kindsof graph proximity, i.e., the vertex proximity relationship, as the self-supervised objective to learnvertex embeddings. GraphSAGE [15] proposes to use a random-walk based proximity objective totrain GNN in an unsupervised fashion. [35, 48, 53] exploit the mutual information maximizationscheme to construct objective for GNNs. Recently, two works are proposed to construct unsupervisedrepresentations for molecular graphs. Liu et.al.[28] employs an N-gram model to extract the contextof vertices and construct the graph representation by assembling the vertex embeddings in short walksin the graph. Hu et.al. [18] investigate various strategies to pre-train the GNNs and propose threeself-supervised tasks to learn molecular representations. However, [18] isolates the highly correlatedtasks of context prediction and node/edge type prediction, which makes it difficult to preserve domainknowledge between the local structure and the node attributes. Besides, the graph-level task in [18] isconstructed by the supervised property labels, which is impeded by the limited number of supervisedlabels of molecules and has demonstrated the negative transfer in the downstream tasks. Contrastwith [18], the molecular representations derived by our method are more appropriate in terms ofpersevering the domain knowledge, which has demonstrated remarkable effectiveness in downstreamtasks without negative transfer.
GROVER is built upon the Transformer architecture [51] and GNNs, so we briefly formalize themand the supervised learning task in this section.
Supervised learning tasks of graphs.
Downstream tasks of molecules are often supervised learningtasks. A molecule can be abstracted as a topological graph G = ( V , E ) , where |V| = n refers to a setof n nodes (atoms) in the molecule and |E| = m refers to a set of m edges (bonds). N v is used todenote the set of neighbors of node v in the graph. We use x v to represent the initial features of node v , and e uv as the initial features of edge ( u, v ) . For graph data, there are usually two categories ofsupervised learning tasks: i) Node classification/regression , where each node v has a label/target y v ,and the task is to learn to predict the labels of unlabelled nodes; ii) Graph classification/regression ,where a set of graphs { G , ..., G N } and their labels/targets { y , ..., y N } are given, and the task is topredict the label/target of a new graph. Attention mechanism and the Transformer-style architectures.
The attention mechanism is themain building block of various Transformer-style models. We focus on multi-head attention, whichstacks several scaled dot-product attention layers together and allows parallel running. One scaleddot-product attention layer takes a set of queries, keys, values ( q , k , v ) as inputs. Then it computesthe dot products of the query with all keys, and applies a softmax function to obtain the weightson the values. By stacking the set of ( q , k , v )s into matrices ( Q , K , V ), it admits highly optimizedmatrix multiplication operations. Specifically, the outputs can be arranged as a matrix:Attention ( Q , K , V ) = softmax ( QK (cid:62) / √ d ) V , (1)3here d is the dimension of q and k . Suppose we arrange k attention layers into the multi-headattention, then its output matrix can be written as,MultiHead ( Q , K , V ) = Concat ( head , ..., head k ) W O , head i = Attention ( QW Q i , KW K i , VW V i ) , where W Q i , W K i , W V i are the projection matrices of head i . Graph Neural Networks (GNNs).
Recently, GNNs have received a surge of interest in variousdomains, such as knowledge graph, social networks and drug discovery. The key operation ofGNNs can be abstracted to a message passing process, which involves message passing (also calledneighborhood aggregation) between the nodes in the graph. The message passing operation iterativelyupdates a node v ’s hidden states, h v , by aggregating the hidden states of v ’s neighboring nodes andedges. In general, the message passing process involves several iterations, each iteration can befurther partitioned into several hops. Suppose there are L iterations, and iteration l contains K l hops.Formally, in iteration l , the k -th hop can be formulated as, m ( l,k ) v = AGGREGATE ( l ) ( { ( h ( l,k − v , h ( l,k − u , e uv ) | u ∈ N v } ) , (2) h ( l,k ) v = σ ( W ( l ) m ( l,k ) v + b ( l ) ) , where m ( l,k ) v is the aggregated message, and σ () is some activation function. We make the conventionthat h ( l, v := h ( l − ,K l − ) v . There are several popular ways of choosing AGGREGATE ( l ) () , such asmean, max pooling and graph attention mechanism [15, 52]. For one iteration of message passing,there are a layer of trainable parameters (i.e., parameters inside AGGREGATE ( l ) , W ( l ) and b ( l ) ).These parameters are shared across the K l hops within iteration l . After L iterations of messagepassing, the hidden states of the last hop in the last iteration are used as the embeddings of the nodes,i.e., h ( L,K L ) v , v ∈ V . Lastly, a READOUT operation is applied to get the graph-level representation, h G = READOUT ( { h (0 ,K ) v , ..., h ( L,K L ) v | v ∈ V} ) . (3) GROVER
Pre-training Framework
This section contains details of our pre-training architecture together with the well-designed self-supervision tasks. On a high level, the model is a Transformer-based neural network with tailoredGNNs as the self-attention building blocks. The GNNs therein enable capturing structural informationin the graph data and information flow on both the node and edge message passing paths. Furthermore,we introduce a dynamic message passing scheme in the tailored GNN, which is proved to boost thegeneralization performance of
GROVER models.
GROVER consists of two submodules: the node GNN transformer and edge GNN transformer.In order to ease the exposition, we will only explain details of the node GNN transformer (ab-breviated as node
GTransformer ) in the sequel, and ignore the edge GNN transformer sinceit has a similar structure. The overall architecture of
GROVER is deferred in Appendix A.
Multi-Head AttentionLayerNorm dyMPN dyMPN dyMPN
Input Graph long-range residual connection
Output ModuleGTransformer
Feed Forward
Node Embed
Add & Norm
Concat & NormAggregate2Node Feed Forward
Edge Embed
Add & NormConcat & NormAggregate2Edge V K Q Figure 1: Overview of
GTransformer . GNN Transformer (
GTransformer ). The key componentof the node
GTransformer is our proposed graph multi-head attention component, which is the attention blockstailored to structural input data. A vanilla attention block,such as that in Equation (1), requires vectorized inputs.However, graph inputs are naturally structural data that arenot vectorized. So we use our designed dynamic GNNs( dyMPN , see the following sections for details) to extractvectors as queries, keys and values from nodes of thegraph, then feed them into the attention block.This strategy is simple yet powerful , because it enablesutilizing the highly expressive dyMPN , in order to bettermodel the structural information in molecular data. The4
Contextual property extraction Subgraph masking ? ?
Prediction …… node-basededge-based node/edge representation masked part … Semantic motifs fromdomain knowledge Graph-level Prediction graph representation
Input moleculeMolecular graph
Contextual property prediction (node/edge level task) Graph-level motif prediction 𝑘 = 1𝑘 = 1
Figure 2: Overview of the self-supervised task construction of
GROVER .high expressiveness of
GTransformer can be attributed to its bi-level information extraction frame-work. It is well-known that the message passing process captures local structural information ofthe graph, therefore using the outputs of dyMPN as queries, keys and values would get the localsubgraph structure involved, thus constituting the first level of information extraction. Meanwhile,the Transformer encoder can be viewed as a variant of the GAT [21, 52] on a fully connected graphconstructed by V . Hence, using Transformer encoder on top of these dyMPN queries, keys andvalues makes it possible to extract global relations between nodes, which enables the second level ofinformation extraction. This bi-level information extraction largely enhances the representationalpower of GROVER models.The second feature of
GTransformer is that we use a single long-range residual connection , whilethe Transformer encoder uses several short-range residual connections within each attention layer.The long-range residual connection conveys the initial node/edge feature information directly to thelast layers of
GTransformer . Two benefits could be obtained from this single long-range residualconnection: i) like ordinary residual connections, it improves the training process by alleviating thevanishing gradient problem [17], ii) compared to the various short-range residual connections in theTransformer encoder, our long-range residual connection can alleviate the over-smoothing [33, 42]problem in the message passing process.
Dynamic Message Passing Network ( dyMPN ). The general message passing process (see Equa-tion (2)) has two hyperparameters: number of iterations/layers L and number of hops K l , l = 1 , ..., L within each iteration. The number of hops is closely related to the size of the receptor field of thegraph convolution operation, which would affect generalizability of the message passing model.Given a fixed number of layers L , we find out that the pre-specified number of hops might notwork well for different kinds of dataset. Instead of pre-specifying the number of hops, we developa randomized strategy for choosing the number of hops during training process: at each epoch,we choose K l from some random distribution for layer l . Two choices of randomization workwell: i) K l ∼ U ( a, b ) , drawn from a uniform distribution; ii) K l is drawn from a truncated normaldistribution, which is derived from that of a normally distributed random variable by bounding therandom variable from both bellow and above.The above randomized message passing scheme enables random receptor field for each node ingraph convolution operation. We call the induced network Dynamic Message Passing networks(abbreviated as dyMPN ). Extensive experimental verification demonstrates that dyMPN enjoysbetter generalization performance than vanilla message passing networks without the randomizationstrategy. The success of the pre-training model crucially depends on the design of self-supervision tasks.Different from Hu et.al. [18], to avoid negative transfer on downstream tasks, we do not use thesupervised labels in pre-training and propose new self-supervision tasks on both of these two levels: contextual property prediction and graph-level motif prediction , which are sketched in Figure 2.
Contextual Property Prediction.
A good self-supervision task on the node level should satisfythe following properties: i) The prediction target is reliable and easy to get; ii) The predictiontarget should reflect contextual information of the node/edge. Guided by these criteria, we presentthe tasks on both nodes and edges. They both try to predict the context-aware properties of thetarget node/edge within some local subgraph. What kinds of context-aware properties shall oneuse? We define recurrent statistical properties of local subgraph in the following two-step man-ner (let us take the node subgraph in Figure 3 as the example): i) Given a target node (e.g.,5he Carbon atom in red color), we extract its local subgraph as its k -hop neighboring nodes andedges. When k =1, it involves the Nitrogen atom, Oxygen atom, the double bond and single bond. CN Key: C_N-DOUBLE1_O-SINGLE1Key: DOUBLE_C-SINGLE1_N-SINGLE1
C CC ON 𝑘 = 1
Figure 3:
Examples of constructing contex-tual properties. ii) We extract statistical properties of this subgraph, specif-ically, we count the number of occurrence of (node, edge)pairs around the center node, which makes the term of node-edge-counts . Then we list all the node-edgecounts terms in alphabetical order, which constitutes thefinal property: e.g.,
C_N_DOUBLE1_O-SINGLE1 in the ex-ample. This step can be viewed as a clustering process: thesubgraphs are clustered according to the extracted prop-erties, one property corresponds to a cluster of subgraphswith the same statistical property.With the context-aware property defined, the contextualproperty prediction task works as follows: given a molecular graph, after feeding it into the
GROVER encoder, we obtain embeddings of its atoms and bonds. Suppose randomly choose the atom v andits embedding is h v . Instead of predicting the atom type of v , we would like h v to encode somecontextual information around node v . The way to achieve this target is to feed h v into a very simplemodel (such as a fully connected layer), then use the output to predict the contextual properties ofnode v . This prediction is a multi-class prediction problem (one class corresponds to one contextualproperty). Graph-level Motif Prediction.
Graph-level self-supervision task also needs reliable and cheaplabels. Motifs are recurrent sub-graphs among the input graph data, which are prevalent in moleculargraph data. One important class of motifs in molecules are functional groups, which encodes therich domain knowledge of molecules and can be easily detected by the professional software, suchas RDKit [26]. Formally, the motif prediction task can be formulated as a multi-label classificationproblem, where each motif corresponds to one label. Suppose we are considering the presence of p motifs { m , ..., m p } in the molecular data. For one specific molecule (abstracted as a graph G ), weuse professional software to detect whether each of the motif shows up in G , then use it as the targetof the motif prediction task. After pre-training
GROVER models on massive unlabelled data with the designed self-supervisedtasks, one should obtain a high-quality molecular encoder which is able to output embeddings forboth nodes and edges. These embeddings can be used for downstream tasks through the fine-tuningprocess. Various downstream tasks could benefit from the pre-trained
GROVER models. They canbe roughly divided into three categories: node level tasks, e.g., node classification; edge level tasks,e.g., link prediction; and graph level tasks, such as the property prediction for molecules. Take thegraph level task for instance. Given node/edge embeddings output by the
GROVER encoder, we canapply some READOUT function Equation (3) to get the graph embedding firstly, then use additionalmultiple layer perceptron (MLP) to predict the property of the molecular graph. One would use partof the supervised data to fine-tune both the encoder and additional parameters (READOUT and MLP).After several epochs of fine-tuning, one can expect a well-performed model for property prediction.
Pre-training Data Collection.
We collect 11 million (M) unlabelled molecules sampled fromZINC15 [46] and Chembl [11] datasets to pre-train
GROVER . We randomly split 10% percent ofunlabelled molecules as the validation sets for model selection.
Fine-tuning Tasks and Datasets.
To thoroughly evaluate
GROVER on downstream tasks, weconduct experiments on 11 benchmark datasets from the MoleculeNet [58] with various targets, suchas quantum mechanics, physical chemistry, biophysics and physiology. Details are deferred to theAppendix B.1. In machine learning tasks, random splitting is a common process to split the dataset.However, for molecular property prediction, scaffold splitting [2] offers a more challenging yetrealistic way of splitting. We adopt the scaffold splitting method with a ratio for train/validation/test All datasets can be downloaded from http://moleculenet.ai/datasets-1
Classification (Higher is better)Dataset BBBP SIDER ClinTox BACE Tox21 ToxCast . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . GraphConv [23] . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . Weave [22] . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . SchNet [44] . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . MPNN [13] . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . DMPNN [61] . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . MGCN [29] . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . AttentiveFP [59] . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . N-GRAM [28] . (0 . . (0 . . (0 . . (0 . . (0 . - HU. et.al[18] . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . GROVER base . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . GROVER large . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . Regression (Lower is better)Dataset FreeSolv ESOL Lipo QM7 QM8 . (0 . . (0 . . (0 . . (9 . . (0 . GraphConv [23] . (0 . . (0 . . (0 . . (20 . . (0 . Weave [22] . (0 . . (0 . . (0 . . (2 . . (0 . SchNet [44] . (0 . . (0 . . (0 . . (6 . . (0 . MPNN [13] . (0 . . (0 . . (0 . . (17 . . (0 . DMPNN [61] . (0 . . (0 . . (0 . . (13 . . (0 . MGCN [29] . (0 . . (0 . . (0 . . (4 . . (0 . AttentiveFP [59] . (0 . . (0 . . (0 . . (4 . . (0 . N-GRAM [28] . (0 . . (0 . . (0 . . (1 . . (0 . GROVER base . (0 . . (0 . . (0 . . (5 . . (0 . GROVER large . (0 . . (0 . . (0 . . (3 . . (0 . as 8:1:1. For each dataset, as suggested by [58], we apply three independent runs on three random-seeded scaffold splitting and report the mean and standard deviations. Baselines.
We comprehensively evaluate
GROVER against 10 popular baselines from Moleculenet[58] and several state-of-the-arts (STOAs) approaches. Among them, The TF_Roubust [39] is aDNN-based mulitask framework taking the molecular fingerprints as the input. GraphConv [23],Weave [22] and SchNet [44] are three graph convolutional models. MPNN [13] and it’s variantsDMPNN [61] and MGCN [29] are models considering the edge features during message passing.AttentiveFP [59] is an extension of the graph attention network. Specifically, to demonstrate the powerof our self-supervised strategy, we also compare
GROVER with two pre-trained model: N-Gram [28]and Hu et.al [18]. We only report classification results for [18] since the original implementation donot admit regression task without non-trivial modifications.
Experimental Configurations.
We use Adam optimizer for both pre-train and fine-tuning. TheNoam learning rate scheduler [9] is adopted to adjust the learning rate during training. Specificconfigurations are:
GROVER
Pre-training.
For the contextual property prediction task, we set thecontext radius k = 1 to extract the contextual property dictionary, and obtain 2518 and 2686 distinctnode and edge contextual properties as the node and edge label, respectively. For each moleculargraph, we randomly mask 15% of node and edge labels for prediction. For the graph-level motifprediction task, we use RDKit [26] to extract 85 functional groups as the motifs of molecules. Werepresent the label of motifs as the one-hot vector. To evaluate the effect of model size, we pre-traintwo GROVER models,
GROVER base and
GROVER large with different hidden sizes, while keepingall other hyper-parameters the same. Specifically,
GROVER base contains ∼
48M parameters and
GROVER large contains ∼ Fine-tuning Procedure.
We use the validation lossto select the best model. For each training process, we train models for 100 epochs. For hyper-parameters, we perform the random search on the validation set for each dataset and report the bestresults. More pre-training and fine-tuning details are deferred to Appendix C and Appendix D. The result is not presented since N-Gram on ToxCast is too time consuming to be finished in time. .1 Results on Downstream Tasks Table 1 documents the overall results of all models on all datasets, where the cells in gray indicatethe previous STOAs, and the cells in blue indicates the best result achieved by
GROVER . Table 1offers the following observations: (1)
GROVER models consistently achieve the best performance onall datasets with large margin on almost of them. The overall relative improvement is . on alldatasets ( . on classification tasks and . on regression tasks). . This remarkable boostingvalidates the effectiveness of the pre-training model GROVER for molecular property predictiontasks. (2) Specifically,
GROVER base outperforms the STOAs on 8/11 datasets, while
GROVER large surpasses the STOAs on all datasets. This improvement can be attributed to the high expressivepower of the large model, which can encode more information from the self-supervised tasks. (3)In the small dataset FreeSolv with only 642 labeled molecules,
GROVER gains a . relativeimprovement over existing STOAs. This confirms the strength of GROVER since it can significantlyhelp with the tasks with very few label information.
GROVER
FrameworkHow Useful is the Self-supervised Pre-training?
To investigate the contribution of the self-supervision strategies, we compare the performances of pre-trained
GROVER and
GROVER withoutpre-training on classification datasets, both of which follow the same hyper-parameter setting. Wereport the comparison of classification task in Table 5.2, it is not supervising that the performanceof
GROVER becomes worse without pre-training. The self-supervised pre-training leads to a per-formance boost with an average AUC increase of 3.8% over the model without pre-training. Thisconfirms that the self-supervised pre-training strategy can learn the implicit domain knowledgeand enhance the prediction performance of downstream tasks. Notably, the datasets with fewersamples, such as SIDER, ClinTox and BACE gain a larger improvement through the self-supervisedpre-training. It re-confirm the effectiveness of the self-supervised pre-training for the task withinsufficient labeled molecules.
Effect of the Proposed dyMPN and
GTransformer . In this section, we use a toy data set with 600Kunlabelled molecules to pre-train
GROVER with 38M parameters. Besides, to justify the rationalebehind the proposed
GTransformer and dyMPN , we implement two variants:
GROVER w/o dyMPN and
GROVER w/o
GTrans . GROVER w/o dyMPN fix the number of message passing hops K l , while GROVER w/o
GTrans replace the
GTransformer with the original Transformer. Figure 4 displays thecurve of training and validation loss for three models. First,
GROVER w/o
GTrans is the worst onein both training and validation. It implies that trivially combining the GNN and Transformer can notenhance the expressive power of GNN. Second, dyMPN slightly harm the training loss by introducingrandomness in the training process. However, the validation loss becomes better. Therefore, dyMPN brings a better generalization ability to
GROVER by randomizing the receptor field for every messagepassing step. Overall, with new Transformer-style architecture and the dynamic message passingmechanism,
GROVER enjoys high expressive power and can well capture the structural informationin molecules, thus helping with various downstream molecular prediction tasks.
GROVER
No Pretrain Abs. Imp.BBBP (2039)
Table 2: Comparison between
GROVER with and without pre-training.
Epoch L o ss Training Loss
GROVERGROVER w/o DyMPNGROVER w/o GTrans
Epoch L o ss Validation Loss
GROVERGROVER w/o DyMPNGROVER w/o GTrans
Figure 4: The training and validation loss of
GROVER andits variants.
We explore the potential of the large-scale pre-trained GNN models in this work. With well-designedself-supervised tasks and largely-expressive architecture, our model
GROVER can learn rich implicitinformation from the enormous unlabelled graphs. More importantly, by fine-tuning on
GROVER , We use relative improvement [50] to provide the unified descriptions.
8e achieve huge improvements (more than on average) over current STOAs on 11 challengingmolecular property prediction benchmarks, which first verifies the power of self-supervised pre-trained approaches in the graph learning area.Despite the successes, there is still room to improve GNN pre-training in the following aspects: More self-supervised tasks.
Well designed self-supervision tasks are the key of success for GNNpre-training. Except for the tasks presented in this paper, other meaningful tasks would also boostthe pre-training performance, such as distance-preserving tasks and tasks that getting 3D inputinformation involved.
More downstream tasks.
It is desirable to explore a larger category ofdownstream tasks, such as node prediction and link prediction tasks on different kinds of graphs.Different categories of downstream tasks might prefer different pre-training strategies/self-supervisiontasks, which is worthwhile to study in the future.
Wider and deeper models.
Larger models arecapable of capturing richer semantic information for more complicated tasks, as verified by severalstudies in the NLP area. It is also interesting to employ even larger models and data than
GROVER .However, one might need to alleviate potential problems when training super large models of GNN,such as gradient vanishing and oversmoothing.
References [1] Tox21 challenge, 2017. https://tripod.nih.gov/tox21/challenge/ .[2] Guy W Bemis and Mark A Murcko. The properties of known drugs. 1. molecular frameworks.
Journal of medicinal chemistry , 39(15):2887–2893, 1996.[3] Tian Bian, Xi Xiao, Tingyang Xu, Peilin Zhao, Wenbing Huang, Yu Rong, and Junzhou Huang.Rumor detection on social media with bi-directional graph convolutional networks. In
AAAI2020 , 2020.[4] L. C. Blum and J.-L. Reymond. 970 million druglike small molecules for virtual screening inthe chemical universe database GDB-13.
J. Am. Chem. Soc. , 131:8732, 2009.[5] Hongming Chen, Ola Engkvist, Yinhai Wang, Marcus Olivecrona, and Thomas Blaschke. Therise of deep learning in drug discovery.
Drug discovery today , 23(6):1241–1250, 2018.[6] Zhengdao Chen, Xiang Li, and Joan Bruna. Supervised community detection with line graphneural networks. arXiv preprint arXiv:1705.08415 , 2017.[7] Connor W Coley, Regina Barzilay, William H Green, Tommi S Jaakkola, and Klavs F Jensen.Convolutional embedding of attributed molecular graphs for physical property prediction.
Journal of chemical information and modeling , 57(8):1757–1772, 2017.[8] John S Delaney. Esol: estimating aqueous solubility directly from molecular structure.
Journalof chemical information and computer sciences , 44(3):1000–1005, 2004.[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training ofdeep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 ,2018.[10] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel,Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learningmolecular fingerprints. In
Advances in neural information processing systems , pages 2224–2232, 2015.[11] Anna Gaulton, Louisa J Bellis, A Patricia Bento, Jon Chambers, Mark Davies, Anne Hersey,Yvonne Light, Shaun McGlinchey, David Michalovich, Bissan Al-Lazikani, et al. Chembl: alarge-scale bioactivity database for drug discovery.
Nucleic acids research , 40(D1):D1100–D1107, 2012.[12] Kaitlyn M Gayvert, Neel S Madhukar, and Olivier Elemento. A data-driven approach topredicting successes and failures of clinical trials.
Cell chemical biology , 23(10):1294–1301,2016.[13] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neuralmessage passing for quantum chemistry. In
ICML , pages 1263–1272. JMLR. org, 2017.914] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In
Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery anddata mining , pages 855–864, 2016.[15] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on largegraphs. In
Advances in neural information processing systems , pages 1024–1034, 2017.[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers:Surpassing human-level performance on imagenet classification. In
Proceedings of the IEEEinternational conference on computer vision , pages 1026–1034, 2015.[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition ,pages 770–778, 2016.[18] Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and JureLeskovec. Pre-training graph neural networks. arXiv preprint arXiv:1905.12265 , 2019.[19] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connectedconvolutional networks. In
Proceedings of the IEEE conference on computer vision and patternrecognition , pages 4700–4708, 2017.[20] Stanisław Jastrz˛ebski, Damian Le´sniak, and Wojciech Marian Czarnecki. Learning to smile (s). arXiv preprint arXiv:1602.06289 , 2016.[21] Chaitanya Joshi. Transformers are graph neural networks, 2020. https://graphdeeplearning.github.io/post/transformers-are-gnns/ .[22] Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Moleculargraph convolutions: moving beyond fingerprints.
Journal of computer-aided molecular design ,30(8):595–608, 2016.[23] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutionalnetworks. In
International Conference on Learning Representations (ICLR) , 2017.[24] Johannes Klicpera, Janek Groß, and Stephan Günnemann. Directional message passing formolecular graphs. In
International Conference on Learning Representations (ICLR) , 2020.[25] Michael Kuhn, Ivica Letunic, Lars Juhl Jensen, and Peer Bork. The sider database of drugs andside effects.
Nucleic acids research , 44(D1):D1075–D1079, 2015.[26] Greg Landrum et al. Rdkit: Open-source cheminformatics. 2006.[27] Jia Li, Yu Rong, Hong Cheng, Helen Meng, Wenbing Huang, and Junzhou Huang. Semi-supervised graph classification: A hierarchical graph perspective. In
The World Wide WebConference , pages 972–982. ACM, 2019.[28] Shengchao Liu, Mehmet F Demirel, and Yingyu Liang. N-gram graph: Simple unsupervisedrepresentation for graphs, with applications to molecules. In
Advances in Neural InformationProcessing Systems , pages 8464–8476, 2019.[29] Chengqiang Lu, Qi Liu, Chao Wang, Zhenya Huang, Peize Lin, and Lixin He. Molecularproperty prediction: A multilevel quantum interactions modeling perspective. In
Proceedings ofthe AAAI Conference on Artificial Intelligence , volume 33, pages 1052–1060, 2019.[30] Jing Ma, Wei Gao, and Kam-Fai Wong. Detect rumors on twitter by promoting informationcampaigns with generative adversarial learning. In
The World Wide Web Conference , pages3049–3055, 2019.[31] Ines Filipa Martins, Ana L Teixeira, Luis Pinheiro, and Andre O Falcao. A bayesian approachto in silico blood-brain barrier penetration modeling.
Journal of chemical information andmodeling , 52(6):1686–1697, 2012.[32] David L Mobley and J Peter Guthrie. Freesolv: a database of experimental and calculatedhydration free energies, with input files.
Journal of computer-aided molecular design , 28(7):711–720, 2014.[33] Kenta Oono and Taiji Suzuki. On asymptotic behaviors of graph cnns from dynamical systemsperspective. arXiv preprint arXiv:1905.10947 , 2019.1034] Andrew Owens and Alexei A Efros. Audio-visual scene analysis with self-supervised multisen-sory features. In
Proceedings of the European Conference on Computer Vision (ECCV) , pages631–648, 2018.[35] Zhen Peng, Wenbing Huang, Minnan Luo, Qinghua Zheng, Yu Rong, Tingyang Xu, andJunzhou Huang. Graph representation learning via graphical mutual information maximization.In
Proceedings of The Web Conference 2020 , pages 259–270, 2020.[36] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social repre-sentations. In
Proceedings of the 20th ACM SIGKDD international conference on Knowledgediscovery and data mining , pages 701–710, 2014.[37] Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. Improving languageunderstanding with unsupervised learning.
Technical report, OpenAI , 2018.[38] Raghunathan Ramakrishnan, Mia Hartmann, Enrico Tapavicza, and O Anatole Von Lilienfeld.Electronic spectra from tddft and machine learning in chemical space.
The Journal of chemicalphysics , 143(8):084111, 2015.[39] Bharath Ramsundar, Steven Kearnes, Patrick Riley, Dale Webster, David Konerding, and VijayPande. Massively multitask networks for drug discovery. arXiv preprint arXiv:1502.02072 ,2015.[40] Ann M Richard, Richard S Judson, Keith A Houck, Christopher M Grulke, Patra Volarath,Inthirany Thillainadarajah, Chihae Yang, James Rathman, Matthew T Martin, John F Wambaugh,et al. Toxcast chemical landscape: paving the road to 21st century toxicology.
Chemical researchin toxicology , 29(8):1225–1251, 2016.[41] David Rogers and Mathew Hahn. Extended-connectivity fingerprints.
Journal of chemicalinformation and modeling , 50(5):742–754, 2010.[42] Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. Dropedge: Towards deepgraph convolutional networks on node classification. In
International Conference on LearningRepresentations , 2020.[43] Seongok Ryu, Jaechang Lim, Seung Hwan Hong, and Woo Youn Kim. Deeply learning molec-ular structure-property relationships using attention-and gate-augmented graph convolutionalnetwork. arXiv preprint arXiv:1805.10988 , 2018.[44] Kristof Schütt, Pieter-Jan Kindermans, Huziel Enoc Sauceda Felix, Stefan Chmiela, AlexandreTkatchenko, and Klaus-Robert Müller. Schnet: A continuous-filter convolutional neural networkfor modeling quantum interactions. In
Advances in neural information processing systems ,pages 991–1001, 2017.[45] Kristof T Schütt, Farhad Arbabzadah, Stefan Chmiela, Klaus R Müller, and AlexandreTkatchenko. Quantum-chemical insights from deep tensor neural networks.
Nature com-munications , 8(1):1–8, 2017.[46] Teague Sterling and John J Irwin. Zinc 15–ligand discovery for everyone.
Journal of chemicalinformation and modeling , 55(11):2324–2337, 2015.[47] Govindan Subramanian, Bharath Ramsundar, Vijay Pande, and Rajiah Aldrin Denny. Computa-tional modeling of β -secretase 1 (bace-1) inhibitors using ligand based approaches. Journal ofchemical information and modeling , 56(10):1936–1949, 2016.[48] Fan-Yun Sun, Jordan Hoffman, Vikas Verma, and Jian Tang. Infograph: Unsupervised andsemi-supervised graph-level representation learning via mutual information maximization. In
International Conference on Learning Representations , 2019.[49] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neuralnetworks. In
Advances in neural information processing systems , pages 3104–3112, 2014.[50] Leo Törnqvist, Pentti Vartia, and Yrjö O Vartia. How should relative changes be measured?
The American Statistician , 39(1):43–46, 1985.[51] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In
Advances in neural informationprocessing systems , pages 5998–6008, 2017.1152] Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and YoshuaBengio. Graph attention networks. arXiv preprint arXiv:1710.10903 , 2017.[53] Petar Veliˇckovi´c, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio, and R DevonHjelm. Deep graph infomax. arXiv preprint arXiv:1809.10341 , 2018.[54] Izhar Wallach, Michael Dzamba, and Abraham Heifets. Atomnet: a deep convolutionalneural network for bioactivity prediction in structure-based drug discovery. arXiv preprintarXiv:1510.02855 , 2015.[55] Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, XiaogangWang, and Xiaoou Tang. Residual attention network for image classification. In
Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition , pages 3156–3164, 2017.[56] Sheng Wang, Yuzhi Guo, Yuhong Wang, Hongmao Sun, and Junzhou Huang. Smiles-bert:Large scale unsupervised pre-training for molecular property prediction. In
Proceedings ofthe 10th ACM International Conference on Bioinformatics, Computational Biology and HealthInformatics , pages 429–436, 2019.[57] David Weininger, Arthur Weininger, and Joseph L Weininger. Smiles. 2. algorithm for generationof unique smiles notation.
Journal of chemical information and computer sciences , 29(2):97–101, 1989.[58] Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh SPappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machinelearning.
Chemical Science , 9(2):513–530, 2018.[59] Zhaoping Xiong, Dingyan Wang, Xiaohong Liu, Feisheng Zhong, Xiaozhe Wan, Xutong Li,Zhaojun Li, Xiaomin Luo, Kaixian Chen, Hualiang Jiang, et al. Pushing the boundaries ofmolecular representation for drug discovery with the graph attention mechanism.
Journal ofmedicinal chemistry , 2019.[60] Zheng Xu, Sheng Wang, Feiyun Zhu, and Junzhou Huang. Seq2seq fingerprint: An unsuperviseddeep molecular embedding for drug discovery. In
BCB , 2017.[61] Kevin Yang, Kyle Swanson, Wengong Jin, Connor Coley, Philipp Eiden, Hua Gao, AngelGuzman-Perez, Timothy Hopper, Brian Kelley, Miriam Mathea, et al. Analyzing learned molec-ular representations for property prediction.
Journal of chemical information and modeling ,59(8):3370–3388, 2019.[62] Liangzhen Zheng, Jingrong Fan, and Yuguang Mu. Onionnet: a multiple-layer intermolecular-contact-based convolutional neural network for protein–ligand binding affinity prediction.
ACSomega , 4(14):15956–15965, 2019. 12 ppendix
A The Overall Architecture of
GROVER
Model
A-B"$+C3)BD !" ,993(9)$(:2.*( /.-0)$+1+2.34,**+1+2.34 5((*+5.36)3* ,993(9)$(:7*9( /.-0)$+1+2.34,**+1+2.34 > ? @ !" ,993(9)$(:2.*( /.-0)$+1+2.34,**+1+2.34 5((*+5.36)3* ,993(9)$(:7*9( /.-0)$+1+2.34,**+1+2.34 > ? @ B) M M B Figure 5: Overview of the whole
GROVER architecture with both node-view
GTransformer (in pinkbackground) and edge-view
GTransformer (in green background)Figure 5 illustrates the complete architecture of
GROVER models, which contains a node-view
GTransformer (in pink background) and an edge-view
GTransformer (in green background). Briefpresentations of the node-view
GTransformer have been introduced in the main text, and the edge-view
GTransformer is in a similar structure. Here we elaborate more details of the
GROVER modeland its associated four sets of output embeddings.As shown in Figure 5, node-view
GTransformer contains node dyMPN , which maintains hiddenstates of nodes h v , v ∈ V and performs the message passing over nodes. Meanwhile, edge-view GTransformer contains edge dyMPN , that maintains hidden states of edges h vw , h wv , ( v, w ) ∈ E andconducts message passing over edges . The edge message passing is viewed as an ordinary messagepassing over the line graph of the original graph, where the line graph describes the neighboring ofedges in the original graph and enables an appropriate way to define message passing over edges [6].Note that edge hidden states have directions, i.e., h vw is not identical to h wv in general.Then, after the multi-head attention, we denote the transformed node and edge hidden states by ¯ h v and ¯ h vw , respectively.Given the above setup, we can explain why GROVER will output four sets of embeddings in Figure 5.Let us focus on the information flow in the pink panel of Figure 5, first. Here the node hiddenstates ¯ h v encounter the two components, Aggregate2Node and
Aggregate2Edge , which are used toaggregate the node hidden states to node messages and edge messages, respectively. Specifically,the
Aggregate2Node and
Aggregate2Edge components in node-view
GTransformer is formulatedas follows: m node-embedding-from-node-states v = (cid:88) u ∈N v CONCAT(¯ h u , x u ) (4) m edge-embedding-from-node-states vw = (cid:88) u ∈N v \ w CONCAT(¯ h u , e uv ) . (5)Then the node-view GTransformer transforms the node messages m node-embedding-from-node-states v and edge messages m edge-embedding-from-node-states vw through Pointwise Feed Forward layers [51] andAdd&LayerNorm to produce the final node embeddings and edge embeddings, respectively.Similarly, for the information flow in the green panel, the edge hidden states ¯ h vw encounter thetwo components Aggregate2Node and
Aggregate2Edge as well. Their operations are formulated as13
Key: DOUBLE_C-SINGLE1_N-SINGLE1
C C CON C CC ON 𝑘 = 1
Figure 6: Examples of constructing contextual properties for edgesfollows, m node-embedding-from-edge-states v = (cid:88) u ∈N v CONCAT(¯ h uv , x u ) , (6) m edge-embedding-from-edge-states vw = (cid:88) u ∈N v \ w CONCAT(¯ h uv , e uv ) . (7)Then, the edge-view GTransformer transforms the node messages and edge messages throughPointwise Feed Forward layers and Add&LayerNorm to produce the final node embeddings and edgeembeddings, respectively.In summary, the
GROVER model outputs four sets of embeddings from two information flows. Thenode information flow (node
GTransformer ) maintains node hidden states and finally transformthem into another node embeddings and edge embeddings, while the edge information flow (edge
GTransformer ) maintains edge hidden states and also transforms them into node and edge embeddings.The four sets of embeddings reflect structural information extracted from the two distinct views, andthey are flexible to conduct downstream tasks, such as node-level prediction, edge-level predictionand graph-level prediction (via an extra READOUT component).
A.1 Fine-tuning Model for Molecular Property Prediction
As explained above, given a molecular graph G i and the corresponding label y i , GROVER producestwo node embeddings, H i, node-view and H i, edge-view , from node-view GTransformer and edge-view
GTransformer , respectively. We feed these two node embeddings into a shared self-attentive READ-OUT function to generate the graph-level embedding [27, 52]: S = softmax (cid:0) W tanh (cid:0) W H (cid:62) (cid:1)(cid:1) , g = Flatten( SH ) , (8)where W ∈ R d attn_hidden × d hidden_size and W ∈ R d attn_out × d attn_hidden are two weight matrix and g is the finalgraph embedding. After the READOUT, we employ two distinct MLPs to generate two predictions: p i, node-view and p i, edge-view . Besides the supervised loss L ( p i, node-view , y i ) + L ( p i, edge-view , y i ) , thefinal loss function also includes a disagreement loss [27] L diss = || p i, node-view − p i, edge-view || toretrain the consensus of two predictions. A.2 Constructing Contextual Properties for Edges
In Section 4.2 we describe an example of constructing contextual properties of nodes, here we presentan instance of cooking edge contextual properties in order to complete the picture.Similar to the process of node contextual property construction, we define recurrent statisticalproperties of local subgraph in a two-step manner. Let us take the graphs in Figure 6 for instance andconsider the double chemical bond in red color in the left graph.Step I: We extract its local subgraph as its k -hop neighboring nodes and edges. When k =1, it involvesthe Nitrogen atom, Carbon atom and the two single bonds. Step II: We extract statistical propertiesof this subgraph, specifically, we count the number of occurrence of (node, edge) pairs around the14enter edge, which makes the term of node-edge-counts . Then we list all the node-edge countsterms in alphabetical order, which makes the final property: e.g., DOUBLE_C_SINGLE1_N-SINGLE1 in the example.Note that there are two graphs and two double bonds in red color in Figure 6, since their subgraphshave the same statistical property, the resulted contextual properties of the two bonds would be thesame. For a different point of view, this step can be viewed as a clustering process: the subgraphs areclustered according to the extracted properties, one property corresponds to a cluster of subgraphswith the same statistical property.
B Details about Experimental Setup
B.1 Dataset Description
Table 3: Dataset information
Type Category Dataset
Classification Biophysics BBBP 1 2039 ROC-AUCPhysiology SIDER 27 1427 ROC-AUCClinTox 2 1478 ROC-AUCBACE 1 1513 ROC-AUCTox21 12 7831 ROC-AUCToxCast 617 8575 ROC-AUCRegression Physical chemistry FreeSolv 1 642 RMSEESOL 1 1128 RMSELipophilicity 1 4200 RMSEQuantum mechanics QM7 1 6830 MAEQM8 12 21786 MAE
Table 3 summaries information of benchmark datasets, including task type, dataset size, and evaluationmetrics. The details of each dataset are listed bellow [58]:
Molecular Classification Datasets. - BBBP [31] involves records of whether a compound carries the permeability property ofpenetrating the blood-brain barrier.-
SIDER [25] records marketed drugs along with its adverse drug reactions, also known as theSide Effect Resource .-
ClinTox [12] compares drugs approved through FDA and drugs eliminated due to thetoxicity during clinical trials.-
BACE [47] is collected for recording compounds which could act as the inhibitors of human β -secretase 1 (BACE-1) in the past few years.- Tox21 [1] is a public database measuring the toxicity of compounds, which has been usedin the 2014 Tox21 Data Challenge.-
ToxCast [40] contains multiple toxicity labels over thousands of compounds by runninghigh-throughput screening tests on thousands of chemicals.
Molecular Regression Datasets. - QM7 [4] is a subset of GDB-13, which records the computed atomization energies of stableand synthetically accessible organic molecules, such as HOMO/LUMO, atomization energy,etc. It contains various molecular structures such as triple bonds, cycles, amide, epoxy, etc .-
QM8 [38] contains computer-generated quantum mechanical properties, e.g., electronicspectra and excited state energy of small molecules.-
ESOL is a small dataset documenting the solubility of compounds [8].-
Lipophilicity [11] is selected from the ChEMBL database, which is an important prop-erty that affects the molecular membrane permeability and solubility. The data is obtainedvia octanol/water distribution coefficient experiments .15
FreeSolv [32] is selected from the Free Solvation Database, which contains the hydrationfree energy of small molecules in water from both experiments and alchemical free energycalculations .
Dataset Splitting.
We apply the scaffold splitting [2] for all tasks on all datasets. It splits themolecules with distinct two-dimensional structural frameworks into different subsets. It is a morechallenging but practical setting since the test molecular can be structurally different from trainingset. Here we apply the scaffold splitting to construct the train/validation/test sets.
B.2 Feature Extraction Processes for Molecules
The feature extraction contains two parts: 1) Node / edge feature extraction. We use RDKit toextract the atom and bond features as the input of dyMPN . Table 4 and Tabel 5 show the atomand bond feature we used in
GROVER . 2) Molecule-level feature extraction. Following the sameprotocol of [58, 61], we extract additional 200 molecule-level features by RDKit for each moleculeand concatenate these features to the output of self-attentive READOUT, to go through MLP for thefinal prediction. Table 4: Atom features. features size descriptionatom type 100 type of atom (e.g., C, N, O), by atomic numberformal charge 5 integer electronic charge assigned to atomnumber of bonds 6 number of bonds the atom is involved inchirality 5 number of bonded hydrogen atomsnumber of H 5 number of bonded hydrogen atomsatomic mass 1 mass of the atom, divided by 100aromaticity 1 whether this atom is part of an aromatic systemhybridization 5 sp, sp2, sp3, sp3d, or sp3d2
Table 5: Bond features. features size descriptionbond type 4 single, double, triple, or aromaticstereo 6 none, any, E/Z or cis/transin ring 1 whether the bond is part of a ringconjugated 1 whether the bond is conjugated
C Implementation and Pre-training Details
We use Pytorch to implement
GROVER and horovod for the distributed training. We use the Adamoptimizer with learning rate . and L2 weight decay for − . We train the model for 500epochs. The learning rate warmup over the first two epochs and decreases exponentially from . to . . We use PReLU [16] as the activation function and the dropout rate is 0.1 for alllayers. Both GROVER base and
GROVER base contain 4 heads. We set the iteration L = 1 and sample K l ∼ φ ( µ = 6 , σ = 1 , a = 3 , b = 9) for the embedded dyMPN in GROVER . ψ ( µ, σ, a, b ) is atruncated normal distribution with a truncation range ( a, b ) . For the The hidden size for GROVER base and
GROVER base are 800 and 1200 respectively.We use 250 Nvidia V100 GPUs to pre-train
GROVER base and
GROVER large . Pre-training
GROVER base and
GROVER large took 2.5 days and 4 days respectively. For the models depictedin Figure 4 in the ablation study, we use 32 Nvidia V100 GPUs to pre-train the
GROVER model andits variants. We will release the pre-trained models and the training code in the future.16
Fine-tuning Details
For each task, we try 300 different hyper-parameter combinations via random search to find the bestresults. Table 6 demonstrates all the hyper-parameters of fine-tuning model. All fine-tuning tasks arerun on a single P40 GPU. Table 6: The fine-tuning hyper-parameters hyper-parameter Description Rangebatch_size the input batch_size. 32init_lr initial learning rate ratio of Noam learning rate scheduler. The real initial learning rate is max_lr / init_lr. 10max_lr maximum learning rate of Noam learning rate scheduler. . ∼ . final_lr final learning rate ratio of Noam learning rate scheduler. The real final learning rate is max_lr / final_lr. ∼ dropout dropout ratio. 0, 0.05, 0.1,0.2attn_hidden hidden size for the self-attentive readout. 128attn_out the number of output heads for the self-attentive readout. 4,8dist_coff coefficient of the disagreement loss 0.05, 0.1,0.15bond_drop_rate drop edge ratio [42] 0, 0.2,0.4,0.6ffn_num_layer The number of MLP layers. 2,3ffn_hidden_size The hidden size of MLP layers. 5,7,13 E Additional Experimental Results
Table 7 depicts the additional results of the comparison of the performance of pre-trained
GROVER and
GROVER without pre-training on regression tasks.Table 7: Comparison between
GROVER with and without pre-training on regression tasks
GROVER
No Pre-training Absolute ImprovementRMSE FreeSolv0.013