[PDF] Advanced Graph and Sequence Neural Networks for Molecular Property Prediction and Drug Discovery

Abstract

Properties of molecules are indicative of their functions and thus are useful in many applications. As a cost-effective alternative to experimental approaches, computational methods for predicting molecular properties are gaining increasing momentum and success. However, there lacks a comprehensive collection of tools and methods for this task currently. Here we develop the MoleculeKit, a suite of comprehensive machine learning tools spanning different computational models and molecular representations for molecular property prediction and drug discovery. Specifically, MoleculeKit represents molecules as both graphs and sequences. Built on these representations, MoleculeKit includes both deep learning and traditional machine learning methods for graph and sequence data. Noticeably, we propose and develop novel deep models for learning from molecular graphs and sequences. Therefore, MoleculeKit not only serves as a comprehensive tool, but also contributes towards developing novel and advanced graph and sequence learning methodologies. Results on both online and offline antibiotics discovery and molecular property prediction tasks show that MoleculeKit achieves consistent improvements over prior methods.

Full PDF

AAdvanced Graph and Sequence Neural Networks forMolecular Property Prediction and Drug Discovery

Zhengyang Wang , Meng Liu , Youzhi Luo , Zhao Xu , Yaochen Xie , LimeiWang , Lei Cai , and Shuiwang Ji Texas A&M University, Department of Computer Science and Engineering, College Station, TX 77843, USA Currently with Microsoft, Bellevue, WA 98004, USA * Correspondence should be addressed to: [email protected] + These authors contributed equally to this work. a r X i v : . [ q - b i o . Q M ] J a n BSTRACT

Properties of molecules are indicative of their functions and thus are useful in many applications. With the advances ofdeep learning methods, computational approaches for predicting molecular properties are gaining increasing momentum.However, there lacks customized and advanced methods and comprehensive tools for this task currently. Here we developthe MoleculeKit, a suite of comprehensive machine learning methods and tools spanning different computational models andmolecular representations for molecular property prediction and drug discovery. Speciﬁcally, MoleculeKit represents moleculesas both graphs and sequences. Built on these representations, MoleculeKit includes novel deep models for learning frommolecular graphs and sequences. Therefore, MoleculeKit not only serves as a comprehensive tool, but also contributes towardsdeveloping novel and advanced graph and sequence learning methodologies. Results on both online and ofﬂine antibioticsdiscovery and molecular property prediction tasks show that MoleculeKit achieves consistent improvements over prior methods.

Introduction

Molecular property prediction is one of the key tasks in cheminformatics and has applications in many ﬁelds, including quantummechanics, physical chemistry, biophysics and physiology . With the rapid advances of machine learning methods, there is agrowing trend and success to apply computational methods to predict molecular properties, which saves a huge amount oftime-consuming and expensive chemical processes . With emerging global challenges like the COVID-19 pandemic, it iseven more urgent to develop powerful machine learning methods for molecular property prediction, thereby accelerating thespeed of drug discovery .In the literature, there are two common ways to represent molecules in silico , i.e. , molecular graphs and simpliﬁed molecular-input line-entry system (SMILES) sequences . Molecular graphs are intuitive to humans and informative to machine learningmodels. In particular, molecular graphs can keep the structure information of molecules by clearly showing the connectionsbetween atoms. On the other hand, SMILES sequences have a longer history in cheminformatics and have been widely used.SMILES is a linguistic construct to represent molecules as single-line sequences using certain simple grammar rules and avocabulary composed of characters representing atoms and bonds. The SMILES sequences also stores essential chemicalinformation of molecules. Various machine learning methods have been proposed on either molecular graphs orSMILES sequences . Setting aside the difference in data types, these machine learning methods can also be divided bythe method type. That is, we can categorize them into deep learning models and non-deep learning models. Deep learningmethods have set new records in most machine learning tasks. However, non-deep learning methods still show advantages incertain cases, especially when the amount of available labeled data is small. Nevertheless, there is no clear evidence on whichdata type or method type leads to better performances for molecular property prediction . In addition, the development ofdeep learning models for molecular property prediction is still in early stage.In this work, we propose MoleculeKit, an advanced machine learning tool for molecular property prediction and drugdiscovery. MoleculeKit is unique in three aspects. First, MoleculeKit consists of a suite of comprehensive machine learningmethods across different data types and method types. We expect them to provide complementary information for molecularproperty prediction and yield better performance. Second, we propose a new graph-based deep learning method namedmulti-level message passing neural network (ML-MPNN), which is included in MoleculeKit. ML-MPNN is able to aggregateinformation from a full range of levels in molecular graphs, including nodes, edges, subgraphs, and the entire graph and makesfull use of richly informative molecular graphs (Section 4.2.1). Third, MoleculeKit is equipped with a novel sequence-baseddeep learning method, namely contrastive-BERT. Contrastive-BERT is based on the state-of-the-art BERT and includes themasked embedding recovery task, a novel self-supervised pre-training task via contrastive learning (Section 4.3.1). With thesethree features, MoleculeKit not only serves as a comprehensive tool for molecular property prediction, but also contributestowards developing novel and advanced techniques. Concretely, MoleculeKit has four modules covering deep and non-deep methods based on both molecular graphs and SMILES sequences; those are, ML-MPNN, Weisfeiler-Lehman subtree(WL-subtree) kernel , contrastive-BERT and subsequence kernel (Fig. 1a).3 roperty prediction Prediction stage

Multi-level MPNNWL kernel + SVMSubsequence kernel + SVM Contrastive-BERTTraining labels

One-time pre-training

Pre-training SMILES Contrastive-BERT

Pre-training molecules … (cid:335) Training molecules … (cid:335) CC(C)COC(=O)c1ccccc1 ……

To graphsTo SMILES …… Training stage (only performed once for all datasets)

Multi-level MPNNWL kernel + SVMSubsequence kernel + SVM Contrastive-BERT

Ensembling

Testing/predicting molecules

To graphsTo SMILES ba Graph-based modelsSequence-based models

Weisfeiler-Lehman kernel + SVMM (cid:88)(cid:79)(cid:87)(cid:76) -level MPNNSubsequence kernel + SVM (cid:335)(cid:335)(cid:335)(cid:335) (cid:335)(cid:335)(cid:335)(cid:335)

Contrastive-BERT Ensemble and predict Geometric propertyEnergetic, electronic propertiesThermodynamic propertyWater solubilityHydration free energyLipophilicityBioactivity, binding a ﬃ nitiesAdverse reactionsBarrier permeabilityToxicologyAnd more … Input graphs

Input SMILES

Molecules

Figure 1.

MoleculeKit overview, training and inference. a , MoleculeKit is composed of four modules, including twograph-based models, multi-level message passing neural network (ML-MPNN) and Weisfeiler-Lehman subtree (WL-subtree)kernel, and two SMILES sequence-based models, contrastive-BERT and subsequence kernel. We represent molecules asmolecular graphs and SMILES sequences, and feed into the corresponding modules. The ﬁnal prediction result of MoleculeKitis obtained by ensembling the predicted scores of the four modules. b , Given a training dataset with labeled molecules, eachmodule is trained individually under the standard supervised setting. In particular, contrastive-BERT initializes the trainingparameters differently from the other three modules who apply the random initialization. The initialization of contrastive-BERTcomes from an extra one-time pre-training phase on a large pre-training dataset with unlabeled molecules. In the predictionstage, MoleculeKit predicts the property of molecules by ensembling scores predicted by the four modules. n the following, we (1) demonstrate the overall effectiveness of MoleculeKit in the AI Cures open challenge task ,(2) show the advantages of MoleculeKit and each module of it in drug discovery with an off-line dataset, (3) analyze theimprovements brought by MoleculeKit as well as our proposed ML-MPNN and contrastive-BERT on a wide range of molecularproperty prediction tasks from MoleculeNet benchmarks . In addition, we discuss the key differences of ML-MPNN andcontrastive-BERT from related works, and perform comprehensive ablation studies to support the novel designs in ML-MPNNand contrastive-BERT. The proposed MoleculeKit consists of four modules; namely, multi-level message passing neural network (ML-MPNN),Weisfeiler-Lehman subtree (WL-subtree) kernel , contrastive-BERT and subsequence kernel (Fig. 1a). Despite the differencesin data types and model architectures among these four modules, the training and inference procedures are uniﬁed (Fig. 1b).First, representing molecules as molecular graphs and SMILES sequences is effortless. Next, in the training stage, we assumea training dataset for each molecular property prediction task, where molecules are labeled by the property of interest. Eachmodule is then trained under the standard supervised learning setting, just with different initialization methods. Speciﬁcally,contrastive-BERT initializes training parameters from an extra one-time pre-training phase, while the other three modulesapply the random initialization. The pre-training of contrastive-BERT is performed under a self-supervised contrastive learningsetting on 2 million unlabeled molecules from the ZINC database. It is worth emphasized that this pre-training phase isconducted only once for all tasks. Finally, during inference, we use the ensemble of the predicted scores from the four modulesas the ﬁnal prediction result of our MoleculeKit. Users of MoleculeKit are also given the freedom of employing fewer modules. The COVID-19 pandemic has posed a great challenge to public health, and we ﬁrst demonstrate how our MoleculeKit canbe used to accelerate drug discovery for emerging global threats like the COVID-19. To this end, we use our MoleculeKit toparticipate in an open challenge on drug discovery, known as the AI Cures , which aims at discovering new antibiotics tocure the secondary lungs infections caused by COVID-19. The entire dataset consists of 2,335 chemical molecules, which issplit into a training set of 2,097 molecules and a test set of 238 molecules. The prediction target is whether a molecule has theantibacterial activity, or inhibition, to Pseudomonas aeruginosa , which is a bacterium leading to secondary lungs infections ofCOVID-19 patients. The ground truth binary labels of all molecules are obtained from in-vitro screening assay. The class labelsof the training data have been made available to challenge participants, while the labels of test data have been hold out for thechallenge organizers to evaluate the performance of participants. The prediction performance is evaluated by ROC-AUC andPRC-AUC (Section 4.5) on the hold-out test set.We use all the four modules of MoleculeKit in this task with both graph and sequence representations of molecules. Ourﬁnal prediction is obtained with the ensemble of these four modules. Concretely, for each molecule of the test set, we use the able 1.

Results on the AI Cures open challenge task in terms of PRC-AUC and ROC-AUC. Our model name is MoleculeKit,and our team name is DIVE@TAMU. (a) Test PRC-AUC

Rank Model Author Submissions Test PRC-AUC1 MolecularG AIDrug@PA 7 0.7252 - AGL Team 20 0.7023 MoleculeKit DIVE@TAMU 7 0.6774 GB BI 6 0.675 Chemprop ++ AICures@MIT 4 0.6626 - Mingjun Liu 3 0.6577 Pre-trained OGB-GIN (ensemble) Weihua Hu@Stanford 2 0.6518 RF + ﬁngerprint Cyrus Maher@Vir Bio 1 0.6499 Graph Self-supervised Learning SJTU_NRC_Mila 3 0.62210 - Congjie He 10 0.611 (b) Test ROC-AUC

Rank Model Author Submissions Test ROC-AUC1 MoleculeKit DIVE@TAMU 7 0.9282 Chemprop ++ AICures@MIT 4 0.8773 - Gianluca Bontempi 7 0.8484 - Apoorv Umang 1 0.845 Pre-trained OGB-GIN (ensemble) Weihua Hu@Stanford 2 0.8376 - Kexin Huang 1 0.8247 Chemprop Rajat Gupta 7 0.8188 MLP IITM 7 0.8079 Graph Self-supervised Learning SJTU_NRC_Mila 3 0.810 - Congjie He 10 0.8average of four predicted inhibition probabilities from four modules as its ﬁnal predicted inhibition probability.There are a total of 27 teams participating in this open challenge. The prediction performance of our MoleculeKit and thoseof other top teams are reported in Table 1. In summary, our MoleculeKit achieves the highest test ROC-AUC of 0.928 andranks the third on test PRC-AUC with a value of 0.677. Note that the top two teams (AIDrug@PA and AGL Team) with higherTest PRC-AUC (0.725 and 0.702) than us achieve very low ROC-AUC (0.7 and 0.675). Our MoleculeKit outperforms themon average by 35% on test ROC-AUC, while they outperform us on average by 5% on test PRC-AUC. These results indicatethat our MoleculeKit achieves very stable and consistent performance across two different evaluation metrics. Our promisingperformance and turnkey solution to this open challenge on drug discovery demonstrate that MoleculeKit can be used for drugscreening and accelerating drug discovery.

In addition to the AI Cures open challenge task, we apply our MoleculeKit on another dataset to further demonstrate itsadvantages on accelerating drug discovery. In particular, we focus on predicting the capability of inhibiting the growth of

Escherichia coli . It will greatly help ﬁnding new antibiotics to alleviate the bacteria’s drug-resistance problem. The trainingdataset is composed of 2,335 molecules with binary labels obtained through screening. The testing dataset, provided by Stoke able 2.

Results on the Drug Repurposing Hub (DRH) testing dataset in terms of PRC-AUC and ROC-AUC.Model Test PRC-AUC Test ROC-AUCBaseline 0.777 0.888ML-MPNN 0.828 0.900WL-subtree kernel 0.857 0.913Contrastive-BERT 0.804 0.877Subsequence kernel 0.778 0.878MoleculeKit 0.867 0.919et al. , contains 162 molecules form Drug Repurposing Hub (DRH), among which 53 are found inhibitory against Escherichiacoli .The baseline model in this experiment comes from Stoke et al. as well; that is, the directed message passing neuralnetwork (D-MPNN) is applied. Speciﬁcally, Stoke et al. randomly split the 2,335 training molecules into a new trainingdataset and a validation dataset with the ratio of 9:1 with 20 different random seeds, generating 20 different train-validationsplits. Then they train D-MPNN on these 20 splits independently, yielding 20 D-MPNN models. During prediction on thetesting dataset, the average of predicted scores from these 20 models is provided as the ﬁnal prediction score. The performanceis evaluated by PRC-AUC and ROC-AUC (Section 4.5).We apply all the four modules of MoleculeKit on this task. In order to show the superiority of MoleculeKit, we do notperform the ensemble over different train-validation splits. Instead, we randomly split the 2,335 training molecules into a newtraining dataset and a validation dataset with the ratio of 8.5:1.5 once, and use this train-validation split only. In addition tothe ensemble of all four modules, we report the performance of each module as ablation studies, indicating that the betterperformance of MoleculeKit does not merely come from the ensemble of multiple methods. The results are provided in Table 2.Surprisingly, the subsequence kernel, without the ensemble over different train-validation splits, can already achieve thecompetitive performance with the baseline by itself. The sequence-based deep learning method, contrastive-BERT, outperformsthe baseline by a large margin in terms of PRC-AUC while achieving a similar ROC-AUC. The slight decrease in ROC-AUCis complemented by the graph-based methods. Both graph-based methods are very effective by themselves, especially theWL-subtree kernel. As a result, with the ensemble of four modules, MoleculeKit improves the performance signiﬁcantly.The results clearly show three insights regarding MoleculeKit. First, by comparing ML-MPNN and contrastive-BERT withthe baseline, we can see that our proposed deep learning methods are effective. Second, the kernel methods are very powerful,revealing why MoleculeKit includes them. Third, different data types and method types provide complementary information,resulting in a further improved performance with the ensemble. With two successful applications on drug discovery, we further explore MoleculeKit on a wider range of molecular propertyprediction tasks. Particularly, we apply MoleculeKit on MoleculeNet benchmarks for molecular property prediction.We perform experiments on 14 datasets of MoleculeNet benchmarks, focusing on various molecular properties in the uantumMechanics PhysicalChemistry • QM8• QM9 • ESOL• Lipophilicity• FreeSolv Biophysics • HIV• BACE • PCBA• MUV

Physiology • BBBP• Tox21• ToxCast • SIDER• ClinTox PP ab c Performance of MoleculeKit o (cid:81)(cid:3) regression datasets (cid:11)(cid:79)(cid:82)(cid:90)(cid:72)(cid:85)(cid:3)(cid:76)(cid:86)(cid:3)(cid:69)(cid:72)(cid:87)(cid:87)(cid:72)(cid:85)(cid:12)

Performance (cid:82)(cid:73)(cid:3)(cid:48)(cid:82)(cid:79)(cid:72)(cid:70)(cid:88)(cid:79)(cid:72)(cid:46)(cid:76)(cid:87)(cid:3) o (cid:81)(cid:3) classiﬁcation datasets (cid:11)(cid:75)(cid:76)(cid:74)(cid:75)(cid:72)(cid:85)(cid:3)(cid:76)(cid:86)(cid:3)(cid:69)(cid:72)(cid:87)(cid:87)(cid:72)(cid:85)(cid:12) Figure 2.

MoleculeKit on MoleculeNet benchmarks for molecular property prediction. a , We apply MoleculeKit onMoleculeNet benchmarks for a wide range of molecular property prediction tasks. Experiments are performed on 14 datasets,covering different molecular properties in the ﬁeld quantum mechanics, physical chemistry, biophysics and physiology. b ,Prediction performances of MoleculeKit on the testing dataset of 5 regression tasks, in terms of MAE or RMSE (Section 4.5,Supplementary Table 1). We follow Yang et al. to plot the error relative to the MoleculeNet baseline. The conﬁdence intervalsare marked by the standard deviations. c , Prediction performances of MoleculeKit on the testing dataset of 9 classiﬁcationtasks, in terms of ROC-AUC or PRC-AUC (Section 4.5, Supplementary Table 1). The absolute performance values are plotted.The conﬁdence intervals are marked by the standard deviations. For b&c, the mean and standard deviation results are computedfrom 3 independent runs over 3 different train/validation/test splits on each dataset. The same 3 splits are used for baselines andour MoleculeKit for fair comparisons. Results of MoleculeNet and D-MPNN are directly copied from their papers. Notethat D-MPNN is not evaluated on BACE and ToxCast datasets. All the corresponding quantitative results are provided inSupplementary Table 2. These results show signiﬁcant and consistent performance gains brought by our MoleculeKit.ﬁeld of quantum mechanics, physical chemistry, biophysics and physiology (Fig. 2a). These properties can be either binarylabels, like those in drug discovery, or continuous values. Correspondingly, they are formulated as binary classiﬁcation tasks orregression tasks. Typically, binary classiﬁcation tasks are evaluated by PRC-AUC or ROC-AUC, while regression tasks areevaluated by MAE or RMSE (Section 4.5). In order to serve as benchmarks and enable fair comparisons, Wu et al. provide thesplit and evaluation methods for all the datasets. We report these key details of datasets in Supplementary Table 1. Furtherinformation can be found in Wu et al. .Two baseline models are chosen for comparisons with our MoleculeKit. One is a collection of previously proposedmolecular featurization and learning algorithms provided together with MoleculeNet benchmarks . It includes both graph-basedand sequence-based methods, covering both deep and non-deep models. Without loss of clarity, we denote this baseline as oleculeNet and always report the best performance from the collection. The other model is the directed message passingneural network (D-MPNN) , which incorporates latest techniques in graph neural networks and achieves the state-of-the-artperformances on most datasets of MoleculeNet benchmarks. These two models form strong baselines in terms of both breadthand depth.All the results in this section, including baselines and our MoleculeKit, are obtained by following the same settings providedin MoleculeNet benchmarks . The performances of our MoleculeKit and the two baselines are compared in Fig. 2b&c. Allthe corresponding quantitative results are provided in Supplementary Table 2. Note that when showing performances onregression datasets (Fig. 2b), we follow Yang et al. to plot the error relative to the MoleculeNet baseline, because thescale of performance values differ signiﬁcantly across different datasets. On the other hand, when showing performances onclassiﬁcation datasets (Fig. 2c), the absolute performance values in [ , ] are plotted. Note that D-MPNN is not evaluated onBACE and ToxCast datasets . It is also worth noting that, in this set of experiments, we do not always use the ensemble ofall four modules for MoleculeKit. Instead, we choose modules for the ﬁnal ensemble based on their validation performances.Besides, the two kernel methods are not used on QM9, PCBA, MUV, ToxCast datasets, as they are too large in terms of thenumber of molecules or tasks (Supplementary Table 1). In these cases, the kernel methods are infeasible due to lack of memoryor exceeding long computation time. Overall, our MoleculeKit outperforms both baselines on 13 out of 14 datasets. On theMUV dataset, MoleculeKit is beaten by MoleculeNet while improved D-MPNN signiﬁcantly. These results demonstrate thepower of MoleculeKit on molecular property prediction.Additionally, in order to demonstrate the effectiveness of our proposed novel deep learning methods, ML-MPNN andcontrastive-BERT, we also perform ablation studies by comparing them individually with the baselines in the following. Ablation studies on ML-MPNN.

The comparisons between our proposed ML-MPNN and the two baselines, MoleculeNetand D-MPNN, are provided in Fig. 3a&b. The plots follow the same settings as Fig. 2b&c. The corresponding quantitativeresults can be found in Supplementary Table 2.On 9 out of 14 datasets, ML-MPNN outperforms both baselines. Concretely, ML-MPNN achieves better performancesthan MoleculeNet on 11 out of 14 datasets and D-MPNN on 9 out of 12 datasets, respectively. In particular, we notice thatMoleculeNet utilizes extra 3D coordinate information on QM8 and QM9 datasets, which D-MPNN and our ML-MPNN do notuse. D-MPNN barely outperforms MoleculeNet on QM8 and underperforms MoleculeNet on QM9, while our ML-MPNNyields a performance boost.The increased performances on a majority of the datasets show the effectiveness of our proposed ML-MPNN. Meanwhile, thecomparisons between D-MPNN and ML-MPNN demonstrate that our proposed ML-MPNN can serve as the new state-of-the-artgraph-based deep learning method for molecular property prediction. Ablation studies on contrastive-BERT.

The comparisons between our proposed contrastive-BERT and the two baselinesare provided in Fig. 3c&d. The corresponding quantitative results can be found in Supplementary Table 2.On half of 14 datasets, contrastive-BERT shows performances equal to or better than both baselines. Respectively, c bd

Performance of ML-MPNN o (cid:81)(cid:3) regression datasets (cid:11)(cid:79)(cid:82)(cid:90)(cid:72)(cid:85)(cid:3)(cid:76)(cid:86)(cid:3)(cid:69)(cid:72)(cid:87)(cid:87)(cid:72)(cid:85)(cid:12)

Performance of ML-MPNN o (cid:81)(cid:3) classiﬁcation datasets (cid:11)(cid:75)(cid:76)(cid:74)(cid:75)(cid:72)(cid:85)(cid:3)(cid:76)(cid:86)(cid:3)(cid:69)(cid:72)(cid:87)(cid:87)(cid:72)(cid:85)(cid:12)

Performance of contrastive-BERT o (cid:81)(cid:3) regression datasets (cid:11)(cid:79)(cid:82)(cid:90)(cid:72)(cid:85)(cid:3)(cid:76)(cid:86)(cid:3)(cid:69)(cid:72)(cid:87)(cid:87)(cid:72)(cid:85)(cid:12)

Performance of contrastive-BERT o (cid:81)(cid:3) classiﬁcation datasets (cid:11)(cid:75)(cid:76)(cid:74)(cid:75)(cid:72)(cid:85)(cid:3)(cid:76)(cid:86)(cid:3)(cid:69)(cid:72)(cid:87)(cid:87)(cid:72)(cid:85)(cid:12)

Figure 3.

ML-MPNN and contrastive-BERT on MoleculeNet benchmarks for molecular property prediction. a ,Prediction performances of ML-MPNN on the testing dataset of 5 regression tasks, in terms of MAE or RMSE (Section 4.5,Table 1). b , Prediction performances of ML-MPNN on the testing dataset of 9 classiﬁcation tasks, in terms of ROC-AUC orPRC-AUC (Section 4.5, Supplementary Table 1). c , Prediction performances of contrastive-BERT on the testing dataset of 5regression tasks, in terms of MAE or RMSE (Section 4.5, Supplementary Table 1). d , Prediction performances ofcontrastive-BERT on the testing dataset of 9 classiﬁcation tasks, in terms of ROC-AUC or PRC-AUC (Section 4.5,Supplementary Table 1). All the ﬁgures follow the same settings as Fig. 2b&c. All the corresponding quantitative results areprovided in Supplementary Table 2. These results demonstrate the advantages of our proposed novel deep learning methods,ML-MPNN and contrastive-BERT.contrastive-BERT outperforms MoleculeNet on 8 out of 14 datasets and D-MPNN on 7 out of 12 datasets. Note that,both D-MPNN and ML-MPNN use additional global molecular features obtained with the open-source package RDKit (Section 4.6), and MoleculeNet has access to extra 3D coordinate information on QM8 and QM9 datasets, as mentioned above.However, contrastive-BERT only uses SMILES sequences, without additional features nor extra information. Taking these intoconsideration, contrastive-BERT serves as a powerful sequence-based model for molecular property prediction. We propose MoleculeKit, an advanced tool composed of graph-based and sequence-based machine learning methods, forvarious molecular property prediction tasks including drug discovery. Speciﬁcally, MoleculeKit consists of four modules,including two graph-based methods, multi-level message passing neural network (ML-MPNN) and Weisfeiler-Lehman subtree(WL-subtree) kernel, and two sequence-based methods, contrastive-BERT and subsequence kernel. Among the four modules ofMoleculeKit, ML-MPNN and contrastive-BERT are novel deep learning methods proposed in this work. In order to demonstratethe advantages of MoleculeKit as well as the two novel methods, we perform thorough experimental studies on two drugdiscovery applications and a wide range of molecular property prediction tasks.We anticipate that our work would have potential impacts in multiple aspects. First, both our proposed ML-MPNN nd contrastive-BERT achieve success in various molecular property prediction applications by themselves. They stand foradvanced techniques for graph-based and sequence-based machine learning method, respectively. We expect them to notonly push forward the development of new models for molecular property prediction, but also inspire the proposal of morepowerful machine learning methods in general. Second, we have shown that different data types and method types both providecomplementary information and boost the performances with an ensemble as MoleculeKit. This serves as a useful insightwhen applying machine learning methods for molecular property prediction in practice. In particular, our MoleculeKit wouldserve as a reliable open-source tool. Third, our work may have a larger impact beyond the academic community. For example,we achieve top performances in the AI Cures open challenge, which aims at accelerating drug discovery for emerging globalthreats like the COVID-19.Regarding the two novel deep learning methods proposed in this work, ML-MPNN and contrastive-BERT, we discussthe key differences from previous studies and illustrate the superiority of our methods. In terms of ML-MPNN, performingmolecular property prediction by learning from molecular graphs has drawn a lot of attention in the community .Typically, existing works learn representations for molecules through message passing. The original MPNN updates noderepresentations by aggregating messages from its neighboring nodes and connected edges in each layer. Instead of updatingrepresentations for nodes, D-MPNN learns representations for each directed edge. The recent proposed HIMP incorporatesrepresentations associated with substructures during message passing. While these existing methods are shown to be effective,they only leverage limited information contained in the molecular graphs during message passing. Intuitively, informationrelated to the molecular properties can come from different levels, such as edges, nodes, and subgraphs. Based on these insights,our ML-MPNN is designed to be capable of aggregating information from a full range of levels in molecular graphs.Speciﬁcally, ML-MPNN can be regarded as an extension of the uniﬁed graph neural network framework . The keydifference lies in taking subgraph-level representations into consideration. ML-MPNN incorporates this level of informationeffectively, leading to a more powerful message passing framework. In order to show that the improvements of ML-MPNN docome from subgraph-level representations, we perform ablation studies to compare the performances between ML-MPNNand ML-MPNN without subgraph-level representations on MoleculeNet benchmarks. The comparison results are providedin Supplementary Table 3 and Supplementary Fig. 6. With subgraph-level representations, improvements can be observedon 12 out of 14 datasets. Another novel aspect of ML-MPNN is the normalization methods, We extend the the graph sizenormalization to SizeNorm that normalizes multi-level representations before applying BatchNorm (Section 4.2.1). Inorder to verify the effectiveness of the SizeNorm and BatchNorm, we conduct ablation studies as well and show results inSupplementary Table 4 and Supplementary Fig. 7. Compared to the model without any normalization, SizeNorm can improvethe performance on 5 datasets out of 14 datasets and BatchNorm can improve the performance on 11 datasets. ApplyingSizeNorm and BatchNorm together can improve performance on 12 datasets.In terms of contrastive-BERT, there exists multiple studies on learning from sequential data . Recurrent neuralnetworks (RNNs), like long short-term memory (LSTM) and gated recurrent unit (GRU) , are the most popular sequence- ased machine learning models before the proposal of Transformer . Based on Transformer, BERT applies pre-trainingand achieves great success in various tasks. Transformer is theoretically more powerful than RNNs and shows improvedperformance in tasks like machine translation . However, we ﬁnd it necessary to perform pre-training as in BERT to achievegood performance on molecular property prediction, due to the small sizes of available datasets. To verity this, we conductexperiments comparing performances on MoleculeNet benchmarks between GRU, Transformer, and our contrastive-BERT, andreport results in Supplementary Table 5 and Supplementary Fig. 8. We can see that GRU outperforms Transformer signiﬁcantlyon small datasets, while contrastive-BERT achieves the best performances on 13 out of 14 datasets. Contrastive-BERT hasa similar architecture as Transformer but is featured by a pre-training phase via contrastive learning (Section 4.3.1),indicating the necessity of effective pre-training.Contrastive-BERT is not the ﬁrst study that applies BERT on molecular property prediction. SMILES-BERT directlyemploys BERT on SMILES sequences with the masked language model pre-training task from BERT. This pre-training tasks isa classiﬁcation task, where the model is asked to predict what the masked characters are. On the contrary, our contrastive-BERTproposes a novel pre-training task via contrastive learning, named masked embedding recovery task, where the model is askedto predict the embeddings of the masked characters. This contrastive learning task is substantially harder than the originalclassiﬁcation task and is supposed to encourage learning better representations of SMILES sequences for down-streaming tasks.In order to demonstrate the advantages of our proposed contrastive-BERT and the masked embedding recovery task, we performablation studies comparing our contrastive-BERT with SMILES-BERT model. Results are provided in Supplementary Table 6and Supplementary Fig. 9. Our contrastive-BERT model outperforms the SMILES-BERT model on 11 out of 14 datasets.One limitation of our MoleculeKit is that it does not consider the 3D coordinate information of molecules when available.While we have shown that MoleculeKit is able to outperform models with 3D coordinate information into consideration,we expect incorporating such information into MoleculeKit may further improve the performances. Intuitively, moleculesare 3D structures. Location information of atoms can be important for predicting molecular properties. Although there areseveral attempts to consider such 3D coordinate information, how to utilize such information effectively is still an openchallenge in the community. We leave incorporating 3D coordinate information into MoleculeKit as future work. References Wu, Z. et al.

MoleculeNet: a benchmark for molecular machine learning.

Chem. science , 513–530 (2018). Schneider, G. & Wrede, P. Artiﬁcial neural networks for computer-based molecular design.

Prog. biophysics molecularbiology , 175–222 (1998). Behler, J. & Parrinello, M. Generalized neural-network representation of high-dimensional potential-energy surfaces.

Phys.review letters , 146401 (2007). Bartók, A. P., Payne, M. C., Kondor, R. & Csányi, G. Gaussian approximation potentials: the accuracy of quantummechanics, without the electrons.

Phys. review letters , 136403 (2010). . Varnek, A. & Baskin, I. Machine learning methods for property prediction in chemoinformatics: quo vadis?

J. chemicalinformation modeling , 1413–1437 (2012). Bartók, A. P., Kondor, R. & Csányi, G. On representing chemical environments.

Phys. review B , 184115 (2013). Duvenaud, D. K. et al.

Convolutional networks on graphs for learning molecular ﬁngerprints. In

Advances in neuralinformation processing systems , 2224–2232 (2015). Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O. & Dahl, G. E. Neural message passing for quantum chemistry. In

Proceedings of the 34th international conference on machine learning , 1263–1272 (2017). Schütt, K. T., Arbabzadah, F., Chmiela, S., Müller, K. R. & Tkatchenko, A. Quantum-chemical insights from deep tensorneural networks.

Nat. communications , 1–8 (2017). Hy, T. S., Trivedi, S., Pan, H., Anderson, B. M. & Kondor, R. Predicting molecular properties with covariant compositionalnetworks.

The journal chemical physics , 241745 (2018).

Unke, O. T. & Meuwly, M. PhysNet: A neural network for predicting energies, forces, dipole moments, and partial charges.

J. chemical theory computation , 3678–3693 (2019). Yang, K. et al.

Analyzing learned molecular representations for property prediction.

J. chemical information modeling ,3370–3388 (2019). Unterthiner, T. et al.

Deep learning as an opportunity in virtual screening. In

Proceedings of the deep learning workshop atNIPS , vol. 27, 1–9 (2014).

Ma, J., Sheridan, R. P., Liaw, A., Dahl, G. E. & Svetnik, V. Deep neural nets as a method for quantitative structure–activityrelationships.

J. chemical information modeling , 263–274 (2015). Stokes, J. M. et al.

A deep learning approach to antibiotic discovery.

Cell , 688–702 (2020).

Weininger, D. SMILES, a chemical language and information system. 1. introduction to methodology and encoding rules.

J. chemical information computer sciences , 31–36 (1988). Shervashidze, N., Schweitzer, P., van Leeuwen, E. J., Mehlhorn, K. & Borgwardt, K. M. Weisfeiler-Lehman graph kernels.

J. machine learning research , 2539–2561 (2011). Kearnes, S., McCloskey, K., Berndl, M., Pande, V. & Riley, P. Molecular graph convolutions: moving beyond ﬁngerprints.

J. computer-aided molecular design , 595–608 (2016). Coley, C. W., Barzilay, R., Green, W. H., Jaakkola, T. S. & Jensen, K. F. Convolutional embedding of attributed moleculargraphs for physical property prediction.

J. chemical information modeling , 1757–1772 (2017). Fey, M., Yuen, J. G. & Weichert, F. Hierarchical inter-message passing for learning on molecular graphs. In internationalconference on machine learning, graph representation learning and beyond workshop (2020). Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N. & Watkins, C. Text classiﬁcation using string kernels.

J. machinelearning research , 419–444 (2002). Cao, D.-S. et al.

In silico toxicity prediction by support vector machine and smiles representation-based string kernel.

SARQSAR environmental research , 141–153 (2012). Bjerrum, E. J. SMILES enumeration as data augmentation for neural network modeling of molecules. arXiv preprintarXiv:1703.07076 (2017).

Mayr, A. et al.

Large-scale comparison of machine learning methods for drug target prediction on ChEMBL.

Chem.science , 5441–5451 (2018). Wang, S., Guo, Y., Wang, Y., Sun, H. & Huang, J. SMILES-BERT: large scale unsupervised pre-training for molecularproperty prediction. In

Proceedings of the 10th ACM international conference on bioinformatics, computational biologyand health informatics , 429–436 (2019).

Honda, S., Shi, S. & Ueda, H. R. SMILES Transformer: pre-trained molecular ﬁngerprint for low data drug discovery. arXiv preprint arXiv:1911.04738 (2019).

LeCun, Y., Bengio, Y. & Hinton, G. Deep learning.

Nature , 436–444 (2015).

Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for languageunderstanding. In

Proceedings of the 2019 conference of the north American chapter of the association for computationallinguistics: human language technologies, volume 1 (long and short papers) , 4171–4186, DOI: 10.18653/v1/N19-1423(Association for Computational Linguistics, Minneapolis, Minnesota, 2019).

Irwin, J. J., Sterling, T., Mysinger, M. M., Bolstad, E. S. & Coleman, R. G. ZINC: a free tool to discover chemistry forbiology.

J. chemical information modeling , 1757–1768 (2012). Battaglia, P. W. et al.

Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261 (2018).

Dwivedi, V. P., Joshi, C. K., Laurent, T., Bengio, Y. & Bresson, X. Benchmarking graph neural networks. arXiv preprintarXiv:2003.00982 (2020).

Hochreiter, S. & Schmidhuber, J. Long short-term memory.

Neural computation , 1735–1780 (1997). Cho, K. et al.

Learning phrase representations using RNN encoder–decoder for statistical machine translation. In

Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , 1724–1734, DOI:10.3115/v1/D14-1179 (Association for Computational Linguistics, Doha, Qatar, 2014).

Vaswani, A. et al.

Attention is all you need. In

Advances in neural information processing systems , 5998–6008 (2017). Hadsell, R., Chopra, S. & LeCun, Y. Dimensionality reduction by learning an invariant mapping. In , vol. 2, 1735–1742 (IEEE, 2006).

Baevski, A., Zhou, Y., Mohamed, A.-r. & Auli, M. wav2vec 2.0: a framework for self-supervised learning of speechrepresentations.

Adv. neural information processing systems (2020). He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. Momentum contrast for unsupervised visual representation learning. In

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 9729–9738 (2020).

Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations.

Proceddings 37th international conference on machine learning (2020).

Schütt, K. et al.

SchNet: A continuous-ﬁlter convolutional neural network for modeling quantum interactions. In

Advancesin neural information processing systems , 991–1001 (2017).

Klicpera, J., Groß, J. & Günnemann, S. Directional message passing for molecular graphs. In

Proceddings of the 8thinternational conference on learning representations (2019).

Acknowledgements

We thank Sheng Wang, Zhenqin Wu and Kevin Yang for sharing their data and answering our questions. This work wassupported in part by National Science Foundation grants DBI-1922969, IIS-1908198, and IIS-1955189.

Author contributions

S.J. conceived and initiated the research. Z.W., M.L., Y.L., Z.X., Y.X., L.W., L.C. and S.J. designed the methods. M.L., Y.L.,Z.X., Y.X. and L.W. implemented the training and validation methods and designed and developed the software package. Z.W.,L.C. and S.J. supervised the project. Z.W., M.L., Y.L., Z.X., Y.X., L.W. and S.J. wrote the manuscript.

Competing Interests

The authors declare no competing interests.

Methods

Our proposed MoleculeKit aims at making accurate predictions on certain properties of a molecule through graph-based andsequence-based machine learning models. Correspondingly, we represent a molecule as the molecular graph and the simpliﬁedmolecular-input line-entry system (SMILES) sequence . Here, we give basic notations and deﬁne the problems.As a molecule is composed of chemical atoms and bonds between them, it is natural to represent it as a molecular graph,where nodes and edges correspond to atoms and bonds, respectively. In particular, a bond could be represented as a singleundirected edge or two directed edges with opposite directions. Formally, a molecular graph is denoted as G = ( V , E ) ∈ G ,where V is the set of nodes and E ⊆ V × V is the set of edges. | V | and | E | are the numbers of nodes and edges, respectively.Moreover, we consider attributed molecular graphs. Speciﬁcally, there are feature vectors associated with each node v ∈ V andedge e ∈ E . These feature vectors are represented as x ( v ) ∈ R p and w ( e ) ∈ R q , given by the pre-deﬁned node attribute mapping x : V (cid:55)→ R p and edge attribute mapping w : E (cid:55)→ R q , respectively.On the other hand, a molecule can be represented as a SMILES sequence as well. Basically, a SMILES sequence is simplya string, where each character denotes a chemical atom or an indicator of structures like a bond, a ring, etc. Formally, we denotea SMILES sequence as S ∈ S and its length as | S | . And S [ i ] , i = , . . . , | S | refers to the i -th character of S . All the characterscome from a vocabulary Σ .A property of a molecule can be represented by a scalar number y . A continuous y ∈ R can denote the water solubility of themolecule. A binary y ∈ {− , } can indicate whether a molecule is antibacterial or not. Depending on y , the problem we focuson can be formulated as either a regression ( y ∈ R ) or a binary classiﬁcation ( y ∈ {− , } ) task. Speciﬁcally, given a moleculerepresented by the molecular graph G or the SMILES sequence S as the input, a machine learning model is asked to predict y .In the following, we describe our proposed MoleculeKit ensembling both graph-based and sequence-based machine learningmethods. With the molecular graph G as the input, MoleculeKit consists of two graph-based machine learning models; those are, ournovel multi-level message passing neural network (ML-MPNN) and the Weisfeiler-Lehman subtree kernel (WL-subtree) . Our proposed multi-level message passing neural network (ML-MPNN) is featured by the ability of aggregating informationfrom a full range of levels in molecular graphs, including nodes, edges, subgraphs, and the entire graph. As a result, multi-levelrepresentations are initialized and updated through our ML-MPNN.

Multi-level Representations.

We ﬁrst introduce the multi-level representations and their initialization. In particular, weconsider node-level representations, edge-level representations, subgraph-level representations, and a graph-level representation.As discussed in Section 4.1, the molecular graph G = ( V , E ) is attributed, with feature vectors x ( v ) ∈ R p and w ( e ) ∈ R q ssociated with each node v ∈ V and edge e ∈ E . The node-level representations and edge-level representations in our ML-MPNN are then initialized as x ( v ) ∈ R p and w ( e ) ∈ R q , respectively. Concretely, we follow Yang et al. to deﬁne x ( · ) and w ( · ) . The node attribute mapping x ( · ) encodes important attributes of a chemical atom, such as the type of atom, number ofrelated bonds, formal charge, and atomic mass. And the edge attribute mapping w ( · ) encodes useful properties of a bond,including the bond type, the stereo, whether the bond is part of a ring, and whether the bond is conjugated. In the following, wedirectly use x ( v ) ∈ R p and w ( e ) ∈ R q to denote the node-level representations and edge-level representations for simplicity.ML-MPNN utilizes the junction tree to obtain subgraphs and initialize the subgraph-level representations. The junctiontree represents a molecular graph as a tree, where each node in the tree corresponds to a subgraph in the original moleculargraph. Speciﬁcally, we follow the tree decomposition algorithm to build the junction tree. Nodes in the resulted junctiontree represent rings, bonds, bridged compounds, or singletons in the original molecular graph, as illustrated in SupplementaryFig. 1. Afterwards, in order to initialize subgraph-level representations for the original molecular graph, we simply generatenode-level representations for the junction tree. Here, for each node in the junction tree, we encode the concrete subgraph typeas its node-level representation. Formally, given a molecular graph G , we denote the node set of the junction tree as C , which isalso the set of subgraphs. The subgraph-level representations are denoted by r ( c ) ∈ R f for each subgraph c ∈ C .The graph-level representation in our ML-MPNN is empirically initialized as the average of edge-level representations inthe corresponding graph. After ML-MPNN, the ﬁnal graph-level representation obtained is used for predicting the molecularproperty. Formally, given a molecular graph G , we denote the graph-level representation as z ( G ) ∈ R g .As a summary, given a molecular graph G = ( V , E ) and the set of corresponding subgraphs C , the multi-level representationsinvolved in our ML-MPNN are node-level representations x ( v ) ∈ R p for each node v ∈ V , edge-level representations w ( e ) ∈ R q for each edge e ∈ E , subgraph-level representations r ( c ) ∈ R f for each subgraph c ∈ C , and a graph-level representation z ( G ) ∈ R g . Note that ML-MPNN handles molecular graphs with directed edges. For each edge e ∈ E , we use s e ∈ V and t e ∈ V to denote the source node and target node. In addition, without loss of generality, we consider the nodes and subgraphsare indexed from 1 to | V | and from 1 to | C | , respectively. Then the assignment of nodes to subgraphs can be denoted by anassignment matrix R ∈ { , } | V |×| C | , where R [ v , c ] = v belongs to subgraph c , and R [ v , c ] = ML-MPNN.

Next, we introduce how multi-level representations are updated through our ML-MPNN, where messagepassing is performed hierarchically. ML-MPNN can be viewed as an extension of the general graph neural network framework .ML-MPNN stacks several ML-MPNN layers followed by a ﬁnal output layer for classiﬁcation or regression. Each ML-MPNNlayer updates the representations for each level in ﬁve steps, as illustrated in Supplementary Fig. 2. Step 1: Update the edge-level representations.

We update the edge-level representations by aggregating informationfrom node-level representations of the source node and target node as well as the graph-level representation. Speciﬁcally, weconcatenate these representations with the previous edge-level representation as the inputs to a multilayer perceptron (MLP) toupdate the edge-level representation. Formally, we have w (cid:48) ( e ) = MLP ([ w ( e ) , x ( s e ) , x ( t e ) , z ( G )]) , ∀ e ∈ E , (1) here [ · ] denote concatenation and w (cid:48) ( e ) represents the updated edge-level representation. Note that, as the graph-levelrepresentation contains the global information of the molecular graph, it is supposed to guide the updating process and thusused in each updating step below. Step 2: Update the node-level representations.

For each node, we use updated edge-level representations of incoming edgesand node-level representations of corresponding source nodes to compute messages it receives. The node-level representationsis then updated with the messages and the graph-level representation. The updating process is formulated as M e n ( v ) = MLP (cid:32) ∑ { e : t e = v } (cid:2) w (cid:48) ( e ) , x ( s e ) (cid:3)(cid:33) , ∀ v ∈ V , (2) x (cid:48) ( v ) = MLP (cid:0)(cid:2) x ( v ) , M e n ( v ) , z ( G ) (cid:3)(cid:1) , ∀ v ∈ V . (3)Here, x (cid:48) ( v ) is the updated node-level representation of node v . And M e n ( · ) denotes the messages passed from edge-levelrepresentations to node-level representations. Step 3: Update subgraph-level representations.

The messages for updating the subgraph-level representations are fromnodes that are assigned to the subgraph as well as neighboring subgraphs, i.e. , neighboring nodes in the junction tree. Theconcrete updating procedure is given by M n s ( c ) = MLP  ∑ { v : R [ v , c ] = } x (cid:48) ( v )  , ∀ c ∈ C , (4) r (cid:48) ( c ) = MLP (cid:32)(cid:34) r ( c ) , M n s ( c ) , MLP (cid:32) ∑ u ∈ N ( c ) r ( u ) (cid:33) , z ( G ) (cid:35)(cid:33) , ∀ c ∈ C , (5)where N ( c ) denotes the set of neighboring subgraphs of subgraph c and r (cid:48) ( c ) is the updated subgraph-level representation ofsubgraph c . And M n s ( · ) represents the messages passed from node-level representations to subgraph-level representations. Step 4: Update the graph-level representation.

Finally, the graph-level representation gets updated after receiving messagesfrom updated representations of each level. We have M e g ( G ) = | E | ∑ e ∈ E w (cid:48) ( e ) , (6) M n g ( G ) = | V | ∑ v ∈ V x (cid:48) ( v ) , (7) M s g ( G ) = | C | ∑ c ∈ C r (cid:48) ( c ) , (8) z (cid:48) ( G ) = MLP (cid:0)(cid:2) z ( G ) , M e g ( G ) , M n g ( G ) , M s g ( G ) (cid:3)(cid:1) , (9)where z (cid:48) ( G ) is the updated graph-level representation of graph G . And M e g ( · ) , M n g ( · ) , and M s g ( · ) denote the messagespassed from edge-level representations, node-level representations, and subgraph-level representations, respectively. tep 5: Normalization for multi-level representations. Like modern deep learning models, our ML-MPNN uses batchtraining. Adding BatchNorm can stabilize the training process and usually achieve better performance. However, due tothe various sizes of graphs, batching graphs will lead to representations at different scales, making it difﬁcult to learn theoptimal statistics for BatchNorm . As a result, certain normalization methods on sizes must be applied. In ML-MPNN,graphs of various sizes refer to graphs with a variable number of edges, nodes, and subgraphs. Correspondingly, we applythe EdgeSizeNorm, NodeSizeNorm, and SubgraphSizeNorm for edge-level representations, node-level representations, andsubgraph-level representations, respectively, given by¯ w (cid:48) ( e ) = BatchNorm (cid:0)

EdgeSizeNorm (cid:0) w (cid:48) ( e ) (cid:1)(cid:1) , EdgeSizeNorm (cid:0) w (cid:48) ( e ) (cid:1) = w (cid:48) ( e ) (cid:112) | E | , ∀ e ∈ E , (10)¯ x (cid:48) ( v ) = BatchNorm (cid:0)

NodeSizeNorm (cid:0) x (cid:48) ( e ) (cid:1)(cid:1) , NodeSizeNorm (cid:0) x (cid:48) ( v ) (cid:1) = x (cid:48) ( e ) (cid:112) | V | , ∀ v ∈ V , (11)¯ r (cid:48) ( c ) = BatchNorm (cid:0)

SubgraphSizeNorm (cid:0) r (cid:48) ( c ) (cid:1)(cid:1) , SubgraphSizeNorm (cid:0) r (cid:48) ( c ) (cid:1) = r (cid:48) ( c ) (cid:112) | C | , ∀ c ∈ C , (12)¯ z (cid:48) ( G ) = BatchNorm (cid:0) z (cid:48) ( G ) (cid:1) . (13) MoleculeKit applies the Weisfeiler-Lehman subtree kernel (WL-subtree) as a traditional graph-based machine learningcomponent. WL-subtree is able to capture the similarity among molecular graphs regarding the isomorphism. In WL-subtree,we consider molecular graphs with undirected edges. In addition, the node attribute mapping x : V (cid:55)→ R p becomes a node labelmapping (cid:96) : V (cid:55)→ Z + that assigns to each node v the atomic number of the corresponding atom. And there is no edge attribute.In general, given any two graphs G , G ∈ G , a graph kernel computes the similarity between them as k g ( G , G ) , where k g : G → R is a kernel function. Suppose we have a training dataset { ( G i , y i ) } Ni = with N samples. We train a kernel SupportVector Machine (SVM) by optimizingmin { α i } ni = C (cid:34) N ∑ i = L (cid:32) N ∑ j = α j y j K ji , y i (cid:33)(cid:35) + N ∑ i = N ∑ j = α i α j K ji , (14)where K i j = k g ( G i , G j ) , C is a hyper-parameter, and { α i } Ni = are training parameters. For a regression task, L is the ε -insensitiveloss . After training, the prediction result of a graph G is given by ∑ Ni = α i y i k g ( G i , G ) . For a binary classiﬁcation task, L isthe hinge loss and the prediction result is given by sign (cid:0) ∑ Ni = α i y i k g ( G i , G ) (cid:1) .In particular, WL-subtree deﬁnes a kernel function k g that computes the similarity based on common subtrees in two graphs.Speciﬁcally, let { s , s , · · · , s M } be the set of all possible subtrees within a certain depth. For a graph G , we compute φ ( G ) = [ n ( s , G ) , n ( s , G ) , · · · , n ( s M , G )] T , (15)where n ( s i , G ) , i = , , . . . , M is the number of the subtree s i in G . With φ ( · ) , the kernel function between G and G is given y k g ( G , G ) = φ ( G ) T φ ( G ) .The set { s , s , · · · , s M } is obtained by an iterative process of relabeling each node. Concretely, for the t -th iteration, wedenote the label of each node v in a graph G as (cid:96) ( t ) ( v ) . At iteration 0, all the nodes in G are labelled by the node labelmapping (cid:96) ( ) ( · ) = (cid:96) ( · ) . At iteration t ( t ≥ ) , we update labels in two steps. First, for each node v , we build a multiset S ( v ) = { (cid:96) ( t − ) ( u ) : u ∈ N ( v ) } , where N ( v ) is the set of neighboring nodes of v in G . Basically, S ( v ) collects the currentlabels of neighboring nodes of v . Note that a multiset is allowed to include repeated elements. Meanwhile, labels in S ( v ) aresorted. Second, we hash { (cid:96) ( t − ) ( v ) , S ( v ) } to a new label s v . Then we have (cid:96) ( t ) ( v ) = s v .This iterative process is performed on all graphs in the training dataset. Note that, for any pair of different nodes v and v ,no matter whether they are in the same graph or not, (cid:96) ( t ) ( v ) = (cid:96) ( t ) ( v ) if and only if the two t -hop subtrees rooted at v and v are identical. As a result, each label denotes a unique subtree. After T iterations, we collect all the labels that ever appear in anygraph during the T iterations and put them into { s , s , · · · , s M } , a set of all possible subtrees within T hops. Finally, to obtain n ( s i , G ) , i = , , . . . , M for any graph G , we simply perform the T iterations on G and count the times of s i appearing. With the SMILES sequence S as the input, MoleculeKit consists of two sequence-based machine learning models; those are,our proposed contrastive-BERT and the subsequence kernel . Our contrastive-BERT is characterized by a novel self-supervised pre-training task via contrastive learning , namelymasked embedding recovery task. It results in better performances on downstream applications after ﬁne-tuning. Note thatthe pre-training phase is performed only once on a large unlabeled dataset and we have provided the pre-trained model. Theﬁne-tuning phase is the same as end-to-end supervised training on labeled datasets, except for using the pre-trained model toinitialize training parameters instead of random initialization. In the following, we ﬁrst describe the network architecture of ourcontrastive-BERT and then focus on the proposed masked embedding recovery task.

Network architecture.

Similar to the original BERT , the network architecture of our contrastive-BERT mainly followsTransformer . Given a SMILES sequence S with the vocabulary Σ , a trainable embedding layer ﬁrst is applied to transformeach character S [ i ] , i = , , . . . , | S | into an embedding vector h i ∈ R d . The embedding layer actually maintains an embeddingmatrix of dimension | Σ | × d , where each row corresponds to the embedding vector of a character in Σ . The embedding vector ofa character is supposed to capture chemical information about the corresponding atom or structure. In addition, since the ordermatters in SMILES sequences, the position embedding from Vaswani et al. is added to h i . Afterwards, h i serves as the inputsto Transformer and gets updated.Transformer is composed of a stack of Transformer layers. A Transformer layer consists of a self-attention mechanismfollowed by a multilayer perceptron (MLP). Speciﬁcally, with h i , i = , , . . . , | S | as inputs, the self-attention mechanism allowsdirect interactions among them and updates each h i by aggregating information according to the interaction results. Formally, y writing h i , i = , , . . . , | S | into a matrix H = [ h , h , . . . , h | S | ] T ∈ R | S |× d , we have H (cid:48) = Softmax (cid:18) ( HW Q )( HW K ) T √ d (cid:19) HW V , (16)where H (cid:48) = [ h (cid:48) , h (cid:48) , . . . , h (cid:48)| S | ] T ∈ R | S |× d represents the updated embeddings and W Q , W K , W V ∈ R d × d are training parameters.In particular, the multi-head technique is also used. The self-attention mechanism is similar to a global message passing inthe sense that each character can receive information from any other character. The following MLP has two layers with theGaussian error linear unit (GELU) as the activation function. In order to facilitate the training, residual connections and thelayer normalization are applied on both the self-attention mechanism and the MLP.Consequently, the outputs of our contrastive-BERT are updated embedding vectors for each character of a SMILES sequence S , denoted as o i , i = , , . . . , | S | . In practice, a special token is inserted to S as the ﬁrst character S [ ] . The outputembedding vector of will be used for prediction, as explained below. Masked embedding recovery task.

As described above, the training of the contrastive-BERT consists of a self-supervisedpre-training phase followed by a supervised ﬁne-tuning phase, as shown in Supplementary Fig. 4. Our contrastive-BERT mainlycontributes to the self-supervised pre-training phase by proposing the masked embedding recovery task, a novel self-supervisedtask via contrastive learning.Basically, the masked embedding recovery task asks the model to predict the embeddings of masked characters based onembeddings of unmasked characters. The masking strategy in our contrastive-BERT follows the one in the original BERT , inorder to fairly demonstrate the advantages of the new objective. Concretely, in a SMILES sequence, 15% of the charactersare randomly masked. For each character that is selected to be masked, it has 80% probability to be replaced by a special token, 10% probability to be replaced by another random character, and 10% probability to remain unchanged. Foreach SMILES sequence, at least one character is masked.As a result, if a character S [ i ] is masked, the corresponding input embedding h i is replaced and will not be fed into thenetwork. The masked embedding recovery task is to predict the output embedding o i without h i . However, there is noground truth for o i , which means a common supervised objective cannot be used. In order to perform the pre-training, ourcontrastive-BERT designs a self-supervised objective via contrastive learning. In particular, for a masked character S [ i ] , weforce the predicted o i to be similar to h i , as they correspond to the same character. In contrast, o i needs to be less similarto h j , ∀ j (cid:54) = i . Note that, while h i is replaced from the inputs to the network, it can still be retrieved from the embeddingmatrix. Given a function to measure the similarity, the pre-training can be achieved using a contrastive loss, as illustrated inSupplementary Fig. 5.After the pre-training phase with the masked embedding recovery task, our contrastive-BERT is ﬁne-tuned on labeleddatasets for downstream applications. Speciﬁcally, a predictor is added on top of the embedding vector of to predictthe molecular property. The ﬁne-tuning is an end-to-end training process with a supervised loss. In both training phases, we usea data augmentation technique based on a characteristic of SMILES sequences. That is, one molecule may be represented by ultiple different SMILES sequences, although each SMILES sequence only corresponds to a certain molecule. For example, CCN , NCC , and

C(N)C are all valid SMILE sequences of the organic compound ethylamine with the formula CH CH NH (Supplementary Fig. 3). Each time we perform training on a molecule, we randomly choose a valid SMILES sequence as inputsto our contrastive-BERT. Such data augmentation helps the model learn embeddings that capture intrinsic chemical informationirrelevant to the representation formats. During testing, we use the canonicalization algorithm to generate a unique canonicalSMILES sequence for each molecule. As a traditional sequence-based machine learning component, our MoleculeKit uses the subsequence kernel . As indicated byits name, the subsequence kernel computes the similarity between SMILES sequences based on their common subsequences.Formally, for a SMILES sequence S ∈ S , a sequence U is called a subsequence of S if there exists a list of indices iii = { i , i , · · · , i | U | : 1 ≤ i < i < · · · < i | U | ≤ | S |} such that U [ j ] = S [ i j ] , j = , · · · , | U | , i.e. , U = S [ iii ] . Note that the subsequence U is not necessarily composed of consecutive characters from S .Similar to the graph kernel described in Section 4.2.2, a kernel function k s : S → R is deﬁned to compute the similaritybetween two sequences S , S ∈ S as k s ( S , S ) . Given a training dataset { ( S i , y i ) } Ni = , the training process solves the sameoptimization problem in Equation (14) with K i j = k s ( S i , S j ) . With the learned parameters { α i } Ni = , the prediction result of ansequence S is given by ∑ Ni = α i y i k s ( S i , S ) and sign (cid:0) ∑ Ni = α i y i k s ( S i , S ) (cid:1) for regression and binary classiﬁcation, respectively.Concretely, the subsequence kernel focuses on subsequences with a pre-deﬁned length D . For SMILES sequences with thevocabulary Σ , the set of all possible subsequences of length D is Σ D = { U , U , . . . , U | Σ D | } . For a sequence S , we compute n ( U , S ) = ∑ iii : U = S [ iii ] λ i D − i + , (17) φ ( S ) = (cid:104) n ( U , S ) , n ( U , S ) , · · · , n ( U | Σ D | , S ) (cid:105) T , (18)where n ( U i , G ) , i = , , . . . , | Σ D | is a weighted count of U i in S . Speciﬁcally, λ with 0 < λ < S . With φ ( · ) , the kernel function k s between two SMILES sequences S and S is given by k s ( S , S ) = φ ( S ) T φ ( S ) . In practice, it iscommon to normalize it by k s ( S , S ) √ k s ( S , S ) · k s ( S , S ) . We introduce the loss functions to train the deep learning components of MoleculeKit, i.e. , our proposed ML-MPNN andcontrastive-BERT. Note that the contrastive-BERT has an extra one-time self-supervised pre-training phase with a contrastiveloss, as described in Section 4.3.1. The ﬁne-tuning phase of the contrastive-BERT uses the same supervised losses as ML-MPNN. For all the training of deep learning models, we employ the batch training, where we randomly select a batch of B samples from the training dataset in each training iteration. The Adam optimizer is used to minimize the loss. .4.1 Supervised loss for regression tasks In regression tasks, we use the mean squared error (MSE) loss: L MSE ( { y i } Bi = , { ˆ y i } Bi = ) = B B ∑ i = ( y i − ˆ y i ) , (19)where y i , ˆ y i ∈ R are the ground-truth label and prediction result of the i -th training sample in the batch, respectively. In binary classiﬁcation tasks, we use the binary cross entropy (BCE) loss: L BCE ( { y i } Bi = , { ˆ y i } Bi = ) = − B B ∑ i = (cid:0) { y = } ( y i ) log ( ˆ y i ) + { y = − } ( y i ) log ( − ˆ y i ) (cid:1) , (20)where y i ∈ {− , } is the ground-truth label of the i -th training sample in the batch, ˆ y i ∈ ( , ) is the corresponding predictedclassiﬁcation score, and { y = } ( · ) and { y = − } ( · ) are indicator functions. The masked embedding recovery task uses a self-supervised contrastive loss. Given a SMILES sequence S , the input and outputembeddings of the contrastive-BERT are h i and o i , i = , , . . . , | S | , respectively. If S [ i ] is masked, the corresponding contrastiveloss is given by L contrast ( o i , h , h , . . . , h | S | ) = − log exp ( sim ( o i , h i ) / τ ) ∑ | S | j = exp ( sim ( o i , h j ) / τ ) , (21)where τ is a temperature parameter and the similarity function sim ( · , · ) computes the cosine similarity, deﬁned by sim ( o , h ) = o T h | o | · | h | . (22)If a SMILES sequence has multiple masked characters, the ﬁnal loss is the averaged contrastive loss. As the molecular property prediction problem can be formulated as with a regression or a binary classiﬁcation task, we usedifferent evaluation metrics for different datasets.

Given a testing dataset of N samples for a regression task, we denote the ground-truth labels as { y i } ni = and the correspondingprediction results as { ˆ y i } Ni = , where y i , ˆ y i ∈ R . The evaluation metric is mean absolute error (MAE) : MAE ( { y i } Ni = , { ˆ y i } Ni = ) = N N ∑ i = | y i − ˆ y i | , (23) r root-mean-square error (RMSE): RMSE ( { y i } Ni = , { ˆ y i } Ni = ) = (cid:115) N N ∑ i = ( y i − ˆ y i ) . (24)A smaller MAE or RMSE indicates a better performance. Given a testing dataset of N samples for a binary classiﬁcation task, we denote the ground-truth labels as { y i } ni = and thecorresponding predicted classiﬁcation scores as { ˆ y i } Ni = , where y i ∈ {− , } and ˆ y i ∈ [ , ] . The predicted classiﬁcation score ˆ y i can be transformed into the predicted label ˜ y i ∈ {− , } with a threshold γ , where ˜ y i = y i ≥ γ and ˜ y i = − y i < γ .If ˜ y i = y i and ˜ y i = y i = − i -th prediction is a true positive (or true negative); if ˜ y i (cid:54) = y i and ˜ y i = y i = − i -th prediction is a false positive (or false negative). With the threshold γ , we use T P γ , T N γ , FP γ , FN γ torepresent the number of true positives, true negatives, false positives and false negatives, respectively. And we can compute thetrue positive rate ( T PR ), false positive rate (

FPR ), and positive predictive value (

PPV ) under γ as: T PR γ = T P γ T P γ + FN γ , (25) FPR γ = FP γ FP γ + T N γ , (26) PPV γ = T P γ T P γ + FP γ . (27)Note that T PR and

PPV are also known as recall and precision.The evaluation metric for binary classiﬁcation tasks is the area under curve (AUC) of the receiver operating characteristic(ROC) curve or the precision-recall curve (PRC) , denoted by ROC-AUC and PRC-AUC. Concretely, by varying γ ∈ [ , ] ,the ROC curve and PRC plot how ( FPR , T PR ) and (recall, precision) change in a xy -coordinate plane, respectively. A largerROC-AUC or PRC-AUC indicates a better performance. The settings of our device are - GPU: Nvidia GeForce RTX 2080 Ti 11GB; CPU: Intel Xeon Silver 4116 2.10GHz; OS: Ubuntu16.04.3 LTS.We implement our proposed ML-MPNN in PyTorch and Pytorch Geometric . Following D-MPNN , we incorporateadditional global molecular features that can be easily obtained with the open-source package RDKit . Each molecule hasa 200-dimensional global molecular feature, which is concatenated with the ﬁnal graph-level representation learned by ourML-MPNN as inputs to a 2-layer MLP to make the prediction. The task-speciﬁc conﬁgurations include the settings of initiallearning rate, learning rate decay factor, weight decay, hidden dimension, dropout rate, and batch size, as listed in SupplementaryTable 7. ur contrastive-BERT model is also implemented in PyTorch. The network is composed of 6 Transformer layers with 1024hidden units and 4 attention heads. For the one-time pre-training phase, we train the model for 10 epochs, with the initiallearning rate set to 0.0001 and decay with the cosine annealing strategy. The batch size is set to 128 and the dropout rate is0.1. In the ﬁne-tuning phase, the settings are task-speciﬁc. Speciﬁcally, we tune batch size, the number of training epochs andlearning rate for each dataset, as listed in Supplementary Table 8.As for kernel methods, the number of iterations of graph kernels, the subsequence length and the decay factor of stringkernels, and whether kernel normalization is used are task-speciﬁc. The setting of these hyperparameters for each dataset issummarized in Supplementary Table 9. Reporting summary

Further information on research design can be found in the Nature Research Reporting Summary linked to this article.

Data availability

The dataset for AI Cures open challenge task is available at . The datasetfor antibiotic discovery, provided by Stokes et al. , can be found at . The datasets and splits of MoleculeNets benchmarks for molecular property pre-diction can be downloaded from http://deepchem.io/datasets/MolNet_bkp.zip and http://deepchem.io/trained_models/Hyperparameter_MoleculeNetv3.tar.gz . The dataset of 2 million molecules fromZINC , prepared by Hu et al. , for pretraining our contrastive-Bert can be downloaded from http://snap.stanford.edu/gnn-pretrain/data/chem_dataset.zip .We also collect datasets and corresponding splits of MoleculeNet benchmarks and the dataset of 2 million moleculesfrom ZINC at https://github.com/divelab/MoleculeKit/tree/master/moleculekit/datasets inour MoleculeKit tool. We recommend our users to use this collection for convenience. Code availability

The code for MoleculeKit training, prediction and evaluation (in Python/PyTorch), is publicly available at https://github.com/divelab/MoleculeKit . References

Rarey, M. & Dixon, J. S. Feature trees: a new molecular similarity measure based on tree matching.

J. computer-aidedmolecular design , 471–490 (1998). Jin, W., Barzilay, R. & Jaakkola, T. Junction tree variational autoencoder for molecular graph generation. In

Proceedingsof the 35th international conference on machine learning , 2323–2332 (2018). Ioffe, S. & Szegedy, C. Batch normalization: accelerating deep network training by reducing internal covariate shift. In

Proceedings of the 32th international conference on machine learning , 448–456 (2015).

Cortes, C. & Vapnik, V. Support-vector networks.

Mach. learning , 273–297 (1995). Chang, C.-C. & Lin, C.-J. LIBSVM: A library for support vector machines.

ACM transactions on intelligent systemstechnology (TIST) , 1–27 (2011). Vapnik, V. N.

The Nature of Statistical Learning Theory (Springer-Verlag, Berlin, Heidelberg, 1995).

Drucker, H., Burges, C. J., Kaufman, L., Smola, A. J. & Vapnik, V. Support vector regression machines. In

Advances inneural information processing systems , 155–161 (1997).

Hendrycks, D. & Gimpel, K. Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415 (2016).

He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In

Proceedings of the IEEE conferenceon computer vision and pattern recognition , 770–778 (2016).

Ba, J. L., Kiros, J. R. & Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).

Neglur, G., Grossman, R. L. & Liu, B. Assigning unique keys to chemical compounds for data integration: some interestingcounter examples. In

International workshop on data integration in the life sciences , 145–157 (Springer, 2005).

Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. In

Proceddings of the 3rd international conferenceon learning representations (2015).

Willmott, C. J. & Matsuura, K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) inassessing average model performance.

Clim. research , 79–82 (2005). Fawcett, T. An introduction to ROC analysis.

Pattern recognition letters , 861–874 (2006). Davis, J. & Goadrich, M. The relationship between Precision-Recall and ROC curves. In

Proceedings of the 23rdinternational conference on machine learning , 233–240 (2006).

Paszke, A. et al.

PyTorch: An imperative style, high-performance deep learning library. In

Advances in neural informationprocessing systems , 8026–8037 (2019).

Fey, M. & Lenssen, J. E. Fast graph representation learning with PyTorch Geometric. In international conference onlearning representations, workshop on representation learning on graphs and manifolds (2019).

Hu, W. et al.

Strategies for pre-training graph neural networks. In

Proceddings of the 7th international conference onlearning representations (2019).(2019).