[PDF] Are 2D fingerprints still valuable for drug discovery?

Abstract

Recently, molecular fingerprints extracted from three-dimensional (3D) structures using advanced mathematics, such as algebraic topology, differential geometry, and graph theory have been paired with efficient machine learning, especially deep learning algorithms to outperform other methods in drug discovery applications and competitions. This raises the question of whether classical 2D fingerprints are still valuable in computer-aided drug discovery. This work considers 23 datasets associated with four typical problems, namely protein-ligand binding, toxicity, solubility and partition coefficient to assess the performance of eight 2D fingerprints. Advanced machine learning algorithms including random forest, gradient boosted decision tree, single-task deep neural network and multitask deep neural network are employed to construct efficient 2D-fingerprint based models. Additionally, appropriate consensus models are built to further enhance the performance of 2D-fingerprintbased methods. It is demonstrated that 2D-fingerprint-based models perform as well as the state-of-the-art 3D structure-based models for the predictions of toxicity, solubility, partition coefficient and protein-ligand binding affinity based on only ligand information. However, 3D structure-based models outperform 2D fingerprint-based methods in complex-based protein-ligand binding affinity predictions.

Full PDF

AAre 2D ﬁngerprints still valuable for drug discovery?

Kaifu Gao , Duc Duy Nguyen , Vishnu Sresht , Alan M. Mathiowetz , Meihua Tu , and Guo-Wei Wei , , , ∗ Department of Mathematics, Michigan State University, MI 48824, USA. Pﬁzer Medicine Design, 610 Main St, Cambridge, MA 02139, USA. Department of Electrical and Computer Engineering, Michigan State University, MI 48824, USA. Department of Biochemistry and Molecular Biology, Michigan State University, MI 48824, USA.November 5, 2019

Abstract

Recently, molecular ﬁngerprints extracted from three-dimensional (3D) structures using advanced mathemat-ics, such as algebraic topology, differential geometry, and graph theory have been paired with efﬁcient machinelearning, especially deep learning algorithms to outperform other methods in drug discovery applications andcompetitions. This raises the question of whether classical 2D ﬁngerprints are still valuable in computer-aideddrug discovery. This work considers 23 datasets associated with four typical problems, namely protein-ligandbinding, toxicity, solubility and partition coefﬁcient to assess the performance of eight 2D ﬁngerprints. Advancedmachine learning algorithms including random forest, gradient boosted decision tree, single-task deep neuralnetwork and multitask deep neural network are employed to construct efﬁcient 2D-ﬁngerprint based models.Additionally, appropriate consensus models are built to further enhance the performance of 2D-ﬁngerprint-based methods. It is demonstrated that 2D-ﬁngerprint-based models perform as well as the state-of-the-art 3Dstructure-based models for the predictions of toxicity, solubility, partition coefﬁcient and protein-ligand bindingafﬁnity based on only ligand information. However, 3D structure-based models outperform 2D ﬁngerprint-basedmethods in complex-based protein-ligand binding afﬁnity predictions.

I Introduction

Drug discovery is a multi-parameter optimization process, which involves a long list of chemical, biological, andphysiological properties. For a drug candidate, numerous drug-related properties must be assessed, includingbinding afﬁnity, toxicity, octanol-water partition coefﬁcient (Log P), aqueous solubility (Log S), etc. Binding afﬁnityassesses the strength of a drug’s binding to its target, while, toxicity is a measure of the degree to which achemical compound can damage an organism adversely. In addition, a partition coefﬁcient is deﬁned as theratio of concentrations of a solute in a mixture of two immiscible solvents at equilibrium and, in the case of logP, represents the drug-relatedness of a compound as well as its hydrophobic effect on human bodies. Anotherrelevant drug attribute is aqueous solubility which plays a vital role in distribution, absorption, and biologicalactivity, among other processes because 65-90 % of body mass is water.

Their importance to drug design anddiscovery has been emphasized by many recent surveys.

Indeed, unsatisfactory toxicity or pharmacokineticproperties are responsible for approximately half of drug candidate failures to reach the market. Traditional experiments for measuring drug properties are conducted either in vivo or in vitro. Such experi-ments are quite time consuming and expensive. Additionally, testing with animals can raise important ethicalconcerns. Therefore, various computer-aided or in silico methods become more attractive since they can pro-duce quick results without sacriﬁcing much accuracy in many situations. Among them, one of the most popularapproaches is the quantitative structure-activity/property relationship (QSAR/QSPR) analysis. It assumes thatsimilar molecules have similar bioactivities or physicochemical properties. Based on this assumption, activitiesand properties of new molecules can be predicted by studying the correlation between chemical or structuralfeatures of molecules and their activities or properties, reducing the need for time-consuming experiments. ∗ Corresponding to Guo-Wei Wei. Email: [email protected] a r X i v : . [ q - b i o . B M ] N ov olecular ﬁngerprints are one way of encoding the structural features of a molecule. They play a fundamentalrole in QSAR/QSPR analysis, virtual screening, similarity-based compound search, target molecule ranking,drug ADMET prediction, and other drug discovery processes. Molecular ﬁngerprints are property proﬁles ofa molecule, usually in the form of vectors with each vector element indicating the existence, the degree orthe frequency of one particular structure feature. Various ﬁngerprints have been developed for molecularfeature encoding in the past few decades.

Most ﬁngerprints are 2D ﬁngerprints which can be extracted frommolecular connection tables without 3D structure information. However, high dimensional ﬁngerprints have alsobeen developed to utilize 3D molecular structure and other information. There are four main categories of 2D ﬁngerprints, namely substructure key-based ﬁngerprints, topological orpath-based ﬁngerprints, circular ﬁngerprints, and pharmacophore ﬁngerprints. Substructure key-based ﬁnger-prints are bit strings representing the presence of certain substructures or fragments from a given list of structuralkeys in the compound. Molecular access system (MACCS) is one of the most popular substructure key-based ﬁngerprint methods. Topological or path-based ﬁngerprints are based on analyzing all the fragments of amolecule following a (usually linear) path up to a certain number of bonds, and then hashing every one of thesepaths to create one ﬁngerprint. The most prominent ones in this category are FP2, Daylight and electro-topological state (Estate) ﬁngerprints. Circular ﬁngerprints are also hashed topological ﬁngerprints but ratherthan looking for paths in a molecule, they record the environment of each atom up to a pre-determined radius.A well-known example for this class is extended-connectivity ﬁngerprint (ECFP). Pharmacophore ﬁngerprintsinclude the relevant features and interactions needed for a molecule to be active against a given target, includ-ing 2D-pharmacophore, and extended reduced graph (ERG) ﬁngerprints as examples.Since 2D ﬁngerprints only rely on the 2D structures, their generation is easy, fast and convenient.In addition to the four categories mentioned above, recent improvements in deep learning have enabled thecreation of neural ﬁngerprints â ˘AˇT where the mapping between ﬁngerprints and 2D structures is learned si-multaneously with the parameters of the regression/classiﬁcation model that maps ﬁngerprints to targets. Theseâ ˘AŸlearnedâ ˘A ´Z ﬁngerprints can potentially improve predictive performance on QSAR/QSPR tasks, but theymust be relearned when trying to predict new properties across signiﬁcantly different regions of chemical space.Since the focus of this work is on comparing 2D and 3D descriptors across a number of disparate tasks andchemically diverse datasets, we have chosen not to consider neural ﬁngerprints.Most commonly used 2D molecular ﬁngerprints were derived over a decade ago and their validation was car-ried out using classical regression or classiﬁcation algorithms, such as linear regression, logistic regression,logistic classiﬁcation, naive Bayes, k-nearest neighbors, support vector machine, etc. On the other hand, new3D structure-based ﬁngerprints built from algebraic topology, differential geometry, geometric graph the-ory, and algebraic graph theory have been developed in recent years. In particular, these new ﬁngerprintswere mostly paired with advanced machine learning algorithms, such as random forest (RF), gradient boost-ing decision tree (GBDT), single-task deep neural networks (ST-DNNs), multi-task deep neural networks(MT-DNNs), convolutional neural network (CNN), recurrent neural network (RNN), etc. methodology, whichare now easily accessible to the scientiﬁc community via user-friendly deep learning frameworks in popularprogramming languages. Often, these new methods have demonstrated higher accuracy or better perfor-mance than earlier methods in the literature, which are typically based on 2D ﬁngerprints and/or simple machinelearning algorithms for drug discovery related applications, such as protein-ligand binding, virtual screening, toxicity, solubility, partition coefﬁcient, as well as protein folding stability change upon mutation. Addition-ally, recent results from D3R Grand Challenges, a community-wide annual competition series in computer-aideddrug design, indicate that structure-based methods using sophisticated 3D structure-based ﬁngerprints have anadvantage over ligand-based methods using 2D ﬁngerprints in scoring and free energy predictions.

Thesedevelopments raise an interesting question of whether 2D ﬁngerprints are still valuable for drug design anddiscovery. Therefore, there is pressing need to reassess 2D ﬁngerprints with advanced machine learning algo-rithms and compare their performance with the state-of-the-art 3D structure-based ﬁngerprints for drug discoveryrelated applications.The objective of the present work is to reassess the predictive power of eight popular 2D ﬁngerprints forfour important drug-related problems, namely, toxicity, binding afﬁnity, Log P, and Log S, involving a total of23 datasets. These problems are selected for the availability of reference results generated by the state-of-the-art 3D structure-based ﬁngerprints in the literature. To optimize 2D ﬁngerprints’ performance, advanced machinelearning algorithms, including RF, GBDT, ST-DNN, and MT-DNN, are employed in the present study. Additionally,consensus models are constructed from appropriate combinations of 2D ﬁngerprint-based predictions to further2nhance their performance. The predictive power of each 2D ﬁngerprint for certain functional groups is analyzed.Extensive numerical studies over 23 datasets using eight 2D ﬁngerprints and four different machine learning al-gorithms indicate that the combination of appropriate machine learning algorithms and 2D ﬁngerprint-basedmodels, particularly consensus models, can bring signiﬁcant improvements over previous 2D QSPR approachesespecially on toxicity predictions. Moreover, 2D ﬁngerprint-based models perform as well as the state-of-the-art 3D structure-based ﬁngerprints in the predictions of toxicity, solubility, partition coefﬁcient and ligand-basedprotein-ligand binding afﬁnity. Finally, topology-based ﬁngerprints extracted from 3D protein-ligand complexeshave a signiﬁcant advantage over 2D ﬁngerprints in complex-based protein-ligand binding afﬁnity predictions.We believe that the present performance analysis and assessment will provide a useful guideline on how tochoose appropriate ﬁngerprints and machine learning methods for drug discovery related applications.

II MethodsII.A 2D ﬁngerprints

In the present work, we investigate eight popular 2D ﬁngerprints, including FP2 ﬁngerprint, MACCS ﬁngerprint,Daylight ﬁngerprint, Estate1 ﬁngerprint, Estate2 ﬁngerprint, ECFP4 Fingerprint, 2D-pharmacophore (Pharm2D),and extended reduced graph ﬁngerprint (ERG). They are chosen to represent four main 2D molecular ﬁngerprintcategories, namely key-based ﬁngerprints, topological or path-based ﬁngerprints, circular ﬁngerprints, pharma-cophore ﬁngerprints. These features are some of the most popular and commonly used ones. Table 1 sum-marizes the information related to these ﬁngerprints. All 2D ﬁngerprints were generated by Openbabel (version2.4.1) and RDKit (version 2018.09.3). II.B Ensemble methods

Two popular ensemble methods were used in our work. The ﬁrst method is random forest (RF), which con-structs a multitude of decision trees during a training process. RF can be used to predict a classiﬁcation label(classiﬁcation model) or a mean prediction (regression model) of the individual trees. It is very robust againstoverﬁtting and easy to use. The second method is gradient boosting decision tree (GBDT). In this approach, indi-vidual decision trees are combined in a stage-wise fashion to achieve the capability of learning complex features.It uses both gradient and boosting strategies to reduce model errors. Compared to deep neural network (DNN)approaches, these two ensemble methods are robust, relatively insensitive to hyper parameters, and easy toimplement. Moreover, they are much faster to train than DNN is. In fact, for small datasets, RF and GBDT canperform even better than DNN or other deep learning algorithms. Therefore, these methods have been appliedto a variety of QSAR prediction problems, such as toxicity, solvation, and binding afﬁnity predictions.

II.C Single-task deep neural network (ST-DNN)

A DNN mimics the learning process of a biological brain by constructing a wide and deep architecture ofnumerous connected neuron units. A typical deep neural network often includes multiple hidden layers. In eachlayer, there are hundreds or even thousands of neurons. During learning stages, weights on each layer areupdated by backpropagation. With a complex and deep network, DNN is capable of constructing hierarchicalfeatures and model complex nonlinear relationships.ST-DNN is a regular deep learning algorithm. It only takes care of one single prediction task. Therefore, it onlylearns from one speciﬁc training dataset. A typical four-layer ST-DNN is showed in ﬁgure 1, where N i (i = 1, ...,4), represents the number of neurons in the i th hidden layer. II.D Multitask deep neural network (MT-DNN)

The multitask (MT) learning technique has achieved much success in qualitative Merck and Tox21 predictionchallenges.

In the MT framework, multiple tasks share the same hidden layers. However, the output layeris attached to different tasks. This framework enables the neural network to learn all the data simultaneously fordifferent tasks. Thus, the commonalities and differences among various datasets can be exploited. It has beenshowed that MT learning typically can improve the prediction accuracy of relatively small datasets if it combineswith relatively larger datasets in its training.Figure 2 is an illustration of a typical four-layer MT-DNN for training four different tasks simultaneously. Supposethere are totally T tasks and the training data for the t th task are ( X ti , y ti ) N t i =1 , where t = 1 , . . . , T , i = 1 , . . . , N t , N t is the number of samples in the t th task, and X ti is the feature vector for the i th sample in the t th task, y ti isthe label value of the i th sample in the t th task, respectively. The purpose of MT learning is to simultaneouslyminimize the loss function: argmin (cid:80) Tt =1 (cid:80) N t i =1 L ( y ti , f t ( X ti , θ t )) where f t is the prediction for the i th sample in the t th task by our MT-DNN, which is a function of the feature3 ingerprint Description Number of features Package FP2 A path-based ﬁngerprint which indexessmall molecule fragments based on lin-ear segments of up to 7 atoms

256 Openbabel Daylight A path-based ﬁngerprint consisting2048 bits and encoding all connectiv-ity pathways in a given length througha molecule MACCS A substructure keys-based ﬁngerprintwith 166 structural keys based onSMARTS patterns which uses an iterative process to as-sign numeric identiﬁers to each atom X ti , L is the loss function, and θ t is the collection of machine learning hyperparameters. A popular costfunction for regression is the mean squared error, which can be deﬁned as: L ( y ti , f t ( X ti , θ t )) = N t (cid:80) N t i =1 ( y ti − f t ( X ti , θ t )) .In this study, MT learning technology is applied to the toxicity prediction. The ultimate goal of this MT learningis to potentially improve the overall performance of multiple toxicity prediction models, especially for the smallestdataset that performs relatively poorly in the ST-DNN. More concretely, it is reasonable to assume that differenttoxicity indexes share a common pattern so that these different tasks can be trained simultaneously when theirfeature vectors are constructed in the same manner. For our toxicity prediction, four different tasks (LD , IGC ,LC , LC -DM data sets) are trained together. This leads to four output neurons in the output layer (See O to O in Figure 2), with each neuron being speciﬁc to one of four tasks. II.E HyperparametersEnsemble hyperparameters.

Both RF and GBDT were implemented by the scikit-learn package (version0.20.1). In this work, there are a total of 23 datasets with their training data size varying from 94 to 8199.RF has been showed to be consistent and robust with various datasets. However, if its parameters are carefully4igure 1: An illustration of a typical ST-DNN. Only one task (data set) is trained in this network. Four hiddenlayers are included, k i (i = 1, 2, 3, 4) represents the number of neurons in the i th hidden layer and N i,j is the j thneuron in the i th hidden layer. Here, O is the single output for the task.Figure 2: An illustration of a typical MT-DNN training four tasks (datasets) simultaneously. Four hidden layersare included in this network, k i (i = 1, 2, 3, 4) represents the number of neurons in the i th hidden layer and N i,j is the j th neuron in the i th hidden layer. Here O to O represent four predictor outputs for four tasks.tuned based on the size of a given training set, GBDT can attain better performance than RF does in mostcases. For all experiments in this work, the most essential parameters of GBDT are chosen as learning rate =0.01, min_samples_split = 3, max_features=sqrt. Detail values of other parameters are given in Table 2. Network hyperparameters.

Since the numbers of features differ much in different 2D ﬁngerprints, differentnetwork architectures have to be adopted. For example, Estate 1 ﬁngerprint has only 79 bits. Therefore a 4-layer network with the number of neurons in various hidden layers are chosen as 500, 1000, 1500, and 500.However, the Daylight ﬁngerprint has as many as 2048 features, and thus a much larger network is needed. Thenetwork for this ﬁngerprint still has 4 layers but there are 3000, 2000, 1000, and 500 neurons in the ﬁrst, second,third and fourth hidden layer, respectively. Other network parameters are as followed: the optimizer is stochastic5 raining-set size RF parameters GBDT parameters <

800 n_estimators=1000, criterion=‘mse’,max_depth=None,min_samples_split=2,min_samples_leaf=1,min_weight_fraction_leaf=0.0 n_estimators=2000, max_depth=9,min_samples_split=3, learn-ing_rate=0.01, subsample=0.1,max_features=’sqrt’800 to 5000 n_estimators=10000,max_depth=7,min_samples_split=3,learning_rate=0.01, subsample=0.3,max_features=’sqrt’5000 to 10000 n_estimators=20000,max_depth=7,min_samples_split=3,learning_rate=0.01, subsample=0.3,max_features=’sqrt’Table 2: RF and GBDT parameters for different training-set sizes.gradient descent (SGD) with momentum of 0.5. 2000 epochs were run for all the networks. Mini-batch size isset to 4. The learning rate is set to 0.01 in the ﬁrst 1000 epochs and 0.001 for the rest epochs. Our tests indicatethat adding a dropout or using L decay does not necessarily improve the accuracy, and thus, we omit these twotechniques. All the network hyperparameters are summarized in Table 3. These hyperparameters are applied toboth ST-DNN and MT-DNN. All the DNN training is performed by Pytorch (version 1.0). Fingerprint Numberoffeatures Number ofhiddenlayers Number ofneurons in eachhidden layer Optimizer Mini-batch Learningrate

Estate1 79 4 500,1000,1500,500 SGD with amomentumof 0.5 4 First 1000:0.01; Then:0.001Estate2 79Daylight 2048 3000,2000,1000,500Table 3: The network hyperparameters for both ST-DNN and MT-DNN.

III ResultsIII.A Toxicity prediction

Four toxicity datasets were studied in our work, namely oral rat LD (LD ), 40 h Tetrahymena pyriformisIGC (IGC ), 96 h fathead minnow LC (LC ), and 48 h Daphnia magna LC (LC -DM). Among them,LD measures the amount of chemicals that can kill half of rats when orally ingested. IGC records the 50%growth inhibitory concentration of Tetrahymena pyriformis organism after 40 h. LC reports at the concentrationof test chemicals in water in milligrams per liter that cause 50% of fathead minnows to die after 96 h. The lastone is LC -DM, which represents the concentration of test chemicals in water in milligrams per liter that cause50% Daphnia maga to die after 48 h. The unit of toxicity reported in these four datasets is -log mol/L. All ofthem are accessible from the recent publications Data set Total size Train set size Test set size Max value Min value LD

823 659 164 9.261 0.037LC III.A.1 The performance of ensemble methods

Because it is easy to implement and fast to train, two ensemble methods, RF and GBDT, were ﬁrst tested.Since four datasets have very different sizes, different numbers of estimators in RF and GBDT models should beused. Speciﬁcally, for two relatively small sets, LC and LC -DM, the numbers of estimators are set to 2000.For IGC50, 10000 estimators are used. For the largest set LD , we have used 20000 estimators.6he accuracy is in term of the square of Pearson correlation coefﬁcient ( R ). Overall, GBDT’s performance isalways better than that of RF, which agrees with early publication. Among all the eight ﬁngerprints we tested,Estate2, Estate1, Daylight, FP2, ECFP and MACCS usually work well on these four sets. Thus the consensus ofthese six ﬁngerprints was also considered (“Top 6-cons” in Figure 3). The consensus model typically gives riseto a further improvement over all single ﬁngerprints in most cases.Figure 3: The R on LD , IGC , LC , LC -DM test sets yielded by eight ﬁngerprints and the consensuses ofthe top 6 features. Two ensemble methods were adopted (GBDT: blue, RF: red). The values shown in the ﬁgureare the R of GBDT. (a)LD test set. LD dataset is the largest set having as many as 7413 compounds. However, a higherexperimental uncertainty of the values in this set makes this set relatively difﬁcult to predict (See "Max value"and "Min value" in Table 4). In our GBDT model, the best single ﬁngerprint (MACCS) yields an R of 0.643, whilethe consensus of the top 6 ﬁngerprints increases R to 0.679. (b) IGC test set. IGC set is the second largest set (1792 compounds) among the four sets we investigated.As indicated in Table 4, the diversity of molecules of in this set is the lowest among the four sets, which meansthe prediction should be some easier. Our results show that Estate2 is the best single ﬁngerprint with an R of0.742, and the consensus of the top 6 ﬁngerprints leads to an R of 0.785. (c) LC test set. LC set is a relative smaller set (823 compounds). By employing GBDT model, estate2ﬁngerprint achieves the top performance, which yields an R of 0.662. The consensus of the top 6 ﬁngerprintimproves the R to 0.715. 7 d) LC -DM test set. Among the four sets, LC -DM test set is the smallest one with only 283 training moleculesand 70 test molecules, which is troublesome to build a robust model. Therefore, our model just generatesmoderate results. Speciﬁcally, the best single ﬁngerprint Estate1 only has an R of 0.520. The consensus modeleven ruins the R a little bit with an R as low as 0.486. Similar difﬁculty is also faced by other recent work, suchas the R of the 3D-topology based GBDT model only reaches 0.505. Thus, there is a need for multitask deeplearning when dealing with such a small dataset.

III.A.2 The performance of single-task and multitask deep learning

On average, Estate2, Estate1, and Daylight are the top three ﬁngerprints when using GBDT models in all thefour sets. Thus, these three ﬁngerprints were picked up to perform higher-level ST-DNN and MT-DNN.Since the lengths of the three ﬁngerprints differ much, different DNN architectures are needed. Four hiddenlayers with 500, 1000, 1500, and 500 neurons are used for Estate1 and Estate2, whose ﬁngerprints have 79features. Four hidden layers with 3000, 2000, 1000, and 500 neurons are used for Daylight, whose ﬁngerprinthas 2048 bits.The pattern of ST-DNN results is similar to that of GBDT results. On four data sets, a ST-DNN consensus modelyields an average R of 0.658 (0.632, 0.791, 0.687, and 0.523 respectively). As a comparison, the average R by a GBDT consensus model is 0.666 (0.679, 0.785, 0.715, and 0.486 respectively). However, the performancecan be largely enhanced by the multitask strategy because the two relatively smaller sets LC and LC -DMcan beneﬁt much from two larger sets LD and IGC . As shown in Table 5, while the MT-DNN model seldomchanges the performance on LD and IGC , it gives rise to a dramatic improvement on LC and LC -DM,especially on LC -DM. The consensus lifts the R result from 0.523 to 0.725. Method R of LD R of IGC R of LC R of LC -DM Estate2 ST-DNN 0.484 0.715 0.569 0.433Estate2 MT-DNN 0.489 0.696 0.660 0.623Estate1 ST-DNN 0.569 0.733 0.650 0.601Estate1 MT-DNN 0.566 0.735 0.694 0.684Daylight ST-DNN 0.619 0.701 0.570 0.346Daylight MT-DNN 0.617 0.717 0.724 0.694Consensus ST-DNN 0.632 0.791 0.687 0.523Consensus MT-DNN 0.639 0.794 0.765 0.725Table 5: The R of ST-DNN and MT-DNN based on the top 3 ﬁngerprints in GBDT (Estate2, Estate1, Daylight)and their consensuses. III.A.3 Systematic comparison with other toxicity predictions

A systematic comparison with other methods was provided in Table 6. The same datasets are also usedto develop the Toxicity Estimation Software Tool (T.E.S.T). So many related results can be found in its user’sguide, including hierarchical, single model, FDA, group contribution, nearest neighbor, and T.E.S.T consensus.Since T.E.S.T is also based on 2D descriptors, the comparison between the results from the present models andT.E.S.T can largely reﬂect the predictive power of the present models. As shown in Table 6, on the LD , IGC ,LC sets, the present MT-DNN consensus always leads to a higher R than T.E.S.T consensus. Especially,on the IGC and LC sets, the present MT-DNN consensus models largely beat T.E.S.T (0.794 vs 0.764 and0.765 vs 0.728), and the present GBDT results quite outperform T.E.S.T (0.679 vs 0.626) on the LD set. Evenon the LC50-DM set, because the training set is so small (283), ensemble methods (RF and GBDT) and DNNmethods are not suitable for it: R of ST-DNN and GBDT are, respectively, 0.486 and 0.523. However, the R ofMT-DNN is as high as 0.725 for LC -DM dataset, which is quite comparable to the T.E.S.T result with an R of0.739.2D MT-DNN consensus has an average R of 0.731 for these four datasets, while the average of T.E.S.T modelis 0.714, and the recent 3D structure-based topological MT-DNN consensus result is also 0.731. These resultsconﬁrm that 2D ﬁngerprints integrated with MT-DNN model surpass the previous 2D models and are as good asthe recent 3D structure-based topological model. III.B Aqueous solubility (Log S)

For Log S, following the previous literature, we test Klopman’s test set with the original train set. The unitof Log P in these sets is log unit. Since the size of the training set is 1290, 10000 estimators were used in theGBDT model. 8 D Method R RMSE Coverage

The present 2D MT-DNN consensus 0.639 0.549 1.000The present 2D GBDT consensus 0.679 0.580 1.000

Hierarchical IGC Method R RMSE Coverage

The present 2D MT-DNN consensus 0.794 0.457 1.000The present 2D GBDT consensus 0.785 0.457 1.000

Hierarchical LC Method R RMSE Coverage

The present 2D MT-DNN consensus 0.765 0.718 1.000The present 2D GBDT consensus 0.715 0.783 1.000

Hierarchical LC -DM Method R RMSE Coverage

The present 2D MT-DNN consensus 0.725 0.935 1.000The present 2D GBDT consensus 0.486 1.239 1.000

Hierarchical Training set Klopman’s test set R and RMSE of 0.944 and 0.684, respectively. The consensus of top 3 is even better, which improves R and RMSE to 0.955 and 0.648 (See Table 8). A systematic comparisons to other methods are included in9 ingerprint R RMSECons-top 3 0.955 0.648Cons-top 6 0.944 0.684

MACCS 0.958 0.664Estate1 0.932 0.791Daylight 0.923 0.780FP2 0.908 0.853ECFP 0.904 0.875Estate2 0.897 0.907Pharm2D 0.832 1.114ERG 0.811 1.202Table 8: The R and RMSE of predicting Log S by eight ﬁngerprints and the consensuses of the top 3 and top 6on Klopman’s test set. Method R RMSECons-top 3 0.955 0.648Cons-top 6 0.944 0.684

MT-ESTD + -1 (3D) III.C Partition coefﬁcient (Log P)

Three Log P data sets were tested using the GBDT model. The training set has 8199 molecules, which wasoriginally compiled by Cheng et al. There are three test sets, namely FDA, Star, and Non-star respec-tively, which are given in Table 10. The Log P in these sets is by the unit of log mol/L. Due to the size of thetraining set, 20000 estimators are used in the GBDT model. Test setTraining set FDA Star Non-star R oracceptable rate. The acceptable rate here is deﬁned as the percentage of molecules within error range < 0.5. Of all the three sets, the 2D ﬁngerprints of Estate2, Estate1, MACCS, and ECFP are always the top 4. Theconsensuses of the top 4 ﬁngerprints produce R up to 0.901 on the FDA set and attain an acceptable rate onStar set at 71.3%. On the Non-star set, the top 4 consensus is somehow worse than the best single ﬁngerprintEstate1 but it is still in the second place with an acceptable rate of 46.5% (See Figure 4).A detailed comparison with other Log P prediction methods was shown in Table 11. On the FDA data set,GBDT-ESTD + -2-AD and MT-ESTD-1 are based on 3D descriptors. GBDT-ESTD + -2-AD model includes somemolecules from the NIH-dataset in its training set. Therefore, its performance is slightly better than the presentone. The 2D method ALOGPS also performs slightly better (0.908 vs 0.901) than the present one. However,a previous study has pointed out that for the PHYSPROP database, the training set of ALOGPS actuallycontains all of the compounds in the FDA set. It is unclear how well it will perform if the overlapping compoundsare removed from the training set. Unlike ALOGPS, XLOGP3’s training data is completely independent of thetest set. In this case, the present prediction is more accurate than that of XLOGP3 (0.901 vs 0.872).The present results on the Star and Non-star sets are also systematically compared with other stat-of-the-art models as shown in Table 12. For the Star set, we achieve 71% of total number of molecules having thepredicted error less than 0.5 (acceptable rate 71%). This result is quite satisfactory and is comparable to the3D structure-based model developed by Wu et al. with an acceptable rate of 72% on the same training set(“MT-ESTD-1" in Table 12). There are many commercial software packages developed to predict Log P suchas AB/Log P, S/Log P, ACD/log P, etc. However, we cannot validate whether the training sets used inthese software packages overlap with the Star set. It is more meaningful when comparing the present model to10igure 4: The performance of eight ﬁngerprints and the consensuses of the top 4 on the FDA, Star and Non-stardata sets of Log P. To be consistent with previous results, on the FDA set, R is given, while on star and non-stardatasets, acceptable rate is given. Method R RMSE

GBDT-ESTD + -2-AD (2D+3D) Our Cons-top 4 (2D) 0.901 0.63

XLOGP3 (2D) since its training dataset does not contain any molecules in the test set. Again, the presentmodel outperforms XLogP3 package on the Star set with the acceptable rates being 71% and 60%, respectively.In the Non-star set, all of the published methods perform as accurate as those in the FDA and Star data set,since the structures in the Non-star set are relatively new and complex. Thus, our model also only achieves anacceptable rate of 47%. However, it is still tied for the third place among all predictors. This result is even betterthan some 3D structure-based models, though RMSE is relatively high due to a few large outliers.11 tar set (N=223) Non-star set (N=43) % of Moleculeswithin error range % of Moleculeswithin error rangeMethod < < < <

84 12 0.41 42 23 1.00MT-ESTD + -1-AD

77 16 0.49 49 19 0.98S+logP

76 22 0.45 40 35 0.87ACD/logP

75 17 0.50 44 32 1.00CLOGP

74 20 0.52 47 28 0.91MT-ESTD-1

72 18 0.55 33 28 1.01ALOGPS

71 23 0.53 42 30 0.82

Our cons-top 4 71 18 0.625 47 16 1.233

MiLogP

69 22 0.57 49 30 0.86KowWIN

68 21 0.64 40 30 1.05TLOGP

67 16 0.74 30 37 1.12CSLogP

66 22 0.65 58 19 0.93SLIPPER-2002

62 22 0.80 35 23 1.23XLOGP3

60 30 0.62 47 23 0.89XLOGP2

57 22 0.87 35 23 1.16QLOGP

48 26 0.96 21 26 1.42VEGA

47 27 1.04 28 30 1.24SPARC

45 22 1.36 28 21 1.70LSER

44 26 1.07 35 16 1.26CLIP

41 25 1.05 33 9 1.54MLOGP(Sim+)

38 30 1.26 26 28 1.56HINTLOGP

34 22 1.80 30 5 2.72NC+NHET

29 26 1.35 19 16 1.71Table 12: Comparison of Log P predictions of the Star and Nonstar sets.

III.D Protein-ligand binding afﬁnity predictionIII.D.1 The S1322 dataset

To assess the predictive power of 2D-ﬁngerprint based models, two protein-ligand binding afﬁnity datasetswere investigated. The ﬁrst one is denoted as the S1322 set. It is a high quality data set with 1322 protein-ligand complexes involving 7 protein clusters (labeled as CL1, CL2, · · · , CL7). It is a subset of the reﬁned setof PDBbind v2015. The other dataset is PDBbind v2016, in which the reﬁned set excluding the core set inPDBbind v2016 is used as a training data. The core set is a test set. These two sets are summarized in Table 13. S1322 set PDBBind v2016 reﬁned set

CL1 CL2 CL3 CL4 CL5 CL6 CL7 reﬁned set training set core set (test set)333 264 219 156 134 122 94 4057 3767 290Table 13: The quantitative summary of the S1322 and PDBbind v2016 data sets.The ligand-based model is used in the present work. For the S1322 set, a 5-fold cross validation was conductedwith the GBDT method. To be consistent with the results in the previous literature, accuracy is in term of Pearsoncorrelation coefﬁcient ( R ). Because the results from Daylight and Pharm2D ﬁngerprints are relatively poor, theirresults are omitted here. The performance of the other six ﬁngerprints (ECFP, FP2, Estate2, MACCS, Estate1,ERG) and their consensus are shown in Figure 5.Figure 5 indicates that for all the seven clusters, the consensuses of the six ﬁngerprints largely achieve betterperformance than that of any single ﬁngerprint. Speciﬁcally, the R values of consensus models are 0.717, 0.847,0.708, 0.718, 0.831, 0.777, and 0.760 on each of 7 clusters, respectively and 0.765 on average. These resultsare comparable to ones achieved by a ligand-based 3D topology and GBDT model. III.D.2 PDBbind v2016 reﬁned set and core set

The present ligand-based model was also tested on PDBbind v2016. Rather than cross validation, this time thecore set is regarded as a test set. Quite consistent with core validation on the S1322 set, the consensus of thesix ﬁngerprints leads to a large improvement than any single one, with an R of 0.747. These results indicate that12igure 5: Pearson correlation coefﬁcient ( R ) on the seven clusters of the S1322 data set yielded by the sixﬁngerprints (ECFP, FP2, Estate 2, MACCS, Estate 1, ERG) and their consensuses.13igure 6: The R on the PDBbind v2016 binding afﬁnity set yielded by the six ﬁngerprints (ECFP, FP2, Estate 2,MACCS, Estate 1, ERG) and their consensus.the present model has a stable and reliable performance on different protein-ligand binding afﬁnity data sets. Method R RMSE (kcal/mol)

TopBP (Complex) et. al. reports 2D ﬁngerprint-based complex models. In their work, a recently developed 2D ﬁngerprint model is used to encode protein-ligandcomplex information. When combined with DNN, their method gives rise to an R of 0.817 on the PDBBind v2016core set. A complicated 3D structure-based model using the topology of the protein-ligand complex developedby our group has an R of 0.861 on the same set. Table 14 lists these results. IV DiscussionIV.A General analysis

In the present work, the predictive power of eight popular 2D ﬁngerprints as well as their consensuses on fourimportant drug-related properties (i.e., toxicity, Log S, Log P, binding afﬁnity) was investigated. The presentstudy reveals that with a proper machine learning algorithm, the 2D ﬁngerprint-based models including theirconsensuses outperform other 2D QSPR approaches in the most cases, especially on the toxicity predictions.Additionally, 2D ﬁngerprint-based models are comparable to state-of-the-art 3D structure-based models in mostdrug-related property predictions, except for protein-ligand binding afﬁnity prediction. Considering 2D ﬁngerprintsare very "cheap" molecular descriptors that are easy and fast to generate, our results are very impressive. Itmeans that 2D ﬁngerprints with appropriate machine learning algorithms are still very valuable for practicalproblems, such as the prediction of toxicity, the aqueous solubility (Log S), and the partition coefﬁcient (Log P).However, for protein-ligand binding afﬁnity prediction, complex-based models using 3D topological ﬁngerprintshave a major advantage over the present 2D ﬁngerprints, i.e., about 15% more accurate. IV.B The performance analysis of 2D ﬁngerprintsIV.B.1 Analysis of 2D ﬁngerprints for PDBbind v2016 core set predictions

The performance of each 2D ﬁngerprint can be systematically analyzed by comparing the difference betweenprediction errors of every pair of ﬁngerprints as follows.(1) The relative absolute error for the f th ﬁngerprint on the i th sample (molecule) in the test set is deﬁned by Error f,i = | prediction value f,i − experimental value i || experimental value i | Ranking FP2 Estate1 Estate2 MACCS

15 carbonyl group: 24/41 carbonyl group: 25/42 carbonyl group: 23/41 bicyclic compounds:17/362 unfused benzene ring:21/41 unfused benzene ring:18/42 unfused benzene ring:22/41 pyridine: 17/363 bicyclic compounds:19/41 aniline:14/42 carboxylate ion: 16/41 ether: 16/364 hydroxyl: 16/41 carboxylate ion: 14/42 bicyclic compounds:15/41 carbonyl group 15/3616 ether: 14/41 hydroxyl: 14/42 carbonyl group with N:13/41 hydroxyl: 15/366 12/41 ether: 14/42 ether: 13/41 unfused benzene ring:12/367 amide: 10/41 carbonyl withNitrogen: 13/42 hydroxyl: 11/41 amide: 10/368 azole: 8/41 amide: 11/42 amide: 11/41 carboxylate ion: 9/369 ......multiple non-fusedbenzene rings: 7/41 11/41 aniline: 10/41 azole: 7/36170 ......aniline: 7/41 phenol: 8/42 ......multiple non-fusedbenzene rings 8/41 furan: 5/36Table 15: The top 10 frequently occurred functional groups in PDBbind v2016 core set for each ﬁngerprint. Foreach ﬁngerprint, the occurrence frequency and the total number of molecules are also given.This analysis is quite signiﬁcant as shown in Table 15. It indicates that different ﬁngerprints have different per-formance on certain functional groups: some ﬁngerprints perform better on some functional groups, while otherﬁngerprints perform better on other functional groups. Therefore, one can select an appropriate ﬁngerprint torepresent a certain class of functional groups based on Table 15. For the FP2, Estate1, and Estate2 ﬁngerprints,the top two functional groups are carbonyl groups and unfused benzene rings. However, the MACCS ﬁngerprintis some special. Its top two functional groups are bicyclic compounds and pyridine.The third top functional groups differ much for four ﬁngerprints: bicyclic compounds for FP2, aniline for Estate1,carboxylate ion for Estate2, and ether for MACCS, which gives us more information to choose ﬁngerprints. Suchas, if one has a molecule including aniline, then Estate1 should be selected. Noticeably, some functional groupsoccur exclusively for one or two types of ﬁngerprints. For example, F, Cl, Br, I is only on the lists of FP2 andEstate1. While azole appears only on the list of FP2 and MACCS and multiple non-fused benzene rings are onlyfor FP2 and Estate 2. Moreover, phenol occurs only for Estate1 and furan occurs only for MACCS.

IV.B.2 Analysis of 2D ﬁngerprints for the IGC toxicity data set prediction and also other data sets Using the same 5-step procedure outlined above, we carry out a performance analysis for toxicity dataset IGC ,which is shown in Figure 8 and Table 16. The molecules in the toxicity data set are typically small and simple,leading to the functional groups in Table 16 also small. Moreover, since there are not too many functional groupsin these relatively simple molecules, only top 8 functional groups are presented in the table. Similar to theperformance on the binding afﬁnity, for the top 4 ﬁngerprints on the toxicity set, the carbonyl group is in the ﬁrstplace. Unfused benzene rings also have a high occurrence frequency, resulting in the second or third ranking.The difference between the performance of various ﬁngerprints is mainly located on sulﬁde and aliphatic chainswith 8 or more members. FP2 ﬁngerprint works well on sulﬁde, whereas, Daylight, Estate1 and Estate2 workwell on aliphatic chains with 8 or more members. Ranking FP2 Daylight Estate1 Estate2 toxicity set for each ﬁngerprint. For eachﬁngerprint, the occurrence frequency and the total number of molecules are also given.19igure 8: The ranked error differences between pairs of ﬁngerprints for IGC toxicity set of 358 molecules. Onlythe top 4 ﬁngerprints (i.e., Estate2, FP2, Estate1, Daylight) are considered.The same performance analyses were also conducted for other toxicity and log P data sets, the results areshown in Tables S1 to S4. These tables indicate, for the toxicity data sets of LD , LC , LC -DM, the perfor-mance of the Estate1 and Estate2 ﬁngerprints are similar, they both work well on bicycle compounds; comparingto it, the FP2 ﬁngerprint works better on aliphatic chains with 8 or more members, the daylight ﬁngerprint has abetter performance on amide. For log P data set, the ECFP and Estate2 ﬁngerprints lead to a good performanceon aniline, the Estate 1 ﬁngerprints works better on bicycle compounds; MACCS ﬁngerprint works better onunfused benzene ring. IV.C The predictive power of the consensus of 2D ﬁngerprints

The consensus of several different ﬁngerprints typically further enhances the performance of a single ﬁngerprint.This enhancement can be quite signiﬁcant. However, on the datasets of different drug-related properties, thebest ﬁngerprint combinations for the consensus are not consistent. One possible explanation is that differentﬁngerprints are good at encoding certain functional groups, and datasets for different drug-related propertieshave different functional group distributions. This is also the reason why a consensus can enhance performance.The consensus can capture more functional groups and counter-balance the systematical bias from differentﬁngerprints.On toxicity prediction, the best combination for consensus is obtained with Estate2, Estate1, Daylight, FP2,ECFP, and MACCS. On the Log S prediction, the best combination is achieved with MACCS, Estate1, andDaylight. While on the Log P prediction, the best consensus involves Estate2, Estate1, ECFP, and MACCS.Finally, on the binding afﬁnity prediction, the best consensus uses Estate2, Estate1, FP2, ECFP, MACCS, andERG. It is worth noting that, Estate related (Estate1, Estate2 or both) models are always included in the bestcombinations. In fact, their single performances are relatively good. This ﬁnding is not surprising since Estateﬁngerprints encode the intrinsic electronic state of the atom as perturbed by the electronic inﬂuence of all otheratoms. It is well-known that electronic state is important to drug-related properties.20

V.D Multitask deep learning

Multitask deep learning was utilized on our toxicity prediction. It turns out that the smallest set LC -DM withonly 283 training samples beneﬁts dramatically from the multitask deep learning strategy. Its R value rises from0.523 to 0.725. This is because, in the frame of multitask deep learning, different data sets (tasks) share similarstructure-function relationships. When a small dataset is trained with a large dataset through shared neuralnetworks, the statistics learned from the large datasets in the shared neurons can help predict the small datasetproperty. As a result, the other three large toxicity sets can share their patterns learned from training with thesmall toxicity set, enhancing its prediction. Therefore, multitask deep learning could be a useful strategy to trainrelatively small datasets. IV.E The limitation and advantage of 2D ﬁngerprints

Typically, 2D ﬁngerprints only encode small molecules, such as ligands, although high level 2D ﬁngerprint mod-els including both proteins and ligands have also been developed.

Theoretically, 2D ﬁngerprints are moresuitable for target-independent or target-unspeciﬁc problems involving small molecules, such as toxicity, solva-tion free energy, aqueous solubility, partition coefﬁcient, permeability, etc. The current investigation conﬁrms thispoint. For toxicity, aqueous solubility and partition coefﬁcient, the present 2D-ﬁngerprint based methods performquite similar to or even somewhat better than 3D structure-based methods in some cases.For protein-ligand binding afﬁnity predictions, both ligand-based approaches and complex-based are exam-ined. For ligand-based approaches, 2D-ﬁngerprint based methods can perform as well as 3D structure-basedmodels. However, 3D structure-based topological models outperform 2D-ﬁngerprint based methods (i.e., R:0.861 vs 0.747 for PDBbind v2016 core test). In fact, more sophisticated 2D ﬁngerprint models that utilize theprotein-ligand complex information and DNN are still not as accurate as 3D topology-based models (i.e.,R: 0.817 vs 0.861 for PDBbind v2016 core test and 0.774 vs 0.808 for PDBbind v2013 core test). Essentially,algebraic topology is designed to simplify the geometric complexity of biological macromolecules. Therefore, itis able to extract vital information from protein-ligand complexes to predict their binding afﬁnities.2D ﬁngerprints are much easier to generate than 3D structure-based ﬁngerprints built form algebraic topology,differential geometry or various graph theory. Therefore, 2D-ﬁngerprint based models can be useful tools forpreliminary drug screening studies. V Conclusion

Two-dimensional molecular ﬁngerprints, or 2D ﬁngerprints, refer to molecular structural patterns, such as ele-mental composition, atomic connectivity, functional groups, 2D-pharmacophores etc. extracted from a moleculewithout taking into account the 3D-structural representation of these properties. 2D ﬁngerprints have been a mainworkhorse for cheminformatics and bioformatics for decades. However, their validations in various datasets weretypically carried out long time ago with earlier machine learning algorithms. Recently, new 3D structure-basedmolecular ﬁngerprints built from algebraic topology, differential geometry, geometric graph theory, andalgebraic graph theory have found much success in drug discovery related applications, including D3RGrand Challenges. It raises an interesting issue whether 2D ﬁngerprints are still competitive in drug discov-ery related applications.This work reassesses 2D ﬁngerprints for their performance in drug discovery related applications. We con-sider a total of eight commonly used 2D ﬁngerprints, namely FP2, Daylight, MACCS, Estate1, Estate2, ECFP,Pharm2D, and ERG. Four types of drug discovery related applications with 23 datasets, including solubility (LogS) and partition coefﬁcient (Log P) that are independent of a target protein, toxicity that may depend on certainunknown target proteins, and protein-ligand binding afﬁnity that depend on known target proteins, are designed tovalidate 2D ﬁngerprints. Advanced machine learning algorithms, including random forest (RF), gradient boostingdecision trees (GBDT), single-task deep neural network (ST-DNN), and multitask deep neural network (MT-DNN)are used to optimize the performance of the above 2D ﬁngerprints in the aforementioned four types of datasets.In particular, MT-DNN is designed to enhance the performance of 2D ﬁngerprints on relatively small datasetsby a simultaneous training with relatively large datasets that share a similar pattern. Since each ﬁngerprint mayhave an explicit bias on certain functional groups or 2D patterns, we carry out various consensus to furtherboost the performance of 2D ﬁngerprints in all the datasets. Finally, the strengths of top four 2D ﬁngerprints forpredicting protein-ligand binding afﬁnity and quantitative toxicity are analyzed in detail.Our general ﬁndings are as follows. 1) 2D ﬁngerprint-based models are as good as 3D structure-based mod-els for various toxicity, Log S and Log P datasets under the same training-test condition. 2) For ligand-basedprotein-ligand binding afﬁnity predictions, 2D ﬁngerprint-based models perform equally well as 3D structure-based models that are based only on ligand 3D structures. 3) 3D structure-based models that utilize 3D protein-21igand complex information outperform 2D ﬁngerprints that based on either ligand information or protein-ligandcomplex information. 4) Advanced machine learning algorithms, such as DNN and MT-DNN, are crucial for 2Dﬁngerprints to achieve optimal performance. 5) There is no 2D ﬁngerprint that outperforms all other 2D ﬁnger-prints in all applications. 6) Appropriate consensus of a few 2D models typically achieves better performance.Therefore, if combined with advanced machine learning algorithms, the 2D ﬁngerprints are still competitive inmost drug discovery related applications except for those that involve protein structures.

Acknowledgments

This work was supported in part by NSF Grants DMS-1721024, DMS-1761320, and IIS1900473 and NIH grantGM126189. 22 eferences [1] Li Di and Edward H Kerns.

Drug-like properties: concepts, structure design and methods from ADME totoxicity optimization . Academic press, 2015.[2] Niel M Henriksen, Andrew T Fenley, and Michael K Gilson. Computational calorimetry: high-precision cal-culation of host–guest binding thermodynamics.

Journal of chemical theory and computation , 11(9):4377–4394, 2015.[3] Kaifu Gao, Jian Yin, Niel M Henriksen, Andrew T Fenley, and Michael K Gilson. Binding enthalpy calcu-lations for a neutral host–guest pair yield widely divergent salt effects across water models.

Journal ofchemical theory and computation , 11(10):4555–4564, 2015.[4] Kedi Wu and Guo-Wei Wei. Quantitative toxicity prediction using topology based multitask deep neuralnetworks.

Journal of chemical information and modeling , 58(2):520–531, 2018.[5] Kedi Wu, Zhixiong Zhao, Renxiao Wang, and Guo-Wei Wei. Topp–s: Persistent homology-based multi-taskdeep neural networks for simultaneous predictions of partition coefﬁcient and aqueous solubility.

Journal ofcomputational chemistry , 39(20):1444–1454, 2018.[6] Christopher A Lipinski, Franco Lombardo, Beryl W Dominy, and Paul J Feeney. Experimental and com-putational approaches to estimate solubility and permeability in drug discovery and development settings.

Advanced drug delivery reviews , 23(1-3):3–25, 1997.[7] Li Di and Edward H Kerns. Biological assay challenges from compound solubility: strategies for bioassayoptimization.

Drug discovery today , 11(9-10):446–451, 2006.[8] Andrew L Hopkins, György M Keserü, Paul D Leeson, David C Rees, and Charles H Reynolds. The role ofligand efﬁciency metrics in drug discovery.

Nature reviews Drug discovery , 13(2):105, 2014.[9] Pascale Atallah, Kenneth B Wagener, and Michael D Schulz. Admet: the future revealed.

Macromolecules ,46(12):4735–4741, 2013.[10] Han Van De Waterbeemd and Eric Gifford. Admet in silico modelling: towards prediction paradise?

Naturereviews Drug discovery , 2(3):192, 2003.[11] Kyaw-Zeyar Myint, Lirong Wang, Qin Tong, and Xiang-Qun Xie. Molecular ﬁngerprint-based artiﬁcial neuralnetworks qsar for ligand biological activity predictions.

Molecular pharmaceutics , 9(10):2912–2923, 2012.[12] Hanna Geppert, Martin Vogt, and Jurgen Bajorath. Current trends in ligand-based virtual screening: molec-ular representations, data mining methods, new application areas, and performance evaluation.

Journal ofchemical information and modeling , 50(2):205–216, 2010.[13] Kunal Roy and Indrani Mitra. Electrotopological state atom (e-state) index in drug design, qsar, propertyprediction and toxicity assessment.

Current computer-aided drug design , 8(2):135–158, 2012.[14] Mahmud Tareq Hassan Khan. Predictions of the admet properties of candidate drug molecules utilizingdifferent qsar/qspr modelling approaches.

Current drug metabolism , 11(4):285–295, 2010.[15] David Rogers and Mathew Hahn. Extended-connectivity ﬁngerprints.

Journal of chemical information andmodeling , 50(5):742–754, 2010.[16] Yu-Chen Lo, Stefano E Rensi, Wen Torng, and Russ B Altman. Machine learning in chemoinformatics anddrug discovery.

Drug discovery today , 2018.[17] Adrià Cereto-Massagué, María José Ojeda, Cristina Valls, Miquel Mulero, Santiago Garcia-Vallvé, andGerard Pujadas. Molecular ﬁngerprint similarity search in virtual screening.

Methods , 71:58–63, 2015.[18] Jitender Verma, Vijay M Khedkar, and Evans C Coutinho. 3d-qsar in drug design-a review.

Current topicsin medicinal chemistry , 10(1):95–115, 2010. 2319] Joseph L Durant, Burton A Leland, Douglas R Henry, and James G Nourse. Reoptimization of mdl keys foruse in drug discovery.

Journal of chemical information and computer sciences , 42(6):1273–1280, 2002.[20] Noel M O’Boyle, Michael Banck, Craig A James, Chris Morley, Tim Vandermeersch, and Geoffrey R Hutchi-son. Open babel: An open chemical toolbox.

Journal of cheminformatics , 3(1):33, 2011.[21] Inc. Daylight Chemical Information Systems. Daylight.[22] Lowell H Hall and Lemont B Kier. Electrotopological state indices for atom types: a novel combinationof electronic, topological, and valence state information.

Journal of Chemical Information and ComputerSciences , 35(6):1039–1045, 1995.[23] Greg Landrum et al. Rdkit: Open-source cheminformatics, 2006.[24] Nikolaus Stieﬂ, Ian A Watson, Knut Baumann, and Andrea Zaliani. Erg: 2d pharmacophore descriptions forscaffold hopping.

Journal of chemical information and modeling , 46(1):208–220, 2006.[25] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular ﬁngerprints. In

Ad-vances in neural information processing systems , pages 2224–2232, 2015.[26] Kevin Yang, Kyle Swanson, Wengong Jin, Connor Coley, Philipp Eiden, Hua Gao, Angel Guzman-Perez,Timothy Hopper, Brian Kelley, Miriam Mathea, et al. Are learned molecular representations ready for primetime? arXiv preprint arXiv:1904.01561 , 2019.[27] Zixuan Cang and Guo-Wei Wei. Integration of element speciﬁc persistent homology and machine learn-ing for protein-ligand binding afﬁnity prediction.

International journal for numerical methods in biomedicalengineering , 34(2):e2914, 2018.[28] Zixuan Cang, Lin Mu, and Guo-Wei Wei. Representability of algebraic topology for biomolecules in machinelearning based scoring and virtual screening.

PLoS computational biology , 14(1):e1005929, 2018.[29] Duc Duy Nguyen and Guo-Wei Wei. Dg-gl: Differential geometry-based geometric learning of moleculardatasets.

International journal for numerical methods in biomedical engineering , 35(3):e3179, 2019.[30] Duc D Nguyen, Tian Xiao, Menglun Wang, and Guo-Wei Wei. Rigidity strengthening: A mechanism forprotein–ligand binding.

Journal of chemical information and modeling , 57(7):1715–1721, 2017.[31] David Bramer and Guo-Wei Wei. Multiscale weighted colored graphs for protein ﬂexibility and rigidity anal-ysis.

The Journal of chemical physics , 148(5):054103, 2018.[32] Duc Duy Nguyen, Zixuan Cang, Kedi Wu, Menglun Wang, Yin Cao, and Guo-Wei Wei. Mathematical deeplearning for pose and binding afﬁnity prediction and ranking in d3r grand challenges.

Journal of computer-aided molecular design , 33(1):71–82, 2019.[33] Vladimir Svetnik, Andy Liaw, Christopher Tong, J Christopher Culberson, Robert P Sheridan, and Bradley PFeuston. Random forest: a classiﬁcation and regression tool for compound classiﬁcation and qsar modeling.

Journal of chemical information and computer sciences , 43(6):1947–1958, 2003.[34] Robert E Schapire. The boosting approach to machine learning: An overview. In

Nonlinear estimation andclassiﬁcation , pages 149–171. Springer, 2003.[35] Imad A Basheer and Maha Hajmeer. Artiﬁcial neural networks: fundamentals, computing, design, andapplication.

Journal of microbiological methods , 43(1):3–31, 2000.[36] Rich Caruana. Multitask learning.

Machine learning , 28(1):41–75, 1997.2437] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado,Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, GeoffreyIrving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg,Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens,Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, FernandaViégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available fromtensorﬂow.org.[38] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, ZemingLin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In

NIPS AutodiffWorkshop , 2017.[39] Zixuan Cang and Guo-Wei Wei. Analysis and prediction of protein folding energy changes upon mutationby element speciﬁc persistent homology.

Bioinformatics , 33(22):3549–3557, 2017.[40] Zied Gaieb, Conor D Parks, Michael Chiu, Huanwang Yang, Chenghua Shao, W Patrick Walters, Millard HLambert, Neysa Nevins, Scott D Bembenek, Michael K Ameriks, et al. D3r grand challenge 3: blind predic-tion of protein–ligand poses and afﬁnity rankings.

Journal of computer-aided molecular design , 33(1):1–18,2019.[41] T Martin. User’s guide for test (version 4.2)(toxicity estimation software tool): A program to estimate toxicityfrom molecular structure, 2016.[42] HL Morgan. The generation of a unique machine description for chemical structures-a technique developedat chemical abstracts service.

Journal of Chemical Documentation , 5(2):107–113, 1965.[43] Bao Wang, Chengzhang Wang, Kedi Wu, and Guo-Wei Wei. Breaking the polar-nonpolar division in solva-tion free energy prediction.

Journal of computational chemistry , 39(4):217–233, 2018.[44] Bao Wang, Zhixiong Zhao, Duc D Nguyen, and Guo-Wei Wei. Feature functional theory–binding predictor(fft–bp) for the blind prediction of binding free energies.

Theoretical Chemistry Accounts , 136(4):55, 2017.[45] Stephen J Capuzzi, Regina Politi, Olexandr Isayev, Sherif Farag, and Alexander Tropsha. Qsar modeling oftox21 challenge stress response and nuclear receptor signaling toxicity assays.

Frontiers in EnvironmentalScience , 4:3, 2016.[46] Bharath Ramsundar, Bowen Liu, Zhenqin Wu, Andreas Verras, Matthew Tudor, Robert P Sheridan, andVijay Pande. Is multitask deep learning practical for pharma?

Journal of chemical information and modeling ,57(8):2068–2076, 2017.[47] Jan Wenzel, Hans Matter, and Friedemann Schmidt. Predictive multitask deep neural network models foradme-tox properties: Learning from large data sets.

Journal of chemical information and modeling , 2019.[48] Zhuyifan Ye, Yilong Yang, Xiaoshan Li, Dongsheng Cao, and Defang Ouyang. An integrated transfer learn-ing and multitask learning approach for pharmacokinetic parameter prediction.

Molecular pharmaceutics ,16(2):533–541, 2018.[49] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel,Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning inpython.

Journal of machine learning research , 12(Oct):2825–2830, 2011.[50] Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. Pytorch: Tensors and dynamic neuralnetworks in python with strong gpu acceleration.

PyTorch: Tensors and dynamic neural networks in Pythonwith strong GPU acceleration , 6, 2017.[51] Kevin S Akers, Glendon D Sinks, and T Wayne Schultz. Structure–toxicity relationships for selected halo-genated aliphatic chemicals.

Environmental toxicology and pharmacology , 7(1):33–39, 1999.2552] Hao Zhu, Alexander Tropsha, Denis Fourches, Alexandre Varnek, Ester Papa, Paola Gramatica, TomasOberg, Phuong Dao, Artem Cherkasov, and Igor V Tetko. Combinatorial qsar modeling of chemical toxicantstested against tetrahymena pyriformis.

Journal of chemical information and modeling , 48(4):766–784, 2008.[53] TJ Hou, Ke Xia, Wei Zhang, and XJ Xu. Adme evaluation in drug discovery. 4. prediction of aqueoussolubility based on atom contribution approach.

Journal of chemical information and computer sciences ,44(1):266–275, 2004.[54] Gilles Klopman, Shaomeng Wang, and Donald M Balthasar. Estimation of aqueous solubility of organicmolecules by the group contribution approach. application to the study of biodegradation.

Journal of chem-ical information and computer sciences , 32(5):474–482, 1992.[55] Tiejun Cheng, Yuan Zhao, Xun Li, Fu Lin, Yong Xu, Xinglong Zhang, Yan Li, Renxiao Wang, and Luhua Lai.Computation of octanol- water partition coefﬁcients by guiding an additive model with knowledge.

Journalof chemical information and modeling , 47(6):2140–2148, 2007.[56] Alex Avdeef.

Absorption and drug development: solubility, permeability, and charge state . John Wiley &Sons, 2012.[57] Raimund Mannhold, Gennadiy I Poda, Claude Ostermann, and Igor V Tetko. Calculation of molecularlipophilicity: State-of-the-art and comparison of log p methods on more than 96,000 compounds.

Journalof pharmaceutical sciences , 98(3):861–893, 2009.[58] P Howard and W Meylan. Physical/chemical property database (physprop), 1999.[59] Zhihai Liu, Yan Li, Li Han, Jie Li, Jie Liu, Zhixiong Zhao, Wei Nie, Yuchen Liu, and Renxiao Wang. Pdb-widecollection of binding data: current status of the pdbbind database.

Bioinformatics , 31(3):405–412, 2014.[60] Minyi Su, Qifan Yang, Yu Du, Guoqin Feng, Zhihai Liu, Yan Li, and Renxiao Wang. Comparative assessmentof scoring functions: The casf-2016 update.

Journal of chemical information and modeling , 59(2):895–913,2018.[61] Maciej Wójcikowski, Michał Kukiełka, Marta M Stepniewska-Dziubinska, and Pawel Siedlecki. Developmentof a protein–ligand extended connectivity (plec) ﬁngerprint and its application for binding afﬁnity predictions.

Bioinformatics , 2018.[62] Indra Kundu, Goutam Paul, and Raja Banerjee. A machine learning approach towards the prediction ofprotein–ligand binding afﬁnity based on fundamental molecular properties.