[PDF] Attribution Methods Reveal Flaws in Fingerprint-Based Virtual Screening

Abstract

Fingerprint-based models for protein-ligand binding have demonstrated outstanding success on benchmark datasets; however, these models may not learn the correct binding rules. To assess this concern, we use in silico datasets with known binding rules to develop a general framework for evaluating model attribution. This framework identifies fragments that a model considers necessary to achieve a particular score, sidestepping the need for a model to be differentiable. Our results confirm that high-performing models may not learn the correct binding rule, and suggest concrete steps that can remedy this situation. We show that adding fragment-matched inactive molecules (decoys) to the data reduces attribution false negatives, while attribution false positives largely arise from the background correlation structure of molecular data. Normalizing for these background correlations helps to reveal the true binding logic. Our work highlights the danger of trusting attributions from high-performing models and suggests that a closer examination of fingerprint correlation structure and better decoy selection may help reduce misattributions.

Full PDF

AAttribution Methods Reveal Flaws in Fingerprint-Based Virtual Screening

Vikram Sundar Lucy Colwell

Abstract

Fingerprint-based models for protein-ligand bind-ing have demonstrated outstanding success onbenchmark datasets; however, these models maynot learn the correct binding rules. To assess thisconcern, we use in silico datasets with knownbinding rules to develop a general frameworkfor evaluating model attribution. This frame-work identiﬁes fragments that a model considersnecessary to achieve a particular score, sidestep-ping the need for a model to be differentiable.Our results conﬁrm that high-performing modelsmay not learn the correct binding rule, and sug-gest concrete steps that can remedy this situation.We show that adding fragment-matched inactivemolecules (decoys) to the data reduces attribu-tion false negatives, while attribution false posi-tives largely arise from the background correla-tion structure of molecular data. Normalizing forthese background correlations helps to reveal thetrue binding logic. Our work highlights the dan-ger of trusting attributions from high-performingmodels and suggests that a closer examination ofﬁngerprint correlation structure and better decoyselection may help reduce misattributions.

1. Introduction

Identifying ligands that bind tightly to a given protein targetis a crucial ﬁrst step in drug discovery. Experimental meth-ods such as high-throughput screening are time-consuming,and costly, while physics-based methods are computation-ally expensive and can be inaccurate (MacConnell et al.,2017; Schneider, 2017; Grinter & Zou, 2014; Chen, 2015).The emergence of large datasets enables data-driven ap-proaches to be applied to this problem. In recent years, avariety of ML-based approaches have been developed toidentify active ligands for a protein target given screening Google Research, Mountain View, CA, USA Department ofChemistry, University of Cambridge, Cambridge, Cambridgeshire,UK. Correspondence to: Lucy Colwell < [email protected] > . Proceedings of the th International Conference on MachineLearning , Vienna, Austria, PMLR 108, 2020. Copyright 2020 bythe author(s). data (Gawehn et al., 2016; Colwell, 2018; Ripphausen et al.,2011). These approaches report outstanding in silico suc-cess on benchmark datasets (Ragoza et al., 2017; Ramsundaret al., 2017; Gomes et al., 2017).However, recent studies analyzing how neural network mod-els attribute their results suggest that even high-performingmodels often do not learn the correct rule (McCloskey et al.,2019). Fingerprint-based models also share these issues;Sheridan (2019) observed that high-performing models ofthe same dataset attribute binding activity to different atomswithin a molecule. Attribution of virtual screening modelsis particularly important because accurate identiﬁcation ofpharmacophores would enable medicinal chemists to im-prove potential drug candidates.In this paper, we evaluate attributions from ﬁngerprint-basedvirtual screening models, complementing a previous analy-sis of graph convolutional models (McCloskey et al., 2019).We propose a general framework for evaluating model at-tributions that does not require gradients, and use in sil-ico datasets to evaluate a number of standard models. Inagreement with previous work (Sheridan, 2019), our resultsestablish that high-performing models may not learn thecorrect binding rule. Going beyond previous work, we ana-lyze properties of the data that lead to misattributions, andprovide insight into how this can be mitigated. Our analysisreveals that attribution results can be improved by (i) addingfragment-matched decoys, and (ii) accounting for spuriouscorrelations in the data that originate from both the natureof small molecule structures and the deﬁnition of ﬁngerprintdescriptors.

2. Methods

To measure attribution we construct a number of in silico datasets, where each dataset has a speciﬁed binding logicthat requires randomly selected fragments to be presentfor a molecule to be active. We identiﬁed active and inactive ligands from the ZINC12 database (Irwin et al.,2012) using each binding logic to generate each dataset.Dataset 0 required a benzene, alkyne, and amino group tobe considered active; dataset 1 an alkyne, benzene, andhydroxyl; dataset 2 a ﬂuorine, alkene, and benzene; and a r X i v : . [ q - b i o . B M ] J u l ttribution Methods Reveal Flaws in Fingerprint-Based Virtual Screening dataset 3 a benzene, ether, and amino group. Our bindinglogics mimic known pharmacaphores for real proteins; forexample, most binders to soluble epoxide hydrolase have anamide and a urea group (Waltenberger et al., 2016).We used ECFP6 ﬁngerprints with 2048 bits (Rogers & Hahn,2010) to featurize molecules. We tested Naive Bayes, Logis-tic Regression, and Random Forest models, all implementedusing scikit-learn (Pedregosa et al., 2011). Naive Bayes usedno prior; Logistic Regression used C = 1 ; Random Forestused trees and a maximum depth of . Model AUCs(Area under the Receiver Operator Characteristic curve)were computed on a randomly held-out test set comprising of the data, with an even split of actives and inactives.All train/test splits were repeated times to measure theAUC variation. For attribution analysis we retrained the models on all thedata, without any hold-out sets. Since some models werenot differentiable, we developed an attribution method thatworks for any ﬁngerprint-based model. Our method relieson SIS (sufﬁcient input subsets), which identiﬁes sufﬁcientsubsets of the input features to reach a particular scorethreshold for the given model (Carter et al., 2018). Ournull feature mask was the vector of all s. If our modelpredicted a binding probability of x for a given molecule,the threshold used for SIS attribution was (cid:100) x − . (cid:101) .The next step in our attribution procedure is to translatethe signiﬁcant features (ﬁngerprint bits) back to fragments.For each molecule, we used rdkit to indicate the relevantfragment or fragments for each bit (Landrum, 2006). For apre-trained model m and a speciﬁc molecule, we generatean attribution vector v m of dimensionality the number ofatoms in the molecule. Each element of v m correspondsto an atom in the molecule and is the number of signiﬁcantfragments containing that atom. When a single bit corre-sponds to multiple fragments, all are considered signiﬁcant.The ground truth attribution vector v R for each moleculehas v R = 1 for any atom belonging to fragment in thebinding logic and v R = 0 otherwise.We used v m and v R , to compute four metrics of attributionaccuracy. First, for a given molecule the cosine similarity v m · v R || v m |||| v R || provides an attribution score for the model m .This score is normalized by the attribution score for thesame model trained on randomly labeled data to avoid datastructure effects (Adebayo et al., 2018). The attribution falsepositive score is the proportion of atoms that the model er-roneously considers relevant. The attribution false negativescore is the proportion of relevant atoms that the attributionmisses. Finally, the comparative attribution score betweentwo models m and m is v m · v m || v m |||| v m || . We computed at- tribution scores for randomly selected molecules, eachpredicted to bind with probability at least . , to computeerrors in attribution scores.

3. Attribution Results

The models, especially logistic and random forest, performoutstandingly well on the in silico datasets; Figure 1 showsthat many models achieve very high AUCs. However, thenormalized attribution scores in Figure 1 are much lower.Even high-performing models tend to generate poor attribu-tion scores with large variance between molecules, suggest-ing that the models are learning something very differentfrom the ground truth logic.

AUC N o r m a li z e d A tt r i b u t i o n S c o r e LogisticRandom ForestNaive Bayes

Figure 1.

Comparison of attribution scores to AUC on in silico datasets. The attribution score is a measure of how close themodel’s attribution is to the real rule for a particular molecule.Even high-performing models have poor attribution scores, sug-gesting that they are learning the wrong rule.

To better understand this data, Figure 2 splits the attribu-tion score into false positives and false negatives, whileFigure 3 shows some attribution images for dataset 1. Bothsuggest that logistic regression is more susceptible to falsenegatives, while random forest is more susceptible to falsepositives. These results call into question the validity ofusing these models to physically interpret predictions or togarner medicinal chemistry insight. In order to further es-tablish that our attribution results provide insight into modelperformance, we used knowledge of the misattributed atomsto manually generate adversarial examples; a sample areshown in Figure 3. This shows that erroneous attributionscorrespond to weaknesses in the trained models.

To understand attribution false negatives, we compute thePearson correlation between the presence of every featureand the binding activity of each molecule and examine those ttribution Methods Reveal Flaws in Fingerprint-Based Virtual Screening

Dataset A tt r i b u t i o n F a l s e P o s i t i v e A LogisticRandom ForestNaive Bayes

Dataset A tt r i b u t i o n F a l s e N e g a t i v e B LogisticRandom ForestNaive Bayes

Figure 2. (a) Attribution False Positive and (b) False Negatives. Weobserve higher rates of attribution false negatives for the logisticregression models. Random forest models have higher rates offalse positives.

Original

Logistic

Adversarial

Prediction: 0.99Actual: 0.0Random Forest Prediction: 1.00Actual: 0.0

Figure 3.

Attribution Images and Adversarial Examples for Dataset1, with binding rule benzene, alkyne, and hydroxyl. The leftmolecule is the attribution image for the speciﬁed model. Red hashighest attribution; white has lowest. The right image is an adver-sarial example constructed using the observed misattributions. Thesuccess of the adversarial examples indicates that our attributionmethod correctly identiﬁes ﬂaws in model performance. features most correlated with activity. We observe that ben-zene does not appear among the top features for dataset 1 (data not shown), despite the fact that it is required inthe binding logic. This is despite the fact that the top contains features that are not present in the binding logic.This surprising observation likely explains why models failto place high weight on benzene, as seen in Figure 3. Toaddress this issue, we add fragment-matched decoys: inac-tive molecules that have some, but not all, of the fragmentsrequired for binding to the dataset. Speciﬁcally, we include generic inactives, inactives with the ﬁrst two frag-ments, inactives with the ﬁrst and third fragments, and inactives with the second two fragments.We ﬁnd that adding these fragment-matched decoys de-creases the number of attribution false negatives (data notshown), and improves normalized attribution scores (Figure4). However, our models are still not perfect, as they haveplenty of attribution false positives and it is still possibleto generate adversarial examples (data not shown). Thusadding fragment-matched decoys is necessary but not sufﬁ-cient to improve the overall attribution accuracy of our mod-els. When working with real world datasets, one could iden-tify fragment-matched decoys by screening molecules withsome, but not all, of the fragments observed in known ac-tives. For example, combinatorial libraries where moleculesare generated by linking fragments selected from fragmentlibraries would serve this purpose. Logistic Random Forest Naive Bayes

Model N o r m a li z e d A tt r i b u t i o n S c o r e OriginalFragment-Matched Decoys

Figure 4.

Effect of Adding Fragment-Matched Decoys on Normal-ized Attribution Score for Dataset 1. Adding fragment-matcheddecoys improves overall attribution scores, but they are still notperfect, at least partly due to high rates of false positives.

To understand attribution false positives, we note that thefeatures most correlated with activity in dataset 1 includea number of ethers that connect alkynes to benzenes or toalcohols. The alkynes, benzenes, and alcohols are part ofthe binding logic, but the ether is not. Similarly, in Figure 3,we see that ethers are incorrectly included in the attributedfragments. This suggests that at least some attribution falsepositives are due to features that are highly correlated with ttribution Methods Reveal Flaws in Fingerprint-Based Virtual Screening activity but are not part of the binding logic. This could becaused by the fact that the features are not truly independent,i.e. some features may co-occur in dataset molecules moreoften than would be expected at random. To measure this,we compute the Pearson correlation between every pair offeatures across all molecules in dataset 1. This analysisreveals that ether is highly correlated with both the benzeneand alkyne groups that are present in the binding logic,explaining why ether is highly correlated with activity. Sincethe dataset molecules are drawn randomly from ZINC12,these high inter-feature correlations can only be a resultof spurious background correlations already present in theZINC12 dataset.It has previously been shown that background correlationsof hashed ﬁngerprints are approximately drawn from thestandard Marchenko-Pastur distribution that would be ex-pected for random vectors drawn from a multivariate Gaus-sian distribution (Lee et al., 2016). This suggests that thereis no additional correlation structure inherent in molecularﬁngerprint data. Despite this result, both the nature of smallmolecule structures and the manner in which ﬁngerprintsare constructed suggest that there should be correlationsbetween bits that correspond to e.g. fragments and theirsubstructures. To probe this more deeply, we computedthe pair correlation matrix for a random sample of molecules from ZINC12 using unhashed ECFP6 ﬁngerprintsand found signiﬁcant background correlation, as shown inFigure 5. This suggests that the hashing process obscuresimportant information about the background correlationstructure. D a t a s e t C o rr e l a t i o n Background Correlation A Background Correlations D a t a s e t C o rr e l a t i o n s B Figure 5.

Correlation between Background and Dataset 1. (A)Top right: Heatmap of dataset 1 correlation matrix using hashedﬁngerprints. Bottom left: Background dataset correlation matrixfrom unhashed ﬁngerprints. (B) Scatter plot of these correlations.We observe that many spurious correlations in dataset 1 are likelycaused by the background. Correlations that are only present inthe dataset are useful for determining the correct binding logic foreach dataset.

Our results indicate that some of the highest correlationsin the hashed ﬁngerprints observed in dataset 1 are related to this background correlation structure of the unhashedﬁngerprints, as shown by the cloud of points near the y = x line in the scatter plot in Figure 5. Other correlations appearonly in the dataset and are not present in the background;this generates the cloud of points surrounding the y -axisin the scatterplots of Figure 5. Our trained models needto distinguish informative correlations caused by the bind-ing logic from spurious background correlations caused bymolecular descriptors.Successfully distinguishing between spurious and legiti-mate correlations does help uncover the correlations thatcorrespond directly to the binding logic. After normalizingthe dataset correlations by the background correlation, thehighest correlations are between an alkyne and hydroxyl,a hydroxyl and a benzene, and an alkyne and a benzene:exactly the groups in the binding logic. Thus correlationsthat occur most strongly above background do correspondto the binding logic used to build the dataset. In order to verify that attribution results for our in silico datasets reﬂect those for real datasets, we examined compar-ative attribution scores. Real activity data was acquired fromChEMBL24.1 (Davies et al., 2015; Gaulton et al., 2017)for a number of protein targets, with inactive moleculesacquired from PubChem (Mervin et al., 2018; Kim et al.,2016) following the procedure in Sundar & Colwell (2020).We see similar degrees of disagreement between the modelson the in silico datasets and on the real datasets, as mea-sured by comparative attribution scores (data not shown).We note that if two high-performing models like logisticregression and random forest disagree on attributions forspeciﬁc molecules for which they make accurate bindingpredictions, then at least one must be wrong.

4. Conclusions

Our results suggest a number of important cautionary notesabout the success of ﬁngerprint-based protein/ligand bind-ing models. Even high-performing models may misattribute,and using in silico datasets to explicitly test model perfor-mance can help identify models that perform attribution cor-rectly. We demonstrated that false negative attributions canbe mitigated by adding fragment-matched decoys, wherefuture work will test the sensitivity to accurate decoy selec-tion. A corresponding approach for real datasets would beto screen molecules with a subset of the fragments presentin known actives. Further, we have shown that the strongestfalse positives originate from correlations present in thebackground data that cause various fragments to be spuri-ously correlated with activity. Thus one key to mitigatingfalse positives is to develop models that account for thebackground correlation structure. ttribution Methods Reveal Flaws in Fingerprint-Based Virtual Screening

References

Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I.,Hardt, M., and Kim, B. Sanity Checks for SaliencyMaps. (Nips), 2018. URL http://arxiv.org/abs/1810.03292 .Carter, B., Mueller, J., Jain, S., and Gifford, D. Whatmade you do this? Understanding black-box decisionswith sufﬁcient input subsets. oct 2018. URL http://arxiv.org/abs/1810.03805 .Chen, Y.-C. Beware of docking!

Trends in Pharma-cological Sciences , 36(2):78–95, feb 2015. ISSN01656147. doi: 10.1016/j.tips.2014.12.001. URL .Cleves, A. E. and Jain, A. N. Effects of inductive bias oncomputational evaluations of ligand-based modeling andon drug discovery.

Journal of computer-aided moleculardesign , 22(3-4):147–59, 2008. ISSN 0920-654X. doi: 10.1007/s10822-007-9150-y. URL .Colwell, L. J. Statistical and machine learning approaches topredicting proteinligand interactions.

Current Opinion inStructural Biology , 49:123–128, 2018. ISSN 1879033X.doi: 10.1016/j.sbi.2018.01.006.Davies, M., Nowotka, M., Papadatos, G., Dedman, N.,Gaulton, A., Atkinson, F., Bellis, L., and Overington,J. P. ChEMBL web services: Streamlining access todrug discovery data and utilities.

Nucleic Acids Re-search , 43(W1):W612–W620, 2015. ISSN 13624962.doi: 10.1093/nar/gkv352.Duvenaud, D. K., Maclaurin, D., Iparraguirre, J., Bombarell,R., Hirzel, T., Aspuru-Guzik, A., and Adams, R. P. Con-volutional Networks on Graphs for Learning MolecularFingerprints.

Advances in Neural Information ProcessingSystems , 28:2224–2232, 2015.Gaulton, A., Hersey, A., Nowotka, M. L., Patricia Bento,A., Chambers, J., Mendez, D., Mutowo, P., Atkinson, F.,Bellis, L. J., Cibrian-Uhalte, E., Davies, M., Dedman,N., Karlsson, A., Magarinos, M. P., Overington, J. P.,Papadatos, G., Smit, I., and Leach, A. R. The ChEMBLdatabase in 2017.

Nucleic Acids Research , 45(D1):D945–D954, 2017. ISSN 13624962. doi: 10.1093/nar/gkw1074.Gawehn, E., Hiss, J. A., and Schneider, G. DeepLearning in Drug Discovery.

Molecular Informat-ics , 35(1):3–14, jan 2016. ISSN 18681743. doi:10.1002/minf.201501008. URL . Gomes, J., Ramsundar, B., Feinberg, E. N., and Pande,V. S. Atomic Convolutional Networks for Predict-ing Protein-Ligand Binding Afﬁnity. arXiv Preprint ,(arXiv:1703.10603), mar 2017. URL http://arxiv.org/abs/1703.10603 .Grinter, S. and Zou, X. Challenges, Applications,and Recent Advances of Protein-Ligand Dock-ing in Structure-Based Drug Design.

Molecules ,19(7):10150–10176, jul 2014. ISSN 1420-3049. doi: 10.3390/molecules190710150. URL .Irwin, J. J., Sterling, T., Mysinger, M. M., Bolstad, E. S., andColeman, R. G. ZINC: A Free Tool to Discover Chem-istry for Biology.

Journal of Chemical Information andModeling , 52(7):1757–1768, jul 2012. ISSN 1549-9596.doi: 10.1021/ci3001277. URL .Kearnes, S., McCloskey, K., Berndl, M., Pande, V.,and Riley, P. Molecular graph convolutions: movingbeyond ﬁngerprints.

Journal of Computer-AidedMolecular Design , 30(8):595–608, aug 2016. ISSN0920-654X. doi: 10.1007/s10822-016-9938-8. URL .Kim, S., Thiessen, P. A., Bolton, E. E., Chen, J., Fu, G.,Gindulyte, A., Han, L., He, J., He, S., Shoemaker, B. A.,Wang, J., Yu, B., Zhang, J., and Bryant, S. H. PubChemsubstance and compound databases.

Nucleic Acids Re-search , 44(D1):D1202–D1213, 2016. ISSN 13624962.doi: 10.1093/nar/gkv951.Landrum, G. RDKit: Open-source Cheminformatics, 2006.ISSN 00028282. URL .Lee, A. A., Brenner, M. P., and Colwell, L. J. Predictingproteinligand afﬁnity with a random matrix framework.

Proceedings of the National Academy of Sciences , 2016.ISSN 0027-8424. doi: 10.1073/pnas.1611138113.Liu, S., Alnammi, M., Ericksen, S. S., Voter, A. F., Ananiev,G. E., Keck, J. L., Hoffmann, F. M., Wildman, S. A., and ttribution Methods Reveal Flaws in Fingerprint-Based Virtual Screening

Gitter, A. Practical Model Selection for Prospective Vir-tual Screening.

Journal of chemical information and mod-eling , 59(1):282–293, jan 2019. ISSN 1549-960X. doi:10.1021/acs.jcim.8b00363. URL .MacConnell, A. B., Price, A. K., and Paegel, B. M. AnIntegrated Microﬂuidic Processor for DNA-EncodedCombinatorial Library Functional Screening.

ACScombinatorial science , 19(3):181–192, 2017. ISSN2156-8944. doi: 10.1021/acscombsci.6b00192. URL .McCloskey, K., Taly, A., Monti, F., Brenner, M. P., andColwell, L. Using Attribution to Decode Binding Mech-anism in Neural Network Models for Chemistry.

Pro-ceedings of the National Academy of Sciences , 2019.doi: 10.1073/pnas.1820657116. URL http://arxiv.org/abs/1811.11310 .Mervin, L. H., Bulusu, K. C., Kalash, L., Afzal, A. M.,Svensson, F., Firth, M. A., Barrett, I., Engkvist, O., andBender, A. Orthologue chemical space and its inﬂuenceon target prediction.

Bioinformatics , 34(1):72–79, 2018.ISSN 14602059. doi: 10.1093/bioinformatics/btx525.Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour-napeau, D., Brucher, M., Perrot, M., and Duchesnay, ´E.Scikit-learn: Machine Learning in Python.

Journal ofMachine Learning Research , 12(Oct):2825–2830, 2011.ISSN ISSN 1533-7928. URL http://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html .Ragoza, M., Hochuli, J., Idrobo, E., Sunseri, J., and Koes,D. R. ProteinLigand Scoring with Convolutional NeuralNetworks.

Journal of Chemical Information and Mod-eling , 57(4):942–957, apr 2017. ISSN 1549-9596. doi:10.1021/acs.jcim.6b00740. URL .Ramsundar, B., Kearnes, S., Riley, P., Webster, D.,Konerding, D., and Pande, V. Massively Multi-task Networks for Drug Discovery. arXiv Preprint ,(arXiv:1502.02072), 2015. URL http://arxiv.org/abs/1502.02072 . Ramsundar, B., Liu, B., Wu, Z., Verras, A., Tudor, M., Sheri-dan, R. P., and Pande, V. Is Multitask Deep Learning Prac-tical for Pharma?

Journal of Chemical Information andModeling , 57(8):2068–2076, aug 2017. ISSN 1549-9596.doi: 10.1021/acs.jcim.7b00146. URL http://pubs.acs.org/doi/10.1021/acs.jcim.7b00146 .Ripphausen, P., Nisius, B., and Bajorath, J. State-of-the-art in ligand-based virtual screening.

DrugDiscovery Today , 16(9-10):372–376, may 2011. ISSN13596446. doi: 10.1016/j.drudis.2011.02.011. URL .Rogers, D. and Hahn, M. Extended-Connectivity Finger-prints.

Journal of Chemical Information and Model-ing , 50(5):742–754, may 2010. ISSN 1549-9596. doi:10.1021/ci100050t. URL https://pubs.acs.org/doi/10.1021/ci100050t .Rohrer, S. G. and Baumann, K. Maximum unbiased val-idation (MUV) data sets for virtual screening based onPubChem bioactivity data.

Journal of Chemical Infor-mation and Modeling , 2009. ISSN 15499596. doi:10.1021/ci8002649.Schneider, G. Automating drug discovery.

Na-ture Reviews Drug Discovery , 17(2):97–113, dec2017. ISSN 1474-1776. doi: 10.1038/nrd.2017.232. URL .Sheridan, R. P. Time-Split Cross-Validation as a Methodfor Estimating the Goodness of Prospective Prediction.

Journal of Chemical Information and Modeling , 53(4):783–790, apr 2013. ISSN 1549-9596. doi: 10.1021/ci400084k. URL http://pubs.acs.org/doi/10.1021/ci400084k .Sheridan, R. P. Interpretation of QSAR Models by ColoringAtoms According to Changes in Predicted Activity: HowRobust Is It?

Journal of Chemical Information andModeling , 59:1324–1337, 2019.Sundar, V. and Colwell, L. The Effect of Debiasing Protein-Ligand Binding Data on Generalization.

Journal of Chem-ical Information and Modeling , 60(1):56–62, 2020. doi:https://doi.org/10.1021/acs.jcim.9b00415.Unterthiner, T., Mayr, A., Klambauer, G., Steljaert, M.,Wenger, J., Ceulemans, H., and Hochreiter, S. DeepLearning as an Opportunity in Virtual Screening.

Pro-ceedings of the Deep Learning Workshop at NIPS , pp.1–9, 2014. ttribution Methods Reveal Flaws in Fingerprint-Based Virtual Screening

Verdonk, M. L., Berdini, V., Hartshorn, M. J., Mooij,W. T. M., Murray, C. W., Taylor, R. D., and Wat-son, P. Virtual Screening Using Protein-LigandDocking: Avoiding Artiﬁcial Enrichment.

Journalof Chemical Information and Computer Sciences ,44(3):793–806, may 2004. ISSN 0095-2338. doi:10.1021/ci034289q. URL .Wallach, I. and Heifets, A. Most Ligand-Based Classiﬁ-cation Benchmarks Reward Memorization Rather thanGeneralization.

Journal of Chemical Information andModeling , 2018. ISSN 15205142. doi: 10.1021/acs.jcim.7b00403.Wallach, I., Dzamba, M., and Heifets, A. AtomNet: A DeepConvolutional Neural Network for Bioactivity Predic-tion in Structure-based Drug Discovery. arXiv Preprint ,arXiv:1510, oct 2015. URL http://arxiv.org/abs/1510.02855 .Waltenberger, B., Garscha, U., Temml, V., Liers, J., Werz,O., Schuster, D., and Stuppner, H. Discovery of Po-tent Soluble Epoxide Hydrolase (sEH) Inhibitors byPharmacophore-Based Virtual Screening.