[PDF] We Should at Least Be Able to Design Molecules That Dock Well

Abstract

Designing compounds with desired properties is a key element of the drug discovery process. However, measuring progress in the field has been challenging due to the lack of realistic retrospective benchmarks, and the large cost of prospective validation. To close this gap, we propose a benchmark based on docking, a popular computational method for assessing molecule binding to a protein. Concretely, the goal is to generate drug-like molecules that are scored highly by SMINA, a popular docking software. We observe that popular graph-based generative models fail to generate molecules with a high docking score when trained using a realistically sized training set. This suggests a limitation of the current incarnation of models for de novo drug design. Finally, we propose a simplified version of the benchmark based on a simpler scoring function, and show that the tested models are able to partially solve it. We release the benchmark as an easy to use package available at this https URL. We hope that our benchmark will serve as a stepping stone towards the goal of automatically generating promising drug candidates.

Full PDF

WWe Should At Least Be Able To Design Molecules ThatDock Well

Tobiasz Cieplinski , Tomasz Danel , Sabina Podlewska ,Stanisław Jastrzębski Abstract

Designing compounds with desired properties is a key element of the drug discoveryprocess. However, measuring progress in the ﬁeld has been challenging due to the lackof realistic retrospective benchmarks, and the large cost of prospective validation. Toclose this gap, we propose a benchmark based on docking, a popular computationalmethod for assessing molecule binding to a protein. Concretely, the goal is to generatedrug-like molecules that are scored highly by SMINA, a popular docking software. Weobserve that popular graph-based generative models fail to generate molecules with ahigh docking score when trained using a realistically sized training set. This suggestsa limitation of the current incarnation of models for de novo drug design. Finally, wepropose a simpliﬁed version of the benchmark based on a simpler scoring function, andshow that the tested models are able to partially solve it. We release the benchmarkas an easy to use package available at https://github.com/cieplinski-tobiasz/smina-docking-benchmark. We hope that our benchmark will serve as a stepping stone to-wards the goal of automatically generating promising drug candidates.

Designing compounds with some desired chemical properties is the central challenge in thedrug discovery process [Sliwoski et al., 2014, Elton et al., 2019]. De novo drug design isone of the most successful computational approach that involves generating new potentialligands from scratch , which avoids enumerating explicitly the vast space of possible struc-tures. Recently, deep learning has unlocked new progress in drug design. Promising resultsusing deep generative models have been shown in generating soluble [Kusner et al., 2017a],bioactive [Segler et al., 2018], and drug-like [Jin et al., 2018b] molecules.A key challenge in the ﬁeld of drug design is the lack of realistic benchmarks [Elton et al.,2019]. Ideally, the generated molecule by a de novo method should be tested in the wet labfor the desired property. In practice, typically, a proxy is used. For example, the octanol-water partition coeﬃcient or bioactivity is predicted using a computational model [Segleret al., 2018, Kusner et al., 2017a]. However, these models are often too simplistic [Eltonet al., 2019]. This is aptly summarized in Coley et al. [2019] as “The current evaluations forgenerative models do not reﬂect the complexity of real discovery problems.”. In contrast to Jagiellonian University, Poland New York University, USA Molecule.one, Poland a r X i v : . [ q - b i o . B M ] J u l igand GenerativeModel SMINA Scoringfunction protein

Dockingscore

Figure 1: Visualization of the proposed docking-based benchmark for de-novo drug designmethods. First, the proposed molecule (leftmost) is docked to the target’s binding site usingSMINA, a popular docking software. In the most diﬃcult version of the benchmark, theﬁnal score is computed based on the ligand pose using the default SMINA docking score.drug design, more realistic benchmarks have been used in the design of photovoltaics [Pyzer-Knapp et al., 2015] or in the design oof molecules with certain excitation energies [Sumitaet al., 2018], where a physical calculation was carried out to both train models, and toevaluate generated compounds.Our main contribution is a realistic benchmark for de novo drug design. We base ourbenchmark on docking, a popular computational method for predicting molecule binding to aprotein. Concretely, the goal is to generate molecules that are scored highly by SMINA [Koeset al., 2013]. We picked Koes et al. [2013] due to its popularity and being available under afree license. While we focus on de novo drug design, our methodology can be extended toevaluate retrospectively other approaches to designing molecules. Code to reproduce resultsand evaluate new models is available online at https://github.com/cieplinski-tobiasz/smina-docking-benchmark.Our second contribution is exposing limitation of currently popular de novo drug designmethods for generating bioactive molecules. When trained using a few thousands com-pounds, a typical training set size, the tested methods fail to generate active structuresaccording to the docking software. This suggest we should exercise caution when applyingthem in drug discovery pipelines, where we seldom have larger number of known ligands.We hope our benchmark will serve as a stepping stone to further improve these promisingmodels.The paper is organised as follows. We ﬁrst discuss prior work and introduce our bench-mark. Next, we use our benchmark to evaluate two popular models for de novo drug design.Finally, we analyse why the tested models fail on the most diﬃcult version of the benchmark.

We begin by brieﬂy discussing prior work and motivation. Next, we introduce our bench-mark.

Standardized benchmarks are critical to measure progress in any ﬁeld. Development oflarge-scale benchmarks such as the ImageNet was critical for the recent developments in2rtiﬁcial intelligence [Deng et al., 2009, Wang et al., 2018]. Many new methods for de novodrug design are conceived every year, which motivates the need for a systematic and eﬃcientway to compare them [Schneider and Clark, 2019].De novo drug design methods are typically evaluated using proxy tasks that circumventthe need to test the generated compounds experimentally [Jin et al., 2018a, You et al., 2018,Maziarka et al., 2020, Kusner et al., 2017b, Gómez-Bombarelli et al., 2016]. Optimizing theoctanol-water partition coeﬃcient (logP) is a common example. The logP coeﬃcient iscommonly computed using an atom-based method that involves summing contribution ofindividual atoms [Wildman and Crippen, 1999, Jin et al., 2018a], which is available in theRDKit package [Landrum, 2016]. Due to the fact that it is easy to optimize the atom-basedmethod by producing unrealistic molecules [Brown et al., 2019], a version that heuristicallypenalizes hard to synthesize compounds is used in practice [Jin et al., 2018a]. This exampleillustrates the need to develop more realistic ways to benchmark these methods. Anotherexample is QED score [Bickerton et al., 2012] which is designed to capture druglikeliness of a compound. Finally, some approacches use a model (e.g. a neural network) to predictbioactivity of the generated compounds Segler et al. [2018]. Similarly to logP, these twotasks are also possible to optimize while producing unrealistic molecules. This is aptlysummarized in Coley et al. [2019] as“The current evaluations for generative models do not reﬂect the complexity ofreal discovery problems.”Interestingly, besides the aforementioned proxy tasks, more realistic proxy tasks arerarely used in the context of evaluating de novo drug design methods. This is in contrastto evaluation of generative models for generating photovoltaics [Pyzer-Knapp et al., 2015]or molecules with certain excitation energies [Sumita et al., 2018]. One notable exceptionis Aumentado-Armstrong [2018] who try to generate compounds that are active accordingto the DrugScore [Neudert and Klebe, 2011], and then evaluate the generated compoundsusing rDock [Ruiz-Carmona et al., 2014]. This lack of the overall diversity and realism inthe typically used evaluation methods motivates us to propose our benchmark.

Our docking-based benchmark is deﬁned by: (1) docking software that computes for agenerated compound its pose in the binding site, (2) a function that scores the pose, (3) atraining set of compounds with an already computed docking score.The goal is to generate a given number of molecules that achieve the maximum possibledocking score. For the sake of simplicity, we do not impose limits on the distance of theproposed compounds to training set. Thus a simple baseline is to return the training set.Finding similar compounds that have a higher docking score is already prohibitively chal-lenging for current state-of-the-art methods. As the ﬁeld progresses, our benchmark canbe easily extended to account for the similarity between the generated compounds and thetraining set.Finally, we would like to stress that the benchmark is not limited to de novo methods.The benchmark is applicable to any other approaches such as virtual screening. The onlylimitation required for a fair comparison is that docking is performed only on the suppliedtraining set. 3 .3 Instantiation

As a concrete instantiation of our docking-based benchmark, we use SMINA [Koes et al.,2013] due to its wide-spread use and being oﬀered under a free license. To create thetraining set, we download from the ChEMBL [Gaulton et al., 2016] database moleculestested against popular drug-targets: 5-HT1B, 5-HT2B, ACM2, and CYP2D6. For 5-HT1Bthe ﬁnal dataset consists in molecules, out of which are active (Ki < 100nm) and are inactive molecules (Ki > 1000nm). We list sizes of the datasets in Table 2.We dock each molecule using default settings in SMINA to a manually selected bindingsite coordinate. Protein structures were downloaded from the Protein Database Bank,cleaned and prepared for docking using Schrodinger modeling package. The resulting proteinstructures are provided in our code repository. We describe further details on the preparationof the datasets in Appendix C.Starting from the above, we deﬁne the following three variants of the benchmark. Inthe ﬁrst variant, the goal is to propose molecules that achieve the smallest SMINA dockingscore used in score only mode, deﬁned as follows:Docking score = − . · gauss ( o = 0 , w = 0 . − . · gauss ( o = 3 , w = 2)+ 0 . · repulsion − . · hydrophobic − . · non_dir_h_bond , where all terms are computed based on the ﬁnal docking pose. The ﬁrst three termsmeasure the steric interaction between ligand and the protein. The fourth and the ﬁfth termlook for hydrophobic and hydrogen bonds between the ligand and the protein. We includein Appendix A a detailed description of all the terms.Next, we propose two simpler variants of the benchmark based on individual terms in theSMINA scoring function. In the Repulsion task, the goal is to only minimize the repulsioncomponent, which is deﬁned as:repulsion ( a , a ) = (cid:40) d diﬀ ( a , a ) , d diﬀ ( a , a ) < , otherwisewhere d diﬀ ( a , a ) is the distance between the atoms minus the sum of their van derWaals radii. Distance unit is Angstrom ( − m).The third task, Hydrogen Bond Task , is to maximize the non_dir_h_bond term:non_dir_h_bond ( a , a ) =  , ( a , a ) do not form hydrogen bond , d diﬀ ( a , a ) < − . , d diﬀ ( a , a ) ≥ d diﬀ ( a ,a ) − . otherwise . To make the results more stable, we average the score over the top 5 best-scoring bindingposes. Finally, to make the benchmark more realistic, we ﬁlter the generated compoundsusing the Lipinski rule, and discard molecules with molecular weight lower than 100.4

Results and discussion

In this section, we evaluate two popular models for de novo drug design on our benchmark.

We compare two popular models for de novo drug design. Chemical Variational Autoen-coder (CVAE) [Gómez-Bombarelli et al., 2018] applies Variational Autoencoder [Kingma andWelling, 2013] by representing molecules as strings of characters (using SMILES encoding).This approach was later extended by Grammar Variational Autoencoder (GVAE) [Kusneret al., 2017a], which ensures that generated compounds are grammatically correct.

To generate active compounds, we follow a similar approach to in Gómez-Bombarelli et al.[2018] and Kusner et al. [2017a]. First, we ﬁne-tune a given generative model for epochson the training set ligands, starting from weights made available by the authors . Allhyperparameters are set to default values used in Gómez-Bombarelli et al. [2018] and Kusneret al. [2017a]. Additionally, we use the provided scores to train a multilayer perceptron(MLP) to predict the target (e.g. the SMINA scoring function) based on the latent spacerepresentation of the molecule.To generate compounds, we ﬁrst take a random sample of the model latent space bysampling from a Gaussian distribution with the standard deviation of 1 and the mean of 0.Starting from this point in the latent space, we take Z gradient steps to optimize the outputof the MLP. Based on this approach we generate 250 compounds from the model.All other experimental details, including hyperparameter values used in the experiments,can be found in Appendix B.In the experiments, we compare to three baselines: (1) random compounds from ZiNC,(2) random unseen in training active compounds (Ki < 100nm), (3) random unseen intraining inactive compounds (Ki > 1000nm). In this section, we use the above procedure to generate compounds for the three targetsdeﬁned in Section 2.3. Table 1a summarizes the results on all three tasks. Below we makeseveral observations.First, we observe that CVAE and GVAE models fail to generate compounds that achievehigher scores than the three baselines. In particular, both models produce compoundsthat achieve on average docking scores below the mean in the training set ( − . ). Evenlooking at the Top 1%, the achieved scores are below the mean. This is reminiscent ofresults in Gao and Coley [2020]. They show that de novo models tend to generate diﬃcultto synthesize molecules, even if optimizing for a proxy of synthesability. Similarly, eventhough we optimize for docking score using the MLP, the models fail to optimize the actualscore. We will analys this phenomenon more closely in Section 3.4.The next two tasks are easier to solve. On the Repulsion task, both models improveupon the baselines. GVAE generates compounds with remarkably low repulsion that is an Available at https://github.com/aspuru-guzik-group/chemical_vae/tree/master/models/zinc and athttps://github.com/mkusner/grammarVAE/tree/master/pretrained. (a)

Score Function task (b)

Repulsion task (c)

Hydrogen Bonding task

Table 1: Benchmark results. In

Score Function , the goal is to propose compoundsachieving the lowest mean SMINA docking score towards a given target. Each cell reportsthe mean score for all compounds, and for the top of compounds in the parenthesis.We observe that for Score Function task CVAE and GVAE fail to outperform ZiNC (arandom sample of compounds from the ZiNC database). Missing results ("-") indicatethat the model failed to generate molecules that satisfy the drug-like ﬁlters (describedin the text).order of magnitude lower ( . ) than the best score observed in the training set ( . ).This however should be interpreted with caution. It is possible to minimize repulsion byavoiding docking to the binding site, which might explain why Inactives tend to have lowerrepulsion than Actives. As such we suggest to treat the Repulsion task as a form of a unittest for the validity of the generating procedure.The

Hydrogen Bond task diﬃculty lies between the ﬁrst two tasks. The mean scoreof molecules generated by GVAE is . times larger than the mean score of Actives ( . times larger for Top 1%). Contrarily, CVAE Top 1% results are worse the ones in Inactivesdataset. This suggests that maximizing the non_dir_h_bond term in Equation A is harderthan minimizing the repulsion term, but is still an easier task than optimizing the wholedocking score.Stronger results on the Repulsion and

Hydrogen Bond tasks show it is feasible tooptimize individual components of the score. This suggests that solving

SMINA Score task is an attainable goal.Finally, our results also suggest that generative models applied to de novo drug discovery6HT1B 5HT2B ACM2 CYP2D6Dataset size 1891 1194 2341 4200Table 2: Sizes of the dataset used in the benchmark. The corresponding test datasetcomprises of of the whole dataset, and the rest of it is used in training.pipelines might require substantial more data to generate active compounds than is typicallyavailable for training. In particular on the 5-HT1B receptor, despite using a realisticallysized training set of over one thousand compounds, the achieved docking scores are worsethan in a random sample from the ZiNC dataset. Docking score is only a simple proxy ofthe actual binding aﬃnity, and as such it should worry us that it is already challenging tooptimize.

In this section, we investigate why do CVAE and GVAE models fail to generate compoundsthat achieve high docking scores.The most natural hypothesis is that predicting docking score is diﬃcult. Our procedureuses the gradient of the MLP to generate active compounds. If the model fails at predictingactivity, it is reasonable to assume it also fails to guide the generating process towards activecompounds. 5HT1B 5HT2B ACM2 CYP2D6CVAE -60.071 (-58.792) -28.958 (-30.205) -354.084 (-352.818) -515.886 (-377.572)GVAE -36.565 (-25.608) -50.412 (-47.629) -66.009 (-53.335) -55.021 (-47.307)Table 3: Predicted docking score towards 5HT1B by the MLP for the compounds presentedin Table 1a. Comparing to Table 1a, we can observe that the predicted docking scores aretwo, three or even four orders of magnitude overestimating the true activity.Recall that both models failed to generate compounds that achieve better docking scorestowards 5HT1B than in particular a random sample from the ZiNC dataset (see Table 1a).However, the MLP predicts that these compounds are active towards 5HT1B. We show thisin Table 3. This discrepancy between MLP predictions and actual docking scores stronglysuggest that modeling the docking score is the bottleneck.To better quantify this eﬀect, we measure the root mean squared error (MSE) betweenthe predicted and the true docking score on two datasets: (1) random samples of latentspace from Gaussian distribution (

Gauss ), and (2) molecules generated using gradientlatent space optimization (

Generated ). We compare this to SMINA by redocking compoundsand measuring the discrepancy between the two docking runs (

SMINA ).Table 4 reports the results, and Figure 2 shows the predicted docking score against thetrue docking score on the compounds generated by CVAE. We observe that on all subsetsthe RMSE is three to four orders of magnitude higher than RMSE between two docking runs(SMINA). This shows that using better models for predicting docking scores is a promisingavenue for improving results on the benchmark.7LP SMINAGauss 4.064 0.004Generated 56.062 0.009 (a)

CVAE

MLP SMINAGauss 8.24 0.004Generated 33.646 0.018 (b)

GVAE

Table 4: Root mean square error between the predicted and the true docking scores forCVAE (left) and GVAE (right), compared to the diﬀerence between two runs of the dockingsoftware (SMINA). P r e d i c t e d d o c k i n g s c o r e Spearman correlation: -0.0836

Figure 2: The predicted docking score (y axis) versus the true docking score (x axis) for the compounds generated by CVAE. Improving modeling of the docking score is a promisingavenue for improving on the benchmark.

As concluded by Coley et al. [2019], “the current evaluations for generative models do notreﬂect the complexity of real discovery problems”. In this work, we proposed a new, morerealistic, benchmark tailored to de novo drug design using docking score as the targetto optimize. Code to evaluate new models is available at https://github.com/cieplinski-tobiasz/smina-docking-benchmark.Our results also suggest that generative models applied to de novo drug discoverypipelines might require substantial more data to generate realistic compounds than is typi-cally available for training. Despite using over compounds for training (for the CYP2D6target), the achieved docking scores are worse than in a random sample from the ZiNCdataset. Docking score is only a simple proxy of the actual binding aﬃnity, and as such itshould worry us that it is already challenging to optimize.On a more optimistic note, the tested models were able to optimize the number ofhydrogen bonds to the binding site, which is a term in the SMINA scoring function. Thissuggests that producing compounds that optimize docking score based on the provideddataset is an attainable, albeit challenging, task. We hope our benchmark better reﬂects“the complexity of real discovery problems“ and will serve as a stepping stone towardsdeveloping better de novo models for drug discovery.8 eferences

Tristan Aumentado-Armstrong. Latent molecular optimization for targeted therapeuticdesign.

CoRR , abs/1809.02032, 2018.G. Richard Bickerton, Gaia V. Paolini, Jérémy Besnard, Sorel Muresan, and Andrew L.Hopkins. Quantifying the chemical beauty of drugs.

Nature Chemistry , 4(2):90–98, 2012.Nathan Brown, Marco Fiscato, Marwin H.S. Segler, and Alain C. Vaucher. Guacamol:Benchmarking models for de novo molecular design.

Journal of Chemical Informationand Modeling , 59(3):1096–1108, 2019.Connor W Coley, Natalie S Eyke, and Klavs F. Jensen. Autonomous discovery in thechemical sciences part ii: Outlook.

Angewandte Chemie International Edition , n/a(n/a),2019.J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-ScaleHierarchical Image Database. In

CVPR09 , 2009.Daniel C. Elton, Zois Boukouvalas, Mark D. Fuge, and Peter W. Chung. Deep learning formolecular design—a review of the state of the art.

Mol. Syst. Des. Eng. , 4:828–849, 2019.Wenhao Gao and Connor W. Coley. The synthesizability of molecules proposed by generativemodels, 2020.Anna Gaulton, Anne Hersey, Michał Nowotka, A. Patrícia Bento, Jon Chambers, DavidMendez, Prudence Mutowo, Francis Atkinson, Louisa J. Bellis, Elena Cibrián-Uhalte,Mark Davies, Nathan Dedman, Anneli Karlsson, María Paula Magariños, John P. Over-ington, George Papadatos, Ines Smit, and Andrew R. Leach. The ChEMBL databasein 2017.

Nucleic Acids Research , 45(D1):D945–D954, 11 2016. ISSN 0305-1048. doi:10.1093/nar/gkw1074. URL https://doi.org/10.1093/nar/gkw1074 .Rafael Gómez-Bombarelli, David Duvenaud, José Miguel Hernández-Lobato, JorgeAguilera-Iparraguirre, Timothy D. Hirzel, Ryan P. Adams, and Alán Aspuru-Guzik. Auto-matic chemical design using a data-driven continuous representation of molecules.

CoRR ,abs/1610.02415, 2016.Rafael Gómez-Bombarelli, Jennifer N. Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Tim-othy D. Hirzel, Ryan P. Adams, and Alán Aspuru-Guzik. Automatic chemical design usinga data-driven continuous representation of molecules.

ACS Central Science , 4(2):268–276,Jan 2018. ISSN 2374-7951.Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoderfor molecular graph generation. In

International Conference on Machine Learning , pages2328–2337, 2018a.Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoderfor molecular graph generation, 2018b.Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2013. URL http://arxiv.org/abs/1312.6114 . cite arxiv:1312.6114.9avid Ryan Koes, Matthew P. Baumgartner, and Carlos J. Camacho. Lessons learnedin empirical scoring with smina from the csar 2011 benchmarking exercise.

Journal ofChemical Information and Modeling , 53(8):1893–1904, 2013. PMID: 23379370.Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variationalautoencoder, 2017a.Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variationalautoencoder, 2017b.Greg Landrum. Rdkit: Open-source cheminformatics software. 2016. URL https://github.com/rdkit/rdkit/releases/tag/Release_2016_09_4 .Łukasz Maziarka, Agnieszka Pocha, Jan Kaczmarczyk, Krzysztof Rataj, Tomasz Danel, andMichał Warchoł. Mol-cyclegan: a generative model for molecular optimization.

Journalof Cheminformatics , 12(1):2, 2020.Gerd Neudert and Gerhard Klebe. Dsx: A knowledge-based scoring function for the assess-ment of protein–ligand complexes.

Journal of Chemical Information and Modeling , 51(10):2731–2745, 10 2011.Edward O. Pyzer-Knapp, Kewei Li, and Alan Aspuru-Guzik. Learning from the harvardclean energy project: The use of neural networks to accelerate materials discovery.

Ad-vanced Functional Materials , 25(41):6495–6502, 2015.Sergio Ruiz-Carmona, Daniel Alvarez-Garcia, Nicolas Foloppe, A. Beatriz Garmendia-Doval,Szilveszter Juhos, Peter Schmidtke, Xavier Barril, Roderick E. Hubbard, and S. DavidMorley. rdock: A fast, versatile and open source program for docking ligands to proteinsand nucleic acids.

PLOS Computational Biology , 10(4):1–7, 04 2014.Gisbert Schneider and David E. Clark. Automated de novo drug design: Are we nearlythere yet?

Angewandte Chemie International Edition , 58(32):10792–10803, 2019.Marwin H. S. Segler, Thierry Kogej, Christian Tyrchan, and Mark P. Waller. Generatingfocused molecule libraries for drug discovery with recurrent neural networks.

ACS CentralScience , 4(1):120–131, 2018.Gregory Sliwoski, Sandeepkumar Kothiwale, Jens Meiler, and Edward W. Lowe. Computa-tional methods in drug discovery.

Pharmacological Reviews , 66(1):334–395, 2014. ISSN0031-6997.Masato Sumita, Xiufeng Yang, Shinsuke Ishihara, Ryo Tamura, and Koji Tsuda. Hunting forOrganic Molecules with Artiﬁcial Intelligence: Molecules Optimized for Desired ExcitationEnergies. 4 2018.Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman.GLUE: A multi-task benchmark and analysis platform for natural language understand-ing. In

Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Inter-preting Neural Networks for NLP , pages 353–355, Brussels, Belgium, November 2018.Association for Computational Linguistics.10cott A. Wildman and Gordon M. Crippen. Prediction of physicochemical parameters byatomic contributions.

Journal of Chemical Information and Computer Sciences , 39(5):868–873, 09 1999.Jiaxuan You, Bowen Liu, Rex Ying, Vijay S. Pande, and Jure Leskovec. Graph convolutionalpolicy network for goal-directed molecular graph generation.

CoRR , abs/1806.02473,2018.

A Default SMINA scoring function

We include the deﬁnitions of SMINA’s default scoring function components and weightsused for calculating docking score in score only mode. a and a denote atoms, d ( a , a ) isthe distance between atoms, d opt is the sum of their van der Waals radii and d diﬀ ( a , a ) = d ( a , a ) − d opt ( a , a ) . Distance unit is Angstrom ( − m).Docking score = − . · gauss ( o = 0 , w = 0 . − . · gauss ( o = 3 , w = 2)+ 0 . · repulsion − . · hydrophobic − . · non_dir_h_bondgauss ( a , a ) = exp (cid:32) − (cid:18) d diﬀ ( a , a ) − ow (cid:19) (cid:33) repulsion ( a , a ) = (cid:40) d diﬀ ( a , a ) , d diﬀ ( a , a ) < , otherwisehydrophobic ( a , a ) =  , not _ hydrophobic ( a ) or not _ hydrophobic ( a )1 , d diﬀ ( a , a ) < . , d diﬀ ( a , a ) ≥ . . − d diﬀ ( a , a ) , otherwisenon_dir_h_bond ( a , a ) =  , ( a , a ) do not form hydrogen bond , d diﬀ ( a , a ) < − . , d diﬀ ( a , a ) ≥ d diﬀ ( a ,a ) − . , otherwise B Model details

We include hyperparameters and training settings used in our models. Our code is availableat https://github.com/cieplinski-tobiasz/smina-docking-benchmark.MLP is used to predict docking score from CVAE or GVAE latent space representation ofmolecule. It is a simple feed forward neural network with one hidden layer. Hyperparametersof this model are listed in 5. 11arameterTraining epochs 50Layers number 1Hidden layer dim 1000Loss function Mean Squared ErrorOptimizer AdamLearning rate 0.001Table 5: MLP hyperparametersBoth Chemical VAE and Grammar VAE are based on variational autoencoder modelwith stacked convolution layers in its encoder part and stacked GRU layers in decoder part.What diﬀers them is the way that SMILES is encoded to one hot vector. Chemical VAEencodes each character of SMILES to separate one-hot vector, while Grammar VAE formsa parse tree from SMILES and encodes the parse rules. Details for CVAE are listed in 6and for GVAE in 7. ParameterMLP learning rate 0.05MLP descent iterations 50Fine-tuning batch size 256Fine-tuning epochs 5Latent space dim 196Encoder convolution layers number 4Decoder GRU layers number 4Table 6: Chemical VAE hyperparametersParameterMLP learning rate 0.01MLP descent iterations 50Fine-tuning batch size 256Fine-tuning epochs 5Latent space dim 56Encoder convolution layers number 3Decoder GRU layers number 3Table 7: Grammar VAE hyperparameters