[PDF] Generate Novel Molecules With Target Properties Using Conditional Generative Models

Abstract

Drug discovery using deep learning has attracted a lot of attention of late as it has obvious advantages like higher efficiency, less manual guessing and faster process time. In this paper, we present a novel neural network for generating small molecules similar to the ones in the training set. Our network consists of an encoder made up of bi-GRU layers for converting the input samples to a latent space, predictor for enhancing the capability of encoder made up of 1D-CNN layers and a decoder comprised of uni-GRU layers for reconstructing the samples from the latent space representation. Condition vector in latent space is used for generating molecules with the desired properties. We present the loss functions used for training our network, experimental details and property prediction metrics. Our network outperforms previous methods using Molecular weight, LogP and Quantitative Estimation of Drug-likeness as the evaluation metrics.

Full PDF

GGenerate Novel Molecules With Target PropertiesUsing Conditional Generative Models

Abhinav Sagar ∗ Vellore Institute of TechnologyVellore, Tamil Nadu, India [email protected]

Abstract

Drug discovery using deep learning has attracted a lot of attention of late as ithas obvious advantages like higher efﬁciency, less manual guessing and fasterprocess time. In this paper, we present a novel neural network for generatingsmall molecules similar to the ones in the training set. Our network consists ofan encoder made up of bi-GRU layers for converting the input samples to a latentspace, predictor for enhancing the capability of encoder made up of 1D-CNNlayers and a decoder comprised of uni-GRU layers for reconstructing the samplesfrom the latent space representation. Condition vector in latent space is used forgenerating molecules with the desired properties. We present the loss functionsused for training our network, experimental details and property prediction metrics.Our network outperforms previous methods using Molecular weight, LogP andQuantitative Estimation of Drug-likeness as the evaluation metrics.

Deep learning has achieved tremendous success in many areas tackling a range of datasets fromimages to text. One of those areas is Chemoinformatics. Neural networks have been used to solvea variety of problems like molecule property prediction, drug design, chemical reaction predictionetc. In this work, we propose a novel network for de novo molecular design using deep learning.Neural networks clearly offers a better approach as it speeds up the process while also increasing theefﬁciency compared to the traditional methods.The total number of potential organic molecules is very large while the chemical space contains morethan molecules (Virshup et al., 2013). Of these only some molecules are discovered, Alltraditional methods discovered new molecules using the chemical space of the molecules which werealready discovered (Kim et al., 2016). A lot of newly discovered molecules were born out of hit andtrial methods.The main challenge is to generate new molecules with desired property (molecular weight, toxicity,solubility etc). Deep learning optimizes the computational overhead to search for new molecules.This approach is not only fast but also cheaper and more efﬁcient. Also the whole chemical spacecan be utilized to search for potential small molecules. The process of de novo molecular design canbe separated into the following parts: 1. molecular generation; 2. approach to rank molecules; 3.function to optimize the molecular space in search of new molecules. ∗ Website of author - https://abhinavsagar.github.io/

Preprint. Under review. a r X i v : . [ q - b i o . B M ] S e p Related Work

Generative models primarily driven by GANs and VAEs have been used successfully for efﬁcientmolecular design. (Gómez-Bombarelli et al., 2018) used VAEs to optimize the molecular propertiesin latent space where molecules are expressed as a real vector. Since the latent space is continuousand differentiable, hence gradient based optimization can be done. (Blaschke et al., 2018) usedadversarial autoencoder and bayesian optimization to generate molecules according to a speciﬁcproperty. (Kadurin et al., 2017) compared AAE and VAE using reconstruction error and variability ofthe output molecular ﬁngerprints as the evaluation metrics.(Segler et al., 2018) and (Gupta et al., 2018) used generative models using Natural LanguageProcessing(NLP) techniques. The molecules are represented as SMILES string and the model learnsthe probability distribution of the next character of a given piece of SMILES. Transfer learning wasused using a pre trained backbone. (Popova et al., 2018) and (Segler et al., 2018) used RecurrentNeural Networks, (Kusner et al., 2017) and (Dai et al., 2018) used Variational Autoencoders. Themodels directly generating the graph structures of molecules are used in (Simonovsky and Komodakis,2018) and (Jin et al., 2018).Lately conditional molecular design has been used to generate molecules with properties close topre-determined target conditions. (Segler et al., 2018) used recursive ﬁne tuning while (Gómez-Bombarelli et al., 2018) used bayesian optimization. Here the need is to ﬁnd a latent representationwhich is close to the target condition. Since an additional optimization procedure is used, this methodis inefﬁcient especially when there are multiple target conditions.Our main contributions can be summarized as follows: • We present a novel network for generating new but similar molecules to the ones it is trained on. • Our network can be divided into three parts: encoder which converts the training data into a latentspace, predictor is used to enhance the function of encoder using another latent space conversionwhile the decoder is responsible for generating new molecules from the latent space. • We present the loss functions, experimental details and evaluation metrics for property prediction. • Our network outperforms previous methods on most of the metrics.

Recurrent Neural Networks are used where data in sequential in nature ie there is a connectivity withprevious terms. Generating SMILES sequences requires understanding of the sequence of characters.The outputs in case of RNN depends on previous computations thus creating a loop where informationat every stage is inﬂuenced by previous stages. RNNs is represented mathematically as in Equation 1: h t = σ h ( U h x t + V h h t − + b h ) o t = σ y ( W y h t + b h ) (1)Where x t denotes input vector ( m × ), h t denotes hidden vector ( n × ), o t denotes output vector( n × ), b h denotes bias vector ( n × ), U, W denotes parameter matrices ( n × m ), V denotesparameter matrix ( n × n ) and σ h , σ y denotes activation functions.Thus RNNs form a chain of repeating units which are all linked together for making sense ofsequential information. RNNs are very difﬁcult to train. A lot of factors make training a challenge of which the most commonis the vanishing gradient problem. As the training proceeds, the value of gradient which is used toupdate the weights of the network becomes less. In the earlier layers, the value becomes inﬁnitelysmall thus there is a memory loss. Gated Recurrent Unit were used to counter this as it has anadditional forget gate to forget the information when it is no longer needed. Together with the2nput gate, forget gate is used to decide which information to forget and which is important to makepredictions (Cho et al., 2014). GRUs is represented mathematically as in Equation 2: z t = σ ( W z · [ h t − , x t ] + b z ) r t = σ ( W r · [ h t − , x t ] + b r )˜ h t = tanh ( W h · [ r t (cid:12) h t − , x t ] + b h ) h t = (1 − z t ) (cid:12) h t − + z t (cid:12) ˜ h t (2)Where h t denotes hidden layer vectors, x t denotes output vector, b z , b r , b h denotes bias vector, W z , W r , W h denotes parameter matrices and σ , tanh denotes activation functions. Although CNNs have been mostly used with image data, however it can also be used for text baseddata. The need is to convert the text into one hot encoded format in the form of matrix. Each row ofthe matrix represents a word or character. In case of images, data is represented in two dimensionalform along with a two dimensional ﬁlter. However text data is in a 1D format along with a onedimensional kernel. There is also a need to use zero padding at places where character is null.

A vanilla autoencoder is good at reconstructing the samples given as input to encoder. Howeverit lacks the ability to generate new samples using existing samples. This is where VariationalAutoencoder comes to rescue. They are a kind of generative model which takes a standard normaldistribution as input to encoder and decoder generates new samples from the latent space (Kingmaand Welling, 2013). The objective function of vanilla VAEs is deﬁned in Equation 3: E [log P ( X | z )] − D KL [ Q ( z | X ) || P ( z )] (3)where E denotes expectation value, P, Q denotes probability distributions, D KL denotes Kullback-Leibler divergence, X denotes data and z denotes latent spaces. The decoder of VAE is not suitable since the goal is to design drugs with the desired property.However CVAE solves this problem using labels along with the input sample to encoder. The latentvector contains information about sampled data and hence the decoder is able to synthesize newsamples with the desired property. Condition vector c is concatenated and is used for both the encoderand decoder (Sohn et al., 2015). The objective function of CVAE is changed accordingly and isdeﬁned in Equation 4: E [log P ( X | z, c )] − D KL Q ( z | X, c ) || P ( z | c )] (4)where E denotes expectation value, P, Q denotes probability distributions, D KL denotes Kullback-Leibler divergence, X denotes data, z denotes latent spaces and c denotes condition vector. A sample of 100,0000 SMILES strings of drug like molecules was randomly sampled from ZINCdatabase. We use 90,000 molecules for training and 10,000 molecules for testing the propertyprediction performance. A special sequence indicating end of sequence is appended at the end ofevery sequence. To evaluate the performance of our network, we used three properties molecularweight (MolWt), Wildman Crippen partition coefﬁcient (LogP) and quantitative estimation of drug-likeness (QED). 3 .2 Molecular Representation

A lot of molecular representations have been used in literature. The most common among them arethe SMILES (Weininger, 1988) and the Graph representation. The more detailed the representation is,the more computational burden it demands. Simpliﬁed Molecular Input Line Entry System (SMILES)is a one dimensional representation of a two dimensional chemical drawing. It contains atoms andbond symbols with an easy vocabulary.Since it is easy to understand and parse, hence Natural Language Processing (NLP) techniques canbe used on them. More than one SMILES representation of a molecule is possible, however only onecanonical form is used per molecule. The molecular latent space visualized in two dimensions usingprincipal component analysis is shown in Figure 1:Figure 1: 2D visualization of around 8000 molecules encoded into the latent space.The results are visualized using RDKit Python package.

The following benchmarks was used to determine the performance of our network for generatingmolecules:

1. Validity:

It assesses whether the molecules generated are realistic or not. Examples of not validmolecules are one with wrong valency conﬁguration or wrong SMILES syntax.

2. Uniqueness:

It assesses whether the molecules generated are different from one another or not.

3. Novelty:

It assesses whether the molecules generated are different from the ones in the trainingset or not.

Our network was trained on SMILES from ChEMBL database. Since the context is in small moleculargeneration, hence only SMILES string with less than 120 characters were used. The data was dividedinto 80% training data and 20% testing data. Bayesian optimization was done to optimize thehyper-parameters like number of hidden layers, activation functions, learning rate etc.Our model is comprised of an encoder, predictor and decoder networks. The encoder in our networkhas three bi-GRU layers, ﬂatten layer and a dense layer. The predictor is comprised of a dense layerand three 1D convolutional layers. The latent space dimensions were set to 292. The encoder was feddata from SMILES database after one hot encoding. The encoded data is sampled in the latent spaceusing mean and standard deviation vectors. The latent vector produces new samples after passingthrough the decoder. The decoder has three uni-GRU layers followed by a dense and ﬂatten layer.4he input variable x is generated from a generative distribution p θ ( x | y, z ) , which is conditioned on theoutput variable y and latent variable z . The prior distributions are denoted by p ( y ) = N ( y | σy, (cid:80) y ) and p ( z ) = N ( z | θ, I ) . In our case, x denotes molecules while y denotes properties. Standarddeviation is used on both y and z terms before passing through the decoder. Our network is shown inFigure 2: Figure 2: Illustration of our network architecture After the network is trained, property prediction is done using prediction network q φ ( y | x ) . Theunlabeled instance is represented by x , the corresponding properties ˆ y are predicted as deﬁned inEquation 5: ˆ y ∼ N (cid:0) µ φ ( x ) , diag (cid:0) σ φ ( x ) (cid:1)(cid:1) (5)The point estimate ˆ y can be obtained by maximizing the probability which is represented by µ φ ( x ) .The decoder network p θ ( x | y, z ) is used to generate a molecule. A molecule representation ˆ x isobtained from y and z as deﬁned in Equation 6: ˆ x = arg max log p θ ( x | y , z ) (6)Let a sequence S of symbols s i at steps t i ∈ T , our language model assigns a probability as deﬁnedin Equation 7: P θ ( S ) = P θ ( s ) · T (cid:89) t =2 P θ ( s t | s t − , . . . , s ) (7)where the parameters θ are learned from the training set. We use a GRU to estimate the probabilitiesof Equation 7. 5he probability distribution P θ ( s t +1 | s t , ..., s ) of the next symbol given the already seen sequence isthus estimated using the output vector y t of the recurrent neural network at time step t as deﬁned inEquation 8: P θ ( s t +1 = k | s t , . . . , s ) = exp (cid:0) y kt (cid:1)(cid:80) Kk (cid:48) =1 cxp (cid:0) y k (cid:48) t (cid:1) (8)where y kt corresponds to the k th element of vector y t . Novel molecules can be generated by samplingfrom this distribution. This sampling procedure is repeated until the desired number of characters hasbeen generated. An open source package named RDKit was used for testing the validity of the generated SMILESstrings and calculating the properties of the molecules. Samples of dataset is drawn from ZINCdatabase (Irwin et al., 2012). Learning rate in our experiments was set to 0.0001 with exponentialdecay of 0.99. The model was trained until it converged. The condition for generated molecules tobe successful is if the target property of generated molecules was within 10% range of given targetvalue. The molecules are encoded to the latent representation and gaussian noise is added to it.The standard deviation value was important to tune as a lower value generated molecules similar tothe ones in the training set. On the other hand using a larger standard deviation generated moleculesvery different from the ones it was trained on. The optimal value of standard deviation was foundto be 0.05. During training, we normalize each output variable to have a mean of 0 and standarddeviation of 1. Batch size value of 50 was used along with ADAM as the optimizer.The property prediction performance is evaluated using mean absolute error (MAE) on the test set.The encoder, predictor and decoder networks consist of three hidden layers each having 50 gatedrecurrent units (GRU). The target values for MolWt, LogP, and QED are set as (250, 350, 450), (1.5,3.0, 4.5), and (0.5, 0.7, 0.9), respectively. : Several similarity and distance metrics are used for quantifying the distancebetween two molecules. The most common among them is Tanimoto similarity which has a valuebetween 0 and 1. Measuring the similarity between two molecules is very important as new moleculesare designed using the data of drugs which are already present. Designing drugs with new propertybut similar proﬁle requires changing the previous structure nominally. Tanimoto Similarity metric isdeﬁned in Equation 9: d ( a , b ) = n ( A ∩ B ) n ( A ) + n ( B ) − ( A ∩ B ) (9)where n ( A ) denotes number of bits set to 1 in molecule a (cid:48) s ﬁngerprint, n ( B ) denotes number ofbits set to 1 in molecule b (cid:48) s ﬁngerprint, ( A ∩ B ) denotes number of bits set to 1 that molecule a andmolecule b have in common. The fraction of invalid molecules using our network was less than 1%. The fraction of uniquemolecules generated is 90.2%. The average and standard deviation values are reported in Table 1using MAE as the evaluation metric with the varying fractions of labeled molecules. Our networkoutperforms others in most of the cases.An important tool for evaluating the performance of network is done using properties distribution.The following three properties are used:

1. Molecular weight (MW):

It is the sum of atomic weights in a molecule. To ﬁgure out if thegenerated samples are biased towards lighter or heavier molecules histograms of molecular weightfor the generated and test sets are plotted. 6 . LogP:

It is the ratio of a chemical’s concentration in the octanol phase to its concentration in theaqueous phase.

3. Quantitative Estimation of Drug-likeness (QED):

It is a measure of how likely a molecule is aviable candidate for a drug. It’s value lies between 0 and 1 both included.

The property prediction performance with varying fractions of labeled molecules compared withnetworks: ECFP (Rogers and Hahn, 2010), GraphConv (Kearnes et al., 2016), VAE (Gómez-Bombarelli et al., 2018) and SSVAE (Kang and Cho, 2018) is shown in Table 1:Table 1: Property prediction performance with varying fractions of labeled molecules.frac(%) property ECFP GraphConv VAE SSVAE Ours5 MolWt 17.713 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± The generative models like VAE and CVAE are able to learn smooth latent representation of inputdata and perform interpolations on them. For carrying out the interpolations, starting and end pointsare needed. Both of these points should represent a molecule in chemical space. The interpolationsamples between Aspirin and Paracetamol is shown in Figure 6:Figure 6: Generated samples using interpolation.

Drug discovery or new molecule generation has garnered a lot of attention from the deep learningcommunity. Since it a very important but challenging problem in Cheminformatics, hence a lot ofwork has been done using a variety of neural networks. In this paper, we propose a novel networkfor generating similar but new molecules to the one it has been trained on. The network is made upof three parts: encoder, predictor and decoder networks. We present the loss functions, molecularrepresentation and experimental details. Using Molecular weight, LogP and Quantitative Estimation8f Drug-likeness as the property prediction metrics, our network performs better than previous stateof the art approaches.

Acknowledgments

We would like to thank Nvidia for providing the GPUs for this work.

References

E. J. Bjerrum and B. Sattarov. Improving chemical autoencoder latent space and molecular de novogeneration diversity with heteroencoders.

Biomolecules , 8(4):131, 2018.T. Blaschke, M. Olivecrona, O. Engkvist, J. Bajorath, and H. Chen. Application of generativeautoencoder in de novo molecular design.

Molecular informatics , 37(1-2):1700123, 2018.H. Chen, O. Engkvist, Y. Wang, M. Olivecrona, and T. Blaschke. The rise of deep learning in drugdiscovery.

Drug discovery today , 23(6):1241–1250, 2018.K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio.Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 , 2014.H. Dai, Y. Tian, B. Dai, S. Skiena, and L. Song. Syntax-directed variational autoencoder for structureddata. arXiv preprint arXiv:1802.08786 , 2018.N. De Cao and T. Kipf. Molgan: An implicit generative model for small molecular graphs. arXivpreprint arXiv:1805.11973 , 2018.E. Gawehn, J. A. Hiss, and G. Schneider. Deep learning in drug discovery.

Molecular informatics ,35(1):3–14, 2016.G. B. Goh, C. Siegel, A. Vishnu, N. O. Hodas, and N. Baker. Chemception: a deep neural networkwith minimal chemistry knowledge matches the performance of expert-developed qsar/qspr models. arXiv preprint arXiv:1706.06689 , 2017.R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández-Lobato, B. Sánchez-Lengeling,D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik. Automaticchemical design using a data-driven continuous representation of molecules.

ACS central science ,4(2):268–276, 2018.I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, andY. Bengio. Generative adversarial nets. In

Advances in neural information processing systems ,pages 2672–2680, 2014.A. Gupta, A. T. Müller, B. J. Huisman, J. A. Fuchs, P. Schneider, and G. Schneider. Generativerecurrent networks for de novo drug design.

Molecular informatics , 37(1-2):1700111, 2018.S. Hochreiter and J. Schmidhuber. Long short-term memory.

Neural computation , 9(8):1735–1780,1997.J. J. Irwin, T. Sterling, M. M. Mysinger, E. S. Bolstad, and R. G. Coleman. Zinc: a free tool todiscover chemistry for biology.

Journal of chemical information and modeling , 52(7):1757–1768,2012.W. Jin, R. Barzilay, and T. Jaakkola. Junction tree variational autoencoder for molecular graphgeneration. arXiv preprint arXiv:1802.04364 , 2018.A. Kadurin, S. Nikolenko, K. Khrabrov, A. Aliper, and A. Zhavoronkov. drugan: an advancedgenerative adversarial autoencoder model for de novo generation of new molecules with desiredmolecular properties in silico.

Molecular pharmaceutics , 14(9):3098–3104, 2017.S. Kang and K. Cho. Conditional molecular design with deep generative models.

Journal of chemicalinformation and modeling , 59(1):43–52, 2018.9. Kearnes, K. McCloskey, M. Berndl, V. Pande, and P. Riley. Molecular graph convolutions: movingbeyond ﬁngerprints.

Journal of computer-aided molecular design , 30(8):595–608, 2016.S. Kim, P. A. Thiessen, E. E. Bolton, J. Chen, G. Fu, A. Gindulyte, L. Han, J. He, S. He, B. A.Shoemaker, et al. Pubchem substance and compound databases.

Nucleic acids research , 44(D1):D1202–D1213, 2016.D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 ,2013.M. J. Kusner, B. Paige, and J. M. Hernández-Lobato. Grammar variational autoencoder. arXivpreprint arXiv:1703.01925 , 2017.Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature , 521(7553):436–444, 2015.J. Lim, S. Ryu, J. W. Kim, and W. Y. Kim. Molecular generative model based on conditionalvariational autoencoder for de novo molecular design.

Journal of cheminformatics , 10(1):1–9,2018.M. Popova, O. Isayev, and A. Tropsha. Deep reinforcement learning for de novo drug design.

Scienceadvances , 4(7):eaap7885, 2018.K. Preuer, P. Renz, T. Unterthiner, S. Hochreiter, and G. Klambauer. Fréchet chemnet distance: ametric for generative models for molecules in drug discovery.

Journal of chemical informationand modeling , 58(9):1736–1741, 2018.K. Preuer, G. Klambauer, F. Rippmann, S. Hochreiter, and T. Unterthiner. Interpretable deep learningin drug discovery. In

Explainable AI: Interpreting, Explaining and Visualizing Deep Learning ,pages 331–345. Springer, 2019.O. Prykhodko, S. V. Johansson, P.-C. Kotsias, J. Arús-Pous, E. J. Bjerrum, O. Engkvist, and H. Chen.A de novo molecular generation method using latent vector based generative adversarial network.

Journal of Cheminformatics , 11(1):74, 2019.D. Rogers and M. Hahn. Extended-connectivity ﬁngerprints.

Journal of chemical information andmodeling , 50(5):742–754, 2010.B. Sanchez-Lengeling and A. Aspuru-Guzik. Inverse molecular design using machine learning:Generative models for matter engineering.

Science , 361(6400):360–365, 2018.M. H. Segler, T. Kogej, C. Tyrchan, and M. P. Waller. Generating focused molecule libraries for drugdiscovery with recurrent neural networks.

ACS central science , 4(1):120–131, 2018.M. Simonovsky and N. Komodakis. Graphvae: Towards generation of small graphs using variationalautoencoders. In

International Conference on Artiﬁcial Neural Networks , pages 412–422. Springer,2018.K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditionalgenerative models. In

Advances in neural information processing systems , pages 3483–3491, 2015.A. M. Virshup, J. Contreras-García, P. Wipf, W. Yang, and D. N. Beratan. Stochastic voyages intouncharted chemical space produce a representative library of all possible drug-like compounds.

Journal of the American Chemical Society , 135(19):7296–7303, 2013.J. Wang, H. Cao, J. Z. Zhang, and Y. Qi. Computational protein design with deep learning neuralnetworks.

Scientiﬁc reports , 8(1):1–9, 2018.D. Weininger. Smiles, a chemical language and information system. 1. introduction to methodologyand encoding rules.

Journal of chemical information and computer sciences , 28(1):31–36, 1988.D. Xue, Y. Gong, Z. Yang, G. Chuai, S. Qu, A. Shen, J. Yu, and Q. Liu. Advances and challengesin deep generative models for de novo molecule generation.

Wiley Interdisciplinary Reviews:Computational Molecular Science , 9(3):e1395, 2019.X. Yang, J. Zhang, K. Yoshizoe, K. Terayama, and K. Tsuda. Chemts: an efﬁcient python libraryfor de novo molecular generation.