[PDF] Generative network complex (GNC) for drug discovery

Abstract

It remains a challenging task to generate a vast variety of novel compounds with desirable pharmacological properties. In this work, a generative network complex (GNC) is proposed as a new platform for designing novel compounds, predicting their physical and chemical properties, and selecting potential drug candidates that fulfill various druggable criteria such as binding affinity, solubility, partition coefficient, etc. We combine a SMILES string generator, which consists of an encoder, a drug-property controlled or regulated latent space, and a decoder, with verification deep neural networks, a target-specific three-dimensional (3D) pose generator, and mathematical deep learning networks to generate new compounds, predict their drug properties, construct 3D poses associated with target proteins, and reevaluate druggability, respectively. New compounds were generated in the latent space by either randomized output, controlled output, or optimized output. In our demonstration, 2.08 million and 2.8 million novel compounds are generated respectively for Cathepsin S and BACE targets. These new compounds are very different from the seeds and cover a larger chemical space. For potentially active compounds, their 3D poses are generated using a state-of-the-art method. The resulting 3D complexes are further evaluated for druggability by a championing deep learning algorithm based on algebraic topology, differential geometry, and algebraic graph theories. Performed on supercomputers, the whole process took less than one week. Therefore, our GNC is an efficient new paradigm for discovering new drug candidates.

Full PDF

GGenerative network complex (GNC) for drug discovery

Christopher Grow , Kaifu Gao , Duc Duy Nguyen , and Guo-Wei Wei , , , ∗ Department of Mathematics, Michigan State University, MI 48824, USA. Department of Electrical and Computer Engineering, Michigan State University, MI 48824, USA. Department of Biochemistry and Molecular Biology, Michigan State University, MI 48824, USA.November 1, 2019

Abstract

It remains a challenging task to generate a vast variety of novel compounds with desirable pharmacologicalproperties. In this work, a generative network complex (GNC) is proposed as a new platform for designing novelcompounds, predicting their physical and chemical properties, and selecting potential drug candidates that fulﬁllvarious druggable criteria such as binding afﬁnity, solubility, partition coefﬁcient, etc. We combine a SMILESstring generator, which consists of an encoder, a drug-property controlled or regulated latent space, and a de-coder, with veriﬁcation deep neural networks, a target-speciﬁc three-dimensional (3D) pose generator, and math-ematical deep learning networks to generate new compounds, predict their drug properties, construct 3D posesassociated with target proteins, and reevaluate druggability, respectively. New compounds were generated inthe latent space by either randomized output, controlled output, or optimized output. In our demonstration, 2.08million and 2.8 million novel compounds are generated respectively for Cathepsin S and BACE targets. Thesenew compounds are very different from the seeds and cover a larger chemical space. For potentially active com-pounds, their 3D poses are generated using a state-of-the-art method. The resulting 3D complexes are furtherevaluated for druggability by a championing deep learning algorithm based on algebraic topology, differentialgeometry, and algebraic graph theories. Performed on supercomputers, the whole process took less than oneweek. Therefore, our GNC is an efﬁcient new paradigm for discovering new drug candidates.

I Introduction

Drug design and discovery ultimately test our understanding of biological sciences, the status of biotechnol-ogy, and the maturity of computational sciences and mathematics. Technically, drug discovery involves targetdiscovery, lead discovery, lead optimization, preclinical development, three phases of clinical trials, and ﬁnally,launching to market only if everything goes well. Among them, lead discovery, lead optimization, and preclinicaldevelopment disqualify tens of thousands of molecules based on their binding afﬁnities, solubilities, partitioncoefﬁcients, clearances, permeabilities, toxicities, pharmacokinetics, etc., leaving only about ten compounds forclinical trails. Currently, drug discovery is both expensive and time-consuming. It takes about $2.6 billion dollarsand more than ten years, on average, to bring a new drug to the market. Reducing the cost and speeding upthe drug discovery process are crucial issues for the pharmaceutical industry. Much effort has been taken tooptimize key steps of the drug discovery pipeline. For example, the development of high-throughput screening(HTS) has led to an unprecedented increase in the number of potential targets and leads. HTS is able to quicklyconduct millions of tests to rapidly identify active compounds of interest using compound libraries. While there has been an increase in the number of potential targets and leads, the number of new molecularentities generated has remained stable because of a high attrition rate during preclinical development and clinicalphases, caused by the selection of leads with inappropriate physicochemical or pharmacological properties.

Rational drug design (RDD) approaches are proposed to better identify candidates with the highest probabilityof success. RDD aims at ﬁnding new medications based on the knowledge of biologically druggable targets.

Several empirical metrics, such as Lipinski’s rule of ﬁve (RO5), were established for estimating druglikeness,which describes the druggability of a substance with respect to factors like bioavailability, solubility, toxicity,etc. Generally, the early selection of candidates requires the design of molecules complementary in shape andcharge to the target of interest, which leads to a high binding afﬁnity. Additionally, the determination of the nature ∗ Corresponding to Guo-Wei Wei. Email: [email protected] a r X i v : . [ q - b i o . B M ] O c t nd rates of physical/chemical/biological processes that are involved in the absorption, distribution, metabolism,and elimination (ADME) of drug candidates are also of primary importance. ADME proﬁling and prediction aremostly dependent on molecular descriptors such as RO5. Furthermore, cellular/animal disease models aretypically used during lead optimization to measure various pharmacokinetics. Finally, toxicity study is a primarytask for preclinical development.Recently, computer-aided drug design (CADD) has emerged as a useful approach in reducing the cost andperiod of drug discovery. Computational techniques have been developed for both virtual screening (VS)and optimizing the ADME properties of lead compounds. Essentially, these methods are designed as in silicoﬁlters to eliminate compounds with undesirable properties. These ﬁlters are widely applied for the assembly ofcompound libraries using combinatorial chemistry. The integration of early ADME proﬁling of lead chemicalshas contributed to the speed-up of lead selection for phase-I trials without large amounts of revenue loss. Currently, compounds are added in libraries on the basis of target-focused design or diversity considerations. VS and HTS can screen compound libraries to select a subset of compounds whose properties are in agreementwith various criteria. Despite these efforts, the current size of databases of chemical compounds remains small when comparedwith the chemical space spanned by all possible energetically stable stoichiometric combinations of atoms andtopologies in molecules. Considering these factors, it is estimated that there are 10 distinct molecules. Amongthem, 10 are druglike. As a result, computational techniques are also being developed for the de novo designof druglike molecules and for generating large virtual chemical libraries, which can be more efﬁciently screenedfor in silico drug discovery.Among the computational techniques available, deep neural networks (DNN) have gained much interest for theirability to extract features and learn physical principles from training data. Currently, DNN-based architectureshave been successfully developed for applications in a wide variety of ﬁelds in the biological and biomedicalsciences. More interestingly, several deep generative models based on variational autoencoders (VAEs), adversarialautoencoders (AAEs), recurrent neural networks (RNNs), long short term memory networks (LSTMs) andgenerative adversarial networks (GANs) have been proposed for exploring the vast druglike chemical space.A policy-based reinforcement learning approach was proposed to tune RNNs for episodic tasks. A VAEwas used by Gomez-Bombarelli et al. to encode a molecule in the continuous latent space for exploringassociated properties. The usage of these models has been extended to generate molecules with desiredproperties. Miha Skalic et al. combined a conditional variational autoencoder and a captioning network togenerate previously unseen compounds from input voxelized molecular representations. Artur Kadurin et al. built an AAE to generate new compounds. Boris Sattarov et al. combined deep autoencoder RNNs withgenerative topographic mapping to carry out de novo molecular design.It is particularly interesting and important to generate potential drug candidates for speciﬁc drug targets. Tothis end, a network complex is required to fulﬁll various functions, including target-speciﬁc molecular generation,target-speciﬁc binding afﬁnity ranking, and solubility and partition coefﬁcient evaluation. In this work, we proposea generative network complex (GNC) to combine drug-property controlled or regulated autoencoder (AE) mod-els and DNN predictors to generate millions of new molecules and select potential drug candidates that haveappropriate druggable properties. Our GNC includes the following components:1. Using known molecules in a target-speciﬁc training set as seeds, a SMILES string generator is con-structed to generate millions of novel compounds. This generator consists of a CNN-based encoder, adrug-property controlled or regulated latent space, and a LSTM-based decoder.2. A pre-trained multitask DNN model is constructed to select drug candidates based on druggable properties.3. A 3D structure generator, MathPose, to convert selected 2D SMILES strings into 3D structures based ontarget receipt information.4. A 3D multitask druggable property predictor, mathematical deep learning (MathDL), to further select newdrug candidates via various druggable criteria.Some of these components, namely MathPose and MathDL, have been extensively validated in blind set-tings. Our GNC can not only generate new molecules, but also construct or pick up the molecules withideal drug properties. This makes it a very promising method for generating millions of new drug candidates insilico in a very short time period. 2

I MethodsII.A The structure of generative network complex (GNC)

N1CCN(CC1)C(C(F)=C2)=CC(=C2C4 =O)N(C3CC =O)N(C3CC

Affinity

LopP

LogSClearance …

3D structure generator by MathPose

Property prediction by MathDL

Training setsTraining setsSMILES string Encoder Decoder SMILES stringAffinity Clearance

LogP

LogS, … LatentspaceNew 3D structures

Experimental verification

Figure 1: A schematic illustration of a generative network complex. It consists of an autoencoder that takesSMILES strings (SS) into a drug-property regulated latent space, a regulated latent space, a LSTM-based au-todecoder, a multitask network for the evaluation of binding afﬁnity, partition coefﬁcient (LogP), solubility (LogS),clearance, etc., a 3D structure generator named MathPose, and MathDL, a reﬁned 3D multitask druggable prop-erty predictor based on algebraic topology, differential geometry, and graph theory, to select new drug candidatestructures.In the proposed GNC, the ﬁrst component is a generative network including encoder, drug-property regulatedlatent space, and decoder models. The generative network will take a given SMILES string as input to generatea novel one. The newly generated SMILES strings will be fed into the second component of our GNC, a 2Dﬁngerprint-based deep neural network (2DFP-DNN), so that only ones with desired druggable properties arekept. The next component is the MathPose model which is used to predict the 3D structure information of thecompounds selected by 2DFP-DNN. The bioactivities of those compounds are again estimated by the structure-based deep learning model named MathDL. The druggable properties predicted by this last component of ourGNC are used as an indicator to select the promising drug candidates. The outline of the GNC is illustrated inFigure 1.

II.A.1 Autoencoder

An autoencoder is a type of artiﬁcial neural network used to encode a set of data into vectors in the latent space.An autoencoder is typically combined with a decoder to transform the encoded vectors back into SMILES strings.In the present work, we propose a latent space technique which controls or regulates various drug properties,such as binding afﬁnity, solubility (LogS), partition coefﬁcient (LogP), clearance, etc.

Encoder

The encoder network in the present work is a convolutional neural network (CNN) which takes convertsSMILES strings into 3D molecular images before encoding their into the latent space. It consists of ﬁve 3D-convolutional layers. The number of output channels for each layer are 32, 32, 64, 64, 32, 32, 32, and 32 withkernel sizes all (1, 1, 1), respectively. For the sake of visualization, the encoder’s architecture is outlined in Figure2 In the present encoder, for each SMILES string, 3D conformers were generated via RDKit and optimizedusing the MMFF94 force ﬁeld with default settings. Molecule atoms were then voxelized into a discretized 1 Åcubic grid with sides of length 24 Å prior to a random rotation and 2 Å translation of the molecule. The voxelized3 atent space (512) Encoder Decoder

SMILES strings

3D voxels via RDKit

5, 24 × 24 × 24

Conv 3D 32, 1× 1 × 1 × 2Maxpool 2× 2 × 2Conv 3D 64, 1× 1 × 1 × 2Maxpool 2× 2 × 2Conv 3D 32, 1× 1 × 1 × 2

Maxpool 2× 2 × 2 × 2Fully connected ReLu 62-unit LSTMOutput: 62 × 1024Fully connectedOutput: 62 × 29Max Output: 62 × 1 (pick up the index of the max value in each dimension)SMILES stringsAffinity Clearance LogPLogSTraining sets

Figure 2: Illustration of an autoencoder, which consists of a CNN-based encoder, a regulated latent space, anda LSTM-based decoder.value is determined by its atom type and the distance r between neighboring atoms and its center: n ( r ) = 1 − exp[ − ( r vdW /r ) ] , (1)where r vdW is the van der Waals radius.The voxelized values of ﬁve types of properties are calculated: hydrophobic, aromatic, H-bond donors, H-bondacceptors, and heavy atoms, leading to ﬁve different channels. Latent space

We propose three latent space regulation schemes, i.e., randomized output, controlled output,and optimized output, to construct new compounds. First, to generate new compounds from seeds, randomnoise can be added to the latent space. In other words, the encoded latent vectors can be perturbed by standardGaussian noise, rendering a possible new latent representation. The resulting latent vector will be fed into thedecoder network.Additionally, a more interesting control procedure is to select the latent space output through a druggableproperty assessment. As shown in Figures 1 and 2, we use the trained encoder to generate latent-space rep-resentations of a dataset of interest, such as the BACE dataset. Based on these representations of the datasetand its labels, we train deep learning network models to evaluate and predict various druggable properties, in-cluding binding afﬁnities, solubility, partition coefﬁcient, clearance, toxicity, etc. In certain situations, we also buildmultitask deep learning models to enhance latent-space evaluations. In this approach, each new compound inits latent space representation is evaluated for its druggable properties to determine whether it is to be fed intothe decoder.Finally, a more effective optimization scheme is to actively build new drug candidates in the latent spacerepresentation with desirable properties as shown in Figures 1 and 2. With appropriate training datasets, weﬁrst construct m latent-space predictive machine learning models as described above for m different properties,such as binding afﬁnities, solubility, partition coefﬁcient, clearance, etc. For each property, we set up a target4alue, y j . We then build an L loss function to optimize a given n -component latent-space vector X ∈ R n : min m (cid:88) j =1 k j (ˆ y j ( X ) − y j ) , (2)where k j is a preselected weight coefﬁcient for the j th property and ˆ y j ( X ) : R n → R is the predicted j th propertyvalue of the latent-space vector X from latent-space machine learning models. Alternatively, we also use othermetrics, such as L or mixed metrics for constructing the loss function. The optimization with a gradient decentalgorithm leads to an iterative scheme for regularizing the latent-space vector X . Alternatively, a Monte Carloprocedure can be implemented.Target values y j can be chosen to optimize potential drugs. In case of binding afﬁnity (BA), we use a targetedvalue of y BA ≤ − . kcal/mol. For LogP, we set y LogP ≤ . Note that additionally constraints, such as, similarity,Lipinski’s rule of ﬁve or their variants for druglikeness can be easily implemented with Eq. (2). Decoder

The decoder network here consists of several LSTMs. LSTMs are variants of RNNs that were pro-posed to handle language processing problems, which require the network to take into account the relationshipsbetween words rather than simply interpreting each word independently. RNNs are designed to pass ﬁxed-sizepieces of information from one neuron to others in the network. However, RNNs are not very effective at process-ing information with long-term dependencies, as the persistence of information within the network is somewhatshort-lived. As a result, LSTMs were designed to overcome this problem.

The encoder network wastrained in the shape encoder framework.In each LSTM unit, there is a cell consisting of an input gate, an output gate, and a forget gate which aredescribed in the following equations H t = o t ∗ tanh( C t ) , (3)where o t depends on its input X t and the output of the last layer H t − : o t = σ ( W o [ H t − , X t ] + b o ) . (4)Here, σ is the activation function. Now, the cell state at the i th layer is given by: C t = f t ∗ C t − + i t ∗ ˆ C t , (5)where ˆ C t is the change of the cell state at the i th layer ˆ C t = tanh( W C [ h t − , x t ] + b C ) , (6)and f t and i t are given by the following, f t = σ ( W f [ H t − , X t ] + b f ) (7) i t = σ ( W i [ H t − , X t ] + b i ) . (8)The updated cell state, C t , is then passed to the next layer along with the output of the current layer and in-cludes accumulated information from all previous layers so that the network can handle long-term dependenciesbetween inputs.The purpose of LSTMs in our case is to decode molecules from the encoded vectors in the latent space.There is some variation in the decoding process via the use of probabilistic sampling. Due to the LSTM’s abilityto handle long-term dependencies, it can learn SMILES grammar, and build SMILES strings by selecting thenext token proportionally to its predicted probability. This means that some variation from the seed SMILESstring will occur. As a result, even when the input is the same, the output will not always be the same. Thiscauses the generated SMILES strings to be different from their seeds. Our LSTM decoder has 5 layers as thesame as the number of layers of the aforementioned CNN encoder. The architecture of the decoder is depictedin Figure 2. The Adam optimizer was applied with the learning rate 0.001 and a batch size of 128 to minimizethe following loss L = − N N (cid:88) i =1 M (cid:88) j =1 y ( i ) j log p ( i ) j , (9)where y ( i ) j and p ( i ) j , respectively, represent the the ground-truth and the predicted probability for component j thin the i th SMILES string. Also, N is the number of samples in each batch and M is the length of the SMILESstring. 5 I.A.2 2D ﬁngerprint-based binding afﬁnity predictors (2DFP-DNN)

The predictors are deep neural networks (DNN) pre-trained on our own training sets. A DNN mimics the learningprocess of a biological brain by constructing a wide and deep architecture of numerous connected neuron units.A typical DNN often includes multiple hidden layers. In each layer, there are hundreds or even thousands ofneurons. During the learning stage, weights on each layer are updated by backpropagation. A deep neuralnetwork is able to construct hierarchical features and model complex nonlinear relationships.The purpose of our DNN predictors is to predict the binding afﬁnities and other properties of the generatedcompounds and, based on that, screen ideal drug candidates meeting our criteria. Binding afﬁnity assesses adrug’s binding strength to its target, which is one of the most important drug properties.

The input of predictornetworks is 2D molecular ﬁngerprints. In our case, a combination of ECFP and MACCS ﬁngerprints wereused, yielding 2214 bits of features (2048 bits from ECFP and 166 bits from MACCS) in total. The output ofthe network is the drug properties, such as binding afﬁnity, log P, and log S. During the training and predictionprocesses, the SMILES strings of compounds were ﬁrst transformed to their 2D ﬁngerprints and then fed intothe network. The ﬁngerprint transformation from SMILES strings was conducted by RDKit. With appropriate training data, we can construct multitask DNNs for simultaneous predictions of binding afﬁn-ity, log P, log S, and toxicity.

Our DNN predictor networks have 4 layers with 3000, 2000, 1000, and 500neurons in each hidden layer, respectively. For training, we used stochastic gradient descent with a momentumof 0.5. We trained each network for 2000 epochs with a mini-batch size of 4. We used a learning rate of 0.01 forthe ﬁrst 1000 epochs and reduced it to 0.001 for the last 1000 epochs. Our tests indicate that adding dropout or L decay does not necessarily increase the accuracy of the networks, and as a consequence, we omitted thesetwo techniques. The DNN training and prediction are performed by Pytorch. II.A.3 MathDL for energy prediction … Protein-ligand complex Element specific groups

Machine learning predictionVarious Mathematical featuresAlgebraic topologyand/orDifferential geometry and/or

Graph Theory

Figure 3: A schematic illustration of the MathDL for binding afﬁnity prediction in which the combination of severaladvanced mathematical representations is integrated with sophisticated CNN models.Our MathDL is constructed by the integration of mathematical representation features and deep learning net-works to generate a powerful binding afﬁnity predictor.

Speciﬁcally, the MathDL is the blend of intensivelyvalidated models based on algebraic topology, differential geometry, and graph theory. In these meth-ods, algebraic topology model makes use of persistent homology in multi-component and multi-level manners tocharacterize protein-ligand complexes by topological invariants, i.e., Betti numbers counting various dimensionalholes. In the 3D space, we have Betti-0, Betti-1, and Bett-2 which receptively counts the numbers of independent6omponents, recognizes numbers of rings, and accounts for the cavity information.

Our previous work, wehave shown that algebraic topology network has outperformed other state-of-the-art methods in the classifyingproteins and active/inactive compounds, and the predictions of protein-ligand binding afﬁnity, toxicity, lop P, and log S. Differential geometry describes how molecules assume complex structures, intricate shapes and convolutedinterfaces between different parts. In our differential geometry-based model, essential chemistry, physical, andbiological information are encoded into the low-dimensional interactive manifolds which are extracted from high-dimensional data space via a multiscale discrete-to-continuum mapping.

Thereby, the molecular structuresand atomic interactions can be conveniently represented via interactive curvatures, interactive areas, etc. Nu-merous numerical validations have shown that the differential geometry model has achieved the state-of-the-artperformances on various biological prediction tasks, namely drug toxicity, molecular solvation, and protein-ligandbinding afﬁnity. Recently, we have developed a powerful algebraic graph-based scoring function which encodes the importantphysical and biological properties such as hydrogen bonds, hydrophilicity, hydrophobicity, van der Waals inter-actions, and electrostatics from the high-dimension space into the low-dimension description via the invariantsextracted from Laplacian, its pseudo-inverse, and adjacency matrices. Algebraic graph theory-based modelshave been widely utilized in the study of physical modeling and molecular analysis such as chemical analy-sis, protein ﬂexibility analysis.

Despite its popularity, the graph-based quantitative models typically arenot as competitive as other quantitative approaches due to no categorization on element types and the missingcrucial non-covalent interactions. This missing information has been encoded in multiscale weighted coloredsubgraphs in our newly designed algebraic graph-based model, named AGL-Score. Extensive numerical vali-dation on PDBbind benchmarks with various evaluation metrics, namely scoring power, ranking power, dockingpower, and screening power has shown that our AGL-Score has outperformed other state-of-the-art methods onthese evaluations which are the standard criteria for virtual screening in drug discovery. The combination of these three powerful models gives rise to the MathDL model which is expected to be oneof the most accurate binding afﬁnity predictors available in the literature. Indeed, the MathDL model achieved thetop performances on the afﬁnity ranking and free energy prediction for Cathepsin S (CatS) inhibitors in the DrugDesign Data Resource (D3R) Grand Challenge 4 (GC4), a worldwide competition series in the computer-aideddrug design. Also, MathDL model was the competitive scoring functions on the binding energy predictions forbeta-secretase 1 (BACE) compounds in GC4. The outline of the MathDL model is depicted in Figure 3.

II.A.4 MathPose for 3D structure prediction

Selective set of complexes

MathDLPose generation by

Vina/GOLD/GLIDE

Input 2D

SMILES

Re-docking with

Vina/GOLD/GLIDE Top ranked pose

Figure 4: A schematic illustration of the MathPose approach for 3D structure generation from a given input 2DSMILES string.In our recent work, we have successfully designed an AGL-Score model to achieve the best performances indocking power metrics which validate the scoring function’s ability to identify the “native pose” from the computer-generated poses. Speciﬁcally, on the CASF-2007 benchmark, our AGL-Score achieves 84% accuracy onthe docking power assessment. The second best scoring function on this benchmark is from GOLD softwarewith ASP ﬁtness score (82%). Our scoring function is still the top performer on docking power test of CASF-2013 benchmark with the accuracy as high as 90%, followed by the machine learning based-scoring function ∆ vina RF (87%). With such promising results, it is expected that the replacement of the single AGL-Scoremodel by intricate MathDL scoring function will certainly improve the quality of pose ranking. This results in theMathPose model whose framework is outlined in Figure 4. In our MathPose, besides the SMILES string of theinterested ligand L , we select a set of complexes having similar binding sites to the one the ligand L can bindto. A pool of nearly 1000 poses for the ligand L is generated by several common docking software, namelyAutodock Vina, GOLD, and GLIDE. Additionally, three docking software packages are utilized to re-dockthe complexes in the selective data set to form at least 100 decoy complexes per input. Then, our MathDL will7e trained on these decoy sets to learn the calculated root mean squared deviation (RMSD) between the decoyand native structures. The trained MathDL will be applied to pick up the top-ranked pose for the given ligand L . II.B The analysis of generated compoundsThe 2D similarity analysis between generated compounds and their seeds.

To investigate how “novel” ourgenerated compounds are from their seeds, a similarity analysis was performed on them. The 2D molecularSMILES strings of the generated molecules were also transformed into 2D molecular ﬁngerprints and then thesimilarity scores between the ﬁngerprints of the generated molecules and their seeds were calculated. Theﬁngerprints were the same ones used in the DNN predictors, a combination of ECFP and MACCS molecularﬁngerprints. The criteria used for the similarity scores was the Tanimoto coefﬁcient. The ﬁngerprint transfor-mation was also conducted by RDKit. The k-means clustering analysis of generated compounds.

Cluster analysis or clustering is the task ofgrouping a set of objects in such a way that objects in the same group are more similar to each other than tothose in other groups. It has already been widely applied to protein conformation analysis.

To present thediversity of our generated active compounds, k-means clustering analysis was performed. The input featureswere the same molecular ﬁngerprints discussed above, and the k-means clustering was conducted by scikit-learn. For each cluster, the center was extracted to represent the cluster.

III Results

To examine and validate the performance of our proposed GNC for generating new compounds for drug targets,we consider two speciﬁc targets, namely Cathepsin S (CatS) set and Beta-Secretase 1 (BACE). These two tar-gets appeared in the D3R Grand Challenges, worldwide competition series in computer-aided drug design, with components addressing pose-prediction, afﬁnity ranking, and free energy calculations.Both CatS and BACE are potential targets for signiﬁcant human diseases. CatS constitutes an 11-memberfamily of proteases involved in protein degradation. It is highly expressed in antigen-presenting cells, whereit degrades major histocompatibility complex class II (MHC II)-associated invariant chain. CatS is a candidatetarget for regulating immune hyper-responsiveness, as the inhibition of CatS may limit antigen presentation.

BACE is a transmembrane aspartic-acid protease human protein encoded by the BACE1 gene. It is essentialfor the generation of beta-amyloid peptide in neural tissue, a component of amyloid plaques widely believed tobe critical in the development of Alzheimer’s, rendering BACE an attractive therapeutic target for this devastatingdisease. The rest of this section is devoted to the utilization of the proposed GNC on the exploration of newpotential drugs for CatS and BACE targets.

III..1 Faithful validation of generative network on CatS and ZINC data sets

To assess the performance of the autoencoder on the CatS data set, we converted all SMILES strings in thedata set into the canonical form using RDKit, and kept only the strings with length no more than 60. This wasdone because the decoder network was designed to produce only SMILES strings of length at most 60. This leftus with 1858 of the 2847 molecules in the CatS training set. After feeding these molecules through the network,1427 (76.8%) yielded valid SMILES strings, with none being identical to the original.We also tested the performance of the autoencoder on a larger data set of 1 million molecules, randomlychosen from the same subset of the ZINC 15 data set from which the training samples were drawn. The trainingset produced from the ZINC 15 data set contains 192,813,983 molecules, 26,880,000 of which were previouslyseen by the autoencoder during training. From these 1 million molecules, 994,219 (99.4%) yielded valid SMILESstrings, and 2,724 (0.27%) SMILES strings were reproduced exactly. A high valid molecule generation rate anda low reconstruction rate enable us to generate meaningful compounds with highly diverse chemical properties. III.A BACEIII.A.1 Data preparation

To enable the proposed GNC to generate meaningful BACE inhibitors, one needs to supply it with seed moleculesclosely related to the BACE target. To this end, we combine all BACE inhibitors provided in the D3R Grand Chal-lenge 4 ( https://drugdesigndata.org/about/grand-challenge-4 ) with the BACE ligands having the reportedbinding afﬁnity on the ChemBL database ( ). That results in a BACE data set of3916 compounds with binding afﬁnities ranging from -2.84 kcal/mol to -13.22 kcal/mol. If one sets -9.56 kcal/molas a threshold to label a compound as active, that BACE data set has 1231 active ligands. The distribution ofbinding afﬁnity in the BACE data set is shown in Figure 5a. That ﬁgure reveals that most of the molecules in ourcollected data set having afﬁnities between -10 kcal/mol and -7 kcal/mol. Also, there are more BACE inhibitorswith binding strength less than -10 kcal/mol than ones having binding afﬁnity higher than -6 kcal/mol.8igure 5: Distributions on the BACE set. (a) The distribution of the experimental binding afﬁnities in the BACEtraining set; (b) The similarity distribution of new molecules compared with their seeds in the BACE set. (c) Thedistribution of new BACE molecules’ binding afﬁnities predicted by 2D ﬁngerprint network model 2DFP-DNN.Figure 6: The illustration of similarity between a seed molecule in the BACE set and some generated compounds:(a) The seed; (b) The most similar compound generated from the seed (similarity score=0.50); (c) A compoundwith a medium similarity score of 0.34; (d) The most different one from the seed (similarity score=0.23).

III.A.2 Structure generation

By feeding the BACE data set of 3916 compounds to the generative network, as many as 2.8 million valid com-pounds were generated by supercomputers in less than one week. To indicate how “novel” our generated com-pounds are from their seeds, the similarity score between each generated compound and its seed is calculated.The similarity score distribution is illustrated in Figure 5b. It is revealed from the ﬁgure that the similarity scoresof the generated compounds have a broad range varying from 0.15 to 1.00. This means that our generatedcompounds cover a very large chemical space. A similarity score being 1 indicates the generated compound isexactly the same as the seed. Fortunately, this is very rare, happening only 9 times in all 2,727,379 generatedcompounds. In most cases, the similarity scores are very low with an average value of 0.34, implying the widerange of diversity among the generated samples.To further verify that the generated compounds are really unique from the seeds, a seed molecule and severalgenerated compounds are shown in Figure 6. In which, Figure 6a depicts a seed, Figure 6b illustrates the most9imilar compound generated from the seed, Figure 6c plots a compound with a medium similarity score of 0.34,and Figure 6d presents the most different one. One can realize that even the most similar one with a similarityscore as high as 0.50, chemical structures are still quite different due to the replacement of the fused ring by acarbon chain (see Figures 6a and 6b).

III.A.3 Binding afﬁnity screening by 2DFP-DNN

To efﬁciently select the potential drug candidates, we carry out the 2D ﬁngerprint DNN model discussed inSection II.A.2 to predict the binding afﬁnities of more than 2.7 millions compounds. Figure 5 illustrates thosepredicted energies. From Figure 5, one can notice that the predicted afﬁnities of the generated BACE compoundsare distributed in a Gaussian manner. This result is probably due to the Gaussian-like distribution of similarityscores between generated ones and their corresponding seeds depicted in Figure 5b.The range of their binding afﬁnities of predicted molecules is widely spread from -3.89 kcal/mol to -10.20kcal/mol, conﬁrming that large chemical space is covered. The peak is at -7.1 kcal/mol, which means abouthalf of the generated compounds have binding afﬁnity smaller than -7.1 kcal/mol. Among this ﬁrst half with thebinding afﬁnity smaller than -7.1 kcal/mol, 5 compounds have predicted binding afﬁnity smaller than -10 kcal/molwhich indicates they are promising drug candidates. Moreover, there are 2130 compounds with binding afﬁnitysmaller than -9 kcal/mol, and 178250 compounds with the binding afﬁnities smaller than -8 kcal/mol. In this work,we use a common binding afﬁnity threshold, i.e. -9.56 kcal/mol, to screen out high-likely less active compounds.As a result, we are left with 99 generated inhibitors having the lowest binding energy in term of kcal/mol.It is noticed that the 2D ﬁngerprint DNN model for binding afﬁnity prediction only relies on the ligand informationwithout the involvement of target proteins. Therefore, its accuracy is not as high as its 3D counterparts (e.g.MathDL) in which the interactions between the target binding site and the interesting compounds are fullyincorporated. However, it is projected to be time consuming when carrying out those 3D-based binding afﬁnitypredictor models on a large pool of molecules. Thus, in this work, we make use of the advantage of simplecalculations in the 2D-based models to ﬁlter out a large number of compounds with highly predicted afﬁnities.

III.A.4 Clustering analysis of selected compounds

Figure 7: The center of the 6 clusters found in our BACE generated set.To illustrate how diverse our generated active compounds are, clustering analysis was performed on the 99generated compounds with the most highly predicted binding afﬁnities discussed in Section III.A.3. By carrying10ut k-means clustering method, one can ﬁnd 6 clusters in our generated set, and the center of each cluster isshown in Figure 7.Statistically, the sizes of these 6 clusters are 7, 38, 10, 7, 12, and 25, respectively. Inside these 6 clusters, theaverage similarity scores to the centers are 0.69, 0.58, 0.62, 0.66, 0.63, and 0.67, respectively, which indicatesthe compounds in the same cluster are relatively similar. By contrast, the similarity scores between differentclusters are much lower. Speciﬁcally, the similarity score between these 6 cluster centers are only around 0.40;thereby, implying the high diversity in our generated compounds.Among these 6 clusters, cluster 2 is the biggest one with 38 compounds. Moreover, it contains the largestnumbers of the highly predicted binding afﬁnities. Particularly, cluster 2 has 5 compounds with predicted bindingafﬁnities smaller than -10 kcal/mol. Since the compounds in the same cluster are similar, it suggests that othercompounds in cluster 2 may also have a high potential to become drugs. SMILES strings of all 99 compoundsin 6 clusters are included in Table S1 in Supporting Information.

III.A.5 Binding afﬁnity screening by MathDL

Figure 8: Top 4 generated BACE compounds having the lowest binding afﬁnities predicted by MathDL model.Their 3D structures were constructed by MathPose. Their IDs under our naming system and the pre-dicted energies are, respectively (a) BACE_gen_35 (-8.263 kcal/mol); (b) BACE_gen_66 (-8.258 kcal/mol); (c)BACE_gen_29 (-8.202 kcal/mol); and (d) BACE_gen_25 (-8.20 kcal/mol). Their SMILES strings are provided inTable S1, and their corresponding 3D structures are included in File S1. All of those molecules were docked toprotein extracted from the complex with PDB ID 3dv5.The DNN model using the 2D ﬁngerprint features, as discussed in Section III.A.3, only relies on the ligandinformation and lacks the receptor environment. As a result its reliability is not guaranteed when identifying themost promising drug candidates. It has been shown that structure-based models often outperform the ligand-based models in diverse datasets.

Therefore, our MathDL model, discussed in Section II.A.3, is utilizedto re-rank the compounds picked out by 2DFP-DNN models. The MathDL model is trained on the BACE dataset of 3916 compounds whose 3D structures are generated by MathPose mentioned in Section II.A.4.The Kendall’s Tau coefﬁcient ( τ ) and Pearson correlation coefﬁcient ( R p ) of the cross-validation on the training11ata are 0.608 and 0.797, respectively. These accuracy evaluations guaranteed a well-trained MathDL modelon that speciﬁc training set. A generated compound set of 99 molecules are fed into MathPose to obtain 3Dstructures provided in File S1 in Supporting Information. All of them were docked to the protein extracted froma complex with PDB ID 3dv5. Their binding afﬁnities are, then, predicted by the aforementioned trained MathDLmodel. It is noticed that binding afﬁnity values of 99 generated molecules predicted by MathDL are higherthan ones estimated by the 2D ﬁngerprint DNN approach in term of kcal/mol. Speciﬁcally, based on MathDLpredictor, the lowest binding energy is -8.263 kcal/mol, the highest energy is -5.972 kcal/mol, and the averagedenergy over 99 compounds is -7.33 kcal/mol. Figure 8 illustrates the binding poses of top four ligands, namelyBACE_gen_35, BACE_gen_66, BACE_gen_29, and BACE_gen_25, in term of afﬁnity. The predicted energiesof those top 4 molecules are -8.263 kcal/mol, -8.258 kcal/mol, -8.202 kcal/mol, and -8.20 kcal/mol, respectively.Despite having nearly the same values of predicted afﬁnities among those top 4 compounds, they are quitedifferent molecules judged by their 2D similarity scores. Speciﬁcally, among those 4 compounds, BACE_gen_35and BACE_gen_66 are the most similar structures but their similarity score is as low as 0.265. In addition,BACE_gen_29 and BACE_gen_25 are the most dissimilar compounds with 2D singularity score being 0.11.Generating very low binding afﬁnity compounds with diverse chemical formulas is an important goal for the pre-clinical stage since that will enhance the chance of selecting promising drug candidates with low risk of having aside effect. Obtaining top and disparate molecules demonstrates the capacity of our proposed GNC in capturingthe wide range of chemical space. III.B CatSIII.B.1 Data preparation

Figure 9: The three distributions about the CatS set. (a) The distribution of experimental binding afﬁnity in theCatS data set; (b) The distribution of similarity scores to their seeds in the CatS generated set. (c) he distributionof the CatS generated set’s binding afﬁnities predicted by 2D ﬁngerprint network model 2DFP-DNN.Similar to the BACE target, CatS inhibitors were presented in the D3R grand challenges. Thus, these com-pounds ( n = 593 ) are used as seeds to produce new CatS molecules. Other CatS compounds reported in theChemBL database are also included in our seeds. In total, we collected a data set of 2847 compounds. Thebinding afﬁnity of these molecules ranges from -4.72 to -14.33 kcal/mol. As with the BACE data set, we chose-9.56 kcal/mol as the threshold for the active compound selection. With this threshold, 1461 of the 2847 com-pounds in our seeds are active. The distribution of binding afﬁnity in our collected CatS data set is shown inFigure 9a. III.B.2 Structure generation

Using the 2847 compounds in the CatS collected set as seeds and feeding them into the generator network,we generated 1000 distinct compounds for each seed, for a total of 2,847,000 generated compounds. However,there was some duplication among the compounds generated by different seeds, resulting in only 2,080,566distinct compounds being generated. To determine the novelty of our generated network, the similarity scorebetween each generated compound and its seed is evaluated and depicted in Figure 9b.Similar to the results from the BACE data set, the similarity scores of the generated compounds have a broadrange from 0.06 and 1.00. A similarity score of 1.00 was obtained only 12 times in all 2,847,000 generated12igure 10: Illustration of similarity between a seed in the CatS set and some generated compounds: (a) Theseed; (b) The most similar compound generated from the seed (similarity score=0.45); (c) A compound with amedium similarity score of 0.31; (d) The most different one from the seed (similarity score=0.18).molecules. In most cases, the similarity scores are very low with an average value of 0.34, indicating there is alot of diversity among the generated samples. To further verify that the generated compounds are really differentfrom the seeds, a seed molecule and several generated compounds are shown in Figure 10. In which, Figure10a is one seed, Figure 10b is the most similar compound generated from the seed with a similarity score of0.45, Figure 10c is the compound with a medium similarity score of 0.31, and Figure 10d is the most differentone with a similarity score of 0.18. Obtaining low similarity scores between generated compounds and feedingtarget is one of the desired features in our GNC model in which the novelty of computer-generated molecules isemphasized.

III.B.3 Binding afﬁnity screening by 2DFP-DNN

Here, we carry out the 2DFP-DDN model to ﬁlter out the “bad” generated CatS molecules by the binding afﬁnitycriterion. Similar to the BACE compound screening conditions, we use an afﬁnity threshold at -9.56 kcal/mol.Speciﬁcally, any molecules with predicted energy higher than that threshold are left out. As a result, we selected61,571 potentially “good” compounds.Furthermore, we are interested in the overall distribution of the binding afﬁnity of the generated compounds.Figure 9c depicts the distribution of the predicted afﬁnity for all 2,080,566 molecules. The distribution is fairlyclose to a Gaussian distribution. Consistent with the similarity score distribution above, the range of their bindingafﬁnity prediction is very large, from -4.61 kcal/mol to -12.12 kcal/mol, conﬁrming that large chemical space iscovered. The mean binding afﬁnity is -7.62 kcal/mol. Among the compounds with the smallest predicted bindingafﬁnity, 21,283 compounds have binding afﬁnity smaller than -10 kcal/mol, 510 compounds have binding afﬁnitysmaller than -11 kcal/mol, and 1 compound has a binding afﬁnity smaller than -12 kcal/mol. These are potentiallyvery highly active compounds. However, as discussed in Section III.A.3, there is no free lunch in the developmentof binding prediction models. 2DFP-DNN predictor is extremely fast in training millions of molecules. However,its accuracy is less competitive in comparison to 3D-based models such as MathDL. Thus, we still utilize theMathDL scoring function to select the most promising drug candidates.

III.B.4 Clustering analysis of selected compounds

To illustrate how diverse our generated compounds are, clustering analysis were performed to the 61,571 se-lected compounds generated by our model. The 6 clusters are found in our generated set, and the center ofeach cluster is shown in Figure 11. The sizes of these 6 clusters are 7077, 13059, 15048, 9221, 6884, and10282 respectively. Inside these 6 clusters, the average similarity scores to the centers are 0.37, 0.34, 0.34,0.39, 0.41, and 0.36 respectively, which indicates that there is a signiﬁcant variation among compounds in eachcluster. In addition, the average binding afﬁnity of each cluster is -10.01, -9.89, -9.91, -9.98, -9.92, and -9.93kcal/mol respectively. Unlike the BACE data set, there is not much difference in the average energies betweendifferent clusters. Therefore, it is expected to obtain highly potential drug candidates with dissimilar physical andbiological chemical properties.

III.B.5 Binding afﬁnity screening by MathDL

MathDL here was trained with 2847 seeds used in the generator network. The Pearson’s correlation coefﬁcientand Kendall’s Tau coefﬁcient on the 10-fold cross-validation (CV) of the training set was found to be R p = 0 . and τ = 0 . , respectively. The promising CV performance ensures a well-trained machine learning model.Furthermore, the reliability of the MathDL models on the afﬁnity ranking of the CatS inhibitors has been shown13igure 11: The center of the 6 clusters found in our CatS generated set.in the Grand Challenges 3 and 4 where our models were ranked 1st place among more than 50 teams fromover the world.To further validate the generated molecules, the top 1050 compounds in term of energy indicated by 2DFP-DNN network are re-ranked by the MathDL model. To get the input ready for the structure-based model, the 3Dposes of those 1050 molecules are predicted by our MathPose. Under the 2DFP-DNN predictor, the averagebinding afﬁnity of those 1050 compounds is -11.05 kcal/mol and their afﬁnities range from -10.693 kcal/mol to -12.159 kcal/mol with standard deviation being -0.213 kcal/mol. On the other hand, by utilizing the MathDL model,the average binding afﬁnity of the selected molecules is -9.27 kcal/mol with a range between -7.008 kcal/m and -11.681 kcal/mol, and the standard deviation is found to be -0.736 kcal/mol. The Pearson’s correlation coefﬁcienton the energy prediction for the generated compounds by 2DFP-DNN and MathDL is as low as 0.112 whichindicates the disagreement between those two models. That discrepancy was also observed when predictingthe afﬁnity ranking of the CatS molecules in the Grand Challenges 4 where the structure-based model MathDLoutperformed its ligand-based counterpart. Therefore, MathDL’s predicted energies are chosen to select thepromising drug candidates among the computer-generated compounds.The 3D structures of top 4 compounds in term of afﬁnity, namely CatS_gen_195, CatS_gen_968, CatS_gen_902,and CatS_gen_228, are plotted in Figure 12. Their reported afﬁnities are, respectively, -11.681 kcal/mol, -11.608kcal/mol, -11.540 kcal/mol, and -11.536 kcal/mol. Despite similarly predicted afﬁnities, their structures are quitedissimilar from each other. Speciﬁcally, the highest similarity score is 0.297 obtained between CatS_gen_968and CatS_gen_902 molecules. While the lowest similarity score is 0.11 evaluated between CatS_gen_902 andCatS_gen_228. The statistical information again conﬁrms the ability of our proposed GNC to cover large chemi-cal space. IV Discussions

Since the chemical space is huge, there is a need to generate a wide variety of novel compounds for all kindsof properties. This work introduces the GNC to generate novel molecules, predict their druggable properties,and ﬁnally pick up the drug candidates that fulﬁll the threshold for drug properties such as binding afﬁnity. Wediscuss a number of issues concerning generative networks.14igure 12: Four generated CatS compounds having the lowest binding afﬁnities predicted by MathDL model.Their 3D structures were predicted by MathPose. Their IDs under our naming system and the predicted energiesare, respectively (a) CatS_gen_195 (-11.681 kcal/mol); (b) CatS_gen_968 (-11.608 kcal/mol); (c) CatS_gen_902(-11.540 kcal/mol); and (d) CatS_gen_228 (-11.536 kcal/mol). Their SMILES strings are provided in Table S2,and their corresponding 3D structures are included in File S2.

IV.A Latent space design of new compounds

Latent space information can be effectively modiﬁed by a variety of methods. In the current work, we proposethree approaches, including 1) randomized output, 2) controlled output, and 3) optimized output. The ﬁrst ap-proach can certainly create new molecules. We note that some of the new latent space conﬁgurations cannotbe interpreted by the decoder.The second approach is designed to discriminate potentially drug-like molecules from potentially inactiveones. Currently, we found that machine learning models built from latent space representations are highlyaccurate. Therefore, the proposed approach is potentially very useful. Nonetheless, the performance of thismethod depends crucially on the quality of training datasets. Additionally, for this approach to effectively controlthe druggability of the generated compounds, the decoder must be intensively trained with tens of millions ofmolecules and have a near-perfect reconstruction rate. Achieving a high reconstruction rate for a diverse classof test compounds is a challenging issue in the design of molecular autoencoders. This issue is under ourconsideration.The third approach is introduced to create new compounds with desirable druggable properties. Similarly tothe last method, the success of this approach depends on the quality of training datasets and machine modelsand the reconstruction rate of the decoder. Additionally, reference selection for each drug property is anotherimportant issue. It depends on our current understanding and criteria of drug-like molecules. However, thisapproach is very promising and will be an important direction for future studies.Finally, it is noted that the third approach does not depend on the seed conﬁguration. Therefore, its initiallatent space distribution can be chosen randomly. As such, this method can be very fast and efﬁcient.15

V.B Generator efﬁciency

One challenge traditional pharmaceutical industry faces is that designing new drug candidates is very time-consuming. This low efﬁciency obviously can not tackle a variety of health crises human being currently encoun-ters, such as drug-resistant infections and fast mutation of viruses, which requires lots of new drugs in a veryshort time.Computers are typically faster than human beings. Therefore, generating new drugs by computers is a po-tential solution. Such as in our case, just using one K20 Nvidia CUDA GPU card, our generator network cangenerate 2.08 million and 2.8 million novel compounds for the CatS and BACE targets in less than one week,such task is far beyond human power. Moreover, such process is fully automatic and even does not need humansupervision. So, such automatic generators can provide us a huge drug-candidate database rather than somesporadic ones. What is more, just we already showed in this work, combining this generator with reliable auto-matic DNN predictors, the bulk of drug candidates can be further screened based on the properties predictedby automatic predictors. This whole automatic workﬂow should be a promising future of the pharmaceuticalsindustry.

IV.C Chemical spaces generated by generators

Our generated compounds are originated from their seeds, some known ligands binding to CatS and BACE fromPDBbind database. Thanks to the magic of the autoencoder including some random source, these generatedcompounds are truly novel and quite far from their seeds: no matter for CatS and BACE, the average similarityscores to the seeds are just around 0.3. This means the two sources of random works and the generator createsnovel compounds rather than just playing some “copy" games.More importantly, these generated compounds spread in huge chemical space, this means our generatedcompounds cover a large range of chemical properties, so it is more possible to hit potential drug candidates.First, the similarity scores to the seeds have a large range, for BACE it is 0.2 to 0.6, for CatS, it is 0.15 to 0.65.Second, our predicted binding afﬁnities also have a wide range, from -5 kcal/mol to -10 kcal/mol or even to -11kcal/mol. All in all, the generator is powerful, originating from seeds and but cover a huge chemical space faraway from the seeds.

IV.D Faith vs novelty

The random noise regulated latent space we designed here can generate lots of novel compounds far fromtheir seeds, this is due to its design: random sources are included in the model. However, in other words, thisgenerator is not faithful, since the output is quite different from the input. Such architecture is good for ourpurpose — what we want is to create broad new compounds from the seeds rather than faithful ones.However, in another scenario, faith is highly needed. Grifﬁths et al., Jin et al., Kusner et al. and Dai etal. perform Bayesian optimization in the latent space to obtain compounds with desired properties. We havealso designed controlled latent space and optimized latent space in the present work. In these cases, outputsshould faithfully reﬂect the latent space. Otherwise, optimization in the latent space could not be faithfully passedto the output. The reconstructing accuracy is a very critical evaluation. Much effort has already put to reinforcereconstructing accuracies, such as grammar VAE, syntax-directed VAE and junction tree VAE, to achieve areconstructing accuracy as high as 0.76. In comparison, the VAE we applied only has a reconstructing accuracyof 0.20. It means that our outputs are always new compounds. However, our new gated recurrent unit (GRU)-based autoencoder can achieve a 99% reconstructing accuracy, which enables us to carry out desirable designin the latent space. The detail of this work will be published elsewhere. IV.E Chemical spaces of predicted high binding afﬁnity compounds

Using our predictors, high binding afﬁnity compounds can be screened. So one concern is whether these highbinding afﬁnity compounds spread in a large chemical space or they are similar to each other and in a smallrange. According to our results, even the numbers of high binding afﬁnity compounds are only 1050 and 99 forCatS and BACE respectively, they are still quite different. In our clustering analysis, these high binding afﬁnitycompounds are classiﬁed into 6 clusters, the similarities between clusters are only around 0.4.The large chemical space covered by the high binding afﬁnity compounds is beneﬁcial to drug design. First,a good drug not only depends on binding afﬁnity but also depends on other properties such as toxicity, log P,log S, clearance, etc. A large chemical space means these high binding afﬁnity compounds have different otherproperties, so there is more chance for them to pass the screenings based on other properties. Additionally,more types of related drugs are easier to tackle the fast mutation of viruses.16

Conclusion

In our work, a generative network complex (GNC) is introduced. We propose three latent-space techniques, in-cluding randomized output, controlled output, and optimized output to generate novel and potential compounds.Additionally, their physical and chemical properties are predicted by a two-dimensional (2D) ﬁngerprint-baseddeep learning predictor, and potential drug candidates are preliminarily screened by predicted properties. More-over, for promising drug candidates, their 3D poses associated with speciﬁc protein targets are predicted by ourMathPose, one of the most accurate pose prediction schemes according to D3R Grand Challenges, a worldwidecompetition series in computer-aided drug design. Finally, more accurate property estimations based on the3D poses are performed by our MathDL, a advanced mathematics-based deep learning network, leading to newdrug candidates with the desirable drug properties. This automated platform has been used to generate 2.08million new drug candidates for Cathepsin S and 2.8 million novel compounds for BACE. For 1050 potential drugcandidates for CatS and 99 potential drug candidates for BACE, 3D poses associated with their target proteinshave been created to further evaluate their druggable properties. Our framework is designed to create new drugsin silico, so as to save time and reduce cost in drug discovery. Designing gated recurrent unit (GRU)-basedautoencoders with near perfect reconstruction accuracies is under our consideration to achieve robust latentspace drug design.

Supplementary materials

Supplementary materials are available online for potential drug candidates for Cathepsin S and BACE targets.

TableS1.csv

A list of SMILES strings and predicted binding afﬁnities of 99 potentially active compounds for theBACE target.

TableS2.csv

A lsit of SMILES strings and predicted binding afﬁnities of 1050 potentially active compounds forthe CatS target.Supplementary materials are available upon request for potential drug candidates for Cathepsin S and BACEtargets (near 70 gigabytes in size).

FileS1.zip

Zip ﬁle of 3D structure information of 99 selectively generated BACE compounds and their receptors.

FileS2.zip

Zip ﬁle of 3D structure information of 1050 selectively generated CatS compounds and their recep-tors.

Acknowledgments

This work was supported in part by NSF Grants DMS-1721024, DMS-1761320, and IIS1900473 and NIH grantGM126189. DDN and GWW are also funded by Bristol-Myers Squibb and Pﬁzer.

References [1] Joseph A DiMasi, Henry G Grabowski, and Ronald W Hansen. Innovation in the pharmaceutical industry:new estimates of r&d costs.

Journal of health economics , 47:20–33, 2016.[2] James P Hughes, Stephen Rees, S Barrett Kalindjian, and Karen L Philpott. Principles of early drugdiscovery.

British journal of pharmacology , 162(6):1239–1249, 2011.[3] Ricardo Macarron, Martyn N Banks, Dejan Bojanic, David J Burns, Dragan A Cirovic, Tina Garyantes,Darren VS Green, Robert P Hertzberg, William P Janzen, Jeff W Paslay, et al. Impact of high-throughputscreening in biomedical research.

Nature reviews Drug discovery , 10(3):188, 2011.[4] Ann I Graul, Laura Revel, Esmeralda Rosa, and Elisabet Cruces. Overcoming the obstacles in thepharma/biotech industry: 2008 update.

Drug News Perspect , 22(1):39, 2009.[5] Mahmud Tareq Hassan Khan. Predictions of the admet properties of candidate drug molecules utilizingdifferent qsar/qspr modelling approaches.

Current drug metabolism , 11(4):285–295, 2010.176] Michael J Waring, John Arrowsmith, Andrew R Leach, Paul D Leeson, Sam Mandrell, Robert M Owen,Garry Pairaudeau, William D Pennie, Stephen D Pickett, Jibo Wang, et al. An analysis of the attrition ofdrug candidates from four major pharmaceutical companies.

Nature reviews Drug discovery , 14(7):475,2015.[7] Kristian Strømgaard, Povl Krogsgaard-Larsen, and Ulf Madsen.

Textbook of drug design and discovery .CRC press, 2017.[8] Christopher A Lipinski, Franco Lombardo, Beryl W Dominy, and Paul J Feeney. Experimental and com-putational approaches to estimate solubility and permeability in drug discovery and development settings.

Advanced drug delivery reviews , 23(1-3):3–25, 1997.[9] Katya Tsaioun, Michel Bottlaender, and Aloise Mabondzo. Addme–avoiding drug development mistakesearly: central nervous system drug discovery perspective. In

BMC neurology , volume 9, page S1. BioMedCentral, 2009.[10] Saeed Alqahtani. In silico adme-tox modeling: progress and prospects.

Expert opinion on drug metabolism& toxicology , 13(11):1147–1158, 2017.[11] KV Balakin, YA Ivanenkov, and NP Savchuk. Compound library design for target families. In

Chemoge-nomics , pages 21–46. Springer, 2009.[12] IM Kapetanovic. Computer-aided drug discovery and development (caddd): in silico-chemico-biologicalapproach.

Chemico-biological interactions , 171(2):165–176, 2008.[13] Renjie Huang and Ivanhoe Leung. Protein-directed dynamic combinatorial chemistry: a guide to proteinligand and inhibitor discovery.

Molecules , 21(7):910, 2016.[14] Paweł Szyma ´nski, Magdalena Markowicz, and El˙zbieta Mikiciuk-Olasik. Adaptation of high-throughputscreening in drug discovery-toxicological screening tests.

International journal of molecular sciences ,13(1):427–452, 2012.[15] Gisbert Schneider and Uli Fechner. Computer-based de novo design of drug-like molecules.

Nature Re-views Drug Discovery , 4(8):649, 2005.[16] Seonwoo Min, Byunghan Lee, and Sungroh Yoon. Deep learning in bioinformatics.

Brieﬁngs in bioinformat-ics , 18(5):851–869, 2017.[17] Polina Mamoshina, Armando Vieira, Evgeny Putin, and Alex Zhavoronkov. Applications of deep learning inbiomedicine.

Molecular pharmaceutics , 13(5):1445–1454, 2016.[18] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 ,2013.[19] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoen-coders. arXiv preprint arXiv:1511.05644 , 2015.[20] Danilo P Mandic and Jonathon Chambers.

Recurrent neural networks for prediction: learning algorithms,architectures and stability . John Wiley & Sons, Inc., 2001.[21] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.

Neural computation , 9(8):1735–1780,1997.[22] Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A Bharath.Generative adversarial networks: An overview.

IEEE Signal Processing Magazine , 35(1):53–65, 2018.[23] Marcus Olivecrona, Thomas Blaschke, Ola Engkvist, and Hongming Chen. Molecular de-novo designthrough deep reinforcement learning.

Journal of cheminformatics , 9(1):48, 2017.[24] Mariya Popova, Olexandr Isayev, and Alexander Tropsha. Deep reinforcement learning for de novo drugdesign.

Science advances , 4(7):eaap7885, 2018.1825] Rafael Gómez-Bombarelli, Jennifer N Wei, David Duvenaud, José Miguel Hernández-Lobato, BenjamínSánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams, andAlán Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of molecules.

ACS central science , 4(2):268–276, 2018.[26] Seokho Kang and Kyunghyun Cho. Conditional molecular design with deep generative models.

Journal ofchemical information and modeling , 59(1):43–52, 2018.[27] Miha Skalic, José Jiménez, Davide Sabbadin, and Gianni De Fabritiis. Shape-based generative modelingfor de novo drug design.

Journal of chemical information and modeling , 59(3):1205–1214, 2019.[28] Artur Kadurin, Sergey Nikolenko, Kuzma Khrabrov, Alex Aliper, and Alex Zhavoronkov. drugan: an advancedgenerative adversarial autoencoder model for de novo generation of new molecules with desired molecularproperties in silico.

Molecular pharmaceutics , 14(9):3098–3104, 2017.[29] Boris Sattarov, Igor I Baskin, Dragos Horvath, Gilles Marcou, Esben Jannik Bjerrum, and Alexandre Varnek.De novo molecular design by combining deep autoencoder recurrent neural networks with generative topo-graphic mapping.

Journal of chemical information and modeling , 59(3):1182–1196, 2019.[30] Duc Duy Nguyen, Zixuan Cang, Kedi Wu, Menglun Wang, Yin Cao, and Guo-Wei Wei. Mathematical deeplearning for pose and binding afﬁnity prediction and ranking in d3r grand challenges.

Journal of computer-aided molecular design , 33(1):71–82, 2019.[31] Duc Duy Nguyen, Kaifu Gao, Menglun Wang, and Guo-Wei Wei. Mathdl: Mathematical deep learning ford3r grand challenge 4.

Journal of Computer Aided Molecular Design , in press, 2019.[32] Greg Landrum et al. Rdkit: Open-source cheminformatics, 2006.[33] Thomas A Halgren. Merck molecular force ﬁeld. i. basis, form, scope, parameterization, and performanceof mmff94.

Journal of computational chemistry , 17(5-6):490–519, 1996.[34] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits oflanguage modeling. arXiv preprint arXiv:1602.02410 , 2016.[35] Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,Fernanda Viégas, Martin Wattenberg, Greg Corrado, et al. Google’s multilingual neural machine transla-tion system: Enabling zero-shot translation.

Transactions of the Association for Computational Linguistics ,5:339–351, 2017.[36] Marco Daniele Parenti and Giulio Rastelli. Advances and applications of binding afﬁnity prediction methodsin drug discovery.

Biotechnology advances , 30(1):244–250, 2012.[37] Kaifu Gao, Jian Yin, Niel M Henriksen, Andrew T Fenley, and Michael K Gilson. Binding enthalpy calcu-lations for a neutral host–guest pair yield widely divergent salt effects across water models.

Journal ofchemical theory and computation , 11(10):4555–4564, 2015.[38] David Rogers and Mathew Hahn. Extended-connectivity ﬁngerprints.

Journal of chemical information andmodeling , 50(5):742–754, 2010.[39] Joseph L Durant, Burton A Leland, Douglas R Henry, and James G Nourse. Reoptimization of mdl keys foruse in drug discovery.

Journal of chemical information and computer sciences , 42(6):1273–1280, 2002.[40] Kedi Wu, Zhixiong Zhao, Renxiao Wang, and Guo-Wei Wei. Topp–s: Persistent homology-based multi-taskdeep neural networks for simultaneous predictions of partition coefﬁcient and aqueous solubility.

Journal ofcomputational chemistry , 39(20):1444–1454, 2018.[41] Kedi Wu and Guo-Wei Wei. Quantitative toxicity prediction using topology based multitask deep neuralnetworks.

Journal of chemical information and modeling , 58(2):520–531, 2018.1942] Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. Pytorch: Tensors and dynamic neuralnetworks in python with strong gpu acceleration.

PyTorch: Tensors and dynamic neural networks in Pythonwith strong GPU acceleration , 6, 2017.[43] Zixuan Cang, Lin Mu, Kedi Wu, Kristopher Opron, Kelin Xia, and Guo-Wei Wei. A topological approach forprotein classiﬁcation.

Computational and Mathematical Biophysics , 3(1), 2015.[44] Zixuan Cang, Lin Mu, and Guo-Wei Wei. Representability of algebraic topology for biomolecules in machinelearning based scoring and virtual screening.

PLoS computational biology , 14(1):e1005929, 2018.[45] Zixuan Cang and Guo-Wei Wei. Integration of element speciﬁc persistent homology and machine learn-ing for protein-ligand binding afﬁnity prediction.

International journal for numerical methods in biomedicalengineering , 34(2):e2914, 2018.[46] Duc Duy Nguyen and Guo-Wei Wei. Dg-gl: Differential geometry-based geometric learning of moleculardatasets.

International journal for numerical methods in biomedical engineering , 35(3):e3179, 2019.[47] Duc D Nguyen, Tian Xiao, Menglun Wang, and Guo-Wei Wei. Rigidity strengthening: A mechanism forprotein–ligand binding.

Journal of chemical information and modeling , 57(7):1715–1721, 2017.[48] Duc Nguyen and Guo-Wei Wei. Agl-score: Algebraic graph learning score for protein-ligand binding scoring,ranking, docking, and screening.

Journal of Chemical Information and Modeling , 59(7):3291–3304, 2019.[49] Afra Zomorodian and Gunnar Carlsson. Computing persistent homology.

Discrete & Computational Geom-etry , 33(2):249–274, 2005.[50] Kelin Xia and Guo-Wei Wei. Persistent homology analysis of protein structure, ﬂexibility, and folding.

Inter-national journal for numerical methods in biomedical engineering , 30(8):814–844, 2014.[51] Kelin Xia, Xin Feng, Yiying Tong, and Guo Wei Wei. Persistent homology for the quantitative prediction offullerene stability.

Journal of computational chemistry , 36(6):408–422, 2015.[52] Zixuan Cang and Guo-Wei Wei. Topologynet: Topology based deep convolutional and multi-task neuralnetworks for biomolecular property predictions.

PLoS computational biology , 13(7):e1005690, 2017.[53] Guo-Wei Wei. Differential geometry based multiscale models.

Bulletin of mathematical biology , 72(6):1562–1622, 2010.[54] Kelin Xia and Guo-Wei Wei. Multidimensional persistence in biomolecular data.

Journal of computationalchemistry , 36(20):1502–1520, 2015.[55] Dejan Plavši´c, Sonja Nikoli´c, Nenad Trinajsti´c, and Douglas J Klein. Relation between the wiener index andthe schultz index for several classes of chemical graphs.

Croatica Chemica Acta , 66(2):345–353, 1993.[56] Dusanka Janezic, Ante Milicevic, Sonja Nikolic, and Nenad Trinajstic.

Graph-theoretical matrices in chem-istry . CRC Press, 2015.[57] Nobuhiro Go, Tosiyuki Noguti, and Testuo Nishikawa. Dynamics of a small globular protein in terms oflow-frequency vibrational modes.

Proceedings of the National Academy of Sciences , 80(12):3696–3700,1983.[58] Ali Rana Atilgan, SR Durell, Robert L Jernigan, Melik C Demirel, O Keskin, and Ivet Bahar. Anisotropy ofﬂuctuation dynamics of proteins with an elastic network model.

Biophysical journal , 80(1):505–515, 2001.[59] Ivet Bahar, Timothy R Lezon, Lee-Wei Yang, and Eran Eyal. Global dynamics of proteins: bridging betweenstructure and function.

Annual review of biophysics , 39:23–42, 2010.[60] Tiejun Cheng, Xun Li, Yan Li, Zhihai Liu, and Renxiao Wang. Comparative assessment of scoring functionson a diverse test set.

Journal of chemical information and modeling , 49(4):1079–1093, 2009.2061] Yan Li, Li Han, Zhihai Liu, and Renxiao Wang. Comparative assessment of scoring functions on an updatedbenchmark: 2. evaluation methods and general results.

Journal of chemical information and modeling ,54(6):1717–1736, 2014.[62] Cheng Wang and Yingkai Zhang. Improving scoring-docking-screening powers of protein–ligand scoringfunctions using random forest.

Journal of computational chemistry , 38(3):169–177, 2017.[63] Oleg Trott and Arthur J Olson. Autodock vina: improving the speed and accuracy of docking with a newscoring function, efﬁcient optimization, and multithreading.

Journal of computational chemistry , 31(2):455–461, 2010.[64] Gareth Jones, Peter Willett, Robert C Glen, Andrew R Leach, and Robin Taylor. Development and validationof a genetic algorithm for ﬂexible docking.

Journal of molecular biology , 267(3):727–748, 1997.[65] Richard A Friesner, Jay L Banks, Robert B Murphy, Thomas A Halgren, Jasna J Klicic, Daniel T Mainz,Matthew P Repasky, Eric H Knoll, Mee Shelley, Jason K Perry, et al. Glide: a new approach for rapid, accu-rate docking and scoring. 1. method and assessment of docking accuracy.

Journal of medicinal chemistry ,47(7):1739–1749, 2004.[66] Dávid Bajusz, Anita Rácz, and Károly Héberger. Why is tanimoto index an appropriate choice for ﬁngerprint-based similarity calculations?

Journal of cheminformatics , 7(1):20, 2015.[67] Kaifu Gao, Hongqing He, Minghui Yang, and Honggao Yan. Molecular dynamics simulations of the es-cherichia coli hppk apo-enzyme reveal a network of conformational transitions.

Biochemistry , 54(44):6734–6742, 2015.[68] Kaifu Gao, Ya Jia, and Minghui Yang. A network of conformational transitions revealed by molecular dy-namics simulations of the binary complex of escherichia coli 6-hydroxymethyl-7, 8-dihydropterin pyrophos-phokinase with mgatp.

Biochemistry , 55(49):6931–6939, 2016.[69] Kaifu Gao and Yunjie Zhao. A network of conformational transitions in the apo form of ndm-1 enzymerevealed by md simulation and a markov state model.

The Journal of Physical Chemistry B , 121(14):2952–2960, 2017.[70] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel,Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning inpython.

Journal of machine learning research , 12(Oct):2825–2830, 2011.[71] Zied Gaieb, Conor D Parks, Michael Chiu, Huanwang Yang, Chenghua Shao, W Patrick Walters, Millard HLambert, Neysa Nevins, Scott D Bembenek, Michael K Ameriks, et al. D3r grand challenge 3: blind predic-tion of protein–ligand poses and afﬁnity rankings.

Journal of computer-aided molecular design , 33(1):1–18,2019.[72] Michael K Ameriks, Scott D Bembenek, Matthew T Burdett, Ingrid C Choong, James P Edwards, DamaraGebauer, Yin Gu, Lars Karlsson, Hans E Purkey, Bart L Staker, et al. Diazinones as p2 replacements forpyrazole-based cathepsin s inhibitors.

Bioorganic & medicinal chemistry letters , 20(14):4060–4064, 2010.[73] Danielle K Wiener, Alice Lee-Dutra, Scott Bembenek, Steven Nguyen, Robin L Thurmond, Siquan Sun,Lars Karlsson, Cheryl A Grice, Todd K Jones, and James P Edwards. Thioether acetamides as p3 bindingelements for tetrahydropyrido-pyrazole cathepsin s inhibitors.

Bioorganic & medicinal chemistry letters ,20(7):2379–2382, 2010.[74] Robert Vassar, Dora M Kovacs, Riqiang Yan, and Philip C Wong. The β -secretase enzyme bace in healthand alzheimer’s disease: regulation, cell biology, function, and therapeutic potential. Journal of Neuro-science , 29(41):12787–12794, 2009.[75] Federica Prati, Giovanni Bottegoni, Maria Laura Bolognesi, and Andrea Cavalli. Bace-1 inhibitors: Fromrecent single-target molecules to multitarget compounds for alzheimer’s disease: Miniperspective.

Journalof medicinal chemistry , 61(3):619–637, 2017. 2176] Teague Sterling and John J. Irwin. Zinc 15 - ligand discovery for everyone.

Journal of Chemical Informationand Modeling , 55(11):2324–2337, 2015. http://pubs.acs.org/doi/abs/10.1021/acs.jcim.5b00559.[77] Ryan-Rhys Grifﬁths and José Miguel Hernández-Lobato. Constrained bayesian optimization for automaticchemical design. arXiv preprint arXiv:1709.05501 , 2017.[78] Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for moleculargraph generation. arXiv preprint arXiv:1802.04364 , 2018.[79] Matt J Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In

Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages 1945–1954.JMLR. org, 2017.[80] Hanjun Dai, Yingtao Tian, Bo Dai, Steven Skiena, and Le Song. Syntax-directed variational autoencoderfor structured data. arXiv preprint arXiv:1802.08786arXiv preprint arXiv:1802.08786