ChemRxiv | 2021
A Bag of Tricks for Automated De Novo Design of Molecules with the Desired Properties: Application to EGFR Inhibitor Discovery
Abstract
Deep generative neural networks have been used increasingly in computational chemistry for de novo design of molecules with desired properties. Many deep learning approaches employ reinforcement learning for optimizing the target properties of the generated molecules. However, the success of this approach is often hampered by the problem of sparse rewards as the majority of the generated molecules are expectedly predicted as inactives. We propose several technical innovations to address this problem and improve the balance between exploration and exploitation modes in reinforcement learning. In a proof-of-concept study, we demonstrate the application of the deep generative recurrent neural network enhanced by several novel technical tricks to designing experimentally validated potent inhibitors of the epidermal growth factor (EGFR). The proposed technical solutions are expected to substantially improve the success rate of finding novel bioactive compounds for specific biological targets using generative and reinforcement learning approaches. Deep and reinforcement learning in drug discovery. The development and application of deep generative models for de novo design of molecules with the desired properties have emerged as an important modern research direction in Computer-Assisted Drug Discovery (CADD).1–4 Deep generative models can be classified by the types of molecular representation employed in model development. The most commonly used types are SMILES strings5 and molecular graphs. Multiple models for generating SMILES strings6–9 and molecular graphs10–14 corresponding to synthetically feasible novel molecues have been proposed. Many of such models use reinforcement learning (RL)7,15,16 techniques for optimizing properties of the generated molecules. For example, Olivecrona et al.6 and Blaschke et al.17 proposed the REINVENT algorithm and memory-assisted reinforcement learning, respectively, and demonstrated how these approaches could maximize the predicted activity of generated molecules against the 5hydroxytryptamine receptor type 1A (HTR1A) and the dopamine type 2 receptor (DRD2). Another recent example is the RationaleRL algorithm proposed by Jin et al.18 The authors used RationaleRL to maximize the predicted activity of inhibitors against glycogen synthase kinase-3 beta (GSK3β) and c-Jun N-terminal kinase-3 (JNK3). Unfortunately, the aforementioned studies included no experimental validation of the proposed computational hits. Notably, Zhavoronkov et al19 not only proposed a novel generative tensorial reinforcement learning algorithm, but also used their method to design potent DDR1 kinase inhibitors, and performed experimental validation of virtual hits. Most theoretical works on de novo molecular design employ a series of benchmark tasks such as maximization of properties that can be assessed for every molecule, such as LogP20, QED21, or the benchmark collection proposed in GuacaMol.22 Such tasks employ objective metrics obtained directly from a molecule s SMILES5 or underlying molecular graph through a scoring function. These scoring functions return continuous values that can be used to assign a reward to generated molecules. For example, the Quantitative Estimate of Druglikeness score (QED) has values between 0 and 1.0, with 0 being least drug-like and 1.0 being most drug-like. In such a case, every generated molecule would receive a continuous score: the bigger score values will correspond to bigger reward values, and vice versa. Moreover, a naïve generative model pretrained on a dataset of drug-like compounds such as ChEMBL23 would produce molecules with relatively high QED values. In this case, optimization of the generative model via reinforcement learning will proceed efficiently as every generated molecule would get a score. Indeed, the efficient optimization of the QED score has been demonstrated many times in the literature.10,20,24 Problem of sparse rewards in reinforcement learning. In contrast to physical properties such as LogP that can be calculated directly from molecular structure, the biological activity of a novel compound designed to bind the desired protein target cannot be predicted from its chemical structure alone. A common way to predict the binding affinity of novel, untested ligands is by using Quantitative Structure-Activity Relationship (QSAR) models25,26 trained on historical experimental data for a protein target of interest using machine learning techniques. These models have either continuous outputs (pKd, pIC50, etc.) for regression problems or binary outputs (active/inactive class label) for classification problems. QSAR models could, in principle, be used to construct a reward function for reinforcement learning to optimize the binding affinity of generated molecules, as was shown, for instance, in our previous publication.7 However, unlike physical molecular properties like LogP that every molecule possesses, specific bioactivity is a target property that exists for only a small fraction of molecules, which leads to the reward sparseness in the generative models. This sparse rewards problem represents a serious obstacle for the effective use of reinforcement learning for designing molecules with high activity. Indeed, the low success probability often leads to the overwhelming majority of training trajectories resulting in a zero reward, which implies that the reinforcement learning agent or policy network struggles to explore the environment and learn the optimal strategy for maximizing the expected reward. Thus, a promising molecule with high bioactivity for a protein of interest is unlikely to be observed if molecules are randomly sampled from a naïve generative model. Training the generative network to optimize the potency of generated molecules against a desired protein target is an excellent example of a reinforcement learning problem with sparse rewards. In this study, we demonstrate that the naïve generative model produces molecules predicted to be inactive in most cases. Under such a scenario, the naïve generative model rarely observes good examples and fails to maximize the binding affinity of generated ligands. We further address this problem by proposing a set of heuristic approaches (a bag of tricks ) combined with reinforcement learning in the sparse rewards situation to increase the efficiently of optimizing the structures of generated molecules to have higher biological activity. Using the epidermal growth factor receptor (EGFR) ligands as a case study, we show that by combining a reinforcement learning pipeline for generative model optimization with proposed heuristics, we could overcome sparse reward issues and successfully rediscover known active scaffolds for EGFR using the feedback from the classification QSAR model only. In addition to methodological advances, we also performed experimental bioassay validation of the novel generated hit molecules, which confirmed the experimental activity of virtual hits. Major findings. We performed a series of experiments that resulted in the following chief observations: 1. The generative model trained with only the policy gradient algorithm could not discover any active molecules for EGFR due to sparse rewards. 2. The combination of policy gradient algorithm with proposed fine-tuning by (i) transfer learning, (ii) experience replay, and (iii) real-time reward shaping resulted in much better exploration and an increased number of generated molecules with high active class probabilities. 3. Experimental testing of selected computational hits that could be obtained from a commercial source validated the efficiency of our novel approach for discovering novel bioactive molecules. Below, we discuss how we arrived at the above observations. Overall, the section consists of two main parts. In the first part, we describe our computational analysis concerning the first two observations. In the second part, we discuss the generation, selection, and experimental bioactivity testing of computational hit compounds for an important cancer biological target, epidermal growth factor receptor (EGFR). The most active compound featured a privileged EGFR scaffold found in the known active molecules. Notably, the training set was not enriched for this scaffold as compared to other scaffolds and this scaffold was not used selectively as part of the reinforcement learning procedure. Model pipeline. Neural network training is a nontrivial task as its hyperparameter values define a training protocol. Due to the high number of hyperparameters, the training hyperparameter space is vast. To complicate things further, neural network training is a computationally expensive task that can last hours to days. The choice of training hyperparameters thus has a significant influence on model quality. We sought to run a benchmark experiment to investigate how different training techniques interact and how they affect model quality. As a case study, we performed the optimization of the generative model with reinforcement learning to maximize the predicted probability of active class for EGFR protein. The experimental training pipeline is shown in Figure 1. Figure 1: Pipeline of model training. The model was pre-trained on ChEMBL data and then trained for 20 epochs. Each epoch consists of three steps: policy gradient, policy experience replay, and fine-tuning. At the end of each step, 3,200 molecules are generated, and molecules with predicted activity exceeding the probability threshold are admitted into the replay buffer. The replay buffer, in turn, influences training at the policy replay and fine-tuning steps. At the end of the training, the model generates 16,000 molecules for evaluation. We modified the number of iterations for all 3 steps in each epoch to understand their effects on training. We also used different libraries to initialize the replay buffer to understand how the replay buffer can influence model behavior. As describ