[PDF] Deep Inverse Reinforcement Learning for Structural Evolution of Small Molecules

Abstract

The size and quality of chemical libraries to the drug discovery pipeline are crucial for developing new drugs or repurposing existing drugs. Existing techniques such as combinatorial organic synthesis and High-Throughput Screening usually make the process extraordinarily tough and complicated since the search space of synthetically feasible drugs is exorbitantly huge. While reinforcement learning has been mostly exploited in the literature for generating novel compounds, the requirement of designing a reward function that succinctly represents the learning objective could prove daunting in certain complex domains. Generative Adversarial Network-based methods also mostly discard the discriminator after training and could be hard to train. In this study, we propose a framework for training a compound generator and learning a transferable reward function based on the entropy maximization inverse reinforcement learning paradigm. We show from our experiments that the inverse reinforcement learning route offers a rational alternative for generating chemical compounds in domains where reward function engineering may be less appealing or impossible while data exhibiting the desired objective is readily available.

Full PDF

aa r X i v : . [ q - b i o . B M ] O c t D EEP I NVER SE R EINFORCEMENT L EAR NING FOR S TRUC TURAL E VOLUTION OF S MALL M OLEC ULES

Brighter Agyemang, ∗ School of Computer Science and EngineeringUniversity of Electronic Science and TechnologyChengdu, PRC

Wei-Ping Wu

School of Computer Science and EngineeringUniversity of Electronic Science and TechnologyChengdu, PRC

Daniel Addo

School of Sofware EngineeringUniversity of Electronic Science and TechnologyChengdu, PRC

Michael Y. Kpiebaareh

School of Computer Science and EngineeringUniversity of Electronic Science and TechnologyChengdu, PRC

Ebenezer Nanor

School of Computer Science and EngineeringUniversity of Electronic Science and TechnologyChengdu, PRC

Charles Roland Haruna

School of Computer Science and EngineeringUniversity of Electronic Science and TechnologyChengdu, PRC A BSTRACT

The size and quality of chemical libraries to the drug discovery pipeline are crucial for developingnew drugs or repurposing existing drugs. Existing techniques such as combinatorial organic synthe-sis and High-Throughput Screening usually make the process extraordinarily tough and complicatedsince the search space of synthetically feasible drugs is exorbitantly huge. While reinforcementlearning has been mostly exploited in the literature for generating novel compounds, the require-ment of designing a reward function that succinctly represents the learning objective could provedaunting in certain complex domains. Generative Adversarial Network-based methods also mostlydiscard the discriminator after training and could be hard to train. In this study, we propose a frame-work for training a compound generator and learning a transferable reward function based on theentropy maximization inverse reinforcement learning paradigm. We show from our experimentsthat the inverse reinforcement learning route offers a rational alternative for generating chemicalcompounds in domains where reward function engineering may be less appealing or impossiblewhile data exhibiting the desired objective is readily available. K eywords Drug Design, Inverse Reinforcement Learning, Reinforcement Learning, Deep Learning, Small Molecules A vailability The source code and data of this study are available at https://github.com/bbrighttaer/irelease

Identifying promising leads is crucial to the early stages of drug discovery. Combinatorial organic synthesis andHigh-throughput Screening (HTS) are well-known methods used to generate new compounds in the domain (drugand compound are used interchangeably in this study). This generation process is typically followed by expert anal-ysis, which focuses on desired properties such as solubility, activity, pharmacokinetic proﬁle, toxicity, and syntheticaccessibility, to ascertain the desirability of a generated compound. Compound generation and modiﬁcation methodsare useful for enriching chemical databases and scaffold hopping [1]. A review of the structural and analog entityevolution patent landscape estimates that the pharmaceutical industry constitutes of the domain [2].Indeed, the compound generation task is noted to be hard and complicated [3] considering that there exist − synthetically feasible drug-like compounds [4]. With about [5] of drug development projects failing due to ∗ Contact: [email protected] . A

GYEMANG ET AL .unforeseen reasons, it is signiﬁcant to ensure diversity in compounds that are desirable to avoid a fatal collapse of thedrug discovery process. As a result of these challenges in the domain, there is a need for improved de novo compoundgeneration methods.In recent times, the proliferation of data, advances in computer hardware, novel algorithms for studying complexproblems, and other related factors have contributed signiﬁcantly to the steady growth of data-driven methods suchas Deep Learning (DL). DL-based approaches have been applied to several domains, such as Natural Language Pro-cessing (NLP) [6, 7], Computer Vision [8], ProteoChemometric Modeling [9], compound and target representationlearning [10, 11], and reaction analysis [12]. Consequently, there has been a growing interest in the literature to usedata-driven methods to study the compound generation problem.Deep Reinforcement Learning (DRL), Generative Adversarial Networks (GAN), and Transfer Learning (TL) are someof the approaches that have been used to generate compounds represented using the Simpliﬁed Molecular Input LineEntry System (SMILES) [13]. The DRL-based methods model the compound generation task as a sequential decision-making process and use Reinforcement Learning (RL) algorithms to design generators (agents) that estimate thestatistical relationship between actions and outcomes. This statistical knowledge is then leveraged to maximize theoutcome, thereby biasing the generator according to the desired chemical space. Motivated by the work in [14], theGAN-based methods also model the compound generation task as a sequential decision-making problem but with adiscriminator parameterizing the reward function. TL methods train a generator on a large dataset to increase theproportion of valid SMILES strings sampled from the generator before performing a ﬁne-tuning training to bias thegenerator to the target chemical space.Regarding DRL-based methods, [1] proposed a Recurrent Neural Network (RNN)-based approach to train generativemodels for producing analogs of a query compound and compounds that satisfy certain chemical properties, such asactivity to the Dopamine Receptor D2 (DRD2) target. The generator takes as input one-hot encoded representationsof the canonical SMILES of a compound, and for each experiment, a corresponding reward function is speciﬁed.Also, [15] proposed a stack-augmented RNN generative model using the REINFORCE [16] algorithm where, un-like [1], the canonical SMILES encoding of a compound is learned using backpropagation. The reward functionsin [15] are parameterized by a prediction model that is trained separately from the generator. In both the studies of [1]and [15], the generator was pretrained on a large SMILES dataset using a supervised learning approach before apply-ing RL to bias the generator. Similarly, [17] proposed a SMILES- and Graph-based compound generation model thatadopts the supervised pretraining and subsequent RL biasing approach. Unlike [1] and [15], [17] assign intermedi-ate rewards to valid incomplete SMILES strings. We posit that since an incomplete SMILES string could, in somecases, have meaning (e.g., moieties), assigning intermediate rewards could facilitate learning. While the DRL-basedgenerative models can generate biased compounds, an accurate speciﬁcation of the reward function is challenging andtime-consuming in complex domains. In most interesting de novo compound generation scenarios, compounds meet-ing multiple objectives may be required, and specifying such multi-objective reward functions leads to the generator(agent) exploiting the straightforward objective and generating compounds with low variety.In another vein, [3] trained an RNN model on a large dataset using supervised learning and then performed TL tothe domain of interest to generate focused compounds. Since the supervised approach used to train the generatoris different from the autoregressive sampling technique adopted at test time, such methods are not well suited formulti-step SMILES sampling [18]. This discrepancy is referred to as exposure bias. Additionally, methods suchas [3] that maximize the likelihood of the underlying data are susceptible to learning distributions that place masses inlow-density areas of the multivariate distribution giving rise to the underlying data.On the other hand, [19] based on the work of [20] to propose a GAN-based generator that produces compounds match-ing some desirable metrics. The authors adopted an alternating approach to enable multi-objective RL optimization.As pointed out by [21], the challenges with training GAN cause a high rate of invalid SMILES strings, low diversity,and reward function exploitation. In [22], a memory-based generator replaced the generator proposed by [19] in or-der to mitigate the problems in [19]. In these GAN-based methods, the authors adopted a Monte-Carlo Tree Search(MCTS) method to assign intermediate rewards. However, GAN training could be unstable and the generator couldget worse as a result of early saturation [23]. Additionally, the discriminator in the GAN-based models is typicallydiscarded after training.In this paper, we propose a novel framework for training a compound generator and learning a reward function fromdata using a sample-based Deep Inverse Reinforcement Learning (DIRL) objective [24]. We observe that while it maybe daunting or impossible to accurately specify the reward function of some complex drug discovery campaigns toleverage DRL algorithms, sample of compounds that satisfy the desired behavior may be readily available or collated.Therefore, our proposed method offers a solution to developing in-silico generators and recovering reward functionfrom compound samples. As pointed out by [25], the DIRL objective could lead to stability in training and producingeffective generators. Also, unlike the typical GAN case where the discriminator is discarded, the sample-based DIRL2. A

GYEMANG ET AL .objective is capable of training a generator and a reward function. Since the learned reward function succinctly rep-resents the agent’s objective, it could be transferred to related domains (with a possible ﬁne-tuning step). Moreover,since the Binary Cross Entropy (BCE) loss usually applied to train a discriminator does not apply in the case of thesampled-based DIRL objective, this eliminates saturation problems in training the generator. The DIRL approach alsomitigates the challenge of imbalance between different RL objectives.The outline of our study is as follows: Section 2 presents the RL and IRL background of this study, Section 3 discussesthe research problem of this study and our proposed approach, we discuss the results of our experiments in Section 4,and draw the conclusions of this study in Section 5.

In this section, we review the concepts of Reinforcement Learning (RL) and Inverse Reinforcement Learning (IRL)related to this study.

The aim of Artiﬁcial Intelligence (AI) is to develop autonomous agents that can sense their environment and actintelligently. Since learning through interaction is vital for developing intelligent systems, the paradigm of RL hasbeen adopted to study several AI research problems. In RL, an agent receives an observation from its environment,reasons about the received observation to decide on an action to execute in the environment, and receives a signalfrom the environment as to the usefulness of the action executed in the environment. This process continues in atrial-and-error manner over a ﬁnite or inﬁnite time horizon for the agent to estimate the statistical relationship betweenthe observations, actions, and their results. This statistical relationship is then leveraged by the agent to optimize theexpected signal from the environment.Formally, an RL agent receives a state s t from the environment and takes an action a t in the environment leading to ascalar reward r t +1 in each time step t . The agent’s behavior is deﬁned by a policy π ( a t | s t ) , and each action performedby the agent transitions the agent and the environment to a next state s t +1 or a terminal state s T . The RL problem ismodeled as a Markov Decision Process (MDP) consisting of: • A set of states S , with a distribution over starting states p ( s ) . • A set of actions A . • State transition dynamics function T ( s t +1 | s t , a t ) that maps a state-action pair at a time step t to a distributionof states at time step t + 1 . • A reward function R ( s t , a t ) that assigns a reward to the agent after taking action a t in state s t . • A discount factor γ ∈ [0 , that is used to specify a preference for immediate and long-term rewards. Also, γ < ensures that a limit is placed on the time steps considered in the inﬁnite horizon case.The policy π maps a state to a probability distribution over the action space: π : S → p ( A = a |S ) . In an episodicMDP of length T , the sequence of states, actions, and rewards is referred to as a trajectory or rollout of the policy π .The return or total reward for a trajectory could then be represented as R = P T − t =0 γ t r t +1 . The goal of an RL agent isto learn a policy π ∗ that maximizes its expected total reward from all states: π ∗ = argmax π E [ R | π ] (1)Existing RL methods for solving such an MDP could be classiﬁed into value function-based and policy search algo-rithms. Value function methods estimate the expected return of being in a given state. The state-value function V π ( s ) estimates the expected return of starting in state s and following π afterward: V π ( s ) = E [ R | s, a, π ] (2)Given the optimal values of all states, V ∗ ( s ) = max π V π ( s ) , ∀ s ∈ S , the optimal policy π ∗ could be determined as: π ∗ = argmax π V ∗ ( s ) (3)A closely related concept to the state-value function is the state-action value function Q π ( s, a ) . With the state-actionvalue function, the initial action a is given and the policy π is effective from the next state onward: Q π ( s, a ) = E [ R | s, a, π ] . (4)3. A GYEMANG ET AL .The agent can determine the optimal policy, given Q π ( s, a ) , by acting greedily in every state, π ∗ = argmax π Q π ( s, a ) .Also, V π ( s ) = max a Q π ( s, a ) .On the other hand, policy search/gradient algorithms directly learn the policy π of the agent instead of indirectlyestimating it from value-based methods. Speciﬁcally, policy-based methods optimize the policy π ( a | s, θ ) , where θ isthe set of parameters of the model approximating the true policy π ∗ . We review the Policy Gradient (PG) methodsused in this study below. The REINFORCE algorithm [16] is a policy gradient method that estimates the gradient g := ∇ θ E hP T − t =0 r t +1 i which has the form g = E " T − X t =0 Ψ t ∇ θ logπ θ ( a t | s t ) , (5)and updates θ in the direction of the estimated gradient. Ψ t could be any of the functions deﬁned by [26]. For in-stance, [15] deﬁned Ψ t = γ t r ( s T ) , where r ( s T ) denotes the reward at the terminal state. In another vein, [22] usedthe reward of a fully generated sequence. Due to high variance in using the total reward of the trajectory P T − t =0 γ t r t +1 ,a baseline version could be adopted to reduce the variance, P T − t ′ = t γ t ′ − t r t ′ +1 − b ( s t ) . When a value function approx-imator is used to estimate the baseline b ( s t ) the resulting method is referred to as an actor-critic method. A relatedconcept to determining Ψ t is realizing that it is less challenging to identify that an action has a better outcome than thedefault behavior compared to learning the value of the action [27]. This concept gives rise to the advantage function A π ( s, a ) = Q π ( s, a ) − V π ( s ) where V ( s ) , serving as a baseline, is approximated by a function such as a neuralnetwork. Schulman et al. [28] proposed a robust and data efﬁcient policy gradient algorithm as an alternative to prior RL trainingmethods such as the REINFORCE, Q-learning [29], and the relatively complicated Trust Region Policy Optimization(TRPO) [30]. The proposed method, named Proximal Policy Optimization (PPO), shares some similarities with theTRPO in using the ratio between the new and old policies scaled by the advantages of actions to estimate the policygradient instead of the logarithm of action probabilities (as seen in Equation 5). While TRPO uses a constrainedoptimization objective that requires the conjugate gradient algorithm to avoid large policy updates, PPO uses a clippedobjective that forms a pessimistic estimate of the policy’s performance to avoid destructive policy updates. Theseattributes of the PPO objective enables performing multiple epochs of policy update with the data sampled from theenvironment. In REINFORCE, such multiple epochs of optimization often leads to destructive policy updates. Thus,PPO is also more sample efﬁcient than the REINFORCE algorithm.More formally, the PPO objective is expressed as: L P P O ( θ ) = E [ min ( r t ( θ ) A t , clip ( r t ( θ ) , − ǫ, ǫ ) A t )] (6)where r t ( θ ) = π θ ( a t | s t ) π θ old ( a t | s t ) , (7)and A t = δ t + ( γλ ) δ t +1 + ... + ( γλ ) T − t +1 δ T − (8)where δ t = r t + γV ( s t +1 ) − V ( s t ) with hyperparameters γ and λ . The clip ( · ) function creates an alternate policyupdate to the expectation of r t ( θ ) A t where the action probability ratios between the old and updated policies aremaintained in the range of [1 − ǫ, ǫ ] . Taking the minimum of the two possible policy updates ensures a lowerbound update is preferred. This is useful since a large update to the parameters of a highly non-linear policy function,such as a neural network, often results in a worse policy. In this paper, we set ǫ = 0 . . IRL is the problem of learning the reward function of an observed agent, given its policy or behavior, thereby avoidingthe manual speciﬁcation of a reward function [31]. The IRL class of solutions assumes the following MDP \ R E : • A set of states S , with a distribution over starting states p ( s ) .4. A GYEMANG ET AL . • A set of actions A . • State transition dynamics function T ( s t +1 | s t , a t ) that maps a state-action pair at a time step t to a distributionof states at time step t + 1 . • A set of demonstrated trajectories D = n(cid:10) ( s i , a i ) , ..., ( s iT − , a iT − ) (cid:11) Ni =1 o from the observed agent or expert. • A discount factor γ ∈ [0 , may be used to discount future rewards.The goal of IRL then is to learn the reward function R E that best explains the expert demonstrations. It is assumedthat the demonstrations are perfectly observed and that the expert follows an optimal policy.While learning reward functions from data is appealing, the IRL problem is ill-posed since there are many rewardfunctions under which the observed expert behavior is optimal [32, 31]. An instance is a reward function that assigns0 (or any constant value) for all selected actions; in such a case, any policy is optimal. Other main challenges areaccurate inference, generalizability, the correctness of prior knowledge, and computational cost with an increase inproblem complexity.To this end, several IRL proposals exist in the literature to mitigate the IRL challenges mentioned above [31]. Inrecent times, the entropy optimization class of IRL methods has been widely used by researchers due to the maxi-mum entropy’s goal to obtain an unbiased distribution of potential reward [33, 34, 35, 24]. The intuition is that thesolution that maximizes entropy violates the optimization constraints least and hence, least wrong. In the MaximumEntropy (MaxEnt) formulation, the probability of the expert’s trajectory is proportional to the exponential of the totalreward [33], p ( τ ) ∝ exp ( R ( τ )) , ∀ τ ∈ D , (9)where R ( τ ) = P ( s,a ) ∈ τ r ( s, a ) and r ( s, a ) gives the reward for taking action a in state s . We consider the problem of training a model G θ ( a t | s t ) to generate a set of compounds C = { c , c , ..., c M | M ∈ N } ,each encoded as a valid SMILES string, such that C is generally biased to satisfy a predeﬁned criteria that could beevaluated by a function E . Considering the sequential nature of a SMILES string, we study this problem in theframework of IRL following the MDP described in section 2.2; instead of the RL approach mostly adopted in theliterature on compound generation, a parameterized reward function R ψ has to be learned from D , which is a set ofSMILES that satisfy the criteria evaluated by E , to train G θ following the MDP in section 2.1. Note that in the caseof SMILES generation, T ( s t +1 | s t , a t ) = 1 .In this context, the action space A is deﬁned as the set of unique tokens that are combined to form a canonical SMILESstring and the state space S is the set of all possible combinations of these tokens to encode a valid SMILES string,each of length l ∈ [1 , T ] , T ∈ N . We set s to a ﬁxed state that denotes the beginning of a SMILES string, and s T isthe terminal state.Furthermore, we compute r ( s T − , a T − ) = R ψ ( Y T ) , where Y T denotes a full generated sequence. Intermediatestate rewards are obtained by performing an N -time Monte Carlo Tree Search (MCTS) with G θ as the rollout policyto obtain the corresponding rewards. Speciﬁcally, given the N -time MCTS set of rollouts, M C G θ ( Y t , N ) = (cid:8) Y T , ..., Y N T (cid:9) , (10)the reward for a state is then calculated as r ( s t , a t ) = ( N P Nn =1 R ψ ( Y n T ) , Y n T ∈ M C G θ ( Y t , N ) t < T − ,R ψ ( Y T ) , t = T − . The workﬂow we propose in this study for the structural evolution of compounds is presented in Figure 1a. Theframework is described in what follows. 5. A

GYEMANG ET AL . SMILES from compounds DB (e.g.ChEMBL)Generative Pretraining of Priornetwork EvaluationAgent networkstate Generated SMILES stringsReward function learningreward Reward FunctionDemonstrations SelectedcompoundsMonte-Carlo Tree SearchMolecule Generation Environment (a)

SMILES EncoderReward Function Net Actor Net Critic Net Agentreward action probabilities state valueSMILES (b)

Figure 1: (a) Illustration of the proposed framework for training a small molecule generator and learning a rewardfunction using IRL. The workﬂow begins with pretraining the generator model using large-scale compound sequencedataset, such as ChEMBL. The pretrained model is used to initialize the agent network in an IRL training schemewhere a reward function is learned and the agent is biased to generate desirable compounds. The generated SMILEScould be examined by an evaluation function where compounds satisfying a speciﬁed threshold could be persisted forfurther analysis. (b) The general structure of the models used in the study. The agent is represented as an actor-criticarchitecture that is trained using the PPO [28] algorithm. The actor, critic, and reward function are all RNNs that sharethe same SMILES encoder. The actor net becomes the SMILES generator at test time.6. A

GYEMANG ET AL . The ﬁrst stage of the workﬂow is creating a SMILES dataset, similar to most existing GAN and vanilla-RL meth-ods. Possible repositories for facilitating this dataset creation are DrugBank [36], KEGG [37], STITCH [38], andChEMBL [39]. This dataset is used at the next stage for pretraining the generator. For this purpose, we used thecurated ChEMBL dataset of [15], which consists of approximately . million drug-like compounds.Also, a set of SMILES satisfying the constraints or criteria of interest are collated from an appropriate source asthe demonstrations D of the IRL phase (for instance, see section 3.6). To evaluate SMILES during the training ofthe generator, we assume the availability of an evaluation function E that can evaluate the extent to which a givencompound satisﬁes the optimization constraints. Here, E could be an ML model, a robot that conducts chemicalsynthesis, or a molecular docking program.In this study, we avoided composing the demonstrations dataset from the data used to train the evaluation function.Since, in practice, the data, rule set, or method used for developing any of the evaluation functions mentioned abovecould differ from the approach for constructing the demonstrations dataset, this independence provides a more realisticcontext to assess our proposed method. This stage of the framework entails using the large-scale dataset of the previous step to pretrain a model that would bea prior for initializing the agent network/policy. This pretraining step aims to enable the model to learn the SMILESsyntax to attain a high rate of valid generated SMILES strings.Since the SMILES string is a sequence of characters, we represent the prior model as a Recurrent Neural Network(RNN) with Gated Recurrent Units (GRUs). The architecture of the generator at this stage of the workﬂow is depictedin Figure 3b. The generator takes as input the output of an encoder that learns the embedding of SMILES characters,as shown in Figure 1b. For each given SMILES string, the generator is trained to predict the next token in the sequenceusing the cross-entropy loss function.According to the ﬁndings of [40], regular RNNs cannot effectively generate sequences of a context-free language dueto the lack of memory. To this end, [40] proposed the use of a Stack-RNN architecture, which equips the standardRNN cell with a stack unit to mitigate this problem. Since the SMILES encoding of a compound is a context-freelanguage, we follow [15] to use the Stack-RNN for generating SMILES as it ensures that tasks such as noting the startand end parts of aromatic moieties, and parentheses for branches in the chemical structure are well addressed. We notethat while [15] used a single layer Stack-RNN, we adopt a multi-layer Stack-RNN structure in our study. We reckonthat the multi-layer Stack-RNN could facilitate learning better representations at multiple levels of abstraction akin tomulti-layer RNNs.The stack unit enables the model to persist information across different time steps. Three differentiable operationscould be performed on the stack at each time step: POP, PUSH, and NO-OP. The POP operation deletes the entry atthe top of the stack, the PUSH operation updates the top of the stack with a new record, and the NO-OP operationleaves the stack unchanged for the next time step. In [40], the entry at the top of stack s at time step t is indexed as s t [0] and is updated as s t [0] = a t [ P U SH ] σ ( Dh t ) + a t [ P OP ] s t − [1] + a t [ N O − OP ] s t − [0] , (11)where h t ∈ R m × is the RNN hidden vector of a sequence at time step t , D ∈ R w × m , w is the dimension of the stack,and a t ∈ R is a vector whose elements map to the stack control operators PUSH, POP, and NO-OP, respectively.Here, a t is computed as a t = sof tmax ( Ah t ) . (12)where A ∈ R × m . Intuitively, if a t [ P U SH ] = 1 then s t [0] is updated with Dh t ; if a t [ P OP ] = 1 then the top of thestack is updated with the second entry of the stack and a t [ N O − OP ] = 1 leaves the stack unchanged.Regarding the remaining entries of the stack, thus indexes i > , the update rule is s t [ i ] = a t [ P U SH ] s t − [ i −

1] + a t [ P OP ] s t − [ i + 1] + a t [ N O − OP ] s t − [ i ] . (13)Subsequently, the RNN cell’s hidden vector h t is updated as h t = σ ( U x t + Rh t − + P s t − [0]) , (14)where σ ( x ) = 1 / (1 + exp ( − x )) , U ∈ R m × d , R ∈ R m × m , P ∈ R m × w , s t − [0] ∈ R w × , d is the dimension ofencoded representation of token x t , and m is dimension of the hidden state.7. A GYEMANG ET AL .Figure 2: Sample compounds generated using the pretrained model.

As stated earlier, we frame the SMILES string generation task as an MDP problem where the reward function isunknown. Consequently, this MDP does not enable the use of RL algorithms to approximate the solution since RLmethods require a reward function. Therefore we learn the reward function using demonstrations that exhibit thedesired behavior.Speciﬁcally, given the set of trajectories D = { τ , τ , ...τ N } observed from an expert that acts stochastically under anoptimal policy, the density function is speciﬁed as, p ( τ ) = 1 Z exp ( Rψ ( τ )) (15)where τ = {h s , a i , ..., h s T − , a T − i} is a trajectory of a generated SMILES string Y T and Rψ is an unknownreward function parameterized by ψ . Thus, the expert generates SMILES strings that satisfy the desired criteria witha probability that increases exponentially with the return. Here, the main challenge for this energy-based model iscomputing the partition function, Z = Z exp ( R ψ ( τ )) . (16)Ziebart et al. [33] computed Z using a Dynamic Programming (DP) approach to estimate the partition function. How-ever, the DP method is not scalable to large-state space domains, such as the space of synthetically accessible com-pounds mentioned earlier.In [24], the authors proposed a sample-based entropy maximization IRL algorithm. The authors use a sample or back-ground distribution to approximate the partition function and continuously adapt this sample distribution to providea better estimation of Equation 15. Therefore, if the sample distribution represents the agent’s policy and this dis-tribution is trained using an RL algorithm, given an approximation of R ψ , then the RL algorithm guides the sampledistribution to a space that provides a better estimation of Z . We refer to the method discussed by [24] as GuidedReward Learning (GRL) in this study. We adopt this approach to bias a pretrained generator to the desired compoundspace since the GRL method produces a policy and a learned reward function that could be transferred to other relateddomains. 8. A GYEMANG ET AL . Linearreward

LinearLinearLinear (a)

GRUStackGRU GRUStack GRUStackGRUStackGRU GRUStack GRUStack LinearLinearLinearLinearSoftmax Softmax Softmax Softmax (b)

GRUStackGRU GRUStack GRUStackGRUStackGRU GRUStack GRUStack LinearLinearLinearLinearSoftmax Softmax Softmax SoftmaxSampling Sampling Sampling Sampling (c)

Figure 3: The architectures of the RNN models used in this study. (a) The structure of the reward net (left) and the criticnet (right). (b) The structure of the generator net during pretraining to learn the SMILES syntax. The model is trainedusing the cross-entropy loss (c) The structure of the generator net during the RL phase; the model autoregressivelygenerates a SMILES string given a ﬁxed start state. The sampling process terminates when an end token is sampled,or a length limit is reached. 9. A

GYEMANG ET AL .Table 1: Experiments hardware speciﬁcations

Model Number of cores RAM (GB) GPUs

Intel XeonCPU E5-2687W 48 128 1 GeForceGTX 1080Intel XeonCPU E5-2687W 24 128 4 GeForceGTX 1080TiTable 2: Performance of ML models used as evaluation functions in the experiments. The reported values are averagesof a 5-fold CV in each case. The standard deviation values are shown in parenthesis. The corresponding experiment(s) that used each ML model is/are speciﬁed in parenthesis in the ﬁrst column.

Model Binary Classiﬁcation RegressionPrecision Recall Accuracy AUC RMSE R RNN-Bin (DRD2) 0.971 (0.120) 0.970 (0.120) 0.985 (0.011) 0.996 (0.001) - -RNN-Reg (LogP) - - - - 0.845 (0.508) 0.708 (0.359)XGB-Reg (JAK2 Min & Max) - - - - 0.646 (0.037) 0.691 (0.039)

Formally, taking the log-likelihood of Equation 15 provides the maximization objective, L GRL ( ψ ) = 1 N X τ i ∈D R ψ ( τ i ) − logZ (17) ≈ N X τ i ∈D R ψ ( τ i ) − log M X τ j ∈ ˆ D exp ( R ψ ( τ j )) q ( τ j ) (18)where ˆ D is the set of M trajectories sampled using the background distribution q . As depicted in Figure 3a-left, R ψ isrepresented as a 2-layer GRU RNN that takes as input the output of a SMILES encoder (see Figure 1b) and predicts ascalar reward for a generated valid or invalid SMILES string.RL is used to train q to improve the background distribution used in estimating the partition function. The modelarchitecture of q during the RL training and SMILES generation phases is shown in Figure 3c.Since we represent q as an RNN in this study, training the sample distribution with high-variance RL algorithms suchas the REINFORCE [16] objective could override the efforts of the pretraining due to the non-linearity of the model.Therefore, we train the sample distribution using the Proximal Policy Optimization (PPO) algorithm [28]. The PPOobjective ensures a gradual change in the parameters of the agent policy (sample distribution). The PPO algorithmlearns a value function to estimate A π ( s, a ) making it an actor-critic method. The architecture of the critic in this study,modeled as an RNN, is illustrated in 3a-right. Like the actor model, the critic takes the SMILES encoder’s outputs asinput to predict V ( s t ) at each time step t .Lastly, the evaluation function E of our proposed framework (see Figure 1a) is used, at designated periods of training,to evaluate a set of generated SMILES strings. Compounds deemed to satisfy the learning objective could be persistedfor further examination in the drug discovery process, and the training of the generator could be terminated. We performed four experiments to evaluate the approach described in this study. In each experiment, the aim is toascertain the effectiveness of our proposed approach and how it compares to the case where the reward function isknown, as found in the literature. The hardware speciﬁcations we used for our experiments are presented in Table 1.The performance of each of the evaluation functions used in the experiments are shown in Table 2.

This experiment’s objective was to train the generator to produce compounds that target the Dopamine Receptor D2protein. Hence, we retrieved a DRD2 dataset of compounds from ExCAPE-DB [41] . This dataset contained positive compounds (binding to DRD2). We then sampled an equal number of the remaining compounds as thenegatives (non-binding to DRD2) to create a balanced dataset. The balanced DRD2 dataset of samples was thenused to train a two-layer LSTM RNN, similar to the reward function network shown in 3a-left but with an additionalSigmoid endpoint, using ﬁve-fold cross-validation with the BCE loss function. https://git.io/JUgpt

10. A

GYEMANG ET AL .The resulting ﬁve models of the CV training then served as the evaluation function E of this experiment. This eval-uation function is referred to as RNN-Bin in Table 2. At test time, the average value of the predicted DRD2 activityprobability was assigned as the result of the evaluation of a given compound.Also, to create the set of demonstrations D of this experiment, we used the SVM classiﬁer of [1] to ﬁlter the ChEMBLdataset of [15] for compounds with a probability of activity greater than . . This ﬁltering resulted in a dataset of compounds to serve as D . In this experiment, we trained a generator to produce compounds biased toward having their octanol-water partitioncoefﬁcient (logP) less than ﬁve and greater than one. LogP is one of the elements of Lipinski’s rule of ﬁve.On the evaluation function E of this experiment, we used the LogP dataset of [15], consisting of compounds,to train an LSTM RNN, similar to the reward function network shown in 3a-left, using ﬁve-fold cross-validation withthe Mean Square Error (MSE) loss function. Similar to the DRD2 experiment, the ﬁve models serve as the evaluationfunction of E . The evaluation function of this experiment is labeled RNN-Reg in Table 2.We constructed the LogP demonstrations dataset D by using the LogP-biased generator of [15] to produce SMILES strings, of which were unique valid compounds. The compounds then served as the set of demon-strations for this experiment.

We also performed two experiments on producing compounds for JAK2 modulation. In the ﬁrst JAK2 experiment,we trained a generator to produce compounds that maximize the negative logarithm of half-maximal inhibitory con-centration (pIC ) values for JAK2. In this instance, we used the public JAK2 dataset of [15], consisting of compounds, to train an XGBoost model in a ﬁve-fold cross-validation method with the MSE loss function. The ﬁveresulting models then served as the evaluation function E , similar to the LogP and DRD2 experiments. The JAK2 in-hibition evaluation function is referred to as XGB-Reg in Table 2. We used the JAK2-maximization-biased generatorof [15] to produce a demonstration set of unique valid compounds out of generated SMILES.On the other hand, we performed an experiment to bias the generator towards producing compounds that minimize thepIC values for JAK2. JAK2 minimization is useful for reducing off-target effects. While we maintained the evalua-tion function of the JAK2 maximization experiment in the JAK2 minimization experiment, we replaced the demonstra-tions set with unique valid SMILES, out of generated SMILES, produced by the JAK2-minimization-biasedgenerator of [15]. Also, we did not train a reward function for JAK2 minimization but rather transferred the rewardfunction learned for JAK2 maximization to this experiment. However, we negated each reward obtained from theJAK2 maximization reward function for the minimization case study. This was done to examine the ability to transfera learned reward function to a related domain. As discussed in Section 3.4, we pretrained a two-layer Stack-RNN model with the ChEMBL dataset of [15] for oneepoch. The training time was approximately 14 days. This pretrained model served as the initializing model for thefollowing generators:1. PPO-GRL: This model type follows our proposed approach. It is trained using the GRL objective at the IRLphase and the PPO algorithm at the RL phase.2. PPO-Eval: This model type follows our proposed approach but without the IRL phase. The RL algorithmused is PPO. Since the problem presented in section 3.1 assumes an evaluation function E to periodicallydetermine the performance of the generator during training, this model enables us to evaluate an instancewhere E is able to serve as a reward function directly, such as ML models in our experiments. We note thatother instances of E , such as molecular docking or a robot performing synthesis, may be expensive to serveas the reward function in RL training.3. REINFORCE: This model type uses the proposed SMILES generation method in [15] to train a two-layerstack-RNN generator following the method and reward functions of [15]. Thus, no IRL is performed. In theDRD2 experiment, we used the reward function of [1]. Also, for the JAK2 and LogP experiments, we usedtheir respective reward functions in [15]. 11. A GYEMANG ET AL .4. REINFORCE-GRL: Likewise, the REINFORCE-GRL model type is trained using the REINFORCE algo-rithm at the RL phase and the GRL method to learn the reward function. This model facilitates assessment ofthe signiﬁcance of the PPO algorithm properties to our proposed DIRL approach.5. Stack-RNN-TL: This model type is trained using TL. Speciﬁcally, after the pretraining stage, the demonstra-tions dataset is used to ﬁne-tune the prior model to bias the generator towards the desired chemical space.Unlike the previous model types which are either trained using RL/IRL, this generator is trained using super-vised learning (cross entropy loss function). Since this model type is a possible candidate for scenarios wherethe reward function is not speciﬁed but exemplar data could be provided, TL serves as a useful baseline tocompare the PPO-GRL approach.The training time of each generator in each experiment is shown in Table 3.

Apart from the Stack-RNN-TL model, all other models in each experiment were trained for a maximum of episodes (across all trajectories), with early termination once the threshold of the experiment was reached. The Stack-RNN-TL model was trained for two epochs (due to time constraint) in each experiment. The threshold for eachexperiment is the average score of the evaluation function on the demonstration set. Also, only the model weightsyielding the best score, as evaluated by E during training, were saved for each model type.Furthermore, we assessed the performance of all trained generators using the metrics provided by [20] and the internaldiversity metric in [21]. In each of these metrics, the best value is 1, and the worst value is 0. We give a briefintroduction of the metrics below: • Diversity: The diversity metrics measure the relative diversity between the generated compounds and a ref-erence set. Given a compound from the generated set, a value of 1 connotes that the substructures of thecompound is diverse from the referenced set whereas a value of 0 indicates that the compound shares severalsubstructures with the compounds in the reference set. In our study, a random sample of the demonstrationsdataset (1000 compounds) constitute the external diversity [20] reference set. On the other hand, all gener-ated compounds constitute the reference set when calculating internal diversity [21]. Intuitively, the internaldiversity metric indicates whether the compound generator repeats similar substructures. We used ECFP8(Extended Connectivity Fingerprint with diameter 8 represented using 2048 bits) vector of each compoundfor calculating internal and external diversities. • Solubility: Measures the likeliness for a compound to mix with water. This is also referred to as the water-octanol partition coefﬁcient. • Naturalness: Measures how similar a generated compound is to the structure space of Natural Products (NPs).NPs are small molecules that are synthesized by living organisms and are viewed as good starting points fordrug development [42]. • Synthesizability: Measures how a compound lends itself to chemical synthesis (0 indicates hard to make and1 indicates easy to make) [43]. • Druglikeness: Estimates the plausibility of a generated compound being a drug candidate. The synthesizabil-ity and solubility of a compound contribute to the compound’s druglikeness.

We generated samples for each trained generator and used the molecular metrics in Section 3.6.5 to assess thegenerator’s performance. However, during training, we generated samples and maintained an exponential averageto monitor performance. Figure 4 shows the density plots of each model’s valid SMILES string and the convergenceprogression of each model toward the threshold of an experiment, beginning from the score of samples generated bythe pretrained model.Also, Table 3 presents the results of each metric for the valid SMILES samples of each model. Likewise, Table 4presents the results for a set of compounds ﬁltered from the valid SMILES samples of each model by applying thethreshold of each experiment. In the case of the logP results in Table 4, we selected compounds whose values are lessthan 5 and greater than 1. The added proportion in threshold column in Table 4 presents the quota of a generator’svalid SMILES that were within the threshold of the experiment; the maximum value is 1 and minimum is 0.Firstly, we observed from the results that the PPO-GRL model focused on generating compounds that either satisfythe demonstration dataset threshold or are toward the threshold after a few episodes during training in each of the12. A

GYEMANG ET AL . (a)(b) PPO-GRLPPO-EvalREINFORCEREINFORCE-GRLStack-RNN-TLDemo SMILESUnbiased SMILES p I C Stack-RNN-TL (c) p I C Stack-RNN-TL (d)

Figure 4:

Column 1 - The distribution plots of the evaluation function’s outputs for samples generated from the differ-ent model types used in the experiment.

Column 2 - The convergence plot of the RL and IRL-trained models duringtraining; the y-axis represents the mean value of the experiment’s evaluation function output for samples generated ata point during training. The

Demo SMILES results correspond to the demonstration ﬁles of a given experiment. The

Unbiased SMILES results correspond to samples generated from the pretrained (unbiased or prior) model.

Column3 - The convergence plot of the Stack-RNN-TL model during training. (a) The results of DRD2 experiment. (b) Theresults of LogP optimization experiment. (c) The results of JAK2 maximization. (d) The results of JAK2 minimizationexperiment. 13. A

GYEMANG ET AL .Table 3: Results of experiments without applying threshold to generated or dataset samples. PPO-GRL follows ourproposed approach, REINFORCE follows the work in [15], PPO-Eval, REINFORCE-GRL, and Stack-RNN-TL arebaselines in this study.

Objective Algorithm/Dataset Num. of uniquecanonical SMILES InternalDiversity ExternalDiversity Solubility Naturalness Synthesizability Druglikeness Approx. TrainingTime (minutes)

DRD2 Demonstrations 7732 0.897 - 0.778 0.536 0.638 0.603 -Unbiased 2048 0.917 0.381 0.661 0.556 0.640 0.602 -PPO-GRL 3264 0.887 0.189 0.887 0.674 0.579 0.271 41PPO-Eval 4538 0.878 0.183 0.885 0.722 0.521 0.236 33REINFORCE 7393 0.904 0.325 0.802 0.703 0.718 0.443 42REINFORCE-GRL 541 0.924 0.513 0.846 0.794 0.293 0.304 390Stack-RNN-TL 6927 0.918 0.391 0.655 0.551 0.647 0.611 3064LogP Demonstrations 5019 0.880 - 0.900 0.553 0.803 0.512 -Unbiased 2051 0.917 0.372 0.658 0.553 0.640 0.603 -PPO-GRL 4604 0.903 0.159 0.836 0.580 0.770 0.494 60PPO-Eval 4975 0.733 0.060 0.982 0.731 0.449 0.075 50REINFORCE 7704 0.897 0.101 0.774 0.580 0.827 0.618 56REINFORCE-GRL 7225 0.915 0.347 0.663 0.551 0.689 0.627 62Stack-RNN-TL 6927 0.918 0.391 0.655 0.551 0.647 0.611 1480JAK2 Max Demonstrations 3608 0.805 - 0.083 0.485 0.560 0.483 -Unbiased 2050 0.917 0.379 0.654 0.554 0.641 0.604 -PPO-GRL 5911 0.928 0.614 0.587 0.664 0.468 0.581 22PPO-Eval 6937 0.917 0.386 0.657 0.554 0.644 0.604 38REINFORCE 6768 0.916 0.351 0.608 0.608 0.529 0.627 30REINFORCE-GRL 7039 0.917 0.381 0.658 0.555 0.644 0.607 133Stack-RNN-TL 6927 0.918 0.391 0.655 0.551 0.647 0.611 608JAK2 Min Demonstrations 285 0.828 - 0.534 0.548 0.895 0.604 -Unbiased 2050 0.917 0.379 0.654 0.554 0.641 0.604 -PPO-GRL 3446 0.907 0.244 0.488 0.506 0.776 0.663 34PPO-Eval 1533 0.703 0.008 0.997 0.756 0.414 0.049 148REINFORCE 7694 0.908 0.234 0.655 0.591 0.799 0.649 40REINFORCE-GRL 6953 0.917 0.376 0.662 0.547 0.657 0.613 47Stack-RNN-TL 6927 0.918 0.391 0.655 0.551 0.647 0.611 83

Table 4: Results of experiments with applied optimization threshold on generated or dataset samples. PPO-GRLfollows our proposed approach, REINFORCE follows the work in [15], PPO-Eval, REINFORCE-GRL, Stack-RNN-TL are baselines in this study.

Objective Algorithm/Dataset Num. of uniquecanonical SMILES Proportionin threshold InternalDiversity ExternalDiversity Solubility Naturalness Synthesizability Druglikeness

DRD2 Demonstrations 4941 0.635 0.896 - 0.793 0.524 0.642 0.588Unbiased 528 0.266 0.919 0.412 0.691 0.586 0.588 0.609PPO-GRL 2406 0.737 0.864 0.104 0.939 0.709 0.508 0.181PPO-Eval 3465 0.764 0.849 0.092 0.947 0.756 0.458 0.144REINFORCE 5005 0.677 0.888 0.222 0.837 0.715 0.706 0.392REINFORCE-GRL 201 0.372 0.907 0.223 0.929 0.853 0.211 0.194Stack-RNN-TL 1715 0.248 0.920 0.447 0.696 0.586 0.585 0.608LogP Demonstrations 5007 1.000 0.880 - 0.898 0.553 0.803 0.512Unbiased 1903 0.927 0.915 0.346 0.687 0.546 0.652 0.605PPO-GRL 4548 0.988 0.903 0.149 0.842 0.579 0.772 0.493PPO-Eval 4969 0.999 0.733 0.061 0.983 0.731 0.449 0.075REINFORCE 7638 0.991 0.897 0.098 0.777 0.579 0.829 0.618REINFORCE-GRL 6796 0.941 0.914 0.317 0.686 0.546 0.698 0.630Stack-RNN-TL 6479 0.935 0.916 0.364 0.678 0.545 0.655 0.614JAK2 Max Demonstrations 3449 0.958 0.794 - 0.073 0.482 0.560 0.479Unbiased 717 0.354 0.916 0.360 0.639 0.572 0.565 0.572PPO-GRL 2768 0.468 0.925 0.543 0.578 0.700 0.376 0.533PPO-Eval 2427 0.350 0.916 0.369 0.640 0.564 0.565 0.576REINFORCE 2942 0.434 0.912 0.281 0.586 0.609 0.462 0.605REINFORCE-GRL 2508 0.356 0.916 0.355 0.644 0.561 0.568 0.577Stack-RNN-TL 2376 0.343 0.917 0.370 0.650 0.557 0.578 0.580JAK2 Min Demonstrations 145 0.518 0.785 - 0.539 0.506 0.908 0.617Unbiased 123 0.062 0.887 0.075 0.680 0.422 0.827 0.655PPO-GRL 693 0.201 0.88 0.061 0.534 0.453 0.852 0.673PPO-Eval 10 0.007 0.618 0.000 1.000 0.728 0.387 0.050REINFORCE 1078 0.140 0.891 0.086 0.681 0.500 0.887 0.677REINFORCE-GRL 490 0.070 0.897 0.142 0.682 0.432 0.831 0.671Stack-RNN-TL 426 0.0.061 0.897 0.130 0.683 0.430 0.832 0.662

14. A

GYEMANG ET AL . DRD2 LogP JAK2-max JAK2-minExperiment010002000300040005000600070008000 N u m . o f d e m o n s t r a t i o n s categoryUnique compoundsCompounds in threshold Figure 5: The size of the demonstrations dataset used in each experiment of this study.experiments. This early convergence of the PPO-GRL generator (in terms of number of episodes) connotes that anappropriate reward function has been retrieved from the demonstration dataset to facilitate the biasing optimizationusing RL. Also, while the evaluation function provided an appropriate signal for training the PPO-Eval in some cases,the diversity metrics of the PPO-Eval model were typically less than the PPO-GRL model. We also realized in trainingthat generating focused compounds toward a threshold was accompanied by an increase in the number of invalidSMILES strings. This increase in invalid compounds explains the drop in the number of valid SMILES of PPO-GRLin Table 3. Although the REINFORCE model mostly generated a high number of valid SMILES samples, it wasless sample efﬁcient. We reckon that the difference in sample efﬁciency between PPO-GRL and REINFORCE-GRLis as a result of the variance in estimating Ψ (see Equation 5). Thus, a stable estimation of Ψ is signiﬁcant to thesample-based objective in Equation 18 since a better background distribution could be estimated. The performance ofthe REINFORCE-GRL, as shown in Figure 4, reiﬁes this high-variance challenge. Also, the Stack-RNN-TL modelrecorded the same scores for all metrics across the experiments, as seen in Table 3. This performance of the Stack-RNN-TL model connotes that no focused compounds could be generated in each experiment after two epochs oftraining but mostly took longer time to train as compared to the other models. We also note that the Stack-RNN-TLmodel produced a higher number of valid canonical SMILES than the unbiased or prior generator after the ﬁne-tuningstep.Concerning the DRD2 experiment, although the PPO-GRL model’s number of valid SMILES strings was more thanthe number of unique compounds in the unbiased dataset, it performed approximately one-third lesser than the PPO-Eval model. As shown in Figure 4a, this is observed in the mean predicted DRD2 activity of the PPO-Eval model,which reaches the threshold in fewer episodes than the PPO-GRL and REINFORCE models. The PPO-GRL modelproduced a higher proportion of its valid compounds in the activity threshold than the REINFORCE model, and thegenerated samples seem to share a relatively lesser number of substructures with the demonstrations set (externaldiversity) than the PPO-Eval approach as reported in Table 4. Unsurprisingly, due to the variance problem mentionedearlier, the REINFORCE-GRL performed poorly and generated the fewest number of valid compounds with morethan half of its produced SMILES being less than the DRD2 activity threshold.Regarding the logP optimization experiment, most of the compounds sampled from all the generators had logP valuesthat fell within the range of and . However, while the REINFORCE-GRL and Stack-RNN-TL models recordedan average logP value of approximately . , the PPO-GRL, PPO-Eval, and REINFORCE models recorded higheraverage logP values closer to the demonstration dataset’s average logP value, as shown in Figure 4b. Considering thatthe PPO-GRL model was trained without the reward function used to train the PPO-Eval and REINFORCE generators,our proposed approach was effective at recovering a proper reward function for biasing the generator. Interestingly, thesamples of the PPO-GRL generator recorded better diversity scores and were deemed more druglike than the PPO-Evalgenerator’s samples, as shown in Table 3. 15. A GYEMANG ET AL .On the JAK2 maximization experiment, the PPO-Eval method could not reach the threshold. The PPO-GRL generatorreached the threshold in less than episodes but with fewer valid SMILES strings than the REINFORCE generator.As shown in Table 4, the proportion of compounds in the JAK2 max threshold of the PPO-GRL generated samples,when compared to the other models in the JAK2 maximization experiment, indicates that the PPO-GRL model seemsto focus on the objective early in training than the other generators.On the other hand, the JAK2 minimization experiment provides an insightful view of the PPO-GRL behavior despiteits inability, as well as the other models, to reach the threshold in Figure 4d. In Figure 5, we show the size of thedemonstrations dataset in each experiment and the number of compounds that satisfy the experiment’s threshold. Wethink of the proportion of each demonstration dataset satisfying the threshold as vital to the learning objective, andhence, fewer numbers could make learning a useful reward function more challenging. Hence, the size and qualityof the demonstration dataset contribute signiﬁcantly to learning a useful reward function. We suggest this explainsthe PPO-GRL generator’s inability to reach the JAK2 minimization threshold. Therefore, it is no surprise that theREINFORCE-GRL generator’s best mean pIC value was approximately the same as that of the pretrained model’sscore, as seen in Figure 4d. We note that the number of unique PPO-GRL generated compounds was almost ﬁvetimes larger than the demonstrations with approximately the same internal variation. It is worth noting that while theREINFORCE approach used the JAK2 minimization reward function of [15], the PPO-GRL method used the negatedrewards of the learned JAK2 maximization reward function. This ability to transfer learned reward functions could bea useful technique in drug discovery campaigns.In a nutshell, the preceding results and analysis show that our proposed framework offers a rational approach to traincompound generators and learn a reward function, transferable to related domains, in situations where specifying thereward function for RL training is challenging or not possible. While the PPO-Eval model can perform well in someinstances, the evaluation function may not be readily available or expensive to serve in the training loop as a rewardfunction in some real-world scenarios such as a robot synthesis or molecular docking evaluation function. This study reviewed the importance of chemical libraries to the drug discovery process and discussed some notableexisting proposals in the literature for evolving chemical structures to satisfy speciﬁc objectives. We pointed out thatwhile RL methods could facilitate this process, specifying the reward function in certain drug discovery studies couldbe challenging, and its absence renders even the most potent RL technique inapplicable. Our study has proposed areward function learning and a structural evolution model development framework based on the entropy maximizationIRL method. The experiments conducted have shown the promise such a direction offers in the face of the large-scalechemical data repositories that have become available in recent times. We conclude that further studies into improvingthe proposed method could provide a powerful technique to aid drug discovery.

Further Studies

An area for further research could be techniques for reducing the number of invalid SMILES while the generator fo-cuses on the training objective. An approach that could be investigated is a training method that actively teaches thereward function to distinguish between valid and invalid SMILES strings. Also, a study to provide a principled under-standing of developing the demonstrations dataset could be an exciting and useful direction.; Additionally, consideringthe impact of transformer models in DL research, the Gated Transformer proposed by Parisotto et al. [44] to extend thecapabilities of transformer models to the RL domain offers the opportunity to develop better compound generators. Inparticular, our work could be extended by replacing the Stack-RNN used to model the generator with a Memory-basedGated Transformer.

Acknowledgements

We would like to thank Zhihua Lei, Kwadwo Boafo Debrah, and all reviewers of this study.

Funding

This work was partly supported by SipingSoft Co. Ltd. 16. A

GYEMANG ET AL . References [1] Marcus Olivecrona, Thomas Blaschke, Ola Engkvist, and Hongming Chen. Molecular de-novo design throughdeep reinforcement learning.

Journal of Cheminformatics , 9(1):1–14, 2017.[2] Yan A. Ivanenkov, Vladimir A. Aladinskiy, Nikolay A. Bushkov, Andrey A. Ayginin, Alexander G. Majouga, andAlexandre V. Ivachtchenko. Small-molecule inhibitors of hepatitis c virus (hcv) non-structural protein 5a (ns5a):a patent review (2010-2015).

Expert Opinion on Therapeutic Patents , 27(4):401–414, 2017. PMID: 27967269.[3] Marwin H. S. Segler, Thierry Kogej, Christian Tyrchan, and Mark P. Waller. Generating focused moleculelibraries for drug discovery with recurrent neural networks.

ACS Central Science , 4(1):120–131, Jan 2018.[4] P. G. Polishchuk, T. I. Madzhidov, and A. Varnek. Estimation of the size of drug-like chemical space based onGDB-17 data.

Journal of Computer-Aided Molecular Design , 27(8):675–679, August 2013.[5] Aroon D Hingorani, Valerie Kuan, Chris Finan, Felix A Kruger, Anna Gaulton, Sandesh Chopade, Reecha Sofat,Raymond J. MacAllister, John P Overington, Harry Hemingway, Spiros Denaxas, David Prieto, and Juan PabloCasas. Improving the odds of drug development success through human genomics: modelling study.

ScientiﬁcReports , 9(1):1–25, 2019.[6] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by Jointly Learning toAlign and Translate. arxiv e-prints , pages 1–15, 2014.[7] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models areunsupervised multitask learners.

OpenAI Technical Report , 2018.[8] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, andYoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Francis Bachand David Blei, editors,

Proceedings of the 32nd International Conference on Machine Learning , volume 37 of

Proceedings of Machine Learning Research , pages 2048–2057, Lille, France, 07–09 Jul 2015. PMLR.[9] Matthew Ragoza, Joshua Hochuli, Elisa Idrobo, Jocelyn Sunseri, and David Ryan Koes. Protein-Ligand Scoringwith Convolutional Neural Networks.

Journal of Chemical Information and Modeling , 2017.[10] Brighter Agyemang, Wei-Ping Wu, Michael Yelpengne Kpiebaareh, Zhihua Lei, Ebenezer Nanor, and Lei Chen.Multi-view self-attention for interpretable drug–target interaction prediction.

Journal of Biomedical Informatics ,110:103547, 2020.[11] Masashi Tsubaki, Kentaro Tomii, and Jun Sese. Compound-protein interaction prediction with end-to-end learn-ing of neural networks for graphs and sequences.

Bioinformatics , 35(2):309–318, 2019.[12] Marwin H.S. Segler and Mark P. Waller. Modelling Chemical Reasoning to Predict and Invent Reactions.

Chem-istry - A European Journal , 2017.[13] David Weininger. SMILES, a Chemical Language and Information System: 1: Introduction to Methodology andEncoding Rules.

Journal of Chemical Information and Computer Sciences , 1988.[14] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policygradient.

CoRR , abs/1609.05473, 2016.[15] Mariya Popova, Olexandr Isayev, and Alexander Tropsha. Deep reinforcement learning for de novo drug design.

Science Advances , 4(7):1–15, 2018.[16] R J Williams. Simple statistical gradient-following methods for connectionist reinforcement learning.

MachineLearning , 8:229–256, 1992.[17] Fangzhou Shi, Shan You, and Chang Xu. Reinforced molecule generation with heterogeneous states. In JianyongWang, Kyuseok Shim, and Xindong Wu, editors,

IEEE International Conference on Data Mining (ICDM) , pages548–557. IEEE, 2019.[18] Florian Schmidt. Generalization in Generation: A closer look at Exposure Bias. pages 1–12, 2019.[19] Benjamin Sanchez-Lengeling, Carlos Outeiral, Gabriel L Guimaraes, and Alán Aspuru-Guzik. Optimizing dis-tributions over molecular space. An Objective-Reinforced Generative Adversarial Network for Inverse-designChemistry (ORGANIC).

ChemRxiv , pages 1–18, 2017.[20] Gabriel Lima Guimaraes, Benjamin Sanchez-Lengeling, Pedro Luis Cunha Farias, and Alán Aspuru-Guzik.Objective-reinforced generative adversarial networks (ORGAN) for sequence generation models.

CoRR ,abs/1705.10843, 2017.[21] Mostapha Benhenda. ChemGAN challenge for drug discovery: can AI reproduce natural chemical diversity? arXiv e-prints , 2017. 17. A

GYEMANG ET AL .[22] Evgeny Putin, Arip Asadulaev, Yan Ivanenkov, Vladimir Aladinskiy, Benjamin Sanchez-Lengeling, Alán Aspuru-Guzik, and Alex Zhavoronkov. Reinforced Adversarial Neural Computer for de Novo

Molecular Design.

Journalof Chemical Information and Modeling , 58(6):1194–1204, June 2018.[23] Martín Arjovsky and Léon Bottou. Towards principled methods for training generative adversarial networks.

CoRR , abs/1701.04862, 2017.[24] Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policyoptimization. In

Proceedings of the 33nd International Conference on Machine Learning (ICML) , volume 48 of

JMLR Workshop and Conference Proceedings , pages 49–58. JMLR.org, 2016.[25] Chelsea Finn, Paul F. Christiano, Pieter Abbeel, and Sergey Levine. A connection between generative adversarialnetworks, inverse reinforcement learning, and energy-based models.

CoRR , abs/1611.03852, 2016.[26] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continu-ous control using generalized advantage estimation. In

Proceedings of the International Conference on LearningRepresentations (ICLR) , 2016.[27] Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. A Brief Survey of DeepReinforcement Learning. arXiv e-prints , pages 1–14, 2017.[28] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimizationalgorithms.

CoRR , abs/1707.06347, 2017.[29] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei a Rusu, Joel Veness, Marc G Bellemare, AlexGraves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik,Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning.

Nature , 518(7540):529–533, 2015.[30] John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust Region PolicyOptimization.

CoRR , pages 1–21, 2015.[31] Saurabh Arora and Prashant Doshi. A Survey of Inverse Reinforcement Learning: Challenges, Methods andProgress. arXiv e-prints , 2018.[32] Andrew Ng and Stuart Russell. Algorithms for inverse reinforcement learning.

Proceedings of the SeventeenthInternational Conference on Machine Learning , pages 663–670, 2000.[33] Brian D Ziebart, Andrew Maas, J Andrew Bagnell, and Anind K Dey. Maximum Entropy Inverse ReinforcementLearning. In

AAAI Conference on Artiﬁcial Intelligence , pages 1433–1438, 2008.[34] Markus Wulfmeier and Peter Ondr. Maximum Entropy Deep Inverse Reinforcement Learning. arXiv e-prints ,2016.[35] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Daniel D. Lee, Masashi Sugiyama,Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors,

Advances in Neural Information ProcessingSystems 29: Annual Conference on Neural Information Processing Systems , pages 4565–4573, 2016.[36] Craig Knox, Vivian Law, Timothy Jewison, Philip Liu, Son Ly, Alex Frolkis, Allison Pon, Kelly Banco, ChristineMak, Vanessa Neveu, Yannick Djoumbou, Roman Eisner, An Chi Guo, and David S. Wishart. DrugBank 3.0: Acomprehensive resource for ’Omics’ research on drugs.

Nucleic Acids Research , 2011.[37] Minoru Kanehisa, Susumu Goto, Yoko Sato, Miho Furumichi, and Mao Tanabe. KEGG for integration andinterpretation of large-scale molecular data sets.

Nucleic Acids Research , 2012.[38] Damian Szklarczyk, Alberto Santos, Christian Von Mering, Lars Juhl Jensen, Peer Bork, and Michael Kuhn.STITCH 5: Augmenting protein-chemical interaction networks with tissue and afﬁnity data.

Nucleic AcidsResearch , 2016.[39] A. Patrícia Bento, Anna Gaulton, Anne Hersey, Louisa J. Bellis, Jon Chambers, Mark Davies, Felix A. Krüger,Yvonne Light, Lora Mak, Shaun McGlinchey, Michal Nowotka, George Papadatos, Rita Santos, and John P.Overington. The ChEMBL bioactivity database: An update.

Nucleic Acids Research , 2014.[40] Armand Joulin and Tomas Mikolov. Inferring algorithmic patterns with stack-augmented recurrent nets. InCorinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, editors,

Advancesin Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems ,pages 190–198, 2015.[41] Jiangming Sun, Nina Jeliazkova, Vladimir Chupakin, Jose Felipe Golib-Dzib, Ola Engkvist, Lars Carlsson, JörgWegner, Hugo Ceulemans, Ivan Georgiev, Vedrin Jeliazkov, Nikolay Kochev, Thomas J. Ashby, and HongmingChen. ExCAPE-DB: An integrated large scale dataset facilitating Big Data analysis in chemogenomics.

Journalof Cheminformatics , 2017. 18. A

GYEMANG ET AL .[42] Maria Sorokina and Christoph Steinbeck. Naples: A natural products likeness scorer—web application anddatabase.

Journal of Cheminformatics , 11(1):1–7, 2019.[43] Peter Ertl and Ansgar Schuffenhauer. Estimation of synthetic accessibility score of drug-like molecules based onmolecular complexity and fragment contributions.

Journal of Cheminformatics , 2009.[44] Emilio Parisotto, H. Francis Song, Jack W. Rae, Razvan Pascanu, Caglar Gulcehre, Siddhant M. Jayakumar, MaxJaderberg, Raphael Lopez Kaufman, Aidan Clark, Seb Noury, Matthew M. Botvinick, Nicolas Heess, and RaiaHadsell. Stabilizing Transformers for Reinforcement Learning.