Scaffold-constrained molecular generation
Maxime Langevin, Herve Minoux, Maximilien Levesque, Marc Bianciotto
SScaffold-constrained molecular generation
Maxime Langevin, † , ‡ Herv´e Minoux, ‡ Maximilien Levesque, ∗ , ¶ , § and MarcBianciotto ∗ , ‡ † PASTEUR, D´epartement de chimie, Ecole Normale Sup´erieure, PSL University, SorbonneUniversit´e, CNRS, 75005 Paris, France ‡ Molecular Design Sciences - Integrated Drug Discovery, Sanofi R&D, Vitry-sur-Seine,France ¶ PASTEUR, D´epartement de chimie, Ecole Normale Sup´erieure, PSL University,Sorbonne Universit´e, CNRS, 75005 Paris, France § Aqemia, Paris, France
E-mail: [email protected]; marc.bianciotto@sanofi.com
Abstract
One of the major applications of generative models for drug discovery targets the lead-optimization phase. During the optimization of a lead series, it is common to have scaffoldconstraints imposed on the structure of the molecules designed. Without enforcing suchconstraints, the probability of generating molecules with the required scaffold is extremelylow and hinders the practicality of generative models for de-novo drug design. To tacklethis issue, we introduce a new algorithm to perform scaffold-constrained in-silico molecu-lar design. We build on the well-known SMILES-based Recurrent Neural Network (RNN)generative model, with a modified sampling procedure to achieve scaffold-constrained gener-ation. We directly benefit from the associated reinforcement learning methods, allowing to1 a r X i v : . [ q - b i o . Q M ] O c t esign molecules optimized for different properties while exploring only the relevant chemicalspace. We showcase the method’s ability to perform scaffold-constrained generation on vari-ous tasks: designing novel molecules around scaffolds extracted from SureChEMBL chemicalseries, generating novel active molecules on the Dopamine Receptor D2 (DRD2) target, andfinally, designing predicted actives on the MMP-12 series, an industrial lead-optimizationproject. Introduction
Finding new drugs is a long, costly and difficult problem, with potential failure all alongthe drug discovery pipeline and an overall success rate close to only 4%. Lead-optimization,where medicinal chemists refine bioactive molecules into potential drugs, takes a major partof the time and cost in the discovery pipeline. Finding a good drug candidate requiresfinding a molecule that is active on the target of interest while satisfying multiple criteriamainly related to safety and pharmacokinetics. In this respect, lead-optimization can beviewed as a multi-objective optimization problem in chemical space. There has been arecent surge of interest in generative models for de-novo drug design and their applicationin the drug discovery pipeline. Generative models have been studied for two types of tasks:distribution learning and goal-oriented learning. Distribution learning aims at reproducingthe distribution of a known dataset, in order to sample a large library of molecules similar tothe initial dataset used to train the generative model. Goal-oriented learning, on the otherhand, takes as input a scoring function and aims at finding the molecules with the highestscores. Applying generative models to lead-optimization can actually be understood as aspecial case of goal-oriented learning, where the scoring function reflects the adequacy ofthe molecule to the different project objectives. Distribution learning benchmarks are alsofrequently used to assess whether a model has learnt to generate drug-like molecules, andwill be a good starting point for goal-oriented learning.2owever, real life lead optimization projects very often impose constraints on the scaffoldof designed molecules. Interesting scaffolds are identified during the lead identification phaseof the pipeline, and often kept throughout the rest of the pipeline, with only minor changes.The scaffold, or “core” of the molecule, is key to (i) preserving biological activity identi-fied earlier in the pipeline (ii) staying within areas of chemical space where prior informationgathered is relevant, making SAR (structure-activity relationship) understandable (iii) main-taining a high throughput for compound synthesis by using common precursors, speedingthe Design-Make-Test-Analysis (DMTA) cycle to a level acceptable by industry standardsand (iv) exploring a relevant chemical space, translatable into a Markush structure, from anIntellectual Property perspective.Therefore, lead-optimization is actually most of the time a multi-objective optimizationproblem under scaffold constraints. Yet, goal-oriented learning with scaffold constraints isnot extensively studied, while being of major interest in the application of generative modelsto medicinal chemistry. As showed in our work, simply including the presence of the scaffoldof interest in the scoring function as a supplementary criterion and applying existing goal-oriented models shows poor results. Hence, there exists a clear need for algorithms thatinclude directly those constraints in their generative process.Before the increasing rise of interest in generative models for chemistry, traditionalapproaches were already studied by computational chemists for de-novo design of optimizedmolecules. Those approaches relied either on combination of fragments and exploration ofchemical space by performing virtual reactions with handcrafted rules. Genetic algorithms,that showed success in many optimization tasks, were also adapted to chemistry and usedin order to design molecules of interest.
Following Gomez-Bombarelli et al. who appliedrecent generative models in order to generate new molecules, there has been increasinglygrowing interest for de-novo design with generative models. Indeed, relying on a data-driven method that learns the underlying distribution of molecules helps to remain in achemical space of drug-like molecules. 3mong the various molecular generative models published in recent years, many actu-ally rely on RNN, and especially on the Long Short Term Memory (LSTM) architecture.RNN are frequently used as generative models for sequential data, and in particular forSMILES. Those models differ by the way they aim at learning the probability distri-bution of the training set. Variational Auto-Encoders (VAE) learn this distributionthrough an encoder-decoder loss, where the generator is used as a decoder and anotherRNN used as encoder. Generative Adversarial Networks (GAN) learn the distributionthrough a competition between the generator and a discriminator, and character-level gen-erator through simple negative log-likelihood minimization. While the methods to learnthe distribution differ, in the end those methods rely on the same architecture to generatenew molecules, which is SMILES-based RNN. To design optimized molecules with RNN,the easiest way is to generate a large virtual library of molecules and then screen it againstthe objectives, but the likelihood of finding optimized molecules is extremely small. Rein-forcement Learning (RL) was proposed as a way to design optimized molecules. Eitherthrough policy gradient or hill-climbing, which both aim at maximizing the probability ofthe highest scoring SMILES, the RNN can be fine-tuned to generate high scoring molecules.Hill-climbing was shown to achieve state-of-the-art results on various goal-oriented bench-marks. Optimizing directly on the molecular graph, controlled either by a neural network or through Monte-Carlo Tree Search, can yield interesting results, but also leads to com-pounds of lesser quality, which hinders adoption of those approaches for application in drugdiscovery projects. Methods based on Graph Neural Networks, that work directly on themolecular graph rather than with SMILES, have also been studied recently. Those ar-chitectures are more experimental and haven’t proved for the moment to perform as well asSMILES based methods.Scaffold-constrained generation has been investigated in only a few previous works. Frag-ment growing with RNN has been described, but is very limited in its scope and can-not accommodate scaffold constraints besides simply imposing the presence of a fragment.4eepScaffold relies on Graph Neural Networks to build molecules on an existing scaffold.The architecture used is complex and experimental, relying on graph convolutional neu-ral networks. The absence of built-in reinforcement learning also hinders the adoption ofthis method to design optimized molecules. The Reinvent Scaffold Decorator is based onSMILES and RNN and uses an encoder-decoder architecture. It completes scaffolds in twodifferent ways, single-step and multi-step. Single-step completes all open positions at once,while multi-step completes each open position iteratively. The architecture used is also com-plex, as it relies on an encoder-decoder scheme, and is limited to open positions at branchingpoints. The model also requires dataset-specific preprocessing and training for each appli-cation. Moreover, no RL procedure is directly applicable, limiting as well the scope of thismethod.Prior work on Markov models and especially RNN showed that for a given language,constraints that can be translated sequentially can be enforced during sampling. Thus,the main idea behind our model is that scaffold constraints can actually be translated tothe SMILES language. Acknowledging this fact, we investigate how we can leverage thistranslation from structure to SMILES to perform scaffold-constrained generation. We showthat scaffold-constrained generation can be achieved with SMILES-based RNN using simplya modified sampling procedure. This means that our method tackles scaffold-constrainedgeneration without the need for a new model, or even for retraining this model. More-over, all previous experience acquired by computational chemists with SMILES-based RNNusage translates with our procedure. The available state-of-the-art reinforcement learningmethods for goal-oriented learning ensures that our procedure is easily applicable for lead-optimization. By performing scaffold constrained generation, we can include the project’sstructural constraints directly in our generative process. This allows us to search directlywithin the very sparse subspace of chemical space that is relevant to the lead-optimizationproject. Furthermore, our framework allows low-level control on scaffold-constrained molec-ular generation, including the possibility to limit oneself to linear fragments, to control the5resence of branches and cycles, or the type of atoms authorized, as well as the ability toperform scaffold-hopping.We begin by giving background on molecule generation with SMILES-based RNN. Then,we show how to adapt the sampling procedure in order to perform scaffold constrained gen-eration with an RNN. Finally, we provide experimental results on different de-novo designtasks, and show the usefulness of our method in the context of scaffold constrained genera-tion. Methods
The SMILES language represents molecules as a chain of characters. Each charactercorresponds to an atom or denotes structural information, such as opening or closing ofcycles and branches, stereochemistry or multiple bonds.Starting from a molecular graph, cycles are broken down and marked with numbers, anda starting atom is chosen. The SMILES string is obtained by printing the symbol of thenodes encountered in a depth-first traversal starting at the chosen atom, with parenthesesindicating branching points in the graph. For a given molecular graph, there are as least asmany SMILES as possible starting atoms. A canonicalization algorithm can be used to pickthe starting atom, thus yielding the canonical SMILES of the molecule. The correspondingmolecular graph can be easily retrieved from a given SMILES.RNN can generate SMILES in a sequential fashion by modeling the conditional proba-bility distribution over SMILES tokens (conditioned on the beginning of a SMILES). TheSMILES language is enriched with a GO and EOS token that denote the beginning andthe end of a SMILES string. Let s = x , ....x n the tokenized version of a SMILES string,with x i characters from the SMILES language (here x and x t denote respectively the GOand EOS tokens). The RNN models P ( x t | x , ...x t − ), i.e. the conditional probability of atoken given all the previous ones. The RNN is trained on a database of drug-like molecules6such as ChEMBL ) to predict the next token of a SMILES given the beginning of thesequence, and training is achieved through minimization of the negative log-likelihood of thetraining set SMILES. The objective is to minimize (w.r.t the RNN parameters denoted by θ ) the following loss: L ( θ ) = − T X t =1 log P ( x t | x , ...x t − ) (1)The RNN relies on its internal state h to process the information from the previous tokens,and models the conditional probability P ( x t | x , ...x t − ) as P ( x t | h t − ). Data augmentationby enriching the dataset with non-canonical SMILES has shown benefits in different tasks,therefore we adopt it in our methodology when training RNN for the following experiments.Once the conditional probability distribution P ( x t | x , ...x t − ) is learnt, sampling isachieved by initializing the sequence with a GO token, and then sampling tokens sequen-tially. This procedure ends when a EOS token is sampled, and returns a SMILES with itscorresponding likelihood, which is amenable to backpropagation. Algorithm 1:
Generating new SMILES samples
Result:
SMILES stringinitialize h ; x = GO ; t = 1 ; while x t = EOS do Sample x t from P ( x t | x , ...x t − ) and update h t − to h t ; t = t + 1 ; endOutput: x , ...x t This framework can be adapted to scaffold constrained generation. Indeed, a scaffold canbe viewed as an incomplete SMILES. This incomplete SMILES can have different positionswith open possibilities, denoted by the special token “*”. An example of how scaffoldconstraints can be translated in SMILES, in the case of an open branch, is given Table 1.7able 1: Linking SMILES syntax with molecular structure: open branch
Original structure Structure with open position
CC(C)(C(=O)O)c1ccc(cc1)C(O)CCCN2CCC(CC2)C(O) (c3ccccc3) c4ccccc4 CC(C)(C(=O)O)c1ccc(cc1)C(O)CCCN2CCC(CC2)C(O) (*) c4ccccc4When the open position is within a linker, an example is given Table 2.8able 2: Linking SMILES syntax with molecular structure: open linker
Original structure Structure with open position
CC(C)(C(=O)O)c1ccc(cc1)C(O)C CC N2CCC(CC2)C(O)(c3ccccc3)c4ccccc4 CC(C)(C(=O)O)c1ccc(cc1)C(O)C * N2CCC(CC2)C(O)(c3ccccc3)c4ccccc4In the classic procedure, the RNN samples freely tokens until the end of the SMILES isreached. For scaffold constrained generation, the RNN is constrained to follow the SMILESscaffold, and sampling is enabled only when an open position “*” of the SMILES is reached.The major subtlety lies in the fact that the sampling procedure depends upon the con-figuration of the molecular graph around the open position, especially with regard to thedecision to stop sampling and resume reading. In this work, we tackle three different kinds ofopen positions. First, open positions at branching points of the molecular graph (commonlyreferred to as R-groups). Then open positions in linkers (that link different cycles of themolecule), and finally constrained choices, where the position is open but the number ofpossibilities is already limited within the drug discovery project. Each kind of open positionrequires an adequate sampling procedure. The general algorithm for scaffold constrained9ampling is the following:
Algorithm 2:
Generating new SMILES samples with scaffold constraints
Result:
SMILES string with scaffold s Input: scaffold s = s , ..., s n initialize h ; x = GO ; t = 1 ; for i ← to n doif s i not ∗ then Read s i and update h t − to h t ; x t = s i ; t = t + 1 ; else Sample y = ( y , ..., y k ) and update h according to special sampling procedurefor scaffold decoration; x t , ..., x t + k = y , ..., y k ; t = t + k end t = t + 1; x t = EOS ; Output: x , ...x t Decorating a fixed scaffold is one the predominant techniques used by medicinal chemistsin lead optimization and being able to perform this task with a generative model is ofmajor practical importance in their application to drug discovery. For branched decorations(i.e R-groups), the SMILES translation is straightforward: opening and closing parenthesesdenote the beginning and end of the branch, and therefore an open branched decoration willtranslate to “(*)” in the SMILES language. In this case, the sampling procedure is easilydesigned: the RNN is free to sample any tokens while the branch is open. When a closingparenthesis that matches the opening parenthesis of the branch is sampled, then it means10he RNN finished to sample the branched decoration, and can resume reading the rest ofthe scaffold. This also implies that it is necessary to keep track of the opening and closingparentheses sampled within the decoration. Another technical issue that arises due to theSMILES language is that sampled cycles identifiers (1 , , , ... ) should be different than theones used in the scaffold, to ensure the respect of the given scaffold in corner cases where acycle opened before an open position is closed after said open position. Algorithm 3:
Decorating a scaffold on a given branching position
Result:
Smiles string with completed decoration
Input: hidden state hh = h ;opened = 1 ;closed = 0; t = 1; while opened > closed do Sample x t from P ( x | h t − ) and update h t − to h t ; if x = ’(’ then opened += 1 ; else if x = ’)’ then closed += 1 ; t = t + 1 ; end Additional refinements are possible, such as specifying the beginning of the branch (e.g“(CN*)” instead of “(*)”), or restraining the branch to be a linear fragment (by forbiddingthe opening of new branches and cycles). 11igure 1: Sampling a new branchIf branched decorations represent the majority of modifications allowed on a scaffoldduring lead optimization phases, it is sometimes interesting to allow more profound changeson the scaffold by performing scaffold hopping. To tackle this particular task, we investigatethe possibility to have an open position within a linker between cycles. This problem presentsdifferent difficulties. First, contrary to branched decorations, there is no clear indicator forstopping sampling and resuming reading the pattern. Thus, the stopping criterion willnecessarily be arbitrary. We implement it under the form of a user-defined probabilitydistribution on the length of the added fragment. Furthermore, as the end of samplingis arbitrarily decided rather than chosen by the RNN, we need to keep track of openingand closing parentheses as well as cycles to ensure that branches and cycles are completedwithin the added fragment before stopping sampling. The stopping criterion for sampling istherefore a combination of a specified probability-distribution and if the sampled fragmentdoesn’t contains uncompleted cycles and branches. We do not look into the task of modifyingexisting cycles, which is more complex.The ability to modify the core of a molecule allows our method to tackle more thansimply scaffold decoration and thus to perform scaffold-hopping as well (see figure S1 for12 lgorithm 4:
Scaffold hopping by linker completion
Input: hidden state h , distribution on linker size P size Result:
Smiles string with completed linker h = h ;Sample n char from P size ;opened = 0 ;closed = 0 ;step = 0 ;cycle = False ; while cycle or step < n char or opened > closed do Sample x t from P ( x | h t ) and update h t − to h t ; if x t = ’(’ then opened = opened + 1 ; else if x t = ’)’ then closed = closed + 1 ; else if x t ∈ { , , . . . , } thenif corresponding cycle not opened then cycle = True ;keep track of opened cycle ; else close corresponding cycle ; if no cycle still opened then cycle = False ;step = step + 1; end language, an extension of SMILES, dealswith discrete choices with the following syntax: [ x , x , ..., x k ] represents a discrete choicebetween SMILES characters x , x , ..., x k . We use this syntax for sampling between a finiteset of discrete choices. The sampling procedure is straightforward: restrict the possible to-kens to those present in the discrete choices, renormalize the probability distribution andthen sample. Instead of drawing the next token from P ( x | h t ), we sample it from: Q ( x ) = softmax[ P ( x | h t ) · X c ∈ choices δx, c ] (2)By design, our method can deal with multiple open positions of different nature within14he same scaffold. This is important as scaffold constrained lead optimization often dealswith multiple open positions at once. Experiments
The objective of the following experiments is to verify the ability of our method to performthe major tasks that can be encountered in the context of scaffold-constrained moleculargeneration. We first check the ability of our method to perform scaffold-constrained genera-tion on previously unseen scaffolds. We then assess the capacity of our method to generate,if needed, analogs to a given chemical series, in a focused learning task. Finally, we measureits performance in optimization of molecules with given constraints by benchmarking it indifferent scaffold-constrained goal-oriented scenarios.We rely on different sets of molecule for training and validation of the method. FollowingOlivecrona et al. database, a database of patented molecules. Clustering molecules bytheir Bemis-Murcko scaffold, we extract 18 chemical series with 18 (see figure S2) differentscaffolds. Those scaffolds are chosen for validation as they • are sampled from real life drug discovery projects • where not present in the training set (we remove any molecule that has one of the 18scaffolds from the training set)To study the ability to explore a focused region of chemical space and design analogsto a given series, we also isolate the largest (93 molecules) among the extracted chemicalseries. This yields 17 scaffolds for the validation set, and one scaffold for the focused learningvalidation set. Below, we refer to the molecules extracted from ChEMBL as the training15et, the chemical series from SureChEMBL as the validation set, and the chemical seriesreferenced above as the focused learning validation set.To implement our model, we build on the existing codebase released by Olivecrona et al.that already includes a SMILES based RNN, and with which many researchers are alreadyfamiliar. The RNN used in the subsequent tasks are either trained on the training set(onwards named “Generic RNN”) or on the focused learning validation set chemical series(named “Focused RNN”).The rationale for using the Generic RNN in most applications is that the training set iscomprised of diverse drug-like compounds, and that an RNN trained on it should be able toexplore a large and varied chemical space. For the focused learning task, the goal is to assessthe ability to generate close analogs to a given chemical series, and therefore the FocusedRNN trained specifically on this chemical series is used.For the task of generating molecules around an unseen scaffold, we use the Generic RNNto design molecules conditioned on the different scaffolds in the validation set. Major met-rics used in benchmarking suites for in-silico molecular generation are evaluated. Themain goal is to ensure the ability to generate valid and unique SMILES, as well as to assesswhether physico-chemical properties are similar to those of drug-like compounds. For thefocused learning task, we compare distributions of molecules generated by the Generic RNNand the Focused RNN with the training set, validation set and the focused learning validationset. As for goal-oriented benchmarks, we begin with the DRD2 target to provide a fair com-parison with prior work
22 21 on scaffold constrained generation and investigate the ability ofour method to generate predicted DRD2 actives on the different validation scaffolds studiedin. To generate optimized molecules, we rely on a state-of-the-art Reinforcement Learn-ing procedure, hill-climbing. We then benchmark our method with the MMP-12 series which is a large publicly available industrial lead-optimization dataset. Scaffold constraintsare present in the dataset and explicitly mentioned in the original work that released thedata, which supports our statement that they are common within drug-discovery lead opti-16ization campaigns. After building a QSAR model on pIC values, we compare our modelagainst SMILES based LSTM with hill-climbing, a state-of-the-art in-silico de-novo designalgorithm, in the task of generating predicted actives with the required scaffold. Generating molecules with new scaffolds
To verify the ability of our method to generate novel molecules around unseen scaffolds, weperform classic distribution learning benchmarks on sets of molecules designed around thescaffolds from the validation set. For each of the 17 scaffolds, we generate 10 000 SMILES.We first compute the proportion of (i) valid and (ii) unique SMILES. Validity is definedas whether the SMILES defines a valid molecular structure and is checked with the RD-Kit. We shall note that validity ensures that the SMILES is syntactically valid (all ringsand branches closed, no illegal atom types) but does not mean that the molecule will benecessarily synthetizable in a wet-lab. We also compute several physico-chemical proper-ties for each valid molecule generated: calculated logarithm of partition coefficient(logP),Molecular Weight (MW), Synthetic Accessibility Score (SAS), Quantitative Estimate ofDrug-Likeness (QED), numbers of H donors (HBD) and acceptors (HBA). All propertieswere computed with the RDKit. We group molecules generated for each of the differ-ent scaffolds together and compare the distributions of those properties between generatedmolecules, the training set and the validation set.The proportion of unique and valid molecules for each scaffold in the validation set aregiven Figure 3, where the scaffolds are ordered by increasing number of open positions.Validity and uniqueness proportions (calculated over 10 000 SMILES for each scaffold),shown in Figure 3, are on par with the best scores obtained by various generative models. Scaffolds in the validation set are ordered by ascending number of open positions; uniquenessproportion increases with the number of open positions, which matches the intuition that themore possibilities there is to modify the scaffolds, the more diverse the generated SMILESwill be for a given scaffold. 17
Scaffold index V a li d i t y a ndun i c i t y ValidityUnicity
Figure 3: Proportions of valid and unique molecules out of 10000 generated for each valida-tion scaffold P r o p o rt i o n o f v a li d m o l ec u l e s Validity P r o p o rt i o n o f un i q u e m o l ec u l e s Unicity
Figure 4: Validity and uniqueness proportions (min, lower quartile, median, upper quartileand max) across the 17 validation scaffoldsWe then perform comparison of distributions of various properties as a sanity check toensure that generated molecules are similar to the drug-like molecules of the training andvalidation set. 18
00 400 500 600 700
Molecular Weight (Da) . . . . . . . . . P r o p o rt i o n Generic RNNValidation setTraining set
Calculated log P . . . . . . . . .
40 0 2 4 6 8 10
Hydrogen Bonds Acceptors . . . . . . . . .
400 2 4 6 8 10
Hydrogen Bonds Donors . . . . . . . . . P r o p o rt i o n SAS . . . . . . . . .
40 0 . . . . . . QED . . . . . . . . . Figure 5: Histogram of properties across generated molecules, training and validation set.For molecular weight and ClogP, values outside the [250,750] g.mol − and [-1,6] ranges arenot shown, and bars are the extremities accumulate values outside those ranges.We note that properties are distributed similarly for the different sets of molecules, sug-gesting that generated molecules populates a similar property space as the training and19alidation set. An interesting fact is that generated molecules have lower QED scores; this israther intuitive as molecules from training and validation set are actual molecules implyingthe bias that they were necessarily synthesizable and considered interesting enough in a drugdiscovery context so that their synthesis was actually performed. Focused learning on a chemical series
In the context of lead-optimization, we might like to narrow the search for optimizedmolecules in a focused chemical space. Therefore, we investigate whether we could generateclose analogs to an existing chemical series. We use the Focused RNN, trained specificallyon the focused learning validation set (comprised of a single chemical series), to generatemolecules with the scaffold of the focused learning validation set. For comparison, we alsogenerate molecules with the same scaffold but using the Generic RNN.We then perform dimensionality reduction to 2D with PCA on ECFP4 fingerprints.Generated molecules are plotted in Figure 6 against the focused learning validation set andthe training set. 20 − − − − Training setFocused learning validation setFocused RNNGeneric RNN
Figure 6: PCA of generated molecules with Focused and Generic RNN compared withtraining set and focused learning validation setMolecules generated with the Generic RNN overlap almost completely with the trainingset, while molecules generated with the Focused RNN overlap with the Focused learningvalidation series. This indicates that using a focused RNN allows to sample a chemical spaceclose to a chemical series of interest.It should be noted that while dimensionality reduction such as PCA on fingerprints iscommonly used to compare distributions of molecules, the question of whether this is themost relevant approach (especially as fingerprints are high-dimensional and binary) is stillopen, and should be kept in mind, as well as the fact that we are also comparing molecules21ith a shared substructure. Generating DRD2 actives
Benchmarking our model on goal-oriented tasks is the main focus of our experiments. Abenchmark for goal-oriented scaffold-constrained generation was proposed by Ar´us-Pouset al.. We therefore start by tackling the same task, which is generating predicted ac-tives on the DRD2 target with specific scaffolds. We use the same 5 validation scaffolds.First, we assess unicity and validity proportions of generated molecules to ensure that ourmethod generalizes well to those scaffolds. Then, for each scaffold, we run a reinforcementlearning procedure, hill-climbing, with the objective of generating predicted actives. Activityprediction is done with the QSAR model from. After ten epochs dedicated to learning togenerate predicted actives, the best 50 molecules are kept.22able 3: Metrics on distribution learning and goal-oriented learning for DRD2 validationscaffolds
Scaffold Validmolecules Uniquemolecules Predicted ac-tive moleculesout of 50 best
86% 84%
82% 92%
86% 96%
92% 90%
98% 32% , as no insight wasgiven of the total number of molecules generated for each scaffold. As mentioned previously,we made the choice to assess the top 50 best molecules generated. Nonetheless, we find thatour model is able to generate a much higher proportion of predicted actives for each scaffold.This is achieved even without access to known actives on DRD2 with different scaffolds,unlike in Ar´us-Pous et al. where training with actives is required. Furthermore, relying on amuch simpler architecture translates into a much higher throughput for generated molecules(roughly 100 times faster in CPU-time). Contrary to the Reinvent Scaffold Generator, wealso do not require the use of a special preprocessing algorithm or of specific pretraining. Onthe other hand, we require a Reinforcement Learning procedure to be ran, though it comes ata rather cheap computational cost (in the limit that the property of interest can be computedefficiently). This allows us to be much more efficient in finding molecules optimizing a givenobjective. Table 4: Speed of generation comparisonCPU time for generating 1000molecules Molecules generated per second(CPU time)ReinventScaffoldDecorator 620.9 seconds 1.6 molecules/seconds Scaffold-constrainedgenerator 7.02 seconds 143 molecules/seconds comparison with is not provided as we failed to make use of the code released to replicate the findingsof this work lead-optimization use case: the MMP-12 series Experiments on the DRD2 target shows that our method can design predicted actives ona biological target while satisfying scaffold constraints. To provide further comparison, wedesigned a novel benchmark on the MMP-12 series, a publicly available industrial lead-optimization dataset, and assessed how scaffold-constrained generation fares against a state-of-the-art generative model. As goal-oriented models primary target is lead-optimization,designing a benchmark that resembles the industrial problem we’d like to solve makes sense.We compare our method and classic SMILES-based RNN with hill climbing reinforcementlearning. For each method, 10 runs of reinforcement learning are launched, with the exactsame optimization procedure and number of steps to ensure a fair comparison. For eachrun, the top 50 molecules are kept. For each molecule, given its predicted pIC , the scoreis computed as: max(1 , − (7 . − pIC ) / . , if scaffold constraints are met0 , otherwiseMolecules that have predicted pIC > . > .
5) withsubstructure (therefore matching 2/2 of the project’s requirements), only actives (1/2 re-quirements) , only matching substructures (1/2 requirements as well) and none of the two(0/2 requirements met).With our method, we find 23% of molecules that satisfy 2/2 requirements for the projectwhile no satisfying molecules are found with classic SMILES-based RNN. This discrepancyis probably due to the fact that the chemical space where the structure constraint is met isa very sparse subspace of chemical space, which hinders the performance of reinforcementlearning. This leads the classic RNN to struggle generating molecules within this chemicalspace while optimizing activity. On the contrary, our method generates by design only25able 5: Generating actives on the MMP-12 series
Right scaffold,active (2/2)
Right scaffold,not active (1/2) Active, withoutscaffold (1/2) Not active,without scaffold(0/2)Classic RNN
56% 42% 2%Scaffold-constrainedgenerator
77% 0% 0%molecules that meet the scaffold requirement, allowing the optimization procedure to focusonly on the true objective. Amongst the molecules discovered by our method are presentsome experimentally validated actives that were part of the held-out validation set, yieldingyet another confirmation that inverse QSAR powered by generative models can discoverexperimentally validated molecules.Figure 7: Re-discovered active from the held-out validation set, with the scaffold constrainthighlightedThis experiment shows that, by releasing the structure constraint (as it is built in ourmodel), our method can search much more efficiently for optimized molecules (w.r.t predictedactivity) within a subspace of molecules with a required scaffold. This experiment also showsthat optimization under structure constraints seems to be a difficult task for generativemodels. For instance, on the Guacamol benchmarking suite, optimization on multiple26bjectives is not problematic and high scores are achieved on a wide range of tasks bySMILES-based RNN. Yet, having only two objectives in this task with one being a scaffoldconstraint, classic SMILES-based RNN struggles to generate optimized molecules, whichgives a significant advantage to scaffold constrained generation. As scaffold constraints arecommon within drug discovery projects, we think that our method could therefore prove tobe very useful in this context. Conclusion
Applying generative models to drug-discovery can be used in lead optimization tasks, thatoften require to respect scaffold constraints. Those constraints are mainly present for pre-serving biological activity and staying in known SAR domains for the optimized properties,as well as for synthesizability and optimization of the synthesis process. Including thoseconstraints in the generative process of a model is a difficult problem, and that has potentialpractical impact on the applications of generative models to drug-discovery. In this work, weinvestigate how a well-known model, SMILES-based RNN, can be slightly modified to achievedifferent scaffold-constrained generative tasks. This is possible thanks to a modified samplingprocedure. Our approach for scaffold constrained generation thus doesn’t require designinga new model, or even retraining it. Furthermore, all previous works on reinforcement learn-ing for molecular properties optimization with SMILES-based RNN stay applicable. Usingdistribution learning benchmarks, we show that our method can generalize across unseenscaffolds, and can also generate molecules in a focused fashion. The validation scaffolds usedfor this task are extracted from SureChEMBL and thus derived from real lead optimizationchemical series. On scaffold constrained goal-oriented benchmarks, our method largely out-performs state-of-the-art de-novo design algorithms. Our approach was able to generate newpredicted actives on the DRD2 target, without specific pretraining. Furthermore, we showedthat it outperformed classic SMILES based reinforcement learning for designing predicted27ctives on the MMP-12 series, proposing held-out experimentally validated actives. We alsoshow that by design, reinforcement learning methods are applicable in this context. Thisenables our method to use state-of-the-art algorithms for goal-oriented tasks, and we showstrong performance on scaffold constrained in-silico molecular optimization. Our model alsogoes beyond simple scaffold decoration. It can provide low level control on the way the scaf-fold is completed, and can also be used in other tasks such as scaffold hopping. Limitationsof the method include the need to handcraft the scaffold constraints in SMILES format, aswell as the fact that in particular instances, specificities of the SMILES language requiresmanual overriding of sampled cycles to ensure the respect of scaffold constraints. Overall, webelieve our method shows a real practical interest for scaffold constrained optimization tasksthat make the most of actual lead optimization challenges in drug discovery. Coupled withthe fact that we rely on a well-known and already widely adopted model, we expect that itwill benefit researchers looking to apply generative models for lead optimization tasks.28 eferences (1) Schneider, P.; Schneider, G. De Novo Design at the Edge of Chaos.
Journal of MedicinalChemistry , , 4077–4086.(2) Paul, S. M.; Mytelka, D. S.; Dunwiddie, C. T.; Persinger, C. C.; Munos, B. H.; Lind-borg, S. R.; Schacht, A. L. How to improve R&D productivity: the pharmaceuticalindustry’s grand challenge. Nature Reviews Drug Discovery , , 203–214.(3) Elton, D. C.; Boukouvalas, Z.; Fuge, M. D.; Chung, P. W. Deep learning for moleculardesign—a review of the state of the art. Mol. Syst. Des. Eng. , , 828–849.(4) Brown, N.; Fiscato, M.; Segler, M. H.; Vaucher, A. C. GuacaMol: Benchmarking Modelsfor de Novo Molecular Design. Journal of Chemical Information and Modeling , , 1096–1108.(5) Hughes, J.; Rees, S.; Kalindjian, S.; Philpott, K. Principles of early drug discovery. British Journal of Pharmacology , , 1239–1249.(6) St˚ahl, N.; Falkman, G.; Karlsson, A.; Mathiason, G.; Bostr¨om, J. Deep ReinforcementLearning for Multiparameter Optimization in de novo Drug Design. Journal of ChemicalInformation and Modeling , , 3166–3176.(7) Hartenfeller, M.; Schneider, G. Enabling future drug discovery by de novo design. WIREs Computational Molecular Science , , 742–759.(8) Yoshikawa, N.; Terayama, K.; Honma, T.; Oono, K.; Tsuda, K. Population-based DeNovo Molecule Generation, Using Grammatical Evolution. Chemistry Letters , .(9) Jensen, J. H. A graph-based genetic algorithm and generative model/Monte Carlo treesearch for the exploration of chemical space. Chem. Sci. , , 3567–3572.2910) G´omez-Bombarelli, R.; Wei, J. N.; Duvenaud, D.; Hern´andez-Lobato, J. M.; S´anchez-Lengeling, B.; Sheberla, D.; Aguilera-Iparraguirre, J.; Hirzel, T. D.; Adams, R. P.;Aspuru-Guzik, A. Automatic Chemical Design Using a Data-Driven Continuous Rep-resentation of Molecules. ACS Central Science , , 268–276.(11) Schneider, P. et al. Rethinking drug design in the artificial intelligence era. NatureReviews Drug Discovery , , 353–364.(12) Segler, M. H. S.; Kogej, T.; Tyrchan, C.; Waller, M. P. Generating Focused MoleculeLibraries for Drug Discovery with Recurrent Neural Networks. ACS Central Science , , 120–131.(13) Olivecrona, M.; Blaschke, T.; Engkvist, O.; Chen, H. Molecular De Novo Designthrough Deep Reinforcement Learning. Journal of Cheminformatics , .(14) Polykovskiy, D.; Zhebrak, A.; Sanchez-Lengeling, B.; Golovanov, S.; Tatanov, O.;Belyaev, S.; Kurbanov, R.; Artamonov, A.; Aladinskiy, V.; Veselov, M.;Kadurin, A.; Nikolenko, S. I.; Aspuru-Guzik, A.; Zhavoronkov, A. Molecular Sets(MOSES): A Benchmarking Platform for Molecular Generation Models. CoRR , abs/1811.12823 .(15) Sanchez-Lengeling, B.; Outeiral, C.; Guimaraes, G. L.; Aspuru-Guzik, A. Optimizingdistributions over molecular space. An Objective-Reinforced Generative AdversarialNetwork for Inverse-design Chemistry (ORGANIC). , .(16) Zhou, Z.; Kearnes, S.; Li, L.; Zare, R. N.; Riley, P. Optimization of Molecules via DeepReinforcement Learning. Scientific Reports , , 10752.(17) Jin, W.; Barzilay, R.; Jaakkola, T. Junction Tree Variational Autoencoder for MolecularGraph Generation. , , 2323–2332.3018) Cao, N. D.; Kipf, T. MolGAN: An implicit generative model for small molecular graphs. CoRR , abs/1805.11973 .(19) Merk, D.; Friedrich, L.; Grisoni, F.; Schneider, G. De Novo Design of Bioactive SmallMolecules by Artificial Intelligence. Molecular Informatics , , 1700153.(20) Gupta, A.; M¨uller, A. T.; Huisman, B. J. H.; Fuchs, J. A.; Schneider, P.; Schneider, G.Generative Recurrent Networks for De Novo Drug Design. Molecular Informatics , , 1700111.(21) Li, Y.; Hu, J.; Wang, Y.; Zhou, J.; Zhang, L.; Liu, Z. DeepScaffold: A ComprehensiveTool for Scaffold-Based De Novo Drug Discovery Using Deep Learning. Journal ofChemical Information and Modeling , , 77–91.(22) Ar´us-Pous, J.; Patronov, A.; Bjerrum, E. J.; Tyrchan, C.; Reymond, J.-L.; Chen, H.;Engkvist, O. SMILES-based deep generative scaffold decorator for de-novo drug design. Journal of Cheminformatics , .(23) Walder, C.; Kim, D. Computer Assisted Composition with Recurrent Neural Networks. , , 359–374.(24) Papadopoulos, A.; Pachet, F.; Roy, P.; Sakellariou, J. Exact Sampling for Regular andMarkov Constraints with Belief Propagation. , 341–350.(25) Weininger, D. SMILES-A Language for Molecules and Reactions. Handbook ofChemoinformatics , 80 – 102.(26) Gaulton, A. et al. The ChEMBL database in 2017.
Nucleic acids research , .(27) Ar´us-Pous, J.; Johansson, S. V.; Prykhodko, O.; Bjerrum, E. J.; Tyrchan, C.; Rey-mond, J.-L.; Chen, H.; Engkvist, O. Randomized SMILES strings improve the qualityof molecular generative models. Journal of Cheminformatics , , 71.3128) B¨ohm, H.-J.; Flohr, A.; Stahl, M. Scaffold hopping. Drug Discovery Today: Technolo-gies , , 217–224.(29) Papadatos, G.; Davies, M.; Dedman, N.; Chambers, J.; Gaulton, A.; Siddle, J.;Koks, R.; Irvine, S.; Pettersson, J.; Goncharoff, N.; Hersey, A.; Overington, J.SureChEMBL: A large-scale, chemically annotated patent document database. NucleicAcids Research , .(30) Bemis, G. W.; Murcko, M. A. The Properties of Known Drugs. 1. Molecular Frame-works. Journal of Medicinal Chemistry , , 2887–2893.(31) Pickett, S. D.; Green, D. V. S.; Hunt, D. L.; Pardoe, D. A.; Hughes, I. AutomatedLead Optimization of MMP-12 Inhibitors Using a Genetic Algorithm. ACS medicinalchemistry letters , , 28–33.(32) Landrum, G. RDKit: Open-source cheminformatics. .(33) Ertl, P.; Schuffenhauer, A. Estimation of synthetic accessibility score of drug-likemolecules based on molecular complexity and fragment contributions. Journal of Chem-informatics , .(34) Bickerton, G. R.; Paolini, G. V.; Besnard, J.; Muresan, S.; Hopkins, A. L. Quantifyingthe chemical beauty of drugs. Nature Chemistry , , 90–98.Figure 8: For Table of Contents Only32 upporting Information:Scaffold-constrained molecular generation Maxime Langevin, † , ‡ Herv´e Minoux, ‡ Maximilien Levesque, ∗ , ¶ , § and MarcBianciotto ∗ , ‡ † PASTEUR, D´epartement de chimie, Ecole Normale Sup´erieure, PSL University, SorbonneUniversit´e, CNRS, 75005 Paris, France ‡ Molecular Design Sciences - Integrated Drug Discovery, Sanofi R&D, Vitry-sur-Seine,France ¶ PASTEUR, D´epartement de chimie, Ecole Normale Sup´erieure, PSL University,Sorbonne Universit´e, CNRS, 75005 Paris, France § Aqemia, Paris, France
E-mail: [email protected]; marc.bianciotto@sanofi.com
Data curation and software availability
Code availability
All software and data to reproduce the results of this paper are available at: https://github.com/maxime-langevin/scaffold-constrained-generation .To implement the methods presented above, we build on the existing codebase of Olive-crona and al. , available at https://github.com/MarcusOlivecrona/REINVENT . The dif-ferent experiments are reproduced in Jupyter notebooks available in our codebase. An extranotebook showing basic usage for researchers interested in simply using our method withoutS-1 a r X i v : . [ q - b i o . Q M ] O c t ecessarily reproducing our results is also available. ChEMBL dataset
The ChEMBL database is often used to train generative models of drug-like molecules. Totrain our RNN, we use a preprocessed version of the ChEMBL where only molecules havingbetween 10 and 50 heavy atoms and comprised of elements ∈ { H, B, C, N, O, F, Si, P, S, Cl, Br, I } were kept. The original filtered ChEMBL dataset can be found and downloaded at: https://github.com/MarcusOlivecrona/REINVENT/blob/master/data/ChEMBL_filtered , and inour repository at https://github.com/maxime-langevin/scaffold-constrained-generation/data/ChEMBL_filtered . Furthermore, we filtered the dataset to exclude molecules havingone the 17 validation scaffolds as a substructure, yielding the final dataset at https://github.com/maxime-langevin/scaffold-constrained-generation/data/ChEMBL_without_sureChEMBL.smi . SureChEMBL dataset
The SureChEMBL database is comprised of patented compounds. The database can bedownloaded at https://chembl.gitbook.io/chembl-interface-documentation/downloads .34000 compounds were exctracted from SureChEMBL v2019.10.01. Compounds were clus-tered by Bemis-Murcko scaffold and 18 chemical series (every molecule in each series hav-ing the same scaffold) were kept to be used as a validation set. The molecules in the 18series can be found at https://github.com/maxime-langevin/scaffold-constrained-generation/data/SureChEMBL/200323_SureChemBL_dataset_636.sdf , and the 18 scaf-folds at https://github.com/maxime-langevin/scaffold-constrained-generation/data/SureChEMBL/surechembl_scaffolds.sdf . S-2 RD2 dataset
The full DRD2 dataset can be found at https://github.com/undeadpixel/reinvent-scaffold-decorator/blob/master/training_sets/drd2.excapedb.smi.gz . The scaf-folds used for the goal-directed benchmark are the ones used in Ar´us-Pous and al., and can befound in our codebase at https://github.com/maxime-langevin/scaffold-constrained-generation/data/DRD2/drd2_scaffolds.sdf . MMP-12 dataset
The MMP-12 dataset was downloaded from the supplementary materials of Pickett and al.The dataset can be found at https://github.com/maxime-langevin/scaffold-constrained-generation/data/MMP12/mmp12.csv . Implementation details
Computation time benchmarks
Computation time benchmarks were run on an Amazon EC2 p2.xlarge, and assessed theruntime of generating molecules using one CPU.
Distribution learning benchmarks
To assess distribution learning benchmarks, 10000 molecules were generated for each scaffold.
Validity
The validity score for a scaffold is the ratio of the number of valid molecules, as defined inthe RDKit , out of all 10000 generated molecules.S-3 nicity
Out of the generated valid molecules, the number of unique molecules is computed as theratio of molecules with distinct canonical SMILES string.
Physico-chemical properties
All physico-chemical properties where computed using the RDKit. Properties were com-puted on the valid molecules out of the 10000 generated for each scaffold, and then groupedtogether. The overall distributions were plotted against the distributions of both the train-ing and the validation set, in order to check that there was no striking dissimilarity betweenthem.
Predicting DRD2 activity
One of the major point of the DRD2 activity was to benchmark our method against theReinvent Scaffold Decorator .Thus, it seems natural to use the same QSAR model. As weweren’t able to find this QSAR model within the codebase reproducing the experiments ofthe article, we used a QSAR model used in a work from the same group on the DRD2dataset, and we assumed that the QSAR model used in the two works was the same. In ourcodebase, the model used for DRD2 activity prediction can be found at https://github.com/maxime-langevin/scaffold-constrained-generation/data/clf.pkl Predicting MMP-12 activity
To predict activity on the MMP-12 target, the dataset was split into a training and a testset. Then, a random forest regression algorithm (implemented with Scikit-learn) was fit-ted on the training set with continuous targets (corresponding to the experimental pIC ),and evaluated on the testing set. The evaluation yielded a coefficient of determination r = 0 .
84. The QSAR model is accessible at https://github.com/maxime-langevin/
S-4 caffold-constrained-generation/data/MMP12/final_activity_model.pkl , and eval-uation on the test set at https://github.com/maxime-langevin/scaffold-constrained-generation/MMP12_experiments.ipynb . Hill climbing procedure
To optimize molecules in goal-oriented benchmarks, a hill-climbing procedure was used,as it was shown to be overall the best method amongst different generative models. Thealgorithm can be summarized as a repetition of the following steps: • Generate 500 molecules • Score them and keep the top 50 unique molecules • Perform 10 rounds of log-likelihood maximization with the 50 best moleculesThose steps are repeated 10 times in a row. The code for performing hill-climbing can befound at https://github.com/maxime-langevin/scaffold-constrained-generation/hill_climbing.py . S-5 iguresigures