[PDF] Scaffold-constrained molecular generation

Abstract

One of the major applications of generative models for drug Discovery targets the lead-optimization phase. During the optimization of a lead series, it is common to have scaffold constraints imposed on the structure of the molecules designed. Without enforcing such constraints, the probability of generating molecules with the required scaffold is extremely low and hinders the practicality of generative models for de-novo drug design. To tackle this issue, we introduce a new algorithm to perform scaffold-constrained in-silico molecular design. We build on the well-known SMILES-based Recurrent Neural Network (RNN) generative model, with a modified sampling procedure to achieve scaffold-constrained generation. We directly benefit from the associated reinforcement Learning methods, allowing to design molecules optimized for different properties while exploring only the relevant chemical space. We showcase the method's ability to perform scaffold-constrained generation on various tasks: designing novel molecules around scaffolds extracted from SureChEMBL chemical series, generating novel active molecules on the Dopamine Receptor D2 (DRD2) target, and, finally, designing predicted actives on the MMP-12 series, an industrial lead-optimization project.

Full PDF

SScaﬀold-constrained molecular generation

Maxime Langevin, † , ‡ Herv´e Minoux, ‡ Maximilien Levesque, ∗ , ¶ , § and MarcBianciotto ∗ , ‡ † PASTEUR, D´epartement de chimie, Ecole Normale Sup´erieure, PSL University, SorbonneUniversit´e, CNRS, 75005 Paris, France ‡ Molecular Design Sciences - Integrated Drug Discovery, Sanoﬁ R&D, Vitry-sur-Seine,France ¶ PASTEUR, D´epartement de chimie, Ecole Normale Sup´erieure, PSL University,Sorbonne Universit´e, CNRS, 75005 Paris, France § Aqemia, Paris, France

E-mail: [email protected]; marc.bianciotto@sanoﬁ.com

Abstract

One of the major applications of generative models for drug discovery targets the lead-optimization phase. During the optimization of a lead series, it is common to have scaﬀoldconstraints imposed on the structure of the molecules designed. Without enforcing suchconstraints, the probability of generating molecules with the required scaﬀold is extremelylow and hinders the practicality of generative models for de-novo drug design. To tacklethis issue, we introduce a new algorithm to perform scaﬀold-constrained in-silico molecu-lar design. We build on the well-known SMILES-based Recurrent Neural Network (RNN)generative model, with a modiﬁed sampling procedure to achieve scaﬀold-constrained gener-ation. We directly beneﬁt from the associated reinforcement learning methods, allowing to1 a r X i v : . [ q - b i o . Q M ] O c t esign molecules optimized for diﬀerent properties while exploring only the relevant chemicalspace. We showcase the method’s ability to perform scaﬀold-constrained generation on vari-ous tasks: designing novel molecules around scaﬀolds extracted from SureChEMBL chemicalseries, generating novel active molecules on the Dopamine Receptor D2 (DRD2) target, andﬁnally, designing predicted actives on the MMP-12 series, an industrial lead-optimizationproject. Introduction

Finding new drugs is a long, costly and diﬃcult problem, with potential failure all alongthe drug discovery pipeline and an overall success rate close to only 4%. Lead-optimization,where medicinal chemists reﬁne bioactive molecules into potential drugs, takes a major partof the time and cost in the discovery pipeline. Finding a good drug candidate requiresﬁnding a molecule that is active on the target of interest while satisfying multiple criteriamainly related to safety and pharmacokinetics. In this respect, lead-optimization can beviewed as a multi-objective optimization problem in chemical space. There has been arecent surge of interest in generative models for de-novo drug design and their applicationin the drug discovery pipeline. Generative models have been studied for two types of tasks:distribution learning and goal-oriented learning. Distribution learning aims at reproducingthe distribution of a known dataset, in order to sample a large library of molecules similar tothe initial dataset used to train the generative model. Goal-oriented learning, on the otherhand, takes as input a scoring function and aims at ﬁnding the molecules with the highestscores. Applying generative models to lead-optimization can actually be understood as aspecial case of goal-oriented learning, where the scoring function reﬂects the adequacy ofthe molecule to the diﬀerent project objectives. Distribution learning benchmarks are alsofrequently used to assess whether a model has learnt to generate drug-like molecules, andwill be a good starting point for goal-oriented learning.2owever, real life lead optimization projects very often impose constraints on the scaﬀoldof designed molecules. Interesting scaﬀolds are identiﬁed during the lead identiﬁcation phaseof the pipeline, and often kept throughout the rest of the pipeline, with only minor changes.The scaﬀold, or “core” of the molecule, is key to (i) preserving biological activity identi-ﬁed earlier in the pipeline (ii) staying within areas of chemical space where prior informationgathered is relevant, making SAR (structure-activity relationship) understandable (iii) main-taining a high throughput for compound synthesis by using common precursors, speedingthe Design-Make-Test-Analysis (DMTA) cycle to a level acceptable by industry standardsand (iv) exploring a relevant chemical space, translatable into a Markush structure, from anIntellectual Property perspective.Therefore, lead-optimization is actually most of the time a multi-objective optimizationproblem under scaﬀold constraints. Yet, goal-oriented learning with scaﬀold constraints isnot extensively studied, while being of major interest in the application of generative modelsto medicinal chemistry. As showed in our work, simply including the presence of the scaﬀoldof interest in the scoring function as a supplementary criterion and applying existing goal-oriented models shows poor results. Hence, there exists a clear need for algorithms thatinclude directly those constraints in their generative process.Before the increasing rise of interest in generative models for chemistry, traditionalapproaches were already studied by computational chemists for de-novo design of optimizedmolecules. Those approaches relied either on combination of fragments and exploration ofchemical space by performing virtual reactions with handcrafted rules. Genetic algorithms,that showed success in many optimization tasks, were also adapted to chemistry and usedin order to design molecules of interest.

Following Gomez-Bombarelli et al. who appliedrecent generative models in order to generate new molecules, there has been increasinglygrowing interest for de-novo design with generative models. Indeed, relying on a data-driven method that learns the underlying distribution of molecules helps to remain in achemical space of drug-like molecules. 3mong the various molecular generative models published in recent years, many actu-ally rely on RNN, and especially on the Long Short Term Memory (LSTM) architecture.RNN are frequently used as generative models for sequential data, and in particular forSMILES. Those models diﬀer by the way they aim at learning the probability distri-bution of the training set. Variational Auto-Encoders (VAE) learn this distributionthrough an encoder-decoder loss, where the generator is used as a decoder and anotherRNN used as encoder. Generative Adversarial Networks (GAN) learn the distributionthrough a competition between the generator and a discriminator, and character-level gen-erator through simple negative log-likelihood minimization. While the methods to learnthe distribution diﬀer, in the end those methods rely on the same architecture to generatenew molecules, which is SMILES-based RNN. To design optimized molecules with RNN,the easiest way is to generate a large virtual library of molecules and then screen it againstthe objectives, but the likelihood of ﬁnding optimized molecules is extremely small. Rein-forcement Learning (RL) was proposed as a way to design optimized molecules. Eitherthrough policy gradient or hill-climbing, which both aim at maximizing the probability ofthe highest scoring SMILES, the RNN can be ﬁne-tuned to generate high scoring molecules.Hill-climbing was shown to achieve state-of-the-art results on various goal-oriented bench-marks. Optimizing directly on the molecular graph, controlled either by a neural network or through Monte-Carlo Tree Search, can yield interesting results, but also leads to com-pounds of lesser quality, which hinders adoption of those approaches for application in drugdiscovery projects. Methods based on Graph Neural Networks, that work directly on themolecular graph rather than with SMILES, have also been studied recently. Those ar-chitectures are more experimental and haven’t proved for the moment to perform as well asSMILES based methods.Scaﬀold-constrained generation has been investigated in only a few previous works. Frag-ment growing with RNN has been described, but is very limited in its scope and can-not accommodate scaﬀold constraints besides simply imposing the presence of a fragment.4eepScaﬀold relies on Graph Neural Networks to build molecules on an existing scaﬀold.The architecture used is complex and experimental, relying on graph convolutional neu-ral networks. The absence of built-in reinforcement learning also hinders the adoption ofthis method to design optimized molecules. The Reinvent Scaﬀold Decorator is based onSMILES and RNN and uses an encoder-decoder architecture. It completes scaﬀolds in twodiﬀerent ways, single-step and multi-step. Single-step completes all open positions at once,while multi-step completes each open position iteratively. The architecture used is also com-plex, as it relies on an encoder-decoder scheme, and is limited to open positions at branchingpoints. The model also requires dataset-speciﬁc preprocessing and training for each appli-cation. Moreover, no RL procedure is directly applicable, limiting as well the scope of thismethod.Prior work on Markov models and especially RNN showed that for a given language,constraints that can be translated sequentially can be enforced during sampling. Thus,the main idea behind our model is that scaﬀold constraints can actually be translated tothe SMILES language. Acknowledging this fact, we investigate how we can leverage thistranslation from structure to SMILES to perform scaﬀold-constrained generation. We showthat scaﬀold-constrained generation can be achieved with SMILES-based RNN using simplya modiﬁed sampling procedure. This means that our method tackles scaﬀold-constrainedgeneration without the need for a new model, or even for retraining this model. More-over, all previous experience acquired by computational chemists with SMILES-based RNNusage translates with our procedure. The available state-of-the-art reinforcement learningmethods for goal-oriented learning ensures that our procedure is easily applicable for lead-optimization. By performing scaﬀold constrained generation, we can include the project’sstructural constraints directly in our generative process. This allows us to search directlywithin the very sparse subspace of chemical space that is relevant to the lead-optimizationproject. Furthermore, our framework allows low-level control on scaﬀold-constrained molec-ular generation, including the possibility to limit oneself to linear fragments, to control the5resence of branches and cycles, or the type of atoms authorized, as well as the ability toperform scaﬀold-hopping.We begin by giving background on molecule generation with SMILES-based RNN. Then,we show how to adapt the sampling procedure in order to perform scaﬀold constrained gen-eration with an RNN. Finally, we provide experimental results on diﬀerent de-novo designtasks, and show the usefulness of our method in the context of scaﬀold constrained genera-tion. Methods

The SMILES language represents molecules as a chain of characters. Each charactercorresponds to an atom or denotes structural information, such as opening or closing ofcycles and branches, stereochemistry or multiple bonds.Starting from a molecular graph, cycles are broken down and marked with numbers, anda starting atom is chosen. The SMILES string is obtained by printing the symbol of thenodes encountered in a depth-ﬁrst traversal starting at the chosen atom, with parenthesesindicating branching points in the graph. For a given molecular graph, there are as least asmany SMILES as possible starting atoms. A canonicalization algorithm can be used to pickthe starting atom, thus yielding the canonical SMILES of the molecule. The correspondingmolecular graph can be easily retrieved from a given SMILES.RNN can generate SMILES in a sequential fashion by modeling the conditional proba-bility distribution over SMILES tokens (conditioned on the beginning of a SMILES). TheSMILES language is enriched with a GO and EOS token that denote the beginning andthe end of a SMILES string. Let s = x , ....x n the tokenized version of a SMILES string,with x i characters from the SMILES language (here x and x t denote respectively the GOand EOS tokens). The RNN models P ( x t | x , ...x t − ), i.e. the conditional probability of atoken given all the previous ones. The RNN is trained on a database of drug-like molecules6such as ChEMBL ) to predict the next token of a SMILES given the beginning of thesequence, and training is achieved through minimization of the negative log-likelihood of thetraining set SMILES. The objective is to minimize (w.r.t the RNN parameters denoted by θ ) the following loss: L ( θ ) = − T X t =1 log P ( x t | x , ...x t − ) (1)The RNN relies on its internal state h to process the information from the previous tokens,and models the conditional probability P ( x t | x , ...x t − ) as P ( x t | h t − ). Data augmentationby enriching the dataset with non-canonical SMILES has shown beneﬁts in diﬀerent tasks,therefore we adopt it in our methodology when training RNN for the following experiments.Once the conditional probability distribution P ( x t | x , ...x t − ) is learnt, sampling isachieved by initializing the sequence with a GO token, and then sampling tokens sequen-tially. This procedure ends when a EOS token is sampled, and returns a SMILES with itscorresponding likelihood, which is amenable to backpropagation. Algorithm 1:

Generating new SMILES samples

Result:

SMILES stringinitialize h ; x = GO ; t = 1 ; while x t = EOS do Sample x t from P ( x t | x , ...x t − ) and update h t − to h t ; t = t + 1 ; endOutput: x , ...x t This framework can be adapted to scaﬀold constrained generation. Indeed, a scaﬀold canbe viewed as an incomplete SMILES. This incomplete SMILES can have diﬀerent positionswith open possibilities, denoted by the special token “*”. An example of how scaﬀoldconstraints can be translated in SMILES, in the case of an open branch, is given Table 1.7able 1: Linking SMILES syntax with molecular structure: open branch

Original structure Structure with open position

CC(C)(C(=O)O)c1ccc(cc1)C(O)CCCN2CCC(CC2)C(O) (c3ccccc3) c4ccccc4 CC(C)(C(=O)O)c1ccc(cc1)C(O)CCCN2CCC(CC2)C(O) (*) c4ccccc4When the open position is within a linker, an example is given Table 2.8able 2: Linking SMILES syntax with molecular structure: open linker

Original structure Structure with open position

CC(C)(C(=O)O)c1ccc(cc1)C(O)C CC N2CCC(CC2)C(O)(c3ccccc3)c4ccccc4 CC(C)(C(=O)O)c1ccc(cc1)C(O)C * N2CCC(CC2)C(O)(c3ccccc3)c4ccccc4In the classic procedure, the RNN samples freely tokens until the end of the SMILES isreached. For scaﬀold constrained generation, the RNN is constrained to follow the SMILESscaﬀold, and sampling is enabled only when an open position “*” of the SMILES is reached.The major subtlety lies in the fact that the sampling procedure depends upon the con-ﬁguration of the molecular graph around the open position, especially with regard to thedecision to stop sampling and resume reading. In this work, we tackle three diﬀerent kinds ofopen positions. First, open positions at branching points of the molecular graph (commonlyreferred to as R-groups). Then open positions in linkers (that link diﬀerent cycles of themolecule), and ﬁnally constrained choices, where the position is open but the number ofpossibilities is already limited within the drug discovery project. Each kind of open positionrequires an adequate sampling procedure. The general algorithm for scaﬀold constrained9ampling is the following:

Algorithm 2:

Generating new SMILES samples with scaﬀold constraints

Result:

SMILES string with scaﬀold s Input: scaﬀold s = s , ..., s n initialize h ; x = GO ; t = 1 ; for i ← to n doif s i not ∗ then Read s i and update h t − to h t ; x t = s i ; t = t + 1 ; else Sample y = ( y , ..., y k ) and update h according to special sampling procedurefor scaﬀold decoration; x t , ..., x t + k = y , ..., y k ; t = t + k end t = t + 1; x t = EOS ; Output: x , ...x t Decorating a ﬁxed scaﬀold is one the predominant techniques used by medicinal chemistsin lead optimization and being able to perform this task with a generative model is ofmajor practical importance in their application to drug discovery. For branched decorations(i.e R-groups), the SMILES translation is straightforward: opening and closing parenthesesdenote the beginning and end of the branch, and therefore an open branched decoration willtranslate to “(*)” in the SMILES language. In this case, the sampling procedure is easilydesigned: the RNN is free to sample any tokens while the branch is open. When a closingparenthesis that matches the opening parenthesis of the branch is sampled, then it means10he RNN ﬁnished to sample the branched decoration, and can resume reading the rest ofthe scaﬀold. This also implies that it is necessary to keep track of the opening and closingparentheses sampled within the decoration. Another technical issue that arises due to theSMILES language is that sampled cycles identiﬁers (1 , , , ... ) should be diﬀerent than theones used in the scaﬀold, to ensure the respect of the given scaﬀold in corner cases where acycle opened before an open position is closed after said open position. Algorithm 3:

Decorating a scaﬀold on a given branching position

Result:

Smiles string with completed decoration

Input: hidden state hh = h ;opened = 1 ;closed = 0; t = 1; while opened > closed do Sample x t from P ( x | h t − ) and update h t − to h t ; if x = ’(’ then opened += 1 ; else if x = ’)’ then closed += 1 ; t = t + 1 ; end Additional reﬁnements are possible, such as specifying the beginning of the branch (e.g“(CN*)” instead of “(*)”), or restraining the branch to be a linear fragment (by forbiddingthe opening of new branches and cycles). 11igure 1: Sampling a new branchIf branched decorations represent the majority of modiﬁcations allowed on a scaﬀoldduring lead optimization phases, it is sometimes interesting to allow more profound changeson the scaﬀold by performing scaﬀold hopping. To tackle this particular task, we investigatethe possibility to have an open position within a linker between cycles. This problem presentsdiﬀerent diﬃculties. First, contrary to branched decorations, there is no clear indicator forstopping sampling and resuming reading the pattern. Thus, the stopping criterion willnecessarily be arbitrary. We implement it under the form of a user-deﬁned probabilitydistribution on the length of the added fragment. Furthermore, as the end of samplingis arbitrarily decided rather than chosen by the RNN, we need to keep track of openingand closing parentheses as well as cycles to ensure that branches and cycles are completedwithin the added fragment before stopping sampling. The stopping criterion for sampling istherefore a combination of a speciﬁed probability-distribution and if the sampled fragmentdoesn’t contains uncompleted cycles and branches. We do not look into the task of modifyingexisting cycles, which is more complex.The ability to modify the core of a molecule allows our method to tackle more thansimply scaﬀold decoration and thus to perform scaﬀold-hopping as well (see ﬁgure S1 for12 lgorithm 4:

Scaﬀold hopping by linker completion

Input: hidden state h , distribution on linker size P size Result:

Smiles string with completed linker h = h ;Sample n char from P size ;opened = 0 ;closed = 0 ;step = 0 ;cycle = False ; while cycle or step < n char or opened > closed do Sample x t from P ( x | h t ) and update h t − to h t ; if x t = ’(’ then opened = opened + 1 ; else if x t = ’)’ then closed = closed + 1 ; else if x t ∈ { , , . . . , } thenif corresponding cycle not opened then cycle = True ;keep track of opened cycle ; else close corresponding cycle ; if no cycle still opened then cycle = False ;step = step + 1; end language, an extension of SMILES, dealswith discrete choices with the following syntax: [ x , x , ..., x k ] represents a discrete choicebetween SMILES characters x , x , ..., x k . We use this syntax for sampling between a ﬁniteset of discrete choices. The sampling procedure is straightforward: restrict the possible to-kens to those present in the discrete choices, renormalize the probability distribution andthen sample. Instead of drawing the next token from P ( x | h t ), we sample it from: Q ( x ) = softmax[ P ( x | h t ) · X c ∈ choices δx, c ] (2)By design, our method can deal with multiple open positions of diﬀerent nature within14he same scaﬀold. This is important as scaﬀold constrained lead optimization often dealswith multiple open positions at once. Experiments

The objective of the following experiments is to verify the ability of our method to performthe major tasks that can be encountered in the context of scaﬀold-constrained moleculargeneration. We ﬁrst check the ability of our method to perform scaﬀold-constrained genera-tion on previously unseen scaﬀolds. We then assess the capacity of our method to generate,if needed, analogs to a given chemical series, in a focused learning task. Finally, we measureits performance in optimization of molecules with given constraints by benchmarking it indiﬀerent scaﬀold-constrained goal-oriented scenarios.We rely on diﬀerent sets of molecule for training and validation of the method. FollowingOlivecrona et al. database, a database of patented molecules. Clustering molecules bytheir Bemis-Murcko scaﬀold, we extract 18 chemical series with 18 (see ﬁgure S2) diﬀerentscaﬀolds. Those scaﬀolds are chosen for validation as they • are sampled from real life drug discovery projects • where not present in the training set (we remove any molecule that has one of the 18scaﬀolds from the training set)To study the ability to explore a focused region of chemical space and design analogsto a given series, we also isolate the largest (93 molecules) among the extracted chemicalseries. This yields 17 scaﬀolds for the validation set, and one scaﬀold for the focused learningvalidation set. Below, we refer to the molecules extracted from ChEMBL as the training15et, the chemical series from SureChEMBL as the validation set, and the chemical seriesreferenced above as the focused learning validation set.To implement our model, we build on the existing codebase released by Olivecrona et al.that already includes a SMILES based RNN, and with which many researchers are alreadyfamiliar. The RNN used in the subsequent tasks are either trained on the training set(onwards named “Generic RNN”) or on the focused learning validation set chemical series(named “Focused RNN”).The rationale for using the Generic RNN in most applications is that the training set iscomprised of diverse drug-like compounds, and that an RNN trained on it should be able toexplore a large and varied chemical space. For the focused learning task, the goal is to assessthe ability to generate close analogs to a given chemical series, and therefore the FocusedRNN trained speciﬁcally on this chemical series is used.For the task of generating molecules around an unseen scaﬀold, we use the Generic RNNto design molecules conditioned on the diﬀerent scaﬀolds in the validation set. Major met-rics used in benchmarking suites for in-silico molecular generation are evaluated. Themain goal is to ensure the ability to generate valid and unique SMILES, as well as to assesswhether physico-chemical properties are similar to those of drug-like compounds. For thefocused learning task, we compare distributions of molecules generated by the Generic RNNand the Focused RNN with the training set, validation set and the focused learning validationset. As for goal-oriented benchmarks, we begin with the DRD2 target to provide a fair com-parison with prior work

22 21 on scaﬀold constrained generation and investigate the ability ofour method to generate predicted DRD2 actives on the diﬀerent validation scaﬀolds studiedin. To generate optimized molecules, we rely on a state-of-the-art Reinforcement Learn-ing procedure, hill-climbing. We then benchmark our method with the MMP-12 series which is a large publicly available industrial lead-optimization dataset. Scaﬀold constraintsare present in the dataset and explicitly mentioned in the original work that released thedata, which supports our statement that they are common within drug-discovery lead opti-16ization campaigns. After building a QSAR model on pIC values, we compare our modelagainst SMILES based LSTM with hill-climbing, a state-of-the-art in-silico de-novo designalgorithm, in the task of generating predicted actives with the required scaﬀold. Generating molecules with new scaﬀolds

To verify the ability of our method to generate novel molecules around unseen scaﬀolds, weperform classic distribution learning benchmarks on sets of molecules designed around thescaﬀolds from the validation set. For each of the 17 scaﬀolds, we generate 10 000 SMILES.We ﬁrst compute the proportion of (i) valid and (ii) unique SMILES. Validity is deﬁnedas whether the SMILES deﬁnes a valid molecular structure and is checked with the RD-Kit. We shall note that validity ensures that the SMILES is syntactically valid (all ringsand branches closed, no illegal atom types) but does not mean that the molecule will benecessarily synthetizable in a wet-lab. We also compute several physico-chemical proper-ties for each valid molecule generated: calculated logarithm of partition coeﬃcient(logP),Molecular Weight (MW), Synthetic Accessibility Score (SAS), Quantitative Estimate ofDrug-Likeness (QED), numbers of H donors (HBD) and acceptors (HBA). All propertieswere computed with the RDKit. We group molecules generated for each of the diﬀer-ent scaﬀolds together and compare the distributions of those properties between generatedmolecules, the training set and the validation set.The proportion of unique and valid molecules for each scaﬀold in the validation set aregiven Figure 3, where the scaﬀolds are ordered by increasing number of open positions.Validity and uniqueness proportions (calculated over 10 000 SMILES for each scaﬀold),shown in Figure 3, are on par with the best scores obtained by various generative models. Scaﬀolds in the validation set are ordered by ascending number of open positions; uniquenessproportion increases with the number of open positions, which matches the intuition that themore possibilities there is to modify the scaﬀolds, the more diverse the generated SMILESwill be for a given scaﬀold. 17

Scaﬀold index V a li d i t y a ndun i c i t y ValidityUnicity

Figure 3: Proportions of valid and unique molecules out of 10000 generated for each valida-tion scaﬀold P r o p o rt i o n o f v a li d m o l ec u l e s Validity P r o p o rt i o n o f un i q u e m o l ec u l e s Unicity

Figure 4: Validity and uniqueness proportions (min, lower quartile, median, upper quartileand max) across the 17 validation scaﬀoldsWe then perform comparison of distributions of various properties as a sanity check toensure that generated molecules are similar to the drug-like molecules of the training andvalidation set. 18

00 400 500 600 700

Molecular Weight (Da) . . . . . . . . . P r o p o rt i o n Generic RNNValidation setTraining set

Calculated log P . . . . . . . . .

40 0 2 4 6 8 10

Hydrogen Bonds Acceptors . . . . . . . . .

400 2 4 6 8 10

Hydrogen Bonds Donors . . . . . . . . . P r o p o rt i o n SAS . . . . . . . . .

40 0 . . . . . . QED . . . . . . . . . Figure 5: Histogram of properties across generated molecules, training and validation set.For molecular weight and ClogP, values outside the [250,750] g.mol − and [-1,6] ranges arenot shown, and bars are the extremities accumulate values outside those ranges.We note that properties are distributed similarly for the diﬀerent sets of molecules, sug-gesting that generated molecules populates a similar property space as the training and19alidation set. An interesting fact is that generated molecules have lower QED scores; this israther intuitive as molecules from training and validation set are actual molecules implyingthe bias that they were necessarily synthesizable and considered interesting enough in a drugdiscovery context so that their synthesis was actually performed. Focused learning on a chemical series

In the context of lead-optimization, we might like to narrow the search for optimizedmolecules in a focused chemical space. Therefore, we investigate whether we could generateclose analogs to an existing chemical series. We use the Focused RNN, trained speciﬁcallyon the focused learning validation set (comprised of a single chemical series), to generatemolecules with the scaﬀold of the focused learning validation set. For comparison, we alsogenerate molecules with the same scaﬀold but using the Generic RNN.We then perform dimensionality reduction to 2D with PCA on ECFP4 ﬁngerprints.Generated molecules are plotted in Figure 6 against the focused learning validation set andthe training set. 20 − − − − Training setFocused learning validation setFocused RNNGeneric RNN

Figure 6: PCA of generated molecules with Focused and Generic RNN compared withtraining set and focused learning validation setMolecules generated with the Generic RNN overlap almost completely with the trainingset, while molecules generated with the Focused RNN overlap with the Focused learningvalidation series. This indicates that using a focused RNN allows to sample a chemical spaceclose to a chemical series of interest.It should be noted that while dimensionality reduction such as PCA on ﬁngerprints iscommonly used to compare distributions of molecules, the question of whether this is themost relevant approach (especially as ﬁngerprints are high-dimensional and binary) is stillopen, and should be kept in mind, as well as the fact that we are also comparing molecules21ith a shared substructure. Generating DRD2 actives

Benchmarking our model on goal-oriented tasks is the main focus of our experiments. Abenchmark for goal-oriented scaﬀold-constrained generation was proposed by Ar´us-Pouset al.. We therefore start by tackling the same task, which is generating predicted ac-tives on the DRD2 target with speciﬁc scaﬀolds. We use the same 5 validation scaﬀolds.First, we assess unicity and validity proportions of generated molecules to ensure that ourmethod generalizes well to those scaﬀolds. Then, for each scaﬀold, we run a reinforcementlearning procedure, hill-climbing, with the objective of generating predicted actives. Activityprediction is done with the QSAR model from. After ten epochs dedicated to learning togenerate predicted actives, the best 50 molecules are kept.22able 3: Metrics on distribution learning and goal-oriented learning for DRD2 validationscaﬀolds

Scaﬀold Validmolecules Uniquemolecules Predicted ac-tive moleculesout of 50 best

86% 84%

82% 92%

86% 96%

92% 90%

98% 32% , as no insight wasgiven of the total number of molecules generated for each scaﬀold. As mentioned previously,we made the choice to assess the top 50 best molecules generated. Nonetheless, we ﬁnd thatour model is able to generate a much higher proportion of predicted actives for each scaﬀold.This is achieved even without access to known actives on DRD2 with diﬀerent scaﬀolds,unlike in Ar´us-Pous et al. where training with actives is required. Furthermore, relying on amuch simpler architecture translates into a much higher throughput for generated molecules(roughly 100 times faster in CPU-time). Contrary to the Reinvent Scaﬀold Generator, wealso do not require the use of a special preprocessing algorithm or of speciﬁc pretraining. Onthe other hand, we require a Reinforcement Learning procedure to be ran, though it comes ata rather cheap computational cost (in the limit that the property of interest can be computedeﬃciently). This allows us to be much more eﬃcient in ﬁnding molecules optimizing a givenobjective. Table 4: Speed of generation comparisonCPU time for generating 1000molecules Molecules generated per second(CPU time)ReinventScaﬀoldDecorator 620.9 seconds 1.6 molecules/seconds Scaﬀold-constrainedgenerator 7.02 seconds 143 molecules/seconds comparison with is not provided as we failed to make use of the code released to replicate the ﬁndingsof this work lead-optimization use case: the MMP-12 series Experiments on the DRD2 target shows that our method can design predicted actives ona biological target while satisfying scaﬀold constraints. To provide further comparison, wedesigned a novel benchmark on the MMP-12 series, a publicly available industrial lead-optimization dataset, and assessed how scaﬀold-constrained generation fares against a state-of-the-art generative model. As goal-oriented models primary target is lead-optimization,designing a benchmark that resembles the industrial problem we’d like to solve makes sense.We compare our method and classic SMILES-based RNN with hill climbing reinforcementlearning. For each method, 10 runs of reinforcement learning are launched, with the exactsame optimization procedure and number of steps to ensure a fair comparison. For eachrun, the top 50 molecules are kept. For each molecule, given its predicted pIC , the scoreis computed as:  max(1 , − (7 . − pIC ) / . , if scaﬀold constraints are met0 , otherwiseMolecules that have predicted pIC > . > .

5) withsubstructure (therefore matching 2/2 of the project’s requirements), only actives (1/2 re-quirements) , only matching substructures (1/2 requirements as well) and none of the two(0/2 requirements met).With our method, we ﬁnd 23% of molecules that satisfy 2/2 requirements for the projectwhile no satisfying molecules are found with classic SMILES-based RNN. This discrepancyis probably due to the fact that the chemical space where the structure constraint is met isa very sparse subspace of chemical space, which hinders the performance of reinforcementlearning. This leads the classic RNN to struggle generating molecules within this chemicalspace while optimizing activity. On the contrary, our method generates by design only25able 5: Generating actives on the MMP-12 series

Right scaﬀold,active (2/2)

Right scaﬀold,not active (1/2) Active, withoutscaﬀold (1/2) Not active,without scaﬀold(0/2)Classic RNN

56% 42% 2%Scaﬀold-constrainedgenerator

77% 0% 0%molecules that meet the scaﬀold requirement, allowing the optimization procedure to focusonly on the true objective. Amongst the molecules discovered by our method are presentsome experimentally validated actives that were part of the held-out validation set, yieldingyet another conﬁrmation that inverse QSAR powered by generative models can discoverexperimentally validated molecules.Figure 7: Re-discovered active from the held-out validation set, with the scaﬀold constrainthighlightedThis experiment shows that, by releasing the structure constraint (as it is built in ourmodel), our method can search much more eﬃciently for optimized molecules (w.r.t predictedactivity) within a subspace of molecules with a required scaﬀold. This experiment also showsthat optimization under structure constraints seems to be a diﬃcult task for generativemodels. For instance, on the Guacamol benchmarking suite, optimization on multiple26bjectives is not problematic and high scores are achieved on a wide range of tasks bySMILES-based RNN. Yet, having only two objectives in this task with one being a scaﬀoldconstraint, classic SMILES-based RNN struggles to generate optimized molecules, whichgives a signiﬁcant advantage to scaﬀold constrained generation. As scaﬀold constraints arecommon within drug discovery projects, we think that our method could therefore prove tobe very useful in this context. Conclusion

Applying generative models to drug-discovery can be used in lead optimization tasks, thatoften require to respect scaﬀold constraints. Those constraints are mainly present for pre-serving biological activity and staying in known SAR domains for the optimized properties,as well as for synthesizability and optimization of the synthesis process. Including thoseconstraints in the generative process of a model is a diﬃcult problem, and that has potentialpractical impact on the applications of generative models to drug-discovery. In this work, weinvestigate how a well-known model, SMILES-based RNN, can be slightly modiﬁed to achievediﬀerent scaﬀold-constrained generative tasks. This is possible thanks to a modiﬁed samplingprocedure. Our approach for scaﬀold constrained generation thus doesn’t require designinga new model, or even retraining it. Furthermore, all previous works on reinforcement learn-ing for molecular properties optimization with SMILES-based RNN stay applicable. Usingdistribution learning benchmarks, we show that our method can generalize across unseenscaﬀolds, and can also generate molecules in a focused fashion. The validation scaﬀolds usedfor this task are extracted from SureChEMBL and thus derived from real lead optimizationchemical series. On scaﬀold constrained goal-oriented benchmarks, our method largely out-performs state-of-the-art de-novo design algorithms. Our approach was able to generate newpredicted actives on the DRD2 target, without speciﬁc pretraining. Furthermore, we showedthat it outperformed classic SMILES based reinforcement learning for designing predicted27ctives on the MMP-12 series, proposing held-out experimentally validated actives. We alsoshow that by design, reinforcement learning methods are applicable in this context. Thisenables our method to use state-of-the-art algorithms for goal-oriented tasks, and we showstrong performance on scaﬀold constrained in-silico molecular optimization. Our model alsogoes beyond simple scaﬀold decoration. It can provide low level control on the way the scaf-fold is completed, and can also be used in other tasks such as scaﬀold hopping. Limitationsof the method include the need to handcraft the scaﬀold constraints in SMILES format, aswell as the fact that in particular instances, speciﬁcities of the SMILES language requiresmanual overriding of sampled cycles to ensure the respect of scaﬀold constraints. Overall, webelieve our method shows a real practical interest for scaﬀold constrained optimization tasksthat make the most of actual lead optimization challenges in drug discovery. Coupled withthe fact that we rely on a well-known and already widely adopted model, we expect that itwill beneﬁt researchers looking to apply generative models for lead optimization tasks.28 eferences (1) Schneider, P.; Schneider, G. De Novo Design at the Edge of Chaos.

Journal of MedicinalChemistry , , 4077–4086.(2) Paul, S. M.; Mytelka, D. S.; Dunwiddie, C. T.; Persinger, C. C.; Munos, B. H.; Lind-borg, S. R.; Schacht, A. L. How to improve R&D productivity: the pharmaceuticalindustry’s grand challenge. Nature Reviews Drug Discovery , , 203–214.(3) Elton, D. C.; Boukouvalas, Z.; Fuge, M. D.; Chung, P. W. Deep learning for moleculardesign—a review of the state of the art. Mol. Syst. Des. Eng. , , 828–849.(4) Brown, N.; Fiscato, M.; Segler, M. H.; Vaucher, A. C. GuacaMol: Benchmarking Modelsfor de Novo Molecular Design. Journal of Chemical Information and Modeling , , 1096–1108.(5) Hughes, J.; Rees, S.; Kalindjian, S.; Philpott, K. Principles of early drug discovery. British Journal of Pharmacology , , 1239–1249.(6) St˚ahl, N.; Falkman, G.; Karlsson, A.; Mathiason, G.; Bostr¨om, J. Deep ReinforcementLearning for Multiparameter Optimization in de novo Drug Design. Journal of ChemicalInformation and Modeling , , 3166–3176.(7) Hartenfeller, M.; Schneider, G. Enabling future drug discovery by de novo design. WIREs Computational Molecular Science , , 742–759.(8) Yoshikawa, N.; Terayama, K.; Honma, T.; Oono, K.; Tsuda, K. Population-based DeNovo Molecule Generation, Using Grammatical Evolution. Chemistry Letters , .(9) Jensen, J. H. A graph-based genetic algorithm and generative model/Monte Carlo treesearch for the exploration of chemical space. Chem. Sci. , , 3567–3572.2910) G´omez-Bombarelli, R.; Wei, J. N.; Duvenaud, D.; Hern´andez-Lobato, J. M.; S´anchez-Lengeling, B.; Sheberla, D.; Aguilera-Iparraguirre, J.; Hirzel, T. D.; Adams, R. P.;Aspuru-Guzik, A. Automatic Chemical Design Using a Data-Driven Continuous Rep-resentation of Molecules. ACS Central Science , , 268–276.(11) Schneider, P. et al. Rethinking drug design in the artiﬁcial intelligence era. NatureReviews Drug Discovery , , 353–364.(12) Segler, M. H. S.; Kogej, T.; Tyrchan, C.; Waller, M. P. Generating Focused MoleculeLibraries for Drug Discovery with Recurrent Neural Networks. ACS Central Science , , 120–131.(13) Olivecrona, M.; Blaschke, T.; Engkvist, O.; Chen, H. Molecular De Novo Designthrough Deep Reinforcement Learning. Journal of Cheminformatics , .(14) Polykovskiy, D.; Zhebrak, A.; Sanchez-Lengeling, B.; Golovanov, S.; Tatanov, O.;Belyaev, S.; Kurbanov, R.; Artamonov, A.; Aladinskiy, V.; Veselov, M.;Kadurin, A.; Nikolenko, S. I.; Aspuru-Guzik, A.; Zhavoronkov, A. Molecular Sets(MOSES): A Benchmarking Platform for Molecular Generation Models. CoRR , abs/1811.12823 .(15) Sanchez-Lengeling, B.; Outeiral, C.; Guimaraes, G. L.; Aspuru-Guzik, A. Optimizingdistributions over molecular space. An Objective-Reinforced Generative AdversarialNetwork for Inverse-design Chemistry (ORGANIC). , .(16) Zhou, Z.; Kearnes, S.; Li, L.; Zare, R. N.; Riley, P. Optimization of Molecules via DeepReinforcement Learning. Scientiﬁc Reports , , 10752.(17) Jin, W.; Barzilay, R.; Jaakkola, T. Junction Tree Variational Autoencoder for MolecularGraph Generation. , , 2323–2332.3018) Cao, N. D.; Kipf, T. MolGAN: An implicit generative model for small molecular graphs. CoRR , abs/1805.11973 .(19) Merk, D.; Friedrich, L.; Grisoni, F.; Schneider, G. De Novo Design of Bioactive SmallMolecules by Artiﬁcial Intelligence. Molecular Informatics , , 1700153.(20) Gupta, A.; M¨uller, A. T.; Huisman, B. J. H.; Fuchs, J. A.; Schneider, P.; Schneider, G.Generative Recurrent Networks for De Novo Drug Design. Molecular Informatics , , 1700111.(21) Li, Y.; Hu, J.; Wang, Y.; Zhou, J.; Zhang, L.; Liu, Z. DeepScaﬀold: A ComprehensiveTool for Scaﬀold-Based De Novo Drug Discovery Using Deep Learning. Journal ofChemical Information and Modeling , , 77–91.(22) Ar´us-Pous, J.; Patronov, A.; Bjerrum, E. J.; Tyrchan, C.; Reymond, J.-L.; Chen, H.;Engkvist, O. SMILES-based deep generative scaﬀold decorator for de-novo drug design. Journal of Cheminformatics , .(23) Walder, C.; Kim, D. Computer Assisted Composition with Recurrent Neural Networks. , , 359–374.(24) Papadopoulos, A.; Pachet, F.; Roy, P.; Sakellariou, J. Exact Sampling for Regular andMarkov Constraints with Belief Propagation. , 341–350.(25) Weininger, D. SMILES-A Language for Molecules and Reactions. Handbook ofChemoinformatics , 80 – 102.(26) Gaulton, A. et al. The ChEMBL database in 2017.

Nucleic acids research , .(27) Ar´us-Pous, J.; Johansson, S. V.; Prykhodko, O.; Bjerrum, E. J.; Tyrchan, C.; Rey-mond, J.-L.; Chen, H.; Engkvist, O. Randomized SMILES strings improve the qualityof molecular generative models. Journal of Cheminformatics , , 71.3128) B¨ohm, H.-J.; Flohr, A.; Stahl, M. Scaﬀold hopping. Drug Discovery Today: Technolo-gies , , 217–224.(29) Papadatos, G.; Davies, M.; Dedman, N.; Chambers, J.; Gaulton, A.; Siddle, J.;Koks, R.; Irvine, S.; Pettersson, J.; Goncharoﬀ, N.; Hersey, A.; Overington, J.SureChEMBL: A large-scale, chemically annotated patent document database. NucleicAcids Research , .(30) Bemis, G. W.; Murcko, M. A. The Properties of Known Drugs. 1. Molecular Frame-works. Journal of Medicinal Chemistry , , 2887–2893.(31) Pickett, S. D.; Green, D. V. S.; Hunt, D. L.; Pardoe, D. A.; Hughes, I. AutomatedLead Optimization of MMP-12 Inhibitors Using a Genetic Algorithm. ACS medicinalchemistry letters , , 28–33.(32) Landrum, G. RDKit: Open-source cheminformatics. .(33) Ertl, P.; Schuﬀenhauer, A. Estimation of synthetic accessibility score of drug-likemolecules based on molecular complexity and fragment contributions. Journal of Chem-informatics , .(34) Bickerton, G. R.; Paolini, G. V.; Besnard, J.; Muresan, S.; Hopkins, A. L. Quantifyingthe chemical beauty of drugs. Nature Chemistry , , 90–98.Figure 8: For Table of Contents Only32 upporting Information:Scaﬀold-constrained molecular generation Maxime Langevin, † , ‡ Herv´e Minoux, ‡ Maximilien Levesque, ∗ , ¶ , § and MarcBianciotto ∗ , ‡ † PASTEUR, D´epartement de chimie, Ecole Normale Sup´erieure, PSL University, SorbonneUniversit´e, CNRS, 75005 Paris, France ‡ Molecular Design Sciences - Integrated Drug Discovery, Sanoﬁ R&D, Vitry-sur-Seine,France ¶ PASTEUR, D´epartement de chimie, Ecole Normale Sup´erieure, PSL University,Sorbonne Universit´e, CNRS, 75005 Paris, France § Aqemia, Paris, France

E-mail: [email protected]; marc.bianciotto@sanoﬁ.com

Data curation and software availability

Code availability

All software and data to reproduce the results of this paper are available at: https://github.com/maxime-langevin/scaffold-constrained-generation .To implement the methods presented above, we build on the existing codebase of Olive-crona and al. , available at https://github.com/MarcusOlivecrona/REINVENT . The dif-ferent experiments are reproduced in Jupyter notebooks available in our codebase. An extranotebook showing basic usage for researchers interested in simply using our method withoutS-1 a r X i v : . [ q - b i o . Q M ] O c t ecessarily reproducing our results is also available. ChEMBL dataset

The ChEMBL database is often used to train generative models of drug-like molecules. Totrain our RNN, we use a preprocessed version of the ChEMBL where only molecules havingbetween 10 and 50 heavy atoms and comprised of elements ∈ { H, B, C, N, O, F, Si, P, S, Cl, Br, I } were kept. The original ﬁltered ChEMBL dataset can be found and downloaded at: https://github.com/MarcusOlivecrona/REINVENT/blob/master/data/ChEMBL_filtered , and inour repository at https://github.com/maxime-langevin/scaffold-constrained-generation/data/ChEMBL_filtered . Furthermore, we ﬁltered the dataset to exclude molecules havingone the 17 validation scaﬀolds as a substructure, yielding the ﬁnal dataset at https://github.com/maxime-langevin/scaffold-constrained-generation/data/ChEMBL_without_sureChEMBL.smi . SureChEMBL dataset

The SureChEMBL database is comprised of patented compounds. The database can bedownloaded at https://chembl.gitbook.io/chembl-interface-documentation/downloads .34000 compounds were exctracted from SureChEMBL v2019.10.01. Compounds were clus-tered by Bemis-Murcko scaﬀold and 18 chemical series (every molecule in each series hav-ing the same scaﬀold) were kept to be used as a validation set. The molecules in the 18series can be found at https://github.com/maxime-langevin/scaffold-constrained-generation/data/SureChEMBL/200323_SureChemBL_dataset_636.sdf , and the 18 scaf-folds at https://github.com/maxime-langevin/scaffold-constrained-generation/data/SureChEMBL/surechembl_scaffolds.sdf . S-2 RD2 dataset

The full DRD2 dataset can be found at https://github.com/undeadpixel/reinvent-scaffold-decorator/blob/master/training_sets/drd2.excapedb.smi.gz . The scaf-folds used for the goal-directed benchmark are the ones used in Ar´us-Pous and al., and can befound in our codebase at https://github.com/maxime-langevin/scaffold-constrained-generation/data/DRD2/drd2_scaffolds.sdf . MMP-12 dataset

The MMP-12 dataset was downloaded from the supplementary materials of Pickett and al.The dataset can be found at https://github.com/maxime-langevin/scaffold-constrained-generation/data/MMP12/mmp12.csv . Implementation details

Computation time benchmarks

Computation time benchmarks were run on an Amazon EC2 p2.xlarge, and assessed theruntime of generating molecules using one CPU.

Distribution learning benchmarks

To assess distribution learning benchmarks, 10000 molecules were generated for each scaﬀold.

Validity

The validity score for a scaﬀold is the ratio of the number of valid molecules, as deﬁned inthe RDKit , out of all 10000 generated molecules.S-3 nicity

Out of the generated valid molecules, the number of unique molecules is computed as theratio of molecules with distinct canonical SMILES string.

Physico-chemical properties

All physico-chemical properties where computed using the RDKit. Properties were com-puted on the valid molecules out of the 10000 generated for each scaﬀold, and then groupedtogether. The overall distributions were plotted against the distributions of both the train-ing and the validation set, in order to check that there was no striking dissimilarity betweenthem.

Predicting DRD2 activity

One of the major point of the DRD2 activity was to benchmark our method against theReinvent Scaﬀold Decorator .Thus, it seems natural to use the same QSAR model. As weweren’t able to ﬁnd this QSAR model within the codebase reproducing the experiments ofthe article, we used a QSAR model used in a work from the same group on the DRD2dataset, and we assumed that the QSAR model used in the two works was the same. In ourcodebase, the model used for DRD2 activity prediction can be found at https://github.com/maxime-langevin/scaffold-constrained-generation/data/clf.pkl Predicting MMP-12 activity

To predict activity on the MMP-12 target, the dataset was split into a training and a testset. Then, a random forest regression algorithm (implemented with Scikit-learn) was ﬁt-ted on the training set with continuous targets (corresponding to the experimental pIC ),and evaluated on the testing set. The evaluation yielded a coeﬃcient of determination r = 0 .

84. The QSAR model is accessible at https://github.com/maxime-langevin/

S-4 caffold-constrained-generation/data/MMP12/final_activity_model.pkl , and eval-uation on the test set at https://github.com/maxime-langevin/scaffold-constrained-generation/MMP12_experiments.ipynb . Hill climbing procedure

To optimize molecules in goal-oriented benchmarks, a hill-climbing procedure was used,as it was shown to be overall the best method amongst diﬀerent generative models. Thealgorithm can be summarized as a repetition of the following steps: • Generate 500 molecules • Score them and keep the top 50 unique molecules • Perform 10 rounds of log-likelihood maximization with the 50 best moleculesThose steps are repeated 10 times in a row. The code for performing hill-climbing can befound at https://github.com/maxime-langevin/scaffold-constrained-generation/hill_climbing.py . S-5 iguresigures