[PDF] Predicting retrosynthetic pathways using a combined linguistic model and hyper-graph exploration strategy

Abstract

We present an extension of our Molecular Transformer architecture combined with a hyper-graph exploration strategy for automatic retrosynthesis route planning without human intervention. The single-step retrosynthetic model sets a new state of the art for predicting reactants as well as reagents, solvents and catalysts for each retrosynthetic step. We introduce new metrics (coverage, class diversity, round-trip accuracy and Jensen-Shannon divergence) to evaluate the single-step retrosynthetic models, using the forward prediction and a reaction classification model always based on the transformer architecture. The hypergraph is constructed on the fly, and the nodes are filtered and further expanded based on a Bayesian-like probability. We critically assessed the end-to-end framework with several retrosynthesis examples from literature and academic exams. Overall, the frameworks has a very good performance with few weaknesses due to the bias induced during the training process. The use of the newly introduced metrics opens up the possibility to optimize entire retrosynthetic frameworks through focusing on the performance of the single-step model only.

Full PDF

PPredicting retrosynthetic pathways using acombined linguistic model and hyper-graphexploration strategy

Philippe Schwaller , Riccardo Petraglia , Valerio Zullo , Vishnu HNair , Rico Andreas Haeuselmann , Riccardo Pisoni , CostasBekas , Anna Iuliano , and Teodoro Laino IBM Research – Zurich, S¨aumerstrasse 4, CH-8803 R¨uschlikon, Switzerland Dipartimento di Chimica e Chimica Industriale, Universit`a di Pisa,Via GiuseppeMoruzzi 13, I-56124, Pisa, Italy

Abstract

We present an extension of our Molecular Transformer architecturecombined with a hyper-graph exploration strategy for automatic retrosyn-thesis route planning without human intervention. The single-step ret-rosynthetic model sets a new state of the art for predicting reactantsas well as reagents, solvents and catalysts for each retrosynthetic step.We introduce new metrics (coverage, class diversity, round-trip accuracyand Jensen-Shannon divergence) to evaluate the single-step retrosyntheticmodels, using the forward prediction and a reaction classiﬁcation modelalways based on the transformer architecture. The hypergraph is con-structed on the ﬂy, and the nodes are ﬁltered and further expandedbased on a Bayesian-like probability. We critically assessed the end-to-endframework with several retrosynthesis examples from literature and aca-demic exams. Overall, the frameworks has a very good performance withfew weaknesses due to the bias induced during the training process. Theuse of the newly introduced metrics opens up the possibility to optimizeentire retrosynthetic frameworks through focusing on the performance ofthe single-step model only.

The ﬁeld of organic chemistry has been continuously evolving, moving its at-tention from the synthesis of complex natural products to the understanding ofmolecular functions and activities [1–3]. These advancements were made pos-sible thanks to the vast chemical knowledge and intuition of human experts,acquired over several decades of practice. Among the diﬀerent tasks involved,1 a r X i v : . [ c s . L G ] O c t he design of eﬃcient synthetic routes for a given target (retrosynthesis) is ar-guably one of the most complex problems. Key reasons include the need toidentify a cascade of disconnections schemes, suitable building blocks and func-tional group protection strategies. Therefore, it is not surprising that computershave been employed since the 1960s [4], giving rise to several computer-aidedretrosynthetic tools.Rule-based or similarity-based methods have been the most successful ap-proach implemented in computer programs for many years. While they suggestvery eﬀective [5, 6] pathways to molecules of interest, these methods do notstrictly learn chemistry from data but rather encode synthon generation rules.The main drawback of rule-based systems is the need for laborious manual en-coding, which prevents scaling with increasing data set sizes. Moreover, thecomplexity in assessing the logical consistency among all existing rules and thenew ones increases with the number of codiﬁed rules and may sooner or laterreach a level where the problem becomes intractable. The dawn of AI-driven chemistry.

While human chemical knowledge willkeep fueling the organic chemistry research in the years to come, a careful anal-ysis of current trends [5, 7–20] and the application of basic extrapolation prin-ciples undeniably shows that there are growing expectations on the use of Arti-ﬁcial Intelligence (AI) architectures to mimic human chemical intuition and toprovide research assistant services to all bench chemists worldwide.Concurrently to rule-based systems, a wide range of AI approaches have beenreported for retrosynthetic analysis [9, 12], prediction of reaction outcomes [21–26] and optimization of reaction conditions [27]. All these AI models supersededrule-based methods in their potential of mimicking the human brain by learningchemistry from large data sets without human intervention.While this extensive production of AI models for Organic chemistry wasmade possible by the availability of public data [28, 29], the noise containedin this data and generated by the text-mining extraction process is heavilyreducing their potential. In fact, while rule-based systems [30] demonstrated,through wet-lab experiments, the capability to design target molecules with lesspuriﬁcation steps and hence, leading to savings in time and cost [31], the AIapproaches [6, 9, 12, 16, 32–38] still have a long way to go.Among the diﬀerent AI approaches [39] those treating chemical reactionprediction as natural language (NL) problems [40] are becoming increasinglypopular. They are currently state of the art in the forward reaction predictionrealm, scoring an undefeated accuracy of more than 90% [22]. In the NL frame-work, chemical reactions are encoded as sentences using reaction SMILES [41]and the forward- or retro- reaction prediction is cast as a translation problem,using diﬀerent types of neural machine translation architectures. One of thegreatest advantages of representing synthetic chemistry as a language is the in-trinsic scalability for larger data sets, as it avoids the need for humans to assignreaction centers, which is an important caveat of rule-based systems [6, 30].The Molecular Transformer architecture [42], of which trained models fuel the2loud-based IBM RXN [43] for Chemistry platform, is currently the most pop-ular architecture treating chemistry as a language.

Transformer-based retrosynthesis: current status.

Inspired by the suc-cess of the Molecular Transformer [22, 42, 43] for forward reaction prediction, afew retrosynthetic models based on the same architecture were reported shortlyafter [32, 33, 35–37]. Zheng et al. [32] proposed a template-free self-corrected ret-rosynthesis predictor built on the Transformer architecture. The model achieves43.7% top-1 accuracy on a small standardized (50k reactions) data set [44]. Us-ing a coupled neural network-based syntax checker, they were able to reduce theinitial number of invalid candidate precursors from 12.1% to 0.7%. It is inter-esting to note that previous work using the Transformer architecture reporteda number of invalid candidates smaller than 0.5% in forward reaction predic-tion [22], without the need of any additional syntax checker. Karpov et al. [33]described a Transformer model for retrosynthetic reaction predictions trained onthe same data set [44]. They were able to successfully predict the reactants witha top-1 accuracy of 42.7%. Lin et al. [35] combined a Monte-Carlo tree search,previously introduced for retrosynthesis in the ground-breaking work by Segleret al. [12], with a single retrosynthetic step Transformer architecture for predict-ing multi-step reactions. In a single-step setting, the model described by Linet al. [35] achieved a top-1 prediction accuracy of over 43.1% and 54.1% whentrained on the same small data set [44] and a ten times larger collection, respec-tively. Duan et al. [37] increased the batch size and the training time for theirTransformer model and were able to achieve a top-1 accuracy of 54.1% on the50k USPTO data set [44]. Later on, the same architecture was reported to havea top-1 accuracy of 43.8% [36], in line with the three previous transformer-basedapproaches [32, 33, 35] but signiﬁcantly lower than the accuracy previously re-ported by Duan et al [37]. Interestingly, the transformer model was also trainedon a proprietary data set [36], including only reactions with two reactants witha Tanimoto similarity distribution peaked at 0.75, characteristic of an excessivedegree of similarity (roughly 2 times higher than the USPTO). Despite the highreported top-1 accuracy using the proprietary training and testing set, it is ques-tionable how a model that overﬁts a particular ensemble of identical chemicaltransformations could be used in practice. Recently, a graph enhanced trans-former model [45] and a mixture model [46] were proposed, achieving a top-1accuracy of 44.9% and more diverse reactant suggestions, respectively, with nosubstantial improvements over previous works.Except for the work of Lin et al. [35], all transformer-based retrosyntheticapproaches were so far limited to a single-step prediction. Moreover, none ofthe previously reported works attempts the concurrent prediction of reagents,catalysts and solvent conditions but only reactants.In this work, we present an extension of our Molecular Transformer architec-ture combined with a hyper-graph exploration strategy to design retrosyntheticpathways without human intervention. Compared to all other existing worksusing AI, we predict reactants as well as reagents for each retrosynthetic step.3hroughout the article, we will refer to reactants and reagents (e.g. solventsand catalysts) as precursors. Instead of using the conﬁdence level intrinsic tothe retrosynthetic model, we introduce new metrics (coverage, class diversity,round-trip accuracy and Jensen-Shannon divergence) to evaluate the single-stepretrosynthetic model, using the corresponding forward prediction and reactionclassiﬁcation model. This provides a general assessment of each retrosyntheticstep capturing the important aspects a model should have to perform similarlyto human experts in retrosynthetic analysis.The optimal synthetic pathway is found through a beam search on the hyper-graph of the possible disconnection strategies and allows to circumvent potentialselectivity traps. The hypergraph is constructed on the ﬂy, and the nodes areﬁltered and subject to further expansion based on a Bayesian-like probabilitythat makes use of the forward prediction likelihood and the SCScore [47] toprioritize synthetic steps. This strategy penalizes non-selective reactions andprecursors with higher complexity than targets, leading to termination whencommercially available building blocks are identiﬁed. The quality of the ret-rosynthetic tree is strongly related to the likelihood distributions of the forwardprediction model across the twelve diﬀerent superclasses generated in single-step retrosynthesis. We encode the analysis of the probability distributionsusing the Jensen-Shannon divergence. This provides for the ﬁrst time a holisticanalysis and a key indicator to systematically improve the quality of multi-stepretrosynthetic tools.Finally, we critically assessed the entire AI framework by reviewing severalretrosynthetic problems, some of them from literature data and others from aca-demic exams. We show that reaching high performance on a subset of metricsfor single-step retrosynthetic prediction is not beneﬁcial in a multi-step frame-work. We also demonstrate that the use of all newly deﬁned metrics provides anevaluation of end-to-end solutions, thereby focusing only on the quality of thesingle-step prediction model. The trained models and the entire architecture isfreely available online [43]. The potential of the presented technology is high,augmenting the skills of less experienced chemists but also enabling chemists todesign and protect the intellectual property of non-obvious synthetic routes forgiven targets.

Solving a retrosynthetic problem is equivalent to exploring a directed acyclicgraph of all possible retrosyntheses of a given target and ﬁnding the optimalroute based on the optimization of speciﬁc cost functions (price of synthesis,raw materials availability, eﬃcacy, etc.). Monte-Carlo Tree Search (MCTS)algorithms were the method of choice to explore retrosynthetic graphs in pre-vious works [12, 35, 38]. Here, we use a hypergraph exploration strategy (seeSection 4.5). We construct the directed acyclic hypergraph on the ﬂy, using a4ayesian-like probability to decide the direction along which the graph is ex-panded. The combined use of the SCScore [47] drives the tree towards moresimple precursors (see 4.5). In Figure 1, we show a schematic representationof the multi-step retrosynthetic workﬂow. Given a target molecule, we use asingle-step retrosynthetic model to generate a certain number of possible discon-nections (i.e. precursors set). Upon canonicalization, for each of these optionswe determine the reaction class (as additional information to display for users),and compute the SCScore as well as the reaction likelihood with the forwardprediction model on the corresponding inchiﬁed entry. In order to discouragethe use of non-selective reactions, we ﬁlter the single-step retrosynthetic pre-dictions through using a threshold on the reaction likelihood returned by theforward model. The likelihood and SCScore of the ﬁltered predictions are com-bined to compute a probability score to rank all the options. In case all thepredicted precursors are commercially available the retrosynthetic analysis pro-vides that option as a possible solution and the exploration of that tree branchis considered complete. If not, we repeat the entire cycle using the precursors asinitial target molecules until we reach either commercially available molecules orthe maximum number of speciﬁed retrosynthesis steps. The multi-step frame-work is entirely based on the use of statistical information and does not includechemical knowledge. Therefore, it is important to analyze the performance ofthe single-step retrosynthetic model in detail to understand the strengths andweaknesses of the entire methodology.

Solving retrosynthetic problems requires a careful analysis of which ones amongmultiple precursors could lead to the desired product more eﬃciently, as seen, forexample, for 5-Bromo-2-methoxypyridine in Figure 2. Humans address this issueby mentally listing and analyzing all possible disconnection sites and retainingonly the disconnection, for which the corresponding precursors are thought toproduce the target molecule in the most selective way.For the evaluation of single-step retrosynthetic models, the top-N accuracyscore was commonly used. Top-N accuracy means that the ground truth pre-cursors were found within the ﬁrst N suggestions of the retrosynthetic model.Unfortunately, the disconnection of a target molecule rarely originates from oneset of precursors only. In fact, quite often the presence of diﬀerent functionalgroups allows a multitude of possible disconnection strategies to exist leading todiﬀerent sets of reactants, as well as possible solvents and catalysts. Moreover,the analysis of the USPTO stereo data set, derived from the text-mined open-source reaction data set by Lowe [28, 29], and of the Pistachio data set [49],shows that 6% of the products, and 14% respectively, have at least two diﬀerentsets of precursors. While these numbers only reﬂect the organic chemistry repre-sented by each data set, the total number of possible disconnections is certainlylarger. Considering the limited size of existing data sets, it is evident that, inthe context of retrosynthesis, the top-N accuracy rewards the ability of a modelto retrieve expected answers from a data set more than that to predict chem-5 etro ClassForwardSCScore

Forward conﬁdenceor predictionTarget moleculeComplete pathUnﬁnished pathcommercially available precursorsnon available precursors ﬁlteryesyesyes nonono score canonical & unique- precursors 1 (R )…- precursors N (R N ) - O-Ac protection (5.3.5) - Mitsunobu aryl ether synthesis (1.7.7)- Chlorocarbonyl to carboxy (9.7.245)avail?stop? Figure 1: Schematic of the Multi-step retrosynthetic workﬂow.ically meaningful precursors. Therefore, a top-N comparison with the groundtruth is not an adequate metric for assessing retrosynthetic models.Here, we dispute the previous use of top-N accuracy [6, 9, 12, 16, 32–37] andto introduce four diﬀerent metrics, namely, round-trip accuracy, coverage, classdiversity and Jensen-Shannon divergence [50], as seen in Figure 3, to evaluatesingle step retrosynthetic models and through them retrosynthetic tools as awhole. All these four metrics have been critically designed and assessed withthe help of human domain experts (see Section 4.2 for a detailed description).6

NAr ether synthesis (1.7.11)US05922742A Bromination (10.1.1)US20120088764A1O-methylation (1.7.14)US20150210671A1 desired product

Figure 2: Highlighting a few of the precursors and reactions leading to 5-Bromo-2-methoxypyridine that are found in the US Patents data set. The moleculeswere depicted with CDK [48]. desired product comparing withexperimental productscomparing with ground truth precursors retro accuracy previous workthis work ideal, but not scalable fast, but sub-optimalmany sets of precursors lead to same product scalable single-step retrosynthesisevaluation metrics candidate precursors 1candidate precursors 2…candidate precursors Ncandidate precursors 1candidate precursors 2…candidate precursors N

Human expertvalidation ExperimentsModel to evaluateFixed scoring model predicted product 1predicted product 2…predicted product N

RetroModelForwardModel round-trip accuracy % of predicted product == desired product coverage at least 1 predicted product == desired product class diversity diversity of suggested reaction classes

Jensen-Shannon divergence similarity of class probability distributions

Figure 3: Overview of single-step retrosynthesis evaluation metrics.During the development phase we trained diﬀerent retrosynthetic transformer-based models with two diﬀerent data sets, one fully based on open-source data( stereo ) and one on based commercially available data from Pistachio ( pista-chio ). In some cases, the data set was inchiﬁed [51] (labelled with inchi ).Table 1 shows the results for the retrosynthetic models, evaluated using a ﬁxedforward prediction model ( pistachio inchi ) on two validation sets ( stereo and pistachio ). The coverage represents the percentage of desired products for whichat least one valid precursor set was suggested. It was similar and above 90%for all the model combinations, which is an important requirement to guaranteethe possibility to always oﬀer at least one disconnection strategy. Likewise, theclass diversity, which is an average of how many diﬀerent reaction classes arepredicted in a single retrosynthetic step, was comparable for both models witha slightly better performance for the pistachio model.During the diﬀerent training runs, we noticed that the stereo retro modelconsistently performed better than the pistachio model in terms of round-tripaccuracy, which is the percentage of precursor sets leading to the initial targetwhen evaluated with the forward model. Notwithstanding, the synthesis routesgenerated with this model were often characterized by a sequence of illogical7able 1: Evaluation of single-step retrosynthetic models. The test data setconsisted of 10K entries. For every reaction we generated 10 predictions. Thenumber of resulting precursor suggestions was 100K. Round-trip accuracy (RT),coverage (Cov.), class diversity (CD), the inverse of the Jensen Shannon diver-gence of the class likelihood distributions (1/JSD), the percentage of invalidSMILES (invalid smi) and the human expert evaluation (human eval) are re-ported in the table. Models with the ” inchi” suﬃx were trained on an inchiﬁeddata set.

Model Test RT Cov. CD 1/JSD invalid Humanretro forward data [%] [%] smi [%] evalstereo inchi pist inchi stereo 81.2 95.1 1.8 16.5 0.5 -stereo inchi pist inchi pist 79.1 93.8 1.8 20.6 1.1 -pist inchi pist inchi pist 74.9 95.3 2.1 22.0 0.5 +pist pist inchi pist 71.1 92.6 2.1 27.2 0.6 ++ protection/deprotection steps, as if the model was heavily biased towards thosereaction classes. This apparent paradox became clear when we analyzed indetail how humans approach the problem of retrosynthesis. For an expert, itis not suﬃcient to always ﬁnd at least one disconnection site (coverage) andbe sure that the corresponding precursors will selectively lead to the originaltarget (round-trip accuracy). It is necessary to generate a diverse sample ofdisconnection strategies to cope with competitive functional group reactivity(class diversity). And most important, the needs to be a guarantee that ev-ery disconnection class has a similar probability distribution as all the others(Jensen-Shannon divergence, JSD). Continuing the parallelism with human ex-perts, if one was exposed to the same reaction classes for many years, the useof those familiar schemes in the route planning would appear more frequently,leading to strongly biased retrosynthesis. Therefore, it is important to reduceany bias in single-step retrosynthetic models to a minimum.To evaluate the bias of single-step model we use the JSD of the likelihooddistributions for the prediction divided in diﬀerent reaction superclasses, whichwe report in Table 1 as 1/JSD. The larger this number the more similar the like-lihood distributions of the reactions belonging to diﬀerent classes are and hence,the less dominant (lower bias) individual reaction classes are in the multi-stepsynthesis (2.1). In Figure 4, we show the likelihood distributions for the diﬀerentmodels in Table 1. Except for the resolution class all of the distribution show apeak close to 1.0, which is a clearly shows that the model learned how to predictthe reaction in those classes. In contrast, resolution class is instead relativelyﬂat as a consequence of the poor data quality/quantity for stereochemical reac-tions both in the stereo and pistachio data set. Interestingly, one can see thatfor the stereo model the likelihood distributions of the deprotection, reductionand oxidation reactions are quite diﬀerent (and generally more peaked) from allother distributions generated with the same model. This statistical imbalancefavours those reaction classes and explains the occurrence of illogical loops of8 . . . D e n s i t y stereo_inchipistachio_inchipistachio . . . . . . . . . . . . D e n s i t y . . . . . . . . . . . . Reaction likelihood D e n s i t y . . . Reaction likelihood9 - Functional groupinterconversion (FGI) . . . Reaction likelihood10 - Functional groupaddition (FGA) . . . Reaction likelihood11 - Resolution

Figure 4: The likelihood distributions predicted by a forward model ( pista-chio inchi ) for the reactions suggested by diﬀerent retro models. We show thelikelihood range between 0.5 and 1.0.protection/deprotection or oxidation/reduction strategies in agreement with thehuman expert assessment (last column in Table 1). While a peaked distributionis desirable, as this is a consequence of the model learning to predict discon-nection strategies in a precise class, the dissimilarity (JSD) between the twelveprobability distributions reﬂects a clear quality issue, likely due to unbalanceddata sets. Among the few models reported, the pistachio model was found tohave the most similar reaction likelihood distributions and is the one analyzedin the subsequent part of the manuscript and made available online.The evaluation of the four metrics (round-trip, coverage, class diversity and1/JSD) requires the identiﬁcation of the reaction class for each prediction. Weused a transformer-based reaction classiﬁcation model, as described in [53]. InFigure 5, we report the ground truth classiﬁed by the NameRXN [52] tool, theclass distribution predicted by our classiﬁcation model on the ground truth reac-tions and ﬁnally, the class distributions predicted for the reactions suggested bythe retrosynthesis models (see Table 1). We observe that the classiﬁcations madeby our class prediction model are in agreement with the ones of NameRXN [52]9

Reaction Super Class R e a c t i o n s [ % ] ground truthclass predictions on gtsuggestions by stereo_inchisuggestions by pistachio_inchisuggestions by pistachio Figure 5: Distribution of reaction superclasses for the ground truth [52], the pre-dicted superclasses for the ground truth reactions and the predicted superclassesfor the reactions suggested by the diﬀerent retrosynthesis models.and match them with an accuracy of 93.8%. The distributions of the single-stepretrosynthetic models resemble the original one with the one diﬀerence that thenumber of unrecognized reactions has nearly been halved. All of the modelslearned to predict more recognizable reactions, even for products, for whichthere was an unrecognized reaction in the ground truth.The design of single-step retrosynthetic prediction models through multi-objective (round-trip accuracy, coverage, class diversity and 1/JSD) optimiza-tion opens the way to the systematic improvement of entire retrosynthetic multi-step algorithms without the need to manually review the quality of entire ret-rosynthetic routes.

An evaluation of the model was carried out through performing the retrosynthe-sis of the compounds reported in Figure 6. Some of these are known compounds,for which the synthesis is reported in literature (1, 2, 5, 7, 8), others are unknownstructures (3, 4, 6, 9). For the ﬁrst group the evaluation of the model couldbe made by comparing the proposed retrosynthetic analysis with the knownsynthetic pathway. For the second group, a critical evaluation of the proposedretrosynthesis, which takes into account the level of chemo-, regio-, and stere-oselectivity for every retrosynthetic step was performed. The parameters usedfor each retrosynthesis are reported in the SI. In some cases, the default valueswere changed to increase the hypergraph exploration and yield better results. Asan output, the model generates several retrosynthetic sequences for each com-pound, each one with a diﬀerent conﬁdence level. Because the model predictsnot only reactants, but also reagents, solvents and catalysts, there are severalsequences with similar conﬁdence level and identical disconnection strategies10igure 6: Set of molecules used to assess the quality of retrosynthesis.and diﬀering only by the suggested reaction solvents in few steps. Therefore,we report only one of the similar sequences in the SI.All of the retrosynthetic routes generated for compounds 1, 2 and 3 fulﬁll thecriteria of chemoselectivity; the highest conﬁdence sequence (called ”sequence0”) of 1 corresponds to the reported synthesis of the product [54] and startsfrom the commercially available acrylonitrile. The other two sequences (17 and22) use synthetic equivalents of acrylonitrile and also show their preparation.For compound 2 the highest conﬁdence retrosynthetic sequence (sequence 0)does not correspond to the synthetic pathway reported in the literature, wherethe key step is the opening of an epoxide ring. Two other sequences (5 and 23)report this step and one of them (sequence 5) corresponds to the literature syn-thesis [55]. The retrosynthetic sequence for compound 3 provides a Diels-Alderreaction as ﬁrst disconnection strategy, and proposes a correct retrosyntheticpath for the synthesis of the diene from available precursors. A straightforward11etrosynthetic sequence was found also in the case of compound 4, where thediene moiety was disconnected by two oleﬁnation reactions and the sequenceuses structurally simple compounds as starting material. It may be debatablewhether the two oleﬁnations through a Horner-Wadsworth-Emmons reaction,can really be stereoselective towards the E-conﬁgurated alkenes or whether thereduction of the conjugate aldehyde by NaBH4 can be completely chemoselec-tive towards the formation of the allylic alcohol. Only experimental work cansolve this puzzle and give the correct answer.The retrosynthesis of racemic omeoprazole 5 returned a sequence consistingof one step only because the model ﬁnds in its library of available compounds thesulﬁde precursor of the ﬁnal sulfoxide. When repeating the retrosynthesis usingbenzene as starting molecule in conjunction with a restricted set of availablecompounds, we obtained a more complete retrosynthetic sequence with somesteps in common with the reported one [56]. However, although all of the stepsfulﬁll the chemoselectivity requirement, the sequence is characterised by someavoidable protection-deprotection steps. This nicely reﬂects the bias present inthe likelihood distributions of the diﬀerent superclasses for the chosen model. Infact, although the single-step retrosynthetic model has the best Jensen-Shannondivergence among all of the trained models, there is still room for improvementsthat we will explore in the future. A higher similarity across the likelihooddistributions will prevent the occurrence of illogical protection-deprotection,estheriﬁcation/saponiﬁcation steps.In addition, the reported sequence for 5 lists a compound not present in therestricted set of available molecules as starting material. A de novo retrosyn-thesis of this compound solved the problem. The retrosynthetic sequence of thestructurally complex compound 6 was possible only with wider settings allow-ing a more extensive hypergraph exploration. The result was a retrosyntheticroute starting from simple precursors: notably, the sequence also showed thesynthesis of the triazole ring through a Huisgen cycloaddition. However, werecognized the occurrence of some chemoselectivity problems in step 6, whenthe enolate of the ketone is generated in the presence of an acetate group,used as protection of the alcohol. This problem could be avoided through us-ing a diﬀerent protecting group for the alcohol. By contrast, the alkylationof the ketone enolate by means of a benzyl bromide bearing an enolizable es-ter group in the structure appears less problematic, due to the high reactivityof the bromide. The retrosynthesis of the chiral stereodeﬁned compound indi-navir, 7, completed in one step, through ﬁnding a very complex precursor inthe set of available molecules. Sequences of lower conﬁdence resulted in moreretrosynthetic steps, disconnecting the molecule as in the reported synthesis [57]but stopped at the stereodeﬁned epoxide, with no further disconnection pathsavailable. However, when the retrosynthesis was performed on the same racemicmolecule, a chemoselective retrosynthetic pathway was found, disconnecting theepoxide and starting from simple precursors. Similarly, for the other opticallyactive compound, propranolol, 8, which was disconnected according to the pub-lished synthetic pathway [58] only when the retrosynthesis was performed onthe racemic compound. The problem experienced with stereodeﬁned molecules12eﬂects the poor likelihood distribution of the resolution superclass in Figure 4.In fact, because all current USPTO derived data sets ( stereo and pistachio )have particularly noisy stereochemical data we decided to retain only few en-tries in order to avoid jeopardizing the overall quality. With a limited number ofstereochemical examples available in the training set the model was not able tolearn reactions belonging to the resolution class, failing to provide disconnectionoptions for stereodeﬁned centers.The retrosynthesis of the last molecule, 9, succeeded only with intensivehypergraph exploration settings. However, the retrosynthetic sequence is te-diously long, with several avoidable esteriﬁcation-saponiﬁcation steps. Similarto 5, the bias in the likelihood distributions is the one reason for this peculiarbehavior. In addition, a non-symmetric allyl bromide was chosen as precursorof the corresponding tertiary amine: this choice entails a regioselectivity prob-lem, given that the allyl bromide can undergo nucleophilic displacement notonly at the ipso position, giving rise to the correct product, but also at theallylic position, resulting in the formation of the regioisomeric amine. Lastly,the model was unable to ﬁnd a retrosynthetic path for one complex buildingblock, which was not found in the available molecule set. However, a slightmodiﬁcation of the structure of this intermediate enabled a nice retrosyntheticpath to be found, which can also be easily applied to the original problem,starting from 1,3-cycloexanedione instead of cyclohexanone. We also made acomparison of our retrosynthetic architecture with previous work [6, 12], usingthe same compounds for the assessments (see SI). The model performed wellon the majority of these compounds, showing problems in the case of stereode-ﬁned compounds as in the previous examples. Retrosynthetic paths were easilyobtained only for their racemic structure. The proposed retrosyntheses in somecases are quite similar to those reported [6] while, for some compounds [12] theyare diﬀerent but still chemoselective. Only in a few cases the model failed toﬁnd a retrosynthesis.

In this work we presented an extension of our Molecular Transformer architec-ture combined with a hyper-graph exploration strategy to design retrosynthesiswithout human intervention. We introduce a single-step retrosynthetic modelpredicting reactants as well as reagents for the ﬁrst time. We also introduce fournew metrics (coverage, class diversity, round-trip accuracy and Jensen-Shannondivergence) to provide a thorough evaluation of the single-step retrosyntheticmodel. The optimal synthetic pathway is found through a beam search on thehyper-graph of the possible disconnection strategies and allows to circumventpotential selectivity traps. The hypergraph is constructed on the ﬂy, and thenodes are ﬁltered and further expanded based on a Bayesian-like probabilityscore until commercially available building blocks are identiﬁed. We assessedthe entire framework by reviewing several retrosynthetic problems to highlightstrengths and weaknesses. As conﬁrmed by the statistical analysis, the entire13ramework performs very well for a wide class of disconnections. An intrinsicbias towards a few classes (reduction/oxidation/estheriﬁcation/saponiﬁcation)may lead, in some cases, to illogical disconnection strategies that are a pecu-liar ﬁngerprint of the current learning process. Also, an insuﬃcient ability tohandle stereochemical reactions is the result of a poor quality training data setthat covers only a few examples in the resolution class. The use of the fournew metrics, combined with the critical analysis of the current model, providesa well deﬁned strategy to optimize the retrosynthetic framework by focusingexclusively on the performance of the single-step retrosynthetic model. A keyrole in this strategy will be the construction of statistically relevant trainingdata sets to improve the conﬁdence of the model in diﬀerent types of reactionclasses and disconnections.

Similar to our previous works we use SMILES to represent molecules, takingmore advantage of the auxiliary fragment information in which the groupedfragment indices are written after the label ’f:’. The diﬀerent groups are sepa-rated by a ’,’ and the connected fragments within a group are separated by ’.’.An example would be ’—f:1.2,4.5—’. , where the fragments 1 and 2 as well as 4and 5 belong together. There is nothing that enforces closeness of fragments inthe SMILES string, hence diﬀerent fragments belonging to the same compoundcould end up at opposite ends of the string. Typical examples are metallorganiccompounds. Here, we relate the fragments within a group with a ‘ ∼ ‘ characterinstead of a ‘.‘. Consequently, the fragmented molecules are kept together inthe reaction string.Atom-mapping as well as reactant-reagent roles, are a rich source of infor-mation generated by highly complicated tasks [59], the assignment often beingsubjectively made by humans. Schwaller et al. [22] recently proposed to ignorereactant and reagent roles for the reaction prediction task. In contrast to pre-vious works [32, 33, 35, 36], the single-step retrosynthetic model presented herepredicts reactants and reagents. In an eﬀort to simplify the prediction task, themost common precursors with a length of more than 50 tokens were replaced bymolecule tokens. Those molecules were turned back into the usual tokenizationbefore calculating the likelihood with the forward model. Moreover, to ensure abasic tautomer standardization we inchiﬁed our molecules, as described in [60],to improve the quality of the forward prediction model. In contrast to previouswork [16], we never use a reaction class token as input for the retrosynthesismodel.The data sets used to train the diﬀerent models in this work are derived fromthe open source USPTO reaction database by Lowe [28, 29] and the Pistachiodatabase by NextMove Software [49]. We preprocessed both data sets to ﬁlterout incomplete reactions and keep 1M and 2M entries, respectively. As done14reviously in [22, 61], we added 800k textbook reactions to the training ofspeciﬁc forward and retrosynthetic models. The evaluation of retrosynthetic routes is a task for human experts. Unfor-tunately, every evaluation is tedious and diﬃcult to scale to a large numberof examples. Therefore, it is challenging to generate statistically relevant re-sults for more than a few diﬀerent model settings. By using an analogy withhuman experts, we propose to use a forward prediction model [12, 62] and areaction classiﬁcation model to assess the quality of the retrosynthetic predic-tions. They can not only predict products when given a set of precursors butalso estimate the likelihood of the corresponding forward reaction and provideits classiﬁcation. Model scores have already been used as an alternative to hu-man annotators to evaluate generative adversarial networks [63]. In our context,we deﬁne a retrosynthetic prediction as valid if the suggested set of precursorsleads to the original product when processed by the forward chemical reactionprediction model (see Figure 3). In Section 4.3 we report more details on theassessment of the forward prediction model compared to human experts.Here we introduce four metrics ( round-trip accuracy , coverage , classdiversity and the Jensen-Shannon divergence ) to systematically evaluateretrosynthetic models.The round-trip accuracy quantiﬁes what percentage of the retrosyntheticsuggestions is valid. This is an important evaluation as it is desirable to haveas many valid suggestions as possible. This metric is highly dependent on thenumber of beams, as generating more outcomes through the use of a beamsearch might lead to a smaller percentage of valid suggestions due to lowerquality suggestions in case of a higher number of beams.The coverage quantiﬁes the number of target molecules that produce atleast one valid disconnection. With this metric, one wants to prevent rewardingmodels that produce many valid disconnections for only few reactions, whichwould result in a small coverage. A retrosynthetic model should be able toproduce valid suggestions for a wide variety of target molecules.The class diversity is complementary to the coverage , as instead of re-lating to targets it counts the number of diverse reaction superclasses predictedby the retrosynthetic model, upon classiﬁcation. A single-step retrosyntheticmodel should predict a wide diversity of disconnection strategies, which meansgenerating precursors leading to the same product, with the corresponding re-actions belonging to diﬀerent reaction classes. Allowing a multitude of diﬀerentdisconnection strategies is beneﬁcial for an optimal route search and importantspeciﬁcally when the target molecule contains multiple functional groups.Finally, the

Jensen-Shannon divergence , which is used to compare thelikelihood distributions of the suggested reactions belonging to diﬀerent classesabove a threshold of 0.5, is calculated as follows:15 SD ( P , P , ..., P ) = H (cid:32) (cid:88) i =0 P i (cid:33) − (cid:88) i =0 H ( P i ) , (1)where P i denote the probability distributions and H ( P ) the Shannon entropyfor the distribution P .To calculate the Jensen-Shannon divergence we split the reactions into su-perclasses and use the likelihoods predicted by the forward model to build alikelihood distribution within each class. This metric is crucial to assess themodel quality for building a meaningful sequence of retrosynthetic steps. Infact, analogous to human experts, having a model with a dissimilar likelihooddistribution would be equivalent to having a human expert favour a few speciﬁcreaction classes over others. This would result in an introduction of bias favour-ing those classes with dominant likelihood distributions. While it is desirableto have a peaked distribution, as this is an evident sign of the model learningfrom the data, it is also desirable to have all the likelihood distributions equallypeaked, with none of them exercising more inﬂuence than the others during theconstruction of the retrosynthetic tree. The inverse of the Jensen-Shannon di-vergence (1 /JSD ) is a measure of the similarity of the likelihood distributionsamong the diﬀerent superclasses and we use this parameter as an eﬀective met-ric to guarantee uniform likelihood distributions among all possible predictedreaction classes. An uneven distribution may be connected to the nature of thepredictive model and, most importantly, to the nature of the training data set.The combined use of these metrics paves the way for a systematic improvementof entire retrosynthetic frameworks, by properly tuning data sets that optimizethe diﬀerent single-step performance indicators in a multi-objective fashion.Additionally, it is also essential that the model produces syntactically validmolecules (grammatically correct SMILES). We check this by using the open-source chemoinformatics software RDKit [64]. The forward prediction model was trained with the same hyperparameters asthe original Molecular Transformer [22], apart from the number of the attentionlayers, which was increased from 256 to 384. Thanks to the increase in capacity,a higher validation accuracy could be reached. For the ﬁnal model we used adata set derived from Pistachio3.0 [49] where all the molecules were inchiﬁed.As described in the work of Schwaller et al. [22] we augmented the training datawith the addition of random SMILES and textbook reactions to the trainingset.The forward prediction model can be used in two modes. First, when givena precursor set, the most likely products can be predicted. Second, when givena precursor set and a target product, the likelihood of this speciﬁc reaction canbe estimated. In this work, we set the beam size of the forward model to 3.As described previously, we use the forward chemical prediction model as adigital domain expert for evaluating the correctness of the predictions generated16y the retrosynthetic model. As recently published [22], the accuracy of thismodel is higher than 90% when compared with a public data set. In order tocalibrate the forward prediction model within the entire retrosynthetic frame-work, 50 random forward reaction predictions were analyzed by human experts.The assessment gave an accuracy of 78% which should be compared to an accu-racy of 80% given by the trained model. Although the data set is too limited toclaim any statistical relevance, this assessment oﬀers strong evidence in favourof using the forward prediction model as a digital twin of human chemists.

To classify reactions, we used a data-driven reaction classiﬁcation model [53]that was trained similarly to the Molecular Transformer forward and retrosyn-thetic model. It is characterized by four encoder layers and one decoder layerand trained using the same hyperparameters. The main diﬀerence is that theinputs were made up of the complete reaction string (precursors → products) andthe outputs of the split reaction class identiﬁer from NameRXN, consisting ofthree numbers corresponding to superclass, classes/categories and named reac-tion. More details on reaction classes can be found in [44]. The classiﬁcationmodel used in this work matches the same class as the NameRXN tool [52] for93.8% of the reactions. A retrosynthetic tree is equivalent to a directed acyclic hyper-graph, a math-ematical object composed of hyper-arcs (A) that link nodes (N). The maindiﬀerence compared to a typical graph is that a hyper-arc can link multiplenodes, similar to what happens in a retrosynthesis: if a node represents a tar-get molecule, the hyper-arcs connecting to diﬀerent nodes represent all possiblereactions involving those corresponding molecules. Hyper-arcs have an intrinsicdirectionality and their “direction” deﬁnes whether the reaction is forward orretro (see Figure 7).Similar to the construction of a dependency list in object oriented program-ming languages, a retrosynthetic route is a simpliﬁed version of a hyper-graphas its structure needs to be free of any loops. This requirement renders theretrosynthetic route a hyper-tree [65], in which the removal of any of the edgesleads to two disconnected hyper-trees. The hyper-tree, in which the root is thetarget molecule and the leaves are the commercially available starting materials,is an optimal structure to represent a retrosynthetic pathway (see Figure 8).In cases where the hyper-graph of the entire chemical space is available, anexhaustive search may reveal all the possible synthetic pathways leading to atarget molecule from deﬁned starting materials. Here, instead of constructing ahyper-tree of all available reactions, we build the relevant portion of the hyper-tree on the ﬂy: only the nodes and arcs expanding in the direction of thehyper-tree exploration strategy are calculated and added to the existing tree.17 + B + C CBA D + EED + + +

Figure 7: A generic reaction (top of the picture) can be represented as a hyper-graph. Each molecule involved in the reaction becomes a node in the hyper-graph while the hyper-arc, connecting the reactants and reagents to the product,represents the reaction arrow.Algorithm 1 provides an overview of the on the ﬂy hyper-graph expansionstrategy, where given a starting node ( N ), the graph is expanded by predictingthe reactions and precursors ( R i ) leading to the molecule N . The single-stepretrosynthetic model uses a beam-search to explore the possible disconnectionsand we retain the top-15 predicted sets of precursors (thus, i = { , , ..., } ).The SMILES corresponding to these predictions are canonicalized and duplicateentries removed. Any SMILE that fails in the canonicalization step or containsthe target molecule is also removed. The remaining sets of precursors are furtherﬁltered by using the forward model to assess reaction viability and selectivity.Regarding viability, we retain only those precursors ( R i ) whose top-1 forwardmodel predictions match the molecule N . This guarantees that, in the presenceof multiple functional groups, the recommended disconnection leads to the de-sired targets. While this is a necessary condition, it is not a suﬃcient one ascompetitive reactions (top-2 and following) may lead to a mixture of moleculesdiﬀerent from the desired target. In order to enforce chemo-selectivity, we usethe likelihood of the top-1 forward prediction model and select only top-1 predic-tions with a likelihood larger than the subsequent top-2 by at least 0.2. As thesum of likelihoods for the predictions of diﬀerent sets of precursors ( R i ) leadingto a target N is one, any prediction likelihood higher than 0.6 automatically18 + B + C D + EE + F + G HH + I + L A + K I L K H G FEA B CD Figure 8: Example of hyper-graph complexity. The Molecule H is the target(purple label). The red lines represent the synthetic path from commerciallyavailable precursors (highlighted in green) to the target molecule. The yellowline, does not aﬀect the retrosynthesis of H, neither does the last reaction withblack lines.satisﬁes the requirements above and passes our ﬁlter. This ﬁltering protocolincreases the occurrence of chemo-selective reactions along the retrosyntheticpath, penalizing disconnections that are highly competitive.Moreover, precursor sets are clustered together to identify similar discon-nection strategies and reduce tree complexity. Within the same cluster, theprecursors related to the highest forward prediction likelihood are used as start-ing nodes for further tree expansion. Every precursor molecule, unless alreadypresent in the graph, will generate a new node, and every reaction will connecteach of the reactants to the target molecule by means of a new hyper-arc.Every hyper-arc in the tree is scored with a so-called optimization score,which is used to deﬁne the ”best” retrosynthetic route. The total score of aretrosynthetic pathway is calculated by multiplying the scores of all the arcscontained in the path. The deﬁnition of the score for a single arc is: S (C ⇒ A + B) = P(A + B → C) s(A) ∗ s(B)s(C) (2)where S (C ⇒ A+B) denotes the score for a single retrosynthetic step: the higherthe score the higher the preference towards that step. P (A + B → C) is thelikelihood of the forward chemical reaction computed by the forward predictionmodel. s (X) | X ∈ { A , B , C } is the simplicity score of molecule X: s (X) = 1 − SC(X) −

14 (3)where SC (X) is the SCScore [47] of molecule X. The SCScore of a moleculeincreases from 1 to 5 with an increasing complexity of the synthetic route. In this19 lgorithm 1: Hyper-graph expansion algorithm

Data:

Existing Node N , Beam Size B , retrosynthesis model, forwardmodel Result:

New Nodes connected to N begin R = { R i | i = 1 ..B } ←− Predict possible retrosynthesis steps (top- B ) // R i are represented as SMILES for R i ∈ R // select precursor sets for expansion do R i ←− Try to canonicalize R i , discard if not canonicalizableDiscard R i , if N is a precursor in R i L R i → N ←− Compute likelihood of reaction R i → N if L R i → N > . then Attach R i to N with a hyper-arc else F top − , F top − ←− Predict top-2 forward reactions from R i if Product of F top − is N and Likelihood ( F top − ) > . Likelihood ( F top − ) then Attach R i to N with a hyper-arc else discard R i R i )surviving the ﬁltering or from all the hyper-arcs generated by the expansion21orming a cycle in the tree. From a chemical point of view, this means that oneof the precursors of the product requires the product to synthesize itself.Every time a pathway enters a cycle, the pathway itself is considered termi-nated. The tree exploration returns all the possible paths leading to a successfulretrosynthesis, sorted by the optimization score. References [1] Suzuki, A. Recent advances in the cross-coupling reactions of organoboronderivatives with organic electrophiles, 19951998.

Journal of OrganometallicChemistry , 147 – 168 (1999).[2] Ai, Y., Ye, N., Wang, Q., Yahata, K. & Kishi, Y. Zirconium/nickel-mediated one-pot ketone synthesis.

Angewandte Chemie , 10931–10935(2017).[3] Liu, X., Li, X., Chen, Y., Hu, Y. & Kishi, Y. On ni catalysts for catalytic,asymmetric ni/cr-mediated coupling reactions.

Journal of the AmericanChemical Society , 6136–6139 (2012). PMID: 22443690.[4] Corey, E. J. The logic of chemical synthesis: multistep synthesis of complexcarbogenic molecules (nobel lecture).

Angewandte Chemie InternationalEdition in English , 455–465 (1991).[5] Szymku´c, S. et al. Computer-Assisted Synthetic Planning: The End of theBeginning.

Angewandte Chemie (International ed. in English) , 5904–5937 (2016).[6] Coley, C. W., Rogers, L., Green, W. H. & Jensen, K. F. Computer-assistedretrosynthesis based on molecular similarity. ACS central science , 1237–1245 (2017).[7] Schreck, J. S., Coley, C. W. & Bishop, K. J. M. Learning RetrosyntheticPlanning through Simulated Experience. ACS central science , 970–981(2019).[8] Watson, I. A., Wang, J. & Nicolaou, C. A. A retrosynthetic analysis algo-rithm implementation. Journal of cheminformatics , 1 (2019).[9] Coley, C. W., Green, W. H. & Jensen, K. F. Machine Learning inComputer-Aided Synthesis Planning. Accounts of Chemical Research ,1281–1289 (2018).[10] Fagerberg, R., Flamm, C., Kianian, R., Merkle, D. & Stadler, P. F. Findingthe K best synthesis plans. Journal of cheminformatics , 19 (2018).[11] Lowe, D. AI designs organic syntheses. Nature , 592–593 (2018).2212] Segler, M. H. S., Preuss, M. & Waller, M. P. Planning chemical syntheseswith deep neural networks and symbolic AI.

Nature , 604–610 (2018).[13] Feng, F., Lai, L. & Pei, J. Computational Chemical Synthesis Analysis andPathway Design.

Frontiers in chemistry , 199 (2018).[14] Savage, J., Kishimoto, A., Buesser, B., Diaz-Aviles, E. & Alzate, C. Chem-ical Reactant Recommendation Using a Network of Organic Chemistry (ACM, New York, New York, USA, 2017).[15] Segler, M. H. S. & Waller, M. P. Neural-Symbolic Machine Learning forRetrosynthesis and Reaction Prediction.

Chemistry (Weinheim an derBergstrasse, Germany) , 5966–5971 (2017).[16] Liu, B. et al. Retrosynthetic reaction prediction using neural sequence-to-sequence models.

ACS central science , 1103–1113 (2017).[17] Masoumi, A., Soutchanski, M. & Marrella, A. Organic Synthesis as Artiﬁ-cial Intelligence Planning. In th International Workshop on Semantic WebApplications and Tools for Life Sciences SWATLS (2013).[18] Law, J. et al. Route Designer: a retrosynthetic analysis tool utilizingautomated retrosynthetic rule generation.

Journal of chemical informationand modeling , 593–602 (2009).[19] Todd, M. H. Computer-Aided Organic Synthesis. ChemInform , no–no(2005).[20] Coley, C. W. et al. A robotic platform for ﬂow synthesis of organic com-pounds informed by ai planning.

Science , eaax1566 (2019).[21] Schwaller, P., Gaudin, T., Lnyi, D., Bekas, C. & Laino, T. found in trans-lation: predicting outcomes of complex organic chemistry reactions usingneural sequence-to-sequence models.

Chem. Sci. , 6091–6098 (2018).[22] Schwaller, P. et al. Molecular transformer: A model for uncertainty-calibrated chemical reaction prediction.

ACS Central Science , null (0).[23] Kayala, M. A. & Baldi, P. ReactionPredictor: prediction of complex chem-ical reactions at the mechanistic level using machine learning. Journal ofchemical information and modeling , 2526–2540 (2012).[24] Segler, M. H. S. & Waller, M. P. Modelling Chemical Reasoning to Predictand Invent Reactions. Chemistry (Weinheim an der Bergstrasse, Germany) , 6118–6128 (2017).[25] Coley, C. W., Barzilay, R., Jaakkola, T. S., Green, W. H. & Jensen, K. F.Prediction of organic reaction outcomes using machine learning. ACS Cen-tral Science , 434–443 (2017). PMID: 28573205.2326] Coley, C. et al. A graph-convolutional neural network model for the pre-diction of chemical reactivity.

Chem. Sci. , 370–377 (2019).[27] Gao, H. et al. Using Machine Learning To Predict Suitable Conditions forOrganic Reactions.

ACS central science , 1465–1476 (2018).[28] Lowe, D. M. Extraction of Chemical Structures and Reactions from theLiterature . Ph.D. thesis, University of Cambridge (2012).[29] Lowe, D. Chemical reactions from US patents (1976-Sep2016) (2017).[30] Grzybowski, B. A., Bishop, K. J. M., Kowalczyk, B. & Wilmer, C. E. The’wired’ universe of organic chemistry.

Nature Chemistry , 31–36 (2009).[31] Klucznik, T. et al. Eﬃcient syntheses of diverse, medicinally relevant tar-gets planned by computer and executed in the laboratory.

Chem , 522 –532 (2018).[32] Zheng, S., Rao, J., Zhang, Z., Xu, J. & Yang, Y. Predicting retrosyntheticreaction using self-corrected transformer neural networks. arXiv preprintarXiv:1907.01356 (2019).[33] Karpov, P., Godin, G. & Tetko, I. A transformer model for retrosynthesis(2019).[34] Liu, X., Li, P. & Song, S. Decomposing retrosynthesis into reactive centerprediction and molecule generation. bioRxiv arXiv preprint arXiv:1906.02308 (2019).[36] Lee, A. A. et al. Molecular transformer uniﬁes reaction prediction andretrosynthesis across pharma chemical space.

Chem. Commun. – (2019).[37] Duan, H., Wang, L., Zhang, C. & Li, J. Retrosynthesis with attention-based nmt model and chemical analysis of the” wrong” predictions. arXivpreprint arXiv:1908.00727 (2019).[38] Thakkar, A., Kogej, T., Reymond, J.-L., Engkvist, O. & Bjerrum, E. J.Datasets and Their Inﬂuence on the Development of Computer AssistedSynthesis Planning Tools in the Pharmaceutical Domain (2019).[39] de Almeida, A. F., Moreira, R. & Rodrigues, T. Synthetic organic chemistrydriven by artiﬁcial intelligence.

Nature Reviews Chemistry , 1–16 (2019).[40] Cadeddu, A., Wylie, E. K., Jurczak, J., Wampler-Doty, M. & Grzybowski,B. A. Organic chemistry as a language and the implications of chemicallinguistics for structural and retrosynthetic analyses. Angewandte Chemie(International ed. in English) , 8108–8112 (2014).2441] Weininger, D. Smiles, a chemical language and information system. 1.introduction to methodology and encoding rules. Journal of chemical in-formation and computer sciences , 31–36 (1988).[42] Molecular Transformer. URL https://github.com/pschwllr/MolecularTransformer . (Accessed Jul 29, 2019).[43] IBM RXN for Chemistry. URL https://rxn.res.ibm.com . (Accessed Oct10, 2019).[44] Schneider, N., Lowe, D. M., Sayle, R. A., Tarselli, M. A. & Landrum,G. A. Big data from pharmaceutical patents: a computational analysis ofmedicinal chemists bread and butter. Journal of medicinal chemistry ,4385–4402 (2016).[45] Anonymous. Molecular graph enhanced transformer for retrosynthesis pre-diction. In Submitted to International Conference on Learning Representa-tions (2020). Under review.[46] Anonymous. Learning to make generalizable and diverse predictions forretrosynthesis. In

Submitted to International Conference on Learning Rep-resentations (2020). Under review.[47] Coley, C. W., Rogers, L., Green, W. H. & Jensen, K. F. Scscore: Syntheticcomplexity learned from a reaction corpus.

Journal of chemical informationand modeling , 252–261 (2018).[48] Willighagen, E. L. et al. The chemistry development kit (cdk) v2. 0: atomtyping, depiction, molecular formulas, and substructure searching.

Journalof cheminformatics , 33 (2017).[49] Nextmove Software Pistachio. URL . (Accessed Jul 29, 2019).[50] Lin, J. Divergence measures based on the shannon entropy. IEEE Trans-actions on Information theory , 145–151 (1991).[51] Heller, S. R., McNaught, A., Pletnev, I., Stein, S. & Tchekhovskoi, D. Inchi,the iupac international chemical identiﬁer. Journal of cheminformatics ,23 (2015).[52] Nextmove Software nameRXN. URL . (Accessed Jul 29, 2019).[53] Schwaller, P., Vaucher, A., Nair, V. H. & Laino, T. Data-Driven ChemicalReaction Classiﬁcation with Attention-Based Neural Networks (2019).[54] Lednicer, D. & Mitscher, L. A. The organic chemistry of drug synthesis.2 . A Wiley-Interscience publication (Wiley, New York, 1980). OCLC:310877189. 2555] Worthington, P. A.

Synthesis and Fungicidal Activity of Triazole TertiaryAlcohols , chap. 27, 302–317. URL https://pubs.acs.org/doi/abs/10.1021/bk-1987-0355.ch027 .[56] Cotton, H. et al.

Asymmetric synthesis of esomeprazole.

Tetrahedron:Asymmetry , 3819 – 3825 (2000).[57] Jay F. Larrow, T. R. V. K. M. R. C. H. S. P. J. R., Ed Roberts & Jacobsen,E. N. (1s,2r)-1-AMINOINDAN-2-OL. Organic Syntheses , 46 (1999).[58] Crowther, A. F. & Smith, L. H. .beta.-adrenergic blocking agents. ii. pro-pranolol and related 3-amino-1-naphthoxy-2-propanols. Journal of Medic-inal Chemistry , 1009–1013 (1968). PMID: 5697060.[59] Schneider, N., Stieﬂ, N. & Landrum, G. A. Whats what: The (nearly)deﬁnitive guide to reaction role assignment. Journal of chemical informa-tion and modeling , 2336–2346 (2016).[60] OBoyle, N. M. Towards a universal smiles representation-a standardmethod to generate canonical smiles based on the inchi. Journal of chem-informatics , 22 (2012).[61] Nam, J. & Kim, J. Linking the neural machine translation and the pre-diction of organic chemistry reactions. arXiv preprint arXiv:1612.09529 (2016).[62] Satoh, H. & Funatsu, K. Sophia, a knowledge base-guided reaction pre-diction system-utilization of a knowledge base derived from a reactiondatabase. Journal of chemical information and computer sciences , 34–44 (1995).[63] Salimans, T. et al. Improved techniques for training gans. In

Advances inneural information processing systems , 2234–2242 (2016).[64] Landrum, G. et al. rdkit/rdkit: 2019 03 4 (q1 2019) release (2019). URL https://doi.org/10.5281/zenodo.3366468 .[65] Nieminen, J. & Peltola, M. Hypertrees.

Applied Mathematics Letters12