[PDF] Graph Transformation for Enzymatic Mechanisms

Abstract

Motivation: The design of enzymes is as challenging as it is consequential for making chemical synthesis in medical and industrial applications more efficient, cost-effective and environmentally friendly. While several aspects of this complex problem are computationally assisted, the drafting of catalytic mechanisms, i.e. the specification of the chemical steps-and hence intermediate states-that the enzyme is meant to implement, is largely left to human expertise. The ability to capture specific chemistries of multi-step catalysis in a fashion that enables its computational construction and design is therefore highly desirable and would equally impact the elucidation of existing enzymatic reactions whose mechanisms are unknown. Results: We use the mathematical framework of graph transformation to express the distinction between rules and reactions in chemistry. We derive about 1000 rules for amino acid side chain chemistry from the M-CSA database, a curated repository of enzymatic mechanisms. Using graph transformation we are able to propose hundreds of hypothetical catalytic mechanisms for a large number of unrelated reactions in the Rhea database. We analyze these mechanisms to find that they combine in chemically sound fashion individual steps from a variety of known multi-step mechanisms, showing that plausible novel mechanisms for catalysis can be constructed computationally.

Full PDF

GGraph Transformation for EnzymaticMechanisms

Jakob L. Andersen , Rolf Fagerberg , Christoph Flamm ,Walter Fontana , Juraj Kolˇc´ak , Christophe V.F.P. Laurent , Daniel Merkle , and Nikolai Nøjgaard Department of Mathematics and Computer Science, University ofSouthern Denmark, Odense, Denmark Department of Theoretical Chemistry, University of Vienna,Vienna, Austria Department of Systems Biology, Harvard Medical School, Boston,Massachusetts, USA

AbstractMotivation:

The design of enzymes is as challenging as it is conse-quential for making chemical synthesis in medical and industrial applica-tions more eﬃcient, cost-eﬀective and environmentally friendly. Whileseveral aspects of this complex problem are computationally assisted,the drafting of catalytic mechanisms, i.e. the speciﬁcation of the chem-ical steps—and hence intermediate states—that the enzyme is meantto implement, is largely left to human expertise. The ability to cap-ture speciﬁc chemistries of multi-step catalysis in a fashion that enablesits computational construction and design is therefore highly desirableand would equally impact the elucidation of existing enzymatic reactionswhose mechanisms are unknown.

Results:

We use the mathematical framework of graph transformationto express the distinction between rules and reactions in chemistry. Wederive about 1000 rules for amino acid side chain chemistry from theM-CSA database, a curated repository of enzymatic mechanisms. Usinggraph transformation we are able to propose hundreds of hypothetical cat-alytic mechanisms for a large number of unrelated reactions in the Rheadatabase. We analyze these mechanisms to ﬁnd that they combine inchemically sound fashion individual steps from a variety of known multi-step mechanisms, showing that plausible novel mechanisms for catalysiscan be constructed computationally.

Availability and Implementation:

We provide a demo version of ourapproach onlinehttps://cheminf-live.imada.sdu.dk/mechsearch/. The code of the initialprototype is available upon request.

Contact: [email protected]

Supplementary information:

Supplementary data are available athttps://cheminf.imada.sdu.dk/preprints/ECCB-2021 a r X i v : . [ q - b i o . M N ] F e b Introduction

Since the advent of the digital revolution, the vast repertoire of chemical knowl-edge has become accessible through a growing number of repositories. Theutility of such warehousing hinges on computational tools for searching andaggregating content and for exploring its consequences. Indeed, many tools ex-ist for reasoning about molecules and their reactions at the level of symbolicchemistry [Todd, 2005, Cook et al., 2012, Segler and Waller, 2017]. Likewise,software implementing quantum and classical methods enables the study of con-ﬁgurational energy landscapes that undergird symbolic chemistry [Welborn andHead-Gordon, 2019].The toolbox dwindles, however, when it comes to reasoning about the chem-istry of reaction networks . While computational and mathematical infrastruc-ture exists for studying the kinetics or ﬂux balance of networks, there is little inthe way of systematically constructing such networks while taking into accountchemical possibilities. Tools aimed at the kinetics of chemical reaction networksare cast in terms of variables that refer to concentrations, but their dynamicsalone cannot introduce new components beyond those explicitly speciﬁed at theoutset. To extend a network requires an executable representation of actualchemistry.The construction, and therefore also the design, of chemical networks ismade possible by the notion of a chemical rule, which is distinct from a chemicalreaction. In a reaction, molecules are completely speciﬁed, whereas a rule makesexplicit only those aspects of molecules that are necessary for a reaction tooccur at the level of abstraction deﬁned by symbolic chemistry. A rule is aschema that represents the transformation of an educt pattern into a productpattern. Given completely speciﬁed educt molecules, a rule generates possiblereactions for those molecules that contain its educt pattern. Since symbolicchemistry represents molecules as typed graphs, the formal domain of graphtransformation [Ehrig et al., 1973, Habel et al., 2001, Ehrig et al., 2006] seemsto be the natural foundation for implementing the distinction between reactionsand rules.The distillation of rules from large catalogs of reactions would open the doorto the iterative construction and design of networks by repeated applicationof speciﬁcally chosen rules. Rule collections with a formally sound applicationsemantics make chemical knowledge “executable”, but the realization of thisnotion in the form of computational tools is challenging.In this contribution we provide an initial example towards realizing thisvision. A key component is an open-source software platform, known as MØD,for specifying and iteratively applying chemical rules [Andersen et al., 2013,2016]. We deploy this platform on rules that we derive from a database ofhand-curated mechanisms of enzymatically catalyzed reactions known as the“Mechanism and Catalytic Site Atlas” or M-CSA [Ribeiro et al., 2017].Our speciﬁc focus, thus, is on the design of enzymatic catalysis. Such designis of signiﬁcance in a range of applications from addressing disease to shiftingchemical industry towards more sustainable, waste minimizing, and environ-mentally friendly production processes [Zimmerman et al., 2020, Pleissner andK¨ummerer, 2020, Schrittwieser et al., 2018]. Chemical networks are centralin this goal because almost all catalysis rests on a network-based mechanism,despite informal language often referring to a singular agent (the catalyst).2peciﬁcally, at the catalytic site of an enzyme several reaction steps combineinto a network in such a way that upon completion of the overall transformationfrom substrates into products each protein component has regained the samechemical state it had initially. This requires the network to be a cycle. Cycli-cal network catalysis also occurs at larger scales. For example, the citric acidcycle at the core of modern biological metabolism acts as a network catalystregardless of the fact that its individual steps are also catalyzed by enzymes.Designing a full enzyme requires attention to structure, which controls speci-ﬁcity and provides a stable niche that guarantees a proper causal ordering of thecatalytic steps. While the network view does not address structure, it under-scores that designing an enzyme also requires designing a multi-step catalytic process . The computational implementation of this view through graph transfor-mation should lend further credence to the computer assisted design of enzymes[Welborn and Head-Gordon, 2019].Our contribution is organized as follows. We ﬁrst proceed by making a subsetof the M-CSA executable as graph transformation rules. We then exemplifythe utilization of such rules by constructing proposals for catalytic networkmechanisms for some reactions in the Rhea database [Lombardot et al., 2019]—a database unrelated to the M-CSA, also containing enzymatic reactions, butmostly listed without suggestions for an underlying network mechanism.

Like any other reaction, an enzymatic reaction is usually expressed as the con-version of educt molecules (substrates) into products by means of a speciﬁcprotein functioning as catalyst. Biology would quite literally be unthinkablewithout catalysis. For example, the spontaneous decarboxylation of arginineoccurs roughly with a rate constant of 2 × − s − (a half-life of about 1.1 bil-lion years), whereas the Escherichia coli arginine decarboxylase has a reported7 × fold catalytic proﬁciency [Snider and Wolfenden, 2000].Enzymatic catalysis, however, is oftentimes not a single event but involvesmultiple steps that together constitute a catalytic mechanism. Each of thesesteps can be seen as an elementary reaction in which components of the sub-strate, amino acids, and possibly cofactors (such as ﬂavin adenine dinucleotide),react to a stable intermediate state that becomes the input to the subsequentstep, eventually resulting in the formation of the product and regeneration ofthe catalyst. While the packaging of a cyclic reaction network within a largeprotein warrants referring to enzymatic catalysis as if it were a single event, itis essential for our purpose to unpack catalysis into a detailed mechanism sup-porting the overall reaction. This mechanism not only transforms a substrateinto a product but must also guarantee that any protein components deployedin the process regain their initial state upon completion.For clarity we ﬁx terminology as follows. The phrase “reaction step” (or“step” for short) is used to denote a reaction judged elementary , that is, furtherindivisible, by M-CSA contributors. Moreover, we use the phrase “reactionmechanism” to refer to a causal succession of steps and the phrase “overallreaction” as the chemical sum over a mechanism. “Reaction” can mean any of3he above, depending on the context.The “reaction center” of a step refers to the atoms that undergo an elec-tronic displacement or whose bonds are rearranged by the reaction step (bold-face atoms in Figure 1); see also Section 2.2. We refer to everything else thatstays ﬁxed and is not in the reaction center as the “context” of a step.We further classify an amino acid explicitly mentioned in a step as eitheractive or passive depending on whether it intersects the reaction center or not.Although passive amino acids are not subject to bond rearrangements, theyare nonetheless deemed critical for eﬃcient catalysis by contributing to thephysicochemical properties of the catalytic pocket, such as establishing chargedistributions and spatial constraints.The M-CSA [Ribeiro et al., 2017] contains overall reactions for a collectionof 964 representative enzymes (so-called entries in M-CSA terminology). 684of these reactions are listed with at least one manually curated step-by-stepdescription of the catalytic mechanism that converts reactants into products.The validity of the mechanisms considered for inclusion is judged on the basisof direct experimental evidence and observations explained by it. This can resultin the inclusion of several reaction mechanisms for the same enzymatic overallreaction. In total, the M-CSA provides 818 detailed reaction mechanisms.The reaction mechanisms in the M-CSA include an English description,based on direct or indirect experimental evidence, of the catalytic process interms of active and passive amino acids as well as any cofactors involved. Areaction step is formally expressed as an arrow pushing diagram describing thedisplacement of electrons. Figure 1A depicts a step as obtained from the M-CSA.The M-CSA aims at providing a representative (non-redundant) set of allknown enzymatic reactions, hence resulting in a collection of mechanisms thatare considerably dissimilar. Yet, many of the individual steps across diﬀerentmechanisms appear similar or outright identical to each other if one were torestrict the context. This suggests that enzymatic mechanisms are composedof identiﬁable building blocks best described by rules at the step level. Onceidentiﬁed, such rules could be used to construct mechanisms for reactions otherthan those in the M-CSA. To make reaction knowledge executable, a formal framework is needed. Sincefor many purposes symbolic chemistry represents molecules as connected graphsthat are undirected and typed, the formalism of graph transformation (or graphrewriting) appears to be an appropriate choice. Graph transformation is anextension of the idea of term rewriting to graphs and has a well-developedfoundation in category theory [Ehrig et al., 1973, 2006].In a molecule graph, a typed node represents an atom with a charge and atyped edge represents a bond of a certain order (single, double, triple, aromatic).A collection of molecules as a whole is then represented as a disconnected graph,whose connected components represent the individual molecules. We refer tosuch a graph as a “state graph” (or “state” for short). Graph transformationis about deﬁning rules by which one state can be transformed into another.The idea of a rule is to specify the transformation of a graphical input pattern4igure 1: The relationship between a chemical step and a graph transforma-tion rule. ( A ) The panel is adapted from step 1 in mechanism 1 of M-CSAID: 337 and shows the initial reaction step performed by the protein-glutamatemethylesterase (CheB) (UniProtKB: P04042, EC 3.1.1.61, EC 3.5.1.44). Aspar-tate increases the p K a of the histidine imidazole ring by forming a hydrogenbond to the histidine N (cid:15) . This turns histidine into a powerful base and henceactivates the serine residue. The latter is then able to perform a nucleophilic at-tack on the glutamine methyl-ester substrate. The left-hand side illustrates theelectron movement from the histidine N δ atom via serine to the O atom in thecarbonyl in the substrate ester group. The right-hand side of the arrow showsthe resulting intermediates. Electronic displacements are shown as arrow pushesin magenta. For details on gray shades refer to the caption of panel B. ( B ) Thispanel shows the graph transformation rule derived from panel A following ourheuristic guidelines (Section 2.3). Explicit hydrogen atoms have been added forclarity. The bonds and nodes that change from L to R , are emphasized in color(red for bonds, blue for nodes). The reaction center is shown in boldface. Therule asserts that those parts that are grayed out in panel A constitute molecularelements that have no bearing on the chemical transformation. Structures weredownloaded from the M-CSA. 5nto an output pattern and to carry this transformation over to the state if itcontains the input pattern.In more formal terms, a rule (a double-pushout rule in the jargon of graphrewriting) is a span p = ( L l ←− K r −→ R ), where L and R indicate the leftand right pattern, respectively. K is the invariant graph, containing elementscommon to L and R . The correspondence between atoms in L and R (the atommap) is speciﬁed by the injections l and r . The transformation from L into R is then given by breaking those bonds in L that are not in K and forming thosebonds in R that are not in K . The atoms in L and R also carry modiﬁablestate, such as charge (blue font in Figure 1B).A rule must at the very least specify all the bonds that are broken and formedin L and R , respectively, or that undergo order modiﬁcation. We call such aminimal rule the action . The action is the formalization of the reaction centermentioned in the previous section. A “maximal rule” (or “reaction rule”) is onein which L and R are completely speciﬁed educt and product molecules. The context C is given by all nodes and bonds that reﬁne (add invariant detail to)the action. An action, thus, has an empty context and a maximal rule containsas much context as needed for L and R to specify molecular species. By varyingthe context between these two extremes we can construct a variety of rules thatare all reﬁnements of the same action.Recall that a state G is a typically disconnected graph whose componentsare molecules. A rule transforms one state into another by application . A ruleapplication consists in embedding L into the host graph G and replacing thesubgraph of G selected by the embedding with R , while respecting the atommap given by l and r . For example, in Figure 1, the rule shown in panel B isapplied to the molecules on the left of the reaction arrow in panel A, yieldingthe depicted reaction. By applying a rule to a state in general and not justto state components it actually modiﬁes (the educts) we let the state act as acontext for successive applications of rules, i.e. a mechanism. This will simplifyhow we present our strategy in Section 2.4. Like any reaction, a step can be viewed as the special case of a maximally reﬁnedrule. Such a rule consists of all molecular species, including passive ones, thatthe M-CSA curators deemed to be necessary for the complete documentation ofthe step. Maximal rules are probably overspeciﬁed because the context includesmolecular parts that are unlikely to all be necessary for setting oﬀ the chemicaltransformation. In particular, rules that mirror reactions are unlikely to overlapwith one another in a way that permits their composition. Their L -patternsare unlikely to be embeddable in anything other than the molecular speciesrepresented by L itself.By decreasing the context C of a maximally reﬁned rule we might bettercapture the chemistry that drives a reaction and decrease the chemical speciﬁcityof a rule in a fashion that provides more opportunities for composition. Thiswould facilitate the construction of reaction networks. Clearly, by going to theextreme of emptying C , thus retaining the action alone, we might misrepresentthe chemistry and increase compositionality too indiscriminately.Given a collection of partially redundant reaction examples, a major chal-6enge is to devise statistical methods for identifying the right amount of contextto be enshrined in a rule that abstracts a reaction class. The M-CSA, however,is not the right collection for such an endeavor because its objective is to providemaximal non-redundant coverage of distinct enzymatic mechanisms.For the present purpose we use, therefore, heuristics that are also aimed atkeeping the combinatorial explosion of the state space generated by the repeatedapplication of rules in check. Since educt combinations of molecular speciesoften contain upwards of 100 atoms, we require a relatively large context tokeep the state space (Section 2.4) manageable. At the same time, we try tolimit the extent of context—especially context originating from the substrateof an enzymatic reaction—by invoking patterns commonly used in chemistry.These considerations led us to devise three guidelines for crafting the context C of a step. We then assemble a rule by adding C to the action.1. Local topology:

We assume that the immediate surroundings of thereaction center are a signiﬁcant driver of the reaction. Hence, all atomsand bonds that are directly connected to the reaction center are retainedin the context. For instance, the immediate surroundings allow us todistinguish between reactions acting on a carbon chain or a methyl group.2.

Functional patterns:

We compiled a set of 157 chemical patterns basedon functional groups, cycles, and small molecules common in organicchemistry (e.g. carboxylic acid, imidazole, water). In addition, the setis adapted to the M-CSA by including minor variations in the patternsthat were observed in the substrates utilized by the enzymes documentedin the M-CSA. The chemical patterns are then embedded into both thereactant and product graphs of a reaction step. If a match intersects thereaction center, all the atoms and bonds of the pattern are included inthe invariant graph K . A list of chemical patterns can be found in thesupplementary data.3. Active amino acids:

We posit that the active amino acids are crucial fordriving a reaction step. If any part of an amino acid intersects the reactioncenter, the whole amino acid side-chain is included in the context.A reaction step in arrow pushing notation and the rule inferred on the basisof these guidelines is shown in Figure 1.Our rules do not include any molecules that do not share at least one atomwith the reaction center. This may well misrepresent the chemistry, because as-pects of these molecules could be necessary for allowing the reaction to proceed.For example, they might act as electrostatic stabilizers, shift p K a values, providehydrogen bonds or guide the reaction sterically. None of these properties canbe represented explicitly in the graph transformation framework. They could,however, be represented implicitly, precisely by including the molecular partsresponsible for these properties in the L and R patterns. Through the L patternthey act as necessary (matching) conditions for the rule to apply. Although thelevel of abstraction deﬁned by graph transformation cannot represent the physi-cal causes of a reaction directly, it can be informed by them. This suggests thatmolecular dynamics, quantum mechanical calculations, or more phenomenologi-cal approaches, including machine learning, that are capable of determining themolecular patterns required for a reaction will be useful in deﬁning the content7f a rule. By shaping a rule, this information becomes executable. However,an augmentation of rule construction of this sort is beyond the present scope.Here, we simply argue heuristically.The proposed coarsening of a reaction step into a rule can be readily appliedto any step in the M-CSA. Because the objective of the M-CSA is to non-redundantly cover known enzymatic reaction mechanisms, its mechanisms arediverse and include steps that are fairly complex and perhaps overly speciﬁc orrare. This does not serve well the creation of a rule set that could be appliedto ﬁnd catalytic mechanisms for a variety of reactions. Moreover, our presentframework cannot handle some of the complexity present in the M-CSA and wetherefore separate out those we can handle.First, we exclude reaction steps that rely on metal ions as a cofactor. Metalions frequently interact with molecules through coordination bonds. This typeof covalent bond comprises two electrons originating from only one atom. Atthe current state of development, our graph model for chemistry is unable torepresent the chemistry of coordination bonds.Second, we exclude reaction steps from mechanisms that rely on radicalsand single electron jumps. This kind of electron movement is also frequentlyassociated with metal ion chemistry.The presentation of mechanisms in the M-CSA targets human readers. Asa consequence, across the presentation of steps of a mechanism some moleculesmight appear, disappear, or change abstraction level. While such editing isbeneﬁcial for the purpose of visualization, it is detrimental for rule construction.In particular, it hampers correct tracking of individual amino acids in the eventof covalent bond formation with the substrate. Thus, thirdly, we submit allmechanisms to a sanity check by tracking atoms across the full complementof molecules mentioned across their steps. Many of the changes are easy todetect and ﬁx by propagating the disappearing or appearing molecules acrossthe mechanism. However, if our atom tracking fails, we exclude all the steps ofthe mechanism.Taking this ﬁltering into account, we obtain a total of 1083 diﬀerent rulesderived from reaction steps across 471 M-CSA mechanisms. For 368 of thosemechanisms, all steps are used for rule construction, making them fully repro-ducible by our rule set. The majority of disqualiﬁed reaction steps is dependenton metal ions. We consider this limitation acceptable, especially since we focuson the chemistry of catalysts that do not require cofactors.We can link the rule set thus obtained to the chemical process classiﬁcationtags of each step provided by the M-CSA. Each of the ﬁve most common chemi-cal processes in the M-CSA is instantiated by at least 49 % of reaction steps usedin the construction of our rule set. Speciﬁcally, proton transfer is representedwith 58 . . . . In the previous section we described our procedure for converting a reaction stepin the M-CSA into a graph transformation rule. In this section we focus on howa collection of rules so obtained can be used to reconstruct M-CSA mechanismsor propose new ones for reactions not included in the M-CSA.8ecalling Section 2.2, the term “state” refers to a graph where each con-nected component represents a molecule. Any particular molecule can occurin multiple instances to accommodate stoichiometry. Moreover, since reactionsconserve mass, all states have an invariant atomic composition. We ﬁrst needto deﬁne the states that are reachable with a given set of rules R from an initialstate G . For this we need a bit of notation.For a rule p ∈ R to apply to a state G , the left-hand pattern L of therule must embed in G . We refer to such an embedding as a “match” anddeﬁne the set of all possible matches of a given rule in state G as M ( p, G ).We write G p,m === ⇒ H for the transformation of G into H by rule p ∈ R usingthe match m ∈ M ( p, G ). Each such rule application is a “direct derivation”which we call a transition between states. Recall from Section 2.2 that G isthe full complement of molecules as it might have been formed by prior ruleapplications. It may include molecules that are not altered by rule p . A state G can thus be thought of as the test-tube mixture in the context of which a ruleapplication (a reaction step) takes place to produce a new state H . Restrictinga state to only components that are matched by m then corresponds to theusual notion of a reaction (a “proper derivation” [Andersen et al., 2014]).Aggregating all H that can be generated by all rules in R using all possiblematches onto G , we obtain the set of states that constitute a 1-step extensionof G . In symbols, we deﬁne the k -step extension of G as G = { G }G k +1 = (cid:110) H (cid:12)(cid:12)(cid:12) ∃ G ∈ G k , p ∈ R , m ∈ M ( p, G ) : G p,m === ⇒ H (cid:111) ∪ G G k , together with the transitions for all p and m , deﬁne the reachable state space S k at depth k . Speciﬁcally, S k is a directed graph with the node set G k andan edge ( G, H ) between nodes

G, H ∈ G k if the transition G p,m === ⇒ H exists forsome rule p and match m .A further notion is that of a trace τ , which is a sequence of transitions,that transforms the initial educt state G I ≡ G into the ﬁnal product state G F within a state space S . A trace thus carves a path from G I to G F in S ; it corresponds to a (match-)speciﬁc composition of rules that generate thetransitions in τ . We write G I τ = ⇒ G F to denote the transformation of state G I into state G F by τ . Since transitions represent reaction steps, τ represents amechanism for transforming the educt molecules in G I into product moleculesin G F . In principle, a trace could take many detours in going from G I to G F . Inthe present setting it makes sense to eliminate such meandering by stipulatingthat τ be a minimal path from G I to G F in S .Since we aim at catalytic reactions we require certain molecules, representedby graph A , to be a subgraph of both G I and G F . To line up with intuition, wewrite G I = E ⊕ A and G F = P ⊕ A , where ⊕ is the disjoint graph union. We thenspeak of an A -catalyzed transformation trace τ A of educts E into products P ,in symbols E τ A == ⇒ P . Our task, therefore, is to identify a possible mechanism τ A within the state space S k constructed from the initial state G I . The restrictionto S k means that the mechanisms cannot be longer than k steps.The diﬃculty, of course, is in searching for potential mechanisms in a statespace S k whose size is beyond astronomic even for modest k . Although the9-CSA contains mechanisms of up to 20 steps in length, these appear to beoutliers. Most M-CSA mechanisms are relatively short, averaging 3 . S k to some small iteration depth k .In order to improve our chances of ﬁnding a τ A we leverage the informationin both E and P when constructing the state space S k . To ﬁnd a τ A such that E τ A == ⇒ P at depth k , the construction of S k starts from G I and we make thisexplicit by writing S k ( G I ). However, we need to be certain that the state G F is contained in the node set of S k ( G I ), that is, G F must be reachable from G I .We can ensure this by exploiting the invertibility of graph transformation rulesin the double pushout framework [Ehrig, 1979]. Speciﬁcally, we invert all rules p ∈ R by swapping their left-hand L p and right-hand pattern R p , to obtainan action that transforms the pattern R p into L p . We then join S k ( G F ), afterinversion of all transitions, with S k ( G I ) on the shared parts of the underlyingstates to obtain a state space that is guaranteed to contain all paths from G I to G F of length at most 2 k , while only exploring each state space to a depth of k . Such a combination still results in a graph with possibly numerous “deadends”, that is, states that are reachable only from G I or only from G F . Weremove such states in order to obtain a succinct representation and refer to theresulting state space as the relevant state space. The relevant state space canbe envisioned as a ﬂow with a single source G I and a single drain G F . The approach described in Section 2 has been implemented in Python with themore computationally intensive tasks in C++. In constructing a state spacewe rely on MØD [Andersen et al., 2016] for eﬃcient graph transformationsand on NetworkX [Hagberg et al., 2008] for general graph algorithms. Duringthe conversion of steps to rules, we use the Marvin Molecule File Converter20.20.0 (ChemAxon Ltd., https://chemaxon.com/) for adding explicit hydrogenatoms to the Marvin ﬁles downloaded from the M-CSA database. MarvinSketch20.20.0 (ChemAxon Ltd.) was used to draw the molecules in the ﬁgures.To test the practicality of our approach, as well as the generality of theconstructed rule set, we use reaction data provided by Rhea [Lombardot et al.,2019]. Rhea is an expert-curated database containing information about re-actions of biological interest, many of which are enzymatic and obtained frompeer-reviewed literature.Most Rhea reactions are not annotated with detailed catalyst information.While one or more proteins might be mentioned, the catalytic sites are generallyunknown or not reported. The reactions provided by Rhea are therefore suitablefor testing our ability to propose new mechanisms using the rules we constructedfrom the M-CSA.The choice of catalysts for a reaction is part of the mechanism predictionprocess. In general, enzymatic reactions rely on a combination of amino acidside chains and possibly cofactors. As mentioned in Section 2.3, to curtail thecombinatorial explosion we limit ourselves to catalysis that employs only aminoacids. To this end, we identify 26 tautomers of 17 amino acids commonly used10lu His Asp Cys Lys ThrRhea 624 421 226 50 5 1Rhea (unique) 135 28 73 4 4 1M-CSA (global) 579 660 621 280 407 157M-CSA (single) 25 34 20 7 14 4Table 1: Comparison of amino acids utilization frequency. The ﬁrst row speciﬁesthe total number of reactions for which a mechanism using the speciﬁed aminoacids was discovered. The second row counts the reactions for which all thediscovered mechanisms relied on the particular amino acid. The third row liststotal number of incidences across the entire M-CSA database. The fourth rowcounts all the M-CSA mechanisms represented by our rule set that only use asingle active amino acid, of the type speciﬁed.within the M-CSA database (supplementary data). Alanine, glycine and prolinewere not included.

The Rhea database lists over 10 reactions. Some of these cannot be analyzed atthe level of abstraction of our present model, such as conversions of substratesbetween tautomers diﬀering only sterically. After removing these, we are leftwith 8805 reactions for which we attempt to predict mechanisms within thelimits of computational complexity.We refer to the 368 mechanisms in the M-CSA that are covered by our ruleset as the “covered” M-CSA mechanisms. 35 . . . l = 6 steps. We thereforedecided to limit our search to mechanisms of length l ≤

6, thus limiting thestate space expansion to a depth of 3.For 786 (8 . l ≤ . l ≤ th mostcommon amino acid in the M-CSA, does not show up in any of our mecha-nisms; its role is taken by threonine. However, the disparity in usage amongamino acids that occur in our mechanisms is much more pronounced than inthe M-CSA. This suggests that our rule set favors mechanisms that limit theinteraction with the amino acid to proton transfers.Despite the limitation on length and the restriction to catalysis by a singleamino acid, our results include some interesting mechanisms. While many ofthe generated mechanisms consist of rules that were all derived from the same12-CSA mechanism, others contain rules derived from multiple distinct M-CSAmechanisms. For instance, a two-step glutamate-based mechanism that com-bines rules abstracted from distinct M-CSA mechanisms could be constructed for54 reactions, such as the example in Figure 2. While glutamate itself acts onlyas a proton acceptor, the rules are utilized in nontrivial chemistry, consisting ofan assisted keto-enol tautomerisation followed by unimolecular elimination anddehydration.The mechanism in Figure 2 is but one example of amino acids engaging onlyin proton transfers yet triggering a more complex interaction. For instance,in 517 Rhea reactions with a rule-based mechanism (67 . Conducting a comprehensive exploration to seek mechanisms using more thanone amino acid must be left to future work. Just trying any combination of twoamino acids requires 26 = 676 state space constructions and trace searches foreach of the 8805 eligible reactions in the Rhea database. Instead, we demon-strate that we can query for mechanisms that exhibit a speciﬁed behavioralmotif found in the M-CSA by carefully selecting the set of catalytic amino acidswhen expanding the state space.One such behavior motif that is common among several of the covered M-CSA mechanisms consists of the joint action of histidine and serine (or cys-teine). Speciﬁcally, histidine acts as a proton acceptor depronotating serine (orcysteine) thereby activating the latter and allowing it to attack the substratein a nucleophilic addition, which results in the formation of a covalent enzyme-substrate complex. We can ask whether the same behavior can be used in theconstruction of catalytic mechanisms (based on our rule set) for Rhea reactions.We thus deﬁne the set of catalytic amino acids to consist of histidine, cys-teine, and serine, including all their tautomers as indicated in the previoussection. We then search for mechanisms limited in length to less or equal 6steps within the state space S for each Rhea reaction for which our procedurecould not ﬁnd any sAA mechanism.For 133 of the 7863 (= 8805 − − O -(trans-sinapoyl)- β - d -glu-cose and choline into d -glucose and O -sinapoyl-choline. As the associated ECnumber of the reaction (EC 2.3.1.91) already suggests, the d -glucose moiety of1- O -(trans-sinapoyl)- β - d -glucose is replaced by a choline molecule.The proposed reaction mechanism can be split into two parts, each consistingof an addition and a subsequent elimination step. The ﬁrst part of the mecha-nism is based on two rules, of which the ﬁrst is depicted in Figure 1B. In step 1the enzyme covalently binds to the substrate via a single bond between serineand sinapoyl-glucose, resulting in the formation of an oxyanion on the substrate13igure 3: Proposed 5-step mechanism for the conversion of choline (i) andsinapoyl-glucose (iv) into glucose (vii) and sinapoyl-choline (xi) (RHEA:12024entry) depending on two amino acids, namely histidine (ii) and serine (iii) . Inthe initial step histidine (ii) acts as proton acceptor of the hydrogen atom re-leased from serine (iii). Said serine (iii) then acts as a nucleophile towards theester of the sinapoyl-glucose (iv) which results in the covalent linkage of theformer to the latter. In the second step the formed oxyanion collapses whichcauses the deprotonation of a fully protonated histidine (v) and elimination ofa glucose molecule (vii). The third step describes the deprotonation of choline(i) by histidine (ii). In step four the activated choline (ix) attacks the transi-tion molecule (viii) created during step 2. In step 5, after covalent binding ofcholine (ix), the oxyanion in the transition molecule (x) collapses which resultsin the deprotonation of histidine (v), the release of serine (iii), and ultimatelythe formation of sinapoyl-choline (xi). Electronic displacements are shown asarrow pushes in magenta. Arrows between the Step panels indicate which rulewas applied. Rule 1 is depicted in Figure 1B. The states of the molecules areindicated by the roman numerals (i) through (x).14ide. The nucleophilic attack of the substrate by the serine is facilitated by theaction of histidine, which abstracts a proton from the serine. The second ruleapplication (step 2) yields the collapse of the unstable oxyanion. Together withthe fully protonated histidine, the collapse results in the release of the glucosemolecule from the enzyme-substrate complex. At the current stage of develop-ment our model does not allow diﬀerentiation between stereo-isomers and wetherefore cannot identify speciﬁcally d -glucose as listed in RHEA:12024The second part of the proposed mechanism is based on three diﬀerent rules,but follows a very similar pattern as the ﬁrst part just described. In step 3, thecholine is deprotonated by a histidine, which allows the deprotonated cholineto attack the enzyme-substrate complex in the fourth step 4, resulting in thecovalent attachment of the choline to the enzyme-substrate complex. In theﬁnal and ﬁfth step 5, the oxyanion formed in the process collapses, and withassistance from a proton transfer by the fully protonated histidine, eliminatesthe serine from the enzyme-substrate complex. This ﬁnal step results in thecatalytic components being restored and the products formed.The two rules used in the ﬁrst part of the hypothetical mechanism have beenextracted from four diﬀerent M-CSA mechanisms in which they jointly occur(entries number 337, 705, 733 and 866). In contrast, the three rules comprisingthe second part are each derived from a diﬀerent M-CSA mechanism. Thus,the constructed mechanism combines knowledge from at least four diﬀerentmechanisms listed in the M-CSA.The constructed mechanism employs a histidine for two distinct tasks. His-tidine acts as a proton acceptor twice; it is restored in the middle of the mech-anism after the ﬁrst use and reused a second time thereafter. This suggests thepossibility that a diﬀerent proton acceptor, e.g. a second instance of a histidine,could be deployed in one of the addition / elimination steps, should the catalyticsite geometry of an actual enzyme require it.The other mechanism that uses one serine and one histidine is functionallyequivalent to the described mechanism, except that the histidine-enabled at-tachment of the serine to the substrate occurs in two steps; speciﬁcally, via step1 and 2 of M-CSA ID: 94 proposal 1.Among the 173 mechanisms constructed for RHEA:12024 we identiﬁed mech-anisms with the same chemistry as presented in Figure 3 but using cysteine inplace of serine to anchor the substrate. We chose to detail the serine examplesince aspartate, histidine and serine are predicted by similarity to be (part of)the active site of the only protein, UniProtKB: Q8VZU3, listed as playing a rolein RHEA:12024. The presence of aspartate, in addition to histidine and serine,in the active site is interesting as in all of the relevant M-CSA mechanisms, his-tidine is assisted by aspartate or, in one case, glutamate as a passive amino acid.This catalytic triad engages in a fairly common process in enzymatic reactions.A hydrogen bond between aspartate (or glutamate) and histidine is increasingthe p K a of the latter [Stehle et al., 2006], thus expediting the deprotonationof serine and choline. As aspartate (or glutamate) is not part of the reactioncenter, it is not present in the rules derived from these cases.15 Discussion

Enzymatic catalysis is critical for enabling eﬃcient and cost-eﬀective chemicalsynthesis for medical, environmental, and industrial applications. The designof enzymes is therefore highly desirable. Among the numerous aspects thatmust be taken into account when designing an enzyme is the draft of a catalyticmechanism: a sequence of steps that is cyclical in the participating amino acidsand whose traversal converts substrate(s) into product(s). Each step leads to atransition intermediate and contributes to the requirements that the architec-ture of the catalytic site must satisfy to make the mechanism eﬀective. Suchrequirements include spatial arrangement and the fashioning of a physicochem-ical milieu.In this paper we demonstrate that the drafting of mechanisms can be ap-proached computationally. Central to our approach is the well-deﬁned formalismof graph transformation with which we encode chemistry in the form of rulesused to generate reactions by directly rewriting chemical graphs representingmolecules. The approach thus hinges on extracting rules of chemistry fromreaction examples, eﬀectively generalizing these examples. Once rules are avail-able they can be deployed to construct chemical spaces in which to search forsuitable mechanisms.We show the feasibility of this approach by converting the M-CSA into rulespertinent to amino acid side chain chemistry. In this way we make the knowledgein the M-CSA executable in the sense of enabling the computational construc-tion of hypothetical mechanisms that catalyze reactions outside its scope.Speciﬁcally, we construct multiple mechanisms using a single amino acid tocatalyze a large number of reactions in the Rhea database. The analysis of thesemechanisms indicates that we succeed in capturing interesting and meaningfulchemistry resulting from the combination of rules derived from reaction stepsbelonging to distinct M-CSA mechanisms.Our procedure also generates a hypothetical catalytic mechanism relying ontwo amino acids—serine and histidine—for the Rhea reaction of sinapoyl-glucoseand choline into glucose and sinapoyl-choline. For this reaction the literature[Fraser et al., 2007] suggests an enzyme (UniProtKB: Q8VZU3) whose active siteis predicted to include Ser178, Asp389 and His443 based on similarity. Aspartateis known to inﬂuence the p K a of histidine [Stehle et al., 2006] when in proximity,thus playing a “passive” role (Section 2.1), which prevents it from being includedin our rules as they are presently constructed (Section 2.3). The example leadsus to believe that in addition to proposing novel catalytic mechanisms for generalreactions, our approach is capable to assist in the prediction of catalytic sitesof enzymes with known substrates and products but unknown or incompletemechanisms.The generality of rules can be regimented by tuning the context of theiraction (Section 2.3). Depending on the extent of context, a varying number ofmechanisms can be subsumed under the same set of rules from which they canbe generated, much like a formal language. It is conceivable therefore to useexplicit rule compositions to formulate search criteria for retrieving mechanismsof speciﬁed chemistry from mechanism databases.There are many possibilities for advancing the expressivity of rules as imple-mented by MØD, most notably by inclusion of stereochemistry [Andersen et al.,2017] and the decoration of rules with application constraints taking physico-16hemical parameters into consideration. The navigation of large state spaces isa general challenge in computational science, but the results we obtained withthe simple rule-heuristics employed here suggest that the automated generationof catalytic mechanisms for arbitrary reactions is a meaningful goal to pursue. Acknowledgments

The authors gratefully acknowledge Leon Middelboe Hansen and Mikkel Pile-gaard for providing scripts constituting the preliminary analysis of the rules.Bernhard Thiel for his insightful pre-study during his master project at theUniversity of Vienna.

Funding

This work is supported by the Novo Nordisk Foundation grant NNF19OC0057834and by the Independent Research Fund Denmark, Natural Sciences, grants DFF-0135-00420B and DFF-7014-00041.

References

J. L. Andersen, C. Flamm, D. Merkle, and P. F. Stadler. Inferring chemical reac-tion patterns using rule composition in graph grammars.

Journal of SystemsChemistry , 4(1):4, 2013. ISSN 1759-2208. doi: 10.1186/1759-2208-4-4.J. L. Andersen, C. Flamm, D. Merkle, and P. F. Stadler. Generic strategies forchemical space exploration.

International Journal of Computational Biologyand Drug Design , 7(2/3):225–258, 2014. doi: 10.1504/IJCBDD.2014.061649.J. L. Andersen, C. Flamm, D. Merkle, and P. F. Stadler. A software packagefor chemically inspired graph transformation. In R. Echahed and M. Minas,editors,

Graph Transformation. ICGT 2016. Lecture Notes in Computer sci-ence , volume 9761, pages 73–88. Springer, Cham, Switzerland, 2016. ISBN978-3-319-40529-2. doi: 10.1007/978-3-319-40530-8 5.J. L. Andersen, C. Flamm, D. Merkle, and P. F. Stadler. Chemical graphtransformation with stereo-information. In J. de Lara and D. Plump, ed-itors,

Graph Transformation. ICGT 2017. Lecture Notes in Computer Sci-ence , volume 10373, pages 54–69. Springer, Cham, Switzerland, 2017. ISBN978-3-319-61469-4. doi: 10.1007/978-3-319-61470-0 4.A. Cook, A. P. Johnson, J. Law, M. Mirzazadeh, O. Ravitz, and A. Simon.Computer-aided synthesis design: 40 years on.

WIREs Computational Molec-ular Science , 2:79–107, 2012. doi: 10.1002/wcms.61.H. Ehrig. Introduction to the algebraic theory of graph grammars (a survey).In V. Claus, H. Ehrig, and G. Rozenberg, editors,

Graph-Grammars andTheir Application to Computer Science and Biology , pages 1–69, Berlin andHeidelberg, Germany, 1979. Springer. ISBN 978-3-540-09525-5. doi: 10.1007/BFb0025714. 17. Ehrig, M. Pfender, and H. J. Schneider. Graph-grammars: An algebraicapproach. In , pages 167–180, USA, 1973. doi: 10.1109/SWAT.1973.11.H. Ehrig, K. Ehrig, U. Golas, and G. Taentzer. Fundamentals of algebraicgraph transformation. In W. Brauer, G. Rozenberg, and A. Salomaa, editors,

Monographs in Theoretical Computer Science. An EATCS Series . Springer,Berlin and Heidelberg, Germany, 2006. ISBN 3-540-31187-4. doi: 10.1007/3-540-31188-2.C. M. Fraser, M. G. Thompson, A. M. Shirley, J. Ralph, J. A. Schoenherr,T. Sinlapadech, M. C. Hall, and C. Chapple. Related Arabidopsis serinecarboxypeptidase-like sinapoylglucose acyltransferases display distinct butoverlapping substrate speciﬁcities.

Plant Physiology , 144(4):1986–1999, 82007. ISSN 0032-0889. doi: 10.1104/pp.107.098970.A. Habel, J. M¨uller, and D. Plump. Double-pushout graph transformationrevisited.

Mathematical Structures in Computer Science , 11(5):637–688, 2001.doi: 10.1017/S0960129501003425.A. A. Hagberg, D. A. Schult, and P. J. Swart. Exploring network structure,dynamics, and function using networkx. In G. Varoquaux, T. Vaught, andJ. Millman, editors,

Proceedings of the 7th Python in Science Conference ,pages 11–15, Pasadena, CA USA, 2008.T. Lombardot, A. Morgat, K. B. Axelsen, L. Aimo, N. Hyka-Nouspikel,A. Niknejad, A. Ignatchenko, I. Xenarios, E. Coudert, N. Redaschi, andA. Bridge. Updates in Rhea: SPARQLing biochemical reaction data.

Nu-cleic Acids Research , 47(D1):D596–D600, 1 2019. ISSN 0305-1048. doi:10.1093/nar/gky876.D. Pleissner and K. K¨ummerer. Green chemistry and its contribution to in-dustrial biotechnology. In M. Fr¨ohling and M. Hiete, editors,

Advances inBiochemical Engineering/Biotechnology , volume 173, pages 281–298, Cham,Switzerland, 2020. Springer. ISBN 978-3-030-47065-4. doi: 10.1007/10 \ \ Nucleic Acids Research , 46(D1):D618–D623, 2017. ISSN 0305-1048. doi: 10.1093/nar/gkx1012.J. H. Schrittwieser, S. Velikogne, M. Hall, and W. Kroutil. Artiﬁcial biocatalyticlinear cascades for preparation of organic molecules.

Chemical Reviews , 118(1):270–348, 2018. doi: 10.1021/acs.chemrev.7b00033.M. H. S. Segler and M. P. Waller. Modelling chemical reasoning to predict andinvent reactions.

Chemistry - A European Journal , 23(25):6118, 2017. doi:10.1002/chem.201604556.M. J. Snider and R. Wolfenden. The rate of spontaneous decarboxylation ofamino acids.

Journal of the American Chemical Society , 122(46):11507–11508,11 2000. ISSN 0002-7863. doi: 10.1021/ja002851c.18. Stehle, W. Brandt, C. Milkowski, and D. Strack. Structure determinants andsubstrate recognition of serine carboxypeptidase-like acyltransferases fromplant secondary metabolism.

FEBS Letters , 580(27):6366–6374, 11 2006.ISSN 00145793. doi: 10.1016/j.febslet.2006.10.046.M. H. Todd. Computer-aided organic synthesis.

Chemical Society Reviews , 34:247–266, 2005. doi: 10.1039/b104620a.V. V. Welborn and T. Head-Gordon. Computational design of synthetic en-zymes.

Chemical Reviews , 119(11):6613–6630, 2019. doi: 10.1021/acs.chemrev.8b00399.J. B. Zimmerman, P. T. Anastas, H. C. Erythrope, and W. Leitner. Designingfor a green chemistry future.