Exact Topology of Dynamic Probability Surface of an Activated Process by Persistent Homology
EExact Topology of Dynamic ProbabilitySurface of an Activated Process by PersistentHomology
Farid Manuchehrfar , Huiyu Li , Wei Tian , Ao Ma ∗ , and Jie Liang ∗ February 9, 2021Center for Bioinformatics and Quantiative Biology and Department of Bioengneering, Uni-versity of Illinois at Chicago, Chicago, IL 60607. These authors contributed equally, ∗ Corresponding authors.1 a r X i v : . [ m a t h - ph ] F e b bstract To gain insight into reaction mechanism of activated processes, we introduce anexact approach for quantifying the topology of high-dimensional probability surfacesof the underlying dynamic processes. Instead of Morse indexes, we study the homologygroups of a sequence of superlevel sets of the probability surface over high-dimensionalconfiguration spaces using persistent homology. For alanine-dipeptide isomerization,a prototype of activated processes, we identify locations of probability peaks andconnecting-ridges, along with measures of their global prominence. Instead of a saddle-point, the transition state ensemble (TSE) of conformations are at the most prominentprobability peak after reactants/products, when proper reaction coordinates are in-cluded. Intuition-based models, even those exhibiting a double-well, fail to capture thedynamics of the activated process. Peak occurrence, prominence, and locations can bedistorted upon subspace projection. While principal component analysis account forconformational variance, it inflates the complexity of the surface topology and destroydynamic properties of the topological features. In contrast, TSE emerges naturally asthe most prominent peak beyond the reactant/product basins, when projected to asubspace of minimum dimension containing the reaction coordinates. Our approach isgeneral and can be applied to investigate the topology of high-dimensional probabilitysurfaces of other activated process.
Keywords— surface topology; landscape analysis; persistent homology; active process; tran-sition state; energy flow Introduction
Activated processes are ubiquitous in molecular systems, ranging from chemical reactions of smallmolecules to dynamic conformational changes and enzymatic reactions of proteins. In proteins, allfunctionally important processes are activated processes, which provide well-defined rates essentialfor proteins to carry out their roles in the cellular context, as proper timing is required for properfunction.The prevalent picture describing an activated process is that of a transition between two meta-stable basins on the free energy landscape separated by a barrier, whose height is large comparedto thermal energy [1]. The slow time scale of activation arises from the fact that the molecularsystem can rarely accumulate sufficient energy in the relevant degrees of freedom (DoFs) to surpassthe transition barrier. This simple and elegant picture originates from reaction rate theories, suchas the well-known transition state theory and Kramer’s theory [1–7] developed in studies of thedynamics of chemical reactions of small molecules.A key concept in reaction rate theories is that of reaction coordinates: a few special coordinatesexists that can fully determine the progress of a reaction process [8–10]. A requirement for reactioncoordinates is that they must accurately locate the transition barrier. Accordingly, the numerousdegree of freedoms (DoFs) in a complex molecular system ( e.g. , a protein molecule, a system ofsolute and solvent) can be divided into reaction coordinates and heat bath. Reaction coordinatesplay central roles as they determine both the mechanism and the rate of activation. For example, tomodify the activity of an enzyme, one should modify residues involved in the reaction coordinatesof the enzyme activities [11, 12], as this will modify both the reaction pathway and the barrierheight for activation. In contrast, modifying residues that belong to the heat bath will not alterthe enzymatic activity, as the role of the heat bath is to provide energy to the reaction coordinatesto cross the activation barrier during rare fluctuations, which is largely a non-specific process.Given such significance, it is important to develop a rigorous and quantitative criterion fordetermining the correct reaction coordinates. This task was accomplished with the development ofthe procedure of committor test , which is characterized by the committor value p B [8, 9, 13, 14]: the robability that a dynamic trajectory initiated from a given configuration to reach the product basinbefore visiting the reactant basin. By definition, the reactant and product states have committorvalues of 0 and 1, respectively, whereas the optimal transition state coincides with p B = 0 . p B therefore provides a rigorous parameterization of the reaction process.Thus, the intuitive, albeit qualitative, notion of reaction coordinates translates into a rigorousdefinition of the few coordinates that are sufficient for determining the committor value of anygiven configuration. Du et. al first adopted this rigorous definition of reaction coordinates in thecontext of protein folding [9]; Chandler and co-workers established its usage as a standard practicein the general context of activated processes [8].While this rigorous criterion has been well accepted, identifying the correct reaction coordi-nates turns out to be rather difficult, even for systems of modest complexity. One example is the C eq → C az isomerization reaction of the alanine dipeptide in vacuum, a prototype for studyingbiomolecular conformational transitions. Alanine dipeptide is the smallest molecule that satisfiesthe criterion that distinguishes complex molecules from small molecules: the non-reaction coordi-nates in the system constitute a large enough thermal bath to provide the reaction coordinates withadequate energy to cross the activation barrier. As a result, the C eq → C az transition displaysfeatures of activated dynamics that are unique to complex molecules but absent in small molecules.It was first found by Bolhuis et. al [15] that the conventional Ramachandran torsional angles φ and ψ , while sufficient for distinguishing the two stable basins, are inadequate for locating the transi-tion state. Instead, another torsional angle θ was found to be an essential reaction coordinate––arather counter-intuitive finding [16,17]. The counter-intuitive nature of reaction coordinates turnedout to be more often the norm than exception in complex systems [15, 18, 19], posing a formidablechallenge, as sans intuition, there appears no guidance in sight.The challenge in identifying reaction coordinates had motivated efforts in developing rigorousmethods for their identification in complex systems since the early 2000’s [8,9,15,18,20,21]. Beyondunreliable intuition and trial-and-error, the first systematic method was that of machine-learning,in which a neural network was used to automatically identify the optimal reaction coordinatesfrom a prepared pool of candidates [21]. This method was used to successfully identify the key olvent coordinate that controls the isomerization dynamics of an alanine dipeptide in solution,which had defied prior intuition-based trial-and-error efforts. The success of this machine learningbased approach lead to a series of further developments along similar lines [10, 18, 19, 22–28].However, a major deficiency of machine learning-based methods is that they cannot answerthe real question concerning reaction coordinates – why some coordinates are more important foractivation than the others? Instead, these methods can only inform us empirically which coordinatesappear to be important based on well-defined criteria. Recently, Ma and co-workers developed arigorous theory for mapping out the flow of potential energy through individual coordinates [17,29].It was found that the reaction coordinates are the coordinates that carry high energy flows duringthe activation process. This result suggested an appealing physical picture: energy flows from fastcoordinates into slow coordinates during activation, so that adequate energy can accumulate in theslow coordinates, enabling them to cross the activation barrier on path of these slow coordinates.This physical picture also suggested that reaction coordinates are preferred channels of energyflows and are encoded in the protein structure. Through analysis of energy flow, one can obtaineda prioritized list of coordinates that likely play most significant roles in the activation process.The most celebrated concept in reaction rate theories is that of the transition state (TS), whichis the dynamic bottleneck of an activated process. Conventional thoughts are that they are locatedat a critical point with Mores index of 1 on the high-dimensional potential energy landscape ofthe molecule. If all the reaction coordinates of an activated process are known, the TS will bean index-1 critical point on the free energy surface of the reaction coordinates, the sole directionwith negative Hessian pointing along the ideal one-dimensional reaction coordinate valid at the topof the activation barrier. Based on this picture, the surface of the multi-dimensional probabilitydistribution of reaction coordinates constructed from an ensemble of reactive trajectories should behighly structured, with the important dynamic states ( e.g. TS) manifesting as critical topologicalfeatures.To gain insights into reaction mechanism, a common practice is to study features of the freeenergy surface. However, all relevant information of an activated process is contained in the re-active trajectories. Instead of the free energy surface, one can construct the dynamic probability urface of the transition state. For certain systems, this can be achieved using the transition pathsampling (TPS) method. Once a large ensemble of reactive trajectories are generated, an ensembleof configurations that the system sampled during the transition process can be harvested. Fromthis ensemble, one can generate a dynamic probability landscape or transition state surface, whichis usually high-dimensional.The focus of this work is the analysis of the exact topology of the high-dimensional dynamicprobability landscape, and the establishment of its relationship with the transition state of the activeprocess. Instead of Morse index, we characterize the high-dimensional transition state surface by itstopological structures in homology groups. We analyze topological changes in the superlevel set ofthe probability surface at different probability levels. Using the technique of persistent homology,we identify the locations of probability peaks in the high-dimensional configuration space and ridgesconnecting them, along with measures of their global prominence.We apply this approach to study the active process of C eq → C ax isomerization of the alaninedipeptide, a well-characterized model system [15, 17, 29]. After the exact topological structuresof the dynamic probability surface are constructed, we identify the location of the ensemble oftransition state conformations. Instead of a saddle point with a Morse index of 1, the transitionstate ensemble (TSE) is found to be located at the top of the most prominent peak after thoseof the reactant and product. In addition, the dynamically important topological structures areretained when the surface is projected onto the 2-dimensional plane of ( φ − θ ), which are known tobe the reaction coordinates [17]. In contrast, when projected to the intuition-derived ( φ − θ )-plane,the topological features of the probability surface no longer contain dynamic information on thetransition state. Furthermore, we find that PCA dimension reductions distort surface topologies,such that the transition state ensemble cannot be recovered from PCA-derived topological features:Instead of simplification, PCA destroys the dynamic properties of topological features of the originaltransition state surface.Overall, we introduce a novel approach for quantifying the exact topology of high-dimensionalsurfaces. With this approach, we have characterized the precise topology of the transition statesurface of the active process of alanine dipeptide isomerization. We have also established that he TSE is located at the most prominent probability peak beyond those of the reactant andthe product, when the subspace of projection contain the proper reaction coordinates. The newlanguage of homology group and the technique of persistent homology on superlevel sets of high-dimensional probability surfaces introduced here are general and can be applied to investigations ofthe topology of high-dimensional surfaces over configuration space encountered in other problemsof activated process. We briefly discuss how the dynamic probability surface can be constructed for the molecular systemof our interests. We then discuss the problem of understanding 118the topological structures of the probability surface over the relevant configuration space. Ourfocus will be the introduction of the method of homology groups for analyzing the exact topologicalstructures of the dynamic probability surface. This approach is based on the homological structuresof the sequence of the superlevel sets of a probability surface, and differs from previous efforts thatare based on critical points, Morse theory, and Euler characteristics.
We use the transition path sampling (TPS) method [8] to generate a sufficiently large ensemble ofreactive trajectories for the active process we study. To ensure the transition process is fully coveredwithout bias, the duration of the trajectories is much longer than that of the transition process.Along each trajectories, we further harvested a number of system configuration with a specific timeintervals to generate an ensemble of the relevant configurations that the system samples duringthe transition process. From this ensemble, we construct a dynamic probability landscape over a d -dimensional subspace, namely, the joint probability distribution over the d -coordinates. .2 Configuration space and probability surface. We now discuss the general problem of analyzing probability surface over a configuration space of ar-bitrary dimension. We first introduce a few relevant concepts, and then review current approaches.This is followed by an exposition of the basics of homology groups and persistent homology. Wefocus on developing these concepts in the setting of cubics and superlevel sets of the probabilitysurface. We will also discuss the key concept of filtration and its construction, and how it can beused to uncover the exact high-dimensional topological features.
Configuration space.
We begin our discussion in a general setting. We use d number offeatures that describe the configurations of a molecule. These can be bond lengths, bond angles,and torsional angles describing the structure of the molecule. For alanine dipeptide, there are atotal of 60 features that fully describes the configuration of the molecule. For a molecule of finitesize, the configuration space M ⊂ R d is compact. It is likely that M lies in a subspace of theEuclidean space R d , as there are coupling between different degrees of freedom. Probability surface.
Each configuration x = ( x , x , · · · , x d ) ∈ M has a probability f ( x ) ∈ [0 ,
1] associated with it. Namely, we have a function f : M → R [0 , that assigns the probabilityvalue p ( x ) to each configuration x . Superlevel sets and sublevel sets.
For a value 0 ≤ a ≤
1, we can identify all x whoseprobability values f ( x ) ≥ a , which is called the superlevel set M f ≥ a : M f ≥ a ≡ { x ∈ M | f ( x ) ≥ a } = f − ([ a, . Similarly, we can have the the sublevel set M f ≤ a : M f ≤ a ≡ { x ∈ M | f ( x ) ≤ a } = f − ((0 , a ]) . .3 Topology of probability surface: Critical points and Morseindices. It is of great importance to understand the topological structures of the probability surface f ( x ).The approach of analyzing the critical points is well-practiced for exploring the topological struc-tures of high-dimensional surfaces. Critical Points.
The critical points of f ( M ) are where all first derivatives of f vanishes: ∂f ( x ) ∂x = 0 , ∂f ( x ) ∂x = 0 , · · · , ∂f ( x ) ∂x d = 0 . Critical points are coordinates independent and can be further classified into different types bythe secondary derivatives. We can organize the secondary derivatives into a d × d Hessian matrix H f ( x ), whose entries are: ( H f ) i, j = ∂ f∂x i ∂x j . At non-degenerative critical points, where the Hessian matrices are non-singular, the Hessian willhave a mixture of positive and negative eigenvalues. The number of negative eigenvalues η is the Morse index of the critical point.
Topology of surface by critical points and Morse theory.
At a critical point, thetopology of sublevel sets changes. For a critical point x with f ( x ) = a , consider the sublevel sets M f