Identification of slow molecular order parameters for Markov model construction
Guillermo Perez-Hernandez, Fabian Paul, Toni Giorgino, Gianni de Fabritiis, Frank Noé
IIdentification of slow molecular order parameters for Markovmodel construction
Guillermo Perez-Hernandez , Fabian Paul , , + ,Toni Giorgino , + , Gianni de Fabritiis , † , and Frank Noé , ∗ February 28, 2013
Abstract
A goal in the kinetic characterization of a macromolecular system is the description of its slowrelaxation processes, involving (i) identification of the structural changes involved in these pro-cesses, and (ii) estimation of the rates or timescales at which these slow processes occur. Mostof the approaches to this task, including Markov models, Master-equation models, and kineticnetwork models, start by discretizing the high-dimensional state space and then characterize re-laxation processes in terms of the eigenvectors and eigenvalues of a discrete transition matrix. Thepractical success of such an approach depends very much on the ability to finely discretize theslow order parameters. How can this task be achieved in a high-dimensional configuration spacewithout relying on subjective guesses of the slow order parameters? In this paper, we use thevariational principle of conformation dynamics to derive an optimal way of identifying the “slowsubspace” of a large set of prior order parameters - either generic internal coordinates (distancesand dihedral angles), or a user-defined set of parameters. It is shown that a method to identifythis slow subspace exists in statistics: the time-lagged independent component analysis (TICA).Furthermore, optimal indicators—order parameters indicating the progress of the slow transitionsand thus may serve as reaction coordinates—are readily identified. We demonstrate that the slowsubspace is well suited to construct accurate kinetic models of two sets of molecular dynamics sim-ulations, the 6-residue fluorescent peptide MR121-GSGSW and the 30-residue natively disorderedpeptide KID. The identified optimal indicators reveal the structural changes associated with theslow processes of the molecular system under analysis.
Adresses: FU Berlin, Arnimallee 6, 14195 Berlin, Germany.2 Institute of Biomedical Engineering (ISIB), National Research Council of Italy (CNR), Corso StatiUniti 4, I-35127 Padua, Italy.3 GRIB, Barcelona Biomedical Research Park (PRBB), C/ Dr. Aiguader 88, 08003, Barcelona, Spain.4 Max-Planck-Institut für Colloids and Interfaces, Division Theory and Bio-Systems, Science ParkPotsdam-Golm, 14424 Potsdam, Germany + equal contributionCorresponding authors: † [email protected]* [email protected] 1 a r X i v : . [ phy s i c s . c h e m - ph ] F e b Introduction
Conformational transitions between long-lived, or “metastable” states are essential to the function ofbiomolecules [27, 58, 52, 24, 78, 28, 95]. These rare transitions are ubiquitously found in biomolecu-lar processes, including folding [34, 44], complex conformational rearrangements between native pro-tein substates [26, 63], and ligand binding [68]. Rare conformational transitions can be explicitlytraced by either single-molecule experiments [70, 88, 17, 44] or by high-throughput molecular dynam-ics simulations, either realized with few long trajectories [84, 48] or with many shorter trajectories[92, 11, 67, 13, 77, 12]. Molecular dynamics (MD) simulations are unique in their ability to resolvethe dynamics and all structural features of a biomolecule simultaneously. When the sampling problemcan be overcome and the appropriateness of the force field parameters used is confirmed by accom-panying experimental evidence, MD simulations are amongst the most powerful tools to investigateconformational transitions in biomolecules.A current challenge with high-throughput MD simulations is to extract meaningful information fromvast trajectory data in an objective way. To achieve this goal, the last few years have seen vast activityin the development of computational methods that extract kinetic models from the MD data. Kineticmodels usually first partition the conformation space into discrete states [93, 61, 40, 33, 94, 14, 74,56, 21, 80, 69]. Subsequently, transition rates or probabilities can be estimated [45, 56, 62, 16, 90, 80,69]. The resulting models are often called transition networks [74, 62, 32], Diffusion maps [18, 76],Master equation models [87, 14], Markov models [64] or Markov state models [86, 16] (MSM), where“Markovianity” means that the kinetics are modeled by a memoryless jump process between states.The recent integration of classical statistical mechanics with modern molecular kinetics highlightsthe crucial role of the eigenvectors and eigenvalues of the Markov model transition matrix or Masterequation rate matrix. This is because they approximate the exact eigenfunctions and eigenvalues ofthe propagator of the continuous dynamics [79]. The following eigenvalue equation is fundamental toconformation dynamics P φ i = λ i φ i (1)Here, P is the transfer operator that propagates probability densities of molecular configurations[81, 66], φ i are its eigenfunctions, and λ i are the associated eigenvalues. Equivalent expressions areobtained by expressing the eigenfunctions in different weighted spaced, leading to the transfer operatorformulation [81], or the symmetrized propagator formulation [14]. Eq. (1) is fundamental because whensolving it for the largest eigenvalues and associated eigenfunctions, all stationary or kinetic quantitiesare defined by them. For example:• P is guaranteed to have a unitary stationary eigenvalue and the associated stationary distribution µ ( x ) = φ ( x ) ; The ensemble average of an observable o can be calculated from o and µ .• Experimentally measurable relaxation rates of the system can be computed from the eigenvaluesas κ i = − τ − ln λ i , or the corresponding timescales as t i = κ − i .• The metastable states (often referred to as “free energy basins”—although we will avoid this termas it would imply the projection onto some pre-defined coordinate set), can be computed fromthe sign structure of the leading eigenfunctions [81, 22].• The structural transition associated to each relaxation timescale is defined by the correspondingeigenfunction [72] and corresponds to a transition between metastable sets. This fact can beused to assign structural changes to experimentally measurable timescales [65].• Experimentally measurable correlation functions (e.g. fluorescence correlation, intermediate scat-tering function in dynamic neutron or X-ray scattering) can be computed as a sum of single-exponential relaxations with timescales computed from λ i and amplitudes from the φ i and theexperimental observable [65, 43, 15].• From the largest m Eigenvalues and their associated Eigenfunctions, a rank- m propagator can beassembled that can describe the dynamics slower than timescale t m [46]. From this propagator,many properties can be calculated, such as transition pathways between two sets of configurations[51, 7, 67]. 2he approximation error of all of the above quantities can be cast in terms of the approximation errorof the eigenvalues and eigenfunctions [79, 23, 72, 71]. Vice versa, all of the above quantities are easilyand precisely computable when the eigenvalues and eigenfunctions of P have been approximated withhigh precision. Consequently, any modeling method that attempts to compute the above quantitiesmust aim at approximating the eigenvalues and eigenfunctions of P - either explicitly or implicitly.Markov models and most of the other aforementioned kinetic models require a discretization of con-figuration space to be made. This is typically done by choosing representative configurations by somedata clustering method, and then partitioning the configuration space by a Voronoi tesselation. Incontrast to other fields of data analysis, the purpose of clusters is not a classification of configurations,but rather a sufficiently fine discretization of configuration space such that the eigenfunctions can bewell approximated in terms of step functions on the Voronoi cells [72]. In order to achieve this, themetric must be chosen such that a fine partition of the relevant “slow” order parameters, i.e. thosewhich are good indicators of the slow eigenfunctions φ i .How can the slow order parameters be identified without already having a high-precision Markovmodel? It was early noted that a priori order parameters, such as the root mean square distance(RMSD) to a single reference structure, the radius of gyration, or pre-selected distances or anglesare often not good indicators of the slow eigenfunctions, and thus bear the danger of disguising theslow kinetics [45, 61, 56]. In order to avoid this, Markov model construction has focused in the lastyears focused on the other extreme - using general metrics that are capable of describing every sortof configurational change. Most notable is the minimal RMSD metric, which assigns to each pair ofconfigurations their minimal Euclidean distance subject to rigid-body translation and rotation [39].Minimal RMSD has been used successfully in many examples, especially protein folding (see [9, 72]and references therein). Recent applications include folding of MR121-GSGS-W peptide[72], foldingof FiP35 WW domain, GTT, NTL9, and protein G [6] and discovery of cryptic allosteric sites in β -lactamase, interleukin-2, and RNase H [10]. However, minimal RMSD tends to fail in situationwhere the largest-amplitude motions are not the slowest (an example of this is the natively disorderedKID peptide analyzed below). Principal component analysis (PCA) is a frequently-used method toreduce the dimension of an order parameter space by projecting it on its linear subspace of the largest-amplitude motions [4]. PCA has also been used successfully in Markov model construction [67, 65],however is suffers from the similar limitations like minimal RMSD, as there is no general guaranteethat large-amplitude motions are associated with slow transitions.It is an important challenge to find a metric that provides a good indicator of the slow processes, suchthat a good approximation of the eigenfunctions φ i is feasible with a moderate number of clusters.The aim of this paper is to identify such a method. To be more precise, let r , ..., r d ∈ R be a possiblylarge set of d order parameters of a molecular system that are a priori specified by the user. Typicalexamples of order parameter include intramolecular distances and torsion angles. However, complexorder parameters like the instantaneous dipole moment of a molecule, or an experimentally measurablequantity such as a FRET efficiency may also be included. Given this set of order parameters, we aimto 1. Find the linear combination of order parameters that optimally approximates the dominanteigenvalues and eigenfunctions, such that a high-precision Markov model can be built in theseorder parameters with direct clustering.2. Identify the m order parameters that are best and least redundant indicators for the m dominanteigenfunctions, thus providing the user a direct physical interpretation which structural changesare associated with the slowest relaxation timescales (Feature selection).Here we use the variational principle of conformation dynamics [66] to derive an optimal solution forproblem 1, and show that an existing extension to PCA solves this problem: time-lagged independentcomponent analysis (TICA) combines information from the covariance matrix and a time-lagged co-variance matrix of the data [55]. See [2] for a detailed description of the method. TICA has recentlybeen applied in the analysis of MD data. Naritomi and Fuchigami[57] used TICA to investigate do-main motion of the LAO protein and compared it to PCA. Mitsutake et al [53] used relaxation modeanalysis, a related technique, to analyze the dynamics of Met-enkephalin. Both studies showed that theslow modes were not necessarily associated with large amplitudes, and time-lagged mode analyses werethus better suited to detect them than PCA. Here, we demonstrate the usefulness of TICA coordinates3or constructing Markov models for two rather different molecular processes: (i) the conformationaldynamics of the small fluorescent peptide MR121-GSGSW, for which good Markov models can bebuilt using a variety of methods, and of the natively unstructured 30-residue peptide KID, modeledthrough a large ensemble of explicit-solvent molecular dynamics (MD) simulations.We also propose a way to approach problem 2, identifying the optimal indicators of the slowest pro-cesses. These indicators inform the user of the structural process that is governing the slow relaxationsof the macromolecule. Optimal indicators help in understanding what comprises the slow kinetics, anddramatically the user time to “search” for a structural character of the slow processes from a Markovmodel. We summarize the variational principle of conformation dynamics, stating that the true eigenfunctionsare best approximated by a Markov model, when the estimated timescales ˆ t i are maximized. Wederive a way to optimally approximate the true eigenfunctions in terms of a linear combination of theoriginal order parameters. We then show that this method is identical to the time-lagged independentcomponent analysis (TICA) that is an established method in statistics. The TICA problem can beeasily solved by subsequently solving two simple Eigenvalue problems. We start by providing an expression for the propagator of exact continuous molecular dynamics,and show that in order to approximate its long-time behavior, its largest eigenvalues and associatedeigenfunctions must be well approximated.We use x t to denote the full molecular configuration at time t (if velocities are available, x t denotesa point in full phase space) in state or phase space Ω . We assume that the molecular dynamicsimplementation is Markovian in Ω (i.e. the time step to x t + τ is computed based on the current valueof x t only), and gives rise to a unique stationary density µ ( x ) , usually the Boltzmann density: µ ( x ) = Z − e − βH ( x ) . where H is the Hamiltonian, Z is the partition function, and β = ( k B T ) − is the inverse temperature.We also assume that the dynamics are statistically reversible, i.e. that the molecular system is sim-ulated in thermal equilibrium. Let us denote a probability density of molecular configurations as ρ t ,and let us subsume the action of the molecular dynamics implementation into the propagator P ( τ ) .The propagator describes the probability that a trajectory that is at configuration x t at time t will befound at a configuration x t + τ a time τ later. In an ensemble view, the propagator takes a probabilitydensity of configurations, ρ t and predicts the probability density of configurations at later time, ρ t + τ : ρ t + τ = P ( τ ) ρ t We can write the propagator, by expanding it in terms of its eigenvalues: λ i ( τ ) = e − τti and its eigenfunctions φ i , as: ρ t + τ ( y ) = P ( τ ) ρ t ( x ) = ∞ (cid:88) i =1 e − τti (cid:104) ψ i , ρ t (cid:105) φ i . (2)where the eigenfunctions φ i ( x ) take the role of basis functions with which probability densities ρ can beconstructed. The first eigenvalue is λ = 1 and the remaining eigenvalues have a norm strictly smallerthan . Thus, the first timescale is t = ∞ and corresponds to the stationary distribution, whileall other timescales t i are finite relaxation timescales. ψ i ( x ) = µ − ( x ) φ i ( x ) are the eigenfunctionsweighted by the stationary density. Eq. (2) has a straightforward physical interpretation: the scalarproduct (cid:104) ψ i , ρ t (cid:105) measures the overlap of the starting density ρ t with the i th eigenfunction and thus4etermines the amplitude by which this eigenfunction contributes to the dynamics. At any time τ ,the new probability density ρ t + τ is composed of a set of basis functions φ i . With increasing time, thecontributions of all basis functions φ i with i > vanish exponentially with a timescale given by t i .After infinite time τ → ∞ , only the first term with t = ∞ (and hence exp( − τ /t i ) = 1 ) is left, andthe stationary density is reached: lim τ →∞ P ( τ ) ρ t = φ = µ . Stationarity implies that µ will not bechanged under the action of the propagator: P ( τ ) µ = µ. Suppose we are interested in slow timescales τ (cid:29) t m +1 . At such large times, the dynamics is governedby the m largest timescales t i and eigenfunctions of the propagator: ρ t + τ = P ( τ ) ρ t ≈ m (cid:88) i =1 e − τti (cid:104) ψ i , ρ t (cid:105) φ i . All kinetic properties at this timescale and all stationary properties can be accurately computed whenthe dominant m eigenvalues and eigenfunctions are approximated. This is our goal. We can make a few general statements on how to approximate the true timescales t i and eigenfunctions.These general properties can be used to derive a general method that achieves the aim of this paper:the identification of the slowest order parameters in a molecule. Since φ i and ψ i are interchangeableusing the weights µ , the approximation problem can be described using either kind of eigenfunction.Subsequently we will always refer to the problem of approximating the weighted eigenfunctions ψ i .Consider some function of the molecular configuration, f ( x ) . From Eq. (2) we can express the time-autocorrelation function of f as a function of τ as: (cid:104) f ( x t ) f ( x t + τ ) (cid:105) t = ∞ (cid:88) i =1 e − τti (cid:104) φ i , f (cid:105) (3)Suppose we would know the true eigenfunction ψ i ( x ) . It is now easy to show [66] that the time-autocorrelation function of ψ i ( x ) yields the exact i th eigenvalue, and thus permits to recover the exact i th timescale: ˆ λ i ( τ ) = (cid:104) ψ i ( x t ) ψ i ( x t + τ ) (cid:105) = e − τti ˆ t i = − τ ln | ˆ λ i ( τ ) | = t i . However, in reality we will not know the exact eigenfunction ψ i . Suppose that we would guess amodel function ˆ ψ that is supposed to be similar to ψ . When we make sure that ˆ ψ is appropriatelynormalized, the variational principle of conformation dynamics [66] shows that the time-autocorrelationfunction of ˆ ψ approximates the true eigenvalue, and the true timescale from below: (cid:104) ˆ ψ ( x t ) ˆ ψ ( x t + τ ) (cid:105) ≤ e − τt ˆ t ≤ t (4)where equality only holds for ˆ ψ = ψ . Thus, we have a recipe for finding an optimal approximationto the second timescale and its associated eigenfunction: We must seek a function ˆ ψ that has themaximum timescale ˆ t .Similar inequalities can be shown for the other eigenvalues and timescales t , ..., t m . We can show thatif one proposes a model function ˆ ψ i that is orthogonal to the eigenfunctions through i − , we alsohave: ˆ t i ≤ t i . (5)This variational principle of conformation dynamics is analogous to the variational principle in quantummechanics. 5 .3 Best approximation of the eigenfunctions What is the relation of the variational principle above to Markov models? Since the eigenfunctions ψ i are initially unknown and difficult to guess, it is reasonable to approximate them by functions ˆ ψ i thatare assembled from a linear combination of basis functions ˆ ψ i ( x ) = n (cid:88) k =1 b ik χ k ( x ) (6)which must be defined a priori, and the optimization problem then consists of finding the optimalparameters b ik that we will denote by vectors b i ∈ R n , where we have chosen the dimension of thebasis set, n , to be equal to the number of basis functions. The Ritz method [75] provides the optimalset of coefficients for an orthonormal basis set. Formally, if we define the covariance matrix betweenAnsatz functions as: c χij ( τ ) = E t [ χ i ( x t ) χ j ( x t + τ )] And we require that the basis functions are orthogonal—which is equivalent to them being uncorrelatedat lag time 0: (cid:104) χ i , χ j (cid:105) µ = E t [ χ i ( x t ) χ j ( x t )] = c χij (0) = δ ij (7)then the optimal set of coefficients is then given by the eigenvectors b i of the following Eigenvalueproblem: C χ ( τ ) b i = b i ˆ λ i ( τ ) (8)Let us now consider the more general case that the Ansatz functions are not orthonormal, i.e. (cid:104) χ i , χ j (cid:105) µ (cid:54) = δ ij . In this situation we must first orthonormalized the basis coordinates before. This is done via ageneralization to Eq. (8). For a non-orthonormal basis set, the optimal approximation to the trueeigenvalues and eigenfunctions is obtained by solving the generalized eigenvalue problem: C χ ( τ ) b i = C χ (0) b i ˆ λ i ( τ ) (9)One may formally rewrite Eq. (9) as D ( τ ) b i = ˆ λ i b i where D ( τ ) = ( C χ (0)) − C χ ( τ ) is an orthonormalbasis set. Numerically, the matrix inversion of C χ (0) is often poorly conditioned and should thereforebe avoided. The results (8) and (9) are well known from variational calculus. The appendix containsan illustrative derivation of Eq. (9) relevant to the special choice of basis set used in this paper. Based on the above results we can now formulate a method to find a linear combination of molecularorder parameters r = ( r ( x ) , ..., r d ( x )) that best resolves the slow relaxation processes. This is doneby finding the optimal coefficients for Eq. (6). For this, we define the Basis function χ i to be identicalto the mean-free coordinate r i ( x ) (if the original order parameters r (cid:48) i ( x ) are not mean-free, then wesimply subtract the mean: r i ( x ) = r (cid:48) i ( x ) − (cid:104) r i ( x ) (cid:105) ): χ i ( x ) = r i ( x ) (10)Thus, our basis set has n = d dimensions. Now let us compute the correlation matrix of normalizedorder parameters as: c rij ( τ ) = E t [ r i ( x t ) r j ( x t + τ )] = c χij ( τ ) Then solving Eq. (9) with the correlation matrix for lag times 0 and τ will provide us with the linearcombination of input order parameters that optimally approximates the exact propagator eigenfunc-tions. See Appendix for a sketch of the usual derivation of Eq. (9) for the case of TICA. It turnsout that Eq. (9) with the choice of coordinates (10) is known as the time-lagged independent compo-nent analysis (TICA) in statistics [55, 57]. A robust algorithm to solve Eq. (9) is known as AMUSEalgorithm [47] and will be given below.The Eigenfunction approximations via Eq. (6) using the coefficients b i are the optimal approximationto the true eigenfunctions and will give an optimal approximation of the timescales. As a result of thevariational principle, ˆ λ ( τ ) ≤ λ ( τ ) and ˆ t ( τ ) = − τ ln ˆ λ ( τ ) ≤ t ψ ≈ ˆ ψ is true, and therefore the variational principle can at this point not beextended to further timescales than t . In other words, the TICA timescales ˆ t , ..., ˆ t m may be bothunder- or overestimated. We do not intend to use the TICA timescales directly, but rather use the TICA subspace in order toconstruct a Markov model by finely discretizing this space. What can be said about the timescalesof the resulting Markov model? We can use the variational principle summarized above to boundthe timescales of a Markov model. Classical Markov models operate by assigning a configuration x uniquely to one of the geometric clusters used to construct them. It can be shown [79] that thisoperation is equivalent to use the basis functions χ i ( x ) = i ( x ) √ π i , i.e. each basis function i is a step function with has a constant value on the configurations belong-ing to the i th cluster and is zero elsewhere. This basis is an orthonormal basis set: (cid:104) χ i , χ j (cid:105) µ = π − i ´ x ∈ S i µ ( x ) d x = δ ij . Thus, the direct Ritz method applies and as shown in [66], Eq. (8) becomes: T ( τ ) r i = r i ˜ λ i ( τ ) (11)where T ( τ ) is the Markov model transition matrix and R = [ r , ..., r n ] are its right eigenvectors. Torelate Eq. (8) and (11) we have used the definition C χ ( τ ) = (cid:113) π i π j T ij ( τ ) , i.e. the covariance matrixbetween Ansatz functions χ is the symmetrized transition matrix as given in [14].Thus, a Markov model is the Ritz method for the choice of a step-function basis on the clustersused to build it, and thus gives an optimal step-function approximation to the eigenfunctions andmaximal eigenvalues amongst all choices of functions that can be supported by the clustering. Itfollows from Eq. (4) that at least the second timescale will then be underestimated. When the Markovmodel is sufficiently good in approximating the slowest processes, all of the first m timescales will beunderestimated as given by Eq. (5). It was shown [23] that this estimation error becomes smallerwhen τ is increased. Prinz et al [71] showed that it decreases with τ − . As a result, when plotting theestimated timescales ˆ t i ( τ ) as a function of τ one obtains the well-known implied timescale plots shownin Figs. (2) and (4), where the estimated timescales ˆ t i ( τ ) slowly converge to the true timescale when τ is increased.We have now seen that both the TICA eigenvalue ˆ λ and the corresponding timescale ˆ t are underesti-mated, as well as the Markov model eigenvalue ˜ λ and the corresponding timescale ˜ t . Unfortunately,we cannot make a rigorous statement of how ˆ t and ˜ t are related to each other. However, we canmake the ad hoc statement that we intend to cluster the dominant TICA subspace “sufficiently fine”.Thereby the Markov model step functions of the dominant TICA component allows the nonlineareigenfunction ψ ( x ) to be approximated better than by the linear combination of order parameters(10) directly. For example, it is typical that the eigenfunction ψ ( x ) stays almost constant over a largepart of configuration space and then changes abruptly to a different level in the transition state [81, 72].Such a behavior can be much better described by a step function than by a linear fit. Therefore, weshall here assume that the estimates of the dominant timescale as ˆ t < ˜ t < t : The dominant TICAtimescale ˆ t is a lower bound to the true timescale t , but typically a poor lower bound. The Markovmodel timescale ˜ t is typically larger, and thus a better estimate of the true timescale t . This conceptis illustrated in Fig. 1. Having identified the “slow” linear combinations of input order parameters, the hope is that clusteringin a low-dimensional linear subspace will provide a useful clustering metric for the accurate and efficient7 ominant TICA parameter E i g e n f un c o n ψ ( x ) MSM approxima5on T i m e s c a l e e s m a t e TICA MSM t ˆ t ˜ t (upper bound) (a) (b) ˆ ψ ( x ) = d k =1 b k r k ( x ) Figure 1: Scheme illustrating different approximations to the dominant eigenfunction ψ of the Molec-ular dynamics propagator, and the associated approximations to the slowest relaxation timescale t .TICA (blue) approximates the eigenfunction ψ (black) as a linear combination of molecular observ-ables and the TICA timescale ˆ t associated to the TICA eigenvalue ˆ λ underestimates (usually strongly)the true timescale t . The estimate is then improved by building a Markov model in TICA space whichapproximates the eigenfunction ψ by a step function (red) that is constant on the Markov model clus-ters. The corresponding Markov model estimate of the relaxation timescale, ˜ t , is thus typically largerthan the TICA timescale ˆ t and a better estimate of the true timescale t .construction of Markov models with a moderate number of clusters. Here, we compare the performanceof different cluster metrics which are briefly described in the present section.There are many software packages available for performing data clustering. For clustering and Markovmodel construction of molecular dynamics data, the packages EMMA [83], MSMbuilder [5], Wordom[82] and METAGUI [8] are currently available. Here, we use the EMMA package. Clustering methods for discretizing MD trajectory data can be divided into two categories:1. explicit coordinate methods that treat molecular coordinates as elements of an explicit vectorspace. The MD data is projected into the chosen coordinate set and then clustered by somedistance metric (e.g. Euclidean distance) in that space.2. pure metric methods that have no explicit vector space available, but rather a metric thatmeasures the distance between pairs of molecular conformations. The clustering algorithm groupsonly existing configurations via this distance metric.Coordinates used in (1) include Cartesian coordinates (provided there is a meaningful coordinate origin,for example defined by the largest molecule or domain in the system). For example, in Refs. [30, 13],the binding of a small ligand was analyzed by using the three-dimensional positions of the ligand withrespect to the protein. Other frequently used coordinate sets are intramolecular coordinates such asdihedral angles [62] or inter-atomic distances. Alternatively to directly clustering the primary set ofcoordinates, coordinate transforms can be applied to preprocess the data, with the hope of identifyingsub-spaces in which the clustering will be more informative. Below we will discuss the transformationsPCA and TICA in detail.A metric often used in (2) is the normalized Euclidean metric after rigid-body translation and rotationhas been removed, in short the minimal RMSD (or least RMSD) metric [39]. Minimal RMSD isan established metric for the analysis of MD trajectories and MSM construction and used here as areference.Discretization of trajectory data is performed using clustering algorithms. In a first stage the trajectoryis explored and n representative points of the coordinate space are selected as cluster centers with a8lustering method. Various algorithms have been proposed in the literature, including k -means [50], k -centers [20], k -medoids [42], regular spatial clustering[83], regular temporal clustering[83] and Wardclustering[6]. The k -means algorithm requires the coordinate space to be a vector space (in order tocompute the mean) whereas the other aforementioned algorithms only require a metric.Subsequent to the identification of cluster centers, the state space is partitioned by assigning eachtrajectory frame to its closest cluster center according to the same metric used for clustering. Thediscretization obtained this way is a Voronoi tessellation of the observed coordinate space. Voronoicells form a complete partition of the conformation space. PCA is a linear transform that transforms coordinates in such a way that their instantaneous corre-lations vanishes. It is frequently used in the MD community in order to identify the linear subspacein which the largest-amplitude motions occur, with the hope that these large-amplitude motions aremost informative of functionally relevant transitions [4].Let r ∈ R d be a vector of order parameters used, for example distances or Cartesian positions. Withoutrestriction of generality we assume that r is mean-free, i.e. the mean of the data has already beensubtracted. Note that r is generally only a subset of the full phase space coordinates, thus R d is asubset of Ω . For example, in protein simulation r usually only contains the protein coordinates, but notthose of the solvent. The covariance matrix C r of the order parameters r is defined by the elements: c rij = (cid:104) r i r j (cid:105) while the estimator for trajectory data containing N discrete time steps is: ˆ c rij = 1 N − N (cid:88) t =1 r i ( t ) r j ( t ) The elements c rij are covariances between different order parameters if i (cid:54) = j and autocovariances if i = j .Principal components (PCs) are uncorrelated variables y that are obtained via an orthonormal trans-form of the original order parameters r . For this, the eigenvectors w i of the correlation matrix areobtained: C r w i = w i σ i , in matrix form, with the eigenvector matrix W = [ w , ..., w d ] and the matrix of variances Σ =diag( σ , .., σ d ) . C r W = WΣ . In order to transform an original coordinate vector r into principal components, we perform: y T = r T W (12)Usually, principal components are sorted according to their autocovariance σ i . If σ i decays rapidlywith i , one often selects a threshold and ignores all PCs with smaller σ i . This is done by using some W (cid:48) ∈ R d × m matrix, which only contains the dominant m < d column vectors of W . Used in thisway, PCA is a tool for dimension reduction. In the present paper we use PCA in two ways: (1) asa direct dimension reduction tool to yield a subspace for clustering and subsequent Markov modelconstruction, and (2) to transform the original data into the full set of principal components via Eq.(12), thus arriving at a decorrelated coordinate set as an input for the subsequent transform (seesubsequent section). This way of using PCA is called whitening the data [1].PCA is often used to analyze MD data, and has also been employed in the construction of Markovmodels. We used PCA to reduce the dimension of Pin WW protein simulations in order to builda protein folding Markov model with a lagtime of only τ = 2 ns [67]. Stock and co-workers haveexplored the possibility of using dihedral angles as an input to PCA (dPCA). As angular coordinates,they cannot be averaged in the same way like nonperiodic coordinates. Ref. [3] thus suggests to usethe sine and cosine of backbone dihedral angles as input coordinates. It is found for hepta-alanin9hat small-amplitude PCs are unimodal while large-amplitude PCs have multimodal distributions andthus contain the interesting conformation dynamics. Performing k -means clustering on the PCs didproduce states with high metastability. However, it has proven difficult to analyze proteins with severalsecondary structure elements using dPCA. In subsequent work [35], the approach was extended to the“dihedral PCA by parts” method. Like PCA, TICA [55] uses a linear transform to map the original order parameters r ( t ) to a new setof order parameters z ( t ) — the independent components (ICs). Unlike PCs, ICs have to fulfill two properties:1. they are uncorrelated and2. their autocovariances at a fixed lag time τ are maximal.The time-lagged covariance matrix C rτ ( τ ) is defined by: c rij ( τ ) = (cid:104) r i ( t ) r j ( t + τ ) (cid:105) and the estimator for trajectory data containing N time steps is given by: ˆ c rij ( τ ) = 1 N − τ − N − τ (cid:88) t =1 r i ( t ) r j ( t + τ ) The elements of C r ( τ ) are time-lagged autocovariances if i = j and time-lagged cross covariancesif i (cid:54) = j . As shown in the Appendix, this matrix is symmetric under the assumption of reversibledynamics and in the limit of good statistics. For a finite dataset, symmetricity must be enforced.We seek a transformation matrix U = [ u , ..., u d ] that diagonalizes C r (0) (to fulfill property 1), andmaximizes the autocorrelations c zii ( τ ) = u Ti C r ( τ ) u i for every column u i of U (to fulfill property 2).As described in Sec. 2.4, this is accomplished by solving: C r ( τ ) u i = C r (0) u i ˆ λ i ( τ ) , (13)Eq. (13) is equivalent to Eq. (9). See appendix for an illustrative derivation of (13). As described inSec. 2.4, the second-largest estimated eigenvalue is a lower bound for the real second-largest propagatoreigenvalue: ˆ λ ( τ ) < λ ( τ ) .ICs are now ordered according to the magnitude of the autocovariance ˆ λ i ( τ ) , and the IC’s with thelargest autocovariances ˆ λ i ( τ ) will be called dominant . Since the dominant m IC’s yield the linear sub-space in which most of the slow processes are contained, it is reasonable to now perform a direct cluster-ing in this subspace, thus aiming at approximating the nonlinear behavior of the slowest m eigenfunc-tions with step functions. This will yield a better approximation to the m slowest timescales. RewritingEq. (13) in matrix form, with and the matrix of autocorrelations ˆ Λ ( τ ) = diag(ˆ λ ( τ ) , .., ˆ λ d ( τ )) yields: C r ( τ ) U = C r (0) U ˆ Λ ( τ ) . (14)In order to transform an original coordinate vector r into independent components, we perform: z T = r T U (15)How can (14) be solved? If C r (0) or C r ( τ ) were invertible the generalized eigenvalue problem could betransformed into a normal eigenvalue problem. But as we expect some of our original order parametersto be highly correlated, the determinants of C r and C r ( τ ) will be nearly zero, prohibiting this option.Alternatively, one can seek the solution of (13) via generalized eigensolvers.However, there is a simple and efficient alternative to this: Problem (14) can also be solved by solvingtwo simple eigenvalue problems using the AMUSE algorithm [47]. It consists of the following steps:1. Use PCA to transform mean-free data r ( t ) into principal components y ( t ) .10. Normalize principal components: y (cid:48) ( t ) = Σ − y ( t ) .3. Compute the symmetrized time-lagged covariance matrix C (cid:48) y (cid:48) τ = [ C y (cid:48) τ + ( C y (cid:48) τ ) † ] of the normal-ized PCs.4. Compute an eigenvalue decomposition of C (cid:48) y (cid:48) τ , obtaining eigenvector matrix V and project thetrajectory y (cid:48) ( t ) onto the dominant eigenvectors to obtain z ( t ) .This only works when the eigenvectors of C (cid:48) y (cid:48) τ are uniquely defined, i.e. if the eigenvalues are notdegenerated [1]. The main idea of this algorithm is, that properties (1) and (2) can be fulfilled oneafter the other. First, steps 1 and 2 use PCA to produce decorrelated and normalized trajectories y (cid:48) ( t ) ,also known as whitening the data. Then steps 3 and 4 maximize the time lagged autocovariances.Because the matrix V which is used in step 4 is unitary, it preserves scalar products between thevectors y (cid:48) ( t ) . Now if y (cid:48) ( t ) are chosen to be uncorrelated (and properly normalized) then also z ( t ) willbe uncorrelated.In summary, the transformation Eq. (15) can be written as a concatenation of three linear transforms: z T ( t ) = r T ( t ) U = r T ( t ) WΣ − V . (16)TICA will be used as a dimension reduction technique. Only the dominant TICA components will beused to construct a Markov model. Markov models are constructed by first performing a data clustering using an appropriate metricdescribed in the results section (using the EMMA command mm_cluster) and subsequently convertingthe molecular dynamics trajectory files into discrete trajectory files containing the sequence of clusterindexes visited (using the EMMA command mm_assign). For the sake of the current paper, the mainanalysis is the behavior of the relaxation timescales that are implied by the estimated Markov model(EMMA command mm_timescales).All Markov model estimation is done as proposed in [72], using the maximum probability estimatorof reversible transition matrices with a weak neighbor prior count matrix (EMMA default). Let uscall the transition matrix T ( τ ) , then it has the right eigenvectors ψ i , the left eigenvectors φ i and theeigenvalues ˜ λ i according to the following Eigenvalue equations: T ( τ ) ψ i = ˜ λ i ( τ ) ψ i φ Ti T ( τ ) = ˜ λ i ( τ ) φ Ti We order eigenvalues by descending norm. When T ( τ ) is connected (irreducible), it will have aunique eigenvalue of norm 1. The corresponding eigenvector can be normalized to yield the stationarydistribution π : π T = π T T ( τ ) . Since T ( τ ) fulfills detailed balance, the left and right eigenvectors are related by: φ i = diag( π ) ψ i The estimated (implied) relaxation timescales of the Markov model are given by ˜ t i = − τ ln ˜ λ ( τ ) . which are - ignoring statistical errors - related to the true relaxation timescales by ˜ t i < t i (see theorysection), and are typically larger than the timescales implied by the TICA eigenvalues (see theorysection and Fig. 1). 11 .5 Optimal indicators Given the final Markov model transition matrix T ( τ ) , we can now establish a simple way to quantifyhow well each of the order parameters r k serving as an input serves as an indicator of the slowprocess described by the eigenvector ψ i : We simply compute the correlation between all pairs of orderparameters and eigenvectors, and then, for each eigenvector, choose those order parameters that havea maximum correlation: r opt ( i ) = arg max r k (cid:104) r k ψ i (cid:105) − (cid:104) r k (cid:105)(cid:104) ψ i (cid:105) (cid:113) (cid:104) r k (cid:105)(cid:104) ψ i (cid:105) . (17)The averages in Eq. (17) can be computed either via Markov model states (in this case the averagevalue of r k , is computed for every microstate j , obtaining ¯ r k,j , and the correlation is given by (cid:104) r k ψ i (cid:105) = (cid:80) nj =1 π j ¯ r k,j ψ i,j ). Here we instead choose to evaluate the averages as a time average over all trajectorydata. In this case, the eigenvector coordinate is given by the microstate each trajectory frame isassociated to. The proposed methodology is demonstrated on two different peptide systems: the fluorescent peptideMR121-GSGSW and the 30-residue natively unstructured peptide KID.MR121-GSGSW is a well-studied fluorescent peptide that has been extensively characterized by ex-periments [59], simulations [19] and also Markov models [60, 65]. Here, a data set of two explicitsolvent simulations of 3 µs each is used that is publicly accessible as a benchmark dataset for theEMMA software package (see http://simtk.org/home/emma). The details of the simulation setup aredescribed in [72].The slowest relaxation timescale of the MR121-GSGSW data set has been estimated to be between 20and 30 ns, and it has been found that the slowest processes are dominated by the interaction betweenMR121 and the tryptophan residue [65]. The data set is used as a benchmark system to test whetherMarkov model construction in PCA or TICA coordinates manage to identify the slow parameters,approximate the slow processes, and assign the correct timescales.Fig. 2A1 shows a sample structure of MR121-GSGSW. Fig 2B shows a benchmark for the relaxationtimescales computed by a regular-space clustering in pairwise minimal RMSD metric using 1000 clustercenters. The slowest processes are found at about 25 ns, 12 ns and 8 ns, slightly larger—and thus moreaccurate according to the variational principle in Eq. (4)—than by the coarser Markov model in [65].To set up the direct clustering, two internal coordinate sets are considered: (i) the set of 66 distancesbetween 12 coordinates defined by the 5 C α ’s and the 7 ring centers involved, and (ii) the centerposition and the orientation vector coordinates of the tryptophan sidechain in a coordinate set definedby the MR121 principal axes (see Fig. 2A2 for an illustration). Fig. 2c1-3 show the results of directk-means clustering with 1000 cluster centers in the space of 66 intramolecular distances (C1), only the9 tryptophan coordinates (C2), and the combined set (C3). It is clearly seen that the intramoleculardistances are not suited to resolve the slowest processes, while the tryptophan coordinates resolvethem very well. This can be understood from the structural arrangements shown in Fig. 3 that aredominated by the relative orientation of the tryptophan sidechain with respect to the MR121 ringsystem. Especially the slowest process, the stacking-order exchange of the two ring systems, cannotbe well described by the intramolecular distances that are similar when the tryptophan is “above” or“below” the MR121. Fig. 2C3 shows that discretizing the combined coordinate set resolves the slowestprocesses with similar timescales as in the 9 Trp-coordinate set alone. This is not always expected, asincreasing the dimensionality of the space to be clustered while keeping the number of clusters constantwill often reduce the resolution.In the subsequent PCA and TICA analysis different linear subspaces of the combined coordinateset were considered. Interestingly, clustering the principal components reduces the quality of theMarkov model significantly. This is explained by the fact that the largest-amplitude motions in thepresent system is the transition between structures in which Trp and MR121 are in contact, andopen structures. However, open structures have a very low population, giving rise to a rather fasttimescale of the opening/closing process. The slowest processes, involving different arrangements and12rientations of the Trp and MR121 while being in contact, give rise to comparatively small amplitudemotions. Using one and four PCA components (D1 and D2), the three slowest processes are not found.Using ten PCA components, the two slowest processes are found, although slightly underestimated,while the third-slowest process is not found.Fig. 2E1-3 shows that the TICA coordinates perform indeed very well. Using only the single slowestTICA coordinate does resolve the slowest process well and gives rise to a timescale of 20-25 ns, close tothe expected value. Using the four slowest TICA coordinates resolves the two slowest processes well,while somewhat underestimating the third process. With ten TICA coordinates all slow processesare well resolved, and the timescales are found to be 27 ns, 13 ns, and 10 ns at a lagtime of τ = kinetic map , where the coordinates are given by the twoslowest left eigenvectors φ , φ . For example, a cluster i is drawn at a position ( φ ,i , φ ,i ) with a sizeproportional to its stationary probability π i . The map is termed “kinetic”, because similar positionsin eigenvector spaces mean that the states can relatively quickly reach one another, while distantpositions only exchange on timescales t on the horizontal, and on timescale t on the vertical axis.The left eigenvectors are chosen instead of the right eigenvectors, because the left eigenvectors areweighted by the stationary distribution: φ k,i = π i ψ k,i . Thus, points on the border of the map tend tohave larger stationary probability. Therefore, the extremal points are at the same time populous andkinetically distant, and can roughly be associated with the most stable “free energy minima”, while thesmaller clusters connecting them correspond to transition states. The structures, shown for the mostpopulous and kinetically distinct clusters, indicate that the slowest relaxations are associated with astacking-order exchange of the MR121 and Trp groups, and a rotation of the Trp group with respectto the MR121 group (see “marker” atom shown as a blue sphere).Fig. 3B illustrates the optimal indicators of the slowest processes, i.e. the input order parametersthat have the largest correlation with the individual right Markov model eigenvectors ψ and ψ .The correlation plots show that the respective order parameters attain clearly different values at theend-states of the transition, i.e. for the minimal and maximal values of the respective eigenvector. Atintermediate values of the eigenvectors, i.e. transition states, the order parameter can access manydifferent values. This is easily seen in the slowest process (Fig. 3b1), where the best indicator is theTrp z -position that mediates the stacking order exchange (correlation coefficient 0.84 with the secondeigenvector ψ ). While the value of the Trp z -position is clearly defined in the transition end-states,where the Trp is located “above” and “below” the MR121 moiety, the transition states include openconfigurations where the Trp and the MR121 are not in contact at all, and therefore all values of the z -position are accessible in these states. A similar behavior is seen for the second-slowest process (Trpsidechain rotation).We now turn to another molecular system. Here, an extensive set of simulations of the kinase inducibledomain (KID) in explicit solvent were investigated. KID is part of the cAMP response element-bindingprotein (CREB). CREB is a transcription factor involved in processes as important as glucose reg-ulation and memory, and it binds the CREB-binding protein (CBP), a well-known cancer-relatedmolecular hub with around 300 interacting protein partners [41]. KID belongs to a large and impor-tant class of intrinsically unstructured peptides (IUP), encompassing many hormones, domains andeven whole proteins [91]. Unstructured regions perform their function even though they lack a well-defined secondary or tertiary structure in solution. Although standardized algorithms exist to detectunstructured regions on the basis of the primary amino acid sequence, the structural details of how dis-ordered regions exert their function is still elusive. For example, some unstructured domains, includingKID, become folded upon binding [89]; it is therefore of much interest (e.g. for the druggability ofprotein-protein interactions) to investigate whether the presence of pre-formed elements causes foldedconformations to be selected from the ensemble (conformational selection) [54], or whether the bindingrather occurs through induced-fit mechanics [85].To shed light on this problem, we set-up an ensemble of all-atom simulations of the phosphorylatedKID (pKID) domain. We have performed 7706 all-atom explicit-solvent simulations of 24 ns eachusing the ACEMD software [29] on the GPUGRID distributed computing platform [12], yielding atotal 168 µs simulation data. However, due to the short simulations, only short lagtimes could beused, presenting a challenge to the Markov model construction. The detailed simulation setup is13escribed in the Appendix.Fig. 4 shows the performance of different metrics in their ability to resolve the slowest processes ofKID. KID is a more difficult case than the MR121-GSGSW peptide because its natively unstructurednature gives rise to many fast large-amplitude motions which will conceal the slow processes in most ad hoc metrics. Fig. 4B and C show that neither regular-space clustering in minimal pairwise RMSDmetric, nor direct clustering in all distances yield a converged estimated of the timescales up to lagtimesof 10 ns. Between these two, regular-space RMSD is better, reaching a timescale of about 170 ns at τ = 10 ns , while the direct clustering produces a timescale below 100 ns at τ = 10 ns . Higher choicesof lagtimes were avoided as they lead to a severe reduction of the usable data because the connectedset of clusters drops significantly below 100% after that point. Fig. 4D1-3 show that the performanceof principal components is even worse than direct clustering, giving rise to timescale estimates below20 ns for one principal component and below 50 ns for ten principal components. This confirms thatthe largest-amplitude motions are not the slowest in KID.Fig. 4E1-3 show the performance of the TICA coordinate using one, four, and ten dimensions. Usingonly the slowest TICA coordinate, a slow process of >200 ns is found, that has not been resolved bythe clustering in any of the other metrics, however this timescale does not converge for lagtimes up to10 ns. Using only the four slowest TICA coordinates, there are already three processes resolved thatare above 100 ns, and the convergence behavior improves. Using the ten slowest TICA coordinates,five processes slower than 100 ns are resolved. The slowest process converges to a timescale around220 ns and does so already at a lagtime τ of 2-5 ns. Thus, the lagtime needed is a factor of 50-100smaller than the timescale of the process, indicating a very good discretization of the correspondingprocess.Fig. 5A illustrates the structural transitions associated with the two slowest relaxation processes ofKID as identified by the Markov model using ten-dimensional TICA model. We have decided tofocus on the two slowest processes around 200 and 220 ns relaxation time, because they are somewhatseparated from the next-slowest processes occurring at around 100 ns. As the peptide has greatstructural variability it is of little value to plot all relevant structures. Therefore, we have plottedthe positions of the microstates again in a kinetic map, using the coordinates of the two dominantleft eigenvectors φ , φ . It is seen that at the slowest timescales, the system rearranges betweenmostly open and disordered structures (left), structures with one helix folded or partially folded (topright), and hairpin-like structures (bottom right). Thus the system has some residual helical structure,although it is not very stable in absence of a stabilizing binding partner.Fig. 5B illustrates the optimal indicators of the slowest processes, i.e. the input order parameters thathave the largest correlation with the resulting right Markov model eigenvectors ψ and ψ . Like forMR121-GSGSW, the correlation plots show that the respective order parameters are mainly able todistinguish the end-states of the transition, but unlike for MR121-GSGSW, multiple C α − C α distancesare almost equally good indicators for the same process. Fig. 5B shows correlation plots of the bestindicators of ψ and ψ , but indicates the five best correlations (all correlation coefficients above 0.7)in the structure. It is seen that the slowest process (timescale 220 ns) is best described by a hingeopening and closing, where the closed hinge appears to induce at least partial formation of the N-terminal helix (red, see Fig. 5B2). This is consistent with NMR experiments that have shown theN-terminal region to be approximately 50% helical in the apo-form [73]. The second-slowest process(timescale 200 ns) is best described by partial helix formation of the C-terminal part (blue) of theprotein. In the present manuscript we have derived a method to find the optimal linear combination of inputcoordinates for approximating the slowest relaxation processes in complex conformational rearrange-ments of molecules. It is shown that an implementation for this method is already known in statisticsas the TICA method, which is combined here with Markov modeling in order to construct models ofthe slow relaxation processes and precise estimates of the related relaxation timescales. It is shownthat this approach of constructing Markov models yields slower timescales, and thus a more preciseapproximation to the true relaxation processes, than previous approaches. This is also achieved for thenatively unstructured peptide KID where established approaches such as direct clustering in distance14pace, minimal-RMSD-based clustering, or clustering in PCA space did not perform well because thelargest-amplitude motions were not good indicators of the slowest relaxation processes.Beyond having an approach to construct quantitatively accurate Markov models in a way that is morerobust than most previous approaches, we readily obtain a way to find best indicators of the slowesttransitions. Best indicators are those molecular order parameters that are best correlated with theMarkov model Eigenvectors describing the slowest processes, and thus serve as candidates for goodreaction coordinates. Being able to point out such indicators provides a way to make the sometimescomplex structural rearrangements readily understandable.
Acknowledgements
We thank all the volunteers of GPUGRID who donated GPU computing time to the project. Weare grateful to Thomas Weikl (MPI Potsdam) for advice and support. G.P.-H. acknowledges supportfrom German Science Foundation DFG fund NO 825-3. F.P. acknowledges funding from the MaxPlanck society. T.G. gratefully acknowledges former support from the “Beatriu de Pinós” scheme of theAgència de Gestió d’Ajuts Universitaris i de Recerca (Generalitat de Catalunya). G.D.F. acknowledgessupport from the Ramón y Cajal scheme and support by the Spanish Ministry of Science and Innovation(Ref. BIO2011-27450). F.N. acknowledges funding from DFG center
Matheon .15 I T S / n s I T S / n s I T S / n s τ / ns I T S / n s τ / ns 2 4 6 8 10 τ / ns B)C1) C3)C2)E1) E3)E2)D1) D3)D2)A1) A2)
Figure 2: MR121-GSGSW peptide and its dominant relaxation timescales calculated via differentMarkov model construction methods. (A1) Sample structure of the peptide. (A2) Illustration ofthe Trp coordinates used. The center position of the Trp and the orientation vectors are given ina coordinate system defined by the MR121 principal axes. (B) Relaxation timescales using regularspace RMSD clustering with approximately 1000 clusters. (C-E) Relaxation timescales using k -meanswith 1000 clusters and Euclidean metric but operating on different subspaces: (C1) Intramoleculardistances between all C α ’s and ring centers. (C2) Center position and orientation coordinates of theTrp moiety in the MR121 coordinate system. (C3) Combined coordinate set including intramoleculardistances and Trp coordinates. (D1-3) Dominant PCA subspace of the combined coordinate set using1, 4, and 10 dimensions. (E1-3) Dominant TICA subspace of the combined coordinate set using 1, 4,and 10 dimensions. 16 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2−0.2−0.15−0.1−0.0500.050.10.150.2 left eigenvector φ l e ft e i gen v e c t o r φ −0.06 −0.04 −0.02 0 0.02 0.04 0.06−2.5−2−1.5−1−0.500.511.52 corr(d | ψ )=−0.81righteigenvector ψ −0.06 −0.04 −0.02 0 0.02 0.04 0.06 0.08 0.1−1−0.8−0.6−0.4−0.200.20.40.60.81 corr(d | ψ )=0.59 d = I Z T R P righteigenvector ψ CMZ CMZ
B2B1 d = C M Z T R P Figure 3: (A)
Kinetic map of the two slowest relaxation processes of MR121-GSGSW (around 27 nsand 13 ns) calculated from the Markov model shown in Fig 2E3. The grey discs mark the coordinates ofthe 1000 microstates in the space of the left eigenvectors φ , φ . The slowest relaxation of the systemthus takes place on the horizontal axis, the second-slowest one on the vertical axis, and distances areassociated with kinetic separation. The area of a disc is proportional to the stationary probability ofthe corresponding microstate. Some representative (kinetically distant and populous) microstates areshown as molecular structures. (B) Optimal indicators of the slow processes. The scatter plots showthe correlation between the second and third right Markov model eigenvectors ψ , ψ and the orderparameters most correlated with them. The arrows in the structures show the optimal indicators.(B1) The Trp z -position mediates the stacking order exchange and has a correlation coefficient of 0.84with the second eigenvector ψ (timescale 27 ns). (B2) The smallest Trp axis of inertia mediatesthe rotation of the side-chain and has a correlation coefficient of 0.59 with the third eigenvector ψ (timescale 13 ns). 17 τ / ns I T S / n s τ / ns 0 2 4 6 8 10 τ / ns0100200300 I T S / n s I T S / n s A) C3)B)E1) E3)E2)D1) D3)D2)
Figure 4: KID peptide and its estimated dominant relaxation timescales using different Markov modelconstruction methods. (A) Sample structure of KID. (B) Relaxation timescales using regular spaceRMSD clustering with 1000 clusters. (C-E) Relaxation timescales using k -means with 1000 clusters andEuclidean metric but operating on different subspaces. (C) All C α − C α distances. (D1-3) DominantPCA subspace of C α − C α distances using 1, 4, and 10 dimensions. (E1-3) Dominant TICA subspaceof C α − C α distances using 1, 4, and 10 dimensions.18 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 0.3−0.4−0.3−0.2−0.100.10.20.30.4 left eigenvector φ l e ft e i gen v e c t o r φ Figure 5: (A)
Kinetic map of the two slowest relaxation processes of the KID peptide (around 200 nsand 220 ns) calculated from the Markov model shown in Fig 4E3. The grey discs mark the coordinatesof the 1000 microstates in the space of the left eigenvectors φ , φ . The slowest relaxation of the systemthus takes place on the horizontal axis, the second-slowest one on the vertical axis, and distances areassociated with kinetic separation. The area of a disc is proportional to the stationary probabilityof the corresponding microstate. Some representative (kinetically distant and populous) microstatesare shown as molecular structures. (B) Optimal indicators of the slow processes. The scatter plotsshow the correlation between the second and third right Markov model eigenvectors ψ , ψ and theorder parameters most correlated with them. The colored lines show all five best indicators, all havingcorrelation coefficients with the respective eigenvectors of 0.7 or greater. The slowest process maythus be described as opening / closing of the hinge between the two helical domains of KID (timescale220 ns), while hinge-closing is associated with at least partial N-terminal helix formation (red). Thesecond-slowest process may be described as partial helix formation in the “blue” region (timescale 200ns). 19 ppendix Derivation of TICA
The generalized Eigenvalue problem of Eq. (9), and more specifically the TICA problem can be derivedin different ways. It goes back to the classical Ritz method [75] and can be found in many mathematicaltexts. In the following we sketch a standard derivation using variational calculus, see also [36, 2] for athorough discussion of this approach.Let r ∈ R d be a vector of coordinates used, for example distances or Cartesian positions. Withoutrestriction of generality we assume that r is mean-free, i.e. the mean of the data has already beensubtracted. Note that r is contains generally only a subset of the full phase space coordinates, thus R d is a subset of Ω .We now seek new coordinates z ∈ R m as a linear transformation of r such that1. z are uncorrelated2. the autocovariances of z at a fixed lag time τ are maximal.We will show that if coordinates z are given by a weighted sum of r z i ( r ) = d (cid:88) k =1 u ik r k (18)the weight coefficients have to fulfill the generalized eigenvalue problem (see theory section) C r ( τ ) u i = ˆ λ i C r (0) u i (19)where C rτ ( τ ) is the time-lagged covariance matrix that is defined by: c rij ( τ ) = (cid:104) r i ( t ) r j ( t + τ ) (cid:105) (20)To prove this we rewrite the covariance matrix of z and the time-lagged covariance matrix of z usingthe defining equations (18), and (20). c zij (0) = (cid:104) z i ( t ) z j ( t ) (cid:105) = (cid:88) k,l u ik u jl (cid:104) r k ( t ) r l ( t ) (cid:105) = (cid:88) k,l u ik u jl c rkl (0) c zij ( τ ) = (cid:104) z i ( t ) z j ( t + τ ) (cid:105) = (cid:88) k,l u ik u jl (cid:104) r k ( t ) r l ( t + τ ) (cid:105) = (cid:88) k,l u ik u jl c rkl ( τ ) We wish to maximize c zij ( τ ) (property 3) under the constraint, that c zij (0) = δ ij (property 2). Westart by computing one coordinate z with maximal autocovariance. It is given by the weighted sum z = (cid:80) i,j u i u j c rij (0) , where we used the shorthand notation u i = u i . The constraint (2) for z is now c z (0) = (cid:80) i,j u i u j c rij (0) = 1 .Since the matrix-elements c rij (0) are fixed, the autocovariance c z ( τ ) can be treated as a differentiablefunction of the coefficients u i . Therefore, we need to maximize the function F ( u , . . . , u d ) = (cid:88) k,l u k u l c rkl ( τ ) − ˆ λ (cid:88) k,l u k u l c rkl (0) − where ˆ λ is the Lagrange multiplier. We perform the maximization by setting the partial derivatives of F with respect to the weight coefficients to zero. ∂F∂u k = (cid:32)(cid:88) l u l c rkl ( τ ) (cid:33) − ˆ λ (cid:32)(cid:88) l u l c rkl (0) (cid:33) Rearranging and rewriting this equation in matrix-vector form leads to (19) for i = 1 . The sameargument is used for the subsequent eigenvalues. We now prove that the solutions of (19) fulfill theproperties requested above: 20. The IC’s obtained by solving (19) are uncorrelated : Let u i be a generalized eigenvectorwith eigenvalue λ i and let u j be a generalized eigenvector with eigenvalue λ j (cid:54) = λ i . Then theorthogonality condition u Ti C r (0) u j = δ ij (21)will hold if C r (0) and C r ( τ ) are symmetric matrices. If u i and u j are used as the weights in(18) this is equivalent to c zij (0) = δ ij .Proof: λ i C r (0) u i · u j = C r ( τ ) u i · u j = u i · C r ( τ ) u j = u i · λ j C r (0) u j = λ j C r (0) u i · u j Therefore λ i − λ j )( u Ti C r (0) u j ) . Because λ i (cid:54) = λ j the orthogonality condition must hold.This does not hold, if eigenvectors are degenerate i.e. λ i = λ j for some i , j . However degeneracycan be avoided by changing the lag time τ such that no eigenvalues with large magnitude coincide.Solutions with smaller eigenvalues might still be degenerate, but this is unproblematic since thesesolutions are discarded for clustering. In addition, “fast” modes will necessarily be uncorrelatedwith “slow” modes, because their eigenvalues are far apart.2. The autocovariances at a fixed lag time τ are maximal :We show that the autocovariances are identical to the Lagrange multipliers, and thus to theeigenvalues in Eq. 19: c zij ( τ ) = ˆ λ i δ ij (22)To see this, multiply (19) with u Ti from the left, to obtain c zji ( τ ) = u Tj C r ( τ ) u i = ˆ λ i u Tj C r (0) u i now use the orthogonality condition (21) c zji ( τ ) = c zij ( τ ) = u Tj C r ( τ ) u i = ˆ λ i δ ij To show that the optimum found is indeed a maximum, we calculate the Hessian of the con-strained autocovariance c z ( τ ) . Its elements are: H kl = ∂ F∂u k ∂u l = c rkl ( τ ) − ˆ λ c rkl (0) and show that it is a positive definite matrix x T Hx = x T C r ( τ ) x − ˆ λ x T C r (0) x < ∀ x We first expand x in the basis of the generalized eigenvectors x = (cid:80) mi u i ( u i · x ) = (cid:80) mi u i c i anduse equations (21) and (22) (cid:88) i,j c i c j u Ti C r ( τ ) u j − ˆ λ (cid:88) i,j c i c j u Ti C r (0) u j = (cid:88) i c i ˆ λ i − ˆ λ (cid:88) i c i Without loss of generality, we assume that the solution vectors of Eq. (19) were sorted bydescending eigenvalues ˆ λ i to obtain an ordering from “slow” modes to “fast” modes, ˆ λ > ˆ λ >. . . > ˆ λ m . From this follows (cid:80) i c i ˆ λ i − ˆ λ (cid:80) i c i ≤ for the first solution u . The second solutionis restricted to a subspace that is orthogonal to u according to (21) and (22) x T C r (0) u = x T C r ( τ ) u = 0 Therefore we can ignore the coefficient c in the development of x and obtain the quadratic form (cid:88) i =2 c i ˆ λ i − ˆ λ (cid:88) i =2 c i Again, this is negative, because ˆ λ is the largest eigenvalue in the sum. This procedure can beextended to the third, fourth,... eigenvalue, showing that the optima are minima for all solutions.As a result, we can sort the solution vectors of Eq. (19) by descending eigenvalues ˆ λ i to obtain anordering from “slow” modes to “fast” modes. 21 ymmetricity and Symmetrization of the time-lagged covariance matrix Consider the correlation matrix of mean-free coordinates r for lag time τ : c rij ( τ ) = (cid:104) r i ( t ) r j ( t + τ ) (cid:105) and the correlation matrix for lag time τ : cor rij ( τ ) = c rij ( τ ) σ i σ j = (cid:104) r i ( t ) r j ( t + τ ) (cid:105) (cid:113) (cid:104) r i ( t ) (cid:105)(cid:104) r j ( t ) (cid:105) = ˆ x ˆ y dxdy xy p ( r i ( t ) = x, r j ( t + τ ) = y ) where p ( x ( t ) = x, y ( t + τ ) = y ) is the unconditional transition probability between the set S = { r i = x } and the set S = { r j = y } within time lag τ . In statistically reversible dynamics, such an unconditionalset-transition probability is symmetric (this follows directly from integrating the detailed balancecondition µ ( x ) p τ ( x | y ) = µ ( y ) p τ ( y | x ) over the sets). Thus, we can exchange time indexes and show: cor rij ( τ ) = ˆ x ˆ y dxdy xy p ( r i ( t + τ ) = x, r j ( t ) = y )= ˆ y ˆ x dydx yx p ( r j ( t ) = y, r i ( t + τ ) = x )= cor rji ( τ ) . And then trivially c rij ( τ ) = c rij ( τ ) ∀ τ When estimating correlation or covariance matrices from simulations, one cannot expect c ij = c ji tohold. A trivial method is to use c ij ( τ ) = 12 (ˆ c ij ( τ ) + ˆ c ji ( τ )) where ˆ c ij ( τ ) is the simulation estimate. Simulation setup, KID
The coordinates of the phosphorylated KID domain (28 residues, CREB residues 119-146) were ex-tracted from chain B of the entry 1KDX deposited in the Protein Data Bank. The entry representthe folded configuration of the pKID-CBP bound structure, determined through NMR [89]. Neutralacetylated and N-methyl caps were added to avoid artifactual charges at the peptide’s termini; theprotein was solvated with 6572 water molecules and a 85 mM KCl concentration (matching the exper-imental ionic strength). The system was then parametrized with the AMBER ff99SB-ILDN forcefield[49]; water and ions were modeled respectively with the TIP3P and Joung-Cheatham parameter sets[38, 37]. The system thus prepared was first equilibrated for 24 ns in the constant-pressure ensemble,during which it stabilized at a volume of approximately 60 Å . The peptide was then denatured byheating it at 500 ºK for 17.6 ns in constant-volume conditions; 176 frames were extracted from thistrajectory and used as starting configurations for the production runs. All of the simulations were per-formed with a time step of 4 fs, enabled by the hydrogen mass repartitioning scheme [25]; long-rangeelectrostatic forces were computed with the particle-mesh Ewald summation method. A nonbondedcutoff distance of 9 Å was used with a switching distance of 7.5 Å for Van der Waals interactions,while the lengths of bonds involving hydrogen atoms were constrained with the SHAKE algorithm.A set of 7706 production runs was executed on the GPUGRID distributed computing network [12].Each simulation was performed in the constant-volume ensemble at 315 ºK for 24 ns with the sameparametrization used for equilibration, storing structural snapshots every 100 ps. Each productionsimulation begun either from one of the configurations visited during the denaturation run, or framesvisited during preceding production trajectories. Starting frames were selected iteratively with anadaptive strategy in order to minimize the statistical uncertainty on the largest eigenvalue, computedon the already available simulation data, based on Singhal and Pande’s algorithm [31].22 eferences [1] Aapo Hyvärinen, Juha Karhunen, and Erkki Oja. Independent Component Analysis , chapter 18,page 344. John Wiley & Sons, 2001.[2] Erkki Oja Aapo Hyvärinen, Juha Karhunen.
Independent Component Analysis . John Wiley &Sons, 2001.[3] Alexandros Altis, Moritz Otten, Phuong H. Nguyen, Rainer Hegger, and Gerhard Stock. Con-struction of the free energy landscape of biomolecules via dihedral angle principal componentanalysis.
The Journal of Chemical Physics , 128(24):245102, 2008.[4] A. Amadei, A. B. Linssen, and H. J. C. Berendsen. Essential dynamics of proteins.
Proteins ,17:412–225, 1993.[5] Kyle A. Beauchamp, Gregory R. Bowman, Thomas J. Lane, Lutz Maibaum, Imran S. Haque,and Vijay S. Pande. MSMBuilder2: Modeling Conformational Dynamics at the Picosecond toMillisecond Scale.
Journal of chemical theory and computation , 7(10):3412–3419, October 2011.[6] Kyle A. Beauchamp, Robert McGibbon, Yu-Shan Lin, and Vijay S. Pande. Simple few-statemodels reveal hidden complexity in protein folding.
Proceedings of the National Academy ofSciences , 2012.[7] Alexander Berezhkovskii, Gerhard Hummer, and Attila Szabo. Reactive flux and folding pathwaysin network models of coarse-grained protein dynamics.
The Journal of chemical physics , 130(20),May 2009.[8] Xevi Biarnés, Fabio Pietrucci, Fabrizio Marinelli, and Alessandro Laio. Metagui. a vmd interfacefor analyzing metadynamics and molecular dynamics simulations.
Computer Physics Communi-cations , 183(1):203–211, Jan 2012.[9] Gregory R. Bowman, Kyle A. Beauchamp, George Boxer, and Vijay S. Pande. Progress andchallenges in the automated construction of Markov state models for full protein systems.
J.Chem. Phys. , 131(12):124101+, September 2009.[10] Gregory R. Bowman and Phillip L. Geissler. Equilibrium fluctuations of a single folded proteinreveal a multitude of potential cryptic allosteric sites.
Proceedings of the National Academy ofSciences , 109(29):11681–11686, 2012.[11] Gregory R. Bowman, Vincent A. Voelz, and Vijay S. Pande. Atomistic Folding Simulations of theFive-Helix Bundle Protein Lambda 6-85.
Journal of the American Chemical Society , 133(4):664–667, February 2011.[12] I. Buch, M. J. Harvey, T. Giorgino, D. P. Anderson, and G. De Fabritiis. High-throughput all-atommolecular dynamics simulations using distributed computing.
Journal of Chemical Informationand Modeling , 50(3):397–403, Mar 2010.[13] Ignasi Buch, Toni Giorgino, and Gianni De Fabritiis. Complete reconstruction of an enzyme-inhibitor binding process by molecular dynamics simulations.
Proceedings of the National Academyof Sciences , 108(25):10184–10189, June 2011.[14] Nicaolae V. Buchete and Gerhard Hummer. Coarse Master Equations for Peptide Folding Dy-namics.
J. Phys. Chem. B , 112:6057–6069, 2008.[15] Ginka S. Buchner, Ronan D. Murphy, Nicolae-Viorel Buchete, and Jan Kubelka. Dynamics ofprotein folding: Probing the kinetic network of folding–unfolding transitions with experiment andtheory.
Biochimica et Biophysica Acta , 1814:1001–1020, 2011.[16] J. D. Chodera, K. A. Dill, N. Singhal, V. S. Pande, W. C. Swope, and J. W. Pitera. Auto-matic discovery of metastable states for the construction of Markov models of macromolecularconformational dynamics.
J. Chem. Phys. , 126:155101, 2007.2317] H.S. Chung, J.M. Louis, and W.A. Eaton. Single-molecule fluorescence experiments determineprotein folding transition path times.
Science , 335:981–984, 2012.[18] R. R. Coifman, S. Lafon, A. B. Lee, M. Maggioni, B. Nadler, F. Warner, and S. W. Zucker.Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusionmaps.
Proc. Natl. Acad. Sci. USA , 102:7426–7431, 2005.[19] Isabella Daidone, Hannes Neuweiler, Sören Doose, Markus Sauer, and Jeremy C. Smith. Hydrogen-bond driven loop-closure kinetics in unfolded polypeptide chains.
PloS One , 6:e1000645+, 2010.[20] S. Dasgupta and P. Long. Performance guarantees for hierarchical clustering.
J. Comput. Syst.Sci. , 70(4):555–569, June 2005.[21] B. de Groot, X. Daura, A. Mark, and H. Grubmüller. Essential Dynamics of Reversible PeptideFolding: Memory-free Conformational Dynamics Governed by Internal Hydrogen Bonds.
J. Mol.Bio. , 301:299–313, 2001.[22] P. Deuflhard and M. Weber. Robust Perron cluster analysis in conformation dynamics.
ZIBReport , 03-09, 2003.[23] Natasa Djurdjevac, Marco Sarich, and Christof Schütte. Estimating the eigenvalue error of MarkovState Models.
Multiscale Model. Simul. , 10:61–81, 2012.[24] Elan Z. Eisenmesser, Oscar Millet, Wladimir Labeikovsky, Dmitry M. Korzhnev, Magnus Wolf-Watz, Daryl A. Bosco, Jack J. Skalicky, Lewis E. Kay, and Dorothee Kern. Intrinsic dynamics ofan enzyme underlies catalysis.
Nature , 438(7064):117–121, November 2005.[25] K. Anton Feenstra, Berk Hess, and Herman J. C. Berendsen. Improving efficiency of large time-scale molecular dynamics simulations of hydrogen-rich systems.
Journal of Computational Chem-istry , 20(8):786–798, 1999.[26] S. Fischer, B. Windshuegel, D. Horak, K. C. Holmes, and J. C. Smith. Structural mechanism ofthe recovery stroke in the Myosin molecular motor.
Proc. Natl. Acad. Sci. USA , 102:6873–6878,2005.[27] Alexander Gansen, Alessandro Valeri, Florian Hauger, Suren Felekyan, Stanislav Kalinin, KatalinTóth, Jörg Langowski, and Claus A. M. Seidel. Nucleosome disassembly intermediates charac-terized by single-molecule FRET.
Proc. Natl. Acad. Sci. USA , 106(36):15308–15313, September2009.[28] Gebhardt, Thomas Bornschlögl, and Matthias Rief. Full distance-resolved folding energy land-scape of one single protein molecule.
Proc. Natl. Acad. Sci. USA , 107(5):2013–2018, February2010.[29] M. J. Harvey, G. Giupponi, and G. De Fabritiis. Acemd: Accelerating biomolecular dynamics inthe microsecond time scale.
Journal of Chemical Theory and Computation , 5(6):1632–1639, Jun2009.[30] Martin Held, Philipp Metzner, and Frank Noé. Mechanisms of protein-ligand association and itsmodulation by protein mutations.
Biophys. J. , 100:701–710, 2011.[31] Nina Singhal Hinrichs and Vijay S. Pande. Calculation of the distribution of eigenvalues andeigenvectors in markovian state models for molecular dynamics.
The Journal of Chemical Physics ,126(24):244101–244101–11, Jun 2007.[32] Danzhi Huang and Amedeo Caflisch. The Free Energy Landscape of Small Molecule Unbinding.
PLoS Comput Biol , 7(2):e1002002+, February 2011.[33] Isaac A. Hubner, Eric J. Deeds, and Eugene I. Shakhnovich. Understanding ensemble proteinfolding at atomic detail.
Proc. Natl. Acad. Sci. USA , 103(47):17747–17752, November 2006.2434] M. Jäger, Y. Zhang, J. Bieschke, H. Nguyen, M. Dendle, M. E. Bowman, J. P. Noel, M. Gruebele,and J. W. Kelly. Structure-function-folding relationship in a WW domain.
Proc. Natl. Acad. Sci.USA , 103:10648–10653, 2006.[35] Abhinav Jain, Rainer Hegger, and Gerhard Stock. Hidden complexity of protein free-energylandscapes revealed by principal component analysis by parts.
The Journal of Physical ChemistryLetters , 1(19):2769–2773, 2010.[36] I.T. Jolliffe.
Principal Component Analysis . Springer, New York, 2 edition, 2002.[37] William L. Jorgensen, Jayaraman Chandrasekhar, Jeffry D. Madura, Roger W. Impey, andMichael L. Klein. Comparison of simple potential functions for simulating liquid water.
TheJournal of Chemical Physics , 79(2):926–935, July 1983.[38] In Suk Joung and Thomas E. Cheatham. Determination of alkali and halide monovalent ion param-eters for use in explicitly solvated biomolecular simulations.
The Journal of Physical Chemistry.B , 112(30):9020–9041, Jul 2008. PMID: 18593145 PMCID: 2652252.[39] W. Kabsch. A solution for the best rotation to relate two sets of vectors.
Acta CrystallographicaSection A , 32(5):922–923, Sep 1976.[40] Mary E. Karpen, Douglas J. Tobias, and Charles L. Brooks. Statistical clustering techniques forthe analysis of long molecular dynamics trajectories: analysis of 2.2-ns trajectories of YPGDV.
Biochemistry , 32(2):412–420, January 1993.[41] Lawryn H Kasper, Tomofusa Fukuyama, Michelle A Biesen, Fayçal Boussouar, Caili Tong, Antoinede Pauw, Peter J Murray, Jan M A van Deursen, and Paul K Brindle. Conditional knockoutmice reveal distinct functions for the global transcriptional coactivators CBP and p300 in T-celldevelopment.
Molecular and cellular biology , 26(3):789–809, Feb 2006. PMID: 16428436.[42] L. Kaufman and P.J. Rousseeuw.
Statistical Data Analysis Based on the L_1-Norm and RelatedMethods , chapter Clustering by means of Medoids, pages 405–416. North-Holland, 1987.[43] Bettina Keller, Jan-Hendrik Prinz, and Frank Noé. Markov models and dynamical fingerprints:Unraveling the complexity of molecular kinetics.
Chem. Phys. , 396:92–107, 2012.[44] Andrei Y. Kobitski, Alexander Nierth, Mark Helm, Andres Jäschke, and G. Ulrich Nienhaus.Mg2+ dependent folding of a Diels-Alderase ribozyme probed by single-molecule FRET analysis.
Nucleic Acids Res. , 35:2047–2059, 2007.[45] S. V. Krivov and M. Karplus. Hidden complexity of free energy surfaces for peptide (protein)folding.
Proc. Nat. Acad. Sci. USA , 101:14766–14770, 2004.[46] Susanna Kube and Marcus Weber. A coarse graining method for the identification of transitionrates between molecular conformations.
J. Chem. Phys. , 126(2):024103+, 2007.[47] L. Tong, V.C. Soon, Y.F. Huang, and R. Liu. AMUSE: a new blind identification algorithm.
Circuits and Systems , 3:1784–1787, 1990.[48] Kresten Lindorff-Larsen, Stefano Piana, Ron O. Dror, and David E. Shaw. How fast-foldingproteins fold.
Science , 334:517–520, 2011.[49] Kresten Lindorff-Larsen, Stefano Piana, Kim Palmo, Paul Maragakis, John L Klepeis, Ron ODror, and David E Shaw. Improved side-chain torsion potentials for the Amber ff99SB proteinforce field.
Proteins , 78(8):1950–1958, Jun 2010. PMID: 20408171 PMCID: PMC2970904.[50] S.P. Lloyd. Least squares quantization in pcm.
IEEE Transactions on Information Theory ,28:129–137, 1982.[51] P. Metzner, C. Schütte, and E. Vanden Eijnden. Transition Path Theory for Markov JumpProcesses.
Multiscale Model. Simul. , 7:1192–1219, 2009.2552] W. Min, G. Luo, B. J. Cherayil, S. C. Kou, and X. S. Xie. Observation of a Power-Law MemoryKernel for Fluctuations within a Single Protein Molecule.
Phys. Rev. Lett. , 94:198302+, 2005.[53] Ayori Mitsutake, Hiromitsu Iijima, and Hiroshi Takano. Relaxation mode analysis of a pep-tide system: Comparison with principal component analysis.
The Journal of Chemical Physics ,135(16):164102, 2011.[54] Amrita Mohan, Christopher J. Oldfield, Predrag Radivojac, Vladimir Vacic, Marc S. Cortese,A. Keith Dunker, and Vladimir N. Uversky. Analysis of molecular recognition features (MoRFs).
Journal of Molecular Biology , 362(5):1043–1059, Oct 2006.[55] L. Molgedey and H. G. Schuster. Separation of a mixture of independent signals using timedelayed correlations.
Phys. Rev. Lett. , 72:3634–3637, Jun 1994.[56] Stefanie Muff and Amedeo Caflisch. Kinetic analysis of molecular dynamics simulations revealschanges in the denatured state and switch of folding pathways upon single-point mutation of a-sheet miniprotein.
Proteins , 70:1185–1195, 2007.[57] Yusuke Naritomi and Sotaro Fuchigami. Slow dynamics in protein fluctuations revealed by time-structure based independent component analysis: The case of domain motions.
The Journal ofChemical Physics , 134(6):065101, 2011.[58] Heike Neubauer, Natalia Gaiko, Sylvia Berger, Jörg Schaffer, Christian Eggeling, Jennifer Tuma,Laurent Verdier, Claus A. Seidel, Christian Griesinger, and Andreas Volkmer. Orientational anddynamical heterogeneity of rhodamine 6G terminally attached to a DNA helix revealed by NMRand single-molecule fluorescence spectroscopy.
J. Am. Chem. Soc. , 129(42):12746–12755, October2007.[59] Hannes Neuweiler, Marc Löllmann, Sören Doose, and M. Sauer. Dynamics of Unfolded PolypeptideChains in Crowded Environment Studied by Fluorescence Correlation Spectroscopy.
J. Mol. Biol. ,365:856–869, 2007.[60] F. Noé, I. Daidone, J. C. Smith, A. di Nola, and A. Amadei. Solvent Electrostriction DrivenPeptide Folding revealed by Quasi-Gaussian Entropy Theory and Molecular Dynamics Simulation.
Journal of Physical Chemistry B , 112:11155–11163, 2008.[61] F. Noé and S. Fischer. Transition networks for modeling the kinetics of conformational transitionsin macromolecules.
Curr. Opin. Struc. Biol. , 18:154–162, 2008.[62] F. Noé, I. Horenko, C. Schütte, and J. C. Smith. Hierarchical Analysis of Conformational Dy-namics in Biomolecules: Transition Networks of Metastable States.
J. Chem. Phys. , 126:155102,2007.[63] F. Noé, D. Krachtus, J. C. Smith, and S. Fischer. Transition Networks for the ComprehensiveCharacterization of Complex Conformational Change in Proteins.
J. Chem. Theo. Comp. , 2:840–857, 2006.[64] Frank Noé. Probability Distributions of Molecular Observables computed from Markov Models.
J. Chem. Phys. , 128:244103, 2008.[65] Frank Noé, Sören Doose, Isabella Daidone, Marc Löllmann, John D. Chodera, Markus Sauer,and Jeremy C. Smith. Dynamical fingerprints for probing individual relaxation processes inbiomolecular dynamics with simulations and kinetic experiments.
Proc. Natl. Acad. Sci. USA ,108:4822–4827, 2011.[66] Frank Noé and Feliks Nüske. A variational approach to modeling slow processes in stochasticdynamical systems.
Multiscale Model. Simul. (submitted, available at: http://publications.mi.fu-berlin.de/1109/) , 2012.[67] Frank Noé, Christof Schütte, Eric Vanden-Eijnden, Lothar Reich, and Thomas R. Weikl. Con-structing the full ensemble of folding pathways from short off-equilibrium simulations.
Proc. Natl.Acad. Sci. USA , 106:19011–19016, 2009. 2668] Andreas Ostermann, Robert Waschipky, Fritz G. Parak, and Ulrich G. Nienhaus. Ligand bindingand conformational motions in myoglobin.
Nature , 404:205–208, 2000.[69] Albert C. Pan and Benoît Roux. Building Markov state models along pathways to determine freeenergies and rates of transitions.
J. Chem. Phys. , 129(6):064107+, August 2008.[70] Menahem Pirchi, Guy Ziv, Inbal Riven, Sharona Sedghani Cohen, Nir Zohar, Yoav Barak, andGilad Haran. Single-molecule fluorescence spectroscopy maps the folding landscape of a largeprotein.
Nature Comms , 2:493, 2011.[71] Jan-Hendrik Prinz, John D. Chodera, and Frank Noé. Robust rate estimate from spectral esti-mation.
Phys. Rev. Lett. (submitted) , 2011.[72] Jan-Hendrik Prinz, Hao Wu, Marco Sarich, Bettina Keller, Martin Fischbach, Martin Held,John D. Chodera, Christof Schütte, and Frank Noé. Markov models of molecular kinetics: Gen-eration and validation.
J. Chem. Phys. , 134:174105, 2011.[73] Ishwar Radhakrishnan, Gabriela C Pérez-Alvarado, David Parker, H.Jane Dyson, Marc R Mont-miny, and Peter E Wright. Solution structure of the kix domain of cbp bound to the transactivationdomain of creb: A model for activator:coactivator interactions.
Cell , 91:741–752, 1997.[74] F. Rao and A. Caflisch. The Protein Folding Network.
J. Mol. Bio. , 342:299–306, 2004.[75] Walter Ritz. Über eine neue methode zur lösung gewisser variationsprobleme der mathematischenphysik.
Journal für die Reine und Angewandte Mathematik , 135:1–61, 1909.[76] M. A. Rohrdanz, W. Zheng, M. Maggioni, and C. Clementi. Determination of reaction coordinatesvia locally scaled diffusion map.
J. Chem. Phys. , 134(124116), 2011.[77] S. K. Sadiq, F. Noé, and G. de Fabritiis. Kinetic characterization of the critical step in hiv-1protease maturation.
Proc. Natl. Acad. Sci. USA (in press) , 2012.[78] Yusdi Santoso, Catherine M. Joyce, Olga Potapova, Ludovic Le Reste, Johannes Hohlbein,Joseph P. Torella, Nigel D. F. Grindley, and Achillefs N. Kapanidis. Conformational transi-tions in DNA polymerase I revealed by single-molecule FRET.
Proc. Natl. Acad. Sci. USA ,107(2):715–720, January 2010.[79] Marco Sarich, Frank Noé, and Christof Schütte. On the approximation error of markov statemodels.
SIAM Multiscale Model. Simul. , 8:1154–1177, 2010.[80] V. Schultheis, T. Hirschberger, H. Carstens, and P. Tavan. Extracting Markov Models of PeptideConformational Dynamics from Simulation Data.
J. Chem. Theory Comp. , 1:515–526, 2005.[81] C. Schütte, A. Fischer, W. Huisinga, and P. Deuflhard. A Direct Approach to ConformationalDynamics based on Hybrid Monte Carlo.
J. Comput. Phys. , 151:146–168, 1999.[82] M. Seeber, A. Felline, F. Raimondi, S. Muff, Ran Friedman, F. Rao, A. Caflisch, and F. Fanelli.Wordom: a user-friendly program for the analysis of molecular structures, trajectories, and freeenergy surfaces.
J. Comp. Chem. , 6:1183–1194, 2011.[83] M. Senne, B. Trendelkamp-Schroer, A.S.J.S. Mey, C. Schütte, and F. Noé. Emma - a softwarepackage for markov model building and analysis.
J. Chem. Theory and Comput. , 8:2223–2238,2012.[84] David E. Shaw, Paul Maragakis, Kresten Lindorff-Larsen, Stefano Piana, Ron O. Dror, Michael P.Eastwood, Joseph A. Bank, John M. Jumper, John K. Salmon, Yibing Shan, and Willy Wriggers.Atomic-Level Characterization of the Structural Dynamics of Proteins.
Science , 330(6002):341–346, October 2010.[85] Benjamin A. Shoemaker, John J. Portman, and Peter G. Wolynes. Speeding molecular recognitionby using the folding funnel: The fly-casting mechanism.
Proceedings of the National Academy ofSciences , 97(16):8868–8873, Aug 2000. 2786] N. Singhal and V. S. Pande. Error analysis and efficient sampling in Markovian state models formolecular dynamics.
J. Chem. Phys. , 123:204909, 2005.[87] S. Sriraman, I. G. Kevrekidis, and G. Hummer. Coarse Master Equation from Bayesian Analysisof Replica Molecular Dynamics Simulations.
J. Phys. Chem. B , 109:6479–6484, 2005.[88] Johannes Stigler, Fabian Ziegler, Anja Gieseke, J. Christof M. Gebhardt, and Matthias Rief. Thecomplex folding network of single calmodulin molecules.
Science , 334:512–516, 2011.[89] Kenji Sugase, H. Jane Dyson, and Peter E. Wright. Mechanism of coupled folding and binding ofan intrinsically disordered protein.
Nature , 447(7147):1021–1025, Jul 2007.[90] W. C. Swope, J. W. Pitera, and F. Suits. Describing protein folding kinetics by molecular dynamicssimulations: 1. Theory.
J. Phys. Chem. B , 108:6571–6581, 2004.[91] Vladimir N. Uversky. Intrinsically disordered proteins from A to Z.
The International Journal ofBiochemistry & Cell Biology , 43(8):1090–1103, Aug 2011.[92] Vincent A. Voelz, Gregory R. Bowman, Kyle Beauchamp, and Vijay S. Pande. Molecular Simula-tion of ab Initio Protein Folding for a Millisecond Folder NTL9.
J. Am. Chem. Soc. , 132(5):1526–1528, February 2010.[93] D. J. Wales.
Energy Landscapes . Cambridge University Press, Cambridge, 2003.[94] M. Weber. Improved Perron cluster analysis.
ZIB Report , 03-04, 2003.[95] Beth G. Wensley, Sarah Batey, Fleur A. C. Bone, Zheng M. Chan, Nuala R. Tumelty, AnnetteSteward, Lee G. Kwa, Alessandro Borgia, and Jane Clarke. Experimental evidence for a frustratedenergy landscape in a three-helix-bundle protein family.