[PDF] Accurate protein-folding transition-path statistics from a simple free-energy landscape

Abstract

A central goal of protein-folding theory is to predict the stochastic dynamics of transition paths --- the rare trajectories that transit between the folded and unfolded ensembles --- using only thermodynamic information, such as a low-dimensional equilibrium free-energy landscape. However, commonly used one-dimensional landscapes typically fall short of this aim, because an empirical coordinate-dependent diffusion coefficient has to be fit to transition-path trajectory data in order to reproduce the transition-path dynamics. We show that an alternative, first-principles free-energy landscape predicts transition-path statistics that agree well with simulations and single-molecule experiments without requiring dynamical data as an input. This 'topological configuration' model assumes that distinct, native-like substructures assemble on a timescale that is slower than native-contact formation but faster than the folding of the entire protein. Using only equilibrium simulation data to determine the free energies of these coarse-grained intermediate states, we predict a broad distribution of transition-path transit times that agrees well with the transition-path durations observed in simulations. We further show that both the distribution of finite-time displacements on a one-dimensional order parameter and the ensemble of transition-path trajectories generated by the model are consistent with the simulated transition paths. These results indicate that a landscape based on transient folding intermediates, which are often hidden by one-dimensional projections, can form the basis of a predictive model of protein-folding transition-path dynamics.

Full PDF

aa r X i v : . [ q - b i o . B M ] A ug Accurate protein-folding transition-path statistics from a simple free-energy landscape

William M. Jacobs and Eugene I. Shakhnovich

Department of Chemistry and Chemical Biology, Harvard University, 12 Oxford Street, Cambridge, MA 02138, USA (Dated: August 9, 2018)A central goal of protein-folding theory is to predict the stochastic dynamics of transition paths— the rare trajectories that transit between the folded and unfolded ensembles — using only ther-modynamic information, such as a low-dimensional equilibrium free-energy landscape. However,commonly used one-dimensional landscapes typically fall short of this aim, because an empiricalcoordinate-dependent diﬀusion coeﬃcient has to be ﬁt to transition-path trajectory data in orderto reproduce the transition-path dynamics. We show that an alternative, ﬁrst-principles free-energylandscape predicts transition-path statistics that agree well with simulations and single-moleculeexperiments without requiring dynamical data as an input. This ‘topological conﬁguration’ modelassumes that distinct, native-like substructures assemble on a timescale that is slower than native-contact formation but faster than the folding of the entire protein. Using only equilibrium simulationdata to determine the free energies of these coarse-grained intermediate states, we predict a broaddistribution of transition-path transit times that agrees well with the transition-path durations ob-served in simulations. We further show that both the distribution of ﬁnite-time displacements on aone-dimensional order parameter and the ensemble of transition-path trajectories generated by themodel are consistent with the simulated transition paths. These results indicate that a landscapebased on transient folding intermediates, which are often hidden by one-dimensional projections,can form the basis of a predictive model of protein-folding transition-path dynamics.

I. INTRODUCTION

In studies of complex molecular systems, free-energylandscapes provide a tractable and intuitive frameworkfor predicting rare events. Free-energy landscapes arelow-dimensional projections that describe the equilib-rium distribution of molecular conﬁgurations with re-spect to a small number of collective variables. However,when appropriately deﬁned, these landscapes can alsobe used to predict dynamical properties of equilibriumor near-equilibrium stochastic trajectories, such as therelative rates of transitions between macrostates. Thisfeature, combined with the fact that a variety of com-putational techniques have been developed for eﬃcientlycalculating free energies without directly simulating rareevents [1], makes free-energy landscapes useful for ratio-nalizing reaction and phase-transformation mechanismsin complex systems. Consequently, theories based onfree-energy landscapes are widely applied to problems inclassical [2–4] and non-classical [5, 6] nucleation, phaseseparation [7, 8], and protein folding [9–12].A particularly important problem in protein folding isthe prediction of transition paths between the unfoldedand folded ensembles [13–18]. These trajectories areboth rare and, at an atomistic level, extremely hetero-geneous, making this problem ideal for landscape-basedtheories. One of the most widely adopted approachesis to model folding as a diﬀusion process on a smooth,one-dimensional free-energy landscape [19–25]. Never-theless, even when using a good reaction coordinate, In this context, a good reaction coordinate is one that not onlydistinguishes the unfolded and folded states, but also takes asingle value for all conﬁgurations visited on transition paths thathave an equal probability of reaching either of these macrostates. one-dimensional landscapes typically have to be cor-rected empirically to reproduce the dynamical propertiesof the actual stochastic folding trajectories [25, 26]. Thiscorrection can be achieved by introducing a position-dependent diﬀusion coeﬃcient [27], since the gradientof the free-energy landscape itself is not suﬃcient topredict the relative rates of molecular motions on theone-dimensional reaction coordinate. The key limita-tion of this approach is that the transition-path trajec-tories that we wish to predict are required as input,either to determine the coordinate-dependence of thediﬀusion coeﬃcient [26, 28–30] or to ﬁnd a projectionfor which the apparent diﬀusive behavior is coordinate-independent [31, 32]. It is also unclear whether a sin-gle optimized one-dimensional coordinate can always befound for large proteins, which may have more compli-cated or parallel folding pathways [33]. Furthermore, re-cent single-molecule measurements of folding transitionpaths [15, 34] have provided experimental evidence ofthe shortcomings of one-dimensional landscapes, as thefolding free-energy barrier inferred by applying a one-dimensional diﬀusion model to measured transit-time dis-tributions is often inconsistent with the landscape deter-mined directly in the same experiments [35, 36].In contrast, an optimal free-energy landscape is onethat is capable of predicting the statistical properties ofstochastic transition paths without requiring additional,empirical kinetic information. To address this problem,we recently proposed an alternative, ﬁrst-principles ap-proach [37] for constructing structure-based free-energylandscapes to describe protein-folding transition paths.Based on an analysis of a native-centric ‘Ising-like’model [38–40], we postulated that the key events alongtransition paths coincide with the formation of native-like loops in the polymer backbone [41]. We thereforedevised a coarse-graining procedure in which microstatessharing the same set of native-like loops, but diﬀerentsets of native contacts, are grouped into the same ‘topo-logical conﬁguration.’ Because the loss of entropy dueto loop closure is not compensated until multiple stabi-lizing native contacts are formed, we further postulatedthat these topological conﬁgurations are, in general, sep-arated by free-energy barriers, leading to a separationof timescales between the formation of individual nativecontacts and the assembly of topological conﬁgurations.Consistent with this prediction, we found that topologi-cal conﬁgurations interconvert on much slower timescalesthan individual native contacts in atomistic simulationsand that these slower transitions follow roughly Marko-vian dynamics [37].In this paper, we show that the topological conﬁgura-tion model accurately describes the stochastic dynamicsof transition-path trajectories. By estimating the freeenergies of the predicted topological conﬁgurations us-ing equilibrium all-atom simulation snapshots [42], weapply this model to generate an ensemble of transitionpaths in terms of transitions between coarse-grained, par-tially folded states. First, we show that the distribu-tion of transit times predicted by this approach is muchcloser to the distribution of simulated transit times thanthat predicted by a model of diﬀusion on a smooth one-dimensional landscape. Second, we demonstrate that thedistribution of simulated ﬁnite-time displacements on aone-dimensional order parameter can be rationalized bythe topological conﬁguration model. Lastly, we use a hid-den Markov framework to show that the predicted sepa-ration of timescales generates an ensemble of transitionpaths that is consistent with the simulated folding trajec-tories. Overall, these results indicate that a free-energylandscape deﬁned on the basis of transient, native-like in-termediates can predict protein-folding transition pathswithout requiring post hoc corrections to the transition-path dynamics.

II. THEORYA. Deﬁnition of a topological conﬁguration

The central principle of the topological conﬁgura-tion model [37] is a separation between three diﬀerenttimescales: a relatively fast timescale associated withnative-contact formation, a slower timescale on whichsubstantial portions of native structure assemble, and aslowest timescale on which the entire protein folds. Thetimescale separation for these processes has been well es-tablished experimentally, with measurements reportingsingle-contact formation on timescales of approximately10 ns [43], native-like loop formation on timescales of100 ns–1 µ s [44], and the folding of proteins with approx-imately 100 residues or more on timescales of 100 µ s orlonger [45]. It is also well established that the entropy–enthalpy compensation of protein folding is imperfect, a ab bc cd d e ef f g g Ubiquitin (1ubq) contact map

Residue R e s i du e a Schematic ‘topological conﬁguration’ landscape b a ab abd abde abdeg abdefg abcdefg ∅ F r eee n e r g y FIG. 1. Deﬁnition of topological conﬁgurations for the proteinUbiquitin. (a) A map of the residue–residue native contactsdetermined from a crystal structure (PDB ID: 1ubq). Sub-structures, comprising sets of at least six adjacent native con-tacts, are colored and labeled. Native contacts that are notpart of any substructure are shown in gray. (b) A schematictopological conﬁguration free-energy landscape showing a sin-gle pathway between the completely unfolded, ‘ ∅ ,’ and na-tive, ‘abcdefg,’ conﬁgurations. These conﬁgurations are sepa-rated by free-energy barriers and thus interconvert on a slowertimescale than the formation of individual native contacts.Selected conﬁgurations are illustrated below, where stretchesof disordered residues are indicated by dashed lines. meaning that while a small number of native contacts israrely suﬃcient to counter the associated loss of conﬁgu-rational entropy completely, the formation of subsequentnative contacts tends to be more thermodynamically fa-vorable [46, 47]. This general feature, which is respon-sible for the overall free-energy barrier that determinesthe slowest timescale of protein folding, also gives rise tomany smaller barriers on folding transition paths. In par-ticular, the entropic penalty associated with the closureof a single native-like loop typically results in a small yetsigniﬁcant free-energy barrier, which in turn leads to adynamical timescale that is slower than the average rateof native-contact formation but faster than the rate atwhich the entire protein folds.By identifying native loops and considering all per-mutations of the order in which they can form, we canconstruct a free-energy landscape that captures this keyintermediate timescale. We previously demonstrated [37]how this analysis could be applied to a structure-based,‘Ising-like’ model based on the pioneering work of Eatonand colleagues [38, 48, 49]. In that work, we calculatedthe free-energy barriers between conﬁgurations in thestructure-based model to support the predicted separa-tion of timescales. We also provided indirect evidence ofintermediate barriers in an all-atom model by analyzingthe dwell times associated with the predicted topologicalconﬁgurations in all-atom simulations.As in Ref. 37, we shall focus on the protein Ubiquitin inorder to test the topological conﬁguration model’s abilityto predict transition-path dynamics. In the contact mapshown in Figure 1a, individual ‘substructures’ comprisenative contacts that are adjacent in the contact map.Topological conﬁgurations are deﬁned as combinationsof one or more substructures, including any additionalnative contacts between the residues that comprise thesesubstructures. As shown in the schematic Figure 1b, wecan then construct a free-energy landscape in the dis-crete space of topological conﬁgurations. Transitions areallowed between conﬁgurations that diﬀer by a single sub-structure (and thus a single native-like loop closure). Itis important to note that each topological conﬁgurationis not a rigid structure but rather an ensemble of mi-crostates, in which diﬀerent sets of native contacts arepresent but the same set of substructures (and, conse-quently, native-like loops) are represented. Furthermore,due to the predicted separation of timescales, the ﬂuc-tuations in native-contact formation within a topologicalconﬁguration are typically much faster than the transi-tions between conﬁgurations. B. Estimation of topological conﬁguration freeenergies from Molecular Dynamics simulations

In this paper, our goal is to evaluate the predictionsof a free-energy landscape constructed solely from ther-modynamic data. For this purpose, we analyzed snap-shots from all-atom Molecular Dynamics (MD) simula-tions [42], which were conducted under conditions wherea total of ten unbiased, reversible folding and unfolding Two contacts are adjacent if each residue in the ﬁrst contact is animmediate neighbor of one of the residues in the second contact;for example ( i, j ) is adjacent to ( i, j + 1) and ( i + 1 , j + 1). transition paths of Ubiquitin were observed. To calculatethe free energy associated with each predicted topologicalconﬁguration, we ﬁrst classiﬁed all simulation snapshots,recorded at 1 ns intervals, according to which substruc-tures are present. In each snapshot, we found all na-tive residue–residue contacts in which at least one pairof heavy atoms is less than 4 . i, j ) were estimated according to the relative frequen-cies of the classiﬁed snapshots,∆ F ij = − k B T log ( N i /N j ) , (1)where N i is the total number of snapshots assigned toconﬁguration i , k B is the Boltzmann constant, and T isthe absolute temperature. For comparison with theoriesbased on one-dimensional free-energy landscapes, we alsocalculated the free-energy landscape as a function of thenumber of native heavy-atom contacts [50] using a 4 . F ( x ) = − k B T log N ( x ) + const ., (2)where N ( x ) is the total number of snapshots in whichthe number of native contacts falls in the range[ x − ∆ x/ , x + ∆ x/ x is taken to be 4 native contacts.The central object of the topological conﬁgurationmodel is the rate matrix T for transitions between topo-logical conﬁgurations. Ideally, one should determine thetransition rates either from the free-energy barriers or themean ﬁrst passage times between conﬁgurations, but ac-curate calculations of this type are not possible given theavailable simulation data. Instead, we simply assumed asymmetric form that enforces detailed balance for the for-ward and backward rates between conﬁgurations i and j , T ij = k exp (cid:18) − ∆ F ji k B T (cid:19) i = j, (3)for pairs of conﬁgurations ( i, j ) that diﬀer by the additionor removal of a single substructure; the diagonal elementsof the matrix are then T ii = − P j = i T ij . The prefactor k is the same for all transitions and is left as an adjustableparameter that scales all barrier heights between con-ﬁgurations equivalently. However, due to the assumedseparation of timescales, we know that k should be slowcompared to the average rate of native-contact forma-tion. Then, given the matrix T , it is straightforward tocalculate the overall folding rate, k fold ; the committorassociated with each conﬁguration i , p fold ,i ; the proba-bility of ﬁnding a trajectory in a speciﬁc conﬁguration ∅ ad abadabd abde abdefabcdefabcdegabcdefg024680 100 200 300 400 500 600 700 800 900 ∅ adg abad abd abde abcdeabdefabcdefabcdegabcdefg00.510 100 200 300 400 500 600 700 800 900 Number of native contacts ∅ abdabe abde abcdeabcdgabdefabdeg abcdefabcdegabcdefgaba p f o l d F / k T F / k T - dp r o j ec t i o n - dp r o j ec t i o n abc x A x B FIG. 2. Comparison of the topological conﬁguration and one-dimensional (1-d) free-energy landscapes. (a) Projections ofthe free-energy surfaces onto the number of native contacts.Selected curves corresponding to distinct topological conﬁg-urations (solid colored lines) are labeled; the 1-d projection F ( x ), with bin width ∆ x = 4 native contacts, is shown by thedashed black line. (b) The unimodal projections of the topo-logical conﬁgurations onto the number of native contacts canbe characterized by Gaussian distributions with the indicatedmeans (squares) and standard deviations (error bars). Thedotted lines indicate allowed transitions between conﬁgura-tions that diﬀer by exactly one substructure. (c) The com-mittors, p fold , associated with each conﬁguration (squares),projected onto the number of native contacts. Selected conﬁg-urations are labeled, and the predicted folding ﬂuxes throughthe network of states are indicated by the widths of the graylines; ﬂuxes less than 5% have been omitted. For comparison,the dashed black line shows p fold ( x ) predicted by the 1-d land-scape, where the vertical dotted lines indicate the boundariesof the transition-path region, x A and x B . In all panels, theconﬁgurations are colored according to their estimated freeenergies, F i = − k B T log N i , with blue and red indicating lowand high free energies, respectively. i on a transition path, m AB i ; and the folding ﬂuxes be-tween adjacent states i and j , f AB ij , using transition-paththeory [51]. These quantities will be used throughoutour analysis. We shall show that, despite not undertak-ing detailed calculations of the individual barrier heights,this model produces remarkably accurate transition-pathstatistics. Furthermore, alternative choices for the formof the rates given in Eq. (3), such as a Metropolis func-tion [52], do not change the qualitative nature of ourresults. We also note that all timescales determined di-rectly from the atomistic simulations are accelerated rel-ative to experiments, in part due to the elevated temper-ature at which the simulations were conducted [42]. C. General properties of topological conﬁgurationfree-energy landscapes

In addition to a separation of timescales, our previ-ous analysis [37] of this model made two general pre-dictions that hold up when comparing with simulationdata. First, the Boltzmann-weighted ensemble of mi-crostates associated with each topological conﬁgurationis predicted to be unimodal when projected onto a one-dimensional coordinate. For example, the free energyas a function of the number of native contacts is uni-modal for all topological conﬁgurations, as shown bythe labeled colored curves in Figure 2a, suggesting thatthere are no signiﬁcant free-energy barriers between mi-crostates within each conﬁguration. Consequently, it isreasonable to approximate the projection of each conﬁg-uration onto this order parameter using a Gaussian dis-tribution with the estimated mean, h x i i , and variance, h x i i − h x i i , where the subscripts indicate averages overall snapshots classiﬁed as topological conﬁguration i , asshown in Figure 2b. This approximation will be used inthe discussion of hidden Markov modeling below.Second, in the case of proteins such as Ubiquitin withlittle structural symmetry, the free energies of the varioustopological conﬁgurations are relatively heterogeneous,which results in a small number of high-probabilitytransition paths through the network. By applyingtransition-path theory [51] to the rate matrix T , we cal-culated p fold for each conﬁguration and the folding ﬂuxbetween conﬁgurations on transition paths. Figure 2cshows that there is a nearly one-to-one correspondencebetween the predicted p fold and the number of nativecontacts when we consider only those conﬁgurations thatare likely to be visited on transition paths, even thoughthis order parameter played no role in the transition-paththeory calculations. This observation is consistent withthe fact that the number of native contacts is a goodreaction coordinate for identifying the ensemble of tran-sition states, where p fold = / , from simulated Ubiquitintransition paths [50]. However, knowing the location ofthe transition state on an order parameter is, in general,not suﬃcient to predict the transition-path dynamics. Inaddition, any one-dimensional projection almost invari-ably hides some of the intermediate free-energy barrierson transition paths that play an important role in thetransition-path dynamics [53, 54]. III. STATISTICAL ANALYSESA. Distribution of transition-path transit times

As an initial test of the model, we examined the pre-dicted distribution of transit times between the unfoldedand folded ensembles. For two-state proteins, transittimes are orders of magnitude smaller than the charac-teristic waiting time until a folding or unfolding eventoccurs [13, 15, 55, 56]. Nevertheless, recent advances insingle-molecule experiments [15, 34] have made it pos-sible to measure these brief trajectories. Although ex-perimental measurements have conﬁrmed that the dis-tribution of transit times has an exponential tail as ex-pected for stochastic barrier-crossing processes, the shapeof the distribution generally disagrees with the predic-tions of one-dimensional landscape theories [34–36]. Inparticular, the measured distributions are typically muchbroader than expected for a one-dimensional landscapewith a harmonic barrier.To compare the predictions of the one-dimensional(1-d) and topological conﬁguration models, we used ki-netic Monte Carlo (kMC) simulations [57] to sample tran-sition paths between absorbing states of a rate matrix, T . We ﬁrst calculated the distribution of transit timesfor the 1-d native-contacts landscape shown in Figure 2a.To this end, we discretized this landscape between theunfolded and folded free-energy minima into 150 bins(such that ∆ x = 4 native contacts) and constructed atri-diagonal transition matrix, T . We assumed thesymmetric form T x,x ± ∆ x ) = k exp (cid:20) − F ( x ± ∆ x ) − F ( x )2 k B T (cid:21) , (4)with T x,x ) = − [ T x,x − ∆ x ) + T x,x +∆ x ) ]. Because thetransit times are inversely proportional to the transition-matrix prefactor k , transit-time distributions for diﬀer-ent models can be compared by scaling t AB according tothe mean transit time, h t AB i . In this way we can seethat the predicted distribution of transit times, p ( t AB ),between the folded and unfolded free-energy minima x A and x B (Figure 2c) is relatively narrow, with a coeﬃcientof variation of 0 .

39 and an exponential tail (Figure 3a).Alternatively, we can ﬁt the decay constant of the ex-ponential tail, ω − , in order to compare with the the-oretical distribution for 1-d harmonic barrier crossings, P harmAB ( ωt ; ∆ F † ) [58], where the shape parameter ∆ F † is the height of the barrier (Figure 3b). The simulateddistribution of 1-d transit times agrees well with this har-monic prediction using the barrier height ∆ F † = 5 . kT determined directly from the empirical 1-d landscape(Figure 2a), despite the fact that this landscape is notperfectly harmonic. t AB / h t AB i P A B ( . k T ) P A B ( . k T ) ωt AB MD transition-path transit times p ( t A B ) p ( t A B ) ba FIG. 3. Distributions of transition-path transit times, p ( t AB ),for the models of Ubiquitin folding shown in Figure 2. (a) Thedistributions calculated via kinetic Monte Carlo simulationsof the 1-d native contacts landscape model, the topologicalconﬁguration model, and the quasi-one-dimensional minimumfree-energy path through the topological conﬁguration net-work. To compare the shapes of the distributions, all transittimes are scaled according to the mean transit time, h t AB i , foreach model. Also shown are the ten transit times observed inall-atom MD simulations. (b) The same three distributionswere ﬁt to the theoretical distribution for a harmonic barrier, P harmAB , in order to estimate the decay constant, ω − , of the ex-ponential tail. The transit-time distribution calculated fromthe empirical 1-d landscape agrees well with the theoreticaldistribution using the empirical barrier height, 5 . kT , whilethe distribution calculated from the topological conﬁgurationmodel can only be ﬁt by P harmAB if we set the shape parameter∆ F † equal to a much lower barrier height of 0 . kT . We then repeated these calculations using the topo-logical conﬁguration rate matrix deﬁned in Eq. (3). Theresulting distribution of transit times (Figure 3a) also hasan exponential tail, but is substantially broader, with acoeﬃcient of variation of 0 .

91. We ﬁnd that the distri-bution of transit times obtained from the MD simula-tions, which has a coeﬃcient of variation of 1 . ± . − − − −

20 0 20 40 60 80 . kT discrete state F / k T P A B ( . k T ) t AB / h t AB i harmonic periodic (CG) p ( t A B ) periodic ab FIG. 4. Distributions of transition-path transit times for twoone-dimensional toy landscapes. (a) The two toy free-energylandscapes, with a single harmonic barrier [blue empty circles; F ( x ) ∝ A cos(2 πx/ F ( x ) ∝ A cos(2 πx/ kT cos(10 πx/ h t AB i , foreach model. The single-barrier landscape has a transit-timedistribution that is narrower than the theoretical prediction, P harmAB , for this barrier height (black dashed line), while theintermediate-barrier transit-time distribution and its coarse-grained approximation are signiﬁcantly broader. is 10 ). When comparing the transit-time distributionpredicted by the topological conﬁguration model with theharmonic prediction P harmAB ( ωt ; ∆ F † ), the best ﬁt is ob-tained with a shape parameter ∆ F † that corresponds toa one-dimensional landscape with a 0 . kT barrier (Fig-ure 3b). Interestingly, this order-of-magnitude diﬀerencebetween the actual barrier height and that returned bya ﬁt to the harmonic theory is reminiscent of the discrep-ancy found in experimental measurements [34]. Note that the barrier on the minimum free-energy path throughthe topological conﬁguration network is also greater than 5 kT . The striking diﬀerence between the shapes of thesetransit-time distributions is primarily a consequence ofthe intermediate barrier crossings, as opposed to the mul-tidimensionality of the network model. For example, bysimulating transition paths that only traverse the quasi-one-dimensional minimum free-energy path through theconﬁguration network, we obtained a similar distributionof transit times (Figure 3). To explore this reasoningfurther, we constructed two toy 1-d landscapes with thesame barrier height and number of bins as in the em-pirical 1-d landscape (Figure 4a). In the ﬁrst landscape,the single barrier is, to a good approximation, harmonic,while in the second landscape, there are ﬁve interme-diate barriers. We then computed T using Eq. (4)and simulated the transition paths for these toy modelsvia kMC. As expected, the presence of the intermediatebarriers signiﬁcantly broadens the transit-time distribu-tion, resulting in a coeﬃcient of variation of 0.47 for theintermediate-barrier landscape versus 0.36 for the single-barrier toy landscape (Figure 4b). It is also possible tocoarse-grain the dynamics over the intermediate-barriertoy landscape by calculating the mean ﬁrst passage timesbetween the local free-energy minima (Figure 4a). Simu-lating the transition paths for this coarse-grained modelresults in a good agreement with the full intermediate-barrier toy landscape (Figure 4b). Although the diﬀer-ence between the transit-time distributions for these par-ticular toy models is smaller than that shown in Figure 3,this comparison clearly demonstrates that the presenceof intermediate barriers on the transition paths tends tobroaden the distribution of transit times. B. Distribution of ﬁnite-time displacements on aone-dimensional order parameter

As a second statistical test, we examined distributionsof ﬁnite-time displacements on a one-dimensional orderparameter. Using the number of native contacts as theorder parameter, we measured displacements, ∆ x , givena lag time ∆ t on all transition-path trajectories in theatomistic MD simulations. We considered lag times rang-ing from 1 ns, which is longer than the typical time re-quired for the formation of a single native contact, to ∼

100 ns, which is much shorter than the mean transittime, 2 . µ s, observed in the MD simulations. Afterverifying that Ubiquitin transition paths exhibit subdif-fusive motion over this range of lag times [59], meaningthat h [∆ x (∆ t )] i ∝ (∆ t ) p with p <

1, we sought to deter-mine whether, for a given lag time, the distribution of fre-quent, small displacements is predictive of larger jumps.By averaging over all MD transition-path trajectories andremoving the net directional motion, ( x B − x A ) /t AB , weﬁnd that the vast majority of displacements, for which | ∆ x | ≤ p h ∆ x (∆ t ) i , are well described by Gaussiandistributions over the entire range of lag times. However,larger displacements are much more frequent than pre-dicted by the tails of the Gaussian distributions, regard- − − − ∆ x/ p h ∆ x (∆ t ) i MD transition paths − − − p ( ∆ x | ∆ t ) ∆ x/ h σ i i Topo. conﬁg. network l og ( ∆ t / n s ) ,l og (cid:2) ∆ t × (cid:0) k f o l d / k M D f o l d (cid:1) (cid:3) − − − ∆ x/ p h ∆ x (∆ t ) i Toy model: harmonic − − − p ( ∆ x | ∆ t ) ∆ x/ p h ∆ x (∆ t ) i Toy model: periodic l og ( k ∆ t ) c ba d FIG. 5. Distributions of ﬁnite-time displacements on the 1-dnative contacts order parameter, averaged over the transition-path ensemble. (a) The distribution of displacements ∆ x aftera lag time ∆ t observed in all-atom MD transition paths. Thedistributions are centered, such that h ∆ x i = 0, and scaled bythe root-mean-squared displacement at each lag time. Thefrequent small displacements are well described by a Gaus-sian distribution with a unit standard deviation (black dashedline); however, larger displacements are much more frequentthan predicted by the tails of this Gaussian distribution. Col-ors correspond to the lag time, in units of nanoseconds, asshown by the scale bar on the right. (b) The predicted dis-tribution of displacements corresponding to transitions be-tween conﬁgurations in the topological conﬁguration model(dotted lines) is broader than a Gaussian distribution ﬁt tothe transition-path-ensemble-averaged ﬂuctuations within in-dividual conﬁgurations, leading to similar fat-tailed behavior.The lag times are scaled by the slowest timescale of the MDsimulations, ( k MDfold ) − , for comparison with panel a. (c–d) Dis-tributions calculated from kinetic Monte Carlo simulations oftransition paths on the two toy landscapes shown in Figure 4a.Only the intermediate-barrier landscape (panel d) reproducesthe fat-tailed behavior observed in the MD simulations. less of the lag time. This ‘fat-tailed’ behavior is shownin Figure 5a, where the distribution for each lag time isscaled according to its root-mean-squared displacementand compared to a unit Gaussian distribution indicatedby the black dashed line.This unusual feature is naturally predicted by thetimescale separation in the topological conﬁguration model, since the distribution of one-dimensional displace-ments is narrower for ﬂuctuations within a conﬁgura-tion than for transitions from one conﬁguration to an-other. To illustrate this idea, we compare the expecteddisplacement associated with a step between conﬁgura-tions, h ( σ i + σ j ) / i , with the average size of a ﬂuctu-ation within a conﬁguration, h σ i i , where the averagesare taken over all conﬁgurations and weighted by m AB i ,the probability of ﬁnding a transition-path trajectoryin any conﬁguration i (Figure 5b). The contributionfrom transitions between conﬁgurations, shown by col-ored dotted lines in Figure 5b, increases with the lagtime, since the probability of moving from conﬁguration i to j within a ﬁnite time ∆ t is given by the matrix expo-nential [exp(∆ t T )] ij . As a result of these larger displace-ments, the tails of the distribution are always outside ofthe unit Gaussian distribution, suggesting that the fat-tailed behavior observed in the MD simulation-deriveddistributions is also indicative of relatively rare interme-diate barrier crossings.To test this hypothesis, we analyzed the distribu-tions of ﬁnite-time displacements obtained from simu-lated transition paths over the two toy landscapes shownin Figure 4a. By scaling the distributions accordingto their root-mean-squared displacements and compar-ing with a unit Gaussian distribution (Figure 5c,d), weﬁnd that only the landscape with intermediate barriersresults in a qualitatively similar fat-tailed distribution.The single-barrier toy landscape, by contrast, results inlarge displacements being less frequent than predictedby a Gaussian distribution. We therefore conclude thatthe relative enrichment of large displacements within lagtimes of a few tens of nanoseconds is a likely consequenceof intermediate barrier crossings in the MD-simulatedtransition paths. C. Likelihood comparison between predicted andsimulated transition-path trajectories

Finally, we tested whether the transition-path trajecto-ries observed in the MD simulations, when projected ontoa one-dimensional coordinate, are representative of thetransition-path ensembles predicted by the topologicalconﬁguration model. To do so, we treated this stochas-tic process as a hidden Markov model with a discretestate space of topological conﬁgurations. In this model,transition paths traverse the discrete state space in accor-dance with the rate matrix T , but we assume that we canonly observe the instantaneous projection of each state s onto the 1-d coordinate x . With the exception of thetransition-matrix prefactor k , both the transition proba-bilities between states and the topological conﬁguration-dependent probabilities of observing a given number ofnative contacts, p ( x | s ), are completely determined byquantities calculated from the equilibrium MD simula-tion data. We ﬁrst removed all conﬁgurations in thetopological conﬁguration network that are not likely tobe visited on transition paths (less than 1% of predictedfolding ﬂux) to guard against overﬁtting. We then usedthe standard Viterbi algorithm [60, 61] to determine theunique sequence of conﬁgurations, { s l } , that maximizesthe log likelihood of the observed time series { x l } for eachtransition-path trajectory, h log L i = n − n X l =1 log (cid:2) p ( x l | s l )( e ∆ t T ) s l − ,s l (cid:3) , (5)where the index l runs over all consecutive snapshots oneach transition path, and we have normalized the loglikelihood to remove the trivial dependence on the totaltrajectory length n .A representative maximum likelihood ﬁt, using theﬁxed transition rates and emission probabilities deﬁnedin Eq. (3) and Figure 2b, respectively, is shown in Fig-ure 6a, where the apparent separation of timescales be-tween high-frequency oscillations and slower, step-likebehavior can be easily discerned by eye. Furthermore,the log likelihood of most probable path, h log L max i , isonly weakly dependent on the transition-matrix prefac-tor, k . (Figure 6b). Analyzing one frame per nanosec-ond, the maximum of h log L max i is found at a value of k that results in a predicted folding rate, k fold , of ap-proximately 1 × − frames. Importantly, this foldingrate agrees well with the empirical folding rate that wecalculated directly from the mean waiting times betweenfolding and unfolding transitions in the full MD trajec-tories, k MDfold ≃ . − (Figure 6b, inset ).We then performed an analogous hidden Markov anal-ysis for the 1-d landscape model with a constant diﬀusioncoeﬃcient. In this case, the discrete states { ¯ x } are binsof width ∆ x , the rate matrix is given by Eq. (4), andthe Gaussian emissions p ( x | s = ¯ x ) are assumed to havea standard deviation of ∆ x . For example, a represen-tative maximum likelihood ﬁt, assuming a bin width of∆ x = 16 native contacts, is shown in Figure 6a. Unlikethe topological conﬁguration model, we ﬁnd that the loglikelihood of the most probable path in the 1-d modelis strongly dependent on the transition-matrix prefac-tor, k . Furthermore, the maximum with respect to k corresponds to a folding rate that is orders of magnitudegreater than that determined from MD simulations. Thismeans that, in order to generate a transition path withthe observed high frequency ﬂuctuations on the empiri-cal 1-d landscape, k has to be tuned to a point wherethe predicted rates of folding and unfolding events areunrealistically fast. The topological conﬁguration modeldoes not suﬀer from this contradiction, since the sepa-ration of timescales between the fast motions within atopological conﬁguration and slower transitions betweenconﬁgurations is an intrinsic feature of the model.To ensure a fair comparison between these models, wecalculated the expected values of the log likelihood fortransition paths generated directly by both models, h log L i model = X s m AB s log (cid:2) h p ( x | s ) ih ( e ∆ t T ) s ′ ,s i (cid:3) , (6) − − − − − − − − − − − − − − M D f o l d i n g r a t e log [ k ∆ t × ( k fold /k )] − − − − − − M D r a t e − − − − −

204 8 12 16 20 24 ∆ t = 8 ns∆ t = 4 ns∆ t = 2 ns∆ t = 1 nsbin width ∆ x − − − − − ∆ t = n s ∆ t = n s ∆ t = n s ∆ t = n s unfold.folded 0 1000 2000 3000 Topo. conﬁg. unfold.folded 0 1000 2000 3000 t/ ns xx h l og L m a x i − h l og L i m o d e l Topo. conf. 1-d native contact landscape

Topo. conﬁg. model 1-d landscape bca ∆ x = 16 h l og L m a x i FIG. 6. Comparison of transition-path trajectories fromatomistic MD simulations and trajectories generated by theo-retical models. (a) The maximum-likelihood transition paths(red solid lines) through the discrete states of the topolog-ical conﬁguration and 1-d landscape models given a repre-sentative MD transition-path trajectory, projected onto thenumber of native contacts, x (blue lines). For the topologicalconﬁguration model, the expected ﬂuctuations (i.e., the stan-dard deviations of the Gaussian approximations in Figure 2b)within the most probable states are shown by red dashed lines.(b) The dependence of the per-frame log likelihood of the mostprobable sequence of states on the transition-matrix prefac-tor, k . This rate is scaled by the folding rate predicted by themodel, k fold , for comparison with the estimated MD foldingrate, k MDfold (black dashed line). The maximum of h log L max i for the topological conﬁguration model coincides with the MDfolding rate (inset). (c) Comparison between the log likeli-hood of the most probable sequence of states and the expectedlog likelihood for each model, h log L i model (see text). Whenﬁtting to the 1-d model, the agreement between these quan-tities depends strongly on the spatial coarse-graining, ∆ x , ofthe landscape and temporal averaging of the MD transition-path trajectories over time windows of width ∆ t . Error barsindicate the standard error of the mean. where the expectation values for the emission andtransition probabilities are h p ( x | s ) i ≡ R ∞−∞ p ( x | s ) dx and h [exp(∆ t T )] s ′ ,s i ≡ P s ′ { [exp(∆ t T )] s ′ ,s } , respectively, ineach state s . Fixing k to match the MD folding rate,Figure 6c shows that the log likelihood of the most prob-able path determined by ﬁtting the MD data is consis-tent with the expected log likelihood for the ensembleof transition paths generated by the topological conﬁg-uration model. This result is independent of temporalcoarse-graining, i.e., down-sampling the trajectory x ( t )by averaging over a moving window of width ∆ t . Bycontrast, temporal coarse-graining has a signiﬁcant eﬀecton the diﬀerence between the best-ﬁt and expected loglikelihoods for the 1-d landscape model when k is ﬁxedaccording to the MD folding rate, since increasing ∆ t preferentially removes high frequency ﬂuctuations. Thisdiﬀerence is also sensitive to the landscape bin width ∆ x ,since increasing ∆ x reduces the number of distinct statesand consequently slows the rate of transitions betweenadjacent states. As a result, increasing the bin width re-sults in an eﬀective separation of timescales on the 1-dlandscape, albeit without a ﬁrst-principles justiﬁcation.Only by introducing a separation of timescales through a post hoc combination of temporal coarse-graining of thetrajectories and spatial coarse-graining of the 1-d land-scape is it possible for the 1-d model to generate transi-tion paths that are consistent with those observed in theMD simulations (Figure 6c).In conclusion, this hidden Markov analysis highlightsthe importance of a separation of timescales for reproduc-ing the transition-path trajectories observed in atomisticMD simulations. In particular, the co-occurrence of fastﬂuctuations in the number of native contacts and infre-quent folding events is naturally captured by the topo-logical conﬁguration model, as seen by the agreementbetween the log likelihoods of the ﬁtted and predictedtransition-path trajectory ensembles. One-dimensionallandscape models that lack an intermediate timescale,by contrast, require that the trajectories observed in MDsimulations be smoothed substantially in order to con-form to the predicted transition-path dynamics. IV. DISCUSSION

We have shown that a theoretical model of protein fold-ing, which emphasizes an intermediate timescale associ-ated with transitions between distinct conﬁgurations ofpartial native structure, accurately predicts multiple sta-tistical properties of the stochastic dynamics of foldingtransition paths. By using equilibrium simulation data toconstruct an approximate rate matrix for transitions be-tween topological conﬁgurations, we demonstrated thatthe transition-path ensembles generated by this modelhave broad transit-time distributions that are consistentwith both all-atom simulations and experimental obser-vations. We then showed that the non-Gaussian distri-butions of ﬁnite-time displacements that are predicted by this model qualitatively match all-atom simulationresults. Lastly, we demonstrated that this model canreconcile rapid local ﬂuctuations on a one-dimensionalorder parameter with a slow overall rate of folding, twoseemingly contradictory features that are simultaneouslyobserved in simulated transition-path trajectories. Mostimportantly, all predictions of the topological conﬁgura-tion model were made without the use of any dynamicalinformation.The intermediate timescale in the topological conﬁgu-ration model is predicted to arise due to local free-energybarriers that separate transient states with distinct setsof native-like loops [37]. In this work, we assumed thatthe rates of transitions between states could be approxi-mated using a simple formula that satisﬁes detailed bal-ance. However, a more accurate approach would involvethe calculation of the rates between adjacent topologicalconﬁgurations in the all-atom model. Such an approachmight beneﬁt from recent advances in Markov state mod-eling [62, 63], although, in this application, the deﬁnitionsof the states would be assumed a priori on the basisof the native structure. Nevertheless, it is remarkablethat we are able to obtain qualitatively accurate resultsfor a variety of statistical tests using a highly simpliﬁedMarkov model and the estimated free energies of the pre-dicted topological conﬁgurations. This success indicatesthat the statistical analyses that we considered dependto a greater extent on the presence of intermediate bar-riers than on their precise heights, provided that thesebarriers are comparable to the thermal energy ( & k B T )and are thus kinetically relevant. Generalizing beyondUbiquitin, we anticipate that accounting for this inter-mediate timescale is likely to be especially important inthe context of large proteins, which tend to have com-plicated native topologies. Transition-path analyses ofsmall, ultrafast-folding proteins [64], by contrast, are lesslikely to beneﬁt from the coarse-graining strategy de-scribed here, since these proteins typically contain onlyone or two native-state loops whose formation dominatesthe overall folding rate. The lower probability of encoun-tering substantial free-energy barriers on folding transi-tion paths, which are needed to assume approximatelyMarkovian transitions between intermediate conﬁgura-tions, suggests that models of diﬀusion over a single har-monic barrier may be more appropriate in these partic-ular cases.To compare our approach with commonly used one-dimensional models, we assumed a projection onto thenumber of native contacts and a constant diﬀusion coef-ﬁcient. The number of native contacts has been shownto be a good reaction coordinate in the sense that, formany small proteins, it can be used to locate the tran-sition state from transition-path trajectories with highprobability [50]. The results presented here are not in-consistent with this notion of a good reaction coordinatewhen a single pathway through the network of topolog-ical conﬁgurations dominates, as shown by the similar-ity between the committors predicted by the two mod-0els. Furthermore, if one were to account for coordinate-dependent diﬀusion, it is likely that these two approacheswould lead to similar predictions for the transition-pathdynamics, since existing methods for ﬁtting coordinate-dependent diﬀusion coeﬃcients often reveal the existenceof intermediate barriers that were hidden by the projec-tion onto the original reaction coordinate [31]. However,to carry out such an analysis, dynamical information isalways required in some form [25], meaning that the un-derlying landscape is not, by itself, truly predictive.The topological conﬁguration model that we have ex-amined here diﬀers in a number of important ways fromalternative models of folding intermediates that havebeen proposed previously. Unlike the early hierarchicalmodel of Ptitsyn [65] and the more recent ‘foldon’ hy-pothesis [66], the assembly of native-like intermediatesneed not lead to a more negative free energy at everystep. At the same time, we have not assumed that thefree energy decreases only upon incorporation of the ﬁnalnative-like substructure, as proposed in a recent ‘volcano’model of folding [67]. By contrast, the highest point onthe minimum free-energy path through the network oftopological conﬁgurations is determined by the free en-ergies of the various conﬁgurations and the barriers be-tween them, which depend, in turn, on the temperatureand solvent conditions [37]. The topological conﬁgura-tion model also suggests a natural deﬁnition of a foldingpathway [68] at the level of topological conﬁgurationswhile allowing for alternative, yet less probable, path-ways.Finally, this theoretical analysis has a number ofimplications for experimental investigations of protein-folding transition paths. The transit-time and ﬁnite-time displacement distributions that we have discussedcan now be measured directly in single-molecule ex-periments. However, to distinguish between alterna-tive theoretical models, it would be most useful to an-alyze high-resolution experimental transition-path mea-surements using hidden Markov models in order to detectand characterize transient folding intermediates. Usingestablished non-parametric methods [69, 70], it shouldbe possible to assess both the number of distinct tran-sient states and any separation of timescales objectively.Furthermore, by combining such analyses with structure-based models, like the type discussed here, it should bepossible to extract more detailed information regarding the underlying free-energy landscape from these mea-surements. In this way, continued advances in single-molecule measurements can be used to improve predic-tive landscape-based models of protein-folding transi-tion paths, in particular for large proteins with complexnative-state topologies. V. CONCLUSION

We have shown that the introduction of an interme-diate timescale, which is faster than native-contact for-mation but slower than the typical time for folding anentire protein, can qualitatively alter the statistics ofprotein-folding transition paths. We proposed that thisintermediate timescale is associated with the assembly ofnative-like loops, and we used this principle to build acoarse-grained free-energy landscape for Ubiquitin fromequilibrium atomistic simulation data. Without rely-ing on any dynamical information from simulations, weshowed that this model predicts distributions of transittimes and ﬁnite-time displacements that are consistentwith simulated transition paths, but diﬀer qualitativelyfrom the predictions of a one-dimensional model of dif-fusion on an empirical free-energy landscape. We alsoused a hidden Markov analysis to demonstrate that thismodel generates transition paths that agree with boththe dynamics and kinetics inferred from reversible fold-ing simulations. Our results suggest that the analysis ofsingle-molecule transition-path trajectories may be im-proved by accounting for intermediate free-energy barri-ers, which are a fundamental aspect of the complexity offolding large biomolecules.

ACKNOWLEDGMENTS

The authors would like to thank D.E. Shaw Researchfor providing the all-atom Molecular Dynamics simu-lation data. In addition, the authors are grateful formany insightful discussions with Michael Manhart. Thiswork was supported by NIH grants F32GM116231 andGM068670. All analysis and simulation code is availablefrom the authors upon request. [1] D. Frenkel and B. Smit,

Understanding molecular simula-tion: From algorithms to applications (Academic Press,2001).[2] J. W. Gibbs, Am. J. Sci. , 441 (1878).[3] P. R. ten Wolde, M. J. Ruiz-Montero, and D. Frenkel,J. Chem. Phys. , 9932 (1996).[4] J. J. De Yoreo and P. G. Vekilov, Rev. Mineral. Geochem. , 57 (2003).[5] P. R. ten Wolde and D. Frenkel, Science , 1975 (1997). [6] W. M. Jacobs, A. Reinhardt, and D. Frenkel, Proc. Natl.Acad. Sci. U.S.A. , 6313 (2015).[7] D. Frenkel, Physica A: Stat. Mech. , 26 (1999).[8] W. Poon, F. Renth, and R. Evans, J. Phys.: Condens.Matter , A269 (2000).[9] V. Abkevich, A. Gutin, and E. Shakhnovich, J. Chem.Phys. , 6052 (1994).[10] E. I. Shakhnovich, Chem. Rev. , 1559 (2006). [11] D. Gfeller, P. De Los Rios, A. Caﬂisch, and F. Rao, Proc.Natl. Acad. Sci. U.S.A. , 1817 (2007).[12] D. Thirumalai, E. P. O’Brien, G. Morrison, andC. Hyeon, Ann. Rev. Biophys. , 159 (2010).[13] V. I. Abkevich, A. M. Gutin, and E. I. Shakhnovich,Biochemistry , 10026 (1994).[14] E. I. Shakhnovich, Curr. Opin. Struct. Biol. , 29 (1997).[15] H. S. Chung, K. McHale, J. M. Louis, and W. A. Eaton,Science , 981 (2012).[16] H. S. Chung and W. A. Eaton, Nature , 685 (2013).[17] H. S. Chung, T. Cellmer, J. M. Louis, and W. A. Eaton,Chem. Phys. , 229 (2013).[18] K. Truex, H. S. Chung, J. M. Louis, and W. A. Eaton,Phys. Rev. Lett. , 018101 (2015).[19] A. ˇSali, E. I. Shakhnovich, and M. Karplus, Nature ,248 (1994).[20] J. D. Bryngelson, J. N. Onuchic, N. D. Socci, and P. G.Wolynes, Proteins: Struct., Func., and Bioinf. , 167(1995).[21] J. N. Onuchic and P. G. Wolynes, Curr. Opin. Struct.Biol. , 70 (2004).[22] A. Berezhkovskii and A. Szabo, J. Chem. Phys. ,014503 (2005).[23] R. B. Best and G. Hummer, Phys. Rev. Lett. , 228104(2006).[24] B. W. Zhang, D. Jasnow, and D. M. Zuckerman, J.Chem. Phys. , 074504 (2007).[25] R. B. Best and G. Hummer, Phys. Chem. Chem. Phys. , 16902 (2011).[26] G. Hummer, New J. Phys. , 34 (2005).[27] A. Berezhkovskii and A. Szabo, J. Chem. Phys. ,074108 (2011).[28] R. B. Best and G. Hummer, Proc. Natl. Acad. Sci. U.S.A. , 6732 (2005).[29] M. Hinczewski, Y. von Hansen, J. Dzubiella, and R. R.Netz, J. Chem. Phys. , 245103 (2010).[30] M. L. Mugnai and R. Elber, J. Chem. Phys. , 014105(2015).[31] R. B. Best and G. Hummer, Proc. Natl. Acad. Sci. U.S.A. , 1088 (2010).[32] P. Tiwary and B. Berne, J. Chem. Phys. , 152701(2017).[33] R. Du, V. S. Pande, A. Y. Grosberg, T. Tanaka, andE. I. Shakhnovich, J. Chem. Phys. , 334 (1998).[34] K. Neupane, D. A. N. Foster, D. R. Dee, H. Yu, F. Wang,and M. T. Woodside, Science , 239 (2016).[35] D. E. Makarov, J. Chem. Phys. , 071101 (2017).[36] R. Satija, A. Das, and D. E. Makarov, J. Chem. Phys. , 152707 (2017).[37] W. M. Jacobs and E. I. Shakhnovich, Biophys. J. ,925 (2016).[38] V. Mu˜noz and W. A. Eaton, Proc. Natl. Acad. Sci. U.S.A. , 11311 (1999).[39] E. Alm and D. Baker, Proc. Natl. Acad. Sci. U.S.A. ,11305 (1999).[40] O. V. Galzitskaya and A. V. Finkelstein, Proc. Natl.Acad. Sci. U.S.A. , 11299 (1999).[41] D. Frenkel, Biophys. J. , 893 (2016).[42] S. Piana, K. Lindorﬀ-Larsen, and D. E. Shaw, Proc.Natl. Acad. Sci. U.S.A. , 5915 (2013). [43] P. A. Thompson, W. A. Eaton, and J. Hofrichter, Bio-chemistry , 9200 (1997).[44] L. J. Lapidus, W. A. Eaton, and J. Hofrichter, Proc.Natl. Acad. Sci. U.S.A. , 7220 (2000).[45] S. O. Garbuzynskiy, D. N. Ivankov, N. S. Bogatyreva,and A. V. Finkelstein, Proc. Natl. Acad. Sci. U.S.A. ,147 (2013).[46] B. H. Zimm and J. K. Bragg, J. Chem. Phys. , 526(1959).[47] K. A. Dill, K. M. Fiebig, and H. S. Chan, Proc. Natl.Acad. Sci. U.S.A. , 1942 (1993).[48] J. Kubelka, E. R. Henry, T. Cellmer, J. Hofrichter, andW. A. Eaton, Proc. Natl. Acad. Sci. U.S.A. , 18655(2008).[49] E. R. Henry, R. B. Best, and W. A. Eaton, Proc. Natl.Acad. Sci. U.S.A. , 17880 (2013).[50] R. B. Best, G. Hummer, and W. A. Eaton, Proc. Natl.Acad. Sci. U.S.A. , 17874 (2013).[51] P. Metzner, C. Sch¨utte, and E. Vanden-Eijnden, Multi-scale Model. Sim. , 1192 (2009).[52] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth,A. H. Teller, and E. Teller, J. Chem. Phys. , 1087(1953).[53] S. V. Krivov and M. Karplus, Proc. Natl. Acad. Sci.U.S.A. , 14766 (2004).[54] S. V. Krivov and M. Karplus, J. Phys. Chem. B ,12689 (2006).[55] F. Ding, N. V. Dokholyan, S. V. Buldyrev, H. E. Stanley,and E. I. Shakhnovich, Biophys. J. , 3525 (2002).[56] H. S. Chung, J. M. Louis, and W. A. Eaton, Proc. Natl.Acad. Sci. U.S.A. , 11837 (2009).[57] D. T. Gillespie, J. Phys. Chem. , 2340 (1977).[58] S. Chaudhury and D. E. Makarov, J. Chem. Phys. ,034118 (2010).[59] S. V. Krivov, PLoS Comp. Biol. , e1000921 (2010).[60] G. D. Forney, Proc. IEEE , 268 (1973).[61] “hmmlearn — hmmlearn 0.2.1 documentation,” http://hmmlearn.readthedocs.io/en/latest/ , ac-cessed: June 19, 2018.[62] G. R. Bowman, V. S. Pande, and F. No´e, in An intro-duction to Markov state models and their application tolong timescale molecular simulation (Springer, 2014) pp.1–6.[63] J. D. Chodera and F. No´e, Curr. Opin. Struct. Biol. ,135 (2014).[64] K. Lindorﬀ-Larsen, S. Piana, R. O. Dror, and D. E.Shaw, Science , 517 (2011).[65] O. Ptitsyn, Doklady Akademii Nauk SSSR , 1213(1973).[66] H. Maity, M. Maity, M. M. Krishna, L. Mayne, andS. W. Englander, Proc. Natl. Acad. Sci. U.S.A. , 4741(2005).[67] G. C. Rollins and K. A. Dill, J. Am. Chem. Soc. ,11420 (2014).[68] A. N. Adhikari, K. F. Freed, and T. R. Sosnick, Proc.Natl. Acad. Sci. U.S.A. , 17442 (2012).[69] I. Sgouralis, M. Whitmore, L. Lapidus, M. J. Comstock,and S. Press´e, J. Chem. Phys. , 123320 (2018).[70] I. Sgouralis and S. Press´e, Biophys. J.112