Accurate protein-folding transition-path statistics from a simple free-energy landscape
aa r X i v : . [ q - b i o . B M ] A ug Accurate protein-folding transition-path statistics from a simple free-energy landscape
William M. Jacobs and Eugene I. Shakhnovich
Department of Chemistry and Chemical Biology, Harvard University, 12 Oxford Street, Cambridge, MA 02138, USA (Dated: August 9, 2018)A central goal of protein-folding theory is to predict the stochastic dynamics of transition paths— the rare trajectories that transit between the folded and unfolded ensembles — using only ther-modynamic information, such as a low-dimensional equilibrium free-energy landscape. However,commonly used one-dimensional landscapes typically fall short of this aim, because an empiricalcoordinate-dependent diffusion coefficient has to be fit to transition-path trajectory data in orderto reproduce the transition-path dynamics. We show that an alternative, first-principles free-energylandscape predicts transition-path statistics that agree well with simulations and single-moleculeexperiments without requiring dynamical data as an input. This ‘topological configuration’ modelassumes that distinct, native-like substructures assemble on a timescale that is slower than native-contact formation but faster than the folding of the entire protein. Using only equilibrium simulationdata to determine the free energies of these coarse-grained intermediate states, we predict a broaddistribution of transition-path transit times that agrees well with the transition-path durations ob-served in simulations. We further show that both the distribution of finite-time displacements on aone-dimensional order parameter and the ensemble of transition-path trajectories generated by themodel are consistent with the simulated transition paths. These results indicate that a landscapebased on transient folding intermediates, which are often hidden by one-dimensional projections,can form the basis of a predictive model of protein-folding transition-path dynamics.
I. INTRODUCTION
In studies of complex molecular systems, free-energylandscapes provide a tractable and intuitive frameworkfor predicting rare events. Free-energy landscapes arelow-dimensional projections that describe the equilib-rium distribution of molecular configurations with re-spect to a small number of collective variables. However,when appropriately defined, these landscapes can alsobe used to predict dynamical properties of equilibriumor near-equilibrium stochastic trajectories, such as therelative rates of transitions between macrostates. Thisfeature, combined with the fact that a variety of com-putational techniques have been developed for efficientlycalculating free energies without directly simulating rareevents [1], makes free-energy landscapes useful for ratio-nalizing reaction and phase-transformation mechanismsin complex systems. Consequently, theories based onfree-energy landscapes are widely applied to problems inclassical [2–4] and non-classical [5, 6] nucleation, phaseseparation [7, 8], and protein folding [9–12].A particularly important problem in protein folding isthe prediction of transition paths between the unfoldedand folded ensembles [13–18]. These trajectories areboth rare and, at an atomistic level, extremely hetero-geneous, making this problem ideal for landscape-basedtheories. One of the most widely adopted approachesis to model folding as a diffusion process on a smooth,one-dimensional free-energy landscape [19–25]. Never-theless, even when using a good reaction coordinate, In this context, a good reaction coordinate is one that not onlydistinguishes the unfolded and folded states, but also takes asingle value for all configurations visited on transition paths thathave an equal probability of reaching either of these macrostates. one-dimensional landscapes typically have to be cor-rected empirically to reproduce the dynamical propertiesof the actual stochastic folding trajectories [25, 26]. Thiscorrection can be achieved by introducing a position-dependent diffusion coefficient [27], since the gradientof the free-energy landscape itself is not sufficient topredict the relative rates of molecular motions on theone-dimensional reaction coordinate. The key limita-tion of this approach is that the transition-path trajec-tories that we wish to predict are required as input,either to determine the coordinate-dependence of thediffusion coefficient [26, 28–30] or to find a projectionfor which the apparent diffusive behavior is coordinate-independent [31, 32]. It is also unclear whether a sin-gle optimized one-dimensional coordinate can always befound for large proteins, which may have more compli-cated or parallel folding pathways [33]. Furthermore, re-cent single-molecule measurements of folding transitionpaths [15, 34] have provided experimental evidence ofthe shortcomings of one-dimensional landscapes, as thefolding free-energy barrier inferred by applying a one-dimensional diffusion model to measured transit-time dis-tributions is often inconsistent with the landscape deter-mined directly in the same experiments [35, 36].In contrast, an optimal free-energy landscape is onethat is capable of predicting the statistical properties ofstochastic transition paths without requiring additional,empirical kinetic information. To address this problem,we recently proposed an alternative, first-principles ap-proach [37] for constructing structure-based free-energylandscapes to describe protein-folding transition paths.Based on an analysis of a native-centric ‘Ising-like’model [38–40], we postulated that the key events alongtransition paths coincide with the formation of native-like loops in the polymer backbone [41]. We thereforedevised a coarse-graining procedure in which microstatessharing the same set of native-like loops, but differentsets of native contacts, are grouped into the same ‘topo-logical configuration.’ Because the loss of entropy dueto loop closure is not compensated until multiple stabi-lizing native contacts are formed, we further postulatedthat these topological configurations are, in general, sep-arated by free-energy barriers, leading to a separationof timescales between the formation of individual nativecontacts and the assembly of topological configurations.Consistent with this prediction, we found that topologi-cal configurations interconvert on much slower timescalesthan individual native contacts in atomistic simulationsand that these slower transitions follow roughly Marko-vian dynamics [37].In this paper, we show that the topological configura-tion model accurately describes the stochastic dynamicsof transition-path trajectories. By estimating the freeenergies of the predicted topological configurations us-ing equilibrium all-atom simulation snapshots [42], weapply this model to generate an ensemble of transitionpaths in terms of transitions between coarse-grained, par-tially folded states. First, we show that the distribu-tion of transit times predicted by this approach is muchcloser to the distribution of simulated transit times thanthat predicted by a model of diffusion on a smooth one-dimensional landscape. Second, we demonstrate that thedistribution of simulated finite-time displacements on aone-dimensional order parameter can be rationalized bythe topological configuration model. Lastly, we use a hid-den Markov framework to show that the predicted sepa-ration of timescales generates an ensemble of transitionpaths that is consistent with the simulated folding trajec-tories. Overall, these results indicate that a free-energylandscape defined on the basis of transient, native-like in-termediates can predict protein-folding transition pathswithout requiring post hoc corrections to the transition-path dynamics.
II. THEORYA. Definition of a topological configuration
The central principle of the topological configura-tion model [37] is a separation between three differenttimescales: a relatively fast timescale associated withnative-contact formation, a slower timescale on whichsubstantial portions of native structure assemble, and aslowest timescale on which the entire protein folds. Thetimescale separation for these processes has been well es-tablished experimentally, with measurements reportingsingle-contact formation on timescales of approximately10 ns [43], native-like loop formation on timescales of100 ns–1 µ s [44], and the folding of proteins with approx-imately 100 residues or more on timescales of 100 µ s orlonger [45]. It is also well established that the entropy–enthalpy compensation of protein folding is imperfect, a ab bc cd d e ef f g g Ubiquitin (1ubq) contact map
Residue R e s i du e a Schematic ‘topological configuration’ landscape b a ab abd abde abdeg abdefg abcdefg ∅ F r eee n e r g y FIG. 1. Definition of topological configurations for the proteinUbiquitin. (a) A map of the residue–residue native contactsdetermined from a crystal structure (PDB ID: 1ubq). Sub-structures, comprising sets of at least six adjacent native con-tacts, are colored and labeled. Native contacts that are notpart of any substructure are shown in gray. (b) A schematictopological configuration free-energy landscape showing a sin-gle pathway between the completely unfolded, ‘ ∅ ,’ and na-tive, ‘abcdefg,’ configurations. These configurations are sepa-rated by free-energy barriers and thus interconvert on a slowertimescale than the formation of individual native contacts.Selected configurations are illustrated below, where stretchesof disordered residues are indicated by dashed lines. meaning that while a small number of native contacts israrely sufficient to counter the associated loss of configu-rational entropy completely, the formation of subsequentnative contacts tends to be more thermodynamically fa-vorable [46, 47]. This general feature, which is respon-sible for the overall free-energy barrier that determinesthe slowest timescale of protein folding, also gives rise tomany smaller barriers on folding transition paths. In par-ticular, the entropic penalty associated with the closureof a single native-like loop typically results in a small yetsignificant free-energy barrier, which in turn leads to adynamical timescale that is slower than the average rateof native-contact formation but faster than the rate atwhich the entire protein folds.By identifying native loops and considering all per-mutations of the order in which they can form, we canconstruct a free-energy landscape that captures this keyintermediate timescale. We previously demonstrated [37]how this analysis could be applied to a structure-based,‘Ising-like’ model based on the pioneering work of Eatonand colleagues [38, 48, 49]. In that work, we calculatedthe free-energy barriers between configurations in thestructure-based model to support the predicted separa-tion of timescales. We also provided indirect evidence ofintermediate barriers in an all-atom model by analyzingthe dwell times associated with the predicted topologicalconfigurations in all-atom simulations.As in Ref. 37, we shall focus on the protein Ubiquitin inorder to test the topological configuration model’s abilityto predict transition-path dynamics. In the contact mapshown in Figure 1a, individual ‘substructures’ comprisenative contacts that are adjacent in the contact map.Topological configurations are defined as combinationsof one or more substructures, including any additionalnative contacts between the residues that comprise thesesubstructures. As shown in the schematic Figure 1b, wecan then construct a free-energy landscape in the dis-crete space of topological configurations. Transitions areallowed between configurations that differ by a single sub-structure (and thus a single native-like loop closure). Itis important to note that each topological configurationis not a rigid structure but rather an ensemble of mi-crostates, in which different sets of native contacts arepresent but the same set of substructures (and, conse-quently, native-like loops) are represented. Furthermore,due to the predicted separation of timescales, the fluc-tuations in native-contact formation within a topologicalconfiguration are typically much faster than the transi-tions between configurations. B. Estimation of topological configuration freeenergies from Molecular Dynamics simulations
In this paper, our goal is to evaluate the predictionsof a free-energy landscape constructed solely from ther-modynamic data. For this purpose, we analyzed snap-shots from all-atom Molecular Dynamics (MD) simula-tions [42], which were conducted under conditions wherea total of ten unbiased, reversible folding and unfolding Two contacts are adjacent if each residue in the first contact is animmediate neighbor of one of the residues in the second contact;for example ( i, j ) is adjacent to ( i, j + 1) and ( i + 1 , j + 1). transition paths of Ubiquitin were observed. To calculatethe free energy associated with each predicted topologicalconfiguration, we first classified all simulation snapshots,recorded at 1 ns intervals, according to which substruc-tures are present. In each snapshot, we found all na-tive residue–residue contacts in which at least one pairof heavy atoms is less than 4 . i, j ) were estimated according to the relative frequen-cies of the classified snapshots,∆ F ij = − k B T log ( N i /N j ) , (1)where N i is the total number of snapshots assigned toconfiguration i , k B is the Boltzmann constant, and T isthe absolute temperature. For comparison with theoriesbased on one-dimensional free-energy landscapes, we alsocalculated the free-energy landscape as a function of thenumber of native heavy-atom contacts [50] using a 4 . F ( x ) = − k B T log N ( x ) + const ., (2)where N ( x ) is the total number of snapshots in whichthe number of native contacts falls in the range[ x − ∆ x/ , x + ∆ x/ x is taken to be 4 native contacts.The central object of the topological configurationmodel is the rate matrix T for transitions between topo-logical configurations. Ideally, one should determine thetransition rates either from the free-energy barriers or themean first passage times between configurations, but ac-curate calculations of this type are not possible given theavailable simulation data. Instead, we simply assumed asymmetric form that enforces detailed balance for the for-ward and backward rates between configurations i and j , T ij = k exp (cid:18) − ∆ F ji k B T (cid:19) i = j, (3)for pairs of configurations ( i, j ) that differ by the additionor removal of a single substructure; the diagonal elementsof the matrix are then T ii = − P j = i T ij . The prefactor k is the same for all transitions and is left as an adjustableparameter that scales all barrier heights between con-figurations equivalently. However, due to the assumedseparation of timescales, we know that k should be slowcompared to the average rate of native-contact forma-tion. Then, given the matrix T , it is straightforward tocalculate the overall folding rate, k fold ; the committorassociated with each configuration i , p fold ,i ; the proba-bility of finding a trajectory in a specific configuration ∅ ad abadabd abde abdefabcdefabcdegabcdefg024680 100 200 300 400 500 600 700 800 900 ∅ adg abad abd abde abcdeabdefabcdefabcdegabcdefg00.510 100 200 300 400 500 600 700 800 900 Number of native contacts ∅ abdabe abde abcdeabcdgabdefabdeg abcdefabcdegabcdefgaba p f o l d F / k T F / k T - dp r o j ec t i o n - dp r o j ec t i o n abc x A x B FIG. 2. Comparison of the topological configuration and one-dimensional (1-d) free-energy landscapes. (a) Projections ofthe free-energy surfaces onto the number of native contacts.Selected curves corresponding to distinct topological config-urations (solid colored lines) are labeled; the 1-d projection F ( x ), with bin width ∆ x = 4 native contacts, is shown by thedashed black line. (b) The unimodal projections of the topo-logical configurations onto the number of native contacts canbe characterized by Gaussian distributions with the indicatedmeans (squares) and standard deviations (error bars). Thedotted lines indicate allowed transitions between configura-tions that differ by exactly one substructure. (c) The com-mittors, p fold , associated with each configuration (squares),projected onto the number of native contacts. Selected config-urations are labeled, and the predicted folding fluxes throughthe network of states are indicated by the widths of the graylines; fluxes less than 5% have been omitted. For comparison,the dashed black line shows p fold ( x ) predicted by the 1-d land-scape, where the vertical dotted lines indicate the boundariesof the transition-path region, x A and x B . In all panels, theconfigurations are colored according to their estimated freeenergies, F i = − k B T log N i , with blue and red indicating lowand high free energies, respectively. i on a transition path, m AB i ; and the folding fluxes be-tween adjacent states i and j , f AB ij , using transition-paththeory [51]. These quantities will be used throughoutour analysis. We shall show that, despite not undertak-ing detailed calculations of the individual barrier heights,this model produces remarkably accurate transition-pathstatistics. Furthermore, alternative choices for the formof the rates given in Eq. (3), such as a Metropolis func-tion [52], do not change the qualitative nature of ourresults. We also note that all timescales determined di-rectly from the atomistic simulations are accelerated rel-ative to experiments, in part due to the elevated temper-ature at which the simulations were conducted [42]. C. General properties of topological configurationfree-energy landscapes
In addition to a separation of timescales, our previ-ous analysis [37] of this model made two general pre-dictions that hold up when comparing with simulationdata. First, the Boltzmann-weighted ensemble of mi-crostates associated with each topological configurationis predicted to be unimodal when projected onto a one-dimensional coordinate. For example, the free energyas a function of the number of native contacts is uni-modal for all topological configurations, as shown bythe labeled colored curves in Figure 2a, suggesting thatthere are no significant free-energy barriers between mi-crostates within each configuration. Consequently, it isreasonable to approximate the projection of each config-uration onto this order parameter using a Gaussian dis-tribution with the estimated mean, h x i i , and variance, h x i i − h x i i , where the subscripts indicate averages overall snapshots classified as topological configuration i , asshown in Figure 2b. This approximation will be used inthe discussion of hidden Markov modeling below.Second, in the case of proteins such as Ubiquitin withlittle structural symmetry, the free energies of the varioustopological configurations are relatively heterogeneous,which results in a small number of high-probabilitytransition paths through the network. By applyingtransition-path theory [51] to the rate matrix T , we cal-culated p fold for each configuration and the folding fluxbetween configurations on transition paths. Figure 2cshows that there is a nearly one-to-one correspondencebetween the predicted p fold and the number of nativecontacts when we consider only those configurations thatare likely to be visited on transition paths, even thoughthis order parameter played no role in the transition-paththeory calculations. This observation is consistent withthe fact that the number of native contacts is a goodreaction coordinate for identifying the ensemble of tran-sition states, where p fold = / , from simulated Ubiquitintransition paths [50]. However, knowing the location ofthe transition state on an order parameter is, in general,not sufficient to predict the transition-path dynamics. Inaddition, any one-dimensional projection almost invari-ably hides some of the intermediate free-energy barrierson transition paths that play an important role in thetransition-path dynamics [53, 54]. III. STATISTICAL ANALYSESA. Distribution of transition-path transit times
As an initial test of the model, we examined the pre-dicted distribution of transit times between the unfoldedand folded ensembles. For two-state proteins, transittimes are orders of magnitude smaller than the charac-teristic waiting time until a folding or unfolding eventoccurs [13, 15, 55, 56]. Nevertheless, recent advances insingle-molecule experiments [15, 34] have made it pos-sible to measure these brief trajectories. Although ex-perimental measurements have confirmed that the dis-tribution of transit times has an exponential tail as ex-pected for stochastic barrier-crossing processes, the shapeof the distribution generally disagrees with the predic-tions of one-dimensional landscape theories [34–36]. Inparticular, the measured distributions are typically muchbroader than expected for a one-dimensional landscapewith a harmonic barrier.To compare the predictions of the one-dimensional(1-d) and topological configuration models, we used ki-netic Monte Carlo (kMC) simulations [57] to sample tran-sition paths between absorbing states of a rate matrix, T . We first calculated the distribution of transit timesfor the 1-d native-contacts landscape shown in Figure 2a.To this end, we discretized this landscape between theunfolded and folded free-energy minima into 150 bins(such that ∆ x = 4 native contacts) and constructed atri-diagonal transition matrix, T . We assumed thesymmetric form T x,x ± ∆ x ) = k exp (cid:20) − F ( x ± ∆ x ) − F ( x )2 k B T (cid:21) , (4)with T x,x ) = − [ T x,x − ∆ x ) + T x,x +∆ x ) ]. Because thetransit times are inversely proportional to the transition-matrix prefactor k , transit-time distributions for differ-ent models can be compared by scaling t AB according tothe mean transit time, h t AB i . In this way we can seethat the predicted distribution of transit times, p ( t AB ),between the folded and unfolded free-energy minima x A and x B (Figure 2c) is relatively narrow, with a coefficientof variation of 0 .
39 and an exponential tail (Figure 3a).Alternatively, we can fit the decay constant of the ex-ponential tail, ω − , in order to compare with the the-oretical distribution for 1-d harmonic barrier crossings, P harmAB ( ωt ; ∆ F † ) [58], where the shape parameter ∆ F † is the height of the barrier (Figure 3b). The simulateddistribution of 1-d transit times agrees well with this har-monic prediction using the barrier height ∆ F † = 5 . kT determined directly from the empirical 1-d landscape(Figure 2a), despite the fact that this landscape is notperfectly harmonic. t AB / h t AB i P A B ( . k T ) P A B ( . k T ) ωt AB MD transition-path transit times p ( t A B ) p ( t A B ) ba FIG. 3. Distributions of transition-path transit times, p ( t AB ),for the models of Ubiquitin folding shown in Figure 2. (a) Thedistributions calculated via kinetic Monte Carlo simulationsof the 1-d native contacts landscape model, the topologicalconfiguration model, and the quasi-one-dimensional minimumfree-energy path through the topological configuration net-work. To compare the shapes of the distributions, all transittimes are scaled according to the mean transit time, h t AB i , foreach model. Also shown are the ten transit times observed inall-atom MD simulations. (b) The same three distributionswere fit to the theoretical distribution for a harmonic barrier, P harmAB , in order to estimate the decay constant, ω − , of the ex-ponential tail. The transit-time distribution calculated fromthe empirical 1-d landscape agrees well with the theoreticaldistribution using the empirical barrier height, 5 . kT , whilethe distribution calculated from the topological configurationmodel can only be fit by P harmAB if we set the shape parameter∆ F † equal to a much lower barrier height of 0 . kT . We then repeated these calculations using the topo-logical configuration rate matrix defined in Eq. (3). Theresulting distribution of transit times (Figure 3a) also hasan exponential tail, but is substantially broader, with acoefficient of variation of 0 .
91. We find that the distri-bution of transit times obtained from the MD simula-tions, which has a coefficient of variation of 1 . ± . − − − −
20 0 20 40 60 80 . kT discrete state F / k T P A B ( . k T ) t AB / h t AB i harmonic periodic (CG) p ( t A B ) periodic ab FIG. 4. Distributions of transition-path transit times for twoone-dimensional toy landscapes. (a) The two toy free-energylandscapes, with a single harmonic barrier [blue empty circles; F ( x ) ∝ A cos(2 πx/ F ( x ) ∝ A cos(2 πx/ kT cos(10 πx/ h t AB i , foreach model. The single-barrier landscape has a transit-timedistribution that is narrower than the theoretical prediction, P harmAB , for this barrier height (black dashed line), while theintermediate-barrier transit-time distribution and its coarse-grained approximation are significantly broader. is 10 ). When comparing the transit-time distributionpredicted by the topological configuration model with theharmonic prediction P harmAB ( ωt ; ∆ F † ), the best fit is ob-tained with a shape parameter ∆ F † that corresponds toa one-dimensional landscape with a 0 . kT barrier (Fig-ure 3b). Interestingly, this order-of-magnitude differencebetween the actual barrier height and that returned bya fit to the harmonic theory is reminiscent of the discrep-ancy found in experimental measurements [34]. Note that the barrier on the minimum free-energy path throughthe topological configuration network is also greater than 5 kT . The striking difference between the shapes of thesetransit-time distributions is primarily a consequence ofthe intermediate barrier crossings, as opposed to the mul-tidimensionality of the network model. For example, bysimulating transition paths that only traverse the quasi-one-dimensional minimum free-energy path through theconfiguration network, we obtained a similar distributionof transit times (Figure 3). To explore this reasoningfurther, we constructed two toy 1-d landscapes with thesame barrier height and number of bins as in the em-pirical 1-d landscape (Figure 4a). In the first landscape,the single barrier is, to a good approximation, harmonic,while in the second landscape, there are five interme-diate barriers. We then computed T using Eq. (4)and simulated the transition paths for these toy modelsvia kMC. As expected, the presence of the intermediatebarriers significantly broadens the transit-time distribu-tion, resulting in a coefficient of variation of 0.47 for theintermediate-barrier landscape versus 0.36 for the single-barrier toy landscape (Figure 4b). It is also possible tocoarse-grain the dynamics over the intermediate-barriertoy landscape by calculating the mean first passage timesbetween the local free-energy minima (Figure 4a). Simu-lating the transition paths for this coarse-grained modelresults in a good agreement with the full intermediate-barrier toy landscape (Figure 4b). Although the differ-ence between the transit-time distributions for these par-ticular toy models is smaller than that shown in Figure 3,this comparison clearly demonstrates that the presenceof intermediate barriers on the transition paths tends tobroaden the distribution of transit times. B. Distribution of finite-time displacements on aone-dimensional order parameter
As a second statistical test, we examined distributionsof finite-time displacements on a one-dimensional orderparameter. Using the number of native contacts as theorder parameter, we measured displacements, ∆ x , givena lag time ∆ t on all transition-path trajectories in theatomistic MD simulations. We considered lag times rang-ing from 1 ns, which is longer than the typical time re-quired for the formation of a single native contact, to ∼
100 ns, which is much shorter than the mean transittime, 2 . µ s, observed in the MD simulations. Afterverifying that Ubiquitin transition paths exhibit subdif-fusive motion over this range of lag times [59], meaningthat h [∆ x (∆ t )] i ∝ (∆ t ) p with p <
1, we sought to deter-mine whether, for a given lag time, the distribution of fre-quent, small displacements is predictive of larger jumps.By averaging over all MD transition-path trajectories andremoving the net directional motion, ( x B − x A ) /t AB , wefind that the vast majority of displacements, for which | ∆ x | ≤ p h ∆ x (∆ t ) i , are well described by Gaussiandistributions over the entire range of lag times. However,larger displacements are much more frequent than pre-dicted by the tails of the Gaussian distributions, regard- − − − ∆ x/ p h ∆ x (∆ t ) i MD transition paths − − − p ( ∆ x | ∆ t ) ∆ x/ h σ i i Topo. config. network l og ( ∆ t / n s ) ,l og (cid:2) ∆ t × (cid:0) k f o l d / k M D f o l d (cid:1) (cid:3) − − − ∆ x/ p h ∆ x (∆ t ) i Toy model: harmonic − − − p ( ∆ x | ∆ t ) ∆ x/ p h ∆ x (∆ t ) i Toy model: periodic l og ( k ∆ t ) c ba d FIG. 5. Distributions of finite-time displacements on the 1-dnative contacts order parameter, averaged over the transition-path ensemble. (a) The distribution of displacements ∆ x aftera lag time ∆ t observed in all-atom MD transition paths. Thedistributions are centered, such that h ∆ x i = 0, and scaled bythe root-mean-squared displacement at each lag time. Thefrequent small displacements are well described by a Gaus-sian distribution with a unit standard deviation (black dashedline); however, larger displacements are much more frequentthan predicted by the tails of this Gaussian distribution. Col-ors correspond to the lag time, in units of nanoseconds, asshown by the scale bar on the right. (b) The predicted dis-tribution of displacements corresponding to transitions be-tween configurations in the topological configuration model(dotted lines) is broader than a Gaussian distribution fit tothe transition-path-ensemble-averaged fluctuations within in-dividual configurations, leading to similar fat-tailed behavior.The lag times are scaled by the slowest timescale of the MDsimulations, ( k MDfold ) − , for comparison with panel a. (c–d) Dis-tributions calculated from kinetic Monte Carlo simulations oftransition paths on the two toy landscapes shown in Figure 4a.Only the intermediate-barrier landscape (panel d) reproducesthe fat-tailed behavior observed in the MD simulations. less of the lag time. This ‘fat-tailed’ behavior is shownin Figure 5a, where the distribution for each lag time isscaled according to its root-mean-squared displacementand compared to a unit Gaussian distribution indicatedby the black dashed line.This unusual feature is naturally predicted by thetimescale separation in the topological configuration model, since the distribution of one-dimensional displace-ments is narrower for fluctuations within a configura-tion than for transitions from one configuration to an-other. To illustrate this idea, we compare the expecteddisplacement associated with a step between configura-tions, h ( σ i + σ j ) / i , with the average size of a fluctu-ation within a configuration, h σ i i , where the averagesare taken over all configurations and weighted by m AB i ,the probability of finding a transition-path trajectoryin any configuration i (Figure 5b). The contributionfrom transitions between configurations, shown by col-ored dotted lines in Figure 5b, increases with the lagtime, since the probability of moving from configuration i to j within a finite time ∆ t is given by the matrix expo-nential [exp(∆ t T )] ij . As a result of these larger displace-ments, the tails of the distribution are always outside ofthe unit Gaussian distribution, suggesting that the fat-tailed behavior observed in the MD simulation-deriveddistributions is also indicative of relatively rare interme-diate barrier crossings.To test this hypothesis, we analyzed the distribu-tions of finite-time displacements obtained from simu-lated transition paths over the two toy landscapes shownin Figure 4a. By scaling the distributions accordingto their root-mean-squared displacements and compar-ing with a unit Gaussian distribution (Figure 5c,d), wefind that only the landscape with intermediate barriersresults in a qualitatively similar fat-tailed distribution.The single-barrier toy landscape, by contrast, results inlarge displacements being less frequent than predictedby a Gaussian distribution. We therefore conclude thatthe relative enrichment of large displacements within lagtimes of a few tens of nanoseconds is a likely consequenceof intermediate barrier crossings in the MD-simulatedtransition paths. C. Likelihood comparison between predicted andsimulated transition-path trajectories
Finally, we tested whether the transition-path trajecto-ries observed in the MD simulations, when projected ontoa one-dimensional coordinate, are representative of thetransition-path ensembles predicted by the topologicalconfiguration model. To do so, we treated this stochas-tic process as a hidden Markov model with a discretestate space of topological configurations. In this model,transition paths traverse the discrete state space in accor-dance with the rate matrix T , but we assume that we canonly observe the instantaneous projection of each state s onto the 1-d coordinate x . With the exception of thetransition-matrix prefactor k , both the transition proba-bilities between states and the topological configuration-dependent probabilities of observing a given number ofnative contacts, p ( x | s ), are completely determined byquantities calculated from the equilibrium MD simula-tion data. We first removed all configurations in thetopological configuration network that are not likely tobe visited on transition paths (less than 1% of predictedfolding flux) to guard against overfitting. We then usedthe standard Viterbi algorithm [60, 61] to determine theunique sequence of configurations, { s l } , that maximizesthe log likelihood of the observed time series { x l } for eachtransition-path trajectory, h log L i = n − n X l =1 log (cid:2) p ( x l | s l )( e ∆ t T ) s l − ,s l (cid:3) , (5)where the index l runs over all consecutive snapshots oneach transition path, and we have normalized the loglikelihood to remove the trivial dependence on the totaltrajectory length n .A representative maximum likelihood fit, using thefixed transition rates and emission probabilities definedin Eq. (3) and Figure 2b, respectively, is shown in Fig-ure 6a, where the apparent separation of timescales be-tween high-frequency oscillations and slower, step-likebehavior can be easily discerned by eye. Furthermore,the log likelihood of most probable path, h log L max i , isonly weakly dependent on the transition-matrix prefac-tor, k . (Figure 6b). Analyzing one frame per nanosec-ond, the maximum of h log L max i is found at a value of k that results in a predicted folding rate, k fold , of ap-proximately 1 × − frames. Importantly, this foldingrate agrees well with the empirical folding rate that wecalculated directly from the mean waiting times betweenfolding and unfolding transitions in the full MD trajec-tories, k MDfold ≃ . − (Figure 6b, inset ).We then performed an analogous hidden Markov anal-ysis for the 1-d landscape model with a constant diffusioncoefficient. In this case, the discrete states { ¯ x } are binsof width ∆ x , the rate matrix is given by Eq. (4), andthe Gaussian emissions p ( x | s = ¯ x ) are assumed to havea standard deviation of ∆ x . For example, a represen-tative maximum likelihood fit, assuming a bin width of∆ x = 16 native contacts, is shown in Figure 6a. Unlikethe topological configuration model, we find that the loglikelihood of the most probable path in the 1-d modelis strongly dependent on the transition-matrix prefac-tor, k . Furthermore, the maximum with respect to k corresponds to a folding rate that is orders of magnitudegreater than that determined from MD simulations. Thismeans that, in order to generate a transition path withthe observed high frequency fluctuations on the empiri-cal 1-d landscape, k has to be tuned to a point wherethe predicted rates of folding and unfolding events areunrealistically fast. The topological configuration modeldoes not suffer from this contradiction, since the sepa-ration of timescales between the fast motions within atopological configuration and slower transitions betweenconfigurations is an intrinsic feature of the model.To ensure a fair comparison between these models, wecalculated the expected values of the log likelihood fortransition paths generated directly by both models, h log L i model = X s m AB s log (cid:2) h p ( x | s ) ih ( e ∆ t T ) s ′ ,s i (cid:3) , (6) − − − − − − − − − − − − − − M D f o l d i n g r a t e log [ k ∆ t × ( k fold /k )] − − − − − − M D r a t e − − − − −
204 8 12 16 20 24 ∆ t = 8 ns∆ t = 4 ns∆ t = 2 ns∆ t = 1 nsbin width ∆ x − − − − − ∆ t = n s ∆ t = n s ∆ t = n s ∆ t = n s unfold.folded 0 1000 2000 3000 Topo. config. unfold.folded 0 1000 2000 3000 t/ ns xx h l og L m a x i − h l og L i m o d e l Topo. conf. 1-d native contact landscape
Topo. config. model 1-d landscape bca ∆ x = 16 h l og L m a x i FIG. 6. Comparison of transition-path trajectories fromatomistic MD simulations and trajectories generated by theo-retical models. (a) The maximum-likelihood transition paths(red solid lines) through the discrete states of the topolog-ical configuration and 1-d landscape models given a repre-sentative MD transition-path trajectory, projected onto thenumber of native contacts, x (blue lines). For the topologicalconfiguration model, the expected fluctuations (i.e., the stan-dard deviations of the Gaussian approximations in Figure 2b)within the most probable states are shown by red dashed lines.(b) The dependence of the per-frame log likelihood of the mostprobable sequence of states on the transition-matrix prefac-tor, k . This rate is scaled by the folding rate predicted by themodel, k fold , for comparison with the estimated MD foldingrate, k MDfold (black dashed line). The maximum of h log L max i for the topological configuration model coincides with the MDfolding rate (inset). (c) Comparison between the log likeli-hood of the most probable sequence of states and the expectedlog likelihood for each model, h log L i model (see text). Whenfitting to the 1-d model, the agreement between these quan-tities depends strongly on the spatial coarse-graining, ∆ x , ofthe landscape and temporal averaging of the MD transition-path trajectories over time windows of width ∆ t . Error barsindicate the standard error of the mean. where the expectation values for the emission andtransition probabilities are h p ( x | s ) i ≡ R ∞−∞ p ( x | s ) dx and h [exp(∆ t T )] s ′ ,s i ≡ P s ′ { [exp(∆ t T )] s ′ ,s } , respectively, ineach state s . Fixing k to match the MD folding rate,Figure 6c shows that the log likelihood of the most prob-able path determined by fitting the MD data is consis-tent with the expected log likelihood for the ensembleof transition paths generated by the topological config-uration model. This result is independent of temporalcoarse-graining, i.e., down-sampling the trajectory x ( t )by averaging over a moving window of width ∆ t . Bycontrast, temporal coarse-graining has a significant effecton the difference between the best-fit and expected loglikelihoods for the 1-d landscape model when k is fixedaccording to the MD folding rate, since increasing ∆ t preferentially removes high frequency fluctuations. Thisdifference is also sensitive to the landscape bin width ∆ x ,since increasing ∆ x reduces the number of distinct statesand consequently slows the rate of transitions betweenadjacent states. As a result, increasing the bin width re-sults in an effective separation of timescales on the 1-dlandscape, albeit without a first-principles justification.Only by introducing a separation of timescales through a post hoc combination of temporal coarse-graining of thetrajectories and spatial coarse-graining of the 1-d land-scape is it possible for the 1-d model to generate transi-tion paths that are consistent with those observed in theMD simulations (Figure 6c).In conclusion, this hidden Markov analysis highlightsthe importance of a separation of timescales for reproduc-ing the transition-path trajectories observed in atomisticMD simulations. In particular, the co-occurrence of fastfluctuations in the number of native contacts and infre-quent folding events is naturally captured by the topo-logical configuration model, as seen by the agreementbetween the log likelihoods of the fitted and predictedtransition-path trajectory ensembles. One-dimensionallandscape models that lack an intermediate timescale,by contrast, require that the trajectories observed in MDsimulations be smoothed substantially in order to con-form to the predicted transition-path dynamics. IV. DISCUSSION
We have shown that a theoretical model of protein fold-ing, which emphasizes an intermediate timescale associ-ated with transitions between distinct configurations ofpartial native structure, accurately predicts multiple sta-tistical properties of the stochastic dynamics of foldingtransition paths. By using equilibrium simulation data toconstruct an approximate rate matrix for transitions be-tween topological configurations, we demonstrated thatthe transition-path ensembles generated by this modelhave broad transit-time distributions that are consistentwith both all-atom simulations and experimental obser-vations. We then showed that the non-Gaussian distri-butions of finite-time displacements that are predicted by this model qualitatively match all-atom simulationresults. Lastly, we demonstrated that this model canreconcile rapid local fluctuations on a one-dimensionalorder parameter with a slow overall rate of folding, twoseemingly contradictory features that are simultaneouslyobserved in simulated transition-path trajectories. Mostimportantly, all predictions of the topological configura-tion model were made without the use of any dynamicalinformation.The intermediate timescale in the topological configu-ration model is predicted to arise due to local free-energybarriers that separate transient states with distinct setsof native-like loops [37]. In this work, we assumed thatthe rates of transitions between states could be approxi-mated using a simple formula that satisfies detailed bal-ance. However, a more accurate approach would involvethe calculation of the rates between adjacent topologicalconfigurations in the all-atom model. Such an approachmight benefit from recent advances in Markov state mod-eling [62, 63], although, in this application, the definitionsof the states would be assumed a priori on the basisof the native structure. Nevertheless, it is remarkablethat we are able to obtain qualitatively accurate resultsfor a variety of statistical tests using a highly simplifiedMarkov model and the estimated free energies of the pre-dicted topological configurations. This success indicatesthat the statistical analyses that we considered dependto a greater extent on the presence of intermediate bar-riers than on their precise heights, provided that thesebarriers are comparable to the thermal energy ( & k B T )and are thus kinetically relevant. Generalizing beyondUbiquitin, we anticipate that accounting for this inter-mediate timescale is likely to be especially important inthe context of large proteins, which tend to have com-plicated native topologies. Transition-path analyses ofsmall, ultrafast-folding proteins [64], by contrast, are lesslikely to benefit from the coarse-graining strategy de-scribed here, since these proteins typically contain onlyone or two native-state loops whose formation dominatesthe overall folding rate. The lower probability of encoun-tering substantial free-energy barriers on folding transi-tion paths, which are needed to assume approximatelyMarkovian transitions between intermediate configura-tions, suggests that models of diffusion over a single har-monic barrier may be more appropriate in these partic-ular cases.To compare our approach with commonly used one-dimensional models, we assumed a projection onto thenumber of native contacts and a constant diffusion coef-ficient. The number of native contacts has been shownto be a good reaction coordinate in the sense that, formany small proteins, it can be used to locate the tran-sition state from transition-path trajectories with highprobability [50]. The results presented here are not in-consistent with this notion of a good reaction coordinatewhen a single pathway through the network of topolog-ical configurations dominates, as shown by the similar-ity between the committors predicted by the two mod-0els. Furthermore, if one were to account for coordinate-dependent diffusion, it is likely that these two approacheswould lead to similar predictions for the transition-pathdynamics, since existing methods for fitting coordinate-dependent diffusion coefficients often reveal the existenceof intermediate barriers that were hidden by the projec-tion onto the original reaction coordinate [31]. However,to carry out such an analysis, dynamical information isalways required in some form [25], meaning that the un-derlying landscape is not, by itself, truly predictive.The topological configuration model that we have ex-amined here differs in a number of important ways fromalternative models of folding intermediates that havebeen proposed previously. Unlike the early hierarchicalmodel of Ptitsyn [65] and the more recent ‘foldon’ hy-pothesis [66], the assembly of native-like intermediatesneed not lead to a more negative free energy at everystep. At the same time, we have not assumed that thefree energy decreases only upon incorporation of the finalnative-like substructure, as proposed in a recent ‘volcano’model of folding [67]. By contrast, the highest point onthe minimum free-energy path through the network oftopological configurations is determined by the free en-ergies of the various configurations and the barriers be-tween them, which depend, in turn, on the temperatureand solvent conditions [37]. The topological configura-tion model also suggests a natural definition of a foldingpathway [68] at the level of topological configurationswhile allowing for alternative, yet less probable, path-ways.Finally, this theoretical analysis has a number ofimplications for experimental investigations of protein-folding transition paths. The transit-time and finite-time displacement distributions that we have discussedcan now be measured directly in single-molecule ex-periments. However, to distinguish between alterna-tive theoretical models, it would be most useful to an-alyze high-resolution experimental transition-path mea-surements using hidden Markov models in order to detectand characterize transient folding intermediates. Usingestablished non-parametric methods [69, 70], it shouldbe possible to assess both the number of distinct tran-sient states and any separation of timescales objectively.Furthermore, by combining such analyses with structure-based models, like the type discussed here, it should bepossible to extract more detailed information regarding the underlying free-energy landscape from these mea-surements. In this way, continued advances in single-molecule measurements can be used to improve predic-tive landscape-based models of protein-folding transi-tion paths, in particular for large proteins with complexnative-state topologies. V. CONCLUSION
We have shown that the introduction of an interme-diate timescale, which is faster than native-contact for-mation but slower than the typical time for folding anentire protein, can qualitatively alter the statistics ofprotein-folding transition paths. We proposed that thisintermediate timescale is associated with the assembly ofnative-like loops, and we used this principle to build acoarse-grained free-energy landscape for Ubiquitin fromequilibrium atomistic simulation data. Without rely-ing on any dynamical information from simulations, weshowed that this model predicts distributions of transittimes and finite-time displacements that are consistentwith simulated transition paths, but differ qualitativelyfrom the predictions of a one-dimensional model of dif-fusion on an empirical free-energy landscape. We alsoused a hidden Markov analysis to demonstrate that thismodel generates transition paths that agree with boththe dynamics and kinetics inferred from reversible fold-ing simulations. Our results suggest that the analysis ofsingle-molecule transition-path trajectories may be im-proved by accounting for intermediate free-energy barri-ers, which are a fundamental aspect of the complexity offolding large biomolecules.
ACKNOWLEDGMENTS
The authors would like to thank D.E. Shaw Researchfor providing the all-atom Molecular Dynamics simu-lation data. In addition, the authors are grateful formany insightful discussions with Michael Manhart. Thiswork was supported by NIH grants F32GM116231 andGM068670. All analysis and simulation code is availablefrom the authors upon request. [1] D. Frenkel and B. Smit,
Understanding molecular simula-tion: From algorithms to applications (Academic Press,2001).[2] J. W. Gibbs, Am. J. Sci. , 441 (1878).[3] P. R. ten Wolde, M. J. Ruiz-Montero, and D. Frenkel,J. Chem. Phys. , 9932 (1996).[4] J. J. De Yoreo and P. G. Vekilov, Rev. Mineral. Geochem. , 57 (2003).[5] P. R. ten Wolde and D. Frenkel, Science , 1975 (1997). [6] W. M. Jacobs, A. Reinhardt, and D. Frenkel, Proc. Natl.Acad. Sci. U.S.A. , 6313 (2015).[7] D. Frenkel, Physica A: Stat. Mech. , 26 (1999).[8] W. Poon, F. Renth, and R. Evans, J. Phys.: Condens.Matter , A269 (2000).[9] V. Abkevich, A. Gutin, and E. Shakhnovich, J. Chem.Phys. , 6052 (1994).[10] E. I. Shakhnovich, Chem. Rev. , 1559 (2006). [11] D. Gfeller, P. De Los Rios, A. Caflisch, and F. Rao, Proc.Natl. Acad. Sci. U.S.A. , 1817 (2007).[12] D. Thirumalai, E. P. O’Brien, G. Morrison, andC. Hyeon, Ann. Rev. Biophys. , 159 (2010).[13] V. I. Abkevich, A. M. Gutin, and E. I. Shakhnovich,Biochemistry , 10026 (1994).[14] E. I. Shakhnovich, Curr. Opin. Struct. Biol. , 29 (1997).[15] H. S. Chung, K. McHale, J. M. Louis, and W. A. Eaton,Science , 981 (2012).[16] H. S. Chung and W. A. Eaton, Nature , 685 (2013).[17] H. S. Chung, T. Cellmer, J. M. Louis, and W. A. Eaton,Chem. Phys. , 229 (2013).[18] K. Truex, H. S. Chung, J. M. Louis, and W. A. Eaton,Phys. Rev. Lett. , 018101 (2015).[19] A. ˇSali, E. I. Shakhnovich, and M. Karplus, Nature ,248 (1994).[20] J. D. Bryngelson, J. N. Onuchic, N. D. Socci, and P. G.Wolynes, Proteins: Struct., Func., and Bioinf. , 167(1995).[21] J. N. Onuchic and P. G. Wolynes, Curr. Opin. Struct.Biol. , 70 (2004).[22] A. Berezhkovskii and A. Szabo, J. Chem. Phys. ,014503 (2005).[23] R. B. Best and G. Hummer, Phys. Rev. Lett. , 228104(2006).[24] B. W. Zhang, D. Jasnow, and D. M. Zuckerman, J.Chem. Phys. , 074504 (2007).[25] R. B. Best and G. Hummer, Phys. Chem. Chem. Phys. , 16902 (2011).[26] G. Hummer, New J. Phys. , 34 (2005).[27] A. Berezhkovskii and A. Szabo, J. Chem. Phys. ,074108 (2011).[28] R. B. Best and G. Hummer, Proc. Natl. Acad. Sci. U.S.A. , 6732 (2005).[29] M. Hinczewski, Y. von Hansen, J. Dzubiella, and R. R.Netz, J. Chem. Phys. , 245103 (2010).[30] M. L. Mugnai and R. Elber, J. Chem. Phys. , 014105(2015).[31] R. B. Best and G. Hummer, Proc. Natl. Acad. Sci. U.S.A. , 1088 (2010).[32] P. Tiwary and B. Berne, J. Chem. Phys. , 152701(2017).[33] R. Du, V. S. Pande, A. Y. Grosberg, T. Tanaka, andE. I. Shakhnovich, J. Chem. Phys. , 334 (1998).[34] K. Neupane, D. A. N. Foster, D. R. Dee, H. Yu, F. Wang,and M. T. Woodside, Science , 239 (2016).[35] D. E. Makarov, J. Chem. Phys. , 071101 (2017).[36] R. Satija, A. Das, and D. E. Makarov, J. Chem. Phys. , 152707 (2017).[37] W. M. Jacobs and E. I. Shakhnovich, Biophys. J. ,925 (2016).[38] V. Mu˜noz and W. A. Eaton, Proc. Natl. Acad. Sci. U.S.A. , 11311 (1999).[39] E. Alm and D. Baker, Proc. Natl. Acad. Sci. U.S.A. ,11305 (1999).[40] O. V. Galzitskaya and A. V. Finkelstein, Proc. Natl.Acad. Sci. U.S.A. , 11299 (1999).[41] D. Frenkel, Biophys. J. , 893 (2016).[42] S. Piana, K. Lindorff-Larsen, and D. E. Shaw, Proc.Natl. Acad. Sci. U.S.A. , 5915 (2013). [43] P. A. Thompson, W. A. Eaton, and J. Hofrichter, Bio-chemistry , 9200 (1997).[44] L. J. Lapidus, W. A. Eaton, and J. Hofrichter, Proc.Natl. Acad. Sci. U.S.A. , 7220 (2000).[45] S. O. Garbuzynskiy, D. N. Ivankov, N. S. Bogatyreva,and A. V. Finkelstein, Proc. Natl. Acad. Sci. U.S.A. ,147 (2013).[46] B. H. Zimm and J. K. Bragg, J. Chem. Phys. , 526(1959).[47] K. A. Dill, K. M. Fiebig, and H. S. Chan, Proc. Natl.Acad. Sci. U.S.A. , 1942 (1993).[48] J. Kubelka, E. R. Henry, T. Cellmer, J. Hofrichter, andW. A. Eaton, Proc. Natl. Acad. Sci. U.S.A. , 18655(2008).[49] E. R. Henry, R. B. Best, and W. A. Eaton, Proc. Natl.Acad. Sci. U.S.A. , 17880 (2013).[50] R. B. Best, G. Hummer, and W. A. Eaton, Proc. Natl.Acad. Sci. U.S.A. , 17874 (2013).[51] P. Metzner, C. Sch¨utte, and E. Vanden-Eijnden, Multi-scale Model. Sim. , 1192 (2009).[52] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth,A. H. Teller, and E. Teller, J. Chem. Phys. , 1087(1953).[53] S. V. Krivov and M. Karplus, Proc. Natl. Acad. Sci.U.S.A. , 14766 (2004).[54] S. V. Krivov and M. Karplus, J. Phys. Chem. B ,12689 (2006).[55] F. Ding, N. V. Dokholyan, S. V. Buldyrev, H. E. Stanley,and E. I. Shakhnovich, Biophys. J. , 3525 (2002).[56] H. S. Chung, J. M. Louis, and W. A. Eaton, Proc. Natl.Acad. Sci. U.S.A. , 11837 (2009).[57] D. T. Gillespie, J. Phys. Chem. , 2340 (1977).[58] S. Chaudhury and D. E. Makarov, J. Chem. Phys. ,034118 (2010).[59] S. V. Krivov, PLoS Comp. Biol. , e1000921 (2010).[60] G. D. Forney, Proc. IEEE , 268 (1973).[61] “hmmlearn — hmmlearn 0.2.1 documentation,” http://hmmlearn.readthedocs.io/en/latest/ , ac-cessed: June 19, 2018.[62] G. R. Bowman, V. S. Pande, and F. No´e, in An intro-duction to Markov state models and their application tolong timescale molecular simulation (Springer, 2014) pp.1–6.[63] J. D. Chodera and F. No´e, Curr. Opin. Struct. Biol. ,135 (2014).[64] K. Lindorff-Larsen, S. Piana, R. O. Dror, and D. E.Shaw, Science , 517 (2011).[65] O. Ptitsyn, Doklady Akademii Nauk SSSR , 1213(1973).[66] H. Maity, M. Maity, M. M. Krishna, L. Mayne, andS. W. Englander, Proc. Natl. Acad. Sci. U.S.A. , 4741(2005).[67] G. C. Rollins and K. A. Dill, J. Am. Chem. Soc. ,11420 (2014).[68] A. N. Adhikari, K. F. Freed, and T. R. Sosnick, Proc.Natl. Acad. Sci. U.S.A. , 17442 (2012).[69] I. Sgouralis, M. Whitmore, L. Lapidus, M. J. Comstock,and S. Press´e, J. Chem. Phys. , 123320 (2018).[70] I. Sgouralis and S. Press´e, Biophys. J.112