[PDF] Adaptive Markov State Model estimation using short reseeding trajectories

Abstract

In the last decade, advances in molecular dynamics (MD) and Markov State Model (MSM) methodologies have made possible accurate and efficient estimation of kinetic rates and reactive pathways for complex biomolecular dynamics occurring on slow timescales. A promising approach to enhanced sampling of MSMs is to use so-called "adaptive" methods, in which new MD trajectories are "seeded" preferentially from previously identified states. Here, we investigate the performance of various MSM estimators applied to reseeding trajectory data, for both a simple 1D free energy landscape, and for mini-protein folding MSMs of WW domain and NTL9(1-39). Our results reveal the practical challenges of reseeding simulations, and suggest a simple way to reweight seeding trajectory data to better estimate both thermodynamic and kinetic quantities.

Full PDF

MMSM estimation with short reseeding trajectories

Adaptive Markov State Model estimation using short reseeding trajectories

Hongbin Wan and Vincent A. Voelz a) Department of Chemistry, Temple University, Philadelphia, PA 19122, USA (Dated: 13 December 2019)

In the last decade, advances in molecular dynamics (MD) and Markov State Model (MSM)methodologies have made possible accurate and eﬃcient estimation of kinetic rates and reac-tive pathways for complex biomolecular dynamics occurring on slow timescales. A promisingapproach to enhanced sampling of MSMs is to use so-called adaptive methods, in which newMD trajectories are seeded preferentially from previously identiﬁed states. Here, we inves-tigate the performance of various MSM estimators applied to reseeding trajectory data, forboth a simple 1D free energy landscape, and for mini-protein folding MSMs of WW domainand NTL9(1-39). Our results reveal the practical challenges of reseeding simulations, andsuggest a simple way to reweight seeding trajectory data to better estimate both thermody-namic and kinetic quantities.Keywords: Markov Model, molecular kinetics, adaptive sampling

INTRODUCTION

In the last decade, Markov State Model (MSM)methodologies have made possible accurate and eﬃcientestimation of kinetic rates and reactive pathways for slowand complex biomolecular dynamics.

One of the keyadvantages touted by MSM methods is the ability to uselarge ensembles of short-timescale trajectories for sam-pling events that occur on slow timescales. The main ideais that suﬃcient sampling using many short trajectoriescan circumvent the need to sample long trajectories.With this in mind, many “adaptive” methods havebeen developed for the purpose of accelerating samplingof MSMs. The simplest of these can be called adaptiveseeding , where one or more new rounds of unbiased sim-ulations are performed by “seeding” swarms of trajecto-ries throughout the landscape. The choice of seeds arebased on some initial approximation of the free energylandscape, possibly from non-equilibrium or enhanced-sampling methods. Adaptive seeding can be performedby ﬁrst identifying a set of metastable states, then initi-ating simulations from each state. If the seeding trajec-tories provide suﬃcient connectivity and statistical sam-pling of transition rates, an MSM can be constructed toaccurately estimate both kinetics and thermodynamics.Similarly, so-called adaptive sampling algorithms havebeen developed for MSMs in which successive rounds oftargeted seeding simulations are performed, updating theMSM after each round.

A simple adaptive samplingstrategy is to start successive rounds of simulations fromunder-sampled states, for instance, from the state withthe least number of transition counts. A more sophis-ticated approach is the FAST algorithm, which is de-signed to discover states and reactive pathways of inter-est by choosing new states based on an objective functionthat balances under-sampling with a reward for samplingdesired structural observables.

Other algorithms in-clude: REAP, which eﬃciently explores folding land-scapes by using reinforcement learning to choose new a) Electronic mail: [email protected] states, and surprisal-based sampling, which choosesnew states that will minimize the uncertainty of the rel-ative entropy between two or more MSMs.A key problem with adaptive sampling of MSMs arisesbecause we are often interested in equilibrium properties,while trajectory seeding is decidedly non-equilibrium .This may seem like a subtle point, because the dynamicaltrajectories themselves are unbiased, but of course, theensemble of starting points for each trajectory are almostalways statistically biased, i.e. the seeds are not drawnfrom the true equilibrium distribution. This can be prob-lematic because most MSMs are constructed from tran-sition rate estimators that enforce detailed balance andassume trajectory data is obtained at equilibrium. Thedistribution of sampled transitions, however, will only re-ﬂect equilibrium conditions in the limit of long trajectorylength.One way around this problem is to focus mostly onthe kinetic information obtained by adaptive sampling.A recent study of the ability of FAST to accurately de-scribe reactive pathways concluded that the most re-liable MSM estimator to use with adaptive samplingdata is a row-normalized transition count matrix. In-deed, weighted-ensemble path sampling algorithms focusmainly on sampling the kinetics of reactive pathways,information which can be used to recover global thermo-dynamic properties.

A major disadvantage of thisapproach is that it ignores potentially valuable equilib-rium information. As shown by Trendelkamp-Schroerand No´e, detailed balance is a powerful constraint toinfer rare-event transition rates from equilibrium popu-lations. Speciﬁcally, when faced with limited sampling,dedicating half of one’s simulation samples toward en-hanced thermodynamic sampling (e.g. umbrella sam-pling) can result in a signiﬁcant reduction in the uncer-tainty of estimated rates, simply because the improvedestimates of equilibrium state populations inform the rateestimates through detailed balance.Another way around this problem, recently describedby N¨uske et al., is to use an estimator based on observableoperator model (OOM) theory, which utilizes informa-tion from transitions observed at lag times τ and 2 τ toobtain estimates unbiased by the initial distribution of a r X i v : . [ q - b i o . B M ] D ec SM estimation with short reseeding trajectories 2seeding trajectories. Although the OOM estimator isable to make better MSM estimates at shorter lag times,it requires the storage of transition count arrays thatscale as the cube of the number states, and dense-matrixsingular-value decomposition, which can be impracticalfor MSMs with large numbers of states. N¨uske et al.derive an expression quantifying the error incurred bynon-equilibrium seeding, from which they conclude thatsuch bias is diﬃcult to remove without either increasingthe lag time or improving the state discretization.Here, we explore an alternative way to recover accu-rate MSM estimates from biased seeding trajectories, byreweighting sampled transition counts to better approxi-mate counts that would be observed at equilibrium. Likethe Trendelkamp-Schroer and No´e method, this re-quires some initial estimate of state populations, perhapsobtained from previous rounds of adaptive sampling.We are particularly interested in examining how theperformance of this reweighting method compares toother estimators, in cases where it is impractical to gen-erate long trajectories and instead one must rely on en-sembles of short seeding trajectories. An example of acase like this is adaptive seeding of protein folding MSMsbuilt from ultra-long trajectories simulated on the Antonsupercomputer. Because such computers are not widelyavailable, adaptive seeding using conventional computersmay be one of the only practical ways to leverage MSMsto predict the eﬀect of mutations, for example.In this manuscript, we ﬁrst perform adaptive seed-ing tests using 1D-potential energy models, and comparehow diﬀerent estimators perform at accurately capturingkinetics and thermodynamics. We then perform simi-lar tests for MSMs built from ultra-long reversible fold-ing trajectories of two mini-proteins, WW domain andNTL9(1-39). Our results, described below, suggest thatreweighting trajectory counts with estimates of equilib-rium state populations can achieve a good balance of ki-netic and thermodynamic estimation.

Estimators

We explored the accuracy and eﬃciency of severaldiﬀerent transition probability estimators using adap-tive seeding trajectory data as input: (1) a maximum-likelihood estimator (MLE), (2) a MLE estimator wherethe exact equilibrium populations π i of each state i areknown a priori , (3) an MLE estimator where each in-put trajectory is weighted by an a priori estimate ofthe equilibrium population of its starting state, (4) row-normalized transition counts, and (5) an observable op-erator model (OOM) estimator. Maximum-likelihood estimator (MLE).

The MLE for a reversible MSM assumes that observedtransition counts are independent, and drawn from theequilibrium distribution, so that reversibility (i.e. de-tailed balance) can be used as a constraint. The like-lihood of observing a set of given transition counts, L = (cid:81) i (cid:81) j p c ij ij , when maximized under the constraintthat π i p ij = π j p ji , for all i, j , yields a self-consistentexpression that can be iterated to ﬁnd the equilibriumpopulations, π i = (cid:88) j c ij + c jiN j π j + N i π i (1)where N i = (cid:80) j c ij . The transition probabilities p ij aregiven by p ij = ( c ij + c ji ) π j N j π i + N i π j (2) Maximum-likelihood estimator (MLE) with knownpopulations π i Maximization of the likelihood function above, withthe additional constraint of ﬁxed populations π i , yieldsa similar self-consistent equation that can be used to de-termine a set of Lagrange multipliers, λ i = (cid:88) j ( c ij + c ji ) π j λ i λ j π i + λ i π j , (3)from which the transition probabilities p ij can be ob-tained as p ij = ( c ij + c ji ) π j λ j π i + λ i π j . (4) Maximum-likelihood estimator (MLE) withpopulation-weighted trajectory counts.

For this estimator, ﬁrst a modiﬁed count matrix c (cid:48) ij iscalculated, c (cid:48) ij = (cid:88) k w ( k ) c ( k ) ij , (5)where transition counts c ( k ) ji from trajectory k areweighted in proportion to w ( k ) = π ( k ) , the estimatedequilibrium population of the initial state of the trajec-tory. The idea behind this approach is to counteract thestatistical bias from adaptive seeding by scaling the ob-served transition counts proportional to their equilibriumﬂuxes. The modiﬁed counts are then used as input to theMLE. Row-normalized counts.

For this estimator, the transition probabilities are ap-proximated as p ij = c ij (cid:80) j c ij . (6)This approach does not guarantee reversible transitionprobabilities, which only occurs in the limit of large num-bers of reversible transition counts. In practice, how-ever, the largest eigenvectors of the transition probabil-ity matrix have very nearly real eigenvalues, such that weSM estimation with short reseeding trajectories 3can report relevant relaxation timescales and equilibriumpopulations. Observable operator model (OOM) theory

For an introduction to OOMs and their use in esti-mating MSMs, see references.

OOMs are closelyrelated to Hidden Markov Models (HMMs) but havethe advantage, like MSMs, that they can be learned di-rectly from observable-projected trajectory data. Un-like MSMs, they require the collection of both one-steptransition count matrices, and a complete set of two-steptransition count matrices. With enough training data,and suﬃcient model rank, OOMs can recapitulate exactrelaxation timescales, uncontaminated by MSM state dis-cretization error. We used the OOM estimator as described by N¨uskeet al. and implemented in PyEMMA 2.3, using thedefault method of choosing the OOM model rank bydiscarding singular values that contribute a bootstrap-estimated signal-to-noise ratio of less than 10. Thesettings used to instantiate the model in PyEMMA are: OOMReweightedMSM(reversible=True,count mode=’sliding’, sparse=False,connectivity=’largest’,rank Ct=’bootstrap counts’, tol rank=10.0,score method=’VAMP2’, score k=10,mincount connectivity=’1/n’)

The OOM estimator returns two estimates: (1) a so-called corrected

MSM, which derives from using the unbi-ased OOM equilibrium correlation matrices to constructa matrix of MSM transition probabilities, and (2) the

OOM timescales , which (given suﬃcient training data)estimate the relaxation timescales uncontaminated bydiscretizaiton error. Note that there is no correspond-ing MSM for the OOM timescales.

RESULTSA simple example of adaptive seeding estimation error

To illustrate some of estimation errors that can arisewith adaptive seeding, we ﬁrst consider a simple modelof 1D diﬀusion over a potential energy surface V ( x ).The continuous time evolution of a population distribu-tion p ( x, t ) is analytically described by the Fokker-Planckequation, ∂∂t p ( x, t ) = D ∂ ∂x p ( x, t ) + µ ∂∂x (cid:2) p ( x, t ) ∇ V ( x ) (cid:3) (7)where D is the diﬀusion constant and µ = D/k B T is themobility. Here x and t are unitless values with D set tounity and the energy function u ( x ) in units of k B T .Here we consider a “ﬂat-bottom” landscape, over arange x ∈ (0 ,

2) with periodic boundaries, described by V ( x ) = e − b ( | x − |− . e − b ( | x − |− . . (8) where b = 100. The landscape is partitioned into twostates: x is assigned to state 0 if x < . x ≥ . b → ∞ , the free energy diﬀerence between thestates is exactly 1 ( k B T ), with an equilibrium constantof K = e ≈ . b = 100, K ≈ . b was chosen to help the stability of numericalintegration (performed at a resolution of ∆ x = 2 . × − and ∆ t = 5 × − ).To emulate adaptive seeding, we propagate the con-tinuous dynamics starting from an initial distribution p ( x,

0) centered on state i , and compute the transitionprobabilities p ij from state i to j from the evolved den-sity after some lag time τ (Figure 1B).We ﬁrst consider a seeding strategy where simula-tions are started from the center of each state, which wemodel by propagating density from an initial distribu-tions p ( x,

0) = δ ( x ) to compute p and p as a functionof τ , and from p ( x,

0) = δ ( x −

1) to compute p and p (Figure 1C). By t = 1, the population has reachedequilibrium. By detailed balance, the estimated equi-librium constant is ˆ K = p /p , which systematically underestimates the true value (Figure 1D). This is be-cause the initial distribution is non-equilibrium, in thedirection of being too uniform. There are also errors inestimating the relaxation timescales. The diﬀerent esti-mators described above (MLE, MLE with known equi-librium populations π i , population-weighted MLE, androw-normalized counts) were applied to estimate the im-plied timescale τ (Figure 1E). For this simple two-statesystem, all of the estimates are very similar to the ana-lytical result τ = − τ / ln(1 − p − p ) (with the excep-tion of MLE, which yields τ estimates that are about9% larger when t = 0 . τ = 0 . τ is overestimated . This is becauseat early times, outgoing population ﬂuxes are underesti-mated.We also consider a seeding strategy where simulationsare started from randomly from inside each state, whichwe model by propagating density from initially uniformdistributions across each state (Figure 1F). As before,we ﬁnd that the equilibrium constant is underestimated,again, because the initial seeding is out-of-equilibrium(Figure 1G). The estimate is somewhat improved, how-ever, by the “pre-equilibration” of each state. In thiscase, the implied timescale τ is underestimated at earlytimes (Figure 1H). This is can be understood as arisingfrom the overestimation of outgoing population ﬂuxes, asnot enough population has yet reached the ground state(state 0).From these two scenarios, we can readily understandthat the success of estimating both kinetics and ther-modynamics from adaptive seeding greatly depends onhow well the initial seeding distribution approximatesthe equilibrium distribution. Moreover, we can see thattimescale estimates can be either under-estimated orover-estimated, depending on the initial seeding distribu-tion. It is important to note that these errors arise pri-marily because of non-equilibrium sampling, rather thanSM estimation with short reseeding trajectories 4ﬁnite sampling error or discretization error. Seeding of a 1-D potential energy surface

We next consider the following two-well potentialenergy surface, as used by Stelzl et al. : U ( x ) = − k B T . ln[ e − x − − + e − x − ] for x ∈ [1 . , . k B T = 0 .

596 kcal · mol − . The state space is uniformlydivided into 20 states of width 0.2 to calculate discrete-state quantities. Diﬀusion on the 1-D landscape is ap-proximated by a Markov Chain Monte Carlo (MCMC)procedure in which new moves are translations randomlychosen from δ ∈ [ − . , +0 .

05] and accepted with proba-bility min(1 , exp( − β [ U ( x + δ ) − U ( x )])), i.e. the Metropo-lis criterion.As a standard against which to compare estimatesfrom seeding trajectories, we used a set of very longtrajectories to construct an optimal MSM model us-ing the above state deﬁnitions. To estimate the “true”relaxation timescales of the optimal model, we gener-ated long MCMC trajectories of 10 steps, samplingfrom a series of scaled potentials U ( λ ) ( x ) = λU ( x ) for λ ∈ [0 . , . , . , . , . , . λ value, 20 tra-jectories were generated, with half of them starting from x = 2 . x = 5 .

0, result-ing in a total of 120 trajectories. The DTRAM estimatorof Wu et al. was used to estimate the slowest relaxationtimescale as 9.66 ( ± × steps, using a lag timeof 1000 steps.We next generated adaptive seeding trajectory data forthis system. We limited the lag time to τ = 100 steps,and generated s trajectories (5 ≤ s ≤ nτ (for n = 10 , , × s × n transition counts between allstates i and j in lag time τ , stored in a 20 ×

20 countmatrix of entries c ij . From these counts, estimates of thetransition probabilities, p ij were made.Estimates of the slowest MSM implied timescales werecomputed as τ = − τ / ln µ , where µ is the largest non-stationary eigenvalue of the matrix of transition probabil-ities. Estimates of the free energy diﬀerence ∆ F betweenthe two wells were computed as − k B T ln( π L / (1 − π L ))where π L is the estimated total equilibrium populationof MSM states with x < .

3. To estimate uncertaintiesin ln τ and ∆ F , twenty bootstraps were constructed bysampling from the s trajectories with replacement.Results for ﬁve diﬀerent estimators are shown in Fig-ure 3. In all cases, timescale estimates improve withlarger numbers of seeding trajectories, and greater num-bers of steps in each trajectory. From these results, onecan clearly see that MLE (red lines in Figure 3), the de-fault estimator in software packages like PyEMMA andMSMBuilder , consistently underestimates the impliedtimescale, although this artifact becomes less severe asmore trajectory data is incorporated. At the same time,MLE estimates of equilibrium populations become sys-tematically worse with more trajectory data. This is pre-cisely because uniform seeding produces a biased, non-equilibrium distribution of sampled trajectories, but the MLE assumes that trajectory data is drawn from equi-librium . In contrast, the MLE with known equilibriumpopulations (blue lines in Figure 3) accurately predictsimplied timescales with much less trajectory data, un-derscoring the powerful constraint that detailed balanceprovides in determining rates. Of all the estimators, thismethod is the most accurate, but relies on having verygood estimates of equilibrium state populations a priori .The row-normalized counts estimator (yellow lines inFigure 3) estimates timescales well, provided that a suf-ﬁcient number of observed transitions must be sampledto obtain proper estimates. This is necessary to achieveconnectivity and to ensure that the transition matrix isat least approximately reversible. However, the free en-ergy diﬀerences estimated by row-normalized counts arehighly uncertain, and in the limit of large numbers ofsuﬃciently long trajectories (10 steps), estimates of freeenergies become systematically incorrect due the the sta-tistical bias from seeding. In comparison, the MLE withpopulation-weighted trajectory counts (magenta lines inFigure 3) is able to make much more accurate estimatesof free energies with less trajectory data. We note thatwhile here we are reweighting the transition counts withthe known populations, the results are similar if reason-able approximations of the state populations are used(data not shown). Compared to the row-normalizedcounts estimator, implied timescales are slower to con-verge with more trajectory data, but are comparablewhen the seeding trajectories are suﬃciently long (10 steps). With suﬃciently long trajectories (10 steps),the MLE timescales are better estimated than either therow-normalized counts estimator or population-weightedtrajectory counts, but MLE is ﬂawed because the freeenergies are so biased from the seeding. The population-weighting trajectory counts estimator avoids this artifactto give good estimates of free energies.The “corrected” and OOM timescales from the OOMestimator consistently underestimate the slowest impliedtimescale for this test system, although these estimatesimprove with greater numbers and lengths of trajectories(cyan and green lines in Figure 3). The OOM estimatoris susceptible statistical noise, as evidenced by the aver-age OOM model rank selected by the PyEMMA imple-mentation, which increases from 2 to 10 as the numberof trajectories are increased (Figure 4). Like the row-normalized counts estimator, the OOM estimator givespoorly converged estimates of two-well free energy diﬀer-ences. Adaptive seeding of folding landscapes for WW Domainand NTL9(1-39)

For studying protein folding, an adaptive seeding strat-egy may be useful for several reasons. For one, gen-erating ensembles of short trajectories may simply bea practical necessity, due to the limited availability ofspecial-purpose hardware to generate ultra-long trajec-tories. Adaptive seeding may also be able to eﬃcientlyleverage an existing MSM (perhaps built from ultra-longtrajectory data) to predict perturbations to folding bySM estimation with short reseeding trajectories 5

A BCDE FGH

FIG. 1. Adaptive seeding in a simple two-state system. (A) The potential energy surface V ( x ), annotated with state indices.(B) Propagation of an initial distribution p ( x,

0) = δ ( x −

1) (blue), plotted at increments ∆ t = 0.1, up to t = 1. Startingfrom an initial distribution p ( x,

0) = δ ( x − x ) using x = 0 ,

1, shown are estimated quantities as a function of lag time τ :(C) transition probabilities p ij , (D) equilibrium constant K = π /π , and (E) relaxation timescale τ , as computed by variousestimators. Starting from uniform distributions p ( x,

0) over each state, shown are estimated quantities as a function of lagtime τ : (F) transition probabilities p ij , (G) equilibrium constant K = π /π , and (H) relaxation timescale τ , as computed byvarious estimators. mutations, or to examine the eﬀect of diﬀerent simula-tion parameters like force ﬁeld or temperature.How can we best utilize adaptive seeding of MSMsbuilt from ultra-long trajectory data to make good es-timates of both kinetics and thermodynamics? To ad-dress this question, here we test the performance of var-ious MSM estimators on the challenging task of esti-mating MSMs from adaptive seeding of protein foldinglandscapes. The folding landscapes come from MSMsbuilt from ultra-long reversible folding trajectories offast-folding mini-proteins WW domain and NTL9(1-39).WW domain is a 35-residue protein signaling do-main with a three-stranded β -sheet structure that bindsto proline-rich peptides. Its folding kinetics and ther- modynamics have been studied extensively by time-resolved spectroscopy, and its folding mechanismhas been probed extensively using molecular simulationstudies. Moreover, an impressively large number ofsite-directed mutants of WW domain have been con-structed to investigate folding mechanism and discoverfast-folding variants, including the Fip mutant (Fip35)of human pin1 WW domain, for which mutations in ﬁrsthairpin loop region resulting in a fast folding time of ∼ µ s at 77.5 ◦ C. Further mutation of the the sec-ond hairpin loop of Fip35 produced an even faster-foldingvariant, called GTT, in which the native sequence Asn-Ala-Ser (NAS) was replaced with Gly-Thr-Thr (GTT),resulting in a a fast relaxation time ∼ µ s at 80 ◦ C. SM estimation with short reseeding trajectories 6 x U ( x ) ( kc a l / m o l ) FIG. 2. A 1-D two-state potential was used to test the perfor-mance of MSM estimators on seeding trajectory data. Verti-cal lines denote the twenty uniformly-spaced discrete states.

NTL9(1-39) domain is a 39-residue truncation variantof the N-terminal domain of ribosomal protein L9, whosefolded state consists of an α -helix and three-strand β -sheet. NTL9(1-39), which has a folding relaxation time of ∼ ◦ C, has been extensively probed by bothexperimental and computational studies. The K12Mmutant of NTL9(1-39) is a fast-folding variant with afolding timescale of ∼ µ s at 25 ◦ C. Markov State Model construction

Molecular simulation trajectory data for GTT WWdomain and K12M NTL9(1-39) were provided by Shawet al., Our ﬁrst objective was to construct high-qualityreference MSMs from the available trajectory data, tobe used as benchmarks against which we could compareadaptive seeding results.Two independent ultra-long trajectories of 651 and 486 µ s, performed at 360K using the CHARMM22* forceﬁeld, were used to construct MSMs of GTT WW domain,as previously described by Wan and Voelz. The MSMwas constructed by ﬁrst projecting the trajectory datato all pairwise distances between C α and C β atoms, per-forming dimensionality reduction via tICA (time-laggedindependent component analysis), and then cluster-ing in this lower-dimensional space using the k -centersalgorithm to deﬁne a set of metastable states. A 1000-state MSM for GTT WW domain was constructed usinga lag time of 100 ns. The generalized matrix Raleigh quo-tient (GMRQ) method was used to choose the optimalnumber of states (1000) and number of independent tICAcomponents (eight). The MSM gives an estimated fold-ing relaxation timescale of 10.2 µ s, which is comparableto the folding time of 21 ± µ s estimated from analy-sis of the trajectory data by Lindorﬀ-Larsen et al. Itis also comparable to the 8 µ s timescale estimate froma three-state MSM model built using sliding constraint rate estimation by Beauchamp et al. As described be-low, since seeding trajectory data for a 1000-state MSMwas too expensive to analyze using the OOM estimator,we additionally built a 200-state model, using the samemethods, which gives an estimated folding timescale of2.3 µ s.Four trajectories of the K12M mutant of NTL9(1-39),of lengths 1052 µ s, 990 µ s, 389 µ s, and 377 µ s, simulatedat 355 K using the CHARMM22* force ﬁeld, were pro-vided by Shaw et al. . Choosing a lagtime of 200 ns,we constructed an MSM using the same methodology asGTT WW domain described above, resulting in a 1000-state model utilizing 6 tICA components which gives afolding timescale estimate of ∼ µ s. Although this is afaster timescale than the experimentally measured valueat 25 ◦ C, it accurately reﬂects the timescale of foldingevents observed in the trajectory data (at 355 K), andis similar to folding timescales measured by T-jump 2-D IR spectroscopy at 80 ◦ C ( ∼ µ s). We also notethat previous analysis of the K12M NTL9(1-39) trajec-tory data has given similar timescale estimates. Analysisof the trajectory data by Lindorﬀ-Larsen et al. yieldedan estimate of ∼ µ s, a three-state MSM model builtusing sliding constraint rate estimation by Beauchampet al. gives an estimate of ∼ µ s, and an MSM byBaiz et al. gives a timescale estimate of 18 µ s. As wedid for GTT WW domain, and using the same methods,we additionally built a 200-state MSM, which gives anestimated folding timescale of 5.2 µ s. Estimated folding rates and equilibrium populations fromreseeding simulations

To emulate data sets that would be obtained by adap-tive reseeding, we randomly sampled segments of the ref-erence trajectory data that start from each metastablestate. An advantage of emulating reseeding trajectorydata in this way is the ability to recapitulate the originalmodel in the limit of long trajectories. The full set of ref-erence trajectory data ( ∼ ∼ numbers of trajectories10 i m p li ed t i m e sc a l e e s t i m a t e steps numbers of trajectories10 i m p li ed t i m e sc a l e e s t i m a t e steps numbers of trajectories10 i m p li ed t i m e sc a l e e s t i m a t e steps MLEMLE known_piMLE pop-weightedrow-normcorrectedOOMactual numbers of trajectories20246810 ∆ F e s t i m a t e ( kc a l / m o l ) steps numbers of trajectories20246810 ∆ F e s t i m a t e ( kc a l / m o l ) steps numbers of trajectories20246810 ∆ F e s t i m a t e ( kc a l / m o l ) steps MLEMLE known_piMLE pop-weightedrow-normcorrectedactual

FIG. 3. Estimates of the slowest implied timescales (left) and two-well free energy diﬀerences (right), shown as a functionof number of seeding trajectories of various lengths. Shown are results for: (red) MLE, (blue) MLE with known equilibriumpopulations, (magenta) MLE with population-weighted trajectory counts, (yellow) row-normalized counts, (cyan) correctedMSMs, and (green) OOM timescales. Black horizontal lines denote the highly converged estimates from DTRAM. Shadedbounds represent uncertainties estimated from a bootstrap procedure. the equilibrium populations estimated from the referenceMSMs. In cases where the row-normalized counts esti-mator fails at large lag times due to the disconnectionof states, a single pseudo-count was added to elementsof the transition count matrix that are nonzero in thereference MSM. a. Estimates of slowest implied timescales.

We ap-plied diﬀerent estimators to the reseeding trajectory datagenerated for 200-state and 1000-state MSMs of GTTWW domain and K12M NTL9(1-39), and comparedthe predicted slowest implied timescales to the referenceMSMs. Similar implied timescale estimates were ob-tained using ﬁve independent reseeding trajectories ini-tiated from each state (Figure 7) and ten independentreseeding trajectories (Figure 8).Across most of the estimators tested, implied timescaleestimates are statistically well-converged using trajecto-ries of 100 ns or more. For the MLE-based estimators in particular (MLE, MLE with known populations, MLEwith population reweighting), the estimates have smalluncertainties and are not strongly sensitive to trajectorylengths tested. This suggests that while there is enoughseeding data to combat statistical sampling error, thereremain systematic errors in estimated timescales, mainlydue to the estimation method used.The MLE performs well for the 200-state MSMs ofGTT, but consistently underestimates the slowest im-plied timescale in all other MSMs. As seen with ourtests above with the 1-D two-state potential, this error islikely due to using a detailed balance constraint with sta-tistically biased seeding trajectory data. The MLE withknown populations is much more accurate at estimatingthe slowest implied timescale, and with less uncertainty.For the NLT9 MSMs, however, MLE with known popu-lations systematically underestimates the slowest impliedtimescale, again suggesting the biased seeding trajectorySM estimation with short reseeding trajectories 8 numbers of trajectories024681012 OO M r an k steps numbers of trajectories024681012 OO M r an k steps numbers of trajectories024681012 OO M r an k steps FIG. 4. The average OOM rank increases with the numberof trajectories. Shaded bounds represent uncertainties esti-mated from a bootstrap procedure.. data has an inﬂuence despite the strong constraints en-forced by the estimator. MLE with population reweight-ing has more uncertainty than the other MLE-based es-timators, but overall is arguably the most accurate esti-mator across all the MSM reseeding tests.The OOM-based estimates of the slowest impliedtimescale (which could only be performed for the com-putationally tractable 200-state MSMs) required seedingtrajectory data of least 90 ns, in order to have suﬃcienttrajectory data to perform the bootstrapped signal-to-noise estimation, from which the OOM rank is selected.While the predicted OOM timescales overestimate the re-laxation timescales of the 200-state MSMs of both GTTand NTL9 systems, the timescale estimates from thecorrected MSMs more closely recapitulate the referenceMSMs timescales, with accuracy comparable to the MLEwith population reweighting, especially for seeding tra-jectories over 200 ns.The row-normalized counts estimator yields slowest implied timescales estimates accelerated compared to thereference MSM, similar to the MLE results. This accel-eration is greater than might be expected given our testswith the 1-D two-state potential, and may in part be dueto the necessity of adding pseudo-counts to the transitioncount matrix to ensure connectivity. b. Estimates of native-state populations.

We ap-plied diﬀerent estimators to the reseeding trajectory datagenerated for 200-state and 1000-state MSMs of GTTWW domain and K12M NTL9(1-39), and comparedthe predicted native-state populations to the referenceMSMs. The native-state population was calculated as theequilibrium population of the MSM state correspondingto the native conformation. Similar native-state popu-lations were estimated using ﬁve independent reseedingtrajectories initiated from each state (Figure 9) and tenindependent reseeding trajectories (Figure 10). These es-timates are compared to the native-state populations es-timated from the reference MSMs: 75.3% and 74.8% forthe 200-state and 1000-state MSMs of GTT WW domain,respectively, and 75.9% and 79.4% for the 200-state and1000-state MSMs of K12M NTL9(1-39), respectively.As we saw in our tests using a 1-D two-state potential,obtaining accurate estimates of the equilibrium popula-tions from very short seeding trajectories is a challengingtask. In those tests, we found that the most accurateestimates of native-state populations come from estima-tors that utilize some a priori knowledge of native-statepopulations. Here, the MLE with known populations re-capitulates the native population exactly (by deﬁnition),while the MLE with population reweighting successfullycaptures the the native-state population in a 200 micro-state model of both GTT and NTL9, and slightly over-estimates the native state population for the 1000-stateMSMs. In contrast, the MLE, row-normalized counts,and OOM estimators all underestimate the folded popu-lations. With the exception of some slight improvementthat begins to occur as the seeding trajectory length in-creases, these underestimates are independent of the tra-jectory length, indicating that that non-equilibrium seed-ing is to blame. The native state populations are under-estimated in these cases because there are many moreMSM microstates belonging the unfolded-state basin ofthe folding landscape.

DISCUSSION

Markov State Model approaches have enjoyed greatsuccess in the last decade due to their ability to inte-grate ensembles of short trajectories. In light of thissuccess, the allure of adaptive sampling methods, whichpromise to leverage the focused sampling of short tra-jectories, is understandable. Recent studies have shownthat a judiciously-chosen adaptive sampling strategy canaccelerate barrier-crossing, state discovery and pathwaysampling.

Our work in this manuscript underscores that a keyproblem with the practical use of adaptive sampling al-gorithms is the statistical bias introduced by focusedsampling, which makes diﬃcult the unbiased estimationSM estimation with short reseeding trajectories 9

A BDCE F native

FIG. 5. 200-state MSMs of (A) GTT WW domain and (B) K12M NTL9(1-39), built from ultra-long folding trajectory data,are shown projected onto the two largest tICA components. Red circles show the locations of the MSM microstates. The heatmap shows a density plot of the raw trajectory data, with color bar values denoting the natural logarithm of histogram counts.1000-state MSMs are shown for (C) GTT WW domain, and (D) K12M NTL9(1-39). (E) and (F) show corresponding 1-D freeenergy proﬁles along the ﬁrst tICA component, calculated from histogramming the trajectory data in bins of width 0.025. of both kinetics and thermodynamics. We have illus-trated these problems in simple 1D diﬀusion models, andin large-scale all-atom simulations of protein folding, bytesting the performance of various MSM estimators onadaptive seeding data generated these scenarios. In all cases, we ﬁnd that estimators that incorporate some formof a priori knowledge about equilibrium populations areable to correct for sampling bias to some extent, resultingin accurate estimates of both slowest implied timescalesand equilibrium free energies. Moreover, we have shownSM estimation with short reseeding trajectories 10

FIG. 6. Implied timescale plots for 200-state MSMs of (A) GTT WW domain and (B) K12M NTL9(1-39), and 1000-stateMSMs of (C) GTT WW domain and (D) K12M NTL9(1-39). Uncertainties (shaded areas) for GTT and NTL9 were estimatedusing 5-fold and 8-fold leave-one-out bootstraps, respectively. The dotted vertical line marks the lag time of 100 ns chosen toconstuct the MSMs.

A BC D

FIG. 7. Slowest implied timescales predicted by various MSM estimators as a function of seeding trajectory length. Estimatedslowest timescales are shown for 200-microstate MSMs of (A) GTT WW domain and (B) K12M NTL9(1-39), and 1000-microstate MSMs of (C) GTT WW domain and (D) K12M NTL9(1-39). Adaptive seeding data was generated using ﬁveindependent trajectories from each microstate. Uncertainties were computed using a bootstrap procedure. that an MLE estimator with population-weighted transi-tion counts is a very simple way to counteract this bias,and achieve reasonably accurate results.As pointed out by others previously, the success of such estimators is further demonstration of the impor-tance of incorporating thermodynamic information intoMSM estimates, given the powerful constraint that de-tailed balance provides. With this in mind, it is noSM estimation with short reseeding trajectories 11 A BC D

FIG. 8. Slowest implied timescales predicted by various MSM estimators as a function of seeding trajectory length. Estimatedslowest timescales are shown for 200-microstate MSMs of (A) GTT WW domain and (B) K12M NTL9(1-39), and 1000-microstate MSMs of (C) GTT WW domain and (D) K12M NTL9(1-39). Adaptive seeding data was generated using tenindependent trajectories from each microstate. Trajectory lengths longer than 100 ns for GTT WW domain and 200 nsfor K12M NTL9(1-39) result in trajectory datasets larger than the reference model, and are not shown. Uncertainties werecomputed using a bootstrap procedure.

AC BD

FIG. 9. Native state populations predicted by various MSM estimators as a function of seeding trajectory length. Estimatedpopulations are shown for 200-microstate MSMs of (A) GTT WW domain and (B) K12M NTL9(1-39), and 1000-microstateMSMs of (C) GTT WW domain and (D) K12M NTL9(1-39). Adaptive seeding data was generated using ﬁve independenttrajectories from each microstate. Uncertainties were computed using a bootstrap procedure. surprise that new multi-ensemble MSM estimators likeTRAM and DHAMed have been able to achieve moreaccurate results than previous estimators. Thus, whilemany existing adaptive sampling strategies focus sam- pling to reﬁne estimated transition rates between states,we expect that improved adaptive sampling estimatorsmay come from similar “on-the-ﬂy” focusing that seeksto reﬁne thermodynamic estimates as well.SM estimation with short reseeding trajectories 12 AC BD

FIG. 10. Native state populations predicted by various MSM estimators as a function of seeding trajectory length. Estimatedpopulations are shown for 200-microstate MSMs of (A) GTT WW domain and (B) K12M NTL9(1-39), and 1000-microstateMSMs of (C) GTT WW domain and (D) K12M NTL9(1-39). Adaptive seeding data was generated using ten independenttrajectories from each microstate. Trajectory lengths longer than 100 ns for GTT WW domain and 200 ns for K12M NTL9(1-39) result in trajectory datasets larger than the reference model, and are not shown. Uncertainties were computed using abootstrap procedure.

As adaptive sampling strategies and estimators con-tinue to improve, we expect the more eﬃcient use ofultra-long simulation trajectories combined with ensem-bles of short trajectories for adaptive sampling, especiallyto probe the eﬀects of protein mutations, or diﬀerentbinding partners for molecular recognition.

ACKNOWLEDGMENTS

The authors thank the participants of Folding@home,without whom this work would not be possible. We thankD. E. Shaw Research for providing access to folding tra-jectory data. This research was supported in part by theNational Science Foundation through major research in-strumentation grant number CNS-09-58854 and NationalInstitutes of Health grants 1R01GM123296 and NIH Re-search Resource Computer Cluster Grant S10-OD020095. F. No´e, C. Sch¨utte, E. Vanden-Eijnden, L. Reich, and T. R.Weikl, Proc. Natl. Acad. Sci. U. S. A. , 19011 (2009). V. A. Voelz, G. R. Bowman, K. Beauchamp, and V. S. Pande,J. Am. Chem. Soc. , 1526 (2010). J.-H. Prinz, H. Wu, M. Sarich, B. Keller, M. Senne, M. Held,J. D. Chodera, C. Sch¨utte, and F. No´e, J. Chem. Phys. ,174105 (2011). J. D. Chodera and F. No´e, Curr. Opin. Struct. Biol. , 135(2014). F. No´e, J. Chodera, G. Bowman, V. Pande, and F. No´e, “An in-troduction to markov state models and their application to long timescale molecular simulation, vol. 797 of advances in experi-mental medicine and biology,” (2014). X. Huang, G. R. Bowman, S. Bacallado, and V. S. Pande, Pro-ceedings of the National Academy of Sciences , 19765 (2009). V. A. Voelz, B. Elman, A. M. Razavi, and G. Zhou, J. Chem.Theory Comput. , 5716 (2014). Z. Shamsi, A. S. Moﬀett, and D. Shukla, Nature PublishingGroup , 12700 (2017). S. Doerr and G. De Fabritiis, Journal of chemical theory andcomputation , 2064 (2014). M. I. Zimmerman and G. R. Bowman, J. Chem. Theory Comput. , 5747 (2015). M. I. Zimmerman, J. R. Porter, X. Sun, R. R. Silva, and G. R.Bowman, Journal of Chemical Theory and Computation , 5459(2018). Z. Shamsi, K. J. Cheng, and D. Shukla, arXiv preprintarXiv:1710.00495 (2017). B. W. Zhang, D. Jasnow, and D. M. Zuckerman, The Journalof Chemical Physics , 054107 (2010). M. C. Zwier, J. L. Adelman, J. W. Kaus, A. J. Pratt, K. F. Wong,N. B. Rego, E. Su´arez, S. Lettieri, D. W. Wang, M. Grabe, D. M.Zuckerman, and L. T. Chong, Journal of chemical theory andcomputation , 800 (2015). A. Dickson and S. D. Lotz, The Journal of Physical ChemistryB , 5377 (2016). S. D. Lotz and A. Dickson, Journal of the American ChemicalSociety , jacs.7b08572 (2018). T. Dixon, S. D. Lotz, and A. Dickson, Journal of Computer-Aided Molecular Design , 547 (2018). B. Trendelkamp-Schroer and F. Noe, Physical Review X ,011009 (2016). F. N¨uske, H. Wu, J.-H. Prinz, C. Wehmeyer, C. Clementi, andF. No´e, The Journal of Chemical Physics , 094104 (2017). K. Lindorﬀ-Larsen, S. Piana, R. O. Dror, and D. E. Shaw, Sci-

SM estimation with short reseeding trajectories 13 ence , 517 (2011). H. Wu, A. S. J. S. Mey, E. Rosta, and F. No´e, The Journal ofChemical Physics , 214106 (2014). G. R. Bowman, K. A. Beauchamp, G. Boxer, and V. S. Pande,J. Chem. Phys. , 124101 (2009). B. Trendelkamp-Schroer and F. No´e, Physical Review X ,011009 (2016). H. Jaeger, Neural computation , 1371 (2000). H. Wu, J.-H. Prinz, and F. No´e, The Journal of Chemical Physics , 144101 (2015). M. K. Scherer, B. Trendelkamp-Schroer, F. Paul, G. Perez-Hernandez, M. Hoﬀmann, N. Plattner, C. Wehmeyer, J.-H.Prinz, and F. No´e, Journal of Chemical Theory and Compu-tation , 5525 (2015). L. S. Stelzl, A. Kells, E. Rosta, and G. Hummer, Journal ofchemical theory and computation , 6328 (2017). H. Wu, A. S. J. S. Mey, E. Rosta, and F. No´e, The Journal ofChemical Physics , 214106 (2014). M. P. Harrigan, M. M. Sultan, C. X. Hern´andez, B. E. Husic,P. Eastman, C. R. Schwantes, K. A. Beauchamp, R. T. McGib-bon, and V. S. Pande, Biophysj , 10 (2017). H. Nguyen, M. J¨ager, A. Moretto, M. Gruebele, and J. W. Kelly,Proc. Natl. Acad. Sci. U. S. A. , 3948 (2003). M. J¨ager, Y. Zhang, J. Bieschke, H. Nguyen, M. Dendle, M. E.Bowman, J. P. Noel, M. Gruebele, and J. W. Kelly, Proc. Natl.Acad. Sci. U. S. A. , 10648 (2006). F. Liu and M. Gruebele, Chem. Phys. Lett. , 1 (2008). K. Dave, M. J¨ager, H. Nguyen, J. W. Kelly, and M. Gruebele,Journal of Molecular Biology , 1617 (2016). F. Liu, D. Du, A. A. Fuller, J. E. Davoren, P. Wipf, J. W. Kelly,and M. Gruebele, Proc. Natl. Acad. Sci. U. S. A. , 2369(2008). Y. Gao, C. Zhang, X. Wang, and T. Zhu, Chemical PhysicsLetters , 112 (2017). R. B. Best, G. Hummer, and W. A. Eaton, Proceedings of theNational Academy of Sciences , 201311599 (2013). S. Piana, K. Sarkar, K. Lindorﬀ-Larsen, M. Guo, M. Gruebele,and D. E. Shaw, Journal of Molecular Biology , 43 (2011). H. Nguyen, M. J¨ager, J. W. Kelly, and M. Gruebele, J. Phys.Chem. B , 15182 (2005). J.-C. Horng, V. Moroz, and D. P. Raleigh, Journal of molecularbiology , 1261 (2003). J.-H. Cho and D. P. Raleigh, Journal of Molecular Biology ,174 (2005). V. A. Voelz, G. R. Bowman, K. Beauchamp, and V. S. Pande,Journal of the American Chemical Society , 1526 (2010). C. R. Schwantes and V. S. Pande, Journal of chemical theoryand computation , 2000 (2013). J.-H. Cho, W. Meng, S. Sato, E. Y. Kim, H. Schindelin, andD. P. Raleigh, Proceedings of the National Academy of Sciencesof the United States of America , 12079 (2014). C. R. Baiz, Y.-S. Lin, C. S. Peng, K. A. Beauchamp, V. A. Voelz,V. S. Pande, and A. Tokmakoﬀ, Biophysj , 1359 (2014). J.-H. Cho, S. Sato, and D. P. Raleigh, Journal of MolecularBiology , 827 (2004). K. Lindorﬀ-Larsen, S. Piana, R. O. Dror, and D. E. Shaw, Sci-ence (New York, N.Y.) , 517 (2011). H. Wan, G. Zhou, and V. A. Voelz, Journal of Chemical Theoryand Computation , 5768 (2016). C. R. Schwantes and V. S. Pande, J. Chem. Theory Comput. ,2000 (2013). G. Perez-Hernandez, F. Paul, T. Giorgino, G. De Fabritiis, andF. No´e, J. Chem. Phys. , 015102 (2013). R. T. McGibbon and V. S. Pande, J. Chem. Phys. , 124105(2015). K. A. Beauchamp, R. McGibbon, Y.-S. Lin, and V. S. Pande,Proc. Natl. Acad. Sci. U. S. A. , 17807 (2012). E. Hruska, J. R. Abella, F. N¨uske, L. E. Kavraki, andC. Clementi, The Journal of Chemical Physics , 244119(2018). H. Wu, F. Paul, C. Wehmeyer, and F. No´e, Proceedings of theNational Academy of Sciences of the United States of America113