[PDF] Long-timescale predictions from short-trajectory data: A benchmark analysis of the trp-cage miniprotein

Abstract

Elucidating physical mechanisms with statistical confidence from molecular dynamics simulations can be challenging owing to the many degrees of freedom that contribute to collective motions. To address this issue, we recently introduced a dynamical Galerkin approximation (DGA) [Thiede et al. J. Phys. Chem. 150, 244111 (2019)], in which chemical kinetic statistics that satisfy equations of dynamical operators are represented by a basis expansion. Here, we reformulate this approach, clarifying (and reducing) the dependence on the choice of lag time. We present a new projection of the reactive current onto collective variables and provide improved estimators for rates and committors. We also present simple procedures for constructing suitable smoothly varying basis functions from arbitrary molecular features. To evaluate estimators and basis sets numerically, we generate and carefully validate a dataset of short trajectories for the unfolding and folding of the trp-cage miniprotein, a well-studied system. Our analysis demonstrates a comprehensive strategy for characterizing reaction pathways quantitatively.

Full PDF

LLong-timescale predictions from short-trajectorydata: A benchmark analysis of the trp-cageminiprotein

John Strahan, † Adam Antoszewski, † Chatipat Lorpaiboon, † Bodhi P. Vani, † Jonathan Weare, ∗ , ‡ and Aaron R. Dinner ∗ , † † Department of Chemistry, University of Chicago, Chicago, IL 60637, USA ‡ Courant Institute of Mathematical Sciences, New York University, New York, New York10012, USA

E-mail: [email protected]; [email protected]

Abstract

Elucidating physical mechanisms with statistical conﬁdence from molecular dy-namics simulations can be challenging owing to the many degrees of freedom that con-tribute to collective motions. To address this issue, we recently introduced a dynamicalGalerkin approximation (DGA) [Thiede et al. J. Phys. Chem. , 244111 (2019)],in which chemical kinetic statistics that satisfy equations of dynamical operators arerepresented by a basis expansion. Here, we reformulate this approach, clarifying (andreducing) the dependence on the choice of lag time. We present a new projection ofthe reactive current onto collective variables and provide improved estimators for ratesand committors. We also present simple procedures for constructing suitable smoothlyvarying basis functions from arbitrary molecular features. To evaluate estimators andbasis sets numerically, we generate and carefully validate a dataset of short trajectoriesfor the unfolding and folding of the trp-cage miniprotein, a well-studied system. Our a r X i v : . [ phy s i c s . d a t a - a n ] S e p nalysis demonstrates a comprehensive strategy for characterizing reaction pathwaysquantitatively. Molecular dynamics simulations enable atomic-resolution investigation of complex processes.These investigations are often carried out by direct simulation: the equations of motionare numerically integrated forward in time to generate trajectories (times series of atomicpositions and, as needed, momenta) for as long as possible given available computationalresources. Since most events of interest occur on timescales longer than those accessibleby direct simulation, many enhanced sampling schemes have been developed to allow moreextensive interrogation of an event of interest without sacriﬁcing model ﬁdelity. Splittingmethods, for example, branch and prune a collection of simultaneously evolving trajecto-ries to promote progress in a small number of order parameters (or collective variables,CVs).

Regardless of whether trajectory data are generated by direct simulation or en-hanced sampling, an essential question remains: How can these data be analyzed to yieldnew understanding about the process under study?We recently introduced dynamical Galerkin approximation (DGA) to analyze trajec-tory data generated by direct simulation, as well as many enhanced sampling schemes. Inthis approach, conditional expectations such as committor functions are cast as solutionsto equations involving the operator determining the statistics of the underlying process, itstransition operator. The solution to the equation is then approximated as a linear combina-tion of basis functions. This approach builds on an extensive literature from the last decadethat shows that the eigenvalues and eigenvectors of the transition operator can be approxi-mated from trajectory data, subject to a Markov assumption. These spectral estimationmethods aim to characterize the slowest dynamical features of the system (e.g., transitionsbetween metastable states) as eigenvectors corresponding to the largest eigenvalues of the2ransition operator. When the goal is to study a particular event of interest, the indirectrelationship between the eigenvectors of the generator and the event of interest is a weaknessof the spectral estimation approach. Indeed, for many complex systems the true slowestdynamical features of the system are too slow to be of any physical interest.In contrast, the aim of DGA is not to extract spectral information. Instead DGA aimsto compute statistics that directly characterize a particular event under study. For example,when transitions between particular metastable states are of interest, the statistics thatDGA yields can be combined within the framework of transition path theory (TPT) to obtain reactive ﬂuxes and in turn reaction mechanisms. Because DGA analyzes shorttrajectory fragments, it can be used to process the data generated by many splitting schemes.Alternatively, the trajectory data can be generated by seeding initial conditions for shortdirect simulations throughout state space.Though most often employed as a spectral estimation tool, Markov State Models (MSMs)have also been used to approximate TPT related quantities.

DGA can be viewed asan extension of these MSM variants to a more general class of target quantities and tomore general representations (basis set expansions) of those quantities. However, even withparameters chosen as in MSMs, the DGA estimators introduced in this article improve upontheir MSM counterparts in several ways including reduced dependence on the crucial lagtime parameter and lower variance estimates of certain TPT quantities.In our previous study, we compared diﬀusion map and indicator basis sets for predict-ing mean ﬁrst-passage times and committors for the M¨uller-Brown model and the foldingof the protein Fip35 using six long equilibrium trajectories from D. E. Shaw Research. Because there were only a few folding events within those trajectories, it was diﬃcult toassess the performance of the method. One goal of the present study is to generate a proteinfolding dataset that enables robust application of the approach and to compare diﬀerentbasis sets and estimators systematically.To this end, we study the trp-cage miniprotein, a 20-residue fast-folding artiﬁcial sequence3asn-leu-tyr-ile-glu-trp-leu-lys-asp-gly-gly-pro-ser-ser-gly-arg-pro-pro-pro-ser) that has beenstudied extensively both experimentally and computationally.

In solution at 298 K,the protein folds on a 4 µ s timescale and unfolds on a 12 µ s timescale, which makes theseprocesses diﬃcult but not impossible to simulate directly. In particular, D. E. Shaw Researchproduced a 208 µ s equilibrium simulation of the K8A mutant of trp-cage using the Antonsupercomputer with the CHARMM 22* force ﬁeld. Although, like the Fip35 data, thistrajectory contains relatively few folding events, it has been the subject of previous MSM and variational approach for Markov processes (VAMP) studies. These earlier studies serveas valuable points of comparison and enable us to identify CVs that provide good controlover sampling. Though DGA does not depend directly on any choice of CV, its performanceis strongly aﬀected by the quality of the dataset of sampled trajectories. We use our chosenset of CVs together with enhanced sampling methods to generate a new dataset comprisedof many short trajectories that are distributed evenly throughout the CV space.In this article we reformulate DGA in terms of the transition operator of the underlyingMarkov process. This has two primary advantages relative to our previous formulation interms of the generator of the process. First, it clariﬁes the role of lag time in DGA esti-mates, showing that correctly constructed estimators should have no dependence on lag timein the inﬁnite-basis, inﬁnite-sampling limit. Second, the formulation in terms of the tran-sition operator leads directly to estimators that correctly account for boundary conditionsby stopping underlying trajectories appropriately. Using our improved DGA estimators weintroduce new estimators for TPT reaction rates and reactive currents. To make compu-tation of the reactive current tractable and the result readily interpretable, we introduce aprojection formula for the reactive current onto a CV space which allows us to assign relativeweights to transition paths in arbitrary CV spaces. We also introduce a new procedure forconstructing a basis set from arbitrary molecular features (here, primarily pairwise distancesbetween C α atoms, though we also explore CVs with delay embedding) and compare it withtwo basis sets that are used widely in the MSM literature: indicator functions on molecular4eatures and indicator functions on time-lagged independent component analysis (TICA)coordinates. We show that our DGA estimators with selected basis sets can robustlyyield remarkably good agreement with published results for committors and pathways, eventhough the total simulation time of our trp-cage dataset is only 30 µ s, with a maximumtrajectory length of 30 ns. The projection of the reactive currents on CVs facilitates bothvisualization and quantiﬁcation of information about pathways, enabling immediate identi-ﬁcation of the deﬁning properties of transition states. This makes our approach an eﬃcientone for exploring mechanisms. In this section, we introduce key dynamical statistics and explain how they can be deﬁnedin terms of an evolution operator (Section 2.1). An emphasis on forms that lead directlyto practical and accurate numerical estimators causes several departures from the standardpresentation of this material. We present our approach for solving the operator equationsnumerically by Galerkin (basis expansion) approximation (Section 2.2), and distinguishforward-in-time statistics (Section 2.2.1) from backward-in-time statistics involving the ad-joint of the evolution operator (Section 2.2.2); this is followed by a discussion of basis sets(Section 2.2.3) and an approach for constructing an approximately Markovian process whenthe molecular representation does not adequately capture the dynamics (delay embedding,Section 2.2.4). Finally, guided by TPT, we combine the dynamical statistics estimated byDGA to yield approximations of reaction rates and currents. (Section 2.3). The dynamics of a Markov process X ( t ) can be encoded in its associated transition operator, T t , which speciﬁes the evolution of the expectation of a function f over some interval of time5 ≥ T t f ( x ) = E [ f ( X ( t )) | X (0) = x ] . (1)The time index t can be continuous or discrete. The transition operator (also known as theKoopman operator), and in particular its eigenvectors and eigenvalues, are the key quan-tities in well-established methods for discovering slowly decorrelating features of a Markovprocess. The transition operator is also central to the DGA approach. However, in DGA,instead of estimating the spectrum of the transition operator, the goal is to solve linearequations representing certain conditional expectations.In ref. 7, we presented DGA in terms of the generator which, for a continuous timeprocess is deﬁned by the limit: L f ( x ) = lim t → T t f ( x ) − f ( x ) t . (2)For a discrete time process the limit is removed and t in (2) is replaced by the unit of asingle time step. A presentation in terms of the generator has the advantage that it resultsin very concise equations for quantities of interest. For example, consider the (forward)committor, q + ( x ), which is the probability of entering a product state B before a reactantstate A starting from x / ∈ A ∪ B : q + ( x ) = P [ X ( T A ∪ B ) ∈ B | x = x ] , (3)where T A ∪ B = inf { t > X ( t ) ∈ A ∪ B } is the time of ﬁrst entrance into A ∪ B . For x ∈ A , q + ( x ) = 1, and, for x ∈ B , q + ( x ) = 0. The committor satisﬁes the Feynman-Kac relation L q + ( x ) = 0 for x / ∈ A ∪ B, q + ( x ) = B ( x ) =  , x ∈ B , x / ∈ B for x ∈ A ∪ B (4)(see Eqs. (18) and (19) of ref. 7). 6n this article we choose to work directly with the transition operator instead of thegenerator because it facilitates the implementation of numerical formulas. It also greatlysimpliﬁes our description of TPT and clariﬁes the relationship between DGA and the well-established VAC approach to approximating spectral properties of the transition operator(see ref. 41). In the case of the committor, we integrate (4) until a chosen time τ to obtainthe equivalent form of the Feynman-Kac relation, T τA ∪ B q + ( x ) − q + ( x ) = 0 for x / ∈ A ∪ B, q + ( x ) = B ( x ) for x ∈ A ∪ B. (5)In this expression we have introduced the notation T tA ∪ B for the transition operator of thestopped process X ( t ∧ T A ∪ B ), i.e., T tA ∪ B f ( x ) = E [ f ( X ( t ∧ T A ∪ B )) | X (0) = x ] . (6)Here and below t ∧ T A ∪ B = min { t, T A ∪ B } , indicating that the evolution process does notproceed beyond escape.For a more general domain D and T D c = inf { t > X ( t ) / ∈ D } , the conditionalexpectation u ( x ) = E (cid:20) b ( X ( T D c )) − Z T D c a ( X ( t )) dt (cid:12)(cid:12)(cid:12)(cid:12) X (0) = x (cid:21) for x ∈ D (7)solves the equation T τD c u ( x ) − u ( x ) = Z τ T tD c a ( x ) dt for x ∈ D, u ( x ) = b ( x ) for x / ∈ D. (8)To obtain (5) for the committor, choose D = ( A ∪ B ) c , b = B , and a = 0. In (7) and (8) weassume for simplicity that a ( x ) = 0 for x / ∈ D . For a discrete-time process the time integralin these expressions should be interpreted as a sum.7rucially, (8) holds for any choice of τ ≥ τ , (8) converges to (7). However, in most cases of interest, the escape time T D c is very large, making estimation of u in (7) by direct simulation of sample trajectoriesof X ( t ) prohibitively expensive. In the context of DGA, the signiﬁcance of (8) is that itexpresses u in terms of an expectation over short trajectories. The catch is that (8) must be“inverted” to solve for u . We now describe a Galerkin approach to approximating conditional expectations from shorttrajectory data. We ﬁrst introduce a “guess” function ψ that satisﬁes the boundary condi-tions (i.e., ψ ( x ) = b ( x ) for x / ∈ D ). Our approximation has the form u ( x ) ≈ ψ ( x ) + n X j =1 φ j ( x ) v j , (9)where { φ j ( x ) } is a set of n basis functions satisfying φ j ( x ) = 0 for x / ∈ D , and v is a vectorof n coeﬃcients. We begin by approximating predictions of quantities forward-in-time as in (7) by expandingthe solution u of (8) at a particular user chosen value of τ called the lag time. While thesolution u itself is independent of τ in (8), the quality of our approximation of u with a ﬁnitebasis may depend on the choice of lag time (even in the absence of sampling error). A similarphenomenon has recently been explained in detail in the context of the VAC algorithm. Substituting (9) into (8), multiplying by φ i and integrating over the distribution of sampledpoints µ to form the inner product h f, g i = R f ( x ) g ( x ) µ ( dx ), we obtain the linear system ofequations: ( C τ − C ) v = r τ , (10)8ith matrices C s ∈ R n × n for s = 0 , τ , C sij = h φ i , T sD c φ j i µ (11)and vector r τ ∈ R n , r τi = (cid:28) φ i , ψ ( x ) − T τD c ψ ( x ) + Z τ T tD c a ( x ) dt (cid:29) µ . (12)Given (11) and (12), (10) can be readily solved for v by standard methods of linear algebra.In models that represent molecules with high ﬁdelity, (11) and (12) cannot be evaluateddirectly because a closed form of T τD c is not known. DGA overcomes this issue by approxi-mating the action of the transition operator using short molecular dynamics trajectories: if X (0) is a sample drawn from µ and { X ( t ) } τt =0 is a trajectory segment of length τ startingfrom X (0), then we can estimate C sij (for s = 0 , τ ) and r τi as C sij ≈ M M X m =1 φ i ( X ( m ) (0)) φ j ( X ( m ) ( s ∧ T D c )) (13) r τi ≈ M M X m =1 φ i ( X ( m ) (0)) ψ ( X ( m ) (0)) − ψ ( X ( m ) ( τ ∧ T D c )) + ∆ N X p =0 a ( X ( m ) ( p ∆)) ! (14)where m indexes trajectory segments, ∆ is the sampling interval, and N satisﬁes N ∆ = τ ∧ T D c . To avoid overhead, it is advantageous to generate trajectories much longer than τ (but still much shorter than typical values of T D c ) and use a rolling window to generateshort trajectories of length τ . We further note that in practice conﬁgurations are not savedat every molecular dynamics step. This limits the resolution of both the lag time and thestopping time, which we take to be the time of the ﬁrst saved conﬁguration outside thedomain D . 9 .2.2 Adjoints, the steady state, and backward-in-time predictions To compute many important quantities we need not only to solve equations involving thetransition operator but also equations involving its adjoint ( T t ) † µ in the µ -weighted innerproduct, which by deﬁnition satisﬁes Z f ( x ) T t g ( x ) µ ( dx ) = Z g ( x )( T t ) † µ f ( x ) µ ( dx ) . (15)One such equation is for the change of measure w = dπ/dµ , which can be used to reweightfrom the sampling distribution µ to the stationary distribution π : Z f ( x ) π ( dx ) = Z f ( x ) w ( x ) µ ( dx ) , (16)assuming µ and w are normalized such that R w ( x ) µ ( dx ) = 1. Owing to the time transla-tional invariance of averages over the stationary distribution π , (15), and (16), the changeof measure satisﬁes the equation ( T τ ) † µ w ( x ) − w ( x ) = 0 . (17)(17) can be solved analogously to (8), but, in this case, there are no boundary conditions.The introduction of a basis leads to a linear system of equations of the form( ¯ C τ − ¯ C ) > v = 0 , (18)with ¯ C s (for s = 0 , τ ) diﬀering from C s only in the choice of basis (which is no longerrestricted to D ) and the use of T s in place of T sD c ; > denotes the transpose. We note thatby including φ ( x ) = 1 in the basis we can guarantee that the equation for v has a solution.10iven an approximate w , (16) can be computed as Z f ( x ) π ( dx ) ≈ M X j =1 f ( X ( j ) (0)) w ( X ( j ) (0)) , (19)with the weights normalized such that M X j =1 w ( X ( j ) (0)) = 1 . (20)That the change of measure can be estimated from short nonequilibrium trajectory data waspreviously observed in ref. 16.Another important quantity expressible in terms of an equation involving an adjoint ofthe transition operator is the backwards committor q − ( x ) = P (cid:2) X ( − T − A ∪ B ) ∈ A | X (0) = x (cid:3) , T − A ∪ B = inf { t > X ( − t ) ∈ A ∪ B } , (21)where X ( − t ), t ≥ T − t f ( x ) = ( T t ) † π f ( x ) = 1 w ( x ) ( T t ) † µ [ f w ]( x ) (22)(the last equality can be veriﬁed using (15)). The backward committor is the probabilitythat a trajectory currently at position x last came from the reactant state A rather than theproduct state B . It satisﬁes the Feynman-Kac relation T − τA ∪ B q − ( x ) − q − ( x ) = 0 for x / ∈ A ∪ B, q − ( x ) = A ( x ) for x ∈ A ∪ B. (23)Consistent with our deﬁnition of T tA ∪ B above, T − tA ∪ B is the transition operator for the steady-state backward-in-time process stopped upon ﬁrst entrance in A ∪ B .To expand and approximate q − according to the DGA recipe described above, we needto estimate µ -weighted inner products involving T − tA ∪ B . To that end we note that, as long as11 = 0 on A ∪ B , (cid:10) g, T − tA ∪ B f (cid:11) µ = Z E (cid:20) f ( X ( S A ∪ B ( t ))) g ( X ( t )) w ( X ( t )) (cid:12)(cid:12)(cid:12)(cid:12) X (0) = x (cid:21) w ( x ) µ ( dx ) (24)where S A ∪ B ( t ) = sup { s ≤ t : X ( s ) ∈ A ∪ B } (25)(with S A ∪ B ( t ) = 0 if X ( s ) / ∈ A ∪ B for all 0 ≤ s ≤ t ). We provide a derivation of (24) inAppendix A. Just as for the forward committor, we expect that use of a sampling measure µ with high resolution in transition regions will lead to higher approximation accuracy (i.e.,better ability of a ﬁnite basis to capture the dynamics). However, in our experience thefactor of w − ( X ( t )) in (24) leads to signiﬁcant sampling errors for larger values of t . For ourbackward committor calculation we therefore weight inner products by π , using the formula (cid:10) g, T − tA ∪ B f (cid:11) π = Z E (cid:20) f ( X ( S A ∪ B ( t ))) g ( X ( t )) (cid:12)(cid:12)(cid:12)(cid:12) X (0) = x (cid:21) w ( x ) µ ( dx ) . (26)(26) allows inner products involving T − tA ∪ B to be computed using forward trajectories of X initiated according to µ, i.e., exactly the same ingredients required to make forward-in-timepredictions by DGA.Following our procedure for forward quantities outlined in Section 2.2, given a guessfunction ψ satisfying ψ = 1 on A and ψ = 0 on B and basis functions φ j that are zero on A ∪ B , we can build an approximation q − ( x ) ≈ ψ ( x ) + n X j =1 φ j ( x ) v j (27)by solving ( C − τ − C ) v = r − τ (28)12ith C − τij = (cid:10) φ i , T − τA ∪ B φ j (cid:11) π = Z E (cid:20) φ j ( X ( S A ∪ B ( τ ))) φ i ( X ( τ )) (cid:12)(cid:12)(cid:12)(cid:12) X (0) = x (cid:21) w ( x ) µ ( dx ) (29)and r − τi = h φ i , ψ i π − (cid:10) φ i , T − τA ∪ B ψ (cid:11) π = Z E (cid:20) ( ψ ( X ( τ )) − ψ ( X ( S A ∪ B ( τ )))) φ i ( X ( τ )) (cid:12)(cid:12)(cid:12)(cid:12) X (0) = x (cid:21) w ( x ) µ ( dx ) (30)where the second equality in each display follows from (26).Along with the forward committor q + and the stationary change of measure w , thebackward committor is a key ingredient of TPT. In Section 2.3 we describe how DGAestimates of these quantities can be combined with TPT to reveal key properties of steady-state transition paths from the reactant state A to the product state B . However, beforethat, we complete our presentation of DGA with a discussion of molecular representationsand basis sets, with emphasis on those that we employ in the present study to analyzetrp-cage miniprotein unfolding and folding. A key determinant of the performance of DGA is the choice of basis set. Constructing abasis set that respects the boundary conditions of the problem and captures the dynamicswith relatively few functions requires care. Here we discuss how we generated the basis setsthat we compare later in our numerical experiments, and explain why we chose them overalternatives.In addition to the choice of functions, there is also a choice of molecular representation(i.e., the features that serve as inputs to the functions). Although molecular dynamics tra-jectories are generally recorded as sequences of Cartesian coordinates, the inputs to the basisfunctions are generally internal coordinates. This removes the eﬀects of trivial translations13nd rotations, and it can improve the statistics. The internal coordinates that we use arepairwise distances between C α atoms greater than two sequence positions apart; for trp-cage,there are 153 such distances. In other words, the process X ( t ) to which we apply DGA (andTPT) is the length 153 vector of pairwise distance values. In our tests we found that in-cluding additional features, such as backbone dihedral angles, did not improve performance.We assume that the reactant state A and product state B of interest can be characterizedin terms of these variables. We construct basis functions of these variables that satisfy thehomogeneous boundary condition on the domain D = ( A ∪ B ) c .In this work, we compare three choices of basis set: indicator functions on the pairwisedistances, indicator functions constructed on the top 10 TICA coordinates computedfrom the pairwise distances at a lag time of 0.5 ns, and smooth functions of pairwise distancesthat satisfy the boundary conditions. We refer to these henceforth as the distance indicator,TICA indicator, and modiﬁed distance basis sets. We constructed the distance indicator andTICA indicator basis sets and their guess functions as follows:1. For the distance indicator basis set, we constructed 200 indicator functions by mini-batch k -means clustering as implemented in PYEMMA on the values of the 153 pair-wise distances. For the TICA indicator basis set, the clustering was performed on thetop 10 TICA coordinates constructed on the pairwise distances.2. We retained all resulting indicator functions with non-zero regions fully contained in( A ∪ B ) c as the basis set. We split any indicator functions with non-zero regionsoverlapping with A or B , and we redeﬁned them to be non-zero only in the portions in( A ∪ B ) c . For the change of measure calculations, boundary conditions are not present,so we used all indicator functions unmodiﬁed.3. For the forward committor calculation we took the the guess function to be ψ ( x ) = B ( x ). For the backward committor calculation we took the guess function to be ψ ( x ) = A ( x ). 14ith an indicator basis, the DGA and MSM estimator (with appropriate state deﬁnitions)of the forward committor q + and change of measure w become similar. We note howeverthat the DGA (as formulated here) and MSM approaches diverge both in DGA’s use ofstopped trajectories and in the way q + and w (and q − ) are used to estimate TPT quantitiesas described in Section 2.3.We constructed the distance basis set and its guess function as follows:1. We computed d A and d B as the distance in feature space (i.e. in 153-dimensionalEuclidean space) to the sampled points in states A and B , respectively.2. We set h ( x ) = d A d B / ( d A + d B ) , which obeys the homogeneous boundary conditionsby construction.3. We computed basis functions obeying the boundary conditions by multiplying eachcoordinate of the pairwise distance vector x by h ( x ): φ i ( x ) = x i h ( x ). For the changeof measure calculation, we use φ i = x i and add the constant function into the set ofchosen features.4. To remove any linear dependencies introduced by enforcing the boundary conditions,and to ensure numerical stability, we orthogonalized the basis set φ i with respect tothe sampling measure (up to sampling error) using a singular value decomposition.5. For the forward committor calculation we took the guess function to be ψ ( x ) = d B / ( d A + d B ) . For the backward committor calculation we took the guess function tobe ψ ( x ) = d A / ( d A + d B ) .Although here we use the backbone pairwise distances, we note that this construction proce-dure could be used to generate basis sets obeying the homogeneous boundary conditions fora choice of variables other than the pairwise distances such as dihedral angles, radial basisfunctions, or soft indicator functions. 15he indicator and TICA basis sets are the most widely used in the MSM literature. Var-ious alternatives have been proposed speciﬁcally in the context of spectral estimation. In our previous work, we considered a basis set based on diﬀusion maps. Due to thesize of our trp-cage dataset ( ∼ datapoints), the O ( N ) scaling of the matrix diagonaliza-tion associated with the diﬀusion map proved prohibitively computationally costly withoutsubsampling and out of sample extension. Application of DGA as described so far assumes that the underlying process X ( t ) is Marko-vian; the conditional expectations that DGA seeks to approximate are not fully deﬁned if X ( t ) is not Markovian. Yet, in the previous section we described an approach to building abasis set for DGA consisting of functions of only a subset of the full collection of variables(selected pairwise distances). Though the dynamics of this subset are not strictly Marko-vian, in Section 4 we show that, at least in the speciﬁc context of the trp-cage system, theremaining degrees of freedom relax suﬃciently fast that DGA yields accurate results.However, in some circumstances, one may only have access to a small number of variablesthat are insuﬃcient to specify the dynamics. This situation is typical when the data are froman experiment. In this case, we can construct a more expressive representation of the systemfrom time-lagged images, i.e., if X ( t ) is not itself Markovian we can instead apply DGA tothe augmented process ¯ X ( t ) = ( X ( t − M δ ) , X ( t − ( M + 1) δ ) , . . . , X ( t )). For large enough M one can expect ¯ X ( t ) to be nearly Markovian.In Section 4.5, we show that delay embedding can signiﬁcantly improve DGA estimateswhen a small number of CVs is used to characterize molecular conﬁgurations. Writing thevalues of the CVs at time t as the vector X ( t ), we construct the delay embedded process¯ X ( t ). We then construct a basis set following the recipe in Section 2.2.3 for the modiﬁeddistance basis, but replacing X with ¯ X . We then extend other functions f of the CV spaceto the delay-embedded space by f ( ¯ X ( t )) = f ( X ( t − b M/ c δ )). This allows us to extend the16tates A and B (which can both be deﬁned in terms of the CVs) as well as the functions a and b in (7). We then apply DGA as outlined above directly on the delay-embedded space. Estimates of rates from simulations are frequently of interest because they can be compareddirectly with experimental measurements, and they can provide indirect information aboutmechanisms. TPT in principle provides not just rate estimates but reactive currents orﬂuxes, which provide direct information about mechanisms. However, previous calculationsof reactive current have been limited to toy models and depictions of the reactive ﬂux betweenmetastable states can been diﬃcult to interpret. Working within the TPT framework andbuilding upon DGA approximations of w , q + , and q − , in this section we introduce robustestimates of the reaction rate and of an easily interpretable projection of the reactive currentonto CVs (as opposed to over the network of metastable states).There are various expressions for the rate in TPT. One approach is based on the rateat which trajectories transition from A to B , R AB . If U is any set for which A ⊂ U and B ⊂ U c then, for a continuous time process, R AB = lim t → t Z (cid:0) U ( x ) T t [ U c q + ]( x ) − U c ( x ) T t [ U q + ]( x ) (cid:1) q − ( x ) π ( dx )= lim t → t Z (cid:0) U ( x ) T t q + ( x ) − T t [ U q + ]( x ) (cid:1) q − ( x ) π ( dx ) , (31)where the second line is obtained by noting that U c ( x ) = 1 − U ( x ). Here and below, for adiscrete time process the limit is removed and t is replaced by the unit of a single time step.Expression (31) simply counts trajectories with forward crossings of the surface dividing U and U c , weighted by their probabilities that they start in A and end in B . Consequently,when using this formula to estimate rates from data, only those trajectories that cross thesurface dividing U and U c contribute. Because these trajectories are generally a small fractionof the data, this results in relatively large variances in estimates. We can obtain considerably17etter estimates by considering the isocommittor surfaces: U ( z ) = { x : q + ( x ) ≤ z, x ∈ D } for z ∈ (0 , R AB is independent of z . Integrating (31) with respect to z and using linearity of T t yields R AB = Z R AB dz = lim t → t Z (cid:26) T t q + ( x ) Z [0 ,z ] (cid:0) q + ( x ) (cid:1) dz − T t (cid:20) q + Z [0 ,z ] ( q + ) dz (cid:21) ( x ) (cid:27) q − ( x ) π ( dx )= lim t → t Z (cid:26) T t q + ( x ) (cid:2) − q + ( x ) (cid:3) − T t (cid:2) q + (1 − q + ) (cid:3) ( x ) (cid:27) q − ( x ) π ( dx )= lim t → t Z (cid:0) T t q ( x ) − q + ( x ) T t q + ( x ) (cid:1) q − ( x ) π ( dx ) , (32)where we have made use of the fact that the integral of the Heaviside function (which entersthrough the indicator functions) is the ramp function. This expression for R AB immediatelysuggests the estimator: R AB ≈ t X i q + ( X ( i ) ( t ∧ T A ∪ B )) (cid:2) q + ( X ( i ) ( t ∧ T A ∪ B )) − q + ( X ( i ) (0)) (cid:3) q − ( X ( i ) (0)) w ( X ( i ) (0))(33)for some small choice of t . Note the use of stopped trajectories in (33). For very small valuesof t the inclusion of the stopping time T A ∪ B has no impact. However, in our numericalexperiments we ﬁnd that use of stopped trajectories improves the accuracy of (33) and, inparticular, (37) below, for most choices of t . Given an estimate of R AB , the rate constant is k AB = R AB P i q − ( X ( i ) (0)) w ( X ( i ) (0)) . (34)The denominator in 34 is the mean of the backward committor, which is the fraction of timethe system spends having last visited state A.As noted above, we can also use simulations to understand how reactive trajectories ﬂowthrough a CV space. One way to do this is to partition the space into discrete states andthen estimate the reactive ﬂuxes between pairs of states. However, the resulting directed18raph can be complicated and diﬃcult to interpret. When the sample paths are continuousthe reactive ﬂux between neighboring values in CV space is can be summarized as a singlevector ﬁeld in CV space. If θ is a vector-valued CV and ds is a bin in CV space of volume | ds | , the reactive current at point s is J θAB ( s ) = lim t, | ds |→ t | ds | Z (cid:0) T t [ θq + ]( x ) − θ ( x ) T t q + ( x ) (cid:1) { θ ∈ ds } ( x ) q − ( x ) π ( dx )+ (cid:0) T t [ { θ ∈ ds } θq + ]( x ) − θ ( x ) T t [ { θ ∈ ds } q + ]( x ) (cid:1) q − ( x ) π ( dx ) (35)In appendix C, we show that J θAB ( s ) = R J AB · ∇ θ ( x ) δ ( θ ( x ) − s ) π ( dx ), and we establish thatthe projected reactive current satisﬁes Z ∂C θ J θAB ( s ) · n C θ dσ C θ = Z ∂C J AB · n C dσ C , (36)where C θ is any region of CV space such that its inverse image (under the CV mapping) inthe full conﬁguration space, C , contains A and does not intersect B . To estimate J θAB fromtrajectory data we have the following estimator: J θAB ( s ) ≈ t | ds | M X i =1 q + ( X ( i ) ( t ∧ T A ∪ B )) (cid:0) θ ( X ( i ) ( t ∧ T A ∪ B )) − θ ( X ( i ) (0)) (cid:1) × θ ∈ ds ( X ( i ) (0)) q − ( X ( i ) (0)) w ( X ( i ) (0))+ 12 t | ds | M X i =1 q + ( X ( i ) ( t )) (cid:0) θ ( X ( i ) ( t )) − θ ( X ( i ) ( S A ∪ B ( t ))) (cid:1) × θ ∈ ds ( X ( i ) ( t )) q − ( X ( i ) ( S A ∪ B ( t ))) w ( X ( i ) (0)) (37)Note that the lag time t in (33) and (37) need not be the same as the lag time τ used toestimate the committors q + and q − . In contrast to the role of τ , even with perfect samplingand a perfect basis, estimates of TPT quantities will depend on t . Several considerationsare involved in the choice of t . For larger values of t (33) and (37) incur signiﬁcant bias dueto poor approximation of the t → t , we found that (37) suﬀers large statistical errors. The choice of t is additionallyconstrained by the need for the change of measure at all initial time points to be used in(37), which requires that one uses the same lag time for computing both J θAB and w . A fullanalysis of the error sources is beyond the scope of this work, and in practice we choose alag time that gives reasonable results for the change of measure and reasonable smoothnessin the vector ﬁeld. In this section, we specify the computational procedure to generate and analyze the datasetfor the unfolding and folding of trp-cage. We describe preparing the system and its underlyingdynamics (Section 3.1), choosing collective variables based on their ability to distinguishmetastable states (Section 3.2), generating and validating the dataset of short trajectories(3.3), and deﬁning the unfolded and folded states (Section 3.4).

Unless otherwise noted, all molecular dynamics simulations were performed with GROMACS5.1.4 and PLUMED 2.3 using the CHARMM36m force ﬁeld in the NVT ensembleat 300 K using the Langevin thermostat with a temperature coupling constant of 10 ps − applied to all atoms, and a time step of 2 fs. Bonds to hydrogen atoms were constrainedusing the LINCS algorithm. Electrostatic interactions were computed using particle-meshEwald summation with a cutoﬀ of 1.2 nm. Lennard-Jones interactions were switched oﬀfrom 1.0 to 1.2 nm using the default GROMACS switching function.The system was prepared from an NMR structure of trp-cage (PDB code 1L2Y ). Theprotein was solvated in a 50 ˚A cubic box with the TIP3P water model using CHARMM-GUI 3.0.

10 K + and 11 Cl − ions were added, bringing the system to charge neutralityand 150 mM KCl. The energy of the system was minimized until the maximum force was20elow 1000 kJ/mol nm. The system was then equilibrated for 1 ns in the NVT ensemblewith position restraints (using a 1 fs timestep), 10 ns in the NPT ensemble with harmonicrestraints on non-hydrogen atom positions (force constant 400 kj/mol nm for backboneatoms and 40 kj/mol nm for side chain atoms.) and a Parrinello-Rahman barostat with apressure coupling constant of 5 ps − , 5 ns in the NPT ensemble without position restraints,and then 10 ns in the NVT ensemble without position restraints. The cubic box length wasdetermined from the restraint-free NPT equilibration run to be 4.48 nm and ﬁxed at thatvalue after that run. The performance of DGA rests on having a dataset with good sampling of all states thatcontribute to the reaction mechanism. As mentioned in the Introduction, the availablephysically weighted molecular dynamics data for trp-cage contain few unfolding and foldingtransitions. We thus sought to use enhanced sampling methods to generate a dataset withimproved representation outside the stable states. To this end, we evaluated CVs for theirability to control sampling and resolve the unfolded and folded states.Based on previous studies, we considered ﬁve CVs:1. The radius of gyration of the C α atoms ( R g );2. The root mean squared deviation (RMSD) of all C α atoms from their positions in anequilibrated structure (RMSD full );3. The RMSD of the C α atoms of residues 2 to 9, which make up the α helix in the nativestate (RMSD hx );4. The RMSD of the C α atoms of residues 11 to 15, which make up the 3-10 helix in thenative state (RMSD − );5. The end-to-end distance ( d ). 21 g , RMSD full , and RMSD hx were used in ref. 33, and RMSD − was used in ref. 37 (theredeﬁned only to residue 14), where they found that it was able to resolve several metastablestates identiﬁed by spectral clustering.To explore how these collective variables change as trp-cage unfolds, we ran a seriesof Adiabatic Bias Molecular Dynamics (ABMD) simulations to drive unfolding from theequilibrated native structure. ABMD uses a ratchet-and-pawl-like bias to trap spontaneousﬂuctuations that move the system forward in selected CVs. By applying ABMD with dif-ferent combinations of the CVs above, we found that RMSD full and RMSD − yielded rea-sonable control of the system and enabled exploration of all metastable states characterizedin previous studies. To initialize a dataset of short trajectories for DGA, we deﬁned a grid of 64 points in the spaceof RMSD full and RMSD − (Figure 1). We then used 64 independent ABMD simulationsto steer the system to each of these points from the ﬁnal structure from the equilibrationsimulations described in Section 3.1. We ran each ABMD simulation for 1 ns, saving thestructure every 5 ps; the force constants were 1.25 kJ/(mol ˚A ) and 1.0 kJ/(mol ˚A ) forRMSD full and RMSD − , respectively. From the set of all recorded structures, we chose the64 structures closest to the targets and equilibrated each for 1 ns with a harmonic restraintwith the same force constants as in the AMBD simulations. From each of the resultingstructures, we then launched 14 free simulations (with diﬀerent random number generatorseeds) of length 30 ns each, saving structures every 5 ps.From this dataset, we computed all possible two-dimensional potentials of mean force(PMFs) involving the CVs listed in Section 3.2. We compared these PMFs with correspond-ing ones from replica exchange umbrella sampling (REUS). Based on the DGA PMFs, weused the RMSD of the α helix (RMSD hx ), and the RMSD of the 3-10 helix (RMSD − ),and the end-to-end distance ( d ) to control the sampling. REUS window centers were placed22igure 1: Initialization points for the dataset of short trajectories. ABMD targets (symbols)are overlaid on DGA PMFs (color scale and contours, spaced every 1 k B T ) for the CVs usedfor steering. (left) The initial 64 ABMD targets were based on RMSD full and RMSD ; 14free simulations of length 30 ns were launched from each of the structures resulting fromthese ABMD simulations. (right) 64 ABMD targets in RMSD hx and end-to-end distanceadded to ensure adequate sampling of the unfolded state; 2 free simulations of length 30 nswere launched from each of the structures resulting from these ABMD simulations.on a uniform 8 × × hx ranging from 0.3 to 2.8 ˚A,RMSD − ranging 0.3 to 3.3 ˚A, and d ranging from 6 to 38 ˚A. This grid fully covered therelevant areas of CV space identiﬁed by previous simulations. The force constants for theharmonic potentials for each window were 29.2 kJ/(mol · ˚A ) for RMSD hx , 20.3 kJ/(mol · ˚A )for RMSD − , and 0.178 kJ/(mol · ˚A ) for d , following ref. 59. To initialize each window,structures were taken from the DGA database that were closest to each window center. Thebuilt-in replica exchange functionality of GROMACS was used to create a three-dimensionalreplica exchange procedure, where structures from nearby windows were periodically ex-changed. Every window was ﬁrst simulated for 100 ps, with swaps attempted betweenadjacent windows in d space (i.e., window centers with the same RMSD hx and RMSD − d values) every 10 ps. This was repeated for a total of three 100 psiterations, with the second and third iterations proposing swaps between neighboring win-dows in RMSD hx and RMSD − , respectively. This 300-ps procedure was repeated until atotal simulation time of 10 ns was reached for each window, with structures saved every 10ps. Following this protocol, structures were exchanged across all of the three-dimensionalgrid, with exchange probabilities in the range 10-60%. The PMF was constructed by usingthe Eigenvector Method for Umbrella Sampling (EMUS) extended to REUS. The REUSsimulations were run until the asymptotic variance of the PMF dropped below 0.1 ( k B T ) (Figure S1).The REUS PMFs suggested that the initial DGA dataset did not adequately sampleconﬁgurations with RMSD hx > . > . k B T . Therefore, we selected 64 morepoints from a grid with RMSD hx > . hx the most, but other PMFs were also noticeably improved. Thedataset used for all further DGA calculations thus contains a total of 1024 trajectories, eachof length 30 ns, with structures saved every 5 ps. We found that PMFs projected onto only global measures of unfolding (RMSD full , R g , and d ) did not have clearly identiﬁable unfolded basins (Figures 1 and 2). By contrast, the PMFon the CVs tracking secondary structure (RMSD hx and RMSD − ) had clearly identiﬁableunfolded and folded basins, as well as several intermediates. Based on this analysis, we took24igure 2: PMFs for the indicated CVs. Results shown are computed by DGA with themodiﬁed distance basis set and a lag time of 0.5 ns. We use a 50 ×

50 grid to compute eachPMF. Similar results are obtained with other basis sets and REUS; see Figures S4, S5, andS6.the unfolded state to be | RMSD hx − .

15 ˚A | .

008 ˚A + | RMSD − . | .

125 ˚A < . (38)The folded state is(RMSD hx − . . + (RMSD − . .

04 ˚A < d <

17 ˚A . (39)25e included the end-to-end distance constraint on the folded state to exclude structureswhich are extended but have the secondary structure intact.Figure 3: Top nontrivial TICA eigenvector averaged on the RMSD hx and RMSD CVswith physical weighting. The unfolded and folded states are indicated in yellow with repre-sentative structures. Intermediate states in Table 1 are marked and labeled.Heterogeneous structures contribute to the unfolded state, making it challenging to deﬁne,and there is no guarantee that the choices above are optimal in any sense. Because we expectunfolding and folding to be among the slowest motions of the system, an alternative would beto deﬁne the states in terms of the slowest mode of the system identiﬁed by a dimensionality-reduction algorithm. However, data-driven state deﬁnitions are often diﬃcult to interpretphysically, despite their theoretical justiﬁcations. Furthermore, data-driven state deﬁnitionscan be diﬃcult to incorporate into sampling algorithms. We thus use physical CVs forpath sampling, stratiﬁcation, and state deﬁnitions, and we then check for consistency witha data-driven state choice.Figure 3 shows that the slowest mode of the system identiﬁed by TICA applied to theDGA dataset correlates with the PMF and switches between low and high values in going26etween the unfolded and folded states. Here and going forward, all functions we projectonto CVs are conditional averages of the form R f ( x ) δ ( θ ( x ) − s ) π ( dx ) / R δ ( θ ( x ) − s ) π ( dx ).We estimate these by binning our CV space into bins, and for each bin ds , plotting: R f ( x ) δ ( θ ( x ) − s ) π ( dx ) R δ ( θ ( x ) − s ) π ( dx ) ≈ P i f ( X ( i ) (0) w ( X ( i ) (0)) θ ∈ ds (( X ( i ) (0)) P i w ( X ( i ) (0)) θ ∈ ds (( X ( i ) (0)) . (40)We furthermore show in Section 4.2 that this mode correlates with the committor. Wethus feel that RMSD hx and RMSD − enable the clearest two-dimensional projection of thereaction and present most of our results in terms of these CVs. In addition to the unfoldedand folded states, we deﬁne four intermediate states U1, U2, L1, and L2 shown on Figure 3.In the next section, we apply our DGA and TPT formalism to show that trp-cage can foldalong an upper path through intermediates U1 and U2, or a lower path through L1 and L2. In this section, we evaluate how the three basis sets described in Section 2.2.3 (indicatorfunctions of pairwise distances, indicator functions of TICA coordinates, and pairwise dis-tances modiﬁed to satisfy the boundary conditions) impact the performance of DGA forestimating PMFs, rates, committors, and reactive currents for the unfolding and folding ofthe trp-cage miniprotein. Where possible, we compare our results with references obtainedby independent means.

Figure 2 shows PMFs computed on each pair of the physically motivated CVs with DGAwith the modiﬁed distance basis set. The corresponding PMFs from REUS are shown inFigure S2; diﬀerence maps comparing the results obtained with the two methods and threebasis sets are shown in Figures S4, S5, and S6. All of the main basins identiﬁed by REUSare present in the DGA PMFs, and there is good quantitative agreement between REUS27nd DGA, with RMSDs of < k B T for all three basis sets (that said, of these, the distanceindicator basis set results in the largest deviations). Consistent with their agreement with theREUS PMF, the three DGA PMFs are in agreement with each other. We did observe thatREUS tends to give slightly ﬂatter PMFs than DGA with all three basis sets. In principle,there are two sources of error in the DGA PMFs: (i) approximation error from representingthe true change of measure with a basis expansion and (ii) estimation (sampling) error.Analysis of error in DGA will be the subject of future work. Error in US is discussed in refs.61, 62, and 63. Table 1: CV values for metastable states.State RMSD full / ˚A RMSD hx / ˚A d/ ˚A RMSD / ˚A R g / ˚AFolded 1.1 0.30 11.1 0.30 7.0Unfolded 5.8 2.1 20.2 2.8 9.2U1 2.4 0.34 13.1 1.2 7.3U2 5.2 0.34 19.3 2.8 8.8L1 2.2 1.2 9.5 0.30 7.2L2 2.6 1.9 14.5 0.30 7.3We found that the projection onto the RMSD hx and RMSD coordinates was best ableto separate the pathways and states of interest, so we now focus on this projection. Figure 3indicates the folded (lower left) and unfolded (upper right) basins, as well four intermediates.The intermediates deﬁne two pathways, which we label upper (with intermediates U1 andU2) and lower (with intermediates L1 and L2). Table 1 gives the ﬁve CV values for each ofthe six states.To understand the characteristics of the intermediate states, we turn to Figure 4, whichshows the solvent-accessible surface area (SASA) of trp-6 on the left, and pro-12 on theright. We ﬁnd that the U2 intermediate state is characterized by partial solvation of thehydrophobic core, measured by the SASA of trp-6, and nearly full detachment of pro-12.Furthermore, the U2 state is signiﬁcantly more extended than the lower pathway intermedi-ates as measured both by R g and end-to-end distance. In addition to being more compact,with near-native R g values, L1 and L2 have near-native trp-6 and pro-12 SASA values, sug-28igure 4: Equilibrium average solvent accessible surface area (SASA) projected onto theRMSD hx and RMSD CVs for (left) trp-6 and (right) proline-12.gesting the hydrophobic core is fully formed. These intermediate states can be mapped tothose previously reported in the literature. Bolhuis and Jurazek identiﬁed three foldingintermediates. Our U1 and U2 intermediates roughly map onto their P d and I intermediates,and our L1 and L2 intermediates roughly map onto their L intermediate. U1 and U2 alsocorrespond to states S and S identiﬁed by Sidky et al. We next calculated both forward and backward committors using DGA with the three basissets and lag times ranging from 0.5 ns to 12 ns (Figure 5 and Figure S7). As they should,the backward committors mirror the forward committors, so we focus our discussion on thelatter. The timescale of trp-cage folding is on the order of 5 µ s from both experiment andsimulation, thus both our trajectory lengths (30 ns) and lag times are several orders ofmagnitude shorter than the motions of interest, providing an appropriate setting in which29e expect DGA to show beneﬁts.Figure 5: DGA forward committors. Left, middle, and right columns are computed withthe modiﬁed distance, distance indicator, and TICA indicator basis sets, respectively. Top,middle, and bottom rows are computed with lag times of 0.5, 2.5, and 7.5 ns, respectively.In contrast to the PMFs, we found the committors to be sensitive to the choice of basisset (and associated guess function). The modiﬁed distance basis set, in addition to beingsubstantially faster to construct as it avoids slow and unstable high-dimensional clustering,is less prone to discontinuities at the boundary than the distance indicator function basisset. The TICA indicator function basis set performs similarly to the modiﬁed distancebasis set and has the advantage over the distance indicator basis set that clustering on thelower-dimensional subspace is signiﬁcantly faster and more stable.30or a given basis set, we found relatively little variation in the committors across lag times.This is in contrast to variational approach for conformational dynamics (VAC) algorithm,where the results can strongly depend on the lag time (although this can be mitigated byusing multiple lag times ). We postpone a full investigation of DGA’s error properties, andin particular its dependence on the choice of lag time, to future work.Because we expect unfolding and folding to be among the slowest motions of the system,we can validate the DGA committors by comparing them with the slowest mode of the sys-tem identiﬁed by TICA. Comparing Figures 3 and 5 shows that the largest TICA eigenvector(estimated with a lag time of 0.5 ns) correlates almost perfectly with the estimated com-mittors obtained with the modiﬁed distance basis set, when projected onto RMSD hx andRMSD − . The agreement between these two independent calculations furthermore sug-gests that the physically motivated CVs capture the behavior detected by the data-drivenmethod. In this projection, we see that the transition states fall where the SASA of trp-6(Figure 4) changes rapidly.As an additional validation, we used DGA with the modiﬁed distance basis set and a lagtime of 0.5 ns (Figure 6) to compute committors on the CVs used by Juraszek and Bolhuis. When projected onto RMSD and RMSD hx , the positions of the transition states in Figure 4 ofref. 32 fall in areas estimated to have q + = 0 . We computed reactive currents for the three basis sets using the estimator in (37) and thecommittors from the the previous section (Figure 7). For this calculation, we use the shortestlag time of 0.5 ns for both the committor and reactive current, though in principle they could31igure 6: Forward committor (left) and reactive current (right) projected onto the RMSD hx and full RMSD CVs used in ref. 32. Results shown are computed with the modiﬁed distancebasis set and a lag time of 0.5 ns.be chosen separately. As previously, we primarily present our results projected onto RMSD hx and RMSD − . Overall the results for the three basis sets are similar, though the distanceindicator basis set exhibits greater noise around (RMSD hx , RMSD − ) = (1.3 ˚A, 1.3 ˚A),consistent with the plateau in the committor in this region.The currents, which provide information directly about dynamics, conﬁrm the presenceof two paths for the folding process: an upper path with formation of the α helix prior toformation of the 3-10 helix, and a lower path with the order of these events transposed.The upper path proceeds through intermediates U1 and U2, with folding beginning withformation of the α helix and partial desolvation of trp-6, followed by full formation of the3-10 helix. The lower path proceeds through L1 and L2, with folding beginning with collapseinto the L2 intermediate with no α helix, but the hydrophobic core fully formed, followed byformation of the α helix. Both of these paths correspond to troughs in the PMFs on theseCVs. 32igure 7: Folding (top) and unfolding (bottom) reactive current projected onto the RMSD hx and RMSD CVs using (37) with the three choices of basis set. Left, middle, and rightcolumns are computed with the modiﬁed distance, distance indicator, and TICA indicatorbasis sets, respectively. All computations use a lag time of 0.5 ns.Previous studies have found multiple pathways resembling the ones we ﬁnd here. Kim etal. used diﬀusion maps to identify two pathways: one with tertiary contacts forming ﬁrst,followed by α helix formation, and another with the order transposed. Jurazek and Bolhiuscame to similar conclusions using transition path sampling. An advantage of the reactive current is that we can use it to assign weights to the twopaths. By computing the relative ﬂux crossing RMSD = 1 . hx < . hx > . α helix, and then the 3-10 helix and hydrophobiccore (i.e., the upper pathway). Although we are not aware of a previous estimate of thereactive current for this system, we can compare these numbers to the frequencies withwhich transition path sampling sampled the pathways in ref. 32. There, Juraszek and Bolhuisobserved the pathway in which tertiary contacts form ﬁrst (i.e., the lower pathway) 80% of33he time. The diﬀerence may be due to diﬀerent CV and state deﬁnitions (Jurazek andBolhuis used 5 CVs in their state deﬁnitions , whereas we consider only RMSD − andRMSD hx ) or force ﬁeld and setup diﬀerences. Finally, we computed rates using the estimator in (31). We present our results as inverserates (unfolding and folding times) to make comparisons to lag times and trajectory lengthsclear. As mentioned previously, these times are expected to be on the order of microseconds.In particular, Juraszek and Bolhuis used transition interface sampling to estimate inverseunfolding and folding rates of 1.2 µ s and 0.4 µ s, though as noted previously those resultsare for a diﬀerent model.Figure 8: Inverse rates estimated for folding (left) and unfolding (right).All three basis sets gave rate estimates that were within an order of magnitude of thosenumbers. However, the results for the distance indicator basis were markedly faster. Fur-thermore, in all three cases, the inverse rate exhibited signiﬁcant dependence on lag time.34ur analysis of the trajectory of the K8A mutant suggests the need for a lag time of at least100 ns (consistent with ref. 37), though as discussed in the Introduction, these data do notcontain a suﬃcient number of unfolding and folding events to obtain accurate rate estimates.Juxtaposed with the lack of sensitivity to lag time for the committor and reactive current,these observations suggest that DGA’s strength is in its ability to give statistical insight intomechanisms with relatively little data, but that rates may be more eﬃciently computed bymethods that directly sample relevant statistics such as stratiﬁcation schemes. As described in Section 2.2.4, delay embedding can be used to construct an approximatelyMarkovian process when the feature space does not fully capture the dynamics. To illustratethis idea using our trp-cage dataset, we restrict the feature space to the ﬁve physical CVsand apply DGA with the modiﬁed distance basis set on either the feature space itself or thedelay-embedded feature space. Figure 9 shows the reactive currents and committors resultingfrom DGA on these two spaces. We ﬁnd that the committor and current constructed fromthe delay embedded representation largely agree with the DGA result constructed on the153 pairwise distances. Without delay embedding, we ﬁnd several qualitative disagreements,in particular the U2 state has a committor value close to zero, and the reactive current doesnot resolve the two pathways since many of the arrows point directly towards the foldedstate. 35igure 9: Comparison of DGA estimates for the forward committor (top) and reactivecurrent for folding (bottom) with the modiﬁed distance basis set on a feature space restrictedto the ﬁve physical CVs (right) and a delay-embedded feature space (left). The delay-embedded results are obtained with a delay of δ = 0 .

125 ns, N = 40 images, and a DGA lagtime of 0.5 ns. In this paper, we have cast the dynamical Galerkin approximation (DGA) for computingchemical kinetic statistics from short trajectories in terms of the stopped transition operator.This formulation can be immediately translated into expressions that can be applied tosimulation data. It also clariﬁes the role of the lag time, showing that estimates of conditionalexpectations computed by DGA are exact in the inﬁnite basis and data limit, independentof the choice of lag time.To evaluate DGA’s performance, we generated and carefully validated a dataset of shorttrajectories for the unfolding and folding of the trp-cage miniprotein, a well-characterizedsystem. We used umbrella sampling to validate our short trajectory dataset by comparingthe resulting PMFs. Quantitative agreement between the PMFs was observed, suggesting36hat our short trajectory dataset had suﬃcient sampling to compute dynamical statistics.The PMF calculations furthermore enabled us to rapidly assess diﬀerent combinations ofCVs for their abilities to separate metastable states. The α helix RMSD and 3-10 helixRMSD in particular allowed us to resolve intermediates to a greater degree than found inprevious studies.We next applied DGA to compute forward and backward committors between the un-folded and folded states. We evaluated a number of competing estimators for the backwardcommittor and found that one based on forward trajectories weighted by the stationarydistribution gave the best results. The committors by themselves are not able to identifyreaction pathways or transition states, but they can be combined according to transitionpath theory to extract this information. Speciﬁcally, we introduce a new estimator for theTPT rate, and a projection formula and corresponding estimator for the reactive currentin a CV space. Our projected reactive current allows us to easily resolve and visualize thepathways that the system takes in arbitrary CV spaces, and even lets us assign relativeweights to these pathways. Acquiring this kind of mechanistic information has previouslybeen possible only through transition path sampling and related methods; such methods donot as readily allow exploration of CVs and state deﬁnitions because the sampling is linkeddirectly to them.We introduced a simple procedure that takes an arbitrary set of molecular features andadapts them to produce a basis set that satisﬁes the homogeneous boundary conditions.Using pairwise distances as the molecular features, we compared the performance of such abasis set with indicator functions on the molecular features and indicator functions on TICAcoordinates. Other basis constructions such as diﬀusion maps and radial basis functions arepossible, and we expect that the best choice will be system dependent. We applied our DGAand TPT formalism to our dataset, and identiﬁed intermediate states and pathways whichhave been previously reported in the literature, providing further validation of our methods.We found that the estimates of the TPT rate, while on the same order of magnitude as37revious estimates, nevertheless show signiﬁcant dependence on lag time. Finally, we showedthat delay embedding can be an eﬀective strategy for constructing a molecular representationwith approximately Markovian dynamics from a low-dimensional feature space.Our results suggest several interesting directions for future investigation. We have seenthat in our trp-cage application the choice of lag time has only a modest eﬀect on DGAestimates of conditional expectations, while TPT quantities, and in particular the rate de-pend sensitively on lag time. Recently, we showed that integrating over lag times for VACimproves the robustness of that method. It will be interesting to see if an analogous strat-egy can improve rate estimates from DGA. An in depth mathematical study of DGA’s errorand its dependence on lag time along the lines of our previous analysis of VAC is also inorder. Though DGA has performed well in our tests so far, looking ahead to larger and morecomplex systems, it may become necessary move away from a Galerkin approach and towardmore ﬂexible representations of the kinetic functions we seek to approximate. This would beconsistent with the current trend toward using neural networks to represent eigenfunctionsin spectral estimation. Introducing this higher level of representational ﬂexibility whilemaintaining the reliability we observe in our trp-cage application of DGA will be a challenge.

Acknowledgement

We thank Erik Thiede and Justin Finkel for their critical readings of the manuscript andhelpful feedback as well as D. E. Shaw Research for making available the K8A mutanttrajectory. We also than Robert Webber for helpful conversations. Research reported inthis publication was supported by the National Institute of General Medical Sciences ofthe National Institutes of Health under award number R35GM136381. Simulations wereperformed on resources from the Research Computing Center at the University of Chicago.38 upporting Information Available

The following ﬁles are available free of charge. Mathematical appendices, asymptoticvariances for REUS, and diﬀerence maps between DGA and REUS PMFs.

References (1) Huber, G. A.; Kim, S. Weighted-ensemble Brownian dynamics simulations for proteinassociation reactions.

Biophysical Journal , , 97–110.(2) Dickson, A.; Dinner, A. R. Enhanced sampling of nonequilibrium steady states. Annualreview of physical chemistry , , 441–459.(3) Guttenberg, N.; Dinner, A. R.; Weare, J. Steered transition path sampling. The Journalof Chemical Physics , , 234103, Publisher: American Institute of Physics.(4) Aristoﬀ, D.; Bello-Rivas, J. M.; Elber, R. A Mathematical Framework for Exact Mile-stoning. Multiscale Modeling & Simulation , , 301–322, Publisher: Society forIndustrial and Applied Mathematics.(5) Dinner, A. R.; Mattingly, J. C.; Tempkin, J. O. B.; Koten, B. V.; Weare, J. TrajectoryStratiﬁcation of Stochastic Dynamics. SIAM Review , , 909–938, Publisher:Society for Industrial and Applied Mathematics.(6) Allen, R. J.; Frenkel, D.; ten Wolde, P. R. Simulating rare events in equilibrium ornonequilibrium stochastic systems. The Journal of chemical physics , , 024102.(7) Thiede, E. H.; Giannakis, D.; Dinner, A. R.; Weare, J. Galerkin approximation ofdynamical quantities using trajectory data. The Journal of Chemical Physics , , 244111.(8) Schmid, P. J. Dynamic mode decomposition of numerical and experimental data. Jour-nal of Fluid Mechanics , , 5–28, Publisher: Cambridge University Press.399) Prinz, J.-H.; Wu, H.; Sarich, M.; Keller, B.; Senne, M.; Held, M.; Chodera, J. D.;Sch¨utte, C.; No´e, F. Markov models of molecular kinetics: Generation and validation. The Journal of chemical physics , , 174105.(10) Nske, F.; Keller, B. G.; Prez-Hernndez, G.; Mey, A. S. J. S.; No, F. Variational Approachto Molecular Kinetics. Journal of Chemical Theory and Computation , , 1739–1752.(11) Williams, M. O.; Kevrekidis, I. G.; Rowley, C. W. A DataDriven Approximation of theKoopman Operator: Extending Dynamic Mode Decomposition. Journal of NonlinearScience , , 1307–1346.(12) Husic, B. E.; Pande, V. S. Markov State Models: From an Art to a Science. Journal ofthe American Chemical Society , , 2386–2396, Publisher: American ChemicalSociety.(13) Klus, S.; N¨uske, F.; Koltai, P.; Wu, H.; Kevrekidis, I.; Sch¨utte, C.; No´e, F. Data-drivenmodel reduction and transfer operator approximation. Journal of Nonlinear Science , , 985–1010.(14) Pande, V. S.; Beauchamp, K.; Bowman, G. R. Everything you wanted to know aboutMarkov State Models but were afraid to ask. Methods (San Diego, Calif.) , ,99–105.(15) Eisner, T.; Farkas, B.; Haase, M.; Nagel, R. Operator Theoretic Aspects of ErgodicTheory ; Springer, 2015; Vol. 272.(16) Wu, H.; Nske, F.; Paul, F.; Klus, S.; Koltai, P.; No, F. Variational Koopman models:slow collective variables and molecular kinetics from short oﬀ-equilibrium simulations.

The Journal of Chemical Physics , , 154104, arXiv: 1610.06773.4017) Mardt, A.; Pasquali, L.; Wu, H.; No, F. VAMPnets for deep learning of molecular kinet-ics. Nature Communications , , 1–11, Number: 1 Publisher: Nature PublishingGroup.(18) N¨uske, F.; Schneider, R.; Vitalini, F.; No´e, F. Variational tensor approach for approx-imating the rare-event kinetics of macromolecular systems. The Journal of ChemicalPhysics , , 054105.(19) Prinz, J.-H.; Chodera, J. D.; No´e, F. Spectral rate theory for two-state kinetics. PhysicalReview X , , 011020.(20) Del Moral, P. Feynman-Kac Formulae ; Springer, 2004.(21) Karatzas, I.; Shreve, S.

Brownian Motion and Stochastic Calculus ; Springer Science &Business Media, 2012; Vol. 113.(22) Vanden-Eijnden, E.

Computer Simulations in Condensed Matter Systems: From Mate-rials to Chemical Biology Volume 1 ; Springer, 2006; pp 453–493.(23) E., W.; Vanden-Eijnden, E. Towards a Theory of Transition Paths.

Journal of StatisticalPhysics , , 503.(24) Metzner, P.; Schtte, C.; Vanden-Eijnden, E. Transition Path Theory for Markov JumpProcesses. Multiscale Modeling & Simulation , , 1192–1219, Publisher: Societyfor Industrial and Applied Mathematics.(25) No´e, F.; Sch¨utte, C.; Vanden-Eijnden, E.; Reich, L.; Weikl, T. R. Constructing theequilibrium ensemble of folding pathways from short oﬀ-equilibrium simulations. Pro-ceedings of the National Academy of Sciences , , 19011–19016.(26) Liu, Y.; Hickey, D. P.; Minteer, S. D.; Dickson, A.; Calabrese Barton, S. Markov-State Transition Path Analysis of Electrostatic Channeling. The Journal of PhysicalChemistry C , , 15284–15292, Publisher: American Chemical Society.4127) Meng, Y.; Shukla, D.; Pande, V. S.; Roux, B. Transition path theory analysis of c-Srckinase activation. Proceedings of the National Academy of Sciences , , 9193–9198.(28) Coifman, R. R.; Lafon, S. Diﬀusion maps. Applied and Computational Harmonic Anal-ysis , , 5–30.(29) Mller, K.; Brown, L. D. Location of saddle points and minimum energy paths by aconstrained simplex optimization procedure. Theoretica chimica acta , , 75–93.(30) Shaw, D. E.; Maragakis, P.; Lindorﬀ-Larsen, K.; Piana, S.; Dror, R. O.; East-wood, M. P.; Bank, J. A.; Jumper, J. M.; Salmon, J. K.; Shan, Y.; Wriggers, W.Atomic-Level Characterization of the Structural Dynamics of Proteins. Science , , 341–346, Publisher: American Association for the Advancement of Science Section:Research Article.(31) Qiu, L.; Pabit, S. A.; Roitberg, A. E.; Hagen, S. J. Smaller and Faster: The 20-ResidueTrp-Cage Protein Folds in 4 s. Journal of the American Chemical Society , ,12952–12953.(32) Juraszek, J.; Bolhuis, P. G. Sampling the multiple folding mechanisms of Trp-cage inexplicit solvent. Proceedings of the National Academy of Sciences , , 15859–15864, Publisher: National Academy of Sciences Section: Biological Sciences.(33) Juraszek, J.; Bolhuis, P. G. Rate Constant and Reaction Coordinate of Trp-Cage Fold-ing in Explicit Water. Biophysical Journal , , 4246–4257, Publisher: Elsevier.(34) Marinelli, F.; Pietrucci, F.; Laio, A.; Piana, S. A kinetic model of trp-cage folding frommultiple biased molecular dynamics simulations. PLoS Comput Biol , , e1000452.(35) Lindorﬀ-Larsen, K.; Piana, S.; Dror, R. O.; Shaw, D. E. How Fast-Folding ProteinsFold | Science.

Science , , 517–520.4236) Deng, N.-j.; Dai, W.; Levy, R. M. How Kinetics within the Unfolded State AﬀectsProtein Folding: An Analysis Based on Markov State Models and an Ultra-Long MDTrajectory. The Journal of Physical Chemistry B , , 12787–12799, Publisher:American Chemical Society.(37) Sidky, H.; Chen, W.; Ferguson, A. L. High-Resolution Markov State Models for theDynamics of Trp-Cage Miniprotein Constructed Over Slow Folding Modes Identiﬁedby State-Free Reversible VAMPnets. The Journal of Physical Chemistry B , ,7999–8009.(38) Molgedey, L.; Schuster, H. G. Separation of a mixture of independent signals using timedelayed correlations. Physical review letters , , 3634.(39) P´erez-Hern´andez, G.; Paul, F.; Giorgino, T.; De Fabritiis, G.; No´e, F. Identiﬁcationof slow molecular order parameters for Markov model construction. The Journal ofchemical physics , , 07B604 1.(40) Schwantes, C. R.; Pande, V. S. Improvements in Markov state model construction revealmany non-native interactions in the folding of NTL9. Journal of chemical theory andcomputation , , 2000–2009.(41) Webber, R. J.; Thiede, E. H.; Dow, D.; Dinner, A. R.; Weare, J. Error bounds fordynamical spectral estimation. arXiv:2005.02248 [physics] , arXiv: 2005.02248.(42) Sch¨utte, C.; No´e, F.; Lu, J.; Sarich, M.; Vanden-Eijnden, E. Markov state models basedon milestoning. The Journal of chemical physics , , 05B609.(43) Vitalini, F.; No´e, F.; Keller, B. A basis set for peptides for the variational approach toconformational kinetics. Journal of chemical theory and computation , , 3992–4004. 4344) Boninsegna, L.; Gobbo, G.; No´e, F.; Clementi, C. Investigating molecular kinetics byvariationally optimized diﬀusion maps. Journal of chemical theory and computation , , 5947–5960.(45) Weber, M.; Fackeldey, K.; Sch¨utte, C. Set-free Markov state model building. The Jour-nal of Chemical Physics , , 124133.(46) Abraham, M. J.; Murtola, T.; Schulz, R.; P´all, S.; Smith, J. C.; Hess, B.; Lindah, E.Gromacs: High performance molecular simulations through multi-level parallelism fromlaptops to supercomputers. SoftwareX , , 19–25.(47) Bonomi, M.; Branduardi, D.; Bussi, G.; Camilloni, C.; Provasi, D.; Raiteri, P.; Dona-dio, D.; Marinelli, F.; Pietrucci, F.; Broglia, R. A.; Parrinello, M. PLUMED: A portableplugin for free-energy calculations with molecular dynamics. Computer Physics Com-munications , , 1961–1972.(48) Tribello, G. A.; Bonomi, M.; Branduardi, D.; Camilloni, C.; Bussi, G. PLUMED 2:New feathers for an old bird. Computer Physics Communications , , 604–613.(49) PLUMED, Promoting transparency and reproducibility in enhanced molecular simula-tions. Nature Methods , , 670–673.(50) MacKerell, A. D.; Bashford, D.; Bellott, M.; Dunbrack, R. L.; Evanseck, J. D.;Field, M. J.; Fischer, S.; Gao, J.; Guo, H.; Ha, S.; Joseph-McCarthy, D.; Kuchnir, L.;Kuczera, K.; Lau, F. T. K.; Mattos, C.; Michnick, S.; Ngo, T.; Nguyen, D. T.; Prod-hom, B.; Reiher, W. E.; Roux, B.; Schlenkrich, M.; Smith, J. C.; Stote, R.; Straub, J.;Watanabe, M.; Wi´orkiewicz-Kuczera, J.; Yin, D.; Karplus, M. All-Atom EmpiricalPotential for Molecular Modeling and Dynamics Studies of Proteins . The Journal ofPhysical Chemistry B , , 3586–3616.(51) Best, R. B.; Zhu, X.; Shim, J.; Lopes, P. E. M.; Mittal, J.; Feig, M.; MacKerell, A. D.Optimization of the Additive CHARMM All-Atom Protein Force Field Targeting Im-44roved Sampling of the Backbone φ , ψ and Side-Chain χ χ Journal of Chemical Theory and Computation , , 3257–3273.(52) Huang, J.; Rauscher, S.; Nawrocki, G.; Ran, T.; Feig, M.; De Groot, B. L.;Grubm¨uller, H.; MacKerell, A. D. CHARMM36m: An improved force ﬁeld for foldedand intrinsically disordered proteins. Nature Methods , , 71–73.(53) Hess, B.; Bekker, H.; Berendsen, H. J. C.; Fraaije, J. G. E. M. LINCS: A linear con-straint solver for molecular simulations. Journal of Computational Chemistry , ,1463–1472.(54) Neidigh, J. W.; Fesinmeyer, R. M.; Andersen, N. H. Designing a 20-residue protein. Nature Structural Biology , , 425–430.(55) Jorgensen, W. L.; Chandrasekhar, J.; Madura, J. D.; Impey, R. W.; Klein, M. L.Comparison of simple potential functions for simulating liquid water. The Journal ofChemical Physics , , 926–935.(56) Jo, S.; Kim, T.; Iyer, V. G.; Im, W. CHARMM-GUI: A web-based graphical userinterface for CHARMM. Journal of Computational Chemistry , , 1859–1865.(57) Lee, J.; Cheng, X.; Swails, J. M.; Yeom, M. S.; Eastman, P. K.; Lemkul, J. A.; Wei, S.;Buckner, J.; Jeong, J. C.; Qi, Y.; Jo, S.; Pande, V. S.; Case, D. A.; Brooks, C. L.;MacKerell, A. D.; Klauda, J. B.; Im, W. CHARMM-GUI Input Generator for NAMD,GROMACS, AMBER, OpenMM, and CHARMM/OpenMM Simulations Using theCHARMM36 Additive Force Field. Journal of Chemical Theory and Computation , , 405–413.(58) Marchi, M.; Ballone, P. Adiabatic bias molecular dynamics: A method to navigatethe conformational space of complex molecular systems. Journal of Chemical Physics , , 3697–3702. 4559) Park, S.; Kim, T.; Im, W. Transmembrane Helix Assembly by Window Exchange Um-brella Sampling. Physical Review Letters , , 108102.(60) Sugita, Y.; Kitao, A.; Okamoto, Y. Multidimensional replica-exchange method for free-energy calculations. Journal of Chemical Physics , , 6042–6051.(61) Thiede, E. H.; Van Koten, B.; Weare, J.; Dinner, A. R. Eigenvector method for umbrellasampling enables error analysis. Journal of Chemical Physics , .(62) Antoszewski, A.; Feng, C.-J.; Vani, B. P.; Thiede, E. H.; Hong, L.; Weare, J.; Tok-makoﬀ, A.; Dinner, A. R. Insulin Dissociates by Diverse Mechanisms of Coupled Un-folding and Unbinding. The Journal of Physical Chemistry B , , 5571–5587.(63) Dinner, A. R.; Thiede, E. H.; Koten, B. V.; Weare, J. Stratiﬁcation as a GeneralVariance Reduction Method for Markov Chain Monte Carlo. SIAM/ASA Journal onUncertainty Quantiﬁcation , 1139–1188, Publisher: Society for Industrial andApplied Mathematics.(64) Lorpaiboon, C.; Thiede, E. H.; Webber, R. J.; Weare, J.; Dinner, A. R. Inte-grated VAC: A robust strategy for identifying eigenfunctions of dynamical operators. arXiv:2007.08027 [physics] , arXiv: 2007.08027.(65) Kim, S. B.; Dsilva, C. J.; Kevrekidis, I. G.; Debenedetti, P. G. Systematic charac-terization of protein folding pathways using diﬀusion maps: Application to Trp-cageminiprotein.

The Journal of Chemical Physics , , 085101, Publisher: AmericanInstitute of Physics. 46 raphical TOC Entry upporting Information forLong time-scale predictions from short-trajectorydata: A benchmark analysis of the trp-cagemini-protein John Strahan, † Adam Antoszewski, † Chatipat Lorpaiboon, † Bodhi P. Vani, † Jonathan Weare, ∗ , ‡ and Aaron R. Dinner ∗ , † † Department of Chemistry, University of Chicago, Chicago, IL 60637, USA ‡ Courant Institute of Mathematical Sciences, New York University, New York, New York10012, USA

E-mail: [email protected]; [email protected]

A Backward-in-time inner products

In this appendix we provide an elementary derivation of (24) which is key to our estimatesof inner products involving the stopped backward-in-time transition operator T tA ∪ B . For thepurposes of this derivation we assume that X ( t ) is a discrete time process (so that t is a non-negative integer) with probability density p ( x (1) | x (0)) for transition from x (0) to x (1) andstationary density π . The steady state backward-in-time process X ( − t ) then has transitiondensity q ( x (0) | x (1)) = p ( x (1) | x (0)) π ( x (0)) π ( x (1)) . (S1)1 a r X i v : . [ phy s i c s . d a t a - a n ] S e p rom this expression we immediately ﬁnd that π ( x ( t )) q ( x ( t − | x ( t )) q ( x ( t − | x ( t − · · · q ( x (0) | x (1))= π ( x (0)) p ( x ( t ) | x ( t − p ( x ( t − | x ( t − · · · p ( x (1) | x (0)) (S2)relating the steady state backward-in-time path density to the steady state forward-in-timepath density. Therefore, for any path function F ( x (0) , x (1) , . . . , x ( t )) and any density µ (equivalent to π ) we ﬁnd (recalling that here w = π/µ ) that Z E [ F ( X (0) , X ( − , . . . , X ( − t )) | X (0) = x ] µ ( x ) dx = Z F ( x (0) , . . . , x ( − t )) w ( x (0)) π ( x (0)) q ( x ( − | x (0)) · · · q ( x ( − t ) | x ( − t + 1)) dx (0) · · · dx ( − t )= Z F ( x ( t ) , . . . , x (0)) w ( x ( t )) π ( x ( t )) q ( x ( t − | x ( t )) · · · q ( x (0) | x (1)) dx ( t ) · · · dx (0)= Z E (cid:20) F ( X ( t ) , X ( t − , . . . , X (0)) w ( X ( t )) (cid:12)(cid:12)(cid:12)(cid:12) X (0) = x (cid:21) w ( x ) µ ( x ) dx. (S3)We will use (S3) to ﬁnd an expression for h g, T − tA ∪ B f i = Z E (cid:2) g ( x ) f ( X ( − T − A ∪ B ∧ t )) | X (0) = x (cid:3) µ ( x ) dx (S4)in terms of the forward-in-time process. If we choose F ( X (0) , X ( − , . . . , X ( − t )) = g ( X (0)) f ( X ( − T − A ∪ B ∧ t )) , (S5)then h g, T − tA ∪ B f i = Z E [ F ( X (0) , X ( − , . . . , X ( − t )) | X (0) = x ] µ ( x ) dx. (S6)In terms of the forward process F ( X ( t ) , X ( t − , . . . , X (0)) = f ( X ( S A ∪ B ( t ))) g ( X ( t )) , (S7)2here we remind the reader of (25): S A ∪ B ( t ) = sup { s ≤ t : X ( s ) ∈ A ∪ B } . Applying (S3) with this choice of F yields (24): (cid:10) g, T − tA ∪ B f (cid:11) = Z E (cid:20) f ( X ( S A ∪ B ( t ))) g ( X ( t )) w ( X ( t )) (cid:12)(cid:12)(cid:12)(cid:12) X (0) = x (cid:21) w ( x ) µ ( dx ) . B A formula for the reactive current

It has been shown that for a diﬀusion with generator L f ( x ) = X j b j ( x ) ∂f∂x j ( x ) + 12 X ij a ij ( x ) ∂ f∂x i ∂x j ( x ) (S8)the reactive current is the vector ﬁeld given by( J AB ) i = q + ( x ) q − ( x ) J i + π ( x ) q − ( x ) X j a ij ( x ) ∂q + ∂x j ( x ) − π ( x ) q + ( x ) X j a ij ( x ) ∂q − ∂x j ( x ) , (S9)where J is the equilibrium current: J i = π ( x ) b i ( x ) − X j ∂ [ πa ij ] ∂x j ( x ) . (S10)To project the current onto a CV space of interest, we take the dot product with ∇ θ for anysmooth CV θ and, using the identity J i · ∇ f ( x ) = π ( x )2 (cid:0) L f ( x ) − L † π f ( x ) (cid:1) , (S11)3hich follows from direct manipulations, we can write J AB · ∇ θ ( x ) = π ( x )2 (cid:0) q − ( x ) L [ q + θ ]( x ) − q + ( x ) L † π [ q − θ ]( x ) (cid:1) . (S12)This formula is not useful computationally since it still contains a backward-in-time genera-tor. To compute statistics from data, we need to formulate their estimators as expectationsagainst the stationary distribution since this (1) permits the use of the adjoint relation toclear away backward transition operators and (2) is consistent with our reweighting scheme.To this end, we deﬁne the projected reactive current as J θAB ( s ) = Z J AB ( x ) · ∇ θ ( x ) δ ( θ ( x ) − s ) dx = lim | ds |→ | ds | Z { θ ( x ) ∈ ds } J AB ( x ) · ∇ θ ( x ) dx, (S13)where ds is an inﬁnitesimal region of CV space with s ∈ ds , and { x : θ ( x ) ∈ ds } does notintersect A ∪ B. Using (S12) and the fact that L q + = 0 and L † π q − = 0 on ( A ∪ B ) c , we have J θAB ( s ) = lim | ds |→ | ds | Z { θ ( x ) ∈ ds } π ( x )2 (cid:0) q − ( x ) L [ q + θ ]( x ) − q + ( x ) L † π [ q − θ ]( x ) (cid:1) dx = lim | ds |→ | ds | Z { θ ( x ) ∈ ds } π ( x )2 (cid:16) q − ( x ) L [ q + θ ]( x ) − q − ( x ) L q + ( x ) θ ( x ) − q + ( x ) L † π [ q − θ ]( x ) + q + ( x ) L † π q − ( x ) θ ( x ) (cid:17) dx. (S14)Writing this expression in terms of the transition operator and canceling terms, we ﬁnd that J θAB ( s ) = lim t, | ds |→ t | ds | Z { θ ( x ) ∈ ds } π ( x ) (cid:16) q − ( x ) T t [ q + θ ]( x ) − q − ( x ) T t q + ( x ) θ ( x ) − q + ( x )( T t ) † π [ q − θ ]( x ) + q + ( x )( T t ) † π q − ( x ) θ ( x ) (cid:17) dx = lim t, | ds |→ t | ds | Z π ( x ) q − ( x ) (cid:16) { θ ( x ) ∈ ds } (cid:0) T t [ q + θ ]( x ) − T t q + ( x ) θ ( x ) (cid:1) + (cid:0) T t [ q + θ { θ ∈ ds } ]( x ) − T t [ q + { θ ∈ ds } ]( x ) θ ( x ) (cid:1) (cid:17) dx, (S15)where the second equality follows from the deﬁnition of the adjoint ( T t ) † π .4xpression (S15) for J θAB ( s ) can be directly translated into an estimator for computingfrom short-trajectory data: J θAB ( s ) ≈ t | ds | M X i =1 q + ( X ( i ) ( t )) (cid:0) θ ( X ( i ) ( t )) − θ ( X ( i ) (0)) (cid:1) × q − ( X ( i ) (0)) θ ∈ ds ( X ( i ) (0)) w ( X ( i ) (0))+ 12 t | ds | M X i =1 q + ( X ( i ) ( t )) (cid:0) θ ( X ( i ) ( t )) − θ ( X ( i ) (0)) (cid:1) × q − ( X ( i ) (0)) θ ∈ ds ( X ( i ) ( t )) w ( X ( i ) (0)) . (S16)Finally, without aﬀecting the t → A ∪ B , yielding the estimator J θAB ( s ) ≈ t | ds | M X i =1 q + ( X ( i ) ( t ∧ T A ∪ B )) (cid:0) θ ( X ( i ) ( t ∧ T A ∪ B )) − θ ( X ( i ) (0)) (cid:1) × q − ( X ( i ) (0)) θ ∈ ds ( X ( i ) (0)) w ( X ( i ) (0))+ 12 t | ds | M X i =1 q + ( X ( i ) ( t )) (cid:0) θ ( X ( i ) ( t )) − θ ( X ( i ) ( S A ∪ B ( t ))) (cid:1) × q − ( X ( i ) ( S A ∪ B ( t ))) θ ∈ ds ( X ( i ) ( t )) w ( X ( i ) (0)) (S17)which, in our experience, outperformed (S16) for larger values of t . Note that we could havecanceled additional terms in (S15) to yield a more concise estimator. However, we foundthat the estimator (S17) gave less noisy results. C Reactive current on a CV space

We now establish that our projected reactive current gives the ﬂux over surfaces in CVspace. We assume that our CVs are smooth and that, for some subset C θ of CV space withsmooth boundary, the set C = { x : θ ( x ) ∈ C θ } contains A and does not intersect B . We5ill establish that for such a subset, Z ∂C θ J θAB ( s ) · n C θ dσ C θ = Z ∂C J AB · n C dσ C . (S18)Here n C θ is the outward pointing normal vector to the boundary ∂C θ of C θ , n C is the normalvector to the boundary ∂C of C , σ C θ is the surface measure on ∂C θ and, σ C is the surfacemeasure on ∂C . The signiﬁcance of (S18) is that it shows that our deﬁnition of J θAB preservesreactive ﬂux across surfaces in the CV space so that statistics of reactive paths could, inprinciple, be computed directly from J θAB .Let f δ be a smooth function on CV space that is equal to 1 on C θ and equal to 0 for x adistance of more than δ from C θ . Applying the divergence theorem and integrating by partswe ﬁnd that Z ∂C θ J θAB ( s ) · n C θ dσ C θ = Z C θ div J θAB ( s ) ds = lim δ → Z f δ ( s ) div J θAB ( s ) ds = − lim δ → Z J θAB ( s ) · ∇ f δ ( s ) ds. (S19)Inserting our deﬁnition of J θAB we ﬁnd that Z ∂C θ J θAB ( s ) · n C θ dσ C θ = − lim δ → X j Z Z J AB ( x ) · ∇ θ j ( x ) δ ( θ ( x ) − s ) ∂f δ ( s ) ∂s j dxds = − lim δ → X j Z J AB ( x ) · ∇ θ j ( x ) ∂f δ ∂s j ( θ ( x )) dx. (S20)Using the chain rule the last expression can be rewritten as Z ∂C θ J θAB ( s ) · n C θ dσ C θ = − lim δ → Z J AB ( x ) · ∇ f δ ( θ ( x )) dx. (S21)Integrating by parts, taking the δ → References (1) Vanden-Eijnden, E.

Computer Simulations in Condensed Matter Systems: From Mate-rials to Chemical Biology Volume 1 ; Springer, 2006; pp 453–493.7igure S1: EMUS asymptotic variance for REUS PMFs.8igure S2: REUS PMFs.9igure S3: Diﬀerence between DGA with the modiﬁed distance basis set without the αα