[PDF] Non-Parametric Analysis of Non-Equilibrium Simulations

Abstract

We extend the non-parametric framework of reaction coordinate optimization to non-equilibrium ensembles of (short) trajectories. For example, we show how, starting from such an ensemble, one can obtain an equilibrium free energy profile along the committor, which can be used to determine important properties of the dynamics exactly. New adaptive sampling approach, the transition state ensemble enrichment, is suggested, which samples the configuration space by "growing" committor segments towards each other starting from the boundary states. This framework is suggested as a general tool, alternative to the Markov state models, for a rigorous and accurate analysis of simulations of large biomolecular systems, as it has the following attractive properties. It is immune to the curse of dimensionality, it does not require system specific information, it can approximate arbitrary reaction coordinates with high accuracy and it has sensitive and rigorous criteria to test optimality and convergence. The approaches are illustrated on a 50-dimensional model system and a realistic protein folding trajectory.

Full PDF

NNon-Parametric Analysis of Non-EquilibriumSimulations

Sergei V. Krivov ∗ Astbury Center for Structural Molecular Biology, Faculty of Biological Sciences, Universityof Leeds, Leeds LS2 9JT, United Kingdom

E-mail: [email protected]

Abstract

We extend the non-parametric framework of reaction coordinate optimization tonon-equilibrium ensembles of (short) trajectories. For example, we show how, start-ing from such an ensemble, one can obtain an equilibrium free energy proﬁle alongthe committor, which can be used to determine important properties of the dynamicsexactly. New adaptive sampling approach, the transition state ensemble enrichment,is suggested, which samples the conﬁguration space by ”growing” committor segmentstowards each other starting from the boundary states. This framework is suggestedas a general tool, alternative to the Markov state models, for a rigorous and accurateanalysis of simulations of large biomolecular systems, as it has the following attractiveproperties. It is immune to the curse of dimensionality, it does not require systemspeciﬁc information, it can approximate arbitrary reaction coordinates with high accu-racy and it has sensitive and rigorous criteria to test optimality and convergence. Theapproaches are illustrated on a 50-dimensional model system and a realistic proteinfolding trajectory. a r X i v : . [ phy s i c s . c h e m - ph ] F e b Introduction

One general strategy in overcoming the sampling problem in biomolecular simulations con-sists of simulating a very large ensemble of short trajectories rather than a singe long tra-jectory. This strategy allows seamless parallelization and is a promising approach towardssimulations employing exascale or cloud computing.

Adaptive sampling approaches canbe considered as an extension of this strategy, where one, for example, improves samplingin less sampled parts of conﬁguration space, or parts that produce largest error or controlsthe exploration/exploitation balance.

The swarms of trajectories is another successfulvariation of this idea.To analyze such ensembles of short trajectories one commonly employs the Markov statemodel (MSM) framework.

Assuming that the sampling is suﬃciently extensive, andusing a ﬁne-grained clustering of the conﬁguration space of the system, one can estimate thetransition probability matrix. For large system, as conﬁguration space size grows exponen-tially with system size, dimensionality reduction is performed before clustering. Knowingthe matrix, one can compute many important properties of the equilibrium dynamics, forexample, the equilibrium probabilities/populations, ﬂuxes and rates. The minimal lag timewhen a MSM becomes approximately Markovian, which can be estimated by the conver-gence of implied timescales or by Chapman-Kolmogorov criterion, is a good indicator of theaccuracy of the constructed model. The shorter is the lag time, the shorter are the trajec-tories, required to construct the MSM, and the larger is the possible speedup over a direct,brute-force simulation. State of the art approaches have lag times in the range of tens ofnanoseconds.

Recently, we have suggested non-parametric approaches, which can determine thecommittor and a few slowest eigenvectors, that pass stringent validation tests at much shorterlag time of trajectory sampling interval of 0 . as they use no system speciﬁc information and thus do not require an extensive expertisewith the system. In particular, they do not require a functional form with many parameters2o closely approximate a reaction coordinate (RC), e.g., a linear combinations of moleculardescriptors or a deep neural network, and can approximate any RC with high accuracy.The approaches however were restricted to long equilibrium trajectories. Here we report theextension of the framework to non-equilibrium ensembles of trajectories and suggest it as ageneral framework, alternative to the MSM, for a rigorous and accurate analysis of dynamicsof large biomolecular systems.We describe approaches which can be used to analyze such non-equilibrium ensembles oftrajectories, to determine the following important descriptors/properties of the dynamics:the committor function, the re-weighting factors (related to the equilibrium probabilities),the eigenvectors of the equilibrium and non-equilibrium transfer operators and that of thetransition probability. In particular, we show how one can determine the equilibrium freeenergy proﬁle as a function of the committor, which can be used to determine exactly suchimportant properties of the dynamics as the equilibrium ﬂux, the mean ﬁrst passage times,and the mean transition path times between any two points on the committor. One way to analyze non-equilibrium simulations consist in computing the re-weightingfactors ﬁrst, and then use them to re-weight the sampling, thus essentially reducing theproblem to the equilibrium case. Such a straightforward approach, however, has the fol-lowing shortcoming. The accuracy of the analysis depends on the accuracy of the obtainedre-weighting factors. Thus, it requires an approach capable of determining the re-weightingfactors for every trajectory point robustly and accurately, which is a very diﬃcult task. Here,we present approaches that do not assume the existence of the re-weighting factors, and thusfree of the shortcoming.The paper is as follows. We start by reviewing the non-parametric framework for equilib-rium simulations. Next, equations to determine the committor function from non-equilibriumsimulations are derived. They are followed by derivation of equations to determine re-weighting factors. The power of the developed approaches is illustrated on two examples:a 50-dimensional model system and a realistic protein folding trajectory. Next, we discuss3 number of realistic practical scenarios of how the developed approaches can be combinedwith existing enhanced/adaptive sampling techniques. We then describe a generic adaptivesampling approach, the transition state ensemble enrichment, TSEE, which is based on thedeveloped non-parametric approaches. The performance of the TSEE is illustrated on the50-dimensional model system. We end with a concluding discussion.

A rigorous way to analyze dynamics produced by biomolecular simulations is to describe itas a diﬀusion on a free energy landscape, free energy as a function of RCs. The simulationtrajectory is projected onto a RC by computing the RC time-series r ( t ) as a function oftime, which is used to determine the corresponding free energy landscape and diﬀusioncoeﬃcient. For such a description to be quantitatively accurate, the RCs should be chosen inan optimal way. The committor function is an example of such a RC, that can be usedto compute some important properties of the dynamics exactly.

The eigenvectors (EVs)of the transfer operator are another example. Recently we have developed non-parametricapproaches to accurately determine such coordinates from a long equilibrium trajectory.

Here we describe how to extend this framework to non-equilibrium simulations, makingpossible to use these approaches for enhanced/adaptive sampling.

The overall idea of iterative non-parametric RC optimization is as follows.

We start witha seed RC time-series r ( t ). During each iteration we consider a variation of RC as r ( t )+ δr ( t ),where δr ( t ) can be (time-series of) any function of conﬁguration space, collective variablesand the RC itself. For example, one can take δr ( t ) = f ( r ( t ) , y ( t )), where y ( t ) is time-seriesof a randomly chosen coordinate of conﬁguration space X i ( t ) or a randomly chosen collectivevariable and f ( r, y ) = (cid:80) lm α lm r l y m is a low degree polynomial. The coeﬃcients/parameters4f the variation are chosen such that r ( t ) + δr ( t ) provides the best approximation to thetarget optimal RC (e.g., committor). Speciﬁcally, they deliver optimum to a speciﬁc targetfunctional α (cid:63) = arg min α I ( r + δr ). The RC time-series is updated r ( t ) ← r ( t ) + δr (cid:63) ( t ),where δr (cid:63) ( t ) is the optimal variation, i.e., δr (cid:63) ( t ) = (cid:80) lm α (cid:63)lm r l ( t ) y m ( t ). Iterating the processone repeatedly improves the putative RC time-series by incorporating information containedin diﬀerent coordinates or collective variables. Alternatively, by repeating the iterations, thetarget functional is optimized by considering variations along diﬀerent coordinates. For thetarget functionals considered here, the optimal coeﬃcients α (cid:63) are found as solutions of linearsystems of equations.While each iteration may depend on the exact choice of the family of collective variables y or the parametrization of the variation δr ( t ), the ﬁnal RC does not, since it providesoptimum to a (non-parametric) target functional, when the optimization converges. In thissense such an approach is non-parametric.If the system obeys some symmetry (e.g., the rotational and translational symmetriesfor biomolecules), then the optimal RC should obey the same symmetry. A simple way toensure this is to use as y , variables that respect the symmetry, for example, the distancesbetween randomly chosen pairs of atoms y ( t ) = r ij ( t ).The equations for non-parametric RC optimization are derived in the following sequenceof steps. We ﬁrst ﬁnd a variational principle when the system dynamics is described by aﬁnite Markov chain (a MSM). Next the variational principle is reformulated in terms of RCtime-series. If this is possible, it means, that there is no need to construct/consider a ﬁniteMarkov chain and one can operate using just RC time-series. By varying the functional weobtain the ﬁnal equations for the optimal values of the parameters α (cid:63) .We employ the framework of ﬁnite Markov chains to describe Markov dynamics in theconﬁguration space. While the conﬁguration space in molecular simulations is continuousand rigorous treatment requires the usage of integral operators, we prefer ﬁnite Markovchains due to their convenience, simplicity and manifest invariance to the choice of coor-5inate systems. Moreover, a ﬁnite Markov chain can provide an excellent approximationto continuous conﬁguration space. Since, we consider such a Markov chain as a theoreticalconcept, e.g., to derive the equations, and there is no need for an actual construction of sucha chain in practice, the number of states can be arbitrarily large as long as it stays ﬁnite. Forexample, a molecular system of interest can be embedded into a large box with boundaries at ± L along each coordinate. Each coordinate can be discretized with a very ﬁne step of, say, d ∼ .

01 ˚A. Since the molecular dynamics simulations are usually performed by numericalintegrating the Newtons equations of motion, the dynamics, at the timescales close to thesimulation time step, is Markovian in the phase space not the conﬁguration space. At longertimescales the dynamics loses memory about the momenta and can be considered approxi-mately Markovian in the conﬁguration space. We assume that given simulation trajectoriesare recorded with such or longer sampling interval, since we are mainly interested in the de-termination of optimal RCs as functions of the conﬁguration space. However, it is possible,in principle, to apply the developed approaches to determine optimal RCs as functions ofphase space at a shorter sampling interval.

We ﬁrst review the approach for equilibrium trajectories.

We use the following no-tation: X ( i ∆ t ) denotes a long equilibrium multidimensional trajectory, where ∆ t is thetrajectory sampling interval; x ( i ) denotes an arbitrary RC as a function of MSM state i ,while q ( i ) is reserved for the committor; r ( i ∆ t ) is an arbitrary RC as a function of trajec-tory snapshot or time along the trajectory or, shortly, a function of trajectory; again q ( i ∆ t )is reserved for the committor. Here we describe how, given X ( i ∆ t ), one can determineputative time-series r ( i ∆ t ), which closely approximates the committor q ( i ∆ t ).Assume that, by using a ﬁne-grained clustering, we are able to construct an accurateMarkov state model, with transition probability matrix deﬁned as P ( i | k, ∆ t ) = n ( i | k, ∆ t ) /n ( k ),6here P ( i | k, ∆ t ) is the transition probability from state k to state i after time-interval (lagtime) ∆ t , n ( i | k, ∆ t ) is the number of transition from state k to state i after time interval∆ t and n ( k ) = (cid:80) i n ( i | k, ∆ t ).The committor function satisﬁes the following equation (cid:88) i [ q ( i ) − q ( k )] P ( i | k, ∆ t ) = 0 , for k (cid:54) = A, B (1a) q ( A ) = 0; q ( B ) = 1 . (1b)Consider the following optimization problem:min x (cid:88) ij [ x ( i ) − x ( j )] n ( i | j, ∆ t ) (2a) x ( A ) = 0; x ( B ) = 1 , (2b)here, x ( i ) is an arbitrary RC as a function of state i . By diﬀerentiation with respect to x ( k ),and using the detailed balance condition n ( i | j, ∆ t ) = n ( j | i, ∆ t ) one obtains Eq. 1, i.e., thecommittor function provides the minimum to the functional 2.Before reformulating Eq. 2 optimization problem in terms of RC time-series lets introducea convenient abbreviation for the sums like (cid:80) T/ ∆ ti =0 f ( i ∆ t ) g ( i ∆ t + ∆ t ), where ∆ t equals ∆ t or its multiple and T = N ∆ t is trajectory length. If ∆ t = k ∆ t , it means that only 1 /k -thfraction of points in the trajectory are used. To use all the points in the trajectory onecan average over the starting point as 1 /k (cid:80) k − j =0 (cid:80) T/ ∆ ti =0 f (( ik + j )∆ t ) g (( ik + j )∆ t + k ∆ t ),which equals 1 /k (cid:80) T − ∆ ti =0 f ( i ∆ t ) g ( i ∆ t + ∆ t ). We denote such a sum as (cid:80) t f ( t ) g ( t + ∆ t ).Even though we will mainly use ∆ t = ∆ t , the notation allows the consideration of arbitrarylag times.The optimization problem Eqs. 2 is translated to RC time-series r ( i ∆ t ) (for lag time7 t ) as min r (cid:88) t [ r ( t + ∆ t ) − r ( t )] (3a) r ( t ) = 0 , X ( t ) ∈ A ; r ( t ) = 1 , X ( t ) ∈ B, (3b)here r ( i ∆ t ) is an arbitrary RC as a function of trajectory. The total squared displacementfunctional in Eq. 3a, which is optimized, is referred later as ∆ r for brevity. Here andbelow we assume ∆ t = ∆ t unless stated otherwise. The theoretical minimum value ofthe functional, attained for r = q , equals ∆ q = 2 N AB , where N AB is the total numberof transitions from state A to B, or from B to A. Thus, if during RC optimization ∆ r / N AB , it follows that the putative RC closely approximates the committor.To satisfy the constraint Eq. 3b during optimization, we, ﬁrst construct the seed RCtime-series that satisﬁes the constraint, and second, during optimization, we keep positionsof these points ﬁxed by setting δr ( t ) = 0 for them. Lets introduce boundary indicatorfunction I b ( t ), which equals 1 when point X ( t ) belongs to a boundary and is thus ﬁxedduring optimization, and zero otherwise. ˜ I b ( t ) = 1 − I b ( t ) is its negative. The variationof the putative time series, which keeps the positions of points/frames in boundary statesﬁxed can be taken as r ( t ) + δr ( t ) = r ( t ) + ˜ I b ( t ) (cid:80) j α j f j ( t ), where f j ( t ) are basis functions,that are discussed below. Optimal coeﬃcients α (cid:63) , which give the best approximation tothe committor for the considered variation, can be found by equating the derivative of thefunctional with respect to α k to zero: (cid:88) t [ r ( t + ∆ t ) − r ( t ) + δr ( t + ∆ t ) − δr ( t )][ f k ( t + ∆ t ) ˜ I b ( t + ∆ t ) − f k ( t ) ˜ I b ( t )] = 0 , (4)or more compact (cid:88) t [∆ r ( t ) + ∆ δr ( t )]∆[ f k ( t ) ˜ I b ( t )] = 0 , (5)where operator ∆ denotes the forward time diﬀerence ∆ f ( t ) = f ( t + ∆ t ) − f ( t ). It equals8he following system of linear equations (cid:88) j A kj α (cid:63)j = b k (6a) A kj = (cid:88) t ∆[ f k ( t ) ˜ I b ( t )]∆[ f j ( t ) ˜ I b ( t )] (6b) b k = − (cid:88) t ∆ r ( t )∆[ f k ( t ) ˜ I b ( t )] (6c)As basis functions f j ( t ) one can take the terms of a low-degree polynomial, i.e., r l ( t ) y m ( t ) for l + m ≤ n and y ( t ) is a randomly chosen coordinate of the conﬁguration space, y ( t ) = X i ( t ),or a collective variable. To focus optimization on a particular region of RC, one can modulatethe polynomial terms by a common envelop. For example, to focus optimization on pointsaround r on the RC, one can use e −| r ( t ) − r | /d × r l ( t ) y m ( t ), where d is some small numberthat deﬁnes the scale. This option is useful to optimize regions corresponding to free energyminima along the committor, which get exponentially shrunk. Generally, the higher is the degree of the polynomial, the faster is the optimization,though more computationally demanding. However a very high degree may lead to numericalinstabilities and strong overﬁtting. The following strategy was found useful: use a polynomial f ( r, y ) with a relatively small degree (3-6) for updates involving r ( t ) and y ( t ) followed by apolynomial f ( r ) of a high degree (e.g., 16) for updates involving only r ( t ).The basic algorithm (which we call NPq) is as follows. Initialization: a seed RC isconstructed, which satisﬁes the boundary constrains, for example, r ( t ) = 0 if X ( t ) ∈ A , r ( t ) = 1 if X ( t ) ∈ B and r ( t ) = 0 . Iterations: one selects times-series y ( t )(a randomly chosen coordinate of conﬁguration space X or a collective variable), computesbasis functions, solves Eqs. 6 and updates r ( t ). Stopping: iterations stop when ∆ r / N AB - the number of transitions from state A to state B or thatfrom B to A .If the system has been extensively sampled, and overﬁtting is not possible, then, as9utative ∆ r / N AB , the RC should closely approximate the committor. To conﬁrmthat, one can use the the Z C, validation/optimality criterion for the committor. Z C, can be straightforwardly computed from time-series r ( i ∆ t ): each transition of trajectoryfrom x = r ( i ∆ t ) to x = r ( i ∆ t + ∆ t ) adds 1 / | x − x | to Z C, ( x, ∆ t ) for all points x between x and x . Validation:

If a putative RC closely approximates the committor,then Z C, ( x, ∆ t ) ≈ N AB for all x and ∆ t , where Z C, ( x, ∆ t ) are computed using transitionpath segment summation scheme. Optimality: for a suboptimal RC, Z C, ( x, ∆ t ) valuesgenerally decrease to the limiting value of N AB , as ∆ t increases. The larger the diﬀerencebetween Z C, ( x, ∆ t ) and Z C, ( x, ∆ t ) the less optimal the RC around x . Jupyter notebooksillustrating usage of Z C,α proﬁles for RC analyses and, in particular, as the committor andeigenvector criteria are available at https://github.com/krivovsv/CFEPs. For realistic systems with limited sampling, this simple algorithm may start to overﬁtthe RC in some regions and underﬁt in other. One way to overcome this problem is tomake optimization adaptive, by focusing optimization on less optimized spatio-temporalregions. Here we consider another strategy - to use adaptive sampling in order to improvesampling in regions that are overﬁt or undersampled. Since adaptive sampling is no-longerequilibrium, and the described approach assumes the detailed balance, we describe a newapproach applicable to non-equilibrium sampling.

We assume that we are given a non-equilibrium ensemble of (short) trajectories. Whileeach trajectory was simulated by following the unperturbed or natural dynamics of interest,the starting conﬁgurations are chosen arbitrarily, for example, according to an enhanced oradaptive sampling scheme.We employ the following representation of a non-equilibrium ensemble of trajectories.All the short trajectories are concatenated into a single long trajectory X ( i ∆ t ) combined10ith itraj( i ∆ t ) index function which maps frames to the trajectory numbers they belongto. Our aim is to determine putative time-series r ( i ∆ t ), which closely approximates thecommittor q ( i ∆ t ).Analysis of such non-equilibrium ensembles of trajectories by the MSM formalism iscarried out without modiﬁcation. One determines the transition numbers n ( i | j, ∆ t ) andtransition probability matrix P ( i | j, ∆ t ), which can be used, e.g., to determine the committorusing Eq. 1 or the equilibrium probability. The non-parametric approach, however, needsmodiﬁcations, as the detailed balance is not satisﬁed in such non-equilibrium ensembles, i.e., n ( i | j, ∆ t ) (cid:54) = n ( j | i, ∆ t ) and the minimum of Eq. 2 is no longer provided by the committorfunction. To ﬁnd a functional for non-equilibrium case, i.e., a functional whose minimumis provided by the committor function, which does not assume the detailed balance andwhich can be expressed in terms of RC time-series, we used the following trick. Consider thefollowing optimization problem,min x (cid:12)(cid:12)(cid:12) x (cid:48) = x (cid:88) ij [ x (cid:48) ( i ) − x ( j )] n ( i | j, ∆ t ) (7a) x ( A ) = 0 , x ( B ) = 1 , (7b)where min x (cid:12)(cid:12)(cid:12) x (cid:48) = x sign means that we optimize by varying x , while variables x (cid:48) are ﬁxed duringoptimization and are updated as x (cid:48) = x straight after, then the optimization cycle is repeateduntil converged. For example, assume that we minimize the functional by the steepest-descent algorithm (SD), i.e., by iteratively making steps against the gradient: x ( k ) = x ( k ) − γ ∇ k , where ∇ k is gradient and γ is the step size. The SD will stop when the gradient is zero ∇ k = − (cid:80) i [ x (cid:48) ( i ) − x ( k )] n ( i | k, ∆ t ) = 0 , for k (cid:54) = A, B (8)Now, introduce the update of x (cid:48) variables, after every SD step, as x (cid:48) = x . Since, we iterativelydecrease a positive functional, the process should converge, hence we let x (cid:48) = x in Eq. 8,11nd obtain that x is the committor (Eq. 1).Before translating Eq. 7 functional to RC time-series terms, we update our nota-tion to take into account summation over trajectories in the ensemble. Consider sum (cid:80) N tr k =1 (cid:80) T k − ∆ ti =0 f k ( i ∆ t ) g k ( i ∆ t + ∆ t ), where the ﬁrst sum with index k , is the sum over N tr trajectories in the ensemble and the second sum with index i , is the sum along k -thtrajectory with length T k . We denote such a sum as (cid:80) t f ( t ) g ( t + ∆ t ) I t ( t ), where the sumover t , is the sum over the long trajectory obtained by concatenating all the trajectories inthe ensemble, and I t ( t ) is indicator function, which equals 1 when f ( t ) and g ( t + ∆ t ) belongto the same short trajectory, i.e., itraj( t ) = itraj( t + ∆ t ), and zero otherwise. I t ( t ) kills allthe cross-trajectories terms, ensuring that only terms, where f ( t ) and g ( t + ∆ t ) are fromthe same trajectory, contribute to the sum. This short, intuitive notation, makes equationsbelow less cluttered.The optimization problem of Eq. 7 is translated to RC time-series terms as followsmin r (cid:12)(cid:12)(cid:12) r (cid:48) = r (cid:88) t [ r (cid:48) ( t + ∆) − r ( t )] I t ( t ) (9a) r ( A ) = 0 , r ( B ) = 1 (9b)Taking RC variation as r ( t ) + δr ( t ) = r ( t ) + ˜ I b ( t ) (cid:80) j α j f j ( t ) one obtains for the gradient ∂/∂α k = − (cid:88) t [ r (cid:48) ( t + ∆) − r ( t ) − δr ( t )] f k ( t ) ˜ I b ( t ) I t ( t ) (10)Instead of optimizing with the SD, which converges rather slow, one can ﬁnd analytically12 (cid:63) , the optimal values of α , where ∂/∂α k = 0: (cid:88) j A kj α (cid:63)j = b k (11a) A kj = (cid:88) t f k ( t ) f j ( t ) ˜ I b ( t ) I t ( t ) (11b) b k = (cid:88) t [ r (cid:48) ( t + ∆ t ) − r ( t )] f k ( t ) ˜ I b ( t ) I t ( t ) (11c)One may attempt to speed up the convergence of the iterations further by letting r (cid:48) = r + δr in Eq. 10. Which leads to the following system of linear equations (cid:88) j A kj α (cid:63)j = b k (12a) A kj = (cid:88) t [ f j ( t ) ˜ I b ( t ) − f j ( t + ∆ t ) ˜ I b ( t + ∆ t )] f k ( t ) ˜ I b ( t ) I t ( t ) (12b) b k = (cid:88) t [ r ( t + ∆ t ) − r ( t )] f k ( t ) ˜ I b ( t ) I t ( t ) (12c)In the non-equilibrium case, in contrast to the equilibrium one, the lower bound of the ∆ r functional is not known, because it depends on the sampling. To monitor the convergence ofthe optimization process here, we suggest to adopt one of the metrics in iterative equationsolving - the increment size | x new − x old | . Since the optimization is stochastic ( y are selectedrandomly), we suggest to monitor the increment size during the last n iterations || r − r − n || = (cid:112)(cid:80) t [ r ( t ) − r − n ( t )] , to have a representative estimate; here subscript − n means n iterationsback.The basic NPNEq algorithm is similar to the equilibrium case and is as follows.

Ini-tialization: a seed RC is constructed, which satisﬁes the boundary constrains, for example, r ( t ) = 0 if X ( t ) ∈ A , r ( t ) = 1 if X ( t ) ∈ B and r ( t ) = 0 . Iterations: one se-lects times-series y ( t ) (a randomly chosen coordinate of conﬁguration space X or a collectivevariable), computes basis functions, solves Eqs. 12 and updates r ( t ). Stopping: iterationsstop when the change of RC time-series during the last n iterations || r − r − n || is suﬃciently13mall.We show in Appendix that Eq. 12 can be obtained in other ways. Using the Galerkincondition, where one minimizes the error terms (the residuals or the deviations from 0) inEq. 7, by making them orthogonal to the basis functions. Or, by minimizing the weightedsum of the error terms squared - a standard approach of solving system of linear equationsiteratively. Here we suggest a generalization of the Z C, criterion for the committor to the non-equilibriumcase. Consider function Z q ( x, ∆ t ), whose derivative equals ∂Z q ( x, ∆ t ) ∂x = (cid:88) ij δ ( x − x ( j ))[ x ( i ) − x ( j )] n ( i | j, ∆ t ) (13)It can be computed from RC time-series r ( i ∆ t ) as ∂Z q ( x, ∆ t ) ∂x = (cid:88) t δ ( x − r ( t ))[ r ( t + ∆ t ) − r ( t )] (14)If x ( i ) is the committor, i.e., satisﬁes Eq. 1, then by summing Eq. 13 over i , one obtainsthat the derivative is zero for all j but the boundary nodes. Which leads to the validationcriterion: Z q ( x, ∆ t ) is constant for the committor function for all x (but boundary nodes,see below) and ∆ t . Note that, in contrast to the equilibrium case, the constant value hereis not informative, as it is deﬁned by the transitions from the state A and depends on thesampling. In Appendix we show that Z q is an outgoing part of the Z C, proﬁle and Z q = Z C, for an equilibrium trajectory with the detailed balance.Note that, analogous to Z C, , Z q deviates from the constant value around the boundariesfor ∆ t > ∆ t . The deviations can be eliminated by employing the transition path segment14ummation scheme. However, since one expects the trajectories to be relatively short, thedeviations are expected to be small, and we do not see a signiﬁcant advantage in introducingthis scheme here.If a putative RC deviates from the committor, then Z q derivative should deviate fromzero. However, it is not clear if the diﬀerence between the derivatives for two diﬀerent ∆ t canserve as a measure of RC sub-optimality. Here, the equilibrium Z C, criterion is used for thatpurpose, which can obtained by re-weighting the non-equilibrium sampling, as demonstratedlater. Another quantity of interest in non-equilibrium sampling are the equilibrium probabilitiesor re-weighting factors. Having determine the transition matrix one can compute the equi-librium probabilities, π ( i ), as the solution of π ( i ) = (cid:88) j P ( i | j, ∆ t ) π ( j ) . (15)Introducing re-weighting factors w ( i ), which correct the non-equilibrium distribution π ( i ) = n ( i ) w ( i ), the equation can be written also as w ( i ) n ( i ) = (cid:88) j n ( i | j, ∆ t ) w ( j ) , (16)here n ( i ) = (cid:80) j n ( j | i, ∆ t ). For a single long equilibrium trajectory, where the number ofingoing and outgoing transitions for every node is equal, (cid:80) j n ( j | i, ∆ t ) = (cid:80) j n ( i | j, ∆ t ), w ( i ) = 1 is the solution - no re-weighting is necessary.The re-weighting factors do not represent a RC, as it makes little sense to project thedynamics on them. However, they can be determined by the developed formalism, and the15erminology of the formalism will be used for consistency. In particular, we will refer toarbitrary re-weighting factors as a RC and the correct re-weighting factor as the optimal RCdenoted by w (analogous to q for the committor). The aim here is to determine putativetime-series r ( i ∆ t ), which closely approximates the re-weighting factors w ( i ∆ t ).The corresponding optimization functional for Eq. 16 ismin x (cid:12)(cid:12)(cid:12) x (cid:48) = x (cid:88) i n ( i ) x ( i ) / − (cid:88) ij x ( i ) n ( i | j, ∆ t ) x (cid:48) ( j ) (17)which is translated to RC time-seriesmin r (cid:12)(cid:12)(cid:12) r (cid:48) = r (cid:88) t I t ( t ) r ( t ) / − r ( t + ∆ t ) r (cid:48) ( t ) I t ( t ) . (18)Considering RC variation as r ( t ) + (cid:80) j α j f j ( t ) one obtains the following equations for theoptimal parameters (cid:88) j A kj α (cid:63)j = b k (19a) A kj = (cid:88) t [ f k ( t ) − f k ( t + ∆ t )] f j ( t ) I t ( t ) (19b) b k = − (cid:88) t [ f k ( t ) − f k ( t + ∆ t )] r ( t ) I t ( t ) (19c) A j = (cid:88) t f j ( t ) I t ( t ) (19d) b = (cid:88) t − r ( t ) I t ( t ) (19e)The re-weighting factors are deﬁned up to an overall factor, which we ﬁx by requiring thetotal weight to be equal that of an equilibrium trajectory, i.e., (cid:80) t w ( t ) I t ( t ) = (cid:80) t

1. Thisleads to Eqs. 19d-e. They should replace equations Eqs. 19b-c for k = 1, for constant basisfunction f ( t ) = 1, for which Eqs. 19b-c give zeros.The re-weighting factors can also be considered as the ﬁrst right eigenvector (with16igenvalue λ =1) of a non-equilibrium version of the transfer operator n ( i | j, ∆ t ) /n ( i ). Ap-pendix discusses the corresponding equations for the eigenvectors of the transfer operator P ( i | j, ∆ t ) π ( j ) /π ( i ) and the transition probability P ( i | j, ∆ t ).The optimization of eigenvectors, and, correspondingly, of the re-weighting factors, hasan inherent instability. For example, if time-series y ( t ), which is used to improve putativere-weighting factors, enters a region in conﬁguration space, but does not come back, it willtry to increase the weight of this region inﬁnitely. Short trajectories are likely to make thissituation more probable. To make the optimization of the re-weighting factors robust, onemay need to employ some ideas discussed in and this is a work in progress. Here we suggestto use a selected set of proper collective variables that sample all the regions extensively,i.e., they contain transitions to and from all the sampled regions.In the simplest case one can take as y ( t ) only a single committor coordinate time-series.In this case one will determine w ( q ), re-weighting factors as a function of the committor.This is suﬃcient, for example, for the ﬁrst passage ensemble which consists of trajectoriesstarting in A and stopping as soon as they reach B, since the biasing factor in this ensemble isa function of the committor. It should also be suﬃcient for an ensemble of short trajectoriesfor a system with a single dominant pathway. In this case, the committor function, increasingalong the pathway, can be used to parameterize the pathway and the re-weighting factors.If there are two (or a few more) parallel pathways one can incorporate a proper collectivevariable that distinguishes between them into optimization as y ( t ).The basic NPNEw algorithm is as follows.

Initialization: a seed RC is initialized to r ( t ) = 1. Iterations: one randomly selects times-series y ( t ) from a set of proper collectivevariables, computes basis functions, solves Eqs. 19 and updates the putative RC time-series. Stopping: iterations stop when the change of RC time-series during the last n iterations || r − r − n || is suﬃciently small.Once computed, the re-weighting factors are used to determine the equilibrium properties.For example, for the equilibrium free energy proﬁle F ( r ): each trajectory point r ( i ∆ t )17ontributes with corresponding weight of w ( i ∆ t ); for equilibrium Z C,α cut-proﬁles: eachtransition from r ( i ∆ t ) to r ( i ∆ t + ∆ t ) contributes with corresponding weight of w ( i ∆ t ). As the ﬁrst model system we consider a high-dimensional system for which the committorfunction can be computed analytically. It qualitatively resembles a protein folding landscapewith radially symmetric potential energy U ( X ) = U ( R ), decreasing towards the beginningof the coordinates. The decrease in enthalpy is compensated by the decrease in entropy sothat the resulting free energy proﬁle as a function of R has two minima, separated by abarrier (Fig. 1). More speciﬁcally, U ( R ) = U ( R ) − ( n −

1) ln( R ), where R = (cid:112)(cid:80) ni =1 X i and U ( R ) =  R < R − ≤ R ≤

12 4 e − ( R − + 4 e − ( R − < R R − (20)Due to high dimensionality of the conﬁguration space, n = 50 here, the system can notbe analyzed directly by an MSM approach, one would need to preform a dimensionalityreduction ﬁrst. For example, a trajectory of 10 frames will not even visit every possibleregion of conﬁguration space with diﬀerent combinations of coordinate signs. Approachesassuming pathways can not be applied also, as the system does not have a well deﬁnedpathway.Non-equilibrium ensemble of short trajectories was obtained by randomly selecting apoint in the 50 dimensional conﬁguration space with uniform distribution in 1 < R <

13 andsimulating a diﬀusion trajectory for 10 steps with D ( X ) = 1, simulation step ∆ t sim = 0 . t = 0 .

1. The total size of the ensemble is 10 points. The free18nergy proﬁle as a function of the radius F H ( R ), computed from the trajectories, is diﬀerentfrom U ( R ) (Fig. 1), conﬁrming the non-equilibrium character of sampling. The NPNEq R F F H U Figure 1: Non-equilibrium free energy proﬁle F H ( R ) (blue) and U ( R ) (red).algorithm is used to ﬁnd the putative committor time-series. Speciﬁcally, Initialization: aseed RC is constructed as r ( t ) = 0 if R ( t ) < r ( t ) = 1 if R ( t ) > r ( t ) = 0 . Iterations:

Every iteration consists of four RC updates.NPNEq equations (Eqs. 12) with basis functions being the terms of polynomial f ( r, y ) ofdegree 6, where y ( t ) = X i ( t ) and i is randomly chosen from 1 , , ...,

50, i.e., y ( t ) is a randomlychosen coordinate time-series. It is followed by NPNEq equations with basis functions beingthe terms of polynomial f ( r ) of degree 16 with envelop exp( −| − r | / . −| r | / . Stopping: iterations are terminatedwhen || r − r − || < .

3. A Jupyter notebook with the analysis is provided in the SupportingInformation and is also available at https://github.com/krivovsv/NPNE. The results are robust with respect to the polynomial degrees, frequency of updates withenvelops, size of the envelops, etc. Higher degrees generally lead to faster convergence, a bitsmaller value of the ∆ r functional, and less ﬂuctuating Z q , though very high degrees mayresult in instability and occasional failure to converge.Fig. 2 demonstrates the convergence of the iterations of the optimization process. Thesize of increments || r − r − || are steadily getting smaller with the iteration number. Thechange in the functional ∆ r value as a function of iteration number is getting smaller,19ndicates that we are approaching the minimum. The change of the RC time-series during thelast 100 iterations, for selected frames, is bounded by 0 . r a r rr b || rr || Figure 2: Convergence of the NPNEq optimization. a) ∆ r (black) and || r − r − || (blue) asfunctions of iteration number. b) Change of the RC time-series during the last 100 iterationsfor selected frames.Fig. 3 inspects how closely the determined time-series approximates the committor. Thevalidation criterion is relatively constant. The root mean squared deviations of Z q are about5, 6 and 10 for ∆ t = 1 , t = 4 could be dueto general statistical ﬂuctuations because of limited sampling. Unlike the equilibrium Z C, proﬁles, the mean values of the Z q proﬁles are not very meaningful, as they depend on thetransitions from state A, which depend on the sampling. The committor as a function of R can be compute analytically as q ( R ) = (cid:90) RR ( A ) D − ( x ) e U ( x ) dx/ (cid:90) R ( B ) R ( A ) D − ( x ) e U ( x ) dx, D ( x ) = 1. Fig. 3b shows that the latter is in a good agreement with the putative RC,which is referred as committor henceforth. q Z q a t = 1 t = 2 t = 4 R q b Figure 3: Validation of the putative RC time-series. a) Non-equilibrium committor validationcriterion Z q ( x, ∆ t ) along putative time-series q for ∆ t = 1 , , R (black line) withthat for selected frames from RC time-series (yellow dots).The re-weighting factors are computed using the NPNEw algorithm. Speciﬁcally, Initial-ization: a seed RC is initialized as r ( t ) = 1. Iterations:

NPNEw equations (Eqs. 19) withbasis functions being the terms of polynomial f ( r, y ) of degree 5, where y ( t ) is the putativecommittor time-series q ( t ). Stopping: iterations are terminated when || r − r − || < . Z C, as the function of the putative committor q , the committor validationcriterion. The proﬁle is constant with ﬂuctuations bounded by 10%, conﬁrming that q approximates the committor rather well. 21he equilibrium Z C, proﬁle can be used to compute the equilibrium ﬂux J AB = N AB /T ,where T = N ∆ t , is the total length of trajectory and N AB is the number of transitionsfrom A to B. N AB can be computed as N − AB = (cid:82) q ( B ) q ( A ) Z − C, ( q ) dq and N = (cid:80) t w ( t ) I t ( t ).The obtained value of the equilibrium ﬂux J AB = 0 . J AB = N AB /Z = 0 . N − AB = (cid:82) R ( B ) R ( A ) Z − C, ( x ) dx = (cid:82) R ( B ) R ( A ) D − ( x ) e U ( x ) dx and Z = (cid:82) e − U ( x ) dx , where D ( x ) = 1. q Z C , a t = 1 t = 2 t = 4 q , R F b F H ( q ) U ( R ) Figure 4: Equilibrium properties. a) Equilibrium validation/optimality criterion along theputative time-series is constant. The deviations from the constant value are in the range of10%. b) F H (˜ q ) (green), equilibrium free energy proﬁle as a function of ˜ q , is in agreementwith U ( R ) (red); F H (˜ q ) was shifted horizontally and U ( R ) vertically for maximum overlap.The re-weighting factors can be used to compute the equilibrium free energy proﬁle F H ( q )and the diﬀusion coeﬃcient D ( q ) as functions of the committor and thus provide the diﬀusivemodel of the equilibrium dynamics along the committor, which can be used to compute thefollowing important properties of the dynamics exactly: the equilibrium ﬂux, the mean22rst passage times, and the mean transition path times between any two points on thecommittor. However, using F H ( q ) for the analysis and description of the dynamics is notvery convenient as the diﬀusion coeﬃcient varies signiﬁcantly along the coordinate. It is moreconvenient to use a “natural” coordinate, ˜ q , where the diﬀusion coeﬃcient is constant D (˜ q ) = 1. It is related to q by the following monotonous transformation d ˜ q/dq = D ( q ) − / .Fig. 4b shows that F H (˜ q ) is in a very good agreement with U ( R ).In summary, this example illustrates that the NPNEq and NPNEw algorithms can beused to determine the committor and re-weighting factors from non-equilibrium ensemblesof short trajectories and to construct a diﬀusive model of equilibrium dynamics, which canbe used to compute important properties of the equilibrium dynamics exactly.Functions implementing NPq (Eq. 6), NPNEq (Eq. 12) and NPNEw (Eq. 19) iterations,computing Z q and Z H proﬁles and performing transformation to natural coordinate areavailable as Python library npnelib.py at https://github.com/krivovsv/NPNE. We have demonstrated that the NPNEq algorithm can accurately determine the committorRC from a non-equilibrium sampling of the model system. The model system has a relativelysimple conﬁguration space and a relatively simple committor function, which is a functionof R only. It is of interest to see how accurately the NPNEq algorithm can approximate thecommittor for a realistic system. To this end, the NPNEq algorithm is applied to a longequilibrium protein folding trajectory of HP35 Nle/Nle double mutant consisting of 1509392snapshots at 380 K, in particular, to compare with its equilibrium version. The analysisdetails can be found in a Jupyter notebook, provided in the Supporting information and athttps://github.com/krivovsv/NPNE. The optimization continued for 40000 iterations. The ﬁnal ∆ r / ∼ . N AB , i.e., almosttwo times higher than the target value of N AB = 74 .

5. Fig. 5a inspects the convergence of thealgorithms. As once can see the increment size || r − r − || converges to some non-zero value,23hile the ∆ r / r / N AB . Fig. 5b shows that the free energy proﬁle as the function of the putative committor, F ( q ),is very similar to that obtained with equilibrium adaptive non-parametric optimization, indicating that the non-equilibrium approach has similar approximation power. iteration r / a q F bc q Z q || rr || Figure 5: Application of the NPNEq to a realistic protein folding trajectory. a) Convergenceof the NPNEq optimization: || r − r − || (blue) and ∆ r (black) as functions of iterationnumber. b) Free energy as a function of committor, F H ( q ) . c) Z q ( x, ∆ t ) along putativecommittor time-series for ∆ t = 1 (blue), 2 (orange), ..., . Z q criterion (Fig. 5c), which, for equilibrium dynamics, is equivalent to Z C, shows that Z q ( q, ∆ t ) is almost 2 times larger then N AB = 74 .

5. It means that the diﬀusive model of24ynamics is accurate, within a factor of 2, at the time scale of ∆ t = 0 . kT ln(2) ∼ . Z q ( q, ∆ t ) − Z q ( q, t ) is smallest for 0 . < q < .

6. Ifcontinued further, Z q ( q, ∆ t ) will get lower than Z q ( q, t ) in that region, indicating thatthe putative RC is overﬁtted around the transition state. One way to avoid overﬁtting, isto make optimization more uniform by focusing it on underﬁtted/suboptimal regions. Analternative approach consists in performing additional extensive sampling of the transitionstate by starting many short trajectories from the frames in the overﬁtted region and analyzethe combined simulations using the developed non-equilibrium approach.

Given a representative and extensive, possibly non-equilibrium sampling of the conﬁgurationspace the proposed approaches can be used to determine the equilibrium free energy proﬁleas a function of the committor. The later, in particular, can be used to determine importantproperties of the equilibrium dynamics exactly. By a representative sampling we mean sucha sampling which contains all the important regions of the conﬁguration space, e.g., allthe important transition pathways, or a representative sample of them, if their number isinﬁnite. By an extensive sampling we mean a sampling of such a size that overﬁtting bythe non-parametric approaches is not possible or negligible. In this section we will discusspossible strategies of generating such a representative and extensive sampling.Consider ﬁrst the case where a trajectory or ensemble of trajectories provide representa-tive, though not extensive sampling, for example, the state-of-the-art protein folding trajec-tories.

Applying the non-parametric approaches (either equilibrium or non-equilibrium)one ﬁnds that the optimization soon starts to overﬁt the committor RC in the TS region,25ecause the sampling of this region is relatively poor, compare to the rest of the conﬁgura-tion space.

The regions where the putative RC is overﬁtted can be detected by using theoptimality criteria. In order to avoid the overﬁtting, many additional short simulations areperformed, starting from the conﬁgurations that belong the overﬁtted regions, e.g., the TSregion. Then, the total simulation data is analyzed by the non-parametric non-equilibriumapproach.A more diﬃcult case is when the initial representative sampling is absent. For systemswith relatively simple, small conﬁguration space, selection of initial conﬁgurations to startmany short simulations as well as the seed RC can be done analytically, as it was done for themodel system considered here. Such systems may include practically important cases suchas, e.g., studies of dynamics of a ligand binding/unbinding to/from a protein, or diﬀusionof a small molecule/ion through an ion channel pore.If one of the boundary states has a much shorter lifetime compared to the other state,then many trajectories should be started from the former state, which shall generate a non-equilibrium (ﬁrst passage) representative sampling.If both boundary states have long residence times, while the transition path times arerather short, one can use the transition path sampling approach to generate a representativesampling. Inclusion of the rejected paths will increase the size of the sampling and removeconditioning on the boundary states.Another possibility is to use biased, non-equilibrium sampling, though, in this case, rep-resentative sampling of transition paths is not guaranteed. For example, one may use sam-pling at a higher temperature or sampling with a bias potential, e.g., umbrella sampling, steered-MD, replica-exchange, meta-dynamics, or forward ﬂux sampling. If anenhanced sampling method perturbs the dynamics of interest, e.g., a higher temperature ora biasing potential, then many short simulations with unperturbed dynamics, need to beperformed, starting from the obtained conﬁgurations.String method using swarms of trajectories can be straightforwardly combined with26he NPNEq approach. Since the latter does not assume the existence of a dominant pathway,it may improve performance of the former in systems, where this assumption does not hold.Consider now the forward ﬂux sampling (FFS), where one uses an order parame-ter (OP), which can be diﬀerent from the optimal RC - the committor, to propagate thetrajectories from state A to state B. While the accuracy of FFS does not depend on theOP, the eﬃciency does. Thus it would be desirable to propagate FFS trajectories using theoptimal RC or committor. Since the committor RC is not known in advance, one possibilityis to compute the committor during sampling, applying the developed approach to the datasampled so far. Having the idea in mind we propose the following approach.We ﬁrst describe an idealized scenario. Assume that relatively long unbiased simulationswere performed in both boundary states A and B. The simulations are not long enough,however, for the system to sample the transitions between the states, and thus can not beused to construct the entire committor RC. Assume now that these simulations, however,can be used to construct the committor in the sampled regions, i.e., the starting and endingsegments of the committor for example [0 , α ] and [ β, α and β . Analyzing the combined new andold simulations, one extends the RC segments to a large value of α and a smaller value of β ,since some of the stochastic trajectories will travel to these regions. One continues in suchan iterative manner to grow the two segments towards each other, until they meet, when α = β , thus providing the initial representative sampling of transition paths.Unfortunately, it is not possible to construct accurately just the two segments of theRC, because as soon as the RC is divided into two non-overlapping segments, [0 , α ] and[ β, α → β →

1. However, even an approximate RC, obtained just before the RC is dividedinto two segments can be useful. It is possible, when such a partial optimization of the RCincreases the fraction of points with correct values of RC. In this case, a new ensemble ofmany short trajectories is prepared, by starting them from points selected uniformly along27he RC. The new ensemble, will have a higher fraction of points with higher values of α andsmaller values of β . By iterating this process, one can converge to the ensemble with pointsuniformly sampled along the RC. This process is somewhat analogous to the way uraniumis enriched in centrifuges: each cycle leads only to a marginal increase in the concentrationof the desired isotope. However, by repeating the cycle many times, the concentration getsexponentially increased. We call this approach the transition state ensemble enrichment,TSEE. R q a R q b F F Figure 6: Convergence of the TSEE approach in application to the model system. Initialdistribution of points along an OP F ( R ) (blue line); two-dimensional distributions of pointson the q-R plane (yellow dots). Panel a and b show ﬁrst and second iterations, respectively.For details see text.We illustrate the TSEE approach on the model 50 dimensional system (Fig. 6). FirstIteration . We start by sampling the boundary states. 10000 short trajectories of length 10with timestep of 0 . R = 2 and R = 12. This isdone by assigning the 50 coordinates to random numbers uniformly distributed in the range[-0.5,0.5], and re-scaling them so that R equals to 2 or 12. F ( R ) on Fig. 6a shows that28he points are distributed mainly around R = 2 and R = 12, with almost no points in theTS region. The NPNEq algorithm is applied to optimize the putative RC. The degree ofpolynomial initially is set at 2 to limit the ﬂexibility of RC to avoid its quick division intotwo segments. The degree is gradually increased during optimization. NPNEq optimizationis continued either until a segment 0 . < q < .

55 had fewer than 100 points, i.e., theRC is close to be divided into two segments (stopping condition one), or until the iterationnumber reached 1000 (stopping condition 2). The NPNEq optimization exited after 566iterations with stopping condition one. The distribution of points on the q-R plane on Fig.6a shows that optimization has stratiﬁed points according to the putative committor or thatthe initial and ﬁnal parts of the committor are determined relatively accurate. By selectingpoints from the diﬀerent regions along q , diﬀerent regions of the conﬁgurations space can besampled more uniformly. Second Iteration . 10000 points are drawn uniformly along ˜ q - the putative committortransformed to the natural coordinate. These points are used to start 10000 short trajectoriesof length 10 with time step of 0 . F ( R ) on Fig. 6b shows that some of these trajectoriesvisited the TSE. The same NPNEq algorithm is applied to optimize the putative RC. Thealgorithm terminated after 1000 iterations with stopping condition two, i.e., the optimizationis robust with no division of the RC. The distribution of points on the q-R plane on Fig.6b shows that they cover all of the committor. Thus, the TSEE algorithm converged on thesecond iteration. We have described non-parametric non-equilibrium approaches to accurately determine thecommittor function and re-weighting factors from non-equilibrium simulations. Given arepresentative and extensive sampling of the conﬁguration space, e.g., a large ensembleof short trajectories, the proposed approaches can be used to determine the equilibrium29ree energy proﬁle as a function of the committor. The proﬁle, together with the positiondependent diﬀusion coeﬃcient, specify a diﬀusive model of the equilibrium dynamics. Themodel can be used to compute the following important properties of the dynamics exactly :the equilibrium ﬂux, the mean ﬁrst passage times, and the mean transition path timesbetween any two points on the committor.

The power of the approach was illustratedon a model 50-dimensional system and a realistic protein folding trajectory.In application to the eigenvectors optimization problem, the obtained equations are sim-ilar to those obtained in the EDMD approach.

Here, however, these equations describe asingle iteration of the iterative optimization process, which leads to the following advantages.A major weakness of the parametric approaches, e.g., those using a linear combinations ofmolecular descriptors/features or a deep neural network, is the choice-of-basis (choice of func-tional form) problem. While it was argued that ”the expressive power of neural networksprovides a natural solution to the choice-of-basis problem”, ﬁnding an optimal architec-ture of a neural network and input variables are diﬃcult tasks. While intuition can help tosolve the problem for low-dimensional model systems, the diﬃculty in the case of complexrealistic systems becomes apparent, when one remembers that such a function should beable to accurately project a few million snapshots of a very high-dimensional trajectory. Inparticular, it implies an extensive knowledge of the system, and that an acceptable solutionis likely to be system speciﬁc. The developed non-parametric approaches can approximateany reaction coordinate with high accuracy. While each iteration may depend on the exactchoice of the family of collective variables/molecular descriptors/features, the ﬁnal reactioncoordinate does not, since it provides optimum to a (non-parametric) target functional, whenthe optimization converges. We assume, of course, that the employed input variables pro-vide all the important information. For the analysis of biomolecular simulations one cansuggest the inter atom distances, or the sines and cosines of internal angles as the standardsets of input variables. The developed non-parametric approaches are able to accuratelyapproximate the committors and eigenvectors of realistic systems at the shortest timescales30f trajectory sampling interval of 0 . Also, one of the reasons of using re-weighteddata in eigenvector approximation is to avoid complex eigenvalues/eigenvectors since theylack interpretability. This strategy assumes that the re-weighting factors can be accuratelydetermined for every trajectory point, which is a diﬃcult task. The problem of complexvalues, however, has a simple solution in the iterative optimization. First, since the numberof basis functions used during each iteration is rather small, the statistical noise is small andthe occurrence of complex eigenvalues/eigenvectors is an infrequent event. Thus one caneither skip such an iteration or accept it, truncating complex variables to real parts.Note, that while we call the approaches non-parametric, emphasizing that we focus onRC as a function of trajectory (trajectory time or trajectory snapshot), r ( i ∆ t ), rather thanas a function of conﬁguration space, r ( X ), it is possible to record all the RC transformationsduring iterative optimization (training) and apply them later to new (test) data, e.g., forcross-validation. Alternatively, one can perform cross-validation on the ﬂy, by computingparameters of RC transformations on the train part of the data, while applying these trans-formations to the train and test parts of the data. It can be trivially implemented by setting I t ( t ) = 0 for the test data.It is instructive to compare diﬀerent descriptions of molecular dynamics, e.g., using com-mittors vs using eigenvectors as reaction coordinates for free energy landscapes or usingeigenvectors to approximate the evolution (forward, backward or Koopman) operators of thedynamics. They all have strong and week points. For example, if it is suﬃcient to know justsuch important quantities of dynamics as the equilibrium ﬂux, the mean ﬁrst passage timesor the mean transition path times between two states of interest, e.g., folded and unfoldedstates or bound and unbound states, then the diﬀusive model along the committor allowsone to determine these properties exactly (between any two points along the committor).This result is valid for any system, irrespective of complexity of its free energy landscape,and does not assume the separation of timescales. The diﬀusive model can be used todetermine, rigorously and in a direct manner, the free energy barrier and the pre-exponential31actor - the major determinants of molecular kinetics. Distribution of transition-path timesis an example of quantity that can not be accurately determined from the diﬀusive mod-els, in general. If one assumes the separation of timescales, then the projected dynamicsbecomes Markovian, and the diﬀusive model provides it complete description. A set of slow-est eigenvectors/eigenfunctions (basis) can provide a close approximation to the evolutionoperators and thus can be used to compute accurately many properties of the dynamics.One, however, may require a relatively large basis set to accurately estimate the quantities,that can be computed exactly by the diﬀusive model along the committor, that requiresthe determination of just one optimal coordinate. Also, it is not straightforward to visu-alize an approximated evolution operator, while a free energy landscape as a function ofone or two optimal reaction coordinates, provides a clear, intuitive and quantitative pictureof the dynamics. Another diﬀerence is that iterative optimization of committors is robust,while that of eigenvectors has an inherent instability. In the committor case, one seeksan optimal coordinate between two given states (a variant of supervised learning or ratherreinforcement learning). The eigenvector optimization can be considered as a variant ofunsupervised learning: one seeks eigenvectors with smallest eigenvalues, which describe theslowest dynamics. However, some of such eigenvectors are not of interest. For example, inprotein folding, such an eigenvector could describe a much slower torsion angle isomerizationprocess.

Another, more likely possibility, is due to a limited sampling, especially in thecase of many short trajectories. There are many parts of the conﬁguration space that werevisited only once, and eigenvectors describing those transitions have small eigenvalues. Thus,starting with an eigenvector of interest, the iterative approach may eventually converge toan eigenvector, with smaller eigenvalue, but of no interest. To determine the committor,one needs to specify two boundary states. Proper deﬁnition of such states is a diﬃcultproblem. For example, a natural approach of using the rmsd from a structure may lead toinaccuracies and hide complexity of the free energy landscapes. The problem is likely tobe more severe for systems with complex free energy landscape, e.g., intrinsically disordered32roteins. One general strategy of blind, unbiased analysis of dynamics, that uses strongpoints of both eigenvectors and committors is as follows. First, eigenvectors, even not com-pletely optimized/converged, are used for an exploratory analysis of free energy landscapes,e.g., to locate and deﬁne the boundary states. This is followed by the determination of thecommittors between these states and the corresponding equilibrium free energy proﬁles.The described non-parametric approaches have only two assumptions - representativesampling and that the underlying dynamics is Markovian in the conﬁguration space. Forexample, for atomistic MD simulations, where the dynamics is Newtonian at the integra-tion time-step, the sampling/saving interval ∆ t needs to be suﬃciently large, so that thedynamics have no memory about the momenta. In principle, shorter sampling intervals canbe employed if dynamics in the phase space is considered, i.e., the committor is a functionof positions and momenta, however it is not yet clear, if it can bring signiﬁcant advantages.Since representative sampling does not need to cover exhaustively the entire conﬁgurationspace, the approaches do not suﬀer from the curse of dimensionality. It suggests that theseapproaches can be used to investigate dynamics of large biomolecular systems in a rigorousand accurate way. The approaches allows straightforward parallelization and can be adaptedfor exascale computing.By a representative sampling we mean such a sampling which provides a representa-tive, but not exhaustive/complete sampling of all the important regions of the conﬁgurationspace. By an extensive sampling we mean a sampling of such a size that overﬁtting bythe non-parametric approaches is not possible or negligible. In case, when the sampling isnot extensive, i.e., some regions do get overﬁtted, it can be straightforwardly rectiﬁed byperforming many short simulations starting from the conﬁgurations in the overﬁtted regions.We have suggested how one can generate such a representative and extensive samplingin a number of realistic practical scenarios, e.g., in tandem with many developed enhancedsampling techniques. We have also described a generic approach, the transition state en-semble enrichment, TSEE, which generates such a representative and extensive sampling in33n iterative, self-consistent manner, by ”growing” committor segments towards each otherstarting from the boundary states.The developed non-parametric approaches determine values of a speciﬁc optimal RC(e.g., the committor) for an ensemble of conﬁgurations, without using any system speciﬁcinformation. They can be considered analogous to linear algebra routines (e.g. the LAPACKlibrary ), where given, for example, a matrix, one can obtain numerical values of eigenvectorcomponents. Here, however, the task is complicated by the fact that the transition probabil-ity matrix is not given explicitly; only an ensemble of trajectories is provided. In particular,one can not compute the matrix-vector product, Av , the basic operation in iterative linearalgebra methods. Also, the conﬁguration space is continuous, meaning that we are dealingwith an inﬁnite-dimensional problem, which is somewhat simpliﬁed by considering a largerepresentative sample of points instead. In addition, the non-parametric approaches to de-termine the eigenvectors, require additional eﬀorts to suppress the ’inherent instability’. However, the developed approaches show that these problems are solvable, and further de-velopment of the framework should deliver rigorous, robust and eﬃcient tools to solve thesampling problem.

Another way to derive Eq. 12 is by using the Galerkin condition. Consider a variationof the RC, approximating the committor function, that satisﬁes the boundary conditions: x ( i ) + δx ( i ) = x ( i ) + ˜ I b ( i ) (cid:80) j α j f j ( i ), where x ( i ) satisﬁes the boundary condition x ( A ) = 0and x ( B ) = 1, while ˜ I b ( A ) = 0, ˜ I b ( B ) = 0 and ˜ I b ( i ) = 1 otherwise, and f j are the basisfunctions. The error vector (cid:15) ( j ), or the vector of residuals, is deﬁned as (cid:88) i [ x ( i ) − x ( j )] n ( i | j, ∆ t ) = (cid:15) ( j ) , for j (cid:54) = A, B (21)34or the committor function (cid:15) = 0. The optimal variation is deﬁned by the Galerkin condition:the error vector is orthogonal to all the basis functions of the variation (cid:15) ⊥ f k (cid:88) j [ (cid:88) i [ x ( i ) + δx ( i ) − x ( j ) − δx ( j )] n ( i | j, ∆ t )] ˜ I b ( j ) f k ( j ) = 0 , (22)where we used ˜ I b to extend the summation to all j . This system of equations is translatedto the RC time-series as follows (cid:88) t [ r ( t + ∆ t ) + δr ( t + ∆ t ) − r ( t ) − δr ( t )] f k ( t ) I t ( t ) I b ( t ) = 0 , (23)which leads to Eq. 12.Yet another way to derive Eq. 12 is to consider the following optimization functionalmin x (cid:88) j (cid:54) = A,B n ( j )[ x ( j ) − (cid:88) i P ( i | j, ∆ t ) x ( i )] (24a) x ( A ) = 0 , x ( B ) = 1 (24b)The functional equals (cid:80) j (cid:15) ( j ) /n ( j ) and attains its minimum when (cid:15) ( j ) = 0, which givesthe committor equation (Eq. 1). The functional does not assume the detailed balance. Min-imization of such a functional is a standard approach of solving a linear system of equations(for committor) iteratively. This functional, however, can not be expressed in terms of RCtime-series r ( i ∆ t ), and thus can not be used for non-parametric optimization. Considernow the modiﬁed optimization problemmin x (cid:12)(cid:12)(cid:12) x (cid:48) = x (cid:88) j (cid:54) = A,B n ( j )[ x ( j ) − (cid:88) i P ( i | j, ∆ t ) x (cid:48) ( i )] (25a) x ( A ) = 0 , x ( B ) = 1 (25b)While the entire functional can not be expressed in terms of RC time-series, the part that35epends on x can be expressed. The other part is not important, as it depends solely on x (cid:48) and is ﬁxed during optimization. When expressed in terms of RC time-series, it equals (cid:80) t r ( t ) I t ( t ) − r (cid:48) ( t + ∆ t ) r ( t ) I t ( t ), i.e., it is equal to Eq. 9 up to the term r (cid:48) ( t + ∆ t ) I t ( t ),which is also held constant and disappears after diﬀerentiation. It means that Eq 12 (theNPNEq algorithm) can be also interpreted as iterative solving of the (more conventional)optimization problem of Eqs 24. Z q criterion Consider a functions ”conjugated” or time-reversed to Z q ∂Z Tq ( x, ∆ t ) ∂x = (cid:88) ij δ ( x − x ( i ))( x ( j ) − x ( i )) n ( i | j, ∆ t ) (26)For the half-sum of the two functions one obtains ∂∂x Z q ( x, ∆ t ) + Z Tq ( x, ∆ t )2 = (cid:88) ij [ δ ( x − x ( j )) − δ ( x − x ( i ))][ x ( i ) − x ( j )] n ( i | j, ∆ t ) / n ( i | j, ∆ t ) transitions from j to i . Then, if x ( i ) > x ( j ), one obtains a rectangular pulse from x ( j ) to x ( i ) of height n ( i | j, ∆ t ) / × | x ( i ) − x ( j ) | . If x ( i ) < x ( j ), one obtains a rectangular pulse from x ( i ) to x ( j )of height n ( i | j, ∆ t ) / × | x ( i ) − x ( j ) | . But this is exactly the deﬁnition of Z C, . Thus, Z q ( x, ∆ t ) + Z Tq ( x, ∆ t )2 = Z C, ( x, ∆ t ) . For equilibrium dynamics, where n ( i | j, ∆ t ) = n ( j | i, ∆ t ), one ﬁnds Z q = Z Tq = Z C, .36 .3 Non-parametric determination of eigenvectors from non-equilibriumsampling The re-weighting factors can also be considered as the components of the ﬁrst right eigen-vector (with λ =1) of a non-equilibrium version of the transfer operator n ( i | j, ∆ t ) /n ( i ): (cid:88) j n ( i | j, ∆ t ) u ( j ) = λn ( i ) u ( i ) (28)They are related to the right eigenvectors of the equilibrium transfer operator T ( i | j, ∆ t ) = P ( i | j, ∆ t ) π ( j ) /π ( i ) = n ( i | j, ∆ t ) w ( j ) / [ n ( i ) w ( i )] as (cid:88) j T ( i | j, ∆ t )[ u ( j ) /w ( j )] = λ [ u ( i ) /w ( i )] , (29)i.e., eigenvectors of the transfer operator can be obtained as eigenvectors of Eq. 28 dividedby the re-weighting factors w ( i ) (the ﬁrst eigenvector of Eq. 28). The eigenvectors of Eq.28 can be found as the solution of optimization problemmax x (cid:12)(cid:12)(cid:12) x (cid:48) = x (cid:88) ij x ( i ) n ( i | j, ∆ t ) x (cid:48) ( j ) (30a) (cid:88) i n ( i ) x ( i ) = 1 (30b)which is translated to RC time-seriesmax r (cid:12)(cid:12)(cid:12) r (cid:48) = r (cid:88) t r ( t + ∆ t ) r (cid:48) ( t ) I t ( t ) (31a) (cid:88) t r ( t ) I t ( t ) = 1 (31b)Taking RC variations as r ( t ) = (cid:80) j α j f j ( t ) and following steps analogous those used toderive the NPNEq equations one obtains the following equations (the generalized eigenvalue37roblem) for the optimal parameters A kj α (cid:63)j = λB kj α (cid:63)j (32a) A kj = (cid:88) t f k ( t + ∆ t ) f j ( t ) I t ( t ) (32b) B kj = (cid:88) t f k ( t ) f j ( t ) I t ( t ) (32c)The left eigenvectors of the transition probability matrix (cid:88) j v ( j ) P ( j | i, ∆ t ) = λv ( i ) (33)can be found as the solution to optimization problemmax x (cid:12)(cid:12)(cid:12) x (cid:48) = x (cid:88) ij x (cid:48) ( j ) n ( j | i, ∆ t ) x ( i ) (34a) (cid:88) i n ( i ) x ( i ) = 1 (34b)which is translated to RC time-seriesmax r (cid:12)(cid:12)(cid:12) r (cid:48) = r (cid:88) t r (cid:48) ( t + ∆ t ) r ( t ) I t ( t ) (35a) (cid:88) t r ( t ) I t ( t ) = 1 (35b)with the following equations on optimal parameters[ A T ] kj α (cid:63)j = λB kj α (cid:63)j , (36)where matrices A and B are deﬁned in Eq. 32. If the dynamics is inherently reversible orequilibrium (though the sampling may be not), i.e., P ( i | j, ∆ t ) π ( j ) = P ( j | i, ∆ t ) π ( i ), then T ( i | j, ∆ t ) = P ( j | i, ∆ t ) and v ( j ) in Eq. 33 are the right eigenvectors of the transfer operator.38hus, Eq. 36 can be used to determine the eigenvectors of the transfer operator withoutusing the re-weighting factors, for the dynamics which is inherently reversible or equilibrium,which is usually assumed for molecular simulations. Eqs. 32 and 36 are similar to equationsfor obtaining linear combinations of feature variables best approximating eigenvectors of theKoopman operator. Further discussion on how to select basis functions or how to suppress possible instabilityduring iterative optimization of eigenvectors can be found in Ref. 21.

Supporting Information.

Jupyter notebooks containing the analyses are provided in a single zip archive.

References (1) Kohlhoﬀ, K. J.; Shukla, D.; Lawrenz, M.; Bowman, G. R.; Konerding, D. E.; Belov, D.;Altman, R. B.; Pande, V. S. Cloud-based simulations on Google Exacycle reveal ligandmodulation of GPCR activation pathways.