NNon-Parametric Analysis of Non-EquilibriumSimulations
Sergei V. Krivov ∗ Astbury Center for Structural Molecular Biology, Faculty of Biological Sciences, Universityof Leeds, Leeds LS2 9JT, United Kingdom
E-mail: [email protected]
Abstract
We extend the non-parametric framework of reaction coordinate optimization tonon-equilibrium ensembles of (short) trajectories. For example, we show how, start-ing from such an ensemble, one can obtain an equilibrium free energy profile alongthe committor, which can be used to determine important properties of the dynamicsexactly. New adaptive sampling approach, the transition state ensemble enrichment,is suggested, which samples the configuration space by ”growing” committor segmentstowards each other starting from the boundary states. This framework is suggestedas a general tool, alternative to the Markov state models, for a rigorous and accurateanalysis of simulations of large biomolecular systems, as it has the following attractiveproperties. It is immune to the curse of dimensionality, it does not require systemspecific information, it can approximate arbitrary reaction coordinates with high accu-racy and it has sensitive and rigorous criteria to test optimality and convergence. Theapproaches are illustrated on a 50-dimensional model system and a realistic proteinfolding trajectory. a r X i v : . [ phy s i c s . c h e m - ph ] F e b Introduction
One general strategy in overcoming the sampling problem in biomolecular simulations con-sists of simulating a very large ensemble of short trajectories rather than a singe long tra-jectory. This strategy allows seamless parallelization and is a promising approach towardssimulations employing exascale or cloud computing.
Adaptive sampling approaches canbe considered as an extension of this strategy, where one, for example, improves samplingin less sampled parts of configuration space, or parts that produce largest error or controlsthe exploration/exploitation balance.
The swarms of trajectories is another successfulvariation of this idea.To analyze such ensembles of short trajectories one commonly employs the Markov statemodel (MSM) framework.
Assuming that the sampling is sufficiently extensive, andusing a fine-grained clustering of the configuration space of the system, one can estimate thetransition probability matrix. For large system, as configuration space size grows exponen-tially with system size, dimensionality reduction is performed before clustering. Knowingthe matrix, one can compute many important properties of the equilibrium dynamics, forexample, the equilibrium probabilities/populations, fluxes and rates. The minimal lag timewhen a MSM becomes approximately Markovian, which can be estimated by the conver-gence of implied timescales or by Chapman-Kolmogorov criterion, is a good indicator of theaccuracy of the constructed model. The shorter is the lag time, the shorter are the trajec-tories, required to construct the MSM, and the larger is the possible speedup over a direct,brute-force simulation. State of the art approaches have lag times in the range of tens ofnanoseconds.
Recently, we have suggested non-parametric approaches, which can determine thecommittor and a few slowest eigenvectors, that pass stringent validation tests at much shorterlag time of trajectory sampling interval of 0 . as they use no system specific information and thus do not require an extensive expertisewith the system. In particular, they do not require a functional form with many parameters2o closely approximate a reaction coordinate (RC), e.g., a linear combinations of moleculardescriptors or a deep neural network, and can approximate any RC with high accuracy.The approaches however were restricted to long equilibrium trajectories. Here we report theextension of the framework to non-equilibrium ensembles of trajectories and suggest it as ageneral framework, alternative to the MSM, for a rigorous and accurate analysis of dynamicsof large biomolecular systems.We describe approaches which can be used to analyze such non-equilibrium ensembles oftrajectories, to determine the following important descriptors/properties of the dynamics:the committor function, the re-weighting factors (related to the equilibrium probabilities),the eigenvectors of the equilibrium and non-equilibrium transfer operators and that of thetransition probability. In particular, we show how one can determine the equilibrium freeenergy profile as a function of the committor, which can be used to determine exactly suchimportant properties of the dynamics as the equilibrium flux, the mean first passage times,and the mean transition path times between any two points on the committor. One way to analyze non-equilibrium simulations consist in computing the re-weightingfactors first, and then use them to re-weight the sampling, thus essentially reducing theproblem to the equilibrium case. Such a straightforward approach, however, has the fol-lowing shortcoming. The accuracy of the analysis depends on the accuracy of the obtainedre-weighting factors. Thus, it requires an approach capable of determining the re-weightingfactors for every trajectory point robustly and accurately, which is a very difficult task. Here,we present approaches that do not assume the existence of the re-weighting factors, and thusfree of the shortcoming.The paper is as follows. We start by reviewing the non-parametric framework for equilib-rium simulations. Next, equations to determine the committor function from non-equilibriumsimulations are derived. They are followed by derivation of equations to determine re-weighting factors. The power of the developed approaches is illustrated on two examples:a 50-dimensional model system and a realistic protein folding trajectory. Next, we discuss3 number of realistic practical scenarios of how the developed approaches can be combinedwith existing enhanced/adaptive sampling techniques. We then describe a generic adaptivesampling approach, the transition state ensemble enrichment, TSEE, which is based on thedeveloped non-parametric approaches. The performance of the TSEE is illustrated on the50-dimensional model system. We end with a concluding discussion.
A rigorous way to analyze dynamics produced by biomolecular simulations is to describe itas a diffusion on a free energy landscape, free energy as a function of RCs. The simulationtrajectory is projected onto a RC by computing the RC time-series r ( t ) as a function oftime, which is used to determine the corresponding free energy landscape and diffusioncoefficient. For such a description to be quantitatively accurate, the RCs should be chosen inan optimal way. The committor function is an example of such a RC, that can be usedto compute some important properties of the dynamics exactly.
The eigenvectors (EVs)of the transfer operator are another example. Recently we have developed non-parametricapproaches to accurately determine such coordinates from a long equilibrium trajectory.
Here we describe how to extend this framework to non-equilibrium simulations, makingpossible to use these approaches for enhanced/adaptive sampling.
The overall idea of iterative non-parametric RC optimization is as follows.
We start witha seed RC time-series r ( t ). During each iteration we consider a variation of RC as r ( t )+ δr ( t ),where δr ( t ) can be (time-series of) any function of configuration space, collective variablesand the RC itself. For example, one can take δr ( t ) = f ( r ( t ) , y ( t )), where y ( t ) is time-seriesof a randomly chosen coordinate of configuration space X i ( t ) or a randomly chosen collectivevariable and f ( r, y ) = (cid:80) lm α lm r l y m is a low degree polynomial. The coefficients/parameters4f the variation are chosen such that r ( t ) + δr ( t ) provides the best approximation to thetarget optimal RC (e.g., committor). Specifically, they deliver optimum to a specific targetfunctional α (cid:63) = arg min α I ( r + δr ). The RC time-series is updated r ( t ) ← r ( t ) + δr (cid:63) ( t ),where δr (cid:63) ( t ) is the optimal variation, i.e., δr (cid:63) ( t ) = (cid:80) lm α (cid:63)lm r l ( t ) y m ( t ). Iterating the processone repeatedly improves the putative RC time-series by incorporating information containedin different coordinates or collective variables. Alternatively, by repeating the iterations, thetarget functional is optimized by considering variations along different coordinates. For thetarget functionals considered here, the optimal coefficients α (cid:63) are found as solutions of linearsystems of equations.While each iteration may depend on the exact choice of the family of collective variables y or the parametrization of the variation δr ( t ), the final RC does not, since it providesoptimum to a (non-parametric) target functional, when the optimization converges. In thissense such an approach is non-parametric.If the system obeys some symmetry (e.g., the rotational and translational symmetriesfor biomolecules), then the optimal RC should obey the same symmetry. A simple way toensure this is to use as y , variables that respect the symmetry, for example, the distancesbetween randomly chosen pairs of atoms y ( t ) = r ij ( t ).The equations for non-parametric RC optimization are derived in the following sequenceof steps. We first find a variational principle when the system dynamics is described by afinite Markov chain (a MSM). Next the variational principle is reformulated in terms of RCtime-series. If this is possible, it means, that there is no need to construct/consider a finiteMarkov chain and one can operate using just RC time-series. By varying the functional weobtain the final equations for the optimal values of the parameters α (cid:63) .We employ the framework of finite Markov chains to describe Markov dynamics in theconfiguration space. While the configuration space in molecular simulations is continuousand rigorous treatment requires the usage of integral operators, we prefer finite Markovchains due to their convenience, simplicity and manifest invariance to the choice of coor-5inate systems. Moreover, a finite Markov chain can provide an excellent approximationto continuous configuration space. Since, we consider such a Markov chain as a theoreticalconcept, e.g., to derive the equations, and there is no need for an actual construction of sucha chain in practice, the number of states can be arbitrarily large as long as it stays finite. Forexample, a molecular system of interest can be embedded into a large box with boundaries at ± L along each coordinate. Each coordinate can be discretized with a very fine step of, say, d ∼ .
01 ˚A. Since the molecular dynamics simulations are usually performed by numericalintegrating the Newtons equations of motion, the dynamics, at the timescales close to thesimulation time step, is Markovian in the phase space not the configuration space. At longertimescales the dynamics loses memory about the momenta and can be considered approxi-mately Markovian in the configuration space. We assume that given simulation trajectoriesare recorded with such or longer sampling interval, since we are mainly interested in the de-termination of optimal RCs as functions of the configuration space. However, it is possible,in principle, to apply the developed approaches to determine optimal RCs as functions ofphase space at a shorter sampling interval.
We first review the approach for equilibrium trajectories.
We use the following no-tation: X ( i ∆ t ) denotes a long equilibrium multidimensional trajectory, where ∆ t is thetrajectory sampling interval; x ( i ) denotes an arbitrary RC as a function of MSM state i ,while q ( i ) is reserved for the committor; r ( i ∆ t ) is an arbitrary RC as a function of trajec-tory snapshot or time along the trajectory or, shortly, a function of trajectory; again q ( i ∆ t )is reserved for the committor. Here we describe how, given X ( i ∆ t ), one can determineputative time-series r ( i ∆ t ), which closely approximates the committor q ( i ∆ t ).Assume that, by using a fine-grained clustering, we are able to construct an accurateMarkov state model, with transition probability matrix defined as P ( i | k, ∆ t ) = n ( i | k, ∆ t ) /n ( k ),6here P ( i | k, ∆ t ) is the transition probability from state k to state i after time-interval (lagtime) ∆ t , n ( i | k, ∆ t ) is the number of transition from state k to state i after time interval∆ t and n ( k ) = (cid:80) i n ( i | k, ∆ t ).The committor function satisfies the following equation (cid:88) i [ q ( i ) − q ( k )] P ( i | k, ∆ t ) = 0 , for k (cid:54) = A, B (1a) q ( A ) = 0; q ( B ) = 1 . (1b)Consider the following optimization problem:min x (cid:88) ij [ x ( i ) − x ( j )] n ( i | j, ∆ t ) (2a) x ( A ) = 0; x ( B ) = 1 , (2b)here, x ( i ) is an arbitrary RC as a function of state i . By differentiation with respect to x ( k ),and using the detailed balance condition n ( i | j, ∆ t ) = n ( j | i, ∆ t ) one obtains Eq. 1, i.e., thecommittor function provides the minimum to the functional 2.Before reformulating Eq. 2 optimization problem in terms of RC time-series lets introducea convenient abbreviation for the sums like (cid:80) T/ ∆ ti =0 f ( i ∆ t ) g ( i ∆ t + ∆ t ), where ∆ t equals ∆ t or its multiple and T = N ∆ t is trajectory length. If ∆ t = k ∆ t , it means that only 1 /k -thfraction of points in the trajectory are used. To use all the points in the trajectory onecan average over the starting point as 1 /k (cid:80) k − j =0 (cid:80) T/ ∆ ti =0 f (( ik + j )∆ t ) g (( ik + j )∆ t + k ∆ t ),which equals 1 /k (cid:80) T − ∆ ti =0 f ( i ∆ t ) g ( i ∆ t + ∆ t ). We denote such a sum as (cid:80) t f ( t ) g ( t + ∆ t ).Even though we will mainly use ∆ t = ∆ t , the notation allows the consideration of arbitrarylag times.The optimization problem Eqs. 2 is translated to RC time-series r ( i ∆ t ) (for lag time7 t ) as min r (cid:88) t [ r ( t + ∆ t ) − r ( t )] (3a) r ( t ) = 0 , X ( t ) ∈ A ; r ( t ) = 1 , X ( t ) ∈ B, (3b)here r ( i ∆ t ) is an arbitrary RC as a function of trajectory. The total squared displacementfunctional in Eq. 3a, which is optimized, is referred later as ∆ r for brevity. Here andbelow we assume ∆ t = ∆ t unless stated otherwise. The theoretical minimum value ofthe functional, attained for r = q , equals ∆ q = 2 N AB , where N AB is the total numberof transitions from state A to B, or from B to A. Thus, if during RC optimization ∆ r / N AB , it follows that the putative RC closely approximates the committor.To satisfy the constraint Eq. 3b during optimization, we, first construct the seed RCtime-series that satisfies the constraint, and second, during optimization, we keep positionsof these points fixed by setting δr ( t ) = 0 for them. Lets introduce boundary indicatorfunction I b ( t ), which equals 1 when point X ( t ) belongs to a boundary and is thus fixedduring optimization, and zero otherwise. ˜ I b ( t ) = 1 − I b ( t ) is its negative. The variationof the putative time series, which keeps the positions of points/frames in boundary statesfixed can be taken as r ( t ) + δr ( t ) = r ( t ) + ˜ I b ( t ) (cid:80) j α j f j ( t ), where f j ( t ) are basis functions,that are discussed below. Optimal coefficients α (cid:63) , which give the best approximation tothe committor for the considered variation, can be found by equating the derivative of thefunctional with respect to α k to zero: (cid:88) t [ r ( t + ∆ t ) − r ( t ) + δr ( t + ∆ t ) − δr ( t )][ f k ( t + ∆ t ) ˜ I b ( t + ∆ t ) − f k ( t ) ˜ I b ( t )] = 0 , (4)or more compact (cid:88) t [∆ r ( t ) + ∆ δr ( t )]∆[ f k ( t ) ˜ I b ( t )] = 0 , (5)where operator ∆ denotes the forward time difference ∆ f ( t ) = f ( t + ∆ t ) − f ( t ). It equals8he following system of linear equations (cid:88) j A kj α (cid:63)j = b k (6a) A kj = (cid:88) t ∆[ f k ( t ) ˜ I b ( t )]∆[ f j ( t ) ˜ I b ( t )] (6b) b k = − (cid:88) t ∆ r ( t )∆[ f k ( t ) ˜ I b ( t )] (6c)As basis functions f j ( t ) one can take the terms of a low-degree polynomial, i.e., r l ( t ) y m ( t ) for l + m ≤ n and y ( t ) is a randomly chosen coordinate of the configuration space, y ( t ) = X i ( t ),or a collective variable. To focus optimization on a particular region of RC, one can modulatethe polynomial terms by a common envelop. For example, to focus optimization on pointsaround r on the RC, one can use e −| r ( t ) − r | /d × r l ( t ) y m ( t ), where d is some small numberthat defines the scale. This option is useful to optimize regions corresponding to free energyminima along the committor, which get exponentially shrunk. Generally, the higher is the degree of the polynomial, the faster is the optimization,though more computationally demanding. However a very high degree may lead to numericalinstabilities and strong overfitting. The following strategy was found useful: use a polynomial f ( r, y ) with a relatively small degree (3-6) for updates involving r ( t ) and y ( t ) followed by apolynomial f ( r ) of a high degree (e.g., 16) for updates involving only r ( t ).The basic algorithm (which we call NPq) is as follows. Initialization: a seed RC isconstructed, which satisfies the boundary constrains, for example, r ( t ) = 0 if X ( t ) ∈ A , r ( t ) = 1 if X ( t ) ∈ B and r ( t ) = 0 . Iterations: one selects times-series y ( t )(a randomly chosen coordinate of configuration space X or a collective variable), computesbasis functions, solves Eqs. 6 and updates r ( t ). Stopping: iterations stop when ∆ r / N AB - the number of transitions from state A to state B or thatfrom B to A .If the system has been extensively sampled, and overfitting is not possible, then, as9utative ∆ r / N AB , the RC should closely approximate the committor. To confirmthat, one can use the the Z C, validation/optimality criterion for the committor. Z C, can be straightforwardly computed from time-series r ( i ∆ t ): each transition of trajectoryfrom x = r ( i ∆ t ) to x = r ( i ∆ t + ∆ t ) adds 1 / | x − x | to Z C, ( x, ∆ t ) for all points x between x and x . Validation:
If a putative RC closely approximates the committor,then Z C, ( x, ∆ t ) ≈ N AB for all x and ∆ t , where Z C, ( x, ∆ t ) are computed using transitionpath segment summation scheme. Optimality: for a suboptimal RC, Z C, ( x, ∆ t ) valuesgenerally decrease to the limiting value of N AB , as ∆ t increases. The larger the differencebetween Z C, ( x, ∆ t ) and Z C, ( x, ∆ t ) the less optimal the RC around x . Jupyter notebooksillustrating usage of Z C,α profiles for RC analyses and, in particular, as the committor andeigenvector criteria are available at https://github.com/krivovsv/CFEPs. For realistic systems with limited sampling, this simple algorithm may start to overfitthe RC in some regions and underfit in other. One way to overcome this problem is tomake optimization adaptive, by focusing optimization on less optimized spatio-temporalregions. Here we consider another strategy - to use adaptive sampling in order to improvesampling in regions that are overfit or undersampled. Since adaptive sampling is no-longerequilibrium, and the described approach assumes the detailed balance, we describe a newapproach applicable to non-equilibrium sampling.
We assume that we are given a non-equilibrium ensemble of (short) trajectories. Whileeach trajectory was simulated by following the unperturbed or natural dynamics of interest,the starting configurations are chosen arbitrarily, for example, according to an enhanced oradaptive sampling scheme.We employ the following representation of a non-equilibrium ensemble of trajectories.All the short trajectories are concatenated into a single long trajectory X ( i ∆ t ) combined10ith itraj( i ∆ t ) index function which maps frames to the trajectory numbers they belongto. Our aim is to determine putative time-series r ( i ∆ t ), which closely approximates thecommittor q ( i ∆ t ).Analysis of such non-equilibrium ensembles of trajectories by the MSM formalism iscarried out without modification. One determines the transition numbers n ( i | j, ∆ t ) andtransition probability matrix P ( i | j, ∆ t ), which can be used, e.g., to determine the committorusing Eq. 1 or the equilibrium probability. The non-parametric approach, however, needsmodifications, as the detailed balance is not satisfied in such non-equilibrium ensembles, i.e., n ( i | j, ∆ t ) (cid:54) = n ( j | i, ∆ t ) and the minimum of Eq. 2 is no longer provided by the committorfunction. To find a functional for non-equilibrium case, i.e., a functional whose minimumis provided by the committor function, which does not assume the detailed balance andwhich can be expressed in terms of RC time-series, we used the following trick. Consider thefollowing optimization problem,min x (cid:12)(cid:12)(cid:12) x (cid:48) = x (cid:88) ij [ x (cid:48) ( i ) − x ( j )] n ( i | j, ∆ t ) (7a) x ( A ) = 0 , x ( B ) = 1 , (7b)where min x (cid:12)(cid:12)(cid:12) x (cid:48) = x sign means that we optimize by varying x , while variables x (cid:48) are fixed duringoptimization and are updated as x (cid:48) = x straight after, then the optimization cycle is repeateduntil converged. For example, assume that we minimize the functional by the steepest-descent algorithm (SD), i.e., by iteratively making steps against the gradient: x ( k ) = x ( k ) − γ ∇ k , where ∇ k is gradient and γ is the step size. The SD will stop when the gradient is zero ∇ k = − (cid:80) i [ x (cid:48) ( i ) − x ( k )] n ( i | k, ∆ t ) = 0 , for k (cid:54) = A, B (8)Now, introduce the update of x (cid:48) variables, after every SD step, as x (cid:48) = x . Since, we iterativelydecrease a positive functional, the process should converge, hence we let x (cid:48) = x in Eq. 8,11nd obtain that x is the committor (Eq. 1).Before translating Eq. 7 functional to RC time-series terms, we update our nota-tion to take into account summation over trajectories in the ensemble. Consider sum (cid:80) N tr k =1 (cid:80) T k − ∆ ti =0 f k ( i ∆ t ) g k ( i ∆ t + ∆ t ), where the first sum with index k , is the sum over N tr trajectories in the ensemble and the second sum with index i , is the sum along k -thtrajectory with length T k . We denote such a sum as (cid:80) t f ( t ) g ( t + ∆ t ) I t ( t ), where the sumover t , is the sum over the long trajectory obtained by concatenating all the trajectories inthe ensemble, and I t ( t ) is indicator function, which equals 1 when f ( t ) and g ( t + ∆ t ) belongto the same short trajectory, i.e., itraj( t ) = itraj( t + ∆ t ), and zero otherwise. I t ( t ) kills allthe cross-trajectories terms, ensuring that only terms, where f ( t ) and g ( t + ∆ t ) are fromthe same trajectory, contribute to the sum. This short, intuitive notation, makes equationsbelow less cluttered.The optimization problem of Eq. 7 is translated to RC time-series terms as followsmin r (cid:12)(cid:12)(cid:12) r (cid:48) = r (cid:88) t [ r (cid:48) ( t + ∆) − r ( t )] I t ( t ) (9a) r ( A ) = 0 , r ( B ) = 1 (9b)Taking RC variation as r ( t ) + δr ( t ) = r ( t ) + ˜ I b ( t ) (cid:80) j α j f j ( t ) one obtains for the gradient ∂/∂α k = − (cid:88) t [ r (cid:48) ( t + ∆) − r ( t ) − δr ( t )] f k ( t ) ˜ I b ( t ) I t ( t ) (10)Instead of optimizing with the SD, which converges rather slow, one can find analytically12 (cid:63) , the optimal values of α , where ∂/∂α k = 0: (cid:88) j A kj α (cid:63)j = b k (11a) A kj = (cid:88) t f k ( t ) f j ( t ) ˜ I b ( t ) I t ( t ) (11b) b k = (cid:88) t [ r (cid:48) ( t + ∆ t ) − r ( t )] f k ( t ) ˜ I b ( t ) I t ( t ) (11c)One may attempt to speed up the convergence of the iterations further by letting r (cid:48) = r + δr in Eq. 10. Which leads to the following system of linear equations (cid:88) j A kj α (cid:63)j = b k (12a) A kj = (cid:88) t [ f j ( t ) ˜ I b ( t ) − f j ( t + ∆ t ) ˜ I b ( t + ∆ t )] f k ( t ) ˜ I b ( t ) I t ( t ) (12b) b k = (cid:88) t [ r ( t + ∆ t ) − r ( t )] f k ( t ) ˜ I b ( t ) I t ( t ) (12c)In the non-equilibrium case, in contrast to the equilibrium one, the lower bound of the ∆ r functional is not known, because it depends on the sampling. To monitor the convergence ofthe optimization process here, we suggest to adopt one of the metrics in iterative equationsolving - the increment size | x new − x old | . Since the optimization is stochastic ( y are selectedrandomly), we suggest to monitor the increment size during the last n iterations || r − r − n || = (cid:112)(cid:80) t [ r ( t ) − r − n ( t )] , to have a representative estimate; here subscript − n means n iterationsback.The basic NPNEq algorithm is similar to the equilibrium case and is as follows.
Ini-tialization: a seed RC is constructed, which satisfies the boundary constrains, for example, r ( t ) = 0 if X ( t ) ∈ A , r ( t ) = 1 if X ( t ) ∈ B and r ( t ) = 0 . Iterations: one se-lects times-series y ( t ) (a randomly chosen coordinate of configuration space X or a collectivevariable), computes basis functions, solves Eqs. 12 and updates r ( t ). Stopping: iterationsstop when the change of RC time-series during the last n iterations || r − r − n || is sufficiently13mall.We show in Appendix that Eq. 12 can be obtained in other ways. Using the Galerkincondition, where one minimizes the error terms (the residuals or the deviations from 0) inEq. 7, by making them orthogonal to the basis functions. Or, by minimizing the weightedsum of the error terms squared - a standard approach of solving system of linear equationsiteratively. Here we suggest a generalization of the Z C, criterion for the committor to the non-equilibriumcase. Consider function Z q ( x, ∆ t ), whose derivative equals ∂Z q ( x, ∆ t ) ∂x = (cid:88) ij δ ( x − x ( j ))[ x ( i ) − x ( j )] n ( i | j, ∆ t ) (13)It can be computed from RC time-series r ( i ∆ t ) as ∂Z q ( x, ∆ t ) ∂x = (cid:88) t δ ( x − r ( t ))[ r ( t + ∆ t ) − r ( t )] (14)If x ( i ) is the committor, i.e., satisfies Eq. 1, then by summing Eq. 13 over i , one obtainsthat the derivative is zero for all j but the boundary nodes. Which leads to the validationcriterion: Z q ( x, ∆ t ) is constant for the committor function for all x (but boundary nodes,see below) and ∆ t . Note that, in contrast to the equilibrium case, the constant value hereis not informative, as it is defined by the transitions from the state A and depends on thesampling. In Appendix we show that Z q is an outgoing part of the Z C, profile and Z q = Z C, for an equilibrium trajectory with the detailed balance.Note that, analogous to Z C, , Z q deviates from the constant value around the boundariesfor ∆ t > ∆ t . The deviations can be eliminated by employing the transition path segment14ummation scheme. However, since one expects the trajectories to be relatively short, thedeviations are expected to be small, and we do not see a significant advantage in introducingthis scheme here.If a putative RC deviates from the committor, then Z q derivative should deviate fromzero. However, it is not clear if the difference between the derivatives for two different ∆ t canserve as a measure of RC sub-optimality. Here, the equilibrium Z C, criterion is used for thatpurpose, which can obtained by re-weighting the non-equilibrium sampling, as demonstratedlater. Another quantity of interest in non-equilibrium sampling are the equilibrium probabilitiesor re-weighting factors. Having determine the transition matrix one can compute the equi-librium probabilities, π ( i ), as the solution of π ( i ) = (cid:88) j P ( i | j, ∆ t ) π ( j ) . (15)Introducing re-weighting factors w ( i ), which correct the non-equilibrium distribution π ( i ) = n ( i ) w ( i ), the equation can be written also as w ( i ) n ( i ) = (cid:88) j n ( i | j, ∆ t ) w ( j ) , (16)here n ( i ) = (cid:80) j n ( j | i, ∆ t ). For a single long equilibrium trajectory, where the number ofingoing and outgoing transitions for every node is equal, (cid:80) j n ( j | i, ∆ t ) = (cid:80) j n ( i | j, ∆ t ), w ( i ) = 1 is the solution - no re-weighting is necessary.The re-weighting factors do not represent a RC, as it makes little sense to project thedynamics on them. However, they can be determined by the developed formalism, and the15erminology of the formalism will be used for consistency. In particular, we will refer toarbitrary re-weighting factors as a RC and the correct re-weighting factor as the optimal RCdenoted by w (analogous to q for the committor). The aim here is to determine putativetime-series r ( i ∆ t ), which closely approximates the re-weighting factors w ( i ∆ t ).The corresponding optimization functional for Eq. 16 ismin x (cid:12)(cid:12)(cid:12) x (cid:48) = x (cid:88) i n ( i ) x ( i ) / − (cid:88) ij x ( i ) n ( i | j, ∆ t ) x (cid:48) ( j ) (17)which is translated to RC time-seriesmin r (cid:12)(cid:12)(cid:12) r (cid:48) = r (cid:88) t I t ( t ) r ( t ) / − r ( t + ∆ t ) r (cid:48) ( t ) I t ( t ) . (18)Considering RC variation as r ( t ) + (cid:80) j α j f j ( t ) one obtains the following equations for theoptimal parameters (cid:88) j A kj α (cid:63)j = b k (19a) A kj = (cid:88) t [ f k ( t ) − f k ( t + ∆ t )] f j ( t ) I t ( t ) (19b) b k = − (cid:88) t [ f k ( t ) − f k ( t + ∆ t )] r ( t ) I t ( t ) (19c) A j = (cid:88) t f j ( t ) I t ( t ) (19d) b = (cid:88) t − r ( t ) I t ( t ) (19e)The re-weighting factors are defined up to an overall factor, which we fix by requiring thetotal weight to be equal that of an equilibrium trajectory, i.e., (cid:80) t w ( t ) I t ( t ) = (cid:80) t
1. Thisleads to Eqs. 19d-e. They should replace equations Eqs. 19b-c for k = 1, for constant basisfunction f ( t ) = 1, for which Eqs. 19b-c give zeros.The re-weighting factors can also be considered as the first right eigenvector (with16igenvalue λ =1) of a non-equilibrium version of the transfer operator n ( i | j, ∆ t ) /n ( i ). Ap-pendix discusses the corresponding equations for the eigenvectors of the transfer operator P ( i | j, ∆ t ) π ( j ) /π ( i ) and the transition probability P ( i | j, ∆ t ).The optimization of eigenvectors, and, correspondingly, of the re-weighting factors, hasan inherent instability. For example, if time-series y ( t ), which is used to improve putativere-weighting factors, enters a region in configuration space, but does not come back, it willtry to increase the weight of this region infinitely. Short trajectories are likely to make thissituation more probable. To make the optimization of the re-weighting factors robust, onemay need to employ some ideas discussed in and this is a work in progress. Here we suggestto use a selected set of proper collective variables that sample all the regions extensively,i.e., they contain transitions to and from all the sampled regions.In the simplest case one can take as y ( t ) only a single committor coordinate time-series.In this case one will determine w ( q ), re-weighting factors as a function of the committor.This is sufficient, for example, for the first passage ensemble which consists of trajectoriesstarting in A and stopping as soon as they reach B, since the biasing factor in this ensemble isa function of the committor. It should also be sufficient for an ensemble of short trajectoriesfor a system with a single dominant pathway. In this case, the committor function, increasingalong the pathway, can be used to parameterize the pathway and the re-weighting factors.If there are two (or a few more) parallel pathways one can incorporate a proper collectivevariable that distinguishes between them into optimization as y ( t ).The basic NPNEw algorithm is as follows.
Initialization: a seed RC is initialized to r ( t ) = 1. Iterations: one randomly selects times-series y ( t ) from a set of proper collectivevariables, computes basis functions, solves Eqs. 19 and updates the putative RC time-series. Stopping: iterations stop when the change of RC time-series during the last n iterations || r − r − n || is sufficiently small.Once computed, the re-weighting factors are used to determine the equilibrium properties.For example, for the equilibrium free energy profile F ( r ): each trajectory point r ( i ∆ t )17ontributes with corresponding weight of w ( i ∆ t ); for equilibrium Z C,α cut-profiles: eachtransition from r ( i ∆ t ) to r ( i ∆ t + ∆ t ) contributes with corresponding weight of w ( i ∆ t ). As the first model system we consider a high-dimensional system for which the committorfunction can be computed analytically. It qualitatively resembles a protein folding landscapewith radially symmetric potential energy U ( X ) = U ( R ), decreasing towards the beginningof the coordinates. The decrease in enthalpy is compensated by the decrease in entropy sothat the resulting free energy profile as a function of R has two minima, separated by abarrier (Fig. 1). More specifically, U ( R ) = U ( R ) − ( n −
1) ln( R ), where R = (cid:112)(cid:80) ni =1 X i and U ( R ) = R < R − ≤ R ≤
12 4 e − ( R − + 4 e − ( R − < R R − (20)Due to high dimensionality of the configuration space, n = 50 here, the system can notbe analyzed directly by an MSM approach, one would need to preform a dimensionalityreduction first. For example, a trajectory of 10 frames will not even visit every possibleregion of configuration space with different combinations of coordinate signs. Approachesassuming pathways can not be applied also, as the system does not have a well definedpathway.Non-equilibrium ensemble of short trajectories was obtained by randomly selecting apoint in the 50 dimensional configuration space with uniform distribution in 1 < R <
13 andsimulating a diffusion trajectory for 10 steps with D ( X ) = 1, simulation step ∆ t sim = 0 . t = 0 .
1. The total size of the ensemble is 10 points. The free18nergy profile as a function of the radius F H ( R ), computed from the trajectories, is differentfrom U ( R ) (Fig. 1), confirming the non-equilibrium character of sampling. The NPNEq R F F H U Figure 1: Non-equilibrium free energy profile F H ( R ) (blue) and U ( R ) (red).algorithm is used to find the putative committor time-series. Specifically, Initialization: aseed RC is constructed as r ( t ) = 0 if R ( t ) < r ( t ) = 1 if R ( t ) > r ( t ) = 0 . Iterations:
Every iteration consists of four RC updates.NPNEq equations (Eqs. 12) with basis functions being the terms of polynomial f ( r, y ) ofdegree 6, where y ( t ) = X i ( t ) and i is randomly chosen from 1 , , ...,
50, i.e., y ( t ) is a randomlychosen coordinate time-series. It is followed by NPNEq equations with basis functions beingthe terms of polynomial f ( r ) of degree 16 with envelop exp( −| − r | / . −| r | / . Stopping: iterations are terminatedwhen || r − r − || < .
3. A Jupyter notebook with the analysis is provided in the SupportingInformation and is also available at https://github.com/krivovsv/NPNE. The results are robust with respect to the polynomial degrees, frequency of updates withenvelops, size of the envelops, etc. Higher degrees generally lead to faster convergence, a bitsmaller value of the ∆ r functional, and less fluctuating Z q , though very high degrees mayresult in instability and occasional failure to converge.Fig. 2 demonstrates the convergence of the iterations of the optimization process. Thesize of increments || r − r − || are steadily getting smaller with the iteration number. Thechange in the functional ∆ r value as a function of iteration number is getting smaller,19ndicates that we are approaching the minimum. The change of the RC time-series during thelast 100 iterations, for selected frames, is bounded by 0 . r a r rr b || rr || Figure 2: Convergence of the NPNEq optimization. a) ∆ r (black) and || r − r − || (blue) asfunctions of iteration number. b) Change of the RC time-series during the last 100 iterationsfor selected frames.Fig. 3 inspects how closely the determined time-series approximates the committor. Thevalidation criterion is relatively constant. The root mean squared deviations of Z q are about5, 6 and 10 for ∆ t = 1 , t = 4 could be dueto general statistical fluctuations because of limited sampling. Unlike the equilibrium Z C, profiles, the mean values of the Z q profiles are not very meaningful, as they depend on thetransitions from state A, which depend on the sampling. The committor as a function of R can be compute analytically as q ( R ) = (cid:90) RR ( A ) D − ( x ) e U ( x ) dx/ (cid:90) R ( B ) R ( A ) D − ( x ) e U ( x ) dx, D ( x ) = 1. Fig. 3b shows that the latter is in a good agreement with the putative RC,which is referred as committor henceforth. q Z q a t = 1 t = 2 t = 4 R q b Figure 3: Validation of the putative RC time-series. a) Non-equilibrium committor validationcriterion Z q ( x, ∆ t ) along putative time-series q for ∆ t = 1 , , R (black line) withthat for selected frames from RC time-series (yellow dots).The re-weighting factors are computed using the NPNEw algorithm. Specifically, Initial-ization: a seed RC is initialized as r ( t ) = 1. Iterations:
NPNEw equations (Eqs. 19) withbasis functions being the terms of polynomial f ( r, y ) of degree 5, where y ( t ) is the putativecommittor time-series q ( t ). Stopping: iterations are terminated when || r − r − || < . Z C, as the function of the putative committor q , the committor validationcriterion. The profile is constant with fluctuations bounded by 10%, confirming that q approximates the committor rather well. 21he equilibrium Z C, profile can be used to compute the equilibrium flux J AB = N AB /T ,where T = N ∆ t , is the total length of trajectory and N AB is the number of transitionsfrom A to B. N AB can be computed as N − AB = (cid:82) q ( B ) q ( A ) Z − C, ( q ) dq and N = (cid:80) t w ( t ) I t ( t ).The obtained value of the equilibrium flux J AB = 0 . J AB = N AB /Z = 0 . N − AB = (cid:82) R ( B ) R ( A ) Z − C, ( x ) dx = (cid:82) R ( B ) R ( A ) D − ( x ) e U ( x ) dx and Z = (cid:82) e − U ( x ) dx , where D ( x ) = 1. q Z C , a t = 1 t = 2 t = 4 q , R F b F H ( q ) U ( R ) Figure 4: Equilibrium properties. a) Equilibrium validation/optimality criterion along theputative time-series is constant. The deviations from the constant value are in the range of10%. b) F H (˜ q ) (green), equilibrium free energy profile as a function of ˜ q , is in agreementwith U ( R ) (red); F H (˜ q ) was shifted horizontally and U ( R ) vertically for maximum overlap.The re-weighting factors can be used to compute the equilibrium free energy profile F H ( q )and the diffusion coefficient D ( q ) as functions of the committor and thus provide the diffusivemodel of the equilibrium dynamics along the committor, which can be used to compute thefollowing important properties of the dynamics exactly: the equilibrium flux, the mean22rst passage times, and the mean transition path times between any two points on thecommittor. However, using F H ( q ) for the analysis and description of the dynamics is notvery convenient as the diffusion coefficient varies significantly along the coordinate. It is moreconvenient to use a “natural” coordinate, ˜ q , where the diffusion coefficient is constant D (˜ q ) = 1. It is related to q by the following monotonous transformation d ˜ q/dq = D ( q ) − / .Fig. 4b shows that F H (˜ q ) is in a very good agreement with U ( R ).In summary, this example illustrates that the NPNEq and NPNEw algorithms can beused to determine the committor and re-weighting factors from non-equilibrium ensemblesof short trajectories and to construct a diffusive model of equilibrium dynamics, which canbe used to compute important properties of the equilibrium dynamics exactly.Functions implementing NPq (Eq. 6), NPNEq (Eq. 12) and NPNEw (Eq. 19) iterations,computing Z q and Z H profiles and performing transformation to natural coordinate areavailable as Python library npnelib.py at https://github.com/krivovsv/NPNE. We have demonstrated that the NPNEq algorithm can accurately determine the committorRC from a non-equilibrium sampling of the model system. The model system has a relativelysimple configuration space and a relatively simple committor function, which is a functionof R only. It is of interest to see how accurately the NPNEq algorithm can approximate thecommittor for a realistic system. To this end, the NPNEq algorithm is applied to a longequilibrium protein folding trajectory of HP35 Nle/Nle double mutant consisting of 1509392snapshots at 380 K, in particular, to compare with its equilibrium version. The analysisdetails can be found in a Jupyter notebook, provided in the Supporting information and athttps://github.com/krivovsv/NPNE. The optimization continued for 40000 iterations. The final ∆ r / ∼ . N AB , i.e., almosttwo times higher than the target value of N AB = 74 .
5. Fig. 5a inspects the convergence of thealgorithms. As once can see the increment size || r − r − || converges to some non-zero value,23hile the ∆ r / r / N AB . Fig. 5b shows that the free energy profile as the function of the putative committor, F ( q ),is very similar to that obtained with equilibrium adaptive non-parametric optimization, indicating that the non-equilibrium approach has similar approximation power. iteration r / a q F bc q Z q || rr || Figure 5: Application of the NPNEq to a realistic protein folding trajectory. a) Convergenceof the NPNEq optimization: || r − r − || (blue) and ∆ r (black) as functions of iterationnumber. b) Free energy as a function of committor, F H ( q ) . c) Z q ( x, ∆ t ) along putativecommittor time-series for ∆ t = 1 (blue), 2 (orange), ..., . Z q criterion (Fig. 5c), which, for equilibrium dynamics, is equivalent to Z C, shows that Z q ( q, ∆ t ) is almost 2 times larger then N AB = 74 .
5. It means that the diffusive model of24ynamics is accurate, within a factor of 2, at the time scale of ∆ t = 0 . kT ln(2) ∼ . Z q ( q, ∆ t ) − Z q ( q, t ) is smallest for 0 . < q < .
6. Ifcontinued further, Z q ( q, ∆ t ) will get lower than Z q ( q, t ) in that region, indicating thatthe putative RC is overfitted around the transition state. One way to avoid overfitting, isto make optimization more uniform by focusing it on underfitted/suboptimal regions. Analternative approach consists in performing additional extensive sampling of the transitionstate by starting many short trajectories from the frames in the overfitted region and analyzethe combined simulations using the developed non-equilibrium approach.
Given a representative and extensive, possibly non-equilibrium sampling of the configurationspace the proposed approaches can be used to determine the equilibrium free energy profileas a function of the committor. The later, in particular, can be used to determine importantproperties of the equilibrium dynamics exactly. By a representative sampling we mean sucha sampling which contains all the important regions of the configuration space, e.g., allthe important transition pathways, or a representative sample of them, if their number isinfinite. By an extensive sampling we mean a sampling of such a size that overfitting bythe non-parametric approaches is not possible or negligible. In this section we will discusspossible strategies of generating such a representative and extensive sampling.Consider first the case where a trajectory or ensemble of trajectories provide representa-tive, though not extensive sampling, for example, the state-of-the-art protein folding trajec-tories.
Applying the non-parametric approaches (either equilibrium or non-equilibrium)one finds that the optimization soon starts to overfit the committor RC in the TS region,25ecause the sampling of this region is relatively poor, compare to the rest of the configura-tion space.
The regions where the putative RC is overfitted can be detected by using theoptimality criteria. In order to avoid the overfitting, many additional short simulations areperformed, starting from the configurations that belong the overfitted regions, e.g., the TSregion. Then, the total simulation data is analyzed by the non-parametric non-equilibriumapproach.A more difficult case is when the initial representative sampling is absent. For systemswith relatively simple, small configuration space, selection of initial configurations to startmany short simulations as well as the seed RC can be done analytically, as it was done for themodel system considered here. Such systems may include practically important cases suchas, e.g., studies of dynamics of a ligand binding/unbinding to/from a protein, or diffusionof a small molecule/ion through an ion channel pore.If one of the boundary states has a much shorter lifetime compared to the other state,then many trajectories should be started from the former state, which shall generate a non-equilibrium (first passage) representative sampling.If both boundary states have long residence times, while the transition path times arerather short, one can use the transition path sampling approach to generate a representativesampling. Inclusion of the rejected paths will increase the size of the sampling and removeconditioning on the boundary states.Another possibility is to use biased, non-equilibrium sampling, though, in this case, rep-resentative sampling of transition paths is not guaranteed. For example, one may use sam-pling at a higher temperature or sampling with a bias potential, e.g., umbrella sampling, steered-MD, replica-exchange, meta-dynamics, or forward flux sampling. If anenhanced sampling method perturbs the dynamics of interest, e.g., a higher temperature ora biasing potential, then many short simulations with unperturbed dynamics, need to beperformed, starting from the obtained configurations.String method using swarms of trajectories can be straightforwardly combined with26he NPNEq approach. Since the latter does not assume the existence of a dominant pathway,it may improve performance of the former in systems, where this assumption does not hold.Consider now the forward flux sampling (FFS), where one uses an order parame-ter (OP), which can be different from the optimal RC - the committor, to propagate thetrajectories from state A to state B. While the accuracy of FFS does not depend on theOP, the efficiency does. Thus it would be desirable to propagate FFS trajectories using theoptimal RC or committor. Since the committor RC is not known in advance, one possibilityis to compute the committor during sampling, applying the developed approach to the datasampled so far. Having the idea in mind we propose the following approach.We first describe an idealized scenario. Assume that relatively long unbiased simulationswere performed in both boundary states A and B. The simulations are not long enough,however, for the system to sample the transitions between the states, and thus can not beused to construct the entire committor RC. Assume now that these simulations, however,can be used to construct the committor in the sampled regions, i.e., the starting and endingsegments of the committor for example [0 , α ] and [ β, α and β . Analyzing the combined new andold simulations, one extends the RC segments to a large value of α and a smaller value of β ,since some of the stochastic trajectories will travel to these regions. One continues in suchan iterative manner to grow the two segments towards each other, until they meet, when α = β , thus providing the initial representative sampling of transition paths.Unfortunately, it is not possible to construct accurately just the two segments of theRC, because as soon as the RC is divided into two non-overlapping segments, [0 , α ] and[ β, α → β →
1. However, even an approximate RC, obtained just before the RC is dividedinto two segments can be useful. It is possible, when such a partial optimization of the RCincreases the fraction of points with correct values of RC. In this case, a new ensemble ofmany short trajectories is prepared, by starting them from points selected uniformly along27he RC. The new ensemble, will have a higher fraction of points with higher values of α andsmaller values of β . By iterating this process, one can converge to the ensemble with pointsuniformly sampled along the RC. This process is somewhat analogous to the way uraniumis enriched in centrifuges: each cycle leads only to a marginal increase in the concentrationof the desired isotope. However, by repeating the cycle many times, the concentration getsexponentially increased. We call this approach the transition state ensemble enrichment,TSEE. R q a R q b F F Figure 6: Convergence of the TSEE approach in application to the model system. Initialdistribution of points along an OP F ( R ) (blue line); two-dimensional distributions of pointson the q-R plane (yellow dots). Panel a and b show first and second iterations, respectively.For details see text.We illustrate the TSEE approach on the model 50 dimensional system (Fig. 6). FirstIteration . We start by sampling the boundary states. 10000 short trajectories of length 10with timestep of 0 . R = 2 and R = 12. This isdone by assigning the 50 coordinates to random numbers uniformly distributed in the range[-0.5,0.5], and re-scaling them so that R equals to 2 or 12. F ( R ) on Fig. 6a shows that28he points are distributed mainly around R = 2 and R = 12, with almost no points in theTS region. The NPNEq algorithm is applied to optimize the putative RC. The degree ofpolynomial initially is set at 2 to limit the flexibility of RC to avoid its quick division intotwo segments. The degree is gradually increased during optimization. NPNEq optimizationis continued either until a segment 0 . < q < .
55 had fewer than 100 points, i.e., theRC is close to be divided into two segments (stopping condition one), or until the iterationnumber reached 1000 (stopping condition 2). The NPNEq optimization exited after 566iterations with stopping condition one. The distribution of points on the q-R plane on Fig.6a shows that optimization has stratified points according to the putative committor or thatthe initial and final parts of the committor are determined relatively accurate. By selectingpoints from the different regions along q , different regions of the configurations space can besampled more uniformly. Second Iteration . 10000 points are drawn uniformly along ˜ q - the putative committortransformed to the natural coordinate. These points are used to start 10000 short trajectoriesof length 10 with time step of 0 . F ( R ) on Fig. 6b shows that some of these trajectoriesvisited the TSE. The same NPNEq algorithm is applied to optimize the putative RC. Thealgorithm terminated after 1000 iterations with stopping condition two, i.e., the optimizationis robust with no division of the RC. The distribution of points on the q-R plane on Fig.6b shows that they cover all of the committor. Thus, the TSEE algorithm converged on thesecond iteration. We have described non-parametric non-equilibrium approaches to accurately determine thecommittor function and re-weighting factors from non-equilibrium simulations. Given arepresentative and extensive sampling of the configuration space, e.g., a large ensembleof short trajectories, the proposed approaches can be used to determine the equilibrium29ree energy profile as a function of the committor. The profile, together with the positiondependent diffusion coefficient, specify a diffusive model of the equilibrium dynamics. Themodel can be used to compute the following important properties of the dynamics exactly :the equilibrium flux, the mean first passage times, and the mean transition path timesbetween any two points on the committor.
The power of the approach was illustratedon a model 50-dimensional system and a realistic protein folding trajectory.In application to the eigenvectors optimization problem, the obtained equations are sim-ilar to those obtained in the EDMD approach.
Here, however, these equations describe asingle iteration of the iterative optimization process, which leads to the following advantages.A major weakness of the parametric approaches, e.g., those using a linear combinations ofmolecular descriptors/features or a deep neural network, is the choice-of-basis (choice of func-tional form) problem. While it was argued that ”the expressive power of neural networksprovides a natural solution to the choice-of-basis problem”, finding an optimal architec-ture of a neural network and input variables are difficult tasks. While intuition can help tosolve the problem for low-dimensional model systems, the difficulty in the case of complexrealistic systems becomes apparent, when one remembers that such a function should beable to accurately project a few million snapshots of a very high-dimensional trajectory. Inparticular, it implies an extensive knowledge of the system, and that an acceptable solutionis likely to be system specific. The developed non-parametric approaches can approximateany reaction coordinate with high accuracy. While each iteration may depend on the exactchoice of the family of collective variables/molecular descriptors/features, the final reactioncoordinate does not, since it provides optimum to a (non-parametric) target functional, whenthe optimization converges. We assume, of course, that the employed input variables pro-vide all the important information. For the analysis of biomolecular simulations one cansuggest the inter atom distances, or the sines and cosines of internal angles as the standardsets of input variables. The developed non-parametric approaches are able to accuratelyapproximate the committors and eigenvectors of realistic systems at the shortest timescales30f trajectory sampling interval of 0 . Also, one of the reasons of using re-weighteddata in eigenvector approximation is to avoid complex eigenvalues/eigenvectors since theylack interpretability. This strategy assumes that the re-weighting factors can be accuratelydetermined for every trajectory point, which is a difficult task. The problem of complexvalues, however, has a simple solution in the iterative optimization. First, since the numberof basis functions used during each iteration is rather small, the statistical noise is small andthe occurrence of complex eigenvalues/eigenvectors is an infrequent event. Thus one caneither skip such an iteration or accept it, truncating complex variables to real parts.Note, that while we call the approaches non-parametric, emphasizing that we focus onRC as a function of trajectory (trajectory time or trajectory snapshot), r ( i ∆ t ), rather thanas a function of configuration space, r ( X ), it is possible to record all the RC transformationsduring iterative optimization (training) and apply them later to new (test) data, e.g., forcross-validation. Alternatively, one can perform cross-validation on the fly, by computingparameters of RC transformations on the train part of the data, while applying these trans-formations to the train and test parts of the data. It can be trivially implemented by setting I t ( t ) = 0 for the test data.It is instructive to compare different descriptions of molecular dynamics, e.g., using com-mittors vs using eigenvectors as reaction coordinates for free energy landscapes or usingeigenvectors to approximate the evolution (forward, backward or Koopman) operators of thedynamics. They all have strong and week points. For example, if it is sufficient to know justsuch important quantities of dynamics as the equilibrium flux, the mean first passage timesor the mean transition path times between two states of interest, e.g., folded and unfoldedstates or bound and unbound states, then the diffusive model along the committor allowsone to determine these properties exactly (between any two points along the committor).This result is valid for any system, irrespective of complexity of its free energy landscape,and does not assume the separation of timescales. The diffusive model can be used todetermine, rigorously and in a direct manner, the free energy barrier and the pre-exponential31actor - the major determinants of molecular kinetics. Distribution of transition-path timesis an example of quantity that can not be accurately determined from the diffusive mod-els, in general. If one assumes the separation of timescales, then the projected dynamicsbecomes Markovian, and the diffusive model provides it complete description. A set of slow-est eigenvectors/eigenfunctions (basis) can provide a close approximation to the evolutionoperators and thus can be used to compute accurately many properties of the dynamics.One, however, may require a relatively large basis set to accurately estimate the quantities,that can be computed exactly by the diffusive model along the committor, that requiresthe determination of just one optimal coordinate. Also, it is not straightforward to visu-alize an approximated evolution operator, while a free energy landscape as a function ofone or two optimal reaction coordinates, provides a clear, intuitive and quantitative pictureof the dynamics. Another difference is that iterative optimization of committors is robust,while that of eigenvectors has an inherent instability. In the committor case, one seeksan optimal coordinate between two given states (a variant of supervised learning or ratherreinforcement learning). The eigenvector optimization can be considered as a variant ofunsupervised learning: one seeks eigenvectors with smallest eigenvalues, which describe theslowest dynamics. However, some of such eigenvectors are not of interest. For example, inprotein folding, such an eigenvector could describe a much slower torsion angle isomerizationprocess.
Another, more likely possibility, is due to a limited sampling, especially in thecase of many short trajectories. There are many parts of the configuration space that werevisited only once, and eigenvectors describing those transitions have small eigenvalues. Thus,starting with an eigenvector of interest, the iterative approach may eventually converge toan eigenvector, with smaller eigenvalue, but of no interest. To determine the committor,one needs to specify two boundary states. Proper definition of such states is a difficultproblem. For example, a natural approach of using the rmsd from a structure may lead toinaccuracies and hide complexity of the free energy landscapes. The problem is likely tobe more severe for systems with complex free energy landscape, e.g., intrinsically disordered32roteins. One general strategy of blind, unbiased analysis of dynamics, that uses strongpoints of both eigenvectors and committors is as follows. First, eigenvectors, even not com-pletely optimized/converged, are used for an exploratory analysis of free energy landscapes,e.g., to locate and define the boundary states. This is followed by the determination of thecommittors between these states and the corresponding equilibrium free energy profiles.The described non-parametric approaches have only two assumptions - representativesampling and that the underlying dynamics is Markovian in the configuration space. Forexample, for atomistic MD simulations, where the dynamics is Newtonian at the integra-tion time-step, the sampling/saving interval ∆ t needs to be sufficiently large, so that thedynamics have no memory about the momenta. In principle, shorter sampling intervals canbe employed if dynamics in the phase space is considered, i.e., the committor is a functionof positions and momenta, however it is not yet clear, if it can bring significant advantages.Since representative sampling does not need to cover exhaustively the entire configurationspace, the approaches do not suffer from the curse of dimensionality. It suggests that theseapproaches can be used to investigate dynamics of large biomolecular systems in a rigorousand accurate way. The approaches allows straightforward parallelization and can be adaptedfor exascale computing.By a representative sampling we mean such a sampling which provides a representa-tive, but not exhaustive/complete sampling of all the important regions of the configurationspace. By an extensive sampling we mean a sampling of such a size that overfitting bythe non-parametric approaches is not possible or negligible. In case, when the sampling isnot extensive, i.e., some regions do get overfitted, it can be straightforwardly rectified byperforming many short simulations starting from the configurations in the overfitted regions.We have suggested how one can generate such a representative and extensive samplingin a number of realistic practical scenarios, e.g., in tandem with many developed enhancedsampling techniques. We have also described a generic approach, the transition state en-semble enrichment, TSEE, which generates such a representative and extensive sampling in33n iterative, self-consistent manner, by ”growing” committor segments towards each otherstarting from the boundary states.The developed non-parametric approaches determine values of a specific optimal RC(e.g., the committor) for an ensemble of configurations, without using any system specificinformation. They can be considered analogous to linear algebra routines (e.g. the LAPACKlibrary ), where given, for example, a matrix, one can obtain numerical values of eigenvectorcomponents. Here, however, the task is complicated by the fact that the transition probabil-ity matrix is not given explicitly; only an ensemble of trajectories is provided. In particular,one can not compute the matrix-vector product, Av , the basic operation in iterative linearalgebra methods. Also, the configuration space is continuous, meaning that we are dealingwith an infinite-dimensional problem, which is somewhat simplified by considering a largerepresentative sample of points instead. In addition, the non-parametric approaches to de-termine the eigenvectors, require additional efforts to suppress the ’inherent instability’. However, the developed approaches show that these problems are solvable, and further de-velopment of the framework should deliver rigorous, robust and efficient tools to solve thesampling problem.
Another way to derive Eq. 12 is by using the Galerkin condition. Consider a variationof the RC, approximating the committor function, that satisfies the boundary conditions: x ( i ) + δx ( i ) = x ( i ) + ˜ I b ( i ) (cid:80) j α j f j ( i ), where x ( i ) satisfies the boundary condition x ( A ) = 0and x ( B ) = 1, while ˜ I b ( A ) = 0, ˜ I b ( B ) = 0 and ˜ I b ( i ) = 1 otherwise, and f j are the basisfunctions. The error vector (cid:15) ( j ), or the vector of residuals, is defined as (cid:88) i [ x ( i ) − x ( j )] n ( i | j, ∆ t ) = (cid:15) ( j ) , for j (cid:54) = A, B (21)34or the committor function (cid:15) = 0. The optimal variation is defined by the Galerkin condition:the error vector is orthogonal to all the basis functions of the variation (cid:15) ⊥ f k (cid:88) j [ (cid:88) i [ x ( i ) + δx ( i ) − x ( j ) − δx ( j )] n ( i | j, ∆ t )] ˜ I b ( j ) f k ( j ) = 0 , (22)where we used ˜ I b to extend the summation to all j . This system of equations is translatedto the RC time-series as follows (cid:88) t [ r ( t + ∆ t ) + δr ( t + ∆ t ) − r ( t ) − δr ( t )] f k ( t ) I t ( t ) I b ( t ) = 0 , (23)which leads to Eq. 12.Yet another way to derive Eq. 12 is to consider the following optimization functionalmin x (cid:88) j (cid:54) = A,B n ( j )[ x ( j ) − (cid:88) i P ( i | j, ∆ t ) x ( i )] (24a) x ( A ) = 0 , x ( B ) = 1 (24b)The functional equals (cid:80) j (cid:15) ( j ) /n ( j ) and attains its minimum when (cid:15) ( j ) = 0, which givesthe committor equation (Eq. 1). The functional does not assume the detailed balance. Min-imization of such a functional is a standard approach of solving a linear system of equations(for committor) iteratively. This functional, however, can not be expressed in terms of RCtime-series r ( i ∆ t ), and thus can not be used for non-parametric optimization. Considernow the modified optimization problemmin x (cid:12)(cid:12)(cid:12) x (cid:48) = x (cid:88) j (cid:54) = A,B n ( j )[ x ( j ) − (cid:88) i P ( i | j, ∆ t ) x (cid:48) ( i )] (25a) x ( A ) = 0 , x ( B ) = 1 (25b)While the entire functional can not be expressed in terms of RC time-series, the part that35epends on x can be expressed. The other part is not important, as it depends solely on x (cid:48) and is fixed during optimization. When expressed in terms of RC time-series, it equals (cid:80) t r ( t ) I t ( t ) − r (cid:48) ( t + ∆ t ) r ( t ) I t ( t ), i.e., it is equal to Eq. 9 up to the term r (cid:48) ( t + ∆ t ) I t ( t ),which is also held constant and disappears after differentiation. It means that Eq 12 (theNPNEq algorithm) can be also interpreted as iterative solving of the (more conventional)optimization problem of Eqs 24. Z q criterion Consider a functions ”conjugated” or time-reversed to Z q ∂Z Tq ( x, ∆ t ) ∂x = (cid:88) ij δ ( x − x ( i ))( x ( j ) − x ( i )) n ( i | j, ∆ t ) (26)For the half-sum of the two functions one obtains ∂∂x Z q ( x, ∆ t ) + Z Tq ( x, ∆ t )2 = (cid:88) ij [ δ ( x − x ( j )) − δ ( x − x ( i ))][ x ( i ) − x ( j )] n ( i | j, ∆ t ) / n ( i | j, ∆ t ) transitions from j to i . Then, if x ( i ) > x ( j ), one obtains a rectangular pulse from x ( j ) to x ( i ) of height n ( i | j, ∆ t ) / × | x ( i ) − x ( j ) | . If x ( i ) < x ( j ), one obtains a rectangular pulse from x ( i ) to x ( j )of height n ( i | j, ∆ t ) / × | x ( i ) − x ( j ) | . But this is exactly the definition of Z C, . Thus, Z q ( x, ∆ t ) + Z Tq ( x, ∆ t )2 = Z C, ( x, ∆ t ) . For equilibrium dynamics, where n ( i | j, ∆ t ) = n ( j | i, ∆ t ), one finds Z q = Z Tq = Z C, .36 .3 Non-parametric determination of eigenvectors from non-equilibriumsampling The re-weighting factors can also be considered as the components of the first right eigen-vector (with λ =1) of a non-equilibrium version of the transfer operator n ( i | j, ∆ t ) /n ( i ): (cid:88) j n ( i | j, ∆ t ) u ( j ) = λn ( i ) u ( i ) (28)They are related to the right eigenvectors of the equilibrium transfer operator T ( i | j, ∆ t ) = P ( i | j, ∆ t ) π ( j ) /π ( i ) = n ( i | j, ∆ t ) w ( j ) / [ n ( i ) w ( i )] as (cid:88) j T ( i | j, ∆ t )[ u ( j ) /w ( j )] = λ [ u ( i ) /w ( i )] , (29)i.e., eigenvectors of the transfer operator can be obtained as eigenvectors of Eq. 28 dividedby the re-weighting factors w ( i ) (the first eigenvector of Eq. 28). The eigenvectors of Eq.28 can be found as the solution of optimization problemmax x (cid:12)(cid:12)(cid:12) x (cid:48) = x (cid:88) ij x ( i ) n ( i | j, ∆ t ) x (cid:48) ( j ) (30a) (cid:88) i n ( i ) x ( i ) = 1 (30b)which is translated to RC time-seriesmax r (cid:12)(cid:12)(cid:12) r (cid:48) = r (cid:88) t r ( t + ∆ t ) r (cid:48) ( t ) I t ( t ) (31a) (cid:88) t r ( t ) I t ( t ) = 1 (31b)Taking RC variations as r ( t ) = (cid:80) j α j f j ( t ) and following steps analogous those used toderive the NPNEq equations one obtains the following equations (the generalized eigenvalue37roblem) for the optimal parameters A kj α (cid:63)j = λB kj α (cid:63)j (32a) A kj = (cid:88) t f k ( t + ∆ t ) f j ( t ) I t ( t ) (32b) B kj = (cid:88) t f k ( t ) f j ( t ) I t ( t ) (32c)The left eigenvectors of the transition probability matrix (cid:88) j v ( j ) P ( j | i, ∆ t ) = λv ( i ) (33)can be found as the solution to optimization problemmax x (cid:12)(cid:12)(cid:12) x (cid:48) = x (cid:88) ij x (cid:48) ( j ) n ( j | i, ∆ t ) x ( i ) (34a) (cid:88) i n ( i ) x ( i ) = 1 (34b)which is translated to RC time-seriesmax r (cid:12)(cid:12)(cid:12) r (cid:48) = r (cid:88) t r (cid:48) ( t + ∆ t ) r ( t ) I t ( t ) (35a) (cid:88) t r ( t ) I t ( t ) = 1 (35b)with the following equations on optimal parameters[ A T ] kj α (cid:63)j = λB kj α (cid:63)j , (36)where matrices A and B are defined in Eq. 32. If the dynamics is inherently reversible orequilibrium (though the sampling may be not), i.e., P ( i | j, ∆ t ) π ( j ) = P ( j | i, ∆ t ) π ( i ), then T ( i | j, ∆ t ) = P ( j | i, ∆ t ) and v ( j ) in Eq. 33 are the right eigenvectors of the transfer operator.38hus, Eq. 36 can be used to determine the eigenvectors of the transfer operator withoutusing the re-weighting factors, for the dynamics which is inherently reversible or equilibrium,which is usually assumed for molecular simulations. Eqs. 32 and 36 are similar to equationsfor obtaining linear combinations of feature variables best approximating eigenvectors of theKoopman operator. Further discussion on how to select basis functions or how to suppress possible instabilityduring iterative optimization of eigenvectors can be found in Ref. 21.
Supporting Information.
Jupyter notebooks containing the analyses are provided in a single zip archive.
References (1) Kohlhoff, K. J.; Shukla, D.; Lawrenz, M.; Bowman, G. R.; Konerding, D. E.; Belov, D.;Altman, R. B.; Pande, V. S. Cloud-based simulations on Google Exacycle reveal ligandmodulation of GPCR activation pathways.