[PDF] Integrated VAC: A robust strategy for identifying eigenfunctions of dynamical operators

Abstract

One approach to analyzing the dynamics of a physical system is to search for long-lived patterns in its motions. This approach has been particularly successful for molecular dynamics data, where slowly decorrelating patterns can indicate large-scale conformational changes. Detecting such patterns is the central objective of the variational approach to conformational dynamics (VAC), as well as the related methods of time-lagged independent component analysis and Markov state modeling. In VAC, the search for slowly decorrelating patterns is formalized as a variational problem solved by the eigenfunctions of the system's transition operator. VAC computes solutions to this variational problem by optimizing a linear or nonlinear model of the eigenfunctions using time series data. Here, we build on VAC's success by addressing two practical limitations. First, VAC can give poor eigenfunction estimates when the lag time parameter is chosen poorly. Second, VAC can overfit when using flexible parameterizations such as artificial neural networks with insufficient regularization. To address these issues, we propose an extension that we call integrated VAC (IVAC). IVAC integrates over multiple lag times before solving the variational problem, making its results more robust and reproducible than VAC's.

Full PDF

IIntegrated VAC: A robust strategy for identifyingeigenfunctions of dynamical operators

Chatipat Lorpaiboon, ∗ , † , ‡ , ⊥ Erik Henning Thiede, ∗ , P , § , ⊥ Robert J. Webber, ∗ , k , ⊥ Jonathan Weare, ∗ , k and Aaron R. Dinner ∗ , † , ‡ † Department of Chemistry, University of Chicago, Chicago, IL 60637 ‡ James Franck Institute, University of Chicago, Chicago, IL 60637 P Flatiron Institute, New York, NY 60637 § Department of Computer Science, University of Chicago, Chicago, IL 60637 k Courant Institute of Mathematical Sciences, New York University, New York, NY 10012 ⊥ Equal Contributions

E-mail: [email protected]; [email protected]; [email protected]; [email protected];[email protected]

Abstract

One approach to analyzing the dynamics ofa physical system is to search for long-livedpatterns in its motions. This approach hasbeen particularly successful for molecular dy-namics data, where slowly decorrelating pat-terns can indicate large-scale conformationalchanges. Detecting such patterns is the centralobjective of the variational approach to confor-mational dynamics (VAC), as well as the re-lated methods of time-lagged independent com-ponent analysis and Markov state modeling. InVAC, the search for slowly decorrelating pat-terns is formalized as a variational problemsolved by the eigenfunctions of the system’stransition operator. VAC computes solutionsto this variational problem by optimizing a lin-ear or nonlinear model of the eigenfunctions us-ing time series data. Here, we build on VAC’ssuccess by addressing two practical limitations.First, VAC can give poor eigenfunction esti-mates when the lag time parameter is chosenpoorly. Second, VAC can overﬁt when usingﬂexible parameterizations such as artiﬁcial neu-ral networks with insuﬃcient regularization. Toaddress these issues, we propose an extensionthat we call integrated VAC (IVAC). IVAC in- tegrates over multiple lag times before solvingthe variational problem, making its results morerobust and reproducible than VAC’s.

Introduction

Many physical systems exhibit motion acrossfast and slow timescales. Whereas individualsubcomponents may relax rapidly to a quasi-equilibrium, large collective motions occur overtimescales that are orders of magnitude longer.These slow motions are often the most scien-tiﬁcally signiﬁcant. For instance, observing thelarge-scale conformational changes that governprotein function requires microseconds to days,even though individual atomic vibrations haveperiods of femtoseconds. However, when ex-ploring new systems, such slow collective pro-cesses may not be fully understood from theoutset. Rather, they must be detected fromtime series data.One approach for automating this process isthe “variational approach to conformational dy-namics” (VAC).

In the VAC framework, slowdynamical processes are identiﬁed using func-tions that decorrelate slowly. These functionsare the eigenfunctions of a self-adjoint operator1 a r X i v : . [ phy s i c s . d a t a - a n ] S e p ssociated with the system’s dynamics knownas the transition operator . The transition oper-ator evolves expectations of functions over thesystem’s state forward in time and completelydeﬁnes the dynamics on a distributional level.VAC estimates the transition operator’s eigen-functions by constructing a linear or nonlinearmodel and using data to optimize parametersin the model. VAC encompasses commonlyused approaches such as time-lagged indepen-dent component analysis and eigenfunctionestimates constructed using Markov state mod-els. In addition, recent VAC approaches useartiﬁcial neural networks to learn approxima-tions to the eigenfunctions.

While VAC has been successful in some appli-cations, the approach has limitations. The ac-curacy of the estimated eigenfunctions dependsstrongly on the function space in which theeigenfunctions are approximated, the amount ofdata available, and a hyperparameter known asthe lag time. In our previous work we gave acomprehensive error analysis for the linear VACalgorithm. This error analysis showed that thechoice of lag time can be critical to achieving anaccurate VAC scheme. Choosing a lag time thatis too short can cause substantial systematicbias in estimated eigenfunctions, while choos-ing a lag time that is too long can make VACexponentially sensitive to sampling error.In this paper, we present an extension of theVAC procedure in which we integrate the cor-relation functions in VAC over a time window.We term this approach integrated VAC (IVAC).Because IVAC is less sensitive to the choice oflag time, it reduces error compared to VAC.Additionally, when IVAC is applied using anapproximation space parameterized by a neuralnetwork, the approach leads to stable trainingand mitigates the overﬁtting problems associ-ated with VAC.We organize the rest of the paper as follows.In the theory section, we review the role of thetransition operator and its eigenfunctions, andwe introduce the VAC approach for estimatingeigenfunctions. We then present the procedurefor IVAC. In the results section, we evaluate theperformance of IVAC on two model systems.We conclude with a summary and a discussion of further ways IVAC can be extended. Methods

Background

In this section, we review the VAC theoreticalframework that shows how the slowly decor-relating functions in a physical system can beidentiﬁed using a linear operator known as thetransition operator.We assume that the system of interest is acontinuous-time Markov process X t ∈ R n witha stationary, ergodic distribution µ (speciﬁcallya Feller process ). We use E to denote expec-tations of the process X t started from µ . Forexample, if µ is the Boltzmann distribution as-sociated with the Hamiltonian H and tempera-ture T , then expectations of the process satisfy E [ f ( X t )] = R f ( x ) e − H ( x ) /k B T dx R e − H ( x ) /k B T dx (1)for all t ≥

0. However, our results are validfor systems with other, more general, stationarydistributions.

The transition operator

To begin, we consider the space of real-valued functions with ﬁnite second moment( E (cid:2) f ( X ) (cid:3) < ∞ ). Equipped with the innerproduct h f, g i = E [ f ( X ) g ( X )] (2)this forms a Hilbert space, which we denote L µ . We deﬁne the transition operator at alag time τ to be the operator T τ f ( x ) = E [ f ( X τ ) | X = x ] (3)applied to a function f ∈ L µ . Here, we areinterpreting the conditional expectation as afunction of the initial point x .The transition operator is also called theMarkov or (stochastic) Koopman operator. We use the term transition operator as it is well-established in the literature on stochastic pro-cesses, and the terminology emphasizes the con-2ection with ﬁnite-state Markov chains. For aﬁnite-state Markov chain, f is a column vectorand T τ is a row-stochastic transition matrix.The transition operator lets us rewrite corre-lation functions in terms of inner products in L µ : E [ f ( X ) g ( X τ )] = h f, T τ g i . (4)Moreover, we can express the slow motions ofa system’s dynamics in terms of the transitionoperator. The slow motions are identiﬁed byfunctions f for which the normalized correlationfunction E [ f ( X ) f ( X τ )] E [ f ( X ) f ( X )] = h f, T τ f ih f, f i (5)is large. We will show in the next subsectionthat these slowly decorrelating functions lie inthe linear span of the top eigenfunctions of thetransition operator. Eigenfunctions of the transition operator

We can immediately see that T τ has the con-stant function as an eigenfunction, because T τ E [1 | X = x ] = 1 . (6)However, there is no guarantee that any othereigenfunctions exist. We must therefore imposeadditional assumptions.We ﬁrst assume that X t obeys detailed bal-ance. For any functions f, g ∈ L µ , we have E [ f ( X ) g ( X τ )] = E [ f ( X τ ) g ( X )] , (7)or equivalently h f, T τ g i = hT τ f, g i . (8)This detailed balance condition ensures that T τ is a self-adjoint operator on L µ .Next we assume that T τ is a compact oper-ator. In our context, assuming compactnessis the same as assuming that the action of T τ can be decomposed as an inﬁnite sum involving eigenfunctions and eigenvalues: T τ f ( x ) = ∞ X i =1 e − σ i τ h η i , f ( x ) i η i ( x ) . (9)Our assumption of compactness is made for thesake of simplicity; in fact a weaker assumptionof quasi-compactness is suﬃcient. We refer thereader to Webber et al. for a more generaltreatment.At all lag times τ >

0, the function η i is aneigenfunction of the transition operator T τ witheigenvalue λ τi = e − σ i τ . (10)The eigenvalues are indexed so that0 = σ < σ ≤ σ ≤ · · · (11)and lim i →∞ σ i = ∞ . Because the process is er-godic, it is known that the largest eigenvalue λ τ = 1 is a simple eigenvalue and all othereigenvalues are bounded away from 1. Theparticular dependence of the eigenvalues on τ occurs because the transition operator can bewritten as T τ = e L τ , ∀ τ ≥ L is an operator known as the inﬁnitesi-mal generator. We note that it is also commonto consider the implied timescale (ITS) associ-ated with eigenfunction i , deﬁned asITS i = σ − i . (13)We can use the eigenvalues and eigenvectorsof the transition operator to rewrite the normal-ized correlation function (5). Observing that T f ( x ) = f ( x ) and substituting (9) into thenumerator and denominator of (5) gives E [ f ( X ) f ( X τ )] E [ f ( X ) f ( X )] = P ∞ i =1 e − σ i τ h η i , f i P ∞ i =1 h η i , f i . (14)We now consider which functions maximizethe normalized correlation function. Apply-ing (11), we ﬁnd that the normalized correlationfunction is maximized when we set f to be the3onstant function f ( x ) = η ( x ) = 1, because P ∞ i =1 e − σ i τ h η i , f i P ∞ i =1 h η i , f i ≤ P ∞ i =1 e − σ τ h η i , f i P ∞ i =1 h η i , f i (15)= e − σ τ (16)for all functions f ∈ L µ . If we constrain thesearch to functions that are orthogonal to η ,i.e., functions where h η , f i = E [ f ( x )] = 0 (17)and assume σ > σ , the normalized correla-tion function is maximized when f = η . Ifwe constrain f to be orthogonal to both η and η , then the next slowest decorrelating functionwould be η , and so forth. Maximizing the nor-malized correlation function at any lag time τ is therefore equivalent to identifying the eigen-functions of the transition operator.Because of the connection to slowly decor-relating functions, the eigenfunctions providea natural coordinate system for dimensionalityreduction. The ﬁrst few eigenfunctions providea compact representation of all the slowest mo-tions of the system. Additionally, clusteringdata based on the eigenfunction values makesit possible to identify metastable states. The variational approach to conforma-tional dynamics

The “variational approach to conformationaldynamics” (VAC) is a procedure for identify-ing eigenfunctions by maximizing the normal-ized correlation function. The ﬁrst eigenfunc-tion, η ( x ) = 1, is known exactly and is set tothe constant function. To identify subsequenteigenfunctions, we parameterize a candidate so-lution f using a vector of parameters θ . Wethen construct an estimate γ i for the i th eigen-function by tuning the parameters to maximize(5). We set γ i = f θ , where θ = arg max θ E [ f θ ( X ) f θ ( X τ )] E [ f θ ( X ) f θ ( X )] (18)subject to h f θ , γ j i = 0 for all j < i . In practice,we use empirical estimates of the correlationsconstructed from sampled data. For instance, if our data set consists of a single equilibriumtrajectory x , x ∆ , . . . x T − ∆ , we would then con-struct the estimateˆ E [ f ( X ) g ( X τ )] =∆ T − τ T − ∆ − τ ∆ X s =0 f ( x s ∆ ) g ( x s ∆+ τ ) + f ( x s ∆+ τ ) g ( x s ∆ )2 . (19)Here and in the rest of the paper, we use the ˆsymbol to indicate quantities constructed usingsampled data.Once we have obtained an estimated eigen-function ˆ γ i using data, we can estimate the as-sociated eigenvalue and implied timescale usingˆ λ τi = ˆ E [ˆ γ i ( X )ˆ γ i ( X τ )]ˆ E [ˆ γ i ( X )ˆ γ i ( X )] (20)ˆ σ i = − τ log ˆ λ τi . (21)If the sampling is perfect, the variational princi-ple ensures that VAC eigenvalues and VAC im-plied timescales are bounded from above by thetrue eigenvalues e − σ i τ and implied timescales σ − i , and the upper bound is achieved when theVAC eigenfunction is the true eigenfunction η i .However, since the empirical estimate (20) isused in practice, it is possible to obtain esti-mates that exceed the variational upper bound.The earliest VAC approaches estimated theeigenfunctions of the transition operator by us-ing linear combinations of basis functions { φ i } ,a procedure now known as linear VAC. In lin-ear VAC, the optimization parameters are theunknown linear coeﬃcients v , which solve thegeneralized eigenvalue problemˆ C ( τ ) v i = ˆ λ τi ˆ C (0) v i , (22)where ˆ C jk ( t ) = ˆ E [ φ j ( X ) φ k ( X t )] . (23)In approaches known as time-lagged indepen-dent component analysis and relaxation modeanalysis, the basis functions { φ i } were cho-sen to be the system’s coordinate axes. Thischoice of approximation space is still commonly4sed to construct collective variable spaces ei-ther for analyzing dynamics or for streamlin-ing further sampling. Markov state models(MSMs) provide an alternative approach forestimating eigenfunctions using linear combi-nations of basis functions. MSMs canserve as general dynamical models for the es-timation of metastable structures and chemicalrates.

When MSMs are applied to estimateeigenfunctions and eigenvalues, the approach isequivalent to performing linear VAC using a ba-sis of indicator functions on disjoint sets. No´e and Nuske uniﬁed the linear VAC ap-proaches and exploited a general variationalprinciple for identifying eigenvalues and eigen-functions of the transition operator. Subse-quent work further developed the methodol-ogy and introduced more general linear basisfunctions. Moreover, it was observed thatthe general variational principle allows one tomodel the eigenfunctions using nonlinear ap-proximation spaces such as the output of a neu-ral network.

This can lead to very ﬂexibleand powerful approximation spaces. However,in our experience, the greater ﬂexibility can alsolead to overﬁtting problems that need to be ad-dressed through regularization.In a common nonlinear VAC approach, aneural network outputs a set of functions φ , φ , . . . , φ S that serve as a basis set for linearVAC calculations. The network parameters arethen optimized to maximize the VAMP score, which under our assumption of detailed balancecan be calculated usingVAMP- k = S X i =1 | ˆ λ τi | k . (24)The hyperparameter k is typically set to 1 or2. In this paper, we use the VAMP-1 score,since we ﬁnd that it leads to more robust train-ing. We note that the score function we useis also called the generalized matrix Rayleighquotient. Challenges in VAC calculations

A major challenge in VAC calculations is select-ing the lag time τ . Since the early days of VAC, it was noted that lag times that are too shortor too long can lead to inaccurate eigenfunctionestimates. Our recent work revealed thatthe sensitivity to lag time is caused by a combi-nation of approximation error at short lag timesand estimation error at long lag times. In thissection, we describe the impact of approxima-tion error and estimation error and provide aschematic (Figure 1) that illustrates the trade-oﬀ between approximation error and estimationerror at diﬀerent lag times. Approximation error is the systematic error ofVAC that exists even when VAC is performedwith an inﬁnite data set. We expect approxima-tion error to dominate the calculation when thebasis set is of poor quality and our approxima-tion space cannot faithfully represent the eigen-functions of the transition operator. The ap-proximation error is greatest at short lag times,and it decreases and eventually stabilizes as thelag time is increased. Therefore, VAC users cantypically reduce approximation error by avoid-ing the very shortest lag times.

Estimation error is the random error of VACthat comes from statistical sampling. As shownin our previous work, with increasing lag timethe results of VAC become exponentially sensi-tive to small variations in the data set, leadingto high estimation error. At large enough lagtimes, all the eigenfunction estimates ˆ γ τ , ˆ γ τ , . . . are essentially random noise. VA C e rr o r Approximation errorEstimation errorTotal error

Figure 1: Schematic illustrating the sources ofVAC error at diﬀerent lag times. Even with-out sampling, VAC solutions have approxima-tion error. Random variation due to samplingcontributes additional estimation error.5n Webber et al. , we proposed measuringVAC’s sensitivity to estimation error using thecondition number κ τ . The condition numbermeasures the largest possible changes that canoccur in the subspace of VAC eigenfunctions (cid:8) γ τj , γ τj +1 , . . . , γ τk (cid:9) when there are small errorsin the entries of C (0) and C ( τ ). The conditionnumber is calculated using the expression κ τ = 1min n ˆ λ τj − − ˆ λ τj , ˆ λ τk − ˆ λ τk +1 o . (25)For a given problem and a given lag time,we can use the condition number to deter-mine which subspaces of VAC eigenfunctionsare highly sensitive to estimation error andwhich subspaces are comparatively less sensi-tive to estimation error.Although we rigorously derived the conditionnumber only in the case of linear VAC, we ﬁndthat the condition number is also helpful formeasuring estimation error in nonlinear VAC.If κ τ & τ , then identifyingeigenfunctions is very diﬃcult and requires alarge data set. We recommend that authorsreport the condition number along with theirVAC results, helping readers to assess whetherthe results are potentially sensitive to estima-tion error. Integrated VAC

To address the diﬃculty inherent in choosing agood lag time, we propose an extension of VACcalled “integrated VAC” (IVAC) where we inte-grate over a range of diﬀerent lag times beforesolving a variational problem. We ﬁnd that thenew approach is more robust to lag time selec-tion and it often gives better results overall.Just as VAC maximizes the correlation func-tion in (5), IVAC solves a variational problemby identifying a subspace of functions f thatmaximize the integrated correlation function Z τ max τ min E [ f ( X ) f ( X s )] E [ f ( X ) f ( X )] ds . (26)As in VAC, the functions solving the variationalproblem are the eigenfunctions of the transition operator. When the eigenfunction η i is substi-tuted into the integrated correlation function(26), the resulting expression is related to theimplied timescales by Z τ max τ min E [ η i ( X ) η i ( X s )] E [ η i ( X ) η i ( X )] ds = e − σ i τ min − e − σ i τ max σ i . (27)Therefore, like VAC, IVAC is a variational ap-proach for identifying both eigenfunctions andimplied timescales.IVAC is a natural extension of VAC; in thelimit as τ max approaches τ min , IVAC gives thesame eigenfunction and implied timescale esti-mates as regular VAC. However, when τ max and τ min are separated from each other, the resultsof IVAC and VAC start to diverge. We ﬁnd thatIVAC with minimal tuning performs compara-bly to VAC with optimal tuning. IVAC has thedesirable feature that it is not very sensitive tothe values of τ min and τ max .Previous approaches for estimating eigenfunc-tions using multiple time lags have attemptedto reduce approximation error by accounting forunobserved degrees of freedom. In contrast,IVAC uses multiple time lags to reduce estima-tion error and improve robustness to parameterchoice.

Linear IVAC

Linear IVAC uses linear combinations of ba-sis functions to maximize the integrated auto-correlation function (26). However, as simula-tion data are sampled at discrete time points,we cannot directly calculate the integral. Wetherefore replace (26) with a discrete sum takenover uniformly spaced lag times. We seek tomaximize τ max X τ = τ min E [ f ( X ) f ( X τ )] E [ f ( X ) f ( X )] , (28)where τ = τ min , τ min + ∆ , τ min + 2∆ , . . . , τ max and ∆ is the sampling interval. The discretesum (28) approximates (26) up to a constantmultiple, and its value is maximized when f lies within the span of the top eigenfunctions6f the transition operator. Setting f to be theeigenfunction η i , we can sum the resulting ﬁnitegeometric series: τ max X τ = τ min E [ η i ( X ) η i ( X τ )] E [ η i ( X ) η i ( X )]= e − σ i τ min − e − σ i ( τ max +∆) − e − σ i ∆ . (29)In linear IVAC, we optimize linear combina-tions of basis functions { φ i } to maximize thefunctional (28). The optimization parametersare the unknown linear coeﬃcients v , whichsolve the generalized eigenvalue problemˆ I ( τ min , τ max ) v i = ˆ λ i ˆ C (0) v i , (30)where we have deﬁnedˆ C jk ( t ) = ˆ E [ φ j ( X ) φ k ( X t )] (31)ˆ I ( τ min , τ max ) = τ max X τ = τ min ˆ C ( τ ) . (32)We solve the generalized eigenvalue problem toobtain estimates ˆ γ i for the transition operator’seigenfunctions. Then, we form the sum τ max X τ = τ min ˆ E [ˆ γ i ( X )ˆ γ i ( X τ )]ˆ E [ˆ γ i ( X )ˆ γ i ( X )] , (33)and we estimate implied timescales by solving(29) for ˆ σ i using a root-ﬁnding algorithm. Nonlinear IVAC

Nonlinear IVAC maximizes the integrated cor-relation function (26) by constructing approxi-mations in a nonlinear space of functions, forexample, those represented by a neural net-work. Speciﬁcally, the nonlinear model pro-vides a set of functions φ , φ , . . . , φ S that serveas a basis set for linear IVAC. The parametersare trained to maximize the VAMP- k score S X i =1 | ˆ λ i | k (34)where the eigenvalues ˆ λ i are deﬁned using equa-tion (30). In a linear approximation space, all values of VAMP- k scores lead to identical eigen-function estimates. In a nonlinear approxima-tion space, it is theoretically possible that min-imizing with diﬀerent values of k scores wouldlead to diﬀerent estimates. However, in prac-tice we ﬁnd there is little diﬀerence between es-timates at the minima. We present our resultsusing k = 1 because it leads to the most stableconvergence; we found that higher values of k are prone to large gradients and, in turn, unsta-ble training. When k = 1, the score functioncan be computed usingtr( ˆ C (0) − ˆ I ( τ min , τ max )) . (35)The main practical challenge in an appli-cation of nonlinear IVAC is that the basisfunctions φ , φ , . . . , φ S change at every iter-ation, requiring costly re-evaluation of ˆ C (0),ˆ I ( τ min , τ max ), and the gradient of (35) with re-spect to the parameters. To reduce this cost,we have developed the batch subsampling ap-proach described in Algorithm 1, which we ap-ply at the start of each optimization iteration. Algorithm 1: subsampling routine input : data x , . . . , x T − ∆ , τ min , τ max ,number of samples N for n ∈ , , . . . , N do Sample τ n from { τ min , . . . , τ max } ;Sample s n from { , . . . , T − τ n − ∆ } ; endoutput: sampled pairs ( x s n , x s n + τ n )In the subsampling approach, we draw a ran-domly chosen set of data points, which allow usto estimate the matrix entries ˆ C ij (0) using N X n =1 φ i ( x s n ) φ j ( x s n ) + φ i ( x s n + τ ) φ j ( x s n + τ )2 N (36)and the matrix entries ˆ I ij ( τ min , τ max ) using N X n =1 φ i ( x s n ) φ j ( x s n + τ n ) + φ i ( x s n + τ n ) φ j ( x s n )2 N ∆ / ( τ max − τ min + ∆) . (37)After constructing these random matrices, wecalculate the score function 35. We then use au-7omatic diﬀerentiation to obtain the gradient ofthe score function with respect to the parame-ters, and we perform an optimization step. Byrandomly drawing new data points at each op-timization step, we ensure a thorough samplingof the data set and we are able to train thenonlinear representation at reduced cost. Typi-cally, we ﬁnd that 10 –10 data points per batchis enough for the score function (35) to be esti-mated with low bias. Results and discussion

In this section, we provide evidence that IVACis more robust than VAC and can give more ac-curate eigenfunction estimates. First, we showresults from applying IVAC and VAC to thealanine dipeptide. VAC can provide accurateeigenfunction estimates for this test problemowing to the large spectral gap and the approx-imation space that overlaps closely with theeigenfunctions of the transition operator. How-ever, VAC requires a careful tuning of the lagtime. In contrast, IVAC is much less sensitiveto lag time choice. IVAC gives solutions thatare comparable to VAC with the optimal lagtime parameter and substantially better thanVAC with a poorly chosen lag time.Second, we show results for the villin head-piece protein. Because the data set has a smallnumber of independent samples and the neu-ral network approximation space is ﬂexible andprone to overﬁtting, VAC and IVAC suﬀer fromestimation error at long lag times. Despitethese challenges, we present a robust protocolfor choosing parameters in IVAC to limit the es-timation error, and we show that IVAC is lesssensitive to overﬁtting for this problem com-pared with VAC.

Application to the alanine dipep-tide

In this section we compare linear IVAC andVAC applied to Langevin dynamics simulationsof the alanine dipeptide (i.e., N -acetyl-alanyl- N -methylamide) in aqueous solvent; furthersimulation details are given in the supporting information.The alanine dipeptide is a well-studied modelfor conformational changes in proteins. Likemany protein systems, the alanine dipeptidehas dynamics that are dominated by transitionsbetween metastable states. The top eigenfunc-tions are useful for locating barriers betweenstates, as these eigenfunctions change sharplywhen passing from one well to another. We fo-cus on estimating η and η , as large changes inthese eigenfunctions correspond to transitionsover the alanine dipeptide’s two largest barri-ers. We refer to the span of η , η , and η asthe 3D subspace.In our experiments, we consider trajectoriesof length 10 ns and 20 ns. The trajectoriesare long enough to observe approximately 15or 30 transitions respectively along the dipep-tide’s slowest degree of freedom. Folding simu-lations of proteins, such as the villin headpiececonsidered below, often have a similar numberof transitions between the folded and unfoldedstates.There are several features that make it pos-sible for VAC to perform well on this exam-ple. First, the linear approximation space,which consists of all the dihedral angles in themolecular backbone, is small (just 9 basis func-tions), and it is known to overlap heavily withthe top eigenfunctions of the dynamics. Sec-ond, we are estimating a well-conditioned sub-space with a minimum condition number of justmin τ κ τ = min τ (cid:16) ˆ λ τ − ˆ λ τ (cid:17) − = 1 . , (38)and therefore we do not expect a heavy ampli-ﬁcation of sampling error that degrades eigen-function estimates.To evaluate the error in our eigenfunction es-timates, we compare to “ground truth” eigen-functions computed using a Markov state modelbuilt with a very long time series (1.5 µ s) anda ﬁne discretization of the dihedral angles. Wemeasure error using the projection distance, which evaluates the overlap between one sub-space and the orthogonal complement of an-other subspace. For subspaces U and V withorthonormal basis functions { u i } and { v i } , the8rojection distance is given by d ( U , V ) = sX i,j (cid:0) δ ij − h u i , v j i (cid:1) . (39)This measure, which combines the error in thediﬀerent eigenfunctions into a single number, isuseful because VAC is typically used to identifysubspaces of eigenfunctions rather than individ-ual eigenfunctions. The maximum possible er-ror when estimating k eigenfunctions is √ k .Our main result from the alanine dipeptideapplication is that IVAC is more robust to theselection of lag time parameters than VAC.In Figure 2, we report the accuracy of IVACand VAC for diﬀerent lag times and trajectorylengths. In the left column, we show the rootmean square errors (RMSE) for IVAC (orange)and VAC (purple), aggregated over thirty in-dependent trajectories. From the aggregatedresults, IVAC performs nearly as well as VACwith the best possible τ and consistently givesresults much better than VAC with a poorlychosen τ . The RMSE of IVAC is just 0 .

58 with10 ns trajectories and 0 .

45 with 20 ns trajec-tories. These low error levels are not far fromthe minimum error of 0 .

37 that is possible usingour linear approximation space.In the right column of Figure 2, we show re-sults for a 10 ns trajectory and a 20 ns trajec-tory. The trajectories were selected to help il-lustrate diﬀerences in the error proﬁles for VACand IVAC; similar plots for all other trajecto-ries can be found in the supporting information.We observe two key diﬀerences. First, VAC er-ror can exhibit high-frequency stochastic vari-ability as a function of lag time, a source ofvariability that does not aﬀect integrated VACresults. Second, VAC can have high error levelsat very short and long lag times. The projectiondistance against our reference often reaches 1.0,which might indicate that a true eigenfunctionis completely orthogonal to our estimated sub-space. The error of IVAC is unlikely to reachsuch high extremes.We note that the parameter values τ min = 1 psand τ max = 1 ns used in IVAC are not hard totune. The range 1 ps − λ τ and E rr o r

10 nsRMSE 10 ns1 trajectory10 Lag Time (ps)0.00.51.01.5 E rr o r

20 nsRMSE 10 Lag Time (ps)20 ns1 trajectoryVAC IVAC

Figure 2: Linear IVAC and VAC errors for ala-nine dipeptide trajectories. IVAC was appliedwith τ min = 1 ps and τ max = 1 ns. VAC was ap-plied with variable lag time τ (horizontal axis).Errors are computed using the projection dis-tance to the MSM reference for the span of η and η . (left) Root mean square errors (RMSE)over 30 independent trajectories. (right) Errorsfor a single trajectory.9 D i h e d r a l A n g l e =1ps VAC=25ps Dihedral Angle =(1-1000)ps IVAC MSMRef=250ps

Figure 3: Clusters on the eigenfunctions estimated using VAC and IVAC compared with clusterson an accurate MSM. (left of the dashed line) VAC and IVAC results for the 20 ns trajectory fromFigure 2. (right of the dashed line) Clustering on η and η evaluated using an accurate MSMreference. Lag Time (ps)10 L a g T i m e ( p s ) Lag Time (ps)0.00.51.0 E i g e n v a l u e s (ps)10 m a x ( p s ) P r o j e c t i o n D i s t a n c e b / w VA C S u b s p a c e s E rr o r Figure 4: Lag time dependence of VAC and IVAC results. All results shown are for the single20 ns alanine dipeptide trajectory in Figure 2. (left) Projection distance between VAC results at thehorizontal axis lag time and VAC results at the vertical axis lag time. (center) First six estimatedeigenvalues of the transition operator. (right) Error in IVAC results at diﬀerent values of τ min and τ max , evaluated using the projection distance. 10 λ τ decrease from values near one to values nearzero. In contrast, it is much harder to tune theVAC lag time τ . VAC results are very sensitiveto high or low lag times as seen in Figure 2.When eigenfunction estimates are accurate,we expect that the eigenfunction coordinateswill help identify the system’s metastablestates. In Figure 3, we compare the resultsof clustering conﬁgurations in the 20 ns alaninedipeptide trajectory in Figure 2 using the asso-ciated IVAC and VAC estimates. We plot thepredicted metastable states against the dipep-tide’s φ and ψ dihedral angles. In the ﬁgure, wepresent VAC results taken at a short lag time,an intermediate lag time, and a long lag time.We also present results for the MSM reference.Comparing against the reference, we ﬁnd thatIVAC identiﬁes clusters as accurately as VACat a well-chosen lag time, and IVAC performsfar better than VAC at a poorly-chosen lagtime.Next, we present additional analyses appliedto a single 20 ns alanine dipeptide trajectory,that provide insight into why IVAC is more ro-bust to lag time selection than VAC. To start,we examine the discrepancy in VAC results atdiﬀerent lag times. In Figure 4, left, we per-formed VAC with a range of diﬀerent lag times,and we measured the projection distance be-tween the VAC results obtained at one lag time τ (horizontal axis) and the VAC results ob-tained at a diﬀerent lag time τ (vertical axis).The square with low projection distance be-tween 3 ps and 200 ps indicates that VAC re-sults with lag times chosen within this range aresimilar to one another, but not to those with lagtimes taken from outside this range.The discrepancy between VAC results at bothlow and high lag times can be explained by aplot of VAC eigenvalues (Figure 4, center). At3 ps, there is an eigenvalue crossing betweenthe eigenvalues ˆ λ τ and ˆ λ τ (shown in purpleand magenta). The eigenvalue crossing causesVAC to misidentify the third VAC eigenfunc-tion (which is inside the 3D subspace) and thefourth VAC eigenfunction (which is outside the3D subspace). At 200 ps, there is a diﬀerentproblem related to insuﬃcient sampling. Thethird eigenvalue descends into noise, causing VAC to ﬁt the ﬁrst two eigenfunctions at theexpense of the 3D subspace.With integrated VAC, the problem of ﬁnd-ing a single good lag time is replaced with theproblem of ﬁnding two endpoints for a range oflag times. This proves to be an easier task asIVAC is more tolerant of lag times outside theregion where VAC gives good results. In Figure4, right, we show the error of IVAC as a functionof τ min and τ max (horizontal and vertical axes,respectively). This ﬁgure, which shows the er-ror of IVAC estimates computed from compari-son with the reference, is diﬀerent from the ﬁg-ure on the left which shows only the discrepancybetween VAC results at diﬀerent lag times. Fig-ure 4, right, also shows the error of VAC, whichappears along the diagonal of the plot corre-sponding to the case τ min = τ max .Figure 4, right, reveals that the range of lagtime parameters for which IVAC exhibits lowerror levels is much broader than the rangeof lag times for which VAC exhibits low er-ror levels. This supports our basic argumentthat choosing good parameters in IVAC is eas-ier than choosing good parameters in VAC. Toachieve low errors, we do not need to identifythe optimal VAC lag times but only integrateover a window that contains the optimal VAClag times while ensuring that τ max is not exces-sively high. Application to the villin headpiece

Next we apply IVAC to a diﬃcult spectral esti-mation problem with limited data. We seek toestimate the slow dynamics for an engineered35-residue subdomain of the villin headpieceprotein. Our data consist of a 125 µ s molecu-lar dynamics simulation performed by Lindorﬀ-Larsen et al. Villin is a common model sys-tem for protein folding for both experimentaland computational studies, where the topeigenfunctions correlate with the folding andunfolding of the protein.On the surface, the villin data set would seemto be much larger and more useful for spectralestimation compared to the 10 −

20 ns trajec-tories we examined for the alanine dipeptide.11owever, the villin headpiece relaxes to equilib-rium orders of magnitude more slowly than thealanine dipeptide. The data set contains just 34folding/unfolding events with a folding time of2 . µ s. The limited number of observed eventsis characteristic of simulations of larger andmore complex biomolecules, since simulationsrequire massive computational resources andconformational changes take place slowly overmany molecular dynamics time steps. Thefact that dynamics of villin are not under-stood nearly as well as the dynamics of the ala-nine dipeptide presents an additional challenge.Compared to the alanine dipeptide, villin has amore complex free energy surface and a largernumber of degrees of freedom. Since the trueeigenfunctions of the system are unknown, itis appropriate to apply spectral estimation us-ing a large and diverse feature set. However,the large size and diversity of the feature setincreases the risk of estimation error.In contrast to the alanine dipeptide results,where we applied IVAC using linear combina-tions of basis functions, here we apply IVACusing a neural network. The increased ﬂexibil-ity of the neural network approximation reducesapproximation error. However, the procedurefor optimizing the neural network is more com-plicated than the procedure for applying linearVAC. Moreover, the complexity of the neuralnetwork representation (around 5 × param-eters) makes overﬁtting a concern for this ex-ample.We use a slight modiﬁcation of the neural net-work architecture published in Sidky et al. ,with 2 hidden layers of 50 neurons, tanh nonlin-earities, and batch normalization between lay-ers. The network is built on top of a richset of features, consisting of all the C α pair-wise distances as well as sines and cosines ofall dihedral angles. At each optimization step,we subsample 10 data points using Algorithm1. We optimize the neural network parametersusing AdamW with a learning rate of 10 − and a weight decay coeﬃcient of 10 − . Follow-ing standard practice, we use the ﬁrst half ofthe data set for training and the second halffor validation. We validate the neural networkagainst the testing data set every 100 optimiza- tion steps, and perform early stopping with apatience of 10.We present our results for villin in two parts.First we describe our procedure for selecting pa-rameters in nonlinear IVAC. Next we highlightevidence that nonlinear IVAC shows greater ro-bustness to overﬁtting compared to nonlinearVAC. Selection of parameters

Here, we describe the protocols we use for se-lecting IVAC parameters. By establishingclear protocols, we help ensure that IVAC per-forms to the best of its ability, providing ro-bust eigenfunction estimates even in a high-dimensional setting with limited data.Our ﬁrst protocol is to evaluate the condi-tion number for the subspace of eigenfunctionsthat we are estimating. This protocol is moti-vated by the theoretical error analysis in Web-ber et al. , where we showed that spectral es-timates are less sensitive to estimation error fora well-conditioned subspace. To ensure that weare estimating a well-conditioned subspace, weﬁrst use IVAC to estimate eigenvalues for thetransition operator. We then identify a sub-space of eigenfunctions η , η , . . . , η k that is sep-arated from all other eigenfunctions by a largespectral gap ˆ λ τk − ˆ λ τk +1 .For the villin data, we choose the subspaceconsisting only of the constant eigenfunction η = 1 and the ﬁrst nontrivial eigenfunction η . This is a well-conditioned subspace with aminimum condition numbermin τ κ τ = min τ (cid:16) ˆ λ τ − ˆ λ τ (cid:17) − = 1 . . (40)Our second protocol for ensuring robustnessis to check that eigenfunction estimates remainconsistent when the random seeds used in ini-tialization and subsampling are changed. Wetrain ten nonlinear IVAC neural networks andquantify the inconsistency in the results usingthe root mean square projection distance be-tween eigenspace estimates from diﬀerent runs.The results of this calculation are plotted inFigure 5 across a range of τ min and τ max values.The results for VAC appear along the diagonal12 Lag Time (ns)0.00.20.40.60.81.0 E i g e n v a l u e s min (ns)1310301003001000300010000 m a x ( n s ) P r o j e c t i o n D i s t a n c e b / w R e p li c a t e s Figure 5: Nonlinear IVAC results for the 125 µ s villin headpiece trajectory. (left) Estimatedeigenvalues of the transition operator. (right) Root mean square projection distance between 10replicates of nonlinear IVAC at the speciﬁed values of τ min and τ max . t I C A IVAC1 3 ns tICA 1 t I C A IVAC1 100 ns tICA 1 tICA 1

Figure 6: Nonlinear IVAC results plotted on the ﬁrst two time-lagged independent componentanalysis (tICA) coordinates. (top) IVAC with a 1 − −

100 ns integration window and three diﬀerent randomseeds. 13f the plot in Figure 5, corresponding to thecase τ min = τ max .Figure 5 reveals problems with consistency forboth IVAC and VAC, . IVAC is robust to thechoice of τ min . However, setting τ max <

30 ns or τ max >

300 ns leads to poor consistency. If wetrain the neural network with these problem-atic τ max values, then solutions can look verydiﬀerent depending on the random seeds thatare used for optimizing. With VAC, setting τ <

10 ns or τ >

300 ns would lead to in-consistent results.IVAC provides more ﬂexibility to address theconsistency issues compared to VAC, since wecan integrate over a range of lag times. Forthe villin data, we choose to set τ min = 1 nsand τ max = 100 ns. For these parameter values,the consistency score is very good. The typi-cal projection distance between subspaces withdiﬀerent random seeds is just 0 .

05. Moreover,1 −

100 ns is a wide range of lag times, helpingto ensure that optimal or near-optimal VAC lagtimes are included in the integration window.To help explain why the consistency is so poorfor small τ max values, we present in Figure 6 aset of IVAC solutions obtained with an inte-gration window of 1 − − − −

100 ns. As shown inFigure 6, the IVAC solutions are nearly identi-cal regardless of the random seed.In summary, we have proposed a robust pro-cedure for approximating eigenfunctions of thevillin headpiece system. We have chosen toapproximate a well-conditioned eigenspace thatis separated from other eigenspaces by a widespectral gap. Moreover, we have ensured that IVAC results are consistent regardless of therandom initialization and random drawn sub-sets used to train the neural net. Because ofthese protocols, the neural net estimates shownin Figure 6 reliably identify clusters in thetrajectory data indicative of folded/unfoldedstates.

Robustness to overﬁtting

In this section, we present results suggestingthat nonlinear IVAC is more robust to overﬁt-ting than nonlinear VAC. This is crucial if thedata set is too small for cross-validation.To identify the overﬁtting issue with smalldata sets, we eliminate the early stopping andwe train IVAC and VAC until the training lossstabilizes. We calculate implied timescales byperforming linear VAC on the outputs of thenetworks trained using IVAC and VAC, whichwe present in Figure 7. Lag Time (ns)10 I T S ( n s ) Train 10 Lag Time (ns)Test0.00 0.05 0.10Frequency (ns )10 P S D ( n s ) Train 0.00 0.05 0.10Frequency (ns )TestVAC (100 ns) IVAC (1 100 ns) Figure 7: Implied time scales (ITS) and powerspectral densities (PSD) obtained with nonlin-ear IVAC and VAC with neural network basisfunctions applied to the villin headpiece dataset. The VAC training lag time is marked bythe dotted line in each panel.We ﬁrst compare the estimated impliedtimescales between the training and valida-14ion data sets. For both algorithms, the im-plied timescales calculated on the training dataare larger than those calculated on the valida-tion data. This is clear evidence of overﬁtting.However, we see that IVAC gives larger impliedtimescales on the validation data compared toVAC. In combination with the variational prin-ciple associated with the implied timescales,this suggests that IVAC is giving an improvedestimate for the slow eigenfunctions.Examining the implied timescales estimatedon training data show further signs of overﬁt-ting. The VAC implied timescale estimates forthe training data exhibit sharp peaks at thetraining lag time that are absent in the im-plied timescale estimates of the validation data.This suggests a hypothesis for the mechanism ofoverﬁtting: with a suﬃciently ﬂexible approxi-mation space, VAC is able to ﬁnd spurious cor-relations between features that happen to beseparated by τ . This explains the smaller peaksat integral multiples of the lag time, as featuresartiﬁcially correlated at τ will be correlated at2 τ as well.To conﬁrm our hypothesis, we we plot thepower spectral density (PSD) of the timetrace of eigenfunction estimates in Figure 7.The PSD conﬁrms the existence of a periodiccomponent in VAC results with a frequency atthe inverse training lag time. In contrast, IVACdoes not exhibit such a periodic component.In Figure 7, we see that the 1 −

100 ns inte-gration window leads to implied timescale esti-mates that depend smoothly on the data bothfor the training and the test data set. The PSDshows no periodic components in the spectra forIVAC, providing further evidence that IVAC iscomparatively robust while VAC results can bevery sensitive to the particular lag time that isused.

Conclusion

In this paper we have presented integratedVAC (IVAC), a new extension to the popularvariational approach to conformational dynam-ics (VAC). By integrating correlation functionsover a window of lag times, IVAC provides ro- bust estimates of the eigenfunctions of a sys-tem’s transition operator.To test the eﬃcacy of the new approach, wecompared IVAC and VAC results on two molec-ular systems. First, we applied the spectralestimation methods to simulation data fromthe alanine dipeptide. This is a relatively sim-ple system that permits generation of extensivereference data for validating our calculations.As we varied the lag time parameters and theamount of data available, we observed the im-proved robustness of IVAC compared to VAC.IVAC gives low-error eigenfunction estimateseven when the lag times range over multipleorders of magnitude. In contrast, VAC requiresmore precise lag-time tuning to give reasonableresultsNext we applied IVAC to analyze a fold-ing/unfolding trajectory for the villin head-piece. These data contain relatively few fold-ing/unfolding events despite pushing the lim-its of present computing technology. For thisapplication, we used a ﬂexible neural networkrepresentation built on top of a rich feature set.We presented a procedure for selecting param-eters in IVAC that helps lead to robust per-formance in the face of uncertainty. For theapplication to villin data, we found that VACexhibited pronounced artifacts from overﬁttingwhen precautions were not taken to speciﬁcallyprevent it, while IVAC did not.Our work highlights the sensitivity of VACcalculations to error from insuﬃcient sampling.Examining our results on the villin headpiece,we see that regularization (here, by early stop-ping) and validation are crucial when runningVAC with neural networks or other ﬂexible ap-proximation spaces. With insuﬃcient regular-ization or poor validation these schemes easilyoverﬁt. Even for the alanine dipeptide exam-ple, where we employ a simple basis on a sta-tistically well-conditioned problem, we see thatVAC has a high probability of giving spuriousresults with insuﬃcient data.Integrated VAC addresses this problem byconsidering information across multiple timelags. Future extensions of the work couldfurther leverage this information. For in-stance, employing a well-chosen weighting func-15ion within the integral in (5) could furtherdecrease hyperparameter sensitivity. Addi-tionally, future numerical experiments couldpoint to improved procedures for selecting τ min and τ max values. Finally, we could in-tegrate over multiple lag times in other for-malisms using the transition operator, such asschemes that estimate committors and mean-ﬁrst-passage times. These extensions wouldfurther strengthen the basic message of ourwork: combining information from multiple lagtimes leads to improved estimates of the tran-sition operator and its properties.

Acknowledgement

EHT was supported byDARPA grant HR00111890038. RJW was sup-ported by the National Science Foundationthrough award DMS-1646339. CL, ARD, andJW were supported by the National Institutesof Health award R35 GM136381. JW was sup-ported by the Advanced Scientiﬁc ComputingResearch Program within the DOE Oﬃce ofScience through award DE-SC0020427. Thevillin headpiece data set was provided by D.E.Shaw Research. Computing resources whereprovided by the University of Chicago ResearchComputing Center.

Supporting Information Avail-able

The following ﬁles are available free of charge.Alanine dipeptide simulation details and errorplots for additional individual alanine dipeptidetrajectories. Loss functions for villin nonlinearVAC and IVAC training.

References (1) No´e, F.; Nuske, F. A variational approachto modeling slow processes in stochasticdynamical systems.

Multiscale Modeling &Simulation , , 635–655.(2) Nske, F.; Keller, B. G.; P´erez-Hern´andez, G.; Mey, A. S.; No, F.Variational approach to molecular ki- netics. Journal of Chemical Theory andComputation , , 1739–1752.(3) P´erez-Hern´andez, G.; Paul, F.;Giorgino, T.; De Fabritiis, G.; No´e, F.Identiﬁcation of slow molecular order pa-rameters for Markov model construction. The Journal of Chemical Physics , , 07B604 1.(4) Molgedey, L.; Schuster, H. G. Separationof a mixture of independent signals usingtime delayed correlations. Physical ReviewLetters , , 3634.(5) Schwantes, C. R.; Pande, V. S. Improve-ments in Markov state model construc-tion reveal many non-native interactionsin the folding of NTL9. Journal of Chem-ical Theory and Computation , ,2000–2009.(6) Klus, S.; N¨uske, F.; Koltai, P.; Wu, H.;Kevrekidis, I.; Sch¨utte, C.; No´e, F. Data-driven model reduction and transfer oper-ator approximation. Journal of NonlinearScience , , 985–1010.(7) Sch¨utte, C.; Fischer, A.; Huisinga, W.;Deuﬂhard, P. A direct approach to confor-mational dynamics based on hybrid MonteCarlo. Journal of Computational Physics , , 146–168.(8) Swope, W. C.; Pitera, J. W.; Suits, F. De-scribing protein folding kinetics by molec-ular dynamics simulations. 1. Theory. TheJournal of Physical Chemistry B , , 6571–6581.(9) Swope, W. C.; Pitera, J. W.; Suits, F.;Pitman, M.; Eleftheriou, M.; Fitch, B. G.;Germain, R. S.; Rayshubski, A.;Ward, T. C.; Zhestkov, Y. et al. De-scribing protein folding kinetics bymolecular dynamics simulations. 2. Ex-ample applications to alanine dipeptideand a β -hairpin peptide. The Journalof Physical Chemistry B , ,6582–6594.1610) Mardt, A.; Pasquali, L.; Wu, H.; No´e, F.VAMPnets for deep learning of molecularkinetics. Nature Communications , , 5.(11) Chen, W.; Sidky, H.; Ferguson, A. L. Non-linear discovery of slow molecular modesusing state-free reversible VAMPnets. TheJournal of Chemical Physics , ,214114.(12) Webber, R. J.; Thiede, E. H.; Dow, D.;Dinner, A. R.; Weare, J. Error sources inspectral estimation for Markov processes. arXiv preprint arXiv:TBD ,(13) Takano, H.; Miyashita, S. Relaxationmodes in random spin systems. Journal ofthe Physical Society of Japan , ,3688–3698.(14) Kallenberg, O. Foundations of modernprobability ; Springer Science & BusinessMedia, 2006.(15) Eisner, T.; Farkas, B.; Haase, M.;Nagel, R.

Operator Theoretic Aspects ofErgodic Theory ; Springer, 2015; Vol. 272.(16) Hirao, H.; Koseki, S.; Takano, H. Molec-ular dynamics study of relaxation modesof a single polymer chain.

Journal of thePhysical Society of Japan , , 3399–3405.(17) Swope, W. C.; Pitera, J. W.; Suits, F. De-scribing protein folding kinetics by molec-ular dynamics simulations. 1. Theory. TheJournal of Physical Chemistry B , , 6571–6581.(18) Pande, V. S.; Beauchamp, K.; Bow-man, G. R. Everything you wanted toknow about Markov State Models butwere afraid to ask. Methods , , 99–105.(19) Vanden-Eijnden, E. An Introduction toMarkov State Models and Their Applica-tion to Long Timescale Molecular Simula-tion ; Springer, 2014; pp 91–100. (20) No´e, F.; Prinz, J.-H. In

An Introduction toMarkov State Models and Their Applica-tion to Long Timescale Molecular Simula-tion, vol. 797 of Advances in Experimen-tal Medicine and Biology ; Bowman, G. R.,Pande, V. S., No´e, F., Eds.; Springer,2014; Chapter 6.(21) Keller, B. G.; Aleksic, S.; Donati, L. In

Biomolecular Simulations in Drug Discov-ery ; Gervasio, F. L., Spiwok, V., Eds.;Wiley-VCH, 2019; Chapter 4.(22) Vitalini, F.; No´e, F.; Keller, B. A ba-sis set for peptides for the variational ap-proach to conformational kinetics.

Jour-nal of Chemical Theory and Computation , , 3992–4004.(23) Boninsegna, L.; Gobbo, G.; No´e, F.;Clementi, C. Investigating molecular ki-netics by variationally optimized diﬀusionmaps. Journal of Chemical Theory andComputation , , 5947–5960.(24) Schwantes, C. R.; McGibbon, R. T.;Pande, V. S. Perspective: Markov modelsfor long-timescale biomolecular dynamics. The Journal of Chemical Physics , , 09B201.(25) Wu, H.; N¨uske, F.; Paul, F.; Klus, S.;Koltai, P.; No´e, F. Variational Koop-man models: slow collective variablesand molecular kinetics from short oﬀ-equilibrium simulations. The Journal ofChemical Physics , , 154104.(26) McGibbon, R. T.; Pande, V. S. Varia-tional cross-validation of slow dynamicalmodes in molecular kinetics. The Journalof Chemical Physics , , 124105.(27) Naritomi, Y.; Fuchigami, S. Slow dynam-ics in protein ﬂuctuations revealed bytime-structure based independent compo-nent analysis: the case of domain motions. The Journal of Chemical Physics , , 02B617.(28) Husic, B. E.; Pande, V. S. Note: MSMlag time cannot be used for variational17odel selection. The Journal of ChemicalPhysics , , 176101.(29) Wu, H.; Prinz, J.-H.; No´e, F. Projectedmetastable Markov processes and their es-timation with observable operator mod-els. The Journal of chemical physics , , 10B610 1.(30) Su´arez, E.; Adelman, J. L.; Zucker-man, D. M. Accurate estimation of pro-tein folding and unfolding times: beyondMarkov state models. Journal of chemicaltheory and computation , , 3473–3481.(31) Cao, S.; Montoya-Castillo, A.; Wang, W.;Markland, T. E.; Huang, X. On the ad-vantages of exploiting memory in Markovstate models for biomolecular dynamics. The Journal of Chemical Physics , , 014105.(32) Thiede, E. H.; Giannakis, D.; Din-ner, A. R.; Weare, J. Galerkin approxi-mation of dynamical quantities using tra-jectory data. The Journal of ChemicalPhysics , , 244111.(33) Edelman, A.; Arias, T. A.; Smith, S. T.The geometry of algorithms with orthogo-nality constraints. SIAM Journal on Ma-trix Analysis and Applications , ,303–353.(34) Lindorﬀ-Larsen, K.; Piana, S.;Dror, R. O.; Shaw, D. E. How fast-folding proteins fold. Science , ,517–520.(35) McKnight, J. C.; Doering, D. S.; Matsu-daira, P. T.; Kim, P. S. A thermostable35-residue subdomain within villin head-piece. 1996.(36) Kubelka, J.; Eaton, W. A.; Hofrichter, J.Experimental tests of villin subdomainfolding simulations. Journal of molecularbiology , , 625–630.(37) Duan, Y.; Kollman, P. A. Pathways to aprotein folding intermediate observed in a 1-microsecond simulation in aqueous solu-tion. Science , , 740–744.(38) Sidky, H.; Chen, W.; Ferguson, A. L.High-Resolution Markov State Modelsfor the Dynamics of Trp-Cage Minipro-tein Constructed Over Slow FoldingModes Identiﬁed by State-Free ReversibleVAMPnets. The Journal of PhysicalChemistry B , , 7999–8009,PMID: 31453697.(39) Loshchilov, I.; Hutter, F. DecoupledWeight Decay Regularization. 2017.(40) Welch, P. The use of fast Fourier trans-form for the estimation of power spec-tra: A method based on time averagingover short, modiﬁed periodograms. IEEETransactions on Audio and Electroacous-tics , , 70–73.18 raphical TOC Entry Lag Time0.500.751.001.25 E rr o r VAC withdifferentlag timesIVAC overthe window10 ns upporting Information forIntegrated VAC: A robust strategy for identifyingeigenfunctions of dynamical operators Chatipat Lorpaiboon, ∗ , † , ‡ , ⊥ Erik Henning Thiede, ∗ , P , § , ⊥ Robert J. Webber, ∗ , k , ⊥ Jonathan Weare, ∗ , k and Aaron R. Dinner ∗ , † , ‡ † Department of Chemistry, University of Chicago, Chicago, IL 60637 ‡ James Franck Institute, University of Chicago, Chicago, IL 60637 P Flatiron Institute, New York, NY 60637 § Department of Computer Science, University of Chicago, Chicago, IL 60637 k Courant Institute of Mathematical Sciences, New York University, New York, NY 10012 ⊥ Equal Contributions

E-mail: [email protected]; [email protected]; [email protected]; [email protected];[email protected] a r X i v : . [ phy s i c s . d a t a - a n ] S e p imulation details for the alanine dipeptide experiments All simulations were conducted using Gromacs 5.1.4.

The molecule was represented by theCHARMM 27 force ﬁeld in a solvent modelled by 513 water molecules using a rigid TIP3Pmodel. Long-range electrostatics were performed using particle-mesh Ewald summation atfourth order with a Fourier spacing of 0.12 nm. Each simulation used Langevin dynamics,integrated using the GROMACS leap-frog Langevin integrator with a 1 fs time step and atime constant of 0.1 ps at a temperature of 310 K. Hydrogen bonds were constrained to berigid using LINCS, and water rigidity was enforced using SETTLE. In each simulation,the system was initialized at a density of 1 kg / L. The system was then equilibrated for50 ps at constant volume, followed by another 50 ps equilibration at constant pressure usingthe Parrinello-Rahman barostat. Finally, the system was again equilibrated at constantvolume for 50 ns. The data set used was obtained from a production run of 50 ns, withstructures saved every 500 fs. To construct our references for the true eigenfunctions, we ran10 simulations each of length 150 ns, and constructed an MSM on all dihedral angles. ThisMSM had 500 Markov states; these were identiﬁed by k-means clustering, and we estimatedthe eigenfunctions and eigenvalues using pyEMMA. References (1) Abraham, M. J.; Murtola, T.; Schulz, R.; P´all, S.; Smith, J. C.; Hess, B.; Lindahl, E.GROMACS: High performance molecular simulations through multi-level parallelismfrom laptops to supercomputers.

SoftwareX , , 19–25.(2) P´all, S.; Abraham, M. J.; Kutzner, C.; Hess, B.; Lindahl, E. Tackling exascale softwarechallenges in molecular dynamics simulations with GROMACS. International confer-ence on exascale applications and software. 2014; pp 3–27.(3) Pronk, S.; P´all, S.; Schulz, R.; Larsson, P.; Bjelkmar, P.; Apostolov, R.; Shirts, M. R.;S2mith, J. C.; Kasson, P. M.; van der Spoel, D. et al. GROMACS 4.5: a high-throughputand highly parallel open source molecular simulation toolkit. Bioinformatics , ,845–854.(4) Van Der Spoel, D.; Lindahl, E.; Hess, B.; Groenhof, G.; Mark, A. E.; Berendsen, H. J.GROMACS: fast, ﬂexible, and free. Journal of Computational Chemistry , ,1701–1718.(5) Lindahl, E.; Hess, B.; Van Der Spoel, D. GROMACS 3.0: a package for molecularsimulation and trajectory analysis. Molecular Modeling nnual , , 306–317.(6) Berendsen, H. J.; van der Spoel, D.; van Drunen, R. GROMACS: a message-passingparallel molecular dynamics implementation. Computer physics communications , , 43–56.(7) MacKerell Jr, A. D.; Banavali, N.; Foloppe, N. Development and current status of theCHARMM force ﬁeld for nucleic acids. Biopolymers: Original Research on Biomolecules , , 257–265.(8) Berendsen, H. J.; Postma, J. P.; van Gunsteren, W. F.; Hermans, J. Intermolecularforces ; Springer, 1981; pp 331–342.(9) Essmann, U.; Perera, L.; Berkowitz, M. L.; Darden, T.; Lee, H.; Pedersen, L. G. Asmooth particle mesh Ewald method.

The Journal of Chemical Physics , ,8577–8593.(10) Hess, B.; Bekker, H.; Berendsen, H. J.; Fraaije, J. G. LINCS: a linear constraint solverfor molecular simulations. Journal of computational chemistry , , 1463–1472.(11) Miyamoto, S.; Kollman, P. A. Settle: An analytical version of the SHAKE and RATTLEalgorithm for rigid water models. Journal of computational chemistry , , 952–962. S312) Parrinello, M.; Rahman, A. Polymorphic transitions in single crystals: A new moleculardynamics method. Journal of Applied physics , , 7182–7190.(13) Scherer, M. K.; Trendelkamp-Schroer, B.; Paul, F.; Prez-Hernndez, G.; Hoﬀmann, M.;Plattner, N.; Wehmeyer, C.; Prinz, J.-H.; No, F. PyEMMA 2: a software package forestimation, validation, and analysis of Markov models. Journal of Chemical Theory andComputation , , 5525–5542. S4 .00.51.01.5 D S u b s p a c e E rr o r D S u b s p a c e E rr o r D S u b s p a c e E rr o r D S u b s p a c e E rr o r D S u b s p a c e E rr o r Lag Time (ps)0.00.51.01.5 D S u b s p a c e E rr o r Lag Time (ps) 10 Lag Time (ps) 10 Lag Time (ps) 10 Lag Time (ps)VAC IVAC

Figure S1: VAC (at the horizontal axis lag time) and IVAC (with τ min = 1 ps and τ max = 1 ns)errors for all 30 of the 10-ns long alanine dipeptide trajectories.S5 .00.51.01.5 D S u b s p a c e E rr o r D S u b s p a c e E rr o r D S u b s p a c e E rr o r D S u b s p a c e E rr o r D S u b s p a c e E rr o r Lag Time (ps)0.00.51.01.5 D S u b s p a c e E rr o r Lag Time (ps) 10 Lag Time (ps) 10 Lag Time (ps) 10 Lag Time (ps)VAC IVAC

Figure S2: VAC (at the horizontal axis lag time) and IVAC (with τ min = 1 ps and τ max = 1 ns)errors for all 30 of the 20-ns long alanine dipeptide trajectories.S6 .855.905.956.00 VA M P - S c o r e IVAC (1 3 ns) VA M P - S c o r e IVAC (1 100 ns) VA M P - S c o r e VAC (100 ns) VA M P - S c o r e IVAC (1 3000 ns) VA M P - S c o r e VAC (3000 ns)0 5000 10000Optimization Steps 0 5000 10000Optimization Steps 0 5000 10000Optimization StepsTrain Loss Test Loss