Deflation reveals dynamical structure in nondominant reaction coordinates
DDeflation reveals dynamical structure in nondominant reaction coordinates
Brooke E. Husic
1, 2, a) and Frank No´e
1, 3, b) Department of Mathematics and Computer Science, Freie Universit¨at, Berlin,Germany Department of Chemistry, Stanford University, Stanford, CA, 94305, USA Department of Chemistry, Rice University, Houston, TX, 77005, USA
The output of molecular dynamics simulations is high-dimensional, and the degrees of freedom among theatoms are related in intricate ways. Therefore, a variety of analysis frameworks have been introduced in orderto distill complex motions into lower-dimensional representations that model the system dynamics. Thesedynamical models have been developed to optimally approximate the system’s global kinetics. However, theseparate aims of optimizing global kinetics and modeling a process of interest diverge when the process ofinterest is not the slowest process in the system. Here, we introduce deflation into state-of-the-art methods inmolecular kinetics in order to preserve the use of variational optimization tools when the slowest dynamicalmode is not the same as the one we seek to model and understand. First, we showcase deflation for a simpletoy system and introduce the deflated variational approach to Markov processes (dVAMP). Using dVAMP,we show that nondominant reaction coordinates produced using deflation are more informative than theircounterparts generated without deflation. Then, we examine a protein folding system in which the slowestdynamical mode is not folding. Following a dVAMP analysis, we show that deflation can be used to obscurethis undesired slow process from a kinetic model, in this case a VAMPnet. The incorporation of deflation intocurrent methods opens the door for enhanced sampling strategies and more flexible, targeted model building.
I. INTRODUCTION
Molecular dynamics (MD) simulations provide a de-tailed view of a system that describes its atomistic dy-namics: how a protein folds or misfolds, how a ligandbinds, or how conformational changes occur. However,interpretable dynamical motifs are encoded in collectivevariables that must be computed from the simulationoutput. A wealth of research has been dedicated toextracting thermodynamic and kinetic information fromtime-series simulation data, often under the assumptionthat a Markovian lag time can be defined that is longenough for transitions to be memoryless but short enoughto resolve the dynamics of interest (see Ref. 1 for a re-view).When models are Markovian, the future state of thesystem depends only on its present state, and not on itshistory. Thus, the system dynamics can be completelydescribed by time-lagged pairs of data points. The trueMarkovian dynamics are described by some unknown dy-namical operator whose eigen- or singular functions canbe approximated in a data-driven way using the time-lagged representation .Recently, two variational principles were developed forcomputing the optimal approximations to the dominanteigen- or singular functions of the unknown operator,which represent the slow dynamical modes of the system.First, No´e and N¨uske introduced the variational ap-proach to conformational dynamics (VAC) for reversibledynamics; more recently, Wu and No´e derived the vari-ational approach to Markov processes (VAMP) for irre-versible dynamics, which is a generalization of the VAC. a) Electronic mail: [email protected] b) Electronic mail: [email protected]
When analyzing time-lagged data, suitable feature rep-resentations must be chosen for the data, which ofteninvolve transformations of the original spatial coordi-nates such that the features are rotation- and translation-invariant. For proteins, common features include back-bone dihedral angles and pairwise contact distances.From the time-lagged feature representation, VAMP canbe used in two ways. First, VAMP yields the optimal linear model from those features. Second, VAMP canbe used to compare the approximation quality of nonlin-ear models of the features, such as Markov state models(MSMs) , Koopman models , or VAMPnets . A methodfor comparing feature choices directly has also recentlybeen presented .The introduction of variational methods into kineticmodel construction has shifted the modeling commu-nity away from heuristic parameter choices and towardsobjective model rankings. However, the VAC and theVAMP optimize the global kinetics, which are character-ized by the slowest dynamical modes of the system. Itis a common criticism of these methods that the slowmodes are not always the same as the modes of inter-est . For example, the optimization method introducedby Banushkina and Krivov identifies a rare event asthe optimal reaction coordinate instead of the desiredmode, and the method derived by McGibbon, Husic, andPande applied to a protein folding dataset finds an irrel-evant distance contraction instead of the folding process.Thus, applying a variational method when modeling atime-series dataset will optimize the approximation of theslowest modes, regardless of whether they are interesting.While in some cases it can be straightforward to modifythe kinetic model a posteriori in order to remove theundesired modes after model construction, it becomesmessier if not impossible to remove the same slow modefrom every candidate model when multiple models are a r X i v : . [ q - b i o . B M ] J u l ranked and their modes are nonequivalent.Furthermore, constraints of commonly used kineticmodels require that the slow modes they identify are mu-tually orthogonal, which places restrictions on the mod-eling of the process of interest if it is not the slowestprocess. Finally, researchers sometimes find that the re-action coordinates of desired modes feature structuresalso present in slower reaction coordinates of undesiredmodes, thus obscuring the nondominant process.Inspired by these limitations, in this work we con-sider the use of deflation to address two problems atthe frontier of molecular kinetics analyses. First, we in-vestigate whether the incorporation of deflation into thevariational approach can increase the interpretability ofnondominant reaction coordinates. Deflated reaction co-ordinates are no longer required to be orthogonal to pre-viously computed coordinates; thus, if previously com-puted reaction coordinates are undesired, deflation mayprevent the contamination of interesting modes with un-interesting ones. Second, we examine whether the defla-tion of a specific process can be used construct a newkinetic model that optimizes the description of only thedesired slow dynamics. In this case, deflation is used tocreate a new basis, from which multiple models can beconstructed and ranked with the guarantee that they arenot optimizing for the accuracy of the uninteresting slowmode. From our results, we suggest several next stepsfor sampling and model construction. II. THEORYA. Time-lagged independent component analysis
The bio- and chemical physics communities often ana-lyze time-series data in which the ordering of data pointsis not arbitrary but instead is indexed by time. A suiteof algorithms developed for MD simulations involves theanalysis of time-lagged data using feature transforma-tions such that the dynamics are Markovian—i.e., inde-pendent of the system history for a defined lag time τ .Since the dynamics do not depend on system history, theentire dataset can be represented by T − τ time-laggedpairs, for T total simulation time.Equivalently, time-lagged data can be represented bytwo matrices, X ≡ [ z , . . . , z T − τ ] (cid:62) , (1) Y ≡ [ z τ , . . . , z T ] (cid:62) , (2)where z t is the column vector of mean-free feature valuesat time t , where X represents the first T − τ data pointsand Y represents their values after a time shift of τ . ForMD analyses, we assume that the length of z t is smallerthan T ; i.e., X and Y have more rows than columns. Forexample, z t might contain the (mean-free) inter-residuedistances or side chain torsional angles of a protein. Time-lagged independent component analysis(TICA) leverages the matrix representations (1)and (2) to solve the generalized eigenvalue problem, C τ Φ = C ΦΛ , (3)for symmetric correlation and covariance matrices C τ and C , where, C τ ≡ X (cid:62) Y + Y (cid:62) X T − τ ) , (4) C ≡ X (cid:62) X + Y (cid:62) Y T − τ ) , (5)respectively.The TICA solution (3) gives a set of weights stored inthe columns of Φ (i.e., the eigenvector solutions) whichcan be used to transform the data from its original spaceto a new space. Each solution is a reaction coordinate : itcharacterizes a dynamical process within the data. Wewill see later that TICA is a special case of canonicalcorrelation analysis (CCA).It is clear from definitions (4) and (5) that a strict re-versibility requirement is imposed upon the TICA formu-lation. In particular, (4) assumes that viewing the datain the backwards direction in time would be equally validif it were to have occurred in the forward direction, whichintroduces bias that may not be appropriate for systemswith rare events. This reversibility requirement has bothmathematical and physical implications: the eigenvaluesof the solution will be real and the dynamical systemmodeled is assumed to be in thermodynamic equilibrium.The two interpretations are connected: the processes inthe simulated system, represented by the weights (eigen-vector solutions), relax to equilibrium when the dynamicsare reversible.TICA was introduced into MD analysis in 2011 asa method to identify slow kinetic modes in proteins .Later, TICA was used as a preprocessing step in the con-struction of MSMs . Most importantly for our purposes,TICA is equivalent to the linear VAC: namely, it has beenshown to produce the optimal linear approximations forthe slow modes of reversible systems . In other words,by transforming our time-lagged features using TICA so-lutions, we obtain the best possible linear description ofthe system kinetics using those features—as long as weare comfortable assuming reversible dynamics. B. Variational approach to Markov processes
However, what if we are not comfortable assuming re-versible dynamics, but still want to obtain linearly opti-mal approximations of the slow modes in our features?In 2017, Wu and No´e solved this problem: instead ofseeking the eigenvector solutions of TICA, which requiresa reversibility assumption, the optimal solutions insteadcome from the singular vectors of the so-called Koopmanmatrix.The Koopman matrix K minimizes the error in theregression problem Y = XK . For time-lagged data, K thus approximates a dynamical propagator: the righthand side approximates the propagation of the data afterthe lag time τ . Approximating the singular functions ofthe propagator by analyzing K is the linear VAMP .To compute K we first calculate three matrices fromour mean-free, time-lagged dataset (without any re-versibility requirements): C ≡ T − τ X (cid:62) X , (6) C τ ≡ T − τ X (cid:62) Y , (7) C ττ ≡ T − τ Y (cid:62) Y . (8)Then, we perform a whitening transformation on the ma-trices X and Y , ˜ X ≡ XC − , (9)˜ Y ≡ YC − ττ , (10)in order to remove their internal covariances. Instead ofdetermining K directly, we instead calculate ˜ K , whichpropagates ˜ X → ˜ Y in the whitened space.The solution to the regression problem ˜ Y = ˜ X ˜ K isgiven by the ordinary least squares estimator ,˜ K = ( ˜ X (cid:62) ˜ X ) − ˜ X (cid:62) ˜ Y , (11)which, by using substitutions from (6) through (10), wecan show yields ,˜ K = ( X (cid:62) X ) − X (cid:62) Y ( Y (cid:62) Y ) − = C − C τ C − ττ . (12)VAMP proceeds by performing the singular value de-composition (SVD) ˜ K = U (cid:48) SV (cid:48)(cid:62) , and then transformingthe resulting matrices of left ( U (cid:48) ) and right ( V (cid:48) ) singu-lar values into the non-whitened space. This procedureis summarized in Alg. 1. Algorithm 1
VAMP
Require: X ∈ R n × m , Y ∈ R n × p (cid:46) mean-free columns U (cid:48) SV (cid:48)(cid:62) ← ( X (cid:62) X ) − X (cid:62) Y ( Y (cid:62) Y ) − U ← n ( X (cid:62) X ) − U (cid:48) V ← n ( Y (cid:62) Y ) − V (cid:48) x weights are the columns of U y weights are the columns of V The outputs U and V store the reaction coordinates intheir columns—the weights for the transformations of ˜ X and ˜ Y , respectively. By the definition of the SVD, wehave the following orthogonality conditions: U (cid:62) C U = I , (13) V (cid:62) C ττ V = I . (14)When C = C ττ and C τ = C (cid:62) τ , the SVD of ˜ K isequal to is eigendecomposition. In this case, the data isreversible, and VAMP is identical to TICA and thelinear VAC . C. Iterative CCA with NIPALS
CCA is a well-known statistical algorithm thatfinds transformations of X and Y such that their rep-resentations in the new latent space (i.e., defined by thetransformations) are maximally correlated. Thus, VAMPis a version of CCA where the data is time-lagged . ForVAMP, we saw in Alg. 1 that a single SVD is performedto obtain all the weights simultaneously.CCA can be performed with a single SVD as in VAMP,or alternatively by using iterative deflation. This pro-duces different weights due to the deflation process. Theiterative deflation algorithm developed by Wold pro-ceeds as follows: Algorithm 2
CCA (NIPALS with iterative defla-tion)/dVAMP
Require: X ∈ R n × m , Y ∈ R n × p (cid:46) mean-free columns Let: A + indicate the Moore-Penrose pseudoinverse of A for each component c do choose nonzero vector ω c while φ c is not converged do φ c ← X + ω c (cid:46) φ c ≡ c th x weight vector ξ c ← X φ c (cid:46) ξ c ≡ c th x score vector ψ c ← Y + ξ c (cid:46) ψ c ≡ c th y weight vector ω c ← Y ψ c (cid:46) ω c ≡ c th y score vector end while X ← X − ξ c [ X (cid:62) ξ c ( ξ (cid:62) c ξ c ) − ] (cid:62) (cid:46) Deflation of X Y ← Y − ω c [ Y (cid:62) ω c ( ω (cid:62) c ω c ) − ] (cid:62) (cid:46) Deflation of Y end for The nonlinear iterative least squares (NIPALS)method in lines 1–9 of Alg. 2 is equivalent to the powermethod for computing dominant singular vectors . Inlines 11–12, the data is deflated such that new domi-nant singular vectors can be calculated in the next it-eration. In the molecular kinetics context, we can callAlg. 2 “dVAMP”, i.e., VAMP with deflation. The firstreaction coordinates (weights) calculated from dVAMPare identical to the first VAMP coordinates (see Appen-dices A and B). The crucial difference between VAMPand dVAMP is that the nondominant reaction coordi-nates (i.e., all but the first coordinate) are different dueto deflation.Deflation is common to statistical algorithms suchas partial least squares (PLS) and principal componentanalysis (PCA) , both of which can be performed us-ing Alg. 2 with changes only to lines 4 and 7. Deflationcan be viewed as subtracting the rank-one approxima-tions of X and Y at step c from X and Y at the previ-ous update . This is intended to minimize the effect ofa previously computed weight on the subsequent weightsby removing the presence of the associated singular valueand vector . Using deflation to obtain the weights de-stroys the orthogonality conditions in (13) and (14). Fur-thermore, there is no persistent matrix of which the sin-gular vectors are computed (see also Appendix C).By applying the deflated weights to the non-deflateddata , ξ (cid:48) c = X φ c , (15) ω (cid:48) c = Y ψ c , (16)dVAMP produces transformations that differ fromVAMP. We expect the weights identified using Alg. 2to be minimally contaminated by previously identifiedcoordinates.To demonstrate the application of dVAMP in eluci-dating nondominant reaction coordinates, we consider alow-dimensional toy system. Then, for an atomistic pro-tein folding system, we explore the use of deflation byperforming a dVAMP analysis in order to deflate a spe-cific slow dynamical mode from the feature data before applying a kinetic model. This strategy allows selectedmodes to be obscured from the model optimization proce-dure, because the associated timescale no longer appearsslow.
III. EXPERIMENTSA. Asymmetric double wells
Due to deflation, we expect that reaction coordinatesobtained with dVAMP will minimally contaminate sub-sequently calculated coordinates. To investigate this, wedesign a dataset with dependencies among its degrees offreedom in order to mimic a real system, but with fewenough dimensions that we can assess the results relativeto the original data.First, we perform three independent simulations alongthe following asymmetric double well potential, U ( x ) = x − x + 0 . x, which is visualized with a red line in the left plot ofFig. 1A. The first simulation produces the histogram ofparticle positions P ( n ) for n up to 50000 simulationsteps. The histogram is illustrated on the same plot,and represents the first dimension of our toy dataset.The other two independent simulations yield particlepositions P (cid:48) ( n ) and P (cid:48) ( n ). We then generate the second A ! = 1.000 ! = 0.519 ! = 0.843 " = 0.985 " = 0.900 " = 0.878 BC FIG. 1. (A) Histograms of particle distributions for the threeprocesses in the toy dataset. The potential used to simulatethe first process is overlaid on the leftmost plot in red. (B) Allhistograms for the three reaction coordinates (transforma-tions) produced from analysis with VAMP and (C) dVAMP.Gold and green stars indicate regions of the coordinates char-acterized by particle positions in the corresponding locationsof the second and third processes, respectively. Filled starsindicate that the region of the reaction coordinate is exclu-sively populated by those regions of the original processes,and unfilled stars indicate that other regions of the processcan also be found at that location on the reaction coordinate.The singular values σ in (B) are the same for the VAMP anddVAMP coordinates. The ρ values in (C) are the Spearmanrank correlation coefficients of the dVAMP transformationwith the corresponding VAMP transformation. and third dimensions of our toy dataset from P (cid:48) and P (cid:48) so that they are dependent upon the first dimension P : P ( n ) = (cid:40) x ∼ N (0 , . ), P ( n ) > P (cid:48) ( n ) + 2 , otherwise P ( n ) = (cid:40) x ∼ N (1 , . ) , P ( n ) > . P (cid:48) ( n ) − , otherwise , where N is the normal distribution. The histograms ofparticle positions P ( n ) and P ( n ) are shown in the cen-ter and right plots of Fig. 1A.Now we wish to compare the results of the standardVAMP algorithm with the results of dVAMP. To do this,we input the mean-free three-dimensional coordinatesinto Algs. 1 and 2 and compute the VAMP and dVAMPreaction coordinates. Figures 1B and C illustrate the toysystem results. The first reaction coordinate is identicalto the first deflated reaction coordinate because no de-flation has been performed. The singular values of thecoordinates, provided in Fig. 1B, are identical.The difference between VAMP and dVAMP becomesapparent in the nondominant reaction coordinates. Thesecond VAMP coordinate is roughly Gaussian; in con-trast, the second dVAMP coordinate reveals meaning- A B
FIG. 2. (A) Sampled villin structures from the MD trajectory analyzed. Helical secondary structure is colored and coilsare white. Each image represents five structures sampled from similar locations in TIC space as determined by a 250-center k -means model built upon the first three original TICs. The purple structure represents the folded state, and the blue structurerepresents the denatured state. The green structure is a rare helical misfolded state that we assert is an artifact. (B) Two-dimensional histograms for TICA transformations constructed from villin contact distances. Dashed lines indicate the regionscorresponding to the sampled structures of the same color. The first TIC tracks the conversion to and from the rare artifactonly. The second TIC tracks the majority of the folding process and correlates well with RMSD to the folded structure. ful structures within the original data. Similarly, thethird dVAMP transformation presents a more familiarstructure with respect to the original data, whereas thethird VAMP transformation is still “contaminated” bya Gaussian near the center of the coordinate. TheSpearman rank correlation coefficients between pairsof corresponding reaction coordinates are also providedin Fig. 1C, and the coefficients’ distances from 1 are con-sistent with qualitative assessments of their differences instructure. B. Folding of villin1. Modeling with TICA
Now we are interested in exploring how deflation canbe used to tackle setbacks to kinetic modeling in ahigher-dimensional protein example. To do this, we ana-lyze an MD simulation of the 35-residue villin headpiece(PDB ID: 2f4k) performed by Lindorff-Larsen et al. .The folded state is depicted in the purple structures inFig. 2A. The simulation dataset consists of a single 125 µ strajectory with 34 instances each of folding and unfold-ing. For our analysis, we subsample the original datasetby a factor of 10 to produce 2 ns timesteps.Before constructing a kinetic model, we convert theCartesian coordinates output from the simulation intofeatures; in this case, a set of pairwise contact distances,where the distance is defined by the closest heavy atomsin the pair. For our analysis, we consider the pairwisedistances among a subset of twelve residues at three-residue intervals starting with the first (LEU 1, GLU 4,. . . , LEU 34) for a total of 66 distances.Now, we want to use TICA to build a kinetic model for our data. To perform TICA, we can use VAMP(Alg. 1) with the reversibility conditions in Sec. II A. Thisis achieved by letting X ← X (cid:48) (cid:95) Y (cid:48) and Y ← Y (cid:48) (cid:95) X (cid:48) ,where X (cid:48) and Y (cid:48) are the original time-lagged data and (cid:95) signifies concatenation along the row axis. We use a lagtime of τ = 50 ns, or 25 time steps. The TICA resultsare shown in a two-dimensional histogram in Fig. 2B.The first TICA coordinate separates the majority of con-figuration space from a rare misfolded state (green inFig. 2A). The second TICA coordinate separates the de-natured state from the folded state (blue and purple inFig. 2A, respectively).In order to demonstrate the application of dVAMP andsubsequent analysis with deflation, we will now assertthat we are not interested in the rare misfolded state dueto its poor sampling, and that it is therefore an artifactthat is disruptive to our kinetic model. If we wanted tostop at TICA, we could ignore the first TIC. However, wemay be interested in any of the following further analyses:(i) creating an MSM , especially when using kineticmapping (in which the slowest modes are mostimportant to the MSM state decomposition);(ii) evaluating the approximation quality of that MSMusing the VAC ;(iii) macrostating our TICA model or MSM using aneigenvector-based approach such as Perron clustercluster analysis (PCCA) ;(iv) constructing a VAMPnet to obtain a kinetic model(similar to performing (i)-(iii) in sequence);or the application of a variety of other tools.What (i)-(iv) have in common is that if an undesiredslow process is present in the model, then it will corre-spondingly affect analysis in the following ways:(i) the MSM state decomposition will be influenced bythe artifact,(ii) MSM variational scores will be a function of howaccurately the artifact kinetics are modeled,(iii) divisive clustering methods such as PCCA will di-vide macrostates according to the artifact, and(iv) the VAMPnet optimization will involve optimallymodeling the artifact.
2. Removing undesired processes with deflation
The state of the art in Markovian kinetic modelinghas not yet provided a solution to these problems otherthan the acquiescence that if we are approximating sys-tem kinetics then we must work with whatever the globalkinetics are. However, the use of deflation provides a newoption. In Sec. III A, we saw how the use of dVAMP canreveal structure in nondominant reaction coordinates.Now, we will selectively use deflation to prevent ourmodel from optimizing the kinetics of the precise processwe do not want. Rewriting line 11 in Alg. 2 for thispurpose, we have, X new = X original − ξ c [ X (cid:62) original ξ c ( ξ (cid:62) c ξ c ) − ] (cid:62) , (17)where c is the index of the coordinate that is undesired,and ξ c is the score of that coordinate, which was calcu-lated in the course of Alg. 2. This removes the effect ofthe c th coordinate on the dynamics by setting the eigen-or singular value of the associated eigen- or singular vec-tor of the matrix product ˜ K (12) to zero.For our purposes, this means that deflation has ren-dered the corresponding dynamical process fast : it ap-pears as noise in the new basis. Because the VAC, VAMP,and related methods seek to optimally approximate theslowest processes in the system, deflation effectively “re-moves” this process from the optimization procedure ofthe global kinetic model by lowering its timescale in thenew feature basis.To achieve this for villin, we perform dVAMP as inSec. III A through the first dVAMP coordinate in or-der to obtain the scores for the undesired (slowest) pro-cess. Then, we apply (17) with c = 1. Recall that ourdata X original are the 66 pairwise contact distances, andso X new are corresponding deflated distances. The de-flated distances form a basis in which the eigenvalue ofan undesired process is set to zero. They can now beinput into any model ranking framework (e.g., MSMs orVAMPnets with variable hyperparameters) such that theresulting kinetic models will not be optimizing the arti-fact kinetics alongside the interesting ones.
FIG. 3. One-dimensional histograms for two-state VAMP-nets constructed from the original distances (top row) anddeflated distances (bottom row). Each row shows two dif-ferent analyses of the same data. The use of two states waschosen as a hyperparameter a priori . The hard state assign-ments are determined by mapping soft state memberships tothe state with the higher probability. The first VAMPnet(original distances) separates the rare artifact from the restof the configurations. The second VAMPnet (deflated dis-tances) separates structures near the folded state from thedenatured state. Percentages indicate state populations.
3. Constructing VAMPnets for villin folding
We now investigate the difference in modeling resultsbetween the original and deflated bases using VAMP-nets . VAMP (Alg. 1) calculates an SVD in line 1 whichproduces a diagonal k × k matrix S of positive singu-lar values σ i , i ≤ k . The sum of the highest k singularvalues raised to the r th power was proven by Wu andNo´e to be a variational score—the VAMP- r score—forthe approximation quality of the kinetic model:VAMP- r ≡ k (cid:88) i =1 σ ri . (18)It follows from the existence of a scalar variationalbound that machine learning algorithms can be designedto optimize it . Using neural networks, VAMPnetsachieve exactly this by constructing few-state Koopmanmodels directly from features by optimizing a VAMPscore . Here, we build VAMPnets for both our origi-nal (mean-free) contact distances and our new, deflatedbasis. We chose VAMPnets because, due to the neuralnetwork architecture, there is no opportunity to manuallymodify a model while it is being built (e.g., by manuallyremoving an undesired TIC).For our VAMPnets, we use a lag time of 300 ns andchoose two metastable states. The training and valida-tion sets are split to be approximately equal in size (51%and 49% of the data, respectively), and 10 epochs areperformed when training each model. The remaining hy-perparameters are the defaults at the time of modeling(batch size, 1000; network depth, 6; layer width, 100;learning rate, 0.0001). The VAMPnet output is a list ofstate assignment probabilities for each frame in the tra-jectory; binary (“hard”) memberships are determined byassigning each frame to the state with the greater prob-ability.Our VAMPnet results are illustrated in Fig. 3. Thetop row of histograms summarizes the VAMPnet con-structed on the original contact distances. The upper leftplot shows distributions for each (hard) VAMPnet stateacross the values of the first TIC from the original TICAanalysis. The upper right plot shows distributions for thesame states across root-mean-square deviation (RMSD)values to the folded structure of villin. It is clear fromthese plots that the two-state VAMPnet separates ourartifact (4.6% of frames) from the rest of the ensemble.The second row of histograms shows the same analy-sis for the VAMPnet constructed on the deflated contactdistances with the same hyperparameters. The lower leftplot shows that both VAMPnet states contain structuresfrom the artifact region. The lower right plot shows thatthe new two-state VAMPnet generally isolates the foldedstructure (26.8%) from the denatured ensemble. Thus,by using deflated contact distances, we obtain a VAMP-net that captures folding and unfolding using the sameparameters with which we were unable to capture foldingand unfolding before the basis deflation.We can see from the lower left histogram in Fig. 3 thatthe “artifact” is present in the folded state, and furtheranalysis shows that the majority of structures assignedto the folded state with RMSDs greater than 7 . ). Instead, we have demon-strated that dVAMP can be used to prevent a kineticmodel from optimizing the accuracy of a (subjectively)undesired process a priori . IV. DISCUSSION
We have examined incorporation of deflation intostate-of-the-art methods in molecular kinetics. It is al-ready known that VAMP is equivalent to CCA withoutdeflation. In this study we explore the use of dVAMP—VAMP with deflation—for the development of reactioncoordinates. Then, we demonstrate the selective use of deflation to remove a specific slow process from a kineticmodel; in this case, a VAMPnet.In the toy example in Sec. III A, we see that nondom-inant reaction coordinates obtained from dVAMP quali-tatively provide more information about the system thantheir nondeflated VAMP counterparts. In addition to in-creased interpretability, these coordinates might be usedfor enhanced sampling. For example, MD simulationscan be started from the top of a barrier revealed only ina deflated reaction coordinate.In the protein folding example, we see that deflationcan obscure a long-timescale process from a kinetic modelin order to facilitate further analysis that is desired tobe independent of that process. In MD simulations, thedominant processes may not be the same as the processesof interest due to low sampling or force field artifacts .Thus, deflation presents a systematic way of removing theeffects of these undesired modes on model building. Us-ing deflated data, MSMs, Koopman models, or VAMP-nets can be constructed such that the eigen- or singularvectors of the resulting model capture only those dynam-ics which are under investigation. Furthermore, multiplemodels can be compared with the guarantee none of themare optimizing the accuracy of the undesired slow mode.Frameworks such as MSMs and Koopman models re-quire that the dynamical modes are orthogonal. Thus,if dynamical modes of interest are faster than an unde-sired process, they are constrained by this orthogonalitycondition. By deflating the basis, we gain the flexibil-ity to optimally model the dynamical modes of interestwithout requiring that they are orthogonal to somethinguninteresting.This work opens the door not only for improved analy-sis but also for related methods developments. Often weare interested in some separate observable that changesin time with the data and its relationship to the features.The implementation of a deflation method in the contextof explaining the relationships between a feature and anobservable is a valuable pursuit and has already been ex-plored in the time-independent context . The adap-tation of such methods to time-series dynamics would beof great utility to the field and the advance presentedhere may serve as a starting point.Several open source Python software packages facili-tated this study. The double well potential in Sec. III Awas simulated using PyEMMA . The 2D histogram inFig. 2 was generated using Corner , contact distances forvillin were calculated with MDTraj , VAMPnets werecalculated using the deeptime package . An exampleJupyter notebook is provided to reproduce the analysisin Sec. III A which also uses NumPy , SciPy , Scikit-learn , Matplotlib , and Seaborn . Protein trajecto-ries were visualized with VMD . ACKNOWLEDGEMENTS
BEH is extremely grateful to Fabian Paul, Muneeb Sul-tan, Moritz Hoffmann, Andreas Mardt, Luca Pasquali,Stefan Klus, Tim Hempel, and Chris Wehmeyer for help-ful discussions and feedback. We thank D. E. Shaw Re-search for providing the protein simulation dataset. BEHand FN acknowledge funding from the European Com-mission (ERC CoG 772230 “ScaleCell”) and the DeutscheForschungsgemeinschaft (SFB1114/A04).
Appendix A: Intuition for the relationship between VAMPand canonical CCA
We can demonstrate the equivalence of VAMP andcanonical CCA (“dVAMP” for time-lagged data) by con-sidering the calculation of the first component, which isidentical in both formulations.By making substitutions from lines 4–8 of Alg. 2, wecan write, φ c ← X + YY + X φ c . (A1)Further substituting the definition of the Moore-Penrosepseudoinverse , X + ≡ ( X (cid:62) X ) − X (cid:62) , we obtain, φ c ← ( X (cid:62) X ) − X (cid:62) Y ( Y (cid:62) Y ) − Y (cid:62) X φ c , (A2)which has been shown to be equivalent to the powermethod for identifying the dominant eigenvector of a ma-trix . Thus we see that φ c is the dominant eigenvectorof the expression,( X (cid:62) X ) − ( X (cid:62) Y )( Y (cid:62) Y ) − ( Y (cid:62) X ) , (A3)for X at a given iteration (i.e., possibly deflated).To obtain an intuition for the connection of this ex-pression (A3) to the VAMP solution in Alg. 1, we firstsubstitute (9), (10), and (12) into ˜ Y = ˜ X ˜ K , YC − ττ = ( XC − )( C − C τ C − ττ ) , (A4)which simplifies to, Y = XC − C τ . (A5)Thus the matrix that converts X to Y (i.e., propagatesthe time-series if X and Y represent time-lagged data)is, K f ≡ C − C τ = ( X (cid:62) X ) − X (cid:62) Y , (A6) where the subscript f indicates that, in the time-laggedinterpretation, the matrix propagates the data forward in time. It is easy to see that for the backward direction, K b ≡ C − ττ C (cid:62) τ = ( Y (cid:62) Y ) − Y (cid:62) X . (A7)Now, we can see that (A3) is equal to K f K b , as wellas the analogous solution for the Y weights:( X (cid:62) X ) − ( X (cid:62) Y )( Y (cid:62) Y ) − ( Y (cid:62) X ) = K f K b (A8)( Y (cid:62) Y ) − ( Y (cid:62) X )( X (cid:62) X ) − ( X (cid:62) Y ) = K b K f . (A9)CCA with deflation via NIPALS obtains the dominanteigenvector of the left-hand sides, whereas VAMP calcu-lates the dominant eigenvector of the right-hand sides.The dominant eigenvalues of the expressions in (A8)and (A9) are the squares of the dominant singular valuesobtained from VAMP and dVAMP.In general, the equivalences above apply when X and Y are the same for both algorithms, and only the domi-nant component is considered. Appendix B: CCA with iterative deflation via the SVD
Algorithm 2 can be rewritten with an SVD in place ofthe inner NIPALS loop as follows:
Algorithm 3
CCA (SVD with iterative deflation)
Require: X ∈ R n × m , Y ∈ R n × p (cid:46) mean-free columns for each component c do U (cid:48) SV (cid:48)(cid:62) ← ( X (cid:62) X ) − X (cid:62) Y ( Y (cid:62) Y ) − U ← n ( X (cid:62) X ) − U (cid:48) V ← n ( Y (cid:62) Y ) − V (cid:48) φ c ← first column of U (cid:46) φ c ≡ c th x weight vector ψ c ← first column of V (cid:46) ψ c ≡ c th y weight vector ξ c ← X φ c (cid:46) ξ c ≡ c th x score vector ω c ← Y ψ c (cid:46) ω c ≡ c th y score vector X ← X − ξ c [ X (cid:62) ξ c ( ξ (cid:62) c ξ c ) − ] (cid:62) (cid:46) Deflation of X Y ← Y − ω c [ Y (cid:62) ω c ( ω (cid:62) c ω c ) − ] (cid:62) (cid:46) Deflation of Y end for However, the implementation of this algorithm con-tains multiple inverse square roots, which will be numer-ically unstable with deflation, leading to errors that arepropagated with each new component. Regularizationcan be employed to avoid ill-conditioned covariance ma-trices, but then the solution should no longer be expectedto match that of the algorithm with NIPALS. Indeed, itis generally a good idea to regularize covariance matri-ces, including in the NIPALS algorithm and TICA. Theinterested reader is referred to Appendix B in Ref. 9.
Appendix C: Connection to Koopman operatorapproximation
The matrix K is a finite-dimensional linear approxi-mation to the continuous integral Koopman operator K and can be called the Koopman matrix . In thetime-lagged context, the operator propagates a view ofthe system at time t to a possibly different view of thesystem at time t + τ . While we have focused on obtainingreaction coordinates, it is useful to note that the weight(singular) vectors obtained in VAMP (Alg. 1) are ap-proximations to the singular functions of the Koopmanoperator. The first weight vector calculation in dVAMP(Alg. 2) approximates the dominant singular functionsbecause it is equivalent to the single SVD calculation inVAMP (Alg. 1; see Appendices A and B).However, if a relationship between the subsequentweight vectors calculated in dVAMP and K exists, it isnot straightforward. In a sense, a new operator is ap-proximated with each deflation for a matrix built fromnew, deflated datasets X c and Y c , in which the pres-ence of the previously calculated c weight vectors havebeen removed. Thus, each new set of weight vectors ap-proximates the dominant singular functions of some new operator K c , which describes the deflated dynamics. B. E. Husic and V. S. Pande, “Markov state models: From anart to a science,” J. Am. Chem. Soc. , 2386–2396 (2018). C. Sch¨utte and M. Sarich, “A critical appraisal of markov statemodels,” Eur. Phys. J. , 2445–2462 (2015). F. No´e and F. N¨uske, “A variational approach to modeling slowprocesses in stochastic dynamical systems,” Multiscale Model.Simul. , 635–655 (2013). H. Wu and F. No´e, “Variational approach for learningMarkov processes from time series data,” arXiv preprintarXiv:1707.04659 (2017). H. Wu, F. N¨uske, F. Paul, S. Klus, P. Koltai, and F. No´e, “Vari-ational koopman models: slow collective variables and molecularkinetics from short off-equilibrium simulations,” J. Chem. Phys. , 154104 (2017). A. Mardt, L. Pasquali, H. Wu, and F. No´e, “Vampnets for deeplearning of molecular kinetics,” Nature Commun. , 5 (2018). M. K. Scherer, B. E. Husic, M. Hoffmann, F. Paul, H. Wu, andF. No´e, “Variational selection of features for molecular kinetics,”J. Chem. Phys. , 194108 (2019). P. V. Banushkina and S. V. Krivov, “Nonparametric varia-tional optimization of reaction coordinates,” J. Chem. Phys. ,184108 (2015). R. T. McGibbon, B. E. Husic, and V. S. Pande, “Identification ofsimple reaction coordinates from complex dynamics,” J. Chem.Phys. , 044109 (2017). Y. Naritomi and S. Fuchigami, “Slow dynamics in protein fluc-tuations revealed by time-structure based independent compo-nent analysis: the case of domain motions,” J. Chem. Phys. ,065101 (2011). C. R. Schwantes and V. S. Pande, “Improvements in Markovstate model construction reveal many non-native interactions inthe folding of NTL9,” J. Chem. Theory Comput. , 2000–2009(2013). G. P´erez-Hern´andez, F. Paul, T. Giorgino, G. De Fabritiis, andF. No´e, “Identification of slow molecular order parameters forMarkov model construction,” J. Chem. Phys. , 015102 (2013). Here, we adopt the statistics convention by writing our linearequation Y = XK . In the propagator context, it is more common to write Y = KX , and thus the equations for K in this work areequal to K (cid:62) with the latter convention. F. Paul, H. Wu, M. Vossel, B. L. de Groot, and F. No´e, “Identi-fication of kinetic order parameters for non-equilibrium dynam-ics,” J. Chem. Phys. , 164120 (2019). C. Wehmeyer and F. No´e, “Time-lagged autoencoders: Deeplearning of slow collective variables for molecular kinetics,” J.Chem. Phys. , 241703 (2018). We assume in (9)-(12) that the matrices X (cid:62) X , etc., are fullrank. H. Hotelling, “Relations between two sets of variates,”Biometrika , 321–377 (1936). H. Wold, “Partial least squares,” in
Encyclopedia of StatisticalSciences (American Cancer Society, 2006). F. No´e, “Machine learning for molecular dynamics on longtimescales,” arXiv preprint arXiv:1812.07669 (2018). The Moore-Penrose pseudoinverse A + is defined as( A (cid:62) A ) − A (cid:62) and is equivalent to A − when A is invert-ible . H. Wold, “Nonlinear iterative partial least squares (nipals) mod-elling: some current developments,” in
Multivariate Analysis–III (Elsevier, 1973) pp. 383–407. J. A. Wegelin et al. , “A survey of partial least squares (pls) meth-ods, with emphasis on the two-block case,” University of Wash-ington, Department of Statistics, Tech. Rep (2000). H. Wold, “Path models with latent variables: The NIPALS ap-proach,” in
Quantitative Sociology (Elsevier, 1975) pp. 307–357. PCA is equivalent to PLS when X = Y . P. A. White, “The computation of eigenvalues and eigenvectorsof a matrix,” SIAM J. Appl. Math. , 393–437 (1958). Note that if the deflated weights are applied to the the de-flated data, the transformations will be the same in VAMP anddVAMP. The nonequivalence of VAMP and dVAMP transforma-tions occurs when applying the deflated weights to nondeflated data. C. Spearman, “The proof and measurement of association be-tween two things,” Am. J. Psychol. , 72–101 (1904). K. Lindorff-Larsen, S. Piana, R. O. Dror, and D. E. Shaw, “Howfast-folding proteins fold,” Science , 517–520 (2011). F. No´e and C. Clementi, “Kinetic distance and kinetic maps frommolecular dynamics simulation,” J. Chem. Theory Comput. ,5002–5011 (2015). R. T. McGibbon and V. S. Pande, “Variational cross-validationof slow dynamical modes in molecular kinetics,” J. Chem. Phys. , 124105 (2015). P. Deuflhard, W. Huisinga, A. Fischer, and C. Sch¨utte, “Iden-tification of almost invariant aggregates in reversible nearly un-coupled markov chains,” Linear Alg. Appl. , 39–59 (2000). This is different from ξ (cid:48) c in Eqn. 15, which is not calculated inthe course of Alg. 2 on data that has been deflated c times, butinstead is calculated on the original data afterward. Note that if the undesired process is not the slowest process, i.e., c >
1, we can deflate it exclusively using (17) once for c only.However, we must perform dVAMP through the c th coordinatesto obtain the necessary weights. To calculate all of the distances, we perform the same transfor-mation for Y using ω c in order to obtain the last τ data points,which are not present in X due to the time lag. K. A. Beauchamp, R. McGibbon, Y.-S. Lin, and V. S. Pande,“Simple few-state models reveal hidden complexity in proteinfolding,” Proc. Natl. Acad. Sci. , 17807–17813 (2012). J. S. Hub and B. L. De Groot, “Detection of functional modes inprotein dynamics,” PLoS Comput. Biol. , e1000480 (2009). T. Krivobokova, R. Briones, J. S. Hub, A. Munk, and B. L.de Groot, “Partial least-squares functional mode analysis: ap-plication to the membrane proteins aqp1, aqy1, and clc-ec1,”Biophys. J. , 786–796 (2012). C. Wehmeyer, M. K. Scherer, T. Hempel, B. E. Husic, S. Olsson,and F. No´e, “Introduction to Markov state modeling with thePyEMMA software [article v1.0],” LiveCoMS , 5965 (2018). D. Foreman-Mackey, “corner.py: Scatterplot matrices inpython,” J. Open Source Softw. (2016). R. T. McGibbon, K. A. Beauchamp, M. P. Harrigan, C. Klein,J. M. Swails, C. X. Hern´andez, C. R. Schwantes, L.-P. Wang,T. J. Lane, and V. S. Pande, “MDTraj: a modern open libraryfor the analysis of molecular dynamics trajectories,” Biophys. J. , 1528–1532 (2015). T. Kluyver, B. Ragan-Kelley, F. P´erez, B. Granger, M. Bus-sonnier, J. Frederic, K. Kelley, J. Hamrick, J. Grout, S. Cor-lay, P. Ivanov, D. Avila, S. Abdalla, and C. Willing, “Jupyternotebooks – a publishing format for reproducible computationalworkflows,” in
Positioning and Power in Academic Publish-ing: Players, Agents and Agendas , edited by F. Loizides andB. Schmidt (IOS Press, 2016) pp. 87–90. S. Van Der Walt, S. C. Colbert, and G. Varoquaux, “The numpyarray: a structure for efficient numerical computation,” Comput.Sci. Eng. , 22 (2011). E. Jones, T. Oliphant, P. Peterson, et al. , “SciPy: Open sourcescientific tools for Python,” (2001–). F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg,J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot,and E. Duchesnay, “Scikit-learn: Machine learning in Python,” J. Mach. Learn. Res. , 2825–2830 (2011). J. D. Hunter, “Matplotlib: A 2D graphics environment,” Com-put. Sci. Eng. , 90–95 (2007). M. Waskom, O. Botvinnik, D. O’Kane, P. Hobson, S. Lukauskas,D. C. Gemperline, T. Augspurger, Y. Halchenko, J. B. Cole,J. Warmenhoven, J. de Ruiter, C. Pye, S. Hoyer, J. Vander-plas, S. Villalba, G. Kunter, E. Quintero, P. Bachant, M. Mar-tin, K. Meyer, A. Miles, Y. Ram, T. Yarkoni, M. L. Williams,C. Evans, C. Fitzgerald, Brian, C. Fonnesbeck, A. Lee, andA. Qalieh, “mwaskom/seaborn: v0.8.1 (september 2017),”(2017). W. Humphrey, A. Dalke, and K. Schulten, “VMD – VisualMolecular Dynamics,” J. Mol. Graph. , 33–38 (1996). R. Penrose, “A generalized inverse for matrices,” Math. Proc.Cambridge Philos. Soc. , 406413 (1955). B. O. Koopman, “Hamiltonian systems and transformation inhilbert space,” Proc. Natl. Acad. Sci. , 315–318 (1931). I. Mezi´c, “Spectral properties of dynamical systems, model reduc-tion and decompositions,” Nonlinear Dyn. , 309–325 (2005). S. Klus, F. N¨uske, P. Koltai, H. Wu, I. Kevrekidis, C. Sch¨utte,and F. No´e, “Data-driven model reduction and transfer operatorapproximation,” J. Nonlinear Sci.28