[PDF] Accelerated Simulations of Molecular Systems through Learning of their Effective Dynamics

Abstract

Simulations are vital for understanding and predicting the evolution of complex molecular systems. However, despite advances in algorithms and special purpose hardware, accessing the timescales necessary to capture the structural evolution of bio-molecules remains a daunting task. In this work we present a novel framework to advance simulation timescales by up to three orders of magnitude, by learning the effective dynamics (LED) of molecular systems. LED augments the equation-free methodology by employing a probabilistic mapping between coarse and fine scales using mixture density network (MDN) autoencoders and evolves the non-Markovian latent dynamics using long short-term memory MDNs. We demonstrate the effectiveness of LED in the M\"ueller-Brown potential, the Trp Cage protein, and the alanine dipeptide. LED identifies explainable reduced-order representations and can generate, at any instant, the respective all-atom molecular trajectories. We believe that the proposed framework provides a dramatic increase to simulation capabilities and opens new horizons for the effective modeling of complex molecular systems.

Full PDF

AAccelerated Simulations of Molecular Systemsthrough Learning of their Effective Dynamics

Pantelis R. Vlachas , Julija Zavadlav ,Matej Praprotnik , , Petros Koumoutsakos , , † Computational Science and Engineering Laboratory,ETH Zurich, CH-8092, Switzerland Professorship of Multiscale Modeling of Fluid Materials,Department of Mechanical Engineering, Technical University of Munich80333, Munich, Germany Laboratory for Molecular Modeling, National Institute of ChemistrySI-1001 Ljubljana, Slovenia Department of Physics, Faculty of Mathematics and Physics, University of LjubljanaSI-1000 Ljubljana, Slovenia John A. Paulson School of Engineering and Applied SciencesHarvard University, Cambridge, MA 02138, USA † Corresponding author. E-mail: [email protected]

Simulations are vital for understanding and predicting the evolution of com-plex molecular systems. However, despite advances in algorithms and specialpurpose hardware, accessing the timescales necessary to capture the struc-tural evolution of bio-molecules remains a daunting task. In this work wepresent a novel framework to advance simulation timescales by up to threeorders of magnitude, by learning the effective dynamics (LED) of molecularsystems. LED augments the equation-free methodology by employing a proba-bilistic mapping between coarse and ﬁne scales using mixture density network(MDN) autoencoders and evolves the non-Markovian latent dynamics using a r X i v : . [ phy s i c s . c o m p - ph ] F e b ong short-term memory MDNs. We demonstrate the effectiveness of LED inthe M ¨ueller-Brown potential, the Trp Cage protein, and the alanine dipeptide.LED identiﬁes explainable reduced-order representations and can generate,at any instant, the respective all-atom molecular trajectories. We believe thatthe proposed framework provides a dramatic increase to simulation capabil-ities and opens new horizons for the effective modeling of complex molecularsystems. Introduction

Over the last 30 years molecular dynamics (MD) simulations of biological macro-moleculeshave advanced our understanding of their structure and function ( ). Today MD simulationshave become an essential tool for scientiﬁc discovery in the ﬁelds of biology, chemistry, andmedicine. However, they remain hampered by their limited access to timescales of biologicalrelevance for protein folding pathways, conformational dynamics, and rare-event kinetics.In order to resolve this bottleneck two complementary approaches have been pursued. Firstefforts centered around innovative hardware solutions started with crowd sourcing for computecycles ( ) and have more recently received a boost with the Anton machine ( ) enabling remark-able, second-long, simulations for small bio-molecules. Complementary algorithmic efforts aimto advance time scales by systematic coarse graining of the system dynamics. One of the ﬁrstsuch studies used the principal component or normal mode analysis to simulate the conforma-tional changes in proteins ( ). Several coarse-graining (CG) methods reduce the complexityof molecular systems by modeling several atoms as a single particle ( ). Backmapping tech-niques ( ) can be subsequently utilized to recover the atomistic degrees of freedom from aCG representation. Multiscale approaches combine the atomistic and coarse-grained/continuummodels ( ) to augment the accessible timescales while signiﬁcant efforts have focused on2nhanced sampling techniques ( ). Several of these methods exploit the fact that coarsekinetic dynamics on the molecular level are often governed by a few, slow collective variables(CVs) (also termed reaction coordinates) ( ), or by transitions between a few long-livedmetastable states (

28, 29 ).The CVs are typically speciﬁed a priori and their choice crucially impacts the performanceand success of the respective sampling methods. Similar to to the CG models, the CVs provide alow order representation of the molecular system, albeit without a particle representation. CVsare of much lower dimensionality than CG models, and retrieving atomistic conﬁgurations fromCVs is a more challenging problem. While many research efforts have addressed the coarseto ﬁne mapping in CG models, the literature is still scarce on methods to retrieve atomisticconﬁgurations from CVs.Machine learning (ML) methods (

30, 31 ), exploiting the expressive power of deep networksand their scalability to large datasets, have been used to alleviate the computational burdenassociated with the simulation of proteins, leading to profound scientiﬁc discoveries ( ).The pioneering work in Ref. ( ), utilized neural networks to learn an approximate poten-tial energy surface of density functional theory (DFT) in bulk silicon from quantum mechanicalcalculations, performing MD simulations with this approximate potential and accelerating theDFT simulations. The ﬁeld of data-driven learning of potential energy surfaces and force ﬁeldsis rapidly attracting attention with important recent extensions and applications ( ). MLis employed to identify CG models for MD in Refs. ( ). Boltzmann generators are pro-posed in Ref. ( ) to sample from the equilibrium distribution of a molecular system directlysurpassing the need to perform MD.Early ML methods for the identiﬁcation of CVs are building on the variational approach ( )leading to the time-lagged independent analysis (TICA) ( ). TICA is based on the Koopmanoperator theory, suggesting the existence of a latent transformation to an inﬁnite-dimensional3pace that linearizes the dynamics on average. As a consequence, slow CVs are modeled aslinear combinations of feature functions of the state of the protein (atom coordinates, or internalstructural coordinates). Coarse-graining of the molecular dynamics is achieved by discretizingthe state space and employing indicator vector functions as features (

49, 51–53 ). Consequently,the feature state dynamics reduce to the propagation law of a Markov State Model (MSM).More recently the need for expert knowledge to construct the latent feature functions has beenalleviated by learning the latent space using neural networks (

54, 55 ). The dynamics on thelatent space are assumed to be linear and Markovian. For example, VAMPnets (

54, 56 ) learnnonlinear features of the molecular state with autoencoder (AE) networks. However, they arenot generative and cannot recover the detailed conﬁguration of the protein (decoding part).Moreover, the method requires the construction of an MSM to sample the latent dynamics andapproximate the time-scales of the dynamics. Time-lagged AE have been utilized to identifya reaction coordinate embedding and propagate the dynamics in Ref. ( ) but they are notgenerative, as the learned mappings are deterministic, while the effective dynamics are assumedto be Markovian.Extensions to generative approaches include Refs. ( ). In Ref. ( ), a deep generativeMSM is utilized to capture the long-timescale dynamics and sample realistic alanine dipeptideconﬁgurations. Even though Mixture Density Networks (MDNs) are employed in Ref. ( )to propagate the dynamics in the latent space, memory effects are not taken into account. Theproposed method is based on the autocorrelation loss, which suffers from the dependency on thebatch size ( ). In ( ),the Reweighted autoencoded variational Bayes for enhanced sampling(RAVE) method is proposed that alternates between iterations of MD and a Variational AE(VAEs) model. RAVE is encoding each time-step independently without taking into account thetemporal aspect of the latent dynamics. RAVE requires the transition to the high-dimensionalconﬁguration space to progress the simulation in time, which can be computationally expensive.4he works mentioned above imply memory-less (Markovian) latent space dynamics by se-lecting an appropriate time-lag in the master equations (

51, 52 ). The time-lag is usually es-timated heuristically, balancing the requirements to be large enough so that the Markovianassumption holds, and at the same time small enough to ensure that the method samples theconﬁguration space efﬁciently. We remark that in cases where a protein is interacting with asolvent, only the conﬁguration of the protein is taken into account and not the solvent. This ren-ders the Markovian assumption in the latent dynamics rather unrealistic. This issue is addressedin this work by employing Long Short-Term Memory (LSTM) ( ) Recurrent Neural Networks(RNNs) that capture memory effects of the latent dynamics.Here we propose a novel data-driven generative framework that relies on Learning the Ef-fective Dynamics (LED) of the molecular systems ( ). LED is founded on the equation-freeframework (EFF) ( ) and it enriches it by employing ML methodologies to evolve the latentspace dynamics with the Mixture Density Network - Long Short-Term Memory RNN (MDN-LSTM) and the two-way mapping between coarse and ﬁne scales with Mixture Density NetworkAutoencoders (MDN-AEs) ( ). These enrichments are essenetial in extending the applicabil-ity of EFF to non-Markovian settings and problems with strong non-linearities. We demonstratethe effectiveness of the LED framework in simulations of the M¨ueller-Brown potential (MBP),the Trp Cage miniprotein, and the alanine dipeptide in water. LED can accurately capturethe statistics, and reproduce the free energy landscape from data. Moreover, LED uncoverslow-energy metastable states in the free energy projected to the latent space and recovers thetransition time-scales between them. We ﬁnd that in simulations of the alanine dipeptide andthe Trp Cage miniprotein, LED is three orders of magnitude faster than the classical MD solver.As a data-driven generative method, LED has the ability to sample novel unseen conﬁgurationsinterpolating the training data and accelerating the exploration of the state space.5 aterials and Methods The LED framework ( ) for molecular systems is founded on the equation-free framework(EFF) ( ). It addresses the key bottlenecks of EFF namely, the coarse to ﬁne mapping andthe evolution of the latent space using an MDN-AE and an MDN-LSTM respectively. Anillustration of the LED framework is given in Figure 1.In the following, the state of a molecule at time t is described by a high dimensional vector s t ∈ Ω ⊆ R d s , where d s ∈ N denotes its dimension. The state vector can include the atompositions or their rotation/translation invariant features obtained using for example the Kabschtransform ( ). A trajectory of this system is obtained by an MD integrator and the state of themolecule after a timestep ∆ t is described by the probability distribution function (PDF): p ( s t +∆ t | s t ) . (1)The transition distribution in Equation 1 depends on the choice of ∆ t . Mixture Density Network (MDN) Autoencoder (AE):

Here the MDN-AE is utilized toidentify the latent (coarse) representation and upscale it probabilistically to the high dimen-sional state space. MDNs ( ) are neural architectures that can represent arbitrary conditionaldistributions. The MDN output is a parametrization of the distribution of a multivariate randomvariable conditioned on the input of the network.The latent state is computed by z t = E ( s t ; w E ) , where E is the encoder (a deep neuralnetwork) with trainable parameters w E and z t ∈ R d z with d z (cid:28) d s . Since z t is a coarseapproximation, many states can be mapped to the same z t . As a consequence, a deterministicmapping z t → s t like the one used in Refs. (

54, 55 ) is not suitable. Here, an MDN is employedto model the upscaling conditional PDF p ( s t | z t ) described by the parameters w s | z . These6arameters are the outputs of the decoder with weights w D and are a function of the latentrepresentation z t , i.e. w s | z ( z t ) = D ( z t ; w D ) . (2)The state of the molecule can then be sampled from p ( s t | z t ) := p ( s t ; w s | z ) .Including in the state s t the rotation/translation invariant features of the molecule understudy ( ), ensures that the MDN samples physically meaningful molecular conﬁgurations.The state s t is composed of states representing bond lengths s bt ∈ R d b s , and angles s at ∈ R d a s .Initially, the MD data of the bonds are scaled to [0 , . An auxiliary variable vector v t ∈ R d b s is deﬁned to model the distribution of bonds. In particular, p ( v t | z t ) is modeled as a Gaussianmixture model with K s mixture kernels as p ( v t | z t ) = K s (cid:88) k =1 π k v ( z t ) N (cid:18) µ k v ( z t ) , σ k v ( z t ) (cid:19) , (3)and the mapping s bt = ln(1 + exp( v t )) is used to recover the distribution of the scaled bondlengths at the output. The functional form of the mixing coefﬁcients π k v ( z t ) , the means µ k v ( z t ) ,and the variances σ k v ( z t ) is a deep neural network (decoder D ). The distribution of the anglesis modeled with the circular normal (von Mises) distribution, i.e. p ( s at | z t ) = K s (cid:88) k =1 π k s a ( z t ) exp (cid:18) ν k s a ( z t ) cos (cid:16) s at − µ k s a ( z t ) (cid:17)(cid:19) πI (cid:0) ν k s a ( z t ) (cid:1) , (4)where I ( ν k s a ) is the modiﬁed Bessel function of order . Here, again the functional form of π k s a ( z t ) , µ k s a ( z t ) and ν k s a ( z t ) is a deep neural network (decoder D ).In total, the outputs of the decoder D that parametrize p ( s t | z t ) are w s | z = { π k v , µ k v , σ k v , π k s a , µ k s a , ν k s a } k ∈{ ,...,K s } , (5)7 t − T μ +Δ t s t −2Δ t s t s t + T m −Δ t s t + T m ENCODER ENCODER ENCODER z t − T μ +Δ t z t −2Δ t z t RNN MDN RNN MDN RNN MDN h t −Δ t h t − T μ z t +Δ t RNN MDNRNN MDN z t +Δ t z t +2Δ t z t + T m −Δ t ˜ s t + T m DECODER MDN z t + T m s t −Δ t ENCODER z t −Δ t RNN MDN h t −2Δ t Molecular dynamics for (water molecules ignored) T μ + T m Encoding to latent space

Teacher forcing the RNN for T μ Iterative propagation of the latent dynamics for T m Probabilistic decoding to high dimensional space T μ ≪ T m ˜ s t + T m ∼ p ( ⋅ | z t + T m , z t + T m −Δ t , …) z t +Δ t ∼ p ( ⋅ | z t , z t −Δ t , …) Figure 1: High dimensional (ﬁne scale) dynamics s t are simulated for a short period ( T µ ).During this warm-up period, the state s t is passed through the encoder network. The outputs ofthe encoder z t provide the time-series input to the LSTM, allowing for the update of its hiddenstate h t , thus capturing non-Markovian effects. The output of the LSTM is a parametriza-tion of the probabilistic non-Markovian latent dynamics p ( z t | h t ) . Starting from the last latentstate z t , the LSTM iteratively samples p ( z t | h t ) and propagates the low order latent dynamicsup to a total horizon of T m time units, with T m > T µ . The LED decoder may be utilizedat any desired time-scale to map the latent state z t back to a high-dimensional representation s t ∼ p ( ·| z t , z t − ∆ t , . . . ) . Propagation in the low order space unraveled by LED is orders of mag-nitude cheaper than evolving the high dimensional system based on ﬁrst principles (moleculardynamics/density functional theory, etc.).which are all functions of the latent state z t , which is the decoder input. The MDN-AE istrained to predict the mixing coefﬁcients minimizing the data likelihood w E , w D =argmax w E , w D p ( s t | z t )=argmax w E , w D p (cid:0) s t ; w s | z (cid:1) , (6)where w s | z = D (cid:0) E ( s t ; w E ); w D (cid:1) is the output of the MDN-AE and s t are the MD data. Thedetails of the training procedure can be found in ( ).8 ong Short-Term Memory recurrent neural network (LSTM) The latent dynamics maybe characterized by non-Markovian effects, i.e. p ( z t +∆ t | z t , z t − ∆ t , . . . ) , due to the neglected degrees of freedom (solvent) or the selection of a relatively small time-lag ∆ t .Here the LSTM cell architecture ( ) is utilized to evolve the nonlinear and non-Markovianlatent dynamics. The propagation in the LSTM is given by: h t , c t = R (cid:0) z t , h t − ∆ t , c t − ∆ t ; w R (cid:1) , (7)where the hidden-to-hidden recurrent mapping R takes the form g ft = σ f (cid:0) W f [ h t − ∆ t , z t ] + b f (cid:1) g it = σ i (cid:0) W i [ h t − ∆ t , z t ] + b i (cid:1) ˜ c t = tanh (cid:0) W c [ h t − ∆ t , z t ] + b c (cid:1) c t = g ft (cid:12) c t − ∆ t + g it (cid:12) ˜ c t g z t = σ h (cid:0) W h [ h t − ∆ t , z t ] + b h (cid:1) h t = g z t (cid:12) tanh( c t ) , (8)where g ft , g it , g z t ∈ R d h , are the gate vector signals (forget, input and output gates), z t ∈ R d z is the latent input at time t , h t ∈ R d h is the hidden state, c t ∈ R d h is the cell state, while W f , W i , W c , W h ∈ R d h × ( d h + d z ) , are weight matrices and b f , b i , b c , b h ∈ R d h biases. The symbol (cid:12) denotes the element-wise product. The activation functions σ f , σ i and σ h are sigmoids. Thedimension of the hidden state d h (number of hidden units) controls the capacity of the cell toencode history information. The set of trainable parameters of the LSTM are w R = { b f , b i , b c , b h , W f , W i , W c , W h } . (9)9 t h t − σ g zt h t h t c t − c t (cid:12) g ft g it σ + tanh (cid:12) tanh˜ c t (cid:12) σ Figure 2: Information ﬂow in an LSTM cell.An illustration of the information ﬂow in a LSTM cell is given in Figure 2. The cell state canencode the history of the latent state evolution and capture non-Markovian effects.

Mixture Density LSTM Network (MDN-LSTM)

The LSTM captures the history of thelatent state and the non-Markovian latent transition dynamics are expressed as: p ( z t +∆ t | z t , z t − ∆ t , . . . ) = p ( z t +∆ t | h t ) , (10)where h t given in Equation 7. A second MDN is used to model the conditional distribution p ( z t +∆ t | h t ) of the latent transition dynamics. This MDN is conditioned on the hidden stateof the LSTM h t and implicitly conditioned on the history, i.e., p ( z t +∆ t | z t , z t − ∆ t , . . . ) := p ( z t +∆ t ; w z | h ) , so it can capture non-Markovian dynamics. The distribution p ( z t +∆ t | h t ) ismodeled as a Gaussian mixture with K z mixture kernels p ( z t +∆ t | h t ) = K z (cid:88) k =1 π k z ( h t ) N (cid:18) µ k z ( h t ) , σ k z ( h t ) (cid:19) , (11)with parameters w z | h given by w z | h ( h t ) = { π k z ( h t ) , µ k z ( h t ) , σ k z ( h t ) } , (12)10hat are a function of h t . These parameters are the outputs of the neural network Z ( h t ; w Z ) ,with trainable weights w Z , and are a function of the hidden state, i.e. p ( z t +∆ t | h t ) := p ( z t +∆ t ; w z | h ) , w z | h ( h t ) = Z ( h t ; w Z ) . (13)The weights of the LSTM w R and the latent MDN w Z are trained to output the parameters w z | h that maximize the likelihood of the latent evolution w R , w Z = argmax w R , w Z p (cid:0) z t +∆ t ; w z | h (cid:1) , (14)where w z | h is deﬁned in Equation 13, and h t appearing in Equation 13 is deﬁned in Equation 7.During the training phase, the MD trajectory data s t are provided at the input of the trainedMDN-AE z t = E ( s t ; w E ) . The encoder outputs the latent dynamics z t that are used to updatethe hidden state of the LSTM and optimize its weights according to Equation 14. In contrastto the linear operator utilized in MSMs, the recurrent functional form in Equation 7 can benonlinear and incorporate memory effects, via the hidden state of the LSTM. Learned Effective Dynamics

The LED framework can be employed to accelerate MD simu-lations and enable more efﬁcient exploration of the state space and uncovering of novel proteinconﬁgurations (shown in SM Section 3). The networks in LED are trained on trajectories fromMD simulations in two phases. First, the MDN-AE provides a reduced-order representation,maximizing the data likelihood (Ref. ( )). The MDN-AE is trained with Backpropagation ( )using the adaptive stochastic optimization method Adam ( ). Adding a pre-training phase ﬁt-ting the kernels µ k , σ k of the MDN-AE to the data, and ﬁxing them during MDN-AE trainingled to better results. Next, the MDN-LSTM is trained to forecast the latent space dynamics (theMDN-AE weights are considered ﬁxed) to maximize the latent data likelihood. MDN-LSTM istrained with Backpropagation through time (BPTT) ( ) with Adam optimizer.11he LED propagates the computationally inexpensive dynamics on its latent space. Startingfrom an initial state from a test dataset (unseen during training), a short time history T µ ofthe state evolution is utilized to warm up the hidden state of the LED. The MDN-LSTM isused to propagate the latent dynamics for a time horizon T m (cid:29) T µ . High-dimensional stateconﬁgurations can be recovered at any time instant by using the probabilistic decoder part ofMDN-AE. We ﬁnd that the LED framework can accelerate MD simulations by three orders ofmagnitude. Results

The LED framework is tested in three systems, single-particle Langevin dynamics using thetwo-dimensional MBP, the Trp Cage miniprotein, and the alanine dipeptide, widely adopted asbenchmarks for molecular dynamics modeling (

49, 54, 55, 59, 69 ). M ¨uller-Brown potential (MBP)

The Langevin dynamics of a particle in the MBP are char-acterized by the stochastic differential equation m ¨ x ( t ) = −∇ V (cid:0) x ( t ) (cid:1) − γ ˙ x ( t ) + (cid:112) k B T R ( t ) , (15)where x ∈ R is the position, ˙ x is the velocity, ¨ x is the acceleration, V ( x ) is the MBP (deﬁnedin SM Section 1), k B is the Boltzmann’s constant, T is the temperature, γ is the dampingcoefﬁcient, and R ( t ) a delta-correlated stationary Gaussian process with zero-mean. The natureof the dynamics is affected by the damping coefﬁcient γ . Low damping coefﬁcients lead to aninertial regime. High damping factors lead to a diffusive regime (Brownian motion) with lessprominent memory effects. Here, a low damping γ = 1 is considered, along with k B T = 15 .The equations are integrated with the Velocity Verlet algorithm with timestep δt = 10 − ,starting from initial conditions randomly sampled uniformly from x ∈ [ − . , . × [ − . , ° x ° x Brown-M¨uller Potential ° . . . . . . p ( x ) ° ° x ° x Brown-M¨uller Potential V ) ° ° x ° x LED ° . . . . . . p ( x ) Brown-Müller Potential Reference LED

Figure 3: From left to right: the M¨uller-Brown potential, a scatter plot of the joint statedistribution computed from reference data (with annotation of two long-lived metastable states),and the same scatter plot obtained by LED sampled trajectories.till T = 10 , after truncating an initial transient period of ˜ T = 10 . The data are sub-sampledkeeping every 50 th data point to create the training and testing datasets for LED. The coarsetime-step of LED is ∆ t = 0 . . We use initial conditions for training, for validation andall for testing. LED is trained with a one-dimensional reduced order latent representation z t ∈ R . The reader is referred to the SM Section 1 for further information regarding the MBPparameterization of Ref. ( ) and hyperparameters of LED.The MBP is shown in Figure 3, along with a density scatter plot of the joint distributionof the MBP states computed from the testing data and LED. The joint distribution reveals twolong-lived metastable states that correspond to the low-energy regions. The LED learns totransition probabilistically between the metastable states, mimicking the dynamics of the systemand reproducing the state statistics.The free energy projected on the latent space, i.e., F = − κ B T log p ( z t ) is plotted in Fig-ure 4. The free energy proﬁle of the trajectories sampled from LED matches closely the onefrom the reference data with a root mean square error between the two free energy proﬁles of ≈ . κ B T . LED reveals two minima in the free energy proﬁle. Utilizing the LED decoder, thelatent states in these regions are mapped to their image in the two-dimensional state representa-tion s t ∈ R (here corresponding to x t ∈ R ) in Figure 4. LED is mapping the low-energetic13egions in the free energy proﬁle to the long-lived metastable states in the two dimensionalspace of the MBP. ° . ° . ° . . . . z F / ∑ B T MDLED Iterative ° . ° . ° . ° . . . . . . x ° . ° . . . . . . . . x Brown-Muller Potential p ( x ) ° . ° . ° . ° . . . . . . x ° . ° . . . . . . . . x Brown-Muller Potential p ( x ) ReferenceLED

Figure 4: Middle: free energy proﬁle projected on the latent space learned by the LED encoder,i.e., F = − κ B T ln p ( z t ) . The free energy proﬁle computed by LED (propagation of the latentdynamics with LED) matches closely the one from the reference data. Quantitatively, the rootmean square error is . κ B T . LED recovers two low-energy regions that are mapped to thetwo long-lived metastable states (left and right) in the two-dimensional state space s t ∈ R .Next, we evaluate the LED framework in reproducing the transition times between the long-lived states. In LED, metastable states can be deﬁned either on the reduced order latent space z t ∈ R or the state space s t ∈ R (as the decoder can map any latent state to a state space).In the following, two metastable states are deﬁned as ellipses on the state space depicted inFigure 3 (deﬁned in the SM Section 1). The time-scales will vary depending on the deﬁnitionof the metastable states in the phase space. The distribution of transition times computed fromLED trajectories is compared with the transition time distribution from the test data in Figure 5.LED captures qualitatively the transition time distributions and the mean values are close toeach other. In SM Section 1, we also report the transition times obtained with metastable statesdeﬁnition on the latent space. This approach has the beneﬁt of not requiring the prior knowledgeabout the metastable states in the state space. In conclusion, LED is capturing the joint statedistribution on the MBP, and matching the timescales of the system. Trp Cage

The Trp-cage is considered a prototypical miniprotein for the study of protein fold-ing ( ). The protein is simulated with MD ( ) with a time-step δt = 1 fs, up to a total time of14

200 400 600 800Transition times0100200300400500 C o un t s Histogram of T → , ∆ t = 0 . Reference: max( T ) = 832, mean( T ) = 61, N=680LED: max( T ) = 1122, mean( T ) = 91, N=740 C o un t s Histogram of T → , ∆ t = 0 . Reference: max( T ) = 1132, mean( T ) = 188, N=601LED: max( T ) = 1027, mean( T ) = 164, N=525 Figure 5: Distribution of the transition times learned by LED C o un t s Histogram of T → , ∆ t = 0 . Reference: max( T ) = 832, mean( T ) = 61, N=680LED: max( T ) = 1122, mean( T ) = 91, N=740 , computed from sampledtrajectories, matches the original ﬁne scale transition times of the MBP dynamics C o un t s Histogram of T → , ∆ t = 0 . Reference: max( T ) = 1132, mean( T ) = 188, N=601LED: max( T ) = 1027, mean( T ) = 164, N=525 . Left:Histogram of T → . Mean T → of MD trajectories is , mean T → = 91 for LED. Right:Histogram of T → . Mean T → of MD trajectories is , mean T → = 164 for LED. LEDhas learned to propagate the effective dynamics (a one dimensional latent state z ) and capturethe non-Markovian effects. T = 100 ns. The data is sub-sampled at ∆ t = 0 . ps, creating a trajectory with N = 10 sam-ples. The data is divided into sequences of samples ( T = 400 ps each). The ﬁrst sequences are used for training (corresponding to . ns), the next sequences for validation,while all the data is used for testing.The protein positions are transformed into rototranslational invariant features (internal co-ordinates), composed of bonds, angles, and dihedral angles, leading to a state with dimension d s = 456 . LED is trained with a latent space z t ∈ R , i.e., d z = 2 . LED is tested by startingfrom the initial condition in each of the test sequences, iteratively propagating the latentspace to forecast T = 400 ps. For more information on the hyperparameters of LED, refer tothe SM Section 2.The projection of MD trajectory data to LED latent space is illustrated in Figure 6 left, inthe form of the free energy, i.e., F = − κ B T log p ( z t ) , with z t = ( z , z ) T ∈ R . The freeenergy on the latent space computed from trajectories sampled from LED is given in Figure 6on the right. LED successfully captures the three metastable states of the Trp Cage miniprotein,15hile being three orders of magnitude faster compared to the MD solver. Quantitatively, thetwo proﬁles agree up to an error margin of approximately . κ B T . The SM Section 2 providesadditional results on the agreement of the marginal state distributions, and realistic samples ofthe protein conﬁguration sampled from LED. ° ° ° z ° . ° . ° . ° . . . . . z ° ° ° ° ° l og p ( z ) ° log p ( z ) ° ° F/∑ B T ° ° ° z ° . ° . ° . ° . . . . . z ° ° ° ° ° l og p ( z ) ° log p ( z ) ° ° F/∑ B T LEDReference

Figure 6: Free energy projection on the latent space F = − κ B T log p ( z t ) , with z t ∈ R . Left:MD data projected to the LED latent space. Right: the free energy of trajectories sampled fromLED. LED is capturing the free energy proﬁle. Alanine dipeptide

The alanine dipeptide is often used as the testing ground for enhancedsampling methods ( ). LED is evaluated in learning and propagating the dynamics of alaninedipeptide in water. The molecule is simulated with MD ( ) with a time-step δt = 1 fs, up to T = 100 ns. We subsample the data, keeping every th datapoint, creating a trajectory with N = 10 samples. LED is thus operating on a timescale ∆ t = 0 . ps. The data is divided into sequences of samples ( T = 400 ps each). The ﬁrst sequences are used for training(corresponding to . ns), the next sequences for validation, while all the data is used fortesting. LED is tested by starting from the initial condition in each of the test sequences,iteratively propagating the latent space to forecast T = 400 ps.The dipeptide positions are transformed into rototranslational invariant features (internal16oordinates), composed of bonds, angles, and dihedral angles, leading to a state with dimension d s = 24 . In order to demonstrate that LED can uncover the dynamics in a drastically reducedorder latent space, the dimension of the later is set to one d z = 1 , i.e. z t ∈ R . For moreinformation on the hyperparameters of LED, refer to the SM Section 3.The metastable states of the dynamics are represented in terms of the energetically favoredregions in the state space of two backbone dihedral angles, φ and ψ , i.e., the Ramachandranspace ( ) plotted in Figure 7. Speciﬁcally, previous works consider ﬁve low-energy clus-ters, i.e., { C , P II , α R , α L , C ax } . The trained LED is qualitatively reproducing the density inthe Ramachandran plot in Figure 7 qualitatively, identifying the three dominant low-energymetastable states { C , P II , α R } . LED, however, fails to capture the state density on the lessfrequently observed states in the training data { α L , C ax } . The marginal distributions of thetrajectories generated by LED match the ground-truth ones (MD data) closely, as depicted inFigure 13 in the SM Section 3. Even though LED is propagating a one-dimensional latent state,it can reproduce the statistics while being three orders of magnitude faster than the MD solver. °

100 0 100 ¡ ° √ C P II Æ R Æ L C ax C ax Predicted Density ° °

100 0 100 ¡ ° √ C P II Æ R Æ L C ax C ax Target Density ° Reference LED

Figure 7: Ramachandran plot of the alanine dipeptide, i.e., space spanned by two back-bone dihedral angles ( φ, ψ ) . Scatter plots are colored based on the joint density of ( φ, ψ ) .Left: test data. Right: LED trajectories. We observe ﬁve energetically favorable metastablestates denoted with { C , P II , α R , α L , C ax } . LED captures the three dominant metastable states { C , P II , α R , } . The states { α L , C ax } are rarely observed in the training data.17he free energy is projected to the latent space, i.e., F = − κ B T ln( p ( z t )) , and plotted inFigure 8. The free energy projection computed from MD trajectories (test data) is comparedwith the one computed from trajectories sampled from LED. The two free energy proﬁles agreeup to a root mean square error of κ B T . Note that LED unravels three dominant minima in thelatent space. These low-energy regions correspond to metastable states of the dynamics.The Ramachandran space ( φ, ψ ) is frequently used to describe the long-term behavior andmetastable states of the system (

55, 74 ). The latent encoding of the LED is evaluated basedon the mapping between the latent space and the Ramachandran space. Utilizing the MDNdecoder, the LED can map the latent state z to the respective rototranslational invariant features(bonds and angles) and regions in the Ramachandran plot. As illustrated in Figure 8, the LEDis mapping the three low-energy regions in the latent space to the three dominant metastablestates in the Ramachandran plot { C , P II , α R } . Even though LED is propagating a reduced-order one-dimensional latent state, it captures the stochastic dynamics of the system.In Figure 9, a conﬁguration randomly sampled from MD data is given for each metastablestate. The closest conﬁguration sampled from LED is compared with the MD data sample interms of the Root Mean Square Deviation (RMSD) score. The LED samples realistic conﬁgu-rations with low RMSD errors for all metastable states. The mean and standard deviation of theRMSD scores of the closest neighbors sampled from LED are µ ± σ = 0 . ± .

021 ˚ A for the C MD sample conﬁguration (Figure 9 top left). This score for the rest of the metastable statesis . ± .

463 ˚ A for P II , . ± .

019 ˚ A for α R , . ± .

162 ˚ A for α L , and . ± .

125 ˚ A for C ax . The LED samples similar conﬁgurations with low RMSD scores for the most fre-quently observed metastable states { C , P II , α R } . The average RMSD error is slightly higherand ﬂuctuates more for the less frequently observed { α R , C ax } .The dynamics learned by LED are evaluated according to the mean ﬁrst-passage times (MF-PTs) between the dominant metastable states. The MFPT is the average time-scale to reach a18

100 0 100 ¡ ° √ C P II Æ R Æ L C ax C ax °

100 0 100 ¡ ° √ C P II Æ R Æ L C ax C ax α R C P II ° ° z F / ∑ B T MDLED Iterative

ReferenceLED

Figure 8: Plot of the free energy proﬁle projected on the latent state learned by the LED, i.e., F = − κ B T ln p ( z t ) . The latent free energy proﬁle of MD trajectories is compared with thelatent free energy proﬁle of trajectories sampled from LED. The two proﬁles agree up to a rootmean square error of κ B T . Utilizing the LED decoder, the low-energy regions in the latentspace (close to the minima) can be mapped to the corresponding protein conﬁgurations andmetastable states in the Ramachandran plot. The LED uncovers the three dominant metastablestates { C , P II , α R } in the free energy surface (minima). The LED captures the free energy pro-ﬁle and the dominant metastable states while being computationally three orders of magnitudecheaper than MD.ﬁnal metastable state, starting from any initial state. The MFPTs are computed a posteriorifrom trajectories sampled from the LED and the MD test trajectories, using the PyEMMA soft-ware ( ). The metastable states considered here are given in the SM Section 3.As a reference for the MFPTs, we consider an MSM ﬁtted to the MD data (test dataset).The reference MFPTs agree with previous literature ( ). The time-lag of the MSM is setto ∆ t MSM = 10 ps to ensure the necessary Markovianity of the dynamics. This time-lag istwo orders of magnitude larger than the timestep of LED. Fitting an MSM with a time-lag of19 MD DATARMSD Å ≈ 0.121

LED P II MD DATARMSD Å ≈ 0.125

LED α R MD DATA LEDRMSD Å ≈ 0.091 α L MD DATA LEDRMSD Å ≈ 0.548

Figure 9: For each metastable state, a random alanine dipeptide conﬁguration sampled fromMD data is compared against the closest conﬁguration sampled from the LED with d z = 1 . TheRoot Mean Square Deviation (RMSD) in ˚ A between the two is plotted for reference. ∆ t MSM = 1 ps on the MD data results in very high errors ( ≈ on average) in the compu-tation of MFPTs. This emphasizes the need for non-Markovian models that can reproduce thesystem’s dynamics and statistics independent of the selection of the time-lag.The MFPTs of trajectories sampled from LED are estimated with an MSM with a time-lag ∆ t MSM = 10 ps. Note that the LED is operating on a time-step ∆ t = 0 . ps. The MFPTs areidentiﬁed with a low average error of . . The results on the MFPT are summarized inTable 1. LED captures very well the transitions that are dominant in the data e.g. T C → P II , T P II → C or T α R → C . In contrast, LED exhibits high MFPT errors in transitions that are rarelyobserved in the training data.LED identiﬁes the dominant MFPT successfully by utilizing a very small amount of train-ing data ( . ns for training and . ns validation) and propagating the latent dynamics on a20able 1: Mean ﬁrst-passage times (MFPT) between the metastable states of alanine dipeptidein water in [ns]. MFPTs are estimated by ﬁtting MSMs with different time-lags ( ps and ps)on trajectories generated by MD, or the LED framework. The average relative error is given forreference. MFPT MSM − − − − . T C → P II .

105 0 .

016 85 0 . T C → α R .

104 0 .

015 86 0 . T P II → C .

226 0 .

036 84 0 . T P II → α R .

105 0 .

015 86 0 . T α R → C .

236 0 .

036 85 0 . T α R → P II .

116 0 .

017 86 0 . Average Relative Error 85.09% 10.51% reduced order space ( d z = 1 ). LED trajectories are three orders of magnitude cheaper to obtaincompared to MD data. At the same time, MSM ﬁtting is a relatively fast procedure once theclustering based on the metastable states is obtained. In contrast, a careless selection of thetime-lag in the MSM that fails to render the dynamics Markovian, (e.g. ∆ t = 1 ps) leads to asurrogate model that fails to capture the system time-scales. This emphasizes the need to modelnon-Markovian effects with LED in case of limited data sampled at a high frequency (smalltime-steps ∆ t ). A more informative selection of the time-lag may alleviate this problem, ren-dering the dynamics Markovian as in the reference MSM. Still, the consequent sub-samplingof the data can lead to omissions of effects whose time-scales are smaller than the time-lag.As a consequence, the heuristic selection of the time-lag is rendering the modeling processerror-prone.The SM Section 3 provides additional results on the MFPTs estimated based on metastablestate deﬁnition in the latent space of LED (without prior knowledge). Furthermore, the effec-tiveness of LED to interpolate the training data and unravel novel conﬁgurations of the protein(state-space) is also illustrated. 21 iscussion This work proposes a data-driven framework, LED, to learn and propagate the effective dy-namics of molecular systems accelerating MD simulations by orders of magnitude. Previousstate-of-the-art methods are based on the Markovian assumption on the latent state, or minimizethe autocorrelation or the variational loss on the data. The latter take into account the error onthe long-term equilibrium statistics explicitly to capture the system time-scales but suffer froma dependency on the batch size ( ). In contrast, the LED is trained to maximize the data like-lihood and identify a continuous reduced-order latent representation. The nonlinear dynamicsare propagated in the latent space and the memory effects are captured through the hidden stateof the LSTM. Moreover, the method is generative and the decoder part of the MDN-AE can beemployed to sample high dimensional conﬁgurations on any desired time-scales.The encoder of LED is analogous to the coarse graining model design, while the decoder isimplicitly learning a backmapping to atomistic conﬁgurations. The LED automates the dimen-sionality reduction often associated with the empirical a-priori selection of Collective Variablesin molecular simulations (

22, 55 ). At the same time the MDN-LSTM propagates the dynam-ics on the latent space in a form that is comparable to nonlinear, non-Markovian metadynam-ics ( ).The effectiveness of LED is demonstrated for three systems. In the case of the Langevindynamics using BMP, LED can recover the free energy landscape in the latent space, identifytwo low-energetic states corresponding to the long-lived metastable states of the potential, andcapture the transition times between the metastable states. For the Trp Cage miniprotein, LEDcaptures the free energy projection on the latent space and unravels three metastable states.Lastly, for the system of alanine dipeptide in water, LED captures the conﬁguration statistics ofthe system accurately while being three orders of magnitude faster than MD solvers. It identiﬁes22hree low-energetic regions in the free energy proﬁle projected to the one-dimensional latentstate that corresponds to the three dominant metastable states { α R , C , P II } . LED is also ableto capture the dominant mean ﬁrst-passage times in contrast to the MSM operating on the sametime-scale, owing to the non-Markovian latent propagation in the latent state with the MDN-LSTM. Furthermore, we showcase how our framework is capable of unraveling novel proteinconﬁgurations interpolating on the training data.The speed-up achieved by LED depends on the MD solver used, the dimensionality, andthe complexity of the protein under study. Still, it is expected that the computationally cheappropagation in the latent space of the LED is orders of magnitude faster than the MD solver.The success of LED paves the way for faster exploration of the conformational space ofmolecular systems. Future research efforts will target the application of LED to larger proteinsand the investigation of LED’s capabilities and limitations in uncovering the metastable states asminima in the free energy proﬁle. An alternative research direction is the automatic extractionof features directly from the raw position and velocity data (not using rototranslational invariantfeatures). Moreover, further studies will concentrate on coupling LED with an MD solver in analternating fashion for faster exploration of the state space. Acknowledgments

The authors thank Ioannis Mandralis, Pascal Weber, and Fabian Wermelinger for fruitful dis-cussions and providing feedback. The authors also acknowledge the Swiss National Supercom-puting Centre (CSCS) support providing the necessary computational resources under Projects930. The authors declare no competing interests.23 ata and materials availability

Code and data to reproduce the ﬁndings of this study will be made openly available in a publicrepository upon publication. The PyEMMA software package ( ) is employed in the currentwork for MSM ﬁtting and MFPT estimation. Supplementary Materials

Section S1, Deﬁnition of the MBP, the metastable states and time-scale analysis on the LEDlatent spaceSection S1, Figure 10. Marginal state statistics of LED in the MBPSection S1, Table 3. AE hyperparameter tuning in the MBPSection S1, Table 4. LSTM hyperparameter tuning in the MBPSection S1, Table 5. LED hyperparameters in the MBPSection S2, Figure 11. Marginal state statistics of LED in the Trp CageSection S2, Figure 12. Conﬁguration sampled from LED in the Trp CageSection S2, Table 6. AE hyperparameter tuning in the Trp CageSection S2, Table 7. LSTM hyperparameter tuning in the Trp CageSection S2, Table 8. LED hyperparameters in the Trp CageSection S3, Information on the simulation of alanine, deﬁnition of the metastable states, time-scale analysis on the LED latent space, and study on unraveling novel conﬁgurations with LEDSection S3, Table 9. Metastable state deﬁntions in alanineSection S3, Table 10. Mean First Passage Time analysis in alanineSection S3, Figure 13. Marginal state statistics of LED in alanineSection S3, Figure 14. Unraveling novel conﬁgurations with LED in alanineSection S3, Table 11. AE hyperparameter tuning in alanine24ection S3, Table 12. LSTM hyperparameter tuning in alanineSection S3, Table 13. LED hyperparameters in alanine

References

1. M. Karplus J. A. McCammon, Molecular dynamics simulations of biomolecules

Nat.Struct. Mol. Biol. , 646 (2002).2. M. Shirts V. S. Pande, Screen Savers of the World Unite! Science , 1903 (2000).3. D. E. Shaw et al. , Proceedings of the Conference on High Performance Computing Net-working, Storage and Analysis , SC ’09 (Association for Computing Machinery, New York,NY, USA, 2009).4. M. A. Balsera, W. Wriggers, Y. Oono K. Schulten, Principal component analysis and longtime protein dynamics.

J. Phys. Chem. , 2567 (1996).5. B. Brooks M. Karplus, Harmonic dynamics of proteins: normal modes and ﬂuctuations inbovine pancreatic trypsin inhibitor.

PNAS , 6571 (1983).6. L. Skjaerven, A. Martinez N. Reuter, Principal component and normal mode analysis ofproteins; a quantitative comparison using the GroEL subunit. Proteins , 232 (2011).7. M. Praprotnik D. Janeˇziˇc, Molecular Dynamics Integration Meets Standard Theory ofMolecular Vibrations. J. Chem. Inf. Model , 1571 (2005).8. W. G. Noid, Perspective: Coarse-grained models for biomolecular systems. J. Chem. Phys. , 09B201 1 (2013). 25. J. Zavadlav, G. Arampatzis P. Koumoutsakos, Bayesian selection for coarse-grained modelsof liquid water.

Sci. Rep. , 1 (2019).10. J. W. Wagner, J. F. Dama, A. E. P. Durumeric G. A. Voth, On the Representability Problemand the Physical Meaning of Coarse-Grained Models. J. Chem. Phys. , 044108 (2016).11. W. Pezeshkian, M. K¨onig, T. A. Wassenaar S. J. Marrink, Backmapping triangulated sur-faces to coarse-grained membrane models.

Nat. Commun. , 1 (2020).12. M. Stieffenhofer, M. Wand T. Bereau, Adversarial Reverse Mapping of EquilibratedCondensed-Phase Molecular Structures. arXiv preprint arXiv:2003.07753 (2020).13. B. Hess, S. Le´on, N. van der Vegt K. Kremer, Long time atomistic polymer trajectoriesfrom coarse grained simulations: Bisphenol-A polycarbonate. Soft Matter , 409 (2006).14. T. Werder, J. H. Walther P. Koumoutsakos, Hybrid atomistic–continuum method for thesimulation of dense ﬂuid ﬂows. J. Comput. Phys. , 373 (2005).15. G. S. Ayton, W. G. Noid G. A. Voth, Multiscale modeling of biomolecular systems: Inserial and in parallel.

Curr. Opin. Struct. Biol. , 192 (2007).16. M. Praprotnik, L. Delle Site K. Kremer, Multiscale simulation of soft matter: From scalebridging to adaptive resolution. Annu. Rev. Phys. Chem. , 545 (2008).17. T. Huber, A. E. Torda W. F. Van Gunsteren, Local elevation: a method for improving thesearching properties of molecular dynamics simulation. J. Comput. Aided Mol. Des. , 695(1994).18. C. Voudouris, Guided Local Search—An illustrative example in function optimisation. BTTechnol. J. , 46 (1998). 269. C. Dellago, P. G. Bolhuis D. Chandler, Efﬁcient transition path sampling: Application toLennard-Jones cluster rearrangements. J. Chem. Phys. , 9236 (1998).20. A. Laio M. Parrinello, Escaping free-energy minima.

PNAS , 12562 (2002).21. T. S. van Erp, D. Moroni P. G. Bolhuis, A novel path sampling method for the calculationof rate constants. J. Chem. Phys. , 7762 (2003).22. L. Maragliano, A. Fischer, E. Vanden-Eijnden G. Ciccotti, String method in collective vari-ables: Minimum free energy paths and isocommittor surfaces.

J. Chem. Phys. , 024106(2006).23. T. Jaffrelot Inizan et al. , High-Resolution Mining of SARS-CoV-2 Main Protease Confor-mational Space: Supercomputer-Driven Unsupervised Adaptive Sampling (2020).24. B. Peters B. L. Trout, Obtaining reaction coordinates by likelihood maximization.

J. Chem.Phys. , 054108 (2006).25. H. Stamati, C. Clementi L. E. Kavraki, Application of nonlinear dimensionality reductionto characterize the conformational landscape of small peptides.

Proteins , 223 (2010).26. A. Bittracher, R. Banisch C. Sch¨utte, Data-driven computation of molecular reaction coor-dinates. J. Chem. Phys. , 154103 (2018).27. L. Bonati, V. Rizzi M. Parrinello, Data-driven collective variables for enhanced sampling.

J. Phys. Chem. Lett. , 2998 (2020).28. C. Sch¨utte, F. No´e, J. Lu, M. Sarich E. Vanden-Eijnden, Markov state models based onmilestoning. J. Chem. Phys. , 05B609 (2011).279. A. Bittracher et al. , Transition manifolds of complex metastable systems.

J. Nonlinear Sci. , 471 (2018).30. D. MICHIE, “Memo”Functions and Machine Learning. Nature , 19 (1968).31. C. M. Bishop,

Pattern recognition and machine learning , Information science and statistics(Springer, New York, NY, 2006). Softcover published in 2016.32. F. No´e, A. Tkatchenko, K.-R. M¨uller C. Clementi, Machine learning for molecular simula-tion.

Annu. Rev. Phys. Chem. , 361 (2020).33. K. T. Butler, D. W. Davies, H. Cartwright, O. Isayev A. Walsh, Machine learning for molec-ular and materials science. Nature , 547 (2018).34. F. No´e,

Machine Learning Meets Quantum Physics (Springer, 2020), pp. 331–372.35. J. Behler M. Parrinello, Generalized neural-network representation of high-dimensionalpotential-energy surfaces.

Phys. Rev. Lett. , 146401 (2007).36. M. Rupp, A. Tkatchenko, K.-R. M¨uller O. A. Von Lilienfeld, Fast and accurate modelingof molecular atomization energies with machine learning. Phys. Rev. Lett. , 058301(2012).37. S. Chmiela, H. E. Sauceda, K.-R. M¨uller A. Tkatchenko, Towards exact molecular dynam-ics simulations with machine-learned force ﬁelds.

Nat. Commun. , 1 (2018).38. K. T. Sch¨utt, H. E. Sauceda, P.-J. Kindermans, A. Tkatchenko K.-R. M ¨uller, SchNet–A deeplearning architecture for molecules and materials. J. Chem. Phys. , 241722 (2018).39. P. Rowe, V. L. Deringer, P. Gasparotto, G. Cs´anyi A. Michaelides, An accurate and trans-ferable machine learning potential for carbon.

J. Chem. Phys. , 034702 (2020).280. A. P. Bart´ok et al. , Machine learning uniﬁes the modeling of materials and molecules.

Sci.Adv. , e1701816 (2017).41. G. Imbalzano et al. , Automatic selection of atomic ﬁngerprints and reference conﬁgurationsfor machine-learning potentials. J. Chem. Phys. , 241730 (2018).42. K. Hansen et al. , Assessment and validation of machine learning methods for predictingmolecular atomization energies.

J. Chem. Theory Comput. , 3404 (2013).43. F. A. Faber et al. , Prediction errors of molecular machine learning models lower than hybridDFT error. J. Chem. Theory Comput. , 5255 (2017).44. B. Cheng, E. A. Engel, J. Behler, C. Dellago M. Ceriotti, Ab initio thermodynamics ofliquid and solid water. PNAS , 1110 (2019).45. L. Zhang, J. Han, H. Wang, R. Car W. E, DeePCG: Constructing coarse-grained models viadeep neural networks.

J. Chem. Phys. , 034101 (2018).46. J. Wang et al. , Machine learning of coarse-grained molecular dynamics force ﬁelds.

ACSCent. Sci. , 755 (2019).47. A. E. Durumeric G. A. Voth, Adversarial-residual-coarse-graining: Applying machinelearning theory to systematic molecular coarse-graining. J. Chem. Phys. , 124110(2019).48. F. No´e, S. Olsson, J. K ¨ohler H. Wu, Boltzmann generators: Sampling equilibrium states ofmany-body systems with deep learning.

Science , eaaw1147 (2019).49. F. Nuske, B. G. Keller, G. P´erez-Hern´andez, A. S. Mey F. No´e, Variational approach tomolecular kinetics.

J. Chem. Theory Comput. , 1739 (2014).290. G. P´erez-Hern´andez F. No´e, Hierarchical time-lagged independent component analysis:computing slow modes and reaction coordinates for large molecular systems. J. Chem.Theory Comput. , 6118 (2016).51. N.-V. Buchete G. Hummer, Coarse master equations for peptide folding dynamics. J. Phys.Chem. B , 6057 (2008).52. F. No´e et al. , Dynamical ﬁngerprints for probing individual relaxation processes inbiomolecular dynamics with simulations and kinetic experiments.

PNAS , 4822 (2011).53. J. M. L. Ribeiro, P. Bravo, Y. Wang P. Tiwary, Reweighted autoencoded variational Bayesfor enhanced sampling (RAVE).

J. Chem. Phys. , 072301 (2018).54. A. Mardt, L. Pasquali, H. Wu F. No´e, VAMPnets for deep learning of molecular kinetics.

Nat. Commun. , 1 (2018).55. C. Wehmeyer F. No´e, Time-lagged autoencoders: Deep learning of slow collective variablesfor molecular kinetics. J. Chem. Phys. , 241703 (2018).56. W. Chen, H. Sidky A. L. Ferguson, Nonlinear discovery of slow molecular modes usingstate-free reversible VAMPnets.

J. Chem. Phys. , 214114 (2019).57. H. Wu, A. Mardt, L. Pasquali F. No´e,

NeurIPS (2018), pp. 3975–3984.58. C. X. Hern´andez, H. K. Wayment-Steele, M. M. Sultan, B. E. Husic V. S. Pande, Variationalencoding of complex dynamics.

Phys. Rev. E , 062412 (2018).59. H. Sidky, W. Chen A. L. Ferguson, Molecular latent space simulators. Chem. Sci. , 9459(2020). 300. I. G. Kevrekidis et al. , Equation-free, coarse-grained multiscale computation: Enabling mi-croscopic simulators to perform system-level analysis. Commun. Math. Sci. , 715 (2003).61. S. Hochreiter J. Schmidhuber, Long short-term memory. Neural Comput. , 1735 (1997).62. C. M. Bishop, Mixture density networks (1994).63. W. Kabsch, A solution for the best rotation to relate two sets of vectors.

Acta Crystallogr.A , 922 (1976).64. P. R. Vlachas et al. , Backpropagation algorithms and reservoir computing in recurrent neu-ral networks for the forecasting of complex spatiotemporal dynamics. Neural Netw. (2020).65. P. J. Werbos, Generalization of backpropagation with application to a recurrent gas marketmodel.

Neural Netw. , 339 (1988).66. P. R. Vlachas, G. Arampatzis, C. Uhler P. Koumoutsakos, Learning the Effective Dynamicsof Complex Multiscale Systems. arXiv preprint arXiv:2006.13431 (2020).67. D. E. Rumelhart, G. E. Hinton R. J. Williams, Learning internal representations by errorpropagation, Tech. rep. , California Univ San Diego La Jolla Inst for Cognitive Science(1985).68. D. P. Kingma J. Ba, Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014).69. K. M ¨uller L. D. Brown, Location of saddle points and minimum energy paths by a con-strained simplex optimization procedure.

Theor. Chim. Acta , 75 (1979).70. H. V. Guzman et al. , ESPResSo++ 2.0: Advanced methods for multiscale molecular simu-lation. Comput. Phys. Commun. , 66 (2019).311. J. McCarty M. Parrinello, A variational conformational dynamics approach to the selectionof collective variables in metadynamics.

J. Chem. Phys. , 204109 (2017).72. G. N. Ramachandran, Stereochemistry of polypeptide chain conﬁgurations.

J. Mol. Biol. ,95 (1963).73. M. K. Scherer et al. , PyEMMA 2: A Software Package for Estimation, Validation, andAnalysis of Markov Models. J. Chem. Theory Comput. , 5525 (2015).74. B. Trendelkamp-Schroer F. No´e, Efﬁcient estimation of rare-event kinetics. Phys. Rev. X ,011009 (2016).75. H. Jang T. B. Woolf, Multiple pathways in conformational transitions of the alanine dipep-tide: an application of dynamic importance sampling. J. Comput. Chem. , 1136 (2006).76. D. S. Chekmarev, T. Ishida R. M. Levy, Long-time conformational transitions of alaninedipeptide in aqueous solution: Continuous and discrete-state kinetic models. J. Phys. Chem.B , 19487 (2004).77. H. Wang, C. Sch ¨utte, G. Ciccotti L. Delle Site, Exploring the conformational dynamicsof alanine dipeptide in solution subjected to an external electric ﬁeld: A nonequilibriummolecular dynamics simulation.

J. Chem. Theory Comput. , 1376 (2014).78. Y. Duan et al. , A point-charge force ﬁeld for molecular mechanics simulations of proteinsbased on condensed-phase quantum mechanical calculations. J. Comput. Chem. , 1999(2003).79. U. W. Schmitt G. A. Voth, The computer simulation of proton transport in water. J. Chem.Phys. , 9361 (1999). 320. G. S. Grest K. Kremer, Molecular dynamics simulation for polymers in the presence of aheat bath.

Phys. Rev. A , 3628 (1986).81. M. Neumann, The dielectric constant of water. Computer simulations with the MCY po-tential. J. Chem. Phys. , 5663 (1985). 33able 2: Metastable states in the MBP modeled as ellipses x /α + x /β ≤ . The ellipsesare rotated by θ . State Center ( x , x ) Axes α, β ) θ ( − . , .

45) (0 . , . π/ (0 . , .

05) (0 . , .

15) 0

The MBP has the form V ( x ) = (cid:88) k =1 A k exp (cid:0) α k ( x − ˆ X ,k )+ b k ( x − ˆ X ,k )( x − ˆ X ,k )+ c k ( x − ˆ X ,k ) (cid:1) , (16)where x = [ x , x ] T is the position. The parametrization α = [ − , − , − . , . T ,b = [0 , , , . T ,c = [ − , − , − . , . T ,A = [ − , − , − , T , ˆ X = (cid:20) − . −

10 0 . . (cid:21) , (17)is followed according to Ref. ( ).The marginal distributions of the MBP states from trajectories sampled from the LED iscompared with the groundtruth (test data) in Figure 10. Deﬁnition of Metastable States

The metastable states of the MB potential are deﬁned asellipses in the x ∈ R space. The centers and axes given in Table 2. Time-scales in the LED Latent Space

The latent space learned by LED can be utilized toidentify low-energy metastable states without the need for prior knowledge. The deﬁnition of34 − x . . . . . f x ( x ) Target Density Predicted Density 0 1 2 x . . . . . f x ( x ) Target Density Predicted Density

Figure 10: Comparison of the marginal distributions of the MBP states x and x between thetest data and trajectories of LED. The LED is propagating the dynamics on a one dimensionalreduced order latent state, i.e., d z = 1 .the metastable states in the rotationally and translationally invariant space constitutes such priorknowledge. Minima in the free energy projection on the LED latent space, constitute probablemetastable states.The trajectories sampled with LED are clustered based on these latent metastable clustersdepicted in Figure 4. An MSM is ﬁtted on the clustered trajectories. The time-lag of the MSM isset to time units to ensure Markovianity. The timescales computed by MSM are T → = 49 and T → = 321 . LED is overestimating T → and underestimating T → . The order of thetimescales, however, is captured. In contrast, an MSM with a time-lag of ∆ t = 0 . , which is thetimestep of the LED, fails to capture the order of the timescales due to the violated Markovianityassumption ( T → = 3 and T → = 21 ). 35 ED Hyperparameters

In order to prepare the dataset for training, validation, and testingof the LED in the MBP, initial conditions are sampled from x ∈ [ − . , . × [ − . , .The dynamics are solved with the Velocity Verlet algorithm, with time-step δt = 10 − up to T = 5000 , after an initial transient period of ˜ T = 10 discarded from the data. The data aresub-sampled to ∆ t = 0 . , keeping every 50 th data point. In this way, trajectories of N = 10 samples, each corresponding to T = 5000 time units are created. LED is trained on of thesetrajectories. trajectories are used for validation, while all trajectories are used for testing.The number and size of hidden layers are the same for the encoder E , the decoder D , and thelatent MDN Z . In the ﬁrst phase, the MDN-AE is trained, tuning its hyperparameters based ona grid search reported in Table 3. The autoencoder with the smallest error on the state statisticson the validation dataset is picked. Next, the MDN-LSTM is trained, tuning its hyperparametersbased on a grid search reported in Table 4. The LED model with the smallest error on the statestatistics on the validation dataset is picked. Both networks are trained with validation basedearly stopping. The LED is tested on the total initial conditions. For more information of thetraining technicalities the interested reader is referred to Ref. ( ).Table 3: Hyperparameter tuning of AE for MBPHyperparameter ValuesBatch size Initial learning rate . Weight decay rate { , − } Number of AE layers { , } Size of AE layers { , , } Activation of AE layers selu , tanh Latent dimension { } Input/Output data scaling [0 , MDN-AE kernels { , } MDN-AE hidden units MDN-AE multivariate MDN-AE covariance scaling factor { . , . , . } Initial learning rate − BPTT sequence length { , } Number of LSTM layers Size of LSTM layers { , , } Activation of LSTM Cell tanh

MDN-LSTM kernels { , , } MDN-LSTM hidden units { , } MDN-LSTM multivariate MDN-LSTM covariance scaling factor { . , . , . , . } Table 5: Hyperparameters of LED model with lowest validation error on MBPHyperparameter ValuesNumber of AE layers Size of AE layers Activation of AE layers tanh

Latent dimension MDN-AE kernels MDN-AE hidden units MDN-AE multivariate MDN-AE covariance scaling factor . Weight decay rate . BPTT sequence length

Number of LSTM layers Size of LSTM layers Activation of LSTM Cell tanh

MDN-LSTM kernels MDN-LSTM hidden units MDN-LSTM multivariate MDN-LSTM covariance scaling factor .4

MDN-LSTM kernels MDN-LSTM hidden units MDN-LSTM multivariate MDN-LSTM covariance scaling factor .4 SI: Trp Cage

Marginal State Distributions

The marginal distributions of the trajectories generated byLED match the ground-truth ones (MD data) closely, as depicted in Figure 11. In Figure 12, a − . . . x . . . . . f x ( x ) x . . . . f x ( x ) − . . . x f x ( x ) − . . . x f x ( x ) − . . . x f x ( x ) Target DensityPredicted Density − . . . x f x ( x ) − . . . x f x ( x ) − . . . x f x ( x ) − . . . x . . . . . f x ( x ) − . . . x . . . . f x ( x ) Target DensityPredicted Density − . . . x . . . . . f x ( x ) − x . . . . . f x ( x ) − . . . x f x ( x ) − . . . x . . . . . f x ( x ) − . . . x . . . . f x ( x ) Target DensityPredicted Density − . . . x . . . f x ( x ) − . . . x . . . f x ( x ) − − x . . . . . f x ( x ) − . . . x . . . . . f x ( x ) − x . . . . . f x ( x ) Target DensityPredicted Density − . . . x f x ( x ) − . . . x . . . . . f x ( x ) . . . x f x ( x ) − . . . x . . . . . f x ( x ) − . . . x f x ( x ) Target DensityPredicted Density

Figure 11: Plot of the marginal state distributions s − s in the Trp Cage miniprotein. Com-parison of the state distributions estimated from the MD data (test dataset) and from trajectoriessampled from LED.sample from MD data of the TRP cage is compared with a close sample (in terms of the latent38pace) of LED. The RMSD is .

784 ˚ A . MD DATA

RMSD ≈ 2.784

LED

Figure 12: Trp Cage protein conﬁgurations found in the MD data compared to a sample of LEDthat is is in close proximity in the latent space. The RMSD error between the two conﬁgurationsis .

784 ˚ A . LED Hyperparameters

In the LED architecture, the number and size of hidden layers are thesame for the encoder E , the decoder D , and the latent MDN Z . The MDN-AE is trained, tuningits hyperparameters based on the grid search reported in Table 6. The latent space of the MDN-AE is z ∈ R , i.e., d z = 2 . The MDN-AE model with the lowest error on the state statistics inthe validation dataset is picked. Then, the MDN-AE is coupled with the MDN-LSTM as LED.The MDN-LSTM is trained to minimize the latent data likelihood. The hyperparameters of theMDN-LSTM are tuned according to the grid search reported in Table 7. The LED model withthe lowest error on the state statistics in the validation dataset is selected. Its hyperparametersare reported in Table 8. The LED is tested in initial conditions randomly sampled from thetesting data. Starting from these initial conditions, we utilize the iterative propagation in the39atent space of the LED to forecast T = 400 ps.Table 6: Hyperparameter tuning of AE for Trp CageHyperparameter ValuesBatch size Initial learning rate − Weight decay rate { , − , − , − } Number of AE layers { , } Size of AE layers { , , } Activation of AE layers selu , tanh Latent dimension Input/Output data scaling [0 , MDN-AE kernels { , , } MDN-AE hidden units { , } MDN-AE covariance scaling factor . Table 7: Hyperparameter tuning of LSTM for Trp CageHyperparameter ValuesBatch size Initial learning rate − BPTT sequence length { , } Number of LSTM layers Size of LSTM layers { , , } Activation of LSTM Cell tanh

MDN-LSTM kernels { , , , } MDN-LSTM hidden units { , , , } MDN-LSTM multivariate { , } MDN-LSTM covariance scaling factor { . , . , . , . } Size of AE layers

Activation of AE layers tanh

Latent dimension MDN-AE kernels MDN-AE hidden units MDN-AE multivariate MDN-AE covariance scaling factor . Weight decay rate BPTT sequence length

Number of LSTM layers Size of LSTM layers Activation of LSTM Cell tanh

MDN-LSTM kernels MDN-LSTM hidden units MDN-LSTM multivariate MDN-LSTM covariance scaling factor . SI: Alanine Dipeptide

A molecule of alanine dipeptide in water is simulated with MD ( ). The peptide is modeledwith the AMBER03 force ﬁeld ( ), while the water is modeled with TIP3P/Fs ( ). TheVelocity Verlet algorithm is employed for the integration. The simulation domain is a cubicbox (edge length . nm) with periodic boundary conditions and minimum image convention.The temperature is maintained at K with a local Langevin thermostat ( ), with the valueof the friction constant equal to . / ps. The cutoff distance for the nonbonded interactions is r c = 0 . nm. The reaction ﬁeld method ( ) is used for the electrostatic interaction beyond thecutoff, with the dielectric permittivity of inner and outer regions equal to and , respectively.A timestep of δt = 1 fs is considered, and the dynamics are integrated up to a total timeof T = 100 ns, creating a dataset with a total of data samples. The data are subsampled,keeping every th datapoint, creating a trajectory with N = 10 samples. The coarse time-step of LED is thus ∆ t = 0 . ps. The protein positions are transformed into rototranslationalinvariant features (internal coordinates), composed of bonds, angles, and dihedral angles. Thedata are split to trajectories of samples (each trajectory corresponds to T = 400 psof MD data), discarding the remaining data. The ﬁrst trajectories (corresponding to a totalof . ns of MD data) are used for training and the next trajectories for validation. All initial conditions are used for testing. Marginal State Distributions

The marginal distributions of the trajectories generated byLED match the ground-truth ones (MD data) closely, as depicted in Figure 13.

Metastable State Deﬁnition

The protein is considered to lie in each of the ﬁve metastablestates { C , P II , α R , α L , C ax } if the distance in the Ramachandran plot between the proteinstate and the metastable state center is smaller than degrees. The metastable state centers are42able 9: Centers of the metastable states in the Ramachandran plot.Metastable state Center ( φ, ψ ) P II ( − , C − , α R ( − , − α L (67 , C ax (70 , deﬁned in Table 9. Metastable States on the Latent Space and Mean First Passage Times

The metastablestates can be deﬁned on the latent space of LED, by projecting the free energy on the latentspace, and identifying the local minima. This alleviates the need for expert knowledge (deﬁ-nition of the metastable states). The MFPTs between the metastable states on the latent spaceof the LED are compared with the MFPTs between the corresponding metastable states on theRamachadran space in Table 10. Note that the results depend on how the latent metastable statesare deﬁned. However, in order to capture the order of the timescales without the need of priorexpert knowledge, a rough approximation (small region around the minima in the latent space)is adequate. The LED is able to capture the order of the timescales, alleviating the need forexpert knowledge on the deﬁnition of the metastable states.

Unraveling Novel Conﬁgurations with LED

We evaluate LED’s effectiveness in unravelingnovel conﬁgurations of the protein (state-space) absent from the training data. For this purpose,we create four different small datasets composed of trajectories of the protein, each one notincluding one of the metastable states { C , P II , α R , C ax } . This is done by removing any statethat lies closer than 40 degrees to the metastable states’ centers. In this way, we guarantee thatLED has not seen any state close to the metastable state missing from the data. Note that inthis case, the LED is not trained on a single large MD trajectory but on small trajectories are43able 10: Mean ﬁrst-passage times (MFPT) between the metastable states of alanine dipeptidein water in [ns]. MFPTs are estimated by ﬁtting MSMs with a time-lag of ps on MD trajec-tories. In LED, the metastable states are considered as regions around the local minima of thefree energy projection on the latent space. The average relative error is given for reference. Metastable states on Metastable states on

Ramachandran Space LED Latent Space

MFPT MSM − − − − . − . T C → P II .

105 0 . .

143 36 T C → α R .

104 0 . .

124 19 T P II → C .

226 0 . .

356 57 T P II → α R .

105 0 . .

123 18 T α R → C .

236 0 . .

361 53 T α R → P II .

116 0 . .

148 27Average Relative Error 10.51% 35.21% not temporally adjacent. We end up with four datasets, each one consisting of approximately trajectories of length T = 50 ps (500 steps of 0.1 ps). Each dataset covers approximately ns protein simulation time. These datasets are created to evaluate the effectiveness of LEDin generating truly novel conﬁgurations for faster exploration of the state space. We do not careat this point for accurate reproduction of the statistics due to the minimal data used for training.In Figure 14, we plot the Ramachandran plots of the training data along with the ones obtainedby analyzing the trajectories of the trained LED models in each of the four cases. We observethat the LED can unravel the metastable states P II , C , C ax , and α R , even though they were notpart of the training data. However, by removing states that lie close to the metastable state α R ,the LED cannot capture the α R and C ax metastable states. This is because the LED is trainedon only a small subset of the training dataset and the transitions to these metastable states arerare. 44 .

14 0 . x f x ( x ) Target Density Predicted Density 0 .

125 0 .

130 0 .

135 0 . x f x ( x ) Target Density Predicted Density 0 .

115 0 .

120 0 .

125 0 . x f x ( x ) Target Density Predicted Density 0 .

15 0 . x f x ( x ) Target Density Predicted Density0 .

15 0 . x f x ( x ) Target Density Predicted Density 0 .

14 0 . x f x ( x ) Target Density Predicted Density 0 .

125 0 .

130 0 .

135 0 .

140 0 . x f x ( x ) Target Density Predicted Density 0 .

115 0 .

120 0 .

125 0 . x f x ( x ) Target Density Predicted Density0 .

14 0 .

15 0 . x f x ( x ) Target Density Predicted Density 2 . . . x f x ( x ) Target Density Predicted Density 1 . . . . . x f x ( x ) Target Density Predicted Density 1 . . . . x f x ( x ) Target Density Predicted Density1 . . . x f x ( x ) Target Density Predicted Density 1 . . . x f x ( x ) Target Density Predicted Density 2 . . . x f x ( x ) Target Density Predicted Density 2 . . . . x f x ( x ) Target Density Predicted Density1 . . . . x f x ( x ) Target Density Predicted Density − . . . x . . . . . f x ( x ) Target Density Predicted Density − x . . . . . f x ( x ) Target Density Predicted Density − x . . . . . . . f x ( x ) Target Density Predicted Density − x . . . . . . . f x ( x ) Target Density Predicted Density − x . . . . . . f x ( x ) Target Density Predicted Density − . − . . . x . . . . . f x ( x ) Target Density Predicted Density − x . . . . . f x ( x ) Target Density Predicted Density

Figure 13: Plot of the marginal state distributions. Comparison of the state distributions esti-mated from the MD data (test dataset) and from trajectories sampled from LED.45

100 0 100 ¡ ° √ P II C Æ R Æ L C ax C ax Training Data ° °

100 0 100 ¡ ° √ C P II Æ R Æ L C ax C ax Predicted Density ° ° °

100 0 100 ¡ ° √ C P II Æ R Æ L C ax C ax Predicted Density ° °

100 0 100 ¡ ° √ C P II Æ R Æ L C ax C ax Predicted Density ° REMOVING P II REMOVING α R REMOVING C REMOVING C ax Figure 14: LED is trained in four scenarios hiding data that lie closer than 40 degrees to oneof the metastable states { P II , C , α R , C ax } each time. LED can successfully generate novelprobable conﬁgurations close to the metastable states { P II , C , α R , C ax } . Due to the limitedtraining data, however, capturing the state density in the Ramachandran plot is challenging.46 ED Hyperparameters

Regarding the LED architecture, the number and size of hidden lay-ers are the same for the encoder E , the decoder D , and the latent MDN Z . The MDN-AE istrained, tuning its hyperparameters based on the grid search reported in Table 11. The latentspace of the MDN-AE is z ∈ R , i.e., d z = 1 . The MDN-AE model with the lowest validationerror on the state statistics is picked. Then, the MDN-AE is coupled with the MDN-LSTM inLED. The MDN-LSTM is trained to minimize the latent data likelihood. The hyperparametersof the MDN-LSTM are tuned according to the grid search reported in Table 12. The LED modelwith the lowest error on the state statistics in the validation dataset is selected. Its hyperparam-eters are reported in Table 13. The LED is tested in initial conditions randomly sampledfrom the testing data. Starting from these initial conditions, we utilize the iterative propagationin the latent space of the LED to forecast T = 400 ps.Table 11: Hyperparameter tuning of AE for alanine dipeptideHyperparameter ValuesBatch size Initial learning rate − Weight decay rate { , − } Number of AE layers { , } Size of AE layers { , } Activation of AE layers selu , tanh Latent dimension Input/Output data scaling [0 , MDN-AE kernels MDN-AE hidden units { , } MDN-AE multivariate MDN-AE covariance scaling factor . Initial learning rate − BPTT sequence length { , } Number of LSTM layers Size of LSTM layers { , , } Activation of LSTM Cell tanh

MDN-LSTM kernels { , , } MDN-LSTM hidden units { , } MDN-LSTM multivariate { , } MDN-LSTM covariance scaling factor { . , . , . , . } Table 13: Hyperparameters of LED model with lowest validation error on alanine dipeptideHyperparameter ValuesNumber of AE layers Size of AE layers Activation of AE layers tanh

Latent dimension MDN-AE kernels MDN-AE hidden units MDN-AE multivariate MDN-AE covariance scaling factor . Weight decay rate BPTT sequence length

Number of LSTM layers Size of LSTM layers Activation of LSTM Cell tanh

MDN-LSTM kernels MDN-LSTM hidden units MDN-LSTM multivariate MDN-LSTM covariance scaling factor .4