Learning Molecular Dynamics with Simple Language Model built upon Long Short-Term Memory Neural Network
LLearning Molecular Dynamics with Simple Language Model built upon LongShort-Term Memory Neural Network
Sun-Ting Tsai, En-Jui Kuo, and Pratyush Tiwary* a) Department of Physics and Institute for Physical Science and Technology, University of Maryland,College Park 20742, USA. Department of Physics and Joint Quantum Institute, University of Maryland, College Park 20742,USA. Department of Chemistry and Biochemistry and Institute for Physical Science and Technology,University of Maryland, College Park 20742, USA. (Dated: 28 April 2020)
Recurrent neural networks (RNNs) have led to breakthroughs in natural language processing and speechrecognition, wherein hundreds of millions of people use such tools on a daily basis through smartphones,email servers and other avenues. In this work, we show such RNNs, specifically Long Short-Term Memory(LSTM) neural networks can also be applied to capturing the temporal evolution of typical trajectories arisingin chemical and biological physics. Specifically, we use a character-level language model based on LSTM. Thislearns a probabilistic model from 1-dimensional stochastic trajectories generated from molecular dynamicssimulations of a higher dimensional system. We show that the model can not only capture the Boltzmannstatistics of the system but it also reproduce kinetics at a large spectrum of timescales. We demonstrate howthe embedding layer, introduced originally for representing the contextual meaning of words or characters,exhibits here a nontrivial connectivity between different metastable states in the underlying physical system.We demonstrate the reliability of our model and interpretations through different benchmark systems. Weanticipate that our work represents a stepping stone in the use of RNNs for modeling and predicting dynamicsof complex stochastic molecular systems.
I. INTRODUCTION
Recurrent neural networks (RNN) are a machinelearning technique developed for modeling temporal se-quences, with demonstrated successes including but notlimited to modeling human languages. A specific andextremely popular instance of RNNs are long short-term memory (LSTM) neural networks, which possessmore flexibility and can be used for challenging taskssuch as language modeling, machine translation, andweather forecasting. LSTMs were developed to alle-viate the limitation of previously existing RNN architec-tures wherein they could not learn information originat-ing from far past in time. This is known as the vanishinggradient problem, a term that captures how the gradientor force experienced by the RNN parameters vanishesas a function of how long ago did the change happenin the underlying data.
LSTMs deal with this prob-lem by controlling flows of gradients through a so-calledgating mechanism where the gates can open or close de-termined by their values learned for each input. Thegradients can now be preserved for longer sequences bydeliberately gating out some of the effects. This way ithas been shown that LSTMs can accumulate informationfor a long period of time by allowing the network to dy-namically learn to forget aspects of information. Veryrecently LSTMs have also been shown to have the po-tential to mimic trajectories produced by experiments or a) Electronic mail: [email protected] simulations , making accurate predictions about a shorttime into the future, given access to a large amount ofdata in the past. Similarly, another RNN variant namedreservoir computing has been recently applied to learnand predict chaotic systems. Such a capability is al-ready useful for instance in weather forecasting, whereone needs extremely accurate predictions valid for a shortperiod of time.In this work, we consider an alternate and arguablynovel use of RNNs, specifically LSTMs, in making pre-dictions that in contrast to previous work , are validfor very long periods of time but only in a statisticalsense. Unlike domains such as weather forecasting orspeech recognition where LSTMs have allowed very ac-curate predictions albeit valid only for short duration oftime, here we are interested in problems from chemi-cal and biological physics, where the emphasis is moreon making statistically valid predictions valid for ex-tremely long duration of time. This is typified for ex-ample through the use of the ubiquitous notion of rateconstant for activated barrier crossing, where short-timemovements are typically treated as noise, and are not ofinterest for being captured through a dynamical model.Here we suggest an alternative way to use LSTM-basedlanguage model to learn a probabilistic model from thetime sequence along some low-dimensional order param-eters produced by computer simulations or experimentsof a high-dimensional system. We also show by our com-puter simulations of different model systems that the lan-guage model can produce the correct Boltzmann statis-tics (as can other AI methods such as Ref. ) but alsothe kinetics over a large spectrum of modes characterizing a r X i v : . [ c ond - m a t . d i s - nn ] A p r the dynamics in the underlying data. We highlight here aunique aspect of this calculation that the order parameterour framework needs could be arbitrarily close to or farfrom the true underlying slow mode, often called reactioncoordinate. This in turn dictates how long of a memorykernel must be captured which is in general a very hardproblem to solve. Our framework is agnostic to prox-imity from the true reaction coordinate and reconstructsstatistically accurate dynamics in a wide range of orderparameters. Our work thus represents a new usage of apopular artificial intelligence (AI) framework to performdynamical reconstruction in a domain of potentially highfundamental and practical relevance, including materialsand drug design.The manuscript is structured as follows: In Sec. II weexplain the method and the neural network architecturewe used in this work. In Sec. III A, we show how theminimization of loss function leads to learning the pathentropy of a physical system. In Sec. III B, we showthe connection between the embedding layer and tran-sition probability. Followed by this connection, we alsoshow how we can define a transition probability throughembedding vectors. Our computational results are thengiven in Sec. IV. The computational details includingsoftwares we used are given in Sec. IV A. In Sec. IV Band IV C, we shown our tests on Boltzmann statistics andkinetics for Langevin dynamics of model potentials andthe MD simulation of alanine dipeptide, respectively. InSec. IV D, we compare numerically the transition prob-ability introduced in Sec. III B with the actual counts inthe trajectory. Finally, a summary is given in Sec. V.
II. METHOD
Our central rationale in this work is that moleculardynamics (MD) trajectories, adequately discretized inspace and time, can be mapped into a sequence of char-acters in some languages. By using a character-level lan-guage model that is effective in predicting future char-acters given the characters so far in a sequence, we canthen learn the evolution of the MD trajectory that wasmapped into the characters. The model we use is stochas-tic since it learns each character through the probabilitythey appear in a corpus used for training. This languagemodel consists of three sequential parts shown schemat-ically in Fig. 1. First, there is an embedding layermapping one-hot vectors to dense vectors, followed byan LSTM layer which connects input states and hiddenstates at different time steps through a trainable recur-sive function, and finally a dense layer to transform theoutput of LSTM to the categorical probability vector.Specifically, here we consider as input a one-dimensional time series produced by a physical system,for instance through Langevin dynamics being undergoneby a complex molecular system. The time series consistof data points { ξ ( t ) } , where t labels the time step and ξ ∈ R is some one-dimensional collective variable or order FIG. 1:
The schematic plot of the simple character-level language model used in this work. The modelconsists of three main parts: The embedding layer, theLSTM layer, and a dense output layer. The embeddinglayer is a linear layer which multiplies the one-hot in-put s ( t ) by a matrix and produces an embedding vector x ( t ) . The x ( t ) is then used as the input of LSTM net-work, in which the forget gate f ( t ) , the input gate i ( t ) ,the output gate o ( t ) , and the candidate value ˜ c ( t ) areall controlled by ( x ( t ) , h ( t − ). The forget gate and in-put gate are then used to produce the update equationof cell state c t ) . The output gate decides how muchinformation propagates to the next time step. The out-put layer predicts the probabilities ˆ y ( t ) by parametriz-ing the transformation from h ( t ) to ˆ y with learnedweights D d and learned biases b d . Finally, we can com-pute the cross entropy between the predicted probabil-ity distribution ˆ y ( t ) and the true probability distribu-tion y ( t ) = s ( t +1) .parameter for the high-dimensional molecular system. Inline with standard practice for probabilistic models, weconvert the data points to one-hot encoded representa-tions that implement spatial discretization. Thus eachdata point { ξ ( t ) } is represented by a N -dimensional bi-nary vector s ( t ) , where N is the number of discrete grids.An entry of one stands for the representative value and allthe other entries are set to zeros. The representative val-ues are in general finite if the order parameter is bounded,and are equally spaced in R with in total N representa-tive values. Note that the time series { ξ ( t ) } does not haveto be one-dimensional. For a higher-dimensional series,we can always choose a set of representative values cor-responding to locations in the higher-dimensional spacevisited trajectory. This would typically lead to a larger N in the one-hot encoded representations, but the train-ing set size itself will naturally stay the same. We findthat the computational effort only depends on the sizeof training set and very weakly on N , and thus the timespent for learning a higher dimensional time series doesnot increase much relative to a one-dimensional series.In the sense of modeling languages, the one-hot repre-sentation on its own cannot capture the relation betweendifferent characters. Take for instance that there is noword in the English language where the character c isfollowed by x , unless of course one allows for the possi-bility of a space or some other letter in between. To dealwith this, computational linguists make use of an em-bedding layer. The embedding layer works as a look-uptable which converts each one-hot vector s ( t ) to a densevector x ( t ) ∈ R M by the multiplication of a matrix Λ which is called the embedding matrix, where M is calledthe embedding dimension x ( t ) = Λs ( t ) (1)The sequence of dense representation x ( t ) accounts forthe relation between different characters as seen in thetraining time series. x ( t ) is then used as the input of theLSTM layer. Each x ( t ) generates an output h ( t ) ∈ R L from LSTM layer, where L is a tunable hyperparameter.Larger L generally gives better learning capability butneeds more computational resources. The LSTM itselfconsists of the following elements: the input gate i ( t ) ,the forget gate f ( t ) , the output gate o ( t ) the cell state c ( t ) , the candidate value ˜ c ( t ) , and h ( t ) which is the hid-den state vector and the final output from the LSTM.Each gate processes information in different aspects. Briefly, the input gate decides which information to bewritten, the forget gate decides which information to beerased, and the output gate decides which informationto be read from the cell state to the hidden state. Theupdate equation of these elements can be written as fol-lows: f ( t ) = σ ( W f x ( t ) + U f h ( t − + b f ) (2) i ( t ) = σ ( W i x ( t ) + U i h ( t − + b i ) (3) o ( t ) = σ ( W o x ( t ) + U o h ( t − + b o ) (4)˜ c ( t ) = tanh( W c x ( t ) + U c h ( t − + b c ) (5) c ( t ) = f ( t ) ◦ c ( t − + i ( t ) ◦ ˜ c ( t ) (6) h ( t ) = o ( t ) ◦ tanh( c ( t ) ) (7)where W and b are the corresponding weight matricesand bias vectors. The tanh( v ) operates piecewise on eachelement of the vector v . The operation ◦ is the Hadamardproduct. The final layer in Fig. 1 is a simple dense layer withfully connected neurons which converts the output h ( t ) ofthe LSTM to a vector y ( t ) in which each entry denotes thecategorical probability of the representative value for thenext time step t +1. The loss function J for minimizationduring training at every timestep t is then defined as thecross entropy between the output of the model ˆ y ( t ) andthe actual probability for the next timestep ˆ y ( t ) which is just the one-hot vector s t +1 ˆ y ( t ) = softmax( D d h ( t ) + b d ) (8) J = − T − (cid:88) t =0 y ( t ) · ln ˆ y ( t ) = − T − (cid:88) t =0 s ( t +1) · ln ˆ y ( t ) (9)where T is the total length of trajectory, and the finalloss function is the sum over the whole time series. Thesoftmax( x ) i = exp( x i ) / (cid:80) j exp( x j ) is a softmax functionmapping x to a probability vector ˆ y . III. THEORYA. Training LSTM is equivalent to learning path entropy
The central finding of this work, which we demonstratethrough numerical results for different systems in Sec.IV, is that a LSTM framework used to model languagescan also be used to capture kinetic and thermodynamicaspects of dynamical trajectories prevalent in chemicaland biological physics. In this section we demonstratetheoretically as to why LSTMs possess such a capability.Before we get into the mathematical reasoning detailedhere as well as in Supplementary Information (SI), wefirst state our key conceptual idea. Minimizing the lossfunction J in LSTM (Eq. 9), which trains the modelat time t to generate output ˆ y ( t ) resembling the targetoutput s t +1 , is equivalent to minimizing the differencebetween the actual and LSTM-learned path probabili-ties. This difference between path probabilities can becalculated as a cross-entropy J (cid:48) defined as: J (cid:48) = − (cid:88) x ( T ) ... x (0) P ( x ( T ) ... x (0) ) ln Q ( x ( T ) ... x (0) ) (10)where P ( x ( t +1) , ..., x (0) ) and Q ( x ( t +1) , ..., x (0) ) are thecorresponding true and neural network learned pathprobabilities of the system. Eq. 10 can be rewritten asthe sum of path entropy H ( P ) for the true distribution P and Kullback-Liebler D KL distance D KL between P and Q : J (cid:48) = H ( P ) + D KL ( P || Q ). Since D KL is strictlynon-negative attaining the value of 0 iff Q = P , theglobal minimum of J (cid:48) happens when Q = P and J (cid:48) equalsthe path entropy H ( P ) of the system. Thus we claimthat minimizing the loss function in LSTM is equivalentto learning the path entropy of the underlying physicalmodel, which is what makes it capable of capturing ki-netic information of the dynamical trajectory.To prove this claim we start with rewriting J in Eq.9. For a long enough observation period T or for a verylarge number of trajectories, J can be expressed as thecross entropy between conditional probabilities: J = − T − (cid:88) t =0 (cid:88) x ( t +1) P ( x ( t +1) | x ( t ) ... x (0) ) × ln Q ( x ( t +1) | x ( t ) ... x (0) ) (11)where P ( x ( t +1) | x ( t ) ... x (0) ) is the true conditional prob-ability for the physical system, and Q ( x ( t +1) | x ( t ) ... x (0) )is the conditional probability learned by the neural net-work. The minimization of Eq. 11 leads to minimizationof the cross entropy J (cid:48) as shown in the SI. Here we con-versely show how Eq. 10 reduces to Eq. 9 by assuminga stationary first-order Markov process as in Ref. 23: P ( x ( T ) ... x (0) ) = P ( x ( T ) | x ( T − ) ...P ( x (1) | x (0) ) P ( x (0) ) Q ( x ( T ) ... x (0) ) = Q ( x ( T ) | x ( T − ) ...Q ( x (1) | x (0) ) Q ( x (0) )(12)where P ( x ( t +1) j | x ( t ) i ) ≡ P ij is the transition probabilityfrom state x i to state x j and P ( x (0) k ) ≡ P k is the occu-pation probability for the single state x k . Plugging Eq.12 into Eq. 10, and following the derivation in Ref. 23with the constraints (cid:88) j P ij = 1 (cid:88) i P i P ij = P j (13)we arrive at an expression for the cross-entropy J ,which is very similar to the path entropy type expres-sions derived for instance in the framework of MaximumCaliber : J (cid:48) = − (cid:88) i P i ln Q i − T (cid:88) lm P l P lm ln( Q lm ) (14) → − T (cid:88) lm P ( x l ) P ( x m | x l ) ln Q ( x m | x l ) (15)In Eq. 14 as the trajectory length T increases, thesecond term dominates in the estimate of J lead-ing to Eq. 15. This second term is the ensem-ble average of a time-dependent quantity ˜ J ( x ( t ) l ) ≡− (cid:80) m P ( x ( t +1) m | x ( t ) l ) ln Q ( x ( t +1) m | x ( t ) l ). For a largeenough T , the ensemble average can be replaced by thetime average. By assuming ergodicity : J (cid:48) = − T (cid:88) t =1 (cid:88) m P ( x ( t +1) m | x ( t ) l ) ln Q ( x ( t +1) m | x ( t ) l ) (16)from which we directly obtain Eq. 9. Therefore, underfirst-order Markovianity and ergodicity, minimizing theloss function J of Eq. 9 is equivalent to minimizing J (cid:48) and thereby learning the path entropy. In the SI we pro-vide a proof for this statement that lifts the Markovianityassumption as well - the central idea there is similar towhat we showed here. B. Embedding layer in LSTM captures kinetic distances
In word embedding theory, the embedding layer pro-vides a measure of similarity between words. How-ever, from the path probability representation, it is un-clear how the embedding layer works since the deriva-tion can be done without embedding vectors x . To have an understanding to Q lm in the first-order Markov pro-cess, we first write the conditional probability Q lm = Q ( x ( t +1) m | x ( t ) l ) explicitly with softmax defined in Eq. 8and embedding vectors x defined in Eq. 1: Q lm = exp( s ( t +1) m · ( D d h ( t ) + b d )) (cid:80) k exp( s k · ( D d h ( t ) + b d ))= exp( s ( t +1) m · ( D d f θ ( x ( t ) ) + b d )) (cid:80) k exp( s k · ( D d f θ ( x ( t ) ) + b d )) (17)where f is the recursive function h ( t ) = f θ ( x ( t ) , h ( t − ) ≈ f θ ( x ( t ) ) which is defined with the update equation in Eq.2-7. In Eq. 17, θ denotes various parameters including allweight matrices and biases, and the summation index k runs over all possible states. Now we can use multivari-able Taylor’s theorem to approximate f θ as the linearterm around a point a as long as a is not at any localminimum of f θ : f θ ( x ( t ) ) ≈ f θ ( a ) + A θ ( x ( t ) − a ) (18)where A θ is the L by M matrix defined to be ( A θ ) ij = ∂ ( f θ ) i ∂x j | x = a . Then Eq. 17 becomes Q lm = exp( C ( t +1) m ) exp( s ( t +1) m · D d A θ x ( t ) l ) (cid:80) k exp( C k ) exp( s k · D d A θ x ( t ) l ) (19)where C ( t +1) i = s ( t +1) i · [ D d ( f θ ( a l ) + A θ a l ) + b d ]. We cansee in Eq. 19 how the embedding vectors come into thetransition probability. Specifically, there is a symmetricform between output one-hot vectors s ( t +1) m and the in-put one-hot vectors s ( t ) , in which x ( t ) = Λs ( t ) and Λ isthe input embedding matrix, D d A θ can be seen as theoutput embedding matrix, and C ( t +1) i is the correction oftime lag effect. While we don’t have an explicit way tocalculate the output embedding matrix so defined, Eq.19 motivates us to define the following ansatz for thetransition probability: Q lm = Q ( x m | x l ) = exp( x m · x l ) (cid:80) k exp( x k · x l ) (20)where x m and x l are both calculated by the input em-bedding matrix Λ . The expression in Eq. 20 is thusa tractable approximation to the more exact transitionprobability in Eq. 19. Furthermore, we will show in Sec.IV D through numerical examples of test systems thatour ansatz for Q lm does correspond to the kinetic con-nectivity between states. That is, the LSTM embeddinglayer with the transition probability through Eq. 20 cancapture the average commute time between two states inthe original physical system, irrespective of the qualityof low-dimensional projection fed to the LSTM. IV. COMPUTATIONAL RESULTSA. Test systems
To demonstrate our ideas, here we consider threemodel potentials and the popular model molecule ala-nine dipeptide. When applying our neural network tothe model systems, the embedding dimension M is setto 8 and LSTM unit L set to 64. For alanine dipep-tide we took M = 128 and L = 512. All time series werebatched into sequences with a sequence length of 100 andthe batch size of 64. For each model potential, the neuralnetwork was trained using the method of stochastic gradi-ent descent for 20 epochs until the training loss becomessmaller than the validation loss, which means an appro-priate training has been reached. For alanine dipeptide,40 training epochs were used. Our neural network wasbuilt using TensorFlow version 1.10.All model potentials have two degrees of freedom x and y . Our first two models (shown in Fig. 2(a) andFig. 2(b)) have three metastable states with governingpotential U ( x, y ) given by U ( x, y ) = W ( x + y ) − G ( x, x ) G ( y, y ) − G ( x, x ) G ( y, y ) − G ( x, x ) G ( y, y ) (21)where W = 0 . G ( x, x ) = e − ( x − x σ denotes aGaussian function centered at x with width σ = 0 . U ( x, y ) = W ( x + y ) + G ( x, . G ( y, . − G ( x, . G ( y, − . − G ( x, . G ( y, . − G ( x, − . G ( y, − . − G ( x, − . G ( y, . with these model potentials are shown in Figs. 2(a)-(c). The integration timestep for the Langevin dynam-ics simulation was 0.01 units, and the simulation wasperformed at β = 9 . β = 9 . β = 1 /k B T . The MD trajectory for alaninedipeptide was obtained using the software GROMACS5.0.4 , patched with PLUMED 2.4 . The tempera-ture was kept constant at 450K using the velocity rescal-ing thermostat . B. Boltzmann statistics and kinetics for model potentials
The first test we perform for our model is its ability tocapture the Boltzmann weighted statistics for the differ-ent states in each model potential. This is the probabil-ity distribution P or equivalently the related free energy F = − β log P , and can be calculated by direct count-ing from the trajectory. As can be seen in Fig. 2, theLSTM does an excellent job of recovering the Boltzmannprobability within error bars. FIG. 2:
The analytical free energy generated from (a)linear 3-state, (b) triangular 3-state, (c) symmetric 4-state model potentials and (d), (e), (f) are the corre-sponding 1-dimensional projections along x-direction.In the bottom, we compare the Boltzmann probabili-ties of (g) linear 3-state, (h) triangular 3-state, and (i)symmetric 4-state models at each labeled states gener-ated from actual MD simulation and from our neuralnetwork model.Next we describe our LSTM deals with a well-knownproblem in analyzing high-dimensional data sets throughlow-dimensional projections. One can project the high-dimensional data along many different possible low-dimensional order parameters, for instance x , y or acombination thereof in Fig. 2. However most such pro-jections will end up not being kinetically truthful andgive a wrong impression of how distant the metastablestates actually are from each other in the underlyinghigh-dimensional space. It is in general hard to comeup with a projection that preserves the kinetic proper-ties of the high-dimensional space. Consequently, it ishard to design analysis or sampling methods that evenwhen giving a time-series along a sub-optimal projection,still capture the true kinetic distance in the underlyinghigh-dimensional space.Here we show how our LSTM model is agnostic to thequality of the low-dimensional projection in capturing ac-curate kinetics. Given that for each of the 3 potentialsthe LSTM was provided only the x − trajectory, we canexpect that the chosen model potentials constitute dif-ferent levels of difficulties in generating correct kinetics.Specifically, a one-dimensional projection along x is ki-netically truthful for the linear 3-state potential in Fig.2(a) but not for the triangular 3-state and the 4-state po-tentials in Figs. 2(b) and (c) respectively. For instance,Fig. 2(e) gives the impression that state C is kineticallyvery distant from state A, while in reality for this poten-tial all 3 pairs of states are equally close to each other.Similar concerns apply to the 4-state potential.In Figs. 3 and 4 (a)-(c) and (d)-(f) we compare the ac-tual versus LSTM-predicted kinetics for moving betweendifferent metastable states for different model potentials,for all pairs of transitions in both directions (i.e. for in-stance A to B and B to A). Specifically, Fig. 3 (a)-(c)and Fig. 3 (d)-(f) shows results for moving between the3 pairs of states in the linear and triangular 3-state po-tentials respectively. Fig. 4 shows results for the 6 pairsof states in the 4-state potential. Furthermore, for ev-ery pair of state, we analyze the transition time betweenthose states as a function of different minimum commit-ment time, i.e. the minimum time that must be spent bythe trajectory in a given state to be classified as havingcommitted to it. A limiting value, and more specificallythe rate at which the population decays to attain to sucha limiting value, corresponds to the inverse of the rateconstant for moving between those states. Thus herewe show how our LSTM captures not just the rate con-stant, but time-dependent fluctuations in the populationin a given metastable state as equilibrium is attained.The results are averaged over 20 independent segmentstaken from the trajectories of different trials of trainingfor the 3-state potentials and 10 independent segmentsfor the 4-state potential.As can be seen in Figs. 3 and 4, the LSTM model doesan excellent job of reproducing well within errorbars thetransition times between different metastable states fordifferent model potentials irrespective of the quality ofthe low-dimensional projection. Firstly, our model doestell the differences between linear and triangular 3-statemodels (Fig. 3) even though the projected free energiesalong the x variable input into LSTM are same (Fig. 2).The number of transitions between states A and C isless than the others; while for triangular configuration,the numbers of transitions between all pairs of states aresimilar. The rates at which the transition count decays asa function of commitment time is also preserved betweenthe input data and the LSTM prediction.The next part of our second test is the 4-state modelpotential. In Fig. 4 we show comparisons for all 6 pairsof transitions in both forward and reverse directions. Afew features are immediately striking here. Firstly, eventhough states B and C are perceived to be kineticallyproximal from the free energy (Fig. 2), the LSTM cap-tures that they are distal from each other and correctlyassigns similar kinetic distance to the pairs B,C as itdoes to A,D. Secondly, there is asymmetry between theforward and backward directions (for e.g. A to D andD to A, indicating that the input trajectory itself hasnot yet sufficiently sampled the slow transitions in thispotential. As can be seen from Fig. 2 (c) the input tra-jectory has barely 1 or 2 direct transitions for the veryhigh barrier A to D or B to C. This is a likely explanationfor why our LSTM model does a bit worse than in theother two model potentials in capturing the slowest tran- sition rates, as well as the higher error bars we see here.In other words, so far we can conclude that while ourLSTM model can capture equilibrium probabilities andtransition rates for different model potentials irrespectiveof the input projection direction or order parameter, it isstill not a panacea for insufficient sampling itself, as onewould expect. FIG. 3:
Number of transitions between different pairsof metastable states as a function of commitment timedefined in Sec. IV B. The calculations for linear andtriangular configurations are shown in (a)-(c) and (d)-(f) respectively.
C. Boltzmann statistics and kinetics for alanine dipeptide
Finally, we apply our LSTM model to the study ofconformational transitions in alanine dipeptide, a modelbiomolecular system comprising 22 atoms, experiencingthermal fluctuations when coupled to a heat bath. Thestructure of alanine dipeptide is shown in Fig. 5(a).While the full system comprises around 63 degrees offreedom, typically the torsional angles φ and ψ are usedto identify the conformations of this peptide. Over theyears a large number of methods have been tested onthis system in order to perform enhanced sampling ofthese torsions, as well as to construct optimal reactioncoordinates. Here we show that our LSTM modelcan very accurately capture the correct Boltzmann statis-tics as well as transition rates for moving between the twodominant metastable states known as C eq and C ax . Im-portantly, the reconstruction of the equilibrium probabil-ity and transition kinetics, as shown in Fig. 5 and TableI is extremely accurate irrespective of the choice of one-dimensional projection time series fed into the LSTM.Specifically, we do this along sin φ and sin ψ , both of FIG. 4:
Number of transitions between different pairsof metastable states as a function of commitment timedefined in Sec. IV B for 4-state model system.which are known to quite distant from an optimized ki-netically truthful reaction coordinate , where againwe have excellent agreement between input and LSTM-predicted results.
FIG. 5: (a) The molecular structure of alanine dipep-tide used in the actual MD simulatoin. The torsionalangles φ and ψ as the collective variables (CVs) areshown. (b) and (c) The 1-dimensional free energycurves along sin φ and sin ψ are calculated using actualMD data and the data generated from LSTM. Rate: Alanine dipeptideCVs Label C eq to C ax ( ps ) C ax to C eq ( ps )sin φ actual 5689.22 ± ± ± ± ψ actual 5001.42 ± ± ± ± TABLE I:
Transition rates for conformational tran-sitions in alanine dipetide calculated from actual MDtrajectories of LSTM model. Here we show the calcula-tion along two different CVs: sin φ and sin ψ . D. Embedding layer based kinetic distance
In Sec. III B, we first derived a non-tractable relationfor conditional transition probability in the embeddinglayer (Eq. 19), and then through Eq. 20 we introduceda tractable ansatz in the spirit of Eq. 19. In this sectionwe revisit and numerically validate Eq. 20. Specifically,given any two embedding vectors x l and x m calculatedfrom any two states l and m , we estimate the conditionalprobability Q lm using Eq. 20. We use Q i to denotes theBoltzmann probability predicted by the LSTM model.We then write down the interconversion probability k lm between states l and m as: k lm = Q l Q lm + Q m Q ml ≡ /t lm (23)From inverting this rate we then calculate an LSTM-kinetic time as t lm ≡ /k lm = 1 / ( Q l Q lm + Q m Q ml ). InFig. 6, we compare t lm with the actual transition time τ lm obtained from the input data, defined as τ lm = T / (cid:104) N lm (cid:105) (24)Here N lm is the mean number of transitions betweenstate l and m . As this number varies with the precisevalue of commitment time, we average N lm over all com-mit times to get (cid:104) N lm (cid:105) . These two timescales t lm and τ lm thus represent the average commute time or kineticdistance between two states l and m . To facilitatethe comparison between these two very differently de-rived timescales or kinetic distances, we rescale and shiftthem to lie between 0 and 1. The results in Fig. 6 showthat the embedding vectors display the connectivity cor-responding to the original high-dimensional configura-tion space rather than those corresponding to the one-dimensional projection. The model captures the correctconnectivity by learning kinetics, which is clear evidencethat it is able to bypass the projection error along anydegree of freedom. The result also explains how is it thatno matter what degree of freedom we use, our LSTMmodel still gives correct transition times. As long as thedegree of freedom we choose to train the model can beused to discern all metastable states, we can even useEq. 20 to see the underlying connectivity. Therefore, theembedding vectors in LSTM can define a useful distancemetric which can be used to understand and model dy-namics, and are possibly part of the reason why LSTMscan model kinetics accurately inspite of quality of pro-jection and associated non-Markvoian effects. FIG. 6:
In this plot we show our analysis of the em-bedding layer constructed for (a) the linear and trian-gular 3-state and (b) the 4-state model systems. In (a),we use solid circle and empty square markers respec-tively to represent linear and triangular 3-state modelpotentials. In each plot, the data points are shiftedslightly to the right for clarity. The distances marked“actual” and “LSTM” represent rescaled mean tran-sition times as per Eqs. 24 and 23 respectively. Errorbars were calculated over 50 different trajectories.
V. CONCLUSIONS
In summary we believe this work demonstrates poten-tial for using AI approaches developed for natural lan-guage processing such as speech recognition and machinetranslation, in unrelated domains such as chemical andbiological physics. This work represents a first step inthis direction, wherein we used AI, specifically LSTMflavor of recurrent neural networks, to perform kineticreconstruction tasks that other methods could havealso performed. We would like to argue that demonstrat-ing the ability of AI approaches to perform tasks that onecould have done otherwise is a crucial first step. In fu-ture works we will exploring different directions in whichthe AI protocol developed here could be used to performtasks which were increasingly non-trivial in non-AI se-tups. More specifically, in this work we have shown thata simple character-level language model based on LSTMneural network can learn a probabilistic model of a timeseries generated from a physical system such as an evo-lution of Langevin dynamics or MD simulation of com-plex molecular models. We show that the probabilisticmodel can not only learn the Boltzmann statistics butalso capture a large spectrum of kinetics. The embed-ding layer which is designed for encoding the contextualmeaning of words and characters displays a nontrivialconnectivity and has been shown to correlate with thekinetic map defined for reversible Markov chains.
For different model systems considered here, we could obtain correct timescales and rate constants irrespectiveof the quality of order parameter fed into the LSTM.As a rensult, we believe this kind of model outperformstraditional approaches for learning thermodynamics andkinetics, which can often be very sensitive to the choice ofprojection. Finally, the embedding layer can be used todefine a new type of distance metric for high-dimensionaldata when one has access to only some low-dimensionalprojection. We hope that this work represents a first stepin the use of RNNs for modeling, understanding and pre-dicting the dynamics of complex stochastic systems foundin biology, chemistry and physics.
ACKNOWLEDGMENTS
P.T. thanks Dr. Steve Demers for suggesting theuse of LSTMs. The authors thank Carlos Cuellarfor the help in early stages of this project, YihangWang for in-depth discussions, Dedi Wang, Yixu Wang,Zachary Smith for their helpful insights and suggestions.Acknowledgment is made to the Donors of the AmericanChemical Society Petroleum Research Fund for partialsupport of this research (PRF 60512-DNI6). We alsothank Deepthought2, MARCC and XSEDE (projectsCHE180007P and CHE180027P) for computationalresources used in this work.
REFERENCES Herbert Jaeger. Tutorial on training recurrent neural networks,covering BPPT, RTRL, EKF and the” echo state network”approach, volume 5. GMD-Forschungszentrum Informationstech-nik Bonn, 2002. R Rico-Martinez, K Krischer, IG Kevrekidis, MC Kube, andJL Hudson. Discrete-vs. continuous-time nonlinear signal pro-cessing of cu electrodissolution data. Chemical EngineeringCommunications, 118(1):25–48, 1992. N Gicquel, JS Anderson, and IG Kevrekidis. Noninvertibilityand resonance in discrete-time neural networks for time-seriesprocessing. Physics Letters A, 238(1):8–18, 1998. Alex Graves, Marcus Liwicki, Santiago Fern´andez, Roman Berto-lami, Horst Bunke, and J¨urgen Schmidhuber. A novel connec-tionist system for unconstrained handwriting recognition. IEEEtransactions on pattern analysis and machine intelligence, 31(5):855–868, 2008. Martin Sundermeyer, Ralf Schl¨uter, and Hermann Ney. Lstmneural networks for language modeling. In Thirteenthannual conference of the international speech communicationassociation, 2012. Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton.Speech recognition with deep recurrent neural networks. In 2013IEEE international conference on acoustics, speech and signalprocessing, pages 6645–6649. IEEE, 2013. Hasim Sak, Andrew W Senior, and Fran¸coise Beaufays. Longshort-term memory recurrent neural network architectures forlarge scale acoustic modeling. 2014. Kyunghyun Cho, Bart Van Merri¨enboer, Caglar Gulcehre,Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, andYoshua Bengio. Learning phrase representations using rnnencoder-decoder for statistical machine translation. arXivpreprint arXiv:1406.1078, 2014. SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network:
A machine learning approach for precipitation nowcasting. InAdvances in neural information processing systems, pages 802–810, 2015. Kai Chen, Yi Zhou, and Fangyan Dai. A lstm-based method forstock returns prediction: A case study of china stock market. In2015 IEEE international conference on big data (big data), pages2823–2824. IEEE, 2015. Sepp Hochreiter and J¨urgen Schmidhuber. Long short-termmemory. Neural computation, 9(8):1735–1780, 1997. Minh-Thang Luong, Ilya Sutskever, Quoc V Le, Oriol Vinyals,and Wojciech Zaremba. Addressing the rare word problem inneural machine translation. arXiv preprint arXiv:1410.8206,2014. Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, J¨urgen Schmid-huber, et al. Gradient flow in recurrent nets: the difficulty oflearning long-term dependencies, 2001. Joshua C Agar, Brett Naul, Shishir Pandya, Stefan van Der Walt,Joshua Maher, Yao Ren, Long-Qing Chen, Sergei V Kalinin,Rama K Vasudevan, Ye Cao, et al. Revealing ferroelectric switch-ing character using deep recurrent neural networks. Naturecommunications, 10(1):1–11, 2019. Mohammad Javad Eslamibidgoli, Mehrdad Mokhtari, andMichael H Eikerling. Recurrent neural network-based modelfor accelerated trajectory analysis in aimd simulations. arXivpreprint arXiv:1909.10124, 2019. Mantas Lukoˇseviˇcius and Herbert Jaeger. Reservoir comput-ing approaches to recurrent neural network training. ComputerScience Review, 3(3):127–149, 2009. Jaideep Pathak, Brian Hunt, Michelle Girvan, Zhixin Lu, andEdward Ott. Model-free prediction of large spatiotemporallychaotic systems from data: A reservoir computing approach.Physical review letters, 120(2):024102, 2018. Frank No´e, Simon Olsson, Jonas K¨ohler, and Hao Wu. Boltz-mann generators: Sampling equilibrium states of many-body sys-tems with deep learning. Science, 365(6457):eaaw1147, 2019. Giovanni Bussi and Alessandro Laio. Using metadynamics toexplore complex free-energy landscapes. Nature Reviews Physics,pages 1–13, 2020. Yihang Wang, Jo˜ao Marcelo Lamim Ribeiro, and Pratyush Ti-wary. Past–future information bottleneck for sampling molecularreaction coordinate simultaneously with thermodynamics and ki-netics. Nature communications, 10(1):1–8, 2019. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deeplearning. MIT press, 2016. Thomas M Cover and Joy A Thomas. Elements of informationtheory. John Wiley & Sons, 2012. Steve Press´e, Kingshuk Ghosh, Julian Lee, and Ken A Dill. Prin-ciples of maximum entropy and maximum caliber in statisticalphysics. Reviews of Modern Physics, 85(3):1115, 2013. Calvin C Moore. Ergodic theorem, ergodic theory, and statisticalmechanics. Proceedings of the National Academy of Sciences, 112(7):1907–1911, 2015. Frank Noe, Ralf Banisch, and Cecilia Clementi. Commute maps:Separating slowly mixing molecular configurations for kineticmodeling. Journal of chemical theory and computation, 12(11):5620–5630, 2016. Frank No´e and Cecilia Clementi. Kinetic distance and kineticmaps from molecular dynamics simulation. Journal of chemicaltheory and computation, 11(10):5002–5011, 2015. Sun-Ting Tsai and Pratyush Tiwary. Molecular Simulation, inpress. doi:10.1080/08927022.2020.1761548. Giovanni Bussi and Michele Parrinello. Accurate sampling usinglangevin dynamics. Physical Review E, 75(5):056707, 2007. Herman JC Berendsen, David van der Spoel, and Rudi vanDrunen. Gromacs: a message-passing parallel molecular dynam-ics implementation. Computer physics communications, 91(1-3):43–56, 1995. Mark James Abraham, Teemu Murtola, Roland Schulz, Szil´ardP´all, Jeremy C Smith, Berk Hess, and Erik Lindahl. Gromacs:High performance molecular simulations through multi-level par- allelism from laptops to supercomputers. SoftwareX, 1:19–25,2015. Carlo Camilloni et. al. Massimiliano Bonomi, Giovanni Bussi.Promoting transparency and reproducibility in enhanced molec-ular simulations. Nature methods, 16:670–673, 2019. Giovanni Bussi, Davide Donadio, and Michele Parrinello. Canon-ical sampling through velocity rescaling. The Journal of chemicalphysics, 126(1):014101, 2007. Peter H¨anggi, Peter Talkner, and Michal Borkovec. Reaction-ratetheory: fifty years after kramers. Reviews of modern physics, 62(2):251, 1990. Bruce J Berne, Michal Borkovec, and John E Straub. Classicaland modern methods in reaction rate theory. The Journal ofPhysical Chemistry, 92(13):3711–3725, 1988. Omar Valsson, Pratyush Tiwary, and Michele Parrinello. En-hancing important fluctuations: Rare events and metadynam-ics from a conceptual viewpoint. Annual review of physicalchemistry, 67:159–184, 2016. Matteo Salvalaglio, Pratyush Tiwary, and Michele Parrinello. As-sessing the reliability of the dynamics reconstructed from meta-dynamics. Journal of chemical theory and computation, 10(4):1420–1425, 2014. Ao Ma and Aaron R Dinner. Automatic method for identifyingreaction coordinates in complex systems. The Journal of PhysicalChemistry B, 109(14):6769–6779, 2005. Peter G Bolhuis, Christoph Dellago, and David Chandler. Re-action coordinates of biomolecular isomerization. Proceedings ofthe National Academy of Sciences, 97(11):5877–5882, 2000. Zachary Smith, Debabrata Pramanik, Sun-Ting Tsai, andPratyush Tiwary. Multi-dimensional spectral gap optimization oforder parameters (sgoop) through conditional probability factor-ization. The Journal of chemical physics, 149(23):234105, 2018. Guillermo P´erez-Hern´andez, Fabian Paul, Toni Giorgino, GianniDe Fabritiis, and Frank No´e. Identification of slow molecularorder parameters for markov model construction. The Journal ofchemical physics, 139(1):07B604 1, 2013. John D Chodera and Frank No´e. Markov state models ofbiomolecular conformational dynamics. Current opinion instructural biology, 25:135–144, 2014. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, andJeff Dean. Distributed representations of words and phrasesand their compositionality. In Advances in neural informationprocessing systems, pages 3111–3119, 2013. Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, andLlion Jones. Character-level language modeling with deeper self-attention. In Proceedings of the AAAI Conference on ArtificialIntelligence, volume 33, pages 3159–3166, 2019. Supplementary Information: LearningMolecular Dynamics with Simple Lan-guage Model built upon Long Short-TermMemory Neural Network
VI. APPROXIMATING THE LOSS FUNCTION
In the following derivation, we will show how we obtain J (cid:48) introduced in Eq. 10 of main text as an estimate ofminimizing cross entropy J from Eq. 9 of main text. Tobegin with, we start with the cross entropy J : J = − T − (cid:88) t =0 y ( t ) · ln ˆ y ( t ) = − T − (cid:88) t =0 (cid:88) s ( t +1) P ( s ( t +1) | s ( t ) , ... s (0) ) ln ˆ y ( t ) (25)where P ( s ( t +1) | s ( t ) , ... s (0) ) is the conditional probabilityof the physical system computed from the one-hot vectorsof the data. Even if the trajectory has dependency on itslong-term history, as long as trajectory length T (cid:29)
0, wecan approximate Eq. 25 as: J ≈ − T − (cid:88) t =0 (cid:88) s ( t +1) Pr( s ( t +1) | s ( t ) , ... s ( t − T ) ) ln ˆ y ( t ) (26)As is typical in character-level language models , weassume the embedding dimension M is much greater thanthe input dimension N and rewrite the above equationas: J = T (cid:34) − T T − (cid:88) t =0 (cid:88) x ( t +1) P ( x ( t +1) | x ( t ) , ... x ( t − T +1) ) ln Q ( x ( t +1) | x ( t ) , ... x ( t − T ) ) (cid:35) (27)= T (cid:34) T T − (cid:88) t =0 ¯ J ( t ) ( x ( t ) , ... x ( t − T +1) ) (cid:35) (28)where ˜ J ( t ) ( x ( t ) , ... x ( t − T +1) ) ≡− (cid:88) x ( t +1) P ( x ( t +1) | x ( t ) , ... x ( t − T +1) ) ln Q ( x ( t +1) | x ( t ) , ... x ( t − T +1) )is the cross entropy between conditional probabilities.With large enough T , we can also assume ergodicity andconvert the time average to ensemble average, J ≈ T (cid:88) x ( T − ... x (0) P ( x ( T − , ..., x (0) ) ¯ J ( x ( T − , ... x (0) )(29)= − T (cid:88) x ( T ) (cid:88) x ( T − , x ( T − ... x (0) P ( x ( T − , ..., x (0) ) P ( x ( T ) | x ( T − ..., x (0) ) ln Q ( x ( T ) | x ( T − ..., x (0) )(30)= − T (cid:88) x ( T ) ... x (0) P ( x ( T ) , ..., x (0) ) ln Q ( x ( T ) | x ( T − ..., x (0) )(31) The cross entropy J achieves its global minima when Q approaches P : Q ( x ( t +1) | x ( t ) , ... x ( t − T +1) ) → P ( x ( t +1) | x ( t ) , ... x ( t − T +1) ) ∀ t (32)We also know that Q ( x ( T ) | x ( T − , ..., x (0) ) = Q ( x ( T ) , ..., x (0) ) Q ( x ( T − , ..., x (0) ) (33)Plugging Eq. 33 in Eq. 31, J = − T (cid:88) x ( T ) ... x (0) P ( x ( T ) , ..., x (0) ) ln Q ( x ( T ) | x T − , ..., x (0) )= − T (cid:88) x ( T ) ... x (0) (cid:104) P ( x ( T ) , ..., x (0) ) ln Q ( x ( T ) , ..., x (0) ) − P ( x ( T ) , ..., x (0) ) ln Q ( x ( T − , ..., x (0) ) (cid:105) = − T (cid:88) x ( T ) ... x (0) P ( x ( T ) , ..., x (0) ) ln Q ( x ( T ) , ..., x (0) ) + T (cid:88) x ( T ) ... x (0) P ( x ( T ) , ..., x (0) ) ln Q ( x ( T − , ..., x (0) )= J (cid:48) − J (cid:48)(cid:48) where Q ( x ( T ) , ..., x (0) ) → P ( x ( T ) , ..., x (0) ) during theminimization. Here we have defined J (cid:48) and J (cid:48)(cid:48) as fol-lows: J (cid:48) ≡ − T (cid:88) x ( T ) ... x (0) P ( x ( T ) , ..., x (0) ) ln Q ( x ( T ) , ..., x (0) )(34) J (cid:48)(cid:48) ≡ − T (cid:88) x ( T ) ... x (0) P ( x ( T ) , ..., x (0) ) ln Q ( x ( T − , ..., x (0) )(35)= − T (cid:88) x ( T − ... x (0) P ( x ( T − , ..., x (0) ) ln Q ( x ( T − , ..., x (0) )(36)Since the summations run over the state space such thatthe normalization condition of P and Q holds, accordingto Gibbs’ inequality J (cid:48) and J (cid:48)(cid:48) both are well-defined crossentropies, and their global minima happen when Q = P .Therefore, minimizing J leads to minimization of both J (cid:48) and J (cid:48)(cid:48) , at which point both J (cid:48) and J (cid:48)(cid:48) are path entropies. VII. TRAINING ALANINE DIPEPTIDE FOR MOREEPOCHS
Table II shows the calculated transition rates for con-formational change in alanine dipeptide when runningstochastic gradient optimization for 60 training epochs.The result is similar to what we have for 40 epochs.
VIII. TRAJECTORIES
In this section we provide representative trajectoriesobtained for the different systems described in the maintext. In each system, in order to make the learning pro-cess more stable, we removed ephemeral/spurious tran-sitions by smoothening before feeding into the LSTMmodel.1
Rate: Alanine dipeptideCVs Label C eq to C ax ( ps ) C ax to C eq ( ps )sin φ actual 5017.33 ± ± ± ± ψ actual 5169.94 ± ± ± ± TABLE II:
Results with stochastic gradient optimiza-tion run for 60 training epochs.2
FIG. 7:
Trajectories for linear 3-state model potential.(a) The actual trajectory with location of metastablestates shown by horizontal solid and dashed lines. (b)The trajectory after spatial discretization, where thetrajectory now consists of a sequence of labels repre-senting metastable states. To make the learning processmore stable, we removed ephemeral/spurious transi-tions by smoothening before feeding into the LSTMmodel. (c) The trajectory generated by our LSTMmodel.3
FIG. 8:
Trajectories for triangular 3-state modelpotential. (a) The actual trajectory with locationof metastable states shown by horizontal solid anddashed lines. (b) The trajectory after spatial dis-cretization, where the trajectory now consists of a se-quence of labels representing metastable states. Tomake the learning process more stable, we removedephemeral/spurious transitions by smoothening beforefeeding into the LSTM model. (c) The trajectory gen-erated by our LSTM model.4
FIG. 9:
Trajectories for 4-state model potential. (a)The actual trajectory with location of metastable statesshown by horizontal solid and dashed lines. (b) Thetrajectory after spatial discretization, where the trajec-tory now consists of a sequence of labels representingmetastable states. To make the learning process morestable, we removed ephemeral/spurious transitions bysmoothening before feeding into the LSTM model. (c)The trajectory generated by our LSTM model.5
FIG. 10:
Trajectories along sin φφ