[PDF] Deciphering the Protein Motion of S1 Subunit in SARS-CoV-2 Spike Glycoprotein Through Integrated Computational Methods

Abstract

The novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a major worldwide public health emergency that has infected over 1.5 million people. The partially open state of S1 subunit in spike glycoprotein is considered vital for its infection with host cell and is represented as a key target for neutralizing antibodies. However, the mechanism elucidating the transition from the closed state to the partially open state still remains unclear. Here, we applied a combination of Markov state model, transition path theory and random forest to analyze the S1 motion. Our results explored a promising complete conformational movement of receptor-binding domain, from buried, partially open, to detached states. We also numerically confirmed the transition probability between those states. Based on the asymmetry in both the dynamics behavior and backbone C α importance, we further suggested a relation between chains in the trimer spike protein, which may help in the vaccine design and antibody neutralization.

Full PDF

DDeciphering the Protein Motion of S1 Subunit inSARS-CoV-2 Spike Glycoprotein ThroughIntegrated Computational Methods

Hao Tian and Peng Tao ∗ Department of Chemistry, Center for Scientiﬁc Computation, Center for Drug Discovery,Design, and Delivery (CD4), Southern Methodist University, Dallas, Texas, United Statesof America

E-mail: [email protected]

Abstract

The novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a ma-jor worldwide public health emergency that has infected over 1 . α importance, we further suggested a relation betweenchains in the trimer spike protein, which may help in the vaccine design and antibodyneutralization. a r X i v : . [ q - b i o . B M ] A p r Introduction

As of April 9 2020, there has been over 1 . ,

000 cases of death, according to theWorld Heath Organization (WHO). SARS-CoV-2 is related to bats-derived coronavirusesand the SARS-CoV, reported in the Guangdong province of China in 2002, and identiﬁed asa new member of betacoronavirus.

Due to its fast spreading though human contacts, theWHO declared it as a Public Health Emergency of International Concern (PHEIC).SARS-CoV-2 is known to infect human through the interaction between its spike (S)protein and human host receptors.

The S protein is a trimer (chain A, B and C) andeach chain is formed by S1 and S2 subunits that are related with host receptors binding andmembranes fusion, respectively.

The S1 subunit consists of a N-terminal domain (NTD),receptor-binding domain (RBD) and two subdomains (SD1 and SD2). It is reported thatRBD undergoes a conformational change from stable close state to dynamically-less-favorablepartially open state in chain A.

In the closed state, the determinants of receptor bindingare buried and inaccessible to receptors, while in the partially open state are exposed and isexpected to be necessary for the interaction with host cells.

In the cases of SARS-CoV-2and SARS-CoV, S glycoprotein is found to inherently sample closed and open states and thisbehavior is suggested to exist in the most pathogenic coronaviruses.

While the partiallyopen state plays an important role in human cell infection, little study is done illustratingthis protein motion on residue level, which could potentailly facilitate the prevention andtreatment.Molecular dynamics (MD) simulations can provide atomic scale information and arewidely used in sampling protein movement and structure landscape. Two kinds of tra-jectories from closed and open states of SARS-CoV-2 S protein initiating from close state(PDB 6VXX) and partially open state (PDB 6VYB) are available from D E Shaw Re-search . However, the timescale (10 microseconds) is still relatively trivial compared withthe timescale of biological processes in the real world. To gain more information from the2esult of MD simulation, Markov state model (MSM) is applied to catch long-time kineticinformation given time-limited simulation trajectories. Another advantage of MSM isthat it can divide a large number of protein structures from simulation into subspaces basedon the extracted kinetic information, and diﬀerences among those spaces can be calculatedand used for comparison.Machine learning techniques have achieved great accomplishments in chemistry and bi-ology including material discovery, structure representation and computation accelera-tion . The great contributions from machine learning mainly come from its ability to dealwith large scale data and its accurate and explainable models, which provide a new op-portunity to decipher protein. In this study, tree-based machine learning models were usedto identify important residues. Speciﬁcally, random forest model was applied as a classiﬁca-tion model to classify diﬀerent structures and calculate the contribution of each residue andstructure importance for the close-open transition process.The transition from close state to partially open state of S1 subunits of SARS-CoV-2S protein is investigated in this research though a series of Markov state model, transitionpath theory and random forest. Our results numerically conﬁrmed the close-open transitionprobability through estimated transition matrix, showed a complete transition path fromclose to open state and identiﬁed a relationship between the motion of chain A and twoother chains. The root-mean-square deviation (RMSD) is used to measure the overall conformationalchange of each frame with regard to a reference structure in a MD simulation trajectory.For a molecular structure represented by Cartesian coordinate vector r i ( i = 1 to N ) of N atoms, the RMSD is calculated as following: 3MSD = (cid:115) (cid:80) Ni =1 ( r i − U r i ) N (1) r i represents the Cartesian coordinate vector of the i th atom in the reference structure.The transformation matrix U is deﬁned as the best ﬁt alignment between the protein struc-tures along trajectories with respect to the reference structure.The root-mean-square ﬂuctuation (RMSF) is used to measure the ﬂuctuation of eachatom in each frame with regard to a reference structure in a MD simulation trajectory.RMSF i = (cid:118)(cid:117)(cid:117)(cid:116) T T (cid:88) j =1 ( v ji − ¯ v i ) (2)Where T is the total frames and r i is the average position of atom i in the given trajectory. Distances between pairs of backbone C α were chosen as representative features for one struc-ture. Speciﬁcally, each distance pair of one C α carbon and all other C α carbon in all aminoacids was calculated. A protein contact map technique is normally built by transformingeach pair-wised distance value into 1 if that feature value is below a threshold or 0 oth-erwise . However, this feature preprocessing technique risks to ignore potentially usefulspacial information forcing a boolean value on the features. Therefore, inspired by ReLUactivation function in neural network, which the equation is shown below, we proposeda revised feature transformation method by transforming each feature value into 0 if thatfeature is above a threshold and keep it the same value otherwise. Compared with referencefeature transformation rule, our proposed technique can still build a protein contact mapwhile can diﬀerentiate local features with the least local information loss. f ( z ) = max (0 , z ) (3)4 .3 Random Forest Model Random forest is a machine learning technique that can be used for classiﬁcation.

Arandom forest is composed of several decision trees, which are trained based on giventraining data. The ﬁnal classiﬁcation output of a random forest model is a collection ofclasses predicted by each decision tree model. The random forest algorithm carried out inthis study is implemented in scikit-learn program package version 0.20.1. The number ofdecision trees used was 50. One advantage of random forest model over decision tree modelis that employing multiple decision tree models mitigates the overﬁtting problem suﬀered bysingle decision tree model. In a random forest model, a quantitative evaluation of the importance for each feature usedfor training is calculated through training process. This feature importance is calculatedusing Gini impurity: Gini impurity = C (cid:88) i =1 − f i (1 − f i ) (4)where f i is the frequency of a label at a node of being picked to split the data set and C isthe total number of unique labels. A random forest model is a collection of several decisiontree models. The importance of node i in decision tree is calculated as: n i = w i C i − m (cid:88) i w m ( i ) C m ( i ) (5)where w j is the weighted number of samples reading node i , C i is the impurity value ofnode i and m is the number of child nodes. The feature importance of feature i is calculatedas 5 i = (cid:80) sl n j (cid:80) k ∈ all nodes n k (6)where s is the times of node j split on feature i . Feature importance within a decisiontree is further normalized by: norm f i = f i (cid:80) j ∈ all features f j (7)The feature importance in random forest is the averaged importance of feature i in alldecision tree models: F i = (cid:80) j ∈ all decision trees norm f i N (8)where norm f i is the normalized feature importance of one single decision tree and N isthe number of decision trees. Feature importance of all pairwise C α distances were calculated using the above methods.The feature importance of an amino acid is the summation of importances of features thatare related with that amino acid. The relative accumulated feature importance of each aminoacid shows the distinguishability and contribution of that amino acid among all amino acidsin diﬀerentiating states. Markov state model (MSM) is used to construct long-timescale dynamics behavior . Mini-Batch k-means clustering was used to classify each simulation frame to microstates. Perron-cluster cluster analysis (PCCA) was applied to cluster microstates into macrostates. Macrostatesare considered as equilibrium or steady states. Transition matrix and transition probabil-ity were calculated to quantitatively show the transition dynamics between macrostates. Aspeciﬁc time interval, referred to as lag time, needs to be determined to construct transitionmatrix. The value of the lag time, as well as the number of macrostates, is selected based6n the result of estimated relaxation timescale. MSMBuilder version 3.8.0 was employedto build markov state models in this study. Transition path theory (TPT) is used to calculate the probability of transitioning fromone state to another within the framework of a MSM. In the current study, the most repre-sentative structure of closed state was chosen among the 4 macrostates in the closed stateand deﬁned as macrostate 2 due to the reason that it was the most probable macrostateto transfer to other open macrostates. The open state was chosen based on the stability ofeach open macrostates and macrostate 8 was chosen as representative open state. All otherstates are treated as intermediate states. Possible transition pathways from the closed to theopen state were explored. The committor probability q + i is deﬁned as the probability fromstate i to reach the target state rather than initial state. Based on deﬁnition, q + i = 0 for allmicrostates in initial state and q + i = 1 for all microstates in target state. The committorprobability for all other microstates are calculated by the following equation: − q + i + (cid:88) k ∈ I T ik q + k = − (cid:88) k ∈ target state T ik (9)where T ik is the transition probability from state i to state k . Sequentially, the eﬀectiveﬂux is calculated as: f ij = π i q − i T ij q + j (10)where π i is the equilibrium probability of normalized transition matrix T π and q − i is thebackward-committor probability calculated as q − i = 1 − q + i . However, backward ﬂux f ji should also be considered and subtracted when calculating net ﬂux. Therefore, the net ﬂux f + ij = max (0 , f ij − f ji ). Total ﬂux can then be calculated as:7 igure 1: Simulations of SARS-CoV-2 S glycoprotein: (A) RMSD of the trajectories in theclose (red) and open (blue) states. (B) RMSF of simulation trajectories in the close (red)and open (blue) states. The protein is divided into three chains separated by grey dashedlines. Atom number was counted from 0. F = (cid:88) i ∈ initial state (cid:88) j / ∈ initial state π i T ij q + j (11)The ﬂux from initial state to target state can be decomposed to individual pathways p i . Dijkstra algorithm is implemented in MSMBuilder for pathway decomposition. A set ofpathways p i can be generated along f i , which provides a relative probability by: p i = f i (cid:80) j f j (12) Simulation trajectory analysis shows dynamical activity

Two 10 microseconds simulation trajectories of the trimeric SARS-CoV-2 S glycoproteinwere treated as reference and backbone alpha carbons (C α ) of the trimer were chosen andextracted as representative features of structures.8o probe the dynamical stability of two structures, the time evolution of the RMSD wereplotted in Figure 1 (A). All RMSD values were calculated with reference to the ﬁrst frameof each trajectory. The average RMSD values in two states are 5 . A and 10 . A , respectively.The plot suggested that the close state is relatively stable while the partially open state isdynamically active and undergoes signiﬁcant conformational changes after 1 microsecond.However, the simulation of open state after 6 microseconds suggested a convergence in theRMSD value and a relatively stable structure, which corresponds to the detached S1 subunitfrom S2 fusion machinery.RMSF results were plotted in Figure 1 (B). The asymmetry in protein motion was noticedby comparing the individual dynamics behavior among three chains. Corresponding to theRMSD result in chain A, the RMSF result showed a similar high-degree conformationalchange in the RBD domain. However, through the detachment in chain A, chain B andC showed diﬀerent movements that both NTD and RBD in chain B are more dynamicallyactive than those in chain C. while in closed state the chains B and C displayed similardynamics. Markov state model and transition path theory elucidates the close-open transition

Simulation trajectories were projected onto a two-dimensional (2D) plot in RMSD of C α atoms with reference to the close and open state structures, respectively (Figure 2A). Toapply MSM analysis, MiniBatch k-means clustering was applied to divide the simulationsampling space into 300 microstates based on the reduced-dimension plot, shown in S1. Theestimated relaxation timescale was plotted in Figure 2 B. The trend of implied relaxationtimescale showed that the estimated timescale is converged after 15 ns, which was chosen asthe lag time for markov state model. The number of macrostates were determined based onthe band gap in estimated relaxation timescale plot.Total of 8 macrostates were chosen to divide simulations into kinetically separate macrospaces.9 igure 2: Distribution of SARS-CoV-2 S glycoprotein simulations: (A) 2D protein confor-mation space using RMSD values with references both closed and partially open states. (B)Implied relaxation timescale of top 20 variables based on the data coordinates in the reducedmap regarding with the diﬀerent lag time as interval.

Table 1: The probability of top 5 channels.

Channels Probability2,3,5,6,8 23.7%2,3,4,7,5,6,8 15.8%2,5,6,8 11.0%2,3,7,5,6,8 9.6%2,3,4,7,8 8.0%Top 10 channels 88.7%PCCA was applied to map microstates onto macrostates based on the eigenfunction structureof transition probability matrix. The resulting macrostates with transition probability areshown in Figure 3 (A). Close state and open state were equally divided into 4 macrostates, asstates 2, 3, 4, 7 belonging to the close state and states 1, 5, 6, 8 belonging to the open state.The close state is stable with 95 .

5% probability to stay within close macrostates. Macrostate2 was found important due to its high probabiltity of 9 .

9% to transfer from itself to openmacrostates. The average transition from closed macrostates to open macrostates is 4 . igure 3: Markov state model based on the simulation: (A) Macrostates and estimatedtransition probabilities among them; (B) Representative structures of macrostate 2 (blue),3 (red), 5 (green), 6 (orange) and 8 (yellow).pathway connecting these two states. Total of 2 ,

317 pathways were generated and dividedas 51 distinct channels ﬂoating from the State 2 to the State 8. The probability of eachchannel was calculated based on net ﬂux from initial state to the target state. Overall, theprobability of top 5 channels was listed in the Table 1, with the contribution from the top 10channels accounting for 88 .

7% of total population. The most probable path was from State2 → State 3 → State 5 → State 6 → State 8, and the representative structures were plottedin Figure 3B to show a series of transition processes.

Random forest identiﬁes important residues and structures

To better understand the shift between close and open states in S1 subunit, the pair wised C α distances of S1 were extracted as features representing the character of protein conﬁgurations.There are 540 residues on each chain, residue ID from 27 to 676, and total of 1620 ∗ / , ,

390 C α distances were collected as features. Before further analysis, features weretransformed into contact map with our proposed feature transformation technique describedin Methods section. Considering the non-bonded chemical interactions length, we pick 10 . A igure 4: Random forest classiﬁcation model using pair-wised C α distances in S1 subunit:(A) Classiﬁcation accuracy using diﬀerent depths of trees. Depth 7 (shown in grey dashedline) was chosen with 92 .

18% accuracy; (B) Classiﬁcation accuracy regarding diﬀerent depthsof trees using pair-wised C α distances within chain B and C. Depth 8 (shown in grey dashedline) was chosen with 88 .

04% accuracy.as threshold for feature transformation.Supervised machine learning model was applied to extract the key residues that arevital during allosteric process and study the structural diﬀerences among macrostates. Foreach simulation trajectory, frames were saved for every 1 . ,

334 frames. Therefore, 16 ,

668 samples with 1 , ,

390 features were extracted from thetrajectories. Each sample was labeled based on the above macrostate result. Full datasetwas further split into training set(70%) and testing set(30%). After the preparation ofdata, random forest model was applied to distinguish the intrinsic conformational diﬀerencesamong macrostates. Training scores and testing scores were plotted in Figure 4A. 7 waschosen for the depth with corresponding testing accuracy of 92 . α distances. To further investigate the relationship between chainA and two other chains, the original C α distances related with chain A was excluded asreduced features and applied to another random forest model. Training and testing results12 igure 5: (A) Residue sequence with tertiary structure in S1 subunit, referenced to thereleased structure in the prefusion conformation . NTD (blue), N-terminal domain; RBD(green), receptor-binding domain; SD1 and SD2 (orange), subdomains. The sequence startswith residue Ace 26 to Thr 676. (B-C) The position of top 20 important residues are shownin red color. Accumulated tertiary structure importance in chain B and chain C are shownin numbers, respectively. Table 2: Top 5 C α distances. C α distances ImportanceChain C Phe 342, Chain C Asp 442 0.86%Chain C Ala 419, Chain C Tyr 423 0.83%Chain B Thr 323, Chain B Thr 333 0.76%Chain C Cys 136, Chain C Gly 142 0.71%Chain B Leu 390, Chain B Gly 545 0.60%are shown in Figure 4B. The top 500 features accounted for 74 .

8% percent of the overallfeature importance, shown in Figure S2. The testing accuracy with reduced features reached88 .

04% at depth 8.The top ﬁve important C α distances were listed in Table 2. In order to identify keyresidues along the transition from the closed to the partially open state, the feature im-portance of each C α distance was added and accumulated to the two related residues. S1subunit structure was plotted in Figure 5A as reference. Top 20 important residues on chainB and C, with corresponding structure and accumulated structure importance under eachﬁgure, were plotted in Figure 5 (B-C). Full results of residue importance on chain B and Care shown in S3. 13 Discussion

The signiﬁcance of the partially open state of receptor-binding domain (RBD) in SARS-CoV-2 for interacting with the host cell receptor has been extensively studied.

The opening ofS1 subunit, thus exposing RBD, is necessary for engaging with ACE2 and following cleavageof S (cid:48) site. While the RBD exhibits inherently ﬂexibility enabling itself recognized by thereceptor , the motion of this close to open state shift still needs indepth study.It is reported that the SARS-CoV-2 S trimer shows a C Through the RMSF result, we noticed the asym-metry in dynamics at both close and open states. The S1 subunit in chain B and the S2subunit in chain C are more dynamically active than the corresponding structures in theclose state, respectively. The S1 domain in chain B is also more ﬂexible than that in chainC in the open state. This implied the asymmetrical biological functions among the threechains.This implication is supported by the random forest model results. Random forest modelreached high accuracy in predicting macrostates based on the pair-wised C α distances on S1subunit. The reason for the good performance in macrostates prediction is that the inputfeature contains the motion of chain A, which is suﬃcient to identify the main conformationalchange from the close to the open state. There is only a small drop of 4 .

14% in the predictionaccuracy with reduced features, which does not include the motion of chain A. Combinedwith the ﬁnding of asymmetric dynamics in RMSF result, we hypothesized that there is acorrelation between the chain A and two other chains, causing the asymmetrical dynamicsbehavior. Moreover, Moreover, in order to focus on the tertiary structure, the importanceof C α distances was accumulated to S1 domain structure and we numerically identiﬁed keystructure correlated with RBD in chain B, NTD in chain C, RBD in chain B and NTDin chain B in descending order. This relationship may come from the chain B and C’scontribution to the protein motion in chain A. This could also originate from the protein-protein interaction along the opening movement of chain A. Further investigation of this14utual inﬂuence is warranted for a detailed clariﬁcation.Markov state model showed a great diﬀerence in transition probability in macrostate2 (78 . . . .

7% probability. This channel isconsidered important in representing the transient shifting and can be treated as the typicalprotein movement. A rational design aﬀecting the close-to-open transition can be developedthrough this key channel along with the important residue pairs and key tertiary structuredomains.

The spike protein is essential for SARS-CoV-2 as it destabilizes the trimer structure, causingthe detachment of S1 subunit and exposing the RBD domain to host cell membrane. Inthis study, we used publicly available simulation trajectories of S protein and studied theasymmetric dynamics nature of the trimer structure. Markov state model was applied todivide the conformational space into 8 macrostates. The representative structures along themost probable channel in transition path theory are shown to present a clear route fromthe close state to the partially open state. Transition matrix was calculated to determinethe probability of close-open transitions, with maximum of 9 .

9% from the close macrostate2 to the open macrostates. The little diﬀerence between prediction accuracy results in tworandom forest models, where one includes the movement of chain A and the other does not,15mplied a relationship between chain A and two other chains. Yet, whether this relationoriginates from the mutual inﬂuence among chains or the intrinsic asymmetry in biologicalfunctions needs further investigation. Overall, our study quantitatively analyzed the motionof S1 subunit in S protein with important C α distances and residues. Possible preventionand cure searching progress could be facilitated combining with these vital ﬁndings, whichprovide new opportunities for potential drug research and vaccine development. Acknowledgement

Computational time was generously provided by Southern Methodist University’s Center forScientiﬁc Computation. The authors thank D. E. Shaw for sharing the SARS-CoV-2 spikeglycoprotein trajectories. 16 eferences (1) Wu, F.; Zhao, S.; Yu, B.; Chen, Y.-M.; Wang, W.; Song, Z.-G.; Hu, Y.; Tao, Z.-W.;Tian, J.-H.; Pei, Y.-Y., et al. A new coronavirus associated with human respiratorydisease in China.

Nature , , 265–269.(2) Lu, R.; Zhao, X.; Li, J.; Niu, P.; Yang, B.; Wu, H.; Wang, W.; Song, H.; Huang, B.;Zhu, N., et al. Genomic characterisation and epidemiology of 2019 novel coronavirus:implications for virus origins and receptor binding. The Lancet , , 565–574.(3) Cavanagh, D. The coronaviridae ; Springer, 1995; pp 73–113.(4) Lu, G.; Wang, Q.; Gao, G. F. Bat-to-human: spike features determining ’host jump’ ofcoronaviruses SARS-CoV, MERS-CoV, and beyond.

Trends in microbiology , ,468–478.(5) Wang, Q.; Wong, G.; Lu, G.; Yan, J.; Gao, G. F. MERS-CoV spike protein: Targetsfor vaccines and therapeutics. Antiviral research , , 165–177.(6) Walls, A. C.; Park, Y.-J.; Tortorici, M. A.; Wall, A.; McGuire, A. T.; Veesler, D.Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein. Cell ,(7) Li, F. Structure, function, and evolution of coronavirus spike proteins.

Annual reviewof virology , , 237–261.(8) Li, F. Receptor recognition mechanisms of coronaviruses: a decade of structural studies. Journal of virology , , 1954–1964.(9) Wrapp, D.; Wang, N.; Corbett, K. S.; Goldsmith, J. A.; Hsieh, C.-L.; Abiona, O.;Graham, B. S.; McLellan, J. S. Cryo-EM structure of the 2019-nCoV spike in theprefusion conformation. Science , , 1260–1263.1710) Bosch, B. J.; van der Zee, R.; de Haan, C. A.; Rottier, P. J. The coronavirus spikeprotein is a class I virus fusion protein: structural and functional characterization ofthe fusion core complex. Journal of virology , , 8801–8811.(11) Gui, M.; Song, W.; Zhou, H.; Xu, J.; Chen, S.; Xiang, Y.; Wang, X. Cryo-electronmicroscopy structures of the SARS-CoV spike glycoprotein reveal a prerequisite con-formational state for receptor binding. Cell research , , 119–129.(12) Pallesen, J.; Wang, N.; Corbett, K. S.; Wrapp, D.; Kirchdoerfer, R. N.; Turner, H. L.;Cottrell, C. A.; Becker, M. M.; Wang, L.; Shi, W., et al. Immunogenicity and structuresof a rationally designed prefusion MERS-CoV spike antigen. Proceedings of the NationalAcademy of Sciences , , E7348–E7357.(13) Shang, J.; Zheng, Y.; Yang, Y.; Liu, C.; Geng, Q.; Tai, W.; Du, L.; Zhou, Y.; Zhang, W.;Li, F. Cryo-electron microscopy structure of porcine deltacoronavirus spike protein inthe prefusion state. Journal of virology , , e01556–17.(14) Prinz, J.-H.; Wu, H.; Sarich, M.; Keller, B.; Senne, M.; Held, M.; Chodera, J. D.;Sch¨utte, C.; No´e, F. Markov models of molecular kinetics: Generation and validation. The Journal of chemical physics , , 174105.(15) ”Molecular Dynamics Simulations Related to SARS-CoV-2,” D. E. Shaw ResearchTechnical Data, 2020. ,Accessed: 2020-04-05.(16) Su´arez, E.; Adelman, J. L.; Zuckerman, D. M. Accurate estimation of protein foldingand unfolding times: beyond Markov state models. Journal of chemical theory andcomputation , , 3473–3481.(17) Adelman, J. L.; Ghezzi, C.; Bisignano, P.; Loo, D. D.; Choe, S.; Abramson, J.; Rosen-berg, J. M.; Wright, E. M.; Grabe, M. Stochastic steps in secondary active sugar trans-port. Proceedings of the National Academy of Sciences , , E3960–E3966.1818) Raccuglia, P.; Elbert, K. C.; Adler, P. D.; Falk, C.; Wenny, M. B.; Mollo, A.; Zeller, M.;Friedler, S. A.; Schrier, J.; Norquist, A. J. Machine-learning-assisted materials discoveryusing failed experiments. Nature , , 73–76.(19) Faber, F.; Lindmaa, A.; von Lilienfeld, O. A.; Armiento, R. Crystal structure repre-sentations for machine learning models of formation energies. International Journal ofQuantum Chemistry , , 1094–1101.(20) Botu, V.; Ramprasad, R. Adaptive machine learning framework to accelerate ab initiomolecular dynamics. International Journal of Quantum Chemistry , , 1074–1083.(21) Kotsiantis, S. B.; Zaharakis, I.; Pintelas, P. Supervised machine learning: A reviewof classiﬁcation techniques. Emerging artiﬁcial intelligence applications in computerengineering , , 3–24.(22) Jing, Y.; Bian, Y.; Hu, Z.; Wang, L.; Xie, X.-Q. S. Deep learning for drug design:an artiﬁcial intelligence paradigm for drug discovery in the big data era. The AAPSjournal , , 58.(23) Doerr, S.; Ariz-Extreme, I.; Harvey, M. J.; De Fabritiis, G. Dimensionality reductionmethods for molecular simulations. arXiv preprint arXiv:1710.10629 ,(24) Nair, V.; Hinton, G. E. Rectiﬁed linear units improve restricted boltzmann machines.Proceedings of the 27th international conference on machine learning (ICML-10). 2010;pp 807–814.(25) Liaw, A.; Wiener, M., et al. Classiﬁcation and regression by randomForest. R news , , 18–22.(26) Wang, F.; Shen, L.; Zhou, H.; Wang, S.; Wang, X.; Tao, P. Machine Learning Clas-19iﬁcation Model for Functional Binding Modes of TEM-1 β -Lactamase. Frontiers inmolecular biosciences , , 47.(27) Utgoﬀ, P. E. Incremental induction of decision trees. Machine learning , , 161–186.(28) Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blon-del, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V., et al. Scikit-learn: Machine learningin Python. Journal of machine learning research , , 2825–2830.(29) Breiman, L. Random forests. Machine learning , , 5–32.(30) Wang, F.; Zhou, H.; Wang, X.; Tao, P. Dynamical Behavior of β -Lactamases andPenicillin-Binding Proteins in Diﬀerent Functional States and Its Potential Role inEvolution. Entropy , , 1130.(31) Deuﬂhard, P.; Weber, M. Robust Perron cluster analysis in conformation dynamics. Linear algebra and its applications , , 161–184.(32) Bowman, G. R.; Huang, X.; Pande, V. S. Using generalized ensemble simulations andMarkov state models to identify conformational states. Methods , , 197–201.(33) Harrigan, M. P.; Sultan, M. M.; Hern´andez, C. X.; Husic, B. E.; Eastman, P.;Schwantes, C. R.; Beauchamp, K. A.; McGibbon, R. T.; Pande, V. S. MSMBuilder:statistical models for biomolecular dynamics. Biophysical journal , , 10–15.(34) No´e, F.; Sch¨utte, C.; Vanden-Eijnden, E.; Reich, L.; Weikl, T. R. Constructing theequilibrium ensemble of folding pathways from short oﬀ-equilibrium simulations. Pro-ceedings of the National Academy of Sciences , , 19011–19016.(35) Metzner, P.; Sch¨utte, C.; Vanden-Eijnden, E. Transition path theory for Markov jumpprocesses. Multiscale Modeling & Simulation , , 1192–1219.2036) Walls, A. C.; Xiong, X.; Park, Y.-J.; Tortorici, M. A.; Snijder, J.; Quispe, J.;Cameroni, E.; Gopal, R.; Dai, M.; Lanzavecchia, A., et al. Unexpected receptor func-tional mimicry elucidates activation of coronavirus fusion. Cell , , 1026–1039.(37) Yuan, Y.; Cao, D.; Zhang, Y.; Ma, J.; Qi, J.; Wang, Q.; Lu, G.; Wu, Y.; Yan, J.;Shi, Y., et al. Cryo-EM structures of MERS-CoV and SARS-CoV spike glycoproteinsreveal the dynamic receptor binding domains. Nature communications , , 15092.(38) Kirchdoerfer, R. N.; Wang, N.; Pallesen, J.; Wrapp, D.; Turner, H. L.; Cottrell, C. A.;Corbett, K. S.; Graham, B. S.; McLellan, J. S.; Ward, A. B. Stabilized coronavirusspikes are resistant to conformational changes induced by receptor recognition or pro-teolysis. Scientiﬁc reports , , 1–11.(39) Kirchdoerfer, R. N.; Cottrell, C. A.; Wang, N.; Pallesen, J.; Yassine, H. M.;Turner, H. L.; Corbett, K. S.; Graham, B. S.; McLellan, J. S.; Ward, A. B. Pre-fusionstructure of a human coronavirus spike protein. Nature , , 118–121.21 upporting Information Available Figure S1:

MiniBatch k-means clustering results of 300 microstates shown in red dots.22 igure S2:

Accumulated feature importance of top 500 C α distances.23 igure S3:igure S3: