[PDF] Automatic Feature Selection in Markov State Models Using Genetic Algorithm

Abstract

Markov State Models (MSMs) are a powerful framework to reproduce the long-time conformational dynamics of biomolecules using a set of short Molecular Dynamics (MD) simulations. However, precise kinetics predictions of MSMs heavily rely on the features selected to describe the system. Despite the importance of feature selection for large system, determining an optimal set of features remains a difficult unsolved problem. Here, we introduce an automatic approach to optimize feature selection based on genetic algorithms (GA), which adaptively evolves the most fitted solution according to natural selection laws. The power of the GA-based method is illustrated on long atomistic folding simulations of four proteins, varying in length from 28 to 80 residues. Due to the diversity of tested proteins, we expect that our method will be extensible to other proteins and drive MSM building to a more objective protocol.

Full PDF

AAutomatic Feature Selection in Markov State Models UsingGenetic Algorithm

Qihua Chen , a , Jiangyan Feng , b , Shriyaa Mittal c and Diwakar Shukla ∗ , b , c , d , a Department of Material Science and Engineering, b Department of Chemical and Biomolecular Engineering, c Center forBiophysics and Quantitative Biology, d National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA. Qihua Chen and Jiangyan Feng contributed equally to this work. ∗ Correspondence and requests for materials should beaddressed to D.S. (email: [email protected])

Markov State Models (MSMs) are a powerful framework to reproduce thelong-time conformational dynamics of biomolecules using a set of shortMolecular Dynamics (MD) simulations. However, precise kinetics predictionsof MSMs heavily rely on the features selected to describe the system. Despitethe importance of feature selection for large system, determining an optimalset of features remains a difficult unsolved problem. Here, we introduce anautomatic approach to optimize feature selection based on genetic algorithms(GA), which adaptively evolves the most fitted solution according to naturalselection laws. The power of the GA-based method is illustrated on longatomistic folding simulations of four proteins, varying in length from 28to 80 residues. Due to the diversity of tested proteins, we expect that ourmethod will be extensible to other proteins and drive MSM building to amore objective protocol.Additional Key Words and Phrases: Genetic algorithm, feature selection,markov state model, molecular dynamics simulation, generalized matrixRayleigh quotient

Molecular Dynamics (MD) simulation, first introduced by Alderand Wainwright[1] in the late 1950’s, has evolved into a major tech-nique to study the detailed actions and mechanisms of proteins[2–6].Based on Newton’s equations of motion, MD simulations can de-scribe protein dynamics in unprecedented spatial and temporalresolution. However, one of the major challenges for MD simula-tions are the analysis of high dimensional data and the incompat-ibility between timescales accessible to MD simulation and thatare functionally relevant[7–11]. Markov State Models (MSMs)[12–14] have recently been used to address the aforementioned issuesby predicting protein dynamics at long timescales from a pool ofshort MD simulations. The MSM itself is a "transition probabilitymatrix"[15], describing mathematically the memoryless transitionsbetween metastable states. To construct a MSM, raw MD trajectoriesare first transformed from their Cartesian coordinates to features,such as dihedral angles[16, 17] or pairwise contact distances of a pro-tein. This step is often called "featurization". The dimensionality ofthese features may be further reduced through dimensionality reduc-tion step. One commonly used method is time-structure independentcomponents analysis (tICA), which creates linear combinations of

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. To copy otherwise, or republish, to post on servers or to redistributeto lists, requires prior specific permission and/or a fee. Copyright ©JOCSE, a supportedpublication of the Shodor Education Foundation Inc. input features by maximizing their decorrelation time[18–22]. Witha properly constructed MSM, useful thermodynamic and kineticproperties of the dynamic process can be extracted. Despite theattractive feature of MSMs, the thermodynamics and kinetics pre-dicted by MSMs are highly sensitive to which features are selected todiscretize the configuration space[5, 23, 24]. Ideally, features shouldbe chosen to capture the slowest motions of the protein, which areusually the most interesting or important processes. However, deter-mining an optimal set of features remains a considerable challengeespecially when a protein system is sufficiently complex. Here, weshow that machineCurrently, there are two major ways of selecting features in termsof "contact featurization", where pairwise contact distances of a pro-tein are used as features. One is using all pairwise contact distancesof a protein as features. In principle, no important information aboutthe system is missed out since all the contact distances are consid-ered. However, it is costly to calculate all distances even for a smallprotein. For a protein system with R residues, the total number ofdistances among each other will be R ( R -1)/2, which creates a heavyload of calculation on computers. In addition, irrelevant featuresthat do not contribute to the dynamics process may lead to the poorgeneralization performance of the model. Thus, using all availablefeatures may degrade the performance of the MSM both in speed(due to high dimensionality) and accuracy (due to irrelevant informa-tion). Alternatively, the most commonly used method is choosing asubset of contact distances based on human intuition. Consequently,the thermodynamics and kinetics extracted from MSMs can be bi-ased by the manually chosen features. In summary, either way isappropriate for the selection of features and a more convenient,accurate and automated method for feature selection is necessary.A variety of machine learning methods have been recently reportedfor dimensionality reduction and/or feature selection for moleculardynamics datasets[17]. However, the use of these ideas for automaticfeature selection in building MSMs has not been explored.Here, we present a genetic-algorithm based method to selectan optimal set of residue pair distances for contact featurization.Genetic algorithm (GA) is one of the advanced methods to helpwith dealing feature selection problems in data science. First pro-posed by John H. Holland[25, 26], GA is a heuristic and adaptivesimulation algorithm that evolves the most fitted solution to aproblem based on Darwinian natural selection laws. GA has beenbroadly applied to help with function optimization[27], protein fold-ing prediction[28, 29], multiple sequence alignment[30] and more a r X i v : . [ q - b i o . B M ] J un cientific investigations[31]. In nature, useful traits in genes tend tobe preserved in offspring for a higher survival probability. Like thereal cases in nature, better solutions to a problem can be derived byGA according to this principal. In our case, each gene represents thealpha carbon distance between a residue pair, and chromosomes arecombinations of residue pair distances. To seed the whole process,we randomly select one residue pair distance as the starting point ofthe GA. The adaptability of each chromosome (a set of residue pairdistances) is quantitatively expressed as fitness scores in GA. In thisstudy, we use generalized matrix Rayleigh quotient (GMRQ) scoreas the fitness score. GMRQ was recently introduced to quantita-tively evaluate MSMs based on its distance from a theoretical upperlimit[32–34]. The higher the GMRQ score is, the more prominentthe MSM is to capture the slow underlying dynamical motions whilea low GMRQ score indicates that the MSM is not able to reveal theslow dynamics of the system. Therefore, the goal of our methodis optimizing a set of residue pairs that gives the highest GMRQscore. The framework of our GA-based method is adapted fromthe "Optimal Probes" method proposed by Mittal and Shukla[35].In their study, an optimal choice of residue pairs, capturing theslow conformational dynamics, is successfully predicted for doubleelectron-election resonance spectroscopy, an experimental tech-nique capable of detecting conformational changes by monitoringthe distance between electron spins.In this method, we (1) perform contact featurization for eachset of residue pair alpha carbon distances, (2) use tICA to furtherreduce the dimensionality of the data, (3) construct MSMs based onthe reduced dimensionality, and (4) calculate GMRQ for each set ofresidue pair distances to evaluate the MSMs. Based on the GMRQscore, the combination of residue pair distances will be updated.The algorithm will then go back to step (1) to repeat the wholeprocess until reaching user specified number of iterations. In theend, the set of distances with the maximum GMRQ score is chosenas an optimal set of residues for the construction of the "best MSM".To evaluate of our method, we test the GA-based method on fourfolding proteins with the size ranging from 28 residues to 80 residues.Our experimental results show that the method yields comparableand even better accuracy compared with using all available features.To our knowledge, this is the first attempt to automatically selectproper MSM features for analysis. The GA-based method describedhere to larger proteins undergoing conformational changes can beextended. Molecular Dynamics (MD) Simulation Dataset.

MD simulationdatasets of the four folding proteins for analysis were generatedby Lindorff-Larsen et al [4]. The four proteins (BBA, Villin, WWdomain and λ -repressor) vary in length from 28 to 80 amino acids.More details of the simulations are summarized in Table 1. For theanalysis, we retain all the trajectory frames. Three small proteins(BBA, Villin and WW domain) are chosen to evaluate the proposedmethod and the best GMRQ achieved using all contact distancesserves as the benchmark. The 80-amino-acid λ -repressor is usedto test the feasibility of the method on large proteins, as using all Table 1.

Protein and trajectory information.Protein PDB Residues Total simulation time ( µ s)BBA 1FME 28 325Villin 2F4K 47 429WW domain 2F21 35 1137 λ -repressor 1LMB 80 643distances is impractical. Markov State Models (MSMs).

In this study, the goal is optimiz-ing a set of residue pair distances to build the best MSM basedGMRQ. MSMs are kinetic models that reveal the dynamics of asystem[3, 14, 15, 36, 37]. An MSM describes a network of metastableconformational states and reveals the probabilities of each stateperforming jumps from one to another over an appropriate time res-olution ( τ , also called lag time). The jumps are memoryless, whichmeans the probability to transit to the present state is not dependenton the previous ones. Such information is presented in a "transitionprobability matrix" by MSM, where an n × n square matrix depictsthe transitions among n states[15]. The probability of each jumpcan be expressed according to the equation below: p j ( t + τ ) = n (cid:213) i = p i ( t ) T ij ( τ ) (1)The equation can also be expressed in a matrix form: p T ( t + τ ) = p T ( t ) T ( τ ) (2)where p i ( t ) is a population vector whose elements show the proba-bility at time t , p j ( t + τ ) is a population vector after time τ , T ij ( τ ) is the probability to jump from state i to state j and T ( τ ) is thetransition probability matrix that T ( τ ) ∈ R n × n . Further details ofthe transition matrix can be found in literatures[13, 15].The transition probability matrix can be decomposed into eigen-functions and eigenvalues shown below: T ( τ ) ◦ ψ i = λ i ψ i (3)where ψ i is the eigenfunction and λ i are the real eigenvalues that λ i ≤

1, arranged in descending order.Here, each step of the MSM building process used in this studyis described in detail. All the hyperparameters (e.g. the number oftICA components, tICA lagtime, the number of clusters, the numberof MSM timescales and MSM lagtime) are shown in Table 2.(1) Featurization. To construct an MSM, the first step is to pro-cess the datasets that we plan to work on. In our case, we usethe MD simulation data sets listed in Table 1. The datasetsare given in the form of MD trajectories, which present seriesof motion of the protein atoms in a frame-wise arrangement.Because the simulated movements recorded in Cartesian co-ordinates are not ideal for analysis, and too much noise notrelevant to our study may be included, it is better to interpretthe data in other ways. As a result, a lot of reasonable metricssuch as dihedral angle[16] and contact distances betweenresidue pairs are used to featurize the data. The featurizationmethod we choose here is contact distance analysis. By usingsuch technique, more useful information can be extracted rom the redundant MD trajectories. Again, our goal of thisstudy is to optimize the choice of residue pairs for contact dis-tance calculation, so that an MSM with more information andless noise can be found by this method. The method outlinedin this study could be applied to any chosen set of featurescalculated using simulation data.(2) Dimensionality reduction. We further processed our featur-ized data by tICA so as to reduce the dimensionality of thedata. After featurization, the featurized data were projectedonto linear subspaces of the slowest dynamics. The compo-nents of tICA are termed time structure-based independentcomponents (tICs), which are linear combination of the in-put features (a set of contact distances in our case). Top tICscapture the slowest motion captured by tICA and usuallyrepresent the most interesting dynamics[18–22].(3) Clustering. We perform mini-batch k-means clustering onthe processed data. Clustering refers to the coarse graininganalysis that groups certain datasets based on their similar-ities, so that macrostates can be formed to be better under-stood. Commonly used clustering algorithms, such as mini-batch k-means[15, 33, 38], mini-batch k-medoids[39, 40] andk-centers[21, 41], have shown similar performance when thedata is preprocessed with tICA[17–20].(4) MSM construction. After the clustering, a MSM can be builtbased on the processed datasets. The process was imple-mented in a Python environment and the software involved toproduce the analysis above include Numpy[42], MDTraj[24]and MSMBuilder3.8[23]. Generalized Rayleigh Quotient (GMRQ).

In short, an ideal MSMshould successfully identify the slowest dynamics of the protein. Be-cause the state decomposition mentioned above reveals the dynam-ical processes in the system, the identification of true eigenfunctionand eigenvalues become the major problem for scientists to solve.A more quantitative method is needed to help evaluate and find thetrue state decomposition, which is directly related to the choice ofmetrics in the featurization stage.To help solve this problem, GMRQ was introduced as a quanti-tative way of evaluating the quality of an MSM[33, 33, 34]. GMRQis derived from the variational principle that adds up the first m eigenvalues, which denote the slowest m dynamical processes in thesystem. The variational principles set an upper boundary[32–34]for the total sum of real eigenvalues shown below: GMRQ ≡ m (cid:213) i = ˆ λ i ≤ m (cid:213) i = λ i (4)where the ˆ λ i is the estimated eigenvalue and the λ i is the realeigenvalue. In this study, as we try to maximize GMRQ score toapproach the upper boundary, the larger the GMRQ score we get,the closer we are to the slowest dynamics of the protein.To help avoid overfitting, cross-validation must be applied toevaluate our GMRQ scores. The dataset from the MD simulation issplit into a training set and a test set. The training set is first usedto estimate the model parameters such as the eigenvalues, then theestimated model is applied to score its performance in the test set.In this way, the model will not be biased by overfitting the data onto the model. The process of deriving GMRQ scores is achievedby Osprey package[43] and the recruited parameters are shown inTable 2. Mean GMRQ of five cross-validation iterations are used forthe analysis. Fig. 1.

The flow chart showing the whole process of our GA-basedmethod.

Genetic-algorithm-based Method for automatic feature se-lection in Markov State Models (MSMs).

To simulate the naturalselection process according to Darwinian laws, we must decide howthe natural selection principles are implemented in our algorithm. Inthis section, we introduce our basic operators of GA, the frameworkthat we follow to perform GA, and the protocol we adopt to finallygenerate optimal residue pairs. The construction of the GA is basedon the work of Mittal and Shukla[35].In the field of programming, operators refer to the actions totake during each step of execution of the algorithm. The basic op-erators in our study are composed of natural selection, mutation,and crossover. In the following section, we discuss our method tohelp predict an optimized set of residue pair distances for MSMconstruction using genetic algorithm. We also provide the series of teps as a flow chart shown in Figure 1. Some important parametersthat are involved in these steps are: populationSize , percentMutation and percentCrossover . These parameters can be changed accordingto user’s need.(1) A set of all possible residue pairs is identified. R ( R -1)/2 residuepairs for a protein with R residues.(2) populationSize preliminary sets of residue pair are randomlyselected from the set of all possible residue pairs for the firstiteration. Each set contains only one residue pair as the start-ing point for selection. These sets of residue pairs serve asthe initial generation G and are assigned fitness scores of 0.(3) Natural selection is performed to choose new generation ofresidue pairs according to their fitness scores. The naturalselection operator corresponds to the reproductive process innature, which selects genomes with ideal traits for breedingoffspring. In our case, we define a parameter populationSize that describes the number of elements randomly chosen fromthe parental set for a new generation G new .(4) Mutation is performed to maintain diversity to the currentgeneration of residue pair selections. The mutation operatorcorresponds to the mutation process in nature to increasegenetic diversity. In our version of GA, we define a parameter percentMutation to maintain a ratio of mutation in our combi-nation of residue pairs. During the mutation step, the numberof residue pairs to be mutated are generated by ( percentMu-tation × populationSize ) /

100 from G new and those residuepairs are randomly replaced by other residue pairs that areexcluded in the G new .(5) Crossover is performed to add more diversity to the cur-rent generation. The crossover operator corresponds to thenatural recombination process of chromosomes. Here, wedefine another parameter percentCrossover as the percentageof crossover in our combination of residue pairs. The numberof residue pairs to perform crossover is generated by ( per-centCrossover × populationSize ) /

100 from G new . The residuepair distance sets will then be swapped according to the num-ber calculated before to create a new combination of residuepair distances.(6) Evaluations are performed to assign fitness scores to thenewly generated residue pairs. MSMs are constructed basedon contact featurization using the current generation of residuepairs, and GMRQ scores are calculated accordingly to serveas fitness scores.(7) If more iterations are designed to be finished, the next itera-tion should restart at step (3) and use the current generationof residue pairs as G . As the iteration number increases, thefitness scores for the selection of residue pairs should show aconvergence of fitness scores.All the parameters used in this study are organized in Table 2. In this section, we discuss the optimized set of residue pair distancesobtained from our GA-based approach. As described in the method

Table 2.

Model hyperparameteres.

Featurization α -carbon contact distances Decomposition

Components Lag time (ns)tICA 5 0.2

Clustering

ClustersMini-batch k-means 200

Model fitting

N_timescales Lag time (ns)MSM 5 50

Scoring

GMRQ

Cross-validation

Iterations Test set sizeShuffle & Split 5 0.5

Genetic algorithm

Iterations 40populationSize 20%percentMutation 50%percentCrossover 20%part, the unbiased and extensive MD simulation data (>100 µs ) simu-lating the folding process of the proteins is taken from literature[4].Preliminary sets of residue pair distance are randomly selected fromthe set of all the possible residue pairs as the starting point of thegenetic algorithms. These sets go through selection, mutation andcrossover steps to provide a new generation of residue pair distances.In the setting of GA, we choose a population size of 10%, mutationpercentage of 50% and crossover percentage of 20%. Next, the newlygenerated residue pair distances are used to build MSMs and assignnew GMRQ scores (fitness scores) for evaluation. The next iterationwill then go back to the selection step and select according to thenewly assigned fitness scores. As the process goes through moreiterations, the GMRQ scores will converge and a best GMRQ scorecan be found.This method is applied to 4 proteins for demonstration of itsfunctionality: BBA, Villin, WW domain and λ -repressor. Among theproteins, 3 proteins (BBA, Villin and WW domain) are small proteins,each of which has a residue number that smaller than 40 ( R < 40).To examine the effectiveness of our method, we compare the GMRQscores and implied timescales with their corresponding benchmarkvalues (using all contact distances as features). In the end, we showthe ability of our GA-based method to process larger proteins such as λ -repressor, a protein with 80 residues, which cannot be featurizedusing all contact distances. We featurize the small proteins (BBA, Villin, WW domain) usingall contacts featurization to produce benchmark GMRQ scores forcomparison. Benchmark GMRQ scores will serve as a compara-ble reference to evaluate the performance of our method of usingGA to generate optimal residue pairs as featurization metrics. Bycomparing the best GMRQ scores from our GA-based method to able 3. Comparison of the best GMRQ scores generated and benchmark GMRQ scores from all contact featurization. The fraction of allresidue pairs is the fraction of chosen residue pairs in all residue pairs. The best GMRQ refers to the highest GMRQ score that we obtain fromMSMs using residue pair distance features given by our genetic algorithm approach, and the benchmark GMRQ score is the GMRQ providedby the MSM constructed with all contact featurization. The deviation column if the deviation of our best GMRQ score from the benchmarkGMRQ score.Protein Residues Number of chosen Fraction of Best Benchmark Deviation (%)distances pairs (%) GMRQ GMRQBBA (1FME) 28 47 12.43 4.445 4.239 +4.80Villin (2F4K) 35 61 10.25 3.203 3.705 -13.5WW domain (2F21) 35 4 0.67 4.198 4.111 +2.12 λ -repressor (1LMB) 80 60 1.90 4.956 N/A N/A Fig. 2.

GMRQ scores reflecting the MSMs based on the GA-predicted residue pairs. (A) BBA (PDB ID: 1FME), (B) Villin (PDB ID: 2F4K), (C)WW domain (PDB ID: 2F21), (D) λ -repressor (PDB ID: 1LMB). Green, dashed lines indicate the best GMRQ score corresponding to MSMsbased on all contact featurization. Each violin plot shows the increase of GMRQ scores over 40 iterations. In each set of data, the center dotshows the mean values and the vertical line shows the range of this GMRQ data set.the benchmark GMRQ scores, we are able to check whether ourmethod successfully provides the residue pair sets that depict theslowest process of the protein dynamics. We also apply this methodto λ -repressor, a medium sized protein with 80 residues, to show itsability to process larger proteins. The GMRQ scores are calculatedby adding up the eigenvalues of the transition probability matrixprovided by MSMs[33, 34]. The theoretical upper limit of GMRQscore is 6 in all cases[32–34], due to the fact that the number ofMSM timescales is chosen to be 5 in the MSM settings. Therefore,in our case, high GMRQ score that approaches 6 usually suggests abetter ability of an MSM to capture the slowest process, whereaslow GMRQ score implies ineffective state decomposition during the MSM construction process. All information regarding the GMRQscores and residue pair selection is summarized in Table 3. As shownin Figure 2, all GMRQ scores converged over 40 iterations. In Fig-ure 2A, the highest GMRQ score for BBA is around 4.445, which ishigher than the benchmark GMRQ score (4.239). Similar traits areshown by WW domain in Figure 2C that the best GMRQ from GA(4.198) is higher than the benchmark (4.111). However, one excep-tion happens in Villin, shown in Figure 2B. In Figure 2B, the bestpredicted GMRQ (3.203) does not reach the benchmark (3.705). Moreiterations for Villin are needed to reach a best GMRQ score that ishigher than the benchmark, but there exists a trade-off between theaccuracy and computational resource needed. Overall, the percent ariances between our predicted GMRQ score and the benchmarkGMRQ score are +4.8% for BBA, -13.5% for Villin and +2.12% forWW domain, in which Villin has the highest difference comparedto the other two proteins.Similar analysis is applied to λ -repressor, except that λ -repressorlacks a benchmark GMRQ score due to its higher number of residues.Hence, there is no reference value to compare in this case. The bestGMRQ score given over 40 iterations is around 4.956. Consideringthat the upper limit of the GMRQ score in this system is 6, we believethat a score of 4.956 is a relatively high GMRQ that effectivelycaptures the slow dynamics of the protein folding mechanisms.Therefore, we can conclude that the method proves its ability toprovide the optimal selection of residue pairs for the constructionof the best MSM. By plotting lag time dependent implied timescale plots, we canquantitatively visualize the slow modes of protein dynamics. Figure3 shows the comparison between the converged slowest impliedtimescales provided by all contact featurization and our GA-basedmethod. Again, the reference values are provided by utilizing allresidue pair distances as features. Since λ -repressor is too big for allcontacts featurization, there is no benchmark data available and itsimplied timescale is not shown. In Figure 3A, the slowest impliedtimescales (solid and dashed red lines) of BBA nearly overlap witheach other, indicating that our method has chosen a set residue pairdistances that captures the slowest process. In addition, the predictedsecond and third slowest implied timescales (yellow and cyan) areslower than the corresponding timescale for the benchmarks. InFigure 3B, the predicted implied timescales of Villin has a largerdeviation. This inconsistency will be explained and justified in thenext paragraph. In the case of WW domain (Figure 3C), we capturea slower timescale than the benchmarks. We find that inclusion of all residue pair distances can add noises to the model, and our GA-based method helps improve the MSM construction by excludingthose irrelevant features. Other than the GMRQ scores and implied timescale plots, moreinformation can be obtained from the sets of residue pair distances.In Table 2, we collect and summarize the number of distances se-lected by GA and the actual residue numbers in each protein. Oneinteresting thing is that the number of residues in a protein is notnecessary correlated to the number of distances needed to captureits slowest dynamics. For example, it can be observed in Table 2that although both Villin and WW domain have 35 residues intheir sequences, WW domain only needs 4 distances of residuepairs while Villin requires 61 distances. This may be due to thecomplex folding mechanism of Villin. Though both proteins arefast-folding proteins with small numbers of residues, the secondarystructure elements in Villin fold more independently without muchinteractions[44]. Such minimized interaction or minimal frustrationmakes the folding kinetics fast for Villin, according to the foldingfunnel theory[45]. Consequently, because the protein folds quickly,this phenomenon suggests a continuous reduction in energy in thefolding funnel[45, 46], which implies multiple parallel pathwaysduring the folding hypothesis[47]. On the other hand, WW domainfolds much slower than Villin[15] and has more consistent foldingpathways[48]. The independent features in Villin make it hard forGA to fully capture its slowest dynamics.In a previous study, Feng and Shukla[49] utilized evolution cou-plings (ECs) as functional features to capture protein folding andconformational dynamics, which gives the similar results for Villinand WW domain. Their work identified that Villin needs 73 ECs andWW domain only needs 5 ECs to fully describe the protein dynamics.They stated that more ECs are needed if the ECs has low correla-tion. Here, our results show the same trait that Villin requires morefeatures for identification of its slowest dynamic processes, whichis reasonable due to the folding complexity of Villin comparing toother fast-folding small proteins. To fully capture the slow dynamicsof proteins like Villin, a large number of features should be included

Fig. 3.

The first three slowest implied timescales as a function of MSM lag time. (A) BBA (PDB ID: 1FME), (B) Villin (PDB ID: 2F4K), (C) WWdomain (PDB ID: 2F21). The red, yellow and cyan colored lines indicate the slowest, second slowest and third slowest implied timescales,respectively. Dashed lines correspond to the reference value given by the MSMs built on all contacts featurization. Solid lines correspond tothe implied timescales given by the MSMs achieved by using the set of distances optimally chosen by our GA-based method. ig. 4. GA-chosen residue pairs visualized on the unfolded MD structures and folded crystal structures. (A) BBA (PDB ID: 1FME), (B) Villin(PDB ID: 2F4K), (C) WW domain (PDB ID: 2F21), (D) λ -repressor (PDB ID: 1LMB). The black lines specify the distances between residue pairschosen by our GA-based method, which capture the slowest dynamics of the proteins.from its whole dataset. This is a different scenario comparing tocapturing the dynamics of the proteins that needs small numbers offeatures, which is a problem easier for GA to solve. For proteins likeVillin, other methods needs to be explored for a more efficient wayto capture the slowest dynamics. Although our method results insome degrees of deviations from the benchmarks (shown in Figure2B and 3B), it still shows effectiveness in dealing with proteins withcomplicated kinetics.To present our predicted results in a more understandable way,we visualize the optimal sets of residue pairs for all four proteins inFigure 4. Each section (A, B, C and D) of Figure 4 consists of twoparts, representing the unfolded and folded structure of the proteinrespectively. It is easy to notice that the residue pair distances chosenby our method spread out in the protein to capture the complexdynamics of protein folding. Feature selection of MSM construction determines the accuracy ofpredicted kinetics properties. Currently, the selection of features isdone using trial and error. The utilization of GMRQ score enables aquantitative description of the accuracy of MSMs in representingthe molecular dynamics observed in a simulation dataset. UsingGMRQ score as fitness score, we introduce a GA-based methodin order to optimize a set of residue pair distances that producesuperior MSMs. In this study, we have shown that our methodcan provide an automatic, efficient and accurate way to choose theoptimal residue pair distances as features for MSMs construction.This significantly improves the efficiency in the overall process ofbuilding MSMs while still guarantees the quality of MSMs to capturethe slowest protein dynamics. Due to the diversity of tested proteins,our method can be widely applied to other proteins to help with the feature selection process and we anticipate that this methodwill shift MSM building one step closer to a systemic and objectiveprotocol. It is important to be aware that the underlying assumptionof this approach is that the slowest dynamic processes correspond tothe process of interest. However, this assumption can be challengedin the case of insufficient sampling or inaccurate force field.However, the method also has some limitations. The proposedmethod belongs to the class of wrapper methods for feature selec-tion that find the “optimal" feature subset by iteratively selectingfeatures based on the classifier performance. The performance ofthese methods drops significantly for datasets with large numberof important but uncorrelated features. Our method also does notperform well on systems with complex dynamics that requires alarge number of features to capture the underlying dynamics. Inother words, the effectiveness partially depends on the complexityof the conformational changes in the protein, which is shown in thediscussion of Villin. As the folding complexity increases, more path-ways are available for the protein, so the selection of residue pairsmay not fully depict the slowest dynamics of the protein. However, alarge number of biologically relevant dynamic processes have beenshown to involve only a few important features[7–11, 49–51]. Inaddition, sequence information and crystal structure of the proteinshould be known, and sufficient amount of MD simulation datashould be generated to apply our method. In conclusion, the pro-posed algorithm, can help identify essential residue pair distancesfor featurization and exclude noises for MSM construction withhigh efficiency.

The year-long Blue Waters Internship enriched my experience inmany aspects. This opportunity was rare and precious, especially hat I can utilize one of the leading-edge petascale computationalresources on the Blue Waters Supercomputer. I was excited to beoffered the opportunity to meet other interns to study and practicecomputational skills together. Starting last summer, I have beeninvolved in a variety of activities, including a two-week educationalworkshop at University of Illinois at Urbana-Champaign, regularwebinars, monthly reports and preparing a manuscript. Majoring inMaterial Science, I joined the internship with limited computationalexperience. However, I quickly gained essential skills and becameadept with the help of internship coordinators, my research advi-sor and mentors in the lab. In addition, working on the projectshelped me to be familiar with the life in a research group and bebetter prepared for the graduate school. My presentation skills wereimproved through attending group meetings and poster sessions. Ialso practiced my writing skills through regular progress reportsand writing this manuscript. Overall, the past year was a busy year,but it has became a unique experience in my undergraduate studies. ACKNOWLEDGMENTS

The authors thank D. E. Shaw Research for MD simulation trajec-tories (BBA, Villin, WW domain and λ -repressor). Q.C. would liketo thank the support of the Shodor Education Foundation and theBlue Waters Student Internship Program. J.F. would like to thankChia-chen Chu fellowship for support. S.M. is supported by the CSEFellows Program, Computational Science and Engineering at Univer-sity of Illinois, Urbana-Champaign, Urbana, IL. D. S. acknowledgessupport from the Foundation for Food and Agriculture Research viathe new innovator program. REFERENCES [1] BJ Alder and TEf Wainwright. Phase transition for a hard sphere system.

J. Chem.Phys. , 27(5):1208–1209, 1957.[2] Thomas J Lane, Diwakar Shukla, Kyle A Beauchamp, and Vijay S Pande. Tomilliseconds and beyond: challenges in the simulation of protein folding.

Currentopinion in structural biology , 23(1):58–65, 2013.[3] Nuria Plattner, Stefan Doerr, Gianni De Fabritiis, and Frank Noé. Completeprotein–protein association kinetics in atomic detail revealed by molecular dy-namics simulations and markov modelling.

Nat. Chem. , 9(10):1005, 2017.[4] Kresten Lindorff-Larsen, Stefano Piana, Ron O Dror, and David E Shaw. Howfast-folding proteins fold.

Science , 334(6055):517–520, 2011.[5] S Doerr, MJ Harvey, Frank Noé, and G De Fabritiis. Htmd: high-throughputmolecular dynamics for molecular discovery.

J. Chem. Theory. Comput. , 12(4):1845–1852, 2016.[6] Alexander S Moffett and Diwakar Shukla. Using molecular simulation to explorethe nanoscale dynamics of the plant kinome.

Biochemical Journal , 475(5):905–921,2018.[7] Diwakar Shukla, Ariana Peck, and Vijay S Pande. Conformational heterogeneityof the calmodulin binding interface.

Nat. Commun. , 7:10910, 2016.[8] Dan K Vanatta, Diwakar Shukla, Morgan Lawrenz, and Vijay S Pande. A networkof molecular switches controls the activation of the two-component responseregulator ntrc.

Nature communications , 6:7283, 2015.[9] Diwakar Shukla, Yilin Meng, Benoît Roux, and Vijay S Pande. Activation path-way of src kinase reveals intermediate states as targets for drug design.

Naturecommunications , 5:3397, 2014.[10] Kai J Kohlhoff, Diwakar Shukla, Morgan Lawrenz, Gregory R Bowman, David EKonerding, Dan Belov, Russ B Altman, and Vijay S Pande. Cloud-based simulationson google exacycle reveal ligand modulation of gpcr activation pathways.

Naturechemistry , 6(1):15, 2014.[11] Morgan Lawrenz, Diwakar Shukla, and Vijay S Pande. Cloud computing ap-proaches for prediction of ligand binding poses and pathways.

Scientific reports ,5:7918, 2015.[12] Brooke E Husic and Vijay S Pande. Markov state models: From an art to a science.

Journal of the American Chemical Society , 140(7):2386–2396, 2018.[13] Diwakar Shukla, Carlos X Hernández, Jeffrey K Weber, and Vijay S Pande. Markovstate models provide insights into dynamic modulation of protein function.

Acc. Chem. Res. , 48(2):414–422, 2015.[14] Vijay S Pande, Kyle Beauchamp, and Gregory R Bowman. Everything you wantedto know about markov state models but were afraid to ask.

Methods , 52(1):99–105,2010.[15] Gregory R Bowman, Vijay S Pande, and Frank Noé. Introduction and overview ofthis book. In

An introduction to Markov state models and their application to longtimescale molecular simulation , pages 1–6. Springer, 2014.[16] Brooke E Husic, Robert T McGibbon, Mohammad M Sultan, and Vijay S Pande.Optimized parameter selection reveals trends in markov state models for proteinfolding.

J. Chem. Phys. , 145(19):194103, 2016.[17] Shriyaa Mittal and Diwakar Shukla. Recruiting machine learning methods formolecular simulations of proteins.

Mol. Simul. , pages 1–14, 2018.[18] Christian R Schwantes and Vijay S Pande. Improvements in markov state modelconstruction reveal many non-native interactions in the folding of ntl9.

J. Chem.Theory. Comput. , 9(4):2000–2009, 2013.[19] Guillermo Pérez-Hernández, Fabian Paul, Toni Giorgino, Gianni De Fabritiis, andFrank Noé. Identification of slow molecular order parameters for markov modelconstruction.

J. Chem. Phys. , 139(1):07B604_1, 2013.[20] Mohammad M. Sultan and Vijay S Pande. tica-metadynamics: Accelerating meta-dynamics by using kinetically selected collective variables.

J. Chem. Theory.Comput. , 13(6):2440–2447, 2017.[21] Lisa J Lapidus, Srabasti Acharya, Christian R Schwantes, Ling Wu, Diwakar Shukla,Michael King, Stephen J DeCamp, and Vijay S Pande. Complex pathways in foldingof protein g explored by simulation and experiment.

Biophys. J. , 107(4):947–955,2014.[22] Christian R Schwantes, Diwakar Shukla, and Vijay S Pande. Markov state modelsand tica reveal a nonnative folding nucleus in simulations of nug2.

Biophysicaljournal , 110(8):1716–1719, 2016.[23] Kyle A Beauchamp, Gregory R Bowman, Thomas J Lane, Lutz Maibaum, Imran SHaque, and Vijay S Pande. Msmbuilder2: modeling conformational dynamics onthe picosecond to millisecond scale.

J. Chem. Theory. Comput. , 7(10):3412–3419,2011.[24] Robert T McGibbon, Kyle A Beauchamp, Matthew P Harrigan, Christoph Klein,Jason M Swails, Carlos X Hernández, Christian R Schwantes, Lee-Ping Wang,Thomas J Lane, and Vijay S Pande. Mdtraj: a modern open library for the analysisof molecular dynamics trajectories.

Biophys. J. , 109(8):1528–1532, 2015.[25] David E Goldberg and John H Holland. Genetic algorithms and machine learning.

Mach. Learn. , 3(2):95–99, 1988.[26] John Henry Holland.

Adaptation in natural and artificial systems: an introductoryanalysis with applications to biology, control, and artificial intelligence . MIT press,1992.[27] Mohammad Taherdangkoo, Mahsa Paziresh, Mehran Yazdi, and Mohammad HadiBagheri. An efficient algorithm for function optimization: modified stem cellsalgorithm.

Cent. Eur. J. Eng. , 1(3):36–50, 2013.[28] M Kenneth Jr, Scott M LeGrand, et al.

The protein folding problem and tertiarystructure prediction . Springer Science & Business Media, 2012.[29] Ron Unger and John Moult. Genetic algorithms for protein folding simulations.

Biochem Mol Biol J. , 231(1):75–81, 1993.[30] Cedric Gondro and Brian P Kinghorn. A simple genetic algorithm for multiplesequence alignment.

Genet. Mol. Res. , 6(4):964–982, 2007.[31] Md Tamjidul Hoque and Sumaiya Iqbal. Genetic algorithm-based improvedsampling for protein structure prediction.

Int. J. Bio. Inspir. Com. , 9(3):129–141,2017.[32] Brooke E Husic and Vijay S Pande. Note: Msm lag time cannot be used forvariational model selection.

J. Chem. Phys. , 147(17):176101, 2017.[33] Robert T McGibbon and Vijay S Pande. Variational cross-validation of slowdynamical modes in molecular kinetics.

J. Chem. Phys. , 142(12):03B621_1, 2015.[34] Frank Noé and Feliks Nuske. A variational approach to modeling slow processesin stochastic dynamical systems.

Multiscale Model Simul. , 11(2):635–655, 2013.[35] Shriyaa Mittal and Diwakar Shukla. Predicting optimal deer label positions tostudy protein conformational heterogeneity.

J. Phys. Chem. B , 121(42):9761–9770,2017.[36] Gerhard Hummer and Attila Szabo. Optimal dimensionality reduction of mul-tistate kinetic and markov-state models.

J. Phys. Chem. B , 119(29):9029–9037,2014.[37] Jan-Hendrik Prinz, Hao Wu, Marco Sarich, Bettina Keller, Martin Senne, MartinHeld, John D Chodera, Christof Schütte, and Frank Noé. Markov models ofmolecular kinetics: Generation and validation.

J. Chem. Phys. , 134(17):174105,2011.[38] Alexander S Moffett, Kyle W Bender, Steven C Huber, and Diwakar Shukla. Al-losteric control of a plant receptor kinase through s-glutathionylation.

Biophys.J. , 113(11):2354–2363, 2017.[39] John D Chodera, Nina Singhal, Vijay S Pande, Ken A Dill, and William C Swope.Automatic discovery of metastable states for the construction of markov modelsof macromolecular conformational dynamics.

J. Chem. Phys. , 126(15):04B616,2007.

40] Jerome Friedman, Trevor Hastie, and Robert Tibshirani.

The elements of statisticallearning , volume 1. Springer series in statistics New York, 2001.[41] Kyle A Beauchamp, Daniel L Ensign, Rhiju Das, and Vijay S Pande. Quantitativecomparison of villin headpiece subdomain simulations and triplet–triplet energytransfer experiments.

Proc. Natl. Acad. Sci. U.S.A , 108(31):12734–12739, 2011.[42] David Ascher, Paul F Dubois, Konrad Hinsen, Jim Hugunin, Travis Oliphant, et al.Numerical python, 2001.[43] Robert T McGibbon, Carlos X Hernández, Matthew P Harrigan, Steven Kearnes,Mohammad M Sultan, Stanislaw Jastrzebski, Brooke E Husic, and Vijay S Pande.Osprey: Hyperparameter optimization for machine learning.

J. O. S. S , 1:00034,2016.[44] James C McKnight, Don S Doering, Paul T Matsudaira, and Peter S Kim. Athermostable 35-residue subdomain within villin headpiece, 1996.[45] Joseph D Bryngelson, Jose Nelson Onuchic, Nicholas D Socci, and Peter G Wolynes.Funnels, pathways, and the energy landscape of protein folding: a synthesis.

Proteins: Struct., Funct., Bioinf. , 21(3):167–195, 1995.[46] Ken A Dill and Hue Sun Chan. From levinthal to pathways to funnels.

Nat. Struct.Mol. Biol. , 4(1):10, 1997.[47] Li Zhu, Kingshuk Ghosh, Michael King, Troy Cellmer, Olgica Bakajin, and Lisa JLapidus. Evidence of multiple folding pathways for the villin headpiece subdomain.

J. Phys. Chem. B , 115(43):12632–12637, 2011.[48] Silvio a Beccara, Tatjana Škrbić, Roberto Covino, and Pietro Faccioli. Dominantfolding pathways of a ww domain.

Proc. Natl. Acad. Sci. U.S.A , 109(7):2330–2335,2012.[49] Jiangyan Feng and Diwakar Shukla. Characterizing conformational dynamics ofproteins using evolutionary couplings.

J. Phys. Chem. B , 2018.[50] Zahra Shamsi, Alexander S Moffett, and Diwakar Shukla. Enhanced unbiasedsampling of protein dynamics using evolutionary coupling information.

ScientificReports , 7(1):12700, 2017.[51] Mohammad M Sultan, Gert Kiss, Diwakar Shukla, and Vijay S Pande. Automaticselection of order parameters in the analysis of large scale molecular dynamicssimulations.

Journal of chemical theory and computation , 10(12):5217–5223, 2014., 10(12):5217–5223, 2014.