[PDF] Unsupervised learning of dynamical and molecular similarity using variance minimization

Abstract

In this report, we present an unsupervised machine learning method for determining groups of molecular systems according to similarity in their dynamics or structures using Ward's minimum variance objective function. We first apply the minimum variance clustering to a set of simulated tripeptides using the information theoretic Jensen-Shannon divergence between Markovian transition matrices in order to gain insight into how point mutations affect protein dynamics. Then, we extend the method to partition two chemoinformatic datasets according to structural similarity to motivate a train/validation/test split for supervised learning that avoids overfitting.

Full PDF

UUnsupervised learning of dynamical and molecularsimilarity using variance minimization

Brooke E. Husic and Vijay S. Pande

Department of ChemistryStanford UniversityStanford, CA 94305 {bhusic, pande}@stanford.edu

Abstract

In this report, we present an unsupervised machine learning method for determininggroups of molecular systems according to similarity in their dynamics or structuresusing Ward’s minimum variance objective function. We ﬁrst apply the minimumvariance clustering to a set of simulated tripeptides using the information theoreticJensen-Shannon divergence between Markovian transition matrices in order togain insight into how point mutations affect protein dynamics. Then, we extendthe method to partition two chemoinformatic datasets according to structural sim-ilarity to motivate a train/validation/test split for supervised learning that avoidsoverﬁtting.

Scientists have sought to understand the dynamical behavior of proteins at atomic resolution sincethe ﬁrst molecular dynamics (MD) simulation of the 58-amino acid bovine pancreatic trypsin in-hibitor (BPTI) in 1977 [McCammon et al., 1977]. In the past 40 years, computational chemists haveseen major improvements in molecular dynamics methods [Adcock and McCammon, 2006], andmodern datasets can reach biologically relevant timescales (tens of milliseconds using femtosecondtime steps) due to specialized hardware [Shaw et al., 2008] and distributed computing platformssuch as Folding@home [Shirts and Pande, 2000], GPUGRID [Buch et al., 2010], and Google Exacy-cle [Kohlhoff et al., 2014].The enormous size of modern MD datasets requires complementary methods to understandand analyze the data in a statistically rigorous way. Markov state models (MSMs) are apopular framework for this type of analysis that use a master equation to represent the ther-modynamics and kinetics of a molecular system [Bowman et al., 2014]. Recent advances inMSM applications include complex, multi-system dynamics such as protein-protein associa-tion [Plattner et al., 2017, Zhou et al., 2017] or aggregated datasets containing simulations in multipleforce ﬁelds [McKiernan et al., 2017, Olsson et al., 2017]. It is thus necessary to develop comple-mentary tools for understanding these types of aggregated datasets and quantifying the dynamicalsimilarity among the different systems. Minimum variance cluster analysis (MVCA), a recentlyintroduced unsupervised learning method to coarse-grain a single MSM, has the versatility to be usedboth within a single model (for coarse-graining) and among a set of models for dynamical cluster-ing [Husic et al., 2017]. While the authors focus on the coarse-graining application, they concludewith a motivation of the latter application in which they analyze separate folding simulations of asmall protein in nine different protein and water force ﬁeld combinations.In this report, we focus on the ability of the unsupervised MVCA algorithm to identify groups ofmolecular systems. We ﬁrst provide a theoretical background of the MSM transition matrix in orderto motivate the development of MVCA for identifying dynamical groups. We then demonstrate the a r X i v : . [ phy s i c s . b i o - ph ] D ec ethod on MSMs of capped amino acids to gain insight into protein dynamics for larger systems.Finally, we showcase the versatility of MVCA by applying it to a completely different problem: theselection of training, validation, and test sets for cross-validating a chemoinformatic supervisedlearning task. We begin with a continuous-time Markovian process, which means that the probability of transitioningfrom state x to state y after a time interval of τ does not depend on any states occupied before thesystem was in x . We assume this process is time-homogeneous, stochastic, and reversible with respectto its stationary distribution, µ . The process is also irreducible, which means that there is a path fromeach state to every other state given sufﬁcient time, and aperiodic. If we represent our system by aprobability distribution p t at time t , and the transition density kernel for x → y as p ( x, y ) , we canwrite the new probability distribution at time t + τ as, p t + τ ( y ) /µ ( y ) = (cid:90) Ω dx p t ( x ) p ( x, y ) = T ( τ ) ◦ p t ( y ) /µ ( y ) . (1)The continuous transfer operator T , which is characterized by its lag time τ , is compact and self-adjoint with respect to the stationary distribution µ . The transfer operator admits a decompositioninto eigenvalues and eigenfunctions, T ( τ ) ◦ ψ i = λ i ψ i , (2)where the eigenvalues λ i are real and indexed in decreasing order. The eigenvalue λ = 1 is uniqueand corresponds to the stationary process. All subsequent eigenvalues are on the interval | λ i> | < and correspond to dynamical processes in the time series.To build a MSM, we decompose the subset of conformation space explored by the MD simulation intodiscrete, disjoint states. By counting the transitions between these states, we calculate the maximumlikelihood estimator of the the transition probability matrix to obtain a discrete approximation to T .This transition matrix is the MSM master equation. To cluster MSM transition matrices, we review the theory presented in the original MVCA pa-per [Husic et al., 2017]. For two MSMs with row-stochastic transition probability matrices P and Q , the divergence from the i th row of Q to the i th row of P can be written as the Kullback-Leiblerdivergence, div KL ( P i || Q i ) ≡ (cid:88) j P i ( j ) log P i ( j ) Q i ( j ) , (3)where P i can be thought of as the “reference” distribution and Q i as a “test” distribu-tion [Bowman et al., 2010]. The information theoretic Jensen-Shannon divergence [Lin, 1991] is arelated symmetric formulation that utilizes M , the elementwise mean of P and Q ,div JS ( P i || Q i ) ≡ div KL ( P i || M i ) + 12 div KL ( Q i || M i ) . (4)Since it has been shown that the square root of (4) obeys the triangle inequal-ity [Endres and Schindelin, 2003], we can write the following distance metric,2iv √ JS ( P i || Q i ) ≡ (cid:112) div JS ( P i || Q i ) . (5)Finally, for a scalar distance between the two transition matrices P and Q , each of which contains i rows, we deﬁne the sum [Husic et al., 2017], D √ JS ≡ (cid:88) i div √ JS ( P i || Q i ) . (6) Hierarchical agglomerative clustering is an unsupervised learning method that is initiated with aset of pairwise distances between data points and iteratively merges the two closest clusters orsingletons. Hierarchical agglomerative clustering thus requires a similarity function that quantiﬁesthe distance between all data points and an objective function that determines how to use thosedistances to determine which existing clusters to merge at each agglomerating step. Commonexamples of objective functions used for hierarchical agglomerative clustering are single, average,and complete linkages, which deﬁne the distance between two clusters as the shortest, average, orgreatest distance, respectively, between any point in one cluster and any point in the other cluster.Another objective function used for hierarchical agglomerative clustering is Ward’s minimum variancecriterion [Ward, 1963]. The agglomeration is performed by merging the two clusters or singletonssuch that the resulting increase in intra-cluster variance is minimized. Ward’s method is usuallyimplemented according to the following recursive distance update [Müllner, 2013], d ( u, v ) = (cid:114) | v | + | s | T d ( v, s ) + | v | + | t | T d ( v, t ) − | v | T d ( s, t ) , (7)where the clusters s and t have just been merged to create a new cluster, u , and the new distancebetween u and some other cluster v needs to be updated. | c | represents the number of data pointscontained in cluster c , and T ≡ | s | + | t | + | v | . A nonrecursive formula for the distance update when s or t is a singleton has also been derived [Husic and Pande, 2017]. We note that the minimum varianceformulation is rigorously deﬁned for Euclidean distances, but that the objective function can be usedfor any similarity function with the understanding that it no longer corresponds to Euclidean variance,which requires the l -norm.In this paper, we ﬁrst use hierarchical agglomerative clustering with Ward’s minimum varianceobjective function and the D √ JS similarity function (6) to quantify the similarity between multiplemodels for related dynamical systems. We then apply Ward’s minimum variance objective functionwith different similarity functions to the hierarchical clustering of molecular structures and reactionﬁngerprints in order to motivate a cross-validation scheme for a supervised learning model. It is often desirable to modify the dynamics of a protein by introducing a sequence mutation,i.e. substituting a selected amino acid for the one present in the wild type protein. Simulating mutatedproteins using MD can provide a window into whether or not the mutation has affected the proteindynamics. The MVCA algorithm can be used to cluster a set of mutants based on their dynamics,and can lend insight into which mutations produce similar effects. It is known that all solvatedamino acids except glycine and proline occupy certain regions of the space deﬁned by their ϕ and ψ backbone dihedral angles [Vitalini et al., 2016]. Figure 1 shows the ϕ and ψ angles on alanine(left), the structure of which can be compared with glycine (center) and proline (right). Glycine, thesmallest amino acid, occupies more conformations due to its ﬂexibility, and proline occupies fewerconformations due to its 5-member ring. 3 𝛙 ala gly pro Figure 1: Amino acids are differentiated by their side chains, and the dihedrals about the bondsadjacent to the side chains (light blue, left) occupy known regions of ϕ × ψ space. Alanine (left) isan amino acid with a standard range of side chain motion. Glycine (center) is more ﬂexible becauseit does not have a side chain. Proline (right) is less ﬂexible because its side chain forms a ring.To demonstrate MVCA on a dataset for which we can visualize the dynamical degrees of freedom,we consider a set of proline-containing tripeptides simulated for 1 µ s each in the AMBER ff99SB-ILDN force ﬁeld at 300 K. We hypothesize that the dynamics of the central amino acid in a tripeptideare affected by the presence of proline at any of the three positions. To investigate the number ofdynamical groups represented by this tripeptide dataset, we ﬁrst create a MSM at a 100 ps lag time foreach system using a regular spatial clustering of the ϕ and ψ angles of the central amino acid. Sinceall MSMs were built using the same state deﬁnitions, we can assess the similarity of their transitionmatrices using MVCA with D √ JS . The MVCA analysis shows that the 42 tripeptides cluster into ﬁvenatural groups, which is clear from a dendrogram representation of the hierarchical tree (Fig. 2, top).Plotting the free energy surfaces of the central amino acid for each tripeptide shows that the energylandscapes within each cluster are very similar. Three large clusters identify tripeptides with prolineas the ﬁrst, second, and third amino acid (Fig. 2; bottom; blue, purple, and green, respectively). Twosingleton clusters are also identiﬁed, which represent the two systems in the dataset for which glycineis the central amino acid (Fig. 2, gray). It is clear from their free energy landscapes that these systemsare dynamically very different from the others.We suspect that the systems will differentiate according to the location of the proline because it iseasy to understand the degrees of freedom for this simple system. This analysis is important becauseit has ﬁdelity to the expected result, and can successfully group the tripeptides based on only thetransition matrices of their MSMs. The analysis does not produce a reasonable result when the single,average, or complete linkage objective function is used instead of Ward’s minimum variance objectivefunction (see Appendix A). Predicting properties from molecular structures such as solubility and binding afﬁnity is a signiﬁcantchallenge, and state of the art approaches such as atomic convolutional networks [Gomes et al., 2017]and graph convolutions [Kearnes et al., 2016] have been used for supervised learning of these quanti-ties. When using supervised learning to predict molecular properties from a representation of themolecular structure, it is important to use cross-validation when assessing the model’s accuracy. Astandard cross-validation split involves dividing the labeled data into three sets: training, validation,and test. The model is trained on the training set and is evaluated on the validation set duringdevelopment. Once hyperparameters have been selected, the ﬁnal model is evaluated on the test set.For chemoinformatic models designed to predict the property of extremely novel new compounds,care must be taken in choosing how to divide the data into these three sets such that the model isnot overﬁt to the training data. When using neural network architectures, for example, the goal is toproduce a model that has learned some complex underlying feature from the data, but not a model thathas memorized every data point and simply recalls what it has memorized. A train/validation/test splitmotivated by differentiating these two options is therefore critical for evaluating supervised learningmodels in the context of characterizing the properties of novel compounds. Common partitions forchemoinformatic studies include random splits, temporal splits, stratiﬁed splits according to the quan-tity being learned, and scaffold splits [Bemis and Murcko, 1996] in which molecules are assigned toa set based on the frequency of the molecular scaffold. The latter method is generally difﬁcult formodels with deep architectures, since the model must apply what it has learned to a different type of4 l e - a l a - p r o g l u - a l a - p r o g l y - a l a - p r o v a l - s e r- p r o h i s - a l a - p r o t y r- a s p - p r o t y r- l ys - p r o g l y - il e - p r o l ys - l eu - p r o g l u - a r g - p r o l eu - g l u - p r o g l u - g l n - p r o t h r- il e - p r o p r o - l eu - phe p r o - l eu - v a l p r o - a s p - phe p r o - g l n - g l u p r o - g l u - t h r p r o - il e - g l y p r o - phe - g l u p r o - t y r- t h r p r o - s e r- t h r p r o - s e r- g l y p r o - cys - l ys p r o - a l a - v a l p r o - g l y - t r p t h r- g l y - p r o p r o - p r o - t y r a r g - p r o - a s p g l y - p r o - cys a s p - p r o - g l u g l n - p r o - phe s e r- p r o - g l n s e r- p r o - il e p r o - p r o - g l y il e - p r o - l eu a l a - p r o - s e r il e - p r o - a l a l ys - p r o - l eu a r g - p r o - s e r g l u - p r o - p r o l eu - p r o - p r o V a r i an c e Figure 2: A dendrogram from clustering 42 transition matrices using D √ JS and Ward’s minimumvariance objective function shows that the systems cluster into ﬁve groups (top). The energy landscapeof each system is plotted for − < ϕ < on the x -axis and − < ψ < on the y -axis,with darker colors representing more stable conformations (bottom). The ﬁve clusters are identiﬁedusing boxes corresponding to the colors in the dendrogram. Since we have analyzed a system forwhich the degrees of freedom are interpretable, we can see that the clustering analysis identiﬁesgroups with similar free energy surfaces.data [Gomes et al., 2017]. A “realistically novel” training/test split for kinase virtual screening hasalso been recently published with the goal of avoiding overﬁt models [Martin et al., 2017].In the spirit of scaffold and “realistically novel” splittings, here we apply MVCA to choose thetraining, validation, and test subsets from a molecular database. Although it is much too small to usefor training a deep network with many hyperparameters, we present an example for a reduced groupof molecules so we can interpret the results. We thus select the 59 compounds with molecular weightless than 75 g/mol from a dataset containing aqueous solubilities [Delaney, 2004]. In this dataset, themolecules are represented by the simpliﬁed molecular-input line-entry system (SMILES), which areone-dimensional string representations of molecular structures. To quantify the similarity betweeneach pair of molecules, we use the Levenshtein distance ratio between the SMILES strings, which isa function of the number of single character edits that must be made to convert one representation tothe other and is normalized for string length. Since the Levenshtein distance ratio ranges from 0 to1, where a value of 1 means the strings are identical, we encode the similarity between each pair ofstrings as the Levenshtein distance ratio subtracted from 1 so that smaller values correspond to closerstrings. Then, we apply MVCA to the pairwise SMILES representations (instead of pairwise D √ JS CreateStructuralFingerprintForReaction command in the RDKit Python pack-age [ rdkit.org ]. To quantify the similarity between reaction ﬁngerprints we use the Tanimotocoefﬁcient subtracted from 1 so that smaller values correspond to closer distances, as above. Wethen use MVCA to group the 2000 reactions with the lowest total molecular weight. The resultingdendrogram shows two natural groups with 78% of the reactions in one group and 22% of thereactions in the other group. When we calculate the percent allocation of each structural class inTable 1, we see that unsupervised clustering of reaction ﬁngerprints does not partition the reactionsaccording to their classes.From the perspective of a modeler seeking to design a train/validation/test split for predicting reactionproducts, there are two choices for designing a cross-validation scheme. First, the modeler candetermine this split according to the reaction classes in order to represent different types of reactionsin the train, validation, and test sets. However, this may result in an overﬁt model, since the validationand test sets will contain reactions with representations similar to those in the training set accordingto their Tanimoto coefﬁcients. Alternatively, the modeler can determine the split according to thesimilarity of reaction representations. In this case, the train, validation, and test sets would be chosenaccording to minimum variance groupings of ﬁngerprints. For the reaction dataset analyzed here,each group would likely contain reactions from all 10 classes. Performing the same analysis withthe original reaction SMILES strings and the Levenshtein distance ratio produces nearly identicalresults, since 98% of the group assignments are the same as in the ﬁngerprint analysis. This isexpected because the ﬁngerprints are generated from the SMILES strings. The ﬁngerprints usedfor this analysis were 4096-bit, and the same analysis was performed for ﬁngerprints with sizes through . The minimum similarity between any pair of MVCA assignments for ﬁngerprintsof different sizes was 93%. The minimum similarity between SMILES string assignments andﬁngerprint assignments was 92% for the 128-bit ﬁngerprints.Unsupervised clustering of reaction ﬁngerprints shows that the calculated similarity in molecularrepresentations may not align with the similarity assessed by a domain expert, and it is importantto consider the similarity of the data according to its representation using an appropriate metric.7hoosing a cross-validation scheme according to human intuition may therefore lead to overﬁtmodels when the molecular representations are quantitatively similar but heuristically dissimilar.Since overﬁt models have been shown to be a problem in chemoinformatic supervised learningtasks [Martin et al., 2017], we anticipate this method will be useful to the chemoinformatic machinelearning community. In this report, we present the unsupervised learning of peptide dynamics and of molecular structuresusing MVCA for hierarchical agglomerative clustering with Ward’s minimum variance objectivefunction. We ﬁrst demonstrate MVCA with the Jensen-Shannon divergence between MSM transitionmatrices to identify dynamical groups from a dataset of 42 tripeptides containing proline. Then, weapply MVCA in a completely new way, using the Levenshtein edit distance ratio between SMILESrepresentations of small molecules, and the Tanimoto coefﬁcient between reaction ﬁngerprints, to par-tition the datasets for cross-validation. This analysis is intended to address a knowledge gap speciﬁcto supervised machine learning for chemoinformatic analyses designed to predict the properties ofnovel molecules: namely, how to perform cross-validation such that model generalizability is properlymeasured. Here, we suggest constructing training, validation, and test sets for model cross-validationaccording to minimum variance in order to assess model performance with a practical test set; i.e.,one that contains newly designed compounds. While common machine learning wisdom dictates thatthe training, validation, and test sets should be drawn from the same distribution, for the prediction ofa novel compound’s chemical properties it is crucial for us to demonstrate that models can generalizeto new kinds of molecules. If it turns out to be the case that maximally novel test sets break suchmodels, then it is important for the ﬁeld as a whole to question whether supervised machine learningapproaches, in particular those with deep neural network architectures, are appropriate for predictingchemical properties of new compounds.Machine learning plays a crucial role in both modern MD analyses and the prediction of molecularproperties from structure. We thus anticipate that MVCA will be broadly applicable to variousapplications in machine learning for molecular data. The MVCA algorithm is available in the open-source MSMBuilder software package [Harrigan et al., 2017], which was used to build the MSMsin Sec. 3.1. The MVCA analyses presented in this report can also be implemented using the SciPyPython package [Jones et al., 2001].

Acknowledgments

B.E.H. is grateful to Evan Feinberg, Zhenqin Wu, and Bowen Liu for discussions during the prepara-tion of this manuscript. The authors thank Matt Harrigan for providing the tripeptide dataset. Weacknowledge the National Institutes of Health under No. NIH R01-GM62868 for funding. V.S.P. isa consultant & SAB member of Schrodinger, LLC and Globavir, sits on the Board of Directors ofApeel Inc, Freenome Inc, Omada Health, Patient Ping, Rigetti Computing, and is a General Partner atAndreessen Horowitz. An earlier version of this paper was accepted to the Machine Learning forMolecules and Materials Workshop at NIPS 2017 in Long Beach, CA. The workshop proceedingscan be found at . References [Adcock and McCammon, 2006] Adcock, S. A. and McCammon, J. A. (2006). Molecular dynamics survey ofmethods for simulating the activity of proteins.

Chem. Rev. , 106(5):1589–1615.[Bemis and Murcko, 1996] Bemis, G. W. and Murcko, M. A. (1996). The properties of known drugs. 1.molecular frameworks.

J. Med. Chem. , 39(15):2887–2893.[Bowman et al., 2010] Bowman, G. R., Ensign, D. L., and Pande, V. S. (2010). Enhanced modeling via networktheory: Adaptive sampling of Markov state models.

J. Chem. Theory Comput. , 6(3):787–794.[Bowman et al., 2014] Bowman, G. R., Pande, V. S., and Noé, F. (2014). An introduction to Markov statemodels and their application to long timescale molecular simulation. volume 797. Springer.[Buch et al., 2010] Buch, I., Harvey, M. J., Giorgino, T., Anderson, D. P., and Fabritiis, G. D. (2010). High-throughput all-atom molecular dynamics simulations using distributed computing.

J. Chem. Inf. Model. ,50(3):397–403. Delaney, 2004] Delaney, J. S. (2004). Esol: Estimating aqueous solubility directly from molecular structure.

J.Chem. Inf. Comput. Sci. , 44(3):1000–1005.[Endres and Schindelin, 2003] Endres, D. M. and Schindelin, J. E. (2003). A new metric for probabilitydistributions.

IEEE Transactions on Information Theory , 49(7):1858–1860.[Gomes et al., 2017] Gomes, J., Ramsundar, B., Feinberg, E. N., and Pande, V. S. (2017). Atomic convolutionalnetworks for predicting protein-ligand binding afﬁnity. arXiv preprint arXiv:1703.10603 .[Harrigan et al., 2017] Harrigan, M. P., Sultan, M. M., Hernández, C. X., Husic, B. E., Eastman, P., Schwantes,C. R., Beauchamp, K. A., McGibbon, R. T., and Pande, V. S. (2017). MSMBuilder: Statistical models forbiomolecular dynamics.

Biophys. J. , 112(1):10–15.[Husic et al., 2017] Husic, B. E., McKiernan, K. A., Wayment-Steele, H. K., Sultan, M. M., and Pande, V. S.(2017). A minimum variance clustering approach produces robust and interpretable coarse-grained models.

J.Chem. Theory Comput.

Just Accepted Manuscript.[Husic and Pande, 2017] Husic, B. E. and Pande, V. S. (2017). Ward clustering improves cross-validatedMarkov state models of protein folding.

J. Chem. Theory Comput. , 13(3):963–967.[Jones et al., 2001] Jones, E., Oliphant, T., Peterson, P., et al. (2001). SciPy: Open source scientiﬁc tools forPython.[Kearnes et al., 2016] Kearnes, S., McCloskey, K., Berndl, M., Pande, V., and Riley, P. (2016). Molecular graphconvolutions: moving beyond ﬁngerprints.

J. Comput. Aided Mol. Des. , 30(8):595–608.[Kohlhoff et al., 2014] Kohlhoff, K. J., Shukla, D., Lawrenz, M., Bowman, G. R., Konerding, D. E., Belov,D., Altman, R. B., and Pande, V. S. (2014). Cloud-based simulations on Google Exacycle reveal ligandmodulation of GPCR activation pathways.

Nature Chem. , 6(1):15–21.[Lin, 1991] Lin, J. (1991). Divergence measures based on the Shannon entropy.

IEEE Transactions onInformation Theory , 37(1):145–151.[Liu et al., 2017] Liu, B., Ramsundar, B., Kawthekar, P., Shi, J., Gomes, J., Nguyen, Q. L., Ho, S., Sloane, J.,Wender, P., and Pande, V. (2017). Retrosynthetic reaction prediction using neural sequence-to-sequencemodels. arXiv preprint arXiv:1706.01643 .[Martin et al., 2017] Martin, E. J., Polyakov, V. R., Tian, L., and Perez, R. C. (2017). Proﬁle-QSAR 2.0: Kinasevirtual screening accuracy comparable to four-concentration IC50s for realistically novel compounds.

J.Chem. Inf. Model. , 57(8):2077–2088.[McCammon et al., 1977] McCammon, J. A., Gelin, B. R., and Karplus, M. (1977). Dynamics of foldedproteins.

Nature , 267(5612):585–590.[McKiernan et al., 2017] McKiernan, K. A., Husic, B. E., and Pande, V. S. (2017). Modeling the mechanism ofcln025 beta-hairpin formation.

J. Chem. Phys. , 147(10):104107.[Müllner, 2013] Müllner, D. (2013). fastcluster: Fast hierarchical, agglomerative clustering routines for R andPython.

J. Stat. Soft. , 53(1):1–18.[Olsson et al., 2017] Olsson, S., Wu, H., Paul, F., Clementi, C., and Noé, F. (2017). Combining experimental andsimulation data of molecular processes via augmented Markov models.

Proc. Natl. Acad. Sci. , 114(31):8265–8270.[Plattner et al., 2017] Plattner, N., Doerr, S., De Fabritiis, G., and Noé, F. (2017). Complete protein–proteinassociation kinetics in atomic detail revealed by molecular dynamics simulations and markov modelling.

Nat.Chem. , 9(10):1005–1011.[Schneider et al., 2016] Schneider, N., Stieﬂ, N., and Landrum, G. A. (2016). What’s what: The (nearly)deﬁnitive guide to reaction role assignment.

J. Chem. Inf. Model. , 56(12):2336–2346.[Shaw et al., 2008] Shaw, D. E., Deneroff, M. M., Dror, R. O., Kuskin, J. S., Larson, R. H., Salmon, J. K.,Young, C., Batson, B., Bowers, K. J., Chao, J. C., Eastwood, M. P., Gagliardo, J., Grossman, J. P., Ho, C. R.,Ierardi, D. J., Kolossváry, I., Klepeis, J. L., Layman, T., McLeavey, C., Moraes, M. A., Mueller, R., Priest,E. C., Shan, Y., Spengler, J., Theobald, M., Towles, B., and Wang, S. C. (2008). Anton, a special-purposemachine for molecular dynamics simulation.

Commun. ACM , 51(7):91–97.[Shirts and Pande, 2000] Shirts, M. and Pande, V. S. (2000). Screen savers of the world unite!

Science ,290(5498):1903–1904.[Vitalini et al., 2016] Vitalini, F., Noé, F., and Keller, B. (2016). Molecular dynamics simulations data of thetwenty encoded amino acids in different force ﬁelds.

Data in Brief , 7:582–590.[Ward, 1963] Ward, J. H. (1963). Hierarchical grouping to optimize an objective function.

J. Amer. Statist.Assoc. , 58(301):236–244.[Zhou et al., 2017] Zhou, G., Pantelopulos, G. A., Mukherjee, S., and Voelz, V. A. (2017). Bridging microscopicand macroscopic mechanisms of p53-MDM2 binding with kinetic network models.

Biophys. J. , 113(4):785–793. Clustering with other objective functions

It is interesting to contrast single, average, and complete linkage functions for hierarchicalagglomerative clustering with Ward’s method. The following analysis was performed usingScipy [Jones et al., 2001]. p r o - g l y - t r p t h r- g l y - p r o p r o - s e r- t h r p r o - s e r- g l y p r o - a l a - v a l p r o - cys - l ys p r o - phe - g l u p r o - il e - g l y p r o - t y r- t h r p r o - a s p - phe p r o - g l n - g l u p r o - l eu - v a l p r o - g l u - t h r p r o - l eu - phe g l y - il e - p r o h i s - a l a - p r o g l u - a r g - p r o l ys - l eu - p r o l eu - g l u - p r o g l u - a l a - p r o il e - a l a - p r o v a l - s e r- p r o t y r- a s p - p r o g l u - g l n - p r o t y r- l ys - p r o t h r- il e - p r o g l y - a l a - p r o p r o - p r o - t y r g l y - p r o - cys a s p - p r o - g l u a r g - p r o - a s p g l n - p r o - phe p r o - p r o - g l y a l a - p r o - s e r s e r- p r o - g l n il e - p r o - l eu s e r- p r o - il e il e - p r o - a l a l ys - p r o - l eu a r g - p r o - s e r g l u - p r o - p r o l eu - p r o - p r o S ho r t e s t d i s t an c e p r o - g l y - t r p t h r- g l y - p r o p r o - s e r- t h r p r o - s e r- g l y p r o - a l a - v a l p r o - cys - l ys p r o - phe - g l u p r o - il e - g l y p r o - t y r- t h r p r o - a s p - phe p r o - l eu - v a l p r o - g l n - g l u p r o - l eu - phe p r o - g l u - t h r g l y - il e - p r o p r o - p r o - t y r a r g - p r o - a s p p r o - p r o - g l y g l y - p r o - cys a s p - p r o - g l u g l n - p r o - phe il e - p r o - l eu a l a - p r o - s e r s e r- p r o - il e s e r- p r o - g l n il e - p r o - a l a l ys - p r o - l eu a r g - p r o - s e r g l u - p r o - p r o l eu - p r o - p r o h i s - a l a - p r o l ys - l eu - p r o t y r- a s p - p r o t y r- l ys - p r o g l u - a r g - p r o v a l - s e r- p r o l eu - g l u - p r o g l y - a l a - p r o g l u - g l n - p r o t h r- il e - p r o g l u - a l a - p r o il e - a l a - p r o A v e r age d i s t an c e p r o - g l y - t r p t h r- g l y - p r o p r o - s e r- g l y p r o - s e r- t h r p r o - a l a - v a l p r o - cys - l ys p r o - phe - g l u p r o - t y r- t h r p r o - il e - g l y p r o - l eu - phe p r o - l eu - v a l p r o - a s p - phe g l y - p r o - cys a s p - p r o - g l u g l n - p r o - phe s e r- p r o - g l n s e r- p r o - il e p r o - p r o - t y r a r g - p r o - a s p p r o - p r o - g l y il e - p r o - l eu a l a - p r o - s e r a r g - p r o - s e r g l u - p r o - p r o l eu - p r o - p r o il e - p r o - a l a l ys - p r o - l eu g l y - il e - p r o l ys - l eu - p r o v a l - s e r- p r o g l u - a l a - p r o il e - a l a - p r o h i s - a l a - p r o t y r- a s p - p r o t y r- l ys - p r o g l u - a r g - p r o l eu - g l u - p r o g l y - a l a - p r o g l u - g l n - p r o t h r- il e - p r o p r o - g l n - g l u p r o - g l u - t h r G r ea t e s t d i s t an c e Figure 4: Dendrograms created from clustering 42 transition matrices using D √ JS and single (top),average (center), and complete (bottom) linkage objective functions. These dendrograms, which donot identify intuitively natural groups, can be contrasted with Fig. 2 (top). In all three cases, theclustering algorithm cannot separate the system dynamics according to the location of the proline,and can only separate the two glycine-containing tripeptides from the other 40 systems. All threealternative objective functions identify many other singletons, and no coherent groups can be obtainedfrom the results. 10 u r ane M e t h y l h y d r a z i ne N i t r o m e t hane M e t hane A c e t on i t r il e P r op i on i t r il e A c r y l on i t r il e1 , - P r op y l ene o x i de A c e t a m i de U r ea E t hane t h i o l D i m e t h y l s u l f i de E t hane M e t hano l t r an s - - P en t ene c i s - - P en t ene E t hano l E t h y ne E t h y l ene M e t h y l a c e t a t e2 - B u t anone2 - M e t h y l - , - B u t ad i ene - M e t h y l - - B u t ene B u t an - - o l - P r opano l - M e t h y l p r opan - - o l - M e t h y l p r opane2 - M e t h y - - B u t ene2 - M e t h y l bu t ane2 - M e t h y l p r opene2 - M e t h y l - - B u t ene P r opane t - C r o t ona l deh y de C h l o r oe t hane C h l o r oe t h y l ene P r op y l ene M e t h y l p r op y l e t he r - P r opano l M e t h y l f o r m a t e E t h y l f o r m a t e P r op i ona l deh y de1 - B u t ano l A c r o l e i n B u t y r a l deh y de2 - bu t ena l B u t ane P r op y ne D i e t h y l e t he r - B u t y ne1 - P en t y ne E t h y l v i n y l e t he r - B u t ene P en t ane1 , - B u t ad i ene1 - P en t ene , - P en t ad i ene T e t r ah y d r o f u r ane C yc l open t ane C yc l open t ene S ho r t e s t d i s t an c e F u r ane M e t h y l h y d r a z i ne E t hane t h i o l E t h y l ene D i m e t h y l s u l f i de E t hane M e t hane M e t hano l E t hano l N i t r o m e t hane2 - B u t anone M e t h y l a c e t a t e A c e t a m i de U r ea A c e t on i t r il e P r op i on i t r il e A c r y l on i t r il e1 , - P r op y l ene o x i de T e t r ah y d r o f u r ane C yc l open t ane C yc l open t ene - M e t h y l - , - B u t ad i ene - M e t h y l - - B u t ene2 - P r opano l B u t an - - o l - M e t h y l p r opane2 - M e t h y l p r opan - - o l - M e t h y l p r opene2 - M e t h y l - - B u t ene2 - M e t h y - - B u t ene2 - M e t h y l bu t ane t r an s - - P en t ene c i s - - P en t ene M e t h y l f o r m a t e E t h y l f o r m a t e t - C r o t ona l deh y de B u t y r a l deh y de2 - bu t ena l A c r o l e i n P r op i ona l deh y de E t h y ne P r op y ne1 - B u t y ne C h l o r oe t h y l ene C h l o r oe t hane P r opane P r op y l ene1 - B u t ano l - P r opano l M e t h y l p r op y l e t he r B u t ane E t h y l v i n y l e t he r D i e t h y l e t he r P en t ane1 - P en t y ne1 - P en t ene , - P en t ad i ene , - B u t ad i ene1 - B u t ene A v e r age d i s t an c e F u r ane A c e t on i t r il e P r op i on i t r il e A c r y l on i t r il e E t h y ne P r op y ne1 - B u t y ne E t h y l ene E t hane t h i o l D i m e t h y l s u l f i de E t hane M e t h y l f o r m a t e E t h y l f o r m a t e t - C r o t ona l deh y de B u t y r a l deh y de2 - bu t ena l A c r o l e i n P r op i ona l deh y de C h l o r oe t h y l ene C h l o r oe t hane P r opane P r op y l ene1 , - P r op y l ene o x i de1 - B u t ano l - P r opano l M e t h y l p r op y l e t he r B u t ane2 - P r opano l B u t an - - o l - M e t h y - - B u t ene2 - M e t h y l bu t ane2 - M e t h y l p r opene2 - M e t h y l - - B u t ene2 - M e t h y l p r opane2 - M e t h y l p r opan - - o l E t h y l v i n y l e t he r D i e t h y l e t he r T e t r ah y d r o f u r ane C yc l open t ane C yc l open t ene - M e t h y l - , - B u t ad i ene - M e t h y l - - B u t ene1 - P en t ene , - P en t ad i ene , - B u t ad i ene1 - B u t ene P en t ane1 - P en t y ne t r an s - - P en t ene c i s - - P en t ene M e t h y l h y d r a z i ne M e t hane M e t hano l E t hano l N i t r o m e t hane2 - B u t anone M e t h y l a c e t a t e A c e t a m i de U r ea G r ea t e s t d i s t an c ee