Persistent topology of protein space
PPersistent topology of protein space
W. Hamilton, J.E. Borgert, T. Hamelryck, J.S. Marron
Abstract
Protein fold classification is a classic problem in structural biologyand bioinformatics. We approach this problem using persistent homology.In particular, we use alpha shape filtrations to compare a topologicalrepresentation of the data with a different representation that makes use ofknot-theoretic ideas. We use the statistical method of Angle-based Jointand Individual Variation Explained (AJIVE) to understand similaritiesand differences between these representations.
In this study we used persistent homology (PH) and Wasserstein distances be-tween topological representations to quantify the geometry and topology of pro-tein folds. We present our analysis of the resulting geometry of “protein space”,and compare our results to existing protein structure quantification approaches,notably the Gaussian Integral Tuned (GIT) [RF03] vector representation.A major goal is to understand what insights the PH approach reveals thatGIT vectors do not. We applied statistical methodology developed for multi-block data to contrast the two methods in a quantifiable and interpretable man-ner. Multi-block data refers to the setting where disparate sets of features on acommon set of samples are represented by multiple data blocks. The method,Angle-based Joint and Individual Variation Explained (AJIVE) [FJHM18], de-composes the data blocks into joint and individual modes of variation. ApplyingAJIVE to the data blocks generated by each representation (GIT and PH) al-lows us to compare what was jointly captured by both methods and, moreinterestingly, contrast the individual features. In particular, clustering of thePH individual approximation finds topologically similar clusters of proteins thatare not identified by the GIT representation.We computed the t-distributed stochastic neighbor embedding (t-SNE) [vdMH08]coordinates of each data matrix for visual comparison. The t-SNE method is anonlinear dimensionality reduction technique that is useful for embedding high-dimensional data in lower dimensional space for visualization; moreover, t-SNEtends to separate data into clusters and is very popular for visualizing clustereddata. Hierarchical clustering with Wards linkage on the t-SNE coordinates ofthe approximated PH individual matrix finds clusters of proteins in this viewthat are not clustered together in the GIT view, as shown by Figure 1. On1 a r X i v : . [ q - b i o . B M ] F e b he left is the t-SNE plot of the PH individual matrix with proteins coloredby cluster label and on the right is the t-SNE plot of the original GIT vectorsmatrix with the same coloring. Recall that the PH individual matrix gives uscomponents of the variation that are captured by PH but not by the GIT vec-tors. Since the clusters are tight in the left plot of Figure 1 and spread out inthe right plot, we see that the topological similarities of these proteins identifiedby PH are not detected by the GIT approach.Figure 1: The 10 most homogeneous clusters from performing hierarchical clus-tering with Wards linkage on t-SNE coordinates of PH individual matrix. Onthe left, t-SNE of PH individual matrix; on the right, t-SNE coordinates of GITvectors. Clusters are tight in the PH individual plot but appear spread all overin the GIT plot. Peter Røgen provided helpful comments and suggestions regarding the GIT vec-tor discussion. The research of Elyse Borgert and J. S. Marron were partiallysupported by the grants NIH/NIAMS, P30AR072580 and NIH/NIH, R21AR074685.
Alpha complexes and alpha filtrations are families of triangulated surfaces asso-ciated to a point cloud in R or R [EKS83, Ede95], and have been proposed aseffective geometrizations of protein tertiary structure [WSS09]. In this sectionwe describe the construction of alpha shapes and alpha filtrations and illustratethese objects on a protein chain in Figure 2.Alpha complexes are one approach to defining a combinatorial structure ona point cloud in Euclidean space. The idea is to place a ball of some fixed radiusaround each point, and connect points with an edge if the corresponding ballsintersect, connect three points with a triangle if all three balls intersect, etc.Since we want each edge, triangle, etc., to be geometrically meaningful, we alsoask that the balls are intersected with the Voronoi cells of their centers. Recall2igure 2: Left: a visualization of the protein chain 1HFE-T, referring to chainT of the structure with identifier 1HFE in the RCSB Protein Data Bank (PDB)[BBB + C α atoms used to build the alphafiltration. Center-right: the alpha shape for 1HFE-T for alpha= 1 .
7. Thegrowing balls in the alpha shape construction are indicated in red, edges betweenpoints are indicated in black, and triangles between triples of points are shown inblue. At this level, the 5 α -helices in 1HFE-T can be inferred from the locationsof triangles, and the hole they surround has as boundary the union of red balls.Right: the alpha shape for 1HFE-T for alpha= 2 .
6. The locations of α -helicesare less clear, but the prominent hole surrounded by the α -helices is still clearlyvisible.that the Voronoi cell V x i of a point x i in a point cloud x , x , ..., x N is the set V x i = { x : | x i − x | ≤ | x j − x | for j (cid:54) = i } . This intersection requirement ensuresthat edges cannot “jump over” points to connect non-“adjacent” points in thepoint cloud, and that no combinatorial objects of dimension higher than theambient space can appear. For an explicit definition of an alpha complex X α ,let B α ( x i ) = { x : | x − x i | ≤ α and x ∈ V x i } be the ball of radius α centeredat x i and define the k -skeleton of X α , denoted X ( k ) α , to be the collection of all k -simplices in the complex: X ( k ) α = { ( x i , ..., x i k ) : ∩ kj =1 B α ( x i j ) (cid:54) = ∅} . The alpha complex for radius α is then the union of k -skeletons for k = 1 , , ..., d for a point cloud in R d .One natural question is: what radius of ball should we use to construct thealpha complex? Note that if α < α , we have an inclusion of complexes X α ⊂ X α . This enables us to keep track of the alpha complexes for various radii andutilizing a multiscale view of the data. Mathematically, a filtration is a family ofcomplexes X ⊂ X ⊂ · · · X n ⊂ · · · such that X i ⊂ X i +1 , and an alpha filtrationis a family of alpha complexes parametrized by the radius of balls used in theconstruction. Even though we can vary the radius continuously, new simplices inan alpha filtration are added at discrete stages. Thus, in practice, this family ofalpha complexes for a continuously varying radius can be completely describedby a discrete family of alpha complexes X α ⊂ X α ⊂ · · · ⊂ X α n for some largestradius α n . Filtrations are computationally useful when we consider persistenthomology and intrinsic topological features across scales.In Figure 2 we show a 3d render of the protein chain 1HFE-T (left) [BBB + C α atoms (center-left) and two alpha complexes in3he filtration (center-right, right). Note that the average distance between adja-cent C α atoms along a protein is approximately 3.5 ˚A. Using balls of radius 1.7(center-right), about half the average intermolecular distance, the alpha shapehas a single connected component. Moreover, the 5 α -helices of 1HFE-T arevisible as the regions with many triangles present. Using balls of radius 2.6(right) the individual α -helices are no longer visible and the “hole” in the centerof the protein is more clearly defined. Persistent homology (PH) is a mathematical approach to quantify multiscaletopological features, such as connected components (0-dimensional), non-contractibleloops, cavities (1-dimensional), and higher-dimensional analogues. At each stageof a filtration topological features are computed, and then matched with the cor-responding features at other stages (if the features still exist). The collectionof all persistent features across the filtration for dimension d is referred to as H k , or H k ( X α ) to emphasize the filtration used. For an introductory article see[Ghr08, Car09], and for a textbook introduction see [EH10, Ghr14].The PH of a filtration is summarized as a persistence diagram (PD), some-times called a barcode. A PD is a multiset of points { ( b i , d i ) } i that encodethe filtration index where a feature (indexed by i ) appears ( b i ) and where thatfeature disappears ( d i ). In the literature b i is called the birth time, and d i thedeath time. Points ( b i , d i ) farther from the diagonal have a larger persistence,which is the difference between the birth and death time, and are interpretedas intrinsic topological features. Since a PD consists of points in R , we canvisualize the topological features of a point cloud by plotting its PD, as shownin the left of Figure 3; the plotted points are features, whose x -coordinates areusually the birth time and whose y -coordinates are usually the death time. Analternative visualization is to plot a horizontal line for each feature, whose ini-tial x -coordinate is its birth time b i and terminal x -coordinate its death time d i .The y -axis corresponds to persistent feature index, and the user has freedom tochoose how the intervals are sorted. Popular choices are to sort the intervalsby their birth time, or by their death time. The right plot in Figure 3 showsthe barcode representation for the same PD displayed in the left plot, with thefeatures sorted by birth time.As an example of PDs and their visualization, we revisit the alpha filtrationbuilt on 1HFE-T and consider its 1-dimensional persistent features H in Figure3. In the center the alpha complex for alpha = 2 . x = 2 .
6. 4igure 3: Left: the persistence diagram (PD) of the protein chain 1HFE-T.Each dot corresponds to a 1-dimensional feature that appear anywhere in thefiltration, and the blue dot corresponds to the persistent “hole” seen in thealpha shapes of 1HFE-T. Center: the alpha shape of 1HFE-T for alpha = 2 . x = 2 .
7, indicating the radius of balls usedto construct the alpha shape in the center plot. Notice that the vertical lineonly intersects the horizontal red line, and none of the black lines, meaning thatthe alpha shape only has one 1-dimensional feature present.Fast algorithms exist to compute PDs using a variety of different mathe-matical frameworks: the first algorithms used a birth-death simplex matchingmethod [ZC05], and recent algorithms have used discrete morse-theoretic re-ductions to simplify the filtration [MN13] and matroid factorizations [Hen17] tospeed up or reduce the complexity of these computations. In this study we usedDionysus [Mor] to compute PH of alpha filtrations on proteins.
For two PDs D = { ( b i , d i ) } i and D = { ( b j , d j ) } j , the q -Wasserstein distanceis defined as the cost of the optimal matching between the two point sets: W q ( D , D ) := inf γ : D ∗ → D ∗ (cid:88) x ∈ D ∗ (cid:107) x − γ ( x ) (cid:107) q ∞ /q , where D ∗ i := D i ∪ { ( x, x ) : x ∈ [0 , ∞ ) } and the infimum ranges over all bijectivemaps γ : D ∗ → D ∗ [EH10, PZ20]. Points along the diagonal are included in thematching process since PDs often contain a different number of points. Sincediagonal points have the same birth and death time, they do not correspondto any topological features appearing in the filtration. Figure 4 illustrates anoptimal matching between two persistence diagrams: the top-left and top-rightplots contain PDs, and the bottom-left plot has the two PDs overlayed. In thebottom-right plot, the optimal matching is shown as black lines representingminimal length matchings of features to features, or features to the diagonal.5he (squared) 2-Wasserstein distance between these two PDs is the sum of thesquared L ∞ norms of the black lines.Figure 4: An illustration of the Wasserstein distance between two persistencediagrams. Top row: two persistence diagrams. Bottom left: the persistencediagrams overlayed. Bottom right: the optimal matching between the diagramsindicated by solid black bars. W q gives a metric on the space of PDs for 1 ≤ q ≤ ∞ . In our study weused the 2-Wasserstein distance W to compute distances between PDs, sincewe are interested in comparing PDs based on every feature, in contrast to the ∞ -Wasserstein distance which only accounts for the matched pair of PD pointsthat are furthest apart. We used Dionysus [Mor] for our Wasserstein distancecomputations, which relies on novel geometric approaches to efficiently computeoptimal matchings [KMN17]. One approach to quantifying protein structure is through representations calledGaussian Integral Tuned (GIT) vectors. This approach interprets the structureof a protein chain as a curve in 3D Euclidean space that is characterized usinga vector in R consisting of geometric invariants [RF03, Roe05] originating inknot theory. The inspiration for this technique is through a quantity called thewrithe Wr of a closed curve γ : [0 , → R , defined asWr( γ ) = 12 π (cid:90) (cid:90) 63% of proteins in the red box do not haveArchitecture label 60, while in the GIT plot 45 . 62% of the proteins do not haveArchitecture label 60. This suggests that the Wasserstein metric clusters pro-teins with similar CATH labels slightly more compactly than the GIT vectorsdo. In the center-right column, proteins with Class and Architecture labels 2and 60 are plotted with colors corresponding to their Topology label. The redboxes bound proteins with Topology label 40, which are the immunoglobulin-like proteins. In the MDS plot, 36 . 14% of the proteins in the red box do nothave Topology label 40, while this percentage is 47 . 29% in the GIT vector plot.Again, this suggests that the global topological summaries and the Wassersteinmetric more tightly cluster proteins with similar CATH labels, as comparedto the GIT vector representation. The black colored proteins are precisely theimmunoglobulins.Since we are most interested in aspects of the data found by PH that are notcontained in the GIT vectors, we compared the two approaches using AJIVE.The results from each method were given as data blocks, X (PH) and X (GIT), to the AJIVE algorithm. The estimated joint components in the AJIVEdecomposition, ˆ J and ˆ J , revealed features of protein structure captured wellby both methods. More interestingly, features identified by PH that were notidentified using the GIT approach were revealed by the estimated PH individualmatrix, ˆ I .The first two components of the common normalized scores (CNS) fromAJIVE are shown in Figure 9. The CNS represent a common basis of row ( ˆ J ),where row ( ˆ J ) denotes the estimated common score subspace of the joint matri-ces J and J . On the left, the kernel density estimate shows two denser regionsin the center of the plot. The middle panel shows separation of the Class labels.The green points, representing the class of mixed α -helices and β -sheets (Classlabel 3), are concentrated in the dense center regions of the density estimate.Note that the red points, which represent the α -helix class of proteins, appearto lie in a distinct region of the density estimate and have noticeable clusters.The right panel shows a drill-down of the red points with proteins colored bytheir Architecture label, the second level of CATH classification. Separation byArchitecture label is not clear and clusters do not appear homogeneous. Exam-ination of the CNS shows separation by Class label is a joint feature of PH andGIT vectors, but separation by Architecture label is not well captured by thejoint components.Next, we examined the individual block specific scores (BSS), which give theestimated matrices ˆ I and ˆ I in the AJIVE decomposition. In particular, we areinterested in proteins that are closely related by PH but not by GIT. We com-12igure 8: A drilldown analysis comparing the tightness of protein clusters viathe Wasserstein metric and GIT vector representation. In the top row arescatter plots using the MDS coordinates for the Wasserstein metric betweenproteins, in the bottom row are scatter plots of GIT vectors. Left column: thePC 1 and 2 plots of the data with Class label colors. Middle column: scatterplots of just the Class 2 proteins (mostly β -sheets) using Architecture labelsfor colors. The red boxes bound proteins with Architecture label 60 (SandwichArchitecture). In the MDS plot (top) 38 . 63% of the proteins in the red box donot have Architecture label 60, while 45 . 62% do not have Architecture label 60in the GIT plot (bottom). Right column: scatter plots of just the Class 2 andArchitecture 60 proteins using Topology labels for colors. The red boxes boundproteins with Topology label 40 (Immunoglobulin-like proteins). In the MDSplot (top) 36 . 14% proteins in the red box do not have Architecture label 60,while 47 . 29% do not have Architecture label 60 in the GIT plot (bottom).puted the t-SNE coordinates of the PH individual BSS matrix and performedhierarchical clustering on these coordinates using Wards linkage and numberof clusters n = 300. The clustering was performed on the t-SNE coordinatesrather than the original individual matrices since t-SNE gives a useful repre-sentation for identifying and visualizing clusters. We then selected the 10 mosthomogeneous clusters in terms of their Class and Architectures labels, wherehomogeneity was measured by calculating the entropy in each label of each ofthe 300 clusters. Figure 1 shows the results of this analysis. On the left, we seethe clusters are grouped tightly in the t-SNE coordinates of the PH individualmatrix. On the right, these same clusters in the t-SNE coordinates of the fulldata block of the GIT vectors are scattered. Recall that the full data block, X for the GIT vectors, is composed of both the joint and individual variationmatrices. These clusters demonstrate that PH individually finds aspects of thedata not captured by GIT vectors. Interestingly, this analysis reveals 8 proteinscompletely homogeneous in their Class and Architecture label that are clustered13igure 9: Plots of the common normalized scores (CNS) from AJIVE. Theleft plot shows the kernel density estimate of the first two components. Themiddle plot shows the first two components with proteins colored by Class label.Separation of Class label is a mode of variation common to both methods. Onthe right, α -helices colored by their Architecture label.tightly in the PH individual view but not in the GIT view. Protein fold classification can be done by considering proteins as curves in space,characterizing these curves by their corresponding 30-dimensional GIT vectorsand subsequently identifying clusters of GIT vectors. Such clusters correspondto proteins with similar folds. The great advantage of this method is that a com-putationally expensive pairwise comparison of protein structures is replaced byefficient clustering of vectors. Here we explored a similar approach, but insteadof considering protein structures as curves in Euclidean space characterized byGIT vectors, we treat protein structures as topological objects characterized byvectors obtained from PH.The AJIVE analysis provides a decomposition of the data that allows usto meaningfully compare our approach to existing methods. Moreover, theindividual components of the MDS coordinates returned by AJIVE indicatethat PH perceives aspects of protein topology different from the GIT vectors. References [BBB + 20] Stephen K Burley, Charmi Bhikadiya, Chunxiao Bi, Sebastian Bit-trich, Li Chen, Gregg V Crichlow, Cole H Christie, Kenneth Dalen-berg, Luigi Di Costanzo, Jose M Duarte, Shuchismita Dutta, ZukangFeng, Sai Ganesan, David S Goodsell, Sutapa Ghosh, Rachel KramerGreen, Vladimir Guranovi´c, Dmytro Guzenko, Brian P Hudson,Catherine L Lawson, Yuhe Liang, Robert Lowe, Harry Namkoong,Ezra Peisach, Irina Persikova, Chris Randle, Alexander Rose, YanaRose, Andrej Sali, Joan Segura, Monica Sekharan, Chenghua Shao,14i-Ping Tao, Maria Voigt, John D Westbrook, Jasmine Y Young,Christine Zardecki, and Marina Zhuravleva. RCSB Protein DataBank: powerful new tools for exploring 3D structures of biologi-cal macromolecules for basic and applied research and educationin fundamental biology, biomedicine, biotechnology, bioengineeringand energy sciences. Nucleic Acids Research , 49(D1):D437–D451, 112020.[CAH + 10] Vincent B. Chen, W. Bryan Arendall, III, Jeffrey J. Headd, Daniel A.Keedy, Robert M. Immormino, Gary J. Kapral, Laura W. Murray,Jane S. Richardson, and David C. Richardson. MolProbity : all-atom structure validation for macromolecular crystallography. ActaCrystallographica Section D , 66(1):12–21, Jan 2010.[Car09] Gunnar Carlsson. Topology and data. Bull. Amer. Math. Soc.(N.S.) , 46(2):255–308, 2009.[Ede95] H. Edelsbrunner. The union of balls and its dual shape. DiscreteComput. Geom. , 13(3-4):415–440, 1995.[EH10] Herbert Edelsbrunner and John L. Harer. Computational topology .American Mathematical Society, Providence, RI, 2010. An introduc-tion.[EKS83] Herbert Edelsbrunner, David G. Kirkpatrick, and Raimund Seidel.On the shape of a set of points in the plane. IEEE Trans. Inform.Theory , 29(4):551–559, 1983.[FJHM18] Qing Feng, Meilei Jiang, Jan Hannig, and J. S. Marron. Angle-based joint and individual variation explained. J. Multivariate Anal. ,166:241–265, 2018.[Ghr08] Robert Ghrist. Barcodes: the persistent topology of data. Bull.Amer. Math. Soc. (N.S.) , 45(1):61–75, 2008.[Ghr14] R. Ghrist. Elementary Applied Topology . Createspace, 2014.[GHR20] C. Grønbæk, T. Hamelryck, and P. Røgen. Gisa: using gauss inte-grals to identify rare conformations in protein structures. PeerJ , 8,2020.[Hen17] Gregory F. Henselman. Matroids And Canonical Forms: Theoryand Applications . ProQuest LLC, Ann Arbor, MI, 2017. Thesis(Ph.D.)–University of Pennsylvania.[Jol02] I. T. Jolliffe. Principal component analysis . Springer Series in Statis-tics. Springer-Verlag, New York, second edition, 2002.[KMN17] Michael Kerber, Dmitriy Morozov, and Arnur Nigmetov. Geometryhelps to compare persistence diagrams. ACM J. Exp. Algorithmics ,22:Art. 1.4, 20, 2017. 15Lab10] The Richardson Laboratory. The top8000 protein dataset, 2010.[MN13] Konstantin Mischaikow and Vidit Nanda. Morse theory for filtra-tions and efficient computation of persistent homology. DiscreteComput. Geom. , 50(2):330–353, 2013.[Mor] D. Morozov. Dionysus .[PZ20] Victor M Panaretos and Yoav Zemel. An invitation to statistics inWasserstein space . Springer Nature, 2020.[RF03] P. Roegen and B. Fain. Automatic classification of protein structureby using gauss integrals. PNAS , 2003.[Roe05] P. Roegen. Evaluating protein structure descriptors and tuning gaus-sian integral based descriptors. J. Phys.: Condens. Matter , 2005.[SDL + 19] I. Sillitoe, N. Dawson, T.E. Lewis, S. Das, J.G. Lees, P. Ashford,A. Tolulope, H. M. Scholes, I. Senatorov, A. Bujan, F. C. Rodriguez-Conde, B. Dowling, J. Thornton, and C.A. Orengo. Cath: expandingthe horizons of structure-based functional annotations for genomesequences. Nucleic Acids Res. , 2019.[Tor52] Warren S. Torgerson. Multidimensional scaling. I. Theory andmethod. Psychometrika , 17:401–419, 1952.[vdMH08] Laurens van der Maaten and Geoffrey Hinton. Visualizing data usingt-sne. Journal of Machine Learning Research , 9(86):2579–2605, 2008.[WSS09] P. Winter, H. Sterner, and P. Sterner. Alpha shapes and proteins. IEEE , 2009.[ZC05] Afra Zomorodian and Gunnar Carlsson. Computing persistent ho-mology.