[PDF] A New Non-archimedan Metric on Persistent Homology

Abstract

In this article, we define a new non-archimedian metric structure, called cophenetic metric, on persistent homology classes for all degrees. We then show that zeroth persistent homology together with the cophenetic metric and hierarchical clustering algorithms with a number of different metrics do deliver statistically verifiable commensurate topological information based on experimental results we obtained on different datasets. We also observe that the resulting clusters coming from cophenetic distance do shine in terms of internal and external evaluation measures such as silhouette score and the Rand index. Moreover, since the cophenetic metric is defined for all homology degrees, one can now display the inter-relations of persistent homology classes in all degrees via rooted trees.

Full PDF

HHIERARCHICAL CLUSTERING AND ZEROTH PERSISTENT HOMOLOGY

İSMAİL GÜZEL AND ATABEY KAYGUNAbstract. In this article, we show that hierarchical clustering and the zeroth persistent homologydo deliver the same topological information about a given data set. We show this fact usingcophenetic matrices constructed out of the ﬁltered Vietoris-Rips complex of the data set at hand.As in any cophenetic matrix, one can also display the inter-relations of zeroth homology classesvia a rooted tree, also known as a dendogram. Since homological cophenetic matrices can becalculated for higher homologies, one can also sketch similar dendograms for higher persistenthomology classes.

1. Introduction

An overview.

In this article, we compare the topological information coming from hierarchicalclustering algorithms and persistent homology.In the barcode representation of persistent homology, one only keeps a record of the dimensionsof the persistent homology groups in the form of life-time intervals . However, persistenthomology classes carry a rich combinatorial structure, and one can do more that just countingthem. Carlsson expresses the same idea as a question in [23, Ch.8] and [6]:The dendrogram can be regarded as the “right” version of the invariant 𝜋 inthe statistical world of ﬁnite metric spaces. The question now becomes if thereare similar invariants that can capture the notions of higher homotopy groups orhomology groups. The central problem.

The central problem in this article aims to answer is whether hierarchicalclustering and the zeroth persistent homology deliver the same topological information. Thesolution we found relies on writing a cophenetic matrix for persistent homology classes usingpurely homological information coming from the changing scale parameter.

A bridge between hierarchical clustering and persistent homology.

We analyze how arbi-trary persistent homology classes of all degrees “merge” as the scale parameter change, on topof recording of the life-times of these classes. We also investigate what type of representationswould be more appropriate to display this combinatorial information. We solve both of theseproblems by forming a bridge between zeroth persistent homology and hierarchical clusteringin the form of a cophenetic matrix. Since cophenetic matrices already exist for hierarchicalclustering, one can now determine whether these fundamentally diﬀerent methods do indeedyield the same information.Cophenetic matrices [29, 28, 16] in combination with the Mantel Test [21, 20] are the mostwidely used non-parametric statistical tools in biology and ecology for comparing diﬀerentphylogenetic trees, and therefore, are uniquely appropriate for our purposes. Moreover, sincewe have a cophenetic matrix in every homological degree, we are also in a position to use all ofthe statistical tools available for analyzing connected components of a data sets à la hierarchicalclustering, for the higher topological invariants represented by higher Betti numbers. a r X i v : . [ m a t h . A T ] D ec İSMAİL GÜZEL AND ATABEY KAYGUN

An answer to Carlsson’s question.

In forming the bridge between hierarchical clustering andzeroth persistent homology, we also found that the answer to the question raised by Carlssonin [6] comes from algebraic topology: cobordisms. Dendograms are 1-dimensional cobordismclasses of disjoint union of points. For higher homology classes, one has to resort to 𝑛 + 𝑛 -spheres. For example, for persistent homologyin degree 1 such cobordisms are given by oriented genus- 𝑔 Riemann surfaces with ﬁnitely manypunctures, and the classiﬁcation of such 2-manifolds is complete. Unfortunately, in dimensions2- and higher such cobordisms are very diﬃcult to classify. We are going to investigate thespecial case of persistent homology of degree-1 in an upcoming paper [18].

Prior art.

Topological data analysis (TDA) is a new data analysis discipline whose fundamen-tals straddle both very abstract and concrete sub-disciplines of the mathematical research. Eventhough the theoretical roots TDA are ﬁrmly placed in algebraic topology, to solve its com-putational needs it heavily uses computational geometry and numerical linear algebra. SinceTDA relies on the topology of the ambient space from which data sets are sampled, rather thana particular metric structure, in theory TDA is more suitable for extracting information fromhigh-dimensional and large-volume data sets compared to standard machine learning algorithms.Clustering algorithms, on the other hand, have been around for a long time and they form animportant and well-understood class of machine learning algorithms [19, 28, 16, 20, 7]. For eachdata set, these algorithms aim to deliver an optimal partition where subsets are supposed to showa high degree of heterogeneity between, and a high degree of homogeneity within each subset.Hierarchical clustering algorithms that we consider in this paper extract their results based solelyon the metric structure of the ambient space where the data set is embedded. In addition, theyuse a convenient tree representations called dendrograms to display the information on howthese clusters merge as the underlying scale parameter changes [17].Similar to clustering algorithms, the TDA methods we investigate in this paper rely on achanging scale parameter. But instead of relying on the metric structure alone, these methodspropose using persistent homology to compute topological invariants of a data set. Persistenthomology was ﬁrst introduced to investigate topological simpliﬁcations of alpha shapes [12], butlater extended to arbitrary dimensional spaces [32]. The topological invariants that persistenthomology identiﬁes are the Betti numbers deﬁned for every natural number 𝑛 . For instance, theBetti numbers for 𝑛 = , persist . Plan of the article.

This paper is organized as follows. We give the necessary backgroundmaterial we need on hierarchical clustering in Section 2, and we do the same for persistenthomology in Section 3. In Section 4, we introduce the cophenetic matrix for zeroth persistenthomology as the desired bridge between hierarchical clustering and the persistent homology.The results of our numerical experiments are given in Section 5, and in Section 6 we present ourdetailed analysis of our experimental work in the light of theoretical discussions we presentedin the earlier sections. We also propose several avenues of future work in the same Section. The source code and the data of the numerical experiments we conducted in the paper can be found on theauthors’ GitHub page at https://github.com/ismailguzel/TDA-HC .

IERARCHICAL CLUSTERING AND ZEROTH PERSISTENT HOMOLOGY 3

Acknowledgment.

The ﬁrst author was supported by Research Fund Project Number TDK-2020-42698 of the Istanbul Technical University. During writing this article, the second authorwas on sabbatical leave at Queen’s University, Canada, and would like to thank both IstanbulTechnical University and Queen’s University for their support.2. Hierarchical ClusteringAssume we have a connected metric space ( 𝑋, 𝑑 ) , and let 𝜋 ( 𝑋 ) be the set of connectedcomponents of 𝑋 . Assume we have a ﬁnite random sample of points 𝐷 ⊆ 𝑋 taken from 𝑋 whose distribution we do not know. Our aim is to deduce any information about the set ofconnected components of 𝑋 using 𝐷 . We are going to do this by ﬁnding a ﬁnite clustering of 𝐷 which is a set function 𝑐 : 𝐷 → N such that each cluster 𝑐 − ( 𝑖 ) lies within a distinct connectedcomponent for each 𝑖 ∈ N .2.1. Hierarchical clustering.

In its simplest form, in hierarchical clustering we have a function 𝑐 𝜀 : 𝐷 → N for each scale parameter 𝜀 >

0. This function satisﬁes 𝑐 𝜀 ( 𝑥 ) = 𝑐 𝜀 ( 𝑦 ) for any twopoints 𝑥, 𝑦 ∈ 𝐷 when there is a sequence of points 𝑥 , . . . , 𝑥 𝑚 ∈ 𝐷 such that 𝑑 ( 𝑥 𝑖 , 𝑥 𝑖 + ) < 𝜀 for every 𝑖 = , . . . , 𝑚 − 𝑥 = 𝑥 and 𝑥 𝑚 = 𝑦 . Notice that the clustering algorithm ismonotone in the sense that if 𝑐 𝜀 ( 𝑥 ) = 𝑐 𝜀 ( 𝑦 ) then 𝑐 𝜂 ( 𝑥 ) = 𝑐 𝜂 ( 𝑦 ) for every 𝜂 > 𝜀 . Moreover,since 𝐷 ⊆ 𝑋 is ﬁnite and 𝑋 is connected, there is a large enough scale parameter 𝜀 > 𝑐 𝜀 is a single cluster.2.2. Variants of hierarchical clustering.

We noticed in the previous Section that as we increasethe scale parameter 𝜀 > clusters of points. Since we replace points withclusters as we form clusters, we are going to need to calculate distances between clusters.See Algorithm 1. procedure

Cluster( { 𝑥 , . . . , 𝑥 𝑁 } , 𝜀 ) C ← ∅ for 𝑖 from 1 to 𝑛 do Add { 𝑥 𝑖 } as a cluster to C end forrepeat Find a distinct pair ( 𝐶 𝑖 , 𝐶 𝑗 ) in C such that 𝑑 ( 𝐶 𝑖 , 𝐶 𝑗 ) < 𝜀 Remove the clusters 𝐶 𝑖 and 𝐶 𝑗 from C Add the new cluster ( 𝐶 𝑖 ∪ 𝐶 𝑗 ) to C until 𝑑 ( 𝐶 𝑖 , 𝐶 𝑗 ) (cid:62) 𝜀 for all clusters return C end procedure Figure 1. Clustering function pseudocode.For a ﬁxed 𝜀 >

0, let us use 𝐶 𝑖 = 𝑐 − 𝜀 ( 𝑖 ) to denote a cluster, and set 𝑛 𝑖 = | 𝐶 𝑖 | . Let us use 𝑑 𝑖 𝑗 forthe distance between the cluster 𝐶 𝑖 and 𝐶 𝑗 . Lance and Williams [19] introduced to the followinggeneral formula for calculating distances between clusters 𝑑 ( 𝑖 𝑗 ) 𝑘 = 𝛼 𝑖 𝑗 𝑑 𝑖𝑘 + 𝛼 𝑗𝑖 𝑑 𝑗 𝑘 + 𝛽𝑑 𝑖 𝑗 + 𝛾 | 𝑑 𝑖𝑘 − 𝑑 𝑗 𝑘 | for parameters 𝛼 𝑖 𝑗 , 𝛽 and 𝛾 to be determined. Here, 𝑑 ( 𝑖 𝑗 ) 𝑘 denotes the distance between theclusters 𝐶 𝑘 and 𝐶 𝑖 𝑗 = 𝐶 𝑖 ∪ 𝐶 𝑗 which is merged in a single cluster. We list the parameters for İSMAİL GÜZEL AND ATABEY KAYGUN commonly used methods of calculating distances between clusters in Table 1. See [19] fordetails. Methods 𝛼 𝑖 𝑗 𝛽 𝛾 Single − Complete Average 𝑛 𝑖 𝑛 𝑖 + 𝑛 𝑗 𝑛 𝑖 + 𝑛 𝑘 𝑛 𝑖 + 𝑛 𝑗 + 𝑛 𝑘 − 𝑛 𝑘 𝑛 𝑖 + 𝑛 𝑗 + 𝑛 𝑘 𝑑 ( 𝑖 𝑗 ) 𝑘 .2.3. Dendrograms.

Dendrograms are the most common presentation of the results of hierar-chical clustering obtained from a sample set 𝐷 . They display information about how clustersmerge when one increases the distance scale 𝜀 . The topology of the tree structure of any den-dogram has two important pieces: nodes and stems. The nodes of the dendrogram representclusters at a given scale parameter and the lengths of stems represent the distances at which anytwo clusters merge. See Figure 2.Figure 2. Dendrogram terminology explained.3. Persistent Homology3.1. Point clouds and simplicial complexes.

In hierarchical clustering, a collection of pointsgiven in an ambient metric space carries no information other than the distances between them.Derived information such as cophenetic matrices also rely on this metric structure. However,there are other tools to derive more information about the topology of the data set at hand. Oneof these useful tools one can use is a simplicial complex . IERARCHICAL CLUSTERING AND ZEROTH PERSISTENT HOMOLOGY 5

An abstract simplicial complex 𝐾 in a space 𝑋 is a collection of subsets of 𝑋 such that for anytwo 𝑥, 𝑦 ∈ 𝐾 one also has 𝑥 ∩ 𝑦 ∈ 𝐾 . There are two variants of simplicial complexes that are ofinterest for us: Vietoris-Rips complexes and Čech complexes.3.1.1. Vietoris-Rips complexes.

Given a point cloud 𝐷 , the Vietoris-Rips is deﬁned to be thesimplicial complex whose simplicies are all points in 𝐷 that are at most 𝜀 apart. 𝑅 𝜀 ( 𝐷 ) = { 𝜎 ⊂ 𝐷 | 𝑑 ( 𝑥, 𝑦 ) (cid:54) 𝜀, for all 𝑥, 𝑦 ∈ 𝜎 } Čech complexes.

Given a point cloud 𝐷 , the Čech complex associated with 𝐷 is deﬁnedto be the simplicial complex given by C 𝜀 ( 𝐷 ) = (cid:40) 𝜎 ⊆ 𝐷 | (cid:217) 𝑥 ∈ 𝜎 𝐵 𝜀 ( 𝑥 ) ≠ ∅ (cid:41) . In other words, a collection of points 𝜎 = ( 𝑥 , . . . , 𝑥 ℓ ) forms an ℓ -simplex if the set of balls ofradius 𝜀 centered at these points has non-empty intersection.3.2. Choosing an appropriate scale parameter.

In order to turn a point cloud 𝐷 into asimplicial complex, we are going to use 𝑅 𝜀 ( 𝐷 ) the Vietoris-Rips complex associated with 𝐷 with a chosen proximity parameter 𝜀 >

0. We then try to capture the topological features ofthe data by changing the parameter 𝜀 . As we see in Figure 3, we may not capture all of thetopological features of data for a given proximity 𝜀 . Finding the optimal value for 𝜀 for a givendata set 𝐷 is a challenging problem. 𝑅 𝜀 𝑅 𝜀 𝑅 𝜀 Figure 3. Vietoris-Rips complexes with increasing values of the parameters.Edelsbrunner-Letscher-Zomorodian [12], and Carlsson-Zomorodian [32] proposed that the per-sistence homology might help to determine an optimal value for 𝜀 . In persistent homology, onerecords the longevity of each topological feature (in this case homology classes of 𝑅 𝜀 ( 𝐷 ) ) of agiven data set as the proximity parameter 𝜀 changes. One does this by observing the persistence of these topological features depending on 𝜀 .3.3. Persistent Homology.

Let { 𝐾 𝜀 | 𝜀 ∈ R + } be a ﬁltration on a simplicial complex. In otherwords, each 𝐾 𝜀 is a simplicial complex with 𝐾 𝜀 ⊆ 𝐾 𝜂 for every 𝜀 < 𝜂 , and we have 𝐾 = (cid:208) 𝜀> 𝐾 𝜀 .The 𝑘 -th persistent homology of 𝐾 is given byPH 𝑘 ( 𝐾 ) : = { 𝐻 𝑘 ( 𝐾 𝜀 )} 𝜀 ∈ R + together with the collection of linear maps 𝜓 𝑘𝜀,𝜂 : 𝐻 𝑘 ( 𝐾 𝜀 ) → 𝐻 𝑘 ( 𝐾 𝜂 ) induced by the inclusionmaps of 𝐾 𝜀 ↩ → 𝐾 𝜂 for all 𝑘 ∈ N and 𝜀 < 𝜂 in R + . İSMAİL GÜZEL AND ATABEY KAYGUN

Bar codes.

Persistent homology produces a collection of intervals depending on the pa-rameter 𝜀 where we store the life-time of topological features of the point cloud via persistenthomology. Here by life-time we mean the interval on which a homology cycle is non-trivial as 𝜀 ranges from 0 to ∞ . We record both the birth , i.e when a topological feature appears, and the death , i.e when a topological features disappears, as 𝜀 increases. To illustrate the life-time, weuse barcodes as introduced by Carlsson et.al. [8] and Ghrist [15].In a barcode, we place the basis vectors for the homology on the vertical axis whereas thehorizontal axis represents the life span of each basis element in terms of the scale parameter 𝜀 .When we draw the vertical line at a particular 𝜀 𝑖 , the number of intersecting line segments in abarcode is the dimension of the corresponding homology group, i.e. the Betti number, for thatparameter 𝜀 𝑖 . See Figure 4. 𝜀 = 0 𝜀 = 0.34𝜀 = 0.23𝜀 = 0.15 𝜀 = 0.54 Figure 4. An example barcode.In Figure 4, one can see barcodes for zeroth and ﬁrst persistent homology together with theVietoris-Rips complex corresponding to a particular 𝜀 . For example, the blue horizontal linewhose left endpoint on 0 .

22 and right endpoint is on 0 .

25 represents a nonzero element in 𝐻 ( 𝑅 . ) that persisted until 𝐻 ( 𝑅 . ) at which point it either disappeared or merged withanother class.We will postulate that the longest living topological features in the barcode are the genuinetopological features of the point cloud, whereas the shorter ones can be seen as artiﬁcial artifactsof the method we use. Notice also that there will always be one connected component as 𝜀 grows large, i.e. the zeroth Betti number 𝛽 is always going to be 1 eventually.4. A Bridge Between Persistent Homology and Hierarchical Clustering4.1. Cophenetic matrix.

An important notion we need in studying and comparing clusteringmethods is the cophenetic matrix [29, 28, 16].Assume we have a clustering function 𝑐 𝜀 : 𝐷 → N , and let C = { 𝐶 𝑖 = 𝑐 − 𝜀 ( 𝑖 ) | 𝑖 ∈ N } . Let 𝜀 𝑖 𝑗 be the proximity level at which the clusters 𝐶 𝑖 and 𝐶 𝑗 merge to form 𝐶 𝑖 𝑗 for the ﬁrst time. We IERARCHICAL CLUSTERING AND ZEROTH PERSISTENT HOMOLOGY 7 record these numbers in the cophenetic matrix 𝐶 𝜀 ( 𝐷 ) = ( 𝜀 𝑖 𝑗 ) for any pair of clusters 𝐶 𝑖 and 𝐶 𝑗 .The cophenetic distance is a metric under the assumption of monotonicity [27].4.2. Homological cophenetic distance.

Given a point cloud 𝐷 , we consider the Vietoris-Ripscomplex 𝑅 𝜀 ( 𝐷 ) . By gradually increasing 𝜀 we get a ﬁltered simplicial complex, and thus, wecan calculate the persistent homology associated with this ﬁltration.Recall that when we have a ﬁltered simplicial complex { 𝑅 𝜀 } 𝜀> , we have homology groups { 𝐻 𝑘 ( 𝑅 𝜖 )} 𝜀 and connecting linear maps 𝜓 𝑘𝜀,𝜂 : 𝐻 𝑘 ( 𝑅 𝜖 ) → 𝐻 𝑘 ( 𝑅 𝜂 ) for every pair 𝜀 < 𝜂 and forevery 𝑘 ∈ N . We would like to emphasize that even though 𝑅 𝜀 ⊆ 𝑅 𝜂 the induced maps inhomology 𝜓 𝑘𝜀,𝜂 need not be injections.Now, for each linearly independent pair of homology classes 𝛼 and 𝛽 in 𝐻 𝑘 ( 𝑅 𝜀 ) one can test if 𝜓 𝑘𝜀,𝜂 ( 𝛼 ) and 𝜓 𝑘𝜀,𝜂 ( 𝛽 ) are still linearly independent in 𝐻 𝑘 ( 𝑅 𝜂 ) . If the pair 𝜓 𝑛𝜀,𝜂 ( 𝛼 ) and 𝜓 𝑛𝜀,𝜂 ( 𝛽 ) fails to be linearly independent we will say that two classes 𝛼 and 𝛽 merged at time 𝜂 . Thus wecan deﬁne 𝑘 -th homological cophenetic distance 𝐷 𝑘 ( 𝛼, 𝛽 ) = inf (cid:8) 𝜂 − 𝜀 (cid:62) | 𝜓 𝑘𝜀,𝜂 ( 𝛼 ) and 𝜓 𝑘𝜀,𝜂 ( 𝛽 ) are linearly dependent in 𝐻 𝑘 ( 𝑅 𝜂 ) (cid:9) . The zeroth homology and hierarchical clustering.

For the speciﬁc case 𝛽 , there is asimpliﬁcation: all homology classes appear at 𝜀 = 𝜀 goes to ∞ for the subsequent Vietoris-Rips complexes. Thus,it is enough to test whether the classes 𝜓 ,𝜀 ( 𝛼 ) and 𝜓 ,𝜀 ( 𝛽 ) are linearly independent in 𝐻 ( 𝑅 𝜀 ) .Notice that since each point 𝑥 ∈ 𝐷 is a homology class in 𝐻 ( 𝑅 ) , and points also mark therows and columns of the cophenetic matrix 𝐶 ( 𝐷 ) coming from hierarchical clustering, we cancompare these matrices.4.4. Mantel Test.

As we stated above, we need to compare dendrograms coming from diﬀerentcophenetic matrices. For this purpose we are going to use the Mantel test [21] which iscommonly used in biology and ecology. It is a non-parametric statistical method that computesthe signiﬁcance of correlation between rows and columns of a matrix through permutations ofthese rows and columns in one of the input distance matrices.We consider two distance or cophenetic matrices 𝐷 = ( 𝑥 𝑖 𝑗 ) and 𝐷 = ( 𝑦 𝑖 𝑗 ) of size 𝑛 × 𝑛 . Thenormalized Mantel statistic 𝑟 is deﬁned as 𝑟 = ( 𝑛 − ) ( 𝑛 + ) 𝑛 ∑︁ 𝑖 = 𝑛 ∑︁ 𝑗 = 𝑖 + (cid:18) 𝑥 𝑖 𝑗 − ¯ 𝑥𝑠 𝑥 (cid:19) (cid:18) 𝑦 𝑖 𝑗 − ¯ 𝑦𝑠 𝑦 (cid:19) where(i) ¯ 𝑥 and ¯ 𝑦 are averages of all entries of each matrix, and(ii) 𝑠 𝑥 and 𝑠 𝑦 are the standard deviations for 𝑥 and 𝑦 .The test statistic is the Pearson product-moment correlation coeﬃcient 𝑟 ∈ [− , ] . Havinga value in the neighborhood of − + İSMAİL GÜZEL AND ATABEY KAYGUN the calculated statistic is unlikely to have been obtained under the null-hypothesis then thenull-hypothesis is rejected. See [20, Sect. 10.5] for details.5. ExperimentsTo determine if our research is sound, we performed numerical experiments to compare thedendograms we obtained from the Euclidean distance and the dendograms we obtained fromthe cophenetic distance for the zeroth persistent homology. In this section, we are going tosummarize these experiments.5.1. A sample of cities in Turkey.

For our ﬁrst experiment, we used a subset 24 of cities inTurkey whose coordinates are encoded as longitudes and latitudes in radians. See Figure 5.Figure 5. A sample of cities in Turkey5.1.1.

Bar codes and dendrograms.

The left hand side of Figure 6 is the dendrogram we obtainedfrom cophenetic homological distance matrix for the zeroth homology. The right hand side ofFigure 6 is the ordinary barcode obtained from the zeroth persistent homology which displaysthe birth and death times of each homology class, whereas the left hand side is the dendrogramthat indicates which classes merge.5.1.2.

Comparison of Dendrograms.

Next, we apply the hierarchical clustering (with singlelinkage), using the Euclidean distance matrix 𝐸 ( 𝐷 ) , and the homological cophenetic distancematrix 𝐶 ( 𝐷 ) for the zeroth persistent homology. The resulting dendrograms are given inFigure 7. Then, in Figure 8 we align the labels from both dendograms without changing theunderlying cluster structure. In tanglegram representations, one compares the tree structuresusing a metric derived from matches between labels placed on branches [26, 13, 5].For the next phase, we need to compare dendograms. We are going to use the Mantel test (SeeSection 4.4) for this task. The resulting statistic is a measure of how well the labels of thetwo dendrograms are aligned. For the sample of cities we used, the Mantel statistic value weobtained for the matrices 𝐸 ( 𝐷 ) and 𝐶 ( 𝐷 ) was 0 .

98 with p-value of 0 . All of the computational tools we use in this section comes from the dendextend package [14] and vegan package [25] of the R programming language. For the map, we used Generic Mapping Tools [31].

IERARCHICAL CLUSTERING AND ZEROTH PERSISTENT HOMOLOGY 9

Figure 6. Hierarchical enriched barcodes and classical barcodes in TDAFigure 7. Two dendrograms: one from homological cophenetic distance, andthe other from the Euclidean distance.5.2.

Random point clouds.

In our second experiment, we sampled 20 points uniformly ran-domly from the unit square [ , ] × [ , ] , and then we compared their Euclidean distance andhomological cophenetic distance matrices using the Mantel test. We repeated this procedure100 times. The median statistic was 0 .

94 with p-value 0 . Figure 8. Tanglegram of the dendrograms in Figure 7.Figure 9. A histogram of Mantel statistics from random point clouds.6. Conclusions and future work6.1.

Conclusions.

The results of our numerical experiments in Section 5 indicate that there is astrong positive correlation between the dendrograms from the homological cophenetic distancematrix 𝐶 ( 𝐷 ) and the dendograms from the Euclidean distance matrix 𝐸 ( 𝐷 ) . The 𝑝 -values weobtained indicates that our results are statistically signiﬁcant. Note that since this test is basedon random permutations, the permutations will always yield at the same observed correlation 𝑟 but seldomly the same p-value. IERARCHICAL CLUSTERING AND ZEROTH PERSISTENT HOMOLOGY 11

The statistical evidence we collected supports our hypothesis that hierarchical clustering andzeroth persistent homology yield the same topological information about the connected compo-nents of the sampled manifold using completely diﬀerent methods. While hierarchical clusteringexclusively rely on metric structure alone, persistent homology relies on simplicial technology toderive its results. The highly correlated nature of the result comes from the fact that Vietoris-Ripscomplex is derived from the same metric structure used in hierarchical clustering. However, thehomological machinery opens new avenues for statistical data analysis in diﬀerent directions.6.2.

Future work.

One can extend the results of this article in diﬀerent directions.(i) One can investigate the topological structure of the data by replacing zeroth homology withpersistent homology in higher degrees, or(ii) One can replace the metric structure with a pure topology where no metric may exist forthe data at hand, or(iii) One can enrich bar code representation of persistent homology using combinatorial struc-tures such as matroids.The ﬁrst avenue for extension, namely extending our results by replacing the zeroth homologywith higher persistent homology is going to be the subject matter of a future article [18].However, the visual representation of the results would require deep topological results sinceone has to deal with higher cobordisms of 𝑛 -spheres [30] if one is to develop a similar theoryfor the 𝑛 -th persistent homology. For the ﬁrst persistent homology, the cobordisms we wouldneed are given by genus- 𝑔 Riemann surfaces with punctures. Fortunately, there is a completeclassiﬁcation of such surfaces in full [11] and one can display the cobordism results usinga representation similar to dendograms that one uses to display the results of hierarchicalclustering. Unfortunately, for higher dimensional homology, the cobordisms require higherdimensional manifolds with ﬁnitely many punctures for which there is no classiﬁcation exists.The second avenue is of particular interest if the data at hand cannot be easily embedded in anaﬃne space. This is often the case when one deals with categorical data that require diﬀerenttechniques than numerical data [2]. We have shown that provided one can deﬁne a simplicialcomplex out of data sets whose features are purely or partially categorical, the cophenetichomological distance would yield usable information about the data set on par with hierarchicalclustering.Another exciting avenue of research would be employing matroids to analyze the combinatoricsof homology classes. Recall that our homological cophenetic distance essentially records whentwo given homology classes become linearly dependent. In a suitable extension, we would needto record the homological information in terms of linear dependence relations of ﬁnite subsets ofbasis elements of homology groups as the ﬁltration parameter changes in persistent homology.However, the combinatorics of linear dependence relations of more than two elements is farmore complicated to be represented by a simple cophenetic matrix. The natural mathematicalstructure that records and allows a rigorous analysis of such dependency relations for ﬁnite setsof elements is a matroid [3]. This requires an extension of bar codes by including the relevantmatroids. References [1] H. Adams, T. Emerson, M. Kirby, R. Neville, C. Peterson, P. Shipman, S. Chepushtanova, E. Hanson,F. Motta, and L. Ziegelmeier,

Persistence images: A stable vector representation of persistent homology ,The Journal of Machine Learning Research, 18 (2017), pp. 218–252.[2] A. Agresti,

An Introduction to Categorical Data Analysis , Wiley, 2007. [3] A. Björner, M. Las Vergnas, B. Sturmfels, N. White, and G. M. Ziegler,

Oriented matroids , vol. 46of Encyclopedia of Mathematics and its Applications, Cambridge University Press, Cambridge, second ed.,1999.[4] P. Bubenik,

Statistical topological data analysis using persistence landscapes , The Journal of MachineLearning Research, 16 (2015), pp. 77–102.[5] K. Buchin, M. Buchin, J. Byrka, M. Nöllenburg, Y. Okamoto, R. I. Silveira, and A. Wolff,

Drawing(complete) binary tanglegrams , in International Symposium on Graph Drawing, Springer, 2008, pp. 324–335.[6] G. Carlsson,

Persistent homology and applied homotopy theory , 2020.[7] G. Carlsson and F. Mémoli,

Characterization, stability and convergence of hierarchical clustering methods ,The Journal of Machine Learning Research, 11 (2010), pp. 1425–1470.[8] G. Carlsson, A. Zomorodian, A. Collins, and L. J. Guibas,

Persistence barcodes for shapes , InternationalJournal of Shape Modeling, 11 (2005), pp. 149–187.[9] Y.-M. Chung and A. Lawson,

Persistence curves: A canonical framework for summarizing persistencediagrams , arXiv preprint arXiv:1904.07768, (2019).[10] D. Cohen-Steiner, H. Edelsbrunner, and J. Harer,

Stability of persistence diagrams , Discrete & compu-tational geometry, 37 (2007), pp. 103–120.[11] S. Donaldson,

Riemann surfaces , vol. 22 of Oxford Graduate Texts in Mathematics, Oxford University Press,Oxford, 2011.[12] H. Edelsbrunner, D. Letscher, and A. Zomorodian,

Topological persistence and simpliﬁcation , inProceedings 41st Annual Symposium on Foundations of Computer Science, IEEE, 2000, pp. 454–463.[13] H. Fernau, M. Kaufmann, and M. Poths,

Comparing trees via crossing minimization , Journal of Computerand System Sciences, 76 (2010), pp. 593–608.[14] T. Galili, dendextend: an R package for visualizing, adjusting and comparing trees of hierarchical clustering ,Bioinformatics, 31 (2015), pp. 3718–3720.[15] R. Ghrist,

Barcodes: the persistent topology of data , Bulletin of the American Mathematical Society, 45(2008), pp. 61–75.[16] A. K. Jain and R. C. Dubes,

Algorithms for clustering data , Prentice-Hall, Inc., 1988.[17] S. C. Johnson,

Hierarchical clustering schemes , Psychometrika, 32 (1967), pp. 241–254.[18] A. Kaygun and I. Güzel,

Matroids, cobordisms and topological data analysis , In preparation.[19] G. N. Lance and W. T. Williams,

A general theory of classiﬁcatory sorting strategies: 1. hierarchicalsystems , The computer journal, 9 (1967), pp. 373–380.[20] P. Legendre and L. Legendre,

Numerical Ecology , Elsevier, 3rd ed., 2012.[21] N. Mantel,

The detection of disease clustering and a generalized regression approach , Cancer research, 27(1967), pp. 209–220.[22] E. Merelli, M. Rucco, P. Sloot, and L. Tesei,

Topological characterization of complex systems: Usingpersistent entropy , Entropy, 17 (2015), pp. 6872–6892.[23] H. Miller,

Handbook of Homotopy Theory , CRC Press, 2020.[24] C. Moon, N. Giansiracusa, and N. A. Lazar,

Persistence terrace for topological inference of point clouddata , Journal of Computational and Graphical Statistics, 27 (2018), pp. 576–586.[25] J. Oksanen, F. G. Blanchet, M. Friendly, R. Kindt, P. Legendre, D. McGlinn, P. R. Minchin, R. O’Hara,G. L. Simpson, P. Solymos, M. H. H. Stevens, E. Szoecs, and H. Wagner, vegan: Community EcologyPackage , 2019. R package version 2.5-6.[26] C. Scornavacca, F. Zickmann, and D. H. Huson,

Tanglegrams for rooted phylogenetic trees and networks ,Bioinformatics, 27 (2011), pp. i248–i256.[27] T. Sergios and K. Konstantinos,

Pattern Recognition , Academic Press, Boston, fourth edition ed., 2009.[28] P. H. Sneath, R. R. Sokal, et al.,

Numerical taxonomy. The principles and practice of numerical classiﬁ-cation. , W.H. Freeman and Company San Franscisco, 1973.[29] R. R. Sokal and F. J. Rohlf,

The comparison of dendrograms by objective methods , Taxon, 11 (1962),pp. 33–40.[30] R. E. Stong,

Notes on cobordism theory , Mathematical notes, Princeton University Press, Princeton, N.J.;University of Tokyo Press, Tokyo, 1968.[31] P. Wessel, J. F. Luis, L. Uieda, R. Scharroo, F. Wobbe, W. H. F. Smith, and D. Tian,

The generic mappingtools version 6 , Geochemistry, Geophysics, Geosystems, 20 (2019), pp. 5556–5564.[32] A. Zomorodian and G. Carlsson,

Computing persistent homology , Discrete & Computational Geometry,33 (2005), pp. 249–274.

IERARCHICAL CLUSTERING AND ZEROTH PERSISTENT HOMOLOGY 13

Email address : [email protected] Email address : [email protected]@itu.edu.tr