[PDF] TopoMap: A 0-dimensional Homology Preserving Projection of High-Dimensional Data

Abstract

Multidimensional Projection is a fundamental tool for high-dimensional data analytics and visualization. With very few exceptions, projection techniques are designed to map data from a high-dimensional space to a visual space so as to preserve some dissimilarity (similarity) measure, such as the Euclidean distance for example. In fact, although adopting distinct mathematical formulations designed to favor different aspects of the data, most multidimensional projection methods strive to preserve dissimilarity measures that encapsulate geometric properties such as distances or the proximity relation between data objects. However, geometric relations are not the only interesting property to be preserved in a projection. For instance, the analysis of particular structures such as clusters and outliers could be more reliably performed if the mapping process gives some guarantee as to topological invariants such as connected components and loops. This paper introduces TopoMap, a novel projection technique which provides topological guarantees during the mapping process. In particular, the proposed method performs the mapping from a high-dimensional space to a visual space, while preserving the 0-dimensional persistence diagram of the Rips filtration of the high-dimensional data, ensuring that the filtrations generate the same connected components when applied to the original as well as projected data. The presented case studies show that the topological guarantee provided by TopoMap not only brings confidence to the visual analytic process but also can be used to assist in the assessment of other projection methods.

Full PDF

TTopoMap: A 0-dimensional Homology Preserving Projection ofHigh-Dimensional Data

Harish Doraiswamy, Julien Tierny, Paulo J. S. Silva, Luis Gustavo Nonato, and Claudio Silva

Fig. 1. The result of mapping three dimensional data to a 2D space using geometry preserving projections: Classical MDS, Isomap,tSNE, UMAP; and the proposed topology preserving TopoMap method. While geometry preserving methods tend either to splitconnected components or mix them up, TopoMap is guaranteed to preserve them, leveraging more reliable analysis.

Abstract — Multidimensional Projection is a fundamental tool for high-dimensional data analytics and visualization. With very fewexceptions, projection techniques are designed to map data from a high-dimensional space to a visual space so as to preservesome dissimilarity (similarity) measure, such as the Euclidean distance for example. In fact, although adopting distinct mathematicalformulations designed to favor different aspects of the data, most multidimensional projection methods strive to preserve dissimilaritymeasures that encapsulate geometric properties such as distances or the proximity relation between data objects. However, geometricrelations are not the only interesting property to be preserved in a projection. For instance, the analysis of particular structures suchas clusters and outliers could be more reliably performed if the mapping process gives some guarantee as to topological invariantssuch as connected components and loops. This paper introduces

TopoMap , a novel projection technique which provides topologicalguarantees during the mapping process. In particular, the proposed method performs the mapping from a high-dimensional space to avisual space, while preserving the 0-dimensional persistence diagram of the Rips ﬁltration of the high-dimensional data, ensuring thatthe ﬁltrations generate the same connected components when applied to the original as well as projected data. The presented casestudies show that the topological guarantee provided by TopoMap not only brings conﬁdence to the visual analytic process but also canbe used to assist in the assessment of other projection methods.

Index Terms —Topological data analysis, computational topology, high-dimensional data, projection.

NTRODUCTION

Multidimensional Scaling (MDS) accounts for the problem of embed-ding data in a Cartesian space while preserving intrinsic properties ofthe data. A particularly important task in the context of MDS is di-mensionality reduction, which aims to map data from a d -dimensionalto a k -dimensional Cartesian space where k << d . In the context of • H. Doraiswamy and C. Silva are with New York University; J. Tierny is withCNRS and Sorbonne Universit´e; P. J. S. Silva is with University ofCampinas; and L. G. Nonato is with University of Sao Paulo, Sao Carlos.• E-mail: { harishd,csilva } @nyu.edu, [email protected],[email protected], [email protected] received xx xxx. 201x; accepted xx xxx. 201x. Date of Publicationxx xxx. 201x; date of current version xx xxx. 201x. For information onobtaining reprints of this article, please send e-mail to: [email protected] Object Identiﬁer: xx.xxxx/TVCG.201x.xxxxxxx visualization, where the embedding space is 2D or 3D, MDS is typicallycalled multidimensional projection (MDP).Over the last decades, a multitude of MDP methods have been devel-oped to map high-dimensional data to a visual space while preservinggeometric properties such as the Euclidean distance between data ob-jects. A main issue shared by all those methods is that the preservationof geometric properties can only be guaranteed under very particularconditions. Thus, errors and distortions are highly likely in the resultingmapping, introducing uncertainties to analytical procedures carried outfrom projection layouts [68]. For instance, structures observed in thepoint cloud resulting from a projection such as neighborhood relationsmight not be the ones existing in the original data, thus potentiallyleading inexperienced practitioners to wrong conclusions.Although a number of alternatives have been proposed to render theanalysis of projection layouts more reliable [54], few are focused ondeveloping MDP methods with theoretical guarantees as to properties a r X i v : . [ c s . G R ] S e p reserved by the mapping. Guaranteeing that a certain property ispreserved exactly by the mapping makes the analytical process morereliable and meaningful, ensuring that what is seen is indeed what takesplace in the high-dimensional space.This work introduces TopoMap, a novel MDP technique that isguaranteed to preserve topological structures during the dimensionalityreduction process. Speciﬁcally, TopoMap maps high-dimensional datato a visual space while preserving 0-homology (Betti-0) topologicalpersistence as deﬁned by a Rips ﬁltration over the input data points.Intuitively, a Rips ﬁltration grows a high dimensional ball around thedata points, and adds an edge (or a high dimensional simplex) to the ﬁl-tration when two (or more) balls intersect. In other words, the proposedmethod ensures that the topological ﬁltrations over both the originalas well as the projected data generate the same connected componentsat the same instances of the respective ﬁltrations. The topologicalguarantee provided by TopoMap allows analysts to conﬁdently explorehigh-dimensional data by visualizing which groups of objects are moretightly connected in the high-dimensional space. As we show in theprovided case studies, visualizing persistent components via Betti-0preserving projections enables an intuitive analytical process, mak-ing the identiﬁcation of objects with similar properties an easier task.Moreover, in contrast to many distance-based (dissimilarity-based) pro-jections, there is no uncertainty in the visual identiﬁcation of groups(clusters) in the layout produced by TopoMap.Besides enabling reliable mechanisms for data exploration, the pro-posed methodology can be used to assess and better understand distance-based projections. Since TopoMap is guaranteed to preserve the con-nected components of a particular neighborhood graph structure, onecan rely on it to analyze how those connected components are mappedby other projection methods. As a result, one can further understandhow distance-based MDP methods split or merge components, thusrevealing regions with distortion.In summary, the main contributions of the work are: • A theoretical framework to support the design of a dimensional-ity reduction technique called TopoMap, which is guaranteed topreserve the Betti-0 topological persistence deﬁned by the Ripsﬁltration over the data. • An optimization procedure that ensures the correct mapping of theconnected components resulting from the ﬁltration process. • An exhaustive evaluation using both labeled data as well as casestudies over unlabeled data showing the potential of TopoMap tosupport the analysis of high-dimensional data as well as distance-based projection techniques.To the best of our knowledge, TopoMap is the ﬁrst dimensionalityreduction technique to provide guarantees as to the preservation of thetopological properties of the Rips ﬁltration of the data under analysis.

ELATED W ORK

In order to better contextualize the proposed methodology, we organizethe related work in two main parts, topological data analysis (which de-scribes related work in topological data representations) and topology-based multidimensional projection (which describes how these topo-logical data representations can drive projection methods).

Topology-based methods [24] have been very popular in the last twodecades to support advanced data analysis and visualization tasks [37].By providing a concise, structural representation of the data, thesetechniques greatly help in the visualization and analysis of the data.They have been applied successfully to a variety of domains, such asastrophysics [69, 73], biological imaging [2, 9, 14], chemistry [8, 32,59], ﬂuid dynamics [41], material sciences [34, 35, 72], or turbulentcombustion [11, 33, 44].The Rips ﬁltration [5, 24] is often used to analyze the topology ofhigh dimensional point clouds, and is motivated by the work by Chazaland Oudot [16] who showed that Rips ﬁltrations can provably capturethe homology of the manifold sampled by the point cloud. Among the popular representations in topological data analysis,the

Reeb graph [63] is obtained by contracting to a single point eachof the connected components of level sets of an input scalar ﬁeld,resulting in a characteristic skeleton-like representation of the inputdata. For discrete point cloud data, the

Mapper [71] is an approximationof the Reeb graph of some user-deﬁned function (often called lense function) deﬁned over a nearest neighbor graph of the input point cloud.Another popular abstraction is the Morse-Smale complex [21], whichis a cellular decomposition of the domain of an input scalar ﬁeld, suchthat all the points of a given cell admit the same gradient integrationextremities. For discrete point cloud data, the Morse-Smale complexhas been used over the k -nearest neighbor graph of the input pointcloud for clustering purposes [15]. As discussed next, all of theserepresentations (Mapper, Reeb graph, Morse-Smale complex) have alsobeen used as a driving data representation for dimensionality reduction.Given the increasing popularity in using topology-based techniquesfor data analysis, it is not surprising that there are several open sourcetools and libraries available [1, 6, 12, 27, 49, 52, 53, 75]. Multidimensional projection has long been a fundamental analyticaltool, mainly in the context of visualization [42, 54]. In fact, the visual-ization community has not only proposed a number of MDP methodstailored to visual analytic tasks [39], but has also developed method-ologies to facilitate the analysis of MDP distortions [3, 50] and toenrich MDP layouts so as to uncover information hidden in the projec-tions [31,40]. The extensive literature about MDS/MDP techniques hasbeen organized over several books [10, 47] and surveys [19, 54, 77]. Inorder to emphasize our contribution, we focus only on techniques thatexplicitly rely on topological concepts to perform and assess multidi-mensional projections, disregarding distance preserving methods suchas the classical MDS [47] and neighborhood preserving techniquessuch as LLE [66], t-SNE [76], and Lamp [39]. We refer interestedreaders to the above books and surveys for a broader discussion aboutMDS/MDP methods.

Isomap [74] is prossibly one of the ﬁrst MDS techniques to resortto topological mechanisms to accomplish dimensionality reduction.Isomap aims to capture the topological (manifold) structure of the datathrough a graph representation from which geodesic distances are com-puted. A number of variants of Isomap have been proposed, includingLandmark (L-Isomap) [70], out-of-sample [7] and spatio-temporal ex-tensions [38]. An interesting variant of Isomap is the method proposedby Lee and Verleysen [46], which tears a graph representation of thedata so as to preserve essential (non-contractable) loops, thus enablingloop preserving manifold unfoldings. The recent work by Yan et al. [79]is another particularly interesting variant of Isomap (precisely, a variantof L-Isomap). Similar to previous work on skeletonization [43], thisapproach identiﬁes cycles in the original data, but additionally aims atpreserving these cycles when projecting the data to 2D. Speciﬁcally, itfocuses on the Mapper (cf. Sect. 2.1) of a function deﬁned on the KNN-graph structure of the data to select landmark points. The underlyingidea is that the topology-based landmark selection captures the structureof the 1-dimensional homology groups of the data, which hopefully arepreserved during the dimensionality reduction phase accomplished viaregular L-Isomap. However, their approach does not take into account0-homology groups which are therefore not preserved. In contrast, theTopoMap method proposed in this work provides theoretical guaranteesas to 0-dimensional homology group preservation, thus ensuring thatthe connected components visualized in the projection layout are thesame as in the original high-dimensional data, according to its Ripsﬁltration. Similar to Yan et al. [79], Gerber et al. [29, 30] introducedprojection methods driven by topological data representations. In partic-ular, it differs from our work in the sense that the introduced projectionsare driven by the network of cells of maximum dimension (called crys-tals) of the Morse-Smale complex (Sect. 2.1). They do not aim atspeciﬁcally preserving the persistence diagram of the Rips complexas studied in this paper and therefore encode a different information,speciﬁcally tailored for regression tasks.In scientiﬁc visualization, Weber et al. [78] introduced a terrainetaphor to provide an intuitive visualization of the topological featurespresent in a volume scalar ﬁeld. Due to occlusion, these featurescan be challenging to visualize when represented in their original 3Dspace. This work addresses this issue by constructing a 2D terrainwhose elevation is carefully designed, such that the contour tree ofthe elevation map matches the contour tree of the original data in 3D.The resulting elevation can also be displayed as a planar heat mapand the original data points can in principle be projected to this planarlayout, by inserting each 3D point in the 2D region corresponding toits arc in the contour tree. This method can be interpreted as topologypreserving, as the contour tree of the 2D heatmap is guaranteed byconstruction to be equal to the contour tree of the original data in 3D.Note however, that the algorithm for constructing the terrain solelyfocuses on the contour tree and ignores the metric information comingfrom the original data. In particular, it places the root of the branchdecomposition of the contour tree at the center of the layout and thenarranges the children branches along a spiral trajectory [78]. This canhave the effect of projecting in a small 2D neighborhood topologicalfeatures which were originally arbitrarily far apart in 3D. Harvey andWang [36] proposed algorithms to generate an ensemble of terrainseach having the same contour tree as the input data. However, theshortcomings described above apply to these terrains as well.In a series of papers [55–58], Oesterling et al. extended this ap-proach to the case of high-dimensional point clouds. This line of workis probably the most related to our approach. When extending theterrain metaphor to such data, the ﬁrst difﬁculty is to derive a simplicialrepresentation of the point cloud. In their work, Oesterling et al. sug-gest to use a speciﬁc adjacency graph called the Gabriel graph [28]. Thesecond challenge consist in deriving a scalar ﬁeld on this graph whichfaithfully describes the data. The authors opt for a kernel density esti-mation of the point cloud (with a Gaussian kernel). From this point, theterrain metaphor [78] can be applied and the authors introduce variousimprovements [56, 58] based on contour proﬁles for instance [57].In our work, by considering the Rips ﬁltration, we focus our analysison distances , while Oesterling et al. focus on densities . In that regard,these two approaches are complementary, just like distance-based anddensity-based clustering methods are complementary. More impor-tantly, the two approaches differ in the way the layout of the data in 2Dis computed. As discussed above, the terrain metaphor [78] providesa constructive approach for computing the output 2D layout whichdiscards the metric information of the original space, as acknowledgedby the authors [55]. Data points which are arbitrarily far in the originalspace can be projected arbitrarily close, and reciprocally. In contrast,our layout strategy enforces the preservation of the persistent homologyof the Rips ﬁltration. This enables to better take into account the metricproperties of the data, and to some extent be more faithful to its originalgeometry. This tends to preserve the spatial relations between clusters(which are not taken into account in terrain metaphors). For instance, inFig. 1, the central clusters in the data (top row: red, middle row: blue)are indeed projected in between the other clusters with our method (topand middle row, right column).A subtle, yet important, distinction between our work and terrainmetaphors [55–58] is that our approach preserves topological features strictly when projecting the data to 2D. In particular, the 0-dimensionalpersistence diagram of the Rips ﬁltration of the projected data is strictlyequal to that of the high dimensional data, by construction. In con-trast, terrain metaphors for high dimensional data [55–58] providetopology-preserving terrains , but not necessarily topology-preserving projections , as each data point is placed “at a random position along its (density) contour” [55]. Finally, note that to our knowledge, no publicimplementation of the terrain metaphors is available.The recently introduced UMAP approach [51] is based on topolog-ical notions, namely category theory while our approach focuses onPersistent Homology [24]. As reported by its authors, UMAP providesvisual results which are highly similar to t-SNE. For this reason, it isoften regarded as a faster, more modern and more scalable alternativeto t-SNE, which still provides visually similar outputs.Topological tools have also been the basis of methods designed toevaluate the quality of dimensionality reduction techniques. A good

Fig. 2. The ball growth model used to analyze the topological propertiesof point data sets. (a)

Input data. (b)–(f)

Different stages of the ﬁltrationwith increasing diameter δ . These stages correspond to the instant inthe ﬁltration when two components (0-cycles) merge into one. The edgefrom the Rips ﬁltration responsible for this merge is also shown. Notethat this collection of edges correspond to the minimum spanning tree ofthe input points. example is the work by Rieck and Leitte [65], which assesses thequality of a projection technique from the 2nd Wasserstein distancebetween persistence diagrams computed from the original and projecteddata. In a follow up work, Rieck and Leitte [64] proposed the use ofpersistent homology to compare quality measures for dimensionalityreduction, making possible to analyze the agreement of multiple qualitymeasures, thus identifying regions where different quality measuresdisagree the most. Persistent homology has also been employed by Pauland Chalup [60] as a mechanism to validate dimensionality reductionmethods when applied to particular benchmark data. As we shallprove later, TopoMap is guaranteed to preserve connected componentsunder ﬁltration, and is therefore exact (no error) when comparing the0-dimensional homology persistence diagrams generated by a ﬁltrationin the visual and original spaces respectively. OPOLOGY P RESERVING P ROJECTION

Given a data set that is a collection of high-dimensional points in R d ,a common topology-based approach to analyze this data is to studythe evolution of cycles in the simplicial complexes resulting from aEuclidean distance based Rips ﬁltration over these points. Our goal isto project the data onto R such that above evolution for a subset of thecycles is preserved in the projected space as well.In this section, we ﬁrst introduce the necessary notations and formal-ize the problem that is of interest in this work. We refer the reader toEdelsbrunner and Harer [24] for a comprehensive discussion on thesetopics. Next, we describe a high level approach for solving the problem,and discuss different choices that can be made in the implementationof the high level solution. VietorisRips complex.

Let P = { p , p ,..., p n } be a set of pointsin R d . Given a distance threshold δ , the VietorisRips complex [24],(or Rips complex), is deﬁned as the set of all k -simplexes K ⊆ P , | K | = k + , k ≥

0, such that d ( p i , p j ) ≤ δ , ∀ p i , p j ∈ K . Here, d ( · , · ) is the Euclidean distance. Intuitively, the Rips complex for a distancethreshold δ captures the shape of the data when each point p i is re-placed with a d -dimensional ball of diameter δ centered around it. Forexample, consider the 6 points in R shown in Fig. 2(a). Fig. 2(b)–2(f)illustrates this shape for 5 different values of the distance threshold δ . Rips Filtration.

Consider a model where the distance threshold δ isincreased from 0 to ∞ . That is, the d -dimensional balls are graduallygrown in size. A Rips ﬁltration captures this growth model.Consider an ordered set of simplexes K P = { K = /0 , K , K ,..., K m } .Let δ i , i ∈ [ , m ] , be the smallest distance threshold such that simplex i is part of the Rips complex deﬁned for δ i . Then, the above orderedset is a Rips ﬁltration if ∀ i , j , i < j :1. ∃ l ≤ i s.t. K i (cid:84) K j = K l ; and2. δ i ≤ δ j . Topological Persistence and Persistence Diagram.

Consider thegrowth as deﬁned by the Rips ﬁltration, wherein the simplexes from theﬁltration are added one at a time. That is, the i th iteration in this growthwill consist of the subset S i = { K , K ,..., K i − } . The addition of eachnew simplex can change the topology of the underlying data, wherethe topology is captured by the set of cycles in the simplicial complexdeﬁned by S i . More speciﬁcally, a new k -cycle, k ≥

0, can either becreated or an existing k -cycle can be destroyed [25]. Informally, a0-cycle corresponds to a connected component, a 1-cycle to a loop,2-cycle to a void, and so on. Given one such k -cycle, let δ c be thethreshold at which this cycle is created, and δ d the threshold at whichit is destroyed. Then the topological persistence [25] of this k -cycleis deﬁned as δ d − δ c , and intuitively captures the lifetime of this cyclein the given ﬁltration. Note that a cycle that is not destroyed has apersistence equal to inﬁnity.The persistence diagram [17] plots all the cycles created during theﬁltration as a scatter plot, where the coordinates of the point corre-sponding to a cycle is its creation and destruction thresholds (i.e., the x - and y -axes of this plot corresponds to the creation and destructionthresholds). Problem Deﬁnition.

Let PD kP denote the persistence diagram re-stricted to k -cycles computed using the Rips ﬁltration over the point set P . Given a set of points P = { p , p ,..., p n } in R d , our goal is to com-pute a corresponding set of points P (cid:48) = { p (cid:48) , p (cid:48) ,..., p (cid:48) n } in R such that PD P = PD P (cid:48) , where there is a one to one correspondence between theconnected components or 0-cycles (i.e., a point p i belongs to a 0-cyclew.r.t P if, and only if, the point p (cid:48) i will belong to the corresponding0-cycle w.r.t. P (cid:48) ).In other words, the Rips ﬁltration over the projected points P (cid:48) notonly has the exact same connected components during each iterationof the growth, but even the iterations at which they are created anddestroyed are the same when compared to the Rips ﬁltration over thehigh-dimensional points P . Since we are interested only in the evolution of the set of connectedcomponents, it is sufﬁcient to consider only the 0- and 1-simplexes(vertices and edges respectively) of the ﬁltration. Consider the setof edges in the above ﬁltration. Only a subset of these edges resultin a change in topology, or in other words, merge two disconnectedcomponents into a single component. The following lemma bounds thenumber of such topology changing edges in the ﬁltration.

Lemma 1.

Given a Rips ﬁltration deﬁned over a set of n points, thereis exactly n − topology changing edges that result in reducing thenumber of 0-cycles.Proof. Consider an input with n points. At the beginning of the ﬁltra-tion, say at an inﬁnitesimally small threshold ε >

0, there are a total of n components each corresponding to an input point. The addition of eachtopology changing edge reduces the count of connected components byone. Thus, there exists exactly n − K P = { /0 , p , p ,..., p n , e , e ,..., e n − } ⊂ K P be the subset ofa ﬁltration, where p i , 1 ≤ i ≤ n , are the set of input points and e i ,1 ≤ i < n , corresponds to topology changing edges (in order of theirappearance in K P ). Note that we ignore all other edges in K P \ K P ,since they do not change the topology with respect to 0-cycles.Consider only the ordered set of topology changing edges K = { e , e ,..., e n − } from the above ﬁltration. By deﬁnition, the length ofthese edges satisﬁes | e | < | e | < ... < | e n − | . While these inequalities might not hold in practice (two consecutive edges could have the samelength), a simulated small perturbation [26] of the points can ensure thisproperty holds. The following lemma, which shows the equivalence be-tween K and the Euclidean distance minimum spanning tree (EMST)computed over P , provides the basis for our projection algorithm. Lemma 2.

Given a set of points P = { p , p ,..., p n } , let G be thecomplete weighted graph deﬁned over P such that the weight of eachedge ( p i , p j ) is equal to the Euclidean distance d ( p i , p j ) between thecorresponding end points. Then, the ordered set of topology changingedges K = { e , e ,..., e n − } is precisely the set of edges of the min-imum spanning tree (MST), in increasing order of weight, computedover G.Proof. We prove this by showing, through induction, that the orderedset of topology changing edges are the same as those added by theKruskal’s algorithm [18].Consider the ﬁrst edge e of the ﬁltration. By deﬁnition, it is theedge with the smallest length, and thus also the ﬁrst edge that is addedby the Kruskal’s algorithm. Let, edges e , e ,..., e i − be the ﬁrst i − i th topology changing edge of the ﬁltration e i . Forsake of contradiction, say the i th edge added by the Kruskal’s algorithmis e (cid:48) (cid:54) = e i . This implies that the edge e (cid:48) has length less than that of e i , and connects two connected subtrees together. Then, by deﬁnition, e (cid:48) will occur before e i in the ﬁltration, and will also be a topologychanging edge. Thus, the case of e (cid:48) (cid:54) = e i is not possible.The following proposition follows from the above lemmas and itguarantees the existence of a mapping (projection) that retains thetopology with respect to 0-cycles in the projected space. Proposition 1.

Let P be a set of points in R d , K = { e , e ,..., e n − } be the ordered subset of topology changing edges in K P and C iP be theset of connected components obtained during the ﬁltration over K P after the addition of the ﬁrst i topology changing edges { e , e ,..., e i } .Let M : R d → R k be a mapping that maps points in P to P (cid:48) , and K (cid:48) = { e (cid:48) , e (cid:48) ,..., e (cid:48) n − } be the set of topology changing edges in K P (cid:48) .Then, there exists at least one mapping M satisfying the followingproperties:(a) edge lengths | e (cid:48) i | = | e i | , ∀ i ∈ [ , n − ] ;(b) the components generated by the ﬁltrations are identical, i.e.,C iP (cid:48) = C iP ∀ i ∈ [ , n − ] ; and(c) PD P (cid:48) = PD P , where PD P (cid:48) and PD P are the persistence diagramsof K P (cid:48) and K P respectively. We abuse notation in the above proposition when stating that C iP (cid:48) = C iP . What this notation means is that the mapping M establishes aone-to-one relation between the components in C iP and C iP (cid:48) , that is,every point in a component C (cid:48) ∈ C iP (cid:48) is the image of a point in thecorresponding component C ∈ C iP . Note that in the above proposition,guaranteeing properties (a) and (b) is a sufﬁcient condition for (c).The proof of Proposition 1 is constructive and is provided in Sect. 3.3.In fact, using the above lemmas, we design the iterative algorithmshown in Procedure TopoMap that projects a set of high-dimensionalpoints onto R while guaranteeing the properties stated in Propositon 1.The algorithm places the points onto a plane such that the minimumspanning tree edges are preserved. In other words, it “draws” theminimum spanning tree maintaining the edge lengths.The algorithm initially maintains all the points as a separate com-ponent, and stores the minimum spanning tree edges as an ordered set.In each iteration, the algorithm then adds the smallest edge from thisordered set to connect two components, thus reducing the number ofmaintained components by one. The length of the edge is preservedduring this step, that is, its placement is such that distance between the rocedure TopoMap

Require:

High dimensional points P = { p , p ,..., p n } Compute the Euclidean minimum spanning tree E mst over P Let E mst = { e , e ,..., e n − } be the edges ordered on length Let P (cid:48) = { p (cid:48) , p (cid:48) ,..., p (cid:48) n } , where p (cid:48) i = ( , ) , ∀ i Let C i = { p (cid:48) i } be the initial set of components for each i ∈ [ , n − ] do Let ( p a , p b ) be the end points of edge e i Let C a be the component containing p (cid:48) a and C b be the component containing p (cid:48) b Place C a and C b in R s.t. min p (cid:48) j ∈ Ca , p (cid:48) k ∈ Cb { d ( p (cid:48) j , p (cid:48) k ) } = length ( e i ) Let C (cid:48) = C a (cid:83) C b Remove C a and C b from the set of components, and add C (cid:48) into this set end for return P (cid:48) Fig. 3. Projecting the points from Fig. 2 in 1-dimensional space. Eachiteration processes one edge (in increasing order to length) from theminimum spanning tree. two components is equal to the length of the connecting edge that, inturn, has the same length as its counterpart in the original space. Thisedge is then removed from the edge set and the process repeated untilall edges are appropriately placed. The key aspect of the algorithmis Line 8 that places the points based on the minimum spanning treeedge lengths. We describe different ways of accomplishing this in thenext section. The maintenance of the set of connected components isaccomplished using the union-ﬁnd data structure [18].

The TopoMap algorithm starts with e , the smallest topology changingedge in K . Placing the end points of e (which are individual com-ponents at the start of this procedure) in a lower dimensional spacesuch that this distance is preserved is straightforward. In other words, e = e (cid:48) . Now, suppose that the ﬁrst i − i th step, let edge e (cid:48) i beadded so as to connect two components from C i − P (cid:48) that are counterpartsof the components in C i − P linked by e i . As mentioned in Line 8 of thealgorithm, the goal now is to place these two components such that theminimum distance between them is equal to | e i | . If this condition is sat-isﬁed, then the properties | e j | = | e (cid:48) j | and C iP (cid:48) = C iP , ∀ j ∈ { ,..., i } , arenaturally attested. By repeating the process for all e i , i ∈ { ,..., n − } we also guarantee that PD P (cid:48) = PD P . Therefore, proving Proposition 1now requires showing that there exists a way to place two componentsconnected by an edge e (cid:48) i whose length is | e j | . In fact, there are severalways in which this can be accomplished as we show next. Note that,there always exists a valid solution to this problem. A solution in 1-dimensional space.

Consider two sets C and C ,where each point in this set is associated with a x and y value corre-sponding to its 2D coordinates. Let p r ∈ C | p r . x > p (cid:48) . x , ∀ p (cid:48) (cid:54) = p r ∈ C .In other words, p r is the rightmost point in C . Similarly, let p l ∈ C | p l . x < p (cid:48) . x , ∀ p (cid:48) (cid:54) = p l ∈ C be the leftmost point in C . Let the inputedge length be d . The trivial solution is to simply translate all pointsin C such that ( p r . x , p r . y ) = ( − d , ) and ( p l . x , p l . y ) = ( + d , ) . Fig. 3illustrates this procedure for the example shown in Fig. 2. A geometric 2D solution.

Note that the above solution places all thepoints only along the x -axis. Thus, to obtain a more compact solutionthat also uses the second dimension, we modify the above solutionas follows, by arranging components in the plane with local rotations, Fig. 4. Projecting the points from Fig. 2 in 2-dimensional space. similarly to circular layout strategies in tree drawing [67]. Let hull ( C ) and hull ( C ) be the convex hulls of the two components. Pick anedge e t from hull ( C ) and e b from hull ( C ) . Transform (rotate) C such that e t is parallel to the x -axis, and is the topmost edge of theconvex hull (i.e., has the highest y - coordinate). Similarly, transform C such that e b is also parallel to the x -axis, but is the bottommostedge. Let le f t ( e ) denote the left endpoint of the edge e . Now, translatecomponent C such that le f t ( e t ) = ( , ) , and component C such that le f t ( e b ) = ( , d ) . Alternatively, the right endpoints of the edges e t and e b can be used as well to align the two components.There are different ways in which the edge of the convex hull can beselected (as well as to decide which end point is used for the alignment).We decided to choose the edge that contains one of the end points of theminimum spanning tree edge that is under consideration. In case this isnot possible, we choose the edge closest to this point. The intuition hereis to not only preserve the connected components after every iteration,but to also try and preserve the end points of the minimum spanningtree edges as much as possible.Fig. 4 illustrates this procedure for the example points in Fig. 2. Notethat the addition of the ﬁrst two edges result in the same state as in the1D solution above. However, when the third edge ( p , p ) is added,then the point p is placed in a perpendicular orientation. When thelast edge ( p , p ) is processed, since both components have more thantwo points each, the convex hull is used to perform the alignment byappropriately transforming both components. An optimization-based 2D solution.

During data analysis, in additionto preserving the topology changing edges of the ﬁltration, it might bebeneﬁcial to also possibly preserve other properties as much as possible.In this section, we show how our projection approach can be tuned tosupport such modiﬁcations.For example, it might be natural to consider a case where we are alsointerested in keeping the resulting projection “compact”, in the sensethat we want the points to be as close to each other as possible, whilestill ensuring that the distance between the two components, C and C , is equal to the given value. One way of doing this is to minimizethe sum of squared distances between the points in C and C afterthe placement. This can be achieved using the optimization modeldescribed next.First, ﬁx one of the components, say C , and expand it to contain allpoints in the plane that have distance to C less or equal d (this is theregion which should not contain any point from C ). This is achievedby considering lines that are parallel and at a distance d to the edges of hull ( C ) . Since this expanded set of lines is also a convex hull, its innerregion can be described by a set on linear inequalities on the plane. Letthis set of linear inequalities be denoted by A x ≤ b , A ∈ R k × , b ∈ R k .The goal then becomes to ﬁnd the rigid motion (rotation plus trans-lation) that applied to C minimizes the sum of the squared distances tothe expanded convex hull without penetrating it. Formally, this problemcan be mathematically formulated as: ig. 5. Solving the optimization model to place two sets of points. (a) Twocomponents that are to be merged. (b)

The feasibility region with respectto the highlighted edge of hull ( C ) is shaded gray. C is initially placed inthis region and the optimization is solved. (c) Solution. min θ , t ∑ p ∈ C , p ∈ C (cid:107) p − ( R ( θ ) p + t ) (cid:107) s.t. A ( R ( θ ) p + t ) is not strictly smaller than b , ∀ p ∈ C . where θ represents an angle with rotation matrix R ( θ ) and t is a trans-lation vector. Note that it is possible to consider only the points thatdeﬁne the convex hull of C above.This problem can be cast as a mixed integer nonlinear optimizationproblem. The integer variables are needed because the constraints in theabove model are actually a “union of sets” instead of an “intersectionof the sets” that is usual in optimization. Unfortunately, solving suchproblems for a large number of points is impractical. We thereforedecided to use a simpliﬁed heuristic that (i) individually optimizes withrespect to each edge of the convex hull hull ( C ) ; and (ii) minimizes thesum of distances between points in C to a single point in C . Formally,let A i be a row of the matrix A and b i the respective right-hand side. Let p (cid:48) ∈ C be a point of interest. We ﬁrst solve the following optimizationproblem for i = ,..., k .min θ , t ∑ p ∈ C (cid:107) p (cid:48) − ( R ( θ ) p + t ) (cid:107) s.t. A i ( R ( θ ) p + t ) ≥ b i , ∀ p ∈ C . We then consider as ﬁnal solution the one that obtained the smallestobjective value. If we also want to preserve the edge from the ﬁltrationin the projection, weights can be applied to the objective function above,such that the edge endpoint in C has higher weight when compared toother points. In our implementation, we consider p (cid:48) to be the end pointof the edge that is being processed in that iteration.Fig. 5 illustrates this optimization process. It shows sets of points C (colored violet) and C (colored orange) that are to be placed at adistance d from each other (Fig. 5(a)). The points colored red and bluecorrespond to the ﬁltration edge under consideration. The blue pointis chosen as the point of interest in order to minimize the objectivefunction of the optimization. The points in C are ﬁrst randomly placedin the feasibility region corresponding to one of the edges of hull ( C ) ,and the optimization problem is solved (Fig. 5(b)). The resultingsolution is shown in Fig. 5(c).While the simpliﬁed optimization model is also nonlinear (due to therotation), it does not have integer variables and can then be solved bystandard nonlinear optimization algorithms. Note that the ﬁnal solution Table 1. Data sets used in our experiments.

Data set

Iris [23] 150 5 3Seeds [23] 210 8 3Heart [23] 261 11 2Cancer [23] 699 11 2Mfeat [23] 2000 64 10MNIST [45] 20000 784 10Urban 17520 6 not labeled is not guaranteed to have two points, one in C and the other in C , atexact distance d (and therefore, do not satisfy the required ﬁltrationconstraint). However, this can still be ensured by sliding C parallel tothe edge of hull ( C ) that is associated to the solution obtained in theoptimization process. Implementation.

TopoMap was implemented using C++. It can bedivided into two phases. First is to compute the Euclidean distanceminimum spanning tree, for which we used the implementation of thedual tree EMST algorithm [48] provided by the mlpack library [20],and has a time complexity of ( O ( N log N α ( N )) , where N is the sizeof the input. The next phase is to layout the points. Each iterationof TopoMap aligns two components corresponding to the MST edgebeing processed. We use the union-ﬁnd data structure to maintain thelist of components, which can be accomplished in O ( N α ( N )) time.The convex hull of the resulting merged component is then computedusing the qhull library [4], which takes O ( n log n ) time to compute theconvex hull of n points. However, since in each iteration, we use onlythe points in the convex hull of the individual components, n << N in practice. On several large data sets, we found that the layout phaseusing the geometric approach scaled linearly with the input. On theother hand, computing the EMST became the primary bottleneck whenincreasing dimensions.For the optimization based approach, we use the Algencan [61, 62]library for solving our optimization model. It is a robust and highperformance implementation of the augmented Lagrangian method fornonlinear optimization problems whose code is freely available. Asexpected, the optimization approach was slower than the geometricapproach. However, the main bottleneck was still the EMST phaseespecially for large point clouds. From the topological perspective, it is well known that the persistencediagram is robust to noise [17], especially in the context of topology in-ference [16]: small displacements of points in the original space inducesmall variations in the persistence diagram. Since TopoMap strictlypreserves the persistence diagram, topological robustness to noise isguaranteed by deﬁnition. On the other hand, the ordering of the ﬁltra-tion might change slightly due to the noise induced perturbation. Thus,with respect to the actual projection itself, the locations of the pointsin 2D space might vary marginally when using the geometric solution.Note that the optimization-based approach, being non-deterministic,can produce different layouts even for the same input when run multipletimes. However, since the connected components (that represent thepoints in the persistence diagram) are robust, these components are always maintained by the projection.

APPING E VALUATION AND I NTERPRETATION

In this section we present the results of applying TopoMap to projecthigh-dimensional data to a visual space ( R ). Our goal in this section isto analyze the properties of the layout produced by TopoMap, the wayit visually encodes the information contained in the high-dimensionaldata, and how much “readable” the TopoMap layout is when comparedto the ones produced by dissimilarity preserving projection methods.To facilitate the above analyses, we use several data sets (see Table 1)having different numbers of instances and dimensions, some of whichare labeled (i.e., the classes are known for the instances). We usedthe implementations provided by scikit-learn (v0.19.0) for existingmethods. All experiments were run on a machine with an Intel(R)Xeon(R) CPU E5-2630 v2 running at 2.60GHz and 64 GB of memory. ig. 6. Layouts produced by MDS, Isomap, t-SNE, UMAP and TopoMap when applied to the ﬁve ﬁrst data sets in Table 1. Right images in theTopoMap column highlight in colors the denser areas in the left images. Fig. 1 compares the layout produced by MDS, Isomap, t-SNE, UMAP,and TopoMap for 3 synthetic data sets with well known properties. Theﬁrst is simply a set of points sampled from three Gaussians, the secondis points sampled from three rings, while the third is sampled fromtwo concentric spheres. This ﬁgure illustrates the ability of TopoMapto preserve the connected components observed in the original spaceand to nicely reﬂect their relative adjacencies. Fig. 6 performs thesame comparison with respect to the ﬁrst ﬁve data sets in Table 1.One can notice from these examples that the layouts resulting fromTopoMap are quite different from the ones produced by dissimilarity-based methods. This is not a surprise, since TopoMap preserves n − n distances (or distributions in the case oft-SNE) between instances. Star Shaped Ensembles.

TopoMap produces a layout made up ofstar shaped ensembles with branches connecting and emanating fromthem. One way of interpreting this layout is through the use of theequivalence between the 0-homology ﬁltration of the Rips complexand hierarchical clustering using the single-linkage criterion [13]. Theconnected components built during the ﬁltration are exactly the same as the clusters formed when moving up in the hierarchy. In other words,hierarchical clustering with single-linkage produces, by construction,identical results when considering the input (high-dimensional) dataand the two-dimensional projection provided by TopoMap. Thus, wheninterpreting our projections, users should visually identify centers ofstars, as these correspond to clusters in the data (these also tend tocorrespond to the denser parts of the projection, see the TopoMapresults of Fig. 6, right column). On the contrary, the tips of the stars’branches should be interpreted as outliers or points lying at the boundarybetween clusters (these correspond to the least dense regions of theprojections). Notice from Fig. 6 that there is a good correspondencebetween the star ensembles and the classes of data instances. If usingthe star shaped ensemble to guide the exploration, TopoMap enablesvisual analysis that does not demand a great cognitive effort to ﬁgureout which are the main groups of instances in the data.Overall, compared to dissimilarity-based methods, TopoMap isequally, if not more, informative. In fact, except for the Heart dataset, one can easily build a visual correspondence between star shaped ensembles and classes. However, even in the Heart, TopoMap indicatesthe presence of groups of similar instances while the layouts resultingfrom MDS, Isomap and t-SNE are meaningless. In the Cancer dataset, MDS and Isomap clearly reveal one well deﬁned group (blue dots),however, without the labels, it would be difﬁcult to claim that the redpoints make up a class. The same is true with t-SNE as well, whichclearly pinpoints the compact class (in red), but it splits the blue classinto a number of local clusters, increasing the potential for misleadinginterpretation. TopoMap, on the other hand, shows two star shapedensembles, one well deﬁned and another more elongated, indicatingthe presence of two classes, one of them not so compact.

Density and Dispersion.

The right most images in Fig. 6 (TopoMapcolumn) highlight in colors the denser regions in each layout producedby TopoMap, while gray regions correspond to less dense areas. Inparticular, we use a Kernel Density Estimator (KDE) with a Gaussiankernel (one Gaussian is centered at each point in 2D and the sum of thecontributions of all Gaussians is considered as a density estimation ateach point). We additionally use an opacity transfer function, driven bythis density estimation, that the users can further adjust if needed (bydefault, a simple threshold at half of the maximum estimated density).The density-based visualization makes it easier to identify tightly con-nected groups. Although density-based visualizations have been usedto evaluate dissimilarity-based methods [50], the presence of errors anddistortions prevent the analysis from being accomplished with high con-ﬁdence [3]. Note that the “centers” of the starred ensembles correspondto denser areas of the layout, thus corresponding to tightly groupeddata instances. This is evident in the examples involving the Heart andCancer datasets, where a density analysis in the layout resulting fromMDS, Isomap, and t-SNE would be of little use.

Branches and Outliers.

When considering the above mentioned den-sity based visualization, it is easy to see that branches stemming fromthe starred ensembles typically are low density regions.These branches are essentially of two types: those connecting thestarred ensembles; and the ones emanating outwards from the stars.The latter is composed of points whose neighborhoods are not tightlyconnected. From the hierarchical clustering perspective, these can beconsidered as single point clusters (outliers for example) which mergewith already existing large clusters as one moves up the hierarchy.For example, the TopoMap projection of the Cancer dataset in Fig. 6 ig. 7. Mnist data projected using TopoMap (using cosine distance).Transitions between the different starred ensembles clusters: (a) 0 and 8.(b) 3 and 8. (c) 7 and 9. (d) 1 and 8. (e) 0 and 6. (f) class 2 while beinga cluster, is far from 0 and is connected to it via outliers. contains one compact and another more sparse class. The sparse class(red) gives rise to a starred ensemble with a small ”center” and longbranches emanating from it. The density visualization, coupled withthe guarantee that the topology changing edges’ lengths are preservedby TopoMap, gives us conﬁdence to claim that the longer branchescomes from the sparser class. Moreover, outliers tend also to be partof the loose branches, mainly in less dense areas of the layout. Thisfact can be observed in the projection of Mfeat dataset, where classesbecome mixed in loose branches (TopoMap column left image), butnot in the center of the ensembles.

Transitioning Between Clusters.

Branches connecting the centersof starred ensembles tend to encapsulate instances that lie betweenclusters, and represent a transition from one group to another. Toillustrate this property, we used a uniform sample of 20000 instancesfrom the MNIST data set, and projected it using the angular distanceas distance metric (see Sect. 4.4 for a discussion on this). Fig. 7highlights the different transitions between the star ensembles capturedby TopoMap. In particular, note that the cluster corresponding to class transitions to (a), (b), as well as (d). Other transitions such as fromclass to (c) and from to (e) can also be clearly seen. When welldeﬁned clusters are far apart from each other, as is the case of clustercorresponding to class (Fig. 7(f)), the branches emanating from areseen to be formed by “outliers” lying in between the clusters. Notice,however, that the TopoMap layout clearly shows clusters located farapart in the layout, making it easy for users to be aware of whichbranches are more prone to be made up of outliers (using the densitybased visualization to help this process).In general, dissimilarity-based techniques capable of emphasizingclusters such as t-SNE and UMAP does not capture well the transition-ing between the clusters. In contrast, techniques capable of graspingtransitions, such as Isomap, do not emphasize clusters well. Therefore,besides its theoretical guarantees, TopoMap bears properties difﬁcult tobe simultaneously present in dissimilarity based projection methods. Information loss.

There are two main scenarios that can result in aloss of information during the TopoMap projection. First, since thefocus is on preserving the 0-cycles, any information with respect tohigher dimensional cycles is lost. The 3 rings example in Fig. 1 is onesuch instance, where the 1-cycles formed by the 3 main componentsare simply represented as 3 star ensembles. A similar loss can be seenin the concentric spheres example, where the 2-cycles are lost in theprojection. Another scenario which can result in incorrect interpreta- tion is when exploring the long branches of the star ensembles. Thedistance between two points adjacent in a long branch is not necessarilythe distance between them in the high dimensional space. Rather, itrepresents the distance between the corresponding connected compo-nents . For example, say a point p is connected to p in the minimumspanning tree (and hence is an edge of the ﬁltration). This does notguarantee that p and p will form an edge in the projection—the edgewill be between the connected components corresponding to p and p . Thus, the two points may be assigned to different branches of a starensemble depending on the strategy used during the projection. Thus,this property must be considered when interpreting the layout. Layout interpretation guideline.

Based on the above observations,we use the following guideline to explore data using TopoMap for theremainder of this paper: • Use a density-based colormap to visualize the projection. • Start exploration by looking at centers of stars with high density.These typically represent clearly distinct clusters in the data. • Use low density stars to study “uncommon behaviors”. • Explore branches to analyze sparse clusters and outliers.

There are several urban data sets available representing different facetsof the city corresponding to its different properties. These are typi-cally studied in isolation, and can sometimes result in missing out oninteresting patterns resulting from the interactions between these facets.For example, when analyzing just the count of taxi trips, it is easy toobserve that both Times Square as well as Penn Station are identiﬁedas hot spots [22] almost throughout the day. Given that the former is apopular tourist attraction, while the latter is a transit hub, one wouldhowever expect differences in the way usage patterns of these placeschange depending on other conditions.In this experiment, our goal is to see if such patterns do exist. Todo so, we generate high dimensional data sets by combining the NYCtaxi data and the weather data as follows. We divided two years (2014-2015) into hourly intervals. Then, given a location, we consider thetaxi pickups that happened within a 100 m radius of the location. Wethen create one high dimensional point for every hourly interval havingthe following dimensions: count of taxi pickups, average fare, averagedistance, temperature, precipitation (rainfall), and wind speed. Thus,the data set corresponding to each location is a collection of 6D points.We then projected the data corresponding to Times Square and PennStation using TopoMap and visually analyzed the patterns present inthe projection and analyzed the different stared ensembles by lookingat the temporal distribution of the points forming these clusters. Theanalysis of the data corresponding to Penn Station can be found in thesupplemental material.

Times Square.

Fig. 8 shows the results obtained for Times Square.In this scenario it is interesting to note that the most dense region(Figs. 8(b) and (c)) correspond to the summer months. This is in turndivided into two clusters: Fig. 8(b) corresponding to main part ofthe day (10 am to 6 pm), while Fig. 8(c) corresponds to night hours.Similarly the winter months formed its own cluster (Fig. 8(d)). It wasinteresting to note that there was also a cluster with a smaller numberof points corresponding to the Spring and Autumn months (Fig. 8(e)).A curious observation in the above cases was that none of theseclusters included the time period 4 am–8 am. We found these points forsummer in a separate cluster shown in Fig. 8(f). On further analysis,we found that this is primarily because the taxi activity at these timeswas not only lower than the other times, but that these trips also hadlonger distances and fares than normal. Additionally, we also foundthat the points corresponding to periods when there was rainfall formeda less dense cluster among the outliers (Fig. 8(g)).

Since TopoMap bears theoretical guarantees it can be used to probeother projection methods in order to further understand how those meth-ods behave, specially regarding distortions and cluster preservation. Toillustrate this, consider the layouts in Fig. 9. The layout on the top is ig. 8. Analyzing Times Square using TopoMap. (a)

Projection obtained using TopoMap. (b)–(g):

Different clusters are selected and the temporaldistributions of the selected points visualized as a histogram.Fig. 9. Connected components (colored groups) guaranteed to exist inthe high-dimensional space are broken apart by t-SNE. the TopoMap projection of the urban data used in the previous sectioncorresponding to Times Square. The highlighted points correspond tothe 10 largest connected components obtained by stopping the topolog-ical ﬁltration after adding 5000 topology changing edges. There is onelarge component (gold) and nine smaller ones highlighted in differentcolors. Recall that if we apply this ﬁltration in the high-dimensionalspace we would get exactly the same connected components.The bottom image of Fig. 9 shows the result of projecting the samedata using t-SNE. The highlighted points here corresponds to the samecomponents from the TopoMap layout. Notice how t-SNE spreadsthe large gold component around the layout. Even tightly connectedcomponents such as the ones indicated as (a), (b), and (c) in TopoMaplayout are broken apart by t-SNE. This example reveals an importantproperty of t-SNE, namely, clusters visualized in a t-SNE layout tendto correspond to pieces of clusters present in the high-dimensional data. With the help of TopoMap, one can realize where t-SNE is placing thedifferent pieces of a cluster. Although experienced users are usuallyaware about this “breaking cluster” property of t-SNE, we are notaware of any work capable of revealing the extent/intensity of thisphenomenon. Revealing this nature of t-SNE is quite important, andcan be considered as a side contribution of the present work helping toillustrate the potential of using TopoMap as an analytical tool.

Using alternate distance metrics.

As shown in Sect. 4.1 (Fig. 7)TopMap can also be used with an alternate distance metric. This re-quires computing the MST using this metric, in which case the runningtime for computing the MST degenerates to O ( N ) due to the computa-tion of the distance matrix. Note that when another distance metric isused, the ﬁltration in the projected space is still preserved with respectto Euclidean distance in the visual space. This also makes it easier forthe user to gauge the projection in the visualized space, allowing for acomparison between the effect of using different distance metrics. Other 2D and 3D solutions.

While our approach provides a solutionensuring that the 0-dimensional homology is preserved, there can beother valid solutions as well. Depending on the application, one canalso trade-off preserving the persistence of outliers to preserving neigh-borhoods, or optimizing for a different property. Similarly, it would beinteresting to see how the outliers would behave when moving to 3D.

ONCLUSIONS

In this paper we presented TopoMap, the ﬁrst planar projection tech-nique that is guaranteed to preserve the homology of 0-cycles of theRips ﬁltration. Evaluation of our approach using a variety of datasets demonstrated several key properties that are desirable in a visualanalytical tool: the layout is easy to understand while its theoreticalguarantees provide conﬁdence to the users. In the future, we wouldlike to explore ways in which 1-cycles can be preserved as well in theprojection. Analyzing the effectiveness of TopoMap to assist clusteringmechanisms is another direction we will pursue.

Acknowledgments.

This work was partially supported by the DARPAD3M program; Moore Sloan Data Science Environment at NYU; NSFawards CNS-1229185, CCF-1533564, CNS-1544753, CNS-1730396,CNS-1828576; European Commission grant ERC-2019-COG “TORI” (ref. 863464), CNPq-Brazil (303552/2017-4, 304301/2019-1); andthe S˜ao Paulo Research Foundation (FAPESP) - Brazil (2013/07375-0,2016/04190-7, 2018/07551-6, 2018/24293-0). Any opinions, ﬁndings,and conclusions or recommendations expressed in this material arethose of the authors and do not necessarily reﬂect the views of NSFand DARPA.

EFERENCES [1] H. Adams, A. Tausz, and M. Vejdemo-Johansson. Javaplex: A researchsoftware package for persistent (co)homology. In

ICMS , 2014. https://github.com/appliedtopology/javaplex .[2] K. Anderson, J. Anderson, S. Palande, and B. Wang. Topological dataanalysis of functional MRI connectivity in time and space domains. In

MICCAI Workshop on Connectomics in NeuroImaging , 2018.[3] M. Aupetit. Visualizing distortions and recovering topology in continuousprojection techniques.

Neurocomputing , 70(7):1304–1330, 2007.[4] C. B. Barber, D. P. Dobkin, and H. Huhdanpaa. The quickhull algorithmfor convex hulls.

ACM Trans. Math. Softw. , 22(4):469483, Dec. 1996.[5] U. Bauer. Ripser: efﬁcient computation of vietoris-rips persistence bar-codes, Aug. 2019. Preprint.[6] U. Bauer, M. Kerber, J. Reininghaus, and H. Wagner. PHAT - persistenthomology algorithms toolbox. In

ICMS , 2014. https://github.com/blazs/phat .[7] Y. Bengio, J.-f. Paiement, P. Vincent, O. Delalleau, N. L. Roux, andM. Ouimet. Out-of-sample extensions for lle, isomap, mds, eigenmaps,and spectral clustering. In

NIPS , pages 177–184, 2004.[8] H. Bhatia, A. G. Gyulassy, V. Lordi, J. E. Pask, V. Pascucci, and P.-T.Bremer. Topoms: Comprehensive topological exploration for molecularand condensed-matter systems.

J. Comput. Chem. , 39(16):936–952, 2018.[9] A. Bock, H. Doraiswamy, A. Summers, and C. T. Silva. Topoangler:Interactive topology-based extraction of ﬁshes.

IEEE Trans. Comp. Graph. ,24(1):812–821, 2018.[10] I. Borg and G. P.

Modern Multidimensional Scaling - Theory and Applica-tions . Springer Series in Statistics, 1997.[11] P. Bremer, G. Weber, J. Tierny, V. Pascucci, M. Day, and J. Bell. Interactiveexploration and analysis of large scale simulations using topology-baseddata segmentation.

IEEE Trans. Comp. Graph. , 17(9):1307–1324, 2011.[12] P. Bubenik and P. Dłotko. A persistence landscapes toolbox for topologicalstatistics.

Symb. Comp. , 78:91 – 114, 2017. .[13] G. Carlsson. Topology and Data.

Bulletin of the American MathematicalSociety , 46(2):255–308, 2009.[14] H. A. Carr, J. Snoeyink, and M. van de Panne. Simplifying FlexibleIsosurfaces Using Local Geometric Measures. In

IEEE VIS , pages 497–504, 2004.[15] F. Chazal, L. J. Guibas, S. Y. Oudot, and P. Skraba. Persistence-basedclustering in riemannian manifolds.

J. ACM , 60(6), 2013.[16] F. Chazal and S. Oudot. Towards persistence-based reconstruction ineuclidean spaces. In

Symp. on Comp. Geom. , pages 232–241, 2008.[17] D. Cohen-Steiner, H. Edelsbrunner, and J. Harer. Stability of persistencediagrams.

Disc. Comput. Geom. , 37(1):103–120, 2007.[18] T. H. Cormen, C. E. Leiserson, and R. L. Rivest.

Introduction to Algo-rithms . MIT Press, 2001.[19] Z. Cunninghamn, J. P. Ghahramani. Linear dimensionality reduction:Survey, insights, and generalizations.

J. Mach. Learn. Res. , 16(89):2859–2900, 2015.[20] R. R. Curtin, M. Edel, M. Lozhnikov, Y. Mentekidis, S. Ghaisas, andS. Zhang. mlpack 3: a fast, ﬂexible machine learning library.

Journal ofOpen Source Software , 3:726, 2018.[21] L. De Floriani, U. Fugacci, F. Iuricich, and P. Magillo. Morse complexesfor shape segmentation and homological analysis: discrete models andalgorithms.

Comput. Graph. Forum , 34(2):761–785, 2015.[22] H. Doraiswamy, N. Ferreira, T. Damoulas, J. Freire, and C. T. Silva. Usingtopological analysis to support event-guided exploration in urban data.

IEEE Trans. Comp. Graph. , 20(12):2634–2643, 2014.[23] D. Dua and C. Graff. UCI machine learning repository. https://archive.ics.uci.edu/ml/machine-learning-databases/ , 2017.[24] H. Edelsbrunner and J. Harer.

Computational Topology. An Introduction .Amer. Math. Society, Jan. 2010.[25] H. Edelsbrunner, D. Letscher, and A. Zomorodian. Topological Persistenceand Simpliﬁcation.

Disc. Compu. Geom. , 28(4):511–533, 2002.[26] H. Edelsbrunner and E. P. M¨ucke. Simulation of simplicity: a technique tocope with degenerate cases in geometric algorithms.

ACM Trans. Graph. ,9(1):66–104, 1990.[27] B. T. Fasy, J. Kim, F. Lecci, and C. Maria. Introduction to the R packageTDA.

CoRR , abs/1411.1830, 2014. https://cran.r-project.org/web/packages/TDA/index.html .[28] R. K. Gabriel and R. R. Sokal. A new statistical approach to geographicvariation analysis.

Systematic Zoology , 18(3):259–278, 09 1969. [29] S. Gerber, P. Bremer, V. Pascucci, and R. Whitaker. Visual Explorationof High Dimensional Scalar Functions.

IEEE Trans. Comp. Graph. ,16(6):1271–1280, 2010.[30] S. Gerber, O. Rbel, P.-T. Bremer, V. Pascucci, and R. T. Whitaker. Morses-male regression.

J. Comput. Graph. Stat. , 22(1):193–214, 2013.[31] E. Gomez-Nieto, W. Casaca, D. Motta, I. Hartmann, G. Taubin, and L. G.Nonato. Dealing with multiple requirements in geometric arrangements.

IEEE Trans. Comp. Graph. , 22(3):1223–1235, 2016.[32] D. Guenther, R. Alvarez-Boto, J. Contreras-Garcia, J.-P. Piquemal, andJ. Tierny. Characterizing molecular interactions in chemical systems.

IEEETrans. Comp. Graph. , 20(12):2476–2485, 2014.[33] A. Gyulassy, P. Bremer, R. Grout, H. Kolla, J. Chen, and V. Pascucci.Stability of dissipation elements: A case study in combustion.

Comput.Graph. Forum , 33(3):51–60, 2014.[34] A. Gyulassy, M. A. Duchaineau, V. Natarajan, V. Pascucci, E. Bringa,A. Higginbotham, and B. Hamann. Topologically clean distance ﬁelds.

IEEE Trans. Comp. Graph. , 13(6):1432–1439, 2007.[35] A. Gyulassy, A. Knoll, K. Lau, B. Wang, P. Bremer, M. Papka, L. A.Curtiss, and V. Pascucci. Interstitial and interlayer ion diffusion geometryextraction in graphitic nanosphere battery materials.

IEEE Trans. Comp.Graph. , 22(1):916–925, 2016.[36] W. Harvey and Y. Wang. Topological Landscape Ensembles for Visualiza-tion of Scalar-Valued Functions.

Comput. Graph. Forum , 29:993–1002,2010.[37] C. Heine, H. Leitte, M. Hlawitschka, F. Iuricich, L. De Floriani,G. Scheuermann, H. Hagen, and C. Garth. A survey of topology-basedmethods in visualization.

Comput. Graph. Forum , 35(3):643–667, 2016.[38] O. C. Jenkins and M. J. Matari´c. A spatio-temporal extension to isomapnonlinear dimension reduction. In

Proc. ICML , page 56, 2004.[39] P. Joia, D. Coimbra, J. Cuminato, F. Paulovich, and L. Nonato. Local afﬁnemultidimensional projection.

IEEE Trans. Comp. Graph. , 17:2563–2571,2011.[40] P. Joia, F. Petronetto, and L. G. Nonato. Uncovering representative groupsin multidimensional projections.

Comput. Graph. Forum , 34(3):281–290,2015.[41] J. Kasten, J. Reininghaus, I. Hotz, and H. Hege. Two-dimensional time-dependent vortex regions based on the acceleration magnitude.

IEEETrans. Comp. Graph. , 17(12):2080–2087, 2011.[42] J. Krause, A. Dasgupta, J. Fekete, and E. Bertini. Seekaview: An intelli-gent dimensionality reduction strategy for navigating high-dimensionaldata spaces. In M. Hadwiger, R. Maciejewski, and K. Moreland, editors,

IEEE LDAV. , pages 11–19, 2016.[43] V. Kurlin. A one-dimensional homologically persistent skeleton of anunstructured point cloud in any metric space.

Comput. Graph. Forum ,34(5):253–262, 2015.[44] D. E. Laney, P. Bremer, A. Mascarenhas, P. Miller, and V. Pascucci.Understanding the structure of the turbulent mixing layer in hydrodynamicinstabilities.

IEEE Trans. Comp. Graph. , 12(5):1053–1060, 2006.[45] Y. LeCun, C. Cortes, and C. J. Burges. THE MNIST DATABASE ofhandwritten digits. http://yann.lecun.com/exdb/mnist/ , 2020.[46] J. A. Lee and M. Verleysen. Nonlinear dimensionality reduction of datamanifolds with essential loops.

Neurocomputing , 67:29–53, 2005.[47] J. A. Lee and M. Verleysen.

Nonlinear Dimensionality Reduction . Springer,2007.[48] W. B. March, P. Ram, and A. G. Gray. Fast euclidean minimum spanningtree: Algorithm, analysis, and applications. In

Proceedings of the 16thACM SIGKDD International Conference on Knowledge Discovery andData Mining , KDD ’10, pages 603–612. ACM, 2010.[49] C. Maria, J. Boissonnat, M. Glisse, and M. Yvinec. The gudhi library:Simplicial complexes and persistent homology. In

ICMS , 2014. http://gudhi.gforge.inria.fr/ .[50] R. Martins, D. Coimbra, R. Minghim, and A. Telea. Visual analysis ofdimensionality reduction quality for parameterized projections.

Comp. &Graph. , 41:26–42, 2014.[51] L. McInnes, J. Healy, and J. Melville. UMAP: Uniform Manifold Ap-proximation and Projection for Dimension Reduction.

ArXiv e-prints , Feb.2018.[52] D. Morozov. Dionysus. ,2010. Accessed: 2016-09-15.[53] V. Nanda. Perseus, the persistent homology software. , 2013. Accessed: 2016-09-15.[54] L. G. Nonato and M. Aupetit. Multidimensional projection for visual ana-lytics: Linking techniques with distortions, tasks, and layout enrichment.

EEE Trans. Comp. Graph. , 25(8):2650–2673, 2019.[55] P. Oesterling, C. Heine, H. J¨anicke, and G. Scheuermann. Visual analysisof high dimensional point clouds using topological landscapes. In

Proc.PaciﬁcVis , pages 113–120, 2010.[56] P. Oesterling, C. Heine, H. J¨anicke, G. Scheuermann, and G. Heyer. Vi-sualization of High Dimensional Point Clouds Using their Density Dis-tribution’s Topology.

IEEE Trans. Comp. Graph. , 17(11):1547–1559,2011.[57] P. Oesterling, C. Heine, G. H. Weber, and G. Scheuermann. Visualizing ndpoint clouds as topological landscape proﬁles to guide local data analysis.

IEEE Trans. Vis. Comput. Graph. , 19(3):514–526, 2013.[58] P. Oesterling, G. Scheuermann, S. Teresniak, G. Heyer, S. Koch, T. Ertl,and G. H. Weber. Two-stage framework for a topology-based projectionand visualization of classiﬁed document collections. In

Proc. IEEE VAST ,pages 91–98, 2010.[59] M. Olejniczak, A. S. P. Gomes, and J. Tierny. A Topological Data AnalysisPerspective on Non-Covalent Interactions in Relativistic Calculations.

Int.J. Quantum Chem. , 120(8):e26133, 2020.[60] R. Paul and S. K. Chalup. A study on validating non-linear dimensionalityreduction using persistent homology.

Pattern Recognition Letters , 100:160–166, 2017.[61] J. M. M. R. Andreani, E. G. Birgin and M. L. Schuverdt. On augmentedlagrangian methods with general lower-level constraints.

SIAM Opt. ,18:1286–1309, 2007.[62] J. M. M. R. Andreani, E. G. Birgin and M. L. Schuverdt. Augmented la-grangian methods under the constant positive linear dependence constraintqualiﬁcation.

Math. Prog. , 111:5–32, 2008.[63] G. Reeb. Sur les points singuliers dune forme de Pfaff compl`etementint´egrable ou d’une fonction num´erique.

Comptes Rendus des s´eances del’Acad´emie des Sciences , 222(847-849):76, 1946.[64] B. Rieck and H. Leitte. Agreement analysis of quality measures fordimensionality reduction. In

Topological Methods in Data Analysis andVisualization , pages 103–117. Springer, 2015.[65] B. Rieck and H. Leitte. Persistent homology for the evaluation of di-mensionality reduction schemes.

Comput. Graph. Forum , 34(3):431–440,2015.[66] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locallylinear embedding. science , 290(5500):2323–2326, 2000.[67] A. Rusu. Tree drawing algorithms. In

Handbook of Graph Drawing andVisualization , 2013. [68] D. Sacha, L. Zhang, M. Sedlmair, J. A. Lee, J. Peltonen, D. Weiskopf,S. C. North, and D. A. Keim. Visual interaction with dimensionalityreduction: A structured literature analysis.

IEEE Trans. Comp. Graph. ,23(1):241–250, 2017.[69] N. Shivashankar, P. Pranav, V. Natarajan, R. van de Weygaert, E. P. Bos,and S. Rieder. Felix: A topology based framework for visual explorationof cosmic ﬁlaments.

IEEE Trans. Comp. Graph. , 22(6):1745–1759, 2016. http://vgl.serc.iisc.ernet.in/felix/index.html .[70] V. D. Silva and J. B. Tenenbaum. Global versus local methods in nonlineardimensionality reduction. In

NIPS , pages 721–728, 2003.[71] G. Singh, F. Memoli, and G. Carlsson. Topological Methods for theAnalysis of High Dimensional Data Sets and 3D Object Recognition. In

Eurographics Symposium on Point-Based Graphics , 2007.[72] M. Soler, M. Petitfrere, G. Darche, M. Plainchault, B. Conche, andJ. Tierny. Ranking Viscous Finger Simulations to an Acquired GroundTruth with Topology-Aware Matchings. In

IEEE LDAV. , pages 62–72,2019.[73] T. Sousbie. The persistent cosmic web and its ﬁlamentary structure:Theory and implementations.

Royal Astronomical Society , 414:350– 383, 06 2011. .[74] J. B. Tenenbaum, V. De Silva, and J. C. Langford. A global ge-ometric framework for nonlinear dimensionality reduction. science ,290(5500):2319–2323, 2000.[75] J. Tierny, G. Favelier, J. A. Levine, C. Gueunet, and M. Michaux. TheTopology ToolKit.

IEEE Trans. Comp. Graph. , 24(1):832–842, 2018. https://topology-tool-kit.github.io/ .[76] L. van der Maaten and G. Hinton. Visualizing high-dimensional data usingt-sne.

J. Mach. Learn. Res. , 9:2579–2605, 2008.[77] L. van der Maaten, E. Postma, and J. van den Herik. Dimensionalityreduction: A comparative review. Technical report, Tilburg University,2007.[78] G. Weber, P.-T. Bremer, and V. Pascucci. Topological Landscapes: A Ter-rain Metaphor for Scientiﬁc Data.

IEEE Trans. Comp. Graph. , 13(6):1416–1423, Nov. 2007.[79] L. Yan, Y. Zhao, P. Rosen, C. Scheidegger, and B. Wang. Homology-preserving dimensionality reduction via manifold landmarking and tearing.