[PDF] Branch points and stability

Abstract

The hierarchy poset and branch point poset for a data set both admit a calculus of least upper bounds. A method involving upper bounds is used to show that the map of branch points associated to the inclusion of data sets is a controlled homotopy equivalence, where the control is expressed by an upper bound relation that is constrained by Hausdorff distance.

Full PDF

aa r X i v : . [ m a t h . A T ] M a r Branch points and stability

J.F. JardineMarch 16, 2020

Abstract

The hierarchy poset and branch point poset for a data set both admita calculus of least upper bounds. A method involving upper bounds isused to show that the map of branch points associated to the inclusionof data sets is a controlled homotopy equivalence, where the control isexpressed by an upper bound relation that is constrained by Hausdorﬀdistance.

Introduction

This paper is a discussion of clustering phenomena that arise in connection withinclusions X ⊂ Y ⊂ R n of data sets, interpreted through the lens of hierarchiesof clusters and branch points.Suppose that X is a ﬁnite subset (a data set) in a metric space Z . Thereis a well known system of simplicial complexes V s ( X ) whose simplices are thesubsets σ of X such that d ( x, y ) ≤ s for each pair of points x, y ∈ σ , where d isthe metric on Z . The complexes V s ( X ) are the Vietoris-Rips complexes for thedata set X .If k is a positive integer, L s,k ( X ) is the subcomplex of V s ( X ) whose simplices σ have vertices x such that d ( x, y ) ≤ s for at least k distinct points y = x in X . This object is variously called a degree Rips complex, or a Lesnick complex.The number k is a density parameter.The simplicial complexes V s ( X ) and L s,k ( X ) are deﬁned by their respectivepartially ordered sets (posets) of simplices P s ( X ) and P s,k ( X ) [4]. The cor-responding nerves BP s ( X ) and BP s,k ( X ) are barycentric subdivisions of therespective complexes V s ( X ) and L s,k ( X ), and therefore have the same homo-topy types. This identiﬁcation of homotopy types is assumed in this paper, sothat V s ( X ) = BP s ( X ) and L s,k ( X ) = BP s,k ( X ), respectively.A relationship s ≤ t between spatial parameters induces an inclusion L s,k ( X ) ⊂ L t,k ( X ) . Some of the complexes L s,k ( X ) could be empty, and L s,k ( X ) is the barycentricsubdivision of a big simplex for s suﬃciently large if k is bounded above by1he the cardinality of X . Observe also that L s, ( X ) = V s ( X ), and that thesubobjects L s,k ( X ) ﬁlter V s ( X ).For a ﬁxed integer k , the sets π L s,k ( X ) of path components, as s varies,deﬁne a tree Γ k ( X ) with elements ( s, [ x ]) such that [ x ] ∈ π L s,k ( X ).The tree Γ k ( X ) is the object studied by the HDBSCAN clustering algorithm,while the individual sets of clusters π L s,k ( X ) are computed for the DBSCANalgorithm.The tree Γ k ( X ) has a subobject Br k ( X ) whose elements are the branchpoints of the tree Γ k ( X ). The branch points of Γ k ( X ) are in one to one corre-spondence with the stable components for Γ k ( X ) that are deﬁned in [3], in thesense that every stable component starts at a unique branch point. We replacethe stable component discussion of [3] with the branch point tree Br k ( X ), andmake particular use of its ordering.The branch point tree Br k ( X ) is a highly compressed version of the hierarchyΓ k ( X ) that is produced by the HDBSCAN algorithm.We derive a stability result (Theorem 2) for the branch point tree. Thisresult follows from a stability theorem for the degree Rips complex [4], togetherwith a calculus of least upper bounds for the branch point tree that is developedin the next section.Suppose that i : X ⊂ Y are data sets in Z , and that r >

0. Suppose thatthe Hausdorﬀ distance d H ( X k +1 dis , Y k +1 dis ) < r in Z k +1 , where X k +1 dis is the set of k + 1 distinct points in X , interpreted as a subset of the product metric space Z k +1 . The inclusion i induces an inclusion i : L s,k ( X ) → L s,k ( Y ) of simplicialcomplexes, which is natural in all s and k .The stability theorem for the degree Rips complex (Theorem 6 of [4], whichis a statement about posets) implies the following: Theorem 1.

Suppose that X ⊂ Y ⊂ Z are data sets, and we have the relation d H ( X k +1 dis , Y k +1 dis ) < r on Hausdorﬀ distance between associated conﬁguration spaces in Z k +1 . Thenthere is a diagram of simplicial complex maps L s,k ( X ) σ / / i (cid:15) (cid:15) L s +2 r ( X ) i (cid:15) (cid:15) L s,k ( Y ) σ / / θ rrrrrrrrrr L s +2 r ( Y ) (1) in which the horizontal and vertical maps are natural inclusions. The uppertriangle of the diagram commutes, and the lower triangle commutes up to ahomotopy which ﬁxes L s,k ( X ) . Theorem 1 specializes to the Rips Stability Theorem in the case k = 0 (see[4], [1]). The picture (1) is often called a homotopy interleaving.2pplication of the path component functor π to the diagram (1) gives acommutative diagram π L s,k ( X ) σ / / i (cid:15) (cid:15) π L s +2 r ( X ) i (cid:15) (cid:15) π L s,k ( Y ) σ / / θ ♦♦♦♦♦♦♦♦♦♦♦ π L s +2 r ( Y ) (2)which is an interleaving of clusters. This is true for all homotopy invariants:in particular, application of homology functors to (1) produces interleaving di-agram in homology groups.The tree Γ k ( X ) has least upper bounds, and these restrict to least upperbounds for the subtree Br k ( X ) of branch points (Lemma 3).The inclusion Br k ( X ) ⊂ Γ k ( X ) is a homotopy equivalence of posets, wherethe homotopy inverse is deﬁned by taking the maximal branch point ( s , [ x ]) ≤ ( s, [ x ]) below ( s, [ x ]) for each object of Γ k ( X ). The existence of the maximalbranch point below an object ( s, [ x ]) is a consequence of Lemma 6.The poset map i : Γ k ( X ) → Γ k ( Y ) deﬁnes a poset map i ∗ : Br k ( X ) → Br k ( Y ), via the homotopy equivalences for the data sets X and Y of the lastparagraph. The maps θ : π L s,k ( Y ) → π L s +2 r ( X ) induce morphisms of trees θ ∗ : Γ k ( Y ) → Γ k ( X ) and θ ∗ : Br k ( Y ) → Br k ( X ).We then have the following: Theorem 2.

Under the assumptions of Theorem 1, there is a homotopy com-mutative diagram Br k ( X ) σ ∗ / / i ∗ (cid:15) (cid:15) Br k ( X ) i ∗ (cid:15) (cid:15) Br k ( Y ) σ ∗ / / θ ∗ sssssssss Br k ( Y ) (3) of morphisms of trees. This paper is devoted to a proof and interpretation of this result.

Fix the density number k and suppose that L s,k ( X ) = ∅ for s suﬃciently large.Apply the path component functor to the L s,k ( X ), to get a diagram of functions · · · → π L s,k ( X ) → π L t,k ( X ) → . . . The graph Γ k ( X ) has vertices ( s, [ x ]) with [ x ] ∈ π L s,k ( X ), and edges( s, [ x ]) → ( t, [ x ]) with s ≤ t . This graph underlies a poset with a terminalobject, and is therefore a tree (or hierarchy).3he morphisms of Γ k ( X ) are relations ( s, [ x ]) ≤ ( t, [ y ]). The existence ofsuch a relation means that [ x ] = [ y ] ∈ π L t,k ( X ), or that the image of [ x ] ∈ π L s,k ( X ) is [ y ] under the induced function π L s,k ( X ) → π L t,k ( X ). Remarks : 1) Partitions of X given by the set π V s ( X ) are standard clusters.The tree Γ ( X ) = Γ( V ∗ ( X )) deﬁnes a hierarchical clustering that is similar tothe single linkage clustering.2) The set π L s,k ( X ) gives a partitioning of the set of elements of X havingat least k neighbours of distance ≤ s , which is the subject of the DBSCANalgorithm. The tree Γ k ( X ) = Γ( π L ∗ ,k ( X )) is the structural object underlyingthe HDBSCAN algorithm.A branch point in the tree Γ k ( X ) is a vertex ( t, [ x ]) such that either offollowing two conditions hold:1) there is an s < t such that for all s ≤ s < t there are distinct vertices( s, [ x ]) and ( s, [ x ]) with ( s, [ x ]) ≤ ( t, [ x ]) and ( s, [ x ]) ≤ ( t, [ x ]), or2) there is no relation ( s, [ y ]) ≤ ( t, [ x ]) with s < t .The second condition means that a representing vertex x of the path component[ x ] ∈ π L t,k ( X ) is not a vertex of L s,k ( X ) for s < t . Write Br k ( X ) for the setof branch points ( s, [ x ]) in Γ k ( X ).The set Br k ( X ) inherits a partial ordering from the poset Γ k ( X ), and theinclusion Br k ( X ) ⊂ Γ k ( X ) of the set of branch points deﬁnes a monomorphismof posets.Every branch point ( s, [ x ]) of Γ k ( X ) has s = s i , where s i is a phase changenumber for X . The phase change numbers are the various distances d ( x, y )between the elements of the ﬁnite set X .The branch point poset Br k ( X ) is a tree, because the element ( s, [ x ]) corre-sponding to the largest phase change number s is terminal.Suppose that ( s, [ x ]) and ( t, [ y ]) are vertices of the graph Γ k ( X ). There isa vertex ( v, [ w ]) such that ( s, [ x ]) ≤ ( v, [ w ]) and ( t, [ y ]) ≤ ( v, [ w ]). The tworelations specify that [ x ] = [ z ] = [ y ] in π L v,k ( X ).There is a unique smallest vertex ( u, [ z ]) which is an upper bound for both( s, [ x ]) and ( t, [ y ]). The number u is the smallest parameter (necessarily a phasechange number) such that [ x ] = [ y ] in π L u,k ( X ), and so [ z ] = [ x ] = [ y ]. In thiscase, one writes ( s, [ x ]) ∪ ( t, [ y ]) = ( u, [ z ]) . The vertex ( u, [ z ]) is the least upper bound (or join) of ( s, [ x ]) and ( t, [ y ]).Every ﬁnite collection of points ( s , [ x ]) , . . . , ( s p , [ x p ]) has a least upperbound ( s , [ x ]) ∪ · · · ∪ ( s p , [ x p ])in the tree Γ k ( X ). 4 emma 3. The least upper bound ( u, [ z ]) of branch points ( s, [ x ]) and ( t, [ y ]) isa branch point.Proof. If there is a number v such that s, t < v < u , then ( v, [ x ]) and ( v, [ y ])are distinct because ( u, [ z ]) is a least upper bound, so that ( u, [ z ]) is a branchpoint.Otherwise, s = u or t = u , in which case ( u, [ z ]) = ( s, [ x ]) or ( u, [ z ]) = ( t, [ y ]).In either case, ( u, [ z ]) is a branch point.It follows from Lemma 3 that any two branch points ( s, [ x ]) and ( t, [ y ]) have aleast upper bound in Br k ( X ), and that the poset inclusion α : Br k ( X ) → Γ k ( X )preserves least upper bounds.We have the following observation: Lemma 4.

Suppose that ( s , [ x ]) , ( s , [ x ]) and ( s , [ x ]) are vertices of Γ k ( X ) .Then ( s , [ x ]) ∪ ( s , [ x ]) ≤ (( s , [ x ]) ∪ ( s , [ x ])) ∪ (( s , [ x ]) ∪ ( s , [ x ])) . Remark : Carlsson and M´emoli [2] deﬁne an ultrametric d on X = V ( X ), forwhich they say that d ( x, y ) = s , where s is the minimum parameter value suchthat [ x ] = [ y ] ∈ π V s ( X ).The least upper bound concept is both an extension of and a potential re-placement for this ultrametric, and Lemma 4 is the analog for the triangleinequality.The Carlsson-M´emoli theory does not apply to the full tree Γ k ( X ), becausethe vertex sets of the Lesnick complexes L s,k ( X ) can vary with changes of thedistance parameter s . We can, however, deﬁne an ultrametric on each of thesets π L s,k ( X ) as follows:Suppose given [ x ] and [ y ] in π L s,k ( X ) (or equivalently, points ( s, [ x ]) and( s, [ y ]) in Γ k ( X )). Write d ([ x ] , [ y ]) = u − s , where ( s, [ x ]) ∪ ( s, [ y ]) = ( u, [ w ]). Lemma 5.

Every vertex ( s, [ x ]) of Γ k ( X ) has a unique largest branch point ( s , [ x ]) such that ( s , [ x ]) ≤ ( s, [ x ]) .Proof. The least upper bound of the ﬁnite list of the branch points ( t, [ y ]) suchthat ( t, [ y ]) ≤ ( s, [ x ]) is a branch point, by Lemma 3.In the situation of Lemma 5, one says that ( s , [ x ]) is the maximal branchpoint below ( s, [ x ]).If ( s, [ x ]) is a branch point, then the maximal branch point below ( s, [ x ]) is( s, [ x ]), by construction. Lemma 6.

Suppose that ( s , [ x ]) and ( t , [ y ]) are maximal branch points belowthe points ( s, [ x ]) and ( t, [ y ]) in Γ k ( X ) , respectively. Then ( s , [ x ]) ∪ ( t , [ y ]) is the maximal branch point below ( s, [ x ]) ∪ ( t, [ y ]) . roof. Suppose that s ≤ t .We have ( s , [ x ]) ∪ ( t , [ y ]) ≤ ( s, [ x ]) ∪ ( t, [ y ]) . and ( s , [ x ]) ∪ ( t , [ y ]) is a branch point by Lemma 3.Write ( v, [ z ]) = ( s , [ x ]) ∪ ( t , [ y ]) .

1) Suppose that v ≤ t . Then( t , [ y ]) ≤ ( t, [ y ]) = ( t, [ y ])and ( t , [ y ]) ≤ ( v, [ z ]) = ( v, [ y ]) , so that ( v, [ z ]) = ( v, [ y ]) ≤ ( t, [ y ]) = ( t, [ y ])since v ≤ t .Also, ( s , [ x ]) ≤ ( s, [ x ]) and ( s , [ x ]) ≤ ( v, [ z ]) ≤ ( t, [ y ]) so that ( s, [ x ]) ≤ ( t, [ y ]).Then ( s , [ x ]) ≤ ( t , [ y ]) by maximality, and it follows that( s , [ x ]) ∪ ( t , [ y ]) = ( t , [ y ])is the maximal branch point below( s, [ x ]) ∪ ( t, [ y ]) = ( t, [ y ])2) Suppose that v > t . Then ( s, [ x ]) = ( s, [ x ]) ≤ ( v, [ z ]) and ( t, [ y ]) = ( t, [ y ]) ≤ ( v, [ z ]) because s ≤ t < v , so that( s, [ x ]) ∪ ( t, [ y ]) ≤ ( s , [ x ]) ∪ ( t , [ y ]) , Thus, ( s , [ x ]) ∪ ( t , [ y ]) = ( s, [ x ]) ∪ ( t, [ y ]) is a branch point, by Lemma 3. Lemma 7.

The poset inclusion α : Br k ( X ) → Γ k ( X ) has an inverse max : Γ k ( X ) → Br k ( X ) , up to homotopy, and Br k ( X ) is a strong deformation retract of Γ k ( X ) .Proof. Lemma 5 implies that every vertex ( s, [ x ]) of Γ k ( X ) has a unique maximalbranch point ( s , [ x ]) such that ( s , [ x ]) ≤ ( s, [ x ]). Set max ( s, [ x ]) = ( s , [ x ]) . The maximality condition implies that max preserves the ordering. The com-posite max · α is the identity on Br k ( X ), and the relations ( s , [ x ]) ≤ ( s, x )deﬁne a homotopy max · α ≤ k ( X ).6eturn to the inclusion i : X ⊂ Y ⊂ R n of ﬁnite data sets. Suppose that d H ( X k +1 dis , Y k +1 dis ) < r and that L s,k ( Y ) is non-empty, as in the statement ofTheorem 1.Write i ∗ : Br k ( X ) → Br k ( Y ) for the composite poset morphismBr k ( X ) α −→ Γ k ( X ) i ∗ −→ Γ k ( Y ) max −−−→ Br k ( Y )This map takes a branch point ( s, [ x ]) to the maximal branch point below( s, [ i ( x )]). Remark : The map i ∗ : Br k ( X ) → Br k ( Y ) only preserves least upper boundsup to homotopy. Suppose that ( s, [ x ]) and ( t, [ y ]) are branch points of X , andlet ( s , [ x ]) ≤ ( s, [ i ( x )]) and ( t , [ y ]) ≤ ( t, [ i ( y )]) be maximal branch pointsbelow the images of ( s, [ x ]) and ( t, [ y ]) in Γ k ( Y ). Then ( s , [ x ]) ∪ ( t , [ y ]) isthe maximal branch point below ( s, [ i ( x )]) ∪ ( t, [ i ( y )]) by Lemma 6, but it maynot be the maximal branch point below i ∗ (( s, [ x ]) ∪ ( t, [ y ])).Poset morphisms θ ∗ : Br k ( Y ) → Br k ( X ) and σ ∗ : Br k ( X ) → Br k ( X ) aresimilarly deﬁned, by the poset morphism θ : Γ k ( Y ) → Γ k ( X ) given by ( t, [ y ]) ( t + 2 r, [ θ ( y )]), and the shift morphism σ : Γ k ( X ) → Γ k ( X ) given by ( s, [ x ]) ( s + 2 r, [ x ]). These maps again preserve least upper bounds up to homotopy.1) Consider the poset mapsBr k ( X ) i ∗ −→ Br k ( Y ) θ ∗ −→ Br k ( X ) . If ( s, [ x ]) is a branch point for X , choose maximal branch points ( s , [ x ]) ≤ ( s, [ i ( x )] for Y , ( s , [ x ]) ≤ ( s + 2 r, [ θ ( x )]) and ( v, [ y ]) ≤ ( s + 2 r, [ x ]) below therespective objects.Then θ ∗ i ∗ ( s, [ x ]) = ( s , [ x ]), and there is a natural relation θ ∗ i ∗ ( s, [ x ]) = ( s , [ x ]) ≤ ( v, [ y ]) = σ ∗ ( s, [ x ])by a maximality argument. We therefore have a homotopy of poset maps θ ∗ i ∗ ≤ σ ∗ : Br k ( X ) → Br k ( X ) . (4)2) Similarly, if ( t, [ y ]) is a branch point of Y , then i ∗ θ ∗ ( t, [ y ]) ≤ σ ∗ ( t, [ y ]) , giving a homotopy i ∗ θ ∗ ≤ σ ∗ : Br k ( Y ) → Br k ( Y ) . (5)The construction of the poset maps i ∗ , θ ∗ and σ ∗ , together with the relations(4) and (5), complete the proof of Theorem 2.There are relations ( s, [ x ]) ≤ σ ∗ ( s, [ x ]) ≤ ( s + 2 r, [ x ]) (6)7or branch points ( s, [ x ]). It follows that the poset map σ ∗ : Br k ( X ) → Br k ( X )is homotopic to the identity on Br k ( X ).It also follows that σ ∗ ( s, [ x ]) = ( t, [ x ]) is close to ( s, [ x ]) in the sense that t − s ≤ r . Thus, the branch points ( s, [ x ]) and θ ∗ i ∗ ( s, [ x ]) have a commonupper bound, namely σ ∗ ( s, [ x ]), which is close to ( s, [ x ]).The subobject of Br k ( X ) consisting of all branch points of the form ( s, [ x ])as s varies has an obvious notion of distance: the distance between points ( s, [ x ])and ( t, [ x ]) is | t − s | .If ( t, [ y ]) is a branch point of Γ k ( Y ), the branch point σ ∗ ( t, [ y ]) is similarlyan upper bound for ( t, [ y ]) and i ∗ θ ∗ ( t, [ y ]) that is close to ( t, [ y ]). References [1] Andrew J. Blumberg and Michael Lesnick. Universality of the homotopyinterleaving distance.

CoRR , abs/1705.01690, 2017.[2] Gunnar Carlsson and Facundo M´emoli. Characterization, stability and con-vergence of hierarchical clustering methods.