Towards an Axiomatic Approach to Hierarchical Clustering of Measures
aa r X i v : . [ s t a t . M L ] A ug Journal of Machine Learning Research 1 (2015) 1-48 Submitted 4/00; Published 10/00
Towards an Axiomatic Approach toHierarchical Clustering of Measures
Philipp Thomann [email protected]
Ingo Steinwart [email protected]
Nico Schmid [email protected]
Institute for Stochastics and ApplicationsUniversity of Stuttgart, Germany
Editor:
Vladimir Vapnik, Alexander Gammerman, and Vladimir Vovk
Abstract
We propose some axioms for hierarchical clustering of probability measures and investi-gate their ramifications. The basic idea is to let the user stipulate the clusters for someelementary measures. This is done without the need of any notion of metric, similarityor dissimilarity. Our main results then show that for each suitable choice of user-definedclustering on elementary measures we obtain a unique notion of clustering on a large setof distributions satisfying a set of additivity and continuity axioms. We illustrate thedeveloped theory by numerous examples including some with and some without a density.
Keywords: axiomatic clustering, hierarchical clustering, infinite samples clustering, den-sity level set clustering, mixed Hausdorff-dimensions
1. Introduction
Clustering is one of the most basic tools to investigate unsupervised data: finding groupsin data. Its applications reach from categorization of news articles over medical imagingto crime analysis. For this reason, a wealth of algorithms have been proposed, amongthe best-known being: k -means (MacQueen, 1967), linkage (Ward, 1963; Sibson, 1973;Defays, 1977), cluster tree (Stuetzle, 2003), DBSCAN (Ester et al., 1996), spectral clus-tering (Donath and Hoffman, 1973; von Luxburg, 2007), and expectation-maximization forgenerative models (Dempster et al., 1977). For more information and research on clusteringwe refer the reader to Jardine and Sibson (1971); Hartigan (1975); Kaufman and Rousseeuw(1990); Mirkin (2005); Gan et al. (2007); Kogan (2007); Ben-David (2015); Menardi (2015)and the references therein.However, each ansatz has its own implicit or explicit definition of what clustering is.Indeed for k -means it is a particular Voronoi partition, for Hartigan (1975, Section 11.13) itis the collection of connected components of a density level set, and for generative models itis the decomposition of mixed measures into the parts. Stuetzle (2003) stipulates a groupingaround the modes of a density, while Chacón (2014) uses gradient-flows. Thus, there is nouniversally accepted definition.A good notion of clustering certainly needs to address the inherent random variability indata. This can be achieved by notions of clusterings for infinite sample regimes or completeknowledge scenarios—as von Luxburg and Ben-David (2005) put it. Such an approach has c (cid:13) homann, Steinwart, and Schmid various advantages: one can talk about ground-truth, can compare alternative clustering al-gorithms (empirically, theoretically, or in a combination of both by using artificial data), andcan define and establish consistency and learning rates. Defining clusters as the connectedcomponents of density level sets satisfies all of these requirements. Yet it seems to be slightly ad-hoc and it will always be debatable, whether thin bridges should connect components,and whether close components should really be separated. Similar concerns may be raisedfor other infinite sample notions of clusterings such as Stuetzle (2003) and Chacón (2014).In this work we address these and other issues by asking ourselves: What does the setof clustering functions look like? What can defining properties—or axioms—of clusteringfunctions be and what are their ramifications? Given such defining properties, are therefunctions fulfilling these? How many are there? Can a fruitful theory be developed? Andfinally, for which distributions do we obtain a clustering and for which not?
These questions have led us to an axiomatic approach. The basic idea is to let theuser stipulate the clusters for some elementary measures. Here, his choice does not needto rely on a metric or another pointwise notion of similarity though—only basic shapes forgeometry and a separation relation have to be specified. Our main results then show thatfor each suitable choice we obtain a unique notion of clustering satisfying a set of additivityand continuity axioms on a large set of measures. These will be motivated in Section 1.2and are defined in Axioms 1, 2, and 3. The major technical achievement of this work isTheorem 20: it establishes criteria (c.f. Definition 18) to ensure a unique limit structure,which in turn makes it possible to define a unique additive and continuous clustering inTheorem 21. Furthermore in Section 3.5 we explain how this framework is linked to densitybased clustering, and in the examples of Section 4.3 we investigate the consequences in thesetting of mixed Hausdorff dimensions.
Some axioms for clustering have been proposed and investigated, but to our knowledge, allapproaches concern clustering of finite data. Jardine and Sibson (1971) were probably thefirst to consider axioms for hierarchical clusterings: these are maps of sets of dissimilaritymatrices to sets of e.g. ultrametric matrices. Given such sets they obtain continuity anduniqueness of such a map using several axioms. This setting was used by Janowitz and Wille(1995) to classify clusterings that are equivariant for all monotone transformations of thevalues of the distance matrix. Later, Puzicha et al. (1999) investigate axioms for cost func-tions of data-partitionings and then obtain clustering functions as optimizers of such costfunctions. They consider as well a hierarchical version, marking the last axiomatic treat-ment of that case until today. More recently, Kleinberg (2003) put forward an impossibilityresult. He gives three axioms and shows that any (non-hierarchical) clustering of distancematrices can fulfill at most two of them. Zadeh and Ben-David (2009) remedy the impos-sibility by restricting to k -partitions, and they use minimum spanning trees to characterizedifferent clustering functions. A completely different setting is Meilˇa (2005) where an arsenalof axioms is given for distances of clustering partitions. They characterize some distances(variation of information, classification error metric) using different subsets of their axioms.One of the reviewers brought clustering of discrete data to our attention. As far aswe understand, consensus clustering (Mirkin, 1975; Day and McMorris, 2003) and additive owards an Axiomatic Approach to Hierarchical Clustering of Measures clustering (Shepard and Arabie, 1979; Mirkin, 1987) are popular in social studies clusteringcommunities. What we call additive clustering in this work is something completely differentthough. Still, application of our notions to clustering of discrete structures warrants furtherresearch. Let us now give a brief description of our approach. To this end assume for simplicity thatwe wish to find a hierarchical clustering for certain distributions on R d . We denote the setof such distributions by P . Then a clustering is simply a map c that assigns every P ∈ P to a collection c ( P ) of non-empty events. Since we are interested in hierarchical clustering, c ( P ) will always be a forest, i.e. we have A, A ′ ∈ c ( P ) = ⇒ A ⊥ A ′ or A ⊂ A ′ or A ⊃ A ′ . (1)Here A ⊥ A ′ means sufficiently distinct , i.e. A ∩ A ′ = ∅ or something stronger (cf. Definition 1.Following the idea that eventually one needs to store and process the clustering c ( P ) on acomputer, our first axiom assumes that c ( P ) is finite . For a distribution with a continuousdensity the level set forest, i.e. the collection of all connected components of density level sets,will therefore not be viewed as a clustering. For densities with finitely many modes, however,this level set forest consists of long chains interrupted by finitely many branchings. In thiscase, the most relevant information for clustering is certainly represented at the branchingsand not in the intermediate chains. Based on this observation, our second clustering axiompostulates that c ( P ) does not contain chains. More precisely, if s ( F ) denotes the forest thatis obtained by replacing each chain in the forest F by the maximal element of the chain, our structured forest axiom demands that s ( c ( P )) = c ( P ) . (2)To simplify notations we further extend the clustering to the cone defined by P by setting c ( αP ) := c ( P ) (3)for all α > and P ∈ P . Equivalently we can view P as a collection of finite non-trivialmeasures and c as a map on P such that for α > and P ∈ P we have αP ∈ P and c ( αP ) = c ( P ) . It is needless to say that this extended view on clusterings does not changethe nature of a clustering.Our next two axioms are based on the observation that there do not only exist distribu-tions for which the “right notion” of a clustering is debatable but there are also distributionsfor which everybody would agree about the clustering. For example, if P is the uniform dis-tribution on a Euclidean ball B , then certainly everybody would set c ( P ) = { B } . Clearly,other such examples are possible, too, and therefore we view the determination of distribu-tions with such simple clusterings as a design decision . More precisely, we assume that wehave a collection A of closed sets, called base sets and a family Q = { Q A } A ∈A ⊂ P called base measures with the property A = supp Q A for all A ∈ A . Now, our base measureaxiom stipulates c ( Q A ) = { A } . (4) homann, Steinwart, and Schmid It is not surprising that different choices of A , Q , and ⊥ may lead to different clusterings. Inparticular we will see that larger classes A usually result in more distributions for which wecan construct a clustering satisfying all our clustering axioms. On the other hand, taking alarger class A means that more agreement needs to be sought about the distributions havinga trivial clustering (4). For this reason the choice of A can be viewed as a trade-off. Pc ( P ) + P ′ c ( P ′ ) = P + P ′ c ( P ) ∪ ( P ′ ) Figure 1: Example of disjoint additivity for two distributions having a density.Axiom (4) only describes distributions that have a trivial clustering. However, there arealso distributions for which everybody would agree on a non-trivial clustering. For example,if P is the uniform distribution on two well separated Euclidean balls B and B , then the“natural” clustering would be c ( P ) = { B , B } . Our disjoint additivity axiom generalizesthis observation by postulating supp P ⊥ supp P = ⇒ c ( P + P ) = c ( P ) ∪ c ( P ) . (5)In other words, if P consists of two spatially well separated sources P and P , the clusteringof P should reflect this spatial separation, see also Figure 1. Moreover note this axiomformalizes the vague term “spatially well separated” with the help of the relation ⊥ , which,like A and Q is a design parameter that usually influences the nature of the clustering.The axioms (4) and (5) only described the horizontal behaviour of clusterings, i.e. thedepth of the clustering forest is not affected by (4) and (5). Our second additivity axiomaddresses this. To motivate it, assume that we have a P ∈ P and a base measure Q A ,e.g. a uniform distribution on A , such that supp P ⊂ A . Then adding Q A to P can beviewed as pouring uniform noise over P . Intuitively, this uniform noise should not affectthe internal and possibly delicate clustering of P but only its roots, see also Figure 2. Our base additivity axiom formalizes this intuition by stipulating supp P ⊂ A = ⇒ c ( P + Q A ) = s (cid:0) c ( P ) ∪ { A } (cid:1) . (6)Here the structure operation s ( · ) is applied on the right-hand side to avoid a conflict withthe structured forest axiom (2). Also note that it is this very axiom that directs our theorytowards hierarchical clustering, since it controls the vertical growth of clusterings under asimple operation. Q A A + Pc ( P ) = P + Q A c ( P + Q A ) = { A } ∪ c ( P ) Figure 2: Example of base additivity. owards an Axiomatic Approach to Hierarchical Clustering of Measures Any clustering satisfying the axioms (1) to (6) will be called an additive clustering .Now the first, and rather simple part of our theory shows that under some mild technicalassumptions there is a unique additive clustering on the set of simple measures on forests S ( A ) := (cid:26) X A ∈ F α A Q A | F ⊂ A is a forest and α A > for all A ∈ F (cid:27) . Moreover, for P ∈ S ( A ) there is a unique representation P = P A ∈ F α A Q A and the additiveclustering is given by c ( P ) = s ( F ) .Unfortunately, the set S ( A ) of simple measures, on which the uniqueness holds, is usuallyrather small. Consequently, additive clusterings on large collections P are far from beinguniquely determined. Intuitively, we may hope to address this issue if we additionally imposesome sort of continuity on the clusterings, i.e. an implication of the form P n → P = ⇒ c ( P n ) → c ( P ) . (7)Indeed, having an implication of the form (7), it is straightforward to show that the clusteringis not only uniquely determined on S ( A ) but actually on the “closure” of S ( A ) . To find aformalization of (7), we first note that from a user perspective, c ( P n ) → c ( P ) usuallydescribes a desired type of convergence. Following this idea, P n → P then describes a sufficient condition for (7) to hold. In the remainder of this section we thus begin bypresenting desirable properties c ( P n ) → c ( P ) and resulting necessary conditions on P n → P .Let us begin by assuming that all P n are contained in S ( A ) and let us further denotethe corresponding forests in the unique representation of P n by F n . Then we already knowthat c ( P n ) = s ( F n ) , so that the convergence on the right hand side of (7) becomes s ( F n ) → c ( P ) . (8)Now, every s ( F n ) , as well as c ( P ) , is a finite forest, and so a minimal requirement for (8)is that s ( F n ) and c ( P ) are graph isomorphic, at least for all sufficiently large n . Moreover,we certainly also need to demand that every node in s ( F n ) converges to the correspondingnode in c ( P ) . To describe the latter postulation more formally, we fix graph isomorphisms ζ n : s ( F ) → s ( F n ) and ζ : s ( F ) → c ( P ) . Then our postulate reads as ζ n ( A ) → ζ ( A ) , (9)for all A ∈ s ( F ) . Of course, there do exist various notions for describing convergence ofsets, e.g. in terms of the symmetric difference or the Hausdorff metric, so at this stage weneed to make a decision. To motivate our choice, we first note that (9) actually containstwo statements, namely, that ζ n ( A ) converges for n → ∞ , and that its limit equals ζ ( A ) .Now recall from various branches of mathematics that definitions of continuous extensionstypically separate these two statements by considering approximating sequences that auto-matically converge. Based on this observation, we decided to consider monotone sequencesin (9), i.e. we assume that A ⊂ ζ ( A ) ⊂ ζ ( A ) ⊂ . . . for all A ∈ s ( F ) . Let us denote theresulting limit forest by F ∞ , i.e. F ∞ := (cid:26) [ n ζ n ( A ) | A ∈ s ( F ) (cid:27) , homann, Steinwart, and Schmid which is indeed a forest under some mild assumptions on A and ⊥ . Moreover, ζ ∞ : s ( F ) → F ∞ defined by ζ ∞ ( A ) := S n ζ n ( A ) becomes a graph isomorphism, and hence (9) reduces to ζ ∞ ( A ) = ζ ( A ) P -almost surely for all A ∈ s ( F ) . (10)Summing up our considerations so far, we have seen that our demands on c ( P n ) → c ( P ) imply some conditions on the forests associated to the sequence ( P n ) , namely ζ n ( A ) ր forall A ∈ s ( F ) . Without a formalization of P n → P , however, there is clearly no hope thatthis monotone convergence alone can guarantee (7). Like for (9), there are again variousways for formalizing a convergence of P n → P . To motivate our decision, we first notethat a weak continuity axiom is certainly more desirable since this would potentially lead tomore instances of clusterings. Furthermore, observe that (7) becomes weaker the strongerthe notion of P n → P is chosen. Now, if P n and P had densities f n and f , then one ofthe strongest notions of convergence would be f n ր f . In the absence of densities such aconvergence can be expressed by P n ր P , i.e. by P n ( B ) ր P ( B ) for all measurable B .Combining these ideas we write ( P n , F n ) ր P iff P n ր P and there are graph isomorphisms ζ n : s ( F ) → s ( F n ) with ζ n ( A ) ր for all A ∈ s ( F ) . Our formalization of (7) then becomes ( P n , F n ) ր P = ⇒ F ∞ = c ( P ) in the sense of (10), (11)which should hold for all P n ∈ S ( A ) and their representing forests F n .While it seems tempting to stipulate such a continuity axiom it is unfortunately incon-sistent . To illustrate this inconsistency, consider, for example, the uniform distribution P on [0 , . Then P can be approximated by the following two sequences P (1) n := [1 /n, − /n ] PP (2) n := [0 , / − /n ] P + [1 / , P P P (1) n P P (2) n By (11) the first approximation would then lead to the clustering c ( P ) = { [0 , } while thesecond approximation would give c ( P ) = { [0 , / , [1 / , } .Interestingly, this example not only shows that (11) is inconsistent but it also gives ahint how to resolve the inconsistency. Indeed the first sequence seems to be “adapted” to thelimiting distribution, whereas the second sequence ( P (2) n ) is intuitively too complicated sinceits members have two clusters rather than the anticipated one cluster. Therefore, the ideato find a consistent alternative to (11) is to restrict the left-hand side of (11) to “adaptedsequences”, so that our continuity axiom becomes ( P n , F n ) ր P and P n is P -adapted for all n = ⇒ F ∞ = c ( P ) in the sense of (10).In simple words, our main result then states that there exists exactly one such continuousclustering on the closure of S ( A ) . The main message of this paper thus is: Starting with very simple building blocks Q = ( Q A ) A ∈A for which we (need to) agree thatthey only have one trivial cluster { A } , we can construct a unique additive and continuousclustering on a rich set of distributions. Or, in other words, as soon as we have fixed ( A , Q ) and a separation relation ⊥ , there is no ambiguity left what a clustering is. owards an Axiomatic Approach to Hierarchical Clustering of Measures What is left is to explore how the choice of the clustering base ( A , Q , ⊥ ) influences theresulting clustering. To this end, we first present various clustering bases, which, e.g. describeminimal thickness of clusters, their shape, and how far clusters need to be apart fromeach other. For distributions having a Lebesgue density we then illustrate how differentclustering bases lead to different clusterings. Finally, we show that our approach goesbeyond density-based clusterings by considering distributions consisting of several lowerdimensional, overlapping parts.
2. Additive Clustering
In this section we introduce base sets, separation relations, and simple measures, as well asthe corresponding axioms for clustering. Finally, we show that there exists a unique additiveclustering on the set of simple measures.Throughout this work let
Ω = (Ω , T ) be a Hausdorff space and let B ⊃ σ ( T ) be a σ -algebra that contains the Borel sets. Furthermore we assume that M = M Ω is the set offinite, non-zero, inner regular measures P on Ω . Similarly M ∞ Ω denotes the set of non-zeromeasures on Ω if Ω is a Radon space and else of non-zero, inner regular measures on Ω . Inthis respect, recall that any Polish space—i.e. a completely metrizable separable space—isRadon. In particular all open and closed subsets of R d are Polish spaces and thus Radon.For inner regular measures the support is well-defined and satisfies the usual properties, seeAppendix A for details. The set M Ω forms a cone: for all P, P ′ ∈ M Ω and all α > wehave P + P ′ ∈ M Ω and αP ∈ M Ω . Intuitively, any notion of a clustering should combine aspects of concentration and contigu-ousness. What is a possible core of this? On one hand clustering should be local in the senseof disjoint additivity, which was presented in the introduction: If a measure P is understoodon two parts of its support and these parts are nicely separated then the clustering shouldbe just a union of the two local ones. Observe that in this case supp P is not connected! Onthe other hand—in view of base clustering—base sets need to be impossible to partition intonicely separated components. Therefore they ought to be nicely connected . Of course, themeaning of nicely connected and nicely separated are interdependent, and highly disputable.For this reason, our notion of clustering assumes that both meanings are specified in front,e.g. by the user. Provided that both meanings satisfy certain technical criteria, we thenshow, that there exists exactly one clustering. To motivate how these technical criteria maylook like, let us recall that for all connected sets A and all closed sets B , . . . , B k we have A ⊂ B ˙ ∪ . . . ˙ ∪ B k = ⇒ ∃ ! i ≤ k : A ⊂ B i . (12)The left hand side here contains the condition that the B , . . . , B k are pairwise disjoint, forwhich we already introduced the following notation: B ⊥ ∅ B ′ : ⇐⇒ B ∩ B ′ = ∅ . In order to transfer the notion of connectedness to other relations it is handy to generalizethe notation B ˙ ∪ . . . ˙ ∪ B k . To this end, let ⊥ be a relation on subsets of Ω . Then we denote homann, Steinwart, and Schmid the union B ∪ . . . ∪ B k of some B , . . . , B k ⊂ Ω by B ⊥ ∪ . . . ⊥ ∪ B k , iff we have B i ⊥ B j for all i = j . Now the key idea of the next definition is to generalize thenotion of connectivity and separation by replacing ⊥ ∅ in (12) by another suitable relation. Definition 1
Let
A ⊂ B be a collection of closed, non-empty sets. A symmetric relation ⊥ defined on B is called a A - separation relation iff the following holds:(a) Reflexivity : For all B ∈ B : B ⊥ B = ⇒ B = ∅ .(b) Monotonicity : For all
A, A ′ , B ∈ B : A ⊂ A ′ and A ′ ⊥ B = ⇒ A ⊥ B. (c) A -Connectedness : For all A ∈ A and all closed B , . . . , B k ∈ B : A ⊂ B ⊥ ∪ . . . ⊥ ∪ B k = ⇒ ∃ i ≤ k : A ⊂ B i . Moreover, an A -separation relation ⊥ is called stable , iff for all A ⊂ A ⊂ . . . with A n ∈ A ,all n ≥ , and all B ∈ B : A n ⊥ B for all n ≥ ⇒ [ n ≥ A n ⊥ B. (13) Finally, given a separation relation ⊥ then we say that B, B ′ are ⊥ - separated , if B ⊥ B ′ .We write B ◦◦ B ′ iff not B ⊥ B ′ , and say in this case that B, B ′ are ⊥ - connected . It is not hard to check that the disjointness relation ⊥ ∅ is a stable A -separation relation,whenever all A ∈ A are topologically connected. To present another example of a separationrelation, we fix a metric d on Ω and some τ > . Moreover, for B, B ′ ⊂ Ω we write B ⊥ τ B ′ : ⇐⇒ d ( B, B ′ ) ≥ τ . In addition, recall that a B ⊂ Ω is τ -connected, if, for all x, x ′ ∈ B , there exists x , . . . , x n ∈ B with x = x , x n = x ′ , and d ( x i − , x i ) < τ for all i = 1 , . . . , n . Then it is easy to showthat ⊥ τ is an stable A -separation relation if all A ∈ A are τ -connected. For more examplesof separation relations we refer to Section 4.1.It can be shown that ⊥ ∅ is the weakest separation relation, i.e. for every A -separationrelation ⊥ we have A ⊥ A ′ = ⇒ A ⊥ ∅ A ′ for all A, A ′ ∈ A . We refer to Lemma 30, alsoshowing that ⊥ -unions are unique, i.e., for all A , . . . , A k and all A ′ , . . . , A ′ k ′ in A we have A ⊥ ∪ . . . ⊥ ∪ A k = A ′ ⊥ ∪ . . . ⊥ ∪ A ′ k ′ = ⇒ { A , . . . , A k } = { A ′ , . . . , A ′ k ′ } . Finally, the stability implication (13) is trivially satisfied for finite sequences A ⊂ · · · ⊂ A m in A , since in this case we have A ∪ · · · ∪ A m = A m . For this reason stability will onlybecome important when we will consider limits in Section 3.We can now describe the properties a clustering base should satisfy. owards an Axiomatic Approach to Hierarchical Clustering of Measures Definition 2
A (stable) clustering base is a triple ( A , Q , ⊥ ) where A ⊂ B \ {∅} is a classof non-empty sets, ⊥ is a (stable) A -separation relation, and Q = { Q A } A ∈A ⊂ M is afamily of probability measures on Ω with the following properties:(a) Flatness : For all
A, A ′ ∈ A with A ⊂ A ′ we either have Q A ′ ( A ) = 0 or Q A ( · ) = Q A ′ ( · ∩ A ) Q A ′ ( A ) . (b) Fittedness : For all A ∈ A we have A = supp Q A .We call a set A a base set iff A ∈ A and a measure a ∈ M a base measure on A iff A ∈ A and there is an α > with a = αQ A . Q A ′ αQ A = ⇒ αQ A + Q A ′ Let us motivate the two conditions of clustering bases.Flatness concerns nesting of base sets: Let A ⊂ A ′ be basesets and consider the sum of their base measures Q A + Q A ′ .If the clustering base is not flat, weird things can happen—see the right. The way we defined flatness excludes such caseswithout taking densities into account. As a result we will be able to handle aggregationsof measures of different Hausdorff-dimension in Section 4.3. Fittedness, on the other hand,establishes a link between the sets A ∈ A and their associated base measures.Probably, the easiest example of a clustering base has measures of the form Q A ( · ) = µ ( · ∩ A ) µ ( A ) = 1 A dµµ ( A ) , (14)where µ is some reference measure independent of Q A . The next proposition shows thatunder mild technical assumptions such distributions do indeed provide a clustering base. Proposition 3
Let µ ∈ M ∞ Ω and ⊥ be a (stable) A -separation relation for some A ⊂ K ( µ ) ,where K ( µ ) := (cid:8) C ∈ B | < µ ( C ) < ∞ and C = supp µ ( · ∩ C ) (cid:9) denotes the set of µ -support sets . We write Q µ, A := (cid:8) Q A | A ∈ A (cid:9) , where Q A is definedby (14) . Then ( A , Q µ, A , ⊥ ) is a (stable) clustering base. Interestingly, distributions of the form (14) are not the only examples for clusteringbases. For further details we refer to Section 4.3, where we discuss distributions supportedby sets of different Hausdorff dimension.
As outlined in the introduction we are interested in hierarchical clusterings, i.e. in clusteringthat map a finite measure to a forest of sets. In this section we therefore recall somefundamental definitions and notations for such forests. homann, Steinwart, and Schmid Definition 4
Let A be a class of closed, non-empty sets, ⊥ be an A -separation relation,and C be a class with A ⊂ C ⊂ B \ {∅} . We say that a non-empty F ⊂ C is a ( C -valued) ⊥ - forest iff A, A ′ ∈ F = ⇒ A ⊥ A ′ or A ⊂ A ′ or A ′ ⊂ A. We denote the set of all such finite forests by F C and write F := F B\{∅} . A finite ⊥ -forest F ∈ F is partially ordered by the inclusion relation. The maximalelements max F := { A ∈ F : ∄ A ′ ∈ F s.t. A $ A ′ } are called roots and the minimalelements min F := { A ∈ F : ∄ A ′ ∈ F s.t. A ′ $ A } are called leaves . It is not hard to seethat A ⊥ A ′ , whenever A, A ′ ∈ F is a pair of roots or leaves. Moreover, the ground of F is G ( F ) := [ A ∈ F A , that is, G ( F ) equals the union over the roots of F . Finally, F is a tree , iff it has only a singleroot, or equivalently, G ( F ) ∈ F , and F is a chain iff it has a single leaf, or equivalently, iffit is totally ordered.In addition to these standard notions, we often need a notation for describing certainsub-forests. Namely, for a finite forest F ∈ F with A ∈ F we write F (cid:12)(cid:12) % A := { A ′ ∈ F | A ′ % A } for the chain of strict ancestors of A . Analogously, we will use the notations F (cid:12)(cid:12) ⊃ A , F (cid:12)(cid:12) ⊂ A , and F (cid:12)(cid:12) $ A for the chain of ancestors of A (including A ), the tree of descendants of A (including A ), and the finite forest of strict descendants of A , respectively. We refer to Figure 3 for anexample of these notations. Definition 5
Let F be a finite forest. Then we call A , A ∈ F direct siblings iff A = A and they have the same strict ancestors, i.e. F (cid:12)(cid:12) % A = F (cid:12)(cid:12) % A . In this case, any element A ′ ∈ min F (cid:12)(cid:12) % A = min F (cid:12)(cid:12) % A is called a direct parent of A and A . On the other hand for A, A ′ ∈ F we denote A ′ asa direct child of A , iff A ′ ∈ max F (cid:12)(cid:12) $ A . Moreover, the structure of F is defined by s ( F ) := n A ∈ F (cid:12)(cid:12) A is a root or it has a direct sibling A ′ ∈ F o and F is a structured forest iff F = s ( F ) . For later use we note that direct siblings A , A in a ⊥ -forest F always satisfy A ⊥ A .Moreover, the structure of a forest is obtained by pruning all sub-chains in F , see Figure 3.We further note that s ( s ( F )) = s ( F ) for all forests, and if F, F ′ are structured ⊥ -forestswith G ( F ) ⊥ G ( F ′ ) then we have s ( F ∪ F ′ ) = F ∪ F ′ .Let us now present our first set of axioms for (hierarchical) clustering. owards an Axiomatic Approach to Hierarchical Clustering of Measures F : s ( F ) :Figure 3: Illustrations of a forest F and of its structure s ( F ) . Axiom 1 (Clustering)
Let ( A , Q , ⊥ ) be a clustering base and P ⊂ M Ω be a set of mea-sures with Q ⊂ P . A map c : P → F is called an A -clustering if it satisfies:(a) Structured : For all P ∈ P the forest c ( P ) is structured, i.e. c ( P ) = s ( c ( P )) .(b) ScaleInvariance : For all P ∈ P and α > we have αP ∈ P and c ( αP ) = c ( P ) .(c) BaseMeasureClustering : For all A ∈ A we have c ( Q A ) = { A } . Note that the scale invariance is solely for notational convenience. Indeed, we couldhave defined clusterings for distributions only, in which case the scale invariance would havebeen obsolete. Moreover, assuming that a clustering produces structured forests essentiallymeans that the clustering is only interested in the skeleton of the cluster forest. Finally, theaxiom of base measure clustering means that we have a set of elementary measures, namelythe base measures, for which we already agreed upon that they can only be clustered in atrivial way. In Section 4 we will present a couple of examples of ( A , Q , ⊥ ) for which suchan agreement is possible. Finally note that these axioms guarantee that if c : P → F is aclustering and a is a base measure on A then a ∈ P and c ( a ) = { A } . So far our axioms only determine the clusterings for base measures. Therefore, the goal ofthis subsection is to describe the behaviour of clusterings on certain combinations of mea-sures. Furthermore, we will show that the axioms describing this behaviour are consistentand uniquely determine a hierarchical clustering on a certain set of measures induced by Q .Let us begin by introducing the axioms of additivity which we have already describedand motivated in the introduction. Axiom 2 (Additive Clustering)
Let ( A , Q , ⊥ ) be a clustering base and P ⊂ M Ω be aset of measures with Q ⊂ P . A clustering c : P → F is called additive iff the followingconditions are satisfied:(a)
DisjointAdditivity : For all P , . . . , P k ∈ P with mutually ⊥ -separated supports, i.e. supp P i ⊥ supp P j for all i = j , we have P + . . . + P k ∈ P and c ( P + . . . + P n ) = c ( P ) ∪ · · · ∪ c ( P n ) . (b) BaseAdditivity : For all P ∈ P and all base measures a with supp P ⊂ supp a wehave a + P ∈ P and c ( a + P ) = s (cid:0) { supp a } ∪ c ( P ) (cid:1) . homann, Steinwart, and Schmid Our next goal is to show that there exist additive clusterings and that these are uniquelyon a set S of measures that, in some sense, is spanned by Q . The following definitionintroduces this set. Definition 6
Let ( A , Q , ⊥ ) be a clustering base and F ∈ F A be an A -valued finite ⊥ -forest.A measure Q is simple on F iff there exist base measures a A on A ∈ F such that Q = X A ∈ F a A . (15) We denote the set of all simple measures with respect to ( A , Q , ⊥ ) by S := S ( A ) . Figure 4 provides an example of a simple measure. The next lemma shows that therepresentation 15 of simple measures is actually unique.
Lemma 7
Let ( A , Q , ⊥ ) be a clustering base and Q ∈ S ( A ) . Then there exists exactly one F Q ∈ F A such that Q is simple on F Q . Moreover, the representing base measures a A in (15) are also unique and we have supp Q = G F Q . Ω Figure 4: Simple measure.Using Lemma 7 we can now define certain restrictions ofsimple measures Q ∈ S ( A ) with representation (15). Namely,any subset F ′ ⊂ F gives a measure Q (cid:12)(cid:12) F ′ := X A ∈ F ′ a A . We write Q (cid:12)(cid:12) ⊃ A := Q (cid:12)(cid:12) F (cid:12)(cid:12) ⊃ A and similarly Q (cid:12)(cid:12) % A , Q (cid:12)(cid:12) ⊂ A , Q (cid:12)(cid:12) $ A .With the help of Lemma 7 it is now easy to explain how a possible additive clusteringcould look like on S ( A ) . Indeed, for a Q ∈ S ( A ) , Lemma 7 provides a unique finite forest F Q ∈ F A that represents Q and therefore the structure s ( F Q ) is a natural candidate for aclustering of Q . The next theorem shows that this idea indeed leads to an additive clusteringand that every additive clustering on S ( A ) retrieves the structure of the underlying forestof a simple measure. Theorem 8
Let ( A , Q , ⊥ ) be a clustering base and S ( A ) the set of simple measures. Thenwe can define an additive A -clustering c : S ( A ) → F A by c ( Q ) := s ( F Q ) , Q ∈ S ( A ) . (16) Moreover, every additive A -clustering c : P → F satisfies both S ( A ) ⊂ P and (16) .
3. Continuous Clustering
As described in the introduction, we typically need, besides additivity, also some notionof continuity for clusterings. The goal of this section is to introduce such a notion and toshow that, similarly to Theorem 8, this continuity uniquely defines a clustering on a suitablydefined extension of S ( A ) . owards an Axiomatic Approach to Hierarchical Clustering of Measures To this end, we first introduce a notion of monotone convergence for sequences of simplemeasures that does not change the graph structure of the corresponding clusterings givenby Theorem 8. We then discuss a richness property of the clustering base, which essentiallyensures that we can approximate the non-disjoint union of two base sets by another base set.In the next step we describe monotone sequences of simple measures that are in some senseadapted to the limiting distribution. In the final part of this section we then axiomaticallydescribe continuous clusterings and show both their existence and their uniqueness.
The goal of this section is to introduce a notion of monotone convergence for simple measuresthat preserves the graph structure of the corresponding clusterings.Our first step in this direction is done in the following definition that introduces a sortof monotonicity for set-valued isomorphic forests.
Definition 9
Let
F, F ′ ∈ F be two finite forests. Then F and F ′ are isomorphic , denotedby F ∼ = F ′ , iff there is a bijection ζ : F → F ′ such that for all A, A ′ ∈ F we have: A ⊂ A ′ ⇐⇒ ζ ( A ) ⊂ ζ ( A ′ ) . (17) Moreover, we write F ≤ F ′ iff F ∼ = F ′ and there is a map ζ : F → F ′ satisfying 17 and A ⊂ ζ ( A ) , A ∈ F. (18) In this case, the map ζ , which is uniquely determined by (17) , (18) and the fact that F and F ′ are finite, is called the forest relating map (FRM) between F and F ′ . Forests can be viewed as directed acyclic graphs: There is an edge between A and A ′ in F iff A ⊂ A ′ and no other node is in between. Then F ∼ = F ′ holds iff F and F ′ are isomorphicas directed graphs. From this it becomes clear that ∼ = is an equivalence relation. Moreover,the relation F ≤ F ′ means that each node A of F can be graph isomorphically mapped toa node of F ′ that contains A , see Figure 5 for an illustration. Note that ≤ is a partial orderon F and in particular it is transitive. Consequently, if we have finite forests F ≤ · · · ≤ F k then F ≤ F k and there is an FRM ζ k : F → F k . This observation is used in the followingdefinition, which introduces monotone sequences of forests and their limit. Definition 10 A isomonotone sequence of forests is defined as a sequence of finiteforests ( F n ) n ⊂ F such that s ( F n ) ≤ s ( F n +1 ) for all n ≥ . If this is the case, we define thelimit by F ∞ := lim n →∞ s ( F n ) := (cid:26) [ n ≥ ζ n ( A ) | A ∈ s ( F ) (cid:27) , where ζ n : s ( F ) → s ( F n ) is the FRM obtained from s ( F ) ≤ s ( F n ) . homann, Steinwart, and Schmid FF ′ Figure 5: F ≤ F ′ and the ar-rows indicate ζ .It is easy to see that in general, the limit forest F ∞ of anisomonotone sequence of A -valued forests is not A -valued.To describe the values of F ∞ we define the monotone clo-sure of an A ⊂ B by ¯ A := (cid:26) [ n ≥ A n | A n ∈ A and A ⊂ A ⊂ . . . (cid:27) . The next lemma states some useful properties of ¯ A and F ∞ . Lemma 11
Let ⊥ be an A -separation relation. Then ⊥ is actually an ¯ A -separation relation.Moreover, if ⊥ is stable and ( F n ) ⊂ F A is an isomonotone sequence then F ∞ := lim n s ( F n ) is an ¯ A -valued ⊥ -forest and we have s ( F n ) ≤ F ∞ for all n ≥ . Unlike forests, it is straightforward to compare two measures Q and Q on B . Indeed,we say that Q majorizes Q , in symbols Q ≤ Q , iff Q ( B ) ≤ Q ( B ) , for all B ∈ B . For ( Q n ) ⊂ M and P ∈ M , we similarly speak of monotone convergence Q n ↑ P iff Q ≤ Q ≤ · · · ≤ P and lim n →∞ Q n ( B ) = P ( B ) , for all B ∈ B . Clearly, Q ≤ Q ′ implies supp Q ⊂ supp Q ′ and it is easy to show, that Q n ↑ P implies P (cid:0) supp P \ [ n ≥ supp Q n (cid:1) = 0 . We will use such arguments throughout this section. For example, if a , a ′ are base measureson A, A ′ with a ≤ a ′ then A ⊂ A ′ . With the help of these preparations we can now defineisomonotone convergence of simple measures. Definition 12
Let ( A , Q , ⊥ ) be a clustering base and ( Q n ) ⊂ S ( A ) be a sequence of sim-ple measures on finite forests ( F n ) ⊂ F A . Then isomonotone convergence , denoted by ( Q n , F n ) ↑ P , means that both Q n ↑ P and s ( F ) ≤ s ( F ) ≤ . . . . In addition, ¯ S := ¯ S ( A ) denotes the set of all isomonotone limits, i.e. ¯ S ( A ) = (cid:8) P ∈ M | ( Q n , F n ) ↑ P for some ( Q n ) ⊂ S ( A ) on ( F n ) ⊂ F A (cid:9) . For a measure P ∈ ¯ S ( A ) it is probably tempting to define its clustering by c ( P ) :=lim n s ( F n ) , where ( Q n , F n ) ↑ P is some isomonotone sequence. Unfortunately, such anapproach does not yield a well-defined clustering as we have discussed in the introduction.For this reason, we need to develop some tools that help us to distinguish between “good”and “bad” isomonotone approximations. This is the goal of the following two subsections. owards an Axiomatic Approach to Hierarchical Clustering of Measures In this subsection we present and discuss a technical assumption on a clustering base thatwill make it possible to obtain unique continuous clusterings.Let us begin by introducing a notation that will be frequently used in the following. Tothis end, we fix a clustering base ( A , Q , ⊥ ) and a P ∈ M . For B ∈ B we then define Q P ( B ) := { αQ A | α > , A ∈ A , B ⊂ A, αQ A ≤ P } , i.e. Q P ( B ) denotes the set of all basic measures below P whose support contains B . Now,our first definition describes events that can be combined in a base set: Definition 13
Let ( A , Q , ⊥ ) be a clustering base and P ∈ M . Two non-empty B, B ′ ∈ B are called kin below P , denoted as B ∼ P B ′ , iff Q P ( B ∪ B ′ ) = ∅ , i.e., iff there is a basemeasure a ∈ Q such that the following holds:(a) B ∪ B ′ ⊂ supp a (b) a ≤ P. Moreover, we say that every such a ∈ Q P ( B ∪ B ′ ) supports B and B ′ below P . PA A b A A ∼ P A , A P A Figure 6: Kinship. Kinship of two events can be used to test whetherthey belong to the same root in the cluster forest. Toillustrate this we consider two events B and B ′ with B P B ′ . Moreover, assume that there is an A ∈ A with B ∪ B ′ ⊂ A . Then B P B ′ implies that forall such A there is no α > with αQ A ≤ P . Thissituation is displayed on the right-hand side of Fig-ure 6. Now assume that we have two base measures a , a ′ ≤ P on A, A ′ ∈ A that satisfy A ∼ P A ′ and P ( A ∩ A ′ ) > . If A is rich in the sense of A ∪ A ′ ∈ A , then we can find a base measure b on B := A ∪ A ′ with a ≤ b ≤ P or a ′ ≤ b ≤ P . The next definition relaxes the requirement A ∪ A ′ ∈ A , see also Figure 7 for an illustration. Definition 14
Let P ∈ M ∞ Ω be a measure. For B, B ′ ∈ B we write B ⊥⊥ P B ′ : ⇐⇒ P ( B ∩ B ′ ) = 0 and B ◦◦ P B ′ : ⇐⇒ P ( B ∩ B ′ ) > . Moreover, a clustering base ( A , Q , ⊥ ) is called P -subadditive iff for all base measures a , a ′ ≤ P on A, A ′ ∈ A we have A ◦◦ P A ′ = ⇒ ∃ b ∈ Q P ( A ∪ A ′ ) : b ≥ a or b ≥ a ′ . (19) PA A ′ a a ′ b Figure 7: P -subadditivity.Note that the implication (19) in particular en-sures Q P ( A ∪ A ′ ) = ∅ , i.e. A ∼ P A ′ . Moreover, therelation ⊥⊥ P is weaker than any separation relation ⊥ since we obviously have A ◦◦ P A ′ = ⇒ A ◦◦ ∅ A ′ = ⇒ A ◦◦ A ′ , where the second implication isshown in Lemma 30. The following definition intro-duces a stronger notion of additivity. homann, Steinwart, and Schmid Definition 15
Let ◦◦ be a relation on B . An A ⊂ B is ◦◦ - additive iff for all A, A ′ ∈ A A ◦◦ A ′ = ⇒ A ∪ A ′ ∈ A . The next proposition compares the several notions of (sub)-additivity. In particular itimplies that if A is ◦◦ ∅ -additive then ( A , Q µ, A , ⊥ ) is P -subadditive for all P ∈ M . Proposition 16
Let ( A , Q µ, A , ⊥ ) be a clustering base as in Proposition 3. If A is ◦◦ P -additive for some P ∈ M , then ( A , Q µ, A , ⊥ ) is P -subadditive. Conversely, if ( A , Q µ, A , ⊥ ) is P -subadditive for all P ≪ µ then A is ◦◦ µ -additive and thus also ◦◦ P -additive. We have already seen that isomonotone approximations by simple measures are not struc-turally unique. In this subsection we will therefore identify the most economical structureneeded to approximate a distribution by simple measures. Such most parsimonious struc-tures will then be used to define continuous clusterings.Let us begin by introducing a different view on simple measures.
Definition 17
Let ( A , Q , ⊥ ) be a clustering base and Q be a simple measure on F ∈ F A with the unique representation Q = P A ∈ F α A Q A . We define the map λ Q : F → Q by λ Q ( A ) := X A ′ ∈ F : A ′ ⊃ A α A ′ Q A ′ ( A ) ! · Q A , A ∈ F. Moreover, we call the base measure λ Q ( A ) ∈ Q the level of A in Q . λ Q ( A ) A Figure 8: Level.In some sense, the level of an A in Q combines all ancestormeasures including Q A and then restricts this combination to A , seeFigure 8 for an illustration of the level of a node. With the help oflevels we can now describe structurally economical approximationsof measures by simple measures. Definition 18
Let ( A , Q , ⊥ ) be a clustering base and P ∈ M Ω a finite measure. Then asimple measure Q on a forest F ∈ F A is P - adapted iff all direct siblings A , A in F are:(a) P -grounded : if they are kin below P , i.e. Q P ( A ∪ A ) = ∅ , then there is a parentaround them in F .(b) P -fine : every b ∈ Q P ( A ∪ A ) can be majorized by a base measure ˜ b that supportsall direct siblings A , . . . , A k of A and A , i.e. b ∈ Q P ( A ∪ A ) = ⇒ ∃ ˜ b ∈ Q P ( A ∪ . . . ∪ A k ) with ˜ b ≥ b . (c) strictly motivated : for their levels a := λ Q ( A ) and a := λ Q ( A ) in Q there isan α ∈ (0 , such that every base measure b that supports them below P is not largerthan α a or α a , i.e. ∀ b ∈ Q : b ≥ α a or b ≥ α a = ⇒ b
6∈ Q P ( A ∪ A ) . (20) owards an Axiomatic Approach to Hierarchical Clustering of Measures Finally, an isomonotone sequence ( Q n , F n ) ↑ P is adapted, if Q n is P -adapted for all n ≥ . Since siblings are ⊥ -separated, they are ⊥⊥ P -separated, so strict motivation is no con-tradiction to P -subadditivity. Levels are called motivated iff they satisfy condition (20)for α = 1 . Figure 9 illustrates the three conditions describing adapted measures. It can beshown that if A is ◦◦ -additive, then any isomonotone sequence can be made adapted. P a a ′ b Not motivated: ∃ b ∈ Q P ( A ∪ A ′ ) with a ′ ≤ b P a a ′ a , a ′ bb ′ P Motivated PA A Not grounded: A ∼ P A but without parent P b A A A Grounded but not fine: b ∈ Q P ( A ∪ A ) cannot be majorized to support A P Adapted: grounded, fine and motivatedFigure 9: Illustrations for motivated, grounded, fine, and therefore adapted.The following self-consistency result shows that every simple measure is adapted toitself. This result will guarantee that the extension of the clustering from S to ¯ S is indeedan extension. Proposition 19
Let ( A , Q , ⊥ ) be a clustering base. Then every Q ∈ S ( A ) is Q -adapted. In this subsection we finally introduce continuous clusterings with the help of adapted,isomonotone sequences. Furthermore, we will show the existence and uniqueness of suchclusterings.Let us begin by introducing a notation that will be used to identify two clusterings asidentical. To this end let F , F ∈ F be two forests and P ∈ M Ω be finite measure. Thenwe write F = P F , if there exists a graph isomorphism ζ : F → F such that P ( A △ ζ ( A )) = 0 for all A ∈ F . Now our first result shows that adapted isomonotone limits of two different sequencescoincide in this sense. Theorem 20
Let ( A , Q , ⊥ ) be a stable clustering base and P ∈ M Ω be a finite measuresuch that A is P -subadditive. If ( Q n , F n ) , ( Q ′ n , F ′ n ) ↑ P are adapted isomonotone sequencesthen we have lim n s ( F ∞ ) = P lim n s ( F ′∞ ) . homann, Steinwart, and Schmid Theorem 20 shows that different adapted sequences approximating a measure P neces-sarily have isomorphic forests and that the corresponding limit nodes of the forests coincideup to P -null sets. This result makes the following axiom possible. Axiom 3 (Continuous Clustering)
Let ( A , Q , ⊥ ) be a clustering base, P ⊂ M Ω be aset of measures. We say that c : P → F is a continuous clustering , if it is an additiveclustering and for all P ∈ P and all adapted isomonotone sequences ( Q n , F n ) ↑ P we have c ( P ) = P lim n s ( F ∞ ) . The following, main result of this section shows that there exist continuous clusteringsand that they are uniquely determined on a large subset of ¯ S ( A ) . Theorem 21
Let ( A , Q , ⊥ ) be a stable clustering base and set P A := (cid:8) P ∈ ¯ S ( A ) | A is P -subadditive and there is ( Q n , F n ) ր P adapted (cid:9) . Then there exists a continuous clustering c A : P A → F ¯ A . Moreover, c A is unique on P A ,that is, for all continuous clusterings c : ˜ P → F we have c A ( P ) = P c ( P ) , P ∈ P A . Recall from Proposition 16 that A is P -subadditive for all P ∈ M Ω if A is ◦◦ ∅ -additive.It can be shown that if A is ◦◦ A -additive, then any isomonotone sequence can be madeadapted. In this case we thus have P A = ¯ S ( A ) and Theorem 21 shows that there exists aunique continuous clustering on ¯ S ( A ) . Let us recall from Proposition 3 that a simple way to define a set of base measures Q waswith the help of a reference measure µ . Given a stable separation relation ⊥ , we denotedthe resulting stable clustering base by ( A , Q µ, A , ⊥ ) . Now observe that for this clusteringbase every Q ∈ S ( A ) is µ -absolutely continuous and its unique representation yields the µ -density f = P A ∈ F α A A for suitable coefficients α A > . Consequently, each level set { f > λ } consists of some elements A ∈ F , and if all elements in A are connected, theadditive clustering c ( Q ) of Q thus coincides with the “classical” cluster tree obtained fromthe level sets. It is therefore natural to ask, whether such a relation still holds for continuousclusterings on distributions P ∈ P A .Clearly, the first answer to this question needs to be negative, since in general the clustertree is an infinite forest whereas our clusterings are always finite. To illustrate this, let usconsider the Factory density on [0 , , which is defined by f ( x ) := ( − x, if x ∈ [0 , )1 , if x ∈ [ , Clearly, this gives the following ⊥ ∅ -decomposition of the level sets: { f > λ } = [0 , , if λ < , [0 , − λ ) ⊥ ∅ ∪ [ , , if ≤ λ < , owards an Axiomatic Approach to Hierarchical Clustering of Measures which leads to the clustering forest F f = (cid:8) [0 , , [ , (cid:9) ∪ (cid:8) [0 , − λ ) | ≤ λ < (cid:9) . Nowobserve that even though F f is infinite, it is as graph somehow simple: there is a root [0 , , anode [ , , and an infinite chain [0 , − λ ) , ≤ λ < . Replacing this chain by its supremum [0 , ) we obtain the structured forest (cid:8) [0 , , [0 , ) , [ , (cid:9) , for which we can then ask whether it coincides with the continuous clustering obtained from ( A , Q µ, A , ⊥ ∅ ) if A consists of all closed intervals in [0 , and µ is the Lebesgue measure.To answer this question we first need to formalize the operation that assigns a structuredto an infinite forest. To this end, let F be an arbitrary ⊥ -forest. We say that C ⊂ F is apure chain, iff for all C, C ′ ∈ C and A ∈ F \ C the following two implications hold: A ⊂ C = ⇒ A ⊂ C ′ ,C ⊂ A = ⇒ C ′ ⊂ A. Roughly speaking, the first implication ensures that no node above a bifurcation is containedin the chain, while the second implication ensures that no node below a bifurcation iscontained in the chain. With this interpretation in mind it is not surprising that we candefine the structure of the forest F with the help of the maximal pure chains by setting s ( F ) := (cid:26) [ C | C ⊂ F is a maximal pure chain (cid:27) . Note that for infinite forests the structure s ( F ) may or may not be finite. For example, forthe factory density it is finite as we have already seen above.We have seen in Lemma 11 that the nodes of a continuous clustering are ⊥ -separatedelements of ¯ A . Consequently, it only makes sense to compare continuous clustering withthe structure of a level set forest, if this forest shares this property. This is ensured in thefollowing definition. Definition 22
Let f : Ω → [0 , ∞ ] be a measurable function and ( A , Q , ⊥ ) be a stable clus-tering base. Then f is of ( A , Q , ⊥ ) - type iff there is a dense subset Λ ⊂ [0 , sup f ) suchthat for all λ ∈ Λ the level set { f > λ } is a finite union of pairwise ⊥ -separated events B ( λ ) , . . . , B k ( λ ) ( λ ) ∈ ¯ A . If this is the case the level set ⊥ -forest is given by F f, Λ := { B i ( λ ) | i ≤ k ( λ ) and λ ∈ Λ } . Note that for given f and Λ the forest F f, Λ is indeed well-defined since ⊥ is an ¯ A -separation relation by Lemma 11 and therefore the decomposition of { f > λ } into the sets B ( λ ) , . . . , B k ( λ ) ( λ ) ∈ ¯ A is unique by Lemma 30.With the help of these preparations we can now formulate the main result of this sub-section, which compares continuous clusterings with the structure of level set ⊥ -forests: Theorem 23
Let µ ∈ M Ω , ( A , Q µ, A , ⊥ ) the stable clustering based described in Proposition3, and P ∈ M Ω such that A is P -subadditive. Assume that P has a µ -density f that is of ( A , Q , ⊥ ) -type with a dense subset Λ such that s ( F f, Λ ) is finite and for all λ ∈ Λ and all i < j ≤ k ( λ ) we have B i ( λ ) ⊥ B j ( λ ) . Then we have P ∈ ¯ S ( A ) and c ( P ) = µ s ( F f, Λ ) . homann, Steinwart, and Schmid On the other hand, it is not difficult to show that if P ∈ ¯ S ( A ) then P has a density of ( A , Q , ⊥ ) -type. We do not know though whether there has to a density of ( A , Q , ⊥ ) -typefor that even the closure of siblings are separated.If supp µ = Ω one might think that this is not true since on the complement of thesupport anything goes. To be more precise—if µ is not inner regular and hence no supportis defined—assume there is an open set O ⊂ Ω with µ ( O ) = 0 . This then means that thereis no base set A ⊂ O , because base sets are support sets. Hence anything that would happenon O is determined by what happens in supp P !In the literature density based clustering is only considered for continuous densities sincethey may serve as a canonical version of the density. The following result investigates suchdensities. Proposition 24
For a compact Ω ⊂ R d and a measure µ ∈ M Ω we consider the stableclustering base ( A , Q µ, A , ⊥ ∅ ) . We assume that all open, connected sets are contained in ¯ A and that P ∈ M Ω is a finite measure such that A is P -subadditive. If P has a continuousdensity f that has only finitely many local maxima x ∗ , . . . , x ∗ k then P ∈ P A and there abijection ψ : { x ∗ , . . . , x ∗ k } → min c ( P ) such that x ∗ i ∈ ψ ( x ∗ i ) . In this case c ( P ) = µ (cid:8) B i λ | i ≤ k ( λ ) and λ ∈ Λ (cid:9) where Λ = (cid:8) λ < . . . < λ m < sup f (cid:9) is the finite set of levels at which the splits occur.
4. Examples
After having given the skeleton of this theory we now give more examples of how to use it.This should as well motivate some of the design decisions. It will also become clear in whatway the choice of a clustering base ( A , Q , ⊥ ) influences the clustering. In this subsection we present several examples of clustering bases. Our first three examplesconsider different separation relations.
Example 1 (Separation relations)
The following define stable A -separation relations:(a) Disjointness : If
A ⊂ B is a collection of non-empty, closed, and topologically con-nected sets then B ⊥ ∅ B ′ ⇐⇒ B ∩ B ′ = ∅ . (b) τ -separation : Let (Ω , d ) be a metric space, τ > , and A ⊂ B be a collection ofnon-empty, closed, and τ -connected sets then B ⊥ τ B ′ : ⇐⇒ d ( B, B ′ ) ≥ τ. (c) Linear separation : Let H be a Hilbert space with inner product h · | · i and Ω ⊂ H .Then non-empty events A, B ⊂ Ω are linearly separated , A ⊥ ℓ B , iff A ⊥ ∅ B and ∃ v ∈ H \ { } , α ∈ R ∀ a ∈ A, b ∈ B : h a | v i ≤ α and h b | v i ≥ α. owards an Axiomatic Approach to Hierarchical Clustering of Measures The latter means there is an affine hyperplane U ⊂ Ω such that A and B are ondifferent sides. Then ⊥ ℓ is a A separation relation if no base set A ∈ A can bewritten as a finite union of pairwise ⊥ ℓ -disjoint closed sets. It is stable if H is finite-dimensional. Our next goal is to present some examples of base set collections A . Since these describethe sets we need to agree upon that their can only be trivially clustered, smaller collections A are generally preferred. Let µ be the Lebesgue measure on R d . To define possible collections A we will consider the following building blocks in R d : C Dyad := (cid:8) axis-parallel boxes with dyadic coordinates (cid:9) , C p := (cid:8) closed ℓ dp -balls (cid:9) , p ∈ [1 , ∞ ] , C Conv := (cid:8) convex and compact µ -support sets (cid:9) . C Dyad corresponds to the cells of a histogram whereas C p has connections to moving-windowdensity estimation. When combined with ⊥ ∅ or ⊥ τ and base measures of the form (14) thesecollections may already serve as clustering bases. However, ¯ C • and ¯ S C are not very rich sincemonotone increasing sequences in C • converge to sets of the same shape, and hence the setsin ¯ C • have the same shape constraint as those in C • . As a result the sets of measures ¯ S C • forwhich we can determine the unique continuous clustering are rather small. However, moreinteresting collections can be obtained by considering finite, connected unions built of suchsets. To describe such unions in general we need the following definition. Definition 25
Let ⊥⊥ be a relation on B , ◦◦ be its negation, and C ⊂ B be a class of non-empty events. The ⊥⊥ - intersection graph on C , G⊥⊥ ( C ) , has C as nodes and there is anedge between A, B ∈ C iff A ◦◦ B . We define: C ⊥⊥ ( C ) := { C ∪ . . . ∪ C k | C , . . . , C k ∈ C and the graph G⊥⊥ ( { C , . . . , C k } ) is connected } . Obviously any separation relation can be used. But one can also consider weaker relationslike ⊥⊥ P , or e.g. A ⊥⊥ A ′ if A ∩ A ′ has empty interior, or if it contains no ball of size τ . Suchexamples yield smaller A and indeed in these cases ¯ S is much smaller.The following example provides stable clustering bases. Example 2 (Clustering bases)
The following examples are ◦◦ ∅ -additive: A Dyad := C ⊥ ∅ ( C Dyad ) = (cid:8) finite connected unions of boxes with dyadic coordinates (cid:9) , A p := C ⊥ ∅ ( C p ) = (cid:8) finite connected unions of closed L p -balls (cid:9) , A Conv := C ⊥ ∅ ( C Conv ) = (cid:8) finite connected unions of convex µ -support sets (cid:9) . Then A Dyad , A p , A Conv ⊂ K ( µ ) . Furthermore the following examples are ◦◦ τ -additive: A τ Dyad := C ⊥ τ ( C Dyad ) , A τp := C ⊥ τ ( C p ) , A τ Conv := C ⊥ τ ( C Conv ) . This leads to the following examples of stable clustering bases: ( A Dyad , Q µ, A Dyad , ⊥ ∅ ) , ( A p , Q µ, A p , ⊥ ∅ ) , ( A Conv , Q µ, A Conv , ⊥ ∅ ) , ( A τ Dyad , Q µ, A τ Dyad , ⊥ τ ) , ( A τp , Q µ, A τp , ⊥ τ ) , ( A τ Conv , Q µ, A τ Conv , ⊥ τ ) , ( A Dyad , Q µ, A Dyad , ⊥ τ ) , ( A p , Q µ, A p , ⊥ τ ) , ( A Conv , Q µ, A Conv , ⊥ τ ) . homann, Steinwart, and Schmid ∈ A Dyad ∈ A ∈ A Dyad , ¯ A Figure 10: Some examples of sets in A Box , A Conv and their closure.
The first row is the most common case, using connected sets and their natural separationrelation. The second row is the τ -connected case. The third row shows how the fine tuningcan be handled: We consider connected base sets, but siblings need to be τ -separated, hencee.g. saddle points cannot be approximated. The larger the extended class ¯ A is, the more measures we can cluster. The followingproposition provides a sufficient condition for ¯ A being rich. Proposition 26
Assume all A ∈ A are path-connected. Then all B ∈ ¯ A are path-connected.Furthermore assume that A is intersection-additive and that it contains a countable neigh-bourhood base. Then ¯ A contains all open, path-connected sets. One can show that the first statement also holds for topological connectedness. Fur-thermore note that C Dyad is a countable neighbourhood base, and therefore A Dyad , A p , and A Conv fulfill the conditions of Proposition 26.
Following the manual to cluster densities given in Theorem 23 by decomposing the densitylevel sets into ⊥ -disjoint components, one first needs to understand the ⊥ -disjoint com-ponents of general events. In this subsection we investigate such decompositions and theresulting clusterings. We assume µ to be the Lebesgue measure on some suitable Ω ⊂ R d andlet the base measures be the ones considered in Proposition 3. For visualization purposeswe further restrict our considerations to the one- and two-dimensional case, only. d = 1 In the one-dimensional case, in which Ω is an interval, the examples A p = A Conv simply con-sist of compact intervals, and their monotone closures consist of all intervals. To understandthe resulting clusters let us first consider the twin peaks density: f ( x ) := − min (cid:8) | x − | , | x − | (cid:9) . f ( x ) x Clearly, this gives the following ⊥ ∅ -decomposition of the level sets: H f ( λ ) = ( λ, − λ ) for λ < , H f ( λ ) = ( λ, − λ ) ⊥ ∅ ∪ ( + λ, λ ) for ≤ λ < and hence the ⊥ ∅ -clustering forest is (cid:8) (0 , , ( , ) , ( , ) (cid:9) . Since, none of the boundarypoints can be reached, any isomonotone, adapted sequence yields this result. However, theclustering changes, if the separation relation ⊥ τ is considered. We obtain H f ( λ ) = ( λ, − λ ) , for λ < + τ , H f ( λ ) = ( λ, − λ ) ⊥ τ ∪ ( + λ, λ ) , for + τ ≤ λ < owards an Axiomatic Approach to Hierarchical Clustering of Measures Name
Merlon Camel M FactoryDensity ( A p , ⊥ ∅ )( A p , ⊥ τ ) with τ small ( A p , ⊥ τ ) with τ large [ ][ ] [ ][ ][ ] [ ][ ] ( )( )( )( )( ) ( )( ) [ ][ )( ][ ][ ) ( ][ ] [ ][ )[ ][ ][ ) [ ][ ] Table 1: Examples of clustering in dimension d = 1 using A p and three separation relations.if τ ∈ (0 , ) and the resulting ⊥ τ -clustering is (cid:8) (0 , , ( + τ , − τ ) , ( + τ , − τ ) (cid:9) . Finally,if τ ≥ then all level sets are τ -connected and the forest is simply { (0 , } . In Table 1 moreexamples of clustering of densities can be found. d = 2 Our goal in this subsection is to understand the ⊥ -separated decomposition of closed events.We further present the resulting clusterings for some densities that are indicator functionsand illustrate clusterings for continuous densities having a saddle point.Let us begin by assuming that P has a Lebesgue density of the form B , where B is some µ -support set. Then one can show, see Lemma 50 for details, that adapted, isomonotonesequences ( F n ) of forests F n ↑ B are of the form F n = { A n , . . . , A nk } , where the elements ofeach forest F n are mutually disjoint and can be ordered in such a way that A i ⊂ A i ⊂ . . . .The limit forest F ∞ then consists of the k pairwise ⊥ -separated sets: B i := [ n ≥ A ni , and there is a µ -null set N ∈ B with B = B ⊥ ∪ . . . ⊥ ∪ B k ⊥ ∪ N. (21)Let us now consider the base sets A p in Example 2. By Proposition 26 we know that ¯ A p contains all open, path-connected sets and therefore all open L q -balls. Moreover, allclosed L q -balls B are µ -support sets with µ ( ∂B ) = 0 . Our initial consideration shows that B can be approximated by an adapted, isomonotone sequence ( F n ) of forests of the form F n = { A n } with A n ∈ A p . However, depending on p and q the µ -null set N in (21) maydiffer.Now that we have an understanding of ¯ A p and adapted, isomonotone approximationswe can investigate some more interesting cases and appreciate the influence of the choice of A on the outcome of the clustering in the following example. Example 3 (Clustering of indicators)
We consider 6 examples of µ -support sets B ∈ R . The first 4 have two parts that only intersect at one point, the second to last has two homann, Steinwart, and Schmid A = C ( ) A = C ( ) A ∞ = C ( ) A Conv A τ (cid:8) , (cid:9) (cid:8) , (cid:9) (cid:8) (cid:9) (cid:8) (cid:9) (cid:8) (cid:9)(cid:8) (cid:9) (cid:8) , (cid:9) (cid:8) , (cid:9) (cid:8) (cid:9) (cid:8) (cid:9)(cid:8) , (cid:9) (cid:8) (cid:9) (cid:8) (cid:9) (cid:8) (cid:9) (cid:8) (cid:9)(cid:8) (cid:9) (cid:8) (cid:9) (cid:8) , (cid:9) (cid:8) (cid:9) (cid:8) (cid:9)(cid:8) , (cid:9) (cid:8) , (cid:9) (cid:8) , (cid:9) (cid:8) , (cid:9) (cid:8) < τ (cid:9)(cid:8) (cid:9) (cid:8) (cid:9) (cid:8) (cid:9) (cid:8) (cid:9) (cid:8) (cid:9) Table 2: Clusterings of indicators. topological components, and the last one is connected in a fat way. By natural approximationswe get the clusterings of Table 2. The red dots indicate points which never are achievedby any approximation. Observe how the geometry encoded in A shapes the clustering. Since A Conv and A are invariant under rotation, they yield the same structure of clustering forrotated sets. The classes A and A ∞ on the other hand are not rotation-invariant andtherefore the clustering depends on the orientation of B . After having familiarized ourselves with the clustering of indicator functions we finallyconsider a continuous density that has a saddle point.
Example 4 On Ω := [ − , consider the density f : Ω → [0 , given by f ( x, y ) := x · y + 1 .Then we have the following ⊥ ∅ -decomposition of the level sets H f ( λ ) of f : H f ( λ ) = { ( x, y ) : xy > λ − } if λ ∈ [0 , , [ − , ˙ ∪ (0 , if λ = 1 , { ( x, y ) : x < and xy > λ − } ˙ ∪{ ( x, y ) : x > and xy > λ − } if λ ∈ (1 , . For ( A p , Q µ, A p , ⊥ ∅ ) the clustering forest is therefore given by: (cid:8) [ − , , [ − , , (0 , (cid:9) = (cid:8) , , (cid:9) . Moreover, for ( A τ , Q µ, A τ , ⊥ τ ) the clustering forest looks like (cid:8) , , (cid:9) . So far we have only considered clusterings of Lebesgue absolutely continuous distributions.In this subsection we provide some examples indicating that the developed theory goes farbeyond this standard example. At first, lower dimensional base sets and their resulting clus-terings are investigated. Afterwards we discuss collections of base sets of different dimensionsand provide clusterings for some measures that are not absolutely continuous to any Haus-dorff measure. For the sake of simplicity we will restrict our considerations to ⊥ ∅ -clusterings,but generalizations along the lines of the previous subsections are straightforward. owards an Axiomatic Approach to Hierarchical Clustering of Measures Let us begin by recalling that the s -dimensional Hausdorff-measure on B is defined by H s ( B ) := lim ε → inf (cid:8) ∞ X i =1 (diam( B i )) s | B ⊂ [ i B i and ∀ i ∈ N : diam( B i ) ≤ ε (cid:9) . Moreover, the Hausdorff-dimension of a B ∈ B is the value s ∈ [0 , d ] at which s
7→ H s ( B ) jumps from ∞ to . If B has Hausdorff-dimension s , then H s ( B ) can be either zero, finite,or infinite. Hausdorff-measures are inner regular (Federer, 1969, Cor. 2.10.23) and H d equalsthe Lebesgue-measure up to a normalization factor. For a reference on Hausdorff-dimensionsand -measures we refer to Falconer (1993) and Federer (1969). Recall that given a Borel set C ⊂ R s a map ϕ : C → Ω is bi-Lipschitz iff there are constants < c , c < ∞ s.t. c d ( x, y ) ≤ d ( ϕ ( x ) , ϕ ( y )) ≤ c d ( x, y ) . Lemma 27 If C is a Lebesgue-support set in R s and ϕ : C → Ω is bi-Lipschitz then C ′ := ϕ ( C ) has Hausdorff-dimension s and it is an H s -support set in Ω . Motivated by Lemma 27, consider the following collection of s -dimensional base sets in Ω : C p,s := (cid:8) ϕ ( C ) ⊂ Ω | C is the closed unit p -ball in R s and ϕ : C → Ω is bi-Lipschitz (cid:9) . Using the notation of Definition 25 and Proposition 3 we further write A p,s := C ⊥ ∅ ( C p,s ) and Q p,s := Q H s , A p,s . By A := (cid:8) { x } | x ∈ Ω (cid:9) we denote the singletons and Q the collection of Dirac measures.Since continuous mappings of connected sets are connected, ( A p,s , Q p,s , ⊥ ∅ ) is a stable ⊥ ∅ -additive clustering base. Remark that we take the union after embedding into R d andtherefore also crossings do happen, e.g. the cross [ − , × { }∪ { }× [ − , ∈ A p, . Anotherpossibility would be to embed A p via a set of transformations into R d . Finally we confinethe examples here only to integer Hausdorff-dimensions—it would be interesting though toconsider e.g. the Cantor set or the Sierpinski triangle. The following example presents aresulting clustering of an H -absolutely continuous measure on R . Example 5 (Measures on curves in the plane) On Ω := [ − , consider the measure P := f d H whose density is given by f ( x, y ) := f Merlon ( x ) if x ≥ and y = 0 ,f Camel ( t ) if x = − t − and y = 3 − t ,f M ( t ) if x = 2 t − and y = − − t . Here the densities and clusterings for the Merlon, the Camel and the M can be seen inTable 1. So for ( A p, , Q p, , ⊥ ∅ ) with any fixed p ≥ the clustering forest of P is given by: c ( P ) = ( [0 , × { } , [0 , ] × { } , [ , × { } ,g (cid:0) (0 , (cid:1) , g (cid:0) (0 . , . (cid:1) , g (cid:0) (0 . , . (cid:1) ,g (cid:0) [0 , (cid:1) , g (cid:0) [0 , . (cid:1) , g (cid:0) (0 . , (cid:1) ) where g i : [0 , → Ω are given by g ( t ) = ( − t − , − t ) and g ( t ) = (2 t − , − − t ) . homann, Steinwart, and Schmid In this subsection we consider measures that can be decomposed into measures that areabsolutely continuous with respect to Hausdorff measures of different dimensions. To thisend, we write µ ≺ µ ′ for two measures µ and µ ′ on B , iff for all B ∈ B with B ⊂ supp µ ∩ supp µ ′ we have µ ( B ) < ∞ = ⇒ µ ′ ( B ) = 0 . For Q , Q ′ ⊂ M Ω we further write Q ≺ Q ′ if µ ≺ µ ′ for all µ ∈ Q and µ ′ ∈ Q ′ . Clearly, therelation ≺ is transitive. Moreover, we have H s ≺ H t whenever s < t . The next propositionshows that clustering bases whose base measures dominate each other in the sense of ≺ canbe merged. Proposition 28
Let ( A , Q , ⊥ ) , . . . , ( A m , Q m , ⊥ ) be stable clustering bases sharing thesame separation relation ⊥ and assume Q ≺ · · · ≺ Q m . We define A := [ i A i and Q := [ i Q i . Then ( A , Q , ⊥ ) is a stable clustering base. Proposition 28 shows that the ⊥ ∅ -additive, stable bases ( A p,s , Q p,s , ⊥ ∅ ) on R d can bemerged. Unfortunately, however, its union is no longer ⊥ ∅ -additive, and therefore we need toinvestigate P -subadditivity in order to describe distributions for which our theory providesa clustering. This is done in the next proposition. Proposition 29
Let ( A , Q , ⊥ ) and ( A , Q , ⊥ ) be clustering bases with Q ≺ Q and P and P be finite measures with P ≺ A and A ≺ P . Furthermore, assume that A i is P i -subadditive for both i = 1 , and let P := P + P . Then we have(a) For i = 1 , and all base measures a ∈ Q iP we have a ≤ P i ,(b) If for all base measures a ∈ Q P and supp P ◦◦ supp a there exists a base measure ˜ a ∈ Q P (supp P ) with a ≤ ˜ a then A ∪ A is P -subadditive. To illustrate condition (b) consider clustering bases ( A p,s , Q p,s , ⊥ ∅ ) and ( A p,t , Q p,t , ⊥ ∅ ) for some s < t . The condition specifies that any such base measure a intersecting supp P can be majorized by one which supports supp P . Then all parts of supp P intersecting atleast one component of supp P have to be on the same niveau line of P . Note that this istrivially satisfied if the supp P ∩ supp P = ∅ . Recall that mixtures of the latter form havealready been clustered in Rinaldo and Wasserman (2010) by a kernel smoothing approach.Clearly, our axiomatic approach makes it possible to define clusterings for significantly moreinvolved distributions as the following two examples demonstrate. Example 6 (Mixture of atoms and full measure)
Consider
Ω = R . Let ( A , Q , ⊥ ∅ ) be the singletons with Dirac measures and consider for any fixed p ≥ the clustering base owards an Axiomatic Approach to Hierarchical Clustering of Measures ( A p,s , Q p,s , ⊥ ∅ ) . Both are ◦◦ ∅ -additive and stable and we have Q ≺ Q p, . Now consider themeasures P := δ + 2 δ + δ and P ( dx ) := sin ( xπ ) H ( dx ) . Then the assumptions of Proposition 29 are satisfied and the clustering of P := P + P isgiven by c ( P ) = c ( P ) ∪ c ( P ) = (cid:8) { } , (0 , ) , ( , , { } , { } (cid:9) . Our last example combines Examples 4 and 5.
Example 7 (Mixtures in dimension 2)
Consider
Ω := [ − , and the densities f and f introduced in Examples 5 and 4, respectively. Furthermore, consider the measures P := f d H , P := f d H and the clustering bases ( A p, , Q p, , ⊥ ∅ ) and ( A p ′ , , Q p ′ , , ⊥ ∅ ) for some fixed p, p ′ ≥ . Asabove Q p, ≺ Q p ′ , . And by Proposition 29 the clustering forest of P = P + P is given by c ( P ) ∪ c ( P ) = ( [0 , × { } , [0 , ] × { } , [ , × { } ,g (cid:0) (0 , (cid:1) , g (cid:0) (0 . , . (cid:1) , g (cid:0) (0 . , . (cid:1) ,g (cid:0) [0 , (cid:1) , g (cid:0) [0 , . (cid:1) , g (cid:0) (0 . , (cid:1) , [ − , , [ − , , (0 , ) where g i : [0 , → Ω are given by g ( t ) = ( − t − , − t ) and g ( t ) = (2 t − , − − t ) . Observethat g and g lie on niveau lines of f .
5. Proofs
We begin with some simple properties of separation relations.
Lemma 30
Let ⊥ be an A -separation relation. Then the following statements are true:(a) For all B, B ′ ∈ B with B ⊥ B ′ we have B ∩ B ′ = ∅ .(b) Suppose that ⊥ is stable and ( A i ) i ≥ ⊂ A is increasing. For A := S n A n and all B ∈ B we then have A n ⊥ B for all n ≥ ⇐⇒ A ⊥ B (c) Let A ∈ A and B , . . . , B k ∈ B be closed. Then: A ⊂ B ⊥ ∪ . . . ⊥ ∪ B k = ⇒ ∃ ! i ≤ k : A ⊂ B i (d) For all A , . . . , A k ∈ A and all A ′ , . . . , A ′ k ′ ∈ A , we have A ⊥ ∪ . . . ⊥ ∪ A k = A ′ ⊥ ∪ . . . ⊥ ∪ A ′ k ′ = ⇒ { A , . . . , A k } = { A ′ , . . . , A ′ k ′ } . homann, Steinwart, and Schmid Proof of Lemma 30: (a).
Let us write B := B ∩ B ′ . Monotonicity and B ⊥ B ′ implies B ⊥ B ′ and thus B ′ ⊥ B by symmetry. Another application of the monotonicity gives B ⊥ B and the reflexivity thus shows B ∩ B ′ = B = ∅ . (b). “ ⇒ ” is stability and “ ⇐ ” follows from monotonicity. (c). Existence of such an i is A -connectedness. Now assume that there is an j = i with A ⊂ B j . Then ∅ 6 = A ⊂ B i ∩ B j contradicting B i ⊥ B j by (a). (d). We write F := { A , . . . , A k } and F ′ := { A ′ , . . . , A ′ k ′ } . By (c) we find an injection I : F → F ′ such that A ⊂ I ( A ) and hence k ≤ k ′ . Analogously, we find an injection J : F ′ → F such that A ⊂ J ( A ) , and we get k = k ′ . Consequently, I and J are bijections.Let us now fix an A i ∈ F . For A j := J ◦ I ( A i ) ∈ F we then find A i ⊂ I ( A i ) ⊂ J ( I ( A i )) = A j .This implies i = j , since otherwise A i ⊂ A j would contradict A i ⊥ A j by (a). Therefore wefind A i = I ( A i ) and the bijectivity of I thus yields the assertion. Proof of Proposition 3:
We first need to check that the support is defined for all restric-tions µ | C := µ ( · ∩ C ) to sets C ∈ B that satisfy < µ ( C ) < ∞ . To this end, we check that µ | C is inner regular: If Ω is a Radon space then there is nothing to prove since µ | C is a finitemeasure. If Ω is not a Radon space, then the definition of M ∞ Ω guarantees that µ is innerregular and hence µ | C is inner regular by Lemma 51.Let us now verify that ( A , Q µ, A , ⊥ ) is a (stable) clustering base. To this end, we firstobserve that each Q A ∈ Q µ, A is a probability measure by construction and since we havealready seen that µ | C is inner regular for all C ∈ K ( µ ) we conclude that Q µ, A ⊂ M .Moreover, fittedness follows from A ⊂ K ( µ ) . For flatness let A, A ′ ∈ A with A ⊂ A ′ and Q A ′ ( A ) = 0 . Then for all B ∈ B we have Q A ( B ) = µ ( B ∩ A ) µ ( A ) = µ ( B ∩ A ∩ A ′ ) µ ( A | A ′ ) · µ ( A ′ ) = µ ( B ∩ A | A ′ ) µ ( A | A ′ ) = Q A ′ ( B ∩ A ) Q A ′ ( A ) . Proof of Lemma 7:
Let Q = P A ∈ F α A Q A and Q = P A ′ ∈ F ′ α ′ A ′ Q A ′ be two representationsof Q ∈ Q . By part (d) of Lemma 51 we then obtain supp Q = supp X A ∈ F α A Q A ! = [ A ∈ F supp Q A = [ A ∈ F A , = G F and since we analogously find supp Q = G F ′ , we conclude that G F = G F ′ . The lattertogether with Lemma 30 gives max F = max F ′ . To show that α A = α ′ A for all roots A ∈ max F = max F ′ , we pick a root A ∈ max F and assume that α A < α ′ A . Now, if A hasno direct child, we set B := A . Otherwise we define B := A \ ( A ∪ . . . ∪ A k ) , where the A k are the direct children of A in F . Because of the definition of a direct child and part (d) ofLemma 30 we find A ∪ . . . ∪ A k $ A in the second case. In both cases we conclude that B is non-empty and relatively open in A = supp Q A and by Lemma 51 we obtain Q A ( B ) > .Consequently, our assumption α A < α ′ A yields α A Q A ( B ) < α ′ A Q A ( B ) ≤ Q ( B ) . However,our construction also gives Q ( B ) = X A ′′ ∈ F α A ′′ Q A ′′ ( B ) = α A Q A ( B )+ X A ′′ $ A α A ′′ Q A ′′ ( B )+ X A ′′ ⊥ A α A ′′ Q A ′′ ( B ) = α A Q A ( B ) , owards an Axiomatic Approach to Hierarchical Clustering of Measures i.e. we have found a contradiction. Summing up, we already know that max F = max F ′ and α A = α ′ A for all A ∈ max F . This yields X A ∈ max F α A Q A = X A ′ ∈ max F ′ α ′ A ′ Q A ′ . Eliminating the roots gives the forests F := F \ max F and F ′ := F ′ \ max F ′ and Q := X A ∈ F α A Q A = Q − X A ∈ max F α A Q A = Q − X A ′ ∈ max F ′ α ′ A ′ Q A ′ = X A ′ ∈ F ′ α ′ A ′ Q A ′ , i.e. Q has two representations based upon the reduced forests F and F ′ . Applying theargument above recursively thus yields F = F ′ and α ′ A = α ′ A for all A ∈ F . Proof of Theorem 8:
We first show that (16) defines an additive clustering. Since Axiom 1is obviously satisfied, it suffices to check the two additivity axioms for P := S ( A ) . We beginby establishing DisjointAdditivity. To this end, we pick Q , . . . , Q k ∈ S ( A ) with representing ⊥ -forests F i such that supp Q i = G F i are mutually ⊥ -separated. For A ∈ max F i and A ′ ∈ max F j with i = j , we then have A ⊥ A ′ , and therefore F := F ∪ . . . ∪ F k is the representing ⊥ -forest of Q := Q + . . . + Q k . This gives Q ∈ S ( A ) and c ( Q ) = s ( F ) = s ( F ) ∪ · · · ∪ s ( F k ) = c ( Q ) ∪ · · · ∪ c ( Q k ) . To check BaseAdditivity we fix a Q ∈ S ( A ) with representing ⊥ -forest F and a base measure a = αQ A with supp Q ⊂ supp a . For all A ′ ∈ F we then have A ′ ⊂ G F = supp Q ⊂ A andtherefore F ′ := { A } ∪ F is the representing ⊥ -forest of a + Q . This yields a + Q ∈ S ( A ) and c ( a + Q ) = s ( F ′ ) = s (cid:0) { A } ∪ F (cid:1) = s (cid:0) supp a ∪ c ( Q ) (cid:1) . Let us now show that every additive A -clustering c : P → F satisfies both S ( A ) ⊂ P and(16). To this end we pick a Q ∈ S ( A ) with representing forest F and show by induction over | F | = n that both Q ∈ P and c ( Q ) = s ( F ) . Clearly, for n = 1 this immediately follows fromAxiom 1. For the induction step we assume that for some n ≥ we have already proved Q ′ ∈ P and c ( Q ′ ) = s ( F ′ ) for all Q ′ ∈ S ( A ) with representing forest F ′ of size | F ′ | < n .Let us first consider the case in which F is a tree. Let A be its root and α A be corre-sponding coefficient in the representation of Q . Then Q ′ := Q − α A Q A is a simple measurewith representing forest F ′ := F \ A and since | F ′ | = n − we know Q ′ ∈ P and c ( Q ′ ) = s ( F ′ ) by the induction assumption. By the axiom of BaseAdditivity we conclude that c ( Q ) = c ( α A Q A + Q ′ ) = s ( { A } ∪ c ( Q ′ )) = s ( { A } ∪ F ′ ) = s ( F ) , where the last equality follows from the assumption that F is a tree with root A .Now consider the case where F is a forest with k ≥ roots A , . . . , A k . For i ≤ k wedefine Q i := Q (cid:12)(cid:12) ⊂ A i . Then all Q i are simple measures with representing forests F i := F (cid:12)(cid:12) ⊂ A i and we have Q = Q + · · · + Q k . Therefore, the induction assumption guarantees Q i ∈ P and c ( Q i ) = s ( F i ) . Since supp Q i = A i and A i ⊥ A j whenever i = j , the axiom ofDisjointAdditivity then shows Q ∈ P and c ( Q ) = c ( Q ) ∪ · · · ∪ c ( Q k ) = s ( F ) ∪ · · · ∪ s ( F k ) = s ( F ) . homann, Steinwart, and Schmid For the first assertion it suffices to check ¯ A -connectedness. To thisend, we fix an A ∈ ¯ A and closed sets B , . . . , B k with A ⊂ B ⊥ ∪ . . . ⊥ ∪ B k . Let ( A n ) ⊂ A with A n ր A . For all n ≥ part (c) of Lemma 30 then gives exactly one i ( n ) with A n ⊂ B i ( n ) .This uniqueness together with A n ⊂ A n +1 yields i (1) = i (2) = . . . and hence A n ⊂ B i (1) forall n . We conclude that A ⊂ B i (1) by part (b) of Lemma 30.For the second assertion we pick an isomonotone sequence ( F n ) ⊂ F A and define F ∞ :=lim n s ( F n ) . Let us first show that F ∞ is a ⊥ -forest. To this end, we pick A, A ′ ∈ F ∞ . Bythe construction of F ∞ there then exist A , A ′ ∈ s ( F ) such that for A n := ζ n ( A ) and A ′ n := ζ n ( A ′ ) we have A n ր A and A ′ n ր A ′ Now, if A ⊥ A ′ then A n ⊥ A ′ n and thus A m ⊥ A ′ n for all m, n by isomonotonicity. Using the stability of ⊥ twice we first obtain A ⊥ A ′ n for all n and then A ⊥ A ′ . If A A ′ , we may assume A ⊂ A ′ since s ( F ) isa ⊥ -forest. Isomonotonicity implies A n ⊂ A ′ n ⊂ A ′ for all n and hence A ⊂ A ′ . Finally, s ( F n ) ≤ F ∞ is trivial. Proof of Proposition 16:
We first show that A is P -subadditive if A is ◦◦ P -additive. Tothis end we fix A, A ′ ∈ A with A ◦◦ P A ′ . Since A is ◦◦ P -additive we find B := A ∪ A ′ ∈ A .This yields Q B ( A ) = µ ( A ∩ B ) µ ( B ) = µ ( A ) µ ( B ) > and analogously we obtain Q B ( A ′ ) > . For αQ A , α ′ Q A ′ ≤ P we can therefore assume that β := αQ B ( A ) < α ′ Q B ( A ′ ) . Setting b := βQ B we now obtain by the flatness assumption αQ A ( · ) = α · Q B ( · ∩ A ) Q B ( A ) = b ( · ∩ A ) ≤ b ( · ) . Now assume that ( A , Q µ, A , ⊥ ) is P -subadditive for all P ≪ µ . Let A, A ′ ∈ A with A ◦◦ µ A ′ .Then we have P := Q A + Q ′ A ≪ µ and Q A , Q A ′ ≤ P . Since A is P -subadditive there is abase measure b ≤ P with A ∪ A ′ ⊂ supp b ⊂ supp P = A ∪ A ′ by Lemma 51. Consequentlywe obtain A ∪ A ′ = supp b ∈ A . Lemma 31
Let P ∈ M and ( A , Q , ⊥ ) be a P -subadditive clustering base. Then the kinshiprelation ∼ P is a symmetric and transitive relation on { B ∈ B | P ( B ) > } and an equivalencerelation on the set { A ∈ A | ∃ α > such that αQ A ≤ P } . Finally, for all finite sequences A , . . . , A k ∈ A of sets that are pairwise kin below P there is b ∈ Q P ( A ∪ . . . ∪ A k ) . Proof of Lemma 31:
Symmetry is clear. Let B ∼ P B and B ∼ P B be events with P ( B i ) > . Then there are base measures c = γQ C ∈ Q P ( B ∪ B ) and c ′ = γ ′ Q C ′ ∈ Q P ( B ∪ B ) supporting them. This yields B ⊂ C ∩ C ′ and thus P ( C ∩ C ′ ) ≥ P ( B ) > . In otherwords, we have C ◦◦ P C ′ , and by subadditivity we conclude that there is a b ∈ Q P ( C ∪ C ′ ) .This gives B ∪ B ⊂ C ∪ C ′ ⊂ supp b , and therefore B ∼ P B at b . To show reflexivityon the specified subset of A , we fix an A ∈ A and an α > such that a := αQ A ≤ P . Thenwe have a ∈ αQ P ( A ) and hence we obtain A ∼ P A .The last statement follows by induction over k , where the initial step k = 2 is simplythe definition of kinship. Let us therefore assume the statement is true for some k ≥ . Let owards an Axiomatic Approach to Hierarchical Clustering of Measures A , . . . , A k +1 ∈ A be pairwise kin. By assumption there is a b ∈ Q P ( A ∪ . . . ∪ A k ) . Sincethis latter yields A ⊂ supp b we find A ∼ P supp b and by transitivity of ∼ P we hence have A k +1 ∼ P supp b . By definition there is thus a ˜ b ∈ Q P ( A k +1 ∪ supp b ) and since this gives A ∪ . . . ∪ A k +1 ⊂ A k +1 ∪ supp b ⊂ supp ˜ b we find ˜ b ∈ Q P ( A ∪ . . . ∪ A k +1 ) . Lemma 32
Let ( A , Q , ⊥ ) be a clustering base and Q ∈ S ( A ) with representing forest F ∈F A . Then for all A ∈ F we have Q ( · ∩ A ) = λ Q ( A ) + Q (cid:12)(cid:12) $ A . Proof of Lemma 32:
Let A ∈ max F be the root with A ⊂ A . Then we can decompose F into F = { A ′ ∈ F : A ′ ⊃ A } ˙ ∪{ A ′ ∈ F : A ′ $ A } ˙ ∪{ A ′ ∈ F : A ′ ⊥ A } . Moreover, flatnessof Q gives Q A ′ ( · ∩ A ) = Q A ′ ( A ) · Q A ( · ) for all A ′ ∈ A with A ⊂ A ′ while fittedness gives Q A ′ ( A ) = 0 for all A ′ ∈ A with A ′ ⊥ A by the monotonicity of ⊥ , part (a) of Lemma 30,and part (b) of Lemma 51. For B ∈ B we thus have Q ( B ∩ A ) = X A ′ ⊃ A α A ′ Q A ′ ( B ∩ A ) + X A ′ $ A α A ′ Q A ′ ( B ∩ A ) + X A ′ ⊥ A α A ′ Q A ′ ( B ∩ A )= X A ′ ⊃ A α A ′ Q A ′ ( A ) Q A ( B ) + X A ′ $ A α A ′ Q A ′ ( B ∩ A )= λ Q ( A )( B ) + Q (cid:12)(cid:12) $ A ( B ) , where the last step uses Q A ′ ( B ∩ A ) = Q A ′ ( B ) for A ′ ⊂ A , which follows from fittedness. Lemma 33
Let ( A , Q , ⊥ ) be a clustering base and a , b be base measures on A, B ∈ A with A ⊂ B . Then for all C ∈ B with a ( C ∩ A ) > we have b ( · ∩ A ) = b ( C ∩ A ) a ( C ∩ A ) · a ( · ∩ A ) . Proof of Lemma 33:
By assumption there are α, β > with a = αQ A and b = βQ B .Moreover, flatness guarantees Q B ( · ∩ A ) = Q B ( A ) · Q A ( · ) . For all C ∈ B we thus obtain b ( C ∩ A ) = βQ B ( C ∩ A ) = βQ B ( A ) · Q A ( C ) = βQ B ( A ) · Q A ( C ∩ A ) = βQ B ( A ) α a ( C ∩ A ) . where in the second to last step we used Q A ( · ) = Q A ( ·∩ A ) , which follows from A = supp Q A .For C ∈ B with a ( C ∩ A ) > we thus find βQ B ( A ) α = b ( C ∩ A ) a ( C ∩ A ) and inserting this in theprevious formula gives the assertion. Lemma 34
Let ( A , Q , ⊥ ) be a clustering base and Q ∈ S ( A ) be a simple measure, a be abase measures on some A ∈ A , and C ∈ B . Then the following statements are true:(a) If a ≤ Q then there is a level b in Q with a ≤ b .(b) If b ( · ∩ C ) ≤ a ( · ∩ C ) for all levels b of Q then Q ( C ) ≤ a ( C ) . homann, Steinwart, and Schmid (c) For all P ∈ M we have Q ≤ P if and only if b ≤ P for all levels b in Q . Proof of Lemma 34:
In the following we denote the representing forest of Q by F . (a). By a ≤ Q we find A ⊂ supp Q = G F . Since the roots max F form a finite ⊥ -disjointunion of closed sets of G F , the A -connectedness shows that A is already contained in one ofthe roots, say A ∈ max F . For F ′ := { A ′ ∈ F | A ⊂ A ′ } we thus have A ∈ F ′ . Moreover, F ′ is a chain, since if there were ⊥ -disjoint A ′ , A ′′ ∈ F ′ then A would only be containedin one of them by Lemma 30. Therefore there is a unique leaf B := min F ′ ∈ F and thus A ⊂ B . We denote the level of B in Q by b . Then it suffices to show a ≤ b . To this end,let { C , . . . , C k } = max F (cid:12)(cid:12) $ B be the direct children of B in F . By construction we know A C i for all i = 1 , . . . , k and hence A -connectedness yields A C ⊥ ∪ . . . ⊥ ∪ C k . Therefore C := A \ S i C i is non-empty and relatively open in A = supp Q A . This gives a ( C ∩ A ) > by Lemma 51. Let us write b := λ Q ( B ) for the level of B in Q . Lemma 32 applied to thenode B ∈ F then gives Q ( C ) = b ( C ) + Q (cid:12)(cid:12) $ B ( C ) = b ( C ) + X A ′ ∈ F : A ′ $ B α A ′ Q A ′ ( C ) = b ( C ) since for A ′ ∈ F with A ′ $ B we have A ′ ⊂ S i C i and thus supp Q A ′ ∩ C = A ′ ∩ C = ∅ .Therefore, we find a ( C ∩ A ) = a ( C ) ≤ Q ( C ) = b ( C ) = b ( C ∩ B ) . By Lemma 33 weconclude that b ( · ∩ A ) ≥ a ( · ∩ A ) . For B ′ ∈ B the decomposition B ′ = ( B ′ \ A ) ˙ ∪ ( B ′ ∩ A ) and the fact that A = supp a ⊂ supp b then yields the assertion. (b). For A ∈ F we define B A := A \ [ A ′ ∈ F : A ′ $ A A ′ , i.e. B A is obtained by removing the strict descendants from A . From this description it iseasy to see that { B A : A ∈ F } is a partition of G ( F ) = supp Q . Hence we obtain Q ( C ) = X A ∈ F Q ( C ∩ B A ) = X A ∈ F X A ′ ∈ F α A ′ Q A ′ ( C ∩ B A )= X A ∈ F X A ′ ⊃ A α A ′ Q A ′ ( C ∩ B A ) + X A ∈ F X A ′ $ A α A ′ Q A ′ ( C ∩ B A )= X A ∈ F λ Q ( A )( C ∩ B A ) , (22)where we used Q A ′ ( C ∩ B A ) = Q A ′ ( C ∩ B A ∩ A ) together with flatness applied to pairs A ⊂ A ′ as well as A ′ ∩ B A = ∅ applied to pairs A ′ $ A . Our assumption now yields Q ( C ) ≤ X A ∈ F a ( C ∩ B A ) = a ( C ∩ supp Q ) ≤ a ( C ) . (c). Let b := λ Q ( B ) be a level of B in Q with b P . Then there is a B ′ ∈ B with b ( B ′ ) > P ( B ) and for B ′′ := B ′ ∩ supp b = B ′ ∩ B we find Q ( B ′′ ) ≥ a ( B ′′ ) = a ( B ′ ) >P ( B ′ ) ≥ P ( B ′′ ) . Conversely, assume b ≤ P for all levels b in Q . By the decomposition (22)we then obtain Q ( C ) = X A ∈ F λ Q ( A )( C ∩ B A ) ≤ X A ∈ F P ( C ∩ B A ) = P ( C ∩ supp Q ) ≤ P ( C ) . owards an Axiomatic Approach to Hierarchical Clustering of Measures Corollary 35
Let ( A , Q , ⊥ ) be a clustering base, Q ∈ S ( A ) a simple measure with repre-senting forest F and A , A ∈ F . Then for all a ∈ Q Q ( A ∪ A ) there exists a level b in Q such that A ∪ A ⊂ B and a ≤ b . Proof of Corollary 35:
Let us fix an a ∈ Q Q ( A ∪ A ) . Since a ≤ Q , Lemma 34 gives alevel b in Q with a ≤ b . Setting B := supp b ∈ F then gives A ∪ A ⊂ supp a ⊂ B . Proof of Proposition 19:
Let Q be a simple measure and Q = P A ∈ F α A Q A be its uniquerepresentation. Moreover, let A , A be direct siblings in F and a , a be the correspondinglevels in Q . Then Q -groundedness follows directly from Corollary 35. To show that A , A are Q -motivated and Q -fine, we fix an a ∈ Q Q ( A ∪ A ) . Furthermore, let b be the levelin Q found by Corollary 35, i.e. we have A ∪ A ⊂ supp b =: B and a ≤ b ≤ Q . Now let A , . . . , A k ∈ F be the remaining direct siblings of A and A . Since B is an ancestor of A and A it is also an ancestor of A , . . . , A k and hence A ∪ · · · ∪ A k ⊂ B . This immediatelygives b ∈ Q Q ( A ∪ · · · ∪ A k ) and we already know b ≥ a . In other words, A , A are Q -fine.Finally, observe that for B ⊂ A ′ flatness gives Q A ′ ( B ) Q B ( · ) = Q A ′ ( · ∩ B ) . Since A ⊂ B we hence obtain a ( A ) ≤ b ( A ) = X A ′ ⊃ B α A ′ Q A ′ ( B ) Q B ( A ) = X A ′ ⊃ B α A ′ Q A ′ ( A ) and since Q A ( A ) = 1 we also find a ( A ) = X A ′ ⊃ A α A ′ Q A ′ ( A ) Q A ( A ) = X A ′ ⊃ A α A ′ Q A ′ ( A ) = X A ′ ⊃ B α A ′ Q A ′ ( A ) + α A . Since α A > we conclude that a ( A ) < (1 − ε ) a ( A ) for a suitable ε > . Analogously,we find an ε > with a ( A ) < (1 − ε ) a ( A ) and taking α := 1 − min { ε , ε } thus yields Q -motivation. Lemma 36
Let ( A , Q , ⊥ ) be a clustering base, P ∈ M Ω , and Q, Q ′ ≤ P be simple measureson finite forests F and F ′ . If all roots in both F and F ′ are P -grounded, then any root inone tree can only be kin below P to at most one root in the other tree. Proof of Lemma 36:
Let us assume the converse, i.e. we have an A ∈ max F and B, B ′ ∈ max F ′ such that A ∼ P B and A ∼ P B ′ . Let a , b , b ′ be the respective sum-mands in the simple measures Q and Q ′ . Then < a ( A ) ≤ Q ( A ) ≤ P ( A ) and analogously P ( B ) , P ( B ′ ) > . Then by transitivity of ∼ P established in Lemma 31 we have B ∼ P B ′ and by groundedness there has to be a parent for both in F ′ , so they would not be roots. Proposition 37
Let ( A , Q , ⊥ ) be a stable clustering base and P ∈ M such that A is P -subadditive. Let ( Q n , F n ) ↑ P , where all forests F n have k roots A n , . . . , A kn , which, inaddition, are assumed to be P -grounded. Then A i := S n A in are unique under all suchapproximations up to a P -null set. homann, Steinwart, and Schmid Proof of Proposition 37:
The A , . . . , A k are pairwise ⊥ -disjoint by Lemma 11 and byLemma 53 they partition supp P up to a P -null set, i.e. P (supp P \ S i A i ) = 0 . Thereforeany B ∈ B with P ( B ) > intersects at least one of the A i . Moreover, we have . Now let ( Q ′ n , F ′ n ) ↑ P be another approximationof the assumed type with roots B in and limit roots B , . . . , B k ′ . Clearly, our preliminaryconsiderations also hold for these limit roots. Now consider the binary relation i ∼ j , whichis defined to hold iff A i ◦◦ P B j .Since P ( A i ) > there has to be a B j with P ( A i ∩ B j ) > , so for all i ≤ k there is a j ≤ k ′ with i ∼ j . Then, since A in ∩ B jn ↑ A i ∩ B j , there is an n ≥ with P ( A in ∩ B jn ) > .By P -subadditivity of A we conclude that A in and B jn are kin below P , and Lemma 36shows that this can only happen for at most one j ≤ k ′ . Consequently, we have k ≤ k ′ and ∼ defines an injection i j ( i ) . The same argument also holds in the other direction andwe see that k = k ′ and that i ∼ j defines a bijection. Clearly, we may assume that i ∼ j iff i = j . Then P ( A i ∩ B j ) > if and only if i = j , and since both sets of roots partition supp P up to a P -null set, we conclude that P ( A i △ B i ) = 0 . Lemma 38
Let ( A , Q , ⊥ ) be a clustering base and P ∈ M such that A is P -subadditive.Moreover, let a , . . . , a k ≤ P be base measures on A , . . . , A k ∈ A such that A ◦◦ P A i forall ≤ i ≤ k . Then there is b ∈ Q P ( A ∪ . . . ∪ A k ) and an a i such that b ≥ a i , and if k ≥ and the a , . . . , a k satisfy the motivation implication (20) pairwise, then b ≥ a . Proof of Lemma 38:
The proof of the first assertion is based on induction. For k = 2 theassertion is P -subadditivity. Now assume that the statement is true for k . Then there is a b ∈ Q P ( A ∪ . . . ∪ A k ) and an i ≤ k with b ≥ a i . The assumed A ◦◦ P A k +1 thus yields P ( A k +1 ∩ supp b ) ≥ P ( A k +1 ∩ A ) > , and hence P -subadditivity gives a ˜ b ∈ Q P ( A k +1 ∪ supp b ) with ˜ b ≥ a k +1 or ˜ b ≥ b ≥ a i . Forthe second assertion observe that b ∈ Q P ( A i ∩ A j ) for all i, j and hence (20) implies b a i for i ≥ . Lemma 39
Let ( A , Q , ⊥ ) be a clustering base and Q ≤ P be a simple and P -adaptedmeasure with representing forest F . Let C , . . . , C k ∈ F be direct siblings for some k ≥ .Then there exists an ε > such that:(a) For all a ∈ Q P ( C ∪ . . . ∪ C k ) and i ≤ k we have a ( C i ) ≤ (1 − ε ) · Q ( C i ) .(b) Assume that A is P -subadditive and that a ≤ P is a simple measure with supp a ◦◦ P C i for at least two i ≤ k . Then for all i ≤ k we have a ( C i ) ≤ (1 − ε ) · Q ( C i ) .(c) If A is P -subadditive and Q ′ ≤ P is a simple measure with representing forest F ′ suchthat there is an i ≤ k with the property that for all B ∈ F ′ we have B ◦◦ P C i = ⇒ ∃ j = i : B ◦◦ P C j . Then Q ′ ( · ∩ C i ) ≤ (1 − ε ) Q ( · ∩ C i ) holds true. owards an Axiomatic Approach to Hierarchical Clustering of Measures Proof of Lemma 39:
Let c , . . . , c k be the levels of C , . . . , C k in Q . Since Q is adapted,(20) holds for some α ∈ (0 , . We define ε := 1 − α . (a). We fix an a ∈ Q P ( C ∪ . . . ∪ C k ) , an i ≤ k , and a j ≤ k with j = i . Let c i , c j be thelevels of C i and C j in Q . Since α c i and α c j are motivated, we have a α c i and a α c j .Hence, there is a C ∈ B with a ( C ) < α c i ( C ) and thus also a ( C ∩ C i ) < α c i ( C ∩ C i ) .Lemma 33 then yields a ( · ∩ C i ) ≤ α c i ( · ∩ C i ) and the definition of levels gives a ( C i ) ≤ α c i ( C i ) = αQ ( C i ) = (1 − ε ) Q ( C i ) . (b). We may assume supp a ◦◦ P C and supp a ◦◦ P C . By the second part of Lemma 38applied to supp a , C , C there is an a ′ ∈ Q P (supp a ∪ C ∪ C ) ⊂ Q P ( C ∪ C ) with a ′ ≥ a ,and since Q is P -fine, we may actually assume that a ′ ∈ Q P ( C ∪ . . . ∪ C k ) . Now part (a)yields a ′ ( C i ) ≤ (1 − ε ) · Q ( C i ) for all i = 1 , . . . , k . (c). We may assume i = 1 . Our first goal is to show b ( · ∩ C ) ≤ (1 − ε ) c ( · ∩ C ) (23)for all levels b in Q ′ , To this end, we fix a level b in Q ′ and write B := supp b . If P ( B ∩ C ) =0 , then (23) follows from b ( C ) = b ( B ∩ C ) ≤ P ( B ∩ C ) = 0 . In the other case we have B ◦◦ P C and our assumption gives a j = 1 with B ◦◦ P C j . Bythe second part of Lemma 38 we find an a ∈ Q P ( B ∪ C ∪ C j ) ⊂ Q P ( C ∪ C j ) with a ≥ b ,and by (a) we thus obtain a ( C ) ≤ (1 − ε ) Q ( C ) = (1 − ε ) c ( C ) . Now, Lemma 33 gives a ( · ∩ C ) ≤ (1 − ε ) c ( · ∩ C ) and hence (23) follows.With the help of (23) we now conclude by part (b) of Lemma 34 that Q ′ ( · ∩ C ) ≤ (1 − ε ) c ( · ∩ C ) and using c ( · ∩ C ) ≤ Q ( · ∩ C ) we thus obtain the assertion. Lemma 40
Let ( A , Q , ⊥ ) be a clustering base and P ∈ M such that A is P -subadditive.Moreover, let Q, Q ′ ≤ P be simple P -adapted measures on F, F ′ , and S ∈ s ( F ) and S ′ ∈ s ( F ′ ) be two nodes that have children in s ( F ) and s ( F ′ ) , respectively. Let { C , . . . , C k } = max s ( F ) (cid:12)(cid:12) $ S and { D , . . . , D k ′ } = max s ( F ′ ) (cid:12)(cid:12) $ S ′ be their direct children and consider the relation i ∼ j : ⇔ C i ◦◦ P D j . Then we have k, k ′ ≥ and if ∼ is left-total, i.e. for every i ≤ k there is a j ≤ k ′ with i ∼ j , then it is right-unique,i.e. for every i ≤ k there is at most one j ≤ k ′ with i ∼ j . Proof of Lemma 40:
The definition of the structure of a forest gives k, k ′ ≥ . Moreover,we note that P ( A ) ≥ Q ( A ) > for all A ∈ F and P ( A ) ≥ Q ′ ( A ) > for all A ∈ F ′ .Now assume that ∼ is not right-unique, say ∼ j and ∼ j ′ for some j = j ′ . Applying P -subadditivity twice we then find a b ∈ Q P ( C ∪ D j ∪ D j ′ ) with b ≥ c or b ≥ d j or b ≥ d j ′ , where c , d j , and d j ′ are the corresponding levels. Since d j , d j ′ are motivated weconclude that b ≥ c . Now, because of Q P ( C ∪ D j ∪ D j ′ ) ⊂ Q P ( D j ∪ D j ′ ) and P -finenessof Q ′ there is a b ′ ∈ Q P ( D ∪ . . . ∪ D k ′ ) with b ′ ≥ b . Now pick a direct sibling of C , say C . Then there is a j ′′ with ∼ j ′′ , and since B ′ := supp b ′ ⊃ D ∪ . . . ∪ D k ′ this implies homann, Steinwart, and Schmid P ( B ′ ∩ C ) ≥ P ( D j ′′ ∩ C i ) > . By P -subadditivity we hence find a b ′′ ∈ Q P ( B ′ ∪ C ) ⊂Q P ( C ∪ C ) with b ′′ ≥ b ′ or b ′′ ≥ c . Clearly, b ′′ ≥ c violates the fact that C , C aremotivated, and thus b ′′ ≥ b ′ . However, we have shown b ′ ≥ b ≥ c , and thus b ′′ ≥ c . Sincethis again violates the fact that C , C are motivated, we have found a contradiction. Proof of Theorem 20:
We prove the theorem by induction over the generations in theforests. For a finite forest F , we define s ( F ) := max F and s N +1 ( F ) := s N ( F ) ∪ (cid:8) A ∈ s ( F ) | A is a direct child of a leaf in s N ( F ) (cid:9) . We will now show by induction over N that there is a graph-isomorphism ζ N : s N ( F ∞ ) → s N ( F ′∞ ) with P ( A △ ζ N ( A )) = 0 for all A ∈ s N ( F ∞ ) . For N = 0 this has already been shownin Proposition 37. Let us therefore assume that the statement is true for some N ≥ . Letus fix an S ∈ min s N ( F ∞ ) and let S ′ := ζ N ( S ) ∈ min s N ( F ′∞ ) be the corresponding node.We have to show that both have the same number of direct children in s N +1 ( · ) and thatthese children are equal up to P -null sets. By induction this then finishes the proof.Since S ∈ s N ( F ∞ ) ⊂ s ( F ∞ ) , the node S has either no children or at least . Now, ifboth S and S ′ have no direct children then we are finished. Hence we can assume that S has direct children C , . . . , C k for some k ≥ , i.e. max( F ∞ (cid:12)(cid:12) $ S ) = { C , . . . , C k } . Let S n , C n , . . . , C kn ∈ s ( F n ) and S ′ n ∈ s ( F ′ n ) be the nodes that correspond to S, C , . . . , C k ,and S ′ , respectively. Since P ( S △ S ′ ) = 0 we then obtain for all i ≤ kP ( S ′ ∩ C i ) = P ( S ∩ C i ) = P ( C i ) ≥ Q ( C i ) ≥ Q ( C i ) > , that is S ′ ◦◦ P C i for all i ≤ k . Since S ′ = S n S ′ n and C i = S n C in this can only happen if S ′ n ◦◦ P C in for all sufficiently large n . We therefore may assume without loss of generalitythat S ′ ◦◦ P C in for all i ≤ k and all n ≥ . (24)Let us now investigate the structure of F ′ n (cid:12)(cid:12) ⊂ S ′ n . To this end, we will seek a kind of anchor B ′ n ∈ F ′ n (cid:12)(cid:12) ⊂ S ′ n , which will turn out later to be the direct parent of the yet to find ζ N +1 ( C i ) ∈ F ′∞ . We define this anchor by B ′ n := min { B ∈ F ′ n | B ◦◦ P C i for all i = 1 , . . . , k } . This minimum is unique. Indeed, let ˜ B ′ n be any other minimum with ˜ B ′ n ◦◦ P C i for all i ≤ k . Since both are minima, none is contained in the other and because F ′ n is a forest thismeans B ′ n ⊥ ˜ B ′ n . Let b ′ n and ˜ b ′ n be their levels in Q ′ n . Since Q ′ n is P -adapted, these twolevels are motivated. This means that there can be no base measure majorizing one of themand supporting B ′ n ∪ ˜ B ′ n . On the other hand, by the second part of Lemma 38 there exists a b ′′ n ∈ Q p ( B ′ n ∪ C ∪· · ·∪ C k ) with b ′′ n ≥ b ′ n . Now because of P ( ˜ B ′ n ∩ supp b ′′ n ) ≥ P ( ˜ B ′ n ∩ C ) > and P -subadditivity there exists a base measure majorizing ˜ b ′ n ≥ b ′ n or b ′′ n and supporting ˜ B ′ n ∩ supp b ′′ n . This contradicts the motivatedness of b ′ n and ˜ b ′ n and hence the minimum B ′ n is unique. owards an Axiomatic Approach to Hierarchical Clustering of Measures Since B ′ n is the unique minimum among all B ∈ F ′ n with B ◦◦ P C i for all i , we also have B ′ n ⊂ B for all such B and hence B ′ n ⊂ S ′ n by (24). The major difficulty in handling B ′ n though is that it may jump around as a function of n : Indeed we may have B ′ n ∈ F ′ n \ s ( F ′ n ) and therefore the monotonicity s ( F ′ n ) ≤ s ( F ′ n +1 ) says nothing about B ′ n . In particular, wehave in general B ′ n B ′ n +1 .Let us now enumerate the set min F ′ n (cid:12)(cid:12) $ B ′ n of direct children of B ′ n by D n , . . . , D k n n ,where k n ≥ . Again these D in can jump around as a function of n . The number k n specifiesdifferent cases: we have B ′ n ∈ min F ′ n , i.e. B ′ n is a leaf, iff k n = 0 ; on the other hand D in ∈ s ( F ′ n ) iff k n ≥ . Next we show that for all i ≤ k and all sufficiently large n there isan index j ( i, n ) ∈ { , . . . , k n } with C i ◦◦ P D j ( i,n ) n . (25)Note that this in particular implies k n ≥ for sufficiently large n . To this end we fixan i ≤ k . Suppose that C i ⊥⊥ P ( D n m ∪ · · · ∪ D k nm n m ) for infinitely many n , n , . . . . Byconstruction B ′ n m is the smallest element of F ′ n m that ⊥⊥ P -intersects C i . More precisely, forany A ∈ F ′ n m with A ◦◦ P C i we have A ⊃ B ′ n m and therefore A ◦◦ P C i ′ for all such A andall i ′ ≤ k . Hence, all Q ′ n m in this subsequence fulfill the conditions of the last statement inLemma 39 and we get an ε > such that for all such n m Q ′ n m ( C i ) ≤ (1 − ε ) Q ( C i ) ≤ (1 − ε ) P ( C i ) (26)which contradicts Q ′ n m ( C i ) ↑ P ( C i ) since P ( C i ) > .Therefore for all i ≤ k and all sufficiently large n there is an index j ( i, n ) such that (25)holds. Clearly, we may thus assume that there is such an j ( i, n ) for all n ≥ . Since j ( i, n ) ∈{ , . . . , k n } we conclude that k n ≥ for all n ≥ . Moreover, k n = 1 is impossible, since k n = 1 yields j ( i, n ) = 1 , and this would mean, that C i ◦◦ P D n for all i ≤ k contradictingthat B ′ n is the minimal set in F ′ n having this property. Consequently B ′ n has the directchildren D n , . . . , D k n n where k n ≥ for all n ≥ .So far we have seen that D n , . . . , D k n n ∈ s ( F ′ n ) are inside S ′ n . Therefore S ′ n is not aleaf, and hence S ′ / ∈ min F ′∞ as well. But still for infinitely many n these D jn might notbe the direct children of S ′ n . Let us therefore denote the direct children of S ′ n ∈ s ( F ′ n ) by E n , . . . , E k ′ n ∈ s ( F ′ n ) , where we pick a numbering such that E in ⊂ E in +1 and by the definitionof the structure of a forest we have k ′ ≥ .For an arbitrary but fixed n we now show { D n , . . . , D k n n } = { E n , . . . , E k ′ n } . To this letus assume the converse. Since the E jn are the direct children of S ′ n in the structure s ( F ′ n ) there is a j n ≤ k ′ with D jn ⊂ E j n n for all j , and since B ′ n is the direct parent of the D jn weconclude that B ′ n ⊂ E j n n . Therefore we have C i ◦◦ P E j n n for all i ≤ k . Since Q and Q ′ n areadapted we can use Lemma 40 to see that for all i ≤ k we have C i ⊥⊥ P E jn for all j = j n .Let us fix a j = j n . Our goal is to show Q m ( E jn ) < (1 − ε ) Q ′ n ( E jn ) , for all sufficiently large m ≥ n , since this inequality contradicts the assumed convergence of Q m ( E jn ) to P ( E jn ) ≥ Q ′ n ( E jn ) > . By part (c) of Lemma 39 with Q ′ n as Q and Q m as Q ′ itsuffices to show that for all A ∈ F m and all sufficiently large m ≥ n we have A ◦◦ P E jn = ⇒ A ◦◦ P E j n n . (27) homann, Steinwart, and Schmid To this end, we fix an A ∈ F m with A ◦◦ P E jn . Then we first observe that for all m ≥ n we have P ( A ∩ S ′ m ) ≥ P ( A ∩ S ′ n ) ≥ P ( A ∩ E jn ) > . Moreover, the induction assumptionensures P ( S △ S ′ ) = 0 and since S m ր S and S ′ m ր S ′ , we conclude that P ( A ∩ S m ) > for all sufficiently large m . Now, C m ∪ · · · ∪ C km are direct siblings and hence we either have C m ∪ · · · ∪ C km ⊂ A or A ⊂ C i m for exactly one i ≤ k . In the first case we get P ( A ∩ E j n n ) ≥ P ( C m ∩ E j n n ) ≥ P ( C ∩ E j n n ) > by the already established C i ◦◦ P E j n n for all i ≤ k . The second case is impossible, since itcontradicts adaptedness. Indeed, A ⊂ C i m implies C i m ◦◦ P E jn and by the already established C i ◦◦ P E j n n for all i ≤ k , we also know C i m ◦◦ P E j n n . By the second part of Lemma 38 wetherefore find a ˜ c ∈ Q P ( C i m ∪ E jn ∪ E j n n ) with ˜ c ≥ c i m , where c i m is the level of C i m in Q m .Now fix any i ≤ k with i = i and observe that we have P ( C im ∩ supp ˜ c ) ≥ P ( C im ∩ E j n n ) ≥ P ( C i ∩ E j n n ) > , and hence P -subadditivity yields a c ′′ ∈ Q P ( C im ∪ supp ˜ c ) with c ′′ ≥ c im or c ′′ ≥ ˜ c ≥ c i m , where c im is the level of C im in Q m . Since c ′′ ∈ Q P ( C im ∪ supp ˜ c ) ⊂ Q P ( C im ∪ C i m ) ,we have thus found a contradiction to the fact that the direct siblings C im and C i m are P -motivated.So far we have shown { D n , . . . , D k n n } = { E n , . . . , E k ′ n } and k n = k ′ for all n . Withoutloss of generality we may thus assume that D jn = E jn for all n and all j ≤ k ′ . In particular,this means that the direct children of S ′ n in s ( F ′ n ) equal the direct children of B ′ n in F ′ n . Letus write D j := [ n ≥ D jn , j = 1 , . . . , k ′ and i ∼ j iff C i ◦◦ P D j . We have seen around (25) that for all i ≤ k there is at least one j ≤ k = k ′ with i ∼ j , namely j ( i, . By Lemma 40 we then conclude that j ( i, is theonly index j ≤ k ′ satisfying i ∼ j . By reversing the roles of C i and D j , which is possiblesince D j = E j is a direct children of S ′ n in s ( F ′ n ) , we can further see that for all j there isan index i with i ∼ j and again by Lemma 40 we conclude that there is at most one i with i ∼ j . Consequently, i ∼ j defines a bijection between { C , . . . , C k } and { D , . . . , D k ′ } andhence we have k = k ′ . Moreover, we may assume without loss of generality that i ∼ j iff i = j . From the latter we obtain C i ◦◦ P D j iff i = j .To generalize the latter, we fix n, m ≥ and write i ∼ j iff C in ◦◦ P D jm . Since we have P ( C in ∩ D im ) ≥ P ( C i ∩ D i ) > , we conclude that i ∼ i , and by Lemma 40 we again seethat i ∼ j is false for i = j . This yields C in ◦◦ P D jm iff i = j and by taking the limits, wefind C i ◦◦ P D j iff i = j .Next we show that P ( C i △ D i ) = 0 for all i ≤ k . Clearly, it suffices to consider the case i = 1 . To this end assume that R := C \ D satisfies P ( R ) > . For R n := R ∩ C n = C n \ D ,we then have R n ↑ R since C n ↑ C and R ⊂ C . Consequently, < P ( R ) = P ( R ∩ C ) implies P ( R n ) > for all sufficiently large n . On the other hand, we have P ( R ∩ D ) = 0 by the definition of R and P ( R ∩ D j ) ≤ P ( C ∩ D j ) = 0 for all j = 1 as we have shownabove.We next show that Q ′ m ( R n ) = Q ′ m (cid:12)(cid:12) ⊃ B ′ m ( R n ) . To this end it suffices to show that forany A ∈ F ′ m with A / ∈ F ′ m (cid:12)(cid:12) ⊃ B ′ m we have Q ′ m ( A ∩ R n ) ≤ P ( A ∩ R n ) = 0 . Let us thus fix an A ∈ F ′ m with A / ∈ F ′ m (cid:12)(cid:12) ⊃ B ′ m . Then we either have A $ B ′ m or A ⊥ B ′ m . In the first case there owards an Axiomatic Approach to Hierarchical Clustering of Measures is j ≤ k with A ⊂ D jm which means, as shown above, that P ( A ∩ R n ) ≤ P ( D jm ∩ R n ) = 0 . Inthe second case, by definition of structure, we even have A ⊥ S ′ m . So there is a A ′ m ∈ s ( F ′ m ) with A ⊂ A ′ m and A ′ m ⊥ S ′ m and by isomonotonicity of the structure there is A ′ ∈ F ′∞ with A ′ m ⊂ A ′ and A ′ ⊥ S ′ . Hence by induction assumption P ( A ∩ R n ) ≤ P ( A ∩ S n ) ≤ P ( A ∩ S ) ≤ P ( A ′ ∩ S ) = P ( A ′ ∩ S ′ ) = 0 .Using P ( C i ∩ D i ) > we now observe that Q ′ m (cid:12)(cid:12) ⊃ B ′ m fulfills the conditions of part (c) ofLemma 39 for C and C and by R n ⊂ C n we thus obtain Q ′ m ( R n ) = Q ′ m (cid:12)(cid:12) ⊃ B ′ m ( R n ) ≤ (1 − ε ) Q n ( R n ) ≤ (1 − ε ) P ( R n ) . This contradicts < P ( R n ) = lim m →∞ Q ′ m ( R n ) . So we can assume P ( R n ) = 0 for all n and therefore P ( R ) = lim n →∞ P ( R n ) = 0 . By reversing roles we thus find P ( D △ C ) = P ( C \ D ) + P ( D \ C ) = 0 and therefore the children are indeed the same up to P -nullsets.Finally, we are able to finish the induction: To this end we extend ζ N to the map ζ N +1 : s N +1 ( F ∞ ) → s N +1 ( F ′∞ ) by setting, for every leaf S ∈ min s N ( F ∞ ) , ζ N +1 ( C i ) := D i where C , . . . , C k ∈ s N +1 ( F ∞ ) are the direct children of S and D , . . . , D k ∈ s N +1 ( F ′∞ ) arethe nodes we have found during our above construction. Clearly, our construction showsthat ζ N +1 is a graph isomorphism satisfying P ( A △ ζ N +1 ( A )) = 0 for all A ∈ s N +1 ( F ∞ ) . Lemma 41
Let ( A , Q , ⊥ ) be a clustering base, P , . . . , P k ∈ M with supp P i ⊥ supp P j for all i = j , and Q i ≤ P i be simple measures with representing forests F i . We define P := P + . . . + P k , Q := Q + . . . + Q k , and F := F ∪ · · · ∪ F k . Then we have:(a) The measure Q is simple and F is its representing ⊥ -forest.(b) For all base measures a ≤ P there exists exactly one i with a ≤ P i .(c) If A is P i -subadditive for all i ≤ k , then A is P -subadditive.(d) if Q i is P i -adapted for all i ≤ k , then Q is adapted to P . Proof of Lemma 41: (a).
Since Q i ≤ P i ≤ P we have G F i = supp Q i ⊂ supp P i . By themonotonicity of ⊥ we then obtain G F i ⊥ G F j for i = j . From this we obtain the assertion. (b). Let a ≤ P be a base measure on A ∈ A . Then we have A = supp a ⊂ supp P = S i supp P i . By A -connectedness there thus exists a i with A ⊂ supp P i . For B ∈ B we thenfind a ( B ) = a ( B ∩ supp P i ) ≤ P ( B ∩ supp P i ) = P i ( B ∩ supp P i ) = P i ( B ) . Moreover, for j = i we have a ( A ) > and P j ( A ) = 0 and thus i is unique. (c). Let a , a ′ ≤ P be base measures on base sets A, A ′ with A ◦◦ P A ′ . Since A ⊥ A ′ implies A ⊥ ∅ A ′ , we have A ◦◦ A ′ . By (b) we find unique indices i, i ′ with a ≤ P i and a ′ ≤ P i ′ .This implies A ⊂ supp P i and A ′ ⊂ supp P j , and hence we have supp P i ◦◦ supp P i ′ bymonotonicity. This gives i = i ′ , i.e. a , a ′ ≤ P i . Since A is P i -subadditive there now is an ˜ a ∈ Q P i ( A ∪ A ′ ) with ˜ a ≥ a or ˜ a ≥ a ′ , and since ˜ a ≤ P i ≤ P we obtain the assertion. homann, Steinwart, and Schmid (d). From (b) we conclude Q P ( A ∪ A ) = ∅ for all roots A ∈ F i and A ∈ F j and all i = j . This can be used to infer the groundedness and fineness of Q from the groundednessand fineness of the Q i . Now let a , a ′ ≤ P be the levels of some direct siblings A, A ′ ∈ F in Q and b ∈ Q P ( A ∪ A ′ ) be any base measure. By (b) there is a unique i with b ≤ P i , andhence a , a ′ ≤ P i as well. Therefore Q inherits strict motivation from Q i . Lemma 42
Let ( A , Q , ⊥ ) be a clustering base, P ∈ M , a be a base measure on A ∈ A with supp P ⊂ A , and Q ≤ P be a simple measure with representing forest F . We define P ′ := a + P , Q ′ := a + Q , and F ′ := { A } ∪ F . Then the following statements hold:(a) The measure Q ′ is simple and F ′ is its representing ⊥ -forest.(b) Let a ′ ≤ P ′ be a base measure on A ′ . Then either a ′ ≤ a or there is an α ∈ (0 , suchthat a ′ ( · ∩ A ′ ) = a ( · ∩ A ′ ) + α a ′ ( · ∩ A ′ ) .(c) If A is P -subadditive then A is P ′ -subadditive.(d) If Q is P -adapted, then Q ′ is P ′ -adapted. Proof of Lemma 42: (a).
We have G F = supp Q ⊂ supp P ⊂ A and hence F ′ is a ⊥ -forest, which is obviously representing Q . (b). Let us assume that a ′ a , i.e. there is a C ∈ B with a ′ ( C ) > a ( C ) and thus we find a ′ ( C ∩ A ′ ) = a ′ ( C ) > a ( C ) ≥ a ( C ∩ A ′ ) . In addition, we have A ′ = supp a ′ ⊂ supp a = A ,and therefore Lemma 33 shows a ( · ∩ A ′ ) = γ a ′ ( · ∩ A ′ ) , where γ := a ( C ∩ A ′ ) a ′ ( C ∩ A ′ ) < . Setting α := 1 − γ yields the assertion. (c). Let a , a ≤ P ′ be base measures on sets A , A ∈ A with A ◦◦ P ′ A . Since supp P ′ = A , we have A ∪ A ⊂ A , and thus a ∈ Q P ′ ( A ∪ A ) . Clearly, if a ≥ a or a ≥ a , there isnothing left to prove, and hence we assume a a and a a . Then (b) gives α i ∈ (0 , with a i ( ·∩ A i ) = a ( ·∩ A i )+ α i a i ( ·∩ A i ) . We conclude that a ( ·∩ A i )+ α i a i ( ·∩ A i ) = a i ( ·∩ A i ) ≤ P ′ ( · ∩ A i ) = a ( · ∩ A i ) + P ( · ∩ A i ) , and thus α i a i = α i a i ( · ∩ A i ) ≤ P ( · ∩ A i ) ≤ P . Since A is P -subadditive, we thus find an ˜ a ∈ Q P ( A ∪ A ) with say ˜ a ≥ α a . For ˜ A := supp ˜ a wethen have ˜ a ′ := a ( · ∩ ˜ A ) + ˜ a ( · ∩ ˜ A ) ≥ a ( · ∩ ˜ A ) + α a ( · ∩ ˜ A ) ≥ a ( · ∩ A ) + α a ( · ∩ A ) = a , where we used supp a = A ⊂ ˜ A . Moreover ˜ A = supp ˜ a ⊂ supp P ⊂ A , together withflatness of Q shows that ˜ a ′ is a base measure, and we also have ˜ a ′ ≤ a + ˜ a ≤ a + P = P ′ .Finally we observe that A ∪ A ⊂ ˜ A = supp ˜ a ′ , and hence ˜ a ′ ∈ Q P ′ ( A ∪ A ) . (d). Clearly, F ′ is grounded because it is a tree. Now let A , . . . , A k ∈ F ′ , k ≥ be directsiblings and a ′ i be their levels in Q ′ . Since A is the only root it has no siblings, so for all i we have A i ∈ F . Moreover, the levels a i of A i in Q are P -motivated and P -fine since Q is P -adapted. Now let b ∈ Q P ′ ( A ∪ A ) and B := supp b .To check that Q ′ is P ′ -fine, we first observe that in the case b ≤ a there is nothingto prove since a ∈ Q P ′ ( A ∪ . . . ∪ A k ) by construction. In the remaining case b a wefind a β > with b ( · ∩ B ) = a ( · ∩ B ) + β b ( · ∩ B ) by (b), and by P -fineness of Q , thereexists a ˜ b ∈ Q P ( A ∪ . . . ∪ A k ) with ˜ b ≥ β b . Since supp ˜ b ⊂ supp P ⊂ supp a we see that owards an Axiomatic Approach to Hierarchical Clustering of Measures a + ˜ b is a simple measure, and hence we can consider the level ˜ b ′ of supp ˜ b in a + ˜ b . Since ˜ b ′ ≤ a + ˜ b ≤ a + P ≤ P ′ , we then obtain ˜ b ′ ∈ Q P ′ ( A ∪ . . . ∪ A k ) and for C ∈ B we also have b ( C ) = b ( C ∩ B ) = a ( C ∩ B ) + β b ( C ∩ B ) ≤ a ( C ∩ B ) + ˜ b ( C ∩ B ) = ˜ b ′ ( C ∩ B ) ≤ ˜ b ′ ( C ) . To check that Q ′ is strictly P ′ -motivated we fix the constant α ∈ (0 , appearing in thestrict P -motivation of Q . Then there are ˜ α i ∈ (0 , such that a ( · ∩ A i ) + α a i = ˜ α i a ′ i . Weset ˜ α := max { ˜ α , ˜ α } ∈ (0 , and obtain a ( · ∩ A i ) + α a i ≤ ˜ α a ′ i for both i = 1 , . Let usfirst consider the case b ≤ a . Since our construction yields a ′ i = a ( · ∩ A i ) + ˜ a i a , there is a C ∈ B with a ′ i ( C ) > a ( C ) . This implies ˜ α a ′ i ( C ) ≥ a ( C ∩ A i ) + α a i ( C ) > a ( C ∩ A i ) ≥ b ( C ∩ A i ) , i.e. b ˜ α a ′ i . Consequently, it remains to consider the case b a . By (b) and supp b ⊂ supp P ′ = A there is a β ∈ (0 , with b ( · ∩ B ) = a ( · ∩ B ) + β b ( · ∩ B ) . Then β b = β b ( · ∩ B ) = b ( · ∩ B ) − a ( · ∩ B ) ≤ P ′ ( · ∩ B ) − a ( · ∩ B ) = P ( · ∩ B ) ≤ P , and since β b ∈ Q P ( A ∪ A ) we obtain β b α a i for i = 1 , . Hence there is an event C ⊂ supp b with β b ( C ) < α a i ( C ) , which yields b ( C ∩ A i ) = a ( C ∩ A i ∩ B )+ β b ( C ∩ A i ) < a ( C ∩ A ) + α a i ( C ∩ A i ) ≤ ˜ α a ′ i ( C ∩ A i ) , i.e. b ˜ α a ′ i . Proof of Theorem 21:
For a P ∈ ¯ S ( A ) and a P -adapted isomonotone sequence ( Q n , F n ) ր P we define c A ( P ) := P lim n →∞ s ( F n ) , which is possible by Theorem 20. By Proposition19 we then now that c A ( Q ) = c ( Q ) for all Q ∈ Q , and hence c A satisfies the Axiom ofBaseMeasureClustering. Furthermore, c A is obviously structured and scale-invariant, andcontinuity follows from Theorem 20.To check that c A is disjoint-additive, we fix P , . . . , P k ∈ P A with pairwise ⊥ -disjointsupports and let ( Q in , F in ) ր P i be P i -adapted isomonotone sequences of simple measures.We set Q n := Q n + · · · + Q kn and P := P + . . . + P k . By Lemma 41 Q n is simple on F n := F n ∪ · · · ∪ F kn and P -adapted, and A is P -subadditive. Moreover, we have Q n ր P and s ( F n ) = S i s ( F in ) inherits monotonicity as well. Therefore ( Q n , F n ) ր P is P -adaptedand lim s ( F n ) = S i lim s ( F in ) implies disjoint-additive.To check BaseAdditivity we fix a P ∈ P A and a base measure a with supp P ⊂ a .Moreover, let ( Q n , F n ) ր P be a P -adapted sequence. Let Q ′ n := a + Q n and P ′ := a + P . Then by Lemma 42 Q ′ n is simple on F ′ n := { A } ∪ F n and P ′ -adapted, and A is P ′ -subadditive. Furthermore we have ( Q ′ n , F ′ n ) ր P ′ and therefore we find P ′ ∈ P A and lim s ( F ′ n ) = s ( { A } ∪ lim s ( F n )) .For the uniqueness we finally observe that Theorem 8 together with the Axioms of Ad-ditivity shows equality on S ( A ) and the Axiom of Continuity in combination with Theorem20 extends this equality to P A . Lemma 43
Let µ ∈ M ∞ Ω , and consider ( A , Q µ, A , ⊥ ) .(a) If A, A ′ ∈ A with A ⊂ A ′ µ -a.s. then A ⊂ A ′ .(b) Let P ∈ M Ω such that A is P -subadditive and P has a µ -density f that is of ( A , Q , ⊥ ) -type with a dense subset Λ such that s ( F f, Λ ) is finite. For all λ ∈ Λ and all A , . . . , A k ∈A with A ∪ . . . ∪ A k ⊂ { f > λ } µ -a.s. there is B ∈ A with A ∪ . . . ∪ A k ⊂ B pointwiseand B ⊂ { f > λ } µ -a.s. homann, Steinwart, and Schmid Proof of Lemma 43: (a).
Let
A, A ′ ∈ A with A ⊂ A ′ µ -a.s. and let x ∈ A . Now B := A \ A ′ is relative open in A and if it is non-empty then µ ( B ) > since A is a supportset. Since by assumption µ ( B ) = 0 we have B = ∅ . (b). Since H := { f > λ } ∈ ¯ A there is an increasing sequence C n ր H of base sets. Let d b n := λ B n d µ ∈ Q P . For all i ≤ k eventually B n ◦◦ µ A i , so there is a n s.t. B n isconnected to all of them. By P -subadditivity between b n and λ A dµ, . . . , λ A k dµ there is d c = λ ′ C dµ ∈ Q P that supports all of them and majorizes at least one of them. Hence λ ≤ λ ′ and thus A ∪ . . . ∪ A k ⊂ C ⊂ { f > λ ′ } ⊂ { f > λ } µ -a.s. By (a) we are finished. Lemma 44
Let f be a density of ( A , Q , ⊥ ) -type, set P := f dµ and assume A is P -subadditive and F f, Λ is a chain. For all k ≥ and all n ∈ N let B n = C ∪ . . . ∪ C k be a (possibly empty) union of base sets C , . . . , C k ∈ A with B n ⊂ (cid:8) f > λ (cid:9) for all λ ∈ Λ .Then P := f dµ ∈ P and there is ( Q n , F n ) ր P adapted where for all n F n is a chain and B n ⊂ min F n . Proof of Lemma 44:
Let ( λ n ) n ⊂ Λ be a dense countable subset with λ n < ρ and set Λ n := { λ , . . . , λ n } , Λ ∞ := S n Λ n . Remark that max Λ n < ρ for all n , | Λ n | = n and Λ ⊂ Λ ⊂ . . . . For very n we enumerate the n elements of Λ n by λ (1 , n ) < . . . < λ ( n, n ) .For every λ ∈ Λ ∞ we let n λ := min { n | λ ∈ Λ n } ∈ N .Since f is of ( A , Q , ⊥ ) -type, H ( λ ) := { f > λ } ∈ ¯ A for λ ∈ Λ . Therefore there is A λ,n ∈ A s.t. A λ,n ↑ H ( λ ) , where n ≥ . We would like to use these A λ,n to construct Q n ,but they need to be made compatible in order that ( Q n , F n ) n becomes isomonotone. Hencewe construct by induction a family of sets A ( λ, n ) ∈ A , λ ∈ Λ n , n ∈ N with the followingproperties: A λ,n ∪ A ( λ ( i + 1 , n ) , n ) ∪ A ( λ, n − ∪ B n ⊂ A ( λ, n ) ⊂ H ( λ ) ˙ ∪ N ( λ, n ) , µ ( N ( λ, n )) = 0 . Here A ( λ ( i + 1 , n ) , n ) is thought as empty if i = n and similarly A ( λ, n −
1) = ∅ if n = 1 or λ / ∈ Λ n − . All of these involved sets C are base sets with C ⊂ H ( λ ) and hence by Lemma 43there is such an A ( n, λ ) . Since A λ,n ր n H ( λ ) we then also have A ( λ, n λ + n ) ↑ H ( λ ) .Now for all n consider the chain F n := { A ( λ, n ) | λ ∈ Λ n } ⊂ A and the simple measure Q n on F n given by: h n := n X i =1 (cid:0) λ ( i, n ) − λ ( i − , n ) (cid:1) · A ( λ ( i,n ) ,n ) = X λ ∈ Λ n λ · A ( λ,n ) \ S λ ′ >λ A ( λ ′ ,n ) ( λ (0 , n ) := 0) Let x ∈ B . Let Λ n ( x ) := { λ ∈ Λ n | x ∈ A ( λ, n ) } Then h n ( x ) = max Λ n ( x ) . And if x ∈ A ( λ, n ) then x ∈ A ( λ, n + 1) so Λ n ( x ) ⊂ Λ n +1 ( x ) andwe have: h n ( x ) = max Λ n ( x ) ≤ max Λ n +1 ( x ) = h n +1 ( x ) Furthermore if λ ∈ Λ n ( x ) then x ∈ A ( λ, n ) ⊂ H ( λ ) implying h ( x ) > λ . Therefore h ≤ h ≤ · · · ≤ h . owards an Axiomatic Approach to Hierarchical Clustering of Measures On the other hand for all ε > , since Λ ∞ is dense, there is a n and λ ∈ Λ n with h ( x ) − ε ≤ λ < h ( x )) . Then x ∈ H ( λ ) and therefore for n big enough x ∈ A ( λ, n ) and then: h ( x ) ≥ h n ( x ) ≥ λ ≥ h ( x ) − ε. This means h n ( x ) ↑ h ( x ) for all x ∈ B so we have h n ↑ h pointwise and by monotoneconvergence ( Q n , F n ) ↑ P . Proof of Theorem 23:
Let f be a density as supposed and set F := s ( F f, Λ ) . By assump-tion F is finite. If | F | = 1 then F f, Λ is a chain and the Theorem follows from Lemma 44using B n = ∅ , n ∈ N , in the notation of the lemma. Hence we can now assume | F | > . Weprove by induction over | F | that f dµ ∈ ¯ S ( A ) and c ( f dµ ) = µ s ( F f, Λ ) and assume that thisis true for all f ′ with level forests | s ( F f ′ , Λ ′ | < | F | . For readability we first handle the casethat F is not a tree.Assume that F has two or more roots A , . . . , A k with k = k (0) . Denote by f i := f (cid:12)(cid:12) A i the corresponding densities, hence f = f + . . . + f k , and set F i := s ( F f i , Λ ) = F (cid:12)(cid:12) ⊂ A i and P i := f i dµ . We cannot use DisjointAdditivity , because separation of the A i does notimply separation of the supports. Hence we have to construct a P -adapted isomonotonesequence ( Q n , F n ) ր P . Since F = F ˙ ∪ . . . ˙ ∪ F k we have | F i | < | F | and hence by inductionassumption for all i ≤ k we have c ( P i ) = F i , and there is an isomonotone P i -adaptedsequence ( Q i,n , F i,n ) ր P i . For Q n := Q ,n + . . . + Q k,n and F n := F ,n ∪ . . . ∪ F k,n it isclear that ( Q n , F n ) ր P is isomonotone. Let b ∈ Q P and B := supp b . We show that this is ◦◦ µ -connected to exactly one A i . There is β > s.t. d b = β B dµ and β B ≤ f µ -a.s. Nowlet λ ∈ Λ with λ < β and λ < inf (cid:8) λ ′ ∈ Λ | k ( λ ′ ) = k (0) (cid:9) . Because for all λ ∈ Λ also theclosures of clusters are ⊥ -separated we have B ⊂ H f ( λ ) = B ( λ )) ⊥ ∪ . . . ⊥ ∪ B k ( λ )) . By connectedness there is a unique i ≤ k with B ⊂ B i ( λ ) and by monotonicity B ⊥ B j ( λ ) for all i = j . Since this holds for all λ ∈ Λ small enough and Λ is dense, this means that B is ◦◦ µ -connected to exactly i . Using this, P -adaptedness of Q n is inherited from P i -adaptednessof Q i,n . Therefore P = lim n Q n ∈ P and c ( P ) = F .Now assume that F is a tree. Since | F | > there are direct children A , . . . , A k ofthe root in the structured forest F with k ≥ . Let ρ := inf { λ ∈ Λ | k ( λ ) = 1 } . Since F is a tree, ρ > . Let f ( ω ) := min { ρ, f ( ω ) } and f ′ ( ω ) := max { , f ( ω ) − ρ } for all ω ∈ Ω , and set dP := f dµ and dP ′ := f ′ dµ . Then P = P + P ′ is split into a podest corresponding to the root and its chain and the density corresponding to the children. Weset Λ ′ := (cid:8) λ − ρ | λ ∈ Λ , λ > ρ (cid:9) . Then | F f ′ , Λ ′ | = | F | − and by induction assumptionthere is ( Q ′ n , F ′ n ) ↑ P ′ adapted. Set B n := G F ′ n and B := S B n . Then by Lemma 44 thereis ( Q n , F n ) ր P adapted, which is given by a density h n .Now there might be a gap ε n := ρ − sup h n > . By construction ε n → but to beprecise we let ˜ Q n := Q ′ n + X A ∈ max F ′ n ε n · A dµ. This is still a simple measure on F ′ n and therefore ( Q n + ˜ Q n , F n ∪ F ′ n ) ր P . We have toshow P -adapted: homann, Steinwart, and Schmid Grounded:
Is fulfilled, since we consider trees at the moment.
Fine:
Let C , . . . , C k ∈ F n ∪ F ′ n be direct siblings. Then C , . . . , C k ∈ F ′ n because F n isa chain. If they are contained in one of the roots of F ′ n fineness is inherited fromadaptedness of Q ′ n . Else they are the roots of F ′ n . Let a = α A dµ ∈ Q P be abasic measure that ⊥⊥ P -intersects say C and C . Then is clear that α ≤ ρ and by P -subadditivity fineness is granted. Motivated
Let
C, C ′ ∈ F n ∪ F ′ n be direct siblings. Then again C, C ′ ∈ F ′ n . If they arecontained in one of the roots of F ′ n motivatedness is inherited from adaptedness of Q ′ n .Else they are the roots of F ′ n . Let a = α A dµ ∈ Q P be a base measure that supports C ∪ C . Again it is clear that α ≤ ρ and hence it cannot majorize neither the level of C nor the one of C ′ . Proof of Proposition 24:
Since f is continuous, all H f ( λ ) are open and it is the disjointunion of its open connected components. We show any connected component contains atleast one of the ˆ x , . . . , ˆ x k . To this end let λ ≥ and B be a connected component of H f ( λ ) (then B = ∅ ). Because Ω is compact, so is the closure ¯ B , and hence the maximumof f on ¯ B is attained at some y ∈ ¯ B . Since there is y ∈ B we have f ( y ) ≥ f ( y ) > λ we have y ∈ H f ( λ ) . Now H f ( λ ) is an open set, so y is an inner point of this open set, andwe know y ∈ ¯ B , therefore y ∈ B . Therefore y ∈ B is a local maximum.Hence for all λ there are at most k components and f is of ( A , Q µ, A , ⊥ ∅ ) -type. Thegeneralized structure ˜ s ( F f ) is finite, since there are only k leaves.Now, fix for the moment a local maximum ˆ x i . Since ˆ x i is a local maximum, there is ε s.t. f ( y ) ≤ f (ˆ x i ) for all y with d ( y, ˆ x i ) < ε . For all ε ∈ (0 , ε ) consider the sphere S ε ( λ ) := (cid:8) y ∈ Ω : f ( y ) ≥ λ and d ( y, ˆ x i ) = ε (cid:9) . Since Ω is compact and S ε ( λ ) is closed, it is also compact. So as λ ↑ f (ˆ x i ) the S ε ( λ ) is amonotone decreasing sequence of compact sets. Assume that all S ε ( λ ) were non-empty: Let y n ∈ S ε / ( n +1) ( λ ) then ( y n ) n is a sequence in the compact set S ε / ( λ ) , hence there wouldbe a subsequence converging to some y ε . This subsequence eventually is in every S ε / ( n +1) and hence y ε ∈ T λ
A, A ′ be closed, non-empty, and (path-)connected. Then: A ∪ A ′ is (path-)connected ⇐⇒ A ◦◦ ∅ A ′ . Therefore any finite or countable union A ∪ . . . ∪ A k , k ≤ ∞ of such sets is connected iffthe graph induced by the intersection relation is connected. Proof of Lemma 45:
Topological connectivity means that A ∪ A ′ cannot be written asdisjoint union of closed non-empty sets. Hence, if A ∪ A ′ is connected, then this union cannotbe disjoint. On the other hand if x ∈ A ∩ A ′ = ∅ and A ∪ A ′ = B ∪ B ′ with non-emptyclosed sets then x ∈ B or x ∈ B ′ . Say x ∈ B , then still B ′ has to intersect A or A ′ , say B ′ ∩ A = ∅ . Then both B, B ′ intersect A and both C := B ∩ A and C ′ := B ′ ∩ A are closedand non-empty. But since A = C ∪ C ′ is connected there is y ∈ C ∩ C ′ ⊂ B ∩ B ′ andtherefore B ∪ B ′ is not a disjoint union.For path-connectivity: If x ∈ A ∩ A ′ = ∅ then for all y ∈ A ∪ A ′ there is a path connecting x to y , so A ∪ A ′ is path-connected. On the other hand, if A ∪ A ′ is path connected thenfor any x ∈ A and x ′ ∈ A ′ there is a continuous path f : [0 , → A ∪ A ′ connecting x to x ′ .Then B := f − ( A ) and B ′ := f − ( A ′ ) are closed and non-empty, and B ∪ B ′ = [0 , . Since [0 , is topologically connected there is y ∈ B ∩ B ′ and so f ( y ) ∈ A ∩ A ′ . Proof of Example 1:
Reflexivity and monotonicity are trivial for all the three relations.
Disjointness : Stability is trivial and connectedness follows from Lemma 45 and from theobservation: A ⊂ B ⊥ ∅ ∪ . . . ⊥ ∅ ∪ B k ⇒ A = ( A ∩ B ) ⊥ ∅ ∪ . . . ⊥ ∅ ∪ ( A ∩ B k ) τ -separation : Connectedness follows from the definition of τ -connectedness. For stabilitylet A n ↑ n A and A n ⊥ τ B for n ∈ N and observe d ( A, B ) = sup x ∈ A d ( x, B ) = sup n ∈ N sup x ∈ A n d ( x, B ) = sup n ∈ N d ( A n , B ) ≥ τ. Linear Separation : Connectedness follows from the condition on A since A ⊂ B ⊥ ℓ ∪ . . . ⊥ ℓ ∪ B k implies A = A ∩ B ⊥ ℓ ∪ . . . ⊥ ℓ ∪ A ∩ B k . To prove stability let A n ↑ n A and A n ⊥ ℓ B for n ∈ N .Observe that v sup { α ∈ R | h v | a i ≤ α ∀ a ∈ A } is continuous and the same holds for the upper bound for the α . Hence for each n andany vector v ∈ H with h v | v i = 1 there is a compact, possibly empty interval I n ( v ) of α fulfilling the separation along v . Since by assumption the unit sphere is compact so is thesemi-direct product I n := { ( v, α ) | α ∈ I n ( v ) } . Since I n = ∅ and I n ⊃ I n +1 is a monotonelimit of non-empty compact sets, the limit T n I n is non-empty. Lemma 46
Let µ ∈ M ∞ Ω . If C ⊂ K ( µ ) then C ⊥⊥ ( C ) ⊂ K ( µ ) . homann, Steinwart, and Schmid Proof of Lemma 46:
Let A = C ∪ . . . ∪ C k ∈ C ⊥⊥ ( C ) then: supp 1 A dµ = supp(1 C + · · · + 1 C k ) dµ = C ∪ . . . ∪ C k = A. Lemma 47
Let
C ⊂ B be a class of non-empty closed sets. We assume the following gen-eralized stability: If B ∈ B and A , . . . , A k ∈ C form a connected subgraph of G⊥⊥ ( C ) : A i ⊥⊥ B ∀ i ≤ k = ⇒ A ⊥⊥ B. Then C ⊥⊥ ( C ) is ⊥⊥ -intersection additive. Furthermore the monotone closure C ⊥⊥ ( C ) is ¯ C ⊥⊥ ( C ) := { C ∪ C ∪ . . . | C , C , . . . ∈ C and the graph G⊥⊥ ( { C , C , . . . } ) is connected } Proof of Lemma 47:
Let A = C ∪ . . . ∪ C n , A ′ = C ′ ∪ . . . ∪ C ′ n ′ ∈ C ( C ) with A ◦◦ A ′ . Iffor all j ≤ n ′ we had C ′ j ⊥⊥ A then by assumption A ′ ⊥⊥ A and therefore there has to be j ≤ n ′ with C ′ j ◦◦ A . By the same argument there then is i ≤ n with C i ◦◦ C j . Thereforethe intersection graph on C , . . . , C n , C ′ , . . . , C ′ n ′ is connected and A ∪ A ′ = C ∪ . . . ∪ C n ∪ C ′ ∪ . . . ∪ C ′ n ′ ∈ C ( C ) . Let B ∈ C ( C ) and A , A , . . . ∈ C ( C ) with A n ↑ B . Then for all n we have A n = C n ∪ . . . ∪ C nk ( n ) with C nj ∈ C and their intersection graph is connected. Since A n ⊂ A n +1 for all C nj there is j ′ with Cnj ⊂ C ( n +1) ,j ′ which even gives C nj ◦◦ C ( n +1) j ′ . Hence, thefamily { C nj } n,j being countable can be enumerated ˜ C , ˜ C , . . . s.t. for all m there is i ( m ) < m with C m ◦◦ C i ( m ) . Therefore for all m , the intersection graph on ˜ C , . . . , ˜ C m is connectedand hence ˜ A m := ˜ C ∪ . . . ∪ ˜ C m ∈ C ( C ) . And we see that S m ˜ A m ∈ ¯ C ( C ) and therefore B = [ n A n = [ nj C nj = [ m ˜ C m ∈ ¯ C ( C ) . Now let B ∈ ¯ C ( C ) and B = S n C n with C n ∈ C and s.t. the intersection graph on C , C , . . . is connected. By Zorn’s Lemma it has a spanning tree. Since there are at mostcountable many nodes, one can assume that this tree is locally countable and thereforethere is an enumeration of the nodes C n (1) , C n (2) , . . . s.t. they form a connected subgraphfor all m . Then the intersection graph on C n (1) , . . . , C n ( m ) is connected for all m andtherefore A m := C n (1) ∪ . . . ∪ C n ( m ) ∈ C ( C ) . A m ∈ C i ( C ) ↑ B is monotone and we have B = S A m ∈ C i ( C ) . Proposition 48
Let
C ⊂ B be a class of non-empty, closed events and ⊥ a C -separationrelation. We assume the following generalized countable stability: If B ∈ B and A , A , . . . ∈C form a connected subgraph of G ⊥ ( C ) : A n ⊥ B ∀ n = ⇒ [ n A n ⊥ B. Then ⊥ is a C ⊥ ( C ) -separation relation. owards an Axiomatic Approach to Hierarchical Clustering of Measures Proof of Proposition 48:
Set ˜ A := C ⊥ . The assumption assures ˜ A -stability. We have toshow ˜ A -connectedness. So let A ∈ ˜ A and B , . . . , B k ∈ B closed with: A ⊂ B ⊥ ∪ . . . ⊥ ∪ B k . By definition of C there are C , . . . , C n ∈ C with A = C ∪ . . . ∪ C n and s.t. the ⊥ -intersectiongraph on { C , . . . , C n } is connected. For all j ≤ n we have C j ⊂ A ⊂ B ∪ . . . ∪ B k andby C -connectedness there is i ( j ) ≤ k with C j ⊂ B i ( j ) . Now, whenever i ( j ) = i ( j ′ ) since B i ( j ) ◦◦ B i ( j ′ ) we have by monotonicity C j ◦◦ C j ′ . So whenever there is an edge between C j and C j ′ then i ( j ) = i ( j ′ ) . This means that i ( · ) is constant on connected components of thegraph, and hence on the whole graph. Proposition 49
Let
C ⊂ B be a class of non-empty, closed events and ⊥ a C -separationrelation with the following alternative C ⊥ ( C ) -stability: For all A , A , . . . ∈ C and B ∈ B : G⊥⊥ ( { A , A , . . . } ) is connected and for all n : A n ⊥ B = ⇒ [ n A n ⊥ B. (28) Then ⊥ is a C ⊥ ( C ) -separation relation and C ⊥ ( C ) is ⊥ -intersection additive.Assume furthermore ⊥⊥ is a weaker relation ( B ⊥ B ′ = ⇒ B ⊥⊥ B ′ ). Then ⊥ is a C ⊥⊥ ( C ) -separation relation and C ⊥⊥ ( C ) is ⊥⊥ -intersection additive. Proof of Proposition 49:
The first part is a corollary of Lemma 47 and Proposition 48.For the second part observe C ⊥⊥ ( C ) ⊂ C ⊥ ( C ) . hence ⊥ is also a C ⊥⊥ ( C ) -separation relation.But now C ⊥⊥ ( C ) is only ⊥⊥ -intersection additive. Proof of Proposition 26:
First if A n ↑ B ∈ ¯ A then for all x, x ′ ∈ B there is n with x, x ′ ∈ A n and since A n is path-connected there is a path connecting x and x ′ in A n ⊂ B ,so they are connected also in B .Let O be open and path-connected. Let ( A n ) n ⊂ A ′ be the subsequence of all A ∈ A ′ with A ⊂ O . Since O is open and A ′ a neighborhood base O = S n A n . Consider thegraph on the ( A n ) n given by the intersection relation. Then by Zorn’s Lemma there is aspanning tree, and we can assume that it is locally at most countable. Therefore there isan enumeration A ′ , A ′ , . . . such that { A ′ , . . . , A ′ n } is a connected sub-graph for all n . Byintersection-additivity hence ˜ A n := A ′ ∪ . . . ∪ A ′ n ∈ A and ˜ A n ↑ O . Lemma 50
Let µ ∈ M ∞ Ω and assume there is a B ∈ K ( µ ) with dP = 1 B dµ . Assumethat ( A , Q µ, A , ⊥ A ) is a P -subadditive stable clustering base and ( Q n , F n ) ↑ P is adapted.Then s ( F n ) = { A n , . . . , A nk } consists only of roots and can be ordered in such a way that A i ⊂ A i ⊂ . . . . The limit forest F ∞ then consists of the k pairwise ⊥ A -separated sets: B i := [ n ≥ A ni , there is a µ -null set N ∈ B with B = B ⊥ A ∪ . . . ⊥ A ∪ B k ⊥ ∅ ∪ N. (29) homann, Steinwart, and Schmid Proof of Lemma 50:
Once we have shown that all s ( F n ) only consists of their roots, therest is a direct consequence of the isomonotonicity, and the fact that there is a µ -null set N s.t.: B = supp P = N ⊥ ∅ ∪ [ n supp Q n = B ⊥ A ∪ . . . ⊥ A ∪ B k ⊥ ∪ N. Now let
A, A ′ ∈ F n be direct siblings and denote by a = , a ′ ≤ P their levels in Q n . Then thereare α, α ′ > with a = α A dµ and a ′ = α ′ A ′ dµ . Now, a , a ′ ≤ P implies α A , α ′ A ′ ≤ B ( µ -a.s.) and hence α, α ′ ≤ . Assume they have a common root A ∈ max F n , i.e. A ∪ A ′ ⊂ A ⊂ B . Then α A , α ′ a ′ ≤ A ≤ B ( µ -a.s.) and hence they cannot be motivated. Proof of Lemma 27:
The Hausdorff-dimension is calculated in (Falconer, 1993, Corollary2.4). Proposition 2.2 therein gives for all events B ⊂ C and B ′ ⊂ C ′ : H s ( ϕ ( B )) ≤ c s H s ( B ) and H s ( ϕ − ( B ′ )) ≤ c s H s ( B ′ ) . We show that C ′ is a H s -support set. Let B ′ ⊂ C ′ be any relatively open set and set B := ϕ − ( B ′ ) ⊂ C . Then B ⊂ C is open because ϕ is a homeomorphism. And since C is asupport set we have < H s ( B ) < ∞ . This gives < H s ( B ) = H s ( ϕ − ( B ′ )) ≤ c s H s ( B ′ ) and H s ( B ′ ) = H s ( ϕ ( B )) ≤ c s H s ( B ) < ∞ . Therefore C ′ is a H s -support set. Proof of Proposition 28:
The proof is split into four steps: (a).
We first show that forall A ∈ A there is a unique index i ( A ) with A ∈ A i ( A ) . To this end, we fix an A ∈ A .Then there is i ≤ m with A ∈ A i . Let µ ∈ Q i be the corresponding base measure with supp µ = A . Let j ≤ m and µ ′ ∈ Q j be another measure with supp µ ′ = A . Then µ ( A ) = 1 and µ ′ ( A ) = 1 . If j > i then by assumption µ ≺ µ ′ and this would give µ ′ ( A ) = 0 . If j < i we have µ ′ ≺ µ and this would give µ ( A ) = 0 . So i = j . (b). Next we show that for all
A, A ′ ∈ A with A ⊂ A ′ we have i ( A ) ≤ i ( A ′ ) . To thisend we first observe that A = A ∩ A ′ = supp Q A ∩ supp Q A ′ . If we had i > j then Q A ′ ∈ Q j ≺ Q i ∋ Q A and since Q A ′ ( A ) ≤ Q A ′ ( A ′ ) = 1 < ∞ we would have Q A ( A ) = 0 .Therefore i ≤ j . (c). Now we show that ⊥ is a stable A -separation relation. Clearly, it suffices to show A -stability and A -connectedness. The former follows since i ( A n ) is monotone if A ⊂ A ⊂ . . . by (b) and hence eventually is constant. For the latter let A ∈ A i and B , . . . , B k ∈ B closed with A ⊂ B ⊥ ∪ . . . ⊥ ∪ B k . Then since ⊥ is an A i -separation relation there is j ≤ k with A ⊂ B j . (d). Finally, we show that ( A , Q , ⊥ ) is a stable clustering base. To this end observe thatfittedness is inherited from the individual clustering bases. Let A ∈ A i and A ′ ∈ A j with A ⊂ A ′ . Then i ≤ j by (b). If i = j then flatness follows from flatness of A i . If i < j thenby assumption Q A ≺ Q A ′ and because Q A ( A ) = 1 < ∞ we have Q A ′ ( A ) = 0 . Proof of Proposition 29: (a).
Let a ≤ P be a base measure on A ∈ A i . If i =1 then Q A ( A ∩ supp P ) ≤ Q A ( A ) = 1 and by A ≺ P we have Q A ≺ P and hence P ( A ∩ supp P ) = P ( A ) = 0 . Now for all events C ∈ A c therefore a ( C ) = 0 ≤ P ( C ) andfor all C ⊂ A : a ( C ) ≤ P ( C ) = α P ( C ) + α P ( C ) = α P ( C ) . owards an Axiomatic Approach to Hierarchical Clustering of Measures Now if i = 2 then by assumption P ≺ a and since < P ( A ∩ supp P ) < ∞ wetherefore have a ( A ∩ supp P ) ≤ a (supp P ) = 0 and for all events C ⊂ Ω \ supp P we have a ( C ) ≤ P ( C ) = α P ( C ) and for all events C ⊂ supp P : a ( C ) ≤ a (supp P ) = 0 ≤ P ( C ) . (b). Let a , a ′ ≤ P be base measures on A ∈ A i and A ∈ A j with A ◦◦ A A ′ . By theprevious statement we then already have a ≤ α i P i and a ′ ≤ α j P j . Now, if i = j then by P i -subadditivity of A i there is a base measure b ≤ P i ≤ P on B ∈ A i with B ⊃ A ∪ A ′ .Now if i = j consider say i = 2 and j = 1 . Since A ∩ supp P ⊃ A ∩ A ′ = ∅ by assumption a can be majorized by a base measure ˜ a ≤ P on ˜ A ∈ A with supp P ⊂ ˜ A and ˜ a ≥ a . Thelatter also gives A ⊂ ˜ A and hence ˜ a supports A and supp P ⊃ supp a ′ and ˜ a ≥ a . Acknowledgments
This work has been supported by DFG Grant STE 1074/2-1. We thank the reviewers andeditors for their helpful comments.
Appendix A. Appendix: Measure and Integration Theoretic Tools
Throughout this subsection, Ω is a Hausdorff space and B is its Borel σ -algebra. Recall thata measure µ on B is inner regular iff for all A ∈ B we have µ ( A ) = sup (cid:8) µ ( K ) | K ⊂ A is compact (cid:9) . A Radon space is a topological space such that all finite measures are inner regular. Cohn(2013, Theorem 8.6.14) gives several examples of such spaces such as a) Polish spaces,i.e. separable spaces whose topology can be described by a complete metric, b) open andclosed subsets of Polish spaces, and c) Banach spaces equipped with their weak topology. Inparticular all separable Banach spaces equipped with their norm topology are Polish spacesand infinite dimensional spaces equipped with the weak topology are not Polish spacesbut still they are Radon spaces. Furthermore Hausdorff measures, which are considered inSection 4.3, are inner regular (Federer, 1969, Cor. 2.10.23). For any inner regular measure µ we define the support by supp µ := Ω \ [ (cid:8) O ⊂ Ω | O is open and µ ( O ) = 0 (cid:9) . By definition the support is closed and hence measurable. The following lemma collectssome more basic facts about the support that are used throughout this paper.
Lemma 51
Let µ be an inner regular measure and A ∈ B . Then we have:(a) If A ⊥ ∅ supp µ , then we have µ ( A ) = 0 .(b) If ∅ 6 = A ⊂ supp µ is relatively open in supp µ , then µ ( A ) > . homann, Steinwart, and Schmid (c) If µ ′ is another inner regular measures and α, α ′ > then supp( αµ + α ′ µ ′ ) = supp( µ ) ∪ supp( µ ′ ) (d) The restriction µ | A of µ to A defined by µ | A ( B ) = µ ( B ∩ A ) is an inner regular measureand supp µ | A ⊂ A ∩ supp µ .If µ is not inner regular, (d) also holds provided that Ω is a Radon space and µ ( A ) < ∞ . Proof of Lemma 51: (a).
We show that A := Ω \ supp µ is a µ -null set. Let K ⊂ A beany compact set. By definition A is the union of all open sets O ⊂ Ω with µ ( O ) = 0 . Sothose sets form an open cover of A and therefore of K . Since K is compact there exists afinite sub-cover { O , . . . , O n } of K . By σ -subadditivity of µ we find µ ( K ) ≤ n X i =1 µ ( O i ) = 0 , and since this holds for all such compact K ⊂ A we have by inner regularity µ ( A ) = sup K ⊂ A µ ( K ) = 0 . (b). By assumption there an open O ⊂ Ω with ∅ 6 = A = O ∩ supp µ . Now O ∩ supp µ = ∅ implies µ ( O ) > . Moreover, we have the partition O = A ∪ ( O \ supp µ ) and since O \ supp µ is open, we know µ ( O \ supp µ ) = 0 , and hence we conclude that µ ( O ) = µ ( A ) . (c). This follows from the fact that for all open O ⊂ Ω we have ( αµ + α ′ µ ′ )( O ) = αµ ( O ) + α ′ µ ′ ( O ) = 0 ⇐⇒ µ ( O ) = 0 and µ ′ ( O ) = 0 . (d). The measure µ | A is inner regular since for B ∈ B we have µ ′ ( B ) = sup (cid:8) µ ( K ′ ) | K ′ ⊂ B ∩ A is compact (cid:9) ≤ sup (cid:8) µ ′ ( K ′ ) | K ′ ⊂ B is compact (cid:9) ≤ µ ′ ( B ) . Now observe that X \ A ∩ supp µ ⊂ X \ ( A ∩ supp µ ) = ( X \ A ) ∪ ( X \ supp µ ) . For the openset O := X \ A ∩ supp µ we thus find µ | A ( O ) ≤ µ | A ( X \ A ) + µ | A ( X \ supp µ ) ≤ µ ( X \ supp µ ) = 0 . Lemma 52
Let
Q, Q ′ be σ -finite measures.(a) If Q and Q ′ have densities h, h ′ with respect to some measure µ then Q ≤ Q ′ ⇐⇒ h ≤ h ′ µ -a.s. owards an Axiomatic Approach to Hierarchical Clustering of Measures (b) If Q ≤ Q ′ then Q is absolutely continuous with respect to Q ′ , i.e. Q has a densityfunction h with respect to Q ′ , dQ = h dQ ′ such that: h ( x ) = ( ∈ [0 , if x ∈ supp Q ′ else Proof of Lemma 52: (a). "‘ ⇐ "’ a direct calculation gives Q ( B ) = Z B h dµ ≤ Z B h ′ dµ = Q ′ ( B ) . and monotonicity of the integral.For "‘ ⇒ "’ assume µ ( { x : h ( x ) > h ′ ( x ) } ) > , then Z h>h ′ hdµ = Q ( { h > h ′ } ) ≤ Q ′ ( { h > h ′ } ) = Z h>h ′ h ′ dµ < Z h>h ′ hdµ, where the last inequality holds since we assume µ ( { x : h ( x ) > h ′ ( x ) } > and again themonotonicity of the integral. Through this contradiction implies the statement. (b). Q ≤ Q ′ means every Q ′ -null set is a Q -null set. Furthermore since Q ′ is σ -finite Q is σ -finite as well. So we can use Radon-Nikodym theorem and there is a h ≥ s.t. dQ = h dQ ′ . Since the complement of supp Q ′ is a Q ′ -null set, we can assume h ( x ) = 0 onthis complement.We have to show that h ≤ a.s. Let E n := { h ≥ n } and E := { h > } . Then E n ↑ E and we have Q ′ ( E n ) ≥ Q ( E n ) = Z E n h dQ ′ ≥ (1 + n ) · Q ′ ( E n ) , which implies Q ′ ( E n ) = 0 for all n . Therefore Q ′ ( E ) = lim n Q ′ ( E n ) = 0 . Lemma 53 (a) Let Q n ↑ P , A := supp P and B := S n supp Q n . Then B ⊂ A and P ( B \ A ) = 0 .(b) Assume Q is a finite measure and Q ≤ Q ≤ . . . ≤ Q and let the densities h n := dQ n dQ .Then h ≤ h ≤ . . . ≤ Q -a.s. Furthermore, the following are equivalent:(i) Q n ↑ Q (ii) h n ↑ Q -a.s.(iii) h n ↑ in L . Proof of Lemma 53: (a) Since Q n ≤ P we have supp Q n ⊂ A and therefore B ⊂ A . Because of ( A \ B ) ∩ supp Q n = ∅ and the convergence we have for all nP ( A \ B ) = lim n →∞ Q n ( A \ B ) = 0 . homann, Steinwart, and Schmid (b) By the previous lemma we have h ≤ h ≤ · ≤ Q -a.s.(i) ⇒ (ii): Since ( h n ) n is monotone Q -a.s. it converges Q -a.s. to a limit h ≤ . Let E n := { h ≤ − n } and E := { h < } . Then E n ↑ E and we have by the monotone convergence theorem: Q m ( E n ) = Z E n h m dQ −−−−→ m →∞ Z E n h dQ ≤ (1 − n ) Q ( E n ) But since Q m ( E n ) ↑ m Q ( E n ) this means that Q ( E n ) = 0 for all n and therefore Q ( E ) = lim n Q ( E n ) = 0 .(ii) ⇒ (iii): This follows from monotone convergence, because ∈ L ( Q ) .(iii) ⇒ (i): For all B ∈ B : Q ( B ) − Q n ( B ) = Z B | − h n | dQ ≤ Z | − h n | dQ → because of h n → in L . References
S. Ben-David. Computational Feasibility of Clustering under Clusterability Assumptions.
ArXiv e-prints , January 2015.J. E. Chacón. A population background for nonparametric density-based clustering.
ArXive-prints , August 2014. URL http://arxiv.org/abs/1408.1381 .D. L. Cohn.
Measure theory . Birkhäuser, 2nd ed. edition, 2013.W. Day and F. McMorris.
Axiomatic Consensus Theory in Group Choice and Biomathe-matics . Society for Industrial and Applied Mathematics, 2003.D. Defays. An efficient algorithm for a complete link method.
The Computer Journal , 20(4):364–366, 1977.A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete datavia the EM algorithm.
Journal of the Royal Statistical Society. Series B , 39(1):1–38, 1977.W.E. Donath and A.J. Hoffman. Lower bounds for the partitioning of graphs.
IBM Journalof Research and Development , 17(5):420–425, Sept 1973.M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discoveringclusters in large spatial databases with noise. In
International Conference on KnowledgeDiscovery and Data Mining , pages 226–231. AAAI Press, 1996.K. Falconer.
Fractal Geometry: Mathematical Foundations and Applications . Wiley, 1993. owards an Axiomatic Approach to Hierarchical Clustering of Measures H. Federer.
Geometric measure theory . Springer, 1969.G. Gan, C. Ma, and J. Wu.
Data clustering. Theory, algorithms, and applications . SIAM,2007.John A. Hartigan.
Clustering algorithms . Wiley, 1975.M. F. Janowitz and R. Wille. On the classification of monotone-equivariant cluster methods.In Cox, Hansen, and Julesz, editors,
Partitioning Data Sets: DIMACS Workshop 1993 ,pages 117–142. AMS, 1995.N. Jardine and R. Sibson.
Mathematical Taxonomy . Wiley, 1971.L. Kaufman and P. J. Rousseeuw.
Finding groups in data: an introduction to cluster analysis .Wiley, 1990.J. M. Kleinberg. An impossibility theorem for clustering. In Becker, Thrun, and Ober-mayer, editors,
Advances in Neural Information Processing Systems 15 , pages 463–470.MIT Press, 2003.J. Kogan.
Introduction to clustering large and high-dimensional data . Cambridge UniversityPress, 2007.J. MacQueen. Some methods for classification and analysis of multivariate observations.In
Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics ,pages 281–297. University of California Press, 1967.M. Meilˇa. Comparing clusterings: An axiomatic view. In
International Conference onMachine Learning , ICML ’05, pages 577–584. ACM, 2005.G. Menardi. A review on modal clustering.
International Statistical Review , 2015.B. Mirkin.
Clustering for data mining. A data recovery approach . Chapman & Hall/CRC,2005.B.G. Mirkin. On the problem of reconciling partitions.
Quantitative Sociology, InternationalPerspectives on Mathematical and Statistical Modelling , pages 441–449, 1975.B.G. Mirkin. Additive clustering and qualitative factor analysis methods for similaritymatrices.
Psychological Review , 4(1):7–31, 1987.J. Puzicha, T. Hofmann, and J. M. Buhmann. A theory of proximity based clustering:Structure detection by optimization.
Pattern Recognition , 33:617–634, 1999.A. Rinaldo and L. Wasserman. Generalized density clustering.
Ann. of Stat. , 38(5):2678–2722, 2010.R. N. Shepard and P. Arabie. Additive clustering: Representation of similarities as combi-nations of discrete overlapping properties.
Psychological Review , 86(2):87–123, 1979.R. Sibson. SLINK: An optimally efficient algorithm for the single-link cluster method.
TheComputer Journal , 16(1):30–34, 1973. homann, Steinwart, and Schmid W. Stuetzle. Estimating the cluster tree of a density by analyzing the minimal spanningtree of a sample.
Journal of Classification , 20(1):025–047, 2003.U. von Luxburg. A tutorial on spectral clustering.
Statistics and Computing , 17(4):395–416,2007.U. von Luxburg and S. Ben-David. Towards a statistical theory of clustering. In
PASCALworkshop on Statistics and Optimization of Clustering , 2005.J. H. Ward, Jr. Hierarchical grouping to optimize an objective function.
Journal of theAmerican Statistical Association , 58(301):236–244, 1963.R. B. Zadeh and S. Ben-David. A uniqueness theorem for clustering. In Bilmes and Ng,editors,
Conference on Uncertainty in Artificial Intelligence, 2009 . AUAI Press, 2009.. AUAI Press, 2009.