[PDF] A score function for Bayesian cluster analysis

Abstract

Full PDF

aa r X i v : . [ s t a t . O T ] M a y A score function for Bayesian cluster analysis

John Noble Lukasz RajkowskiMay 27, 2019

Abstract

We propose a score function for Bayesian clustering. The function is parameter free andcaptures the interplay between the within cluster variance and the between cluster entropyof a clustering. It can be used to choose the number of clusters in well-established clusteringmethods such as hierarchical clustering or K -means algorithm. Many clustering methods generate a family of clusterings that depend on some user-deﬁnedparameters. The most prominent example is the K -means algorithm, where the investigatorhas to specify the number of clusters. Similarly, in hierarchical clustering, a whole familyof clusterings is obtained, starting from the ﬁnest partition into singletons and ending inthe coarsest clustering, i.e. a single cluster. Again, the investigator chooses the number ofclusters based on the dendrogram.All these methods come with a variety of suggestions how to choose the optimal numberof clusters. Some of these are rather heuristic in nature, while others have deep theoreticalfoundations. For the K -means algorithm these include the elbow method or average silhou-ette method (Rousseeuw [1987]). Another solution is to use a score statistic (a functionwhich is intended to measure the quality of a clustering) and among diﬀerent clusteringsproposed by a given method choose the one that maximises the score statistic. Construct-ing score statistics is not a trivial task; one of the most popular choices is the gap statistic (Tibshirani et al. [2001]).In this article we propose a new score statistic. It is derived as a limit of the ﬁrst orderapproximation to the posterior probability (up to the norming constant) in a NonparametricBayesian Mixture Model with the inverse Wishart distribution as a base measure for thewithin group covariance matrices and the Gaussian distribution as a base measure for thecluster means and the component measure. In order to derive the limit we assume that thedata is an independent sample from some ‘input’ probability distribution on the observationspace; this gives a method of assessing the compatibility of the partitions of the observationspace to the input distribution. The score function is obtained by taking the empiricalmeasure as the input distribution and tweaking it slightly so that it is well deﬁned on allpossible data clusterings. Our main contribution is the formulation of a novel score function for clusterings, which ismotivated theoretically and performs well on analysed datasets. Suppose that we have asequence of observations x , . . . , x n ∈ R d and we believe that it consists of several groups andwithin every group the data is distributed according to some Gaussian distribution (with nknown mean and covariance matrix). The goal is to construct a simple function thatmeasures how well a given clustering of the dataset corresponds to the assumption of beingGaussially distributed within clusters. Our proposition is the following: for I ⊂ [ n ] we deﬁne x I = | I | P i ∈ I x i and ˆ V x ( I ) = | I | P i ∈ I ( x i − x [ n ] )( x i − x [ n ] ) t and for the notational simplicitydenote ˆ V x := ˆ V x ([ n ]). For x = ( x , . . . , x n ) and I – a partition of [ n ] = { , , . . . , n } let D ( x , I ) := − X I ∈I | I | n ln det (cid:16) ˆ V x | I | + ˆ V x ( I ) (cid:17) + X I ∈I | I | n ln | I | n . (1.1)It should be noted that if x is a realisation of a random independent sample X , . . . , X n from some distribution P on X , then the components of the formula (1.1) can be treated asempirical estimates of relevant probabilities or the conditional covariance matrices. This isactually how (1.1) is obtained; we investigate the details in Section 3. This remark may bealso convenient when dealing with large datasets where the exact computation of (1.1) couldbe time consuming. In such case we can approximate the variance components of (1.1) byusing the random samples from clusters. We start our presentation with a formal deﬁnition of a score function , intended to measurethe quality of the data clustering.

Notation.

For n ∈ N let [ n ] = { , . . . , n } and let Π n be the set of all partitions of [ n ].Let X = R d be the observation space. Let O = S ∞ n =1 X n × Π n be the set of all possibleﬁnite sequences of observations and their partitions and let R = R ∪ {−∞ , ∞} . Deﬁnition. A clustering score function is any function S : O → R . Deﬁnition.

Let S be a score function and let F be a family of functions from X to X .We say that S is robust to F if for every x = ( x , . . . , x n ) ∈ X n and I , J ∈ Π n andevery f ∈ F we have S ( x , I ) ≤ S ( x , J ) if and only if S ( f ( x ) , I ) ≤ S ( f ( x ) , J ), where f ( x ) = (cid:0) f ( x ) , . . . , f ( x n ) (cid:1) .Hence robustness to F means that if we apply any function f ∈ F to all observations, theoptimal clustering indicated by the score function will not alter. If no prior knowledge aboutthe clustering structure is available, a natural demand from a score function is to be robustto linear isomorphisms of X . In particular, it should be robust to scaling of the axes sinceit would be strange if the result of applying the score function would depend on the unitsused to measure the observation. For the similar reasons, we expect a good score functionto be robust to translations.Note, that on the other hand the robustness to all linear transformation would be undesirable– in particular, moving all points to the origin is a linear transformation and we do not expectany clusters to be seen after applying it. Notation.

Let A and B be two partitions of the same set. We say that A is ﬁner than B iffor every A ∈ A there exist B ∈ B such that A ⊂ B . Equivalently, we say that B is coarser than A and we write A (cid:22) B . Deﬁnition.

Let S be a clustering score function. We say that it is non-increasing if forevery x ∈ X n and I , J ∈ Π n such that I (cid:22) J we have S ( x , I ) ≤ S ( x , J ). If −S isnon-increasing then S is non-decreasing .Clearly, no non-decreasing score function would be good for clustering purposes as it wouldassign the highest score to the clustering into one full cluster, regardless of the data. Sim-ilarly, a non-increasing function gives the highest score to the partition of singletons. Itseems desirable for this two tendencies to interplay and it is theoretically appealing to ﬁndincreasing and decreasing parts in a given score function. .2 Properties of the D score function Notation.

To facilitate the notation in the remaining part of the text we use | Σ | to denotethe determinant of a square matrix Σ. Deﬁnition.

With the notation presented in Section 1.1 we deﬁne D Σ ( x , I ) := − X I ∈I | I | n ln (cid:12)(cid:12)(cid:12) Σ | I | + ˆ V x ( I ) (cid:12)(cid:12)(cid:12) + X I ∈I | I | n ln | I | n . (2.1)and then D ( x , I ) = D ˆ V x ( x , I ) (which is equivalent to (1.1)). Moreover, we use D to denote D Σ with Σ being a matrix of zeroes. Property 1.

Let x , . . . , x n ∈ X such that x , . . . , x n span X . Let x = ( x , . . . , x n ) . Then |D ( x , I ) | < ∞ for any I ∈ Π n .Proof. For any v ∈ R d v t X i ∈ I ( x i − bx I )( x i − bx I ) t ! v = X i ∈ I (cid:0) v t ( x i − bx I ) (cid:1) ≥ P i ∈ I ( x i − bx I )( x i − bx I ) t is non-negative deﬁnite. Moreover, it follows from theassumptions that ˆ V x is positive deﬁnite. A sum of non-negative and positive deﬁnite matrixis positive deﬁnite, so its determinant is positive. Therefore all the summands in (1.1) areﬁnite and the proof follows. Property 2.

The score function D is robust to translations and linear isomorphisms.Proof. It is easy to check that for any x ∈ X n , I ∈ Π n and any translation T we have D ( x , I ) = D (cid:0) T ( x ) , I (cid:1) and hence robustness to translations.Let L : X → X be a linear automorphism, deﬁned by L ( x ) = Ax , where A is a n × n invertible matrix. Then D (cid:0) L ( x ) , I (cid:1) = − X I ∈I | I | n ln (cid:12)(cid:12)(cid:12) | I | A ˆ V x A t + 1 | I | X i ∈ I A ( x i − x I )( x i − x I ) t A t (cid:12)(cid:12)(cid:12) + X I ∈I | I | n ln | I | n == − X I ∈I | I | n ln (cid:12)(cid:12)(cid:12) A (cid:0) | I | ˆ V x + 1 | I | X i ∈ I ( x i − x I )( x i − x I ) t (cid:1) A t (cid:12)(cid:12)(cid:12) + X I ∈I | I | n ln | I | n == − X I ∈I | I | n ln (cid:16) | A | (cid:12)(cid:12)(cid:12) | I | ˆ V x + 1 | I | X i ∈ I ( x i − x I )( x i − x I ) t (cid:12)(cid:12)(cid:12) | A t | (cid:17) + X I ∈I | I | n ln | I | n == D ( x , I ) − ln | A | , (2.3)which clearly implies robustness to linear isomorphisms. Property 3. (a) P I ∈I | I | n ln | I | n is increasing(b) − P I ∈I | I | n ln (cid:12)(cid:12)(cid:12) ˆ V x ( I ) (cid:12)(cid:12)(cid:12) is decreasing(c) − P I ∈I | I | n ln (cid:12)(cid:12)(cid:12) Σ | I | (cid:12)(cid:12)(cid:12) is increasingProof. The proof of parts (a) and (b) follow from Proposition 6 by taking the empiricalmeasure instead of P . Part (c) follows from (a) because − X I ∈I | I | n ln (cid:12)(cid:12)(cid:12) Σ | I | (cid:12)(cid:12)(cid:12) = d X I ∈I | I | n ln | I | n + d ln n − ln | Σ | . (2.4) The derivation

In this section we give the theoretical foundations for considering the function D as clusteringscore function. We present a general formulation of a Bayesian Mixture Model and then weconcentrate on the case where the data within clusters are distributed as Gaussians.We analyse the asymptotics of the formula for the (unnormalised) posterior in this model.In this way we concentrate on scoring the partitions of the observation space rather thanthe data themselves. However, it is easy to switch to the score statistic by considering anempirical counterpart of P instead of P ; this yields D (cf. (2.1)). The general form of (2.1)is constructed to prevent the function D from assigning an inﬁnite score to clusterings withvery small clusters (of size less than the dimension of the observation space); on the otherhand when the clusters are large enough, then D approximates D . Let Θ ⊂ R p be the parameter space and { G θ : θ ∈ Θ } be a family of probability measures onthe observation space R d . Consider a prior distribution π on Θ. Let ν be a probability distri-bution on the m -dimensional simplex ∆ m = { p = ( p i ) mi =1 : P mi =1 p i = 1 and p i ≥ i ≤ m } (where m ∈ N ∪ {∞} ). Let p = ( p i ) mi =1 ∼ ν θ = ( θ i ) mi =1 iid ∼ π x = ( x , . . . , x n ) | p , θ iid ∼ P mi =1 p i G θ i . (3.1)This is a Bayesian Mixture Model . If G θ a Gaussian distribution for all θ ∈ Θ, we saythat (3.1) deﬁnes a

Bayesian Mixture of Gaussians . In this case a convenient choice of theparameter space is Θ = R d × S + d , where S + d is the space of positive deﬁnite d × d matrices.Then for θ = ( µ, Λ) the distribution G θ is the multivariate normal distribution N ( µ, Λ). Aconjugate prior distribution π on Θ is the Normal-inverse-Wishart distribution, which isgiven by Λ ∼ W − ( η + d + 1 , η Σ ) µ | Λ ∼ N ( µ , Λ /κ ) (3.2)Here W − denotes the inverse Wishart distribution and the hyperparameters are κ , η > µ ∈ R d and Σ ∈ S + . This prior is listed in Gelman et al. [2013] with a slightly diﬀerenthyperparameters, but we made this modiﬁcation to obtain E Λ = Σ , V ( µ ) = E V ( µ | Λ) + V E ( µ | Λ) = E Λ /κ + V ( µ ) = Σ /κ , (3.3)which gives a nice interpretation of the hyperparameters.Formula (3.1) can model data clustering; clusters are deﬁned by deciding which G θ i gener-ated a given data point. In order to formally deﬁne the clusters, we need to rewrite (3.1)as p = ( p i ) mi =1 ∼ ν θ = ( θ i ) mi =1 iid ∼ π φ = ( φ , . . . , φ n ) | p , θ iid ∼ P mi =1 p i δ θ i x i | p , θ , φ ∼ G φ i independently for all i ≤ n . (3.4)Then the clusters are the classes of abstraction of the equivalence relation i ∼ j ≡ φ i = φ j . Inthis way the distribution ν on the m dimensional simplex generates a probability distribution P ν,n on the partitions of set [ n ] into at most m subsets. Example 3.1.

Let V , V , . . . iid ∼ Beta(1 , α ), p = V , p k = V k Q k − i =1 (1 − V i ) for k >

1. Let ν to be the distribution of p = ( p , p , . . . ). The probability on the space of partitions of [ n ] hat ν generates is the Generalized Polya Urn Scheme (Blackwell et al. [1973]) also knownas the Chinese Restaurant Process (Aldous [1985]) with the probability weight given by P ν,n ( I ) = α |I| α ( n ) Y I ∈I ( | I | − , (3.5)where α ( n ) = α ( α + 1) . . . ( α + n − Lemma 3.2.

Let ν be a probability distribution on ∆ m that generates a probability P ν,n onthe partitions of [ n ] . Then for every partition I of [ n ] P ν,n ( I ) = Z ∆ m X ψ : I − → [ m ] Y I ∈I p | I | ψ ( I ) d ν ( p ) (3.6) where the ,,middle sum” ranges over all injective functions from I to [ m ] (with the convention [ ∞ ] = N ).Proof. If |I| > m then both sides of (3.6) are 0. We now assume that |I| ≤ m . Let us goback to (3.4) and suppose that the weights p = ( p i ) mi =1 and the atoms θ = ( θ i ) mi =1 are ﬁxed.We need to know what is the probability that φ = ( φ , . . . , φ n ) | p , θ iid ∼ P mi =1 p i δ θ i inducesa partition I . This would mean that for every I ∈ I all the values φ i for i ∈ I are equalto θ j for some j ≤ m ; let j = ψ ( I ). The values ψ ( I ) must be diﬀerent for diﬀerent I ∈ I ,otherwise I would not be generated. The probability of the sequence ( φ , . . . , φ n ) where φ i = θ ψ ( I ) for i ∈ I is equal to Q I ∈I p | I | ψ ( I ) . Since any assignment of clusters to atoms isvalid, so for ﬁxed p the probability of I is equal to P ψ : I − → [ m ] Q I ∈I p | I | ψ ( I ) . Since p ∼ ν israndom, we have to integrate it out and (3.6) follows.Let P ν,n be the probability distribution on the space of partitions generated by ν . We canformulate (3.1) as follows: ﬁrstly we generate the partition of observations into clusters, andthen for every cluster we sample actual observations from the relevant marginal distribution.Formally, (3.1) is equivalent to I ∼ P ν,n x I := ( x i ) i ∈ I | I ∼ f | I | independently for all I ∈ I (3.7)where for θ ∼ π , k ∈ N and u = ( u , . . . , u k ) | θ iid ∼ G θ , f k is the marginal density of u , i.e. f k ( u , . . . , u k ) := Z Θ π ( θ ) k Y i =1 g θ ( u i )d θ. (3.8)( g θ is the density of G θ ). We stress the fact that the independent sampling on the ‘lower’ levelof (3.7) relates to the independence between clusters (conditioned on the random partition);within one cluster the observations are (marginally) dependent. To make the notation moreconcise we deﬁne f ( x | I ) := Y I ∈I f | I | ( x I ) . (3.9)Then (3.7) becomes I ∼ P ν,n x | I ∼ f ( · | I ) . (3.10)The further analysis requires the exact formula for f k ; in our case it is straightforward tocompute since π and G θ are conjugate. We state the result here for the reader’s convenience. roposition 1. Let θ = ( µ, Λ) have the distribution given by (3.2) and let u = ( u , . . . , u k ) | θ iid ∼N ( µ, Λ) . Then the marginal distribution of u is given by f k ( u ) = | η Σ | ν / κ / Γ d (cid:0) ν k (cid:1) π dk/ κ k / Γ d (cid:0) ν (cid:1) · det (Σ( u )) − ν k / , (3.11) where Γ d is the multivariate Gamma function and ν k = η + d + 1 + k, κ k = κ + k and (3.12)Σ( u ) = η Σ + k X i =1 ( u i − u )( u i − u I ) t + κ kκ k ( u − µ )( u − µ ) t . (3.13) Proof.

The proof follows from Murphy [2007], equation (266).

Throughout this section P is some ﬁxed probability distribution on R d . Deﬁnition 3.3.

We say that a family A of P -measurable subsets of R d is a P -partition if • P (cid:0)S A ∈A A (cid:1) = 1 • P ( A ∩ A ) = 0 for all A , A ∈ A , A = A . Notation.

Let A be a P -partition of the observation space. Let X , X , . . . iid ∼ P and for n ∈ N let I A n = { J An : A ∈ A} where J An = { i ≤ n : X i ∈ A } (if J An = ∅ , we do not include itin I A n ). We say that I A n is induced by A . Proposition 2.

Let A be a P -partition of the observation space. Then I A n is almost surelya partition of [ n ] .Proof. The proof is straightforward and therefore omitted.Let E P ( A ) = E P ( X | X ∈ A ) and V P ( A ) = Var P ( X | X ∈ A ), where X ∼ P . That means E P ( A ) is the conditional expected value and V P ( A ) is the conditional covariance matrix of X conditioned on the event X ∈ A . For a family A of sets with positive P measure let V P ( A ) = X A ∈A P ( A ) ln | V P ( A ) | , H P ( A ) = − X A ∈A P ( A ) ln P ( A ) , (3.14)where | · | means determinant. Let∆ P ( A ) = − V P ( A ) − H P ( A ) (3.15)It turns out that basically (3.15) is (modulo constant) the ﬁrst order approximation to thelogarithm of the posterior probability in Bayesian Mixture Model of the data clusteringdeﬁned by A , when the data comes as an iid sample from P . Proposition 3. n p P ν,n ( I A n ) · f ( X n | I A n ) ≈ n exp { ∆ P ( A ) } , where ∆ P ( A ) = − X A ∈A P ( A ) ln | V P ( A ) | + X A ∈A P ( A ) ln P ( A ) (3.16) Proof.

The result follows from Proposition 4 and Proposition 5. t should be noted that Proposition 3 does not depend on the form of the prior on probabilitymeasures. This prior is responsible for the ‘entropy‘ part of (3.16).The ﬁnal goal is not to score the partitions of the observation space but clusterings of thedata. A natural idea is to replace the distribution P in (3.15) by its empirical counterpart.Let ˆ P n = n P i ≤ n δ x i be the empirical probability of x . This is how D is obtained.The function D would not be a good score statistic, because if J contains a cluster J of size less than d then P j ∈ J ( x j − x J )( x j − x J ) t is singular and hence ˆ∆ x ( J ) = ∞ . Tocircumvent this, one could add some positive deﬁnite matrix to the within-group covariancematrix – in this way the relevant determinant will always be greater than zero. Since wewould like to avoid any arbitrary constants in the score function, a natural idea is to usethe covariance matrix of the whole dataset, ˆ V x = P i ≤ n ( x i − x )( x i − x ) t . This operation isalso motivated by considering the adaptive model , where the strength of prior distribution isincreasing linearly with the number of observations. The details of this approach are givenin Section 4. On the other hand, we do not want this modiﬁcation to aﬀect ˆ∆ x signiﬁcantlywhen the sizes of clusters are large and the empirical covariance matrices are good estimatesof theoretical ones. Therefore we decide to decrease the importance of the modiﬁcationlinearly with the cluster size. This gives (1.1), which is a well deﬁned score statistic. Proposition 4.

Let P be a probability distribution on R d and let A be a ﬁnite P -partitionof the observation space. Then lim n →∞ n p f ( X n | I A n ) a.s. = Q A ∈A | V P ( A ) | P ( A ) Before we present the proof of Proposition 4, we formulate an auxiliary lemma that concernsthe asymptotics of the function Γ d . Notation.

If ( a n ) ∞ n =1 and ( b n ) ∞ n =1 are real sequences, we write a n ≈ b n if lim n →∞ a n b n = 1.We write a n = o ( b n ) if lim n →∞ a n b n = 0. Similarly, if a, b : R → R are real functions, we write a ( x ) ≈ b ( x ) if lim x →∞ a ( x ) b ( x ) = 1 and a ( x ) = o (cid:0) b ( x ) (cid:1) if lim x →∞ a ( x ) b ( x ) = 0. Lemma 3.4.

Let α, β, a, b > . If a n ≈ αn a and b n − β = o (cid:0) n b (cid:1) then a b n n ≈ ( αn ) β .Proof. For suﬃciently large n we have 1 < a n < αn a and − n b < b n − β < n c , hence(2 αn a ) − nb < a − nb n < a b n − βn < a nb n < (2 αn a ) nc (3.17)Left- and right-hand side of (3.17) converge to 1, so lim n →∞ a b n − βn = 1. The proof followsfrom a bnn ( αn ) β = (cid:0) a n αn a (cid:1) β a b n − βn . Lemma 3.5. If x n ≈ λn and x n /n − λ = o (cid:0) n a (cid:1) for some a > then n p Γ d ( x n ) ≈ ( λ ne ) λd .Proof. Recall Stirling’s formula: Γ( x ) ≈ √ πx ( xe ) x . It follows from Lemma 3.4 that n p Γ( x n ) ≈ (cid:16) √ πx n (cid:16) x n e (cid:17) x n (cid:17) /n = (2 πx n ) /n (cid:16) x n e (cid:17) x n /n ≈ (cid:16) λ ne (cid:17) λ (3.18)since n /n a ≈

1. Note that for ﬁxed t > x n − t ) ≈ λn and as a result n p Γ d ( x n ) = n √ π d ( d − / d Y j =1 n s Γ (cid:18) x n − j − (cid:19) ≈ (cid:16) λ ne (cid:17) λd . (3.19) roof of Proposition 4. Note that | J An | is a random variable with distribution Bin( n, P ( A ))for all A ∈ A . Due to Law of Iterated Logarithm we have that almost surely (cid:0) | J An | /n − P ( A ) (cid:1) = o ( n − / ε ) for any ε > n s Γ d (cid:18) | J An | + n (cid:19) a.s. ≈ (cid:18) P ( A )2 · ne (cid:19) P ( A ) d/ . (3.20)Because A is ﬁnite and P A ∈A P ( A ) = 1, it means that n s Y A ∈A Γ d (cid:18) | J An | + n (cid:19) a.s. ≈ Y A ∈A P ( A ) P ( A ) ! d/ (cid:16) n e (cid:17) d/ . (3.21)By the strong law of large numbers we have that( x i − x A )( x i − x A ) t / | J An | a.s. ≈ V P ( A ) for A ∈ A (3.22)and hence, by (3.13), for A ∈ A (cid:12)(cid:12) Σ( X J A n ) (cid:12)(cid:12) / | J An | d = (cid:12)(cid:12)(cid:12) Σ / | J An | + X i ∈ J An ( x i − x A )( x i − x A ) t / | J An | + k k + | J An | ( x A − µ )( x A − µ ) t (cid:12)(cid:12)(cid:12) a.s. ≈ a.s. ≈ (cid:12)(cid:12)(cid:12) X i ∈ J An ( x i − x A )( x i − x A ) t / | J An | (cid:12)(cid:12)(cid:12) a.s. ≈ | V P ( A ) | (3.23)Hence | Σ( X J A n ) | a.s. ≈ | J An | d | V P ( A ) | a.s. ≈ n d P ( A ) d | V P ( A ) | . Using the Law of Iterated Loga-rithm and Lemma 3.4 again we get n q | Σ( X J A n ) | − ( | J An | + n ) / ≈ ( P ( A ) P ( A ) ) − d/ n − dP ( A ) / | V P ( A ) | − P ( A ) / (3.24)which means n s Y A ∈A | Σ( X J A n ) | − ( | J An | + n ) / ≈ Y A ∈A P ( A ) P ( A ) ! − d/ n − d/ Y A ∈A | V P ( A ) | − P ( A ) / (3.25)and therefore n p f ( X n | I A n ) a.s. ≈ Y A ∈A P ( A ) P ( A ) ! d/ (cid:16) n e (cid:17) d/ Y A ∈A P ( A ) P ( A ) ! − d/ n − d/ Y A ∈A | V P ( A ) | − P ( A ) / == (2 e ) − d/ Y A ∈A | V P ( A ) | − P ( A ) / (3.26) Proposition 5.

Let P be a probability distribution on R d and let A be a ﬁnite P -partition ofthe observation space. Let P ν,n be a probability distribution on the partitions of [ n ] , generatedby the probability distribution ν on ∆ ∞ . Then lim n →∞ n p P ν,n ( I A n ) a.s. = Q A ∈A P ( A ) P ( A ) .Proof. The proof is a direct consequence of the Law of Large Numbers and Theorem 3.8.By (3.15), ∆ P consists of two components: V P and H P . These two behave diﬀerently whentwo clusters are joined; the variance component is increasing whereas the entropy componentis decreasing. roposition 6. Let A be a partition of R d and let A, B ∈ A . Let C be a partition obtainedfrom A by joining A and B , i.e. C = A ∪ { A ∪ B } \ { A, B } . Then(a) H P ( A ) ≥ H P ( C ) (b) V P ( A ) ≤ V P ( C ) .Proof. Let C = A ⊔ B . Part (a): P ( A ) ln P ( A ) + P ( B ) ln P ( B ) − P ( C ) ln P ( C ) = P ( A ) ln P ( A ) P ( C ) + P ( B ) ln P ( B ) P ( C ) ≤ P ( A ) , P ( B ) ≤ P ( C ). Lemma 3.6.

Let A ∩ B = ∅ , C := A ∪ B . Then P ( A ) V P ( A ) + P ( B ) V P ( B ) (cid:22) P ( C ) V P ( C ) (3.28) where (cid:22) is the L¨owner partial order, i.e. M (cid:22) M iﬀ M − M is non-negative deﬁnite.Proof. Let e ( A ) = E X A ( X ) and e ( A ) = E XX t A ( X ) where X ∼ P . Then V P ( A ) = e ( A ) P ( A ) − e ( A ) e ( A ) t P ( A ) . (3.29)Note that the functions P, e , e are additive, hence P ( C ) V P ( C ) − P ( A ) V P ( A ) − P ( B ) V P ( B ) == (cid:18) e ( C ) − e ( C ) e ( C ) t P ( C ) (cid:19) − (cid:18) e ( A ) − e ( A ) e ( A ) t P ( A ) (cid:19) − (cid:18) e ( B ) − e ( B ) e ( B ) t P ( B ) (cid:19) == e ( A ) e ( A ) t P ( A ) + e ( B ) e ( B ) t P ( B ) − e ( C ) e ( C ) t P ( C ) == e ( A ) e ( A ) t P ( A ) + e ( B ) e ( B ) t P ( B ) − (cid:0) e ( A ) + e ( B ) (cid:1)(cid:0) e ( A ) + e ( B ) (cid:1) t P ( A ) + P ( B ) == P ( A ) P ( B )( P ( A ) + P ( B )) (cid:18) e ( A ) P ( B ) − e ( B ) P ( A ) (cid:19) (cid:18) e ( A ) P ( B ) − e ( B ) P ( A ) (cid:19) t . (3.30)The last matrix in (3.30) is clearly non-negative deﬁnite and the proof follows. Theorem 3.7. (Theorem 2.4.4 in Horn et al. [1990])

The function ln det( · ) is convex onthe space of positive deﬁnite matrices.Proof of part (b): P ( A ) P ( C ) ln | V P ( A ) | + P ( B ) P ( C ) ln | V P ( B ) | T heorem . ≤ ln (cid:12)(cid:12)(cid:12) P ( A ) P ( C ) V P ( A ) + P ( B ) P ( C ) V P ( B ) (cid:12)(cid:12)(cid:12) ≤ Lemma . ≤ ln | V P ( C ) | (3.31)and the proof follows. heorem 3.8. Let P ν,n be a probability distribution on the partitions of [ n ] , generated bythe probability distribution ν on ∆ ∞ . Fix K ∈ N and consider a sequence of partitions ( I n ) n ∈ N , where I n = { I n, , . . . , I n,K } is a partition of [ n ] (it is possible that I n,i = ∅ forsome i ≤ K ). Assume that | I n,k | /n → α k > for k ≤ K . Then lim n →∞ n p P n,ν ( I n ) = K Y k =1 α α k k (3.32) Proof.

Firstly note that for suﬃciently large n we have | I k,n | ≥ k ≤ K . Then in(3.6) we sum functions that depend on exactly K coordinates of p . Hence we can express(3.6) in the form of an integral on the K -dimensional set N K = { ( p , . . . , p K ) : P Kk =1 p k =1 , ∀ k ≤ K p k ∈ (0 , } as P n,ν ( I n ) = Z N K K Y k =1 p | I k,n | k d ν K ( p ) (3.33)where ν K is a measure on N K deﬁned by ν K ( A ) = X ψ : [ K ] − → N ν (cid:0) ( p ψ (1) , p ψ (2) , . . . , p ψ ( K ) ) ∈ A (cid:1) (3.34)for A ⊂ N K , where [ K ] = { , , . . . , K } . Hence n p P n,ν ( I n ) = n vuutZ N K K Y k =1 p | I k,n | i d ν K ( p ) = k g n k n (3.35)where g n ( p , . . . , p K ) = Q Kk =1 p | I k,n | /nk and k · k n is the norm in L n ( N K , ν K ) space.Since ν K is not a ﬁnite measure on N K , in the remaining part of the proof we will have to becareful that the functions we are considering belong to the space L n ( N K , ν K ) for suﬃcientlylarge n .Let g ( p , . . . , p K ) = Q Kk =1 p α k k and let h ( p , . . . , p K ) = Q Kk =1 p k . Note that Z N K h ( p )d ν K ( p ) = P K,ν (cid:16)(cid:8) { } , { } , . . . , { K } (cid:9)(cid:17) ≤ . (3.36)Moreover for n > / min α i we have g n ( p ) ≤ h ( p ) and therefore g ∈ L n ( N K , ν K ) for n > / min α i . Because g is bounded by 1 we get k g k n → k g k ∞ = sup N K g = Y k ≤ K α α k k (3.37)(the fact that k g k ∞ = sup N K g = Q k ≤ K α α k k follows easily from applying the Lagrangemultipliers).We now prove that k g n − g k n →

0. It is not a straightforward consequence of the pointwiseconvergence of g n to g since ν K is not a ﬁnite measure on N K .Clearly, ( | I k,n | /n − α k / → α k / > k g n g − / − g / k ∞ → N K .Let N ∈ N be chosen so that for n > N we have k g n g − / − g / k ∞ < ε and nα k ≥ k ≤ K . Then for n > N k g n − g k nn = Z N K | g n − g | n d ν K ( p ) = Z N K | g n g − / − g / | n g n/ d ν K ( p ) ≤≤ ǫ n Z N K g n/ d ν K ( p ) ≤ ǫ n Z N K h d ν K ( p ) ≤ ǫ n , (3.38)hence k g n − g k n →

0. The result follows from the triangle inequality (cid:12)(cid:12) k g n k n − k g k ∞ (cid:12)(cid:12) ≤ (cid:12)(cid:12) k g n k n − k g k n (cid:12)(cid:12) + (cid:12)(cid:12) k g k n − k g k ∞ (cid:12)(cid:12) ≤ k g n − g k n + (cid:12)(cid:12) k g k n − k g k ∞ (cid:12)(cid:12) . (3.39) emma 3.9. Let α i > for i ≤ K and P Ki =1 α i = 1 . Let g ( p , . . . , p K ) = Q Kk =1 p α k k . Then sup N K g = Q k ≤ K α α k k .Proof. As α i > i ≤ K , the function g is continuous and, because N K is compact in R K ,it achieves its extreme values. Let ˆ p = (ˆ p , . . . , ˆ p K ) ∈ N K satisfy g ( ˆ p K ) = sup N K g . Clearly,ˆ p ∈ ∆ K . Indeed, otherwise s = P Ki =1 ˆ p i <

1, ˆ p /s ∈ N K and g ( ˆ p /s ) = g ( ˆ p ) /s > g ( ˆ p ), whichcontradicts the deﬁnition of ˆ p . Since g is nonnegative on ∆ K and it is equal to 0 on theboundary of ∆ K , we know that ˆ p is in the interior of ∆ K . The function g is positive on theinterior of ∆ K , so by considering the function ln( g ) and using the Lagrange multipliers, wegat that ˆ p satisﬁes 0 = ( α i ln p i ) ′ + λ = α i p i + λ (3.40)for i ≤ K and some λ ∈ R . Hence p i ’s are proportional to α i ’s, and because P Ki =1 α i = 1,we get that ˆ p i = α i and the proof follows. We now allow parameters of the model (3.2) to change with the number of observations.More precisely, we perform a substitution η λn =: η n so that the expected value of thewithin group precision matrix is ﬁxed and increasingly concentrated on Σ . We investigatethe limit formula for the posterior as n goes to inﬁnity. Note that in this case Σ | J An | /n → λ Σ + V P ( A ). Λ ∼ W − ( η n + d + 1 , η n Σ ) µ | Λ ∼ N ( µ , Λ /κ ) (4.1) Proposition 7.

Let P be a probability distribution on R d and let A be a ﬁnite P -partitionof the observation space. Then n p f ( X n | I A n ) a.s. ≈ (2 e ) − (1+ |A| λ ) d/ Y A ∈A | λP ( A ) + λ Σ + P ( A ) P ( A ) + λ V P ( A ) | − (cid:0) P ( A )+ λ (cid:1) / (4.2) Proof.

Note that | J An | is a random variable with distribution Bin( n, P ( A )) for all A ∈ A .Due to Law of Iterated Logarithm we have that almost surely (cid:0) | J An | /n − P ( A ) (cid:1) = o ( n − / ε )for any ε > n s Γ d (cid:18) | J An | + η n (cid:19) a.s. ≈ (cid:18) P ( A ) + λ · ne (cid:19)(cid:0) P ( A )+ λ (cid:1) d/ . (4.3)Because A is ﬁnite and P A ∈A P ( A ) = 1, it means that n s Y A ∈A Γ d (cid:18) | J An | + n (cid:19) a.s. ≈ Y A ∈A (cid:0) P ( A ) + λ (cid:1) P ( A )+ λ ! d/ (cid:16) n e (cid:17) (1+ |A| λ ) d/ . (4.4)By the strong law of large numbers we have that( x i − x A )( x i − x A ) t / | J An | a.s. ≈ V P ( A ) for A ∈ A (4.5)and hence, by (3.13), for A ∈ A (cid:12)(cid:12) Σ( X J A n ) (cid:12)(cid:12) / | J An | d = (cid:12)(cid:12)(cid:12) η n Σ / | J An | + X i ∈ J An ( x i − x A )( x i − x A ) t / | J An | + k k + | J An | ( x A − µ )( x A − µ ) t (cid:12)(cid:12)(cid:12) a.s. ≈ a.s. ≈ (cid:12)(cid:12)(cid:12) λP ( A ) Σ + X i ∈ J An ( x i − x A )( x i − x A ) t / | J An | (cid:12)(cid:12)(cid:12) a.s. ≈ | λP ( A ) Σ + V P ( A ) | (4.6) ence | Σ( X J A n ) | a.s. ≈ a.s. ≈ n d (cid:0) P ( A )+ λ (cid:1) d | λP ( A )+ λ Σ + P ( A ) P ( A )+ λ V P ( A ) | . Using the Law of IteratedLogarithm and Lemma 3.4 again we get n q | Σ( X J A n ) | − ( | J An | + η n ) / ≈ (cid:0) n ( P ( A )+ λ ) (cid:1) − ( P ( A )+ λ ) d/ | λP ( A ) + λ Σ + P ( A ) P ( A ) + λ V P ( A ) | − (cid:0) P ( A )+ λ (cid:1) / (4.7)and (4.2) follows. In this article we proposed a score function that can be used for choosing the number ofclusters in popular clustering methods. It is derived as a limit in a Bayesian Mixture Modelof Gaussians. We derived some of its properties, though there are some questions thatremain unanswered. For example, it is interesting to ask what assumptions on P should bemade to ensure that the supremum of possible values of the ∆ function is ﬁnite. References

David J Aldous. Exchangeability and related topics. In ´Ecole d’ ´Et´e de Probabilit´es deSaint-Flour XIII—1983 , pages 1–198. Springer, 1985.David Blackwell, James B MacQueen, et al. Ferguson distributions via p´olya urn schemes.

The annals of statistics , 1(2):353–355, 1973.Andrew Gelman, Hal S Stern, John B Carlin, David B Dunson, Aki Vehtari, and Donald BRubin.

Bayesian data analysis . Chapman and Hall/CRC, 2013.Roger A Horn, Roger A Horn, and Charles R Johnson.

Matrix analysis . Cambridge univer-sity press, 1990.Kevin P Murphy. Conjugate bayesian analysis of the gaussian distribution. def , 1(2 σ Journal of computational and applied mathematics , 20:53–65, 1987.Robert Tibshirani, Guenther Walther, and Trevor Hastie. Estimating the number of clustersin a data set via the gap statistic.

Journal of the Royal Statistical Society: Series B(Statistical Methodology) , 63(2):411–423, 2001., 63(2):411–423, 2001.