A score function for Bayesian cluster analysis
aa r X i v : . [ s t a t . O T ] M a y A score function for Bayesian cluster analysis
John Noble Lukasz RajkowskiMay 27, 2019
Abstract
We propose a score function for Bayesian clustering. The function is parameter free andcaptures the interplay between the within cluster variance and the between cluster entropyof a clustering. It can be used to choose the number of clusters in well-established clusteringmethods such as hierarchical clustering or K -means algorithm. Many clustering methods generate a family of clusterings that depend on some user-definedparameters. The most prominent example is the K -means algorithm, where the investigatorhas to specify the number of clusters. Similarly, in hierarchical clustering, a whole familyof clusterings is obtained, starting from the finest partition into singletons and ending inthe coarsest clustering, i.e. a single cluster. Again, the investigator chooses the number ofclusters based on the dendrogram.All these methods come with a variety of suggestions how to choose the optimal numberof clusters. Some of these are rather heuristic in nature, while others have deep theoreticalfoundations. For the K -means algorithm these include the elbow method or average silhou-ette method (Rousseeuw [1987]). Another solution is to use a score statistic (a functionwhich is intended to measure the quality of a clustering) and among different clusteringsproposed by a given method choose the one that maximises the score statistic. Construct-ing score statistics is not a trivial task; one of the most popular choices is the gap statistic (Tibshirani et al. [2001]).In this article we propose a new score statistic. It is derived as a limit of the first orderapproximation to the posterior probability (up to the norming constant) in a NonparametricBayesian Mixture Model with the inverse Wishart distribution as a base measure for thewithin group covariance matrices and the Gaussian distribution as a base measure for thecluster means and the component measure. In order to derive the limit we assume that thedata is an independent sample from some ‘input’ probability distribution on the observationspace; this gives a method of assessing the compatibility of the partitions of the observationspace to the input distribution. The score function is obtained by taking the empiricalmeasure as the input distribution and tweaking it slightly so that it is well defined on allpossible data clusterings. Our main contribution is the formulation of a novel score function for clusterings, which ismotivated theoretically and performs well on analysed datasets. Suppose that we have asequence of observations x , . . . , x n ∈ R d and we believe that it consists of several groups andwithin every group the data is distributed according to some Gaussian distribution (with nknown mean and covariance matrix). The goal is to construct a simple function thatmeasures how well a given clustering of the dataset corresponds to the assumption of beingGaussially distributed within clusters. Our proposition is the following: for I ⊂ [ n ] we define x I = | I | P i ∈ I x i and ˆ V x ( I ) = | I | P i ∈ I ( x i − x [ n ] )( x i − x [ n ] ) t and for the notational simplicitydenote ˆ V x := ˆ V x ([ n ]). For x = ( x , . . . , x n ) and I – a partition of [ n ] = { , , . . . , n } let D ( x , I ) := − X I ∈I | I | n ln det (cid:16) ˆ V x | I | + ˆ V x ( I ) (cid:17) + X I ∈I | I | n ln | I | n . (1.1)It should be noted that if x is a realisation of a random independent sample X , . . . , X n from some distribution P on X , then the components of the formula (1.1) can be treated asempirical estimates of relevant probabilities or the conditional covariance matrices. This isactually how (1.1) is obtained; we investigate the details in Section 3. This remark may bealso convenient when dealing with large datasets where the exact computation of (1.1) couldbe time consuming. In such case we can approximate the variance components of (1.1) byusing the random samples from clusters. We start our presentation with a formal definition of a score function , intended to measurethe quality of the data clustering.
Notation.
For n ∈ N let [ n ] = { , . . . , n } and let Π n be the set of all partitions of [ n ].Let X = R d be the observation space. Let O = S ∞ n =1 X n × Π n be the set of all possiblefinite sequences of observations and their partitions and let R = R ∪ {−∞ , ∞} . Definition. A clustering score function is any function S : O → R . Definition.
Let S be a score function and let F be a family of functions from X to X .We say that S is robust to F if for every x = ( x , . . . , x n ) ∈ X n and I , J ∈ Π n andevery f ∈ F we have S ( x , I ) ≤ S ( x , J ) if and only if S ( f ( x ) , I ) ≤ S ( f ( x ) , J ), where f ( x ) = (cid:0) f ( x ) , . . . , f ( x n ) (cid:1) .Hence robustness to F means that if we apply any function f ∈ F to all observations, theoptimal clustering indicated by the score function will not alter. If no prior knowledge aboutthe clustering structure is available, a natural demand from a score function is to be robustto linear isomorphisms of X . In particular, it should be robust to scaling of the axes sinceit would be strange if the result of applying the score function would depend on the unitsused to measure the observation. For the similar reasons, we expect a good score functionto be robust to translations.Note, that on the other hand the robustness to all linear transformation would be undesirable– in particular, moving all points to the origin is a linear transformation and we do not expectany clusters to be seen after applying it. Notation.
Let A and B be two partitions of the same set. We say that A is finer than B iffor every A ∈ A there exist B ∈ B such that A ⊂ B . Equivalently, we say that B is coarser than A and we write A (cid:22) B . Definition.
Let S be a clustering score function. We say that it is non-increasing if forevery x ∈ X n and I , J ∈ Π n such that I (cid:22) J we have S ( x , I ) ≤ S ( x , J ). If −S isnon-increasing then S is non-decreasing .Clearly, no non-decreasing score function would be good for clustering purposes as it wouldassign the highest score to the clustering into one full cluster, regardless of the data. Sim-ilarly, a non-increasing function gives the highest score to the partition of singletons. Itseems desirable for this two tendencies to interplay and it is theoretically appealing to findincreasing and decreasing parts in a given score function. .2 Properties of the D score function Notation.
To facilitate the notation in the remaining part of the text we use | Σ | to denotethe determinant of a square matrix Σ. Definition.
With the notation presented in Section 1.1 we define D Σ ( x , I ) := − X I ∈I | I | n ln (cid:12)(cid:12)(cid:12) Σ | I | + ˆ V x ( I ) (cid:12)(cid:12)(cid:12) + X I ∈I | I | n ln | I | n . (2.1)and then D ( x , I ) = D ˆ V x ( x , I ) (which is equivalent to (1.1)). Moreover, we use D to denote D Σ with Σ being a matrix of zeroes. Property 1.
Let x , . . . , x n ∈ X such that x , . . . , x n span X . Let x = ( x , . . . , x n ) . Then |D ( x , I ) | < ∞ for any I ∈ Π n .Proof. For any v ∈ R d v t X i ∈ I ( x i − bx I )( x i − bx I ) t ! v = X i ∈ I (cid:0) v t ( x i − bx I ) (cid:1) ≥ P i ∈ I ( x i − bx I )( x i − bx I ) t is non-negative definite. Moreover, it follows from theassumptions that ˆ V x is positive definite. A sum of non-negative and positive definite matrixis positive definite, so its determinant is positive. Therefore all the summands in (1.1) arefinite and the proof follows. Property 2.
The score function D is robust to translations and linear isomorphisms.Proof. It is easy to check that for any x ∈ X n , I ∈ Π n and any translation T we have D ( x , I ) = D (cid:0) T ( x ) , I (cid:1) and hence robustness to translations.Let L : X → X be a linear automorphism, defined by L ( x ) = Ax , where A is a n × n invertible matrix. Then D (cid:0) L ( x ) , I (cid:1) = − X I ∈I | I | n ln (cid:12)(cid:12)(cid:12) | I | A ˆ V x A t + 1 | I | X i ∈ I A ( x i − x I )( x i − x I ) t A t (cid:12)(cid:12)(cid:12) + X I ∈I | I | n ln | I | n == − X I ∈I | I | n ln (cid:12)(cid:12)(cid:12) A (cid:0) | I | ˆ V x + 1 | I | X i ∈ I ( x i − x I )( x i − x I ) t (cid:1) A t (cid:12)(cid:12)(cid:12) + X I ∈I | I | n ln | I | n == − X I ∈I | I | n ln (cid:16) | A | (cid:12)(cid:12)(cid:12) | I | ˆ V x + 1 | I | X i ∈ I ( x i − x I )( x i − x I ) t (cid:12)(cid:12)(cid:12) | A t | (cid:17) + X I ∈I | I | n ln | I | n == D ( x , I ) − ln | A | , (2.3)which clearly implies robustness to linear isomorphisms. Property 3. (a) P I ∈I | I | n ln | I | n is increasing(b) − P I ∈I | I | n ln (cid:12)(cid:12)(cid:12) ˆ V x ( I ) (cid:12)(cid:12)(cid:12) is decreasing(c) − P I ∈I | I | n ln (cid:12)(cid:12)(cid:12) Σ | I | (cid:12)(cid:12)(cid:12) is increasingProof. The proof of parts (a) and (b) follow from Proposition 6 by taking the empiricalmeasure instead of P . Part (c) follows from (a) because − X I ∈I | I | n ln (cid:12)(cid:12)(cid:12) Σ | I | (cid:12)(cid:12)(cid:12) = d X I ∈I | I | n ln | I | n + d ln n − ln | Σ | . (2.4) The derivation
In this section we give the theoretical foundations for considering the function D as clusteringscore function. We present a general formulation of a Bayesian Mixture Model and then weconcentrate on the case where the data within clusters are distributed as Gaussians.We analyse the asymptotics of the formula for the (unnormalised) posterior in this model.In this way we concentrate on scoring the partitions of the observation space rather thanthe data themselves. However, it is easy to switch to the score statistic by considering anempirical counterpart of P instead of P ; this yields D (cf. (2.1)). The general form of (2.1)is constructed to prevent the function D from assigning an infinite score to clusterings withvery small clusters (of size less than the dimension of the observation space); on the otherhand when the clusters are large enough, then D approximates D . Let Θ ⊂ R p be the parameter space and { G θ : θ ∈ Θ } be a family of probability measures onthe observation space R d . Consider a prior distribution π on Θ. Let ν be a probability distri-bution on the m -dimensional simplex ∆ m = { p = ( p i ) mi =1 : P mi =1 p i = 1 and p i ≥ i ≤ m } (where m ∈ N ∪ {∞} ). Let p = ( p i ) mi =1 ∼ ν θ = ( θ i ) mi =1 iid ∼ π x = ( x , . . . , x n ) | p , θ iid ∼ P mi =1 p i G θ i . (3.1)This is a Bayesian Mixture Model . If G θ a Gaussian distribution for all θ ∈ Θ, we saythat (3.1) defines a
Bayesian Mixture of Gaussians . In this case a convenient choice of theparameter space is Θ = R d × S + d , where S + d is the space of positive definite d × d matrices.Then for θ = ( µ, Λ) the distribution G θ is the multivariate normal distribution N ( µ, Λ). Aconjugate prior distribution π on Θ is the Normal-inverse-Wishart distribution, which isgiven by Λ ∼ W − ( η + d + 1 , η Σ ) µ | Λ ∼ N ( µ , Λ /κ ) (3.2)Here W − denotes the inverse Wishart distribution and the hyperparameters are κ , η > µ ∈ R d and Σ ∈ S + . This prior is listed in Gelman et al. [2013] with a slightly differenthyperparameters, but we made this modification to obtain E Λ = Σ , V ( µ ) = E V ( µ | Λ) + V E ( µ | Λ) = E Λ /κ + V ( µ ) = Σ /κ , (3.3)which gives a nice interpretation of the hyperparameters.Formula (3.1) can model data clustering; clusters are defined by deciding which G θ i gener-ated a given data point. In order to formally define the clusters, we need to rewrite (3.1)as p = ( p i ) mi =1 ∼ ν θ = ( θ i ) mi =1 iid ∼ π φ = ( φ , . . . , φ n ) | p , θ iid ∼ P mi =1 p i δ θ i x i | p , θ , φ ∼ G φ i independently for all i ≤ n . (3.4)Then the clusters are the classes of abstraction of the equivalence relation i ∼ j ≡ φ i = φ j . Inthis way the distribution ν on the m dimensional simplex generates a probability distribution P ν,n on the partitions of set [ n ] into at most m subsets. Example 3.1.
Let V , V , . . . iid ∼ Beta(1 , α ), p = V , p k = V k Q k − i =1 (1 − V i ) for k >
1. Let ν to be the distribution of p = ( p , p , . . . ). The probability on the space of partitions of [ n ] hat ν generates is the Generalized Polya Urn Scheme (Blackwell et al. [1973]) also knownas the Chinese Restaurant Process (Aldous [1985]) with the probability weight given by P ν,n ( I ) = α |I| α ( n ) Y I ∈I ( | I | − , (3.5)where α ( n ) = α ( α + 1) . . . ( α + n − Lemma 3.2.
Let ν be a probability distribution on ∆ m that generates a probability P ν,n onthe partitions of [ n ] . Then for every partition I of [ n ] P ν,n ( I ) = Z ∆ m X ψ : I − → [ m ] Y I ∈I p | I | ψ ( I ) d ν ( p ) (3.6) where the ,,middle sum” ranges over all injective functions from I to [ m ] (with the convention [ ∞ ] = N ).Proof. If |I| > m then both sides of (3.6) are 0. We now assume that |I| ≤ m . Let us goback to (3.4) and suppose that the weights p = ( p i ) mi =1 and the atoms θ = ( θ i ) mi =1 are fixed.We need to know what is the probability that φ = ( φ , . . . , φ n ) | p , θ iid ∼ P mi =1 p i δ θ i inducesa partition I . This would mean that for every I ∈ I all the values φ i for i ∈ I are equalto θ j for some j ≤ m ; let j = ψ ( I ). The values ψ ( I ) must be different for different I ∈ I ,otherwise I would not be generated. The probability of the sequence ( φ , . . . , φ n ) where φ i = θ ψ ( I ) for i ∈ I is equal to Q I ∈I p | I | ψ ( I ) . Since any assignment of clusters to atoms isvalid, so for fixed p the probability of I is equal to P ψ : I − → [ m ] Q I ∈I p | I | ψ ( I ) . Since p ∼ ν israndom, we have to integrate it out and (3.6) follows.Let P ν,n be the probability distribution on the space of partitions generated by ν . We canformulate (3.1) as follows: firstly we generate the partition of observations into clusters, andthen for every cluster we sample actual observations from the relevant marginal distribution.Formally, (3.1) is equivalent to I ∼ P ν,n x I := ( x i ) i ∈ I | I ∼ f | I | independently for all I ∈ I (3.7)where for θ ∼ π , k ∈ N and u = ( u , . . . , u k ) | θ iid ∼ G θ , f k is the marginal density of u , i.e. f k ( u , . . . , u k ) := Z Θ π ( θ ) k Y i =1 g θ ( u i )d θ. (3.8)( g θ is the density of G θ ). We stress the fact that the independent sampling on the ‘lower’ levelof (3.7) relates to the independence between clusters (conditioned on the random partition);within one cluster the observations are (marginally) dependent. To make the notation moreconcise we define f ( x | I ) := Y I ∈I f | I | ( x I ) . (3.9)Then (3.7) becomes I ∼ P ν,n x | I ∼ f ( · | I ) . (3.10)The further analysis requires the exact formula for f k ; in our case it is straightforward tocompute since π and G θ are conjugate. We state the result here for the reader’s convenience. roposition 1. Let θ = ( µ, Λ) have the distribution given by (3.2) and let u = ( u , . . . , u k ) | θ iid ∼N ( µ, Λ) . Then the marginal distribution of u is given by f k ( u ) = | η Σ | ν / κ / Γ d (cid:0) ν k (cid:1) π dk/ κ k / Γ d (cid:0) ν (cid:1) · det (Σ( u )) − ν k / , (3.11) where Γ d is the multivariate Gamma function and ν k = η + d + 1 + k, κ k = κ + k and (3.12)Σ( u ) = η Σ + k X i =1 ( u i − u )( u i − u I ) t + κ kκ k ( u − µ )( u − µ ) t . (3.13) Proof.
The proof follows from Murphy [2007], equation (266).
Throughout this section P is some fixed probability distribution on R d . Definition 3.3.
We say that a family A of P -measurable subsets of R d is a P -partition if • P (cid:0)S A ∈A A (cid:1) = 1 • P ( A ∩ A ) = 0 for all A , A ∈ A , A = A . Notation.
Let A be a P -partition of the observation space. Let X , X , . . . iid ∼ P and for n ∈ N let I A n = { J An : A ∈ A} where J An = { i ≤ n : X i ∈ A } (if J An = ∅ , we do not include itin I A n ). We say that I A n is induced by A . Proposition 2.
Let A be a P -partition of the observation space. Then I A n is almost surelya partition of [ n ] .Proof. The proof is straightforward and therefore omitted.Let E P ( A ) = E P ( X | X ∈ A ) and V P ( A ) = Var P ( X | X ∈ A ), where X ∼ P . That means E P ( A ) is the conditional expected value and V P ( A ) is the conditional covariance matrix of X conditioned on the event X ∈ A . For a family A of sets with positive P measure let V P ( A ) = X A ∈A P ( A ) ln | V P ( A ) | , H P ( A ) = − X A ∈A P ( A ) ln P ( A ) , (3.14)where | · | means determinant. Let∆ P ( A ) = − V P ( A ) − H P ( A ) (3.15)It turns out that basically (3.15) is (modulo constant) the first order approximation to thelogarithm of the posterior probability in Bayesian Mixture Model of the data clusteringdefined by A , when the data comes as an iid sample from P . Proposition 3. n p P ν,n ( I A n ) · f ( X n | I A n ) ≈ n exp { ∆ P ( A ) } , where ∆ P ( A ) = − X A ∈A P ( A ) ln | V P ( A ) | + X A ∈A P ( A ) ln P ( A ) (3.16) Proof.
The result follows from Proposition 4 and Proposition 5. t should be noted that Proposition 3 does not depend on the form of the prior on probabilitymeasures. This prior is responsible for the ‘entropy‘ part of (3.16).The final goal is not to score the partitions of the observation space but clusterings of thedata. A natural idea is to replace the distribution P in (3.15) by its empirical counterpart.Let ˆ P n = n P i ≤ n δ x i be the empirical probability of x . This is how D is obtained.The function D would not be a good score statistic, because if J contains a cluster J of size less than d then P j ∈ J ( x j − x J )( x j − x J ) t is singular and hence ˆ∆ x ( J ) = ∞ . Tocircumvent this, one could add some positive definite matrix to the within-group covariancematrix – in this way the relevant determinant will always be greater than zero. Since wewould like to avoid any arbitrary constants in the score function, a natural idea is to usethe covariance matrix of the whole dataset, ˆ V x = P i ≤ n ( x i − x )( x i − x ) t . This operation isalso motivated by considering the adaptive model , where the strength of prior distribution isincreasing linearly with the number of observations. The details of this approach are givenin Section 4. On the other hand, we do not want this modification to affect ˆ∆ x significantlywhen the sizes of clusters are large and the empirical covariance matrices are good estimatesof theoretical ones. Therefore we decide to decrease the importance of the modificationlinearly with the cluster size. This gives (1.1), which is a well defined score statistic. Proposition 4.
Let P be a probability distribution on R d and let A be a finite P -partitionof the observation space. Then lim n →∞ n p f ( X n | I A n ) a.s. = Q A ∈A | V P ( A ) | P ( A ) Before we present the proof of Proposition 4, we formulate an auxiliary lemma that concernsthe asymptotics of the function Γ d . Notation.
If ( a n ) ∞ n =1 and ( b n ) ∞ n =1 are real sequences, we write a n ≈ b n if lim n →∞ a n b n = 1.We write a n = o ( b n ) if lim n →∞ a n b n = 0. Similarly, if a, b : R → R are real functions, we write a ( x ) ≈ b ( x ) if lim x →∞ a ( x ) b ( x ) = 1 and a ( x ) = o (cid:0) b ( x ) (cid:1) if lim x →∞ a ( x ) b ( x ) = 0. Lemma 3.4.
Let α, β, a, b > . If a n ≈ αn a and b n − β = o (cid:0) n b (cid:1) then a b n n ≈ ( αn ) β .Proof. For sufficiently large n we have 1 < a n < αn a and − n b < b n − β < n c , hence(2 αn a ) − nb < a − nb n < a b n − βn < a nb n < (2 αn a ) nc (3.17)Left- and right-hand side of (3.17) converge to 1, so lim n →∞ a b n − βn = 1. The proof followsfrom a bnn ( αn ) β = (cid:0) a n αn a (cid:1) β a b n − βn . Lemma 3.5. If x n ≈ λn and x n /n − λ = o (cid:0) n a (cid:1) for some a > then n p Γ d ( x n ) ≈ ( λ ne ) λd .Proof. Recall Stirling’s formula: Γ( x ) ≈ √ πx ( xe ) x . It follows from Lemma 3.4 that n p Γ( x n ) ≈ (cid:16) √ πx n (cid:16) x n e (cid:17) x n (cid:17) /n = (2 πx n ) /n (cid:16) x n e (cid:17) x n /n ≈ (cid:16) λ ne (cid:17) λ (3.18)since n /n a ≈
1. Note that for fixed t > x n − t ) ≈ λn and as a result n p Γ d ( x n ) = n √ π d ( d − / d Y j =1 n s Γ (cid:18) x n − j − (cid:19) ≈ (cid:16) λ ne (cid:17) λd . (3.19) roof of Proposition 4. Note that | J An | is a random variable with distribution Bin( n, P ( A ))for all A ∈ A . Due to Law of Iterated Logarithm we have that almost surely (cid:0) | J An | /n − P ( A ) (cid:1) = o ( n − / ε ) for any ε > n s Γ d (cid:18) | J An | + n (cid:19) a.s. ≈ (cid:18) P ( A )2 · ne (cid:19) P ( A ) d/ . (3.20)Because A is finite and P A ∈A P ( A ) = 1, it means that n s Y A ∈A Γ d (cid:18) | J An | + n (cid:19) a.s. ≈ Y A ∈A P ( A ) P ( A ) ! d/ (cid:16) n e (cid:17) d/ . (3.21)By the strong law of large numbers we have that( x i − x A )( x i − x A ) t / | J An | a.s. ≈ V P ( A ) for A ∈ A (3.22)and hence, by (3.13), for A ∈ A (cid:12)(cid:12) Σ( X J A n ) (cid:12)(cid:12) / | J An | d = (cid:12)(cid:12)(cid:12) Σ / | J An | + X i ∈ J An ( x i − x A )( x i − x A ) t / | J An | + k k + | J An | ( x A − µ )( x A − µ ) t (cid:12)(cid:12)(cid:12) a.s. ≈ a.s. ≈ (cid:12)(cid:12)(cid:12) X i ∈ J An ( x i − x A )( x i − x A ) t / | J An | (cid:12)(cid:12)(cid:12) a.s. ≈ | V P ( A ) | (3.23)Hence | Σ( X J A n ) | a.s. ≈ | J An | d | V P ( A ) | a.s. ≈ n d P ( A ) d | V P ( A ) | . Using the Law of Iterated Loga-rithm and Lemma 3.4 again we get n q | Σ( X J A n ) | − ( | J An | + n ) / ≈ ( P ( A ) P ( A ) ) − d/ n − dP ( A ) / | V P ( A ) | − P ( A ) / (3.24)which means n s Y A ∈A | Σ( X J A n ) | − ( | J An | + n ) / ≈ Y A ∈A P ( A ) P ( A ) ! − d/ n − d/ Y A ∈A | V P ( A ) | − P ( A ) / (3.25)and therefore n p f ( X n | I A n ) a.s. ≈ Y A ∈A P ( A ) P ( A ) ! d/ (cid:16) n e (cid:17) d/ Y A ∈A P ( A ) P ( A ) ! − d/ n − d/ Y A ∈A | V P ( A ) | − P ( A ) / == (2 e ) − d/ Y A ∈A | V P ( A ) | − P ( A ) / (3.26) Proposition 5.
Let P be a probability distribution on R d and let A be a finite P -partition ofthe observation space. Let P ν,n be a probability distribution on the partitions of [ n ] , generatedby the probability distribution ν on ∆ ∞ . Then lim n →∞ n p P ν,n ( I A n ) a.s. = Q A ∈A P ( A ) P ( A ) .Proof. The proof is a direct consequence of the Law of Large Numbers and Theorem 3.8.By (3.15), ∆ P consists of two components: V P and H P . These two behave differently whentwo clusters are joined; the variance component is increasing whereas the entropy componentis decreasing. roposition 6. Let A be a partition of R d and let A, B ∈ A . Let C be a partition obtainedfrom A by joining A and B , i.e. C = A ∪ { A ∪ B } \ { A, B } . Then(a) H P ( A ) ≥ H P ( C ) (b) V P ( A ) ≤ V P ( C ) .Proof. Let C = A ⊔ B . Part (a): P ( A ) ln P ( A ) + P ( B ) ln P ( B ) − P ( C ) ln P ( C ) = P ( A ) ln P ( A ) P ( C ) + P ( B ) ln P ( B ) P ( C ) ≤ P ( A ) , P ( B ) ≤ P ( C ). Lemma 3.6.
Let A ∩ B = ∅ , C := A ∪ B . Then P ( A ) V P ( A ) + P ( B ) V P ( B ) (cid:22) P ( C ) V P ( C ) (3.28) where (cid:22) is the L¨owner partial order, i.e. M (cid:22) M iff M − M is non-negative definite.Proof. Let e ( A ) = E X A ( X ) and e ( A ) = E XX t A ( X ) where X ∼ P . Then V P ( A ) = e ( A ) P ( A ) − e ( A ) e ( A ) t P ( A ) . (3.29)Note that the functions P, e , e are additive, hence P ( C ) V P ( C ) − P ( A ) V P ( A ) − P ( B ) V P ( B ) == (cid:18) e ( C ) − e ( C ) e ( C ) t P ( C ) (cid:19) − (cid:18) e ( A ) − e ( A ) e ( A ) t P ( A ) (cid:19) − (cid:18) e ( B ) − e ( B ) e ( B ) t P ( B ) (cid:19) == e ( A ) e ( A ) t P ( A ) + e ( B ) e ( B ) t P ( B ) − e ( C ) e ( C ) t P ( C ) == e ( A ) e ( A ) t P ( A ) + e ( B ) e ( B ) t P ( B ) − (cid:0) e ( A ) + e ( B ) (cid:1)(cid:0) e ( A ) + e ( B ) (cid:1) t P ( A ) + P ( B ) == P ( A ) P ( B )( P ( A ) + P ( B )) (cid:18) e ( A ) P ( B ) − e ( B ) P ( A ) (cid:19) (cid:18) e ( A ) P ( B ) − e ( B ) P ( A ) (cid:19) t . (3.30)The last matrix in (3.30) is clearly non-negative definite and the proof follows. Theorem 3.7. (Theorem 2.4.4 in Horn et al. [1990])
The function ln det( · ) is convex onthe space of positive definite matrices.Proof of part (b): P ( A ) P ( C ) ln | V P ( A ) | + P ( B ) P ( C ) ln | V P ( B ) | T heorem . ≤ ln (cid:12)(cid:12)(cid:12) P ( A ) P ( C ) V P ( A ) + P ( B ) P ( C ) V P ( B ) (cid:12)(cid:12)(cid:12) ≤ Lemma . ≤ ln | V P ( C ) | (3.31)and the proof follows. heorem 3.8. Let P ν,n be a probability distribution on the partitions of [ n ] , generated bythe probability distribution ν on ∆ ∞ . Fix K ∈ N and consider a sequence of partitions ( I n ) n ∈ N , where I n = { I n, , . . . , I n,K } is a partition of [ n ] (it is possible that I n,i = ∅ forsome i ≤ K ). Assume that | I n,k | /n → α k > for k ≤ K . Then lim n →∞ n p P n,ν ( I n ) = K Y k =1 α α k k (3.32) Proof.
Firstly note that for sufficiently large n we have | I k,n | ≥ k ≤ K . Then in(3.6) we sum functions that depend on exactly K coordinates of p . Hence we can express(3.6) in the form of an integral on the K -dimensional set N K = { ( p , . . . , p K ) : P Kk =1 p k =1 , ∀ k ≤ K p k ∈ (0 , } as P n,ν ( I n ) = Z N K K Y k =1 p | I k,n | k d ν K ( p ) (3.33)where ν K is a measure on N K defined by ν K ( A ) = X ψ : [ K ] − → N ν (cid:0) ( p ψ (1) , p ψ (2) , . . . , p ψ ( K ) ) ∈ A (cid:1) (3.34)for A ⊂ N K , where [ K ] = { , , . . . , K } . Hence n p P n,ν ( I n ) = n vuutZ N K K Y k =1 p | I k,n | i d ν K ( p ) = k g n k n (3.35)where g n ( p , . . . , p K ) = Q Kk =1 p | I k,n | /nk and k · k n is the norm in L n ( N K , ν K ) space.Since ν K is not a finite measure on N K , in the remaining part of the proof we will have to becareful that the functions we are considering belong to the space L n ( N K , ν K ) for sufficientlylarge n .Let g ( p , . . . , p K ) = Q Kk =1 p α k k and let h ( p , . . . , p K ) = Q Kk =1 p k . Note that Z N K h ( p )d ν K ( p ) = P K,ν (cid:16)(cid:8) { } , { } , . . . , { K } (cid:9)(cid:17) ≤ . (3.36)Moreover for n > / min α i we have g n ( p ) ≤ h ( p ) and therefore g ∈ L n ( N K , ν K ) for n > / min α i . Because g is bounded by 1 we get k g k n → k g k ∞ = sup N K g = Y k ≤ K α α k k (3.37)(the fact that k g k ∞ = sup N K g = Q k ≤ K α α k k follows easily from applying the Lagrangemultipliers).We now prove that k g n − g k n →
0. It is not a straightforward consequence of the pointwiseconvergence of g n to g since ν K is not a finite measure on N K .Clearly, ( | I k,n | /n − α k / → α k / > k g n g − / − g / k ∞ → N K .Let N ∈ N be chosen so that for n > N we have k g n g − / − g / k ∞ < ε and nα k ≥ k ≤ K . Then for n > N k g n − g k nn = Z N K | g n − g | n d ν K ( p ) = Z N K | g n g − / − g / | n g n/ d ν K ( p ) ≤≤ ǫ n Z N K g n/ d ν K ( p ) ≤ ǫ n Z N K h d ν K ( p ) ≤ ǫ n , (3.38)hence k g n − g k n →
0. The result follows from the triangle inequality (cid:12)(cid:12) k g n k n − k g k ∞ (cid:12)(cid:12) ≤ (cid:12)(cid:12) k g n k n − k g k n (cid:12)(cid:12) + (cid:12)(cid:12) k g k n − k g k ∞ (cid:12)(cid:12) ≤ k g n − g k n + (cid:12)(cid:12) k g k n − k g k ∞ (cid:12)(cid:12) . (3.39) emma 3.9. Let α i > for i ≤ K and P Ki =1 α i = 1 . Let g ( p , . . . , p K ) = Q Kk =1 p α k k . Then sup N K g = Q k ≤ K α α k k .Proof. As α i > i ≤ K , the function g is continuous and, because N K is compact in R K ,it achieves its extreme values. Let ˆ p = (ˆ p , . . . , ˆ p K ) ∈ N K satisfy g ( ˆ p K ) = sup N K g . Clearly,ˆ p ∈ ∆ K . Indeed, otherwise s = P Ki =1 ˆ p i <
1, ˆ p /s ∈ N K and g ( ˆ p /s ) = g ( ˆ p ) /s > g ( ˆ p ), whichcontradicts the definition of ˆ p . Since g is nonnegative on ∆ K and it is equal to 0 on theboundary of ∆ K , we know that ˆ p is in the interior of ∆ K . The function g is positive on theinterior of ∆ K , so by considering the function ln( g ) and using the Lagrange multipliers, wegat that ˆ p satisfies 0 = ( α i ln p i ) ′ + λ = α i p i + λ (3.40)for i ≤ K and some λ ∈ R . Hence p i ’s are proportional to α i ’s, and because P Ki =1 α i = 1,we get that ˆ p i = α i and the proof follows. We now allow parameters of the model (3.2) to change with the number of observations.More precisely, we perform a substitution η λn =: η n so that the expected value of thewithin group precision matrix is fixed and increasingly concentrated on Σ . We investigatethe limit formula for the posterior as n goes to infinity. Note that in this case Σ | J An | /n → λ Σ + V P ( A ). Λ ∼ W − ( η n + d + 1 , η n Σ ) µ | Λ ∼ N ( µ , Λ /κ ) (4.1) Proposition 7.
Let P be a probability distribution on R d and let A be a finite P -partitionof the observation space. Then n p f ( X n | I A n ) a.s. ≈ (2 e ) − (1+ |A| λ ) d/ Y A ∈A | λP ( A ) + λ Σ + P ( A ) P ( A ) + λ V P ( A ) | − (cid:0) P ( A )+ λ (cid:1) / (4.2) Proof.
Note that | J An | is a random variable with distribution Bin( n, P ( A )) for all A ∈ A .Due to Law of Iterated Logarithm we have that almost surely (cid:0) | J An | /n − P ( A ) (cid:1) = o ( n − / ε )for any ε > n s Γ d (cid:18) | J An | + η n (cid:19) a.s. ≈ (cid:18) P ( A ) + λ · ne (cid:19)(cid:0) P ( A )+ λ (cid:1) d/ . (4.3)Because A is finite and P A ∈A P ( A ) = 1, it means that n s Y A ∈A Γ d (cid:18) | J An | + n (cid:19) a.s. ≈ Y A ∈A (cid:0) P ( A ) + λ (cid:1) P ( A )+ λ ! d/ (cid:16) n e (cid:17) (1+ |A| λ ) d/ . (4.4)By the strong law of large numbers we have that( x i − x A )( x i − x A ) t / | J An | a.s. ≈ V P ( A ) for A ∈ A (4.5)and hence, by (3.13), for A ∈ A (cid:12)(cid:12) Σ( X J A n ) (cid:12)(cid:12) / | J An | d = (cid:12)(cid:12)(cid:12) η n Σ / | J An | + X i ∈ J An ( x i − x A )( x i − x A ) t / | J An | + k k + | J An | ( x A − µ )( x A − µ ) t (cid:12)(cid:12)(cid:12) a.s. ≈ a.s. ≈ (cid:12)(cid:12)(cid:12) λP ( A ) Σ + X i ∈ J An ( x i − x A )( x i − x A ) t / | J An | (cid:12)(cid:12)(cid:12) a.s. ≈ | λP ( A ) Σ + V P ( A ) | (4.6) ence | Σ( X J A n ) | a.s. ≈ a.s. ≈ n d (cid:0) P ( A )+ λ (cid:1) d | λP ( A )+ λ Σ + P ( A ) P ( A )+ λ V P ( A ) | . Using the Law of IteratedLogarithm and Lemma 3.4 again we get n q | Σ( X J A n ) | − ( | J An | + η n ) / ≈ (cid:0) n ( P ( A )+ λ ) (cid:1) − ( P ( A )+ λ ) d/ | λP ( A ) + λ Σ + P ( A ) P ( A ) + λ V P ( A ) | − (cid:0) P ( A )+ λ (cid:1) / (4.7)and (4.2) follows. In this article we proposed a score function that can be used for choosing the number ofclusters in popular clustering methods. It is derived as a limit in a Bayesian Mixture Modelof Gaussians. We derived some of its properties, though there are some questions thatremain unanswered. For example, it is interesting to ask what assumptions on P should bemade to ensure that the supremum of possible values of the ∆ function is finite. References
David J Aldous. Exchangeability and related topics. In ´Ecole d’ ´Et´e de Probabilit´es deSaint-Flour XIII—1983 , pages 1–198. Springer, 1985.David Blackwell, James B MacQueen, et al. Ferguson distributions via p´olya urn schemes.
The annals of statistics , 1(2):353–355, 1973.Andrew Gelman, Hal S Stern, John B Carlin, David B Dunson, Aki Vehtari, and Donald BRubin.
Bayesian data analysis . Chapman and Hall/CRC, 2013.Roger A Horn, Roger A Horn, and Charles R Johnson.
Matrix analysis . Cambridge univer-sity press, 1990.Kevin P Murphy. Conjugate bayesian analysis of the gaussian distribution. def , 1(2 σ Journal of computational and applied mathematics , 20:53–65, 1987.Robert Tibshirani, Guenther Walther, and Trevor Hastie. Estimating the number of clustersin a data set via the gap statistic.
Journal of the Royal Statistical Society: Series B(Statistical Methodology) , 63(2):411–423, 2001., 63(2):411–423, 2001.