Finite mixture models do not reliably learn the number of components
FFINITE MIXTURE MODELS DO NOT RELIABLY LEARN THENUMBER OF COMPONENTS
By Diana Cai ∗ , † , Trevor Campbell ∗ , ‡ , and Tamara Broderick § Princeton University † , University of British Columbia ‡ ,Massachusetts Institute of Technology § Scientists and engineers are often interested in learning the numberof subpopulations (or components) present in a data set. A commonsuggestion is to use a finite mixture model (FMM) with a prior onthe number of components. Past work has shown the resulting FMMcomponent-count posterior is consistent; that is, the posterior con-centrates on the true generating number of components. But existingresults crucially depend on the assumption that the component likeli-hoods are perfectly specified. In practice, this assumption is unrealistic,and empirical evidence suggests that the FMM posterior on the num-ber of components is sensitive to the likelihood choice. In this paper,we add rigor to data-analysis folk wisdom by proving that undereven the slightest model misspecification, the FMM component-countposterior diverges : the posterior probability of any particular finitenumber of latent components converges to 0 in the limit of infinitedata. We illustrate practical consequences of our theory on simulatedand real data sets.
1. Introduction.
A probabilistic generative model is typically, of necessity, asimplification of the complex real-world phenomena that govern any observed data.This simplification facilitates tractable data analysis and discovery of meaningful andactionable patterns in data. But it also follows that typical models of real-world datasets are misspecified. And certain types of misspecification can be dangerous, in thatthey may lead to fundamentally inaccurate or misleading inferences.For instance, Miller and Harrison (2013, 2014) caution against misspecificationin mixture modeling. Mixture models are widely used to discover latent groups, orcomponents, within a population. Often the number of components is unknown inadvance, and one of the principal inferential goals is estimating and interpreting thisnumber. For example, practitioners might wish to find the number of latent geneticpopulations (Pritchard et al., 2000; Lorenzen et al., 2006; Huelsenbeck and Andolfatto,2007), gene tissue profiles (Yeung et al., 2001; Medvedovic and Sivaganesan, 2002),cell types (Chan et al., 2008; Prabhakaran et al., 2016), haplotypes (Xing et al.,2006), switching Markov regimes in US dollar exchange rate data (Otranto and Gallo,2002), gamma-ray burst types (Mukherjee et al., 1998), or segmentation regions inan image (e.g., tissue types in an MRI scan (Banfield and Raftery, 1993)). ∗ First authorship is shared jointly by D. Cai and T. Campbell.
Keywords and phrases: finite mixture models, number of components, model misspecification,posterior asymptotics, Bayesian nonparametrics a r X i v : . [ m a t h . S T ] S e p D. CAI, T. CAMPBELL, T. BRODERICK
One way to capture what we know about the number of components is to take aBayesian approach; the Bayesian posterior represents our state of belief about thevalue of interest after observing a data set. A natural check on a Bayesian analysis isto establish that—when the true, generating number of components is known—ourposterior increasingly concentrates near the truth as the number of data pointsbecomes arbitrarily large.One popular Bayesian methodology uses a Dirichlet process mixture model (DPMM).The DPMM itself is constructed with a countable infinity of components. So practi-tioners instead use the posterior on the number of clusters (Pella and Masuda, 2006;Huelsenbeck and Andolfatto, 2007)—i.e., components represented in the observeddata. For any N data points, a DPMM implicitly gives a prior over the number of clus-ters with support on [ N ] := { , , . . . , N } . Unfortunately, due to the latent infinitudeof components in a DPMM, the DPMM cluster-count posterior concentrates strictlyaway from the true generating number of components when the generating numberis finite (Miller and Harrison, 2013, 2014). In fact, Miller and Harrison (2013, 2014)show the cluster-count posterior probability on the generating number of componentsgoes to 0 as the amount of data grows.Miller and Harrison (2018) suggest that a potential alternative prior may resolvethese difficulties. Namely, they consider a prior directly on the number of components,with component counts supported on all possible strictly-positive integers (Nobile,1994; Stephens, 2000; Green and Richardson, 2001; Nobile, 2004, 2007; Nobile andFearnside, 2007; Miller and Harrison, 2013, 2014, 2018; Grazian et al., 2020). Asin Nobile (1994), we call the resulting model the finite mixture model (FMM) toemphasize that, unlike in the DPMM or similar nonparametric models, the expectednumber of components in the prior is fixed and finite across data set sizes. Nobile (1994)has shown that the resulting component-count posterior in this case does concentrateat the true number of components. But crucially this result depends on the assumptionthat the component likelihoods are perfectly specified. In practice, though, we canexpect that the component likelihoods are at least somewhat imperfectly specifiedsince they are necessarily simplifications of real-world phenomena. For instance, whileGaussian mixture models are ubiquitous, data are rarely perfectly Gaussian. Millerand Dunson (2019) provide empirical evidence of undesirable posterior behavior inan FMM with misspecified likelihood.In this paper, we examine FMMs with essentially any component shape, subject tosome mild regularity conditions; in particular, we do not restrict only to, for instance,Gaussian component likelihoods. We show that when the component likelihoods arenot perfectly specified, the component-count posterior concentrates strictly awayfrom the true number of components, just as in the nonparametric case. In fact, wego further to show that the FMM posterior for the number of components diverges :for any finite k ∈ N , the posterior probability that the number of components is k converges to 0 almost surely as the amount of data grows.It follows that we expect using either FMMs or DPMMs will lead to unreliable INITE MIXTURES DO NOT RELIABLY LEARN THE NUMBER OF COMPONENTS estimates of component counts in typical practical applications—even though thesource of misspecification in the two cases is different. That is, the DPMM cluster-count posterior is unreliable even when the component likelihoods are accuratelyspecified—because the DPMM assumes an infinity of components. On the otherhand, the FMM component-count posterior is unreliable only when the componentlikelihoods are misspecified. But since component likelihoods are typically (at leastvery slightly) misspecified in practical applications, we still expect the FMM estimateof the true generating number of components to diverge.Besides resolving a theoretical gap in the literature, our analysis therefore hasimportant practical implications. Namely, if we reject DPMMs for their poor asymp-totic estimation of the number of components, then FMMs must be discarded as well,except in applications where we can guarantee perfectly specified likelihoods a priori.An alternative perspective is that while both models suffer in the asymptotic regime,finite data may provide some regularization properties when using either DPMMsor FMMs. A potential challenge of this latter perspective follows as a corollary ofour analysis: the amount of implicit regularization changes as the data grows, andthus the same analysis will give different answers for different data set sizes. Wedemonstrate this effect in real-data experiments by varying the number of data points;our results show that the number of components varies substantially as we increasethe data set size within realistic ranges. Our results demonstrate that past researchusing a Bayesian prior on the number of components may have strongly depended onthe size of the data set in consideration.In the same spirit as Miller and Harrison (2013, 2014), we believe our paper addsrigor to folk wisdom in the data analysis community. For practitioners interested ina reliable estimate of cluster cardinality, our results suggest more robust methodsare needed. Indeed, work in this direction has already begun (Grünwald, 2006; Wooand Sriram, 2006, 2007; Rodriguez and Dunson, 2011; Wang et al., 2017; Miller andDunson, 2019; Huggins and Miller, 2019).Section 2 begins by introducing FMMs and stating our main result on posteriordivergence in the number of components. Then in Section 3, we discuss the primaryassumptions needed for the main result to hold. Section 4 provides the proof ofthe main theoretical result. Section 5 presents an extension of the main theorem topriors that may vary as the data set grows. Section 6 discusses related work on theasymptotics of the posterior number of components in FMMs. The paper concludesin Section 7 with empirical evidence that the FMM posterior on the number ofcomponents depends on the amount of data considered.
2. Main result.
We begin with a brief description of the finite mixture modelwe consider in this work. In this section, we provide just enough detail to stateTheorem 2.1 and leave the precise probabilistic details for Section 3. Let g be amixing measure g := (cid:80) kj =1 p j δ θ j on a parameter space Θ with p j ∈ [0 , and (cid:80) kj =1 p j = 1 , and let Ψ = { ψ θ : θ ∈ Θ } be a family of component distributions D. CAI, T. CAMPBELL, T. BRODERICK dominated by a σ -finite measure µ . Then we can express a finite mixture f of thecomponents as f = (cid:90) Θ ψ θ dg ( θ ) = k (cid:88) j =1 p j ψ θ j . Consider a Bayesian model with a prior distribution Π on the set of all mixingmeasures G on Θ with finitely many atoms, i.e., g ∼ Π , and likelihood correspondingto conditionally i.i.d. data from f = (cid:82) ψ θ dg ( θ ) . The model assumes the likelihoodis f , but the model is misspecified ; i.e., the observations X N := ( X , . . . , X N ) aregenerated conditionally i.i.d. from a finite mixture f of distributions not in Ψ .Our main result is that under this misspecification of the likelihood, the poste-rior on the number of components Π( k | X N ) diverges; i.e., for any finite k ∈ N , Π( k | X N ) → as N → ∞ . We make only two requirements of the mixture model toguarantee this result: (1) the true data-generating distribution f must be arbitrarilywell-approximated by finite mixtures of Ψ , and (2) the family Ψ must satisfy mildregularity conditions that hold for popular mixture models (e.g., the family Ψ ofGaussians parametrized by mean and variance). We provide precise definitions of theassumptions needed for Theorem 2.1 to hold in Section 3, and a proof in Section 4. Theorem . Suppose observations X N are generated i.i.d. froma distribution f that is not a finite mixture of Ψ . Assume that:Assumption 3.1: f is in the KL-support of the prior Π , andAssumption 3.6: Ψ is continuous, is mixture-identifiable, and has degenerate limits.Then the posterior on the number of components diverges; i.e., for all k ∈ N , Π( k | X N ) N →∞ −→ f -a.s. (1)Note that the conditions of the theorem—although technical—are satisfied by awide class of models used in practice. Assumption 3.1 requires that the prior Π placesenough mass on mixtures near the true generating distribution f . Assumption 3.6enforces regularity of the component family and is satisfied by many popular modelsused in practice, such as the multivariate Gaussian family. Proposition . Let
Ψ = (cid:110) N ( ν, Σ) : ν ∈ R d , Σ ∈ S d ++ (cid:111) be the multivariateGaussian family, where S d ++ := { Σ ∈ R d × d : Σ = Σ (cid:62) , Σ (cid:31) } is the set of d × d symmetric, positive definite matrices. Then Ψ satisfies Assumption 3.6. Thus, provided that f is in the KL-support of the prior, under a misspecifiedGaussian mixture model, our main result implies that the posterior number of compo-nents diverges. While Proposition 2.2 is stated for Gaussian component distributions,we generalize it to mixture-identifiable location-scale families Ψ in Proposition A.2. INITE MIXTURES DO NOT RELIABLY LEARN THE NUMBER OF COMPONENTS Additionally, we note that the divergence of the posterior given in Equation (1)is stronger than the behavior described in Miller and Harrison (2013) for DPMMs:i.e. that the posterior probability converges to 0 at the true number of components.In contrast, here we show that the posterior probability converges to 0 for anyfinite number of components. We conjecture that posterior divergence also holds forDPMMs, but the proof is outside of the scope of this paper.Finally, while the result of Theorem 2.1 assumes that the model uses a fixed prior Π , in many practical modeling scenerios it may be natural to specify a prior Π N that depends on the amount of observed data X N . In Section 5 we show that, if f satisfies a modified KL-support condition with respect to the sequence of priors Π N ,the number of components also diverges in this setting.
3. Precise setup and assumptions in Theorem 2.1.
This section makes thedetails of the modeling setup and each of the conditions in Theorem 2.1 precise.3.1.
Notation and setup.
Let X and Θ be Polish spaces for the observations andparameters, respectively, and endow both with their Borel σ -algebra. For a topologicalspace ( · ) , let C ( · ) be the bounded continuous functions from ( · ) into R , and P ( · ) bethe set of probability measures on ( · ) endowed with the weak topology metrized bythe Prokhorov distance d (Ghosal and van der Vaart, 2017, Appendix A, p. 508), andBorel σ -algebra. We use f i ⇒ f and f i ⇐⇒ f (cid:48) i to denote lim i →∞ d ( f i , f ) = 0 and lim i →∞ d ( f i , f (cid:48) i ) = 0 , respectively, for f i , f (cid:48) i , f ∈ P ( · ) . We assume that the family ofdistributions Ψ = { ψ θ : θ ∈ Θ } is absolutely continuous with respect to a σ -finitebase measure µ , i.e., ψ θ (cid:28) µ for all θ ∈ Θ , and that for measurable A ⊆ X , ψ θ ( A ) is ameasurable function on Θ . Define the measurable mapping F : P (Θ) → P ( X ) frommixing measures to mixtures of Ψ , F ( g ) = (cid:82) ψ θ dg ( θ ) . Let G be the set of atomicprobability measures on Θ with finitely many atoms, and let F be the set of finitemixtures of Ψ ; define G ∗ and F ∗ analogously for countable mixtures.In the Bayesian finite mixture model from Section 2, a mixing measure g ∼ Π isgenerated from a prior measure Π on G , and f = F ( g ) is a likelihood distribution.The posterior distribution on the mixing measure is ∀ measurable A ⊆ G , Π( A | X N ) = (cid:82) A (cid:81) Nn =1 dfdµ ( X n ) d Π( g ) (cid:82) G (cid:81) Nn =1 dfdµ ( X n ) d Π( g ) , (2)where dfdµ is the density of f = F ( g ) with respect to µ . This posterior on the mixingmeasure g ∈ G induces a posterior on the number of components k ∈ N by countingthe number of atoms in g , and it also induces a posterior on mixtures f ∈ F via thepushforward through the mapping F . We overload the notation Π( · | X N ) to referto all of these posterior distributions and Π( · ) to refer to prior distributions; themeaning should be clear from context. D. CAI, T. CAMPBELL, T. BRODERICK
Model assumptions.
The first assumption of Theorem 2.1 is that while thetrue data-generating distribution f is not contained in the model class f / ∈ F , it lieson the boundary of the model class. In particular, we assume f is in the KL-support of the prior Π . Denote the Kullback-Leibler (KL) divergence between probabilitymeasures f and f as KL( f , f ) := (cid:82) log (cid:16) df df (cid:17) df f (cid:28) f ∞ otherwise . Assumption . For all (cid:15) > , the prior distribution Π satisfies Π( f ∈ F : KL( f , f ) < (cid:15) ) > . We use Assumption 3.1 in the proof of Theorem 2.1 primarily to ensure that theBayesian posterior is consistent for f . Note that Assumption 3.1 is fairly weak inpractice. Intuitively, it just requires that the family Ψ is rich enough so that mixturesof Ψ can approximate f arbitrarily well, and that the prior Π places sufficient masson those mixtures close to f . For Bayesian mixture modeling, Ghosal et al. (1999,Theorem 3), Tokdar (2006, Theorem 3.2), Wu and Ghosal (2008, Theorem 2.3), andPetralia et al. (2012, Theorem 1) provide conditions needed to satisfy Assumption 3.1.The second assumption of Theorem 2.1 is that the family of component distributions Ψ is well-behaved. This assumption has three stipulations. First, the mapping θ (cid:55)→ ψ θ must be continuous; this condition essentially asserts that similar parameter values θ must result in similar component distributions ψ θ . Definition . The family Ψ is continuous if the map θ (cid:55)→ ψ θ is continuous. Second, the family Ψ must be mixture-identifiable , which guarantees that eachmixture f ∈ F is associated with a unique mixing measure G ∈ G . Definition . The family Ψ is mixture-identifiable ifthe mapping F ( g ) = (cid:82) ψ θ dg ( θ ) restricted to finite mixtures F : G → F is a bijection. In practice, one should always use an identifiable mixture model for clustering;without identifiability, the task of learning the number of components is ill posed. Andmany models satisfy mixture-identifiability, such as finite mixtures of the multivariateGaussian family (Yakowitz and Spragins, 1968), the Cauchy family (Yakowitz andSpragins, 1968), the gamma family (Teicher, 1963), the generalized logistic family,the generalized Gumbel family, the Weibull family, and von Mises family (Ho andNguyen, 2016, Theorem 3.3). A number of authors (e.g., Chen (1995); Ishwaran et al.(2001); Nguyen (2013); Ho and Nguyen (2016); Guha et al. (2019); Heinrich and Kahn(2018)) appeal to stronger notions of identifiability for mixtures than Definition 3.3.But, to show posterior divergence in the present work, we do not require conditionsstronger than Definition 3.3.
INITE MIXTURES DO NOT RELIABLY LEARN THE NUMBER OF COMPONENTS The third stipulation—that the family Ψ has degenerate limits —guarantees that a“poorly behaved” sequence of parameters ( θ i ) i ∈ N creates a likewise “poorly behaved”sequence of distributions ( ψ θ i ) i ∈ N . This condition allows us to rule out such sequencesin the proof of Theorem 2.1, and is the essential regularity condition to guaranteethat a sequence of finite mixtures of at most k components cannot approximate f arbitrarily closely. Definition . A sequence of distributions ( ψ i ) ∞ i =1 is µ -wide if for any closed set C such that µ ( C ) = 0 and any sequence of distributions ( φ i ) ∞ i =1 such that ψ i ⇐⇒ φ i , lim sup i →∞ φ i ( C ) = 0 . Definition . The family Ψ has degenerate limits if for any tight, µ -widesequence ( ψ θ i ) i ∈ N , we have that ( θ i ) i ∈ N is relatively compact. The contrapositive of Definition 3.5 provides an intuitive explanation of thecondition: as i → ∞ , for any sequence of parameters θ i that eventually leaves everycompact set K ⊆ Θ , either the ψ θ i become “arbitrarily flat” (not tight) or “arbitrarilypeaky” (not µ -wide). For example, consider the family Ψ of Gaussians on R withLebesgue measure µ . If the variance of ψ θ i shrinks as i grows, the sequence ofdistributions converges weakly to a sequence of point masses (not dominated by theLebesgue measure). If either the variance or the mean diverges, the distributionsflatten out and the sequence is not tight. We use the fact that these are the onlytwo possibilities when a sequence of parameters is poorly behaved (not relativelycompact) in the proof of Theorem 2.1.These three stipulations together yield Assumption 3.6. Assumption . The mixture component family Ψ is continuous, is mixture-identifiable, and has degenerate limits.
4. Proof of Theorem 2.1.
The proof has two essential steps. The first is toshow that the Bayesian posterior is weakly consistent for the mixture f ; i.e., for anyweak neighborhood U of f the sequence of posterior distributions satisfies Π( U | X N ) N →∞ −→ , f -a.s. (3)By Ghosh and Ramamoorthi (2003, Theorem 4.4.2), weak consistency for f isguaranteed directly by Assumption 3.1 and the fact that Ψ is dominated by a σ -finitemeasure µ . The second step is to show that for any finite k ∈ N , there exists aweak neighborhood U of f containing no mixtures of the family Ψ with at most k components. Together, these steps show that the posterior probability of the set of all k -component mixtures converges to 0 f -a.s. as the amount of observed data grows.We provide a proof of the second step. To begin, note that Assumption 3.1 has twoadditional implications about f beyond Equation (3). First, f must be absolutely D. CAI, T. CAMPBELL, T. BRODERICK continuous with respect to the dominating measure µ ; if it were not, then there exists ameasurable set A such that f ( A ) > and µ ( A ) = 0 . Since µ dominates Ψ , any f ∈ F satisfies f ( A ) = 0 . Therefore KL( f , f ) = ∞ , and the prior support condition cannothold. Second, it implies that f can be arbitrarily well-approximated by finite mixturesunder the weak metric, i.e., there exists a sequence of finite measures f i ∈ F , i ∈ N suchthat f i ⇒ f as i → ∞ . This holds because (cid:113) KL( f , f ) ≥ TV ( f , f ) ≥ d ( f , f ) .Now suppose the contrary of the claim for the second step, i.e., that there existsa sequence ( f i ) ∞ i =1 of mixtures of at most k components from Ψ such that f i ⇒ f .By mixture-identifiability, we have a sequence of mixing measures g i with at most k atoms such that F ( g i ) = f i . Suppose first that the atoms of the sequence ( g i ) i ∈ N either stay in a compact set or have weights converging to 0. More precisely, supposethere exists a compact set K ⊆ Θ such that g i (cid:0) Θ \ K (cid:1) → . (4)Decompose each g i = g i,K + g i, Θ \ K such that g i,K is supported on K and g i, Θ \ K issupported on Θ \ K . Define the sequence of probability measures ˆ g i,K = g i,K g i,K (Θ) forsufficiently large i such that the denominator is nonzero. Then Equation (4) implies F (cid:0) ˆ g i,K (cid:1) ⇒ f . Since Ψ is continuous and mixture-identifiable, the restriction of F to the domain G is continuous and invertible; and since K is compact, the elements of (ˆ g i,K ) i ∈ N arecontained in a compact set G K ⊆ G by Prokhorov’s theorem (Ghosal and van derVaart, 2017, Theorem A.4). Therefore F ( G K ) = F K is also compact, and the map F restricted to the domain G K is uniformly continuous with a uniformly continuousinverse by Rudin (1976, Theorems 4.14, 4.17, 4.19). Next since F (ˆ g i,K ) ⇒ f , thesequence F (ˆ g i,K ) is Cauchy in F K ; and since F − is uniformly continuous on F K ,the sequence ˆ g i,K must also be Cauchy in G K . Since G K is compact, ˆ g i,K convergesin G K . Lemma 4.1 below guarantees that the convergent limit g K is also a mixingmeasure with at most k atoms; continuity of F implies that F ( g K ) = f , which is acontradiction, since by assumption f is not representable as a finite mixture of Ψ . Lemma . Suppose φ, ( φ i ) i ∈ N are Borel probability measures on a Polish spacesuch that φ i ⇒ φ and sup i | supp φ i | ≤ k ∈ N . Then | supp φ | ≤ k . Proof.
Suppose | supp φ | > k . Then we can find k +1 distinct points x , . . . , x k +1 ∈ supp φ . Pick any metric ρ on the Polish space, and denote the minimum pair-wise distance between the points (cid:15) . Then for each point j = 1 , . . . , k + 1 de-fine the bounded, continuous function h j ( x ) = 0 ∨ (cid:0) − (cid:15) − ρ ( x, x j ) (cid:1) . Since x j ∈ supp φ , we have that (cid:82) h j dφ > . Weak convergence φ i ⇒ φ therefore implies min j =1 ,...,k +1 lim inf i →∞ (cid:82) h j dφ i > . But the h j are nonzero on disjoint sets, andeach φ i only has k atoms; the pigeonhole principle yields a contradiction. INITE MIXTURES DO NOT RELIABLY LEARN THE NUMBER OF COMPONENTS Now we consider the remaining case: for all compact sets K ⊆ Θ , g i (Θ \ K ) (cid:54)→ .Therefore there exists a sequence of parameters ( θ i ) ∞ i =1 that is not relatively compactsuch that lim sup i →∞ g i ( { θ i } ) > . By Assumption 3.6, the sequence ( ψ θ i ) i ∈ N is eithernot tight or not µ -wide. If ( ψ θ i ) i ∈ N is not tight then f i = F ( g i ) is not tight, and byProkhorov’s theorem f i cannot converge to a probability measure, which contradicts f i ⇒ f . If ( ψ θ i ) i ∈ N is not µ -wide then f i = F ( g i ) is not µ -wide. Denote ( φ i ) i ∈ N tobe the singular sequence associated with ( f i ) i ∈ N and C to be the closed set such that lim sup i →∞ φ i ( C ) > , µ ( C ) = 0 , and φ i ⇐⇒ f i per Definition 3.4. Since f (cid:28) µ , f ( C ) = 0 . But f i ⇒ f implies that φ i ⇒ f , so lim sup i →∞ φ i ( C ) = f ( C ) = 0 bythe Portmanteau theorem (Ghosal and van der Vaart, 2017, Theorem A.2). This is acontradiction.
5. Extension to priors that vary with N . Our main result (i.e., Theorem 2.1)applies to the setting of a fixed prior Π . However, it is often natural to specify a priordistribution that changes with N (e.g., Roeder and Wasserman (1997), Richardsonand Green (1997), and Miller and Harrison (2018, Section 7.2.1)). Corollary 5.2 belowdemonstrates that a result nearly identical to Theorem 2.1 holds for priors that areallowed to vary with N , provided that f is in the KL-support of the sequence ofpriors Π N . The only difference is that our result in this case is slightly weaker: weshow that the posterior number of components diverges in probability rather thanalmost surely. Assumption . For all (cid:15) > , the sequence of prior distributions Π N satisfies lim inf N →∞ Π N ( f : KL( f , f ) < (cid:15) ) > . Corollary . Suppose in the setting of Theorem 2.1 we replace Assumption 3.1with Assumption 5.1. Then the posterior on the number of components diverges in f -probability: i.e., for all k ∈ N , Π( k | X N ) N →∞ −→ in f -probability . Proof.
Since for any (cid:15) > , lim inf N →∞ Π N ( f : KL( f , f ) < (cid:15) ) > , Ghosal andvan der Vaart (2017, Theorem 6.17, Lemma 6.26, and Example 6.20) imply that theposterior is weakly consistent at f in probability: i.e., for any weak neighborhood U of f , Π( U | X N ) N →∞ −→ in f -probability . Assumption 5.1 also implies that for sufficiently large N , f is a weak limit of finitemixtures in F . The remainder of the proof is identical to that of Theorem 2.1. D. CAI, T. CAMPBELL, T. BRODERICK
6. Related work.
In this work, we consider FMMs with a prior on the numberof components. We consider the case where this prior does not vary with the numberof data points and has support on all strictly-positive component counts. Posteriorconsistency for the mixture density (Ghosal et al., 1999; Lijoi et al., 2004; Kruijeret al., 2010) and the mixing measure (Nguyen, 2013; Ho and Nguyen, 2016; Guhaet al., 2019) in a wide class of mixture models is well established. But posteriorconsistency for the number of components is not as thoroughly characterized. Thereare several results establishing consistency for the number of components in well-specified FMMs. Nobile (1994, Proposition 3.5) and Guha et al. (2019, Theorem 3.1a)demonstrate that FMMs exhibit posterior consistency for the number of componentswhen the model is well specified and Ψ is mixture-identifiable. The present workcharacterizes the behavior of the FMM posterior on the number of components undercomponent misspecification.A related approach for handling a finite but unknown number of components is tospecify an overfitted mixture model, i.e., a finite mixture model with a number ofcomponents in excess of the true number (e.g. Ishwaran et al. (2001); Rousseau andMengersen (2011); Malsiner-Walli et al. (2016)). In the setting of overfitted FMMswith well-specified component densities, Rousseau and Mengersen (2011, Theorem 1)show that under a stronger identifiability condition than mixture-identifiability andadditional regularity assumptions on the model, the posterior will concentrate properlyby emptying the extra components. Ishwaran et al. (2001, Theorem 1) consider thesetting of estimating the number of components with the assumption of a knownupper bound on the true number of components and well-specified components,and show that the posterior does not asymptotically underestimate the numberof components when assuming a stronger identifiability condition than mixture-identifiability and a KL-support condition on the prior. In addition to assumingcomponent misspecification, we here consider only priors that place full support onthe natural numbers, as opposed to priors that place an upper bound on the numberof components.Frühwirth-Schnatter (2006) provides a wide-ranging review of methodology forfinite mixture modeling. In (e.g.) Section 7.1, Frühwirth-Schnatter (2006) observesthat, in practice, the learned number of mixture components will generally be higherthan the true generating number of components when the likelihood is misspecified—but does not prove a result about the number of components under misspecification.Similarly, Miller and Harrison (2018, Section 7.1.5) discuss the issue of estimatingthe number of components in FMMs under model misspecification and state that theposterior number of components is expected to diverge to infinity as the number ofsamples increases, but no proof of this asymptotic behavior is provided.Finally, a growing body of work is focused on developing more robust FMMs andrelated mixture models. In order to address the issue of component misspecification,a number of authors propose using finite mixture models with nonparametric com-ponent densities, e.g. Gaussian-mixture components (Bartolucci, 2005; Di Zio et al., INITE MIXTURES DO NOT RELIABLY LEARN THE NUMBER OF COMPONENTS
7. Experiments.
In this section, we demonstrate one of the primary practicalimplications of our theory: the inferred number of components can change drasticallydepending on the amount of observed data in misspecified finite mixture models.For all experiments below, we use a finite mixture model with a multivariateGaussian component family having diagonal covariance matrices and a conjugateprior on each dimension. In particular, consider number of components k , mixtureweights p ∈ R k , Gaussian component precisions τ ∈ R k × D + and means θ ∈ R k × D ,labels Z ∈ { , . . . , k } N , and data X ∈ R N × D . Then the probabilistic generativemodel is k ∼ Geom( r ) p ∼ Dirichlet k ( γ, . . . , γ ) τ jd i.i.d. ∼ Gam( α, β ) θ jd i.i.d. ∼ N ( m, κ − jd ) Z n i.i.d. ∼ Categorical( p ) X nd ind ∼ N ( θ z n d , τ − z n d ) , where j ranges from , . . . , k , d ranges from , . . . , D , and n ranges from , . . . , N .For posterior inference, we use a Julia implementation of split-merge collapsed Gibbssampling (Neal, 2000; Jain and Neal, 2004) from Miller and Harrison (2018). ∗ Themodel and inference algorithm are described in more detail in Miller and Harrison(2018, Sec. 7.2.2, Algorithm 1). Note that we use this model primarily to illustratethe problem of posterior divergence under model misspecification; it should not beinterpreted as a carefully-specified model for the data examples that we study. Alsonote that while the empirical examples below involve Gaussian FMMs, our theoryapplies to a more general class of component distributions.7.1.
Synthetic data.
Our first experiments are on synthetic data and are inspiredby Figure 3 of Miller and Dunson (2019), which investigates the posterior of a mixtureof perturbed Gaussians. Here we study the effects of varying data set sizes underboth well-specified and misspecified models. We generated data sets of increasingsize N ∈ { , , , , } from 1- and 2-component univariate Gaussianand Laplace mixture models, where the 1-component distributions have mean 0and scale 1, and the 2-component distributions have means ( − , , scales (1 . , ,and mixing weights (0 . , . . We generated the sequence of data sets such thateach was a subset of the next, larger data set in the sequence. Following Miller andHarrison (2018, Section 7.2.1), we set the hyperparameters of the Bayesian finitemixture model as follows: m = (max n ∈ [ ˜ N ] X n + min n ∈ [ ˜ N ] X n ) where ˜ N = 10 , , κ = (max n ∈ [ ˜ N ] X n − min n ∈ [ ˜ N ] X n ) − , α = 2 , r = 0 . , γ = 1 , and β ∼ Gam(0 . , /κ ) . ∗ Code available at https://github.com/jwmi/BayesianMixtures.jl. D. CAI, T. CAMPBELL, T. BRODERICK (a) Gaussian data, 1 component (b) Gaussian data, 2 components(c) Laplace data, 1 component (d) Laplace data, 2 components(e) Gaussian data, varying prior (f) Laplace data, varying prior
Fig 1:
Upper and middle rows : Posterior probability of the number of components k for Gaussian mixture models with a fixed prior fit to (a,b) univariate data generatedfrom a Gaussian mixture model and (c,d) a Laplace mixture model, Lower row :Posterior probability of the number of components of Gaussian mixtures with avarying prior fit to (e) 2-component univariate data from a Gaussian mixture modeland (f) 2-component univariate data from a Laplace mixture model.
INITE MIXTURES DO NOT RELIABLY LEARN THE NUMBER OF COMPONENTS
Fig 2: Posterior probability of the number of components k for Gaussian mixturemodels, fit to (a) mouse cortex single-cell RNA sequencing data and (b) lung tissuegene expression data.We refer to Miller and Harrison (2018, Section 7.2.1) for additional details on thechoice of model hyperparameters and the sampling of β . We ran a total of 100,000Markov chain Monte Carlo iterations per data set; we discarded the first 10,000iterations as burn-in.The results of the simulations are shown in the top and middle rows of Figure 1.For the data generated from the 1-component models, the posterior on the numberof components concentrates around 1 in the case of Gaussian-generated data as thesample size increases (Figure 1a), whereas the posterior on the number of componentsdiverges for the Laplace data (Figure 1c). We observe similar behavior in the 2-component case, where the posterior concentrates around the correct value in theGaussian case (Figure 1b) but not the Laplace case (Figure 1d).Finally, we considered the Gaussian mixture model above but with a prior thatvaries with the data. Specifically, for the prior on the means, we set the hyper-parameters to m N = (max n ∈ [ N ] X n + min n ∈ [ N ] X n ) and κ N = (max n ∈ [ N ] X n − min n ∈ [ N ] X n ) − , which is the setting considered by Miller and Harrison (2018, Sec-tion 7.2.1); the other hyperparameters were otherwise set to the same values above.We used the 2-component Gaussian and Laplace data sets constructed above forthe fixed prior case. The bottom row of Figure 1 shows the results of the posteriornumber of components under this prior for the well-specified and misspecified cases;again we observe that the posterior diverges under model misspecification.7.2. Gene expression data.
Computational biologists are interested in classifyingcell types by applying clustering techniques to gene expression data (Yeung et al.,2001; Medvedovic and Sivaganesan, 2002; McLachlan et al., 2002; Medvedovic et al.,2004; Rasmussen et al., 2008; de Souto et al., 2008; McNicholas and Murphy, 2010).In our next set of experiments, we apply the Gaussian finite mixture model to two D. CAI, T. CAMPBELL, T. BRODERICK gene expression data sets: (1) single-cell RNA sequencing data from mouse cortex andhippocampus cells (Zeisel et al., 2015) with the same feature selection as Prabhakaranet al. (2016) ( N = 3008 , D = 558 , 11,000 Gibbs sampling steps with 1,000 of thoseas burn-in) and (2) mRNA expression data from human lung tissue (Bhattacharjeeet al., 2001) ( N = 203 , D = 1543 , and 10,000 Gibbs sampling steps with 1,000of those burn-in). Our experiments here represent a simplified version of previousmixture model analyses for these and other related data sets (de Souto et al., 2008;Prabhakaran et al., 2016; Armstrong et al., 2001; Miller and Harrison, 2018).As these gene expression data sets contain counts, we first transformed the datato real numerical values. In particular, we used a base-2 log transform followed bystandardization—such that each dimension of the data had zero mean and unitvariance—per standard practices (e.g., Miller and Harrison (2018)). Then to examinethe effect of increasing data set size on inferential results, we randomly sampledsubsets of increasing size without replacement; each smaller subset was contained inthe next larger data set. For both data sets, we used hyperparameters α = 1 , β = 1 , m = 0 , κ jd = τ jd , r = 0 . , and γ = 1 .For the single-cell RNAseq data set, the posterior on the number of components isshown in Figure 2a. Here the ground truth number of clusters is captured when thedata set size is N = 100 . But as predicted by our theory, as we increase the numberof data points, the posterior number of components diverges.The posterior on the number of components for the lung gene expression data isshown in Figure 2b. Again we find that on the smallest data subsets, the posteriorappears to capture the ground truth number of clusters, but that as we examinemore and more data, the posterior diverges. While diagonal covariance Gaussiancomponents are likely not rich enough to model the cluster shapes, our purpose hereis to capture the effect of model misspecification on the posterior on the number ofcomponents. Thus, these examples suggest the need for more robust analyses.
8. Discussion.
We have shown that the Bayesian posterior distribution for thenumber of components in finite mixtures diverges when the mixture component familyis misspecified. Since misspecification is almost unavoidable in real applications, itfollows that finite mixture models are typically unreliable for estimating the numberof components. In practice, our conclusion implies that inferential results on thenumber of components can change drastically depending on the size of the data set,calling into question the usefulness of these results in application.A number of open questions remain. Because our analysis is inherently asymptotic,it is possible that the Bayesian posterior on the number of components may stillprovide useful inferences for a finite sample—for instance if care is taken to account forthe aforementioned dependence of inferential conclusions on data set size. Additionally,a number of authors have recently proposed robust Bayesian inference methods tomitigate likelihood misspecification (Grünwald and van Ommen, 2017; Miller andDunson, 2019; Wang et al., 2017); it remains to better understand connectionsbetween our results and these methods.
INITE MIXTURES DO NOT RELIABLY LEARN THE NUMBER OF COMPONENTS Acknowledgments.
We thank Jeff Miller for helpful conversations and commentson a draft of this paper. D. Cai was supported in part by a Google Ph.D. Fellowship inMachine Learning. T. Campbell was supported by a National Sciences and EngineeringResearch Council of Canada (NSERC) Discovery Grant and an NSERC DiscoveryLaunch Supplement. T. Broderick was supported in part by ONR grant N00014-17-1-2072, an MIT Lincoln Laboratory Advanced Concepts Committee Award, a GoogleFaculty Research Award, the CSAIL–MSR Trustworthy AI Initiative, and an AROYIP award. APPENDIX A: PROOF OF PROPOSITION 2.2Consider the multivariate Gaussian family
Ψ = (cid:110) N ( ν, Σ) : ν ∈ R d , Σ ∈ S d ++ (cid:111) withparameter space Θ = R d × S d ++ , equipped with the topology induced by the Euclideanmetric. Let ( λ j (Σ)) dj =1 denote the eigenvalues of the covariance matrix Σ ∈ S d ++ thatsatisfy ∞ > λ (Σ) ≥ . . . ≥ λ d (Σ) > . Since the family of Gaussians is continuousand mixture-identifiable (Yakowitz and Spragins, 1968, Proposition 2), the maincondition we need to verify is that the family has degenerate limits (Definition 3.5).A useful fact is that if a sequence of Gaussian distributions is tight, then the sequenceof means and the eigenvalues of the covariance matrix is bounded. Lemma
A.1 . Let ( ψ i ) i ∈ N be a sequence of Gaussian distributions with mean ν i ∈ R d and covariance Σ i ∈ S d ++ . If ( ψ i ) i ∈ N is a tight sequence of measures, thenthe sequences ( ν i ) i ∈ N and ( λ (Σ i )) i ∈ N are bounded. Proof.
Let Y i denote a random variable with distribution ψ i . For each covariancematrix Σ i , consider its eigenvalue decomposition Σ i = U i Λ i U (cid:62) i , where U i ∈ R d × d isan orthonormal matrix and Λ i ∈ R d × d is a diagonal matrix. Then the random variable Z i = U (cid:62) i Y i has distribution N ( U (cid:62) i ν i , Λ i ) . If either (cid:107) ν i (cid:107) = (cid:107) U (cid:62) i ν i (cid:107) is unbounded or (cid:107) Λ i (cid:107) F is unbounded, then Z i is not tight (Billingsley (1986, Example 25.10)). Since Z i and Y i lie in any ball centered at the origin with the same probability, Y i is nottight.We now show that the multivariate Gaussian family has degenerate limits. Proof of Proposition 2.2.
If the parameters ( θ i ) i ∈ N are not a relatively com-pact subset of Θ , then either some coordinate of the sequence of means ν i diverges, λ (Σ i ) → ∞ , or λ d (Σ i ) → . If some coordinate of the mean ν i diverges or themaximum eigenvalue diverges, i.e., λ (Σ i ) → ∞ , then the sequence ( ψ θ i ) is nottight by Lemma A.1. On the other hand, if λ d (Σ i ) → as i → ∞ , then ψ θ i con-verges weakly to a sequence of degenerate Gaussian measures that concentrate on C i = (cid:110) x ∈ R d : ( x − ν i ) (cid:62) u d,i = 0 (cid:111) , where u d,i is the d th eigenvector of Σ i . Notethat µ ( C i ) = 0 for Lebesgue measure µ ; so if we define C = ∪ i C i in the setting ofDefinition 3.4, the sequence is not µ -wide. D. CAI, T. CAMPBELL, T. BRODERICK
We can generalize Proposition 2.2 beyond multivariate Gaussians to mixture-identifiable location-scale families, as shown in Proposition A.2. Examples of suchfamilies include the multivariate Gaussian family, the Cauchy family, the logisticfamily, the von Mises family, and generalized extreme value families. The proof issimilar to that of Proposition 2.2.
Proposition
A.2 . Suppose Ψ is a location-scale family that is mixture-identifiableand absolutely continuous with respect to Lebesgue measure µ , i.e., d Ψ dµ = (cid:26) | Σ | − / ϕ (cid:16) Σ − / ( x − ν ) (cid:17) : ν ∈ R d , Σ ∈ S d ++ (cid:27) , where ϕ : R d → R is a probability density function. Then Ψ satisfies Assumption 3.6. REFERENCES
B. Aragam, C. Dan, P. Ravikumar, and E. P. Xing. Identifiability of nonparametric mixture modelsand Bayes optimal clustering. arXiv preprint arXiv:1802.04397 , 2018.S. A. Armstrong, J. E. Staunton, L. B. Silverman, R. Pieters, M. L. d. Boer, M. D. Minden, S. E.Sallan, E. S. Lander, T. R. Golub, and S. J. Korsmeyer. MLL translocations specify a distinctgene expression profile that distinguishes a unique leukemia.
Nature Genetics , 30(1):41, 2001.J. D. Banfield and A. E. Raftery. Model-based Gaussian and non-Gaussian clustering.
Biometrics ,49(3):803–821, 1993.F. Bartolucci. Clustering univariate observations via mixtures of unimodal normal mixtures.
Journalof Classification , 22(2):203–219, 2005.A. Bhattacharjee, W. G. Richards, J. Staunton, C. Li, S. Monti, P. Vasa, C. Ladd, J. Beheshti,R. Bueno, M. Gillette, M. Loda, G. Weber, E. Mark, E. Lander, W. Wong, B. Johnson, T. Golub,D. Sugarbaker, and M. Meyerson. Classification of human lung carcinomas by mRNA expressionprofiling reveals distinct adenocarcinoma subclasses.
Proceedings of the National Academy ofSciences , 98(24):13790–13795, 2001.P. Billingsley.
Probability and Measure . John Wiley and Sons, third edition, 1986.C. Chan, F. Feng, J. Ottinger, D. Foster, M. West, and T. B. Kepler. Statistical mixture modelingfor cell subtype identification in flow cytometry.
Cytometry Part A , 73(8):693–701, 2008.J. Chen. Optimal rate of convergence for finite mixture models.
The Annals of Statistics , 23(1):221–233, 1995.M. C. de Souto, I. G. Costa, D. S. de Araujo, T. B. Ludermir, and A. Schliep. Clustering cancergene expression data: a comparative study.
BMC Bioinformatics , 9(1):497, 2008.M. Di Zio, U. Guarnera, and R. Rocci. A mixture of mixture models for a classification problem:The unity measure error.
Computational Statistics & Data Analysis , 51(5):2573–2585, 2007.S. Frühwirth-Schnatter.
Finite mixture and Markov switching models . Springer Series in Statistics,2006.S. Ghosal and A. van der Vaart.
Fundamentals of Nonparametric Bayesian Inference . CambridgeSeries in Statistical and Probabilistic Mathematics. Cambridge University Press, 2017.S. Ghosal, J. Ghosh, and R. Ramamoorthi. Posterior consistency of Dirichlet mixtures in densityestimation.
The Annals of Statistics , 27(1):143–158, 1999.J. Ghosh and R. Ramamoorthi.
Bayesian Nonparametrics . Springer Series in Statistics, 2003.C. Grazian, C. Villa, and B. Liseo. On a loss-based prior for the number of components in mixturemodels.
Statistics & Probability Letters , 158:108656, 2020.P. J. Green and S. Richardson. Modelling heterogeneity with and without the Dirichlet process.
Scandinavian Journal of Statistics , 28(2):355–375, 2001.INITE MIXTURES DO NOT RELIABLY LEARN THE NUMBER OF COMPONENTS P. Grünwald and T. v. Ommen. Inconsistency of Bayesian inference for misspecified linear models,and a proposal for repairing it.
Bayesian Analysis , 12(4):1069–1103, 2017.P. D. Grünwald. Bayesian inconsistency under misspecification. In
World Meeting of the InternationalSociety for Bayesian Analysis , 2006.A. Guha, N. Ho, and X. Nguyen. On posterior contraction of parameters and interpretability inBayesian mixture modeling. arXiv preprint arXiv:1901.05078 , 2019.P. Heinrich and J. Kahn. Strong identifiability and optimal minimax rates for finite mixtureestimation.
The Annals of Statistics , 46(6A):2844–2870, 2018.N. Ho and X. Nguyen. On strong identifiability and convergence rates of parameter estimation infinite mixtures.
Electronic Journal of Statistics , 10(1):271–307, 2016.J. P. Huelsenbeck and P. Andolfatto. Inference of population structure under a Dirichlet processmodel.
Genetics , 175(4):1787–1802, 2007.J. H. Huggins and J. W. Miller. Using bagged posteriors for robust inference and model criticism. arXiv preprint 1912.07104 , 2019.H. Ishwaran, L. F. James, and J. Sun. Bayesian model selection in finite mixtures by marginaldensity decompositions.
Journal of the American Statistical Association , 96(456):1316–1332, 2001.S. Jain and R. M. Neal. A split-merge Markov chain Monte Carlo procedure for the Dirichlet processmixture model.
Journal of Computational and Graphical Statistics , 13(1):158–182, 2004.W. Kruijer, J. Rousseau, and A. van der Vaart. Adaptive Bayesian density estimation withlocation-scale mixtures.
Electronic Journal of Statistics , 4:1225–1257, 2010.A. Lijoi, I. Prünster, and S. Walker. Extending Doob’s consistency theorem to nonparametricdensities.
Bernoulli , 10(4):651–663, 2004.E. D. Lorenzen, P. Arctander, and H. R. Siegismund. Regional genetic structuring and evolutionaryhistory of the impala aepyceros melampus.
Journal of Heredity , 97(2):119–132, 2006.G. Malsiner-Walli, S. Frühwirth-Schnatter, and B. Grün. Model-based clustering based on sparsefinite Gaussian mixtures.
Statistics and Computing , 26(1-2):303–324, 2016.G. Malsiner-Walli, S. Frühwirth-Schnatter, and B. Grün. Identifying mixtures of mixtures usingBayesian estimation.
Journal of Computational and Graphical Statistics , 26(2):285–295, 2017.G. J. McLachlan, R. Bean, and D. Peel. A mixture model-based approach to the clustering ofmicroarray expression data.
Bioinformatics , 18(3):413–422, 2002.P. D. McNicholas and T. B. Murphy. Model-based clustering of microarray expression data vialatent gaussian mixture models.
Bioinformatics , 26(21):2705–2712, 2010.M. Medvedovic and S. Sivaganesan. Bayesian infinite mixture model based clustering of geneexpression profiles.
Bioinformatics , 18(9):1194–1206, 2002.M. Medvedovic, K. Y. Yeung, and R. E. Bumgarner. Bayesian mixture model based clustering ofreplicated microarray data.
Bioinformatics , 20(8):1222–1232, 2004.J. W. Miller and D. B. Dunson. Robust Bayesian inference via coarsening.
Journal of the AmericanStatistical Association , 114(527):1113–1125, 2019.J. W. Miller and M. T. Harrison. A simple example of Dirichlet process mixture inconsistencyfor the number of components. In
Advances in Neural Information Processing Systems , pages199–206, 2013.J. W. Miller and M. T. Harrison. Inconsistency of Pitman-Yor process mixtures for the number ofcomponents.
Journal of Machine Learning Research , 15(1):3333–3370, 2014.J. W. Miller and M. T. Harrison. Mixture models with a prior on the number of components.
Journal of the American Statistical Association , 113(521):340–356, 2018.S. Mukherjee, E. D. Feigelson, G. J. Babu, F. Murtagh, C. Fraley, and A. Raftery. Three types ofgamma-ray bursts.
The Astrophysical Journal , 508(1):314, 1998.R. M. Neal. Markov chain sampling methods for Dirichlet process mixture models.
Journal ofComputational and Graphical Statistics , 9(2):249–265, 2000.X. Nguyen. Convergence of latent mixing measures in finite and infinite mixture models.
The Annalsof Statistics , 41(1):370–400, 2013.A. Nobile.
Bayesian analysis of finite mixture distributions . PhD thesis, Carnegie Mellon University, D. CAI, T. CAMPBELL, T. BRODERICK1994.A. Nobile. On the posterior distribution of the number of components in a finite mixture.
Annals ofStatistics , 32(5):2044–2073, 2004.A. Nobile. Bayesian finite mixtures: a note on prior specification and posterior computation. arXivpreprint arXiv:0711.0458 , 2007.A. Nobile and A. T. Fearnside. Bayesian finite mixtures with an unknown number of components:the allocation sampler.
Statistics and Computing , 17(2):147–162, 2007.E. Otranto and G. M. Gallo. A nonparametric Bayesian approach to detect the number of regimesin Markov switching models.
Econometric Reviews , 21(4):477–496, 2002.J. Pella and M. Masuda. The Gibbs and split merge sampler for population mixture analysis fromgenetic data with incomplete baselines.
Canadian Journal of Fisheries and Aquatic Sciences , 63(3):576–596, 2006.F. Petralia, V. Rao, and D. B. Dunson. Repulsive mixtures. In
Advances in Neural InformationProcessing Systems , pages 1889–1897, 2012.S. Prabhakaran, E. Azizi, A. Carr, and D. Pe’er. Dirichlet process mixture model for correctingtechnical variation in single-cell gene expression data. In
International Conference on MachineLearning , pages 1070–1079, 2016.J. K. Pritchard, M. Stephens, and P. Donnelly. Inference of population structure using multilocusgenotype data.
Genetics , 155(2):945–959, 2000.C. Rasmussen, B. de la Cruz, Z. Ghahramani, and D. Wild. Modeling and visualizing uncertaintyin gene expression clusters using Dirichlet process mixtures.
IEEE/ACM Transactions onComputational Biology and Bioinformatics , 6(4):615–628, 2008.S. Richardson and P. J. Green. On Bayesian analysis of mixtures with an unknown number ofcomponents (with discussion).
Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 59(4):731–792, 1997.A. Rodriguez and D. B. Dunson. Nonparametric Bayesian models through probit stick-breakingprocesses.
Bayesian Analysis , 6(1):145–178, 2011.K. Roeder and L. Wasserman. Practical Bayesian density estimation using mixtures of normals.
Journal of the American Statistical Association , 92(439):894–902, 1997.J. Rousseau and K. Mengersen. Asymptotic behaviour of the posterior distribution in overfittedmixture models.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 73(5):689–710, 2011.W. Rudin.
Principles of Mathematical Analysis . McGraw-Hill, 1976.M. Stephens. Bayesian analysis of mixture models with an unknown number of components—analternative to reversible jump methods.
The Annals of Statistics , 28(1):40–74, 2000.H. Teicher. Identifiability of mixtures.
The Annals of Mathematical Statistics , 32(1):244–248, 1961.H. Teicher. Identifiability of finite mixtures.
The Annals of Mathematical Statistics , pages 1265–1269,1963.S. T. Tokdar. Posterior consistency of Dirichlet location-scale mixture of normals in densityestimation and regression.
Sankhy¯a: The Indian Journal of Statistics , pages 90–110, 2006.Y. Wang, A. Kucukelbir, and D. M. Blei. Reweighted data for robust probabilistic models. In
International Conference on Machine Learning , page 3646–3655, 2017.M.-J. Woo and T. Sriram. Robust estimation of mixture complexity.
Journal of the AmericanStatistical Association , 101(476):1475–1486, 2006.M.-J. Woo and T. Sriram. Robust estimation of mixture complexity for count data.
ComputationalStatistics & Data Analysis , 51(9):4379–4392, 2007.Y. Wu and S. Ghosal. Kullback Leibler property of kernel mixture priors in Bayesian densityestimation.
Electronic Journal of Statistics , 2:298–331, 2008.E. P. Xing, K.-A. Sohn, M. I. Jordan, and Y.-W. Teh. Bayesian multi-population haplotype inferencevia a hierarchical Dirichlet process mixture. In
International Conference on Machine Learning ,pages 1049–1056, 2006.S. J. Yakowitz and J. D. Spragins. On the identifiability of finite mixtures.
The Annals of
INITE MIXTURES DO NOT RELIABLY LEARN THE NUMBER OF COMPONENTS Mathematical Statistics , 39(1):209–214, 1968.K. Y. Yeung, D. R. Haynor, and W. L. Ruzzo. Validating clustering for gene expression data.
Bioinformatics , 17(4):309–318, 2001.A. Zeisel, A. B. Muñoz-Manchado, S. Codeluppi, P. Lönnerberg, G. La Manno, A. Juréus, S. Mar-ques, H. Munguba, L. He, C. Betsholtz, C. Rolny, G. Castelo-Branco, J. Hjerling-Leffler, andS. Linnarsson. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq.
Science , 347(6226):1138–1142, 2015., 347(6226):1138–1142, 2015.