[PDF] Finite mixture models do not reliably learn the number of components

Abstract

Scientists and engineers are often interested in learning the number of subpopulations (or components) present in a data set. A common suggestion is to use a finite mixture model (FMM) with a prior on the number of components. Past work has shown the resulting FMM component-count posterior is consistent; that is, the posterior concentrates on the true generating number of components. But existing results crucially depend on the assumption that the component likelihoods are perfectly specified. In practice, this assumption is unrealistic, and empirical evidence suggests that the FMM posterior on the number of components is sensitive to the likelihood choice. In this paper, we add rigor to data-analysis folk wisdom by proving that under even the slightest model misspecification, the FMM component-count posterior diverges: the posterior probability of any particular finite number of latent components converges to 0 in the limit of infinite data. We illustrate practical consequences of our theory on simulated and real data sets.

Full PDF

FFINITE MIXTURE MODELS DO NOT RELIABLY LEARN THENUMBER OF COMPONENTS

By Diana Cai ∗ , † , Trevor Campbell ∗ , ‡ , and Tamara Broderick § Princeton University † , University of British Columbia ‡ ,Massachusetts Institute of Technology § Scientists and engineers are often interested in learning the numberof subpopulations (or components) present in a data set. A commonsuggestion is to use a ﬁnite mixture model (FMM) with a prior onthe number of components. Past work has shown the resulting FMMcomponent-count posterior is consistent; that is, the posterior con-centrates on the true generating number of components. But existingresults crucially depend on the assumption that the component likeli-hoods are perfectly speciﬁed. In practice, this assumption is unrealistic,and empirical evidence suggests that the FMM posterior on the num-ber of components is sensitive to the likelihood choice. In this paper,we add rigor to data-analysis folk wisdom by proving that undereven the slightest model misspeciﬁcation, the FMM component-countposterior diverges : the posterior probability of any particular ﬁnitenumber of latent components converges to 0 in the limit of inﬁnitedata. We illustrate practical consequences of our theory on simulatedand real data sets.

1. Introduction.

A probabilistic generative model is typically, of necessity, asimpliﬁcation of the complex real-world phenomena that govern any observed data.This simpliﬁcation facilitates tractable data analysis and discovery of meaningful andactionable patterns in data. But it also follows that typical models of real-world datasets are misspeciﬁed. And certain types of misspeciﬁcation can be dangerous, in thatthey may lead to fundamentally inaccurate or misleading inferences.For instance, Miller and Harrison (2013, 2014) caution against misspeciﬁcationin mixture modeling. Mixture models are widely used to discover latent groups, orcomponents, within a population. Often the number of components is unknown inadvance, and one of the principal inferential goals is estimating and interpreting thisnumber. For example, practitioners might wish to ﬁnd the number of latent geneticpopulations (Pritchard et al., 2000; Lorenzen et al., 2006; Huelsenbeck and Andolfatto,2007), gene tissue proﬁles (Yeung et al., 2001; Medvedovic and Sivaganesan, 2002),cell types (Chan et al., 2008; Prabhakaran et al., 2016), haplotypes (Xing et al.,2006), switching Markov regimes in US dollar exchange rate data (Otranto and Gallo,2002), gamma-ray burst types (Mukherjee et al., 1998), or segmentation regions inan image (e.g., tissue types in an MRI scan (Banﬁeld and Raftery, 1993)). ∗ First authorship is shared jointly by D. Cai and T. Campbell.

Keywords and phrases: ﬁnite mixture models, number of components, model misspeciﬁcation,posterior asymptotics, Bayesian nonparametrics a r X i v : . [ m a t h . S T ] S e p D. CAI, T. CAMPBELL, T. BRODERICK

One way to capture what we know about the number of components is to take aBayesian approach; the Bayesian posterior represents our state of belief about thevalue of interest after observing a data set. A natural check on a Bayesian analysis isto establish that—when the true, generating number of components is known—ourposterior increasingly concentrates near the truth as the number of data pointsbecomes arbitrarily large.One popular Bayesian methodology uses a Dirichlet process mixture model (DPMM).The DPMM itself is constructed with a countable inﬁnity of components. So practi-tioners instead use the posterior on the number of clusters (Pella and Masuda, 2006;Huelsenbeck and Andolfatto, 2007)—i.e., components represented in the observeddata. For any N data points, a DPMM implicitly gives a prior over the number of clus-ters with support on [ N ] := { , , . . . , N } . Unfortunately, due to the latent inﬁnitudeof components in a DPMM, the DPMM cluster-count posterior concentrates strictlyaway from the true generating number of components when the generating numberis ﬁnite (Miller and Harrison, 2013, 2014). In fact, Miller and Harrison (2013, 2014)show the cluster-count posterior probability on the generating number of componentsgoes to 0 as the amount of data grows.Miller and Harrison (2018) suggest that a potential alternative prior may resolvethese diﬃculties. Namely, they consider a prior directly on the number of components,with component counts supported on all possible strictly-positive integers (Nobile,1994; Stephens, 2000; Green and Richardson, 2001; Nobile, 2004, 2007; Nobile andFearnside, 2007; Miller and Harrison, 2013, 2014, 2018; Grazian et al., 2020). Asin Nobile (1994), we call the resulting model the ﬁnite mixture model (FMM) toemphasize that, unlike in the DPMM or similar nonparametric models, the expectednumber of components in the prior is ﬁxed and ﬁnite across data set sizes. Nobile (1994)has shown that the resulting component-count posterior in this case does concentrateat the true number of components. But crucially this result depends on the assumptionthat the component likelihoods are perfectly speciﬁed. In practice, though, we canexpect that the component likelihoods are at least somewhat imperfectly speciﬁedsince they are necessarily simpliﬁcations of real-world phenomena. For instance, whileGaussian mixture models are ubiquitous, data are rarely perfectly Gaussian. Millerand Dunson (2019) provide empirical evidence of undesirable posterior behavior inan FMM with misspeciﬁed likelihood.In this paper, we examine FMMs with essentially any component shape, subject tosome mild regularity conditions; in particular, we do not restrict only to, for instance,Gaussian component likelihoods. We show that when the component likelihoods arenot perfectly speciﬁed, the component-count posterior concentrates strictly awayfrom the true number of components, just as in the nonparametric case. In fact, wego further to show that the FMM posterior for the number of components diverges :for any ﬁnite k ∈ N , the posterior probability that the number of components is k converges to 0 almost surely as the amount of data grows.It follows that we expect using either FMMs or DPMMs will lead to unreliable INITE MIXTURES DO NOT RELIABLY LEARN THE NUMBER OF COMPONENTS estimates of component counts in typical practical applications—even though thesource of misspeciﬁcation in the two cases is diﬀerent. That is, the DPMM cluster-count posterior is unreliable even when the component likelihoods are accuratelyspeciﬁed—because the DPMM assumes an inﬁnity of components. On the otherhand, the FMM component-count posterior is unreliable only when the componentlikelihoods are misspeciﬁed. But since component likelihoods are typically (at leastvery slightly) misspeciﬁed in practical applications, we still expect the FMM estimateof the true generating number of components to diverge.Besides resolving a theoretical gap in the literature, our analysis therefore hasimportant practical implications. Namely, if we reject DPMMs for their poor asymp-totic estimation of the number of components, then FMMs must be discarded as well,except in applications where we can guarantee perfectly speciﬁed likelihoods a priori.An alternative perspective is that while both models suﬀer in the asymptotic regime,ﬁnite data may provide some regularization properties when using either DPMMsor FMMs. A potential challenge of this latter perspective follows as a corollary ofour analysis: the amount of implicit regularization changes as the data grows, andthus the same analysis will give diﬀerent answers for diﬀerent data set sizes. Wedemonstrate this eﬀect in real-data experiments by varying the number of data points;our results show that the number of components varies substantially as we increasethe data set size within realistic ranges. Our results demonstrate that past researchusing a Bayesian prior on the number of components may have strongly depended onthe size of the data set in consideration.In the same spirit as Miller and Harrison (2013, 2014), we believe our paper addsrigor to folk wisdom in the data analysis community. For practitioners interested ina reliable estimate of cluster cardinality, our results suggest more robust methodsare needed. Indeed, work in this direction has already begun (Grünwald, 2006; Wooand Sriram, 2006, 2007; Rodriguez and Dunson, 2011; Wang et al., 2017; Miller andDunson, 2019; Huggins and Miller, 2019).Section 2 begins by introducing FMMs and stating our main result on posteriordivergence in the number of components. Then in Section 3, we discuss the primaryassumptions needed for the main result to hold. Section 4 provides the proof ofthe main theoretical result. Section 5 presents an extension of the main theorem topriors that may vary as the data set grows. Section 6 discusses related work on theasymptotics of the posterior number of components in FMMs. The paper concludesin Section 7 with empirical evidence that the FMM posterior on the number ofcomponents depends on the amount of data considered.

2. Main result.

We begin with a brief description of the ﬁnite mixture modelwe consider in this work. In this section, we provide just enough detail to stateTheorem 2.1 and leave the precise probabilistic details for Section 3. Let g be amixing measure g := (cid:80) kj =1 p j δ θ j on a parameter space Θ with p j ∈ [0 , and (cid:80) kj =1 p j = 1 , and let Ψ = { ψ θ : θ ∈ Θ } be a family of component distributions D. CAI, T. CAMPBELL, T. BRODERICK dominated by a σ -ﬁnite measure µ . Then we can express a ﬁnite mixture f of thecomponents as f = (cid:90) Θ ψ θ dg ( θ ) = k (cid:88) j =1 p j ψ θ j . Consider a Bayesian model with a prior distribution Π on the set of all mixingmeasures G on Θ with ﬁnitely many atoms, i.e., g ∼ Π , and likelihood correspondingto conditionally i.i.d. data from f = (cid:82) ψ θ dg ( θ ) . The model assumes the likelihoodis f , but the model is misspeciﬁed ; i.e., the observations X N := ( X , . . . , X N ) aregenerated conditionally i.i.d. from a ﬁnite mixture f of distributions not in Ψ .Our main result is that under this misspeciﬁcation of the likelihood, the poste-rior on the number of components Π( k | X N ) diverges; i.e., for any ﬁnite k ∈ N , Π( k | X N ) → as N → ∞ . We make only two requirements of the mixture model toguarantee this result: (1) the true data-generating distribution f must be arbitrarilywell-approximated by ﬁnite mixtures of Ψ , and (2) the family Ψ must satisfy mildregularity conditions that hold for popular mixture models (e.g., the family Ψ ofGaussians parametrized by mean and variance). We provide precise deﬁnitions of theassumptions needed for Theorem 2.1 to hold in Section 3, and a proof in Section 4. Theorem . Suppose observations X N are generated i.i.d. froma distribution f that is not a ﬁnite mixture of Ψ . Assume that:Assumption 3.1: f is in the KL-support of the prior Π , andAssumption 3.6: Ψ is continuous, is mixture-identiﬁable, and has degenerate limits.Then the posterior on the number of components diverges; i.e., for all k ∈ N , Π( k | X N ) N →∞ −→ f -a.s. (1)Note that the conditions of the theorem—although technical—are satisﬁed by awide class of models used in practice. Assumption 3.1 requires that the prior Π placesenough mass on mixtures near the true generating distribution f . Assumption 3.6enforces regularity of the component family and is satisﬁed by many popular modelsused in practice, such as the multivariate Gaussian family. Proposition . Let

Ψ = (cid:110) N ( ν, Σ) : ν ∈ R d , Σ ∈ S d ++ (cid:111) be the multivariateGaussian family, where S d ++ := { Σ ∈ R d × d : Σ = Σ (cid:62) , Σ (cid:31) } is the set of d × d symmetric, positive deﬁnite matrices. Then Ψ satisﬁes Assumption 3.6. Thus, provided that f is in the KL-support of the prior, under a misspeciﬁedGaussian mixture model, our main result implies that the posterior number of compo-nents diverges. While Proposition 2.2 is stated for Gaussian component distributions,we generalize it to mixture-identiﬁable location-scale families Ψ in Proposition A.2. INITE MIXTURES DO NOT RELIABLY LEARN THE NUMBER OF COMPONENTS Additionally, we note that the divergence of the posterior given in Equation (1)is stronger than the behavior described in Miller and Harrison (2013) for DPMMs:i.e. that the posterior probability converges to 0 at the true number of components.In contrast, here we show that the posterior probability converges to 0 for anyﬁnite number of components. We conjecture that posterior divergence also holds forDPMMs, but the proof is outside of the scope of this paper.Finally, while the result of Theorem 2.1 assumes that the model uses a ﬁxed prior Π , in many practical modeling scenerios it may be natural to specify a prior Π N that depends on the amount of observed data X N . In Section 5 we show that, if f satisﬁes a modiﬁed KL-support condition with respect to the sequence of priors Π N ,the number of components also diverges in this setting.

3. Precise setup and assumptions in Theorem 2.1.

This section makes thedetails of the modeling setup and each of the conditions in Theorem 2.1 precise.3.1.

Notation and setup.

Let X and Θ be Polish spaces for the observations andparameters, respectively, and endow both with their Borel σ -algebra. For a topologicalspace ( · ) , let C ( · ) be the bounded continuous functions from ( · ) into R , and P ( · ) bethe set of probability measures on ( · ) endowed with the weak topology metrized bythe Prokhorov distance d (Ghosal and van der Vaart, 2017, Appendix A, p. 508), andBorel σ -algebra. We use f i ⇒ f and f i ⇐⇒ f (cid:48) i to denote lim i →∞ d ( f i , f ) = 0 and lim i →∞ d ( f i , f (cid:48) i ) = 0 , respectively, for f i , f (cid:48) i , f ∈ P ( · ) . We assume that the family ofdistributions Ψ = { ψ θ : θ ∈ Θ } is absolutely continuous with respect to a σ -ﬁnitebase measure µ , i.e., ψ θ (cid:28) µ for all θ ∈ Θ , and that for measurable A ⊆ X , ψ θ ( A ) is ameasurable function on Θ . Deﬁne the measurable mapping F : P (Θ) → P ( X ) frommixing measures to mixtures of Ψ , F ( g ) = (cid:82) ψ θ dg ( θ ) . Let G be the set of atomicprobability measures on Θ with ﬁnitely many atoms, and let F be the set of ﬁnitemixtures of Ψ ; deﬁne G ∗ and F ∗ analogously for countable mixtures.In the Bayesian ﬁnite mixture model from Section 2, a mixing measure g ∼ Π isgenerated from a prior measure Π on G , and f = F ( g ) is a likelihood distribution.The posterior distribution on the mixing measure is ∀ measurable A ⊆ G , Π( A | X N ) = (cid:82) A (cid:81) Nn =1 dfdµ ( X n ) d Π( g ) (cid:82) G (cid:81) Nn =1 dfdµ ( X n ) d Π( g ) , (2)where dfdµ is the density of f = F ( g ) with respect to µ . This posterior on the mixingmeasure g ∈ G induces a posterior on the number of components k ∈ N by countingthe number of atoms in g , and it also induces a posterior on mixtures f ∈ F via thepushforward through the mapping F . We overload the notation Π( · | X N ) to referto all of these posterior distributions and Π( · ) to refer to prior distributions; themeaning should be clear from context. D. CAI, T. CAMPBELL, T. BRODERICK

Model assumptions.

The ﬁrst assumption of Theorem 2.1 is that while thetrue data-generating distribution f is not contained in the model class f / ∈ F , it lieson the boundary of the model class. In particular, we assume f is in the KL-support of the prior Π . Denote the Kullback-Leibler (KL) divergence between probabilitymeasures f and f as KL( f , f ) :=  (cid:82) log (cid:16) df df (cid:17) df f (cid:28) f ∞ otherwise . Assumption . For all (cid:15) > , the prior distribution Π satisﬁes Π( f ∈ F : KL( f , f ) < (cid:15) ) > . We use Assumption 3.1 in the proof of Theorem 2.1 primarily to ensure that theBayesian posterior is consistent for f . Note that Assumption 3.1 is fairly weak inpractice. Intuitively, it just requires that the family Ψ is rich enough so that mixturesof Ψ can approximate f arbitrarily well, and that the prior Π places suﬃcient masson those mixtures close to f . For Bayesian mixture modeling, Ghosal et al. (1999,Theorem 3), Tokdar (2006, Theorem 3.2), Wu and Ghosal (2008, Theorem 2.3), andPetralia et al. (2012, Theorem 1) provide conditions needed to satisfy Assumption 3.1.The second assumption of Theorem 2.1 is that the family of component distributions Ψ is well-behaved. This assumption has three stipulations. First, the mapping θ (cid:55)→ ψ θ must be continuous; this condition essentially asserts that similar parameter values θ must result in similar component distributions ψ θ . Definition . The family Ψ is continuous if the map θ (cid:55)→ ψ θ is continuous. Second, the family Ψ must be mixture-identiﬁable , which guarantees that eachmixture f ∈ F is associated with a unique mixing measure G ∈ G . Definition . The family Ψ is mixture-identiﬁable ifthe mapping F ( g ) = (cid:82) ψ θ dg ( θ ) restricted to ﬁnite mixtures F : G → F is a bijection. In practice, one should always use an identiﬁable mixture model for clustering;without identiﬁability, the task of learning the number of components is ill posed. Andmany models satisfy mixture-identiﬁability, such as ﬁnite mixtures of the multivariateGaussian family (Yakowitz and Spragins, 1968), the Cauchy family (Yakowitz andSpragins, 1968), the gamma family (Teicher, 1963), the generalized logistic family,the generalized Gumbel family, the Weibull family, and von Mises family (Ho andNguyen, 2016, Theorem 3.3). A number of authors (e.g., Chen (1995); Ishwaran et al.(2001); Nguyen (2013); Ho and Nguyen (2016); Guha et al. (2019); Heinrich and Kahn(2018)) appeal to stronger notions of identiﬁability for mixtures than Deﬁnition 3.3.But, to show posterior divergence in the present work, we do not require conditionsstronger than Deﬁnition 3.3.

INITE MIXTURES DO NOT RELIABLY LEARN THE NUMBER OF COMPONENTS The third stipulation—that the family Ψ has degenerate limits —guarantees that a“poorly behaved” sequence of parameters ( θ i ) i ∈ N creates a likewise “poorly behaved”sequence of distributions ( ψ θ i ) i ∈ N . This condition allows us to rule out such sequencesin the proof of Theorem 2.1, and is the essential regularity condition to guaranteethat a sequence of ﬁnite mixtures of at most k components cannot approximate f arbitrarily closely. Definition . A sequence of distributions ( ψ i ) ∞ i =1 is µ -wide if for any closed set C such that µ ( C ) = 0 and any sequence of distributions ( φ i ) ∞ i =1 such that ψ i ⇐⇒ φ i , lim sup i →∞ φ i ( C ) = 0 . Definition . The family Ψ has degenerate limits if for any tight, µ -widesequence ( ψ θ i ) i ∈ N , we have that ( θ i ) i ∈ N is relatively compact. The contrapositive of Deﬁnition 3.5 provides an intuitive explanation of thecondition: as i → ∞ , for any sequence of parameters θ i that eventually leaves everycompact set K ⊆ Θ , either the ψ θ i become “arbitrarily ﬂat” (not tight) or “arbitrarilypeaky” (not µ -wide). For example, consider the family Ψ of Gaussians on R withLebesgue measure µ . If the variance of ψ θ i shrinks as i grows, the sequence ofdistributions converges weakly to a sequence of point masses (not dominated by theLebesgue measure). If either the variance or the mean diverges, the distributionsﬂatten out and the sequence is not tight. We use the fact that these are the onlytwo possibilities when a sequence of parameters is poorly behaved (not relativelycompact) in the proof of Theorem 2.1.These three stipulations together yield Assumption 3.6. Assumption . The mixture component family Ψ is continuous, is mixture-identiﬁable, and has degenerate limits.

4. Proof of Theorem 2.1.

The proof has two essential steps. The ﬁrst is toshow that the Bayesian posterior is weakly consistent for the mixture f ; i.e., for anyweak neighborhood U of f the sequence of posterior distributions satisﬁes Π( U | X N ) N →∞ −→ , f -a.s. (3)By Ghosh and Ramamoorthi (2003, Theorem 4.4.2), weak consistency for f isguaranteed directly by Assumption 3.1 and the fact that Ψ is dominated by a σ -ﬁnitemeasure µ . The second step is to show that for any ﬁnite k ∈ N , there exists aweak neighborhood U of f containing no mixtures of the family Ψ with at most k components. Together, these steps show that the posterior probability of the set of all k -component mixtures converges to 0 f -a.s. as the amount of observed data grows.We provide a proof of the second step. To begin, note that Assumption 3.1 has twoadditional implications about f beyond Equation (3). First, f must be absolutely D. CAI, T. CAMPBELL, T. BRODERICK continuous with respect to the dominating measure µ ; if it were not, then there exists ameasurable set A such that f ( A ) > and µ ( A ) = 0 . Since µ dominates Ψ , any f ∈ F satisﬁes f ( A ) = 0 . Therefore KL( f , f ) = ∞ , and the prior support condition cannothold. Second, it implies that f can be arbitrarily well-approximated by ﬁnite mixturesunder the weak metric, i.e., there exists a sequence of ﬁnite measures f i ∈ F , i ∈ N suchthat f i ⇒ f as i → ∞ . This holds because (cid:113) KL( f , f ) ≥ TV ( f , f ) ≥ d ( f , f ) .Now suppose the contrary of the claim for the second step, i.e., that there existsa sequence ( f i ) ∞ i =1 of mixtures of at most k components from Ψ such that f i ⇒ f .By mixture-identiﬁability, we have a sequence of mixing measures g i with at most k atoms such that F ( g i ) = f i . Suppose ﬁrst that the atoms of the sequence ( g i ) i ∈ N either stay in a compact set or have weights converging to 0. More precisely, supposethere exists a compact set K ⊆ Θ such that g i (cid:0) Θ \ K (cid:1) → . (4)Decompose each g i = g i,K + g i, Θ \ K such that g i,K is supported on K and g i, Θ \ K issupported on Θ \ K . Deﬁne the sequence of probability measures ˆ g i,K = g i,K g i,K (Θ) forsuﬃciently large i such that the denominator is nonzero. Then Equation (4) implies F (cid:0) ˆ g i,K (cid:1) ⇒ f . Since Ψ is continuous and mixture-identiﬁable, the restriction of F to the domain G is continuous and invertible; and since K is compact, the elements of (ˆ g i,K ) i ∈ N arecontained in a compact set G K ⊆ G by Prokhorov’s theorem (Ghosal and van derVaart, 2017, Theorem A.4). Therefore F ( G K ) = F K is also compact, and the map F restricted to the domain G K is uniformly continuous with a uniformly continuousinverse by Rudin (1976, Theorems 4.14, 4.17, 4.19). Next since F (ˆ g i,K ) ⇒ f , thesequence F (ˆ g i,K ) is Cauchy in F K ; and since F − is uniformly continuous on F K ,the sequence ˆ g i,K must also be Cauchy in G K . Since G K is compact, ˆ g i,K convergesin G K . Lemma 4.1 below guarantees that the convergent limit g K is also a mixingmeasure with at most k atoms; continuity of F implies that F ( g K ) = f , which is acontradiction, since by assumption f is not representable as a ﬁnite mixture of Ψ . Lemma . Suppose φ, ( φ i ) i ∈ N are Borel probability measures on a Polish spacesuch that φ i ⇒ φ and sup i | supp φ i | ≤ k ∈ N . Then | supp φ | ≤ k . Proof.

Suppose | supp φ | > k . Then we can ﬁnd k +1 distinct points x , . . . , x k +1 ∈ supp φ . Pick any metric ρ on the Polish space, and denote the minimum pair-wise distance between the points (cid:15) . Then for each point j = 1 , . . . , k + 1 de-ﬁne the bounded, continuous function h j ( x ) = 0 ∨ (cid:0) − (cid:15) − ρ ( x, x j ) (cid:1) . Since x j ∈ supp φ , we have that (cid:82) h j dφ > . Weak convergence φ i ⇒ φ therefore implies min j =1 ,...,k +1 lim inf i →∞ (cid:82) h j dφ i > . But the h j are nonzero on disjoint sets, andeach φ i only has k atoms; the pigeonhole principle yields a contradiction. INITE MIXTURES DO NOT RELIABLY LEARN THE NUMBER OF COMPONENTS Now we consider the remaining case: for all compact sets K ⊆ Θ , g i (Θ \ K ) (cid:54)→ .Therefore there exists a sequence of parameters ( θ i ) ∞ i =1 that is not relatively compactsuch that lim sup i →∞ g i ( { θ i } ) > . By Assumption 3.6, the sequence ( ψ θ i ) i ∈ N is eithernot tight or not µ -wide. If ( ψ θ i ) i ∈ N is not tight then f i = F ( g i ) is not tight, and byProkhorov’s theorem f i cannot converge to a probability measure, which contradicts f i ⇒ f . If ( ψ θ i ) i ∈ N is not µ -wide then f i = F ( g i ) is not µ -wide. Denote ( φ i ) i ∈ N tobe the singular sequence associated with ( f i ) i ∈ N and C to be the closed set such that lim sup i →∞ φ i ( C ) > , µ ( C ) = 0 , and φ i ⇐⇒ f i per Deﬁnition 3.4. Since f (cid:28) µ , f ( C ) = 0 . But f i ⇒ f implies that φ i ⇒ f , so lim sup i →∞ φ i ( C ) = f ( C ) = 0 bythe Portmanteau theorem (Ghosal and van der Vaart, 2017, Theorem A.2). This is acontradiction.

5. Extension to priors that vary with N . Our main result (i.e., Theorem 2.1)applies to the setting of a ﬁxed prior Π . However, it is often natural to specify a priordistribution that changes with N (e.g., Roeder and Wasserman (1997), Richardsonand Green (1997), and Miller and Harrison (2018, Section 7.2.1)). Corollary 5.2 belowdemonstrates that a result nearly identical to Theorem 2.1 holds for priors that areallowed to vary with N , provided that f is in the KL-support of the sequence ofpriors Π N . The only diﬀerence is that our result in this case is slightly weaker: weshow that the posterior number of components diverges in probability rather thanalmost surely. Assumption . For all (cid:15) > , the sequence of prior distributions Π N satisﬁes lim inf N →∞ Π N ( f : KL( f , f ) < (cid:15) ) > . Corollary . Suppose in the setting of Theorem 2.1 we replace Assumption 3.1with Assumption 5.1. Then the posterior on the number of components diverges in f -probability: i.e., for all k ∈ N , Π( k | X N ) N →∞ −→ in f -probability . Proof.

Since for any (cid:15) > , lim inf N →∞ Π N ( f : KL( f , f ) < (cid:15) ) > , Ghosal andvan der Vaart (2017, Theorem 6.17, Lemma 6.26, and Example 6.20) imply that theposterior is weakly consistent at f in probability: i.e., for any weak neighborhood U of f , Π( U | X N ) N →∞ −→ in f -probability . Assumption 5.1 also implies that for suﬃciently large N , f is a weak limit of ﬁnitemixtures in F . The remainder of the proof is identical to that of Theorem 2.1. D. CAI, T. CAMPBELL, T. BRODERICK

6. Related work.

In this work, we consider FMMs with a prior on the numberof components. We consider the case where this prior does not vary with the numberof data points and has support on all strictly-positive component counts. Posteriorconsistency for the mixture density (Ghosal et al., 1999; Lijoi et al., 2004; Kruijeret al., 2010) and the mixing measure (Nguyen, 2013; Ho and Nguyen, 2016; Guhaet al., 2019) in a wide class of mixture models is well established. But posteriorconsistency for the number of components is not as thoroughly characterized. Thereare several results establishing consistency for the number of components in well-speciﬁed FMMs. Nobile (1994, Proposition 3.5) and Guha et al. (2019, Theorem 3.1a)demonstrate that FMMs exhibit posterior consistency for the number of componentswhen the model is well speciﬁed and Ψ is mixture-identiﬁable. The present workcharacterizes the behavior of the FMM posterior on the number of components undercomponent misspeciﬁcation.A related approach for handling a ﬁnite but unknown number of components is tospecify an overﬁtted mixture model, i.e., a ﬁnite mixture model with a number ofcomponents in excess of the true number (e.g. Ishwaran et al. (2001); Rousseau andMengersen (2011); Malsiner-Walli et al. (2016)). In the setting of overﬁtted FMMswith well-speciﬁed component densities, Rousseau and Mengersen (2011, Theorem 1)show that under a stronger identiﬁability condition than mixture-identiﬁability andadditional regularity assumptions on the model, the posterior will concentrate properlyby emptying the extra components. Ishwaran et al. (2001, Theorem 1) consider thesetting of estimating the number of components with the assumption of a knownupper bound on the true number of components and well-speciﬁed components,and show that the posterior does not asymptotically underestimate the numberof components when assuming a stronger identiﬁability condition than mixture-identiﬁability and a KL-support condition on the prior. In addition to assumingcomponent misspeciﬁcation, we here consider only priors that place full support onthe natural numbers, as opposed to priors that place an upper bound on the numberof components.Frühwirth-Schnatter (2006) provides a wide-ranging review of methodology forﬁnite mixture modeling. In (e.g.) Section 7.1, Frühwirth-Schnatter (2006) observesthat, in practice, the learned number of mixture components will generally be higherthan the true generating number of components when the likelihood is misspeciﬁed—but does not prove a result about the number of components under misspeciﬁcation.Similarly, Miller and Harrison (2018, Section 7.1.5) discuss the issue of estimatingthe number of components in FMMs under model misspeciﬁcation and state that theposterior number of components is expected to diverge to inﬁnity as the number ofsamples increases, but no proof of this asymptotic behavior is provided.Finally, a growing body of work is focused on developing more robust FMMs andrelated mixture models. In order to address the issue of component misspeciﬁcation,a number of authors propose using ﬁnite mixture models with nonparametric com-ponent densities, e.g. Gaussian-mixture components (Bartolucci, 2005; Di Zio et al., INITE MIXTURES DO NOT RELIABLY LEARN THE NUMBER OF COMPONENTS

7. Experiments.

In this section, we demonstrate one of the primary practicalimplications of our theory: the inferred number of components can change drasticallydepending on the amount of observed data in misspeciﬁed ﬁnite mixture models.For all experiments below, we use a ﬁnite mixture model with a multivariateGaussian component family having diagonal covariance matrices and a conjugateprior on each dimension. In particular, consider number of components k , mixtureweights p ∈ R k , Gaussian component precisions τ ∈ R k × D + and means θ ∈ R k × D ,labels Z ∈ { , . . . , k } N , and data X ∈ R N × D . Then the probabilistic generativemodel is k ∼ Geom( r ) p ∼ Dirichlet k ( γ, . . . , γ ) τ jd i.i.d. ∼ Gam( α, β ) θ jd i.i.d. ∼ N ( m, κ − jd ) Z n i.i.d. ∼ Categorical( p ) X nd ind ∼ N ( θ z n d , τ − z n d ) , where j ranges from , . . . , k , d ranges from , . . . , D , and n ranges from , . . . , N .For posterior inference, we use a Julia implementation of split-merge collapsed Gibbssampling (Neal, 2000; Jain and Neal, 2004) from Miller and Harrison (2018). ∗ Themodel and inference algorithm are described in more detail in Miller and Harrison(2018, Sec. 7.2.2, Algorithm 1). Note that we use this model primarily to illustratethe problem of posterior divergence under model misspeciﬁcation; it should not beinterpreted as a carefully-speciﬁed model for the data examples that we study. Alsonote that while the empirical examples below involve Gaussian FMMs, our theoryapplies to a more general class of component distributions.7.1.

Synthetic data.

Our ﬁrst experiments are on synthetic data and are inspiredby Figure 3 of Miller and Dunson (2019), which investigates the posterior of a mixtureof perturbed Gaussians. Here we study the eﬀects of varying data set sizes underboth well-speciﬁed and misspeciﬁed models. We generated data sets of increasingsize N ∈ { , , , , } from 1- and 2-component univariate Gaussianand Laplace mixture models, where the 1-component distributions have mean 0and scale 1, and the 2-component distributions have means ( − , , scales (1 . , ,and mixing weights (0 . , . . We generated the sequence of data sets such thateach was a subset of the next, larger data set in the sequence. Following Miller andHarrison (2018, Section 7.2.1), we set the hyperparameters of the Bayesian ﬁnitemixture model as follows: m = (max n ∈ [ ˜ N ] X n + min n ∈ [ ˜ N ] X n ) where ˜ N = 10 , , κ = (max n ∈ [ ˜ N ] X n − min n ∈ [ ˜ N ] X n ) − , α = 2 , r = 0 . , γ = 1 , and β ∼ Gam(0 . , /κ ) . ∗ Code available at https://github.com/jwmi/BayesianMixtures.jl. D. CAI, T. CAMPBELL, T. BRODERICK (a) Gaussian data, 1 component (b) Gaussian data, 2 components(c) Laplace data, 1 component (d) Laplace data, 2 components(e) Gaussian data, varying prior (f) Laplace data, varying prior

Fig 1:

Upper and middle rows : Posterior probability of the number of components k for Gaussian mixture models with a ﬁxed prior ﬁt to (a,b) univariate data generatedfrom a Gaussian mixture model and (c,d) a Laplace mixture model, Lower row :Posterior probability of the number of components of Gaussian mixtures with avarying prior ﬁt to (e) 2-component univariate data from a Gaussian mixture modeland (f) 2-component univariate data from a Laplace mixture model.

INITE MIXTURES DO NOT RELIABLY LEARN THE NUMBER OF COMPONENTS

Fig 2: Posterior probability of the number of components k for Gaussian mixturemodels, ﬁt to (a) mouse cortex single-cell RNA sequencing data and (b) lung tissuegene expression data.We refer to Miller and Harrison (2018, Section 7.2.1) for additional details on thechoice of model hyperparameters and the sampling of β . We ran a total of 100,000Markov chain Monte Carlo iterations per data set; we discarded the ﬁrst 10,000iterations as burn-in.The results of the simulations are shown in the top and middle rows of Figure 1.For the data generated from the 1-component models, the posterior on the numberof components concentrates around 1 in the case of Gaussian-generated data as thesample size increases (Figure 1a), whereas the posterior on the number of componentsdiverges for the Laplace data (Figure 1c). We observe similar behavior in the 2-component case, where the posterior concentrates around the correct value in theGaussian case (Figure 1b) but not the Laplace case (Figure 1d).Finally, we considered the Gaussian mixture model above but with a prior thatvaries with the data. Speciﬁcally, for the prior on the means, we set the hyper-parameters to m N = (max n ∈ [ N ] X n + min n ∈ [ N ] X n ) and κ N = (max n ∈ [ N ] X n − min n ∈ [ N ] X n ) − , which is the setting considered by Miller and Harrison (2018, Sec-tion 7.2.1); the other hyperparameters were otherwise set to the same values above.We used the 2-component Gaussian and Laplace data sets constructed above forthe ﬁxed prior case. The bottom row of Figure 1 shows the results of the posteriornumber of components under this prior for the well-speciﬁed and misspeciﬁed cases;again we observe that the posterior diverges under model misspeciﬁcation.7.2. Gene expression data.

Computational biologists are interested in classifyingcell types by applying clustering techniques to gene expression data (Yeung et al.,2001; Medvedovic and Sivaganesan, 2002; McLachlan et al., 2002; Medvedovic et al.,2004; Rasmussen et al., 2008; de Souto et al., 2008; McNicholas and Murphy, 2010).In our next set of experiments, we apply the Gaussian ﬁnite mixture model to two D. CAI, T. CAMPBELL, T. BRODERICK gene expression data sets: (1) single-cell RNA sequencing data from mouse cortex andhippocampus cells (Zeisel et al., 2015) with the same feature selection as Prabhakaranet al. (2016) ( N = 3008 , D = 558 , 11,000 Gibbs sampling steps with 1,000 of thoseas burn-in) and (2) mRNA expression data from human lung tissue (Bhattacharjeeet al., 2001) ( N = 203 , D = 1543 , and 10,000 Gibbs sampling steps with 1,000of those burn-in). Our experiments here represent a simpliﬁed version of previousmixture model analyses for these and other related data sets (de Souto et al., 2008;Prabhakaran et al., 2016; Armstrong et al., 2001; Miller and Harrison, 2018).As these gene expression data sets contain counts, we ﬁrst transformed the datato real numerical values. In particular, we used a base-2 log transform followed bystandardization—such that each dimension of the data had zero mean and unitvariance—per standard practices (e.g., Miller and Harrison (2018)). Then to examinethe eﬀect of increasing data set size on inferential results, we randomly sampledsubsets of increasing size without replacement; each smaller subset was contained inthe next larger data set. For both data sets, we used hyperparameters α = 1 , β = 1 , m = 0 , κ jd = τ jd , r = 0 . , and γ = 1 .For the single-cell RNAseq data set, the posterior on the number of components isshown in Figure 2a. Here the ground truth number of clusters is captured when thedata set size is N = 100 . But as predicted by our theory, as we increase the numberof data points, the posterior number of components diverges.The posterior on the number of components for the lung gene expression data isshown in Figure 2b. Again we ﬁnd that on the smallest data subsets, the posteriorappears to capture the ground truth number of clusters, but that as we examinemore and more data, the posterior diverges. While diagonal covariance Gaussiancomponents are likely not rich enough to model the cluster shapes, our purpose hereis to capture the eﬀect of model misspeciﬁcation on the posterior on the number ofcomponents. Thus, these examples suggest the need for more robust analyses.

8. Discussion.

We have shown that the Bayesian posterior distribution for thenumber of components in ﬁnite mixtures diverges when the mixture component familyis misspeciﬁed. Since misspeciﬁcation is almost unavoidable in real applications, itfollows that ﬁnite mixture models are typically unreliable for estimating the numberof components. In practice, our conclusion implies that inferential results on thenumber of components can change drastically depending on the size of the data set,calling into question the usefulness of these results in application.A number of open questions remain. Because our analysis is inherently asymptotic,it is possible that the Bayesian posterior on the number of components may stillprovide useful inferences for a ﬁnite sample—for instance if care is taken to account forthe aforementioned dependence of inferential conclusions on data set size. Additionally,a number of authors have recently proposed robust Bayesian inference methods tomitigate likelihood misspeciﬁcation (Grünwald and van Ommen, 2017; Miller andDunson, 2019; Wang et al., 2017); it remains to better understand connectionsbetween our results and these methods.

INITE MIXTURES DO NOT RELIABLY LEARN THE NUMBER OF COMPONENTS Acknowledgments.

We thank Jeﬀ Miller for helpful conversations and commentson a draft of this paper. D. Cai was supported in part by a Google Ph.D. Fellowship inMachine Learning. T. Campbell was supported by a National Sciences and EngineeringResearch Council of Canada (NSERC) Discovery Grant and an NSERC DiscoveryLaunch Supplement. T. Broderick was supported in part by ONR grant N00014-17-1-2072, an MIT Lincoln Laboratory Advanced Concepts Committee Award, a GoogleFaculty Research Award, the CSAIL–MSR Trustworthy AI Initiative, and an AROYIP award. APPENDIX A: PROOF OF PROPOSITION 2.2Consider the multivariate Gaussian family

Ψ = (cid:110) N ( ν, Σ) : ν ∈ R d , Σ ∈ S d ++ (cid:111) withparameter space Θ = R d × S d ++ , equipped with the topology induced by the Euclideanmetric. Let ( λ j (Σ)) dj =1 denote the eigenvalues of the covariance matrix Σ ∈ S d ++ thatsatisfy ∞ > λ (Σ) ≥ . . . ≥ λ d (Σ) > . Since the family of Gaussians is continuousand mixture-identiﬁable (Yakowitz and Spragins, 1968, Proposition 2), the maincondition we need to verify is that the family has degenerate limits (Deﬁnition 3.5).A useful fact is that if a sequence of Gaussian distributions is tight, then the sequenceof means and the eigenvalues of the covariance matrix is bounded. Lemma

A.1 . Let ( ψ i ) i ∈ N be a sequence of Gaussian distributions with mean ν i ∈ R d and covariance Σ i ∈ S d ++ . If ( ψ i ) i ∈ N is a tight sequence of measures, thenthe sequences ( ν i ) i ∈ N and ( λ (Σ i )) i ∈ N are bounded. Proof.

Let Y i denote a random variable with distribution ψ i . For each covariancematrix Σ i , consider its eigenvalue decomposition Σ i = U i Λ i U (cid:62) i , where U i ∈ R d × d isan orthonormal matrix and Λ i ∈ R d × d is a diagonal matrix. Then the random variable Z i = U (cid:62) i Y i has distribution N ( U (cid:62) i ν i , Λ i ) . If either (cid:107) ν i (cid:107) = (cid:107) U (cid:62) i ν i (cid:107) is unbounded or (cid:107) Λ i (cid:107) F is unbounded, then Z i is not tight (Billingsley (1986, Example 25.10)). Since Z i and Y i lie in any ball centered at the origin with the same probability, Y i is nottight.We now show that the multivariate Gaussian family has degenerate limits. Proof of Proposition 2.2.

If the parameters ( θ i ) i ∈ N are not a relatively com-pact subset of Θ , then either some coordinate of the sequence of means ν i diverges, λ (Σ i ) → ∞ , or λ d (Σ i ) → . If some coordinate of the mean ν i diverges or themaximum eigenvalue diverges, i.e., λ (Σ i ) → ∞ , then the sequence ( ψ θ i ) is nottight by Lemma A.1. On the other hand, if λ d (Σ i ) → as i → ∞ , then ψ θ i con-verges weakly to a sequence of degenerate Gaussian measures that concentrate on C i = (cid:110) x ∈ R d : ( x − ν i ) (cid:62) u d,i = 0 (cid:111) , where u d,i is the d th eigenvector of Σ i . Notethat µ ( C i ) = 0 for Lebesgue measure µ ; so if we deﬁne C = ∪ i C i in the setting ofDeﬁnition 3.4, the sequence is not µ -wide. D. CAI, T. CAMPBELL, T. BRODERICK

We can generalize Proposition 2.2 beyond multivariate Gaussians to mixture-identiﬁable location-scale families, as shown in Proposition A.2. Examples of suchfamilies include the multivariate Gaussian family, the Cauchy family, the logisticfamily, the von Mises family, and generalized extreme value families. The proof issimilar to that of Proposition 2.2.

Proposition

A.2 . Suppose Ψ is a location-scale family that is mixture-identiﬁableand absolutely continuous with respect to Lebesgue measure µ , i.e., d Ψ dµ = (cid:26) | Σ | − / ϕ (cid:16) Σ − / ( x − ν ) (cid:17) : ν ∈ R d , Σ ∈ S d ++ (cid:27) , where ϕ : R d → R is a probability density function. Then Ψ satisﬁes Assumption 3.6. REFERENCES

B. Aragam, C. Dan, P. Ravikumar, and E. P. Xing. Identiﬁability of nonparametric mixture modelsand Bayes optimal clustering. arXiv preprint arXiv:1802.04397 , 2018.S. A. Armstrong, J. E. Staunton, L. B. Silverman, R. Pieters, M. L. d. Boer, M. D. Minden, S. E.Sallan, E. S. Lander, T. R. Golub, and S. J. Korsmeyer. MLL translocations specify a distinctgene expression proﬁle that distinguishes a unique leukemia.

Nature Genetics , 30(1):41, 2001.J. D. Banﬁeld and A. E. Raftery. Model-based Gaussian and non-Gaussian clustering.

Biometrics ,49(3):803–821, 1993.F. Bartolucci. Clustering univariate observations via mixtures of unimodal normal mixtures.

Journalof Classiﬁcation , 22(2):203–219, 2005.A. Bhattacharjee, W. G. Richards, J. Staunton, C. Li, S. Monti, P. Vasa, C. Ladd, J. Beheshti,R. Bueno, M. Gillette, M. Loda, G. Weber, E. Mark, E. Lander, W. Wong, B. Johnson, T. Golub,D. Sugarbaker, and M. Meyerson. Classiﬁcation of human lung carcinomas by mRNA expressionproﬁling reveals distinct adenocarcinoma subclasses.

Proceedings of the National Academy ofSciences , 98(24):13790–13795, 2001.P. Billingsley.

Probability and Measure . John Wiley and Sons, third edition, 1986.C. Chan, F. Feng, J. Ottinger, D. Foster, M. West, and T. B. Kepler. Statistical mixture modelingfor cell subtype identiﬁcation in ﬂow cytometry.

Cytometry Part A , 73(8):693–701, 2008.J. Chen. Optimal rate of convergence for ﬁnite mixture models.

The Annals of Statistics , 23(1):221–233, 1995.M. C. de Souto, I. G. Costa, D. S. de Araujo, T. B. Ludermir, and A. Schliep. Clustering cancergene expression data: a comparative study.

BMC Bioinformatics , 9(1):497, 2008.M. Di Zio, U. Guarnera, and R. Rocci. A mixture of mixture models for a classiﬁcation problem:The unity measure error.

Computational Statistics & Data Analysis , 51(5):2573–2585, 2007.S. Frühwirth-Schnatter.

Finite mixture and Markov switching models . Springer Series in Statistics,2006.S. Ghosal and A. van der Vaart.

Fundamentals of Nonparametric Bayesian Inference . CambridgeSeries in Statistical and Probabilistic Mathematics. Cambridge University Press, 2017.S. Ghosal, J. Ghosh, and R. Ramamoorthi. Posterior consistency of Dirichlet mixtures in densityestimation.

The Annals of Statistics , 27(1):143–158, 1999.J. Ghosh and R. Ramamoorthi.

Bayesian Nonparametrics . Springer Series in Statistics, 2003.C. Grazian, C. Villa, and B. Liseo. On a loss-based prior for the number of components in mixturemodels.

Statistics & Probability Letters , 158:108656, 2020.P. J. Green and S. Richardson. Modelling heterogeneity with and without the Dirichlet process.

Scandinavian Journal of Statistics , 28(2):355–375, 2001.INITE MIXTURES DO NOT RELIABLY LEARN THE NUMBER OF COMPONENTS P. Grünwald and T. v. Ommen. Inconsistency of Bayesian inference for misspeciﬁed linear models,and a proposal for repairing it.

Bayesian Analysis , 12(4):1069–1103, 2017.P. D. Grünwald. Bayesian inconsistency under misspeciﬁcation. In

World Meeting of the InternationalSociety for Bayesian Analysis , 2006.A. Guha, N. Ho, and X. Nguyen. On posterior contraction of parameters and interpretability inBayesian mixture modeling. arXiv preprint arXiv:1901.05078 , 2019.P. Heinrich and J. Kahn. Strong identiﬁability and optimal minimax rates for ﬁnite mixtureestimation.

The Annals of Statistics , 46(6A):2844–2870, 2018.N. Ho and X. Nguyen. On strong identiﬁability and convergence rates of parameter estimation inﬁnite mixtures.

Electronic Journal of Statistics , 10(1):271–307, 2016.J. P. Huelsenbeck and P. Andolfatto. Inference of population structure under a Dirichlet processmodel.

Genetics , 175(4):1787–1802, 2007.J. H. Huggins and J. W. Miller. Using bagged posteriors for robust inference and model criticism. arXiv preprint 1912.07104 , 2019.H. Ishwaran, L. F. James, and J. Sun. Bayesian model selection in ﬁnite mixtures by marginaldensity decompositions.

Journal of the American Statistical Association , 96(456):1316–1332, 2001.S. Jain and R. M. Neal. A split-merge Markov chain Monte Carlo procedure for the Dirichlet processmixture model.

Journal of Computational and Graphical Statistics , 13(1):158–182, 2004.W. Kruijer, J. Rousseau, and A. van der Vaart. Adaptive Bayesian density estimation withlocation-scale mixtures.

Electronic Journal of Statistics , 4:1225–1257, 2010.A. Lijoi, I. Prünster, and S. Walker. Extending Doob’s consistency theorem to nonparametricdensities.

Bernoulli , 10(4):651–663, 2004.E. D. Lorenzen, P. Arctander, and H. R. Siegismund. Regional genetic structuring and evolutionaryhistory of the impala aepyceros melampus.

Journal of Heredity , 97(2):119–132, 2006.G. Malsiner-Walli, S. Frühwirth-Schnatter, and B. Grün. Model-based clustering based on sparseﬁnite Gaussian mixtures.

Statistics and Computing , 26(1-2):303–324, 2016.G. Malsiner-Walli, S. Frühwirth-Schnatter, and B. Grün. Identifying mixtures of mixtures usingBayesian estimation.

Journal of Computational and Graphical Statistics , 26(2):285–295, 2017.G. J. McLachlan, R. Bean, and D. Peel. A mixture model-based approach to the clustering ofmicroarray expression data.

Bioinformatics , 18(3):413–422, 2002.P. D. McNicholas and T. B. Murphy. Model-based clustering of microarray expression data vialatent gaussian mixture models.

Bioinformatics , 26(21):2705–2712, 2010.M. Medvedovic and S. Sivaganesan. Bayesian inﬁnite mixture model based clustering of geneexpression proﬁles.

Bioinformatics , 18(9):1194–1206, 2002.M. Medvedovic, K. Y. Yeung, and R. E. Bumgarner. Bayesian mixture model based clustering ofreplicated microarray data.

Bioinformatics , 20(8):1222–1232, 2004.J. W. Miller and D. B. Dunson. Robust Bayesian inference via coarsening.

Journal of the AmericanStatistical Association , 114(527):1113–1125, 2019.J. W. Miller and M. T. Harrison. A simple example of Dirichlet process mixture inconsistencyfor the number of components. In

Advances in Neural Information Processing Systems , pages199–206, 2013.J. W. Miller and M. T. Harrison. Inconsistency of Pitman-Yor process mixtures for the number ofcomponents.

Journal of Machine Learning Research , 15(1):3333–3370, 2014.J. W. Miller and M. T. Harrison. Mixture models with a prior on the number of components.

Journal of the American Statistical Association , 113(521):340–356, 2018.S. Mukherjee, E. D. Feigelson, G. J. Babu, F. Murtagh, C. Fraley, and A. Raftery. Three types ofgamma-ray bursts.

The Astrophysical Journal , 508(1):314, 1998.R. M. Neal. Markov chain sampling methods for Dirichlet process mixture models.

Journal ofComputational and Graphical Statistics , 9(2):249–265, 2000.X. Nguyen. Convergence of latent mixing measures in ﬁnite and inﬁnite mixture models.

The Annalsof Statistics , 41(1):370–400, 2013.A. Nobile.

Bayesian analysis of ﬁnite mixture distributions . PhD thesis, Carnegie Mellon University, D. CAI, T. CAMPBELL, T. BRODERICK1994.A. Nobile. On the posterior distribution of the number of components in a ﬁnite mixture.

Annals ofStatistics , 32(5):2044–2073, 2004.A. Nobile. Bayesian ﬁnite mixtures: a note on prior speciﬁcation and posterior computation. arXivpreprint arXiv:0711.0458 , 2007.A. Nobile and A. T. Fearnside. Bayesian ﬁnite mixtures with an unknown number of components:the allocation sampler.

Statistics and Computing , 17(2):147–162, 2007.E. Otranto and G. M. Gallo. A nonparametric Bayesian approach to detect the number of regimesin Markov switching models.

Econometric Reviews , 21(4):477–496, 2002.J. Pella and M. Masuda. The Gibbs and split merge sampler for population mixture analysis fromgenetic data with incomplete baselines.

Canadian Journal of Fisheries and Aquatic Sciences , 63(3):576–596, 2006.F. Petralia, V. Rao, and D. B. Dunson. Repulsive mixtures. In

Advances in Neural InformationProcessing Systems , pages 1889–1897, 2012.S. Prabhakaran, E. Azizi, A. Carr, and D. Pe’er. Dirichlet process mixture model for correctingtechnical variation in single-cell gene expression data. In

International Conference on MachineLearning , pages 1070–1079, 2016.J. K. Pritchard, M. Stephens, and P. Donnelly. Inference of population structure using multilocusgenotype data.

Genetics , 155(2):945–959, 2000.C. Rasmussen, B. de la Cruz, Z. Ghahramani, and D. Wild. Modeling and visualizing uncertaintyin gene expression clusters using Dirichlet process mixtures.

IEEE/ACM Transactions onComputational Biology and Bioinformatics , 6(4):615–628, 2008.S. Richardson and P. J. Green. On Bayesian analysis of mixtures with an unknown number ofcomponents (with discussion).

Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 59(4):731–792, 1997.A. Rodriguez and D. B. Dunson. Nonparametric Bayesian models through probit stick-breakingprocesses.

Bayesian Analysis , 6(1):145–178, 2011.K. Roeder and L. Wasserman. Practical Bayesian density estimation using mixtures of normals.

Journal of the American Statistical Association , 92(439):894–902, 1997.J. Rousseau and K. Mengersen. Asymptotic behaviour of the posterior distribution in overﬁttedmixture models.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 73(5):689–710, 2011.W. Rudin.

Principles of Mathematical Analysis . McGraw-Hill, 1976.M. Stephens. Bayesian analysis of mixture models with an unknown number of components—analternative to reversible jump methods.

The Annals of Statistics , 28(1):40–74, 2000.H. Teicher. Identiﬁability of mixtures.

The Annals of Mathematical Statistics , 32(1):244–248, 1961.H. Teicher. Identiﬁability of ﬁnite mixtures.

The Annals of Mathematical Statistics , pages 1265–1269,1963.S. T. Tokdar. Posterior consistency of Dirichlet location-scale mixture of normals in densityestimation and regression.

Sankhy¯a: The Indian Journal of Statistics , pages 90–110, 2006.Y. Wang, A. Kucukelbir, and D. M. Blei. Reweighted data for robust probabilistic models. In

International Conference on Machine Learning , page 3646–3655, 2017.M.-J. Woo and T. Sriram. Robust estimation of mixture complexity.

Journal of the AmericanStatistical Association , 101(476):1475–1486, 2006.M.-J. Woo and T. Sriram. Robust estimation of mixture complexity for count data.

ComputationalStatistics & Data Analysis , 51(9):4379–4392, 2007.Y. Wu and S. Ghosal. Kullback Leibler property of kernel mixture priors in Bayesian densityestimation.

Electronic Journal of Statistics , 2:298–331, 2008.E. P. Xing, K.-A. Sohn, M. I. Jordan, and Y.-W. Teh. Bayesian multi-population haplotype inferencevia a hierarchical Dirichlet process mixture. In

International Conference on Machine Learning ,pages 1049–1056, 2006.S. J. Yakowitz and J. D. Spragins. On the identiﬁability of ﬁnite mixtures.

The Annals of

INITE MIXTURES DO NOT RELIABLY LEARN THE NUMBER OF COMPONENTS Mathematical Statistics , 39(1):209–214, 1968.K. Y. Yeung, D. R. Haynor, and W. L. Ruzzo. Validating clustering for gene expression data.

Bioinformatics , 17(4):309–318, 2001.A. Zeisel, A. B. Muñoz-Manchado, S. Codeluppi, P. Lönnerberg, G. La Manno, A. Juréus, S. Mar-ques, H. Munguba, L. He, C. Betsholtz, C. Rolny, G. Castelo-Branco, J. Hjerling-Leﬄer, andS. Linnarsson. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq.

Science , 347(6226):1138–1142, 2015., 347(6226):1138–1142, 2015.