[PDF] Optimal Bayesian estimation of Gaussian mixtures with growing number of components

Abstract

We study posterior concentration properties of Bayesian procedures for estimating finite Gaussian mixtures in which the number of components is unknown and allowed to grow with the sample size. Under this general setup, we derive a series of new theoretical results. More specifically, we first show that under mild conditions on the prior, the posterior distribution concentrates around the true mixing distribution at a near optimal rate with respect to the Wasserstein distance. Under a separation condition on the true mixing distribution, we further show that a better and adaptive convergence rate can be achieved, and the number of components can be consistently estimated. Furthermore, we derive optimal convergence rates for the higher-order mixture models where the number of components diverges arbitrarily fast. In addition, we consider the fractional posterior and investigate its posterior contraction rates, which are also shown to be minimax optimal in estimating the mixing distribution under mild conditions. We also investigate Bayesian estimation of general mixtures with strong identifiability conditions, and derive the optimal convergence rates when the number of components is fixed. Lastly, we study theoretical properties of the posterior of the popular Dirichlet process (DP) mixture prior, and show that such a model can provide a reasonable estimate for the number of components while only guaranteeing a slow convergence rate of the mixing distribution estimation.

Full PDF

OO PTIMAL B AYESIAN ESTIMATION OF G AUSSIAN MIXTURES WITH GROWINGNUMBER OF COMPONENTS

Ilsang Ohn and Lizhen Lin

The University of Notre Dame

July 21, 2020

Abstract

We study posterior concentration properties of Bayesian procedures forestimating ﬁnite Gaussian mixtures in which the number of components isunknown and allowed to grow with the sample size. Under this generalsetup, we derive a series of new theoretical results. More speciﬁcally, weﬁrst show that under mild conditions on the prior, the posterior distribu-tion concentrates around the true mixing distribution at a near optimal ratewith respect to the Wasserstein distance. Under a separation condition onthe true mixing distribution, we further show that a better and adaptiveconvergence rate can be achieved, and the number of components can beconsistently estimated. Furthermore, we derive optimal convergence ratesfor the higher-order mixture models where the number of components di-verges arbitrarily fast. In addition, we consider the fractional posterior andinvestigate its posterior contraction rates, which are also shown to be min-imax optimal in estimating the mixing distribution under mild conditions.We also investigate Bayesian estimation of general mixtures with strongidentiﬁability conditions, and derive the optimal convergence rates whenthe number of components is ﬁxed. Lastly, we study theoretical propertiesof the posterior of the popular Dirichlet process (DP) mixture prior, andshow that such a model can provide a reasonable estimate for the num-ber of components while only guaranteeing a slow convergence rate of themixing distribution estimation.

Finite mixture models are powerful tools for modeling heterogeneous data,which have been used in a wide range of applications in statistics and machinelearning including density estimation [26], clustering [11], document model-ing [3], image generation [39] and designing generative adversarial networks1 a r X i v : . [ m a t h . S T ] J u l . O HN AND

L. L IN point-wise conver-gence rate C ν (cid:63) n − for estimating the mixing distribution under the L dis-tance, where n denotes the sample size and the C ν (cid:63) is a constant depending onthe true mixing distribution ν (cid:63) . This convergence result holds for the so-called strongly identiﬁable mixtures which include the Gaussian location mixtures asspecial cases, and so do those stated below. Nguyen [37] and Scricciolo [42] de-rived the n − point-wise posterior contraction rate under the second-orderWasserstein distance. Ho and Nguyen [22] proved that the maximum likeli-hood estimator (MLE) can also achieve this point-wise rate. Under the ﬁrst-order Wasserstein distance, a better point-wise convergence rate C ν (cid:63) n − canbe obtained. Heinrich and Kahn [21], Ho et al. [23] and Guha et al. [20] es-tablished the n − point-wise rate for the minimum Kolmogorov distance es-timator, minimum Hellinger distance estimator and Bayesian procedure withthe mixture of ﬁnite mixtures (MFM) prior, respectively. On the other hand, forthe continuous mixtures where the mixing distribution admits a density func-tion, Martin [28] derived a near n − point-wise rate of the mixing densityestimation for their predictive recursion algorithm [36, 45].However, due to a lack of uniformity in the constant C ν (cid:63) , their analysishas been restricted to the ﬁxed truth setup, with the number of componentsassumed to be either known or ﬁxed. Also note that these point-wise ratesare not upper bounds of the actual minimax optimal rates of mixing distribu-tion estimation, which were later derived by Heinrich and Kahn [21]. It wasshown that the minimax optimal convergence rate of mixing distribution esti-mation for strongly identiﬁable mixtures, is of order n − ( ( k (cid:63) − k )+ ) , where k (cid:63) and k denote the total number of components and the number of well-separated AYESIAN ESTIMATION OF G AUSSIAN MIXTURES k (cid:63) − k which can be viewed as thedegree of overspeciﬁcation. Heinrich and Kahn [21] also proposed a minimaxoptimal minimum Kolmogorov distance estimator which however can be com-putationally expensive. More recently, Wu and Yang [47] proposed a computa-tionally efﬁcient estimator called the denoised method of moments estimatorfor Gaussian mixture models, and showed that this estimator achieves the min-imax rate. However, these minimax optimal estimators require the knowledgeof the number of components k (cid:63) , which is not practical. On the other hand, noBayesian procedure has yet been able to yield a minimax optimal rate.In general, one does not have the prior knowledge on the number of com-ponents, and selecting an appropriate value of the number of components isa crucial step in providing accurate estimates of the true mixing distribution.With too many components, one may suffer from large variances whereas toofew components may lead to biased estimators. Also estimating the numberof components may be of interest itself in practice especially when each com-ponent has a physical interpretation. A widely used approach to choose thenumber of components is based on a model selection criterion before estimat-ing parameters, and a few consistent model selection criteria are available inthe literature such as complete likelihood [2], the Bayesian information criteria(BIC) [25], the singular Bayesian information criteria (sBIC) [8] and the Bayesfactor [6].A Bayesian approach is an attractive alternative due to its ability to esti-mate both the number of components and parameters in a uniﬁed manner. Anatural strategy to infer a mixture model with an unknown number of com-ponents is to also impose a prior on the number of components k . By do-ing so, it provides a way of not only choosing the best number of components(i.e., model selection), but also combining results from different mixture mod-els with possibly varying number of components (i.e., model averaging). Onenotable disadvantage for such models is that posterior computations may bechallenging, since it requires developing Monte Carlo Markov chain (MCMC)algorithms for sampling from a parameter space of varying dimensions, whichoften results in poor mixing or slow convergence of the Markov chain to thestationary distribution. Several MCMC methods have been proposed to cir-cumvent this issue including [40, 44, 38, 34]. On the theoretical side, Guhaet al. [20] derived the n − point-wise posterior contraction rate for this typeof prior distribution. They also obtained posterior consistency of the ﬁxed num-ber of components under the strong identiﬁability condition. Another promis-ing approach is to use over-ﬁtted mixtures. This approach considers a mixturemodel with the number of components larger than the true one and estimatesthe true model by discarding spurious components. Rousseau and Mengersen[41] studied asymptotic properties of the over-ﬁtted mixtures and proved witha prior on weights of a mixture using a Dirichlet distribution with a suitablyselected hyperparameter, the spurious components vanishes asymptotically atthe rate n − log a n for some a > HN AND

L. L IN the number of clusters can be a reasonable estimate of the true numberof components (Theorem 4.1). For mixing distribution estimation, theperformance of the DP is inferior in view of the convergence rate (Theo-rem 4.2). AYESIAN ESTIMATION OF G AUSSIAN MIXTURES

We ﬁrst introduce some notation that will be used throughout the paper. For apositive integer n ∈ N , we let [ n ] : = {

1, 2, . . . , n } . For two positive sequences { a n } n ∈ N and { b n } n ∈ N , we write a n (cid:46) b n if there exists a positive constant C > a n ≤ Cb n for any n ∈ N . Moreover, we write a n (cid:38) b n if b n (cid:46) a n and write a n (cid:16) b n if a n (cid:46) b n and a n (cid:38) b n . For a real number x ∈ R , (cid:98) x (cid:99) denotes the largest integer less than or equal to x and (cid:100) x (cid:101) the smallestinteger larger than or equal to x . For n random variables X , . . . , X n , we usethe shorthand notation X n : = ( X , . . . , X n ) . We denote by ( · ) the indicatorfunction. Let δ θ denote a Dirac measure at θ .Let ( X , X ) be a measurable space equipped with a Lebesgue measure λ . For q > f on X , we let (cid:107) f (cid:107) q denote its (cid:96) q norm withrespect to the Lebesgue measure, i.e., (cid:107) f (cid:107) q : = (cid:0) (cid:82) | f ( x ) | q λ ( d x ) (cid:1) q . For theprobability measure G on ( X , X ) , let P G denote the probability or the expecta-tion under the measure G . We denote by p G the probability density function of G with respect to the Lebesgue measure λ . For n ∈ N , let P ( n ) G be the probabil-ity or the expectation under the product measure and p ( n ) G its density function.For two probability densities p and p , we denote by KL ( p , p ) the Kullback-Leibler (KL) divergence from p to p and by KL ( p , p ) the KL variations,i.e., KL ( p , p ) : = (cid:90) log (cid:32) p ( x ) p ( x ) (cid:33) p ( x ) λ ( d x ) KL ( p , p ) : = (cid:90)  log (cid:32) p ( x ) p ( x ) (cid:33) p ( x ) λ ( d x ) .Moreover, we let R α ( p , p ) denote the R´enyi α -divergence of order α ∈ (

0, 1 ) from p to p and h ( p , p ) denote the Hellinger distance between p and p ,. O HN AND

L. L IN R α ( p , p ) : = − log (cid:18) (cid:90) p α ( x ) p − α ( x ) λ ( d x ) (cid:19) h ( p , p ) : = (cid:40) (cid:90) (cid:18)(cid:113) p ( x ) − (cid:113) p ( x ) (cid:19) λ ( d x ) (cid:41) .For a convex function f : R (cid:55)→ R such that f ( ) =

0, the f -divergence from p to p is deﬁned by D f ( p , p ) : = (cid:90) f (cid:32) p ( x ) p ( x ) (cid:33) p ( x ) λ ( d x ) .For ζ >

0, a space of certain distributions G and a distribution G ∈ G , wedeﬁne a ζ -KL neighborhood of G by B KL ( ζ , G , G ) : = (cid:110) G ∈ G : KL ( p G , p G ) < ζ , KL ( p G , p G ) < ζ (cid:111) .For a metric space ( Z , ρ ) , we let N ( (cid:101) , Z , ρ ) denote the (cid:101) -covering number of ( Z , ρ ) and let diam ( Z ) : = sup (cid:8) ρ ( z , z ) : z , z ∈ Z (cid:9) . In this paper, we initially consider the Gaussian location mixture model in onedimension: X , . . . , X n iid ∼ k ∑ j = w j N ( θ j , σ ) , (2.1)where θ , . . . , θ k ∈ R are the atoms and ( w , . . . , w k ) ∈ ∆ k are the mixing weights .Here we deﬁne ∆ k : = (cid:110) ( w , . . . , w k ) ∈ [

0, 1 ] k : (cid:107) w (cid:107) = (cid:111) for k ∈ N . We assume that the variance σ is known and without loss of gen-erality σ =

1. With the convolution denoted with the symbol ∗ , we simplywrite ν ∗ Φ = k ∑ j = w j N ( θ j , 1 ) for the mixing distribution ν : = ∑ ki = w j δ θ j , where Φ denotes the standard nor-mal distribution. For a set Θ ⊂ R and k ∈ N , we deﬁne the set of k -atomicdistributions M k ( Θ ) : =  k ∑ j = w j δ θ j : ( w , . . . , w k ) ∈ ∆ k , θ , . . . , θ k ∈ Θ  . AYESIAN ESTIMATION OF G AUSSIAN MIXTURES M k ( Θ ) ⊂ M k + ( Θ ) for every k ∈ N . The parameter space is givenby M ( Θ ) : = (cid:83) k ∈ N M k ( Θ ) . For mathematical convenience, we introduce thenotation P ( Θ ) to denote the set of all distributions supported on Θ . Note that M ( Θ ) ⊂ P ( Θ ) .For mixture models, the Wasserstein distance is widely used as a perfor-mance measure for the mixing distribution estimation. To deﬁne the Wasser-stein distance between two atomic distributions, we ﬁrst deﬁne Q ( w , w (cid:48) ) : = (cid:110) ( p jh ) j ∈ [ k ] , h ∈ [ k (cid:48) ] ∈ [

0, 1 ] k × k (cid:48) : k (cid:48) ∑ h = p jh = w j , k ∑ j = p jh = w (cid:48) h , ∀ j ∈ [ k ] , h ∈ [ k (cid:48) ] (cid:111) ,for given two weight vectors w ∈ ∆ k and w (cid:48) ∈ ∆ k (cid:48) , which is a set of jointdistributions on [ k ] × [ k (cid:48) ] with marginal distributions w and w (cid:48) . For any q ≥

1, the q -th order Wasserstein distance between two atomic distributions ν : = ∑ kj = w j δ θ j and ν (cid:48) : = ∑ k (cid:48) h = w (cid:48) h δ θ (cid:48) h is deﬁned as W q ( ν , ν (cid:48) ) : = inf p ∈Q ( w , w (cid:48) )  k ∑ j = k (cid:48) ∑ h = p jh | θ j − θ (cid:48) h | q  q .Our analysis on the mixing distribution estimation invokes the connectionbetween the difference of moments and the Wassestein distance, which is de-veloped by [47]. For ν ∈ M ( Θ ) , we denote by m h ( ν ) the h -th moment of ν ,that is m h ( ν ) : = E ( X h ) ,where X is the random variable such that X ∼ ν . The r -th moment vector isdeﬁned by m r ( ν ) : = (cid:0) m ( ν ) , · · · , m r ( ν ) (cid:1) .Closeness of moments vectors of two atomic distributions implies their close-ness in the Wasserstein distance. See Lemmas 6.1 and 6.5. We ﬁrst assume that the true data generating process is given as ν (cid:63) ∗ Φ where ν (cid:63) ∈ M k (cid:63) ([ − L , L ]) , L > k (cid:63) ∈ N , which is the true number of mixingcomponents. For simplicity, we write M k : = M k ([ − L , L ]) for each k ∈ N and M : = M ([ − L , L ]) = ∪ ∞ k = M k . We consider a general model in which the truemixing distribution ν (cid:63) ∈ M k (cid:63) can vary with sample size n , in particular, thetrue number of components k (cid:63) can vary with n . This is a critical difference fromthe existing Bayesian literature on mixture models which assumed a ﬁxed truemixing distribution [37, 42, 20].We assume an upper bound ¯ k n on the true number of components k (cid:63) . Thisassumption alleviates some technical difﬁculties, and can be justiﬁed by the. O HN AND

L. L IN k n (cid:16) log n / log log n , as Wu andYang [47] did, since the minimax optimal convergence rate of mixing distribu-tion estimation for large mixtures ν (cid:63) ∈ M k (cid:63) with k (cid:63) (cid:16) log n / log log n has aslow rate of log log n / log n (See Proposition 8 of [47]), and we will show thatone can develop a Bayesian procedure that attains this rate without knowingthe upper bound of the true number of components. See Theorem 2.7 in Sec-tion 2.6.We now introduce our prior distribution on the ﬁnite Gaussian mixturemodel. The prior ﬁrst samples the number of components k from a prior Π ( k ) and then samples the atoms θ ∈ [ − L , L ] k and weights w ∈ ∆ k from Π ( θ | k ) and Π ( w | k ) , respectively. Thus the prior distribution is a distributionon M = ∪ k ∈ N M k .We impose the following conditions on the prior. Assumption P.

Recall that ¯ k n is the known upper bound on the true numberof components. The prior distribution Π satisﬁes the following conditions:(P1) The prior distribution on the number of components k is data-dependant.There are a constant c > A > n ∈ N and any k ◦ ∈ N , Π ( k = k ◦ + ) Π ( k = k ◦ ) ≤ c e − A ¯ k n log n . (2.2)Additionally, there are constants c > c > n ∈ N and any k † ∈ [ ¯ k n ] , Π ( k = k † ) ≥ c e − ( c ¯ k n log n ) k † . (2.3)(P2) For any k ∈ N and any ( w , . . . , w k ) ∈ ∆ k , there are positive constants c and c such that for any η ∈ (

0, 1/ k ) , Π  k ∑ j = | w j − w j | ≤ η (cid:12)(cid:12)(cid:12) k  ≥ c η c k . (2.4)(P3) For any k ∈ N and any θ ∈ [ − L , L ] k , there are positive constants c and c such that for any η > Π (cid:32) max ≤ j ≤ k | θ j − θ j | ≤ η (cid:12)(cid:12)(cid:12) k (cid:33) ≥ c η c k . (2.5)We now provide some examples of prior distributions satisfying Assump-tion P. In the following examples, the constant A > AYESIAN ESTIMATION OF G AUSSIAN MIXTURES Example 1.

The mixture of ﬁnite mixture (MFM) prior considered in [34, 20] isa hierarchical prior consisting of a distribution on the number of components,the Dirichlet distribution on the weights and a distribution on the atoms. As-sumption P is met by the MFM prior with appropriate choices of each distribu-tion. An example is given as follows. The geometric distribution with proba-bility mass function ( − p n ) k − p n on k , where p n : = − a exp ( − A ¯ k n log n ) forarbitrary a >

0, satisﬁes (2.2) and (2.3) since p n (cid:38)

1. The Dirichlet distribution

DIR ( κ , . . . , κ k ) on the mixing weights with κ j ∈ ( κ , 1 ) for every j ∈ [ k ] andsome κ ∈ (

0, 1 ) satisﬁes (P2), see Lemma A.5. If the prior distribution on θ behaves like a uniform distribution up to a multiplicative constant, then (P3)holds. Example 2.

Consider a truncated Poisson distribution for k that’s supported on N with probability mass function e − λ n λ k − n / ( k − ) !, where λ n : = a exp ( − A ¯ k n log n ) for arbitrary a >

0. Then this Poisson distribution clearly satisﬁes (2.2). Alsoit satisﬁes (2.3) with a choice of c = A + c (cid:48) for some constant c (cid:48) >

0, sincee − λ n (cid:38) (( k − ) ! ) − ≥ exp ( − k log k ) ≥ exp ( − ¯ k n log ¯ k n ) ≥ exp ( − c (cid:48) ¯ k n log n ) .The MFM prior with such a Poisson prior on the number of components alsosatisﬁes Assumption P. Example 3.

Consider a Binomial prior distribution on the number of compo-nents such that k − ∼ BINOM ( ¯ k n − p n ) with p n : = a exp ( − A ¯ k n log n ) forarbitrary a >

0. Then this prior satisﬁes (2.2) since ( ¯ k n − k ◦ ) / ( ¯ k n − k ◦ − ) ≤ ¯ k n (cid:46) e log log n and 1 − p n ≤

1. Also it satisﬁes (2.3) since 1 − p n (cid:38)

1. The MFM priorwith this Binomial prior distribution satisﬁes Assumption P.

Example 4.

The spike and slab prior distribution on the unnomralized weightscan satisfy (P1) and (P2). Suppose that we consider an over-ﬁtted mixturemodel ν = ∑ ¯ k n j = w j δ θ j . Let S : = { j ∈ [ ¯ k n ] : w j > } , a set of indices corre-sponding to nonzero weights. Then we can write ν = ∑ j ∈ S w j δ θ j . Let ˜ w ≡ ( ˜ w j ) j ∈ [ ¯ k n ] be the independent random variables where ˜ w is generated from GAMMA ( κ , b ) and the other variables, i.e., ˜ w , . . . , ˜ w ¯ k n , are generated from aspike and slab distribution ( − p n ) δ + p n GAMMA ( κ , b ) with p n : = a exp ( − A ¯ k n log n ) for a > b > κ ∈ (

0, 1 ) . If we deﬁne the number of components asthe number of nonzero elements in ˜ w and the weights as a normalized ver-sion of ( ˜ w j ) j ∈ S , i.e., k : = (cid:107) ˜ w (cid:107) and w j : = ˜ w j / (cid:107) ˜ w (cid:107) for j ∈ S , then k − BINOM ( ¯ k n − p n ) and ( w j ) j ∈ S follows DIR ( κ , . . . , κ ) . Thus Assumption P holdsby Examples 1 and 3. In this section, we present concentration properties of the posterior distribution Π ( ·| X n ) deﬁned below, with the prior given in Section 2.3 and the data from. O HN AND

L. L IN Π ( d ν | X n ) : = p ( n ) ν ∗ Φ ( X n ) Π ( d ν ) (cid:82) p ( n ) ν ∗ Φ ( X n ) Π ( d ν ) . (2.6)We ﬁrst show that our posterior distribution does not overestimate the num-ber of components. Theorem 2.1.

Assume ν (cid:63) ∈ M k (cid:63) where k (cid:63) ≤ ¯ k n (cid:46) log n / log log n. Then with theprior distribution Π satisfying Assumption P, we have P ( n ) ν (cid:63) ∗ Φ (cid:2) Π ( ν ∈ M k (cid:63) | X n ) (cid:3) →

1. (2.7)

Remark 1.

Note that the condition ν (cid:63) ∈ M k (cid:63) does not mean that ν (cid:63) is not in-cluded in the lower order models such as M , . . . , M k (cid:63) − because there maybe overlapped atoms or zero weights. In view of this observation, Theorem 2.1can be stated with a more precise argument as follows. Let ˘ k (cid:63) be the small-est number of components of the true mixing distribution ν (cid:63) in a sense that ν (cid:63) ∈ M ˘ k (cid:63) \ M ˘ k (cid:63) − . Then the conclusion of the theorem actually means that P ( n ) ν (cid:63) ∗ Φ (cid:104) Π ( ν ∈ M ˘ k (cid:63) | X n ) (cid:105) → (cid:4) The following theorem shows the optimal concentration property of theposterior distribution of the mixing distribution.

Theorem 2.2.

Under the same assumptions of Theorem 2.1, we have P ( n ) ν (cid:63) ∗ Φ (cid:34) Π (cid:18) W ( ν , ν (cid:63) ) ≥ M ¯ (cid:101) n (cid:12)(cid:12)(cid:12) X n (cid:19)(cid:35) = o ( ) (2.8) for some universal constant M > , where ¯ (cid:101) n : = ( k (cid:63) ) k (cid:63) − k (cid:63) − (cid:32) ¯ k n log nn (cid:33) k (cid:63) − . (2.9)If the number of components k (cid:63) is ﬁxed, the convergence rate in Theo-rem 2.2 is equivalent to the minimax optimal rate n − ( k (cid:63) − ) [47, Proposition7] up to at most a logarithmic factor since ¯ k n (cid:46) log n .Compared with the minimax rate, our rate has two redundant factors ¯ k n and log n . The log n factor is common in the nonparametric Bayesian literature,which often arises due to the popular “prior mass and testing” proof technique.We refer to the papers [24, 13] for discussions about this phenomenon. We alsoadopt the “prior mass and testing” approach and thus misses the log n factor.The ¯ k n factor is paid for model selection. Unlike the frequentist work [47],which proposes an estimation algorithm that attains the exact minimax opti-mal rate with the assumption that the true number of components is known, AYESIAN ESTIMATION OF G AUSSIAN MIXTURES k n factor can be removed. We may be able to remove thisfactor using somewhat reﬁned proof techniques without assuming the knownnumber of components. For example, some Bayesian works on linear regres-sion [5, 29] and Gaussian directed acyclic graph models [4, 27] simultaneouslyachieved model selection consistency and the exact minimax convergence ratesfor parameters estimation through a careful analysis of the likelihood ratio. Wewill investigate whether the same can be done for Gaussian mixture models inthe near future. To improve the convergence rate in Theorem 2.2, one may assume that atomsare well separated and the weights are bounded away from zero. We introducethe formal deﬁnition related to this notion.

Deﬁnition 1.

An atomic distribution ν : = ∑ kj = w j δ θ j is said to be k ( γ , ω ) -separated for k ∈ [ k ] , γ > ω > S , . . . , S k of [ k ] such that • | θ j − θ j (cid:48) | ≥ γ for any j ∈ S l , j (cid:48) ∈ S l (cid:48) and any l , l (cid:48) ∈ [ k ] with l (cid:54) = l (cid:48) ; • ∑ j ∈ S l w j ≥ ω for any l ∈ [ k ] .We let M k , k , γ , ω : = (cid:8) ν ∈ M k : ν is k ( γ , ω ) -separated (cid:9) .In the next theorem, we derive the optimal posterior contraction rate of themixing distribution under the separation assumption. We call this contractionrate an adaptive rate because the result is achieved without any knowledge ofthe number of well-separated components k of the true mixing distribution. Theorem 2.3.

Assume ν (cid:63) ∈ M k (cid:63) , k , γ , ω where k (cid:63) ≤ ¯ k n (cid:46) log n / log log n. More-over, assume that γω > M (cid:48) ¯ (cid:101) n for a sufﬁciently large constant M (cid:48) > , where ¯ (cid:101) n isthe convergence rate deﬁned in (2.9) . Then with the prior distribution Π satisfyingAssumption P, we have P ( n ) ν (cid:63) ∗ Φ (cid:34) Π (cid:18) W ( ν , ν (cid:63) ) ≥ M (cid:101) n (cid:12)(cid:12)(cid:12) X n (cid:19)(cid:35) = o ( ) , (2.10) for some universal constant M > , where (cid:101) n : = ( k (cid:63) ) k (cid:63) − k + ( k (cid:63) − k )+ γ − k − ( k (cid:63) − k )+ (cid:32) ¯ k n log nn (cid:33) ( k (cid:63) − k )+ . (2.11). O HN AND

L. L IN Remark 2.

A nice surprise from the result of Theorem 2.3 is that our Bayesianprocedure can achieve a better convergence rate than the one in Theorem 2.2without requiring any further condition on the prior distribution. This is be-cause of fact that the condition γω > M (cid:48) ¯ (cid:101) n guarantees that the mixing distri-bution ν is k ( a γ , 0 ) -separated asymptotically for some constant a ∈ (

0, 1 ) under the posterior distribution, provided that Theorem 2.2 holds. (cid:4) Under the same separation condition but with the additional assumptionthat the number of components k (cid:63) is known , Wu and Yang [47] achieved theconvergence rate C k (cid:63) , γ n − ( ( k (cid:63) − k )+ ) for the denoised method of moments es-timator, where C k (cid:63) , γ is some quantity depending on k (cid:63) and γ . Compared withthe rate of [47], our convergence rate (2.11) has redundant factor ¯ k n log n dueto the proof technique and the existence of the model selection step. Again thefactor ¯ k n can be removed if one assumes the number of components is known.In view of Proposition 2.4 presented below, the convergence rate in The-orem 2.3 is minimax optimal [21, Theorem 3.2] up to a logarithmic factor ifthe model parameters k (cid:63) , k and γ are ﬁxed constants. Heinrich and Kahn[21] established the minimax optimal rate n − ( ( k (cid:63) − k )+ ) of the estimation ofthe mixing distribution satisfying the locally varying condition. Namely, theyshowed that for ﬁxed k (cid:63) ∈ N , k ∈ [ k (cid:63) ] and ν ∈ M k \ M k − , it follows thatinf { ˆ ν } sup ν (cid:63) ∈M k (cid:63) : W ( ν (cid:63) , ν ) ≤ (cid:101) † n P ( n ) ν (cid:63) ∗ Φ (cid:2) W ( ˆ ν , ν (cid:63) ) (cid:3) (cid:38) n − ( k (cid:63) − k )+ , (2.12)where the inﬁmum ranges over all possible sequences of estimators and (cid:101) † n : = n − ( ( k (cid:63) − k )+ )+ ι for some ι > locally . This locally varying condition is seemingly differ-ent from the separation condition given in Deﬁnition 1, but in fact the formeris a sufﬁcient condition of the latter. Intuitively, we can expect that the true dis-tribution ν (cid:63) ∈ M k (cid:63) close to ν ∈ M k \ M k − has at least k well-separatedcomponents, and therefore satisﬁes the separation condition. We formally statethis argument in the next proposition. Proposition 2.4.

Let k ∈ N and ν : = ∑ k j = w j δ θ j ∈ M k \ M k − . Deﬁne γ ( ν ) : = min j , h ∈ [ k ] : j (cid:54) = h | θ j − θ h | > ω ( ν ) : = min j ∈ [ k ] w j > Let k ∈ { k , k +

1, . . . } and c ∈ (

0, 1/4 ) . Then we have (cid:8) ν ∈ M k : W ( ν , ν ) < c γ ( ν ) ω ( ν ) (cid:9) ⊂ M k , k , ( − c ) γ ( ν ) , − c − c ω ( ν ) .Due to Proposition 2.4, it is clear that our Bayesian procedure is also near-optimal for the estimation of the mixing distribution under the locally varyingcondition. We merely state the result. AYESIAN ESTIMATION OF G AUSSIAN MIXTURES Corollary 2.5.

Assume k (cid:63) ≤ ¯ k n (cid:46) log n / log log n. Let k ∈ N be a ﬁxed constantsuch that k ≤ k (cid:63) , and let ν ∈ M k \ M k − be a ﬁxed distribution. Moreover, as-sume that the prior distribution Π satisﬁes Assumption P . Then there exist universalconstants τ > and M > such that P ( n ) ν (cid:63) ∗ Φ  Π  W ( ν , ν (cid:63) ) ≥ M ( k (cid:63) ) k (cid:63) − k + ( k (cid:63) − k )+ (cid:32) ¯ k n log nn (cid:33) ( k (cid:63) − k )+ (cid:12)(cid:12)(cid:12) X n  = o ( ) (2.13) for any ν (cid:63) ∈ M k (cid:63) with W ( ν (cid:63) , ν ) < τ eventually. As a byproduct, we can obtain the posterior consistency of the true numberof components when the true mixing distribution ν (cid:63) is perfectly separated, thatis, k (cid:63) = k . Note that in this case, ν (cid:63) ∈ M k (cid:63) \ M k (cid:63) − . The following theoremstates this formally. Theorem 2.6.

Assume ν (cid:63) ∈ M k (cid:63) , k (cid:63) , γ , ω where k (cid:63) ≤ ¯ k n (cid:46) log n / log log n. More-over, assume that γω > M (cid:48) max { ¯ (cid:101) n , (cid:101) n } (2.14) for a sufﬁciently large constant M (cid:48) > , where ¯ (cid:101) n and (cid:101) n are the convergence ratesdeﬁned in (2.9) and (2.11) , respectively. Then with the prior distribution Π satisfyingAssumption P, we have P ( n ) ν (cid:63) ∗ Φ (cid:104) Π (cid:0) ν ∈ M k (cid:63) \ M k (cid:63) − | X n (cid:1)(cid:105) →

1. (2.15)The condition (2.14) provides a threshold for detection. This conditionplays a similar role as the beta-min condition for variable selection in linearregression [5, 29].Guha et al. [20] obtained the consistency result with a similar prior distri-bution to ours, but their analysis is restricted to the ﬁxed truth cases.

In Section 2, we have assumed that k (cid:63) (cid:46) log n / log log n . This assumption isjustiﬁed by the minimax result for the estimation of the higher-order mixturespresented by [47]. In this section, we prove that there is a Bayesian proce-dure which is similar to the one considered in Section 2, but does not assumea known upper bound of the number of components, can attain this minimaxoptimality. In this case, instead of Assumption (P1), we impose a milder con-dition given below on the prior.(P1 (cid:48) ) There are constants c > c > k ◦ ∈ N , Π ( k = k ◦ ) ≥ c e − c k ◦ . (2.16). O HN AND

L. L IN (cid:48) ) is satisﬁed by the Poisson and geometric distribution with constant meanand success probability, respectively.The next theorem provides the convergence rate of mixing distribution es-timation without any restriction on the true number of components. Theorem 2.7.

Assume ν (cid:63) ∈ M . Then with the prior distribution Π satisfying (P1 (cid:48) ),(P2) and (P3), we have P ( n ) ν (cid:63) ∗ Φ  Π (cid:32) W ( ν , ν (cid:63) ) ≥ M log log n log n (cid:12)(cid:12)(cid:12) X n (cid:33) = o ( ) (2.17) for some universal constant M > . If the true mixing distribution ν (cid:63) belongs to M k (cid:63) with k (cid:63) (cid:16) log n / log log n ,the convergence rate in the above theorem is rate-exact optimal [47, Theorem5]. Indeed, the above result holds even when the true generating process isgiven by µ (cid:63) ∗ Φ with µ (cid:63) ∈ P ([ − L , L ]) , which includes continuous or inﬁnitemixtures. In this section, we consider the fractional posterior, also called the α -posterior,as the estimator. With the prior distribution Π and the data X n , the fractionalposterior Π α ( ·| X n ) of order α ∈ (

0, 1 ) is deﬁned by Π α ( d ν | X n ) : = (cid:110) p ( n ) ν ∗ Φ ( X n ) (cid:111) α Π ( d ν ) (cid:82) (cid:110) p ( n ) ν ∗ Φ ( X n ) (cid:111) α Π ( d ν ) . (2.18)The fractional posterior has received a great deal of recent attention, mainlydue to its empirically demonstrated robustness to model misspeciﬁcation [19,31]. In particular, numerical experiments of [31] showed that the fractionalposteriors of the Gaussian mixtures are robust to a certain type of model mis-speciﬁcation, while the regular posteriors are not. Another key advantage isthat concentration of the fractional posterior can be established under fewerconditions on the prior comparing to the regular posterior [1]. This also turnsout to be the case for the Gaussian mixtures. The use of the fractional posteriorallows us to avoid the construction of an exponential test function, thus theproof is substantially simpliﬁed.The next theorem shows that the fractional posterior has the optimal con-centration properties as does the regular posterior. AYESIAN ESTIMATION OF G AUSSIAN MIXTURES Theorem 2.8.

Fix α ∈ (

0, 1 ) . Assume ν (cid:63) ∈ M k (cid:63) where k (cid:63) ≤ ¯ k n (cid:46) log n / log log n.Moreover, assume that the prior distribution Π satisﬁes Assumption P. Then thereexist positive constants c , c and c such that Π α ( ν ∈ M k (cid:63) | X n ) ≥ − c e − c ¯ k n log n (2.19) and (cid:90) W ( ν , ν (cid:63) ) Π α ( d ν | X n ) (cid:46) ¯ (cid:101) n + e − c ¯ k n log n , (2.20) with P ( n ) ν (cid:63) ∗ Φ -probability at least − c /¯ k n log n, where ¯ (cid:101) n is the convergence rate de-ﬁned in (2.9) . If e − c ¯ k n log n (cid:46) ¯ (cid:101) n , which holds for any diverging ¯ k n , the fractional posteriorattains the minimax optimal convergence rate up to a logarithmic factor. In this section, we extend the theoretical analysis of the Gaussian mixtures pro-vided in Section 2.4 to general mixture models satisfying strong identiﬁabilityconditions.With a slight abuse of the notation, for a mixing distribution ν ∈ M ( Θ ) and a family of distribution functions { F ( · , θ ) : θ ∈ Θ } for Θ ⊂ R , we let ν ∗ F denote the distribution having a density function p ν (cid:63) ∗ F ( · ) : = (cid:90) f ( · , θ ) ν ( d θ ) , (3.1)where f ( · , θ ) denotes the probability density function of F ( · , θ ) . We call F ( · , · ) and f ( · , · ) a kernel distribution function and a kernel density function , respectively.We assume here that the data are i.i.d observations from the distribution ν (cid:63) ∗ F for some k (cid:63) -atomic mixing distribution ν (cid:63) ∈ M k (cid:63) and family of distri-bution functions { F ( · , θ ) : θ ∈ Θ } satisfying some regularity and strong iden-tiﬁability conditions. We ﬁrst introduce the strong identiﬁability condition. Deﬁnition 2.

A family of distribution functions (cid:8) F ( · , θ ) : θ ∈ Θ (cid:9) for Θ ⊂ R , issaid to be q-strongly identiﬁable if for any ﬁnite subset B of Θ , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q ∑ j = ∑ θ (cid:48) ∈ B a j , θ (cid:48) ∂ j f ∂θ j ( · , θ (cid:48) ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ = = ⇒ max j ∈{ q } max θ (cid:48) ∈ B | a j , θ (cid:48) | = ν ∗ F is q -strongly identiﬁable if (cid:8) F ( · , θ ) : θ ∈ Θ (cid:9) is q -strongly identiﬁable.Heinrich and Kahn [21, Theorem 2.4] shows that the location mixture mod-els, i.e., f ( x , θ ) = f ( x − θ ) , in which both the kernel density function f ( · ) andits derivatives up to q − ± ∞ , are q -strongly identiﬁable.. O HN AND

L. L IN ∞ -strongly identiﬁ-able. Also the scale mixtures, i.e., f ( x , θ ) = θ − f ( θ − x ) for θ ∈ Θ ⊂ R + , withthe same condition on the kernel density function, are q -strongly identiﬁable.We impose the following regularity conditions including the strong identi-ﬁability condition. Assumption F( q ). The family of distribution functions (cid:8) F ( · , θ ) : θ ∈ Θ (cid:9) with Θ ⊂ R satisﬁes the following conditions:(F1) For any x ∈ R , F ( x , θ ) is q -differentiable with respect to θ .(F2) (cid:8) F ( · , θ ) : θ ∈ Θ (cid:9) is q -strongly identiﬁable.(F3) There are constants c > s ≥ (cid:13)(cid:13)(cid:13)(cid:13) ∂ q F ∂θ q ( · , θ ) − ∂ q F ∂θ q ( · , θ ) (cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ c | θ − θ | s for any θ , θ ∈ Θ .(F4) There are constants c > b ∈ (

0, 1 ] such that (cid:90) p ν ∗ F ( x ) (cid:32) p ν ∗ F ( x ) p ν ∗ F ( x ) (cid:33) b λ ( d x ) ≤ c for any ν , ν ∈ M q ( Θ ) .The ﬁrst three conditions are inherited from the regularity condition of [21].The additional condition (F4) is introduced to control the prior concentrationof a KL neighborhood of the true distribution ν (cid:63) ∗ F . If the set Θ is given asan interval, say [ − L , L ] , the condition (F4) is satisﬁed by various location mix-tures, in particular, by the Laplace location mixture [14] and Gaussian locationmixture [18].In this section, we assume that the number of components k (cid:63) is ﬁxed butstill unknown. We thus use the prior distribution on the number of compo-nents satisfying (P1) with the constant ¯ k n . Furthermore, since we consider ageneral set of atoms Θ ⊂ R rather than the interval [ − L , L ] to include, forexample, scale mixtures and exponential family mixtures, Assumption (P3) isslightly modiﬁed to Equation (2.5) being met for any k ∈ N and θ ∈ Θ k . Wealso assume the kernel distribution function F ( · , · ) is known, i.e., no misspec-iﬁcation of the kernel distribution function. That is, we consider the posteriordistribution denoted by Π F ( ·| X n ) , which is deﬁned as Π F ( d ν | X n ) : = p ( n ) ν ∗ F ( X n ) Π ( d ν ) (cid:82) p ( n ) ν ∗ F ( X n ) Π ( d ν ) . (3.2)Note that we still allow the true mixing distribution to vary with the samplesize. This setup is still substantially more general than the ﬁxed truth setupconsidered in the existing Bayesian literature [37, 42, 20]. AYESIAN ESTIMATION OF G AUSSIAN MIXTURES

Theorem 3.1.

Let Θ be a compact subset of R with nonempty interior. Assume that ν (cid:63) ∈ M k (cid:63) ( Θ ) with k (cid:63) ∈ N being ﬁxed and that the family of distribution functions { F ( · , θ ) : θ ∈ Θ } satisﬁes Assumption F(q) with q = k (cid:63) . Then with the priordistribution Π satisfying Assumption P, we have P ( n ) ν (cid:63) ∗ F  Π F  W ( ν , ν (cid:63) ) ≥ M (cid:18) log nn (cid:19) k (cid:63) − (cid:12)(cid:12)(cid:12) X n  = o ( ) (3.3) for some universal constant M > . The convergence rate in Theorem 3.1 is equivalent to the convergence rate(2.9) for the Gaussian mixtures with the ﬁxed number of components k (cid:63) . Remark 3.

We believe that even if the number of components grows, the re-sult of Theorem 3.1 still holds with the same convergence rate as (3.3) up to aconstant depending on k (cid:63) , provided that Assumption F( q ) is met with q = ∞ .We need to establish a uniform version of Lemma 6.8 over the number of com-ponents, which is a key technical tool for the proof. It could be an objective offuture work. (cid:4) Moreover, our Bayesian procedure can obtain the minimax optimal conver-gence rate [21, Theorem 3.2] under the locally varying condition on the truemixing distribution, which is assumed in Corollary 2.5 for the Gaussian mix-tures.

Theorem 3.2.

Let Θ be a compact subset of R with nonempty interior. Let k (cid:63) , k ∈ N be ﬁxed constants with k (cid:63) ≥ k and let ν ∈ M k ( Θ ) \ M k − ( Θ ) be a ﬁxed dis-tribution. Assume the family of distribution functions { F ( · , θ ) : θ ∈ Θ } satisﬁesAssumption F(q) with q = k (cid:63) . Moreover, assume that the prior distribution Π sat-isﬁes Assumption P. Then there exist universal constants τ > and M > suchthat P ( n ) ν (cid:63) ∗ F  Π F  W ( ν , ν (cid:63) ) ≥ M (cid:18) log nn (cid:19) ( k (cid:63) − k )+ (cid:12)(cid:12)(cid:12) X n  = o ( ) (3.4) for any ν (cid:63) ∈ M k (cid:63) with W ( ν (cid:63) , ν ) < τ eventually. In this section, we consider Dirichlet process (DP) prior [10] on the mixing dis-tribution which results in an inﬁnite mixture model– the popular Dirichlet pro-cess (DP) mixture model. Although a DP mixture model is minimax optimal. O

HN AND

L. L IN ( log n ) − in estimating the mixing distribution of the Gaussian location mix-tures as shown by [37]. Their result assumes that the number of component k (cid:63) is ﬁxed. We consider the DP prior for the mixture distribution estimation andderive the posterior contraction rates in the most general set up by allowingthe number of the components of the true mixing distribution to grow. Furthermore, we adopt a natural strategy of using the number of the clusters T of the data to estimate the number of components and we establish posterior consistencyof such a procedure.Note that the DP prior does not satisfy Assumption (P1), and thus the the-orems in Section 2.4 do not cover the case of DP prior. This section aims toseparately analyze concentration properties of the posterior of the DP mixturemodels.In our Gaussian location mixture setup, the DP is a distribution on inﬁnite -atomic distributions of the form ˜ ν : = ∞ ∑ j = w j δ θ j (4.1)where w , w , · · · ∈ [

0, 1 ] are mixing weights such that ∑ ∞ j = w j = θ , θ , · · · ∈ [ − L , L ] . We let M ∞ be the set of distributions of the form (4.1). The DP with aconcentration parameter κ > H , denoted by DP ( κ , H ) ,can be expressed by the following stick-breaking generation process [43] E j iid ∼ BETA ( κ ) , w j = E j j − ∏ h = ( − E h ) , θ j iid ∼ H .Since the weights generated from the above procedure are positive with prob-ability 1, one can say that Π DP ( ˜ ν ∈ M ∞ \ M ) =

1. This implies that ev-ery mixing distribution generated from the posterior of the DP mixture modelhas inﬁnite number of components, therefore the posterior distribution of thenumber of components k cannot provide any reasonable estimate of the truenumber of components.One possible solution is to use an additional post-processing procedure forthe posterior distribution. For example, Guha et al. [20] proposed the operator T to inﬁnite mixing distributions which removes weak components (in a sensethat the corresponding weights are very small) and merges similar components(whose atoms are very close) of an inﬁnite mixing distribution so that T ( ˜ ν ) is aﬁnite mixing distribution. They proved that for a ﬁxed truth ν (cid:63) ∈ M k (cid:63) \ M k (cid:63) − ,the posterior distribution of the ﬁnite mixing distribution T ( ˜ ν ) obtained afterpost processing concentrates to the model M k (cid:63) \ M k (cid:63) − under the DP priordistribution with a ﬁxed concentration parameter. AYESIAN ESTIMATION OF G AUSSIAN MIXTURES T n , of the data X n as an estimate of the number of components. Note thatfor i ∈ [ n ] , X i iid ∼ ˜ ν ∗ Φ can be written equivalently with the latent assignmentvariable Z i ∈ N as Z i iid ∼ w [ ˜ ν ] : = ∞ ∑ j = w j δ j , X i | Z i ind ∼ N ( θ Z i , 1 ) .where w [ ˜ ν ] ∈ P ( N ) can be viewed as the distribution on N such that w [ ˜ ν ]( J ) = ˜ ν ( { θ j : j ∈ J } ) for any J ⊂ N . The number of clusters T n is deﬁned by T n : = T n ( Z n ) : = (cid:12)(cid:12)(cid:8) j ∈ N : ∃ i ∈ [ n ] s.t. Z i = j (cid:9)(cid:12)(cid:12) .Here we consider the joint posterior distribution of the mixing distribution˜ ν and the latent assignment variable Z n conditioned on the data X n , whichis given as Π DP ( d ˜ ν , Z n | X n ) : = (cid:104) ∏ ni = φ ( X i − θ Z i ) p w [ ˜ ν ] ( Z i ) (cid:105) Π DP ( d ˜ ν ) (cid:82) ∑ Z n ∈ N n (cid:104) ∏ ni = φ ( X i − θ Z i ) p w [ ˜ ν ] ( Z i ) (cid:105) Π DP ( d ˜ ν ) , (4.2)where φ ( · ) denotes the probability density function of the standard normaldistribution and Π DP denotes the DP prior.Note that the data are still assumed to be generated from the ﬁnite Gaussianmixture model ν (cid:63) ∗ Φ where ν (cid:63) ∈ M k (cid:63) for k (cid:63) ∈ N but we allow the number ofcomponents to grow at an arbitrary fast speed. Even in such general situations,we show in the following theorem that the DP prior with a suitably chosenconcentration parameter can provide a nearly tight upper bound of the truenumber of components . Theorem 4.1.

Assume ν (cid:63) ∈ M k (cid:63) with k (cid:63) ∈ N . Then with the DP prior DP ( κ n , H ) where κ n (cid:16) ( n log n ) − and H is the uniform distribution on [ − L , L ] , we have P ( n ) ν (cid:63) ∗ Φ (cid:2) Π DP ( T n > Ck (cid:63) | X n ) (cid:3) = o ( ) (4.3) for some constant C > depending only on the prior distribution. Miller and Harrison [32, 33] showed that the posterior distribution of thenumber of clusters does not concentrate at the true number of components ifone uses the DP prior with a constant concentration parameter . In particular, ifthe true data generating process is N (

0, 1 ) = δ ∗ Φ , the posterior probabilitythat the number of components is equal to the true number of components (i.e.,1) goes to zero [32, Theorem 5.1]. Our proposed data-dependent concentrationparameter resolves this inconsistency.. O HN AND

L. L IN Remark 4.

Under the prior Π considered in Section 2.3, the posterior distribu-tion of T n is asymptotically the same as the one of k . Miller and Harrison [34]proved that | Π ( k = k ◦ | X n ) − Π ( T n = k ◦ | X n ) | → k ◦ ∈ N as long as Π ( k = k (cid:48) ) > k (cid:48) ∈ [ k ◦ ] . In view of this fact, the number ofclusters T n can be used to infer the true number of clusters k (cid:63) even if we usethe prior distribution Π in Section 2.3. (cid:4) Remark 5.

One may wonder whether the choice of the concentration param-eter κ n (cid:16) ( n log n ) − would lead to slower posterior contraction rate whenthe DP mixture model is used for density estimation as a DP mixture modelis commonly adopted for. It turns out that it would not. In fact, even for κ n (cid:16) ( n log n ) − , one can show that there is a universal constant M > P ( n ) ν (cid:63) ∗ Φ  Π DP (cid:32) h ( p ˜ ν ∗ Φ , p ν (cid:63) ∗ Φ ) ≥ M log a n √ n | X n (cid:33) = o ( ) for any ν (cid:63) ∈ P ([ − L , L ]) , for some a >

0. One can easily check the above re-sult. Following the proof of Theorem 5.1 of [18] and applying Lemma A.5, wecan see that the prior concentration near the true mixing distribution is lowerbounded by ( n − κ n ) c log n (cid:38) exp ( − c log n ) for some c , c >

0. Thus anusual prior mass and testing approach leads to the conclusion in the precedingdisplay for estimating the density. (cid:4)

However, using the DP prior leads to a very slow convergence rate of mix-ing distribution estimation in general as stated in the next theorem.

Theorem 4.2.

Assume ν (cid:63) ∈ M . Then with the DP prior DP ( κ n , H ) , where exp ( − c log a n ) (cid:46) κ n (cid:46) for some a > and c > and H is the uniform distribution on [ − L , L ] , wehave P ( n ) ν (cid:63) ∗ Φ  Π DP (cid:32) W ( ˜ ν , ν (cid:63) ) ≥ M log log n log n (cid:12)(cid:12)(cid:12) X n (cid:33) = o ( ) (4.4) for some universal constant M > . The above result holds even when the true mixing distribution ν (cid:63) is an ar-bitrary distribution supported on [ − L , L ] .As one can see from our theorem above, if the true mixing distribution isof high order such that k (cid:63) (cid:16) log n / log log n , the posterior of the DP mixturemodel attains the minimax optimality [47, Theorem 5]. However, unlike theBayesian procedure proposed in Section 2, we conjecture that posterior of theDP mixture model cannot obtain an improved convergence rate for estimat-ing a mixing distribution when the true number of components grows slowly,say k (cid:63) (cid:28) log n / log log n , because it tends to produce many redundant com-ponents. Nguyen [37] analyzed the posterior of Dirichlet process mixture en-dowed with a ﬁxed concentration parameter for estimating mixing distribu-tion with a ﬁxed number of components and obtain a slow convergence rate ( log n ) − with respect to the second-order Wasserstein distance. AYESIAN ESTIMATION OF G AUSSIAN MIXTURES We conduct numerical experiments to validate our theoretical ﬁndings. Forthe prior distribution, we use a MFM prior consisting of a Poisson distribu-tion with mean λ on the number of components, the Dirichlet distribution onthe weights and the uniform distribution on the atoms. For the Dirichlet dis-tribution prior on the mixing weights, we ﬁx its concentration parameter asa k -dimensional vector of 1’s. For the mean parameter of the Poisson distri-bution, we consider the following two choices: the constant one and the oneinversely proportional to the sample size. We call the former MFM const andthe latter

MFM vary . MFM vary is motivated by our theory. For posterior com-putation, we employ the reversible jump MCMC algorithm of [40]. For eachposterior computation, we ran a single Markov chain with length 105,000. Wesaved every 100-th sample after a burn-in period of 5,000 samples.

We compare the performance of the proposed Bayesian method with othercompetitors. We consider the denoised method of moment (

DMM ) estimator pro-posed by [47] and the maximum a posteriori (MAP) estimator with the Dirich-let distribution prior on the weights and the uniform distribution prior on theatoms. In the implementation of the

DMM algorithm, we use the authors’ Pythoncodes which are available on this github repository. We consider the MAP es-timators of two types of mixture models: exact-ﬁtted and over-ﬁtted mixtures.The number of components of the exact-ﬁtted mixture is exactly equal to thetrue number of components and the one of the over-ﬁtted mixture is someupper bound ¯ k of the true number of components, in this simulation, we set¯ k = k (cid:63) . We call the MAP estimator of the exact-ﬁtted mixture MAP exact andthe one of the over-ﬁtted mixture

MAP over . We use the standard expectation-maximization (EM) algorithm to obtain MAP estimators. For the proposedBayesian method, we use the posterior mode of the mixing distribution as anestimator. We consider the two choices of the mean parameter of the Pois-son prior, λ n = n − ( MFM vary ) and λ n = MFM const ). For all the fourBayesian methods, we set the support of the uniform distribution prior theinterval [ −

6, 6 ] and the concentration parameter of the Dirichlet distributionprior the vector of 1’s.We generated synthetic data sets from a Gaussian mixture model ν (cid:63) ∗ Φ with ν (cid:63) : = ∑ k (cid:63) j = w (cid:63) j δ θ (cid:63) j . We consider the following four different cases of thetrue mixing distribution.Case 1 (Well-separated) θ (cid:63) = ( − −

1, 1, 3 ) , w (cid:63) = ( , , , ) Case 2 (Overlapped components) θ (cid:63) = ( − −

1, 1, 3 ) , w (cid:63) = ( , , , ) Case 3 (Weak component) θ (cid:63) = ( − −

1, 1, 3 ) , w (cid:63) = ( , , , ) Case 4 (Higher-order) θ (cid:63) = ( − − −

2, 0, 2, 4, 6 ) , w (cid:63) = ( , . . . , ) . O HN AND

L. L IN n range over { } .We repeat this data generation 20 times for each experiment and report theaverage of the ﬁrst order Wasserstein distance between each estimator and thetrue mixing distribution.Figure 1 displays the average of the the ﬁrst order Wasserstein errors of theﬁve estimators for the four cases of the data generating process. Contrary toits theoretical optimality, DMM performs the worst among the ﬁve estimators forall the scenarios. The performance gap of DMM to the Bayesian methods arethe largest for Case 4. We observed that there is numerical instability of the

DMM implementation in estimating the higher-order mixtures, which leads tothe poor performance of the method. For Case 1, the over-ﬁtted mixture model

MAP over performs worse than the other Bayesian methods, but does similarfor the other three cases. For Case 2 and Case 3,

MFM vary tends to select thesmaller mixture than the true mixture, in general, its posterior distribution ismaximized at k = k (cid:63) =

4. Note that thisdoes not contradict our theoretical results where we establish the consistentestimation of the number of well-separated components, which might be equalto 3 in these two cases. This leads to slightly better performance for Case 2where overlapped components exist and slightly worse performance for Case3 where weak components exist. For the higher-order mixture case, all thefour Bayesian methods performs almost similar. Overall, knowing the truenumber of components does not give substantial improvement of empiricalperformance, which corresponds to our theory that it gives only at most log n gain in the convergence rate. In this experiment, we assess the performance of the proposed Bayesian pro-cedure and the DP mixture model with data-dependent hyperparameters. Wegenerated the Gaussian mixture with atoms ( −

2, 0, 2 ) and equal weights ( ) .Five independent data sets are generated from this Gaussian mixture modelfor each sample size n ∈ {

50, 100, 250, 1000, 2500 } . We compare four Bayesianmethods: the two MFM models with Poisson mean parameter λ n = n − ( MFM vary )and λ n = MFM const ) and the two DP mixtures models with concentra-tion parameter κ n = ( n log n ) − ( DP vary ) and κ n = DP const ). We usethe uniform distribution on [ −

6, 6 ] for both the prior on the atoms for the MFMand the base distribution for the DP mixture. We use Neals Algorithm 8 [35]for non-conjugate priors to compute the posterior distributions of the DP mix-tures.Figure 2 presents the posterior distributions of the number of componentsfor the two MFMs and of the number of cluster for the two DP mixtures, re-spectively. It clearly shows that the diminishing choices of hyperparameter AYESIAN ESTIMATION OF G AUSSIAN MIXTURES (a) Case 1 (b) Case 2(c) Case 3 (d) Case 4 Figure 1: The average of the ﬁrst-order Wasserstien errors of ﬁve estimators bysample size.advocated by our theory outperforms the constant counterparts. It is worth tonotice that the posterior distribution of

DP vary captures the true number ofcomponents well for large samples. It is a widely observed that the DP mix-ture tends to produce redundant clusters, in particular, Miller and Harrison[34] and Guha et al. [20] observed this phenomenon in their simulation studies,however our simulation shows that a data-dependent concentration parameterinversely related to the sample size can circumvent this issue.

Proof of Theorem 2.1.

Let ˜ ζ n : = (cid:112) log n / n . We state the following well knownresult in the Bayesian literature (e.g., Lemma 8.1 of [15]): P ( n ) ν (cid:63) ∗ Φ  (cid:90) p ( n ) ν ∗ Φ p ( n ) ν (cid:63) ∗ Φ ( X n ) Π ( d ν ) ≥ e − n ˜ ζ n Π ( B KL ( ˜ ζ n , ν (cid:63) ∗ Φ , M ))  ≥ − n ˜ ζ n .. O HN AND

L. L IN (a) MFM with λ n = n − (b) MFM with λ n = κ n = ( n log n ) − (d) DP with κ n = Figure 2: Posterior distribution of the number of components for the MFM andof the number of clusters for the DP mixture. The true number of componentsis 3.By Lemma A.1, we have KL ( p ν (cid:63) ∗ Φ , p ν ∗ Φ ) ≤ W ( ν ∗ Φ , ν (cid:63) ∗ Φ ) .Since (cid:82) p ν (cid:63) ∗ Φ ( x )( p ν (cid:63) ∗ Φ ( x ) / p ν ∗ Φ ( x )) b d λ ( x ) < ∞ for some b ∈ (

0, 1 ) which isshown by Equation (4.6) of [18], Lemma A.1 and Lemma A.2 imply that KL ( p ν (cid:63) ∗ Φ , p ν ∗ Φ ( X i )) ≤ c W ( ν ∗ Φ , ν (cid:63) ∗ Φ ) log (cid:32) W ( ν ∗ Φ , ν (cid:63) ∗ Φ ) (cid:33) ,for some constant c >

0. Thus Π ( B KL ( ˜ ζ n , ν (cid:63) ∗ Φ , M )) ≥ Π (cid:16) ν ∈ M : W ( ν , ν (cid:63) ) ≤ c ( n log n ) − (cid:17) ≥ Π (cid:16) ν ∈ M k (cid:63) : W ( ν , ν (cid:63) ) ≤ c ( n log n ) − (cid:17) Π ( k = k (cid:63) ) .for some constant c >

0. We now lower bound the prior mass on the Wasser-stein ball in M k (cid:63) in the preceding display. By Lemma A.3, we have that for any AYESIAN ESTIMATION OF G AUSSIAN MIXTURES ν ∈ M k (cid:63) , W ( ν , ν (cid:63) ) ≤ max ≤ j ≤ k (cid:63) | θ j − θ (cid:63) j | + L  k (cid:63) ∑ j = | w j − w (cid:63) j |  By (P2) and (P3), Π (cid:16) ν ∈ M k (cid:63) : W ( ν , ν (cid:63) ) ≤ c ( n log n ) − (cid:17) ≥ Π  k (cid:63) ∑ j = | w j − w (cid:63) j | ≤ c L n log n  Π (cid:32) | θ j − θ (cid:63) j | ≤ c n log n , ∀ j ∈ [ k (cid:63) ] (cid:33) (cid:38) (( n log n ) − ) c k (cid:63) (cid:38) e − c k (cid:63) log n for some constant c >

0. Therefore, P ( n ) ν (cid:63) ∗ Φ (cid:2) Π ( ν / ∈ M k (cid:63) | X n ) (cid:3) = P ( n ) ν (cid:63) ∗ Φ  (cid:82) ν / ∈M k (cid:63) p ( n ) ν ∗ Φ ( X n ) / p ( n ) ν (cid:63) ∗ Φ ( X n ) Π ( d ν ) (cid:82) p ( n ) ν ∗ Φ ( X n ) / p ( n ) ν (cid:63) ∗ Φ ( X n ) Π ( d ν )  (cid:46) Π ( k > k (cid:63) ) e − n ˜ ζ n Π ( B KL ( ˜ ζ n , ν (cid:63) ∗ Φ , M )) + n ˜ ζ n (cid:46) e ( + c ) k (cid:63) log n Π ( k > k (cid:63) ) Π ( k = k (cid:63) ) + n ˜ ζ n (cid:46) e ( + c ) k (cid:63) log n e − A ¯ k n log n + n ˜ ζ n .Hence if A > c +

2, the desired result follows.

For the proof of Theorem 2.2, we use the following moment comparison lemmato translate the mixing distribution estimation problem to the the moment vec-tor estimation problem.

Lemma 6.1 (Proposition 1 of Wu and Yang [47]) . Suppose that ν , ν ∈ M k ([ − L , L ]) for L > . Let ζ : = (cid:107) m ( k − ) ( ν ) − m ( k − ) ( ν ) (cid:107) ∞ . Then W ( ν , ν ) ≤ c k ( ζ ) k − (6.1) for some constant c > depending only on L. We use a standard “prior mass and testing” approach to prove the conver-gence of the moment vector. The crucial step is to construct a test functionwith exponentially small error probabilities. We employ the median denoisedmoment estimator proposed by [47] to the construction of such a test function.. O

HN AND

L. L IN Deﬁnition 3.

Let X n be n independent samples, and let k ∈ N and η ∈ (

0, 1 ) .Divide the sample to N : = (cid:4) log ( k / η ) (cid:5) ∧ n almost equal sized batches, say X , . . . , X N , where each batch has the (cid:98) n / N (cid:99) or (cid:98) n / N (cid:99) + l ∈ [ N ] and h ∈ [ k − ] , compute˜ M ( η ) l , h : = |X l | ∑ X ∈X i X h , M ( η ) l , h : = h ! (cid:98) h /2 (cid:99) ∑ a = ( − ) a a ! ( h − a ) ! ˜ M l , h .Then we deﬁne the median denoised moment estimator ˆ m ( η ) ( k − ) = ( ˆ m ( η ) h ) h ∈ [ k − ] by ˆ m ( η ) h : = ˆ m ( η ) h ( X n ) : = Median (cid:18)(cid:110) M ( η ) l , h : l ∈ [ N ] (cid:111)(cid:19) . (6.2)For the median denoised moment estimator we have the exponential tailbound. Recall that P ([ − L , L ]) stands for the set of all distributions supportedon [ − L , L ] . Lemma 6.2.

Suppose that X , . . . , X n iid ∼ µ ∗ Φ where µ ∈ P ([ − L , L ]) . Then for anyk ∈ N and (cid:101) > , there is constant c > depending only on L such that P ( n ) µ ∗ Φ (cid:16) (cid:107) ˆ m ( η (cid:101) ) ( k − ) − m ( k − ) ( µ ) (cid:107) ∞ ≥ (cid:101) (cid:17) ≤ ( ) k exp (cid:18) − n (cid:110) ( c k ) − k + (cid:101) ∧ (cid:111)(cid:19) , where ˆ m ( η (cid:101) ) ( k − ) is the median denosied moment estimator presented in Deﬁnition 3with η = η (cid:101) where η (cid:101) : = ( k ) exp (cid:16) − ( c k ) − k + n (cid:101) (cid:17) .To control the covering number of the parameter space M , we need thefollowing two lemmas. Lemma 6.3.

For any ν , ν ∈ M k ([ − L , L ]) , we have (cid:107) m ( k − ) ( ν ) − m ( k − ) ( ν ) (cid:107) ∞ ≤ c ( (cid:112) c k ) ( k − ) (cid:107) p ν ∗ Φ − p ν ∗ Φ (cid:107) for some constants c > and c > depending only on L. Lemma 6.4 (Theorem 3.1 of Ghosal and van der Vaart [18]) . For any (cid:101) ∈ (

0, 1/2 ) , log N (cid:0) (cid:101) , P ([ − L , L ]) , (cid:107) · (cid:107) (cid:1) ≤ c (cid:18) log 1 (cid:101) (cid:19) for some universal constant c > . AYESIAN ESTIMATION OF G AUSSIAN MIXTURES

Proof of Theorem 2.2.

Let ζ n : = (cid:113) ¯ k n log n / n . In the proof of Theorem 2.1, wehave shown that P ( n ) ν (cid:63) ∗ Φ ( A n ) → n → ∞ , where A n : =  X n ∈ R n : (cid:90) p ( n ) ν ∗ Φ p ( n ) ν (cid:63) ∗ Φ ( X n ) Π ( d ν ) ≥ e − c k (cid:63) n ζ n  for some constant c > P ( n ) ν (cid:63) ∗ Φ (cid:2) Π ( ν ∈ M k (cid:63) | X n ) (cid:3) →

1, the proof is done if we prove that P ( n ) ν (cid:63) ∗ Φ [ Π ( ˜ U | X n )] = o ( ) , where˜ U : = (cid:8) ν ∈ M : W ( ν , ν (cid:63) ) ≥ M ¯ (cid:101) n (cid:9) (cid:92) { ν ∈ M k (cid:63) } = (cid:8) ν ∈ M k (cid:63) : W ( ν , ν (cid:63) ) ≥ M ¯ (cid:101) n (cid:9) .For notational simplicity, we suppress the subscript 1: ( k (cid:63) − ) of the momentvector and its denoised estimator to write m ( · ) : = m ( k (cid:63) − ) ( · ) and ˆ m ( η ) : = ˆ m ( η ) ( k (cid:63) − ) . Let ρ ( ν , ν ) : = (cid:107) m ( k (cid:63) − ) ( ν ) − m ( k (cid:63) − ) ( ν ) (cid:107) ∞ for ν , ν ∈ M .Let U : = (cid:110) ν ∈ M k : ρ ( ν , ν (cid:63) ) ≥ (cid:6) M k (cid:63) (cid:7) ζ n (cid:111) ,where M k (cid:63) : = √ k (cid:63) ( √ M k (cid:63) ) k (cid:63) − with M > (cid:6) M k (cid:63) (cid:7) ( k (cid:63) − ) ≤ ( M k (cid:63) ) ( k (cid:63) − ) ≤ √ M k (cid:63) ( k (cid:63) ) ( k (cid:63) − ) , byLemma 6.1, if we take M such that M ≥ c √ M for some constant c > L , we have ˜ U ⊂ U .It remains to bound the posterior probability of U . To do this we use astandard peeling device technique. Deﬁne U t : = (cid:8) ν ∈ M k (cid:63) : t ζ n ≤ (cid:107) m ( ν ) − m ( ν (cid:63) ) (cid:107) ∞ < ( t + ) ζ n (cid:9) .Since (cid:107) m ( ν ) (cid:107) ∞ ≤ ( ∨ L ) k (cid:63) − for any ν ∈ M k (cid:63) ([ − L , L ]) , for t larger than2 ( ∨ L ) k (cid:63) − / ζ n the set U t is empty. Therefore, U ⊂ t ∗ n (cid:91) t = (cid:100) M k (cid:63) (cid:101) U t , where t ∗ n : = sup (cid:110) t ∈ N : t ≤ ( ∨ L ) k (cid:63) − / ζ n (cid:111) .Let ( U t , s : s ∈ [ S t ]) be a t ζ n /4 net of U t in the distance ρ ( · , · ) for each t , where S t : = N (cid:0) t ζ n /4, U t , ρ (cid:1) . We further decompose U t to U t , s , s ∈ [ S t ] , where U t , s : = (cid:8) ν ∈ U t : (cid:107) m ( ν ) − m ( ν t , s ) (cid:107) ∞ < t ζ n /4 (cid:9) . O HN AND

L. L IN U t ⊂ (cid:83) S t s = U t , s .Now we construct the test function for the test H : ν = ν (cid:63) versus H : ν ∈U t , s with exponentially small type I and II error probabilities. Let ψ t , s : R n (cid:55)→ [

0, 1 ] be the function given by ψ t , s ( X n ) : = (cid:16) (cid:107) ˆ m ( η n , t ) − m ( ν (cid:63) ) (cid:107) ∞ ≥ t ζ n /4 (cid:17) ,where ˆ m ( η n , t ) is the median denoised moments deﬁned in Deﬁnition 3 with η n , t : = ( k (cid:63) ) exp (cid:16) − ( c k (cid:63) ) − k (cid:63) + n ( t ζ n /4 ) (cid:17) .Here, the universal constant c > L is chosen so that P ( n ) ν (cid:63) ∗ Φ (cid:16) (cid:107) ˆ m ( η n , t ) − m ( ν ) (cid:107) ∞ > t ζ n /4 (cid:17) (cid:46) k (cid:63) exp (cid:18) − n (cid:110) ( c k (cid:63) ) − k (cid:63) + ( t ζ n /4 ) ∧ (cid:111)(cid:19) .Note that the existence of the constant c is guaranteed by Lemma 6.2. We justshowed the exponential type I error bound for the test function ψ t , s . By triangleinequality, we have that for every ν ∈ U t , s , (cid:107) ˆ m ( η n , t ) − m ( ν (cid:63) ) (cid:107) ∞ ≥ (cid:107) m ( ν t , s ) − m ( ν (cid:63) ) (cid:107) ∞ − (cid:107) m ( ν ) − m ( ν t , s ) (cid:107) ∞ − (cid:107) ˆ m ( η n , t ) − m ( ν ) (cid:107) ∞ ≥ t ζ n − t ζ n /4 − (cid:107) ˆ m ( η n , t ) − m ( ν ) (cid:107) ∞ .Thus the type II error probability is bounded exponentially assup ν ∈U t , s P ( n ) ν ∗ Φ (cid:0) − ψ t , s ( X n ) (cid:1) = sup ν ∈U t , s P ( n ) ν ∗ Φ (cid:16) (cid:107) ˆ m ( η n , t ) − m ( ν (cid:63) ) (cid:107) ∞ < t ζ n /4 (cid:17) ≤ sup ν ∈U t , s P ( n ) ν ∗ Φ (cid:16) t ζ n /4 − (cid:107) ˆ m ( η n , t ) − m ( ν ) (cid:107) ∞ < t ζ n /4 (cid:17) ≤ sup ν ∈U t , s P ( n ) ν ∗ Φ (cid:16) (cid:107) ˆ m ( η n , t ) − m ( ν ) (cid:107) ∞ > t ζ n /2 (cid:17) (cid:46) k (cid:63) exp (cid:18) − n (cid:110) ( c k (cid:63) ) − k (cid:63) + ( t ζ n /4 ) ∧ (cid:111)(cid:19) .We need to compute the upper bound of S t . By Lemma 6.3, for any ν , ν ∈M k (cid:63) , we have ρ ( ν , ν ) ≤ c (cid:16)(cid:112) c k (cid:63) (cid:17) k (cid:63) − (cid:107) p ν ∗ Φ − p ν ∗ Φ (cid:107) , AYESIAN ESTIMATION OF G AUSSIAN MIXTURES c , c > L , which implies that S t : = N (cid:0) t ζ n /4, U t , ρ (cid:1) ≤ N  t ζ n c (cid:16) √ c k (cid:63) (cid:17) k (cid:63) − , { p ν ∗ Φ : ν ∈ M k (cid:63) } , (cid:107) · (cid:107)  (cid:46) (cid:32) ( k (cid:63) ) k (cid:63) t ζ n (cid:33) c (cid:46) e c log n for some universal constants c , c > ( k (cid:63) ) k (cid:63) (cid:46) exp ( k (cid:63) log ( k (cid:63) )) (cid:46) exp ( c log n ) for some c > ψ : R n (cid:55)→ [

0, 1 ] deﬁned by ψ : = sup t ∈ N : M k (cid:63) ≤ t ≤ t ∗ n max s ∈ [ S t ] ψ t , s .For notational simplicity, we denote A ( M , k (cid:63) , ζ n ) : = n (cid:20)(cid:110) ( c k (cid:63) ) − k (cid:63) + ( M k (cid:63) ζ n /4 ) (cid:111) ∧ (cid:21) = n (cid:20)(cid:110) k (cid:63) ( M / c ) k (cid:63) − ( ζ n /4 ) (cid:111) ∧ (cid:21) .Then the type I error probability of ψ is bounded by P ( n ) ν (cid:63) ∗ Φ ψ ( X n ) ≤ t ∗ n ∑ t = (cid:100) M k (cid:63) (cid:101) S t P ( n ) ν (cid:63) ∗ Φ ψ t , s ( X n ) (cid:46) t ∗ n k (cid:63) e c log n exp (cid:0) − A ( M , k (cid:63) , ζ n ) (cid:1) (cid:46) k (cid:63) exp (cid:0) c log n − A ( M , k (cid:63) , ζ n ) (cid:1) (6.4)for some constants c , c > L , where the third inequalityfollows from the fact that t ∗ n ≤ ( ∨ L ) k (cid:63) − / ζ n (cid:46) e c log n for some constant c > L . On the other hand, the type II error is boundedbysup ν ∈U P ( n ) ν ∗ Φ ( − ψ ( X n )) ≤ sup t ∈ N : M k (cid:63) ≤ t ≤ t ∗ n sup s ∈ [ S t ] sup ν ∈U t , s P ( n ) ν ∗ Φ (cid:0) − ψ t , s ( X n ) (cid:1) (cid:46) k (cid:63) exp (cid:0) − A ( M , k (cid:63) , ζ n ) (cid:1) . (6.5). O HN AND

L. L IN P ( n ) ν (cid:63) ∗ Φ (cid:104) Π (cid:0) U | X n (cid:1)(cid:105) ≤ P ( n ) ν (cid:63) ∗ Φ ψ ( X n ) + P ( n ) ν (cid:63) ∗ Φ (cid:104) ( − ψ ( X n )) Π (cid:0) U | X n (cid:1) A n (cid:105) + o ( ) ≤ P ( n ) ν (cid:63) ∗ Φ ψ ( X n ) + − c k (cid:63) n ζ n sup ν ∈U P ( n ) ν ∗ Φ ( − ψ ( X n )) + o ( ) (cid:46) k (cid:63) exp (cid:16) c log n + c k (cid:63) ¯ k n log n − A ( M , k (cid:63) , ζ n ) (cid:17) + o ( ) (cid:46) exp (cid:16) c k (cid:63) ¯ k n log n − A ( M , k (cid:63) , ζ n ) (cid:17) + o ( ) for some constant c > L . Note that for any M such that M > c , we have A ( M , k (cid:63) , ζ n ) ≥ k (cid:63) n (cid:32) M c ¯ k n log n n (cid:33) ∧  ≥ c M k (cid:63) ¯ k n log n for some constant c > c , where the second inequalityis due to that ¯ k n log n / n = o ( ) . Hence the posterior probability of U goes tozero if we choose M such that M > max { c / c , c , 1 } . We need the following adaptive version of the moment comparison lemma toestabilish the adaptive rate.

Lemma 6.5 (Proposition 4 of Wu and Yang [47]) . Suppose that ν and ν are sup-ported on a set of r atoms in [ − L , L ] , and each atom is at least ˜ γ away from all but atmost r (cid:48) atoms. Let ζ : = (cid:13)(cid:13)(cid:13) m ( r − ) ( ν ) − m ( r − ) ( ν ) (cid:13)(cid:13)(cid:13) ∞ . Then W ( ν , ν ) ≤ c r (cid:32) r r − ˜ γ r − r (cid:48) − ζ (cid:33) r (cid:48) , (6.6) for some constant c > depending only on L.Proof of Theorem 2.3. To avoid confusion, we denote by ¯ M instead of M the suf-ﬁciently large constant appearing in (2.8) and we let M be the constant appear-ing in (2.10). If M (cid:101) n ≥ ¯ M ¯ (cid:101) n , the result follows trivially from Theorem 2.2, sowe assume throughout that M (cid:101) n < ¯ M ¯ (cid:101) n . AYESIAN ESTIMATION OF G AUSSIAN MIXTURES P ( n ) ν (cid:63) ∗ Φ (cid:2) Π ( ν / ∈ M k (cid:63) | X n ) (cid:3) = o ( ) by Theorem 2.1 P ( n ) ν (cid:63) ∗ Φ (cid:2) Π ( W ( ν , ν (cid:63) ) ≥ ¯ M ¯ (cid:101) n | X n ) (cid:3) = o ( ) by Theorem 2.2,we will be done with the proof if we can show that P ( n ) ν (cid:63) ∗ Φ (cid:20) Π (cid:16)(cid:8) ν ∈ M k (cid:63) : ¯ M ¯ (cid:101) n > W ( ν , ν (cid:63) ) ≥ M (cid:101) n (cid:9) | X n (cid:17)(cid:21) = o ( ) .Let ν : = ∑ k (cid:63) j = w j δ θ j be the mixing distribution satisfying W ( ν , ν (cid:63) ) ≤ ¯ M ¯ (cid:101) n forthe true mixing distribution ν (cid:63) : = ∑ k (cid:63) j = w (cid:63) j δ θ (cid:63) j . Since ν (cid:63) is k ( γ , ω ) -separated,there is a partition ( S l : l ∈ [ k ]) of [ k (cid:63) ] such that | θ j − θ j (cid:48) | ≥ γ for any j ∈ S l , j (cid:48) ∈ S l (cid:48) and any l , l (cid:48) ∈ [ k ] with l (cid:54) = l (cid:48) and ∑ j ∈ S l w j ≥ ω for any l ∈ [ k ] . Foreach h ∈ [ k (cid:63) ] , let j ∗ h = argmin j ∈ [ k (cid:63) ] | θ j − θ (cid:63) h | . Note that for any l ∈ [ k ] , W ( ν , ν (cid:63) ) ≥ ∑ h ∈ S l w (cid:63) h | θ j ∗ h − θ (cid:63) h | ≥ ω min h ∈ S l | θ j ∗ h − θ (cid:63) h | .We now suppose that the assumption γω > M (cid:48) ¯ (cid:101) n holds with M (cid:48) : = c ¯ M forsome constant c less than 1/2. Thenmin h ∈ S l | θ j ∗ h − θ (cid:63) h | ≤ W ( ν , ν (cid:63) ) / ω ≤ ¯ M ¯ (cid:101) n / ω ≤ c γ .That is, for any l ∈ [ k ] , there is h ∈ S l such that θ (cid:63) h is close to some atomof ν within distance γ / c . Hence the mixing distribution ν is k (( − c ) γ , 0 ) separated. Let S : = (cid:110) θ j : j ∈ [ k (cid:63) ] (cid:111) ∪ (cid:110) θ (cid:63) j : j ∈ [ k (cid:63) ] (cid:111) . Then each element in S is ( − c ) γ away from at least 2 ( k − ) elements in S . Therefore by invokingLemma 6.5 with r = k (cid:63) , r (cid:48) = k (cid:63) − − ( k − ) = ( k (cid:63) − k ) + γ =( − c ) γ , we have for sufﬁciently large M > (cid:8) ν ∈ M k (cid:63) : M ¯ (cid:101) n > W ( ν , ν (cid:63) ) ≥ M (cid:101) n (cid:9) ⊂ (cid:110) ν ∈ M k (cid:63) : (cid:107) m ( k (cid:63) − ) ( ν ) − m ( k (cid:63) − ) ( ν (cid:63) ) (cid:107) ∞ ≥ (cid:6) M k (cid:63) (cid:7) ζ n (cid:111) ,where M k (cid:63) : = √ k (cid:63) ( √ M k (cid:63) ) k (cid:63) − with M > ζ n : = (cid:113) ¯ k n log n / n . The only remaining part of the proof is to bound the pos-terior probability of the right-hand side of the preceding display, and this isshown in the proof of Theorem 2.2. Proof of Proposition 2.4.

We set γ : = γ ( ν ) and ω : = ω ( ν ) for short. Supposethat ν : = ∑ kj = w j δ θ j ∈ M k satisﬁes W ( ν , ν ) < c γω . Since ν is k ( γ , ω ) -separated, by the similar argument in the proof of Theorem 2.3, we have thatfor every h ∈ [ k ] , | θ j ∗ h − θ h | ≤ W ( ν , ν ) / ω ≤ c γ ,. O HN AND

L. L IN j ∗ h = argmin j ∈ [ k ] | θ j − θ (cid:63) h | . Thus, ν is k (( − c ) γ , 0 ) sepa-rated. Moreover, since | θ j ∗ h − θ l | ≥ | θ h − θ l | − | θ j ∗ h − θ h | ≥ ( − c ) γ > c γ for any l (cid:54) = h , the indices j ∗ , . . . , j ∗ k are distinct. Thus there is a partition S , . . . , S k of [ k ] such that | θ j − θ j (cid:48) | ≥ ( − c ) γ for any j ∈ S h , j (cid:48) ∈ S h (cid:48) andany h , h (cid:48) ∈ [ k ] with h (cid:54) = h (cid:48) and j ∗ h ∈ S h for any h ∈ [ k ] . Let ( p ∗ jh ) j ∈ [ k ] , h ∈ [ k ] ∈Q (( w j ) j ∈ [ k ] , ( w j ) j ∈ [ k ] ) be the optimal coupling such that W ( ν , ν ) = ∑ kj = ∑ k h = p jh | θ j − θ h | . Then for any h ∈ [ k ] , we have c γω > W ( ν , ν ) ≥ k ∑ j = p ∗ jh | θ j − θ h | = ∑ j ∈ S h p ∗ jh | θ j − θ h | + ∑ j / ∈ S h p ∗ jh | θ j − θ h |≥ +  w h − ∑ j ∈ S h p ∗ jh  ( − c ) γ ,where the last inequality follows from that | θ j − θ h | ≥ | θ j − θ j ∗ h | − | θ j ∗ h − θ h | ≥ ( − c ) γ − c γ for any j / ∈ S h . Hence, ∑ j ∈ S h w j ≥ ∑ j ∈ S h p ∗ jh ≥ w h − c − c ω ≥ − c − c ω ,which completes the proof. Proof of Theorem 2.6.

Assume that ν : = ∑ kj = w j δ θ j ∈ M k with k < k (cid:63) . Thenthere exists an index h ∗ ∈ [ k (cid:63) ] such that | θ j − θ (cid:63) h ∗ | ≥ min h ∈ [ k (cid:63) ] : h (cid:54) = j ∗ | θ j − θ (cid:63) h | for any j ∈ [ k ] , which implies that2 | θ j − θ (cid:63) h ∗ | ≥ | θ j − θ (cid:63) h ∗ | + min h ∈ [ k (cid:63) ] : h (cid:54) = j ∗ | θ j − θ (cid:63) h |≥ min h , l ∈ [ k (cid:63) ] : h (cid:54) = l | θ (cid:63) l − θ (cid:63) h | .Therefore, for the optimal coupling ( p ∗ jh ) j ∈ [ k ] , h ∈ [ k (cid:63) ] ∈ Q (( w j ) j ∈ [ k ] , ( w (cid:63) j ) j ∈ [ k (cid:63) ] ) ,we have W ( ν , ν (cid:63) ) = k ∑ j = k (cid:63) ∑ h = p ∗ jh | θ j − θ (cid:63) h |≥ k ∑ j = p ∗ jh ∗ | θ j − θ (cid:63) h ∗ |≥ w (cid:63) h ∗ | θ (cid:63) l − θ (cid:63) h | ≥ γω AYESIAN ESTIMATION OF G AUSSIAN MIXTURES γω > M (cid:48) (cid:101) n for some large constant M (cid:48) > { ν ∈ M k } ⊂ (cid:8) ν ∈ M : W ( ν , ν (cid:63) ) ≥ γω /2 (cid:9) ⊂ (cid:110) ν ∈ M : W ( ν , ν (cid:63) ) ≥ M (cid:48) (cid:101) n /2 (cid:111) .The proof is complete by Theorem 2.3. We invoke the following moment comparison lemma for general distributions.

Lemma 6.6.

Let µ , µ ∈ P ([ − L , L ]) and r ∈ N . Then W ( µ , µ ) ≤ c (cid:26) r + + √ r ( c ) r (cid:107) m r ( µ ) − m r ( µ ) (cid:107) ∞ (cid:27) . for some constants c > and c > depending only on L . Proof.

Let µ (cid:48) and µ (cid:48) be distributions supported on [ −

1, 1 ] constructed by scal-ing µ and µ respectively. Then by Lemma 24 of Wu and Yang [47], W ( µ (cid:48) , µ (cid:48) ) ≤ π r + + ( + √ ) r (cid:107) m r ( µ (cid:48) ) − m r ( µ (cid:48) ) (cid:107) .Since W ( µ , µ ) = L W ( µ (cid:48) , µ (cid:48) ) and | m j ( µ ) − m j ( µ ) | = L j | m j ( µ (cid:48) ) − m j ( µ (cid:48) ) | for any j ∈ N , we have W ( µ , µ ) ≤ π Lr + + L ( + √ ) r √ r max ≤ j ≤ r L − j | m j ( µ ) − m j ( µ ) |≤ π Lr + + L √ r (( + √ )( ∨ L − )) r (cid:107) m r ( µ ) − m r ( µ ) (cid:107) ∞ ,which completes the proof. Proof of Theorem 2.7.

Let ˜ ξ n : = n − log n and ξ n : = n − log − n so that ξ n log ( ξ n ) (cid:46) ˜ ξ n . Following the proof of Theorem 2.1, we have D n : = (cid:90) p ( n ) ν ∗ Φ p ( n ) ν (cid:63) ∗ Φ ( X n ) Π ( d ν ) ≥ e − n ˜ ξ n Π ( B KL ( ˜ ξ n , ν (cid:63) ∗ Φ , M )) ≥ e − n ˜ ξ n Π (cid:16) ν ∈ M : W ( ν , ν (cid:63) ) ≤ c ξ n (cid:17) with P ( n ) ν (cid:63) ∗ Φ -probability at least 1 − n ˜ ξ n for some constant c >

0. Let R : = (cid:108) L / √ c ξ n (cid:109) and B , . . . , B R be a partition of [ − L , L ] such that diam ( B j ) ≤√ c ξ n /2. By Lemma A.3, W ( ν , ν (cid:63) ) ≤ √ c ξ n + L (cid:16) R ∑ j = | ν ( B j ) − ν (cid:63) ( B j ) | (cid:17) .. O HN AND

L. L IN Π (cid:16) ν ∈ M : W ( ν , ν (cid:63) ) ≤ c ξ n (cid:17) ≥ Π  ν ∈ M : R ∑ j = | ν ( B j ) − ν (cid:63) ( B j ) | ≤ c ξ n L  ≥ Π  ν ∈ M : R ∑ j = | ν ( B j ) − ν (cid:63) ( B j ) | ≤ c ξ n L (cid:12)(cid:12)(cid:12) { k = R } ∩ E  × Π ( E | k = R ) Π ( k = R ) ,where E denotes the event such that each B j contains exactly one atom of ν .By (P1 (cid:48) ), − log Π ( k = R ) (cid:38) R (cid:38) n and by (P3), − log Π ( E | k = R ) (cid:38) − R log ( ξ − n ) (cid:38) n log n . By (P2), − log Π  ν ∈ M : R ∑ j = | ν ( B j ) − ν (cid:63) ( B j ) | ≤ c ξ n L (cid:12)(cid:12)(cid:12) { k = R } ∩ E  (cid:38) n log n .Combining the results, we arrive at P ( n ) ν (cid:63) ∗ Φ (cid:16) D n (cid:38) e − c n log n (cid:17) ≥ − n log n for some constant c > k be the positive integer such that ˆ k (cid:16) log n / log log n but 2ˆ k − ≤ log n / log log n . By applying Lemma 6.6 with r = k −

1, if M is sufﬁcientlylarge, we obtain (cid:40) ν ∈ M : W ( ν , ν (cid:63) ) ≥ M log log n log n (cid:41) ⊂ (cid:40) ν ∈ M : (cid:107) m k − ( ν ) − m k − ( ν (cid:63) ) (cid:107) ∞ ≥ M (cid:48) ( ˆ k ) − c − ˆ k log log n log n (cid:41) ⊂ (cid:110) ν ∈ M : (cid:107) m k − ( ν ) − m k − ( ν (cid:63) ) (cid:107) ∞ ≥ M (cid:48) c − ˆ k log − n (cid:111) for some constant M (cid:48) > M and L , and some c > L . Following the proof of Theorem 2.2, it sufﬁces to showthat (cid:16) ˆ k c ˆ k ∨ e c n log n (cid:17) exp (cid:16) − c ( M (cid:48) ) n ˆ k − k + c − k log − n (cid:17) = o ( ) for some constants c , c >

0. Note that ( u ˆ k ) − k + (cid:38) n − for any constant u >

0, thus the preceding display holds clearly.

AYESIAN ESTIMATION OF G AUSSIAN MIXTURES For the fractional posterior, the following oracle inequality holds in general.

Lemma 6.7 (Corollary 3.7 of Bhattacharya et al. [1]) . Let X , . . . , X n iid ∼ G (cid:63) forsome distribution G (cid:63) . Let G be a set of distribution and Π be the prior distribution on G . Then for any ζ ∈ (

0, 1 ) such that n ζ > and α ∈ (

0, 1 ) , we have (cid:90) G ∈G R α ( p G , p G (cid:63) ) Π α ( d G | X n ) ≤ αζ − n log Π ( B KL ( ζ , G , G )) with P ( n ) G (cid:63) -probability at least − n ζ .Proof of Theorem 2.8. Let ζ n : = (cid:113) ¯ k n log n / n . We prove the ﬁrst assertion. Recallthat Π α ( ν / ∈ M k (cid:63) | X n ) = (cid:82) ν / ∈M k (cid:63) ( p ( n ) ν ∗ Φ ( X n ) / p ( n ) ν (cid:63) ∗ Φ ( X n )) α Π ( d ν ) (cid:82) ( p ( n ) ν ∗ Φ ( X n ) / p ( n ) ν (cid:63) ∗ Φ ( X n )) α Π ( d ν ) .We deal with the numerator and denominator separately. For the denominator,we have the high probability bound (see the proof of Theorem 3.1 of [1]), (cid:90)  p ( n ) ν ∗ Φ p ( n ) ν (cid:63) ∗ Φ ( X n )  α Π ( d ν ) ≥ e − c n ζ n Π (cid:0) B KL ( ζ n , ν (cid:63) ∗ Φ , M ) (cid:1) with P ( n ) ν (cid:63) ∗ Φ -probability at least 1 − c / ( n ζ n ) = − c / ( ¯ k n log n ) , for someconstants c and c depending only on α . Since Π ( B KL ( ζ n , ν (cid:63) ∗ Φ , M )) ≥ exp ( − c k (cid:63) n ζ n ) Π ( k = k (cid:63) ) for some c >

0, we further have (cid:90)  p ( n ) ν ∗ Φ p ( n ) ν (cid:63) ∗ Φ ( X n )  α Π ( d ν ) ≥ e − ( c + c ) k (cid:63) n ζ n Π ( k = k (cid:63) ) (6.7)with P ( n ) ν (cid:63) ∗ Φ -probability at least 1 − c / ( ¯ k n log n ) . For the expectation of thenumerator with respect to P ( n ) ν (cid:63) ∗ Φ , by Fubini’s theorem, we obtain P ( n ) ν (cid:63) ∗ Φ  (cid:90) ν / ∈M k (cid:63)  p ( n ) ν ∗ Φ p ( n ) ν (cid:63) ∗ Φ ( X n )  α Π ( d ν )  ≤ (cid:90) ν / ∈M k (cid:63) (cid:90) n ∏ i = (cid:104) p αν ∗ Φ ( X i ) p − αν (cid:63) ∗ Φ ( X i ) d X i (cid:105) Π ( d ν ) .Since M is convex, (i.e., for any ν , ν ∈ M and t ∈ (

0, 1 ) , there is ¯ ν ∈ M such that p ¯ ν ∗ Φ = ( − t ) p ν ∗ Φ + tp ν ∗ Φ ), we can apply Lemma 2.1 of [1] to. O HN AND

L. L IN < (cid:82) p αν ∗ f ( X i ) p − αν (cid:63) ∗ f ( X i ) d X i ≤

1. Hence, the expectation of numeratoris further bounded by the prior probability Π ( k > k (cid:63) ) . By Markov’s inequalitywe obtain the following high probability bound for the numerator P ( n ) ν (cid:63) ∗ Φ  (cid:90) ν / ∈M k (cid:63)  p ( n ) ν ∗ Φ p ( n ) ν (cid:63) ∗ Φ ( X n )  α Π ( d ν ) ≥ ( ¯ k n log n ) Π ( k > k (cid:63) )  ≤ k n log n .(6.8)Combining (6.7), (6.8) and Assumption (2.2), we have Π α ( ν / ∈ M k (cid:63) | X n ) (cid:46) e − c ¯ k n log n for some constant c >

0, with P ( n ) ν (cid:63) ∗ Φ -probability at least 1 − ( + c ) / ( ¯ k n log n ) .For the second assertion, we note that the Wasserstein distance between anytwo atomic distributions ν ∈ M and ν ∈ M is bounded by diam ([ − L , L ]) = L , and so (cid:90) W ( ν , ν (cid:63) ) Π α ( d ν | X n ) (cid:46) (cid:90) ν ∈M k (cid:63) W ( ν , ν (cid:63) ) Π α ( d ν | X n )+ Π α ( ν / ∈ M k (cid:63) | X n ) for any given data X n ∈ R n . We have shown that the second term vanishes atspeed e − c ¯ k n log n . We now focus on the ﬁrst term. For notational simplicity, welet ρ ( ν , ν (cid:63) ) : = (cid:107) m ( k (cid:63) − ) ( ν ) − m ( k (cid:63) − ) ( ν (cid:63) ) (cid:107) ∞ By Lemma 6.1, (cid:90) ν ∈M k (cid:63) W ( ν , ν (cid:63) ) Π α ( d ν | X n ) (cid:46) k (cid:63) (cid:90) ν ∈M k (cid:63) ρ ( ν , ν (cid:63) ) k (cid:63) − Π α ( d ν | X n ) ≤ k (cid:63) (cid:34) (cid:90) ν ∈M k (cid:63) ρ ( ν , ν (cid:63) ) Π α ( d ν | X n ) (cid:35) k (cid:63) − for any given data X n ∈ R n , where the second inequality follows from Jensen’sinequality for concave functions. For any ν ∈ M k (cid:63) , Lemma 6.3 implies that ρ ( ν , ν (cid:63) ) (cid:46) ( c k (cid:63) ) k (cid:63) − (cid:107) p ν ∗ Φ − p ν ∗ Φ (cid:107) for some constant c >

0. Since both p ν ∗ Φ and p ν (cid:63) ∗ Φ are bounded by 1/ √ π ,we have (cid:107) p ν ∗ Φ − p ν ∗ Φ (cid:107) ≤ √ π h ( p ν ∗ Φ , p ν ∗ Φ ) ≤ ( π ) − α ∧ ( − α ) R α ( p ν ∗ Φ , p ν ∗ Φ ) AYESIAN ESTIMATION OF G AUSSIAN MIXTURES (cid:90) ν ∈M k (cid:63) R α ( p ν ∗ Φ , p ν ∗ Φ ) Π α ( d ν | X n ) ≤ (cid:90) R α ( p ν ∗ Φ , p ν ∗ Φ ) Π α ( d ν | X n ) (cid:46) k (cid:63) ¯ k n log nn .with P ( n ) ν (cid:63) ∗ Φ -probability at least 1 − ( ¯ k n log n ) . Combining the derived bounds,we arrive at (cid:90) ν ∈M k (cid:63) W ( ν , ν (cid:63) ) Π α ( d ν | X n ) (cid:46) k (cid:63) (cid:20) (cid:90) ( k (cid:63) ) k (cid:63) − R α ( p ν ∗ Φ , p ν ∗ Φ ) Π α ( d ν | X n ) (cid:21) k (cid:63) − (cid:46) ( k (cid:63) ) k (cid:63) − k (cid:63) − (cid:32) ¯ k n log nn (cid:33) k (cid:63) − ,with P ( n ) ν (cid:63) ∗ Φ -probability at least 1 − ( ¯ k n log n ) , which completes the proof. We introduce the notation F ( · , ν ) : = (cid:90) F ( · , θ ) ν ( d θ ) for a distribution function F ( · , · ) and the mixing distribution ν ∈ M , whichdenotes the distribution function of ν ∗ F . For convenience, we let F be a set ofall distribution functions.A key technical device for the proof is the following relationship betweenthe Kolmogorov distance and the Wasserstein distance. Lemma 6.8 (Theorem 6.3 of Heinrich and Kahn [21]) . Let Θ be a compact subsetof R with nonempty interior. Fix k ∈ N . Suppose that Assumption F(q) is met withq = k.1. There exists a constant c > such that W k − ( ν , ν ) ≤ c (cid:13)(cid:13) F ( · , ν ) − F ( · , ν ) (cid:13)(cid:13) ∞ (6.9) for any ν , ν ∈ M k .2. Let k ∈ [ k ] and let ν ∈ M k \ M k − . There exist constants τ > andc > such that W ( k − k )+ ( ν , ν ) ≤ c (cid:13)(cid:13) F ( · , ν ) − F ( · , ν ) (cid:13)(cid:13) ∞ (6.10) for any ν , ν ∈ M k with W ( ν , ν ) ∨ W ( ν , ν ) < τ . . O HN AND

L. L IN Lemma 6.9 (Lemma 1 of Scricciolo [42]) . Let F (cid:63) be a continuous distributionfunction and P F (cid:63) denote the probability operator with respect to F (cid:63) . Let F be a cset of certain distribution functions. Let { ˜ ζ n } n ∈ N be a positive sequence such that ˜ ζ n (cid:38) (cid:112) log n / n. If the prior distribution on F satisﬁes Π (cid:16) B KL ( ˜ ζ n , F (cid:63) , F ) (cid:17) (cid:38) exp ( − c n ˜ ζ n ) (6.11) for some constant c > , then P ( n ) F (cid:63) (cid:34) Π (cid:18)(cid:110) F ∈ F : (cid:107) F − F (cid:63) (cid:107) ∞ ≥ M ˜ ζ n (cid:111) | X n (cid:19)(cid:35) = o ( ) for sufﬁciently large M > Proof of Theorem 3.1.

Let ˜ ζ n : = (cid:112) log n / n . By the ﬁrst assertion of Lemma 6.8,we have that  ν ∈ M ( Θ ) : W ( ν , ν (cid:63) ) ≥ M (cid:18) log nn (cid:19) k (cid:63) −  ⊂ (cid:40) ν ∈ M k (cid:63) ( Θ ) : (cid:107) F ( · , ν ) − F ( · , ν (cid:63) ) (cid:107) ∞ ≥ c M (cid:18) log nn (cid:19) (cid:41) ∪ (cid:8) ν / ∈ M k (cid:63) ( Θ ) (cid:9) for some constant c >

0. By the similar argument of Theorem 2.1, it is nothard to prove that the expected posterior probability of the event { ν / ∈ M k (cid:63) } goes to zero. For the ﬁrst event of the right-hand side of the preceding display,we will apply Lemma 6.9 to conclude the desired result. By (F4), Lemma A.2implies that B KL ( ˜ ζ n , ν (cid:63) ∗ Φ , M ( Θ )) ⊃ (cid:110) ν ∈ M ( Θ ) : h ( p ν ∗ F , p ν (cid:63) ∗ F ) ≤ c ( n log n ) − (cid:111) for some constant c >

0. Furthermore, by (F3), h ( f ( · , θ ) , f ( · , θ )) ≤ (cid:107) f ( · , θ ) − f ( · , θ ) (cid:107) ≤ c | θ − θ | s for any θ , θ ∈ [ − L , L ] for some constant c > h ( p ν ∗ F , p ν ∗ F ) ≤ c W ss ( ν , ν ) for any AYESIAN ESTIMATION OF G AUSSIAN MIXTURES ν , ν ∈ M . Therefore, by Lemma A.3, we obtain Π ( B KL ( ˜ ζ n , ν (cid:63) ∗ F , M ( Θ ))) ≥ Π (cid:16) ν ∈ M k (cid:63) ( Θ ) : W ss ( ν , ν (cid:63) ) ≤ c ( n log n ) − (cid:17) Π ( k = k (cid:63) ) ≥ Π  k (cid:63) ∑ j = | w j − w (cid:63) j | ≤ c ( L ) s n log n  × Π (cid:32) | θ j − θ (cid:63) j | s ≤ c s n log n , ∀ j ∈ [ k (cid:63) ] (cid:33) Π ( k = k (cid:63) ) (cid:38) e − c log n for some constants c , c >

0, where the last inequality follows from Assump-tion P. Thus the prior concentration condition (6.11) of Lemma 6.9 is fulﬁlledand the proof is done.

Proof of Theorem 3.2.

Using the similar argument in the proof of Theorem 3.1combined with the second assertion of Lemma 6.8, we obtain the desired result.

Proof of Theorem 4.1. If k (cid:63) > n , the event of interest is empty, so we focus on thecases that k (cid:63) ≤ n . Let ˜ ζ n : = (cid:112) log n / n . As in the proof of Theorem 2.1, we havethat (cid:90) p ( n ) ˜ ν ∗ Φ p ( n ) ν (cid:63) ∗ Φ ( X n ) Π DP ( d ˜ ν ) ≥ e − n ˜ ζ n Π DP (cid:16) B KL ( ˜ ζ n , ν (cid:63) ∗ Φ , M ∞ ) (cid:17) with P ( n ) ν (cid:63) ∗ Φ -probability at least 1 − ( n ˜ δ n ) . Let ξ n : = ( n log n ) − so that ξ n log ( ξ n ) (cid:46) ξ n log ( ξ n ) (cid:46) ˜ ζ n . Then since (cid:82) p ν (cid:63) ∗ Φ ( x )( p ν (cid:63) ∗ Φ ( x ) / p ν ∗ Φ ( x )) b d λ ( x ) < ∞ forsome b ∈ (

0, 1 ) by Equation (4.6) of [18], Lemma A.2 implies that Π DP (cid:0) B KL ( ζ n , ν (cid:63) ∗ Φ , M ∞ ) (cid:1) ≥ Π DP (cid:0) ˜ ν ∈ M ∞ : (cid:107) p ˜ ν ∗ Φ − p ν (cid:63) ∗ Φ (cid:107) ≤ c ξ n (cid:1) .for some constant c >

0. Let B , B , . . . , B k (cid:63) be a partition of [ − L , L ] such that θ (cid:63) j ∈ B j , diam ( B j ) = c ξ n /4 for each j ∈ [ k (cid:63) ] (Here we assume without lossof generality that all the atoms of ν (cid:63) does not overlap with each other, other-wise, we can consider a partition where each set contains exactly one distinctatom). Since the vector ( ˜ ν ( B ) , ˜ ν ( B ) , . . . , ˜ ν ( B k (cid:63) )) follows the Dirichlet distri-bution with parameter ( κ n H ( B ) , κ n H ( B ) , . . . , κ n H ( B k (cid:63) )) , by Lemma A.4 anddiam ( B j ) = c ξ n /4 for every j ∈ [ k (cid:63) ] , we have (cid:8) ˜ ν ∈ M ∞ : (cid:107) p ˜ ν ∗ Φ − p ˜ ν (cid:63) ∗ Φ (cid:107) ≤ c ξ n (cid:9) ⊃  ˜ ν ∈ M ∞ : k (cid:63) ∑ j = (cid:12)(cid:12)(cid:12) ˜ ν ( B j ) − w (cid:63) j (cid:12)(cid:12)(cid:12) ≤ c ξ n  . O HN AND

L. L IN w (cid:63) j : =

0. Finally, by Lemma A.5, Π DP  ˜ ν ∈ M ∞ : k (cid:63) ∑ j = (cid:12)(cid:12)(cid:12) ˜ ν ( B j ) − w (cid:63) j (cid:12)(cid:12)(cid:12) ≤ c ξ n  ≥ ( c ξ n /4 ) k (cid:63) κ k (cid:63) + n k ∏ j = H ( B j ) (cid:38) κ k (cid:63) + n exp ( − c k (cid:63) log n ) for some constant c >

0, where the second inequality follows from that H ( B ) = − ∑ k (cid:63) j = H ( B j ) (cid:38) −

1/ log n (cid:38) H ( B j ) = c ξ n / ( L ) for j ∈ [ k (cid:63) ] andlog ( ξ n ) (cid:16) log n .On the other hand, we use Fubini’s theorem to obtain P ( n ) ν (cid:63) ∗ Φ  (cid:90) ∑ Z n ∈ N n : T n ( Z n ) > Ck (cid:63) (cid:40) n ∏ i = φ ( X i − θ Z i ) p w [ ˜ ν ] ( Z i ) p ν (cid:63) ∗ Φ ( X i ) (cid:41) Π DP ( d ˜ ν )  = (cid:90) ∑ Z n ∈ N n : T n ( Z n ) > Ck (cid:63) (cid:40) (cid:90) n ∏ i = φ ( X i − θ Z i ) d X n (cid:41) p ( n ) w [ ˜ ν ] ( Z n ) Π DP ( d ˜ ν )= ∑ Z n ∈ N n : T n ( Z n ) > Ck (cid:63) (cid:90) p ( n ) w [ ˜ ν ] ( Z n ) Π DP ( d ˜ ν )= P CRP ( κ n ) ( T n ( Z n ) > Ck (cid:63) ) ,where CRP ( κ n ) denotes the Chinese restaurant process with concentration pa-rameter κ n . It is known that the probability mass function of T n is given by(e.g., see Proposition 4.9 of [17]) P CRP ( κ n ) ( T n = t ) = C n ( t ) n ! κ tn Γ ( κ n ) Γ ( κ n + n ) where C n ( t ) : = ( n ! ) − ∑ S ⊂ [ n − ] : | S | = n − t ∏ i ∈ S i . Since C n ( t + ) C n ( t ) = ∑ S ⊂ [ n − ] : | S | = t ∏ i ∈ S i ∑ S ⊂ [ n − ] : | S | = t − ∏ i ∈ S i ≤ ∑ S ⊂ [ n − ] : | S | = t − ∏ i ∈ S i (cid:16) ∑ n − i (cid:48) = i (cid:48) (cid:17) ∑ S ⊂ [ n − ] : | S | = t − ∏ i ∈ S i ≤ log ( e ( n − )) we have P CRP ( κ n ) ( T n ≥ t + ) (cid:46) n ∑ h = t + ( κ n log n ) h − (cid:46) ( κ n log n ) t .Hence, P ( n ) ν (cid:63) ∗ Φ (cid:2) Π DP ( T n > Ck ∗ | X n ) (cid:3) (cid:46) e c k (cid:63) log n ( κ n log n ) Ck (cid:63) − κ k (cid:63) + n + o ( ) (cid:46) e c k (cid:63) log n e − (( C − ) k (cid:63) − ) log n + o ( ) AYESIAN ESTIMATION OF G AUSSIAN MIXTURES c >

0. If C > c +

3, the desired result follows.

Proof of Theorem 4.2.

Let ˜ ξ n : = n − log n and ξ n : = n − log − n so that ξ n log ( ξ n ) (cid:46) ˜ ξ n . By the same arguments used in the proof of Theorem 2.7,we have that D n : = (cid:90) p ( n ) ˜ ν ∗ Φ p ( n ) ν (cid:63) ∗ Φ ( X n ) Π DP ( d ˜ ν ) ≥ e − n ˜ ξ n Π DP  ˜ ν ∈ M ∞ : R ∑ j = | ˜ ν ( B j ) − ν (cid:63) ( B j ) | ≤ c ξ n L  with P ( n ) ν (cid:63) ∗ Φ -probability at least 1 − n ˜ ξ n , where we deﬁne R : = (cid:108) L / √ c ξ n (cid:109) and ( B , . . . , B R ) is a partition of [ − L , L ] such that diam ( B j ) ≤ √ c ξ n /2 forsome c >

0. Since ( ˜ ν ( B ) , . . . , ˜ ν ( B R )) follows the Dirichlet distribution withparameter ( κ n H ( B ) , . . . , κ n H ( B R )) , by Lemma A.5, D n is further bounded as D n (cid:38) e − n ˜ ξ n ξ Rn ( κ n ξ n ) R (cid:38) e − c n log + a n .with P ( n ) ν (cid:63) ∗ Φ -probability at least 1 − n ˜ ξ n . Following the proof of Theorem 2.7,we obtain the desired result. A Appendix: Additional lemmas and proofs

A.1 Technical lemmas

The following three lemmas provide inequalities that are useful throughout theproofs.

Lemma A.1 (Lemma 1 of Nguyen [37]) . Let f : R (cid:55)→ R be a convex function suchthat f ( ) = and let f ( · , θ ) denote a probability density function with parameter θ . For two atomic meassures ν : = ∑ k j = w j δ θ j ∈ M k and ν : = ∑ k j = w j δ θ j ∈M k , deﬁne W ψ , f ( ν , ν ) : = inf ( p jh ) ∈Q ( w , w ) k ∑ j = k ∑ h = p jh D f (cid:16) f ( · , θ j ) , f ( · , θ j ) (cid:17) , where w : = ( w , . . . , w k ) and w : = ( w , . . . , w k ) . Then D f  k ∑ j = w j f ( · , θ j ) , k ∑ j = w j f ( · , θ j )  ≤ W ψ , f ( ν , ν ) .. O HN AND

L. L IN In particular, for the standard normal density function φ , we have h  k ∑ j = w j φ ( · − θ j ) , k ∑ j = w j φ ( · − θ j )  ≤ W ( ν , ν ) , KL  k ∑ j = w j φ ( · − θ j ) , k ∑ j = w j φ ( · − θ j )  ≤ W ( ν , ν ) . Lemma A.2 (Theorem 5 of Wong and Shen [46]) . Let ζ > be sufﬁciently small.For two density functions p and p such that h ( p , p ) ≤ ζ andC ζ : = (cid:90) p ( x ) (cid:32) p ( x ) p ( x ) (cid:33) b λ ( d x ) < ∞ for some b ∈ (

0, 1 ] , we have KL ( p , p ) ≤ c ζ  ∨ log (cid:32) C ζ ζ (cid:33) , KL ( p , p ) ≤ c ζ  ∨ log (cid:32) C ζ ζ (cid:33) for some constants c , c > Lemma A.3 (Lemma 3 of Gao and van der Vaart [14]) . For any µ , µ ∈ P ( Θ ) ,any countably many partition ( B j ) j ∈ N of Θ and any q ≥ , we have W q ( µ , µ ) ≤ sup j ∈ N diam ( B j ) + diam ( Θ ) (cid:16) ∞ ∑ j = | µ ( B j ) − µ ( B j ) | (cid:17) q . In particular, for any ν , ν ∈ M k ( Θ ) with ν : = ∑ kj = w ij δ θ j and ν : = ∑ kj = w j δ θ j ,and any k ∈ N , we have W q ( ν , ν ) ≤ sup j ∈ [ k ] | θ j − θ j | + diam ( Θ ) (cid:16) ∞ ∑ j = | w j − w j | (cid:17) q . A.2 Proofs of Lemmas 6.2 and 6.3

Proof of Lemma 6.2.

Recall that N : = (cid:6) log ( k / η (cid:101) ) (cid:7) ∧ n where η (cid:101) , which is theconstant depending only on (cid:101) , will be speciﬁed later. For simplicity we dropthe subscript (cid:101) of η (cid:101) . Let n l : = |X l | then n l is either (cid:98) n / N (cid:99) or (cid:98) n / N (cid:99) +

1. Bythe variance bound presented in Lemma 5 of [47], we have

Var ( M ( η ) l , h ) ≤ n l (cid:16) c (cid:48) ( L + √ h ) (cid:17) h , AYESIAN ESTIMATION OF G AUSSIAN MIXTURES l ∈ [ N ] and any h ∈ [ k − ] , for some universal constant c (cid:48) >

0. Thenby the Chebyshev inequality, the expectation of the random variable deﬁnedas Z l , h : = (cid:12)(cid:12)(cid:12) M ( η ) l , h − m h ( ν ) (cid:12)(cid:12)(cid:12) < (cid:115) n l (cid:16) c (cid:48) ( L + √ h ) (cid:17) h  is bounded below by P l , h : = P ( n l ) ν ∗ Φ Z l , h ≥

34 , (A.1)for any l ∈ [ N ] and any h ∈ [ k − ] . Now we use the well-known mediantrick. By deﬁnition of median and the fact that n l ≥ (cid:98) n / N (cid:99) ≥ n / ( N ) , wehave that (cid:32)(cid:13)(cid:13)(cid:13) ˆ m ( η (cid:101) ) h − m h ( ν ) (cid:13)(cid:13)(cid:13) ∞ ≥ (cid:114) N n (cid:16) c (cid:48) ( L + √ h ) (cid:17) h (cid:33) ≤ (cid:32) N ∑ l = Z l , h ≤ N (cid:33) . (A.2)By Hoeffding’s inequality, the probability of the right-hand side of the preced-ing display is bounded as P ( n ) ν ∗ Φ (cid:32) N ∑ l = Z l , h ≤ N (cid:33) ≤ P ( n ) ν ∗ Φ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N ∑ l = Z l , h − N ∑ l = P l , h (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ N  ≤ e − N /8 , (A.3)where the ﬁrst inequality is due to (A.1). Since √ ( c (cid:48) ( L + √ h )) h ≤ (cid:113) c (cid:48) k k − for any h ∈ [ k − ] for some universal constant c (cid:48) > L ,(A.2), (A.3) and the union bound imply P ( n ) ν ∗ Φ (cid:32)(cid:13)(cid:13)(cid:13) ˆ m ( η (cid:101) ) ( k − ) − m ( k − ) ( ν ) (cid:13)(cid:13)(cid:13) ∞ ≥ (cid:114) Nn (cid:18)(cid:113) c (cid:48) k (cid:19) k − (cid:33) ≤ ( k ) e − N /8 .Let η : = ( k ) exp ( − ( c (cid:48) k ) k − n (cid:101) ) , then N : = (cid:4) log ( k / η ) (cid:5) ∧ n ≤ ( c (cid:48) k ) k − n (cid:101) and so (cid:101) ≥ √ N / n ( c (cid:48) k ) k − . Thus, by noticing that N ≥ (cid:110) (( c (cid:48) k ) − k + n (cid:101) ) ∧ n (cid:111) −

1, we get the desired result.

Proof of Lemma 6.3.

We write ν : = ∑ kj = w j δ θ j and ν : = ∑ kj = w j δ θ j . Let H j ( z ) , j ∈ N be the Hermite polynomials deﬁned by the generating functione xt − t = ∞ ∑ j = H j ( x ) t j j ! .Then we have the identity φ ( x − t ) = ( π ) (cid:113) φ ( √ x ) e xt − t = ( π ) (cid:113) φ ( √ x ) ∞ ∑ j = H j ( √ x ) ( t / √ ) j j ! e − t . O HN AND

L. L IN p ν ∗ Φ ( x ) = (cid:90) φ ( x − t ) d ν ( t ) = ( π ) (cid:113) φ ( √ x ) ∞ ∑ j = H j ( √ x ) j /2 j ! E (cid:16) T j e − T (cid:17) for any mixing distribution ν ∈ P ( R ) , where T is the random variable suchthat T ∼ ν . By the orthogonality of the Hermite polynomials, we have √ (cid:90) H l ( √ x ) H j ( √ x ) φ ( √ x ) d x = (cid:90) H l ( x ) H j ( x ) φ ( x ) d x = j ! ( l = j ) .Hence, (cid:107) p ν ∗ Φ − p ν ∗ Φ (cid:107) = ∞ ∑ j = j !2 j √ π (cid:26) E (cid:16) T j e − T (cid:17) − E (cid:16) T j e − T (cid:17)(cid:27) ,where T and T are random variables such that T ∼ ν and T ∼ ν . Thepreceding display implies (cid:12)(cid:12)(cid:12) E (cid:16) T j e − T (cid:17) − E (cid:16) T j e − T (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:113) j !2 j √ π (cid:107) p ν ∗ Φ − p ν ∗ Φ (cid:107) .For each j ∈ [ k − ] , we let P j ( x ) = ∑ k − h = a j , h x h be the unique polynomial ofdegree ( k − ) that interpolates the 2 k points (( θ il , θ jil e − θ il /4 )) i ∈{ } ; l ∈ [ k ] . Weassume all the atoms of ν and ν are distinct, otherwise, we can consider theinterpolation polynomial of degree r , where r < k − (cid:12)(cid:12)(cid:12) E ( T j ) − E ( T j ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) E (cid:16) P j ( T ) e − T (cid:17) − E (cid:16) P j ( T ) e − T (cid:17)(cid:12)(cid:12)(cid:12) ≤ k − ∑ h = | a j , h | (cid:12)(cid:12)(cid:12) E (cid:16) T j e − T (cid:17) − E (cid:16) T j e − T (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:107) p ν ∗ Φ − p ν ∗ Φ (cid:107) k − ∑ h = | a j , h | (cid:113) j !2 j √ π .It remains to bound the coefﬁcients ( a j , h ) j ∈ [ k − ] ; h ∈ [ k − ] ∪{ } . Let x k ∗ ( i − )+ l − : = θ il and y j , k ∗ ( i − )+ l − : = θ jil e − θ il /4 for i =

1, 2 , l ∈ [ k ] and j ∈ [ k − ] . Then wecan express P j in the Newton form such that P j ( x ) = k − ∑ h = b j [ x , . . . , x h ] h − ∏ l = ( x − x l ) , AYESIAN ESTIMATION OF G AUSSIAN MIXTURES b j is deﬁned recursively as b j [ x h ] : = y j , h b j [ x h , . . . , x h + l ] : = b j [ x h + , . . . , x h + l ] − b j [ x h , . . . , x h + l − ] x h + l − x h .Since the derivatives of all orders of the function x (cid:55)→ x j e − x /4 are uniformlybounded on [ − L , L ] , ( b j [ z , . . . , z h ]) h = k − are uniformly bounded too. Hence,since | x h | ≤ L for every h ∈ [ k − ] ∪ { } , ( a j , h ) j ∈ [ k − ] ; th ∈ [ k − ] ∪{ } are boundedby ( k − ) c k − for some universal constant c >

0. Thus the desired resultfollows from the bound j ! ≤ j j . A.3 Lemmas for the proofs for Section 4

Lemma A.4.

Let B , B , . . . , B k be a measurable partition of a compact set Θ ⊂ R , ( w , . . . , w k ) ∈ ∆ k and θ j ∈ B j for j ∈ [ k ] . Let ν : = ∑ kj = w j δ θ j . Then for anydistribution µ ∈ P ( Θ ) , (cid:13)(cid:13)(cid:13) p µ ∗ Φ − p ν ∗ Φ (cid:13)(cid:13)(cid:13) ≤ ≤ j ≤ k diam ( B j ) + k ∑ j = | µ ( B j ) − w j | , with w : = .Proof. We start with the decomposition p µ ∗ Φ − p ν ∗ Φ = (cid:90) U φ ( x − θ ) d µ ( θ ) + k ∑ j = (cid:90) U j (cid:110) φ ( x − θ ) − φ ( x − θ j ) (cid:111) d µ ( θ )+ k ∑ j = φ ( x − θ j ) (cid:110) µ ( B j ) − w j (cid:111) .Since (cid:107) φ ( · − θ ) − φ ( · − θ j ) (cid:107) ≤ | θ − θ j | and (cid:107) φ (cid:107) =

1, the desired result fol-lows.

Lemma A.5.

Let ( w , . . . , w k ) be distributed according to the Dirichlet distributionwith parameter ( κ , . . . , κ k ) such that κ j ∈ (

0, 1 ] for any j ∈ [ k ] . Then for any ( w , . . . , w k ) ∈ ∆ k and any η ∈ (

0, 1/ k ] , P  k ∑ j = | w j − w j | ≤ η  ≥ η ( k − ) k ∏ j = κ j . Proof.

Without loss of generality, assume w k ≥ k . Then for ( w , . . . , w k ) suchthat | w j − w j | < η / k , we have k − ∑ j = w j ≤ − w k + ( k − ) η k ≤ ( + η ) k − k ≤

1. O

HN AND

L. L IN η ≤ k . This implies that ( w , . . . , w k ) ∈ ∆ k . Moreover, ∑ kj = | w j − w j | ≤ ∑ k − j = | w j − w j | < η . Thus, P  k ∑ j = | w j − w j | ≤ η  ≥ P (cid:18) | w j − w j | ≤ η k , j ∈ [ k − ] (cid:19) ≥ Γ ( ∑ kj = κ k ) ∏ kj = Γ ( κ j ) k − ∏ j = (cid:90) ( w j + η / k ) ∧ ( w j − η / k ) ∨ w κ j − j d w j ,where the second inequality follows from the fact that ( − ∑ k − j = w j ) κ k − ≥ κ k ≥

1. Since 1 ≤ Γ ( κ ) ≤ κ for any κ ∈ (

0, 1 ] , w κ j − j ≥ ( w j + η / k ) ∧ − ( w j − η / k ) ∨ ≥ η / k , we further have that P  k ∑ j = | w j − w j | < η  ≥ (cid:18) η k (cid:19) k − k ∏ j = κ j ≥ η ( k − ) k ∏ j = κ j ,which completes the proof. Acknowledgement

We would like to thank Minwoo Chae for very useful comments and discus-sions. We acknowledge the generous support of NSF grant DMS CAREER1654579.

References [1] Bhattacharya, A., Pati, D., and Yang, Y. (2019). Bayesian fractional posteri-ors.

The Annals of Statistics , 47(1):39–66.[2] Biernacki, C., Celeux, G., and Govaert, G. (2000). Assessing a mixturemodel for clustering with the integrated completed likelihood.

IEEE Trans-actions on Pattern Analysis and Machine Intelligence , 22(7):719–725.[3] Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet allocation.

Journal of Machine Learning Research , 3(Jan):993–1022.[4] Cao, X., Khare, K., and Ghosh, M. (2019). Posterior graph selection and esti-mation consistency for high-dimensional Bayesian DAG models.

The Annalsof Statistics , 47(1):319–348.[5] Castillo, I. and van der Vaart, A. (2012). Needles and straw in a haystack:Posterior concentration for possibly sparse sequences.

The Annals of Statis-tics , 40(4):2069–2101.

AYESIAN ESTIMATION OF G AUSSIAN MIXTURES

The Annals of Statistics , 36(2):938–962.[7] Chen, J. (1995). Optimal rate of convergence for ﬁnite mixture models.

TheAnnals of Statistics , 23(1):221–233.[8] Drton, M. and Plummer, M. (2017). A Bayesian information criterion forsingular models.

Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 79(2):323–380.[9] Eghbal-zadeh, H., Zellinger, W., and Widmer, G. (2019). Mixture densitygenerative adversarial networks. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 5820–5829.[10] Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric prob-lems.

The Annals of Statistics , 1(2):209–230.[11] Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminantanalysis, and density estimation.

Journal of the American statistical Association ,97(458):611–631.[12] Fruhwirth-Schnatter, S., Celeux, G., and Robert, C. P. (2019).

Handbook ofmixture analysis . CRC Press.[13] Gao, C. and Zhou, H. H. (2016). Rate exact Bayesian adaptation withmodiﬁed block priors.

The Annals of Statistics , 44(1):318–345.[14] Gao, F. and van der Vaart, A. (2016). Posterior contraction rates for de-convolution of Dirichlet-Laplace mixtures.

Electronic Journal of Statistics ,10(1):608–627.[15] Ghosal, S., Ghosh, J. K., and van der Vaart, A. W. (2000). Convergencerates of posterior distributions.

The Annals of Statistics , 28(2):500–531.[16] Ghosal, S. and van der Vaart, A. (2007). Posterior convergence rates ofDirichlet mixtures at smooth densities.

The Annals of Statistics , 35(2):697–723.[17] Ghosal, S. and van der Vaart, A. (2017).

Fundamentals of nonparametricBayesian inference , volume 44. Cambridge University Press.[18] Ghosal, S. and van der Vaart, A. W. (2001). Entropies and rates of conver-gence for maximum likelihood and Bayes estimation for mixtures of normaldensities.

The Annals of Statistics , 29(5):1233–1263.[19] Gr ¨unwald, P. and Van Ommen, T. (2017). Inconsistency of Bayesian infer-ence for misspeciﬁed linear models, and a proposal for repairing it.

BayesianAnalysis , 12(4):1069–1103.. O

HN AND

L. L IN arXiv preprintarXiv:1901.05078 .[21] Heinrich, P. and Kahn, J. (2018). Strong identiﬁability and optimal mini-max rates for ﬁnite mixture estimation. The Annals of Statistics , 46(6A):2844–2870.[22] Ho, N. and Nguyen, X. (2016). On strong identiﬁability and convergencerates of parameter estimation in ﬁnite mixtures.

Electronic Journal of Statistics ,10(1):271–307.[23] Ho, N., Nguyen, X., and Ritov, Y. (2020). Robust estimation of mixingmeasures in ﬁnite mixture models.

Bernoulli , 26(2):828–857.[24] Hoffmann, M., Rousseau, J., and Schmidt-Hieber, J. (2015). On adaptiveposterior concentration rates.

The Annals of Statistics , 43(5):2259–2295.[25] Keribin, C. (2000). Consistent estimation of the order of mixture models.

Sankhy¯a: The Indian Journal of Statistics, Series A , pages 49–66.[26] Kruijer, W., Rousseau, J., and van der Vaart, A. (2010). Adaptive Bayesiandensity estimation with location-scale mixtures.

Electronic Journal of Statis-tics , 4:1225–1257.[27] Lee, K., Lee, J., and Lin, L. (2019). Minimax posterior convergence ratesand model selection consistency in high-dimensional DAG models based onsparse Cholesky factors.

The Annals of Statistics , 47(6):3413–3437.[28] Martin, R. (2012). Convergence rate for predictive recursion estimation ofﬁnite mixtures.

Statistics & Probability Letters , 82(2):378–384.[29] Martin, R., Mess, R., and Walker, S. G. (2017). Empirical Bayes pos-terior concentration in sparse high-dimensional linear models.

Bernoulli ,23(3):1822–1847.[30] McLachlan, G. J., Lee, S. X., and Rathnayake, S. I. (2019). Finite mixturemodels.

Annual Review of Statistics and its Application , 6:355–378.[31] Miller, J. W. and Dunson, D. B. (2019). Robust Bayesian inference via coars-ening.

Journal of the American Statistical Association , 114(527):1113–1125.[32] Miller, J. W. and Harrison, M. T. (2013). A simple example of dirichletprocess mixture inconsistency for the number of components. In

Advancesin Neural Information Processing Systems , pages 199–206.[33] Miller, J. W. and Harrison, M. T. (2014). Inconsistency of pitman-yor pro-cess mixtures for the number of components.

The Journal of Machine LearningResearch , 15(1):3333–3370.

AYESIAN ESTIMATION OF G AUSSIAN MIXTURES

Journal of the American Statistical Association ,113(521):340–356.[35] Neal, R. M. (2000). Markov chain sampling methods for Dirichlet processmixture models.

Journal of Computational and Graphical Statistics , 9(2):249–265.[36] Newton, M. A. (2002). On a nonparametric recursive estimator of themixing distribution.

Sankhy¯a: The Indian Journal of Statistics, Series A , pages306–322.[37] Nguyen, X. (2013). Convergence of latent mixing measures in ﬁnite andinﬁnite mixture models.

The Annals of Statistics , 41(1):370–400.[38] Nobile, A. and Fearnside, A. T. (2007). Bayesian ﬁnite mixtures with anunknown number of components: The allocation sampler.

Statistics andComputing , 17(2):147–162.[39] Richardson, E. and Weiss, Y. (2018). On GANs and GMMs. In

Advances inNeural Information Processing Systems , pages 5847–5858.[40] Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtureswith an unknown number of components (with discussion).

Journal of theRoyal Statistical Society: Series B (Statistical Methodology) , 59(4):731–792.[41] Rousseau, J. and Mengersen, K. (2011). Asymptotic behaviour of the pos-terior distribution in overﬁtted mixture models.

Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) , 73(5):689–710.[42] Scricciolo, C. (2017). Bayesian Kantorovich deconvolution in ﬁnite mix-ture models. In

Convegno della Societ`a Italiana di Statistica , pages 119–134.Springer.[43] Sethuraman, J. (1994). A constructive deﬁnition of Dirichlet priors.

Statis-tica Sinica , 4:639–650.[44] Stephens, M. (2000). Bayesian analysis of mixture models with an un-known number of componentsan alternative to reversible jump methods.

The Annals of Statistics , 28(1):40–74.[45] Tokdar, S. T., Martin, R., and Ghosh, J. K. (2009). Consistency of a recursiveestimate of mixing distributions.

The Annals of Statistics , 37(5A):2502–2522.[46] Wong, W. H. and Shen, X. (1995). Probability inequalities for likelihood ra-tios and convergence rates of sieve MLEs.

The Annals of Statistics , 23(2):339–362.[47] Wu, Y. and Yang, P. (2018). Optimal estimation of Gaussian mixtures viadenoised method of moments. arXiv preprint arXiv:1807.07237arXiv preprint arXiv:1807.07237