Optimal Bayesian estimation of Gaussian mixtures with growing number of components
OO PTIMAL B AYESIAN ESTIMATION OF G AUSSIAN MIXTURES WITH GROWINGNUMBER OF COMPONENTS
Ilsang Ohn and Lizhen Lin
The University of Notre Dame
July 21, 2020
Abstract
We study posterior concentration properties of Bayesian procedures forestimating finite Gaussian mixtures in which the number of components isunknown and allowed to grow with the sample size. Under this generalsetup, we derive a series of new theoretical results. More specifically, wefirst show that under mild conditions on the prior, the posterior distribu-tion concentrates around the true mixing distribution at a near optimal ratewith respect to the Wasserstein distance. Under a separation condition onthe true mixing distribution, we further show that a better and adaptiveconvergence rate can be achieved, and the number of components can beconsistently estimated. Furthermore, we derive optimal convergence ratesfor the higher-order mixture models where the number of components di-verges arbitrarily fast. In addition, we consider the fractional posterior andinvestigate its posterior contraction rates, which are also shown to be min-imax optimal in estimating the mixing distribution under mild conditions.We also investigate Bayesian estimation of general mixtures with strongidentifiability conditions, and derive the optimal convergence rates whenthe number of components is fixed. Lastly, we study theoretical propertiesof the posterior of the popular Dirichlet process (DP) mixture prior, andshow that such a model can provide a reasonable estimate for the num-ber of components while only guaranteeing a slow convergence rate of themixing distribution estimation.
Finite mixture models are powerful tools for modeling heterogeneous data,which have been used in a wide range of applications in statistics and machinelearning including density estimation [26], clustering [11], document model-ing [3], image generation [39] and designing generative adversarial networks1 a r X i v : . [ m a t h . S T ] J u l . O HN AND
L. L IN point-wise conver-gence rate C ν (cid:63) n − for estimating the mixing distribution under the L dis-tance, where n denotes the sample size and the C ν (cid:63) is a constant depending onthe true mixing distribution ν (cid:63) . This convergence result holds for the so-called strongly identifiable mixtures which include the Gaussian location mixtures asspecial cases, and so do those stated below. Nguyen [37] and Scricciolo [42] de-rived the n − point-wise posterior contraction rate under the second-orderWasserstein distance. Ho and Nguyen [22] proved that the maximum likeli-hood estimator (MLE) can also achieve this point-wise rate. Under the first-order Wasserstein distance, a better point-wise convergence rate C ν (cid:63) n − canbe obtained. Heinrich and Kahn [21], Ho et al. [23] and Guha et al. [20] es-tablished the n − point-wise rate for the minimum Kolmogorov distance es-timator, minimum Hellinger distance estimator and Bayesian procedure withthe mixture of finite mixtures (MFM) prior, respectively. On the other hand, forthe continuous mixtures where the mixing distribution admits a density func-tion, Martin [28] derived a near n − point-wise rate of the mixing densityestimation for their predictive recursion algorithm [36, 45].However, due to a lack of uniformity in the constant C ν (cid:63) , their analysishas been restricted to the fixed truth setup, with the number of componentsassumed to be either known or fixed. Also note that these point-wise ratesare not upper bounds of the actual minimax optimal rates of mixing distribu-tion estimation, which were later derived by Heinrich and Kahn [21]. It wasshown that the minimax optimal convergence rate of mixing distribution esti-mation for strongly identifiable mixtures, is of order n − ( ( k (cid:63) − k )+ ) , where k (cid:63) and k denote the total number of components and the number of well-separated AYESIAN ESTIMATION OF G AUSSIAN MIXTURES k (cid:63) − k which can be viewed as thedegree of overspecification. Heinrich and Kahn [21] also proposed a minimaxoptimal minimum Kolmogorov distance estimator which however can be com-putationally expensive. More recently, Wu and Yang [47] proposed a computa-tionally efficient estimator called the denoised method of moments estimatorfor Gaussian mixture models, and showed that this estimator achieves the min-imax rate. However, these minimax optimal estimators require the knowledgeof the number of components k (cid:63) , which is not practical. On the other hand, noBayesian procedure has yet been able to yield a minimax optimal rate.In general, one does not have the prior knowledge on the number of com-ponents, and selecting an appropriate value of the number of components isa crucial step in providing accurate estimates of the true mixing distribution.With too many components, one may suffer from large variances whereas toofew components may lead to biased estimators. Also estimating the numberof components may be of interest itself in practice especially when each com-ponent has a physical interpretation. A widely used approach to choose thenumber of components is based on a model selection criterion before estimat-ing parameters, and a few consistent model selection criteria are available inthe literature such as complete likelihood [2], the Bayesian information criteria(BIC) [25], the singular Bayesian information criteria (sBIC) [8] and the Bayesfactor [6].A Bayesian approach is an attractive alternative due to its ability to esti-mate both the number of components and parameters in a unified manner. Anatural strategy to infer a mixture model with an unknown number of com-ponents is to also impose a prior on the number of components k . By do-ing so, it provides a way of not only choosing the best number of components(i.e., model selection), but also combining results from different mixture mod-els with possibly varying number of components (i.e., model averaging). Onenotable disadvantage for such models is that posterior computations may bechallenging, since it requires developing Monte Carlo Markov chain (MCMC)algorithms for sampling from a parameter space of varying dimensions, whichoften results in poor mixing or slow convergence of the Markov chain to thestationary distribution. Several MCMC methods have been proposed to cir-cumvent this issue including [40, 44, 38, 34]. On the theoretical side, Guhaet al. [20] derived the n − point-wise posterior contraction rate for this typeof prior distribution. They also obtained posterior consistency of the fixed num-ber of components under the strong identifiability condition. Another promis-ing approach is to use over-fitted mixtures. This approach considers a mixturemodel with the number of components larger than the true one and estimatesthe true model by discarding spurious components. Rousseau and Mengersen[41] studied asymptotic properties of the over-fitted mixtures and proved witha prior on weights of a mixture using a Dirichlet distribution with a suitablyselected hyperparameter, the spurious components vanishes asymptotically atthe rate n − log a n for some a > HN AND
L. L IN the number of clusters can be a reasonable estimate of the true numberof components (Theorem 4.1). For mixing distribution estimation, theperformance of the DP is inferior in view of the convergence rate (Theo-rem 4.2). AYESIAN ESTIMATION OF G AUSSIAN MIXTURES
We first introduce some notation that will be used throughout the paper. For apositive integer n ∈ N , we let [ n ] : = {
1, 2, . . . , n } . For two positive sequences { a n } n ∈ N and { b n } n ∈ N , we write a n (cid:46) b n if there exists a positive constant C > a n ≤ Cb n for any n ∈ N . Moreover, we write a n (cid:38) b n if b n (cid:46) a n and write a n (cid:16) b n if a n (cid:46) b n and a n (cid:38) b n . For a real number x ∈ R , (cid:98) x (cid:99) denotes the largest integer less than or equal to x and (cid:100) x (cid:101) the smallestinteger larger than or equal to x . For n random variables X , . . . , X n , we usethe shorthand notation X n : = ( X , . . . , X n ) . We denote by ( · ) the indicatorfunction. Let δ θ denote a Dirac measure at θ .Let ( X , X ) be a measurable space equipped with a Lebesgue measure λ . For q > f on X , we let (cid:107) f (cid:107) q denote its (cid:96) q norm withrespect to the Lebesgue measure, i.e., (cid:107) f (cid:107) q : = (cid:0) (cid:82) | f ( x ) | q λ ( d x ) (cid:1) q . For theprobability measure G on ( X , X ) , let P G denote the probability or the expecta-tion under the measure G . We denote by p G the probability density function of G with respect to the Lebesgue measure λ . For n ∈ N , let P ( n ) G be the probabil-ity or the expectation under the product measure and p ( n ) G its density function.For two probability densities p and p , we denote by KL ( p , p ) the Kullback-Leibler (KL) divergence from p to p and by KL ( p , p ) the KL variations,i.e., KL ( p , p ) : = (cid:90) log (cid:32) p ( x ) p ( x ) (cid:33) p ( x ) λ ( d x ) KL ( p , p ) : = (cid:90) log (cid:32) p ( x ) p ( x ) (cid:33) p ( x ) λ ( d x ) .Moreover, we let R α ( p , p ) denote the R´enyi α -divergence of order α ∈ (
0, 1 ) from p to p and h ( p , p ) denote the Hellinger distance between p and p ,. O HN AND
L. L IN R α ( p , p ) : = − log (cid:18) (cid:90) p α ( x ) p − α ( x ) λ ( d x ) (cid:19) h ( p , p ) : = (cid:40) (cid:90) (cid:18)(cid:113) p ( x ) − (cid:113) p ( x ) (cid:19) λ ( d x ) (cid:41) .For a convex function f : R (cid:55)→ R such that f ( ) =
0, the f -divergence from p to p is defined by D f ( p , p ) : = (cid:90) f (cid:32) p ( x ) p ( x ) (cid:33) p ( x ) λ ( d x ) .For ζ >
0, a space of certain distributions G and a distribution G ∈ G , wedefine a ζ -KL neighborhood of G by B KL ( ζ , G , G ) : = (cid:110) G ∈ G : KL ( p G , p G ) < ζ , KL ( p G , p G ) < ζ (cid:111) .For a metric space ( Z , ρ ) , we let N ( (cid:101) , Z , ρ ) denote the (cid:101) -covering number of ( Z , ρ ) and let diam ( Z ) : = sup (cid:8) ρ ( z , z ) : z , z ∈ Z (cid:9) . In this paper, we initially consider the Gaussian location mixture model in onedimension: X , . . . , X n iid ∼ k ∑ j = w j N ( θ j , σ ) , (2.1)where θ , . . . , θ k ∈ R are the atoms and ( w , . . . , w k ) ∈ ∆ k are the mixing weights .Here we define ∆ k : = (cid:110) ( w , . . . , w k ) ∈ [
0, 1 ] k : (cid:107) w (cid:107) = (cid:111) for k ∈ N . We assume that the variance σ is known and without loss of gen-erality σ =
1. With the convolution denoted with the symbol ∗ , we simplywrite ν ∗ Φ = k ∑ j = w j N ( θ j , 1 ) for the mixing distribution ν : = ∑ ki = w j δ θ j , where Φ denotes the standard nor-mal distribution. For a set Θ ⊂ R and k ∈ N , we define the set of k -atomicdistributions M k ( Θ ) : = k ∑ j = w j δ θ j : ( w , . . . , w k ) ∈ ∆ k , θ , . . . , θ k ∈ Θ . AYESIAN ESTIMATION OF G AUSSIAN MIXTURES M k ( Θ ) ⊂ M k + ( Θ ) for every k ∈ N . The parameter space is givenby M ( Θ ) : = (cid:83) k ∈ N M k ( Θ ) . For mathematical convenience, we introduce thenotation P ( Θ ) to denote the set of all distributions supported on Θ . Note that M ( Θ ) ⊂ P ( Θ ) .For mixture models, the Wasserstein distance is widely used as a perfor-mance measure for the mixing distribution estimation. To define the Wasser-stein distance between two atomic distributions, we first define Q ( w , w (cid:48) ) : = (cid:110) ( p jh ) j ∈ [ k ] , h ∈ [ k (cid:48) ] ∈ [
0, 1 ] k × k (cid:48) : k (cid:48) ∑ h = p jh = w j , k ∑ j = p jh = w (cid:48) h , ∀ j ∈ [ k ] , h ∈ [ k (cid:48) ] (cid:111) ,for given two weight vectors w ∈ ∆ k and w (cid:48) ∈ ∆ k (cid:48) , which is a set of jointdistributions on [ k ] × [ k (cid:48) ] with marginal distributions w and w (cid:48) . For any q ≥
1, the q -th order Wasserstein distance between two atomic distributions ν : = ∑ kj = w j δ θ j and ν (cid:48) : = ∑ k (cid:48) h = w (cid:48) h δ θ (cid:48) h is defined as W q ( ν , ν (cid:48) ) : = inf p ∈Q ( w , w (cid:48) ) k ∑ j = k (cid:48) ∑ h = p jh | θ j − θ (cid:48) h | q q .Our analysis on the mixing distribution estimation invokes the connectionbetween the difference of moments and the Wassestein distance, which is de-veloped by [47]. For ν ∈ M ( Θ ) , we denote by m h ( ν ) the h -th moment of ν ,that is m h ( ν ) : = E ( X h ) ,where X is the random variable such that X ∼ ν . The r -th moment vector isdefined by m r ( ν ) : = (cid:0) m ( ν ) , · · · , m r ( ν ) (cid:1) .Closeness of moments vectors of two atomic distributions implies their close-ness in the Wasserstein distance. See Lemmas 6.1 and 6.5. We first assume that the true data generating process is given as ν (cid:63) ∗ Φ where ν (cid:63) ∈ M k (cid:63) ([ − L , L ]) , L > k (cid:63) ∈ N , which is the true number of mixingcomponents. For simplicity, we write M k : = M k ([ − L , L ]) for each k ∈ N and M : = M ([ − L , L ]) = ∪ ∞ k = M k . We consider a general model in which the truemixing distribution ν (cid:63) ∈ M k (cid:63) can vary with sample size n , in particular, thetrue number of components k (cid:63) can vary with n . This is a critical difference fromthe existing Bayesian literature on mixture models which assumed a fixed truemixing distribution [37, 42, 20].We assume an upper bound ¯ k n on the true number of components k (cid:63) . Thisassumption alleviates some technical difficulties, and can be justified by the. O HN AND
L. L IN k n (cid:16) log n / log log n , as Wu andYang [47] did, since the minimax optimal convergence rate of mixing distribu-tion estimation for large mixtures ν (cid:63) ∈ M k (cid:63) with k (cid:63) (cid:16) log n / log log n has aslow rate of log log n / log n (See Proposition 8 of [47]), and we will show thatone can develop a Bayesian procedure that attains this rate without knowingthe upper bound of the true number of components. See Theorem 2.7 in Sec-tion 2.6.We now introduce our prior distribution on the finite Gaussian mixturemodel. The prior first samples the number of components k from a prior Π ( k ) and then samples the atoms θ ∈ [ − L , L ] k and weights w ∈ ∆ k from Π ( θ | k ) and Π ( w | k ) , respectively. Thus the prior distribution is a distributionon M = ∪ k ∈ N M k .We impose the following conditions on the prior. Assumption P.
Recall that ¯ k n is the known upper bound on the true numberof components. The prior distribution Π satisfies the following conditions:(P1) The prior distribution on the number of components k is data-dependant.There are a constant c > A > n ∈ N and any k ◦ ∈ N , Π ( k = k ◦ + ) Π ( k = k ◦ ) ≤ c e − A ¯ k n log n . (2.2)Additionally, there are constants c > c > n ∈ N and any k † ∈ [ ¯ k n ] , Π ( k = k † ) ≥ c e − ( c ¯ k n log n ) k † . (2.3)(P2) For any k ∈ N and any ( w , . . . , w k ) ∈ ∆ k , there are positive constants c and c such that for any η ∈ (
0, 1/ k ) , Π k ∑ j = | w j − w j | ≤ η (cid:12)(cid:12)(cid:12) k ≥ c η c k . (2.4)(P3) For any k ∈ N and any θ ∈ [ − L , L ] k , there are positive constants c and c such that for any η > Π (cid:32) max ≤ j ≤ k | θ j − θ j | ≤ η (cid:12)(cid:12)(cid:12) k (cid:33) ≥ c η c k . (2.5)We now provide some examples of prior distributions satisfying Assump-tion P. In the following examples, the constant A > AYESIAN ESTIMATION OF G AUSSIAN MIXTURES Example 1.
The mixture of finite mixture (MFM) prior considered in [34, 20] isa hierarchical prior consisting of a distribution on the number of components,the Dirichlet distribution on the weights and a distribution on the atoms. As-sumption P is met by the MFM prior with appropriate choices of each distribu-tion. An example is given as follows. The geometric distribution with proba-bility mass function ( − p n ) k − p n on k , where p n : = − a exp ( − A ¯ k n log n ) forarbitrary a >
0, satisfies (2.2) and (2.3) since p n (cid:38)
1. The Dirichlet distribution
DIR ( κ , . . . , κ k ) on the mixing weights with κ j ∈ ( κ , 1 ) for every j ∈ [ k ] andsome κ ∈ (
0, 1 ) satisfies (P2), see Lemma A.5. If the prior distribution on θ behaves like a uniform distribution up to a multiplicative constant, then (P3)holds. Example 2.
Consider a truncated Poisson distribution for k that’s supported on N with probability mass function e − λ n λ k − n / ( k − ) !, where λ n : = a exp ( − A ¯ k n log n ) for arbitrary a >
0. Then this Poisson distribution clearly satisfies (2.2). Alsoit satisfies (2.3) with a choice of c = A + c (cid:48) for some constant c (cid:48) >
0, sincee − λ n (cid:38) (( k − ) ! ) − ≥ exp ( − k log k ) ≥ exp ( − ¯ k n log ¯ k n ) ≥ exp ( − c (cid:48) ¯ k n log n ) .The MFM prior with such a Poisson prior on the number of components alsosatisfies Assumption P. Example 3.
Consider a Binomial prior distribution on the number of compo-nents such that k − ∼ BINOM ( ¯ k n − p n ) with p n : = a exp ( − A ¯ k n log n ) forarbitrary a >
0. Then this prior satisfies (2.2) since ( ¯ k n − k ◦ ) / ( ¯ k n − k ◦ − ) ≤ ¯ k n (cid:46) e log log n and 1 − p n ≤
1. Also it satisfies (2.3) since 1 − p n (cid:38)
1. The MFM priorwith this Binomial prior distribution satisfies Assumption P.
Example 4.
The spike and slab prior distribution on the unnomralized weightscan satisfy (P1) and (P2). Suppose that we consider an over-fitted mixturemodel ν = ∑ ¯ k n j = w j δ θ j . Let S : = { j ∈ [ ¯ k n ] : w j > } , a set of indices corre-sponding to nonzero weights. Then we can write ν = ∑ j ∈ S w j δ θ j . Let ˜ w ≡ ( ˜ w j ) j ∈ [ ¯ k n ] be the independent random variables where ˜ w is generated from GAMMA ( κ , b ) and the other variables, i.e., ˜ w , . . . , ˜ w ¯ k n , are generated from aspike and slab distribution ( − p n ) δ + p n GAMMA ( κ , b ) with p n : = a exp ( − A ¯ k n log n ) for a > b > κ ∈ (
0, 1 ) . If we define the number of components asthe number of nonzero elements in ˜ w and the weights as a normalized ver-sion of ( ˜ w j ) j ∈ S , i.e., k : = (cid:107) ˜ w (cid:107) and w j : = ˜ w j / (cid:107) ˜ w (cid:107) for j ∈ S , then k − BINOM ( ¯ k n − p n ) and ( w j ) j ∈ S follows DIR ( κ , . . . , κ ) . Thus Assumption P holdsby Examples 1 and 3. In this section, we present concentration properties of the posterior distribution Π ( ·| X n ) defined below, with the prior given in Section 2.3 and the data from. O HN AND
L. L IN Π ( d ν | X n ) : = p ( n ) ν ∗ Φ ( X n ) Π ( d ν ) (cid:82) p ( n ) ν ∗ Φ ( X n ) Π ( d ν ) . (2.6)We first show that our posterior distribution does not overestimate the num-ber of components. Theorem 2.1.
Assume ν (cid:63) ∈ M k (cid:63) where k (cid:63) ≤ ¯ k n (cid:46) log n / log log n. Then with theprior distribution Π satisfying Assumption P, we have P ( n ) ν (cid:63) ∗ Φ (cid:2) Π ( ν ∈ M k (cid:63) | X n ) (cid:3) →
1. (2.7)
Remark 1.
Note that the condition ν (cid:63) ∈ M k (cid:63) does not mean that ν (cid:63) is not in-cluded in the lower order models such as M , . . . , M k (cid:63) − because there maybe overlapped atoms or zero weights. In view of this observation, Theorem 2.1can be stated with a more precise argument as follows. Let ˘ k (cid:63) be the small-est number of components of the true mixing distribution ν (cid:63) in a sense that ν (cid:63) ∈ M ˘ k (cid:63) \ M ˘ k (cid:63) − . Then the conclusion of the theorem actually means that P ( n ) ν (cid:63) ∗ Φ (cid:104) Π ( ν ∈ M ˘ k (cid:63) | X n ) (cid:105) → (cid:4) The following theorem shows the optimal concentration property of theposterior distribution of the mixing distribution.
Theorem 2.2.
Under the same assumptions of Theorem 2.1, we have P ( n ) ν (cid:63) ∗ Φ (cid:34) Π (cid:18) W ( ν , ν (cid:63) ) ≥ M ¯ (cid:101) n (cid:12)(cid:12)(cid:12) X n (cid:19)(cid:35) = o ( ) (2.8) for some universal constant M > , where ¯ (cid:101) n : = ( k (cid:63) ) k (cid:63) − k (cid:63) − (cid:32) ¯ k n log nn (cid:33) k (cid:63) − . (2.9)If the number of components k (cid:63) is fixed, the convergence rate in Theo-rem 2.2 is equivalent to the minimax optimal rate n − ( k (cid:63) − ) [47, Proposition7] up to at most a logarithmic factor since ¯ k n (cid:46) log n .Compared with the minimax rate, our rate has two redundant factors ¯ k n and log n . The log n factor is common in the nonparametric Bayesian literature,which often arises due to the popular “prior mass and testing” proof technique.We refer to the papers [24, 13] for discussions about this phenomenon. We alsoadopt the “prior mass and testing” approach and thus misses the log n factor.The ¯ k n factor is paid for model selection. Unlike the frequentist work [47],which proposes an estimation algorithm that attains the exact minimax opti-mal rate with the assumption that the true number of components is known, AYESIAN ESTIMATION OF G AUSSIAN MIXTURES k n factor can be removed. We may be able to remove thisfactor using somewhat refined proof techniques without assuming the knownnumber of components. For example, some Bayesian works on linear regres-sion [5, 29] and Gaussian directed acyclic graph models [4, 27] simultaneouslyachieved model selection consistency and the exact minimax convergence ratesfor parameters estimation through a careful analysis of the likelihood ratio. Wewill investigate whether the same can be done for Gaussian mixture models inthe near future. To improve the convergence rate in Theorem 2.2, one may assume that atomsare well separated and the weights are bounded away from zero. We introducethe formal definition related to this notion.
Definition 1.
An atomic distribution ν : = ∑ kj = w j δ θ j is said to be k ( γ , ω ) -separated for k ∈ [ k ] , γ > ω > S , . . . , S k of [ k ] such that • | θ j − θ j (cid:48) | ≥ γ for any j ∈ S l , j (cid:48) ∈ S l (cid:48) and any l , l (cid:48) ∈ [ k ] with l (cid:54) = l (cid:48) ; • ∑ j ∈ S l w j ≥ ω for any l ∈ [ k ] .We let M k , k , γ , ω : = (cid:8) ν ∈ M k : ν is k ( γ , ω ) -separated (cid:9) .In the next theorem, we derive the optimal posterior contraction rate of themixing distribution under the separation assumption. We call this contractionrate an adaptive rate because the result is achieved without any knowledge ofthe number of well-separated components k of the true mixing distribution. Theorem 2.3.
Assume ν (cid:63) ∈ M k (cid:63) , k , γ , ω where k (cid:63) ≤ ¯ k n (cid:46) log n / log log n. More-over, assume that γω > M (cid:48) ¯ (cid:101) n for a sufficiently large constant M (cid:48) > , where ¯ (cid:101) n isthe convergence rate defined in (2.9) . Then with the prior distribution Π satisfyingAssumption P, we have P ( n ) ν (cid:63) ∗ Φ (cid:34) Π (cid:18) W ( ν , ν (cid:63) ) ≥ M (cid:101) n (cid:12)(cid:12)(cid:12) X n (cid:19)(cid:35) = o ( ) , (2.10) for some universal constant M > , where (cid:101) n : = ( k (cid:63) ) k (cid:63) − k + ( k (cid:63) − k )+ γ − k − ( k (cid:63) − k )+ (cid:32) ¯ k n log nn (cid:33) ( k (cid:63) − k )+ . (2.11). O HN AND
L. L IN Remark 2.
A nice surprise from the result of Theorem 2.3 is that our Bayesianprocedure can achieve a better convergence rate than the one in Theorem 2.2without requiring any further condition on the prior distribution. This is be-cause of fact that the condition γω > M (cid:48) ¯ (cid:101) n guarantees that the mixing distri-bution ν is k ( a γ , 0 ) -separated asymptotically for some constant a ∈ (
0, 1 ) under the posterior distribution, provided that Theorem 2.2 holds. (cid:4) Under the same separation condition but with the additional assumptionthat the number of components k (cid:63) is known , Wu and Yang [47] achieved theconvergence rate C k (cid:63) , γ n − ( ( k (cid:63) − k )+ ) for the denoised method of moments es-timator, where C k (cid:63) , γ is some quantity depending on k (cid:63) and γ . Compared withthe rate of [47], our convergence rate (2.11) has redundant factor ¯ k n log n dueto the proof technique and the existence of the model selection step. Again thefactor ¯ k n can be removed if one assumes the number of components is known.In view of Proposition 2.4 presented below, the convergence rate in The-orem 2.3 is minimax optimal [21, Theorem 3.2] up to a logarithmic factor ifthe model parameters k (cid:63) , k and γ are fixed constants. Heinrich and Kahn[21] established the minimax optimal rate n − ( ( k (cid:63) − k )+ ) of the estimation ofthe mixing distribution satisfying the locally varying condition. Namely, theyshowed that for fixed k (cid:63) ∈ N , k ∈ [ k (cid:63) ] and ν ∈ M k \ M k − , it follows thatinf { ˆ ν } sup ν (cid:63) ∈M k (cid:63) : W ( ν (cid:63) , ν ) ≤ (cid:101) † n P ( n ) ν (cid:63) ∗ Φ (cid:2) W ( ˆ ν , ν (cid:63) ) (cid:3) (cid:38) n − ( k (cid:63) − k )+ , (2.12)where the infimum ranges over all possible sequences of estimators and (cid:101) † n : = n − ( ( k (cid:63) − k )+ )+ ι for some ι > locally . This locally varying condition is seemingly differ-ent from the separation condition given in Definition 1, but in fact the formeris a sufficient condition of the latter. Intuitively, we can expect that the true dis-tribution ν (cid:63) ∈ M k (cid:63) close to ν ∈ M k \ M k − has at least k well-separatedcomponents, and therefore satisfies the separation condition. We formally statethis argument in the next proposition. Proposition 2.4.
Let k ∈ N and ν : = ∑ k j = w j δ θ j ∈ M k \ M k − . Define γ ( ν ) : = min j , h ∈ [ k ] : j (cid:54) = h | θ j − θ h | > ω ( ν ) : = min j ∈ [ k ] w j > Let k ∈ { k , k +
1, . . . } and c ∈ (
0, 1/4 ) . Then we have (cid:8) ν ∈ M k : W ( ν , ν ) < c γ ( ν ) ω ( ν ) (cid:9) ⊂ M k , k , ( − c ) γ ( ν ) , − c − c ω ( ν ) .Due to Proposition 2.4, it is clear that our Bayesian procedure is also near-optimal for the estimation of the mixing distribution under the locally varyingcondition. We merely state the result. AYESIAN ESTIMATION OF G AUSSIAN MIXTURES Corollary 2.5.
Assume k (cid:63) ≤ ¯ k n (cid:46) log n / log log n. Let k ∈ N be a fixed constantsuch that k ≤ k (cid:63) , and let ν ∈ M k \ M k − be a fixed distribution. Moreover, as-sume that the prior distribution Π satisfies Assumption P . Then there exist universalconstants τ > and M > such that P ( n ) ν (cid:63) ∗ Φ Π W ( ν , ν (cid:63) ) ≥ M ( k (cid:63) ) k (cid:63) − k + ( k (cid:63) − k )+ (cid:32) ¯ k n log nn (cid:33) ( k (cid:63) − k )+ (cid:12)(cid:12)(cid:12) X n = o ( ) (2.13) for any ν (cid:63) ∈ M k (cid:63) with W ( ν (cid:63) , ν ) < τ eventually. As a byproduct, we can obtain the posterior consistency of the true numberof components when the true mixing distribution ν (cid:63) is perfectly separated, thatis, k (cid:63) = k . Note that in this case, ν (cid:63) ∈ M k (cid:63) \ M k (cid:63) − . The following theoremstates this formally. Theorem 2.6.
Assume ν (cid:63) ∈ M k (cid:63) , k (cid:63) , γ , ω where k (cid:63) ≤ ¯ k n (cid:46) log n / log log n. More-over, assume that γω > M (cid:48) max { ¯ (cid:101) n , (cid:101) n } (2.14) for a sufficiently large constant M (cid:48) > , where ¯ (cid:101) n and (cid:101) n are the convergence ratesdefined in (2.9) and (2.11) , respectively. Then with the prior distribution Π satisfyingAssumption P, we have P ( n ) ν (cid:63) ∗ Φ (cid:104) Π (cid:0) ν ∈ M k (cid:63) \ M k (cid:63) − | X n (cid:1)(cid:105) →
1. (2.15)The condition (2.14) provides a threshold for detection. This conditionplays a similar role as the beta-min condition for variable selection in linearregression [5, 29].Guha et al. [20] obtained the consistency result with a similar prior distri-bution to ours, but their analysis is restricted to the fixed truth cases.
In Section 2, we have assumed that k (cid:63) (cid:46) log n / log log n . This assumption isjustified by the minimax result for the estimation of the higher-order mixturespresented by [47]. In this section, we prove that there is a Bayesian proce-dure which is similar to the one considered in Section 2, but does not assumea known upper bound of the number of components, can attain this minimaxoptimality. In this case, instead of Assumption (P1), we impose a milder con-dition given below on the prior.(P1 (cid:48) ) There are constants c > c > k ◦ ∈ N , Π ( k = k ◦ ) ≥ c e − c k ◦ . (2.16). O HN AND
L. L IN (cid:48) ) is satisfied by the Poisson and geometric distribution with constant meanand success probability, respectively.The next theorem provides the convergence rate of mixing distribution es-timation without any restriction on the true number of components. Theorem 2.7.
Assume ν (cid:63) ∈ M . Then with the prior distribution Π satisfying (P1 (cid:48) ),(P2) and (P3), we have P ( n ) ν (cid:63) ∗ Φ Π (cid:32) W ( ν , ν (cid:63) ) ≥ M log log n log n (cid:12)(cid:12)(cid:12) X n (cid:33) = o ( ) (2.17) for some universal constant M > . If the true mixing distribution ν (cid:63) belongs to M k (cid:63) with k (cid:63) (cid:16) log n / log log n ,the convergence rate in the above theorem is rate-exact optimal [47, Theorem5]. Indeed, the above result holds even when the true generating process isgiven by µ (cid:63) ∗ Φ with µ (cid:63) ∈ P ([ − L , L ]) , which includes continuous or infinitemixtures. In this section, we consider the fractional posterior, also called the α -posterior,as the estimator. With the prior distribution Π and the data X n , the fractionalposterior Π α ( ·| X n ) of order α ∈ (
0, 1 ) is defined by Π α ( d ν | X n ) : = (cid:110) p ( n ) ν ∗ Φ ( X n ) (cid:111) α Π ( d ν ) (cid:82) (cid:110) p ( n ) ν ∗ Φ ( X n ) (cid:111) α Π ( d ν ) . (2.18)The fractional posterior has received a great deal of recent attention, mainlydue to its empirically demonstrated robustness to model misspecification [19,31]. In particular, numerical experiments of [31] showed that the fractionalposteriors of the Gaussian mixtures are robust to a certain type of model mis-specification, while the regular posteriors are not. Another key advantage isthat concentration of the fractional posterior can be established under fewerconditions on the prior comparing to the regular posterior [1]. This also turnsout to be the case for the Gaussian mixtures. The use of the fractional posteriorallows us to avoid the construction of an exponential test function, thus theproof is substantially simplified.The next theorem shows that the fractional posterior has the optimal con-centration properties as does the regular posterior. AYESIAN ESTIMATION OF G AUSSIAN MIXTURES Theorem 2.8.
Fix α ∈ (
0, 1 ) . Assume ν (cid:63) ∈ M k (cid:63) where k (cid:63) ≤ ¯ k n (cid:46) log n / log log n.Moreover, assume that the prior distribution Π satisfies Assumption P. Then thereexist positive constants c , c and c such that Π α ( ν ∈ M k (cid:63) | X n ) ≥ − c e − c ¯ k n log n (2.19) and (cid:90) W ( ν , ν (cid:63) ) Π α ( d ν | X n ) (cid:46) ¯ (cid:101) n + e − c ¯ k n log n , (2.20) with P ( n ) ν (cid:63) ∗ Φ -probability at least − c /¯ k n log n, where ¯ (cid:101) n is the convergence rate de-fined in (2.9) . If e − c ¯ k n log n (cid:46) ¯ (cid:101) n , which holds for any diverging ¯ k n , the fractional posteriorattains the minimax optimal convergence rate up to a logarithmic factor. In this section, we extend the theoretical analysis of the Gaussian mixtures pro-vided in Section 2.4 to general mixture models satisfying strong identifiabilityconditions.With a slight abuse of the notation, for a mixing distribution ν ∈ M ( Θ ) and a family of distribution functions { F ( · , θ ) : θ ∈ Θ } for Θ ⊂ R , we let ν ∗ F denote the distribution having a density function p ν (cid:63) ∗ F ( · ) : = (cid:90) f ( · , θ ) ν ( d θ ) , (3.1)where f ( · , θ ) denotes the probability density function of F ( · , θ ) . We call F ( · , · ) and f ( · , · ) a kernel distribution function and a kernel density function , respectively.We assume here that the data are i.i.d observations from the distribution ν (cid:63) ∗ F for some k (cid:63) -atomic mixing distribution ν (cid:63) ∈ M k (cid:63) and family of distri-bution functions { F ( · , θ ) : θ ∈ Θ } satisfying some regularity and strong iden-tifiability conditions. We first introduce the strong identifiability condition. Definition 2.
A family of distribution functions (cid:8) F ( · , θ ) : θ ∈ Θ (cid:9) for Θ ⊂ R , issaid to be q-strongly identifiable if for any finite subset B of Θ , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q ∑ j = ∑ θ (cid:48) ∈ B a j , θ (cid:48) ∂ j f ∂θ j ( · , θ (cid:48) ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ = = ⇒ max j ∈{ q } max θ (cid:48) ∈ B | a j , θ (cid:48) | = ν ∗ F is q -strongly identifiable if (cid:8) F ( · , θ ) : θ ∈ Θ (cid:9) is q -strongly identifiable.Heinrich and Kahn [21, Theorem 2.4] shows that the location mixture mod-els, i.e., f ( x , θ ) = f ( x − θ ) , in which both the kernel density function f ( · ) andits derivatives up to q − ± ∞ , are q -strongly identifiable.. O HN AND
L. L IN ∞ -strongly identifi-able. Also the scale mixtures, i.e., f ( x , θ ) = θ − f ( θ − x ) for θ ∈ Θ ⊂ R + , withthe same condition on the kernel density function, are q -strongly identifiable.We impose the following regularity conditions including the strong identi-fiability condition. Assumption F( q ). The family of distribution functions (cid:8) F ( · , θ ) : θ ∈ Θ (cid:9) with Θ ⊂ R satisfies the following conditions:(F1) For any x ∈ R , F ( x , θ ) is q -differentiable with respect to θ .(F2) (cid:8) F ( · , θ ) : θ ∈ Θ (cid:9) is q -strongly identifiable.(F3) There are constants c > s ≥ (cid:13)(cid:13)(cid:13)(cid:13) ∂ q F ∂θ q ( · , θ ) − ∂ q F ∂θ q ( · , θ ) (cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ c | θ − θ | s for any θ , θ ∈ Θ .(F4) There are constants c > b ∈ (
0, 1 ] such that (cid:90) p ν ∗ F ( x ) (cid:32) p ν ∗ F ( x ) p ν ∗ F ( x ) (cid:33) b λ ( d x ) ≤ c for any ν , ν ∈ M q ( Θ ) .The first three conditions are inherited from the regularity condition of [21].The additional condition (F4) is introduced to control the prior concentrationof a KL neighborhood of the true distribution ν (cid:63) ∗ F . If the set Θ is given asan interval, say [ − L , L ] , the condition (F4) is satisfied by various location mix-tures, in particular, by the Laplace location mixture [14] and Gaussian locationmixture [18].In this section, we assume that the number of components k (cid:63) is fixed butstill unknown. We thus use the prior distribution on the number of compo-nents satisfying (P1) with the constant ¯ k n . Furthermore, since we consider ageneral set of atoms Θ ⊂ R rather than the interval [ − L , L ] to include, forexample, scale mixtures and exponential family mixtures, Assumption (P3) isslightly modified to Equation (2.5) being met for any k ∈ N and θ ∈ Θ k . Wealso assume the kernel distribution function F ( · , · ) is known, i.e., no misspec-ification of the kernel distribution function. That is, we consider the posteriordistribution denoted by Π F ( ·| X n ) , which is defined as Π F ( d ν | X n ) : = p ( n ) ν ∗ F ( X n ) Π ( d ν ) (cid:82) p ( n ) ν ∗ F ( X n ) Π ( d ν ) . (3.2)Note that we still allow the true mixing distribution to vary with the samplesize. This setup is still substantially more general than the fixed truth setupconsidered in the existing Bayesian literature [37, 42, 20]. AYESIAN ESTIMATION OF G AUSSIAN MIXTURES
Theorem 3.1.
Let Θ be a compact subset of R with nonempty interior. Assume that ν (cid:63) ∈ M k (cid:63) ( Θ ) with k (cid:63) ∈ N being fixed and that the family of distribution functions { F ( · , θ ) : θ ∈ Θ } satisfies Assumption F(q) with q = k (cid:63) . Then with the priordistribution Π satisfying Assumption P, we have P ( n ) ν (cid:63) ∗ F Π F W ( ν , ν (cid:63) ) ≥ M (cid:18) log nn (cid:19) k (cid:63) − (cid:12)(cid:12)(cid:12) X n = o ( ) (3.3) for some universal constant M > . The convergence rate in Theorem 3.1 is equivalent to the convergence rate(2.9) for the Gaussian mixtures with the fixed number of components k (cid:63) . Remark 3.
We believe that even if the number of components grows, the re-sult of Theorem 3.1 still holds with the same convergence rate as (3.3) up to aconstant depending on k (cid:63) , provided that Assumption F( q ) is met with q = ∞ .We need to establish a uniform version of Lemma 6.8 over the number of com-ponents, which is a key technical tool for the proof. It could be an objective offuture work. (cid:4) Moreover, our Bayesian procedure can obtain the minimax optimal conver-gence rate [21, Theorem 3.2] under the locally varying condition on the truemixing distribution, which is assumed in Corollary 2.5 for the Gaussian mix-tures.
Theorem 3.2.
Let Θ be a compact subset of R with nonempty interior. Let k (cid:63) , k ∈ N be fixed constants with k (cid:63) ≥ k and let ν ∈ M k ( Θ ) \ M k − ( Θ ) be a fixed dis-tribution. Assume the family of distribution functions { F ( · , θ ) : θ ∈ Θ } satisfiesAssumption F(q) with q = k (cid:63) . Moreover, assume that the prior distribution Π sat-isfies Assumption P. Then there exist universal constants τ > and M > suchthat P ( n ) ν (cid:63) ∗ F Π F W ( ν , ν (cid:63) ) ≥ M (cid:18) log nn (cid:19) ( k (cid:63) − k )+ (cid:12)(cid:12)(cid:12) X n = o ( ) (3.4) for any ν (cid:63) ∈ M k (cid:63) with W ( ν (cid:63) , ν ) < τ eventually. In this section, we consider Dirichlet process (DP) prior [10] on the mixing dis-tribution which results in an infinite mixture model– the popular Dirichlet pro-cess (DP) mixture model. Although a DP mixture model is minimax optimal. O
HN AND
L. L IN ( log n ) − in estimating the mixing distribution of the Gaussian location mix-tures as shown by [37]. Their result assumes that the number of component k (cid:63) is fixed. We consider the DP prior for the mixture distribution estimation andderive the posterior contraction rates in the most general set up by allowingthe number of the components of the true mixing distribution to grow. Furthermore, we adopt a natural strategy of using the number of the clusters T of the data to estimate the number of components and we establish posterior consistencyof such a procedure.Note that the DP prior does not satisfy Assumption (P1), and thus the the-orems in Section 2.4 do not cover the case of DP prior. This section aims toseparately analyze concentration properties of the posterior of the DP mixturemodels.In our Gaussian location mixture setup, the DP is a distribution on infinite -atomic distributions of the form ˜ ν : = ∞ ∑ j = w j δ θ j (4.1)where w , w , · · · ∈ [
0, 1 ] are mixing weights such that ∑ ∞ j = w j = θ , θ , · · · ∈ [ − L , L ] . We let M ∞ be the set of distributions of the form (4.1). The DP with aconcentration parameter κ > H , denoted by DP ( κ , H ) ,can be expressed by the following stick-breaking generation process [43] E j iid ∼ BETA ( κ ) , w j = E j j − ∏ h = ( − E h ) , θ j iid ∼ H .Since the weights generated from the above procedure are positive with prob-ability 1, one can say that Π DP ( ˜ ν ∈ M ∞ \ M ) =
1. This implies that ev-ery mixing distribution generated from the posterior of the DP mixture modelhas infinite number of components, therefore the posterior distribution of thenumber of components k cannot provide any reasonable estimate of the truenumber of components.One possible solution is to use an additional post-processing procedure forthe posterior distribution. For example, Guha et al. [20] proposed the operator T to infinite mixing distributions which removes weak components (in a sensethat the corresponding weights are very small) and merges similar components(whose atoms are very close) of an infinite mixing distribution so that T ( ˜ ν ) is afinite mixing distribution. They proved that for a fixed truth ν (cid:63) ∈ M k (cid:63) \ M k (cid:63) − ,the posterior distribution of the finite mixing distribution T ( ˜ ν ) obtained afterpost processing concentrates to the model M k (cid:63) \ M k (cid:63) − under the DP priordistribution with a fixed concentration parameter. AYESIAN ESTIMATION OF G AUSSIAN MIXTURES T n , of the data X n as an estimate of the number of components. Note thatfor i ∈ [ n ] , X i iid ∼ ˜ ν ∗ Φ can be written equivalently with the latent assignmentvariable Z i ∈ N as Z i iid ∼ w [ ˜ ν ] : = ∞ ∑ j = w j δ j , X i | Z i ind ∼ N ( θ Z i , 1 ) .where w [ ˜ ν ] ∈ P ( N ) can be viewed as the distribution on N such that w [ ˜ ν ]( J ) = ˜ ν ( { θ j : j ∈ J } ) for any J ⊂ N . The number of clusters T n is defined by T n : = T n ( Z n ) : = (cid:12)(cid:12)(cid:8) j ∈ N : ∃ i ∈ [ n ] s.t. Z i = j (cid:9)(cid:12)(cid:12) .Here we consider the joint posterior distribution of the mixing distribution˜ ν and the latent assignment variable Z n conditioned on the data X n , whichis given as Π DP ( d ˜ ν , Z n | X n ) : = (cid:104) ∏ ni = φ ( X i − θ Z i ) p w [ ˜ ν ] ( Z i ) (cid:105) Π DP ( d ˜ ν ) (cid:82) ∑ Z n ∈ N n (cid:104) ∏ ni = φ ( X i − θ Z i ) p w [ ˜ ν ] ( Z i ) (cid:105) Π DP ( d ˜ ν ) , (4.2)where φ ( · ) denotes the probability density function of the standard normaldistribution and Π DP denotes the DP prior.Note that the data are still assumed to be generated from the finite Gaussianmixture model ν (cid:63) ∗ Φ where ν (cid:63) ∈ M k (cid:63) for k (cid:63) ∈ N but we allow the number ofcomponents to grow at an arbitrary fast speed. Even in such general situations,we show in the following theorem that the DP prior with a suitably chosenconcentration parameter can provide a nearly tight upper bound of the truenumber of components . Theorem 4.1.
Assume ν (cid:63) ∈ M k (cid:63) with k (cid:63) ∈ N . Then with the DP prior DP ( κ n , H ) where κ n (cid:16) ( n log n ) − and H is the uniform distribution on [ − L , L ] , we have P ( n ) ν (cid:63) ∗ Φ (cid:2) Π DP ( T n > Ck (cid:63) | X n ) (cid:3) = o ( ) (4.3) for some constant C > depending only on the prior distribution. Miller and Harrison [32, 33] showed that the posterior distribution of thenumber of clusters does not concentrate at the true number of components ifone uses the DP prior with a constant concentration parameter . In particular, ifthe true data generating process is N (
0, 1 ) = δ ∗ Φ , the posterior probabilitythat the number of components is equal to the true number of components (i.e.,1) goes to zero [32, Theorem 5.1]. Our proposed data-dependent concentrationparameter resolves this inconsistency.. O HN AND
L. L IN Remark 4.
Under the prior Π considered in Section 2.3, the posterior distribu-tion of T n is asymptotically the same as the one of k . Miller and Harrison [34]proved that | Π ( k = k ◦ | X n ) − Π ( T n = k ◦ | X n ) | → k ◦ ∈ N as long as Π ( k = k (cid:48) ) > k (cid:48) ∈ [ k ◦ ] . In view of this fact, the number ofclusters T n can be used to infer the true number of clusters k (cid:63) even if we usethe prior distribution Π in Section 2.3. (cid:4) Remark 5.
One may wonder whether the choice of the concentration param-eter κ n (cid:16) ( n log n ) − would lead to slower posterior contraction rate whenthe DP mixture model is used for density estimation as a DP mixture modelis commonly adopted for. It turns out that it would not. In fact, even for κ n (cid:16) ( n log n ) − , one can show that there is a universal constant M > P ( n ) ν (cid:63) ∗ Φ Π DP (cid:32) h ( p ˜ ν ∗ Φ , p ν (cid:63) ∗ Φ ) ≥ M log a n √ n | X n (cid:33) = o ( ) for any ν (cid:63) ∈ P ([ − L , L ]) , for some a >
0. One can easily check the above re-sult. Following the proof of Theorem 5.1 of [18] and applying Lemma A.5, wecan see that the prior concentration near the true mixing distribution is lowerbounded by ( n − κ n ) c log n (cid:38) exp ( − c log n ) for some c , c >
0. Thus anusual prior mass and testing approach leads to the conclusion in the precedingdisplay for estimating the density. (cid:4)
However, using the DP prior leads to a very slow convergence rate of mix-ing distribution estimation in general as stated in the next theorem.
Theorem 4.2.
Assume ν (cid:63) ∈ M . Then with the DP prior DP ( κ n , H ) , where exp ( − c log a n ) (cid:46) κ n (cid:46) for some a > and c > and H is the uniform distribution on [ − L , L ] , wehave P ( n ) ν (cid:63) ∗ Φ Π DP (cid:32) W ( ˜ ν , ν (cid:63) ) ≥ M log log n log n (cid:12)(cid:12)(cid:12) X n (cid:33) = o ( ) (4.4) for some universal constant M > . The above result holds even when the true mixing distribution ν (cid:63) is an ar-bitrary distribution supported on [ − L , L ] .As one can see from our theorem above, if the true mixing distribution isof high order such that k (cid:63) (cid:16) log n / log log n , the posterior of the DP mixturemodel attains the minimax optimality [47, Theorem 5]. However, unlike theBayesian procedure proposed in Section 2, we conjecture that posterior of theDP mixture model cannot obtain an improved convergence rate for estimat-ing a mixing distribution when the true number of components grows slowly,say k (cid:63) (cid:28) log n / log log n , because it tends to produce many redundant com-ponents. Nguyen [37] analyzed the posterior of Dirichlet process mixture en-dowed with a fixed concentration parameter for estimating mixing distribu-tion with a fixed number of components and obtain a slow convergence rate ( log n ) − with respect to the second-order Wasserstein distance. AYESIAN ESTIMATION OF G AUSSIAN MIXTURES We conduct numerical experiments to validate our theoretical findings. Forthe prior distribution, we use a MFM prior consisting of a Poisson distribu-tion with mean λ on the number of components, the Dirichlet distribution onthe weights and the uniform distribution on the atoms. For the Dirichlet dis-tribution prior on the mixing weights, we fix its concentration parameter asa k -dimensional vector of 1’s. For the mean parameter of the Poisson distri-bution, we consider the following two choices: the constant one and the oneinversely proportional to the sample size. We call the former MFM const andthe latter
MFM vary . MFM vary is motivated by our theory. For posterior com-putation, we employ the reversible jump MCMC algorithm of [40]. For eachposterior computation, we ran a single Markov chain with length 105,000. Wesaved every 100-th sample after a burn-in period of 5,000 samples.
We compare the performance of the proposed Bayesian method with othercompetitors. We consider the denoised method of moment (
DMM ) estimator pro-posed by [47] and the maximum a posteriori (MAP) estimator with the Dirich-let distribution prior on the weights and the uniform distribution prior on theatoms. In the implementation of the
DMM algorithm, we use the authors’ Pythoncodes which are available on this github repository. We consider the MAP es-timators of two types of mixture models: exact-fitted and over-fitted mixtures.The number of components of the exact-fitted mixture is exactly equal to thetrue number of components and the one of the over-fitted mixture is someupper bound ¯ k of the true number of components, in this simulation, we set¯ k = k (cid:63) . We call the MAP estimator of the exact-fitted mixture MAP exact andthe one of the over-fitted mixture
MAP over . We use the standard expectation-maximization (EM) algorithm to obtain MAP estimators. For the proposedBayesian method, we use the posterior mode of the mixing distribution as anestimator. We consider the two choices of the mean parameter of the Pois-son prior, λ n = n − ( MFM vary ) and λ n = MFM const ). For all the fourBayesian methods, we set the support of the uniform distribution prior theinterval [ −
6, 6 ] and the concentration parameter of the Dirichlet distributionprior the vector of 1’s.We generated synthetic data sets from a Gaussian mixture model ν (cid:63) ∗ Φ with ν (cid:63) : = ∑ k (cid:63) j = w (cid:63) j δ θ (cid:63) j . We consider the following four different cases of thetrue mixing distribution.Case 1 (Well-separated) θ (cid:63) = ( − −
1, 1, 3 ) , w (cid:63) = ( , , , ) Case 2 (Overlapped components) θ (cid:63) = ( − −
1, 1, 3 ) , w (cid:63) = ( , , , ) Case 3 (Weak component) θ (cid:63) = ( − −
1, 1, 3 ) , w (cid:63) = ( , , , ) Case 4 (Higher-order) θ (cid:63) = ( − − −
2, 0, 2, 4, 6 ) , w (cid:63) = ( , . . . , ) . O HN AND
L. L IN n range over { } .We repeat this data generation 20 times for each experiment and report theaverage of the first order Wasserstein distance between each estimator and thetrue mixing distribution.Figure 1 displays the average of the the first order Wasserstein errors of thefive estimators for the four cases of the data generating process. Contrary toits theoretical optimality, DMM performs the worst among the five estimators forall the scenarios. The performance gap of DMM to the Bayesian methods arethe largest for Case 4. We observed that there is numerical instability of the
DMM implementation in estimating the higher-order mixtures, which leads tothe poor performance of the method. For Case 1, the over-fitted mixture model
MAP over performs worse than the other Bayesian methods, but does similarfor the other three cases. For Case 2 and Case 3,
MFM vary tends to select thesmaller mixture than the true mixture, in general, its posterior distribution ismaximized at k = k (cid:63) =
4. Note that thisdoes not contradict our theoretical results where we establish the consistentestimation of the number of well-separated components, which might be equalto 3 in these two cases. This leads to slightly better performance for Case 2where overlapped components exist and slightly worse performance for Case3 where weak components exist. For the higher-order mixture case, all thefour Bayesian methods performs almost similar. Overall, knowing the truenumber of components does not give substantial improvement of empiricalperformance, which corresponds to our theory that it gives only at most log n gain in the convergence rate. In this experiment, we assess the performance of the proposed Bayesian pro-cedure and the DP mixture model with data-dependent hyperparameters. Wegenerated the Gaussian mixture with atoms ( −
2, 0, 2 ) and equal weights ( ) .Five independent data sets are generated from this Gaussian mixture modelfor each sample size n ∈ {
50, 100, 250, 1000, 2500 } . We compare four Bayesianmethods: the two MFM models with Poisson mean parameter λ n = n − ( MFM vary )and λ n = MFM const ) and the two DP mixtures models with concentra-tion parameter κ n = ( n log n ) − ( DP vary ) and κ n = DP const ). We usethe uniform distribution on [ −
6, 6 ] for both the prior on the atoms for the MFMand the base distribution for the DP mixture. We use Neals Algorithm 8 [35]for non-conjugate priors to compute the posterior distributions of the DP mix-tures.Figure 2 presents the posterior distributions of the number of componentsfor the two MFMs and of the number of cluster for the two DP mixtures, re-spectively. It clearly shows that the diminishing choices of hyperparameter AYESIAN ESTIMATION OF G AUSSIAN MIXTURES (a) Case 1 (b) Case 2(c) Case 3 (d) Case 4 Figure 1: The average of the first-order Wasserstien errors of five estimators bysample size.advocated by our theory outperforms the constant counterparts. It is worth tonotice that the posterior distribution of
DP vary captures the true number ofcomponents well for large samples. It is a widely observed that the DP mix-ture tends to produce redundant clusters, in particular, Miller and Harrison[34] and Guha et al. [20] observed this phenomenon in their simulation studies,however our simulation shows that a data-dependent concentration parameterinversely related to the sample size can circumvent this issue.
Proof of Theorem 2.1.
Let ˜ ζ n : = (cid:112) log n / n . We state the following well knownresult in the Bayesian literature (e.g., Lemma 8.1 of [15]): P ( n ) ν (cid:63) ∗ Φ (cid:90) p ( n ) ν ∗ Φ p ( n ) ν (cid:63) ∗ Φ ( X n ) Π ( d ν ) ≥ e − n ˜ ζ n Π ( B KL ( ˜ ζ n , ν (cid:63) ∗ Φ , M )) ≥ − n ˜ ζ n .. O HN AND
L. L IN (a) MFM with λ n = n − (b) MFM with λ n = κ n = ( n log n ) − (d) DP with κ n = Figure 2: Posterior distribution of the number of components for the MFM andof the number of clusters for the DP mixture. The true number of componentsis 3.By Lemma A.1, we have KL ( p ν (cid:63) ∗ Φ , p ν ∗ Φ ) ≤ W ( ν ∗ Φ , ν (cid:63) ∗ Φ ) .Since (cid:82) p ν (cid:63) ∗ Φ ( x )( p ν (cid:63) ∗ Φ ( x ) / p ν ∗ Φ ( x )) b d λ ( x ) < ∞ for some b ∈ (
0, 1 ) which isshown by Equation (4.6) of [18], Lemma A.1 and Lemma A.2 imply that KL ( p ν (cid:63) ∗ Φ , p ν ∗ Φ ( X i )) ≤ c W ( ν ∗ Φ , ν (cid:63) ∗ Φ ) log (cid:32) W ( ν ∗ Φ , ν (cid:63) ∗ Φ ) (cid:33) ,for some constant c >
0. Thus Π ( B KL ( ˜ ζ n , ν (cid:63) ∗ Φ , M )) ≥ Π (cid:16) ν ∈ M : W ( ν , ν (cid:63) ) ≤ c ( n log n ) − (cid:17) ≥ Π (cid:16) ν ∈ M k (cid:63) : W ( ν , ν (cid:63) ) ≤ c ( n log n ) − (cid:17) Π ( k = k (cid:63) ) .for some constant c >
0. We now lower bound the prior mass on the Wasser-stein ball in M k (cid:63) in the preceding display. By Lemma A.3, we have that for any AYESIAN ESTIMATION OF G AUSSIAN MIXTURES ν ∈ M k (cid:63) , W ( ν , ν (cid:63) ) ≤ max ≤ j ≤ k (cid:63) | θ j − θ (cid:63) j | + L k (cid:63) ∑ j = | w j − w (cid:63) j | By (P2) and (P3), Π (cid:16) ν ∈ M k (cid:63) : W ( ν , ν (cid:63) ) ≤ c ( n log n ) − (cid:17) ≥ Π k (cid:63) ∑ j = | w j − w (cid:63) j | ≤ c L n log n Π (cid:32) | θ j − θ (cid:63) j | ≤ c n log n , ∀ j ∈ [ k (cid:63) ] (cid:33) (cid:38) (( n log n ) − ) c k (cid:63) (cid:38) e − c k (cid:63) log n for some constant c >
0. Therefore, P ( n ) ν (cid:63) ∗ Φ (cid:2) Π ( ν / ∈ M k (cid:63) | X n ) (cid:3) = P ( n ) ν (cid:63) ∗ Φ (cid:82) ν / ∈M k (cid:63) p ( n ) ν ∗ Φ ( X n ) / p ( n ) ν (cid:63) ∗ Φ ( X n ) Π ( d ν ) (cid:82) p ( n ) ν ∗ Φ ( X n ) / p ( n ) ν (cid:63) ∗ Φ ( X n ) Π ( d ν ) (cid:46) Π ( k > k (cid:63) ) e − n ˜ ζ n Π ( B KL ( ˜ ζ n , ν (cid:63) ∗ Φ , M )) + n ˜ ζ n (cid:46) e ( + c ) k (cid:63) log n Π ( k > k (cid:63) ) Π ( k = k (cid:63) ) + n ˜ ζ n (cid:46) e ( + c ) k (cid:63) log n e − A ¯ k n log n + n ˜ ζ n .Hence if A > c +
2, the desired result follows.
For the proof of Theorem 2.2, we use the following moment comparison lemmato translate the mixing distribution estimation problem to the the moment vec-tor estimation problem.
Lemma 6.1 (Proposition 1 of Wu and Yang [47]) . Suppose that ν , ν ∈ M k ([ − L , L ]) for L > . Let ζ : = (cid:107) m ( k − ) ( ν ) − m ( k − ) ( ν ) (cid:107) ∞ . Then W ( ν , ν ) ≤ c k ( ζ ) k − (6.1) for some constant c > depending only on L. We use a standard “prior mass and testing” approach to prove the conver-gence of the moment vector. The crucial step is to construct a test functionwith exponentially small error probabilities. We employ the median denoisedmoment estimator proposed by [47] to the construction of such a test function.. O
HN AND
L. L IN Definition 3.
Let X n be n independent samples, and let k ∈ N and η ∈ (
0, 1 ) .Divide the sample to N : = (cid:4) log ( k / η ) (cid:5) ∧ n almost equal sized batches, say X , . . . , X N , where each batch has the (cid:98) n / N (cid:99) or (cid:98) n / N (cid:99) + l ∈ [ N ] and h ∈ [ k − ] , compute˜ M ( η ) l , h : = |X l | ∑ X ∈X i X h , M ( η ) l , h : = h ! (cid:98) h /2 (cid:99) ∑ a = ( − ) a a ! ( h − a ) ! ˜ M l , h .Then we define the median denoised moment estimator ˆ m ( η ) ( k − ) = ( ˆ m ( η ) h ) h ∈ [ k − ] by ˆ m ( η ) h : = ˆ m ( η ) h ( X n ) : = Median (cid:18)(cid:110) M ( η ) l , h : l ∈ [ N ] (cid:111)(cid:19) . (6.2)For the median denoised moment estimator we have the exponential tailbound. Recall that P ([ − L , L ]) stands for the set of all distributions supportedon [ − L , L ] . Lemma 6.2.
Suppose that X , . . . , X n iid ∼ µ ∗ Φ where µ ∈ P ([ − L , L ]) . Then for anyk ∈ N and (cid:101) > , there is constant c > depending only on L such that P ( n ) µ ∗ Φ (cid:16) (cid:107) ˆ m ( η (cid:101) ) ( k − ) − m ( k − ) ( µ ) (cid:107) ∞ ≥ (cid:101) (cid:17) ≤ ( ) k exp (cid:18) − n (cid:110) ( c k ) − k + (cid:101) ∧ (cid:111)(cid:19) , where ˆ m ( η (cid:101) ) ( k − ) is the median denosied moment estimator presented in Definition 3with η = η (cid:101) where η (cid:101) : = ( k ) exp (cid:16) − ( c k ) − k + n (cid:101) (cid:17) .To control the covering number of the parameter space M , we need thefollowing two lemmas. Lemma 6.3.
For any ν , ν ∈ M k ([ − L , L ]) , we have (cid:107) m ( k − ) ( ν ) − m ( k − ) ( ν ) (cid:107) ∞ ≤ c ( (cid:112) c k ) ( k − ) (cid:107) p ν ∗ Φ − p ν ∗ Φ (cid:107) for some constants c > and c > depending only on L. Lemma 6.4 (Theorem 3.1 of Ghosal and van der Vaart [18]) . For any (cid:101) ∈ (
0, 1/2 ) , log N (cid:0) (cid:101) , P ([ − L , L ]) , (cid:107) · (cid:107) (cid:1) ≤ c (cid:18) log 1 (cid:101) (cid:19) for some universal constant c > . AYESIAN ESTIMATION OF G AUSSIAN MIXTURES
Proof of Theorem 2.2.
Let ζ n : = (cid:113) ¯ k n log n / n . In the proof of Theorem 2.1, wehave shown that P ( n ) ν (cid:63) ∗ Φ ( A n ) → n → ∞ , where A n : = X n ∈ R n : (cid:90) p ( n ) ν ∗ Φ p ( n ) ν (cid:63) ∗ Φ ( X n ) Π ( d ν ) ≥ e − c k (cid:63) n ζ n for some constant c > P ( n ) ν (cid:63) ∗ Φ (cid:2) Π ( ν ∈ M k (cid:63) | X n ) (cid:3) →
1, the proof is done if we prove that P ( n ) ν (cid:63) ∗ Φ [ Π ( ˜ U | X n )] = o ( ) , where˜ U : = (cid:8) ν ∈ M : W ( ν , ν (cid:63) ) ≥ M ¯ (cid:101) n (cid:9) (cid:92) { ν ∈ M k (cid:63) } = (cid:8) ν ∈ M k (cid:63) : W ( ν , ν (cid:63) ) ≥ M ¯ (cid:101) n (cid:9) .For notational simplicity, we suppress the subscript 1: ( k (cid:63) − ) of the momentvector and its denoised estimator to write m ( · ) : = m ( k (cid:63) − ) ( · ) and ˆ m ( η ) : = ˆ m ( η ) ( k (cid:63) − ) . Let ρ ( ν , ν ) : = (cid:107) m ( k (cid:63) − ) ( ν ) − m ( k (cid:63) − ) ( ν ) (cid:107) ∞ for ν , ν ∈ M .Let U : = (cid:110) ν ∈ M k : ρ ( ν , ν (cid:63) ) ≥ (cid:6) M k (cid:63) (cid:7) ζ n (cid:111) ,where M k (cid:63) : = √ k (cid:63) ( √ M k (cid:63) ) k (cid:63) − with M > (cid:6) M k (cid:63) (cid:7) ( k (cid:63) − ) ≤ ( M k (cid:63) ) ( k (cid:63) − ) ≤ √ M k (cid:63) ( k (cid:63) ) ( k (cid:63) − ) , byLemma 6.1, if we take M such that M ≥ c √ M for some constant c > L , we have ˜ U ⊂ U .It remains to bound the posterior probability of U . To do this we use astandard peeling device technique. Define U t : = (cid:8) ν ∈ M k (cid:63) : t ζ n ≤ (cid:107) m ( ν ) − m ( ν (cid:63) ) (cid:107) ∞ < ( t + ) ζ n (cid:9) .Since (cid:107) m ( ν ) (cid:107) ∞ ≤ ( ∨ L ) k (cid:63) − for any ν ∈ M k (cid:63) ([ − L , L ]) , for t larger than2 ( ∨ L ) k (cid:63) − / ζ n the set U t is empty. Therefore, U ⊂ t ∗ n (cid:91) t = (cid:100) M k (cid:63) (cid:101) U t , where t ∗ n : = sup (cid:110) t ∈ N : t ≤ ( ∨ L ) k (cid:63) − / ζ n (cid:111) .Let ( U t , s : s ∈ [ S t ]) be a t ζ n /4 net of U t in the distance ρ ( · , · ) for each t , where S t : = N (cid:0) t ζ n /4, U t , ρ (cid:1) . We further decompose U t to U t , s , s ∈ [ S t ] , where U t , s : = (cid:8) ν ∈ U t : (cid:107) m ( ν ) − m ( ν t , s ) (cid:107) ∞ < t ζ n /4 (cid:9) . O HN AND
L. L IN U t ⊂ (cid:83) S t s = U t , s .Now we construct the test function for the test H : ν = ν (cid:63) versus H : ν ∈U t , s with exponentially small type I and II error probabilities. Let ψ t , s : R n (cid:55)→ [
0, 1 ] be the function given by ψ t , s ( X n ) : = (cid:16) (cid:107) ˆ m ( η n , t ) − m ( ν (cid:63) ) (cid:107) ∞ ≥ t ζ n /4 (cid:17) ,where ˆ m ( η n , t ) is the median denoised moments defined in Definition 3 with η n , t : = ( k (cid:63) ) exp (cid:16) − ( c k (cid:63) ) − k (cid:63) + n ( t ζ n /4 ) (cid:17) .Here, the universal constant c > L is chosen so that P ( n ) ν (cid:63) ∗ Φ (cid:16) (cid:107) ˆ m ( η n , t ) − m ( ν ) (cid:107) ∞ > t ζ n /4 (cid:17) (cid:46) k (cid:63) exp (cid:18) − n (cid:110) ( c k (cid:63) ) − k (cid:63) + ( t ζ n /4 ) ∧ (cid:111)(cid:19) .Note that the existence of the constant c is guaranteed by Lemma 6.2. We justshowed the exponential type I error bound for the test function ψ t , s . By triangleinequality, we have that for every ν ∈ U t , s , (cid:107) ˆ m ( η n , t ) − m ( ν (cid:63) ) (cid:107) ∞ ≥ (cid:107) m ( ν t , s ) − m ( ν (cid:63) ) (cid:107) ∞ − (cid:107) m ( ν ) − m ( ν t , s ) (cid:107) ∞ − (cid:107) ˆ m ( η n , t ) − m ( ν ) (cid:107) ∞ ≥ t ζ n − t ζ n /4 − (cid:107) ˆ m ( η n , t ) − m ( ν ) (cid:107) ∞ .Thus the type II error probability is bounded exponentially assup ν ∈U t , s P ( n ) ν ∗ Φ (cid:0) − ψ t , s ( X n ) (cid:1) = sup ν ∈U t , s P ( n ) ν ∗ Φ (cid:16) (cid:107) ˆ m ( η n , t ) − m ( ν (cid:63) ) (cid:107) ∞ < t ζ n /4 (cid:17) ≤ sup ν ∈U t , s P ( n ) ν ∗ Φ (cid:16) t ζ n /4 − (cid:107) ˆ m ( η n , t ) − m ( ν ) (cid:107) ∞ < t ζ n /4 (cid:17) ≤ sup ν ∈U t , s P ( n ) ν ∗ Φ (cid:16) (cid:107) ˆ m ( η n , t ) − m ( ν ) (cid:107) ∞ > t ζ n /2 (cid:17) (cid:46) k (cid:63) exp (cid:18) − n (cid:110) ( c k (cid:63) ) − k (cid:63) + ( t ζ n /4 ) ∧ (cid:111)(cid:19) .We need to compute the upper bound of S t . By Lemma 6.3, for any ν , ν ∈M k (cid:63) , we have ρ ( ν , ν ) ≤ c (cid:16)(cid:112) c k (cid:63) (cid:17) k (cid:63) − (cid:107) p ν ∗ Φ − p ν ∗ Φ (cid:107) , AYESIAN ESTIMATION OF G AUSSIAN MIXTURES c , c > L , which implies that S t : = N (cid:0) t ζ n /4, U t , ρ (cid:1) ≤ N t ζ n c (cid:16) √ c k (cid:63) (cid:17) k (cid:63) − , { p ν ∗ Φ : ν ∈ M k (cid:63) } , (cid:107) · (cid:107) (cid:46) (cid:32) ( k (cid:63) ) k (cid:63) t ζ n (cid:33) c (cid:46) e c log n for some universal constants c , c > ( k (cid:63) ) k (cid:63) (cid:46) exp ( k (cid:63) log ( k (cid:63) )) (cid:46) exp ( c log n ) for some c > ψ : R n (cid:55)→ [
0, 1 ] defined by ψ : = sup t ∈ N : M k (cid:63) ≤ t ≤ t ∗ n max s ∈ [ S t ] ψ t , s .For notational simplicity, we denote A ( M , k (cid:63) , ζ n ) : = n (cid:20)(cid:110) ( c k (cid:63) ) − k (cid:63) + ( M k (cid:63) ζ n /4 ) (cid:111) ∧ (cid:21) = n (cid:20)(cid:110) k (cid:63) ( M / c ) k (cid:63) − ( ζ n /4 ) (cid:111) ∧ (cid:21) .Then the type I error probability of ψ is bounded by P ( n ) ν (cid:63) ∗ Φ ψ ( X n ) ≤ t ∗ n ∑ t = (cid:100) M k (cid:63) (cid:101) S t P ( n ) ν (cid:63) ∗ Φ ψ t , s ( X n ) (cid:46) t ∗ n k (cid:63) e c log n exp (cid:0) − A ( M , k (cid:63) , ζ n ) (cid:1) (cid:46) k (cid:63) exp (cid:0) c log n − A ( M , k (cid:63) , ζ n ) (cid:1) (6.4)for some constants c , c > L , where the third inequalityfollows from the fact that t ∗ n ≤ ( ∨ L ) k (cid:63) − / ζ n (cid:46) e c log n for some constant c > L . On the other hand, the type II error is boundedbysup ν ∈U P ( n ) ν ∗ Φ ( − ψ ( X n )) ≤ sup t ∈ N : M k (cid:63) ≤ t ≤ t ∗ n sup s ∈ [ S t ] sup ν ∈U t , s P ( n ) ν ∗ Φ (cid:0) − ψ t , s ( X n ) (cid:1) (cid:46) k (cid:63) exp (cid:0) − A ( M , k (cid:63) , ζ n ) (cid:1) . (6.5). O HN AND
L. L IN P ( n ) ν (cid:63) ∗ Φ (cid:104) Π (cid:0) U | X n (cid:1)(cid:105) ≤ P ( n ) ν (cid:63) ∗ Φ ψ ( X n ) + P ( n ) ν (cid:63) ∗ Φ (cid:104) ( − ψ ( X n )) Π (cid:0) U | X n (cid:1) A n (cid:105) + o ( ) ≤ P ( n ) ν (cid:63) ∗ Φ ψ ( X n ) + − c k (cid:63) n ζ n sup ν ∈U P ( n ) ν ∗ Φ ( − ψ ( X n )) + o ( ) (cid:46) k (cid:63) exp (cid:16) c log n + c k (cid:63) ¯ k n log n − A ( M , k (cid:63) , ζ n ) (cid:17) + o ( ) (cid:46) exp (cid:16) c k (cid:63) ¯ k n log n − A ( M , k (cid:63) , ζ n ) (cid:17) + o ( ) for some constant c > L . Note that for any M such that M > c , we have A ( M , k (cid:63) , ζ n ) ≥ k (cid:63) n (cid:32) M c ¯ k n log n n (cid:33) ∧ ≥ c M k (cid:63) ¯ k n log n for some constant c > c , where the second inequalityis due to that ¯ k n log n / n = o ( ) . Hence the posterior probability of U goes tozero if we choose M such that M > max { c / c , c , 1 } . We need the following adaptive version of the moment comparison lemma toestabilish the adaptive rate.
Lemma 6.5 (Proposition 4 of Wu and Yang [47]) . Suppose that ν and ν are sup-ported on a set of r atoms in [ − L , L ] , and each atom is at least ˜ γ away from all but atmost r (cid:48) atoms. Let ζ : = (cid:13)(cid:13)(cid:13) m ( r − ) ( ν ) − m ( r − ) ( ν ) (cid:13)(cid:13)(cid:13) ∞ . Then W ( ν , ν ) ≤ c r (cid:32) r r − ˜ γ r − r (cid:48) − ζ (cid:33) r (cid:48) , (6.6) for some constant c > depending only on L.Proof of Theorem 2.3. To avoid confusion, we denote by ¯ M instead of M the suf-ficiently large constant appearing in (2.8) and we let M be the constant appear-ing in (2.10). If M (cid:101) n ≥ ¯ M ¯ (cid:101) n , the result follows trivially from Theorem 2.2, sowe assume throughout that M (cid:101) n < ¯ M ¯ (cid:101) n . AYESIAN ESTIMATION OF G AUSSIAN MIXTURES P ( n ) ν (cid:63) ∗ Φ (cid:2) Π ( ν / ∈ M k (cid:63) | X n ) (cid:3) = o ( ) by Theorem 2.1 P ( n ) ν (cid:63) ∗ Φ (cid:2) Π ( W ( ν , ν (cid:63) ) ≥ ¯ M ¯ (cid:101) n | X n ) (cid:3) = o ( ) by Theorem 2.2,we will be done with the proof if we can show that P ( n ) ν (cid:63) ∗ Φ (cid:20) Π (cid:16)(cid:8) ν ∈ M k (cid:63) : ¯ M ¯ (cid:101) n > W ( ν , ν (cid:63) ) ≥ M (cid:101) n (cid:9) | X n (cid:17)(cid:21) = o ( ) .Let ν : = ∑ k (cid:63) j = w j δ θ j be the mixing distribution satisfying W ( ν , ν (cid:63) ) ≤ ¯ M ¯ (cid:101) n forthe true mixing distribution ν (cid:63) : = ∑ k (cid:63) j = w (cid:63) j δ θ (cid:63) j . Since ν (cid:63) is k ( γ , ω ) -separated,there is a partition ( S l : l ∈ [ k ]) of [ k (cid:63) ] such that | θ j − θ j (cid:48) | ≥ γ for any j ∈ S l , j (cid:48) ∈ S l (cid:48) and any l , l (cid:48) ∈ [ k ] with l (cid:54) = l (cid:48) and ∑ j ∈ S l w j ≥ ω for any l ∈ [ k ] . Foreach h ∈ [ k (cid:63) ] , let j ∗ h = argmin j ∈ [ k (cid:63) ] | θ j − θ (cid:63) h | . Note that for any l ∈ [ k ] , W ( ν , ν (cid:63) ) ≥ ∑ h ∈ S l w (cid:63) h | θ j ∗ h − θ (cid:63) h | ≥ ω min h ∈ S l | θ j ∗ h − θ (cid:63) h | .We now suppose that the assumption γω > M (cid:48) ¯ (cid:101) n holds with M (cid:48) : = c ¯ M forsome constant c less than 1/2. Thenmin h ∈ S l | θ j ∗ h − θ (cid:63) h | ≤ W ( ν , ν (cid:63) ) / ω ≤ ¯ M ¯ (cid:101) n / ω ≤ c γ .That is, for any l ∈ [ k ] , there is h ∈ S l such that θ (cid:63) h is close to some atomof ν within distance γ / c . Hence the mixing distribution ν is k (( − c ) γ , 0 ) separated. Let S : = (cid:110) θ j : j ∈ [ k (cid:63) ] (cid:111) ∪ (cid:110) θ (cid:63) j : j ∈ [ k (cid:63) ] (cid:111) . Then each element in S is ( − c ) γ away from at least 2 ( k − ) elements in S . Therefore by invokingLemma 6.5 with r = k (cid:63) , r (cid:48) = k (cid:63) − − ( k − ) = ( k (cid:63) − k ) + γ =( − c ) γ , we have for sufficiently large M > (cid:8) ν ∈ M k (cid:63) : M ¯ (cid:101) n > W ( ν , ν (cid:63) ) ≥ M (cid:101) n (cid:9) ⊂ (cid:110) ν ∈ M k (cid:63) : (cid:107) m ( k (cid:63) − ) ( ν ) − m ( k (cid:63) − ) ( ν (cid:63) ) (cid:107) ∞ ≥ (cid:6) M k (cid:63) (cid:7) ζ n (cid:111) ,where M k (cid:63) : = √ k (cid:63) ( √ M k (cid:63) ) k (cid:63) − with M > ζ n : = (cid:113) ¯ k n log n / n . The only remaining part of the proof is to bound the pos-terior probability of the right-hand side of the preceding display, and this isshown in the proof of Theorem 2.2. Proof of Proposition 2.4.
We set γ : = γ ( ν ) and ω : = ω ( ν ) for short. Supposethat ν : = ∑ kj = w j δ θ j ∈ M k satisfies W ( ν , ν ) < c γω . Since ν is k ( γ , ω ) -separated, by the similar argument in the proof of Theorem 2.3, we have thatfor every h ∈ [ k ] , | θ j ∗ h − θ h | ≤ W ( ν , ν ) / ω ≤ c γ ,. O HN AND
L. L IN j ∗ h = argmin j ∈ [ k ] | θ j − θ (cid:63) h | . Thus, ν is k (( − c ) γ , 0 ) sepa-rated. Moreover, since | θ j ∗ h − θ l | ≥ | θ h − θ l | − | θ j ∗ h − θ h | ≥ ( − c ) γ > c γ for any l (cid:54) = h , the indices j ∗ , . . . , j ∗ k are distinct. Thus there is a partition S , . . . , S k of [ k ] such that | θ j − θ j (cid:48) | ≥ ( − c ) γ for any j ∈ S h , j (cid:48) ∈ S h (cid:48) andany h , h (cid:48) ∈ [ k ] with h (cid:54) = h (cid:48) and j ∗ h ∈ S h for any h ∈ [ k ] . Let ( p ∗ jh ) j ∈ [ k ] , h ∈ [ k ] ∈Q (( w j ) j ∈ [ k ] , ( w j ) j ∈ [ k ] ) be the optimal coupling such that W ( ν , ν ) = ∑ kj = ∑ k h = p jh | θ j − θ h | . Then for any h ∈ [ k ] , we have c γω > W ( ν , ν ) ≥ k ∑ j = p ∗ jh | θ j − θ h | = ∑ j ∈ S h p ∗ jh | θ j − θ h | + ∑ j / ∈ S h p ∗ jh | θ j − θ h |≥ + w h − ∑ j ∈ S h p ∗ jh ( − c ) γ ,where the last inequality follows from that | θ j − θ h | ≥ | θ j − θ j ∗ h | − | θ j ∗ h − θ h | ≥ ( − c ) γ − c γ for any j / ∈ S h . Hence, ∑ j ∈ S h w j ≥ ∑ j ∈ S h p ∗ jh ≥ w h − c − c ω ≥ − c − c ω ,which completes the proof. Proof of Theorem 2.6.
Assume that ν : = ∑ kj = w j δ θ j ∈ M k with k < k (cid:63) . Thenthere exists an index h ∗ ∈ [ k (cid:63) ] such that | θ j − θ (cid:63) h ∗ | ≥ min h ∈ [ k (cid:63) ] : h (cid:54) = j ∗ | θ j − θ (cid:63) h | for any j ∈ [ k ] , which implies that2 | θ j − θ (cid:63) h ∗ | ≥ | θ j − θ (cid:63) h ∗ | + min h ∈ [ k (cid:63) ] : h (cid:54) = j ∗ | θ j − θ (cid:63) h |≥ min h , l ∈ [ k (cid:63) ] : h (cid:54) = l | θ (cid:63) l − θ (cid:63) h | .Therefore, for the optimal coupling ( p ∗ jh ) j ∈ [ k ] , h ∈ [ k (cid:63) ] ∈ Q (( w j ) j ∈ [ k ] , ( w (cid:63) j ) j ∈ [ k (cid:63) ] ) ,we have W ( ν , ν (cid:63) ) = k ∑ j = k (cid:63) ∑ h = p ∗ jh | θ j − θ (cid:63) h |≥ k ∑ j = p ∗ jh ∗ | θ j − θ (cid:63) h ∗ |≥ w (cid:63) h ∗ | θ (cid:63) l − θ (cid:63) h | ≥ γω AYESIAN ESTIMATION OF G AUSSIAN MIXTURES γω > M (cid:48) (cid:101) n for some large constant M (cid:48) > { ν ∈ M k } ⊂ (cid:8) ν ∈ M : W ( ν , ν (cid:63) ) ≥ γω /2 (cid:9) ⊂ (cid:110) ν ∈ M : W ( ν , ν (cid:63) ) ≥ M (cid:48) (cid:101) n /2 (cid:111) .The proof is complete by Theorem 2.3. We invoke the following moment comparison lemma for general distributions.
Lemma 6.6.
Let µ , µ ∈ P ([ − L , L ]) and r ∈ N . Then W ( µ , µ ) ≤ c (cid:26) r + + √ r ( c ) r (cid:107) m r ( µ ) − m r ( µ ) (cid:107) ∞ (cid:27) . for some constants c > and c > depending only on L . Proof.
Let µ (cid:48) and µ (cid:48) be distributions supported on [ −
1, 1 ] constructed by scal-ing µ and µ respectively. Then by Lemma 24 of Wu and Yang [47], W ( µ (cid:48) , µ (cid:48) ) ≤ π r + + ( + √ ) r (cid:107) m r ( µ (cid:48) ) − m r ( µ (cid:48) ) (cid:107) .Since W ( µ , µ ) = L W ( µ (cid:48) , µ (cid:48) ) and | m j ( µ ) − m j ( µ ) | = L j | m j ( µ (cid:48) ) − m j ( µ (cid:48) ) | for any j ∈ N , we have W ( µ , µ ) ≤ π Lr + + L ( + √ ) r √ r max ≤ j ≤ r L − j | m j ( µ ) − m j ( µ ) |≤ π Lr + + L √ r (( + √ )( ∨ L − )) r (cid:107) m r ( µ ) − m r ( µ ) (cid:107) ∞ ,which completes the proof. Proof of Theorem 2.7.
Let ˜ ξ n : = n − log n and ξ n : = n − log − n so that ξ n log ( ξ n ) (cid:46) ˜ ξ n . Following the proof of Theorem 2.1, we have D n : = (cid:90) p ( n ) ν ∗ Φ p ( n ) ν (cid:63) ∗ Φ ( X n ) Π ( d ν ) ≥ e − n ˜ ξ n Π ( B KL ( ˜ ξ n , ν (cid:63) ∗ Φ , M )) ≥ e − n ˜ ξ n Π (cid:16) ν ∈ M : W ( ν , ν (cid:63) ) ≤ c ξ n (cid:17) with P ( n ) ν (cid:63) ∗ Φ -probability at least 1 − n ˜ ξ n for some constant c >
0. Let R : = (cid:108) L / √ c ξ n (cid:109) and B , . . . , B R be a partition of [ − L , L ] such that diam ( B j ) ≤√ c ξ n /2. By Lemma A.3, W ( ν , ν (cid:63) ) ≤ √ c ξ n + L (cid:16) R ∑ j = | ν ( B j ) − ν (cid:63) ( B j ) | (cid:17) .. O HN AND
L. L IN Π (cid:16) ν ∈ M : W ( ν , ν (cid:63) ) ≤ c ξ n (cid:17) ≥ Π ν ∈ M : R ∑ j = | ν ( B j ) − ν (cid:63) ( B j ) | ≤ c ξ n L ≥ Π ν ∈ M : R ∑ j = | ν ( B j ) − ν (cid:63) ( B j ) | ≤ c ξ n L (cid:12)(cid:12)(cid:12) { k = R } ∩ E × Π ( E | k = R ) Π ( k = R ) ,where E denotes the event such that each B j contains exactly one atom of ν .By (P1 (cid:48) ), − log Π ( k = R ) (cid:38) R (cid:38) n and by (P3), − log Π ( E | k = R ) (cid:38) − R log ( ξ − n ) (cid:38) n log n . By (P2), − log Π ν ∈ M : R ∑ j = | ν ( B j ) − ν (cid:63) ( B j ) | ≤ c ξ n L (cid:12)(cid:12)(cid:12) { k = R } ∩ E (cid:38) n log n .Combining the results, we arrive at P ( n ) ν (cid:63) ∗ Φ (cid:16) D n (cid:38) e − c n log n (cid:17) ≥ − n log n for some constant c > k be the positive integer such that ˆ k (cid:16) log n / log log n but 2ˆ k − ≤ log n / log log n . By applying Lemma 6.6 with r = k −
1, if M is sufficientlylarge, we obtain (cid:40) ν ∈ M : W ( ν , ν (cid:63) ) ≥ M log log n log n (cid:41) ⊂ (cid:40) ν ∈ M : (cid:107) m k − ( ν ) − m k − ( ν (cid:63) ) (cid:107) ∞ ≥ M (cid:48) ( ˆ k ) − c − ˆ k log log n log n (cid:41) ⊂ (cid:110) ν ∈ M : (cid:107) m k − ( ν ) − m k − ( ν (cid:63) ) (cid:107) ∞ ≥ M (cid:48) c − ˆ k log − n (cid:111) for some constant M (cid:48) > M and L , and some c > L . Following the proof of Theorem 2.2, it suffices to showthat (cid:16) ˆ k c ˆ k ∨ e c n log n (cid:17) exp (cid:16) − c ( M (cid:48) ) n ˆ k − k + c − k log − n (cid:17) = o ( ) for some constants c , c >
0. Note that ( u ˆ k ) − k + (cid:38) n − for any constant u >
0, thus the preceding display holds clearly.
AYESIAN ESTIMATION OF G AUSSIAN MIXTURES For the fractional posterior, the following oracle inequality holds in general.
Lemma 6.7 (Corollary 3.7 of Bhattacharya et al. [1]) . Let X , . . . , X n iid ∼ G (cid:63) forsome distribution G (cid:63) . Let G be a set of distribution and Π be the prior distribution on G . Then for any ζ ∈ (
0, 1 ) such that n ζ > and α ∈ (
0, 1 ) , we have (cid:90) G ∈G R α ( p G , p G (cid:63) ) Π α ( d G | X n ) ≤ αζ − n log Π ( B KL ( ζ , G , G )) with P ( n ) G (cid:63) -probability at least − n ζ .Proof of Theorem 2.8. Let ζ n : = (cid:113) ¯ k n log n / n . We prove the first assertion. Recallthat Π α ( ν / ∈ M k (cid:63) | X n ) = (cid:82) ν / ∈M k (cid:63) ( p ( n ) ν ∗ Φ ( X n ) / p ( n ) ν (cid:63) ∗ Φ ( X n )) α Π ( d ν ) (cid:82) ( p ( n ) ν ∗ Φ ( X n ) / p ( n ) ν (cid:63) ∗ Φ ( X n )) α Π ( d ν ) .We deal with the numerator and denominator separately. For the denominator,we have the high probability bound (see the proof of Theorem 3.1 of [1]), (cid:90) p ( n ) ν ∗ Φ p ( n ) ν (cid:63) ∗ Φ ( X n ) α Π ( d ν ) ≥ e − c n ζ n Π (cid:0) B KL ( ζ n , ν (cid:63) ∗ Φ , M ) (cid:1) with P ( n ) ν (cid:63) ∗ Φ -probability at least 1 − c / ( n ζ n ) = − c / ( ¯ k n log n ) , for someconstants c and c depending only on α . Since Π ( B KL ( ζ n , ν (cid:63) ∗ Φ , M )) ≥ exp ( − c k (cid:63) n ζ n ) Π ( k = k (cid:63) ) for some c >
0, we further have (cid:90) p ( n ) ν ∗ Φ p ( n ) ν (cid:63) ∗ Φ ( X n ) α Π ( d ν ) ≥ e − ( c + c ) k (cid:63) n ζ n Π ( k = k (cid:63) ) (6.7)with P ( n ) ν (cid:63) ∗ Φ -probability at least 1 − c / ( ¯ k n log n ) . For the expectation of thenumerator with respect to P ( n ) ν (cid:63) ∗ Φ , by Fubini’s theorem, we obtain P ( n ) ν (cid:63) ∗ Φ (cid:90) ν / ∈M k (cid:63) p ( n ) ν ∗ Φ p ( n ) ν (cid:63) ∗ Φ ( X n ) α Π ( d ν ) ≤ (cid:90) ν / ∈M k (cid:63) (cid:90) n ∏ i = (cid:104) p αν ∗ Φ ( X i ) p − αν (cid:63) ∗ Φ ( X i ) d X i (cid:105) Π ( d ν ) .Since M is convex, (i.e., for any ν , ν ∈ M and t ∈ (
0, 1 ) , there is ¯ ν ∈ M such that p ¯ ν ∗ Φ = ( − t ) p ν ∗ Φ + tp ν ∗ Φ ), we can apply Lemma 2.1 of [1] to. O HN AND
L. L IN < (cid:82) p αν ∗ f ( X i ) p − αν (cid:63) ∗ f ( X i ) d X i ≤
1. Hence, the expectation of numeratoris further bounded by the prior probability Π ( k > k (cid:63) ) . By Markov’s inequalitywe obtain the following high probability bound for the numerator P ( n ) ν (cid:63) ∗ Φ (cid:90) ν / ∈M k (cid:63) p ( n ) ν ∗ Φ p ( n ) ν (cid:63) ∗ Φ ( X n ) α Π ( d ν ) ≥ ( ¯ k n log n ) Π ( k > k (cid:63) ) ≤ k n log n .(6.8)Combining (6.7), (6.8) and Assumption (2.2), we have Π α ( ν / ∈ M k (cid:63) | X n ) (cid:46) e − c ¯ k n log n for some constant c >
0, with P ( n ) ν (cid:63) ∗ Φ -probability at least 1 − ( + c ) / ( ¯ k n log n ) .For the second assertion, we note that the Wasserstein distance between anytwo atomic distributions ν ∈ M and ν ∈ M is bounded by diam ([ − L , L ]) = L , and so (cid:90) W ( ν , ν (cid:63) ) Π α ( d ν | X n ) (cid:46) (cid:90) ν ∈M k (cid:63) W ( ν , ν (cid:63) ) Π α ( d ν | X n )+ Π α ( ν / ∈ M k (cid:63) | X n ) for any given data X n ∈ R n . We have shown that the second term vanishes atspeed e − c ¯ k n log n . We now focus on the first term. For notational simplicity, welet ρ ( ν , ν (cid:63) ) : = (cid:107) m ( k (cid:63) − ) ( ν ) − m ( k (cid:63) − ) ( ν (cid:63) ) (cid:107) ∞ By Lemma 6.1, (cid:90) ν ∈M k (cid:63) W ( ν , ν (cid:63) ) Π α ( d ν | X n ) (cid:46) k (cid:63) (cid:90) ν ∈M k (cid:63) ρ ( ν , ν (cid:63) ) k (cid:63) − Π α ( d ν | X n ) ≤ k (cid:63) (cid:34) (cid:90) ν ∈M k (cid:63) ρ ( ν , ν (cid:63) ) Π α ( d ν | X n ) (cid:35) k (cid:63) − for any given data X n ∈ R n , where the second inequality follows from Jensen’sinequality for concave functions. For any ν ∈ M k (cid:63) , Lemma 6.3 implies that ρ ( ν , ν (cid:63) ) (cid:46) ( c k (cid:63) ) k (cid:63) − (cid:107) p ν ∗ Φ − p ν ∗ Φ (cid:107) for some constant c >
0. Since both p ν ∗ Φ and p ν (cid:63) ∗ Φ are bounded by 1/ √ π ,we have (cid:107) p ν ∗ Φ − p ν ∗ Φ (cid:107) ≤ √ π h ( p ν ∗ Φ , p ν ∗ Φ ) ≤ ( π ) − α ∧ ( − α ) R α ( p ν ∗ Φ , p ν ∗ Φ ) AYESIAN ESTIMATION OF G AUSSIAN MIXTURES (cid:90) ν ∈M k (cid:63) R α ( p ν ∗ Φ , p ν ∗ Φ ) Π α ( d ν | X n ) ≤ (cid:90) R α ( p ν ∗ Φ , p ν ∗ Φ ) Π α ( d ν | X n ) (cid:46) k (cid:63) ¯ k n log nn .with P ( n ) ν (cid:63) ∗ Φ -probability at least 1 − ( ¯ k n log n ) . Combining the derived bounds,we arrive at (cid:90) ν ∈M k (cid:63) W ( ν , ν (cid:63) ) Π α ( d ν | X n ) (cid:46) k (cid:63) (cid:20) (cid:90) ( k (cid:63) ) k (cid:63) − R α ( p ν ∗ Φ , p ν ∗ Φ ) Π α ( d ν | X n ) (cid:21) k (cid:63) − (cid:46) ( k (cid:63) ) k (cid:63) − k (cid:63) − (cid:32) ¯ k n log nn (cid:33) k (cid:63) − ,with P ( n ) ν (cid:63) ∗ Φ -probability at least 1 − ( ¯ k n log n ) , which completes the proof. We introduce the notation F ( · , ν ) : = (cid:90) F ( · , θ ) ν ( d θ ) for a distribution function F ( · , · ) and the mixing distribution ν ∈ M , whichdenotes the distribution function of ν ∗ F . For convenience, we let F be a set ofall distribution functions.A key technical device for the proof is the following relationship betweenthe Kolmogorov distance and the Wasserstein distance. Lemma 6.8 (Theorem 6.3 of Heinrich and Kahn [21]) . Let Θ be a compact subsetof R with nonempty interior. Fix k ∈ N . Suppose that Assumption F(q) is met withq = k.1. There exists a constant c > such that W k − ( ν , ν ) ≤ c (cid:13)(cid:13) F ( · , ν ) − F ( · , ν ) (cid:13)(cid:13) ∞ (6.9) for any ν , ν ∈ M k .2. Let k ∈ [ k ] and let ν ∈ M k \ M k − . There exist constants τ > andc > such that W ( k − k )+ ( ν , ν ) ≤ c (cid:13)(cid:13) F ( · , ν ) − F ( · , ν ) (cid:13)(cid:13) ∞ (6.10) for any ν , ν ∈ M k with W ( ν , ν ) ∨ W ( ν , ν ) < τ . . O HN AND
L. L IN Lemma 6.9 (Lemma 1 of Scricciolo [42]) . Let F (cid:63) be a continuous distributionfunction and P F (cid:63) denote the probability operator with respect to F (cid:63) . Let F be a cset of certain distribution functions. Let { ˜ ζ n } n ∈ N be a positive sequence such that ˜ ζ n (cid:38) (cid:112) log n / n. If the prior distribution on F satisfies Π (cid:16) B KL ( ˜ ζ n , F (cid:63) , F ) (cid:17) (cid:38) exp ( − c n ˜ ζ n ) (6.11) for some constant c > , then P ( n ) F (cid:63) (cid:34) Π (cid:18)(cid:110) F ∈ F : (cid:107) F − F (cid:63) (cid:107) ∞ ≥ M ˜ ζ n (cid:111) | X n (cid:19)(cid:35) = o ( ) for sufficiently large M > Proof of Theorem 3.1.
Let ˜ ζ n : = (cid:112) log n / n . By the first assertion of Lemma 6.8,we have that ν ∈ M ( Θ ) : W ( ν , ν (cid:63) ) ≥ M (cid:18) log nn (cid:19) k (cid:63) − ⊂ (cid:40) ν ∈ M k (cid:63) ( Θ ) : (cid:107) F ( · , ν ) − F ( · , ν (cid:63) ) (cid:107) ∞ ≥ c M (cid:18) log nn (cid:19) (cid:41) ∪ (cid:8) ν / ∈ M k (cid:63) ( Θ ) (cid:9) for some constant c >
0. By the similar argument of Theorem 2.1, it is nothard to prove that the expected posterior probability of the event { ν / ∈ M k (cid:63) } goes to zero. For the first event of the right-hand side of the preceding display,we will apply Lemma 6.9 to conclude the desired result. By (F4), Lemma A.2implies that B KL ( ˜ ζ n , ν (cid:63) ∗ Φ , M ( Θ )) ⊃ (cid:110) ν ∈ M ( Θ ) : h ( p ν ∗ F , p ν (cid:63) ∗ F ) ≤ c ( n log n ) − (cid:111) for some constant c >
0. Furthermore, by (F3), h ( f ( · , θ ) , f ( · , θ )) ≤ (cid:107) f ( · , θ ) − f ( · , θ ) (cid:107) ≤ c | θ − θ | s for any θ , θ ∈ [ − L , L ] for some constant c > h ( p ν ∗ F , p ν ∗ F ) ≤ c W ss ( ν , ν ) for any AYESIAN ESTIMATION OF G AUSSIAN MIXTURES ν , ν ∈ M . Therefore, by Lemma A.3, we obtain Π ( B KL ( ˜ ζ n , ν (cid:63) ∗ F , M ( Θ ))) ≥ Π (cid:16) ν ∈ M k (cid:63) ( Θ ) : W ss ( ν , ν (cid:63) ) ≤ c ( n log n ) − (cid:17) Π ( k = k (cid:63) ) ≥ Π k (cid:63) ∑ j = | w j − w (cid:63) j | ≤ c ( L ) s n log n × Π (cid:32) | θ j − θ (cid:63) j | s ≤ c s n log n , ∀ j ∈ [ k (cid:63) ] (cid:33) Π ( k = k (cid:63) ) (cid:38) e − c log n for some constants c , c >
0, where the last inequality follows from Assump-tion P. Thus the prior concentration condition (6.11) of Lemma 6.9 is fulfilledand the proof is done.
Proof of Theorem 3.2.
Using the similar argument in the proof of Theorem 3.1combined with the second assertion of Lemma 6.8, we obtain the desired result.
Proof of Theorem 4.1. If k (cid:63) > n , the event of interest is empty, so we focus on thecases that k (cid:63) ≤ n . Let ˜ ζ n : = (cid:112) log n / n . As in the proof of Theorem 2.1, we havethat (cid:90) p ( n ) ˜ ν ∗ Φ p ( n ) ν (cid:63) ∗ Φ ( X n ) Π DP ( d ˜ ν ) ≥ e − n ˜ ζ n Π DP (cid:16) B KL ( ˜ ζ n , ν (cid:63) ∗ Φ , M ∞ ) (cid:17) with P ( n ) ν (cid:63) ∗ Φ -probability at least 1 − ( n ˜ δ n ) . Let ξ n : = ( n log n ) − so that ξ n log ( ξ n ) (cid:46) ξ n log ( ξ n ) (cid:46) ˜ ζ n . Then since (cid:82) p ν (cid:63) ∗ Φ ( x )( p ν (cid:63) ∗ Φ ( x ) / p ν ∗ Φ ( x )) b d λ ( x ) < ∞ forsome b ∈ (
0, 1 ) by Equation (4.6) of [18], Lemma A.2 implies that Π DP (cid:0) B KL ( ζ n , ν (cid:63) ∗ Φ , M ∞ ) (cid:1) ≥ Π DP (cid:0) ˜ ν ∈ M ∞ : (cid:107) p ˜ ν ∗ Φ − p ν (cid:63) ∗ Φ (cid:107) ≤ c ξ n (cid:1) .for some constant c >
0. Let B , B , . . . , B k (cid:63) be a partition of [ − L , L ] such that θ (cid:63) j ∈ B j , diam ( B j ) = c ξ n /4 for each j ∈ [ k (cid:63) ] (Here we assume without lossof generality that all the atoms of ν (cid:63) does not overlap with each other, other-wise, we can consider a partition where each set contains exactly one distinctatom). Since the vector ( ˜ ν ( B ) , ˜ ν ( B ) , . . . , ˜ ν ( B k (cid:63) )) follows the Dirichlet distri-bution with parameter ( κ n H ( B ) , κ n H ( B ) , . . . , κ n H ( B k (cid:63) )) , by Lemma A.4 anddiam ( B j ) = c ξ n /4 for every j ∈ [ k (cid:63) ] , we have (cid:8) ˜ ν ∈ M ∞ : (cid:107) p ˜ ν ∗ Φ − p ˜ ν (cid:63) ∗ Φ (cid:107) ≤ c ξ n (cid:9) ⊃ ˜ ν ∈ M ∞ : k (cid:63) ∑ j = (cid:12)(cid:12)(cid:12) ˜ ν ( B j ) − w (cid:63) j (cid:12)(cid:12)(cid:12) ≤ c ξ n . O HN AND
L. L IN w (cid:63) j : =
0. Finally, by Lemma A.5, Π DP ˜ ν ∈ M ∞ : k (cid:63) ∑ j = (cid:12)(cid:12)(cid:12) ˜ ν ( B j ) − w (cid:63) j (cid:12)(cid:12)(cid:12) ≤ c ξ n ≥ ( c ξ n /4 ) k (cid:63) κ k (cid:63) + n k ∏ j = H ( B j ) (cid:38) κ k (cid:63) + n exp ( − c k (cid:63) log n ) for some constant c >
0, where the second inequality follows from that H ( B ) = − ∑ k (cid:63) j = H ( B j ) (cid:38) −
1/ log n (cid:38) H ( B j ) = c ξ n / ( L ) for j ∈ [ k (cid:63) ] andlog ( ξ n ) (cid:16) log n .On the other hand, we use Fubini’s theorem to obtain P ( n ) ν (cid:63) ∗ Φ (cid:90) ∑ Z n ∈ N n : T n ( Z n ) > Ck (cid:63) (cid:40) n ∏ i = φ ( X i − θ Z i ) p w [ ˜ ν ] ( Z i ) p ν (cid:63) ∗ Φ ( X i ) (cid:41) Π DP ( d ˜ ν ) = (cid:90) ∑ Z n ∈ N n : T n ( Z n ) > Ck (cid:63) (cid:40) (cid:90) n ∏ i = φ ( X i − θ Z i ) d X n (cid:41) p ( n ) w [ ˜ ν ] ( Z n ) Π DP ( d ˜ ν )= ∑ Z n ∈ N n : T n ( Z n ) > Ck (cid:63) (cid:90) p ( n ) w [ ˜ ν ] ( Z n ) Π DP ( d ˜ ν )= P CRP ( κ n ) ( T n ( Z n ) > Ck (cid:63) ) ,where CRP ( κ n ) denotes the Chinese restaurant process with concentration pa-rameter κ n . It is known that the probability mass function of T n is given by(e.g., see Proposition 4.9 of [17]) P CRP ( κ n ) ( T n = t ) = C n ( t ) n ! κ tn Γ ( κ n ) Γ ( κ n + n ) where C n ( t ) : = ( n ! ) − ∑ S ⊂ [ n − ] : | S | = n − t ∏ i ∈ S i . Since C n ( t + ) C n ( t ) = ∑ S ⊂ [ n − ] : | S | = t ∏ i ∈ S i ∑ S ⊂ [ n − ] : | S | = t − ∏ i ∈ S i ≤ ∑ S ⊂ [ n − ] : | S | = t − ∏ i ∈ S i (cid:16) ∑ n − i (cid:48) = i (cid:48) (cid:17) ∑ S ⊂ [ n − ] : | S | = t − ∏ i ∈ S i ≤ log ( e ( n − )) we have P CRP ( κ n ) ( T n ≥ t + ) (cid:46) n ∑ h = t + ( κ n log n ) h − (cid:46) ( κ n log n ) t .Hence, P ( n ) ν (cid:63) ∗ Φ (cid:2) Π DP ( T n > Ck ∗ | X n ) (cid:3) (cid:46) e c k (cid:63) log n ( κ n log n ) Ck (cid:63) − κ k (cid:63) + n + o ( ) (cid:46) e c k (cid:63) log n e − (( C − ) k (cid:63) − ) log n + o ( ) AYESIAN ESTIMATION OF G AUSSIAN MIXTURES c >
0. If C > c +
3, the desired result follows.
Proof of Theorem 4.2.
Let ˜ ξ n : = n − log n and ξ n : = n − log − n so that ξ n log ( ξ n ) (cid:46) ˜ ξ n . By the same arguments used in the proof of Theorem 2.7,we have that D n : = (cid:90) p ( n ) ˜ ν ∗ Φ p ( n ) ν (cid:63) ∗ Φ ( X n ) Π DP ( d ˜ ν ) ≥ e − n ˜ ξ n Π DP ˜ ν ∈ M ∞ : R ∑ j = | ˜ ν ( B j ) − ν (cid:63) ( B j ) | ≤ c ξ n L with P ( n ) ν (cid:63) ∗ Φ -probability at least 1 − n ˜ ξ n , where we define R : = (cid:108) L / √ c ξ n (cid:109) and ( B , . . . , B R ) is a partition of [ − L , L ] such that diam ( B j ) ≤ √ c ξ n /2 forsome c >
0. Since ( ˜ ν ( B ) , . . . , ˜ ν ( B R )) follows the Dirichlet distribution withparameter ( κ n H ( B ) , . . . , κ n H ( B R )) , by Lemma A.5, D n is further bounded as D n (cid:38) e − n ˜ ξ n ξ Rn ( κ n ξ n ) R (cid:38) e − c n log + a n .with P ( n ) ν (cid:63) ∗ Φ -probability at least 1 − n ˜ ξ n . Following the proof of Theorem 2.7,we obtain the desired result. A Appendix: Additional lemmas and proofs
A.1 Technical lemmas
The following three lemmas provide inequalities that are useful throughout theproofs.
Lemma A.1 (Lemma 1 of Nguyen [37]) . Let f : R (cid:55)→ R be a convex function suchthat f ( ) = and let f ( · , θ ) denote a probability density function with parameter θ . For two atomic meassures ν : = ∑ k j = w j δ θ j ∈ M k and ν : = ∑ k j = w j δ θ j ∈M k , define W ψ , f ( ν , ν ) : = inf ( p jh ) ∈Q ( w , w ) k ∑ j = k ∑ h = p jh D f (cid:16) f ( · , θ j ) , f ( · , θ j ) (cid:17) , where w : = ( w , . . . , w k ) and w : = ( w , . . . , w k ) . Then D f k ∑ j = w j f ( · , θ j ) , k ∑ j = w j f ( · , θ j ) ≤ W ψ , f ( ν , ν ) .. O HN AND
L. L IN In particular, for the standard normal density function φ , we have h k ∑ j = w j φ ( · − θ j ) , k ∑ j = w j φ ( · − θ j ) ≤ W ( ν , ν ) , KL k ∑ j = w j φ ( · − θ j ) , k ∑ j = w j φ ( · − θ j ) ≤ W ( ν , ν ) . Lemma A.2 (Theorem 5 of Wong and Shen [46]) . Let ζ > be sufficiently small.For two density functions p and p such that h ( p , p ) ≤ ζ andC ζ : = (cid:90) p ( x ) (cid:32) p ( x ) p ( x ) (cid:33) b λ ( d x ) < ∞ for some b ∈ (
0, 1 ] , we have KL ( p , p ) ≤ c ζ ∨ log (cid:32) C ζ ζ (cid:33) , KL ( p , p ) ≤ c ζ ∨ log (cid:32) C ζ ζ (cid:33) for some constants c , c > Lemma A.3 (Lemma 3 of Gao and van der Vaart [14]) . For any µ , µ ∈ P ( Θ ) ,any countably many partition ( B j ) j ∈ N of Θ and any q ≥ , we have W q ( µ , µ ) ≤ sup j ∈ N diam ( B j ) + diam ( Θ ) (cid:16) ∞ ∑ j = | µ ( B j ) − µ ( B j ) | (cid:17) q . In particular, for any ν , ν ∈ M k ( Θ ) with ν : = ∑ kj = w ij δ θ j and ν : = ∑ kj = w j δ θ j ,and any k ∈ N , we have W q ( ν , ν ) ≤ sup j ∈ [ k ] | θ j − θ j | + diam ( Θ ) (cid:16) ∞ ∑ j = | w j − w j | (cid:17) q . A.2 Proofs of Lemmas 6.2 and 6.3
Proof of Lemma 6.2.
Recall that N : = (cid:6) log ( k / η (cid:101) ) (cid:7) ∧ n where η (cid:101) , which is theconstant depending only on (cid:101) , will be specified later. For simplicity we dropthe subscript (cid:101) of η (cid:101) . Let n l : = |X l | then n l is either (cid:98) n / N (cid:99) or (cid:98) n / N (cid:99) +
1. Bythe variance bound presented in Lemma 5 of [47], we have
Var ( M ( η ) l , h ) ≤ n l (cid:16) c (cid:48) ( L + √ h ) (cid:17) h , AYESIAN ESTIMATION OF G AUSSIAN MIXTURES l ∈ [ N ] and any h ∈ [ k − ] , for some universal constant c (cid:48) >
0. Thenby the Chebyshev inequality, the expectation of the random variable definedas Z l , h : = (cid:12)(cid:12)(cid:12) M ( η ) l , h − m h ( ν ) (cid:12)(cid:12)(cid:12) < (cid:115) n l (cid:16) c (cid:48) ( L + √ h ) (cid:17) h is bounded below by P l , h : = P ( n l ) ν ∗ Φ Z l , h ≥
34 , (A.1)for any l ∈ [ N ] and any h ∈ [ k − ] . Now we use the well-known mediantrick. By definition of median and the fact that n l ≥ (cid:98) n / N (cid:99) ≥ n / ( N ) , wehave that (cid:32)(cid:13)(cid:13)(cid:13) ˆ m ( η (cid:101) ) h − m h ( ν ) (cid:13)(cid:13)(cid:13) ∞ ≥ (cid:114) N n (cid:16) c (cid:48) ( L + √ h ) (cid:17) h (cid:33) ≤ (cid:32) N ∑ l = Z l , h ≤ N (cid:33) . (A.2)By Hoeffding’s inequality, the probability of the right-hand side of the preced-ing display is bounded as P ( n ) ν ∗ Φ (cid:32) N ∑ l = Z l , h ≤ N (cid:33) ≤ P ( n ) ν ∗ Φ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N ∑ l = Z l , h − N ∑ l = P l , h (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ N ≤ e − N /8 , (A.3)where the first inequality is due to (A.1). Since √ ( c (cid:48) ( L + √ h )) h ≤ (cid:113) c (cid:48) k k − for any h ∈ [ k − ] for some universal constant c (cid:48) > L ,(A.2), (A.3) and the union bound imply P ( n ) ν ∗ Φ (cid:32)(cid:13)(cid:13)(cid:13) ˆ m ( η (cid:101) ) ( k − ) − m ( k − ) ( ν ) (cid:13)(cid:13)(cid:13) ∞ ≥ (cid:114) Nn (cid:18)(cid:113) c (cid:48) k (cid:19) k − (cid:33) ≤ ( k ) e − N /8 .Let η : = ( k ) exp ( − ( c (cid:48) k ) k − n (cid:101) ) , then N : = (cid:4) log ( k / η ) (cid:5) ∧ n ≤ ( c (cid:48) k ) k − n (cid:101) and so (cid:101) ≥ √ N / n ( c (cid:48) k ) k − . Thus, by noticing that N ≥ (cid:110) (( c (cid:48) k ) − k + n (cid:101) ) ∧ n (cid:111) −
1, we get the desired result.
Proof of Lemma 6.3.
We write ν : = ∑ kj = w j δ θ j and ν : = ∑ kj = w j δ θ j . Let H j ( z ) , j ∈ N be the Hermite polynomials defined by the generating functione xt − t = ∞ ∑ j = H j ( x ) t j j ! .Then we have the identity φ ( x − t ) = ( π ) (cid:113) φ ( √ x ) e xt − t = ( π ) (cid:113) φ ( √ x ) ∞ ∑ j = H j ( √ x ) ( t / √ ) j j ! e − t . O HN AND
L. L IN p ν ∗ Φ ( x ) = (cid:90) φ ( x − t ) d ν ( t ) = ( π ) (cid:113) φ ( √ x ) ∞ ∑ j = H j ( √ x ) j /2 j ! E (cid:16) T j e − T (cid:17) for any mixing distribution ν ∈ P ( R ) , where T is the random variable suchthat T ∼ ν . By the orthogonality of the Hermite polynomials, we have √ (cid:90) H l ( √ x ) H j ( √ x ) φ ( √ x ) d x = (cid:90) H l ( x ) H j ( x ) φ ( x ) d x = j ! ( l = j ) .Hence, (cid:107) p ν ∗ Φ − p ν ∗ Φ (cid:107) = ∞ ∑ j = j !2 j √ π (cid:26) E (cid:16) T j e − T (cid:17) − E (cid:16) T j e − T (cid:17)(cid:27) ,where T and T are random variables such that T ∼ ν and T ∼ ν . Thepreceding display implies (cid:12)(cid:12)(cid:12) E (cid:16) T j e − T (cid:17) − E (cid:16) T j e − T (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:113) j !2 j √ π (cid:107) p ν ∗ Φ − p ν ∗ Φ (cid:107) .For each j ∈ [ k − ] , we let P j ( x ) = ∑ k − h = a j , h x h be the unique polynomial ofdegree ( k − ) that interpolates the 2 k points (( θ il , θ jil e − θ il /4 )) i ∈{ } ; l ∈ [ k ] . Weassume all the atoms of ν and ν are distinct, otherwise, we can consider theinterpolation polynomial of degree r , where r < k − (cid:12)(cid:12)(cid:12) E ( T j ) − E ( T j ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) E (cid:16) P j ( T ) e − T (cid:17) − E (cid:16) P j ( T ) e − T (cid:17)(cid:12)(cid:12)(cid:12) ≤ k − ∑ h = | a j , h | (cid:12)(cid:12)(cid:12) E (cid:16) T j e − T (cid:17) − E (cid:16) T j e − T (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:107) p ν ∗ Φ − p ν ∗ Φ (cid:107) k − ∑ h = | a j , h | (cid:113) j !2 j √ π .It remains to bound the coefficients ( a j , h ) j ∈ [ k − ] ; h ∈ [ k − ] ∪{ } . Let x k ∗ ( i − )+ l − : = θ il and y j , k ∗ ( i − )+ l − : = θ jil e − θ il /4 for i =
1, 2 , l ∈ [ k ] and j ∈ [ k − ] . Then wecan express P j in the Newton form such that P j ( x ) = k − ∑ h = b j [ x , . . . , x h ] h − ∏ l = ( x − x l ) , AYESIAN ESTIMATION OF G AUSSIAN MIXTURES b j is defined recursively as b j [ x h ] : = y j , h b j [ x h , . . . , x h + l ] : = b j [ x h + , . . . , x h + l ] − b j [ x h , . . . , x h + l − ] x h + l − x h .Since the derivatives of all orders of the function x (cid:55)→ x j e − x /4 are uniformlybounded on [ − L , L ] , ( b j [ z , . . . , z h ]) h = k − are uniformly bounded too. Hence,since | x h | ≤ L for every h ∈ [ k − ] ∪ { } , ( a j , h ) j ∈ [ k − ] ; th ∈ [ k − ] ∪{ } are boundedby ( k − ) c k − for some universal constant c >
0. Thus the desired resultfollows from the bound j ! ≤ j j . A.3 Lemmas for the proofs for Section 4
Lemma A.4.
Let B , B , . . . , B k be a measurable partition of a compact set Θ ⊂ R , ( w , . . . , w k ) ∈ ∆ k and θ j ∈ B j for j ∈ [ k ] . Let ν : = ∑ kj = w j δ θ j . Then for anydistribution µ ∈ P ( Θ ) , (cid:13)(cid:13)(cid:13) p µ ∗ Φ − p ν ∗ Φ (cid:13)(cid:13)(cid:13) ≤ ≤ j ≤ k diam ( B j ) + k ∑ j = | µ ( B j ) − w j | , with w : = .Proof. We start with the decomposition p µ ∗ Φ − p ν ∗ Φ = (cid:90) U φ ( x − θ ) d µ ( θ ) + k ∑ j = (cid:90) U j (cid:110) φ ( x − θ ) − φ ( x − θ j ) (cid:111) d µ ( θ )+ k ∑ j = φ ( x − θ j ) (cid:110) µ ( B j ) − w j (cid:111) .Since (cid:107) φ ( · − θ ) − φ ( · − θ j ) (cid:107) ≤ | θ − θ j | and (cid:107) φ (cid:107) =
1, the desired result fol-lows.
Lemma A.5.
Let ( w , . . . , w k ) be distributed according to the Dirichlet distributionwith parameter ( κ , . . . , κ k ) such that κ j ∈ (
0, 1 ] for any j ∈ [ k ] . Then for any ( w , . . . , w k ) ∈ ∆ k and any η ∈ (
0, 1/ k ] , P k ∑ j = | w j − w j | ≤ η ≥ η ( k − ) k ∏ j = κ j . Proof.
Without loss of generality, assume w k ≥ k . Then for ( w , . . . , w k ) suchthat | w j − w j | < η / k , we have k − ∑ j = w j ≤ − w k + ( k − ) η k ≤ ( + η ) k − k ≤
1. O
HN AND
L. L IN η ≤ k . This implies that ( w , . . . , w k ) ∈ ∆ k . Moreover, ∑ kj = | w j − w j | ≤ ∑ k − j = | w j − w j | < η . Thus, P k ∑ j = | w j − w j | ≤ η ≥ P (cid:18) | w j − w j | ≤ η k , j ∈ [ k − ] (cid:19) ≥ Γ ( ∑ kj = κ k ) ∏ kj = Γ ( κ j ) k − ∏ j = (cid:90) ( w j + η / k ) ∧ ( w j − η / k ) ∨ w κ j − j d w j ,where the second inequality follows from the fact that ( − ∑ k − j = w j ) κ k − ≥ κ k ≥
1. Since 1 ≤ Γ ( κ ) ≤ κ for any κ ∈ (
0, 1 ] , w κ j − j ≥ ( w j + η / k ) ∧ − ( w j − η / k ) ∨ ≥ η / k , we further have that P k ∑ j = | w j − w j | < η ≥ (cid:18) η k (cid:19) k − k ∏ j = κ j ≥ η ( k − ) k ∏ j = κ j ,which completes the proof. Acknowledgement
We would like to thank Minwoo Chae for very useful comments and discus-sions. We acknowledge the generous support of NSF grant DMS CAREER1654579.
References [1] Bhattacharya, A., Pati, D., and Yang, Y. (2019). Bayesian fractional posteri-ors.
The Annals of Statistics , 47(1):39–66.[2] Biernacki, C., Celeux, G., and Govaert, G. (2000). Assessing a mixturemodel for clustering with the integrated completed likelihood.
IEEE Trans-actions on Pattern Analysis and Machine Intelligence , 22(7):719–725.[3] Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet allocation.
Journal of Machine Learning Research , 3(Jan):993–1022.[4] Cao, X., Khare, K., and Ghosh, M. (2019). Posterior graph selection and esti-mation consistency for high-dimensional Bayesian DAG models.
The Annalsof Statistics , 47(1):319–348.[5] Castillo, I. and van der Vaart, A. (2012). Needles and straw in a haystack:Posterior concentration for possibly sparse sequences.
The Annals of Statis-tics , 40(4):2069–2101.
AYESIAN ESTIMATION OF G AUSSIAN MIXTURES
The Annals of Statistics , 36(2):938–962.[7] Chen, J. (1995). Optimal rate of convergence for finite mixture models.
TheAnnals of Statistics , 23(1):221–233.[8] Drton, M. and Plummer, M. (2017). A Bayesian information criterion forsingular models.
Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 79(2):323–380.[9] Eghbal-zadeh, H., Zellinger, W., and Widmer, G. (2019). Mixture densitygenerative adversarial networks. In
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 5820–5829.[10] Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric prob-lems.
The Annals of Statistics , 1(2):209–230.[11] Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminantanalysis, and density estimation.
Journal of the American statistical Association ,97(458):611–631.[12] Fruhwirth-Schnatter, S., Celeux, G., and Robert, C. P. (2019).
Handbook ofmixture analysis . CRC Press.[13] Gao, C. and Zhou, H. H. (2016). Rate exact Bayesian adaptation withmodified block priors.
The Annals of Statistics , 44(1):318–345.[14] Gao, F. and van der Vaart, A. (2016). Posterior contraction rates for de-convolution of Dirichlet-Laplace mixtures.
Electronic Journal of Statistics ,10(1):608–627.[15] Ghosal, S., Ghosh, J. K., and van der Vaart, A. W. (2000). Convergencerates of posterior distributions.
The Annals of Statistics , 28(2):500–531.[16] Ghosal, S. and van der Vaart, A. (2007). Posterior convergence rates ofDirichlet mixtures at smooth densities.
The Annals of Statistics , 35(2):697–723.[17] Ghosal, S. and van der Vaart, A. (2017).
Fundamentals of nonparametricBayesian inference , volume 44. Cambridge University Press.[18] Ghosal, S. and van der Vaart, A. W. (2001). Entropies and rates of conver-gence for maximum likelihood and Bayes estimation for mixtures of normaldensities.
The Annals of Statistics , 29(5):1233–1263.[19] Gr ¨unwald, P. and Van Ommen, T. (2017). Inconsistency of Bayesian infer-ence for misspecified linear models, and a proposal for repairing it.
BayesianAnalysis , 12(4):1069–1103.. O
HN AND
L. L IN arXiv preprintarXiv:1901.05078 .[21] Heinrich, P. and Kahn, J. (2018). Strong identifiability and optimal mini-max rates for finite mixture estimation. The Annals of Statistics , 46(6A):2844–2870.[22] Ho, N. and Nguyen, X. (2016). On strong identifiability and convergencerates of parameter estimation in finite mixtures.
Electronic Journal of Statistics ,10(1):271–307.[23] Ho, N., Nguyen, X., and Ritov, Y. (2020). Robust estimation of mixingmeasures in finite mixture models.
Bernoulli , 26(2):828–857.[24] Hoffmann, M., Rousseau, J., and Schmidt-Hieber, J. (2015). On adaptiveposterior concentration rates.
The Annals of Statistics , 43(5):2259–2295.[25] Keribin, C. (2000). Consistent estimation of the order of mixture models.
Sankhy¯a: The Indian Journal of Statistics, Series A , pages 49–66.[26] Kruijer, W., Rousseau, J., and van der Vaart, A. (2010). Adaptive Bayesiandensity estimation with location-scale mixtures.
Electronic Journal of Statis-tics , 4:1225–1257.[27] Lee, K., Lee, J., and Lin, L. (2019). Minimax posterior convergence ratesand model selection consistency in high-dimensional DAG models based onsparse Cholesky factors.
The Annals of Statistics , 47(6):3413–3437.[28] Martin, R. (2012). Convergence rate for predictive recursion estimation offinite mixtures.
Statistics & Probability Letters , 82(2):378–384.[29] Martin, R., Mess, R., and Walker, S. G. (2017). Empirical Bayes pos-terior concentration in sparse high-dimensional linear models.
Bernoulli ,23(3):1822–1847.[30] McLachlan, G. J., Lee, S. X., and Rathnayake, S. I. (2019). Finite mixturemodels.
Annual Review of Statistics and its Application , 6:355–378.[31] Miller, J. W. and Dunson, D. B. (2019). Robust Bayesian inference via coars-ening.
Journal of the American Statistical Association , 114(527):1113–1125.[32] Miller, J. W. and Harrison, M. T. (2013). A simple example of dirichletprocess mixture inconsistency for the number of components. In
Advancesin Neural Information Processing Systems , pages 199–206.[33] Miller, J. W. and Harrison, M. T. (2014). Inconsistency of pitman-yor pro-cess mixtures for the number of components.
The Journal of Machine LearningResearch , 15(1):3333–3370.
AYESIAN ESTIMATION OF G AUSSIAN MIXTURES
Journal of the American Statistical Association ,113(521):340–356.[35] Neal, R. M. (2000). Markov chain sampling methods for Dirichlet processmixture models.
Journal of Computational and Graphical Statistics , 9(2):249–265.[36] Newton, M. A. (2002). On a nonparametric recursive estimator of themixing distribution.
Sankhy¯a: The Indian Journal of Statistics, Series A , pages306–322.[37] Nguyen, X. (2013). Convergence of latent mixing measures in finite andinfinite mixture models.
The Annals of Statistics , 41(1):370–400.[38] Nobile, A. and Fearnside, A. T. (2007). Bayesian finite mixtures with anunknown number of components: The allocation sampler.
Statistics andComputing , 17(2):147–162.[39] Richardson, E. and Weiss, Y. (2018). On GANs and GMMs. In
Advances inNeural Information Processing Systems , pages 5847–5858.[40] Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtureswith an unknown number of components (with discussion).
Journal of theRoyal Statistical Society: Series B (Statistical Methodology) , 59(4):731–792.[41] Rousseau, J. and Mengersen, K. (2011). Asymptotic behaviour of the pos-terior distribution in overfitted mixture models.
Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) , 73(5):689–710.[42] Scricciolo, C. (2017). Bayesian Kantorovich deconvolution in finite mix-ture models. In
Convegno della Societ`a Italiana di Statistica , pages 119–134.Springer.[43] Sethuraman, J. (1994). A constructive definition of Dirichlet priors.
Statis-tica Sinica , 4:639–650.[44] Stephens, M. (2000). Bayesian analysis of mixture models with an un-known number of componentsan alternative to reversible jump methods.
The Annals of Statistics , 28(1):40–74.[45] Tokdar, S. T., Martin, R., and Ghosh, J. K. (2009). Consistency of a recursiveestimate of mixing distributions.
The Annals of Statistics , 37(5A):2502–2522.[46] Wong, W. H. and Shen, X. (1995). Probability inequalities for likelihood ra-tios and convergence rates of sieve MLEs.
The Annals of Statistics , 23(2):339–362.[47] Wu, Y. and Yang, P. (2018). Optimal estimation of Gaussian mixtures viadenoised method of moments. arXiv preprint arXiv:1807.07237arXiv preprint arXiv:1807.07237