Jeffreys priors for mixture estimation: properties and alternatives
JJeffreys priors for mixtureestimation: properties andalternatives
Clara Grazian ∗ and Christian P. Robert † Abstract.
While Jeffreys priors usually are well-defined for the parametersof mixtures of distributions, they are not available in closed form. Further-more, they often are improper priors. Hence, they have never been usedto draw inference on the mixture parameters.The implementation and theproperties of Jeffreys priors in several mixture settings are studied. It isshown that the associated posterior distributions most often are improper.Nevertheless, the Jeffreys prior for the mixture weights conditionally on theparameters of the mixture components will be shown to have the propertyof conservativeness with respect to the number of components, in case ofoverfitted mixture and it can be therefore used as a default priors in thiscontext.
Key words and phrases:
Noninformative prior, mixture of distributions,Bayesian analysis, Dirichlet prior, improper prior, improper posterior, labelswitching.
1. INTRODUCTION
Bayesian inference in mixtures of distributions has been studied quite exten-sively in the literature. See, e.g., MacLachlan and Peel (2000) and Fr¨uhwirth-Schnatter (2006) for book-long references and Lee et al. (2009) for one amongmany surveys. From a Bayesian perspective, one of the several difficulties withthis type of distribution,(1) k (cid:88) (cid:96) =1 p (cid:96) f (cid:96) ( x | θ (cid:96) ) , k (cid:88) (cid:96) =1 p (cid:96) = 1 , is that its ill-defined nature (non-identifiability, multimodality, unbounded likeli-hood, etc.) leads to restrictive prior modelling since most improper priors are notacceptable. This is due in particular to the feature that a sample from (1) maycontain no subset from one of the k components f ( ·| θ (cid:96) ) (see. e.g., Titteringtonet al., 1985). Albeit the probability of such an event is decreasing quickly to zeroas the sample size grows, it nonetheless prevents the use of independent improper ∗ Corresponding Author: Nuffield Department of Medicine, University of Oxford, John Rad-cliffe Hospital, Microbiology Department, Headley Way, Oxford, OX3 9DU, United Kingdom.mail: [email protected] † CEREMADE Universit´e Paris-Dauphine, University of Warwick and CREST, Paris. e-mail:[email protected]. a r X i v : . [ s t a t . M E ] D ec priors, unless such events are prohibited (Diebolt and Robert, 1994). Similarly,the exchangeable nature of the components often induces both multimodality inthe posterior distribution and convergence difficulties as exemplified by the labelswitching phenomenon that is now quite well-documented (Celeux et al., 2000;Stephens, 2000; Jasra et al., 2005; Fr¨uhwirth-Schnatter, 2006; Geweke, 2007; Puo-lam¨aki and Kaski, 2009). This feature is characterized by a lack of symmetry inthe outcome of a Monte Carlo Markov chain (MCMC) algorithm, in that the pos-terior density is exchangeable in the components of the mixture but the MCMCsample does not exhibit this symmetry. In addition, most MCMC samplers do notconcentrate around a single mode of the posterior density, partly exploring sev-eral modes, which makes the construction of Bayes estimators of the componentsmuch harder.When specifying a prior over the parameters of (1), it is therefore quite del-icate to produce a manageable and sensible non-informative version and somehave argued against using non-informative priors in this setting (for example,MacLachlan and Peel (2000) argues that it is impossible to obtain proper pos-terior distributions from fully noninformative priors), on the basis that mixturemodels are ill-defined objects that require informative priors to give a meaningto the notion of a component of (1). For instance, the distance between twocomponents needs to be bounded from below to avoid repeating the same com-ponent indefinitely. Alternatively, the components all need to be informed by thedata, as exemplified in Diebolt and Robert (1994) who imposed a completionscheme (i.e., a joint model on both parameters and latent variables) such that allcomponents were allocated at least two observations, thereby ensuring that the(truncated) posterior was well-defined. Wasserman (2000) proved ten years laterthat this truncation led to consistent estimators and moreover that only this typeof priors could produce consistency. While the constraint on the allocations is notfully compatible with the i.i.d. representation of a mixture model, it naturallyexpresses a modelling requirement that all components have a meaning in termsof the data, namely that all components genuinely contributed to generating apart of the data. This translates as a form of weak prior information on how muchone trusts the model and how meaningful each component is on its own (by oppo-sition with the possibility of adding meaningless artificial extra-components withalmost zero weights or almost identical parameters).While we do not seek Jeffreys priors as the ultimate prior modelling for non-informative settings, being altogether convinced of the lack of unique referencepriors (Robert, 2001a; Robert et al., 2009), we think it is nonetheless worthwhileto study the performances of those priors in the setting of mixtures in order todetermine if indeed they can provide a version of reference priors and if theyare at least well-defined in such settings. We will show that only in very specificsituations the Jeffreys prior provides reasonable inference.In Section 2 we provide a formal characterisation of properness of the posteriordistribution for the parameters of a mixture model, in particular with Gaussiancomponents, when a Jeffreys prior is used for them. In Section 3 we will analyzethe properness of the Jeffreys prior and of the related posterior distribution: onlywhen the weights of the components (which are defined in a compact space)are the only unknown parameters it turns out that the Jeffreys prior (and sothe relative posterior) is proper; on the other hand, when the other parameters are unknown, the Jeffreys prior will be proved to be improper and in only onesituation it provides a proper posterior distribution. In Section 4 we present a wayto realize a noninformative analysis of mixture models, in particular we proposeto use the Jeffreys prior as a default prior in case of overfitted mixtures andintroduce improper priors for at least some parameters. The default proposal ofSection 4 will be tested on several simulation studies in Section 5 and several realexamples in Section 6, on both well known datasets in the mixture literature anda new dataset. Section 7 concludes the paper.
2. JEFFREYS PRIORS FOR MIXTURE MODELS
We recall that the Jeffreys prior was introduced by Jeffreys (1939) as a defaultprior based on the Fisher information matrix(2) π J ( θ ) ∝ | I ( θ ) | / = (cid:12)(cid:12)(cid:12)(cid:12) − E (cid:20) ∂ ∂θ∂θ T log g ( X ; θ ) (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) / , whenever the later is well-defined; I ( · ) stands for the expected Fisher informa-tion matrix and the symbol | · | denotes the determinant. Although the prior isendowed with some frequentist properties like matching and asymptotic minimalinformation (Robert, 2001a, Chapter 3), it does not constitute the ultimate an-swer to the selection of prior distributions in non-informative settings and thereexist many alternatives such as reference priors (Berger et al., 2009), maximumentropy priors (Rissanen, 2012), matching priors (Ghosh et al., 1995), and otherproposals (Kass and Wasserman, 1996). In most settings Jeffreys priors are im-proper, which may explain for their conspicuous absence in the domain of mixtureestimation, since the latter prohibits the use of independent improper priors byallowing any subset of components to go “empty” with positive probability. Thatis, the likelihood of a mixture model can always be decomposed as a sum overall possible partitions of the data into k groups at most, where k is the num-ber of components of the mixture. This means that there are terms in this sumwhere no observation from the sample brings any amount of information aboutthe parameters of a specific component.Approximations of the Jeffreys prior in the setting of mixtures can be found,e.g., in Figueiredo and Jain (2002), where the authors revert to independentJeffreys priors on the components of the mixture. This induces the same negativeside-effect as with other independent priors, namely an impossibility to handleimproper priors. Rubio and Steel (2014) provides a closed-form expression for theJeffreys prior for a location-scale mixture with two components. The family ofdistributions considered in Rubio and Steel (2014) is2 (cid:15)σ f (cid:18) x − µσ (cid:19) I x<µ + 2(1 − (cid:15) ) σ f (cid:18) x − µσ (cid:19) I x>µ (which thus hardly qualifies as a mixture, due to the orthogonality in the supportsof both components that allows to identify which component each observation isissued from). The factor 2 in the fraction is due to the assumption of symmetryaround zero for the density f . For this specific model, if we impose that the weight (cid:15) is a function of the variance parameters, (cid:15) = σ / σ + σ , the Jeffreys prior is givenby π ( µ, σ , σ ) ∝ / σ σ { σ + σ } . However, in this setting, Rubio and Steel (2014) demonstrates that the posteriorassociated with the (regular) Jeffreys prior is improper, hence not relevant forconducting inference. Rubio and Steel (2014) also considers alternatives to thegenuine Jeffreys prior, either by reducing the range or even the number of pa-rameters, or by building a product of conditional priors. They further considerso-called non-objective priors that are only relevant to the specific case of theabove mixture.Another obvious explanation for the absence of Jeffreys priors is computa-tional, namely the closed-form derivation of the Fisher information matrix isanalytically unavailable. The reason is that the generic [ j, h ]-th element, with j, h ∈ { , · · · , k } , of the Fisher information matrix for mixture models is an inte-gral of the form(3) − (cid:90) X ∂ log (cid:20) k (cid:80) (cid:96) =1 p (cid:96) f (cid:96) ( x | θ (cid:96) ) (cid:21) ∂θ j ∂θ h (cid:34) k (cid:88) (cid:96) =1 p (cid:96) f (cid:96) ( x | θ (cid:96) ) (cid:35) − dx (in the special case of component densities with a univariate parameter) whichcannot be computed analytically. Since these are unidimensional integrals, wederive an approximation of the elements of the Fisher information matrix basedon Riemann sums. The resulting computational expense is of order O( b ) if b isthe total number of (independent) parameters. Since the elements of the informa-tion matrix usually are ratios between the component densities and the mixturedensity, there may be difficulties with non-probabilistic methods of integration.
3. CHARACTERIZATION OF THE JEFFREYS PRIORS FOR MIXTUREMODELS AND RESPECTIVE POSTERIORS
Unsurprisingly, most Jeffreys priors associated with mixture models are im-proper, the exception being when only the weights of the mixture are unknown,as already demonstrated in Bernardo and Gir`on (1988).We will characterize properness and improperness of Jeffreys priors and derivedposteriors, when some or all of the parameters of distributions from location-scalefamilies are unknown. These results are analytically established; the behavior ofthe Jeffreys prior and of the deriving posterior has also been studied throughsimulations, with sufficiently large Monte Carlo experiments (see Section 5). Thefollowing results are often presented for Gaussian mixture models, anyway, theJeffreys prior has a behavior common to all the location-scale families; thereforethe results may be generalized to any location-scale family.
A representation of the Jeffreys prior and the derived posterior distribution forthe weights of a three-component mixture model is given in Figure 1: the priordistribution is much more concentrated around extreme values in the support,i.e., it is a prior distribution conservative in the number of important components.
Lemma . When the weights p i are the only unknown parameters in (1) ,the corresponding Jeffreys prior is proper. Fig 1 . Approximations (on a grid of values) of the Jeffreys prior (on the log-scale) when onlythe weights of a Gaussian mixture model with 3 components are unknown (on the top) and ofthe derived posterior distribution (with known means equal to -1, 0 and 2 respectively and knownstandard devitations equal to 1, 5 and 0.5 respectively). The red cross represents the true values.
Proof.
The generic element of the Fisher information matrix I ( p ) of themixture model (1) when the weights are the only unknown parameters is (for j, h = { , . . . , k − } )(4) (cid:90) X ( f j ( x ) − f k ( x ))( f h ( x ) − f k ( x )) (cid:80) k(cid:96) =1 p (cid:96) f (cid:96) ( x ) d x when we consider the parametrization in ( p , . . . , p k − ), with p k = 1 − p − · · · − p k − . Consider now a data augmented model, where a latent variable describing theallocations of each observation to the particular component is introduced. Inother words, a latent variable z i is considered such that z i = (0 · · · · · · z i(cid:96) = 1 in the (cid:96) -th position of the vector if x i has been generated from the (cid:96) -th components, for i = 1 , · · · , n where n is the sample size and (cid:96) = 1 , · · · , k .Therefore, z = ( z , . . . , z n ) is a multinomial variable for k possible outcomes suchthat g ( x, z | θ, p ) = g ( x | z, θ, p ) g ( z | θ, p ) = n (cid:89) i =1 g ( x i | z i , θ, p ) g ( z i | θ, p )= n (cid:89) i =1 k (cid:89) (cid:96) =1 [ f (cid:96) ( x i | θ (cid:96) ) p (cid:96) ] I [ zi,(cid:96) ]=1 = k (cid:89) (cid:96) =1 (cid:89) i : z i,(cid:96) =1 f l ( x i | θ (cid:96) ) (cid:34) k (cid:89) (cid:96) =1 p n (cid:96) (cid:96) (cid:35) (5)where I [ z i,(cid:96) =1] is the indicator function that z i,(cid:96) = 1 and n (cid:96) is the number ofallocations to the (cid:96) -th component. For an extensive review of the techniques ofdata augmentation in the case of mixture models one may refer to Fr¨uhwirth-Schnatter (2006). Equation (6) shows that the likelihood function is separable for θ and p and thatthe second part is multinomial. Therefore, when looking for the Jefffreys priorfor the weights of a complete (data-augmented) mixture model, the elements ofthe Fisher information matrix are − E (cid:20) ∂ ∂p (cid:96) log g ( x, z | θ, p ) (cid:21) = − n (cid:96) np (cid:96) p (cid:96) = cp (cid:96) − E (cid:20) ∂ ∂p (cid:96) ∂p j log g ( x, z | θ, p ) (cid:21) = 0leading to the usual Jeffreys prior associated to the multinomial model, a Dirichletdistribution D ir ( , · · · , ).The above only applies to the artificial case when the allocations z i are known.When they are unknown, it is easy to see that the log-likelihood function becomes(6) log g ( x | θ, p ) = log g ( x, z | θ, p ) − n (cid:88) i =1 k (cid:88) (cid:96) =1 I [ z i,(cid:96) =1] log p ( z i,(cid:96) = 1 | x i , θ, p )where the second term on the right side of the equation represents the loss of infor-mation compared to the data-augmented likelihood function. Define the expectedFisher information matrix for model (6) (when only the weights are unknown) as I data − aug ( p, θ ). Therefore, since the difference between both matrices is positivedefinite, this implies thatdet( I ( p )) ≤ det( I data − aug ( p ))[det( I ( p ))] / ≤ (cid:104) det( I data − aug ( p )) (cid:105) / π J ( p ) ≤ π data − augJ ( p )This results shows that the Jeffreys prior on the weights of a mixture modelwhen allocations are unknown is proper since bounded by the Jeffreys prior D ir ( , · · · , ) for the complete model.As a particular case, when all the mixands converge to the same distribution,each of the elements of the form (4) tends to (cid:90) X ( f j ( x ) − f k ( x ))( f (cid:96) ( x ) − f k ( x )) f j ( x ) d x which does not depend on p . Therefore, in this case, the determinant of the de-riving Fisher information matrix is constant in p = ( p , · · · , p k ) and the resultingJeffreys prior is uniform on the k -dimensional simplex.We note that this result is a generalization to a k -component mixture of theprior derived in Bernardo and Gir`on (1988) for k = 2 (however, these authorsderive the reference prior for the limiting cases when all the components havepairwise disjoint supports and when all the components converge to the samedistribution). This reasoning led Bernardo and Gir`on (1988) to conclude that theusual D ( λ , . . . , λ k ) Dirichlet prior with λ (cid:96) ∈ [ / ,
1] for ∀ (cid:96) = 1 , · · · , k seems to be a reasonable approximation. They also prove that the Jeffreys prior for theweights p (cid:96) is convex, with an argument based on the sign of the second derivative.It is important to stress that, in a mixture model setting, it is usual to sat-urate the model when the number of components is not surely known a priori and consider a large number of components k . The main difficulty in this settingis non-identifiability, in particular the rate of estimation for the satured modelis much slower than the standard 1 / √ n . Rousseau and Mengersen (2011) havestudied the effect of a prior distribution on the weights of a general mixtureon regularizing the posterior distribution, i.e. consistency to a single configu-ration of the reduced parameter space. This is achievable with a prior whichallows to empty the extra-components or to merge the existing ones. In particu-lar, Rousseau and Mengersen (2011) propose a Dirichlet prior distribution, withparameters λ , · · · , λ k smaller than r/ r is the dimension of θ (cid:96) ) to emptythe extra-components or larger than r/ λ j ( j = 1 , · · · , k ) is quite influential for finite sample sizes. Theconfiguration studied in the proof of Lemma 3.1 is compatible with the Dirichletconfiguration of the prior proposed by Rousseau and Mengersen (2011). This isan important property of the Jeffreys prior, since it makes the prior conservativein the number of the components. Namely, one can asymptotically identify thecomponents that are artificially added to the model but have no meaning for thedata. Moreover, it offers an automatic choice, on the contrary of the Dirichletprior where the hyper-parameters have to been chosen.The shape of the Jeffreys prior for the weights of a mixture model depends onthe type of the components: see Appendix A of the Supplementary Material for adiscussion. The marginal Jeffreys prior for the weight of one component is moreconcentrated around one if that component is more informative in terms of Fisherinformation matrix: for example, if we consider a two-component mixture modelwith a Gaussian and a Student t component, the Jeffreys prior for the weightswill be more symmetric as the number of degrees of freedom of the Student t increases. In this Section we will consider mixtures of location-scale distributions. If thecomponents of the mixture model (1) are distributions from a location-scale fam-ily and the location or scale parameters of the mixture components are unknown,this turns the mixture itself into a location-scale model:(7) p f ( x | µ, τ ) + k (cid:88) (cid:96) =2 p (cid:96) f (cid:96) ( a (cid:96) + xb (cid:96) | µ, τ, a (cid:96) , b (cid:96) ) . As a result, model (1) may be reparametrized following Mengersen and Robert(1996), in the case of Gaussian components(8) p N ( µ, τ ) + (1 − p ) N ( µ + τ δ, τ σ )namely using a reference location µ and a reference scale τ (which may be, forinstance, the location and scale of a specific component). Equation (8) may begeneralized to the case of k components as p N ( µ, τ ) + k − (cid:88) (cid:96) =1 (1 − p )(1 − q ) · · · (1 − q (cid:96) − ) q (cid:96) · N ( µ + τ θ + · · · + τ · · · σ (cid:96) − θ (cid:96) , τ σ · · · σ (cid:96) )++ (1 − p )(1 − q ) · · · (1 − q k − ) · N ( µ + τ θ + · · · + τ · · · σ k − θ k − , τ σ · · · σ k − ) . Since the mixture model is a location-scale model, the Jeffreys prior is as in thefollowing Lemma (see also (Robert, 2001a, Chapter 3)).
Lemma . When the parameters of a location-scale mixture model are un-known, the Jeffreys prior is improper, constant in µ and powered as τ − d/ , where d is the total number of unknown parameters of the components (i.e. excludingthe weights). An new version of the proof, never presented before, is available in AppendixB of the Supplementary Material, while the characterization of the Jeffreys priorfor δ is given in Appendix C.We now derive analytical characterizations of the posterior distributions asso-ciated with the Jeffreys priors for mixture models.Consider, first, the case where only the location parameters of a mixture modelare unknown.There is a substantial difference between the cases where k = 2 or k > Lemma . When k = 2 , the posterior distribution derived from the Jeffreysprior when only the location parameters of model (14) are unknown is proper. The complete proof of lemma 3.3 is given in Appendix D of the Supplemen-tary Material. Here it is worth noticing that the properness of the posteriordistribution in the context of Lemma 3.3 depends on the representation of themixture model as a location-scale distribution, where the second component isdefined with respect to a reference component: if we focus the attention on thepart of the likelihood depending only on the second component, even if the prioris constant with respect to the difference between the location parameters δ as δ → ±∞ , the likelihood depends on δ as exp( − n − δ ) and therefore the behaviorof the posterior distribution is convergent.Figure 2 shows an approximation of the Jeffreys prior for the location parame-ters of a two-component Gaussian mixture model on a grid of values and confirmsthat the prior is constant on the difference between the means and takes higherand higher values as the difference between them increases, while the posteriordistribution, even if showing the classical multimodal nature (Celeux et al., 2000),seems to concentrate around the true modes. It also appears to be perfectly sym-metric because the other parameters (weights and standard deviations) have beenfixed as identical.The same proof cannot be extended to the general case of k components,because the location parameters are defined as several distances from the referencelocation parameter: if we again focus the attention on the part of the likelihooddepending on the second component, the integral with respect to δ converges,however the prior is constant with respect to any other δ j ( j = 3 , · · · , k ) as δ j → ±∞ and the integral does not converge with respect to the other differences.Then the following Lemma holds (the formal proof is available in Appendix E). Fig 2 . Approximations (on a grid of values) of the Jeffreys prior (on the log-scale) when only themeans of a Gaussian mixture model with two components are unknown above and of the derivedposterior distribution (with known weights both equal to 0.5 and known standard deviations bothequal to 5) below.
Lemma . When k > , the posterior distribution derived from the Jeffreysprior is improper when only the location parameters of model (14) are unknown. This result confirms the idea that each part of the likelihood gives informationabout at most the difference between the locations of the respective componentsand the reference location, but not on the locations of the other components.We can now consider the case where all the parameters of (14) are unknown.
Theorem . The posterior distribution of the parameters of a mixturemodel with location-scale components derived from the Jeffreys prior when allparameters of model (14) are unknown is improper.
The proof is available in Appendix F of the Supplementary Material.
4. A NONINFORMATIVE ALTERNATIVE TO JEFFREYS PRIOR
The information brought by the Jeffreys prior or lack thereof does not seem tobe enough to conduct inference in the case of mixture models. The computationof the determinant creates a dependence between the elements of the Fisherinformation matrix in the definition of the prior distribution which makes itdifficult to find and justify moderate modifications of this prior that would leadto a proper posterior distribution. For example, using a proper prior for part ofthe scale parameters and the Jeffreys prior conditionally on them does not avoidimpropriety, as it is shown Appendix G of the Supplementary Material.The literature covers attempts to define priors that add a small amount ofinformation that is sufficient to conduct the statistical analysis without over-whelming the information contained in the data. Some of these are related to thecomputational issues in estimating the parameters of mixture models, as in the0
The information brought by the Jeffreys prior or lack thereof does not seem tobe enough to conduct inference in the case of mixture models. The computationof the determinant creates a dependence between the elements of the Fisherinformation matrix in the definition of the prior distribution which makes itdifficult to find and justify moderate modifications of this prior that would leadto a proper posterior distribution. For example, using a proper prior for part ofthe scale parameters and the Jeffreys prior conditionally on them does not avoidimpropriety, as it is shown Appendix G of the Supplementary Material.The literature covers attempts to define priors that add a small amount ofinformation that is sufficient to conduct the statistical analysis without over-whelming the information contained in the data. Some of these are related to thecomputational issues in estimating the parameters of mixture models, as in the0 approach of Casella et al. (2002), who finds a way to use perfect slice samplerby focusing on components in the exponential family and conjugate priors. Acharacteristic example is given by Richardson and Green (1997), who proposesweakly informative priors, which are data-dependent (or empirical Bayes) andare represented by flat normal priors over an interval corresponding to the rangeof the data. Nevertheless, since mixture models belong to the class of ill-posedproblems, the influence of a proper prior over the resulting inference is difficultto assess.Another solution found in Mengersen and Robert (1996) proceeds throughthe reparametrization (8) and introduces a reference component that allows forimproper priors. This approach then envisions the other parameters as departuresfrom the reference and ties them together by considering each parameter θ (cid:96) as aperturbation of the parameter of the previous component θ (cid:96) − . This perspectiveis justified by the argument that the ( (cid:96) − p N ( µ, τ ) + (1 − p ) q N ( µ + τ θ, τ σ )+ (1 − p )(1 − q ) N ( µ + τ θ + τ σ(cid:15), τ σ σ )where one can impose the constraint 1 ≥ σ ≥ σ for identifiability reasons.Under this representation, it is possible to use an improper prior on the globallocation-scale parameter ( µ, τ ), while proper priors must be applied to the re-maining parameters. This reparametrization has been used also for exponentialcomponents by Gruet et al. (1999) and Poisson components by Robert and Tit-terington (1998). Moreover, Roeder and Wasserman (1997) proposes a Markovprior which follows the same reasoning of dependence between the parametersfor Gaussian components, where each parameter is again a perturbation of theparameter of the previous component θ (cid:96) − . Kamary et al. (2017) also proposes areparametrization of location-scale mixtures based on invariance that allows forweakly informative priors.On one hand, this representation suggests to define a global location-scaleparameter in a more implicit way, via a hierarchical model that considers morelevels in the analysis and choose noninformative priors at the last level in thehierarchy.On the other hand, we believe that an essential feature of a default prior is thatit should let the analysis be able to identify the correct number of meaningfulcomponents, in particular in the standard case where an overfitted mixture isassumed because the a priori information on the number of components is weak.We thus propose a prior scenario which combines both the hierarchical repre-sentation and the conservativeness property in terms of components.More precisely, consider the Gaussian mixture model (1)(9) g ( x | θ ) = k (cid:88) (cid:96) =1 p i N ( x | µ (cid:96) , σ (cid:96) ) . The parameters of each component may be considered as related in some way;for example, the observations induce a reasonable range, which makes it highly improbable to face very different means in the above Gaussian mixture model. Asimilar argument may be used for the standard deviations.Therefore, at the second level of the hierarchical model, we may write µ (cid:96) iid ∼ N ( µ , ζ ) σ (cid:96) iid ∼ U (0 , ζ ) + 12 1 U (0 , ζ ) p | µ, σ ∼ π J ( p | µ, σ )(10)which indicates that the location parameters vary between components, but arelikely to be close, and that the scale parameters may be smaller or larger than ζ ; we have decided to define both µ (cid:96) and σ (cid:96) as depending on hyperparameter ζ without loss of generality, as one may notice by analysing mean and varianceof the random variables; this representation allows the application of the MCMCscheme proposed in Robert and Mengersen (1999) which allows a better mixing ofthe chains. The mixture weights are given the prior distribution π J ( p | µ, σ ) whichis the Jeffreys prior for the weights, conditional on the location and scale parame-ters, given in Section 3.1; this choice makes use of the conservative property of theJeffreys prior for the weights which is essential in the case of miss-specificationof the number of components.At the third level of the hierarchical model, the prior may be noninformative: π ( µ , ζ ) ∝ ζ . (11)As in Mengersen and Robert (1996) the parameters in the mixture model areconsidered tied together; on the other hand, this feature is not obtained via aconstrained representation of the mixture model itself, but via a hierarchy in thedefinition of the model and the parameters. Theorem . The posterior distribution derived from the hierarchical rep-resentation of the Gaussian mixture model associated with (9) , (10) and (11) isproper. The proof of Theorem 4.1 is available in Appendix H of the SupplementaryMaterial.As a side remark, even if Theorem 4.1 is stated for Gaussian mixture models,it may be extended to other location-scale distributions. Section 6 will presentan example with log-normal components, Section 6.1 with Gumbel components.However it cannot be generalized to any location-scale distribution.This hierarchical version of the mixture model presents some advantages; inparticular, the Jeffreys prior used for the weights is conservative in terms ofnumber of components in the case of misspecification. We remind that when thenumber of components is not known, it is usual in practice to fix a model witha high number of components (if one wants to avoid a nonparametric analysis),therefore it is essential that the posterior distribution gives hints on the right k . This feature of the Jeffreys prior allow the experimenter to do so in a nonin-formative way. More precisely, this hierchical prior respect the Assumption 5 ofRousseau and Mengersen (2011). . . . . . . n p Fig 3 . Boxplots of posterior means of the largest weight p , with the hierarchical prior onthe parameters, in particular a conditional Jeffreys prior on the weights, for sample sizes n = 50 , , , , , .
5. SIMULATION STUDY
In this Section we present the results of several simulations studies we con-duct to support the theoretical results presented so far. The results of additionalsimulations are given in Appendix G and H of the Supplementary Material.As a remark, integrals of the form (3) need to be approximated, as mentionedin Section 2. There are numerical issue here. We decided to use Riemann sums(with 550 points) when the component standard deviations are sufficiently large,as they produce stable results, and Monte Carlo integration (with sample sizesof 1500) when they are small. In the latter case, the variability of MCMC resultsseems to decrease as σ i approaches 0. See the Supplementary Material for adetailed description of these computational issues.We can analyse the property of conservativeness in overfitted mixtures throughsimulations, by using the hierarchical prior proposed in Section 4. We consider avery simple example to illustrate this theoretical result. Suppose we want to fit atwo-component Gaussian mixture model with weights p and 1 − p and parametersunknown to a sample of data x = { x , · · · , x n } generated from a standard normaldistribution N (0 , M = 20 replica-tions of samples of size n = (50 , , , , , p increases to 1 as n increases.We have also considered a more complicated situation, where we want to fit amodel with an increasing number of components ( k = (2 , , , x = { x , · · · , x n } generated from a two-component mixture model(12) 0 . N ( − ,
1) + 0 . N (3 , . l l p p . . . k=2 n = l l p p p . . . k=3 lll llll ll p p p p . . . k=4 llll lll lllll ll p p p p p . . . k=5 p p . . . = llll llllllll llll p p p . . . lll lll llllllll lll p p p p . . . l llllllll llllll llllll lll p p p p p . . . p p . . . = p p p . . . ll l lll p p p p . . . ll llllll lll p p p p p . . . Fig 4 . Boxplots of posterior means of the weights p , with the hierarchical prior on the parameters,in particular a conditional Jeffreys prior on the weights, for sample sizes n = (50 , , andwith models with k = (2 , , , components. Figures 4 and 5 show the boxplots for the posterior means of the weightsobtained through M = 20 replications of the experiment, with a correct ( k = 2) ora misspecified ( k = (3 , ,
6. ILLUSTRATIONS
In this Section we will analyse the performance of the approach proposed inSection 4 in three datasets so well-known in the literature of mixture modelsthat they can be taken as benchmarks and in a new dataset we propose here for p p . . . k=2 n = p p p . . . k=3 l lllll p p p p . . . k=4 lll lllllll llll lllll p p p p p . . . k=5 p p . . . = l l p p p . . . lll p p p p . . . l p p p p p . . . p p . . . = p p p . . . lll l p p p p . . . lll ll lll l p p p p p . . . Fig 5 . As in Figure 4, for sample sizes n = (1000 , , . Galaxy velocity of galaxy (1000km/s) D en s i t y . . . . . . Fig 6 . Predictive distribution of the galaxy dataset: the red line represent the estimation of thedensity, the shadow blue area represents the credible intervals in simulations by assuming aten-component mixture model. the first time. In order to better present this new dataset, the analysis of it ispresented separately.The first dataset contains data about the velocity (in km per second) of 82galaxies in the Corona Borealis region. The goal of this analysis is to understandthe number of stellar populations, in order to support a particular theory of theformation of the Galaxy. The Galaxy dataset has been investigated by severalauthors, including Richardson and Green (1997), Raftery (1996), Escobar andWest (1995) and Roeder (1990) among others.The galaxies velocities are considered as random variables distributed accord-ing to a mixture of k normal distributions. The evaluation of the number ofcomponents has proved to be delicate, with estimates from 3 in Roeder andWasserman (1997) to 5 in Richardson and Green (1997) and 7 in Escobar andWest (1995).We have assumed a ten-component mixture model and check whether or notthe hierarchical approach that uses the conditional Jeffreys prior on the weightsof the mixture model manages to identify a smaller number of significant compo-nents. The results are available in Figure 6 and Table 1. The algorithm identifies5 components with weights larger than zero, which is a result along the lineof Richardson and Green (1997) and more conservative than Escobar and West(1995), which confirms the Jeffreys prior’s feature of being conservative in thenumber of the components. Credible intervals also show that the parameters ofthe components with marginal posterior distributions for the weights not concen-trated around zero are estimated with lower uncertainty.The second dataset is related to a population study to validate caffeine as aprobe drug to establish the genetic status of rapid acetylators and slow acetyla-tors (Bechtel et al., 1993): many drugs, including caffeine, are metabolyzed by apolymorphic enzyme (EC 2.3.1.5) in humans and the white population is dividedinto two groups of slow acetylators and rapid acetylators. Caffeine is consideredan interesting drug to study the phenotype of people, because it is regularly con- Enzyme
AFMU/1X D en s i t y Fig 7 . Predictive distribution of the enzyme dataset: the red line represent the estimation of thedensity, the shadow blue area represents the credible intervals in simulations by assuming aten-component mixture model. sumed by a large amount of the population. Several population studies have beenconducted, some of them reporting a bimodality, some others a trimodality. Wefocus on the study presented by Bechtel et al. (1993), involving 245 unrelated pa-tients and computing the molar ratio between two metabolites of caffeine, AFMUand 1X, both measured in urine 4 to 6 hours after ingestion of 200 mg of caffeine.We have again assumed a ten-component mixture model and checked whetheror not the hierarchical approach which uses the conditional Jeffreys prior on theweights of the mixture model is able to identify a smaller number of significantcomponents.The results are available in Figure 7 and Table 1. The algorithm identifiestwo components with weights clearly larger than zero and two other componentswith very small weights. Bechtel et al. (1993) identify a bimodal density, whileRichardson and Green (1997) consider highly likely a 3-5 component mixture.The Jeffreys prior allows to concentrate the analysis on mainly two subgroupsand it suggests that Gaussian components may be inappropriate in this setting:by looking to the location of the components with small weights, it may be moreadequate to consider asymmetric distributions.Our third dataset is related to measuring the acid neutralizing capacity (ANC)(in log-scale) of a sample of 155 lakes in north-central Wisconsin, to determinethe number of lakes that have been affected by acidic deposition (Crawford et al.,1992): the ANC measures the capability of a lake to neutralize acid, i.e. low valuesmay indicate a problem for the lake’s biological diversity.The results are available in Figure 8 and Table 1. The algorithm identifies twocomponents with significant weights and two other components with very small Acidity log(ANC) D en s i t y . . . . . . Fig 8 . Predictive distribution of the acidity dataset: the red line represent the estimation of thedensity, the shadow blue area represents the credible intervals in simulations by assuming aten-component mixture model. weights. Crawford et al. (1992) assume a bimodal density, while Richardson andGreen (1997) consider highly likely a 3-5 component model. The Jeffreys prioragain allows to concentrate the analysis on two main subgroups and suggests toinvestigate the importance of other two components and possibly the goodness-of-fit of the log-normal distribution in this setting. A recent trend in computer network systems is the deployment of networkfunctions in software Nunes et al. (2014). The so-called “software dataplanes”are emerging as an alternative to traditional hardware switched and routers,reducing costs and enhancing programmability.The monitoring of IP packets is, among all possible network functions, one ofthe most suitable for a software deployment. However, the monitoring has a hugecost in terms of consumed CPU (processing) time by packet. The main reason forthis is that each incoming packet triggers the retrieval, from a large hash-table,of all the information related to the packet flow (i.e. the packet’s family). Thisoperation is generally called flow-entry retrieval. The time required for the flow-entry retrieval (retrieval time) mainly depends on whether such information isavailable in one of the processor caches (e.g. L1, L2, L3) or in memory.The dataset used in this analysis consists of generated samples of retrievaltime, each with 10 times, under two different set-ups. In the first one, the flow-entry has been forced to reside in fast processor caches (“hit”). In the second one,all flow-entries have been forced to reside in the server RAM (memory), whichresults in a slower flow-entry retrieval (“miss”).Both samples show a heavy tail, due to possible hash collisions on the ta- Table 1
Posterior means for the weights, the means and the standard deviations of a ten-componentmixture model, assumed for the galaxy, the enzyme and the acidity datasets (the first numberin brackets is the posterior mean and the second is the posterior standard deviation). We havedecided to not shown the estimated location and scale parameters when the weights areconcentrated around zero.
Dataset: galaxy enzyme acidity p p p p p (cid:80) (cid:96) =6 p (cid:96) ble, as well as additional delays introduced by measuring the retrieval time ata nanosecond timescale. In the case of “miss”, another reason for the heavy tailcan be identified with the virtual/physical memory mapping, which can inflatethe retrieval time in some cases.The goal of a realistic analysis is to infer the proportion of reported timeswhich may be considered from the “hit” distribution and the proportion of timeswhich may be considered from the “miss” distribution, i.e. to derive what is thepercentage of packets for which the flow-entry was in the cache and the percentageof packets for which the flow-entry was in memory.However, a first simulation is generally used to test the procedure. The inter-est of the analysis will be in the region of the space where the two distributionsare overlapping, therefore the interest is not in the external tails, which may,nonetheless, affect inference. Therefore, a preliminary analysis may be conductedin order to understand if a part of the future observations may be discarded fromthe analysis. In this particular case, the conservative property of the Jeffreys priormay be used in order to understand how much important are the tails of eachdistribution and to identify the right models to use. For instance, a comparisonbetween a Gaussian mixture model and a mixture model with Gumbel compo-nents may be run: if in both cases the analysis run with a Jeffreys prior for themixture weights identifies more than two (assumed) distributions of interest, thismay be a suggestion that the observations allocated to the external components(not the “hit” or the “miss” ones) may be discarded, providing inference on theproportion of observations to discard as well.Figure 9 and Table 2 show the results of this analysis: adopting a Jeffreysprior for the mixture weights when assuming Gumbel components allows to betterestimate the first component and to describe the asymmetry observed in the dataas an asymmetry in the first component instead of an additional component.Nevertheless it is not sufficient to identify the observations in the right tail ofthe second component as part of its tail, since the algorithm identifies a thirdcomponent located in that part of the space. (a) t D en s i t y . . . . . (b) t D en s i t y . . . . . Fig 9 . Predictive distribution of the network dataset: the red line represent the estimation ofthe density, the shadow blue area (very concentrated around the red lines) represents the cred-ible intervals in simulations by assuming a ten-component mixture model, with Gaussiancomponents on the left and with Gumbel components on the right. In this setting, the Jeffreys prior allows to i) identify a miss-specification ofthe model assumptions (the approximated Bayes factor of the mixture of Gumbelcomponents against the mixture of normal components is 2 .
10) and ii) identifywhich part of the observations to discard from further studies.
7. CONCLUSION
This thorough analysis of the Jeffreys priors in the setting of mixtures withlocation-scale components shows that mixture distributions deserve the qualifi-cation of an ill-posed problem with regard to the production of non-informativepriors. Indeed, we have shown that most configurations for Bayesian inferencein this framework do not allow for the standard Jeffreys prior to be taken asa reference. While this is not the first occurrence where Jeffreys priors cannotbe used as reference priors, we have shown that the Jeffreys prior for the mix-ture weights has the important property to be conservative in the number ofcomponents, with a configuration compatible with the results of Rousseau andMengersen (2011).This is a general feature of the Jeffreys prior for the mixtureweights, which is independent from the shape of the distributions composing themixture.Nevertheless, we have decided to study its behavior in the specific case ofcomponents from location-scale families. We have proposed a hierarchical repre-sentation of the mixture model, which allow for improper priors at the highestlevel of the hierarchy and assumes the Jeffreys prior for the mixture weightsin the second level, conditional on prior distributions for the location and scaleparameters along the line of Mengersen and Robert (1996).Through several examples, both on simulated and real datasets, we have shownthat this representation seems to be more conservative on the number of com-ponents than other non or weakly informative prior distributions for mixturemodels available in the literature. In particular, it seems to be able to recognize0
This thorough analysis of the Jeffreys priors in the setting of mixtures withlocation-scale components shows that mixture distributions deserve the qualifi-cation of an ill-posed problem with regard to the production of non-informativepriors. Indeed, we have shown that most configurations for Bayesian inferencein this framework do not allow for the standard Jeffreys prior to be taken asa reference. While this is not the first occurrence where Jeffreys priors cannotbe used as reference priors, we have shown that the Jeffreys prior for the mix-ture weights has the important property to be conservative in the number ofcomponents, with a configuration compatible with the results of Rousseau andMengersen (2011).This is a general feature of the Jeffreys prior for the mixtureweights, which is independent from the shape of the distributions composing themixture.Nevertheless, we have decided to study its behavior in the specific case ofcomponents from location-scale families. We have proposed a hierarchical repre-sentation of the mixture model, which allow for improper priors at the highestlevel of the hierarchy and assumes the Jeffreys prior for the mixture weightsin the second level, conditional on prior distributions for the location and scaleparameters along the line of Mengersen and Robert (1996).Through several examples, both on simulated and real datasets, we have shownthat this representation seems to be more conservative on the number of com-ponents than other non or weakly informative prior distributions for mixturemodels available in the literature. In particular, it seems to be able to recognize0 Table 2
Posterior means for the weights, the means and the standard deviations of a ten-componentmixture model, assumed for the network dataset (credible intervals of level 0.95 in brackets).
Gaussian comp. p µ σ (0.180,0.249) (222.657,233.842) (45.483,55.265) (0.474,0.568) (160.216,161.882) (6.830,8.212) (0.188,0.257) (81.057,82.270) (1.666,2.135) (0.029,0.064) (91.710 ,93.700) (2.698,4.388 (cid:80) (cid:96) =5 p (cid:96) = 0 . Gumbel comp. p µ σ (0.183,0.251) (213.446,213.846) (53.526,64.667) (0.479,0.562) (160.113,160.482) (7.465,8.482) (0.219,0.302) (83.251,83.270) (3.005,3.753) (cid:80) (cid:96) =4 p (cid:96) = 0 . the meaningful components, which is an essential property for a noninformativeprior for mixture model: in fact, in an objective setting, it is essential to considerthe possibility to have assumed a wrong number of components. In this sense,the Jeffreys prior for the mixture weights may be used to identify the mean-ingful components and possible miss-specifications of either the number or thedistributional family of the components.As a note aside, we have mainly analyzed mixture of Gaussian distributionsin this paper, with extensions of the theoretical results to the other distributionsof the location-scale family. Nevertheless, the possible difficulties deriving fromthe use of distributions different from the Gaussian are not considered here andwill be the focus of future research. In particular, all likelihoods poorly specifiedand ill-behaved cases are more likely to meet difficulties. However, the Jeffreysprior is known as a regularization prior that does not necessarily reflect priorbeliefs, but in combination with the likelihood function yields posteriors withdesirable properties; see Hoogerheide and van Dijk (2008) for a detailed reviewof ill-behaved posterior cases and the role of the Jeffreys prior in those cases. ACKNOWLEDGEMENTS AND NOTES
The code used for the Gaussian mixture models is available online at thefollowing link: https://github.com/cgrazian/Jeffreys_mixtures .The Authors want to thank Gioacchino Tangari, from the Department of Elec-tronic and Electrical Engineering, University College London, for having providedthe simulations of Section 6.1.
REFERENCES
Bechtel Y.C. , Bonaiti-Pellie, C. , Poisson, N. , Magnette, J. and
Bechtel, P.R. (1993).A population and family study Nacetyltransferase using caffeine urinary metabolites.
ClinicalPharmacology & Therapeutics , Berger, J. , Bernardo, J. and
D., S. (2009). Natural induction: An objective Bayesian ap-proach.
Rev. Acad. Sci. Madrid , A 103
Bernardo, J. and
Gir`on, F. (1988). A Bayesian analysis of simple mixture problems. In
Bayesian Statistics 3 (J. Bernardo, M. DeGroot, D. Lindley and A. Smith, eds.). OxfordUniversity Press, Oxford, 67–78.
Casella, G. , Mengersen, K. , Robert, C. and
Titterington, D. (2002). Perfect slicesamplers for mixtures of distributions.
J. Royal Statist. Society Series B , Celeux, G. , Hurn, M. and
Robert, C. (2000). Computational and inferential difficulties withmixture posterior distributions.
J. American Statist. Assoc. , Crawford, S.L. , DeGroot, M.H. , Kadane, J.B. and
Small, M.J. (1992). Modeling Lake-Chemistry Distributions: Approximate Bayesian Methods for Estimating a Finite-MixtureModel
Technometrics , Diebolt, J. and
Robert, C. (1994). Estimation of finite mixture distributions by Bayesiansampling.
J. Royal Statist. Society Series B , Escobar, M.D., and West, M. (1995). Bayesian density estimation and inference usingmixtures.
Journal of the american statistical association , (430) 577–588. Figueiredo, M. and
Jain, A. (2002). Unsupervised learning of finite mixture models.
PatternAnalysis and Machine Intelligence, IEEE Transactions on , Fr¨uhwirth-Schnatter, S. (2006).
Finite Mixture and Markov Switching Models . Springer-Verlag, New York, New York.
Geweke, J. (2007). Interpretation and inference in mixture models: Simple MCMC works.
Comput. Statist. Data Analysis , Ghosh, M. , Carlin, B. P. and
Srivastiva, M. S. (1995). Probability matching priors forlinear calibration.
TEST , Gruet, M. , Philippe, A. and
Robert, C. (1999). MCMC control spreadsheets for exponentialmixture estimation.
J. Comput. Graph. Statist. , Hoogerheide, L.F. and van Dijk, H.K. (2008) Possibly ill-behaved posteriors in econometricmodels.
Jasra, A. , Holmes, C. and
Stephens, D. (2005). Markov Chain Monte Carlo methods andthe label switching problem in Bayesian mixture modeling.
Statist. Sci. , Jeffreys, H. (1939).
Theory of Probability . 1st ed. The Clarendon Press, Oxford.
Kamary, K. , Lee, J.E. and
Robert, C.P. (2017). Weakly informative reparameterisations forlocation-scale mixtures. pre-print, arXiv:1601.01178v2 . Kass, R. and
Wasserman, L. (1996). Formal rules of selecting prior distributions: a reviewand annotated bibliography.
J. American Statist. Assoc. , Lee, K. , Marin, J.-M. , Mengersen, K. and
Robert, C. (2009). Bayesian inference on mix-tures of distributions. In
Perspectives in Mathematical Sciences I: Probability and Statistics (N. N. Sastry, M. Delampady and B. Rajeev, eds.). World Scientific, Singapore, 165–202.
MacLachlan, G. and
Peel, D. (2000).
Finite Mixture Models . John Wiley, New York.
Mengersen, K. and
Robert, C. (1996). Testing for mixtures: A Bayesian entropic approach(with discussion). In
Bayesian Statistics 5 (J. Berger, J. Bernardo, A. Dawid, D. Lindleyand A. Smith, eds.). Oxford University Press, Oxford, 255–276.
Nunes, B.A.A , Mendonca, M. , Nguyen, X.N. , Obraczka, K. , Turletti, T. (2014) Asurvey of software-defined networking: Past, present, and future of programmable networks.
IEEE Communications Surveys & Tutorials
16 (3) (2014) 1617–1634.
Puolam¨aki, K. and
Kaski, S. (2009). Bayesian solutions to the label switching problem. In
Advances in Intelligent Data Analysis VIII (N. Adams, C. Robardet, A. Siebes and J.-F.Boulicaut, eds.), vol. 5772 of
Lecture Notes in Computer Science . Springer Berlin Heidelberg,381–392.
Raftery, A.E. (1996). Hypothesis testing and model selection. In
Markov Chain Monte Carloin Practice (W.R. Gilks, D.J. Spiegelhalter and S. Richardson, eds.), London: Chapman andHall, pp. 163–188.
Richardson, S. and
Green, P. (1997). On Bayesian analysis of mixtures with an unknownnumber of components (with discussion).
J. Royal Statist. Society Series B , Rissanen, J. (2012).
Optimal Estimation of Parameters . Cambridge University Press.
Robert, C. (2001a).
The Bayesian Choice . 2nd ed. Springer-Verlag, New York.
Robert, C. and
Casella, G. (2004).
Monte Carlo Statistical Methods . 2nd ed. Springer-Verlag,New York.
Robert, C. , Chopin, N. and
Rousseau, J. (2009). Theory of Probability revisited (withdiscussion).
Statist. Science , Robert, C. and
Mengersen, K. (1999). Reparametrization issues in mixture estimation andtheir bearings on the Gibbs sampler.
Comput. Statist. Data Analysis , Robert, C. and
Titterington, M. (1998). Reparameterisation strategies for hidden Markovmodels and Bayesian approaches to maximum likelihood estimation.
Statistics and Comput-ing , Roeder, K. (1990). Density estimation with confidence sets exemplified by superclusters andvoids in the galaxies.
J. American Statist. Assoc. , (411) 617–624. Roeder, K. and
Wasserman, L. (1997). Practical Bayesian density estimation using mixturesof normals.
J. American Statist. Assoc. , Rousseau, J. and
Mengersen, K. (2011). Asymptotic behaviour of the posterior distributionin overfitted mixture models.
J. Royal Statist. Society Series B , Rubio, F. and
Steel, M. (2014). Inference in two-piece location-scale models with Jeffreyspriors.
Bayesian Analysis , Stephens, M. (2000). Dealing with label switching in mixture models.
J. Royal Statist. SocietySeries B , Titterington, D. , Smith, A. and
Makov, U. (1985).
Statistical Analysis of Finite MixtureDistributions . John Wiley, New York.
Wasserman, L. (2000). Asymptotic inference for mixture models using data dependent priors.
J. Royal Statist. Society Series B , Fig 10 . Approximations of the marginal prior distributions for the first weight of a two-component Gaussian mixture model, p N ( − ,
1) + (1 − p ) N (10 , (black), p N ( − ,
1) + (1 − p ) N (1 , (red) and p N ( − ,
1) + (1 − p ) N (10 , (blue). SUPPLEMENTARY MATERIALAppendix A: Form of the Jeffreys prior for the weights of the mixture model.
The shape of the Jeffreys prior for the weights of a mixture model depends onthe type of the components. Figure 10, 11 and 12 show the form of the Jeffreysprior for a two-component mixture model for different choices of components. Itis always concentrated around the extreme values of the support, however theamount of concentration around 0 or 1 depends on the information brought byeach component. In particular, Figure 10 shows that the prior is much moresymmetric as there is symmetry between the variances of the distribution com-ponents, while Figure 11 shows that the prior is much more concentrated around1 for the weight relative to the normal component if the second component is aStudent t distribution.Finally Figure 12 shows the behavior of the Jeffreys prior when the first com-ponent is Gaussian and the second is a Student t and the number of degreesof freedom is increasing. As expected, as the Student t is approaching a normaldistribution, the Jeffreys prior becomes more and more symmetric. Appendix B: Proof of Lemma 3.2
When the parameters of a location-scale mixture model are unknown, the Jef-freys prior is improper, constant in µ and powered as τ − d/ , where d is the totalnumber of unknown parameters of the components (i.e. excluding the weights). Proof.
We first consider the case where the means are the only unknownparameters of a Gaussian mixture model g X ( x ) = k (cid:88) l =1 p l N ( x | µ l , σ l ) Fig 11 . Approximations of the marginal prior distributions for the first weight of a two-component mixture model where the first component is Gaussian and the second is Studentt, p N ( − ,
1) + (1 − p ) t( df = 1 , , (black), p N ( − ,
1) + (1 − p ) t( df = 1 , , (red) and p N ( − ,
1) + (1 − p ) t( df = 1 , , (blue). Fig 12 . Approximations of the marginal prior distributions for the first weight of a two-component mixture model where the first component is Gaussian and the second is Student twith an increasing number of degrees of freedom. The generic elements of the expected Fisher information matrix are, in the caseof diagonal and off-diagonal terms respectively: E (cid:20) − ∂ log g X ( X ) ∂µ i (cid:21) = p i σ i (cid:90) ∞−∞ (cid:2) ( x − µ i ) N ( x | µ i , σ i ) (cid:3) (cid:80) k(cid:96) =1 p (cid:96) N ( x | µ (cid:96) , σ (cid:96) ) dx, E (cid:20) − ∂ log g X ( X ) ∂µ i ∂µ j (cid:21) = p i p j σ i σ j ·· (cid:90) ∞−∞ ( x − µ i ) N ( x | µ i , σ i )( x − µ j ) N ( x | µ j , σ j ) (cid:80) k(cid:96) =1 p (cid:96) N ( x | µ (cid:96) , σ (cid:96) ) dx. Now, consider the change of variable t = x − µ i in the above integrals, where µ i is thus the mean of the i -th Gaussian component ( i ∈ { , · · · , k } ). The aboveintegrals are then equal to E (cid:20) − ∂ log g X ( X ) ∂µ j (cid:21) = p j σ j ·· (cid:90) ∞−∞ (cid:2) ( t − µ j + µ i ) N ( t | µ j − µ i , σ i ) (cid:3) (cid:80) k(cid:96) =1 p (cid:96) N ( t | µ (cid:96) − µ i , σ (cid:96) ) dx, E (cid:20) − ∂ log g X ( X ) ∂µ j ∂µ m (cid:21) = p j p m σ j σ m ·· (cid:90) ∞−∞ ( t − µ j + µ i ) N ( x | µ j , σ j )( t − µ m + µ i ) N ( t | µ m − µ i , σ m ) (cid:80) k(cid:96) =1 p (cid:96) N ( t | µ (cid:96) − µ i , σ (cid:96) ) dx. Therefore, the terms in the Fisher information only depend on the differences δ j = µ i − µ j for j ∈ { , · · · , k } . This implies that the Jeffreys prior is impropersince a reparametrization in ( µ i , δ ) shows the prior does not depend on µ i .Moreover, consider a two-component mixture model with all the parametersunknown p N ( µ, τ ) + (1 − p ) N ( µ + τ δ, τ σ ) . With some computations, it is straightforward to derive the Fisher informationmatrix for this model, partly shown in Table 3, where each element is multipliedfor a term which does not depend on τ . Table 3
Factors depending on τ of the Fisher information matrix for the reparametrized model (8) . σ δ p µ τσ τ − τ − δ τ − τ − p 1 1 1 τ − τ − µ τ − τ − τ − τ − τ − τ τ − τ − τ − τ − τ − Therefore, the Fisher information matrix considered as a function of τ is ablock matrix. From well-known results in linear algebra, if we consider a blockmatrix M = (cid:20) A BC D (cid:21) then its determinant is given by det( M ) = det( A − BD − C ) det( D ). In thecase of a two-component mixture model where the total number of components parameters (i.e. non considering the weights) is d = 4, det( D ) ∝ τ − , whiledet( A − BD − C ) ∝ τ only). Then theJeffreys prior for a two-component Gaussian mixture model is proportional to τ − . If we generalize to the case of a Gaussian mixture model with k components,the total number of component parameters is d = 2 k and the Jeffreys prior for a k -component Gaussian mixture model is proportional to τ − k .When considering the general case of components from a location-scale family,this feature of improperness of the Jeffreys prior distribution is still valid, because,once reference location-scale parameters are chosen, the mixture model may berewritten as(13) p f ( x | µ, τ ) + k (cid:88) (cid:96) =2 p (cid:96) f (cid:96) ( a (cid:96) + xb (cid:96) | µ, τ, a (cid:96) , b (cid:96) ) . Then the second derivatives of the logarithm of model (14) behave as the oneswe have derived for the Gaussian case, i.e. they will depend on the differencesbetween each location parameter and the reference one, but not on the referencelocation itself. Then the Jeffreys prior will be constant with respect to the globallocation parameter and powered in the global scale parameter.
Appendix C: Jeffreys prior for δ = µ − µ The Jeffreys prior of δ conditional on µ when only the location parameters areunknown is improper. Proof.
When considering the reparametrization by Mengersen and Robert(1996), the Jeffreys prior for δ for a fixed µ has the form: π J ( δ | µ ) ∝ (cid:90) X (cid:104) (1 − p ) x exp {− x } (cid:105) pσ exp {− σ ( x + δστ ) } + (1 − p ) exp {− x } dx and the following result may be demonstrated. The improperness of the condi-tional Jeffreys prior on δ depends (up to a constant) on the double integral (cid:90) ∆ (cid:90) X c (cid:104) (1 − p ) x exp {− x } (cid:105) pσ exp {− σ ( x + δστ ) } + (1 − p ) exp {− x } dxdδ. The order of the integrals is allowed to be changed, then (cid:90) X x (cid:90) ∆ (cid:104) (1 − p ) exp {− x } (cid:105) pσ exp {− σ ( x + δστ ) } + (1 − p ) exp {− x } dδdx. Define f ( x ) = (1 − p ) e − x = d . Then (cid:90) X x (cid:90) ∆ d pσ exp {− σ ( x + δστ ) } + d dδdx. Fig 13 . Approximations (on a grid of values) of the Jeffreys prior (on the natural scale) of thedifference between the means of a Gaussian mixture model with only the means unknown (left)and of the derived posterior distribution (on the right, the red line represents the true value),with known weights equal to (0 . , . (black lines), (0 . , . (green and blue lines) and knownstandard deviations equal to (5 , (black lines), (1 , (green lines) and (7 , (blue lines). Since the behavior of (cid:20) d pσ exp {− σ ( x + δστ ) } + d (cid:21) depends on exp {− δ } as δ goes to ∞ , we have that (cid:90) + ∞−∞ {− δ } + d dδ > (cid:90) + ∞ A {− δ } + d dδ because the integrand function is positive. Then (cid:90) + ∞ A {− δ } + d dδ > (cid:90) + ∞ A ε + d dδ = + ∞ . Therefore the conditional Jeffreys prior on δ is improper.Figure 13 compares the behaviour of the prior and the resulting posteriordistribution for the difference between the means of a two-component Gaussianmixture model: the prior distribution is symmetric and it has different behavioursdepending on the value of the other parameters, but it always stabilizes for largeenough values; the posterior distribution appears to always concentrate aroundthe true value. Appendix D: Proof of lemma 3.3
When k = 2 , the posterior distribution derived from the Jeffreys prior whenonly the location parameters of model (14) p f ( x | µ, τ ) + k (cid:88) (cid:96) =2 p (cid:96) f (cid:96) ( a (cid:96) + xb (cid:96) | µ, τ, a (cid:96) , b (cid:96) ) . are unknown is proper. Proof.
The conditional Jeffreys prior for the means of a Gaussian mixturemodel follow the behavior of the product of the diagonal elements of the Fisherinformation matrix: p p σ σ (cid:40)(cid:90) + ∞−∞ [ t N (0 , σ )] p N (0 , σ ) + p N ( δ, σ ) dt × (cid:90) + ∞−∞ [ u N (0 , σ )] p N ( − δ, σ ) + p N (0 , σ ) du − (cid:18)(cid:90) + ∞−∞ t N (0 , σ )( t − δ ) N ( δ, σ ) p N (0 , σ ) + p N ( δ, σ ) dt (cid:19) (cid:41) where δ = µ − µ .The posterior distribution is then defined as n (cid:89) i =1 [ p N ( x i | µ , σ ) + p N ( x i | µ , σ )] π J ( µ , µ | p, σ ) . The likelihood may be rewritten (without loss of generality, by considering σ = σ = 1, since they are known) as L ( θ ) = n (cid:89) i =1 [ p N ( x i | µ ,
1) + p N ( x i | µ , π ) n p n e − n (cid:80) i =1 ( x i − µ ) + n (cid:88) j =1 p n − p e − (cid:80) i (cid:54) = j ( x i − µ ) − ( x j − µ ) + n (cid:88) j =1 (cid:88) k (cid:54) = j p n − p e − (cid:34) (cid:80) i (cid:54) = j,k ( x i − µ ) +( x j − µ ) +( x k − µ ) (cid:35) + · · · + p n e − n (cid:80) j =1 ( x j − µ ) . (15)Then, for | µ | → ∞ , L ( θ ) tends to the term p n e − n (cid:80) j =1 ( x j − µ ) that is constant for µ . Therefore we can study the behavior of the posteriordistribution for this part of the likelihood to assess its properness.This explains why we want the following integral to converge: (cid:90) R × R p n e − n (cid:80) j =1 ( x j − µ ) π J ( µ , µ ) dµ dµ which is equal to (by the change of variable µ − µ = δ ) (cid:90) R × R p n e − n (cid:80) j =1 ( x j − µ − δ ) π J ( µ , δ ) dµ dδ. In Appendix C of the Supplementary Material it is possible to see that the priordistribution only depends on the difference between the means δ : (cid:90) R p n (cid:90) R e − n (cid:80) j =1 ( x j − µ − δ ) dµ π J ( δ ) dδ ∝ (cid:90) R (cid:90) R e − n (cid:80) j =1 ( x j − δ ) + µ n (cid:80) j =1 ( x j − δ ) − nµ dµ π J ( δ ) dδ = (cid:90) R (cid:90) R e µ n (cid:80) j =1 ( x j − δ ) − nµ dµ e − n (cid:80) j =1 ( x j − δ ) π J ( δ ) dδ = (cid:90) R e − n (cid:80) j =1 ( x j − δ ) + n (cid:80) j =1 ( xj − δ )2 n π J ( δ ) dδ ≈ (cid:90) R e − n − δ π J ( δ ) dδ. (16)where π J ( δ ) is defined as π J ( δ ) ∝ (cid:90) X (cid:104) (1 − p ) x exp {− x } (cid:105) pσ exp {− σ ( x + δστ ) } + (1 − p ) exp {− x } dx As δ → ±∞ this quantity is constant with respect to δ . Therefore the integral(16) is convergent for n ≥ Appendix E: Proof of Lemma 3.4
When k > , the posterior distribution derived from the Jeffreys prior whenonly the location parameters of model (14) are unknown is improper. Proof.
In the case of k (cid:54) = 2 components, the Jeffreys prior for the locationparameters is still constant with respect to a reference mean (for example, µ ).Therefore it depends on the difference parameters ( δ = µ − µ , δ = µ − µ , · · · , δ k = µ k − µ ).The Jeffreys prior depends on the product on the diagonal (cid:90) ∞−∞ [ t N (0 , σ )] p N (0 , σ ) + · · · + p k N ( δ k , σ k ) dt · · · (cid:90) ∞−∞ [ u N (0 , σ k )] p N ( − δ k , σ ) + · · · + p k N (0 , σ k ) du. If we consider the case as in Lemma 3.3, where only the part of the likelihooddepending on e.g. µ may be considered, the convergence of the following integralhas to be studied: (cid:90) R · · · (cid:90) R e − n − δ π J ( δ , · · · , δ k ) dδ · · · dδ k . In this case, however, the integral with respect to δ may converge, neverthelessthe integrals with respect to δ j with j (cid:54) = 2 will diverge, since the prior tends tobe constant for each δ j as | δ j | → ∞ .0
In the case of k (cid:54) = 2 components, the Jeffreys prior for the locationparameters is still constant with respect to a reference mean (for example, µ ).Therefore it depends on the difference parameters ( δ = µ − µ , δ = µ − µ , · · · , δ k = µ k − µ ).The Jeffreys prior depends on the product on the diagonal (cid:90) ∞−∞ [ t N (0 , σ )] p N (0 , σ ) + · · · + p k N ( δ k , σ k ) dt · · · (cid:90) ∞−∞ [ u N (0 , σ k )] p N ( − δ k , σ ) + · · · + p k N (0 , σ k ) du. If we consider the case as in Lemma 3.3, where only the part of the likelihooddepending on e.g. µ may be considered, the convergence of the following integralhas to be studied: (cid:90) R · · · (cid:90) R e − n − δ π J ( δ , · · · , δ k ) dδ · · · dδ k . In this case, however, the integral with respect to δ may converge, neverthelessthe integrals with respect to δ j with j (cid:54) = 2 will diverge, since the prior tends tobe constant for each δ j as | δ j | → ∞ .0 Appendix F: Proof of Theorem 3.1
The posterior distribution of the parameters of a mixture model with location-scale components derived from the Jeffreys prior when all parameters of model (14) are unknown is improper.
Proof.
Consider a mixture model with components coming from the location-scale family. The proof will consider Gaussian components, however it may begeneralized to any location-scale distribution.Consider the elements on the diagonal of the Fisher information matrix; again,since the Fisher information matrix is positive definite, the determinant is boundedby the product of the terms in the diagonal.Consider a reparametrization into τ = σ and τ σ = σ . Then it is straightfor-ward to see that the integral of this part of the prior distribution will depend ona term ( τ ) − ( d +1) ( σ ) − d , as seen in the proof of Lemma 3.2. The likelihood, on theother hand, is given by L ( θ ) = n (cid:89) i =1 (cid:2) p N ( µ, τ ) + (1 − p ) N ( µ + τ δ, τ σ ) (cid:3) = 1(2 π ) n τ n p n e − n (cid:80) i =1( xi − µ )22 τ ++ 1 τ n σ n (cid:88) i =1 p n − (1 − p ) e − (cid:80) j (cid:54) = i ( xj − µ τ − ( xi − µ τ σ + 1 τ n σ n (cid:88) i =1 (cid:88) k (cid:54) = i p n − (1 − p ) e − (cid:80) j (cid:54) = i,k ( xj − µ )22 τ ·· e − [ ( xi − ( µ + τδ ))2+( xk − ( µ + τδ ))2 ] τ σ + · · · + (1 − p ) n τ σ e − n (cid:80) i =1 ( x i − ( µ + τδ )) (cid:35) . (17)When composing the prior with the part of the likelihood which only dependson the first component, this part does not provide information about the param-eters σ and the integral will diverge.In particular, the integral of the first part of the posterior distribution relativeto the part of the likelihood dependent on the first component only and on theproduct of the diagonal terms of the Fisher information matrix for the prior whenconsidering a two-component mixture model is (cid:90) (cid:90) + ∞−∞ (cid:90) + ∞−∞ (cid:90) ∞ (cid:90) ∞ c p n τ n p p τ σ exp (cid:40) − τ n (cid:88) i =1 ( x i − µ ) (cid:41) × (cid:90) ∞−∞ (cid:104) σ exp (cid:110) − ( τσy + δ ) τ (cid:111) − exp (cid:110) − y (cid:111)(cid:105) p σ exp (cid:110) − ( τσy + δ ) τ (cid:111) + p exp (cid:110) − y (cid:111) dy × (cid:90) ∞−∞ z exp( − z ) p exp (cid:110) − z (cid:111) + p σ exp (cid:110) − ( zτ − δ ) τ σ (cid:111) dz × (cid:90) ∞−∞ w exp (cid:8) − w (cid:9) p σ exp (cid:110) − ( τσw + δ ) τ σ (cid:111) + p exp (cid:110) − w (cid:111) dw × (cid:90) ∞−∞ (cid:0) z − (cid:1) exp (cid:8) − z (cid:9) p exp (cid:110) − z (cid:111) + p σ exp (cid:110) − ( zτ + µ − µ ) τ σ (cid:111) dz × (cid:90) ∞−∞ (cid:0) u − (cid:1) exp (cid:8) − u (cid:9) p σ exp (cid:110) − ( uτσ + µ − µ ) τ (cid:111) + p exp (cid:110) − u (cid:111) du dτ dσdµ dµ dp . When considering the integrals relative to the Jeffreys prior, they do not rep-resent an issue for convergence with respect to the scale parameters, becauseexponential terms going to 0 as the scale parameters tend to 0 are present. How-ever, when considering the part out of the previous integrals, a factor σ − whichcauses divergence is present. Then this particular part of the posterior distribu-tion does not integrate.When considering the case of k components, the integral inversily depends on σ , σ , · · · , σ k − which implies the posterior always is improper. Appendix G: Improperness of the posterior distribution deriving from themultivariate Jeffreys prior
Since the posterior distribution which follows from the use of the multivariateJeffreys prior on the complete set of parameters is improper, we expect to see non-convergent behaviors in the MCMC simulations, in particular for small samplesizes. For small sample sizes, the chains tend to get stuck when very small valuesof standard deviations are accepted. Figure 14 shows the results for differentsample sizes and different scenarios (in particular, the situations when the meansare close or well separated from one another are considered) for a mixture modelwith two and three Gaussian components: sometimes the chains do not convergeand tend towards very extreme values of means, sometimes the chains get stuckto very small values of standard deviations.The improperness of the posterior distribution is not only due to the scaleparameters: we may use a reparametrization of the problem as in Equation (8)and use a proper prior on the parameter σ , for example, by following Robert andMengersen (1999) p ( σ ) = 12 U [0 , ( σ ) + 12 1 U [0 , ( σ ) . and the Jeffreys prior for all the other parameters ( p, µ, δ, τ ) conditionally on σ ,and still face the same issue. Actually, using a proper prior on σ does not avoidconvergence trouble, as demonstrated by Figure 15, which shows that, even if thechains with respect to the standard deviations are not stuck around 0 when usinga proper prior for σ in the reparametrization proposed by Robert and Mengersen(1999), the chains with respect to the locations parameters demonstrate a diver-gent behavior. . . . . . . k=2 sample size P r opo r t i on . . . . . . k=3 sample size P r opo r t i on Fig 14 . All parameters unknown: results from 50 replications of the experiment with close means(solid lines) and well-separated means (dashed lines) based on simulations and a burn-inof simulations. The graph shows the proportion of Monte Carlo chains stuck at values ofstandard deviations close to zero (blue lines) and the proportion of chains diverging towards highvalues of means. The case of a two-component GMM is on the left, the case of a three-componentGMM is on the right. . . . . . . k=2 sample size P r opo r t i on Fig 15 . All parameters unknown, proper prior on σ , two-component GMM: results from 50replications of the experiment for both close means (solid lines) and far means (dashed lines)based on simulations and a burn-in of simulations. The graph shows the proportionof Monte Carlo chains stuck at values of standard deviations close to 0 (blue lines) and theproportion of chains diverging towards high values of means (red lines). These problems are overcome by the hierarchical prior proposed in Section 4:a simulation study (not shown) along the lines of the one just presented for theposterior distribution deriving from the multivariate Jeffreys prior confirms thatthe chains obtained via MCMC for 50 replications of the experiments always havea convergent behavior despite the posterior being improper.
Appendix H: The properness of the hierarchical representation of Theorem4.1
The posterior distribution derived from the hierarchical representation of theGaussian mixture model associated with (9), (10) and (11) is proper.
Proof.
Consider the composition of the three levels of the hierarchical modeldescribed in equations (9), (10) and (11): π ( µ , σ ,µ , ζ ; x ) ∝ L ( µ , µ , σ , σ ; x ) p − / (1 − p ) − / × ζ πζ exp (cid:26) − ( µ − µ ) ( µ − µ ) ζ (cid:27) × (cid:20)
12 1 ζ I [ σ ∈ (0 ,ζ )] ( σ ) + 12 ζ σ I [ σ ∈ ( ζ , + ∞ )] ( σ ) (cid:21) × (cid:20)
12 1 ζ I [ σ ∈ (0 ,ζ )] ( σ ) + 12 ζ σ I [ σ ∈ ( ζ , + ∞ )] ( σ ) (cid:21) (18) where L ( · ; x ) is given by Equation (17).Once again, we can initialize the proof by considering only the first term in thesum composing the likelihood function for the mixture model. Then the productin (18) may be split into four terms corresponding to the different terms in thescale parameters’ prior. For instance, the first term is (cid:90) ∞ (cid:90) ∞−∞ (cid:90) R (cid:90) R (cid:90) R + (cid:90) R + (cid:90) σ n p n exp − n (cid:80) i =1 ( x i − µ ) σ × ζ exp (cid:26) − ( µ − µ ) ( µ − µ ) ζ (cid:27) ×
14 1 ζ ζ I [ σ ∈ (0 ,ζ )] ( σ ) I [ σ ∈ (0 ,ζ )] ( σ ) dpdσ dσ dµ dµ dµ dζ and the second one (cid:90) ∞ (cid:90) ∞−∞ (cid:90) R (cid:90) R (cid:90) R + (cid:90) R + (cid:90) σ n p n exp − n (cid:80) i =1 ( x i − µ ) σ × ζ exp (cid:26) − ( µ − µ ) ( µ − µ ) ζ (cid:27) ×
14 1 ζ ζ σ I [ σ ∈ (0 ,ζ )] ( σ ) I [ σ ∈ ( ζ , ∞ )] ( σ ) dpdσ dσ dµ dµ dµ dζ . The integrals with respect to µ , µ and µ converge, since the data are carryinginformation about µ through µ . The integral with respect to σ converges aswell, because, as σ →
0, the exponential function goes to 0 faster than σ n goesto ∞ (integrals where σ > ζ are not considered here because this reasoning mayeasily extend to those cases). The integrals with respect to σ converge, becausethey provide a factor proportional to ζ and 1 /ζ respectively which simplifieswith the normalizing constant of the reference distribution (the uniform in thefirst case and the Pareto in second one). Finally, the term 1 /ζ resulting from theprevious operations has its counterpart in the integrals relative to the locationpriors. Therefore, the integral with respect to ζ converges.The part of the posterior distribution relative to the weights is not an issue,since the weights belong to the corresponding simplex. Appendix I: Effect of the sample size in the conservativeness of the Jeffreysprior
This Appendix shows the estimation of the density (19) when a higher numberof components is assumed, together with a Jeffreys prior for the weigths of themixture for sample sizes 50 , , , , -10 -5 . . . . k=2 x D en s i t y -10 -5 . . . . k=3 x D en s i t y -10 -5 . . . . k=4 x D en s i t y -10 -5 . . . . k=5 x D en s i t y n=50 Fig 16 . Estimated densities in replications of the experiment (in grey) against the true model(in red) for n = 50 . (19) 0 . N ( − ,
1) + 0 . N (3 , . Figures 16-19 show the M = 20 resulting estimated densities against (19); asthe number of components increases, the estimated densities are less and lesssmooth, nevertheless this feature is mitigated as the sample size increases. Appendix J: Implementation Features
The computing expense due to derive the Jeffreys prior for a set of parametervalues is in O( d ) if d is the total number of (independent) parameters.Each element of the Fisher information matrix is an integral of the form − (cid:90) X ∂ log (cid:104)(cid:80) kh =1 p h f ( x | θ h ) (cid:105) ∂θ i ∂θ j (cid:34) k (cid:88) h =1 p h f ( x | θ h ) (cid:35) − dx which has to be approximated. We have applied both numerical integration andMonte Carlo integration and simulations show that, in general, numerical inte-gration obtained via Gauss-Kronrod quadrature, produces more stable results.Nevertheless, when the values of one or more standard deviations or weights aretoo small, either the approximations tend to be very dependent on the boundsused for numerical integration (usually chosen to omit a negligible part of thedensity) or the numerical approximation may not be even applicable. In thiscase, Monte Carlo integration seems to be more stable, where the stability of theresults depends on the Monte Carlo sample size.Figure 20 shows the value of the Jeffreys prior obtained via Monte Carlo inte-gration of the elements of the Fisher information matrix for an increasing number -10 -5 . . . . k=2 x D en s i t y -10 -5 . . . . k=3 x D en s i t y -10 -5 . . . . k=4 x D en s i t y -10 -5 . . . . k=5 x D en s i t y n=100 Fig 17 . As in Figure 16, for n = 100 . -10 -5 . . . . k=2 x D en s i t y -10 -5 . . . . k=3 x D en s i t y -10 -5 . . . . k=4 x D en s i t y -10 -5 . . . . k=5 x D en s i t y n=500 Fig 18 . As in Figure 16, for n = 500 . -10 -5 . . . . k=2 x D en s i t y -10 -5 . . . . k=3 x D en s i t y -10 -5 . . . . k=4 x D en s i t y -10 -5 . . . . k=5 x D en s i t y n=1,000 Fig 19 . As in Figure 16, for n = 1000 . of Monte Carlo simulations both in the case where the Jeffreys prior diverges(where the standard deviations are small) and where it assumes low values. Thevalue obtained via Monte Carlo integration is then compared with the value ob-tained via numerical integration. The sample size relative to the point where thegraph stabilizes may be chosen to perform the approximation. The number ofMonte Carlo simulations needed to reach a fixed amount of variability may bechosen.Since the approximation problem is one-dimensional, another numerical so-lution could be based on the Riemann sums; Figure 21 shows the comparisonbetween the approximation to the Jeffreys prior obtained via Monte Carlo inte-gration and via the sums of Riemann: it is clear that the Riemann sums lead tomore stable results in comparison with Monte Carlo integration. On the otherhand, they can be applied in more situations than the Gauss-Kromrod quadra-ture, in particular, in cases where the standard deviations are very small (of order10 − ). Nevertheless, when the standard deviations are smaller than this, one hasto pay attention on the features of the function to integrate. In fact, the mixturedensity tends to concentrate around the modes, with regions of density close to0 between them. The elements of the Fisher informtation matrix are, in general,ratios between the components’ densities and the mixture density, then in thoseregions an indeterminate form of type is obtained; Figure 22 represents thebehavior of one of these elements when σ i → i = 1 , · · · , k .Thus, we have decided to use the Riemann sums (with a number of pointsequal to 550) to approximate the Jeffreys prior when the standard deviations aresufficiently large and Monte Carlo integration (with sample sizes of 1500) whenthey are too small. In this case, the variability of the results seems to decrease as σ i approaches 0, as shown in Figure 23. Fig 20 . Jeffreys prior obtained via Monte Carlo integration (and numerical integration, inred) for the model . N ( − ,
1) + 0 . N (0 ,
5) + 0 . N (15 , (above) and for the model N ( − , .
2) + N (0 , .
2) + N (1 , . (below). On the x -axis there is the number of MonteCarlo simulations. Fig 21 . Boxplots of 100 replications of the procedure based on Monte Carlo integration (above)and Riemann sums (below) which approximates the Fisher information matrix of the model . N ( − , . N (0 , . N (15 , for sample sizes from to . The value obtainedvia numerical integration is represented by the red line (in the graph below, all the approximationsobtained with more than knots give the same result, exactly equal to the one obtained viaRiemann sums). Fig 22 . The first element on the diagonal of the Fisher information matrix relative to the firstweight of the two-component Gaussian mixture model . N ( − , .
01) + 0 . N (2 , . . Fig 23 . Approximation of the Jeffreys prior (in log-scale) for the two-component Gaussian mix-ture model . N ( − , σ )+0 . N (2 , σ ) , where σ is taken equal for both components and increasing.0