Testing hypotheses via a mixture estimation model
Kaniav Kamary, Kerrie Mengersen, Christian P. Robert, Judith Rousseau
BBayesian hypothesis testing as amixture estimation model ∗ Kaniav Kamary
Universit´e Paris-Dauphine, CEREMADE
Kerrie Mengersen
Queensland University of Technology, Brisbane
Christian P. Robert
Universit´e Paris-Dauphine, CEREMADE, Dept. of Statistics, University of Warwick, and CREST, Saclay
Judith Rousseau
University of Oxford, Dept. of Statistics and Universit´e Paris-Dauphine, CEREMADE
Abstract.
We consider a novel paradigm for Bayesian testing of hypothe-ses and Bayesian model comparison. Instead of the traditional compar-ison of posterior probabilities of the competing hypotheses, given thedata, we consider the hypotheses as components of a mixture model.We therefore replace the original testing problem with an estimationone that focus on the probability or weight of a given hypothesis withina mixture model as the parameter of interest and the posterior distribu-tion of this weight as the outcome of the test. A major differentiatingfeature of this approach is that that generic improper priors are ac-ceptable. For example, a reference Beta B ( a , a ) prior on the mixtureweight parameter can be used for the common problem of testing twocontrasting hypotheses. In this case, the sensitivity of the posterior es-timates of the weights to the choice of a vanishes as the sample sizeincreases, leading to a consistent procedure and a suggested defaultchoice of a = 0 .
5. Another feature of this easily implemented alterna-tive to the classical Bayesian solution is that the speeds of convergenceof the posterior mean of the weight and of the corresponding posteriorprobability are quite similar.
Key words and phrases:
Noninformative prior, Mixture of distributions,Bayesian analysis, testing statistical hypotheses, Dirichlet prior, Poste-rior probability. ∗ Kaniav Kamary, CEREMADE, Universit´e Paris-Dauphine, 75775 Paris cedex 16, France, [email protected] , Kerrie Mengersen, QUT, Brisbane, QLD, Australia, [email protected] , Chris-tian P. Robert and Judith Rousseau, CEREMADE, Universit´e Paris-Dauphine, 75775 Paris cedex 16, France xian,[email protected] . Research partly supported by the Agence Nationale de la Recherche (ANR,212, rue de Bercy 75012 Paris) through the 2012–2015 grant ANR-11-BS01-0010 “Calibration” and by two 2010–2015 and 2016–2021 senior chair grants of Institut Universitaire de France. Thanks to Phil O’Neall and TheoKypriaos from the University of Nottingham for their hospitality and a fruitful discussion that drove this researchtowards new pathways. We are also grateful to the participants of BAYSM’14, ISBA 16, several seminars, andto the members of the Bristol Bayesian Cake Club for their comments. a r X i v : . [ s t a t . M E ] D ec
1. INTRODUCTION1.1 A open problem
Statistical testing of hypotheses and the related model choice problem are central issues forstatistical inference. Perspectives and methods relating to these issues have been developed overthe past two centuries, but it remains an object of study and debate, in particular because themost standard approach based on p -values is open to misue and abuse, as highlighted by arecent ASA warning (Wasserstein and Lazar, 2016), and also because the classical (frequentist)and Bayesian paradigms (Neyman and Pearson, 1933; Jeffreys, 1939; Berger and Sellke, 1987;Casella and Berger, 1987; Gigerenzer, 1991; Berger, 2003; Mayo and Cox, 2006; Gelman, 2008)are at odds, both conceptually and practically.From the perspectives of Neyman-Pearson and Fisher, tests are constructed as a competitionbeween so-called null and alternative hypotheses, and typically evaluated with respect to theirability to control the type I error, i.e., the probability of falsely rejecting the null hypothesis infavour of the alternative. These procedures therefore handle the two hypotheses differently, withan imbalance that makes the subsequent action of accepting the null hypothesis problematic.The ASA statement (Wasserstein and Lazar, 2016) recommends against basing decisions solelyon the p -value and advocates the use of supplementary indicators such as those obtained byBayesian methods.However, from a Bayesian perspective, the handling of hypothesis testing is also problematic,for different reasons (Jeffreys, 1939; Bernardo, 1980; Berger, 1985; Aitkin, 1991; Berger andJefferys, 1992; De Santis and Spezzaferri, 1997; Bayarri and Garcia-Donato, 2007; Christensenet al., 2011; Johnson and Rossell, 2010; Gelman et al., 2013a; Robert, 2014). In particular, weconsider that the issue of non-informative Bayesian hypothesis testing is still mostly unresolvedboth theoretically and in practice, despite having produced much debate and proposals; wit-ness the specific case of the Lindley or Jeffreys–Lindley paradox (Lindley, 1957; Shafer, 1982;DeGroot, 1982; Robert, 1993; Lad, 2003; Spanos, 2013; Sprenger, 2013; Robert, 2014) and dis-cussions about pseudo-Bayes factors (Aitkin, 1991; Berger and Pericchi, 1996; O’Hagan, 1995,1997; Aitkin, 2010; Gelman et al., 2013b).There are similar difficulties with Bayesian model selection. Several perspectives can be de-fended even for the canonical problem of comparing models with the aim of choosing one ofthem. For instance, Hoeting et al. (1999) propose a model averaging approach; Christensen et al.(2011) argue that this is a decision issue that pertains to testing; Robert (2001) express thisas a model index estimation setting; Gelman et al. (2013a) prefer to rely on more exploratorypredictive tools, and (Park and Casella, 2008; Rockova and George, 2016) restrict the inferenceto a maximisation problem in a sparse model or variable selection context. Under all the frameworks described above, a generally accepted perspective is that hypothesistesting and model selection do not primarily seek to identify which alternative or model is “true”(if any). From a Bayesian perspective, hypotheses can be formulated as models and hypothesistesting can therefore be viewed as a form of model selection, in which the aim to compare severalpotential statistical models in order to identify the model that is most strongly supported bythe data (see, e.g. Berger and Jefferys, 1992; Madigan and Raftery, 1994; Balasubramanian,1997; MacKay, 2002; Consonni et al., 2013). For this reason, we will use the terms hypothesesand models interchangeably in the context of hypothesis testing and model choice.The most common approaches to Bayesian hypothesis testing in practice are posterior prob-abilities of the model given the data (see, e.g., Robert, 2001), the Bayes factor (Jeffreys, 1939)and its approximations such as the Bayesian information criterion (BIC) and the Deviance in-
AYESIAN HYPOTHESIS TESTING VIA MIXTURE MODELS formation criterion (DIC) (Schwarz, 1978; Csisz´ar and Shields, 2000; Spiegelhalter et al., 2002;Forbes et al., 2006; Plummer, 2008) and posterior predictive tools and their variants (Gelmanet al., 2013a; Vehtari and Ojanen, 2012). For example, consider two families of models, one foreach of the hypotheses under comparison, M : x ∼ f ( x | θ ) , θ ∈ Θ and M : x ∼ f ( x | θ ) , θ ∈ Θ . Following Berger (1985) and Robert (2001), a standard Bayesian approach is to associate witheach of those models a prior distribution, θ ∼ π ( θ ) and θ ∼ π ( θ ) , and to compute the marginal or integrated likelihoods m ( x ) = (cid:90) Θ f ( x | θ ) π ( θ ) d θ and m ( x ) = (cid:90) Θ f ( x | θ ) π ( θ ) d θ either through the Bayes factor or through the posterior probability, respectively: B = m ( x ) m ( x ) , P ( M | x ) = ω m ( x ) ω m ( x ) + ω m ( x ) , ω + ω = 1 , ω i ≥ ω i of both models. The Bayesiandecision step proceeds by comparing the Bayes factor B to the threshold value of one orcomparing the posterior probability P ( M | x ) to a bound derived from a 0–1 loss function ora “golden” bound like α = 0 .
05 inspired from frequentist practice (Berger and Sellke, 1987;Berger et al., 1997, 1999; Berger, 2003; Ziliak and McCloskey, 2008). As a general rule, whencomparing more than two models, the model with the largest posterior probability is selected,but this rule is highly dependent on the prior modelling, even with large datasets, which makesit hard to promote as the default solution in practical studies.
Given that the difficulties associated with the traditional handling of posterior probabilitiesfor Bayesian testing and model selection are well documented and comprehensively reviewed(Vehtari and Lampinen, 2002) and Vehtari and Ojanen (2012), we do not replicate or dwell fur-ther on this discussion, nor do we attempt to resolve the attendant problems. Instead, we proposea different approach, which we argue provides a convergent and naturally interpretable solution,a measure of uncertainty on the outcome, a wider range of prior modelling, and straightforwardcalibration tools.The proposed approach is described here in the context of a hypothesis test or model selectionproblem with two alternatives. We represent the distribution of each individual observation asa two-component mixture between both models M and M . The resulting mixture model isthus an encompassing model , as it contains both models under comparison as special cases.While those special cases are extreme cases which weights are located at the boundaries of theinterval (0 , While mixtures are natural tools for classification and clustering, which can separate a sam-ple into observations associated with each model, we argue that the posterior distribution onthe weights provides a relevant indicator of the strength of support for each model given thedata. Given a sufficiently large sample from model M , say, this distribution will almost surelyconcentrate near 0. Starting from this theoretical garantee, we can calibrate the degree of con-centration near 0 versus 1 for the current sample size by comparing the posterior distributionto posteriors associated with simulated samples from models M and M , respectively. Eventhough we fundamentally object to returning a decision about “the” model corresponding to thedata (McShane et al., 2018), and thus would like to halt the statistician’s input at returning theabove posterior, it is furthermore straightforward to produce a posterior median estimate of thecomponent weight that can be compared with realisations from each model. Quite obviously,the mixture posterior produced by a standard Bayesian analysis (Fr¨uhwirth-Schnatter, 2006)also provides information about the model parameters and the presence of potential outliers inthe data.With regard to the classical approach to Bayesian hypothesis testing, this mixture represen-tation is not equivalent to the use of a posterior probability. In fact, a posterior estimate of themixture weight cannot be viewed as a proxy to the numerical value of this posterior probability,which we do not see as a worthwhile tool for testing for reasons given below. As mentioned inthe previous paragraph, this new tool can be calibrated in its own right, while allowing for adegree of uncertainty in the hypothesis evaluation, which is not the case for the Bayes factor. Inparticular, while posterior probabilities are scaled against the (0 ,
1) interval, it can be argued(Fraser, 2011; Fraser et al., 2009, 2016) that they cannot be taken at face value because of theirlack of frequentist coverage and hence need to be calibrated as well. Furthermore, the mixtureapproach offers the valuable feature of limiting the number of parameters in the model andhence is in keeping with Occam’s razor, see, e.g., Jefferys and Berger (1992); Rasmussen andGhahramani (2001).The plan of the paper is as follows. Section 2 provides a description of the mixture modelspecifically created for this setting and presents a simple example of its implementation. Section3 expands Rousseau and Mengersen (2011) to provide conditions on the hyperameters of themixture model that are sufficient to achieve convergence. The performance of the mixtureapproach is then illustrated through three further examples in Section 4 and concluding remarksabout the general applicability of the method are made in Section 5.
2. TESTING PROBLEMS AS ESTIMATING MIXTURE MODELS2.1 A new paradigm for testing
Following from the above, given two classes of statistical models, M and M , which maycorrespond to a hypothesis to be tested and its alternative, respectively, it is always possible toembed both models within an encompassing mixture model(1) M α : x ∼ αf ( x | θ ) + (1 − α ) f ( x | θ ) , ≤ α ≤ . Indeed, both models correspond to very special cases of the mixture model, one for α = 1 andthe other for α = 0 (with a slight notational inconsistency in the indices). The choice of possible encompassing models is obviously unlimited: for instance, a Geometric mixture (Meng,2016, personnal communication) x ∼ f α ( x ) ∝ f ( x | θ ) α f ( x | θ ) − α is a conceivable alternative. However, such alternatives are less practical to manage, starting with the issue of theintractable normalizing constant. Note also that when f and f are Gaussian densities, the Geometric mixtureremains Gaussian for all values of α . Similar drawbacks can be found with harmonic mixtures.AYESIAN HYPOTHESIS TESTING VIA MIXTURE MODELS When considering a sample ( x , . . . , x n ) from one of the two models, the mixture represen-tation still holds at the likelihood level, namely the likelihood for each model is a special caseof the weighted sum of both likelihoods. However, this is not directly appealing for estimationpurposes since it corresponds to a mixture with a single observation . See however O’Neill andKypraios (2014) for a computational solution based upon this representation.What we propose in this paper is to draw inference on the individual mixture representation(1), acting as if each observation was individually and independently produced by the mixturemodel. Hence α represents the probability that a new observations is sampled from f belongingto model M . The approach proposed here therefore aims at answering the question: what isthe proportion of the data which support one model, which has a definite predictive flavour tothe testing problem.Here are five advantages we see about this approach.First, if the data were indeed generated from model M , then the Bayesian estimate of theweight α and the posterior probability of model M produce equally convergent indicators ofpreference for this model (see Section 3). Moreover, the posterior distribution of α evaluatesmore thoroughly the strength of the support for a given model than the single figure outcomeof a Bayes factor or of a posterior probability, while the variability of the posterior distributionon α allows for a more thorough assessment of the strength of the support of one model againstthe other. Indeed, the approach allows for the possibility that, for a finite dataset, one model,both models or neither model could be acceptable, as illustrated in Section 4.Second, the mixture approach also removes the need for artificial prior probabilities on modelindices, ω and ω . These priors are rarely discussed in a classical Bayesian approach, eventhough they linearly impact on the posterior probabilities. Under the new approach, prior mod-elling only involves selecting an operational prior on α , for intance a Beta B ( a , a ) distribution,with a wide range of acceptable values for the hyperparameter a , as demonstrated in Section3. While the value of a impacts the posterior distribution of α , it can be argued that (a) itnonetheless leads to an accumulation of the mass near 1 or 0; (b) a sensitivity analysis on theimpact of a is straightforward to carry out; and (c) in most settings the approach can be easilycalibrated by a parametric bootstrap experiment, so the prior predictive error can be directlyestimated and can drive the choice of a if need be.Third, the problematic computation (Chen et al., 2000; Marin and Robert, 2011) of themarginal likelihoods is bypassed, since standard algorithms are available for Bayesian mixtureestimation (Richardson and Green, 1997; Berkhof et al., 2003; Fr¨uhwirth-Schnatter, 2006; Leeet al., 2009). Moroever, the (simultaneously conceptual and computational) difficulty of “labelswitching” (Celeux et al., 2000; Stephens, 2000; Jasra et al., 2005) that plagues both Bayesianestimation and Bayesian computation for most mixture models completely vanishes in thisparticular context, since components are no longer exchangeable in the current framework. Inparticular, we compute neither a Bayes factor nor a posterior probability related with thesubstitute mixture model and we hence avoid the difficulty of recovering the modes of theposterior distribution (Berkhof et al., 2003; Lee et al., 2009; Rodriguez and Walker, 2014). Ourperspective is completely centred on estimating the parameters of a mixture model where bothcomponents are always identifiable.Fourth, the extension to a finite collection of models to be compared is straightforward, as thissimply involves a larger number of components. The mixture approach allows consideration of Dependent observations like Markov chains can be modeled by a straightforward extension of (1) where bothterms in the mixture are conditional on the relevant past observations. Using a Bayes factor to test for the number of components in the mixture (1) as in Richardson and Green(1997) would be possible. However, the outcome would fail to answer the original question of selecting betweenboth (or more) models. all these models at once rather than engaging in costly pairwise comparisons. It is thus possibleto eliminate the least likely models from simulations, since they will not be explored by thecorresponding computational algorithm (Carlin and Chib, 1995; Richardson and Green, 1997).Finally, while standard (proper and informative) prior modeling can be painlessly reproducedin this novel setting, non-informative (improper) priors are also permitted, provided both mod-els under comparison are first reparameterised so that they share parameters with commonmeaning. For instance, in the special case when all parameters make sense in both models, themixture model (1) can read as M α : x ∼ αf ( x | θ ) + (1 − α ) f ( x | θ ) , ≤ α ≤ . For instance, if θ is a location parameter, a flat prior π ( θ ) ∝ In order to illustrate the proposed approach, consider a hypothesis test between a Normal N ( θ ,
1) and a Normal N ( θ ,
2) distribution. We construct the mixture so that the same locationparameter θ is used in both the Normal N ( θ,
1) and the Normal N ( θ,
2) distribution. This allowsthe use of Jeffreys’ (1939) noninformative prior π ( θ ) = 1, in contrast with the correspondingBayes factor. We thus embed the test in the mixture of Normal models, α N ( θ, − α ) N ( θ, B ( a , a ) prior on α . In this case, considering the posterior distribution on( α, θ ), conditional on the allocation vector ζ , leads to conditional independence between θ and α : θ | x , ζ ∼ N (cid:18) n ¯ x + . n ¯ x n + . n , n + . n (cid:19) , α | ζ ∼ B e ( a + n , a + n ) , where n i and ¯ x i denote the number of observations and the empirical mean of the observationsallocated to component i , respectively (with the convention that n i ¯ x i = 0 when n i = 0. Sincethis conditional posterior distribution is well-defined for every possible value of ζ and since thedistribution ζ has a finite support, π ( θ | x ) is proper.Note that for this example, the conditional evidence π ( x | ζ ) can easily be derived in closedform, which means that a random walk on the allocation space { , } n could be implemented.We did not follow that direction, as it seemed unlikely such a random walk would have beenmore efficient than a Metropolis–Hastings algorithm on the parameter space only.In order to evaluate the convergence of the estimates of the mixture weights, we simulated100 N (0 ,
1) datasets. Figure 1 displays the range of the posterior means and medians of α wheneither a or n varies, showing the concentration effect (with a lingering impact of a o ) when n increases. We also included the posterior probability of M in the comparison, derived from theBayes factor B = 2 n − / (cid:14) exp / n (cid:88) i =1 ( x i − ¯ x ) , with equal prior weights, even though it is not formally well defined since it is based on animproper prior. The shrinkage of the posterior expectations towards 0 . α . The same concen-tration phenomenon occurs for the N (0 ,
2) case, as illustrated in Figure 2 for a single N (0 , While this may sound like an extremely restrictive requirement in a traditional mixture model, we stress herethat the presence of common meaning parameters becomes quite natural within a testing setting. To wit, whencomparing two different models for the same data, moments like E [ X γ ] are defined in terms of the observed dataand hence should be the same for both models. Reparametrising the models in terms of those common meaningmoments does lead to a mixture model with some and maybe all common parameters.AYESIAN HYPOTHESIS TESTING VIA MIXTURE MODELS p(M1|x) . . . . . . n=15 . . . . . . p(M1|x) . . . . . . . . n=50 . . . . . . . . p(M1|x) . . . . . . n=100 . . . . . . p(M1|x) . . . . . . . . n=500 . . . . . . . . Fig 1:
Normal Example:
Boxplots of the posterior means (wheat) and medians of α (dark wheat) ,compared with a boxplot of the exact posterior probabilities of M (gray) for a N (0 ,
1) sample, derivedfrom 100 datasets for sample sizes equal to 15 , , , MCMC iterations.
3. ASYMPTOTIC CONSISTENCY
In this Section we study the asymptotic properties of our mixture testing procedure. Moreprecisely we study the asymptotic behaviour of the posterior distribution of α and we prove thatthe posterior on α concentrates on the true value of α in the sense that if model M is correctthe posterior distribution concentrates on α = 1, if model M is correct then the posteriordistribution concentrates on α = 0 and if neither are correct then the posterior concentrate onthe value of α which minimizes the Kullback-Leibler divergence. This shows that the posterior on α leads to a consistent testing procedure. Moreover we also study the separation rate associatedto such procedure when the models are embedded and we show that the approach leads tooptimal separation rate, contrarywise to the Bayes factor which has an extra √ log n factor.To do so we consider two different cases. In the first case, the two models, M and M , arewell separated while, in the second case, model M is a submodel of M . We denote by π theprior distribution on ( α, θ ) with θ = ( θ , θ ) and assume that θ j ∈ Θ j ⊂ R d j . We first provethat, under weak regularity conditions on each model, we can obtain posterior concentrationrates for the marginal density f θ,α ( · ) = αf ,θ ( · ) + (1 − α ) f ,θ ( · ). Let x n = ( x , · · · , x n ) be a n sample with true density f ∗ . Proposition 1
Assume that, for all C > , there exist Θ n a subset of Θ × Θ and B , B ≥ such that (2) π [Θ cn ] ≤ n − C , Θ n ⊂ {(cid:107) θ (cid:107) + (cid:107) θ (cid:107) ≤ B n B } llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll .1 .2 .3 .4 .5 1 . . . . . . . . lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll .1 .2 .3 .4 .5 1 − − − − − Fig 2:
Normal Example: (left)
Posterior distributions of the mixture weight α and (right) of theirlogarithmic transform log { α } under a Beta B ( a , a ) prior when a = . , . , . , . , . , N (0 ,
2) sample of 10 observations. The MCMC outcome is based on 10 iterations. and that there exist H ≥ and L, δ > such that, for j = 1 , , sup θ,θ (cid:48) ∈ Θ n (cid:107) f j,θ j − f j,θ (cid:48) j (cid:107) ≤ Ln H (cid:107) θ j − θ (cid:48) j (cid:107) , θ = ( θ , θ ) , θ (cid:48) = ( θ (cid:48) , θ (cid:48) ) , ∀(cid:107) θ j − θ ∗ j (cid:107) ≤ δ ; KL ( f j,θ j , f j,θ ∗ j ) (cid:46) (cid:107) θ j − θ ∗ j (cid:107) . (3) We then have that, when f ∗ = f θ ∗ ,α ∗ , with α ∗ ∈ [0 , , there exists M > such that π (cid:104) ( α, θ ); (cid:107) f θ,α − f ∗ (cid:107) > M (cid:112) log n/n | x n (cid:105) = o p (1) . The proof of Proposition 1 is a direct consequence of Theorem 2.1 of Ghosal et al. (2000) and isthus omitted here. Condition (3) is a weak regularity condition on each of the candidate models.Combined with condition (2) it allows consideration of noncompact parameter sets in the usualway; see, for instance, Ghosal et al. (2000). It is satisfied in all examples considered in this paper.We build on Proposition 1 to describe the asymptotic behaviour of the posterior distributionon the parameters. It is possible to sharpen the above posterior concentration rate into M n / √ n for any sequence M n going to infinity by controlling the local entropy and obtaining preciseupper bounds on neighbourhoods of f ∗ . This is not useful in the case of separated models butbecomes more important in the context of embedded models. Although we do not treat thishere, following Kleijn and van der Vaart (2006), if the true distribution f does not belongto the embedding model f θ,α , then the posterior will concentrate on f ∗ which minimizes theKullback-Leibler divergence between f and f θ,α , at a similar rate. Assume that both models are separated in the sense that there is identifiability:(4) ∀ α, α (cid:48) ∈ [0 , , ∀ θ j , θ (cid:48) j , j = 1 , P θ,α = P θ (cid:48) ,α (cid:48) ⇒ α = α (cid:48) , θ = θ (cid:48) , AYESIAN HYPOTHESIS TESTING VIA MIXTURE MODELS where P θ,α denotes the distribution associated with f θ,α . We assume that (4) also holds on theboundary of Θ × Θ . In other words, the followinginf θ ∈ Θ inf θ ∈ Θ (cid:107) f ,θ − f ,θ (cid:107) > θ ∗ j ∈ Θ j , j = 1 ,
2, if P θ j converges in the weak topology to P θ ∗ j , then θ j converges in the Euclidean topology to θ ∗ j . The following result then holds: Theorem 1
Assume that (4) is satisfied, together with (2) and (3) , then for all (cid:15) > π [ | α − α ∗ | > (cid:15) | x n ] = o p (1) . In addition, assume that the mapping θ j → f j,θ j is twice continuously differentiable in a neigh-bourhood of θ ∗ j , j = 1 , , and that f ,θ ∗ − f ,θ ∗ , ∇ f ,θ ∗ , ∇ f ,θ ∗ are linearly independent as functions of y and that there exists δ > such that ∇ f ,θ ∗ , ∇ f ,θ ∗ , sup | θ − θ ∗ | <δ | D f ,θ | , sup | θ − θ ∗ | <δ | D f ,θ | ∈ L . Then (5) π (cid:104) | α − α ∗ | > M (cid:112) log n/n (cid:12)(cid:12) x n (cid:105) = o p (1) . Theorem 1 allows for the interpretation of the quantity α under the posterior distribution.In particular, if the data x n are generated from model M (resp. M ), then the posteriordistribution on α concentrates around α = 1 (resp. around α = 0), which establishes theconsistency of our mixture approach.We now consider the embedded case. In this Section we assume that M is a submodel of M , in the sense that θ = ( θ , ψ ) with ψ ∈ S ⊂ R d ψ and that f ,θ ∈ M when θ = ( θ , ψ ) for some given value ψ , say ψ = 0.Condition (4) is no longer verified for all α ’s: we assume however that it is verified for all α, α ∗ ∈ [0 ,
1) and that θ ∗ = ( θ ∗ , ψ ∗ ) satisfies ψ ∗ (cid:54) = 0. In this case, under the same conditions asin Theorem 1, we immediately obtain the posterior concentration rate (cid:112) log n/n for estimating α when α ∗ ∈ [0 ,
1) and ψ ∗ (cid:54) = 0 and Theorem 1 implies that (5) holds, which in turns impliesthat if α ∗ = 1, i.e. if the distribution comes from model M , π (cid:104) α > M (cid:112) log n/n | x n (cid:105) = o p (1) . We now treat the case where ψ ∗ = 0; in other words, f ∗ is in model M .As in Rousseau and Mengersen (2011), we consider both possible paths to approximate f ∗ :either α goes to 1 or ψ goes to ψ = 0. In the first case, called path 1, ( α ∗ , θ ∗ ) = (1 , θ ∗ , θ ∗ , ψ )with ψ ∈ S ; in the second, called path 2, ( α ∗ , θ ∗ ) = ( α, θ ∗ , θ ∗ ,
0) with α ∈ [0 , P ∗ as the distribution and denote F ∗ g = (cid:82) f ∗ ( x ) g ( x ) dµ ( x ) for any integrable function g . For sparsity reasons, we consider the following structure for the prior on ( α, θ ): π ( α, θ ) = π α ( α ) π ( θ ) π ψ ( ψ ) , θ = ( θ , ψ ) . This means that the parameter θ is common to both models, i.e., that θ shares the parameter θ with f ,θ .Condition (4) is replaced by(6) P θ,α = P ∗ ⇒ α = 1 , θ = θ ∗ , θ = ( θ ∗ , ψ ) or α ≤ , θ = θ ∗ , θ = ( θ ∗ , ∗ be the above parameter set.As in the case of separated models, the posterior distribution concentrates on Θ ∗ . We nowdescribe more precisely the asymptotic behaviour of the posterior distribution, using Rousseauand Mengersen (2011). We cannot apply directly Theorem 1 of Rousseau and Mengersen (2011),hence the following result is an adaptation of it. We require the following assumptions with f ∗ = f ,θ ∗ . For the sake of simplicity, we assume that Θ and S are compact. Extension to noncompact sets can be handled similarly to Rousseau and Mengersen (2011).B1 Regularity : Assume that θ → f ,θ and θ → f ,θ are 3 times continuously differentiableand that F ∗ ¯ f ,θ ∗ f ,θ ∗ < + ∞ , ¯ f ,θ ∗ = sup | θ − θ ∗ | <δ f ,θ , f ,θ ∗ = inf | θ − θ ∗ | <δ f ,θ F ∗ sup | θ − θ ∗ | <δ |∇ f ,θ ∗ | f ,θ ∗ < + ∞ , F ∗ (cid:32) |∇ f ,θ ∗ | f ,θ ∗ (cid:33) < + ∞ ,F ∗ sup | θ − θ ∗ | <δ | D f ,θ ∗ | f ,θ ∗ < + ∞ , F ∗ (cid:32) sup | θ − θ ∗ | <δ | D f ,θ ∗ | f ,θ ∗ (cid:33) < + ∞ B2 Integrability : There exists S ⊂ S ∩ {| ψ | > δ } , for some positive δ and satisfyingLeb( S ) >
0, and such that for all ψ ∈ S , F ∗ (cid:32) sup | θ − θ ∗ | <δ f ,θ ,ψ f ,θ ∗ (cid:33) < + ∞ , F ∗ (cid:32) sup | θ − θ ∗ | <δ f ,θ ,ψ f ,θ ∗ (cid:33) < + ∞ , B3 Stronger identifiability : Set ∇ f ,θ ∗ ,ψ ∗ ( x ) = (cid:0) ∇ θ f ,θ ∗ ,ψ ∗ ( x ) T , ∇ ψ f ,θ ∗ ,ψ ∗ ( x ) T (cid:1) T . Then for all ψ ∈ S with ψ (cid:54) = 0, if η ∈ R , η ∈ R d (7) η ( f ,θ ∗ − f ,θ ∗ ,ψ ) + η T1 [ ∇ θ f ,θ ∗ − ∇ θ f ,θ ∗ ,ψ ( x )] = 0 ⇔ η = 0 , η = 0Assumptions B1-B3 are similar, but weaker, to Rousseau and Mengersen (2011)’s set ofconditions and in fact B3 is milder than the strong identifiability condition imposed in thatpaper. Hence these conditions are satisfied for a wide range of regular models.We can now state the main theorem: Theorem 2
Given the model f θ ,ψ,α = αf ,θ + (1 − α ) f ,θ ,ψ , assume that the data comprise the n sample x n = ( x , · · · , x n ) issued from f ,θ ∗ for some θ ∗ ∈ Θ , and that assumptions B − B are satisfied. Then for all sequence M n going toinfinity, (8) π (cid:2) ( α, θ ); (cid:107) f θ,α − f ∗ (cid:107) > M n / √ n | x n (cid:3) = o p (1) . AYESIAN HYPOTHESIS TESTING VIA MIXTURE MODELS If the prior π α on α is a Beta B ( a , a ) distribution, with a < d ψ , and if the prior π π ψ isabsolutely continuous with positive and continuous density at ( θ ∗ , , then for all M n going toinfinity, π (cid:2) | α − | > M n / √ n | x n (cid:3) = o p (1) . If a > d ψ , then for any e n = o (1) , π [ | α − | < e n | x n ] = o p (1) . Note that the phase transition on the behaviour of the posterior distribution is a < d ψ versus a > d ψ , which is not quite the same as in Rousseau and Mengersen (2011).Theorems 2 and 1 imply that testing decisions can be taken based on the posterior distributionof 1 − α when a < d ψ . Indeed, in this case if one considers a testing approach of the form: H is rejected if π (1 − α > M n / √ n | x n ) ≥ / M n large or increasing to infinity,then this testing procedure is consistent under both the null and the alternative.In contrast to the Bayes factor which converges to 0 under the alternative model M expo-nentially quickly, the convergence rate of α to α ∗ (cid:54) = 1 is of order 1 / √ n . However this does notmean that the separation rate of the procedure based on the mixture model is worse than thatof the Bayes factor. On the contrary, while it is well known that the Bayes factor leads to aseparation rate of order √ log n/ √ n in parametric models, we show in the following theoremthat our approach can lead to a testing procedure with a better separation rate of order 1 / √ n .To prove the following result we need to strengthen slightly assumption B3:B4 second order identifiability condition :Set D ψ f ,θ , as the second derivative of f ,θ ,ψ with respect to ψ calculated at θ = ( θ ,
0) Thenfor all θ ∈ Θ if η ∈ R d , η , η ∈ R d ψ η T1 ∇ θ f ,θ + η T2 ∇ ψ f ,θ , ( x ) + η T3 D ψ f ,θ , η = 0 ⇔ η = 0 , η = η = 0(9)Note that condition B4 is very similar to the strong identifiability condition of Rousseau andMengersen (2011). Theorem 3
Given the model f θ,α = f θ ,ψ,α = αf ,θ + (1 − α ) f ,θ ,ψ , θ = ( θ , ψ ) assume that the data comprise the n sample x n = ( x , · · · , x n ) issued from f ∗ n = f ,θ ,n ,ψ n forsome some sequence θ ,n ∈ Θ and ψ n ∈ S Let assumptions B − B be satisfied. Moreover ifthe prior π θ ,ψ is absolutely continuous with positive and continuous density on Θ and if theprior π α on α is a Beta B ( a , a ) distribution then there exists M (cid:48) > such that sup θ ,n ∈ Θ , (cid:107) ψ n (cid:107)≥ M n / √ n E θ ,n ,ψ n π (cid:2) | α − | ≤ M (cid:48) M n / √ n | x n (cid:3) = o (1) for any sequence M n going to infinity such that M n = o ( √ n ) . Theorem 3 implies in particular that if the testing procedure is: H is rejected as soon as π (1 − α > M / √ n | x n ) ≤ / M an arbitrarily large constant then the separation rateis of order √ M / √ n . Although Theorem 3 holds for any value of a and d ψ , for the testingprocedure to make sense one needs to choose a < d ψ , since, otherwise, for any e n = o (1), theposterior distribution π (1 − α > e n | x n ) = o p (1) under H . Calibrating the procedure by a priorpredictive approach under both H and H will lead to a consistent testing procedure.
4. ILLUSTRATIONS
In this Section, we present three further examples that demonstrate the performance of themixture estimation approach and provide confirmation of the consistency results obtained inSection 3. The first follows from the example given in Section 2.2 and is a direct application ofTheorem 1. The second is cast in a nonparametric setting and is an application of Theorem 2.The third example is a case study that illustrates the hypothesis testing approach in a regressionsetup.
Example 4.1
Inspired by Marin et al. (2014), we oppose the Normal N ( µ,
1) model to thedouble-exponential L ( µ, √
2) model. The scale √ µ can be shared by both models and allows for the use of the flat Jeffreys’ prior. As in the examplein Section 2.2, Beta distributions B ( a , a ) are compared with respect to their hyperparameter a . However, whereas in the previous example we illustrated that the posterior distribution ofthe weight of the true model converged to 1, we now consider a setting in which neither modelis correct. We achieve this feature by using a N (0 , . ) distribution to simulate the data as itcorresponds to neither model M nor to model M . In this specific case, both posterior meansand medians of α fail to concentrate near 0 and 1 as the sample size increases, as shown inFigure 3. Thus in the majority of cases in this experiment, the outcome indicates that neitherof both models is favored by the data. This example does not exactly follow the assumptions ofTheorem 1 since the Laplace distribution is not differentiable everywhere. However, it is bothalmost surely differentiable and differentiable in quadratic mean, so we expect to see the sametypes of behaviour as predicted by Theorem 1. . . . . . . a0=.1 . . . . . . a0=.3 . . . . . . a0=.5 Fig 3:
Example 4.1:
Ranges of posterior means (skyblue) and medians (dotted) of the weight α ofmodel N ( θ,
1) over 100 N (0 , . ) datasets for sample sizes from 1 to 1000. Each estimate is based on aBeta prior with a = . , . , .
5, respectively, and 10 MCMC iterations.
In this example, the Bayes factor associated with Jeffreys’ prior is defined as B = exp { − (cid:80) ni =1 ( x i − ¯ x ) / } ( √ π ) n − √ n (cid:46) (cid:90) ∞−∞ exp { − (cid:80) ni =1 | x i − µ | / √ } (2 √ n d µ where the denominator is available in closed form. As above, since the prior is improper, it isformally undefined, even though the classical Bayesian approach argues in favour of using thesame prior on both µ ’s. Nonetheless, we employ it in order to compare Bayes estimators of α with the posterior probability of the model being a N ( µ,
1) distribution. Based on a Monte Carloexperiment involving 100 replicas of a N (0 , . ) dataset, Figure 4 demonstrates the reluctanceof the estimates of α to approach 0 or 1, while P ( M | x ) varies over the whole range between0 and 1 for all sample sizes considered here. While this is a weakly informative indication, AYESIAN HYPOTHESIS TESTING VIA MIXTURE MODELS the right hand side of Figure 4 shows that, on average, the posterior estimates of α convergetoward a value between . . a while the posterior probabilities converge to .
6. Inthis respect, both criteria offer a similar interpretation about the data because neither α nor P ( M | x ) provide definitive support for either model. P(M1|x) . . . n=10 P(M1|x) . . . n=40 P(M1|x) . . . n=100 P(M1|x) . . . n=500 . . . . . sample size P(M1|x) a0=.1 a0=.3 a0=.5
Fig 4:
Example 4.1: (left)
Boxplot of the posterior means (wheat) and medians (dark wheat) of α ,and of the posterior probabilities of model N ( µ,
1) over 100 N (0 , . ) datasets for sample sizes n =10 , , , (right) averages of the posterior means and posterior medians of α against the posteriorprobabilities P ( M | x ) for sample sizes going from 1 to 1000. Each posterior approximation is based on10 Metropolis-Hastings iterations.
Example 4.2
In this example we investigate a nonparametric goodness-of-fit problem of testingif the data come from a Gaussian distribution or not. We represent non Gaussian distributionsas nonparametric mixtures of Gaussian distributions so that our encompassing model becomes(with an abuse of notations) M α : α N ( µ , σ ) + (1 − α ) (cid:90) R N ( µ, σ ) dP ( µ )where we consider a prior distribution on ( µ , σ , P ) defined by µ | σ ∼ N (0 , τ σ ) , σ ∼ IG ( b , b ) , P ∼ DP ( M, N (0 , σ τ ))where DP ( M, G ) denotes the Dirichlet process with base measure
M G and IG ( b , b ) the inverseGamma distribution with parameter ( b , b ). This model defines a standard nonparametric priordistribution on the density f of the observations.Although the model does not follow the theory developed in Section 3 since it is restrictedto the parametric case, the general theory on nonparametric mixture models implies that theposterior distribution on f concentrates under M α around the true density in Hellinger or L ;see, for instance, Kruijer et al. (2010) or Ghosal and van der Vaart (2007). This implies that ifthe true distribution with density f is not Gaussian, i.e.inf µ,σ (cid:107) f − ϕ µ,σ (cid:107) = δ > , where ϕ µ,σ is the density of a N ( µ, σ ) random variable, then the posterior probability Π( α > − δ/ − (cid:15) | x n ) for all (cid:15) >
0, goes to 0 almost surely under f . This is a consequence of (cid:107) f − αϕ µ ,σ + (1 − α ) (cid:90) R ϕ µ,σ dP ( µ ) (cid:107) ≥ (cid:107) f − αϕ µ ,σ (cid:107) − (1 − α ) ≥ (cid:107) f − ϕ µ ,σ (cid:107) − − α ) . The convergence under f = ϕ µ ,σ is more intricate but the following heuristic argument givesus some hints on how to choose the hyperparameters: using Scricciolo (2011) we find that theposterior distribution concentrates around f at the rate √ log n/ √ n . In Nguyen (2013), it isproved that for nonparametric location mixtures the posterior distribution on the mixing densityis Wasserstein consistent. Here the model is a location mixture of Gaussians, but the commonscale is also unknown, and we conjecture the result of Nguyen (2013) still holds in our case.Hence assuming that the posterior distribution of Q α = ( αδ ( µ ) + (1 − α ) P ) × δ ( σ ) convergesin L -Wasserstein distance to δ ( µ ) × δ ( σ ) , we consider a Taylor expansion and we obtain (seeNguyen (2013) and Rousseau and Mengersen (2011))(log n/n ) / (cid:38) (cid:107) ϕ µ ,σ − αϕ µ ,σ + (1 − α ) (cid:90) R ϕ µ,σ dP ( µ ) (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13) (cid:16) L ” µ [ E Q α ( µ − µ ) + 2 σ ( σ − σ )] + ( σ − σ ) L ” σ,σ + 2(¯ µ − µ )( σ − σ ) L ” σ,µ (cid:17) + (¯ µ − µ ) ∇ µ ϕ + o ( u n ) (cid:107) where ¯ µ = E Q α ( µ ) , u n = | ¯ µ − µ | + | E Q α ( µ − µ ) + 2 σ ( σ − σ ) | + ( σ − σ ) and L ” µ , L ” σ,µ and L ” σ,σ ) are the second derivative of ϕ µ,σ with respect to µ , ( µ, σ ) and σ respectively. Bylinear independence, this leads to | u n | (cid:46) (cid:112) log n/n . In particular, the prior mass of this eventif 1 − α < (cid:15) is bounded by a term of order (log n/n ) M (1 − e ) / a ∧ / / for any 1 > e > o ( n − − a / ) as soon as M + a ∧ / > a . Hence, using the same argument as inRousseau and Mengersen (2011) under the Gaussian model the posterior distribution on α willconcentrate around 1.This reasoning leads us to consider hyperparameters satisfying M + a ∧ / > a , forinstance: a = 1 , M > / a = 1 / M = 1. We implemented an MCMC algorithmusing a marginal representation for the mixture, that is, integrating out the parameters µ , σ and P and sampling purely α and the allocation random variables in the data augmentationscheme. The output of this implementation is illustrated in Figure 5 for both Normal and non-Normal ( t distributed) samples, showing a departure away from α = 1 for the later, the slowerthe decrease the larger the degree of freedom. Example 4.3
In this last example we demonstrate that the theory and methodology corre-sponding to Theorem 1 can be extended to the regression case under the assumption that thedesign is random. We consider a binary response setup, using the R dataset about diabetesin Pima Indian women (R Development Core Team, 2006) as a benchmark (as in Marin andRobert, 2007). The dataset contains a random sample of 200 women tested for diabetes ac-cording to WHO criteria. The response variable y is “Yes” or “No”, for presence or absenceof diabetes and the explanatory variable x is restricted here to the body mass index (bmi)weight in kg / (height in m) . For this problem, either logistic or probit regression models couldbe suitable, so we compare these fits via our method. If y = ( y y . . . y n ) is the vector ofbinary responses and X = [ I n x ] is the n × i = 1 , . . . , n ) M : y i | x i , θ ∼ B (1 , p i ) where p i = exp( x i θ )1 + exp( x i θ ) M : y i | x i , θ ∼ B (1 , q i ) where q i = Φ( x i θ )(10)where x i = (1 x i ) is the vector of explanatory variables and where θ j , j = 1 ,
2, is a 2 × M or M . We AYESIAN HYPOTHESIS TESTING VIA MIXTURE MODELS llll lllllllll lllll lllllllllll lllllllllll llllllllll llllllllllll llllllll lllllllllllllllllll llllllll llllll lllllllllll lllllll ll .25 .75 .5 .25 .75 .5 .25 .75 .5 . . . . . . a Fig 5:
Example 4.2:
Boxplot of the posterior 25%, 50%, and 75% quantiles of the mixture weight α of the Normal component for 100 replications of N simulations from (a) a standard normal distribution( N = 100); (b) a standard normal distribution ( N = 1000); (c) a t distribution ( N = 100); (d) a t distribution ( N = 100); (e) a t distribution ( N = 100); (g) a t distribution ( N = 100). All values arebased on 2 10 Metropolis-Hastings iterations and 100 replications of the MCMC runs. once again consider the case where both models share the same parameter. However, for thisgeneralised linear model there is no moment equation that relates θ and θ , so we adopt alocal reparameterisation strategy by rescaling the parameters of the probit model M so thatthe MLE’s of both models coincide. This strategy follows from the remark by Choudhury et al.(2007) regarding the connection between the Normal cdf and a logistic functionΦ( x i θ ) ≈ exp( k x i θ )1 + exp( k x i θ )and we attempt to find the best estimate of k to make both parameters coherent. Given( k , k ) = ( (cid:99) θ / (cid:99) θ , (cid:99) θ / (cid:99) θ ) , which denote ratios of the maximum likelihood estimates of the logistic model parameters tothose for the probit model, we redefine q i in (10) as(11) q i = Φ( x i ( κ − θ )) ,κ − θ = ( θ / k , θ / k ).Once the mixture model is thus parameterised, we set our now standard Beta B ( a , a ) onthe weight of M , α , and choose the default g -prior on the regression parameter (see, e.g., Marinand Robert, 2007, Chapter 4), so that θ ∼ N (0 , n ( X T X ) − ) . Table 1
Dataset Pima.tr: Posterior medians of the mixture model parameters.
Logistic model parameters Probit model parameters a α θ θ θ k θ k .1 .352 -4.06 .103 -2.51 .064.2 .427 -4.03 .103 -2.49 .064.3 .440 -4.02 .102 -2.49 .063.4 .456 -4.01 .102 -2.48 .063.5 .449 -4.05 .103 -2.51 .064 Table 2
Simulated dataset: Posterior medians of the mixture model parameters. M α M α True model: logistic with θ = (5 , . probit with θ = (3 . , . a α θ θ θ k θ k α θ θ θ k θ k .1 .998 4.940 1.480 2.460 .640 .003 7.617 1.777 3.547 .786.2 .972 4.935 1.490 2.459 .650 .039 7.606 1.778 3.542 .787.3 .918 4.942 1.484 2.463 .646 .088 7.624 1.781 3.550 .788.4 .872 4.945 1.485 2.464 .646 .141 7.616 1.791 3.547 .792.5 .836 4.947 1.489 2.465 .648 .186 7.596 1.782 3.537 .788 In a Gibbs representation (not implemented here), the full conditional posterior distributionsgiven the allocation vector ζ are α ∼ B ( a + n , a + n ) and π ( θ | y , X, ζ ) ∝ exp (cid:8)(cid:80) i I ζ i =1 y i x i θ (cid:9)(cid:81) i ; ζ i =1 [1 + exp( x i θ )] exp (cid:8) − θ T ( X T X ) θ (cid:14) n (cid:9) × (cid:89) i ; ζ i =2 Φ( x i ( κ − θ )) y i (1 − Φ( x i ( κ − θ ))) (1 − y i ) (12)where n and n are the number of observations allocated to the logistic and probit models,respectively. This conditional representation shows that the posterior distribution is then clearlydefined, which is obvious when considering that the chosen prior is proper.For the Pima dataset, the maximum likelihood estimates of the GLMs are ˆ θ = ( − . , . θ = ( − . , . k = (1 . , . a = . , . , . , . , . α are close to 0 . a and the estimates of θ and θ are very stable(and quite similar to the MLEs). We note a slight increase of α towards 0 . a increases, butdo not want to over-interpret the phenomenon. This behaviour leads us to conclude that (a)none or both of the models are appropriate for the Pima Indian data, and (b) the sample sizemay be insufficiently large to allow discrimination between the logit and the probit models.To follow up on this last remark, we ran a second experiment with simulated logit and probitdatasets and a larger sample size n = 10 , , .
5) for thelogit model and (3 . , .
8) for the probit model. The estimates of the parameters of both M α and M α and for both datasets are presented in Table 2. For every a , the estimates in the truemodel are quite close to the true values and the posterior estimates of α are close to 1 in thelogit case and to 0 in the probit case. For this large setting, there is thus consistency in theselection of the proper model. In addition, Figure 6 shows that when the sample size is largeenough, the posterior distribution of α concentrates its mass near 1 and 0 when the data aresimulated from a logit and a probit model, respectively. AYESIAN HYPOTHESIS TESTING VIA MIXTURE MODELS (a) Pima dataset (b) Data from logistic model (c) Data from probit model Fig 6:
Example 4.3:
Histograms of the posterior distributions of α in favor of the logistic model basedon 10 Metropolis-Hastings iterations where a = . , . , . , . , .
5. CONCLUSION
Bayesian inference has been used in a very wide and increasing range of contexts over thepast thirty years, and many of the applications of the Bayesian paradigm have concentrated oncomparing scientific theories and testing hypotheses. Due to the ever increasing complexity ofthe statistical models handled in such applications, the natural and understandable tendency ofpractitioners has been to rely on the default solution of the posterior probability (or equivalentlyof the Bayes factor) without fully understanding the sensitivity of these methods to both priormodeling and posterior calibration (Robert et al., 2011). In this area, objective Bayes solutionsremain tentative and have not reached consensus.The novel approach we have proposed here for Bayesian testing of hypotheses and Bayesianmodel comparison offers in our opinion many incentives over these established methods. Bycasting the problem as an encompassing mixture model, not only do we replace the originaltesting problem with a better controlled estimation target that focuses on the frequency ofa given model within the mixture model, but we also allow for posterior variability of thisfrequency. The posterior distribution of the weights of both components in the mixture offers asetting for deciding about which model is most favored by the data that is at least as intuitive asthe sole number corresponding to either the posterior probability or the Bayes factor. The rangeof acceptance, rejection and indecision conclusions can easily be calibrated by simulation underboth models, as well as by deciding on the values of the weights that are extreme enough in favorof one model. The examples provided in this paper have shown that the posterior medians ofsuch weights settle very quickly near the boundary values 1. Although we do not advocate suchpractice, it is even possible to derive a Bayesian p -value by considering the posterior area underthe tail of the distribution of the weight. Moreover, the approach does not induce additionalcomputational strain on the analysis.Besides decision making, another issue of potential concern about this new approach is theimpact of the prior modelling. As demonstrated in our examples, a partly common parameteri-sation is often feasible and hence allows for reference priors, at least on the common parameters.This proposal thus allows for a partial removal of the prohibition on using improper priors inhypothesis testing (DeGroot, 1973), a problem which has plagued the objective Bayes literaturefor decades. Concerning the prior on the weight parameter, we analyzed the sensitivity on theresulting posterior distribution of various prior Beta modelings on those weights. While thesensitivity is clearly present, it naturally vanishes as the sample size increases, in agreementwith our consistency results, and remains of a moderate magnitude. This leads us to suggest the default value of a = 0 . REFERENCES
Aitkin, M. (1991). Posterior Bayes factors (with discussion).
J. Royal Statist. Society Series B , 53:111–142.Aitkin, M. (2010).
Statistical Inference: A Bayesian/Likelihood approach . CRC Press, Chapman & Hall, NewYork.Balasubramanian, V. (1997). Statistical inference, Occam’s razor, and statistical mechanics on the space ofprobability distributions.
Neural Computat. , 9(2):349–368.Bayarri, M. and Garcia-Donato, G. (2007). Extending conventional priors for testing general hypotheses in linearmodels.
Biometrika , 94:135–152.Berger, J. (1985).
Statistical Decision Theory and Bayesian Analysis . Springer-Verlag, New York, second edition.Berger, J. (2003). Could Fisher, Jeffreys and Neyman have agreed on testing?
Statistical Science , 18(1):1–32.Berger, J., Boukai, B., and Wang, Y. (1997). Unified frequentist and Bayesian testing of a precise hypothesis(with discussion).
Statistical Science , 12:133–160.Berger, J., Boukai, B., and Wang, Y. (1999). Simultaneous Bayesian-frequentist sequential testing of nestedhypotheses.
Biometrika , 86:79–92.Berger, J. and Jefferys, W. (1992). Ockham’s razor and Bayesian analysis.
Amer. Scientist , 80:64–72.Berger, J. and Pericchi, L. (1996). The intrinsic Bayes factor for model selection and prediction.
J. AmericanStatist. Assoc. , 91:109–122.Berger, J., Pericchi, L., and Varshavsky, J. (1998). Bayes factors and marginal distributions in invariant situations.
Sankhya A , 60:307–321.Berger, J. and Sellke, T. (1987). Testing a point-null hypothesis: the irreconcilability of significance levels andevidence (with discussion).
J. American Statist. Assoc. , 82:112–122.Berkhof, J., van Mechelen, I., and Gelman, A. (2003). A Bayesian approach to the selection and testing ofmixture models.
Statistica Sinica , 13:423–442.Bernardo, J. (1980). A Bayesian analysis of classical hypothesis testing. In Bernardo, J., DeGroot, M. H., Lindley,D. V., and Smith, A., editors,
Bayesian Statistics . Oxford University Press.Carlin, B. and Chib, S. (1995). Bayesian model choice through Markov chain Monte Carlo.
J. Royal Statist.Society Series B , 57(3):473–484.Casella, G. and Berger, R. (1987). Reconciling Bayesian and frequentist evidence in the one-sided testing problem.
J. American Statist. Assoc. , 82:106–111.Celeux, G., Hurn, M., and Robert, C. (2000). Computational and inferential difficulties with mixture posteriordistributions.
J. American Statist. Assoc. , 95(3):957–979.Chen, M., Shao, Q., and Ibrahim, J. (2000).
Monte Carlo Methods in Bayesian Computation . Springer-Verlag,New York.Choudhury, A., Ray, S., and Sarkar, P. (2007). Approximating the cumulative distribution function of the normaldistribution.
Journal of Statistical Research , 41:59–67.Christensen, R., Johnson, W., Branscum, A., and Hanson, T. (2011).
Bayesian ideas and data analysis: anintroduction for scientists and statisticians . CRCPress, New York.Consonni, G., Forster, J. J., and La Rocca, L. (2013). The whetstone and the alum block: Balanced objectiveBayesian comparison of nested models for discrete data.
Statistical Science , 28(3):398–423.Csisz´ar, I. and Shields, P. (2000). The consistency of the BIC Markov order estimator.
Ann. Statist. , 28:1601–1619.De Santis, F. and Spezzaferri, F. (1997). Alternative Bayes factors for model selection.
Canadian J. Statist. ,25:503–515.DeGroot, M. (1973). Doing what comes naturally: Interpreting a tail area as a posterior probability or as alikelihood ratio.
J. American Statist. Assoc. , 68:966–969.DeGroot, M. (1982). Discussion of Shafer’s ‘Lindley’s paradox’.
J. American Statist. Assoc. , 378:337–339.Forbes, F., Celeux, G., Robert, C., and Titterington, D. (2006). Deviance information criteria for missing datamodels.
Bayesian Analysis , 1:651–674.Fraser, D. (2011). Is Bayes posterior just quick and dirty confidence?
Statist. Science , 26(3):299–316. (Withdiscussion.).Fraser, D., B´edard, M., Wong, A., Lin, W., and Fraser, A. (2016). Bayes, reproducibility, and the quest for truth.
Statist. Sci. , 31(4):578–590.Fraser, D., Wong, A., and Sun, Y. (2009). Three enigmatic examples and inference from likelihood.
Canadian J.Statist. , 37(1):1–21.Fr¨uhwirth-Schnatter, S. (2006).
Finite Mixture and Markov Switching Models . Springer-Verlag, New York, NewYork.Gelman, A. (2008). Objections to Bayesian statistics.
Bayesian Analysis , 3(3):445–450.AYESIAN HYPOTHESIS TESTING VIA MIXTURE MODELS Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., and Rubin, D. (2013a).
Bayesian Data Analysis .Chapman and Hall, New York, New York, third edition.Gelman, A., Robert, C., and Rousseau, J. (2013b). Inherent difficulties of non-Bayesian likelihood-based inference,as revealed by an examination of a recent book by Aitkin (with a reply from the author).
Statistics & RiskModeling , 30:1001–1016.Ghosal, S., Ghosh, J. K., and van der Vaart, A. W. (2000). Convergence rates of posterior distributions.
Ann.Statist. , 28(2):500–531.Ghosal, S. and van der Vaart, A. (2007). Posterior convergence rates of dirichlet mixtures at smooth densities.
Ann. Statist. , 35(2):697–723.Gigerenzer, G. (1991). The superego, the ego and the id in statistical reasoning. In Keren, G. and Lewis, C.,editors,
Methodological and Quantitative Issues in the Analysis of Psychological Data . Erlbaum, Hillsdale, NewJersey.Hoeting, J. A., Madigan, D., Raftery, A., and Volinsky, C. (1999). Bayesian model averaging: A tutorial (withdiscussion).
Statistical Science , 14(4):382–417.Jasra, A., Holmes, C., and Stephens, D. (2005). Markov Chain Monte Carlo methods and the label switchingproblem in Bayesian mixture modeling.
Statist. Sci. , 20(1):50–67.Jefferys, W. and Berger, J. (1992). Ockham’s razor and Bayesian analysis.
American Scientist , 80:64–72.Jeffreys, H. (1939).
Theory of Probability . The Clarendon Press, Oxford, first edition.Johnson, V. and Rossell, D. (2010). On the use of non-local prior densities in Bayesian hypothesis tests.
J. RoyalStatist. Society Series B , 72:143–170.Kleijn, B. and van der Vaart, A. (2006). Misspecification in infinite-dimensional Bayesian statistics.
Ann. Statist. ,34(2):837–877.Kruijer, W., Rousseau, J., and van der Vaart, A. (2010). Adaptive Bayesian density estimation with location-scalemixtures.
Electronic Journal of Statistics , 4:1225–1257.Lad, F. (2003). Appendix: the Jeffreys–Lindley paradox and its relevance to statistical testing. In
Conferenceon Science and Democracy, Palazzo Serra di Cassano, Napoli .Lee, K., Marin, J.-M., Mengersen, K., and Robert, C. (2009). Bayesian inference on mixtures of distributions.In Sastry, N. N., Delampady, M., and Rajeev, B., editors,
Perspectives in Mathematical Sciences I: Probabilityand Statistics , pages 165–202. World Scientific, Singapore.Lindley, D. (1957). A statistical paradox.
Biometrika , 44:187–192.MacKay, D. J. C. (2002).
Information Theory, Inference & Learning Algorithms . Cambridge University Press,Cambridge, UK.Madigan, D. and Raftery, A. (1994). Model selection and accounting for model uncertainty in graphical modelsusing Occam’s window.
J. American Statist. Assoc. , 89:1535–1546.Marin, J., Pillai, N., Robert, C., and Rousseau, J. (2014). Relevant statistics for Bayesian model choice.
J. RoyalStatist. Soc. Series B , 76(5):833–859.Marin, J. and Robert, C. (2007).
Bayesian Core . Springer-Verlag, New York.Marin, J. and Robert, C. (2011). Importance sampling methods for Bayesian discrimination between embeddedmodels. In Chen, M.-H., Dey, D., M¨uller, P., Sun, D., and Ye, K., editors,
Frontiers of Statistical DecisionMaking and Bayesian Analysis . Springer-Verlag, New York.Mayo, D. and Cox, D. (2006). Frequentist statistics as a theory of inductive inference. In Rojo, J., editor,
Optimality: The Second Erich L. Lehmann Symposium , Lecture Notes-Monograph Series, pages 77–97. Instituteof Mathematical Statistics, Beachwood, Ohio, USA.McShane, B. B., Gal, D., Gelman, A., Robert, C., and Tackett, J. L. (2018). Abandon statistical significance.
The American Statistician . (To appear.).Neyman, J. and Pearson, E. (1933). The testing of statistical hypotheses in relation to probabilities a priori.
Proc. Cambridge Philos. Soc. , 24:492–510.Nguyen, X. (2013). Convergence of latent mixing measures in finite and infinite mixture models.
Ann. Statist. ,41(1):370–400.O’Hagan, A. (1995). Fractional Bayes factors for model comparisons.
J. Royal Statist. Society Series B , 57:99–138.O’Hagan, A. (1997). Properties of intrinsic and fractional Bayes factors.
Test , 6:101–118.O’Neill, P. D. and Kypraios, T. (2014). Bayesian model choice via mixture distributions with application toepidemics and population process models.
ArXiv e-prints .Park, T. and Casella, G. (2008). The Bayesian lasso.
J. Amer. Stat. Assoc. , 103(482):681–686.Plummer, M. (2008). Penalized loss functions for Bayesian model comparison.
Biostatistics , 9(3):523–539.R Development Core Team (2006).
R: A Language and Environment for Statistical Computing . R Foundationfor Statistical Computing, Vienna, Austria.Rasmussen, C. E. and Ghahramani, Z. (2001). Occam’s razor. In
Advances in Neural Information ProcessingSystems , volume 13.Richardson, S. and Green, P. (1997). On Bayesian analysis of mixtures with an unknown number of components (with discussion). J. Royal Statist. Society Series B , 59:731–792.Robert, C. (1993). A note on Jeffreys-Lindley paradox.
Statistica Sinica , 3(2):601–608.Robert, C. (2001).
The Bayesian Choice . Springer-Verlag, New York, second edition.Robert, C. (2014). On the Jeffreys–Lindley paradox.
Philosophy of Science , 5(2):216–232.Robert, C., Cornuet, J.-M., Marin, J.-M., and Pillai, N. (2011). Lack of confidence in ABC model choice.
Proceedings of the National Academy of Sciences , 108(37):15112–15117.Rockova, V. and George, E. I. (2016). Fast bayesian factor analysis via automatic rotations to sparsity.
Journalof the American Statistical Association , 111(516):1608–1622.Rodriguez, C. and Walker, S. (2014). Label switching in Bayesian mixture models: Deterministic relabelingstrategies.
Journal of Computational and Graphical Statistics , 23(1):25–45.Rousseau, J. and Mengersen, K. (2011). Asymptotic behaviour of the posterior distribution in overfitted mixturemodels.
J. Royal Statist. Society Series B , 73(5):689–710.Schwarz, G. (1978). Estimating the dimension of a model.
Ann. Statist. , 6:461–464.Scricciolo, C. (2011). Posterior rates of convergence for dirichlet mixtures of exponential power densities.
Electron.J. Statist. , 5:270–308.Shafer, G. (1982). On Lindley’s paradox (with discussion).
Journal of the American Statistical Association ,378:325–351.Spanos, A. (2013). Who should be afraid of the Jeffreys–Lindley paradox?
Philosophy of Science , 80:73–93.Spiegelhalter, D. J., Best, N., B.P., C., and Van der Linde, A. (2002). Bayesian measures of model complexityand fit.
Journal of the Royal Statistical Society, Series B , 64:583–640.Sprenger, J. (2013). Testing a precise null hypothesis: The case of Lindley’s paradox.
Philosophy of Science ,80(5):733–744.Stephens, M. (2000). Dealing with label switching in mixture models.
J. Royal Statist. Society Series B , 62(4):795–809.Vehtari, A. and Lampinen, J. (2002). Bayesian model assessment and comparison using crossvalidation predictivedensities.
Neural Computation , 14:2439–2468.Vehtari, A. and Ojanen, J. (2012). A survey of Bayesian predictive methods for model assessment, selection andcomparison.
Statistics Surveys , 6:142–228.Wasserstein, R. L. and Lazar, N. A. (2016). The ASA’s statement on p-values: Context, process, and purpose.
The American Statistician , 70(2):129–133.Ziliak, S. and McCloskey, D. (2008).
The Cult of Statistical Significance: How the standard error costs us jobs,justice, and lives . Univ of Michigan Pr.
APPENDIX 1: PROOFS OF SECTION 3
In this Section we give the proofs of Theorems 1, 2 and 3.
Proof of Theorem 1
Using Proposition 1, we have that π ( A n | x n ) = 1 + o p (1)with A n = { ( α, θ ); (cid:107) f θ,α − f θ ∗ ,α ∗ (cid:107) ≤ δ n } and δ n = M (cid:112) log n/n . Consider a subsequence α n , P ,θ n , P ,θ n which converges to α, µ , µ where convergence holds in the sense that α n → α and P j,θ jn convergesweakly to µ j . Note that µ j ( X ) ≤ αµ + (1 − α ) µ = α ∗ P ,θ ∗ + (1 − α ∗ ) P ,θ ∗ The above equality implies that µ and µ are probabilities. Using (4), we obtain that α = α ∗ , µ j = P j,θ ∗ j , which implies posterior consistency for α . The proof of (5) follows the same line as in Rousseau andMengersen (2011). Consider first the case where α ∗ ∈ (0 , θ con-centrates around θ ∗ .Writing L (cid:48) = ( f ,θ ∗ − f ,θ ∗ , α ∗ ∇ f ,θ ∗ , (1 − α ∗ ) ∇ f ,θ ∗ ) := ( L α , L , L ) L ” = diag(0 , α ∗ D f ,θ ∗ , (1 − α ∗ ) D f ,θ ∗ ) and η = ( α − α ∗ , θ − θ ∗ , θ − θ ∗ ) , ω = η/ | η | , AYESIAN HYPOTHESIS TESTING VIA MIXTURE MODELS we then have(13) (cid:107) f θ,α − f θ ∗ ,α ∗ (cid:107) = | η | (cid:12)(cid:12) ω T L (cid:48) + | η | / ω T L ” ω + | η | ω (cid:2) ω T L − ω T L (cid:3) + o ( | η | ) (cid:12)(cid:12) For all ( α, θ ) ∈ A n , η = ( α − α ∗ , θ − θ ∗ , θ − θ ∗ ) goes to 0 and for n large enough there exists (cid:15) > | α − α ∗ | + | θ − θ ∗ | ≤ (cid:15) . We now prove that there exists c > α, θ ) ∈ A n v ( ω ) = (cid:12)(cid:12)(cid:12)(cid:12) ω T L (cid:48) + | η | ω T L ” ω + | η | ω (cid:104) ω T2 L (cid:48) + ω T3 L (cid:48) (cid:105) + o ( | η | ) (cid:12)(cid:12)(cid:12)(cid:12) > c, where ω is defined with respect to α, θ . Were it not the case, there would exist a sequence ( α n , θ n ) ∈ A n such that the associated v ( ω n ) ≤ c n with c n = o (1). As ω n belongs to a compact set we could find asubsequence converging to a point ¯ ω . At the limit we would obtain¯ ω T L (cid:48) = 0and by linear independence ¯ ω = 0 which is not possible. Thus for all ( α, θ ) ∈ A n | α − α ∗ | + | θ − θ ∗ | (cid:46) δ n . Assume now instead that α ∗ = 0. Then define L (cid:48) = ( L α , L ) and L ” = diag(0 , D f ,θ ∗ ) and η = ( α − α ∗ , θ − θ ∗ ) , ω = η/ | η | and consider a Taylor expansion with θ fixed, θ ∗ = θ and | η | going to 0. This leads to(14) (cid:107) f θ,α − f α ∗ ,θ ∗ (cid:107) = | η | (cid:12)(cid:12)(cid:12)(cid:12) ω T L (cid:48) + | η | ω T L ” ω − | η | ω ω L (cid:12)(cid:12)(cid:12)(cid:12) + o ( | η | )in place of (13) and using the same argument as in the case α ∗ ∈ (0 , | α − α ∗ | + | θ − θ ∗ | (cid:46) δ n . Proof of Theorem 2
Recall that f ∗ = f ,θ ∗ . To prove (8) we must first find a precise lower bound on D n := (cid:90) α (cid:90) Θ e l n ( f θ,α ) − l n ( f ∗ ) d π θ ( θ )d π α ( α )Consider the approximating set S n ( (cid:15) ) = { ( θ, α ) , α > − / √ n, | θ − θ ∗ | ≤ / √ n, | ψ − ¯ ψ | ≤ (cid:15) } , θ = ( θ , θ )with | ¯ ψ | > (cid:15) some fixed parameter in S . Using the same computations as in Rousseau and Mengersen(2011), it holds that for all δ > C δ > P ∗ (cid:0) D n < e − C δ π ( S n ( (cid:15) )) / (cid:1) < δ. So that with probability greater than 1 − δ , D n (cid:38) n − ( a + d ) / . Denote B n = { ( θ, α ); (cid:107) f θ,α − f ∗ (cid:107) ≤ M n / √ n } , we know that π ( (cid:107) f θ,α − f ∗ (cid:107) ≤ M n (cid:112) log n/ √ n | x n ) = o p (1) . Let M n ≤ j ≤ M n √ log n and consider the slice S n ( j ) = { j/ √ n ≤ (cid:107) f θ,α − f ∗ (cid:107) ≤ ( j + 1) / √ n } . We nowupper bound π ( S n ( j )). To do so we split the parameter space into α < − δ for a fixed arbitrarily small δ , and α > − δ . In the first case, we have that ψ converges to 0 and θ to θ ∗ . A Taylor expansion leadsto (cid:107) αf ,θ + (1 − α ) f ,θ ,ψ − f ,θ ∗ (cid:107) = (cid:107) ( θ − θ ∗ ) T ∇ θ f ,θ ∗ + (1 − α ) ψ T ∇ ψ f ,θ ∗ , (cid:107) + O ( (cid:107) θ − θ ∗ (cid:107) + (1 − α ) (cid:107) ψ (cid:107) )(16) Setting v n = (cid:107) θ − θ ∗ (cid:107) + (cid:107) (1 − α ) ψ (cid:107) and η = ( θ − θ ∗ , (1 − α ) ψ ) /v n , (16) implies that (cid:107) αf ,θ + (1 − α ) f ,θ ,ψ − f ,θ ∗ (cid:107) ≥ v n (cid:107) η T ∇ f ,θ ∗ , (cid:107) + O ( v n )and by linear independence of ∇ f ,θ ,ψ we obtain that v n ≤ Cj/ √ n on S n ( j ) ∩ { α ∈ (0 , − δ ) } and π n, := π ( S n ( j ) ∩ { α ∈ (0 , − δ ) } ) (cid:46) j d n − d / . Now consider 1 − α ≤ δ . If 1 − α ≤ M j/ √ n then (cid:107) θ − θ ∗ (cid:107) (cid:46) j/ √ n which has prior probability boundedby O (( j/ √ n ) d + a ). If 1 − α > M j/ √ n , then ψ goes to 0 and (16) implies that (cid:107) θ − θ ∗ (cid:107) + (cid:107) (1 − α ) ψ (cid:107) (cid:46) j/ √ n which in turns implies that π ( S n ( j )) (cid:46) j d n − d / + ( j/ √ n ) d + a + j d n − d / (cid:90) δMj/ √ n u a − d ψ − du (cid:46) j d n − d / + ( j/ √ n ) d + a Since a ≤ d ψ ,(17) π ( S n ( j )) π ( S n ( (cid:15) )) (cid:46) j d ψ + a . Write S (cid:15) = { (cid:15) ≤ (cid:107) f θ,α − f ∗ (cid:107) ≤ (cid:15) } Equation (16) implies that if M n √ log n/ √ n ≥ (cid:15) ≥ M n / √ n , S (cid:15) ⊂ {(cid:107) θ − θ ∗ (cid:107) ≤ τ (cid:15) } ∩ {(cid:107) (1 − α ) ψ (cid:107) ≤ τ (cid:15) } for some τ >
0. To cover S (cid:15) by L balls of radius (cid:15)/ (cid:107) θ − θ (cid:48) (cid:107) ≤ ζ(cid:15) and we split the set into 1 − α ≤ (cid:15) and 1 − α > (cid:15) . Note that by choosing ζ smallenough, (cid:107) f θ,α − f θ (cid:48) ,α (cid:48) (cid:107) ≤ (cid:107) f θ ,ψ,α − f θ ,ψ (cid:48) ,α (cid:48) (cid:107) + (cid:15)/ (cid:107) (1 − α )( f ,θ ,ψ − f ,θ , ) − (1 − α (cid:48) )( f ,θ ,ψ (cid:48) − f ,θ , ) (cid:107) + (cid:15)/ − α > κ(cid:15) for some κ >
0. Then on S (cid:15) , (cid:107) ψ (cid:107) ≤ τ /κ and by choosing κ large enough, (cid:107) f θ,α − f θ (cid:48) ,α (cid:48) (cid:107) ≤ M (cid:107) (1 − α ) ψ − (1 − α (cid:48) ) ψ (cid:48) (cid:107) + (cid:15)/ ≤ (cid:15)/ (cid:107) (1 − α ) ψ − (1 − α (cid:48) ) ψ (cid:48) (cid:107) ≤ ζ(cid:15) with ζ small enough. If 1 − α ≤ κ(cid:15) , choose | α (cid:48) − α | ≤ (cid:15)/
16 sothat (cid:107) f θ,α − f θ (cid:48) ,α (cid:48) (cid:107) ≤ (1 − α ) (cid:107) f ,θ ,ψ − f ,θ ,ψ (cid:48) (cid:107) + (cid:15)/ ≤ (1 − α ) (cid:107) ψ − ψ (cid:48) (cid:107) M + (cid:15)/ ≤ (cid:15)/ (cid:107) ψ − ψ (cid:48) (cid:107) ≤ / (4 M κ ). Hence the local L entropy is bounded by a constant for all M n √ log n/ √ n > (cid:15) > M n / √ n and using Theorem 2.4 of (Ghosal et al., 2000), we obtain (8).Now consider d ψ > a and let A n = { ( θ, α ) ∈ B n ; 1 − α > z n / √ n } with z n a sequence increasing toinfinity faster than M n and B n = {(cid:107) f θ,α − f ∗ (cid:107) ≤ M n / √ n } . We prove that π ( A n | x n ) = o p (1) by provingthat π ( A n ) = o ( n − ( a + d ) / ) and using the lower bound on D n of order n − ( a + d ) / . We split B n into B n, ( δ ) = B n ∩ { ( θ, α ) , θ = ( θ , ψ ); (cid:107) ψ (cid:107) < δ } , B n, ( δ ) = B n ∩ B n, ( δ ) c , δ > δ n = M n / √ n . First we prove that for all δ > A n ∩ B n, ( δ ) = ∅ , when n is large enough. Let δ >
0, then for any ( θ, α ) ∈ A n ∩ B n, ( δ ), We thus have (cid:107) ψ (cid:107) (cid:54) = o (1), α = 1 + o (1)and | θ − θ ∗ | = o (1). Consider a Taylor expansion of f θ,α around α = 1 and θ = θ ∗ , with ψ fixed. Thisleads to f θ,α − f ∗ = ( α − f ,θ ∗ − f ,θ ∗ ,ψ ] + ( θ − θ ∗ )[ ∇ θ f ,θ ∗ − ∇ θ f ,θ ∗ ,ψ ( x )]+ 12 ( θ − θ ∗ ) T (cid:0) ¯ αD θ f , ¯ θ + (1 − ¯ α ) D θ f , ¯ θ ,ψ (cid:1) ( θ − θ ∗ )+ ( α − θ − θ ∗ ) T [ ∇ θ f , ¯ θ − ∇ θ f , ¯ θ ,ψ ]= ( α − f ,θ ∗ − f ,θ ∗ ,ψ ] + ( θ − θ ∗ )[ ∇ θ f ,θ ∗ − ∇ θ f ,θ ∗ ,ψ ( x )] + o ( | α − | + (cid:107) θ − θ ∗ (cid:107) ) AYESIAN HYPOTHESIS TESTING VIA MIXTURE MODELS with ¯ α ∈ (0 ,
1) and ¯ θ ∈ ( θ , θ ∗ ) and the o (1) is uniform over A n ∩ B n, ( δ ). Set η = ( α − , θ − θ ∗ ) and x = η/ | η | if | η | >
0. Then (cid:107) f θ,α − f ∗ (cid:107) = | η | (cid:0) x T L ( ψ ) + o (1) (cid:1) , L = ( f ,θ ∗ − f ,θ ∗ ,ψ , ∇ θ f ,θ ∗ − ∇ θ f ,θ ∗ ,ψ ( x ))We now prove that on A n ∩ B n, ( δ ), (cid:107) f θ,α − f ∗ (cid:107) (cid:38) | η | . Assume that it is not the case; then there exist c n > θ ,n , α n ) such that along that subsequence | x T n L ( ψ n ) + o (1) | ≤ c n with x n = η n / | η n | . Since it belongs to a compact set, together with ψ n , any converging subsequencesatisfies at the limit (¯ x, ¯ ψ ), ¯ x T L ( ¯ ψ ) = 0 , which is not possible. Hence | α − | (cid:46) M n / √ n = o ( w n / √ n ), which is not possible so that A n ∩ B n, ( δ ) = ∅ when n is large enough. We now bound π ( A n ∩ B n, ( δ )) for δ > θ ∗ = ( θ ∗ , α fixed. Note that ∇ θ f ,θ ∗ = ∇ θ f ,θ ∗ . We have f θ,α − f ∗ = ( θ − θ ∗ ) T ∇ θ f ,θ ∗ + (1 − α ) ψ T ∇ ψ f ,θ ∗ + 12 ( θ − θ ∗ ) T H α, ¯ θ ( θ − θ ∗ )where H α, ¯ θ is the block matrix H α, ¯ θ = (cid:18) αD θ f , ¯ θ + (1 − α ) D θ ,θ f , ¯ θ (1 − α ) D θ ,ψ f , ¯ θ (1 − α ) D ψ,θ f , ¯ θ (1 − α ) D ψ,ψ f , ¯ θ (cid:19) Since H α, ¯ θ is bounded in L (in the sense that each of its components is bounded as functions in L ),uniformly in neighbourhoods of θ ∗ , we have writing η = ( θ − θ ∗ , (1 − α ) ψ ) and x = η/ | η | , that | η | = o (1)on A n ∩ B n, ( δ ) and (cid:107) f θ,α − f ∗ (cid:107) (cid:38) | η | (cid:0) x T ∇ f ,θ ∗ + o (1) (cid:1) , if (cid:15) is small enough. Using a similar argument to before, this leads to | η | (cid:46) δ n on A n ∩ B n, ( δ ), so that π ( A n ∩ B n, ( δ )) (cid:46) δ d n (cid:90) z n / √ n ( δ n /u ) d ψ u a − du (cid:46) δ d + d ψ n z a − d ψ n n ( d ψ − a ) / (cid:46) n − ( d + a ) / M d n z a − d ψ n = o ( n − ( d + a ) / )choosing M n = o ( z ( d ψ − a ) /d n ), going to infinity (recall that d ψ > a ).Finally assume that d ψ < a and denote C n = { ( θ, α ) ∈ B n ; 1 − α < e n } , then the sam argumentsimply that if δ > C n = C n ∩ B n, ( δ ) and π ( C n ∩ B n, ( δ )) (cid:46) δ d n (cid:90) e n ( δ n /u ) d ψ u a − du (cid:46) δ d + d ψ n e a − d ψ n (cid:46) n − d / M d ψ n e a − d ψ n = o ( n − d / )if M d ψ n = o ( e − ( a − d ψ ) n ). Now, working with S (cid:48) n = {(cid:107) θ − θ ∗ (cid:107) ≤ / √ n, (cid:107) ψ (cid:107) ≤ / √ n, α ∈ (¯ α − e (cid:48) n , ¯ α + e (cid:48) n )with e (cid:48) n going to 0 arbitrarily slowly, we have that with probability going to 1, D n (cid:38) π ( S (cid:48) n ) (cid:38) n − d / e (cid:48) n so that by choosing e (cid:48) n accordingly, π ( C n ) = o ( n − d / e (cid:48) n ) and Theorem 2 is proved . Proof of Theorem 3
The proof of Theorem 3 proceeds along the same line as the previous proof. Let f ∗ n = f ,θ ,n ,ψ n with (cid:107) ψ n (cid:107) = o (1); the other case has already been proved in Theorem 1. Recall that π (cid:16) (cid:107) f θ,α − f ∗ n (cid:107) ≥ M (cid:112) log n/ √ n | x n (cid:17) = o p (1)if M > M n / √ n ≤ (cid:107) ψ n (cid:107) = o ( n − / ) with sincethe other case can be treated more easily. We prove first that the posterior concentration rate can besharpened into(18) π (cid:0) (cid:107) f θ,α − f ∗ n (cid:107) ≥ z n / √ n | x n (cid:1) = o p (1) , for any z n → + ∞ . To do so we first obtain a sharp lower bound on D n . From the regularity assumptions [B1] and [B2] forall ( α, θ , ψ ) ∈ ˜ S n = {(cid:107) θ − θ ,n (cid:107) ≤ / √ n, (cid:107) (1 − α ) ψ − ψ n (cid:107) + (cid:107) ψ n (cid:107)(cid:107) ψ (cid:107) ≤ / √ n } and a Taylor expansionwith θ = ( θ , ψ ) around θ ,n and 0 (both for ψ and ψ n ) we bound, KL ( f ∗ n , f θ,α ) ≤ (cid:90) ( f θ ,n ,ψ n − f θ,α ) f θ ,n ,ψ n ( x ) dx (cid:46) (cid:107) θ − θ ,n (cid:107) + (cid:107) (1 − α ) ψ − ψ n (cid:107) + (1 − α ) (cid:107) ψ (cid:107) (cid:107) ψ − ψ n (cid:107) (cid:46) (cid:107) θ − θ ,n (cid:107) + (cid:107) (1 − α ) ψ − ψ n (cid:107) + (cid:107) ψ n (cid:107)(cid:107) ψ (cid:107) with a similar inequality for (cid:82) f ∗ n (log f ∗ n − log f θ,α ) ( x ) dx . Hence, with probability bounded by (cid:15) , D n ≥ e − C (cid:15) π ( ˜ S n ) , for some large positive constant C (cid:15) . We have the following lower bound on π ( ˜ S n ). Note that (cid:107) (1 − α ) ψ − ψ n (cid:107) + (cid:107) ψ n (cid:107)(cid:107) ψ (cid:107) ≤ / √ n implies that 1 − α ≥ √ n (cid:107) ψ n (cid:107) . π (cid:16) ˜ S n (cid:17) (cid:38) n − d / n − d ψ / (cid:90) δ √ n (cid:107) ψ n (cid:107) u a − d ψ − du (cid:38) n − d / ( √ n (cid:107) ψ n (cid:107) ) − ( d ψ − a ) + . (19)We now bound from above π ( S n ( j )) and control the entropy of S n ( j ) for neighbourhoods S n ( j ) = { j/ √ n ≤ (cid:107) f θ ,ψ,α − f ∗ n (cid:107) ≤ ( j + 1) / √ n } .We have on S n ( j ): (cid:107) αf ,θ + (1 − α ) f ,θ ,ψ − f ,θ ,n ,ψ n (cid:107) ≤ ( j + 1) / √ n We split this set into two subsets α < − δ for a fixed arbitrarily small δ and α ≥ − δ . In the first case,we have as a first aproximation (cid:107) αf ,θ + (1 − α ) f ,θ ,ψ − f ,θ ,n (cid:107) (cid:46) (cid:107) ψ n (cid:107) + j/ √ n which implies in turn that (cid:107) θ − θ ,n (cid:107) (cid:46) (cid:107) ψ n (cid:107) + j/ √ n and (cid:107) ψ (cid:107) (cid:46) (cid:107) ψ n (cid:107) + j/ √ n where the constantsdepend on δ . A more refined Taylor expansion then leads to (cid:107) αf ,θ + (1 − α ) f ,θ ,ψ − f ,θ ,n ,ψ n (cid:107) = (cid:107) ( θ − θ n ) T ( ∇ θ f ,θ ,n + o (1)) + ((1 − α ) ψ − ψ n ) T ∇ ψ f ,θ ,n , + (1 − α )( ψ − ψ n ) T ( D ψ f ,θ ,n , + o (1))( ψ − ψ n ) (cid:107) (20)Setting v n = (cid:107) θ − θ n (cid:107) + (cid:107) (1 − α ) ψ − ψ n (cid:107) and η = ( θ − θ ,n , (1 − α ) ψ − ψ n ) /v n , (20) implies that (cid:107) αf ,θ + (1 − α ) f ,θ ,ψ − f ,θ ,n ,ψ n (cid:107) ≥ v n (cid:107) η T ( ∇ f ,θ ,n , + o (1)) (cid:107) + O ( (cid:107) ψ n (cid:107) + v n )and by linear independence of ∇ f ,θ , (assumption B4), v n ≤ Cj/ √ n on S n ( j ) and π ( S n ( j ) ∩ { α ≤ − δ } ) (cid:46) j d n − d / . We now consider the last case: α > − δ . As before, a first crude approximation leads to (cid:107) f ,θ − f ,θ ,n ,ψ n (cid:107) (cid:46) − α + j/ √ n , which in turns implies that (cid:107) θ − θ ,n (cid:107) + (cid:107) ψ n (cid:107) (cid:46) − α + j/ √ n . In particularif j/ √ n = o ( (cid:107) ψ n (cid:107) ), then 1 − α (cid:38) (cid:107) ψ n (cid:107) .Consider first the case where j/ √ n (cid:38) (cid:107) ψ n (cid:107) . Then(21) π ( S n ( j ) ∩ { − α ≤ κj/ √ n } ) (cid:46) ( j/ √ n ) d + a . Moreover (cid:107) αf ,θ + (1 − α ) f ,θ ,ψ − f ,θ ,n ,ψ n (cid:107) ≤ (cid:107) αf ,θ + (1 − α ) f ,θ ,ψ − f ,θ ,n (cid:107) + O ( (cid:107) ψ n (cid:107) )so that when a ≤ d ψ , (17) holds. When a > d ψ , the proof of Theorem 2 implies that there exists C > S n ( j ) ∩ { − α > j/ √ n } ⊂ {(cid:107) θ − θ ,n (cid:107) + (cid:107) (1 − α ) ψ − ψ n (cid:107) ≤ Cj/ √ n } AYESIAN HYPOTHESIS TESTING VIA MIXTURE MODELS and(22) π ( S n ( j )) (cid:46) ( j/ √ n ) d . Now consider the case where j/ √ n = o ( (cid:107) ψ n (cid:107) ) and recall that 1 − α (cid:38) (cid:107) ψ n (cid:107) and (cid:107) θ − θ ,n (cid:107) = o (1). ATaylor expansion with θ around θ ,n and 1 − α around 0 holding ψ fixed and non null leads to (cid:107) αf ,θ + (1 − α ) f ,θ ,ψ − f ,θ ,n ,ψ n (cid:107) = (cid:107) − ψ Tn ∇ ψ f ,θ ,n , + (1 − α )( f ,θ ,n ,ψ − f ,θ ,n ,ψ n )+ ( θ − θ ,n ) T ∇ θ f ,θ ,n (cid:107) + o ( (cid:107) θ − θ ,n (cid:107) + 1 /n ) + O ((1 − α ) (cid:107) ψ n (cid:107) )Set η = ( θ − θ ,n , − α, ψ n ) and ω = η/ (cid:107) η (cid:107) , then (cid:107) η (cid:107) (cid:0) (1 − α )( f ,θ ,n ,ψ − f ,θ ,n , ) + ( θ − θ ,n ) T ∇ θ f ,θ ,n (cid:1) (cid:107) (cid:46) j/ √ n so that by linear independence of ( f ,θ ,n ,ψ − f ,θ ,n , ) , ∇ θ f ,θ ,n (assumption B3), (cid:107) ψ n (cid:107) (cid:46) j/ √ n whichis impossible. Therefore (cid:107) ψ (cid:107) = o (1). A Taylor expansion then implies that (cid:107) ( θ − θ ,n ) T ∇ θ f ,θ ,n + ((1 − α ) ψ − ψ n ) T ∇ ψ f ,θ ,n , + (1 − α )( ψ − ψ n ) T D ψ f ,θ ,n , ( ψ − ψ n ) / (cid:107) + o ( (cid:107) θ − θ ,n (cid:107) + (1 − α ) (cid:107) ψ (cid:107) ) (cid:46) j/ √ n (23)When (cid:107) (1 − α ) ψ − ψ n (cid:107) ≤ δ (1 − α ) (cid:107) ψ − ψ n (cid:107) , with δ small, then (cid:107) ( θ − θ ,n ) T ∇ θ f ,θ ,n + (1 − α )( ψ − ψ n ) T D ψ f ,θ ,n , ( ψ − ψ n ) / (cid:107) (cid:46) j/ √ n, set η = ( θ − θ ,n , √ − α ( ψ − ψ n )) and ω = η/ (cid:107) η (cid:107) , assumption B4 implies that (cid:107) θ − θ ,n (cid:107) + (1 − α ) (cid:107) ψ − ψ n (cid:107) (cid:46) j/ √ n, (cid:107) (1 − α ) ψ − ψ n (cid:107) = o ( j/ √ n )and since 1 − α ≤ δ small, (cid:107) ψ − ψ n (cid:107) = (cid:107) ψ (cid:107) (1 + o (1)) so that (cid:107) ψ (cid:107)(cid:107) ψ n (cid:107) (cid:46) j/ √ n, and (1 − α ) j √ n (cid:107) ψ n (cid:107) (cid:38) (cid:107) (1 − α ) ψ (cid:107) = (cid:107) ψ n (cid:107) + o ( j/ √ n ) , − α (cid:38) √ n (cid:107) ψ n (cid:107) j The prior mass of this set is bounded above by π n, (cid:46) ( j/ √ n ) d (cid:18) √ n (cid:107) ψ n (cid:107) j (cid:19) − ( d ψ − a ) + . Similarly when (1 − α ) (cid:107) ψ − ψ n (cid:107) ≤ δ (cid:107) (1 − α ) ψ − ψ n (cid:107) , (23) becomes (cid:107) ( θ − θ ,n ) T ∇ θ f ,θ ,n + ((1 − α ) ψ − ψ n ) T ∇ ψ f ,θ ,n , (cid:107) (cid:46) j/ √ n, which in turns implies that (cid:107) θ − θ ,n (cid:107) + (cid:107) (1 − α ) ψ − ψ n (cid:107) (cid:46) j/ √ n and (1 − α ) (cid:107) ψ − ψ n (cid:107) = o ( j/ √ n ) sothat α − α (cid:107) ψ n (cid:107) (cid:46) (cid:115) j √ n (1 − α )and 1 − α (cid:38) ( √ n (cid:107) ψ n (cid:107) ) /j and the prior mass of this set is bounded from above by π n, (cid:46) ( j/ √ n ) d ( √ n (cid:107) ψ n (cid:107) ) − ( d ψ − a ) + j ( d ψ − a ) + . Finally let (cid:107) (1 − α ) ψ − ψ n (cid:107) ≥ δ (1 − α ) (cid:107) ψ − ψ n (cid:107) ≥ δ (cid:107) (1 − α ) ψ − ψ n (cid:107) . Set η = ( θ − θ ,n ) / (cid:107) θ − θ ,n (cid:107) , η = ((1 − α ) ψ − ψ n ) / (cid:107) (1 − α ) ψ − ψ n (cid:107) , η = ( ψ − ψ n ) / (cid:107) ψ − ψ n (cid:107) and u n = (cid:107) θ − θ ,n (cid:107) + (cid:107) (1 − α ) ψ − ψ n (cid:107) + (1 − α ) (cid:107) ψ − ψ n (cid:107) and w = (cid:107) θ − θ ,n (cid:107) u n , w = (cid:107) (1 − α ) ψ − ψ n (cid:107) u n , w = (1 − α ) (cid:107) ψ − ψ n (cid:107) u n , Then ( w , w , w ) belongs to the sphere in R with radius 1 and for each j = 1 , , η j belongs to thesphere (with radius 1 ) in R d with d = d ψ or d so that (23) becomes u n (cid:107) w η T ∇ θ f ,θ ,n + w η T ∇ ψ f ,θ ,n , + w η T D ψ f ,θ ,n , η (cid:107) . (cid:46) j/ √ n, Assumption B4 implies that u n (cid:46) j/ √ n . This leads to the same constraints as in the case of π n, so thatfinally π ( S n ( j )) (cid:46) ( j/ √ n ) d + ( j/ √ n ) d ( √ n (cid:107) ψ n (cid:107) ) − ( d ψ − a ) + j ( d ψ − a ) + and π ( S n ( j )) π ( ˜ S n ) (cid:46) j d +2( d ψ − a ) + . We now control the entropy of S n ( j ) for j ≤ M √ log n , i.e. the logarithm of the covering numberof S n ( j ) by L balls with radius ζj/ √ n , ζ > S n ( j ) is included in {(cid:107) θ − θ ,n (cid:107) ≤ C j/ √ n } ∩ {(cid:107) (1 − α ) ψ − ψ n (cid:107) ≤ C j/ √ n } ∩ { (1 − α ) (cid:107) ψ − ψ n (cid:107) ≤ C j/ √ n } for some C > (cid:107) ψ n (cid:107) (cid:46) j/ √ n , we are back to the proof of Theorem 2, and thelocal entropy is bounded by a constant. If j/ √ n ≤ δ (cid:107) ψ n (cid:107) with δ small, recall from the above computationsthat 1 − α ≥ τ √ n (cid:107) ψ n (cid:107) /j for some τ > (cid:107) ψ − ψ n (cid:107) ≤ C j (1 − α ) − / √ n . A Taylor expansion in ψ, ψ (cid:48) , θ , θ (cid:48) leads to (cid:107) αf ,θ + (1 − α ) f ,θ,ψ − ( α (cid:48) f ,θ (cid:48) + (1 − α (cid:48) ) f ,θ (cid:48) ,ψ (cid:48) ) (cid:107) = (cid:107) ((1 − α ) ψ − (1 − α (cid:48) ) ψ (cid:48) ) T ∇ ψ f ,θ , + (1 − α )2 ψ T D ψ f ,θ , ¯ ψ ψ − (1 − α (cid:48) )2 ( ψ (cid:48) ) T D ψ f ,θ , ¯ ψ (cid:48) ψ (cid:48) (cid:107) + O ( (cid:107) θ − θ (cid:48) (cid:107) ) + o ( j/ √ n )= (cid:107) ((1 − α ) ψ − (1 − α (cid:48) ) ψ (cid:48) ) T ∇ ψ f ,θ , + 12 [(1 − α ) ψ − (1 − α (cid:48) ) ψ (cid:48) ] T D ψ f ,θ , ¯ ψ ψ + (1 − α (cid:48) )2 ( ψ (cid:48) ) T D ψ f ,θ , ¯ ψ (cid:48) ( ψ − ψ (cid:48) ) (cid:107) + O ( (cid:107) θ − θ (cid:48) (cid:107) ) + o ( j/ √ n ) ≤ ζj √ n as soon as (cid:107) θ − θ (cid:48) (cid:107) ≤ δζj/ (4 √ n ), (cid:107) (1 − α ) ψ − (1 − α (cid:48) ) ψ (cid:48) (cid:107) ≤ δζj/ (4 √ n ) and (cid:107) ψ (cid:48) − ψ (cid:107) ≤ δζj/ (4 √ n (cid:107) ψ n (cid:107) ).The number of sets to cover {(cid:107) (1 − α ) ψ − ψ n (cid:107) ≤ C j/ √ n } ∩ { (1 − α ) (cid:107) ψ − ψ n (cid:107) ≤ C j/ √ n } ∩ {(cid:107) θ − θ ,n (cid:107) ≤ C j/ √ n } is bounded by a constant independent of j and n . Finally combining the lower bound on D n , the upperbound on π ( S n ( j )), the entropy bounds above and Theorem 2.4 of (Ghosal et al., 2000), we obtain thatfor all increasing sequence to infinity z n , uniformly over (cid:107) ψ n (cid:107) ≥ M n / √ n , θ ,n ∈ Θ E f ∗ n (cid:2) π (cid:0) (cid:107) f α,θ,ψ − f ∗ n (cid:107) ≥ z n / √ n | x n (cid:1)(cid:3) = o (1) , f ∗ n = f ,θ ,n ,ψ n . From the above computations with j = z n going to infinity with z n = o ( √ n (cid:107) ψ n (cid:107) ) = o ( n / ) we havethat 1 − α ≥ C √ n (cid:107) ψ n (cid:107) , so that there exists M > (cid:107) ψ (cid:107)≥ M n / √ n sup θ ∈ Θ E θ ,ψ (cid:2) π (1 − α < M M n / √ n | x n ) (cid:3) = o (1)and Theorem 3 is proved.Note that when a > d ψ , for any e n > A n = { − α ≤ e n } has posterior probability going to 0 under f ∗ n since the prior mass of the event {(cid:107) θ − θ ,n (cid:107) + (cid:107) (1 − α ) ψ − ψ n (cid:107) + (1 − α ) (cid:107) ψ − ψ n (cid:107) ≤ z n / √ n } ∩ { (1 − α ) ≤ e n } is of order O ( e a − d ψ n z d n n − d / ) = o ( n − d / ) as soon as e a − d ψ n z d n = o (1). Since we can choose z nn