Bayes Factors for Peri-Null Hypotheses
aa r X i v : . [ m a t h . S T ] F e b Bayes Factors for Peri-Null Hypotheses
Alexander Ly
University of AmsterdamCentrum Wiskunde & Informatica
Eric-Jan Wagenmakers
University of AmsterdamAbstractA perennial objection against Bayes factor point-null hypothesis testsis that the point-null hypothesis is known to be false from the out-set. Following Morey and Rouder (2011) we examine the consequencesof approximating the sharp point-null hypothesis by a hazy ‘peri-null’hypothesis instantiated as a narrow prior distribution centered on thepoint of interest. The peri-null Bayes factor then equals the point-nullBayes factor multiplied by a correction term which is itself a Bayes fac-tor. For moderate sample sizes, the correction term is relatively incon-sequential; however, for large sample sizes the correction term becomesinfluential and causes the peri-null Bayes factor to be inconsistent andapproach a limit that depends on the ratio of prior ordinates evaluatedat the maximum likelihood estimate. We characterize the asymptoticbehavior of the peri-null Bayes factor and discuss how to constructperi-null Bayes factor hypothesis tests that are also consistent.
Keywords:
Consistency, Peri-null correction factor, Asymptotic sam-pling distributionERI-NULL BAYES FACTORS ARE INCONSISTENT 2 vagueness leads nowhere.Jeffreys, 1937
Introduction
In the Bayesian paradigm, the support that data y n := ( y , . . . , y n ) provide foran alternative hypothesis H versus a point-null hypothesis H is given by the Bayesfactor BF ( y n ): p ( y n | H ) p ( y n | H ) | {z } BF ( y n ) = Posterior model odds z }| { P ( H | y n ) P ( H | y n ) , prior model odds z }| { P ( H ) P ( H ) (1)= R Θ f ( y n | θ ) π ( θ | H ) d θ R Θ f ( y n | θ ) π ( θ | H ) d θ , (2)where the first line indicates that the Bayes factor quantifies the change from priorto posterior model odds (Wrinch & Jeffreys, 1921), and the second line indicatesthat this change is given by a ratio of marginal likelihoods, that is, a comparisonof prior predictive performance obtained by integrating the parameters θ j out ofthe j th model’s likelihood f ( y n | θ j ) at the observations y n with respect to the priordensity π ( θ j | H j ) (Jeffreys, 1935, 1939; Kass & Raftery, 1995). Although the gen-eral framework applies to the comparison of any two models (as long as the modelsmake probabilistic predictions; Dawid, 1984; Shafer & Vovk, 2019), the proceduredeveloped by Harold Jeffreys in the late 1930s was explicitly designed as an improve-ment on p -value null-hypothesis significance testing. In the prototypical scenario, anull-hypothesis H has p free parameters, whereas an alternative hypothesis H has p = p + 1 free parameters; the additional free parameter in H is the one that is test-relevant . For instance, in Jeffreys’s t -test the test-relevant parameter δ = µ/σ represents standardized effect size; after assigning prior distributions to the modelparameters we may compute the Bayes factor for H : δ = 0 with free parameter θ = σ ∈ (0 , ∞ ) versus H : θ = ( δ, σ ) ∈ R × (0 , ∞ ) where δ is unrestricted and σ denoting the common nuisance parameter. When BF ( y n ) = 1 / BF ( y n ) is largerthan 1, the data provide evidence that the ‘general law’ H can be retained; whenBF ( y n ) is smaller than 1, the data provide evidence that it ought to be replacedby H , the model that relaxes the general law. The larger the deviation from 1, thestronger the evidence. Importantly, in Jeffreys’s framework the test-relevant param-eter is fixed under H and free to vary under H . The hypothesis H is generallyknown as a ‘point-null’ hypothesis.A perennial objection against point-null hypothesis testing—whether Bayesianor frequentist—is that in most practical applications, the point-null is never trueERI-NULL BAYES FACTORS ARE INCONSISTENT 3exactly (e.g., Bakan, 1966; Berkson, 1938; Edwards, Lindman, & Savage, 1963;Jones & Tukey, 2000; Kruschke & Liddell, 2018; see also Laplace, 1774/1986, p. 375).If this argument is accepted and H is deemed to be false from the outset, then thetest merely assesses whether or not the sample size was sufficiently large to detectthe non-zero effect. This objection was forcefully made by Tukey:“Statisticians classically asked the wrong question—and were willingto answer with a lie, one that was often a downright lie. They asked “Arethe effects of A and B different?” and they were willing to answer “no.”All we know about the world teaches us that the effects of A and B arealways different—in some decimal place—for any A and B. Thus asking“Are the effects different?” is foolish. (Tukey, 1991, p. 100)This perennial objection has been rebutted in several ways (e.g., Jeffreys,1937, 1961; Kass & Raftery, 1995); in the current work we focus on the mostcommon rebuttal, namely that the point-null hypothesis is a mathematicallyconvenient approximation to a more realistic ‘peri-null’ (Tukey, 1995) hypoth-esis H e that assigns the test-relevant parameter a distribution tightly concen-trated around the value specified by the point-null hypothesis (e.g., Good,1967, p. 416; Berger & Delampady, 1987; Cornfield, 1966, 1969; Dickey, 1976;Edwards et al., 1963; Gallistel, 2009; George & McCulloch, 1993; Jeffreys, 1935, 1936;Rouder, Speckman, Sun, Morey, & Iverson, 2009). For instance, in the case of the t -test the peri-null H e could specify δ ∼ π ( δ | H e ) = N (0 , κ ), where the width κ isset to a small value.Previous work has suggested that the approximation of a point-null hypothesisby an interval is reasonable when the width of that interval is half a standard error inwidth (Berger & Delampady, 1987) or one standard error in width (Jeffreys, 1935).Here we explore the consequences of replacing the point-null hypothesis H by aperi-null hypothesis H e from a different angle. We alter only the specification of thenull-hypothesis H , which means that the alternative hypothesis H now overlaps with H e . Below we show, first, that the effect on the Bayes factor of replacing H with H e isgiven by another Bayes factor, namely that between H and H e (cf. Morey & Rouder,2011, p. 411). This ‘peri-null correction factor’ is usually near 1, unless samplesize grows large. In the limit of large sample sizes, we demonstrate that the Bayesfactor for the peri-null H e versus the alternative H is bounded by the ratio of theprior ordinates evaluated at the maximum likelihood estimate. This proves earlierstatements from Morey and Rouder (2011, pp. 411-412) and confirms suggestions inJeffreys (1961, p. 367) and Jeffreys (1973, p. 39, Eq. 2). In other words, the Bayesfactor for the peri-null hypothesis is inconsistent. We end with suggestions on howa consistent method for hypothesis testing can be obtained without fully committingto a point-null hypothesis.ERI-NULL BAYES FACTORS ARE INCONSISTENT 4 The Peri-Null Correction Factor
Consider the three hypotheses discussed earlier: the point-null hypothesis H fixes the test-relevant parameter to a fixed value (e.g., δ = 0); the peri-null hypothesis H e assigns the test-relevant parameter a distribution that is tightly centered aroundthe value of interest (e.g., δ ∼ π ( δ | H e ) = N (0 , κ ) with κ small); and the alternativehypothesis H assigns the test-relevant parameter a relatively wide prior distribution, δ ∼ π ( δ | H ). The Bayes factor of interest is between H and H e , which can beexpressed as the product of two Bayes factors involving H : p ( y n | H ) p ( y n | H e ) | {z } Peri-null BF e = p ( y n | H ) p ( y n | H ) | {z } Point-null BF × p ( y n | H ) p ( y n | H e ) | {z } Correction factor BF e . (3)In words, the Bayes factor for the alternative hypothesis against the peri-null hy-pothesis equals the Bayes factor for the alternative hypothesis against the point-nullhypothesis, multiplied by a correction factor (cf. Morey & Rouder, 2011, p. 411).This correction factor quantifies the extent to which the point-null hypothesis out-predicts the peri-null hypothesis. With data sets of moderate size, and κ small, theperi-null and point-null hypotheses will make similar predictions, and consequentlythe correction factor will be close to 1. In such cases, the point-null can indeed beconsidered a mathematically convenient approximation to the peri-null. Example
Consider the hypothesis that “more desired objects are seen as closer”(Balcetis & Dunning, 2011). In the authors’ Study I, 90 participants had to esti-mate their distance to a bottle of water. Immediately prior to this task, 47 ‘thirsty’participants had consumed a serving of pretzels, whereas 43 ‘quenched’ participantshad drank as much as they wanted from four 8-oz glasses of water. In line withthe authors’ predictions, “Thirsty participants perceived the water bottle as closer( M = 25 . SD = 7 .
3) than quenched participants did ( M = 28 . SD = 6 . t = 2 .
00 and p = . t -test concerning the test-relevant parameter δ may contrast H : δ = 0 versus H with a Cauchy distribution with mode 0 and interquartile range κ , the commondefault value κ = 1 / √ = 1 . H . We may also computea peri-null correction factor by contrasting H : δ = 0 against H e : δ ∼ N (0 , κ ),with κ = 0 .
01, say. The resulting peri-null correction factor is BF e = 0 . κ = 0 .
05, we have BF e = 0 . Calculated using the Summary Stats module in JASP, (e.g., Ly et al., 2018, jasp-stats.org ),and based on Gronau, Ly, and Wagenmakers (2020).
ERI-NULL BAYES FACTORS ARE INCONSISTENT 5factor of BF e = 1 . = 1 .
259 to BF e = 1 .
167 is utterlyinconsequential.The difference between the peri-null and point-null Bayes factor remains incon-sequential for larger values of t . When we change t = 2 .
00 to t = 4 .
00, the point-nullBayes factor equals BF = 174, which according to Jeffreys’s classification of evi-dence (e.g., Jeffreys, 1961, Appendix B) is considered compelling evidence for H .With κ = 0 .
01, the peri-null correction factor equals BF e = 0 .
986 and consequentlya peri-null Bayes factor equals of about 172 in favor of H over H e . With κ = 0 . e = 0 .
713 and BF e ≈ P ( H | y n ) = 174 / ≈ .
994 versus 124 /
125 = 0 . H and H e at the maximum likelihood estimate. The Peri-Null Bayes Factor is Inconsistent
Historically, the main motivation for the development of the Bayes factor wasthe desire to be able to obtain arbitrarily large evidence for a general law: “Weare looking for a system that will in suitable cases attach probabilities near 1 toa law.” (Jeffreys, 1977, p. 88; see also Etz & Wagenmakers, 2017; Ly et al., 2020;Wrinch & Jeffreys, 1921).Statistically, this desideratum means that we want Bayes factors to be consis-tent, which implies that, as sample size increases, BF ( Y n ) (i) grows without boundwhen the data are generated under the alternative model H , and (ii) tends to zerowhen the data are generated under the null model, that is,BF ( Y n ) P θ → P θ ∈ H , and BF ( Y n ) P θ → P θ ∈ H , (4)thus, regardless of the chosen prior model probabilities P ( H ) , P ( H ) ∈ (0 , P ( H | Y n ) P θ → P θ ∈ H , and P ( H | Y n ) P θ → P θ ∈ H , (5)where P θ refers to the data generating distribution, here, Y i iid ∼ P θ , and where X n P θ → X denotes convergence in probability, that is, lim n →∞ P θ ( | X n − X | > ǫ ) = 0 as usual.Suppose that the parameter θ ∈ R p can be separated into a test-relevant pa-rameter δ ∈ R and nuisance parameters σ ∈ R p − . Below we prove that when thepoint-null hypothesis H : δ = 0 is replaced by a distribution over δ , i.e., the peri-null hypothesis H e : δ ∼ π ( δ | H e ), the resulting peri-null Bayes factor BF e ( Y ( n ) ) isinconsistent (cf. suggestions by Jeffreys, 1961, p. 367; Jeffreys, 1973, p. 39, Eq. 2;and the statements by Morey & Rouder, 2011, p. 411-412).ERI-NULL BAYES FACTORS ARE INCONSISTENT 6The inconsistency of peri-null Bayes factors follows quite directly from Laplace’smethod (Laplace, 1774/1986) for nested model comparisons, and consistency of themaximum likelihood estimator (MLE). Both Laplace’s method and consistency of theMLE hold under weaker conditions than stated here, namely, for absolute continuouspriors (e.g., van der Vaart, 1998, Chapter 10), and regular parametric models (e.g.,van der Vaart, 1998, Chapter 7; Ly, Marsman, Verhagen, Grasman, & Wagenmakers,2017, Appendix E). These models only need to be one time differentiable with respectto θ in quadratic mean and have non-degenerate Fisher information matrices that arecontinuous in θ with determinants that are bounded away from zero and infinity.The inconsistency of the peri-null Bayes factor is therefore expected to hold moregenerally.We show that under the stronger conditions of Kass, Tierney, and Kadane(1990), the asymptotic sampling distribution of peri-null Bayes factors can be easilyderived. These stronger conditions imply that the model is regular for which we knowthat the MLE is not only consistent, but also locally asymptotically normal with avariance equal to the observed Fisher information matrix at ˆ θ with entries[ ˆ I (ˆ θ )] a,b = − n n X i =1 (cid:18) ∂ ∂θ a ∂θ b log f ( Y i | θ ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) θ =ˆ θ , (6)see for instance Ly et al. (2017) for details. Theorem 1 (Limit of a peri-null Bayes factor) . Let Y n = ( Y , . . . , Y n ) be indepen-dently and identically distributed random variables with common distribution P θ ∈ P Θ ,where P Θ is an identifiable family of distributions that is Laplace-regular (Kass et al.,1990). This implies that P Θ admits densities f ( y n | θ ) with respect to the Lebesguemeasure that are six times continuously differentiable in θ at the data-governing pa-rameter θ ∈ Θ ⊂ R p and Θ open with non-empty interior. Furthermore, assumethat the (peri-null) prior densities π ( θ | H e ) and π ( θ | H ) assign positive mass to aneighborhood at the data-governing parameter θ and are four times continuously dif-ferentiable at θ ; then BF e ( Y n ) P θ → π ( θ | H ) π ( θ | H e ) . ⋄ Proof.
The condition that the model is Laplace-regular allows us to employ theLaplace method to approximate the numerator and the denominator of the peri-nullBayes factor by p ( Y n | H j ) = f ( Y n | ˆ θ ) (cid:16) πn (cid:17) p | ˆ I (ˆ θ ) | − π (ˆ θ | H j ) (7) × (cid:18) C (ˆ θ | H j ) n + C (ˆ θ | H j ) n + O ( n − ) (cid:19) , where C (ˆ θ | H j ) and C (ˆ θ | H j ) for j = e , e ( Y n ) = π (ˆ θ | H ) (cid:20) C (ˆ θ | H ) n + O ( n − ) (cid:21) π (ˆ θ | H e ) (cid:20) C (ˆ θ | H e ) n + O ( n − ) (cid:21) . (8)Identifiability and the regularity conditions on the model imply that the maximumlikelihood estimator is consistent, thus, ˆ θ P θ → θ (e.g., van der Vaart, 1998, Chapter 5).As all functions of ˆ θ in Eq. (8) are smooth at θ , the continuous mapping theoremapplies and the assertion follows.Theorem 1 implies that BF e ( Y n ) is inconsistent; for all data-governing param-eter values that have a neighborhood that receives positive mass from both priors, theperi-null Bayes factor approaches a limit that is given by the ratio of prior densitiesevaluated at the data governing θ as n increases. Note that this holds in particular forthe test point of interest, e.g., δ = 0, which has a neighborhood that the peri-null priorassigns positive mass to. The limit in Theorem 1 can also be derived using the gener-alized Savage-Dickey density ratio (Verdinelli & Wasserman, 1995) and exploiting thetransitivity of the Bayes factor. Theorem 1, however, can be more straightforwardlyextended to characterize the asymptotic behavior of the peri-null Bayes factor.The limiting value of the peri-null Bayes factor is not representative when n issmall or moderate. Theorem 2 below shows that the sampling mean of log BF e ( Y ( n ) )is expected to be of smaller magnitude than its limiting value. In other words, thelimit in Theorem 1 should be viewed as an upper bound under the alternative and alower bound under the null.This theorem exploits the fact that without a point-null hypothesis the gradientsof the densities π ( θ | H ) and π ( θ | H e ) are of the same dimension, which implies thatthe gradient ∂∂θ log (cid:16) π ( θ | H ) π ( θ | H e ) (cid:17) is well-defined. As such, the delta method can be usedto show that the peri-null Bayes factor inherits the asymptotic normality property ofthe MLE.To state the theorem we write D for the differential operator with respectto θ , e.g., [ D π ( θ | H j )] = ∂∂θ π ( θ | H j ) denotes the gradient, and [ D π ( θ | H j )] = ∂ ∂θ∂θ π ( θ | H j ) denotes the Hessian matrix. Theorem 2 (Asymptotic sampling distribution of a peri-null Bayes factor) . Underthe regularity conditions stated in Theorem 1 and for all data-governing parameters θ for which ˙ v ( θ ) := [ D log (cid:16) π ( θ | H ) π ( θ | H e ) (cid:17) ] = 0 , (9)ERI-NULL BAYES FACTORS ARE INCONSISTENT 8 the asymptotic sampling distribution of the logarithm of the peri-null Bayes factor isnormal, that is, √ n (cid:18) log BF e ( Y n ) − log (cid:16) π ( θ | H ) π ( θ | H e ) (cid:17) − E ( θ, n ) (cid:19) P θ N (cid:16) , ˙ v ( θ ) T I − ( θ ) ˙ v ( θ ) (cid:17) , (10) where P θ denotes convergence in distribution under P θ and where E ( θ, n ) = log (cid:16) C ( θ | H ) /n + C ( θ | H ) /n C ( θ | H e ) /n + C ( θ | H e ) /n (cid:17) , (11) is a bias term that is asymptotically negligible.For all θ for which ˙ v ( θ ) = 0 , but ¨ v ( θ ) := [ D log (cid:16) π ( θ | H ) π ( θ | H e ) (cid:17) ] = 0 , the asymptoticdistribution of log BF e ( Y n ) has a quadratic form, that is, n (cid:18) log BF e ( Y n ) − log (cid:16) π ( θ | H ) π ( θ | H e ) (cid:17) − E ( θ, n ) (cid:19) P θ Z T I − / ( θ )¨ v ( θ ) I − / ( θ ) Z, (12) where Z ∼ N (0 , I ) with I ∈ R p × p the identity matrix. ⋄ Proof.
The proof depends on (another) Taylor series expansion, see Appendix A forfull details. Firstly, we recall that √ n (ˆ θ − θ ) θ N (0 , I − ( θ )). To relate this asymp-totic distribution to that of log BF e ( Y n ), we note that Eq. (8) is, up to a decreasingerror in n , a smooth function of the maximum likelihood estimator. The goal isto ensure that the error terms 1 + C ( θ | H j ) /n + C ( θ | H j ) /n are asymptoticallynegligible. A Taylor series expansion at the data-governing θ shows thatlog BF e ( Y n ) = log (cid:16) π ( θ | H ) π ( θ | H e ) (cid:17) + E ( θ, n ) (13)+ (ˆ θ − θ ) T (cid:16) ˙ v ( θ ) + [ D E ( θ, n )] (cid:17) + (ˆ θ − θ ) T (¨ v ( θ )+[ D E ( θ,n )])2 (ˆ θ − θ ) + O P ( n − / ) . The asymptotic normality result follows after rearranging Eq. (13), a multiplicationof √ n on both sides, and an application of Slutsky’s lemma. Similarly, when ˙ v ( θ ) iszero, but ¨ v ( θ ) not, we have n log BF e ( Y n ) = n (cid:16) log (cid:16) π ( θ | H ) π ( θ | H e ) (cid:17) + E ( θ, n )) (cid:17) (14)+ √ n (ˆ θ − θ ) T (¨ v ( θ )+ O ( n − )])2 √ n (ˆ θ − θ ) + O P ( n − / ) . Since √ n (ˆ θ − θ ) P θ N (0 , I ( θ ) − ), the second order result follows.To conclude that the bias term is indeed asymptotically negligible, note thatlog(1 + x/n ) ≈ x/n as n → ∞ and therefore D k E ( θ, n ) = O (cid:18) n D k (cid:16) C ( θ | H ) − C ( θ | H e ) (cid:17)(cid:19) for all k ≤
3. The approximation log(1 + x/n ) ≈ x/n requires C k ( θ | H j )for k = 1 , j = e , κ is relatively small compared to κ . The bias is, therefore, expected to decaymuch more slowly.ERI-NULL BAYES FACTORS ARE INCONSISTENT 9Theorem 2 also shows that under the alternative hypothesis, log BF e ( Y n ) isexpected to increase towards the limiting value log (cid:16) π ( θ | H ) π ( θ | H e ) (cid:17) as n → ∞ whenever E ( θ, n ) <
0. The bias is negative, because if the data-governing parameter δ isfar from zero, but the peri-null prior is specified such that it is peaked at zero, theLaplace approximations become less accurate. In other words, for fixed n and δ = 0,we typically have C ( θ | H ) ≤ C ( θ | H e ) and C ( θ | H ) ≤ C ( θ | H e ) and, therefore, E ( θ, n ) < Example
We consider a Bayesian t -test and for the peri-null Bayes factor use the priors π ( δ, σ | H ) ∝ Cauchy( δ ; 0 , κ ) σ − and π ( δ, σ | H e ) ∝ N ( δ ; 0 , κ ) σ − . (15)Note that π ( δ, σ | H ) is chosen as in the default Bayesian t -test (Jeffreys, 1948;Ly, Verhagen, & Wagenmakers, 2016a, 2016b; Rouder et al., 2009), where κ > δ = µ/σ ,and σ ∝ σ − implies that the standard deviation common in both models is pro-portional to σ − (for advantages of this choice see Grünwald, de Heide, & Koolen,2019; Hendriksen, de Heide, & Grünwald, in press). For data governing parameters θ = ( µ, σ ), where µ is the population mean, Theorem 1 shows that as n → ∞ log BF e ( Y n ; κ , κ ) P θ → log √ κ exp( µ κ σ ) √ πκ (1 + h µκ σ i ) =: v ( θ ) . (16)Direct calculations show that ˙ v ( θ ) = 0 only when µ = 0. Hence, under the alternative µ = 0, the logarithm of these peri-null Bayes factor t -tests are asymptotically normalwith an approximate variance of( µ + 2 µ σ )( µ + ( κ − κ ) σ ) κ σ ( µ + κ σ ) n . (17)To characterize the asymptotic mean we also require the bias term E ( θ, n ), which forthe problem at hand comprises of C ( θ | H ) = µ +(18+2 κ ) σ µ +( κ − κ σ µ + κ σ ) , (18) C ( θ | H ) = µ +(1110+3127 κ ) σ µ +(6020+4462 κ ) κ σ µ +(5091 κ − κ σ − µ + κ σ ) , (19) C ( θ | H e ) = µ +6 σ µ + κ σ (2 κ − κ σ , (20) C ( θ | H e ) = µ +(264 − κ ) σ µ +(10811 κ − κ σ µ +2(713 − κ ) κ σ κ σ . (21)More concretely, under µ = 0 .
167 and σ = 1, log BF e ( Y n ; 0 . ,
1) converges inprobability to log(10). This limit is depicted as the brown dashed horizontal curve inthe top left subplot of Fig. 1.ERI-NULL BAYES FACTORS ARE INCONSISTENT 10 κ = 0 . n l og B F κ = 0 . n l og B F n l og B F n l og B F Figure 1 . Under the alternative, the logarithm of the peri-null Bayes factor t -testis asymptotically normal with a mean (i.e., the solid curves) that increases to thelimit, e.g., log BF e = log(10) and log BF e = log(30) in the top and bottom rowrespectively. The black and red curves correspond to the simulated and asymptoticnormal sampling distribution respectively. The dotted curves show the 97.5% and2.5% quantiles of the respective sampling distribution. Note that the convergence tothe upper bound is slower when the peri-null is more concentrated, e.g., compare theleft to the right column.This subplot also shows the mean (solid red curve) and the 97.5% and 2.5%quantiles (dotted red curves above and below the solid curve respectively) based onthe asymptotic normal result of Theorem 2. The black curves represent the analogousquantities based on simulated normal data with µ = 0 . σ = 1 based on 1,000replications at sample sizes n = 100 , , , . . . , , p ( Y n | H e ) is still inaccu-rate. As expected, the Laplace approximation becomes accurate sooner, wheneverthe peri-null prior is less concentrated. The top right subplot depicts results oflog BF e ( Y n ; 0 . ,
1) under µ = 0 .
314 and σ = 1, which converges in probabilityto log(10).ERI-NULL BAYES FACTORS ARE INCONSISTENT 11Similarly, the asymptotic normal distribution becomes adequate at a smallersample size for larger population means µ . The bottom left subplot corresponds tolog BF e ( Y n ; 0 . ,
1) under µ = 0 .
182 and σ = 1, whereas the bottom right subplotcorresponds to log BF e ( Y n ; 0 . ,
1) under µ = 0 .
348 and σ = 1. The logarithms ofboth peri-null Bayes factors converge in probability to log(30).In sum, the plots show that under the alternative hypothesis the asymptoticnormal distribution approximates the sampling distribution of the logarithm of theperi-null Bayes factor quite well, and it approximates better when the peri-null prioris less concentrated.Under the null hypothesis µ = 0, the gradient ˙ v (0 , σ ) = 0, and so is the Hessian,except for the the first entry of ¨ v , that is, ∂ ∂µ v ( µ, σ ) (cid:12)(cid:12)(cid:12)(cid:12) µ =0 = κ − κ κ κ σ . (22)As such, log BF e ( Y n ) has a shifted asymptotically χ (1)-distribution, i.e., n (cid:18) log BF e ( Y n ; κ , κ ) − log (cid:16) π ( θ | H ) π ( θ | H e ) (cid:17) − E ( θ, n ) (cid:19) P ,σ κ − κ κ κ Z , (23)where Z ∼ N (0 , µ = 0 and σ = 1, log BF e ( Y n ; 0 . , P ,σ → − . e ( Y n ; 0 . ,
1) converges in probability to − .
53. Both cases yieldevidence for the null hypothesis, but the evidence is stronger for the peri-null thatis more tightly concentrated around 0. The approximation based on the asymptotic χ (1)-distribution (in red) and the simulations (in black) are shown in Fig. 2. In theleft subplot, the curves based on the asymptotic χ (1)-distribution only start from n = 185, because only for n ≥
185 does log(1+ C (0 , | H e ) /n + C (0 , | H e ) /n ) havea non-negative argument; for κ = 0 .
05 we have that C (0 , | H e ) = − .
83. Notethat under the null hypothesis, the Laplace approximations are accurate sooner thanunder the alternative hypothesis, because the priors are already concentrated at zero.Under the null hypothesis the general observation remains true that for reasonablesample sizes the expected peri-null Bayes factor is far from the limiting value.Unlike the peri-null Bayes factor, the (default) point-null Bayes factor is consis-tent. Fig. 3 shows the simulated sampling distribution of the point-null and peri-nullBayes factors in blue and black respectively. As before the 97.5% quantile (top dottedcurve), the average (solid curve), and the 2.5% quantile (bottom dotted curve) aredepicted as well.The top left subplot of Fig. 3 shows that under µ = 0 .
167 and σ = 1 thepoint-null and peri-null Bayes factor behave similarly up to n = 30. Furthermore,the average point-null log Bayes factor crosses the peri-null upper bound of log(10)at around n = 380, whereas the peri-null Bayes factor remains bounded even in thelimit, and is therefore inconsistent. The top right subplot shows, under µ = 0 .
314 andERI-NULL BAYES FACTORS ARE INCONSISTENT 12 κ = 0 . n l og B F ~ κ = 0 . n l og B F ~ Figure 2 . Under the null, the logarithm of the peri-null Bayes factor t -test has ashifted asymptotically χ (1)-distribution with a mean (i.e., the solid curves) thatdecreases to the limit, e.g., log BF e = − .
22 and log BF e = − .
53 in the left andright plot respectively. The black and red curves correspond to the simulated andasymptotic χ (1) sampling distribution respectively. The dotted curves show the97.5% and 2.5% quantiles of the respective sampling distribution. Note that theconvergence to the lower bound is slower when the peri-null is more concentrated,e.g., compare the left to the right plot. σ = 1, that the discrepancy between the point-null and peri-null Bayes factor becomesapparent sooner when the peri-null prior is less concentrated, i.e., κ = 0 .
10 insteadof κ = 0 .
05. Also note that under these alternatives, the logarithm of the point-nullBayes factor grows linearly (e.g., Bahadur & Bickel, 2009; Johnson & Rossell, 2010).Hence, the point-null Bayes factor has a larger power to detect an effect than thatafforded by the peri-null Bayes factor.The bottom row of Fig. 3 paints a similar picture; under the null the point-nullBayes factor accumulates evidence for the null hypothesis without bound as n grows.For κ = 0 .
05 the behavior of the peri-null and the point-null Bayes factor is similarup to n = 200 and it takes about n = 1 ,
000 samples before the average point-null logBayes factor crosses the peri-null lower bound of − .
22. For κ = 0 .
10 only n = 270samples are needed before the log Bayes factor for the point-null hypothesis crossesthe peri-null lower bound of − . Towards Consistent Peri-Null Bayes Factors
There are at least three methods to adjust the peri-null Bayes factor in order toavoid inconsistency. The first method changes both the point-null hypothesis H andthe alternative hypothesis H . Specifically, one may define the hypotheses under testto be non-overlapping (e.g., Chandramouli & Shiffrin, 2019). The resulting procedureis usually known as an ‘interval-null hypothesis’, where the interval-null is defined asa (renormalized) slice of the prior distribution for the test-relevant parameter underan alternative hypothesis (e.g., Morey & Rouder, 2011). For instance, in the case ofERI-NULL BAYES FACTORS ARE INCONSISTENT 13 κ = 0 . n l og B F ~ κ = 0 . n l og B F n l og B F n l og B F Figure 3 . (Default) point-null Bayes factor t -tests (depicted in blue) are consistentunder both the alternative and null, e.g., top and bottom row respectively, as op-posed to peri-null Bayes factors (depicted in black). Note that the peri-null and thedefault point-null Bayes factors behave similarly when n is small. The domain wherethe two types of Bayes factors behave similarly is smaller when the peri-null is lessconcentrated, e.g., compare the right to the left column.a t -test an encompassing hypothesis H e may assign effect size δ a Cauchy distributionwith mode 0 and interquartile range κ e ; from this encompassing hypothesis one mayconstruct two rival hypotheses by restricting the Cauchy prior to particular intervals:the interval-null hypothesis truncates the encompassing Cauchy to an interval cen-tered on δ = 0: δ ∼ Cauchy(0 , κ e ) I ( − a, a ), whereas the interval-alternative hypoth-esis is the conjunction of the remaining two intervals, δ ∼ Cauchy(0 , κ e ) I ( −∞ , − a )and δ ∼ Cauchy(0 , κ e ) I ( a, ∞ ). The resulting peri-null Bayes factor is then consistentin accordance to subjective interval belief; for all data-governing parameters δ in theinterior of the interval-null, lim n →∞ BF e = 0, and for δ in the interior of the slicedout alternative lim n →∞ BF e = 0. In particular, when a = 1 and the data govern-ing δ = 0 .
7, then this Bayes factor will eventually show unbounded evidence for theinterval-null. Apart from the need to specify the width of the interval (Jeffreys, 1961, For consistency to hold the standard condition is assumed that the interval-null or sliced upprior assigns positive mass to a neighborhood of δ in the respective intervals. ERI-NULL BAYES FACTORS ARE INCONSISTENT 14p. 367), the disadvantages of this method are twofold. Firstly, the prior distributionsfor the rival interval hypotheses are of an unusual shape – a continuous distributionup to the point of truncation, where the prior mass abruptly drops to zero. It isdebatable whether such artificial forms would ever result from an elicitation effort.The second disadvantage is that it seems somewhat circuitous to parry the critique“the null hypothesis is never true exactly” by adjusting both the null hypothesis and the alternative hypothesis.The second method to specify a (partially) consistent peri-null Bayes factoris to change the point-null hypothesis to a peri-null hypothesis by supplementingrather than supplanting the spike with a distribution (Morey & Rouder, 2011). Inother words, the point-null hypothesis is upgraded to include a narrow distributionaround the spike. This mixture distribution is generally known as a ‘spike-and-slab’prior, but here the slab represents the peri-null hypothesis and is relatively peaked.This mixture model H ′ may be called a ‘hybrid null hypothesis’ (Morey & Rouder,2011), a ‘mixture null hypothesis’, or a ‘peri-point null hypothesis’. Thus, H ′ = ξ H + (1 − ξ ) H e , with ξ ∈ (0 ,
1) the mixture weights and, say, ξ = . Because ξ > H ′ to H will be consistent when the data come from H ; and because H ′ also has mass away from the point under test, the presence of atiny true non-zero effect will not lead to the certain rejection of the null hypothesis as n grows large. The data determine which of the two peri-point components receivesthe most weight. As before, for modest sample sizes and small κ , the distinctionbetween point-null, peri-null, and peri-point null is immaterial. The main drawbackof the peri-point null hypothesis is that it is consistent only when the data come from H ; when the data come from H or H e , the Bayes factor remains bounded as before(i.e., Eq. 8).The third method is to define a peri-null hypothesis whose width κ slowlydecreases with sample size (i.e., a ‘shrinking peri-null hypothesis’). For the t -test, onecan take κ = cσ/ √ n for some constant c > κ shrinks too quickly. The representationEq. (3) shows that this consistency fix is equivalent to keeping the peri-null correctionBayes factor BF e close to one regardless of the data. Note that this is attainableas κ → Concluding Comments
The objection that “the null hypothesis is never true” may be countered byabandoning the point-null hypothesis in favor of a peri-null hypothesis. For moderatesample sizes and relatively narrow peri-nulls, this change leaves the Bayes factorrelatively unaffected. For large sample sizes, however, the change exerts a profoundinfluence and causes the Bayes factor to be inconsistent, with a limiting value givenby the ratio of prior ordinates evaluated at the maximum likelihood estimate (cf.Jeffreys, 1961, p. 367 and Morey & Rouder, 2011, pp. 411-412). Here we also derivedthe asymptotic sampling distribution of the peri-null Bayes factor and show that itslimiting value is essentially an upper bound under the alternative and a lower boundunder the null. The asymptotic distributions also provide insights to typical valuesof the peri-null Bayes factor at a finite n . Note that there exist several Bayes factormethods that have replaced point-null hypotheses with either peri-null hypotheses(e.g., Stochastic Search Variable Selection (SSVS; George & McCulloch, 1993) ) orwith other hypotheses that have a continuous prior distribution close to zero (e.g.,the sceptical prior proposed by Pawel & Held, 2020). As far as evidence from themarginal likelihood is concerned, these methods are therefore inconsistent.Inconsistency may not trouble subjective Bayesians: if the peri-null hy-pothesis truly reflects the belief of a subjective sceptic, and the alterna-tive hypothesis truly reflects the belief of a subjective proponent, then theBayes factor provides the relative predictive success for the sceptic ver-sus the proponent, and it is irrelevant whether or not this relative suc-cess is bounded. Objective Bayesians, however, develop and apply proceduresthat meet various desiderata (e.g., Bayarri, Berger, Forte, & García-Donato, 2012;Consonni, Fouskakis, Liseo, & Ntzoufras, 2018), with consistency a prominent exam-ple. As indicated above, the desire for consistency was the primary motivation for thedevelopment of the Bayesian hypothesis test (Wrinch & Jeffreys, 1921). For objectiveBayesians then, it appears the point-null hypothesis is more than just a mathemat-ically convenient approximation to the peri-null hypothesis (Jeffreys, 1961, p. 367).The peri-point mixture model (consistent only under the point-null hypothesis) andthe shrinking peri-point model (incoherent because the prior width depends on samplesize) may provide acceptable compromise solutions.Regardless of one’s opinion on the importance of consistency, it is evident thatseemingly inconsequential changes in model specification may asymptotically yieldfundamentally different results. Researchers who entertain the use of peri-null hy-potheses should be aware of the asymptotic consequences; in addition, it generallyappears prudent to apply several tests and establish that the conclusions are relativelyrobust. “A similar setup in this context was considered by Mitchell and Beauchamp (1988) , who insteadused “spike and slab” mixtures. An important distinction of our approach is that we do not put aprobability mass on β i = 0.” (George & McCulloch, 1993, p. 883). ERI-NULL BAYES FACTORS ARE INCONSISTENT 16
Acknowledgements
This research was supported by the Netherlands Organisation for Scientific Re-search (NWO; grant
Bahadur, R. R., & Bickel, P. J. (2009). An optimality property of Bayes’ test statistics.
Lecture Notes-Monograph Series , , 18–30.Bakan, D. (1966). The test of significance in psychological research. Psychological Bulletin , , 423–437.Balcetis, E., & Dunning, D. (2011). Wishful seeing: More desired objects are seen as closer. Psychological Science , , 147–152.Bayarri, M. J., Berger, J. O., Forte, A., & García-Donato, G. (2012). Criteria for Bayesianmodel choice with application to variable selection. The Annals of Statistics , ,1550–1577.Berger, J. O., & Delampady, M. (1987). Testing precise hypotheses. Statistical Science , ,317–352.Berkson, J. (1938). Some difficulties of interpretation encountered in the application of thechi-square test. Journal of the American Statistical Association , , 526–536.Chandramouli, S. H., & Shiffrin, R. M. (2019). Commentary on Gronau and Wagenmakers. Computational Brain & Behavior , , 12–21.Consonni, G., Fouskakis, D., Liseo, B., & Ntzoufras, I. (2018). Prior distributions forobjective Bayesian analysis. Bayesian Analysis , , 627–679.Cornfield, J. (1966). A Bayesian test of some classical hypotheses—with applications tosequential clinical trials. Journal of the American Statistical Association , , 577–594.Cornfield, J. (1969). The Bayesian outlook and its application. Biometrics , , 617–657.Dawid, A. P. (1984). Present position and potential developments: Some personal views:Statistical theory: The prequential approach (with discussion). Journal of the RoyalStatistical Society Series A , , 278–292.Dickey, J. M. (1976). Approximate posterior distributions. Journal of the American Sta-tistical Association , , 680–689.Edwards, W., Lindman, H., & Savage, L. J. (1963). Bayesian statistical inference forpsychological research. Psychological Review , , 193–242.Etz, A., & Wagenmakers, E.-J. (2017). J. B. S. Haldane’s contribution to the Bayes factorhypothesis test. Statistical Science , , 313–329.Gallistel, C. R. (2009). The importance of proving the null. Psychological Review , ,439–453.George, E. J., & McCulloch, R. E. (1993). Variable selection via Gibbs sampling. Journalof the American Statistical Association , , 881–889.Good, I. J. (1967). A Bayesian significance test for multinomial distributions. Journal ofthe Royal Statistical Society, Series B (Methodological) , , 399–431.Gronau, Q. F., Ly, A., & Wagenmakers, E.-J. (2020). Informed Bayesian t -tests. TheAmerican Statistician , , 137–143.Grünwald, P., de Heide, R., & Koolen, W. (2019). Safe testing. arXiv preprintarXiv:1906.07801 .Hendriksen, A., de Heide, R., & Grünwald, P. (in press). Optional stopping with Bayesfactors: A categorization and extension of folklore results, with an application toinvariant situations. Bayesian Analysis . ERI-NULL BAYES FACTORS ARE INCONSISTENT 18
Isserlis, L. (1918). On a formula for the product-moment coefficient of any order of a normalfrequency distribution in any number of variables.
Biometrika , , 134–139.Jeffreys, H. (1935). Some tests of significance, treated by the theory of probability. Pro-ceedings of the Cambridge Philosophy Society , , 203–222.Jeffreys, H. (1936). Further significance tests. In (Vol. 32, pp. 416–445).Jeffreys, H. (1937). Scientific method, causality, and reality. Proceedings of the AristotelianSociety , , 61–70.Jeffreys, H. (1939). Theory of probability (1st ed.). Oxford, UK: Oxford University Press.Jeffreys, H. (1948).
Theory of probability (2nd ed.). Oxford, UK: Oxford University Press.Jeffreys, H. (1961).
Theory of probability (3rd ed.). Oxford, UK: Oxford University Press.Jeffreys, H. (1973).
Scientific inference (3rd ed.). Cambridge, UK: Cambridge UniversityPress.Jeffreys, H. (1977). Probability theory in geophysics.
Journal of the Institute of Mathematicsand its Applications , , 87–96.Johnson, V. E., & Rossell, D. (2010). On the use of non-local prior densities in Bayesianhypothesis tests. Journal of the Royal Statistical Society: Series B (Statistical Method-ology) , , 143–170.Jones, L. V., & Tukey, J. W. (2000). A sensible formulation of the significance test. Psychological Methods , , 411–414.Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American StatisticalAssociation , , 773–795.Kass, R. E., Tierney, L., & Kadane, J. B. (1990). The validity of posterior expansions basedon Laplace’s method. In S. Geisser, J. S. Hodges, S. J. Press, & A. Zellner (Eds.), Bayesian and likelihood methods in statistics and econometrics: Essays in honor ofGeorge A. Barnard (Vol. 1, pp. 473–488). Elsevier.Kruschke, J. K., & Liddell, T. M. (2018). The Bayesian New Statistics: Hypothesistesting, estimation, meta–analysis, and power analysis from a Bayesian perspective.
Psychonomic Bulletin & Review , , 178–206.Laplace, P.-S. (1774/1986). Memoir on the probability of the causes of events. StatisticalScience , , 364–378.Ly, A., Komarlu Narendra Gupta, A. R., Etz, A., Marsman, M., Gronau, Q. F., & Wagen-makers, E.-J. (2018). Bayesian reanalyses from summary statistics and the strengthof statistical evidence. Advances in Methods and Practices in Psychological Science , (3), 367–374. doi: 10.1177/2515245918779348Ly, A., Marsman, M., Verhagen, A. J., Grasman, R. P. P. P., & Wagenmakers, E.-J. (2017).A tutorial on Fisher information. Journal of Mathematical Psychology , , 40–55.Ly, A., Stefan, A., van Doorn, J., Dablander, F., van den Bergh, D., Sarafoglou, A., . . .Wagenmakers, E.-J. (2020). The Bayesian methodology of Sir Harold Jeffreys as apractical alternative to the p-value hypothesis test. Computational Brain & Behav-ior (3), 153–161.Ly, A., Verhagen, A. J., & Wagenmakers, E.-J. (2016a). Harold Jeffreys’s default Bayes fac-tor hypothesis tests: Explanation, extension, and application in psychology.
Journalof Mathematical Psychology , , 19–32.Ly, A., Verhagen, A. J., & Wagenmakers, E.-J. (2016b). An evaluation of alternativemethods for testing hypotheses, from the perspective of Harold Jeffreys. Journal of
ERI-NULL BAYES FACTORS ARE INCONSISTENT 19
Mathematical Psychology , , 43–55.McCullagh, P. (2018). Tensor methods in statistics . Courier Dover Publications.Mitchell, T. J., & Beauchamp, J. J. (1988). Bayesian variable selection in linear regression.
Journal of the American Statistical Association , , 1023–1032.Morey, R. D., & Rouder, J. N. (2011). Bayes factor approaches for testing interval nullhypotheses. Psychological Methods , , 406–419.Morey, R. D., & Rouder, J. N. (2018). BayesFactor 0.9.12-4.2.
Comprehensive R Archive Network. Retrieved from http://cran.r-project.org/web/packages/BayesFactor/index.html
Pawel, S., & Held, L. (2020). The sceptical Bayes factor for the assessmentof replication success.
Manuscript submitted for publication . Retrieved from https://arxiv.org/abs/2009.01520
Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review , , 225–237.Shafer, G., & Vovk, V. (2019). Game-theoretic foundations for probability and finance (Vol. 455). John Wiley & Sons.Tukey, J. W. (1991). The philosophy of multiple comparisons.
Statistical Science , ,100–116.Tukey, J. W. (1995). Controlling the proportion of false discoveries for multiple comparisons:Future directions. In V. S. L. Williams, L. V. Jones, & I. Olkin (Eds.), Perspectiveson statistics for educational research: Proceedings of a workshop (pp. 6–9). ResearchTriangle Park, NC: National Institute of Statistical Sciences.van der Vaart, A. W. (1998).
Asymptotic statistics . Cambridge University Press.Verdinelli, I., & Wasserman, L. (1995). Computing Bayes factors using a generalization ofthe Savage–Dickey density ratio.
Journal of the American Statistical Association , ,614–618.Wrinch, D., & Jeffreys, H. (1921). On certain fundamental principles of scientific inquiry. Philosophical Magazine , , 369–390. ERI-NULL BAYES FACTORS ARE INCONSISTENT 20AppendixA. Laplace ApproximationThe Laplace approximation uses a (multivariate) Taylor expansion for which we in-troduce notation. Let h : Θ ⊂ R p → R , i.e., h ( θ ) = − n P ni =1 log f ( y i | θ ), and we writeˆ θ for the point in its domain where h takes its global minimum. Furthermore, we usesubscripts to denote partial derivatives, whereas superscripts refer to components ofa vector, or more generally an array. For instance, π a = ∂∂θ a π (ˆ θ ) refers to the a -thcomponent of the vector of partial derivatives [ D π (ˆ θ )] of the prior π evaluated atthe MLE. Similarly, we write h abc = ∂ ∂θ a ∂θ b ∂θ c h (ˆ θ ) for the abc -th component of thethree-dimensional array [ D h (ˆ θ )]. Hence, the number of indices in the subscript cor-responds to the number of derivatives of h and the indices, each in 1 , , . . . , p , providethe location of the component.We use superscripts to refer to the component of a vector. For instance, u a =( θ a − ˆ θ a ) represents the a -th component of the difference vector u = θ − ˆ θ , thus,equivalently u a := e Ta u , where e a is the unit (column) vector with entry 1 at index a and zero elsewhere. Similarly, ς abcd the abcd -th component of a four dimensionalarray.Moreover, we employ Einstein’s summation convention and suppress the sumwhenever an index occurs in both the sub and superscript. For instance, h a u a := p X a =1 h a u a , (24) h abc u a u b u c := p X a =1 p X b =1 p X c =1 h abc u a u b u c . (25)The former defines an inner product between the gradient of h and deviations u ,whereas the h abc = [ D h ] abc refers to the a -th row, b -th column, and c -th depth of thethree-dimensional array consisting of partial derivatives of h of order three. Lastly,we use the shorthand notation h a h b u a u b := X a X b h a h b u a u b , (26)to denote the nested sum which is needed for Cauchy products ( h a u a )( h b u b ). Forinstance, with d = 2( h u + h u )( h u + h u ) = h u h u + 2 h u h u + h u h u , (27)which is equivalent to h h u u + h h u u + h h u u + h h u u . (28)With these notational conventions a multivariate Taylor approximation is denoted as h ( θ ) = h (ˆ θ ) + h a u a + h ab u a u b + h abc u a u b u c + h abcd u a u b u c u d + O ( | u | ) . (29)and note the similarity to its one-dimensional counterpart.ERI-NULL BAYES FACTORS ARE INCONSISTENT 21 Theorem 3 (Laplace expansion with error term) . Let P Θ be a collection of densityfunctions that are six times continuously differentiable in θ ∈ Θ ⊂ R p , and π ( θ ) a prior density that is four times continuously differentiable. Let Y iid ∼ f ( y | θ ) forcertain θ , then with ˆ θ the MLE p ( y n ) = Z Θ f ( y n | θ ) π ( θ )d θ (30)= ( πn ) p f ( y n | ˆ θ ) π (ˆ θ ) | ˆ I (ˆ θ ) | − / (cid:20) C (1) (ˆ θ ) n + C (2) (ˆ θ ) n + O ( n − ) (cid:21) , (31) where | · | denotes the determinant and C (1) (ˆ θ ) = π ab π (ˆ θ ) ς ab − (cid:16) h abcd + h abc π u π (ˆ θ ) (cid:17) ς abcd + h abc h uef ς abcdef , (32) C (2) (ˆ θ ) = π abcd π (ˆ θ ) ς abcd − π (ˆ θ ) h abcdef +6 h abcde π f +15 h abcd π ef +20 h abc π def π (ˆ θ ) ς abcdef (33)+ π (ˆ θ ) h abcd h efgh +8 π (ˆ θ ) h abcde h fgh +40 h abc (cid:16) h defg π h + h def π gh (cid:17) π (ˆ θ ) ς abcdefgh − π (ˆ θ ) h abcd h efg h hij +4 h abc h def h ghi π j π (ˆ θ ) ς abcdefghij + h abc h def h ghi h jkl ς abcdefghijkl where ς ab , ς abcd , ς abcdef , ς abcdefgh , ς abcdefghij , and ς abcdefghijkl represent the ab -th com-ponent of the second, the abcd -th component of the fourth, the abcdef -th component ofthe sixth moment, the abcdef gh -th component of the eigth moment, the abcdef ghij -thcomponent of the tenth moment, and the abcdef ghijkl -th component of the twelfthmoment, of the p dimensional random vector Q ∼ N p (0 , ˆ I (ˆ θ ) − ) , respectively. ⋄ Proof.
The proof is based on (i) Taylor-expanding the exponential of the log-likelihoodof order five around ˆ θ , (ii) the definition of the exponential as a series and Taylor-expanding π to third order at the same point ˆ θ , and (iii) properties of the normaldistribution. Step (i)
Let h ( θ ) = n P ni =1 log f ( y i | θ ), then since h ( θ ) ∈ C (Θ) we know thatthere exists δ > B ˆ θ ( δ ) ⊂ R p of radius δ centered at ˆ θ theaverage log-likelihood h n ( θ ) is well-approximated by a Taylor expansion of order 5.This combined with ˆ θ being the MLE and the notation ˜ q = θ − ˆ θ yields p ( y n ) = Z Θ e − nh ( θ ) π ( θ )d θ = Z B ˆ θ ( δ ) e − nh (ˆ θ ) − nh ab ˜ q a ˜ q b R (˜ q ) π (˜ q )d˜ q, (34)= f ( y n | ˆ θ ) Z B ˆ θ ( δ ) e − nh ab ˜ q a ˜ q b e − ˜ R (˜ q ) π (˜ q )d˜ q, (35)where ˜ R (˜ q ) = n [ h abc ˜ q a ˜ q b ˜ q c + h abcd ˜ q a ˜ q b ˜ q c ˜ q d + h abcde ˜ q a ˜ q b ˜ q c ˜ q d ˜ q e + O ( | ˜ q | )] , (36)ERI-NULL BAYES FACTORS ARE INCONSISTENT 22is the bounded remainder term since h ∈ C (Θ). The replacement of Θ by B ˆ θ ( δ ) inthe integral is justified if the mass is concentrated at ˆ θ , thus, whenever the integralwith respect to the first order term falls off quadratically, that is, if | n ˆ I (ˆ θ ) | / e − n ( h ( θ ) − h (ˆ θ )) π ( θ )d θ = O ( n − ) , (37)which is the case when ˆ θ is unimodal. When it is not unimodal, but ˆ θ is a globalmaximum, then the condition implies that the requirement that the contribution ofthe other maxima is not too big. Step (ii)
After centering the integral at ˆ θ we scale with respect to √ n , that is,we apply the change of variable q = √ n ˜ q , thus, R n − p/ d q = R d˜ q and therefore p ( y n ) = ( πn ) p/ f ( y n | ˆ θ ) | ˆ I (ˆ θ ) | − / Z B ˆ θ ( √ nδ ) ˜ ϕ ( q ) e − R ( q ) ˜ π ( q )d q, (38)where ˜ ϕ is the density of a multivariate normal distribution centered at 0 and co-variance matrix Σ = ˆ I − (ˆ θ ), and where ˜ π ( q ) is the Taylor approximation of π at theMLE, that is, ˜ π ( q ) = π (ˆ θ ) + π a (ˆ θ ) q a n / + π ab (ˆ θ ) q a q b n + π abc (ˆ θ ) q a q b q c n / + O ( n − ) , (39)and where the remainder term is now R ( q ) = h abc q a q b q c n / + h abcd q a q b q c q d n + h abcde q a q b q c q d q e n / + h abcde q a q b q c q d q e q f n O ( n − / ) . To exploit the properties of Gaussian integrals we replace integration domain B ˆ θ ( √ nδ )by R p , which is justified when n is large, and because the tails of a normal densityfall off exponentially.By definition of e − R ( q ) as a series and without the exponential approximationerror p ( y n ) ≈ ( πn ) p/ f ( y n | ˆ θ ) | ˆ I (ˆ θ ) | − / (40) × Z R p ˜ ϕ ( q ) h − R ( q ) + R ( q ) − R ( q ) + O ( | R ( q ) | ) i ˜ π ( q )d q. From here onwards we focus on the integral Eq. (40), which after some straightforwardbut tedious computations can be shown to be of the form Z R p ˜ ϕ ( q ) h A + A n − / + A n − + A n − / + A n − + O ( n − ) i d q, (41)where the A j terms are functions of q and ˆ θ defined by the series representation of e − R ( q ) and ˜ π ( q ).ERI-NULL BAYES FACTORS ARE INCONSISTENT 23 Step (iii)
The terms A j are given below. Of the following results only the exactvalues of A , A and A matter; what matters for A and A is that they only involveodd powers of q : A = π (ˆ θ ) (42) A = π a q a − h abc π (ˆ θ )6 q a q b q c (43) A = π ab q a q b − (cid:16) π (ˆ θ ) h abcd + h abc π u (cid:17) q a q b q c q d + π (ˆ θ ) h abc h uef q a q b q c q d q e q w (44) A = π abc q a q b q c − h abcde π (ˆ θ )+30 h abcd π v +60 h abc π uv q a q b q c q d q e (45)+ h abc h uefl π (ˆ θ )+2 h abc h uef π l q a q b q c q d q e q w q l A = π abcd q a q b q c q d (46) − π (ˆ θ ) h abcdef +6 h abcde π f +15 h abcd π ef +20 h abc π def q a q b q c q d q e q f + π (ˆ θ ) h abcd h efgh +8 π (ˆ θ ) h abcde h fgh +40 h abc (cid:16) h defg π h + h def π gh (cid:17) q a q b q c q d q e q f q g q h − π (ˆ θ ) h abcd h efg h hij +4 h abc h def h ghi π j q a q b q c q d q e q f q g q h q i q j + π (ˆ θ ) h abc h def h ghi h jkl q a q b q c q d q e q f q g q h q i q j q k q l . Since for k odd A k only involve odd powers of q we conclude that their integral withrespect to ˜ ϕ ( q ) vanishes. Hence, p ( y n ) = ( πn ) p/ f ( y n | ˆ θ ) | ˆ I (ˆ θ ) | − / π (ˆ θ ) h E [ A ] nπ (ˆ θ ) + E [ A ] n π (ˆ θ ) O ( n − ) i , (47)where E [ A ] and E [ A ] are expectations with respect to Q ∼ N (0 , ˆ I (ˆ θ ) − ). Thisimplies that the order n − and n − terms in the assertion are C (1) (ˆ θ ) = E [ A ] /π (ˆ θ )and C (2) (ˆ θ ) = E [ A ] /π (ˆ θ ).The components of higher moments can be expressed in terms of the covari-ances ς ab = Cov( Q a , Q b ) using Isserlis’ formula (Isserlis, 1918; McCullagh, 2018). Formoments ς a ··· a w , that is, a component of the w th moment of Q with w = 2 v even,the following holds ς a ··· a w = X u ∈ P w Y i,j ∈ u ς ij , (48)where P w is the collection of all pairs of which there are v . For instance, for w = 4, ς abcd is a sum of 2-products of pairs, for w = 6 is a sum of 3-products of ς abcdef andERI-NULL BAYES FACTORS ARE INCONSISTENT 24so forth and so on. More specifically, ς abcd = ς ab ς cd + ς ac ς bd + ς ad ς bc (49) ς abcdef = ς ab ς cd ς ef + ς ab ς ce ς df + ς ab ς cf ς de (50)+ ς ac ς bd ς ef + ς ac ς be ς df + ς ac ς bf ς de + ς ad ς bc ς ef + ς ad ς be ς cf + ς ad ς bf ς ce + ς ae ς bc ς df + ς ae ς bd ς cf + ς ae ς bf ς cd + ς af ς bc ς de + ς af ς bd ς ce + ς af ς be ς cd , where all indexes a, b, c, d, e, f = 1 , , . . . , p . The expression of ς abcdefgh , ς abcdefghij ,and ς abcdefghijkl define sums of 105 = 3 × ×
7, 945 = 3 × × ×
9, and 10 ,
395 =3 ××
395 =3 ×× ××
395 =3 ×× ×× ××
395 =3 ×× ×× ×× ××