Computing Bayes factors to measure evidence from experiments: An extension of the BIC approximation
CComputing Bayes factors to measure evidence from experiments: An extension of theBIC approximation
Thomas J. Faulkenberry ∗ Tarleton State University
Bayesian inference affords scientists with powerful tools for testing hypotheses. One of thesetools is the Bayes factor, which indexes the extent to which support for one hypothesis overanother is updated after seeing the data. Part of the hesitance to adopt this approach maystem from an unfamiliarity with the computational tools necessary for computing Bayes factors.Previous work has shown that closed form approximations of Bayes factors are relatively easyto obtain for between-groups methods, such as an analysis of variance or t -test. In this paper,I extend this approximation to develop a formula for the Bayes factor that directly uses infor-mation that is typically reported for ANOVAs (e.g., the F ratio and degrees of freedom). Aftergiving two examples of its use, I report the results of simulations which show that even withminimal input, this approximate Bayes factor produces similar results to existing software solutions.Note: to appear in Biometrical Letters . I. INTRODUCTION
Hypothesis testing is the primary tool for statisticalinference across much of the biological and behavioralsciences. As such, most scientists are trained in classicalnull hypothesis significance testing (NHST). The scenariofor testing a hypothesis is likely familiar to most readersof this journal. Suppose one wants to test a specific re-search hypothesis (e.g., some treatment has an effect onsome outcome measure). NHST works by first assuminga null hypothesis (e.g., the treatment has no effect ) andthen computing some test statistic for a sample of data.This sample test statistic is then compared to a hypothet-ical distribution of test statistics that would arise if thenull hypothesis were true. If the sample’s test statisticis in the tail of the distribution (that is, it should oc-cur with low probability), the scientist decides to reject the null hypothesis in favor of the alternative hypothesis.Further, the p -value, which indicates how surprising thesample would be if the null hypothesis were true, is oftentaken as a measure of evidence: the lower the p -value,the stronger the evidence.While orthodox across many disciplines, NHST doeshave philosophical criticisms [13]. Also, the p -value isprone to misinterpretation [2, 3]. Finally, NHST is ide-ally suited to providing support for the alternative hy-pothesis, but the procedure does not work in the casewhere one wants to measure support for the null hypoth-esis. That is, we can reject the null, but we cannot accept the null. To overcome this limitation, we can use an al-ternative method for testing hypotheses that is based onBayesian inference: the Bayes factor. ∗ [email protected] A. The Bayes factor
Bayesian inference is a method of measurement that isbased on the computation of P ( H | D ), which is calledthe posterior probability of a hypothesis H , given data D .Bayes’ theorem casts this probability as P ( H | D ) = P ( D | H ) · P ( H ) P ( D ) . (1)One may think of Equation 1 in the following manner:before observing data D , one assigns a prior probability P ( H ) to hypothesis H . After observing data, one canthen update this prior probability to a posterior proba-bility P ( H | D ) by multiplying the prior P ( H ) by thelikelihood P ( D | H ). This product is then rescaled toa probability distribution (i.e., total probability = 1) bydividing by the marginal probability P ( D ).Bayes’ theorem provides a natural way to test hy-potheses. Suppose we have two competing hypotheses:an alternative hypothesis H and a null hypothesis H .We can directly compare the posterior probabilities of H and H by computing their ratio; that is, we cancompute the posterior odds in favor of H over H as P ( H | D ) /P ( H | D ). Using Bayes’ theorem (Equation1), it is trivial to see that P ( H | D ) P ( H | D ) (cid:124) (cid:123)(cid:122) (cid:125) posterior odds = P ( D | H ) P ( D | H ) (cid:124) (cid:123)(cid:122) (cid:125) Bayes factor · P ( H ) P ( H ) (cid:124) (cid:123)(cid:122) (cid:125) prior odds . (2)This equation can also be interpreted in terms of the“updating” metaphor that was explained above. Specifi-cally, the posterior odds are equal to the prior odds multi-plied by an updating factor. This updating factor is equalto the ratio of likelihoods P ( D | H ) and P ( D | H ), andis called the Bayes factor [4]. Intuitively, the Bayes factorcan be interpreted as the weight of evidence provided by a r X i v : . [ s t a t . C O ] M a r a set of data D . For example, suppose that one assignedthe prior odds of H and H equal to 1; that is, H and H are a priori assumed to be equally likely. Then, sup-pose that after observing data D , the Bayes factor wascomputed to be 10. Now, the posterior odds (the oddsof H over H after observing data) is 10:1 in favor of H over H . As such, the Bayes factor provides an easilyinterpretable measure of the evidence in favor of H .In order to help with interpreting Bayes factors, var-ious classification schemes have been proposed. Onesimple scheme is a four-way classification proposed byRaftery [9], where Bayes factors between 1 and 3 areconsidered weak evidence; between 3 and 20 constitutes positive evidence; between 20 and 150 constitutes strong evidence; and beyond 150 is considered very strong evi-dence.Note that in the discussion above, there was no spe-cific assumption about the order in which we addressed H and H . If instead we wanted to assess the weight ofevidence in favor of H over H , Equation 2 could sim-ply be adjusted by taking reciprocals. As such, implieddirection is important when computing Bayes factors, soone must be careful to define notation when representingBayes factors. A common convention is to define BF asthe Bayes factor for H over H ; similarly, BF wouldrepresent the Bayes factor for H over H . Note that BF = 1 /BF .In summary, the Bayes factor provides an index of pref-erence for one hypothesis over another that has some ad-vantages over NHST. First, the Bayes factor tells us byhow much a data sample should update our belief in onehypothesis over a competing one. Second, though NHSTdoes not allow one to accept a null hypothesis, doing sowithin a Bayesian framework makes perfect sense. Giventhese advantages, it may be surprising that Bayesian in-ference has not been used more often in the empiricalsciences. One reason for the lack of more widespreadadoption may be that Bayes factors are quite difficult tocompute. We tackle this issue in the next section. II. COMPUTING BAYES FACTORS
As an example, suppose we are interested in computingthe Bayes factor for a null hypothesis H over an alterna-tive hypothesis H , given data D . Recall from Equation2 that this Bayes factor (denoted BF ) is equal to BF = P ( D | H ) P ( D | H ) . While this equation may seem conceptually quite sim-ple, it is computationally much more difficult. This isbecause in order to compute the numerator and denomi-nator, one must parameterize the hypotheses (or models ,to be more clear), and then each likelihood is computedby conditioning over all possible parameter values andsumming over this set. Since these potential parameter values are often over a continuous parameter space, thiscomputation requires integration, and thus the formulafor the Bayes factor amounts to BF = (cid:82) θ ∈ Θ P ( D | H , θ ) π ( θ ) dθ (cid:82) θ ∈ Θ P ( D | H , θ ) π ( θ ) dθ (3)where Θ and Θ are the parameter spaces for models H and H , respectively, and π and π are the priorprobability density functions of the parameters of H and H , respectively.Thus, in order to compute BF , one must specify thepriors π and π for H and H . Further, the integralsusually do not have closed-form solutions, so numericalintegration techniques are necessary. These requirementslend a computation of the Bayes factor to be inaccessibleto all but the most ardent researchers who have at leasta more-than-modest amount of mathematical training.Fortunately, there are an increasing number of solu-tions that avoid a direct encounter with computationsof the above type. Recently, researchers have proposed default priors for standard experimental designs such as t -tests [7, 11] and ANOVA [10]. These default priors areimplemented in software packages such as the R pack-age BayesFactor [8], and as such, have provided a user-friendly method for researchers to compute Bayes factorswithout the computational overhead needed in Equation3. While these software solutions work quite well for com-puting Bayes factors from raw data, they are a bit lim-ited in the following context. Suppose that in the courseof reading some published literature, a researcher comesacross a result that is presented as “nonsignificant”, withassociated test statistic F (1 ,
23) = 2 . p = 0 .
15. In anNHST context, this nonsignificant result does not provideevidence for the null hypothesis; rather, it just impliesthat we cannot reject the null. A natural question wouldbe what, if any, support does this result provide for thenull hypothesis? Of course, a Bayes factor would be use-ful here, but without the raw data, we cannot use thepreviously mentioned software solutions. To this end, itwould be advantageous if there were some easy way tocompute a Bayes factor directly from the reported teststatistic.It turns out that this computation is indeed possible,at least in certain cases. In the following, I will showhow one particular method for computing Bayes factors[the BIC approximation; 9] can be adapted to solve thisproblem, thus allowing researchers to compute approxi-mate Bayes factors from summary statistics alone (withno need for raw data). Further, I will show through sim-ulations that this method compares well to the defaultBayes factors for ANOVA developed by Rouder et al.[10].
III. THE BIC APPROXIMATION OF THEBAYES FACTOR
Wagenmakers [13] demonstrated a method [based onearlier work by 9] for computing approximate Bayes fac-tors using the BIC (Bayesian Information Criterion). Fora given model H i , the BIC is defined asBIC( H i ) = − L i + k i · log n, where n is the number of observations, k i is the numberof free parameters of model H i , and L i is the maximumlikelihood for model H i . He then showed that the Bayesfactor for H o over H can be approximated as BF ≈ exp (cid:16) ∆BIC / (cid:17) , (4)where ∆BIC = BIC( H ) − BIC( H ). Further, Wagen-makers [13] showed that when comparing an alternativehypothesis H to a null hypothesis H ,∆BIC = n log (cid:32) SSE SSE (cid:33) + ( k − k ) log n. (5)In this equation, SSE and SSE represent the sum ofsquares for the error terms in models H and H , re-spectively. Both Wagenmakers [13] and Masson [6] giveexcellent examples of how to use this approximation tocompute Bayes factors, assuming one is given informa-tion about SSE and SSE , as is the case with moststatistical software. However, we will now consider thesituation where one is given the statistical summary (i.e., F (1 ,
23) = 2 . F -ratio F ( df , df ), where df represents the degrees of freedom associated with themanipulation, and df represents the degrees of freedomassociated with the error term. Then, F = SS /df SS /df = SS SS · df df , where SS and SS are the sum of squared er-rors associated with the manipulation and the error term,respectively.From Equation 5, we see that∆BIC = n log (cid:18) SSE SSE (cid:19) + ( k − k ) log n = n log (cid:18) SS SS + SS (cid:19) + df log n. This equality holds because
SSE represents the sum ofsquares that is not explained by H , which is simply SS (the error term). Similarly, SSE is the sum of squaresnot explained by H , which is the sum of SS and SS [see 13, p. 799]. Finally, in the context of comparing H and H in an ANOVA design, we have k − k = df . Now, we can use algebra to re-express ∆BIC in termsof F : ∆BIC = n log (cid:18) SS SS + SS (cid:19) + df log n = n log (cid:32) SS SS + 1 (cid:33) + df log n = n log (cid:32) df df SS SS · df df + df df (cid:33) + df log n = n log (cid:32) df df F + df df (cid:33) + df log n = n log (cid:18) df F df + df (cid:19) + df log n. Substituting this into Equation 4, we can compute: BF ≈ exp (∆BIC / (cid:20) (cid:18) n log (cid:18) df F df + df (cid:19) + df log n (cid:19)(cid:21) = exp (cid:20) n (cid:18) df F df + df (cid:19) + df n (cid:21) = (cid:18) df F df + df (cid:19) n/ · n df / = (cid:115) df n · n df ( F df + df ) n = (cid:118)(cid:117)(cid:117)(cid:116) n df (cid:16) F df df + 1 (cid:17) n . Rearranging this last expression slightly yields the ap-proximation: BF ≈ (cid:115) n df (cid:18) F df df (cid:19) − n (6)Practically speaking, the approximation given in Equa-tion 6 offers nothing new over the previous formulationsof the BIC approximation given in Wagenmakers [13] andMasson [6]. However, it does have two advantages overthese previous formulations. First, one can directly takereported ANOVA statistics (e.g., sample size, degrees offreedom, and the F -ratio) and compute BF withouthaving to compute SSE or SSE . We should note thatMasson [6] correctly mentions that SSE /SSE = 1 − η p ,so if a paper reports η p , the need for computing SSE and SSE is nullified. However, the method of Masson[6] is still essentially a two-step process; one first com-putes ∆BIC , which in turn is used to compute BF .In contrast, the expression derived in Equation 6 is aone-step process that can easily be implemented using ascientific calculator or a simple spreadsheet. IV. EXAMPLE COMPUTATIONS
In this section, we will discuss two examples of usingEquation 6 to compute Bayes factors. In the first exam-ple, I will show how to compute and interpret a Bayesfactor for a reported null effect in the field of experimen-tal psychology. In the second example, I will show how tomodify Equation 6 to work with an independent samples t -test. A. Example 1
Sevos et al. [12] performed an experiment to assesswhether schizophrenics could internal simulate motor ac-tions when perceiving graspable objects. The evidencefor such internal simulation comes from a statistical in-teraction between response-orientation compatibility andthe presence of an individual name prime. Sevos et al.reported that in a sample of n = 18 schizophrenics, therewas no interaction between this compatibility and nameprime, F (1 ,
17) = 2 . p = 0 . BF to assess the evidence for this nulleffect.To this end, we use Equation 6 to compute BF ≈ (cid:115) n df (cid:18) F df df (cid:19) − n = (cid:115) (cid:18) . · (cid:19) − = 1 . . This Bayes factor can be interpeted as follows: afterseeing the data, our belief in the null hypothesis shouldincrease only by a factor of 1.19. In other words, thisdata is not very informative toward our belief in the null,which implies that the claim of a null effect in Sevos et al.[12] may be a bit optimistic. According the classificationscheme of Raftery [9], this result provides weak evidencefor the null.
B. Example 2
Borota et al. [1] observed that with a sample of n = 73participants, those who received 200 mg of caffeine per-formed significantly better on a test of object memorycompared to a control group of participants who received a placebo, t (71) = 2 . p = 0 . F -ratio, it may not be immediately obviouswhether we can use Equation 6 in this context. It turnsout to be straightforward to modify Equation 6 to workfor an independent samples t -test. All we need are twosimple transformations: (1) F = t , and (2) df = 1.Applying these to Equation 6, we get BF ≈ (cid:115) n df (cid:18) F df df (cid:19) − n = (cid:115) n (cid:18) t (1) df (cid:19) − n = (cid:115) n (cid:18) t df (cid:19) − n We can now apply this equation to the reported resultsof Borota et al. [1]. We see that BF ≈ (cid:115) n (cid:18) t df (cid:19) − n = (cid:115) (cid:18) . (cid:19) − = 1 . null ! Such results are an example ofLindley’s paradox [5], where “significant” p values be-tween 0.04 and 0.05 can actually imply evidence in favorof the null when analyzed in a Bayesian framework. V. SIMULATIONS: BIC APPROXIMATIONVERSUS DEFAULT BAYESIAN ANOVA
At this stage, it is clear that Equation 6 provides astraightforward method for computing an approximateBayes factor, especially in cases when one is given onlyminimal output from a reported ANOVA or t test. How-ever, it is not yet clear to what extent this BIC approx-imation would result in the same decision if a Bayesiananalysis of variance [e.g., 10] were performed on the rawdata. To answer this question, I performed a series ofsimulations.Each simulation consisted of 1000 randomly generateddata sets under a 2 × n = 20 , , or80. Specifically each data set consisted of a vector y generated as y ijk = α i + τ j + γ ij + ε ijk where i = 1 , j = 1 , ,
3, and k = 1 , . . . , n . The “effects” α , τ , and γ were generated from multivariate normal dis-tributions with mean 0 and variance g , yielding threedifferent effect sizes obtained by setting g = 0 , . , and0 . n = 20 , ,
80) withthe 3 effect sizes ( g = 0 , . , . BF to as-sess evidence in favor of the alternative hypothesis overthe null hypothesis. Similar to Wang [14], I set thedecision criterion to select the alternative hypothesis iflog( BF ) >
0, and the null hypothesis otherwise. Be-cause the different cell sizes resulted in similar outcomes,for brevity I only report the n = 50 cell size conditionin the summaries below. Also note that all BayesFac-tor models were fit with a “wide” prior, which is roughlyequivalent to the unit-information prior used by Raftery[9] for the BIC approximation.First, I will report the results of computing Bayes fac-tors for the main effect α in each of the effect size con-ditions g = 0, g = 0 .
05, and g = 0 .
2. Five-numbersummaries for log( BF ) are reported for the n = 50 sim-ulation in Table I, as well as the proportion of simulateddata sets for which the Bayesian ANOVA and the BICapproximation from Equation 6 selected the same model.As shown in Table I, the BIC approximation fromEquation 6 provides a similar distribution of Bayes fac-tors compared to those computed from the BayesFactorpackage in R. Figure 1 shows this pattern of results quite clearly, as the kernel density plots for the two differenttypes of Bayes factors exhibit a large amount of over-lap for the g = 0 .
05 and g = 0 . g = 0 case. However, as can beseen in the “Consistency” column of Table I, regardlessof effect size conditions, the two different types of Bayesfactors resulted in the same decision in a large proportionof simulations (at least 98.4% of simulation trials).A similar picture emerges for the main effect τ . As canbe seen in Table II and Figure 2, the BIC approxima-tion and the BayesFactor outputs are largely consistentand result in mostly the same model choice decisions. Aswith the results for main effect α , there is some slight dif-ference in the kernel density plots when simulating nulleffects (i.e., the condition g = 0). However, both meth-ods chose the same model on at least 92.7% of simulatedtrials, showing a good amount of consistency.Finally, we can see in Table III and Figure 3 thatthe BIC approximation closely mirrors the output of theBayesFactor package for the interaction effect γ . Indeed,the kernel density plots in Figure 3 show considerableoverlap between the distributions of BIC values and thedistributions of BayesFactor outputs, and this picture isconsistent across all three effect sizes ( g = 0 , . , . VI. CONCLUSION
The BIC approximation given in Equation 6 pro-vides an easy-to-use estimate of Bayes factors for simplebetween-subject ANOVA and t test designs. It requiresonly minimal information, which makes it well-suited forusing in a meta-analytic context. In simulations, the es-timates derived from Equation 6 compare favorably toBayes factors computed using existing software solutionswith raw data. Thus, the researcher can confidently addthis BIC approximation to the ever-growing collection ofBayesian tools for scientific measurement. [1] Borota, D., Murray, E., Keceli, G., Chang, A., Watabe,J. M., Ly, M., Toscano, J. P., and Yassa, M. A.(2014). Post-study caffeine administration enhancesmemory consolidation in humans. Nature Neuroscience ,17(2):201–203.[2] Gigerenzer, G. (2004). Mindless statistics.
The Journalof Socio-Economics , 33(5):587–606.[3] Hoekstra, R., Morey, R. D., Rouder, J. N., and Wa-genmakers, E.-J. (2014). Robust misinterpretation ofconfidence intervals.
Psychonomic Bulletin & Review ,21(5):1157–1164.[4] Jeffreys, H. (1961).
The Theory of Probability (3rd ed.) .Oxford University Press, Oxford, UK. [5] Lindley, D. V. (1957). A statistical paradox.
Biometrika ,44(1-2):187–192.[6] Masson, M. E. J. (2011). A tutorial on a practicalBayesian alternative to null-hypothesis significance test-ing.
Behavior Research Methods , 43(3):679–690.[7] Morey, R. D. and Rouder, J. N. (2011). Bayes factorapproaches for testing interval null hypotheses.
Psycho-logical Methods , 16(4):406–419.[8] Morey, R. D. and Rouder, J. N. (2015).
BayesFactor:Computation of Bayes Factors for Common Designs . Rpackage version 0.9.12-2.[9] Raftery, A. E. (1995). Bayesian model selection in socialresearch.
Sociological Methodology , 25:111–163.
TABLE I. Summary of values for log BF for main effect α with cell size n = 50 g BF type Min Q Median Q Max Consistency0 BayesFactor -2.40 -2.35 -2.14 -1.74 3.38BIC -2.85 -2.80 -2.59 -2.17 3.17 0.9850.05 BayesFactor -2.40 -1.83 -0.20 4.24 47.49BIC -2.85 -2.23 -0.41 4.44 53.29 0.9840.2 BayesFactor -2.40 -0.61 4.88 17.26 119.69BIC -2.85 -0.63 6.76 21.71 146.02 0.987FIG. 1. Kernel density plots of distributions of log BF for main effect α , presented as a function of Bayes factor type (BICversus BayesFactor) and effect size ( g = 0 , . , . BF for main effect τ with cell size n = 50 g BF type Min Q Median Q Max Consistency0 BayesFactor -3.97 -3.68 -3.31 -2.70 2.25BIC -3.40 -3.10 -2.70 -2.05 3.19 0.9800.05 BayesFactor -3.97 -1.89 1.15 6.28 43.55BIC -3.40 -1.05 2.28 8.07 48.17 0.9270.2 BayesFactor -3.92 3.19 12.00 28.51 134.04BIC -3.34 6.03 17.15 36.58 149.39 0.952TABLE III. Summary of values for log BF for interaction effect γ with cell size n = 50 g BF type Min Q Median Q Max Consistency0 BayesFactor -3.97 -3.08 -2.72 -2.05 3.47BIC -3.40 -3.12 -2.74 -1.98 4.03 0.9900.05 BayesFactor -3.57 -2.30 -0.96 1.24 20.64BIC -3.40 -2.27 -0.78 1.64 22.40 0.9700.2 BayesFactor -3.37 -0.40 3.26 9.57 54.17BIC -3.39 -0.20 3.84 10.72 57.61 0.975
FIG. 2. Kernel density plots of distributions of log BF for main effect τ , presented as a function of Bayes factor type (BICversus BayesFactor) and effect size ( g = 0 , . , . BF for the interaction effect γ , presented as a function of Bayes factor type(BIC versus BayesFactor) and effect size ( g = 0 , . , . Journal of Mathematical Psychology , 56(5):356–374.[11] Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D.,and Iverson, G. (2009). Bayesian t tests for acceptingand rejecting the null hypothesis. Psychonomic Bulletin& Review , 16(2):225–237.[12] Sevos, J., Grosselin, A., Brouillet, D., Pellet, J., and Mas-soubre, C. (2016). Is there any influence of variations in context on object-affordance effects in schizophrenia?Perception of property and goals of action.