Decision-theoretic justifications for Bayesian hypothesis testing using credible sets
DDecision-theoretic justifications for Bayesianhypothesis testing using credible sets
M˚ans Thulin Abstract
In Bayesian statistics the precise point-null hypothesis θ = θ can be testedby checking whether θ is contained in a credible set. This permits testing of θ = θ without having to put prior probabilities on the hypotheses. Whilesuch inversions of credible sets have a long history in Bayesian inference,they have been criticised for lacking decision-theoretic justification.We argue that these tests have many advantages over the standardBayesian tests that use point-mass probabilities on the null hypothesis. Wepresent a decision-theoretic justification for the inversion of central credibleintervals, and in a special case HPD sets, by studying a three-decision prob-lem with directional conclusions. Interpreting the loss function used in thejustification, we discuss when test based on credible sets are applicable.We then give some justifications for using credible sets when testingcomposite hypotheses, showing that tests based on credible sets coincidewith standard tests in this setting. Keywords:
Bayesian inference; credible set; confidence interval; decisiontheory; directional conclusion; hypothesis testing; three-decision problem.
The first step of the standard solution to Bayesian testing of the point-null (orsharp, or precise) hypothesis θ ∈ Θ = { θ } is to assign a prior probability to thehypothesis. This is necessary if one wishes to use e.g. tests based on Bayes factors,as an absolutely continuous prior for θ would yield P( θ ∈ Θ | x ) = 0 regardless ofthe data x . We shall refer to this procedure, described in detail e.g. by Robert(2007, Section 5.2.4), as the standard test or the standard solution in the following. Department of Mathematics, Uppsala University, Box 480, 751 06 Uppsala, Sweden.Phone: +46(0)184713389; E-mail: [email protected] a r X i v : . [ m a t h . S T ] D ec continuous prior can however be utilized in a simpler alternative strategy, inwhich the evidence against Θ is evaluated indirectly by using inverted crediblesets for testing. In this procedure a credible set Θ c , such that P( θ ∈ Θ c | x ) ≥ − α ,is computed and the null hypothesis is rejected if θ / ∈ Θ c . Using credible sets fortesting avoids the additional complication of imposing explicit probabilities on thehypotheses, avoids Lindley’s paradox and allows for the use of non-informativepriors.While it has been argued that tests of point-null hypotheses are unnatural froma Bayesian point of view, they are undeniably of considerable practical interest(Berger & Delampady, 1987; Ghosh et al., 2006, Section 2.7.2; Robert, 2007,Section 5.2.4). Inverting credible sets is arguably the simplest way to test suchhypotheses. For this reason, inverting credible sets as a means to test hypotheseshas long been, and continues to be, a part of Bayesian statistics. Tests of this typehave been studied by Box & Tiao (1965), Hsu (1982), Kim (1991), Drummond &Rambaut (2007) and Kruschke (2011), among many others. Zellner (1971, Section10.2) credited this procedure to Lindley (1965, Section 5.6), who in turn creditedit to Jeffreys (1961). Lindley suggested that it may be useful when the priorinformation is vague or diffuse, with the caveat that the distribution of θ must beresonably smooth around θ for the inversion to be sensible. Koch (2007, Section3.4) motivated this procedure by its resemblance to frequentist methods. Ghosh etal. (2006, Section 2.7) described the inversion of credible sets as “a very informalway of testing”. Zellner remarked that “no decision theoretic justification appearsavailable for this procedure”.The purpose of this short note is to discuss such justifications and to putinverted credible sets on a decision-theoretic footing, making them formal Bayesiantests. These justifications shed light on when tests based on credible sets shouldbe used. Throughout the text we assume that the parameter space Θ ⊆ R andthat the posterior distribution of θ is proper and absolutely continuous.We will study three types of credible sets. A highest posterior density (HPD)set contains all θ for which the posterior density is greater than some k α . Itsappeal lies in the fact that among all 1 − α credible sets, the HPD set has theshortest length.With q α ( θ | x ) denoting the posterior quantile function of θ , such that P( θ in thestandard solution, with π = 1 − π being the prior probability of Θ .
1. The use of mixture priors in the standard test.
One could argue, as Berger& Sellke (1987, Section 2), that when using the standard solution it is in generalreasonable to have π ≥ / is not rejected because of its low priorprobability. This is common in practice. It is however much more sensible toadjust the loss function if stronger evidence is required before Θ is rejected. Theprior distribution should reflect the prior beliefs (or lack of them) and not theseverity of rejecting Θ . Choosing a large π corresponds to using a prior with asharp spike close to θ . Such priors are only reasonable if there is a very strongprior belief in the null hypothesis, and cannot be used in an analysis where at leastsome degree of objectivity is desired.When lacking prior information or striving for complete objectivity, it is com-mon practice to use a prior that in some sense is non-informative or objective. Thepoint-null hypothesis is often a simplification of the hypothesis that | θ − θ | < (cid:15) for some small (cid:15) , in which case π = P( | θ − θ | < (cid:15) ). If P( | θ − θ | < (cid:15) ) is com-puted under an objective prior on θ , π will invariably be extremely small. The3objective” prior π = 1 / .Berger & Sellke (1987, Section 2) claim that a point-null test only makes senseto a Bayesian if the prior actually has a sharp spike near θ . But it is perfectlyreasonable to formulate a hypothesis before constructing an informative prior orto require that an objective prior is used when no prior information is available.It is furthermore reasonable to require that the prior distribution can be used formore than one type of statistical decision, e.g. both estimation and hypothesistesting; the prior should reflect prior beliefs and not the type of decision that oneis interested in.
2. The poor frequentist performance of the standard test.
There are manysituations in which it is reasonable to require that a statistical procedure works wellfrom both a Bayesian and a frequentist perspective, particularly when an objectiveanalysis is desirable. A well-known complication with the standard solution totesting point-null hypotheses is the asymptotic discrepancy between Bayesian andfrequentist analysis that is known as Lindley’s paradox (Lindley, 1957), in which,for any fixed prior, P( θ ∈ Θ | x ) goes to 1 as the sample size increases for values ofa test statistic that correspond to a fixed (small) p -value. Thus a frequentist and aBayesian will reach different conclusions as more and more data is collected. Thisis another example of how the standard Bayesian test favours the null hypothesis.There is no such discrepancy for tests based on credible sets. On the contrary,when based on non-informative priors, credible sets tend to have favourable prop-erties when treated as frequentist confidence intervals; see e.g. Brown et al. (2001);and hypothesis testing using a credible set is often a valid frequentist procedure.
3. Measures of evidence.
The main argument against the use of credible setsfor testing is that they do not utilize P( θ ∈ Θ | x ), and so does not really measurethe amount of evidence for and against Θ . Alas, the first part of this argumentseems weak, considering the contradictions and paradoxes that the standard testsutilizing P( θ ∈ Θ | x ) give rise to. It should also be noted that a test can be fullyBayesian without being based on P( θ ∈ Θ | x ) as long at it depends only on theposterior distribution. Tests based on inverted credible sets have this property.While credible sets do not measure the evidence against Θ directly, there areindirect measures associated with such tests. With T ( x ) being the largest HPDset not containing θ , Pereira & Stern (1999) proposed P( θ / ∈ T ( x ) | x ), i.e. thesmallest α for which θ is not contained in the 1 − α HPD set, as a measure ofevidence against Θ , with values close to 0 indicating that Θ is false. Similarin spirit to the frequentist p -value, this is a measure of how far out in the tailsof the posterior distribution θ is. With S ( x ) being the largest central intervalnot containing θ , one can similarly use P( θ / ∈ S ( x ) | x ) as a measure of evidence.When T ( x ) and S ( x ) are viewed as frequentist confidence intervals, these measuresof evidence coincide with the p -values of the corresponding two-sided tests. This4tands in sharp contrast to the evidence measure P( θ ∈ Θ | x ) utilized in thestandard solution, which is irreconcilable with p -values (Berger & Sellke, 1987). Pereira & Stern (1999) proposed, along with the p -value-like measure of evidencementioned above, a test of Θ = { θ } that is equivalent to inverting HPD sets,dubbing it the Full Bayesian Significance Test. See also Madruga et al. (2003)and Pereira et al. (2008). None of the aforementioned papers discuss the historicalbackground of this test, briefly outlined in Section 1 of the present paper.Let π ( · ) denote the a priori density function of θ and define T ( x ) = { θ : π ( θ | x ) > π ( θ | x ) } . T ( x ) is the most credible HPD set not containing θ . ThePereira–Stern test rejects Θ when P( θ / ∈ T ( x ) | x ) is ”small” ( < .
05, say). Inother words, Θ is rejected at the 5 % level if and only if it is not contained in the95 % HPD set.Madruga, Esteves and Wechsler (2001) provided a decision-theoretic justifi-cation for the Pereira-Stern test, that is non-standard in that the loss functiondepends on x . Let the test function ϕ be 0 if Θ is accepted and 1 if Θ is rejected.The Madruga-Esteves-Wechsler loss is: L MEW ( θ, ϕ, x ) = (cid:40) a (1 − I ( θ ∈ T ( x )) , if ϕ ( x ) = 1 b + c I ( θ ∈ T ( x )) , if ϕ ( x ) = 0 , with a, b, c >
0. Minimization of the expected posterior loss leads to the Pereira–Stern test, where Θ is rejected if P( θ / ∈ T ( x ) | x ) < ( b + c ) / ( a + c ). As an example,the test resulting from the choice a = 0 .
975 and b = c = 0 .
025 is equivalent toinverting the 95 % HPD set.The controversial part of this decision-theoretic justification of inverting HPDsets is that the loss function depends on x . While such loss functions have ap-peared in the literature a few times; Madruga et al. (2001) give Kadane (1992)and Bernardo & Smith (1994, Section 6.1.4) as examples; they do not appear tobe widely accepted. Indeed, choosing the loss function after x has been observedis arguably not entirely in line with classic decision theory. Pereira et al. (2008)argue that an advantage with this type of loss is that it allows the statistician toinclude his or her embarrassment over accepting Θ when θ is not in a particularHPD set. Such loss functions seems applicable in highly subjective analyses, butshould be avoided when objectivity is desirable. As we will show next, tests basedon central intervals can, in contrast, be justified using a standard loss functions.5 .3 Central intervals When testing Θ , credible (and confidence) intervals are in practice often used fordirectional conclusions. If θ is above (or below) the credible interval, it is commonpractice to reject Θ and to conclude that θ is larger (or smaller) than θ .Next, we provide a decision-theoretic justification of the inversion of centralintervals that uses a standard loss function not involving x , and for which thedirectional conclusions are fully valid. A justification for the use of HPD sets isobtained in a special case as a corollary.We formally reformulate the problem of testing the point-null hypothesis Θ = { θ } as a three-decision problem with directional conclusions. Θ is then testedagainst both Θ − = { θ : θ < θ } and Θ = { θ : θ > θ } . Such problems havepreviously been studied by for instance Jones & Tukey (2000) and Jonsson (2012)in the frequentist setting and Bansal & Sheng (2010) and Bansal et al. (2012) inthe Bayesian setting. Theorem 1.
Let Θ = { θ } , Θ − = { θ : θ < θ } and Θ = { θ : θ > θ } . Let ϕ be a decision function that takes values in {− , , } , such that we accept Θ i if ϕ ( x ) = i . Under an absolutely continuous prior for θ and the loss function L (1) ( θ, ϕ ) = , if θ ∈ Θ i and ϕ = i, i ∈ {− , , } ,α/ , if θ / ∈ Θ and ϕ = 0 , , if θ ∈ Θ i ∪ Θ and ϕ = − i, i ∈ {− , } , with < α < , the Bayes test is to reject Θ if θ is not contained in the central − α credible set.Proof. The expected posterior loss isE( L (1) ( θ, ϕ ) | x ) = E (cid:16) P( θ ∈ Θ | x ) I {− } ( ϕ ) + P( θ ∈ Θ − | x ) I { } ( ϕ )+ α/ · P( θ / ∈ Θ | x ) I { } ( ϕ ) (cid:12)(cid:12)(cid:12) x (cid:17) = E (cid:16) P( θ ∈ Θ | x ) I {− } ( ϕ ) + P( θ ∈ Θ − | x ) I { } ( ϕ ) + α/ · I { } ( ϕ ) | x (cid:17) , since P( θ / ∈ Θ | x ) = 1 because of the absolute continuity of θ . The posteriorexpected losses are therefore E( L (1) ( θ, | x ) = α/ L (1) ( θ, i | ) x ) = P( θ ∈ Θ i | x ) for i ∈ {− , } . It follows that the Bayes decision rule is to accept Θ if α/ < min i ∈{− , } P( θ ∈ Θ i | x ), or, equivalently, if 1 − α/ > max i ∈{− , } P( θ ∈ Θ i | x ). ButP( θ ∈ Θ − | x ) > − α/ q − α/ ( θ | x ) < θ , in whichcase θ is not contained in the central interval. Similarly, P( θ ∈ Θ | x ) > − α/ q α/ ( θ | x ) > θ . Thus Θ is accepted if and only if θ is contained in the 1 − α central interval.6t is not uncommon that the posterior density is unimodal and symmetric,typical cases of this being normal and Student’s t posteriors. In this case, thecentral interval coincides with the HPD set. The following corollary is thereforeimmediate. Corollary 1. If π ( θ | x ) is absolutely continuous, symmetric and unimodal, theBayes test under L (1) is to reject Θ if θ is not contained in the − α HPD set.
Typically α ≤ . L (1) , in which case the evidence for either Θ − or Θ hasto be quite large before Θ is rejected. This is a conservative type of analysis,where the statistician is somewhat reluctant to reject Θ . Examples of situationswhere such a loss is applicable include many investigations of hypotheses in science,circumstances where falsely rejecting Θ leads to considerable economical lossesand legal and forensic questions (this loss is in line with the classic in dubio proreo principle).The posterior probability of the null hypothesis is not evaluated in this test (itis always 0), but the probabilities of the two alternative hypotheses are. By theprocess of elimination, Θ is accepted if neither Θ − nor Θ has a high enoughposterior probability. Θ is rejected if θ is far out in the tails of the posteriordistribution, making these tests quite similar to frequentist hypothesis testing. For historical reasons, test based on HPD sets are much more common in theliterature than tests using central intervals; confer the references given in Section1 for examples. However, from a decision-theoretic perspective, the somewhattautological use of the HPD set T ( x ) in the loss function L MEW seems artificalin comparison to the easy-to-interpret loss function L (1) . The L MEW –justificationof tests using HPD sets involves a good deal more subjectivity than does thejustification of tests using central intervals. In most situations, the latter test seemsto be preferable when a test with a decision-theoretic justification is desirable.A further argument for prefering inverted central intervals is given by studying how the two types of tests rejects the null hypothesis. For θ ∈ R , when Θ = { θ } is rejected, there is always an implicit directional aspect to this decision: thenull hypothesis is rejected because it seems more likely that θ belongs to eitherΘ − = { θ : θ < θ } or Θ = { θ : θ > θ } . When the posterior is skewed, HPD setsare, by construction, biased towards one of these sub-alternatives. Such directionalrejection biases can be avoided by using inverted central intervals for testing.When the frequentist properties of test based on HPD and central intervalsare evaluated in some common point-null testing problems for scale or rate pa-rameters, it is seen that this rejection bias causes the HPD set test to have worse7igure 1: Power of two-sided tests of the hypothesis H : σ = 1, where σ is thevariance of the normal distribution with known mean, based on 95 % credible setsusing the Jeffreys prior. . . . n = 10 True variance P o w e r HPD interval testCentral interval test . . . n = 20 True variance P o w e r . . . n = 30 True variance P o w e r Figure 2: Power of two-sided tests of the hypothesis H : λ = 1, where λ is therate parameter of the exponential distribution, based on 95 % credible sets usingthe Jeffreys prior. . . . n = 10 True rate P o w e r HPD interval testCentral interval test . . . n = 20 True rate P o w e r . . . n = 30 True rate P o w e r power properties than the central interval test. For inference about the scale/rateparameters in the normal, gamma, inverse gamma and Weibull distributions usingthe corresponding Jeffreys priors, the two tests have comparable power for θ < θ ,while the test using the central interval has higher power for θ > θ ; sometimessubstantially so. Examples are given in Figures 1 and 2; the power functions forthe other cases are similar. In general, the central interval test also seems to comecloser to attaining the nominal size in this setting.A similar issue occurs when the posterior density is monotone, as is for instancethe case when using the conjugate Pareto prior for θ in U (0 , θ ). In this setting theHPD set is an upper or lower confidence bound, meaning that it only is possibleto reject Θ in one direction, so that the resulting test in fact is one-sided.8 Testing composite hypotheses
When performing tests of composite hypotheses, there is typically no discrepancybetween tests based on credible set and the standard Bayesian tests. This will beillustrated with two examples below. For Θ and Θ being uncountable subsets of R , both with positive probabilities under an absolutely continuous prior, we willconsider the weighted 0–1 loss function L (2) ( θ, ϕ ) = , if ϕ = 1 − I Θ ( θ ) a, if θ ∈ Θ and ϕ = 1 b, if θ ∈ Θ and ϕ = 0 . First, we consider one-sided composite hypotheses. The proof of the followingtheorem is in analogue with that of Theorem 1 and is therefore omitted.
Theorem 2.
For the hypotheses Θ = { θ : θ ≤ θ } and Θ = { θ : θ > θ } , withP (Θ ) , P (Θ ) > , let ϕ be a test function, such that Θ is rejected if ϕ = 1 andaccepted if ϕ = 0 . Under an absolutely continuous prior for θ and the loss function L (2) with a = 1 − α and b = α , < α < , the Bayes test is to reject Θ if andonly if θ is not contained in the lower-bound credible set { θ : θ ≥ q α ( θ | x ) } , or,equivalently, if and only if P ( θ ∈ Θ | x ) ≤ α . This is a standard one-sided test and the credible set test is thus in agreementwith the standard methods. The corresponding justification for upper-bound cred-ible sets is given by considering testing the hypothesis Θ = { θ : θ ≥ θ } againstΘ = { θ : θ < θ } instead.For a general composite hypothesis θ ∈ Θ with the alternative θ ∈ Θ = Θ \ Θ ,a correspondence between hypothesis testing and credible test can be establishedusing a loss function where falsely accepting the null hypothesis is considered tobe much worse than falsely rejecting it. Such a loss can be of considerable use insituations where false acceptance of the null hypothesis can be costly, for instancewhen screening for diseases, malicious web pages or signs of terrorist activity. Theorem 3.
For the hypotheses Θ and Θ = Θ \ Θ , with P (Θ ) , P (Θ ) > , let ϕ be a test function, such that Θ is rejected if ϕ = 1 and accepted if ϕ = 0 .Under an absolutely continuous prior for θ and the loss function L (2) with a = α and b = 1 − α , < α < / , the Bayes test is to reject Θ if and only if thereexists at least one α credible set that does not contain a non-null subset of Θ , or,equivalently, if and only if P ( θ ∈ Θ | x ) ≤ − α .Proof. By Proposition 5.2.2 of Robert (2007), the Bayes test is to accept Θ ifP( θ ∈ Θ | x ) ≥ − α . Now, a credible set Θ c is a set such that P(Θ c | x ) ≥ − α .Thus, by definition, if Θ is such that P(Θ | x ) ≥ − α , Θ c can be a credible set9nly if P(Θ ∩ Θ c | x ) >
0. Hence, we accept the null hypothesis if and only if each − α credible set contains a non-null subset of Θ .Once again, this is a standard test, albeit a less commonly used one. Usingthis loss function corresponds to the statistician being either unsure about whichcredible set to report or unwilling to accept Θ if there exists at least one way tolook at the data that puts Θ to question. Acknowledgments
The author wishes to thank Silvelyn Zwanzig for valuable suggestions that helpedimprove the exposition of the paper, Christian Robert for pointing out a crucialmistake in an earlier draft and Paulo Marques, who helped provide some keyreferences in an early stage of this work.
References
Bansal, N.K., Hamedani, G.G., Sheng, R. (2012), Bayesian analysis of hypothesistesting problems for general population: a Kullback–Leibler alternative,
Journalof Statistical Planning and Inference , , 1991–1998.Bansal, N.K., Sheng, R. (2010), Bayesian decision theoretic approach to hypothesistesting problems with skewed alternatives, Journal of Statistical Planning andInference , , 2894–2903.Berger, J.O., Delampady, M. (1987), Testing precise hypotheses, Statistical Sci-ence , , 317–335.Berger, J.O., Sellke, T. (1987), Testing a point null hypothesis: the irreconcilabilityof p values and evidence, Journal of the American Statistical Association , ,112–122.Bernardo, J.M., Smith, A.F.M. (1994), Bayesian Theory , John Wiley & Sons, NewYork.Box, G.E.P., Tiao, G.C. (1965), Multiparameter problems from a Bayesian pointof view,
The Annals of Mathematical Statistics , , 1468–1482.Brown, L.D., Cai, T.T., DasGupta, A. (2001), Interval estimation for a binomialproportion, Statistical Science , , 101–133.Drummond, A.J., Rambaut, A. (2007), BEAST: Bayesian evolutionary analysisby sampling trees, BMC Evolutionary Biology , , 214.10hosh, J.K., Delampady, M., Samanta, T. (2006), An Introduction to BayesianAnalysis , Springer, New York.Hsu, D.A. (1982), Bayesian robust detection of shift in the risk structure of stockreturns,
Journal of the American Statistical Association , , 29–39.Jeffreys, H. (1961), Theory of Probability , 3rd edition, Clarendon Press, Oxford.Jones, L.V., Tukey, J.W. (2000), A sensible formulation of the significance test,
Psychological Methods , , 411–414.Jonsson, F. (2012), Characterizing optimality among three-decision proceduresfor directional conclusions, Journal of Statistical Planning and Inference , ,392–399.Kadane, J.B. (1992), Healthy scepticism as an expected utility explanation of thephenomena of Allais and Ellsberg, Theory and Decision , , 57–64.Kim, D. (1991), A Bayesian significance test of the stationarity of regression pa-rameters, Biometrika , , 667–675.Koch, K.-R. (2007), Introduction to Bayesian Statistics , Springer, Berlin.Kruschke, J.K. (2011), Bayesian assessment of null values via parameter estimationand model comparison,
Perspectives on Psychological Science , , 299–312.Lindley, D.V. (1957), A statistical paradox, Biometrika , , 187–192.Lindley, D.V. (1965), Introduction to Probability and Statistics from a BayesianViewpoint, Part 2. Inference , Cambridge University Press, Cambridge.Madruga M.R., Esteves, L.G., Wechsler, S. (2001), On the Bayesianity of Pereira-Stern tests,
Test , , 291–299.Madruga, M.R., Pereira, C.A.B., Stern, J.M. (2003), Bayesian evidence test forprecise hypotheses, Journal of Statistical Planning and Inference , , 185–198.Pereira, C.A.B., Stern, J.M. (1999), Evidence and credibility: full Bayesian signif-icance test for precise hypotheses, Entropy , , 99–110.Pereira, C.A.B., Stern, J.M., Wechsler, S. (2008), Can a significance test be gen-uinely Bayesian?, Bayesian Analysis , , 79–100.Robert, C.P. (2007), The Bayesian Choice , Springer, New York.Zellner, A. (1971),